heterogeneous system coherence

Anuncio
HETEROGENEOUS SYSTEM COHERENCE
FOR INTEGRATED CPU-GPU SYSTEMS
JASON POWER*, ARKAPRAVA BASU*, JUNLI GU†, SOORAJ PUTHOOR†,
BRADFORD M BECKMANN†, MARK D HILL*†, STEVEN K REINHARDT†, DAVID A WOOD*†
*University of Wisconsin-Madison
†Advanced Micro Devices, Inc.
Powerpoint version available on:
http://pages.cs.wisc.edu/~powerjg/
2 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
ABSTRACT
Hardware coherence can increase the utility of
heterogeneous systems
Major bottlenecks in current coherence implementations
‒High bandwidth difficult to support at directory
‒Extreme resource requirements
We propose Heterogeneous System Coherence
‒Leverages spatial locality and region coherence
‒Reduces bandwidth by 94%
‒Reduces resource requirements by 95%
4 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
PHYSICAL INTEGRATION
5 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
PHYSICAL INTEGRATION
6 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
PHYSICAL INTEGRATION
7 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
PHYSICAL INTEGRATION
Stacked High-bandwidth DRAM
GPU
CPU
Cores
8 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
Credit: IBM
LOGICAL INTEGRATION
General-purpose GPU computing
‒OpenCL
‒CUDA
Heterogeneous Uniform Memory Access (hUMA)
‒Shared virtual address space
‒Cache coherence
Allows new heterogeneous apps
9 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
OUTLINE
Motivation
Background
‒System overview
‒Cache architecture reminder
Heterogeneous System Bottlenecks
Heterogeneous System Coherence Details
Results
Conclusions
10 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
SYSTEM OVERVIEW
SYSTEM LEVEL
Highbandwidth
interconnect
Accelerated
Processing
Unit (APU)
DRAM Channels
11 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
SYSTEM OVERVIEW
APU
APU
GPU compute
accesses must
stay coherent
GPU
Cluster
Direct-access
bus
CPU
Cluster
Directory
(used for graphics)
To DRAM
12 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
Arrow thickness
→bandwidth
Invalidation
traffic
SYSTEM OVERVIEW
GPU
CU
Very high bandwidth:
CU CU CU CU CULocal
CU Scratchpad
CU CU CU CU
L2
has
high
miss
rate
L1
L1
L1
L1
L1
L1Memory
L1
L1
L1
L1
GPU Cluster
I-Fetch / Decode
CU
CU
CU
CU
L1
L1
L1
L1
CU
L1
Register File
Ex
Ex
Ex
Ex
Ex
L1
L1
L1
L1Ex L1
CU
CU
CU
CU
L1
GPU L2 Cache
Ex
Ex
Coalescer
Ex
CU
L1
ExL1 ExL1 Ex
L1
L1
CU
Ex CU ExCU ExCU Ex
CU
CU
13 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
To L1
L1
L1
L1
L1
L1
L1
CU
CU
CU
CU
CU
CU
SYSTEM OVERVIEW
CPU Cluster
CPU
bandwidth:
Core
CPU
Core
Low
Low L2 miss rate
L1
L1
To Dir
L2
L1
L1
CPU
Core
CPU
Core
14 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
CACHE ARCHITECTURE REMINDER
CPU/GPU L2 CACHE
MSHR Entries
Demand
Requests
Cache Tag Arrays
Demand requests
Searches cache tags
from L1Allocates
cache anfor
a tag match
MSHR
Tag hit on probe: send
entry On a directory
MSHRs
data to other core
Miss
On a miss, send probe,Requests
check
Data
Onrequest
a hit, return
Hit
to directory
MSHRsResponses
and tags
Probe
data to the L1
Requests
Core Data
Responses
15 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
Coherent
Network
Interface
DIRECTORY ARCHITECTURE REMINDER
DIRECTORY
Miss
To DRAM
16 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
PR Entries
MSHR Entries
Demand Block
requests
Block tags
Probe
Searches
Directory
Tag Array cache
Requests/
Responses
from L2Allocates
cache anfor
a tag match
MSHR
On a miss, the entry
data
Allocate
and send
Probe
comes
from DRAM
MSHRs
Request RAM
Coherent
probes to L2 caches Hit
Block Requests
BACKGROUND SUMMARY
System under investigation
‒Heterogeneous CPU-GPU on chip
‒High-bandwidth DRAM
Directory pipeline complex
‒MSHR array is associative
‒Difficult to pipeline with more than 1 request per cycle
‒Important resources: MSHR entries
17 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
OUTLINE
Motivation
Background
Heterogeneous System Bottlenecks
‒Simulation overview
‒Directory bandwidth
‒MSHRs
‒Performance is significantly affected
Heterogeneous System Coherence Details
Results
Conclusions
18 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
SIMULATION DETAILS
gem5 simulator
Workloads
‒Simple CPU
‒GPU simulator based on AMD GCN
‒All memory requests through gem5
CPU Clock
CPU Cores
CPU Shared L2
GPU Clock
Compute Units
GPU Shared L2
L3 (Memory-side)
DRAM
Peak Bandwidth
Baseline Directory
‒Modified to use hUMA
‒Rodinia & AMD APP SDK
2 GHz
2
2 MB (16-way banked)
1 GHz
32
4 MB (64-way banked)
16 MB (16-way banked)
DDR3, 16 channels
700 GB/s
256k entries (8-way banked)
19 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
GPGPU BENCHMARKS
Rodinia benchmarks
‒ bp trains the connection weights on a neural network
‒ bfs breadth-first search
‒ hs performs a transient 2D thermal simulation (5-point stencil)
‒ lud matrix decomposition
‒ nw performs a global optimization for DNA sequence alignment
‒ km does k-means clustering
‒ sd speckle-reducing anisotropic diffusion
AMD SDK
‒ bn bitonic sort
‒ dct discrete cosine transform
‒ hg histogram
‒ mm matrix multiplication
20 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
SYSTEM BOTTLENECKS
APU
Difficult to scale directory
bandwidth
‒Difficult to multi-port
GPU
‒Complicated pipeline
Cluster
CPU
Cluster
Designed to
support CPU
High resource usage
bandwidth
‒Must allocate MSHR for entire duration of request
High
bandwidth
‒MSHR
array difficult to scale Directory
To DRAM
21 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
DIRECTORY TRAFFIC
Directory accesses per GPU cycle
4.5
4
Difficult to support >1
request per cycle
3.5
3
2.5
2
1.5
1
0.5
0
bp
bfs
hs
lud
nw
22 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
km
sd
bn
dct
hg
mm
RESOURCE USAGE
100000
Maximum MSHRs
10000
1000
100
Very difficult to
scale MSHR array
Steady state at
700 GB/s
Causes significant
back-pressure on L2s
10
1
bp
bfs
hs
lud
nw
23 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
km
sd
bn
dct
hg
mm
PERFORMANCE OF BASELINE
COMPARED TO UNCONSTRAINED RESOURCES
5
Back-pressure from limited
MSHRs and bandwidth
4.5
4
Slow down
3.5
3
2.5
2
1.5
1
0.5
0
bp
bfs
hs
lud
nw
24 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
km
sd
bn
dct
hg
mm
BOTTLENECKS SUMMARY
Directory bandwidth
‒Must support up to 4 requests per cycle
‒Difficult to construct pipeline
Resource usage
‒MSHRs are a constraining resource
‒Need more than 10,000
‒Without resource constraints, up to 4x better performance
25 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
OUTLINE
Motivation
Background
Heterogeneous System Bottlenecks
Heterogeneous System Coherence Details
‒Overall system design
‒Region buffer design
‒Region directory design
‒Example
‒Hardware complexity
Results
Conclusions
26 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
BASELINE DIRECTORY COHERENCE
APU
GPU
Cluster
CPU
Cluster
Initialization
Kernel Launch
Directory
To DRAM
27 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
Read result
HETEROGENEOUS SYSTEM COHERENCE (HSC)
APU
GPU
Cluster
CPU
Cluster
Initialization
Kernel Launch
Directory
To DRAM
28 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
HETEROGENEOUS SYSTEM COHERENCE (HSC)
APU
GPU
Region
Cluster
Buffer
CPU
Region
Cluster
Buffer
Direct-access bus
Region
Directory
Directory
To DRAM
29 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
Region buffers
coordinate with
region directory
HSC: EXAMPLE MEMORY REQUEST
GPU L2 Cache
GPU Region Buffer
Region Directory
APU
GPU
Region
Cluster
Buffer
CPU
Region
Cluster
Buffer
Region
Directory
To DRAM
32 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
HSC: L2 CACHE & REGION BUFFER
MSHR Entries
Demand
Demand
Requests
Requests
MSHRs
MSHR Entries
MSHRs
Region
tags and
Cache Tag Arrays
Region Buffer
Cache Tag Arrays
permissions
Only region-level
permission traffic
Interface for
direct-access bus
Miss
Hit
Miss
Core Data
Responses
Hit
Core Data
Responses
33 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
Hit
Miss
Miss
Requests
Probe
Hit Data
Requests
Responses
Probe
Requests
Direct Access
Bus Interface
Coherent
Coherent
Network
Network
Interface
Interface
HSC: REGION DIRECTORY
Region tags,
sharers, and
Block Directory
Array
permissions
Region
DirectoryTag
Tag
Array
Block Probe
Requests/
BlockResponses
Probe
Requests/Responses
Hit
Miss
Miss
To DRAM
34 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
PR Entries
Hit
PR Entries
MSHR Entries
Block Requests
Probe
Probe
Request
RAM
Request
RAM
MSHRs
MSHR Entries
Region
Permission
Requests
MSHRs
Coherent
HSC: HARDWARE COMPLEXITY
Region protocols reduce
directory size
‒Region directory: 8x fewer entries
Region buffers
‒At each L2 cache
‒1-KB region (16 64-B blocks)
‒16-K region entries
‒Overprovisioned for low-locality
workloads
35 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
(a) Region Directory Entry
Region Tag
State CPU GPU
18 bits
2 bits 1 valid bit
per cluster
(b) Region Buffer Entry
Region Tag
18 bits
State B0 B1 B2 ... B15
2 bits
1 valid bit per
block in the region
HSC SUMMARY
Key insight
‒GPU-CPU applications exhibit high spatial locality
‒Use direct-access bus present in systems
‒Offload bandwidth onto direct-access bus
Use coherence network only for permission
Add region buffer to track region information
‒At each L2 cache
‒Bypass coherence network and directory
Replace directory with region directory
‒Significantly reduces total size needed
36 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
OUTLINE
Motivation
Background
Heterogeneous System Bottlenecks
Heterogeneous System Coherence Details
Results
‒Speed-up
‒Latency of loads
‒Bandwidth
‒MSHR usage
Conclusions
37 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
THREE CACHE-COHERENCE PROTOCOLS
Broadcast: Null-directory that broadcasts on all requests
Baseline: Block-based, mostly inclusive, directory
HSC: Region-based directory with 1-KB region size
38 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
HSC PERFORMANCE
5
4.5
Normalized speed-up
4
Largest
Largest slow-downs
slowdowns
Broadcast
from constrained
resources
Baseline
HSC
3.5
3
2.5
2
1.5
1
0.5
0
bp
bfs
hs
lud
nw
39 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
km
sd
bn
dct
hg
mm
DIRECTORY TRAFFIC REDUCTION
1.2
broadcast
baseline
HSC
Normalized directory bandwidth
1
0.8
0.6
0.4
Average bandwidth
significantly reduced Theoretical
reduction from 16
block regions
0.2
0
bp
bfs
hs
lud
nw
40 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
km
sd
bn
dct
hg
mm
HSC RESOURCE USAGE
Normalized directory MSHRs required
0.25
0.2
0.15
0.1
Maximum
MSHRs
significantly
reduced
0.05
0
bp
bfs
hs
lud
nw
41 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
km
sd
bn
dct
hg
mm
RESULTS SUMMARY
Used a detailed timing simulator for CPU and GPU
HSC significantly improves performance
‒Reduces the average load latency
‒Decreases bandwidth requirement of directory
HSC reduces the required MSHRs at the directory
42 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
RELATED WORK
Coarse-grained coherence
‒Region coherence
‒ Applied to snooping systems [Cantin, ISCA 2005] [Moshovos, ISCA 2005]
[Zebchuk, MICRO 2007]
‒ Extended to directories [Fang, PACT 2013] [Zebchuk, MICRO 2013]
‒Spatiotemporal coherence [Alisafaee, MICRO 2012]
‒Dual-grain directory coherence [Basu, UW-TR 2013]
‒ Primarily focused on directory size
GPU coherence [Singh et al. HPCA 2013]
‒Intra-GPU coherence
43 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
CONCLUSIONS
Hardware coherence can increase the utility of
heterogeneous systems
Major bottlenecks in current coherence implementations
‒High bandwidth difficult to support at directory
‒Extreme resource requirements
We propose Heterogeneous System Coherence
‒Leverages spatial locality and region coherence
‒Reduces bandwidth by 94%
‒Reduces resource requirements by 95%
44 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
Questions?
Contact:
powerjg@cs.wisc.edu
45 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to
product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or
otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to
time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR
ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO
EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM
THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of
Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance
Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.
46 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
Backup
Slides
LOAD LATENCY
4.5
broadcast
Normalized load latency
4
3.5
3
baseline
HSC
Average load time
significantly reduced
2.5
2
1.5
1
0.5
0
bp
bfs
hs
lud
nw
48 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
km
sd
bn
dct
hg
mm
EXECUTION TIME BREAKDOWN
120
GPU
CPU
hg
mm
Execution time (%)
100
80
60
40
20
0
bp
bfs
hs
lud
nw
49 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46
km
sd
bn
dct
Descargar