HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU†, SOORAJ PUTHOOR†, BRADFORD M BECKMANN†, MARK D HILL*†, STEVEN K REINHARDT†, DAVID A WOOD*† *University of Wisconsin-Madison †Advanced Micro Devices, Inc. Powerpoint version available on: http://pages.cs.wisc.edu/~powerjg/ 2 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 ABSTRACT Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current coherence implementations ‒High bandwidth difficult to support at directory ‒Extreme resource requirements We propose Heterogeneous System Coherence ‒Leverages spatial locality and region coherence ‒Reduces bandwidth by 94% ‒Reduces resource requirements by 95% 4 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 PHYSICAL INTEGRATION 5 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 PHYSICAL INTEGRATION 6 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 PHYSICAL INTEGRATION 7 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 PHYSICAL INTEGRATION Stacked High-bandwidth DRAM GPU CPU Cores 8 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 Credit: IBM LOGICAL INTEGRATION General-purpose GPU computing ‒OpenCL ‒CUDA Heterogeneous Uniform Memory Access (hUMA) ‒Shared virtual address space ‒Cache coherence Allows new heterogeneous apps 9 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 OUTLINE Motivation Background ‒System overview ‒Cache architecture reminder Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Results Conclusions 10 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 SYSTEM OVERVIEW SYSTEM LEVEL Highbandwidth interconnect Accelerated Processing Unit (APU) DRAM Channels 11 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 SYSTEM OVERVIEW APU APU GPU compute accesses must stay coherent GPU Cluster Direct-access bus CPU Cluster Directory (used for graphics) To DRAM 12 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 Arrow thickness →bandwidth Invalidation traffic SYSTEM OVERVIEW GPU CU Very high bandwidth: CU CU CU CU CULocal CU Scratchpad CU CU CU CU L2 has high miss rate L1 L1 L1 L1 L1 L1Memory L1 L1 L1 L1 GPU Cluster I-Fetch / Decode CU CU CU CU L1 L1 L1 L1 CU L1 Register File Ex Ex Ex Ex Ex L1 L1 L1 L1Ex L1 CU CU CU CU L1 GPU L2 Cache Ex Ex Coalescer Ex CU L1 ExL1 ExL1 Ex L1 L1 CU Ex CU ExCU ExCU Ex CU CU 13 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 To L1 L1 L1 L1 L1 L1 L1 CU CU CU CU CU CU SYSTEM OVERVIEW CPU Cluster CPU bandwidth: Core CPU Core Low Low L2 miss rate L1 L1 To Dir L2 L1 L1 CPU Core CPU Core 14 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 CACHE ARCHITECTURE REMINDER CPU/GPU L2 CACHE MSHR Entries Demand Requests Cache Tag Arrays Demand requests Searches cache tags from L1Allocates cache anfor a tag match MSHR Tag hit on probe: send entry On a directory MSHRs data to other core Miss On a miss, send probe,Requests check Data Onrequest a hit, return Hit to directory MSHRsResponses and tags Probe data to the L1 Requests Core Data Responses 15 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 Coherent Network Interface DIRECTORY ARCHITECTURE REMINDER DIRECTORY Miss To DRAM 16 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 PR Entries MSHR Entries Demand Block requests Block tags Probe Searches Directory Tag Array cache Requests/ Responses from L2Allocates cache anfor a tag match MSHR On a miss, the entry data Allocate and send Probe comes from DRAM MSHRs Request RAM Coherent probes to L2 caches Hit Block Requests BACKGROUND SUMMARY System under investigation ‒Heterogeneous CPU-GPU on chip ‒High-bandwidth DRAM Directory pipeline complex ‒MSHR array is associative ‒Difficult to pipeline with more than 1 request per cycle ‒Important resources: MSHR entries 17 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 OUTLINE Motivation Background Heterogeneous System Bottlenecks ‒Simulation overview ‒Directory bandwidth ‒MSHRs ‒Performance is significantly affected Heterogeneous System Coherence Details Results Conclusions 18 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 SIMULATION DETAILS gem5 simulator Workloads ‒Simple CPU ‒GPU simulator based on AMD GCN ‒All memory requests through gem5 CPU Clock CPU Cores CPU Shared L2 GPU Clock Compute Units GPU Shared L2 L3 (Memory-side) DRAM Peak Bandwidth Baseline Directory ‒Modified to use hUMA ‒Rodinia & AMD APP SDK 2 GHz 2 2 MB (16-way banked) 1 GHz 32 4 MB (64-way banked) 16 MB (16-way banked) DDR3, 16 channels 700 GB/s 256k entries (8-way banked) 19 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 GPGPU BENCHMARKS Rodinia benchmarks ‒ bp trains the connection weights on a neural network ‒ bfs breadth-first search ‒ hs performs a transient 2D thermal simulation (5-point stencil) ‒ lud matrix decomposition ‒ nw performs a global optimization for DNA sequence alignment ‒ km does k-means clustering ‒ sd speckle-reducing anisotropic diffusion AMD SDK ‒ bn bitonic sort ‒ dct discrete cosine transform ‒ hg histogram ‒ mm matrix multiplication 20 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 SYSTEM BOTTLENECKS APU Difficult to scale directory bandwidth ‒Difficult to multi-port GPU ‒Complicated pipeline Cluster CPU Cluster Designed to support CPU High resource usage bandwidth ‒Must allocate MSHR for entire duration of request High bandwidth ‒MSHR array difficult to scale Directory To DRAM 21 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 DIRECTORY TRAFFIC Directory accesses per GPU cycle 4.5 4 Difficult to support >1 request per cycle 3.5 3 2.5 2 1.5 1 0.5 0 bp bfs hs lud nw 22 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm RESOURCE USAGE 100000 Maximum MSHRs 10000 1000 100 Very difficult to scale MSHR array Steady state at 700 GB/s Causes significant back-pressure on L2s 10 1 bp bfs hs lud nw 23 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm PERFORMANCE OF BASELINE COMPARED TO UNCONSTRAINED RESOURCES 5 Back-pressure from limited MSHRs and bandwidth 4.5 4 Slow down 3.5 3 2.5 2 1.5 1 0.5 0 bp bfs hs lud nw 24 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm BOTTLENECKS SUMMARY Directory bandwidth ‒Must support up to 4 requests per cycle ‒Difficult to construct pipeline Resource usage ‒MSHRs are a constraining resource ‒Need more than 10,000 ‒Without resource constraints, up to 4x better performance 25 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 OUTLINE Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details ‒Overall system design ‒Region buffer design ‒Region directory design ‒Example ‒Hardware complexity Results Conclusions 26 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 BASELINE DIRECTORY COHERENCE APU GPU Cluster CPU Cluster Initialization Kernel Launch Directory To DRAM 27 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 Read result HETEROGENEOUS SYSTEM COHERENCE (HSC) APU GPU Cluster CPU Cluster Initialization Kernel Launch Directory To DRAM 28 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 HETEROGENEOUS SYSTEM COHERENCE (HSC) APU GPU Region Cluster Buffer CPU Region Cluster Buffer Direct-access bus Region Directory Directory To DRAM 29 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 Region buffers coordinate with region directory HSC: EXAMPLE MEMORY REQUEST GPU L2 Cache GPU Region Buffer Region Directory APU GPU Region Cluster Buffer CPU Region Cluster Buffer Region Directory To DRAM 32 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 HSC: L2 CACHE & REGION BUFFER MSHR Entries Demand Demand Requests Requests MSHRs MSHR Entries MSHRs Region tags and Cache Tag Arrays Region Buffer Cache Tag Arrays permissions Only region-level permission traffic Interface for direct-access bus Miss Hit Miss Core Data Responses Hit Core Data Responses 33 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 Hit Miss Miss Requests Probe Hit Data Requests Responses Probe Requests Direct Access Bus Interface Coherent Coherent Network Network Interface Interface HSC: REGION DIRECTORY Region tags, sharers, and Block Directory Array permissions Region DirectoryTag Tag Array Block Probe Requests/ BlockResponses Probe Requests/Responses Hit Miss Miss To DRAM 34 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 PR Entries Hit PR Entries MSHR Entries Block Requests Probe Probe Request RAM Request RAM MSHRs MSHR Entries Region Permission Requests MSHRs Coherent HSC: HARDWARE COMPLEXITY Region protocols reduce directory size ‒Region directory: 8x fewer entries Region buffers ‒At each L2 cache ‒1-KB region (16 64-B blocks) ‒16-K region entries ‒Overprovisioned for low-locality workloads 35 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 (a) Region Directory Entry Region Tag State CPU GPU 18 bits 2 bits 1 valid bit per cluster (b) Region Buffer Entry Region Tag 18 bits State B0 B1 B2 ... B15 2 bits 1 valid bit per block in the region HSC SUMMARY Key insight ‒GPU-CPU applications exhibit high spatial locality ‒Use direct-access bus present in systems ‒Offload bandwidth onto direct-access bus Use coherence network only for permission Add region buffer to track region information ‒At each L2 cache ‒Bypass coherence network and directory Replace directory with region directory ‒Significantly reduces total size needed 36 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 OUTLINE Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Results ‒Speed-up ‒Latency of loads ‒Bandwidth ‒MSHR usage Conclusions 37 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 THREE CACHE-COHERENCE PROTOCOLS Broadcast: Null-directory that broadcasts on all requests Baseline: Block-based, mostly inclusive, directory HSC: Region-based directory with 1-KB region size 38 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 HSC PERFORMANCE 5 4.5 Normalized speed-up 4 Largest Largest slow-downs slowdowns Broadcast from constrained resources Baseline HSC 3.5 3 2.5 2 1.5 1 0.5 0 bp bfs hs lud nw 39 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm DIRECTORY TRAFFIC REDUCTION 1.2 broadcast baseline HSC Normalized directory bandwidth 1 0.8 0.6 0.4 Average bandwidth significantly reduced Theoretical reduction from 16 block regions 0.2 0 bp bfs hs lud nw 40 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm HSC RESOURCE USAGE Normalized directory MSHRs required 0.25 0.2 0.15 0.1 Maximum MSHRs significantly reduced 0.05 0 bp bfs hs lud nw 41 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm RESULTS SUMMARY Used a detailed timing simulator for CPU and GPU HSC significantly improves performance ‒Reduces the average load latency ‒Decreases bandwidth requirement of directory HSC reduces the required MSHRs at the directory 42 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 RELATED WORK Coarse-grained coherence ‒Region coherence ‒ Applied to snooping systems [Cantin, ISCA 2005] [Moshovos, ISCA 2005] [Zebchuk, MICRO 2007] ‒ Extended to directories [Fang, PACT 2013] [Zebchuk, MICRO 2013] ‒Spatiotemporal coherence [Alisafaee, MICRO 2012] ‒Dual-grain directory coherence [Basu, UW-TR 2013] ‒ Primarily focused on directory size GPU coherence [Singh et al. HPCA 2013] ‒Intra-GPU coherence 43 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 CONCLUSIONS Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current coherence implementations ‒High bandwidth difficult to support at directory ‒Extreme resource requirements We propose Heterogeneous System Coherence ‒Leverages spatial locality and region coherence ‒Reduces bandwidth by 94% ‒Reduces resource requirements by 95% 44 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 Questions? Contact: powerjg@cs.wisc.edu 45 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 46 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 Backup Slides LOAD LATENCY 4.5 broadcast Normalized load latency 4 3.5 3 baseline HSC Average load time significantly reduced 2.5 2 1.5 1 0.5 0 bp bfs hs lud nw 48 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct hg mm EXECUTION TIME BREAKDOWN 120 GPU CPU hg mm Execution time (%) 100 80 60 40 20 0 bp bfs hs lud nw 49 | HETEROGENEOUS SYSTEM COHERENCE | DECEMBER 11, 2013 | MICRO-46 km sd bn dct