Laboratorio de Tecnologías de Información Memory Hierarchy Arquitectura de Computadoras Arturo Díaz Pérez Centro de Investigación y de Estudios Avanzados del IPN Laboratorio de Tecnologías de Información adiaz@cinvestav.mx Arquitectura de Computadoras MemoryHierarchy- 1 The Big Picture: Where are We Now? Laboratorio de Tecnologías de Información The Five Classic Components of a Computer Processor Input Control Memory Datapath Output Today’s Topics: ■ ■ Locality and Memory Hierarchy Memory Organization Arquitectura de Computadoras MemoryHierarchy- 2 Memory Trends Memory Technology Typical Access Time Laboratorio de Tecnologías de Información $ per GB in 2004 SRAM 0.5-5 ns $4000-$10,000 DRAM 50-70 ns $100-$200 Magnetic Disck 5,000,000-20,000,000 ns $0.50-$2 Arquitectura de Computadoras MemoryHierarchy- 3 Technology Trends Laboratorio de Tecnologías de Información Capacity Logic:2x in 3 years Speed (latency) 2x in 3 years DRAM: 4x in 3 years 2x in 10 years Disk: 4x in 3 years 2x in 10 years DRAM Size 1000:1! 64 Kb 2:1! 256 Kb 1 Mb 4 Mb 16 Mb 64 Mb 2 GB Cycle Time 250 ns 220 ns 190 ns 165 ns 145 ns 120 ns 50-70ns Year 1980 1983 1986 1989 1992 1995 2004 Arquitectura de Computadoras MemoryHierarchy- 4 Who Cares About the Memory Hierarchy? Laboratorio de Tecnologías de Información Processor-DRAM Memory Gap (latency) Performance 1000 CPU “Moore’s Law” µProc 60%/yr. (2X/1.5yr) Processor-Memory Performance Gap: (grows 50% / year) 100 10 “Less’ Law?” DRAM 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 1 Time Arquitectura de Computadoras DRAM 9%/yr. (2X/10 yrs) MemoryHierarchy- 5 Current Microprocessor Laboratorio de Tecnologías de Información Rely on caches to bridge gap Microprocessor-DRAM performance gap time of a full cache miss in instructions executed 1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or ■ ■ 136 instructions 320 instructions 648 instructions 1/2X latency x 3X clock rate x 3X Instr/clock ⇒ -5X Arquitectura de Computadoras MemoryHierarchy- 6 Impact on Performance Suppose a processor executes at ■ ■ ■ Inst Miss (0.5) 16% Clock Rate = 200 MHz (5 ns per cycle) CPI = 1.1 50% arith/logic, 30% ld/st, 20% control Suppose that 10% of memory operations get 50 cycle miss penalty Laboratorio de Tecnologías de Información Ideal CPI (1.1) 35% DataMiss (1.6) 49% CPI = ideal CPI + average stalls per instruction = 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) ) = 1.1 cycle + 1.5 cycle = 2. 6 58 % of the time the processor is stalled waiting for memory! a 1% instruction miss rate would add an additional 0.5 cycles to the CPI! Arquitectura de Computadoras MemoryHierarchy- 7 The Goal: illusion of large, fast, cheap memory Laboratorio de Tecnologías de Información Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and fast (most of the time)? ■Hierarchy ■Parallelism Arquitectura de Computadoras MemoryHierarchy- 8 An Expanded View of the Memory System Laboratorio de Tecnologías de Información Processor Speed: Fastest Size: Smallest Cost: Highest Arquitectura de Computadoras Memory Memory Datapath Memory Control Memory Memory Slowest Biggest Lowest MemoryHierarchy- 9 Why hierarchy works Laboratorio de Tecnologías de Información The Principle of Locality: ■ Program access a relatively small portion of the address space at any instant of time. Probability of reference 0 Arquitectura de Computadoras Address Space 2n - 1 MemoryHierarchy- 10 Memory Hierarchy: How Does it Work? Laboratorio de Tecnologías de Información Temporal Locality (Locality in Time): Clustering in time: items referenced in the immediate past have a high probability of being re-referenced in the immediate future => Keep most recently accessed data items closer to the processor ■ Spatial Locality (Locality in Space): Clustering in space: items located physically near an item referenced in the immediate past have a high probability of being re-referenced in the immediate future => Move blocks consists of contiguous words to the upper levels ■ To Processor From Processor Arquitectura de Computadoras Upper Level Memory Blk X Lower Level Memory Blk Y MemoryHierarchy- 11 Visualizing Locality Laboratorio de Tecnologías de Información The memory map [Hatfield and Gerald 1971] Arquitectura de Computadoras MemoryHierarchy- 12 Memory Hierarchy of a Modern Computer System Laboratorio de Tecnologías de Información By taking advantage of the principle of locality: ■ ■ Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Processor Control On-Chip Cache Registers Datapath Second Level Cache (SRAM) Main Memory (DRAM) Speed (ns): 1s 10s 100s Size (bytes): 100s Ks Ms Arquitectura de Computadoras Secondary Storage (Disk) Tertiary Storage (Tape) 10,000,000s 10,000,000,000s (10s ms) (10s sec) Gs Ts MemoryHierarchy- 13 How is the hierarchy managed? Laboratorio de Tecnologías de Información Registers <-> Memory ■ by compiler (programmer?) cache <-> memory ■ by the hardware memory <-> disks ■ ■ by the hardware and operating system (virtual memory) by the programmer (files) Arquitectura de Computadoras MemoryHierarchy- 14 Memory Hierarchy: Terminology Laboratorio de Tecnologías de Información Hit: data appears in some block in the upper level (example: Block X) ■ ■ Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) ■ ■ Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty To Processor From Processor Arquitectura de Computadoras Upper Level Memory Lower Level Memory Blk X Blk Y MemoryHierarchy- 15 Memory Hierarchy Technology Laboratorio de Tecnologías de Información Random Access: ■ ■ ■ “Random” is good: access time is the same for all locations DRAM: Dynamic Random Access Memory ● High density, low power, cheap, slow ● Dynamic: need to be “refreshed” regularly SRAM: Static Random Access Memory ● Low density, high power, expensive, fast ● Static: content will last “forever”(until lose power) “Non-so-random” Access Technology: ■ ■ Access time varies from location to location and from time to time Examples: Disk, CDROM Sequential Access Technology: access time linear in location (e.g.,Tape) We will concentrate on random access technology ■ The Main Memory: DRAMs + Caches: SRAMs Arquitectura de Computadoras MemoryHierarchy- 16 Main Memory Background Laboratorio de Tecnologías de Información Performance of Main Memory: ■ ■ Latency: Cache Miss Penalty ● Access Time: time between request and word arrives ● Cycle Time: time between requests Bandwidth: I/O & Large Block Miss Penalty (L2) Main Memory is DRAM: Dynamic Random Access Memory ■ ■ Dynamic since needs to be refreshed periodically (8 ms) Addresses divided into 2 halves (Memory as a 2D matrix): ● RAS or Row Access Strobe ● CAS or Column Access Strobe Cache uses SRAM : Static Random Access Memory ■ No refresh (6 transistors/bit vs. 1 transistor) Size: DRAM/SRAM - 4-8 Cost/Cycle time: SRAM/DRAM - 8-16 Arquitectura de Computadoras MemoryHierarchy- 17 Typical SRAM Organization Din 3 Din 2 Din 1 Laboratorio de Tecnologías de Información Din 0 WrEn A0 A1 SRAM Memory 16x4 Dout 3 Arquitectura de Computadoras Dout 2 Dout 1 A2 A3 Dout 0 MemoryHierarchy- 18 Typical SRAM Organization: 16-word x 4-bit Din 3 Din 2 Din 1 Laboratorio de Tecnologías de Información Din 0 WrEn Precharge Wr Driver & - Precharger+ Wr Driver & - Precharger+ Wr Driver & - Precharger+ Wr Driver & - Precharger+ SRAM Cell SRAM Cell SRAM Cell SRAM Cell Word 1 SRAM Cell SRAM Cell SRAM Cell SRAM Cell : : : : Address Decoder Word 0 A0 A1 A2 A3 Word 15 SRAM Cell SRAM Cell SRAM Cell SRAM Cell - Sense Amp + - Sense Amp + - Sense Amp + - Sense Amp + Dout 3 Arquitectura de Computadoras Dout 2 Dout 1 Dout 0 MemoryHierarchy- 19 Classical DRAM Organization Laboratorio de Tecnologías de Información bit (data) lines r o w d e c o d e r row address Each intersection represents a 1-T DRAM Cell RAM Cell Array word (row) select Column Selector & I/O Circuits data Arquitectura de Computadoras Column Address Row and Column Address together: ■ Select 1 bit a time MemoryHierarchy- 20 Main Memory Performance Wide: ■ Simple: ■ CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits) Laboratorio de Tecnologías de Información Interleaved: ■ CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved CPU, Cache, Bus, Memory same width (32 bits) Arquitectura de Computadoras MemoryHierarchy- 21 Summary First Part Laboratorio de Tecnologías de Información Two Different Types of Locality: ■ Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. ■ Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality: ■ ■ Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. DRAM is slow but cheap and dense: ■ Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense: ■ Good choice for providing the user FAST access time. Arquitectura de Computadoras MemoryHierarchy- 22