Example: marketing

Set-Associative Cache Architecture

chapter 5 Set Associative Caches1 COMPUTERORGANIZATIONANDDESIGNThe Hardware/Software Interface5thEditionChapter 5 Set-Associative Cache ArchitecturePerformance SummarynWhen CPU performance increases:nMiss penalty becomes more proportion of time spent on memory clock rate:nMemory stalls account for more CPU t neglect Cache behavior when evaluating system 5 Set Associative Caches2 Review: Reducing Cache Miss Rates #1 Allow more flexible block placementnIn a direct mapped Cache a memory block maps to exactly one Cache the other extreme, we could allow a memory block to be mapped to anycache block fully associative compromise is to divide the Cache into sets,each of which consists of n ways (n-way set associative).

Chapter 5 —Set Associative Caches 10 Cortex-A8 Data Cache Miss Rates FIGURE 5.45 Data cache miss rates for ARM Cortex-A8 when running Minnespec, a small version of SPEC2000.Applications with larger memory footprints tend to have higher miss rates in both L1 and L2. Note that the L2 rate is the global miss rate; that is, counting all references,

Tags:

  Chapter, Spec2000

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Set-Associative Cache Architecture

1 chapter 5 Set Associative Caches1 COMPUTERORGANIZATIONANDDESIGNThe Hardware/Software Interface5thEditionChapter 5 Set-Associative Cache ArchitecturePerformance SummarynWhen CPU performance increases:nMiss penalty becomes more proportion of time spent on memory clock rate:nMemory stalls account for more CPU t neglect Cache behavior when evaluating system 5 Set Associative Caches2 Review: Reducing Cache Miss Rates #1 Allow more flexible block placementnIn a direct mapped Cache a memory block maps to exactly one Cache the other extreme, we could allow a memory block to be mapped to anycache block fully associative compromise is to divide the Cache into sets,each of which consists of n ways (n-way set associative).

2 A memory block maps to a unique set -specified by the index field -and can be placed any where in that CachesnFully associative Cache :nAllow a given block to go in any Cache a read, requires all blocks to be searched in comparator per entry (expensive).nn-way set associative:nEach set contains number determines which set the requested item is located all entries in a given set at (less expensive). chapter 5 Set Associative Caches3 Spectrum of AssociativitynFor a Cache with 8 entries:Four-Way Set Associative Cachen28= 256 sets each with four ways (each with one block).31 30 .. 13 12 11 .. 2 1 0 Byte selectWay 0 Way 1 Way 2 Way 3 chapter 5 Set Associative Caches4 Another Direct-Mapped Catch Example04040404nConsider the main-memory word reference string0 4 0 4 0 4 0 4missmissmissmissmissmissmissmiss00 Mem(0)00 Mem(0)01401 Mem(4)00000 Mem(0)01400 Mem(0)01400 Mem(0)01401 Mem(4)00001 Mem(4)000 Start with an empty Cache -all blocks initially marked as not valid.

3 8 requests, 8 missesnPing-pong effect due to conflict misses -two memory locations that map into the same Cache Associative Cache Example0 CacheMain MemoryQ2: How do we find it?Use next 1 low order memory address bit to determine which Cache set ( , modulo the number of sets in the Cache ).TagDataQ1: Is it there?Compare allthe Cache tagsin the set to the high order 3 memory address bits to tell if the memory block is in the Each word is 4 bytes. Two low-order bits select the byte in the word. Two Sets per 5 Set Associative Caches52-way Set Associative Memory0404nConsider the main memory word reference string0 4 0 4 0 4 0 4missmisshithit000 Mem(0)000 Mem(0)Start with an empty Cache -all blocks initially marked as not Mem(4)010 Mem(4)000 Mem(0)000 Mem(0)010 Mem(4) 8 requests, 2 missesnSolves the ping pong effect in a direct-mapped Cache due to conflict misses since now two memory locations that map into the same Cache set can of Set Associative CachesnFor a fixed size Cache , each increase by a factor of two in associativity doubles the number of blocks per set ( , the number or ways)

4 And halves the number of sets decreases the size of the index by 1 bit and increases the size of the tag by 1 offsetByte offsetIndexTagDecreasing associativityFully associative (only one set). Tag is all the address bits except block and byte mapped(only one way).Smaller tags, only a single associativitySelects the setUsed for tag compareSelects the word in the blockChapter 5 Set Associative Caches6 How Much Associativity is Right?nIncreased associativity decreases miss ratenBut with diminishing of a system with 64 KBD- Cache , 16-word blocks, SPEC2000n1-way: : : : of Set Associative CachesnN-way set associative Cache costs:nN comparators -delay and delay -set selection -before data is available afterset selection, and Hit/Miss decision.

5 In a direct-mapped Cache , the Cache block is available beforethe Hit/Miss decision:nSo its not possible to just assume a hit and continue and recover later if it was a a miss occurs, which way s block do we pick for replacement? chapter 5 Set Associative Caches7 Replacement PolicynDirect mapped -no associative:nPrefer non-valid entry, if there is , choose among entries in the used (LRU) is common:nChoose the one unused for the longest time:nSimple for 2-way, manageable for 4-way, too complicatedbeyond , gives about the same performance as LRU for high of Set Associative CachesnThe choice of direct-mapped or Set-Associative depends on the cost of a miss versus the benefit of a gains are in going from direct mapped to 2-way (20%+ reduction in miss rate).

6 chapter 5 Set Associative Caches8 Reducing Cache Miss Rates #2 Use multiple levels of cachesnWith advancing technology, we have more than enough room on the die for bigger L1 caches orfor a second level of Cache normally a unifiedL2 Cache -it holds both instructions and data -and in some cases even a unified L3 our example, CPIidealof 2, 100 cycle miss penalty (to main memory) and a 25 cycle miss penalty (to UL2$), 36% load/stores, a 2% (4%) L1 I$ (D$) miss rate, add a UL2$ miss 2 + .02 25 + .36 .04 25 + .005 100 + .36 .005 100 = (as compared to with no L2$)Multilevel Cache Design ConsiderationsnDesign considerations for L1 and L2 caches are different:nPrimary Cache should focus on minimizing hit time in support of a shorter clock cycle:nSmaller capacity with smaller block Cache (s) should focus on reducing miss rate to reduce the penalty of long main memory access times:nLarger capacity with larger block levels of miss penalty of the L1 Cache is significantly reduced by the presence of an L2 Cache so it can be smaller ( , faster) but have a higher miss the L2 Cache , hit time is less important than miss rate.

7 NThe L2$ hit time determines L1$ s miss 5 Set Associative Caches9 Memory Sort ExampleFIGURE Comparing Quicksortand Radix Sort by (a) instructions executed per item sorted, (b) time per item sorted, and (c) Cache misses per item data is from a paper by LaMarcaand Ladner[1996]. Although the numbers would change for newer computers, the idea still holds. Due to such results, new versions of Radix Sort have been invented that take memory hierarchy into account, to regain its algorithmic advantages (see Section ). The basic idea of Cache optimizations is to use all the data in a block repeatedly before it is replaced on a miss. Copyright 2009 Elsevier, Inc. All rights Machines Cache ParametersChapter 5 Set Associative Caches10 Cortex-A8 Data Cache Miss RatesFIGURE Cache miss rates for ARM Cortex-A8 when running Minnespec, a small version of with larger memory footprints tend to have higher miss rates in both L1 and L2.

8 Note that the L2 rate is the global miss rate; that is, counting all references, including those that hit in L1. (See Elaboration in Section ) Mcfis known as a Cache buster. Intel Core i7 920 Data Cache Miss RatesFIGURE L1, L2, and L3 data Cache miss rates for the Intel Core i7 920 running the full integer SPECCPU2006 5 Set Associative Caches11 Summary: Improving Cache Performance1. Reduce the time to hit in the Cache :nSmaller mapped writes:nNo write allocate no hit on Cache , just write to write allocate to avoid two cycles (first check for hit, then write) pipeline writes via a delayed write buffer to Reduce the miss rate:nBigger flexible placement (increase associativity).nLarger blocks (16 to 64 bytes typical).

9 NVictim Cache small buffer holding most recently replaced : Improving Cache Performance3. Reduce the miss penalty:nSmaller a write buffer to hold dirty blocks being replaced so you don t have to wait for the write to complete before reading. nCheck write buffer (and/or victim Cache ) on read miss may get large blocks fetch, critical word multiple Cache levels L2 Cache is often not tied to CPU clock 5 Set Associative Caches12 Summary: The Cache Design SpacenSeveral interacting dimensions:nCache vs. optimal choice is a compromisenDepends on access characteristics:nHard to on technology / often SizeBlock SizeBadGoodLessMoreFactor AFactor BConcluding RemarksnFast memories are small, large memories are slow:nWe want fast, large memories.

10 LnCaching gives this illusion. JnPrinciple of locality:nPrograms use a small part of their memory space hierarchynL1 Cache L2 Cache .. DRAM memory disk


Related search queries