CPU and its Cache
In this post, we'll dive into CPU caches — tiny but powerful memory units that play a huge role in how fast our computers run. We'll look at what are they, why caches are needed, how they work across different levels, and what affects their performance. The goal is to give programmers, like me, a clear, practical overview without getting too overwhelming.
I'm writing this from a software engineer's perspective, so while I've done my best to be accurate, there may be some gaps.
If you spot something off or have insights to add or change, I'd love to hear your feedback in the comments or by email.
In the end, it's just an informative blog post, not a formal paper so contributions are welcome!
1. Why Caches Became Essential
First things first - let's understand why we need caches in our CPUs.
For that purpose, let's imagine a world without them — like in the early days of computing.
Back then, whenever the CPU needed data to execute an instruction, it had to go straight all the way to main memory (RAM) and then fetch the required data and come back to perform such execution.
The Problem with Direct RAM Access
This approach is not inherently bad, but you can clearly see the problem here: the execution unit needs to wait for the data to arrive from RAM before it can do its job which is a huge waste of time in terms of clock cycle.
Also, in the past, CPUs and RAM worked at similar speeds, so this wasn't a big issue.
But starting in the 1990s, CPU speeds skyrocketed while RAM didn't keep up.
Suddenly, you had processors running billions of operations per second but constantly stuck waiting hundreds of cycles just to get data from memory.
Think of it like a world-class chef who can cook lightning-fast but has to walk to a far-away pantry for every single ingredient—even if it's something they just used a second ago. The constant trips waste time and completely defeat the purpose of having such a fast chef.
Every time the CPU needed something, the request had to travel across the Front Side Bus, through the memory controller (in the Northbridge), to the RAM, and then all the way back.
That round trip was painfully slow compared to the CPU's pace, creating a huge bottleneck.
Modern CPUs (since the late 2000s) solved the bus bottleneck by integrating the memory controller directly onto the CPU die.
Now RAM connects straight to the processor through dedicated memory channels, eliminating that long journey through the Front Side Bus and Northbridge.
But even with this improvement, RAM access still takes hundreds of CPU cycles — the speed gap between processors and memory remains the fundamental problem, not the path the data takes.
The Birth of Caches: Using Locality
Fortunately, computer engineers found that programs tend to follow certain patterns, known as locality:
- Temporal Locality: If data was used recently, it'll probably be needed again soon.
- Spatial Locality: If some data was used, nearby data will likely be used next.
Engineers realized they could take advantage of these patterns. Instead of sending the CPU to slow RAM every time, they placed a small amount of very fast memory close to the CPU — the cache.
In simple words, When the CPU needs data, it first checks this cache. If the data is there (a cache hit), it's served instantly. If not (a cache miss), the CPU falls back to RAM. This dramatically reduces wait times and keeps the CPU busy instead of idle.
Why Not Just Make Faster RAM?
Now you can ask, if caches are so great, why not just make main memory as fast as caches? It is technically possible but useless for the following two reasons:
- Cost — Cache uses SRAM, which is much faster but also much more expensive than the DRAM used in RAM. Building gigabytes of SRAM would be wildly unaffordable.
- Practicality — Even if you had a small machine with superfast RAM, it wouldn't help much once the program's working set (the data it actively uses) gets larger than that memory. At that point, the system would need to use even slower storage like a hard drive or SSD, which would be disastrous for performance.
That's why computers use a layered memory system: a large, affordable main memory (DRAM) paired with smaller, ultra-fast caches (SRAM). This gives us the best balance of speed, size, and cost.
2. The Power of Caches in Action
Imagine this setup -> Accessing main memory (RAM) takes 300 cycles. Accessing the cache takes 20 cycles. A program processes 50 data elements, using each element 50 times.
Without a cache:
The CPU goes to RAM every time. Total cycles:
50 × 50 × 300 = 750,000 cycles
That's three quarters of a million cycles spent waiting for memory.
With a cache:
- First access to each element = cache miss → load from RAM (300 cycles).
- Remaining 49 accesses per element = cache hits → served from cache (20 cycles).
So the math looks like this:
Cache misses: 50 × 300 = 15,000 cycles
Cache hits: 50 × 49 × 20 = 49,000 cycles
Total (with cache) = 15,000 + 49,000 = 64,000 cycles
The result:
- Without cache: 750,000 cycles
- With cache: 64,000 cycles
- Performance improvement: ~91% faster
That's a massive reduction in CPU wait time. This example shows how even a small cache can drastically cut down memory access times by leveraging locality. The CPU spends far less time waiting and much more time doing actual work.
3. Cache Size and the "Working Set"
Caches are super fast but much smaller than RAM.
For example, a workstation might have 4 MB of CPU cache and 4 GB of main memory — a 1:1000 ratio.
Because the cache can only hold a tiny fraction of what's in RAM, it can't store everything — it has to be selective.
This size constraint introduces the idea of a program's working set: the data and instructions it actively uses right now.
If the working set fits in the cache, most accesses are cache hits. You get high hit rates and great performance — no problem with the limited cache size.
In real-world systems, especially with many processes running, the combined working set usually exceeds the cache. Then the CPU must constantly choose what to keep and what to evict. When needed data has been evicted, you get cache misses and must fetch from slower RAM, which reduces the performance benefit.
Balancing working set and cache is a joint effort by both,
- Hardware designers pick cache sizes, levels, and replacement policies.
- Software developers write code that improves locality — reusing data and accessing nearby memory to raise hit rates.
When both do their jobs well, the working set fits nicely in cache and the program flies. When they don't, you get cache misses—and the CPU is back to waiting hundreds of cycles for RAM.
4. Cache Organization and Hierarchy: A Multi-Layered Approach
Okay. Now we know that the cache size is limited, hardware designers faced a key question:
how do we organize these small, fast memories to get the most benefit?
The solution they found, is a multi-level cache hierarchy.
Instead of one cache, modern CPUs use several layers of caches, each with different sizes and speeds, working together to bridge the gap between the CPU and main memory.
Cache hierarchy in Modern CPUs
L1 Cache (Level 1)
- Fastest and smallest cache, located directly on the CPU core.
- Usually split into two parts:
- L1d — data cache storing information the CPU uses.
- L1i — instruction cache storing decoded CPU instructions.
- Separating data and instructions improves performance since their access patterns differ. Accessed in just a few CPU cycles.
L2 Cache (Level 2)
- Larger and slightly slower than L1 but much faster than main memory.
- Stores data and instructions that don't fit in L1 but are still frequently used. Often unified, meaning it handles both data and instructions.
L3 Cache (Level 3)
- Largest and slowest on-chip cache, yet still faster than main memory.
- Typically shared across multiple CPU cores. Acts as a "victim cache" for L2, holding data evicted from L2 but still potentially needed.
By keeping frequently used data closer to the CPU at the right speed, the memory hierarchy minimizes those slow trips to main memory. For programmers, this mostly works behind the scenes — but understanding it is the difference between code that flies and code that stalls. When you write with locality in mind, you're working with the cache instead of against it, and that can make your programs orders of magnitude faster.
5. Multi-Core and Multi-Thread Cache Sharing: The Complexities of Concurrent Access
Modern processors aren't just faster — they're more parallel, with multiple cores on a single chip. Each core is essentially an independent processing unit that can execute its own instructions simultaneously. We won't dive deep into the mechanics of cores and threads here, but the key point is this: when multiple cores share the same caches, things get complicated. They need to coordinate access, maintain consistency, and avoid stepping on each other's toes which adds a whole new layer of complexity to cache design. Two key levels of parallelism are important to understand:
Cores
A single CPU often contains multiple cores. Each core is largely independent and usually has its own dedicated L1 data (L1d) and L1 instruction (L1i) caches. This means that different cores executing different code can operate with minimal interference at the L1 level.
Threads (Hyper-threads / SMT)
Some architectures support multiple threads per core, often called hyper-threads or Symmetric Multi-Threading (SMT). Unlike separate cores, these threads share nearly all of the core’s resources, including the L1 caches. Each thread has its own registers, but they contend for the same L1 cache space. If two threads on the same core access different data, they may evict each other’s data from the cache, leading to cache pollution.
Beyond L1
L2 cache is typically private to each core, giving it fast access without contention from other cores. L3 cache (Last Level Cache) is shared across all cores within the CPU, acting as a common pool for frequently accessed data. In multi-socket systems — where multiple physical CPU packages sit on the same motherboard, each socket has its own complete cache hierarchy (L1, L2, L3). Communication between CPUs in different sockets must traverse the slower interconnect (like Intel's UPI or AMD's Infinity Fabric), making cross-socket memory access significantly more expensive.
Understanding this hierarchical sharing pattern is crucial for programmers. It affects how data is arranged and threads are scheduled to reduce conflicts and maximize cache locality, especially in parallel applications.
6. Cache Operation at a Granular Level & Cost of Hits and Misses
CPUs don't fetch individual bytes or words from memory—they work with larger blocks called cache lines.
- Cache Line Granularity: Modern caches typically operate on 64-byte lines (sometimes 32 or 128 bytes, depending on the architecture). When you access a single byte, the entire 64-byte block containing it gets loaded into cache. This leverages spatial locality: if you need one byte, you'll probably need nearby bytes soon. It also aligns with how RAM transfers data efficiently in bursts.
- Cacheable vs. Uncacheable Memory: Most normal program data and instructions are cached automatically. However, certain memory regions are marked uncacheable, typically memory-mapped hardware registers or DMA buffers where caching would cause incorrect behavior (e.g., the CPU cache could become out of sync with what a device wrote directly to RAM).
-
Address Mapping (Tag, Index, Offset): To locate data, the CPU splits each memory address into three fields:
- Offset (lowest bits): Identifies the specific byte within the 64-byte cache line.
- Index (middle bits): Selects which cache set to check. (We'll cover sets and associativity shortly.)
- Tag (highest bits): The unique identifier for that particular memory block. The CPU compares this tag against stored tags to determine if the requested data is in cache (a hit) or not (a miss).
Cost of Cache Hits and Misses
Memory access speed varies dramatically depending on where data is found. Here are approximate latencies for a modern x86 processor (actual numbers vary by generation and workload):
- Register Access: ~1 cycle — data is already in the CPU's registers, the fastest storage available.
- L1 Cache Hit: ~4-5 cycles — extremely fast, making L1 performance critical.
- L2 Cache Hit: ~12-15 cycles — noticeably slower than L1 but still fast.
- L3 Cache Hit: ~40-50 cycles — acts as a shared backstop before hitting main memory.
- Main Memory (DRAM): ~200-300+ cycles — orders of magnitude slower than cache. This is why cache misses are so expensive.
These timings can vary based on factors like memory contention, prefetching effectiveness, and TLB (Translation Lookaside Buffer) hits or misses. The key takeaway: cache hits are cheap, cache misses are catastrophic. The performance gap between L1 and RAM is roughly 50-100x, which is why optimizing for cache locality is one of the highest-leverage performance techniques available to programmers.
7. Address Splitting for Cache Access: Locating Data in the Hierarchy
To efficiently locate data in a multi-level cache, the CPU splits memory addresses into distinct components. This helps determine if a requested piece of data is in the cache and where.
- Offset (Block Offset): Lowest bits of the address specify the byte position within the cache line. For a 64-byte cache line, the lowest 6 bits (2^6 = 64) serve as the offset.
- Index (Set Field): Middle bits above the offset determine which cache set to search. In set-associative caches, each set holds multiple cache lines (e.g., 8-way associativity = 8 lines per set). The index narrows the search to a specific set.
- Tag (Tag Field): Highest bits of the address uniquely identify which memory block is stored. When the CPU searches a set, it compares the requested tag with the tags of all lines in that set. A match = cache hit; no match = cache miss, requiring access to a slower memory level.
This three-part address splitting allows caches to be both fast and practical. The index directs the CPU quickly to the right set, while the tag confirms that the correct block of memory is present, enabling efficient management of limited high-speed storage.
Example: How a Memory Address is Split for Cache Access
Suppose we have a CPU with the following cache configuration:
- Cache line size: 64 bytes
- Cache organization: 8-way set associative (8 lines per set)
- Total cache size: 32 KB
- Number of sets: 32 KB ÷ (64 bytes/line × 8 lines/set) = 64 sets
Now imagine the CPU wants to access memory address 0x0001A2C0 (32-bit address).
Step 1: Convert to binary
0x0001A2C0 = 0000 0000 0000 0001 1010 0010 1100 0000
Step 2: Split the address
For 64 sets, we need log₂(64) = 6 bits for the index.
For 64-byte lines, we need log₂(64) = 6 bits for the offset.
Tag (20 bits) Index (6 bits) Offset (6 bits) 0000000000000001101000 | 101100 | 000000 0x0006A | 44 | 0
Breaking it down:
- Offset:
000000(binary) = 0 (decimal) — we're accessing the first byte in the cache line - Index:
101100(binary) = 44 (decimal) — the CPU searches set #44 - Tag:
0x0006A— this identifies which specific memory block should be in that set
Step 3: How the CPU checks the cache
- Go to set #44 in the cache.
- Compare the tag
0x0006Awith the tags of all 8 lines in that set (since it's 8-way associative). - If a match is found → cache hit → the CPU reads the data from that line.
- If no match → cache miss → the CPU fetches the entire 64-byte block from the next level (L2, L3, or RAM) and stores it in one of the 8 lines in set #44, potentially evicting an existing line using a replacement policy (LRU, random, etc.).
This example shows exactly how the CPU splits a memory address into tag, index, and offset to efficiently locate data in a set-associative cache.
8. Cache Write Operations: Ensuring Data Integrity and Performance
When the CPU writes data, caches introduce extra complexity. How writes are handled affects both performance and data integrity.
Write Allocation and Partial Writes
When the CPU writes to memory, the cache must decide whether to bring the data into cache (if it's not already there). There are two approaches:
- Write-Allocate: On a write miss, the cache line is loaded from memory into the cache before being modified. This is common in write-back caches because future reads or writes to that line will benefit from it being cached.
- No-Write-Allocate: On a write miss, the data is written directly to memory without loading it into cache. Common in write-through caches to avoid unnecessary cache pollution.
For partial writes (e.g., writing just 4 bytes of a 64-byte line), the cache must perform a read-modify-write: load the full line, modify the relevant bytes, then mark it dirty. This ensures the cache line contains coherent data.
Dirty Cache Lines
Once a cache line is modified, it differs from the data in main memory. The cache marks it as dirty to indicate that it must eventually be written back. An unmodified line is clean and can be evicted without writing to memory.
Dirty lines are written back to memory when:
- The line is evicted to make room for new data
- The cache is explicitly flushed (e.g., during context switches or I/O operations)
- Cache coherence protocols require it (in multi-core systems)
Write Policies
Two main strategies control when modified data reaches main memory:
- Write-Through: Every write updates both the cache and main memory immediately. This ensures memory is always up-to-date but generates significant memory bus traffic. Clean lines can be evicted instantly without writeback. Typically paired with no-write-allocate.
- Write-Back: Writes only update the cache and mark the line as dirty. Memory is updated only when the dirty line is evicted or flushed. This dramatically reduces bus traffic and improves performance but requires careful coherency management in multi-core systems. Typically paired with write-allocate.
Special Write Regions
- Write-Combining (WC): A special memory type that allows multiple writes to be coalesced in a write-combining buffer before being sent to memory as a single burst. The CPU doesn't guarantee ordering between WC writes, making it ideal for frame buffers and graphics memory where sequential ordering isn't critical but bandwidth is precious.
- Uncacheable (UC) Memory: Certain regions bypass the cache entirely, with reads and writes going directly to memory or hardware. This is essential for memory-mapped I/O registers, DMA buffers, and hardware that requires real-time visibility of every access. Caching such regions would cause stale data and incorrect behavior.
9. Cache Eviction and Data Movement: Making Room for New Data
Caches are limited in size, so when new data needs to be loaded and the cache is full, older or less-used data must be removed.
When Eviction Happens
Eviction occurs when there's a cache miss and the target cache set is already full. The cache must choose a victim line to evict before loading the new data. What happens next depends on whether the victim is clean or dirty:
- Clean lines: Can be discarded immediately since they match the data in the next level of memory. No write-back needed.
- Dirty lines: Must be written back to the next level of the memory hierarchy (L1 → L2 → L3 → main memory) before being replaced.
Hierarchical Movement
Eviction costs vary by cache level:
- Evicting from L1 (closest to CPU) writes back to L2, which is fast (~12-15 cycles)
- Evicting from L2 writes back to L3, which is slower (~40-50 cycles)
- Evicting from L3 (farthest from CPU) writes back to main memory, which is very expensive (~200-300+ cycles)
This is why keeping frequently-used data in L1 is so critical — each level down the hierarchy gets progressively more expensive.
Inclusive vs. Exclusive Cache Hierarchies
Different CPU architectures organize their cache hierarchies differently:
- Inclusive Caches: Data in L1 is guaranteed to also exist in L2, and data in L2 exists in L3. When L1 evicts a line, L2 already has a copy, simplifying the operation. This design makes cache coherence easier in multi-core systems (the L3 can track what all cores have), but wastes capacity by storing duplicates. Example: Most Intel processors.
- Exclusive Caches: Each cache line exists in only one level at a time. A line in L1 is not also in L2. This maximizes total effective cache capacity since there's no duplication. However, it complicates coherence protocols and eviction (when L1 evicts, the data must be moved to L2). Example: Some AMD processors.
- Non-Inclusive (NINE): A middle ground where cache levels don't enforce inclusion but also don't prevent it. Data may or may not be duplicated across levels. Example: Recent AMD Zen architectures.
Replacement Policies
When a set is full, the cache must decide which line to evict. Common policies include:
- LRU (Least Recently Used): Evicts the line that hasn't been accessed in the longest time. Effective but requires tracking access history.
- Pseudo-LRU: Approximates LRU with less overhead, commonly used in real hardware.
- Random: Simple and fast, but less effective than LRU. Sometimes used in L1 caches where speed is critical.
The replacement policy significantly impacts cache hit rates, especially for workloads with complex access patterns.
10. Cache Coherency in Multi-Processor Systems: Maintaining a Unified View
In multi-CPU or multi-core systems, each CPU may have its own caches. Ensuring that all see the latest data is critical for correct operation. This is cache coherency.
10a. The Challenge
Directly accessing another CPU’s cache is slow. Processors need protocols to synchronize states and maintain a consistent view of memory.
10b. MESI Protocol
The MESI protocol defines four states for each cache line:
- Modified (M): Changed by this CPU, not in memory, exclusive to this cache.
- Exclusive (E): Same as memory, exclusive to this cache.
- Shared (S): Same as memory, may exist in other CPUs’ caches.
- Invalid (I): Data not valid, must be loaded before use.
10c. Maintaining Coherency via Snooping
Processors monitor (snoop) the bus for other CPUs’ memory accesses:
- Write Access (Request For Ownership, RFO): If another CPU wants to write a line you have, you must invalidate your copy. The writing CPU gets exclusive ownership.
- Read Access: If another CPU requests a line in Modified state, you send the data and update your state to Shared.
10d. Performance Implications
Bus operations and RFOs are expensive. Frequent writes to the same cache line or CPU migration can cause delays. Minimizing unnecessary coherency traffic is essential for high-performance multi-core software.
11. Cache Associativity: How Data Maps to Cache Locations
Beyond simply existing, the way a cache maps main memory addresses to its internal storage locations profoundly impacts efficiency. This mapping strategy is known as associativity. It dictates how flexible the cache is in storing data and how prone it is to certain types of cache misses.
Cache designers face a trade-off: designs that allow maximum flexibility are complex and costly, while simpler designs can limit performance. There are three main types of cache associativity:
11a. Fully Associative Cache
Concept: In a fully associative cache, any main memory block (cache line) can be stored in any available location within the cache. There are no restrictions on placement.
Lookup Mechanism: The processor compares the tag of the requested memory block with the tags of all cache lines in parallel.
Advantages:
- Maximum flexibility → minimizes conflict misses (when two active blocks compete for the same location).
- Highest possible hit rate for a given cache size.
Disadvantages:
- Extremely complex and expensive → requires one comparator per cache line for parallel checks.
- Consumes significant power and chip area for large caches.
Practical Use: Due to complexity, fully associative caches are usually reserved for very small, specialized caches, such as the Translation Lookaside Buffer (TLB), which may have only a few dozen entries.
11b. Direct-Mapped Cache
Concept: The simplest and most restrictive design. Each memory block can be stored in only one specific cache location. The cache set index in the memory address directly points to that location.
Lookup Mechanism: The processor uses the set index to select a single cache line, then compares the memory address tag with the tag in that line.
Advantages:
- Simple and fast → only one comparator is needed per lookup.
- Economical in terms of transistors and power.
Disadvantages:
- Highly prone to conflict misses → two frequently used blocks mapping to the same line evict each other, even if other lines are free.
- Can severely reduce effective cache hit rate and performance.
Practical Use: Rare for larger caches in modern CPUs due to conflict misses.
11c. Set-Associative Cache (The Hybrid Approach)
Concept: Balances the extremes of fully associative and direct-mapped caches. The cache is divided into multiple sets, each containing a small number of ways (e.g., 2-way, 4-way, 8-way, 16-way). A memory block maps to a specific set but can reside in any way within that set.
Lookup Mechanism:
- The cache set index from the memory address identifies the correct set.
- Within the set, the processor compares the memory address tag with all tags of the ways in parallel.
- If a match is found → cache hit.
Advantages:
- Reduces conflict misses by allowing multiple locations per set.
- Manageable complexity → number of comparisons limited to the associativity degree (e.g., 8 comparisons for an 8-way set-associative cache).
- Scalable → increasing cache size primarily increases the number of sets, not comparators per set.
Practical Use: Standard in modern CPU caches. L1 caches often use 8-way set-associativity, while L2/L3 caches can use 16-way or higher.
Understanding associativity is crucial for programmers: it directly influences how memory access patterns can maximize cache hits or cause performance-degrading conflict misses. Organizing data to avoid multiple active blocks mapping to the same set can significantly improve performance.
12. TLB (Translation Look-Aside Buffer) Influence: The Hidden Cost of Virtual Memory
While CPU caches accelerate data access, another crucial performance component comes into play with virtual memory: the Translation Look-Aside Buffer (TLB). Unlike data or instruction caches that store actual memory content, the TLB is a specialized, extremely fast cache that stores recent virtual-to-physical address translations.
Virtual vs. Physical Addresses
Modern operating systems use virtual memory, giving each program its own isolated address space. Before the CPU can access data in physical RAM, its virtual address must be translated into a physical address via the Memory Management Unit (MMU). This translation involves looking up page tables in main memory—a multi-step, slow process.
How the TLB Works
- TLB Hit: If the translation is found in the TLB, the CPU retrieves the physical address almost instantly and proceeds quickly.
- TLB Miss: If not found, the MMU performs a full "page table walk" through main memory. This is very costly, taking hundreds of cycles.
TLB Characteristics
- Small Size: Typically only dozens to a few hundred entries, because they must be extremely fast.
- Multi-Level TLBs: L1 ITLB for instructions, L1 DTLB for data, and a unified L2 TLB for both.
- Frequent Flushes: Context switches clear the TLB since each process has unique page tables, so the TLB is rarely “warm” for long.
- Performance Impact: If a program’s working set exceeds TLB capacity, misses are frequent, adding to cache miss delays and reducing overall performance. Hardware prefetching across page boundaries is also less effective when TLB misses are common.
13. Critical Word Load: Accelerating Data Arrival
The Problem
When a cache miss occurs, a full cache line (e.g., 64 bytes) must be fetched. The CPU might only need a specific byte or word (the critical word) to continue execution. Waiting for the entire cache line to arrive would stall the CPU unnecessarily.
The Solution: Critical Word First & Early Restart
- The memory controller prioritizes sending the critical word first within the incoming cache line.
- Once it arrives, the CPU resumes execution immediately, while the rest of the cache line continues to transfer.
- This "early restart" hides some latency of a cache miss and allows the CPU to work sooner.
Limitations
This optimization works best when the CPU knows which word is critical. Aggressive prefetching may interfere, as the exact critical word might be in flight or unknown. Despite this, Critical Word Load significantly reduces perceived memory latency.
14. Cache Placement: Strategic Allocation in Multi-Core Systems
The physical arrangement and sharing of caches in multi-core and multi-processor systems are hardware-defined, but programmers must understand them to optimize software performance. This "cache placement" determines which caches are shared and which are private.
Fixed by Hardware
Cache placement (e.g., whether L1, L2, or L3 caches are shared or private) is determined by the CPU architecture and cannot be changed by programmers.
L1 Caches (Typically Private)
L1d and L1i caches are almost always private to each CPU core, minimizing contention and allowing each core to operate at maximum L1 speed.
Higher-Level Caches (Shared or Private)
- Early Multi-Core Processors: Each core had separate L2 caches, like L1.
- Later Intel Models: Many dual-core Intel processors featured a shared L2 cache for two cores.
- Quad-Core Processors (Intel Core 2 QX6700/6800): Often had two L2 caches, each shared by a pair of cores.
- AMD Family 10h Opteron Processors: Each core had its own L2 cache, with a unified L3 cache shared by all cores.
Implications of Sharing
- Shared Caches: Advantage: overlapping working sets across cores benefit from more total cache memory → fewer misses. Disadvantage: contention can occur when threads access different data mapping to the same cache sets, leading to inefficient use ("cache pollution").
- Private Caches: Advantage: dedicated cache space per core → good for independent workloads. Disadvantage: underutilization if working sets are small or data sharing between cores requires main memory transfers.
Programmer’s Role
While cache placement is fixed, programmers can influence thread affinity—deciding which CPU cores or hyper-threads run which software. Strategically assigning threads to cores that share caches (or placing independent threads on cores with private caches) helps align software with hardware for maximum performance and minimal contention.
15. Cache Prefetching: Predicting Data Before It's Needed
Modern CPUs don’t wait for the program to request data—they try to anticipate it. This technique, known as prefetching, aims to reduce cache misses by bringing data into the cache before the CPU actually needs it.
Hardware Prefetching
- Implemented within the CPU itself.
- Detects patterns in memory accesses, such as sequential access, and automatically loads the next block into the cache.
- Example: If a program accesses addresses 0x1000, 0x1010, 0x1020, the prefetcher may load 0x1030 and 0x1040 ahead of time.
- Benefits: Reduces latency for predictable workloads.
- Pitfalls: If the access pattern is irregular, prefetching may bring unnecessary data into the cache, wasting bandwidth and evicting useful data.
Software Prefetching
- Program instructions explicitly request data to be loaded into the cache ahead of use.
- Often used in high-performance computing (HPC) and graphics applications.
- Example (C syntax):
_mm_prefetch(ptr, _MM_HINT_T0)tells the CPU to fetch data atptrinto the L1 cache. - Balances control between programmer knowledge and CPU speculation.
16. Cache Replacement Policies: Deciding Which Data to Evict
When a cache set is full, the CPU must evict a cache line to make room for new data. The policy used to choose the victim affects performance and conflict misses.Least Recently Used (LRU)
- Evicts the cache line that has not been accessed for the longest time. - Advantage: Exploits temporal locality. - Commonly used in L1/L2 caches.First-In, First-Out (FIFO)
- Evicts the oldest cache line, regardless of usage. - Simpler but can perform poorly if older data is still frequently accessed.Random Replacement
- Chooses a cache line at random to evict. - Reduces hardware complexity and works surprisingly well in some scenarios.Pseudo-LRU (PLRU)
- Approximation of LRU for higher associativity caches. - Requires fewer hardware resources than true LRU.17. Cache Miss Types: Understanding Why Data Isn't Found
Not all cache misses are the same. Understanding the type helps programmers optimize memory access patterns.17a. Compulsory Misses
- Also called cold-start misses. - Occur when a cache line is accessed for the first time. - Cannot be avoided, but can be mitigated with prefetching.17b. Capacity Misses
- Occur when the cache is too small to hold the working set of the program. - Even with perfect placement, data is evicted due to limited cache size.17c. Conflict Misses
- Happen in set-associative or direct-mapped caches when multiple blocks map to the same set. - Example: Two frequently accessed addresses always map to the same set, evicting each other repeatedly. - Increasing associativity reduces these misses.18. NUMA Considerations: Multi-Socket Memory Architectures
In systems with multiple CPUs (multi-socket), memory is physically distributed. Access time depends on which CPU “owns” the memory.18a. Local vs Remote Memory
- Local memory: Memory physically attached to the same CPU socket. - Remote memory: Memory attached to another CPU socket. Accessing it is slower.18b. Cache and NUMA
- Each CPU still has its private L1/L2 caches and possibly a shared L3 cache. - NUMA-aware programming is crucial: assign threads to CPUs close to the memory they access most. - Libraries likenumactl allow fine-grained control of thread and memory placement.