back

CPU and its Cache

October 4, 2025

In this post, we'll dive into CPU caches — tiny but powerful memory units that play a huge role in how fast our computers run. We'll look at what are they, why caches are needed, how they work across different levels, and what affects their performance. The goal is to give programmers, like me, a clear, practical overview without getting too overwhelming.

I'm writing this from a software engineer's perspective, so while I've done my best to be accurate, there may be some gaps. If you spot something off or have insights to add or change, I'd love to hear your feedback in the comments or by email.

In the end, it's just an informative blog post, not a formal paper so contributions are welcome!

1. Why Caches Became Essential

First things first - let's understand why we need caches in our CPUs.

For that purpose, let's imagine a world without them — like in the early days of computing. Back then, whenever the CPU needed data to execute an instruction, it had to go straight all the way to main memory (RAM) and then fetch the required data and come back to perform such execution.

The Problem with Direct RAM Access

This approach is not inherently bad, but you can clearly see the problem here: the execution unit needs to wait for the data to arrive from RAM before it can do its job which is a huge waste of time in terms of clock cycle.

Also, in the past, CPUs and RAM worked at similar speeds, so this wasn't a big issue. But starting in the 1990s, CPU speeds skyrocketed while RAM didn't keep up. Suddenly, you had processors running billions of operations per second but constantly stuck waiting hundreds of cycles just to get data from memory.

Think of it like a world-class chef who can cook lightning-fast but has to walk to a far-away pantry for every single ingredient—even if it's something they just used a second ago. The constant trips waste time and completely defeat the purpose of having such a fast chef.

Every time the CPU needed something, the request had to travel across the Front Side Bus, through the memory controller (in the Northbridge), to the RAM, and then all the way back. That round trip was painfully slow compared to the CPU's pace, creating a huge bottleneck.

Modern CPUs (since the late 2000s) solved the bus bottleneck by integrating the memory controller directly onto the CPU die. Now RAM connects straight to the processor through dedicated memory channels, eliminating that long journey through the Front Side Bus and Northbridge. But even with this improvement, RAM access still takes hundreds of CPU cycles — the speed gap between processors and memory remains the fundamental problem, not the path the data takes.

The Birth of Caches: Using Locality

Fortunately, computer engineers found that programs tend to follow certain patterns, known as locality:

  1. Temporal Locality: If data was used recently, it'll probably be needed again soon.
  2. Spatial Locality: If some data was used, nearby data will likely be used next.

Engineers realized they could take advantage of these patterns. Instead of sending the CPU to slow RAM every time, they placed a small amount of very fast memory close to the CPU — the cache.

In simple words, When the CPU needs data, it first checks this cache. If the data is there (a cache hit), it's served instantly. If not (a cache miss), the CPU falls back to RAM. This dramatically reduces wait times and keeps the CPU busy instead of idle.

Why Not Just Make Faster RAM?

Now you can ask, if caches are so great, why not just make main memory as fast as caches? It is technically possible but useless for the following two reasons:

  1. Cost — Cache uses SRAM, which is much faster but also much more expensive than the DRAM used in RAM. Building gigabytes of SRAM would be wildly unaffordable.
  2. Practicality — Even if you had a small machine with superfast RAM, it wouldn't help much once the program's working set (the data it actively uses) gets larger than that memory. At that point, the system would need to use even slower storage like a hard drive or SSD, which would be disastrous for performance.

That's why computers use a layered memory system: a large, affordable main memory (DRAM) paired with smaller, ultra-fast caches (SRAM). This gives us the best balance of speed, size, and cost.

2. The Power of Caches in Action

Imagine this setup -> Accessing main memory (RAM) takes 300 cycles. Accessing the cache takes 20 cycles. A program processes 50 data elements, using each element 50 times.

Without a cache:

The CPU goes to RAM every time. Total cycles:

50 × 50 × 300 = 750,000 cycles

That's three quarters of a million cycles spent waiting for memory.

With a cache:

So the math looks like this:

Cache misses: 50 × 300 = 15,000 cycles
Cache hits: 50 × 49 × 20 = 49,000 cycles
Total (with cache) = 15,000 + 49,000 = 64,000 cycles

The result:

That's a massive reduction in CPU wait time. This example shows how even a small cache can drastically cut down memory access times by leveraging locality. The CPU spends far less time waiting and much more time doing actual work.

3. Cache Size and the "Working Set"

Caches are super fast but much smaller than RAM. For example, a workstation might have 4 MB of CPU cache and 4 GB of main memory — a 1:1000 ratio.

Because the cache can only hold a tiny fraction of what's in RAM, it can't store everything — it has to be selective. This size constraint introduces the idea of a program's working set: the data and instructions it actively uses right now.

If the working set fits in the cache, most accesses are cache hits. You get high hit rates and great performance — no problem with the limited cache size.

In real-world systems, especially with many processes running, the combined working set usually exceeds the cache. Then the CPU must constantly choose what to keep and what to evict. When needed data has been evicted, you get cache misses and must fetch from slower RAM, which reduces the performance benefit.

Balancing working set and cache is a joint effort by both,

When both do their jobs well, the working set fits nicely in cache and the program flies. When they don't, you get cache misses—and the CPU is back to waiting hundreds of cycles for RAM.

4. Cache Organization and Hierarchy: A Multi-Layered Approach

Okay. Now we know that the cache size is limited, hardware designers faced a key question: how do we organize these small, fast memories to get the most benefit? The solution they found, is a multi-level cache hierarchy.

Instead of one cache, modern CPUs use several layers of caches, each with different sizes and speeds, working together to bridge the gap between the CPU and main memory.

Cache hierarchy in Modern CPUs

L1 Cache (Level 1)

L2 Cache (Level 2)

L3 Cache (Level 3)

By keeping frequently used data closer to the CPU at the right speed, the memory hierarchy minimizes those slow trips to main memory. For programmers, this mostly works behind the scenes — but understanding it is the difference between code that flies and code that stalls. When you write with locality in mind, you're working with the cache instead of against it, and that can make your programs orders of magnitude faster.

5. Multi-Core and Multi-Thread Cache Sharing: The Complexities of Concurrent Access

Modern processors aren't just faster — they're more parallel, with multiple cores on a single chip. Each core is essentially an independent processing unit that can execute its own instructions simultaneously. We won't dive deep into the mechanics of cores and threads here, but the key point is this: when multiple cores share the same caches, things get complicated. They need to coordinate access, maintain consistency, and avoid stepping on each other's toes which adds a whole new layer of complexity to cache design. Two key levels of parallelism are important to understand:

Cores

A single CPU often contains multiple cores. Each core is largely independent and usually has its own dedicated L1 data (L1d) and L1 instruction (L1i) caches. This means that different cores executing different code can operate with minimal interference at the L1 level.

Threads (Hyper-threads / SMT)

Some architectures support multiple threads per core, often called hyper-threads or Symmetric Multi-Threading (SMT). Unlike separate cores, these threads share nearly all of the core’s resources, including the L1 caches. Each thread has its own registers, but they contend for the same L1 cache space. If two threads on the same core access different data, they may evict each other’s data from the cache, leading to cache pollution.

Beyond L1

L2 cache is typically private to each core, giving it fast access without contention from other cores. L3 cache (Last Level Cache) is shared across all cores within the CPU, acting as a common pool for frequently accessed data. In multi-socket systems — where multiple physical CPU packages sit on the same motherboard, each socket has its own complete cache hierarchy (L1, L2, L3). Communication between CPUs in different sockets must traverse the slower interconnect (like Intel's UPI or AMD's Infinity Fabric), making cross-socket memory access significantly more expensive.

Understanding this hierarchical sharing pattern is crucial for programmers. It affects how data is arranged and threads are scheduled to reduce conflicts and maximize cache locality, especially in parallel applications.

6. Cache Operation at a Granular Level & Cost of Hits and Misses

CPUs don't fetch individual bytes or words from memory—they work with larger blocks called cache lines.

Cost of Cache Hits and Misses

Memory access speed varies dramatically depending on where data is found. Here are approximate latencies for a modern x86 processor (actual numbers vary by generation and workload):

These timings can vary based on factors like memory contention, prefetching effectiveness, and TLB (Translation Lookaside Buffer) hits or misses. The key takeaway: cache hits are cheap, cache misses are catastrophic. The performance gap between L1 and RAM is roughly 50-100x, which is why optimizing for cache locality is one of the highest-leverage performance techniques available to programmers.

7. Address Splitting for Cache Access: Locating Data in the Hierarchy

To efficiently locate data in a multi-level cache, the CPU splits memory addresses into distinct components. This helps determine if a requested piece of data is in the cache and where.

This three-part address splitting allows caches to be both fast and practical. The index directs the CPU quickly to the right set, while the tag confirms that the correct block of memory is present, enabling efficient management of limited high-speed storage.

Example: How a Memory Address is Split for Cache Access

Suppose we have a CPU with the following cache configuration:

Now imagine the CPU wants to access memory address 0x0001A2C0 (32-bit address).

Step 1: Convert to binary

0x0001A2C0 = 0000 0000 0000 0001 1010 0010 1100 0000

Step 2: Split the address

For 64 sets, we need log₂(64) = 6 bits for the index.
For 64-byte lines, we need log₂(64) = 6 bits for the offset.

Tag (20 bits)              Index (6 bits)      Offset (6 bits)
0000000000000001101000      | 101100            | 000000
0x0006A                     |   44              |   0

Breaking it down:

Step 3: How the CPU checks the cache

  1. Go to set #44 in the cache.
  2. Compare the tag 0x0006A with the tags of all 8 lines in that set (since it's 8-way associative).
  3. If a match is foundcache hit → the CPU reads the data from that line.
  4. If no matchcache miss → the CPU fetches the entire 64-byte block from the next level (L2, L3, or RAM) and stores it in one of the 8 lines in set #44, potentially evicting an existing line using a replacement policy (LRU, random, etc.).
Set #42
Tag: 0x00012
64 bytes of data...
Tag: 0x0003F
64 bytes of data...
... (6 more lines)
Set #43
Tag: 0x00089
64 bytes of data...
Tag: 0x000A1
64 bytes of data...
... (6 more lines)
Set #44 ← INDEX points here!
Tag: 0x00012
64 bytes of data...
Tag: 0x00045
64 bytes of data...
Tag: 0x0006A ✓ MATCH!
64 bytes of data...
↑ This is our data!
Tag: 0x000B2
64 bytes of data...
... (4 more lines)
Set #45
Tag: 0x0001D
64 bytes of data...
Tag: 0x00078
64 bytes of data...
... (6 more lines)
... (59 more sets, up to Set #63)

This example shows exactly how the CPU splits a memory address into tag, index, and offset to efficiently locate data in a set-associative cache.

8. Cache Write Operations: Ensuring Data Integrity and Performance

When the CPU writes data, caches introduce extra complexity. How writes are handled affects both performance and data integrity.

Write Allocation and Partial Writes

When the CPU writes to memory, the cache must decide whether to bring the data into cache (if it's not already there). There are two approaches:

For partial writes (e.g., writing just 4 bytes of a 64-byte line), the cache must perform a read-modify-write: load the full line, modify the relevant bytes, then mark it dirty. This ensures the cache line contains coherent data.

Dirty Cache Lines

Once a cache line is modified, it differs from the data in main memory. The cache marks it as dirty to indicate that it must eventually be written back. An unmodified line is clean and can be evicted without writing to memory.

Dirty lines are written back to memory when:

Write Policies

Two main strategies control when modified data reaches main memory:

Special Write Regions

9. Cache Eviction and Data Movement: Making Room for New Data

Caches are limited in size, so when new data needs to be loaded and the cache is full, older or less-used data must be removed.

When Eviction Happens

Eviction occurs when there's a cache miss and the target cache set is already full. The cache must choose a victim line to evict before loading the new data. What happens next depends on whether the victim is clean or dirty:

Hierarchical Movement

Eviction costs vary by cache level:

This is why keeping frequently-used data in L1 is so critical — each level down the hierarchy gets progressively more expensive.

Inclusive vs. Exclusive Cache Hierarchies

Different CPU architectures organize their cache hierarchies differently:

Replacement Policies

When a set is full, the cache must decide which line to evict. Common policies include:

The replacement policy significantly impacts cache hit rates, especially for workloads with complex access patterns.

10. Cache Coherency in Multi-Processor Systems: Maintaining a Unified View

In multi-CPU or multi-core systems, each CPU may have its own caches. Ensuring that all see the latest data is critical for correct operation. This is cache coherency.

10a. The Challenge

Directly accessing another CPU’s cache is slow. Processors need protocols to synchronize states and maintain a consistent view of memory.

10b. MESI Protocol

The MESI protocol defines four states for each cache line:

10c. Maintaining Coherency via Snooping

Processors monitor (snoop) the bus for other CPUs’ memory accesses:

10d. Performance Implications

Bus operations and RFOs are expensive. Frequent writes to the same cache line or CPU migration can cause delays. Minimizing unnecessary coherency traffic is essential for high-performance multi-core software.

11. Cache Associativity: How Data Maps to Cache Locations

Beyond simply existing, the way a cache maps main memory addresses to its internal storage locations profoundly impacts efficiency. This mapping strategy is known as associativity. It dictates how flexible the cache is in storing data and how prone it is to certain types of cache misses.

Cache designers face a trade-off: designs that allow maximum flexibility are complex and costly, while simpler designs can limit performance. There are three main types of cache associativity:

11a. Fully Associative Cache

Concept: In a fully associative cache, any main memory block (cache line) can be stored in any available location within the cache. There are no restrictions on placement.

Lookup Mechanism: The processor compares the tag of the requested memory block with the tags of all cache lines in parallel.

Advantages:

Disadvantages:

Practical Use: Due to complexity, fully associative caches are usually reserved for very small, specialized caches, such as the Translation Lookaside Buffer (TLB), which may have only a few dozen entries.

11b. Direct-Mapped Cache

Concept: The simplest and most restrictive design. Each memory block can be stored in only one specific cache location. The cache set index in the memory address directly points to that location.

Lookup Mechanism: The processor uses the set index to select a single cache line, then compares the memory address tag with the tag in that line.

Advantages:

Disadvantages:

Practical Use: Rare for larger caches in modern CPUs due to conflict misses.

11c. Set-Associative Cache (The Hybrid Approach)

Concept: Balances the extremes of fully associative and direct-mapped caches. The cache is divided into multiple sets, each containing a small number of ways (e.g., 2-way, 4-way, 8-way, 16-way). A memory block maps to a specific set but can reside in any way within that set.

Lookup Mechanism:

  1. The cache set index from the memory address identifies the correct set.
  2. Within the set, the processor compares the memory address tag with all tags of the ways in parallel.
  3. If a match is found → cache hit.

Advantages:

Practical Use: Standard in modern CPU caches. L1 caches often use 8-way set-associativity, while L2/L3 caches can use 16-way or higher.

Understanding associativity is crucial for programmers: it directly influences how memory access patterns can maximize cache hits or cause performance-degrading conflict misses. Organizing data to avoid multiple active blocks mapping to the same set can significantly improve performance.

12. TLB (Translation Look-Aside Buffer) Influence: The Hidden Cost of Virtual Memory

While CPU caches accelerate data access, another crucial performance component comes into play with virtual memory: the Translation Look-Aside Buffer (TLB). Unlike data or instruction caches that store actual memory content, the TLB is a specialized, extremely fast cache that stores recent virtual-to-physical address translations.

Virtual vs. Physical Addresses

Modern operating systems use virtual memory, giving each program its own isolated address space. Before the CPU can access data in physical RAM, its virtual address must be translated into a physical address via the Memory Management Unit (MMU). This translation involves looking up page tables in main memory—a multi-step, slow process.

How the TLB Works

TLB Characteristics

13. Critical Word Load: Accelerating Data Arrival

The Problem

When a cache miss occurs, a full cache line (e.g., 64 bytes) must be fetched. The CPU might only need a specific byte or word (the critical word) to continue execution. Waiting for the entire cache line to arrive would stall the CPU unnecessarily.

The Solution: Critical Word First & Early Restart

Limitations

This optimization works best when the CPU knows which word is critical. Aggressive prefetching may interfere, as the exact critical word might be in flight or unknown. Despite this, Critical Word Load significantly reduces perceived memory latency.

14. Cache Placement: Strategic Allocation in Multi-Core Systems

The physical arrangement and sharing of caches in multi-core and multi-processor systems are hardware-defined, but programmers must understand them to optimize software performance. This "cache placement" determines which caches are shared and which are private.

Fixed by Hardware

Cache placement (e.g., whether L1, L2, or L3 caches are shared or private) is determined by the CPU architecture and cannot be changed by programmers.

L1 Caches (Typically Private)

L1d and L1i caches are almost always private to each CPU core, minimizing contention and allowing each core to operate at maximum L1 speed.

Higher-Level Caches (Shared or Private)

Implications of Sharing

Programmer’s Role

While cache placement is fixed, programmers can influence thread affinity—deciding which CPU cores or hyper-threads run which software. Strategically assigning threads to cores that share caches (or placing independent threads on cores with private caches) helps align software with hardware for maximum performance and minimal contention.

15. Cache Prefetching: Predicting Data Before It's Needed

Modern CPUs don’t wait for the program to request data—they try to anticipate it. This technique, known as prefetching, aims to reduce cache misses by bringing data into the cache before the CPU actually needs it.

Hardware Prefetching

Software Prefetching

16. Cache Replacement Policies: Deciding Which Data to Evict

When a cache set is full, the CPU must evict a cache line to make room for new data. The policy used to choose the victim affects performance and conflict misses.

Least Recently Used (LRU)

- Evicts the cache line that has not been accessed for the longest time. - Advantage: Exploits temporal locality. - Commonly used in L1/L2 caches.

First-In, First-Out (FIFO)

- Evicts the oldest cache line, regardless of usage. - Simpler but can perform poorly if older data is still frequently accessed.

Random Replacement

- Chooses a cache line at random to evict. - Reduces hardware complexity and works surprisingly well in some scenarios.

Pseudo-LRU (PLRU)

- Approximation of LRU for higher associativity caches. - Requires fewer hardware resources than true LRU.

17. Cache Miss Types: Understanding Why Data Isn't Found

Not all cache misses are the same. Understanding the type helps programmers optimize memory access patterns.

17a. Compulsory Misses

- Also called cold-start misses. - Occur when a cache line is accessed for the first time. - Cannot be avoided, but can be mitigated with prefetching.

17b. Capacity Misses

- Occur when the cache is too small to hold the working set of the program. - Even with perfect placement, data is evicted due to limited cache size.

17c. Conflict Misses

- Happen in set-associative or direct-mapped caches when multiple blocks map to the same set. - Example: Two frequently accessed addresses always map to the same set, evicting each other repeatedly. - Increasing associativity reduces these misses.

18. NUMA Considerations: Multi-Socket Memory Architectures

In systems with multiple CPUs (multi-socket), memory is physically distributed. Access time depends on which CPU “owns” the memory.

18a. Local vs Remote Memory

- Local memory: Memory physically attached to the same CPU socket. - Remote memory: Memory attached to another CPU socket. Accessing it is slower.

18b. Cache and NUMA

- Each CPU still has its private L1/L2 caches and possibly a shared L3 cache. - NUMA-aware programming is crucial: assign threads to CPUs close to the memory they access most. - Libraries like numactl allow fine-grained control of thread and memory placement.

19. Cache and Power Considerations

High-speed caches improve performance but consume power, impacting energy efficiency, especially in mobile and embedded devices.

19a. Trade-offs

- Larger caches: higher hit rate but more transistors and power. - Smaller caches: lower power but higher miss rates.

19b. Techniques for Power Efficiency

- Cache gating: selectively turning off unused cache sections. - Dynamic resizing: some modern CPUs adjust cache resources based on workload.

20. Speculative Execution and Cache Side Effects

Modern CPUs predict future instructions to keep execution pipelines full. While this improves speed, it interacts with caches in subtle ways.

20a. Speculative Prefetching

- CPU may load data for predicted branches into the cache. - Helps reduce pipeline stalls but can evict useful data if predictions are wrong.

20b. Security Implications

- Speculative execution combined with caching can leak data (e.g., Spectre, Meltdown vulnerabilities). - Attackers exploit timing differences between cache hits and misses to infer sensitive information.

20c. Programmer Awareness

- While speculative execution is automatic, understanding its interaction with caches is important for performance tuning and security-sensitive applications.