Matrix and its Multiplication

October 5, 2025

Before jumping straight to GPUs and AI accelerators, I wanted to really understand how matrix multiplication (MM) works on a CPU.

I started with simple, hand-written code and gradually explored tricks like changing loop order, blocking for the cache, and using SIMD instructions. Along the way, I got a sense of how libraries like OpenBLAS squeeze out every bit of performance — and it’s fascinating to see how much speed you can gain without touching a GPU.

1. Hand Calculation: Naive MatMul

Let’s start with two 4×4 matrices and multiply them the usual “by-hand” way, using the ijk loop order.

That simply means we go through the rows of A (that’s the i loop), then the columns of B (the j loop), and for each pair, we run through all the elements along that row and column (the k loop) to get one number in the result. It’s the most natural way to think about matrix multiplication — row by column — and it’s exactly how we’d do it by hand. We’ll look at how changing this loop order can affect performance in the next section.

1234

5678

9101112

13141516

Matrix A

16151413

1211109

8765

4321

Matrix B

We want to compute C = A × B. Using the ijk order, the formula for each element is:

C[i][j] = ³Σ_k=0A[i][k] · B[k][j]

Each element of C is the sum of products of a row from A and a column from B.

Step-by-Step Calculation

Now, let’s compute each element of C = A × B using the ijk order. For each row i in A and column j in B, we sum over all k values (from 0 to 3) as shown in our formula above.

Row 0 of C (i = 0):

C[0][0] = 1×16 + 2×12 + 3×8 + 4×4 = 16 + 24 + 24 + 16 = 80
C[0][1] = 1×15 + 2×11 + 3×7 + 4×3 = 15 + 22 + 21 + 12 = 70
C[0][2] = 1×14 + 2×10 + 3×6 + 4×2 = 14 + 20 + 18 + 8 = 60
C[0][3] = 1×13 + 2×9 + 3×5 + 4×1 = 13 + 18 + 15 + 4 = 50

Row 1 of C (i = 1):

C[1][0] = 5×16 + 6×12 + 7×8 + 8×4 = 80 + 72 + 56 + 32 = 240
C[1][1] = 5×15 + 6×11 + 7×7 + 8×3 = 75 + 66 + 49 + 24 = 214
C[1][2] = 5×14 + 6×10 + 7×6 + 8×2 = 70 + 60 + 42 + 16 = 188
C[1][3] = 5×13 + 6×9 + 7×5 + 8×1 = 65 + 54 + 35 + 8 = 162

Row 2 of C (i = 2):

C[2][0] = 9×16 + 10×12 + 11×8 + 12×4 = 144 + 120 + 88 + 48 = 400
C[2][1] = 9×15 + 10×11 + 11×7 + 12×3 = 135 + 110 + 77 + 36 = 358
C[2][2] = 9×14 + 10×10 + 11×6 + 12×2 = 126 + 100 + 66 + 24 = 316
C[2][3] = 9×13 + 10×9 + 11×5 + 12×1 = 117 + 90 + 55 + 12 = 274

Row 3 of C (i = 3):

C[3][0] = 13×16 + 14×12 + 15×8 + 16×4 = 208 + 168 + 120 + 64 = 560
C[3][1] = 13×15 + 14×11 + 15×7 + 16×3 = 195 + 154 + 105 + 48 = 502
C[3][2] = 13×14 + 14×10 + 15×6 + 16×2 = 182 + 140 + 90 + 32 = 444
C[3][3] = 13×13 + 14×9 + 15×5 + 16×1 = 169 + 126 + 75 + 16 = 386

After computing all rows, we get our final result matrix C — the product of A and B using the naive ijk order.

80706050

240214188162

400358316274

560502444386

Result Matrix C

Before We Move on to Loop Ordering Methods,

Number of Operations

For our 4×4 matrix multiplication, we calculated by hand that each element needs 4 multiplications and 3 additions. Since we have 16 elements in the result matrix:

Total multiplications: 4 × 4 × 4 = 64
Total additions: 16 × 3 = 48
Total operations: 112

The formula for any N×N matrix multiplication is 2N³ - N². As the matrix size grows, operations increase cubically. An 8×8 matrix needs 896 operations (8× times bigger), and a 512×512 matrix needs 268 million operations.

Memory Usage

We need to store three 4×4 matrices (A, B, and C), which is 48 total elements. Memory depends on the data type:

fp16 (2 bytes per number): 96 bytes total
fp32 (4 bytes per number): 192 bytes total
fp64 (8 bytes per number): 384 bytes total

The formula is 3 × N² × bytes per element. Memory grows quadratically with N. A 512×512 matrix multiplication needs 1.5 MB for fp16, or 6 MB for fp64.

For 4×4 matrices, everything is tiny and fits in cache easily. But as N grows, the number of operations (N³) grows much faster than memory (N²), which is why cache optimization becomes critical for larger matrices. (we will come back to this later)

2. Loop ordering (ijk vs ikj vs kij)

We just computed matrix multiplication using the ijk order — the most intuitive way where we compute each element of C completely before moving to the next. But here's something interesting: we can change the order of these three nested loops and still get the exact same result!

The three loops are:

i loop: iterates through rows of A (and rows of C)
j loop: iterates through columns of B (and columns of C)
k loop: the summation index that runs along A's columns and B's rows

By changing which loop is outermost, middle, and innermost, we change how we accumulate the result. While all orderings compute the same answer mathematically, they access memory in different patterns.

Let's look at two alternative orderings: ikj and kij.

ikj Ordering

In ikj ordering, we fix a row of A (i), then for each element in that row (k), we update an entire row of C by multiplying that single element with the corresponding row of B.

The formula is the same, but we accumulate differently:

C[i][j] += A[i][k] · B[k][j]

Formula explanation: Instead of completing one C[i][j] at a time, we process C row by row. For each A[i][k], we multiply it with the entire k-th row of B and add the results to row i of C. This means we're building up each row of C gradually across all k values.

Step-by-Step Calculation (ikj)

Let's compute the same matrices A and B, but now using ikj ordering. We'll initialize C to all zeros first.

i=0, k=0: Take A[0][0]=1, multiply with entire row B[0]

C[0][0] += 1×16 = 16
C[0][1] += 1×15 = 15
C[0][2] += 1×14 = 14
C[0][3] += 1×13 = 13

Row 0 of C so far: [16, 15, 14, 13]

i=0, k=1: Take A[0][1]=2, multiply with entire row B[1]

C[0][0] += 2×12 = 16+24 = 40
C[0][1] += 2×11 = 15+22 = 37
C[0][2] += 2×10 = 14+20 = 34
C[0][3] += 2×9 = 13+18 = 31

Row 0 of C so far: [40, 37, 34, 31]

i=0, k=2: Take A[0][2]=3, multiply with entire row B[2]

C[0][0] += 3×8 = 40+24 = 64
C[0][1] += 3×7 = 37+21 = 58
C[0][2] += 3×6 = 34+18 = 52
C[0][3] += 3×5 = 31+15 = 46

Row 0 of C so far: [64, 58, 52, 46]

i=0, k=3: Take A[0][3]=4, multiply with entire row B[3]

C[0][0] += 4×4 = 64+16 = 80
C[0][1] += 4×3 = 58+12 = 70
C[0][2] += 4×2 = 52+8 = 60
C[0][3] += 4×1 = 46+4 = 50

Row 0 of C complete: [80, 70, 60, 50]

Continuing for i=1, i=2, i=3... (following the same pattern for each row)

After processing all rows with ikj ordering, we get:

Row 0: [80, 70, 60, 50]
Row 1: [240, 214, 188, 162]
Row 2: [400, 358, 316, 274]
Row 3: [560, 502, 444, 386]

Same result! The final matrix C is identical to what we got with ijk ordering.

Metrics for ikj

Number of Operations: Same as ijk — 64 multiplications, 48 additions, 112 total operations. The number of operations doesn't change; only the order changes.

Memory Access Pattern: The key difference is how we access B. In ikj, we access B[k][j] where k is fixed and j varies — this means we're reading entire rows of B sequentially. This is better for cache because consecutive elements are loaded together in cache lines.

Cache Behavior: For small 4×4 matrices, there's no practical difference. But for 512×512 matrices, ikj is about 2× faster than ijk because it accesses B row-wise instead of column-wise, resulting in fewer cache misses.

kij Ordering

Now, let's take a look at the last method.

In kij ordering, we fix a value of k (which corresponds to one column of A and one row of B), then for each row i of A, we take A[i][k] and multiply it with the entire k-th row of B, adding the results to the corresponding row of C. This is like computing the result as a sum of "outer products" — each iteration of k contributes one layer to building up the entire matrix C.

The formula remains the same:

C[i][j] += A[i][k] · B[k][j]

Formula explanation: Instead of working row by row (ikj) or element by element (ijk), we process k first. For each k value, we take the entire k-th column of A and k-th row of B, and perform an outer product-like operation that updates all of C at once. We build up C gradually by summing contributions from each k.

Step-by-Step Calculation (kij)

Let's compute the same matrices A and B using kij ordering. Again, we start with C initialized to all zeros.

k=0: Use column A[:,0] = [1, 5, 9, 13] and row B[0,:] = [16, 15, 14, 13]

C[0][0] += 1×16 = 16, C[0][1] += 1×15 = 15, C[0][2] += 1×14 = 14, C[0][3] += 1×13 = 13

C[1][0] += 5×16 = 80, C[1][1] += 5×15 = 75, C[1][2] += 5×14 = 70, C[1][3] += 5×13 = 65

C[2][0] += 9×16 = 144, C[2][1] += 9×15 = 135, C[2][2] += 9×14 = 126, C[2][3] += 9×13 = 117

C[3][0] += 13×16 = 208, C[3][1] += 13×15 = 195, C[3][2] += 13×14 = 182, C[3][3] += 13×13 = 169

Now k=0 is done. So, C after k=0:
[[16, 15, 14, 13],
[80, 75, 70, 65],
[144, 135, 126, 117],
[208, 195, 182, 169]]

k=1: Use column A[:,1] = [2, 6, 10, 14] and row B[1,:] = [12, 11, 10, 9]

C[0][0] += 2×12 = 16+24 = 40, C[0][1] += 2×11 = 15+22 = 37, C[0][2] += 2×10 = 14+20 = 34, C[0][3] += 2×9 = 13+18 = 31

C[1][0] += 6×12 = 80+72 = 152, C[1][1] += 6×11 = 75+66 = 141, C[1][2] += 6×10 = 70+60 = 130, C[1][3] += 6×9 = 65+54 = 119

C[2][0] += 10×12 = 144+120 = 264, C[2][1] += 10×11 = 135+110 = 245, C[2][2] += 10×10 = 126+100 = 226, C[2][3] += 10×9 = 117+90 = 207

C[3][0] += 14×12 = 208+168 = 376, C[3][1] += 14×11 = 195+154 = 349, C[3][2] += 14×10 = 182+140 = 322, C[3][3] += 14×9 = 169+126 = 295

Now k=1 is done. So C after k=1:
[[40, 37, 34, 31],
[152, 141, 130, 119],
[264, 245, 226, 207],
[376, 349, 322, 295]]

k=2: Use column A[:,2] = [3, 7, 11, 15] and row B[2,:] = [8, 7, 6, 5]

C[0][0] += 3×8 = 40+24 = 64, C[0][1] += 3×7 = 37+21 = 58, C[0][2] += 3×6 = 34+18 = 52, C[0][3] += 3×5 = 31+15 = 46

C[1][0] += 7×8 = 152+56 = 208, C[1][1] += 7×7 = 141+49 = 190, C[1][2] += 7×6 = 130+42 = 172, C[1][3] += 7×5 = 119+35 = 154

C[2][0] += 11×8 = 264+88 = 352, C[2][1] += 11×7 = 245+77 = 322, C[2][2] += 11×6 = 226+66 = 292, C[2][3] += 11×5 = 207+55 = 262

C[3][0] += 15×8 = 376+120 = 496, C[3][1] += 15×7 = 349+105 = 454, C[3][2] += 15×6 = 322+90 = 412, C[3][3] += 15×5 = 295+75 = 370

Now, k=2 is done, So C after k=2:
[[64, 58, 52, 46],
[208, 190, 172, 154],
[352, 322, 292, 262],
[496, 454, 412, 370]]

k=3: Use column A[:,3] = [4, 8, 12, 16] and row B[3,:] = [4, 3, 2, 1]

C[0][0] += 4×4 = 64+16 = 80, C[0][1] += 4×3 = 58+12 = 70, C[0][2] += 4×2 = 52+8 = 60, C[0][3] += 4×1 = 46+4 = 50

C[1][0] += 8×4 = 208+32 = 240, C[1][1] += 8×3 = 190+24 = 214, C[1][2] += 8×2 = 172+16 = 188, C[1][3] += 8×1 = 154+8 = 162
>
C[2][0] += 12×4 = 352+48 = 400, C[2][1] += 12×3 = 322+36 = 358, C[2][2] += 12×2 = 292+24 = 316, C[2][3] += 12×1 = 262+12 = 274

C[3][0] += 16×4 = 496+64 = 560, C[3][1] += 16×3 = 454+48 = 502, C[3][2] += 16×2 = 412+32 = 444, C[3][3] += 16×1 = 370+16 = 386

Final C:
[[80, 70, 60, 50],
[240, 214, 188, 162],
[400, 358, 316, 274],
[560, 502, 444, 386]]

Same result again! All three loop orderings (ijk, ikj, kij) produce the identical matrix C.

I showed the full kij calculation step-by-step because this ordering might feel the most different from how we naturally think about matrix multiplication. While ijk goes element-by-element and ikj goes row-by-row, kij builds up the entire result matrix layer-by-layer through outer products.

It looks complex at first, but it's really just reorganizing the same 112 multiplications and additions — we're doing the exact same arithmetic, just in a different sequence. Once you see it fully worked out, the pattern becomes clear: for each k, we're adding one "contribution" from A's k-th column and B's k-th row to the entire result matrix.

Metrics for kij

Number of Operations: Still 64 multiplications, 48 additions, 112 total operations. The computational work is identical across all loop orderings.

Memory Access Pattern: In kij, both A[i][k] and B[k][j] have good access patterns. For each fixed k, we access column k of A (which jumps by rows, but we do it for all i's together), and we access row k of B repeatedly (sequential access). The C matrix gets updated in a row-wise manner for each i, which is also cache-friendly.

Cache Behavior: Like ikj, kij has good cache performance for large matrices. For 512×512 matrices, kij performs similarly to ikj (about 2× faster than ijk) because it avoids the column-wise traversal of B that makes ijk slow. The outer product style of computation updates the entire matrix C in a cache-friendly way.

3. Here Comes the Catch - The Cache in MatMul

Important: I hope readers have a basic understanding of CPU caches. If not, feel free to check out my post about CPU caches. It's not mandatory, though — I will try to explain things from the ground up.

CPUs execute instructions using ALUs and FPUs, which constantly need data to work on. While the main memory can provide it, accessing it directly is slow. This is where the CPU cache comes in — a small, fast memory located close to the CPU that reduces access time. Cache is much smaller than main memory, with L1 being the fastest and tiniest. For matrix multiplication, where matrices can be large, keeping cache in mind is crucial for writing efficient code.

How Matrices are Stored in Memory

Just take our matrix A as an example (rewritten here for convenience):

1234

5678

9101112

13141516

Matrix A

What do you see here? A 4×4 grid of numbers, arranged neatly in rows and columns. But computers cannot store it exactly like this. Instead, they store the numbers in continuous memory. Now you might ask: in what order? This is where row-major and column-major storage comes into play.

Row-Major vs Column-Major Memory Layout

Row-major and column-major order are two distinct methods for storing multi-dimensional arrays, like matrices, in computer memory, which is linear.

In row-major order, elements within the same row are stored contiguously in memory. This means that after storing all elements of the first row, the elements of the next row follow directly.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Row-major: The numbers are stored row by row,
left to right, top to bottom

Conversely, in column-major order, elements within the same column are stored contiguously. So, all elements of the first column are stored sequentially, followed by the elements of the second column, and so on.

1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

Column-Major: The numbers are stored column by column,
top to bottom, left to right

4. MatMul Performance Benchmarking in C

I’m sorry, everyone! I know most of you love Python for ML/AI, but Python alone can’t clearly show the performance effects we want to demonstrate. Don’t worry — we’ll come back to Python and NumPy later when we explore OpenBLAS and kernel-level optimizations.

In this section, we'll benchmark three different loop orderings for matrix multiplication to understand how memory access patterns affect performance. All three implementations compute the exact same result C = A × B — but they access memory in different orders. The differences in performance will reveal how critical cache locality is for computational efficiency.

We're using C for these benchmarks because it gives us direct control over memory access patterns and allows us to see the raw performance differences without layers of abstraction. Each implementation will multiply two 512×512 matrices, performing over 268 million floating-point operations.

The key insight: it's not just what you compute, but how you access memory while computing it.

The ijk Ordering (Naive Approach)

This is the most intuitive implementation — it directly follows the mathematical definition of matrix multiplication. For each element C[i][j], we sum the products A[i][k] × B[k][j] across all k. However, this ordering has poor cache performance because B[k][j] accesses column elements of B, which are far apart in memory due to row-major storage.

#define N 512 void matmul_ijk(double A[N][N], double B[N][N], double C[N][N]) { for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { C[i][j] = 0.0; for (int k = 0; k < N; k++) { C[i][j] += A[i][k] * B[k][j]; } } } }

The ikj Ordering (Cache-Friendly)

By swapping the j and k loops, we improve cache locality for matrix B. Now the innermost loop traverses B[k][j] across consecutive columns (row-wise in memory), while A[i][k] is reused across multiple iterations. This change alone can yield significant speedups by reducing cache misses.

#define N 512 void matmul_ikj(double A[N][N], double B[N][N], double C[N][N]) { for (int i = 0; i < N; i++) { for (int k = 0; k < N; k++) { for (int j = 0; j < N; j++) { C[i][j] += A[i][k] * B[k][j]; } } } }

The kij Ordering (Alternative Optimization)

This ordering computes the result by accumulating outer products — for each k, we add the contribution A[i][k] × B[k][j] to all elements. While less intuitive, this ordering also provides good cache behavior by maintaining spatial locality in how we access both input matrices and write to the output matrix C.

#define N 512 void matmul_kij(double A[N][N], double B[N][N], double C[N][N]) { for (int k = 0; k < N; k++) { for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { C[i][j] += A[i][k] * B[k][j]; } } } }

Matrix Initialization and Helper Functions

Before we can benchmark our multiplication algorithms, we need utility functions to initialize matrices with random values, zero them out for each test run, and verify that different loop orderings produce identical results (within floating-point precision).

// Initialize matrix with random values between 0 and 10 void init_matrix(double M[N][N]) { for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { M[i][j] = (double)rand() / RAND_MAX * 10.0; } } } // Initialize matrix to zero (important for accumulation operations) void zero_matrix(double M[N][N]) { for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { M[i][j] = 0.0; } } } // Verify two matrices are equal within epsilon tolerance int matrices_equal(double M1[N][N], double M2[N][N], double epsilon) { for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { if (fabs(M1[i][j] - M2[i][j]) > epsilon) { return 0; } } } return 1; }

The Benchmarking Framework

Our main function orchestrates the entire benchmark: allocating five matrices (A, B, and three output matrices for each ordering), initializing them, running each multiplication variant, measuring execution time, and verifying correctness. We use heap allocation rather than stack allocation to avoid stack overflow with large matrices, and we measure performance in GFLOPS (billions of floating-point operations per second) to standardize our results.

// Count total floating-point operations for matrix multiplication long long count_operations(int n) { long long muls = (long long)n * n * n; // N³ multiplications long long adds = (long long)n * n * (n - 1); // N²(N-1) additions return muls + adds; } int main() { // Allocate matrices on heap (stack might overflow for large N) double (*A)[N] = malloc(N * N * sizeof(double)); double (*B)[N] = malloc(N * N * sizeof(double)); double (*C_ijk)[N] = malloc(N * N * sizeof(double)); double (*C_ikj)[N] = malloc(N * N * sizeof(double)); double (*C_kij)[N] = malloc(N * N * sizeof(double)); if (!A || !B || !C_ijk || !C_ikj || !C_kij) { printf("Memory allocation failed!\n"); return 1; } srand(42); // Fixed seed for reproducibility printf("Matrix Multiplication Performance Comparison\n"); printf("Matrix size: %d x %d\n", N, N); printf("Total operations: %lld\n", count_operations(N)); printf("Memory per matrix: %.2f MB\n", (N * N * sizeof(double)) / (1024.0 * 1024.0)); // Initialize matrices init_matrix(A); init_matrix(B); // Benchmark each ordering // Test 1: ijk ordering zero_matrix(C_ijk); clock_t start = clock(); matmul_ijk(A, B, C_ijk); clock_t end = clock(); double time_ijk = ((double)(end - start)) / CLOCKS_PER_SEC; double gflops_ijk = (count_operations(N) / 1e9) / time_ijk; // Test 2: ikj ordering zero_matrix(C_ikj); start = clock(); matmul_ikj(A, B, C_ikj); end = clock(); double time_ikj = ((double)(end - start)) / CLOCKS_PER_SEC; double gflops_ikj = (count_operations(N) / 1e9) / time_ikj; // Test 3: kij ordering zero_matrix(C_kij); start = clock(); matmul_kij(A, B, C_kij); end = clock(); double time_kij = ((double)(end - start)) / CLOCKS_PER_SEC; double gflops_kij = (count_operations(N) / 1e9) / time_kij; // Verify correctness printf("\n--- Correctness Check ---\n"); printf("ikj matches ijk: %s\n", matrices_equal(C_ijk, C_ikj, 1e-6) ? "YES ✓" : "NO ✗"); printf("kij matches ijk: %s\n", matrices_equal(C_ijk, C_kij, 1e-6) ? "YES ✓" : "NO ✗"); // Cleanup free(A); free(B); free(C_ijk); free(C_ikj); free(C_kij); return 0; }

With a solid understanding of matrix multiplication, cache hierarchies, and memory access patterns behind us, it's time to move from theory to practice. Let's compile our code, run the benchmarks, and see how different loop orderings perform in reality.

5. Benchmark Results: The Numbers Speak for Themselves

Matrix Multiplication Performance Comparison ============================================= Matrix size: 512 x 512 Total operations: 268173312 Memory per matrix: 2.00 MB Total memory: 10.00 MB Initializing matrices... ---- Testing ijk ordering ---- Time: 0.7866 seconds Performance: 0.34 GFLOPS ---- Testing ikj ordering ---- Time: 0.3420 seconds Performance: 0.78 GFLOPS Speedup vs ijk: 2.30x ---- Testing kij ordering ---- Time: 0.3613 seconds Performance: 0.74 GFLOPS Speedup vs ijk: 2.18x ---- Correctness Check ---- ikj matches ijk: YES ✓ kij matches ijk: YES ✓ === Performance Summary === Ordering Time(s) GFLOPS Speedup ijk 0.7866 0.34 1.00x ikj 0.3420 0.78 2.30x kij 0.3613 0.74 2.18x =============================================

The results are dramatic and exactly what we predicted. Simply by reordering our loops — without changing a single calculation — we achieved a 2.3x speedup with the ikj ordering and a 2.18x speedup with kij ordering compared to the naive ijk approach. Both optimized versions compute the exact same result (verified by our correctness check), yet they run in less than half the time.

Let's break down what happened: The ijk ordering took 0.79 seconds and achieved only 0.34 GFLOPS, spending most of its time waiting for data to arrive from slower memory levels.

Meanwhile, the ikj ordering completed in just 0.34 seconds with 0.78 GFLOPS — more than double the throughput. The kij ordering performed similarly at 0.36 seconds and 0.74 GFLOPS. These aren't small improvements — we're talking about getting more than twice the work done in the same amount of time, purely by being smarter about memory access patterns.

This is the power of cache-aware programming. Same algorithm, same number of operations, same hardware — but radically different performance simply because we're working with the cache hierarchy instead of against it.

Understanding the Results: A Deep Dive with Concrete Examples

Numbers are great, but let's understand why we're seeing these performance differences. We'll walk through exactly what happens at the hardware level when each loop ordering executes, using concrete examples with realistic cache behavior, memory latency, and CPU pipeline effects.

Setting Up Our Example with Real Hardware Characteristics

Let's work with small 4×4 matrices for visualization, but we'll also analyze the full 512×512 case. First, let's establish our hardware model based on typical modern CPUs:

Cache Hierarchy (Typical x86-64 CPU): ┌─────────────────────────────────────────────────────────────┐ │ L1 Data Cache: 32 KB, 8-way set associative │ │ - Cache Line Size: 64 bytes (8 doubles) │ │ - Latency: ~4 cycles (~1.3 ns @ 3 GHz) │ │ - Bandwidth: ~100 GB/s │ ├─────────────────────────────────────────────────────────────┤ │ L2 Cache: 256 KB, 8-way set associative │ │ - Latency: ~12 cycles (~4 ns) │ │ - Bandwidth: ~50 GB/s │ ├─────────────────────────────────────────────────────────────┤ │ L3 Cache: 8 MB, 16-way set associative │ │ - Latency: ~40 cycles (~13 ns) │ │ - Bandwidth: ~30 GB/s │ ├─────────────────────────────────────────────────────────────┤ │ Main Memory (DDR4): 16 GB │ │ - Latency: ~200 cycles (~65 ns) │ │ - Bandwidth: ~20 GB/s │ └─────────────────────────────────────────────────────────────┘

Important Constants:

sizeof(double) = 8 bytes

Cache line = 64 bytes = 8 doubles

Matrix size = 512 × 512 = 262,144 doubles = 2 MB per matrix

Total working set = 3 matrices = 6 MB (fits in L3, spills from L1/L2)

How Matrices Are Stored in Memory (Row-Major Order)

In C, matrices are stored in row-major order. Understanding the exact memory layout is crucial:

Matrix A: Matrix B: ┌─────────────────┐ ┌─────────────────┐ │ 1 2 3 4 │ │ 9 10 11 12 │ │ 5 6 7 8 │ │ 13 14 15 16 │ │ 9 10 11 12 │ │ 17 18 19 20 │ │ 13 14 15 16 │ │ 21 22 23 24 │ └─────────────────┘ └─────────────────┘ Memory Layout (each box = 8 bytes): Address: 0x1000 0x1008 0x1010 0x1018 0x1020 0x1028... A: [ 1 ][ 2 ][ 3 ][ 4 ][ 5 ][ 6 ]... A[0][0] A[0][1] A[0][2] A[0][3] A[1][0] A[1][1] Address: 0x2000 0x2008 0x2010 0x2018 0x2020 0x2028... B: [ 9 ][ 10 ][ 11 ][ 12 ][ 13 ][ 14 ]... B[0][0] B[0][1] B[0][2] B[0][3] B[1][0] B[1][1]

Column Access Pattern (the problem):

B[0][0] is at address: 0x2000

B[1][0] is at address: 0x2000 + (1 × 4 × 8) = 0x2020 (skip 4 doubles)

B[2][0] is at address: 0x2000 + (2 × 4 × 8) = 0x2040 (skip 4 doubles)

B[3][0] is at address: 0x2000 + (3 × 4 × 8) = 0x2060 (skip 4 doubles)

For 512×512 matrices:

B[0][0] is at address: base

B[1][0] is at address: base + 4096 bytes (skip 512 doubles = 4 KB!)

B[2][0] is at address: base + 8192 bytes (skip another 4 KB)

...

Cache Line Behavior and Set-Associative Mapping

Modern caches are set-associative, not fully associative. This matters for understanding conflicts:

L1 Cache Organization (32 KB, 8-way set associative):

Total cache lines: 32 KB ÷ 64 bytes = 512 cache lines

Number of sets: 512 ÷ 8 ways = 64 sets

Each memory address maps to a specific set based on its address

Address Breakdown (Assuming 64-byte cache lines): ┌──────────────┬──────────────┬────────────────┐ │ Tag bits │ Set Index │ Offset (6b) │ │ │ (6 bits) │ (0-63) │ └──────────────┴──────────────┴────────────────┘ determines determines byte which set within cache line

Cache Conflict Example:

If addresses map to the same set but we have more than 8 of them in use, we get "cache thrashing" - constantly evicting and reloading lines.

For 512×512 matrices accessing columns:

Row 0, column 0: maps to set X

Row 1, column 0: maps to set Y (Y = X + 64, wraps around)

Row 8, column 0: maps to set X again! (potential conflict)

Row 16, column 0: maps to set X again! (evicting previous data)

The ijk Ordering: A Cycle-by-Cycle Analysis

Let's trace what happens when we compute C[0][0] using ijk ordering, counting CPU cycles:

// The ijk loop structure in C: for (int i = 0; i < N; i++) { // i = 0 for (int j = 0; j < N; j++) { // j = 0 C[i][j] = 0.0; // C[0][0] = 0.0 for (int k = 0; k < N; k++) { // k = 0, 1, 2, 3... C[i][j] += A[i][k] * B[k][j]; } } }
Let's look at Iteration-by-iteration for C[0][0]:
k=0: C[0][0] += A[0][0] * B[0][0] - Load A[0][0]: L1 miss → load cache line [A[0][0]...A[0][7]] → ~4 cycles - Load B[0][0]: L1 miss → load cache line [B[0][0]...B[0][7]] → ~4 cycles - Multiply: 1 cycle (pipelined) - Add to C[0][0]: 1 cycle - Total: ~10 cycles k=1: C[0][0] += A[0][1] * B[1][0] - Load A[0][1]: L1 HIT (in cache from k=0) → ~1 cycle - Load B[1][0]: L1 MISS (different row, 4KB away) → Best case L2: ~12 cycles → Likely L3: ~40 cycles - Multiply + Add: 2 cycles - Total: ~43-55 cycles k=2: C[0][0] += A[0][2] * B[2][0] - Load A[0][2]: L1 HIT → ~1 cycle - Load B[2][0]: L1 MISS → ~40 cycles (L3) or ~200 cycles (RAM) - Multiply + Add: 2 cycles - Total: ~43-203 cycles k=3: C[0][0] += A[0][3] * B[3][0] - Load A[0][3]: L1 HIT → ~1 cycle - Load B[3][0]: L1 MISS → ~40-200 cycles - Multiply + Add: 2 cycles - Total: ~43-203 cycles

For 4×4 matrix, computing ONE element C[0][0]:

Total cycles: 10 + 43 + 43 + 43 = ~139 cycles (best case with L3 hits)

Or: 10 + 200 + 200 + 200 = ~610 cycles (worst case with RAM)

Cache misses on B: 4 out of 4 accesses (100% miss rate!)

Time waiting for memory: ~130-600 cycles

Time computing: ~6 cycles

Memory stalls are 20-100x longer than computation!

For 512×512 matrix, computing ONE element C[0][0]:

Total accesses to B: 512

Cache hits: 0 (complete cache thrashing due to set conflicts)

Cache misses: 512 (every access misses!)

Time per cache miss:

If in L2: ~12 cycles × 512 = 6,144 cycles

If in L3: ~40 cycles × 512 = 20,480 cycles

If in RAM: ~200 cycles × 512 = 102,400 cycles

Arithmetic operations: 512 muls + 512 adds = 1,024 ops at 1-2 cycles each

Memory stalls DOMINATE: 20,480 cycles waiting vs 1,024 cycles computing (20x more time waiting!)

Full matrix (262,144 elements × 512 operations each):

Total B column accesses: 134 million

Cache misses: ~130 million (near 100% miss rate due to pathological cache thrashing)

Time in cache misses alone: 130M × 40 cycles = 5.2 billion cycles = 1.73 seconds @ 3GHz

Our observed time was 0.79 seconds, which is faster than this estimate because:

Some accesses hitting L2 instead of L3 (lower latency)

Compiler optimizations (register reuse, loop unrolling)

Hardware prefetching reducing effective latency for some patterns

The model correctly predicts the order of magnitude and explains why ijk ordering is slow

The ikj Ordering: Why It's 2.3x Faster

Now the same computation with ikj ordering:

// The ikj loop structure in C: for (int i = 0; i < N; i++) { // i = 0 for (int k = 0; k < N; k++) { // k = 0 for (int j = 0; j < N; j++) { // j = 0, 1, 2, 3... C[i][j] += A[i][k] * B[k][j]; } } }

Let's look at iteration-by-iteration for i=0, k=0:

j=0: C[0][0] += A[0][0] * B[0][0] - Load A[0][0]: L1 miss → load [A[0][0]...A[0][7]] → ~4 cycles - Load B[0][0]: L1 miss → load [B[0][0]...B[0][7]] → ~4 cycles - Multiply + Add: 2 cycles - Total: ~10 cycles j=1: C[0][1] += A[0][0] * B[0][1] - Load A[0][0]: IN REGISTER (reused from j=0) → 0 cycles! - Load B[0][1]: L1 HIT (consecutive in cache line) → ~1 cycle - Multiply + Add: 2 cycles - Total: ~3 cycles j=2: C[0][2] += A[0][0] * B[0][2] - A[0][0]: still in register → 0 cycles - B[0][2]: L1 HIT → ~1 cycle - Multiply + Add: 2 cycles - Total: ~3 cycles j=3 through j=7: Similar, ~3 cycles each j=8: C[0][8] += A[0][0] * B[0][8] - A[0][0]: still in register - B[0][8]: NEW cache line → L1 miss → ~4 cycles - Multiply + Add: 2 cycles - Total: ~6 cycles

Pattern for 512 iterations (full row):

When we process the entire innermost loop (j from 0 to 511), we're accessing B[0][0] through B[0][511] sequentially. Since each cache line holds 8 doubles, we need to load 64 cache lines total. This means:

Cache line loads: 512 elements ÷ 8 per line = 64 cache line loads

Cache hits: 512 - 64 = 448 hits

Hit rate: 448/512 = 87.5% (compared to 0% for ijk!)

Cycles per iteration:

First element of each cache line (64 times): ~6 cycles

Remaining 7 elements of each line (448 times): ~3 cycles

Total: 64 × 6 + 448 × 3 = 384 + 1,344 = 1,728 cycles for 512 operations

Compare to ijk for same 512 operations:

ijk: ~20,480 cycles (512 cache misses × 40 cycles each)

ikj: ~1,728 cycles

Speedup: ~12x for this inner loop alone!

But why only 2.3x overall speedup instead of 12x?

The 12x speedup only applies to the innermost loop execution

Outer loops (changing i and k values) introduce additional memory accesses

Writing to C matrix also causes cache pressure and potential misses

Overall program includes initialization, verification, and other overhead

The Fundamental Lesson

This deep dive reveals several critical insights about modern computer architecture for us, the software engineers, to understand:

Memory is the bottleneck: We spend 200 cycles waiting for memory but only 1-2 cycles computing. A modern CPU can do billions of operations per second, but only if you keep it fed with data.

Cache hierarchy is everything: The difference between L1 (4 cycles) and RAM (200 cycles) is 50x. Keeping data in cache is worth more than most algorithmic improvements.

Access patterns matter more than code: ijk and ikj have identical algorithms and operation counts. The only difference is the order of memory access. Yet one is 2.3x faster.

Hardware is optimized for sequential access: CPUs have prefetchers, cache lines, and TLBs all designed around the assumption that you'll access consecutive memory locations. Fighting this costs you dearly.

The roofline model applies: Our ikj implementation achieves 0.78 GFLOPS out of a theoretical 12 GFLOPS peak (6.5% efficiency). We're memory-bound, not compute-bound.

The takeaway: Modern CPUs are lightning-fast at computation but still bottlenecked by memory access — the so-called "memory wall." Our 2.3x speedup came entirely from improving memory access patterns, not from any algorithmic tricks or specialized instructions. And this is just the beginning — next, we’re moving on to tiling, where we’ll push cache efficiency even further.

6. What Is Tiling and Why It Works