Cache blocking/tiling
This course page was updated until March 2022 when I left Durham University. The materials herein are therefore not necessarily still in date.

Achieving reasonable performance for loopy code #

Many of the algorithms we encounter in scientific computing have quite “simple” data access patterns. Numerical code often has multiple nested loops with regular array indexing. This is actually a reason the roofline model is so successful: its optimistic assumptions are not too optimistic.

Despite this simplicity, on hardware with memory caches, we still need to do some work to turn this “simple” code into something that runs with reasonable performance.

What do we mean by “reasonable”, perhaps 50% of the relevant resource bottleneck (memory bandwidth or floating point throughput), and predictable performance without surprises. That is, we would hope to achieve this 50% of peak throughout independent of the problem sizes.

This does not necessarily mean that our algorithm should have linear runtime scaling with problem size. For example, it is possible to achieve constant throughput (FLOPs/s) for matrix-matrix multiply. So the performance is problem size independent, but since matrix-matrix multiplication is an $\mathcal{O}(N^3)$ algorithm, bigger problems will take more than linearly longer.

What can go wrong? #

We’ll look at two exemplar problems that on the face of it are simple, and we’ll observe how their naive implementation does not result in “surprise-free” performance.

Both our our examples come from dense linear algebra.

  1. Dense matrix transpose

    $$ B_{ij} \gets A_{ji} \quad A, B \in \mathbb{R}^{N\times N} $$

  2. Dense matrix-matrix multiply

    $$ C_{ij} \gets C_{ij} + \sum_{k=1}^{N} A_{ik}B_{kj} \quad A, B, C \in \mathbb{R}^{N\times N} $$

These are at “opposite” ends of the roofline plot. Matrix transposition has arithmetic intensity of zero (it does no floating point operations), and is therefore definitely limited by streaming bandwidth. Conversely, matrix-matrix multiplication does $\mathcal{O}(N^3)$ floating point operations on $\mathcal{O}(N^2)$ data, for a arithmetic intensity of $\mathcal{O}(N)$ in the perfect cache model.

This is actually overly optimistic of what we can achieve, but matrix-matrix multiplication is nonetheless still firmly in the flop-limited regime.

Matrix transposition #

The naive implementation of transposition of a matrix looks like this

void transpose(const double * restrict A, double * restrict B, int N)
{
  for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
      B[i*N + j] = A[j*N + i];
}

Since all this is doing is copying data, we might hope to see performance close to the streaming memory bandwidth, independently of $N$. Let’s have a look. On my laptop, the STREAM triad memory bandwidth is around 18GB/s.

Using the transpose code from exercise 7, I can measure the achieved memory bandwidth for a range of matrix sizes.

$ for N in 100 200 300 500 700 1000 1500 2000 2500 3000 4000 5000; do
> ./transpose $N $N
> done

This prints a bunch of output, which I elide, but we can put it in a table

$N$BW [MB/s]
10060000
20046700
30046600
50030600
70023600
100011900
150011800
20004800
25004300
30003800
40003300
50002900

So we start out in the cache, and observe bandwidth significantly above the main memory bandwidth. However, as the matrices get larger, this falls off, and we end up with throughput much below the streaming bandwidth.

If we inspect the code, we notice that we have streaming access to B (good!), but stride-N access to A. This latter access pattern is bad for performance because we read a full cache line (64 bytes) each time we load an entry of A, but only use one double precision entry (8 bytes). As a consequence, we need to hold $LN$ bytes of A in the cache at once, where L is the number of elements that fit in one cache line (here 8). Once the matrices get too large, this is no longer possible.

Suppose that we estimate the time to write, or read, one byte as the inverse of the streaming bandwidth

$$ t_\text{write} = t_\text{read} = \frac{1}{18} \text{ s/GB} $$

Remembering that when we are streaming from main memory, that we only use an eighth of the effective bandwidth for the load of A, then a model for the total time for transposing an $N\times N$ matrix is $$ T = N^2(8t_\text{read} + t_\text{write}). $$ In other words, our model says that in the limit of large $N$ we will only observe $\frac{1}{9}\text{th}$ of the streaming memory bandwidth.

Here’s a picture of what is going on. The data are laid out typewriter style, and the red boxes indicate loads that provoke the load of a full cache line. We see that when writing to B, we always use all of the loaded cache line, but the same is not true for A.

Access to B is stride-1, so works well.

Access to B is stride-1, so works well.

Access to A is stride-N, and so for large matrices we do not get reuse.

Access to A is stride-N, and so for large matrices we do not get reuse.

Cache tiling for better performance #

What we need to do is to reorder the loops so that we run over little blocks, transposing them while keeping them in cache. The techniques we use are called strip mining and loop reordering. We might also hear the terms cache blocking, loop tiling, or loop blocking. The idea is to break loops into blocks of consecutive elements and then reorder for better spatial locality.

Original loops

for (int i = 0; i < N; i++)
  for (int j = 0; j < N; j++)
    B[i*N + j] = A[j*N + i];

After loop blocking

for (int ii = 0; ii < N; ii += stridei)
  for (int jj = 0; jj < N; jj += stridej)
    for (int i = ii; i < min(N, ii+stridei); i++)
      for (int j = jj; j < min(N, jj+stridej); j++)
        B[i*N + j] = A[j*N + i];

The loop over i is split into two, with the outer loop strided by stridei, the j loop is similarly split (with stridej).

This changes the iteration order over the matrices

Before loop blocking

Iteration over B is good

Iteration over B is good

Iteration over A does not get cache reuse

Iteration over A does not get cache reuse

After loop blocking

Iteration over B is still good

Iteration over B is still good

Iteration over A now gets cache reuse

Iteration over A now gets cache reuse

Exercise 7 also provides a blocked version. So we can run the same experiment as before, except this time with loop tiling

$N$BW (original)BW (tiled)BW (OpenBLAS)
100600005620063200
200467004530057900
300466004780058600
500306004160054500
700236001920035900
1000119001270027000
1500118001060024100
200048001050023500
250043001040023700
300038001080026600
400033001120022300
500029001260019000

So now we get consistent performance outside of the cache, with around 60-70% of streaming bandwidth. I also added the bandwidth obtained from using an optimised transpose from the OpenBLAS library. We see that this is even slightly higher than the STREAM triad number. I suspect this is because they have implemented this copy with a non-temporal (or streaming) store.

Matrix-matrix multiplication #

Now let’s look at the matrix multiplication problem. To analyse this problem, we will postulate a two-level memory system with a small fast memory, and a large slow memory. We start out with all the data in slow memory.