Caching Issues in Multicore Performance

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License
This path is relatively slow, forcing the CPU to wait for up to 200 clock cycles just to do a store to, or a load from, memory.

Depending on your CPU’s ability to process instructions out-of-order, it might go idle during this time.

This is a huge performance hit!
Solution: Hierarchical Memory Systems, or “Cache”

The solution is to add intermediate memory systems. The one closest to the ALU (L1) is small and fast. The memory systems get slower and larger as they get farther away from the ALU.

L3 cache also exists on some high-end CPU chips.
Cache and Memory are Named by “Distance Level” from the ALU

L3 cache has been added to many CPU chips as well
## Storage Level Characteristics

<table>
<thead>
<tr>
<th></th>
<th>L1</th>
<th>L2</th>
<th>L3</th>
<th>Memory</th>
<th>Disk</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Type of Storage</strong></td>
<td>On-chip</td>
<td>On-chip</td>
<td>On-chip</td>
<td>Off-chip</td>
<td>Disk</td>
</tr>
<tr>
<td><strong>Typical Size</strong></td>
<td>100 KB</td>
<td>8 MB</td>
<td>32 MB</td>
<td>32 GB</td>
<td>Many GBs</td>
</tr>
<tr>
<td><strong>Typical Access Time (ns)</strong></td>
<td>.25</td>
<td>.50</td>
<td>10.8</td>
<td>50</td>
<td>5,000,000</td>
</tr>
<tr>
<td><strong>Scaled Access Time</strong></td>
<td>1 second</td>
<td>2 seconds</td>
<td>43 seconds</td>
<td>3.3 minutes</td>
<td>231 days</td>
</tr>
<tr>
<td><strong>Managed by</strong></td>
<td>Hardware</td>
<td>Hardware</td>
<td>Hardware</td>
<td>OS</td>
<td>OS</td>
</tr>
</tbody>
</table>


Usually there are two L1 caches – one for Instructions and one for Data. You will often see this referred to in data sheets as: “L1 cache: 32KB + 32KB” or “I and D cache”
Cache Hits and Misses

When the CPU asks for a value from memory, and that value is already in the cache, it can get it quickly. This is called a **cache hit**.

When the CPU asks for a value from memory, and that value is not already in the cache, it will have to go off the chip to get it. This is called a **cache miss**.

While cache might be multiple kilo- or megabytes, the bytes are transferred in much smaller quantities, each called a **cache line**. The size of a cache line is typically just **64 bytes**.

Performance programming should strive to avoid as many cache misses as possible. That’s why it is very helpful to know the cache structure of your CPU.
Successful use of the cache depends on **Spatial Coherence**:

“*If you need one memory address’s contents now, then you will probably also need the contents of some of the memory locations around it soon.*”

Successful use of the cache depends on **Temporal Coherence**:

“*If you need one memory address’s contents now, then you will probably also need its contents again soon.*”

If these assumptions are true, then you will generate a lot of cache hits.

If these assumptions are not true, then you will generate a lot of cache misses, and you end up re-loading the cache a lot.
How Bad Is It? -- Demonstrating the Cache-Miss Problem

C and C++ store 2D arrays a row-at-a-time, like this, $A[i][j]$:

Caching issues arise if we access elements of $A$ in a different order. For example, consider accessing the elements in the order $A[0][0], A[0][1], ..., A[0][9], A[1][0], ..., A[N-1][9]$. If this order places data into different cache lines, each access may cause a cache miss.

For large arrays, would it be better to add the elements by row, or by column? Which will avoid the most cache misses?

Example:

```cpp
float f = Array[i][j];
float f = Array[j][i];
```

Sequential memory order
Jump-around-in-memory order
Demonstrating the Cache-Miss Problem – Across Rows

```c
#define NUM 10000
float Array[NUM][NUM];
double MyTimer();

int main( int argc, char *argv[] )
{
    float sum = 0.;
    double start = MyTimer();
    for( int i = 0; i < NUM; i++ )
    {
        for( int j = 0; j < NUM; j++ )
        {
            sum += Array[i][j]; // access across a row
        }
    }
    double finish = MyTimer();
    double row_secs = finish – start;
```
Demonstrating the Cache-Miss Problem – Down Columns

```c
float sum = 0.;
double start  = MyTimer( );
for( int i = 0; i < NUM; i++ )
{
    for( int j = 0; j < NUM; j++ )
    {
        sum += Array[ j ][ i ]; // access down a column
    }
}
double finish = MyTimer( );
double col_secs = finish - start;
```
Demonstrating the Cache-Miss Problem

Time, in seconds, to compute the array sums, based on by-row versus by-column order:

Dimension (NUM)

( Total array size = NUMxNUM )
Good Object-Oriented Programming Style can sometimes be Inconsistent with Good Cache Use:

```cpp
class xyz
{
  public:
    float x, y, z;
    xyz *next;
    xyz( );
    static xyz *Head = NULL;
};

xyz::xyz( )
{
  xyz * n = new xyz;
  n->next = Head;
  Head = n;
};
```

This is good OO style – it encapsulates and isolates the data for this class. Once you have created a linked list whose elements are all over memory, is it the best use of the cache?
But, Here Is a Compromise:

It might be better to create a large array of xyz structures and then have the constructor method pull new ones from that list. That would keep many of the elements close together while preserving the flexibility of the linked list.

When you need more, allocate another large array and link to it.
```c
#include <cstdio>
define NUMALLOC 1024

struct node
{
    float data;
    bool canBeDeleted;
    struct node *next;
};
struct node *Head = NULL;

struct node *
GetNewNode( )
{
    if( Head == NULL )
    {
        struct node *array = new struct node[NUMALLOC];
        Head = &array[0];
        for( int i = 0; i < NUMALLOC - 1; i++ )
        {
            array[i].canBeDeleted = false;
            array[i].next = &array[i+1];
        }
        array[NUMALLOC-1].next = NULL;
    }
    struct node *p = Head;
    Head = Head->next;
    return p;
}

void
DeleteNode( struct node *n )
{
    n->canBeDeleted = true;
}

But, Here Is a Compromise:

Remember: in this scheme, you cannot delete an individual node because it was allocated as part of an array. The best you can do is track which nodes can be deleted and then when all of an array's nodes are flagged, delete the whole array.
```
Why Can We Get This Kind of Performance Decrease as Data Sets Get Larger?

We are violating Temporal Coherence
We Can Help the Temporal Problem with Pre-Fetching

We will cover this in further detail when we discuss SIMD
An Example of Where Cache Coherence Really Matters: Matrix Multiply

The usual approach is multiplying the entire A row * entire B column
This is equivalent to computing a single dot product

\[
\sum_{k=0}^{\text{SIZE}} A[i][k] \times B[k][j] \rightarrow C[i][j]
\]

**Problem:** Column j of the B matrix is not doing a unit stride
Scalable Universal Matrix Multiply Algorithm (SUMMA)

Entire A row * one element of B row

Equivalent to computing one item in many separate dot products

\[
\begin{align*}
\text{for}(i = 0; i < \text{SIZE}; i++) \\
\quad \text{for}(k = 0; k < \text{SIZE}; k++) \\
\quad \quad \text{for}(j = 0; j < \text{SIZE}; j++) \\
A[i][k] \ast B[k][j] \rightarrow C[i][j]
\end{align*}
\]
Performance vs. Number of Threads (MegaMultiplies / Sec)

- i-j-k
- i-k-j
- j-k-i
- j-i-k
- k-i-j
- k-j-i
N-way Set Associative – a cache line from a particular block of memory can appear in a limited number of places in cache. Each “limited place” is called a set of cache lines. A set contains \( N \) cache lines.

The memory block can appear in any cache line in its set.

Most Caches today are N-way Set Associative

- \( N \) is typically 4 for L1 and 8 or 16 for L2

This would be called “2-way”
How do you figure out where in cache a specific memory address will live?

Total #cache lines ÷ N-way → #cache sets

Memory address in bytes ÷ 64 → Cache Line Block in Memory

modulo #cache sets % 4

Pick Least Recently Used Cache Line in that Cache Set

Cache Set #

Offset in the Cache Line

Cache Line #

Where to find a certain memory address in the cache
A Specific Example with Numbers

Memory address = 1234 bytes

Cache Line Block in Memory = $\frac{1234}{64} = 19$  
Because there are 64 bytes in a cache line

Cache Set # = $19 \% 4 = 3$  
Because there are 4 sets to rotate through

Offset in the Cache Line = $1234 - 19 \times 64 = 18$  
Because there are 18 bytes left after filling 19 complete cache lines

It lives in one of these 2 locations in cache
How Different Cores’ Cache Lines Keep Track of Each Other

Each core has its own separate L2 cache, but a write by one can impact the state of the others.

For example, if one core writes a value into one of its own cache lines, any other core using a copy of that same cache line can no longer count on its values being up-to-date. In order to regain that confidence, the core that wrote must flush that cache line back to memory and the other core must then reload its copy of that cache line.

To maintain this organization, each core’s L2 cache has 4 states (MESI):

1. Modified
2. Exclusive
3. Shared
4. Invalid
A Simplified View of How MESI Works

1. Core A reads a value. Those values are brought into its cache. That cache line is now tagged **Exclusive**.

2. Core B reads a value from the same area of memory. Those values are brought into its cache, and now both cache lines are re-tagged **Shared**.

3. If Core B writes into that value. Its cache line is re-tagged **Modified** and Core A’s cache line is re-tagged **Invalid**.

4. Core A tries to read a value from that same part of memory. But its cache line is tagged **Invalid**. So, *Core B’s cache line is flushed back to memory and then Core A’s cache line is re-loaded from memory*. Both cache lines are now tagged **Shared**.

<table>
<thead>
<tr>
<th>Step</th>
<th>Cache Line A</th>
<th>Cache Line B</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Exclusive</td>
<td>-----</td>
</tr>
<tr>
<td>2</td>
<td>Shared</td>
<td>Shared</td>
</tr>
<tr>
<td>3</td>
<td>Invalid</td>
<td>Modified</td>
</tr>
<tr>
<td>4</td>
<td>Shared</td>
<td>Shared</td>
</tr>
</tbody>
</table>

This is a huge performance hit, and is referred to as **False Sharing**.

Note that False Sharing doesn’t create incorrect results – it just creates a performance hit. If anything, False Sharing *prevents* getting incorrect results.
A Simplified View of How MESI Works – Core A’s State Diagram

Start: Core A reads a value into its cache

Core B reads a value from this same area of memory into its cache – the two cores’ cache lines now point to the same area of memory

Core A writes a value into its cache line, invalidating B’s cache line

A: Exclusive

Core A then writes a value into its cache line

A: Modified

A continues to read values from its cache line

Core B writes a value into its cache line

A: Shared

Core A tries reading a value from its cache line -- B’s cache line now has to be written back to memory and A’s cache line now has to be reloaded

A: Invalid

Note: A’s cache line being labeled Invalid doesn’t affect Core A at all right now – not until Core A tries to use that cache line the next time.
struct s
{
    float value;
}   Array[4];

omp_set_num_threads(  4 );

#pragma omp parallel for
    for( int i = 0; i < 4; i++ )
    {
        for( int j = 0; j < SomeBigNumber; j++ )
        {
            Array[ i ].value = Array[ i ].value + (float)rand( );
        }
    }

Some unpredictable function so the compiler
doesn't try to optimize the j-for-loop away.
#include <stdlib.h>

```c
struct s {
    float value;
    int pad[NUMPAD];
} Array[4];
```

```c
const int SomeBigNumber = 100000000; // keep less than 2B

omp_set_num_threads(4);

#pragma omp parallel for
for (int i = 0; i < 4; i++)
{
    for (int j = 0; j < SomeBigNumber; j++)
    {
        Array[i].value = Array[i].value + (float)rand();
    }
}
```

This works because successive Array elements are forced onto different cache lines, so less (or no) cache line conflicts exist.
Why do these curves look this way?
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 0
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 1
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 2
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 3
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 4
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 5
False Sharing – Fix #1
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 6

![Diagram of cache lines showing false sharing]
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 7

---

[Diagram showing false sharing with NUMPAD = 7]
False Sharing – Fix #1
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 8
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 9
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 10

Diagram showing the spread of data to multiple cache lines, with green highlighted areas representing data placement.
False Sharing – Fix #1
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 11
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 12
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 13
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 14
False Sharing – the Effect of Spreading Your Data to Multiple Cache Lines

NUMPAD = 15
False Sharing – Fix #1
False Sharing – Fix #2: Using local (private) variables

OK, wasting memory to put your data on different cache lines seems a little silly (even though it works well). Can we do something else?

Remember our discussion in the OpenMP section about how stack space is allocated for different threads?

If we use local variables, instead of contiguous array locations, that will spread our writes out in memory, and to different cache lines.
This works because a localized temporary variable is created in each core's stack area, so little or no cache line conflict exists.
False Sharing – Fix #2 vs. Fix #1

Note that Fix #2 with \{1, 2, 4\} threads gives the same performance as NUMPAD= \{0,7,15\}
What if you are malloc’ing, and want to be sure your data structure starts on a cache line boundary?

Knowing that cache lines start on fixed 64-byte boundaries lets you do this. Consider a memory address. The top N-6 bits tell you what cache line number this address is a part of. The bottom 6 bits tell you what offset that address has within that cache line. So, for example, on a 32-bit memory system:

<table>
<thead>
<tr>
<th>Cache line number</th>
<th>Offset in that cache line</th>
</tr>
</thead>
<tbody>
<tr>
<td>32 - 6 = 26 bits</td>
<td>6 bits: 0-63</td>
</tr>
</tbody>
</table>

For example $101010_2 = 42$

So, if you see a memory address whose bottom 6 bits are 000000, then you know that that memory location begins on a cache line boundary.
Let's say that you have a structure and you want to malloc an ARRAYSIZE array of them. Normally, you would do this:

```c
struct xyzw *p = (struct xyzw *) malloc( (ARRAYSIZE)*sizeof(struct xyzw) );
struct xyzw *Array = &p[0];
...
Array[ i ].x = 10. ;
```

If you wanted to make sure that array of structures started on a cache line boundary, you would do this:

```c
unsigned char *p = (unsigned char *) malloc( 64 + (ARRAYSIZE)*sizeof(struct xyzw) );
unsigned int offset = (unsigned int)p & 0x3f; // 0x3f = bottom 6 bits are all 1's
struct xyzw *Array;
If( offset == 0 )
    Array = p;
Else
    Array = (struct xyzw *) &p[64-offset];
...
Array[ i ].x = 10. ;
```

Remember that when you want to free this malloc'ed space, be sure to say:

```c
free( p );  // not:
free( Array );
```