# **Vector Processing**

# (aka, Single Instruction Multiple Data, or SIMD)



mjb@cs.oregonstate.edu



This work is licensed under a <u>Creative Commons</u> <u>Attribution-NonCommercial-NoDerivatives 4.0</u> <u>International License</u>





simd.vector.pptx mjb – March 26, 2025

#### Performance!

Many hardware architectures today, both CPU and GPU, allow you to perform arithmetic operations on multiple array elements simultaneously.

(Thus the label, "Single Instruction Multiple Data".)

We care about this because many problems, especially scientific and engineering, can be cast this way. Examples include convolution, Fourier transform, power spectrum, autocorrelation, etc.



| Year<br>Released | Name    | Width<br>(bits) | Width<br>(FP words) |
|------------------|---------|-----------------|---------------------|
| 1996             | MMX     | 64              | 2                   |
| 1999             | SSE     | 128             | 4                   |
| 2011             | AVX     | 256             | 8                   |
| 2013             | AVX-512 | 512             | <b>1</b> 16         |
|                  | -       |                 |                     |

Note: one complete cache line! Also note: a 4x4 transformation matrix!

### If you care:

Xeon Phi

- MMX stands for "MultiMedia Extensions"
- SSE stands for "Streaming SIMD Extensions"
- AVX stands for "Advanced Vector Extensions"



#### Sidebar: Matrix SIMD in Intel CPUs

Intel announced **AMX – the Advanced Matrix Extensions** in 2020 and built it into CPU chips starting in 2023. It consists of 2D registers, called "tiles". These can multiply 16x16 matrices of data types fp16, int16, and int8 with a single instruction.

This is billed as an "Al Acceleration Engine". I suspect this is much like the Tensor Cores on Nvidia GPUs.



University Computer Graphics

| Intol  | SSE |
|--------|-----|
| IIILEI | JJE |

|          | Year<br>Released | Name    | Width<br>(bits) | Width<br>(FP words) |
|----------|------------------|---------|-----------------|---------------------|
|          | 1996             | MMX     | 64              | 2                   |
| <b>•</b> | 1999             | SSE     | 128             | 4                   |
|          | 2011             | AVX     | 256             | 8                   |
|          | 2013             | AVX-512 | 512             | 16                  |

Intel and AMD CPU architectures support vectorization. The most well-known form is called Streaming SIMD Extension, or **SSE**. It allows four floating point operations to happen simultaneously.

Normally a *scalar* floating point multiplication instruction happens like this:



The SSE version of the multiplication instruction happens like this:







mulss r1, r0



mulps xmm1, xmm0



# This all sounds great! What's the catch?

The catch is that compilers haven't caught up to producing really efficient SIMD code. So, while there are great ways to express the desire for SIMD in code, you won't get the full potential speedup ... yet.

One way to get a better speedup is to use assembly language. That's what we are going to do. Don't worry – *you* wouldn't need to write it.

We are giving you two assembly functions:

- 1. SimdMul: C[ 0:len ] = A[ 0:len ] \* B[ 0:len ]
- 2. SimdMulSum: return (  $\sum A[0:len] * B[0:len]$ )



Warning – due to the nature of how different compilers and systems handle local variables, these two functions only work on *flip* and *rabbit* using gcc/g++, without any optimization !!!

## Getting at the full SIMD power until compilers catch up

```
void
SimdMul(float *a, float *b, float *c, int len)
{
    int limit = (len/SSE WIDTH) * SSE WIDTH;
       asm
         ".att syntax\n\t"
         "movq
                  -24(%rbp), %r8\n\t"
                                            // a
         "movq -32(%rbp), %rcx\n\t"
                                            // b
         "movq -40(%rbp), %rdx\n\t"
                                            // c
    );
    for(int i = 0; i < limit; i += SSE WIDTH)
            asm
              ".att syntax\n\t"
              "movups (%r8), %xmm0\n\t"
                                              // load the first sse register
              "movups (%rcx), %xmm1\n\t"
                                              // load the second sse register
              "mulps %xmm1, %xmm0\n\t"
                                              // do the multiply
              "movups %xmm0, (%rdx)\n\t"
                                              // store the result
              "addg $16, %r8\n\t"
              "addg $16, %rcx\n\t"
              "addg $16, %rdx\n\t"
         );
    }
    for( int i = limit; i < len; i++)
```



{

c[i] = a[i] \* b[i];

This only works on *flip* and *rabbit* using gcc/g++, without any optimization !!!

## Getting at the full SIMD power until compilers catch up

```
float
SimdMulSum( float *a, float *b, int len )
    float sum[4] = \{ 0., 0., 0., 0. \};
    int limit = ( len/SSE_WIDTH ) * SSE_WIDTH;
      _asm
         ".att syntax\n\t"
         "mova
                 -40(%rbp), %r8\n\t"
                                             // a
         "movq
                 -48(%rbp), %rcx\n\t"
                                             // b
         "leaq
                  -32(%rbp), %rdx\n\t"
                                             // &sum[0]
         "movups (%rdx), %xmm2\n\t"
                                             // 4 copies of 0. in xmm2
    );
    for(int i = 0; i < limit; i += SSE WIDTH)
           _asm
              ".att syntax\n\t"
              "movups (%r8), %xmm0\n\t"
                                              // load the first sse register
              "movups (%rcx), %xmm1\n\t"
                                              // load the second sse register
              "mulps %xmm1, %xmm0\n\t"
                                               // do the multiply
              "addps %xmm0, %xmm2\n\t"
                                               // do the add
              "addq $16, %r8\n\t"
             "addq $16, %rcx\n\t"
         );
```

This only works on *flip* and *rabbit* using gcc/g++, without any optimization !!!



asm

# **Array\*Array Multiplication Speed**





#### **Array \* Array**

#### Array \* Scalar



University Computer Graphics



## Requirements for a For-Loop to be SIMD'd

- If there are nested loops, the one to vectorize must be the inner one.
- There can be no jumps or branches. "Masked assignments" (an if-statement-controlled assignment) are OK, e.g.,

- The total number of iterations must be known at runtime when the loop starts
- There can be no inter-loop data dependencies such as:

$$a[i] = a[i-1] + 1.;$$

```
a[100] = a[99] + 1.; // this crosses an SSE boundary, so it is ok a[101] = a[100] + 1.; // this is within one SSE operation, so it is not OK a[101] = a[100] + 1.; // this is within one SSE operation, so it is not OK
```

It helps performance if the elements have contiguous memory addresses.



# **Avoiding Assembly Language: the Intel Intrinsics**

Intel has a mechanism to get at the SSE SIMD without resorting to assembly language. These are called *Intrinsics*.

| Intrinsic     | Meaning                                |
|---------------|----------------------------------------|
| m128          | Declaration for a 128 bit 4-float word |
| _mm_loadu_ps  | Load am128 word from memory            |
| _mm_storeu_ps | Store am128 word into memory           |
| _mm_mul_ps    | Multiply twom128 words                 |
| _mm_add_ps    | Add twom128 words                      |



## SimdMul using Intel Intrinsics

```
#include <xmmintrin.h>
#define SSE WIDTH
                             4
void
SimdMul(float *a, float *b, float *c, int len )
    int limit = ( len/SSE WIDTH ) * SSE WIDTH;
    register float *pa = a;
    register float *pb = b;
    register float *pc = c;
    for(int i = 0; i < limit; i += SSE WIDTH)
         mm storeu ps(pc, mm mul ps( mm loadu ps(pa), mm loadu ps(pb));
         pa += SSE WIDTH;
         pb += SSE WIDTH;
         pc += SSE WIDTH;
    for( int i = limit; i < len; i++)
         c[i] = a[i] * b[i];
```



## SimdMulSum using Intel Intrinsics

```
float
   SimdMulSum( float *a, float *b, int len )
        float sum[4] = \{0., 0., 0., 0., 0.\};
        int limit = (len/SSE WIDTH) * SSE WIDTH;
        register float *pa = a;
        register float *pb = b;
          m128 ss = mm loadu ps(&sum[0]);
        for(int i = 0; i < limit; i += SSE WIDTH)
             ss = _mm_add_ps( ss, _mm_mul_ps( _mm_loadu_ps( pa ), _mm_loadu_ps( pb ) ) );
             pa += SSE WIDTH;
             pb += SSE WIDTH;
        mm storeu ps(&sum[0], ss);
        for( int i = limit; i < len; i++)
             sum[0] += a[ i ] * b[ i ];
        return sum[0] + sum[1] + sum[2] + sum[3];
Or
```

## **Intel Intrinsics**





# Each Core Has Its Very Own SIMD Unit! That Means You Can Combine SIMD and Multicore

```
#define NUM ELEMENTS PER CORE
                                           (ARRAYSIZE / NUMT)
omp set num threads( NUMT );
                                     The variable first is the first array element that
double maxMegaMultsPerSecond = 0.;
                                     thisThread is in charge of.
double time0 = omp get wtime();
                                     &A[first] is the memory address of thisThread's first
#pragma omp parallel
                                     element.
    int thisThread = omp_get_thread_num();
    int first = thisThread * NUM_ELEMENTS_PER_CORE;
    SimdMul( &A[first], &B[first], &C[first], NUM ELEMENTS PER CORE );
double time1 = omp get wtime();
double megaMultsPerSecond = (double)ARRAYSIZE / (time1 - time0) / 1000000.;
```



#### **Notes:**

- Remember that #pragma omp parallel creates a thread team and that all threads execute everything in the curly braces.
- The variable **thisThread** is the thread number of the thread who is executing this code right now. There will eventually be NUMT threads who get to execute this code. Thus, all the instances of **thisThread** will be between 0 and NUMT-1.
- The variable first is the first array element number that thisThread will execute.
- Starting the SIMD multiplications at &A[first], &B[first], &C[first] gives each thread its very own set of contiguous array elements to work on. The SimdMul function depends on this.



# **Combining SIMD with Multicore**



Computer Graphics

"cores + SIMD" = as the code looks on last two slides

Prefetching is used to place a cache line in memory before it is to be used, thus hiding the latency of fetching from off-chip memory.

There are two key issues here:

- 1. Issuing the prefetch at the right time
- 2. Issuing the prefetch at the right distance

#### The right time:

If the prefetch is issued too late, then the memory values won't be back when the program wants to use them, and the processor has to wait anyway.

If the prefetch is issued too early, then there is a chance that the prefetched values could be evicted from cache by another need before they can be used.

#### The right distance:

The "prefetch distance" is how far ahead the prefetch memory is than the memory we are using right now.

Too far, and the values sit in cache for too long, and possibly get evicted.

Too near, and the program is ready for the values before they have arrived.

Computer Grapnics

#### **Array Multiplication**

Length of Arrays (NUM): 1,000,000 Length per SIMD call (ONETIME): 256

```
for( int i = 0; i < NUM; i += ONETIME )
{
    __builtin_prefetch ( &A[i+PD], WILL_READ_ONLY, LOCALITY_LOW );
    __builtin_prefetch ( &B[i+PD], WILL_READ_ONLY, LOCALITY_LOW );
    __builtin_prefetch ( &C[i+PD], WILL_READ_AND_WRITE, LOCALITY_LOW );
    SimdMul( A, B, C, ONETIME );
}</pre>
```



# The Effects of Prefetching on SIMD Computations





- SIMD is an important way to achieve array-operation speed-ups on a CPU
- For now, you might have to write in assembly language to get to all of it
- I suspect that #pragma omp simd will catch up
- I suspect that Intel Intrinsics will catch up
- Prefetching can really help SIMD

