

## **MISC**

All of the opinions expressed in this presentation are my own and do not reflect any held by  $\ensuremath{\mathsf{NVIDIA}}$ 

# **OUTLINE**

CPU versus GPU

Why are they different?

CUDA

Terminology

GPU

Pascal

Opportunities

What skills you should know

@ nvibi



# CPU

#### Priorities

Latency

Event driven programming

Managing complex control flow

Branch prediction & speculative execution

Reduction of memory access

Large caches

Small collection of active threads

Results in less space devoted to math

5 **@nvidu** 





#### **GPU**

Parallelization

**Priorities** 

Work-loads related to games are massively parallel

"4K" = 3840x2180 ~60fps ~= 500M threads per sec!

"SOL" of VR for 2018 -> 4k @ 120hz  $\sim$ = 1B threads!

Latency Trade-off

Advanced batching and scheduling of data

Computation

Less complex control flow, more math

Most space devoted to math

With a smart thread scheduler to keep it busy

CPU

**FUTURE** 

Simpler more power efficient (IOT)

Integration of components == cheaper

GPU

Perf/w

Low-latency (VR requirement)

HPC "co-processors"

Power efficiency! Phi, Tesla, Fire pro.

Fast interconnects (NV-Link, future PCI-E)

"Heterogeneous computing"

o O DVIDIA

#### WHY CARE?

Understand what architectures are best for the problem you need to solve

FEM/Graphics -> Co-proc/GPU

GUI/OS -> CPU

DNN -> GPU/co-proc/TPU?

It's a spectrum, don't force the problem on a poorly suited architecture  $% \left( 1\right) =\left( 1\right) \left( 1\right) \left$ 

Example: Deep Neural Networks

9 **@Invidi** 

#### **CUDA**

Compute Unified Device Architecture

NVIDIA's parallel language

OpenCL is similar

API/language focuses on compute features

LD/ST to generic buffers

Support for advanced scheduling features

10 @NVIDI



#### **CUDA**

## Constructs

Thread

Per-thread local memory

Warp = 32 threads

Share through local memory

Block/CTA = Group of warps

Shared mem per CTA: inter-CTA sync/communication

Grid = Group of CTAs that share the same kernel

Global mem for cross-CTA communication



**INVIDIA** 





**CUDA Compute Integration** cuFFT OpenACC #include <stdio.h> #define N 1000000 cufftHandle plan:
cufftComplex 'data1, 'data2;
cudaMalloc((void')&data1, sizeof(cufftComplex)\*NX\*NY\*NZ
cudaMalloc((void')&data2, sizeof(cufftComplex)\*NX\*NY\*NZ int main(void) { double pi = 0.0f; long i: /\* Create a 3D FFT plan. \*/ cufftPlan3d(&plan, NX, NY, NZ, CUFFT\_C2C); #pragma acc region for for (i=0; i<N; i++) { /\* Transform the first signal in place. \*/
cufftExecC2C(plan, data1, data1, CUFFT\_FORWARD); double t= (double)((i+0.5)/N); pi +=4.0/(1.0+t\*t); /\* Transform the second signal using the same plan. \*/
cufftExecC2C(plan, data2, data2, CUFFT\_FORWARD); printf("pi=%16.15f\n",pi/N); /\* Destroy the cuFFT plan. \*/
cufftDestroy(plan);
cudaFree(data1); cudaFree(data2);

WHY CARE?

Constructs reflect:

Communication boundaries

Read/write coherency

Warps

Low level primitive HW operates on

Blocks/CTAs are made up of integer # of warps (48 threads = 2 warps, not 1.5)

AMD/Intel HPC chips have similar constructs

Use existing libraries

Don't reinvent the wheel, better use of your time

GPU ARCHITECTURE

Evolution

Fixed function (Geforce 256, Radeon R100)

Hardware T&L, Render Output Units/ Raster Operations Pipeline (ROPs)

Programmability (Geforce 2 -> 7)

Vertex, Pixel shaders

Unification (Geforce 8)

Generic "compute" units

Generalization (Fermi, Maxwell, AMD GCN)

Queuing, batching, advanced scheduling

Compute focused



















#### **PASCAL**

Utilization

#### Latency

Memory access - ~10 cycles -> 800 cycles

"Cuda cores" take a few cycles to complete

# of warps on a SM tied to usage of register file

Complex program with high # of registers fill up RF

Decreases # of warps available on the SM for issue

Balance work-load

Tex versus FP versus SFU versus LD/ST

Avoid "peak" utilization of any one unit

| PERF #'S                  |                              |                |                |                  |                  |
|---------------------------|------------------------------|----------------|----------------|------------------|------------------|
|                           | PRICE (\$)                   | TFLOPS<br>(SP) | TFLOPS<br>(DP) | GFLOPS/W<br>(SP) | GFLOPS/W<br>(DP) |
| GTX<br>1080               | 599.00                       | 8.23           | 0.257          | 45.6             | 1.43             |
| NVIDIA<br>GV100           | ??? (Tesla was P100 5599.00) | 15             | 7.5            | 50               | 25               |
| Intel<br>Xeon PHI<br>7290 | 6254.00                      | 6              | 3.5            | 24.5             | 14.3             |
| Proj.<br>Scorpio          | ???                          | 6              | ???            | ???              | 777              |

???

???

???

4.12

DEDE #/C

#### WHAT'S GOOD ENOUGH?

GPU demands are never ending...

Physically based materials for lighting

Indirect illumination

Physically based materials for interaction

4k? 5k? 8k?

VR doubles required work... needs to hit 120fps+

API demands are never ending...

New low-level APIs (Vulkan, DX12) create their own problems...

## **OPPORTUNITIES**

NVIDIA/AMD/Intel/Google/Qualcomm/Apple

Very Strong C/C++ fundamentals

Very Strong OS/Algorithms fundamentals

Very Strong Graphics fundamentals

Parallel Programming w/ emphasis on Performance

Compiler experience a plus

Other industries

PS4 Pro

399.99

Games Industry

Biochemistry simulations

Weather/climate modeling

CAD

#### **QUESTIONS?**

cschultz@nvidia.com

http://www.nvidia.com/object/careers.html



