cs8803 GPU Hardware and Software

Parallel Programming

CPU & GPU

  • 3D graphics accelerator
  • massive parallel processors
  • ML accelerators
  • High performance processors

CPU Pipeline

Superscalar: multiple issue lanes

Superscalar processor: Each pipeline stage handles more than one instruction

Instruction per cycle (IPC) > 1

Instruction level parallelism (ILP) represents parallelism.

Out of order processor helps to improve ILP → find more independent instructions to be executed

Improving ILP has been one of the main focuses in CPU designs.

Multi-threading: multiple HW threads with a PC and a register file each

Amdahl's Law

P = parallel fraction (1-S)

N = number of processors (2, 4, 8, …)

S = serial fraction

Serial fraction is small (when S is close to 0), performance improvement is proportional to N

Parallel Programming

Flynn’s Classical Taxonomy

SISD Single Instruction, Single Data SIMD/SIMT Single Instruction, Multiple Data/Thread
MISD Multiple Instruction, Single Data MIMD Multiple Instruction, Multiple Data

SPMD Programming

SPMD is typical GPU programming pattern and the hardware’s execution model is SIMT.

  • Step 1: Discover Concurrency

  • Step 2: Structing the Algorithm

  • Step 3: Implementation

    • Step 4: Execution and Optimization

Parallel Programming Patterns

1.Master/Worker Pattern

2.SPMD Pattern (Single Program, Multiple Data)

3.Loop Parallelism Pattern

4.Fork/Join Pattern

5.Pipeline Pattern

Programming with Shared Memory

All processors can access the same memory

Proc 2 can observe updates made by Proc 1 by simply reading values from shared memory.

Programming with Distributed Memory

Distributed memory systems have each processor with its own memory space

To access data in other memory space, processors send a message

Processor 2 requests messages from processor 1 and processor 3

OpenMP Programming

  • Parallelization

  • Specifying threads counts

  • Scheduling

  • Data Sharing

  • Synchronization

Mutex Operation

  • Mutex (mutual exclusion) ensures only one thread can access critical section of code.

  • Lock: acquire a mutex to enter critical section

  • Unlock: release a mutex after finishing the critical section; others are allowed to access the critical section

OpenMP Exp

1
2
3
4
5
6
7
8
9
10
11
12
int main() {
const int size = 1000; // Size of the array
int data[size];
// Initialize the array
int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < size; ++i) {
sum += data[i];
}
std::cout << "Sum of array elements: " << sum << std::endl;
return 0;
}

Compiler directives

Works for C/C++/Fortran (used widely in HPC applications)

Compiler replaces directives with calls to runtime library

Library function handles thread create/join

#pragma omp directive [ clause [ clause ] … ]

  • Directives are the main OpenMP construct: pragrma omp parallel for

  • Clauses provide additional information: reduction (+:sum)

Reduction is commonly used.

Number of threads

  • by environment variable: OMP_NUM_THREADS
  • the omp_set_num_threads() function within the code.

Thread Synchronization

  • Barrier: #pragma omp barrier
    • Synchronization point that all participating threads reach a point
    • Green work won’t be started until all blue work is over.
    • E.g., sorting and then update
  • Critical section: #pragma omp critical [(name)] { // Critical section code }
    • A critical section should be updated only by one thread.
  • Atomic: #pragma omp atomic
    • To ensure some tasks are done atomically
    • Atomically → either the work is all done or nothing; no partial work

Parallel Sections

1
2
3
4
5
6
7
8
9
10
#pragma omp parallel sections{ 
#pragma omp section
{
// work-1
}
#pragma omp section
{
// work-2
}
}
1
2
3
4
5
6
7
8
9
10
11
#pragma omp parallel
{
#pragma omp for ordered
for (int i = 0; i < 5; i++) {
#pragma omp ordered
{
// This block of code will be executed in order
printf(”Hello thread %d is doing iteration %d\n", omp_get_thread_num(), i);
}
}
}
1
2
3
4
5
6
7
8
9
#pragma omp parallel
{
#pragma omp single
{
// This block of code will be executed by only one thread
printf("This is a single thread task.\n");
}
// Other parallel work...
}

MPI Programming

MPI stands for message passing interface, a communication model for parallel computing. Example: Two processes want to communicate with each other.

  • Process 0 sends an integer value to process 1 using MPI_send();
  • Process 1 receives the value sent by process 0 using MPI_recv()

Broadcasting

  • Introducing MPI_Bcast(), a method to broadcast data from one process to all other processes.

GPU Programming

GPU Architecture

  • Each core can execute multiple threads.
  • Stream processors are ALU units, SIMT lane or cores on GPUs
  • CPU Cores "Streaming multiprocessors(SM)" in NVIDIA term or SIMT multiprocessors
  • Core ≠ stream processor

image-20251209001402294

GPU Pipeline

  • Fetch
    • One instruction for each warp
    • Multiple PC registers exist to support multi-threaded architecture
    • Round-robin scheduler
    • Greedy scheduler: switch warps on I-cache miss or branch
  • Decode
  • Register read
  • Scheduler (score boarding)
  • Execution (SIMT)
  • Writeback

image-20251209001525680

Execution Unit: Warp/Wave-front

  • Warp/wave-front is the basic unit of execution
  • A group of threads (e.g. 32 threads for the Tesla GPU architecture)

Programmable GPU Architecture Evolution

  • Cache hierarchies (L1, L2 etc.)

  • Extend FP 32 bits to FP 64 bits to support HPC applications

  • Integration of atomic and fast integer operations to support more diverse workloads

  • Supporting for PC per warp to PC per thread

  • Utilization of HBM memory (High bandwidth memory)

  • Addition of smaller floating points formats (FP16) to support ML workloads. FP8 and other formats

  • Incorporation of tensor cores to support ML workloads

  • Integration of transformer cores to support transformer ML workloads

CUDA Code Example: Vector Add

1
2
3
4
5
6
7
__global__ void vectorAdd(const float *A, const float *B, float *C,
int numElements) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements) {
C[i] = A[i] + B[i] + 0.0f;
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
int main(void) {

// Allocate the device input vector A
float *d_A = NULL;
err = cudaMalloc((void **)&d_A, size);

// Copy the host input vectors A and B in host memory to the device input
// vectors in device memory
printf("Copy input data from the host memory to the CUDA device\n");
err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);

// Copy the device result vector in device memory to the host result vector
// in host memory.
printf("Copy output data from the CUDA device to the host memory\n");
err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
}
  1. Host code is executed on CPUs

  2. Kernel code is invoked with <<< … >>>>

  3. Kernel code is executed on GPUs

  • GPU kernel code is Single Program Multiple Data
  • All threads execute the same program.
  • There is no execution order among threads.
  • But we need to make each thread execute different data.

threadIdx.x

  • Even though each thread executes the same program

  • Now each thread has a unique identifier (each thread has built-in variable that represents the x-axis coordinate).

Execution Hierarchy

  • A group of threads forms a block.

  • CUDA block: a group of threads that are executed concurrently.

  • Data is divided by block.

  • For now, let’s just assume that each block is executed on each GPU SM.

  • No ordering among CUDA block execution

1
2
3
4
vectorAdd (/* arguments should come here */) { 
int idx= blockIdx.x * blockDim.x + threadIdx.x;
c[idx] = a[idx] + b[idx]
}

Shared Memory

  • Scratchpad memory

  • Software controlled memory space

  • Use __shared__

  • On chip storage → faster access compared to global memory

  • Accessible only within a CUDA block (later GPUs allow different policy)

Thread Synchronizations

  • Wait for all threads to complete tasks
image-20251209002247949
1
2
3
4
5
__global__ void Kernel(){
// work-1
__synchthreads(); // synchronization
// work-2
}
  • Typically usage: synchthreads() within a CUDA block.

  • Interblock synchronization: achieved through different kernel launches

Typical Usage with Thread Synchronizations

  • Typical computation flow

  • Load → compute → store

  • Load to the shared memory from global memory

  • Compute with the data in the shared memory

  • Store the results into global memory

  • BSP programming model (bulk synchronous parallel)

1
2
3
4
5
6
7
__global__ void Kernel(){
// Load to shared memory
__synchthreads(); // synchronization
// compute
__synchthreads(); // synchronization
// store
}

Kernel Lanuch

  • Launch kernel from the host code using <<< >>>> syntax.

  • Total number of threads in a kernel = numBlocks x threadsPerBlock

  • Grid consists of multiple blocks, block consists of multiple threads.

  • Multiple different tasks → multiple kernels

  • Each kernel (sortKernel, addKernel, storeKernel) handles different tasks sequentially.

1
2
3
sortKernel <<<1 ,1,>>> (d_data, dataSize);
addKernel<<<gridDim,blockDim>>>(d_data, d_data, d_result, dataSize);
storeKernel<<<1,1,>>>(d_result,dataSize);
Scope Note
Shared Memory Within one CUDA block
Global Memory All blocks in a Kernel
Local Memory Within a CUDA thread
Constant Memory All blocks in a Kernel Read only (3D-graphics)
Texture Memory All blocks in a Kernel Read only (3D graphics)
  • Information sharing is limited across ”CUDA execution” hierarchy:

  • Data in the shared memory is visible only within one CUDA.

  • → data in the shared memory can stay only in one SM

  • → data in the shared memory of one CUDA block needs explicit communication (later CUDA supports thread block cluster to allow sharing)

Occupancy

  • How many CUDA blocks can be executed on one SM is decided by following parameters

  • # of registers, shared memory, # of threads

  • Exact H/W configurations are varied by GPU microarchitecture

  • E.g., HW: each SM can execute 256 threads, 64K registers, 32 KB shared memory

  • SW: one CUDA block: 32 threads and 2KB shared memory and each thread has 64 registers

    • Constrain by running threads: 256/32 = 8 CUDA blocks
    • Constrain by register size = 641024/(6432) = 32 CUDA blocks
    • Constrain by shared memory = 32KB/2KB = 16 CUDA blocks
  • Final answer: 8 CUDA blocks per SM

Number of Threads Per CUDA Block

  • Host sets the number of threads per CUDA block. E.g., threadsPerBlock

  • Set up at the kernel launch time

  • # of registers per CUDA block

    • Compiler sets how many architecture registers are needed for each CUDA block (Will be covered later part of the course)
  • Typical CPU ISAs, # of registers per thread is fixed (e.g., 32), but in CUDA, this is flexible

  • Shared memory size is determined from the code

Why Do We Care about Occupancy?

  • E.g., shared int sharedMemory[1000]; → 4B*1000 = 4000B

  • Higher Occupancy means more parallelism.

  • Utilization: Utilizing more hardware resources generally leads to better performance.

  • Exceptional cases will be studied in later lectures.

Managing Global Memory

  • cudaMalloc: allocate memory in the GPU

  • cudaMemcpy: transfer data between CPU and GPU

unified memory (covered in later lectures) eliminates the need for explicit data copies

CUDA Programming with Stencil Operations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Load data into shared memory 
__shared__ float sharedInput[sharedDim][sharedDim];

int sharedX = threadIdx.x + filterSize / 2;
int sharedY = threadIdx.y + filterSize / 2;

//Load different values on boundaries
if (x >= 0 && x < width && y >= 0 && y < height) {
sharedInput[sharedY][sharedX] = input[y * width + x];
}
else {
sharedInput[sharedY][sharedX] = 0.0f; // Handle boundary conditions
}

// Apply the filter to the pixel and its neighbors using shared memory
for (int i = 0; i < filterSize; i++) {
for (int j = 0; j < filterSize; j++) {
result += sharedInput[threadIdx.y + i][threadIdx.x + j] * filter[i][j];
}
}

OpenCL vs CUDA

OpenCL CUDA
Execution Model Work-groups/work-items Block/Thread
Memory model Global/constant/local/private Global/constant/shared/local + Texture
Memory consistency Weak consistency Weak consistency
Synchronization Synchronization using a work-group barrier (between work-items) Using synch_threads Between threads

GPU Architecture

Multithreading

Benefits of Multithreading

  • Hide processor stall time:
    • Cache misses
    • Branch instructions
    • Long latency operations (ALU operations)
  • GPUs use multithreading to hide latency.
    • Out of order processors (OOO) use cache and ILP to hide latency.
  • Longer memory latency requires a greater number of threads to hide latency

Front-end Extension for Multithreading

  • Multiple PC registers for Warps (one PC for each warp)

  • One static instruction for one warp (SPMD programming model)

  • Individual registers for each thread

  • Minimizes context switch overhead

  • Significant resource

CPU Context Switching

  • CPU context switch: Store PC, architecture registers in stack/memory

  • High overhead of CPU context switching

Hardware Support for Multithreading

  • Front-end needs to have multiple PCs
    • One PC for each warp since all threads in a warp share the same PC
    • Later GPUs have other advanced features
  • Large register file
    • Each thread needs “K” number of architecture registers
    • total register file size requirement = K times # number of threads
    • “K” varies by applications
  • Remember occupancy calculation?
    • Each SM can execute Y number of threads, Z number of registers, etc.
    • Y is related to # of PC registers
    • Z is related to K

Calculation Exp

  • Hardware example: SM can execute 256 threads, 64K registers, 32 KB shared memory; warp size is 32.

    1. How many PCs are needed in one SM?

    Answer: 256/32= 8 PCs

    1. If a program has 10 instructions. How many times does one SM fetch an instruction?

    Answer: 10 x 8 = 80

CUDA Block/Threads/Warps

  • Multiple blocks can be running on one Multiprocessor.

  • Each block has multiple threads.

  • A group of threads are executed as a warp.

  • Registers are per thread.

  • Execution width x (2 source/1 write) registers accesses.

Port vs. Bank

  • Port: Hardware interface for data access

  • E.g., each thread requires 2 read and 1 write ports and execution width is 4.

  • → 8 read ports and 4 write ports

image-20251209005424492
  • Bank: a partition (group) of the register file

  • Multiple banks can be accessed simultaneously.

  • More ports mean more hardware wiring and resource usage

Variable Number of Registers per Thread

  • CUDA programming will get benefits from different register counts per thread

  • Instruction R3 = R1+R2

  • Case 1: 4 registers per 1 thread

  • Case 2: 2 registers per 1 thread

  • Case 1: reading registers would not cause a bank conflict

  • Case 2: Read R1, R2 from multiple threads would cause a bank conflict!

  • Remember: GPU executes a group of threads (warp), so multiple threads are reading the same registers

Solution

  • Compiler-Driven solution for optimizing code layout

  • Register ID is known at static time.

Static vs. Dynamic

  • In this course, static often means before running code. The property is not dependent on input of the program. Dynamic means that the property is dependent on input of the program.

  • E.g., static/dynamic number of instructions

1
2
LOOP: ADD R1 R1 #1
BREQ R1, 10, LOOP
  • Let’s say that loop iterates 10 times. Static number of instructions is 2, dynamic number of instructions is 20.

  • Static time analysis = compile time analysis

  • Complexity beyond a 5-stage pipeline:

  • Register file access takes more than 1 cycle.

  • Source register values are buffered.

Scoreboarding

  • Widely used in CPU to enable out of order execution
  • Dynamic instruction scheduling
  • GPUs: check when all source operands within a warp are ready; the warp is sent to execution units
    • Choose which one to send to the execution unit among multiple warps
    E.g., oldest first

Shared Memory Bank Conflicts

  • Shared memory is an on-chip storage

  • Scratch pad memory

  • Shared memory is also composed with banks

  • Let’s assume the following shared memory

  • 4 banks; number is memory address

1
2
__shared__ float sharedInput[index1]; 
Index1= threadIdX.x *4
  • Then Thread1:mem[4], thread2:mem[8], thread3:mem[12] …

  • All threads will generate bank conflicts.

  • Solution: change the software structures…

Global Memory Accesses

  • 1 memory instruction could generate many memory requests to DRAM

  • 1 warp can generate up to 32 memory requests (if warp size is 32)

  • # of requests can easily be a large number

  • E.g., 32 SMs →32 x 32 = 1024 requests in one cycle

  • 1024 * 64B = 64KB per cycles (1GHz GPU) → 64TB/s memory bandwidth

Memory Coalescing

  • Even if the memory can provide high bandwidth memory, reducing memory requests is critical.

  • GPU cache is very small.

  • image-20251209011252304
  • DRAM memory requests size is 64 ~ 128B

  • All memory requests from the first load can be combined into one memory request mem[0]+28 → called coalesced.

  • Second load cannot be easily combined. → uncoalesced.

Coalesced Memory

  • Combining multiple memory requests into a single or more efficient memory requests

  • Consecutive memory requests can be coalesced

  • Reduce the total number of memory requests

  • One of the key software optimization techniques

Program Pattern

Bank Conflict

Matrix Transpose

1
2
3
4
5
6
7
8
9
10
11
__global__ void transposeNaive(float *odata, const float *idata, int width, int height) 
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

if (x < width && y < height) {
int input_idx = y * width + x;
int output_idx = x * height + y;
odata[output_idx] = idata[input_idx];
}
}

Tile Version

1
2
3
4
5
6
7
8
9
__global__ void transposeNaive(float *odata, const float *idata)
{
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;

for (int j = 0; j < TILE_DIM; j += BLOCK_ROWS)
odata[x * width + (y + j)] = idata[(y + j) * width + x];
}

It improves memory throughput!

Gride-Stride

1
2
3
4
5
6
7
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
y[i] = a * x[i] + y[i];
}
1
2
3
4
5
6
7
8
9
10
11
__global__
void saxpy(int n, float a, float *x, float *y)
{
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < n;
i += blockDim.x * gridDim.x)
{
y[i] = a * x[i] + y[i];
}
}

  • Scalable solution.

  • The array size >> # blocks

  • Still maintain coalesced loads/store

  • Could reduce # of blocks significantly → tunable parameters

+1 Padding Code

__shared__ float tile[TILE_DIM][TILE_DIM+1]; $$ = (x(width+1) + (y+j))4

\ = (x(width+1) + (y+j))%32 = (33x)%32=x \ $$

1
2
3
4
5
6
__global__ void access(float* data, int width, int height) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int idx = col * height + row;
data[idx] += 1.0f;
}

What kind of memory access pattern does this kernel use within a warp?

A: column-major, strided memory access pattern

Parallel Reduction

naive version

1
2
3
4
5
if (threadIdx.x % (2*stride) == 0) {
data[threadIdx.x] += data[threadIdx.x + stride];
}
__syncthreads();
stride *= 2;
  • Thread utilization is very low.

  • How do you communicate across blocks?

  • Option #1: Kernel launch

  • Option #2: Use atomic after the block and then thread-block level reduction

  • Option #3: Thread-block draining

  • Option #4: Cooperative group : Cooperative kernel launch

Sequential Addressing

1
2
3
4
5
6
7
8
int tid = threadIdx.x;
for (int stride = 1; stride < blockDim.x; stride *= 2) {
int index = 2 * stride * tid;
if (index < blockDim.x) {
sdata[index] += sdata[index + stride];
}
__syncthreads();
}

Defer Synch: Add during load

1
2
3
unsigned int i = blockIdx.x * (blockDim.x * 2) + threadIdx.x;
sdata[threadIdx.x] = g_idata[i] + g_idata[i + blockDim.x];
__syncthreads();

Unroll last warp

1
2
3
4
5
6
7
8
9
if (tid < 32) {
volatile float* vsmem = sdata;
vsmem[tid] += vsmem[tid + 32];
vsmem[tid] += vsmem[tid + 16];
vsmem[tid] += vsmem[tid + 8];
vsmem[tid] += vsmem[tid + 4];
vsmem[tid] += vsmem[tid + 2];
vsmem[tid] += vsmem[tid + 1];
}

Warp Shuffle Reduction

1
2
3
4
5
6
7
#define FULL_MASK 0xffffffff

float val = ...; // 每个线程的局部和

for (int offset = 16; offset > 0; offset /= 2) {
val += __shfl_down_sync(FULL_MASK, val, offset);
}

AtomicAdd & Cooperative groups

1
2
3
4
5
6
7
8
9
__global__ void reduce_atomic(float* data, int n, float* out) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
float val = 0.0f;
if (i < n) val = data[i];

// 可以先做 block 内归约,再 atomic
// 这里只演示直接 atomic:
atomicAdd(out, val);
}

Wrap shuffle

1
2
3
#define FULL_MASK 0xffffffff
for (int offset = 16; offset > 0; offset /= 2)
val += __shfl_down_sync(FULL_MASK, val, offset);

Atomic in GPU

1
2
3
4
idx = threadIdx.x; 
bid = BlockIdx.x;
val = p_sum[idx];
atomicAdd(b_sum[bid],val), ;
1
2
3
4
LD R2 = p_sum[threadId.X]
LD R1, b_sum[R3]
Add R1 = R1 + R2
ST b_sum[R3], R1

RED.E.ADD.F32.FTZ.RN [c], R2;

  • Hardware supports this atomic instruction

  • RED.E.Add : Reduction, Element-wise reduction, operation Add

  • F32: data type 32-bit float

  • FTZ: Flush-to-zero: round down (flush) to zero for very small numbers (denormalized numbers)

  • RN: Round to nearest

  • Done by special Hardware in L2 cache (L1 caches are not coherent!!)

  • Behavior is serialized.

Which of the following best describes the purpose of atomicAdd() in CUDA?

A: "To perform a synchronized, conflict-free addition to a shared or global memory location across many threads."

Programming Optimization

Cooperative Groups

Warp

  • A lock of execution
  • Width of warp →32.
  • What if we have work that require only fewer threads?
1
2
g.sync();           // synchronize group g
cg::synchronize(g); // an equivalent way to synchronize g
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
using namespace cooperative_groups;
__device__ int reduce_sum(thread_group g, int *temp, int val)
{
int lane = g.thread_rank(); //Rank is only unique within thread group

// Each iteration halves the number of active threads
// Each thread adds its partial sum[i] to sum[lane+i]
for (int i = g.size() / 2; i > 0; i /= 2)
{
temp[lane] = val;
g.sync(); // wait for all threads to store
if(lane<i) val += temp[lane + i];
g.sync(); // wait for all threads to load
}
return val; // note: only thread 0 will return full sum
}

Thread Block

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
__device__ int thread_sum(int *input, int n) 
{
int sum = 0;

for(int i = blockIdx.x * blockDim.x + threadIdx.x;
i < n / 4;
i += blockDim.x * gridDim.x)
{
int4 in = ((int4*)input)[i];
sum += in.x + in.y + in.z + in.w;
}
return sum;
}

__global__ void sum_kernel_block(int *sum, int *input, int n)
{
int my_sum = thread_sum(input, n);

extern __shared__ int temp[];
auto g = this_thread_block();
int block_sum = reduce_sum(g, temp, my_sum);

if (g.thread_rank() == 0) atomicAdd(sum, block_sum);
}

Partitioning Groups

  • Tile partition.

  • Allow programmers to decompose block by program

1
2
thread_group tile32 = cg::tiled_partition(this_thread_block(), 32);
thread_group tile4 = tiled_partition(tile32, 4);

Tile size → multiple of 4, less than 32

Can have synch within a tile

image-20251209203739138

Let’s assume that an SM can execute 256 threads, and the width of a warp is 8 threads. How many PCs are at least needed for one SM?

What is the primary purpose of mask bits in GPU architecture?

Mask bits indicate which threads (lanes) in a warp are active or inactive during SIMT execution. They allow the GPU to selectively enable or disable lanes when executing instructions, especially during branch divergence.

Measuring Performance

1
2
3
4
5
6
cudaEvent_t start
cudaEventCreate (&start); cudaEventCreate (&stop);
cudaEventRecord( start, 0)

cudaEventRecord (stop, 0);
cudaEventSynchronize(stop); // make all work finished

Applications Suitable for GPUs

  • Handle massive parallel data processing

  • Low Dominance in Host-Device Communication Costs

  • Coalesced Data Access (Global Memory Coalescing)

Profiling

  • Identify hotspots of applications
  • Measure key performance factors
    • Reported throughput

    • # of Divergent branches

    • # of Divergent memory (coalesced/uncoalesced)

    • Occupancy

    • Memory bandwidth utilizations

  • Use vendor-provided profiler to profile kernels

Optimization Techniques

  • Overall execution time = data transfer time + compute time + memory access time

  • Data transfer time optimizations

  • Memory access pattern optimizations

  • Computation overhead reduction optimizations

  • Utilize libraries

Data Transfer Optimizations

Optimizations for data transfer between Host and Device

image-20251209204536403
  1. Utilize fast transfer methods
    • E.g.) Pinned (page—locked) memory
  2. Overlap Computation and Data Transfers
    • E.g.) cudaMemcpyAsync
  3. Pipeline Transfer and Computation (Concurrent Copy and Execute)
    • E.g.) Use stream
  4. Direct host memory access
    • Zero copy: access CPU data directly in the CPU and GPU integrated memory
    • Unified virtual addressing (UVA)
      • Driver/runtime system hides the physically separated memory spaces and provides an interface as if CPU and GPU can access any memory
      • Ideally no data copy cost but the implementation still requires data copy and the overhead exists

Memory Access Pattern Optimizations

  • Utilize caches

  • Global memory coalescing accesses

  • Aligned global memory accesses

  • Reducing the number of DRAM memory transactions is the key

  • Check the average memory bandwidth

  • Reduce shared memory bank conflicts

Pinned/Page-Locked Host Memory

  • Use CudaHostAlloc()

  • Operating system guarantees that the page resides in the system in the Host side (Host has virtual memory) . Allows to access using a physical memory access. --> allows GPU to use DMA to access CPU memory (because it knows Physical memory addresses) → allows to achieve the peak PCI-E bandwidth

  • Should we use all the time?

  • Size of (Pinned memory) < size of (physical memory)

  • Comparison code with cuda host

Reduce Computation Overhead

  • Instruction level optimizations

  • Replace with low-cost instructions

  • E.g.) Use shift operations instead of multiplication or divisions

  • Use low precisions when possible (or use fewer number of bits)

  • Use special hardware built-in special functions, e.g., rsqrtf()

  • Utilize Math libraries

  • Reduce branch statements

  • Use predicated execution if possible

  • Avoid atomic operations if possible

  • Use tensor operations and utilize tensor cores

New CUDA Features

  • Warp-level operations: warp shuffle/vote/ballot etc.

  • Communication between threads is expensive

  • Register files are unique to threads

  • Need to use shared memory which is expensive

  • Warp-level operations allow data movement withing a warp

  • Hardware support for warp level reduction

  • Cooperative groups

  • Instead of fixed warp size, smaller warp-level operations are allowed

  • E.g.) NVIDIA Volta’s independent thread scheduling

  • New features

  • Departure from SIMT execution is allowed: each thread could have different threads, different thread group execution

    • Shared memory no longer belongs to one SM.
  • Unified memory for shared memory, L1 data cache, and texture memory

  • Hardware acceleration for split arrive/wait barrier

CUDA Libraries

  • Continuously developed

  • Linear Algebra applications: cuBLAS

  • Sparse matrices: cuSPARSE

  • Tensor based linear algebra: cuTENSOR

  • Parallel algorithm libraries: Thrust

  • Communication libraries: target for multi GPUs

  • NCCL, NVSHMEM

  • Deep learning libraries: cuDNN

Summary

GPU programming optimizations focus on:

1.Reducing compute amount

2.Reducing data transfer overhead from host

3.Ensuring efficient memory accesses patterns in global and shared memory

Which metric should we examine to decide whether we want to perform data transfer optimizations?

A: The ratio of data-transfer time to total execution time.

Optimization options

  • Different memory

  • Increase parallelism

  • Shared memory vs. cache

Review Cooperative Group

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
_global__ void simple_reduce_kernel(int *data) {
// Create thread block group
auto g1 = this_thread_block();
// Create 32-thread tile from thread block
auto g2 = tiled_partition<32>(g1);
// Create 16-thread tile from 32-thread tile
auto g3 = tiled_partition<4>(g2);
// Get thread information
int tid = g1.thread_index().x;
int gid = g1.group_index().x * nTPB + tid;

int lane = g.thread_rank();
if (lane == 0) {
printf("Group partial sum: %d\n", sdata[0]);
}

simple_reduce_kernel<<<1, nTPB, nTPB * sizeof(int)>>>(data);

this is the host code. how many times will we see “Group Partial Sum” statements?

simple_reduce_kernel<<<1, nTPB, nTPB * sizeof(int)>>>(data);

nTPB is 256

1
2
3
4
5
6
7
8
9
device_code ( ) {
….
auto g1 = this_thread_block();
auto g2 = tiled_partition<32>(g1);
auto g3 = tiled_partition<16>(g2);

if (lane == 0) {
printf("Group partial sum: %d\n", sdata[0]);
}

A:

g1: 256 threads in block

g2: 8 tiles of size 32

g3: each 32-thread tile split into 2 tiles of 16

total 16 tiles of 16

Print 16 times.

Review: Different Memory Managements

  • Pinned memory

  • Asynchronous data data transfer

  • Unified Virtual Memory

  • Zero copy

  • Prefetching

Code Exp

  • Basic synchronous transfer
1
2
cudaError_t err = cudaMalloc(&d_data, padded_size * sizeof(unsigned short));
cudaMemcpy(d_data, host_data.data(), size * sizeof(unsigned short), cudaMemcpyHostToDevice);
  • Asynchronous transfer
1
2
cudaMalloc(&d_data, padded_size * sizeof(unsigned short));
cudaMemcpyAsync(d_data, host_data.data(), size * sizeof(unsigned short), cudaMemcpyHostToDevice);
  • Asynchronous transfer with prefetcher
1
2
cudaMalloc(&d_data, padded_size * sizeof(unsigned short));
cudaMemcpyAsync(d_data, host_data.data(), size * sizeof(unsigned short), cudaMemcpyHostToDevice);
  • Pinned Memory
1
2
3
4
cudaMalloc(&d_data, padded_size * sizeof(unsigned short));
unsigned short* h_pinned_data;
cudaHostAlloc(&h_pinned_data, padded_size * sizeof(unsigned short), cudaHostAllocDefault);
cudaMemcpy(d_data, h_pinned_data, padded_size * sizeof(unsigned short), cudaMemcpyHostToDevice);
  • cudaHostRegister: pin regular memory
1
2
3
memcpy(h_pinned_data, host_data.data(), size * sizeof(unsigned short));
//or
cudaMemcpyAsync(d_data, h_pinned_data, padded_size * sizeof(unsigned short), cudaMemcpyHostToDevice);

In the pinned memory style, which data structure needs to use cudaHostAlloc? e.g.) copy from h_data (host data pointer) to d_data (device data pointer)

A: The host-side buffer (h_data) must use cudaHostAlloc() when using pinned-memory style. The device pointer (d_data) is allocated with cudaMalloc(), not cudaHostAlloc().

  • UVM

Unified memory

1
2
3
4
5
6
7
8
9
10
11
unsigned short *a, *b, *c;
cudaMallocManaged(&a, size);
cudaMallocManaged(&b, size);
cudaMallocManaged(&c, size);

vectorAdd<<<blocks, threadsPerBlock>>>(a, b, c, N);
cudaDeviceSynchronize();

..
// Check the results

  • Zerocopy
1
2
3
4
5
6
7
8
9
cudaHostAlloc(&h_a, size * sizeof(unsigned short), cudaHostAllocMapped);
..
// Get device pointers
unsigned short* d_a, *d_b, *d_c;
cudaHostGetDevicePointer(&d_a, h_a, 0);
..
// Launch kernel on mapped memory (zero-copy!)
vectorAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, size);
cudaDeviceSynchronize();

Which CUDA memory management scheme allows applications to use more memory than the GPU’s physical memory?

A: Unified Memory

Multi GPUs

Multi-GPUs

  • To accommodate the high demand for computing power, GPUs have expanded into multi-GPU systems
image-20251209211328755

Multi-GPU Connections

  • High-speed connections (e.g., NVLINK)

  • I/O bandwidth needs to be shared

  • Programmer’s viewpoint: many SMs

  • Shared memory space

  • Memory is NUMA (non-uniform memory)

  • Near memory and far memory from the latency's viewpoint

Thread Block Scheduling and Memory Mapping

  • Thread block scheduling policy can be improved.

  • Memory Page is allocated based on which GPU touched for the first time.

NVLink/NVSwitch

  • NVLink: an alternative communication option to PCI-E

  • Nvidia’s communication protocol

  • Provide high speed communication between GPU-GPU

  • E.g.) 4th generation of NVLInk provides 900GB/s per GPU and 18 links per GPU (3TB HBM3 memory bandwidth in H100)

Switch chip provides even more scalability

image-20251209211630760

Memory Space

  • Across Multi-GPU boards, it uses RDMA to communicate

  • Distributed memory systems

image-20251209211713873

Backgrounds: RDMA Technology

  • Traditional communication

  • Use TCP/IP like network to go through the network software stack

  • Use CPU resource

  • RDMA: Remote Direct Memory Access

  • Similar to DMA (Direct Memory Access), the communication between memory to memory can be done without using CPU resource.

  • Host-bypass technology

GPUDirect RDMA

  • It can be used to communicate between GPUs and even other 3rd party devices.
image-20251209211827932

Increasing Utilization of GPUs

  • Multi-GPUs provide high computing power.

  • GPUs are well-suited for multi-data processing, but not all jobs utilize them.

  • How can we improve utilization?

  • GPUs for data centers

  • Traditional data centers handle multi-tenant jobs, whereas workloads like LLM using GPUs involve one tenant using all available resources.1

  • AI workload performance is limited by the slowest task.

GPU Concurrency Mechanisms

image-20251209211931394
  • Multi-stream: no resource partitioning

  • Multi-Process Service (MPS)

  • Multi-Instance GPU (MIG)

Multi-Instance of GPUs (MIG)

  • Multi-Instance GPUs (MIG) allow multiple jobs to run concurrently on a single GPU.

  • Efficient way of utilizing large GPUs

  • MIG offers memory and GPU core isolation to host multiple users.

  • It partitions memory, L2 cache ports, DRAM memory bandwidth, and on-chip crossbar ports to provide Quality of Service (QoS).

  • E.g.) For example, the A100 can host up to 7 instances, making it crucial for cloud service providers in multi-tenant use cases.

image-20251209212043128

Multi Process Service (MPS) Support

  • Spatial partitioning was supported from V100 GPU

  • Earlier was supported with only time slicing

  • Multiple Jobs can be run concurrently

  • Limitations: no strict resource partitioning, which might cause unpredictable performance but provide limited QoS

  • MPI jobs can be easily ported to CUDA using MPS methods

  • Can provide better utilizations than MIG if QoS is less critical1

Stream Based Programming

  • Stream allows concurrent execution of multiple kernels

  • It allows multiple CPU threads to submit kernels

  • OpenMP programs can be easily ported with stream based programming

  • Scheduling among multiple streams cause performance overhead

  • → To overcome, CUDA graph is proposed. Dependency chain among kernels are constructed.

Non-stream Version

1
2
3
4
cudaMemcpy(d_x, h_x, ds * sizeof(ft), cudaMemcpyHostToDevice);
gaussian_pdf<<<(ds + 255) / 256, 256>>>(d_x, d_y, 0.0, 1.0, ds);
cudaMemcpy(h_y1, d_y, ds * sizeof(ft), cudaMemcpyDeviceToHost);
cudaCheckErrors("non-streams execution error");

Stream Version

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
cudaStream_t streams[num_streams];
for (int i = 0; i < num_streams; i++) {
cudaStreamCreate(&streams[i]);
}

for (int i = 0; i < chunks; i++) { // depth-first launch
cudaMemcpyAsync(d_x + i * (ds / chunks),
h_x + i * (ds / chunks),
(ds / chunks) * sizeof(ft),
cudaMemcpyHostToDevice,
streams[i % num_streams]);

gaussian_pdf<<<((ds / chunks) + 255) / 256,
256, 0, streams[i % num_streams]>>>(
d_x + i * (ds / chunks),
d_y + i * (ds / chunks),
0.0, 1.0, ds / chunks);

cudaMemcpyAsync(h_y + i * (ds / chunks),
d_y + i * (ds / chunks),
(ds / chunks) * sizeof(ft),
cudaMemcpyDeviceToHost,
streams[i % num_streams]);
}

cudaDeviceSynchronize();

Programming for Multi-GPU

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
int main() {

ft *h_x, *d_x[num_gpus], *d_y[num_gpus];
h_x = (ft *)malloc(ds * sizeof(ft));

for (int i = 0; i < num_gpus; i++) {
cudaSetDevice(i);
cudaMalloc(&d_x[i], ds * sizeof(ft)); // Malloc for each GPU
cudaMalloc(&d_y[i], ds * sizeof(ft)); // Malloc for each GPU
}
cudaCheckErrors("allocation error");

for (int i = 0; i < num_gpus; i++) {
for (size_t j = 0; j < ds; j++) {
h_x[j] = rand() / (ft)RAND_MAX;
}
cudaSetDevice(i); // Indicate which device to use from the host side
cudaMemcpy(d_x[i], h_x, ds * sizeof(ft), cudaMemcpyHostToDevice);
}
cudaCheckErrors("copy error");

unsigned long long et1 = dtime_usec(0);

for (int i = 0; i < num_gpus; i++) {
cudaSetDevice(i);
gaussian_pdf<<<(ds+255)/256, 256>>>(d_x[i], d_y[i], 0.0, 1.0, ds); // Kernel invocation for each GPU
}
cudaDeviceSynchronize(); // Wait until all devices finish
cudaCheckErrors("execution error");
}

GPU Support for Multi-Tenant Computing

Mechanisms Stream MPS MIG
Partition type No Logical Physical
Max Partition Unlimited 48 7
SM isolation No By percentage Yes
Mem BW isolation No No Yes
Performance QoS No partial Yes
Reconfiguration Dynamic Process launch When idle

Stream with Data Copy

1
2
3
4
cudaMalloc(&d_data, padded_size * sizeof(unsigned short));
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaMemcpyAsync(d_data, host_data.data(), size * sizeof(unsigned short), cudaMemcpyHostToDevice, stream);

CUDA Graph

Overview

  • Repeated GPU kernel launch is now becoming significant

  • To reduce the GPU kernel launch overhead, introduce CUDA Graph

image-20251209214001167
  • Node: any asynchronous operation:

memory operations or kernel

Construction

  • Method 1: Explicit graph construction

  • Pro: explicit control of graph constructions

  • cudaGraphAddNode(graph, kernel_a, {} …)

  • Method 2: Use stream capture feature

  • Pro: convenient

Stream Capture

  • Record a sequence of operations in a CUDA graph

  • Capture and then instantiate

  • It actually does not launch the kernel until it instantitates

  • Capture mode: start by cudaStreamBeginCapture() and end cuaStreamEndCapture()

  • The becomes a graph → cudaGraph_t object

  • Graph execution: instantiate (cudaGraphInstantitate) and launch (cudaGrpahLaunch) multiple times

Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
bool graphCreated = false;
cudaGraph_t graph;
cudaGraphExec_t instance;
for(int istep = 0; istep < NSTEP; istep++){
if(!graphCreated){
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
for(int ikrn = 0; ikrn < NKERNEL; ikrn++){
shortKernel<<<blocks, threads, 0, stream>>>(out_d, in_d);
}
cudaStreamEndCapture(stream, &graph);
cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
graphCreated = true;
}
cudaGraphLaunch(instance, stream);
cudaStreamSynchronize(stream);
}
  • CUDA graph data structure : cudaGraph_t

  • Instance : cudaGraphExec_t

  • Stream begin capture, stream end capture

  • Overhead of creating graph but the overhead will be amortized over time

Graph Memory Nodes

image-20251209214607340
  • Type of nodes node to manage memory operations inside graph (cuda > 11.4)

  • Memory allocation node (cudaGraphAddMemAllocNode)

  • Memory fee node (cudaGrpahAddMemFreeNode)

Update

  • Graph can be used only parameters (input arguments) and topology do not change

  • Graph update works if only parameter change but topology does not change.

  • Call cudaGrpahExecuUpdate( …)

  • If parameters and topology change, use graph re-launch

Demonstration

image-20251209215421233
image-20251209215430758
  • Multi-GPU communication synchronization overhead is reduced with Cuda Graph

  • Multi-GPU cases, the performance bottleneck can come from CPU scheduling overhead

Conditional Nodes

  • Conditional nodes:

  • IF node: choose between single nodes with another sub-graph

  • WHILE node

  • SWITCH node

  • Can be constructed with by calling cudaGrpahSetConditional

  • More: please look at the CUDA programming manual

NCCL

  • NVIDIA Collective Communications Library

  • Inter-GPU communication primitives

  • Support both collective communication and point-to-point send/receive primitives

  • NCCL provides fast collectives over multiple GPUs both within and across nodes

Using NCCL with CUDA Graphs

1
2
3
4
5
6
7
8
9
10
11
12
cudaGraph_t graph;
cudaStreamBeginCapture(stream);
kernel_A<<< ..., stream >>>(...);
kernel_B<<< ..., stream >>>(...);
ncclAllreduce(..., stream);
kernel_C<<< ..., stream >>>(...);
cudaStreamEndCapture(stream, &graph);

cudaGraphExec_t instance;
cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);
cudaGraphLaunch(instance, stream);
cudaStreamSynchronize(stream);

Pytorch with CUDA Graph

  • CPU and GPU communication has its own kernel (implemented in NCCL but it is essentially memcpy and operations (reduction, distribution))

  • The overhead of NCCL can be high with fast iterations

image-20251209215702725

CUDA graph with Pytorch2

  • Is using CUDA graph always good?

  • The source of performance degradation

  • parameter value transfer (pointer to pointer operations)

  • The paper discuss CUDA graph-aware data placement

image-20251209215720273

What is the main advantage of using CUDA Graph?

A: To significantly reduce CPU overhead from repeatedly launching many small kernels by capturing the whole GPU workload as a single executable graph and replaying it efficiently.

Paper Readings

What kind of applications would get benefits from a large warp?

A: Apps with highly regular, SIMD-friendly, data-parallel workloads, where all threads follow the same control flow and access memory in a uniform pattern.

What is the main advantage of register file virtualization in GPUs?

A:It allows the GPU to support more threads (higher occupancy) than the physical register file would normally allow, improving latency hiding and overall throughput.

Why is it hard to support virtual address translation for GPUs?

A:Because GPUs execute tens of thousands of threads in parallel, and virtual address translation requires page table walks and TLB lookups. Supporting this at GPU scale produces huge latency, high energy cost, massive TLB pressure, and large hardware area overhead.

Which statement explains the meaning of Unified Virtual Memory (UVA) better?

A: UVA provides a single unified virtual address space shared across the host and all GPUs, allowing pointers to be meaningful across devices without explicit address translation.

When does GPU memory oversubscription happen?

A: Oversubscription happens when a GPU uses Unified Memory (Managed Memory) and the application allocates more memory than the physical GPU memory capacity.

What are the main benefits of 2-level warp scheduling?

A:(1) Better latency hiding (2) Better resource utilization (3) Higher fairness among warps / thread blocks (4) Improved energy efficiency

GPU Simulation

Performance Modeling Techniques

  • Cycle level simulation

  • Event driven simulation

  • Analytical Model

  • Sampling based techniques

  • Data based statistical/ML modeling

  • FPGA based emulation

Cycle Level Simulation

  • Commonly used in many architecture simulators

  • Typically, a global clock exists.

  • Each cycle, events, such as instruction fetch and decode are modeled.

  • Multiple clock domains can exist. (E.g. Memory clock, processor clock, NoC clock)

Execution Driven vs. Trace Driven Simulation

  • Execution driven: instructions are executed during the simulation

    • Execute-at-fetch vs execute-at-execute

    • Execute-at-fetch: instruction is executed when an instruction is fetched

    • Execute-at-execute: instruction is executed at the execution stage

  • Trace driven: traces are collected; simulation and execution are decoupled

  • Benefits of execution driven: no need to store traces

  • Trace driven cannot model run-time dependent behaviors: e.g.) lock acquire, barriers

  • Trace-driven simulators are simpler and often lighter and easier to develop

  • E.g.) Memory traces only for memory simulation or cache

Queue Based Modeling

  • Instructions are moving between queues.

  • Scheduler selects instructions that will be sent to the execution stage among ready instructions; not implemented as a queue structure.

  • Other queues are FIFO.

  • When instruction is complete, the dependent instructions are ready. The dependency chain needs to be modeled and broadcasting also needs to be modeled.

  • Cache and memory are modeled to provide memory instruction latency.

image-20251210002400840

Modeling Parameters with Queue Based Modeling

  • Number of cycles in each pipeline stage → depth of the queue

  • How many instructions can move between queues represent pipeline width (E.g., Issue/execution bandwidth)

  • Questions: How do you know the latency of each instruction?

  • Instruction latency assumptions:

    • Instruction latency is given as a parameter (e.g., ADD takes 1 cycle, MUL takes 3 cycles).

    • Latency can be obtained from literature or simulators like CACTI or RTL simulation.

5-stage CPU Processor Modeling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
macsim::run_a_cycle (…) { 
cycle++;
Cpu->run_a_cycle();
Mem->run_a_cycle();
Noc->run_a_cycle();
}

Cpu->run_a_cycle(…) {
Wb->run_a_cycle();
Exec->run_a_cycle();
Sch->run_a_cycle();
De->run_a_cycle();
Fe->run_a_cycle();
}

int main(int argc, char** argv) {
macsim_c* sim;

// Instantiate
sim = new macsim_c();

// Initialize Simulation State
sim->initialize(argc, argv);

// Run simulation
// report("run core (single threads)");
while (sim->run_a_cycle())
;

// Finalize Simulation State
sim->finalize();

return 0;
}
  • Each modeled instruction has an op data structure.

  • Op data structure tracks instruction progress and cycle.

  • E.g.) Op->done_cycle = op->schedule_cycle + latency

    1. what if latency is not fixed?
  • E.g.) Cache misses.

  • After cache simulation (cache hit/miss hierarchy), we can know the memory instruction latency.

Scheduler

  • Scheduler checks ready instructions.

  • Scheduler handles resource constraints.

  • E.g.) # of FP instructions, # of load and store queues, execution units, ports in cache etc.

  • Ensuring resources are available for multiple cycles

GPU Cycle-Level Modeling

  • Similar to CPU modeling

  • The simulation modeling unit is a warp.

  • Warp instruction fetched/decoded/scheduled/executed

  • Scheduler chooses instructions from the head of each warp.

  • Differences from CPUs:

  • In-order scheduling within a warp

  • Out-of-order across warps

  • Major differences between CPU vs. GPU

  • Handling divergent warps

  • Warp, thread block, and kernel concepts

  • Scheduler

End of Simulation

  • Entire thread block scheduled to one SM

  • Tracking complete threads

  • All threads within a cuda block, the corresponding cuda block completes

  • When all thread block is completed, the kernel ends.

  • When all kernel ends, the application ends.

In the queue-based simulation, if we want to increase the execution width, what change do we need to make? Please refer to the diagram in the lecture for the module names. Choose the most relevant one?

A: the number of entries / parallel functional units in the Execute stage

Which of the modules cannot be implemented with queue-based modeling?

A: Caches / Memory hierarchy (L1, L2, TLB, DRAM timing model)

Modeling Unit

  • Instructions are modeled at a warp level.

  • Accessing I-cache, PC registers

  • Reflecting microarchitecture access behavior

  • Mask bits are needed to keep track of resource constraints.

  • Question: How to model divergent warps and memory coalescing?

Memory Coalescing Modeling

  • Modeling memory coalescing is critical.

  • Memory requests need to be merged.

  • Typically, this follows cache line sizes.

  • A 64 B cache line size is already assumed.

Modeling Memory Coalescing with Trace

  • The trace should contain all the memory addresses from each warp.

  • The trace generator can insert all memory instructions individually. E.g., va1, va2, va3, etc.

  • Or trace generator already coalesces memory requests → can reduce the trace size: e.g.) 0x0, 0x4, 0x8, 0x12, 0x16 etc. vs. 0x0 and size 28

Cache Hierarchy Modeling

  • After addresses are coalesced, memory requests access TLB, L1, L2 caches depending on GPU microarchitecture.

Sectored Cache Modeling

  • Modern GPUs adopt sectored cache.

  • Sectored cache allows bringing a sector of the cache block instead of the entire cache block.

  • Benefit: reduces bandwidth

  • Drawback: reduces spatial locality

  • Share the tag

GPU Simulators

  • Several open-source GPU simulators are available.

  • GPU simulators for different ISAs

Name ISA Type Architecture Open Source
GPGPU-Sim NVIDIA PTX/SASS Execution driven GPGPU only http://www.gpgpu-sim.org/
Accel-Sim NVIDIA PTX/SASS Trace driven GPGPU and accelerator https://accel-sim.github.io/
MGPU-Sim AMD GPU Execution driven Multi GPUS are supported [ISCA2019]
Macsim NVIDIA/Intel GPU Trace driven Heterogeneous computing https://github.com/gthparch/macsim
Gem5-GPGPU-Sim AMD GPU or NVIDIA PTX Execution driven Heterogeneous computing https://cpu-gpu-sim.ece.wisc.edu/
Vortex-Simx RISC-V ISA (extensions) Execution, RTL 3D graphics, GPU https://vortex.cc.gatech.edu/

Which statement most accurately describes sectored cache and small block size of cache?

A: Sectored cache reduces data transfer size while keeping a large tag array;

small block size reduces miss penalty but does not reduce tag storage overhead.

GPU Analytical Models

  • Analytical models do not require the execution of the entire program.

  • Analytical models are typically simple and capture the first order of performance modeling.

  • Analytical models often provide insights to understand performance behavior.

First Order of GPU Architecture Design

Let’s consider accelerating a vector dot product with a goal of 1T vector dot products per second (sum +=x[i] *y[i] )

For compute units, we need to achieve 2T FLOPS operations (multiply and ADD) or 1T FMA/sec.

If GPU operates at 1GHz, 1000 FMA units are needed; at 2GHz, 500 FMA units are needed.

Memory units need to supply 2 memory bytes with a 2TB/sec memory bandwidth.

500 FMA units are approximately equal to 16 warps (warp with 32).

If each SM can execute 1 warp per cycle at 2GHz and there are 16 SMs, it can compute 1T vector dot products.

Alternatively, 8 SMs with 2 warps per cycle can also achieve this.

Based on what we have discussed, if we want to design a processor for 1T vector dot products per second with a 4GHz GPU frequency, what’s the memory bandwidth requirement?

A:

To have 1000 FMA units with a 32-thread width of warps, how many warps need to be executed in one cycle with a 4GHz processor?

A:

  • Multithreading: How about the total number of active warps in each SM?

  • W_width : the number of threads (warps) that can run in one cycle

  • W_depth: the number of threads (warps) that can be scheduled during one stall cycle

W_depth and W_width H/W Constraints

  • W_width is determined by the number of ALU units (along with the width of the scheduler)

  • W_depth is determined by the number of registers (along with the number of PC registers)

  • W_depth 20 means 20 x 32 (W_width) x (# register per thread) number of registers are needed

  • W_depth 20 also means at least 20 x PC registers is needed.

image-20251210005653933

Finding W_depth

  • Strong correlation factor of W_depth is memory latency.

  • In the dot product example, assume memory latency is 200 cycles.

  • Case 1) 1 comp, 1 memory (dot product):

  • To hide 200 cycles, 200/ (1 comp + 1 memory) = 100 warps are needed

  • Case 2) If we have 1 memory instruction per 4 compute instructions

  • To hide 200 cycles 200 / (1+4) = 40 warps are needed

Decision Factors for the Number of SMs

  • Previous example: 500 FMA units

  • 1 warp x 16 SMs vs. 2 warps x 8 SMs

  • Large and fewer SMs vs. small and many SMs

  • Cache and registers also need to be split.

  • Large cache with fewer SMs vs. small cache with many SMs

  • Large cache increases cache access time, but large cache can increase cache hits among multiple CUDA blocks

  • Sub-core can also be a design decision factor

    Many of these decisions require the analysis of trade-off between size vs. time

What is most strongly correlated with deciding W_width for the register file sizes

A:

The number of cores

The number of ALUs

The number of active threads

Memory latency

Roofline Model

image-20251210010949705
  • A Visual performance model to determine whether an application (or a processor) is limited by the compute bandwidth or memory bandwidth

  • Vector sum example: 2 Bytes per 1 FLOPS → arithmetic intensity : 0.5

  • Another example: sum +=x[i] x[i]y[i]*y[i]; → 2 Bytes per 4 FLOPS → arithmetic intensity: 2

CPI (Cycle per Instruction) Computation

  • CPI = CPI + CPI + CPI + CPU

  • CPI: sustainable performance without any miss events

  • Example: 5-stage in-order processor

Assumption: CPI = 1. CPI
=3 , CPI = 5

2% instructions has branch misprediction. 5% instructions has cache misses. Average CPI?

Answer: 1 + 0.023 + 0.055 = 1.31

Easy to compute the average performance.

All penalties are assumed to be serialized.

CPI Computation for Multi-threading

  • CPI = CPI/W_depth

  • W_depth: the number of warps that can be scheduled during the stall cycles

  • CPI = CPI + CPI

  • Resource contention: MSHR (# of memory misses), busy states of execution units, DRAM bandwidth etc.

  • GPU is modeled for multi-threading

Applying Interval Analysis on GPUs

  • Naïve approach: consider GPU as just a multi-threading processor

  • Major performance differences between GPU and multi-threading processor

  • Branch divergence: not all warps are active; some part of branch code is serialized.

  • Memory divergence: memory latency can be significantly different depending on memory is coalesced or uncoalesced

  • Newer models improve the performance models by modeling sub-core, sectored cache, and other resource contentions more accurately.

Using the roofline model, we want to guide the performance optimization directions. After plotting into the roofline model, it turns out that the application is compute bounded. What would be a better approach to improve the application?

A:

By utilizing shared memory, the amount of memory to bring is reduced.

Check whether any of the compute operations can be simplified.

Upgrade the GPU with more memory

Apply prefetching operations

A better approach is to reduce the amount of computation or increase compute throughput, e.g., using a better algorithm (fewer FLOPs), vectorization / tensor cores / lower precision, or other instruction-level optimizations.

Accelerating Simulation

  • Accelerating simulation itself

    • Parallelizing the simulator

    • Event driven simulation

    • Simplifying the model

    • Sampling

    • Statistical Modeling

    • ML based Modeling

  • Reducing the workloads

    • Micro benchmarks

    • Reducing the workload size

    • Create small representative workloads

What techniques are the most helpful if we simulate Project #2’s homework solution for GPU architecture simulation? Choose all apply.

Reduce iteration counts

Reduce input sizes

Identify dominant kernels and only simulate the dominant kernels.

Applying CUDA block level sampling

???

Arithmetic Intensity

Code Exp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
__global__ void assignClusters(const float *points, const float *centroids, int *assignments,
int n_points, int k, int dim) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n_points) {
int best_cluster = 0;
float min_dist = FLT_MAX;

for (int c = 0; c < k; c++) {
float dist = 0.0f;
for (int d = 0; d < dim; d++) {
float diff = points[idx * dim + d] - centroids[c * dim + d];
dist += diff * diff; // 1 FLOP multiply + 1 FLOP add per dimension
}

if (dist < min_dist) {
min_dist = dist;
best_cluster = c;
}
}
assignments[idx] = best_cluster;
}
}

FLOPS = dim x (1 sub + 1 mul + 1 add) = dim x 3 FLOPS per point = k clusters x dim x 3 Total FLOPS. = n * K * dim * 3

Memory load:

Point data loads:

Centroid data loads:

Store: n

Bytes of memory transferred =

What to count for Memory Bytes

  • Roofline model: use theoretical minimal memory bytes

  • What if we count based on transactional size?

  • Based on what to bring from memory or caches

1
2
3
4
5
6
7
8
#define N 4096

__global__ void matAdd(float *A, float *B, float *C) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N)
C[idx] = A[idx] + B[idx];
}

FLOPs = N

bytes(min) = N × 12B

AI = 1 / 12 FLOP/B

1
2
3
4
5
6
7
8
#define N 4096
#define STRIDE 2

__global__ void matAdd(float *A, float *B, float *C) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N/STRIDE)
C[idx*STRIDE] = A[idx*STRIDE] + B[idx*STRIDE];
}

FLOPs = N/2

bytes(min) = (N/2) × 12B

AI = 1 / 12 FLOP/B

But in transaction size

32 × 8B = 256B

FLOPs = 32

bytes = 6 × 128B = 768B

1
2
3
4
5
6
7
8
9
__global__ void vectorAdd(float *a, float *b, float *c) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < N/10)

c[idx*10] = a[idx*10] + b[idx*10];

}

Assumptions:

  • Each float is 4 bytes.
  • The kernel is launched with enough threads to cover the condition idx < N / 10.
  • Global memory is accessed in 128-byte aligned transactions.
  • There is no caching (every access results in a memory transaction).
  • Memory bandwidth is measured based on data transferred from memory to the core, including reads and writes.

Question: What is the approximate floating-point operations per byte (FLOP/Byte) for this kernel? Here we calculate Bytes for all the data that has to be brought based on memory transaction sizes. (Choose the closest value based on memory traffic, not just raw data accessed.)

A:

When counting bytes based on transaction size , each transaction transfers bytes and contains elements (E: bytes per element). With a stride , only about of them are used. Thus, the effective bytes per useful element is: For three arrays (two loads + one store), bytes per index is , so the arithmetic intensity is: In this kernel, bytes, , hence FLOP/B.

Compiler Intro

Compilation Flow

CPU Compilation Flow

image-20251210192317347

GPU Compilation Flow

image-20251210192344219

NVIDIA

image-20251210192525844

GPU Compiler Pass

image-20251210192616555

Roles of CLANG

  • Front end parser

  • Tool chain for C-family languages

  • Generating the Abstract Syntax Tree (AST)

C++ PreProcessor

  • Performs text substitution before compilation
1
2
3
4
5
6
7
8
9
10
11
#define COURSE_NUMBER 8803 
Int main ()
{
int number = COURSE_NUMBER;
}
// into

Int main ()
{
int number = 8803;
}

IR Optimizations

  • Intermediate Representation

  • Back-end compiler

  • IR provides a good abstract to optimize

  • Many compiler optimizations are done in the IR level

PTX vs. SASS

  • PTX

  • Parallel Thread Execution

  • PTX is a virtual ISA

  • Architecture independent

  • PTX will be translated to machine code

  • PTX does not have register allocation

  • SASS

  • Low-level assembly language

  • Shader Assembly

  • Architecture dependent assembly code

  • Register is allocated

Fat Binaries

  • It contains execution files for multiple architectures

  • It supports multiple GPU versions

  • It also includes CPU code

PTX Instruction

  • Zero to four operands

  • Optional predicate information following an @ symbol

  • @p opcode;

  • @p opcode a;

  • @p opcode d, a;

  • @p opcode d, a, b;

  • @p opcode d, a, b, c; // d: destination operand. a,b,c: source operands

  • setp: writes two destination register. Use “|” to separate multiple destination registers

  • setp.lt.s32 p|q, a, b; // p = (a<b) ; q=!(a<b);

Predicated Execution

  • Predicated registers can be declared as

  • .reg .pred p, q, r;

  • Predicated registers are virtual and declared with .pred type specifier

  • Predicate variables are optional.

  • Lt: less than

1
2
if (i < n) 
j = j + 1;
1
2
setp.lt.s32 p, i, n; // p = (i < n) 
@p add.s32 j, j, 1; // if i < n, add 1 to j

Example Code

  • A PTX statement is either a directive or an instruction.

  • Example of directive: .target .address_size, .function etc.

  • Statements begin with an optional label and end with a semicolon.

1
2
3
4
5
6
7
8
    .reg    .b32 r1, r2;
.global .f32 array[N];

start:
mov.b32 r1, %tid.x;
shl.b32 r1, r1, 2; // shift thread id by 2 bits
ld.global.b32 r2, array[r1]; // thread[tid] gets array[tid]
add.f32 r2, r2, 0.5; // add 1/2

Other PTX Instruction Examples

  • Control flow instructions
    • bra targ1; - Branch to target label 'targ1’
    • all func; - Call function 'func’
    • ret; - Return from function call
  • Synchronization instructions
    • membar, fence
  • Atomic instructions
    • Atom
    • Example: atom.add.f16
  • Special PTX Registers
    • ntid: Number of threads in a CTA (Cooperative Thread Array)
    • tid: Thread ID
    • sp: Stack Pointer

PTX

IR

  • Intermediate Representation

  • Typical IR uses three-address code

  • A → B op C

  • LLVM IR version: %result = add i32 %a, %b

  • %result: destination register (target variable)

  • add: operation

  • i32: results are 32-bit integer

  • %a, %b : source operands

  • PTX version: add.u32 %r1, %r2, %r3 or add.s32 %r1, %r2, %r3

  • add.u32: unsigned integer add.s32: signed integer

Basic Block

  • A maximum sequence of instruction stream with one entry and one exit

    • Only the first instruction can be reached from outside.

    • Once the program enters a basic block, all instructions inside the basic block needs to be executed.

    • All execution needs to be consecutive.

    • Exit instruction is typically a control-flow instruction.

  • Optimizations within a basic block are local code optimization.

Flow Graph

  • Flow graph: each node represents a basic block, and path indicates possible program execution path.

  • Entry node: the first statement of the program

1
2
3
4
5
If (cond1) // BB #1
do work1 // BB #2
else
do work 3 // BB #3
BB #4
image-20251210193919383

Example

1
2
3
4
5
6
7
8
9
10
11
 ld.s32 %r1, [src1]; 
ld.s32 %r2, [src2];
setp.gt.s32 %p1, %r1, %r2;
@%p1 bra is_greater;
bra is_smaller;
is_greater:
mov.s32 %r3, %r1;
bra end_if;
Is_smaller:
mov.s32 %r3, %r2;
end_if:

Draw a basic block and Control Flow

A:

image-20251210194219243

Data Flow Analysis

Global Code Optimizations

  • Local code optimization: optimization within a basic block

  • Global code optimization: optimization across basic blocks

  • Most global code optimization is based on data-flow analyses

  • Data-flow analysis:

    • Analyze the effect of each basic block

    • Analyses differ by examining properties

  • Principal sources of optimization

    • Compiler optimization must preserve the semantics of the original program

Examples of Code Optimizations

  • Removing redundant instructions

  • Copy propagation

  • Dead code eliminations

  • Code motion

  • Induction variable detection

  • Reduction strength

Data-Flow Analysis Abstraction

  • Execution of a program: transformations of the program state

  • Input state: program point before the statement

  • Output state: program point after the statement

Transfer Functions

  • Use Transfer Functions notation

  • OUT[B] = f_B(IN[B])

  • IN[B]: immediate before a basic block

OUT[B]: immediate after a basic block

: transfer function of statement s

Predecessor of B: All blocks that are executed before the basic block B

Successor of B: All blocks that are executed after the basic block of B

Reaching Definitions

  • Analyze whether a definition reaches

  • A definition d reaches a point p if there is a path from the point immediately following d to p, without being killed (overwritten)

  • Definitions: a variable is defined when it receives a value

  • Use: when its value is read

  • E.g.) a= x + y → definitions: a, use: x, y

Gen and Kill

  • d: u = v + w

  • Generates the definition d of variable u and kills all other definitions in the program that define u. : the set of definitions generated by the statement

: the set of all other definitions of u in the program

Generalized Transfer Functions

Reaching Definitions

Algorithm

1
2
3
4
5
6
7
OUT[ENTRY] = ∅
for (each basic block B other than ENTRY) OUT[B] = ∅
while (changes to any OUT occur)
for (each basic block B other than ENTRY) {
IN[B] = ∪_(𝑃 𝑎 𝑝𝑟𝑒𝑑𝑒𝑐𝑒𝑠𝑠𝑜𝑟 𝑜𝑓 𝐵 )OUT[P]
OUT[B] = 𝑔𝑒𝑛_𝐵 ∪ (IN[B] - 𝑘𝑖𝑙𝑙_𝐵)
}
image-20251210202247100
BB Out[B]0 IN[B]1 OUT[B]1 IN[B]2 Out[B]2
B1 000 0000 000 0000 111 000 000 0000 111 0000
B2 000 0000 111 0000 001 1100 111 0111 001 1110
B3 000 0000 001 1100 000 1110 001 1110 000 1110
B4 000 0000 001 1110 001 0111 001 1110 001 0111
EXIT 000 0000 001 0111 001 0111 001 0111 001 0111

Live-variable Analysis

Live-Variable (Liveness) Analysis

  • Liveness analysis helps determine which variables are live (in use) at various program points.

  • Usage: register allocation. Register is allocated only for live variables, ensuring registers are allocated only to live variables.

Data-Flow Equations

  • : set of variables defined in block B before any use

  • :"set of variables whose values may be used in block B before any definition"

  • IN[EXIT] = ∅ : boundary condition

  • IN[B] =

  • OUT[B] =

  • Analysis is done in the backward (opposite to the control flow)

Algorithm

1
2
3
4
5
6
7
IN[EXIT] = ∅
for (each basic block B other than EXIT) IN[B] = ∅
while (changes to any IN occur)
for (each basic block B other than EXIT) {
OUT[B] = ∪_(𝑠 𝑎 𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑜𝑟𝑜𝑓 𝐵 )IN[S]
IN[B] = 〖𝑢𝑠𝑒〗_(𝐵 ) ∪ (OUT[B] - 〖𝑑𝑒𝑓〗_𝐵)
}

Iterative process

image-20251210203157039
BB First Pass Second Pass
OUT[ENTRY] m, n, u1,u2,u3 m, n, u1,u2,u3
IN[B1] m,n,u1,u2,u3 m,n,u1,u2,u3
OUT[B1] i,j,u2,u3 i,j, u2,u3
IN[B2] i,j,u2,u3 i,j,u2,u3
OUT[B2] u2,u3 j,u2,u3
IN[B3] u2,u3 j,u2,u3
OUT[B3] u3 j,u2,u3
IN[B4] u3 j,u2, u3
OUT[B4] i, j, u2, u3
IN[EXIT]

Register Allocations and Live-Variable Analysis

image-20251210203745003
  • a is dead after BB #1

  • Register for a can be reused after BB #1

  • b is still live at BB #2, if b is dead at BB #2, the register for b can also be reused

Register Allocations

  • Only live variables need to have registers.

  • What if there aren’t enough registers available?

  • Register spill/fill operations to a stack

  • Values that won’t be used for a while are moved to the stack.

  • PTX assumes infinite number of registers, so stack operations are not explicitly shown.

SSA

  • SSA is an enhancement of the def-use chain.

  • Key feature: variables can be defined only once in SSA form.

  • Common usage: Intermediate Representations (IR) in compilers are typically in SSA form.

image-20251210204218924

SSA and Control Flow Graphs

  • What if variable bs are defined in both places?
image-20251210204255961

∅(𝑷𝒉𝒊)− Function

  • ∅ Function merges values from different paths.

  • ∅ Function can be implemented using move or other methods in the ISA level.

  • Each definition gets a new version of the variable.

  • Usage always uses the latest version.

  • ∅ Function is added at each joint point for every variable. → more than one predecessor

SSA Conversion Example

image-20251210204355989

When to Insert ∅ Function ?

  • ∅ Function is added at each joint point for every variable. → more than one predecessor → can generate too many ∅ Functions

  • ∅ Functions only need to be inserted when multiple values exist.

image-20251210204457197

Iterative path-convergence criterion needs to be considered.

Path-Convergence Criterion

  • ∅ Function needs to be inserted when all the following are true:

1.There is a block x containing a definition of a;

2.There is a block y (with y x ) containing a definition of a;

3.There is a nonempty path P_xz of edges from x to z;

4.There is a nonempty path P_yz of edges from y to z;

5.Path P_xz and P_yz do not have any node in common other than z;

6.The node z does not appear within both P_xz and P_yz priori to the end though it may appear in one or the other.

Initialization: start node has all variable definitions.

Examples of Compiler Optimizations

Loop Unrolling

  • Unroll a loop for a small number of times for other code optimizations

Benefits:

  • Better instruction scheduling

  • More opportunities for vectorizations

  • Reducing # of branches

1
2
3
4
5
6
7
8
9
10
11
12
13
for (i = 0; i <N; i++){
S(i);
}
//to
for (i = 0; i+4 <N; i+=4){
S(i);
S(i+1);
S(i+2);
S(i+3);
}
for ( ; i<N; i++)
S(i);
}

Function Inlining

  • Interprocedure analysis

  • Benefits: reduce the overhead of stack operations

  • Downside: could increase code size

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
main()
{
t1 = func1(a);
t2 = func1(b);

}
int func1(int arg1){
return func2(arg1);
}
int func2(int arg2){
return arg2*arg2;
}
// to
main()
{
t1 = a*a;
t2 = b*b;

}

Dead Code Elimination

  • Removing code that has no effect on the program
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
.entry main() {
.reg .f32 a, b, c;

// Load values into registers
ld.param.f32 a, [param_a];
ld.param.f32 b, [param_b];

// Multiply a and b, but the result is not used
mul.f32 c, a, b;

add.f32 c, a, 1.0;
div.f32 c, b, 2.0;


// Return
ret;
}
//to
.entry main() {
.reg .f32 a, b, c;

// Load values into registers
ld.param.f32 a, [param_a];
ld.param.f32 b, [param_b];

add.f32 c, a, 1.0;
div.f32 c, b, 2.0;

// Return
ret;
}

Constant Propagation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
   .entry main() {
.reg .f32 a, b, c;

// Load constant values into registers
mov.f32 a, 5.0;
mov.f32 b, 3.0;

// Perform some calculations
mul.f32 c, a, b;

// Other instructions that use 'c'
add.f32 c, c, 1.0;

// Store the result in memory
st.global.f32 [result], c;

// Return
ret;
}
//to
.entry main() {
.reg .f32 c;

// Perform calculations with constants
mul.f32 c, 5.0, 3.0;

// Other instructions that use 'c'
add.f32 c, c, 1.0;

// Store the result in memory
st.global.f32 [result], c;

// Return
ret;
}
  • Constant propagation often also triggers other optimizations

Strength Reduction

  • Replace expensive operations with cheaper operations

  • E.g.) divide by 2 → right shift

1
2
3
div.f32 b, a, 2.0; 
//to
mul.f32 b, a, 0.5;
  • Compute square operations → multiplication
1
2
3
ex2.approx.f32 b, a;
//to
mul.f32 b, a, a;

Code Motion: Loop-Invariant Instructions

  • Move code that is not dependent on loop iterations
1
2
3
4
5
6
7
8
9
for (i = 0; i < 10; i++) {
cons = 3.14;
sum += cons * I;
}
//to
cons = 3.14;
for (i = 0; i < 10; i++) {
sum += cons * I;
}
  • i variable is the loop induction variable

  • PTX: add.u32 %r1, %r1, 1

Divergent Analysis

Review: Divergent Branches

Within a warp, not all threads need to be executed.

E.g.) If-else statements

Not all threads will do the same work

Major differences between SIMD vs. SIMT

Compiler might need to insert reconvergence point, aka: IPDOM (Immediate post dominator)

Divergence Analysis and Vectorization

  • Vectorization: converting loops with vector code (SIMD)

  • Vectorization requires checking whether all instructions will execute in a lock-step.

  • Warp divergence analysis shares similar characteristics: the need to check whether all threads within a warp will execute at the same path

  • Complications: unstructured CFGs (complex control flow graphs)

Divergent Branches: Thread Dependency

  • An expression is thread dependent at a program point p if it may evaluate to different values by different threads at p

  • Example source code

1
2
3
4
5
6
a = threadIdx;
If ( a > 4) {
b = 1;
} else {
b = 0;
}
1
2
3
4
5
6
7
a = threadIdx;
c = a%4;
If ( c ) {
b = 1;
} else {
b = 0;
}
  • If condition is dependent on thread

  • Main source of divergent branches

How to Check Thread Dependency

  • Def-Use chain of thread Ids

  • Identify all dependences of threads Ids

  • Iterative search process similar to data flow analysis

  • Reachability of thread Id needs to be checked.

  • Other example of divergence?

  • What if the branch condition is coming from memory?

1
2
a=mem[loc];
if(a)...
  • Challenges: complex control flow graphs; too conservative analysis: all branches, all loops could be a divergent.

Compiler-Aided Unrolling

1
2
3
4
#pragma unroll 4
for (int i = 0; i < N; ++i) {
// loop body
}
  • Modern compilers (e.g. NVCC, Clang/LLVM for CUDA/HIP) can auto-unroll loops using heuristics or explicit directives.

  • They estimate register pressure, code size, and warp occupancy, attempting to choose an unroll factor suitable for each kernel and device.

Register Usage and Loop Unrolling

1
2
3
4
5
6
7
8
__global__ void vectorAdd(float *A, float *B, float *C, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < N) {
float tmp = A[idx] + B[idx]; // Uses 1 register
C[idx] = tmp;
}
}

Register used per thread: 1-2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
__global__ void vectorAddUnrolled(float *A, float *B, float *C, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;

int base = idx * 4;
if (base + 3 < N) {
float tmp0 = A[base] + B[base];
float tmp1 = A[base + 1] + B[base + 1];
float tmp2 = A[base + 2] + B[base + 2];
float tmp3 = A[base + 3] + B[base + 3];

// Each tmp variable consumes a register.
C[base] = tmp0;
C[base + 1] = tmp1;
C[base + 2] = tmp2;
C[base + 3] = tmp3;
}
}

Register used per thread: 4-6

Unrolling factor 4

High register usage reduces occupancy

Compiler Algorithms for Loop Unrolling

  • Detection: Identify loops with simple bounds and no non-affine jumps or side effects.

  • Profitability Analysis: Estimate expected reduction in control overhead versus potential increase in register pressure/code size using heuristics and cost models.

  • Unroll Factor Selection: Choose unroll factor via static heuristics, user-provided pragmas, or autotuning/profiling.

  • Transformation:

Expand loop body according to unroll factor.

Eliminate induction variable updates and conditional branches (when possible).

Adjust loop remainder (handle leftover iterations).

Compiler Algorithms for Loop Unrolling

  • Detection: Identify loops with simple bounds and no non-affine jumps or side effects.

  • Profitability Analysis: Estimate expected reduction in control overhead versus potential increase in register pressure/code size using heuristics and cost models.

  • Unroll Factor Selection: Choose unroll factor via static heuristics, user-provided pragmas, or autotuning/profiling.

  • Transformation:

    • Expand loop body according to unroll factor.
    • Eliminate induction variable updates and conditional branches (when possible).
    • Adjust loop remainder (handle leftover iterations).

Loop Unrolling and Register Optimization

  • Unrolling exposes opportunities to hold temporary variables/arrays in registers by removing induction variables and index computations.

  • Tradeoff: Large unroll factors increase register usage, reducing occupancy (number of concurrent thread blocks).

  • Some compilers offer flags to restrict maximum register usage per thread (e.g., --maxrregcount in NVCC).

Heuristic Pseudocode

1
2
3
4
5
6
7
for each loop L in kernel:
trip_count = estimate_trip_count(L)
max_unroll = hardware_limit()
for unroll_factor in candidate_factors:
profit = profile_simulate(L, unroll_factor)
if profit > threshold and fits_in_register_file:
apply_unroll(L, unroll_factor)

Branch-Divergence Elimination: Code Example

1
2
3
4
5
6
7
8
9
10
if (condition) {
x = a;
} else {
x = b;
}
//to
x = condition ? a : b;

// Or, eliminating branch:
x = a * cond + b * (1 - cond); // cond is 0/1
  • Compilers can convert eligible branches to predicated instructions

Compiler Algorithms for Branch Divergence Reduction

  • Control-Flow Analysis: flag likely divergent branches.

  • Predication: replace simple branches with masked ops.

  • Branch Flattening: merge/move branches;

1
2
3
4
5
6
7
8
9
for (int i = 0; i < N; ++i) {
if (condition(i)) {
// rare, complex path
special_case();
} else {
// common, fast path
normal_case();
}
}
1
2
3
4
5
6
7
8
9
10
for (int i = 0; i < N; ++i) {
// common, fast path
normal_case();
}
if (condition(i)) {
// rare, complex path
undo normal cases task
special_case();
} else {
}

Kernel/Loop Fission

1
2
3
4
5
6
7
8
9
for i = 0 to N-1
A[i] = B[i] * C1
C[i] = B[i] + C2
// up to down: loop fission
// down to up: loop fusion
for i = 0 to N-1
A[i] = B[i] * C1
for i = 0 to N-1
C[i] = B[i] + C2

GPU and ML

Why are GPUs Good for ML?

  • High number of floating-point operations

  • High data parallel operations

  • GPUs have dense floating-point operations

  • High memory bandwidth (e.g. GDDR, HBM)

  • Flexible data format standards (e.g. TF, BF, INT4 etc.)

  • Statistical computing based computations

DNN Operation Categories

  • Elementwise operations

  • E.g.) Activation operations

  • Reduction operations

  • E.g.) Pooling operations

  • Dot-product operations

  • E.g.) Convolution operations, GEMM

Background of GEMM

  • General Matrix Multiplications (GEMM)

    C = α AB + βC

    A, B and C are m x k , k x n and m x n matrix

  • α = 1 and β = 0 becomes C=AB

  • Popular in fully-connected layers, convolution layers

  • Production of A and B → M * N * K fused multiply-adds (FMAs) → 2 MN*K FLOPS

    E.g.) FP 16 inputs FP32 accumulator

    Arithmetic Intensity =(number of FLOPS )/(number of bytes accesses ) =(2 (MNK))/(2(MK+NK+MN) )

  • Use Arithmetic intensity to compare the machine’s FLOPS/B

Difference Between SIMD vs. Tensor

  • 4x4 matrix computation

  • 4 threads and each thread repeats a loop

image-20251210214359611
1
2
3
4
5
6
if (row < N && col < N)  {
for (int i = 0; i < N; i++) {
sum += A[row * N + i] * B[i * N + col];
}
}
C[row * N + col] = sum;

into

1
HMMA.1688.F16 

Sum needs to be stored in a register. Since NVIDA thread is private, one thread needs to perform accumulation.

If 16 threads are performing the work in parallel, the sum variable needs a reduction mechanism.

SIMD/SIMT Operations of Matrix Multiplications

image-20251210214513499 - 16 SIMD/SIMT operations are needed for 4x4 matrix multiplications.

Tensor Cores

  • Tensor Cores perform matrix multiply and accumulate (MMA) calculations.

  • Hundreds of Tensor Cores operating in parallel in one NVIDIA GPU enable massive increases in throughput and efficiency.

  • Support sparsity

  • E.g.) Each A100 tensor core can execute 256 FP 16 FMA.

  • INT8, INT 4 and binary 1-bit predictions added

Transformer Engine

  • Introduced GH architecture

  • To accelerate transfer layers

  • Transformer engine dynamically scales tensor data into representable range

  • FP 8 operations

  • Bring a chunk of data efficiently to fully utilize tensor units

  • TMA address generation using copy descriptor

Floating Point Formats

FP 16 vs. FP 32

  • FP 16 uses the half precision of FP 32

  • IEEE standards have 32 bits/64 bits (singe/double precisions)

Recap: Production of A and B à M * N * K fused multiply-adds (FMAs) à 2MN*K FLOPS

E.g.) FP 16 inputs FP32 accumulator If FP 8 inputs but FP32 accumulator Changing the input FP formats can change the arithmetic intensity significantly.

Benefits of Quantization

  • Reduce the storage size

  • Increase the arithmetic intensity

  • Advanced floating-point operations also increase the throughputs

  • E.g.) Ideal performance: the performance of FP 8 is 2 x of FP 16 operations

Different Floating Point Formats

image-20251210215232979

Some More Details

  • Exponent values can be both positive and negative

  • Floating points need to represent 2^(-exp1) to 2^(exp2)

  • Negative exponent values represent less than 1; positive exponent values represent greater than 1.

  • Sign bit is to indicate negative or positive value of actual floating point.

  • How to represent exponent values without sign bits?

  • Solution: offset the value

  • Subnormal numbers (denormalized numbers): when all exponents are 0

Floating Point Number Representations

image-20251210215325227

Quantization

  • Reduce the number of required operations

  • Commonly used for reducing input operations while keeping the accumulator in high precision

  • Quantization often leads to a non-uniform quantization

  • Sometimes, value transformations are used to overcome non-uniform quantization.

  • E.g.) Shifted and squeezed 8-bit format (S2FP8)

Tensor Cores

Designing ML Accelerator Units on GPUs

  • Consideration factors
  1. What functionality?

  2. Benefits over the existing GPUs

  • Compute unit design and the scale

  • Data storage and movement

Common Steps of Designing Accelerators

  • Step 1. Identify frequently executed operations

  • Step 2. Performance benefit estimation

  • Software approach vs. hardware approach; software approach is using the existing ISAs.

  • Step 3. Design interface and programmability

  • What ISA to add? What storages? Separate registers or shared registers

  • Step 4. Consider to combine multiple features

  • Any other operations to combine?

Matrix Multiplication Accumulator

  • Mostly commonly used in ML workloads

  • SIMD operations still require row-by-row operations

  • Large matrix multiplication units can be implemented with systolic arrays

  • Design decisions: many small matrix multiplication units vs. a large matrix multiplication unit?

  • Area and programmability choices

  • NVIDIA started with 4x4x4 matrix multiplication units

  • Intel’s AMX (Advanced Matrix Extensions) supports 16x16 matrices

Possible Matrix Multiplication Unit Design

image-20251210215713473
  • Option 1: Parallel Units 4 FMA units x 16 elements

  • Option 2: Pipelining with future FMA units

image-20251210215725202

Data Storage and Movement

  • Input matrices have to come from Memory

  • Design decisions: dedicated registers or shared memory

  • 4x4x4 matrix operations require at least 3 x 16 register space.

  • NVIDA, registers are private to threads

  • Memory space can be used for storing tensor data.

  • Eventually, data needs to come from memory.

  • In NVIDIA, with tensor core, new asynchronous copy instruction is introduced[1]: loads data directly from global memory into shared memory, optionally bypassing L1 cache, and eliminating the need for intermediate register file (RF) usage

  • New asynchronous barrier instructions are also introduced.

Supporting Sparse Operations

  • Sparse Matrix Multiplication: some elements are zero values

  • Widely used in high-performance computing programs

  • Software approaches: use compressed data format

  • Instead of storing all values, only store non-zero values

  • E.g.) (row index, column index, value) (1,3,1) (2,1,2) (3,3,3), (4,2,5)

image-20251210215834914
  • With pruning operations, sparsity increases

Supporting Sparsity in Hardware

  • Structured sparsity: assume only a fixed % of elements are non-zeros

  • E.g.) Assumption of 50% sparsity (use 1 bit to indicate left or right position)

image-20251210215912903
  • With a structured sparsity, storing index information is simplified.

  • Accelerating SpMV (Sparse Matrix-Vector Multiplication) is a big research topic.

  • Reduce storage space and throughputs

Exam Practices

In the queue-based simulation, if we want to increase the execution width, what change do we need to make? Please refer to the diagram in the lecture for the module names. Choose the most relevant one?

A. Increase the width of DE queue

B. Increase the width of FE queue

C. Increase the depth of DE queue

D. Increase the number of ops to select in the scheduler

Option Why it's NOT the right one
Increase the width of DE queue DE queue stores decoded ops; wider queue doesn't change how many execute per cycle.
Increase the width of FE queue Same reason, FE queue affects fetch bandwidth, not execution width.
Increase the depth of DE queue Depth = how many ops it can hold, not how many can issue per cycle.

Which statement most accurately describes sectored cache and small block size of cache?

Sectored cache is good for reducing cache size.

Sectored cache and small cache block size could reduce the memory bandwidth requirements.

Sectored cache is good for improving spatial locality.

Both sectored cache and small cache block size have the same cache tag storage requirements.

Sectored cache is good for reducing cache size.

  • Wrong. Sectored cache mainly reduces tag storage, not total cache size.
  • The data array size stays the same.

Sectored cache is good for improving spatial locality.

  • No. It actually helps when spatial locality is poor, since you don't want to fetch the whole block.
  • Spatial locality improvement comes from large blocks, not sectored design.

Both sectored cache and small cache block size have the same cache tag storage requirements.

  • Also false.
  • Sectored cache: 1 tag per large block.
  • Small block size cache: 1 tag per small block → many more tags → higher tag storage.
1
2
3
4
5
6
7
8
9
10
11
#define N 10000

__global__ void vectorAdd(float *a, float *b, float *c) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < N/10)

c[idx*10] = a[idx*10] + b[idx*10];

}

Assuming 100 CUDA blocks, each consisting of 100 threads, with a warp width of 16, and a page size of 4KB, what optimizations would be most helpful in reducing address translation overhead in this code? The baseline machines has 16 entry L1 TLB per core and 48 entries for L2 TLB which are shared among all the SMs, and there are 4 SMs. One SM can execute one CUDA block at a time. L1 TLB access time is 1 cycles and L2 TLB access time is 10 cycles. Float uses 4B.

Double L1 TLB size

Double L2 TLB size

Doubling L1 TLB or Doubling L2 TLB will have similiar performance improvements

Doubling L1 TLB or Doubling L2 TLB will not show any benefits.

A: enough L1 for now

Which analysis is good for dead code eliminations?

Liveness analysis

Strength Reduction

Constant Propagation

We want to the draw roofline graph. This is measured data from A100 and H100

A100

k N GFLOP_per_s_a100 ArithmeticIntensity_F_per_B 0 134217728 0 0 1 134217728 228.428 0.166667 2 134217728 456.617 0.333333 4 134217728 911.964 0.666667 8 134217728 1810.39 1.33333 16 134217728 3256.45 2.66667 32 134217728 4579.43 5.33333 64 134217728 5504.7 10.6667 128 134217728 6055.01 21.3333

H100

k N GFLOP_per_s_h100 ArithmeticIntensity_F_per_B 0 134217728 0 0 1 134217728 466.77 0.166667 2 134217728 932.316 0.333333 4 134217728 1860.39 0.666667 8 134217728 3706.32 1.33333 16 134217728 7318.47 2.66667 32 134217728 13136.8 5.33333 64 134217728 16910.9 10.6667 128 134217728 19802.6 21.3333

(step 1) Draw roofline chart

Question For the same arithmetic intensity (0.333333) H100 shows better GFLOP/S. why ?

H100 has more FP units H100 has higher memory bandwidth H100 is operating with faster clock frequency

Question 2

From the above measured data, find the memory bandwidth for A100 and H100. Choose the closet number

A100: 500MB/s A100: 1TB/S A100:2TB/S H100:500MB/s H100:1TB/S H100:2TB/S H100:3TB/s

Question 3 From the above measured data, find the peak compute throughput for A100 Choose the closet number. Assume that measured data reach the compute bounded zone. A100: 5000GFlop/s A100: 1TFlop/s A100: 10TFlop/s A100: 100TFlop/s

A:

BW=0.166667228.428=1370.6 GB/s

BW=0.166667466.77=2800.6 GB/s

(use lowest datapoint)

We want to represent the following decimal numbers in binary floating-point format:

(1, 1.2, 0.75, 1.1, 0.2, 0.1)

Which of the following is the best choice for mantissa and exponent format to represent these numbers with reasonable precision?

Mantissa: 3 bits Exponent: 2 bits Mantissa: 5 bits Exponent: 4 bits Mantissa: 8 bits Exponent: 8 bits Mantissa: 2 bits Exponent: 3 bits

Question 2 ou are performing a 4×4 × 4×4 matrix multiplication using vector operations on a machine with vector registers of width 4. The machine can perform 16 vector operations simultaneously (i.e., SIMD parallelism across 16 vector instructions).

What is the minimum number of vector registers required to hold all intermediate data during the computation?

8 12 16 24

A: 4(A) + 4(B) + 4(C) + 4(temp register)=16

Question 3 1 / 1 pts You are performing a 4×4 × 4×4 matrix multiplication using tensor operations on a machine where:

Each vector register holds 4 floating-point values, The tensor core can compute the entire 4×4 × 4×4 multiplication in one fused operation. What is the minimum number of vector registers required to hold all input and output data for the computation? Partial sums that are stored in the tensor units are not counted.

12 16 20 24

A: 4+4+4=12

[GPU-design-part-a]

You are designing a GPU system that should fully utilize the available memory bandwidth.

Here are the system details:

  • The memory system is HBM with a peak bandwidth of 3 TB/s.
  • The data format used is BF16 (bfloat16), which is 2 bytes per value.
  • Each operation is a Fused Multiply-Add (FMA) that uses 2 input values and produces 1 output, meaning 3 memory accesses per operation.
  • Each Streaming Multiprocessor (SM) runs at 1 GHz and can execute 2 warps per cycle.
  • Each warp has 32 threads, and each thread performs 1 FMA per cycle.
  • Assume there is no cache, and all memory accesses go directly to HBM.

What is the minimum number of SMs needed to fully saturate the 3 TB/s memory bandwidth?

A:

3TB/s ->

1SM ->

Option: A. 5 B. 25

Choose A since we need fully utilize resources

[GPU-design-part-b]

Now let's continue from the previous setup.

Assume all system parameters are the same as in [GPU-design-part-a], except the following change:

The GPU system now includes a cache with a 90% hit rate, meaning that only 10% of the memory accesses go to the HBM.

With this 90% cache hit rate, what is the minimum number of SMs needed to fully saturate the 3 TB/s HBM bandwidth?

A: