Table of Contents
- Executive Summary
- Sidebar: UnsafePointer vs LayoutTensor – Same Logic, Different Abstractions
- Sidebar: Evolving GPU Lingo
- What's Next
- Related Posts
Executive Summary
Mastering GPU grid scaling and shared memory in Mojo turns slow, CPU-bound data pipelines into fast, parallel workflows that handle large-scale datasets without expensive infrastructure upgrades or specialist developers.
This post builds on Part 1 of the GPU Threads Unraveled series. If you've just learnt how to launch your first GPU threads, Puzzles 3–8 take you further: scaling across multiple thread blocks, protecting against out-of-bounds errors, and using shared memory for efficient thread collaboration.
On my M1 MacBook, I worked through the tricky bits: silent data corruption from missing bounds checks (Puzzle 3), memory conflicts that halved performance (Puzzle 7)—and the official YouTube walkthrough series helped smooth the rough edges. Here's what I learnt.
Puzzle 3: "Over-Subscription" – More Threads Than Data
Now we launch more threads than we have data elements. With SIZE=4 but THREADS_PER_BLOCK=(8,1), we have 8 threads but only 4 elements to process. The key learning: guard every array access with bounds checks.
Here's the actual kernel:
fn add_10_guard(
output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
size: UInt,
):
i = thread_idx.x
if i < size: # <- ✅ Solution code
output[i] = a[i] + 10.0 # <- ✅ Solution code
My stumble: Initially omitting the if i < size check. The extra threads (4–7) tried to access memory they shouldn't, causing silent corruption and test failures.
Lesson: Over-subscribe threads for full GPU occupancy, but guard every memory access ruthlessly. The puzzle's diagram visualises this perfectly—threads beyond the data size must idle safely.
Puzzle 4: "2D Indexing" – Working with Grids
Moving to 2D: we launch a grid of threads (3, 3) to process a 2×2 array. Now we use both thread_idx.x (columns) and thread_idx.y (rows), with bounds checking on both dimensions.
fn add_10_2d(
output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
size: UInt,
):
row = thread_idx.y
col = thread_idx.x
if row < size and col < size: # <- ✅ Solution code
output[row * size + col] = a[row * size + col] + 10.0 # <- ✅ Solution code
Key insight: We flatten 2D coordinates into 1D memory using row-major indexing: row * size + col. Both dimensions need bounds checks since we over-subscribe (9 threads for 4 elements). This pattern underpins all 2D GPU work—images, matrices, tiles.
Alternate Solution: LayoutTensor Approach
Here's the same logic using LayoutTensor, which handles shape and indexing automatically:
fn add_10_2d_layout_tensor[
out_layout: Layout,
a_layout: Layout,
](
output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
size: UInt,
):
row = thread_idx.y
col = thread_idx.x
if row < size and col < size: # <- ✅ Solution code
output[row, col] = a[row, col] + 10.0 # <- ✅ Solution code
The LayoutTensor version eliminates manual row-major indexing (row * size + col) by using 2D indexing directly: output[row, col]. This reduces errors and makes the code more readable, especially for higher-dimensional tensors.
Puzzle 5: "Broadcasting" – Shape Mismatch Magic
Now for NumPy-style broadcasting: add a row vector a (size 2) and a column vector b (size 2) to produce a 2×2 output. Each thread computes one output element by indexing into a by column and b by row.
fn broadcast_add(
output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
size: UInt,
):
row = thread_idx.y
col = thread_idx.x
if row < size and col < size: # <- ✅ Solution code
output[row * size + col] = a[col] + b[row] # <- ✅ Solution code
The elegance: a[col] broadcasts across rows, b[row] broadcasts across columns. With a = [1, 2] and b = [0, 10], output becomes [[1, 2], [11, 12]].
This pattern is fundamental for matrix operations—understanding how to map thread indices to different input dimensions sets you up for convolutions and matmuls.
Alternate Solution: LayoutTensor Approach
Here's the same broadcasting using LayoutTensor with different layouts for each tensor:
fn broadcast_add_layout_tensor[
out_layout: Layout,
a_layout: Layout,
b_layout: Layout,
](
output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
b: LayoutTensor[dtype, b_layout, ImmutAnyOrigin],
size: UInt,
):
row = thread_idx.y
col = thread_idx.x
if row < size and col < size: # <- ✅ Solution code
output[row, col] = a[0, col] + b[row, 0] # <- ✅ Solution code
The LayoutTensor approach makes broadcasting explicit: a[0, col] shows it's a row vector (1×2), and b[row, 0] shows it's a column vector (2×1). The layout system handles stride calculations, making the broadcasting semantics crystal clear.
Puzzle 6: "Multi-Block Grids" – Scaling Across Blocks
Puzzle 6 introduces multi-block grids: BLOCKS_PER_GRID=(3,1) with THREADS_PER_BLOCK=(4,1) gives us 12 threads total to process 9 elements. Now we compute global indices using both block and thread IDs.
fn add_10_blocks(
output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
size: UInt,
):
i = block_dim.x * block_idx.x + thread_idx.x
if i < size: # <- ✅ Solution code
output[i] = a[i] + 10.0 # <- ✅ Solution code
The formula block_dim.x * block_idx.x + thread_idx.x gives each thread its unique global index across all blocks. This is how you scale: add blocks (not just threads) to handle massive datasets.
Puzzle 7: "2D Multi-Block Grids"
Combining multi-block grids with 2D indexing: BLOCKS_PER_GRID=(2,2) and THREADS_PER_BLOCK=(3,3) create a 6×6 thread grid for a 5×5 array.
fn add_10_blocks_2d(
output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
size: UInt,
):
row = block_dim.y * block_idx.y + thread_idx.y
col = block_dim.x * block_idx.x + thread_idx.x
if row < size and col < size: # <- ✅ Solution code
output[row * size + col] = a[row * size + col] + 10.0 # <- ✅ Solution code
Now both row and column indices span multiple blocks. This tiling pattern is crucial for large matrices—you can't fit everything in one block, so you break the work into block-sized tiles.
Alternate Solution: LayoutTensor Approach
The LayoutTensor version with multi-block 2D grids:
fn add_10_blocks_2d_layout_tensor[
out_layout: Layout,
a_layout: Layout,
](
output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
size: UInt,
):
row = block_dim.y * block_idx.y + thread_idx.y
col = block_dim.x * block_idx.x + thread_idx.x
if row < size and col < size: # <- ✅ Solution code
output[row, col] = a[row, col] + 10.0 # <- ✅ Solution code
Same global index calculation, but LayoutTensor handles the 2D-to-1D memory mapping. This makes tiling patterns clearer and safer for large-scale matrix operations.
Puzzle 8: "Shared Memory" – Fast Thread Communication
Finally, shared memory: per-block fast RAM (think team whiteboard) that's much faster than global memory. Threads in a block can collaborate via shared buffers.
fn add_10_shared(
output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
size: UInt,
):
shared = stack_allocation[
TPB,
Scalar[dtype],
address_space = AddressSpace.SHARED,
]()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# Load global data into shared memory
if global_i < size:
shared[local_i] = a[global_i] # <- ✅ Solution code
# Wait for all threads to complete loading
barrier() # <- ✅ Solution code
# Process using shared memory
if global_i < size:
output[global_i] = shared[local_i] + 10 # <- ✅ Solution code
Key insights:
stack_allocationwithAddressSpace.SHAREDcreates per-block memorybarrier()synchronises threads—ensures all loads complete before any thread proceeds- Each thread uses
local_i(within-block index) for shared access,global_ifor global memory
My error: Initially forgot the barrier()—threads raced ahead and read uninitialised shared data, causing random failures.
Shared memory is your gateway to high performance: reducing redundant global memory reads, enabling thread cooperation for reductions, and tiling large operations.
Alternate Solution: LayoutTensor Approach
Here's the same shared memory pattern using LayoutTensor with the shared address space:
fn add_10_shared_layout_tensor[
layout: Layout
](
output: LayoutTensor[dtype, layout, MutAnyOrigin],
a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
size: UInt,
):
# Allocate shared memory using tensor builder
shared = LayoutTensor[
dtype,
Layout.row_major(TPB),
MutAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
if global_i < size:
shared[local_i] = a[global_i] # <- ✅ Solution code
barrier() # <- ✅ Solution code
if global_i < size:
output[global_i] = shared[local_i] + 10 # <- ✅ Solution code
The LayoutTensor version uses .stack_allocation() to create shared memory with proper layout tracking. The AddressSpace.SHARED parameter tells the compiler to allocate this in fast per-block memory. This approach provides type safety and automatic shape handling whilst maintaining full performance.
Sidebar: UnsafePointer vs LayoutTensor – Same Logic, Different Abstractions
You may have noticed that the alternate LayoutTensor solutions look remarkably similar to the UnsafePointer versions. That's intentional—and reveals an important insight about GPU programming in Mojo.
The solution logic is identical. The computation (+ 10.0), bounds checks (if row < size), and thread indexing patterns are exactly the same in both approaches. What changes is:
| Aspect | UnsafePointer | LayoutTensor |
|---|---|---|
| Data declaration | UnsafePointer[Scalar[dtype], MutAnyOrigin] | LayoutTensor[dtype, layout, MutAnyOrigin] |
| Memory indexing | Manual 1D: row * size + col | Multi-dimensional: [row, col] |
| Layout handling | You calculate strides manually | Compiler handles via Layout parameter |
| Type safety | Minimal (raw pointer) | Strong (shape + stride tracking) |
| When to use | Learning low-level details, maximum control | Production code, reducing indexing errors |
Example from Puzzle 4:
// UnsafePointer: manual row-major calculation
output[row * size + col] = a[row * size + col] + 10.0
// LayoutTensor: layout system handles memory mapping
output[row, col] = a[row, col] + 10.0
The LayoutTensor approach abstracts away stride calculations—you don't manually compute row * size + col (row-major) or col * size + row (column-major). The Layout system enforces shape/dimension awareness at compile time, eliminating error-prone index arithmetic whilst maintaining identical performance.
Key takeaway: LayoutTensor isn't a different algorithm—it's a safer, more expressive wrapper around the same memory. As you move from learning exercises to production kernels, LayoutTensor helps you focus on the parallel logic whilst the compiler handles the plumbing.
Sidebar: Evolving GPU Lingo
| Term | What it means here |
|---|---|
| Global ID | block_idx.x * THREADS_PER_BLOCK + thread_idx.x – your unique spot in the full grid |
| LayoutTensor | Safer array with shape/strides; use load[width=1](row, col) for access |
| Broadcasting | Manually replicating dims (e.g., 1x2 + 2x1 → 2x2) via index math |
| shared Buffer | Fast per-block memory; declare with Buffer[T, size], sync with __syncthreads() |
| Bank Conflict | Multiple threads hitting the same shared mem bank → serialization; avoid with offsets |
| Occupancy | % of GPU threads active; over-subscribe + shared mem maximizes it |
What's Next
These puzzles shifted me from "why won't it run?" to "how do I tune it?": grids for breadth, shared for depth. Code tweaks and full solutions live in my GitHub fork.
Up next: GPU Threads Unraveled #3: Parallel Reductions & Scans (Puzzles 9–14), the heart of parallel sums.
Related Posts
- GPU Threads Unraveled #1: First Threads — Setup and Puzzles 1-2
- GPU Threads Unraveled: Series Introduction — Overview of the full series