Executive Summary

Mastering GPU grid scaling and shared memory in Mojo turns slow, CPU-bound data pipelines into fast, parallel workflows that handle large-scale datasets without expensive infrastructure upgrades or specialist developers.

This post builds on Part 1 of the GPU Threads Unraveled series. If you've just learnt how to launch your first GPU threads, Puzzles 3–8 take you further: scaling across multiple thread blocks, protecting against out-of-bounds errors, and using shared memory for efficient thread collaboration.

On my M1 MacBook, I worked through the tricky bits: silent data corruption from missing bounds checks (Puzzle 3), memory conflicts that halved performance (Puzzle 7)—and the official YouTube walkthrough series helped smooth the rough edges. Here's what I learnt.


Puzzle 3: "Over-Subscription" – More Threads Than Data

Now we launch more threads than we have data elements. With SIZE=4 but THREADS_PER_BLOCK=(8,1), we have 8 threads but only 4 elements to process. The key learning: guard every array access with bounds checks.

Here's the actual kernel:

fn add_10_guard(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    i = thread_idx.x
    if i < size:  # <- ✅ Solution code
        output[i] = a[i] + 10.0  # <- ✅ Solution code

My stumble: Initially omitting the if i < size check. The extra threads (4–7) tried to access memory they shouldn't, causing silent corruption and test failures.

Lesson: Over-subscribe threads for full GPU occupancy, but guard every memory access ruthlessly. The puzzle's diagram visualises this perfectly—threads beyond the data size must idle safely.


Puzzle 4: "2D Indexing" – Working with Grids

Moving to 2D: we launch a grid of threads (3, 3) to process a 2×2 array. Now we use both thread_idx.x (columns) and thread_idx.y (rows), with bounds checking on both dimensions.

fn add_10_2d(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    if row < size and col < size:  # <- ✅ Solution code
        output[row * size + col] = a[row * size + col] + 10.0  # <- ✅ Solution code

Key insight: We flatten 2D coordinates into 1D memory using row-major indexing: row * size + col. Both dimensions need bounds checks since we over-subscribe (9 threads for 4 elements). This pattern underpins all 2D GPU work—images, matrices, tiles.

Alternate Solution: LayoutTensor Approach

Here's the same logic using LayoutTensor, which handles shape and indexing automatically:

fn add_10_2d_layout_tensor[
    out_layout: Layout,
    a_layout: Layout,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    if row < size and col < size:  # <- ✅ Solution code
        output[row, col] = a[row, col] + 10.0  # <- ✅ Solution code

The LayoutTensor version eliminates manual row-major indexing (row * size + col) by using 2D indexing directly: output[row, col]. This reduces errors and makes the code more readable, especially for higher-dimensional tensors.


Puzzle 5: "Broadcasting" – Shape Mismatch Magic

Now for NumPy-style broadcasting: add a row vector a (size 2) and a column vector b (size 2) to produce a 2×2 output. Each thread computes one output element by indexing into a by column and b by row.

fn broadcast_add(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    if row < size and col < size:  # <- ✅ Solution code
        output[row * size + col] = a[col] + b[row]  # <- ✅ Solution code

The elegance: a[col] broadcasts across rows, b[row] broadcasts across columns. With a = [1, 2] and b = [0, 10], output becomes [[1, 2], [11, 12]].

This pattern is fundamental for matrix operations—understanding how to map thread indices to different input dimensions sets you up for convolutions and matmuls.

Alternate Solution: LayoutTensor Approach

Here's the same broadcasting using LayoutTensor with different layouts for each tensor:

fn broadcast_add_layout_tensor[
    out_layout: Layout,
    a_layout: Layout,
    b_layout: Layout,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, b_layout, ImmutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    if row < size and col < size:  # <- ✅ Solution code
        output[row, col] = a[0, col] + b[row, 0]  # <- ✅ Solution code

The LayoutTensor approach makes broadcasting explicit: a[0, col] shows it's a row vector (1×2), and b[row, 0] shows it's a column vector (2×1). The layout system handles stride calculations, making the broadcasting semantics crystal clear.


Puzzle 6: "Multi-Block Grids" – Scaling Across Blocks

Puzzle 6 introduces multi-block grids: BLOCKS_PER_GRID=(3,1) with THREADS_PER_BLOCK=(4,1) gives us 12 threads total to process 9 elements. Now we compute global indices using both block and thread IDs.

fn add_10_blocks(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    i = block_dim.x * block_idx.x + thread_idx.x
    if i < size:  # <- ✅ Solution code
        output[i] = a[i] + 10.0  # <- ✅ Solution code

The formula block_dim.x * block_idx.x + thread_idx.x gives each thread its unique global index across all blocks. This is how you scale: add blocks (not just threads) to handle massive datasets.


Puzzle 7: "2D Multi-Block Grids"

Combining multi-block grids with 2D indexing: BLOCKS_PER_GRID=(2,2) and THREADS_PER_BLOCK=(3,3) create a 6×6 thread grid for a 5×5 array.

fn add_10_blocks_2d(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    if row < size and col < size:  # <- ✅ Solution code
        output[row * size + col] = a[row * size + col] + 10.0  # <- ✅ Solution code

Now both row and column indices span multiple blocks. This tiling pattern is crucial for large matrices—you can't fit everything in one block, so you break the work into block-sized tiles.

Alternate Solution: LayoutTensor Approach

The LayoutTensor version with multi-block 2D grids:

fn add_10_blocks_2d_layout_tensor[
    out_layout: Layout,
    a_layout: Layout,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
    size: UInt,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    if row < size and col < size:  # <- ✅ Solution code
        output[row, col] = a[row, col] + 10.0  # <- ✅ Solution code

Same global index calculation, but LayoutTensor handles the 2D-to-1D memory mapping. This makes tiling patterns clearer and safer for large-scale matrix operations.


Puzzle 8: "Shared Memory" – Fast Thread Communication

Finally, shared memory: per-block fast RAM (think team whiteboard) that's much faster than global memory. Threads in a block can collaborate via shared buffers.

fn add_10_shared(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    shared = stack_allocation[
        TPB,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    
    # Load global data into shared memory
    if global_i < size:
        shared[local_i] = a[global_i]  # <- ✅ Solution code
    
    # Wait for all threads to complete loading
    barrier()  # <- ✅ Solution code
    
    # Process using shared memory
    if global_i < size:
        output[global_i] = shared[local_i] + 10  # <- ✅ Solution code

Key insights:

  • stack_allocation with AddressSpace.SHARED creates per-block memory
  • barrier() synchronises threads—ensures all loads complete before any thread proceeds
  • Each thread uses local_i (within-block index) for shared access, global_i for global memory

My error: Initially forgot the barrier()—threads raced ahead and read uninitialised shared data, causing random failures.

Shared memory is your gateway to high performance: reducing redundant global memory reads, enabling thread cooperation for reductions, and tiling large operations.

Alternate Solution: LayoutTensor Approach

Here's the same shared memory pattern using LayoutTensor with the shared address space:

fn add_10_shared_layout_tensor[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: UInt,
):
    # Allocate shared memory using tensor builder
    shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]  # <- ✅ Solution code

    barrier()  # <- ✅ Solution code

    if global_i < size:
        output[global_i] = shared[local_i] + 10  # <- ✅ Solution code

The LayoutTensor version uses .stack_allocation() to create shared memory with proper layout tracking. The AddressSpace.SHARED parameter tells the compiler to allocate this in fast per-block memory. This approach provides type safety and automatic shape handling whilst maintaining full performance.


You may have noticed that the alternate LayoutTensor solutions look remarkably similar to the UnsafePointer versions. That's intentional—and reveals an important insight about GPU programming in Mojo.

The solution logic is identical. The computation (+ 10.0), bounds checks (if row < size), and thread indexing patterns are exactly the same in both approaches. What changes is:

AspectUnsafePointerLayoutTensor
Data declarationUnsafePointer[Scalar[dtype], MutAnyOrigin]LayoutTensor[dtype, layout, MutAnyOrigin]
Memory indexingManual 1D: row * size + colMulti-dimensional: [row, col]
Layout handlingYou calculate strides manuallyCompiler handles via Layout parameter
Type safetyMinimal (raw pointer)Strong (shape + stride tracking)
When to useLearning low-level details, maximum controlProduction code, reducing indexing errors

Example from Puzzle 4:

// UnsafePointer: manual row-major calculation
output[row * size + col] = a[row * size + col] + 10.0

// LayoutTensor: layout system handles memory mapping
output[row, col] = a[row, col] + 10.0

The LayoutTensor approach abstracts away stride calculations—you don't manually compute row * size + col (row-major) or col * size + row (column-major). The Layout system enforces shape/dimension awareness at compile time, eliminating error-prone index arithmetic whilst maintaining identical performance.

Key takeaway: LayoutTensor isn't a different algorithm—it's a safer, more expressive wrapper around the same memory. As you move from learning exercises to production kernels, LayoutTensor helps you focus on the parallel logic whilst the compiler handles the plumbing.


TermWhat it means here
Global IDblock_idx.x * THREADS_PER_BLOCK + thread_idx.x – your unique spot in the full grid
LayoutTensorSafer array with shape/strides; use load[width=1](row, col) for access
BroadcastingManually replicating dims (e.g., 1x2 + 2x1 → 2x2) via index math
shared BufferFast per-block memory; declare with Buffer[T, size], sync with __syncthreads()
Bank ConflictMultiple threads hitting the same shared mem bank → serialization; avoid with offsets
Occupancy% of GPU threads active; over-subscribe + shared mem maximizes it

What's Next

These puzzles shifted me from "why won't it run?" to "how do I tune it?": grids for breadth, shared for depth. Code tweaks and full solutions live in my GitHub fork.

Up next: GPU Threads Unraveled #3: Parallel Reductions & Scans (Puzzles 9–14), the heart of parallel sums.