Exec Summary

Training and running modern AI models on ever-larger datasets demands massive parallel compute. Mojo’s GPU puzzles teach you the building blocks of how to deliver that performance directly, cutting inference times by orders of magnitude and reducing costs, without requiring your team to abandon familiar Python syntax or rewrite your entire stack.

GPUs excel at parallel computation, processing thousands of data points simultaneously, ideal for AI workloads where CPU-bound scripts choke on scale. But programming them directly has always been a low-level slog in C++/CUDA. Mojo changes that: Modular’s Python-superset that gives you full hardware control with Python-like syntax. As introduced in the series overview Introduction, Modular’s 34 GPU puzzles are a great way to learn this stuff: each puzzle is small, focused, and deliberately designed to build deep understanding, one concept at a time.

I'm running them on my M1 and M5 MacBooks (Apple Silicon). Most early puzzles now work perfectly via Metal, so an NVIDIA card is not really needed. This series — GPU Threads Unraveled — is my public notebook: code, gotchas, and what it actually feels like to write real GPU kernels in Mojo.

Companion Resources

ResourceDescription & Link
My GPU Puzzles ForkMy evolving fork with solutions.
https://github.com/Mjboothaus/mojo-gpu-puzzles
Modular's Official YouTube TutorialsLearn GPU Programming with Mojo 🔥 GPU Puzzles Tutorials
Introduction
Puzzle 01: Map
Puzzle 02: Zip
Blog Series IntroductionOverview of the full GPU Threads Unraveled blog series.
GPU Threads Unraveled – Series Introduction
GPU GlossaryBeginner-friendly explanations of key GPU terms.
GPU Glossary

Step 0: Gear Up

Full setup instructions are found here: https://puzzles.modular.com/howto.html

I forked the official repo on GitHub, then cloned my fork:

git clone https://github.com/modular/mojo-gpu-puzzles
cd mojo-gpu-puzzles
pixi run -e apple p01   # Apple Silicon environment (in fact I found the `-e apple` wasn't strictly necessary)

Puzzle 1: "Hello, Threads" – One Thread per Element

Add 10 to every element of a vector of length 4.

fn add_10(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
    i = thread_idx.x
    output[i] = a[i] + 10.0  # <- ✅ Solution code

That’s it. One thread, one element. Here the add_10 function is a very simple example of a Mojo GPU kernel.

Breaking Down the Mojo GPU Code Syntax

Mojo’s GPU code blends Python-like simplicity with explicit control over hardware configuration. Here’s what each new element in Puzzle 1 means:

ElementPurposeWhy it matters in GPU programming
comptimeKeyword that marks a value as known at compile time rather than runtimeEnables aggressive compiler optimisations and generates specialised GPU code for the exact sizes/types you choose
comptime SIZE = 4Compile-time constant defining the vector length (4 elements here)Fixed at compile time so the compiler can optimise memory access and thread launch
comptime BLOCKS_PER_GRID = 1Number of thread blocks launched on the GPU (just 1 for this tiny problem)Controls how work is divided across the GPU; later puzzles use many blocks for scale
comptime THREADS_PER_BLOCK = SIZEThreads per block — here exactly matches the data size (4 threads)Ensures one thread processes one element; maximises simplicity in early puzzles
comptime dtype = DType.float32Specifies the data type (32-bit floating point) for all array elementsAllows precise control over precision and memory usage; float32 is standard for most AI workloads
Scalar[dtype]Mojo type representing a single value of the chosen dtype (here float32)Provides a generic, type-safe way to refer to individual elements regardless of the underlying data type
UnsafePointer[Scalar[dtype]]Low-level pointer to GPU memory holding values of type Scalar[dtype]Gives direct, zero-overhead access to data on the GPU — fast, but requires manual care (bounds checks added later)
thread_idx.xBuilt-in variable: the unique ID (0–3 here) of the current thread within its blockThe heart of data-parallel programming — each thread uses its ID to select its own work item

These comptime declarations are baked into the kernel when it’s compiled, letting Mojo generate highly efficient GPU code tailored exactly to your problem size and data type. The Scalar[dtype] abstraction keeps the code generic and readable while still giving you full performance. Later puzzles will introduce safer tensor types, multi-dimensional grids, and bounds checking — but this minimal setup lets you experience real GPU parallelism from the very first puzzle.

Note: There is also the DeviceContext boilerplate code which executes the kernel — it likely feels cryptic at first, but it’s not the main focus of the GPU puzzles. The puzzles intentionally hide most of this so you can concentrate on the parallel logic inside the kernel. See the Appendix below for details.


Puzzle 2: "Two Inputs" – Element-wise Addition

Same idea, now with two input vectors:

fn add(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
    i = thread_idx.x
    output[i] = a[i] + b[i]  # <- ✅ Solution code

Note that no bounds checks are needed as yet because the number of launched threads exactly matches the data size. That changes in Puzzle 3.

Try Puzzles 1 and 2 yourself and share your thoughts in the comments!

Next: GPU Threads Unraveled #2: Grids, Guards, and Shared Scratchpads (Puzzles 3–8): Coming Soon.


Appendix: The Kernel Launch Boilerplate (High-Level Overview)

Every puzzle includes a main() function that sets up buffers, fills input data, launches your kernel on the GPU, and checks the results. Here’s a simplified walk-through of what that code does in Puzzle 1:

def main():
    with DeviceContext() as ctx:                     # Sets up GPU access and creates a context for all GPU operations
        out = ctx.enqueue_create_buffer[dtype](SIZE) # Allocates the output buffer on the GPU
        out.enqueue_fill(0)                          # Initialises the output buffer to zeros on the GPU
        
        a = ctx.enqueue_create_buffer[dtype](SIZE)   # Allocates the input buffer 'a' on the GPU
        a.enqueue_fill(0)                            # Initialises the input buffer to zeros on the GPU
        
        with a.map_to_host() as a_host:              # Temporarily maps the GPU buffer 'a' to host (CPU) memory for writing
            for i in range(SIZE):
                a_host[i] = i                        # Fill input buffer with values 0, 1, 2, 3 on the CPU side
        
        ctx.enqueue_function_checked[add_10, add_10]( # Launches your GPU kernel (add_10) 
            out,                                     # Output buffer
            a,                                       # Input buffer
            grid_dim=BLOCKS_PER_GRID,                # Number of thread blocks (from comptime constant)
            block_dim=THREADS_PER_BLOCK,             # Threads per block (from comptime constant)
        )
        
        expected = ctx.enqueue_create_host_buffer[dtype](SIZE)  # Creates a buffer on the CPU for the expected result
        expected.enqueue_fill(0)                               # Initialises it to zeros
        
        ctx.synchronize()                                      # Waits for all GPU operations to finish before proceeding
        
        for i in range(SIZE):
            expected[i] = i + 10                               # Computes the expected result directly on the CPU
        
        with out.map_to_host() as out_host:                    # Maps the GPU output buffer back to host memory for reading
            print("out:", out_host)                            # Prints the actual GPU result
            print("expected:", expected)                       # Prints the expected CPU result
            
            for i in range(SIZE):
                assert_equal(out_host[i], expected[i])         # Verifies that GPU output matches the expected values

What each part does (in plain terms)

StepPurpose
DeviceContext()Creates a management context for GPU operations, automatically selecting and initialising the backend to handle memory allocation, data transfers, and kernel launches.
enqueue_create_bufferAllocates memory on the GPU for inputs and outputs
map_to_host()Temporarily maps GPU memory so the CPU can write/read it (used for setup and verification)
enqueue_function_checkedLaunches your kernel (add_10) with the specified grid/block dimensions
grid_dim / block_dimMatches the comptime values you set earlier — tells the GPU exactly how many threads to start
synchronize()Forces the CPU to wait until the GPU finishes (needed before reading results)
assert_equalAutomated test — fails loudly if your kernel produced wrong results

You don’t need to write or deeply understand this boilerplate to solve the puzzles — Modular provides it so you can focus 100% on the kernel logic (the part inside fn add_10). Over time, as you progress through the series, these patterns will become familiar, and you’ll appreciate how Mojo keeps the setup concise while still giving full control when you need it.

For now, just know: this code handles the plumbing (memory allocation, launching, verification) so you can experiment freely with real GPU parallelism from the very first puzzle.