Table of Contents
Exec Summary
Training and running modern AI models on ever-larger datasets demands massive parallel compute. Mojo’s GPU puzzles teach you the building blocks of how to deliver that performance directly, cutting inference times by orders of magnitude and reducing costs, without requiring your team to abandon familiar Python syntax or rewrite your entire stack.
GPUs excel at parallel computation, processing thousands of data points simultaneously, ideal for AI workloads where CPU-bound scripts choke on scale. But programming them directly has always been a low-level slog in C++/CUDA. Mojo changes that: Modular’s Python-superset that gives you full hardware control with Python-like syntax. As introduced in the series overview Introduction, Modular’s 34 GPU puzzles are a great way to learn this stuff: each puzzle is small, focused, and deliberately designed to build deep understanding, one concept at a time.
I'm running them on my M1 and M5 MacBooks (Apple Silicon). Most early puzzles now work perfectly via Metal, so an NVIDIA card is not really needed. This series — GPU Threads Unraveled — is my public notebook: code, gotchas, and what it actually feels like to write real GPU kernels in Mojo.
Companion Resources
| Resource | Description & Link |
|---|---|
| My GPU Puzzles Fork | My evolving fork with solutions. https://github.com/Mjboothaus/mojo-gpu-puzzles |
| Modular's Official YouTube Tutorials | Learn GPU Programming with Mojo 🔥 GPU Puzzles Tutorials • Introduction • Puzzle 01: Map • Puzzle 02: Zip |
| Blog Series Introduction | Overview of the full GPU Threads Unraveled blog series. GPU Threads Unraveled – Series Introduction |
| GPU Glossary | Beginner-friendly explanations of key GPU terms. GPU Glossary |
Step 0: Gear Up
Full setup instructions are found here: https://puzzles.modular.com/howto.html
I forked the official repo on GitHub, then cloned my fork:
git clone https://github.com/modular/mojo-gpu-puzzles
cd mojo-gpu-puzzles
pixi run -e apple p01 # Apple Silicon environment (in fact I found the `-e apple` wasn't strictly necessary)
Puzzle 1: "Hello, Threads" – One Thread per Element
Add 10 to every element of a vector of length 4.
fn add_10(
output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
i = thread_idx.x
output[i] = a[i] + 10.0 # <- ✅ Solution code
That’s it. One thread, one element. Here the add_10 function is a very simple example of a Mojo GPU kernel.
Breaking Down the Mojo GPU Code Syntax
Mojo’s GPU code blends Python-like simplicity with explicit control over hardware configuration. Here’s what each new element in Puzzle 1 means:
| Element | Purpose | Why it matters in GPU programming |
|---|---|---|
comptime | Keyword that marks a value as known at compile time rather than runtime | Enables aggressive compiler optimisations and generates specialised GPU code for the exact sizes/types you choose |
comptime SIZE = 4 | Compile-time constant defining the vector length (4 elements here) | Fixed at compile time so the compiler can optimise memory access and thread launch |
comptime BLOCKS_PER_GRID = 1 | Number of thread blocks launched on the GPU (just 1 for this tiny problem) | Controls how work is divided across the GPU; later puzzles use many blocks for scale |
comptime THREADS_PER_BLOCK = SIZE | Threads per block — here exactly matches the data size (4 threads) | Ensures one thread processes one element; maximises simplicity in early puzzles |
comptime dtype = DType.float32 | Specifies the data type (32-bit floating point) for all array elements | Allows precise control over precision and memory usage; float32 is standard for most AI workloads |
Scalar[dtype] | Mojo type representing a single value of the chosen dtype (here float32) | Provides a generic, type-safe way to refer to individual elements regardless of the underlying data type |
UnsafePointer[Scalar[dtype]] | Low-level pointer to GPU memory holding values of type Scalar[dtype] | Gives direct, zero-overhead access to data on the GPU — fast, but requires manual care (bounds checks added later) |
thread_idx.x | Built-in variable: the unique ID (0–3 here) of the current thread within its block | The heart of data-parallel programming — each thread uses its ID to select its own work item |
These comptime declarations are baked into the kernel when it’s compiled, letting Mojo generate highly efficient GPU code tailored exactly to your problem size and data type. The Scalar[dtype] abstraction keeps the code generic and readable while still giving you full performance. Later puzzles will introduce safer tensor types, multi-dimensional grids, and bounds checking — but this minimal setup lets you experience real GPU parallelism from the very first puzzle.
Note: There is also the
DeviceContextboilerplate code which executes the kernel — it likely feels cryptic at first, but it’s not the main focus of the GPU puzzles. The puzzles intentionally hide most of this so you can concentrate on the parallel logic inside the kernel. See the Appendix below for details.
Puzzle 2: "Two Inputs" – Element-wise Addition
Same idea, now with two input vectors:
fn add(
output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
i = thread_idx.x
output[i] = a[i] + b[i] # <- ✅ Solution code
Note that no bounds checks are needed as yet because the number of launched threads exactly matches the data size. That changes in Puzzle 3.
Try Puzzles 1 and 2 yourself and share your thoughts in the comments!
Next: GPU Threads Unraveled #2: Grids, Guards, and Shared Scratchpads (Puzzles 3–8): Coming Soon.
Appendix: The Kernel Launch Boilerplate (High-Level Overview)
Every puzzle includes a main() function that sets up buffers, fills input data, launches your kernel on the GPU, and checks the results. Here’s a simplified walk-through of what that code does in Puzzle 1:
def main():
with DeviceContext() as ctx: # Sets up GPU access and creates a context for all GPU operations
out = ctx.enqueue_create_buffer[dtype](SIZE) # Allocates the output buffer on the GPU
out.enqueue_fill(0) # Initialises the output buffer to zeros on the GPU
a = ctx.enqueue_create_buffer[dtype](SIZE) # Allocates the input buffer 'a' on the GPU
a.enqueue_fill(0) # Initialises the input buffer to zeros on the GPU
with a.map_to_host() as a_host: # Temporarily maps the GPU buffer 'a' to host (CPU) memory for writing
for i in range(SIZE):
a_host[i] = i # Fill input buffer with values 0, 1, 2, 3 on the CPU side
ctx.enqueue_function_checked[add_10, add_10]( # Launches your GPU kernel (add_10)
out, # Output buffer
a, # Input buffer
grid_dim=BLOCKS_PER_GRID, # Number of thread blocks (from comptime constant)
block_dim=THREADS_PER_BLOCK, # Threads per block (from comptime constant)
)
expected = ctx.enqueue_create_host_buffer[dtype](SIZE) # Creates a buffer on the CPU for the expected result
expected.enqueue_fill(0) # Initialises it to zeros
ctx.synchronize() # Waits for all GPU operations to finish before proceeding
for i in range(SIZE):
expected[i] = i + 10 # Computes the expected result directly on the CPU
with out.map_to_host() as out_host: # Maps the GPU output buffer back to host memory for reading
print("out:", out_host) # Prints the actual GPU result
print("expected:", expected) # Prints the expected CPU result
for i in range(SIZE):
assert_equal(out_host[i], expected[i]) # Verifies that GPU output matches the expected values
What each part does (in plain terms)
| Step | Purpose |
|---|---|
DeviceContext() | Creates a management context for GPU operations, automatically selecting and initialising the backend to handle memory allocation, data transfers, and kernel launches. |
enqueue_create_buffer | Allocates memory on the GPU for inputs and outputs |
map_to_host() | Temporarily maps GPU memory so the CPU can write/read it (used for setup and verification) |
enqueue_function_checked | Launches your kernel (add_10) with the specified grid/block dimensions |
grid_dim / block_dim | Matches the comptime values you set earlier — tells the GPU exactly how many threads to start |
synchronize() | Forces the CPU to wait until the GPU finishes (needed before reading results) |
assert_equal | Automated test — fails loudly if your kernel produced wrong results |
You don’t need to write or deeply understand this boilerplate to solve the puzzles — Modular provides it so you can focus 100% on the kernel logic (the part inside fn add_10). Over time, as you progress through the series, these patterns will become familiar, and you’ll appreciate how Mojo keeps the setup concise while still giving full control when you need it.
For now, just know: this code handles the plumbing (memory allocation, launching, verification) so you can experiment freely with real GPU parallelism from the very first puzzle.