GPU Threads Unraveled #1: First Threads – Launching into Mojo GPU Basics (Puzzles 1-2)

Posted on 2026-01-05 :: 1097 Words :: Tags: Mojo 🔥, GPU Programming, GPU Puzzle, GPU Tutorial, Memory Management, Modular, UnsafePointer, NDBuffer, LayoutTensor, Apple Silicon, Unified Memory, High-Performance Computing, Performance Optimisation, AI Acceleration, Hardware Agnostic

Exec Summary

Training and running modern AI models on ever-larger datasets demands massive parallel compute. Mojo’s GPU puzzles teach you the building blocks of how to deliver that performance directly, cutting inference times by orders of magnitude and reducing costs, without requiring your team to abandon familiar Python syntax or rewrite your entire stack.

GPUs excel at parallel computation, processing thousands of data points simultaneously, ideal for AI workloads where CPU-bound scripts choke on scale. But programming them directly has always been a low-level slog in C++/CUDA. Mojo changes that: Modular’s Python-superset that gives you full hardware control with Python-like syntax. As introduced in the series overview Introduction, Modular’s 34 GPU puzzles are a great way to learn this stuff: each puzzle is small, focused, and deliberately designed to build deep understanding, one concept at a time.

I'm running them on my M1 and M5 MacBooks (Apple Silicon). Most early puzzles now work perfectly via Metal, so an NVIDIA card is not really needed. This series — GPU Threads Unraveled — is my public notebook: code, gotchas, and what it actually feels like to write real GPU kernels in Mojo.

Companion Resources

Resource	Description & Link
My GPU Puzzles Fork	My evolving fork with solutions. https://github.com/Mjboothaus/mojo-gpu-puzzles
Modular's Official YouTube Tutorials	Learn GPU Programming with Mojo 🔥 GPU Puzzles Tutorials • Introduction • Puzzle 01: Map • Puzzle 02: Zip
Blog Series Introduction	Overview of the full GPU Threads Unraveled blog series. GPU Threads Unraveled – Series Introduction
GPU Glossary	Beginner-friendly explanations of key GPU terms. GPU Glossary

Step 0: Gear Up

Full setup instructions are found here: https://puzzles.modular.com/howto.html

I forked the official repo on GitHub, then cloned my fork:

git clone https://github.com/modular/mojo-gpu-puzzles
cd mojo-gpu-puzzles
pixi run -e apple p01   # Apple Silicon environment (in fact I found the `-e apple` wasn't strictly necessary)

Puzzle 1: "Hello, Threads" – One Thread per Element

Add 10 to every element of a vector of length 4.

fn add_10(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
    i = thread_idx.x
    output[i] = a[i] + 10.0  # <- ✅ Solution code

That’s it. One thread, one element. Here the add_10 function is a very simple example of a Mojo GPU kernel.

Breaking Down the Mojo GPU Code Syntax

Mojo’s GPU code blends Python-like simplicity with explicit control over hardware configuration. Here’s what each new element in Puzzle 1 means:

Element	Purpose	Why it matters in GPU programming
`comptime`	Keyword that marks a value as known at compile time rather than runtime	Enables aggressive compiler optimisations and generates specialised GPU code for the exact sizes/types you choose
`comptime SIZE = 4`	Compile-time constant defining the vector length (4 elements here)	Fixed at compile time so the compiler can optimise memory access and thread launch
`comptime BLOCKS_PER_GRID = 1`	Number of thread blocks launched on the GPU (just 1 for this tiny problem)	Controls how work is divided across the GPU; later puzzles use many blocks for scale
`comptime THREADS_PER_BLOCK = SIZE`	Threads per block — here exactly matches the data size (4 threads)	Ensures one thread processes one element; maximises simplicity in early puzzles
`comptime dtype = DType.float32`	Specifies the data type (32-bit floating point) for all array elements	Allows precise control over precision and memory usage; float32 is standard for most AI workloads
`Scalar[dtype]`	Mojo type representing a single value of the chosen `dtype` (here float32)	Provides a generic, type-safe way to refer to individual elements regardless of the underlying data type
`UnsafePointer[Scalar[dtype]]`	Low-level pointer to GPU memory holding values of type `Scalar[dtype]`	Gives direct, zero-overhead access to data on the GPU — fast, but requires manual care (bounds checks added later)
`thread_idx.x`	Built-in variable: the unique ID (0–3 here) of the current thread within its block	The heart of data-parallel programming — each thread uses its ID to select its own work item

These comptime declarations are baked into the kernel when it’s compiled, letting Mojo generate highly efficient GPU code tailored exactly to your problem size and data type. The Scalar[dtype] abstraction keeps the code generic and readable while still giving you full performance. Later puzzles will introduce safer tensor types, multi-dimensional grids, and bounds checking — but this minimal setup lets you experience real GPU parallelism from the very first puzzle.

Note: There is also the DeviceContext boilerplate code which executes the kernel — it likely feels cryptic at first, but it’s not the main focus of the GPU puzzles. The puzzles intentionally hide most of this so you can concentrate on the parallel logic inside the kernel. See the Appendix below for details.

Puzzle 2: "Two Inputs" – Element-wise Addition

Same idea, now with two input vectors:

fn add(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
    i = thread_idx.x
    output[i] = a[i] + b[i]  # <- ✅ Solution code

Note that no bounds checks are needed as yet because the number of launched threads exactly matches the data size. That changes in Puzzle 3.

Try Puzzles 1 and 2 yourself and share your thoughts in the comments!

Next: GPU Threads Unraveled #2: Grids, Guards, and Shared Scratchpads (Puzzles 3–8): Coming Soon.

Appendix: The Kernel Launch Boilerplate (High-Level Overview)

Every puzzle includes a main() function that sets up buffers, fills input data, launches your kernel on the GPU, and checks the results. Here’s a simplified walk-through of what that code does in Puzzle 1:

def main():
    with DeviceContext() as ctx:                     # Sets up GPU access and creates a context for all GPU operations
        out = ctx.enqueue_create_buffer[dtype](SIZE) # Allocates the output buffer on the GPU
        out.enqueue_fill(0)                          # Initialises the output buffer to zeros on the GPU
        
        a = ctx.enqueue_create_buffer[dtype](SIZE)   # Allocates the input buffer 'a' on the GPU
        a.enqueue_fill(0)                            # Initialises the input buffer to zeros on the GPU
        
        with a.map_to_host() as a_host:              # Temporarily maps the GPU buffer 'a' to host (CPU) memory for writing
            for i in range(SIZE):
                a_host[i] = i                        # Fill input buffer with values 0, 1, 2, 3 on the CPU side
        
        ctx.enqueue_function_checked[add_10, add_10]( # Launches your GPU kernel (add_10) 
            out,                                     # Output buffer
            a,                                       # Input buffer
            grid_dim=BLOCKS_PER_GRID,                # Number of thread blocks (from comptime constant)
            block_dim=THREADS_PER_BLOCK,             # Threads per block (from comptime constant)
        )
        
        expected = ctx.enqueue_create_host_buffer[dtype](SIZE)  # Creates a buffer on the CPU for the expected result
        expected.enqueue_fill(0)                               # Initialises it to zeros
        
        ctx.synchronize()                                      # Waits for all GPU operations to finish before proceeding
        
        for i in range(SIZE):
            expected[i] = i + 10                               # Computes the expected result directly on the CPU
        
        with out.map_to_host() as out_host:                    # Maps the GPU output buffer back to host memory for reading
            print("out:", out_host)                            # Prints the actual GPU result
            print("expected:", expected)                       # Prints the expected CPU result
            
            for i in range(SIZE):
                assert_equal(out_host[i], expected[i])         # Verifies that GPU output matches the expected values

What each part does (in plain terms)

Step	Purpose
`DeviceContext()`	Creates a management context for GPU operations, automatically selecting and initialising the backend to handle memory allocation, data transfers, and kernel launches.
`enqueue_create_buffer`	Allocates memory on the GPU for inputs and outputs
`map_to_host()`	Temporarily maps GPU memory so the CPU can write/read it (used for setup and verification)
`enqueue_function_checked`	Launches your kernel (`add_10`) with the specified grid/block dimensions
`grid_dim` / `block_dim`	Matches the `comptime` values you set earlier — tells the GPU exactly how many threads to start
`synchronize()`	Forces the CPU to wait until the GPU finishes (needed before reading results)
`assert_equal`	Automated test — fails loudly if your kernel produced wrong results

You don’t need to write or deeply understand this boilerplate to solve the puzzles — Modular provides it so you can focus 100% on the kernel logic (the part inside fn add_10). Over time, as you progress through the series, these patterns will become familiar, and you’ll appreciate how Mojo keeps the setup concise while still giving full control when you need it.

For now, just know: this code handles the plumbing (memory allocation, launching, verification) so you can experiment freely with real GPU parallelism from the very first puzzle.

Table of Contents