GPU Threads Unraveled: A Hands-On Series Learning GPU Programming in Mojo

Posted on 2026-01-03 :: 958 Words :: Tags: Mojo 🔥, GPU Programming, GPU Puzzle, GPU Tutorial, Memory Management, Modular, UnsafePointer, NDBuffer, LayoutTensor, Apple Silicon, Unified Memory, High-Performance Computing, Performance Optimisation, AI Acceleration, Hardware Agnostic

Exec Summary

Modular’s official GPU puzzles provide a structured, hands-on path to mastering low-level GPU programming in Mojo, a Python-like language—delivering the performance of CUDA/C++ with far less friction, enabling massive AI workload speedups on commodity hardware.

If you’ve ever wanted to understand how GPUs really work under the hood, threads, blocks, shared memory, reductions, tiled matmuls, but found CUDA intimidating or fragmented tutorials unhelpful, Modular has built exactly what you need. For quick explanations of key terms check the Glossary below.

What it is

A collection of 34 progressively more difficult GPU programming puzzles, each a small, self-contained Mojo kernel you complete to pass automated tests. They start with “add 10 to a vector” and end with optimised tiled matrix multiplies that rival cuBLAS.

Why it exists

Modular’s goal is to make Mojo the best language for high-performance AI and systems programming. Direct GPU control is essential for that, but the traditional learning curve (C++/CUDA or fragmented OpenCL/Metal) is steep. These puzzles distil decades of parallel programming wisdom into bite-sized, guided exercises that teach real patterns used in Triton, FlashAttention, and production kernels, without requiring an NVIDIA card (Apple Silicon works via Metal for most puzzles).

How it works

Clone the official repo, run the puzzle with pixi, edit the kernel, and watch the tests pass. Each puzzle includes a diagram, hints, and a companion YouTube walkthrough series (at the time of posting the first four in the series have been published).

This blog series — GPU Threads Unraveled — is my public learning journal as I work through all 34 puzzles on my MacBook. I share the solutions that worked for me, the gotchas I hit, performance insights, and how the concepts translate to real AI acceleration.

All my code lives in my evolving fork of the GPU Puzzles repo:

https://github.com/Mjboothaus/mojo-gpu-puzzles

The series will consist of 8 focused posts, grouping puzzles thematically:

GPU Threads Unraveled #1: First Threads – Launching into Mojo GPU Basics (Setup + Puzzles 1–2)

GPU Threads Unraveled #2: Grids, Guards, and Shared Scratchpads (Puzzles 3–8)

GPU Threads Unraveled #3: Parallel Reductions & Scans (Puzzles 11–16)

GPU Threads Unraveled #4: Debugging the Black Box – Tools for GPU Sanity (Puzzles 9–11)

GPU Threads Unraveled #5: Kernel Crafts – Convolutions and Matrix Magic (Puzzles 15–16 + 27)

GPU Threads Unraveled #6: Warp Wisdom – Fine-Tuning GPU Harmony (Puzzles 24–26)

GPU Threads Unraveled #7: Beyond Basics – Async Memory and Tensor Cores (Puzzles 28–29 + 33–34)

GPU Threads Unraveled #8: From Puzzles to Production – Series Wrap-Up

Ready to dive in? Start with the official puzzles at https://puzzles.modular.com, they assume basic Python knowledge; no prior GPU experience needed. Follow the upcoming blogs (links appearing above) over the coming weeks for stories, code, and observations.

Note: Puzzles 9 and 10 focus on NVIDIA-specific GPU debugging tools and do not currently run on Apple Silicon. In my context I'm skipping them as hands-on exercises and instead treating them as conceptual background between Posts #2 and #3.

Glossary: Key GPU Terms (Simplified)

This appendix provides brief, beginner-friendly explanations of the main GPU-related terms referenced in this post. These are core concepts in parallel computing, explained without deep technical jargon.

Term	Simple Description
GPU	A specialised computer chip (like a graphics card) designed to handle many tasks at once, especially useful for AI and graphics because it processes data in parallel much faster than a regular CPU. The most common GPUs come from NVIDIA (e.g., GeForce, RTX, A100/H100 series), AMD (Radeon and Instinct lines), Intel (Arc and Data Center GPUs), and Apple (integrated GPUs in M-series chips).
Thread	The smallest "worker" unit on a GPU that performs a single task, like calculating one number in a big dataset. Thousands of threads run simultaneously to speed things up.
Block	A group of threads that work together on the GPU, sharing quick-access memory to collaborate efficiently (up to about 1,000 threads per block).
Grid	The overall arrangement of blocks on the GPU, organising how all the work is divided across the hardware.
Shared Memory	Fast, temporary storage space shared among threads in the same block, like a team whiteboard for quick data swaps without slowing down.
Reduction	A way to combine many values (e.g., summing a list of numbers) by breaking it into smaller parallel steps, turning a long sequential job into a fast group effort.
Tiled Matmul	A technique for multiplying large matrices (grids of numbers) by breaking them into smaller "tiles" that fit into fast memory, making computations quicker and more efficient—key for AI models.
Kernel	A small program that runs on the GPU, telling threads what to do (e.g., "add these numbers"). In Mojo, it's written like Python code but executes in parallel.
CUDA	NVIDIA's toolkit for programming GPUs, allowing direct control but often complex; Mojo aims to simplify this.
Metal	Apple's version of GPU programming for devices like MacBooks, similar to CUDA but for Apple hardware.
Warp	A small bundle of threads (usually 32) that execute instructions together on the GPU, like a mini-team that moves in sync for efficiency.
Tensor Cores	Specialised GPU hardware units that accelerate matrix math for AI tasks, making operations like training neural networks much faster.

Table of Contents

Exec Summary

What it is

Why it exists

How it works

Glossary: Key GPU Terms (Simplified)