Post-training of LLMs: What, Why, and How

Posted on 2025-07-15 :: 751 Words :: Tags: Course, LLM, Post-training, SFT, DPO, RLHF, DeepLearning.AI, AI, Fine-tuning

Exec Summary

The Post-training of LLMs short course from DeepLearning.AI explores the hands-on techniques that transform a more generic large language model (LLM) into a genuinely useful assistant, tool, or specialist. This post shares the what, why, and how of post-training, focusing on Supervised Fine-Tuning (SFT), Direct Preference Optimisation (DPO), and Online Reinforcement Learning (RL). I’ve just completed this course (see my certificate below), furthering my recent exploration of the ACP: Agent Communication Protocol, and want to highlight why investing in ongoing, cutting-edge learning matters for real-world business impact.

What is Post-training?

Post-training is the process that takes a pre-trained LLM, one that’s learned from vast amounts of raw text, and "teaches" it to follow instructions, align with human-specified preferences, and perform specific tasks. If pre-training is the model’s broad education, post-training is where it "learns" to respond more appropriately and deliver value in the real world.

The course focuses on three main post-training methods:

SFT: Training on curated input-output pairs so that the model learns to give ideal responses for specific prompts.
DPO: Teaching the model to prefer better outputs by optimising for “chosen” over “rejected” responses.
Online RL: Letting the model generate outputs, scoring them with a reward function, and updating to improve the model over time.

Why Post-training?

A pre-trained LLM is powerful but often unrefined, it might sound plausible but miss the mark on accuracy, tone, or usefulness. Post-training:

Aligns model behaviour with human intent: Making models safer, more helpful, and reliable.
Customises for specific business needs: Turning a generalist model into a specialist one (e.g., coding assistant, customer support agent, or maths tutor).
Delivers real-world performance: Boosting accuracy and usefulness for practical deployments.

How does Post-training work?

The course provides practical labs and walkthroughs for each method:

1. Supervised Fine-Tuning (SFT)

Curate a dataset of prompts and ideal responses.
Train the model to mimic these high-quality outputs.

2. Direct Preference Optimisation (DPO)

Present the model with pairs of responses (“chosen” vs “rejected”).
Optimise so the model prefers the better response.

3. Online Reinforcement Learning (RL)

The model generates a response.
A reward function (human or automated) scores the output.
The model updates to maximise future rewards.

You get hands-on with HuggingFace models, implementing SFT, DPO, and RL pipelines, and seeing how each shapes model behaviour.

Who is the course for?

If you’re building with LLMs and want to go beyond out-of-the-box capabilities, whether for safety, accuracy, or user experience, this course is for you. Some Python and LLM basics help, but the focus is practical and hands-on. As with most methods and techniques in the LLM space, there are plethora of choices and parameters to understand and explore, the notebooks provide a sensible way to experiment.

From my side, I continue to invest in advanced, hands-on courses like this: staying at the forefront of AI means I can bring the latest, most effective techniques directly into people’s businesses and help decode the value from the noise. In a field moving this fast, "training" isn’t just for LLMs but for experienced practitioners too, it’s how I stay relevant and can deliver real and sustainable value.

Post-training of LLMs - Course Certificate

Appendix – Post-training Methods at a Glance

Method	What it does	Typical Use
SFT	Teaches the model to produce ideal outputs	Instruction-following, alignment
DPO	Optimises model to prefer better responses	Subtle alignment, tone, safety
Online RL	Continuously improves model via reward signals	Ongoing optimisation, alignment