Table of Contents
Part 4 of this series on uplifting Model Risk Management (MRM) for deployed and agentic AI. If you have not read the prelude on deployer risk, start there: The AI industry's massive blind spot.
This instalment connects practical AI lifecycle controls to what boards, auditors, and regulators tend to ask for.
Executive Summary
- AI regulation differs across jurisdictions, but many approaches converge on lifecycle controls that look a lot like MRM: inventory, tiering, testing, monitoring, accountability, and third-party oversight.
- The common question under board/audit/regulator scrutiny is simple: can you demonstrate control, with evidence that holds up under challenge?
- For externally hosted models, treat time-variance as a practical governance reality: one “validation day” result may not be a stable reference point.
- A reusable governance pack (tiering, minimum control standards, monitoring plan, evidence, vendor oversight) is often more valuable than a one-off compliance document.
A research idea with a very practical consequence
A common (often unstated) assumption in model validation is that, if you hold the model and the test conditions constant, performance is broadly time-invariant.
A recent research paper (Tschisgale and Wulff, 2026) challenges that intuition for externally hosted LLMs: even when you query a fixed model snapshot in a consistent way, average performance may show day- and week-level variation. If that is directionally true in your environment, it has an immediate governance implication: a single “validation day” result is not a stable reference point.
This is one reason the series glossary calls out time-variance as a practical governance concept for externally hosted models.
This is not an argument to panic. It is an argument to treat time as a first-class variable in your monitoring design, and to be explicit about what constitutes evidence.
A scenario you will recognise
Consider the kind of question that appears in a board pack, an internal audit, or a regulator conversation:
- What AI do we use today?
- Which use cases are high impact?
- Who is accountable?
- What controls are in place?
- What evidence do we have that those controls work in practice?
In practice, those questions often land on risk teams because they are really operational questions. They are not primarily about “what the EU AI Act says”. They are about whether you can demonstrate control, using evidence that holds up under challenge.
For externally hosted models, there is an extra complication: the service can change (sometimes subtly) without you deploying anything. That is not just a versioning story; behaviour can vary as an operational property of how the model is served.
The point of comparison
Regulatory approaches differ in legal force and scope, but many converge on the same lifecycle expectations:
- know what AI you are using (inventory)
- understand impact and tier by risk
- test before release (and be clear about what the testing does not prove for externally hosted models; see the research note in the appendix)
- monitor through the lifecycle
- document evidence and accountability
- manage third-party dependencies
In other words, they land close to disciplines mature MRM teams already recognise.
What to put in front of a board, audit, or regulator
If you only produce one “governance pack”, make it something you can keep current and reuse.
A practical set of artefacts:
- Inventory: use cases, owners, vendors, and where AI is embedded in workflows.
- Tiering: a simple risk tier per use case, with an explanation of what makes it high impact.
- Minimum control standard: what validation and monitoring is required per tier.
- Evidence: the latest validation summary, monitoring results, and incident log for material use cases.
- Third-party oversight: what you asked vendors for (change notices, incident reporting, limitations) and what you received.
A practitioner lens
If you already have MRM, the uplift tends to be less about inventing new governance and more about:
- modernising monitoring for drift and rapid change
- strengthening third-party oversight for vendor models
- treating integration controls as first-class risk controls
- including explicit ethical assessment (Part 2)
- handling autonomy and tool access (Part 3)
A useful output
A high-value deliverable many organisations can produce quickly is a simple mapping:
- AI use case → risk tier → minimum controls → monitoring plan → evidence artefacts
Below is a simple template, shown with a populated example.
Example: AI use case mapping
AI use case
- Name: Customer support chatbot (account enquiries)
- Users: Customers (public-facing)
- Business owner: Head of Customer Service
- Delivery owner: Digital Engineering Lead
Delivery pattern
- Vendor-hosted LLM (third-party) + RAG over approved knowledge base
- Tool access: read-only access to CRM for “order status” and “account profile” (no writes)
Risk tier / materiality
- Tier: High (customer impact + regulatory / conduct sensitivity)
- Why: can influence customer decisions; can provide incorrect guidance; reputational and compliance exposure
Key decisions / actions
- Allowed: answer product/account questions; retrieve customer-specific status; draft next steps
- Not allowed (red lines): account changes; refunds; hardship decisions; advice that implies a binding commitment
Control boundary
- Integration layer: tool permissions, approval gates, retrieval scope, and policy enforcement
Minimum controls (pre-release)
- Ethics assessment (“legal, safe, wise”) completed and signed off
- Black-box testing + red teaming for high-risk failure modes
- Validation summary for the workflow (not just the base model)
- Access control review for retrieval sources and tool permissions (least privilege)
- Change control: documented release process and rollback plan
Minimum controls (runtime)
- Approval gates for any action beyond read-only retrieval
- Observability: prompt/context capture, tool-call logs, model/version identifiers
- Kill-switch: immediate disable of tool access and/or customer exposure
Monitoring plan
- Canary evaluation suite run on a schedule (different times of day / days of week)
- Outcome monitoring: incorrect-answer rate, complaint volume, escalation rate, and near-misses
- Thresholds + escalation paths agreed (including who can trigger rollback)
Evidence artefacts to keep current
- Inventory entry + tiering rationale
- Ethics assessment + decision custody notes
- Test results (pre-release) + monitoring reports (runtime)
- Incident log + post-incident reviews
- Access control evidence (permissions, approval-gate design)
- Vendor artefacts: model card, change notices, incident reporting commitments
Third-party oversight
- Contract expectations: change notices, incident reporting, retention constraints, support SLAs
- Exit plan: replacement model/service and cutover approach
Review cadence and triggers
- Cadence: monthly for first 3 months, then quarterly
- Triggers: vendor model update; workflow/tool change; incident; sustained quality degradation
If you are not sure where to start, pick one meaningful use case and build the mapping end-to-end. The gaps become obvious when you try to turn “we have governance” into “we have evidence”.
If you want a concrete example of why time belongs in your monitoring design, the appendix below summarises the research paper and its practical implications.
Appendix: Research note - periodic variability in GPT-4o performance
Tschisgale and Wulff (2026) tested a simple but important assumption: that LLM performance is time-invariant under fixed conditions. They queried a fixed snapshot of GPT-4o via the API to solve the same multiple-choice physics task every three hours for nearly three months, with fixed hyperparameters and identical prompting. They generated multiple responses per time point and averaged scores to reduce sampling noise.
Their headline result is evidence of statistically significant periodic structure in average performance, consistent with an interaction of daily and weekly rhythms. They estimate that periodic components account for roughly 20% of the variance in the aggregated time series, and they report that the periodic component alone can produce meaningful swings in average score.
This matters for governance because it reframes what “evidence” looks like for externally hosted models:
- A single validation run is not a stable reference point. Your test day may coincide with a high or low period.
- Monitoring should include temporal coverage. For material use cases, outcome testing should run across different times of day and days of week.
- Canary evaluations become practical. Maintain a small suite of fixed prompts / tasks and run them on a schedule to detect changes.
- Vendor oversight becomes testable. If your monitoring shows variation, you can ask vendors for change notices, incident reporting, and an explanation of what may be driving the behaviour.
Limitations worth keeping in mind: this is one model snapshot, one task type, and one geography. It does not prove a cause (e.g. load-shedding), but it is enough to justify treating time as a first-class variable in your monitoring design.
Series: Prelude · Part 1 · Part 2 · Part 3 · Part 4 · Glossary