Production AI needs continuous testing

In many regulated companies, AI is no longer a pilot. It scores fraud, answers customer questions, drafts compliance notes, helps developers write code and acts inside CRM tools. A working demo is no longer enough. Teams need to know whether the system is still reliable today, under real traffic, with current data, current instructions, current tools and current providers.

The demo is not the production system

The Stanford AI Index 2026 counted 362 reported AI incidents in 2025, compared with 233 in 2024. The OECD AI Incidents and Hazards Monitor logged a peak of 435 incidents in January 2026. MIT NANDA's August 2025 report found that 95 percent of enterprise generative AI pilots produced no measurable profit and loss impact.

Those numbers point to the same problem. A controlled demo is a narrow test. Production is messy. Users ask unexpected questions. Documents change. Retrieval indexes drift. Vendors update models. Agents get new tools. Security assumptions that looked safe in staging can break once the system sees real data.

Failures now affect data, money and operations

The public cases have moved beyond chatbots inventing policies. In November 2025, Anthropic disclosed GTG-1002, a near-autonomous espionage campaign using Claude Code against roughly thirty organisations. In January 2026, Microsoft patched ShareLeak, a Copilot Studio flaw that let attackers exfiltrate data through a hidden instruction. In April 2026, Salesforce Agentforce leaked CRM data through PipeLeak, an indirect prompt injection flaw in agent pipelines.

These failures matter because they hit operating state. A model can leak data, cite the wrong source, approve the wrong action or keep working after the world around it has changed. A green infrastructure dashboard will not show that. CPU, latency and error rate can all look normal while the AI gives the wrong answer or takes the wrong action.

Four risks need continuous tests

Grounding failure: the answer contradicts the source it was supposed to use, or cites a source that does not support the claim.
Silent drift: data, labels, instructions, retrieval logic or provider models change, and the system keeps scoring without visible infrastructure errors.
Prompt injection: a hidden instruction in an email, document, webpage, wiki page or tool response changes the system's behaviour.
Agent action failure: the system calls the wrong tool, uses the wrong parameter, repeats an action or performs an irreversible action without approval.

A yearly audit is too slow for these risks. They can appear after a model upgrade, a corpus reindex, a new connector, a guardrail change or a supplier update. The test has to run when the system changes, not only when the audit calendar says so.

Monitoring is not evaluation

Monitoring tells you what happened: instructions, retrieved extracts, tool calls, latency, token counts and errors. Evaluation asks whether the system should have done that. Both are useful. Only the second tells a board, regulator or audit team whether the AI is still behaving as expected.

A healthy infrastructure dashboard can still hide a failing AI system.

NIST made this distinction explicit in AI 800-4, published in March 2026. A credible programme needs infrastructure telemetry and model-behaviour telemetry. It also needs test sets tied to the obligations that apply to the system: AI Act, NIST AI RMF, ISO 42001, DORA or sector guidance.

Regulators now expect proof over time

The EU AI Act requires post-market monitoring and serious incident reporting for high-risk systems. The Digital Omnibus political agreement of 7 May 2026 moved some high-risk deadlines, but it did not remove the need to monitor performance and incidents over the life of the system.

Other standards point in the same direction. NIST AI 800-4 separates deployment-time evaluation from post-deployment monitoring. ISO/IEC 42001 requires measurement and evaluation of AI performance and risk across the lifecycle. In banking and insurance, ECB, BaFin, EBA and ACPR materials all connect AI governance with existing operational, ICT and prudential risk controls.

Five tests should stop a release

Regulated answers: run a reference set for policies, fees, eligibility and other controlled answers. Drift beyond the threshold should block the release.
Prompt injection: rerun attacks against emails, documents, webpages, retrieved content and tool responses.
Grounding: reject answers that cite no source, cite an outdated source or cite a source that does not support the answer.
Agent tools: block high-impact tools such as delete, send, deploy or transact unless the required human approval is recorded.
Data leakage: flag any answer that reveals regulated data outside its declared scope, even partially.

The exact thresholds depend on the system. A tax chatbot, a fraud model and a coding agent do not fail in the same way. The common point is the release rule: before production, decide which test failures stop the release or revert the change.

Where to start

If you run one high-risk AI system, start small. Pick the two standards or regulations that actually bind you. Write the five tests above against the real system. Run them on every meaningful change. Keep the results with the model, instruction, data and tool versions used at the time.

If you run ten or more systems, the manual approach breaks quickly. The goal is not another dashboard. The goal is a standing answer to the question your board and regulators will ask: what did you test, when, with which version, and what changed after the result?

References

Stanford HAI, AI Index Report 2026, 13 April 2026, hai.stanford.edu/ai-index/2026-ai-index-report.
OECD AI Incidents and Hazards Monitor, oecd.ai/en/incidents.
MIT NANDA, The GenAI Divide: State of AI in Business 2025, August 2025.
Anthropic, Disrupting the first reported AI-orchestrated cyber espionage campaign (GTG-1002), 13 November 2025.
Microsoft Security Response Center, CVE-2026-21520 ShareLeak, January 2026.
Salesforce Trust, Agentforce PipeLeak advisory, April 2026.
EU AI Act (Regulation 2024/1689), Articles 72 and 73, eur-lex.europa.eu; Digital Omnibus political agreement, 7 May 2026.
NIST AI 800-4, Challenges to the Monitoring of Deployed AI Systems, March 2026.
ISO/IEC 42001:2023.
ECB Banking Supervision, Supervisory priorities 2026 to 2028, 18 November 2025.
BaFin, Guidance on ICT Risks in the Use of Artificial Intelligence at Financial Entities, 30 January 2026.
EBA, AI Act mapping factsheet, 21 November 2025; ACPR, 2026 work programme.

Production AI needs continuous testing

The demo is not the production system

Failures now affect data, money and operations

Four risks need continuous tests

Monitoring is not evaluation

Regulators now expect proof over time

Five tests should stop a release

Where to start

References

Related reading

Agents do not give wrong answers. They take wrong actions.

DORA already covers your AI systems

Make your AI systems audit-ready, continuously.