Shipping software is a solved problem. You push a commit, tests run, a pipeline promotes the artifact to production, and a rollback is one command away. The discipline took decades to mature — from fragile deploy scripts to GitHub Actions running in seconds.
AI models break most of those assumptions.
A model is not a function that returns deterministic output. Its behavior is defined by training data, hyperparameters, architecture decisions, and — in the case of large language models — the exact phrasing of a prompt. Change any one of those and you have a different model. The question CI/CD for AI has to answer is: how do you treat something that learns as if it were code?
This post walks through the architecture of a production-grade AI pipeline, from classical ML to LLMs, and the specific tooling and patterns that make it work in 2026.
Why traditional CI/CD breaks for AI
A standard CI/CD pipeline tests behaviour against a known specification. Given input X, the function returns Y. Pass or fail.
AI introduces three problems that invalidate this model entirely.
Non-determinism. LLMs return different text on every call. You cannot write an assertEqual test for a language model. The output of a classification model drifts over time as the real-world distribution of inputs shifts away from the training distribution — even if the code never changes.
Artifacts are large and stateful. A trained model checkpoint is not source code. A fine-tuned LLaMA variant can be 70GB. You cannot store it in Git. You cannot diff it meaningfully. And the artifact you deploy today will become stale as the world changes around it.
The pipeline itself is a product. In traditional software, the CI/CD pipeline is infrastructure — a means to an end. In AI, the pipeline is the product. Retraining schedules, evaluation gates, and drift monitors are first-class engineering concerns, not operational afterthoughts.
The three layers of AI CI/CD
A mature AI delivery pipeline operates across three layers. Each extends its predecessor.
Layer 1 — CI Code + data validation → training → model artifact
Layer 2 — CD Model evaluation → staging → canary → production
Layer 3 — CT Drift detection → retraining trigger → loop back to Layer 1
Most teams reach Layer 2. Very few run Layer 3 in full automation. That gap is where model quality silently degrades in production.
Layer 1: Continuous Integration for models
The commit hook for an AI system validates more than code.
Data validation gates
Before a training run starts, the pipeline must verify the input data. Schema validation is the minimum — correct column names, expected types, no null explosions. Beyond that, statistical validation catches distribution shifts early: is the mean of feature X within two standard deviations of the training baseline? Has a new category appeared in a categorical column that the model has never seen?
Tools like Great Expectations and Evidently AI run these checks as pipeline steps, failing the build before a single GPU second is spent on corrupt data.
# .github/workflows/train.yml
- name: Validate dataset
run: |
python -m great_expectations checkpoint run training_data_checkpoint
Experiment tracking and reproducibility
Every training run should be reproducible. That means tracking not just the model weights but the full context that produced them: dataset version, code commit SHA, hyperparameters, environment dependencies.
MLflow and Weights & Biases both solve this. The model that reaches production should be traceable back to a specific experiment run with a single ID.
import mlflow
with mlflow.start_run():
mlflow.log_param("learning_rate", 1e-4)
mlflow.log_param("epochs", 10)
mlflow.log_metric("val_accuracy", val_acc)
mlflow.sklearn.log_model(model, "model")
Model evaluation gates
The training run produces a candidate model. Before it touches staging, it must pass evaluation gates — automated tests that mirror the acceptance criteria a data scientist would apply manually.
These gates typically cover:
- Accuracy threshold — the model must meet a minimum performance bar on a held-out test set
- Bias checks — performance must not degrade significantly across demographic slices
- Regression tests — a golden dataset of known inputs must produce outputs within an acceptable range
- Latency benchmarks — inference time must stay within the SLA at peak load
Only after all gates pass does the artifact get promoted to the model registry.
Layer 2: Continuous Delivery for models
The model registry is the handoff point between training and serving. Think of it as the container registry equivalent for model artifacts — a versioned store with promotion stages.
Model Registry
├── candidate/ ← just trained, not yet validated
├── staging/ ← passed evaluation gates
└── production/ ← live, serving traffic
Champion/challenger deployments
Never replace a production model directly. The pattern that works is champion/challenger: the current production model (champion) continues serving the majority of traffic while the new model (challenger) handles a small slice — typically 5–10%.
Shadow testing is a safer variant: the challenger runs on all production traffic but its outputs are discarded. You collect its predictions against real inputs without affecting users, then compare distributions before committing to any promotion.
# Route 10% of traffic to challenger
if random.random() < 0.10:
response = challenger_model.predict(input)
log_challenger_output(response)
else:
response = champion_model.predict(input)
Blue-green and canary releases
For infrastructure-level deployments, blue-green and canary strategies apply directly. Blue-green keeps two identical environments; the switch is instantaneous and the rollback is a DNS flip. Canary releases increment traffic gradually — 1% → 5% → 25% → 100% — with automatic rollback if error rates exceed a threshold.
In 2026, most teams run canary releases with automated rollback rather than manual promotion gates. The speed gain is significant — promotion happens in minutes, not in the next sprint.
Layer 3: Continuous Training
This is the layer that most CI/CD guides skip. It is also the layer that determines whether your model is still useful six months after deployment.
Models decay. The world changes. User behaviour shifts. A demand-forecasting model trained before a major economic event will produce systematically wrong predictions after it. A content moderation model trained on 2024 language patterns will miss emerging slang by 2026. This phenomenon is called concept drift, and it is silent by default.
Detecting drift
Two signals matter most:
Data drift — the statistical distribution of incoming inputs diverges from the training distribution. Feature values shift, new categories appear, correlations between features change.
Performance drift — model accuracy on ground-truth labels degrades. This is harder to detect because ground truth often arrives with a lag (you predict loan default at origination; you know if it was correct 12 months later).
Evidently AI and NannyML both provide open-source drift detection that integrates into monitoring pipelines. The output is a signal: drift detected above threshold → trigger retraining.
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=production_df)
if report.as_dict()["metrics"][0]["result"]["dataset_drift"]:
trigger_retraining_pipeline()
Automated retraining
When drift is detected, the pipeline should retrain automatically — not on a calendar schedule, but on a signal. Event-driven retraining means the model stays current with minimal human intervention.
The retraining job is the same Layer 1 pipeline, triggered by the monitoring system rather than a Git push.
LLMOps: where the rules change
Large language models introduce a set of challenges that classical MLOps tooling was not designed for.
Prompts are source code
A prompt is not configuration. It is the primary interface between your application and the model's behaviour. A two-word change to a system prompt can fundamentally alter outputs in ways that affect every downstream user.
Prompts need the same engineering discipline as code:
- Version control — every prompt change tracked with a commit SHA, author, and changelog
- Peer review — prompt PRs reviewed by at least one other engineer or prompt specialist
- Automated testing — every prompt change runs against a golden evaluation dataset before merge
- Rollback — if a deployed prompt causes regressions, revert to the previous version in seconds
Storing prompts as string literals in application code couples your prompt iteration cycle to your release cycle. Use a prompt management system (LangSmith, Agenta, Promptfoo) and decouple them.
Semantic evaluation, not exact-match testing
You cannot assert that a language model returns the exact string "Paris". You can assert that the response is semantically correct, relevant, non-toxic, and within the expected format.
The emerging standard is LLM-as-judge: a separate evaluation model scores outputs on dimensions like accuracy, relevance, and groundedness. This produces a quantitative signal from a qualitative output.
def evaluate_response(question: str, response: str) -> dict:
judge_prompt = f"""
Question: {question}
Response: {response}
Rate this response on:
- Accuracy (0-10)
- Relevance (0-10)
- Groundedness (0-10)
Return JSON only.
"""
return judge_model.evaluate(judge_prompt)
RAG pipeline versioning
Retrieval-Augmented Generation systems add a third component to the versioning problem. You now have:
- The base model (or fine-tuned variant)
- The prompt template
- The retrieval index (vector database + embedding model + chunking strategy)
A change to the embedding model requires full re-indexing. A change to the chunking strategy changes what the retriever returns. A change to the retriever changes what the LLM sees. All three need to be versioned and promoted together.
A practical stack in 2026
For teams building from scratch, this stack covers the full pipeline without vendor lock-in:
| Concern | Tool | |---|---| | Experiment tracking | MLflow | | Data versioning | DVC | | Pipeline orchestration | ZenML or Prefect | | Model registry | MLflow Registry or Hugging Face Hub | | CI/CD execution | GitHub Actions | | Drift monitoring | Evidently AI | | LLM evaluation | LangSmith or Agenta | | Serving | FastAPI + Docker + Kubernetes | | Prompt versioning | Agenta or custom Git-based workflow |
Cloud-managed alternatives (SageMaker Pipelines, Vertex AI, Azure ML) trade flexibility for reduced operational overhead — the right call for teams without dedicated MLOps engineers.
Maturity levels
If you are assessing where your team sits today:
Level 0 — Manual. Models trained in notebooks. Deployed by copying a file. No versioning, no monitoring.
Level 1 — Partial automation. Training scripts in Git. Some experiment tracking. Manual deployment after human review.
Level 2 — CI/CD pipeline. Automated training, evaluation gates, model registry, canary deployments. The target for most product teams.
Level 3 — Continuous training. Drift detection triggers automated retraining. Champion/challenger promotion is fully automated. Human review reserved for model architecture changes.
Most organisations in 2026 are between Level 1 and Level 2. Reaching Level 3 is less about tooling than about trust — the confidence that your evaluation gates are robust enough to promote models without human sign-off.
The non-technical problem
The hardest part of CI/CD for AI is not the tooling. It is the culture.
Data scientists optimize for model performance. Engineers optimize for reliability. Product managers optimize for shipping speed. A mature MLOps practice requires all three to agree on what "done" means for a model — what evaluation threshold justifies promotion, what drift level justifies retraining, and who is on call when a deployed model degrades at 2am.
Those agreements need to exist before the pipeline is built. The pipeline enforces them; it does not create them.
Building models that work in notebooks is the easy part. Building systems that keep them working in production — over months, across traffic spikes, through distribution shifts and model updates — is the actual engineering challenge.
The teams that solve it treat every model as a service: versioned, tested, monitored, and always one rollback away from the last known good state.
Questions or feedback? Reach out at bidekani.com