CI/CD for Machine Learning: Automating Model Validation and Deployment
Building a proper CI/CD pipeline for ML — automated model testing, data validation, performance regression checks, and safe deployment patterns including canary releases and shadow mode.
Rohit Raj··4 min read
Introduction
Software engineers have decades of CI/CD best practices. ML teams often ignore them. The result: models deployed without validation, silent regressions discovered weeks later, no rollback capability.
This post brings proper software engineering discipline to ML model deployment.
The ML CI/CD Pipeline
Code Push (model code / training script)
│
▼
┌──────────────┐
│ Static │ ← Lint, type-check, unit tests for data/feature code
│ Checks │
└──────┬───────┘
│ pass
▼
┌──────────────┐
│ Data │ ← Schema validation, drift detection vs. last run
│ Validation │
└──────┬───────┘
│ pass
▼
┌──────────────┐
│ Model │ ← Train on CI data, evaluate on holdout
│ Training │
└──────┬───────┘
│ model meets thresholds
▼
┌──────────────┐
│ Regression │ ← Compare vs. production model on benchmark slice
│ Testing │
└──────┬───────┘
│ no regression
▼
┌──────────────┐
│ Shadow Mode │ ← Run alongside production, log predictions (no action)
└──────┬───────┘
│ shadow metrics acceptable
▼
┌──────────────┐
│ Canary (5%) │ ← Route 5% of traffic, monitor KPIs
└──────┬───────┘
│ stable
▼
┌──────────────┐
│ Full Rollout │ ← Gradual ramp to 100%
└──────────────┘
Step 1: Data Validation with Great Expectations
Catch data quality issues before they corrupt your model:
python
import great_expectations as gedef validate_training_data(df_path: str) -> bool: """Validate training data against defined expectations.""" df = ge.read_csv(df_path) results = df.expect_column_to_exist("customer_id") \ .expect_column_values_to_not_be_null("label") \ .expect_column_median_to_be_between("credit_score", 600, 750) \ .expect_column_values_to_be_between("loan_amount", 1_000, 500_000) \ .expect_column_values_to_be_in_set("risk_tier", ["LOW", "MEDIUM", "HIGH"]) if not results["success"]: failed = [r for r in results["results"] if not r["success"]] print(f"Data validation failed: {failed}") return False return True
Step 2: Model Performance Gates
Hard enforce minimum performance thresholds before any model is promoted:
python
# .github/model_thresholds.yamlthresholds: auc_roc: 0.82 # Must beat this or pipeline fails precision_at_5pct: 0.65 max_fpr_at_recall_80: 0.12 fairness: min_disparate_impact: 0.80 max_equalized_odds_gap: 0.05# In CI pipelinedef check_model_gates(metrics: dict, thresholds: dict) -> bool: failures = [] for metric, threshold in thresholds.items(): if isinstance(threshold, dict): continue # Handle nested (fairness) if metrics.get(metric, 0) < threshold: failures.append(f"{metric}: {metrics[metric]:.4f} < {threshold}") if failures: print("❌ Model gates FAILED:") for f in failures: print(f" {f}") return False print("✅ All model gates passed") return True