MLOps

Data Drift

Definition

Data drift occurs when the statistical properties of production input data change over time compared to the training data, causing model performance to degrade even though the model itself hasn't changed.

Why It Matters

Models learn patterns from training data. When production data no longer matches those patterns, predictions become unreliable, often silently. Your model keeps returning confident predictions, but they’re increasingly wrong.

Real-world examples abound: customer behavior changes seasonally, economic conditions shift, new products launch, competitors change strategies, world events alter patterns. COVID-19 caused massive data drift across virtually every industry, breaking models that had worked for years.

For AI engineers, data drift is a constant threat. LLM applications face it too, as user query patterns change, document collections evolve, and the topics people ask about shift over time. A RAG system optimized for last quarter’s questions may struggle with this quarter’s.

Implementation Basics

Detecting data drift requires comparing distributions:

1. Feature-Level Monitoring Track statistical properties of each input feature: mean, variance, quantiles, cardinality. Significant changes signal drift. For text, monitor vocabulary distribution, sequence lengths, and topic distributions.

2. Statistical Tests Kolmogorov-Smirnov test for continuous features, chi-square test for categorical. Population Stability Index (PSI) quantifies drift magnitude. These tests compare production data windows against training data or baseline periods.

3. Embedding Drift For LLM applications, monitor embedding distributions. Compute centroid distances, average pairwise similarities, or use dimensionality reduction to visualize shift. If user queries cluster differently than before, you have drift.

4. Alerting & Response Set thresholds for acceptable drift. Alert when exceeded. Response options: retrain on recent data, collect more labeled examples, adjust feature engineering, or accept degraded performance temporarily.

5. Root Cause Analysis When drift occurs, understand why. Upstream data changes? New user segments? External events? Understanding the cause informs the response, as sometimes drift is expected and the model should adapt.

Tools: Evidently AI (open-source drift detection), WhyLabs, Arize, or custom monitoring with statistical libraries.

Monitor continuously. Data drift is inevitable, and the question is whether you detect it before your users do.

Source

Dataset shift, including covariate shift (data drift), occurs when the joint distribution of inputs and outputs differs between training and test stages.

https://arxiv.org/abs/2004.03045

Why It Matters

Implementation Basics

🎁 Go Beyond Definitions

Related Terms

Related Articles