← Back to Blog

Monitoring Input Drift When Your Inputs Aren't Tabular

|25 min read
mlopssystem-designembeddingsmmdkernel-methods
Authorship Note

Visualization scripts, as well as prose and argument refinement used the assistance of Claude Opus 4.7. The argument, structure, and technical claims are my own.

Input drift monitoring is a solved problem for traditional machine learning. If your model takes tabular features, you could (for example) run a Kolmogorov–Smirnov test or compute Population Stability Index (PSI) per numeric column, a chi-squared test per categorical column, set thresholds, and put it on a dashboard. Recently I've been thinking about how more and more deployed models take unstructured inputs (text, images, audio, or any combination of multimodal inputs) and the standard tools stop working in this setting. Embedding spaces are high dimensional, and running KS on anywhere from hundreds to thousands of individual coordinates loses all joint structure. Chip Huyen's Designing Machine Learning Systems (pretty much the standard reference textbook on this kind of question) is explicit about the gap:

Because two-sample tests often work better on low-dimensional data than on high-dimensional data, it's highly recommended that you reduce the dimensionality of your data before performing a two-sample test on it.

The textbook does not say how to reduce dimensionality for unstructured input, and the answer I've come to is moreso that you don't reduce dimensions, you switch to two-sample tests built to handle high-dimensional data directly.

My first instinct was to compute the centroid of the training embeddings, compute the centroid of a recent production window, and watch the distance between them. It seems reasonable since the centroid summarizes the distribution, and if it moves something has changed. It turns out that it can work, but only under certain conditions. The question is which conditions, and what to do when they don't hold.


Watching the Centroids

The centroid-monitoring setup:

xˉ=1ni=1nxi,yˉ=1mj=1myj,alert when xˉyˉ2>threshold.\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i, \qquad \bar{y} = \frac{1}{m}\sum_{j=1}^m y_j, \qquad \text{alert when } \|\bar{x} - \bar{y}\|_2 > \text{threshold}.

Under the null hypothesis of no shift, the standard error of the empirical centroid is about σ/n\sigma / \sqrt{n}, where σ\sigma is the per-coordinate standard deviation and nn is the window size. This rate survives in high dimensions via concentration of measure: DD-dimensional embeddings concentrate on a thin shell of radius D\sqrt{D}, and independent embeddings are nearly orthogonal, so the centroid's norm fluctuates by O(σ/n)O(\sigma/\sqrt{n}) regardless of DD. A movement substantially larger than σ/n\sigma/\sqrt{n} is the threshold for a "real shift."

We see that centroid monitoring can work, and it's cheap, but when does it break down?


What Centroids Miss

Centroid distance captures exactly the first moment of the distribution. Due to this limitation, there are three kinds of shifts it cannot detect:

Three side-by-side scatter plots illustrating distribution shifts that preserve the mean. In each panel, training points are shown in light gray and production points in dark green, with a shared centroid marked by a black cross. Panel 1 (Bimodal split): two modes pulled symmetrically apart. Panel 2 (Variance shift): production blob expands around the same center. Panel 3 (Subpopulation drift): two subpopulations shift in opposite directions, cancelling at the mean.
Three distribution shifts the centroid cannot see. In every panel, training (gray) and production (green) share the same centroid (×) by construction, yet the production distribution is materially different in each case.

Multimodal collapse with stable mean. A bimodal distribution where half the mass shifts left and half shifts right, keeping the centroid in place. xˉyˉ2=0\|\bar{x} - \bar{y}\|_2 = 0 but the distribution has changed. Text embeddings cluster by topic, and a topic shift within a cluster can preserve the centroid while changing shape.

Variance and spread shifts. The centroid is invariant to scale. Production data becoming more spread out or more concentrated than training won't move the centroid, but downstream model performance depends on how inputs are distributed across the embedding space, not just where the center of mass is.

Subpopulation drift in mixtures. One component of a mixture shifts while others compensate, keeping the global centroid nearly fixed.

Therefore, whether centroid monitoring is adequate depends on the geometry of the embedding clusters. Before measuring that geometry, we need to place centroid distance in its proper context: it is not a heuristic, it is the simplest member of a parameterized family of two-sample tests, and the same family contains the test we will want for the cases where centroid is not enough.


From Centroids to MMD

Centroid distance misses distributional shape because it only sees the first moment. To capture the full distribution, we need a two-sample test that is sensitive to higher-order structure. The framework for this is integral probability metrics (IPMs). An IPM measures the discrepancy between two distributions PP and QQ by finding the function ff in a class F\mathcal{F} where the expectations disagree most:

DF(P,Q)=supfFEP[f]EQ[f]D_\mathcal{F}(P, Q) = \sup_{f \in \mathcal{F}} \left| \mathbb{E}_P[f] - \mathbb{E}_Q[f] \right|

The witness function ff^* points in the direction where PP and QQ differ the most. The choice of F\mathcal{F} determines the divergence:

MMD uses F\mathcal{F} as the unit ball of the RKHS associated with a kernel kk. Written in terms of kernel mean embeddings μP=ExP[ϕ(x)]\mu_P = \mathbb{E}_{x \sim P}[\phi(x)]:

MMD(P,Q)=μPμQH\text{MMD}(P, Q) = \left\| \mu_P - \mu_Q \right\|_\mathcal{H}

The kernel mean embedding μP\mu_P is a generalization of the empirical mean. For different choices of the feature map ϕ\phi:

Choice of ϕ\phi / kernelWhat MMD reduces toWhat it captures
ϕ(x)=x\phi(x) = x (linear kernel)xˉPxˉQ2\|\bar{x}_P - \bar{x}_Q\|_2Mean only
ϕ(x)=[x,xx]\phi(x) = [x, x \otimes x] (polynomial degree 2)Mean + variance
Gaussian RBF kernelFull distributional testAll moments; MMD = 0 iff P=QP = Q
Kernel Inception Distance

KID (Kernel Inception Distance), used for evaluating image generation (Bińkowski et al. 2018), is Gaussian-kernel MMD on Inception features. The "Inception" part is a specific choice of feature map, the Inception-v3 classifier. The generalization to whatever encoder your deployment pipeline uses (CLIP, SigLIP, BGE-M3, your fine-tuned SBERT, etc.) is direct. Use the serving encoder as the feature map. No retraining or separate model needed.

Centroid distance is the linear-kernel row of this table. The Gaussian RBF kernel, K(x,x)=exp(xx2/(2σ2))K(x, x') = \exp(-\|x - x'\|^2 / (2\sigma^2)), is a smooth similarity function that decays with squared distance, and it is characteristic: MMD = 0 if and only if P=QP = Q. Its unbiased UU-statistic estimator is:

MMD^2=1n(n1)iiK(xi,xi)within training+1m(m1)jjK(yj,yj)within production2nmi,jK(xi,yj)cross\widehat{\text{MMD}}^2 = \underbrace{\frac{1}{n(n-1)}\sum_{i \neq i'}K(x_i, x_{i'})}_{\text{within training}} + \underbrace{\frac{1}{m(m-1)}\sum_{j \neq j'}K(y_j, y_{j'})}_{\text{within production}} - \underbrace{\frac{2}{nm}\sum_{i,j}K(x_i, y_j)}_{\text{cross}}

What is a U-statistic?

A UU-statistic averages a symmetric function over all distinct tuples of samples, note the iii \neq i' and jjj \neq j' in the sums above. The naive plug-in estimator 1n2i,jK(xi,xj)\frac{1}{n^2}\sum_{i,j}K(x_i, x_j) would include the diagonal terms K(xi,xi)K(x_i, x_i), which estimate E[K(X,X)]\mathbb{E}[K(X, X)] rather than E[K(X,X)]\mathbb{E}[K(X, X')] for independent X,XX, X', and that mismatch biases the estimator upward. Dropping the diagonal removes the bias. The unbiased version can dip slightly negative under the null, which is exactly what gives you a usable null distribution to threshold against; the biased "V-statistic" version is non-negative and loses that signal.

The bandwidth σ\sigma in the kernel is set by the median pairwise distance in the training set (the median heuristic). Under no shift, all three terms estimate the same quantity (E[K(X,X)]\mathbb{E}[K(X, X')] under PP), and MMD^20\widehat{\text{MMD}}^2 \approx 0 by cancellation. Under a shift, the cross-term shrinks since samples drawn from different distributions sit farther apart in feature space. The cancellation breaks, so the statistic grows.


Why Not Wasserstein?

MMD with a characteristic kernel is a principled test. But given that we are comparing two distributions in embedding space, a natural alternative is Wasserstein distance (optimal transport distance), which has a clean geometric interpretation as the minimum cost of transporting mass from PP to QQ, and a rich literature. The answer is in the convergence rates.

Let's take an example embedding space with D=768D = 768, then the Wasserstein rate O(n1/768)O(n^{-1/768}) means effectively no convergence from any realistic monitoring window. MMD needs roughly n104n \sim 10^4 samples regardless of DD.

Log-log plot of relative estimation error versus sample size for MMD and Wasserstein at d=768. Both curves normalized to 1 at n=100. The MMD curve descends with slope -1/2; the Wasserstein d=768 curve is essentially flat across the plotted range. A vertical reference line at n=10^4 marks a typical monitoring window.
Relative estimation error vs sample size, normalized to 1 at n=100. MMD's n⁻¹ᐟ² rate is dimension-free; Wasserstein's n⁻¹ᐟᵈ rate makes d=768 effectively non-converging across any realistic window. (Wasserstein at d=2 would coincide with the MMD curve. The issue is dimension, not optimal transport itself.)
DivergenceConvergence rateWorks in D=768D = 768?
Wasserstein WpW_pO(n1/d)O(n^{-1/d})No
MMD (characteristic kernel)O(n1/2)O(n^{-1/2})Yes
Sinkhorn divergenceInterpolates between the twoNot worth the complexity

The Wasserstein rate adapts to intrinsic dimension on low-dimensional supports (Weed & Bach, 2019), so on tight-cluster text embeddings where the intrinsic dimension is small, Wasserstein may become tractable. But MMD does not require knowing the intrinsic dimensionality, so it is a safe default.


The Classifier Trick

Computing the Gaussian-kernel MMD UU-statistic for every production window is straightforward but not the cheapest option. There is a simpler deployment path that also produces something MMD alone does not: a per-sample density-ratio estimate. MMD tells you whether the distribution shifted; the per-sample ratio tells you which samples are responsible and how much to upweight or downweight them when retraining.

Label training samples c=0c = 0 and production samples c=1c = 1, then train a binary classifier h(x)=p(c=1x)h(x) = p(c = 1 | x) on the combined set. By Bayes' rule:

q(x)p(x)=p(c=1x)p(c=0x)p(c=0)p(c=1)\frac{q(x)}{p(x)} = \frac{p(c=1|x)}{p(c=0|x)} \cdot \frac{p(c=0)}{p(c=1)}

The first factor is the odds output by the classifier, h(x)/(1h(x))h(x)/(1-h(x)). The second is a constant fixed by your training/production batch sizes. So one trained classifier gives you three things at once:

  1. The density ratio itself. r(x)=h(x)/(1h(x))r(x) = h(x)/(1-h(x)) from above. Per-sample weights for importance-weighted retraining, and a per-sample novelty score for triage. These weights blow up when h(x)1h(x) \to 1 for production samples far from training support, so clip at some percentile, use log-ratio estimation, or regularize. The bare ratio is not as useful in practice.

  2. AUC as a drift statistic. AUC = 0.5 means no detectable shift. AUC approaching 1 means the distributions are linearly separable.

  3. The witness function. Under the linear loss, the optimal classifier is the IPM witness function, exactly the direction in feature space where PP and QQ differ most. High-confidence production-side predictions identify which inputs are most novel.

The choice of loss function determines which ff-divergence the classifier estimates: linear loss gives total variation, exponential loss gives Hellinger, logistic loss gives χ2\chi^2. Cross-entropy (the standard choice) gives something close to KL divergence, and the AUC is a robust drift statistic regardless of the exact loss.

So we have three deployment options: centroid distance (cheapest), Gaussian-kernel MMD (full test), and domain classifier (cheapest full test, with density ratio as a bonus). The question is when the cheapest option is enough.


When Is Centroid Enough?

We now have the answer to what centroid monitoring misses — shape. But the empirical question remains: for your data, does shape matter enough to outweigh the simplicity of centroid? This turns out to be decidable from a geometric measurement on the training set, and the answer comes from a seemingly unrelated problem.

Embedding consolidation (compressing a cluster of embeddings to a small number of representatives for efficient retrieval) requires the same geometric measurement. Work on consolidation proves a universal lower bound on identity-retrieval error (Vangara & Gopinath, 2026):

εid(C,R)1c1m(θdˉ)deff/2\varepsilon_\text{id}(\mathcal{C}, \mathcal{R}) \geq 1 - c_1 m \left(\frac{\theta'}{\bar{d}}\right)^{d_\text{eff}/2}

where θ=1θ\theta' = 1 - \theta for similarity threshold θ\theta. The bound is operator-agnostic. It holds for centroid, medoid, importance-weighted routing, adaptive selection, anything. What matters for our purposes is that the same (deff,dˉ)(d_\text{eff}, \bar{d}) geometry that determines whether centroid compression is safe for retrieval also determines whether centroid drift detection is adequate for monitoring. Both arguments rest on the same thin-shell concentration of contrastive embeddings.

The parameter space splits into two regimes:

Tight regime (dˉ<θ\bar{d} < \theta', low local deffd_\text{eff}): The cluster mean is a near-sufficient statistic for the full distribution. Linear-kernel MMD (= centroid distance) is near-optimal. In the consolidation setting, the fixed centroid Pareto-dominates every adaptive variant (even an in-hindsight oracle) on real text corpora that sit inside this regime.

Spread regime (high deffd_\text{eff} or dˉθ\bar{d} \geq \theta'): The centroid loses signal and higher-order moments are needed. Gaussian-kernel MMD provides these.

Empirical data from six text corpora and multiple sentence encoders:

Corpusdeff,locald_\text{eff,local} (median)dˉ\bar{d}Centroid id accuracy
DRM templated2.30.050.942
HotpotQA1.50.480.487
MS MARCO5.50.330.905
NQ12.60.380.761
Wikipedia sections30.10.510.647
arXiv titles107.50.330.754
Scatter of six text corpora on (local effective dimension, average intra-cluster distance). Dot size encodes centroid identity-retrieval accuracy. A shaded region in the lower-left marks the tight regime where centroid monitoring is sufficient. DRM templated and MS MARCO sit deep inside the tight region with large dots; HotpotQA sits just above the regime boundary with a tiny dot despite low effective dimension; Wikipedia sections and arXiv titles sit in the spread region.
Each corpus on the (d_eff, d̄) plane. Dot size = centroid id accuracy. Tight regime (shaded) is the joint condition d_eff small AND d̄ < θ′; HotpotQA sits just outside it because d̄ alone exceeds the threshold, despite the lowest effective dimension in the set.

Two things to note:

The joint quantity (deff,dˉ)(d_\text{eff}, \bar{d}) orders the corpora, not either number alone. HotpotQA has the lowest deffd_\text{eff} (1.5) but the worst centroid accuracy (0.487). The reason is dˉ=0.48>0.2=θ\bar{d} = 0.48 > 0.2 = \theta', so the cluster is spread (high average internal distance) and a single vector cannot summarize it regardless of deffd_\text{eff}.

MS MARCO shows the converse direction. Its dˉ=0.33\bar{d} = 0.33 also exceeds θ\theta', so by the strict joint criterion it is outside the tight regime, yet centroid accuracy is 0.905. With deff=5.5d_\text{eff} = 5.5 the cluster is so close to one-dimensional that a single vector still captures most of the structure even though the cluster is not particularly tight. The boundary is two-dimensional: corpora just outside the strict box can fall either way depending on which axis is the source of the violation.

The boundary is fuzzy but useful. Single-digit to low-teens deffd_\text{eff} with dˉ<θ\bar{d} < \theta': tight regime, centroid is adequate. Above that (Wikipedia sections at 30.1, arXiv titles at 107.5) and you're in the spread regime, use Gaussian-kernel MMD. Importantly, this is an empirical heuristic from one paper's tested corpora, not a theorem; ablate on your own data when near the boundary.

Computing (deff,local,dˉ)(d_\text{eff,local}, \bar{d}) on training embeddings takes seconds. The participation ratio deff=(λi)2/λi2d_\text{eff} = (\sum \lambda_i)^2 / \sum \lambda_i^2 comes from the eigenvalues of the per-cluster covariance; dˉ\bar{d} is the average cosine distance within the cluster. One measurement, done once on training data, tells you which monitoring detector to deploy, and answers the retrieval question (whether fixed-centroid compression is safe for your index) at the same time.


A Practical Recipe

Pulling this together into a deployable pipeline:

  1. Feature space. Use the serving encoder ϕ\phi, the same embeddings your model already produces. No separate encoder for monitoring.

  2. One-time geometric diagnostic. Compute per-cluster (deff,local,dˉ)(d_\text{eff,local}, \bar{d}) on the training embeddings. This determines the primary detector.

  3. Primary detector.

    • Tight regime: centroid distance, ϕˉtrainϕˉprod2\|\bar{\phi}_\text{train} - \bar{\phi}_\text{prod}\|_2, alerting when movement substantially exceeds σ/n\sigma/\sqrt{n} (the null standard error from §"Watching the Centroids"). This is the linear-kernel row of the MMD table, cheap, interpretable, and statistically adequate in this regime.
    • Spread regime: Gaussian-kernel MMD with the unbiased UU-statistic, bandwidth σ\sigma from the median heuristic.
  4. Secondary detector and triage tool. Train a domain classifier on (train, prod). AUC is your drift statistic. High-confidence production-side predictions identify which inputs are most novel. The classifier's odds h(x)/(1h(x))h(x)/(1-h(x)) give you a per-sample density ratio for importance-weighted retraining.

  5. Encoder version tracking. The entire pipeline is conditional on ϕ\phi being fixed. When the encoder is updated (model refresh, embedding-model swap, fine-tune), the same inputs map to different vectors and every threshold must be recalibrated. Pin the encoder version, log it with every detector reading, recalibrate on change.

Scenariodeff,locald_\text{eff,local}dˉ\bar{d} vs θ\theta'Primary detector
Tight (DRM, MS MARCO)2–6dˉθ\bar{d} \ll \theta'Centroid distance
Borderline (NQ)12dˉ<θ\bar{d} < \theta', closeCentroid + classifier AUC
Spread (arXiv titles)100+small dˉ\bar{d}, high deffd_\text{eff}Gaussian-kernel MMD + classifier

What Doesn't Work

Several approaches come up in practice that should be avoided:

KS/PSI per embedding coordinate. Each coordinate of an embedding vector has no semantic meaning — the joint structure is what matters. Per-coordinate tests lose all joint structure, produce 768+ simultaneous tests (alert fatigue), and the per-coordinate null is dominated by quasi-orthogonality noise. The same anti-pattern as running KS on raw image pixels.

Wasserstein distance in embedding space. The O(n1/d)O(n^{-1/d}) convergence rate makes it infeasible for D=768D = 768 from any realistic monitoring window. Sinkhorn divergence partially fixes this by interpolating toward MMD as entropic regularization grows, but adds implementation complexity for no statistical benefit when the goal is detection rather than transport.

PCA then KS. Reduces dimension but selects maximum-variance directions, which are not the directions PP and QQ differ along. The classifier-based approach learns the projection that maximally separates the two distributions.

Centroid as the only detector, everywhere. Centroid is adequate in the tight regime and inadequate in the spread regime. The geometric diagnostic determines which. Centroid is a kernel choice (the linear kernel) that is sometimes sufficient and sometimes not, and the choice is decidable from data.


Open Questions

Kernel bandwidth selection for production. The median heuristic is the default and works in most cases, but Sutherland et al. (2017) note a specific failure mode: it underperforms when the scale on which PP and QQ differ is different from the scale of their overall variation, since the bandwidth ends up tuned to the wrong scale. Their power-maximization approach (choose σ\sigma to maximize MMD2/Vm\text{MMD}^2 / \sqrt{V_m}, the ratio of the squared statistic to its variance) addresses this, but it requires labeled drift events to optimize against. Realistic deployments don't have those. What is the right adaptive rule when you cannot assume a specific shift direction?

Conformal prediction for drift detection. Distribution-free two-sample tests based on conformal prediction give finite-sample coverage guarantees without distributional assumptions. Whether these guarantees buy anything over MMD or classifier-AUC in practice is unclear. The distribution-free generality may be paying for coverage you don't need.

Multimodal-specific shifts. When text and image are co-embedded by a CLIP-style encoder, monitoring on the joint embedding may hide modality-specific shifts. The fix is likely per-modality MMD plus joint MMD, with per-modality tests as diagnostics when the joint fires. This has not been systematically studied.

Detector lag versus cluster granularity. Per-cluster centroid monitoring multiplies the alert surface. How this composes with the alert-fatigue problem that the standard literature documents for tabular features is an open practical question.


References

MMD and Kernel Methods

Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., & Smola, A. (2012). A Kernel Two-Sample Test. Journal of Machine Learning Research, 13.

Sriperumbudur, B.K., Fukumizu, K., Gretton, A., Schölkopf, B., & Lanckriet, G.R.G. (2012). On the Empirical Estimation of Integral Probability Metrics. Electronic Journal of Statistics, 6.

Bińkowski, M., Sutherland, D.J., Arbel, M., & Gretton, A. (2018). Demystifying MMD GANs. ICLR 2018.

Sutherland, D.J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A., Smola, A., & Gretton, A. (2017). Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy. ICLR 2017.

Optimal Transport and Sample Complexity

Peyré, G. & Cuturi, M. (2019). Computational Optimal Transport. Foundations and Trends in Machine Learning, 11(5-6).

Weed, J. & Bach, F. (2019). Sharp Asymptotic and Finite-Sample Rates of Convergence of Empirical Measures in Wasserstein Distance. Bernoulli, 25(4A).

Density Ratio and Classification-Based Detection

Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density Ratio Estimation in Machine Learning. Cambridge University Press.

Embedding Consolidation and the Tight/Spread Regime

Vangara, A.B. & Gopinath, A. (2026). The Geometry of Consolidation. Preprint.

Probability in High Dimensions

Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press.

ML Systems Design

Huyen, C. (2022). Designing Machine Learning Systems. O'Reilly Media.

Murphy, K.P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press.