You Are Out of Context!: Giancarlo Cobino, Simone Farci October 2024
You Are Out of Context!: Giancarlo Cobino, Simone Farci October 2024
October 2024
Abstract
This research proposes a novel drift detection methodology for ma-
chine learning (ML) models based on the concept of ”deformation” in the
vector space representation of data. Recognizing that new data can act
as forces stretching, compressing, or twisting the geometric relationships
learned by a model, we explore various mathematical frameworks to quan-
tify this deformation. We investigate measures such as eigenvalue analy-
sis of covariance matrices to capture global shape changes, local density
estimation using kernel density estimation (KDE), and Kullback-Leibler
divergence to identify subtle shifts in data concentration. Additionally, we
draw inspiration from continuum mechanics by proposing a ”strain tensor”
analogy to capture multi-faceted deformations across different data types.
This requires careful estimation of the displacement field, and we delve
into strategies ranging from density-based approaches to manifold learning
and neural network methods. By continuously monitoring these deforma-
tion metrics and correlating them with model performance, we aim to
provide a sensitive, interpretable, and adaptable drift detection system
capable of distinguishing benign data evolution from true drift, enabling
timely interventions and ensuring the reliability of machine learning sys-
tems in dynamic environments. Addressing the computational challenges
of this methodology, we discuss mitigation strategies like dimensional-
ity reduction, approximate algorithms, and parallelization for real-time
and large-scale applications. The method’s effectiveness is demonstrated
through experiments on real-world text data, focusing on detecting con-
text shifts in Generative AI. Our results, supported by publicly available
code, highlight the benefits of this deformation-based approach in captur-
ing subtle drifts that traditional statistical methods often miss. Further-
more, we present a detailed application example within the healthcare
domain, showcasing the methodology’s potential in diverse fields. Future
work will focus on further improving computational efficiency and explor-
ing additional applications across different ML domains.
1 Introduction
Machine learning (ML) and artificial intelligence (AI) are transforming our
world, but their reliance on data creates vulnerabilities. Real-world data is
rarely static; it constantly evolves due to shifting contexts, populations, trends,
1
and behaviors. This phenomenon, known as ”drift,” poses a significant threat to
the dependability and accuracy of ML models. Drift takes various forms. Con-
cept drift alters the relationship between input features and target variables,
while data drift changes the distribution of input features themselves. Con-
text drift, particularly relevant in conversational AI, occurs when the under-
lying topic or context of an interaction shifts, potentially leading to incoherent
responses and a decline in model performance.
This research introduces a novel approach, viewing data as points in a high-
dimensional vector space and interpreting new data as forces that deform this
space. We aim to develop a sensitive and interpretable methodology that cap-
tures these deformations, enabling early warnings of performance degradation.
2 Problem Statement
Drift can significantly impact the performance and reliability of machine learn-
ing (ML) models. It leads to a mismatch between the model’s original assump-
tions and the changing realities of the data, resulting in decreased accuracy
and potentially erroneous decision-making, especially in critical domains such
as healthcare, finance, and autonomous systems. Models that are exposed to
drift may produce unreliable outputs, causing users to lose trust. This erosion
of trust hinders the adoption of ML systems and necessitates frequent retraining
or recalibration, increasing operational costs.
Furthermore, the advancement of artificial intelligence (AI) is closely tied
to the models’ ability to detect context, seamlessly shift between different con-
texts, and provide accurate responses accordingly. Without this capability, AI
systems may struggle to maintain coherent, contextually relevant conversations
or decisions, especially in dynamic environments such as conversational AI or
autonomous decision-making systems.
Given these challenges, the need for effective drift detection methodologies
becomes paramount. These methodologies provide early warnings, enabling cor-
rective actions before significant model degradation occurs. By detecting the
specific type of drift, it is possible to focus retraining efforts, ensuring that up-
dates are both efficient and targeted. Additionally, continuous drift monitoring
helps maintain user confidence and ensures that ML models remain adaptable
in dynamic environments.
3 Research Motivation
Drift detection continues to face challenges despite significant research efforts.
Traditional methods effectively detect large distributional shifts but often miss
subtle drifts, which can degrade model performance over time. These gradual
changes accumulate and damage model accuracy before they become apparent
through standard metrics, highlighting the need for more sensitive detection
methods capable of signaling drift earlier for timely intervention.
2
A major limitation of current techniques is their inability to distinguish
between types of drift, such as concept drift (changes in the relationship be-
tween input and target) and data drift (shifts in input distribution while target
relationships remain constant). Without identifying the specific drift type, cor-
rective actions can be inefficient, reducing their effectiveness.
Additionally, many drift detection methods are computationally intensive,
limiting their use in real-time applications or resource-constrained environments.
This restricts their practicality in systems that require continuous operation.
The vector space deformation approach proposed in this research offers a
solution to these limitations. By analyzing data as points in a high-dimensional
space and examining how new data deforms this space, subtle drifts that may
otherwise go unnoticed can be detected. This method enhances sensitivity, en-
abling earlier detection of gradual changes that may not affect overall statistical
distributions but still impact model performance.
Moreover, this approach is adaptable, as different types of drift manifest as
distinct deformations in the vector space, aiding in the identification of specific
drift types and supporting more targeted corrective actions. The method’s
strong theoretical foundation also paves the way for more robust drift detection
techniques beyond traditional statistical methods.
The vector space deformation approach enhances sensitivity by detecting
small, gradual deformations in the data’s geometric structure, even when overall
statistical distributions remain relatively unchanged. By analyzing distinct de-
formation patterns, our approach helps differentiate between various drift types,
such as concept drift or data drift, allowing for tailored corrective actions.
Generative AI models, particularly those used in conversational AI, are
highly susceptible to context drift, as subtle shifts in user language or topic
can lead to incoherent or irrelevant outputs.
In summary, the vector space deformation approach addresses the key chal-
lenges of traditional methods, offering improved sensitivity, adaptability, and
computational efficiency. It promises to enhance the reliability and adaptabil-
ity of machine learning models in dynamic environments, ensuring long-term
effectiveness despite changing data.
4 Literature Review
4.1 Statistical Drift Detection
Statistical methods form a cornerstone of drift detection, often comparing data
distributions from different periods or between a baseline dataset and new data.
Key techniques include:
3
• Kullback-Leibler (KL) Divergence: Measures the difference between
two probability distributions. It is a non-symmetric divergence, meaning
that the order in which distributions are compared matters.
• Jensen-Shannon (JS) Divergence: A symmetric variant of KL diver-
gence, addressing some of its limitations. JS divergence is often more
stable when dealing with sparse distributions.
• Other Statistical Distances: Methods such as the Kolmogorov-Smirnov
(KS) test, Wasserstein distance (Earth Mover’s Distance), and others pro-
vide alternative ways to quantify distributional differences.
4.2 Advantages
• Well-Established Foundation: Statistical methods have a strong the-
oretical basis and are widely used in data analysis.
• Interpretability: Measures like PSI and KL divergence provide quan-
tifiable metrics that can be monitored for significant changes.
4
model and new incoming data. If the discriminator successfully separates the
datasets, it strongly suggests that drift has occurred.
4.6 Advantages
• Leverages Existing Models: Model-based approaches directly utilize
the trained machine learning models, potentially reducing the need for
additional data collection and analysis.
• Sensitivity to Complex Drift: Changes in model behavior can some-
times capture subtler drifts that may not be immediately obvious in sta-
tistical distributions.
4.7 Limitations
• Model Dependence: The effectiveness of model-based approaches is
tied to the quality and representativeness of the original model. A poorly
trained model will not offer reliable drift detection signals.
• Interpretability: Analyzing changes in model outputs may not always
directly pinpoint the specific nature of the drift or the features involved.
5 Approach
5.1 Data Representation in High-Dimensional Space
The foundation of our proposed drift detection approach rests on the concept
of embedding data points as vectors within a high-dimensional space. To trans-
form raw data into these representations, we carefully consider feature selection
and engineering processes. Suitable features might include numerical values
(such as age or income), categorical features converted into numerical represen-
tations, or features extracted using techniques like image processing or natural
language processing (NLP). Each selected feature corresponds to a dimension
in the vector space. While simpler datasets may work with only a few dimen-
sions, complex data often requires a very high-dimensional representation to
fully capture underlying patterns and relationships.
5
manifolds, which are clusters of data points that exhibit similarities and reside
close together in this space.
• Shifting: If the new data exhibits data drift, the overall distribution of
points in the space might shift. This can change the distances between
clusters or affect the position of decision boundaries learned by the model.
• Stretching or Compressing: Certain areas of the space may stretch or
compress due to changes in the variance or correlations of features in the
new data.
6 Mathematical Formulation
6.1 Data as Constellation (or Cloud)
Data should not only be considered as rows and columns, but as points scattered
in a vast, multidimensional space. Each dimension in this space represents a
feature of the data (e.g., age, income, purchase history, large text, or a generated
image). The initial dataset used to train a model forms a ”constellation” or
cloud in this space, depending on the type of data.
When new data arrives, each new data point acts like a small force. Points
that are significantly different from the original data (outliers) exert a stronger
pull than those that blend in with the rest. The goal is to detect not only if
the center of the data constellation is moving but also how the overall shape is
deforming—whether stretching, compressing, or twisting in response to the new
data forces.
6
• Embedding: The process of converting raw data into vectors is crucial.
For numerical data, we may use the values directly. For textual data,
techniques such as word embeddings (e.g., Word2Vec) can be used to
represent words or documents as vectors.
d = xi − µ
This gives the direction and magnitude of the ”pull” exerted by this new
point.
• Force Magnitude: We can scale the length of the deviation vector ∥d∥
to reflect the influence of the new point on the overall shape:
– Simple Distance: Use the length ∥d∥ directly.
– Fading Influence: Use a function that decreases the force as the
distance from the center increases, such as:
∥d∥ · e−k∥d∥
7
Figure 1: Density contour and transformation (move) vectors
8
original data (often represented by the mean). A larger average displacement
suggests a stronger overall pull away from where the model was initially trained.
While average displacement offers a readily interpretable measure of global
shift, it suffers from significant limitations. This approach is akin to tracking
only the movement of a ship’s anchor. While the anchor’s position provides
some information about the ship’s location, it reveals nothing about the ship’s
orientation, its rocking due to waves, or whether it is taking on water.
Similarly, average displacement is blind to changes in the shape of the data
distribution. A cluster of data points might be stretching, rotating, or becoming
more dispersed, and yet the average displacement could remain relatively un-
changed. For instance, imagine a scenario where a few outliers in the new data
move significantly farther from the center. The average displacement might
not change drastically, especially if the dataset is large, because the value is
averaged across all data points.
Furthermore, average displacement ignores the direction of these pulls. Two
new data points could exert equal but opposite forces on the center of mass,
effectively canceling each other out in the average displacement calculation.
However, these opposing forces, rather than indicating stability, might be subtly
twisting or warping the shape of the data in ways that degrade model perfor-
mance.
Another interesting concept related to data shape changes is the convex hull.
Imagine wrapping the data points with the tightest possible elastic sheet. The
convex hull represents the boundary of this sheet, encompassing all the data
points. A shift in the convex hull often indicates a change in the extremities
of the data distribution, where boundary points are pulled outward. While
intuitively appealing, relying solely on the convex hull for drift detection poses
practical challenges. Computing the convex hull, especially as the number of
dimensions increases, becomes computationally expensive, limiting its usefulness
for real-time monitoring. Moreover, the convex hull is highly susceptible to
outliers. A single data point moving far from the main cluster can drastically
alter the convex hull, even if the majority of the data remains relatively stable.
The limitations of average displacement and the computational challenges of
the convex hull underscore the need for more sophisticated approaches. To effec-
tively detect drift, we must move beyond simply measuring the average shift in
data points towards techniques that capture changes in the spread, orientation,
and overall shape of the data distribution within the high-dimensional vector
space.
9
Rn , where each dimension corresponds to a specific feature of the data after
appropriate preprocessing and embedding. Each data point is then represented
as a vector x = (x1 , x2 , . . . , xn ) ∈ Rn .
To quantify how the new data points influence the original distribution, we
introduce the concept of a force vector. Each new data point xi exerts a pull
on the center of the baseline data, analogous to a gravitational force. This force
can be represented as a vector Fi :
Fi = x i − µ
where:
• Fi is the force vector exerted by the new data point xi .
• xi is the vector representing the new data point.
• µ is the vector representing the center of the baseline data distribution
(calculated as the mean, median, or a robust estimate).
This force vector captures both the direction and magnitude of the pull.
The magnitude, denoted ∥Fi ∥, is simply the Euclidean distance between the
new point and the baseline center.
10
where:
• D is the average displacement.
• n is the total number of data points (baseline + new).
• m is the number of data points in the baseline dataset.
Where:
While the convex hull provides insights into the changing extent of the data
distribution, calculating it in high dimensions becomes computationally expen-
sive, making it less practical for real-time drift monitoring.
The convex hull can provide insights into how the distribution of data is
changing, especially in terms of its outer limits. A shift in the convex hull
indicates that new data is affecting the boundaries of the dataset, possibly
signaling a drift.
Although useful in detecting shifts, calculating the convex hull in high-
dimensional spaces is computationally expensive. This makes it less feasible for
real-time drift detection, particularly in complex, high-dimensional datasets.
11
11 Quantifying Data Deformation: A Multifaceted
Approach to Drift Detection
The success of our proposed drift detection methodology relies on robustly quan-
tifying how new data deforms the geometric structure of the original data space.
We model our data within an n-dimensional Euclidean space Rn , where each
data point, represented as a vector x = (x1 , x2 , . . . , xn ) ∈ Rn , corresponds to a
specific combination of features.
Fi = x i − µ
The magnitude of this force ∥Fi ∥ is the Euclidean distance between the
new point and the baseline center.
where:
– D is the average displacement.
– n is the total number of data points (baseline + new).
– m is the number of data points in the baseline dataset.
12
Higher D: A larger value of D indicates that the new data is pulling strongly
on the original data cloud, potentially signaling a significant shift or drift in the
underlying data distribution.
Lower D: A smaller value of D suggests that the new data is relatively
similar to the baseline data, with only minimal shift.
This measure serves as an initial, intuitive step in the drift detection process,
helping to detect if new data points are causing substantial changes in the
overall data distribution. However, as mentioned earlier, while it gives a basic
indication of drift, average displacement doesn’t capture changes in the shape
or orientation of the data distribution, which may require more sophisticated
techniques.
13
By comparing the eigenvalues and eigenvectors of the baseline and new data
covariance matrices, we can quantify these deformations. Two key measures are
defined:
14
12 Drawing Insights from Continuum Mechan-
ics: The ”Strain” of Data Drift
To further enrich our understanding of data deformation, we turn to the field
of continuum mechanics, which studies the behavior of materials under stress
and strain. Imagine stretching a rubber band. The movement of any point on
the band from its original to its stretched position is its displacement. Strain,
on the other hand, measures the relative deformation within the material—how
much the rubber band has stretched compared to its original length.
The key concept here is the relationship between displacement and strain.
Strain isn’t just about movement; it’s about how that movement changes across
the material. A high strain indicates a rapid change in displacement over a
small distance, like a sharp bend in our rubber band.
We can apply this analogy to our data space. The arrival of new data,
particularly those points significantly different from the baseline, can be seen
as a force acting on this space, causing ”stretching” or ”compression” along
different dimensions.
15
data, this space is typically Euclidean, i.e., M = Rn with the standard Euclidean
distance metric. For more complex data types, such as text or images, the metric
space may involve more specialized distance metrics, such as cosine similarity
for text embeddings or structural similarity (SSIM) for images. The choice of
metric depends on the nature of the data and its structure.
16
new text. For example, if ”inflation” and ”economy” are appearing together
more frequently in new contexts, the off-diagonal element would highlight this
stronger correlation, reflecting the fact that these two concepts are being dis-
cussed in closer relation in the new data.
Tabular Data: For tabular data, the strain tensor can be calculated by com-
paring differences in means and covariance matrices between the baseline and
new data. The displacement field can be estimated as the difference in feature
values between the two datasets, and the strain tensor is then derived from the
gradients of this displacement field. The eigenvalues and eigenvectors of the
covariance matrix provide insights into how the data is stretched (eigenvalues)
and rotated (eigenvectors). This approach is particularly useful for identifying
feature-wise changes and shifts in the relationships between features.
Image Data: In the case of image data, pre-trained convolutional neural net-
works (CNNs) or autoencoders can be employed as feature extractors to repre-
sent images as vectors in a lower-dimensional latent space. The strain tensor is
then calculated within this feature space, where it captures changes in high-level
visual features, such as shapes, textures, or patterns, extracted by the CNN or
autoencoder. This approach allows us to analyze how the overall ”appearance”
of images in the dataset changes over time or between different datasets, by
detecting stretching, compression, and deformation of features in this reduced
space.
Text Data: For text data, changes in word embeddings are analyzed using
metrics like cosine distance or Kullback-Leibler (KL) divergence to quantify
semantic shifts. Word embeddings represent words as vectors in a semantic
space, and the strain tensor is calculated based on changes in pairwise distances
between these embeddings. This captures how the relationships between words
have changed over time or between different datasets. For example, a significant
change in the embedding distances between words like ”economy” and ”infla-
tion” could indicate a semantic drift, reflecting a shift in the context in which
these words are used in the new data. Such metrics provide a way to detect
contextual and semantic drift in text data by measuring shifts in the underlying
semantic relationships.
17
12.5 Relating the Strain Tensor to Specific Types of Drift
The strain tensor provides a powerful framework for detecting different types of
drift in data. By analyzing the diagonal and off-diagonal elements of the strain
tensor, we can capture various shifts in the data, whether they relate to concept
drift, data drift, or context drift. Below, we relate the strain tensor to each of
these drift types.
Context Drift (Text Data): In text data, context drift refers to shifts in
the meaning or usage of words within a corpus. Changes in the relationships
between word embeddings, as captured by the strain tensor, can signal such
contextual shifts. The off-diagonal elements in this case represent changes in the
relationships between words. For example, a significant change in the embedding
relationship between the words ”bank” and ”river” versus ”bank” and ”money”
could indicate that the context in which ”bank” is used has shifted from a
financial meaning to a geographical one, reflecting context drift. This type of
drift is particularly important in applications like natural language processing,
where understanding the evolving meaning of words is critical to maintaining
model performance.
18
tensor, a mathematical object capable of capturing multi-dimensional changes
in distances, angles, and volumes within the data space.
Defining this strain tensor requires carefully considering the specific char-
acteristics of the data. For tabular data, analyzing changes in the covariance
matrix can provide a basis for calculating strain. For image data, we might need
to work in a lower-dimensional latent space learned by an autoencoder or use
image embeddings to represent images as feature vectors.
The key is to adapt the concepts of strain from continuum mechanics to
provide a more nuanced and interpretable view of how data is deforming, going
beyond simple drift detection to gain a deeper understanding of the nature of
the changes and their potential impact on model performance.
Note: While drawing inspiration from continuum mechanics, it is crucial to
acknowledge that data space, unlike a physical material, is not truly continuous.
There might be scenarios where the direct application of strain-displacement
relationships is not appropriate, especially when dealing with discrete data or
complex dependencies between features. Further research is needed to refine
and validate the applicability of strain-based measures in various data contexts.
19
These limitations highlight the need for more sophisticated measures that
capture the multi-faceted nature of data deformation in a computationally effi-
cient manner.
20
high-dimensional spaces, leading to sparse data representation and poor density
estimation performance. This issue makes it difficult to apply KDE directly
in very high-dimensional spaces, such as those often encountered in modern
machine learning applications like image or text data.
Geodesic Distances: Once the manifold is learned, one can compute geodesic
distances between points on the manifold. Geodesic distances represent the
shortest paths along the surface of the manifold, as opposed to Euclidean dis-
tances in the original high-dimensional space. These distances provide a more
meaningful measure of how points relate to each other on the manifold’s struc-
ture.
21
Dimensionality Reduction Techniques: Methods like Principal Compo-
nent Analysis (PCA) or t-Distributed Stochastic Neighbour Embedding (t-SNE)
can be applied first to reduce the data to a lower-dimensional space before es-
timating the displacement field. This approach helps mitigate the challenges
posed by high dimensionality, making density estimation and displacement field
calculations more feasible.
13.5 Manifolds
A manifold is a central concept in geometry and topology, representing a gen-
eralized idea of a curve or surface extended to higher dimensions. Formally, a
manifold is a topological space that locally resembles Euclidean space near each
point. This means that for any given point on a manifold, there is a neighbor-
hood around that point which is homeomorphic to an open subset of Euclidean
space. Several key concepts related to manifolds are outlined below:
22
• Differentiable Manifold: A differentiable manifold is a type of mani-
fold where the transition functions between overlapping charts are differ-
entiable. This structure enables the use of calculus on the manifold and
makes it suitable for studying smooth shapes and continuous spaces.
• Riemannian Manifolds: A Riemannian manifold is a differentiable
manifold equipped with a Riemannian metric, which allows for the mea-
surement of distances and angles. This structure is crucial for analyzing
curvature and other geometric properties, making it essential for studying
smooth and continuous deformations of the data space.
Manifolds are particularly useful when the data lies on or near a lower-
dimensional subspace within a higher-dimensional space. Techniques such as
Isomap or t-SNE can be used to learn the underlying manifold structure. Once
the manifold is learned, geodesic distances can be calculated to estimate the
displacement field between the baseline and new data points.
d = xi − µ
where µ is the center (mean) of the original data distribution. The magnitude
of the deviation can be scaled to reflect the influence of the new data point. We
can scale the length of the deviation vector using a fading influence function:
23
N
1 X
Σ= (xi − µ)(xi − µ)T
N − 1 i=1
where µ is the mean vector of the features, and xi represents a data point in
the feature space.
Eigenvalue Analysis: The eigenvalues of the covariance matrix describe
the variance along the principal axes of the data. If a new batch of data causes
certain eigenvalues to grow or shrink significantly, this corresponds to a ”stretch-
ing” or ”compressing” of the data along those directions. This provides a precise,
quantifiable measure of deformation:
24
Kullback-Leibler Divergence: To quantify how significantly the local
densities differ, you can use Kullback-Leibler (KL) divergence:
X P (x)
KL(P ∥Q) = P (x) log
Q(x)
where P (x) is the PDF of the old data, and Q(x) is the PDF of the new data.
Large KL divergence would signal significant local deformations, which can in-
dicate drift that traditional global metrics might miss.
This connects back to the metaphor by identifying regions of the data space
where clusters become more tightly packed (compression) or dispersed (expan-
sion), echoing the physical concept of deformation.
25
14 Results of Drift Analysis
The custom drift detection and analysis system, which employs both deformation-
based and statistical measures, was tested on real text data. The system was
built using the following key methods:
15 Experimental Results
Our drift detection and analysis system was tested on real text data, focusing
on contextual and semantic shifts. Using deformation-based metrics (Cosine
Distance, L2 Norm) alongside statistical measures (Wasserstein Distance), we
were able to detect context drift that was missed by statistical approaches alone.
26
The results highlighted the practical benefits of the deformation-based ap-
proach, offering a more nuanced understanding of drift in dynamic environments
such as conversational AI.
17 Computational Complexity
The proposed deformation-based drift detection methodology provides enhanced
sensitivity and flexibility in identifying subtle shifts in data distributions. How-
ever, the computational cost of some of the core operations, particularly in
high-dimensional spaces, can become prohibitive, especially for real-time appli-
cations or very large datasets.
27
17.2 Convex Hull Calculation
Challenge: The convex hull, which provides insights into boundary shifts in the
data distribution, becomes increasingly computationally expensive to compute
as the number of dimensions increases. In high-dimensional spaces, constructing
the convex hull requires evaluating many facets, which scales poorly with data
dimensionality.
Mitigation Strategies:
• Approximation Techniques: Rather than calculating the exact convex
hull, approximate algorithms (e.g., Quickhull or iterative pruning) can
provide a good-enough estimate of the convex hull in significantly less
time. These methods focus on identifying key boundary points rather than
constructing the entire hull, thus reducing computational complexity.
• Sampling: Subsampling the dataset is another potential strategy. By
selecting a representative subset of data points, you can compute an ap-
proximate convex hull that reflects the overall shape of the data without
requiring full-scale computation. This tradeoff between accuracy and com-
putational load can be useful in real-time settings.
28
• Incremental Algorithms: Instead of recalculating deformation metrics
from scratch with each new data batch, incremental methods can be used.
These methods update eigenvalue decompositions, covariance matrices,
and density estimates incrementally as new data arrives, reducing the
need for full re-computation.
• Parallelization and GPU Acceleration: Many of the computations
required for deformation analysis, such as eigenvalue decomposition and
convex hull computation, can be parallelized. Leveraging GPU accelera-
tion or distributed computing frameworks (e.g., Spark or Dask) can im-
prove performance in large-scale, real-time applications.
18 Practical Results
We have performed several tests, and we invite you to do the same. We used a
publicly available text file. Using different language models, we created various
models, modifying words and entire sentences to simulate potential drift.
Comparing the original file with one of the synthetic versions, we obtained
the following results:
From a visualization perspective, we showed the change in dimensionally
reduced space and examined how it evolved. We captured several snapshots of
the deformation process. Running the code allows you to see all the deformation
steps dynamically. In the appendix you will find the results of space deformation.
29
Metric Value Description
Original Text 16,380 char- Length of the original text
Length acters
Drifted Text 9,012 charac- Length of the drifted text
Length ters
Length Change 45.0% Percentage reduction in text
length
Deformation (Co- 0.1516 Measure of overall semantic
sine Distance) change (0-1 scale, higher values
indicate more change)
Shape Change (L2 0.5507 Measure of change in specific
Norm) word usage and frequency
Wasserstein Dis- 0.0047 Statistical measure of change in
tance word distribution
30
Figure 2: Spaces at time 0 are the same, as new text has not shifted the original
Figure 3: Original space at 50pc is deformed, forces (in green) are strong
31
Figure 4: Original space at full force applied has shifted, compress somewhere,
expanded in other parts
32
19 Fields of Application
The presented model has numerous specific applications in contexts that are
already prepared for its use and integration. Similar models have previously
benefited from machine learning, demonstrating their potential effectiveness.
The idea behind this hypothesis is to demonstrate that the approach is not
confined to a specific type of data (such as textual), but can be applied in a wide
range of sectors with different dynamics and logics. This would exponentially
increase the value and impact of the methodology, showing its versatility and
adaptability.
20 Healthcare
In medical and healthcare scenarios, we have several successful cases of ML
models where a lot of historical data has been used to predict and create ded-
icated models in a predictive way. But as we know, patient data, symptoms
and pathologies can constantly change. As well as the reactions of symptoms
between patients that are constantly changing, such as health status, aging and
recovery from surgery, rehabilitation and post-operative course.
Changing lifestyles of patients: like time, the behavior of an entire population
can change (new treatments, demographic and birth rate changes, new drugs
tested that had not been invented before). These changes could make models
trained with historical data obsolete. Our “deformation” approach could be used
to detect when the data of a patient or group of patients starts to deviate from
the reference population on which the model was trained. The example we tested
is the following: Imagine a predictive model used to diagnose a cardiovascular
disease. If the patient population starts to age (higher incidence of elderly
patients than when the model was trained), there could be a deformation in the
vector space that signals a drift. This would allow us to intervene by updating
the model before it starts to lose accuracy.
Mathematical formulation of drift and deformation Data Vector
Space Patient data can be represented as vectors in a high-dimensional space.
Each vector represents a patient with his or her characteristics:
Where xi represents a single patient and each xij is a feature of the patient
(e.g. age, cholesterol, blood pressure, etc.). The total data space is then Rn ,
where nnn is the number of features
We define a mean µ0 and a covariance Σ0 to describe the distribution of the
original model data (patients with a mean age and distribution of other clinical
characteristics):
N N
1 X 1 X
µ0 = xi , Σ0 = (xi − µ0 )(xi − µ0 )T
N i=1 N − 1 i=1
33
Where N is the number of patients in the original training set.
From a health perspective, over time, new patients with slightly different
characteristics are added, both for the same hospital, medical practice and also
on a larger sample for example if analyzed at population level (e.g. older popu-
lation in a given territorial band). We therefore define the mean and covariance
of the new data:
M M
1 X 1 X
µt = xi,new , Σt = (xi,new − µt )(xi,new − µt )T
M i=1 M − 1 i=1
Dµ = ∥µt − µ0 ∥
A high value of Dµ indicates that the new data is ”moving towards” the
center of the distribution compared to the original data. This could indicate an
increase in the average age of the patients, for example, or a different reaction
for the same sample analyzed.
The difference between the covariance could be measured using the Frobenius
norm or the comparison of the eigenvalues of the two covariance matrices. Thus,
exactly the same properties as vector norms are recognized; this reflects the
fact that the matrix space is isomorphic to the vector space for example by the
application that sends a matrix into the vector containing its rows one after the
other and therefore a matrix norm must have at least the same properties as a
vector norm. The Frobenius norm therefore quantifies the global change in the
shape of the distribution:
DΣ = ∥Σt − Σ0 ∥F
Where the Frobenius norm is defined as:
X
∥Σt − Σ0 ∥F = (Σt (i, j) − Σ0 (i, j))2
i,j
A high value of Dσ would indicate that the shape of the data distribution
is changing due to factors that were not initially present or were not considered
correctly, suggesting that the new patients being analyzed differ substantially
from the original group (for example, a change in the association between choles-
terol elevation and the age at which it occurred).
In this analyzed scenario we can define a composite drift index that combines
both the shift of the mean of the sample taken as analysis and the change in
34
the covariance to have an overall measure of the drift of all the elements of the
function:
35
21 Conclusion
We introduced a novel approach to drift detection by conceptualizing changes in
data as deformations in a high-dimensional vector space. Using mathematical
tools from linear algebra, topology, and continuum mechanics, we developed a
sensitive and interpretative framework for detecting drift, including the nuances
of context drift. Our approach captures both global and local shifts, distin-
guishing between benign data evolution and performance-degrading drifts. The
method has proven effective in dynamic environments, particularly in applica-
tions such as conversational AI.
Future work will focus on improving computational efficiency in high-dimensional
spaces and expanding the application of this methodology across different ML
domains.
22 Bibliography
• Kullback, S., & Leibler, R. A. (1951). On information and sufficiency.
*The annals of mathematical statistics*, *22*(1), 79-86. (This is the
foundational paper on KL Divergence. More recent works applying KL
Divergence to drift detection would be needed for a complete reference)
• Lin, J. (1991). Divergence measures based on the Shannon entropy. *IEEE
transactions on information theory*, *37*(1), 145-151. (This paper dis-
cusses JS Divergence and its properties). Again, application to drift de-
tection needs further specification.
• Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013).
Distributed representations of words and phrases and their composition-
ality. *Advances in neural information processing systems*, *26*.
• Cortes, C., & Vapnik, V. (1995). Support-vector networks. *Machine
learning*, *20*(3), 273-297. (Foundational SVM paper. More recent ap-
plications within drift detection would be beneficial).
• Tenenbaum, J. B., De Silva, V., & Langford, J. C. (2000). A global
geometric framework for nonlinear dimensionality reduction. *science*,
*290*(5500), 2319-2323. (Isomap)
36
• Maaten, L. v. d., & Hinton, G. (2008). Visualizing data using t-SNE.
*Journal of machine learning research*, *9*(Nov), 2579-2605. (t-SNE)
• Gama, J., Žliobaitė, I., Bifet, A., Holmes, G., & Žliobaite, I. (2014). A
survey on concept drift adaptation. *ACM computing surveys (CSUR)*,
*46*(4), 1-37. (A broad survey on concept drift)
37