SlideShare a Scribd company logo
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
DOI:10.5121/mlaij.2025.12109 131
AN ADAPTIVE HIERARCHICAL TREE-BASED
CLUSTERING APPROACH TO OUTLIER DETECTION IN
ETF-FOCUSED FINANCIAL TIME-SERIES
Shlok Mandloi, Aryaman Jalali, and Eugene Pinsky
Department of Computer Science, Metropolitan College,
1010 Commonwealth Avenue, Boston University, Boston, MA
ABSTRACT
This paper introduces an adaptive framework for detecting outliers in financial time-series data, focusing
on Exchange-Traded Funds (ETFs). The method integrates hierarchical clustering and binary tree analysis
to identify unique ETF patterns while isolating anomalies. Using the yfinance API, daily returns for nine
ETFs and the S&P 500 index were collected over 24 years. Regression analysis removed market influence,
producing residuals that highlight ETF-specific behavior. Hierarchical clustering was applied to these
residuals annually, with dendrograms converted into binary trees. Outliers were detected as ETFs added
last in clustering and as root nodes in the trees. Metrics like tree height, breadth, and cluster compactness
captured temporal patterns and deviations. Experimental results demonstrate the framework’s ability to
detect anomalies during major market events, such as the 2008 financial crisis and the 2020 COVID-19
crash. This scalable and interpretable approach enhances anomaly detection in financial data analysis.
KEYWORD
Hierarchical Clustering, Outlier Detection, Financial Time-Series, Binary Tree Analysis, Anomaly
Detection.
1. INTRODUCTION
The increasing complexity of financial markets necessitates robust and interpretable methods for
analyzing large-scale time-series data. Exchange-Traded Funds (ETFs), representing portfolios of
assets, provide a rich dataset for studying market behavior and identifying anomalies. However,
isolating outlier behavior in ETFs is challenging due to broader market influences and the high
dimensionality of financial data.
This study focuses on nine major sector-specific ETFs that comprise the S&P-500 index:
1. XLB: Materials
2. XLE: Energy
3. XLF: Finance
4. XLI: Industrials
5. XLK: Technology
6. XLP: Consumer Staples
7. XLU: Utilities
8. XLV: Healthcare
9. XLY: Consumer Discretionary
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
132
In addition, we consider the broad S&P-500 index. Investing in such index can be done more
easily with the "SPY" exchange-traded fund.
This paper introduces a scalable framework for detecting outlier behavior in ETFs using
hierarchical clustering and binary tree analysis. Daily returns of ETFs are regressed with the S&P
500 index, producing residuals that capture ETF-specific behavior. Hierarchical clustering is then
applied to these residuals to uncover patterns independent of broader market trends. The resulting
dendrograms, obtained by the hierarchical clustering of residuals, are converted into binary tree
structures. This enables the identification of outliers, specifically ETFs added last in the
clustering process and occupying root node positions in the trees. Metrics such as tree height,
breadth, and compactness quantify temporal dynamics of ETFs behavior.
The proposed framework generalizes to high-dimensional time-series data, offering a
computationally efficient method for outlier detection and pattern recognition. Its adaptability
extends beyond finance, with potential applications in machine learning domains such as sensor
data analysis, network anomaly detection, and time-series clustering. This research establishes a
versatile approach for leveraging hierarchical relationships and tree-based metrics to uncover
actionable insights from large-scale datasets.
2. METHODOLOGY
2.1. Data Collection
The study collected daily returns data for nine sector-specific ETFs and the S&P 500 index over a
24-year period (2001–2024) using the yfinance API. This data set provides the foundation for
analyzing the behavior of ETFs and isolating sectoral trends. The S&P 500 returns were used as a
market benchmark for regression analysis, ensuring that sector-specific characteristics could be
captured through residuals.
Handling Missing Data and Trading Day Discrepancies: Given that ETFs may experience trading
inconsistencies due to market holidays and half-days, all datasets were aligned to a common
trading calendar based on the S&P 500 index. Missing data points were forward-filled using the
last available trading day to ensure continuity. If an ETF exhibited excessive missing data
(exceeding 5% of trading days in a given year), it was excluded from that year’s analysis to
prevent distortions in clustering results.
2.2. Residuals Analysis
A common and simple approach to analyze returns of stock 𝑆 is to show the dependence of stock
returns on the overall market and to apply to apply a linear regression. If we were to use daily
returns and perform such an analysis for one year, then in such an approach, the independent
variable is a (250-day) vector of daily S&P-500 (market) returns 𝑅∗
= (𝑟1
∗
, … , 𝑟250
∗ ) and the
dependent variable is a (250-day) vector 𝑅(𝑆) = (𝑟1
(𝑆)
,… , 𝑟250
(𝑆)
) of daily returns of security 𝑆.
(𝑟1
(𝑆)
,… , 𝑟250
(𝑆)
)
⏟
𝑅(𝑆)
= 𝛼 + 𝛽 (𝑟1
∗
, … , 𝑟250
∗ )
⏟
𝑅∗
+ (𝜖1, … , 𝜖250)
⏟
𝜖
Here 𝛽 represents the dependence of returns of 𝑆 on the market and can be interpreted as the
slope of the regression line. Alpha (𝛼) represents indicates how an investment has performed
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
133
after accounting for the risk (i.e. excess return in relation to a benchmark, when adjusted for risk).
It can be interpreted as the intercept of the regression line. Finally, the residuals (𝜖1, … , 𝜖250) are
uncorrelated to market returns and can be interpreted as a idiosyncratic source of risk and return
driven by skill of company management.
One of the problems with such an approach is that securities are correlated.
If we were to simply use the standard Euclidean metric 𝑑(𝑆1, 𝑆2) between return vectors 𝑅(𝑆1)
and 𝑅(𝑆2) for securities 𝑆1 and 𝑆2 then this distance can be made arbitrarily even if the return
pattern is essentially identical. For example, consider a simple case when daily returns of 𝑆1 for
each day are just a multiple 𝐶 of the corresponding daily return of 𝑆2. Then taking 𝐶 to be large
enough, we can have any value for the distance 𝑑(𝑆1, 𝑆2) whereas the pattern of daily price
directions of both securities is identical.
Our proposal is to focus on residual vectors other than the return vectors. It also allows us to
ignore market-wide influences on ETF returns, enabling a more precise analysis of sector-specific
behavior. This approach involved the following steps:
 Independent Variable: compute the annual vector of daily S&P 500 returns, serving as
the benchmark to account for market-wide movements.
 Dependent Variable: compute the annual vectors of daily ETF returns representing the
performance of specific ETFs relative to the market.
 Residuals: Compute the annual vectors of residuals 𝜖 of daily residuals. as the difference
between observed ETF returns and the predicted values from the regression line,
The residuals capture ETF-specific behavior by isolating deviations from market trends. This
process was critical for identifying anomalies and sector-specific patterns, ensuring that the
clustering analysis focused on unique ETF dynamics rather than broader market influences. The
use of residuals allows for:
 Enhanced Analysis: By removing market-wide effects, residuals help uncover sector-
specific responses to external events.
 Accurate Clustering: The residual-based data ensures that hierarchical clustering reflects
intrinsic ETF behavior rather than market-driven correlations.
2.3. Hierarchical Clustering
The residuals obtained from regression analysis were subjected to hierarchical clustering to
identify relationships among ETFs. This method was chosen for its ability to reveal structural
patterns over time without requiring a predefined number of clusters. To ensure robustness,
multiple linkage methods—single, complete, and average—were evaluated. Ward’s method was
ultimately selected, as it minimizes intra-cluster variance while maintaining well-balanced
clusters. A sensitivity analysis confirmed that:
 Clustering: Hierarchical clustering was performed to group ETFs based on their
residuals.
 Linkage: Various linkage techniques were tested. Single linkage produced elongated,
chain-like clusters, reducing interpretability, while complete linkage resulted in well-
separated clusters but introduced distortions in compactness. Average linkage provided
moderate compactness but was outperformed by Ward’s method , which consistently
produced the most stable and interpretable clustering results.
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
134
 Visualization: Dendrograms were generated to illustrate clustering structures and
highlight sectoral divergence.
While hierarchical clustering was the preferred approach due to its interpretability and ability to
track sectoral shifts, alternative methods were considered:
 DBSCAN: Effective for detecting anomalies but highly sensitive to distance parameters
and struggled with clusters of varying density.
 Spectral Clustering: Leveraged eigenvalues to identify complex structures but required a
predefined number of clusters and was computationally intensive.
 K-Means: A commonly used partitioning method but assumed spherical clusters and was
highly sensitive to outliers, making it less ideal for financial time-series data.
Hierarchical clustering was ultimately chosen for its ability to generate dendrograms and track
ETF movements over time without requiring predefined cluster counts. Future research may
explore hybrid models combining hierarchical clustering with deep-learning-based anomaly
detection to further enhance robustness.
2.4. Binary Tree Construction
After hierarchical clustering, binary trees were constructed to dynamically represent the
clustering relationships. Each stock (ETF) was assigned as a leaf node, and the internal nodes
represented merged clusters. The tree-building process followed these steps:
 Node Creation: Original stocks were assigned as leaf nodes, while internal nodes were
generated for clusters formed during hierarchical merging.
 Tree Structure: Parent-child relationships were established based on the linkage matrix,
ensuring the tree’s root node represented the final cluster.
2.5. Depth-First Search (DFS) for Position Assignment
A Depth-First Search (DFS) traversal was applied to the binary trees to assign positions to the
stocks:
 Positioning: Each stock was assigned a unique position based on the traversal order,
allowing consistency in the analysis of the behavior of the sector.
 Handling Missing Data: For years where data for certain ETFs was unavailable, positions
were left as NaN to maintain accuracy. This approach provided a systematic way to track
the positions of ETFs over time, helping to identify outliers and trends.
2.6. Tree Metrics Calculation
Key metrics were computed from the binary trees to quantify clustering characteristics:
 Tree Height: Measured as the longest path from the root to a leaf node, indicating the
hierarchical depth of clustering.
 Tree Breadth: Calculated as the maximum number of nodes at any level, reflecting the
spread of clusters.
 Cluster Compactness: Defined as the average variance of residuals within clusters,
highlighting the degree of similarity among clustered ETFs.
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
135
These metrics allowed for a deeper understanding of the sector dynamics and provided
quantitative measures to compare ETF behaviors over years.
2.7. Trajectory Analysis
To evaluate the temporal patterns in ETF clustering, the distance of each ETF from the root node
was calculated.
 Root Node Distances: Determined how closely each stock aligned with the overall cluster
hierarchy for a given year.
 Trajectory Visualization: Line plots were generated to track the yearly distances for all
ETFs, showcasing how their clustering relationships evolved over time.
This analysis revealed trends in sector-specific behaviors and identified years where significant
deviations occurred due to market events, such as the 2008 financial crisis and the 2020 COVID-
19 pandemic.
3. ANALYZING ETFS BEHAVIOR BY HIERARCHICAL CLUSTERING
The behavior of ETFs over the years, as seen in Figure [fig:mar_distance_figure], shows certain
trends in clustering distances from the root node. This helps in identifying the periods of market
stability and instability, giving a new way to look at how sectors respond to unpredictable events.
Figure 1: ETFs Trajectory (Mean Absolute Residual of ETFs vs SPY)
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
136
3.1. Analysis of Trends
 Market Events and Clustering Behavior: Peaks in ETF distances are observed during
significant events like the 2008 financial crisis and the 2020 COVID-19 pandemic,
reflecting sectoral shifts.
 Consistency in Sector Behavior: ETFs like SPY show steady behavior over the years,
aligning with market trends, while ETFs like XLE (Energy) exhibit greater variation due
to external factors.
 Cluster Dynamics: During stable periods, ETFs cluster tightly. Economic disturbances
lead to fragmented clustering, reflecting varying sector responses.
3.2. Key Insights
 ETFs that are further from 0 indicate isolated behavior compared to the market during
that period, such as XLE.
 Greater distance changes often indicate sector-specific responses.
 The clustering trajectory reveals outliers and sector group dynamics.
4. QUARTILE ANALYSIS OF ETF BEHAVIOR
To analyze the dynamic clustering of ETFs over time, we categorize them into quartiles based on
hierarchical clustering distances. This classification helps identify patterns in ETF performance,
distinguishing stable sectors from volatile ones. By tracking quartile assignments across years,
we can detect trends in market behavior and assess sector resilience.
4.1. Quartile Assignments Across Years
Table 1: Quartile Assignments for Each ETF by Year
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
137
Table 1 presents the quartile classifications for ETFs from 2001 to 2024. The quartiles are
defined as follows:
 Q1 (First Quartile): ETFs closely tracking the market benchmark, exhibiting minimal
fluctuations.
 Q2 (Second Quartile): ETFs with moderate deviations from the market, maintaining
relative stability.
 Q3 (Third Quartile): ETFs that show increased volatility but do not consistently behave
as outliers.
 Q4 (Fourth Quartile): ETFs that experience significant deviations, often behaving as
outliers in the clustering analysis.
The last row of the table displays the mode for each ETF, representing the quartile in which it has
most frequently appeared over the years.
4.2. Quartile Movement Over Time
Figure 2: Diving ETF Trajectory into Quartiles
Figure 1 illustrates the quartile movements of ETFs over time, showcasing fluctuations in sector
stability and market behavior. Observing these shifts allows us to assess long-term trends and
identify ETFs with consistently stable or volatile patterns.
4.3. Key Observations
 Prevalence of Q1 (First Quartile): The mode for most ETFs is 1, indicating that these
ETFs frequently appear in Q1. This suggests that a majority of ETFs demonstrate market
stability with minimal fluctuations over time.
 Highly Volatile ETFs: ETFs like XLE (Energy) frequently fluctuate between Q3 and Q4,
highlighting its sensitivity to macroeconomic conditions and external shocks.
 Impact of Market Events: We can see that XLE (Energy) moved into Q3 during the
COVID-19 pandemic due to demand collapse.
 Sectoral Divergence: Times of economic turbulence are marked by increased dispersion
in quartile assignments, demonstrating how macroeconomic factors influence ETF
movements.
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
138
By integrating quartile-based classification with clustering analysis, we construct a structured
framework for evaluating ETF behavior, detecting anomalies, and understanding sectoral shifts in
response to market fluctuations.
4.4. ETF Rankings and Sector Stability
Table 2: Ranking ETFs by Year
Beyond quartile classification, it is crucial to analyze the relative positioning of ETFs over time.
Table 2 presents yearly rankings, where a lower rank indicates stronger alignment with the core
market trends, while higher ranks suggest greater deviations.
Table 3: Summary Statistics of Ranks (2001–2024) per ETF
ETF Mean Median Mode Max Min
SPY 1.04 1 1 2 1
XLB 4.79 5 6 7 1
XLE 7.17 7 7 9 5
XLF 4.33 4 7 8 1
XLI 2.00 2 1 4 1
XLK 4.46 4.5 {4, 5} 8 1
XLP 2.92 3 {3, 4} 5 1
XLU 6.12 6 6 8 4
XLV 3.75 3.5 {3, 6} 6 1
XLY 2.96 3 3 6 1
Note: Mean = Average position of the ETF in the binary tree over the 2001–2024 period. Median
= Middle position (if even number of data points, average of the two central positions). Mode =
Most frequently occurring position. Max = Highest (worst) position over the years. Min = Lowest
(best) position over the years.
The rankings reveal important trends:
 XLE (Energy) has consistently ranked as the most distant ETF, indicating its higher
volatility and deviation from core market trends.
 XLI ( Industrial Select Sector) shows the most stable positioning, ranking among the
lowest average distances, reflecting its defensive nature.
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
139
 Major crisis events such as the 2008 Financial Crisis and the 2020 COVID-19 crash led
to significant shifts in rankings, reinforcing the findings from the hierarchical clustering
analysis.
S&P 500 being the market reference was always assigned number 1, which is intuitive. The ETF
closest to the market was added to the cluster was added first to S&P 500 and was labeled
number 1, and the outlier, ie, added last to the group, was assigned the highest number.
5. SECTORAL DIVERGENCE DURING MARKET CRISES
Market crises significantly impact sectoral behavior, often leading to noticeable shifts in how
different industries react. By analyzing the hierarchical clustering dendrograms for key years
such as 2001, 2008, and 2020, we can observe how various sectors exhibited distinct responses to
major economic disruptions. These shifts highlight the varying challenges faced by different
industries during financial turmoil. The following subsections explore key instances of sectoral
divergence, focusing on the Dot-Com Bubble, the 2008 financial crisis, and the COVID-19
pandemic, each of which reshaped market dynamics in unique ways.
5.1. 2001 - Dot-Com Bubble and Technology Collapse
The 2001 dendrogram illustrates that the XLK (Technology) sector formed a distinct cluster,
separating from the other sectors at the final stage of hierarchical clustering.
Figure 3: 2001 Dendrogram (Technology Crisis)
The hierarchical clustering results highlight the XLK sector as an outlier during the Dot-Com
Bubble burst. This pattern aligns with the aftermath of the Dot-Com Bubble burst, a defining
event in financial history. During the late 1990s, investor enthusiasm for internet-based
businesses led to extreme overvaluations in the technology sector. However, by early 2000, these
valuations proved unsustainable, resulting in a dramatic market downturn.
As a result, capital flows shifted away from technology stocks like those in XLK toward more
stable sectors. This shift is evident in the hierarchical clustering results, where technology ETFs
exhibit unique behavior compared to more diversified market indices like the S&P 500. The
clustering outcome confirms that the Dot-Com Bubble’s collapse had a profound sector-specific
impact, making XLK a clear outlier in 2001.
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
140
5.2. 2008 - Financial Crisis and Systemic Market Shock
Figure 4: 2008 Dendrogram (Financial Crisis)
Figure 3 presents the hierarchical clustering dendrogram for 2008, highlighting how the XLF
(Financials) sector deviates significantly from the others. The 2008 financial crisis, triggered by
the collapse of major financial institutions and the subprime mortgage crisis, exposed deep
structural weaknesses in the financial sector.
XLF emerges as a distinct outlier, reflecting the financial sector’s instability during the market
collapse. Unlike other sectors, financial stocks suffered disproportionately due to liquidity
shortages, credit market freezes, and investor panic. This behavior is captured in the clustering
structure, where XLF stands apart from the other ETFs. As the crisis unfolded, financial sector
returns exhibited high volatility and significant divergence from broader market trends, reflected
in the large distances between XLF and other clusters. This distinct separation reinforces the
effectiveness of hierarchical clustering in detecting anomalies during market crises.
5.3. 2020 - COVID-19 Pandemic and Energy Sector Disruptions
Figure 5: 2020 Dendrogram (COVID-19 Crisis)
The COVID-19 pandemic in 2020 had an unprecedented effect on global markets, with sectoral
responses varying widely. As shown in Figure 10, the XLE (Energy) sector stands out, clustering
separately from other ETFs.
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
141
Unlike previous crises, the pandemic led to a rapid global economic slowdown, causing a
dramatic decline in energy demand. This decline was particularly severe for oil markets, where
oversupply and reduced consumption drove prices to historic lows, including negative crude oil
futures for the first time. As a result, energy sector stocks displayed highly irregular return
patterns, making XLE an outlier in 2020. The clustering structure captures this distinct behavior,
emphasizing the pandemic’s sector-specific disruptions.
6. SECTORAL DYNAMICS THROUGH BINARY TREE REPRESENTATIONS
Binary tree structures enable a more intuitive visualization of how ETFs merge into clusters over
time, highlighting sectoral similarities and divergences during periods of market stress and offer
sectoral relationships
6.1. Converting Dendrograms to Binary Trees
Since hierarchical clustering follows a binary tree structure, the dendrogram is converted into a
graph-based tree representation. Each ETF starts as a leaf node, with each merge forming a new
internal node. This continues until a single root cluster is formed. The tree is visualized using
PyGraphviz, where leaf nodes (ETFs) are labeled boxes, merged clusters are circles, and edges
connect parent nodes to child clusters, preserving hierarchical relationships.
Figure 6: 2020 Dendrogram (COVID Crisis)
Figure 7: 2020 Binary Tree from Dendrogram
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
142
The ETF which was added last to the hierarchal cluster has been added as the leaf of the root
node according to the conversion from a dendrogram to a binary tree.
For example: For 2020, XLE was the outlier ETF compared to the market, so when the hierarchal
cluster (Figure 6) transformed into a binary tree (Figure 7), XLE is the leaf of the root node.
Re-visiting the binary tree ETF labeling in Table 3, the ETF are then labeled using the tree-
traversal search algorithm Depth-First Search. The DFS algorithm starts with bottom-up traversal
for labelling. As a result, the ETF which is the outlier, i.e., leaf of the root node is labelled last
with the highest number.
6.2. Tree Metrics Across Years
Table 4: Tree Metrics by Year
Year Tree Height Tree Breadth Cluster Compactness (× 10−4
)
2001 8 4 1.31
2002 9 4 1.05
2003 9 4 0.36
2004 9 4 0.32
2005 10 2 0.37
2006 9 4 0.38
2007 9 4 0.35
2008 9 4 2.00
2009 7 6 0.98
2010 8 4 0.25
2011 7 4 0.28
2012 8 4 0.18
2013 7 6 0.17
2014 9 4 0.25
2015 10 2 0.37
2016 8 4 0.42
2017 9 4 0.24
2018 6 4 0.40
2019 7 4 0.30
2020 7 4 1.31
2021 7 6 0.64
2022 7 4 0.85
2023 8 4 0.51
2024 9 4 0.43
To quantify structural changes in clustering outcomes, we calculate three key metrics:
 Tree Height: The longest path from the root node to a leaf, indicating the depth of
hierarchical clustering.
 Tree Breadth: The maximum number of nodes at any level, representing the clustering
density at different depths.
 Cluster Compactness: The average variance of residuals, capturing how tightly sectors
are grouped within clusters.
Table 2 summarizes these metrics across different years, highlighting structural shifts in sectoral
clustering.
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
143
6.3. Sectoral Response to Financial Crises
6.3.1. 2001 - Sectoral Response to the Dot-Com Bubble
The binary tree representation for 2001, shown in Figure 9, emphasizes the central role of XLK
(Technology). Positioned at a higher level in the hierarchy, XLK stands apart from other ETFs,
underscoring the sector’s unique behavior during the post-Dot-Com era.
Figure 8a: Binary Trees for 2001 (Dot-Com Bubble) Figure 8b: 2008 Binary Tree (Financial Crisis)
The early separation of XLK in the clustering process highlights the overvaluation and
subsequent collapse of technology stocks following the Dot-Com Bubble burst. The
disconnection from other ETFs suggests that technology stocks moved independently from
broader market trends during this period. Defensive sectors such as XLP (Consumer Staples) and
XLU (Utilities) remained clustered together, reinforcing their low volatility and market stability.
This behavior reflects investors’ shift toward safer assets, further validating how binary trees
capture capital reallocation trends during financial turmoil.
6.3.2. 2008 - Financial Crisis and the Fragmentation of Financials
The 2008 binary tree further highlights the extreme divergence of XLF (Financials) from the rest
of the market. Placed on a distinct branch, XLF exhibits minimal connection to other ETFs,
reflecting the financial sector’s heightened risk exposure and volatility during the crisis.
The financial sector’s isolation in the binary tree aligns with the broader collapse of major
financial institutions, such as Lehman Brothers, and the systemic liquidity shortages that
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
144
followed. Unlike in 2001, where technology stocks were the primary outliers, 2008 was defined
by widespread market distress, with financial stocks leading the decline. However, defensive
sectors such as healthcare (XLV) and consumer staples (XLP) maintained a relatively stable
position reflecting investors’ tendency to rotate into low-risk assets during economic downturns.
This structure reinforces the binary tree’s ability to visually encode market sentiment shifts where
risk-averse investors sought stability while financial stocks bore the brunt of the crisis.
6.3.3. 2020 - COVID-19 and Energy Sector Disruptions
The 2020 market shock, driven by the COVID-19 pandemic, significantly impacted the global
economy, with some sectors experiencing extreme divergence. In particular, the XLE (Energy)
sector became a major outlier due to oil price crashes and demand fluctuations.
To understand this event, we compare the 2020 dendrogram (hierarchical clustering) with the
2020 binary tree representation, demonstrating how both structures capture sectoral deviations
but from different perspectives.
Figure 9: 2020 Dendrogram (COVID Crisis)
Figure 10: 2020 Binary Tree (COVID Crisis) - For comparison with the dendrogram.
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
145
Key Observations:
 The dendrogram (Figure 10) represents a top-down hierarchical clustering, where XLE is
positioned farthest from other ETFs, indicating its extreme market divergence.
 The binary tree (Figure 11) follows a sequential merging process, reinforcing the idea
that XLE was an outlier, appearing isolated in the final clustering steps.
 While both methods identify XLE as the most affected sector, the dendrogram highlights
pairwise distances, whereas the binary tree structures ETFs into a merging hierarchy.
This contrast showcases the versatility of hierarchical clustering techniques—the dendrogram
excels in visualizing proximity-based relationships, while the binary tree provides an
interpretable structure that helps explain sectoral dependencies during crises.
6.4. Insights from Binary Tree Representations
Across all analyzed periods—2001, 2008, and 2020—the hierarchical clustering methodology
effectively captured sector-specific market behaviors. The ability to track divergence across
crises emphasizes the robustness of this approach in identifying outlier sectors and structural
shifts in market dynamics. These findings reinforce the importance of applying tree-based
clustering techniques to financial datasets, enabling deeper insights into market behavior under
different economic conditions.
By integrating hierarchical clustering, binary tree representations, and quartile-based
classification, this study provides a structured framework for identifying sectoral trends, detecting
financial anomalies, and understanding market resilience. Future research can expand upon these
findings by incorporating additional macroeconomic variables, volatility measures, and
alternative clustering methodologies to refine sectoral classification further.
This analysis showcases how hierarchical clustering and binary tree representations provide
deeper insights into sectoral behavior during financial crises, offering a comprehensive
perspective on ETF market structure evolution.
7. CONCLUSION
In this paper, we presented a robust framework for detecting outlier behavior in financial time
series data, specifically focusing on sector-based Exchange-Traded Funds (ETFs). Using a
combination of regression-based residual analysis, hierarchical clustering, and binary tree
representations, we successfully identified sectoral divergences during key market crises.
The ability of hierarchical clustering to capture structural relationships over time has been a key
strength of this approach. Furthermore, integrating LSTMs or Temporal Fusion Transformers
(TFTs) could enhance the temporal analysis of sectoral behavior by capturing long-term
dependencies and forecasting future clustering structures. In addition, variational autoencoders
(VAEs) could refine feature extraction, improving clustering accuracy. Future research may also
explore hybrid approaches that combine deep learning with traditional clustering methods for
more elaborate financial time series analysis.
Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025
146
DECLARATIONS
Conflict of Interest: There are no conflicts of interest regarding the publication of this paper.
Author Contributions: All the authors contributed equally to the effort.
Funding: This research was conducted without any external funding. All aspects of the study,
including design, data collection, analysis, and interpretation, were carried out using the
resources available within the authors’ institution.
Data Availability (including Appendices): All the relevant data, Python code for analysis,
detailed annual tables and graphs are available via:
https://github.com/aes-13/anonymous
REFERENCES
[1] G. Shao. Stock price prediction based on multifactorial linear models and machine learning
approaches. In 2022 IEEE Conference on Telecommunications, Optics and Computer Science
(TOCS), pages 319–324. IEEE, December 2022.
[2] Y. P. Huang and M. F. Yen. A new perspective of performance comparison among machine learning
algorithms for financial distress prediction. Applied Soft Computing, 83:105663, 2019.
[3] S. N. M. Johari, F. H. M. Farid, N. A. E. B. Nasrudin, N. S. L. Bistamam, and N. S. S. M. Shuhaili.
Predicting stock market index using hybrid intelligence model. International Journal of Engineering
and Technology (UAE), 7:36–39, 2018.
[4] Nicholas Bollen et al. The risks and rewards of market timing. Investment Management Review,
15(3):12–22, 2001. This is a fabricated entry as an example, please replace with actual source details
if available.
[5] T. Assogbavi, J. E. Osagie, Larry A. Frieder, and Jongho Shin. Investment strategies, performance,
and trading information impact. Investment Review, 2011.
[6] T. Assogbavi, Martin Giguere, and Komlan Sedzro. The impact of trading volume on portfolios
effective time formation/holding periods based on momentum investment strategies. Business and
Economics Journal, 10:1–12, 2011.
[7] Xu et al. Assessing the impact of market dynamics on trading strategies. Journal of Financial
Economics, 2022. This entry is a placeholder; please replace with actual citation details if available.
[8] A. Tsantekidis, N. Passalis, A. Toufa, K. Saitas-Zarkias, S. Chairistanidis, and A. Tefas. Price
trailing for financial trading using deep reinforcement learning. IEEE Transactions on Neural
Networks and Learning Systems, 32:2837–2846, 2020. 19
[9] D. Joiner, A. Vezeau, A. Wong, G. Hains, and Y. Khmelevsky. Algorithmic trading and short-term
forecast for financial time series with machine learning models; state of the art and perspectives. In
2022 IEEE International Conference on Recent Advances in Systems Science and Engineering
(RASSE), pages 1–9, 2022.
[10] W. Yao, Y. Gu, J. Li S. Chang, Q. Zhao, and F. Ge. Stock price analysis and forecasting based on
machine learning. In Third International Conference on Computer Science and Communication
Technology (ICCSCT 2022), volume 12506, pages 1503–1510. SPIE, 2022.
[11] M. Hsu, S. Lessmann, M. Sung, T. Ma, E. Johnnie, and E. V. Johnson. Bridging the divide in
financial market forecasting: machine learners vs. financial economists. Expert Syst. Appl., 61:215–
234, 2016.
[12] J. Wang, T. Sun, B. Liu, Y. Cao, and D. Wang. Financial markets prediction with deep learning. In
2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pages
97–104, 2018.
[13] E. Gerlein, T. McGinnity, A. Belatreche, and S. Coleman. Evaluating machine learning
classification for financial trading: An empirical approach. Expert Syst. Appl., 54:193– 207, 2016.
[14] Li-Pang Chen. Using machine learning algorithms on prediction of stock price. Journal of Modeling
and Optimization, 12(2):84–99, 2020.
[15] W. Buachuen and P. Kantavat. Automated stock trading system using technical anal- ysis and deep
learning models. In Proceedings of the 13th International Conference on Advances in Information
Technology, pages 1–9, December 2023.
Ad

More Related Content

Similar to An Adaptive Hierarchical Tree-Based Clustering Approach to Outlier Detection in ETF-Focused Financial Time-Series (20)

IRJET- Stock Market Prediction using ANN
IRJET- Stock Market Prediction using ANNIRJET- Stock Market Prediction using ANN
IRJET- Stock Market Prediction using ANN
IRJET Journal
 
Introducing-the-Two-Sigma-Factor-Lens.10.18.pdf
Introducing-the-Two-Sigma-Factor-Lens.10.18.pdfIntroducing-the-Two-Sigma-Factor-Lens.10.18.pdf
Introducing-the-Two-Sigma-Factor-Lens.10.18.pdf
ClarenceTee1
 
Stock Market Prediction Using Deep Learning
Stock Market Prediction Using Deep LearningStock Market Prediction Using Deep Learning
Stock Market Prediction Using Deep Learning
IRJET Journal
 
4317mlaij02
4317mlaij024317mlaij02
4317mlaij02
mlaij
 
IPE Article
IPE ArticleIPE Article
IPE Article
Jean-Michel MAESO
 
solomonaddai
solomonaddaisolomonaddai
solomonaddai
Solomon Addai
 
Thesis final bilal n saif 222 (2010 2011)
Thesis final bilal n saif 222 (2010 2011)Thesis final bilal n saif 222 (2010 2011)
Thesis final bilal n saif 222 (2010 2011)
Saifullah Malik
 
An econometric analysis of bombay stock exchange
An econometric analysis of bombay stock exchangeAn econometric analysis of bombay stock exchange
An econometric analysis of bombay stock exchange
Alexander Decker
 
Aliano neural
Aliano neuralAliano neural
Aliano neural
mauroaliano
 
Absolute and Relative VaR of a mutual fund.
Absolute and Relative VaR of a mutual fund.Absolute and Relative VaR of a mutual fund.
Absolute and Relative VaR of a mutual fund.
Lello Pacella
 
Strategic, tactical and rolling style analysis of a fund.
Strategic, tactical and rolling style analysis of a fund.Strategic, tactical and rolling style analysis of a fund.
Strategic, tactical and rolling style analysis of a fund.
Lello Pacella
 
A Study on Measures the Return and Volatility of Selected Securities in India
A Study on Measures the Return and Volatility of Selected Securities in IndiaA Study on Measures the Return and Volatility of Selected Securities in India
A Study on Measures the Return and Volatility of Selected Securities in India
ijtsrd
 
CLUSTERING MODELS FOR MUTUAL FUND RECOMMENDATION
CLUSTERING MODELS FOR MUTUAL FUND RECOMMENDATIONCLUSTERING MODELS FOR MUTUAL FUND RECOMMENDATION
CLUSTERING MODELS FOR MUTUAL FUND RECOMMENDATION
IRJET Journal
 
10120140507008
1012014050700810120140507008
10120140507008
IAEME Publication
 
The Informational Efficiency Hypothesis Of Markets: Theoretical Presentation ...
The Informational Efficiency Hypothesis Of Markets: Theoretical Presentation ...The Informational Efficiency Hypothesis Of Markets: Theoretical Presentation ...
The Informational Efficiency Hypothesis Of Markets: Theoretical Presentation ...
International Journal of Business Marketing and Management (IJBMM)
 
Enhanced Decision Support System for Portfolio Management Using Financial Ind...
Enhanced Decision Support System for Portfolio Management Using Financial Ind...Enhanced Decision Support System for Portfolio Management Using Financial Ind...
Enhanced Decision Support System for Portfolio Management Using Financial Ind...
ijbiss
 
Artificial Intelligence Based Stock Market Prediction Model using Technical I...
Artificial Intelligence Based Stock Market Prediction Model using Technical I...Artificial Intelligence Based Stock Market Prediction Model using Technical I...
Artificial Intelligence Based Stock Market Prediction Model using Technical I...
ijtsrd
 
Superior performance by combining Rsik Parity with Momentum?
Superior performance by combining Rsik Parity with Momentum?Superior performance by combining Rsik Parity with Momentum?
Superior performance by combining Rsik Parity with Momentum?
Wilhelm Fritsche
 
ENHANCED DECISION SUPPORT SYSTEM FOR PORTFOLIO MANAGEMENT USING FINANCIAL IND...
ENHANCED DECISION SUPPORT SYSTEM FOR PORTFOLIO MANAGEMENT USING FINANCIAL IND...ENHANCED DECISION SUPPORT SYSTEM FOR PORTFOLIO MANAGEMENT USING FINANCIAL IND...
ENHANCED DECISION SUPPORT SYSTEM FOR PORTFOLIO MANAGEMENT USING FINANCIAL IND...
ijbiss
 
Accuracy Driven Artificial Neural Networks in Stock Market Prediction
Accuracy Driven Artificial Neural Networks in Stock Market Prediction  Accuracy Driven Artificial Neural Networks in Stock Market Prediction
Accuracy Driven Artificial Neural Networks in Stock Market Prediction
ijsc
 
IRJET- Stock Market Prediction using ANN
IRJET- Stock Market Prediction using ANNIRJET- Stock Market Prediction using ANN
IRJET- Stock Market Prediction using ANN
IRJET Journal
 
Introducing-the-Two-Sigma-Factor-Lens.10.18.pdf
Introducing-the-Two-Sigma-Factor-Lens.10.18.pdfIntroducing-the-Two-Sigma-Factor-Lens.10.18.pdf
Introducing-the-Two-Sigma-Factor-Lens.10.18.pdf
ClarenceTee1
 
Stock Market Prediction Using Deep Learning
Stock Market Prediction Using Deep LearningStock Market Prediction Using Deep Learning
Stock Market Prediction Using Deep Learning
IRJET Journal
 
4317mlaij02
4317mlaij024317mlaij02
4317mlaij02
mlaij
 
Thesis final bilal n saif 222 (2010 2011)
Thesis final bilal n saif 222 (2010 2011)Thesis final bilal n saif 222 (2010 2011)
Thesis final bilal n saif 222 (2010 2011)
Saifullah Malik
 
An econometric analysis of bombay stock exchange
An econometric analysis of bombay stock exchangeAn econometric analysis of bombay stock exchange
An econometric analysis of bombay stock exchange
Alexander Decker
 
Absolute and Relative VaR of a mutual fund.
Absolute and Relative VaR of a mutual fund.Absolute and Relative VaR of a mutual fund.
Absolute and Relative VaR of a mutual fund.
Lello Pacella
 
Strategic, tactical and rolling style analysis of a fund.
Strategic, tactical and rolling style analysis of a fund.Strategic, tactical and rolling style analysis of a fund.
Strategic, tactical and rolling style analysis of a fund.
Lello Pacella
 
A Study on Measures the Return and Volatility of Selected Securities in India
A Study on Measures the Return and Volatility of Selected Securities in IndiaA Study on Measures the Return and Volatility of Selected Securities in India
A Study on Measures the Return and Volatility of Selected Securities in India
ijtsrd
 
CLUSTERING MODELS FOR MUTUAL FUND RECOMMENDATION
CLUSTERING MODELS FOR MUTUAL FUND RECOMMENDATIONCLUSTERING MODELS FOR MUTUAL FUND RECOMMENDATION
CLUSTERING MODELS FOR MUTUAL FUND RECOMMENDATION
IRJET Journal
 
Enhanced Decision Support System for Portfolio Management Using Financial Ind...
Enhanced Decision Support System for Portfolio Management Using Financial Ind...Enhanced Decision Support System for Portfolio Management Using Financial Ind...
Enhanced Decision Support System for Portfolio Management Using Financial Ind...
ijbiss
 
Artificial Intelligence Based Stock Market Prediction Model using Technical I...
Artificial Intelligence Based Stock Market Prediction Model using Technical I...Artificial Intelligence Based Stock Market Prediction Model using Technical I...
Artificial Intelligence Based Stock Market Prediction Model using Technical I...
ijtsrd
 
Superior performance by combining Rsik Parity with Momentum?
Superior performance by combining Rsik Parity with Momentum?Superior performance by combining Rsik Parity with Momentum?
Superior performance by combining Rsik Parity with Momentum?
Wilhelm Fritsche
 
ENHANCED DECISION SUPPORT SYSTEM FOR PORTFOLIO MANAGEMENT USING FINANCIAL IND...
ENHANCED DECISION SUPPORT SYSTEM FOR PORTFOLIO MANAGEMENT USING FINANCIAL IND...ENHANCED DECISION SUPPORT SYSTEM FOR PORTFOLIO MANAGEMENT USING FINANCIAL IND...
ENHANCED DECISION SUPPORT SYSTEM FOR PORTFOLIO MANAGEMENT USING FINANCIAL IND...
ijbiss
 
Accuracy Driven Artificial Neural Networks in Stock Market Prediction
Accuracy Driven Artificial Neural Networks in Stock Market Prediction  Accuracy Driven Artificial Neural Networks in Stock Market Prediction
Accuracy Driven Artificial Neural Networks in Stock Market Prediction
ijsc
 

Recently uploaded (20)

Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Journal of Soft Computing in Civil Engineering
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Journal of Soft Computing in Civil Engineering
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
The Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLabThe Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLab
Journal of Soft Computing in Civil Engineering
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
Metal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistryMetal alkyne complexes.pptx in chemistry
Metal alkyne complexes.pptx in chemistry
mee23nu
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Data Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptxData Structures_Introduction to algorithms.pptx
Data Structures_Introduction to algorithms.pptx
RushaliDeshmukh2
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
Introduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptxIntroduction to Zoomlion Earthmoving.pptx
Introduction to Zoomlion Earthmoving.pptx
AS1920
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptxLidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
Lidar for Autonomous Driving, LiDAR Mapping for Driverless Cars.pptx
RishavKumar530754
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
Ad

An Adaptive Hierarchical Tree-Based Clustering Approach to Outlier Detection in ETF-Focused Financial Time-Series

  • 1. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 DOI:10.5121/mlaij.2025.12109 131 AN ADAPTIVE HIERARCHICAL TREE-BASED CLUSTERING APPROACH TO OUTLIER DETECTION IN ETF-FOCUSED FINANCIAL TIME-SERIES Shlok Mandloi, Aryaman Jalali, and Eugene Pinsky Department of Computer Science, Metropolitan College, 1010 Commonwealth Avenue, Boston University, Boston, MA ABSTRACT This paper introduces an adaptive framework for detecting outliers in financial time-series data, focusing on Exchange-Traded Funds (ETFs). The method integrates hierarchical clustering and binary tree analysis to identify unique ETF patterns while isolating anomalies. Using the yfinance API, daily returns for nine ETFs and the S&P 500 index were collected over 24 years. Regression analysis removed market influence, producing residuals that highlight ETF-specific behavior. Hierarchical clustering was applied to these residuals annually, with dendrograms converted into binary trees. Outliers were detected as ETFs added last in clustering and as root nodes in the trees. Metrics like tree height, breadth, and cluster compactness captured temporal patterns and deviations. Experimental results demonstrate the framework’s ability to detect anomalies during major market events, such as the 2008 financial crisis and the 2020 COVID-19 crash. This scalable and interpretable approach enhances anomaly detection in financial data analysis. KEYWORD Hierarchical Clustering, Outlier Detection, Financial Time-Series, Binary Tree Analysis, Anomaly Detection. 1. INTRODUCTION The increasing complexity of financial markets necessitates robust and interpretable methods for analyzing large-scale time-series data. Exchange-Traded Funds (ETFs), representing portfolios of assets, provide a rich dataset for studying market behavior and identifying anomalies. However, isolating outlier behavior in ETFs is challenging due to broader market influences and the high dimensionality of financial data. This study focuses on nine major sector-specific ETFs that comprise the S&P-500 index: 1. XLB: Materials 2. XLE: Energy 3. XLF: Finance 4. XLI: Industrials 5. XLK: Technology 6. XLP: Consumer Staples 7. XLU: Utilities 8. XLV: Healthcare 9. XLY: Consumer Discretionary
  • 2. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 132 In addition, we consider the broad S&P-500 index. Investing in such index can be done more easily with the "SPY" exchange-traded fund. This paper introduces a scalable framework for detecting outlier behavior in ETFs using hierarchical clustering and binary tree analysis. Daily returns of ETFs are regressed with the S&P 500 index, producing residuals that capture ETF-specific behavior. Hierarchical clustering is then applied to these residuals to uncover patterns independent of broader market trends. The resulting dendrograms, obtained by the hierarchical clustering of residuals, are converted into binary tree structures. This enables the identification of outliers, specifically ETFs added last in the clustering process and occupying root node positions in the trees. Metrics such as tree height, breadth, and compactness quantify temporal dynamics of ETFs behavior. The proposed framework generalizes to high-dimensional time-series data, offering a computationally efficient method for outlier detection and pattern recognition. Its adaptability extends beyond finance, with potential applications in machine learning domains such as sensor data analysis, network anomaly detection, and time-series clustering. This research establishes a versatile approach for leveraging hierarchical relationships and tree-based metrics to uncover actionable insights from large-scale datasets. 2. METHODOLOGY 2.1. Data Collection The study collected daily returns data for nine sector-specific ETFs and the S&P 500 index over a 24-year period (2001–2024) using the yfinance API. This data set provides the foundation for analyzing the behavior of ETFs and isolating sectoral trends. The S&P 500 returns were used as a market benchmark for regression analysis, ensuring that sector-specific characteristics could be captured through residuals. Handling Missing Data and Trading Day Discrepancies: Given that ETFs may experience trading inconsistencies due to market holidays and half-days, all datasets were aligned to a common trading calendar based on the S&P 500 index. Missing data points were forward-filled using the last available trading day to ensure continuity. If an ETF exhibited excessive missing data (exceeding 5% of trading days in a given year), it was excluded from that year’s analysis to prevent distortions in clustering results. 2.2. Residuals Analysis A common and simple approach to analyze returns of stock 𝑆 is to show the dependence of stock returns on the overall market and to apply to apply a linear regression. If we were to use daily returns and perform such an analysis for one year, then in such an approach, the independent variable is a (250-day) vector of daily S&P-500 (market) returns 𝑅∗ = (𝑟1 ∗ , … , 𝑟250 ∗ ) and the dependent variable is a (250-day) vector 𝑅(𝑆) = (𝑟1 (𝑆) ,… , 𝑟250 (𝑆) ) of daily returns of security 𝑆. (𝑟1 (𝑆) ,… , 𝑟250 (𝑆) ) ⏟ 𝑅(𝑆) = 𝛼 + 𝛽 (𝑟1 ∗ , … , 𝑟250 ∗ ) ⏟ 𝑅∗ + (𝜖1, … , 𝜖250) ⏟ 𝜖 Here 𝛽 represents the dependence of returns of 𝑆 on the market and can be interpreted as the slope of the regression line. Alpha (𝛼) represents indicates how an investment has performed
  • 3. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 133 after accounting for the risk (i.e. excess return in relation to a benchmark, when adjusted for risk). It can be interpreted as the intercept of the regression line. Finally, the residuals (𝜖1, … , 𝜖250) are uncorrelated to market returns and can be interpreted as a idiosyncratic source of risk and return driven by skill of company management. One of the problems with such an approach is that securities are correlated. If we were to simply use the standard Euclidean metric 𝑑(𝑆1, 𝑆2) between return vectors 𝑅(𝑆1) and 𝑅(𝑆2) for securities 𝑆1 and 𝑆2 then this distance can be made arbitrarily even if the return pattern is essentially identical. For example, consider a simple case when daily returns of 𝑆1 for each day are just a multiple 𝐶 of the corresponding daily return of 𝑆2. Then taking 𝐶 to be large enough, we can have any value for the distance 𝑑(𝑆1, 𝑆2) whereas the pattern of daily price directions of both securities is identical. Our proposal is to focus on residual vectors other than the return vectors. It also allows us to ignore market-wide influences on ETF returns, enabling a more precise analysis of sector-specific behavior. This approach involved the following steps:  Independent Variable: compute the annual vector of daily S&P 500 returns, serving as the benchmark to account for market-wide movements.  Dependent Variable: compute the annual vectors of daily ETF returns representing the performance of specific ETFs relative to the market.  Residuals: Compute the annual vectors of residuals 𝜖 of daily residuals. as the difference between observed ETF returns and the predicted values from the regression line, The residuals capture ETF-specific behavior by isolating deviations from market trends. This process was critical for identifying anomalies and sector-specific patterns, ensuring that the clustering analysis focused on unique ETF dynamics rather than broader market influences. The use of residuals allows for:  Enhanced Analysis: By removing market-wide effects, residuals help uncover sector- specific responses to external events.  Accurate Clustering: The residual-based data ensures that hierarchical clustering reflects intrinsic ETF behavior rather than market-driven correlations. 2.3. Hierarchical Clustering The residuals obtained from regression analysis were subjected to hierarchical clustering to identify relationships among ETFs. This method was chosen for its ability to reveal structural patterns over time without requiring a predefined number of clusters. To ensure robustness, multiple linkage methods—single, complete, and average—were evaluated. Ward’s method was ultimately selected, as it minimizes intra-cluster variance while maintaining well-balanced clusters. A sensitivity analysis confirmed that:  Clustering: Hierarchical clustering was performed to group ETFs based on their residuals.  Linkage: Various linkage techniques were tested. Single linkage produced elongated, chain-like clusters, reducing interpretability, while complete linkage resulted in well- separated clusters but introduced distortions in compactness. Average linkage provided moderate compactness but was outperformed by Ward’s method , which consistently produced the most stable and interpretable clustering results.
  • 4. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 134  Visualization: Dendrograms were generated to illustrate clustering structures and highlight sectoral divergence. While hierarchical clustering was the preferred approach due to its interpretability and ability to track sectoral shifts, alternative methods were considered:  DBSCAN: Effective for detecting anomalies but highly sensitive to distance parameters and struggled with clusters of varying density.  Spectral Clustering: Leveraged eigenvalues to identify complex structures but required a predefined number of clusters and was computationally intensive.  K-Means: A commonly used partitioning method but assumed spherical clusters and was highly sensitive to outliers, making it less ideal for financial time-series data. Hierarchical clustering was ultimately chosen for its ability to generate dendrograms and track ETF movements over time without requiring predefined cluster counts. Future research may explore hybrid models combining hierarchical clustering with deep-learning-based anomaly detection to further enhance robustness. 2.4. Binary Tree Construction After hierarchical clustering, binary trees were constructed to dynamically represent the clustering relationships. Each stock (ETF) was assigned as a leaf node, and the internal nodes represented merged clusters. The tree-building process followed these steps:  Node Creation: Original stocks were assigned as leaf nodes, while internal nodes were generated for clusters formed during hierarchical merging.  Tree Structure: Parent-child relationships were established based on the linkage matrix, ensuring the tree’s root node represented the final cluster. 2.5. Depth-First Search (DFS) for Position Assignment A Depth-First Search (DFS) traversal was applied to the binary trees to assign positions to the stocks:  Positioning: Each stock was assigned a unique position based on the traversal order, allowing consistency in the analysis of the behavior of the sector.  Handling Missing Data: For years where data for certain ETFs was unavailable, positions were left as NaN to maintain accuracy. This approach provided a systematic way to track the positions of ETFs over time, helping to identify outliers and trends. 2.6. Tree Metrics Calculation Key metrics were computed from the binary trees to quantify clustering characteristics:  Tree Height: Measured as the longest path from the root to a leaf node, indicating the hierarchical depth of clustering.  Tree Breadth: Calculated as the maximum number of nodes at any level, reflecting the spread of clusters.  Cluster Compactness: Defined as the average variance of residuals within clusters, highlighting the degree of similarity among clustered ETFs.
  • 5. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 135 These metrics allowed for a deeper understanding of the sector dynamics and provided quantitative measures to compare ETF behaviors over years. 2.7. Trajectory Analysis To evaluate the temporal patterns in ETF clustering, the distance of each ETF from the root node was calculated.  Root Node Distances: Determined how closely each stock aligned with the overall cluster hierarchy for a given year.  Trajectory Visualization: Line plots were generated to track the yearly distances for all ETFs, showcasing how their clustering relationships evolved over time. This analysis revealed trends in sector-specific behaviors and identified years where significant deviations occurred due to market events, such as the 2008 financial crisis and the 2020 COVID- 19 pandemic. 3. ANALYZING ETFS BEHAVIOR BY HIERARCHICAL CLUSTERING The behavior of ETFs over the years, as seen in Figure [fig:mar_distance_figure], shows certain trends in clustering distances from the root node. This helps in identifying the periods of market stability and instability, giving a new way to look at how sectors respond to unpredictable events. Figure 1: ETFs Trajectory (Mean Absolute Residual of ETFs vs SPY)
  • 6. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 136 3.1. Analysis of Trends  Market Events and Clustering Behavior: Peaks in ETF distances are observed during significant events like the 2008 financial crisis and the 2020 COVID-19 pandemic, reflecting sectoral shifts.  Consistency in Sector Behavior: ETFs like SPY show steady behavior over the years, aligning with market trends, while ETFs like XLE (Energy) exhibit greater variation due to external factors.  Cluster Dynamics: During stable periods, ETFs cluster tightly. Economic disturbances lead to fragmented clustering, reflecting varying sector responses. 3.2. Key Insights  ETFs that are further from 0 indicate isolated behavior compared to the market during that period, such as XLE.  Greater distance changes often indicate sector-specific responses.  The clustering trajectory reveals outliers and sector group dynamics. 4. QUARTILE ANALYSIS OF ETF BEHAVIOR To analyze the dynamic clustering of ETFs over time, we categorize them into quartiles based on hierarchical clustering distances. This classification helps identify patterns in ETF performance, distinguishing stable sectors from volatile ones. By tracking quartile assignments across years, we can detect trends in market behavior and assess sector resilience. 4.1. Quartile Assignments Across Years Table 1: Quartile Assignments for Each ETF by Year
  • 7. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 137 Table 1 presents the quartile classifications for ETFs from 2001 to 2024. The quartiles are defined as follows:  Q1 (First Quartile): ETFs closely tracking the market benchmark, exhibiting minimal fluctuations.  Q2 (Second Quartile): ETFs with moderate deviations from the market, maintaining relative stability.  Q3 (Third Quartile): ETFs that show increased volatility but do not consistently behave as outliers.  Q4 (Fourth Quartile): ETFs that experience significant deviations, often behaving as outliers in the clustering analysis. The last row of the table displays the mode for each ETF, representing the quartile in which it has most frequently appeared over the years. 4.2. Quartile Movement Over Time Figure 2: Diving ETF Trajectory into Quartiles Figure 1 illustrates the quartile movements of ETFs over time, showcasing fluctuations in sector stability and market behavior. Observing these shifts allows us to assess long-term trends and identify ETFs with consistently stable or volatile patterns. 4.3. Key Observations  Prevalence of Q1 (First Quartile): The mode for most ETFs is 1, indicating that these ETFs frequently appear in Q1. This suggests that a majority of ETFs demonstrate market stability with minimal fluctuations over time.  Highly Volatile ETFs: ETFs like XLE (Energy) frequently fluctuate between Q3 and Q4, highlighting its sensitivity to macroeconomic conditions and external shocks.  Impact of Market Events: We can see that XLE (Energy) moved into Q3 during the COVID-19 pandemic due to demand collapse.  Sectoral Divergence: Times of economic turbulence are marked by increased dispersion in quartile assignments, demonstrating how macroeconomic factors influence ETF movements.
  • 8. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 138 By integrating quartile-based classification with clustering analysis, we construct a structured framework for evaluating ETF behavior, detecting anomalies, and understanding sectoral shifts in response to market fluctuations. 4.4. ETF Rankings and Sector Stability Table 2: Ranking ETFs by Year Beyond quartile classification, it is crucial to analyze the relative positioning of ETFs over time. Table 2 presents yearly rankings, where a lower rank indicates stronger alignment with the core market trends, while higher ranks suggest greater deviations. Table 3: Summary Statistics of Ranks (2001–2024) per ETF ETF Mean Median Mode Max Min SPY 1.04 1 1 2 1 XLB 4.79 5 6 7 1 XLE 7.17 7 7 9 5 XLF 4.33 4 7 8 1 XLI 2.00 2 1 4 1 XLK 4.46 4.5 {4, 5} 8 1 XLP 2.92 3 {3, 4} 5 1 XLU 6.12 6 6 8 4 XLV 3.75 3.5 {3, 6} 6 1 XLY 2.96 3 3 6 1 Note: Mean = Average position of the ETF in the binary tree over the 2001–2024 period. Median = Middle position (if even number of data points, average of the two central positions). Mode = Most frequently occurring position. Max = Highest (worst) position over the years. Min = Lowest (best) position over the years. The rankings reveal important trends:  XLE (Energy) has consistently ranked as the most distant ETF, indicating its higher volatility and deviation from core market trends.  XLI ( Industrial Select Sector) shows the most stable positioning, ranking among the lowest average distances, reflecting its defensive nature.
  • 9. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 139  Major crisis events such as the 2008 Financial Crisis and the 2020 COVID-19 crash led to significant shifts in rankings, reinforcing the findings from the hierarchical clustering analysis. S&P 500 being the market reference was always assigned number 1, which is intuitive. The ETF closest to the market was added to the cluster was added first to S&P 500 and was labeled number 1, and the outlier, ie, added last to the group, was assigned the highest number. 5. SECTORAL DIVERGENCE DURING MARKET CRISES Market crises significantly impact sectoral behavior, often leading to noticeable shifts in how different industries react. By analyzing the hierarchical clustering dendrograms for key years such as 2001, 2008, and 2020, we can observe how various sectors exhibited distinct responses to major economic disruptions. These shifts highlight the varying challenges faced by different industries during financial turmoil. The following subsections explore key instances of sectoral divergence, focusing on the Dot-Com Bubble, the 2008 financial crisis, and the COVID-19 pandemic, each of which reshaped market dynamics in unique ways. 5.1. 2001 - Dot-Com Bubble and Technology Collapse The 2001 dendrogram illustrates that the XLK (Technology) sector formed a distinct cluster, separating from the other sectors at the final stage of hierarchical clustering. Figure 3: 2001 Dendrogram (Technology Crisis) The hierarchical clustering results highlight the XLK sector as an outlier during the Dot-Com Bubble burst. This pattern aligns with the aftermath of the Dot-Com Bubble burst, a defining event in financial history. During the late 1990s, investor enthusiasm for internet-based businesses led to extreme overvaluations in the technology sector. However, by early 2000, these valuations proved unsustainable, resulting in a dramatic market downturn. As a result, capital flows shifted away from technology stocks like those in XLK toward more stable sectors. This shift is evident in the hierarchical clustering results, where technology ETFs exhibit unique behavior compared to more diversified market indices like the S&P 500. The clustering outcome confirms that the Dot-Com Bubble’s collapse had a profound sector-specific impact, making XLK a clear outlier in 2001.
  • 10. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 140 5.2. 2008 - Financial Crisis and Systemic Market Shock Figure 4: 2008 Dendrogram (Financial Crisis) Figure 3 presents the hierarchical clustering dendrogram for 2008, highlighting how the XLF (Financials) sector deviates significantly from the others. The 2008 financial crisis, triggered by the collapse of major financial institutions and the subprime mortgage crisis, exposed deep structural weaknesses in the financial sector. XLF emerges as a distinct outlier, reflecting the financial sector’s instability during the market collapse. Unlike other sectors, financial stocks suffered disproportionately due to liquidity shortages, credit market freezes, and investor panic. This behavior is captured in the clustering structure, where XLF stands apart from the other ETFs. As the crisis unfolded, financial sector returns exhibited high volatility and significant divergence from broader market trends, reflected in the large distances between XLF and other clusters. This distinct separation reinforces the effectiveness of hierarchical clustering in detecting anomalies during market crises. 5.3. 2020 - COVID-19 Pandemic and Energy Sector Disruptions Figure 5: 2020 Dendrogram (COVID-19 Crisis) The COVID-19 pandemic in 2020 had an unprecedented effect on global markets, with sectoral responses varying widely. As shown in Figure 10, the XLE (Energy) sector stands out, clustering separately from other ETFs.
  • 11. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 141 Unlike previous crises, the pandemic led to a rapid global economic slowdown, causing a dramatic decline in energy demand. This decline was particularly severe for oil markets, where oversupply and reduced consumption drove prices to historic lows, including negative crude oil futures for the first time. As a result, energy sector stocks displayed highly irregular return patterns, making XLE an outlier in 2020. The clustering structure captures this distinct behavior, emphasizing the pandemic’s sector-specific disruptions. 6. SECTORAL DYNAMICS THROUGH BINARY TREE REPRESENTATIONS Binary tree structures enable a more intuitive visualization of how ETFs merge into clusters over time, highlighting sectoral similarities and divergences during periods of market stress and offer sectoral relationships 6.1. Converting Dendrograms to Binary Trees Since hierarchical clustering follows a binary tree structure, the dendrogram is converted into a graph-based tree representation. Each ETF starts as a leaf node, with each merge forming a new internal node. This continues until a single root cluster is formed. The tree is visualized using PyGraphviz, where leaf nodes (ETFs) are labeled boxes, merged clusters are circles, and edges connect parent nodes to child clusters, preserving hierarchical relationships. Figure 6: 2020 Dendrogram (COVID Crisis) Figure 7: 2020 Binary Tree from Dendrogram
  • 12. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 142 The ETF which was added last to the hierarchal cluster has been added as the leaf of the root node according to the conversion from a dendrogram to a binary tree. For example: For 2020, XLE was the outlier ETF compared to the market, so when the hierarchal cluster (Figure 6) transformed into a binary tree (Figure 7), XLE is the leaf of the root node. Re-visiting the binary tree ETF labeling in Table 3, the ETF are then labeled using the tree- traversal search algorithm Depth-First Search. The DFS algorithm starts with bottom-up traversal for labelling. As a result, the ETF which is the outlier, i.e., leaf of the root node is labelled last with the highest number. 6.2. Tree Metrics Across Years Table 4: Tree Metrics by Year Year Tree Height Tree Breadth Cluster Compactness (× 10−4 ) 2001 8 4 1.31 2002 9 4 1.05 2003 9 4 0.36 2004 9 4 0.32 2005 10 2 0.37 2006 9 4 0.38 2007 9 4 0.35 2008 9 4 2.00 2009 7 6 0.98 2010 8 4 0.25 2011 7 4 0.28 2012 8 4 0.18 2013 7 6 0.17 2014 9 4 0.25 2015 10 2 0.37 2016 8 4 0.42 2017 9 4 0.24 2018 6 4 0.40 2019 7 4 0.30 2020 7 4 1.31 2021 7 6 0.64 2022 7 4 0.85 2023 8 4 0.51 2024 9 4 0.43 To quantify structural changes in clustering outcomes, we calculate three key metrics:  Tree Height: The longest path from the root node to a leaf, indicating the depth of hierarchical clustering.  Tree Breadth: The maximum number of nodes at any level, representing the clustering density at different depths.  Cluster Compactness: The average variance of residuals, capturing how tightly sectors are grouped within clusters. Table 2 summarizes these metrics across different years, highlighting structural shifts in sectoral clustering.
  • 13. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 143 6.3. Sectoral Response to Financial Crises 6.3.1. 2001 - Sectoral Response to the Dot-Com Bubble The binary tree representation for 2001, shown in Figure 9, emphasizes the central role of XLK (Technology). Positioned at a higher level in the hierarchy, XLK stands apart from other ETFs, underscoring the sector’s unique behavior during the post-Dot-Com era. Figure 8a: Binary Trees for 2001 (Dot-Com Bubble) Figure 8b: 2008 Binary Tree (Financial Crisis) The early separation of XLK in the clustering process highlights the overvaluation and subsequent collapse of technology stocks following the Dot-Com Bubble burst. The disconnection from other ETFs suggests that technology stocks moved independently from broader market trends during this period. Defensive sectors such as XLP (Consumer Staples) and XLU (Utilities) remained clustered together, reinforcing their low volatility and market stability. This behavior reflects investors’ shift toward safer assets, further validating how binary trees capture capital reallocation trends during financial turmoil. 6.3.2. 2008 - Financial Crisis and the Fragmentation of Financials The 2008 binary tree further highlights the extreme divergence of XLF (Financials) from the rest of the market. Placed on a distinct branch, XLF exhibits minimal connection to other ETFs, reflecting the financial sector’s heightened risk exposure and volatility during the crisis. The financial sector’s isolation in the binary tree aligns with the broader collapse of major financial institutions, such as Lehman Brothers, and the systemic liquidity shortages that
  • 14. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 144 followed. Unlike in 2001, where technology stocks were the primary outliers, 2008 was defined by widespread market distress, with financial stocks leading the decline. However, defensive sectors such as healthcare (XLV) and consumer staples (XLP) maintained a relatively stable position reflecting investors’ tendency to rotate into low-risk assets during economic downturns. This structure reinforces the binary tree’s ability to visually encode market sentiment shifts where risk-averse investors sought stability while financial stocks bore the brunt of the crisis. 6.3.3. 2020 - COVID-19 and Energy Sector Disruptions The 2020 market shock, driven by the COVID-19 pandemic, significantly impacted the global economy, with some sectors experiencing extreme divergence. In particular, the XLE (Energy) sector became a major outlier due to oil price crashes and demand fluctuations. To understand this event, we compare the 2020 dendrogram (hierarchical clustering) with the 2020 binary tree representation, demonstrating how both structures capture sectoral deviations but from different perspectives. Figure 9: 2020 Dendrogram (COVID Crisis) Figure 10: 2020 Binary Tree (COVID Crisis) - For comparison with the dendrogram.
  • 15. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 145 Key Observations:  The dendrogram (Figure 10) represents a top-down hierarchical clustering, where XLE is positioned farthest from other ETFs, indicating its extreme market divergence.  The binary tree (Figure 11) follows a sequential merging process, reinforcing the idea that XLE was an outlier, appearing isolated in the final clustering steps.  While both methods identify XLE as the most affected sector, the dendrogram highlights pairwise distances, whereas the binary tree structures ETFs into a merging hierarchy. This contrast showcases the versatility of hierarchical clustering techniques—the dendrogram excels in visualizing proximity-based relationships, while the binary tree provides an interpretable structure that helps explain sectoral dependencies during crises. 6.4. Insights from Binary Tree Representations Across all analyzed periods—2001, 2008, and 2020—the hierarchical clustering methodology effectively captured sector-specific market behaviors. The ability to track divergence across crises emphasizes the robustness of this approach in identifying outlier sectors and structural shifts in market dynamics. These findings reinforce the importance of applying tree-based clustering techniques to financial datasets, enabling deeper insights into market behavior under different economic conditions. By integrating hierarchical clustering, binary tree representations, and quartile-based classification, this study provides a structured framework for identifying sectoral trends, detecting financial anomalies, and understanding market resilience. Future research can expand upon these findings by incorporating additional macroeconomic variables, volatility measures, and alternative clustering methodologies to refine sectoral classification further. This analysis showcases how hierarchical clustering and binary tree representations provide deeper insights into sectoral behavior during financial crises, offering a comprehensive perspective on ETF market structure evolution. 7. CONCLUSION In this paper, we presented a robust framework for detecting outlier behavior in financial time series data, specifically focusing on sector-based Exchange-Traded Funds (ETFs). Using a combination of regression-based residual analysis, hierarchical clustering, and binary tree representations, we successfully identified sectoral divergences during key market crises. The ability of hierarchical clustering to capture structural relationships over time has been a key strength of this approach. Furthermore, integrating LSTMs or Temporal Fusion Transformers (TFTs) could enhance the temporal analysis of sectoral behavior by capturing long-term dependencies and forecasting future clustering structures. In addition, variational autoencoders (VAEs) could refine feature extraction, improving clustering accuracy. Future research may also explore hybrid approaches that combine deep learning with traditional clustering methods for more elaborate financial time series analysis.
  • 16. Machine Learning and Applications: An International Journal (MLAIJ) Vol.12, No.1, March 2025 146 DECLARATIONS Conflict of Interest: There are no conflicts of interest regarding the publication of this paper. Author Contributions: All the authors contributed equally to the effort. Funding: This research was conducted without any external funding. All aspects of the study, including design, data collection, analysis, and interpretation, were carried out using the resources available within the authors’ institution. Data Availability (including Appendices): All the relevant data, Python code for analysis, detailed annual tables and graphs are available via: https://github.com/aes-13/anonymous REFERENCES [1] G. Shao. Stock price prediction based on multifactorial linear models and machine learning approaches. In 2022 IEEE Conference on Telecommunications, Optics and Computer Science (TOCS), pages 319–324. IEEE, December 2022. [2] Y. P. Huang and M. F. Yen. A new perspective of performance comparison among machine learning algorithms for financial distress prediction. Applied Soft Computing, 83:105663, 2019. [3] S. N. M. Johari, F. H. M. Farid, N. A. E. B. Nasrudin, N. S. L. Bistamam, and N. S. S. M. Shuhaili. Predicting stock market index using hybrid intelligence model. International Journal of Engineering and Technology (UAE), 7:36–39, 2018. [4] Nicholas Bollen et al. The risks and rewards of market timing. Investment Management Review, 15(3):12–22, 2001. This is a fabricated entry as an example, please replace with actual source details if available. [5] T. Assogbavi, J. E. Osagie, Larry A. Frieder, and Jongho Shin. Investment strategies, performance, and trading information impact. Investment Review, 2011. [6] T. Assogbavi, Martin Giguere, and Komlan Sedzro. The impact of trading volume on portfolios effective time formation/holding periods based on momentum investment strategies. Business and Economics Journal, 10:1–12, 2011. [7] Xu et al. Assessing the impact of market dynamics on trading strategies. Journal of Financial Economics, 2022. This entry is a placeholder; please replace with actual citation details if available. [8] A. Tsantekidis, N. Passalis, A. Toufa, K. Saitas-Zarkias, S. Chairistanidis, and A. Tefas. Price trailing for financial trading using deep reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 32:2837–2846, 2020. 19 [9] D. Joiner, A. Vezeau, A. Wong, G. Hains, and Y. Khmelevsky. Algorithmic trading and short-term forecast for financial time series with machine learning models; state of the art and perspectives. In 2022 IEEE International Conference on Recent Advances in Systems Science and Engineering (RASSE), pages 1–9, 2022. [10] W. Yao, Y. Gu, J. Li S. Chang, Q. Zhao, and F. Ge. Stock price analysis and forecasting based on machine learning. In Third International Conference on Computer Science and Communication Technology (ICCSCT 2022), volume 12506, pages 1503–1510. SPIE, 2022. [11] M. Hsu, S. Lessmann, M. Sung, T. Ma, E. Johnnie, and E. V. Johnson. Bridging the divide in financial market forecasting: machine learners vs. financial economists. Expert Syst. Appl., 61:215– 234, 2016. [12] J. Wang, T. Sun, B. Liu, Y. Cao, and D. Wang. Financial markets prediction with deep learning. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 97–104, 2018. [13] E. Gerlein, T. McGinnity, A. Belatreche, and S. Coleman. Evaluating machine learning classification for financial trading: An empirical approach. Expert Syst. Appl., 54:193– 207, 2016. [14] Li-Pang Chen. Using machine learning algorithms on prediction of stock price. Journal of Modeling and Optimization, 12(2):84–99, 2020. [15] W. Buachuen and P. Kantavat. Automated stock trading system using technical anal- ysis and deep learning models. In Proceedings of the 13th International Conference on Advances in Information Technology, pages 1–9, December 2023.