Data Warehousing and Mining Module 1
Data Warehousing and Mining Module 1
Module 1
Data warehousing
Data warehousing is a method of organizing and compiling data into one
database, whereas data mining deals with fetching important data from
databases.
A data warehouse is where data can be collected for mining purposes, usually
with large storage capacity. Various organizations’ systems are in the data
warehouse, where it can be fetched as per usage.
Data warehouses collaborate data from several sources and ensure data
accuracy, quality, and consistency. System execution is boosted by
differentiating the process of analytics from traditional databases. In a data
warehouse, data is sorted into a formatted pattern by type and as needed. The
data is examined by query tools using several patterns.
Data warehouses store historical data and handle requests faster, helping in
online analytical processing, whereas a database is used to store current
transactions in a business process that is called online transaction processing.
Features
1. Centralized Data Repository:
o Integrates data from various sources into a single repository.
o Provides a unified view of business data.
2. Data Integration:
o Combines data from disparate sources such as databases,
applications, and external systems.
o Ensures consistency and accuracy of data through ETL (Extract,
Transform, Load) processes.
3. Historical Data Storage:
o Maintains historical data for trend analysis and long-term reporting.
o Allows for time-based analysis, including historical comparisons and
forecasting.
4. Data Aggregation and Summarization:
o Supports aggregation and summarization of data to provide high-
level insights.
o Facilitates complex queries and reporting.
5. Optimized for Query Performance:
o Designed to handle complex queries and large volumes of data
efficiently.
o Uses indexing, partitioning, and other techniques to enhance
performance.
6. Support for Data Mining and Analytics:
o Provides a foundation for data mining, business intelligence, and
advanced analytics.
o Enables sophisticated analysis and decision-making processes.
7. Data Quality Management:
o Implements data cleaning and validation processes to ensure data
quality.
o Regularly updates and maintains data accuracy and consistency.
Advantages
1. Improved Decision-Making:
o Provides comprehensive and accurate data that supports informed
decision-making.
o Facilitates business intelligence and strategic planning through
detailed reports and analyses.
2. Enhanced Reporting and Analysis:
o Supports complex queries and detailed reporting, providing
valuable insights into business performance.
o Allows for the creation of dashboards and visualizations for easier
interpretation of data.
3. Data Consistency and Accuracy:
o Ensures that data from various sources is integrated and
standardized, reducing inconsistencies.
o Improves data accuracy through rigorous data quality management.
4. Historical Data Access:
o Stores historical data for trend analysis, forecasting, and
longitudinal studies.
o Allows organizations to track changes over time and analyze past
performance.
5. Increased Efficiency:
o Reduces the time and effort required to generate reports and
perform analyses.
o Centralizes data access, making it easier for users to obtain the
information they need.
6. Scalability:
o Supports the growth of data volumes and complexity over time.
o Can be scaled to accommodate increasing data and user demands.
7. Support for Business Intelligence Tools:
o Integrates seamlessly with various BI tools for advanced data
analysis and visualization.
o Enhances the capabilities of data mining and predictive analytics.
Disadvantages
1. High Initial Cost:
o Requires significant investment in hardware, software, and
infrastructure.
o Implementation and setup costs can be substantial, especially for
large organizations.
2. Complex Implementation:
o Building and maintaining a data warehouse can be complex and
time-consuming.
o Requires careful planning and expertise in data modeling, ETL
processes, and database management.
3. Data Latency:
o Data may not be updated in real-time, leading to potential delays in
data availability.
o Latency can affect the timeliness of reporting and analysis.
Consider the data of a shop for items sold per quarter in the city of Delhi. The
data is shown in the table. In this 2D representation, the sales for Delhi are
shown for the time dimension (organized in quarters) and the item dimension
(classified according to the types of an item sold). The fact or measure displayed
in rupee_sold (in thousands).
Key Concepts
1. Dimensions:
o Definition: Dimensions are perspectives or attributes by which
data can be analyzed. They provide context to the measures and
help in slicing and dicing the data.
o Examples: Time, location, product, customer, and sales region.
2. Facts:
o Definition: Facts are quantitative data or metrics that are of
interest and can be measured. They represent the data points that
are analyzed across different dimensions.
o Examples: Sales revenue, number of units sold, profit margins.
3. Measures:
o Definition: Measures are the numerical values or quantities that
are aggregated or analyzed in the fact tables. They are often
calculated from the data stored in the fact tables.
o Examples: Total sales, average revenue per customer.
4. Fact Tables:
o Definition: Fact tables store quantitative data and are typically
large. They contain measures and foreign keys to the dimension
tables.
o Structure: Fact tables include data such as sales transactions,
performance metrics, and other numerical data that are analyzed
over different dimensions.
5. Dimension Tables:
o Definition: Dimension tables contain descriptive attributes or
characteristics related to the dimensions. They provide context and
additional details for the facts.
o Structure: Dimension tables typically include attributes such as
product name, customer address, or time period.
6. Star Schema:
o Definition: A star schema is a type of multidimensional model
where the fact table is at the center, and dimension tables are
connected to it. The structure resembles a star.
o Advantages: Simple design, easy to understand, and efficient for
querying and reporting.
o Example: A sales fact table connected to dimension tables for time,
product, and location.
7. Snowflake Schema:
o Definition: A snowflake schema is a more normalized form of the
star schema, where dimension tables are further broken down into
related sub-dimension tables. This structure resembles a snowflake.
o Advantages: Reduces data redundancy and improves data
integrity.
o Example: A product dimension table broken down into product
category and product sub-category tables.
8. Galaxy Schema (or Fact Constellation Schema):
o Definition: A galaxy schema includes multiple fact tables that
share dimension tables. It is used for complex data models involving
multiple business processes.
o Advantages: Supports complex queries and provides a
comprehensive view of different business processes.
o Example: Sales and inventory fact tables sharing common
dimension tables for time and product.
Benefits of Multidimensional Modeling
1. Enhanced Query Performance:
o Optimized for complex queries and aggregations. Pre-aggregated
data and indexing improve performance.
2. User-Friendly Reporting:
o Facilitates intuitive reporting and analysis by allowing users to view
data from various perspectives and drill down into details.
3. Data Consistency:
o Provides a consistent view of data across different dimensions and
measures, ensuring uniformity in reporting and analysis.
4. Efficient Data Analysis:
o Enables quick slicing, dicing, and pivoting of data, making it easier
to analyze trends, patterns, and insights.
5. Flexibility:
o Allows for flexible and dynamic analysis by enabling users to
explore data across different dimensions and measures.
Challenges
1. Complexity:
o Designing and maintaining a multidimensional model can be
complex, especially for large and intricate data warehouses.
2. Data Redundancy:
o Star and snowflake schemas may involve some data redundancy,
leading to increased storage requirements.
3. Performance Trade-offs:
o While multidimensional models improve query performance, they
may require significant resources and careful tuning.
Types of OLAP
There are various varieties of OLAP, each serving particular requirements and
preferences for data analysis. The primary OLAP kinds are:
Benefits of OLAP
1. Improved Decision-Making:
o By providing fast and interactive access to multidimensional data,
OLAP enhances decision-making processes, allowing organizations
to make data-driven decisions more effectively.
2. Rapid Data Retrieval:
o OLAP systems are optimized for quick data retrieval and complex
queries, which improves the efficiency of data analysis and
reporting.
3. Enhanced Data Analysis:
o OLAP enables in-depth analysis of data by allowing users to explore
data from different angles and levels of granularity, leading to a
better understanding of business dynamics.
4. Real-Time Analysis:
o OLAP systems support real-time or near-real-time analysis,
providing timely insights that are essential for dynamic business
environments.
Characteristics:
1. Multidimensional Analysis:
o Enables users to view data from multiple dimensions (e.g., time,
geography, product) and perform operations like slicing, dicing, and
drilling down.
2. Interactive Query Processing:
o Provides fast response times for analytical queries, allowing users to
explore data and generate reports in real-time.
3. Aggregation and Summarization:
o Supports aggregation of data at different levels of granularity,
helping users to analyze data at various summary levels (e.g., daily,
monthly, yearly).
4. Complex Calculations:
o Allows for advanced calculations, such as ratios, percentages, and
trend analyses, directly within the OLAP system.
5. User-Friendly Interfaces:
o Features intuitive interfaces, often with drag-and-drop capabilities,
making it accessible for non-technical users to perform data
analysis.
Architecture:
1. Data Sources:
o Source Systems: Various data sources such as transactional
databases, ERP systems, and external data feeds from which data is
extracted.
2. ETL (Extract, Transform, Load):
o Extract: Data is extracted from source systems.
o Transform: Data is cleaned, transformed, and structured to fit the
OLAP model.
o Load: Transformed data is loaded into the OLAP system.
3. Data Warehouse:
o Central Repository: Stores integrated and historical data used for
analysis. It acts as the source for OLAP cubes.
4. OLAP Server:
o ROLAP (Relational OLAP): Uses relational databases to store data
and performs multidimensional queries on relational data
structures.
o MOLAP (Multidimensional OLAP): Uses multidimensional
database systems (OLAP cubes) to store pre-aggregated data for
fast retrieval.
o HOLAP (Hybrid OLAP): Combines features of both ROLAP and
MOLAP for flexible and efficient processing.
5. OLAP Cube:
o Multidimensional Structure: Pre-aggregated data organized into
cubes, with dimensions and measures, facilitating fast and efficient
querying.
6. Client Tools:
o Analytical Tools: Interfaces and applications used by end-users to
interact with OLAP cubes and perform data analysis (e.g., reporting
tools, dashboards).
Multidimensional View:
1. Dimensions:
o Definition: Perspectives or attributes used to slice and dice data
(e.g., time, location, product).
o Hierarchies: Structures within dimensions (e.g., year > quarter >
month > day) that enable drill-down and roll-up operations.
2. Measures:
o Definition: Quantitative data points that are analyzed across
dimensions (e.g., sales revenue, quantity sold).
o Aggregation: Measures are aggregated along dimensions to
provide summary statistics.
3. Slicing and Dicing:
o Slicing: Extracting a subset of data by selecting a single dimension
value (e.g., sales data for January).
o Dicing: Extracting a sub-cube by selecting multiple dimension
values (e.g., sales data for January and February).
4. Drill-Down and Roll-Up:
o Drill-Down: Navigating from summary data to more detailed data
(e.g., from yearly sales to monthly sales).
o Roll-Up: Aggregating detailed data into summary data (e.g., from
monthly sales to yearly sales).
OLAP Operations
OLAP provides various operations to gain insights from the data stored in
multidimensional hypercubes. These operations include:
Drill Down
Drill down operation allows a user to zoom in on the data cube i.e., the less
detailed data is converted into highly detailed data. It can be implemented by
either stepping down a concept hierarchy for a dimension or adding additional
dimensions to the hypercube.
Example: Consider a cube that represents the annual sales (4 Quarters: Q1, Q2,
Q3, Q4) of various kinds of clothes (Shirt, Pant, Shorts, Tees) of a company in 4
cities (Delhi, Mumbai, Las Vegas, New York) as shown below:
Here, the drill-down operation is applied on the time dimension and the
quarter Q1 is drilled down to January, February, and March. Hence, by applying
the drill-down operation, we can move down from quarterly sales in a year to
monthly or weekly records.
Roll up
It is the opposite of the drill-down operation and is also known as a drill-up or
aggregation operation. It is a dimension-reduction technique that performs
aggregation on a data cube. It makes the data less detailed and it can be
performed by combining similar dimensions across any axis.
Example: Considering the above-mentioned clothing company sales example:
Here, we are performing the Roll-up operation on the given data cube by
combining and categorizing the sales based on the countries instead of cities.
Dice
Dice operation is used to generate a new sub-cube from the existing hypercube.
It selects two or more dimensions from the hypercube to generate a new sub-
Here, we are using the dice operation to retrieve the sales done by the company
in the first half of the year i.e., the sales in the first two quarters.
Slice
Slice operation is used to select a single dimension from the given cube to
generate a new sub-cube. It represents the information from another point of
view.
Example: Considering our clothing company sales example:
Here, the sales done by the company during the first quarter are retrieved by
performing the slice operation on the given hypercube.
Pivot
It is used to provide an alternate view of the data available to the users. It is also
known as Rotate operation as it rotates the cube’s orientation to view the data
from different perspectives.
Example: Considering our clothing company sales example:
Here, we are using the Pivot operation to view the sub-cube from a different
perspective
ROLAP vs MOLAP vs HOLAP vs DATA CUBE
Balanced
Performance
Generally slower Typically faster performance
depends on the
due to dynamic due to pre- combining both
Performance implementation
query computed data in relational and
in MOLAP or
generation. cubes. multidimensional
HOLAP.
storage.
Provides a
balance,
Limited by the Scalability is
Highly scalable leveraging
size of the OLAP influenced by
Scalability with relational scalability of
cubes and MOLAP or HOLAP
databases. relational storage
storage capacity. implementation.
and performance
of cubes.
Contains pre-
Handles
Handles detailed Provides both aggregated data;
Data aggregated data;
data at a detailed and granularity
Granularity less access to
granular level. aggregated data. depends on cube
detailed data.
design.
Queries are
Queries involve Simplifies
processed Uses both
Complexity of converting querying by
against pre- relational and
Query multidimensional providing pre-
computed cubes, multidimensional
Processing requests into SQL aggregated
simplifying queries.
queries. views.
processing.
Reflects current
Provides up-to-
data as it is Data may need Refresh rate
date detailed
dynamically periodic updates depends on cube
Data Refresh data and
generated from to reflect update
periodically
relational changes. frequency.
refreshed cubes.
databases.
Offers flexibility
More flexible in Flexibility
Less flexible due from both
querying and depends on
Flexibility to predefined relational and
handling new MOLAP or HOLAP
cube structure. multidimensional
dimensions. usage.
perspectives.
Requires storage
Requires more Storage depends
Generally for both relational
Storage storage for pre- on the size and
requires less and
Requirements aggregated complexity of the
storage space. multidimensional
cubes. cube.
data.
Easier to Combines Implementation
Requires
Ease of implement with complexities of complexity
specialized OLAP
Implementati existing both ROLAP and depends on
cube technology
on relational MOLAP underlying OLAP
and tools.
databases. technologies. system.
Integrates
Handles detailed historical
Depends on cube Historical data
historical data records from
Historical refresh handling
through relational
Data Handling frequency for depends on cube
relational databases with
historical data. maintenance.
databases. aggregated data
from cubes.
Similarities
Aspect ROLAP MOLAP HOLAP Data Cube
Yes, supports
multidimension Yes, supports Yes, organizes
al analysis by Yes, provides multidimension data in a
allowing users multidimension al analysis with multidimension
Multidimensio to analyze data al analysis with a combination al array,
nal Analysis across multiple pre-aggregated of relational enabling
dimensions data stored in and analysis across
(e.g., time, cubes. multidimension multiple
location, al data. dimensions.
product).
Aspect ROLAP MOLAP HOLAP Data Cube
Yes, supports
OLAP Yes, supports Yes, facilitates
Yes, supports operations OLAP OLAP
core OLAP using pre- operations operations by
Support for
operations such computed through both providing pre-
OLAP
as slice, dice, cubes, relational aggregated
Operations
drill-down, roll- facilitating queries and data for
up, and pivot. various multidimension efficient
analytical al cube queries. querying.
queries.
Yes, designed
to help users Yes, helps in Yes, combines Yes, provides a
make informed decision- strengths of structured
Facilitate business making with relational and format for
Decision- decisions by fast query multidimension analyzing data
Making providing responses and al storage to and supporting
insights from pre-aggregated aid in decision- decision-
complex data data. making. making.
sets.
Yes, handles
Yes,
data Yes, pre- Yes, organizes
aggregates
aggregation computes and and stores
data from both
Aggregation and stores aggregated
relational
and summarization aggregated data in a
databases and
Summarizatio by dynamically data in cubes, multidimension
cubes to
n generating simplifying al format,
provide
views from summary simplifying
summarized
relational queries. summarization.
views.
databases.
Yes, integrates Yes, integrates
Yes, integrates Yes, integrates
data from data from
data from data from
various relational
various sources various sources
Data relational sources and
into into a
Integration sources, multidimension
multidimension multidimension
presenting a al cubes for a
al cubes for al structure for
unified view for comprehensive
analysis. analysis.
analysis. view.
Yes, typically
Yes, provides Yes, offers
offers user- Yes, provides
intuitive interfaces that
friendly user-friendly
interfaces for combine
interfaces such interfaces for
User-Friendly querying pre- querying of
as reporting querying and
Interfaces aggregated relational data
tools and exploring
cubes and and
dashboards for multidimension
generating multidimension
querying and al data.
reports. al cubes.
analyzing data.
Data Cube
A data cube is a multidimensional array of data used in OLAP (Online Analytical
Processing) systems to facilitate complex analysis and reporting. It organizes
data in a way that allows users to view and analyze it from multiple dimensions
and hierarchies. Each axis of the cube represents a different dimension, and the
data within the cube can be aggregated along these dimensions.
Key Concepts of a Data Cube
1. Dimensions:
o Definition: Dimensions are perspectives or attributes used to view
and analyze data. They represent different facets of the data and
are typically used to categorize and filter the data.
o Types:
Temporal Dimension: Time-based attributes (e.g., Year,
Quarter, Month, Day).
Geographical Dimension: Location-based attributes (e.g.,
Country, State, City).
Product Dimension: Attributes related to products (e.g.,
Product Category, Brand).
Customer Dimension: Attributes related to customers (e.g.,
Customer ID, Customer Segment).
2. Measures:
o Definition: Measures are quantitative data points that are analyzed
across dimensions. They represent the values or metrics of interest.
o Examples:
Sales Revenue: The total revenue generated from sales.
Quantity Sold: The number of units sold.
Profit Margin: The difference between revenue and costs.
3. Cells:
o Definition: Cells in the data cube store aggregated values for the
intersection of dimension values. Each cell represents a unique
combination of dimension values and contains a measure.
o Example: The cell at the intersection of “January,” “New York,” and
“Electronics” might contain the total sales revenue for that
combination.
4. Hierarchies:
o Definition: Hierarchies within dimensions represent different levels
of granularity. They allow users to drill down or roll up data within a
dimension.
o Example:
Time Dimension Hierarchy: Year > Quarter > Month > Day
Geographical Dimension Hierarchy: Country > State >
City
2. Dice
Definition:
o The dice operation involves selecting multiple dimension values to
create a smaller, more focused sub-cube. This operation filters the
cube along multiple dimensions.
Example:
o Dicing a cube by selecting “Q1 2023,” “New York,” and
“Electronics” results in a sub-cube that contains sales data only for
the first quarter of 2023, for New York, and for Electronics.
Benefits:
o Provides a targeted view of the data by isolating specific values
across multiple dimensions, enabling more detailed and focused
analysis.
3. Drill-Down
Definition:
o The drill-down operation involves navigating from aggregated data
to more detailed data within a hierarchy. This operation allows
users to explore data at finer levels of granularity.
Example:
o Starting with annual sales data, drilling down might reveal monthly
sales figures, or even daily sales figures, providing more detailed
insights into performance.
Benefits:
o Enables users to gain deeper insights by examining data at lower
levels of granularity, helping to uncover trends and patterns that
might be obscured at higher levels.
4. Roll-Up
Definition:
o The roll-up operation involves aggregating data at higher levels in
the hierarchy. This operation summarizes detailed data into broader
categories.
Example:
o From detailed monthly sales data, rolling up might aggregate the
data to show quarterly or yearly totals, providing an overview of
performance over a longer period.
Benefits:
o Offers a high-level summary of data, making it easier to identify
overall trends and performance across longer time periods.
6. Drill-Across
Definition:
o The drill-across operation involves navigating across different data
cubes or fact tables that share common dimensions. This operation
enables users to correlate data across different analytical contexts.
Example:
o Comparing sales data with inventory levels by drilling across cubes
that both include the time dimension allows users to analyze how
inventory levels impact sales performance.
Benefits:
o Facilitates comprehensive analysis by integrating and comparing
data from multiple sources, providing a more complete view of
related information.
7. Drill-Through
Definition:
o The drill-through operation allows users to access detailed
transaction-level data underlying the aggregated information in the
cube. This operation provides transparency and deeper insights.
Example:
o From aggregated sales data, drill-through might access individual
sales transactions to investigate specific sales patterns, customer
behavior, or product performance.
Benefits:
o Provides access to raw data for verification and detailed analysis,
enabling users to examine the specifics behind aggregated metrics
and uncover more granular insights.
Data Mining:
In this process, data is extracted and analyzed to fetch useful information. In
data mining hidden patterns are researched from the dataset to predict future
behavior. Data mining is used to indicate and discover relationships through the
data.
Data mining uses statistics, artificial intelligence, machine learning systems, and
some databases to find hidden patterns in the data. It supports business-related
queries that are time-consuming to resolve.
Advantages
1. Insightful Analysis:
o Provides valuable insights into data that are not immediately
obvious, helping organizations make informed decisions.
2. Predictive Power:
o Enables forecasting of future trends and behaviors based on
historical data, aiding in strategic planning.
3. Pattern Recognition:
o Identifies hidden patterns and relationships in data that can lead to
new opportunities or insights.
4. Improved Efficiency:
o Optimizes processes and resources by uncovering inefficiencies and
areas for improvement.
5. Competitive Advantage:
o Helps organizations gain a competitive edge by leveraging data-
driven insights to innovate and respond to market changes.
Disadvantages
1. Complexity:
o Data mining can be complex and requires specialized skills and
tools to effectively analyze and interpret data.
2. Data Quality Issues:
o The accuracy of insights depends on the quality of the data; poor-
quality data can lead to misleading results.
3. Privacy Concerns:
o Handling sensitive or personal data raises privacy and ethical
concerns, necessitating proper data protection measures.
4. High Costs:
o Implementing data mining solutions can be expensive due to the
need for advanced tools, infrastructure, and expertise.
5. Overfitting:
o Models that are too complex may fit the training data too closely
and perform poorly on new data, leading to inaccurate predictions.
6. Data Volume:
o Large volumes of data can be challenging to process and analyze,
requiring significant computational resources.
Applications
1. Customer Relationship Management (CRM):
o Analyzing customer behavior to segment markets, personalize
marketing efforts, and improve customer satisfaction.
2. Fraud Detection:
o Identifying fraudulent transactions and activities by detecting
anomalies and unusual patterns in financial data.
3. Market Basket Analysis:
o Understanding customer purchasing behavior and discovering
associations between products to optimize inventory and
promotions.
4. Healthcare:
o Predicting patient outcomes, identifying disease patterns, and
improving treatment plans through analysis of medical data.
5. Financial Services:
o Assessing credit risk, detecting financial fraud, and analyzing
investment opportunities.
6. Manufacturing:
o Enhancing quality control, optimizing supply chains, and predicting
equipment failures by analyzing production data.
7. Telecommunications:
o Improving customer retention, detecting network faults, and
optimizing service plans through data analysis.
3. Minkowski Distance
Description: A generalization of both Euclidean and Manhattan distances,
Minkowski distance is parameterized by a variable ppp. When p=1p =
1p=1, it corresponds to Manhattan distance, and when p=2p = 2p=2, it
corresponds to Euclidean distance.
Formula: Distance(x,y)=(∑i=1n∣xi−yi∣p)1/p\text{Distance}(\mathbf{x}, \
mathbf{y}) = \left( \sum_{i=1}^{n} |x_i - y_i|^p
\right)^{1/p}Distance(x,y)=(∑i=1n∣xi−yi∣p)1/p
Characteristics:
o Flexible due to the parameter ppp.
o Provides a range of distance measures depending on the value of
ppp.
Applications: Used in various machine learning algorithms that require
distance metrics, including generalizations of k-NN and clustering
methods.
4. Cosine Similarity
Description: Cosine similarity measures the cosine of the angle between
two vectors. It is particularly useful for text data and high-dimensional
sparse data.
Formula: Similarity(x,y)=x⋅y∥x∥∥y∥\text{Similarity}(\mathbf{x}, \
mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\
5. Jaccard Similarity
Description: Jaccard similarity measures the similarity between two sets
by comparing the size of their intersection to the size of their union.
Formula: Similarity(A,B)=∣A∩B∣∣A∪B∣\text{Similarity}(A, B) = \frac{|A \cap
B|}{|A \cup B|}Similarity(A,B)=∣A∪B∣∣A∩B∣
Characteristics:
o Suitable for binary and categorical data.
o Range between 0 and 1, where 1 indicates complete similarity.
Applications: Used in clustering, classification, and comparing binary
attributes or sets.
8. Mahalanobis Distance
Description: Mahalanobis distance measures the distance between a
point and a distribution. It accounts for correlations between variables and
is useful for identifying outliers.
Formula: Distance(x,y)=(x−y)TS−1(x−y)\text{Distance}(\mathbf{x}, \
mathbf{y}) = \sqrt{(\mathbf{x} - \mathbf{y})^T \mathbf{S}^{-1} (\
mathbf{x} - \mathbf{y})}Distance(x,y)=(x−y)TS−1(x−y) Where S\
mathbf{S}S is the covariance matrix of the distribution.
Characteristics:
o Takes into account the correlation between variables.
o Useful for multivariate data.
Applications: Outlier detection, multivariate anomaly detection, and
classification.