0% found this document useful (0 votes)
19 views

L4 Data Preprocessing

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

L4 Data Preprocessing

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Oil and Gas Petro chemical Pulp and Paper Water and wastewater Metal Industries

Applications
Data pre- ML model AI algorithm Optimizer DSS
Data
processing development development
Modelling &
APC
Simulation

Dr. Senthilmurugan Subbiah, Department of Chemical Engineering, IITG.

Data preprocessing
Data visualization, Outlier detection, &
Smoothing techniques, Data scaling
Data preprocessing

• Data visualization (line plot, scatter plot, histogram)


• Outlier detection (z-score, IQR)
• Smoothing techniques (moving average, exponential average, SG filter)
• Data scaling (Standardization, Normalization)

February 12, 2024 | Slide 2


Data visualization

Visualization for different


• Chart name
purpose

Relationship • scatter plot, bubble plot

Comparisons • line plot, bar chart

Compositions • Pie chart, stacked bar chart, stacked area chart

Distribution • Histogram and Box plot

February 12, 2024 | Slide 3


Drilling rig

1. Mud Tank 8. Standpipe 15. Monkey board 22. Bell nipple


2. Shale Shakers 9. Kelly hose 16. Stand of drill pipe 23. Blowout preventers
3. Suction line 10. Goose-neck 17. Pipe rack floor 24. Drill Sting
4. Mud pump 11. Travelling block 18. Swivel 25. Drill bit
5. Moto or power source 12. Drill line 19. Kelly drive 26. Casing head
6. Vibrating hose 13. Crown block 20. Rotary table 27. Flow line
7. Draw-works 14. Derrick 21. Drill floor
February 12, 2024 | Slide 4
https://dtetechnology.wordpress.com/2014/05/04/components-of-a-land-based-rotary-drilling-platform
Details data used in this plot

Process
In our case, we have used data from oil drilling data from ONGC, where we are analysing the Data source
parameters “Hookload” and “Bit depth”:
• Online sensor
Hookload refers to the tension or weight applied to the drill string, which is the series of pipes and
other equipment that make up the drilling rig. It is measured in units of weight, such as pounds or • From ONGC oil Rig
kilonewtons. The hookload is used to control the weight on the bit, which is the cutting or drilling
tool at the bottom of the drill string. It is used to monitor the progress of the drilling operation and to
detect any potential problems.

Bit depth, on the other hand, refers to the distance from the drilling rig's surface to the point where
the bit is located. It is measured in units of length, such as feet or meters. The bit depth is used to
track the progress of the drilling operation and to determine when to add or remove drilling
equipment.

Together, hookload and bit depth provide valuable information about the drilling process, including
the weight on the bit, the progress of the drilling operation, and the presence of any potential
problems. This information is used to optimize the drilling operation and to make decisions about
when to add or remove drilling equipment. It is also used to make sure that the drilling is done safely
and efficiently, by avoiding over-stressing the drill string, and by ensuring that the drilling is done at
the desired depth.

February 12, 2024 | Slide 5


Find relation between variables/features
scatter plot, bubble plot

February 12, 2024 | Slide 6


Comparison between features
Line Plot

February 12, 2024 | Slide 7


Details data used in this plot

Attribute Information:
Process
1. Date: (DD/MM/YYYY)
For comparison plots we use Air Quality Data containing 2. Time: (HH.MM.SS)
parameters like concentration of NOx, CO, NO2, Relative 3. CO(GT): True hourly averaged concentration Carbon Monoxide in mg/m3 (reference
humidity etc. This data is time series data. analyzer)
4. PT08.S1: (tin oxide) hourly averaged sensor response (nominally CO targeted)
Bar charts are used to compare the sizes of different categories 5. NMHC (GT): True hourly averaged overall Non Metanic HydroCarbons (NMHC)
of data and to show how the data is distributed across different concentration in microg/m^3 (reference analyzer)
categories. They are useful for comparing data across multiple
categories and for identifying trends or patterns in the data. 6. C6H6 (GT):True hourly averaged Benzene concentration in µ𝑔/𝑚3 (reference analyzer)
7. PT08.S2: (titania) hourly averaged sensor response (nominally NMHC targeted)
The bar chart in the following slide is created by plotting the 8. NOXGT: True hourly averaged NOx concentration in ppb (reference analyzer)
time in the x-axis and concentration of CO in y-axis. 9. PT08.S3: (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
10. NO2GT: True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
For pie chart concentration of NOX with time is represented.
11. PT08.S4: (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
12. PT08.S5: (indium oxide) : hourly averaged sensor response (nominally O3 targeted)
For stacked bar chart, concentration of CO, NMHC and O3
are plotted against time. 13. T: Temperature in °C
14. RH: Relative Humidity (%)
For stacked area chart, concentration of NO2(GT), 15. AH: Absolute Humidity
PT08.S4(NO2) and PT08.S5(O3) are plotted against time .
Data source: The dataset contains 9358 instances of hourly averaged responses from an array of 5
metal oxide chemical sensors embedded in an Air Quality Chemical Multi-sensor Device. The
February 12, 2024 | Slide 8 device was located on the field in a significantly polluted area, at road level, within an Italian city.
Data were recorded from March 2004 to February 2005 (one year)
Comparison between features
Bar chart Concentration of CO vs time

February 12, 2024 | Slide 9


Time (hrs)
Compositions
Pie chart and Stacked bar chart

February 12, 2024 | Slide 10


Compositions
Stacked area chart

February 12, 2024 | Slide 11


Distribution
Histogram and Box plot
Concentration of CO

February 12, 2024 | Slide 12


Outlier Removal

Purpose: Outlier removal focuses on identifying and eliminating data points that
significantly deviate from the rest of the data distribution. These points can be due to
noise, measurement errors, or genuine but rare events.
Process: This involves detecting outliers using various statistical methods (like Z-score,
IQR) and then deciding whether to remove or adjust these points based on criteria specific
to the dataset or research question.
Impact on Data:
Data Integrity: Removing outliers can help to prevent skewing of statistical analyses and
machine learning model performance. However, it may also lead to loss of valuable
information, especially if the outliers represent important phenomena.
Statistical Analysis: Enhances the accuracy of descriptive statistics (mean, median,
standard deviation) and model predictions by eliminating potential sources of significant
error.

February 12, 2024 | Slide 13


Outlier detection

• Outliers are values that seem excessively different from most of the rest of the values in
the given dataset.
• Sources of Outliers
• new inventions (true outliers),
• development of new patterns/phenomena,
• experimental errors,
• rarely occurring incidents,
• anomalies,
• incorrectly fed data due to typographical mistakes,
• failure of data recording systems / components like sensor, cable etc.
• However, all outliers are not bad; some reveal new information. Inliers are all the data
points that are part of the distribution except outliers.

February 12, 2024 | Slide 14


Method to Outlier detection
z-score
▪ 𝑥 = 𝑃𝑎𝑟𝑡𝑖𝑐𝑢𝑙𝑎𝑟 𝑉𝑎𝑙𝑢𝑒
▪ 𝜇 = 𝑀𝑒𝑎𝑛
▪ 𝜎 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

Z-Scores help identify outliers if a particular data point has a


Z-score value either less than -3 or greater than +3.

February 12, 2024 | Slide 15


Method to Outlier detection
z-score ▪ Red Wine Production Data For pH, ▪ 𝜇 = 𝑀𝑒𝑎𝑛 = 3.3111
▪ 𝜎 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 0.1544
Sr. Citric Sulfur
Acidity Density pH Sulphates Alcohol Quality Z-Score
No. acid dioxide

1 7.4 0 34 0.9978 3.51 0.56 9.4 5 1.28824

2 7.8 0 67 0.9968 3.2 0.68 9.8 5 -0.71971

3 7.8 0.04 54 0.997 3.26 0.65 9.8 5 -0.33107

4 11.2 0.56 60 0.998 3.16 0.58 9.8 6 -0.9788

5 7.4 0 34 0.9978 3.51 0.56 9.4 5 1.28824

6 7.4 0 40 0.9978 3.51 0.56 9.4 5 1.28824

. . . . . . . . . .

. . . . . . . . . .

1599 6 0.47 42 0.99549 3.39 0.66 11 6 0.51096

February 12, 2024 | Slide 16 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In
Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Pros and Cons

Assumption of Normality: While Z-score normalization does not require the data to
be normally distributed, its effectiveness can vary with the distribution of the data. For
distributions far from normal, other normalization techniques might be more
appropriate.
Outlier Sensitivity: Since the mean and standard deviation are influenced by outliers,
Z-score normalization can be affected by extreme values. In such cases, exploring the
data and possibly addressing outliers before normalization may be advisable.

February 12, 2024 | Slide 17


Method to Outlier detection
IQR
Sr.
pH
No. IQR=Q3-Q1
1 3.51

2 3.2 Interquartile Range


(IQR)
“Minimum” “Maximum”
3 3.26
(Q1-1.5*IQR) (Q3+1.5*IQR)
4 3.16

5 3.51
Outliers Outliers
6 3.51 Q1 Median Q3
(25th Percentile) (75th Percentile)

. .

. .

1599 3.39

February 12, 2024 | Slide 18 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In
Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Pros and cons of IQR

Advantages of Using IQR for Outlier Detection


• Robustness: The IQR is less affected by outliers or non-normal distribution of data
compared to methods relying on mean and standard deviation, making it more reliable
for outlier detection in real-world, skewed datasets.
• Simplicity and Versatility: It's straightforward to calculate and can be applied to any
dataset, regardless of its distribution.
• Useful for Skewed Data: Particularly beneficial for datasets that are not symmetric or
have skewed distributions.
Limitations
• Potential for Over/Under Detection: The choice of 1.5 times the IQR as a cutoff for
outlier detection is somewhat arbitrary and may not be suitable for all datasets.
Adjusting this multiplier can change the sensitivity of outlier detection.
• Does Not Account for Distribution Shape: While robust to outliers, this method does not
take into account the overall shape of the data distribution, potentially leading to the
misclassification of outliers in distributions that are inherently long-tailed.
February 12, 2024 | Slide 19
Smoothing Techniques

Purpose: Smoothing techniques are used to reduce noise and fluctuations in data to reveal underlying
trends or patterns. Unlike outlier removal, smoothing is about filtering the data rather than removing
any data points.

Process: This involves applying mathematical filters or functions to the data, such as moving
averages, exponential smoothing, or more complex algorithms like SG filter. The choice of technique
depends on the data characteristics and the analysis goals.

Impact on Data:

Data Trends: Smoothing helps in highlighting trends and patterns by averaging out short-term
fluctuations. This is particularly useful in time series analysis or signal processing.

Preservation of Data Points: All original data points are retained in the dataset, albeit modified by the
smoothing process, thus maintaining the integrity and structure of the data.
February 12, 2024 | Slide 20
Smoothing techniques remove Outlier
Moving average ▪ Let us construct 3 period moving average ⇒ 𝑤𝑖𝑛𝑑𝑜𝑤 𝑠𝑖𝑧𝑒 = 3

Sr. 𝑦1 + 𝑦2 + 𝑦3 3.51 + 3.2 + 3.26


No.
pH (𝒚) 𝑦 𝑦ത2 = = = 3.32
3 3
1 3.51 -

2 3.2 3.32

3 3.26 3.21

4 3.16 3.31

5 3.51 3.39

6 3.51 3.44

. . .

. . 3.46

1599 3.39 - ▪ Choice of window size is arbitrary


February 12, 2024 | Slide 21 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support
Systems, Elsevier, 47(4):547-553, 2009.
Pros and cons of Moving average technique

Pros:
• Simplicity: Easy to understand and implement.
• Effectiveness: Efficient at reducing random noise while keeping the signal intact.
• Customizable Window: The window size can be adjusted to control the smoothing level.

Cons:
• Lag: Introduces a lag, especially noticeable with larger window sizes, which can be
problematic for real-time analysis.
• Edge Artifacts: The beginning and end of the series might have distortions since fewer
data points are used in the calculation.
• Uniform Weighting: Treats all points within the window equally, which might not be ideal
for all applications.

February 12, 2024 | Slide 22


Smoothing techniques remove Outlier
Exponential average
Sr.
No.
pH (𝒚) 𝑦ො 𝑦ො 𝑦ො𝑡+1 = 𝛽𝑦𝑡 + (1 − 𝛽)𝑦ො𝑡 ⇒ 𝑦ො𝑡+1 = 𝑦ො𝑡 + 𝛽(𝑦𝑡 − 𝑦ො𝑡 )
1 3.51 Assume 𝑦ො1 = 𝑦1 3.51 ▪ Assume 𝛽 = 0.3 (smoothing factor) [0 < 𝛽 < 1]

2 3.2 𝑦ො1 3.51

3 3.26 𝑦ො2 + 0.3 × 𝑦ො2 3.23

4 3.16 𝑦ො3 + 0.3 × 𝑦ො3 3.26

5 3.51 𝑦ො4 + 0.3 × 𝑦ො4 3.17

6 3.51 𝑦ො5 + 0.3 × 𝑦ො5 3.48

. . . .

. . . .

1599 3.39 𝑦ො1598 + 0.3 × 𝑦ො1598 3.56

February 12, 2024 | Slide 23 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support
Systems, Elsevier, 47(4):547-553, 2009.
Smoothing techniques remove Outlier
Exponential average

From equation we can say that,


• 𝛽 near to 0 ⇒ More
weightage is given to
historic values

• 𝛽 near to 1 ⇒ More
weightage is given to recent
values

February 12, 2024 | Slide 24 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support
Systems, Elsevier, 47(4):547-553, 2009.
Pros and cons for Exponential average

Pros:
• Responsiveness: More responsive to recent changes in the data compared to MA.
• Flexible Weighting: Allows for finer control over the smoothing process through the
smoothing constant.
• Suitable for Trends: Can adapt to data with trends and seasonality through variations
like Double Exponential Smoothing and Triple Exponential Smoothing (Holt-Winters
method).

Cons:
• More Complex: Slightly more complex to understand and implement than MA.
• Parameter Sensitivity: The choice of smoothing constant significantly affects the results,
requiring careful tuning.
• Initial Value Sensitivity: The initial smoothed value can impact the smoothing quality,
especially in short series.

February 12, 2024 | Slide 25


Smoothing techniques remove Outlier
Savitzky–Golay (SG) filtering SG filter
The Savitzky-Golay (SG) filter is a digital filtering technique that is used to smooth
data, reduce noise, and enhance signal fidelity without greatly distorting the signal.
Unlike simple moving averages or exponential smoothing that apply uniform or
exponentially decreasing weights to observations within a window, the SG filter
applies a polynomial fit to the observations within the window.
𝑚−1
2
𝑦ො𝑗 = ෍ 𝐶𝑖 𝑦𝑗+𝑖 ,
1−𝑚
𝑖= 2

▪ 𝑛 = 𝑁𝑜. 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠


▪ 𝑚 = 𝑁𝑜. 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑐𝑜𝑛𝑠𝑖𝑑𝑒𝑟𝑒𝑑 𝑖𝑛 𝑤𝑖𝑛𝑑𝑜𝑤
▪ 𝐶 = 𝐶𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠
February 12, 2024 | Slide 26
How the SG Filter Works

Selection of Window Size and Polynomial Order:


• Window Size: The number of data points in each subset over which the polynomial is fitted. It's typically
chosen as an odd number so that the window is symmetric about the point being smoothed.
• Polynomial Order: The degree of the polynomial used to fit the data points within the window. A higher-
order polynomial can fit more complex data patterns but may also fit noise rather than the underlying signal.
Polynomial Fitting:
• For each point in the dataset that needs to be smoothed, a low-degree polynomial is fitted to the data
points within its window.
• This fitting is done using the method of least squares, which minimizes the sum of the squares of the
differences between the observed values and the values provided by the polynomial function.
Computation:
• The value of the fitted polynomial at the central point of the window is used as the new smoothed value for
that data point.
Moving the Window:
• The window is then moved one data point forward, and the process is repeated until all points (or the
desired points) in the data series have been smoothed.

February 12, 2024 | Slide 27


Smoothing techniques remove Outlier
Savitzky–Golay (SG) filtering SG filter

▪ Savitzky and Golay computed some configuration and put it into table

▪ For
▪ Window size = 5
▪ Cubic polynomial, M=3

We get following smoothed value,


1
𝑦ො𝑗 = 𝑎0 = (−3𝑦𝑗−2 + 12𝑦𝑗−1 + 17𝑦𝑗 + 12𝑦𝑗+1 − 3𝑦𝑗+2 )
35

February 12, 2024 | Slide 33


Smoothing techniques remove Outlier
SG filter
Sr.
Smoothing by a 5-point quadratic polynomial pH (𝒚) 𝑦ො 𝑦ො
No.

1 1 3.51 -- --
𝑦ො𝑗 = (−3𝑦𝑗−2 + 12𝑦𝑗−1 + 17𝑦𝑗 + 12𝑦𝑗+1 − 3𝑦𝑗+2 )
35
2 3.2 -- --
1
▪ 𝑛 = 1599 3 3.26
35
(−3 × 3.51 + 12 × 3.2 + 17 × 3.26 + 12 × 3.16 − 3 × 3.51) 3.30
1
4 3.16 𝑦ො4 = (−3𝑦2 + 12𝑦3 + 17𝑦4 + 12𝑦5 − 3𝑦6 ) 3.39
35
∴ 5 3.51 𝑦ො5 =
1
(−3𝑦3 + 12𝑦4 + 17𝑦5 + 12𝑦6 − 3𝑦7 ) 3.42
35
. . . .
⇒ 3 ≤ 𝑗 ≤ 1597
. . . .
1
1597 3.42 𝑦ො1597 = (−3𝑦1595 + 12𝑦1596 + 17𝑦1597 + 12𝑦1598 − 3𝑦1599 ) 3.49
35
1598 3.57 -- --

1599 3.39 -- --

February 12, 2024 | Slide 34


Smoothing techniques remove Outlier
SG filter

February 12, 2024 | Slide 35


Pros and Cons of SG filter

Pros:
• Preservation of Signal Features: Better at preserving the shape of the dataset's peaks
and valleys compared to MA and ES.
• Reduction of Noise: Effective at reducing noise while maintaining the signal's structural
integrity.
• Differentiation Capability: Can be used to calculate the derivative of a signal while
smoothing.
Cons:
• Complexity: More complex to understand and implement compared to MA and ES.
• Polynomial Order Selection: Requires choosing an appropriate polynomial order, which
can affect smoothing quality.
• Edge Effects: Like MA, it can suffer from edge effects, although less so than MA due to
polynomial fitting.
February 12, 2024 | Slide 36
Summary

• Moving Average is best for simple, quick smoothing with easily adjustable
smoothing levels but suffers from lag and edge artifacts.
• Exponential Smoothing offers flexibility and responsiveness, particularly suitable for
data with trends or seasonality, but requires careful tuning of parameters.
• Savitzky-Golay Filter excels in preserving signal features while smoothing, making
it ideal for analytical applications where maintaining signal integrity is crucial, but it
is more complex and sensitive to the choice of polynomial order.

February 12, 2024 | Slide 37


Data scaling

Data scaling is a preprocessing step used in data analysis, machine learning, and
statistics to standardize the range of features or data points.
The primary goal of data scaling is to normalize the scale of the data without distorting
differences in the ranges of values or losing information.
Scaling is essential when the data set includes attributes of varying scales and units,
as these differences can negatively impact the performance of many machine learning
algorithms,

February 12, 2024 | Slide 38


Data scaling

Advantages of Data Scaling Considerations


• Improved Algorithm Performance: Many • Choice of Method: The choice of scaling
algorithms perform better or converge method depends on the dataset and the
faster when the features are on a specific requirements of the application,
including the presence of outliers and the
relatively similar scale. nature of the algorithms being used.
• Consistency in Units: Scaling removes • Application Scope: Scaling should be
the units, allowing for a more direct applied consistently across training and
comparison and combination of different test datasets to avoid introducing
variables. systematic biases.
• Enhanced Gradient Descent: In • Data Distribution: Some scaling methods
algorithms that use gradient descent as assume a specific distribution (e.g.,
an optimization technique, scaling can standardization assumes a normal
distribution), which might not be
help avoid issues with the learning rate appropriate for all datasets.
and improve the convergence speed.
February 12, 2024 | Slide 39
Data scaling
Sr.
No.
Acidity pH ▪ Data Scaling is an important step to take prior to training of machine
1 7.4 3.51 learning models to ensure that variables are within same scale
2 7.8 3.2
Quick-Stats
3 7.8 3.26
Acidity pH
4 11.2 3.16

5 7.4 3.51

6 7.4 3.51

. . .

. . .

1599 6 3.39

February 12, 2024 | Slide 40 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support
Systems, Elsevier, 47(4):547-553, 2009.
Data scaling Conducted to make variable values range from 0 to 1
𝑥 − min 𝑥
Normalization 𝑥′ =
max 𝑥 − min(𝑥)
Quick-Stats
Sr. Before Normalization
Acidity pH
No.
Before Normalization After Normalization Acidity pH
1 7.4 3.51

2 7.8 3.2

3 7.8 3.26

4 11.2 3.16
After Normalization
5 7.4 3.51
Acidity pH

6 7.4 3.51

. . .

. . .

1599 6 3.39

February 12, 2024 | Slide 41

Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Pros and Cons of Min-Max Scaling

Pros:
• Fixed Range: Scales the features to a specific range (0 to 1 or any other specified
range), which can be useful when we need to maintain a positive scale or a specific
scaling range.
• Preservation of Shape: Maintains the distribution shape of the feature, which can be
beneficial for visualization.
• Simple Interpretation: Easy to understand and implement, and the scaled data retains
interpretability in terms of the original range.

Cons:
• Sensitivity to Outliers: Extreme values (outliers) can skew the scaling, compressing the
majority of the data into a small range.
• Not Centered: Does not center the data around a mean of 0; the scaled data might still
have a bias away from 0, which can affect some algorithms.

February 12, 2024 | Slide 42


Data scaling Transform the data to have a mean of zero & standard
deviation of 1. 𝑥−𝜇
Standardization 𝑧=
𝜎
Quick-Stats
Sr. Before Standardization
Acidity pH
No.
Before Standardization After Standardization Acidity pH
1 7.4 3.51

2 7.8 3.2

3 7.8 3.26

4 11.2 3.16
After Standardization
5 7.4 3.51
Acidity pH

6 7.4 3.51

. . .

. . .

1599 6 3.39

February 12, 2024 | Slide 43

Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Pros and Cons of Standardization

Pros:
• Handling Differing Scales: Very effective in cases where the data encompasses attributes with
different scales, enhancing the performance of algorithms sensitive to variance in scales.
• Less Sensitive to Outliers: Compared to Min-Max Scaling, it is less influenced by outliers since it
uses the mean and standard deviation, which, while still affected by outliers, are more robust than
the min and max values.
• Suitability for Algorithms: Particularly beneficial for algorithms that assume data is centered around
0 and with variance in the same scale

Cons:
• No Fixed Range: The transformed data does not have a specific bounded range, which might be a
requirement for certain algorithms or applications.
• Assumption of Normality: Works best if the original data has a Gaussian (normal) distribution,
although it can still be applied to data without it.

February 12, 2024 | Slide 44

You might also like