L4 Data Preprocessing
L4 Data Preprocessing
Applications
Data pre- ML model AI algorithm Optimizer DSS
Data
processing development development
Modelling &
APC
Simulation
Data preprocessing
Data visualization, Outlier detection, &
Smoothing techniques, Data scaling
Data preprocessing
Process
In our case, we have used data from oil drilling data from ONGC, where we are analysing the Data source
parameters “Hookload” and “Bit depth”:
• Online sensor
Hookload refers to the tension or weight applied to the drill string, which is the series of pipes and
other equipment that make up the drilling rig. It is measured in units of weight, such as pounds or • From ONGC oil Rig
kilonewtons. The hookload is used to control the weight on the bit, which is the cutting or drilling
tool at the bottom of the drill string. It is used to monitor the progress of the drilling operation and to
detect any potential problems.
Bit depth, on the other hand, refers to the distance from the drilling rig's surface to the point where
the bit is located. It is measured in units of length, such as feet or meters. The bit depth is used to
track the progress of the drilling operation and to determine when to add or remove drilling
equipment.
Together, hookload and bit depth provide valuable information about the drilling process, including
the weight on the bit, the progress of the drilling operation, and the presence of any potential
problems. This information is used to optimize the drilling operation and to make decisions about
when to add or remove drilling equipment. It is also used to make sure that the drilling is done safely
and efficiently, by avoiding over-stressing the drill string, and by ensuring that the drilling is done at
the desired depth.
Attribute Information:
Process
1. Date: (DD/MM/YYYY)
For comparison plots we use Air Quality Data containing 2. Time: (HH.MM.SS)
parameters like concentration of NOx, CO, NO2, Relative 3. CO(GT): True hourly averaged concentration Carbon Monoxide in mg/m3 (reference
humidity etc. This data is time series data. analyzer)
4. PT08.S1: (tin oxide) hourly averaged sensor response (nominally CO targeted)
Bar charts are used to compare the sizes of different categories 5. NMHC (GT): True hourly averaged overall Non Metanic HydroCarbons (NMHC)
of data and to show how the data is distributed across different concentration in microg/m^3 (reference analyzer)
categories. They are useful for comparing data across multiple
categories and for identifying trends or patterns in the data. 6. C6H6 (GT):True hourly averaged Benzene concentration in µ𝑔/𝑚3 (reference analyzer)
7. PT08.S2: (titania) hourly averaged sensor response (nominally NMHC targeted)
The bar chart in the following slide is created by plotting the 8. NOXGT: True hourly averaged NOx concentration in ppb (reference analyzer)
time in the x-axis and concentration of CO in y-axis. 9. PT08.S3: (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
10. NO2GT: True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
For pie chart concentration of NOX with time is represented.
11. PT08.S4: (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
12. PT08.S5: (indium oxide) : hourly averaged sensor response (nominally O3 targeted)
For stacked bar chart, concentration of CO, NMHC and O3
are plotted against time. 13. T: Temperature in °C
14. RH: Relative Humidity (%)
For stacked area chart, concentration of NO2(GT), 15. AH: Absolute Humidity
PT08.S4(NO2) and PT08.S5(O3) are plotted against time .
Data source: The dataset contains 9358 instances of hourly averaged responses from an array of 5
metal oxide chemical sensors embedded in an Air Quality Chemical Multi-sensor Device. The
February 12, 2024 | Slide 8 device was located on the field in a significantly polluted area, at road level, within an Italian city.
Data were recorded from March 2004 to February 2005 (one year)
Comparison between features
Bar chart Concentration of CO vs time
Purpose: Outlier removal focuses on identifying and eliminating data points that
significantly deviate from the rest of the data distribution. These points can be due to
noise, measurement errors, or genuine but rare events.
Process: This involves detecting outliers using various statistical methods (like Z-score,
IQR) and then deciding whether to remove or adjust these points based on criteria specific
to the dataset or research question.
Impact on Data:
Data Integrity: Removing outliers can help to prevent skewing of statistical analyses and
machine learning model performance. However, it may also lead to loss of valuable
information, especially if the outliers represent important phenomena.
Statistical Analysis: Enhances the accuracy of descriptive statistics (mean, median,
standard deviation) and model predictions by eliminating potential sources of significant
error.
• Outliers are values that seem excessively different from most of the rest of the values in
the given dataset.
• Sources of Outliers
• new inventions (true outliers),
• development of new patterns/phenomena,
• experimental errors,
• rarely occurring incidents,
• anomalies,
• incorrectly fed data due to typographical mistakes,
• failure of data recording systems / components like sensor, cable etc.
• However, all outliers are not bad; some reveal new information. Inliers are all the data
points that are part of the distribution except outliers.
. . . . . . . . . .
. . . . . . . . . .
February 12, 2024 | Slide 16 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In
Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Pros and Cons
Assumption of Normality: While Z-score normalization does not require the data to
be normally distributed, its effectiveness can vary with the distribution of the data. For
distributions far from normal, other normalization techniques might be more
appropriate.
Outlier Sensitivity: Since the mean and standard deviation are influenced by outliers,
Z-score normalization can be affected by extreme values. In such cases, exploring the
data and possibly addressing outliers before normalization may be advisable.
5 3.51
Outliers Outliers
6 3.51 Q1 Median Q3
(25th Percentile) (75th Percentile)
. .
. .
1599 3.39
February 12, 2024 | Slide 18 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In
Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Pros and cons of IQR
Purpose: Smoothing techniques are used to reduce noise and fluctuations in data to reveal underlying
trends or patterns. Unlike outlier removal, smoothing is about filtering the data rather than removing
any data points.
Process: This involves applying mathematical filters or functions to the data, such as moving
averages, exponential smoothing, or more complex algorithms like SG filter. The choice of technique
depends on the data characteristics and the analysis goals.
Impact on Data:
Data Trends: Smoothing helps in highlighting trends and patterns by averaging out short-term
fluctuations. This is particularly useful in time series analysis or signal processing.
Preservation of Data Points: All original data points are retained in the dataset, albeit modified by the
smoothing process, thus maintaining the integrity and structure of the data.
February 12, 2024 | Slide 20
Smoothing techniques remove Outlier
Moving average ▪ Let us construct 3 period moving average ⇒ 𝑤𝑖𝑛𝑑𝑜𝑤 𝑠𝑖𝑧𝑒 = 3
2 3.2 3.32
3 3.26 3.21
4 3.16 3.31
5 3.51 3.39
6 3.51 3.44
. . .
. . 3.46
Pros:
• Simplicity: Easy to understand and implement.
• Effectiveness: Efficient at reducing random noise while keeping the signal intact.
• Customizable Window: The window size can be adjusted to control the smoothing level.
Cons:
• Lag: Introduces a lag, especially noticeable with larger window sizes, which can be
problematic for real-time analysis.
• Edge Artifacts: The beginning and end of the series might have distortions since fewer
data points are used in the calculation.
• Uniform Weighting: Treats all points within the window equally, which might not be ideal
for all applications.
. . . .
. . . .
February 12, 2024 | Slide 23 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support
Systems, Elsevier, 47(4):547-553, 2009.
Smoothing techniques remove Outlier
Exponential average
• 𝛽 near to 1 ⇒ More
weightage is given to recent
values
February 12, 2024 | Slide 24 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support
Systems, Elsevier, 47(4):547-553, 2009.
Pros and cons for Exponential average
Pros:
• Responsiveness: More responsive to recent changes in the data compared to MA.
• Flexible Weighting: Allows for finer control over the smoothing process through the
smoothing constant.
• Suitable for Trends: Can adapt to data with trends and seasonality through variations
like Double Exponential Smoothing and Triple Exponential Smoothing (Holt-Winters
method).
Cons:
• More Complex: Slightly more complex to understand and implement than MA.
• Parameter Sensitivity: The choice of smoothing constant significantly affects the results,
requiring careful tuning.
• Initial Value Sensitivity: The initial smoothed value can impact the smoothing quality,
especially in short series.
▪ Savitzky and Golay computed some configuration and put it into table
▪ For
▪ Window size = 5
▪ Cubic polynomial, M=3
1 1 3.51 -- --
𝑦ො𝑗 = (−3𝑦𝑗−2 + 12𝑦𝑗−1 + 17𝑦𝑗 + 12𝑦𝑗+1 − 3𝑦𝑗+2 )
35
2 3.2 -- --
1
▪ 𝑛 = 1599 3 3.26
35
(−3 × 3.51 + 12 × 3.2 + 17 × 3.26 + 12 × 3.16 − 3 × 3.51) 3.30
1
4 3.16 𝑦ො4 = (−3𝑦2 + 12𝑦3 + 17𝑦4 + 12𝑦5 − 3𝑦6 ) 3.39
35
∴ 5 3.51 𝑦ො5 =
1
(−3𝑦3 + 12𝑦4 + 17𝑦5 + 12𝑦6 − 3𝑦7 ) 3.42
35
. . . .
⇒ 3 ≤ 𝑗 ≤ 1597
. . . .
1
1597 3.42 𝑦ො1597 = (−3𝑦1595 + 12𝑦1596 + 17𝑦1597 + 12𝑦1598 − 3𝑦1599 ) 3.49
35
1598 3.57 -- --
1599 3.39 -- --
Pros:
• Preservation of Signal Features: Better at preserving the shape of the dataset's peaks
and valleys compared to MA and ES.
• Reduction of Noise: Effective at reducing noise while maintaining the signal's structural
integrity.
• Differentiation Capability: Can be used to calculate the derivative of a signal while
smoothing.
Cons:
• Complexity: More complex to understand and implement compared to MA and ES.
• Polynomial Order Selection: Requires choosing an appropriate polynomial order, which
can affect smoothing quality.
• Edge Effects: Like MA, it can suffer from edge effects, although less so than MA due to
polynomial fitting.
February 12, 2024 | Slide 36
Summary
• Moving Average is best for simple, quick smoothing with easily adjustable
smoothing levels but suffers from lag and edge artifacts.
• Exponential Smoothing offers flexibility and responsiveness, particularly suitable for
data with trends or seasonality, but requires careful tuning of parameters.
• Savitzky-Golay Filter excels in preserving signal features while smoothing, making
it ideal for analytical applications where maintaining signal integrity is crucial, but it
is more complex and sensitive to the choice of polynomial order.
Data scaling is a preprocessing step used in data analysis, machine learning, and
statistics to standardize the range of features or data points.
The primary goal of data scaling is to normalize the scale of the data without distorting
differences in the ranges of values or losing information.
Scaling is essential when the data set includes attributes of varying scales and units,
as these differences can negatively impact the performance of many machine learning
algorithms,
5 7.4 3.51
6 7.4 3.51
. . .
. . .
1599 6 3.39
February 12, 2024 | Slide 40 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support
Systems, Elsevier, 47(4):547-553, 2009.
Data scaling Conducted to make variable values range from 0 to 1
𝑥 − min 𝑥
Normalization 𝑥′ =
max 𝑥 − min(𝑥)
Quick-Stats
Sr. Before Normalization
Acidity pH
No.
Before Normalization After Normalization Acidity pH
1 7.4 3.51
2 7.8 3.2
3 7.8 3.26
4 11.2 3.16
After Normalization
5 7.4 3.51
Acidity pH
6 7.4 3.51
. . .
. . .
1599 6 3.39
Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Pros and Cons of Min-Max Scaling
Pros:
• Fixed Range: Scales the features to a specific range (0 to 1 or any other specified
range), which can be useful when we need to maintain a positive scale or a specific
scaling range.
• Preservation of Shape: Maintains the distribution shape of the feature, which can be
beneficial for visualization.
• Simple Interpretation: Easy to understand and implement, and the scaled data retains
interpretability in terms of the original range.
Cons:
• Sensitivity to Outliers: Extreme values (outliers) can skew the scaling, compressing the
majority of the data into a small range.
• Not Centered: Does not center the data around a mean of 0; the scaled data might still
have a bias away from 0, which can affect some algorithms.
2 7.8 3.2
3 7.8 3.26
4 11.2 3.16
After Standardization
5 7.4 3.51
Acidity pH
6 7.4 3.51
. . .
. . .
1599 6 3.39
Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Pros and Cons of Standardization
Pros:
• Handling Differing Scales: Very effective in cases where the data encompasses attributes with
different scales, enhancing the performance of algorithms sensitive to variance in scales.
• Less Sensitive to Outliers: Compared to Min-Max Scaling, it is less influenced by outliers since it
uses the mean and standard deviation, which, while still affected by outliers, are more robust than
the min and max values.
• Suitability for Algorithms: Particularly beneficial for algorithms that assume data is centered around
0 and with variance in the same scale
Cons:
• No Fixed Range: The transformed data does not have a specific bounded range, which might be a
requirement for certain algorithms or applications.
• Assumption of Normality: Works best if the original data has a Gaussian (normal) distribution,
although it can still be applied to data without it.