0% found this document useful (0 votes)

23 views40 pages

L4 Data Preprocessing

Uploaded by

ankitkumar20771089

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views40 pages

L4 Data Preprocessing

Uploaded by

ankitkumar20771089

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Oil and Gas Petro chemical Pulp and Paper Water and wastewater Metal Industries

Applications
Data pre- ML model AI algorithm Optimizer DSS
Data
processing development development
Modelling &
APC
Simulation

Dr. Senthilmurugan Subbiah, Department of Chemical Engineering, IITG.

Data preprocessing
Data visualization, Outlier detection, &
Smoothing techniques, Data scaling
Data preprocessing

• Data visualization (line plot, scatter plot, histogram)

• Outlier detection (z-score, IQR)
• Smoothing techniques (moving average, exponential average, SG filter)
• Data scaling (Standardization, Normalization)

February 12, 2024 | Slide 2

Data visualization

Visualization for different

• Chart name
purpose

Relationship • scatter plot, bubble plot

Comparisons • line plot, bar chart

Compositions • Pie chart, stacked bar chart, stacked area chart

Distribution • Histogram and Box plot

February 12, 2024 | Slide 3

Drilling rig

1. Mud Tank 8. Standpipe 15. Monkey board 22. Bell nipple

2. Shale Shakers 9. Kelly hose 16. Stand of drill pipe 23. Blowout preventers
3. Suction line 10. Goose-neck 17. Pipe rack floor 24. Drill Sting
4. Mud pump 11. Travelling block 18. Swivel 25. Drill bit
5. Moto or power source 12. Drill line 19. Kelly drive 26. Casing head
6. Vibrating hose 13. Crown block 20. Rotary table 27. Flow line
7. Draw-works 14. Derrick 21. Drill floor
February 12, 2024 | Slide 4
https://dtetechnology.wordpress.com/2014/05/04/components-of-a-land-based-rotary-drilling-platform
Details data used in this plot

Process
In our case, we have used data from oil drilling data from ONGC, where we are analysing the Data source
parameters “Hookload” and “Bit depth”:
• Online sensor
Hookload refers to the tension or weight applied to the drill string, which is the series of pipes and
other equipment that make up the drilling rig. It is measured in units of weight, such as pounds or • From ONGC oil Rig
kilonewtons. The hookload is used to control the weight on the bit, which is the cutting or drilling
tool at the bottom of the drill string. It is used to monitor the progress of the drilling operation and to
detect any potential problems.

Bit depth, on the other hand, refers to the distance from the drilling rig's surface to the point where
the bit is located. It is measured in units of length, such as feet or meters. The bit depth is used to
track the progress of the drilling operation and to determine when to add or remove drilling
equipment.

Together, hookload and bit depth provide valuable information about the drilling process, including
the weight on the bit, the progress of the drilling operation, and the presence of any potential
problems. This information is used to optimize the drilling operation and to make decisions about
when to add or remove drilling equipment. It is also used to make sure that the drilling is done safely
and efficiently, by avoiding over-stressing the drill string, and by ensuring that the drilling is done at
the desired depth.

February 12, 2024 | Slide 5

Find relation between variables/features
scatter plot, bubble plot

February 12, 2024 | Slide 6

Comparison between features
Line Plot

February 12, 2024 | Slide 7

Details data used in this plot

Attribute Information:
Process
1. Date: (DD/MM/YYYY)
For comparison plots we use Air Quality Data containing 2. Time: (HH.MM.SS)
parameters like concentration of NOx, CO, NO2, Relative 3. CO(GT): True hourly averaged concentration Carbon Monoxide in mg/m3 (reference
humidity etc. This data is time series data. analyzer)
4. PT08.S1: (tin oxide) hourly averaged sensor response (nominally CO targeted)
Bar charts are used to compare the sizes of different categories 5. NMHC (GT): True hourly averaged overall Non Metanic HydroCarbons (NMHC)
of data and to show how the data is distributed across different concentration in microg/m^3 (reference analyzer)
categories. They are useful for comparing data across multiple
categories and for identifying trends or patterns in the data. 6. C6H6 (GT):True hourly averaged Benzene concentration in µ𝑔/𝑚3 (reference analyzer)
7. PT08.S2: (titania) hourly averaged sensor response (nominally NMHC targeted)
The bar chart in the following slide is created by plotting the 8. NOXGT: True hourly averaged NOx concentration in ppb (reference analyzer)
time in the x-axis and concentration of CO in y-axis. 9. PT08.S3: (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
10. NO2GT: True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
For pie chart concentration of NOX with time is represented.
11. PT08.S4: (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
12. PT08.S5: (indium oxide) : hourly averaged sensor response (nominally O3 targeted)
For stacked bar chart, concentration of CO, NMHC and O3
are plotted against time. 13. T: Temperature in Â°C
14. RH: Relative Humidity (%)
For stacked area chart, concentration of NO2(GT), 15. AH: Absolute Humidity
PT08.S4(NO2) and PT08.S5(O3) are plotted against time .
Data source: The dataset contains 9358 instances of hourly averaged responses from an array of 5
metal oxide chemical sensors embedded in an Air Quality Chemical Multi-sensor Device. The
February 12, 2024 | Slide 8 device was located on the field in a significantly polluted area, at road level, within an Italian city.
Data were recorded from March 2004 to February 2005 (one year)
Comparison between features
Bar chart Concentration of CO vs time

February 12, 2024 | Slide 9

Time (hrs)
Compositions
Pie chart and Stacked bar chart

February 12, 2024 | Slide 10

Compositions
Stacked area chart

February 12, 2024 | Slide 11

Distribution
Histogram and Box plot
Concentration of CO

February 12, 2024 | Slide 12

Outlier Removal

Purpose: Outlier removal focuses on identifying and eliminating data points that
significantly deviate from the rest of the data distribution. These points can be due to
noise, measurement errors, or genuine but rare events.
Process: This involves detecting outliers using various statistical methods (like Z-score,
IQR) and then deciding whether to remove or adjust these points based on criteria specific
to the dataset or research question.
Impact on Data:
Data Integrity: Removing outliers can help to prevent skewing of statistical analyses and
machine learning model performance. However, it may also lead to loss of valuable
information, especially if the outliers represent important phenomena.
Statistical Analysis: Enhances the accuracy of descriptive statistics (mean, median,
standard deviation) and model predictions by eliminating potential sources of significant
error.

February 12, 2024 | Slide 13

Outlier detection

• Outliers are values that seem excessively different from most of the rest of the values in
the given dataset.
• Sources of Outliers
• new inventions (true outliers),
• development of new patterns/phenomena,
• experimental errors,
• rarely occurring incidents,
• anomalies,
• incorrectly fed data due to typographical mistakes,
• failure of data recording systems / components like sensor, cable etc.
• However, all outliers are not bad; some reveal new information. Inliers are all the data
points that are part of the distribution except outliers.

February 12, 2024 | Slide 14

Method to Outlier detection
z-score
▪ 𝑥 = 𝑃𝑎𝑟𝑡𝑖𝑐𝑢𝑙𝑎𝑟 𝑉𝑎𝑙𝑢𝑒
▪ 𝜇 = 𝑀𝑒𝑎𝑛
▪ 𝜎 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

Z-Scores help identify outliers if a particular data point has a

Z-score value either less than -3 or greater than +3.

February 12, 2024 | Slide 15

Method to Outlier detection
z-score ▪ Red Wine Production Data For pH, ▪ 𝜇 = 𝑀𝑒𝑎𝑛 = 3.3111
▪ 𝜎 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 0.1544
Sr. Citric Sulfur
Acidity Density pH Sulphates Alcohol Quality Z-Score
No. acid dioxide

1 7.4 0 34 0.9978 3.51 0.56 9.4 5 1.28824

2 7.8 0 67 0.9968 3.2 0.68 9.8 5 -0.71971

3 7.8 0.04 54 0.997 3.26 0.65 9.8 5 -0.33107

4 11.2 0.56 60 0.998 3.16 0.58 9.8 6 -0.9788

5 7.4 0 34 0.9978 3.51 0.56 9.4 5 1.28824

6 7.4 0 40 0.9978 3.51 0.56 9.4 5 1.28824

. . . . . . . . . .

1599 6 0.47 42 0.99549 3.39 0.66 11 6 0.51096

February 12, 2024 | Slide 16 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In
Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Pros and Cons

Assumption of Normality: While Z-score normalization does not require the data to
be normally distributed, its effectiveness can vary with the distribution of the data. For
distributions far from normal, other normalization techniques might be more
appropriate.
Outlier Sensitivity: Since the mean and standard deviation are influenced by outliers,
Z-score normalization can be affected by extreme values. In such cases, exploring the
data and possibly addressing outliers before normalization may be advisable.

February 12, 2024 | Slide 17

Method to Outlier detection
IQR
Sr.
pH
No. IQR=Q3-Q1
1 3.51

2 3.2 Interquartile Range

(IQR)
“Minimum” “Maximum”
3 3.26
(Q1-1.5*IQR) (Q3+1.5*IQR)
4 3.16

5 3.51
Outliers Outliers
6 3.51 Q1 Median Q3
(25th Percentile) (75th Percentile)

. .

1599 3.39

February 12, 2024 | Slide 18 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In
Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Pros and cons of IQR

Advantages of Using IQR for Outlier Detection

• Robustness: The IQR is less affected by outliers or non-normal distribution of data
compared to methods relying on mean and standard deviation, making it more reliable
for outlier detection in real-world, skewed datasets.
• Simplicity and Versatility: It's straightforward to calculate and can be applied to any
dataset, regardless of its distribution.
• Useful for Skewed Data: Particularly beneficial for datasets that are not symmetric or
have skewed distributions.
Limitations
• Potential for Over/Under Detection: The choice of 1.5 times the IQR as a cutoff for
outlier detection is somewhat arbitrary and may not be suitable for all datasets.
Adjusting this multiplier can change the sensitivity of outlier detection.
• Does Not Account for Distribution Shape: While robust to outliers, this method does not
take into account the overall shape of the data distribution, potentially leading to the
misclassification of outliers in distributions that are inherently long-tailed.
February 12, 2024 | Slide 19
Smoothing Techniques

Purpose: Smoothing techniques are used to reduce noise and fluctuations in data to reveal underlying
trends or patterns. Unlike outlier removal, smoothing is about filtering the data rather than removing
any data points.

Process: This involves applying mathematical filters or functions to the data, such as moving
averages, exponential smoothing, or more complex algorithms like SG filter. The choice of technique
depends on the data characteristics and the analysis goals.

Impact on Data:

Data Trends: Smoothing helps in highlighting trends and patterns by averaging out short-term
fluctuations. This is particularly useful in time series analysis or signal processing.

Preservation of Data Points: All original data points are retained in the dataset, albeit modified by the
smoothing process, thus maintaining the integrity and structure of the data.
February 12, 2024 | Slide 20
Smoothing techniques remove Outlier
Moving average ▪ Let us construct 3 period moving average ⇒ 𝑤𝑖𝑛𝑑𝑜𝑤 𝑠𝑖𝑧𝑒 = 3

Sr. 𝑦1 + 𝑦2 + 𝑦3 3.51 + 3.2 + 3.26

No.
pH (𝒚) 𝑦 𝑦ത2 = = = 3.32
3 3
1 3.51 -

2 3.2 3.32

3 3.26 3.21

4 3.16 3.31

5 3.51 3.39

6 3.51 3.44

. . .

. . 3.46

1599 3.39 - ▪ Choice of window size is arbitrary

February 12, 2024 | Slide 21 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support
Systems, Elsevier, 47(4):547-553, 2009.
Pros and cons of Moving average technique

Pros:
• Simplicity: Easy to understand and implement.
• Effectiveness: Efficient at reducing random noise while keeping the signal intact.
• Customizable Window: The window size can be adjusted to control the smoothing level.

Cons:
• Lag: Introduces a lag, especially noticeable with larger window sizes, which can be
problematic for real-time analysis.
• Edge Artifacts: The beginning and end of the series might have distortions since fewer
data points are used in the calculation.
• Uniform Weighting: Treats all points within the window equally, which might not be ideal
for all applications.

February 12, 2024 | Slide 22

Smoothing techniques remove Outlier
Exponential average
Sr.
No.
pH (𝒚) 𝑦ො 𝑦ො 𝑦ො𝑡+1 = 𝛽𝑦𝑡 + (1 − 𝛽)𝑦ො𝑡 ⇒ 𝑦ො𝑡+1 = 𝑦ො𝑡 + 𝛽(𝑦𝑡 − 𝑦ො𝑡 )
1 3.51 Assume 𝑦ො1 = 𝑦1 3.51 ▪ Assume 𝛽 = 0.3 (smoothing factor) [0 < 𝛽 < 1]

2 3.2 𝑦ො1 3.51

3 3.26 𝑦ො2 + 0.3 × 𝑦ො2 3.23

4 3.16 𝑦ො3 + 0.3 × 𝑦ො3 3.26

5 3.51 𝑦ො4 + 0.3 × 𝑦ො4 3.17

6 3.51 𝑦ො5 + 0.3 × 𝑦ො5 3.48

. . . .

1599 3.39 𝑦ො1598 + 0.3 × 𝑦ො1598 3.56

February 12, 2024 | Slide 23 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support
Systems, Elsevier, 47(4):547-553, 2009.
Smoothing techniques remove Outlier
Exponential average

From equation we can say that,

• 𝛽 near to 0 ⇒ More
weightage is given to
historic values

• 𝛽 near to 1 ⇒ More
weightage is given to recent
values

February 12, 2024 | Slide 24 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support
Systems, Elsevier, 47(4):547-553, 2009.
Pros and cons for Exponential average

Pros:
• Responsiveness: More responsive to recent changes in the data compared to MA.
• Flexible Weighting: Allows for finer control over the smoothing process through the
smoothing constant.
• Suitable for Trends: Can adapt to data with trends and seasonality through variations
like Double Exponential Smoothing and Triple Exponential Smoothing (Holt-Winters
method).

Cons:
• More Complex: Slightly more complex to understand and implement than MA.
• Parameter Sensitivity: The choice of smoothing constant significantly affects the results,
requiring careful tuning.
• Initial Value Sensitivity: The initial smoothed value can impact the smoothing quality,
especially in short series.

February 12, 2024 | Slide 25

Smoothing techniques remove Outlier
Savitzky–Golay (SG) filtering SG filter
The Savitzky-Golay (SG) filter is a digital filtering technique that is used to smooth
data, reduce noise, and enhance signal fidelity without greatly distorting the signal.
Unlike simple moving averages or exponential smoothing that apply uniform or
exponentially decreasing weights to observations within a window, the SG filter
applies a polynomial fit to the observations within the window.
𝑚−1
2
𝑦ො𝑗 = ෍ 𝐶𝑖 𝑦𝑗+𝑖 ,
1−𝑚
𝑖= 2

▪ 𝑛 = 𝑁𝑜. 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠

▪ 𝑚 = 𝑁𝑜. 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑐𝑜𝑛𝑠𝑖𝑑𝑒𝑟𝑒𝑑 𝑖𝑛 𝑤𝑖𝑛𝑑𝑜𝑤
▪ 𝐶 = 𝐶𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡𝑠
February 12, 2024 | Slide 26
How the SG Filter Works

Selection of Window Size and Polynomial Order:

• Window Size: The number of data points in each subset over which the polynomial is fitted. It's typically
chosen as an odd number so that the window is symmetric about the point being smoothed.
• Polynomial Order: The degree of the polynomial used to fit the data points within the window. A higher-
order polynomial can fit more complex data patterns but may also fit noise rather than the underlying signal.
Polynomial Fitting:
• For each point in the dataset that needs to be smoothed, a low-degree polynomial is fitted to the data
points within its window.
• This fitting is done using the method of least squares, which minimizes the sum of the squares of the
differences between the observed values and the values provided by the polynomial function.
Computation:
• The value of the fitted polynomial at the central point of the window is used as the new smoothed value for
that data point.
Moving the Window:
• The window is then moved one data point forward, and the process is repeated until all points (or the
desired points) in the data series have been smoothed.

February 12, 2024 | Slide 27

Smoothing techniques remove Outlier
Savitzky–Golay (SG) filtering SG filter

▪ Savitzky and Golay computed some configuration and put it into table

▪ For
▪ Window size = 5
▪ Cubic polynomial, M=3

We get following smoothed value,

1
𝑦ො𝑗 = 𝑎0 = (−3𝑦𝑗−2 + 12𝑦𝑗−1 + 17𝑦𝑗 + 12𝑦𝑗+1 − 3𝑦𝑗+2 )
35

February 12, 2024 | Slide 33

Smoothing techniques remove Outlier
SG filter
Sr.
Smoothing by a 5-point quadratic polynomial pH (𝒚) 𝑦ො 𝑦ො
No.

1 1 3.51 -- --
𝑦ො𝑗 = (−3𝑦𝑗−2 + 12𝑦𝑗−1 + 17𝑦𝑗 + 12𝑦𝑗+1 − 3𝑦𝑗+2 )
35
2 3.2 -- --
1
▪ 𝑛 = 1599 3 3.26
35
(−3 × 3.51 + 12 × 3.2 + 17 × 3.26 + 12 × 3.16 − 3 × 3.51) 3.30
1
4 3.16 𝑦ො4 = (−3𝑦2 + 12𝑦3 + 17𝑦4 + 12𝑦5 − 3𝑦6 ) 3.39
35
∴ 5 3.51 𝑦ො5 =
1
(−3𝑦3 + 12𝑦4 + 17𝑦5 + 12𝑦6 − 3𝑦7 ) 3.42
35
. . . .
⇒ 3 ≤ 𝑗 ≤ 1597
. . . .
1
1597 3.42 𝑦ො1597 = (−3𝑦1595 + 12𝑦1596 + 17𝑦1597 + 12𝑦1598 − 3𝑦1599 ) 3.49
35
1598 3.57 -- --

1599 3.39 -- --

February 12, 2024 | Slide 34

Smoothing techniques remove Outlier
SG filter

February 12, 2024 | Slide 35

Pros and Cons of SG filter

Pros:
• Preservation of Signal Features: Better at preserving the shape of the dataset's peaks
and valleys compared to MA and ES.
• Reduction of Noise: Effective at reducing noise while maintaining the signal's structural
integrity.
• Differentiation Capability: Can be used to calculate the derivative of a signal while
smoothing.
Cons:
• Complexity: More complex to understand and implement compared to MA and ES.
• Polynomial Order Selection: Requires choosing an appropriate polynomial order, which
can affect smoothing quality.
• Edge Effects: Like MA, it can suffer from edge effects, although less so than MA due to
polynomial fitting.
February 12, 2024 | Slide 36
Summary

• Moving Average is best for simple, quick smoothing with easily adjustable
smoothing levels but suffers from lag and edge artifacts.
• Exponential Smoothing offers flexibility and responsiveness, particularly suitable for
data with trends or seasonality, but requires careful tuning of parameters.
• Savitzky-Golay Filter excels in preserving signal features while smoothing, making
it ideal for analytical applications where maintaining signal integrity is crucial, but it
is more complex and sensitive to the choice of polynomial order.

February 12, 2024 | Slide 37

Data scaling

Data scaling is a preprocessing step used in data analysis, machine learning, and
statistics to standardize the range of features or data points.
The primary goal of data scaling is to normalize the scale of the data without distorting
differences in the ranges of values or losing information.
Scaling is essential when the data set includes attributes of varying scales and units,
as these differences can negatively impact the performance of many machine learning
algorithms,

February 12, 2024 | Slide 38

Data scaling

Advantages of Data Scaling Considerations

• Improved Algorithm Performance: Many • Choice of Method: The choice of scaling
algorithms perform better or converge method depends on the dataset and the
faster when the features are on a specific requirements of the application,
including the presence of outliers and the
relatively similar scale. nature of the algorithms being used.
• Consistency in Units: Scaling removes • Application Scope: Scaling should be
the units, allowing for a more direct applied consistently across training and
comparison and combination of different test datasets to avoid introducing
variables. systematic biases.
• Enhanced Gradient Descent: In • Data Distribution: Some scaling methods
algorithms that use gradient descent as assume a specific distribution (e.g.,
an optimization technique, scaling can standardization assumes a normal
distribution), which might not be
help avoid issues with the learning rate appropriate for all datasets.
and improve the convergence speed.
February 12, 2024 | Slide 39
Data scaling
Sr.
No.
Acidity pH ▪ Data Scaling is an important step to take prior to training of machine
1 7.4 3.51 learning models to ensure that variables are within same scale
2 7.8 3.2
Quick-Stats
3 7.8 3.26
Acidity pH
4 11.2 3.16

5 7.4 3.51

6 7.4 3.51

. . .

1599 6 3.39

February 12, 2024 | Slide 40 Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support
Systems, Elsevier, 47(4):547-553, 2009.
Data scaling Conducted to make variable values range from 0 to 1
𝑥 − min 𝑥
Normalization 𝑥′ =
max 𝑥 − min(𝑥)
Quick-Stats
Sr. Before Normalization
Acidity pH
No.
Before Normalization After Normalization Acidity pH
1 7.4 3.51

2 7.8 3.2

3 7.8 3.26

4 11.2 3.16
After Normalization
5 7.4 3.51
Acidity pH

6 7.4 3.51

. . .

1599 6 3.39

February 12, 2024 | Slide 41

Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Pros and Cons of Min-Max Scaling

Pros:
• Fixed Range: Scales the features to a specific range (0 to 1 or any other specified
range), which can be useful when we need to maintain a positive scale or a specific
scaling range.
• Preservation of Shape: Maintains the distribution shape of the feature, which can be
beneficial for visualization.
• Simple Interpretation: Easy to understand and implement, and the scaled data retains
interpretability in terms of the original range.

Cons:
• Sensitivity to Outliers: Extreme values (outliers) can skew the scaling, compressing the
majority of the data into a small range.
• Not Centered: Does not center the data around a mean of 0; the scaled data might still
have a bias away from 0, which can affect some algorithms.

February 12, 2024 | Slide 42

Data scaling Transform the data to have a mean of zero & standard
deviation of 1. 𝑥−𝜇
Standardization 𝑧=
𝜎
Quick-Stats
Sr. Before Standardization
Acidity pH
No.
Before Standardization After Standardization Acidity pH
1 7.4 3.51

2 7.8 3.2

3 7.8 3.26

4 11.2 3.16
After Standardization
5 7.4 3.51
Acidity pH

6 7.4 3.51

. . .

1599 6 3.39

February 12, 2024 | Slide 43

Data Source: P. Cortez, et. al. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Pros and Cons of Standardization

Pros:
• Handling Differing Scales: Very effective in cases where the data encompasses attributes with
different scales, enhancing the performance of algorithms sensitive to variance in scales.
• Less Sensitive to Outliers: Compared to Min-Max Scaling, it is less influenced by outliers since it
uses the mean and standard deviation, which, while still affected by outliers, are more robust than
the min and max values.
• Suitability for Algorithms: Particularly beneficial for algorithms that assume data is centered around
0 and with variance in the same scale

Cons:
• No Fixed Range: The transformed data does not have a specific bounded range, which might be a
requirement for certain algorithms or applications.
• Assumption of Normality: Works best if the original data has a Gaussian (normal) distribution,
although it can still be applied to data without it.

February 12, 2024 | Slide 44

Process Quality Control - Troubleshooting and Interpretation of Data 4th Ed - Ellis R. Ott Et Al. (ASQ, 2005) PDF
50% (2)
Process Quality Control - Troubleshooting and Interpretation of Data 4th Ed - Ellis R. Ott Et Al. (ASQ, 2005) PDF
669 pages
Six Sigma Green Belt Exam Study Notes PDF
No ratings yet
Six Sigma Green Belt Exam Study Notes PDF
12 pages
The Book of Alternative Data: A Guide for Investors, Traders and Risk Managers
From Everand
The Book of Alternative Data: A Guide for Investors, Traders and Risk Managers
Alexander Denev
No ratings yet
Lecture 2
No ratings yet
Lecture 2
18 pages
Control Chart
No ratings yet
Control Chart
9 pages
Business Analytics Midterm Exam
No ratings yet
Business Analytics Midterm Exam
10 pages
Data (1) (1)
No ratings yet
Data (1) (1)
81 pages
Bottling SPC
No ratings yet
Bottling SPC
6 pages
14.12 Kulcsar U PDF
No ratings yet
14.12 Kulcsar U PDF
6 pages
Exploratory Data Analysis, Inference, Interpretation
No ratings yet
Exploratory Data Analysis, Inference, Interpretation
45 pages
Research File 3
No ratings yet
Research File 3
10 pages
DWM Exp6 C49
No ratings yet
DWM Exp6 C49
15 pages
253777
No ratings yet
253777
66 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Chapter 2_ Data Exploration, Preprocessing and Visualization
No ratings yet
Chapter 2_ Data Exploration, Preprocessing and Visualization
92 pages
TMP 2766 Bas Darwin A Bsce 5 2. TQM Assignment 1.99609192
No ratings yet
TMP 2766 Bas Darwin A Bsce 5 2. TQM Assignment 1.99609192
10 pages
Unit 3 TQM - SPC
No ratings yet
Unit 3 TQM - SPC
35 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
Control Column PDF
No ratings yet
Control Column PDF
6 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
7QC Tools
No ratings yet
7QC Tools
9 pages
Dictionary ZQS
No ratings yet
Dictionary ZQS
16 pages
Chapter 2
No ratings yet
Chapter 2
24 pages
Lab - Manual
No ratings yet
Lab - Manual
8 pages
L3 Overview of ML Model Development Lifecycle-1
No ratings yet
L3 Overview of ML Model Development Lifecycle-1
30 pages
4 EDA - Stats & Graphical Techniques
No ratings yet
4 EDA - Stats & Graphical Techniques
44 pages
Lecture 2 Exploration
No ratings yet
Lecture 2 Exploration
54 pages
L5 Dimensionality Reduction
No ratings yet
L5 Dimensionality Reduction
47 pages
09 Exploration
No ratings yet
09 Exploration
14 pages
R programming and ipr
No ratings yet
R programming and ipr
8 pages
Statistical Data Analysis Explained
93% (27)
Statistical Data Analysis Explained
359 pages
Numerical Measures of Relative Standing: Fall 2016-2017 MGT 205 1
No ratings yet
Numerical Measures of Relative Standing: Fall 2016-2017 MGT 205 1
44 pages
Green Belt
No ratings yet
Green Belt
12 pages
Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data-Driven Models
From Everand
Harness Oil and Gas Big Data with Analytics: Optimize Exploration and Production with Data-Driven Models
Keith R. Holdaway
No ratings yet
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Advanced Analyses Glossary
No ratings yet
Advanced Analyses Glossary
36 pages
Anomaly Detection
No ratings yet
Anomaly Detection
10 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
mukundam2013
No ratings yet
mukundam2013
8 pages
Process Monitoring And Improvement Handbook Manuel E Penarodriguez instant download
No ratings yet
Process Monitoring And Improvement Handbook Manuel E Penarodriguez instant download
50 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Control Charts
No ratings yet
Control Charts
28 pages
VERY GOOD CHEAT SHEET FOR LARGE DATA SET AS MATHS (1)
No ratings yet
VERY GOOD CHEAT SHEET FOR LARGE DATA SET AS MATHS (1)
6 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
14 pages
TQM - Session 9-10
No ratings yet
TQM - Session 9-10
57 pages
ML_Lab_Manual (1)
No ratings yet
ML_Lab_Manual (1)
110 pages
1.Program
No ratings yet
1.Program
20 pages
SPC Basic
No ratings yet
SPC Basic
24 pages
Data Analysis
No ratings yet
Data Analysis
24 pages
Lec448B 20160406
No ratings yet
Lec448B 20160406
30 pages
Quality Control System
No ratings yet
Quality Control System
5 pages
Six Sigma Green Belt Exam Study Notes
90% (48)
Six Sigma Green Belt Exam Study Notes
12 pages
Managing Subsurface Data in the Oil and Gas Sector Seismic: Seismic
From Everand
Managing Subsurface Data in the Oil and Gas Sector Seismic: Seismic
Ahmad Bin Maidinsar
No ratings yet
Finite Element Method
From Everand
Finite Element Method
Gouri Dhatt
1/5 (1)
Engineering Document Control, Correspondence and Information Management (Includes Software Selection Guide) for All
From Everand
Engineering Document Control, Correspondence and Information Management (Includes Software Selection Guide) for All
Huw R Grossmith
No ratings yet
Applied Metrology for Manufacturing Engineering
From Everand
Applied Metrology for Manufacturing Engineering
Ammar Grous
5/5 (1)
Digital Spectral Analysis MATLAB® Software User Guide
From Everand
Digital Spectral Analysis MATLAB® Software User Guide
S. Lawrence Marple, Jr.
No ratings yet
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
From Everand
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
Wouter Verbeke
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
1 s2.0 S2352484722014032 Main
No ratings yet
1 s2.0 S2352484722014032 Main
13 pages
We Are Accepting
No ratings yet
We Are Accepting
2 pages
Table of Paint Conversion
No ratings yet
Table of Paint Conversion
4 pages
M.A.-M.Sc. Yoga
No ratings yet
M.A.-M.Sc. Yoga
27 pages
How Trustworthy Are You?
No ratings yet
How Trustworthy Are You?
1 page
Din en Iso 898-1 DIN EN 20 898-2: Strength Values
100% (1)
Din en Iso 898-1 DIN EN 20 898-2: Strength Values
1 page
BSC Se Group02-1
No ratings yet
BSC Se Group02-1
9 pages
Lab Report Enzyme Kinetics
No ratings yet
Lab Report Enzyme Kinetics
5 pages
Production and Materials Engineering Fact Sheet
No ratings yet
Production and Materials Engineering Fact Sheet
2 pages
Fdocuments - in - Delhi List of Architects
No ratings yet
Fdocuments - in - Delhi List of Architects
74 pages
ICAS 2016 SC Year8
No ratings yet
ICAS 2016 SC Year8
25 pages
English 10
No ratings yet
English 10
176 pages
Socratic Seminar 2
No ratings yet
Socratic Seminar 2
4 pages
U.S. Army Tank Automotive Research, Development and Engineering Center
No ratings yet
U.S. Army Tank Automotive Research, Development and Engineering Center
33 pages
Year 9 - LO 2 - Carbon Dioxide and Its Properties
No ratings yet
Year 9 - LO 2 - Carbon Dioxide and Its Properties
7 pages
OptiStruct 2019 VerificationProblems
No ratings yet
OptiStruct 2019 VerificationProblems
155 pages
Operation Listo PDF
No ratings yet
Operation Listo PDF
18 pages
Bachelor of Technology IN Mechanical Engineering: Fabrication of Portable Pneumatic Vice
No ratings yet
Bachelor of Technology IN Mechanical Engineering: Fabrication of Portable Pneumatic Vice
44 pages
Analisis Sebaran Pencemaran Udara Menggunakan Model Dispersi Gauss Dan Pemetaan Arcgis 10
No ratings yet
Analisis Sebaran Pencemaran Udara Menggunakan Model Dispersi Gauss Dan Pemetaan Arcgis 10
6 pages
CE 222L Section 01 Dr. Xinkai Wu Civil Engineering Dept Mario Garcia-Gillespie Chengkheang Mak
No ratings yet
CE 222L Section 01 Dr. Xinkai Wu Civil Engineering Dept Mario Garcia-Gillespie Chengkheang Mak
11 pages
DNA PRACTICAL 2025 GRADE 12 - ANTON LEMBEDE (1)[1]
No ratings yet
DNA PRACTICAL 2025 GRADE 12 - ANTON LEMBEDE (1)[1]
8 pages
Participatory Planning and Implementation: NSTP 2 - National Service Training Program 2 Dr. Herminigildo S. Villasoto
No ratings yet
Participatory Planning and Implementation: NSTP 2 - National Service Training Program 2 Dr. Herminigildo S. Villasoto
8 pages
SUS 6170 Module Four
No ratings yet
SUS 6170 Module Four
2 pages
Staff Development Report
No ratings yet
Staff Development Report
18 pages
Active Control of Composite Panel Flutter Using Piezoelectric Materials
No ratings yet
Active Control of Composite Panel Flutter Using Piezoelectric Materials
14 pages
Immediate download Psychology and the Challenges of Life Binder Ready Version Adjustment and Growth 13th Edition Jeffrey S. Nevid ebooks 2024
100% (1)
Immediate download Psychology and the Challenges of Life Binder Ready Version Adjustment and Growth 13th Edition Jeffrey S. Nevid ebooks 2024
55 pages
Janjua Et Al 2021 A Systematic Literature Review of Rural Homestays and Sustainability in Tourism
No ratings yet
Janjua Et Al 2021 A Systematic Literature Review of Rural Homestays and Sustainability in Tourism
17 pages
538463731 Focus 4 2E Workbook Answers - 1 Vocabulary Exercise 1 1 Go 2 Person 3 Further 4 Prove 5 - Studocu
No ratings yet
538463731 Focus 4 2E Workbook Answers - 1 Vocabulary Exercise 1 1 Go 2 Person 3 Further 4 Prove 5 - Studocu
1 page
Thermocouples
No ratings yet
Thermocouples
21 pages
Section A (10 Marks) (Time Suggested: 15 Minutes) Answer All Questions in This Paper
No ratings yet
Section A (10 Marks) (Time Suggested: 15 Minutes) Answer All Questions in This Paper
15 pages

L4 Data Preprocessing

Uploaded by

L4 Data Preprocessing

Uploaded by

Oil and Gas Petro chemical Pulp and Paper Water and wastewater Metal Industries

Dr. Senthilmurugan Subbiah, Department of Chemical Engineering, IITG.

• Data visualization (line plot, scatter plot, histogram)

February 12, 2024 | Slide 2

Visualization for different

Relationship • scatter plot, bubble plot

Comparisons • line plot, bar chart

Compositions • Pie chart, stacked bar chart, stacked area chart

Distribution • Histogram and Box plot

February 12, 2024 | Slide 3

1. Mud Tank 8. Standpipe 15. Monkey board 22. Bell nipple

February 12, 2024 | Slide 5

February 12, 2024 | Slide 6

February 12, 2024 | Slide 7

February 12, 2024 | Slide 9

February 12, 2024 | Slide 10

February 12, 2024 | Slide 11

February 12, 2024 | Slide 12

February 12, 2024 | Slide 13

February 12, 2024 | Slide 14

Z-Scores help identify outliers if a particular data point has a

February 12, 2024 | Slide 15

1 7.4 0 34 0.9978 3.51 0.56 9.4 5 1.28824

2 7.8 0 67 0.9968 3.2 0.68 9.8 5 -0.71971

3 7.8 0.04 54 0.997 3.26 0.65 9.8 5 -0.33107

4 11.2 0.56 60 0.998 3.16 0.58 9.8 6 -0.9788

5 7.4 0 34 0.9978 3.51 0.56 9.4 5 1.28824

6 7.4 0 40 0.9978 3.51 0.56 9.4 5 1.28824

1599 6 0.47 42 0.99549 3.39 0.66 11 6 0.51096

February 12, 2024 | Slide 17

2 3.2 Interquartile Range

Advantages of Using IQR for Outlier Detection

Sr. 𝑦1 + 𝑦2 + 𝑦3 3.51 + 3.2 + 3.26

1599 3.39 - ▪ Choice of window size is arbitrary

February 12, 2024 | Slide 22

2 3.2 𝑦ො1 3.51

3 3.26 𝑦ො2 + 0.3 × 𝑦ො2 3.23

4 3.16 𝑦ො3 + 0.3 × 𝑦ො3 3.26

5 3.51 𝑦ො4 + 0.3 × 𝑦ො4 3.17

6 3.51 𝑦ො5 + 0.3 × 𝑦ො5 3.48

1599 3.39 𝑦ො1598 + 0.3 × 𝑦ො1598 3.56

From equation we can say that,

February 12, 2024 | Slide 25

▪ 𝑛 = 𝑁𝑜. 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑝𝑜𝑖𝑛𝑡𝑠

Selection of Window Size and Polynomial Order:

February 12, 2024 | Slide 27

We get following smoothed value,

February 12, 2024 | Slide 33

February 12, 2024 | Slide 34

February 12, 2024 | Slide 35

February 12, 2024 | Slide 37

February 12, 2024 | Slide 38

Advantages of Data Scaling Considerations

February 12, 2024 | Slide 41

February 12, 2024 | Slide 42

February 12, 2024 | Slide 43

February 12, 2024 | Slide 44

You might also like