Lecture 2
Lecture 2
CHE F315
Outline
16 January 2024 4
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers
16 January 2024 5
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers
16 January 2024 6
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers
16 January 2024 7
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers
Thebelt, A., Wiebe, J., Kronqvist, J., Tsay, C., & Misener, R. (2022). Maximizing information from chemical
engineering data sets: Applications to machine learning. Chemical Engineering Science, 252, 117469.
16 January 2024 8
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers
16 January 2024 9
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers
Data preprocessing
16 January 2024 10
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers
Data preprocessing
16 January 2024 11
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers
Data preprocessing
Missing data imputation
Missing values in process industries refer to entries in the
data set that have no connection with the real state of
the process and take values such as ±∞, 0, nan (not a
number)
There are generally three missing patterns:
Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)
16 January 2024 12
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers
Data preprocessing
Missing data imputation
A and C – missing values for
single/multiple variables
due to sensor failure
B – values of some variables
missing at same time
instances fault
D – single variable showing
regular missing values
multirate sampling
Xu, S., Lu, B., Baldea, M., Edgar, T. F., Wojsznis, W., Blevins, T., & Nixon, M. (2015). Data cleaning in the process
industries. Reviews in Chemical Engineering, 31(5), 453-490.
16 January 2024 13
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers
Data preprocessing
16 January 2024 14
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers
Data preprocessing
Outlier detection and removal
• Observations or subsets of
observations that do not show a
consistent behavior with the rest
of the data set from a statistical
perspective
• Causes: malfunction of sensors
Pani, A. K., & Mohanta, H. K. (2016). Online monitoring of cement
and inappropriate treatment of clinker quality using multivariate statistics and Takagi-Sugeno fuzzy-
missing data inference technique. Control Engineering Practice, 57, 1-17.
16 January 2024 15
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers
Data preprocessing
16 January 2024 16
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers
Data preprocessing
Quartile-based identifier and boxplots:
Uses the interquartile distance Q as the scale parameter
Q = Q3 – Q1
where Q1 is the lower quartile, x0.25 and Q3 is the upper quartile,
x0.75
13
med = (Q1+ Q3)/2
For a symmetric data distribution, the following condition to detect
outliers:
|xk -med| >2Q
A boxplot is used as a graphical demonstration
of the quartile-based detector
In the plot, any point that lies outside the
upper or lower fences, is considered as an
outlier.
16 January 2024 17
BITS Pilani, Pilani Campus
CHE F315 Machine Learning for Chemical Engineers
16 January 2024
18 BITS Pilani, Pilani Campus