Chapter 1 Data Preparation VF
Chapter 1 Data Preparation VF
• Handle the various Python libraries intended for data science, namely Numpy, scipy, pandas, sklearn.
• Data visualization
• Data cleaning
• Data mining
• Data transformation
• Data storage…
• Machines have the ability to learn without being explicitly programmed but we need to provide a
mathematical model, a database and a set of features.
• Computer program learns from examples with respect to a given mathematical model.
Training Data
Model
Training Labels
• Dimension Reduction
• Association rules
Feature Principal
Data Cleaning Data Transforms Feature Selection component
Engineering
Analysis
Identifying and Changing the scale Deriving new Identifying those Creating
correcting mistakes or distribution of variables from input variables that compact
or errors in the variables. available data. are most relevant projections of
data. to the task. the data.
Feature Principal
Data Cleaning Data Transforms Feature Selection component
Engineering
Analysis PCA
Identifying and Changing the scale Deriving new Identifying those Creating
correcting mistakes or distribution of variables from input variables that compact
or errors in the variables. available data. are most relevant projections of
data. to the task. the data.
“Data cleaning is used to refer to all kinds of tasks and activities to detect and repair errors in the data.”
“Cleaning up your data is not the most glamourous of tasks, but it’s an essential part of data wrangling.
[…] Knowing how to properly clean and assemble your data will set you miles apart from others in your
field.”
# Quick examples:
“We will generally define outliers as samples that are exceptionally far from the mainstream of the
data.”
Outliers causes:
• Measurement or input error.
• Data corruption.
• True outlier observation…
Boxplot Histogram
Scatter plot
• Missing data are defined as not available values, and that would be
meaningful if observed.
• Missing data can be anything from missing sequence, incomplete
feature, files missing, information incomplete, data entry error etc.
Méthode 1:
« mean » sklearn.impute.SimpleImputer(*, missing_values=nan, strategy=‘mean', fill_value=None, verbose='depre
strategy cated', copy=True, add_indicator=False)
Méthode 2:
df.fillna(df.mean())
«median» Méthode 1:
strategy sklearn.impute.SimpleImputer(*, missing_values=nan, strategy=‘median', fill_value=None, verbose='dep
recated', copy=True, add_indicator=False)
Méthode 2:
df.fillna(df.median())
« most_fre Méthode 1:
quent » sklearn.impute.SimpleImputer(*, missing_values=nan, strategy=‘most_frequent', fill_value=None, verbo
strategy se='deprecated', copy=True, add_indicator=False)
Méthode 2:
df['columnName'] = df['columnName'].fillna(df[‘columnName'].mode()[0])
Feature Principal
Data Cleaning Data Transforms Feature Selection component
Engineering
Analysis
Identifying and Changing the scale Deriving new Identifying those Creating
correcting mistakes or distribution of variables from input variables that compact
or errors in the variables. available data. are most relevant projections of
data. to the task. the data.
• Transforming categorical data is an essential step during data preprocessing. sklearn’s machine
learning library require the input dataset to always have numeric values it does not support
categorical data.
• Before you start transforming your data, it is important to figure out if the feature you’re working
on is ordinal (as opposed to nominal). An ordinal feature is best described as a feature with
ordered categories.
o Min-max normalization
o Decimal scaling
o Z-score normalization…
Feature Principal
Data Cleaning Data Transforms Feature Selection component
Engineering
Analysis
Identifying and Changing the scale Deriving new Identifying those Creating
correcting mistakes or distribution of variables from input variables that compact
or errors in the variables. available data. are most relevant projections of
data. to the task. the data.
Feature Principal
Data Cleaning Data Transforms Feature Selection component
Engineering
Analysis
Identifying and Changing the scale Deriving new Identifying those Creating
correcting mistakes or distribution of variables from input variables that compact
or errors in the variables. available data. are most relevant projections of
data. to the task. the data.
Feature Principal
Data Cleaning Data Transforms Feature Selection component
Engineering
Analysis
Identifying and Changing the scale Deriving new Identifying those Creating
correcting mistakes or distribution of variables from input variables that compact
or errors in the variables. available data. are most relevant projections of
data. to the task. the data.
Reduce overfitting
performing features.
Feature Principal
Data Cleaning Data Transforms Feature Selection component
Engineering
Analysis
Identifying and Changing the scale Deriving new Identifying those Creating
correcting mistakes or distribution of variables from input variables that compact
or errors in the variables. available data. are most relevant projections of
data. to the task. the data.
• A lower number of dimensions in data means less training time and less computational resources
and increases the overall performance of machine learning algorithms
• Like this, identify the top ‘k’ eigenvectors having top ‘k’ eigenvalues to get the ‘k’ best principal components.
So we found the eigenvectors for the eigenvector 𝜆2 , they are 0.735176 and
0.677873
• So in this case:
o Eigenvector of 𝜆1 is the first principal component.
X V Z
• Method 4: Follow the Kaiser’s rule. According to the Kaiser’s rule, it is recommended to keep all
the components with eigenvalues greater than 1.
• Method 5: Use a performance evaluation metric such as RMSE (for Regression) or Accuracy Score
(for Classification).
Understanding of variables
PCA results Interpretation
● The closer the variables are to the edge of the circle, the better the
variables are represented by the factorial plane, ie the variable is well
correlated with the two factors constituting this plane.
● A point is said to be well represented on an axis or a factorial plane if
it is close to its projection on the axis or the plane. If it is far away, it is
said to be misrepresented.
● → angle formed between the point and its projection on the axis (the
Correlation circle closer it is to 90 degrees, the less the point is well represented)
64
Machine Learning Basics ESPRIT 2024/2025
Interpretation: Individual factor map