0% found this document useful (0 votes)

10 views8 pages

UNIT - 2 ML

The document outlines the comprehensive process of data preparation for machine learning, including steps such as data collection, exploratory data analysis, data cleaning, feature engineering, and data transformation. It emphasizes the importance of understanding business objectives, handling missing values, encoding categorical variables, and creating a data preprocessing pipeline. Additionally, it highlights the iterative nature of data preparation and the necessity of documentation for reproducibility and transparency.

Uploaded by

anugowda2724

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views8 pages

UNIT - 2 ML

Uploaded by

anugowda2724

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Page 1 of 8

UNIT-2: Data Preparation

Working with Real Data:

Working with real data in data preparation for machine learning involves
several steps to ensure the data is properly formatted, cleaned, and
preprocessed for use in training a machine learning model. Here's a general
guide to the data preparation process:

1. Data Collection: Obtain the dataset from reliable sources. This could
be from databases, APIs, CSV files, Excel spreadsheets, or any other
structured data format.
2. Exploratory Data Analysis (EDA): Perform EDA to understand the
structure, distribution, and characteristics of the data. This involves:
 Checking for missing values.
 Summarizing statistics (mean, median, min, max, etc.).
 Visualizations (histograms, box plots, scatter plots, etc.) to
understand relationships and distributions.
 Identify outliers and anomalies.
3. Data Cleaning:
 Handle missing values: Impute missing values (using mean,
median, mode, or more sophisticated methods), or remove
rows/columns with missing data depending on the amount of
missingness and the nature of the problem.
 Deal with outliers: Decide whether to remove outliers or
transform them to mitigate their impact on the model.
 Address inconsistencies and errors in data: This might involve
correcting typos, standardizing formats, or resolving
inconsistencies in categorical variables.
4. Feature Engineering:
 Create new features: Combine existing features or derive new
ones that might be more informative for the model.
 Encode categorical variables: Convert categorical variables into
numerical representations using techniques like one-hot
encoding, label encoding, or embeddings.
 Feature scaling: Scale numerical features to a similar range (e.g.,
using min-max scaling or standardization) to prevent features
with large values from dominating the model.
5. Data Transformation:
 Normalize the data: Scale the features to have a mean of 0 and a
standard deviation of 1 to improve convergence during training.
 Dimensionality reduction: If dealing with high-dimensional data,
use techniques like Principal Component Analysis (PCA) or
feature selection to reduce the number of features while
preserving most of the variance.
Page 2 of 8

6. Data Splitting:
 Split the data into training, validation, and test sets to assess
model performance and prevent overfitting.
7. Data Preprocessing Pipeline:
 Create a preprocessing pipeline that encapsulates all the data
preparation steps. This ensures consistency and allows easy
application to new data.
8. Iterative Process: Data preparation is often an iterative process. You
may need to revisit previous steps based on insights gained during
model training and evaluation.
9. Documentation: Document all the steps taken during data
preparation, including any assumptions made or decisions taken. This
documentation is crucial for reproducibility and collaboration.

Look at the Big Picture:

Looking at the big picture in data preparation for machine learning involves
understanding the overarching goals, challenges, and best practices that
guide the entire process. Here's an overview:

1. Understanding Business Objectives: Data preparation starts with a

clear understanding of the business problem or objectives that the
machine learning model aims to address. This understanding helps in
defining the scope of data collection, the choice of features, and the
evaluation metrics for the model.
2. Data Collection: Acquiring relevant and high-quality data is crucial for
the success of any machine learning project. This involves identifying
data sources, collecting data, and ensuring its integrity, accuracy, and
completeness. Data may come from various internal or external
sources, such as databases, APIs, sensor data, or web scraping.
3. Data Cleaning and Preprocessing: Raw data often contains errors,
missing values, outliers, and inconsistencies that need to be addressed
before feeding it into a machine learning model. Data cleaning involves
tasks like imputation of missing values, handling outliers, removing
duplicates, and standardizing formats. Preprocessing includes feature
scaling, encoding categorical variables, and handling skewness in
distributions.
4. Feature Engineering: Feature engineering is the process of creating
new features or transforming existing ones to enhance the
performance of machine learning models. This step requires domain
knowledge and creativity to extract meaningful information from the
data. Feature engineering aims to capture relevant patterns, reduce
dimensionality, and improve the model's ability to generalize.
5. Exploratory Data Analysis (EDA): EDA is an essential step in
understanding the underlying structure and relationships within the
Page 3 of 8

data. It involves visualizations, statistical summaries, and hypothesis

testing to gain insights into the data distribution, correlations between
variables, and potential patterns or trends.
6. Data Transformation and Scaling: Data transformation techniques
like normalization or standardization are applied to ensure that all
features have a similar scale. This prevents features with larger
magnitudes from dominating the model and helps in achieving faster
convergence during training.
7. Data Splitting: Before training a machine learning model, the dataset
is split into training, validation, and test sets. This ensures that the
model is trained on one set, validated on another set for
hyperparameter tuning, and tested on a separate set to evaluate its
generalization performance.
8. Documentation and Reproducibility: Documenting the entire data
preparation process is crucial for reproducibility and transparency. This
includes recording data sources, preprocessing steps, feature
engineering techniques, and any assumptions or decisions made
during the process.
9. Iterative Process: Data preparation is often an iterative process that
involves refining data cleaning procedures, experimenting with
different feature engineering techniques, and optimizing preprocessing
steps based on model performance and feedback.
10. Continuous Monitoring and Maintenance: Once a machine
learning model is deployed, it's essential to monitor its performance
over time and update the data preparation pipeline accordingly. This
ensures that the model remains effective in real-world scenarios and
adapts to changing data patterns or business requirements.

Get the Data:

Discover and Visualize the Data to Gain Insights:

Discovering and visualizing the data is a crucial step in data preparation for
machine learning. Here's a guide on how to perform exploratory data
analysis (EDA) to gain insights:

1. Load the Data: Start by loading your dataset into your preferred data
analysis environment such as Python with libraries like Pandas, NumPy,
and Matplotlib/Seaborn for visualization.
2. Basic Data Exploration:
 Check the first few rows of the dataset using the .head() function
to understand its structure.
 Check the dimensions of the dataset (number of rows and
columns) using the .shape attribute.
Page 4 of 8

 Use the .info() function to get a concise summary of the dataset,

including data types and missing values.
3. Summary Statistics:
 Compute summary statistics such as mean, median, standard
deviation, minimum, and maximum values for numerical features
using the .describe() function.
 For categorical features, you can use the .value_counts() function
to get the frequency distribution of unique values.
4. Data Visualization:
 Histograms: Plot histograms to visualize the distribution of
numerical features. This helps in understanding the range and
spread of values and identifying potential outliers.
 Box plots: Use box plots to visualize the distribution of numerical
features, identify outliers, and compare distributions across
different categories.
 Scatter plots: Plot scatter plots to visualize the relationship
between pairs of numerical features. This helps in identifying
patterns, correlations, and potential trends in the data.
 Bar plots: Use bar plots to visualize the frequency distribution of
categorical features. This helps in understanding the distribution
of categories and identifying dominant categories.
 Heatmaps: Plot heatmaps to visualize the correlation matrix
between numerical features. This helps in identifying
multicollinearity and understanding the strength and direction of
correlations.
5. Feature Relationships:
 Explore relationships between features using scatter plots, pair
plots (for multiple numerical features), and categorical plots (for
categorical features).
 Look for patterns, trends, and correlations between features,
which can provide valuable insights for feature selection and
engineering.
6. Missing Values and Outliers:
 Visualize missing values using heatmaps or bar plots to identify
patterns of missingness across features.
 Plot box plots or scatter plots to identify outliers in numerical
features. Decide whether to remove or impute outliers based on
domain knowledge and the impact on the model.
7. Interactive Visualizations:
 Consider using interactive visualization libraries like Plotly or
Bokeh for more interactive and dynamic exploration of the data.
8. Iterative Exploration:
 Perform iterative data exploration and visualization based on
initial insights and hypotheses generated. This may involve
Page 5 of 8

drilling down into specific subsets of the data or focusing on

particular features of interest.

Prepare the Data for machine Learning Algorithms:

Preparing data for machine learning algorithms involves several steps to

ensure that the dataset is formatted correctly, features are appropriately
scaled, and the data is ready to be used for training a model. Here's a
comprehensive guide to preparing data for machine learning:

1. Handling Missing Values:

 Identify and handle missing values in the dataset. Options
include:
 Removing rows or columns with missing values if they are
insignificant.
 Imputing missing values using methods like mean, median,
mode, or more sophisticated techniques such as K-nearest
neighbors (KNN) imputation or predictive modeling.
 For categorical variables, consider adding a new category
to represent missing values if they carry meaningful
information.
2. Encoding Categorical Variables:
 Convert categorical variables into a numerical format suitable for
machine learning algorithms. Common techniques include:
 One-hot encoding: Create binary columns for each
category, where 1 indicates the presence of the category
and 0 indicates absence.
 Label encoding: Map each category to a unique integer.
This is suitable for ordinal categorical variables with a
natural order.
 Target encoding: Encode categorical variables based on
the target variable's mean or other aggregated metrics.
This can be useful for high-cardinality categorical variables.
3. Feature Scaling:
 Scale numerical features to a similar range to prevent features
with larger magnitudes from dominating the model. Common
scaling techniques include:
 Min-max scaling (Normalization): Scale features to a range
between 0 and 1.
 Standardization: Transform features to have a mean of 0
and a standard deviation of 1.
 Robust scaling: Scale features using median and
interquartile range to handle outliers.
4. Feature Engineering:
Page 6 of 8

 Create new features or transform existing ones to capture

meaningful information and improve the model's performance.
Feature engineering techniques include:
 Polynomial features: Generate polynomial combinations of
features to capture nonlinear relationships.
 Interaction terms: Create new features by taking the
product or ratio of existing features.
 Domain-specific transformations: Apply domain knowledge
to create features relevant to the problem.
5. Dimensionality Reduction:
 Reduce the number of features to alleviate the curse of
dimensionality and improve computational efficiency. Techniques
include:
 Principal Component Analysis (PCA): Project data onto a
lower-dimensional subspace while preserving the
maximum variance.
 Feature selection: Select a subset of relevant features
based on statistical tests, feature importance scores, or
domain knowledge.
6. Data Splitting:
 Split the dataset into training, validation, and test sets to
evaluate the model's performance. Common splits include:
 Training set: Used to train the model.
 Validation set: Used to tune hyperparameters and assess
model performance during training.
 Test set: Held out for final evaluation to estimate the
model's generalization performance on unseen data.
7. Data Pipeline:
 Create a data preprocessing pipeline that encapsulates all the
data preparation steps. This ensures consistency and facilitates
reproducibility when applying the preprocessing steps to new
data.
8. Documentation and Versioning:
 Document all data preparation steps, including assumptions,
transformations, and preprocessing techniques applied. Version
control the data preprocessing pipeline to track changes and
ensure reproducibility.
Select and Train a Model:

Selecting and training a model in the data preparation phase of machine

learning involves choosing an appropriate algorithm, training it on the
prepared dataset, and evaluating its performance. Here's a step-by-step
guide:

1. Choose a Model:
Page 7 of 8

 Select a machine learning algorithm suitable for your problem

based on factors such as the nature of the data (e.g.,
classification, regression), size of the dataset, interpretability
requirements, and computational resources.
 Common algorithms include linear regression, logistic regression,
decision trees, random forests, support vector machines (SVM),
k-nearest neighbors (KNN), and neural networks.
2. Prepare the Data:
 Ensure that the dataset is properly cleaned, preprocessed, and
split into training and testing sets as described earlier in the data
preparation process.
3. Train the Model:
 Fit the selected model to the training data using the fit() method
or equivalent in your chosen machine learning library (e.g.,
scikit-learn in Python).
 Provide the training features ( X_train) and the corresponding
target labels ( y_train) as input to the fit() method.
4. Model Evaluation:
 Evaluate the trained model's performance using appropriate
evaluation metrics based on the problem type (e.g., accuracy,
precision, recall, F1-score for classification; mean squared error,
R-squared for regression).
 Calculate the performance metrics on the test set using the
predict() method to generate predictions and compare them with
the actual target labels ( y_test).
 Visualize the model's performance using relevant plots such as
confusion matrices, ROC curves (for binary classification), or
calibration plots.
5. Hyperparameter Tuning:
 Fine-tune the model's hyperparameters to improve its
performance. This involves searching over a predefined
hyperparameter space using techniques like grid search or
random search.
 Use cross-validation to estimate the model's performance on
different subsets of the training data and avoid overfitting.
6. Model Selection:
 Compare the performance of different models using cross-
validation or a separate validation set.
 Select the model with the best performance based on the
evaluation metrics and your specific requirements (e.g.,
accuracy, interpretability, computational efficiency).
7. Training Pipeline:
 Create a training pipeline that encapsulates the data
preparation, model training, and evaluation steps. This ensures
Page 8 of 8

reproducibility and facilitates experimentation with different

algorithms and hyperparameters.
8. Documentation and Reporting:
 Document the model selection process, including the chosen
algorithm, hyperparameters, and evaluation results.
 Provide insights into the model's strengths, weaknesses, and
potential areas for improvement.

That Time I Got Reincarnated As A Slime, Vol. 15
50% (2)
That Time I Got Reincarnated As A Slime, Vol. 15
468 pages
Machine Learning Project Checklist
100% (1)
Machine Learning Project Checklist
10 pages
Chapter 4
0% (2)
Chapter 4
40 pages
UNIT - 2 ML
No ratings yet
UNIT - 2 ML
8 pages
UNIT 2 ML
No ratings yet
UNIT 2 ML
14 pages
Machine learning Life cycle
No ratings yet
Machine learning Life cycle
11 pages
DS Model Steps
No ratings yet
DS Model Steps
8 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
Session 4 Machine Learning Process (1)
No ratings yet
Session 4 Machine Learning Process (1)
28 pages
Unit 1
No ratings yet
Unit 1
41 pages
Lecture 1
No ratings yet
Lecture 1
21 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
ML Question Answer
No ratings yet
ML Question Answer
4 pages
1725892639Module 3 the Machine Learning Process
No ratings yet
1725892639Module 3 the Machine Learning Process
17 pages
Experiment 01: AIM: To Perform Data Preparation Using Numpy and Panda. Theory
No ratings yet
Experiment 01: AIM: To Perform Data Preparation Using Numpy and Panda. Theory
5 pages
Part 2 Introduction To ML
No ratings yet
Part 2 Introduction To ML
13 pages
ML_1
No ratings yet
ML_1
13 pages
HCA2 (1)
No ratings yet
HCA2 (1)
63 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
26 pages
AML MIDSEM
No ratings yet
AML MIDSEM
59 pages
DSF - UNIT III Notes
No ratings yet
DSF - UNIT III Notes
17 pages
Machine Learning
No ratings yet
Machine Learning
116 pages
Hands On Machine Learning With Scikit Learn and TensorFlow-427-432
No ratings yet
Hands On Machine Learning With Scikit Learn and TensorFlow-427-432
6 pages
Each Stage of A Data Mining Project
No ratings yet
Each Stage of A Data Mining Project
5 pages
Lecture No 2 Data Preparation
No ratings yet
Lecture No 2 Data Preparation
23 pages
Statistics For Data Science - 1
100% (2)
Statistics For Data Science - 1
38 pages
S-9
No ratings yet
S-9
18 pages
How To Prepare Data For Machine Learning
No ratings yet
How To Prepare Data For Machine Learning
34 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
DPT Week 1
No ratings yet
DPT Week 1
3 pages
ML Workflow Steps: Step 2: Building Dataset
No ratings yet
ML Workflow Steps: Step 2: Building Dataset
5 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Data Preparation For Automated Machine Learning: White Paper
No ratings yet
Data Preparation For Automated Machine Learning: White Paper
21 pages
Unit2_2) How python is deployed and Data Science Process.pptx
No ratings yet
Unit2_2) How python is deployed and Data Science Process.pptx
7 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
ADS-IMP-QNA-2025-15-04-06-06-35_copy
No ratings yet
ADS-IMP-QNA-2025-15-04-06-06-35_copy
33 pages
Unit 4_Question Bank and answers
No ratings yet
Unit 4_Question Bank and answers
23 pages
ML_notion_1
No ratings yet
ML_notion_1
18 pages
Step-by-Step Machine Learning
No ratings yet
Step-by-Step Machine Learning
3 pages
Six Steps To Master Machine Learning With Data Preparation
No ratings yet
Six Steps To Master Machine Learning With Data Preparation
44 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Pa 2
No ratings yet
Pa 2
13 pages
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004_compressed (1)
No ratings yet
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004_compressed (1)
6 pages
MACHINE LEARNING 1-5 (Ai &DS)
100% (1)
MACHINE LEARNING 1-5 (Ai &DS)
60 pages
ML
No ratings yet
ML
11 pages
Air quality prediction using machine learning
No ratings yet
Air quality prediction using machine learning
29 pages
Roadmap
No ratings yet
Roadmap
6 pages
Machine Learning Model Workflow
No ratings yet
Machine Learning Model Workflow
3 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
Steps to create data sets and developing a machine learning model
No ratings yet
Steps to create data sets and developing a machine learning model
3 pages
Flow Diagram of Machine Learning or Life Cycle of Machine Learning
No ratings yet
Flow Diagram of Machine Learning or Life Cycle of Machine Learning
91 pages
Manual Data
No ratings yet
Manual Data
13 pages
How To Apply ML
No ratings yet
How To Apply ML
4 pages
Module_-1
No ratings yet
Module_-1
9 pages
ML Module 1
No ratings yet
ML Module 1
12 pages
ML notes
No ratings yet
ML notes
16 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
What is Data Mining_ Key Techniques & Examples
No ratings yet
What is Data Mining_ Key Techniques & Examples
21 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
22 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
A To Z of DTP
No ratings yet
A To Z of DTP
19 pages
Project Proposal RRC
No ratings yet
Project Proposal RRC
2 pages
International Summits, Conferences
No ratings yet
International Summits, Conferences
5 pages
PERKÿNS SABRE 6TG2AM AYRDIMCI MOTOR
100% (1)
PERKÿNS SABRE 6TG2AM AYRDIMCI MOTOR
2 pages
Ideas y Proyectos QRP PDF
No ratings yet
Ideas y Proyectos QRP PDF
211 pages
Activity Calendar Session 2019 - 20 (PG - Class 2)
No ratings yet
Activity Calendar Session 2019 - 20 (PG - Class 2)
2 pages
Standard Specification For Carbon and Low-Alloy Steel Forgings, Requiring Notch Toughness Testing For Piping Components
No ratings yet
Standard Specification For Carbon and Low-Alloy Steel Forgings, Requiring Notch Toughness Testing For Piping Components
8 pages
Most Essential Learning Competencies Matrix Latest Rmay 21 PDF
100% (1)
Most Essential Learning Competencies Matrix Latest Rmay 21 PDF
735 pages
Specialised Plant Cells
No ratings yet
Specialised Plant Cells
15 pages
16 FL - vinegarFermentationUnits
No ratings yet
16 FL - vinegarFermentationUnits
2 pages
BH1 Log Sheet
No ratings yet
BH1 Log Sheet
1 page
Amino Acid
No ratings yet
Amino Acid
53 pages
Detector Conventional Optic de Fum Oc05 sd3
No ratings yet
Detector Conventional Optic de Fum Oc05 sd3
2 pages
Galvashield DAS
No ratings yet
Galvashield DAS
3 pages
Job Specification For HP Hydrogen Compressor - Rotating Equipment
100% (1)
Job Specification For HP Hydrogen Compressor - Rotating Equipment
11 pages
Relationship Between Availability of Resources and Academic
No ratings yet
Relationship Between Availability of Resources and Academic
11 pages
Concrete Structure Damages, Defects and Its Prevention
No ratings yet
Concrete Structure Damages, Defects and Its Prevention
3 pages
Cermolacce, M., Naudin, J., & Parnas, J. (2007). The "minimal self" in psychopathology- re-examining the self-disorders in the schizophrenia spectrum
No ratings yet
Cermolacce, M., Naudin, J., & Parnas, J. (2007). The "minimal self" in psychopathology- re-examining the self-disorders in the schizophrenia spectrum
12 pages
CALFEM Mesh Module Manual
100% (1)
CALFEM Mesh Module Manual
28 pages
NTS Mcqs With Ans
No ratings yet
NTS Mcqs With Ans
4 pages
Alchemy (Updated)
No ratings yet
Alchemy (Updated)
22 pages
CEP233 - M01 - Definition Classification and Types of Surveys
No ratings yet
CEP233 - M01 - Definition Classification and Types of Surveys
15 pages
The California Bearing Ratio and Pore Structure Characteristics of Weakly Expansive Soil in Frozen Areas
No ratings yet
The California Bearing Ratio and Pore Structure Characteristics of Weakly Expansive Soil in Frozen Areas
22 pages
Arihant Kinematics Theory
No ratings yet
Arihant Kinematics Theory
8 pages
ACM Modeling (Air Cycle Machine) Aeronautics
No ratings yet
ACM Modeling (Air Cycle Machine) Aeronautics
10 pages
Resin Impregnation Program
No ratings yet
Resin Impregnation Program
2 pages
Perlove Company Profile
No ratings yet
Perlove Company Profile
10 pages
Intuitive Physics: The Straight-Down Belief and Its Origin
No ratings yet
Intuitive Physics: The Straight-Down Belief and Its Origin
15 pages

UNIT - 2 ML

Uploaded by

UNIT - 2 ML

Uploaded by

Page 1 of 8

UNIT-2: Data Preparation

Look at the Big Picture:

1. Understanding Business Objectives: Data preparation starts with a

data. It involves visualizations, statistical summaries, and hypothesis

Get the Data:

Discover and Visualize the Data to Gain Insights:

 Use the .info() function to get a concise summary of the dataset,

drilling down into specific subsets of the data or focusing on

Prepare the Data for machine Learning Algorithms:

Preparing data for machine learning algorithms involves several steps to

1. Handling Missing Values:

 Create new features or transform existing ones to capture

Selecting and training a model in the data preparation phase of machine

 Select a machine learning algorithm suitable for your problem

reproducibility and facilitates experimentation with different

You might also like