UNIT - 2 ML
UNIT - 2 ML
Working with real data in data preparation for machine learning involves
several steps to ensure the data is properly formatted, cleaned, and
preprocessed for use in training a machine learning model. Here's a general
guide to the data preparation process:
1. Data Collection: Obtain the dataset from reliable sources. This could
be from databases, APIs, CSV files, Excel spreadsheets, or any other
structured data format.
2. Exploratory Data Analysis (EDA): Perform EDA to understand the
structure, distribution, and characteristics of the data. This involves:
Checking for missing values.
Summarizing statistics (mean, median, min, max, etc.).
Visualizations (histograms, box plots, scatter plots, etc.) to
understand relationships and distributions.
Identify outliers and anomalies.
3. Data Cleaning:
Handle missing values: Impute missing values (using mean,
median, mode, or more sophisticated methods), or remove
rows/columns with missing data depending on the amount of
missingness and the nature of the problem.
Deal with outliers: Decide whether to remove outliers or
transform them to mitigate their impact on the model.
Address inconsistencies and errors in data: This might involve
correcting typos, standardizing formats, or resolving
inconsistencies in categorical variables.
4. Feature Engineering:
Create new features: Combine existing features or derive new
ones that might be more informative for the model.
Encode categorical variables: Convert categorical variables into
numerical representations using techniques like one-hot
encoding, label encoding, or embeddings.
Feature scaling: Scale numerical features to a similar range (e.g.,
using min-max scaling or standardization) to prevent features
with large values from dominating the model.
5. Data Transformation:
Normalize the data: Scale the features to have a mean of 0 and a
standard deviation of 1 to improve convergence during training.
Dimensionality reduction: If dealing with high-dimensional data,
use techniques like Principal Component Analysis (PCA) or
feature selection to reduce the number of features while
preserving most of the variance.
Page 2 of 8
6. Data Splitting:
Split the data into training, validation, and test sets to assess
model performance and prevent overfitting.
7. Data Preprocessing Pipeline:
Create a preprocessing pipeline that encapsulates all the data
preparation steps. This ensures consistency and allows easy
application to new data.
8. Iterative Process: Data preparation is often an iterative process. You
may need to revisit previous steps based on insights gained during
model training and evaluation.
9. Documentation: Document all the steps taken during data
preparation, including any assumptions made or decisions taken. This
documentation is crucial for reproducibility and collaboration.
Looking at the big picture in data preparation for machine learning involves
understanding the overarching goals, challenges, and best practices that
guide the entire process. Here's an overview:
Discovering and visualizing the data is a crucial step in data preparation for
machine learning. Here's a guide on how to perform exploratory data
analysis (EDA) to gain insights:
1. Load the Data: Start by loading your dataset into your preferred data
analysis environment such as Python with libraries like Pandas, NumPy,
and Matplotlib/Seaborn for visualization.
2. Basic Data Exploration:
Check the first few rows of the dataset using the .head() function
to understand its structure.
Check the dimensions of the dataset (number of rows and
columns) using the .shape attribute.
Page 4 of 8
1. Choose a Model:
Page 7 of 8