0% found this document useful (0 votes)

27 views

Unit 1

Uploaded by

binokad912

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Unit 1

Uploaded by

binokad912

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Unit I Feature Engineering

What is a Feature?

In the context of machine learning, a feature is an individual measurable property or

characteristic of a data point that is used as input for a machine learning algorithm.

Feature Engineering is the process of creating new features or transforming existing features
to improve the performance of a machine-learning model.

Data pre-processing

Data preprocessing is a step in the data mining and data analysis process that takes raw data
and transforms it into a format that can be understood and analyzed by computers and
machine learning.

Raw, real-world data in the form of text, images, video, etc., is messy. Not only may it
contain errors and inconsistencies, but it is often incomplete, and doesn’t have a regular,
uniform design.

Machines like to process nice and tidy information – they read data as 1s and 0s. So
calculating structured data, like whole numbers and percentages is easy.
However, unstructured data, in the form of text and images must first be cleaned and
formatted before analysis.

Data Preprocessing Importance

Data pre-processing is an important step in the data mining process that involves cleaning
and transforming raw data to make it suitable for analysis.

When using data sets to train machine learning models, you’ll often hear the phrase “garbage
in, garbage out” This means that if you use bad or “dirty” data to train your model, you’ll
end up with a bad, improperly trained model that won’t actually be relevant to your analysis.

Good, preprocessed data is even more important than the most powerful algorithms, to the
point that machine learning models trained with bad data could actually be harmful to the
analysis you’re trying to do – giving you “garbage” results.

It is an important step in the data mining process. It refers to the cleaning, transforming, and
integrating of data in order to make it ready for analysis. The goal of data pre-processing is
to improve the quality of the data and to make it more suitable for the specific data mining
task.
Some common steps in data pre-processing include:

Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used for data
cleaning, such as imputation(bad quality), removal, and train Missing data

There are a number of ways to correct for missing data, but the two most common are:

Ignore the tuples: A tuple is an ordered list or sequence of numbers. If multiple values are
missing within tuples, you may simply discard the tuples with that missing information. This
is only recommended for large data sets, when a few ignored tuples won’t harm further
analysis.

Manually fill in missing data: This can be tedious, but is definitely necessary when working
with smaller data sets.

Noisy data: Data cleaning also includes fixing “noisy” data. This is data that includes
unnecessary data points, irrelevant data, and data that’s more difficult to group together.

Binning: Binning sorts data of a wide data set into smaller groups of more similar data.
It’s often used when analyzing demographics. Income, for example, could be grouped:
$35,000-$50,000, $50,000-$75,000, etc.

Regression: Regression is used to decide which variables will actually apply to your analysis.
Regression analysis is used to smooth large amounts of data. This will help you get a
handle on your data, so you’re not overburdened with unnecessary data.

Clustering: Clustering algorithms are used to properly group data, so that it can be
analyzed with like data. They’re generally used in unsupervised learning, when not a lot is
known about the relationships within your data.

If you’re working with text data, for example, some things you should consider when
cleaning your data are:

 Remove URLs, symbols, emojis, etc., that aren’t relevant to your analysis
 Translate all text into the language you’ll be working in
 Remove HTML tags
 Remove boilerplate email text
 Remove unnecessary blank text between words
 Remove duplicate data

After data cleaning, you may realize you have insufficient data for the task at hand. At this
point you can also perform data wrangling or data enrichment to add new data sets and
run them through quality assessment and cleaning again before adding them to your original
data.
Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different
formats, structures, and semantics. Techniques such as record linkage and data fusion can be
used for data integration.

Normalization: The main purpose of data normalization is to minimize or even exclude

duplicated data.

It is the process of developing clean data. This includes eliminating redundant and
unstructured data and making the data appear similar across all records and fields.

Normalization changes the observations so that they can be described as normal distribution

Scaling: In scaling, you're changing the range of the data

By scaling the variables, one can help compare different variables on an equal footing. The
subplots below can give us a better insight.

Normal distribution:Also known as the “bell curve”, this is a specific statistical distribution
where roughly equal observations fall above and below the mean, the mean and the
median are the same, and there are more observations closer to the mean. The normal
distribution is also known as the Gaussian distribution.

The shape of our data has changed. Before normalizing it was almost L-shaped. But after
normalizing it looks more like a “bell curve”.
The shape of the data doesn’t change, but the scale on the X-axis changes (0 to 1) instead
of ranging from 0 to 8.

Standardization:

Standardization is a technique that transforms data to have a mean of 0 and a standard

deviation of 1, making it easier to compare and combine features. Standardization is useful

when:

1. Features have different distributions: Standardization helps to transform features

with different distributions into a common distribution, making it easier to compare

and combine them.

2. Model assumes normality: Some models, like linear regression, assume normality of the

features. Standardization helps to transform features into a normal distribution,

making it easier to meet this assumption.

3. Feature importance needs to be evaluated: Standardization helps to evaluate the

importance of each feature by scaling them to a common range, making it easier to

compare their coefficients.

The main difference between normalization and standardization is that normalization scales
the data to a common range, while standardization transforms the data to have a mean of
0 and a standard deviation of 1.
When to choose normalization:

 When features have different units or scales.

 When the model is sensitive to feature scales.
 When the goal is to prevent features with large ranges from dominating the model.

When to choose standardization:

 When features have different distributions.

 When the model assumes normality of the features.
 When the goal is to evaluate the importance of each feature.

PRINCIPAL COMPONENT ANALYSIS(PCA)

It is a tool which is used to reduce the dimension of the data. It allows us to reduce the
dimension of the data without much loss of information. PCA reduces the dimension by
finding a few orthogonal linear combinations (principal components) of the original variables
with the largest variance. The first principal component captures most of the variance in the
data. The second principal component is orthogonal to the first principal component and
captures the remaining variance, which is left of first principal component and so on.

KERNEL PCA: PCA is a linear method. That is it can only be applied to datasets which are
linearly separable. It does an excellent job for datasets, which are linearly separable. But, if
we use it to non-linear datasets, we might get a result which may not be the optimal
dimensionality reduction. Kernel PCA uses a kernel function to project dataset into a
higher dimensional feature space, where it is linearly separable. It is similar to the idea of
Support Vector Machines. There are various kernel methods like linear, polynomial, and
gaussian.

Kernel Principal Component Analysis (KPCA) is a technique used in machine learning for
nonlinear dimensionality reduction. It is an extension of the classical Principal Component
Analysis (PCA) algorithm, which is a linear method that identifies the most significant
features or components of a dataset. KPCA applies a nonlinear mapping function to the data
before applying PCA, allowing it to capture more complex and nonlinear relationships
between the data points.

Objectives of PCA:

It is basically a non-dependent procedure in which it reduces attribute space from a large

number of variables to a smaller number of factors.

PCA is basically a dimension reduction process but there is no guarantee that the dimension
is interpretable.

The main task in this PCA is to select a subset of variables from a larger set, based on which
original variables have the highest correlation with the principal amount.
Identifying patterns: PCA can help identify patterns or relationships between variables that
may not be apparent in the original data. By reducing the dimensionality of the data, PCA can
reveal underlying structures that can be useful in understanding and interpreting the
data.

Feature extraction: PCA can be used to extract features from a set of variables that are more
informative or relevant than the original variables. These features can then be used in
modeling or other analysis tasks.

Data compression: PCA can be used to compress large datasets by reducing the number of
variables needed to represent the data, while retaining as much information as possible.

Noise reduction: PCA can be used to reduce the noise in a dataset by identifying and
removing the principal components that correspond to the noisy parts of the data.

Visualization: PCA can be used to visualize high-dimensional data in a lower-dimensional

space, making it easier to interpret and understand. By projecting the data onto the principal
components, patterns and relationships between variables can be more easily visualized.

Uses of PCA:

 It is used to find interrelations between variables in the data.

 It is used to interpret and visualize data.
 The number of variables is decreasing which makes further analysis simpler.
 It’s often used to visualize genetic distance and relatedness between populations.

Local Binary Patterns (LBP): is a type of visual descriptor used for classification in CV.

LBP summarizes texture information by analyzing pixel patterns.

It’s invariant to illumination, translation, and scaling.

Originally proposed in 1994, LBP is a particular case of the Texture Spectrum model
introduced in 1990.

How It Works:

LBP describes the neighbourhood of image elements using binary codes.

It’s commonly used for texture classification, face recognition, and gender recognition.

By studying local properties, LBP identifies characteristics of individual parts of an image2.

Image Representation:

Images are represented as matrices of pixels.

Each pixel has three color channels (red, green, blue), and their values range from 0 to 255.

LBP operates on local neighborhoods within the image, comparing pixel intensities to create
binary patterns

Local Binary Pattern (LBP) has several advantages when applied in DS & CV

1. Texture Analysis:

LBP is excellent for texture analysis. It captures local patterns in an image, making it useful
for tasks like texture classification and segmentation.

2.Robustness to Illumination Changes:

LBP is invariant to monotonic illumination changes. It focuses on relative pixel differences,

which makes it robust in varying lighting conditions.

3.Simple and Efficient:

LBP is computationally efficient and straightforward to implement.

It doesn’t require complex mathematical operations, making it suitable for real-time

applications.

4.Face Recognition:

LBP is commonly used in face recognition systems.

It extracts discriminative features from facial images, allowing accurate recognition even
with variations in pose and lighting.

5.Histogram Representation:

LBP generates histograms of local patterns, which can be used as features for machine
learning models.

These histograms capture texture information effectively.

6.Spatial Information:

LBP considers the spatial arrangement of pixels, which helps in capturing local context.

It’s useful for tasks like object detection and tracking.

Remember that LBP is just one of many techniques in data science, but its simplicity and
effectiveness make it a valuable tool in various applications

Sequential Forward Selection (SFS) is a feature selection technique used in data science
and big data analysis. It is a type of greedy algorithm that iteratively adds features to a dataset
to improve the performance of a predictive model. Here’s how it works:
1. Initialization: The selector is initialized with a predictive model, the number of
features to select, the scoring metric, and the tolerance for improvement.
2. Model Fitting: The selector fits the predictive model on the full set of features.
3. Model Evaluation: The model is evaluated on the training set using the scoring
metric.
4. Feature Addition: The feature that most improves the model’s cross-validation score
is added to the selected features set.
5. Iteration: The selector repeats steps 2-4 until the desired number of features has been
selected.

This method is particularly useful when dealing with a large number of features, as it
incrementally builds the model based on the most informative features. It involves assessing
new features, evaluating combinations of features, and selecting the optimal subset of
features that best contribute to model accuracy.

Sequential Backward Selection (SBS) is a feature selection technique used in machine

learning and data science. It is a type of greedy algorithm that iteratively removes features
from a dataset to improve the performance of a predictive model. Here’s how it works:

1. Initialization: The selector is initialized with a predictive model, the number of

features to select, the scoring metric, and the tolerance for improvement.
2. Model Fitting: The selector fits the predictive model on the full set of features.
3. Model Evaluation: The model is evaluated on the training set using the scoring
metric.
4. Feature Removal: The feature that least reduces the model’s cross-validation score is
removed from the selected features set2.
5. Iteration: The selector repeats steps 2-4 until the desired number of features has been
selected.

Image Segmentation Using K Mean Algorithm
No ratings yet
Image Segmentation Using K Mean Algorithm
5 pages
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
4/5 (2)
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
253777
No ratings yet
253777
66 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Down 2
No ratings yet
Down 2
61 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Data Mining
No ratings yet
Data Mining
5 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
DAI101 4 Data Preparation (1)
No ratings yet
DAI101 4 Data Preparation (1)
45 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Unit 3
No ratings yet
Unit 3
164 pages
Week 2
No ratings yet
Week 2
96 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
DM_merged
No ratings yet
DM_merged
169 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
03Preprocessing
No ratings yet
03Preprocessing
59 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Lec7
No ratings yet
Lec7
45 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
03Preprocessing
No ratings yet
03Preprocessing
65 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Insy662 - f23 - Week 1
No ratings yet
Insy662 - f23 - Week 1
21 pages
Data Binning
No ratings yet
Data Binning
9 pages
Week2-2
No ratings yet
Week2-2
25 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
What Is Data Preprocessing
No ratings yet
What Is Data Preprocessing
4 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Complete Download Post Mining of Association Rules Techniques for Effective Knowledge Extraction 1st Edition Yanchang Zhao PDF All Chapters
100% (6)
Complete Download Post Mining of Association Rules Techniques for Effective Knowledge Extraction 1st Edition Yanchang Zhao PDF All Chapters
71 pages
Tutorial 2: Answer The Following Question. 1. Name Some Logisticregression Problems and Deeply Explain One of Them
No ratings yet
Tutorial 2: Answer The Following Question. 1. Name Some Logisticregression Problems and Deeply Explain One of Them
2 pages
CC-Test Answers
100% (1)
CC-Test Answers
3 pages
Library-Syllabus-Skill-Test (1)
No ratings yet
Library-Syllabus-Skill-Test (1)
8 pages
Assignment 2 Task Sheet
No ratings yet
Assignment 2 Task Sheet
3 pages
Fahad Assignment
No ratings yet
Fahad Assignment
4 pages
Deep Learning Assignment
No ratings yet
Deep Learning Assignment
8 pages
Resume Mysql
No ratings yet
Resume Mysql
3 pages
281510lecture - 1 Introduction To MongoDB-1718181125331
No ratings yet
281510lecture - 1 Introduction To MongoDB-1718181125331
22 pages
Sarir Gas Utilization Project J-22 OKP037: GC2 Isometric
No ratings yet
Sarir Gas Utilization Project J-22 OKP037: GC2 Isometric
6 pages
Ads&a Mid-1 QB
No ratings yet
Ads&a Mid-1 QB
3 pages
Library Managment System
No ratings yet
Library Managment System
186 pages
Chapter A1
No ratings yet
Chapter A1
32 pages
Resume 2023
No ratings yet
Resume 2023
1 page
Where can buy Security and Privacy in Communication Networks 16th EAI International Conference SecureComm 2020 Washington DC USA October 21 23 2020 Proceedings Part II Noseong Park ebook with cheap price
100% (2)
Where can buy Security and Privacy in Communication Networks 16th EAI International Conference SecureComm 2020 Washington DC USA October 21 23 2020 Proceedings Part II Noseong Park ebook with cheap price
54 pages
datamining1
No ratings yet
datamining1
7 pages
Module 3
No ratings yet
Module 3
3 pages
NLP QB
No ratings yet
NLP QB
5 pages
Oracle Cloud Data Management 2024 Foundations Associate-DUMP
No ratings yet
Oracle Cloud Data Management 2024 Foundations Associate-DUMP
10 pages
Install Software Application _Final Exam_Answer
No ratings yet
Install Software Application _Final Exam_Answer
3 pages
AI-Artificial-Intelligence-for-Conflict-Resolution-and-Negotiation_-Enhancing-Mediation-and-Collaboration-Through-Intelligent-Technology
No ratings yet
AI-Artificial-Intelligence-for-Conflict-Resolution-and-Negotiation_-Enhancing-Mediation-and-Collaboration-Through-Intelligent-Technology
4 pages
Image To Text Encryption and Decryption Using Modified RSA Algorithm
No ratings yet
Image To Text Encryption and Decryption Using Modified RSA Algorithm
5 pages
ETL Total
No ratings yet
ETL Total
2 pages
Unit-Iv Schema Refinement and Normalisation: Unit 4 Contents at A Glance
No ratings yet
Unit-Iv Schema Refinement and Normalisation: Unit 4 Contents at A Glance
26 pages
Natural Language Processing(Peiii)
No ratings yet
Natural Language Processing(Peiii)
2 pages
Hoang CV
No ratings yet
Hoang CV
2 pages
Position Paper
No ratings yet
Position Paper
3 pages
BZMC -JAIZ BANK -DATA ANALYTICS ARTIFILCIAL INTELLIGENCE AND MACHINE LEARNING IN FINANCE (1)
No ratings yet
BZMC -JAIZ BANK -DATA ANALYTICS ARTIFILCIAL INTELLIGENCE AND MACHINE LEARNING IN FINANCE (1)
79 pages
OCS353_Review Questions
No ratings yet
OCS353_Review Questions
3 pages

Unit 1

Uploaded by

Unit 1

Uploaded by

Unit I Feature Engineering

In the context of machine learning, a feature is an individual measurable property or

Data Preprocessing Importance

Normalization: The main purpose of data normalization is to minimize or even exclude

Scaling: In scaling, you're changing the range of the data

Standardization is a technique that transforms data to have a mean of 0 and a standard

deviation of 1, making it easier to compare and combine features. Standardization is useful

1. Features have different distributions: Standardization helps to transform features

with different distributions into a common distribution, making it easier to compare

and combine them.

features. Standardization helps to transform features into a normal distribution,

making it easier to meet this assumption.

3. Feature importance needs to be evaluated: Standardization helps to evaluate the

importance of each feature by scaling them to a common range, making it easier to

compare their coefficients.

 When features have different units or scales.

When to choose standardization:

 When features have different distributions.

PRINCIPAL COMPONENT ANALYSIS(PCA)

It is basically a non-dependent procedure in which it reduces attribute space from a large

Visualization: PCA can be used to visualize high-dimensional data in a lower-dimensional

 It is used to find interrelations between variables in the data.

LBP summarizes texture information by analyzing pixel patterns.

It’s invariant to illumination, translation, and scaling.

LBP describes the neighbourhood of image elements using binary codes.

By studying local properties, LBP identifies characteristics of individual parts of an image2.

Images are represented as matrices of pixels.

2.Robustness to Illumination Changes:

LBP is invariant to monotonic illumination changes. It focuses on relative pixel differences,

3.Simple and Efficient:

LBP is computationally efficient and straightforward to implement.

It doesn’t require complex mathematical operations, making it suitable for real-time

LBP is commonly used in face recognition systems.

These histograms capture texture information effectively.

It’s useful for tasks like object detection and tracking.

Sequential Backward Selection (SBS) is a feature selection technique used in machine

1. Initialization: The selector is initialized with a predictive model, the number of

You might also like