Updated Lecture 13 Zainab

Uploaded by

tvsn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Updated Lecture 13 Zainab

Uploaded by

tvsn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Feature Engineering

Outline
• Feature Engineering
• Feature Transformation
• Feature Subset Selection
• Feature Scaling
• Data to be Gaussian: an Introduction to
Power Transformations
• Principal Component Analysis
Cross-Validation Performance
Cross-validation may not perform well or may not be suitable
in certain situations:
• Small Datasets : dataset with only 50 samples,
• Imbalanced Datasets: dataset for fraud detection
• Temporal Data: predicting stock prices using historical
data
• Model Evaluation for Deployment: predicting
stock prices using historical data

• Complex Model Architectures: training a deep

learning model with millions of parameters on a large image
Feature Engineering

Feature engineering refers to the process of translating a

data set into features such that these features are able to
represent the data set more effectively and result in a
better learning performance.
Feature Engineering
• Feature engineering is an important pre-processing step for
machine learning. It has two major elements:
• Feature transformation: involves changing the representation of
the features in a dataset to make them more suitable for a
machine learning algorithm. This can include scaling,
normalization, or creating new features through mathematical
operations.

• Feature subset selection: Feature subset selection involves

selecting a subset of the original features in the dataset that are
most relevant to the problem at hand. This can help improve
model performance by reducing overfitting, speeding up training,
and improving interpretability.
Example: Feature Engineering
• Feature transformation: Consider a dataset containing information about
houses, including features like square footage, number of bedrooms, and number of
bathrooms. One common transformation is to create a new feature representing the
total area of the house by adding the square footage of all rooms together.

• Feature subset selection: Continuing with the housing dataset, suppose it also
contains features like the color of the house, the make of the appliances, and the
type of flooring. Through feature selection techniques, we might discover that these
features do not significantly contribute to predicting the house price. Thus, we
decide to exclude them from the model, focusing only on the most important
features such as square footage, number of bedrooms, and location.
Principal Component Analysis
• In the modern era of machine learning, data scientists grapple with a substantial
number of variables, particularly in fields like computer vision.
• Challenge of High Dimensionality:
• Example: In computer vision, images are represented in terms of pixels.
• 4K Image: Resolution of 3840 x 2160 pixels.
• Challenge: Processing such an image involves dealing with 24,883,200 variables
(pixels multiplied by three color channels: blue, red, and green).
• Issues with High Dimensionality:
• Computational Complexity: High dimensionality increases
computational complexity.
• Overfitting Risk: Greater risk of overfitting due to an abundance of
features.
• Solution: Dimensionality Reduction Techniques
Principal Component Analysis
• To address these challenges, it's essential to reduce dimensionality.
• Objective: Project the data into a lower-dimensional space, mitigating computational
complexity and overfitting risks.

• Principal Component Analysis (PCA): PCA is a renowned dimensionality reduction technique.

• Identify a few principal components that capture as much
information as the original set of predictors.
• Original variables transformed into uncorrelated linear combinations
called principal components (PC).
• PCs ordered so that the first PC accounts for the largest proportion of
the variation in the original features.
Principal Component Analysis
Example
Consider a dataset with information about houses, some of these features are square
footage, number of bedrooms, number of bathrooms, and price. We want to reduce the
dimensionality of this dataset using PCA while still capturing as much information as
possible.

After applying PCA to our housing dataset, we might find that the first principal
component is strongly correlated with the overall size of the house (e.g., square
footage, number of bedrooms, number of bathrooms), while the second principal
component is related to the price of the house.
Suppose that after PCA, we find that the first three principal components explain
95% of the total variance in the dataset. This means that we can represent the
original dataset using just these three principal components, reducing the
dimensionality from, say, 10 original features down to 3 principal components.
Principal Component Analysis
Benefits of PCA:
• Collinearity Prevention: In regression problems, PCA is
employed to prevent or reduce collinearity among
independent variables.
• Efficient Representation: Offers an efficient representation of
the original data with reduced dimensionality.
Principal Component Analysis
• Algorithm Workflow:
• Identify Dominant Direction: Look for the vector with the most
information, indicating the direction of maximum correlation among
features.
• Find Orthogonal Directions: Locate subsequent directions orthogonal
to the first, capturing the most information in each.
• Dimensionality Reduction: Principal axes represent the principal
components, defining the new feature space.
Principal Component Analysis (PCA) on Breast Cancer Data
We will Explore the application of PCA for dimensionality reduction and visualization on
breast cancer data.
Steps 1: Data Loading and Import necessary libraries, including scikit-learn for machine
learning operations and matplotlib for plotting. Use scikit-learn's built-in dataset,
load_breast_cancer, to obtain breast cancer data. Using make_pipeline
simplifies the creation of
Python code from sklearn.pipeline import make_pipeline pipelines, especially when
you have multiple
from sklearn.datasets import load_breast_cancer preprocessing steps and an
estimator. It reduces the
from sklearn.preprocessing import StandardScaler need for manually naming
from sklearn.decomposition import PCA each step, making your
code more concise and
import matplotlib.pyplot as plt It is used for
easier to understand.
StandardScaler:
standardizing features by
removing the mean and
scaling to unit variance,
# Load breast cancer data which is a common
df = load_breast_cancer() preprocessing step in
many machine learning
algorithms.
Principal Component Analysis (PCA) on
Breast Cancer Data
#Step 2 : Scale the Data
scaler = StandardScaler() StandardScaler is a preprocessing step that standardizes the
features by removing the mean and scaling to unit variance.
X_scaled = scaler.fit_transform(df.data) fit_transform method computes the mean and standard
# Step 3: Apply PCA deviation from the data and then scales the data accordingly.

pca = PCA(n_components=2) PCA is used for dimensionality reduction. It identifies the

X_pca = pca.fit_transform(X_scaled) principal components that capture the most variance in the
data.
n_components=2 specifies that we want to reduce the
dimensionality to 2 components.
fit_transform computes the principal components based on
This Python code snippet involves the visualization of the the scaled data and transforms the data into the new lower-
results after applying Principal Component Analysis (PCA) dimensional space.

to the breast cancer dataset. The code creates a scatter

plot of the first two principal components and displays an
image plot of the principal components.
Scatter Plot of the First Two Principal Components:
# Scatter plot of the first two principal components
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df.target)
Principal Component Analysis (PCA)
on Breast Cancer Data
# Display the principal components
components = pca.named_steps['pca'].components_
plt.imshow(components.T)
plt.yticks(range(len(df.feature_names)), df.feature_names)
plt.colorbar()
plt.yticks(range(len(df.feature_names)), df.feature_names) #sets y-axis components =
ticks and labels to feature names. pca.named_steps['pca'].components_
plt.colorbar() #adds a colorbar to the side of the image plot. extracts the principal components from
the fitted PCA model.
plt.show()
plt.imshow(components.T) creates an
image plot of the transposed principal
components matrix.
Principal Component Analysis (PCA) on Breast
Cancer Data
Contribution of Features to Principal Components:
Observation: The image plot (plt.imshow(components.T)) represents the weights or contributions of
each original feature to the first two principal components.
Interpretation: Darker regions indicate lower contributions, while lighter regions indicate higher
contributions. Each row in the image plot corresponds to a feature, and each column corresponds to a
principal component.
Scaling Impact on First Principal Component:
Observation: All features now contribute to the first principal component after scaling.
Interpretation: Scaling ensures that the magnitude of features does not dominate the first component.
Without scaling, features with larger magnitudes could have disproportionately influenced the first
component.
Correlation Among Features in the First Component:
Observation: All features in the first component have the same sign.
Interpretation: This indicates a positive correlation among all features in the first principal
component. When one feature has a high value, others are likely to have high values as well. It suggests
a shared pattern or tendency among the features.
Mixed Signs in the Second Principal Component:
Observation: The second principal component has mixed signs.
Interpretation: Unlike the first component, the second one captures variations with mixed directions
among features. Some features may have positive contributions, while others have negative
contributions, indicating a more complex relationship.
Principal Component Analysis (PCA)
on Breast Cancer Data

After performing Principal Component Analysis (PCA) and obtaining the

principal components, you can use these components for building a new model
or conducting further analyses. The principal components can serve as a
reduced set of features that capture most of the important information in the
original data.
Reference list
•Andrea Giussani, APPLIED MACHINE LEARNING WITH PYTHON, Copyright © 2019, EGEA
•.

Brynjolfsson, Erik & Brian Kahin (Eds.) (2000) - Understanding The Digital Economy - Data, Tools, and Research. Cambridge, MA, MIT Press
100% (4)
Brynjolfsson, Erik & Brian Kahin (Eds.) (2000) - Understanding The Digital Economy - Data, Tools, and Research. Cambridge, MA, MIT Press
408 pages
Cobalt Cobalt-Strike Userguide
No ratings yet
Cobalt Cobalt-Strike Userguide
477 pages
Garbha Upanishad - Translation
50% (2)
Garbha Upanishad - Translation
4 pages
CD & DVD Working Principle-1
No ratings yet
CD & DVD Working Principle-1
7 pages
TC900 Series Digital Fan Coil Thermostat Installation Sheet
No ratings yet
TC900 Series Digital Fan Coil Thermostat Installation Sheet
2 pages
XGSLab UserGuide
100% (1)
XGSLab UserGuide
314 pages
PCA Using Python
No ratings yet
PCA Using Python
18 pages
Module 3
No ratings yet
Module 3
41 pages
U5@-Data Reduction
No ratings yet
U5@-Data Reduction
22 pages
Dimensionality Reduction: Motivation I: Data Compression
No ratings yet
Dimensionality Reduction: Motivation I: Data Compression
35 pages
data reduction
No ratings yet
data reduction
9 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
27 pages
Love Report
No ratings yet
Love Report
7 pages
Cvresearchpaperfinalfinal
No ratings yet
Cvresearchpaperfinalfinal
5 pages
Love Report 1
No ratings yet
Love Report 1
10 pages
Chapter Five Principal Comonent Analysis (PCA)
No ratings yet
Chapter Five Principal Comonent Analysis (PCA)
33 pages
1694601214-Unit 3.4 Principal Component Analysis CU 2.0
No ratings yet
1694601214-Unit 3.4 Principal Component Analysis CU 2.0
36 pages
CHBE413CDS Lecture 12 Unsupervised DimRed
No ratings yet
CHBE413CDS Lecture 12 Unsupervised DimRed
30 pages
DATA REDUCTION
No ratings yet
DATA REDUCTION
23 pages
DR Pca
No ratings yet
DR Pca
22 pages
2101 Week11PCA
No ratings yet
2101 Week11PCA
32 pages
ML Module 6
No ratings yet
ML Module 6
6 pages
program-3
No ratings yet
program-3
7 pages
Principal Component Analysis and Cluster Analysis
No ratings yet
Principal Component Analysis and Cluster Analysis
14 pages
Clustering_and_dimensionality_reduction_techniques__PCA__t_SNE__K_means_ (1)
No ratings yet
Clustering_and_dimensionality_reduction_techniques__PCA__t_SNE__K_means_ (1)
15 pages
Need of Principal Component Analysis
No ratings yet
Need of Principal Component Analysis
8 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
34 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
11 pages
PCA Finds Representation Through Linear Transformation
No ratings yet
PCA Finds Representation Through Linear Transformation
28 pages
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
No ratings yet
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
19 pages
Kinya Sharon - Ass2 - Machine Learning
No ratings yet
Kinya Sharon - Ass2 - Machine Learning
12 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
2 pages
Reduce Data Dimensionality Using PCA
No ratings yet
Reduce Data Dimensionality Using PCA
6 pages
3.2 Pca
No ratings yet
3.2 Pca
27 pages
K. J. Somaiya College of Engineering, Mumbai-77: Title: Implementation of Principal Component Analysis
No ratings yet
K. J. Somaiya College of Engineering, Mumbai-77: Title: Implementation of Principal Component Analysis
2 pages
Module 5 - BECE309L - AIML - Part2
No ratings yet
Module 5 - BECE309L - AIML - Part2
34 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
6 pages
Principal+Component+Analysis
No ratings yet
Principal+Component+Analysis
6 pages
3-Data Fundamentals for BI- Part2
No ratings yet
3-Data Fundamentals for BI- Part2
44 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
1 page
CS464_Ch6_FeatureExtraction
No ratings yet
CS464_Ch6_FeatureExtraction
46 pages
Feature Extraction: - Saheni Patra
No ratings yet
Feature Extraction: - Saheni Patra
17 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
33 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
19 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
U4 - PCA - 5th Sem - DS
No ratings yet
U4 - PCA - 5th Sem - DS
14 pages
Unit 3
No ratings yet
Unit 3
31 pages
05. Dimensionality Reduction
No ratings yet
05. Dimensionality Reduction
8 pages
7.3 Pca
No ratings yet
7.3 Pca
17 pages
STAT502
No ratings yet
STAT502
13 pages
Unit 3dimentionality Reduction
No ratings yet
Unit 3dimentionality Reduction
13 pages
Principal Component Analysis: #Datascience
No ratings yet
Principal Component Analysis: #Datascience
13 pages
EDAB Module 5 Singular Value Decomposition (SVD)
No ratings yet
EDAB Module 5 Singular Value Decomposition (SVD)
58 pages
Sess03 Dimension Reduction Methods
No ratings yet
Sess03 Dimension Reduction Methods
36 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
1 page
W4.2 DataPreProcessing-PCA (1)
No ratings yet
W4.2 DataPreProcessing-PCA (1)
22 pages
Lecture 6(b) PCA-II
No ratings yet
Lecture 6(b) PCA-II
90 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
30 pages
PCA - Feb 8
No ratings yet
PCA - Feb 8
28 pages
Sanjay Singh Principal Component Analysis
No ratings yet
Sanjay Singh Principal Component Analysis
9 pages
5 Data Pre Processing III
No ratings yet
5 Data Pre Processing III
30 pages
Remote Sensing Assignment
No ratings yet
Remote Sensing Assignment
10 pages
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
No ratings yet
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
59 pages
SE 458 - Data Mining (DM) : Spring 2019 Section W1
No ratings yet
SE 458 - Data Mining (DM) : Spring 2019 Section W1
10 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
ISN General Circular SOF Classes I XII 2025 26
No ratings yet
ISN General Circular SOF Classes I XII 2025 26
1 page
ISN-General Circular-Bagless Day_Classes I-X_2025-26
No ratings yet
ISN-General Circular-Bagless Day_Classes I-X_2025-26
1 page
PART TIME PHD (Cat - B) Regulations
No ratings yet
PART TIME PHD (Cat - B) Regulations
10 pages
Sri Sridhara Swami Stotra - Kannada
100% (3)
Sri Sridhara Swami Stotra - Kannada
10 pages
Sri Sridhara Swami Sthotra English
No ratings yet
Sri Sridhara Swami Sthotra English
4 pages
Sri Sridhara Swami Stotra - Kannada
100% (3)
Sri Sridhara Swami Stotra - Kannada
10 pages
What Is A Requirement?: Descriptions and Specifications of A System
No ratings yet
What Is A Requirement?: Descriptions and Specifications of A System
17 pages
Sankara The Missionary - Biography
No ratings yet
Sankara The Missionary - Biography
137 pages
Xcal
No ratings yet
Xcal
141 pages
2MBI100NC-120: 1200V / 100A 2 in One-Package
No ratings yet
2MBI100NC-120: 1200V / 100A 2 in One-Package
4 pages
ALBIN JOSEPH_Crux
No ratings yet
ALBIN JOSEPH_Crux
1 page
Slim Cheatsheep
No ratings yet
Slim Cheatsheep
10 pages
Malwares Exams
No ratings yet
Malwares Exams
4 pages
Dublin Metropolitan University
No ratings yet
Dublin Metropolitan University
4 pages
Enhancing Strategic Decision Making
No ratings yet
Enhancing Strategic Decision Making
9 pages
Collibra DIC Data Privacy 2022.05
No ratings yet
Collibra DIC Data Privacy 2022.05
185 pages
UEC2021142 Assignment7 MUX
No ratings yet
UEC2021142 Assignment7 MUX
5 pages
NRIMS - Researcher User Guide
No ratings yet
NRIMS - Researcher User Guide
26 pages
BD 205
No ratings yet
BD 205
77 pages
Practical Lab List
No ratings yet
Practical Lab List
7 pages
Ibs Seksyen 18, Shah Alam 1 30/09/22
No ratings yet
Ibs Seksyen 18, Shah Alam 1 30/09/22
3 pages
Reserve Bank of India
No ratings yet
Reserve Bank of India
8 pages
Communication Theory
50% (2)
Communication Theory
7 pages
Metrel MD 105 Non Contact Voltage Detector: Mobile, Easy To Use, Ideal Tool For Every Electrician
No ratings yet
Metrel MD 105 Non Contact Voltage Detector: Mobile, Easy To Use, Ideal Tool For Every Electrician
1 page
OM 337.1 - Total Quality Management (Bagchi) PDF
No ratings yet
OM 337.1 - Total Quality Management (Bagchi) PDF
8 pages
Tensor Flow and Keras Sample Programs
No ratings yet
Tensor Flow and Keras Sample Programs
22 pages
AppWall 755 RN
No ratings yet
AppWall 755 RN
29 pages
Sample Thesis Proposal Computer Science
100% (2)
Sample Thesis Proposal Computer Science
6 pages
document (3)
No ratings yet
document (3)
17 pages
PE Open Ended
No ratings yet
PE Open Ended
4 pages
Mtech Model Questions
No ratings yet
Mtech Model Questions
121 pages
(MAA 2.10) EXPONENTIAL EQUATIONS - Solutions
No ratings yet
(MAA 2.10) EXPONENTIAL EQUATIONS - Solutions
8 pages
Chromebook 11 3180 Laptop Owners Manual en Us
No ratings yet
Chromebook 11 3180 Laptop Owners Manual en Us
69 pages

Updated Lecture 13 Zainab

Uploaded by

Updated Lecture 13 Zainab

Uploaded by

Feature Engineering

• Complex Model Architectures: training a deep

Feature engineering refers to the process of translating a

• Feature subset selection: Feature subset selection involves

• Principal Component Analysis (PCA): PCA is a renowned dimensionality reduction technique.

pca = PCA(n_components=2) PCA is used for dimensionality reduction. It identifies the

to the breast cancer dataset. The code creates a scatter

After performing Principal Component Analysis (PCA) and obtaining the

You might also like