0% found this document useful (0 votes)
4 views

Updated Lecture 13 Zainab

Uploaded by

tvsn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Updated Lecture 13 Zainab

Uploaded by

tvsn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Feature Engineering

Outline
• Feature Engineering
• Feature Transformation
• Feature Subset Selection
• Feature Scaling
• Data to be Gaussian: an Introduction to
Power Transformations
• Principal Component Analysis
Cross-Validation Performance
Cross-validation may not perform well or may not be suitable
in certain situations:
• Small Datasets : dataset with only 50 samples,
• Imbalanced Datasets: dataset for fraud detection
• Temporal Data: predicting stock prices using historical
data
• Model Evaluation for Deployment: predicting
stock prices using historical data

• Complex Model Architectures: training a deep


learning model with millions of parameters on a large image
Feature Engineering

Feature engineering refers to the process of translating a


data set into features such that these features are able to
represent the data set more effectively and result in a
better learning performance.
Feature Engineering
• Feature engineering is an important pre-processing step for
machine learning. It has two major elements:
• Feature transformation: involves changing the representation of
the features in a dataset to make them more suitable for a
machine learning algorithm. This can include scaling,
normalization, or creating new features through mathematical
operations.

• Feature subset selection: Feature subset selection involves


selecting a subset of the original features in the dataset that are
most relevant to the problem at hand. This can help improve
model performance by reducing overfitting, speeding up training,
and improving interpretability.
Example: Feature Engineering
• Feature transformation: Consider a dataset containing information about
houses, including features like square footage, number of bedrooms, and number of
bathrooms. One common transformation is to create a new feature representing the
total area of the house by adding the square footage of all rooms together.

• Feature subset selection: Continuing with the housing dataset, suppose it also
contains features like the color of the house, the make of the appliances, and the
type of flooring. Through feature selection techniques, we might discover that these
features do not significantly contribute to predicting the house price. Thus, we
decide to exclude them from the model, focusing only on the most important
features such as square footage, number of bedrooms, and location.
Principal Component Analysis
• In the modern era of machine learning, data scientists grapple with a substantial
number of variables, particularly in fields like computer vision.
• Challenge of High Dimensionality:
• Example: In computer vision, images are represented in terms of pixels.
• 4K Image: Resolution of 3840 x 2160 pixels.
• Challenge: Processing such an image involves dealing with 24,883,200 variables
(pixels multiplied by three color channels: blue, red, and green).
• Issues with High Dimensionality:
• Computational Complexity: High dimensionality increases
computational complexity.
• Overfitting Risk: Greater risk of overfitting due to an abundance of
features.
• Solution: Dimensionality Reduction Techniques
Principal Component Analysis
• To address these challenges, it's essential to reduce dimensionality.
• Objective: Project the data into a lower-dimensional space, mitigating computational
complexity and overfitting risks.

• Principal Component Analysis (PCA): PCA is a renowned dimensionality reduction technique.


• Identify a few principal components that capture as much
information as the original set of predictors.
• Original variables transformed into uncorrelated linear combinations
called principal components (PC).
• PCs ordered so that the first PC accounts for the largest proportion of
the variation in the original features.
Principal Component Analysis
Example
Consider a dataset with information about houses, some of these features are square
footage, number of bedrooms, number of bathrooms, and price. We want to reduce the
dimensionality of this dataset using PCA while still capturing as much information as
possible.

After applying PCA to our housing dataset, we might find that the first principal
component is strongly correlated with the overall size of the house (e.g., square
footage, number of bedrooms, number of bathrooms), while the second principal
component is related to the price of the house.
Suppose that after PCA, we find that the first three principal components explain
95% of the total variance in the dataset. This means that we can represent the
original dataset using just these three principal components, reducing the
dimensionality from, say, 10 original features down to 3 principal components.
Principal Component Analysis
Benefits of PCA:
• Collinearity Prevention: In regression problems, PCA is
employed to prevent or reduce collinearity among
independent variables.
• Efficient Representation: Offers an efficient representation of
the original data with reduced dimensionality.
Principal Component Analysis
• Algorithm Workflow:
• Identify Dominant Direction: Look for the vector with the most
information, indicating the direction of maximum correlation among
features.
• Find Orthogonal Directions: Locate subsequent directions orthogonal
to the first, capturing the most information in each.
• Dimensionality Reduction: Principal axes represent the principal
components, defining the new feature space.
Principal Component Analysis (PCA) on Breast Cancer Data
We will Explore the application of PCA for dimensionality reduction and visualization on
breast cancer data.
Steps 1: Data Loading and Import necessary libraries, including scikit-learn for machine
learning operations and matplotlib for plotting. Use scikit-learn's built-in dataset,
load_breast_cancer, to obtain breast cancer data. Using make_pipeline
simplifies the creation of
Python code from sklearn.pipeline import make_pipeline pipelines, especially when
you have multiple
from sklearn.datasets import load_breast_cancer preprocessing steps and an
estimator. It reduces the
from sklearn.preprocessing import StandardScaler need for manually naming
from sklearn.decomposition import PCA each step, making your
code more concise and
import matplotlib.pyplot as plt It is used for
easier to understand.
StandardScaler:
standardizing features by
removing the mean and
scaling to unit variance,
# Load breast cancer data which is a common
df = load_breast_cancer() preprocessing step in
many machine learning
algorithms.
Principal Component Analysis (PCA) on
Breast Cancer Data
#Step 2 : Scale the Data
scaler = StandardScaler() StandardScaler is a preprocessing step that standardizes the
features by removing the mean and scaling to unit variance.
X_scaled = scaler.fit_transform(df.data) fit_transform method computes the mean and standard
# Step 3: Apply PCA deviation from the data and then scales the data accordingly.

pca = PCA(n_components=2) PCA is used for dimensionality reduction. It identifies the


X_pca = pca.fit_transform(X_scaled) principal components that capture the most variance in the
data.
n_components=2 specifies that we want to reduce the
dimensionality to 2 components.
fit_transform computes the principal components based on
This Python code snippet involves the visualization of the the scaled data and transforms the data into the new lower-
results after applying Principal Component Analysis (PCA) dimensional space.

to the breast cancer dataset. The code creates a scatter


plot of the first two principal components and displays an
image plot of the principal components.
Scatter Plot of the First Two Principal Components:
# Scatter plot of the first two principal components
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df.target)
Principal Component Analysis (PCA)
on Breast Cancer Data
# Display the principal components
components = pca.named_steps['pca'].components_
plt.imshow(components.T)
plt.yticks(range(len(df.feature_names)), df.feature_names)
plt.colorbar()
plt.yticks(range(len(df.feature_names)), df.feature_names) #sets y-axis components =
ticks and labels to feature names. pca.named_steps['pca'].components_
plt.colorbar() #adds a colorbar to the side of the image plot. extracts the principal components from
the fitted PCA model.
plt.show()
plt.imshow(components.T) creates an
image plot of the transposed principal
components matrix.
Principal Component Analysis (PCA) on Breast
Cancer Data
Contribution of Features to Principal Components:
Observation: The image plot (plt.imshow(components.T)) represents the weights or contributions of
each original feature to the first two principal components.
Interpretation: Darker regions indicate lower contributions, while lighter regions indicate higher
contributions. Each row in the image plot corresponds to a feature, and each column corresponds to a
principal component.
Scaling Impact on First Principal Component:
Observation: All features now contribute to the first principal component after scaling.
Interpretation: Scaling ensures that the magnitude of features does not dominate the first component.
Without scaling, features with larger magnitudes could have disproportionately influenced the first
component.
Correlation Among Features in the First Component:
Observation: All features in the first component have the same sign.
Interpretation: This indicates a positive correlation among all features in the first principal
component. When one feature has a high value, others are likely to have high values as well. It suggests
a shared pattern or tendency among the features.
Mixed Signs in the Second Principal Component:
Observation: The second principal component has mixed signs.
Interpretation: Unlike the first component, the second one captures variations with mixed directions
among features. Some features may have positive contributions, while others have negative
contributions, indicating a more complex relationship.
Principal Component Analysis (PCA)
on Breast Cancer Data

After performing Principal Component Analysis (PCA) and obtaining the


principal components, you can use these components for building a new model
or conducting further analyses. The principal components can serve as a
reduced set of features that capture most of the important information in the
original data.
Reference list
•Andrea Giussani, APPLIED MACHINE LEARNING WITH PYTHON, Copyright © 2019, EGEA
•.

You might also like