Updated Lecture 13 Zainab
Updated Lecture 13 Zainab
Outline
• Feature Engineering
• Feature Transformation
• Feature Subset Selection
• Feature Scaling
• Data to be Gaussian: an Introduction to
Power Transformations
• Principal Component Analysis
Cross-Validation Performance
Cross-validation may not perform well or may not be suitable
in certain situations:
• Small Datasets : dataset with only 50 samples,
• Imbalanced Datasets: dataset for fraud detection
• Temporal Data: predicting stock prices using historical
data
• Model Evaluation for Deployment: predicting
stock prices using historical data
• Feature subset selection: Continuing with the housing dataset, suppose it also
contains features like the color of the house, the make of the appliances, and the
type of flooring. Through feature selection techniques, we might discover that these
features do not significantly contribute to predicting the house price. Thus, we
decide to exclude them from the model, focusing only on the most important
features such as square footage, number of bedrooms, and location.
Principal Component Analysis
• In the modern era of machine learning, data scientists grapple with a substantial
number of variables, particularly in fields like computer vision.
• Challenge of High Dimensionality:
• Example: In computer vision, images are represented in terms of pixels.
• 4K Image: Resolution of 3840 x 2160 pixels.
• Challenge: Processing such an image involves dealing with 24,883,200 variables
(pixels multiplied by three color channels: blue, red, and green).
• Issues with High Dimensionality:
• Computational Complexity: High dimensionality increases
computational complexity.
• Overfitting Risk: Greater risk of overfitting due to an abundance of
features.
• Solution: Dimensionality Reduction Techniques
Principal Component Analysis
• To address these challenges, it's essential to reduce dimensionality.
• Objective: Project the data into a lower-dimensional space, mitigating computational
complexity and overfitting risks.
After applying PCA to our housing dataset, we might find that the first principal
component is strongly correlated with the overall size of the house (e.g., square
footage, number of bedrooms, number of bathrooms), while the second principal
component is related to the price of the house.
Suppose that after PCA, we find that the first three principal components explain
95% of the total variance in the dataset. This means that we can represent the
original dataset using just these three principal components, reducing the
dimensionality from, say, 10 original features down to 3 principal components.
Principal Component Analysis
Benefits of PCA:
• Collinearity Prevention: In regression problems, PCA is
employed to prevent or reduce collinearity among
independent variables.
• Efficient Representation: Offers an efficient representation of
the original data with reduced dimensionality.
Principal Component Analysis
• Algorithm Workflow:
• Identify Dominant Direction: Look for the vector with the most
information, indicating the direction of maximum correlation among
features.
• Find Orthogonal Directions: Locate subsequent directions orthogonal
to the first, capturing the most information in each.
• Dimensionality Reduction: Principal axes represent the principal
components, defining the new feature space.
Principal Component Analysis (PCA) on Breast Cancer Data
We will Explore the application of PCA for dimensionality reduction and visualization on
breast cancer data.
Steps 1: Data Loading and Import necessary libraries, including scikit-learn for machine
learning operations and matplotlib for plotting. Use scikit-learn's built-in dataset,
load_breast_cancer, to obtain breast cancer data. Using make_pipeline
simplifies the creation of
Python code from sklearn.pipeline import make_pipeline pipelines, especially when
you have multiple
from sklearn.datasets import load_breast_cancer preprocessing steps and an
estimator. It reduces the
from sklearn.preprocessing import StandardScaler need for manually naming
from sklearn.decomposition import PCA each step, making your
code more concise and
import matplotlib.pyplot as plt It is used for
easier to understand.
StandardScaler:
standardizing features by
removing the mean and
scaling to unit variance,
# Load breast cancer data which is a common
df = load_breast_cancer() preprocessing step in
many machine learning
algorithms.
Principal Component Analysis (PCA) on
Breast Cancer Data
#Step 2 : Scale the Data
scaler = StandardScaler() StandardScaler is a preprocessing step that standardizes the
features by removing the mean and scaling to unit variance.
X_scaled = scaler.fit_transform(df.data) fit_transform method computes the mean and standard
# Step 3: Apply PCA deviation from the data and then scales the data accordingly.