0% found this document useful (0 votes)
4 views

cours data

The document discusses feature engineering in data science, emphasizing its importance in improving model performance, reducing dimensionality, and enhancing interpretability. It covers various techniques such as creating interaction features, binning, one-hot encoding, target encoding, and domain-specific features, along with their benefits and applications. Additionally, it highlights the significance of exploratory data analysis (EDA) and data visualization in understanding data structure and relationships.

Uploaded by

bts.nou.waw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

cours data

The document discusses feature engineering in data science, emphasizing its importance in improving model performance, reducing dimensionality, and enhancing interpretability. It covers various techniques such as creating interaction features, binning, one-hot encoding, target encoding, and domain-specific features, along with their benefits and applications. Additionally, it highlights the significance of exploratory data analysis (EDA) and data visualization in understanding data structure and relationships.

Uploaded by

bts.nou.waw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Data Science Project

Business Data
Understanding Understanding

Deployement
Data
Preparartion

Evaluation

Exploratory
Data Analysis

Modeling

76
Feature Engineering
● Feature engineering is the process of creating new features or modifying existing
ones in your dataset to improve the performance of machine learning models.
● Effective feature engineering can lead to better model accuracy and more informative
representations of your data
● Feature engineering requires experimentation and domain expertise to identify which
features will be most informative for your specific problem.

77
Benefits of Feature Engineering
● Improved Model Performance
− Well-engineered features can lead to better model accuracy and generalization.
● Reduced Dimensionality
− Feature engineering can reduce the dimensionality of your data by selecting or creating the
most informative features, which can lead to faster training and reduced overfitting.
● Enhanced Interpretability
− Engineered features can make the model's predictions more interpretable and understandable.
● Better Handling of Missing Data
− Features engineered from other variables can help fill in missing values more effectively.
● Incorporation of Prior Knowledge
− Feature engineering allows you to include domain-specific knowledge into your model.

78
Feature Engineering Techniques
● Creating Interaction Features
● Binning
● One-Hot Encoding
● Target Encoding (Mean Encoding)
● Time-Based Features
● Text Features
● Feature Scaling
● Feature Selection
● Log Transformations
● Domain-Specific Features
● Aggregations

79
Creating Interaction Features
● Creating interaction features involves combining two or more existing features in a dataset
to capture relationships or interactions between them.
● This can help machine learning models understand complex dependencies that individual
features might not reveal on their own.
● Interaction features are particularly useful when the relationship between features and the
target variable is not linear and when interactions play a significant role in the data

# Creating an interaction feature


df['Price_Advertising_Interaction'] = df['Price'] * df['Advertising Spend']

80
Binning
● Binning, also known as discretization, is a technique used in feature engineering to
divide a continuous feature into a set of discrete intervals or bins.
● This can be useful when the relationship between a continuous variable and the target
variable is not linear, and you want to capture non-linear patterns.
● Binning can also be used for creating categorical features from numerical data.
● Bins are essentially a way to group data points into categories based on their values.

# Define bin edges and labels


bin_edges = [0, 20, 30, 40, 50]
bin_labels = ['<20', '20-29', '30-39', '40+']

# Apply binning to the "Age" feature


df['Age Group'] = pd.cut(df['Age'],
bins=bin_edges, labels=bin_labels, right=False)

81
One-hot encoding
● One-hot encoding is a technique used to convert categorical variables (features) into a
numerical format so that machine learning algorithms can work with them.
● Categorical variables are those that represent categories, such as "red," "blue," "green"
for colors or "cat," "dog," "bird" for animal types.
● One-hot encoding transforms these categorical variables into a binary format, where each
category becomes a separate binary feature (0 or 1).
● For each category, one new binary feature is created, and it is set to 1 if the original
feature had that category, or 0 if it didn’t.
● This allows algorithms to treat each category as a separate entity without assuming any
ordinal relationship between them.

82
One-hot encoding

# Perform one-hot encoding


df_encoded = pd.get_dummies(df, columns=['Color'], prefix=['Color'])

83
Target encoding
● Target encoding, also known as mean encoding, is a technique used to transform
categorical variables into numerical values
● It based on the mean of the target variable (usually a continuous variable) for each
category.
● It is particularly useful in predictive modeling tasks when dealing with categorical features
and regression or binary classification problems.
● The process involves the following steps:
1. For each category in the categorical feature, calculate the mean (or any other aggregate
measure) of the target variable for data points with that category.
2. Replace the original categorical feature with the calculated mean values for each category.
● Target encoding leverages the relationship between the categorical variable and the target
variable, making it more informative for machine learning models to work with categorical
data.

84
Target encoding
● Example with a dataset containing a categorical feature, "City," and a continuous target
variable, "Salary."
● We'll calculate the mean salary for each city and encode the "City" feature using the
calculated means.
# Calculate mean salary for each city
city_means = df.groupby('City')['Salary'].mean()

# Perform target encoding


df['City_Mean_Encoded'] = df['City'].map(city_means)

85
Time-based features
● Time-based features are a type of feature engineering that involves extracting
information from date and time data.
● When working with time series data or datasets containing temporal information, creating
time-based features can help capture patterns and dependencies related to time.
● These features can be used to improve the performance of machine learning models and
gain insights from the data.

86
Time-based features
● Year, Month, Day
− Extracting the year, month, and day components from a date allows you to analyze how data
varies over different time periods, such as seasons or days of the week.
● Quarter, Week, Day of Week
− Similar to year, month, and day, these features provide more granular information about time
patterns, like quarterly trends and weekday/weekend distinctions.
● Time Lags
− Creating lag features by shifting values from previous time points can capture trends and
autocorrelations in time series data.
● Holiday/Event Indicators
− Indicating whether a specific date corresponds to a holiday or a significant event can help
model how external factors affect the data.
● Time Since a Reference Date
− Calculating the time elapsed since a reference date can be useful, for example, to determine
how long a customer has been active.

87
Time-based features

# Convert "Transaction Date" to a datetime type


df['Transaction Date'] = pd.to_datetime(df['Transaction Date'])

# Extract time-based features


df['Year'] = df['Transaction Date'].dt.year
df['Month'] = df['Transaction Date'].dt.month
df['Day'] = df['Transaction Date'].dt.day
df['Day of Week'] = df['Transaction Date'].dt.dayofweek # Monday=0, Sunday=6

88
Text features
● Text features are a type of feature engineering that involves extracting valuable
information from text data.
● Text data can come from a wide range of sources, including social media posts, customer
reviews, news articles, or any content that contains textual information.
● Text features are crucial for natural language processing (NLP) tasks and text-based
machine learning applications.

89
Text features
● Bag of Words (BoW):
− This technique represents text documents as a collection of words or tokens. Each unique
word becomes a feature, and the presence or frequency of each word is used as a feature
value.
● Term Frequency-Inverse Document Frequency (TF-IDF):
− TF-IDF is a numerical statistic that reflects the importance of a word within a document relative
to a collection of documents (corpus). It is used to weigh the importance of words in text data.
● Word Embeddings:
− Word embeddings, such as Word2Vec and GloVe, are dense vector representations of words
in a continuous vector space. These embeddings capture semantic information about words.
● Text Length:
− Features related to the length of text, such as the number of words, characters, or sentences.
● N-grams:
− N-grams are contiguous sequences of N items (words or characters) in text. They capture
local patterns and can be used as features.

90
Text features

This is the first document.


This document is the second document.
And this is the third one.
Is this the first document?

from sklearn.feature_extraction.text import CountVectorizer

# Sample text documents


documents = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]

91
Text features
# Create a CountVectorizer object This is the first document.
vectorizer = CountVectorizer() This document is the second document.
And this is the third one.
# Fit and transform the text documents Is this the first document?
X = vectorizer.fit_transform(documents)

# Get the feature names (words)


feature_names = vectorizer.get_feature_names_out()

# Create a DataFrame from the BoW representation


import pandas as pd
df = pd.DataFrame(X.toarray(), columns=feature_names)

# Display the BoW DataFrame


print(df)

92
Feature selection
● Feature selection is a step in the machine learning pipeline that involves choosing the
most relevant and informative features (variables) from a dataset.
● The goal of feature selection is to improve model performance, reduce overfitting, and
enhance the interpretability of models.
● Feature selection is particularly important when working with high-dimensional data or
when you suspect that many features are irrelevant or redundant.

93
Feature selection
● Filter Methods:
− Filter methods select features based on statistical properties or scoring criteria without
involving a machine learning model.
− Examples include correlation-based feature selection and mutual information-based feature
selection.
● Wrapper Methods:
− Wrapper methods use a machine learning model's performance (e.g., accuracy) as a criterion
to select features.
− Common wrapper methods include forward selection, backward elimination, and recursive
feature elimination.
● Embedded Methods:
− Embedded methods incorporate feature selection as part of the model training process.
− Some machine learning algorithms have built-in feature selection, and regularization
techniques like L1 regularization (Lasso) can encourage feature sparsity.

94
Feature selection
● Example :
− perform feature selection on a dataset using a filter method based on feature importance
scores from a decision tree classifier

import pandas as pd
from sklearn.datasets import load_iris

# Load the Iris dataset as an example


data = load_iris()

X = pd.DataFrame(data.data,
columns=data.feature_names)

y = data.target

95
Feature selection
from sklearn.ensemble import ExtraTreesClassifier
# Create a decision tree classifier to calculate feature
importance scores
classifier = ExtraTreesClassifier()
classifier.fit(X, y)

# Get feature importance scores


feature_importance = classifier.feature_importances_

# Create a DataFrame to store feature names and their


importance scores
feature_importance_df = pd.DataFrame(
{'Feature': X.columns, 'Importance': feature_importance})

# Sort features by importance in descending order

feature_importance_df =
feature_importance_df.sort_values(by='Importance',
ascending=False)
96
Feature selection

# Select the top k most important features (e.g., top 2)


k = 2
selected_features = feature_importance_df.head(k)['Feature'].tolist()

# Display the selected features


print("Selected Features:")
print(selected_features)

97
Domain-Specific Features
● Domain-specific features are a category of engineered features that are created based
on domain knowledge or expertise in a specific field or industry.
● These features are designed to capture information that is highly relevant to the problem
at hand and can provide a significant boost in predictive power.
● Domain-specific features leverage the understanding of the domain and the problem's
nuances to create new variables that enhance the performance of machine learning
models.

98
Domain-Specific Features
● Aggregated statistics
− Calculating statistics such as means, medians, or variances of certain variables within specific
categories or groups.
● Time-based features
− Extracting time-related information, such as day of the week, month, or year, from date or
timestamp data.
● Geospatial features
− Creating features based on geographical data, such as distance to a landmark or the density
of nearby points of interest.
● Text-based features
− Extracting information from text data, such as sentiment scores, word counts, or specific
keywords.
● Interaction features
− Combining existing features to capture interactions or relationships between them

99
Domain-Specific Features
● Consider a hypothetical scenario where we have data on customer transactions for an e-
commerce website, and we want to predict customer churn

100
Domain-Specific Features

# Convert 'LastPurchaseDate' to a datetime object


df['LastPurchaseDate'] = pd.to_datetime(df['LastPurchaseDate'])

# Calculate the average spending per purchase


df['AvgSpendingPerPurchase'] = df['TotalSpent'] / df['PurchaseCount']

# Extract the month of the last purchase


df['LastPurchaseMonth'] = df['LastPurchaseDate'].dt.month

# Calculate the recency as the number of days since the last purchase
df['Recency'] = (pd.to_datetime('2022-04-01') - df['LastPurchaseDate']).dt.days

101
Domain-Specific Features

102
Lab2 : DS

Feature Engineering

Objective: In this lab, you will focus on feature selection for building a predictive
model to forecast on-time product delivery. The primary tasks include loading and
exploring the dataset, performing data preprocessing, and selecting relevant features
for the prediction model.

103
Data Science Project

Business Data
Understanding Understanding

Deployement
Data
Preparartion

Evaluation

Exploratory
Data Analysis

Modeling

104
Exploratory Data Analysis (EDA)
● Exploratory Data Analysis (EDA).
● EDA involves the analysis and visualization of data to understand its structure,
patterns, and relationships.
● EDA is performed before building predictive models or drawing conclusions from the data
● Common techniques and examples of EDA :
− Summary Statistics
− Data Visualization
− Box Plots
− Scatter Plots
− Correlation Analysis
− Categorical Variable Analysis
− Time Series Analysis

106
Summary statistics
● Summary statistics are numerical measures that provide a high-level overview of the
characteristics of a dataset.
● They help in summarizing key properties of data, such as central tendency, variability, and
distribution.
● Summary statistics include measures like mean, median, standard deviation, minimum,
maximum, and quartiles.
● Example:
Suppose we have a dataset of exam scores for a class,
and we want to compute summary statistics for the scores

107
Summary statistics
# Calculate summary statistics for 'ExamScore'
summary_stats = df['ExamScore'].describe()

# Print the summary statistics


print(summary_stats)

108
Data visualization
● Data visualization is the graphical representation of data to help users understand
information in a more accessible and interpretable form.
● It's a crucial aspect of data analysis and communication, as it allows you to explore data,
identify patterns, and present insights effectively.
● Data visualization employs various visual elements like charts, graphs, and plots to
represent data.
● Visualizations can be used to illustrate relationships, trends, distributions, and
comparisons in data.
● Example:
Visualizing the sales performance of a business.

109
Data visualization
● Line Chart (Time Series):
− A line chart is suitable for showing trends in sales over time.
− It's ideal when you want to visualize how sales have changed from month to month
import matplotlib.pyplot as plt
# Create a line chart
plt.plot(df['months'], df['sales'], marker='o', linestyle='-')
plt.title('Monthly Sales Performance')
plt.xlabel('Month')
plt.ylabel('Sales (in dollars)')

# Show the chart


plt.grid(True) # Add grid lines
plt.show()

110
Data visualization
● Bar Chart (Comparative):
− A bar chart can be used to compare sales across different months and see which months had
higher or lower sales.

# Create a bar chart


plt.bar(df['months'], df['sales'], color='blue')
plt.title('Monthly Sales Comparison')
plt.xlabel('Month')
plt.ylabel('Sales (in dollars)')
plt.show()

111
Data visualization
● Area Chart (Cumulative):
− An area chart can show cumulative sales over time, which is useful for understanding the
overall growth in sales.

# Create an area chart


plt.fill_between(df['months'], df['sales'], color='blue', alpha=0.3)
plt.plot(months, sales, marker='o', linestyle='-', color='blue')
plt.title('Cumulative Sales Over Time')
plt.xlabel('Month')
plt.ylabel('Cumulative Sales (in dollars)')
plt.grid(True)
plt.show()

112
Data visualization
● Pie Chart (Composition):
− A pie chart can be used to show the composition of total sales for different months.
# Create a pie chart
plt.pie(df['sales'], labels=df['months'], autopct='%1.1f%%', startangle=140)
plt.title('Sales Composition by Month')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

113
Box plots
● Box plots, also known as box-and-whisker plots, are a graphical representation of the
distribution of a dataset.
● They display the median, quartiles, and potential outliers in the data.
● Box plots are used to visualize the spread and central tendency of data.
● Example:
Visualizing the sales performance of a business.

114
Box plots
# Create a box plot
plt.figure(figsize=(8, 6))
plt.boxplot(df['sales'], labels=['Sales'])
plt.title('Sales Performances by Month')
plt.ylabel('Sales (in dollars)')
plt.grid(True)
Whiskers
# Show the box plot (Min and Max)
plt.show()

Median
Rectangular
"box"
Interquartile Range
(IQR)

115
Box plots
# Create a box plot
plt.figure(figsize=(8, 6))

# showfliers=True to display outliers


plt.boxplot(df['sales'], labels=['Sales'], showfliers=True) Outlier
plt.title('Sales Performances by Month (with Outlier)')
plt.ylabel('Sales (in dollars)')
plt.grid(True)

# Show the box plot


plt.show()

median
sales
Majority of sales
data

116
Scatter plots
● Scatter plots are used to visualize the relationship between two variables, making them
suitable for assessing correlations and identifying patterns in data.
● Scatter plots display individual data points as dots on a two-dimensional plane.
● Each dot represents a data point, with one variable on the x-axis and another variable on
the y-axis.
● They are useful for identifying trends, clusters, and outliers in the data.

117
Scatter plots
# Create a scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(df['months'], df['sales'], color='blue', label='Sales', marker='o')
plt.title('Sales Performances by Month (Scatter Plot)')
plt.xlabel('Month')
plt.ylabel('Sales (in dollars)')
plt.grid(True)
plt.legend()

# Show the scatter plot


plt.show()

118
Correlation Analysis
● Correlation analysis is a statistical technique used to evaluate the strength and
direction of the relationship between two or more variables or features in a dataset.
● The goal of correlation analysis is to determine whether there is a statistical association
between the variables and to what degree they are related
● Correlation is quantified using a Correlation Coefficient, which measures the degree
and direction of the relationship.
● The most common correlation coefficient is the Pearson correlation coefficient, denoted as
"r," which ranges from -1 to 1.
● The formula to calculate the Pearson correlation coefficient (r) between two variables, X
and Y, is as follows:

Xi and Yi are the individual data points (observations) for the variables X and Y, respectively.
𝑋ത and 𝑌ത are the mean (average) values of X and Y,

119
Correlation Analysis
● The sign of the correlation coefficient indicates the direction of the relationship:
− A positive correlation (r > 0) implies that as one variable increases, the other tends to
increase as well.
− A negative correlation (r < 0) implies that as one variable increases, the other tends to
decrease.
− Zero correlation (r ≈ 0) suggests that there is no systematic relationship between the
variables.
● The absolute value of the correlation coefficient reflects the strength of the relationship.
● An r-value closer to 1 (positive or negative) indicates a stronger relationship, while an
r-value closer to 0 indicates a weaker or no relationship

120
Correlation Analysis
import seaborn as sns
import matplotlib.pyplot as plt
# Create a heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True,
cmap='coolwarm', fmt='.2f', square=True)
plt.title('Correlation Matrix Heatmap')
plt.show()

# Calculate the correlation matrix


correlation_matrix = df.corr()

# Display the correlation matrix


print(correlation_matrix)

121
Categorical Variable Analysis
● Categorical variable analysis involves the examination and interpretation of categorical or
qualitative data.
● Categorical variables represent data that can be divided into distinct categories or groups, such as
colors, cities, or product types.
● Analyzing categorical variables is essential in various fields, including statistics, data science, and
social sciences.
● Techniques
− Frequency Distribution: The first step in analyzing categorical data is to create a frequency
distribution, which shows the number or count of data points within each category. This helps you
understand the distribution of data across categories.
− Bar Charts: Bar charts or bar graphs are commonly used to visualize categorical data. Each
category is represented on the x-axis, and the frequency or count is shown on the y-axis. Bar
charts are useful for visual comparisons between categories.
− Pie Charts: Pie charts are another visualization tool for categorical data. They represent each
category as a slice of the pie, with the size of each slice proportional to the frequency or proportion
of data in that category.
− Measures of Central Tendency: While categorical data cannot be averaged like numerical data,
you can calculate the mode, which is the category with the highest frequency. The mode represents
the most common category.

126
Categorical Variable Analysis
● Example of analyzing categorical data using a dataset of students and their favorite colors

127
Time Series Analysis
● Time series analysis is a statistical technique used to analyze and interpret data points
collected or recorded over time at regular intervals.
● It is particularly useful for understanding patterns, trends, and forecasting future values in
time-ordered datasets.
Visualization:
● Example :
Visualizing time series data is
essential

Components of Time Series:


Time series data can often be decomposed
into components like trend, seasonality, and
noise.

129
Road Map 1/2

https://activewizards.com/blog/how-to-choose-the-right-chart-type-infographic/

130
Road Map 2/2

https://www.linkedin.com/pulse/picking-correct-visualization-drive-change-brett-bonner/

131
Lab3 : DS

Exploratory Data Analysis (EDA)


Objective: The objective of this lab is to gain insights into the data's structure, patterns, and
relationships..

132

You might also like