0% found this document useful (0 votes)

4 views

cours data

The document discusses feature engineering in data science, emphasizing its importance in improving model performance, reducing dimensionality, and enhancing interpretability. It covers various techniques such as creating interaction features, binning, one-hot encoding, target encoding, and domain-specific features, along with their benefits and applications. Additionally, it highlights the significance of exploratory data analysis (EDA) and data visualization in understanding data structure and relationships.

Uploaded by

bts.nou.waw

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

cours data

Uploaded by

bts.nou.waw

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Data Science Project

Business Data
Understanding Understanding

Deployement
Data
Preparartion

Evaluation

Exploratory
Data Analysis

Modeling

76
Feature Engineering
● Feature engineering is the process of creating new features or modifying existing
ones in your dataset to improve the performance of machine learning models.
● Effective feature engineering can lead to better model accuracy and more informative
representations of your data
● Feature engineering requires experimentation and domain expertise to identify which
features will be most informative for your specific problem.

77
Benefits of Feature Engineering
● Improved Model Performance
− Well-engineered features can lead to better model accuracy and generalization.
● Reduced Dimensionality
− Feature engineering can reduce the dimensionality of your data by selecting or creating the
most informative features, which can lead to faster training and reduced overfitting.
● Enhanced Interpretability
− Engineered features can make the model's predictions more interpretable and understandable.
● Better Handling of Missing Data
− Features engineered from other variables can help fill in missing values more effectively.
● Incorporation of Prior Knowledge
− Feature engineering allows you to include domain-specific knowledge into your model.

78
Feature Engineering Techniques
● Creating Interaction Features
● Binning
● One-Hot Encoding
● Target Encoding (Mean Encoding)
● Time-Based Features
● Text Features
● Feature Scaling
● Feature Selection
● Log Transformations
● Domain-Specific Features
● Aggregations

79
Creating Interaction Features
● Creating interaction features involves combining two or more existing features in a dataset
to capture relationships or interactions between them.
● This can help machine learning models understand complex dependencies that individual
features might not reveal on their own.
● Interaction features are particularly useful when the relationship between features and the
target variable is not linear and when interactions play a significant role in the data

# Creating an interaction feature

df['Price_Advertising_Interaction'] = df['Price'] * df['Advertising Spend']

80
Binning
● Binning, also known as discretization, is a technique used in feature engineering to
divide a continuous feature into a set of discrete intervals or bins.
● This can be useful when the relationship between a continuous variable and the target
variable is not linear, and you want to capture non-linear patterns.
● Binning can also be used for creating categorical features from numerical data.
● Bins are essentially a way to group data points into categories based on their values.

# Define bin edges and labels

bin_edges = [0, 20, 30, 40, 50]
bin_labels = ['<20', '20-29', '30-39', '40+']

# Apply binning to the "Age" feature

df['Age Group'] = pd.cut(df['Age'],
bins=bin_edges, labels=bin_labels, right=False)

81
One-hot encoding
● One-hot encoding is a technique used to convert categorical variables (features) into a
numerical format so that machine learning algorithms can work with them.
● Categorical variables are those that represent categories, such as "red," "blue," "green"
for colors or "cat," "dog," "bird" for animal types.
● One-hot encoding transforms these categorical variables into a binary format, where each
category becomes a separate binary feature (0 or 1).
● For each category, one new binary feature is created, and it is set to 1 if the original
feature had that category, or 0 if it didn’t.
● This allows algorithms to treat each category as a separate entity without assuming any
ordinal relationship between them.

82
One-hot encoding

# Perform one-hot encoding

df_encoded = pd.get_dummies(df, columns=['Color'], prefix=['Color'])

83
Target encoding
● Target encoding, also known as mean encoding, is a technique used to transform
categorical variables into numerical values
● It based on the mean of the target variable (usually a continuous variable) for each
category.
● It is particularly useful in predictive modeling tasks when dealing with categorical features
and regression or binary classification problems.
● The process involves the following steps:
1. For each category in the categorical feature, calculate the mean (or any other aggregate
measure) of the target variable for data points with that category.
2. Replace the original categorical feature with the calculated mean values for each category.
● Target encoding leverages the relationship between the categorical variable and the target
variable, making it more informative for machine learning models to work with categorical
data.

84
Target encoding
● Example with a dataset containing a categorical feature, "City," and a continuous target
variable, "Salary."
● We'll calculate the mean salary for each city and encode the "City" feature using the
calculated means.
# Calculate mean salary for each city
city_means = df.groupby('City')['Salary'].mean()

# Perform target encoding

df['City_Mean_Encoded'] = df['City'].map(city_means)

85
Time-based features
● Time-based features are a type of feature engineering that involves extracting
information from date and time data.
● When working with time series data or datasets containing temporal information, creating
time-based features can help capture patterns and dependencies related to time.
● These features can be used to improve the performance of machine learning models and
gain insights from the data.

86
Time-based features
● Year, Month, Day
− Extracting the year, month, and day components from a date allows you to analyze how data
varies over different time periods, such as seasons or days of the week.
● Quarter, Week, Day of Week
− Similar to year, month, and day, these features provide more granular information about time
patterns, like quarterly trends and weekday/weekend distinctions.
● Time Lags
− Creating lag features by shifting values from previous time points can capture trends and
autocorrelations in time series data.
● Holiday/Event Indicators
− Indicating whether a specific date corresponds to a holiday or a significant event can help
model how external factors affect the data.
● Time Since a Reference Date
− Calculating the time elapsed since a reference date can be useful, for example, to determine
how long a customer has been active.

87
Time-based features

# Convert "Transaction Date" to a datetime type

df['Transaction Date'] = pd.to_datetime(df['Transaction Date'])

# Extract time-based features

df['Year'] = df['Transaction Date'].dt.year
df['Month'] = df['Transaction Date'].dt.month
df['Day'] = df['Transaction Date'].dt.day
df['Day of Week'] = df['Transaction Date'].dt.dayofweek # Monday=0, Sunday=6

88
Text features
● Text features are a type of feature engineering that involves extracting valuable
information from text data.
● Text data can come from a wide range of sources, including social media posts, customer
reviews, news articles, or any content that contains textual information.
● Text features are crucial for natural language processing (NLP) tasks and text-based
machine learning applications.

89
Text features
● Bag of Words (BoW):
− This technique represents text documents as a collection of words or tokens. Each unique
word becomes a feature, and the presence or frequency of each word is used as a feature
value.
● Term Frequency-Inverse Document Frequency (TF-IDF):
− TF-IDF is a numerical statistic that reflects the importance of a word within a document relative
to a collection of documents (corpus). It is used to weigh the importance of words in text data.
● Word Embeddings:
− Word embeddings, such as Word2Vec and GloVe, are dense vector representations of words
in a continuous vector space. These embeddings capture semantic information about words.
● Text Length:
− Features related to the length of text, such as the number of words, characters, or sentences.
● N-grams:
− N-grams are contiguous sequences of N items (words or characters) in text. They capture
local patterns and can be used as features.

90
Text features

This is the first document.

This document is the second document.
And this is the third one.
Is this the first document?

from sklearn.feature_extraction.text import CountVectorizer

# Sample text documents

documents = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]

91
Text features
# Create a CountVectorizer object This is the first document.
vectorizer = CountVectorizer() This document is the second document.
And this is the third one.
# Fit and transform the text documents Is this the first document?
X = vectorizer.fit_transform(documents)

# Get the feature names (words)

feature_names = vectorizer.get_feature_names_out()

# Create a DataFrame from the BoW representation

import pandas as pd
df = pd.DataFrame(X.toarray(), columns=feature_names)

# Display the BoW DataFrame

print(df)

92
Feature selection
● Feature selection is a step in the machine learning pipeline that involves choosing the
most relevant and informative features (variables) from a dataset.
● The goal of feature selection is to improve model performance, reduce overfitting, and
enhance the interpretability of models.
● Feature selection is particularly important when working with high-dimensional data or
when you suspect that many features are irrelevant or redundant.

93
Feature selection
● Filter Methods:
− Filter methods select features based on statistical properties or scoring criteria without
involving a machine learning model.
− Examples include correlation-based feature selection and mutual information-based feature
selection.
● Wrapper Methods:
− Wrapper methods use a machine learning model's performance (e.g., accuracy) as a criterion
to select features.
− Common wrapper methods include forward selection, backward elimination, and recursive
feature elimination.
● Embedded Methods:
− Embedded methods incorporate feature selection as part of the model training process.
− Some machine learning algorithms have built-in feature selection, and regularization
techniques like L1 regularization (Lasso) can encourage feature sparsity.

94
Feature selection
● Example :
− perform feature selection on a dataset using a filter method based on feature importance
scores from a decision tree classifier

import pandas as pd
from sklearn.datasets import load_iris

# Load the Iris dataset as an example

data = load_iris()

X = pd.DataFrame(data.data,
columns=data.feature_names)

y = data.target

95
Feature selection
from sklearn.ensemble import ExtraTreesClassifier
# Create a decision tree classifier to calculate feature
importance scores
classifier = ExtraTreesClassifier()
classifier.fit(X, y)

# Get feature importance scores

feature_importance = classifier.feature_importances_

# Create a DataFrame to store feature names and their

importance scores
feature_importance_df = pd.DataFrame(
{'Feature': X.columns, 'Importance': feature_importance})

# Sort features by importance in descending order

feature_importance_df =
feature_importance_df.sort_values(by='Importance',
ascending=False)
96
Feature selection

# Select the top k most important features (e.g., top 2)

k = 2
selected_features = feature_importance_df.head(k)['Feature'].tolist()

# Display the selected features

print("Selected Features:")
print(selected_features)

97
Domain-Specific Features
● Domain-specific features are a category of engineered features that are created based
on domain knowledge or expertise in a specific field or industry.
● These features are designed to capture information that is highly relevant to the problem
at hand and can provide a significant boost in predictive power.
● Domain-specific features leverage the understanding of the domain and the problem's
nuances to create new variables that enhance the performance of machine learning
models.

98
Domain-Specific Features
● Aggregated statistics
− Calculating statistics such as means, medians, or variances of certain variables within specific
categories or groups.
● Time-based features
− Extracting time-related information, such as day of the week, month, or year, from date or
timestamp data.
● Geospatial features
− Creating features based on geographical data, such as distance to a landmark or the density
of nearby points of interest.
● Text-based features
− Extracting information from text data, such as sentiment scores, word counts, or specific
keywords.
● Interaction features
− Combining existing features to capture interactions or relationships between them

99
Domain-Specific Features
● Consider a hypothetical scenario where we have data on customer transactions for an e-
commerce website, and we want to predict customer churn

100
Domain-Specific Features

# Convert 'LastPurchaseDate' to a datetime object

df['LastPurchaseDate'] = pd.to_datetime(df['LastPurchaseDate'])

# Calculate the average spending per purchase

df['AvgSpendingPerPurchase'] = df['TotalSpent'] / df['PurchaseCount']

# Extract the month of the last purchase

df['LastPurchaseMonth'] = df['LastPurchaseDate'].dt.month

# Calculate the recency as the number of days since the last purchase
df['Recency'] = (pd.to_datetime('2022-04-01') - df['LastPurchaseDate']).dt.days

101
Domain-Specific Features

102
Lab2 : DS

Feature Engineering

Objective: In this lab, you will focus on feature selection for building a predictive
model to forecast on-time product delivery. The primary tasks include loading and
exploring the dataset, performing data preprocessing, and selecting relevant features
for the prediction model.

103
Data Science Project

Business Data
Understanding Understanding

Deployement
Data
Preparartion

Evaluation

Exploratory
Data Analysis

Modeling

104
Exploratory Data Analysis (EDA)
● Exploratory Data Analysis (EDA).
● EDA involves the analysis and visualization of data to understand its structure,
patterns, and relationships.
● EDA is performed before building predictive models or drawing conclusions from the data
● Common techniques and examples of EDA :
− Summary Statistics
− Data Visualization
− Box Plots
− Scatter Plots
− Correlation Analysis
− Categorical Variable Analysis
− Time Series Analysis

106
Summary statistics
● Summary statistics are numerical measures that provide a high-level overview of the
characteristics of a dataset.
● They help in summarizing key properties of data, such as central tendency, variability, and
distribution.
● Summary statistics include measures like mean, median, standard deviation, minimum,
maximum, and quartiles.
● Example:
Suppose we have a dataset of exam scores for a class,
and we want to compute summary statistics for the scores

107
Summary statistics
# Calculate summary statistics for 'ExamScore'
summary_stats = df['ExamScore'].describe()

# Print the summary statistics

print(summary_stats)

108
Data visualization
● Data visualization is the graphical representation of data to help users understand
information in a more accessible and interpretable form.
● It's a crucial aspect of data analysis and communication, as it allows you to explore data,
identify patterns, and present insights effectively.
● Data visualization employs various visual elements like charts, graphs, and plots to
represent data.
● Visualizations can be used to illustrate relationships, trends, distributions, and
comparisons in data.
● Example:
Visualizing the sales performance of a business.

109
Data visualization
● Line Chart (Time Series):
− A line chart is suitable for showing trends in sales over time.
− It's ideal when you want to visualize how sales have changed from month to month
import matplotlib.pyplot as plt
# Create a line chart
plt.plot(df['months'], df['sales'], marker='o', linestyle='-')
plt.title('Monthly Sales Performance')
plt.xlabel('Month')
plt.ylabel('Sales (in dollars)')

# Show the chart

plt.grid(True) # Add grid lines
plt.show()

110
Data visualization
● Bar Chart (Comparative):
− A bar chart can be used to compare sales across different months and see which months had
higher or lower sales.

# Create a bar chart

plt.bar(df['months'], df['sales'], color='blue')
plt.title('Monthly Sales Comparison')
plt.xlabel('Month')
plt.ylabel('Sales (in dollars)')
plt.show()

111
Data visualization
● Area Chart (Cumulative):
− An area chart can show cumulative sales over time, which is useful for understanding the
overall growth in sales.

# Create an area chart

plt.fill_between(df['months'], df['sales'], color='blue', alpha=0.3)
plt.plot(months, sales, marker='o', linestyle='-', color='blue')
plt.title('Cumulative Sales Over Time')
plt.xlabel('Month')
plt.ylabel('Cumulative Sales (in dollars)')
plt.grid(True)
plt.show()

112
Data visualization
● Pie Chart (Composition):
− A pie chart can be used to show the composition of total sales for different months.
# Create a pie chart
plt.pie(df['sales'], labels=df['months'], autopct='%1.1f%%', startangle=140)
plt.title('Sales Composition by Month')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

113
Box plots
● Box plots, also known as box-and-whisker plots, are a graphical representation of the
distribution of a dataset.
● They display the median, quartiles, and potential outliers in the data.
● Box plots are used to visualize the spread and central tendency of data.
● Example:
Visualizing the sales performance of a business.

114
Box plots
# Create a box plot
plt.figure(figsize=(8, 6))
plt.boxplot(df['sales'], labels=['Sales'])
plt.title('Sales Performances by Month')
plt.ylabel('Sales (in dollars)')
plt.grid(True)
Whiskers
# Show the box plot (Min and Max)
plt.show()

Median
Rectangular
"box"
Interquartile Range
(IQR)

115
Box plots
# Create a box plot
plt.figure(figsize=(8, 6))

# showfliers=True to display outliers

plt.boxplot(df['sales'], labels=['Sales'], showfliers=True) Outlier
plt.title('Sales Performances by Month (with Outlier)')
plt.ylabel('Sales (in dollars)')
plt.grid(True)

# Show the box plot

plt.show()

median
sales
Majority of sales
data

116
Scatter plots
● Scatter plots are used to visualize the relationship between two variables, making them
suitable for assessing correlations and identifying patterns in data.
● Scatter plots display individual data points as dots on a two-dimensional plane.
● Each dot represents a data point, with one variable on the x-axis and another variable on
the y-axis.
● They are useful for identifying trends, clusters, and outliers in the data.

117
Scatter plots
# Create a scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(df['months'], df['sales'], color='blue', label='Sales', marker='o')
plt.title('Sales Performances by Month (Scatter Plot)')
plt.xlabel('Month')
plt.ylabel('Sales (in dollars)')
plt.grid(True)
plt.legend()

# Show the scatter plot

plt.show()

118
Correlation Analysis
● Correlation analysis is a statistical technique used to evaluate the strength and
direction of the relationship between two or more variables or features in a dataset.
● The goal of correlation analysis is to determine whether there is a statistical association
between the variables and to what degree they are related
● Correlation is quantified using a Correlation Coefficient, which measures the degree
and direction of the relationship.
● The most common correlation coefficient is the Pearson correlation coefficient, denoted as
"r," which ranges from -1 to 1.
● The formula to calculate the Pearson correlation coefficient (r) between two variables, X
and Y, is as follows:

Xi and Yi are the individual data points (observations) for the variables X and Y, respectively.
𝑋ത and 𝑌ത are the mean (average) values of X and Y,

119
Correlation Analysis
● The sign of the correlation coefficient indicates the direction of the relationship:
− A positive correlation (r > 0) implies that as one variable increases, the other tends to
increase as well.
− A negative correlation (r < 0) implies that as one variable increases, the other tends to
decrease.
− Zero correlation (r ≈ 0) suggests that there is no systematic relationship between the
variables.
● The absolute value of the correlation coefficient reflects the strength of the relationship.
● An r-value closer to 1 (positive or negative) indicates a stronger relationship, while an
r-value closer to 0 indicates a weaker or no relationship

120
Correlation Analysis
import seaborn as sns
import matplotlib.pyplot as plt
# Create a heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True,
cmap='coolwarm', fmt='.2f', square=True)
plt.title('Correlation Matrix Heatmap')
plt.show()

# Calculate the correlation matrix

correlation_matrix = df.corr()

# Display the correlation matrix

print(correlation_matrix)

121
Categorical Variable Analysis
● Categorical variable analysis involves the examination and interpretation of categorical or
qualitative data.
● Categorical variables represent data that can be divided into distinct categories or groups, such as
colors, cities, or product types.
● Analyzing categorical variables is essential in various fields, including statistics, data science, and
social sciences.
● Techniques
− Frequency Distribution: The first step in analyzing categorical data is to create a frequency
distribution, which shows the number or count of data points within each category. This helps you
understand the distribution of data across categories.
− Bar Charts: Bar charts or bar graphs are commonly used to visualize categorical data. Each
category is represented on the x-axis, and the frequency or count is shown on the y-axis. Bar
charts are useful for visual comparisons between categories.
− Pie Charts: Pie charts are another visualization tool for categorical data. They represent each
category as a slice of the pie, with the size of each slice proportional to the frequency or proportion
of data in that category.
− Measures of Central Tendency: While categorical data cannot be averaged like numerical data,
you can calculate the mode, which is the category with the highest frequency. The mode represents
the most common category.

126
Categorical Variable Analysis
● Example of analyzing categorical data using a dataset of students and their favorite colors

127
Time Series Analysis
● Time series analysis is a statistical technique used to analyze and interpret data points
collected or recorded over time at regular intervals.
● It is particularly useful for understanding patterns, trends, and forecasting future values in
time-ordered datasets.
Visualization:
● Example :
Visualizing time series data is
essential

Components of Time Series:

Time series data can often be decomposed
into components like trend, seasonality, and
noise.

129
Road Map 1/2

https://activewizards.com/blog/how-to-choose-the-right-chart-type-infographic/

130
Road Map 2/2

https://www.linkedin.com/pulse/picking-correct-visualization-drive-change-brett-bonner/

131
Lab3 : DS

Exploratory Data Analysis (EDA)

Objective: The objective of this lab is to gain insights into the data's structure, patterns, and
relationships..

132

CS 2 3 4 Aml
No ratings yet
CS 2 3 4 Aml
70 pages
BPP - Institutional Assessment Result Summary
100% (6)
BPP - Institutional Assessment Result Summary
2 pages
The Medicinal Chef - Dale Pinnock
100% (5)
The Medicinal Chef - Dale Pinnock
294 pages
Unit-II
No ratings yet
Unit-II
119 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
ML-Unit 3
No ratings yet
ML-Unit 3
58 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
Featureengineering 171206213206
No ratings yet
Featureengineering 171206213206
45 pages
7-8 Feature Engineering 101-Normalization
No ratings yet
7-8 Feature Engineering 101-Normalization
8 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
Class PPT - Unit2
No ratings yet
Class PPT - Unit2
139 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
Semi Supervised Learning
No ratings yet
Semi Supervised Learning
86 pages
Feature Engineering / Feature Selection
No ratings yet
Feature Engineering / Feature Selection
33 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Feature Engineering
100% (2)
Feature Engineering
76 pages
ML1
No ratings yet
ML1
69 pages
Unit No: 4 Basics of Feature Engineering (31707 24)
No ratings yet
Unit No: 4 Basics of Feature Engineering (31707 24)
98 pages
UNIT04
No ratings yet
UNIT04
35 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Feature Engineering - 01
No ratings yet
Feature Engineering - 01
31 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
ML Inter Q&A
No ratings yet
ML Inter Q&A
54 pages
Unit 2 Feature Engineering
No ratings yet
Unit 2 Feature Engineering
64 pages
ML 3170724 Unit-4
No ratings yet
ML 3170724 Unit-4
97 pages
06 Feature Engineering
No ratings yet
06 Feature Engineering
24 pages
Project Report
No ratings yet
Project Report
37 pages
ML UNIT 2 2 Old
No ratings yet
ML UNIT 2 2 Old
15 pages
Week 10
No ratings yet
Week 10
50 pages
3-Random Projection and Compressed Sensing technique-13-01-2025
No ratings yet
3-Random Projection and Compressed Sensing technique-13-01-2025
84 pages
Basics of Feature Engineering Marked
No ratings yet
Basics of Feature Engineering Marked
33 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
Unit 5 Material
No ratings yet
Unit 5 Material
18 pages
UNIT 2 PART 2
No ratings yet
UNIT 2 PART 2
6 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
Presentation
No ratings yet
Presentation
10 pages
ML Unit2 Classppt
No ratings yet
ML Unit2 Classppt
44 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
Pattern Recognition 14
No ratings yet
Pattern Recognition 14
46 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Libro Nuevo ML
No ratings yet
Libro Nuevo ML
577 pages
Features
No ratings yet
Features
5 pages
Eature Engineering: Presenter: Prof. Amit Kumar Das
No ratings yet
Eature Engineering: Presenter: Prof. Amit Kumar Das
17 pages
Exp2 - Data Visualization and Cleaning and Feature Selection
No ratings yet
Exp2 - Data Visualization and Cleaning and Feature Selection
13 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
1 What Is Feature Engineering - Kaggle
No ratings yet
1 What Is Feature Engineering - Kaggle
6 pages
Final 1
No ratings yet
Final 1
6 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
ML Unit 2 Part 2
No ratings yet
ML Unit 2 Part 2
23 pages
Microsoft Certified: Power BI Data Analyst Associate PL 300 Practice Tests
From Everand
Microsoft Certified: Power BI Data Analyst Associate PL 300 Practice Tests
CertSquad Professional Trainers
No ratings yet
Lecture 1 - Cash and Cash Equivalents
No ratings yet
Lecture 1 - Cash and Cash Equivalents
18 pages
Ent300 - Business Opportunity
No ratings yet
Ent300 - Business Opportunity
10 pages
1 Analgesics
No ratings yet
1 Analgesics
10 pages
Report On Newton Raphson Method PDF
No ratings yet
Report On Newton Raphson Method PDF
7 pages
Psychology Schizophrenia Notes
No ratings yet
Psychology Schizophrenia Notes
11 pages
CollegeComparisonTemplate
No ratings yet
CollegeComparisonTemplate
3 pages
Pre Heating & Tempering: Washing Machine Thermocouple
No ratings yet
Pre Heating & Tempering: Washing Machine Thermocouple
4 pages
Purge Script
No ratings yet
Purge Script
3 pages
MLT MCQ
No ratings yet
MLT MCQ
21 pages
What's The Difference Between Leadership and Management
100% (2)
What's The Difference Between Leadership and Management
6 pages
Thermodynamics Formulas
No ratings yet
Thermodynamics Formulas
3 pages
Comcast Statement
100% (1)
Comcast Statement
4 pages
Fairy B TRP-1
No ratings yet
Fairy B TRP-1
83 pages
13 NRA Binder Specifications RC380 April 2013
No ratings yet
13 NRA Binder Specifications RC380 April 2013
9 pages
1-6 Function Operations and Composition of Functions
No ratings yet
1-6 Function Operations and Composition of Functions
31 pages
CGD Final Report
No ratings yet
CGD Final Report
40 pages
Module 1 (Ce 361 - Advanced Concrete Technology)
No ratings yet
Module 1 (Ce 361 - Advanced Concrete Technology)
43 pages
AF5102 Accounting Theory Efficient Securities Markets
No ratings yet
AF5102 Accounting Theory Efficient Securities Markets
9 pages
xvw166 v8.7
No ratings yet
xvw166 v8.7
683 pages
Chapter 3
No ratings yet
Chapter 3
2 pages
Brandhood Wipro Signage BOQ 13-9-23
No ratings yet
Brandhood Wipro Signage BOQ 13-9-23
4 pages
Industrial Visit: Rajiv Gandhi College of Engineering, Research & Technology, Chandrapur
No ratings yet
Industrial Visit: Rajiv Gandhi College of Engineering, Research & Technology, Chandrapur
7 pages
Nursery Plant Production Guide
100% (1)
Nursery Plant Production Guide
300 pages
UNIVERSAL TERMS OF SERVICE AGREEMENT click to solar
No ratings yet
UNIVERSAL TERMS OF SERVICE AGREEMENT click to solar
23 pages
Plotting History The Russian Historical Novel in the Imperial Age 1st Edition Dan Ungurianu - The latest updated ebook is now available for download
100% (2)
Plotting History The Russian Historical Novel in the Imperial Age 1st Edition Dan Ungurianu - The latest updated ebook is now available for download
47 pages
QlikView Set Analysis Cheat Sheet
No ratings yet
QlikView Set Analysis Cheat Sheet
1 page
Rulestream_ETO_10_2_x_Installing_Rest_Rule_Services_and_VIA_ThinClient
No ratings yet
Rulestream_ETO_10_2_x_Installing_Rest_Rule_Services_and_VIA_ThinClient
36 pages
DLD Contents
No ratings yet
DLD Contents
5 pages

cours data

Uploaded by

cours data

Uploaded by

Data Science Project

# Creating an interaction feature

# Define bin edges and labels

# Apply binning to the "Age" feature

# Perform one-hot encoding

# Perform target encoding

# Convert "Transaction Date" to a datetime type

# Extract time-based features

This is the first document.

from sklearn.feature_extraction.text import CountVectorizer

# Sample text documents

# Get the feature names (words)

# Create a DataFrame from the BoW representation

# Display the BoW DataFrame

# Load the Iris dataset as an example

# Get feature importance scores

# Create a DataFrame to store feature names and their

# Sort features by importance in descending order

# Select the top k most important features (e.g., top 2)

# Display the selected features

# Convert 'LastPurchaseDate' to a datetime object

# Calculate the average spending per purchase

# Extract the month of the last purchase

# Print the summary statistics

# Show the chart

# Create a bar chart

# Create an area chart

# showfliers=True to display outliers

# Show the box plot

# Show the scatter plot

# Calculate the correlation matrix

# Display the correlation matrix

Components of Time Series:

Exploratory Data Analysis (EDA)

You might also like