Data Science in IOT
Data Science in IOT
Deemed to be University
2230522
Data Science in IOT
Data is an individual unit that contains raw materials which do not carry any specific meaning.
Information:
When data is processed, organized, structured or presented in a particular context so that data is called information.
Example: The average test score of a class is the information derived from the given data.
Techniques
Visualization and
communication to User
Deploys and maintains
What is Data Science
● Data science is an interdisciplinary field that combines techniques, tools, and
methodologies to extract useful insights and knowledge from data.
● It involves various processes, including data collection, cleaning, analysis,
visualization, interpretation and modeling data to solve problems related to
real-world.
● The goal of data science is to transform raw data into actionable information
that can drive informed decisions, solve complex problems, and generate value
across diverse industries and domains.
● Data science is the study of data to extract meaningful insights for
business.
● It is a multidisciplinary approach that combines principles and practices
from the fields of mathematics, statistics, artificial intelligence, and
computer engineering to analyze large amounts of data.
● This analysis helps data scientists to ask and answer questions like
what happened, why it happened, what will happen, and what can be
done with the results.
The word first appeared in the ’60s as an alternative name for statistics. In the
late ’90s, computer science professionals formalized the term. It works on
three aspects: data design, collection, and analysis.
Various fields of Data Science
• Data Pre-processing (Data Analysis, Data Visualization)
• Statistical Computing
• Statistical Modeling
• Machine Learning
• Pattern Recognition
• Real world applications
Data Science Life Cycle
● Business Understanding: At the start, you immerse yourself in understanding the
business problem you're aiming to solve. Collaborate with stakeholders to define the
problem, its goals, and requirements. This step ensures that your data science efforts are
aligned with the business's objectives and deliver value.
● Data Mining: Once the problem is defined, you gather relevant data from various
sources, such as databases, APIs, or files. This is where you collect the raw material
needed for analysis. The data collected should ideally cover the aspects necessary to
address the problem effectively.
● Data Cleaning: The collected data often comes with imperfections like missing values,
outliers, or inconsistencies. In this step, you clean and preprocess the data to ensure its
quality. This includes filling in missing values, removing outliers, and correcting errors
to ensure accurate results.
● Data Exploration: After cleaning, you explore the data to understand its characteristics
and potential insights. You create visualizations and conduct statistical analyses to
reveal patterns, relationships, and trends within the data. This step helps you form
hypotheses and identify potential areas for further analysis.
● Feature Engineering: Feature engineering involves selecting and transforming the
right features (variables) from the data to enhance model performance. This step
could include creating new features, applying transformations, or combining
existing features to provide the most relevant and impactful input for your models
● Predictive Modeling: Now you build predictive models using machine learning
algorithms. These models learn patterns from the data to make predictions or
classifications. You split the data into training and testing sets, train the model on the
training set, and then evaluate its performance on the testing set.
● Data Visualization: Data visualization is crucial for conveying insights to both
technical and non-technical audiences. You create graphs, charts, and other visual
representations to showcase trends, patterns, and relationships within the data.
Visualizations help you communicate findings effectively.
In short
Data Analysis
● Data Analysis is the process of finding insights from information.
https://www.youtube.com/watch?v=017B07EHe2M
https://www.youtube.com/watch?v=PFhFdziYeB4
Time series Data
● Time series data is data that is recorded over consistent intervals of time. These
data points are often ordered chronologically, and the time dimension is a critical
component in the analysis.
● Common sources of time series data: Financial Markets, Healthcare Records, Social
Media Activity
Companies can utilize these techniques in IoT to reduce labor costs, automate
processes, improve operational efficiency and thereby intelligently gather insights
from huge data and take meaningful actions immediately. [IIOT]
• Prepare effective data for Machine Learning, perform prediction and classification on data.
Data lakes
A data lake is a system or repository of data stored in its natural/raw format,
usually object blobs (raw data) or files.
A data lake is usually a single store of data including raw copies of source system
data, sensor data, social data etc. and transformed data used for tasks such as
reporting, visualization, advanced analytics, and machine learning.
A data lake can include structured data from relational databases (rows and
columns), semi-structured data (CSV, logs, XML, JSON), unstructured data
(emails, documents, PDFs), and binary data (images, audio, video).
A data lake can be established "on premises" (within an organization's data
centers) or "in the cloud" (using cloud services).
Data Lakes
Data lakes are platforms that offer storage and compute capabilities for data, and can be used to help
organizations make data-driven decisions. Here are some examples of data lakes:
Amazon Web Services (AWS): Offers a data lake architecture that includes Amazon S3, a storage service with
low latency and high availability
Azure Data Lake Storage: A Microsoft Azure data lake solution that includes built-in data encryption, access
control policies, and auditing capabilities
Google Cloud Storage: A flexible and scalable storage option for building data lakes
Vantage Cloud: A platform with analytics capabilities that include machine learning, graph analytics, and
advanced SQL
Data lakes can be used in many industries, including streaming media, finance, healthcare, and more.
For example, streaming companies can use data lakes to collect and process insights on customer behavior to
improve their recommendation algorithms.
Data Retention Strategy
A data retention strategy, or policy, is a system of rules that organizations use to manage the data they
collect and generate. A data retention policy should include the following:
Data inventory: Identify the data the organization has
Data categories: Define what types of data the organization has
Retention periods: Specify how long to keep different types of data
Data handling procedures: Describe how the data should be stored, where, and in what format
Data destruction procedures: Develop safe ways to destroy data when the retention period ends
Policy training: Educate employees about the policy and who can dispose of data
Compliance auditing: Regularly monitor and audit compliance with the policy
Policy updates: Periodically review and update the policy
Data Retention Strategy
When creating a data retention policy, organizations should consider the following factors:
Business requirements: How the data will be used
Storage costs: How much it costs to store the data
Regulatory and compliance concerns: Any laws or regulations that may apply to the data
Version Controlling
● Version control in data science is like keeping a detailed diary for your project.
● Imagine you're working on a team project where everyone writes their part of a
story. Version control ensures that everyone's work is organized, and changes are
tracked over time.
● It's like having a time machine for your project - if something goes wrong or you
want to see how things looked before, you can easily go back to previous
versions.
● This is crucial in data science, where you're not just dealing with code but also
with datasets, models, and various experiments.
● Types - Local Version Control System, Centralized Version Control System,
Distributed Version Control System.
Skill Required
Data Engineer v/s Data Analyst v/s Data Scientist
● Data Engineer - His role is to design, develop, and maintain the data infrastructure (data
warehouse) and systems necessary for efficiently handling and processing large volumes of data.Data
engineers play a crucial role in ensuring that data is collected, stored, and made accessible in a
structured and reliable manner, supporting the needs of data analysts, data scientists, and other
stakeholders within an organization.
Skills - Python, SQL, Knowledge of building and maintaining data warehouse etc.
● Data Analyst - Data Analyst collect data from different sources, clean it to remove errors, and then
analyze it to find useful patterns and trends. These insights are presented in a way that is easy for
others to understand.
Skills - Python, Statistics, Data visualization.
● Data Scientist - The role of a data scientist is to use advanced analytics and machine learning
techniques to analyze complex data sets and derive valuable insights from them. Data scientists play a
crucial role in turning data into useful insights that can benefit the organization.
Data Information Useful Insights
Programming, Database,
Skill: Programming, Do analysis (statistics),
Big data handler (Hadoop),
Database visualize Find some insights
some form of analytics
Data
Science
Prediction, Detection
like things 🡪
Concepts
Discrete Data: It can hold Finite number of possible values. (number of students in a
class
Continuous Data: Data that can hold infinite number of possible values (weight of a
person)
Variable
A variable is a characteristic or attribute that can take different values. Variables are
fundamental components in datasets and play a crucial role in statistical analysis,
A variable is a property that can take on any value
Eg: Height {178, 168 } CM
Weight: {78,67} KG
Variables
Two king of variable:
1. Quantitative / numerical variable–(property) measured numerically (add,
abstract, multiple, divide) Eg: age, weight, height
2. Qualitative / categorical Variable: we have categories (based on some
characteristics we can derived categorical variable)
Eg: Gender (male, female)
Eg: IQ (0-10→ less IQ) (10-50→ Medium IQ) (50-100→ Good IQ)
Blood Group: A+ve, A-ve
T-shirt size: L, XL, M, S
Ordinal means
ordered or
Rating/
Feedback type
1:00:33
Quantitative
Discrete variable: (Whole Number) What kind of variable Gender is? categorical
No of bank account of person : 3,2,1 (1.5 not
What kind of variable marital status?
possible)
categorical
No of children in a family: 2,1 (1.5 no possible)
River length: continuous
Population of the state is: Discrete
Continuous Variable: (any value)
Song length: continuous
Height: 172.5 cm, 162.5 cm
96 2 Ratio data:
57 4
85 3
44 5
Example
https://www.youtube.com/watch?v=FqB5Es1HXI4
Data Representation
Frequency Distribution:
Sample data: Rose Lilly, Sunflower, Rose, Lilly, Sunflower, Rose, Lilly, Lilly
Rose 3 3
Lilly 4 3+4=7
Continuous (Histogram)
Classification of Digital Data
● Structured: Structured data is organized in a specific format with a well-defined schema.
It is typically tabular and can be easily stored in databases
○ Organized in a table or relational database.
○ Well-defined schema with fixed data types.
○ Easily queryable using standard database queries. Eg. spreadsheet, relational database
● Semi-Structured: falls between structured and unstructured data. It exhibits some level of
structure but does not adhere to the rigid schema found in structured data. Semi-structured
data is often more flexible than structured data, allowing for variations in the representation
of information. E.g. XML data.
● Un-structured: lacks a predefined data model or schema. It does not fit neatly into
traditional relational databases and is often more challenging to analyze.
○ No fixed format or structure.
○ Varied types of data, often human-generated. E.g. Word documents, PDFs, images,
videos, social media posts.
Structured data
Track of all transactions: Use Database RDBMS, mysql, Oracle……etc, and
Store in structured format.
Point of Sale Software: POS software also has helpful tools like sales reporting,
inventory management, and integrated loyalty programs.
Unstructured Data Semi-structured
● Secondary Data (already been collected by someone else) (use filled form data)
○ Internal sources: ○ External Sources:
■ Company record, employee record, sales record ■ Publication data by Govt and
private organization
■ Financial record
■ Books and magazines
■ Journals
■ Newspapers, online websites
etc.
Qualitative data collection methods
Quantitative data collection methods
https://www.youtube.com/watch?v=mUfYXr4VKgI
Statistics
Statistics is the science of collecting, organizing and analyzing data. (Better Decision Making)
Data: Facts or pieces of Information that can be measured.
Example:
The mid sem marks of a class {18, 14, 16, 17, 18}
Type of statistics:
Measure of Central Tendency [Mean, Medium and mode], Measure of Dispersion [Variance, Standard Deviation, Z-Score], Different Type
of Distribution of data [Histogram, PDF, PMF, CDF, Gaussian Distribution, Log Normal Distribution, Exponential, Binomial, Bernoulli
Distribution, Poisson Distribution], power law, standard normal distribution, Boxplot, Histogram, Distribution Plot, EDA, Feature
Engineering, transformation and standardization.
Inferential statistics: Techniques where in we used the data that we have measured to form conclusions.
Z-test, t-test, Chi Square test, Anova Test, f-test Hypothesis Testing, null hypothesis, alternative hypothesis, P value, confidence level,
Significance Value.
Statistics
● Any raw Data, when collected and organized in the form of numerical or tables, is
known as Statistics. Statistics is also the mathematical study of the probability of
events occurring based on known quantitative Data or a Collection of Data.
My class
All NEC class of
(Sample) college (Population)
Example
Let say there are 10 sports camps in MITS and you have collect the height of players from one of
the camp.
Heights are recorded [175 cm, 180 cm, 140 cm, 141 cm, 135 cm, 160 cm]
What type of question come in descriptive statistics:
What is the average height of player in one camp? (mean,median, mode, pass %, %ile)
Distribution of a Data, (how many std it is away from mean— Z-score)
What type of question come in Inferential statistics: (Form conclusion from data)
Are the height of the players of camp 1, similar to height of player of all 10 camps?
1 camp
All 10 camps
(Sample) (Population)
Population and samples
Election: (Exit Poll Result)-sample select randomly
Population (N): population is a group or a superset of data that you are interested in
studying.
(N)-----nth individual
Mall—-survey----8th person
Convenience Sampling:
Survey of data science (allow only those people who has knowledge of data
science)
Example
Exit Poll: Random Sampling
Drug need to be tested: drugs for every one (Random) / specific age group
(strata) –depend on use case
Measure of Central Tendency
Mean, Median, Mode
Refers to the measure used to determine the centre of the distribution of data.
Measures of Center
● Mean (Average): The sum of all values divided by the number of observations.
● Median: The middle value of a sorted dataset. If the dataset has an even number
of observations, the median is the average of the two middle values.
● Interquartile Range (IQR): The range between the first quartile (Q1) and the
third quartile (Q3). It measures the spread of the middle 50% of the data.
IQR = Q3−Q1
Measure of Central Tendency
Arithmetic Mean for Population and sample
Mean (average):
Sample (n)
Example
X={1,1,2,2,3,4,5,5,6,100}
Mean=32+100/11=12 mean=3.2
mean=3.2 (before adding 100) mean= 12 (outlier)
Mean-12 (after adding 100) huge movement (difference) in mean (due to outlier)
median= 3
X={1,1,2,2,3,4,5,5,6,100(outlier)} (completely different from entire median= 3.5 (outlier)
distributions)
Median: X={1,1,2,2,3,3,4,5,5,6,100}
Median works well with
outliers
1. Sort the number
X={1,1,2,2,3,3,4,5,5,6,100,112}
Mode=100 (outlier)
Type of flowers: rose, lilly, sunflower, , , with 10% missing values [what
technique we can use]
Age: 25,26, , , , 32, 34, 38 (with missing value) mean, median, mode?
https://byjus.com/maths/mode/
Measure of Dispersion (spread)
Variance, Standard Deviation [concept of measure of dispersion]
{1,1,2,2,4}=10/2=2
{2,2,2,2,2}=10/2=2
Population Variance:
Sample Variance:
Variance is a statistical measurement
of the spread between numbers in a
data set. It measures how far each
number in the set is from the mean
(average), and thus from every other
number in the set.
https://www.cuemath.com/data
/variance/
Variance is higher means spread
is higher
Standard Deviation
Standard deviation is the square root of variance, and both are measures of how data
is spread out.
Variance
A measure of how spread out all data points are in a
data set. It's the average of the squared deviations
from the mean. Variance is expressed in larger units
than standard deviation, such as meters squared.
Standard deviation
A measure of how far apart data points are from the
mean. It's expressed in the same units as the original
data values, such as meters or minutes. A small
standard deviation means the data is tightly grouped
around the mean, while a larger standard deviation
means the data is more spread out.
Skewness
https://www.geeksforgeeks.org/how-to-calculate-skewness-and-kurtosis-in-pyth
on/?ref=lbp
https://www.geeksforgeeks.org/difference-between-skewness-and-kurtosis/?ref
=lbp
Right Skew: Tail extends to
Left Skew: Tail extends to the Zero Skew: Perfectly
Types of the right, most data clustered
left, most data clustered on the symmetrical distribution, with
Skewness right. (mean < median). on the left. (mean > median).
mean, median, and mode all
Skewness < 0: Then more
Skewness > 0: Then more
equal. Skewness = 0: Then weight in the left tail of the
weight in the right tail of the normally distributed.
distribution. distribution.
● The distribution is symmetric about the mean—half the values fall below the mean and half above the mean.
● The distribution can be described by two values: the mean and the standard deviation.
Empirical rule
The empirical rule, or the 68-95-99.7 rule, tells you where most of your values lie in a normal distribution:
● Around 68% of values are within 1 standard deviation from the mean.
● Around 95% of values are within 2 standard deviations from the mean.
● Around 99.7% of values are within 3 standard deviations from the mean.
Kurtosis
https://www.slideshare.net/AmbaPant/introduction-to-kurtosis
https://analystprep.com/cfa-level-1-exam/quantitative-methods/kurtosis-and-ske
wness-types-of-distributions/
https://www.vedantu.com/maths/types-of-statistics
Study Python
• SciPy
–Mathematical computations
– Provide a collection of sub-packages
– Advanced linear algebra functions
– Support for signal processing
• XGBoost
– eXtreme Gradient Boosting
– Fast data processing
– Support parallel computation
– Provides internal parameters for evaluation
– Higher Accuracy
• Eli5
Python library for Deep learning
• TensorFlow
– Build and train multiple neural networks
– Statistical analysis
– In-build functions to improve the accuracy
– Comes with an in-build visualizer
• Keras
– Support various type of neural networks
– Advanced neural networks computations
– Provide pre-processed dataset
– Easily extensible
• Pytorch
– APIs to integrate with ML/Data science frameworks
Numpy
● It stands for Numerical Python (Core library for numeric & scientific
computing ).
● NumPy is a Python library used for working with arrays.
● The array object in NumPy is called ndarray.
● NumPy arrays are stored at one continuous place in memory unlike
lists, so processes can access and manipulate them very efficiently.
This behavior is called locality of reference in computer science.
● NumPy arrays are used to store Homogeneous data.
Numpy Array v/s Python List
● All elements of array are of same ● List can have elements of different
data type. data types.
● Elements of an array are stored in ● List elements are not stored
contiguous memory locations. contiguously in memory.
● Arrays are static and can not be ● List can be resized and modify
resized once they are created. easily.
● Arrays support element wise ● List do not support element wise
operation. operation.
● Arrays take less space in memory. ● List take more space in memory.
Array Creation
Output
● np.zeros ((rows,cols)) : Initialising numpy array with 0.
2-Dimensional
Slicing in Array
Arr[start-index : end-index : step]
Operations on Array
Sorting an Array
● np.sort( array-name ) : make a copy of original array and returns
the sorted copy without changing original array.
● np.argsort( array-name ) : sort the copy of original array and
returns the Index of sorted list without changing original array.
● array-name.sort( ) : sort the original array and returns nothing.
● The name "Pandas" has a reference to both "Panel Data", and "Python
Data Analysis" and was created by Wes McKinney in 2008.
Series (Single-Dimensional)
DATA
STRUCTURES
Data Frame (Multi-Dimensional)
Pandas Series Object
Changing
value of
Index
Series Object from Dictionary
Extracting Individual Value
Name Marks
0 Bob 76
1 Sam 25
2 Anne 92
AXIS = 1 AXIS = 0
(Columns) (Rows)
More Pandas Functions
We can do any
operations on
dataset values by
“apply” function
and pass a
function.
value_counts() → Count the total sort_values(by=’col’) → Return
number of values in specified the sorted column in ascending
column. order.
Matplotlib Library
Part of UNIT 4
Matplotlib
● Matplotlib is a Python library used for Data Visualization.
● You can create bar-plot, scatter-plots, histograms and a lot
more with matplotlib
● Line Plot sub module of matplotlib
Linear Relationship
Adding Title and Labels Changing Line Aesthetics
Attributes
Plotting
two
Lines on
Same
Plot
Subplot(row, col, index)
Adding
Subplots
● Bar Plot
Understand the
distribution of
categorical value
● Horizontal
Bar Plot
plt.barh
● Scatter Plot
Max Value
75 %
50 %
Creating Data
25 %
Applications
It is used to know: MinValue
● The outliers and their values.
● Symmetry of Data.
● Tight grouping of data.
● Data skewness.
● If , in which direction and how
Box Plot Example
Example:
Find the maximum, minimum, median, first quartile,
third quartile for the given data set: 23, 42, 12, 10, 15, 14, 9.
Solution:
Given: 23, 42, 12, 10, 15, 14, 9.
Arrange the given dataset in ascending order.
9, 10, 12, 14, 15, 23, 42
Hence,
Minimum = 9
Maximum = 42
Median = 14
First Quartile = 10 (Middle value of 9, 10, 12 is 10)
Third Quartile = 23 (Middle value of 15, 23, 42 is 23).
often used in explanatory data analysis. Box plots visually show the
distribution of numerical data and skewness through displaying the
data quartiles (or percentiles) and averages.
Creating Data
Making Plot
● Pie Chart
Creating Data
Creating Data
Residual Plot
A residual plot is a scatter plot that shows the difference between the predicted and actual
values of a variable in a data set. It's used to analyze the relationship between the data and
the regression line, and to determine if a linear model is appropriate for the data:
https://www.geeksforgeeks.org/how-to-create-a-residual-plot-in-python/
Distribution Plot
A distribution plot is a data visualization tool that shows the distribution of data points along an
axis. It's used to compare the range and distribution of numerical data, and to visually assess
the distribution of sample data.
https://www.geeksforgeeks.org/pivot-tables-in-pandas/
Heat Map and Correlation matrix
https://www.shiksha.com/online-courses/articles/heatmap-in-seaborn/
https://www.geeksforgeeks.org/what-is-heatmap-data-visualization-and-how-to-use-it/
https://www.questionpro.com/blog/correlation-matrix/
https://www.geeksforgeeks.org/create-a-correlation-matrix-using-python/
Unit - III
Data Acquisition and Data Wrangling
https://www.analyticsvidhya.com/blog/2021/10/handling-missing-value/
https://www.geeksforgeeks.org/ml-handling-missing-values/
Outlier Analysis
● Outlier refers to an observation that stands out from the majority of the data, exhibiting
unusually high or low values.
● An outlier in data analysis might be a rotten apple in a dataset of quality apples. While the vast
majority of apples in the dataset may have a high-quality rating, the presence of a single rotten
apple can significantly impact the overall average quality rating for the entire dataset.
● In a box plot, we draw a box from the first quartile to the third quartile. A vertical line goes
through the box at the median. The whiskers go from each quartile to the minimum or maximum.
● The five-number summary divides the data into sections that each contain approximately 25% of
the data in that set.
● Outliers: Individual data points that are outside the whiskers are often marked as individual
points to draw attention to them.
● Boxplot: It provides a visual way to understand the central tendency, spread, and presence of
outliers in the data. A box plot consists of several key components:
Box: The central box in the plot represents the interquartile range (IQR), which is the range between
the 25th and 75th percentiles of the data. The box spans from the first quartile (Q1, or the 25th
percentile) to the third quartile (Q3, or the 75th percentile). This box contains 50% of the data.
Median: A line inside the box represents the median,
which is the middle value of the dataset when it's
ordered.
Whiskers: Lines extending from the box,
referred to as whiskers,show the range of the data.
They typically extend to the minimum
and maximum values within a certain range. Any data points outside
this range are considered outliers.
Outliers: Individual data points that are outside the whiskers are often marked as individual points
to draw attention to them.
Percentile and quartiles [uses to find outliers]
Percentage: 1, 2, 3, 4, 5
Ex: suppose any value (x) is 25%ile that means 25% of the entire distribution is less than of particular value (x).
Dataset: 2,2,3,4,5,5,5,6,7,8,8,8,8,8,9,9,10,11,11,12
1. What is the %ile ranking of 10 and 11?
Example: {1,2,2,2,3,3,4,5,5,5,6,6,6,6,7,8,8,8,9,27}
IQR=Q3-Q1
IQR= 7-3=4
How to be see dataset in visualization, What kind of graphs is required to visualize that data
One standard deviation towards the right, Two standard deviation towards the right.
Emperical formula
Weight à follow
Y belongs to SND
24 10000 60
26 20000 65
27 30000 68
data = pd.read_csv('test.csv')
age = data['Age']
sns.boxplot(x=age)
plt.show()
mean = age.mean()
standard_deviation = age.std()
threshold = 3
outliers = []
for age in age:
if
abs(age-mean)>threshold*standard_deviation:
outliers.append(age)
print(outliers)
Outlier Detection Techniques
● Using Z-Score Method
It uses a dataset’s standard deviation and its mean to identify data points that are significantly
different from the majority of the other data points.
z = (x - μ)/σ The Z-score is equal to zero when x = .μ The Z-score is ± 1, ± 2, or ± 3, depending
on whether x is ± 1, ± 2, or ± 3, respectively.
data = pd.read_csv('test.csv')
age = data['Age']
mean = age.mean()
standard_deviation = age.std() https://drive.google.com/file/d/1REeHwEG
threshold = 3 XHhjO6R1kaTzni2XXLodmqEqG/view?us
outliers = [] p=drive_link
for i in age:
z = (i-mean)/standard_deviation
if (z > threshold):
outliers.append(i)
print(outliers)
Outlier Detection Techniques Using IQR
● (IQR) Interquartile Range outlier detection method involves calculating the first and third
quartiles (Q1 and Q3) of a dataset and then identifying any data points that fall beyond the
range of Q1 - 1.5 * IQR to Q3 + 1.5 * IQR, where IQR is the difference between Q3 and Q1.
● Data points that fall outside of this range are considered outliers.
Feature Engineering
● Feature engineering is a crucial and creative process in machine learning
and data analysis. It involves selecting, transforming, and creating new
features (variables) from your raw data to improve the performance of
your machine learning models.
https://www.google.com/amp/s/www.geeksforgeeks.org/what-is-feature-engineering/amp/
Process Involved in Feature Engineering
● Feature Creation : process of generating new features based on domain knowledge or by
observing patterns in the data.
Types of Feature Creation:
Domain-Specific: Creating new features based on domain knowledge, such as creating features
based on business rules or industry standards.
Data-Driven: Creating new features by observing patterns in the data, such as calculating
aggregations or creating interaction features.
Synthetic: Generating new features by combining existing features or synthesizing new data
points.
● Feature Transformation : process of transforming the features into a more suitable representation
for the machine learning model. Types of feature transformations
Normalization: Scaling features to a similar range (e.g., between 0 and 1) to prevent some features
from dominating others.
Standardization: Scaling features to have a mean of 0 and a standard deviation of 1.
Logarithmic or Exponential Transformation: Useful for data with a skewed distribution.
Encoding: Converting categorical variables into numerical form using techniques like one-hot
encoding or label encoding.
● Feature Extraction: is the process of creating new features from existing ones to provide more
relevant information to the machine learning model. Types of Feature Extraction-
Dimensionality Reduction: Reducing the number of features by transforming the data into a
lower-dimensional space while retaining important information. Examples are PCA and t-SNE.
Feature Combination: Combining two or more existing features to create a new one. For
example, the interaction between two features.
Feature Aggregation: Aggregating features to create a new one. For example, calculating the
mean, sum, or count of a set of features.
Feature Transformation: Transforming existing features into a new representation. For example,
log transformation of a feature with a skewed distribution.
● Feature Selection: is the process of selecting a subset of relevant features from the dataset to be
used in a machine-learning model. Types-Filter Method, Wrapper Method, Embedded Method
● Feature Scaling: is the process of transforming the features so that they have a similar scale.
Types- Min-Max Scaling, Standard Scaling, Robust Scaling.
Data Wrangling
Exploratory Data Analysis /
Data Cleaning
Feature Engineering
Data Transformation
Row Deletion & Duplicate Analysis
Feature Engineering Transformation Encoding
1. Feature Construction
Scaling Normalization
2. Feature Improvement
3. Feature Selection Missing Value Analysis
4. Feature Extraction
Standardization
Outlier Detection
Data Wrangling: Data wrangling, also known as data munging, is the process of
gathering, selecting, and organizing data from various sources into a usable format.
It involves activities like data extraction, merging datasets, dealing with missing
values, and handling outliers. Data wrangling is about getting the data into a
consistent structure for further processing.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.preprocessing import StandardScaler
ss=StandardScaler()
data = pd.read_csv('test.csv')
data['Age']=ss.fit_transform(data[['Age']])
print(data['Age'].mean())
print(data['Age'].std())
sns.histplot(data['Age'],bins=50)
Unit - IV
Read slide no 150 to 173 only
Unit - V
Important Web links
https://www.geeksforgeeks.org/types-of-machine-learning/
https://www.javatpoint.com/types-of-machine-learning
Data : https://drive.google.com/drive/folders/1g2RWTLLJ60CUlT_aymdILqvf0S2igAst
You count how many students got a score in each group. This is like counting how many
students got a score between 0 and 10, how many got a score between 11 and 20, and so on.
Then, you draw a bar for each group (bin) on a chart. The height of the bar represents the
number of students who got scores in that group. So, if more students got scores between 11
and 20, that bar will be taller.
● Using seaborn library
Correlation Matrix
● It is a table that shows the correlation coefficients between many variables. Each cell in the
table represents the correlation between two variables. The values range from -1 to 1,
indicating the strength and direction of the relationship between variables.
Case Study
https://drive.google.com/file/d/162GLBndnJyY8iDj025ApgeFb6-yfGtK6/view?usp=drive_link
https://www.kaggle.com/datasets/sudhanshu432/algerian-forest-fires-cleaned-dataset
https://www.kaggle.com/code/mehulnayak10/algerian-forest-fire-prediction-logistic-reg
https://muhammaddawoodaslam.medium.com/exploratory-data-analysis-eda-on-titanic-dataset-804034f394e6