0% found this document useful (0 votes)
39 views

Data Science in IOT

Uploaded by

Aman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Data Science in IOT

Uploaded by

Aman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 220

Madhav Institute of Technology & Science, Gwalior

Deemed to be University

NAAC Accredited with A++ Grade

2230522
Data Science in IOT

By - Dr. Dhananjay Bisen


Unit - I
About Data

● Data is just a collection of facts like IOT


Sensors Data.
● Data can be structured, unstructured, or
semi-structured.
● Data can be numeric, string, document,
audio and video.
● It is used to extract important insights or
making decisions in business.
Data Data
(10-15 year ago) (at present Now)

● Source: Internet and Social media


● Very huge, difficult to handle so we apply
some techniques like Data Science.
● Small amount in KB ● Implementation tools: Python, R
● Store in tables (rows and columns) programming
● Libraries (Pre-build Tools and functions):
Numpy, Pandas, Matplotlib etc.
Data Vs Information
Data:
Data is raw, unorganized facts that need to be processed. Data can be something simple and useless until it is organized.

Data is an individual unit that contains raw materials which do not carry any specific meaning.

Example: Data is a student’s test score.

Information:
When data is processed, organized, structured or presented in a particular context so that data is called information.

Information is a group of data that collectively carries a logical meaning.

Example: The average test score of a class is the information derived from the given data.

(Data science techniques extract important “Information” from “Data”)


Who is generating Data (Source)
• Satellites working for weather prediction and remote sensing generates data.
• Scientific and engineering practices generate huge data on regular basis. [like:
In IOT application, different kind of sensors generate Data for research
purpose.]
• Shopping stores handles millions of transaction per week, Customer records in
shopping mall.
• Share market (stocks, trading, business).
• Telecommunication networks carry vast amount of data traffic.
• Medical and health industry generates massive amount of data.
• Search engine, extensive use of social media
Data Science and analytics Challenges
What we are doing with data now?
❖ We can apply Data science Techniques on Data for following purpose:
● Understand Importance of data
● Finding out important insights of data
● Analyzed to find patterns and trends of data.
❖ Data can be used to build ML models for decision making, predictions,
detection and classification.
❖ Employed for personalization, healthcare, finance, and more.
❖ Data can be applied in AI, NLP, and computer vision.
❖ Enhancing research, optimization, and policy-making
Why Data Science is required/needed?
● Data science is required because it plays a crucial role in addressing the
challenges posed by the ever-increasing volume of data in our modern world.
● It enables organizations to extract valuable insights and patterns from large
amounts of data, leading to better decision-making, improved products and
services, and enhanced understanding of complex phenomena.
● Data science is needed to turn vast data into insights that drive decisions,
solve problems, personalize experiences, and innovate across industries.
Data Science Life Cycle

Techniques

Collecting Remove or ● Artificial Intelligence Recognizing Final


all Data. manage ● Machine Learning the Pattern. Results.
Data Garbage or ● Deep Learning
understand raw data.
Model building Model Model
ing EDA
Evaluation Deploy
Data
modelling ment
Data Acquisition
Sources: Web servers, logs, Databases, API’s, Online repositories.
Data pre-processing
Data cleaning: Inconsistent
Data type, misspelled attributes,
missing and duplicate values
Data Transformation
Exploratory data analysis:
Define and refines the selection
of feature variables that will
require in the model
development.
Feature Engineering
Feature Selection
Machine Learning
● Identify the model
that best fit the system
requirement.
● Train the models on
the training dataset
and test model.
● Select the best
performing model.
● Done through Python.
Pattern Evaluation
Knowledge Representation

Visualization and
communication to User
Deploys and maintains
What is Data Science
● Data science is an interdisciplinary field that combines techniques, tools, and
methodologies to extract useful insights and knowledge from data.
● It involves various processes, including data collection, cleaning, analysis,
visualization, interpretation and modeling data to solve problems related to
real-world.
● The goal of data science is to transform raw data into actionable information
that can drive informed decisions, solve complex problems, and generate value
across diverse industries and domains.
● Data science is the study of data to extract meaningful insights for
business.
● It is a multidisciplinary approach that combines principles and practices
from the fields of mathematics, statistics, artificial intelligence, and
computer engineering to analyze large amounts of data.
● This analysis helps data scientists to ask and answer questions like
what happened, why it happened, what will happen, and what can be
done with the results.

The word first appeared in the ’60s as an alternative name for statistics. In the
late ’90s, computer science professionals formalized the term. It works on
three aspects: data design, collection, and analysis.
Various fields of Data Science
• Data Pre-processing (Data Analysis, Data Visualization)
• Statistical Computing
• Statistical Modeling
• Machine Learning
• Pattern Recognition
• Real world applications
Data Science Life Cycle
● Business Understanding: At the start, you immerse yourself in understanding the
business problem you're aiming to solve. Collaborate with stakeholders to define the
problem, its goals, and requirements. This step ensures that your data science efforts are
aligned with the business's objectives and deliver value.
● Data Mining: Once the problem is defined, you gather relevant data from various
sources, such as databases, APIs, or files. This is where you collect the raw material
needed for analysis. The data collected should ideally cover the aspects necessary to
address the problem effectively.
● Data Cleaning: The collected data often comes with imperfections like missing values,
outliers, or inconsistencies. In this step, you clean and preprocess the data to ensure its
quality. This includes filling in missing values, removing outliers, and correcting errors
to ensure accurate results.
● Data Exploration: After cleaning, you explore the data to understand its characteristics
and potential insights. You create visualizations and conduct statistical analyses to
reveal patterns, relationships, and trends within the data. This step helps you form
hypotheses and identify potential areas for further analysis.
● Feature Engineering: Feature engineering involves selecting and transforming the
right features (variables) from the data to enhance model performance. This step
could include creating new features, applying transformations, or combining
existing features to provide the most relevant and impactful input for your models
● Predictive Modeling: Now you build predictive models using machine learning
algorithms. These models learn patterns from the data to make predictions or
classifications. You split the data into training and testing sets, train the model on the
training set, and then evaluate its performance on the testing set.
● Data Visualization: Data visualization is crucial for conveying insights to both
technical and non-technical audiences. You create graphs, charts, and other visual
representations to showcase trends, patterns, and relationships within the data.
Visualizations help you communicate findings effectively.
In short
Data Analysis
● Data Analysis is the process of finding insights from information.

Data Information Insights


About Data analytics
● Data analytics is the application of data analysis tools and procedures to realize
value from the huge volumes of data generated by connected Internet of
Things devices.
● The potential of IoT analytics is often discussed in relation to the Industrial IoT.
● The IIoT makes it possible for organizations to collect and analyze data from
sensors on manufacturing equipment, pipelines, weather stations, smart meters,
delivery trucks and other types of machinery.
● IoT analytics offers similar benefits for the management of data centers and other
facilities, as well as retail and healthcare applications.
Data Analysis life cycle
Data Analysis life cycle
Phase 1: Discovery
The data science team learn and investigate the problem.
Develop context and understanding.
Come to know about data sources needed and available for the project.
Phase 2: Data Preparation
Steps to explore, preprocess, and condition data prior to modeling and analysis.It
requires the presence of an analytic sandbox, the team execute, load, and transform, to
get data into the sandbox. Data preparation tasks are likely to be performed multiple
times and not in predefined order.
Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine
Phase 3: Model Planning
Team explores data to learn about relationships between variables and subsequently,
selects key variables and the most suitable models.
In this phase, data science team develop data sets for training, testing, and production
purposes. Several tools commonly used for this phase are – Matlab, STASTICA.
Phase 4: Model Building
Team develops datasets for testing, training, and production purposes.
Free or open-source tools – Rand PL/R, Octave, WEKA.
Phase 5: Communication Results
After executing model team need to compare outcomes of modeling to criteria
established for success and failure.Team considers how best to articulate findings and
outcomes to various team members and stakeholders, taking into account warning,
assumptions.
Team should identify key findings, quantify business value, and develop narrative to
summarize and convey findings to stakeholders.
Phase 6: Operationalize
The team communicates benefits of project more broadly and sets up pilot project to
deploy work in controlled way before broadening the work to full enterprise of users.
This approach enables team to learn about performance and related constraints of the
model in production environment on small scale &nbsp, and make adjustments before
full deployment. The team delivers final reports, briefings, codes.
Free or open source tools – Octave, WEKA, SQL, MADlib.
Data Science ToolKit
● Python: NumPy, Pandas, MatplotLib, Seaborn, sklearn
● Tableau is the most popular data visualization tool used in the market, is an
American interactive data visualization software company founded in January
2003, was recently acquired by Salesforce.
● It provides the facilities to break down raw, unformatted data into a process able
and understandable format. It has the ability to visualize geographical data and for
plotting longitudes and latitudes in maps. Latest Version: Tableau 2020.2
● TensorFlow: TensorFlow, developed by Google Brain team, is a free and
open-source software library for dataflow.
● It provides an environment for building and training models, deploying platforms
such as computers, smartphones, and servers, to achieving maximum potential
with finite resources. Latest Version: TensorFlow 2.2.0
● Jupyter: Jupyter, developed by Project Jupyter on February 2015 open-source
software, open-standards, and services for interactive computing across
dozens of programming languages.
● It is a web-based application tool running on the kernel, used for writing live
code, visualizations, and presentations.
● Latest Version: Jupyter Notebook 6.0.3
Type of data analytics
● Descriptive Analytics: Shows what is going on the business…..
● Diagnostics Analysis: Investigate why there is a sudden dip in sales
OR why our profit is going down.
● Predictive Analysis: show’s what will happen in future based on past
Trends. (ex.: what will be the sales in next quarter.
● Prescriptive Analytics: Makes conclusion based on all the previous
analysis and advice what to do next. Ex. whether a new product should
be launched?

IOT Data Analytics:


https://www.upsolver.com/blog/iot-analytics-challenges-applications-innovations
https://www.youtube.com/watch?v=q9oAZwhuUy4

https://www.youtube.com/watch?v=017B07EHe2M

https://www.youtube.com/watch?v=PFhFdziYeB4
Time series Data
● Time series data is data that is recorded over consistent intervals of time. These
data points are often ordered chronologically, and the time dimension is a critical
component in the analysis.

● Common sources of time series data: Financial Markets, Healthcare Records, Social
Media Activity

Time series data Analytics


https://www.tableau.com/learn/articles/time-series-analysis#:~:text=What%20is%
20time%20series%20analysis,data%20points%20intermittently%20or%20randomly

Read content from given link


Transactional Data
Transactional data refers to records of activities or events that occur as part of business
operations or transactions within an organization. These data capture the details of
individual transactions and are essential for tracking and managing various business
processes.
E.g. Product Sales data, Customer service data, Product Order data

Transactional Data Analytics


● Many businesses use transactional data to track their financial performance. This
includes tracking revenue, expenses, and profits. This information can help you
understand where a business is making and spending money, so you can make
informed decisions about financial planning and budgeting
Biological Data
● Biological data refers to a compound or information derived from
living organisms and their products.
● A medicinal compound made from living organisms, such as a serum or
a vaccine, could be characterized as biological data.
● Biological data is highly complex when compared with other forms of
data.
Spatial data
● Spatial data refers to any data that has a geographic or spatial component,
meaning it is associated with a specific location on the Earth's surface.
● Spatial data can be referred to as geographic data or geospatial data. Spatial
data provides the information that identifies the location of features and
boundaries on Earth. Spatial data can be processed and analysed using
Geographical Information Systems (GIS) or Image Processing packages.
● Spatial data analytics
Read content from given link
○ Geoanalytics (location based data) https://www.qlik.com/us/data-analytics/spatial-analysis

Streaming data analytics :


Read content from given link
https://streamsets.com/blog/what-is-streaming-analytics/
Social Network Data
● Social network data comes from connections between people or
entities. These connections could be friendships on Facebook,
followers on Twitter, collaborations between researchers, or any
other relationships.
● Comments
● Images
Data Evolution
● Data evolution refers to the dynamic and continuous process of the
creation, transformation, and utilization of data over time.
● It encompasses the entire lifecycle of data, from its initial generation
through various sources to its storage, processing, and eventual analysis.
● Data evolution involves adapting to the changing nature of information,
including shifts in data formats, the growing volume of data,
advancements in storage and processing technologies, and the
continuous development of tools and techniques for extracting
meaningful insights from data.
Impact of Data science in IOT
Data Science plays a crucial role in developing IoT applications in various
domains such as Predictive Maintenance, Retail Analysis, Healthcare, traffic
management and many more.

Companies can utilize these techniques in IoT to reduce labor costs, automate
processes, improve operational efficiency and thereby intelligently gather insights
from huge data and take meaningful actions immediately. [IIOT]

• Prepare effective data for Machine Learning, perform prediction and classification on data.
Data lakes
A data lake is a system or repository of data stored in its natural/raw format,
usually object blobs (raw data) or files.
A data lake is usually a single store of data including raw copies of source system
data, sensor data, social data etc. and transformed data used for tasks such as
reporting, visualization, advanced analytics, and machine learning.
A data lake can include structured data from relational databases (rows and
columns), semi-structured data (CSV, logs, XML, JSON), unstructured data
(emails, documents, PDFs), and binary data (images, audio, video).
A data lake can be established "on premises" (within an organization's data
centers) or "in the cloud" (using cloud services).
Data Lakes
Data lakes are platforms that offer storage and compute capabilities for data, and can be used to help
organizations make data-driven decisions. Here are some examples of data lakes:

Amazon Web Services (AWS): Offers a data lake architecture that includes Amazon S3, a storage service with
low latency and high availability

Azure Data Lake Storage: A Microsoft Azure data lake solution that includes built-in data encryption, access
control policies, and auditing capabilities

Google Cloud Storage: A flexible and scalable storage option for building data lakes

Apache Hadoop: A popular data lake

Vantage Cloud: A platform with analytics capabilities that include machine learning, graph analytics, and
advanced SQL

Data lakes can be used in many industries, including streaming media, finance, healthcare, and more.

For example, streaming companies can use data lakes to collect and process insights on customer behavior to
improve their recommendation algorithms.
Data Retention Strategy
A data retention strategy, or policy, is a system of rules that organizations use to manage the data they
collect and generate. A data retention policy should include the following:
Data inventory: Identify the data the organization has
Data categories: Define what types of data the organization has
Retention periods: Specify how long to keep different types of data
Data handling procedures: Describe how the data should be stored, where, and in what format
Data destruction procedures: Develop safe ways to destroy data when the retention period ends
Policy training: Educate employees about the policy and who can dispose of data
Compliance auditing: Regularly monitor and audit compliance with the policy
Policy updates: Periodically review and update the policy
Data Retention Strategy
When creating a data retention policy, organizations should consider the following factors:
Business requirements: How the data will be used
Storage costs: How much it costs to store the data
Regulatory and compliance concerns: Any laws or regulations that may apply to the data
Version Controlling
● Version control in data science is like keeping a detailed diary for your project.
● Imagine you're working on a team project where everyone writes their part of a
story. Version control ensures that everyone's work is organized, and changes are
tracked over time.
● It's like having a time machine for your project - if something goes wrong or you
want to see how things looked before, you can easily go back to previous
versions.
● This is crucial in data science, where you're not just dealing with code but also
with datasets, models, and various experiments.
● Types - Local Version Control System, Centralized Version Control System,
Distributed Version Control System.
Skill Required
Data Engineer v/s Data Analyst v/s Data Scientist
● Data Engineer - His role is to design, develop, and maintain the data infrastructure (data
warehouse) and systems necessary for efficiently handling and processing large volumes of data.Data
engineers play a crucial role in ensuring that data is collected, stored, and made accessible in a
structured and reliable manner, supporting the needs of data analysts, data scientists, and other
stakeholders within an organization.

Skills - Python, SQL, Knowledge of building and maintaining data warehouse etc.

● Data Analyst - Data Analyst collect data from different sources, clean it to remove errors, and then
analyze it to find useful patterns and trends. These insights are presented in a way that is easy for
others to understand.
Skills - Python, Statistics, Data visualization.

● Data Scientist - The role of a data scientist is to use advanced analytics and machine learning
techniques to analyze complex data sets and derive valuable insights from them. Data scientists play a
crucial role in turning data into useful insights that can benefit the organization.
Data Information Useful Insights
Programming, Database,
Skill: Programming, Do analysis (statistics),
Big data handler (Hadoop),
Database visualize Find some insights
some form of analytics

Develop application Work on Big data (extract data) and


and store data store in warehouse (efficient way),
Raw material provide to Data analyst
Data Analytics

Data
Science

Prediction, Detection
like things 🡪
Concepts

● Features : No. of columns = No. of features


1 Dimension means 1 feature similarly n dimension means n features
● Supervised Learning : when you know output along with input
● Unsupervised Learning : When you don’t know output
● Reinforcement Learning : supervised + unsupervised
E.g. Robot learns from mistakes
Unit - II
Understanding Data
● Data refers to facts and statistics collected together for reference or analysis
Types of Data
❖ Numeric data
❖ Categorical data
❖ Graphical data
❖ High Dimensional data

● Numeric Data : Numerical data, also known as quantitative or continuous


data, consists of measurable quantities represented with real numbers. It
includes both discrete and continuous data. E.g. Height, weight etc.
● Categorical data : Categorical data, also known as qualitative data,
represents categories or labels and cannot be measured in a meaningful way.
It can be nominal or ordinal. E.g. Gender, color, feedback etc.
● Graphical Data : Graphical data, also known as image data or visual data,
represents information in a visual form. It can include images, charts, graphs, and
other visual representations. E.g. Photographs, graphs, charts, diagrams.
What is High Dimensional Data?
(Definition & Examples)
● High dimensional data refers to a dataset in which the number of features
p is larger than the number of observations N, often written as p >> N.
● For example, a dataset that has p = 6 features and only N = 3 observations
would be considered high dimensional data because the number of features is
larger than the number of observations.
Example
Example 1: Healthcare Data
High dimensional data is common in healthcare datasets where the number of features
for a given individual can be massive (i.e. blood pressure, resting heart rate, immune
system status, surgery history, height, weight, existing conditions, etc.).
In these datasets, it’s common for the number of features to be larger than the number
of observations.
How to Handle High Dimensional Data
● Drop features with many missing values
● Drop features with low variance
● Drop features with low correlation with the
response variable
Use a regularization method : Principal Components Analysis
Principal Components Regression
Ridge Regression
Lasso Regression
Qualitative Data
● Qualitative data deals with characteristics and descriptors that can’t be easily measured but
can be observed.
● Also called Categorical Data
Quantitative Data
● Quantitative data deals with numbers & things that you can measure.
● Also called Numerical Data.

Discrete Data: It can hold Finite number of possible values. (number of students in a
class
Continuous Data: Data that can hold infinite number of possible values (weight of a
person)
Variable
A variable is a characteristic or attribute that can take different values. Variables are
fundamental components in datasets and play a crucial role in statistical analysis,
A variable is a property that can take on any value
Eg: Height {178, 168 } CM
Weight: {78,67} KG
Variables
Two king of variable:
1. Quantitative / numerical variable–(property) measured numerically (add,
abstract, multiple, divide) Eg: age, weight, height
2. Qualitative / categorical Variable: we have categories (based on some
characteristics we can derived categorical variable)
Eg: Gender (male, female)
Eg: IQ (0-10→ less IQ) (10-50→ Medium IQ) (50-100→ Good IQ)
Blood Group: A+ve, A-ve
T-shirt size: L, XL, M, S
Ordinal means
ordered or
Rating/
Feedback type
1:00:33

Quantitative
Discrete variable: (Whole Number) What kind of variable Gender is? categorical
No of bank account of person : 3,2,1 (1.5 not
What kind of variable marital status?
possible)
categorical
No of children in a family: 2,1 (1.5 no possible)
River length: continuous
Population of the state is: Discrete
Continuous Variable: (any value)
Song length: continuous
Height: 172.5 cm, 162.5 cm

Weight: 90 kg, 99.5 kg Blood pressure: continuous

Rainfall: 1.1 inc, 1.5 inc


Variable measurement scales: Interval data: order matters, value also matters,
4 type of measured variable: natural zero is not present

● Nominal, Ordinal, (qualitative data) Interval of temperature: (fahrenheits)


● Interval, Ratio (quantitative data)
(70-80) (80-90) (90-100)
Nominal data: categorical data–classes–colors, gender, type of
flower 0 (not useful)
Ordinal data: order of the data matters but value does not Interval data is a type of numerical data that's measured on a
scale with equal distances between values, called intervals.
Eg: Interval data is used in many quantitative studies, such as
calculating demographic information, testing scores, and credit
Students marks Rank ratings.

100 1 (ordinal data)

96 2 Ratio data:
57 4

85 3

44 5
Example

https://www.youtube.com/watch?v=FqB5Es1HXI4
Data Representation
Frequency Distribution:

Sample data: Rose Lilly, Sunflower, Rose, Lilly, Sunflower, Rose, Lilly, Lilly

Flower Frequency (FD) Cumulative Frequency

Rose 3 3

Lilly 4 3+4=7

Sunflower 2 7+2=9 (Total no of Flower)


Bar graph

Histogram: continuous value


Bins: 10 by default

Pdf: Probability distribution function


Smoothing of histogram
Discrete (Bar chart)

Continuous (Histogram)
Classification of Digital Data
● Structured: Structured data is organized in a specific format with a well-defined schema.
It is typically tabular and can be easily stored in databases
○ Organized in a table or relational database.
○ Well-defined schema with fixed data types.
○ Easily queryable using standard database queries. Eg. spreadsheet, relational database
● Semi-Structured: falls between structured and unstructured data. It exhibits some level of
structure but does not adhere to the rigid schema found in structured data. Semi-structured
data is often more flexible than structured data, allowing for variations in the representation
of information. E.g. XML data.
● Un-structured: lacks a predefined data model or schema. It does not fit neatly into
traditional relational databases and is often more challenging to analyze.
○ No fixed format or structure.
○ Varied types of data, often human-generated. E.g. Word documents, PDFs, images,
videos, social media posts.
Structured data
Track of all transactions: Use Database RDBMS, mysql, Oracle……etc, and
Store in structured format.

Point of Sale Software: POS software also has helpful tools like sales reporting,
inventory management, and integrated loyalty programs.
Unstructured Data Semi-structured

Survey Forms, images, video, etc.. Extensible Markup Language (XML)


Source of Data/Collection of data
● Primary data (collected afresh and first time) (Ex. Filling any form)
○ Observation (structured and unstructured)
○ Interview (personal /telephonic)
○ Questionnaire
○ Schedule (Enumerator)

● Secondary Data (already been collected by someone else) (use filled form data)
○ Internal sources: ○ External Sources:
■ Company record, employee record, sales record ■ Publication data by Govt and
private organization
■ Financial record
■ Books and magazines
■ Journals
■ Newspapers, online websites
etc.
Qualitative data collection methods
Quantitative data collection methods
https://www.youtube.com/watch?v=mUfYXr4VKgI
Statistics
Statistics is the science of collecting, organizing and analyzing data. (Better Decision Making)
Data: Facts or pieces of Information that can be measured.

Example:

The mid sem marks of a class {18, 14, 16, 17, 18}

Age of students of a class {30,25,23,24,20,21}

Type of statistics:

Descriptive statistics: It consists of organizing and summarizing data.

Measure of Central Tendency [Mean, Medium and mode], Measure of Dispersion [Variance, Standard Deviation, Z-Score], Different Type
of Distribution of data [Histogram, PDF, PMF, CDF, Gaussian Distribution, Log Normal Distribution, Exponential, Binomial, Bernoulli
Distribution, Poisson Distribution], power law, standard normal distribution, Boxplot, Histogram, Distribution Plot, EDA, Feature
Engineering, transformation and standardization.

Inferential statistics: Techniques where in we used the data that we have measured to form conclusions.

Population Data —- Sample — Conclusion

Z-test, t-test, Chi Square test, Anova Test, f-test Hypothesis Testing, null hypothesis, alternative hypothesis, P value, confidence level,
Significance Value.
Statistics
● Any raw Data, when collected and organized in the form of numerical or tables, is
known as Statistics. Statistics is also the mathematical study of the probability of
events occurring based on known quantitative Data or a Collection of Data.

● Types of Statistics - Descriptive Statistics Inferential Statistics

● Descriptive Statistics: In the descriptive Statistics, the Data is described in a


summarized way. The summarization is done from the sample of the population using
different parameters like Mean or standard deviation.
● Inferential Statistics: In the Inferential Statistics, we try to interpret the Meaning of
descriptive Statistics. After the Data has been collected, analyzed, and summarised
we use Inferential Statistics to describe the Meaning of the collected Data.
Example
NEC students (50) in my Classroom, Marks of the 1st assignment
{23,34,34,10,12,14,.....}
What type of question come in descriptive statistics:
What is the average marks of the students in my class? (mean,median, mode, pass
%, %ile)
What type of question come in Inferential statistics: (Form conclusion from data)
Are the marks of the students of my classroom, similar to marks of the college? (all
NEC class students, suppose five class).

My class
All NEC class of
(Sample) college (Population)
Example
Let say there are 10 sports camps in MITS and you have collect the height of players from one of
the camp.
Heights are recorded [175 cm, 180 cm, 140 cm, 141 cm, 135 cm, 160 cm]
What type of question come in descriptive statistics:
What is the average height of player in one camp? (mean,median, mode, pass %, %ile)
Distribution of a Data, (how many std it is away from mean— Z-score)
What type of question come in Inferential statistics: (Form conclusion from data)
Are the height of the players of camp 1, similar to height of player of all 10 camps?

1 camp
All 10 camps
(Sample) (Population)
Population and samples
Election: (Exit Poll Result)-sample select randomly

Population (N): population is a group or a superset of data that you are interested in
studying.

Sample(n): A sample is a subset of population data.

Why select sample randomly?

How we can select sample?(so required sampling techniques)


Simple random Sampling: pickup random data (Exit poll)
When performing simple random sampling so every member of the population (N) an
equal chance of being selected for sample(n).
Stratified sampling: where the population (N) is split into non-overlapping groups
(strata/layering)
Example: performing Survey so male will give different kind of survey and female will
give also different kind of survey (gender—male—-female)
Survey based on age group (0-10) (10-20) (20-40) (40-100) (here no chance of
overlapping)
Based on profession can i do stratified sampling:
PHP developer, Data science developer, python—not apply
Doctor, engineer–Apply
Systematic sampling: select nth individual from population

(N)-----nth individual

Mall—-survey----8th person

Convenience Sampling:

Survey of data science (allow only those people who has knowledge of data
science)
Example
Exit Poll: Random Sampling

RBI-Household survey: consider only Women survey (stratified)

Drug need to be tested: drugs for every one (Random) / specific age group
(strata) –depend on use case
Measure of Central Tendency
Mean, Median, Mode

Refers to the measure used to determine the centre of the distribution of data.
Measures of Center
● Mean (Average): The sum of all values divided by the number of observations.

● Median: The middle value of a sorted dataset. If the dataset has an even number
of observations, the median is the average of the two middle values.

● Mode: The value(s) that occur most frequently in the dataset.


Note: A dataset may have no mode (no value repeats), one mode (unimodal), or
multiple modes (multimodal).
Measures of Spread (Variability or Dispersion):
● Range: The difference between the maximum and minimum values in the
dataset. Range = Max−Min

● Variance: The average of the squared differences from the mean.

● Standard Deviation:The square root of the variance. It measures the average


distance of data points from the mean.

● Interquartile Range (IQR): The range between the first quartile (Q1) and the
third quartile (Q3). It measures the spread of the middle 50% of the data.
IQR = Q3−Q1
Measure of Central Tendency
Arithmetic Mean for Population and sample

Mean (average):

Population (N): X={1,1,2,2,3,4,5,5,6}

Sample (n)
Example
X={1,1,2,2,3,4,5,5,6,100}

Mean=32+100/11=12 mean=3.2
mean=3.2 (before adding 100) mean= 12 (outlier)
Mean-12 (after adding 100) huge movement (difference) in mean (due to outlier)
median= 3
X={1,1,2,2,3,4,5,5,6,100(outlier)} (completely different from entire median= 3.5 (outlier)
distributions)

Median: X={1,1,2,2,3,3,4,5,5,6,100}
Median works well with
outliers
1. Sort the number

Odd number: 11 element [middle element 3]

X={1,1,2,2,3,3,4,5,5,6,100,112}

Even number: 3+4/2=3.5

Less difference in median if add outlier….


Mode
Mode={1,2,2,3,4,5,6,6,6,7,8,100,200} Mode works in both
numerical and
mode={most frequent element}
categorical variables
mode=6 {measure of central tendency} but it works well in
categorical variables.
Mode={1,2,2,3,4,5,6,6,6,7,8,100,100,100,100}

Mode=100 (outlier)

Type of flowers: rose, lilly, sunflower, , , with 10% missing values [what
technique we can use]

Missing value will be replaced most frequent elements (mode)

Age: 25,26, , , , 32, 34, 38 (with missing value) mean, median, mode?

https://byjus.com/maths/mode/
Measure of Dispersion (spread)
Variance, Standard Deviation [concept of measure of dispersion]

{1,1,2,2,4}=10/2=2

{2,2,2,2,2}=10/2=2

Mean are same so how we can identify their distribution is different?

Need variance and standard deviation


Variance

Population Variance:

Sample Variance:
Variance is a statistical measurement
of the spread between numbers in a
data set. It measures how far each
number in the set is from the mean
(average), and thus from every other
number in the set.

https://www.cuemath.com/data
/variance/
Variance is higher means spread
is higher
Standard Deviation
Standard deviation is the square root of variance, and both are measures of how data
is spread out.

5 new data point means 1.5 SD to mean


Differences…..

Variance
A measure of how spread out all data points are in a
data set. It's the average of the squared deviations
from the mean. Variance is expressed in larger units
than standard deviation, such as meters squared.

Standard deviation
A measure of how far apart data points are from the
mean. It's expressed in the same units as the original
data values, such as meters or minutes. A small
standard deviation means the data is tightly grouped
around the mean, while a larger standard deviation
means the data is more spread out.
Skewness
https://www.geeksforgeeks.org/how-to-calculate-skewness-and-kurtosis-in-pyth
on/?ref=lbp

https://www.geeksforgeeks.org/difference-between-skewness-and-kurtosis/?ref
=lbp
Right Skew: Tail extends to
Left Skew: Tail extends to the Zero Skew: Perfectly
Types of the right, most data clustered
left, most data clustered on the symmetrical distribution, with
Skewness right. (mean < median). on the left. (mean > median).
mean, median, and mode all
Skewness < 0: Then more
Skewness > 0: Then more
equal. Skewness = 0: Then weight in the left tail of the
weight in the right tail of the normally distributed.
distribution. distribution.

Skewed Left Skewness Zero Skewed Right


What are the properties of normal distributions?
Normal distributions have key characteristics that are easy to spot in graphs:

● The mean, median and mode are exactly the same.

● The distribution is symmetric about the mean—half the values fall below the mean and half above the mean.

● The distribution can be described by two values: the mean and the standard deviation.
Empirical rule
The empirical rule, or the 68-95-99.7 rule, tells you where most of your values lie in a normal distribution:

● Around 68% of values are within 1 standard deviation from the mean.

● Around 95% of values are within 2 standard deviations from the mean.

● Around 99.7% of values are within 3 standard deviations from the mean.
Kurtosis
https://www.slideshare.net/AmbaPant/introduction-to-kurtosis

Example questions and formula

https://analystprep.com/cfa-level-1-exam/quantitative-methods/kurtosis-and-ske
wness-types-of-distributions/

3 means [consider standard deviation 3]

There are many method available for skewness


and kurtosis therefore formula may be varied.
Sampling Distribution
● A sampling distribution refers to a probability distribution of a statistic that comes
from choosing random samples of a given population. Also known as a finite-sample
distribution, it represents the distribution of frequencies on how spread apart various
outcomes will be for a specific population. Types-
1. Sampling distribution of mean: you can calculate the mean of every sample group
chosen from the population and plot out all the data points. The graph will show a
normal distribution, and the center will be the mean of the sampling distribution,
which is the mean of the entire population.
2. Sampling distribution of proportion: It gives you information about proportions
in a population. You would select samples from the population and get the sample
proportion. The mean of all the sample proportions that you calculate from each
sample group would become the proportion of the entire population.
3. T-distribution:T-distribution is used when the sample size is very small or not
much is known about the population. It is used to estimate the mean of the population,
confidence intervals, statistical differences, and linear regression.
Hypothesis Testing
● Hypothesis Testing is a type of statistical analysis in which you put your assumptions
about a population parameter to the test. It is used to estimate the relationship between
2 statistical variables.
Probability Theory
● Probability theory is a foundational concept that deals with uncertainty and
randomness.
● It provides a mathematical framework for modeling and analyzing random events.
● Probability spaces, random variables, and probability distributions are essential
components.
● Common distributions like the binomial, normal, Poisson, and exponential help
model various scenarios
Conditional Probability
● Conditional probability is defined as the likelihood of an event or outcome occurring,
based on the occurrence of a previous event or outcome. Conditional probability is
calculated by multiplying the probability of the preceding event by the updated
probability of the succeeding, or conditional, event.

https://www.vedantu.com/maths/types-of-statistics
Study Python

Python Library: Numpy, Pandas, Matplotlib, Seaborn Library


Example: Measures of Central Tendency
https://www.kaggle.com/code/thabresh/descriptive-statistics-titanic-dataset

Descriptive Statistics - Titanic Dataset


Numpy Library
Python Library
● Python library is a collection of functions and methods that
allows you/user to perform many action without writing
complex code
Python libraries for statistical analysis
• NumPy (numerical computing / complex mathematical computation)
– Scientific Computations
– Multi-dimensional array objects
– Data manipulation

• SciPy
–Mathematical computations
– Provide a collection of sub-packages
– Advanced linear algebra functions
– Support for signal processing

• Pandas (data manipulation with pandas)


– Dataframe Objects
– Process large data sets
– Complex Data Analysis
– Time Series Data
Python libraries for data visualization
• Matplotlib (Bar plot, graph chart, scatter plot) • Plotly
–Plot a variety of graphs – In-build API
– Extract quantitative info
– Pyplot module (Similar to MATLAB) • Bokeh
– Integrate with tools • Create complex statistical graph
• Integration with flask and django
• Seaborn
– Compatible with various data formats
– Support for automated statistical estimation
– High level abstractions
– Support for build-in themes (graph themes)
Python libraries for machine learning
• Scikit-learn
– Provides a set of standard datasets
– Machine learning algorithm
– In-build functions for feature extraction and selection
– Model evaluation

• XGBoost
– eXtreme Gradient Boosting
– Fast data processing
– Support parallel computation
– Provides internal parameters for evaluation
– Higher Accuracy

• Eli5
Python library for Deep learning
• TensorFlow
– Build and train multiple neural networks
– Statistical analysis
– In-build functions to improve the accuracy
– Comes with an in-build visualizer

• Keras
– Support various type of neural networks
– Advanced neural networks computations
– Provide pre-processed dataset
– Easily extensible

• Pytorch
– APIs to integrate with ML/Data science frameworks
Numpy
● It stands for Numerical Python (Core library for numeric & scientific
computing ).
● NumPy is a Python library used for working with arrays.
● The array object in NumPy is called ndarray.
● NumPy arrays are stored at one continuous place in memory unlike
lists, so processes can access and manipulate them very efficiently.
This behavior is called locality of reference in computer science.
● NumPy arrays are used to store Homogeneous data.
Numpy Array v/s Python List

● All elements of array are of same ● List can have elements of different
data type. data types.
● Elements of an array are stored in ● List elements are not stored
contiguous memory locations. contiguously in memory.
● Arrays are static and can not be ● List can be resized and modify
resized once they are created. easily.
● Arrays support element wise ● List do not support element wise
operation. operation.
● Arrays take less space in memory. ● List take more space in memory.
Array Creation

Output
● np.zeros ((rows,cols)) : Initialising numpy array with 0.

● np.full ((dim.),value) : Initialising array with any value.

● np.ones ((row,col)) : Initialising array with any 1.


● np.arange (initial-val, final-val, gap) : create array within a range

● np.random.randint (initial val, final val, no.of random int) :


Initialise array with random numbers. Random is sub module inside NumPy and
randint() method of random.
Array Attributes
1. ndim : to get dimension of array. E.g. 1-Dimension, 2-D…
2. shape : to get shape. E.g. 2x3 , 3x2
3. size : size means number of elements.
4. dtype : data type of elements stored in array. E.g. int, float…
5. itemsize : size of items stored. E.g. for int size = 4 bytes
Changing array shape
● You have 3 attributes - shape=(a,b) , reshape(a,b) , resize(a,b)

All methods are used as a form of array attributes


Indexing in Array

2-Dimensional
Slicing in Array
Arr[start-index : end-index : step]
Operations on Array
Sorting an Array
● np.sort( array-name ) : make a copy of original array and returns
the sorted copy without changing original array.
● np.argsort( array-name ) : sort the copy of original array and
returns the Index of sorted list without changing original array.
● array-name.sort( ) : sort the original array and returns nothing.

By default sorting is done in ascending


order, to reverse the order simply reverse
the answer with slicing as a[ : :-1]
By default sorting is done Row wise. To do it column wise do axis = 0
axis = 0 means “Column wise”
axis = 1 means “Row wise”
Statistical Operations
● np.max( array-name )
● np.min( array-name )
● np.sum( array-name )
● np.mean( array-name )
● np.median( array-name )
● np.prod( array-name )
● np.var( array-name )
● np.std( array-name )
Joining Two Arrays
● np.vstack ((arr1,arr2))
● np.hstack ((arr1,arr2))
● np.column_stack ((arr1,arr2))
Intersection, Union, Set-Difference
● np.intersect1d(arr1,arr2)
● np.union1d(arr1,arr2)
● np.setdiff1d(arr1,arr2)
Array Mathematics
Save & Load Array
● np.save(‘save-name’, arr-name)
● np.load(‘ saved-name ’)
https://saturncloud.io/blog/converting-a-2d-numpy-array-to-dataframe-rows-a-comprehensive-guide/
Practice
Questions
Solutions-
https://humarikaksha.blogspot.com/2021/04/chapter-6-introd
uction-to-numpy.html
Pandas Library
Pandas
● It is a Python library used for working with data sets.

● It has functions for analyzing, cleaning, exploring, and manipulating


data.

● The name "Pandas" has a reference to both "Panel Data", and "Python
Data Analysis" and was created by Wes McKinney in 2008.

Series (Single-Dimensional)
DATA
STRUCTURES
Data Frame (Multi-Dimensional)
Pandas Series Object

Labeled List of value

Changing
value of
Index
Series Object from Dictionary
Extracting Individual Value

Extracting Single Item

Extracting Item from Back

Extracting Multiple Item


Basic Maths Operation on Series

Adding Scalar value Adding two Series Object Multiplying Scalar


in Series Element value in Series
Element
Pandas Dataframe

Name Marks

0 Bob 76

1 Sam 25

2 Anne 92

Dictionary Key (Name, Marks) and their


values in list
Dataframe InBuilt Function

● head() → First five rows of dataframe.

● shape() → No. of row and col. In data


frame.

● describe() → Information of data frame.

● tail() → Last five rows of dataframe.


Extract value from Dataset
● iloc iris.iloc[0:3,0:4] → Extract first three rows and three col.
row Col.
s
All values between 0 and 3 ( including 3 ) means
● loc() total 4 value.
Dropping Columns and Rows

AXIS = 1 AXIS = 0
(Columns) (Rows)
More Pandas Functions
We can do any
operations on
dataset values by
“apply” function
and pass a
function.
value_counts() → Count the total sort_values(by=’col’) → Return
number of values in specified the sorted column in ascending
column. order.
Matplotlib Library
Part of UNIT 4
Matplotlib
● Matplotlib is a Python library used for Data Visualization.
● You can create bar-plot, scatter-plots, histograms and a lot
more with matplotlib
● Line Plot sub module of matplotlib

Linear Relationship
Adding Title and Labels Changing Line Aesthetics
Attributes
Plotting
two
Lines on
Same
Plot
Subplot(row, col, index)

Adding
Subplots
● Bar Plot

Understand the
distribution of
categorical value
● Horizontal
Bar Plot

plt.barh
● Scatter Plot

Creating a Basic Scatter Plot Changing Mark Aesthetics


Adding Two
SubPlots
● Histogram It will use for numerical values
**Working on DataSet**
● Box Plot

Max Value

75 %

50 %
Creating Data
25 %
Applications
It is used to know: MinValue
● The outliers and their values.
● Symmetry of Data.
● Tight grouping of data.
● Data skewness.
● If , in which direction and how
Box Plot Example

Example:
Find the maximum, minimum, median, first quartile,
third quartile for the given data set: 23, 42, 12, 10, 15, 14, 9.

Solution:
Given: 23, 42, 12, 10, 15, 14, 9.
Arrange the given dataset in ascending order.
9, 10, 12, 14, 15, 23, 42
Hence,
Minimum = 9
Maximum = 42
Median = 14
First Quartile = 10 (Middle value of 9, 10, 12 is 10)
Third Quartile = 23 (Middle value of 15, 23, 42 is 23).

often used in explanatory data analysis. Box plots visually show the
distribution of numerical data and skewness through displaying the
data quartiles (or percentiles) and averages.

descriptive statistics box plot or boxplot (also known as box and


● Violin Plot

Creating Data

Making Plot
● Pie Chart
Creating Data

Making Plot Changing Aesthetics


Outer Circle
● DoughNut - Chart
Inner Circle

Creating Data
Residual Plot
A residual plot is a scatter plot that shows the difference between the predicted and actual
values of a variable in a data set. It's used to analyze the relationship between the data and
the regression line, and to determine if a linear model is appropriate for the data:

https://www.geeksforgeeks.org/how-to-create-a-residual-plot-in-python/
Distribution Plot

A distribution plot is a data visualization tool that shows the distribution of data points along an
axis. It's used to compare the range and distribution of numerical data, and to visually assess
the distribution of sample data.

4 types of distribution plots namely:


1. joinplot
2. distplot
3. pairplot
https://medium.com/@frankonyango.w/visualizing-distribution-plots-i
4. rugplot n-python-using-seaborn-7d23a9585a99
Pivot Table

https://www.geeksforgeeks.org/pivot-tables-in-pandas/
Heat Map and Correlation matrix

https://www.shiksha.com/online-courses/articles/heatmap-in-seaborn/

https://www.geeksforgeeks.org/what-is-heatmap-data-visualization-and-how-to-use-it/

https://www.questionpro.com/blog/correlation-matrix/

https://www.geeksforgeeks.org/create-a-correlation-matrix-using-python/
Unit - III
Data Acquisition and Data Wrangling

● Data acquisition involves learning to extract information from diverse


sources, such as databases and APIs. (UNIT-2)
● Data wrangling focuses on the crucial task of cleaning and transforming
raw data to make it suitable for analysis. This process includes handling
missing values, dealing with outliers, and reshaping the data (UNIT-3)
CSV Data
● Comma-Separated Values is a plain text format where data is organized in rows
and columns, with each row representing a record and columns representing
attributes.
● Values are typically separated by commas, but other delimiters like tabs or
semicolons may also be used.
● CSV is a widely used format for storing and exchanging structured data due to
its simplicity and compatibility with various applications.
JSON Data
● It exist in key-value pairs and supports nested structures.
● It is commonly used for representing structured data and is easy for both
humans to read and machines to parse.
● JSON is prevalent in web development and API responses, making it crucial
for data scientists working with diverse data sources.
Missing Value Analysis

https://www.analyticsvidhya.com/blog/2021/10/handling-missing-value/

https://www.geeksforgeeks.org/ml-handling-missing-values/
Outlier Analysis
● Outlier refers to an observation that stands out from the majority of the data, exhibiting
unusually high or low values.

● An outlier in data analysis might be a rotten apple in a dataset of quality apples. While the vast
majority of apples in the dataset may have a high-quality rating, the presence of a single rotten
apple can significantly impact the overall average quality rating for the entire dataset.

You can also refer :


https://www.almabetter.com/bytes/articles/outlie
r-detection-methods-and-techniques-in-machine
-learning-with-examples
Boxplot
● It displays the five-number summary of a set of data and presence of outliers in the data. The
five-number summary is the minimum, first quartile, median, third quartile, and maximum.

● In a box plot, we draw a box from the first quartile to the third quartile. A vertical line goes
through the box at the median. The whiskers go from each quartile to the minimum or maximum.

● The five-number summary divides the data into sections that each contain approximately 25% of
the data in that set.

● Outliers: Individual data points that are outside the whiskers are often marked as individual
points to draw attention to them.
● Boxplot: It provides a visual way to understand the central tendency, spread, and presence of
outliers in the data. A box plot consists of several key components:
Box: The central box in the plot represents the interquartile range (IQR), which is the range between
the 25th and 75th percentiles of the data. The box spans from the first quartile (Q1, or the 25th
percentile) to the third quartile (Q3, or the 75th percentile). This box contains 50% of the data.
Median: A line inside the box represents the median,
which is the middle value of the dataset when it's
ordered.
Whiskers: Lines extending from the box,
referred to as whiskers,show the range of the data.
They typically extend to the minimum
and maximum values within a certain range. Any data points outside
this range are considered outliers.
Outliers: Individual data points that are outside the whiskers are often marked as individual points
to draw attention to them.
Percentile and quartiles [uses to find outliers]

Percentage: 1, 2, 3, 4, 5

Q1: Find percentage of the number that are odd

%=number of numbers that are odd/total numbers = 3/5=0.6=6%

Percentile: [GATE/CAT/GMAT etc]

A percentile is a value below which a certain percentage of observations lie.

Ex: suppose any value (x) is 25%ile that means 25% of the entire distribution is less than of particular value (x).
Dataset: 2,2,3,4,5,5,5,6,7,8,8,8,8,8,9,9,10,11,11,12
1. What is the %ile ranking of 10 and 11?

%ile rank of 10 = (number of value below 10/total observation)*100= (16/20)*100 = 80%ile

That means 80% of entire distribution is less than 10.

%ile rank of 11 = (number of value below 11/total observation)*100= (17/20)*100 = 85%ile

That means 85% of entire distribution is less than 11

2. What value exists at %ile range of 25%.

Value = (%ile/100)*(n+1) = 25/100*21= 5.25 (index position) = (5+5)/2 = 5 (value)

3. What value exists at %ile range of 75%.

Value = (%ile/100)*(n+1) = 75/100*21= 15.75 (index position) = (9+9)/2 = 9 (value)


Five number summary: [use to remove the outlier]: Minimum, First Quartile, Median, Third Quartile, Maximum

Removing outlier [Inter Quartile Range]

Example: {1,2,2,2,3,3,4,5,5,5,6,6,6,6,7,8,8,8,9,27}

Remove outlier: Lower fence (below) <----> higher fence (higher)

Lower Fence: Q1-1.5(IQR) Higher Fence: Q3+1.5(IQR)

IQR=Q3-Q1

Q3=75%ile = 75/100*(19+1) = 15 (index)= 7 (value)

Q1=25%ile = 25/100*(19+1) = 5 (index)= 3 (value)

IQR= 7-3=4

LF=3-1.5(4)=-3(low range) HF=7+1.5(4)=13 (high range)

-3 <------> 13 So outlier is =27 (should be remove or replace by LF/HF]

Remaining data: {1,2,2,2,3,3,4,5,5,5,6,6,6,6,7,8,8,8,9}

Min=1, max=9, Q1=3, Median=5 ,Q3=7


Distribution:

Age: {24,26,27,28,30,32………….} dataset

How to be see dataset in visualization, What kind of graphs is required to visualize that data

Gaussian or normal distribution:

One standard deviation towards the right, Two standard deviation towards the right.

Dataset {100 data points}

Emperical formula

One SD=68% of entire data

2nd SD= 95%

3rd SD=99.7% of entire distribution.

Eg: Height-à normal distributed (domain expert) Doctor

Weight à follow

Iris data set à follow


Z score to find how many SD (either right or left) a data point will fall.
Z-Score

Eg: mean=4 and SD=1


Z score to find how many SD (either right or left) a data point will fall.

x={1,2,3,4,5,6,7}(ND/GD) — Z-Score — y={-3,-2,-1,0,1,2,3} (standard Normal Distribution) [mean=0, SD=1}

Y belongs to SND

Practical Application: (Units are completely different)

Age (year) Salary (rs) Weight (kg24

24 10000 60

26 20000 65

27 30000 68

[convert into same scale] Application: Image


[convert into same scale]
classification [pixel value
Option 2: Normalization [0 to 1] 0 to 255 (use minmax to
Option 1: [target mean=0, SD=1]
convert 0 to 1)
Minmax Scalar: (0-1) or (-1 to 1)
Age—Z-score–SND—-called standardization
Normalization is the process where you can define lower
bound to upper bound and convert data into that range
Outlier Detection Techniques
● Using Standard Deviation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('test.csv')
age = data['Age']
sns.boxplot(x=age)
plt.show()
mean = age.mean()
standard_deviation = age.std()
threshold = 3
outliers = []
for age in age:
if
abs(age-mean)>threshold*standard_deviation:
outliers.append(age)
print(outliers)
Outlier Detection Techniques
● Using Z-Score Method

It uses a dataset’s standard deviation and its mean to identify data points that are significantly
different from the majority of the other data points.
z = (x - μ)/σ The Z-score is equal to zero when x = .μ The Z-score is ± 1, ± 2, or ± 3, depending
on whether x is ± 1, ± 2, or ± 3, respectively.

● A data point with a Z-score (the number of


standard deviations the data point is away from
the mean) of more than 3 or less than -3 is
typically considered to be an outlier. This
method assumes that the data follows a normal
distribution. It is a simple and widely used
method for outlier detection, but it may not
always be appropriate for data that is not
normally distributed.
import numpy as np Outlier detection using Z-score method
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('test.csv')
age = data['Age']
mean = age.mean()
standard_deviation = age.std() https://drive.google.com/file/d/1REeHwEG
threshold = 3 XHhjO6R1kaTzni2XXLodmqEqG/view?us
outliers = [] p=drive_link

for i in age:
z = (i-mean)/standard_deviation
if (z > threshold):
outliers.append(i)
print(outliers)
Outlier Detection Techniques Using IQR
● (IQR) Interquartile Range outlier detection method involves calculating the first and third
quartiles (Q1 and Q3) of a dataset and then identifying any data points that fall beyond the
range of Q1 - 1.5 * IQR to Q3 + 1.5 * IQR, where IQR is the difference between Q3 and Q1.
● Data points that fall outside of this range are considered outliers.
Feature Engineering
● Feature engineering is a crucial and creative process in machine learning
and data analysis. It involves selecting, transforming, and creating new
features (variables) from your raw data to improve the performance of
your machine learning models.

https://www.google.com/amp/s/www.geeksforgeeks.org/what-is-feature-engineering/amp/
Process Involved in Feature Engineering
● Feature Creation : process of generating new features based on domain knowledge or by
observing patterns in the data.
Types of Feature Creation:
Domain-Specific: Creating new features based on domain knowledge, such as creating features
based on business rules or industry standards.
Data-Driven: Creating new features by observing patterns in the data, such as calculating
aggregations or creating interaction features.
Synthetic: Generating new features by combining existing features or synthesizing new data
points.

● Feature Transformation : process of transforming the features into a more suitable representation
for the machine learning model. Types of feature transformations
Normalization: Scaling features to a similar range (e.g., between 0 and 1) to prevent some features
from dominating others.
Standardization: Scaling features to have a mean of 0 and a standard deviation of 1.
Logarithmic or Exponential Transformation: Useful for data with a skewed distribution.
Encoding: Converting categorical variables into numerical form using techniques like one-hot
encoding or label encoding.
● Feature Extraction: is the process of creating new features from existing ones to provide more
relevant information to the machine learning model. Types of Feature Extraction-
Dimensionality Reduction: Reducing the number of features by transforming the data into a
lower-dimensional space while retaining important information. Examples are PCA and t-SNE.
Feature Combination: Combining two or more existing features to create a new one. For
example, the interaction between two features.
Feature Aggregation: Aggregating features to create a new one. For example, calculating the
mean, sum, or count of a set of features.
Feature Transformation: Transforming existing features into a new representation. For example,
log transformation of a feature with a skewed distribution.

● Feature Selection: is the process of selecting a subset of relevant features from the dataset to be
used in a machine-learning model. Types-Filter Method, Wrapper Method, Embedded Method

● Feature Scaling: is the process of transforming the features so that they have a similar scale.
Types- Min-Max Scaling, Standard Scaling, Robust Scaling.
Data Wrangling
Exploratory Data Analysis /
Data Cleaning
Feature Engineering
Data Transformation
Row Deletion & Duplicate Analysis
Feature Engineering Transformation Encoding
1. Feature Construction
Scaling Normalization
2. Feature Improvement
3. Feature Selection Missing Value Analysis
4. Feature Extraction
Standardization
Outlier Detection
Data Wrangling: Data wrangling, also known as data munging, is the process of
gathering, selecting, and organizing data from various sources into a usable format.
It involves activities like data extraction, merging datasets, dealing with missing
values, and handling outliers. Data wrangling is about getting the data into a
consistent structure for further processing.

Data Acquisition : Data acquisition is the process of converting real-world signals


into digital numeric values that can be manipulated by a computer. This process is
typically automated or semi-automated.

Data Visualization: is the process of representing data in a visual context. It's


used in data science to make it easier to analyze and interpret data. Data
visualization can be done using: Charts, Plots, Animations, Infographics. The goal
of data visualization is to make it easier to identify: Trends, Outliers, Patterns.
Data Cleaning: Data cleaning is a specific part of data wrangling that focuses on
identifying and correcting errors or inconsistencies in the dataset. This may include
addressing missing data, handling duplicates, and dealing with outliers. The goal is to
ensure the data is accurate and reliable.

Data Transformation: Data transformation involves changing the format, structure,


or values of the data to make it suitable for analysis or modeling. Common
transformations include normalization, standardization, encoding categorical
variables, and creating new features through mathematical operations or aggregations.
Data transformation helps in making data more compatible with the algorithms you
plan to use.
Scaling Transformation Detecting Outliers
Binary Encoding Box-plots
Normalization Standardization One hot encoding IQR
Label encoding Z-score
Min-max Scaling
Mean Scaling What to do with outliers
Robust Scaling Remove/delete them
Replace them with mean etc
Feature Selection Mean-median imputation
Filter method Quartile based floating
Wrapper Technique Capping
Embedded Technique
Feature Extraction
PCA: Principal Component Analysis
LDA: Linear Discriminant Analysis
Normalizing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.preprocessing import StandardScaler
ss=StandardScaler()

data = pd.read_csv('test.csv')

data['Age']=ss.fit_transform(data[['Age']])

print(data['Age'].mean())
print(data['Age'].std())

sns.histplot(data['Age'],bins=50)
Unit - IV
Read slide no 150 to 173 only
Unit - V
Important Web links

https://www.geeksforgeeks.org/types-of-machine-learning/
https://www.javatpoint.com/types-of-machine-learning
Data : https://drive.google.com/drive/folders/1g2RWTLLJ60CUlT_aymdILqvf0S2igAst

You can also refer :


https://drive.google.com/file/d/162GLBndnJyY8i
Dj025ApgeFb6-yfGtK6/view

Case Study only


● .info( ) : The info() method in Pandas is used to provide a concise summary or information
about a DataFrame. It provides - No. of rows, columns, column-name, data-type, not-null
count, memory-usage.
● .concat( ) : method in Pandas used to combine two or
more data frames
● .describe( ) : used to generate various summary statistics for numerical (or quantifiable)
columns in the DataFrame. It provides information such as the count, mean, standard
deviation, minimum, and maximum values for each numerical column.
● .isnull( ) or .isna( ): used to check for missing or NaN (Not a Number) values in a
DataFrame or Series. It returns a DataFrame or Series of the same shape as the original,
where each element is a boolean value indicating whether the corresponding element in the
original DataFrame or Series is missing (True) or not missing (False).
● .sum( ) : used to calculate the sum of values. Here we find all the missing values sum.
● .fillna( ): used to replace missing or NaN (Not-a-Number) values in a DataFrame or Series
with specified values.
● .unique( ): used to obtain an array of unique values, non repeated values present in specified
column
● .value_counts( ): count the
● .nunique( ): count the number of unique values
occurrences or frequency or count of
each unique value in the original
Series. The resulting Series is typically
sorted in descending order based on
the value counts, so the most frequent
value appears first.
● .head( ): used to retrieve the first n rows of a DataFrame or Series. By default, it returns the
first 5 rows if no argument is provided.
● LabelEncoder: LabelEncoder is a class from the scikit-learn (sklearn) library in Python. It is
commonly used for encoding categorical (non-numeric) data into a numeric format, which is
often required for machine learning algorithms since they typically work with numeric data.

● OneHotEncoder: it is a data preprocessing tool in scikit-learn (sklearn) used to convert


categorical data into a one-hot encoded format. It's particularly useful when dealing with
categorical variables that have no inherent order or when you want to represent each category
as a binary column.
● .get_dummies( )
● Histogram : A histogram is a graphical representation of the distribution of data. It's a way to
visualize the frequency or the number of occurrences of values within a dataset, typically in a
numeric format. In a histogram, data is divided into intervals or "bins," and the height of each
bar in the histogram represents the number of data points that fall within that bin.

Here's how it works:


You make groups for the scores. For example, you might have one group for scores between 0
and 10, another for scores between 11 and 20, and so on. These groups are called "bins."

You count how many students got a score in each group. This is like counting how many
students got a score between 0 and 10, how many got a score between 11 and 20, and so on.

Then, you draw a bar for each group (bin) on a chart. The height of the bar represents the
number of students who got scores in that group. So, if more students got scores between 11
and 20, that bar will be taller.
● Using seaborn library
Correlation Matrix
● It is a table that shows the correlation coefficients between many variables. Each cell in the
table represents the correlation between two variables. The values range from -1 to 1,
indicating the strength and direction of the relationship between variables.

● A correlation coefficient of 1 implies a perfect positive correlation (as one variable


increases, the other also increases).

● A correlation coefficient of -1 implies a perfect negative correlation (as one variable


increases, the other decreases).

● A correlation coefficient of 0 implies no linear correlation.


Exploratory Data Analysis
Feature Engineering

Case Study

https://drive.google.com/file/d/162GLBndnJyY8iDj025ApgeFb6-yfGtK6/view?usp=drive_link

EDA and Linear Regression: https://github.com/HARSHharsh123/ML-project/tree/main/notebooks

https://www.kaggle.com/datasets/sudhanshu432/algerian-forest-fires-cleaned-dataset

https://www.kaggle.com/code/mehulnayak10/algerian-forest-fire-prediction-logistic-reg

https://muhammaddawoodaslam.medium.com/exploratory-data-analysis-eda-on-titanic-dataset-804034f394e6

You might also like