0% found this document useful (0 votes)
3 views

ML Unit 1 CSE

Machine learning

Uploaded by

sanyamch333
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

ML Unit 1 CSE

Machine learning

Uploaded by

sanyamch333
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 184

NOIDA INSTITUTE OF ENGINEERING AND

TECHNOLOGY, GREATER NOIDA

Foundation of Machine Learning


ACSE 0515

Unit: 1

Subject Name: Dr. Hitesh Singh


Foundation of Machine Learning Associate Professor & Deputy
Head
&
Course Details: Prof. Vivek Kumar
CSE 5th Sem Professor & Deputy Head
CSE DEPARTMENT

Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1
1
9/3/2024
Profile

Dr. Hitesh Singh (B.Tech, M.Tech, Ph.D, Post Doc)


Associate Professor, Department of Information Technology.
NIET , Greater Noida 201306.

• Experience : - 13+ overall experience – Teaching & Research


• Area of Interest :Machine Learning, Data Analytics, Wireless Communication.
• Honors , Award & Achievements :-
❑ PhD from Techncal University of Sofia, Bulgaria
❑ Post Doc from Aarhus University, Denmark
❑ 5 Patents Published and 1 Patent Granted
❑ Reviewer for reputed journals and conferences
❑ Guided Projects more than 30 at UG level & more than 15 at PG Level
❑ Working on different Project with Technical University, Sofia, Bulgaria, Aarhus University, Denmark.

❑ More than 35 papers published in International Journals & Conferences.


2
Profile

Prof. Vivek Kumar (Ph.D, Post Doc)


Professor, Department of Information Technology.
NIET , Greater Noida 201306.

• Experience : - 22+ overall experience – Teaching & Research


• Area of Interest :Machine Learning, Data Analytics, Wireless Communication.
• Honors , Award & Achievements :-
❑ 5 Patents Published and 1 Patent Granted

❑ Reviewer for reputed journals and conferences

❑ Guided Projects more than 30 at UG level & more than 15 at PG Level

❑ Working on different Project with Technical University, Sofia, Bulgaria, Aarhus University, Denmark.

❑ More than 35 papers published in International Journals & Conferences.


3
Course Scheme

Dr. Hitesh Singh & Dr. Vivek Kumar Machine


9/3/2024 4
Learning Unit 1
Departmental Elective - II

Dr. Hitesh Singh & Dr. Vivek Kumar Machine


9/3/2024 5
Learning Unit 1
CONTENT
• Course Objective
• Unit Objective
• Course Outcomes
• CO-PO Mapping
• CO PSO Mapping
• Data Mining:
• Data Preprocessing: An Overview
• Data Quality
• Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
• Summary

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 6
Course Objectives
➢This course will serve as a comprehensive introduction to various
topics in machine learning
➢To introduce students to the basic concepts and techniques of
Machine Learning.
➢To become familiar with regression methods, classification
methods, clustering methods.
➢To become familiar with Artificial Neural Networks and Deep
Learning
➢To introduce the concept of Reinforcement Learning and Genetic
Algorithms
➢It Focuses on the implementation of machine learning for solving
practical problems

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 7
Objectives of Unit

Mainly the unit’s objectives are:

➢Conceptualization and summarization of machine learning: To


introduce students to the basic concepts and techniques of
Machine Learning.

➢Machine learning techniques: To become familiar with


regression methods, classification methods, clustering methods.

➢ Scaling up machine learning approaches.

Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning


9/3/2024 8
Unit 1
Course Outcomes
At the end of the course, the student should be able to
COURSE COURSE OUTCOMES
OUTCOME NO
CO1 To understand the need for machine learning for
various problem solving
CO2 To understand a wide variety of learning algorithms
and how to evaluate models generated from data
CO3 To understand the latest trends in machine
learning
CO4 To design appropriate machine learning algorithms
and apply the algorithms to a real-world problems
CO5 To optimize the models learned and report on the
expected accuracy that can be achieved by
applying the models

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 9
CO-PO and PSO Mapping

CO MAPPING WITH PO

CO No. PO1 PO2 PO3


CO1 3 3 1
CO2 3 3 2
CO3 2 3 3
CO4 2 2 1
CO5 3 2 1

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 10
CO-PO and PSO Mapping

CO MAPPING WITH PSO

CO. NO. PSO1 PSO2 PSO3 PSO4

1 1 2

2 1 2 1 1

3 2 1 1 2

4 1 1 1 2

5 1 1 1

6 1 1 1

7 2 1 1 1

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 11
Syllabus

Unit-I : Introduction-
What is Machine Learning?, Fundamental of Machine Learning, Key Concepts
and an example of ML, Basics of Python for machine learning, Machine Learning
Libraries, Data Pre-processing, Handling Missing Values, Handling Outliers, One
Hot Encoder & FeatureScaling

Unit-II : Supervised Learning


Linear regression (Hands on lab), Multiple Regression, Problem visualization,
Polynomial regression, Distance Metrics (Euclidean, Manhattan), Regression
and Classification, Clustering, Gradient Descent, Logistic Regression,
Regularization: Overfitting and under fitting, Cost Function for Logistic
Regression, house price prediction (Hands on)

Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning


9/3/2024 12
Unit 1
Syllabus
Unit-III : Unsupervised Learning and Classification
Logistic regression(Classification), Defining cost, Gradient descent (Hands on
lab) Other Techniques - Naïve Bayes, SVM, KNN, Unsupervised Learning:
Nearest Neighbor, Cosine Similarity, Decision Trees - Intuition, Multiclass
classification, Overfitting & Regularization - Ridge regression, Lasso regression
for feature selection, Bagging - Random Forest for regression, Knowledge, Logic
and Reasoning Planning, Random Forest for classification, Reasoning Under
Uncertainty, Visualizing Decision boundaries, early stopping to prevent over
fitting, Fraud detection problem (Hands on) , probabilities in classification,
Random Forest for classification, Reasoning Under Uncertainty.

Unit-IV : Semi-supervised Learning and PCA –


Reinforcement Learning –Introduction to Reinforcement Learning, Learning Task,
Example of Reinforcement Learning in Practice, Machine Learning Tools -
Engineering applications, Dimensionality Reduction - principal component
analysis (Hands on)

Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning


9/3/2024 13
Unit 1
Syllabus

Unit-V : Boosting and Recommendation System


Boosting – XGBoost, Boosting – LightGBM, Collaborative Recommender
System, Content based Recommender System, Knowledge based
Recommender System, Creating Recommendation System like Movie
Recommendation System using python,

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 14
UNIT-WISE OBJECTIVES

At the end of the unit, the student will be able to:

➢Understand the functionality of the various data mining


component.
➢ Appreciate the strengths and limitations of various data mining
models.
➢Explain the analyzing techniques of various data.
➢Describe different data Processing Forms used in data mining.

Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1
9/3/2024 15
PREREQUISITE AND RECAP

➢ A data warehouse is a subject-oriented, integrated, time-variant and


non-volatile collection of data in support of management’s decision
making process.

➢ A data ware house has three-level architecture that includes: Bottom


Tier, Middle –Tier, Top-Tier.

➢ A data mart is a subset of data stored within the overall data


warehouse, for the needs of a specific team, section or department.

➢ Data Cube is a multi- dimensional structure.

➢ Data warehouse is maintained in the form of Star, Snowflake and fact


constellation schema.

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 16
Prerequisite and Recap

1. Machine Learning is a mathematical discipline, and


students will benefit from a good background in
probability
linear algebra
calculus
Programming experience is essential.

Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1
9/3/2024 17
Top Artificial Intelligence Stats You
Should Know About in 2024

• AI-enabled devices are everywhere. Nearly 77 percent of


devices today use AI technology in one form or another.
• The growth of AI startups accelerated 14-fold since 2000.
And we’d bet more of them are coming up every year.
• Business leaders trust AI’s power in driving growth. 84% of
C-level executives believe that they need to adopt and
leverage Artificial Intelligence to drive growth objectives.

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 18


Top Artificial Intelligence Stats You
Should Know About in 2024
• The global AI market is booming. It will reach 190.61 billion
dollars by 2025, at a compound annual growth rate of 36.62
percent.
• By 2030, Artificial Intelligence will add 15.7 trillion dollars to the
world's GDP, boosting it by 14 percent.
• There will be more AI assistants than people in this world.
Forecasts indicate that there will be 8.4 billion AI-powered digital
voice assistant units in the world by 2024, which surpasses the
total global population.

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 19


Top 10 People in Artificial
Intelligence
Elon Musk, (born June 28, 1971, Pretoria, South Africa),
South African-born American entrepreneur who cofounded
the electronic-payment firm PayPal and formed SpaceX,
maker of launch vehicles and spacecraft. He was also one
of the first significant investors in, as well as chief executive
officer of, the electric car manufacturer Tesla. In addition,
Musk acquired Twitter in 2022.

Elon Musk is undoubtedly one of the most famous personalities in the field of
AI. He founded OpenAI in 2015, along with other co-founders with the vision of
developing friendly AI that should benefit the entire humanity. OpenAI is
conducting groundbreaking research in AI and developing open-source tools
such as OpenAI Universe.

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 20


Top 10 People in Artificial
Intelligence
2. Jurgen Schmidhuber

Popularly known as the “Father of Self-Aware Robots”, Jurgen Schmidhuber is


a German computer scientist. His primary research areas are artificial
intelligence, deep learning, and artificial neural networks. Apart from
Nnaisense, he is a co-director of the Dalle Molle Institute for Artificial
Intelligence in Switzerland.

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 21


Top 10 People in
Artificial Intelligence
3. Andrew Ng

Andrew Ng is undoubtedly one of the most influential personalities in the


field of AI. He is an AI researcher, instructor, and Adjunct Professor at
Stanford University. He was one of the co-founder of Google Brain, a deep
learning AI research team of Google which is dedicated to research in AI,
machine learning, information systems, and large-scale computing.

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 22


Top 10 People in Artificial
Intelligence
4. Anita Schjoll Brede

Anita Scholl Brede is a Norwegian entrepreneur. She co-founded Iris.ai to


speed up research in AI. At Iris.ai, they have developed a world-leading AI
engine for scientific text understanding to find relevant articles using
sophisticated AI algorithms.

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 23


Top 10 People in Artificial
Intelligence
5. Cassie Kozyrkov

Cassie Kozyrkov is the Head of Decision Intelligence at Google. Her aim is to


democratize decision intelligence and build a safe, reliable AI for everyone.
She also has a blog, podcast, and newsletter named Decision Intelligence
which talks about data-driven decision making. Cassie also has some very
popular courses such as a complete minicourse on statistics, a minicourse
on analytics, and making friends with machine learning.
9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 24
Top 10 People in Artificial
Intelligence
6. Demis Hassabis

Demis Hassabis is another famous personality in AI, and the founder and
CEO of DeepMind. DeepMind is an AI research firm which mostly focuses on
deep learning, AI robotics, neuroscience, unsupervised learning and
generative models, and reinforcement learning. The company is known for
developing the first AI system to defeat a professional human Go player —
AlphaGo.
9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 25
Top 10 People in Artificial
Intelligence
7. Fei-Fei Li

Fei-Fei Li is an AI researcher and professor at Stanford University. She is a co-


director of the Stanford Institute for Human-Centered Artificial Intelligence,
and also a co-director of the Stanford Vision and Learning Lab. Also, she was
the director of the Stanford Artificial Intelligence Laboratory (SAIL) between
2013 to 2018.
9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 26
Top 10 People in Artificial
Intelligence
• 8. Geoffrey Hinton

• Geoffrey Hinton is Canada based cognitive psychologist and computer


scientist. He is best known for his work on artificial neural networks. He
has been working simultaneously for Google Brain and University of
Toronto in between 2013 to 2017. He then co-founded the Vector Institute
and became the Chief Scientific Advisor in 2017.
9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 27
Top 10 People in Artificial
Intelligence
9. Ian Goodfellow

Ian Goodfellow is currently the director of a special ML group at


Apple Inc. Born in 1985, he has also worked in Google Brain as a
research scientist and made several contributions to the field of
deep learning. He has also contributed to a book called Artificial
Intelligence: A Modern Approach, which is used in more than
1,500 universities across 135 countries.

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 28


Top 10 People in Artificial
Intelligence
10. Yann LeCun

Yann LeCun is a French computer scientist and Vice President,


Chief AI Scientist at Facebook. He is also serving as a Professor
of the Courant Institute of Mathematical Sciences at New York
University. His primary research areas are machine learning,
computer vision, mobile robotics, and computational
neuroscience.
9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 29
ML/Artificial Intelligence Job
Requirements
The need for skilled AI professionals spans nearly every industry, including:
• Financial services
• Healthcare
• Technology
• Media
• Marketing
• Government and military
• National security
• IoT-enabled systems
• Agriculture
• Gaming
• Retail

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 30


ML/Artificial Intelligence Job
Requirements
Professional AI Skills in Demand for 2024
• Python
• C/C++
• MATLAB

According to ZipRecruiter, these are the top 5 skills required for AI jobs:
• Communication skills
• Knowledge and experience with Python specifically (in general, proficiency
in programming language)
• Digital marketing goals and strategies
• Collaborating effectively with others
• Analytical skills

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 31


ML/Artificial Intelligence Job Requirements

The Intellipaat blog also recommends these additional skills for AI


professionals:

• Solid knowledge of applied mathematics and algorithms


• Problem-solving skills
• Industry knowledge
• Management and leadership skills
• Machine learning

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 32


14 Career Paths in Artificial Intelligence

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 33


14 Career Paths in Artificial Intelligence

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 34


ML/Artificial Intelligence Job Requirements

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 35


ML/Artificial Intelligence Job Requirements

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 36


Introduction (CO1)

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 37


Introduction to ML (CO1)
What is Machine Learning

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 38


Introduction (CO1)

9/3/2024 Dr. Hitesh Singh KCS 055 ML Unit 1 39


Introduction (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 40


Introduction (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 41


Introduction (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 42


Introduction (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 43


Introduction (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 44


Introduction (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 45


Introduction (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 46


Introduction (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 47


ML Applications (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 48


ML Applications (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 49


ML Applications (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 50


ML Applications (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 51


ML Applications (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 52


ML Applications (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 53


ML Applications (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 54


ML Applications (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 55


ML Applications (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 56


ML Applications (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 57


ML Applications (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 58


ML Applications (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 59


ML Applications (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 60


Introduction (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 61


Introduction (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 62


Introduction (CO1)

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 63


Introduction (CO1)

• As its name suggests, Supervised machine learning is


based on supervision.
• It means in the supervised learning technique, we train the
machines using the "labelled" dataset, and based on the
training, the machine predicts the output.
• Here, the labelled data specifies that some of the inputs
are already mapped to the output.
• More preciously, we can say; first, we train the machine
with the input and corresponding output, and then we ask
the machine to predict the output using the test dataset.

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 64


Introduction (CO1)
• Let's understand supervised learning with an example. Suppose we have
an input dataset of cats and dog images.
• So, first, we will provide the training to the machine to understand the
images, such as the shape & size of the tail of cat and dog, Shape of
eyes, colour, height (dogs are taller, cats are smaller), etc.
• After completion of training, we input the picture of a cat and ask the
machine to identify the object and predict the output.
• Now, the machine is well trained, so it will check all the features of the
object, such as height, shape, colour, eyes, ears, tail, etc., and find that
it's a cat.
• So, it will put it in the Cat category.
• This is the process of how the machine identifies the objects in
Supervised Learning.

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 65


Introduction (CO1)

Categories of Supervised Machine Learning

Supervised machine learning can be classified into


two types of problems, which are given below:

•Classification

•Regression

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 66


Introduction (CO1)
a) Classification

• Classification algorithms are used to solve the classification problems in which


the output variable is categorical, such as "Yes" or No, Male or Female, Red or
Blue, etc.

• The classification algorithms predict the categories present in the dataset. Some
real-world examples of classification algorithms are Spam Detection, Email
filtering, etc.

Some popular classification algorithms are given below:


•Random Forest Algorithm
•Decision Tree Algorithm
•Logistic Regression Algorithm
•Support Vector Machine Algorithm

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 67


Introduction (CO1)
b) Regression
• Regression algorithms are used to solve regression problems in which there
is a linear relationship between input and output variables.
• These are used to predict continuous output variables, such as market
trends, weather prediction, etc.

• Some popular Regression algorithms are given below:

•Simple Linear Regression Algorithm


•Multivariate Regression Algorithm
•Decision Tree Algorithm
•Lasso Regression

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 68


Introduction (CO1)
Advantages and Disadvantages of Supervised Learning
Advantages:

•Since supervised learning work with the labelled dataset so we


can have an exact idea about the classes of objects.
•These algorithms are helpful in predicting the output on the
basis of prior experience.

Disadvantages:

•These algorithms are not able to solve complex tasks.


•It may predict the wrong output if the test data is different from
the training data.
•It requires lots of computational time to train the algorithm.

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 69


Introduction (CO1)
Applications of Supervised Learning
Some common applications of Supervised Learning are given below:
•Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process,
image classification is performed on different image data with pre-defined labels.
•Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is
done by using medical images and past labelled data with labels for disease
conditions. With such a process, the machine can identify a disease for the new
patients.
•Fraud Detection - Supervised Learning classification algorithms are used for
identifying fraud transactions, fraud customers, etc. It is done by using historic data
to identify the patterns that can lead to possible fraud.
•Spam detection - In spam detection & filtering, classification algorithms are used.
These algorithms classify an email as spam or not spam. The spam emails are sent to
the spam folder.
•Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various identifications can
be done using the same, such as voice-activated passwords, voice commands, etc.
9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 70
Introduction (CO1)

2. Unsupervised Machine Learning

• Unsupervised learning is different from the Supervised


learning technique; as its name suggests, there is no need
for supervision.
• It means, in unsupervised machine learning, the machine
is trained using the unlabeled dataset, and the machine
predicts the output without any supervision.
• In unsupervised learning, the models are trained with the
data that is neither classified nor labelled, and the model
acts on that data without any supervision.

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 71


Introduction (CO1)
• The main aim of the unsupervised learning algorithm is to
group or categories the unsorted dataset according to the
similarities, patterns, and differences.
• Machines are instructed to find the hidden patterns from the
input dataset.
• Let's take an example to understand it more preciously;
suppose there is a basket of fruit images, and we input it into
the machine learning model.
• The images are totally unknown to the model, and the task
of the machine is to find the patterns and categories of the
objects.
• So, now the machine will discover its patterns and
differences, such as colour difference, shape difference,
and predict the output when it is tested with the test
9/3/2024
dataset. Dr. Hitesh Singh KCS -055 ML Unit 1 72
Introduction (CO1)

Categories of Unsupervised Machine


Learning

Unsupervised Learning can be further


classified into two types, which are given
below:

•Clustering

•Association

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 73


Introduction (CO1)
1) Clustering

The clustering technique is used when we want to find the inherent groups from the
data. It is a way to group the objects into a cluster such that the objects with the most
similarities remain in one group and have fewer or no similarities with the objects of
other groups. An example of the clustering algorithm is grouping the customers by
their purchasing behaviour.

Some of the popular clustering algorithms are given below:

•K-Means Clustering algorithm


•Mean-shift algorithm
•DBSCAN Algorithm
•Principal Component Analysis
•Independent Component Analysis

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 74


Introduction (CO1)

2) Association
• Association rule learning is an unsupervised learning
technique, which finds interesting relations among variables
within a large dataset.
• The main aim of this learning algorithm is to find the
dependency of one data item on another data item and map
those variables accordingly so that it can generate maximum
profit.
• This algorithm is mainly applied in Market Basket analysis,
Web usage mining, continuous production, etc.
• Some popular algorithms of Association rule learning are
Apriori Algorithm, Eclat, FP-growth algorithm.

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 75


Introduction (CO1)
Advantages and Disadvantages of Unsupervised Learning Algorithm

Advantages:
•These algorithms can be used for complicated tasks compared to the
supervised
ones because these algorithms work on the unlabeled dataset.

•Unsupervised algorithms are preferable for various tasks as getting the


unlabeled dataset is easier as compared to the labelled dataset.

Disadvantages:
•The output of an unsupervised algorithm can be less accurate as the dataset
is not labelled, and algorithms are not trained with the exact output in prior.
•Working with Unsupervised learning is more difficult as it works with the
unlabeled dataset that does not map with the output.

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 76


Introduction (CO1)
Applications of Unsupervised Learning

•Network Analysis: Unsupervised learning is used for identifying plagiarism


and

copyright in document network analysis of text data for scholarly articles.

•Recommendation Systems: Recommendation systems widely use


unsupervised learning techniques for building recommendation
applications for different web applications and e-commerce websites.

•Anomaly Detection: Anomaly detection is a popular application of


unsupervised learning, which can identify unusual data points within the
dataset. It is used to discover fraudulent transactions.

•Singular Value Decomposition: Singular Value Decomposition or SVD is


used to extract particular information from the database. For example,
9/3/2024
extracting information of each user located at a particular location.
Dr. Hitesh Singh KCS -055 ML Unit 1 77
Introduction (CO1)
3. Semi-Supervised Learning

• Semi-Supervised learning is a type of Machine Learning algorithm


that lies between Supervised and Unsupervised machine learning.
• It represents the intermediate ground between Supervised (With
Labelled training data) and Unsupervised learning (with no labelled
training data) algorithms and uses the combination of labelled and
unlabeled datasets during the training period.
• Although Semi-supervised learning is the middle ground between
supervised and unsupervised learning and operates on the data that
consists of a few labels, it mostly consists of unlabeled data.
• As labels are costly, but for corporate purposes, they may have few
labels.
• It is completely different from supervised and unsupervised learning as
they are based on the presence & absence of labels.

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 78


Introduction (CO1)

• To overcome the drawbacks of supervised learning and unsupervised learning


algorithms, the concept of Semi-supervised learning is introduced.
• The main aim of semi-supervised learning is to effectively use all the available
data, rather than only labelled data like in supervised learning.
• Initially, similar data is clustered along with an unsupervised learning algorithm,
and further, it helps to label the unlabeled data into labelled data.
• It is because labelled data is a comparatively more expensive acquisition than
unlabeled data.
• We can imagine these algorithms with an example.
• Supervised learning is where a student is under the supervision of an instructor at
home and college.
• Further, if that student is self-analysing the same concept without any help from
the instructor, it comes under unsupervised learning. Under semi-supervised
learning, the student has to revise himself after analyzing the same concept under
the guidance of an instructor at college.

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 79


Introduction (CO1)

Advantages and disadvantages of Semi-supervised


Learning

Advantages:
•It is simple and easy to understand the algorithm.
•It is highly efficient.
•It is used to solve drawbacks of Supervised and Unsupervised
Learning algorithms.

Disadvantages:
•Iterations results may not be stable.
•We cannot apply these algorithms to network-level data.
•Accuracy is low.
9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 80
Introduction (CO1)

4. Reinforcement Learning

• Reinforcement learning works on a feedback-based


process, in which an AI agent (A software component)
automatically explore its surrounding by hitting & trail,
taking action, learning from experiences, and improving
its performance.
• Agent gets rewarded for each good action and get punished
for each bad action; hence the goal of reinforcement learning
agent is to maximize the rewards.
• In reinforcement learning, there is no labelled data like
supervised learning, and agents learn from their experiences
only.
9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 81
Introduction (CO1)

• The reinforcement learning process is similar to a human being;


• for example, a child learns various things by experiences in his day-to-
day life.
• An example of reinforcement learning is to play a game, where the
Game is the environment, moves of an agent at each step define
states, and the goal of the agent is to get a high score.
• Agent receives feedback in terms of punishment and rewards.
• Due to its way of working, reinforcement learning is employed in
different fields such as Game theory, Operation Research,
Information theory, multi-agent systems.
• A reinforcement learning problem can be formalized using Markov
Decision Process(MDP).
• In MDP, the agent constantly interacts with the environment and
performs actions; at each action, the environment responds and
generates a new state.

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 82


Introduction (CO1)

Categories of Reinforcement Learning

• Reinforcement learning is categorized mainly into two types of


methods/algorithms:

• Positive Reinforcement Learning:


Positive reinforcement learning specifies increasing the tendency that the required
behaviour would occur again by adding something. It enhances the strength of the
behaviour of the agent and positively impacts it.

• Negative Reinforcement Learning:


Negative reinforcement learning works exactly opposite to the positive RL. It
increases the tendency that the specific behaviour would occur again by avoiding
the negative condition.

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 83


Introduction (CO1)
Real-world Use cases of Reinforcement Learning
•Video Games:

RL algorithms are much popular in gaming applications. It is used to gain super-human


performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO
Zero.

•Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that how
to use RL in computer to automatically learn and schedule resources to wait for different
jobs in order to minimize average job slowdown.

•Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement
learning. There are different industries that have their vision of building intelligent robots
using AI and Machine learning technology.

•Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with the
help of Reinforcement Learning by Salesforce company.
9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 84
Introduction (CO1)

Advantages and Disadvantages of Reinforcement Learning

Advantages

•It helps in solving complex real-world problems which are difficult to be solved by
general techniques.
•The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.
•Helps in achieving long term results.

Disadvantage

•RL algorithms are not preferred for simple problems.


•RL algorithms require huge data and computations.
•Too much reinforcement learning can lead to an overload of states which can
weaken the results.

9/3/2024 Dr. Hitesh Singh KCS -055 ML Unit 1 85


Data Pre Processing (CO3)

Data Preprocessing
• Data preprocessing is the process of transforming
raw data into an understandable format.
• It is also an important step in data analytics as we
cannot work with raw data.
• The quality of the data should be checked before
applying machine learning or data mining
algorithms.

9/3/2024 Dr. Hitesh Singh KIT601 DS Unit 1 86


Data Pre Processing (CO3)

9/3/2024 Dr. Hitesh Singh KIT601 DS Unit 1 87


Data Pre Processing (CO3)

9/3/2024 Dr. Hitesh Singh KIT601 DS Unit 1 88


Data Pre Processing (CO3)

1. Data Cleaning:

• Handling Missing Values: This involves identifying and


dealing with missing data, either by imputation
(replacing missing values with estimated ones) or
deletion (removing instances with missing data).
• Noise Removal: Noise refers to irrelevant or erroneous
data. Techniques like smoothing, binning, or outlier
detection/removal can help clean noisy data.
• Normalization/Standardization: Scaling numerical
data to a common range (e.g., 0 to 1) or standardizing it
(e.g., mean of 0 and standard deviation of 1) can
improve the performance of some algorithms.

9/3/2024 Dr. Hitesh Singh KIT601 DS Unit 1 89


Data Pre Processing (CO3)

2. Data Transformation:

• Encoding Categorical Variables: Converting categorical variables into numerical


representations using techniques like one-hot encoding or label encoding.
• Feature Scaling: Scaling numerical features to a specific range to ensure they
contribute equally to the analysis (e.g., using Min-Max scaling or Z-score
normalization).
• Feature Engineering: Creating new features or transforming existing ones to
improve the model's predictive power. This can include polynomial features, log
transformations, etc.
• Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) or
t-distributed Stochastic Neighbor Embedding (t-SNE) can reduce the number of
features while preserving important information.

9/3/2024 Dr. Hitesh Singh KIT601 DS Unit 1 90


Data Pre Processing (CO3)

3. Data Integration:

• Combining data from multiple sources into a unified dataset, resolving any
inconsistencies or conflicts in data formats, naming conventions, or units.

4. Data Reduction:

• Sampling: Selecting a representative subset of the data for analysis, such as


random sampling or stratified sampling.
• Feature Selection: Choosing the most relevant features for the analysis,
either manually or using automated techniques like recursive feature
elimination or feature importance rankings.

9/3/2024 Dr. Hitesh Singh KIT601 DS Unit 1 91


Data Pre Processing (CO3)

5. Data Discretization:

• Converting continuous data into discrete intervals or categories,


which can simplify analysis or improve interpretability.

6. Data Normalization:

• Ensuring that data follows a specific distribution or statistical


property, which can be important for certain algorithms like neural
networks.
9/3/2024 Dr. Hitesh Singh KIT601 DS Unit 1 92
Data Pre Processing (CO3)

7. Data Augmentation:

• Generating synthetic data points to supplement the original dataset,


often used in machine learning for tasks like image or text
classification.

8. Data Balancing:

• Addressing class imbalances in the dataset by oversampling minority


classes, undersampling majority classes, or using techniques like
SMOTE (Synthetic Minority Over-sampling Technique).

9/3/2024 Dr. Hitesh Singh KIT601 DS Unit 1 93


Data Pre Processing (CO3)

• Understanding and Extracting Useful Variables


• Understanding and extracting useful variables is a fundamental aspect of data preprocessing and
feature engineering in data analysis and machine learning. Here's a detailed explanation of how you
can approach this process:

1. Domain Knowledge:
• Start by gaining a deep understanding of the domain you're working in. This includes understanding
the business context, the problem you're trying to solve, and the relevant factors that might
influence the outcomes.

2. Data Exploration:
• Perform exploratory data analysis (EDA) to get a comprehensive view of the dataset. This involves
techniques like summary statistics, data visualization (histograms, scatter plots, etc.), and
correlation analysis to understand relationships between variables.

9/3/2024 Dr. Hitesh Singh KIT601 DS Unit 1 94


Data Pre Processing (CO3)

3. Identifying Relevant Variables:

• Based on domain knowledge and EDA results, identify variables that are likely
to be relevant to the problem at hand. Look for variables that have a strong
impact on the target variable or exhibit interesting patterns and relationships.

4. Handling Redundancy:

• Identify and handle redundant variables, i.e., variables that provide similar
information. Redundant variables can increase model complexity without
adding meaningful insights. Techniques like correlation analysis or variance
inflation factor (VIF) can help identify and address redundancy.

9/3/2024 Dr. Hitesh Singh KIT601 DS Unit 1 95


Data Pre Processing (CO3)

5. Feature Engineering:

Create new features or transform existing ones to capture important


information or improve model performance. This can include:

• Polynomial Features: Create polynomial combinations of features to


capture nonlinear relationships.
• Interaction Terms: Create interaction terms between variables to
capture synergistic effects.
• Derived Features: Create features based on domain knowledge or
transformations (e.g., logarithmic transformations, scaling, etc.).
• Time-Based Features: Extract features from timestamps (e.g., day of
week, hour of day) that might be relevant in time-series analysis or
forecasting.
• Text Features: Extract features from text data using techniques like
bag-of-words, TF-IDF, or word embeddings for natural language
processing tasks.

9/3/2024 Dr. Hitesh Singh KIT601 DS Unit 1 96


Data Pre Processing (CO3)

6. Dimensionality Reduction:

• If dealing with high-dimensional data, consider techniques like Principal


Component Analysis (PCA) or t-distributed Stochastic Neighbor Embedding (t-
SNE) to reduce the dimensionality while preserving important information.

7. Feature Importance:

• Use model-based techniques (e.g., decision trees, random forests) or statistical


tests (e.g., ANOVA, chi-squared test) to evaluate the importance of features and
prioritize them accordingly.

8. Iterative Process:

• Feature selection and engineering are often iterative processes. Continuously


evaluate the impact of features on model performance and refine the feature
set based on feedback from modeling results.

9/3/2024 Dr. Hitesh Singh KIT601 DS Unit 1 97


KDD Process (CO3)
What is the KDD Process?
• The term Knowledge Discovery in Databases, or KDD for short,
refers to the broad process of finding knowledge in data, and
emphasizes the "high-level" application of particular data mining
methods.
• It is of interest to researchers in machine learning, pattern
recognition, databases, statistics, artificial intelligence,
knowledge acquisition for expert systems, and data visualization.
• The unifying goal of the KDD process is to extract knowledge from
data in the context of large databases.
• It does this by using data mining methods (algorithms) to extract
(identify) what is deemed knowledge, according to the
specifications of measures and thresholds, using a database
along with any required preprocessing, subsampling, and
transformations of that database.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 98


Data Pre processing (CO3)

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 99


Data Pre processing (CO3)

• Data Selection:
• The process starts with selecting the relevant data
from one or more databases or data sources.
• This involves identifying the data sources,
understanding their structure, and determining
which data subsets are necessary for the analysis.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 100


Data Pre processing (CO3)

• Data Preprocessing:

• Before analysis, the selected data undergoes


preprocessing to clean, transform, and integrate it into a
suitable format for analysis.
• This includes tasks like handling missing values, removing
noise, normalizing or standardizing data, encoding
categorical variables, and reducing dimensionality.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 101


Data Pre processing (CO3)

• Data Transformation:

• Transforming the preprocessed data into a format that


facilitates knowledge discovery. This may involve
feature engineering, creating new variables,
aggregating data, and converting data into a suitable
representation for analysis (e.g., numerical,
categorical, text, time-series).

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 102


Data Pre processing (CO3)
• Data Mining:

• Data mining is the core step in KDD where algorithms and techniques are applied to the
transformed data to extract patterns, relationships, and insights. Common data mining
techniques include:
• Classification: Predicting categorical outcomes or classes based on input variables.
• Regression: Predicting continuous numerical values based on input variables.
• Clustering: Grouping similar data points together based on their attributes.
• Association Rule Mining: Discovering relationships and associations between variables in
large datasets (e.g., market basket analysis).
• Anomaly Detection: Identifying outliers or unusual patterns that deviate from the norm.
• Sequential Pattern Mining: Identifying sequences or patterns in time-series or sequential
data.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 103


Data Pre processing (CO3)

• Pattern Evaluation:
• Once patterns and insights are extracted, they are evaluated based on their
significance, reliability, and relevance to the problem domain. This involves
statistical analysis, validation techniques, and domain expert feedback to
assess the quality of discovered patterns.

• Knowledge Representation:
• The discovered patterns and insights are represented in a meaningful and
interpretable format that can be used for decision-making. This may involve
visualizations, rules, graphs, or other forms of representation that facilitate
understanding and utilization of the knowledge.

9/3/2024 • Knowledge Utilization:


Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 104
Data Pre processing (CO3)

• Knowledge Utilization:

• Finally, the extracted knowledge is utilized for various


purposes such as decision support, predictive modeling,
optimization, trend analysis, risk assessment, and strategic
planning.
• The insights gained from KDD can drive actionable
recommendations and improvements in business
processes, research, healthcare, finance, marketing, and
9/3/2024 other domains. Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 105
Data Pre processing (CO3)

Data Cleaning: Missing Values:


• Data cleaning is a crucial step in data preprocessing,
and handling missing values is a significant part of
this process.
• Missing values occur when data is not available for
certain observations or attributes in a dataset.
• Here’s a detailed explanation of how missing values
can be dealt with during data cleaning:

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 106


Data Pre processing (CO3)
Identifying Missing Values

• The first step is to identify missing values in your dataset.


These missing values can be represented in various ways,
such as:

1.Blank Cells: Empty cells in a spreadsheet or dataset.


2.NaN (Not a Number): A special value used in programming
languages like Python and R to denote missing or undefined
data.
3.Null Values: Database systems often use NULL to
represent missing data.
4.Placeholders: Sometimes, missing values are represented
using placeholders like "N/A," "Unknown," "-999," etc.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 107


Data Pre processing (CO3)
• Reasons for Missing Values

1.Data Entry Errors: Human errors during data


collection or entry can lead to missing values.
2.Non-Response: In surveys or questionnaires,
respondents may choose not to answer certain
questions.
3.Data Corruption: Technical issues or corruption in
data storage systems can cause missing values.
4.Data Privacy: Missing values may occur due to privacy
concerns or data masking.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 108


Data Pre processing (CO3)
Dealing with Missing Values

1.Deleting Rows/Columns: If the missing values are few and random,


deleting the corresponding rows or columns may be an option.
However, this should be done cautiously, as it can lead to loss of
information.
2.Imputation: Imputation involves replacing missing values with
estimated values. Common imputation techniques include:
1. Mean/Median Imputation: Replace missing values with the
mean or median of the column.
2. Mode Imputation: Replace missing categorical values with
the mode (most frequent value) of the column.
3. Regression Imputation: Predict missing values using
regression models based on other variables.
4. K-Nearest Neighbors (KNN) Imputation: Replace missing
values with values from similar observations using KNN
algorithm.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 109


Data Pre processing (CO3)
3. Using Default Values: For categorical data, you
can replace missing values with a default category
or label.

4. Flagging Missing Values: Create a new indicator


variable to flag missing values, preserving the
information that data was missing at certain points.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 110


Data Pre processing (CO3)
Choosing the Right Strategy
• The choice of strategy depends on several factors:

• Amount of Missing Data: If a large portion of data is


missing, deleting rows/columns may not be suitable.
• Data Type: Different strategies may be applicable for
numerical and categorical data.
• Data Distribution: Consider the distribution and
nature of the data before choosing an imputation
method.
• Impact on Analysis: The chosen strategy should not
introduce bias or distort the analysis results.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 111


Data Pre processing (CO3)
Best Practices
• Here are some best practices for handling missing
values:

• Understand Data Domain: Gain domain knowledge to


interpret missing values correctly.
• Document Changes: Document the methods used for
handling missing values to ensure transparency.
• Evaluate Impact: Assess how handling missing values
affects the data distribution and analysis outcomes.
• Iterative Process: Data cleaning is often iterative, so
revisit and refine strategies as needed.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 112


Data Pre processing (CO3)

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 113


Data Pre processing (CO3)
• Handling Missing Values:
• 1. Deleting Rows/Columns:

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 114


Data Pre processing (CO3)
2. Imputation:

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 115


Data Pre processing (CO3)

3. Flagging Missing Values:

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 116


Data Pre processing (CO3)
Noisy data refers to data that contains errors, outliers, or
inconsistencies, which can negatively impact the accuracy and
reliability of analyses and models. Dealing with noisy data is an
essential part of data cleaning and preprocessing. Here’s an
explanation of noisy data and how it can be addressed:

Types of Noisy Data:


1.Errors: Errors in data can occur due to various reasons such as
data entry mistakes, sensor malfunctions, or measurement
inaccuracies.
2.Outliers: Outliers are data points that deviate significantly from
the rest of the data. They can occur naturally or due to errors.
3.Inconsistencies: Inconsistent data occurs when values
contradict each other or violate logical constraints within the
dataset.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 117


Data Pre processing (CO3)
Impact of Noisy Data:
• Bias: Noisy data can introduce bias in analyses
and predictions, leading to incorrect conclusions.
• Reduced Accuracy: Models trained on noisy data
may have lower accuracy and predictive power.
• Misinterpretation: Noisy data can cause
misinterpretation of trends and patterns in the
data.
• Increased Variability: Noisy data can increase the
variability of results, making it challenging to draw
reliable insights.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 118


Data Pre processing (CO3)
Dealing with Noisy Data:
1.Data Cleaning:
1. Error Correction: Identify and correct errors in the data
through manual review or automated algorithms.
2. Outlier Detection and Removal: Use statistical methods or
machine learning techniques to detect and remove outliers.
3. Inconsistency Resolution: Resolve inconsistencies by
validating data against business rules or domain knowledge.
2.Data Transformation:
1. Normalization: Scale numerical data to a standard range to
reduce the impact of outliers.
2. Transformation: Apply mathematical transformations (e.g.,
logarithmic, square root) to make the data more uniform and
reduce skewness.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 119


Data Pre processing (CO3)

• Data Smoothing:
• Moving Average: Smooth time-series data by calculating moving
averages to reduce noise and highlight trends.
• Filtering Techniques: Apply filters (e.g., median filter, Gaussian filter)
to remove noise from signals or images.

• Feature Engineering:
• Feature Selection: Choose relevant features and exclude noisy or
irrelevant features from the analysis.
• Feature Creation: Create new features or combine existing features
to capture important information and reduce noise.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 120


Data Pre processing (CO3)

• Modeling Techniques:

• Robust Models: Use robust machine learning models


that are less sensitive to outliers and noisy data (e.g.,
decision trees, random forests).
• Ensemble Methods: Combine multiple models to
reduce the impact of noise and improve predictive
performance.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 121


Data Pre processing (CO3)

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 122


Data Pre processing (CO3)

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 123


Data Pre processing (CO3)

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 124


Data Pre processing (CO3)
IQR (Interquartile Range) Method:
• Explanation: The IQR method defines outliers as
data points that fall below Q1−1.5×IQRQ1−1.5×IQR
or above Q3+1.5×IQRQ3+1.5×IQR, where Q1Q1 is
the first quartile, Q3Q3 is the third quartile, and
IQR=Q3−Q1IQR=Q3−Q1.
• Handling: Filter or remove data points outside the
defined range.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 125


Data Pre processing (CO3)
Discretization:
• Discretization is the process of converting continuous data into discrete intervals or
bins. This transformation is useful for handling numerical data that is too granular or
for creating categorical variables from numerical data. Discretization can be
achieved through techniques such as binning, clustering, and histogram-based
methods.
• 1. Binning:
• Explanation: Binning involves dividing the range of continuous values into intervals
or bins. Each bin represents a range of values, and data points falling within a bin are
assigned the corresponding bin label.
• Types of Binning:
• Equal Width Binning: Divides the range into equal-width bins, where each bin has the same
width.
• Equal Frequency Binning: Divides the data into bins such that each bin contains approximately
the same number of data points.
• Custom Binning: Defines bins based on domain knowledge or specific requirements.

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 126


Data Pre processing (CO3)
1. Equal Width Binning:

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 127


Data Pre processing (CO3)
2. Equal Frequency Binning:

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 128


Data Pre processing (CO3)

3. Custom Binning:

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 129


Data Pre processing (CO3)
2. Clustering-based Discretization:
• Explanation: Clustering algorithms like k-means or
DBSCAN can be used to cluster similar data points
together. The cluster labels can then be used as
discrete categories for data.
• Example:
Cluster 1: Low,
Cluster 2: Medium,
Cluster 3: High

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 130


Data Pre processing (CO3)

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 131


Data Pre processing (CO3)
3. Histogram-based Discretization:
• Explanation: Histogram-based methods analyze
the data distribution and create bins based on
histogram peaks or density. Peaks in the histogram
represent potential bin boundaries.
• Example:
[0-20], [21-50], [51-100], ...

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 132


Data Pre processing (CO3)

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 133


Data Pre processing (CO3)
Concept Hierarchy Generation:

• Concept hierarchy generation involves organizing categorical data


into hierarchical structures based on the relationships and levels
of abstraction within the data. This process helps in understanding
and representing data in a more structured and meaningful way.

1. Categorical Binning:
• Explanation: Similar to numerical binning, categorical binning
involves grouping categorical values into broader categories or
bins. This can be based on similarity, frequency, or domain
knowledge.
• Example:
Category A: {Apple, Banana, Cherry}Category B: {Orange, Mango

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 134


Data Pre processing (CO3)
2. Clustering-based Hierarchy:
• Explanation: Clustering algorithms can be applied
to categorical data as well to group similar
categories together. The resulting clusters form
levels in the concept hierarchy.
• Example:
Level 1: Fruits {Apple, Banana, Cherry}
Level 2: Citrus Fruits {Orange, Mango}

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 135


Data Pre processing (CO3)
3. Manual Concept Hierarchy:
• Explanation: Domain experts can define a hierarchical
structure based on the relationships and attributes of
categorical data. This manual approach ensures that
the hierarchy reflects domain-specific knowledge.
• Example:
Level 1: Animals
- Level 2: Mammals
- Level 2: Birds
- Level 2: Reptiles

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 136


Data Pre processing (CO3)

9/3/2024 Dr. Hitesh Singh KIT 60ACSDS0603 DS Unit 3 137


Identifying Outliers,(CO4)

9/3/2024 Dr. Kumud Saxena KIT601 DS Unit 1 138


Exploratory Data Analysis (CO4)
• Here's a step-by-step explanation of the techniques used to remove
outliers:
• Interquartile Range (IQR) Method:
• Calculate the first quartile (Q1) and third quartile (Q3) of the data.
• Compute the IQR as IQR = Q3 - Q1.
• Define the lower bound as Q1 - 1.5 * IQR and the upper bound as Q3 +
1.5 * IQR.
• Identify outliers as data points that fall below the lower bound or above
the upper bound.
• Z-Score Method:
• Calculate the z-score for each data point, which measures how many
standard deviations an observation is from the mean.
• Define a threshold z-score (commonly 2 or 3) beyond which data points
are considered outliers.
• Identify outliers as data points with z-scores above the threshold.

9/3/2024 Dr. Kumud Saxena KIT601 DS Unit 1 139


Exploratory Data Analysis (CO4)
• Modified Z-Score Method (MAD):
• Calculate the Median Absolute Deviation (MAD), which is the median of the
absolute deviations from the median of the data.
• Define a threshold (commonly 2 or 3) as a multiple of the MAD.
• Identify outliers as data points with absolute deviations from the median
above the threshold.
• Tukey's Fences:
• Calculate the lower fence as Q1 - k * IQR and the upper fence as Q3 + k *
IQR, where k is a multiplier (commonly 1.5 or 3).
• Identify outliers as data points that fall below the lower fence or above the
upper fence.
• Visualization Techniques:
• Box Plot: Visualize the distribution of the data using a box plot and identify
outliers as data points beyond the whiskers.
• Scatter Plot: Plot the data points on a scatter plot and identify outliers
based on their position relative to other data points.
• Machine Learning-Based Approaches:
• Use outlier detection algorithms such as Isolation Forest, Local Outlier
Factor (LOF), or One-Class SVM to automatically detect and remove
outliers.
9/3/2024 Dr. Kumud Saxena KIT601 DS Unit 1 140
Exploratory Data Analysis (CO4)
Interquartile Range (IQR) Method:

9/3/2024 Dr. Kumud Saxena KIT601 DS Unit 1 141


Exploratory Data Analysis (CO4)
Z-Score Method:

9/3/2024 Dr. Kumud Saxena KIT601 DS Unit 1 142


Exploratory Data Analysis (CO4)
• Modified Z-Score (MAD) Method

9/3/2024 Dr. Kumud Saxena KIT601 DS Unit 1 143


Exploratory Data Analysis (CO4)
• Tukey's Fences:

9/3/2024 Dr. Kumud Saxena KIT601 DS Unit 1 144


One-Hot Encoding
One-Hot Encoding
One-Hot Encoding
One-Hot Encoding
One-Hot Encoding
One-Hot Encoding
One-Hot Encoding
One-Hot Encoding
Feature Scaling

• Feature scaling is a crucial step in data preprocessing for


machine learning models.

• It involves adjusting the values of features in the dataset to a


common scale without distorting differences in the ranges of
values.
Feature Scaling
Feature Scaling
Feature Scaling
Feature Scaling
Feature Scaling
Feature Scaling
Feature Scaling
Faculty Video Links, Youtube & NPTEL
Video Links and Online Courses Details

• Youtube/other Video Links

➢https://nptel.ac.in/courses/106106093/

➢https://www.youtube.com/watch?v=m-aKj5ovDfg

➢https://www.youtube.com/watch?v=G4NYQox4n2g

➢https://nptel.ac.in/courses/106/105/106105174/

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 161
DAILY QUIZ

1. The output of KDD is __________.


o Data.
o Information.
o Query.
o Useful information

2. _________ is a the input to KDD.


o Data.
o Information.
o Query.
o Process.

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 162
DAILY QUIZ

3. Extreme values that occur infrequently are called as _________


o outliers.
o rare values.
o dimensionality reduction.
o All of the above.

4. Treating incorrect or missing data is called as ___________.


o selection.
o preprocessing.
o transformation.
o interpretation.

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 163
DAILY QUIZ

5. Box plot and scatter diagram techniques are _______.

o Graphical.
o Geometric.
o Icon-based.
o Pixel-based.

6. ___________ data are noisy and have many missing attribute


values.

o Preprocessed.
o Cleaned.
o Real-world
o Transformed.

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 164
DAILY QUIZ

7. The term that is not associated with data cleaning process is ______.
o domain consistency
o deduplication.
o disambiguation.
o segmentation.

8. Data scrubbing can be defined as


o Check field overloading
o Delete redundant tuples
o Use simple domain knowledge (e.g., postal code, spell-check) to detect
errors and make corrections
o Analyzing data to discover rules and relationship to detect violator

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 165
WEEKLY ASSIGNMENT

Q1:Explain the data cleaning process in data pre-processing.


[CO1]
Q2:Explain the need for data mining with suitable examples.
Differentiate between database management system and data
mining.
[CO1]
Q3: What are the research challenges to data mining? Explain with
suitable examples. Also explain performance evaluation
measures to evaluate a data mining system.
[CO1]

Q4: Explain parametric and non-parametric methods of


Numerosity reduction with suitable examples?
[CO1]

9/3/2024
Q5: Explain the steps of knowledge discovery
Dr. Hitesh Singh & Dr. Vivek Kumar
in databases?
Machine Learning Unit 1 166
WEEKLY ASSIGNMENT(CONT’d)
Q6: There are various Data Reduction techniques, which one is
having minimum loss of information content? Brief on it.
[CO1]

Q7: Explain 5 different methods to fill in missing values while doing


data cleaning. [CO1]

Q8:Discuss the approaches for mining multi level association rules


from the transactional databases. Give relevant examples.
[CO1]
Q9:Write short notes on – [CO1]
• Data Generalizations
• Class Comparisons

Q10: Differentiate between Knowledge Discovery and Data Mining.


9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 [CO1] 167
MCQ s
1. The full form of KDD is _________.
o Knowledge database.
o Knowledge discovery in database.
o Knowledge data house.
o Knowledge data definition.

2. Various visualization techniques are used in ___________ step of


KDD.
o selection.
o transformation.
o data mining.
o interpretation.

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 168
MCQ s

3. Treating incorrect or missing data is called as ___________.

o selection.
o preprocessing.
o transformation.
o interpretation.

4. The KDD process consists of ________ steps.


o three.
o four.
o five.
o six.

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 169
MCQ s

5. The output of KDD is __________.

o Data.
o Information.
o Query.
o Useful information.

6. _________ is a the input to KDD.

o Data.
o Information.
o Query.
o Process.

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 170
MCQ s

7. Box plot and scatter diagram techniques are _______.

o Graphical.
o Geometric.
o Icon-based.
o Pixel-based.

8. __________ is used to proceed from very specific knowledge to


more general information.

o Induction.
o Compression.
o Approximation.
o Substitution.

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 171
MCQ s

9. Reducing the number of attributes to solve the high dimensionality


problem is called as ________.

o dimensionality curse.
o dimensionality reduction.
o cleaning.
o Overfitting.

10. The term that is not associated with data cleaning process is
______.

o domain consistency.
o deduplication.
o disambiguation.
o segmentation.

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 172
MCQ s

11. Which of the following is not a data pre-processing methods


o Data Visualization
o Data Discretization
o Data Cleaning
o Data Reduction

12. Synonym for data mining is


o Data Warehouse
o Knowledge discovery in database
o Business intelligence
o OLAP

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 173
MCQ s

13. In Binning, we first sort data and partition into (equal-


frequency) bins and then which of the following is not a valid step
o smooth by bin boundaries
o smooth by bin median
o smooth by bin means
o smooth by bin values

14. Data set {brown, black, blue, green, red} is example of Select
one:
o Continuous attribute
o Ordinal attribute
o Numeric attribute
o Nominal attribute

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 174
OLD QUESTION PAPERS
B.Tech
(SEM VI) THEORY EXAMINATION 2017-18
DATAWAREHOUSING AND DATA MINING
Time: 3 Hours Total Marks: 100
Note: 1. Attempt all Sections.
If require any missing data; then choose suitably.
SECTION A
1. Attempt all questions in brief.
2 x 10 = 20
a. Draw the diagram for key steps of data mining.
b. Define the term Support and Confidence.
c. What are attribute selection measures? What is the
drawback of information gain?
d. Differentiate between classification and clustering
e. Write the statement for Apriori Algorithm.

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 175
OLD QUESTION PAPERS
f. What are the drawbacks of k‐mean algorithm?
g. What is Chi Square test?
h. Compare Roll up, Drill down operation.
i. What are Hierarchal methods for clustering?
j. Name main features of Genetic Algorithm.

SECTION B
Attempt any three of the following: 10 x 3 =
30

a. Explain the data mining / knowledge extraction process in


detail?
b. Differentiate between OLAP and OLTP.

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 176
OLD QUESTION PAPERS

c. Find frequent patterns and the association rules by using Apriori


Algorithm for the
following transactional database:

TID T100 T200 T300 T400 T500


ITEMS M,O,N,K,E,Y D,O,N,K,E,Y M,A,K,E M,U,C,K,Y C,O,O,K
,I,E

Let Minimum support= 60% and Minimum Confidence= 80%

d. What are different database schemas .shows with an example?


e. How data back‐ up and data recovery is managed in data
warehouse?

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 177
OLD QUESTION PAPERS

3. Attempt any one part of the following: 10 x 1 = 10


a. Draw the 3‐tier data warehouse architecture. Explain ETL
process.
b. Elaborate the different strategies for data cleaning.

4. Attempt any one part of the following: 10 x 1 = 10


a. What are different clustering methods? Explain STING in detail.
b. What are the applications of data warehousing? Explain web
mining and spatial mining.

5. Attempt any one part of the following: 10 x 1 = 10


a. Define data warehouse. What strategies should be taken care
while designing a warehouse?
b. Write short notes on the following:
(I)Concept Hierarchy (iii) Gain Ratio
9/3/2024
(ii)ROLAP vs MOLAP (iv)Machine
Dr. Hitesh Singh & Dr. Vivek Kumar
Classification
Learning Unit 1
Vs 178
Clustering
OLD QUESTION PAPERS

6. Attempt any one part of the following: 10 x 1 = 10


a. Write the k‐ mean algorithm. Suppose that the data mining task
is to cluster points (with (x,y) representing location ) into three
clusters , where the points are:
A1 (2, 10), A2 (2, 5) A3 (8, 4)
B1 (5, 8), B2 (7, 5) B3 (6, 4)
C1 (1, 2), C2 (4, 9)
The distance function is Euclidian distance. Suppose initially we
assign A1, B1, and C1 as the center of each cluster, respectively.
Use the k‐ means algorithm to show only The three cluster
centers after the first round of execution.

b. What is Hierarchical method for clustering? Explain BIRCH


method.

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 179
EXPECTED QUESTIONS FOR UNIVERSITY
EXAM
1. Discuss the steps involved in KDD process.

2. Define data discretization. Explain the various approaches in


data discretization.

3. Differentiate between losses and lossy data transformation.

4. In real-world data, tuples with missing values for some


attributes are a common occurrence. Describe various
methods for handling this problem?

5. What are the issues in data integration?


9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 180
EXPECTED QUESTIONS FOR UNIVERSITY
EXAM (CONT’d)
6. Suppose a group of 12 sales price records has been sorted as
follows: 5,10,11,13,15,35,50,55,72,92,204,215. Partition them
into three bins by each of the following methods:
a) equal-frequency(equal-depth) partitioning
b) equal-width partitioning
c) Clustering

7. Explain data integration, transformation and loading?

8. Explain the different methods to fill in missing values while


performing data cleaning.

9. Differentiate between Knowledge Discovery and Data Mining.

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 181
SUMMARY
➢ Major Pre-Processing task in Data warehousing is data Cleaning,
Integration, Reduction and Transformation.

➢Data cleaning is a method of correcting the errors and mistakes


performed by humans and also taking acre of missing values.

➢Data Integration is a method of integrating data coming from


various resources ; like data bases, flat files etc.

➢Data reduction is a concept of minimalizing the data volumes in


the data warehouse without effecting the original data content.

➢Data Transformation is to develop a standardized data format that


can be utilized by the data warehouse.

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 182
REFERENCES

➢Alex Berson, Stephen J. Smith “Data Warehousing, Data-Mining &


OLAP”, TMH

➢Mark Humphries, Michael W. Hawkins, Michelle C. Dy, “ Data


Warehousing: Architecture and Implementation”, Pearson.

➢https://www.oreilly.com/library/view/datawarehousingarchitectur
e/0130809020/ch07.html

➢https://www.slideshare.net/2cdude/data-warehousing-3292359

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 183
Thank You

9/3/2024 Dr. Hitesh Singh & Dr. Vivek Kumar Machine Learning Unit 1 184

You might also like