0% found this document useful (0 votes)
9 views65 pages

ML Unit123

Uploaded by

sravschinnusravs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views65 pages

ML Unit123

Uploaded by

sravschinnusravs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

UNIT I

INTRODUCTION TO MACHINE LEARNING &


PREPARING TO MODEL

1.1. HUMAN LEARNING


Human learning is a multifaceted process that involves acquiring new knowledge, skills,
values, attitudes, and behaviors. It is a continuous journey that starts from the moment we are
born and continues throughout our lives.
Types of Human Learning
There are various types of human learning, each with its unique characteristics and
mechanisms:
1. Classical Conditioning: This type of learning, pioneered by Ivan Pavlov, involves
associating two stimuli. For example, if you repeatedly pair the sound of a bell with
the presentation of food, you can eventually condition a dog to salivate at the sound of
the bell alone.

Classical Conditioning
2. Operant Conditioning: Developed by B.F. Skinner, operant conditioning focuses on
the consequences of behavior. Behaviors that are followed by positive consequences
(reinforcement) are more likely to be repeated, while those followed by negative
consequences (punishment) are less likely to be repeated.

1
Operant Conditioning
3. Observational Learning: Also known as social learning, this type of learning
involves acquiring new behaviors by observing and imitating others. It is a powerful
mechanism for learning social norms, skills, and behaviors.

Observational Learning
4. Cognitive Learning: This type of learning emphasizes the mental processes involved
in acquiring knowledge, such as attention, memory, perception, and thinking. It
focuses on how people process information and construct meaning.

2
Cognitive Learning
Early Learning and Development
Early childhood is a critical period for learning and development. During this time, children
acquire fundamental skills such as language, motor skills, and social-emotional skills. These
early experiences lay the foundation for future learning and development.
1.2. MACHINE LEARNING-TYPES
Machine Learning is a subfield of Artificial Intelligence that enables machines to
improve at a given task with experience. It is important to note that all machine learning
techniques are classified as Artificial Intelligence ones. However, not all Artificial
Intelligence could count as Machine Learning since some basic Rule-based engines could be
classified as AI but they do not learn from experience therefore they do not belong to the
machine learning category.
Definition: Arthur Samuel, a pioneer in the field of artificial intelligence and computer
gaming, coined the term “Machine Learning”. He defined machine learning as – “Field of
study that gives computers the capability to learn without being explicitly programmed”.
In a very layman manner, Machine Learning (ML) can be explained as automating
and improving the learning process of computers based on their experiences without being
actually programmed i.e. without any human assistance. The process starts with feeding good
quality data and then training our machines (computers) by building machine learning models
using the data and different algorithms. The choice of algorithms depends on what type of
data do we have and what kind of task we are trying to automate

3
Example: Training of students during exam.
While preparing for the exams students don’t actually cram the subject but try to learn it with
complete understanding. Before the examination, they feed their machine (brain) with a good
amount of high-quality data (questions and answers from different books or teachers notes or
online video lectures). Actually, they are training their brain with input as well as output i.e.
what kind of approach or logic do they have to solve a different kind of questions. Each time
they solve practice test papers and find the performance (accuracy /score) by comparing
answers with answer key given, Gradually, the performance keeps on increasing, gaining
more confidence with the adopted approach. That’s how actually models are built, train
machine with data (both inputs and outputs are given to model) and when the time comes test
on data (with input only) and achieves our model scores by comparing its answer with the
actual output which has not been fed while training. Researchers are working with assiduous
efforts to improve algorithms, techniques so that these models perform even much better.

Difference in ML and Traditional Programming:


 Traditional Programming: We feed in DATA (Input) + PROGRAM (logic), run it on
machine and get output.
 Machine Learning: We feed in DATA (Input) + Output, run it on machine during
training and the machine creates its own program (logic), which can be evaluated while
testing.
The basic machine learning process can be divided into three parts.
1. Data Input: Past data or information is utilized as a basis for future decision-making
2. Abstraction: The input data is represented in a broader way through the underlying
algorithm
3. Generalization: The abstracted representation is generalized to form a framework for
making decisions
The below Figure is a schematic representation of the machine learning process.

4
1.3. TYPES OF MACHINE LEARNING
There are several types of machine learning, each with special characteristics and
applications. Some of the main types of machine learning algorithms are as follows:
i. Supervised Machine Learning
ii. Unsupervised Machine Learning
iii. Reinforcement Learning

i).Supervised learning
Supervisedis defined as when a model gets trained on a “Labelled Dataset”. Labelled
datasets have both input and output parameters. In Supervised Learning algorithms learn
to map points between inputs and correct outputs. It has both training and validation
datasets labelled.
Example: Consider a scenario where you have to build an image classifier to differentiate
between cats and dogs. If you feed the datasets of dogs and cats labelled images to the
algorithm, the machine will learn to classify between a dog or a cat from these labeled
images. When we input new dog or cat images that it has never seen before, it will use the
learned algorithms and predict whether it is a dog or a cat. This is how supervised
learning works and this is particularly an image classification.

5
Supervised learning is effective for various business purposes, including sales
forecasting, inventory optimization, and fraud detection. Some examples of use cases include:
 Predicting real estate prices
 Classifying whether bank transactions are fraudulent or not
 Finding disease risk factors
 Determining whether loan applicants are low-risk or high-risk
 Predicting the failure of industrial equipment's mechanical parts

a).Classification: Classification is the process of organizing objects, data, or information


into groups or categories based on shared characteristics or properties. It's a fundamental
cognitive skill used in various fields, from everyday life to scientific research.
b).Regression: Regression is a statistical method used to model the relationship between a
dependent variable (also known as the response or outcome variable) and one or more
independent variables (also known as predictor or explanatory variables). The primary goal
of regression analysis is to understand how the dependent variable changes as the
independent variables change.
Advantages of Supervised Machine Learning
 Supervised Learning models can have high accuracy as they are trained on labelled
data.
 The process of decision-making in supervised learning models is often interpretable.

6
 It can often be used in pre-trained models which save time and resources when
developing new models from scratch.

Disadvantages of Supervised Machine Learning


 It has limitations in knowing patterns and may struggle with unseen or unexpected
patterns that are not present in the training data.
 It can be time-consuming and costly as it relies on labeled data only.
 It may lead to poor generalizations based on new data.
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
 Image classification: Identify objects, faces, and other features in images.
 Speech recognition: Convert spoken language into text.
 Recommendation systems: Make personalized recommendations to users.
 Predictive analytics: Predict outcomes, such as sales, customer churn, and stock
prices.
 Medical diagnosis: Detect diseases and other medical conditions.
 Fraud detection: Identify fraudulent transactions.
 Autonomous vehicles: Recognize and respond to objects in the environment.
 Email spam detection: Classify emails as spam or not spam.
 Gaming: Recognize characters, analyze player behavior, and create NPCs.
 Weather forecasting: Make predictions for temperature, precipitation, and other
meteorological parameters.

7
ii). Unsupervised Learning
Unsupervised learning is a type of machine learning technique in which an algorithm
discovers patterns and relationships using unlabeled data. Unlike supervised learning,
unsupervised learning doesn’t involve providing the algorithm with labeled target outputs.
The primary goal of unsupervised learning is often to discover hidden patterns, similarities,
or clusters within the data, which can then be used for various purposes, such as data
exploration, visualization, dimensionality reduction, and more.

a).Clustering: Clustering is a crucial technique in unsupervised machine learning. It involves


grouping similar data points together without any prior knowledge of their labels or
categories. The goal is to discover underlying patterns and structures within the data.
b) Association Analysis: Association analysis is a data mining technique used to discover
interesting relationships or associations between variables in large datasets. It's particularly
well-suited for transactional data, such as customer purchases, web browsing history, or
medical records.
Advantages of Unsupervised Machine Learning
 It helps to discover hidden patterns and various relationships between the data.
 Used for tasks such as customer segmentation, anomaly detection, and data
exploration.
 It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning
 Without using labels, it may be difficult to predict the quality of the model’s output.
 Cluster Interpretability may not be clear and may not have meaningful
interpretations.
 It has techniques such as auto-encoders and dimensionality reduction that can be
used to extract meaningful features from raw data.

8
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
 Clustering: Group similar data points into clusters.
 Anomaly detection: Identify outliers or anomalies in data.
 Dimensionality reduction: Reduce the dimensionality of data while preserving its
essential information.
 Recommendation systems: Suggest products, movies, or content to users based on
their historical behavior or preferences.
 Topic modeling: Discover latent topics within a collection of documents.
 Density estimation: Estimate the probability density function of data.
 Image and video compression: Reduce the amount of storage required for
multimedia content.
 Data preprocessing: Help with data preprocessing tasks such as data cleaning,
imputation of missing values, and data scaling.
 Market basket analysis: Discover associations between products.
 Image segmentation: Segment images into meaningful regions.
 Community detection in social networks: Identify communities or groups of
individuals with similar interests or connections.
 Customer behavior analysis: Uncover patterns and insights for better marketing
and product recommendations.
iii). Reinforcement Learning
Reinforcement Learning operates on the principle of learning optimal behavior through trial
and error. The agent takes actions within the environment, receives rewards or penalties,
and adjusts its behavior to maximize the cumulative reward. This learning process is
characterized by the following elements:
 Policy: A strategy used by the agent to determine the next action based on the
current state.
 Reward Function: A function that provides a scalar feedback signal based on the
state and action.
 Value Function: A function that estimates the expected cumulative reward from a
given state.
 Model of the Environment: A representation of the environment that helps in
planning by predicting future states and rewards.

9
Example: The problem is as follows: We have an agent and a reward, with many hurdles in
between. The agent is supposed to find the best possible path to reach the reward. The
following problem explains the problem more easily.

The above image shows the robot, diamond, and fire. The goal of the robot is to get
the reward that is the diamond and avoid the hurdles that are fired. The robot learns by
trying all the possible paths and then choosing the path which gives him the reward with the
least hurdles. Each right step will give the robot a reward and each wrong step will subtract
the reward of the robot. The total reward will be calculated when it reaches the final reward
that is the diamond.
Advantages of Reinforcement Machine Learning
 It has autonomous decision-making that is well-suited for tasks and that can learn to
make a sequence of decisions, like robotics and game-playing.
 This technique is preferred to achieve long-term results that are very difficult to
achieve.
 It is used to solve complex problems that cannot be solved by conventional
techniques.
Disadvantages of Reinforcement Machine Learning
 Training Reinforcement Learning agents can be computationally expensive and
time-consuming.
 Reinforcement learning is not preferable to solving simple problems.
 It needs a lot of data and a lot of computation, which makes it impractical and
costly.
Applications of Reinforcement Machine Learning
Here are some applications of reinforcement learning:
 Game Playing: RL can teach agents to play games, even complex ones.
 Autonomous Vehicles: RL can help self-driving cars navigate and make decisions.
10
 Recommendation Systems: RL can enhance recommendation algorithms by
learning user preferences.
 Healthcare: RL can be used to optimize treatment plans and drug discovery.
 Natural Language Processing (NLP): RL can be used in dialogue systems and
chatbots.
 Finance and Trading: RL can be used for algorithmic trading.
 Supply Chain and Inventory Management: RL can be used to optimize supply
chain operations.
 Game AI: RL can be used to create more intelligent and adaptive NPCs in video
games.
 Adaptive Personal Assistants: RL can be used to improve personal assistants.
 Virtual Reality (VR) and Augmented Reality (AR): RL can be used to create
immersive and interactive experiences.
Comparison – supervised, unsupervised, and reinforcement learning:
Criteria Supervised Learning Unsupervised Learning Reinforcement
Learning
Input Data Input data is labelled. Input data is not labelled. Input data is not
predefined.
Problem Learn pattern of inputs Divide data into classes. Find the best reward
and their labels. between a start and an
end state.
Solution Finds a mapping Finds similar features in Maximizes reward by
equation on input data input data to classify it into assessing the results of
and its labels. classes. state-action pairs
Model Model is built and Model is built and trained The model is trained and
Building trained prior to testing. prior to testing. tested simultaneously.
Applications Deal with regression Deals with clustering and Deals with exploration
and classification associative rule mining and exploitation
problems. problems. problems.
Algorithms Decision trees, linear K-means clustering, k- Q-learning, SARSA,
Used regression, K-nearest medoids clustering, Deep Q Network
neighbors agglomerative clustering
Examples Image detection, Customer segmentation, Drive-less cars, self-
Population growth feature elicitation, targeted navigating vacuum
prediction marketing, etc cleaners, etc

1.4. PROBLES NOT TO BE SOLVED USING MACHINE LEARNING


Machine learning should not be applied to tasks in which humans are very effective or
frequent human intervention is needed. For example, air traffic control is a very complex task
needing intense human involvement. At the same time, for very simple tasks which can be
implemented using traditional programming paradigms, there is no sense of using machine
11
learning. For example, simple rule-driven or formula-based applications like price calculate
or engine, dispute tracking application, etc. do not need machine learning techniques.
Machine learning should be used only when the business process has some lapses. If
the task is already optimized, incorporating machine learning will not serve to justify the
return on investment. For situations where training data is not sufficient, machine learning
cannot be used effectively. This is because, with small training data sets, the impact of bad
data is exponentially worse. For the quality of prediction or recommendation to be good, the
training data should be sizeable.
1.5. APPLICATIONS OF MACHINE LEARNING
Machine Learning is a buzzword for today's technology, and it is growing very
rapidly day by day. Machine learning is used in daily life even without knowing it such as
Google Maps, Google assistant, Alexa, etc. Below are some most trending real-world
applications of Machine Learning:
1. Image Recognition: Image recognition is one of the most common applications of
machine learning. It is used to identify objects, persons, places, digital images, etc. The
popular use case of image recognition and face detection is, Automatic friend tagging
suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a
photo with our Facebook friends, then we automatically get a tagging suggestion with name,
and the technology behind this is machine learning's face detection and recognition
algorithm. It is based on the Facebook project named "Deep Face," which is responsible for
face recognition and person identification in the picture.
2. Speech Recognition: While using Google, we get an option of "Search by voice," it comes
under speech recognition, and it's a popular application of machine learning. Speech
recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition." At present, machine learning algorithms
are widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.
3. Traffic prediction: If we want to visit a new place, we take help of Google Maps, which
shows us the correct path with the shortest route and predicts the traffic conditions. It predicts
the traffic conditions such as whether traffic is cleared, slow-moving, or heavily congested
with the help of two ways: Real Time location of the vehicle form Google Map app and
sensors, Average time has taken on past days at the same time. Everyone who is using

12
Google Map is helping this app to make it better. It takes information from the user and sends
back to its database to improve the performance.

4. Product recommendations: Machine learning is widely used by various e-commerce and


entertainment companies such as Amazon, Netflix, etc., for product recommendation to the
user. Whenever we search for some product on Amazon, then we started getting an
advertisement for the same product while internet surfing on the same browser and this is
because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest. As similar, when we use Netflix, we find some
recommendations for entertainment series, movies, etc., and this is also done with the help of
machine learning.
5. Self-driving cars: One of the most exciting applications of machine learning is self-
driving cars. Machine learning plays a significant role in self-driving cars. Tesla, the most
popular car manufacturing company is working on self-driving car. It is using unsupervised
learning method to train the car models to detect people and objects while driving.
6. Email Spam and Malware Filtering: Whenever we receive a new email, it is filtered
automatically as important, normal, and spam. We always receive an important mail in our
inbox with the important symbol and spam emails in our spam box, and the technology
behind this is Machine learning. Below are some spam filters used by Gmail:

13
 Content Filter
 Header filter
 General blacklists filter
 Rules-based filters
 Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.
7. Virtual Personal Assistant: We have various virtual personal assistants such as Google
assistant, Alexa, Cortana, Siri. As the name suggests, they help us in finding the information
using our voice instruction. These assistants can help us in various ways just by our voice
instructions such as Play music, call someone, Open an email, Scheduling an appointment,
etc. These virtual assistants use machine learning algorithms as an important part. These
assistant record our voice instructions, send it over the server on a cloud, and decode it using
ML algorithms and act accordingly.
8. Online Fraud Detection: Machine learning is making our online transaction safe and
secure by detecting fraud transaction. Whenever we perform some online transaction, there
may be various ways that a fraudulent transaction can take place such as fake accounts, fake
ids, and steal money in the middle of a transaction. So to detect this, Feed Forward Neural
network helps us by checking whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these
values become the input for the next round. For each genuine transaction, there is a specific
pattern which gets change for the fraud transaction hence, it detects it and makes our online
transactions more secure.
9. Stock Market trading: Machine learning is widely used in stock market trading. In the
stock market, there is always a risk of up and downs in shares, so for this machine
learning's long short term memory neural network is used for the prediction of stock market
trends.
10. Medical Diagnosis: In medical science, machine learning is used for diseases diagnoses.
With this, medical technology is growing very fast and able to build 3D models that can
predict the exact position of lesions in the brain. It helps in finding brain tumors and other
brain-related diseases easily.
11. Automatic Language Translation: Nowadays, if we visit a new place and we are not
aware of the language then it is not a problem at all, as for this also machine learning helps us
by converting the text into our known languages. Google's GNMT (Google Neural Machine

14
Translation) provide this feature, which is a Neural Machine Learning that translates the text
into our familiar language, and it called as automatic translation. The technology behind the
automatic translation is a sequence to sequence learning algorithm, which is used with image
recognition and translates the text from one language to another language.
1.6. STATE-OF-THE-ART LANGUAGES/TOOLS IN
MACHINE LEARNING
The algorithms related to different machine learning tasks are known to all and can be
implemented using any language/platform. It can be implemented using a Java platform or C
/C++ language or in .NET. However, there are certain languages and tools which have
beendeveloped with a focus for implementing machine learning. Few of them, which are
most widely used, are covered below.
Languages
 Python:
o Dominates the field: Extensive libraries (Scikit-learn, TensorFlow, PyTorch),
large community, versatility.
o Ideal for: General-purpose ML, deep learning, data science.
 R:
o Strong in statistical computing and data visualization: Excellent for
exploratory data analysis and statistical modeling.
o Ideal for: Statistical analysis, data visualization, niche areas.
 Java:
o Robust and scalable: Suitable for large-scale, production-level ML systems.
o Ideal for: Enterprise applications, big data processing.
 C++:
o High performance: Used for computationally intensive tasks and building
high-performance libraries.
o Ideal for: Performance-critical applications, low-level optimizations.
Tools
 Scikit-learn (Python):
o Comprehensive library: Offers a wide range of algorithms for classification,
regression, clustering, and more.
o User-friendly: Easy to use and well-documented.
 TensorFlow (Python/C++):

15
o Developed by Google: Powerful framework for deep learning, especially for
building and deploying large-scale neural networks.
 PyTorch (Python):
o Dynamic computation graphs: Provides flexibility and ease of use for
research and prototyping.
o Strong in deep learning research: Popular for natural language processing
and computer vision.
 Keras (Python):
o High-level API: Simplifies building and experimenting with neural networks
on top of TensorFlow or other backends.
 Jupyter Notebook:
o Interactive environment: Allows you to write and execute code, visualize
data, and share your work easily.
 AWS SageMaker:
o Cloud-based platform: Provides a suite of tools for building, training, and
deploying machine learning models.
1.7. ISSUES IN MACHINE LEARNING
Machine learning, while a powerful tool, faces several challenges that researchers and
practitioners are actively working to address. Here are some of the key issues:
1. Data Quality and Availability
 Data Scarcity: Many real-world problems lack sufficient labeled data for training
effective models, especially in niche domains or those with limited data collection
resources.
 Data Bias: Training data often reflects existing biases in society, leading to models
that perpetuate and even amplify these biases. This can have serious consequences in
areas like loan applications, hiring processes, and criminal justice.
 Data Privacy: Collecting and using personal data raises significant privacy concerns,
requiring careful consideration of ethical and legal implications.
2. Model Interpretability and Explainability
 Black Box Models: Many complex models, such as deep neural networks, are often
referred to as "black boxes" because their decision-making processes are opaque. This
lack of transparency can hinder trust and make it difficult to understand and debug
model errors.

16
 Explainable AI (XAI): This emerging field aims to develop techniques that make
machine learning models more interpretable and understandable to humans.
3. Overfitting and Underfitting
 Overfitting: Occurs when a model performs well on the training data but poorly on
unseen data. This happens when the model has learned the training data too well,
including noise and irrelevant details.
 Underfitting: Occurs when a model is too simple to capture the underlying patterns
in the data. This results in poor performance on both training and test data.
4. Computational Cost and Resource Requirements
 Training Time: Training complex models, especially deep learning models, can be
computationally expensive, requiring significant time and resources.
 Hardware Requirements: Advanced hardware, such as GPUs and TPUs, is often
necessary to train large-scale models efficiently.
5. Ethical Considerations
 Job Displacement: Automation powered by machine learning raises concerns about
job displacement in various sectors.
 Misuse of Technology: Machine learning can be used for malicious purposes, such as
creating deepfakes or developing autonomous weapons systems.
 Algorithmic Bias: As mentioned earlier, biased data can lead to biased models,
which can have discriminatory impacts on certain groups.
6. Continuous Learning and Adaptation
 Evolving Data: Real-world data is constantly changing. Models need to be able to
adapt to new data and changing conditions to maintain their effectiveness.
 Concept Drift: The underlying relationships between features and targets may
change over time, requiring models to be retrained or updated periodically.
PREPARING TO MODEL: INTRODUCTION
1.8. MACHINELEARNINGACTIVITIES
The first step in machine learning activity starts with data. In case of supervised
learning, it is the labelled training data set followed by test data which is not labelled. In case
of unsupervised learning, there is no question of labelled data but the task is to find patterns
in the input data. A thorough review and exploration of the data is needed to understand the
typeof the data, the quality of the data and relationship between the different data elements.
Based on that, multiple pre-processing activities may need to be done on the input data before

17
we can go ahead with core machine learning activities. Following are the typical preparation
activities done once the input data comes into the machine learning system:
 Understand the type of data in the given input data set. Explore the data to understand
the nature and quality.
 Explore the relationships amongst the data elements, e.g. inter-feature relationship.
Find potential issues in data.
 Do the necessary remediation, e.g. impute missing data values, etc., if needed. Apply
pre-processing steps, as necessary.
 Once the data is prepared for modelling, then the learning tasks start off. As a part of
it, do the following activities:
 The input data is first divided into parts – the training data and the test data (called
holdout). This step is applicable for supervised learning only.
Consider different models or learning algorithms for selection. Train the model based on
the training data for supervised learning problem and apply to unknown data. Directly apply
the chosen unsupervised model on the input data for unsupervised learning problem. After the
model is selected, trained (for supervised learning), and applied on input data, the
performance of the model is evaluated. Based on options available, specific actions can be
taken to improve the performance of the model, if possible. The below Figure - Depicts the
four-step process of machine learning.

Fig: Detailed process of Machine Learning

18
Table Contains a summary of steps and activities involved:

1.9. BASIC TYPES OF DATA IN MACHINE LEARNING


Data is a collection of facts, information, and statistics that can be in various forms,
such as numbers, text, sound, images, or any other format.It is the raw material that, when
processed and analyzed, becomes information.
Data is essential for decision-making, problem-solving, and innovation. It helps us
understand trends, identify patterns, and make informed choices. In today's data-driven
world, organizations that can effectively collect, analyze, and utilize data have a significant
competitive advantage. The two main types of data are:
 Qualitative Data
 Quantitative Data

19
1. Qualitative or Categorical Data
Qualitative or Categorical Data is a type of data that can’t be measured or counted in the form
of numbers. These types of data are sorted by category, not by number. That’s why it is also
known as Categorical Data. These data consist of audio, images, symbols, or text. The gender
of a person, i.e., male, female, or others, is qualitative data. Qualitative data tells about the
perception of people. This data helps market researchers understand the customers’ tastes and
then design their ideas and strategies accordingly.
A. Nominal Data
Nominal Data is used to label variables without any order or quantitative value. The color of
hair can be considered nominal data, as one color can’t be compared with another color.
The name “nominal” comes from the Latin name “nomen,” which means “name.” With the
help of nominal data, we can’t do any numerical tasks or can’t give any order to sort the data.
These data don’t have any meaningful order; their values are distributed into distinct
categories.
Examples of Nominal Data:
 Colour of hair (Blonde, red, Brown, Black, etc.)
 Marital status (Single, Widowed, Married)
 Nationality (Indian, German, American)
 Gender (Male, Female, Others)
 Eye Color (Black, Brown, etc.)
B. Ordinal Data
Ordinal data have natural ordering where a number is present in some kind of order by their
position on the scale. These data are used for observation like customer satisfaction,
happiness, etc., but we can’t do any arithmetical tasks on them. Ordinal data is qualitative
data for which their values have some kind of relative position. These kinds of data can be
considered “in-between” qualitative and quantitative data. The ordinal data only shows the
sequences and cannot use for statistical analysis. Compared to nominal data, ordinal data
have some kind of order that is not present in nominal data.
Examples of Ordinal Data:
 When companies ask for feedback, experience, or satisfaction on a scale of 1 to 10
 Letter grades in the exam (A, B, C, D, etc.)
 Ranking of people in a competition (First, Second, Third, etc.)
 Economic Status (High, Medium, and Low)
 Education Level (Higher, Secondary, Primary)

20
2. Quantitative Data
Quantitative data is a type of data that can be expressed in numerical values, making it
countable and including statistical data analysis. These kinds of data are also known as
Numerical data. It answers the questions like “how much,” “how many,” and “how often.”
For example, the price of a phone, the computer’s ram, the height or weight of a person, etc.,
falls under quantitative data.
Quantitative data can be used for statistical manipulation. These data can be represented on a
wide variety of graphs and charts, such as bar graphs, histograms, scatter plots, boxplots, pie
charts, line graphs, etc.
Examples of Quantitative Data:
 Height or weight of a person or object
 Room Temperature
 Scores and Marks (Ex: 59, 80, 60, etc.)
 Time
The Quantitative data are further classified into two parts:
A. Discrete Data
The term discrete means distinct or separate. The discrete data contain the values that fall
under integers or whole numbers. The total number of students in a class is an example of
discrete data. These data can’t be broken into decimal or fraction values.
The discrete data are countable and have finite values; their subdivision is not possible. These
data are represented mainly by a bar graph, number line, or frequency table.
Examples of Discrete Data:
 Total numbers of students present in a class
 Cost of a cell phone
 Numbers of employees in a company
 The total number of players who participated in a competition
 Days in a week
B. Continuous Data
Continuous data are in the form of fractional numbers. It can be the version of an android
phone, the height of a person, the length of an object, etc. Continuous data represents
information that can be divided into smaller levels. The continuous variable can take any
value within a range.
The key difference between discrete and continuous data is that discrete data contains the
integer or whole number. Still, continuous data stores the fractional numbers to record
different types of data such as temperature, height, width, time, speed, etc.
21
Examples of Continuous Data:
 Height of a person
 Speed of a vehicle
 “Time-taken” to finish the work
 Wi-Fi Frequency
 Market share price
1.10. EXPLORING STRUCTURE OF DATA
The approach of exploring numeric data is different than the approach of exploring
categorical data. In case of a standard data set, we may have the data dictionary available for
reference. Data dictionary is a meta data repository, i.e. the repository of all information
related to the structure of each data element contained in the data set. The data dictionary
gives detailed information on each of the attributes– the description as well as the data type
and other relevant details. In case the data dictionary is not available, we need to use standard
library function of the machine learning tool that we are using and get the details.
Exploring numerical data: Numerical data represents quantitative information that can be
measured and counted. Understanding its characteristics is crucial for effective data analysis
and decision-making. Here's a breakdown of key aspects:
1. Types of Numerical Data
 Continuous: Data that can take on any value within a given range.
o Examples: Height, weight, temperature, time
 Discrete: Data that can only take on specific, distinct values.
o Examples: Number of students, number of cars, count of occurrences
2. Key Characteristics
 Central Tendency: Measures of central tendency describe the "center" or typical
value of the data.
o Mean: The average of all data points.
o Median: The middle value when the data is sorted.
o Mode: The most frequently occurring value.
 Dispersion: Measures of dispersion describe the spread or variability of the data.
o Range: The difference between the maximum and minimum values.
o Variance: The average squared deviation from the mean.
o Standard Deviation: The square root of the variance, providing a measure of
spread in the same units as the data.
o Interquartile Range (IQR): The range of the middle 50% of the data.

22
 Distribution: The way data values are distributed across the range of possible values.
o Normal Distribution (Gaussian): A bell-shaped curve, with most data points
clustered around the mean.
o Skewed Distribution: Data is not symmetrical, with a longer tail on one side.
o Uniform Distribution: Data is evenly distributed across the range.
3. Exploratory Data Analysis (EDA) Techniques
 Summary Statistics: Calculating measures of central tendency and dispersion.
 Data Visualization:
o Histograms: Visualize the distribution of data.
o Box Plots: Show the median, quartiles, and outliers.
o Scatter Plots: Visualize relationships between two numerical variables.
Example: Consider a dataset of student exam scores.
 Type: Continuous (can take on any value within a certain range)
 Central Tendency:
o Mean: Average score of all students
o Median: Score of the middle-ranking student
o Mode: Most frequent score (if any)
 Dispersion:
o Range: Difference between the highest and lowest scores
o Standard Deviation: How much scores typically deviate from the mean
Interpretation: Represents the middle 50% of the data. It is less sensitive to outliers than the
range.
5. Mean Absolute Deviation (MAD):
 Definition: The average of the absolute differences between each data point and the
mean.
Box Plots and Histograms: Both box plots and histograms are valuable tools for visualizing
the distribution of numerical data. They provide different insights into the data's central
tendency, spread, and shape.
Histograms
 What they are: Histograms divide the data into bins (intervals) and show the
frequency (or count) of data points falling within each bin.
 What they reveal:
o Shape of the distribution: Symmetrical, skewed (left or right), bimodal, etc.
o Central tendency: Approximate location of the mean and median.
o Spread: How widely the data is distributed.
23
o Outliers: Extreme values that deviate significantly from the rest of the data.
Box Plots
 What they are: Box plots summarize the distribution of data using five key points:
minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
 What they reveal:
o Median: The middle value of the dataset.
o Quartiles: Divide the data into four equal parts.
o Interquartile Range (IQR): The range between the first and third quartiles,
representing the middle 50% of the data.
o Outliers: Data points that fall beyond 1.5 times the IQR from the nearest
quartile.
1.11. DATA QUALITY AND REMEDIATION
Data Quality in machine learning refers to the fitness of data for its intended use. It
encompasses various aspects that determine how suitable the data is for building accurate and
reliable machine learning models.
Dimensions of Data Quality:
 Accuracy: The data is correct and free from errors.
 Completeness: All necessary information is present, with minimal missing values.
 Consistency: Data is formatted and represented uniformly across the dataset.
 Validity: Data conforms to predefined rules and constraints.
 Timeliness: Data is up-to-date and reflects current conditions.
 Uniqueness: There are no duplicate records.
Common Data Quality Issues:
 Missing Values: Empty cells or fields lacking information.
 Inconsistent Data: Variations in formatting, spelling, or units.
 Outliers: Extreme values that deviate significantly from the norm.
 Duplicate Records: Repeated entries that can skew analysis.
 Incorrect Data Types: Data stored in the wrong format (e.g., text instead of
numbers).
Data Remediation in machine learning refers to the process of identifying and correcting
errors, inconsistencies, and inaccuracies within your dataset to improve its quality.
Data Remediation Techniques:
1. Handling Missing Values:
o Imputation: Replacing missing values with estimated values.

24
 Mean/Median Imputation: Replacing with the average or median
value of the column.
 K-Nearest Neighbors (KNN) Imputation: Predicting missing values
based on similar data points.
 Regression Imputation: Using regression models to predict missing
values.
o Deletion: Removing rows or columns with missing values (use with caution,
as it can lead to data loss).
2. Addressing Inconsistent Data:
o Standardization: Transforming data to a common format (e.g., converting all
units to metric).
o Normalization: Scaling data to a specific range (e.g., between 0 and 1).
o Data Cleaning: Correcting spelling errors, handling typos, and ensuring
consistency in data representation.
3. Identifying and Handling Outliers:
o Visualization: Using box plots, scatter plots, and other visualizations to
identify outliers.
o Statistical Methods: Calculating z-scores or using interquartile range (IQR)
to detect outliers.
o Removal: Removing outliers if they are deemed to be errors or have a
significant impact on the model.
o Transformation: Transforming data using techniques like log transformation
to reduce the impact of outliers.
4. Removing Duplicate Records:
o Using deduplication techniques to identify and remove duplicate rows.
Tools for Data Remediation:
 Data Profiling Tools: Automatically identify and analyse data quality issues.
 Data Cleaning Libraries: Libraries like Pandas (Python) offer functions for handling
missing values, cleaning data, and transforming data.

1.12. DATA PRE-PROCESSING


Data pre-processing is the crucial step of transforming raw data into a clean,
organized, and usable format for training machine learning models.

25
Dimensionality Reduction: In machine learning, dimensionality reduction is a crucial
technique used to reduce the number of features (or dimensions) in a dataset while preserving
as much information as possible. This simplification offers several significant advantages:
Benefits of Dimensionality Reduction:
 Improved Model Performance:
o Reduced Overfitting: By reducing the number of features, we can mitigate
overfitting, where a model performs well on training data but poorly on
unseen data.
o Faster Training: With fewer features, training algorithms can converge more
quickly and efficiently.
 Enhanced Visualization:
o Data Visualization: High-dimensional data is difficult to visualize.
Dimensionality reduction allows us to project data onto lower-dimensional
spaces (often 2D or 3D), making it easier to understand relationships and
patterns.
 Reduced Storage and Computational Costs:
o Storage Efficiency: Storing fewer features requires less storage space.
o Computational Efficiency: Processing fewer features reduces computational
complexity, making models faster to train and deploy.
Common Dimensionality Reduction Techniques:
 Principal Component Analysis (PCA):
o Identifies the principal components (new features) that capture the most
variance in the data.
o Projects the data onto these principal components, reducing the dimensionality
while preserving most of the information.
 Linear Discriminant Analysis (LDA):
o Specifically designed for classification problems.
o Finds linear combinations of features that best separate different classes.
 t-SNE (t-Distributed Stochastic Neighbor Embedding):
o A non-linear technique that excels at visualizing high-dimensional data in low-
dimensional spaces (often 2D).
o Particularly useful for exploring complex data structures and identifying
clusters.
 Feature Selection:

26
o Involves selecting a subset of the original features based on their importance
or relevance.
o Methods include:
 Filter Methods: Rank features based on their individual scores (e.g.,
correlation with the target variable).
 Wrapper Methods: Evaluate the performance of different feature
subsets using a machine learning model.
 Embedded Methods: Select features during the model training
process itself (e.g., using L1 regularization).

27
UNIT II
MODELLING AND EVALUATION &
BASICS OF FEATURE ENGINEERING

2.1. MODEL SELECTION


In machine learning, the process of selecting the top model or algorithm from a list of potential
models to address a certain issue is referred to as model selection. It entails assessing and
contrasting various models according to how well they function and choosing the one that
reaches the highest level of accuracy or prediction power.
Because different models have varied levels of complexity, underlying assumptions, and
capabilities, model selection is a crucial stage in the machine-learning pipeline. Finding a model
that fits the training set of data well and generalizes well to new data is the objective. While a
model that is too complex may overfit the data and be unable to generalize, a model that is too
simple could underfit the data and do poorly in terms of prediction.

The following steps are frequently included in the model selection process:
 Problem formulation: Clearly express the issue at hand, including the kind of
predictions or task that you'd like the model to carry out (for example, classification,
regression, or clustering).
 Candidate model selection: Pick a group of models that are appropriate for the issue at
hand. These models can include straightforward methods like decision trees or linear
regression as well as more sophisticated ones like deep neural networks, random forests,
or support vector machines.
 Performance evaluation: Establish measures for measuring how well each model
performs. Common measurements include area under the receiver's operating
characteristic curve (AUC-ROC), recall, F1-score, mean squared error, and accuracy,
precision, and recall. The type of problem and the particular requirements will determine
which metrics are used.
 Training and evaluation: Each candidate model should be trained using a subset of the
available data (the training set), and its performance should be assessed using a different
subset (the validation set or via cross-validation). The established evaluation measures
are used to gauge the model's effectiveness.

1
 Model comparison: Evaluate the performance of various models and determine which
one performs best on the validation set. Take into account elements like data handling
capabilities, interpretability, computational difficulty, and accuracy.
 Hyperparameter tuning: Before training, many models require that certain
hyperparameters, such as the learning rate, regularisation strength, or the number of
layers that are hidden in a neural network, be configured. Use methods like grid search,
random search, and Bayesian optimization to identify these hyperparameters' ideal
values.
 Final model selection: After the models have been analyzed and fine-tuned, pick the
model that performs the best. Then, this model can be used to make predictions based on
fresh, unforeseen data.
Model Selection Techniques
Model selection in machine learning can be done using a variety of methods and tactics. These
methods assist in comparing and assessing many models to determine which is best suited to
solve a certain issue. Here are some methods for selecting models that are frequently used:
 Train-Test Split: With this strategy, the available data is divided into two sets: a
training set & a separate test set. The models are evaluated using a predetermined
evaluation metric on the test set after being trained on the training set. This method
offers a quick and easy way to evaluate a model's performance using hypothetical data.
 Cross-Validation: A resampling approach called cross-validation divides the data into
various groups or folds. Several folds are used as the test set & the rest folds as the
training set, and the models undergo training and evaluation on each fold separately.
Lowering the variance in the evaluation makes it easier to generate an accurate
assessment of the model's performance. Cross-validation techniques that are frequently
used include leave-one-out, stratified, and k-fold cross-validation.
 Grid Search: Hyperparameter tuning is done using the grid search technique. In order
to do this, a grid containing hyperparameter values must be defined, and all potential
hyperparameter combinations must be thoroughly searched. For each combination, the
models are trained, assessed, and their performances are contrasted. Finding the ideal
hyperparameter settings to optimize the model's performance is made easier by grid
search.
 Random Search: A set distribution for hyperparameter values is sampled at random as
part of the random search hyperparameter tuning technique. In contrast to grid search,
which considers every potential combination, random search only investigates a portion
2
of the hyperparameter field. When a thorough search is not possible due to the size of
the search space, this strategy can be helpful.
 Bayesian optimization: A more sophisticated method of hyperparameter tweaking,
Bayesian optimization. It models the relationship between the performance of the model
and the hyperparameters using a probabilistic model. It intelligently chooses which set
of hyperparameters to investigate next by updating the probabilistic model and
iteratively assessing the model's performance. When the search space is big and
expensive to examine, Bayesian optimization is especially effective.
 Model averaging: This technique combines forecasts from various models to get a
single prediction. For regression issues, this can be accomplished by averaging the
predictions, while for classification problems, voting or weighted voting systems can be
used. Model averaging can increase overall prediction accuracy by lowering the bias and
variation of individual models.
 Information Criteria: Information criteria offer a numerical assessment of the trade-off
between model complexity and goodness of fit. Examples include the Akaike
Information Criterion (AIC) and the Bayesian Information Criterion (BIC). These
criteria discourage the use of too complicated models and encourage the adoption of
simpler models that adequately explain the data.
 Domain Expertise & Prior Knowledge: Prior understanding of the problem and the
data, as well as domain expertise, can have a significant impact on model choice. The
models that are more suitable given the specifics of the problem and the details of the
data may be known by subject matter experts.
 Model Performance Comparison: Using the right assessment measures, it is vital to
evaluate the performance of various models. Depending on the issue at hand, these
measurements could include F1-score, mean squared error, accuracy, precision, recall, or
the area beneath the receiver's operating characteristic curve (AUC-ROC). The best-
performing model can be found by comparing many models.
2.2. TRAINING A MODEL (FOR SUPERVISED LEARNING)
Holdout method
In case of supervised learning, a model is trained using the labelled input data. However, how
can we understand the performance of the model? The test data may not be available
immediately. Also, the label value of the test data is not known. That is the reason why a part of
the input data is held back (that is how the name holdout originates) for evaluation of the model.

3
This subset of the input data is used as the test data for evaluating the performance of a trained
model. In general 70%–80% of the input data (which is obviously labelled) is used for model
training. The remaining 20%–30% is used as test data for validation of the performance of the
model. However, a different proportion of dividing the input data into training and test data is
also acceptable. To make sure that the data in both the buckets are similar in nature, the division
is done randomly. Random numbers are used to assign data items to the partitions. This method
of partitioning the input data into two parts–training and test data (depicted in below figure),
which is by holding back a part of the input data for validating the trained model is known as
holdout method.

Fig: Holdout method


Once the model is trained using the training data, the labels of the test data are predicted
using the model’s target function. Then the predicted value is compared with the actual value of
the label. This is possible because the test data is a part of the input data with known labels. The
performance of the model is in general measured by the accuracy of prediction of the label
value.
In certain cases, the input data is partitioned into three portions – training and a test data,
and a third validation data. The validation data is used in place of test data, for measuring the
model performance. It is used in iterations and to refine the model in each iteration. The test
data is used only for once, after the model is refined and finalized, to measure and report the
final performance of the model as a reference for future learning efforts.
An obvious problem in this method is that the division of data of different classes into
the training and test data may not be proportionate. This situation is worse if the overall
percentage of data related to certain classes is much less compared to other classes. This may
happened spite the fact that random sampling is employed for test data selection. This problem
4
can be addressed to some extent by applying stratified random sampling in place of sampling. In
case of stratified random sampling, the whole data is broken into several homogenous groups or
strata and a random sample is selected from each such stratum. This ensures that the generated
random partitions have equal proportions of each class.
K-Fold Cross-Validation: K-fold cross-validation is a powerful technique used to evaluate the
performance of machine learning models. It helps to assess how well a model will generalize to
unseen data and provides a more robust estimate of its performance compared to a simple train-
test split.
How it works:
1. Split the dataset:
o Divide the dataset into k equal-sized subsets or "folds."
o The value of k is typically between 5 and 10.
2. Iterate:
o For each fold:
 Use k-1 folds as the training set.
 Use the remaining fold as the test set.
 Train the model on the training set and evaluate its performance on the
test set.
3. Calculate average performance:
o Calculate the average performance metric (e.g., accuracy, F1-score, RMSE)
across all k iterations.
o This average provides a more reliable estimate of the model's true performance.
Advantages of K-Fold Cross-Validation:
 Reduced Bias: By using different subsets for training and testing, K-fold cross-
validation helps to reduce the bias introduced by a single train-test split.
 Improved Generalization: Provides a more accurate estimate of how the model will
perform on unseen data.
 More Efficient Use of Data: All data points are eventually used for both training and
testing.

5
Standard k-fold Cross validation
Bootstrap sampling
Bootstrap sampling is a powerful statistical technique that involves repeatedly drawing samples
with replacement from an original dataset to estimate the sampling distribution of a statistic.
How it Works:
1. Create Bootstrap Samples:
o Repeatedly draw samples (with replacement) from the original dataset.
o The size of each bootstrap sample is typically the same as the size of the original
dataset.
o Create a large number of bootstrap samples (e.g., 1,000 or 10,000).
2. Calculate Statistic for Each Sample:
o Calculate the desired statistic (e.g., mean, standard deviation, model accuracy)
for each bootstrap sample.
3. Estimate Sampling Distribution:
o The distribution of the statistic across all bootstrap samples provides an estimate
of the sampling distribution of that statistic.
Applications of Bootstrap Sampling:
 Estimating Standard Errors: Estimating the standard error of a statistic.
 Constructing Confidence Intervals: Building confidence intervals for population
parameters.
 Model Evaluation: Assessing the variability and uncertainty of model predictions.
 Feature Importance: Estimating the importance of different features in a model.
Advantages of Bootstrap Sampling:
 Versatility: Can be applied to a wide range of statistics and models.
6
 Simplicity: Relatively easy to implement.
 No Assumptions: Does not require assumptions about the underlying data distribution.
Limitations:
 Computational Cost: Can be computationally expensive for large datasets.
 Potential Bias: In some cases, bootstrap estimates may have slight biases.

Fig: Bootstrap sampling

Lazy vs. Eager learner

S.No Lazy Learning Eager Learning

Model Construction: Defers the model Model Construction: Builds a generalized


1
construction until a prediction is needed model during the training phase.
Training Phase: Involves analysing the
Training Phase: Primarily involves
2 entire training dataset to create a model that
storing the training data.
captures the underlying patterns.
Prediction Phase: Analyses the stored
data to find the most relevant Prediction Phase: Uses the pre-built model
3
information for making a prediction on to make predictions on new data.
the new, unseen data point.
Examples: k-Nearest Neighbors (k- Examples: Decision Trees, Support Vector
4
NN), Case-Based Reasoning (CBR). Machines (SVM), Neural Networks.

7
2.3. MODEL REPRESENTATION AND INTERPRETABILITY
The main goal of each machine learning model is to generalize well. Here
generalization defines the ability of an ML model to provide a suitable output by adapting the
given set of unknown input. It means after providing training on the dataset, it can produce
reliable and accurate output. Hence, the underfitting and overfitting are the two terms that need
to be checked for the performance of the model and whether the model is generalizing well or
not.
Underfitting: The model is said to have underfitting when it can’t capture the underlined data
ie; it only performs well on trained data but performs poorly on test data.
Ex: Trained to fit an unauthorised
Reasons:
1. The model is too simple.
2. Training data contains noise data.
3. Size of training data is not enough.
Techniques to reduce underfiting:
 Increase model complexity.
 Remove noise data.
 Increase duration of training data.
Overfitting: Overfitting occurs when our machine learning model tries to cover all the data
points or more than the required data points present in the given dataset. Because of this, the
model starts caching noise and inaccurate values present in the dataset, and all these factors
reduce the efficiency and accuracy of the model. The overfitted model has low bias and high
variance. The chances of occurrence of overfitting increase as much we provide training to our
model.
How to avoid the Overfitting in Model: Both overfitting and underfitting cause the degraded
performance of the machine learning model. But the main cause is overfitting, so there are some
ways by which we can reduce the occurrence of overfitting in our model.
 Cross-Validation
 Training with more data
 Removing features
 Early stopping the training
 Regularization
 Ensembling

8
Bias: [Training Error]
 It is actually the error rate of the training data.
 When the error rate is high then the bias is high.
 When the error rate is Low then it is Low the bias.
Error due to Bias: It can occur due to underfitting of the Bias.
Variance: [Testing Error]
 It is the difference between error rate of training data and testing data.
 If the difference of errors is high it is called high variance.
 If the difference of errors is low it is called low variance.
Error due to Bias: It can occur due to overfitting of the model.

9
2.4. EVALUATING PERFORMANCE OF A MODEL
i).Supervised Learning-classification
In supervised learning, one major task is classification. The responsibility of the
classification model is to assign class label to the target feature based on the value of the
predictor features. For example, in the problem of predicting the win/loss in a cricket match, the
classifier will assign a class value win/loss to target feature based on the values of other features
like whether the team won the toss, number of spinners in the team, number of wins the team
had in the tournament, etc. To evaluate the performance of the model, the number of correct
classifications or predictions made by the model has to be recorded. The following are the four
possibilities will be used to evaluate the performance.
1. True Positive (TP): The model predicted win and the team won
2. False Positive (FP): The model predicted win and the team lost
3. False Negitive (FN): The model predicted loss and the team won
4. True Nagitive (TN): The model predicted loss and the team lost

Fig: Details of model classification

A matrix containing correct and incorrect predictions in the form of TPs, FPs, FNs and
TNs is known as confusion matrix.
10
Receiver operating characteristic (ROC) curves:
As we have seen till now, though accuracy is the most popular measure, there are quite a
number of other measures to evaluate the performance of a supervised learning model.
However, visualization is an easier and more effective way to understand the model
performance. It also helps in comparing the efficiency of two models. Receiver Operating
Characteristic (ROC) curve helps in visualizing the performance of a classification model. It
shows the efficiency of a model in the detection of true positives while avoiding the occurrence
of false positives.

ii). Supervised Learning – Regression:


A well-fitted regression model churns out predicted values close to actual values. Hence, a
regression model which ensures that the difference between predicted and actual values is low
can be considered as a good model. Figure 3.9 represents a very simple problem of real estate
value prediction solved using linear regression model. If ‘area’ is the predictor variable (say x)
and ‘value’ is the target variable (say y), the linear regression model can be represented in the
form:

11
For a certain value of x, say x̂, the value of y is predicted as ŷ whereas the actual value of
y is Y (say). The distance between the actual value and the fitted or predicted value, i.e. ŷ is
known as residual. The regression model can be considered to be fitted well if the difference
between actual and predicted value, i.e. the residual value is less.
iii).Unsupervised learning - clustering
Clustering algorithms try to reveal natural groupings amongst the data sets. However, it is quite
tricky to evaluate the performance of a clustering algorithm. Clustering, by nature, is very
subjective and whether the cluster is good or bad is open for interpretations. It was noted,
‘clustering is in the eye of the beholder’. There are couple of popular approaches which are
adopted for cluster quality evaluation.
a).Internal evaluation: It measures cluster quality based on the homogeneity of data belonging
to same cluster or heterogeneity of data belonging to different cluster. For a data set clustered
into ‘k’ clusters, silhouette width is calculated as:

b).External evaluation: In this approach, class label is known for the data set subjected to
clustering. However, quite obviously, the known class labels are not a part of the data used in
clustering. The cluster algorithm is assessed based on how close the results are compared to
those known class labels. For example, purity is one of the most popular measures of cluster

12
algorithms – evaluates the extent to which clusters contain a single class. For a data set having
‘n’ data instances and ‘c’ known class labels which generates ‘k’ clusters, purity is measured as:

Fig: Silhouette width calculation

2.5. IMPROVING PERFORMANCE OF A MODEL


Model parameter tuning is the process of adjusting the model fitting options.
Ensemble helps in averaging out biases of the different underlying models and also reducing
the variance. Ensemble methods combine weaker learners to create stronger ones. A performance
boost can be expected even if models are built as usual and then ensemble.
Build a number of models based on the training data
For diversifying the models generated, the training data subset can be varied using the allocation
function. Sampling techniques like bootstrapping may be used to generate unique training data
sets. Alternatively, the same training data may be used but the models combined are quite
varying, e.g, SVM, neural network, k-NN, etc.
The outputs from the different models are combined using a combination function. A
very simple strategy of combining, say in case of a prediction task using ensemble, can be
majority voting of the different models combined. For example, 3 out of 5 classes predict ‘win’

13
and 2 predict ‘loss’ – then the final outcome of the ensemble using majority vote would be a
‘win’.

One of the earliest and most popular ensemble models is bootstrap aggregating or
bagging. Bagging uses bootstrap sampling method to generate multiple training data sets. These
training data sets are used to generate (or train) a set of models using the same learning
algorithm. Then the outcomes of the models are combined by majority voting (classification) or
by average (regression). Bagging is a very simple ensemble technique which can perform really
well for unstable learners like a decision tree, in which a slight change in data can impact the
outcome of a model significantly. Just like bagging, boosting is another key ensemble-based
technique.
In this type of ensemble, weaker learning models are trained on resampled data and the
outcomes are combined using a weighted voting approach based on the performance of different
models. Adaptive boosting or AdaBoost is a special variant of boosting algorithm. It is based on
the idea of generating weak learners and slowly learning Random forest is another ensemble-
based technique. It is an ensemble of decision trees – hence the name random forest to indicate a
forest of decision trees.

14
2.6. BASICS OF FEATURE ENGINEERING – INTRODUCTION
Generally, all machine learning algorithms take input data to generate the output. The input
data remains in a tabular form consisting of rows (instances or observations) and columns
(variable or attributes), and these attributes are often known as features. For example, an image
is an instance in computer vision, but a line in the image could be the feature. Similarly, in NLP,
a document can be an observation, and the word count could be the feature. So, we can say a
feature is an attribute that impacts a problem or is useful for the problem.
What is Feature Engineering?
Feature engineering is the pre-processing step of machine learning, which extracts
features from raw data. It helps to represent an underlying problem to predictive models in
better ways, which as a result, improve the accuracy of the model for unseen data. The
predictive model contains predictor variables and an outcome variable, and while the feature
engineering process selects the most useful predictor variables for the model.

Since 2016, automated feature engineering is also used in different machine learning
software that helps in automatically extracting features from raw data. Feature engineering in
ML contains mainly four processes: Feature Creation, Transformations, Feature Extraction, and
Feature Selection.
These processes are described as below:
Feature Creation: Feature creation is finding the most useful variables to be used in a
predictive model. The process is subjective, and it requires human creativity and intervention.
The new features are created by mixing existing features using addition, subtraction, and ration,
and these new features have great flexibility.
Transformations: The transformation step of feature engineering involves adjusting the
predictor variable to improve the accuracy and performance of the model. For example, it

15
ensures that the model is flexible to take input of the variety of data; it ensures that all the
variables are on the same scale, making the model easier to understand. It improves the model's
accuracy and ensures that all the features are within the acceptable range to avoid any
computational error.
Feature Extraction: Feature extraction is an automated feature engineering process that
generates new variables by extracting them from the raw data. The main aim of this step is to
reduce the volume of data so that it can be easily used and managed for data modelling. Feature
extraction methods include cluster analysis, text analytics, edge detection algorithms, and
principal components analysis (PCA).
Feature Selection: While developing the machine learning model, only a few variables in the
dataset are useful for building the model, and the rest features are either redundant or irrelevant.
If we input the dataset with all these redundant and irrelevant features, it may negatively impact
and reduce the overall performance and accuracy of the model. Hence it is very important to
identify and select the most appropriate features from the data and remove the irrelevant or less
important features, which is done with the help of feature selection in machine
learning. "Feature selection is a way of selecting the subset of the most relevant features from
the original features set by removing the redundant, irrelevant, or noisy features."
Below are some benefits of using feature selection in machine learning:
 It helps in avoiding the curse of dimensionality.
 It helps in the simplification of the model so that the researchers can easily interpret it.
 It reduces the training time.
 It reduces overfitting hence enhancing the generalization.
Need for Feature Engineering in Machine Learning
In machine learning, the performance of the model depends on data pre-processing and data
handling. But if we create a model without pre-processing or data handling, then it may not give
good accuracy. Whereas, if we apply feature engineering on the same model, then the accuracy
of the model is enhanced. Hence, feature engineering in machine learning improves the model's
performance. Below are some points that explain the need for feature engineering:
 Better features mean flexibility.
 Better features mean simpler models.
 Better features mean better results.

16
2.7. FEATURE TRANSFORMATION
Feature Transformation: It transforms the data structure or unstructured into a new set of
features which can represent the underline problem which machine learning is trying to solve.
Feature Transformation Types:
1. Feature construction.
2. Feature Extraction.
1). Feature construction: The features phase by creating the additional features
Ex: Features, m-Features will be added then final equal to m+n Features.

Feature construction Types:


 Encoding categorical values:
Changing categorical values to numerical values.

1. Encoding ordinal values:


Changing numeric values to ordinal values.

2. Transfer Numeric to categorical:


Changing numerical value into categorical values.

17
2. Feature Extraction: New features are created for combination of original features.
Types:
1. PCA (Principal Component Analysis)
2. Singular value decomposition
3. Linear descripted Analysis
2.8. FEATURE SUBSET SELECTION
Feature selection is a way of selecting the subset of the most relevant features from the original
features set by removing the redundant, irrelevant, or noisy features.
Steps for selection:
 Generation of possible subsets
 Sub set evaluation
 STOP searching based on some stopping criterion
 Validation of result

Fig: Feature selection process


Types of Feature selection
1. Filter approach: It works based on learning algorithms.

Fig: Filter approach

18
2. Wrapper Approach:
 It work based on Induction algorithm as backbox.

Fig: Wrapper approach


3. Hybrid Approach: It is a combination of filter approach and wrapper approach.
4. Embedded Approach: It is similar to wrapper approach but perform feature selection &
classification simultaneously.

Fig: Embedded approach


Benefits of using feature selection in machine learning:
 It helps in avoiding the curse of dimensionality.
 It helps in the simplification of the model so that it can be easily interpreted by the
researchers.
 It reduces the training time.
 It reduces overfitting hence enhance the generalization.

19
UNIT III
BAYESIAN CONCEPT LEARNING &
SUPERVISED LEARNING: CLASSIFICATION
3.1. INTRODUCTION
The technique was derived from the work of the 18th century mathematician Thomas Bayes. He
developed the foundational mathematical principles, known as Bayesian methods, which describe the
probability of events. Bayes provides their thoughts in decision theory which is extensively used in
important mathematics concepts as Probability. Bayes theorem is also widely used in Machine Learning
where we need to predict classes precisely and accurately. An important concept of Bayes theorem
named Bayesian method is used to calculate conditional probability in Machine Learning application that
includes classification tasks.
Why Bayesian methods are important?
Bayesian learning algorithms, like the naive Bayes classifier, are highly practical approaches to certain
types of learning problems as they can calculate explicit probabilities for hypotheses. Bayesian classifiers
are as follows:
 Text-based classification such as spam or junk mail filtering, author identification, or topic
categorization
 Medical diagnosis such as given the presence of a set of observed symptoms during a disease,
identifying the probability of new patients having the disease
 Network security such as detecting illegal intrusion or anomaly in computer networks
One of the strengths of Bayesian classifiers is that they utilize all available parameters to subtly
change the predictions, while many other algorithms tend to ignore the features that have weak effects.
Bayesian classifiers assume that even if few individual parameters have small effect on theoutcome, the
collective effect of those parameters could be quite large. For such learning tasks, the naive Bayes
classifier is most effective.
3.2. BAYES THEOREM
Bayes theorem is one of the most popular machine learning concepts that helps to calculate the
probability of occurring one event with uncertain knowledge while other one has already
occurred. Bayes' theorem can be derived using product rule and conditional probability of event
X with known event Y:
According to the product rule we can express as the probability of event X with known event Y
as follows;
P(X ? Y)= P(X|Y) P(Y) {equation 1}
Further, the probabilities of event Y with known event X:
P(X ? Y)= P(Y|X) P(X) {equation 2}

1
Mathematically, Bayes theorem can be expressed by combining both equations on right hand
side. We will get:

Here, both events X and Y are independent events which means probability of outcome of both
events does not depends one another.
The above equation is called as Bayes Rule or Bayes Theorem.
 P(X|Y) is called as posterior, which we need to calculate. It is defined as updated
probability after considering the evidence.
 P(Y|X) is called the likelihood. It is the probability of evidence when hypothesis is true.
 P(X) is called the prior probability, probability of hypothesis before considering the
evidence
 P(Y) is called marginal probability. It is defined as the probability of evidence under any
consideration.
Hence, Bayes Theorem can be written as:
Posterior = likelihood * prior / evidence

3.3. BAYES’ THEOREM AND CONCEPT LEARNING


One simplistic view of concept learning can be that if we feed the machine with the
training data, then it can calculate the posterior probability of the hypotheses and outputs the
most probable hypothesis. This is also called brute-force Bayesian learning algorithm, and it is
also observed that consistency in providing the right probable hypothesis by this algorithm is
very comparable to the other algorithms.

2
Brute-force Bayesian algorithm

Consistent Learners
A learning algorithm is a consistent learner if it commits zero errors over the training examples
every consistent learner outputs a MAP hypothesis if
 Assume a uniform prior probability distribution over H and if
 Assume a deterministic noise free training data.
Bayesian perspective can be used to characterize learning algorithms even if they do not
explicitly manipulate probabilities
Bayes optimal classifier
The most probable classification of the new instance can be obtained by combining the
predictions of all hypotheses weighed by their corresponding posterior probabilities. By denoting
the possible classification of the new instance as ci from the set C, the probability P(ci|T) that the
correct classification for the new instance is ci is

3
The optimal classification is for which P(ci|T) is maximum is

Naïve Bayes classifier:


Naïve Bayes is a simple technique for building classifiers: models that assign class labels to
problem instances. The basic idea of Bayes rule is that the outcome of a hypothesis can be
predicted on the basis of some evidence (E) that can be observed.
From Bayes rule, it is observed that
A prior probability of hypothesis h or P(h): This is the probability of an event or hypothesis
before the evidence is observed.
A posterior probability of h or P(h|D): This is the probability of an event after the evidence is
observed within the population D.

Algorithm:
Step 1: Convert the given data set into frequency table.
Step 2: Likelihood table by finding the probability of given features.
Step 3: Now use Baye's theorem to calculate posterior probability.

An Illustrative Example
 Let us apply the naive Bayes classifier to a concept learning problem i.e.,classifying days
according to whether someone will play tennis.
 The below table provides a set of 14 training examples of the target concept PlayTennis, where
each day is described by the attributes Outlook, Temperature, Humidity, and Wind.

4
Step 1: Finding frequency table for the given data set.
Weather YES NO
Rainy 2 2
Sunny 3 2
Overcast 5 0
Total 10 4

Step 2: Finding Likely hood table.


Weather YES NO Probability
Rainy 2 2 4/14=0.29
Sunny 3 2 5/14=0.35
Overcast 5 0 5/14=0.35
Total 10 4 14/14=1
probability 10/14=0.71 4/14=0.29

Step 3: Applying Bayes Theorm

5
P(Sunny/Yes) = 3/10= 0.3; P(Yes) = 0.71; P(sunny)= 0.35.
P(Yes/Sunny) =(0.3*0.71)/(0.35) =0.008.

P(Sunny/No)=0.5
P(No) = 0.29
P(sunny)= 0.35.
P(No/Sunny) =(0.5*0.29)/(0.35) =0.414.
Therefore P(Yes/Sunny)>P(No/Sunny).
Hence on sunny day can player play the game.
Advantages:
 This is faster and easy ML algorithm to predict a class of data set.
 It can be used for binary as well as multi class classifier
 It is the most popular choice for text Bayes classifier

3.4. BAYESIAN BELIEF NETWORK


A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph. A Bayesian network graph is
made up of nodes and Arcs (directed links), where:

Each node corresponds to the random variables, and a variable can be continuous or discrete. Arc
or directed arrows represent the causal relationship or conditional probabilities between random
variables. These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no directed
link that means that nodes are independent with each other.
Diverging Connection: In this type of connection, the evidence can be transmitted between two
child nodes of the same parent provided that the parent is not instantiated.

6
Serial Connection: In this type of connection, any evidence entered at the beginning of the
connection can be transmitted through the directed path provided that no intermediate node on
the path is instantiated.
Converging Connection: In this type of connection, the evidence can only be transmitted
between two parents when the child (converging) node has received some evidence and that
evidence can be soft or hard.

Applications of Bayesian Belief Networks:


 Medical Diagnosis: Bayesian Belief Networks (BBNs) are widely used in medical
diagnosis to model relationships between symptoms, diseases, and risk factors. By
combining expert knowledge with patient data, BBNs help doctors predict the likelihood
of diseases and suggest optimal treatments. For example, a BBN can estimate the
probability of a heart attack based on factors like chest pain, age, and blood pressure. As
new symptoms are observed, the network dynamically updates probabilities to improve
diagnostic accuracy.
 Risk Assessment and Decision-Making: In finance and insurance, BBNs assist in
assessing risks by modeling dependencies among factors such as market volatility,
economic indicators, and credit scores. For example, insurance companies use BBNs to
estimate the probability of policy claims based on customer behavior and risk profiles.
These networks also play a role in investment strategies, helping organizations make
data-driven decisions under uncertainty.
 Machine Learning and Data Mining: BBNs are valuable tools in machine learning and
data mining for discovering hidden patterns within datasets. They are used to predict
outcomes, such as fraud detection in banking, by analyzing dependencies between
variables. In data mining, BBNs help uncover relationships between features, enabling

7
more effective predictive models. Their ability to learn from both historical data and
expert input makes them essential in fields where decisions depend on complex,
uncertain systems.
3.5. SUPERVISED LEARNING
In supervised learning, the labelled training data provide the basis for learning. According
to the definition of machine learning, this labelled training data is the experience or prior
knowledge or belief. It is called supervised learning because the process of learning from the
training data by a machine can be related to a teacher supervising the learning process of a
student who is new to the subject. Here, the teacher is the training data. Training data is the past
information with known value of class field or ‘label’. Hence, we say that the ‘training data is
labelled’ in the case of supervised learning.
Examples of Supervised Learning:
 Training a model to distinguish between images of cats and dogs.
 Training a model to filter out unwanted emails.
 Predicting the price of a particular stock tomorrow.
 Predicting the likelihood of a patient having a certain disease based on their medical
history and test results.
 Predicting which customers are at risk of cancelling their subscription.
3.6. CLASSIFICATION MODEL
Classification is a type of supervised learning where a target feature, which is of
categorical type, is predicted for test data on the basis of the information imparted by the training
data. The target categorical feature is known as class.
Some typical classification problems include the following:
 Hand writing recognition
 Image classification
 Disease prediction
 Win–loss prediction of games
 Prediction of natural calamity such as earth quake, flood etc.
In classification, the whole problem centres on assigning a label or category or class to a
test data on the basis of the label or category or class information that is imparted by the training
data. Because the target objective is to assign a class label, we call this type of problem as a
classification problem. Below figure depicts the typical process of classification, where a
classification model is obtained from the labelled training data by a classifier algorithm. On the

8
basis of the model, a class label (e.g. ‘Intel’ as in the case of the test data referred in below fig) is
assigned to the test data.

Classification Model
Classification Learning Steps:

9
Problem Identification: Identifying the problem is the first step in the supervised learning
model. The problem needs to be a well-formed problem, i.e. a problem with well-defined goals
and benefit, which as a long-term impact.
Identification of Required Data: On the basis of the problem identified above, the required
data set that precisely represents the identified problem needs to be identified/ evaluated. For
example: If the problem is to predict whether a tumour is malignant or benign, then the
corresponding patient data sets related to malignant tumour and benign tumours are to be
identified.
Data Pre-processing: This is related to the cleaning/ transforming the data set. This step ensures
that all the unnecessary/ irrelevant data elements are removed. Data pre-processing refers to the
transformations applied to the identified data before feeding the same into the algorithm.
Because the data is gathered from different sources, it is usually collected in a raw format and is
not ready for immediate analysis. This step ensures that the data is ready tobe fed into the
machine learning algorithm.
Definition of Training Data Set: Before starting the analysis, the user should decide what kind
of data set is to be used as a training set. In the case of signature analysis, for example, the
training data set might be a single handwritten alphabet, an entire handwritten word (i.e. a group
of the alphabets) or an entire line of handwriting (i.e. sentences or a group of words).
Thus, a set of ‘input meta-objects’ and corresponding output meta-objects’ are also
gathered. The training set needs to be actively representative of the real-world use of the given
scenario. Thus, a set of data input (X) and corresponding outputs (Y) is gathered either from
human experts or experiments.
Algorithm Selection: This involves determining the structure of the learning function and the
corresponding learning algorithm. This is the most critical step of supervised learning model. On
the basis of various parameters, the best algorithm for a given problem is chosen.
Training: The learning algorithm identified in the previous step is run on the gathered training
set for further fine tuning. Some supervised learning algorithms require the user to determine
specific control parameters (which are given as inputs to the algorithm). These parameters
(inputs given to algorithm) may also be adjusted by optimizing performance on a subset (called
as validation set) of the training set.
Evaluation with the Test Data Set: Training data is run on the algorithm, and its performance
is measured here. If a suitable result is not obtained, further training of parameters may be
required.

10
3.7. K-Nearest Neighbor (KNN) Algorithm
K-Nearest Neighbor (KNN) Algorithm is a type of Supervised Machine Learning
algorithm. It is mainly used for classification predictive problem. It is also called lazy learner or
nonparametric learner.
KNN Working:
It uses future similarity to predict the values of new data points which further means that a new
data point will be assigned a value based on how close it matches the point in the training set.
Step 1: Load the training and testing data set.
Step 2: Choose the values of K ie; Nearest neighbour point, K can be any Integer.
Step 3: For each data point in the test data do the following.
I. Calculate the distance between test data and each row of training data with the
help of Euclidian distance.
The Euclidean distance formula is given by:

d =√[(x2 – x1)2 + (y2 – y1)2]

Where, “d” is the Euclidean distance

(x1, y1) is the coordinate of the first point

(x2, y2) is the coordinate of the second point.


II. Now based on distance value sort them in ascending order.
III. Next it will change the top K-rows from the sorted array.
IV. It will assign a class to the test point based on most frequent classes of the row.
Step 4: end.

11
KNN Algorithm:
Step 1: Input the training and testing data set.
Step 2: Initialize K-Value
Step 3: Calculate the Euclidean distance.
Step 4: Assign class of nearest sample.
Why KNN called Lazy Learner?
It skips abstraction and stores training data and directly applies the philosophy of normal nearest
neighbourhood finding to arrive at the classification. Hence KNN does not learn quickly. So it is
called Lazy Learner.
Advantages of KNN Algorithm:
 It is simple to implement.
 It is robust to the noisy training data
 It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
 Always needs to determine the value of K which may be complex some time.
 The computation cost is high because of calculating the distance between the data points
for all the training samples.
Applications of the KNN Algorithm
Here are some real life applications of KNN Algorithm.
 Recommendation Systems: Many recommendation systems, such as those used by Netflix
or Amazon, rely on KNN to suggest products or content. KNN observes at user behavior
and finds similar users. If user A and user B have similar preferences, KNN might
recommend movies that user A liked to user B.
 Spam Detection: KNN is widely used in filtering spam emails. By comparing the features
of a new email with those of previously labeled spam and non-spam emails, KNN can
predict whether a new email is spam or not.
 Customer Segmentation: In marketing firms, KNN is used to segment customers based on
their purchasing behavior. By comparing new customers to existing customers, KNN can
easily group customers into segments with similar choices and preferences. This helps
businesses target the right customers with right products or advertisements.
 Speech Recognition: KNN is often used in speech recognition systems to transcribe spoken
words into text. The algorithm compares the features of the spoken input with those of
known speech patterns. It then predicts the most likely word or command based on the
closest matches.

12
Example: Given input data set
S.No Maths Science Result
1 4 3 F
2 6 7 P
3 7 8 P
4 5 5 F
5 8 8 P

Problem: Find if Maths=6 and Science=8; then classification should be Pass or Fail, using KNN
Algorithm.
Sol:
Step 1: Assume K=3(Nearest Neighbor)
Step 2: Euclidean Distance

Step 3: Choose 3 values of Nearest Neighbors have taken K=3 ie; D2, D3,D5.
Step 4: Hence D2, D3, D5 contains classification as Pass.
If Maths=6 and Science=8 when the classification result by using KNN is ‘Pass’

3.8. DECISION TREE


Decision tree learning is one of the most widely adopted algorithms for classification. As
the name indicates, it builds a model in the form of a tree structure. Its grouping exactness is
focused with different strategies, and it is exceptionally productive. A decision tree is used for
multi-dimensional analysis with multiple classes. It is characterized by fast execution time and
ease in the interpretation of the rules. The goal of decision tree learning is to create a model that
predicts the value of the output variable based on the input variables in the feature vector.
Each node (or decision node) of a decision tree corresponds to one of the feature vector.
From every node, there are edges to children, wherein there is an edge for each of the possible
values (or range of values) of the feature associated with the node. The tree terminates at
different leaf nodes (or terminal nodes) where each leaf node represents a possible value for the

13
output variable. The output variable is determined by following a path that starts at the root and
is guided by the values of the input variables.
A decision tree is usually represented in the format depicted in below Figure.

Fig: Decision tree structure


In the Decision Tree, the major challenge is the identification of the attribute for the root
node at each level. This process is known as attribute selection. We have two popular attribute
selection measures:
 Information Gain
 Gini Index
a). Information Gain: When we use a node in a decision tree to partition the training instances
into smaller subsets the entropy changes. Information gain is a measure of this change in
entropy.
Suppose S is a set of instances, A is an attribute and Sv is the subset of S, v represents an
individual value that the attribute A can take and Values (A) is the set of all possible values of A,
then

Entropy: Entropy is the measure of uncertainty of a random variable; it characterizes the


impurity of an arbitrary collection of examples. The higher the entropy more the information
content.
Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v, and Values (A)
is the set of all possible values of A, then

14
Example:
For the set X = {a, a, a, b, b, b, b, b}
Total instances: 8
Instances of b: 5
Instances of a: 3

b). Gini Index: Gini index is a measure of impurity or purity used while creating a decision tree
in the CART (Classification and Regression Tree) algorithm. An attribute with the low Gini
index should be preferred as compared to the high Gini index. It only creates binary splits, and
the CART algorithm uses the Gini index to create binary splits. Gini index can be calculated
using the below formula:
Gini Index= 1- ∑jPj2
Avoiding over fitting in decision tree–pruning
The decision tree algorithm, unless as topping criterion is applied, may keep growing indefinitely
splitting for every feature and dividing into smaller partitions till the point that the data is
perfectly classified. This, as is quite evident, results in overfitting problem. Top revent a decision
tree getting overfitted to the training data, pruning of the decision tree is essential. Pruning a
decision tree reduces the size of the tree such that the model is more generalized and can classify
unknown and unlabelled data in a better way. There are two approaches of pruning:
Pre-pruning: Stop growing the tree before it reaches perfection.
Post-pruning: Allow the tree to grow entirely and then post-prune some of the branches from it.
Advantages of the Decision Tree
 It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
 It can be very useful for solving decision-related problems.
 It helps to think about all the possible outcomes for a problem.
 There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
 The decision tree contains lots of layers, which makes it complex.
 It may have an over fitting issue, which can be resolved using the Random Forest
algorithm.

15
 For more class labels, the computational complexity of the decision tree may increase.

3.9. RANDOM FOREST MODEL


Random forest is an ensemble classifier, i.e. a combining classifier that uses and
combines many decision tree classifiers. Ensembling is usually done using the concept of
bagging with different feature sets. The reason for using large number of trees in random forest
is to train the trees enough such that contribution from each feature comes in a number of
models. After the random forest is generated by combining the trees, majority vote is applied to
combine the output of the different trees. A simplified random forest model is depicted in below
Figure. The result from the ensemble model is usually better than that from the individual
decision tree models.

Fig: Random forest model


How does random forest work?
Step 1: select random K data points from the training data set.
Step 2: Build a decision tree associated with randomly selected data points.
Step 3: Choose the number N, for decision tree that we want to build.
Step 4: Repeat step 1 and Step 2.

16
Step 5: For new data point find the predictions of each decision tree, assign new data point with
category that is the majority values.

Out-of-bag (OOB) error in random forest


In random forests, that each tree is constructed using a different bootstrap sample from
the original data. The samples left out of the bootstrap and not used in the construction of the ith
tree can be used to measure the performance of the model. At the end of the run, predictions for
each such sample evaluated each time are tallied, and the final prediction for that sample is
obtained by taking a vote. The total error rate of predictions for such samples is termed as out-of-
bag (OOB) error rate.
The error at the shown in the confusion matrix reflects the OOB error rate. Because of
this reason, the error rated is played is often surprisingly high.
Strengths of random forest:
 It runs efficiently on large and expansive datasets.
 It has a robust method for estimating missing data and maintains precision when a large
proportion of the data is absent.
 It has powerful techniques for balancing errors in a class population of unbalanced
datasets.
 It gives estimates (or assessments) about which features are the most important ones in
the overall classification.
 It generates an internal unbiased estimate (gauge) of the generalization error as the forest
generation progresses. Generated forests can be saved for future use on other data.
 Lastly, the random forest algorithm can be used to solve both classification and
regression problems.
Weaknesses of random forest
 This model, because it combines a number of decision tree models, is not as easy to
understand as a decision tree model.
 It is computationally much more expensive than a simple model like decision tree.
Application of random forest: Random forest is a very powerful classifier which combines the
versatility of many decision tree models into a single model. Because of the superior results, this
ensemble model is gaining wide adoption and popularity amongst the Machine Learning
practitioners to solve a wide range of classification problems.

17
3.10. SUPPORT VECTOR MACHINES
Support Vector Machines is a model, which can do linear classification as well as
regression. SVM is based on the concept of a surface, called a hyper-plane, which draws a
boundary between data instances plotted in the multi-dimensional feature space. The output
prediction of an SVM is one of two conceivable classes which are already defined in the training
data. In summary, the SVM algorithm builds an N-dimensional hyper-plane model that assigns
future instances into one of the two possible output classes.
Support Vectors: SVM chooses extreme points [vectors] that helps to create the hyper-plane,
these extreme cases are called Support Vectors.
How SVM works:
Step 1: Classification can be done using hyper plane.
Step 2: Identifying the correct hyper plane in SVM.
Step 3: Find maximum marginal hyper plane based on Support Vectors.
Step 4: Kernal trick.

Types of SVM:
 Liner SVM: It uses single straight line.
 Non-Linear SVM: It does not use any straight line.
Strengths of SVM
 SVM can be used for both classification and regression.
 It is robust, i.e. not much impacted by data with noise or outliers. The prediction results
using this model are very promising.
Weaknesses of SVM

18
SVM is applicable only for binary classification, i.e. when there are only two classes in
the problem domain. The SVM model is very complex- almost like a black box when it deals
with a high dimensional data set. Hence, it is very difficult and close to impossible to understand
the model in such cases. It is also for a large data set ie; a data set with either a large number of
features or a large number of instances. It is quite memory intensive.
Application of SVM
SVM is most effective when it is used for binary classification, i.e. for solving a machine
learning problem with two classes. One common problem on which SVM can be applied is in the
field of bioinformatics – more specifically, in detecting cancer and other genetic disorders. It can
also be used in detecting the image of a face by binary classification of images into face and non-
face components. More such applications can be described.

Questions:
1).Explain the concept of Bayes theorem with an example.
2). Explain Bayesian belief network and conditional independence with example.
3). Explain Classification model? What are the classification learning steps?
4). What are Bayesian Belief nets? Where are they used?
5). Explain Brute force MAP hypothesis learner? What is minimum description length principle?
6). Explain Naïve Bayes Classifier with an Example.
7). Develop the concepts of K- Nearest Neighbours.
8).What is benefits of K- NN algorithm?
9). How build the structure of a decision tree. What are the advantages of Decision Tree?
10). Explain SVM classifier with suitable example?

Prepared By:
Dr.K.Venkata Nagendra,
Dept.of CSE, SVCN.

19

You might also like