ML Unit123
ML Unit123
Classical Conditioning
2. Operant Conditioning: Developed by B.F. Skinner, operant conditioning focuses on
the consequences of behavior. Behaviors that are followed by positive consequences
(reinforcement) are more likely to be repeated, while those followed by negative
consequences (punishment) are less likely to be repeated.
1
Operant Conditioning
3. Observational Learning: Also known as social learning, this type of learning
involves acquiring new behaviors by observing and imitating others. It is a powerful
mechanism for learning social norms, skills, and behaviors.
Observational Learning
4. Cognitive Learning: This type of learning emphasizes the mental processes involved
in acquiring knowledge, such as attention, memory, perception, and thinking. It
focuses on how people process information and construct meaning.
2
Cognitive Learning
Early Learning and Development
Early childhood is a critical period for learning and development. During this time, children
acquire fundamental skills such as language, motor skills, and social-emotional skills. These
early experiences lay the foundation for future learning and development.
1.2. MACHINE LEARNING-TYPES
Machine Learning is a subfield of Artificial Intelligence that enables machines to
improve at a given task with experience. It is important to note that all machine learning
techniques are classified as Artificial Intelligence ones. However, not all Artificial
Intelligence could count as Machine Learning since some basic Rule-based engines could be
classified as AI but they do not learn from experience therefore they do not belong to the
machine learning category.
Definition: Arthur Samuel, a pioneer in the field of artificial intelligence and computer
gaming, coined the term “Machine Learning”. He defined machine learning as – “Field of
study that gives computers the capability to learn without being explicitly programmed”.
In a very layman manner, Machine Learning (ML) can be explained as automating
and improving the learning process of computers based on their experiences without being
actually programmed i.e. without any human assistance. The process starts with feeding good
quality data and then training our machines (computers) by building machine learning models
using the data and different algorithms. The choice of algorithms depends on what type of
data do we have and what kind of task we are trying to automate
3
Example: Training of students during exam.
While preparing for the exams students don’t actually cram the subject but try to learn it with
complete understanding. Before the examination, they feed their machine (brain) with a good
amount of high-quality data (questions and answers from different books or teachers notes or
online video lectures). Actually, they are training their brain with input as well as output i.e.
what kind of approach or logic do they have to solve a different kind of questions. Each time
they solve practice test papers and find the performance (accuracy /score) by comparing
answers with answer key given, Gradually, the performance keeps on increasing, gaining
more confidence with the adopted approach. That’s how actually models are built, train
machine with data (both inputs and outputs are given to model) and when the time comes test
on data (with input only) and achieves our model scores by comparing its answer with the
actual output which has not been fed while training. Researchers are working with assiduous
efforts to improve algorithms, techniques so that these models perform even much better.
4
1.3. TYPES OF MACHINE LEARNING
There are several types of machine learning, each with special characteristics and
applications. Some of the main types of machine learning algorithms are as follows:
i. Supervised Machine Learning
ii. Unsupervised Machine Learning
iii. Reinforcement Learning
i).Supervised learning
Supervisedis defined as when a model gets trained on a “Labelled Dataset”. Labelled
datasets have both input and output parameters. In Supervised Learning algorithms learn
to map points between inputs and correct outputs. It has both training and validation
datasets labelled.
Example: Consider a scenario where you have to build an image classifier to differentiate
between cats and dogs. If you feed the datasets of dogs and cats labelled images to the
algorithm, the machine will learn to classify between a dog or a cat from these labeled
images. When we input new dog or cat images that it has never seen before, it will use the
learned algorithms and predict whether it is a dog or a cat. This is how supervised
learning works and this is particularly an image classification.
5
Supervised learning is effective for various business purposes, including sales
forecasting, inventory optimization, and fraud detection. Some examples of use cases include:
Predicting real estate prices
Classifying whether bank transactions are fraudulent or not
Finding disease risk factors
Determining whether loan applicants are low-risk or high-risk
Predicting the failure of industrial equipment's mechanical parts
6
It can often be used in pre-trained models which save time and resources when
developing new models from scratch.
7
ii). Unsupervised Learning
Unsupervised learning is a type of machine learning technique in which an algorithm
discovers patterns and relationships using unlabeled data. Unlike supervised learning,
unsupervised learning doesn’t involve providing the algorithm with labeled target outputs.
The primary goal of unsupervised learning is often to discover hidden patterns, similarities,
or clusters within the data, which can then be used for various purposes, such as data
exploration, visualization, dimensionality reduction, and more.
8
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
Clustering: Group similar data points into clusters.
Anomaly detection: Identify outliers or anomalies in data.
Dimensionality reduction: Reduce the dimensionality of data while preserving its
essential information.
Recommendation systems: Suggest products, movies, or content to users based on
their historical behavior or preferences.
Topic modeling: Discover latent topics within a collection of documents.
Density estimation: Estimate the probability density function of data.
Image and video compression: Reduce the amount of storage required for
multimedia content.
Data preprocessing: Help with data preprocessing tasks such as data cleaning,
imputation of missing values, and data scaling.
Market basket analysis: Discover associations between products.
Image segmentation: Segment images into meaningful regions.
Community detection in social networks: Identify communities or groups of
individuals with similar interests or connections.
Customer behavior analysis: Uncover patterns and insights for better marketing
and product recommendations.
iii). Reinforcement Learning
Reinforcement Learning operates on the principle of learning optimal behavior through trial
and error. The agent takes actions within the environment, receives rewards or penalties,
and adjusts its behavior to maximize the cumulative reward. This learning process is
characterized by the following elements:
Policy: A strategy used by the agent to determine the next action based on the
current state.
Reward Function: A function that provides a scalar feedback signal based on the
state and action.
Value Function: A function that estimates the expected cumulative reward from a
given state.
Model of the Environment: A representation of the environment that helps in
planning by predicting future states and rewards.
9
Example: The problem is as follows: We have an agent and a reward, with many hurdles in
between. The agent is supposed to find the best possible path to reach the reward. The
following problem explains the problem more easily.
The above image shows the robot, diamond, and fire. The goal of the robot is to get
the reward that is the diamond and avoid the hurdles that are fired. The robot learns by
trying all the possible paths and then choosing the path which gives him the reward with the
least hurdles. Each right step will give the robot a reward and each wrong step will subtract
the reward of the robot. The total reward will be calculated when it reaches the final reward
that is the diamond.
Advantages of Reinforcement Machine Learning
It has autonomous decision-making that is well-suited for tasks and that can learn to
make a sequence of decisions, like robotics and game-playing.
This technique is preferred to achieve long-term results that are very difficult to
achieve.
It is used to solve complex problems that cannot be solved by conventional
techniques.
Disadvantages of Reinforcement Machine Learning
Training Reinforcement Learning agents can be computationally expensive and
time-consuming.
Reinforcement learning is not preferable to solving simple problems.
It needs a lot of data and a lot of computation, which makes it impractical and
costly.
Applications of Reinforcement Machine Learning
Here are some applications of reinforcement learning:
Game Playing: RL can teach agents to play games, even complex ones.
Autonomous Vehicles: RL can help self-driving cars navigate and make decisions.
10
Recommendation Systems: RL can enhance recommendation algorithms by
learning user preferences.
Healthcare: RL can be used to optimize treatment plans and drug discovery.
Natural Language Processing (NLP): RL can be used in dialogue systems and
chatbots.
Finance and Trading: RL can be used for algorithmic trading.
Supply Chain and Inventory Management: RL can be used to optimize supply
chain operations.
Game AI: RL can be used to create more intelligent and adaptive NPCs in video
games.
Adaptive Personal Assistants: RL can be used to improve personal assistants.
Virtual Reality (VR) and Augmented Reality (AR): RL can be used to create
immersive and interactive experiences.
Comparison – supervised, unsupervised, and reinforcement learning:
Criteria Supervised Learning Unsupervised Learning Reinforcement
Learning
Input Data Input data is labelled. Input data is not labelled. Input data is not
predefined.
Problem Learn pattern of inputs Divide data into classes. Find the best reward
and their labels. between a start and an
end state.
Solution Finds a mapping Finds similar features in Maximizes reward by
equation on input data input data to classify it into assessing the results of
and its labels. classes. state-action pairs
Model Model is built and Model is built and trained The model is trained and
Building trained prior to testing. prior to testing. tested simultaneously.
Applications Deal with regression Deals with clustering and Deals with exploration
and classification associative rule mining and exploitation
problems. problems. problems.
Algorithms Decision trees, linear K-means clustering, k- Q-learning, SARSA,
Used regression, K-nearest medoids clustering, Deep Q Network
neighbors agglomerative clustering
Examples Image detection, Customer segmentation, Drive-less cars, self-
Population growth feature elicitation, targeted navigating vacuum
prediction marketing, etc cleaners, etc
12
Google Map is helping this app to make it better. It takes information from the user and sends
back to its database to improve the performance.
13
Content Filter
Header filter
General blacklists filter
Rules-based filters
Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.
7. Virtual Personal Assistant: We have various virtual personal assistants such as Google
assistant, Alexa, Cortana, Siri. As the name suggests, they help us in finding the information
using our voice instruction. These assistants can help us in various ways just by our voice
instructions such as Play music, call someone, Open an email, Scheduling an appointment,
etc. These virtual assistants use machine learning algorithms as an important part. These
assistant record our voice instructions, send it over the server on a cloud, and decode it using
ML algorithms and act accordingly.
8. Online Fraud Detection: Machine learning is making our online transaction safe and
secure by detecting fraud transaction. Whenever we perform some online transaction, there
may be various ways that a fraudulent transaction can take place such as fake accounts, fake
ids, and steal money in the middle of a transaction. So to detect this, Feed Forward Neural
network helps us by checking whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these
values become the input for the next round. For each genuine transaction, there is a specific
pattern which gets change for the fraud transaction hence, it detects it and makes our online
transactions more secure.
9. Stock Market trading: Machine learning is widely used in stock market trading. In the
stock market, there is always a risk of up and downs in shares, so for this machine
learning's long short term memory neural network is used for the prediction of stock market
trends.
10. Medical Diagnosis: In medical science, machine learning is used for diseases diagnoses.
With this, medical technology is growing very fast and able to build 3D models that can
predict the exact position of lesions in the brain. It helps in finding brain tumors and other
brain-related diseases easily.
11. Automatic Language Translation: Nowadays, if we visit a new place and we are not
aware of the language then it is not a problem at all, as for this also machine learning helps us
by converting the text into our known languages. Google's GNMT (Google Neural Machine
14
Translation) provide this feature, which is a Neural Machine Learning that translates the text
into our familiar language, and it called as automatic translation. The technology behind the
automatic translation is a sequence to sequence learning algorithm, which is used with image
recognition and translates the text from one language to another language.
1.6. STATE-OF-THE-ART LANGUAGES/TOOLS IN
MACHINE LEARNING
The algorithms related to different machine learning tasks are known to all and can be
implemented using any language/platform. It can be implemented using a Java platform or C
/C++ language or in .NET. However, there are certain languages and tools which have
beendeveloped with a focus for implementing machine learning. Few of them, which are
most widely used, are covered below.
Languages
Python:
o Dominates the field: Extensive libraries (Scikit-learn, TensorFlow, PyTorch),
large community, versatility.
o Ideal for: General-purpose ML, deep learning, data science.
R:
o Strong in statistical computing and data visualization: Excellent for
exploratory data analysis and statistical modeling.
o Ideal for: Statistical analysis, data visualization, niche areas.
Java:
o Robust and scalable: Suitable for large-scale, production-level ML systems.
o Ideal for: Enterprise applications, big data processing.
C++:
o High performance: Used for computationally intensive tasks and building
high-performance libraries.
o Ideal for: Performance-critical applications, low-level optimizations.
Tools
Scikit-learn (Python):
o Comprehensive library: Offers a wide range of algorithms for classification,
regression, clustering, and more.
o User-friendly: Easy to use and well-documented.
TensorFlow (Python/C++):
15
o Developed by Google: Powerful framework for deep learning, especially for
building and deploying large-scale neural networks.
PyTorch (Python):
o Dynamic computation graphs: Provides flexibility and ease of use for
research and prototyping.
o Strong in deep learning research: Popular for natural language processing
and computer vision.
Keras (Python):
o High-level API: Simplifies building and experimenting with neural networks
on top of TensorFlow or other backends.
Jupyter Notebook:
o Interactive environment: Allows you to write and execute code, visualize
data, and share your work easily.
AWS SageMaker:
o Cloud-based platform: Provides a suite of tools for building, training, and
deploying machine learning models.
1.7. ISSUES IN MACHINE LEARNING
Machine learning, while a powerful tool, faces several challenges that researchers and
practitioners are actively working to address. Here are some of the key issues:
1. Data Quality and Availability
Data Scarcity: Many real-world problems lack sufficient labeled data for training
effective models, especially in niche domains or those with limited data collection
resources.
Data Bias: Training data often reflects existing biases in society, leading to models
that perpetuate and even amplify these biases. This can have serious consequences in
areas like loan applications, hiring processes, and criminal justice.
Data Privacy: Collecting and using personal data raises significant privacy concerns,
requiring careful consideration of ethical and legal implications.
2. Model Interpretability and Explainability
Black Box Models: Many complex models, such as deep neural networks, are often
referred to as "black boxes" because their decision-making processes are opaque. This
lack of transparency can hinder trust and make it difficult to understand and debug
model errors.
16
Explainable AI (XAI): This emerging field aims to develop techniques that make
machine learning models more interpretable and understandable to humans.
3. Overfitting and Underfitting
Overfitting: Occurs when a model performs well on the training data but poorly on
unseen data. This happens when the model has learned the training data too well,
including noise and irrelevant details.
Underfitting: Occurs when a model is too simple to capture the underlying patterns
in the data. This results in poor performance on both training and test data.
4. Computational Cost and Resource Requirements
Training Time: Training complex models, especially deep learning models, can be
computationally expensive, requiring significant time and resources.
Hardware Requirements: Advanced hardware, such as GPUs and TPUs, is often
necessary to train large-scale models efficiently.
5. Ethical Considerations
Job Displacement: Automation powered by machine learning raises concerns about
job displacement in various sectors.
Misuse of Technology: Machine learning can be used for malicious purposes, such as
creating deepfakes or developing autonomous weapons systems.
Algorithmic Bias: As mentioned earlier, biased data can lead to biased models,
which can have discriminatory impacts on certain groups.
6. Continuous Learning and Adaptation
Evolving Data: Real-world data is constantly changing. Models need to be able to
adapt to new data and changing conditions to maintain their effectiveness.
Concept Drift: The underlying relationships between features and targets may
change over time, requiring models to be retrained or updated periodically.
PREPARING TO MODEL: INTRODUCTION
1.8. MACHINELEARNINGACTIVITIES
The first step in machine learning activity starts with data. In case of supervised
learning, it is the labelled training data set followed by test data which is not labelled. In case
of unsupervised learning, there is no question of labelled data but the task is to find patterns
in the input data. A thorough review and exploration of the data is needed to understand the
typeof the data, the quality of the data and relationship between the different data elements.
Based on that, multiple pre-processing activities may need to be done on the input data before
17
we can go ahead with core machine learning activities. Following are the typical preparation
activities done once the input data comes into the machine learning system:
Understand the type of data in the given input data set. Explore the data to understand
the nature and quality.
Explore the relationships amongst the data elements, e.g. inter-feature relationship.
Find potential issues in data.
Do the necessary remediation, e.g. impute missing data values, etc., if needed. Apply
pre-processing steps, as necessary.
Once the data is prepared for modelling, then the learning tasks start off. As a part of
it, do the following activities:
The input data is first divided into parts – the training data and the test data (called
holdout). This step is applicable for supervised learning only.
Consider different models or learning algorithms for selection. Train the model based on
the training data for supervised learning problem and apply to unknown data. Directly apply
the chosen unsupervised model on the input data for unsupervised learning problem. After the
model is selected, trained (for supervised learning), and applied on input data, the
performance of the model is evaluated. Based on options available, specific actions can be
taken to improve the performance of the model, if possible. The below Figure - Depicts the
four-step process of machine learning.
18
Table Contains a summary of steps and activities involved:
19
1. Qualitative or Categorical Data
Qualitative or Categorical Data is a type of data that can’t be measured or counted in the form
of numbers. These types of data are sorted by category, not by number. That’s why it is also
known as Categorical Data. These data consist of audio, images, symbols, or text. The gender
of a person, i.e., male, female, or others, is qualitative data. Qualitative data tells about the
perception of people. This data helps market researchers understand the customers’ tastes and
then design their ideas and strategies accordingly.
A. Nominal Data
Nominal Data is used to label variables without any order or quantitative value. The color of
hair can be considered nominal data, as one color can’t be compared with another color.
The name “nominal” comes from the Latin name “nomen,” which means “name.” With the
help of nominal data, we can’t do any numerical tasks or can’t give any order to sort the data.
These data don’t have any meaningful order; their values are distributed into distinct
categories.
Examples of Nominal Data:
Colour of hair (Blonde, red, Brown, Black, etc.)
Marital status (Single, Widowed, Married)
Nationality (Indian, German, American)
Gender (Male, Female, Others)
Eye Color (Black, Brown, etc.)
B. Ordinal Data
Ordinal data have natural ordering where a number is present in some kind of order by their
position on the scale. These data are used for observation like customer satisfaction,
happiness, etc., but we can’t do any arithmetical tasks on them. Ordinal data is qualitative
data for which their values have some kind of relative position. These kinds of data can be
considered “in-between” qualitative and quantitative data. The ordinal data only shows the
sequences and cannot use for statistical analysis. Compared to nominal data, ordinal data
have some kind of order that is not present in nominal data.
Examples of Ordinal Data:
When companies ask for feedback, experience, or satisfaction on a scale of 1 to 10
Letter grades in the exam (A, B, C, D, etc.)
Ranking of people in a competition (First, Second, Third, etc.)
Economic Status (High, Medium, and Low)
Education Level (Higher, Secondary, Primary)
20
2. Quantitative Data
Quantitative data is a type of data that can be expressed in numerical values, making it
countable and including statistical data analysis. These kinds of data are also known as
Numerical data. It answers the questions like “how much,” “how many,” and “how often.”
For example, the price of a phone, the computer’s ram, the height or weight of a person, etc.,
falls under quantitative data.
Quantitative data can be used for statistical manipulation. These data can be represented on a
wide variety of graphs and charts, such as bar graphs, histograms, scatter plots, boxplots, pie
charts, line graphs, etc.
Examples of Quantitative Data:
Height or weight of a person or object
Room Temperature
Scores and Marks (Ex: 59, 80, 60, etc.)
Time
The Quantitative data are further classified into two parts:
A. Discrete Data
The term discrete means distinct or separate. The discrete data contain the values that fall
under integers or whole numbers. The total number of students in a class is an example of
discrete data. These data can’t be broken into decimal or fraction values.
The discrete data are countable and have finite values; their subdivision is not possible. These
data are represented mainly by a bar graph, number line, or frequency table.
Examples of Discrete Data:
Total numbers of students present in a class
Cost of a cell phone
Numbers of employees in a company
The total number of players who participated in a competition
Days in a week
B. Continuous Data
Continuous data are in the form of fractional numbers. It can be the version of an android
phone, the height of a person, the length of an object, etc. Continuous data represents
information that can be divided into smaller levels. The continuous variable can take any
value within a range.
The key difference between discrete and continuous data is that discrete data contains the
integer or whole number. Still, continuous data stores the fractional numbers to record
different types of data such as temperature, height, width, time, speed, etc.
21
Examples of Continuous Data:
Height of a person
Speed of a vehicle
“Time-taken” to finish the work
Wi-Fi Frequency
Market share price
1.10. EXPLORING STRUCTURE OF DATA
The approach of exploring numeric data is different than the approach of exploring
categorical data. In case of a standard data set, we may have the data dictionary available for
reference. Data dictionary is a meta data repository, i.e. the repository of all information
related to the structure of each data element contained in the data set. The data dictionary
gives detailed information on each of the attributes– the description as well as the data type
and other relevant details. In case the data dictionary is not available, we need to use standard
library function of the machine learning tool that we are using and get the details.
Exploring numerical data: Numerical data represents quantitative information that can be
measured and counted. Understanding its characteristics is crucial for effective data analysis
and decision-making. Here's a breakdown of key aspects:
1. Types of Numerical Data
Continuous: Data that can take on any value within a given range.
o Examples: Height, weight, temperature, time
Discrete: Data that can only take on specific, distinct values.
o Examples: Number of students, number of cars, count of occurrences
2. Key Characteristics
Central Tendency: Measures of central tendency describe the "center" or typical
value of the data.
o Mean: The average of all data points.
o Median: The middle value when the data is sorted.
o Mode: The most frequently occurring value.
Dispersion: Measures of dispersion describe the spread or variability of the data.
o Range: The difference between the maximum and minimum values.
o Variance: The average squared deviation from the mean.
o Standard Deviation: The square root of the variance, providing a measure of
spread in the same units as the data.
o Interquartile Range (IQR): The range of the middle 50% of the data.
22
Distribution: The way data values are distributed across the range of possible values.
o Normal Distribution (Gaussian): A bell-shaped curve, with most data points
clustered around the mean.
o Skewed Distribution: Data is not symmetrical, with a longer tail on one side.
o Uniform Distribution: Data is evenly distributed across the range.
3. Exploratory Data Analysis (EDA) Techniques
Summary Statistics: Calculating measures of central tendency and dispersion.
Data Visualization:
o Histograms: Visualize the distribution of data.
o Box Plots: Show the median, quartiles, and outliers.
o Scatter Plots: Visualize relationships between two numerical variables.
Example: Consider a dataset of student exam scores.
Type: Continuous (can take on any value within a certain range)
Central Tendency:
o Mean: Average score of all students
o Median: Score of the middle-ranking student
o Mode: Most frequent score (if any)
Dispersion:
o Range: Difference between the highest and lowest scores
o Standard Deviation: How much scores typically deviate from the mean
Interpretation: Represents the middle 50% of the data. It is less sensitive to outliers than the
range.
5. Mean Absolute Deviation (MAD):
Definition: The average of the absolute differences between each data point and the
mean.
Box Plots and Histograms: Both box plots and histograms are valuable tools for visualizing
the distribution of numerical data. They provide different insights into the data's central
tendency, spread, and shape.
Histograms
What they are: Histograms divide the data into bins (intervals) and show the
frequency (or count) of data points falling within each bin.
What they reveal:
o Shape of the distribution: Symmetrical, skewed (left or right), bimodal, etc.
o Central tendency: Approximate location of the mean and median.
o Spread: How widely the data is distributed.
23
o Outliers: Extreme values that deviate significantly from the rest of the data.
Box Plots
What they are: Box plots summarize the distribution of data using five key points:
minimum, first quartile (Q1), median, third quartile (Q3), and maximum.
What they reveal:
o Median: The middle value of the dataset.
o Quartiles: Divide the data into four equal parts.
o Interquartile Range (IQR): The range between the first and third quartiles,
representing the middle 50% of the data.
o Outliers: Data points that fall beyond 1.5 times the IQR from the nearest
quartile.
1.11. DATA QUALITY AND REMEDIATION
Data Quality in machine learning refers to the fitness of data for its intended use. It
encompasses various aspects that determine how suitable the data is for building accurate and
reliable machine learning models.
Dimensions of Data Quality:
Accuracy: The data is correct and free from errors.
Completeness: All necessary information is present, with minimal missing values.
Consistency: Data is formatted and represented uniformly across the dataset.
Validity: Data conforms to predefined rules and constraints.
Timeliness: Data is up-to-date and reflects current conditions.
Uniqueness: There are no duplicate records.
Common Data Quality Issues:
Missing Values: Empty cells or fields lacking information.
Inconsistent Data: Variations in formatting, spelling, or units.
Outliers: Extreme values that deviate significantly from the norm.
Duplicate Records: Repeated entries that can skew analysis.
Incorrect Data Types: Data stored in the wrong format (e.g., text instead of
numbers).
Data Remediation in machine learning refers to the process of identifying and correcting
errors, inconsistencies, and inaccuracies within your dataset to improve its quality.
Data Remediation Techniques:
1. Handling Missing Values:
o Imputation: Replacing missing values with estimated values.
24
Mean/Median Imputation: Replacing with the average or median
value of the column.
K-Nearest Neighbors (KNN) Imputation: Predicting missing values
based on similar data points.
Regression Imputation: Using regression models to predict missing
values.
o Deletion: Removing rows or columns with missing values (use with caution,
as it can lead to data loss).
2. Addressing Inconsistent Data:
o Standardization: Transforming data to a common format (e.g., converting all
units to metric).
o Normalization: Scaling data to a specific range (e.g., between 0 and 1).
o Data Cleaning: Correcting spelling errors, handling typos, and ensuring
consistency in data representation.
3. Identifying and Handling Outliers:
o Visualization: Using box plots, scatter plots, and other visualizations to
identify outliers.
o Statistical Methods: Calculating z-scores or using interquartile range (IQR)
to detect outliers.
o Removal: Removing outliers if they are deemed to be errors or have a
significant impact on the model.
o Transformation: Transforming data using techniques like log transformation
to reduce the impact of outliers.
4. Removing Duplicate Records:
o Using deduplication techniques to identify and remove duplicate rows.
Tools for Data Remediation:
Data Profiling Tools: Automatically identify and analyse data quality issues.
Data Cleaning Libraries: Libraries like Pandas (Python) offer functions for handling
missing values, cleaning data, and transforming data.
25
Dimensionality Reduction: In machine learning, dimensionality reduction is a crucial
technique used to reduce the number of features (or dimensions) in a dataset while preserving
as much information as possible. This simplification offers several significant advantages:
Benefits of Dimensionality Reduction:
Improved Model Performance:
o Reduced Overfitting: By reducing the number of features, we can mitigate
overfitting, where a model performs well on training data but poorly on
unseen data.
o Faster Training: With fewer features, training algorithms can converge more
quickly and efficiently.
Enhanced Visualization:
o Data Visualization: High-dimensional data is difficult to visualize.
Dimensionality reduction allows us to project data onto lower-dimensional
spaces (often 2D or 3D), making it easier to understand relationships and
patterns.
Reduced Storage and Computational Costs:
o Storage Efficiency: Storing fewer features requires less storage space.
o Computational Efficiency: Processing fewer features reduces computational
complexity, making models faster to train and deploy.
Common Dimensionality Reduction Techniques:
Principal Component Analysis (PCA):
o Identifies the principal components (new features) that capture the most
variance in the data.
o Projects the data onto these principal components, reducing the dimensionality
while preserving most of the information.
Linear Discriminant Analysis (LDA):
o Specifically designed for classification problems.
o Finds linear combinations of features that best separate different classes.
t-SNE (t-Distributed Stochastic Neighbor Embedding):
o A non-linear technique that excels at visualizing high-dimensional data in low-
dimensional spaces (often 2D).
o Particularly useful for exploring complex data structures and identifying
clusters.
Feature Selection:
26
o Involves selecting a subset of the original features based on their importance
or relevance.
o Methods include:
Filter Methods: Rank features based on their individual scores (e.g.,
correlation with the target variable).
Wrapper Methods: Evaluate the performance of different feature
subsets using a machine learning model.
Embedded Methods: Select features during the model training
process itself (e.g., using L1 regularization).
27
UNIT II
MODELLING AND EVALUATION &
BASICS OF FEATURE ENGINEERING
The following steps are frequently included in the model selection process:
Problem formulation: Clearly express the issue at hand, including the kind of
predictions or task that you'd like the model to carry out (for example, classification,
regression, or clustering).
Candidate model selection: Pick a group of models that are appropriate for the issue at
hand. These models can include straightforward methods like decision trees or linear
regression as well as more sophisticated ones like deep neural networks, random forests,
or support vector machines.
Performance evaluation: Establish measures for measuring how well each model
performs. Common measurements include area under the receiver's operating
characteristic curve (AUC-ROC), recall, F1-score, mean squared error, and accuracy,
precision, and recall. The type of problem and the particular requirements will determine
which metrics are used.
Training and evaluation: Each candidate model should be trained using a subset of the
available data (the training set), and its performance should be assessed using a different
subset (the validation set or via cross-validation). The established evaluation measures
are used to gauge the model's effectiveness.
1
Model comparison: Evaluate the performance of various models and determine which
one performs best on the validation set. Take into account elements like data handling
capabilities, interpretability, computational difficulty, and accuracy.
Hyperparameter tuning: Before training, many models require that certain
hyperparameters, such as the learning rate, regularisation strength, or the number of
layers that are hidden in a neural network, be configured. Use methods like grid search,
random search, and Bayesian optimization to identify these hyperparameters' ideal
values.
Final model selection: After the models have been analyzed and fine-tuned, pick the
model that performs the best. Then, this model can be used to make predictions based on
fresh, unforeseen data.
Model Selection Techniques
Model selection in machine learning can be done using a variety of methods and tactics. These
methods assist in comparing and assessing many models to determine which is best suited to
solve a certain issue. Here are some methods for selecting models that are frequently used:
Train-Test Split: With this strategy, the available data is divided into two sets: a
training set & a separate test set. The models are evaluated using a predetermined
evaluation metric on the test set after being trained on the training set. This method
offers a quick and easy way to evaluate a model's performance using hypothetical data.
Cross-Validation: A resampling approach called cross-validation divides the data into
various groups or folds. Several folds are used as the test set & the rest folds as the
training set, and the models undergo training and evaluation on each fold separately.
Lowering the variance in the evaluation makes it easier to generate an accurate
assessment of the model's performance. Cross-validation techniques that are frequently
used include leave-one-out, stratified, and k-fold cross-validation.
Grid Search: Hyperparameter tuning is done using the grid search technique. In order
to do this, a grid containing hyperparameter values must be defined, and all potential
hyperparameter combinations must be thoroughly searched. For each combination, the
models are trained, assessed, and their performances are contrasted. Finding the ideal
hyperparameter settings to optimize the model's performance is made easier by grid
search.
Random Search: A set distribution for hyperparameter values is sampled at random as
part of the random search hyperparameter tuning technique. In contrast to grid search,
which considers every potential combination, random search only investigates a portion
2
of the hyperparameter field. When a thorough search is not possible due to the size of
the search space, this strategy can be helpful.
Bayesian optimization: A more sophisticated method of hyperparameter tweaking,
Bayesian optimization. It models the relationship between the performance of the model
and the hyperparameters using a probabilistic model. It intelligently chooses which set
of hyperparameters to investigate next by updating the probabilistic model and
iteratively assessing the model's performance. When the search space is big and
expensive to examine, Bayesian optimization is especially effective.
Model averaging: This technique combines forecasts from various models to get a
single prediction. For regression issues, this can be accomplished by averaging the
predictions, while for classification problems, voting or weighted voting systems can be
used. Model averaging can increase overall prediction accuracy by lowering the bias and
variation of individual models.
Information Criteria: Information criteria offer a numerical assessment of the trade-off
between model complexity and goodness of fit. Examples include the Akaike
Information Criterion (AIC) and the Bayesian Information Criterion (BIC). These
criteria discourage the use of too complicated models and encourage the adoption of
simpler models that adequately explain the data.
Domain Expertise & Prior Knowledge: Prior understanding of the problem and the
data, as well as domain expertise, can have a significant impact on model choice. The
models that are more suitable given the specifics of the problem and the details of the
data may be known by subject matter experts.
Model Performance Comparison: Using the right assessment measures, it is vital to
evaluate the performance of various models. Depending on the issue at hand, these
measurements could include F1-score, mean squared error, accuracy, precision, recall, or
the area beneath the receiver's operating characteristic curve (AUC-ROC). The best-
performing model can be found by comparing many models.
2.2. TRAINING A MODEL (FOR SUPERVISED LEARNING)
Holdout method
In case of supervised learning, a model is trained using the labelled input data. However, how
can we understand the performance of the model? The test data may not be available
immediately. Also, the label value of the test data is not known. That is the reason why a part of
the input data is held back (that is how the name holdout originates) for evaluation of the model.
3
This subset of the input data is used as the test data for evaluating the performance of a trained
model. In general 70%–80% of the input data (which is obviously labelled) is used for model
training. The remaining 20%–30% is used as test data for validation of the performance of the
model. However, a different proportion of dividing the input data into training and test data is
also acceptable. To make sure that the data in both the buckets are similar in nature, the division
is done randomly. Random numbers are used to assign data items to the partitions. This method
of partitioning the input data into two parts–training and test data (depicted in below figure),
which is by holding back a part of the input data for validating the trained model is known as
holdout method.
5
Standard k-fold Cross validation
Bootstrap sampling
Bootstrap sampling is a powerful statistical technique that involves repeatedly drawing samples
with replacement from an original dataset to estimate the sampling distribution of a statistic.
How it Works:
1. Create Bootstrap Samples:
o Repeatedly draw samples (with replacement) from the original dataset.
o The size of each bootstrap sample is typically the same as the size of the original
dataset.
o Create a large number of bootstrap samples (e.g., 1,000 or 10,000).
2. Calculate Statistic for Each Sample:
o Calculate the desired statistic (e.g., mean, standard deviation, model accuracy)
for each bootstrap sample.
3. Estimate Sampling Distribution:
o The distribution of the statistic across all bootstrap samples provides an estimate
of the sampling distribution of that statistic.
Applications of Bootstrap Sampling:
Estimating Standard Errors: Estimating the standard error of a statistic.
Constructing Confidence Intervals: Building confidence intervals for population
parameters.
Model Evaluation: Assessing the variability and uncertainty of model predictions.
Feature Importance: Estimating the importance of different features in a model.
Advantages of Bootstrap Sampling:
Versatility: Can be applied to a wide range of statistics and models.
6
Simplicity: Relatively easy to implement.
No Assumptions: Does not require assumptions about the underlying data distribution.
Limitations:
Computational Cost: Can be computationally expensive for large datasets.
Potential Bias: In some cases, bootstrap estimates may have slight biases.
7
2.3. MODEL REPRESENTATION AND INTERPRETABILITY
The main goal of each machine learning model is to generalize well. Here
generalization defines the ability of an ML model to provide a suitable output by adapting the
given set of unknown input. It means after providing training on the dataset, it can produce
reliable and accurate output. Hence, the underfitting and overfitting are the two terms that need
to be checked for the performance of the model and whether the model is generalizing well or
not.
Underfitting: The model is said to have underfitting when it can’t capture the underlined data
ie; it only performs well on trained data but performs poorly on test data.
Ex: Trained to fit an unauthorised
Reasons:
1. The model is too simple.
2. Training data contains noise data.
3. Size of training data is not enough.
Techniques to reduce underfiting:
Increase model complexity.
Remove noise data.
Increase duration of training data.
Overfitting: Overfitting occurs when our machine learning model tries to cover all the data
points or more than the required data points present in the given dataset. Because of this, the
model starts caching noise and inaccurate values present in the dataset, and all these factors
reduce the efficiency and accuracy of the model. The overfitted model has low bias and high
variance. The chances of occurrence of overfitting increase as much we provide training to our
model.
How to avoid the Overfitting in Model: Both overfitting and underfitting cause the degraded
performance of the machine learning model. But the main cause is overfitting, so there are some
ways by which we can reduce the occurrence of overfitting in our model.
Cross-Validation
Training with more data
Removing features
Early stopping the training
Regularization
Ensembling
8
Bias: [Training Error]
It is actually the error rate of the training data.
When the error rate is high then the bias is high.
When the error rate is Low then it is Low the bias.
Error due to Bias: It can occur due to underfitting of the Bias.
Variance: [Testing Error]
It is the difference between error rate of training data and testing data.
If the difference of errors is high it is called high variance.
If the difference of errors is low it is called low variance.
Error due to Bias: It can occur due to overfitting of the model.
9
2.4. EVALUATING PERFORMANCE OF A MODEL
i).Supervised Learning-classification
In supervised learning, one major task is classification. The responsibility of the
classification model is to assign class label to the target feature based on the value of the
predictor features. For example, in the problem of predicting the win/loss in a cricket match, the
classifier will assign a class value win/loss to target feature based on the values of other features
like whether the team won the toss, number of spinners in the team, number of wins the team
had in the tournament, etc. To evaluate the performance of the model, the number of correct
classifications or predictions made by the model has to be recorded. The following are the four
possibilities will be used to evaluate the performance.
1. True Positive (TP): The model predicted win and the team won
2. False Positive (FP): The model predicted win and the team lost
3. False Negitive (FN): The model predicted loss and the team won
4. True Nagitive (TN): The model predicted loss and the team lost
A matrix containing correct and incorrect predictions in the form of TPs, FPs, FNs and
TNs is known as confusion matrix.
10
Receiver operating characteristic (ROC) curves:
As we have seen till now, though accuracy is the most popular measure, there are quite a
number of other measures to evaluate the performance of a supervised learning model.
However, visualization is an easier and more effective way to understand the model
performance. It also helps in comparing the efficiency of two models. Receiver Operating
Characteristic (ROC) curve helps in visualizing the performance of a classification model. It
shows the efficiency of a model in the detection of true positives while avoiding the occurrence
of false positives.
11
For a certain value of x, say x̂, the value of y is predicted as ŷ whereas the actual value of
y is Y (say). The distance between the actual value and the fitted or predicted value, i.e. ŷ is
known as residual. The regression model can be considered to be fitted well if the difference
between actual and predicted value, i.e. the residual value is less.
iii).Unsupervised learning - clustering
Clustering algorithms try to reveal natural groupings amongst the data sets. However, it is quite
tricky to evaluate the performance of a clustering algorithm. Clustering, by nature, is very
subjective and whether the cluster is good or bad is open for interpretations. It was noted,
‘clustering is in the eye of the beholder’. There are couple of popular approaches which are
adopted for cluster quality evaluation.
a).Internal evaluation: It measures cluster quality based on the homogeneity of data belonging
to same cluster or heterogeneity of data belonging to different cluster. For a data set clustered
into ‘k’ clusters, silhouette width is calculated as:
b).External evaluation: In this approach, class label is known for the data set subjected to
clustering. However, quite obviously, the known class labels are not a part of the data used in
clustering. The cluster algorithm is assessed based on how close the results are compared to
those known class labels. For example, purity is one of the most popular measures of cluster
12
algorithms – evaluates the extent to which clusters contain a single class. For a data set having
‘n’ data instances and ‘c’ known class labels which generates ‘k’ clusters, purity is measured as:
13
and 2 predict ‘loss’ – then the final outcome of the ensemble using majority vote would be a
‘win’.
One of the earliest and most popular ensemble models is bootstrap aggregating or
bagging. Bagging uses bootstrap sampling method to generate multiple training data sets. These
training data sets are used to generate (or train) a set of models using the same learning
algorithm. Then the outcomes of the models are combined by majority voting (classification) or
by average (regression). Bagging is a very simple ensemble technique which can perform really
well for unstable learners like a decision tree, in which a slight change in data can impact the
outcome of a model significantly. Just like bagging, boosting is another key ensemble-based
technique.
In this type of ensemble, weaker learning models are trained on resampled data and the
outcomes are combined using a weighted voting approach based on the performance of different
models. Adaptive boosting or AdaBoost is a special variant of boosting algorithm. It is based on
the idea of generating weak learners and slowly learning Random forest is another ensemble-
based technique. It is an ensemble of decision trees – hence the name random forest to indicate a
forest of decision trees.
14
2.6. BASICS OF FEATURE ENGINEERING – INTRODUCTION
Generally, all machine learning algorithms take input data to generate the output. The input
data remains in a tabular form consisting of rows (instances or observations) and columns
(variable or attributes), and these attributes are often known as features. For example, an image
is an instance in computer vision, but a line in the image could be the feature. Similarly, in NLP,
a document can be an observation, and the word count could be the feature. So, we can say a
feature is an attribute that impacts a problem or is useful for the problem.
What is Feature Engineering?
Feature engineering is the pre-processing step of machine learning, which extracts
features from raw data. It helps to represent an underlying problem to predictive models in
better ways, which as a result, improve the accuracy of the model for unseen data. The
predictive model contains predictor variables and an outcome variable, and while the feature
engineering process selects the most useful predictor variables for the model.
Since 2016, automated feature engineering is also used in different machine learning
software that helps in automatically extracting features from raw data. Feature engineering in
ML contains mainly four processes: Feature Creation, Transformations, Feature Extraction, and
Feature Selection.
These processes are described as below:
Feature Creation: Feature creation is finding the most useful variables to be used in a
predictive model. The process is subjective, and it requires human creativity and intervention.
The new features are created by mixing existing features using addition, subtraction, and ration,
and these new features have great flexibility.
Transformations: The transformation step of feature engineering involves adjusting the
predictor variable to improve the accuracy and performance of the model. For example, it
15
ensures that the model is flexible to take input of the variety of data; it ensures that all the
variables are on the same scale, making the model easier to understand. It improves the model's
accuracy and ensures that all the features are within the acceptable range to avoid any
computational error.
Feature Extraction: Feature extraction is an automated feature engineering process that
generates new variables by extracting them from the raw data. The main aim of this step is to
reduce the volume of data so that it can be easily used and managed for data modelling. Feature
extraction methods include cluster analysis, text analytics, edge detection algorithms, and
principal components analysis (PCA).
Feature Selection: While developing the machine learning model, only a few variables in the
dataset are useful for building the model, and the rest features are either redundant or irrelevant.
If we input the dataset with all these redundant and irrelevant features, it may negatively impact
and reduce the overall performance and accuracy of the model. Hence it is very important to
identify and select the most appropriate features from the data and remove the irrelevant or less
important features, which is done with the help of feature selection in machine
learning. "Feature selection is a way of selecting the subset of the most relevant features from
the original features set by removing the redundant, irrelevant, or noisy features."
Below are some benefits of using feature selection in machine learning:
It helps in avoiding the curse of dimensionality.
It helps in the simplification of the model so that the researchers can easily interpret it.
It reduces the training time.
It reduces overfitting hence enhancing the generalization.
Need for Feature Engineering in Machine Learning
In machine learning, the performance of the model depends on data pre-processing and data
handling. But if we create a model without pre-processing or data handling, then it may not give
good accuracy. Whereas, if we apply feature engineering on the same model, then the accuracy
of the model is enhanced. Hence, feature engineering in machine learning improves the model's
performance. Below are some points that explain the need for feature engineering:
Better features mean flexibility.
Better features mean simpler models.
Better features mean better results.
16
2.7. FEATURE TRANSFORMATION
Feature Transformation: It transforms the data structure or unstructured into a new set of
features which can represent the underline problem which machine learning is trying to solve.
Feature Transformation Types:
1. Feature construction.
2. Feature Extraction.
1). Feature construction: The features phase by creating the additional features
Ex: Features, m-Features will be added then final equal to m+n Features.
17
2. Feature Extraction: New features are created for combination of original features.
Types:
1. PCA (Principal Component Analysis)
2. Singular value decomposition
3. Linear descripted Analysis
2.8. FEATURE SUBSET SELECTION
Feature selection is a way of selecting the subset of the most relevant features from the original
features set by removing the redundant, irrelevant, or noisy features.
Steps for selection:
Generation of possible subsets
Sub set evaluation
STOP searching based on some stopping criterion
Validation of result
18
2. Wrapper Approach:
It work based on Induction algorithm as backbox.
19
UNIT III
BAYESIAN CONCEPT LEARNING &
SUPERVISED LEARNING: CLASSIFICATION
3.1. INTRODUCTION
The technique was derived from the work of the 18th century mathematician Thomas Bayes. He
developed the foundational mathematical principles, known as Bayesian methods, which describe the
probability of events. Bayes provides their thoughts in decision theory which is extensively used in
important mathematics concepts as Probability. Bayes theorem is also widely used in Machine Learning
where we need to predict classes precisely and accurately. An important concept of Bayes theorem
named Bayesian method is used to calculate conditional probability in Machine Learning application that
includes classification tasks.
Why Bayesian methods are important?
Bayesian learning algorithms, like the naive Bayes classifier, are highly practical approaches to certain
types of learning problems as they can calculate explicit probabilities for hypotheses. Bayesian classifiers
are as follows:
Text-based classification such as spam or junk mail filtering, author identification, or topic
categorization
Medical diagnosis such as given the presence of a set of observed symptoms during a disease,
identifying the probability of new patients having the disease
Network security such as detecting illegal intrusion or anomaly in computer networks
One of the strengths of Bayesian classifiers is that they utilize all available parameters to subtly
change the predictions, while many other algorithms tend to ignore the features that have weak effects.
Bayesian classifiers assume that even if few individual parameters have small effect on theoutcome, the
collective effect of those parameters could be quite large. For such learning tasks, the naive Bayes
classifier is most effective.
3.2. BAYES THEOREM
Bayes theorem is one of the most popular machine learning concepts that helps to calculate the
probability of occurring one event with uncertain knowledge while other one has already
occurred. Bayes' theorem can be derived using product rule and conditional probability of event
X with known event Y:
According to the product rule we can express as the probability of event X with known event Y
as follows;
P(X ? Y)= P(X|Y) P(Y) {equation 1}
Further, the probabilities of event Y with known event X:
P(X ? Y)= P(Y|X) P(X) {equation 2}
1
Mathematically, Bayes theorem can be expressed by combining both equations on right hand
side. We will get:
Here, both events X and Y are independent events which means probability of outcome of both
events does not depends one another.
The above equation is called as Bayes Rule or Bayes Theorem.
P(X|Y) is called as posterior, which we need to calculate. It is defined as updated
probability after considering the evidence.
P(Y|X) is called the likelihood. It is the probability of evidence when hypothesis is true.
P(X) is called the prior probability, probability of hypothesis before considering the
evidence
P(Y) is called marginal probability. It is defined as the probability of evidence under any
consideration.
Hence, Bayes Theorem can be written as:
Posterior = likelihood * prior / evidence
2
Brute-force Bayesian algorithm
Consistent Learners
A learning algorithm is a consistent learner if it commits zero errors over the training examples
every consistent learner outputs a MAP hypothesis if
Assume a uniform prior probability distribution over H and if
Assume a deterministic noise free training data.
Bayesian perspective can be used to characterize learning algorithms even if they do not
explicitly manipulate probabilities
Bayes optimal classifier
The most probable classification of the new instance can be obtained by combining the
predictions of all hypotheses weighed by their corresponding posterior probabilities. By denoting
the possible classification of the new instance as ci from the set C, the probability P(ci|T) that the
correct classification for the new instance is ci is
3
The optimal classification is for which P(ci|T) is maximum is
Algorithm:
Step 1: Convert the given data set into frequency table.
Step 2: Likelihood table by finding the probability of given features.
Step 3: Now use Baye's theorem to calculate posterior probability.
An Illustrative Example
Let us apply the naive Bayes classifier to a concept learning problem i.e.,classifying days
according to whether someone will play tennis.
The below table provides a set of 14 training examples of the target concept PlayTennis, where
each day is described by the attributes Outlook, Temperature, Humidity, and Wind.
4
Step 1: Finding frequency table for the given data set.
Weather YES NO
Rainy 2 2
Sunny 3 2
Overcast 5 0
Total 10 4
5
P(Sunny/Yes) = 3/10= 0.3; P(Yes) = 0.71; P(sunny)= 0.35.
P(Yes/Sunny) =(0.3*0.71)/(0.35) =0.008.
P(Sunny/No)=0.5
P(No) = 0.29
P(sunny)= 0.35.
P(No/Sunny) =(0.5*0.29)/(0.35) =0.414.
Therefore P(Yes/Sunny)>P(No/Sunny).
Hence on sunny day can player play the game.
Advantages:
This is faster and easy ML algorithm to predict a class of data set.
It can be used for binary as well as multi class classifier
It is the most popular choice for text Bayes classifier
Each node corresponds to the random variables, and a variable can be continuous or discrete. Arc
or directed arrows represent the causal relationship or conditional probabilities between random
variables. These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no directed
link that means that nodes are independent with each other.
Diverging Connection: In this type of connection, the evidence can be transmitted between two
child nodes of the same parent provided that the parent is not instantiated.
6
Serial Connection: In this type of connection, any evidence entered at the beginning of the
connection can be transmitted through the directed path provided that no intermediate node on
the path is instantiated.
Converging Connection: In this type of connection, the evidence can only be transmitted
between two parents when the child (converging) node has received some evidence and that
evidence can be soft or hard.
7
more effective predictive models. Their ability to learn from both historical data and
expert input makes them essential in fields where decisions depend on complex,
uncertain systems.
3.5. SUPERVISED LEARNING
In supervised learning, the labelled training data provide the basis for learning. According
to the definition of machine learning, this labelled training data is the experience or prior
knowledge or belief. It is called supervised learning because the process of learning from the
training data by a machine can be related to a teacher supervising the learning process of a
student who is new to the subject. Here, the teacher is the training data. Training data is the past
information with known value of class field or ‘label’. Hence, we say that the ‘training data is
labelled’ in the case of supervised learning.
Examples of Supervised Learning:
Training a model to distinguish between images of cats and dogs.
Training a model to filter out unwanted emails.
Predicting the price of a particular stock tomorrow.
Predicting the likelihood of a patient having a certain disease based on their medical
history and test results.
Predicting which customers are at risk of cancelling their subscription.
3.6. CLASSIFICATION MODEL
Classification is a type of supervised learning where a target feature, which is of
categorical type, is predicted for test data on the basis of the information imparted by the training
data. The target categorical feature is known as class.
Some typical classification problems include the following:
Hand writing recognition
Image classification
Disease prediction
Win–loss prediction of games
Prediction of natural calamity such as earth quake, flood etc.
In classification, the whole problem centres on assigning a label or category or class to a
test data on the basis of the label or category or class information that is imparted by the training
data. Because the target objective is to assign a class label, we call this type of problem as a
classification problem. Below figure depicts the typical process of classification, where a
classification model is obtained from the labelled training data by a classifier algorithm. On the
8
basis of the model, a class label (e.g. ‘Intel’ as in the case of the test data referred in below fig) is
assigned to the test data.
Classification Model
Classification Learning Steps:
9
Problem Identification: Identifying the problem is the first step in the supervised learning
model. The problem needs to be a well-formed problem, i.e. a problem with well-defined goals
and benefit, which as a long-term impact.
Identification of Required Data: On the basis of the problem identified above, the required
data set that precisely represents the identified problem needs to be identified/ evaluated. For
example: If the problem is to predict whether a tumour is malignant or benign, then the
corresponding patient data sets related to malignant tumour and benign tumours are to be
identified.
Data Pre-processing: This is related to the cleaning/ transforming the data set. This step ensures
that all the unnecessary/ irrelevant data elements are removed. Data pre-processing refers to the
transformations applied to the identified data before feeding the same into the algorithm.
Because the data is gathered from different sources, it is usually collected in a raw format and is
not ready for immediate analysis. This step ensures that the data is ready tobe fed into the
machine learning algorithm.
Definition of Training Data Set: Before starting the analysis, the user should decide what kind
of data set is to be used as a training set. In the case of signature analysis, for example, the
training data set might be a single handwritten alphabet, an entire handwritten word (i.e. a group
of the alphabets) or an entire line of handwriting (i.e. sentences or a group of words).
Thus, a set of ‘input meta-objects’ and corresponding output meta-objects’ are also
gathered. The training set needs to be actively representative of the real-world use of the given
scenario. Thus, a set of data input (X) and corresponding outputs (Y) is gathered either from
human experts or experiments.
Algorithm Selection: This involves determining the structure of the learning function and the
corresponding learning algorithm. This is the most critical step of supervised learning model. On
the basis of various parameters, the best algorithm for a given problem is chosen.
Training: The learning algorithm identified in the previous step is run on the gathered training
set for further fine tuning. Some supervised learning algorithms require the user to determine
specific control parameters (which are given as inputs to the algorithm). These parameters
(inputs given to algorithm) may also be adjusted by optimizing performance on a subset (called
as validation set) of the training set.
Evaluation with the Test Data Set: Training data is run on the algorithm, and its performance
is measured here. If a suitable result is not obtained, further training of parameters may be
required.
10
3.7. K-Nearest Neighbor (KNN) Algorithm
K-Nearest Neighbor (KNN) Algorithm is a type of Supervised Machine Learning
algorithm. It is mainly used for classification predictive problem. It is also called lazy learner or
nonparametric learner.
KNN Working:
It uses future similarity to predict the values of new data points which further means that a new
data point will be assigned a value based on how close it matches the point in the training set.
Step 1: Load the training and testing data set.
Step 2: Choose the values of K ie; Nearest neighbour point, K can be any Integer.
Step 3: For each data point in the test data do the following.
I. Calculate the distance between test data and each row of training data with the
help of Euclidian distance.
The Euclidean distance formula is given by:
11
KNN Algorithm:
Step 1: Input the training and testing data set.
Step 2: Initialize K-Value
Step 3: Calculate the Euclidean distance.
Step 4: Assign class of nearest sample.
Why KNN called Lazy Learner?
It skips abstraction and stores training data and directly applies the philosophy of normal nearest
neighbourhood finding to arrive at the classification. Hence KNN does not learn quickly. So it is
called Lazy Learner.
Advantages of KNN Algorithm:
It is simple to implement.
It is robust to the noisy training data
It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
Always needs to determine the value of K which may be complex some time.
The computation cost is high because of calculating the distance between the data points
for all the training samples.
Applications of the KNN Algorithm
Here are some real life applications of KNN Algorithm.
Recommendation Systems: Many recommendation systems, such as those used by Netflix
or Amazon, rely on KNN to suggest products or content. KNN observes at user behavior
and finds similar users. If user A and user B have similar preferences, KNN might
recommend movies that user A liked to user B.
Spam Detection: KNN is widely used in filtering spam emails. By comparing the features
of a new email with those of previously labeled spam and non-spam emails, KNN can
predict whether a new email is spam or not.
Customer Segmentation: In marketing firms, KNN is used to segment customers based on
their purchasing behavior. By comparing new customers to existing customers, KNN can
easily group customers into segments with similar choices and preferences. This helps
businesses target the right customers with right products or advertisements.
Speech Recognition: KNN is often used in speech recognition systems to transcribe spoken
words into text. The algorithm compares the features of the spoken input with those of
known speech patterns. It then predicts the most likely word or command based on the
closest matches.
12
Example: Given input data set
S.No Maths Science Result
1 4 3 F
2 6 7 P
3 7 8 P
4 5 5 F
5 8 8 P
Problem: Find if Maths=6 and Science=8; then classification should be Pass or Fail, using KNN
Algorithm.
Sol:
Step 1: Assume K=3(Nearest Neighbor)
Step 2: Euclidean Distance
Step 3: Choose 3 values of Nearest Neighbors have taken K=3 ie; D2, D3,D5.
Step 4: Hence D2, D3, D5 contains classification as Pass.
If Maths=6 and Science=8 when the classification result by using KNN is ‘Pass’
13
output variable. The output variable is determined by following a path that starts at the root and
is guided by the values of the input variables.
A decision tree is usually represented in the format depicted in below Figure.
14
Example:
For the set X = {a, a, a, b, b, b, b, b}
Total instances: 8
Instances of b: 5
Instances of a: 3
b). Gini Index: Gini index is a measure of impurity or purity used while creating a decision tree
in the CART (Classification and Regression Tree) algorithm. An attribute with the low Gini
index should be preferred as compared to the high Gini index. It only creates binary splits, and
the CART algorithm uses the Gini index to create binary splits. Gini index can be calculated
using the below formula:
Gini Index= 1- ∑jPj2
Avoiding over fitting in decision tree–pruning
The decision tree algorithm, unless as topping criterion is applied, may keep growing indefinitely
splitting for every feature and dividing into smaller partitions till the point that the data is
perfectly classified. This, as is quite evident, results in overfitting problem. Top revent a decision
tree getting overfitted to the training data, pruning of the decision tree is essential. Pruning a
decision tree reduces the size of the tree such that the model is more generalized and can classify
unknown and unlabelled data in a better way. There are two approaches of pruning:
Pre-pruning: Stop growing the tree before it reaches perfection.
Post-pruning: Allow the tree to grow entirely and then post-prune some of the branches from it.
Advantages of the Decision Tree
It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
It can be very useful for solving decision-related problems.
It helps to think about all the possible outcomes for a problem.
There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
The decision tree contains lots of layers, which makes it complex.
It may have an over fitting issue, which can be resolved using the Random Forest
algorithm.
15
For more class labels, the computational complexity of the decision tree may increase.
16
Step 5: For new data point find the predictions of each decision tree, assign new data point with
category that is the majority values.
17
3.10. SUPPORT VECTOR MACHINES
Support Vector Machines is a model, which can do linear classification as well as
regression. SVM is based on the concept of a surface, called a hyper-plane, which draws a
boundary between data instances plotted in the multi-dimensional feature space. The output
prediction of an SVM is one of two conceivable classes which are already defined in the training
data. In summary, the SVM algorithm builds an N-dimensional hyper-plane model that assigns
future instances into one of the two possible output classes.
Support Vectors: SVM chooses extreme points [vectors] that helps to create the hyper-plane,
these extreme cases are called Support Vectors.
How SVM works:
Step 1: Classification can be done using hyper plane.
Step 2: Identifying the correct hyper plane in SVM.
Step 3: Find maximum marginal hyper plane based on Support Vectors.
Step 4: Kernal trick.
Types of SVM:
Liner SVM: It uses single straight line.
Non-Linear SVM: It does not use any straight line.
Strengths of SVM
SVM can be used for both classification and regression.
It is robust, i.e. not much impacted by data with noise or outliers. The prediction results
using this model are very promising.
Weaknesses of SVM
18
SVM is applicable only for binary classification, i.e. when there are only two classes in
the problem domain. The SVM model is very complex- almost like a black box when it deals
with a high dimensional data set. Hence, it is very difficult and close to impossible to understand
the model in such cases. It is also for a large data set ie; a data set with either a large number of
features or a large number of instances. It is quite memory intensive.
Application of SVM
SVM is most effective when it is used for binary classification, i.e. for solving a machine
learning problem with two classes. One common problem on which SVM can be applied is in the
field of bioinformatics – more specifically, in detecting cancer and other genetic disorders. It can
also be used in detecting the image of a face by binary classification of images into face and non-
face components. More such applications can be described.
Questions:
1).Explain the concept of Bayes theorem with an example.
2). Explain Bayesian belief network and conditional independence with example.
3). Explain Classification model? What are the classification learning steps?
4). What are Bayesian Belief nets? Where are they used?
5). Explain Brute force MAP hypothesis learner? What is minimum description length principle?
6). Explain Naïve Bayes Classifier with an Example.
7). Develop the concepts of K- Nearest Neighbours.
8).What is benefits of K- NN algorithm?
9). How build the structure of a decision tree. What are the advantages of Decision Tree?
10). Explain SVM classifier with suitable example?
Prepared By:
Dr.K.Venkata Nagendra,
Dept.of CSE, SVCN.
19