Machine Learning Techniques all units .ppt

MACHINE LEARNING
TECHNIQUES
Prepared by,
Dr.J.Preetha,
Prof./IT

UNIT – I
MACHINE LEARNING BASICS
Introduction to Machine Learning (ML)
Essential concepts of ML
Types of learning
Machine learning methods based on Time
Dimensionality
Linearity and Non linearity
Early trends in Machine learning
Data Understanding Representation and visualization
2

INTRODUCTION TO MACHINE
LEARNING
3

4
What is Machine Learning?
 Machine Learning Study of algorithms that improve
their performance at some task with experience
 Optimize a performance criterion using example
data or past experience.
 Role of Statistics: Inference from a sample
 Role of Computer science: Efficient algorithms to
 Solve the optimization problem
 Representing and evaluating the model for
inference

MACHINE LEARNING
Machine learning is branch of
artificial intelligence that enables computers
to “self learn” from training data and
improve over time without being explicitly
programmed.
ML algorithms are able to detect patterns
in data and learn from them in order to
make their own predictions.
5

What is Machine Learning?
“Learning is any process by which a system
improves performance from experience.”
- Herbert Simon
Definition by Tom Mitchell (1998):
Machine Learning is the study of algorithms that
• improve their performance P
• at some task T
• with experience E.
A well-defined learning task is given by <P, T, E>.
3
6

Traditional
Programming
Machine Learning
Computer
Data
Program
Output
Computer
Data
Output
Progra
m
Slide credit: Pedro Domingos
4
7

When Do We Use Machine Learning?
ML is used when:
•Human expertise does not exist (navigating on Mars)
•Humans can’t explain their expertise (speech recognition)
•Models must be customized (personalized medicine-routing on
a computer network)
•Models are based on huge amounts of data (user biometrics))
Learning isn’t always useful:
•There is no need to “learn” to calculate payroll
Based on slide by E. Alpaydin
5
8

9
 Retail: Market basket analysis, Customer
relationship management (CRM)
 Finance: Credit scoring, fraud detection
 Manufacturing: Optimization, troubleshooting
 Medicine: Medical diagnosis
 Telecommunications: Quality of service
optimization
 Bioinformatics: Motifs, alignment
 Web mining: Search engines
Applications:

Growth of Machine Learning
 Machine learning is preferred approach to
 Speech recognition, Natural language processing
 Computer vision
 Medical outcomes analysis
 Robot control
 Computational biology
 This trend is accelerating
 Improved machine learning algorithms
 Improved data capture, networking, faster computers
 Software too complex to write by hand
 New sensors / IO devices
 Demand for self-customization to user, environment
 It turns out to be difficult to extract knowledge from human expertsfailure
of expert systems in the 1980’s.
10

Samuel’s Checkers-Player
“Machine Learning: Field of study that gives
computers the ability to learn without being
explicitly programmed.” -Arthur Samuel (1959)
9
11

Defining the Learning Task
Improve on task T, with respect to
performance metric P, based on experience E
T: Playing checkers
P: Percentage of games won against an arbitrary opponent
E: Playing practice games against itself
T: Recognizing hand-written words
P: Percentage of words correctly classified
E: Database of human-labeled images of handwritten words
T: Driving on four-lane highways using vision sensors
P: Average distance traveled before a human-judged error
E: A sequence of images and steering commands recorded
while observing a human driver.
T: Categorize email messages as spam or legitimate.
P: Percentage of email messages correctly classified.
E: Database of emails, some with human-given labels
10
12

Well Defined Learning Problem
Learning = Improving with experience at some task
Improve over task T ,
with respect to performance measure P ,
based on experience E.
E.g., Learn to play checkers
T : Play checkers
P : % of games won in world tournament
E: opportunity to play against self
13

Learning to Play Checkers
T : Play checkers
P : Percent of games won in world tournament
What experience?
What exactly should be learned?
How shall it be represented?
What speci c algorithm to learn it?
14

Type of Training Experience
Direct or indirect?
Teacher or not?
A problem: is training experience representative of performance
goal?
15

Representation for Target
Choose
Function
collection of rules?
neural network ?
polynomial function of board features?
...
16

A Representation for Learned Func- tion
w +w bp(b)+w rp(b)+w bk(b)+w rk(b)+w bt(b)+w rt(b
0 1 2 3 4 5
6
bp(b): number of black pieces on board b
rp(b): number of red pieces on b
bk(b): number of black kings on b
rk(b): number of red kings on b
bt(b): number of red pieces threatened by black (i.e., which can
be taken on black's next turn)
rt(b): number of black pieces threatened by red
17

Obtaining Training Examples
^
18
V (b): the true target function
V (b) : the learned function
train
V (b): the training value
One rule for estimating training values:
train
^
V (b) V (Successor(b))

Choose Weight Tuning Rule
LMS Weight update rule: Do repeatedly:
Select a training example b at random
1. Compute error(b):
19
train
^
error(b) = V (b) V (b)
2. For each board feature f , update weight w :
i i
w w + c f error(b) i
i i
c is some small constant, say 0.1, to moderate the
rate of learning

Design Choices
Determine
Target Function
Determine Representation of
Learned Function
Determine Type
of Training Experience
Determine
Learning Algorithm
Games against
self
Games against
experts Table of correct
moves
Linear function of
six features
Artificial neural
network
Polynomial
Gradient
descent
Board
➝ value
Board
➝ move
Completed Design
...
...
Linear
programming
...
...
20

TYPES OF LEARNING
 Supervised (inductive) learning
 Training data includes desired outputs
 Unsupervised learning
 Training data does not include desired outputs
 Semi-supervised learning
 Training data includes a few desired outputs
 Reinforcement learning
 Rewards from sequence of actions
24

Learning Associations
 Basket analysis:
P (Y | X ) probability that somebody who buys X also
buys Y where X and Y are products/services.
Example: P ( chips | beer ) = 0.7
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke 26

27
Supervised Learning: Uses
 Prediction of future cases: Use the rule to predict
the output for future inputs
 Knowledge extraction: The rule is easy to
understand
 Compression: The rule is simpler than the data it
explains
 Outlier detection: Exceptions that are not covered
by the rule, e.g., fraud
Example: decision trees tools that create rules

Inductive Learning
 Given examples of a function (X, F(X))
 Predict function F(X) for new examples X
 Discrete F(X): Classification
 Continuous F(X): Regression
 F(X) = Probability(X): Probability estimation
29

30
Prediction: Regression
 Example: Price of a
used car
 x : car attributes
y : price
y = g (x | θ)
g ( ) model,
θ parameters
y = wx+w0

31
Regression Applications
 Navigating a car: Angle of the steering wheel (CMU
NavLab)
 Kinematics of a robot arm
α1= g1(x,y)
α2= g2(x,y)
α1
α2
(x,y)

41
Classification
 Example: Credit
scoring
 Differentiating
between low-risk
and high-risk
customers from
their income and
savings
Discriminant: IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk
Model

42
Classification: Applications
 Pattern recognition
 Face recognition: Pose, lighting, occlusion (glasses,
beard), make-up, hair style
 Character recognition: Different handwriting styles.
 Speech recognition: Temporal dependency.
 Use of a dictionary or the syntax of the language.
 Sensor fusion: Combine multiple modalities; eg, visual (lip
image) and acoustic for speech
 Medical diagnosis: From symptoms to illnesses
 Web Advertising: Predict if a user clicks on an ad on
the Internet.

43
Face Recognition
Training examples of a person
Test images
AT&T Laboratories, Cambridge UK
http://www.uk.research.att.com/facedatabase.html

48
Unsupervised Learning
 Learning “what normally happens” No output
 Clustering: Grouping similar instances
 Other applications: Summarization, Association
Analysis
 Example applications
 Customer segmentation in CRM
 Image compression: Color quantization
 Bioinformatics: Learning motifs

54
Reinforcement Learning
 Topics:
 Policies: what actions should an agent take in a particular
situation
 Utility estimation: how good is a state (used by policy)
 No supervised output but delayed reward
 Credit assignment problem (what was responsible for the
outcome)
 Applications:
 Game playing
 Robot in a maze
 Multiple agents, partial observability, ...

MACHINE LEARNING
METHODS BASED ON TIME
 A static model is trained offline.
 That is, we train the model exactly once and then
use that trained model for a while.
 A dynamic model is trained online.
 That is, data is continually entering the system and
we're incorporating that data into the model
through continuous updates.
55

Curse of Dimensionality
 The Curse of Dimensionality in Machine
Learning arises when working with high-
dimensional data, leading to increased
computational complexity, overfitting, and
spurious correlations.
 Techniques like dimensionality reduction, feature
selection, and careful model design are essential
for mitigating its effects and improving algorithm
performance.
 Navigating this challenge is crucial for unlocking
the potential of high-dimensional datasets and
ensuring robust machine-learning solutions.
61

LINEARITY AND NON LINEARITY
 Linearity refers to the property of a system or
model where the output is directly proportional
to the input.
 While nonlinearity implies that the relationship
between input and output is more complex and
cannot be expressed as a simple linear function.
63

Linearity
 Linear models are often the simplest and most effective
approach.
 A linear model essentially fits a straight line to the data,
allowing it to make predictions based on a linear relationship
between the input features and the output variable.
 In a regression problem, a linear model is used to predict a
continuous target variable based on one or more input
features.
 such as the size and age of a tree.
65

Nonlinearity
 Nonlinear models can take many forms, from polynomial
models that fit curves to the data, to neural networks that
can learn complex patterns in high-dimensional data.
 These models are often more powerful than linear models
because they can capture more complex relationships
between variables.
 In a classification problem, nonlinear models are able to
identify complex decision boundaries that separate
different classes, especially when those boundaries are not
linear. 66

Occam’s Razor
 If you have two competing ideas to explain the same
phenomenon, you should prefer the simpler one.
68

Techniques for Applying Occam’s Razor in Machine
Learning:
69

No Free Lunch Theorem:
 NFL Theorem If an algorithm performs better on
certain class of problems, then it pays for it in the
form of degraded performance in other classes of
problems.
 In other words, you cannot have single optimal
solution for all classes of problems.
70

Why Understanding Occam’s Razor is
Important for Data Scientists
1. Enhancing Interpretability:
Simpler models are often more interpretable, which
means it’s easier to understand how they’re making
predictions.
2. Avoiding Overfitting:
The goal of machine learning is to make accurate
predictions on new, unseen data. By keeping models
simpler, data scientists can reduce the risk of overfitting.
3. Improving Generalizability:
They are less likely to fit the noise in the training
data and more likely to capture the underlying trend or
relationship.
4. Reducing Computational Resources:
where resources might be limited or expensive.
71

EARLY TRENDS IN MACHINE LEARNING
72

DATA UNDERSTANDING,
REPRESENTATION, AND
VISUALIZATION
DATA UNDERSTANDING:
Data understanding basically involves analyzing and
exploring the data to identify any patterns or trends that
may be present.
Data exploration provides a high-level overview of
each attribute (also called variable) in the dataset and the
interaction between the attributes.
The step of understanding the data can be broken down
into three parts:
1. Understanding entities
2. Understanding attributes
3. Understanding data types
77

Understanding Entities
 In the field of data science or machine learning and
artificial intelligence, entities represent groups of
data separated based on conceptual themes and/or
data acquisition methods.
 An entity typically represents a table in a database,
or a flat file.
 EXAMPLE:
comma separated variable (csv) file, or tab
separated variable (tsv) file.
Sometimes it is more efficient to represent the
entities using a more structured formats like
svmlight.
Each entity can contain multiple attributes.
78

Understanding Attributes
 Each attribute can be thought of as a column in the
file or table.
 In case of Iris data, the attributes from the single
given entity are sepal length in cm, sepal width in
cm.
 The structured formats like svmlight are more
useful in case of sparse data, as they add
significant overhead when the data is fully
populated.
 A sparse data is data with high dimensionality, but
with many samples missing values for multiple
attributes.
79

Sepal-length Sepal-width Petal-length Petal-width Class label
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
4.8 3.4 1.9 0.2 Iris-setosa
5.0 3.0 1.6 0.2 Iris-setosa
5.0 3.4 1.6 0.4 Iris-setosa
5.2 3.5 1.5 0.2 Iris-setosa
5.2 3.4 1.4 0.2 Iris-setosa
7.0 3.2 4.7 1.4 Iris-versicolor
6.3 3.3 6.0 2.5 Iris-virginica
80

Understanding Data Types
 Attributes in each entity can be of various different
types from the storage and processing perspective, e.g.,
string, integer valued, datetime, binary (“true”/“false”,
or “1”/“0”), etc.
 Each type needs to be handled separately for generating
a feature vector that will be consumed by the machine
learning algorithm.
 we can also come across sparse data, in which case,
some attributes will have missing values.
 This missing data is typically replaced with special
characters, which should not be confused with any of
the real values.
 In order to process data with missing values, with some
default values, or use an algorithm that can work with
missing data.
81

Representation and Visualization
of the Data
Data visualization is the representation of data through
use of common graphics, such as charts, plots, infographics and
even animations.
These visual displays of information communicate
complex data relationships and data-driven insights in a way that
is easy to understand.
There are couple of options in such cases:
 Draw multiple plots taking 2 or three dimensions at a
time.
 Reduce the dimensionality of the data and plot upto
3 dimensions
Most common methods used to reduce the dimensionality are:
1.Principal Component Analysis or PCA
2.Linear Discriminant Analysis or LDA 82

This can include a variety of visual tools such as:
Charts: Bar charts, line charts, pie charts, etc.
Graphs: Scatter plots, histograms, etc.
Maps: Geographic maps, heat maps, etc.
Dashboards: Interactive platforms that combine
multiple visualizations.
83

Tools for Visualization of Data
 Tableau
 Looker
 Zoho Analytics
 Sisense
 IBM Cognos Analytics
 Qlik Sense
 Domo
 Microsoft Power BI
 Klipfolio
 SAP Analytics Cloud
85

Principal Component Analysis
 We can only visualize the data in maximum of 2 or
3 dimensions. However, it is common practice to
have the dimensionality of the data in tens or even
hundreds.
 If we can find exact coordinates (Xr
,Y r
) of the
paper’s orientation as linear combination of X, Y ,
and Z coordinates of the 3-dimensional space,
we can reduce the dimensionality of the data from
3 to 2.
87

2-dimensional data where PCA dimensionality
is same but along different axes
88

LINEAR DISCRIMINANT
ANALYSIS
 The way PCA tries to find the dimensions that
maximize variance of the data, linear discriminant
analysis or LDA tries to maximize the separation
between the classes of data.
 Thus LDA can only effectively work when we are
dealing with classification type of problem, and the
data intrinsically contains multiple classes.
89

3-dimensional data with 2 classes in another
perspective, with LDA representation.
LDA reduces the effective dimensionality of
data into 1 dimension where the two classes
can be best separated
90

UNIT – II
MACHINE LEARNING METHODS
Linear methods
Regression
Classification
Perceptron and Neural networks
Decision trees
Support vector machines
Probabilistic models
Unsupervised learning
Featurization.
91

LINEAR METHODS
Machine Learning Algorithms are Divided Into Two
Types:
1.Supervised learning algorithms
2.Unsupervised learning algorithms
92

Supervised learning algorithms
93
 Where the model is trained on a labelled
dataset.
 A labelled dataset is one that has both
input and output parameters.
 In this type of learning both training and
validation, datasets are called labelled.
 The input features are the attributes or
characteristics of the data that are used to
make predictions
 while the output labels are the desired
outcomes or targets that the algorithm tries
to predict.

EXAMPLE:
It is a dataset of a shopping store that is useful
in predicting whether a customer will purchase a
particular product under consideration or not based
on his/ her gender, age, and salary.
Input: Gender, Age, Salary
Output: Purchased i.e. 0 or 1; 1 means yes the
customer will purchase and 0 means that the
customer won’t purchase it.
95

Supervised learning can be further classified
into two categories:
96

Unsupervised learning algorithms
Unsupervised learning is a type of machine
learning technique in which an algorithm discovers
patterns and relationships using unlabelled data.
97

 There are two main categories of unsupervised
learning that are mentioned below:
1. Clustering
2. Association
 Clustering is the process of grouping data points
into clusters based on their similarity.
 This technique is useful for identifying patterns
and relationships in data without the need for
labelled examples.
98

 Here are some clustering algorithms:
i. K-Means Clustering algorithm
ii. Mean-shift algorithm
iii. DBSCAN Algorithm
iv. Principal Component Analysis
v. Independent Component Analysis
99

Association
Association rule learning is a technique for
discovering relationships between items in a dataset.
It identifies rules that indicate the presence of
one item implies the presence of another item with a
specific probability.
Here are some association rule learning
algorithms:
i.Apriori Algorithm
ii.Eclat
iii.FP-growth Algorithm
100

LINEAR REGRESSION
Regression is a supervised machine learning
technique which is used to predict continuous
values.
The ultimate goal of the regression algorithm is
to plot a best-fit line or a curve between the data.
Linear regression is a type of supervised
machine learning algorithm that computes the linear
relationship between the dependent variable and one
or more independent features by fitting a linear
equation to observed data.
101

Types:
i. Linear Regression
ii. Polynomial Regression
iii. Ridge Regression
iv. Lasso Regression
v. Logistic Regression
vi. Support Vector Regression (SVR)
vii. Decision Tree Regression
viii. Random Forest Regression
ix. Gradient Boosting Regression
102

Simple Linear Regression
This is the simplest form of linear regression,
and it involves only one independent variable and one
dependent variable. The equation for simple linear
regression is:
where:
Y is the dependent variable
X is the independent variable
 0 is the intercept
β
 1 is the slope
β
103
y= 0+ 1X
β β

MULTIPLE LINEAR REGRESSION
Linear regression is a classic example of strictly
linear models. It is also called as polynomial fitting
Let us consider a problem of linear regression
where training data contains p samples.
Input is n-dimensional as (xi,i = 1 , . . . , p) and xi R
∈ n.
Output is single dimensional as (yi,i = 1,..., p) and yi R.
∈
Assumes a linear relationship between the
independent and dependent variables.
0is the intercept, 1, 2,..., n are the coefficients, and is the
β β β β ϵ
error term.
104
Y= 0+ 1X1+ 2X2+...+ nXn+
β β β β ϵ

Ridge Regression
In machine learning, ridge regression helps
reduce overfitting that results from model
complexity. Model complexity can be due to a model
possessing too many features.
Ridge Regression, a penalty term is added to
the loss function to constrain the coefficients. The
penalty is proportional to the square of the
magnitude of the coefficients.
Loss function:
Minimize ∑i=1n(yi yi^)2+ ∑j=1p j2
− λ β
Here, is a regularization parameter that
λ
controls the strength of the penalty, jare the
β
coefficients, and p is the number of features.
105

CLASSIFICATION
The Classification algorithm is a Supervised
Learning technique that is used to identify the
category of new observations on the basis of training
data.
In Classification, a program learns from the
given dataset or observations and then classifies new
observation into a number of classes or groups.
Such as, Yes or No, 0 or 1, Spam or Not Spam,
cat or dog, etc.
Classes can be called as targets/labels or
categories.
106

In classification algorithm, a discrete output
function(y) is mapped to input variable(x).
y=f(x)
Where, y = categorical output
Example :
Email Spam Detector.
There are two classes, class A and Class B. These classes
have features that are similar to each other and dissimilar
to other classes.
107

Types:
1. Linear Models
1. Logistic Regression
2. Support Vector Machines
2. Non-linear Models
1. K-Nearest Neighbours
2. Kernel SVM
3. Naïve Bayes
4. Decision Tree Classification
5. Random Forest Classification
108

K-Nearest Neighbor(KNN)
Algorithm for Machine Learning
K-Nearest Neighbour is one of the simplest
Machine Learning algorithms based on Supervised
Learning technique.
The k-nearest neighbors (KNN) algorithm is a
non-parametric, supervised learning classifier, which
uses proximity to make classifications or predictions
about the grouping of an individual data point.
109

The weighted K-Nearest Neighbors
(KNN) algorithm is a supervised learning
technique used for classification and
regression tasks.
It makes predictions based on the
similarity between a new data point and its
closest neighbors in the training dataset.
110

Classification
For classification, the target variable is categorical.
Predicting the class of a new data point based on
the majority class among its k-nearest neighbors.
Regression
while for regression, the target variable is
continuous.
Predicting the value of a continuous target variable
based on the average value of the target variable
among its k-nearest neighbors.
111

How does K-NN work?
The K-NN working can be explained on the basis
of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number
of neighbors
Step-3: Identify the K nearest neighbors as per the
calculated Euclidean distance.
Step-4: Among these k neighbors, count the number
of the data points in each category.
Step-5: Assign the new data points to that category
for which the number of the neighbour is maximum.
Step-6: Our model is ready.
112

Advantages of KNN Algorithm:
It is simple to implement.
It is robust to the noisy training data
It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
Always needs to determine the value of K which may
be complex some time.
The computation cost is high because of calculating
the distance between the data points for all the
training samples.
114

Real-World Applications
Weighted KNN is used in a wide range of applications, including:
1 Recommendation Systems
Recommending products or content based on the preferences of similar users.
2 Image Recognition
Classifying images based on their visual features.
3 Financial Forecasting
Predicting future stock prices or other financial indicators based on historical data.
4 Medical Diagnosis
Assisting in the diagnosis of diseases based on patient data and medical records.

PERCEPTRON AND NEURAL
NETWORKS
 A weight is assigned to each input node of a
perceptron, indicating the significance of that input
to the output.
 The perceptron’s output is a weighted sum of the
inputs that have been run through an activation
function to decide whether or not the perceptron
will fire. it computes the weighted sum of its inputs
as:
z = w1x1 + w1x2 + ... + wnxn = XTW
116

The step function compares this weighted sum
to the threshold, which outputs 1 if the input is larger
than a threshold value and 0 otherwise, is the
activation function that perceptron's utilize the most
frequently.
The most common step function used in
perceptron is the Heaviside step function:
117

The perceptron, a foundational building block in the realm of
artificial intelligence, serves as the fundamental unit of neural
networks.
This simple yet powerful computational model, conceived by
Frank Rosenblatt in the 1950s, simulates the behavior of a
single neuron in the human brain.
It takes multiple input signals, applies weights to each signal,
and generates a single output signal based on a threshold
function.
These interconnected perceptrons, forming layers, create
neural networks capable of solving complex tasks like image
recognition, natural language processing, and machine
translation.

The Perceptron: A Simple Model
Input Layer
The input layer
receives information
from the external
environment. These
inputs can be
numerical values,
representing features
or characteristics of
the data being
processed. Each
input is assigned a
weight, reflecting its
importance or
influence on the
overall decision.
Activation Function
The activation
function determines
the output of the
perceptron. It
introduces
nonlinearity,
allowing the
perceptron to learn
complex
relationships in the
data. Common
activation functions
include the sigmoid,
ReLU, and tanh
functions.
Output Layer
The output layer
produces a single
output value,
representing the final
decision or
prediction made by
the perceptron. This
output can be a
binary value (0 or 1)
for classification
tasks or a continuous
value for regression
tasks.

Neural Networks: Layers of Perceptrons
Input Layer
The input layer receives raw data, which is then transformed into a
representation suitable for processing by the network. This representation
is often in the form of numerical values, representing features or
characteristics of the data.
Hidden Layers
Hidden layers process the input information through a series of non-linear
transformations. These layers extract complex features and relationships
from the data, enabling the network to learn intricate patterns and make
accurate predictions. Output Layer
The output layer generates the network's final prediction or decision. The
structure of the output layer depends on the specific task being solved.
For classification tasks, it may produce a probability distribution over
different classes. For regression tasks, it may output a continuous value.

Machine Learning Techniques all units .ppt

Learning and Optimization
Forward Propagation
Information flows from the input layer through the hidden layers
to the output layer. During forward propagation, the network
calculates the output based on the current weights and biases.
Backpropagation
The difference between the predicted output and the actual target
output is calculated. This error signal is then propagated backward
through the network, updating the weights and biases to minimize
the error.
Gradient Descent
Gradient descent is an optimization algorithm that iteratively
adjusts the weights and biases in the direction that minimizes the
error. It uses the gradients of the error function to guide the
search for the optimal parameters.

Types of Neural Networks
1
Convolutional Neural Networks (CNNs)
CNNs are specialized for image recognition tasks. They use convolutional filters to
extract features from images, making them highly effective in identifying objects,
scenes, and patterns.
2
Recurrent Neural Networks (RNNs)
RNNs are designed to process sequential data, such as text or time series. They use
feedback loops to maintain a memory of previous inputs, allowing them to
understand context and dependencies in sequential data.
3 Long Short-Term Memory (LSTM) Networks
LSTMs are a type of RNN that are particularly well-suited for learning long-term
dependencies. They have special memory cells that can store information over
extended periods, enabling them to capture complex relationships in sequential
data.
4 Generative Adversarial Networks (GANs)
GANs are composed of two competing networks: a generator and a discriminator.
The generator creates new data samples, while the discriminator tries to distinguish
between real and generated data. This adversarial process results in the generation
of highly realistic and diverse data.

Applications of Neural Networks
Image Recognition Object Detection Image Classification
Natural
Language
Processing
Machine
Translation
Text Summarization
Speech Recognition Time Series
Forecasting
Drug Discovery

PROBABILISTIC MODELS
Machine learning algorithms today rely heavily
on probabilistic models, which take into consideration
the uncertainty inherent in real-world data.
These models make predictions based on
probability distributions, rather than absolute values,
allowing for a more nuanced and accurate
understanding of complex systems.
One common approach is Bayesian inference,
where prior knowledge is combined with observed
data to make predictions.
Another approach is maximum likelihood
estimation, which seeks to find the model that best
fits observational data.
126

Probabilistic modeling is a statistical approach
that uses the effect of random occurrences or actions
to forecast the possibility of future results.
It is a quantitative modeling method that
projects several possible outcomes that might even
go beyond what has happened recently.
Categories Of Probabilistic Models
1.Generative models
2.Discriminative models.
127

Generative models:
Generative models aim to model the joint
distribution of the input and output variables.
These models generate new data based on the
probability distribution of the original dataset.
Generative models are powerful because they can
generate new data that resembles the training data.
They can be used for tasks such as image and
speech synthesis, language translation, and text
generation.
128

Discriminative models:
The discriminative model aims to model the
conditional distribution of the output variable given
the input variable.
They learn a decision boundary that separates
the different classes of the output variable.
Discriminative models are useful when the focus is on
making accurate predictions rather than generating
new data.
They can be used for tasks such as image
recognition, speech recognition, and sentiment
analysis.
129

Maximum Likelihood Estimation
The maximum likelihood estimation or MLE
approach deals with the problems at the face value
and parameterizes the information into variables.
The values of the variables that maximize the
probability of the observed variables lead to the
solution of the problem.
Let us define the problem using formal
notations. Let there be a function f(x )
θ that produces
the observed output y.
x∈ ℜ𝑛
represent the input on which we don’t
have any control over and θ represent a parameter
∈ Θ
vector that can be single or multidimensional.
130

The MLE method defines a likelihood function
denoted as L(y| )
θ .
Typically the likelihood function is the joint
probability of the parameters and observed variables
as L(y| ) P(y; )
θ θ .
The objective is to find the optimal values for θ
that maximizes the likelihood function as given by
𝜃𝑀𝐿𝐸
= arg max{ ( | )}
𝐿 𝑦 𝜃
𝜃𝗀Θ
or,
𝜃𝑀𝐿𝐸
= arg max{ ( ; }
𝑃 𝑦 𝜃
𝜃𝗀Θ
131

Bayesian Approach
Bayesian approach looks at all the unknowns
are modelled as random variables with known prior
probability distributions.
Let us denote the conditional prior probability
of observing the output y for parameter vector θ as
P(y )
θ .
The marginal probabilities of these variables
are denoted as P(y) and P( )
θ .
The joint probability of the variables can be
written in terms of conditional and marginal
probabilities as
P(y; )
θ = P(y| )
θ · P( ) ………(1)
θ
132

The same joint probability can also be given as
P(y; )
θ = P(θ |y) · P(y) ………(2)
Here the probability P( /y)
θ is called as posterior
probability.
Combining Eqs. 1 and 2
P(θ |y) · P(y) = P(y| )
θ · P( ) ……..(3)
θ
rearranging the terms we get,
This theorem gives relationship between the posteriory
probability and priori probability in a simple and elegant
manner. 133

UNIT – III
MACHINE LEARNING IN PRACTICE
Ranking
Recommendation System
Designing and Tuning model pipelines
Performance measurement
Azure Machine Learning
Open-source Machine Learning libraries
Amazons Machine Learning Tool Kit: Sagemaker.
134

RANKING
 Ranking is a machine learning technique to rank items.
 A ranking algorithm is a procedure used to rank items in
a dataset according to some criterion.
 It can be divided into two categories:
1.Deterministic
2.Probabilistic
 Ranking at heart is essentially sorting of information.
 It is useful for many applications in information retrieval
such as e-commerce, social networks, recommendation
systems, and so on.
 Example:
A user searches for an article or an item to buy online. 135

Measuring Ranking Performance
 The algorithm that can rank the items in strictly
non-increasing order of relevance.
 Let there be n number of items that need to be ranked
and each has a relevance score of ri, i = 1, 2,...,n.
 A simple measure, called as cumulative gain (CG) is
deﬁned on these relevances as,
𝑛
𝐶𝐺 = ∑ 𝑟𝑖
𝑖=1
136

Sample set of items to be ranked with relevance score
138
Item No. Relevance Score Discounted relevance score
1 0.4 0.4
2 0.25 0.16
3 0.1 0.05
4 0.0 0.0
5 0.3 0.12
6 0.13 0.05
7 0.6 0.2
8 0.0 0.0
9 0.56 0.17
10 0.22 0.06

Sample set of items ideally ranked with non-increasing
relevance score
139
Item No. Relevance Score Discounted relevance score
1 0.6 0.6
2 0.56 0.35
3 0.4 0.2
4 0.3 0.13
5 0.25 0.1
6 0.22 0.08
7 0.13 0.04
8 0.1 0.032
9 0.0 0.0
10 0.0 0.0

Ranking Search Results and Google’s
Pagerank
 The concept of PageRank was at the heart of Google’s
rankings.
 Google’s ranking algorithm is a secret, but we know that it
is a probabilistic ranking algorithm. Google uses a
variety of factors to rank webpages, including the number
of links to a page, the page’s PageRank, and the relevance
of the search query to the page.
 Google’s PageRank algorithm is a probabilistic ranking
algorithm that uses the number of links to a webpage as a
measure of its importance.
 The higher the PageRank of a webpage, the more likely it
is to be ranked higher in the search results. 141

Techniques Used in Ranking Systems
 Information retrieval and text mining are the core concepts
that are important in the ranking system that relate to
searching websites or movies, etc.
 We will look speciﬁcally at a technique called as keyword
identiﬁcation/extraction and word cloud generation.
142

Keyword Identification / Extraction
A word cloud is a graphical representation of the
keywords identiﬁed from a document or set of documents.
The size of each keyword represents the relative
importance of it.
Steps in Building a Word Cloud
1.Cleaning up the data. This step involves removing all the
formatting characters as well as punctuation marks.
2.Normalization of the data.
This step involves making all the characters lower
case, unless it is required to distinguish the upper case
letters from lower case.
This step involves applying grammar to ﬁnd the
root word for every word form.
143

3.Removal of stop words. This step involves removing all the
common words that are typically used in any text and have no
meaning of their own. For example a, them for, to, and etc.
4.Compute the frequency of occurrence of each of the
remaining words.
5.Sort the words in descending order of frequency.
6.Plot the top-n keywords graphically to generate the word
cloud.
144

145
Word cloud from famous Gettysburg address by
Abraham Lincoln

RECOMMENDATION SYSTEM
 Modern recommendation systems is commonly called as
collaborative filtering.
 Companies like Amazon and Netflix have influenced
this genre of machine learning significantly for
personalizing shopping and movie watching experiences.
147

Collaborative Filtering
 A mathematical process of predicting the interests or
preferences of a given user based on a database of
interests or preferences of other users.
 Users with similar interests in one aspect tend to share
similar interest in other similar aspects.
 Collaborative ﬁlters always deal with 2-dimensional data
in the form of a matrix.
 One dimension being the list of users and the other
dimension being the entity that is being liked or watched
or purchased
149

Sample training data for building a recommendation
system
151

Solution Approaches:
Information Types
1.Information about the users in the form of their proﬁles.
The proﬁles can have key aspects like age, gender,
location, employment type, number of kids, etc.
2.Information about the interests. For movies, it can be in
the form of languages, genres, lead actors and actresses,
release dates, etc.
3.Joint information of users’ liking or rating their interests
153

 Algorithm Types
 Algorithms exploiting the topological or neighborhood
information. These algorithms are primarily based on the joint
historical information about the users’ratings.
 Algorithms that exploit the structure of the relationships. These
methods assume a structural relationship between users and
their ratings on the interests.
 These relationships are modelled using probabilistic networks
or latent variable methods like component analysis (principal
or independent) or singular value decomposition.
 Hybrid approach. The neighborhood based algorithms simply
cannot operate in such situations.
 Hence these hybrid systems start with algorithms that are more
inﬂuenced by the structural models in the early stages of
neighborhood based models.
154

 Amazon’s Personal Shopping Experience
 First and foremost Amazon is a shopping platform.
 Where Amazon itself sells products and services as well as
it lets third party sellers sell their products.
 Each shopper that comes to Amazon comes with some idea
about the product that he/she wants to purchase along with
some budget for the cost of the product that he/she is ready
to spend on the product as well as some expectation of date
by which the product must be delivered.
155

Context Based Recommendation
Suggesting other similar products that are cheaper than
the product selected.
Content-based filtering uses item features to recommend
other items similar to what the user likes, based on their
previous actions or explicit feedback.
So, if the cost is the only aspect stopping the user from
buying the selected item, the recommended item can solve
the problem.
Suggesting a similar product from a more popular brand.
This might attract user to buy a potentially more
expensive product that is coming from more popular
brand, so more reliable or better quality.
156

 Suggesting a product that has better customer reviews.
 Suggesting a set of products that are typically bundled
with the selected product.
 For example suggesting carry bag or battery charger
or a memory card when selected item is a digital
camera.
157

DESIGNING AND TUNING MODEL
PIPELINES
Model tuning is the experimental process of finding
the optimal values of hyperparameters to maximize model
performance.
Hyperparameters are the set of variables whose
values cannot be estimated by the model from the training
data. These values control the training process.
Then the available data is split into two or three sets
 Training - used for training the model
 Validation - optional validation set is used to tune the
parameters
 Testing - is used to predict the performance metrics of the
algorithm
158

Designing and Tuning Model Pipelines involves
following steps
1. Choosing the Technique or Algorithm
2. Choosing Technique for Adult Salary Classi cation
ﬁ
3. Splitting the Data
4. Strati ed Sampling
ﬁ
5. Training
6. Tuning the Hyperparameters
7. Accuracy Measurement
8. Explain ability of Features
9. Practical Considerations
10. Data Leakage
11. Coincidence and Causality
12. Unknown Categories
159

Choosing the Technique or Algorithm
Let’s consider the problem of binary classification of
the adult salary data.
Example:
If we are dealing with a regression problem then we
are restricted to all the regression type techniques, which
would eliminate the clustering or recommendation
algorithms , etc.
Choosing Technique for Adult Salary Classification
Regression algorithms can always be used in classification
applications with addition of threshold, but are less
preferred.
160

 Decision tree based methods are typically better suited
for problems that have categorical features as these
algorithms inherently use the categorical information.
 Most other algorithms like logistic regression, support
vector machines, neural networks, or probabilistic
approaches are better suited for numerical features.
 Single decision tree can be used as one of the simplest
possible starting algorithm.
 Random forest decision tree is also used better method.
161

Splitting the Data
The division of the three sets in percentage is done as: 60–
20–20, or 70– 15–15.
Stratiﬁed Sampling
Stratiﬁed sampling ensures a certain known distribution of classes
in split parts.
When we are dealing with say n number of classes, the number of
samples per class is not always same.
When the original distribution is unbalanced, there are two choices:
1.Ignore a number of samples from the classes that have more
samples to match the classes that have less number of samples.
2.Use the samples from the classes that have less number of samples
repeatedly to match with number of samples from the classes that
have larger number of samples. 162

Training
Once the data is split into training and test sets, the training set
is used to train the algorithm and build the model. Each algorithm
has its own specific method associated with it for training.
All the different technique is to find the right set of parameters
of the model that can map the input to output as far as training
data is concerned.
The parameters of the model are also classified into two main
types:
1.The parameters that can be computed using the training process to
minimize the error in the prediction.
2.The parameter that cannot be directly computed using the training
process. These parameters are also called as hyperparameters.
163

Tuning the Hyperparameters
 The hyperparameters are typically a set of parameters that
can be unbounded, theoretically choose any number between
1 and ∞.
 These bounds are created based on multiple constraints like
computation requirements, dimensionality, and size of data,
etc
 A single set of training set can be used to get results with
one set of hyperparameters.
 The training set is the only data available for training and the
trained model is applied on the validation set to compute the
accuracy metrics.
164

Data Leakage
If the data is static then there is a set of static algorithms that one
can choose from.
Examples of static problem could be classiﬁcation of images into ones
that contain cars and ones that don’t.
If the data is changing over time, it creates additional layer of
complexity.
Example
Consider a business of auto mechanic. Here are some of the columns in
the data:
Number of visits Method of payment
Category of customer Amount of sale
Year of manufacture of the car
Make of the car
Model of the car
Miles on the odometer 165

PERFORMANCE MEASUREMENT
To evaluate the performance or quality of the model,
different metrics are used, and these metrics are known as
performance metrics or evaluation metrics
Performance Metrics for Regression / Numerical error
 Mean Absolute Error
 Mean Squared Error
 Normalized Error
Performance Metrics for Classification / Categorical error
 Accuracy
 Precision and Recall
 F-Score
 Confusion Matrix
 Receiver Operating Characteristics (ROC) Curve Analysis
Steps in Hypothesis Testing
 A/B Testing
166

Performance Metrics for Regression / Numerical
error
167

Mean Absolute Error is a regressive loss measure
looking at the absolute value difference between a model's
predictions and ground truth, averaged out across the
dataset.
However, to overcome this limitation, another
metric can be used, which is Mean Squared Error or MSE.
168

Normalized Error
In many cases, all the above error metrics can produce
some arbitrary number between -∞ and ∞.
Typically the bounds used are (-1to+1), (0–1), or (0–
100). This way, even a single instance of normalized error
can make sense on its own.
All the above error deﬁnitions can have their own
normalized counterpart.
171

AZURE MACHINE LEARNING- AML
Azure Machine Learning is a comprehensive
machine learning platform that supports language
model fine-tuning and deployment.
Using the Azure Machine Learning model
catalog, users can create an endpoint for Azure OpenAI
Service and use RESI APIs to integrate models into
applications.
173

sign in with a Microsoft account. e.g., hotmail.com or
outlook.com, etc.
174

UNIT – IV
MACHINE LEARNING AND DATAANALYTICS
Machine Learning for Predictive Data Analytics
Data to Insights to Decisions
Data Exploration
Information based Learning
Similarity based learning
Probability based learning
Error based learning
Evaluation
The art of Machine learning to Predictive Data
Analytics. 177

MACHINE LEARNING FOR PREDICTIVE
DATAANALYTICS
Predictive data analytics is the art of building and
using models that make predictions based on historical
data to predict future outcomes.
178

Applications:
Price Prediction: Predictive analytics models can be
trained to predict optimal prices based on historical sales
records.
Dosage Prediction: Doctors and scientists frequently
decide how much of a medicine or other chemical to
include in a treatment.
Risk Assessment: Predictive analytics models can be used
to predict the risk associated with decisions such as issuing
a loan or underwriting an insurance policy.
Diagnosis: Doctors, engineers and scientists regularly make
diagnoses as part of their work. Typically, these diagnoses
are based on their extensive training, expertise, and
experience. 180

Document Classification: Predictive data analytics can be
used to automatically classify documents into different
categories.
Examples:
Email spam filtering, News sentiment analysis, Customer
complaint redirection, and Medical decision making.
How is machine learning used in predictive analytics?
Predictive analytics using machine learning algorithms
can provide more accurate and precise predictions, automate
decision-making processes, and scale up to handle large
datasets and complex problems.
Machine learning algorithms can provide more
accurate predictions than traditional statistical models.
181

Machine Learning
Machine learning is defined as an automated process
that extracts patterns from data.
To build the models used in predictive data analytics
applications, we use supervised machine learning.
Supervised machine learning techniques automatically
learn a model of the relationship between a set of descriptive
features and a target feature based on a set of historical
examples, or instances.
184

Restriction bias constrains the set of models that the
algorithm will consider during the learning process
Preference bias guides the learning algorithm to prefer
certain models over others
Inductive bias is necessary for learning (beyond the
dataset)
There are two sources of information that guide this Search
The training data
The inductive bias of the algorithm
Underﬁtting occurs when the prediction model selected by
the algorithm is too simplistic to represent the underlying
relationship in the dataset between the descriptive features and
the target feature.
Overﬁtting Occurs when the prediction model selected by the
algorithm is so complex that the model fits to the data set too
closely and becomes sensitive to noise in the data. 186

The Predictive Data Analytics Project Lifecycle: Crisp-DM
187

Cross Industry Standard Process for Data Mining
(CRISP-DM).
Key features of the CRISP-DM process that make
it attractive to data analytics practitioners are that it is
non-proprietary; it is application, industry, and tool
neutral; and it explicitly views the data analytics process
from both an application-focused and a Technical
perspective.
Business Understanding:
Predictive data analytics projects never start out
with the goal of building a prediction model.
Instead, they are focused on things like gaining
new customers, selling more products, or adding
efficiencies to a process. 188

Data Understanding:
Once the manner in which predictive data
analytics will be used to address a business problem has
been decided, it is important that the data analyst.
Data Preparation:
Building predictive data analytics models requires
specific kinds of data, organized in a specific kind of
structure known as analytics base table (ABT).
Modeling:
Different machine learning algorithms are used to
build a range of prediction models from which the best
model will be selected for deployment.
189

Evaluation:
This phase Of CRISP-DM covers all the
evaluation tasks required to show that a prediction model
will be able to make accurate predictions after being
deployed and that it does not suffer from overfitting or
underfitting.
Deployment:
Machine learning models are built to serve a
purpose within an organization, and the last phase of
CRISP-DM covers all the work that must be done to
successfully integrate a machine learning model into the
processes within an organization.
190

Predictive Analytics Tools
Predictive Analytics Software Tools have advanced
analytical capabilities like Text Analysis, Real-Time Analysis,
Statistical Analysis, Data Mining, Machine Learning modeling
and Optimization.
Libraries for Statistical Modeling and Analysis
Scikit-learn
Pandas
Stats model
NLTK (Natural Language Processing Tool Kit)
Graph Lab
Neural Designer
Open-Source Analytical Tools
SAP Business Objects
IBM SPSS
Halo Business Intelligence
Daiku-DSS
Weka 191

DATA TO INSIGHTS TO DECISIONS
Designing ABTs that properly represent the
characteristics of a prediction subject is a key skill for
analytics practitioners.
An approach to first develop a set of domain concepts
that describe the prediction subject, and then expand these
into concrete descriptive features.
Converting a business problem into an analytics solution
Involves answering the following key questions:
1.What is the business problem?
2.What are the goals that the business wants to achieve?
3.How does the business currently work?
4.In what ways could a predictive analytics model help to
address the business problem? 192

Designing the Analytics Base Table (ABT)
The basic structure in which we capture historical datasets.
The different data sources typically combined to create an
analytics base table.
The basic structure in which we capture historical datasets
is the analytics base table (ABT)
The general structure of an analytics base table:
 Descriptive features
 Target feature
193

The different data sources typically combined to create an
analytics base table.
194

Prediction subject defines the basic level at which
predictions are made, and each row in the ABT will
represent one instance of the prediction subject
One-row-per-subject is often used to describe this
structure
Each row in an ABT is composed of a set of
descriptive features and a target feature
A good way to define features is to identify the key
domain concepts and then to base the features on
these concepts.
195

The hierarchical relationship between an analytics
solution, domain concepts, and descriptive features.
196

 Features in an ABT:
 Raw features
 Derived features: requires data from multiple sources to be
combined into a set of single feature values
Common derived feature types:
 Aggregates
 Flags
 Ratios
 Mappings
There are a number of general domain concepts that are often useful:
 Prediction Subject Details
 Demographics
 Usage
 Changes in Usage
 Special Usage
 Lifecycle Phase
 Network Links
197

Example domain concepts for a motor insurance fraud
claim prediction analytics solution
198

Designing & Implementing Features
Three key data considerations are particularly
important when we are designing features Data
availability, because we must have data available to
implement any feature we would like to use.
Example:
In an online payments service scenario, we might define a
feature that calculates the average of a customer’s account
balance over the past six months.
199

Timing:
Timing with which data becomes available for
inclusion in a feature.
With the exception of the definition of the target
feature, data that will be used to define a feature must be
available before the event around which we are trying to
make predictions occurs.
Example:
If we were building a model to predict the Outcomes of
soccer matches, we might consider including the attendance
at the match as a descriptive feature.
200

Longevity:
There is potential for features to go stale if something
about the environment from which they are generated
changes.
Example, to make predictions of the outcome of loans
granted by a bank, we might use the borrower’s salary as
a descriptive feature.
201

Propensity models:
Many of the predictive models that we build are
propensity models, which inherently have a temporal
element
Two key periods of propensity modeling:
Observation period
Outcome period
202

Observation and outcome periods deﬁned by an event
rather than by a ﬁxed point in time (each line
represents a prediction subject and stars signify
events).
204

Case Study: Motor Insurance Fraud
What are the observation period and outcome period for
the motor insurance claim prediction scenario?
The observation period and outcome period are measured
over different dates for each insurance claim, deﬁned
relative to the speciﬁc date of that claim.
The observation period is the time prior to the claim
event, over which the descriptive features capturing the
claimant’s behavior are calculated.
The outcome period is the time immediately after the
claim event, during which it will emerge whether the claim
is fraudulent or genuine.
What features could you use to capture the Claim
Frequency domain concept? 205

Example domain concepts for a motor insurance fraud
prediction analytics solution
206

DATA EXPLORATION
Data exploration refers to the initial step in data
analysis in which data analysts use data visualization
and statistical techniques to describe dataset
characterizations, such as size, quantity, and accuracy, in
order to better understand the nature of the data.
It involves scrutinizing datasets to uncover
hidden patterns, outliers, and insights.
Whether in business, healthcare, or research, data
exploration serves as the compass guiding decision-
makers.
207

The Data Quality Report
A data quality report includes tabular reports that
describe the characteristics of each feature in an ABT using
standard statistical measures of central tendency and
variation.
The tabular reports are accompanied by data visualizations:
A histogram for each continuous feature in an ABT
A bar plot for each categorical feature in an ABT.
208

Purposes of Data Exploration:
Understanding Data Structure:
Data exploration helps in understanding the overall
structure of the dataset, including the number of variables,
data types, and the presence of missing values or outliers.
Identifying Patterns and Relationships:
Data exploration involves uncovering patterns and
relationships within the data.
This may include identifying correlations between
variables, grouping data based on common characteristics,
and visualizing data to reveal trends.
209

Informing Further Analysis:
The insights gained from data exploration inform
further analysis, such as model building, hypothesis
testing, and decision-making.
Common Data Exploration Techniques:
Descriptive Statistics:
Summarizing data using measures of central
tendency (mean, median, mode) and measures of
dispersion (variance, standard deviation, range) provides
an overview of the data distribution.
Data Visualization:
Creating visualizations, such as histograms, scatter
plots, and box plots, helps visualize data distribution,
identify patterns, and detect outliers. 210

Data Profiling:
Data profiling involves examining data quality
metrics, such as data completeness, accuracy,
consistency, and timeliness, to assess the overall quality
of the dataset.
Data Transformation:
Data transformation techniques, such as
normalization, scaling, and imputation, may be applied
to prepare the data for further analysis.
Feature Engineering:
Feature engineering involves creating new features
from existing data to improve the performance of
machine learning models.
211

Benefits of Data Exploration:
Improves Data Understanding:
Data exploration deepens the understanding of the
data, its characteristics, and limitations, enabling better
decision-making.
Uncovers Hidden Patterns:
Data exploration reveals patterns and trends that
may not be evident by simply looking at raw data,
informing further analysis and hypothesis generation.
Identifies Data Quality Issues:
Data exploration helps identify data quality issues,
such as missing values, outliers, and inconsistencies,
allowing for data cleaning and improvement. 212

Facilitates Feature Selection:
Data exploration guides feature selection, identifying
relevant features for analysis and model building.
Informs Model Development:
Insights from data exploration inform the choice of
appropriate machine learning algorithms and model
selection.
Data exploration is an essential step in the data
analysis process, providing a foundation for making
informed decisions, uncovering hidden insights, and
building effective models. By thoroughly exploring the
data, data scientists and analysts can extract valuable
knowledge and drive meaningful outcomes.
213

Histograms for different sets of data each of which exhibit
well-known, common characteristics.
(a)Uniform (b) Normal (Unimodal)(c) Unimodal (skewed right)
(a)Unimodal (skewed Left) (b) Exponential(c) Multimodal
214

Three normal distributions with different means but
identical standard deviations.
215

Three normal distributions with identical means but
different standard deviations.
216

Identifying Data Quality Issues
A data quality issue is loosely deﬁned as anything
unusual about the data in an ABT.
The most common data quality issues are:
 missing values
 irregular cardinality
 Outliers
The data quality issues we identify from a data quality
report will be of two types:
 Data quality issues due to invalid data
 Data quality issues due to valid data.
217

INFORMATION BASED LEARNING
Information-based learning (IBL) is a framework for
learning that emphasizes the importance of information and
its role in shaping knowledge and understanding.
It is a constructivist approach that suggests that
learners actively construct their own knowledge by making
connections between new information and their existing
knowledge base.
218

(a) Brian (b) John (c) Aphra (d) Aoife
Cards showing character faces and names for the Guess-Who game
Man Long Hair Glasses Name
219

Fundamentals
A decision tree consists of:
 root node (or starting node),
 interior nodes
 leaf nodes (or terminating nodes).
 each of the non-leaf nodes (root and interior) in the tree
specifies a test to be carried out on one of the query’s
descriptive features.
 each of the leaf nodes specifies a predicted classification
for the query.
221

An email spam prediction dataset
222

Shannon’s Entropy Model
 Claude Shannon’s entropy model defines a computational
measure of the impurity of the elements of a set.
 An easy way to understand the entropy of a set is to think
in terms of the uncertainty associated with guessing the
result if you were to make a random selection from the
set.
 Entropy is related to the probability of a outcome.
 High probability → Low entropy
 Low probability → High entropy
 If we take the log of a probability and multiply it by -1 we
get this mapping!
224

Information Gain
The measure of informativeness that we will use is
known as information gain and is a measure of the
reduction in the overall entropy of a set of instances that
is achieved by testing on a descriptive feature.
1.Compute the entropy of the original dataset with respect
to the target feature.
2.For each descriptive feature, create the sets that result by
partitioning the instances in the dataset using their feature
values, and then sum the entropy scores of each of these
sets.
3.Subtract the remaining entropy value (computed in step 2)
from the original entropy value(computedinstep1) to give
the information gain. 225

SIMILARITY BASED LEARNING
 Feature space: each descriptive has its own dimensional
axis. It is an abstract m- dimensional space that is
created by making each descriptive feature in a dataset
an axis of an m-dimensional coordinate system and
mapping each instance in the dataset to a point in this
coordinate space based on the values of its descriptive
features
 Working of feature space: if the values of the descriptive
features of two or more instances in a dataset are the
same, then these instances will be mapped to the same
point in the feature space and vice versa
 Distance between two points in the feature space is a
useful measure of the similarity of the descriptive 226

Metric(a,b): is a real value function that returns the
distance between two points a and b in the feature space. It
has following properties:
Non-negativity
Identity
Symmetry
Triangular inequality
Two examples of distance metric
Euclidean distance
Manhattan distance (taxi-cam distance)=sum of absolute
differences.
Minkowski distance: a family of distance metrics based
on differences between features. 227

Nearest Neighbor algorithm:
When this model is used to make prediction for new
instances, the distance in the feature space between the query
instance and each instance in the data set instances is computer,
and the prediction returned by the model is the target feature
level of the dataset's instance that is nearest to the query in the
feature space.
It stores entire training dataset in memory resulting in a
negative effect on time complexity of the algorithm
Algorithm : Nearest neighbor algorithm.
Require: a set of training instances
Require: a query instance
228

Iterate across the instances in memory to find the nearest
neighbor—this is the instance with the shortest distance
across the feature space to the query instance.
Make a prediction for the query instance that is equal to
the value of the target feature of the nearest neighbor.
Decision boundary:
It is the boundary between regions of the feature
space in which different target levels will be predicted.
It is generated by aggregating the neighboring local
models (Voronoi regions) that make the same prediction
229

Noise effects:
Using Kronecker delta approach:
n-n algorithm is sensitive to noise because any errors
in the description of labeling of training data results in
erroneous local models and incorrect predictions.
One way to mitigate against noise is to modify the
algorithm to return the majority target level within the set of
k nearest neighbors to the query q
Using weighted k nearest neighbor approach:
Efficient Memory search:
Assuming that the training dataset will remain
relatively stable, the time issue can be offset by investing in
one-off computation to create an index of the instances that
enables efficient retrieval of the nearest neighbors without
doing an exhaustive search of the entire training dataset. 230

k-d tree:
It stands for k-dimensional tree which is a balanced
binary tree in which each of the nodes in the tree index one
on the instances in a training dataset
This tree is constructed so that the nodes that are
nearby in the tree index training instances that are nearby in
the feature space
231

PROBABILITY BASED LEARNING
It is a machine learning technique that uses
probability theory to make predictions or decisions. It's a
statistical approach that models uncertainty in data using
probability distributions.
A probability function, P(), returns the probability of a
feature taking a specific value.
A joint probability refers to the probability of an
assignment of specific values to multiple different features
A conditional probability refers to the probability of
one feature taking a specific value given that we already
know the value of a different feature
232

 A probability distribution is a data structure that
describes the probability of each
 possible value a feature can take. The sum of a probability
distribution must equal 1.0
 A joint probability distribution is a probability
distribution over more than one feature assignment and is
written as a multi-dimensional matrix in which each cell
lists the probability of a particular combination of feature
values being assigned
 The sum of all the cells in a joint probability distribution
must be 1.0.
233

 A joint probability distribution is a probability
distribution over more than one feature assignment and is
written as a multi-dimensional matrix in which each cell
lists the probability of a particular combination of feature
values being assigned.
 The sum of all the cells in a joint probability distribution
must be 1.0.
Baye’s Theorem:
The conditional probability of an event, X, given some
evidence, Y, in terms of the product of the inverse
conditional probability, P(Y| X), and the prior probability
of the event P(X).
234

Bayesian Prediction
We generate the probability of the event that a
target feature, t,
takes a specific level, l,
given the assignment of values to a set of descriptive
features, q, from a query instance.
We can restate Bayes’ Theorem using this terminology and
generalize the definition of Bayes’ Theorem so that it can
take into account more than one piece of evidence(each
descriptive feature value is a separate piece of evidence).
236

1. P(t=l),the prior probability of the target feature t taking the
level l
2. P(q[1],…,q[m]),the joint probability of the descriptive
features of a query instance Taking a specific set of values
3. P(q[1], …, q[m] | t = l), the conditional probability of the
descriptive features of a query instance taking a specific set
of values given that the target feature takes the Level l
Conditional Independence and Factorization
If knowledge of one event has no effect on the probability
of another event, and vice versa, then the two events are
independent of each other.
If two events X and Y are independent then:
P(X|Y) = P(X)
P(X, Y) = P(X) x P(Y) 238

ERROR BASED LEARNING
We perform a search for a set of parameters for a
parameterized model that minimizes the total error across
the predictions made by that model with respect to a set of
training instances.
The key ideas of a parameterized model, measuring error
and an error surface.
Simple Linear Regression
It is a type of Regression algorithms that models the
relationship between a dependent variable and a single
independent variable.
The relationship shown by a Simple Linear
Regression model is linear or a sloped straight line, hence
it is called Simple Linear Regression. 239

The key point in Simple Linear Regression is that the
dependent variable must be a continuous value.
However, the independent variable can be measured on
continuous or categorical values.
Measuring Error
There are many different kinds of error functions, but for
measuring the fit of simple linear regression models, the
most commonly used is the sum of squared errors error
function, or L2.
To calculateL2 we use our candidate model to make a
prediction for each member of the training dataset and then
calculate the error (or residual) between these predictions
and the actual target feature values in the training set.
240

a) A scatter plot of the SIZE and RENTAL PRICE features
from the office rentals dataset; (b)the scatter plot from
(a)with a linear model relating RENTALPRICE to SIZE
overlaid
241

Error Surface:
Here, each pair of Weights w[0]and w[1] defines a
point on the x-y plane, and the sum of squared errors for the
model using these weights determines the height of the error
surface above the x-y Plane for that pair of weights.
The x-y plane is known as a weight space, and the surface is
known as an error surface.
242

EVALUATION
When evaluating machine learning (ML)
models, the question that arises is whether the model is
the best model available from the model’s hypothesis
space in terms of generalization error on the unseen /
future data set.
Hold-out method for Model Evaluation
The hold-out method for model evaluation
represents the mechanism of splitting the dataset into
training and test datasets.
The model is trained on the training set and then
tested on the testing set to get the most optimal model.
243

The hold-out method for model evaluation
Generally, 70-30% split is used for splitting the dataset
where 70% of the dataset is used for training and 30%
dataset is used for testing the model.
244

The following is the process of using the hold-out method
for model evaluation:
Split the dataset into two parts (preferably based on a 70-
30% split; However, the percentage split will vary)
Train the model on the training dataset; While training the
model, some fixed set of hyperparameters is selected.
Test or evaluate the model on the held-out test dataset
Train the final model on the entire dataset to get a model
which can generalize better on then seen or future dataset.
245

THE ART OF MACHINE LEARNING TO
PREDICTIVE DATAANALYTICS
Predictive data analytics projects use machine
learning to build models that capture the relationships in
large datasets between descriptive features and a target
feature.
A specific type of learning, called inductive learning,
is used, where learning entails inducing a general rule from
a set of specific instances.
Predictive analytics project can use CRISP-DM
process to manage a project through its lifecycle.
246

The CRoss Industry Standard Process for Data Mining
(CRISP-DM) is a process model that serves as the base for a
data science process.
It has six sequential phases:
1.Business understanding – What does the business need?
2.Data understanding – What data do we have / need? Is it
clean?
3.Data preparation – How do we organize the data for
modeling?
4.Modeling – What modeling techniques should we apply?
5.Evaluation – Which model best meets the business objectives?
6.Deployment – How do stakeholders access the results?
247

UNIT – V
APPLICATIONS OF MACHINE LEARNING
Image Recognition
Speech Recognition
Email spam and Malware Filtering
Online fraud detection
Medical Diagnosis.
248

IMAGE RECOGNITION
Image recognition is the capability of a system to
understand and interpret visual information from images or
videos.
Image recognition in machine learning often involves
the use of deep learning techniques, particularly convolutional
neural networks (CNNs), which have been highly successful in
this domain.
Digital Images contain pixels, which are the smallest
units of a screen that help in image formation. There are
different types of image formats like JPG, JPEG, GIF, PNG,
etc.
Images have their own role in a digital world that
includes various fields like communication, science, art, and
249

Artificial intelligence and machine learning (AI/ML) are
used in computer vision applications to handle this data
properly for monitoring, detection, categorization, object
identification, and facial recognition.
Learning Based Image Recognition
Data Collection:
A large dataset of labeled images is collected.
Each image is associated with a label that describes
what's in the image (e.g., "cat," "car," "beach").
Preprocessing:
Images are typically resized and normalized to a
standard size, and any color information is converted into a
numerical format.
251

Convolutional Neural Network (CNN):
A CNN is used as the primary architecture for image
recognition. CNNs consist of multiple layers, including
convolutional layers, pooling layers, and fully connected
layers. These layers extract features from the input images.
Training:
The CNN is trained on the labeled dataset. During
training, the network adjusts its internal parameters (weights
and biases) to learn to recognize patterns and features in the
images that are associated with the correct labels.
Testing and Inference:
After training, the CNN can be used to make predictions on
new, unlabeled images. It produces a probability distribution
over possible labels for each image. 252

Post-processing:
The predicted probabilities can be converted into a
final classification or label using techniques like softmax.
Top-k classification is often used to provide multiple
likely labels.
The field of image recognition in machine learning
has seen significant advancements, and it's applied in a
wide range of applications, including object detection,
facial recognition, medical image analysis, and
autonomous vehicles.
python
import tensorflow as tf
from tensorflow import keras
253

# Load and preprocess the dataset (e.g., CIFAR-10)
(x_train, y_train), (x_test, y_test) =
keras.datasets.cifar10.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0 # Normalize
pixel values to between 0 and 1
# Build the CNN model model = keras.Sequential([
keras.layers.Conv2D(32, (3, 3), activation='relu',
input_shape=(32, 32, 3)),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.Flatten(),
keras.layers.Dense(64, activation='relu'), 254

# Compile the model model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logit
s=True), metrics=['accuracy'])
# Train the model
model.fit(x_train, y_train, epochs=10, validation_data=(x_test,
y_test))
# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print(f"Test accuracy: {test_acc}")
This code loads the CIFAR-10 dataset, creates a simple CNN
model, compiles it, and then trains and evaluates the model.
You can replace the dataset with your own image data and
adjust the
model architecture for your specific image recognition task.
255

Application of Image Recognition
Identifying Fraudulent Accounts
Examining fake social media profiles is among the
most significant applications of image recognition.
Facial Recognition and Security Systems
Image recognition is also regarded as important since
it is one of the most critical components in the security
business.
Image recognition algorithms could help marketers
learn about a person’s identity, gender, and mood.
Reverse Image Search
You may have heard of internet reverse image
searches. Reverse photo search is a strategy that allows you
to search by image for free.
256

Help Police Officials to Solve Cases
You might be shocked to learn that government
agencies use image recognition. Today, police and other secret
organizations frequently use image recognition technology to
identify persons in recordings or photographs.
Empowers e-commerce Businesses
Today, image recognition is commonly used in the e-
commerce business. Historically, the visual search industry
has grown significantly.
257

SPEECH RECOGNITION
Speech recognition is a machine learning
technology that uses AI to interpret spoken words and
convert them into text.
It works by breaking down speech into sounds,
analyzing each sound, and then using an algorithm to find
the most likely word for each sound.
You can do speech recognition in python with the
help of computer programs that take in input from the
microphone, process it, and convert it into a suitable
form.
258

Speech Recognition
Working of Speech Recognition
259

How Does Speech Recognition work?
Acoustic modeling is used to recognize phenones/phonetics
in our speech to get the more significant part of speech, as
words and sentences.
Speech recognition starts by taking the sound energy
produced by the person speaking and converting it into
electrical energy with the help of a microphone.
 It then converts this electrical energy from analog to
digital, and finally to text.
It breaks the audio data down into sounds, and it analyzes the
sounds using algorithms to find the most probable word that
fits that audio. All of this is done using Natural Language
Processing and Neural Networks. Hidden Markov models can
be used to find temporal patterns in speech and improve 260

Picking and Installing a Speech Recognition Package
261
Package Functionality Installation
Apiai Included natural language
processing for identifying a
speaker’s intent
$ pip install apiai
Google-cloud-speech Offers basic speech to text
conversion
$pip install virtualenv
Virtualenv <your-env>
<your-env>Scriptsactivate
<your-env>Scriptspip.exe
install google-cloud-speech
Speech recognition Offers easy audio processing
and microphone accessibility
Pip install speech recognition
Watson – developer
cloud
Watson developer cloud is
an artificial intelligence API
that makes creating,
debugging, running and
deploying APIs easy. It can be
Pip install-upgrade waston-
develope cloud

It allows:
•Easy speech recognition from the microphone.
•Makes it easy to transcribe an audio file.
•It also lets us save audio data into an audio file.
•It also shows us recognition results in an easy-to-understand
format.
Speech Recognition in Python: Converting Speech to Text
create a program that takes in the audio as input and
converts it to text.
262

Importing necessary modules
263

Converting speech to text
Now, use the microphone to get audio input from the user in
real-time, recognize it, and print it in text
264

EMAIL SPAM AND MALWARE
FILTERING
 What do we know about spam?
Annoying emails?
Yahoo Mail!
 Spam-based advertising is a huge business
 Email spam is a collection of unwanted, unsolicited
messages sent in bulk to a large number of recipients. It
can be sent for commercial purposes or for other
reasons, and can be sent by humans or by botnets,
which are networks of infected computers

 It is a complex system
Technical: name server, email server, webpages,
etc.
Business: payment processing, merchant bank
accounts, customer service, and fulfillment
 Previous work studied each of the elements in isolation
dynamics of botnets, DNS fast-flux networks, Web
site hosting, spam filtering, URL blacklisting, site
takedown
 This work: quantify the full set of resources employed
to monetize spam email— including naming, hosting,
payment and fulfillment

Main Parts
 Advertising
 Click support
 Redirect sites
 Third-party DNS, spammers’ DNS
 Webservers
 Affiliate programs
 Realization
 Payment services
 Fulfillment
2
6
7

Data Collection
and
Processing
2
6
9

Discussion
 With this big picture, what do you think are the
most effective mechanisms to defeat spam?
CS660 - Advanced Information
Assurance - UMassAmherst
2
7
0

Malware, short for malicious software, is software
designed to damage or disrupt a computer, server, or
network. Cybercriminals, also known as hackers, create
malware to steal data, gain access to systems, or sabotage
devices.
271

A bot is a software application that can perform
repetitive tasks automatically, without the need for human
intervention. Bots can be designed to imitate human behavior,
but they are often faster and more accurate than humans.
272

273
 Spam Detector is used to detect unwanted, malicious
and virus infected texts and helps to separate them from
the non spam texts.
 It uses a binary type of classification containing the
labels such as ‘ham’ (non spam) and spam. Application
of this can be seen in Google Mail (GMAIL) where it
segregates the spam emails in order to prevent them
from getting into the user’s inbox.
 In this Machine Learning Spam Filtering application, we
will develop a Spam Detector app using Support Vector
Machine (SVM) technique for classification and Natural
Language Processing.

 Support Vector Machine (SVM) is a supervised learning
algorithm used for classification and regression problems.
The main objective of SVM is to find a hyperplane in an
N( total number of features)-dimensional space that
differentiates the data points.
Hyperplane and Support Vectors
 Hyperplanes are nothing but a boundary that helps to
separate and group the data into particular classes.
274

SVM works very well on the linearly separable data
275

Natural Language Processing (NLP) is an Artificial
Intelligence (AI) field that enables computer programs to
recognize, interpret, and manipulate human languages.
Application Structure:
spam.csv: Dataset for our project. It contains Labels as
“ham” or “spam” and Email Text.
spamdetector.py: This file is used to load the dataset and
train our classifier.
training_data.pkl: This file contains a trained classifier
in binary format which will be used to predict the output.
SpamGui.py: GUI file for our project where we load the
trained classifier and predict the output for a given
message.
276

Steps for developing a Spam Detector:
1.Import Libraries and initialize variables
2.Preprocessing the data
3.Bag of Words (vectorization)
4.Training the model
5.Prediction using Graphical User Interface
277

ONLINE FRAUD DETECTION
Credit card frauds are easy to do, as we know
that E-commerce and many other online sites have
increased the online payment modes, Which affects
increasing the risk of online frauds.
280

281
Feature Description
step tells about the unit of time
type type of transaction done
amount the total amount of transaction
nameOrg account that starts the transaction
oldbalanceOrg
Balance of the account of sender before
transaction
newbalanceOrg
Balance of the account of sender after
transaction
nameDest account that receives the transaction
oldbalanceDest
Balance of the account of receiver before
transaction
newbalanceDest
Balance of the account of receiver after
transaction
isFraud The value to be predicted i.e. 0 or 1

 The libraries used are :
 Pandas: This library helps to load the data frame in a
2D array format and has multiple functions to
perform analysis tasks in one go.
 Seaborn/Matplotlib: For data visualization.
 Numpy: Numpy arrays are very fast and can perform
large computations in a very short time.
282

 The Importance of Fraud Detection
 Financial Losses
 Customer Trust
 Legal and Regulatory Compliance
 Operational Efficiency
 The Role of Machine Learning Algorithms
Traditional Fraud Detection Algorithms
 Machine learning algorithms, traditional ones that have
been commonly used: logistic regression, decision trees,
and random forests.
 These machine learning models have had a significant
impact on fraud detection processes, providing
organizations with practical predictive capabilities. 283

Logistic regression, for example, is a statistical model
that has been widely used for its simplicity and
efficiency.
The decision trees method, on the other hand, makes
predictions based on a set of decision rules. And finally,
the random forest algorithm combines multiple decision
trees to generate a final outcome.
Complex fraud patterns involve non-linear
relationships, high-dimensional data, and evolving
tactics.
As a result:

Feature Engineering Challenges

Imbalanced Data

Dynamic Fraud Behavior‍ 284

New Approaches: Machine Learning Algorithms
1. Gradient Boosting Machines (GBM)
How GBM Works
Gradient Boosting Machines (GBM) is an ensemble
learning technique that combines multiple weak learners
(typically decision trees) to create a strong predictive model.
Here’s how it works:
GBM assigns weights to each tree based on its
performance in reducing residuals.
The final prediction is the weighted sum of predictions
from all trees.
285

 2. Neural Networks
 Application of Neural Networks (Deep Learning) in
Fraud Detection
 One of the most advanced tools in the machine learning
arsenal is neural networks, an algorithmic structure
modeled after the human brain.
 Neural networks, in essence, are computing systems
made up of interconnecting artificial neurons, or nodes.
 In the landscape of fraud detection, neural networks —
particularly deep learning neural networks — hold
significant potential.
286

 Transaction Sequences: Recurrent neural networks
(RNNs) can analyze sequences of transactions over
time, capturing temporal dependencies.
 Image-Based Fraud Detection: Convolutional
neural networks (CNNs) process images (e.g.,
scanned checks, ID cards) to detect anomalies.
 Ensemble Approaches: Stacking neural networks
with other models (e.g., GBM) can enhance overall
fraud detection accuracy.
287

3. XGBoost: Extreme Gradient Boosting
XGBoost (Extreme Gradient Boosting) is another
advanced machine learning algorithm that is creating
ripples in the field of fraud detection.
This algorithm stands out due to its flexibility, high
accuracy, and effectiveness in dealing with large and
complex datasets.
288

 Mobile Payment Fraud Detection:
 Researchers proposed an XGBoost-based framework for
mobile payment fraud detection.
 By integrating unsupervised outlier detection algorithms
and an XGBoost classifier, they achieved excellent results
on a large dataset of over 6 million mobile transactions.
 Credit Card Fraud Detection:
 Another study applied XGBoost to credit card transaction
data.
 They used outlier detection based on distance sum to
identify fraudulent transactions effectively.
 State Grid Corporation of China (SGCC) Dataset:
 Researchers explored integrating Genetic Algorithms with
XGBoost for fraud detection in the SGCC dataset.
 This integration showed promising results and opens
avenues for further research.
289

4. Isolation Forest: Unleashing Anomaly Detection
What is Isolation Forest?
The Isolation Forest or iForest is a unique and
compelling machine learning algorithm designed for
anomaly detection.
Effectiveness in Identifying Fraudulent Transactions
High-Dimensional Data
290

5. Auto encoders: Unleashing Latent Representations
Auto encoders are essentially data compression
algorithms where the compression and decompression
functions are data-specific, lossy, and learned
automatically from examples.
They are a type of self-organizing system that
can learn to represent (encode) the input data in a
manner that highlights its key features and patterns.
Key Features
Smart Adaptive Machine Learning
Smart Surveillance and Automation
Flexible Customizable Rules
Compliance Assurance 291

 The Power of Advanced Algorithms
 Accuracy: Machine learning algorithms such
as XGBoost, neural networks, and auto
encoders offer unparalleled accuracy in identifying
fraudulent activities.
 Adaptability: These algorithms adapt to changing
fraud patterns, ensuring continuous protection
292

Machine Learning Techniques all units .ppt

More Related Content

Similar to Machine Learning Techniques all units .ppt (20)

Recently uploaded (20)

Machine Learning Techniques all units .ppt