SlideShare a Scribd company logo
MACHINE LEARNING
TECHNIQUES
Prepared by,
Dr.J.Preetha,
Prof./IT
UNIT – I
MACHINE LEARNING BASICS
Introduction to Machine Learning (ML)
Essential concepts of ML
Types of learning
Machine learning methods based on Time
Dimensionality
Linearity and Non linearity
Early trends in Machine learning
Data Understanding Representation and visualization
2
INTRODUCTION TO MACHINE
LEARNING
3
4
What is Machine Learning?
 Machine Learning Study of algorithms that improve
their performance at some task with experience
 Optimize a performance criterion using example
data or past experience.
 Role of Statistics: Inference from a sample
 Role of Computer science: Efficient algorithms to
 Solve the optimization problem
 Representing and evaluating the model for
inference
MACHINE LEARNING
Machine learning is branch of
artificial intelligence that enables computers
to “self learn” from training data and
improve over time without being explicitly
programmed.
ML algorithms are able to detect patterns
in data and learn from them in order to
make their own predictions.
5
What is Machine Learning?
“Learning is any process by which a system
improves performance from experience.”
- Herbert Simon
Definition by Tom Mitchell (1998):
Machine Learning is the study of algorithms that
• improve their performance P
• at some task T
• with experience E.
A well-defined learning task is given by <P, T, E>.
3
6
Traditional
Programming
Machine Learning
Computer
Data
Program
Output
Computer
Data
Output
Progra
m
Slide credit: Pedro Domingos
4
7
When Do We Use Machine Learning?
ML is used when:
•Human expertise does not exist (navigating on Mars)
•Humans can’t explain their expertise (speech recognition)
•Models must be customized (personalized medicine-routing on
a computer network)
•Models are based on huge amounts of data (user biometrics))
Learning isn’t always useful:
•There is no need to “learn” to calculate payroll
Based on slide by E. Alpaydin
5
8
9
 Retail: Market basket analysis, Customer
relationship management (CRM)
 Finance: Credit scoring, fraud detection
 Manufacturing: Optimization, troubleshooting
 Medicine: Medical diagnosis
 Telecommunications: Quality of service
optimization
 Bioinformatics: Motifs, alignment
 Web mining: Search engines
Applications:
Growth of Machine Learning
 Machine learning is preferred approach to
 Speech recognition, Natural language processing
 Computer vision
 Medical outcomes analysis
 Robot control
 Computational biology
 This trend is accelerating
 Improved machine learning algorithms
 Improved data capture, networking, faster computers
 Software too complex to write by hand
 New sensors / IO devices
 Demand for self-customization to user, environment
 It turns out to be difficult to extract knowledge from human expertsfailure
of expert systems in the 1980’s.
10
Samuel’s Checkers-Player
“Machine Learning: Field of study that gives
computers the ability to learn without being
explicitly programmed.” -Arthur Samuel (1959)
9
11
Defining the Learning Task
Improve on task T, with respect to
performance metric P, based on experience E
T: Playing checkers
P: Percentage of games won against an arbitrary opponent
E: Playing practice games against itself
T: Recognizing hand-written words
P: Percentage of words correctly classified
E: Database of human-labeled images of handwritten words
T: Driving on four-lane highways using vision sensors
P: Average distance traveled before a human-judged error
E: A sequence of images and steering commands recorded
while observing a human driver.
T: Categorize email messages as spam or legitimate.
P: Percentage of email messages correctly classified.
E: Database of emails, some with human-given labels
10
12
Well Defined Learning Problem
Learning = Improving with experience at some task
Improve over task T ,
with respect to performance measure P ,
based on experience E.
E.g., Learn to play checkers
T : Play checkers
P : % of games won in world tournament
E: opportunity to play against self
13
Learning to Play Checkers
T : Play checkers
P : Percent of games won in world tournament
What experience?
What exactly should be learned?
How shall it be represented?
What speci c algorithm to learn it?
14
Type of Training Experience
Direct or indirect?
Teacher or not?
A problem: is training experience representative of performance
goal?
15
Representation for Target
Choose
Function
collection of rules?
neural network ?
polynomial function of board features?
...
16
A Representation for Learned Func- tion
w +w bp(b)+w rp(b)+w bk(b)+w rk(b)+w bt(b)+w rt(b
0 1 2 3 4 5
6
bp(b): number of black pieces on board b
rp(b): number of red pieces on b
bk(b): number of black kings on b
rk(b): number of red kings on b
bt(b): number of red pieces threatened by black (i.e., which can
be taken on black's next turn)
rt(b): number of black pieces threatened by red
17
Obtaining Training Examples
^
18
V (b): the true target function
V (b) : the learned function
train
V (b): the training value
One rule for estimating training values:
train
^
V (b) V (Successor(b))
Choose Weight Tuning Rule
LMS Weight update rule: Do repeatedly:
Select a training example b at random
1. Compute error(b):
19
train
^
error(b) = V (b) V (b)
2. For each board feature f , update weight w :
i i
w w + c f error(b) i
i i
c is some small constant, say 0.1, to moderate the
rate of learning
Design Choices
Determine
Target Function
Determine Representation of
Learned Function
Determine Type
of Training Experience
Determine
Learning Algorithm
Games against
self
Games against
experts Table of correct
moves
Linear function of
six features
Artificial neural
network
Polynomial
Gradient
descent
Board
➝ value
Board
➝ move
Completed Design
...
...
Linear
programming
...
...
20
21
ESSENTIAL CONCEPTS OF ML
22
23
TYPES OF LEARNING
 Supervised (inductive) learning
 Training data includes desired outputs
 Unsupervised learning
 Training data does not include desired outputs
 Semi-supervised learning
 Training data includes a few desired outputs
 Reinforcement learning
 Rewards from sequence of actions
24
25
Learning Associations
 Basket analysis:
P (Y | X ) probability that somebody who buys X also
buys Y where X and Y are products/services.
Example: P ( chips | beer ) = 0.7
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke 26
27
Supervised Learning: Uses
 Prediction of future cases: Use the rule to predict
the output for future inputs
 Knowledge extraction: The rule is easy to
understand
 Compression: The rule is simpler than the data it
explains
 Outlier detection: Exceptions that are not covered
by the rule, e.g., fraud
Example: decision trees tools that create rules
28
Inductive Learning
 Given examples of a function (X, F(X))
 Predict function F(X) for new examples X
 Discrete F(X): Classification
 Continuous F(X): Regression
 F(X) = Probability(X): Probability estimation
29
30
Prediction: Regression
 Example: Price of a
used car
 x : car attributes
y : price
y = g (x | θ)
g ( ) model,
θ parameters
y = wx+w0
31
Regression Applications
 Navigating a car: Angle of the steering wheel (CMU
NavLab)
 Kinematics of a robot arm
α1= g1(x,y)
α2= g2(x,y)
α1
α2
(x,y)
32
33
34
35
36
37
38
39
40
41
Classification
 Example: Credit
scoring
 Differentiating
between low-risk
and high-risk
customers from
their income and
savings
Discriminant: IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk
Model
42
Classification: Applications
 Pattern recognition
 Face recognition: Pose, lighting, occlusion (glasses,
beard), make-up, hair style
 Character recognition: Different handwriting styles.
 Speech recognition: Temporal dependency.
 Use of a dictionary or the syntax of the language.
 Sensor fusion: Combine multiple modalities; eg, visual (lip
image) and acoustic for speech
 Medical diagnosis: From symptoms to illnesses
 Web Advertising: Predict if a user clicks on an ad on
the Internet.
43
Face Recognition
Training examples of a person
Test images
AT&T Laboratories, Cambridge UK
http://www.uk.research.att.com/facedatabase.html
44
45
46
47
48
Unsupervised Learning
 Learning “what normally happens” No output
 Clustering: Grouping similar instances
 Other applications: Summarization, Association
Analysis
 Example applications
 Customer segmentation in CRM
 Image compression: Color quantization
 Bioinformatics: Learning motifs
49
50
51
52
53
54
Reinforcement Learning
 Topics:
 Policies: what actions should an agent take in a particular
situation
 Utility estimation: how good is a state (used by policy)
 No supervised output but delayed reward
 Credit assignment problem (what was responsible for the
outcome)
 Applications:
 Game playing
 Robot in a maze
 Multiple agents, partial observability, ...
MACHINE LEARNING
METHODS BASED ON TIME
 A static model is trained offline.
 That is, we train the model exactly once and then
use that trained model for a while.
 A dynamic model is trained online.
 That is, data is continually entering the system and
we're incorporating that data into the model
through continuous updates.
55
56
57
58
59
60
Curse of Dimensionality
 The Curse of Dimensionality in Machine
Learning arises when working with high-
dimensional data, leading to increased
computational complexity, overfitting, and
spurious correlations.
 Techniques like dimensionality reduction, feature
selection, and careful model design are essential
for mitigating its effects and improving algorithm
performance.
 Navigating this challenge is crucial for unlocking
the potential of high-dimensional datasets and
ensuring robust machine-learning solutions.
61
62
LINEARITY AND NON LINEARITY
 Linearity refers to the property of a system or
model where the output is directly proportional
to the input.
 While nonlinearity implies that the relationship
between input and output is more complex and
cannot be expressed as a simple linear function.
63
Linearity vs Nonlinearity
64
Linearity
 Linear models are often the simplest and most effective
approach.
 A linear model essentially fits a straight line to the data,
allowing it to make predictions based on a linear relationship
between the input features and the output variable.
 In a regression problem, a linear model is used to predict a
continuous target variable based on one or more input
features.
 such as the size and age of a tree.
65
Nonlinearity
 Nonlinear models can take many forms, from polynomial
models that fit curves to the data, to neural networks that
can learn complex patterns in high-dimensional data.
 These models are often more powerful than linear models
because they can capture more complex relationships
between variables.
 In a classification problem, nonlinear models are able to
identify complex decision boundaries that separate
different classes, especially when those boundaries are not
linear. 66
67
Occam’s Razor
 If you have two competing ideas to explain the same
phenomenon, you should prefer the simpler one.
68
Techniques for Applying Occam’s Razor in Machine
Learning:
69
No Free Lunch Theorem:
 NFL Theorem If an algorithm performs better on
certain class of problems, then it pays for it in the
form of degraded performance in other classes of
problems.
 In other words, you cannot have single optimal
solution for all classes of problems.
70
Why Understanding Occam’s Razor is
Important for Data Scientists
1. Enhancing Interpretability:
Simpler models are often more interpretable, which
means it’s easier to understand how they’re making
predictions.
2. Avoiding Overfitting:
The goal of machine learning is to make accurate
predictions on new, unseen data. By keeping models
simpler, data scientists can reduce the risk of overfitting.
3. Improving Generalizability:
They are less likely to fit the noise in the training
data and more likely to capture the underlying trend or
relationship.
4. Reducing Computational Resources:
where resources might be limited or expensive.
71
EARLY TRENDS IN MACHINE LEARNING
72
73
74
75
76
DATA UNDERSTANDING,
REPRESENTATION, AND
VISUALIZATION
DATA UNDERSTANDING:
Data understanding basically involves analyzing and
exploring the data to identify any patterns or trends that
may be present.
Data exploration provides a high-level overview of
each attribute (also called variable) in the dataset and the
interaction between the attributes.
The step of understanding the data can be broken down
into three parts:
1. Understanding entities
2. Understanding attributes
3. Understanding data types
77
Understanding Entities
 In the field of data science or machine learning and
artificial intelligence, entities represent groups of
data separated based on conceptual themes and/or
data acquisition methods.
 An entity typically represents a table in a database,
or a flat file.
 EXAMPLE:
comma separated variable (csv) file, or tab
separated variable (tsv) file.
Sometimes it is more efficient to represent the
entities using a more structured formats like
svmlight.
Each entity can contain multiple attributes.
78
Understanding Attributes
 Each attribute can be thought of as a column in the
file or table.
 In case of Iris data, the attributes from the single
given entity are sepal length in cm, sepal width in
cm.
 The structured formats like svmlight are more
useful in case of sparse data, as they add
significant overhead when the data is fully
populated.
 A sparse data is data with high dimensionality, but
with many samples missing values for multiple
attributes.
79
Sepal-length Sepal-width Petal-length Petal-width Class label
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
4.8 3.4 1.9 0.2 Iris-setosa
5.0 3.0 1.6 0.2 Iris-setosa
5.0 3.4 1.6 0.4 Iris-setosa
5.2 3.5 1.5 0.2 Iris-setosa
5.2 3.4 1.4 0.2 Iris-setosa
7.0 3.2 4.7 1.4 Iris-versicolor
6.4 3.2 4.5 1.5 Iris-versicolor
6.9 3.1 4.9 1.5 Iris-versicolor
5.5 2.3 4.0 1.3 Iris-versicolor
6.5 2.8 4.6 1.5 Iris-versicolor
6.7 3.1 4.7 1.5 Iris-versicolor
6.3 2.3 4.4 1.3 Iris-versicolor
5.6 3.0 4.1 1.3 Iris-versicolor
5.5 2.5 4.0 1.3 Iris-versicolor
5.5 2.6 4.4 1.2 Iris-versicolor
6.3 3.3 6.0 2.5 Iris-virginica
80
Understanding Data Types
 Attributes in each entity can be of various different
types from the storage and processing perspective, e.g.,
string, integer valued, datetime, binary (“true”/“false”,
or “1”/“0”), etc.
 Each type needs to be handled separately for generating
a feature vector that will be consumed by the machine
learning algorithm.
 we can also come across sparse data, in which case,
some attributes will have missing values.
 This missing data is typically replaced with special
characters, which should not be confused with any of
the real values.
 In order to process data with missing values, with some
default values, or use an algorithm that can work with
missing data.
81
Representation and Visualization
of the Data
Data visualization is the representation of data through
use of common graphics, such as charts, plots, infographics and
even animations.
These visual displays of information communicate
complex data relationships and data-driven insights in a way that
is easy to understand.
There are couple of options in such cases:
 Draw multiple plots taking 2 or three dimensions at a
time.
 Reduce the dimensionality of the data and plot upto
3 dimensions
Most common methods used to reduce the dimensionality are:
1.Principal Component Analysis or PCA
2.Linear Discriminant Analysis or LDA 82
This can include a variety of visual tools such as:
Charts: Bar charts, line charts, pie charts, etc.
Graphs: Scatter plots, histograms, etc.
Maps: Geographic maps, heat maps, etc.
Dashboards: Interactive platforms that combine
multiple visualizations.
83
84
Tools for Visualization of Data
 Tableau
 Looker
 Zoho Analytics
 Sisense
 IBM Cognos Analytics
 Qlik Sense
 Domo
 Microsoft Power BI
 Klipfolio
 SAP Analytics Cloud
85
86
Principal Component Analysis
 We can only visualize the data in maximum of 2 or
3 dimensions. However, it is common practice to
have the dimensionality of the data in tens or even
hundreds.
 If we can find exact coordinates (Xr
,Y r
) of the
paper’s orientation as linear combination of X, Y ,
and Z coordinates of the 3-dimensional space,
we can reduce the dimensionality of the data from
3 to 2.
87
2-dimensional data where PCA dimensionality
is same but along different axes
88
LINEAR DISCRIMINANT
ANALYSIS
 The way PCA tries to find the dimensions that
maximize variance of the data, linear discriminant
analysis or LDA tries to maximize the separation
between the classes of data.
 Thus LDA can only effectively work when we are
dealing with classification type of problem, and the
data intrinsically contains multiple classes.
89
3-dimensional data with 2 classes in another
perspective, with LDA representation.
LDA reduces the effective dimensionality of
data into 1 dimension where the two classes
can be best separated
90
UNIT – II
MACHINE LEARNING METHODS
Linear methods
Regression
Classification
Perceptron and Neural networks
Decision trees
Support vector machines
Probabilistic models
Unsupervised learning
Featurization.
91
LINEAR METHODS
Machine Learning Algorithms are Divided Into Two
Types:
1.Supervised learning algorithms
2.Unsupervised learning algorithms
92
Supervised learning algorithms
93
 Where the model is trained on a labelled
dataset.
 A labelled dataset is one that has both
input and output parameters.
 In this type of learning both training and
validation, datasets are called labelled.
 The input features are the attributes or
characteristics of the data that are used to
make predictions
 while the output labels are the desired
outcomes or targets that the algorithm tries
to predict.
94
EXAMPLE:
It is a dataset of a shopping store that is useful
in predicting whether a customer will purchase a
particular product under consideration or not based
on his/ her gender, age, and salary.
Input: Gender, Age, Salary
Output: Purchased i.e. 0 or 1; 1 means yes the
customer will purchase and 0 means that the
customer won’t purchase it.
95
Supervised learning can be further classified
into two categories:
96
Unsupervised learning algorithms
Unsupervised learning is a type of machine
learning technique in which an algorithm discovers
patterns and relationships using unlabelled data.
97
 There are two main categories of unsupervised
learning that are mentioned below:
1. Clustering
2. Association
 Clustering is the process of grouping data points
into clusters based on their similarity.
 This technique is useful for identifying patterns
and relationships in data without the need for
labelled examples.
98
 Here are some clustering algorithms:
i. K-Means Clustering algorithm
ii. Mean-shift algorithm
iii. DBSCAN Algorithm
iv. Principal Component Analysis
v. Independent Component Analysis
99
Association
Association rule learning is a technique for
discovering relationships between items in a dataset.
It identifies rules that indicate the presence of
one item implies the presence of another item with a
specific probability.
Here are some association rule learning
algorithms:
i.Apriori Algorithm
ii.Eclat
iii.FP-growth Algorithm
100
LINEAR REGRESSION
Regression is a supervised machine learning
technique which is used to predict continuous
values.
The ultimate goal of the regression algorithm is
to plot a best-fit line or a curve between the data.
Linear regression is a type of supervised
machine learning algorithm that computes the linear
relationship between the dependent variable and one
or more independent features by fitting a linear
equation to observed data.
101
Types:
i. Linear Regression
ii. Polynomial Regression
iii. Ridge Regression
iv. Lasso Regression
v. Logistic Regression
vi. Support Vector Regression (SVR)
vii. Decision Tree Regression
viii. Random Forest Regression
ix. Gradient Boosting Regression
102
Simple Linear Regression
This is the simplest form of linear regression,
and it involves only one independent variable and one
dependent variable. The equation for simple linear
regression is:
where:
Y is the dependent variable
X is the independent variable
 0 is the intercept
β
 1 is the slope
β
103
y= 0+ 1X
β β
MULTIPLE LINEAR REGRESSION
Linear regression is a classic example of strictly
linear models. It is also called as polynomial fitting
Let us consider a problem of linear regression
where training data contains p samples.
Input is n-dimensional as (xi,i = 1 , . . . , p) and xi R
∈ n.
Output is single dimensional as (yi,i = 1,..., p) and yi R.
∈
Assumes a linear relationship between the
independent and dependent variables.
0​is the intercept, 1, 2,..., n ​are the coefficients, and is the
β β β β ϵ
error term.
104
Y= 0+ 1X1+ 2X2+...+ nXn+
β β β β ϵ
Ridge Regression
In machine learning, ridge regression helps
reduce overfitting that results from model
complexity. Model complexity can be due to a model
possessing too many features.
Ridge Regression, a penalty term is added to
the loss function to constrain the coefficients. The
penalty is proportional to the square of the
magnitude of the coefficients.
Loss function:
Minimize ∑i=1n(yi yi^)2+ ∑j=1p j2
− λ β
Here, is a regularization parameter that
λ
controls the strength of the penalty, j​are the
β
coefficients, and p is the number of features.
105
CLASSIFICATION
The Classification algorithm is a Supervised
Learning technique that is used to identify the
category of new observations on the basis of training
data.
In Classification, a program learns from the
given dataset or observations and then classifies new
observation into a number of classes or groups.
Such as, Yes or No, 0 or 1, Spam or Not Spam,
cat or dog, etc.
Classes can be called as targets/labels or
categories.
106
In classification algorithm, a discrete output
function(y) is mapped to input variable(x).
y=f(x)
Where, y = categorical output
Example :
Email Spam Detector.
There are two classes, class A and Class B. These classes
have features that are similar to each other and dissimilar
to other classes.
107
Types:
1. Linear Models
1. Logistic Regression
2. Support Vector Machines
2. Non-linear Models
1. K-Nearest Neighbours
2. Kernel SVM
3. Naïve Bayes
4. Decision Tree Classification
5. Random Forest Classification
108
K-Nearest Neighbor(KNN)
Algorithm for Machine Learning
K-Nearest Neighbour is one of the simplest
Machine Learning algorithms based on Supervised
Learning technique.
The k-nearest neighbors (KNN) algorithm is a
non-parametric, supervised learning classifier, which
uses proximity to make classifications or predictions
about the grouping of an individual data point.
109
The weighted K-Nearest Neighbors
(KNN) algorithm is a supervised learning
technique used for classification and
regression tasks.
It makes predictions based on the
similarity between a new data point and its
closest neighbors in the training dataset.
110
Classification
For classification, the target variable is categorical.
Predicting the class of a new data point based on
the majority class among its k-nearest neighbors.
Regression
while for regression, the target variable is
continuous.
Predicting the value of a continuous target variable
based on the average value of the target variable
among its k-nearest neighbors.
111
How does K-NN work?
The K-NN working can be explained on the basis
of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number
of neighbors
Step-3: Identify the K nearest neighbors as per the
calculated Euclidean distance.
Step-4: Among these k neighbors, count the number
of the data points in each category.
Step-5: Assign the new data points to that category
for which the number of the neighbour is maximum.
Step-6: Our model is ready.
112
113
Advantages of KNN Algorithm:
It is simple to implement.
It is robust to the noisy training data
It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
Always needs to determine the value of K which may
be complex some time.
The computation cost is high because of calculating
the distance between the data points for all the
training samples.
114
Real-World Applications
Weighted KNN is used in a wide range of applications, including:
1 Recommendation Systems
Recommending products or content based on the preferences of similar users.
2 Image Recognition
Classifying images based on their visual features.
3 Financial Forecasting
Predicting future stock prices or other financial indicators based on historical data.
4 Medical Diagnosis
Assisting in the diagnosis of diseases based on patient data and medical records.
PERCEPTRON AND NEURAL
NETWORKS
 A weight is assigned to each input node of a
perceptron, indicating the significance of that input
to the output.
 The perceptron’s output is a weighted sum of the
inputs that have been run through an activation
function to decide whether or not the perceptron
will fire. it computes the weighted sum of its inputs
as:
z = w1x1 + w1x2 + ... + wnxn = XTW
116
The step function compares this weighted sum
to the threshold, which outputs 1 if the input is larger
than a threshold value and 0 otherwise, is the
activation function that perceptron's utilize the most
frequently.
The most common step function used in
perceptron is the Heaviside step function:
117
The perceptron, a foundational building block in the realm of
artificial intelligence, serves as the fundamental unit of neural
networks.
This simple yet powerful computational model, conceived by
Frank Rosenblatt in the 1950s, simulates the behavior of a
single neuron in the human brain.
It takes multiple input signals, applies weights to each signal,
and generates a single output signal based on a threshold
function.
These interconnected perceptrons, forming layers, create
neural networks capable of solving complex tasks like image
recognition, natural language processing, and machine
translation.
The Perceptron: A Simple Model
Input Layer
The input layer
receives information
from the external
environment. These
inputs can be
numerical values,
representing features
or characteristics of
the data being
processed. Each
input is assigned a
weight, reflecting its
importance or
influence on the
overall decision.
Activation Function
The activation
function determines
the output of the
perceptron. It
introduces
nonlinearity,
allowing the
perceptron to learn
complex
relationships in the
data. Common
activation functions
include the sigmoid,
ReLU, and tanh
functions.
Output Layer
The output layer
produces a single
output value,
representing the final
decision or
prediction made by
the perceptron. This
output can be a
binary value (0 or 1)
for classification
tasks or a continuous
value for regression
tasks.
Neural Networks: Layers of Perceptrons
Input Layer
The input layer receives raw data, which is then transformed into a
representation suitable for processing by the network. This representation
is often in the form of numerical values, representing features or
characteristics of the data.
Hidden Layers
Hidden layers process the input information through a series of non-linear
transformations. These layers extract complex features and relationships
from the data, enabling the network to learn intricate patterns and make
accurate predictions. Output Layer
The output layer generates the network's final prediction or decision. The
structure of the output layer depends on the specific task being solved.
For classification tasks, it may produce a probability distribution over
different classes. For regression tasks, it may output a continuous value.
Machine Learning Techniques all units .ppt
Learning and Optimization
Forward Propagation
Information flows from the input layer through the hidden layers
to the output layer. During forward propagation, the network
calculates the output based on the current weights and biases.
Backpropagation
The difference between the predicted output and the actual target
output is calculated. This error signal is then propagated backward
through the network, updating the weights and biases to minimize
the error.
Gradient Descent
Gradient descent is an optimization algorithm that iteratively
adjusts the weights and biases in the direction that minimizes the
error. It uses the gradients of the error function to guide the
search for the optimal parameters.
Machine Learning Techniques all units .ppt
Types of Neural Networks
1
Convolutional Neural Networks (CNNs)
CNNs are specialized for image recognition tasks. They use convolutional filters to
extract features from images, making them highly effective in identifying objects,
scenes, and patterns.
2
Recurrent Neural Networks (RNNs)
RNNs are designed to process sequential data, such as text or time series. They use
feedback loops to maintain a memory of previous inputs, allowing them to
understand context and dependencies in sequential data.
3 Long Short-Term Memory (LSTM) Networks
LSTMs are a type of RNN that are particularly well-suited for learning long-term
dependencies. They have special memory cells that can store information over
extended periods, enabling them to capture complex relationships in sequential
data.
4 Generative Adversarial Networks (GANs)
GANs are composed of two competing networks: a generator and a discriminator.
The generator creates new data samples, while the discriminator tries to distinguish
between real and generated data. This adversarial process results in the generation
of highly realistic and diverse data.
Applications of Neural Networks
Image Recognition Object Detection Image Classification
Natural
Language
Processing
Machine
Translation
Text Summarization
Speech Recognition Time Series
Forecasting
Drug Discovery
PROBABILISTIC MODELS
Machine learning algorithms today rely heavily
on probabilistic models, which take into consideration
the uncertainty inherent in real-world data.
These models make predictions based on
probability distributions, rather than absolute values,
allowing for a more nuanced and accurate
understanding of complex systems.
One common approach is Bayesian inference,
where prior knowledge is combined with observed
data to make predictions.
Another approach is maximum likelihood
estimation, which seeks to find the model that best
fits observational data.
126
Probabilistic modeling is a statistical approach
that uses the effect of random occurrences or actions
to forecast the possibility of future results.
It is a quantitative modeling method that
projects several possible outcomes that might even
go beyond what has happened recently.
Categories Of Probabilistic Models
1.Generative models
2.Discriminative models.
127
Generative models:
Generative models aim to model the joint
distribution of the input and output variables.
These models generate new data based on the
probability distribution of the original dataset.
Generative models are powerful because they can
generate new data that resembles the training data.
They can be used for tasks such as image and
speech synthesis, language translation, and text
generation.
128
Discriminative models:
The discriminative model aims to model the
conditional distribution of the output variable given
the input variable.
They learn a decision boundary that separates
the different classes of the output variable.
Discriminative models are useful when the focus is on
making accurate predictions rather than generating
new data.
They can be used for tasks such as image
recognition, speech recognition, and sentiment
analysis.
129
Maximum Likelihood Estimation
The maximum likelihood estimation or MLE
approach deals with the problems at the face value
and parameterizes the information into variables.
The values of the variables that maximize the
probability of the observed variables lead to the
solution of the problem.
Let us define the problem using formal
notations. Let there be a function f(x )
θ that produces
the observed output y.
x∈ ℜ𝑛
represent the input on which we don’t
have any control over and θ represent a parameter
∈ Θ
vector that can be single or multidimensional.
130
The MLE method defines a likelihood function
denoted as L(y| )
θ .
Typically the likelihood function is the joint
probability of the parameters and observed variables
as L(y| ) P(y; )
θ θ .
The objective is to find the optimal values for θ
that maximizes the likelihood function as given by
𝜃𝑀𝐿𝐸
= arg max{ ( | )}
𝐿 𝑦 𝜃
𝜃𝗀Θ
or,
𝜃𝑀𝐿𝐸
= arg max{ ( ; }
𝑃 𝑦 𝜃
𝜃𝗀Θ
131
Bayesian Approach
Bayesian approach looks at all the unknowns
are modelled as random variables with known prior
probability distributions.
Let us denote the conditional prior probability
of observing the output y for parameter vector θ as
P(y )
θ .
The marginal probabilities of these variables
are denoted as P(y) and P( )
θ .
The joint probability of the variables can be
written in terms of conditional and marginal
probabilities as
P(y; )
θ = P(y| )
θ · P( ) ………(1)
θ
132
The same joint probability can also be given as
P(y; )
θ = P(θ |y) · P(y) ………(2)
Here the probability P( /y)
θ is called as posterior
probability.
Combining Eqs. 1 and 2
P(θ |y) · P(y) = P(y| )
θ · P( ) ……..(3)
θ
rearranging the terms we get,
This theorem gives relationship between the posteriory
probability and priori probability in a simple and elegant
manner. 133
UNIT – III
MACHINE LEARNING IN PRACTICE
Ranking
Recommendation System
Designing and Tuning model pipelines
Performance measurement
Azure Machine Learning
Open-source Machine Learning libraries
Amazons Machine Learning Tool Kit: Sagemaker.
134
RANKING
 Ranking is a machine learning technique to rank items.
 A ranking algorithm is a procedure used to rank items in
a dataset according to some criterion.
 It can be divided into two categories:
1.Deterministic
2.Probabilistic
 Ranking at heart is essentially sorting of information.
 It is useful for many applications in information retrieval
such as e-commerce, social networks, recommendation
systems, and so on.
 Example:
A user searches for an article or an item to buy online. 135
Measuring Ranking Performance
 The algorithm that can rank the items in strictly
non-increasing order of relevance.
 Let there be n number of items that need to be ranked
and each has a relevance score of ri, i = 1, 2,...,n.
 A simple measure, called as cumulative gain (CG) is
defined on these relevances as,
𝑛
𝐶𝐺 = ∑ 𝑟𝑖
𝑖=1
136
137
Sample set of items to be ranked with relevance score
138
Item No. Relevance Score Discounted relevance score
1 0.4 0.4
2 0.25 0.16
3 0.1 0.05
4 0.0 0.0
5 0.3 0.12
6 0.13 0.05
7 0.6 0.2
8 0.0 0.0
9 0.56 0.17
10 0.22 0.06
Sample set of items ideally ranked with non-increasing
relevance score
139
Item No. Relevance Score Discounted relevance score
1 0.6 0.6
2 0.56 0.35
3 0.4 0.2
4 0.3 0.13
5 0.25 0.1
6 0.22 0.08
7 0.13 0.04
8 0.1 0.032
9 0.0 0.0
10 0.0 0.0
140
Ranking Search Results and Google’s
Pagerank
 The concept of PageRank was at the heart of Google’s
rankings.
 Google’s ranking algorithm is a secret, but we know that it
is a probabilistic ranking algorithm. Google uses a
variety of factors to rank webpages, including the number
of links to a page, the page’s PageRank, and the relevance
of the search query to the page.
 Google’s PageRank algorithm is a probabilistic ranking
algorithm that uses the number of links to a webpage as a
measure of its importance.
 The higher the PageRank of a webpage, the more likely it
is to be ranked higher in the search results. 141
Techniques Used in Ranking Systems
 Information retrieval and text mining are the core concepts
that are important in the ranking system that relate to
searching websites or movies, etc.
 We will look specifically at a technique called as keyword
identification/extraction and word cloud generation.
142
Keyword Identification / Extraction
A word cloud is a graphical representation of the
keywords identified from a document or set of documents.
The size of each keyword represents the relative
importance of it.
Steps in Building a Word Cloud
1.Cleaning up the data. This step involves removing all the
formatting characters as well as punctuation marks.
2.Normalization of the data.
This step involves making all the characters lower
case, unless it is required to distinguish the upper case
letters from lower case.
This step involves applying grammar to find the
root word for every word form.
143
3.Removal of stop words. This step involves removing all the
common words that are typically used in any text and have no
meaning of their own. For example a, them for, to, and etc.
4.Compute the frequency of occurrence of each of the
remaining words.
5.Sort the words in descending order of frequency.
6.Plot the top-n keywords graphically to generate the word
cloud.
144
145
Word cloud from famous Gettysburg address by
Abraham Lincoln
146
RECOMMENDATION SYSTEM
 Modern recommendation systems is commonly called as
collaborative filtering.
 Companies like Amazon and Netflix have influenced
this genre of machine learning significantly for
personalizing shopping and movie watching experiences.
147
148
Collaborative Filtering
 A mathematical process of predicting the interests or
preferences of a given user based on a database of
interests or preferences of other users.
 Users with similar interests in one aspect tend to share
similar interest in other similar aspects.
 Collaborative filters always deal with 2-dimensional data
in the form of a matrix.
 One dimension being the list of users and the other
dimension being the entity that is being liked or watched
or purchased
149
150
Sample training data for building a recommendation
system
151
152
Solution Approaches:
Information Types
1.Information about the users in the form of their profiles.
The profiles can have key aspects like age, gender,
location, employment type, number of kids, etc.
2.Information about the interests. For movies, it can be in
the form of languages, genres, lead actors and actresses,
release dates, etc.
3.Joint information of users’ liking or rating their interests
153
 Algorithm Types
 Algorithms exploiting the topological or neighborhood
information. These algorithms are primarily based on the joint
historical information about the users’ratings.
 Algorithms that exploit the structure of the relationships. These
methods assume a structural relationship between users and
their ratings on the interests.
 These relationships are modelled using probabilistic networks
or latent variable methods like component analysis (principal
or independent) or singular value decomposition.
 Hybrid approach. The neighborhood based algorithms simply
cannot operate in such situations.
 Hence these hybrid systems start with algorithms that are more
influenced by the structural models in the early stages of
neighborhood based models.
154
 Amazon’s Personal Shopping Experience
 First and foremost Amazon is a shopping platform.
 Where Amazon itself sells products and services as well as
it lets third party sellers sell their products.
 Each shopper that comes to Amazon comes with some idea
about the product that he/she wants to purchase along with
some budget for the cost of the product that he/she is ready
to spend on the product as well as some expectation of date
by which the product must be delivered.
155
Context Based Recommendation
Suggesting other similar products that are cheaper than
the product selected.
Content-based filtering uses item features to recommend
other items similar to what the user likes, based on their
previous actions or explicit feedback.
So, if the cost is the only aspect stopping the user from
buying the selected item, the recommended item can solve
the problem.
Suggesting a similar product from a more popular brand.
This might attract user to buy a potentially more
expensive product that is coming from more popular
brand, so more reliable or better quality.
156
 Suggesting a product that has better customer reviews.
 Suggesting a set of products that are typically bundled
with the selected product.
 For example suggesting carry bag or battery charger
or a memory card when selected item is a digital
camera.
157
DESIGNING AND TUNING MODEL
PIPELINES
Model tuning is the experimental process of finding
the optimal values of hyperparameters to maximize model
performance.
Hyperparameters are the set of variables whose
values cannot be estimated by the model from the training
data. These values control the training process.
Then the available data is split into two or three sets
 Training - used for training the model
 Validation - optional validation set is used to tune the
parameters
 Testing - is used to predict the performance metrics of the
algorithm
158
Designing and Tuning Model Pipelines involves
following steps
1. Choosing the Technique or Algorithm
2. Choosing Technique for Adult Salary Classi cation
fi
3. Splitting the Data
4. Strati ed Sampling
fi
5. Training
6. Tuning the Hyperparameters
7. Accuracy Measurement
8. Explain ability of Features
9. Practical Considerations
10. Data Leakage
11. Coincidence and Causality
12. Unknown Categories
159
Choosing the Technique or Algorithm
Let’s consider the problem of binary classification of
the adult salary data.
Example:
If we are dealing with a regression problem then we
are restricted to all the regression type techniques, which
would eliminate the clustering or recommendation
algorithms , etc.
Choosing Technique for Adult Salary Classification
Regression algorithms can always be used in classification
applications with addition of threshold, but are less
preferred.
160
 Decision tree based methods are typically better suited
for problems that have categorical features as these
algorithms inherently use the categorical information.
 Most other algorithms like logistic regression, support
vector machines, neural networks, or probabilistic
approaches are better suited for numerical features.
 Single decision tree can be used as one of the simplest
possible starting algorithm.
 Random forest decision tree is also used better method.
161
Splitting the Data
The division of the three sets in percentage is done as: 60–
20–20, or 70– 15–15.
Stratified Sampling
Stratified sampling ensures a certain known distribution of classes
in split parts.
When we are dealing with say n number of classes, the number of
samples per class is not always same.
When the original distribution is unbalanced, there are two choices:
1.Ignore a number of samples from the classes that have more
samples to match the classes that have less number of samples.
2.Use the samples from the classes that have less number of samples
repeatedly to match with number of samples from the classes that
have larger number of samples. 162
Training
Once the data is split into training and test sets, the training set
is used to train the algorithm and build the model. Each algorithm
has its own specific method associated with it for training.
All the different technique is to find the right set of parameters
of the model that can map the input to output as far as training
data is concerned.
The parameters of the model are also classified into two main
types:
1.The parameters that can be computed using the training process to
minimize the error in the prediction.
2.The parameter that cannot be directly computed using the training
process. These parameters are also called as hyperparameters.
163
Tuning the Hyperparameters
 The hyperparameters are typically a set of parameters that
can be unbounded, theoretically choose any number between
1 and ∞.
 These bounds are created based on multiple constraints like
computation requirements, dimensionality, and size of data,
etc
 A single set of training set can be used to get results with
one set of hyperparameters.
 The training set is the only data available for training and the
trained model is applied on the validation set to compute the
accuracy metrics.
164
Data Leakage
If the data is static then there is a set of static algorithms that one
can choose from.
Examples of static problem could be classification of images into ones
that contain cars and ones that don’t.
If the data is changing over time, it creates additional layer of
complexity.
Example
Consider a business of auto mechanic. Here are some of the columns in
the data:
Number of visits Method of payment
Category of customer Amount of sale
Year of manufacture of the car
Make of the car
Model of the car
Miles on the odometer 165
PERFORMANCE MEASUREMENT
To evaluate the performance or quality of the model,
different metrics are used, and these metrics are known as
performance metrics or evaluation metrics
Performance Metrics for Regression / Numerical error
 Mean Absolute Error
 Mean Squared Error
 Normalized Error
Performance Metrics for Classification / Categorical error
 Accuracy
 Precision and Recall
 F-Score
 Confusion Matrix
 Receiver Operating Characteristics (ROC) Curve Analysis
Steps in Hypothesis Testing
 A/B Testing
166
Performance Metrics for Regression / Numerical
error
167
Mean Absolute Error is a regressive loss measure
looking at the absolute value difference between a model's
predictions and ground truth, averaged out across the
dataset.
However, to overcome this limitation, another
metric can be used, which is Mean Squared Error or MSE.
168
169
170
Normalized Error
In many cases, all the above error metrics can produce
some arbitrary number between -∞ and ∞.
Typically the bounds used are (-1to+1), (0–1), or (0–
100). This way, even a single instance of normalized error
can make sense on its own.
All the above error definitions can have their own
normalized counterpart.
171
172
AZURE MACHINE LEARNING- AML
Azure Machine Learning is a comprehensive
machine learning platform that supports language
model fine-tuning and deployment.
Using the Azure Machine Learning model
catalog, users can create an endpoint for Azure OpenAI
Service and use RESI APIs to integrate models into
applications.
173
sign in with a Microsoft account. e.g., hotmail.com or
outlook.com, etc.
174
175
176
UNIT – IV
MACHINE LEARNING AND DATAANALYTICS
Machine Learning for Predictive Data Analytics
Data to Insights to Decisions
Data Exploration
Information based Learning
Similarity based learning
Probability based learning
Error based learning
Evaluation
The art of Machine learning to Predictive Data
Analytics. 177
MACHINE LEARNING FOR PREDICTIVE
DATAANALYTICS
Predictive data analytics is the art of building and
using models that make predictions based on historical
data to predict future outcomes.
178
179
Applications:
Price Prediction: Predictive analytics models can be
trained to predict optimal prices based on historical sales
records.
Dosage Prediction: Doctors and scientists frequently
decide how much of a medicine or other chemical to
include in a treatment.
Risk Assessment: Predictive analytics models can be used
to predict the risk associated with decisions such as issuing
a loan or underwriting an insurance policy.
Diagnosis: Doctors, engineers and scientists regularly make
diagnoses as part of their work. Typically, these diagnoses
are based on their extensive training, expertise, and
experience. 180
Document Classification: Predictive data analytics can be
used to automatically classify documents into different
categories.
Examples:
Email spam filtering, News sentiment analysis, Customer
complaint redirection, and Medical decision making.
How is machine learning used in predictive analytics?
Predictive analytics using machine learning algorithms
can provide more accurate and precise predictions, automate
decision-making processes, and scale up to handle large
datasets and complex problems.
Machine learning algorithms can provide more
accurate predictions than traditional statistical models.
181
182
183
Machine Learning
Machine learning is defined as an automated process
that extracts patterns from data.
To build the models used in predictive data analytics
applications, we use supervised machine learning.
Supervised machine learning techniques automatically
learn a model of the relationship between a set of descriptive
features and a target feature based on a set of historical
examples, or instances.
184
185
Restriction bias constrains the set of models that the
algorithm will consider during the learning process
Preference bias guides the learning algorithm to prefer
certain models over others
Inductive bias is necessary for learning (beyond the
dataset)
There are two sources of information that guide this Search
The training data
The inductive bias of the algorithm
Underfitting occurs when the prediction model selected by
the algorithm is too simplistic to represent the underlying
relationship in the dataset between the descriptive features and
the target feature.
Overfitting Occurs when the prediction model selected by the
algorithm is so complex that the model fits to the data set too
closely and becomes sensitive to noise in the data. 186
The Predictive Data Analytics Project Lifecycle: Crisp-DM
187
Cross Industry Standard Process for Data Mining
(CRISP-DM).
Key features of the CRISP-DM process that make
it attractive to data analytics practitioners are that it is
non-proprietary; it is application, industry, and tool
neutral; and it explicitly views the data analytics process
from both an application-focused and a Technical
perspective.
Business Understanding:
Predictive data analytics projects never start out
with the goal of building a prediction model.
Instead, they are focused on things like gaining
new customers, selling more products, or adding
efficiencies to a process. 188
Data Understanding:
Once the manner in which predictive data
analytics will be used to address a business problem has
been decided, it is important that the data analyst.
Data Preparation:
Building predictive data analytics models requires
specific kinds of data, organized in a specific kind of
structure known as analytics base table (ABT).
Modeling:
Different machine learning algorithms are used to
build a range of prediction models from which the best
model will be selected for deployment.
189
Evaluation:
This phase Of CRISP-DM covers all the
evaluation tasks required to show that a prediction model
will be able to make accurate predictions after being
deployed and that it does not suffer from overfitting or
underfitting.
Deployment:
Machine learning models are built to serve a
purpose within an organization, and the last phase of
CRISP-DM covers all the work that must be done to
successfully integrate a machine learning model into the
processes within an organization.
190
Predictive Analytics Tools
Predictive Analytics Software Tools have advanced
analytical capabilities like Text Analysis, Real-Time Analysis,
Statistical Analysis, Data Mining, Machine Learning modeling
and Optimization.
Libraries for Statistical Modeling and Analysis
Scikit-learn
Pandas
Stats model
NLTK (Natural Language Processing Tool Kit)
Graph Lab
Neural Designer
Open-Source Analytical Tools
SAP Business Objects
IBM SPSS
Halo Business Intelligence
Daiku-DSS
Weka 191
DATA TO INSIGHTS TO DECISIONS
Designing ABTs that properly represent the
characteristics of a prediction subject is a key skill for
analytics practitioners.
An approach to first develop a set of domain concepts
that describe the prediction subject, and then expand these
into concrete descriptive features.
Converting a business problem into an analytics solution
Involves answering the following key questions:
1.What is the business problem?
2.What are the goals that the business wants to achieve?
3.How does the business currently work?
4.In what ways could a predictive analytics model help to
address the business problem? 192
Designing the Analytics Base Table (ABT)
The basic structure in which we capture historical datasets.
The different data sources typically combined to create an
analytics base table.
The basic structure in which we capture historical datasets
is the analytics base table (ABT)
The general structure of an analytics base table:
 Descriptive features
 Target feature
193
The different data sources typically combined to create an
analytics base table.
194
Prediction subject defines the basic level at which
predictions are made, and each row in the ABT will
represent one instance of the prediction subject
One-row-per-subject is often used to describe this
structure
Each row in an ABT is composed of a set of
descriptive features and a target feature
A good way to define features is to identify the key
domain concepts and then to base the features on
these concepts.
195
The hierarchical relationship between an analytics
solution, domain concepts, and descriptive features.
196
 Features in an ABT:
 Raw features
 Derived features: requires data from multiple sources to be
combined into a set of single feature values
Common derived feature types:
 Aggregates
 Flags
 Ratios
 Mappings
There are a number of general domain concepts that are often useful:
 Prediction Subject Details
 Demographics
 Usage
 Changes in Usage
 Special Usage
 Lifecycle Phase
 Network Links
197
Example domain concepts for a motor insurance fraud
claim prediction analytics solution
198
Designing & Implementing Features
Three key data considerations are particularly
important when we are designing features Data
availability, because we must have data available to
implement any feature we would like to use.
Example:
In an online payments service scenario, we might define a
feature that calculates the average of a customer’s account
balance over the past six months.
199
Timing:
Timing with which data becomes available for
inclusion in a feature.
With the exception of the definition of the target
feature, data that will be used to define a feature must be
available before the event around which we are trying to
make predictions occurs.
Example:
If we were building a model to predict the Outcomes of
soccer matches, we might consider including the attendance
at the match as a descriptive feature.
200
Longevity:
There is potential for features to go stale if something
about the environment from which they are generated
changes.
Example, to make predictions of the outcome of loans
granted by a bank, we might use the borrower’s salary as
a descriptive feature.
201
Propensity models:
Many of the predictive models that we build are
propensity models, which inherently have a temporal
element
Two key periods of propensity modeling:
Observation period
Outcome period
202
203
Observation and outcome periods defined by an event
rather than by a fixed point in time (each line
represents a prediction subject and stars signify
events).
204
Case Study: Motor Insurance Fraud
What are the observation period and outcome period for
the motor insurance claim prediction scenario?
The observation period and outcome period are measured
over different dates for each insurance claim, defined
relative to the specific date of that claim.
The observation period is the time prior to the claim
event, over which the descriptive features capturing the
claimant’s behavior are calculated.
The outcome period is the time immediately after the
claim event, during which it will emerge whether the claim
is fraudulent or genuine.
What features could you use to capture the Claim
Frequency domain concept? 205
Example domain concepts for a motor insurance fraud
prediction analytics solution
206
DATA EXPLORATION
Data exploration refers to the initial step in data
analysis in which data analysts use data visualization
and statistical techniques to describe dataset
characterizations, such as size, quantity, and accuracy, in
order to better understand the nature of the data.
It involves scrutinizing datasets to uncover
hidden patterns, outliers, and insights.
Whether in business, healthcare, or research, data
exploration serves as the compass guiding decision-
makers.
207
The Data Quality Report
A data quality report includes tabular reports that
describe the characteristics of each feature in an ABT using
standard statistical measures of central tendency and
variation.
The tabular reports are accompanied by data visualizations:
A histogram for each continuous feature in an ABT
A bar plot for each categorical feature in an ABT.
208
Purposes of Data Exploration:
Understanding Data Structure:
Data exploration helps in understanding the overall
structure of the dataset, including the number of variables,
data types, and the presence of missing values or outliers.
Identifying Patterns and Relationships:
Data exploration involves uncovering patterns and
relationships within the data.
This may include identifying correlations between
variables, grouping data based on common characteristics,
and visualizing data to reveal trends.
209
Informing Further Analysis:
The insights gained from data exploration inform
further analysis, such as model building, hypothesis
testing, and decision-making.
Common Data Exploration Techniques:
Descriptive Statistics:
Summarizing data using measures of central
tendency (mean, median, mode) and measures of
dispersion (variance, standard deviation, range) provides
an overview of the data distribution.
Data Visualization:
Creating visualizations, such as histograms, scatter
plots, and box plots, helps visualize data distribution,
identify patterns, and detect outliers. 210
Data Profiling:
Data profiling involves examining data quality
metrics, such as data completeness, accuracy,
consistency, and timeliness, to assess the overall quality
of the dataset.
Data Transformation:
Data transformation techniques, such as
normalization, scaling, and imputation, may be applied
to prepare the data for further analysis.
Feature Engineering:
Feature engineering involves creating new features
from existing data to improve the performance of
machine learning models.
211
Benefits of Data Exploration:
Improves Data Understanding:
Data exploration deepens the understanding of the
data, its characteristics, and limitations, enabling better
decision-making.
Uncovers Hidden Patterns:
Data exploration reveals patterns and trends that
may not be evident by simply looking at raw data,
informing further analysis and hypothesis generation.
Identifies Data Quality Issues:
Data exploration helps identify data quality issues,
such as missing values, outliers, and inconsistencies,
allowing for data cleaning and improvement. 212
Facilitates Feature Selection:
Data exploration guides feature selection, identifying
relevant features for analysis and model building.
Informs Model Development:
Insights from data exploration inform the choice of
appropriate machine learning algorithms and model
selection.
Data exploration is an essential step in the data
analysis process, providing a foundation for making
informed decisions, uncovering hidden insights, and
building effective models. By thoroughly exploring the
data, data scientists and analysts can extract valuable
knowledge and drive meaningful outcomes.
213
Histograms for different sets of data each of which exhibit
well-known, common characteristics.
(a)Uniform (b) Normal (Unimodal)(c) Unimodal (skewed right)
(a)Unimodal (skewed Left) (b) Exponential(c) Multimodal
214
Three normal distributions with different means but
identical standard deviations.
215
Three normal distributions with identical means but
different standard deviations.
216
Identifying Data Quality Issues
A data quality issue is loosely defined as anything
unusual about the data in an ABT.
The most common data quality issues are:
 missing values
 irregular cardinality
 Outliers
The data quality issues we identify from a data quality
report will be of two types:
 Data quality issues due to invalid data
 Data quality issues due to valid data.
217
INFORMATION BASED LEARNING
Information-based learning (IBL) is a framework for
learning that emphasizes the importance of information and
its role in shaping knowledge and understanding.
It is a constructivist approach that suggests that
learners actively construct their own knowledge by making
connections between new information and their existing
knowledge base.
218
(a) Brian (b) John (c) Aphra (d) Aoife
Cards showing character faces and names for the Guess-Who game
Man Long Hair Glasses Name
219
220
Fundamentals
A decision tree consists of:
 root node (or starting node),
 interior nodes
 leaf nodes (or terminating nodes).
 each of the non-leaf nodes (root and interior) in the tree
specifies a test to be carried out on one of the query’s
descriptive features.
 each of the leaf nodes specifies a predicted classification
for the query.
221
An email spam prediction dataset
222
223
Shannon’s Entropy Model
 Claude Shannon’s entropy model defines a computational
measure of the impurity of the elements of a set.
 An easy way to understand the entropy of a set is to think
in terms of the uncertainty associated with guessing the
result if you were to make a random selection from the
set.
 Entropy is related to the probability of a outcome.
 High probability → Low entropy
 Low probability → High entropy
 If we take the log of a probability and multiply it by -1 we
get this mapping!
224
Information Gain
The measure of informativeness that we will use is
known as information gain and is a measure of the
reduction in the overall entropy of a set of instances that
is achieved by testing on a descriptive feature.
1.Compute the entropy of the original dataset with respect
to the target feature.
2.For each descriptive feature, create the sets that result by
partitioning the instances in the dataset using their feature
values, and then sum the entropy scores of each of these
sets.
3.Subtract the remaining entropy value (computed in step 2)
from the original entropy value(computedinstep1) to give
the information gain. 225
SIMILARITY BASED LEARNING
 Feature space: each descriptive has its own dimensional
axis. It is an abstract m- dimensional space that is
created by making each descriptive feature in a dataset
an axis of an m-dimensional coordinate system and
mapping each instance in the dataset to a point in this
coordinate space based on the values of its descriptive
features
 Working of feature space: if the values of the descriptive
features of two or more instances in a dataset are the
same, then these instances will be mapped to the same
point in the feature space and vice versa
 Distance between two points in the feature space is a
useful measure of the similarity of the descriptive 226
Metric(a,b): is a real value function that returns the
distance between two points a and b in the feature space. It
has following properties:
Non-negativity
Identity
Symmetry
Triangular inequality
Two examples of distance metric
Euclidean distance
Manhattan distance (taxi-cam distance)=sum of absolute
differences.
Minkowski distance: a family of distance metrics based
on differences between features. 227
Nearest Neighbor algorithm:
When this model is used to make prediction for new
instances, the distance in the feature space between the query
instance and each instance in the data set instances is computer,
and the prediction returned by the model is the target feature
level of the dataset's instance that is nearest to the query in the
feature space.
It stores entire training dataset in memory resulting in a
negative effect on time complexity of the algorithm
Algorithm : Nearest neighbor algorithm.
Require: a set of training instances
Require: a query instance
228
Iterate across the instances in memory to find the nearest
neighbor—this is the instance with the shortest distance
across the feature space to the query instance.
Make a prediction for the query instance that is equal to
the value of the target feature of the nearest neighbor.
Decision boundary:
It is the boundary between regions of the feature
space in which different target levels will be predicted.
It is generated by aggregating the neighboring local
models (Voronoi regions) that make the same prediction
229
Noise effects:
Using Kronecker delta approach:
n-n algorithm is sensitive to noise because any errors
in the description of labeling of training data results in
erroneous local models and incorrect predictions.
One way to mitigate against noise is to modify the
algorithm to return the majority target level within the set of
k nearest neighbors to the query q
Using weighted k nearest neighbor approach:
Efficient Memory search:
Assuming that the training dataset will remain
relatively stable, the time issue can be offset by investing in
one-off computation to create an index of the instances that
enables efficient retrieval of the nearest neighbors without
doing an exhaustive search of the entire training dataset. 230
k-d tree:
It stands for k-dimensional tree which is a balanced
binary tree in which each of the nodes in the tree index one
on the instances in a training dataset
This tree is constructed so that the nodes that are
nearby in the tree index training instances that are nearby in
the feature space
231
PROBABILITY BASED LEARNING
It is a machine learning technique that uses
probability theory to make predictions or decisions. It's a
statistical approach that models uncertainty in data using
probability distributions.
A probability function, P(), returns the probability of a
feature taking a specific value.
A joint probability refers to the probability of an
assignment of specific values to multiple different features
A conditional probability refers to the probability of
one feature taking a specific value given that we already
know the value of a different feature
232
 A probability distribution is a data structure that
describes the probability of each
 possible value a feature can take. The sum of a probability
distribution must equal 1.0
 A joint probability distribution is a probability
distribution over more than one feature assignment and is
written as a multi-dimensional matrix in which each cell
lists the probability of a particular combination of feature
values being assigned
 The sum of all the cells in a joint probability distribution
must be 1.0.
233
 A joint probability distribution is a probability
distribution over more than one feature assignment and is
written as a multi-dimensional matrix in which each cell
lists the probability of a particular combination of feature
values being assigned.
 The sum of all the cells in a joint probability distribution
must be 1.0.
Baye’s Theorem:
The conditional probability of an event, X, given some
evidence, Y, in terms of the product of the inverse
conditional probability, P(Y| X), and the prior probability
of the event P(X).
234
235
Bayesian Prediction
We generate the probability of the event that a
target feature, t,
takes a specific level, l,
given the assignment of values to a set of descriptive
features, q, from a query instance.
We can restate Bayes’ Theorem using this terminology and
generalize the definition of Bayes’ Theorem so that it can
take into account more than one piece of evidence(each
descriptive feature value is a separate piece of evidence).
236
237
1. P(t=l),the prior probability of the target feature t taking the
level l
2. P(q[1],…,q[m]),the joint probability of the descriptive
features of a query instance Taking a specific set of values
3. P(q[1], …, q[m] | t = l), the conditional probability of the
descriptive features of a query instance taking a specific set
of values given that the target feature takes the Level l
Conditional Independence and Factorization
If knowledge of one event has no effect on the probability
of another event, and vice versa, then the two events are
independent of each other.
If two events X and Y are independent then:
P(X|Y) = P(X)
P(X, Y) = P(X) x P(Y) 238
ERROR BASED LEARNING
We perform a search for a set of parameters for a
parameterized model that minimizes the total error across
the predictions made by that model with respect to a set of
training instances.
The key ideas of a parameterized model, measuring error
and an error surface.
Simple Linear Regression
It is a type of Regression algorithms that models the
relationship between a dependent variable and a single
independent variable.
The relationship shown by a Simple Linear
Regression model is linear or a sloped straight line, hence
it is called Simple Linear Regression. 239
The key point in Simple Linear Regression is that the
dependent variable must be a continuous value.
However, the independent variable can be measured on
continuous or categorical values.
Measuring Error
There are many different kinds of error functions, but for
measuring the fit of simple linear regression models, the
most commonly used is the sum of squared errors error
function, or L2.
To calculateL2 we use our candidate model to make a
prediction for each member of the training dataset and then
calculate the error (or residual) between these predictions
and the actual target feature values in the training set.
240
a) A scatter plot of the SIZE and RENTAL PRICE features
from the office rentals dataset; (b)the scatter plot from
(a)with a linear model relating RENTALPRICE to SIZE
overlaid
241
Error Surface:
Here, each pair of Weights w[0]and w[1] defines a
point on the x-y plane, and the sum of squared errors for the
model using these weights determines the height of the error
surface above the x-y Plane for that pair of weights.
The x-y plane is known as a weight space, and the surface is
known as an error surface.
242
EVALUATION
When evaluating machine learning (ML)
models, the question that arises is whether the model is
the best model available from the model’s hypothesis
space in terms of generalization error on the unseen /
future data set.
Hold-out method for Model Evaluation
The hold-out method for model evaluation
represents the mechanism of splitting the dataset into
training and test datasets.
The model is trained on the training set and then
tested on the testing set to get the most optimal model.
243
The hold-out method for model evaluation
Generally, 70-30% split is used for splitting the dataset
where 70% of the dataset is used for training and 30%
dataset is used for testing the model.
244
The following is the process of using the hold-out method
for model evaluation:
Split the dataset into two parts (preferably based on a 70-
30% split; However, the percentage split will vary)
Train the model on the training dataset; While training the
model, some fixed set of hyperparameters is selected.
Test or evaluate the model on the held-out test dataset
Train the final model on the entire dataset to get a model
which can generalize better on then seen or future dataset.
245
THE ART OF MACHINE LEARNING TO
PREDICTIVE DATAANALYTICS
Predictive data analytics projects use machine
learning to build models that capture the relationships in
large datasets between descriptive features and a target
feature.
A specific type of learning, called inductive learning,
is used, where learning entails inducing a general rule from
a set of specific instances.
Predictive analytics project can use CRISP-DM
process to manage a project through its lifecycle.
246
The CRoss Industry Standard Process for Data Mining
(CRISP-DM) is a process model that serves as the base for a
data science process.
It has six sequential phases:
1.Business understanding – What does the business need?
2.Data understanding – What data do we have / need? Is it
clean?
3.Data preparation – How do we organize the data for
modeling?
4.Modeling – What modeling techniques should we apply?
5.Evaluation – Which model best meets the business objectives?
6.Deployment – How do stakeholders access the results?
247
UNIT – V
APPLICATIONS OF MACHINE LEARNING
Image Recognition
Speech Recognition
Email spam and Malware Filtering
Online fraud detection
Medical Diagnosis.
248
IMAGE RECOGNITION
Image recognition is the capability of a system to
understand and interpret visual information from images or
videos.
Image recognition in machine learning often involves
the use of deep learning techniques, particularly convolutional
neural networks (CNNs), which have been highly successful in
this domain.
Digital Images contain pixels, which are the smallest
units of a screen that help in image formation. There are
different types of image formats like JPG, JPEG, GIF, PNG,
etc.
Images have their own role in a digital world that
includes various fields like communication, science, art, and
249
250
Facial expression
Artificial intelligence and machine learning (AI/ML) are
used in computer vision applications to handle this data
properly for monitoring, detection, categorization, object
identification, and facial recognition.
Learning Based Image Recognition
Data Collection:
A large dataset of labeled images is collected.
Each image is associated with a label that describes
what's in the image (e.g., "cat," "car," "beach").
Preprocessing:
Images are typically resized and normalized to a
standard size, and any color information is converted into a
numerical format.
251
Convolutional Neural Network (CNN):
A CNN is used as the primary architecture for image
recognition. CNNs consist of multiple layers, including
convolutional layers, pooling layers, and fully connected
layers. These layers extract features from the input images.
Training:
The CNN is trained on the labeled dataset. During
training, the network adjusts its internal parameters (weights
and biases) to learn to recognize patterns and features in the
images that are associated with the correct labels.
Testing and Inference:
After training, the CNN can be used to make predictions on
new, unlabeled images. It produces a probability distribution
over possible labels for each image. 252
Post-processing:
The predicted probabilities can be converted into a
final classification or label using techniques like softmax.
Top-k classification is often used to provide multiple
likely labels.
The field of image recognition in machine learning
has seen significant advancements, and it's applied in a
wide range of applications, including object detection,
facial recognition, medical image analysis, and
autonomous vehicles.
python
import tensorflow as tf
from tensorflow import keras
253
# Load and preprocess the dataset (e.g., CIFAR-10)
(x_train, y_train), (x_test, y_test) =
keras.datasets.cifar10.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0 # Normalize
pixel values to between 0 and 1
# Build the CNN model model = keras.Sequential([
keras.layers.Conv2D(32, (3, 3), activation='relu',
input_shape=(32, 32, 3)),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.Flatten(),
keras.layers.Dense(64, activation='relu'), 254
# Compile the model model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logit
s=True), metrics=['accuracy'])
# Train the model
model.fit(x_train, y_train, epochs=10, validation_data=(x_test,
y_test))
# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print(f"Test accuracy: {test_acc}")
This code loads the CIFAR-10 dataset, creates a simple CNN
model, compiles it, and then trains and evaluates the model.
You can replace the dataset with your own image data and
adjust the
model architecture for your specific image recognition task.
255
Application of Image Recognition
Identifying Fraudulent Accounts
Examining fake social media profiles is among the
most significant applications of image recognition.
Facial Recognition and Security Systems
Image recognition is also regarded as important since
it is one of the most critical components in the security
business.
Image recognition algorithms could help marketers
learn about a person’s identity, gender, and mood.
Reverse Image Search
You may have heard of internet reverse image
searches. Reverse photo search is a strategy that allows you
to search by image for free.
256
Help Police Officials to Solve Cases
You might be shocked to learn that government
agencies use image recognition. Today, police and other secret
organizations frequently use image recognition technology to
identify persons in recordings or photographs.
Empowers e-commerce Businesses
Today, image recognition is commonly used in the e-
commerce business. Historically, the visual search industry
has grown significantly.
257
SPEECH RECOGNITION
Speech recognition is a machine learning
technology that uses AI to interpret spoken words and
convert them into text.
It works by breaking down speech into sounds,
analyzing each sound, and then using an algorithm to find
the most likely word for each sound.
You can do speech recognition in python with the
help of computer programs that take in input from the
microphone, process it, and convert it into a suitable
form.
258
Speech Recognition
Working of Speech Recognition
259
How Does Speech Recognition work?
Acoustic modeling is used to recognize phenones/phonetics
in our speech to get the more significant part of speech, as
words and sentences.
Speech recognition starts by taking the sound energy
produced by the person speaking and converting it into
electrical energy with the help of a microphone.
 It then converts this electrical energy from analog to
digital, and finally to text.
It breaks the audio data down into sounds, and it analyzes the
sounds using algorithms to find the most probable word that
fits that audio. All of this is done using Natural Language
Processing and Neural Networks. Hidden Markov models can
be used to find temporal patterns in speech and improve 260
Picking and Installing a Speech Recognition Package
261
Package Functionality Installation
Apiai Included natural language
processing for identifying a
speaker’s intent
$ pip install apiai
Google-cloud-speech Offers basic speech to text
conversion
$pip install virtualenv
Virtualenv <your-env>
<your-env>Scriptsactivate
<your-env>Scriptspip.exe
install google-cloud-speech
Speech recognition Offers easy audio processing
and microphone accessibility
Pip install speech recognition
Watson – developer
cloud
Watson developer cloud is
an artificial intelligence API
that makes creating,
debugging, running and
deploying APIs easy. It can be
Pip install-upgrade waston-
develope cloud
It allows:
•Easy speech recognition from the microphone.
•Makes it easy to transcribe an audio file.
•It also lets us save audio data into an audio file.
•It also shows us recognition results in an easy-to-understand
format.
Speech Recognition in Python: Converting Speech to Text
create a program that takes in the audio as input and
converts it to text.
262
Importing necessary modules
263
Converting speech to text
Now, use the microphone to get audio input from the user in
real-time, recognize it, and print it in text
264
EMAIL SPAM AND MALWARE
FILTERING
 What do we know about spam?
Annoying emails?
Yahoo Mail!
 Spam-based advertising is a huge business
 Email spam is a collection of unwanted, unsolicited
messages sent in bulk to a large number of recipients. It
can be sent for commercial purposes or for other
reasons, and can be sent by humans or by botnets,
which are networks of infected computers
 It is a complex system
Technical: name server, email server, webpages,
etc.
Business: payment processing, merchant bank
accounts, customer service, and fulfillment
 Previous work studied each of the elements in isolation
dynamics of botnets, DNS fast-flux networks, Web
site hosting, spam filtering, URL blacklisting, site
takedown
 This work: quantify the full set of resources employed
to monetize spam email— including naming, hosting,
payment and fulfillment
Main Parts
 Advertising
 Click support
 Redirect sites
 Third-party DNS, spammers’ DNS
 Webservers
 Affiliate programs
 Realization
 Payment services
 Fulfillment
2
6
7
Big Picture
2
6
8
Data Collection
and
Processing
2
6
9
Discussion
 With this big picture, what do you think are the
most effective mechanisms to defeat spam?
CS660 - Advanced Information
Assurance - UMassAmherst
2
7
0
Malware, short for malicious software, is software
designed to damage or disrupt a computer, server, or
network. Cybercriminals, also known as hackers, create
malware to steal data, gain access to systems, or sabotage
devices.
271
A bot is a software application that can perform
repetitive tasks automatically, without the need for human
intervention. Bots can be designed to imitate human behavior,
but they are often faster and more accurate than humans.
272
273
 Spam Detector is used to detect unwanted, malicious
and virus infected texts and helps to separate them from
the non spam texts.
 It uses a binary type of classification containing the
labels such as ‘ham’ (non spam) and spam. Application
of this can be seen in Google Mail (GMAIL) where it
segregates the spam emails in order to prevent them
from getting into the user’s inbox.
 In this Machine Learning Spam Filtering application, we
will develop a Spam Detector app using Support Vector
Machine (SVM) technique for classification and Natural
Language Processing.
 Support Vector Machine (SVM) is a supervised learning
algorithm used for classification and regression problems.
The main objective of SVM is to find a hyperplane in an
N( total number of features)-dimensional space that
differentiates the data points.
Hyperplane and Support Vectors
 Hyperplanes are nothing but a boundary that helps to
separate and group the data into particular classes.
274
SVM works very well on the linearly separable data
275
Natural Language Processing (NLP) is an Artificial
Intelligence (AI) field that enables computer programs to
recognize, interpret, and manipulate human languages.
Application Structure:
spam.csv: Dataset for our project. It contains Labels as
“ham” or “spam” and Email Text.
spamdetector.py: This file is used to load the dataset and
train our classifier.
training_data.pkl: This file contains a trained classifier
in binary format which will be used to predict the output.
SpamGui.py: GUI file for our project where we load the
trained classifier and predict the output for a given
message.
276
Steps for developing a Spam Detector:
1.Import Libraries and initialize variables
2.Preprocessing the data
3.Bag of Words (vectorization)
4.Training the model
5.Prediction using Graphical User Interface
277
278
279
ONLINE FRAUD DETECTION
Credit card frauds are easy to do, as we know
that E-commerce and many other online sites have
increased the online payment modes, Which affects
increasing the risk of online frauds.
280
281
Feature Description
step tells about the unit of time
type type of transaction done
amount the total amount of transaction
nameOrg account that starts the transaction
oldbalanceOrg
Balance of the account of sender before
transaction
newbalanceOrg
Balance of the account of sender after
transaction
nameDest account that receives the transaction
oldbalanceDest
Balance of the account of receiver before
transaction
newbalanceDest
Balance of the account of receiver after
transaction
isFraud The value to be predicted i.e. 0 or 1
 The libraries used are :
 Pandas: This library helps to load the data frame in a
2D array format and has multiple functions to
perform analysis tasks in one go.
 Seaborn/Matplotlib: For data visualization.
 Numpy: Numpy arrays are very fast and can perform
large computations in a very short time.
282
 The Importance of Fraud Detection
 Financial Losses
 Customer Trust
 Legal and Regulatory Compliance
 Operational Efficiency
 The Role of Machine Learning Algorithms
Traditional Fraud Detection Algorithms
 Machine learning algorithms, traditional ones that have
been commonly used: logistic regression, decision trees,
and random forests.
 These machine learning models have had a significant
impact on fraud detection processes, providing
organizations with practical predictive capabilities. 283
Logistic regression, for example, is a statistical model
that has been widely used for its simplicity and
efficiency.
The decision trees method, on the other hand, makes
predictions based on a set of decision rules. And finally,
the random forest algorithm combines multiple decision
trees to generate a final outcome.
Complex fraud patterns involve non-linear
relationships, high-dimensional data, and evolving
tactics.
As a result:

Feature Engineering Challenges

Imbalanced Data

Dynamic Fraud Behavior‍ 284
New Approaches: Machine Learning Algorithms
1. Gradient Boosting Machines (GBM)
How GBM Works
Gradient Boosting Machines (GBM) is an ensemble
learning technique that combines multiple weak learners
(typically decision trees) to create a strong predictive model.
Here’s how it works:
GBM assigns weights to each tree based on its
performance in reducing residuals.
The final prediction is the weighted sum of predictions
from all trees.
285
 2. Neural Networks
 Application of Neural Networks (Deep Learning) in
Fraud Detection
 One of the most advanced tools in the machine learning
arsenal is neural networks, an algorithmic structure
modeled after the human brain.
 Neural networks, in essence, are computing systems
made up of interconnecting artificial neurons, or nodes.
 In the landscape of fraud detection, neural networks —
particularly deep learning neural networks — hold
significant potential.
286
 Transaction Sequences: Recurrent neural networks
(RNNs) can analyze sequences of transactions over
time, capturing temporal dependencies.
 Image-Based Fraud Detection: Convolutional
neural networks (CNNs) process images (e.g.,
scanned checks, ID cards) to detect anomalies.
 Ensemble Approaches: Stacking neural networks
with other models (e.g., GBM) can enhance overall
fraud detection accuracy.
287
3. XGBoost: Extreme Gradient Boosting
XGBoost (Extreme Gradient Boosting) is another
advanced machine learning algorithm that is creating
ripples in the field of fraud detection.
This algorithm stands out due to its flexibility, high
accuracy, and effectiveness in dealing with large and
complex datasets.
288
 Mobile Payment Fraud Detection:
 Researchers proposed an XGBoost-based framework for
mobile payment fraud detection.
 By integrating unsupervised outlier detection algorithms
and an XGBoost classifier, they achieved excellent results
on a large dataset of over 6 million mobile transactions.
 Credit Card Fraud Detection:
 Another study applied XGBoost to credit card transaction
data.
 They used outlier detection based on distance sum to
identify fraudulent transactions effectively.
 State Grid Corporation of China (SGCC) Dataset:
 Researchers explored integrating Genetic Algorithms with
XGBoost for fraud detection in the SGCC dataset.
 This integration showed promising results and opens
avenues for further research.
289
4. Isolation Forest: Unleashing Anomaly Detection
What is Isolation Forest?
The Isolation Forest or iForest is a unique and
compelling machine learning algorithm designed for
anomaly detection.
Effectiveness in Identifying Fraudulent Transactions
High-Dimensional Data
290
5. Auto encoders: Unleashing Latent Representations
Auto encoders are essentially data compression
algorithms where the compression and decompression
functions are data-specific, lossy, and learned
automatically from examples.
They are a type of self-organizing system that
can learn to represent (encode) the input data in a
manner that highlights its key features and patterns.
Key Features
Smart Adaptive Machine Learning
Smart Surveillance and Automation
Flexible Customizable Rules
Compliance Assurance 291
 The Power of Advanced Algorithms
 Accuracy: Machine learning algorithms such
as XGBoost, neural networks, and auto
encoders offer unparalleled accuracy in identifying
fraudulent activities.
 Adaptability: These algorithms adapt to changing
fraud patterns, ensuring continuous protection
292

More Related Content

Similar to Machine Learning Techniques all units .ppt (20)

PDF
machinecanthink-160226155704.pdf
PranavPatil822557
 
PPTX
Machine Can Think
Rahul Jaiman
 
PPT
i2ml-chap1-v1-1.ppt
Sivamkasi64
 
PPTX
Introduction to Machine Learning.pptx
Dr. Amanpreet Kaur
 
PPT
Machine Learning Ch 1.ppt
ARVIND SARDAR
 
PPT
intro to ML by the way m toh phasee movie Punjabi
botvillain45
 
PPT
ML-Topic1A.ppteeweqeqeqeqeqeqwewqqwwqeeqeqw
YumnaShahzaad
 
PPT
Machine learning introduction to unit 1.ppt
ShivaShiva783981
 
PPTX
INTRODUCTIONTOML2024 for graphic era.pptx
chirag19saxena2001
 
PDF
Introduction AI ML& Mathematicals of ML.pdf
GandhiMathy6
 
PDF
Machine Learning an Research Overview
Kathirvel Ayyaswamy
 
PPTX
BE ML Module 1A_Introduction to Machine Learning.pptx
EktaGangwani1
 
PDF
MACHINE LEARNING Notes by Dr. K. Adisesha
Prof. Dr. K. Adisesha
 
PPTX
MachineLearning_Unit-I.pptxScrum.pptxAgile Model.pptxAgile Model.pptxAgile Mo...
22eg105n11
 
PDF
Chapter 5 - Machine which of Learning.pdf
naolseyum9
 
PPTX
Simple overview of machine learning
priyadharshini R
 
PPTX
machine learning
soundaryasarya
 
PPTX
chapter Three artificial intelligence 1.pptx
gadisaadamu101
 
PPT
Machine Learning
Vivek Garg
 
PPTX
Introduction to Machine Learning
Panimalar Engineering College
 
machinecanthink-160226155704.pdf
PranavPatil822557
 
Machine Can Think
Rahul Jaiman
 
i2ml-chap1-v1-1.ppt
Sivamkasi64
 
Introduction to Machine Learning.pptx
Dr. Amanpreet Kaur
 
Machine Learning Ch 1.ppt
ARVIND SARDAR
 
intro to ML by the way m toh phasee movie Punjabi
botvillain45
 
ML-Topic1A.ppteeweqeqeqeqeqeqwewqqwwqeeqeqw
YumnaShahzaad
 
Machine learning introduction to unit 1.ppt
ShivaShiva783981
 
INTRODUCTIONTOML2024 for graphic era.pptx
chirag19saxena2001
 
Introduction AI ML& Mathematicals of ML.pdf
GandhiMathy6
 
Machine Learning an Research Overview
Kathirvel Ayyaswamy
 
BE ML Module 1A_Introduction to Machine Learning.pptx
EktaGangwani1
 
MACHINE LEARNING Notes by Dr. K. Adisesha
Prof. Dr. K. Adisesha
 
MachineLearning_Unit-I.pptxScrum.pptxAgile Model.pptxAgile Model.pptxAgile Mo...
22eg105n11
 
Chapter 5 - Machine which of Learning.pdf
naolseyum9
 
Simple overview of machine learning
priyadharshini R
 
machine learning
soundaryasarya
 
chapter Three artificial intelligence 1.pptx
gadisaadamu101
 
Machine Learning
Vivek Garg
 
Introduction to Machine Learning
Panimalar Engineering College
 

Recently uploaded (20)

PPTX
Seminar Description: YOLO v1 (You Only Look Once).pptx
abhijithpramod20002
 
PDF
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PDF
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
PDF
Tesia Dobrydnia - An Avid Hiker And Backpacker
Tesia Dobrydnia
 
PDF
Module - 5 Machine Learning-22ISE62.pdf
Dr. Shivashankar
 
PPTX
Explore USA’s Best Structural And Non Structural Steel Detailing
Silicon Engineering Consultants LLC
 
PDF
Clustering Algorithms - Kmeans,Min ALgorithm
Sharmila Chidaravalli
 
PPTX
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 
PPTX
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
PDF
Digital water marking system project report
Kamal Acharya
 
PPTX
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
PDF
A Brief Introduction About Robert Paul Hardee
Robert Paul Hardee
 
PPTX
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
PPTX
Unit_I Functional Units, Instruction Sets.pptx
logaprakash9
 
PPTX
darshai cross section and river section analysis
muk7971
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PDF
Data structures notes for unit 2 in computer science.pdf
sshubhamsingh265
 
Seminar Description: YOLO v1 (You Only Look Once).pptx
abhijithpramod20002
 
Submit Your Papers-International Journal on Cybernetics & Informatics ( IJCI)
IJCI JOURNAL
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
WD2(I)-RFQ-GW-1415_ Shifting and Filling of Sand in the Pond at the WD5 Area_...
ShahadathHossain23
 
Tesia Dobrydnia - An Avid Hiker And Backpacker
Tesia Dobrydnia
 
Module - 5 Machine Learning-22ISE62.pdf
Dr. Shivashankar
 
Explore USA’s Best Structural And Non Structural Steel Detailing
Silicon Engineering Consultants LLC
 
Clustering Algorithms - Kmeans,Min ALgorithm
Sharmila Chidaravalli
 
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
Digital water marking system project report
Kamal Acharya
 
Final Major project a b c d e f g h i j k l m
bharathpsnab
 
A Brief Introduction About Robert Paul Hardee
Robert Paul Hardee
 
2025 CGI Congres - Surviving agile v05.pptx
Derk-Jan de Grood
 
Unit_I Functional Units, Instruction Sets.pptx
logaprakash9
 
darshai cross section and river section analysis
muk7971
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
Data structures notes for unit 2 in computer science.pdf
sshubhamsingh265
 
Ad

Machine Learning Techniques all units .ppt

  • 2. UNIT – I MACHINE LEARNING BASICS Introduction to Machine Learning (ML) Essential concepts of ML Types of learning Machine learning methods based on Time Dimensionality Linearity and Non linearity Early trends in Machine learning Data Understanding Representation and visualization 2
  • 4. 4 What is Machine Learning?  Machine Learning Study of algorithms that improve their performance at some task with experience  Optimize a performance criterion using example data or past experience.  Role of Statistics: Inference from a sample  Role of Computer science: Efficient algorithms to  Solve the optimization problem  Representing and evaluating the model for inference
  • 5. MACHINE LEARNING Machine learning is branch of artificial intelligence that enables computers to “self learn” from training data and improve over time without being explicitly programmed. ML algorithms are able to detect patterns in data and learn from them in order to make their own predictions. 5
  • 6. What is Machine Learning? “Learning is any process by which a system improves performance from experience.” - Herbert Simon Definition by Tom Mitchell (1998): Machine Learning is the study of algorithms that • improve their performance P • at some task T • with experience E. A well-defined learning task is given by <P, T, E>. 3 6
  • 8. When Do We Use Machine Learning? ML is used when: •Human expertise does not exist (navigating on Mars) •Humans can’t explain their expertise (speech recognition) •Models must be customized (personalized medicine-routing on a computer network) •Models are based on huge amounts of data (user biometrics)) Learning isn’t always useful: •There is no need to “learn” to calculate payroll Based on slide by E. Alpaydin 5 8
  • 9. 9  Retail: Market basket analysis, Customer relationship management (CRM)  Finance: Credit scoring, fraud detection  Manufacturing: Optimization, troubleshooting  Medicine: Medical diagnosis  Telecommunications: Quality of service optimization  Bioinformatics: Motifs, alignment  Web mining: Search engines Applications:
  • 10. Growth of Machine Learning  Machine learning is preferred approach to  Speech recognition, Natural language processing  Computer vision  Medical outcomes analysis  Robot control  Computational biology  This trend is accelerating  Improved machine learning algorithms  Improved data capture, networking, faster computers  Software too complex to write by hand  New sensors / IO devices  Demand for self-customization to user, environment  It turns out to be difficult to extract knowledge from human expertsfailure of expert systems in the 1980’s. 10
  • 11. Samuel’s Checkers-Player “Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.” -Arthur Samuel (1959) 9 11
  • 12. Defining the Learning Task Improve on task T, with respect to performance metric P, based on experience E T: Playing checkers P: Percentage of games won against an arbitrary opponent E: Playing practice games against itself T: Recognizing hand-written words P: Percentage of words correctly classified E: Database of human-labeled images of handwritten words T: Driving on four-lane highways using vision sensors P: Average distance traveled before a human-judged error E: A sequence of images and steering commands recorded while observing a human driver. T: Categorize email messages as spam or legitimate. P: Percentage of email messages correctly classified. E: Database of emails, some with human-given labels 10 12
  • 13. Well Defined Learning Problem Learning = Improving with experience at some task Improve over task T , with respect to performance measure P , based on experience E. E.g., Learn to play checkers T : Play checkers P : % of games won in world tournament E: opportunity to play against self 13
  • 14. Learning to Play Checkers T : Play checkers P : Percent of games won in world tournament What experience? What exactly should be learned? How shall it be represented? What speci c algorithm to learn it? 14
  • 15. Type of Training Experience Direct or indirect? Teacher or not? A problem: is training experience representative of performance goal? 15
  • 16. Representation for Target Choose Function collection of rules? neural network ? polynomial function of board features? ... 16
  • 17. A Representation for Learned Func- tion w +w bp(b)+w rp(b)+w bk(b)+w rk(b)+w bt(b)+w rt(b 0 1 2 3 4 5 6 bp(b): number of black pieces on board b rp(b): number of red pieces on b bk(b): number of black kings on b rk(b): number of red kings on b bt(b): number of red pieces threatened by black (i.e., which can be taken on black's next turn) rt(b): number of black pieces threatened by red 17
  • 18. Obtaining Training Examples ^ 18 V (b): the true target function V (b) : the learned function train V (b): the training value One rule for estimating training values: train ^ V (b) V (Successor(b))
  • 19. Choose Weight Tuning Rule LMS Weight update rule: Do repeatedly: Select a training example b at random 1. Compute error(b): 19 train ^ error(b) = V (b) V (b) 2. For each board feature f , update weight w : i i w w + c f error(b) i i i c is some small constant, say 0.1, to moderate the rate of learning
  • 20. Design Choices Determine Target Function Determine Representation of Learned Function Determine Type of Training Experience Determine Learning Algorithm Games against self Games against experts Table of correct moves Linear function of six features Artificial neural network Polynomial Gradient descent Board ➝ value Board ➝ move Completed Design ... ... Linear programming ... ... 20
  • 22. 22
  • 23. 23
  • 24. TYPES OF LEARNING  Supervised (inductive) learning  Training data includes desired outputs  Unsupervised learning  Training data does not include desired outputs  Semi-supervised learning  Training data includes a few desired outputs  Reinforcement learning  Rewards from sequence of actions 24
  • 25. 25
  • 26. Learning Associations  Basket analysis: P (Y | X ) probability that somebody who buys X also buys Y where X and Y are products/services. Example: P ( chips | beer ) = 0.7 Market-Basket transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke 26
  • 27. 27 Supervised Learning: Uses  Prediction of future cases: Use the rule to predict the output for future inputs  Knowledge extraction: The rule is easy to understand  Compression: The rule is simpler than the data it explains  Outlier detection: Exceptions that are not covered by the rule, e.g., fraud Example: decision trees tools that create rules
  • 28. 28
  • 29. Inductive Learning  Given examples of a function (X, F(X))  Predict function F(X) for new examples X  Discrete F(X): Classification  Continuous F(X): Regression  F(X) = Probability(X): Probability estimation 29
  • 30. 30 Prediction: Regression  Example: Price of a used car  x : car attributes y : price y = g (x | θ) g ( ) model, θ parameters y = wx+w0
  • 31. 31 Regression Applications  Navigating a car: Angle of the steering wheel (CMU NavLab)  Kinematics of a robot arm α1= g1(x,y) α2= g2(x,y) α1 α2 (x,y)
  • 32. 32
  • 33. 33
  • 34. 34
  • 35. 35
  • 36. 36
  • 37. 37
  • 38. 38
  • 39. 39
  • 40. 40
  • 41. 41 Classification  Example: Credit scoring  Differentiating between low-risk and high-risk customers from their income and savings Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk Model
  • 42. 42 Classification: Applications  Pattern recognition  Face recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style  Character recognition: Different handwriting styles.  Speech recognition: Temporal dependency.  Use of a dictionary or the syntax of the language.  Sensor fusion: Combine multiple modalities; eg, visual (lip image) and acoustic for speech  Medical diagnosis: From symptoms to illnesses  Web Advertising: Predict if a user clicks on an ad on the Internet.
  • 43. 43 Face Recognition Training examples of a person Test images AT&T Laboratories, Cambridge UK http://www.uk.research.att.com/facedatabase.html
  • 44. 44
  • 45. 45
  • 46. 46
  • 47. 47
  • 48. 48 Unsupervised Learning  Learning “what normally happens” No output  Clustering: Grouping similar instances  Other applications: Summarization, Association Analysis  Example applications  Customer segmentation in CRM  Image compression: Color quantization  Bioinformatics: Learning motifs
  • 49. 49
  • 50. 50
  • 51. 51
  • 52. 52
  • 53. 53
  • 54. 54 Reinforcement Learning  Topics:  Policies: what actions should an agent take in a particular situation  Utility estimation: how good is a state (used by policy)  No supervised output but delayed reward  Credit assignment problem (what was responsible for the outcome)  Applications:  Game playing  Robot in a maze  Multiple agents, partial observability, ...
  • 55. MACHINE LEARNING METHODS BASED ON TIME  A static model is trained offline.  That is, we train the model exactly once and then use that trained model for a while.  A dynamic model is trained online.  That is, data is continually entering the system and we're incorporating that data into the model through continuous updates. 55
  • 56. 56
  • 57. 57
  • 58. 58
  • 59. 59
  • 60. 60
  • 61. Curse of Dimensionality  The Curse of Dimensionality in Machine Learning arises when working with high- dimensional data, leading to increased computational complexity, overfitting, and spurious correlations.  Techniques like dimensionality reduction, feature selection, and careful model design are essential for mitigating its effects and improving algorithm performance.  Navigating this challenge is crucial for unlocking the potential of high-dimensional datasets and ensuring robust machine-learning solutions. 61
  • 62. 62
  • 63. LINEARITY AND NON LINEARITY  Linearity refers to the property of a system or model where the output is directly proportional to the input.  While nonlinearity implies that the relationship between input and output is more complex and cannot be expressed as a simple linear function. 63
  • 65. Linearity  Linear models are often the simplest and most effective approach.  A linear model essentially fits a straight line to the data, allowing it to make predictions based on a linear relationship between the input features and the output variable.  In a regression problem, a linear model is used to predict a continuous target variable based on one or more input features.  such as the size and age of a tree. 65
  • 66. Nonlinearity  Nonlinear models can take many forms, from polynomial models that fit curves to the data, to neural networks that can learn complex patterns in high-dimensional data.  These models are often more powerful than linear models because they can capture more complex relationships between variables.  In a classification problem, nonlinear models are able to identify complex decision boundaries that separate different classes, especially when those boundaries are not linear. 66
  • 67. 67
  • 68. Occam’s Razor  If you have two competing ideas to explain the same phenomenon, you should prefer the simpler one. 68
  • 69. Techniques for Applying Occam’s Razor in Machine Learning: 69
  • 70. No Free Lunch Theorem:  NFL Theorem If an algorithm performs better on certain class of problems, then it pays for it in the form of degraded performance in other classes of problems.  In other words, you cannot have single optimal solution for all classes of problems. 70
  • 71. Why Understanding Occam’s Razor is Important for Data Scientists 1. Enhancing Interpretability: Simpler models are often more interpretable, which means it’s easier to understand how they’re making predictions. 2. Avoiding Overfitting: The goal of machine learning is to make accurate predictions on new, unseen data. By keeping models simpler, data scientists can reduce the risk of overfitting. 3. Improving Generalizability: They are less likely to fit the noise in the training data and more likely to capture the underlying trend or relationship. 4. Reducing Computational Resources: where resources might be limited or expensive. 71
  • 72. EARLY TRENDS IN MACHINE LEARNING 72
  • 73. 73
  • 74. 74
  • 75. 75
  • 76. 76
  • 77. DATA UNDERSTANDING, REPRESENTATION, AND VISUALIZATION DATA UNDERSTANDING: Data understanding basically involves analyzing and exploring the data to identify any patterns or trends that may be present. Data exploration provides a high-level overview of each attribute (also called variable) in the dataset and the interaction between the attributes. The step of understanding the data can be broken down into three parts: 1. Understanding entities 2. Understanding attributes 3. Understanding data types 77
  • 78. Understanding Entities  In the field of data science or machine learning and artificial intelligence, entities represent groups of data separated based on conceptual themes and/or data acquisition methods.  An entity typically represents a table in a database, or a flat file.  EXAMPLE: comma separated variable (csv) file, or tab separated variable (tsv) file. Sometimes it is more efficient to represent the entities using a more structured formats like svmlight. Each entity can contain multiple attributes. 78
  • 79. Understanding Attributes  Each attribute can be thought of as a column in the file or table.  In case of Iris data, the attributes from the single given entity are sepal length in cm, sepal width in cm.  The structured formats like svmlight are more useful in case of sparse data, as they add significant overhead when the data is fully populated.  A sparse data is data with high dimensionality, but with many samples missing values for multiple attributes. 79
  • 80. Sepal-length Sepal-width Petal-length Petal-width Class label 5.1 3.5 1.4 0.2 Iris-setosa 4.9 3.0 1.4 0.2 Iris-setosa 4.7 3.2 1.3 0.2 Iris-setosa 4.6 3.1 1.5 0.2 Iris-setosa 5.0 3.6 1.4 0.2 Iris-setosa 4.8 3.4 1.9 0.2 Iris-setosa 5.0 3.0 1.6 0.2 Iris-setosa 5.0 3.4 1.6 0.4 Iris-setosa 5.2 3.5 1.5 0.2 Iris-setosa 5.2 3.4 1.4 0.2 Iris-setosa 7.0 3.2 4.7 1.4 Iris-versicolor 6.4 3.2 4.5 1.5 Iris-versicolor 6.9 3.1 4.9 1.5 Iris-versicolor 5.5 2.3 4.0 1.3 Iris-versicolor 6.5 2.8 4.6 1.5 Iris-versicolor 6.7 3.1 4.7 1.5 Iris-versicolor 6.3 2.3 4.4 1.3 Iris-versicolor 5.6 3.0 4.1 1.3 Iris-versicolor 5.5 2.5 4.0 1.3 Iris-versicolor 5.5 2.6 4.4 1.2 Iris-versicolor 6.3 3.3 6.0 2.5 Iris-virginica 80
  • 81. Understanding Data Types  Attributes in each entity can be of various different types from the storage and processing perspective, e.g., string, integer valued, datetime, binary (“true”/“false”, or “1”/“0”), etc.  Each type needs to be handled separately for generating a feature vector that will be consumed by the machine learning algorithm.  we can also come across sparse data, in which case, some attributes will have missing values.  This missing data is typically replaced with special characters, which should not be confused with any of the real values.  In order to process data with missing values, with some default values, or use an algorithm that can work with missing data. 81
  • 82. Representation and Visualization of the Data Data visualization is the representation of data through use of common graphics, such as charts, plots, infographics and even animations. These visual displays of information communicate complex data relationships and data-driven insights in a way that is easy to understand. There are couple of options in such cases:  Draw multiple plots taking 2 or three dimensions at a time.  Reduce the dimensionality of the data and plot upto 3 dimensions Most common methods used to reduce the dimensionality are: 1.Principal Component Analysis or PCA 2.Linear Discriminant Analysis or LDA 82
  • 83. This can include a variety of visual tools such as: Charts: Bar charts, line charts, pie charts, etc. Graphs: Scatter plots, histograms, etc. Maps: Geographic maps, heat maps, etc. Dashboards: Interactive platforms that combine multiple visualizations. 83
  • 84. 84
  • 85. Tools for Visualization of Data  Tableau  Looker  Zoho Analytics  Sisense  IBM Cognos Analytics  Qlik Sense  Domo  Microsoft Power BI  Klipfolio  SAP Analytics Cloud 85
  • 86. 86
  • 87. Principal Component Analysis  We can only visualize the data in maximum of 2 or 3 dimensions. However, it is common practice to have the dimensionality of the data in tens or even hundreds.  If we can find exact coordinates (Xr ,Y r ) of the paper’s orientation as linear combination of X, Y , and Z coordinates of the 3-dimensional space, we can reduce the dimensionality of the data from 3 to 2. 87
  • 88. 2-dimensional data where PCA dimensionality is same but along different axes 88
  • 89. LINEAR DISCRIMINANT ANALYSIS  The way PCA tries to find the dimensions that maximize variance of the data, linear discriminant analysis or LDA tries to maximize the separation between the classes of data.  Thus LDA can only effectively work when we are dealing with classification type of problem, and the data intrinsically contains multiple classes. 89
  • 90. 3-dimensional data with 2 classes in another perspective, with LDA representation. LDA reduces the effective dimensionality of data into 1 dimension where the two classes can be best separated 90
  • 91. UNIT – II MACHINE LEARNING METHODS Linear methods Regression Classification Perceptron and Neural networks Decision trees Support vector machines Probabilistic models Unsupervised learning Featurization. 91
  • 92. LINEAR METHODS Machine Learning Algorithms are Divided Into Two Types: 1.Supervised learning algorithms 2.Unsupervised learning algorithms 92
  • 93. Supervised learning algorithms 93  Where the model is trained on a labelled dataset.  A labelled dataset is one that has both input and output parameters.  In this type of learning both training and validation, datasets are called labelled.  The input features are the attributes or characteristics of the data that are used to make predictions  while the output labels are the desired outcomes or targets that the algorithm tries to predict.
  • 94. 94
  • 95. EXAMPLE: It is a dataset of a shopping store that is useful in predicting whether a customer will purchase a particular product under consideration or not based on his/ her gender, age, and salary. Input: Gender, Age, Salary Output: Purchased i.e. 0 or 1; 1 means yes the customer will purchase and 0 means that the customer won’t purchase it. 95
  • 96. Supervised learning can be further classified into two categories: 96
  • 97. Unsupervised learning algorithms Unsupervised learning is a type of machine learning technique in which an algorithm discovers patterns and relationships using unlabelled data. 97
  • 98.  There are two main categories of unsupervised learning that are mentioned below: 1. Clustering 2. Association  Clustering is the process of grouping data points into clusters based on their similarity.  This technique is useful for identifying patterns and relationships in data without the need for labelled examples. 98
  • 99.  Here are some clustering algorithms: i. K-Means Clustering algorithm ii. Mean-shift algorithm iii. DBSCAN Algorithm iv. Principal Component Analysis v. Independent Component Analysis 99
  • 100. Association Association rule learning is a technique for discovering relationships between items in a dataset. It identifies rules that indicate the presence of one item implies the presence of another item with a specific probability. Here are some association rule learning algorithms: i.Apriori Algorithm ii.Eclat iii.FP-growth Algorithm 100
  • 101. LINEAR REGRESSION Regression is a supervised machine learning technique which is used to predict continuous values. The ultimate goal of the regression algorithm is to plot a best-fit line or a curve between the data. Linear regression is a type of supervised machine learning algorithm that computes the linear relationship between the dependent variable and one or more independent features by fitting a linear equation to observed data. 101
  • 102. Types: i. Linear Regression ii. Polynomial Regression iii. Ridge Regression iv. Lasso Regression v. Logistic Regression vi. Support Vector Regression (SVR) vii. Decision Tree Regression viii. Random Forest Regression ix. Gradient Boosting Regression 102
  • 103. Simple Linear Regression This is the simplest form of linear regression, and it involves only one independent variable and one dependent variable. The equation for simple linear regression is: where: Y is the dependent variable X is the independent variable  0 is the intercept β  1 is the slope β 103 y= 0+ 1X β β
  • 104. MULTIPLE LINEAR REGRESSION Linear regression is a classic example of strictly linear models. It is also called as polynomial fitting Let us consider a problem of linear regression where training data contains p samples. Input is n-dimensional as (xi,i = 1 , . . . , p) and xi R ∈ n. Output is single dimensional as (yi,i = 1,..., p) and yi R. ∈ Assumes a linear relationship between the independent and dependent variables. 0​is the intercept, 1, 2,..., n ​are the coefficients, and is the β β β β ϵ error term. 104 Y= 0+ 1X1+ 2X2+...+ nXn+ β β β β ϵ
  • 105. Ridge Regression In machine learning, ridge regression helps reduce overfitting that results from model complexity. Model complexity can be due to a model possessing too many features. Ridge Regression, a penalty term is added to the loss function to constrain the coefficients. The penalty is proportional to the square of the magnitude of the coefficients. Loss function: Minimize ∑i=1n(yi yi^)2+ ∑j=1p j2 − λ β Here, is a regularization parameter that λ controls the strength of the penalty, j​are the β coefficients, and p is the number of features. 105
  • 106. CLASSIFICATION The Classification algorithm is a Supervised Learning technique that is used to identify the category of new observations on the basis of training data. In Classification, a program learns from the given dataset or observations and then classifies new observation into a number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or categories. 106
  • 107. In classification algorithm, a discrete output function(y) is mapped to input variable(x). y=f(x) Where, y = categorical output Example : Email Spam Detector. There are two classes, class A and Class B. These classes have features that are similar to each other and dissimilar to other classes. 107
  • 108. Types: 1. Linear Models 1. Logistic Regression 2. Support Vector Machines 2. Non-linear Models 1. K-Nearest Neighbours 2. Kernel SVM 3. Naïve Bayes 4. Decision Tree Classification 5. Random Forest Classification 108
  • 109. K-Nearest Neighbor(KNN) Algorithm for Machine Learning K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. The k-nearest neighbors (KNN) algorithm is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. 109
  • 110. The weighted K-Nearest Neighbors (KNN) algorithm is a supervised learning technique used for classification and regression tasks. It makes predictions based on the similarity between a new data point and its closest neighbors in the training dataset. 110
  • 111. Classification For classification, the target variable is categorical. Predicting the class of a new data point based on the majority class among its k-nearest neighbors. Regression while for regression, the target variable is continuous. Predicting the value of a continuous target variable based on the average value of the target variable among its k-nearest neighbors. 111
  • 112. How does K-NN work? The K-NN working can be explained on the basis of the below algorithm: Step-1: Select the number K of the neighbors Step-2: Calculate the Euclidean distance of K number of neighbors Step-3: Identify the K nearest neighbors as per the calculated Euclidean distance. Step-4: Among these k neighbors, count the number of the data points in each category. Step-5: Assign the new data points to that category for which the number of the neighbour is maximum. Step-6: Our model is ready. 112
  • 113. 113
  • 114. Advantages of KNN Algorithm: It is simple to implement. It is robust to the noisy training data It can be more effective if the training data is large. Disadvantages of KNN Algorithm: Always needs to determine the value of K which may be complex some time. The computation cost is high because of calculating the distance between the data points for all the training samples. 114
  • 115. Real-World Applications Weighted KNN is used in a wide range of applications, including: 1 Recommendation Systems Recommending products or content based on the preferences of similar users. 2 Image Recognition Classifying images based on their visual features. 3 Financial Forecasting Predicting future stock prices or other financial indicators based on historical data. 4 Medical Diagnosis Assisting in the diagnosis of diseases based on patient data and medical records.
  • 116. PERCEPTRON AND NEURAL NETWORKS  A weight is assigned to each input node of a perceptron, indicating the significance of that input to the output.  The perceptron’s output is a weighted sum of the inputs that have been run through an activation function to decide whether or not the perceptron will fire. it computes the weighted sum of its inputs as: z = w1x1 + w1x2 + ... + wnxn = XTW 116
  • 117. The step function compares this weighted sum to the threshold, which outputs 1 if the input is larger than a threshold value and 0 otherwise, is the activation function that perceptron's utilize the most frequently. The most common step function used in perceptron is the Heaviside step function: 117
  • 118. The perceptron, a foundational building block in the realm of artificial intelligence, serves as the fundamental unit of neural networks. This simple yet powerful computational model, conceived by Frank Rosenblatt in the 1950s, simulates the behavior of a single neuron in the human brain. It takes multiple input signals, applies weights to each signal, and generates a single output signal based on a threshold function. These interconnected perceptrons, forming layers, create neural networks capable of solving complex tasks like image recognition, natural language processing, and machine translation.
  • 119. The Perceptron: A Simple Model Input Layer The input layer receives information from the external environment. These inputs can be numerical values, representing features or characteristics of the data being processed. Each input is assigned a weight, reflecting its importance or influence on the overall decision. Activation Function The activation function determines the output of the perceptron. It introduces nonlinearity, allowing the perceptron to learn complex relationships in the data. Common activation functions include the sigmoid, ReLU, and tanh functions. Output Layer The output layer produces a single output value, representing the final decision or prediction made by the perceptron. This output can be a binary value (0 or 1) for classification tasks or a continuous value for regression tasks.
  • 120. Neural Networks: Layers of Perceptrons Input Layer The input layer receives raw data, which is then transformed into a representation suitable for processing by the network. This representation is often in the form of numerical values, representing features or characteristics of the data. Hidden Layers Hidden layers process the input information through a series of non-linear transformations. These layers extract complex features and relationships from the data, enabling the network to learn intricate patterns and make accurate predictions. Output Layer The output layer generates the network's final prediction or decision. The structure of the output layer depends on the specific task being solved. For classification tasks, it may produce a probability distribution over different classes. For regression tasks, it may output a continuous value.
  • 122. Learning and Optimization Forward Propagation Information flows from the input layer through the hidden layers to the output layer. During forward propagation, the network calculates the output based on the current weights and biases. Backpropagation The difference between the predicted output and the actual target output is calculated. This error signal is then propagated backward through the network, updating the weights and biases to minimize the error. Gradient Descent Gradient descent is an optimization algorithm that iteratively adjusts the weights and biases in the direction that minimizes the error. It uses the gradients of the error function to guide the search for the optimal parameters.
  • 124. Types of Neural Networks 1 Convolutional Neural Networks (CNNs) CNNs are specialized for image recognition tasks. They use convolutional filters to extract features from images, making them highly effective in identifying objects, scenes, and patterns. 2 Recurrent Neural Networks (RNNs) RNNs are designed to process sequential data, such as text or time series. They use feedback loops to maintain a memory of previous inputs, allowing them to understand context and dependencies in sequential data. 3 Long Short-Term Memory (LSTM) Networks LSTMs are a type of RNN that are particularly well-suited for learning long-term dependencies. They have special memory cells that can store information over extended periods, enabling them to capture complex relationships in sequential data. 4 Generative Adversarial Networks (GANs) GANs are composed of two competing networks: a generator and a discriminator. The generator creates new data samples, while the discriminator tries to distinguish between real and generated data. This adversarial process results in the generation of highly realistic and diverse data.
  • 125. Applications of Neural Networks Image Recognition Object Detection Image Classification Natural Language Processing Machine Translation Text Summarization Speech Recognition Time Series Forecasting Drug Discovery
  • 126. PROBABILISTIC MODELS Machine learning algorithms today rely heavily on probabilistic models, which take into consideration the uncertainty inherent in real-world data. These models make predictions based on probability distributions, rather than absolute values, allowing for a more nuanced and accurate understanding of complex systems. One common approach is Bayesian inference, where prior knowledge is combined with observed data to make predictions. Another approach is maximum likelihood estimation, which seeks to find the model that best fits observational data. 126
  • 127. Probabilistic modeling is a statistical approach that uses the effect of random occurrences or actions to forecast the possibility of future results. It is a quantitative modeling method that projects several possible outcomes that might even go beyond what has happened recently. Categories Of Probabilistic Models 1.Generative models 2.Discriminative models. 127
  • 128. Generative models: Generative models aim to model the joint distribution of the input and output variables. These models generate new data based on the probability distribution of the original dataset. Generative models are powerful because they can generate new data that resembles the training data. They can be used for tasks such as image and speech synthesis, language translation, and text generation. 128
  • 129. Discriminative models: The discriminative model aims to model the conditional distribution of the output variable given the input variable. They learn a decision boundary that separates the different classes of the output variable. Discriminative models are useful when the focus is on making accurate predictions rather than generating new data. They can be used for tasks such as image recognition, speech recognition, and sentiment analysis. 129
  • 130. Maximum Likelihood Estimation The maximum likelihood estimation or MLE approach deals with the problems at the face value and parameterizes the information into variables. The values of the variables that maximize the probability of the observed variables lead to the solution of the problem. Let us define the problem using formal notations. Let there be a function f(x ) θ that produces the observed output y. x∈ ℜ𝑛 represent the input on which we don’t have any control over and θ represent a parameter ∈ Θ vector that can be single or multidimensional. 130
  • 131. The MLE method defines a likelihood function denoted as L(y| ) θ . Typically the likelihood function is the joint probability of the parameters and observed variables as L(y| ) P(y; ) θ θ . The objective is to find the optimal values for θ that maximizes the likelihood function as given by 𝜃𝑀𝐿𝐸 = arg max{ ( | )} 𝐿 𝑦 𝜃 𝜃𝗀Θ or, 𝜃𝑀𝐿𝐸 = arg max{ ( ; } 𝑃 𝑦 𝜃 𝜃𝗀Θ 131
  • 132. Bayesian Approach Bayesian approach looks at all the unknowns are modelled as random variables with known prior probability distributions. Let us denote the conditional prior probability of observing the output y for parameter vector θ as P(y ) θ . The marginal probabilities of these variables are denoted as P(y) and P( ) θ . The joint probability of the variables can be written in terms of conditional and marginal probabilities as P(y; ) θ = P(y| ) θ · P( ) ………(1) θ 132
  • 133. The same joint probability can also be given as P(y; ) θ = P(θ |y) · P(y) ………(2) Here the probability P( /y) θ is called as posterior probability. Combining Eqs. 1 and 2 P(θ |y) · P(y) = P(y| ) θ · P( ) ……..(3) θ rearranging the terms we get, This theorem gives relationship between the posteriory probability and priori probability in a simple and elegant manner. 133
  • 134. UNIT – III MACHINE LEARNING IN PRACTICE Ranking Recommendation System Designing and Tuning model pipelines Performance measurement Azure Machine Learning Open-source Machine Learning libraries Amazons Machine Learning Tool Kit: Sagemaker. 134
  • 135. RANKING  Ranking is a machine learning technique to rank items.  A ranking algorithm is a procedure used to rank items in a dataset according to some criterion.  It can be divided into two categories: 1.Deterministic 2.Probabilistic  Ranking at heart is essentially sorting of information.  It is useful for many applications in information retrieval such as e-commerce, social networks, recommendation systems, and so on.  Example: A user searches for an article or an item to buy online. 135
  • 136. Measuring Ranking Performance  The algorithm that can rank the items in strictly non-increasing order of relevance.  Let there be n number of items that need to be ranked and each has a relevance score of ri, i = 1, 2,...,n.  A simple measure, called as cumulative gain (CG) is defined on these relevances as, 𝑛 𝐶𝐺 = ∑ 𝑟𝑖 𝑖=1 136
  • 137. 137
  • 138. Sample set of items to be ranked with relevance score 138 Item No. Relevance Score Discounted relevance score 1 0.4 0.4 2 0.25 0.16 3 0.1 0.05 4 0.0 0.0 5 0.3 0.12 6 0.13 0.05 7 0.6 0.2 8 0.0 0.0 9 0.56 0.17 10 0.22 0.06
  • 139. Sample set of items ideally ranked with non-increasing relevance score 139 Item No. Relevance Score Discounted relevance score 1 0.6 0.6 2 0.56 0.35 3 0.4 0.2 4 0.3 0.13 5 0.25 0.1 6 0.22 0.08 7 0.13 0.04 8 0.1 0.032 9 0.0 0.0 10 0.0 0.0
  • 140. 140
  • 141. Ranking Search Results and Google’s Pagerank  The concept of PageRank was at the heart of Google’s rankings.  Google’s ranking algorithm is a secret, but we know that it is a probabilistic ranking algorithm. Google uses a variety of factors to rank webpages, including the number of links to a page, the page’s PageRank, and the relevance of the search query to the page.  Google’s PageRank algorithm is a probabilistic ranking algorithm that uses the number of links to a webpage as a measure of its importance.  The higher the PageRank of a webpage, the more likely it is to be ranked higher in the search results. 141
  • 142. Techniques Used in Ranking Systems  Information retrieval and text mining are the core concepts that are important in the ranking system that relate to searching websites or movies, etc.  We will look specifically at a technique called as keyword identification/extraction and word cloud generation. 142
  • 143. Keyword Identification / Extraction A word cloud is a graphical representation of the keywords identified from a document or set of documents. The size of each keyword represents the relative importance of it. Steps in Building a Word Cloud 1.Cleaning up the data. This step involves removing all the formatting characters as well as punctuation marks. 2.Normalization of the data. This step involves making all the characters lower case, unless it is required to distinguish the upper case letters from lower case. This step involves applying grammar to find the root word for every word form. 143
  • 144. 3.Removal of stop words. This step involves removing all the common words that are typically used in any text and have no meaning of their own. For example a, them for, to, and etc. 4.Compute the frequency of occurrence of each of the remaining words. 5.Sort the words in descending order of frequency. 6.Plot the top-n keywords graphically to generate the word cloud. 144
  • 145. 145 Word cloud from famous Gettysburg address by Abraham Lincoln
  • 146. 146
  • 147. RECOMMENDATION SYSTEM  Modern recommendation systems is commonly called as collaborative filtering.  Companies like Amazon and Netflix have influenced this genre of machine learning significantly for personalizing shopping and movie watching experiences. 147
  • 148. 148
  • 149. Collaborative Filtering  A mathematical process of predicting the interests or preferences of a given user based on a database of interests or preferences of other users.  Users with similar interests in one aspect tend to share similar interest in other similar aspects.  Collaborative filters always deal with 2-dimensional data in the form of a matrix.  One dimension being the list of users and the other dimension being the entity that is being liked or watched or purchased 149
  • 150. 150
  • 151. Sample training data for building a recommendation system 151
  • 152. 152
  • 153. Solution Approaches: Information Types 1.Information about the users in the form of their profiles. The profiles can have key aspects like age, gender, location, employment type, number of kids, etc. 2.Information about the interests. For movies, it can be in the form of languages, genres, lead actors and actresses, release dates, etc. 3.Joint information of users’ liking or rating their interests 153
  • 154.  Algorithm Types  Algorithms exploiting the topological or neighborhood information. These algorithms are primarily based on the joint historical information about the users’ratings.  Algorithms that exploit the structure of the relationships. These methods assume a structural relationship between users and their ratings on the interests.  These relationships are modelled using probabilistic networks or latent variable methods like component analysis (principal or independent) or singular value decomposition.  Hybrid approach. The neighborhood based algorithms simply cannot operate in such situations.  Hence these hybrid systems start with algorithms that are more influenced by the structural models in the early stages of neighborhood based models. 154
  • 155.  Amazon’s Personal Shopping Experience  First and foremost Amazon is a shopping platform.  Where Amazon itself sells products and services as well as it lets third party sellers sell their products.  Each shopper that comes to Amazon comes with some idea about the product that he/she wants to purchase along with some budget for the cost of the product that he/she is ready to spend on the product as well as some expectation of date by which the product must be delivered. 155
  • 156. Context Based Recommendation Suggesting other similar products that are cheaper than the product selected. Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback. So, if the cost is the only aspect stopping the user from buying the selected item, the recommended item can solve the problem. Suggesting a similar product from a more popular brand. This might attract user to buy a potentially more expensive product that is coming from more popular brand, so more reliable or better quality. 156
  • 157.  Suggesting a product that has better customer reviews.  Suggesting a set of products that are typically bundled with the selected product.  For example suggesting carry bag or battery charger or a memory card when selected item is a digital camera. 157
  • 158. DESIGNING AND TUNING MODEL PIPELINES Model tuning is the experimental process of finding the optimal values of hyperparameters to maximize model performance. Hyperparameters are the set of variables whose values cannot be estimated by the model from the training data. These values control the training process. Then the available data is split into two or three sets  Training - used for training the model  Validation - optional validation set is used to tune the parameters  Testing - is used to predict the performance metrics of the algorithm 158
  • 159. Designing and Tuning Model Pipelines involves following steps 1. Choosing the Technique or Algorithm 2. Choosing Technique for Adult Salary Classi cation fi 3. Splitting the Data 4. Strati ed Sampling fi 5. Training 6. Tuning the Hyperparameters 7. Accuracy Measurement 8. Explain ability of Features 9. Practical Considerations 10. Data Leakage 11. Coincidence and Causality 12. Unknown Categories 159
  • 160. Choosing the Technique or Algorithm Let’s consider the problem of binary classification of the adult salary data. Example: If we are dealing with a regression problem then we are restricted to all the regression type techniques, which would eliminate the clustering or recommendation algorithms , etc. Choosing Technique for Adult Salary Classification Regression algorithms can always be used in classification applications with addition of threshold, but are less preferred. 160
  • 161.  Decision tree based methods are typically better suited for problems that have categorical features as these algorithms inherently use the categorical information.  Most other algorithms like logistic regression, support vector machines, neural networks, or probabilistic approaches are better suited for numerical features.  Single decision tree can be used as one of the simplest possible starting algorithm.  Random forest decision tree is also used better method. 161
  • 162. Splitting the Data The division of the three sets in percentage is done as: 60– 20–20, or 70– 15–15. Stratified Sampling Stratified sampling ensures a certain known distribution of classes in split parts. When we are dealing with say n number of classes, the number of samples per class is not always same. When the original distribution is unbalanced, there are two choices: 1.Ignore a number of samples from the classes that have more samples to match the classes that have less number of samples. 2.Use the samples from the classes that have less number of samples repeatedly to match with number of samples from the classes that have larger number of samples. 162
  • 163. Training Once the data is split into training and test sets, the training set is used to train the algorithm and build the model. Each algorithm has its own specific method associated with it for training. All the different technique is to find the right set of parameters of the model that can map the input to output as far as training data is concerned. The parameters of the model are also classified into two main types: 1.The parameters that can be computed using the training process to minimize the error in the prediction. 2.The parameter that cannot be directly computed using the training process. These parameters are also called as hyperparameters. 163
  • 164. Tuning the Hyperparameters  The hyperparameters are typically a set of parameters that can be unbounded, theoretically choose any number between 1 and ∞.  These bounds are created based on multiple constraints like computation requirements, dimensionality, and size of data, etc  A single set of training set can be used to get results with one set of hyperparameters.  The training set is the only data available for training and the trained model is applied on the validation set to compute the accuracy metrics. 164
  • 165. Data Leakage If the data is static then there is a set of static algorithms that one can choose from. Examples of static problem could be classification of images into ones that contain cars and ones that don’t. If the data is changing over time, it creates additional layer of complexity. Example Consider a business of auto mechanic. Here are some of the columns in the data: Number of visits Method of payment Category of customer Amount of sale Year of manufacture of the car Make of the car Model of the car Miles on the odometer 165
  • 166. PERFORMANCE MEASUREMENT To evaluate the performance or quality of the model, different metrics are used, and these metrics are known as performance metrics or evaluation metrics Performance Metrics for Regression / Numerical error  Mean Absolute Error  Mean Squared Error  Normalized Error Performance Metrics for Classification / Categorical error  Accuracy  Precision and Recall  F-Score  Confusion Matrix  Receiver Operating Characteristics (ROC) Curve Analysis Steps in Hypothesis Testing  A/B Testing 166
  • 167. Performance Metrics for Regression / Numerical error 167
  • 168. Mean Absolute Error is a regressive loss measure looking at the absolute value difference between a model's predictions and ground truth, averaged out across the dataset. However, to overcome this limitation, another metric can be used, which is Mean Squared Error or MSE. 168
  • 169. 169
  • 170. 170
  • 171. Normalized Error In many cases, all the above error metrics can produce some arbitrary number between -∞ and ∞. Typically the bounds used are (-1to+1), (0–1), or (0– 100). This way, even a single instance of normalized error can make sense on its own. All the above error definitions can have their own normalized counterpart. 171
  • 172. 172
  • 173. AZURE MACHINE LEARNING- AML Azure Machine Learning is a comprehensive machine learning platform that supports language model fine-tuning and deployment. Using the Azure Machine Learning model catalog, users can create an endpoint for Azure OpenAI Service and use RESI APIs to integrate models into applications. 173
  • 174. sign in with a Microsoft account. e.g., hotmail.com or outlook.com, etc. 174
  • 175. 175
  • 176. 176
  • 177. UNIT – IV MACHINE LEARNING AND DATAANALYTICS Machine Learning for Predictive Data Analytics Data to Insights to Decisions Data Exploration Information based Learning Similarity based learning Probability based learning Error based learning Evaluation The art of Machine learning to Predictive Data Analytics. 177
  • 178. MACHINE LEARNING FOR PREDICTIVE DATAANALYTICS Predictive data analytics is the art of building and using models that make predictions based on historical data to predict future outcomes. 178
  • 179. 179
  • 180. Applications: Price Prediction: Predictive analytics models can be trained to predict optimal prices based on historical sales records. Dosage Prediction: Doctors and scientists frequently decide how much of a medicine or other chemical to include in a treatment. Risk Assessment: Predictive analytics models can be used to predict the risk associated with decisions such as issuing a loan or underwriting an insurance policy. Diagnosis: Doctors, engineers and scientists regularly make diagnoses as part of their work. Typically, these diagnoses are based on their extensive training, expertise, and experience. 180
  • 181. Document Classification: Predictive data analytics can be used to automatically classify documents into different categories. Examples: Email spam filtering, News sentiment analysis, Customer complaint redirection, and Medical decision making. How is machine learning used in predictive analytics? Predictive analytics using machine learning algorithms can provide more accurate and precise predictions, automate decision-making processes, and scale up to handle large datasets and complex problems. Machine learning algorithms can provide more accurate predictions than traditional statistical models. 181
  • 182. 182
  • 183. 183
  • 184. Machine Learning Machine learning is defined as an automated process that extracts patterns from data. To build the models used in predictive data analytics applications, we use supervised machine learning. Supervised machine learning techniques automatically learn a model of the relationship between a set of descriptive features and a target feature based on a set of historical examples, or instances. 184
  • 185. 185
  • 186. Restriction bias constrains the set of models that the algorithm will consider during the learning process Preference bias guides the learning algorithm to prefer certain models over others Inductive bias is necessary for learning (beyond the dataset) There are two sources of information that guide this Search The training data The inductive bias of the algorithm Underfitting occurs when the prediction model selected by the algorithm is too simplistic to represent the underlying relationship in the dataset between the descriptive features and the target feature. Overfitting Occurs when the prediction model selected by the algorithm is so complex that the model fits to the data set too closely and becomes sensitive to noise in the data. 186
  • 187. The Predictive Data Analytics Project Lifecycle: Crisp-DM 187
  • 188. Cross Industry Standard Process for Data Mining (CRISP-DM). Key features of the CRISP-DM process that make it attractive to data analytics practitioners are that it is non-proprietary; it is application, industry, and tool neutral; and it explicitly views the data analytics process from both an application-focused and a Technical perspective. Business Understanding: Predictive data analytics projects never start out with the goal of building a prediction model. Instead, they are focused on things like gaining new customers, selling more products, or adding efficiencies to a process. 188
  • 189. Data Understanding: Once the manner in which predictive data analytics will be used to address a business problem has been decided, it is important that the data analyst. Data Preparation: Building predictive data analytics models requires specific kinds of data, organized in a specific kind of structure known as analytics base table (ABT). Modeling: Different machine learning algorithms are used to build a range of prediction models from which the best model will be selected for deployment. 189
  • 190. Evaluation: This phase Of CRISP-DM covers all the evaluation tasks required to show that a prediction model will be able to make accurate predictions after being deployed and that it does not suffer from overfitting or underfitting. Deployment: Machine learning models are built to serve a purpose within an organization, and the last phase of CRISP-DM covers all the work that must be done to successfully integrate a machine learning model into the processes within an organization. 190
  • 191. Predictive Analytics Tools Predictive Analytics Software Tools have advanced analytical capabilities like Text Analysis, Real-Time Analysis, Statistical Analysis, Data Mining, Machine Learning modeling and Optimization. Libraries for Statistical Modeling and Analysis Scikit-learn Pandas Stats model NLTK (Natural Language Processing Tool Kit) Graph Lab Neural Designer Open-Source Analytical Tools SAP Business Objects IBM SPSS Halo Business Intelligence Daiku-DSS Weka 191
  • 192. DATA TO INSIGHTS TO DECISIONS Designing ABTs that properly represent the characteristics of a prediction subject is a key skill for analytics practitioners. An approach to first develop a set of domain concepts that describe the prediction subject, and then expand these into concrete descriptive features. Converting a business problem into an analytics solution Involves answering the following key questions: 1.What is the business problem? 2.What are the goals that the business wants to achieve? 3.How does the business currently work? 4.In what ways could a predictive analytics model help to address the business problem? 192
  • 193. Designing the Analytics Base Table (ABT) The basic structure in which we capture historical datasets. The different data sources typically combined to create an analytics base table. The basic structure in which we capture historical datasets is the analytics base table (ABT) The general structure of an analytics base table:  Descriptive features  Target feature 193
  • 194. The different data sources typically combined to create an analytics base table. 194
  • 195. Prediction subject defines the basic level at which predictions are made, and each row in the ABT will represent one instance of the prediction subject One-row-per-subject is often used to describe this structure Each row in an ABT is composed of a set of descriptive features and a target feature A good way to define features is to identify the key domain concepts and then to base the features on these concepts. 195
  • 196. The hierarchical relationship between an analytics solution, domain concepts, and descriptive features. 196
  • 197.  Features in an ABT:  Raw features  Derived features: requires data from multiple sources to be combined into a set of single feature values Common derived feature types:  Aggregates  Flags  Ratios  Mappings There are a number of general domain concepts that are often useful:  Prediction Subject Details  Demographics  Usage  Changes in Usage  Special Usage  Lifecycle Phase  Network Links 197
  • 198. Example domain concepts for a motor insurance fraud claim prediction analytics solution 198
  • 199. Designing & Implementing Features Three key data considerations are particularly important when we are designing features Data availability, because we must have data available to implement any feature we would like to use. Example: In an online payments service scenario, we might define a feature that calculates the average of a customer’s account balance over the past six months. 199
  • 200. Timing: Timing with which data becomes available for inclusion in a feature. With the exception of the definition of the target feature, data that will be used to define a feature must be available before the event around which we are trying to make predictions occurs. Example: If we were building a model to predict the Outcomes of soccer matches, we might consider including the attendance at the match as a descriptive feature. 200
  • 201. Longevity: There is potential for features to go stale if something about the environment from which they are generated changes. Example, to make predictions of the outcome of loans granted by a bank, we might use the borrower’s salary as a descriptive feature. 201
  • 202. Propensity models: Many of the predictive models that we build are propensity models, which inherently have a temporal element Two key periods of propensity modeling: Observation period Outcome period 202
  • 203. 203
  • 204. Observation and outcome periods defined by an event rather than by a fixed point in time (each line represents a prediction subject and stars signify events). 204
  • 205. Case Study: Motor Insurance Fraud What are the observation period and outcome period for the motor insurance claim prediction scenario? The observation period and outcome period are measured over different dates for each insurance claim, defined relative to the specific date of that claim. The observation period is the time prior to the claim event, over which the descriptive features capturing the claimant’s behavior are calculated. The outcome period is the time immediately after the claim event, during which it will emerge whether the claim is fraudulent or genuine. What features could you use to capture the Claim Frequency domain concept? 205
  • 206. Example domain concepts for a motor insurance fraud prediction analytics solution 206
  • 207. DATA EXPLORATION Data exploration refers to the initial step in data analysis in which data analysts use data visualization and statistical techniques to describe dataset characterizations, such as size, quantity, and accuracy, in order to better understand the nature of the data. It involves scrutinizing datasets to uncover hidden patterns, outliers, and insights. Whether in business, healthcare, or research, data exploration serves as the compass guiding decision- makers. 207
  • 208. The Data Quality Report A data quality report includes tabular reports that describe the characteristics of each feature in an ABT using standard statistical measures of central tendency and variation. The tabular reports are accompanied by data visualizations: A histogram for each continuous feature in an ABT A bar plot for each categorical feature in an ABT. 208
  • 209. Purposes of Data Exploration: Understanding Data Structure: Data exploration helps in understanding the overall structure of the dataset, including the number of variables, data types, and the presence of missing values or outliers. Identifying Patterns and Relationships: Data exploration involves uncovering patterns and relationships within the data. This may include identifying correlations between variables, grouping data based on common characteristics, and visualizing data to reveal trends. 209
  • 210. Informing Further Analysis: The insights gained from data exploration inform further analysis, such as model building, hypothesis testing, and decision-making. Common Data Exploration Techniques: Descriptive Statistics: Summarizing data using measures of central tendency (mean, median, mode) and measures of dispersion (variance, standard deviation, range) provides an overview of the data distribution. Data Visualization: Creating visualizations, such as histograms, scatter plots, and box plots, helps visualize data distribution, identify patterns, and detect outliers. 210
  • 211. Data Profiling: Data profiling involves examining data quality metrics, such as data completeness, accuracy, consistency, and timeliness, to assess the overall quality of the dataset. Data Transformation: Data transformation techniques, such as normalization, scaling, and imputation, may be applied to prepare the data for further analysis. Feature Engineering: Feature engineering involves creating new features from existing data to improve the performance of machine learning models. 211
  • 212. Benefits of Data Exploration: Improves Data Understanding: Data exploration deepens the understanding of the data, its characteristics, and limitations, enabling better decision-making. Uncovers Hidden Patterns: Data exploration reveals patterns and trends that may not be evident by simply looking at raw data, informing further analysis and hypothesis generation. Identifies Data Quality Issues: Data exploration helps identify data quality issues, such as missing values, outliers, and inconsistencies, allowing for data cleaning and improvement. 212
  • 213. Facilitates Feature Selection: Data exploration guides feature selection, identifying relevant features for analysis and model building. Informs Model Development: Insights from data exploration inform the choice of appropriate machine learning algorithms and model selection. Data exploration is an essential step in the data analysis process, providing a foundation for making informed decisions, uncovering hidden insights, and building effective models. By thoroughly exploring the data, data scientists and analysts can extract valuable knowledge and drive meaningful outcomes. 213
  • 214. Histograms for different sets of data each of which exhibit well-known, common characteristics. (a)Uniform (b) Normal (Unimodal)(c) Unimodal (skewed right) (a)Unimodal (skewed Left) (b) Exponential(c) Multimodal 214
  • 215. Three normal distributions with different means but identical standard deviations. 215
  • 216. Three normal distributions with identical means but different standard deviations. 216
  • 217. Identifying Data Quality Issues A data quality issue is loosely defined as anything unusual about the data in an ABT. The most common data quality issues are:  missing values  irregular cardinality  Outliers The data quality issues we identify from a data quality report will be of two types:  Data quality issues due to invalid data  Data quality issues due to valid data. 217
  • 218. INFORMATION BASED LEARNING Information-based learning (IBL) is a framework for learning that emphasizes the importance of information and its role in shaping knowledge and understanding. It is a constructivist approach that suggests that learners actively construct their own knowledge by making connections between new information and their existing knowledge base. 218
  • 219. (a) Brian (b) John (c) Aphra (d) Aoife Cards showing character faces and names for the Guess-Who game Man Long Hair Glasses Name 219
  • 220. 220
  • 221. Fundamentals A decision tree consists of:  root node (or starting node),  interior nodes  leaf nodes (or terminating nodes).  each of the non-leaf nodes (root and interior) in the tree specifies a test to be carried out on one of the query’s descriptive features.  each of the leaf nodes specifies a predicted classification for the query. 221
  • 222. An email spam prediction dataset 222
  • 223. 223
  • 224. Shannon’s Entropy Model  Claude Shannon’s entropy model defines a computational measure of the impurity of the elements of a set.  An easy way to understand the entropy of a set is to think in terms of the uncertainty associated with guessing the result if you were to make a random selection from the set.  Entropy is related to the probability of a outcome.  High probability → Low entropy  Low probability → High entropy  If we take the log of a probability and multiply it by -1 we get this mapping! 224
  • 225. Information Gain The measure of informativeness that we will use is known as information gain and is a measure of the reduction in the overall entropy of a set of instances that is achieved by testing on a descriptive feature. 1.Compute the entropy of the original dataset with respect to the target feature. 2.For each descriptive feature, create the sets that result by partitioning the instances in the dataset using their feature values, and then sum the entropy scores of each of these sets. 3.Subtract the remaining entropy value (computed in step 2) from the original entropy value(computedinstep1) to give the information gain. 225
  • 226. SIMILARITY BASED LEARNING  Feature space: each descriptive has its own dimensional axis. It is an abstract m- dimensional space that is created by making each descriptive feature in a dataset an axis of an m-dimensional coordinate system and mapping each instance in the dataset to a point in this coordinate space based on the values of its descriptive features  Working of feature space: if the values of the descriptive features of two or more instances in a dataset are the same, then these instances will be mapped to the same point in the feature space and vice versa  Distance between two points in the feature space is a useful measure of the similarity of the descriptive 226
  • 227. Metric(a,b): is a real value function that returns the distance between two points a and b in the feature space. It has following properties: Non-negativity Identity Symmetry Triangular inequality Two examples of distance metric Euclidean distance Manhattan distance (taxi-cam distance)=sum of absolute differences. Minkowski distance: a family of distance metrics based on differences between features. 227
  • 228. Nearest Neighbor algorithm: When this model is used to make prediction for new instances, the distance in the feature space between the query instance and each instance in the data set instances is computer, and the prediction returned by the model is the target feature level of the dataset's instance that is nearest to the query in the feature space. It stores entire training dataset in memory resulting in a negative effect on time complexity of the algorithm Algorithm : Nearest neighbor algorithm. Require: a set of training instances Require: a query instance 228
  • 229. Iterate across the instances in memory to find the nearest neighbor—this is the instance with the shortest distance across the feature space to the query instance. Make a prediction for the query instance that is equal to the value of the target feature of the nearest neighbor. Decision boundary: It is the boundary between regions of the feature space in which different target levels will be predicted. It is generated by aggregating the neighboring local models (Voronoi regions) that make the same prediction 229
  • 230. Noise effects: Using Kronecker delta approach: n-n algorithm is sensitive to noise because any errors in the description of labeling of training data results in erroneous local models and incorrect predictions. One way to mitigate against noise is to modify the algorithm to return the majority target level within the set of k nearest neighbors to the query q Using weighted k nearest neighbor approach: Efficient Memory search: Assuming that the training dataset will remain relatively stable, the time issue can be offset by investing in one-off computation to create an index of the instances that enables efficient retrieval of the nearest neighbors without doing an exhaustive search of the entire training dataset. 230
  • 231. k-d tree: It stands for k-dimensional tree which is a balanced binary tree in which each of the nodes in the tree index one on the instances in a training dataset This tree is constructed so that the nodes that are nearby in the tree index training instances that are nearby in the feature space 231
  • 232. PROBABILITY BASED LEARNING It is a machine learning technique that uses probability theory to make predictions or decisions. It's a statistical approach that models uncertainty in data using probability distributions. A probability function, P(), returns the probability of a feature taking a specific value. A joint probability refers to the probability of an assignment of specific values to multiple different features A conditional probability refers to the probability of one feature taking a specific value given that we already know the value of a different feature 232
  • 233.  A probability distribution is a data structure that describes the probability of each  possible value a feature can take. The sum of a probability distribution must equal 1.0  A joint probability distribution is a probability distribution over more than one feature assignment and is written as a multi-dimensional matrix in which each cell lists the probability of a particular combination of feature values being assigned  The sum of all the cells in a joint probability distribution must be 1.0. 233
  • 234.  A joint probability distribution is a probability distribution over more than one feature assignment and is written as a multi-dimensional matrix in which each cell lists the probability of a particular combination of feature values being assigned.  The sum of all the cells in a joint probability distribution must be 1.0. Baye’s Theorem: The conditional probability of an event, X, given some evidence, Y, in terms of the product of the inverse conditional probability, P(Y| X), and the prior probability of the event P(X). 234
  • 235. 235
  • 236. Bayesian Prediction We generate the probability of the event that a target feature, t, takes a specific level, l, given the assignment of values to a set of descriptive features, q, from a query instance. We can restate Bayes’ Theorem using this terminology and generalize the definition of Bayes’ Theorem so that it can take into account more than one piece of evidence(each descriptive feature value is a separate piece of evidence). 236
  • 237. 237
  • 238. 1. P(t=l),the prior probability of the target feature t taking the level l 2. P(q[1],…,q[m]),the joint probability of the descriptive features of a query instance Taking a specific set of values 3. P(q[1], …, q[m] | t = l), the conditional probability of the descriptive features of a query instance taking a specific set of values given that the target feature takes the Level l Conditional Independence and Factorization If knowledge of one event has no effect on the probability of another event, and vice versa, then the two events are independent of each other. If two events X and Y are independent then: P(X|Y) = P(X) P(X, Y) = P(X) x P(Y) 238
  • 239. ERROR BASED LEARNING We perform a search for a set of parameters for a parameterized model that minimizes the total error across the predictions made by that model with respect to a set of training instances. The key ideas of a parameterized model, measuring error and an error surface. Simple Linear Regression It is a type of Regression algorithms that models the relationship between a dependent variable and a single independent variable. The relationship shown by a Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple Linear Regression. 239
  • 240. The key point in Simple Linear Regression is that the dependent variable must be a continuous value. However, the independent variable can be measured on continuous or categorical values. Measuring Error There are many different kinds of error functions, but for measuring the fit of simple linear regression models, the most commonly used is the sum of squared errors error function, or L2. To calculateL2 we use our candidate model to make a prediction for each member of the training dataset and then calculate the error (or residual) between these predictions and the actual target feature values in the training set. 240
  • 241. a) A scatter plot of the SIZE and RENTAL PRICE features from the office rentals dataset; (b)the scatter plot from (a)with a linear model relating RENTALPRICE to SIZE overlaid 241
  • 242. Error Surface: Here, each pair of Weights w[0]and w[1] defines a point on the x-y plane, and the sum of squared errors for the model using these weights determines the height of the error surface above the x-y Plane for that pair of weights. The x-y plane is known as a weight space, and the surface is known as an error surface. 242
  • 243. EVALUATION When evaluating machine learning (ML) models, the question that arises is whether the model is the best model available from the model’s hypothesis space in terms of generalization error on the unseen / future data set. Hold-out method for Model Evaluation The hold-out method for model evaluation represents the mechanism of splitting the dataset into training and test datasets. The model is trained on the training set and then tested on the testing set to get the most optimal model. 243
  • 244. The hold-out method for model evaluation Generally, 70-30% split is used for splitting the dataset where 70% of the dataset is used for training and 30% dataset is used for testing the model. 244
  • 245. The following is the process of using the hold-out method for model evaluation: Split the dataset into two parts (preferably based on a 70- 30% split; However, the percentage split will vary) Train the model on the training dataset; While training the model, some fixed set of hyperparameters is selected. Test or evaluate the model on the held-out test dataset Train the final model on the entire dataset to get a model which can generalize better on then seen or future dataset. 245
  • 246. THE ART OF MACHINE LEARNING TO PREDICTIVE DATAANALYTICS Predictive data analytics projects use machine learning to build models that capture the relationships in large datasets between descriptive features and a target feature. A specific type of learning, called inductive learning, is used, where learning entails inducing a general rule from a set of specific instances. Predictive analytics project can use CRISP-DM process to manage a project through its lifecycle. 246
  • 247. The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as the base for a data science process. It has six sequential phases: 1.Business understanding – What does the business need? 2.Data understanding – What data do we have / need? Is it clean? 3.Data preparation – How do we organize the data for modeling? 4.Modeling – What modeling techniques should we apply? 5.Evaluation – Which model best meets the business objectives? 6.Deployment – How do stakeholders access the results? 247
  • 248. UNIT – V APPLICATIONS OF MACHINE LEARNING Image Recognition Speech Recognition Email spam and Malware Filtering Online fraud detection Medical Diagnosis. 248
  • 249. IMAGE RECOGNITION Image recognition is the capability of a system to understand and interpret visual information from images or videos. Image recognition in machine learning often involves the use of deep learning techniques, particularly convolutional neural networks (CNNs), which have been highly successful in this domain. Digital Images contain pixels, which are the smallest units of a screen that help in image formation. There are different types of image formats like JPG, JPEG, GIF, PNG, etc. Images have their own role in a digital world that includes various fields like communication, science, art, and 249
  • 251. Artificial intelligence and machine learning (AI/ML) are used in computer vision applications to handle this data properly for monitoring, detection, categorization, object identification, and facial recognition. Learning Based Image Recognition Data Collection: A large dataset of labeled images is collected. Each image is associated with a label that describes what's in the image (e.g., "cat," "car," "beach"). Preprocessing: Images are typically resized and normalized to a standard size, and any color information is converted into a numerical format. 251
  • 252. Convolutional Neural Network (CNN): A CNN is used as the primary architecture for image recognition. CNNs consist of multiple layers, including convolutional layers, pooling layers, and fully connected layers. These layers extract features from the input images. Training: The CNN is trained on the labeled dataset. During training, the network adjusts its internal parameters (weights and biases) to learn to recognize patterns and features in the images that are associated with the correct labels. Testing and Inference: After training, the CNN can be used to make predictions on new, unlabeled images. It produces a probability distribution over possible labels for each image. 252
  • 253. Post-processing: The predicted probabilities can be converted into a final classification or label using techniques like softmax. Top-k classification is often used to provide multiple likely labels. The field of image recognition in machine learning has seen significant advancements, and it's applied in a wide range of applications, including object detection, facial recognition, medical image analysis, and autonomous vehicles. python import tensorflow as tf from tensorflow import keras 253
  • 254. # Load and preprocess the dataset (e.g., CIFAR-10) (x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 # Normalize pixel values to between 0 and 1 # Build the CNN model model = keras.Sequential([ keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)), keras.layers.MaxPooling2D((2, 2)), keras.layers.Conv2D(64, (3, 3), activation='relu'), keras.layers.MaxPooling2D((2, 2)), keras.layers.Conv2D(64, (3, 3), activation='relu'), keras.layers.Flatten(), keras.layers.Dense(64, activation='relu'), 254
  • 255. # Compile the model model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logit s=True), metrics=['accuracy']) # Train the model model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test)) # Evaluate the model test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2) print(f"Test accuracy: {test_acc}") This code loads the CIFAR-10 dataset, creates a simple CNN model, compiles it, and then trains and evaluates the model. You can replace the dataset with your own image data and adjust the model architecture for your specific image recognition task. 255
  • 256. Application of Image Recognition Identifying Fraudulent Accounts Examining fake social media profiles is among the most significant applications of image recognition. Facial Recognition and Security Systems Image recognition is also regarded as important since it is one of the most critical components in the security business. Image recognition algorithms could help marketers learn about a person’s identity, gender, and mood. Reverse Image Search You may have heard of internet reverse image searches. Reverse photo search is a strategy that allows you to search by image for free. 256
  • 257. Help Police Officials to Solve Cases You might be shocked to learn that government agencies use image recognition. Today, police and other secret organizations frequently use image recognition technology to identify persons in recordings or photographs. Empowers e-commerce Businesses Today, image recognition is commonly used in the e- commerce business. Historically, the visual search industry has grown significantly. 257
  • 258. SPEECH RECOGNITION Speech recognition is a machine learning technology that uses AI to interpret spoken words and convert them into text. It works by breaking down speech into sounds, analyzing each sound, and then using an algorithm to find the most likely word for each sound. You can do speech recognition in python with the help of computer programs that take in input from the microphone, process it, and convert it into a suitable form. 258
  • 259. Speech Recognition Working of Speech Recognition 259
  • 260. How Does Speech Recognition work? Acoustic modeling is used to recognize phenones/phonetics in our speech to get the more significant part of speech, as words and sentences. Speech recognition starts by taking the sound energy produced by the person speaking and converting it into electrical energy with the help of a microphone.  It then converts this electrical energy from analog to digital, and finally to text. It breaks the audio data down into sounds, and it analyzes the sounds using algorithms to find the most probable word that fits that audio. All of this is done using Natural Language Processing and Neural Networks. Hidden Markov models can be used to find temporal patterns in speech and improve 260
  • 261. Picking and Installing a Speech Recognition Package 261 Package Functionality Installation Apiai Included natural language processing for identifying a speaker’s intent $ pip install apiai Google-cloud-speech Offers basic speech to text conversion $pip install virtualenv Virtualenv <your-env> <your-env>Scriptsactivate <your-env>Scriptspip.exe install google-cloud-speech Speech recognition Offers easy audio processing and microphone accessibility Pip install speech recognition Watson – developer cloud Watson developer cloud is an artificial intelligence API that makes creating, debugging, running and deploying APIs easy. It can be Pip install-upgrade waston- develope cloud
  • 262. It allows: •Easy speech recognition from the microphone. •Makes it easy to transcribe an audio file. •It also lets us save audio data into an audio file. •It also shows us recognition results in an easy-to-understand format. Speech Recognition in Python: Converting Speech to Text create a program that takes in the audio as input and converts it to text. 262
  • 264. Converting speech to text Now, use the microphone to get audio input from the user in real-time, recognize it, and print it in text 264
  • 265. EMAIL SPAM AND MALWARE FILTERING  What do we know about spam? Annoying emails? Yahoo Mail!  Spam-based advertising is a huge business  Email spam is a collection of unwanted, unsolicited messages sent in bulk to a large number of recipients. It can be sent for commercial purposes or for other reasons, and can be sent by humans or by botnets, which are networks of infected computers
  • 266.  It is a complex system Technical: name server, email server, webpages, etc. Business: payment processing, merchant bank accounts, customer service, and fulfillment  Previous work studied each of the elements in isolation dynamics of botnets, DNS fast-flux networks, Web site hosting, spam filtering, URL blacklisting, site takedown  This work: quantify the full set of resources employed to monetize spam email— including naming, hosting, payment and fulfillment
  • 267. Main Parts  Advertising  Click support  Redirect sites  Third-party DNS, spammers’ DNS  Webservers  Affiliate programs  Realization  Payment services  Fulfillment 2 6 7
  • 270. Discussion  With this big picture, what do you think are the most effective mechanisms to defeat spam? CS660 - Advanced Information Assurance - UMassAmherst 2 7 0
  • 271. Malware, short for malicious software, is software designed to damage or disrupt a computer, server, or network. Cybercriminals, also known as hackers, create malware to steal data, gain access to systems, or sabotage devices. 271
  • 272. A bot is a software application that can perform repetitive tasks automatically, without the need for human intervention. Bots can be designed to imitate human behavior, but they are often faster and more accurate than humans. 272
  • 273. 273  Spam Detector is used to detect unwanted, malicious and virus infected texts and helps to separate them from the non spam texts.  It uses a binary type of classification containing the labels such as ‘ham’ (non spam) and spam. Application of this can be seen in Google Mail (GMAIL) where it segregates the spam emails in order to prevent them from getting into the user’s inbox.  In this Machine Learning Spam Filtering application, we will develop a Spam Detector app using Support Vector Machine (SVM) technique for classification and Natural Language Processing.
  • 274.  Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression problems. The main objective of SVM is to find a hyperplane in an N( total number of features)-dimensional space that differentiates the data points. Hyperplane and Support Vectors  Hyperplanes are nothing but a boundary that helps to separate and group the data into particular classes. 274
  • 275. SVM works very well on the linearly separable data 275
  • 276. Natural Language Processing (NLP) is an Artificial Intelligence (AI) field that enables computer programs to recognize, interpret, and manipulate human languages. Application Structure: spam.csv: Dataset for our project. It contains Labels as “ham” or “spam” and Email Text. spamdetector.py: This file is used to load the dataset and train our classifier. training_data.pkl: This file contains a trained classifier in binary format which will be used to predict the output. SpamGui.py: GUI file for our project where we load the trained classifier and predict the output for a given message. 276
  • 277. Steps for developing a Spam Detector: 1.Import Libraries and initialize variables 2.Preprocessing the data 3.Bag of Words (vectorization) 4.Training the model 5.Prediction using Graphical User Interface 277
  • 278. 278
  • 279. 279
  • 280. ONLINE FRAUD DETECTION Credit card frauds are easy to do, as we know that E-commerce and many other online sites have increased the online payment modes, Which affects increasing the risk of online frauds. 280
  • 281. 281 Feature Description step tells about the unit of time type type of transaction done amount the total amount of transaction nameOrg account that starts the transaction oldbalanceOrg Balance of the account of sender before transaction newbalanceOrg Balance of the account of sender after transaction nameDest account that receives the transaction oldbalanceDest Balance of the account of receiver before transaction newbalanceDest Balance of the account of receiver after transaction isFraud The value to be predicted i.e. 0 or 1
  • 282.  The libraries used are :  Pandas: This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.  Seaborn/Matplotlib: For data visualization.  Numpy: Numpy arrays are very fast and can perform large computations in a very short time. 282
  • 283.  The Importance of Fraud Detection  Financial Losses  Customer Trust  Legal and Regulatory Compliance  Operational Efficiency  The Role of Machine Learning Algorithms Traditional Fraud Detection Algorithms  Machine learning algorithms, traditional ones that have been commonly used: logistic regression, decision trees, and random forests.  These machine learning models have had a significant impact on fraud detection processes, providing organizations with practical predictive capabilities. 283
  • 284. Logistic regression, for example, is a statistical model that has been widely used for its simplicity and efficiency. The decision trees method, on the other hand, makes predictions based on a set of decision rules. And finally, the random forest algorithm combines multiple decision trees to generate a final outcome. Complex fraud patterns involve non-linear relationships, high-dimensional data, and evolving tactics. As a result:  Feature Engineering Challenges  Imbalanced Data  Dynamic Fraud Behavior‍ 284
  • 285. New Approaches: Machine Learning Algorithms 1. Gradient Boosting Machines (GBM) How GBM Works Gradient Boosting Machines (GBM) is an ensemble learning technique that combines multiple weak learners (typically decision trees) to create a strong predictive model. Here’s how it works: GBM assigns weights to each tree based on its performance in reducing residuals. The final prediction is the weighted sum of predictions from all trees. 285
  • 286.  2. Neural Networks  Application of Neural Networks (Deep Learning) in Fraud Detection  One of the most advanced tools in the machine learning arsenal is neural networks, an algorithmic structure modeled after the human brain.  Neural networks, in essence, are computing systems made up of interconnecting artificial neurons, or nodes.  In the landscape of fraud detection, neural networks — particularly deep learning neural networks — hold significant potential. 286
  • 287.  Transaction Sequences: Recurrent neural networks (RNNs) can analyze sequences of transactions over time, capturing temporal dependencies.  Image-Based Fraud Detection: Convolutional neural networks (CNNs) process images (e.g., scanned checks, ID cards) to detect anomalies.  Ensemble Approaches: Stacking neural networks with other models (e.g., GBM) can enhance overall fraud detection accuracy. 287
  • 288. 3. XGBoost: Extreme Gradient Boosting XGBoost (Extreme Gradient Boosting) is another advanced machine learning algorithm that is creating ripples in the field of fraud detection. This algorithm stands out due to its flexibility, high accuracy, and effectiveness in dealing with large and complex datasets. 288
  • 289.  Mobile Payment Fraud Detection:  Researchers proposed an XGBoost-based framework for mobile payment fraud detection.  By integrating unsupervised outlier detection algorithms and an XGBoost classifier, they achieved excellent results on a large dataset of over 6 million mobile transactions.  Credit Card Fraud Detection:  Another study applied XGBoost to credit card transaction data.  They used outlier detection based on distance sum to identify fraudulent transactions effectively.  State Grid Corporation of China (SGCC) Dataset:  Researchers explored integrating Genetic Algorithms with XGBoost for fraud detection in the SGCC dataset.  This integration showed promising results and opens avenues for further research. 289
  • 290. 4. Isolation Forest: Unleashing Anomaly Detection What is Isolation Forest? The Isolation Forest or iForest is a unique and compelling machine learning algorithm designed for anomaly detection. Effectiveness in Identifying Fraudulent Transactions High-Dimensional Data 290
  • 291. 5. Auto encoders: Unleashing Latent Representations Auto encoders are essentially data compression algorithms where the compression and decompression functions are data-specific, lossy, and learned automatically from examples. They are a type of self-organizing system that can learn to represent (encode) the input data in a manner that highlights its key features and patterns. Key Features Smart Adaptive Machine Learning Smart Surveillance and Automation Flexible Customizable Rules Compliance Assurance 291
  • 292.  The Power of Advanced Algorithms  Accuracy: Machine learning algorithms such as XGBoost, neural networks, and auto encoders offer unparalleled accuracy in identifying fraudulent activities.  Adaptability: These algorithms adapt to changing fraud patterns, ensuring continuous protection 292