SlideShare a Scribd company logo
Introduction to
Machine Learning
GirishGore
Introducing the Speaker
• Girish Gore : 10+Years of Experience in Data Analytics / Data Science
• B.E. Computer Science fromVIT Pune , M.S. from BITS Pilani
• SpentTime on Data Products Mainly In companies like
• Cognizant (InnovationsGroup)
• SAS (Pricing & Revenue Management)
• VuClip (Video Entertainment)
• Shoptimize (E-Commerce)
• Worked in fields like
• Text Mining
• Forecasting and Optimization
• Recommender Systems
Knowing the Audience
Average Experience in Industry ?
Average ML Experience ?
UnderstandingTerminologies
Artificial Intelligence
AI involves machines that can perform tasks that are characteristic of human
intelligence.
Machine Learning
Machine learning is an application of artificial intelligence (AI) that provides
systems the ability to automatically learn and improve from experience without
being explicitly programmed.
Deep Learning
Deep Learning is an attempt to mimic the workings of the brain. Deep
Learning is one of many approaches to machine learning
The Hierarchy
Traditional Programming vs Machine Learning
• If Programming automates processes ,
Machine Learning automates Program
generation i.e. Automation.
• Data and output is run on the computer to
create a program.This program can be used
in traditional programming
What is Machine Learning ?
• Machine Learning is
• study of algorithms that
• improve their performance at a particular task
• with experience ( previous data , output)
• Optimize a performance criterion using example data or past experience
• Role of Computer Science : Efficient Algorithms
• Solve the optimization problem
• Represent and Evaluate the model for inference
Why are we here Now !!! GoogleTrends !!
• Exponential increase in Data generation , accumulation
• Increasing computational power
• Growing progress in available algorithms and Research
• Software becoming too complex to write by hand
Common Applications of Machine Learning
• Web search: ranking page based on what you are most likely to click on.
• Finance: decide who to send what credit card offers to. Evaluation of risk on credit
offers. How to decide where to invest money.
• E-commerce: Predicting customer churn.Whether or not a transaction is fraudulent.
• Robotics: how to handle uncertainty in new environments.Autonomous. Self-driving car.
• Information extraction:Ask questions over databases across the web.
• Social networks: Data on relationships and preferences. Machine learning to extract value
from data.
• Debugging: Use in computer science especially in Labor intensive processes like
debugging. Could suggest where the bug could be
• Gaming, IBMWatson
Types Of Machine Learning
• Learning Associations
• Supervised Learning
• Regression
• Classification
• Un Supervised Learning
• Reinforcement Learning
• Semi supervised Learning
• Training data includes a few desired outputs. Between supervised and un supervised
Learning Associations
• Market Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y
are products/services.
Example: P ( diaper| beer ) = 0.7
TransactionID BasketItems
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper,Coke
Learning Associations
• Support : The probability of the customer buying diaper and beer together
among all sales transactions (Higher support the better)
• Confidence : Suppose that if a customer pick up diaper. How he/she is likely
to buy beer? (Closer to 1 better)
• Lift : Lift is a true comparison between naive model and our model,
meaning that how more likely a customer buy both, compared to buy
separately? (Lift > 1)
Supervised Learning
• Supervised Learning is a Machine Learning task of inferring a generalized function
from labelled training data. Training data includes desired outputs.
Example: Spam Detection , Credit Scoring , Face Detection
• In Supervised Learning for spam detection we have
• Email Contents with Labels marking Spam or Non Spam
• Task is to label newer emails
• Main two types of Supervised Learning Problems
• Regression
• Classification
Supervised Learning
• Regression Problems
• Maps input data to a continuous prediction variable
• Example: Predicting Retail house prices (Price as continues variable)
• Classification Problems
• Maps input data to a set of predefined classes
• Example: Benign or MalignantTumours
Regression : House Price Prediction
• We have historic data about size of house and the price for last 1 year
• Task is to predict the Price of House given its size
•Model Derivation:
Price = Slope of Line * Size + Constant
Classification : Credit Scoring
We have labelled data of low and high risk customers.
Task is differentiating between low-risk and high-risk customers from their
income and savings.
Model Derivation:
IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk
Un Supervised Learning
• Training data does not include desired output.
Task is to find hidden structure in unlabeled data
• CommonApproaches to Un Supervised Learning
• Clustering or Segmentation ( Customer Segmentation)
• Dimensionality Reduction ( PCA (Principal ComponentAnalysis) , SVD
(SingularValue Decomposition))
• Summarization
Un Supervised Learning
• Customer Segmentation: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs.
• The clustering algorithm
forms 3 different groups of
customers to target.
Reinforcement Learning
• Learning from interaction with the environment to achieve a goal.
Rewards from a sequence of actions.
• Every Action has either a
• Reward OR
• Observation
• Examples
• Self Driving Cars
• Recommender Systems
•Stanford Research Link
https://www.cs.utexas.edu/~eladlieb/RLRG.html
ML – Data Science Relationship
Supervised Learning
Linear Regression
Linear Regression
22
• In statistics, linear regression is an approach for modeling the
relationship between a scalar dependent variable y and one or more
explanatory variables (or independent variables) denoted X
• The case of one explanatory variable is called simple linear
regression
• For more than one explanatory
variable, the process is
called multiple linear regression
https://en.wikipedia.org/wiki/Linear_regression
From School Book :
Linear Equations
Y
Y = mX + b
b = Y-intercept
X
Change
in Y
Change in X
m = Slope
Linear Regression : A Common Example
24
Ohm’s Law:
• In physics, it is observed that the relationship between Voltage (V), Current (I)
and Resistance (R) is a linear relationship expressedas
V = I * R
I = V / R
• In a circuit board for a given Resistance R,
as you increase the VoltageV,
the Current I increases proprotionately
http://www.electronics-tutorials.ws/dccircuits/dcp_1.html
Sample Monthly Income-Expense Data of a Household
25
Monthly Income
(in Rs.)
Monthly Expense
(in Rs.)
5,000 8,000
6,000 7,000
10,000 4,500
10,000 2,000
12,500 12,000
14,000 8,000
15,000 16,000
18,000 20,000
19,000 9,000
20,000 9,000
20,000 18,000
22,000 25,000
23,400 5,000
24,000 10,500
24,000 10,000
We have to find the relationship between Income and Expenses
of a household
y = 0.3008x + 6319.1
R² = 0.4215
0
40000
30000
20000
10000
50000
60000
MonthlyExpense
Monthly Income
Income Vs. Expense
Line of Best Fit
26
0
10000
20000
30000
40000
50000
60000
MonthlyExpense
Monthly Income
IncomeVs.Expense
Which of these lines best
describe the relationship
between Household Income
and Expenses ?
27
0
10000
20000
30000
40000
50000
60000
MonthlyExpense
Monthly Income
Income Vs. Expense
The Line of Best Fit will be the
one where Sum of Square of
Error (SSE) term will be
nique)
sample
on
)
)
get
Xi
X
b =
)2
ii
i i i i
nX -(
X Y
21
minimum (OLSTech
Err or (em = ym - ym)
Yi(hat) = bo + b1Xi isthe
regression equati
SSE = ei(hat
2 (1)
)
= (Yi -Y(i(hat))2 (2
= (Yi - bo - b1Xi)2 (3
Using calculus we
Error (en)
Yi -b1
bo =
n
n XY -
Line of Best Fit
Least Squares
• ‘Best Fit’ Means Difference Between ActualYValues & PredictedYValues is
a Minimum. But Positive Differences Off-Set Negative ones. So square
errors!
• LS Minimizes the Sum of the Squared Differences (errors) (SSE)
   

n
i
i
n
i
ii YY
1
2
1
2
ˆˆ 
Simple Linear Regression in R
29
### CODE SNIPPET ###
?cars
# Investigating the basics of the data set
str(cars)
attributes(cars)
Examining the data
30
### CODE SNIPPET ###
# How speed and distance value summaries look. NA’s ?
summary(cars)
# Is there a correlation between speed and time to stop
cor(cars$speed, cars$dist)
Plotting the data
31
### CODE SNIPPET ###
plot(cars, main=“Distance between Speed and Distance to Stop”)
scatter.smooth(cars,lpars = list(col = "red", lwd = 3 , lty = 3))
boxplot(cars$dist, main="Outliers for Distance")
plot(density(cars$speed) , main="Density Distribution of Speed" ,
type="h",col="blue")
Basic Linear Model
32
### CODE SNIPPET ###
linear_model = lm(dist ~ speed , data=cars)
summary(linear_model)
CoefficientAnalysis
33
• Coefficient - Estimate
• Y intercept given is -17.5791
• Every 1 mph increase in the speed of a car, the required distance to stop goes up by 3.9324 feet.
• Coefficient - Standard Error
• The coefficient Standard Error measures the average amount that the coefficient estimates vary from
the actual average value of our response variable.We’d ideally want a lower number relative to its
coefficients.
• Coefficient - t value
• The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far
away from 0.We want it to be far away from zero as this would indicate we could reject the null
hypothesis - that is, we could declare a relationship between speed and distance exist. In general, t-
values are also used to compute p-values.
• Coefficient - Pr(>t)
• A small p-value for the intercept and the slope indicates that we can reject the null hypothesis which
allows us to conclude that there is a relationship between speed and distance.
ResidualAnalysis
### CODE SNIPPET ###
pred_dist <- predict(linear_model, newdata=cars)
residuals <- cars$dist - pred_dist
summary(residuals)
plot(pred_dist , residuals,
xlab=" PredictedValues" ,
ylab=" Residuals" ,
main=" Residual Plot" , col="blue")
Which residual plot suggest good
fit ? : Poll
35
Residual Standard Error
36
• Residual Standard Error is measure of the quality of a linear
regression fit.
• The Residual Standard Error is the average amount that the response
(dist) will deviate from the true regression line.
• In our example, the actual distance required to stop can deviate from
the true regression line by approximately 15.3795867 feet, on
average. (Which is ~ 3.93 * 4 times)
• The Residual Standard Error was calculated with 48 degrees of
freedom. Simplistically, degrees of freedom are the number of data
points that went into the estimation of the parameters
Coefficient of Determination
• In statistics, the coefficient of determination, denoted R2 or r2 and pronounced
"R squared", is a number that indicates the proportion of the variance in the
dependent variable that is predictable from the independent variable(s)
• The R2 we get is 0.6511. Roughly 65% of the variance found in the response
variable (distance) can be explained by the predictor variable (speed)
• R2 value significance is relative to domain , Adjusted R2 used for multi linear
https://en.wikipedia.org/wiki/Coefficient_of_determination
F Statistics & PValue
• Indicator of whether there is a relationship between our predictor and the
response variables
• Greater than 1 suggests we can reject the null hypothesis : No relation between
speed and distance exists
• We can consider a linear model to be statistically significant only when both
these p-Values are less that the pre-determined statistical significance level,
which is ideally 0.05
Summary
What allWe did ?
• Examined the data
• Plotting the data
• Simple Linear Regression Model Creation
• Co efficient Analysis
• Residual Analysis
• R2 Analysis
• F Statistics
Is the current state of model good to be deployed /
used on live ?
Evaluation of Model : SplitTrain /Test
### CODE SNIPPET ###
## 80% of the sample size
sample_size <- floor(0.80 * nrow(cars))
## set the seed to make your partition reproductible
set.seed(123)
train_index <- sample(seq_len(nrow(cars)), size = sample_size)
train <- cars[ train_index, ]
test <- cars[-train_index, ]
linear_model_subset <- lm(dist ~ speed, data=train)
distPred <- predict(linear_model_subset, test)
summary(linear_model_subset)
plot(distPred, test$dist)
RMSE :To compare between models
### CODE SNIPPET ###
rmse <-function(error)
{
sqrt(mean(error^2))
}
print(rmse(test$dist - distPreds))
• RMSE : Root Mean Squared Error
• Average Distance between the observed values and the model predictions
OR
• How far are the residuals from zero
Food for thought !!!
Is the test / train split model the best
generalization we have ??
.. Covered in Upcoming Sessions
Ad

More Related Content

What's hot (20)

Machine learning
Machine learningMachine learning
Machine learning
Vatsal Gajera
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
Suresh Arora
 
Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)
butest
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
MachinePulse
 
Machine learning basics
Machine learning basics Machine learning basics
Machine learning basics
Akanksha Bali
 
Machine Learning Using Python
Machine Learning Using PythonMachine Learning Using Python
Machine Learning Using Python
SavitaHanchinal
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
Marina Santini
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
David Raj Kanthi
 
Machine learning
Machine learningMachine learning
Machine learning
Dr Geetha Mohan
 
Introduction to machine learningunsupervised learning
Introduction to machine learningunsupervised learningIntroduction to machine learningunsupervised learning
Introduction to machine learningunsupervised learning
Sardar Alam
 
Machine learning
Machine learning Machine learning
Machine learning
Saurabh Agrawal
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Rahul Kumar
 
Understanding Basics of Machine Learning
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine Learning
Pranav Ainavolu
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
Charles Vestur
 
What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
Bhaskara Reddy Sannapureddy
 
Machine learning
Machine learningMachine learning
Machine learning
Sanjay krishne
 
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre..."An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
Edge AI and Vision Alliance
 
An overview of machine learning
An overview of machine learningAn overview of machine learning
An overview of machine learning
drcfetr
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
Anish Das
 
Machine Learning for dummies!
Machine Learning for dummies!Machine Learning for dummies!
Machine Learning for dummies!
ZOLLHOF - Tech Incubator
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
Suresh Arora
 
Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)Lecture #1: Introduction to machine learning (ML)
Lecture #1: Introduction to machine learning (ML)
butest
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
MachinePulse
 
Machine learning basics
Machine learning basics Machine learning basics
Machine learning basics
Akanksha Bali
 
Machine Learning Using Python
Machine Learning Using PythonMachine Learning Using Python
Machine Learning Using Python
SavitaHanchinal
 
Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?Lecture 1: What is Machine Learning?
Lecture 1: What is Machine Learning?
Marina Santini
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
David Raj Kanthi
 
Introduction to machine learningunsupervised learning
Introduction to machine learningunsupervised learningIntroduction to machine learningunsupervised learning
Introduction to machine learningunsupervised learning
Sardar Alam
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Rahul Kumar
 
Understanding Basics of Machine Learning
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine Learning
Pranav Ainavolu
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
Charles Vestur
 
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre..."An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
"An Introduction to Machine Learning and How to Teach Machines to See," a Pre...
Edge AI and Vision Alliance
 
An overview of machine learning
An overview of machine learningAn overview of machine learning
An overview of machine learning
drcfetr
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
Anish Das
 

Similar to Introduction to machine learning and model building using linear regression (20)

Machine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.pptMachine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.ppt
ShivaShiva783981
 
Lecture: introduction to Machine Learning.ppt
Lecture: introduction to Machine Learning.pptLecture: introduction to Machine Learning.ppt
Lecture: introduction to Machine Learning.ppt
NiteshJha97
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
Rising Media, Inc.
 
Ml ppt at
Ml ppt atMl ppt at
Ml ppt at
pradeep kumar
 
Machine_Learning.pptx
Machine_Learning.pptxMachine_Learning.pptx
Machine_Learning.pptx
VickyKumar131533
 
Machine Learning event gdsc haldia
Machine Learning event gdsc haldiaMachine Learning event gdsc haldia
Machine Learning event gdsc haldia
XAnLiFE
 
Machine learning full guide gdsc haldia
Machine learning full guide  gdsc haldiaMachine learning full guide  gdsc haldia
Machine learning full guide gdsc haldia
XAnLiFE
 
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
PAPIs.io
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Sanghamitra Deb
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
Tamir Taha
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
Albert Y. C. Chen
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
Johnson Ubah
 
Predicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning ApproachPredicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning Approach
Boston Institute of Analytics
 
Unveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data ScienceUnveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data Science
Boston Institute of Analytics
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
TigerGraph
 
Predicting user demographics in social networks - Invited Talk at University ...
Predicting user demographics in social networks - Invited Talk at University ...Predicting user demographics in social networks - Invited Talk at University ...
Predicting user demographics in social networks - Invited Talk at University ...
Nikolaos Aletras
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NET
Dev Raj Gautam
 
Business Analytics.pptx
Business Analytics.pptxBusiness Analytics.pptx
Business Analytics.pptx
Parveen Vashisth
 
Market Basket Analysis Revisited using SQL Pattern Matching
Market Basket Analysis Revisited using SQL Pattern Matching Market Basket Analysis Revisited using SQL Pattern Matching
Market Basket Analysis Revisited using SQL Pattern Matching
Shankar Somayajula
 
Machine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.pptMachine learning introduction to unit 1.ppt
Machine learning introduction to unit 1.ppt
ShivaShiva783981
 
Lecture: introduction to Machine Learning.ppt
Lecture: introduction to Machine Learning.pptLecture: introduction to Machine Learning.ppt
Lecture: introduction to Machine Learning.ppt
NiteshJha97
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
Rising Media, Inc.
 
Machine Learning event gdsc haldia
Machine Learning event gdsc haldiaMachine Learning event gdsc haldia
Machine Learning event gdsc haldia
XAnLiFE
 
Machine learning full guide gdsc haldia
Machine learning full guide  gdsc haldiaMachine learning full guide  gdsc haldia
Machine learning full guide gdsc haldia
XAnLiFE
 
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
A business level introduction to Artificial Intelligence - Louis Dorard @ PAP...
PAPIs.io
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
Sanghamitra Deb
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
Tamir Taha
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
Albert Y. C. Chen
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
Johnson Ubah
 
Unveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data ScienceUnveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data Science
Boston Institute of Analytics
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
TigerGraph
 
Predicting user demographics in social networks - Invited Talk at University ...
Predicting user demographics in social networks - Invited Talk at University ...Predicting user demographics in social networks - Invited Talk at University ...
Predicting user demographics in social networks - Invited Talk at University ...
Nikolaos Aletras
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NET
Dev Raj Gautam
 
Market Basket Analysis Revisited using SQL Pattern Matching
Market Basket Analysis Revisited using SQL Pattern Matching Market Basket Analysis Revisited using SQL Pattern Matching
Market Basket Analysis Revisited using SQL Pattern Matching
Shankar Somayajula
 
Ad

Recently uploaded (20)

定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Microsoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive OverviewMicrosoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive Overview
GinaTomarongRegencia
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
spssworksho9035530-lva1-app6891 (1).pptx
spssworksho9035530-lva1-app6891 (1).pptxspssworksho9035530-lva1-app6891 (1).pptx
spssworksho9035530-lva1-app6891 (1).pptx
clarkraal
 
Automation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success storyAutomation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success story
Process mining Evangelist
 
Process Mining at AE - Key success factors
Process Mining at AE - Key success factorsProcess Mining at AE - Key success factors
Process Mining at AE - Key success factors
Process mining Evangelist
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
Taqyea
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahahE-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
RyanRahardjo2
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Microsoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive OverviewMicrosoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive Overview
GinaTomarongRegencia
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
indonesia-gen-z-report-2024 Gen Z (born between 1997 and 2012) is currently t...
disnakertransjabarda
 
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
spssworksho9035530-lva1-app6891 (1).pptx
spssworksho9035530-lva1-app6891 (1).pptxspssworksho9035530-lva1-app6891 (1).pptx
spssworksho9035530-lva1-app6891 (1).pptx
clarkraal
 
Automation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success storyAutomation Platforms and Process Mining - success story
Automation Platforms and Process Mining - success story
Process mining Evangelist
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
Taqyea
 
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfjOral Malodor.pptx jsjshdhushehsidjjeiejdhfj
Oral Malodor.pptx jsjshdhushehsidjjeiejdhfj
maitripatel5301
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahahE-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
RyanRahardjo2
 
Ad

Introduction to machine learning and model building using linear regression

  • 2. Introducing the Speaker • Girish Gore : 10+Years of Experience in Data Analytics / Data Science • B.E. Computer Science fromVIT Pune , M.S. from BITS Pilani • SpentTime on Data Products Mainly In companies like • Cognizant (InnovationsGroup) • SAS (Pricing & Revenue Management) • VuClip (Video Entertainment) • Shoptimize (E-Commerce) • Worked in fields like • Text Mining • Forecasting and Optimization • Recommender Systems
  • 3. Knowing the Audience Average Experience in Industry ? Average ML Experience ?
  • 4. UnderstandingTerminologies Artificial Intelligence AI involves machines that can perform tasks that are characteristic of human intelligence. Machine Learning Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Deep Learning Deep Learning is an attempt to mimic the workings of the brain. Deep Learning is one of many approaches to machine learning
  • 6. Traditional Programming vs Machine Learning • If Programming automates processes , Machine Learning automates Program generation i.e. Automation. • Data and output is run on the computer to create a program.This program can be used in traditional programming
  • 7. What is Machine Learning ? • Machine Learning is • study of algorithms that • improve their performance at a particular task • with experience ( previous data , output) • Optimize a performance criterion using example data or past experience • Role of Computer Science : Efficient Algorithms • Solve the optimization problem • Represent and Evaluate the model for inference
  • 8. Why are we here Now !!! GoogleTrends !! • Exponential increase in Data generation , accumulation • Increasing computational power • Growing progress in available algorithms and Research • Software becoming too complex to write by hand
  • 9. Common Applications of Machine Learning • Web search: ranking page based on what you are most likely to click on. • Finance: decide who to send what credit card offers to. Evaluation of risk on credit offers. How to decide where to invest money. • E-commerce: Predicting customer churn.Whether or not a transaction is fraudulent. • Robotics: how to handle uncertainty in new environments.Autonomous. Self-driving car. • Information extraction:Ask questions over databases across the web. • Social networks: Data on relationships and preferences. Machine learning to extract value from data. • Debugging: Use in computer science especially in Labor intensive processes like debugging. Could suggest where the bug could be • Gaming, IBMWatson
  • 10. Types Of Machine Learning • Learning Associations • Supervised Learning • Regression • Classification • Un Supervised Learning • Reinforcement Learning • Semi supervised Learning • Training data includes a few desired outputs. Between supervised and un supervised
  • 11. Learning Associations • Market Basket analysis: P (Y | X ) probability that somebody who buys X also buys Y where X and Y are products/services. Example: P ( diaper| beer ) = 0.7 TransactionID BasketItems 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper,Coke
  • 12. Learning Associations • Support : The probability of the customer buying diaper and beer together among all sales transactions (Higher support the better) • Confidence : Suppose that if a customer pick up diaper. How he/she is likely to buy beer? (Closer to 1 better) • Lift : Lift is a true comparison between naive model and our model, meaning that how more likely a customer buy both, compared to buy separately? (Lift > 1)
  • 13. Supervised Learning • Supervised Learning is a Machine Learning task of inferring a generalized function from labelled training data. Training data includes desired outputs. Example: Spam Detection , Credit Scoring , Face Detection • In Supervised Learning for spam detection we have • Email Contents with Labels marking Spam or Non Spam • Task is to label newer emails • Main two types of Supervised Learning Problems • Regression • Classification
  • 14. Supervised Learning • Regression Problems • Maps input data to a continuous prediction variable • Example: Predicting Retail house prices (Price as continues variable) • Classification Problems • Maps input data to a set of predefined classes • Example: Benign or MalignantTumours
  • 15. Regression : House Price Prediction • We have historic data about size of house and the price for last 1 year • Task is to predict the Price of House given its size •Model Derivation: Price = Slope of Line * Size + Constant
  • 16. Classification : Credit Scoring We have labelled data of low and high risk customers. Task is differentiating between low-risk and high-risk customers from their income and savings. Model Derivation: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk
  • 17. Un Supervised Learning • Training data does not include desired output. Task is to find hidden structure in unlabeled data • CommonApproaches to Un Supervised Learning • Clustering or Segmentation ( Customer Segmentation) • Dimensionality Reduction ( PCA (Principal ComponentAnalysis) , SVD (SingularValue Decomposition)) • Summarization
  • 18. Un Supervised Learning • Customer Segmentation: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs. • The clustering algorithm forms 3 different groups of customers to target.
  • 19. Reinforcement Learning • Learning from interaction with the environment to achieve a goal. Rewards from a sequence of actions. • Every Action has either a • Reward OR • Observation • Examples • Self Driving Cars • Recommender Systems •Stanford Research Link https://www.cs.utexas.edu/~eladlieb/RLRG.html
  • 20. ML – Data Science Relationship
  • 22. Linear Regression 22 • In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X • The case of one explanatory variable is called simple linear regression • For more than one explanatory variable, the process is called multiple linear regression https://en.wikipedia.org/wiki/Linear_regression
  • 23. From School Book : Linear Equations Y Y = mX + b b = Y-intercept X Change in Y Change in X m = Slope
  • 24. Linear Regression : A Common Example 24 Ohm’s Law: • In physics, it is observed that the relationship between Voltage (V), Current (I) and Resistance (R) is a linear relationship expressedas V = I * R I = V / R • In a circuit board for a given Resistance R, as you increase the VoltageV, the Current I increases proprotionately http://www.electronics-tutorials.ws/dccircuits/dcp_1.html
  • 25. Sample Monthly Income-Expense Data of a Household 25 Monthly Income (in Rs.) Monthly Expense (in Rs.) 5,000 8,000 6,000 7,000 10,000 4,500 10,000 2,000 12,500 12,000 14,000 8,000 15,000 16,000 18,000 20,000 19,000 9,000 20,000 9,000 20,000 18,000 22,000 25,000 23,400 5,000 24,000 10,500 24,000 10,000 We have to find the relationship between Income and Expenses of a household y = 0.3008x + 6319.1 R² = 0.4215 0 40000 30000 20000 10000 50000 60000 MonthlyExpense Monthly Income Income Vs. Expense
  • 26. Line of Best Fit 26 0 10000 20000 30000 40000 50000 60000 MonthlyExpense Monthly Income IncomeVs.Expense Which of these lines best describe the relationship between Household Income and Expenses ?
  • 27. 27 0 10000 20000 30000 40000 50000 60000 MonthlyExpense Monthly Income Income Vs. Expense The Line of Best Fit will be the one where Sum of Square of Error (SSE) term will be nique) sample on ) ) get Xi X b = )2 ii i i i i nX -( X Y 21 minimum (OLSTech Err or (em = ym - ym) Yi(hat) = bo + b1Xi isthe regression equati SSE = ei(hat 2 (1) ) = (Yi -Y(i(hat))2 (2 = (Yi - bo - b1Xi)2 (3 Using calculus we Error (en) Yi -b1 bo = n n XY - Line of Best Fit
  • 28. Least Squares • ‘Best Fit’ Means Difference Between ActualYValues & PredictedYValues is a Minimum. But Positive Differences Off-Set Negative ones. So square errors! • LS Minimizes the Sum of the Squared Differences (errors) (SSE)      n i i n i ii YY 1 2 1 2 ˆˆ 
  • 29. Simple Linear Regression in R 29 ### CODE SNIPPET ### ?cars # Investigating the basics of the data set str(cars) attributes(cars)
  • 30. Examining the data 30 ### CODE SNIPPET ### # How speed and distance value summaries look. NA’s ? summary(cars) # Is there a correlation between speed and time to stop cor(cars$speed, cars$dist)
  • 31. Plotting the data 31 ### CODE SNIPPET ### plot(cars, main=“Distance between Speed and Distance to Stop”) scatter.smooth(cars,lpars = list(col = "red", lwd = 3 , lty = 3)) boxplot(cars$dist, main="Outliers for Distance") plot(density(cars$speed) , main="Density Distribution of Speed" , type="h",col="blue")
  • 32. Basic Linear Model 32 ### CODE SNIPPET ### linear_model = lm(dist ~ speed , data=cars) summary(linear_model)
  • 33. CoefficientAnalysis 33 • Coefficient - Estimate • Y intercept given is -17.5791 • Every 1 mph increase in the speed of a car, the required distance to stop goes up by 3.9324 feet. • Coefficient - Standard Error • The coefficient Standard Error measures the average amount that the coefficient estimates vary from the actual average value of our response variable.We’d ideally want a lower number relative to its coefficients. • Coefficient - t value • The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far away from 0.We want it to be far away from zero as this would indicate we could reject the null hypothesis - that is, we could declare a relationship between speed and distance exist. In general, t- values are also used to compute p-values. • Coefficient - Pr(>t) • A small p-value for the intercept and the slope indicates that we can reject the null hypothesis which allows us to conclude that there is a relationship between speed and distance.
  • 34. ResidualAnalysis ### CODE SNIPPET ### pred_dist <- predict(linear_model, newdata=cars) residuals <- cars$dist - pred_dist summary(residuals) plot(pred_dist , residuals, xlab=" PredictedValues" , ylab=" Residuals" , main=" Residual Plot" , col="blue")
  • 35. Which residual plot suggest good fit ? : Poll 35
  • 36. Residual Standard Error 36 • Residual Standard Error is measure of the quality of a linear regression fit. • The Residual Standard Error is the average amount that the response (dist) will deviate from the true regression line. • In our example, the actual distance required to stop can deviate from the true regression line by approximately 15.3795867 feet, on average. (Which is ~ 3.93 * 4 times) • The Residual Standard Error was calculated with 48 degrees of freedom. Simplistically, degrees of freedom are the number of data points that went into the estimation of the parameters
  • 37. Coefficient of Determination • In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is a number that indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s) • The R2 we get is 0.6511. Roughly 65% of the variance found in the response variable (distance) can be explained by the predictor variable (speed) • R2 value significance is relative to domain , Adjusted R2 used for multi linear https://en.wikipedia.org/wiki/Coefficient_of_determination
  • 38. F Statistics & PValue • Indicator of whether there is a relationship between our predictor and the response variables • Greater than 1 suggests we can reject the null hypothesis : No relation between speed and distance exists • We can consider a linear model to be statistically significant only when both these p-Values are less that the pre-determined statistical significance level, which is ideally 0.05
  • 40. What allWe did ? • Examined the data • Plotting the data • Simple Linear Regression Model Creation • Co efficient Analysis • Residual Analysis • R2 Analysis • F Statistics Is the current state of model good to be deployed / used on live ?
  • 41. Evaluation of Model : SplitTrain /Test ### CODE SNIPPET ### ## 80% of the sample size sample_size <- floor(0.80 * nrow(cars)) ## set the seed to make your partition reproductible set.seed(123) train_index <- sample(seq_len(nrow(cars)), size = sample_size) train <- cars[ train_index, ] test <- cars[-train_index, ] linear_model_subset <- lm(dist ~ speed, data=train) distPred <- predict(linear_model_subset, test) summary(linear_model_subset) plot(distPred, test$dist)
  • 42. RMSE :To compare between models ### CODE SNIPPET ### rmse <-function(error) { sqrt(mean(error^2)) } print(rmse(test$dist - distPreds)) • RMSE : Root Mean Squared Error • Average Distance between the observed values and the model predictions OR • How far are the residuals from zero
  • 43. Food for thought !!! Is the test / train split model the best generalization we have ?? .. Covered in Upcoming Sessions