0% found this document useful (0 votes)
3 views

Week 4 - Intro to ML

The document provides an introduction to machine learning, explaining its methods for detecting patterns in data and predicting outcomes. It covers types of machine learning, including supervised and unsupervised learning, along with applications and challenges faced in the field. Key concepts such as bias-variance trade-off, model selection, and the importance of data quality are also discussed.

Uploaded by

Shozab Raza
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Week 4 - Intro to ML

The document provides an introduction to machine learning, explaining its methods for detecting patterns in data and predicting outcomes. It covers types of machine learning, including supervised and unsupervised learning, along with applications and challenges faced in the field. Key concepts such as bias-variance trade-off, model selection, and the importance of data quality are also discussed.

Uploaded by

Shozab Raza
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Introduction to Predictive Lecture 1

Analytics & Machine Learning


Introduction- What is Machine Learning
Machine learning is a set of methods for automatically detecting
patterns in data and using them for predicting future data and
guiding decision making. In other words, learning from data.
Why use Machine Learning
• Machine Learning outperforms traditional solutions which
require a lot of fine-tuning or rules. E.g. ML spam filters vs
traditional logic
• Machine learning adopts to changing environments and
new data, which reduces time taken using traditional
approaches.
• Can solve problems even with highly complex problem
and/or large amounts of data at hand.
Applications of Machine Learning

Analyzing images to detect anomalies or Detecting brain tumors in brain scans Classifying topics using NLP
faults in production line

Forecasting Sales Chatbots for customized and quick Recommending products based on buyer
interactions with customers behaviors
• Supervised learning, the objective is to learn
a function to predict an output variable Y
based on observed input variables (also called
features) x1, . . . , xp. We develop methods that
Types of learn this function based on labelled data
which we call the training data.
machine
learning • Unsupervised learning, we are given only
inputs and the goal is to find “interesting”
patterns in this data. It is used for clustering
Supervised learning
In supervised learning, the output or response variable can be of any
type. However, most methods address two main classes of supervised
learning problems:
• In regression, the response is a quantitative scalar (such as the
income of a worker).

• In classification, the response is a nominal or categorical variable Y=


{1, . . . ,C}, where C is the number of classes. When C = 2, this is called
binary classification; if C > 2, this is called multiclass classification
• Some important algorithms
• Linear regression
• Logistic regression
• Supper Vector Machines (SVMs)
• Decision Trees & Random Forests
Supervised • K-Nearest Neighbors (KNNs)
Learning • K-Means
Examples

• Regression:
• Predicting income of an individual
• Number of Covid-19 patients in the next 2
months
• House prices
• Sales forecasting – units / value

• Classification:
• Cancer – No cancer
• Fraud – Secure
• Churn – No churn
• Good customer – bad customer
Supervised Learning
Predict house prices
•Y=
• X = f(X1, X2, X3, ….Xn)
Unsupervised Learning
• The training data is unlabelled

• Some important algorithms:


• Principal Component Analysis (PCA)
• K-Means
• t-distributed Stochastic Neighbor Embedding (t-SNE)
• Isolation Forests
• Apriori
Unsupervised learning
Find similar
passengers
Data mining
• Data mining is the process of extracting interesting and previously
unknown patterns and relations from large databases, drawing on
the fields of machine learning, statistics, and database technology

• In data mining, the analysis should be both useful and understandable


to the data owner.
Data science
• Data science is a multidisciplinary field that combines knowledge
and skills from statistics, machine learning, software engineering,
data visualization, and domain expertise (in our case, business
expertise) to uncover value from large and diverse data sets

• Data scientists often work directly with stakeholders (say, product


managers) to translate data analysis results into action.
Data analysis process

1. Problem formulation.

2. Data collection and preparation.

3. Exploratory data analysis (EDA).

4. Model building, estimation, and selection.

5. Model evaluation.

6. Communicate results.
Evaluating model performance
• Training set: for exploratory data analysis, model building, model
estimation, model selection, etc.

• Test set: for model evaluation


Training and test data
• Because we are interested on the estimating how well a model will predict future
data, the test set should be kept in a “vault” and brought in strictly at the end of
the analysis. The test set does not lead to model revisions.

• We generally allocate 70-80% of the data to the training sample.

• A higher proportion of training data leads to more accurate model estimation, but
higher variance in estimating the expected loss.

• The split of the data into the training and test sets is often random, but sometimes
there are reasons to consider alternative schemes.
• The validation or the test set should be as
representative of the data that the model
will “see” in production
Data
Mismatch • Don’t test it on apples and productionize to
predict oranges.
Key concepts

• The bias-variance trade-off and model selection.

• Overfitting.

• Parametric vs non-parametric models.

• No-free lunch theorem.

• Accuracy vs interpretability.
Simple vs complex model

High bias/Low variance High variance/Low bias


Underfitting Overfitting
Another example
Bias-Variance tradeoff
• An important decision for data scientist is to choose model complexity

• Having a very complex model can lead to accurate predictions on the


train set but fail miserably on the test set

• Having a very simple model can lead to bad predictions both on train
set and test.

• How to balance this?


The bias variance trade-off
• Mathematically we can show that :

• Error = bias^2 + variance + irreducible error

• We would like our model to be flexible enough to be able to


approximate (possibly) complex relationships between Y and X.
Bias variance
tradeoff
Bias variance tradeoff
• Typically, the more complex we make the model, the better its
approximation capabilities, which translates into lower bias.

• On the other hand, increasing model complexity leads to higher


variance. This is due to the larger (effective) number of parameters to
estimate.

• Hence, we would like to find the optimal (problem specific) model


complexity that minimises our expected loss over the validation curve
Bias variance
• Increasing model complexity will always reduce the training error,
but there is an optimal level of complexity that minimises the test error.
Model selection
• Model selection is a set of methods (such as cross validation) that
allow us to choose the right model among options of different
complexity. It will be a fundamental part of our methodology

• We conduct model selection on the training data

• Model selection also includes hyper parameter tuning which we will


cover in detail later
Overfitting
• We say that there is overfitting when an estimated model is
excessively flexible, incorporating minor variations in the training data
that are likely to be noise rather than predictive patterns.

• An overfit model has small training errors, but may predict poorly. In
essence, it has memorized the training set.

• Not being misled by overfitting is an important reason why we use a


test set.
Illustration
• This example uses data extracted from the fueleconomy.gov website run by
the US government, which lists different estimates of fuel economy for
passenger cars and trucks.

• For each vehicle in the dataset, we have information on various


characteristics such as engine displacement and number of cylinders, along
with laboratory measurements for the city and highway miles per gallon
(MPG) of the car.

• We here consider the unadjusted highway MPG for 2010 cars as the
response variable, and a single predictor, engine displacement.
Example

• A scatter plot reveals a nonlinear association between the two


variables. We therefore need a model that is sufficiently flexible to
capture this nonlinearity.
Parametric vs non parametric approach
• Paramteric models assume an underlying distribution of the parameters
that we want to measure such as we know if we want to fit a linear
regression or a polynomial regression. You know what the regression line
would look like.
• Non parametric models do not assume any statistical underlying
distribution of the dataset. Hence you let the data decide what would the
function be.
• Decision makers mostly prefer parametric models because it is easier to
estimate a parametric model, easier to do predictions, a story can be told
according to a parametric model and the estimates have better statistical
properties compared to those of non-parametric regression.
Parametric vs non parametric
• Here is a picture where both parametric and non-parametric
regression results are shown. OLS (linear regression line) predicts a
negative relationship between X and Y. Nonparametric estimation fits
a 'highly wiggly' function to the data (most of the times you can
choose the smoothness of the function)
Challenges in Machine
Learning
“Bad Data” and
“Bad Algorithm”
• Insufficient training data

• Nonrepresentative training data.

• Poor quality of data. Missing


values, errors and noise!

• Overfitting/Underfitting the data

You might also like