Join the data conversation and see how analytics drives decision making across industries. Learn to understand, analyze, and interpret data as you walk through the fundamentals of data analysis, learn introductory analytic functionality in Google Sheet to distill actionable insights from data sets, see how data analysts translate their findings into compelling business narratives, perform an exploratory analysis using real-world data.
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Intel® Software
This session explains how solutions desired by such IT/Internet/Silicon Valley etc companies can look like, how they may differ from the more “classical” consumers of machine learning and analytics, and the arising challenges that current and future HPC development may have to cope with.
Top 10 Data Science Practitioner PitfallsSri Ambati
Top 10 Data Science Practitioner Pitfalls Meetup with Erin LeDell and Mark Landry on 09.09.15
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Machine learning involves using data and algorithms to enable computers to learn without being explicitly programmed. There are three main types of machine learning problems: supervised learning, unsupervised learning, and reinforcement learning. The machine learning process typically involves 5 steps: data gathering, data preprocessing, feature engineering, algorithm selection and training, and making predictions. Generalization is important in machine learning and involves balancing bias and variance - models with high bias may underfit while those with high variance may overfit.
This document provides an overview of machine learning techniques for classification with imbalanced data. It discusses challenges with imbalanced datasets like most classifiers being biased towards the majority class. It then summarizes techniques for dealing with imbalanced data, including random over/under sampling, SMOTE, cost-sensitive classification, and collecting more data. [/SUMMARY]
Machine learning involves using data to allow computers to learn without being explicitly programmed. There are three main types of machine learning problems: supervised learning, unsupervised learning, and reinforcement learning. The typical machine learning process involves five steps: 1) data gathering, 2) data preprocessing, 3) feature engineering, 4) algorithm selection and training, and 5) making predictions. Generalization is an important concept that relates to how well a model trained on one dataset can predict outcomes on an unseen dataset. Both underfitting and overfitting can lead to poor generalization by introducing bias or variance errors.
The document discusses machine learning concepts including:
1) Machine learning is an application of artificial intelligence that allows systems to automatically learn and improve from experience without being explicitly programmed.
2) There are different types of machine learning including supervised learning, unsupervised learning, and reinforcement learning.
3) The machine learning process involves learning tasks, performance metrics, experience, and optimizing models using techniques like gradient descent.
Defect models that are trained on class imbalanced datasets (i.e., the proportion of defective and clean modules is not equally represented) are highly susceptible to produce inaccurate prediction models. Prior research compares the impact of class rebalancing techniques on the performance of defect models but arrives at contradictory conclusions due to the use of different choice of datasets, classification techniques, and performance measures. Such contradictory conclusions make it hard to derive practical guidelines for whether class rebalancing techniques should be applied in the context of defect models. In this paper, we investigate the impact of class rebalancing techniques on performance measures and the interpretation of defect models. We also investigate the experimental settings in which class rebalancing techniques are beneficial for defect models. Through a case study of 101 datasets that span across proprietary and open-source systems, we conclude that the impact of class rebalancing techniques on the performance of defect prediction models depends on the used performance measure and the used classification techniques. We observe that the optimized SMOTE technique and the under-sampling technique are beneficial when quality assurance teams wish to increase AUC and Recall, respectively, but they should be avoided when deriving knowledge and understandings from defect models.
In the rapidly evolving field of machine learning (ML), the focus is often placed on developing sophisticated algorithms and models that can learn patterns, make predictions, and generate insights from data. However, one of the most critical challenges in building effective machine learning systems lies in ensuring the quality of the data used for training, testing, and validating these models. Data quality directly influences the model's performance, accuracy, and ability to generalize to unseen examples. Unfortunately, in real-world applications, data is rarely perfect, and it is often riddled with various types of errors that can lead to misleading conclusions, flawed predictions, and potentially harmful outcomes. These errors in experimental observations, also referred to as data errors or measurement errors, can significantly compromise the effectiveness of machine learning systems. The sources of these errors are diverse, ranging from technical failures, such as malfunctioning sensors or corrupted datasets, to human errors in data collection, labeling, or interpretation. Furthermore, errors may emerge during the data preprocessing stages, such as incorrect normalization, improper handling of missing data, or the introduction of noise through faulty sampling techniques. These errors can manifest in several ways, including outliers, missing values, mislabeled instances, noisy data, or data imbalances, each of which can influence how well a machine learning model performs. Understanding the nature of these errors and developing strategies to mitigate their impact is crucial for building robust and reliable machine learning models that can operate in real-world environments. Moreover, the impact of errors is not only a technical issue; it also raises significant ethical concerns, particularly when the models are used to inform high-stakes decisions, such as in healthcare, criminal justice, or finance. If errors are not properly addressed, models may inadvertently perpetuate biases, amplify inequalities, or produce inaccurate predictions that negatively affect individuals and communities. Therefore, a thorough understanding of errors in experimental observations is essential for improving the reliability, fairness, and ethical standards of machine learning applications. This introductory discussion provides the foundation for exploring the various types of errors that arise in machine learning datasets, examining their origins, their effects on model performance, and the various methods and techniques available for detecting, correcting, and mitigating these errors. By delving into the challenges posed by errors in experimental observations, we aim to provide a comprehensive framework for addressing data quality issues in machine learning and to highlight the importance of maintaining data integrity in the development and deployment of machine learning systems. This exploration of errors will also touch upon the broader implications for research
Statistical Learning and Model Selection (1).pptxrajalakshmi5921
This document discusses statistical learning and model selection. It introduces statistical learning problems, statistical models, the need for statistical modeling, and issues around evaluating models. Key points include: statistical learning involves using data to build a predictive model; a good model balances bias and variance to minimize prediction error; cross-validation is described as the ideal procedure for evaluating models without overfitting to the test data.
Machine learning algorithms can adapt and learn from experience. The three main machine learning methods are supervised learning (using labeled training data), unsupervised learning (using unlabeled data), and semi-supervised learning (using some labeled and some unlabeled data). Supervised learning includes classification and regression tasks, while unsupervised learning includes cluster analysis.
This document discusses various methods for evaluating machine learning models. It describes splitting data into training, validation, and test sets to evaluate models on large datasets. For small or unbalanced datasets, it recommends cross-validation techniques like k-fold cross-validation and stratified sampling. The document also covers evaluating classifier performance using metrics like accuracy, confidence intervals, and lift charts, as well as addressing issues that can impact evaluation like overfitting and class imbalance.
This document summarizes a presentation about machine learning and predictive analytics. It discusses formal definitions of machine learning, the differences between supervised and unsupervised learning, examples of machine learning applications, and evaluation metrics for predictive models like lift, sensitivity, and accuracy. Key machine learning algorithms mentioned include logistic regression and different types of modeling. The presentation provides an overview of concepts in machine learning and predictive analytics.
This document discusses various techniques for machine learning when labeled training data is limited, including semi-supervised learning approaches that make use of unlabeled data. It describes assumptions like the clustering assumption, low density assumption, and manifold assumption that allow algorithms to learn from unlabeled data. Specific techniques covered include clustering algorithms, mixture models, self-training, and semi-supervised support vector machines.
Application of Machine Learning in AgricultureAman Vasisht
With the growing trend of machine learning, it is needless to say how machine learning can help reap benefits in agriculture. It will be boon for the farmer welfare.
Making Netflix Machine Learning Algorithms ReliableJustin Basilico
This document discusses making Netflix machine learning algorithms reliable. It describes how Netflix uses machine learning for tasks like personalized ranking and recommendation. The goals are to maximize member satisfaction and retention. The models and algorithms used include regression, matrix factorization, neural networks, and bandits. The key aspects of making the models reliable discussed are: automated retraining of models, testing training pipelines, checking models and inputs online for anomalies, responding gracefully to failures, and training models to be resilient to different conditions and failures.
Week 2 Sentiment Analysis Using Machine Learning SARCCOM
This document provides an overview of sentiment analysis using machine learning. It defines sentiment analysis as detecting polarity within text. It discusses the main tasks as classification of sentiment at the text, token, or aspect level. Supervised learning is most common. The document outlines types of sentiment analysis and gives examples. It also summarizes the machine learning process from data gathering and preprocessing to feature engineering, experimentation, and deployment. Hands-on examples are provided for simple sentiment analysis using a dictionary approach and using machine learning.
This document provides an overview of machine learning algorithms and their applications in the financial industry. It begins with brief introductions of the authors and their backgrounds in applying artificial intelligence to retail. It then covers key machine learning concepts like supervised and unsupervised learning as well as algorithms like logistic regression, decision trees, boosting and time series analysis. Examples are provided for how these techniques can be used for applications like predicting loan risk and intelligent loan applications. Overall, the document aims to give a high-level view of machine learning in finance through discussing algorithms and their uses in areas like risk analysis.
This document provides an overview of machine learning, including examples of applications, how machine learning works, and some common algorithms. It discusses how machine learning can augment human intelligence by analyzing large amounts of data. Key machine learning algorithms covered include decision trees, neural networks, support vector machines, and regression models. The document emphasizes the importance of proper testing and evaluation of machine learning models.
The document provides an overview of machine learning concepts including supervised and unsupervised learning algorithms. It discusses splitting data into training and test sets, training algorithms on the training set, testing algorithms on the test set, and measuring performance. For supervised learning, it describes classification and regression tasks, the bias-variance tradeoff, and how supervised algorithms learn by minimizing a loss function. For unsupervised learning, it discusses clustering, representation learning, dimensionality reduction, and exploratory analysis use cases.
This document provides an overview of machine learning concepts and techniques. It discusses supervised learning methods like classification and regression using algorithms such as naive Bayes, K-nearest neighbors, logistic regression, support vector machines, decision trees, and random forests. Unsupervised learning techniques like clustering and association are also covered. The document contrasts traditional programming with machine learning and describes typical machine learning processes like training, validation, testing, and parameter tuning. Common applications and examples of machine learning are also summarized.
ELectronics Boards & Product Testing_Shiju.pdfShiju Jacob
This presentation provides a high level insight about DFT analysis and test coverage calculation, finalizing test strategy, and types of tests at different levels of the product.
Ad
More Related Content
Similar to machine learning types methods classification regression decision tree (20)
Machine learning involves using data to allow computers to learn without being explicitly programmed. There are three main types of machine learning problems: supervised learning, unsupervised learning, and reinforcement learning. The typical machine learning process involves five steps: 1) data gathering, 2) data preprocessing, 3) feature engineering, 4) algorithm selection and training, and 5) making predictions. Generalization is an important concept that relates to how well a model trained on one dataset can predict outcomes on an unseen dataset. Both underfitting and overfitting can lead to poor generalization by introducing bias or variance errors.
The document discusses machine learning concepts including:
1) Machine learning is an application of artificial intelligence that allows systems to automatically learn and improve from experience without being explicitly programmed.
2) There are different types of machine learning including supervised learning, unsupervised learning, and reinforcement learning.
3) The machine learning process involves learning tasks, performance metrics, experience, and optimizing models using techniques like gradient descent.
Defect models that are trained on class imbalanced datasets (i.e., the proportion of defective and clean modules is not equally represented) are highly susceptible to produce inaccurate prediction models. Prior research compares the impact of class rebalancing techniques on the performance of defect models but arrives at contradictory conclusions due to the use of different choice of datasets, classification techniques, and performance measures. Such contradictory conclusions make it hard to derive practical guidelines for whether class rebalancing techniques should be applied in the context of defect models. In this paper, we investigate the impact of class rebalancing techniques on performance measures and the interpretation of defect models. We also investigate the experimental settings in which class rebalancing techniques are beneficial for defect models. Through a case study of 101 datasets that span across proprietary and open-source systems, we conclude that the impact of class rebalancing techniques on the performance of defect prediction models depends on the used performance measure and the used classification techniques. We observe that the optimized SMOTE technique and the under-sampling technique are beneficial when quality assurance teams wish to increase AUC and Recall, respectively, but they should be avoided when deriving knowledge and understandings from defect models.
In the rapidly evolving field of machine learning (ML), the focus is often placed on developing sophisticated algorithms and models that can learn patterns, make predictions, and generate insights from data. However, one of the most critical challenges in building effective machine learning systems lies in ensuring the quality of the data used for training, testing, and validating these models. Data quality directly influences the model's performance, accuracy, and ability to generalize to unseen examples. Unfortunately, in real-world applications, data is rarely perfect, and it is often riddled with various types of errors that can lead to misleading conclusions, flawed predictions, and potentially harmful outcomes. These errors in experimental observations, also referred to as data errors or measurement errors, can significantly compromise the effectiveness of machine learning systems. The sources of these errors are diverse, ranging from technical failures, such as malfunctioning sensors or corrupted datasets, to human errors in data collection, labeling, or interpretation. Furthermore, errors may emerge during the data preprocessing stages, such as incorrect normalization, improper handling of missing data, or the introduction of noise through faulty sampling techniques. These errors can manifest in several ways, including outliers, missing values, mislabeled instances, noisy data, or data imbalances, each of which can influence how well a machine learning model performs. Understanding the nature of these errors and developing strategies to mitigate their impact is crucial for building robust and reliable machine learning models that can operate in real-world environments. Moreover, the impact of errors is not only a technical issue; it also raises significant ethical concerns, particularly when the models are used to inform high-stakes decisions, such as in healthcare, criminal justice, or finance. If errors are not properly addressed, models may inadvertently perpetuate biases, amplify inequalities, or produce inaccurate predictions that negatively affect individuals and communities. Therefore, a thorough understanding of errors in experimental observations is essential for improving the reliability, fairness, and ethical standards of machine learning applications. This introductory discussion provides the foundation for exploring the various types of errors that arise in machine learning datasets, examining their origins, their effects on model performance, and the various methods and techniques available for detecting, correcting, and mitigating these errors. By delving into the challenges posed by errors in experimental observations, we aim to provide a comprehensive framework for addressing data quality issues in machine learning and to highlight the importance of maintaining data integrity in the development and deployment of machine learning systems. This exploration of errors will also touch upon the broader implications for research
Statistical Learning and Model Selection (1).pptxrajalakshmi5921
This document discusses statistical learning and model selection. It introduces statistical learning problems, statistical models, the need for statistical modeling, and issues around evaluating models. Key points include: statistical learning involves using data to build a predictive model; a good model balances bias and variance to minimize prediction error; cross-validation is described as the ideal procedure for evaluating models without overfitting to the test data.
Machine learning algorithms can adapt and learn from experience. The three main machine learning methods are supervised learning (using labeled training data), unsupervised learning (using unlabeled data), and semi-supervised learning (using some labeled and some unlabeled data). Supervised learning includes classification and regression tasks, while unsupervised learning includes cluster analysis.
This document discusses various methods for evaluating machine learning models. It describes splitting data into training, validation, and test sets to evaluate models on large datasets. For small or unbalanced datasets, it recommends cross-validation techniques like k-fold cross-validation and stratified sampling. The document also covers evaluating classifier performance using metrics like accuracy, confidence intervals, and lift charts, as well as addressing issues that can impact evaluation like overfitting and class imbalance.
This document summarizes a presentation about machine learning and predictive analytics. It discusses formal definitions of machine learning, the differences between supervised and unsupervised learning, examples of machine learning applications, and evaluation metrics for predictive models like lift, sensitivity, and accuracy. Key machine learning algorithms mentioned include logistic regression and different types of modeling. The presentation provides an overview of concepts in machine learning and predictive analytics.
This document discusses various techniques for machine learning when labeled training data is limited, including semi-supervised learning approaches that make use of unlabeled data. It describes assumptions like the clustering assumption, low density assumption, and manifold assumption that allow algorithms to learn from unlabeled data. Specific techniques covered include clustering algorithms, mixture models, self-training, and semi-supervised support vector machines.
Application of Machine Learning in AgricultureAman Vasisht
With the growing trend of machine learning, it is needless to say how machine learning can help reap benefits in agriculture. It will be boon for the farmer welfare.
Making Netflix Machine Learning Algorithms ReliableJustin Basilico
This document discusses making Netflix machine learning algorithms reliable. It describes how Netflix uses machine learning for tasks like personalized ranking and recommendation. The goals are to maximize member satisfaction and retention. The models and algorithms used include regression, matrix factorization, neural networks, and bandits. The key aspects of making the models reliable discussed are: automated retraining of models, testing training pipelines, checking models and inputs online for anomalies, responding gracefully to failures, and training models to be resilient to different conditions and failures.
Week 2 Sentiment Analysis Using Machine Learning SARCCOM
This document provides an overview of sentiment analysis using machine learning. It defines sentiment analysis as detecting polarity within text. It discusses the main tasks as classification of sentiment at the text, token, or aspect level. Supervised learning is most common. The document outlines types of sentiment analysis and gives examples. It also summarizes the machine learning process from data gathering and preprocessing to feature engineering, experimentation, and deployment. Hands-on examples are provided for simple sentiment analysis using a dictionary approach and using machine learning.
This document provides an overview of machine learning algorithms and their applications in the financial industry. It begins with brief introductions of the authors and their backgrounds in applying artificial intelligence to retail. It then covers key machine learning concepts like supervised and unsupervised learning as well as algorithms like logistic regression, decision trees, boosting and time series analysis. Examples are provided for how these techniques can be used for applications like predicting loan risk and intelligent loan applications. Overall, the document aims to give a high-level view of machine learning in finance through discussing algorithms and their uses in areas like risk analysis.
This document provides an overview of machine learning, including examples of applications, how machine learning works, and some common algorithms. It discusses how machine learning can augment human intelligence by analyzing large amounts of data. Key machine learning algorithms covered include decision trees, neural networks, support vector machines, and regression models. The document emphasizes the importance of proper testing and evaluation of machine learning models.
The document provides an overview of machine learning concepts including supervised and unsupervised learning algorithms. It discusses splitting data into training and test sets, training algorithms on the training set, testing algorithms on the test set, and measuring performance. For supervised learning, it describes classification and regression tasks, the bias-variance tradeoff, and how supervised algorithms learn by minimizing a loss function. For unsupervised learning, it discusses clustering, representation learning, dimensionality reduction, and exploratory analysis use cases.
This document provides an overview of machine learning concepts and techniques. It discusses supervised learning methods like classification and regression using algorithms such as naive Bayes, K-nearest neighbors, logistic regression, support vector machines, decision trees, and random forests. Unsupervised learning techniques like clustering and association are also covered. The document contrasts traditional programming with machine learning and describes typical machine learning processes like training, validation, testing, and parameter tuning. Common applications and examples of machine learning are also summarized.
ELectronics Boards & Product Testing_Shiju.pdfShiju Jacob
This presentation provides a high level insight about DFT analysis and test coverage calculation, finalizing test strategy, and types of tests at different levels of the product.
Passenger car unit (PCU) of a vehicle type depends on vehicular characteristics, stream characteristics, roadway characteristics, environmental factors, climate conditions and control conditions. Keeping in view various factors affecting PCU, a model was developed taking a volume to capacity ratio and percentage share of particular vehicle type as independent parameters. A microscopic traffic simulation model VISSIM has been used in present study for generating traffic flow data which some time very difficult to obtain from field survey. A comparison study was carried out with the purpose of verifying when the adaptive neuro-fuzzy inference system (ANFIS), artificial neural network (ANN) and multiple linear regression (MLR) models are appropriate for prediction of PCUs of different vehicle types. From the results observed that ANFIS model estimates were closer to the corresponding simulated PCU values compared to MLR and ANN models. It is concluded that the ANFIS model showed greater potential in predicting PCUs from v/c ratio and proportional share for all type of vehicles whereas MLR and ANN models did not perform well.
We introduce the Gaussian process (GP) modeling module developed within the UQLab software framework. The novel design of the GP-module aims at providing seamless integration of GP modeling into any uncertainty quantification workflow, as well as a standalone surrogate modeling tool. We first briefly present the key mathematical tools on the basis of GP modeling (a.k.a. Kriging), as well as the associated theoretical and computational framework. We then provide an extensive overview of the available features of the software and demonstrate its flexibility and user-friendliness. Finally, we showcase the usage and the performance of the software on several applications borrowed from different fields of engineering. These include a basic surrogate of a well-known analytical benchmark function; a hierarchical Kriging example applied to wind turbine aero-servo-elastic simulations and a more complex geotechnical example that requires a non-stationary, user-defined correlation function. The GP-module, like the rest of the scientific code that is shipped with UQLab, is open source (BSD license).
Analysis of reinforced concrete deep beam is based on simplified approximate method due to the complexity of the exact analysis. The complexity is due to a number of parameters affecting its response. To evaluate some of this parameters, finite element study of the structural behavior of the reinforced self-compacting concrete deep beam was carried out using Abaqus finite element modeling tool. The model was validated against experimental data from the literature. The parametric effects of varied concrete compressive strength, vertical web reinforcement ratio and horizontal web reinforcement ratio on the beam were tested on eight (8) different specimens under four points loads. The results of the validation work showed good agreement with the experimental studies. The parametric study revealed that the concrete compressive strength most significantly influenced the specimens’ response with the average of 41.1% and 49 % increment in the diagonal cracking and ultimate load respectively due to doubling of concrete compressive strength. Although the increase in horizontal web reinforcement ratio from 0.31 % to 0.63 % lead to average of 6.24 % increment on the diagonal cracking load, it does not influence the ultimate strength and the load-deflection response of the beams. Similar variation in vertical web reinforcement ratio leads to an average of 2.4 % and 15 % increment in cracking and ultimate load respectively with no appreciable effect on the load-deflection response.
This paper proposes a shoulder inverse kinematics (IK) technique. Shoulder complex is comprised of the sternum, clavicle, ribs, scapula, humerus, and four joints.
Data Structures_Linear data structures Linked Lists.pptxRushaliDeshmukh2
Concept of Linear Data Structures, Array as an ADT, Merging of two arrays, Storage
Representation, Linear list – singly linked list implementation, insertion, deletion and searching operations on linear list, circularly linked lists- Operations for Circularly linked lists, doubly linked
list implementation, insertion, deletion and searching operations, applications of linked lists.
its all about Artificial Intelligence(Ai) and Machine Learning and not on advanced level you can study before the exam or can check for some information on Ai for project
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...Infopitaara
A Boiler Feed Pump (BFP) is a critical component in thermal power plants. It supplies high-pressure water (feedwater) to the boiler, ensuring continuous steam generation.
⚙️ How a Boiler Feed Pump Works
Water Collection:
Feedwater is collected from the deaerator or feedwater tank.
Pressurization:
The pump increases water pressure using multiple impellers/stages in centrifugal types.
Discharge to Boiler:
Pressurized water is then supplied to the boiler drum or economizer section, depending on design.
🌀 Types of Boiler Feed Pumps
Centrifugal Pumps (most common):
Multistage for higher pressure.
Used in large thermal power stations.
Positive Displacement Pumps (less common):
For smaller or specific applications.
Precise flow control but less efficient for large volumes.
🛠️ Key Operations and Controls
Recirculation Line: Protects the pump from overheating at low flow.
Throttle Valve: Regulates flow based on boiler demand.
Control System: Often automated via DCS/PLC for variable load conditions.
Sealing & Cooling Systems: Prevent leakage and maintain pump health.
⚠️ Common BFP Issues
Cavitation due to low NPSH (Net Positive Suction Head).
Seal or bearing failure.
Overheating from improper flow or recirculation.
machine learning types methods classification regression decision tree
1. Role of Machine Learning in
Telecommunication
Dr. Mohamad Abou Taam
2. WHAT IS MACHINE LEARNING?
Machine learning is a subfield of computer science
that studies and develops algorithms that can learn
from data without being explicitly programmed
Computer Science
Artificial Intelligence
Machine Learning
Deep Learning
Machine learning algorithms can detect patterns in
data and use them to predict future data
3. Machine learning
Data Rules / Model
Traditional software: applying given rules to data
Traditional software
Rules
Data Answers /
Actions
Machine learning –
how is it different?
M
a
c
h
i
n
e
l
4. Model design, training and testing (model building, feature engineering)
Historical Data Machine Learning
Model
1
Model application (model scoring)
New Data Model Predictions
2
5. TRIAD OF ALGORITHMS, DATA AND TRAINING
Data
Machine
learning
Algorithms Training
"Learning"is the process of estimating an
unknown dependency or structure of a system
(building a model) from a limited number of
observation (data points) and ability to
generalize it onto previously unseen data
6. Inferential Statistics
Descriptive
Statistics
• Sample should be representative of
population
• Generalization – extrapolation to entire
population
• Watch for population drift!
Inference
THE "CENTRAL DOGMA" OF STATISTICS
Machine learning == statistical learning
Sampling principle
Probability
Population
Learning on sample
Sample
7. THREE TYPES OF MACHINE LEARNING
Reinforcement
Learning
The goal is to optimise actions in a way
that maximises cumulative reward. no
explicitly labeled data is given, but
“rewards” and “punishment” signals are
provided
X – input data /independent variable
Unsupervised
Learning
The goal is to learn patterns and
structure in data given only inputs X.
(no output Y information given at all)
Supervised
Learning
The goal is to learn mapping from
given inputs X to outputs Y, given a
labeled set of input-output (X-Y) pairs
.
X – input data / independent variable
Y – response/ dependent variable
9. SUPERVISED LEARNING: REGRESSION
Response variable Y – real valued
Years of Education
S
e
n
i
o
r
i
t
y
I
n
c
o
m
e
0 50 100 200 300
5
10
15
20
25
TV
Sales
Sales
multivariate
univariate
11. REGRESSION AND CLASSIFICATION ARE SIMILAR
Regression
Predict a numeric variable
Classification
Predict a binary (or categorical) outcome
0
Y
5
10
15
20
25
X
15
5
0 10
0.0
0.2
0.4
0.6
0.8
1.0
-2 -1 0 1 2
X
Probability of event
Data are 1s and 0s – event
either happens or doesn't
happen
12. MODEL OVERFITTING
Regression
Too simple Too complex Just right
Predictions will have high "bias" –
from inadequate assumptions
Predictions will have high "variance"
– driven by noise in the training data
Model complexity is appropriate
given the noise
14. 14
PREDICTION ACCURACY VS EXPLAINABILITY
Model explainability Prediction accuracy
White box models
• Interpretable by design
• Easy to explain
• Quick to run
• Limited tuning needed
Black box models
• Lots of work to get insights
Better predictive performance
• Potential for overfitting
• Often lot of tuning required
• Linear / logistic regression
• Decision trees
Model properties
Algorithm examples • Random forests
• Gradient boosting
• Neural networks
• Deep learning
20. CLASSIFICATION EVALUATION
Quality metrics
Actual
Yes (or 1) No (or 0)
True positives
TP
False
Positives
FP
False
Negatives
FN
True negatives
TN
Predicted
Yes (or 1)
No (or 0)
True positive = Predict event and event happens
True negative = Predict event does not happen, nothing
happens
False positive = Predict event and event does not happen
(false alarm)
False negative = Fail to predict event that does happen
(missed alarm)
21. TRAINING AND TESTING
Train-test split
• 70%-90% of the data
• Used to build the model
• 10%-30% of the data
• Used to check the performance
of the model on unseen data
Train & Test split
• Measure algorithm performance on both
train and test sets!
• Performance will be worse on the test set
• Algorithms hyperparameter tuning can be
used to improve test set performance
• Avoid overfitting!
• Actual performance of the algorithm in
production will not be better than on test
set!
22. TRAINING AND TESTING
Cross-validation
• Makes best use of the data
• Data split in to N "folds" at random
• N models built. On each model, N-1 folds
are used for training and one is used for
testing
• Evaluation criteria averaged across folds
• Allows use of eg 90% training data / 10%
test data splits for 10-fold cross validation
• More data for training increases predictive
power
• Reduces the chance of getting
lucky/unlucky just due to the way a single
train/test split is done
• More time/computer resources
consuming
average
Cross-validation
5-fold cross-validation
23. TYPICAL SUPERVISED LEARNING PIPELINE
Model training
Model application
regression
model
value
value
and testing
24. A SUPERVISED MACHINE LEARNING WORKFLOW
Prepare data Model and predict Impact
business
Define problem and
potential solution
Get the data
Understand the data
Clean the data
Feature engineering
Build and test model
Understand the model
What does it mean for
the business?
What are we going to
change?
Productionise
Iterate
Ongoing monitoring and
improvements