0% found this document useful (0 votes)
38 views

Predictive Maintenance

This document discusses predictive maintenance using machine learning algorithms. It explains the difference between conditional and predictive maintenance, the components of predictive maintenance including sensors, IoT technology and predictive models. It also discusses applying predictive algorithms, how to establish a predictive maintenance program and different predictive maintenance approaches.

Uploaded by

RAZIQ YOUSSEF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Predictive Maintenance

This document discusses predictive maintenance using machine learning algorithms. It explains the difference between conditional and predictive maintenance, the components of predictive maintenance including sensors, IoT technology and predictive models. It also discusses applying predictive algorithms, how to establish a predictive maintenance program and different predictive maintenance approaches.

Uploaded by

RAZIQ YOUSSEF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

MAINTENANCE

PRÉDICTIVE
Application of Machine Learning Algorithms
2 MAINTENANCE CONDITIONNELLE VS MAINTENANCE PRÉDICTIVE

DIFFÉRENCE SIMILARITÉ

Maintenance Conditionnelle : Maintenance Principe : Détecter l'apparition d'anomalies


Préventive basée sur une surveillance du sur des machines avant qu'elles ne
fonctionnement et des données significatives d´une deviennent trop graves. Il s'agit donc
dégradation. d'anticiper les pannes.
Maintenance Prévisionnelle (ou Predictive
Maintenance) : basée sur des prévisions
extrapolées de l´analyse et de l´évaluation des
données significatives d´une dégradation.
3 COMPOSANTS DE LA MAINTENANCE PRÉDICTIVE

There are three main components that allow

PdM to track asset condition and warn

technicians about upcoming equipment

failures:

1.installed condition-monitoring sensors

2.IoT technology

3.predictive data models


4 APPLYING PREDICTIVE ALGORITHMS

The most important part of predictive maintenance (and arguably the hardest one) is building
predictive (a.k.a prognostic) algorithms. In essence, you must build a model that will consider
many different variables and how they interconnect and impact one another – with the ultimate
goal being able to predict machine failures.
5 HOW TO ESTABLISH A PREDICTIVE MAINTENANCE PROGRAM

Step #0: Secure the budget


Before making any plans, you need to get approval from top
management and the commitment that this project will be properly
funded.
Step #1: Identify critical assets
Start by identifying critical assets to be included in the PdM program.
Step #2: Establish a database
Another factor to consider is the presence of sufficient information that
can offer actionable insights to machine behavior.
Step #3: Analyze and establish failures modes
At this point, the organization will need to perform an analysis on the
previously identified critical assets to establish their failure modes.
6 HOW TO ESTABLISH A PREDICTIVE MAINTENANCE PROGRAM

Step #3: Analyze and establish failures modes (FMEA)


7 HOW TO ESTABLISH A PREDICTIVE MAINTENANCE PROGRAM

Step #4: Implement condition monitoring sensors and equipment


Knowing which failure modes they need to watch out for, the organization
can buy appropriate sensors and technology to monitor parts that are
most likely to fail.
Step #5: Develop predictive algorithms
With everything else in place, the next step is designing the right
modeling approach that will form the basis for failure predictions.
Step #6: Deploy to pilot equipment
This is where the predictive modeling is put to test and validated by
deploying the technology to a selected group of pilot equipment.
8 PREDICTIVE MAINTENANCE APPROACHES
1) Problem definition: Classification or regression approach
– Classification: Will it fail?
Multi-class classification: Will it fail for reason X?
– Regression: After how long will it fail?
2) Methods:
– Traditional machine learning:
• Decision trees, Random forests, gradient boosting trees, isolation forest
• SVM (Support Vector Machines)
– Deep learning approach:
• CNN (Convolutional Neural Network)
• RNN (Recurrent Neural Network)/LSTM (Long-Short Term Memory)/GRU (Gated Recurrent Unit)
– Hybrid of deep learning and Physics-Based Modeling (PBM):
• Use PBM to generate training data where lacking
PART I:

Linear Regression,
Decision Trees, Random
Forest & SVM For
Predictive Maintenance
10 DATA PREPROCESSING STEP 1
11 DATA PREPROCESSING STEP 1

STEP 1-1 : IMPORTING THE DATASET


During the dataset importing process, there’s another essential thing you must do – extracting
dependent and independent variables. For every Machine Learning model, it is necessary to
separate the independent variables (matrix of features) and dependent variables in a dataset.
12 DATA PREPROCESSING STEP 1

STEP 1-2 : HANDLING NULL VALUES


In any real-world dataset, there are always few null values. No model can handle these NULL
or NaN values on its own, so we need to intervene.
Basically, there are two ways to handle missing data :
•Deleting a particular row – In this method, you remove a specific row that has a null value for a
feature or a particular column where more than 75% of the values are missing.
•Calculating the mean – This method is useful for features having numeric data like age, salary,
year, etc. Here, you can calculate the mean of a particular feature or column that contains a
missing value and replace the result for the missing value. This method can add variance to the
dataset, and any loss of data can be efficiently negated. Hence, it yields better results
compared to the first method.
13 DATA PREPROCESSING STEP 1

STEP 1-3 : HANDLING CATEGORICAL VARIABLES

Categorical data refers to the information that has specific categories within the dataset such as

country and color.

Categorical variables are further divided into 2 types —

•Ordinal categorical variables — These variables can be ordered. Ex — Size of a T-shirt. We can

say that M<L<XL.

•Nominal categorical variables — These variables can’t be ordered. Ex — Color of a T-shirt. We

can’t say that Blue<Green as it doesn’t make any sense to compare the colors as they don’t

have any relationship.


14 DATA PREPROCESSING STEP 1

STEP 1-4 : SPLITTING THE DATASET


Splitting the dataset is the next step in data preprocessing in deep learning. Every dataset for
Deep Learning model must be split into two separate sets – training set and test set.

A training set denotes the subset of a dataset that is used for training the deep learning model.
A test set is the subset of the dataset that is used for testing the machine learning model.
15 DATA PREPROCESSING STEP 1

STEP 1-5 : STANDARDIZATION


To make learning easier for the network, the data should : take small values (0-1 range) and be
homogenous – all features should take values in the same range.
16 LINEAR REGRESSION STEP 2

Linear regression is used for finding linear relationship between target and one or more
predictors. There are two types of linear regression- Simple and Multiple.
The core idea is to obtain a line that best fits the data. The best fit line is the one for which total
prediction error (all data points) are as small as possible. Error is the distance between the point
to the regression line.
17 LINEAR REGRESSION
1. Simple Linear Regression
This method uses a single independent variable to predict a dependent variable by fitting a best
linear relationship.

c m
2. Multiple Linear Regression
This method uses more than one independent variable to predict a dependent variable by
fitting a best linear relationship.
18 LINEAR REGRESSION – SOME QUESTIONS

What is a linear regression ?

Linear regression is used for finding linear relationship between target and one or more
predictors. There are two types of linear regression- Simple and Multiple.

The simple linear regression model


can be represented graphically as a
best-fit line between the data points,
while the multiple linear regression
model can be represented as
a plane (in 2-dimensions) or
a hyperplane (in higher dimensions).
19 LINEAR REGRESSION
GRADIENT DESCENT

Gradient descent is an iterative optimization algorithm to find the minimum of a function. Here
that function is our Loss Function. Gradient Descent starts with an initial set of parameter
values (𝒄 𝒂𝒏𝒅 𝒎) and iteratively moves towards a set of values that minimizes the Cost
function.
20 LINEAR REGRESSION
Loss Function

The loss is the error in our predicted value of 𝒄 and 𝒎. Our goal is to minimize this error to obtain
the most accurate value of 𝒄 and 𝒎.

 Mean Squared Error (MSE)

Mean Squared Error (also called L2 loss) is the average of the squared differences between the

actual and the predicted values. For a data point Yi and its predicted value Ŷi, where n is the

total number of data points in the dataset, the mean squared error is defined as:
21 LINEAR REGRESSION
Loss Function

 Root Mean Squared Error (RMSE)

RMSE takes the MSE value and applies a square root over it. RMSE can be directly used to

interpret the ‘average error’ that our prediction model makes. For a data point Yi and its

predicted value Ŷi, where n is the total number of data points in the dataset, RMSE is defined as:
22 LINEAR REGRESSION

 Mean Absolute Error (MAE)


Mean Absolute Error (also called L1 loss) takes the average sum
of the absolute differences between the actual and the predicted
values.

Similarities Differences

Both MAE and RMSE express average model In RMSE, since the errors are squared before they
prediction error in units of the variable of interest. are averaged, it gives a relatively high weight to
Both metrics can range from 0 to ∞ and are large errors. This means the RMSE should be more
indifferent to the direction of errors. useful when large errors are particularly
undesirable.
23 LINEAR REGRESSION
STEPS OF GRADIENT DESCENT

1.Initially let 𝒄 = 0 and 𝒎 = 0. Let L be our learning rate.


2.Calculate the partial derivative of the loss function with respect to m, and plug in the current
values of x, y, m and c in it to obtain the derivative value D.

Dₘ is the value of the partial derivative with respect to m. Similarly let´s find the partial
derivative with respect to c, Dc :
24 LINEAR REGRESSION
STEPS OF GRADIENT DESCENT

3.Now we update the current value of m and c using the following equation:

4.We repeat this process until our loss function is a very small value or ideally 0 (which

means 0 error or 100% accuracy). The value of m and c that we are left with now will be the

optimum values.
25 LINEAR REGRESSION
STEPS OF GRADIENT DESCENT

The choice of correct learning rate is very important as it ensures that Gradient Descent
converges in a reasonable time :
1. If we choose α to be very large, Gradient Descent can overshoot the minimum. It may fail to
converge or even diverge.
2. If we choose α to be very small, Gradient Descent will take small steps to reach local
minima and will take a longer time to reach minima.

Very large α Very small α


26 LINEAR REGRESSION – SOME QUESTIONS

Evaluating our Model


How do we evaluate the accuracy of our model?

First of all, you need to make sure that you train the model on the training dataset and build
evaluation metrics on the test set to avoid overfitting. Afterwards, you can check several
evaluation metrics to determine how well your model performed.

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)


27 LINEAR REGRESSION – SOME QUESTIONS

What is overfitting?
the model is useful in reference only to its training
data set, and not to any other new data sets.
For example, it would be a big red flag if our model
saw 99% accuracy on the training set but only 55%
accuracy on the test set.
To find out …

Batch, Mini-Batch and Stochastic Gradient

Descent for Linear Regression


29 LINEAR REGRESSION – SOME QUESTIONS

What is Batch Gradient Descent ?

In Batch Gradient Descent, all the training data is taken into consideration to take a single
step. We take the average of the gradients of all the training examples and then use that mean
gradient to update our parameters. So that’s just one step of gradient descent in one epoch.

What is Stochastic Gradient Descent ?


In Stochastic Gradient Descent (SGD), we consider just one example at a time to take a single
step. We do the following steps in one epoch for SGD:
1.Take an example. 2.Feed it to the regression model. 3.Calculate its gradient. 4.Use the
gradient to update the weights. 5.Repeat steps 1–4 for all the examples in training dataset
30 LINEAR REGRESSION – SOME QUESTIONS

What is Stochastic Gradient Descent ?

Regression Model

In the case of SGD, there will be ‘m’ iterations per epoch, where ‘m’ is the number of
observations in a dataset.
31 LINEAR REGRESSION – SOME QUESTIONS
The batch size is the number of samples to
What is Mini-Batch Gradient Descent ? process before the model is updated

Mini-Batch Gradient Descent is a mixture of Batch Gradient Descent and SGD .


After creating the mini-batches of fixed size, we do the following steps in one epoch:
1.Pick a mini-batch. 2.Feed it to the regression model. 3.Calculate the mean gradient of the mini-
batch. 4.Use the mean gradient we calculated in step 3 to update the weights. 5.Repeat steps 1–
4 for the mini-batches we created
32 LINEAR REGRESSION – SOME QUESTIONS

What is Mini-Batch Gradient Descent ?


EXAMPLE:

If we have a dataset with 1280 samples and we choose a batch size of 32 and 1000 epochs.

Then, we´ll have 40 batches, each with 32 samples. The model will be updated after each batch

of 32 samples. One epoch will involve 40 batches or 40 updates to the model (iterations).

With 1000 epochs, the model will be exposed to the whole dataset 1000 times, that is a total

40.000 batches during the entire training process.

Because data is shuffled prior to each epoch.


(for time-series we don´t shuffle the data)c
33 LINEAR REGRESSION – APPLICATION 1

Amazon_cloths sells clothes online. Customers can order either on a


mobile app or website for the clothes they want.
The company is trying to decide whether to focus their efforts on
their mobile app experience or their website.
34 LINEAR REGRESSION – APPLICATION 2

APPLICATION OF LINEAR REGRESSION


ON HOUSE PRICE PREDICTION
35 LINEAR REGRESSION - APPLICATION

SOME APPLICATIONS OF LINEAR REGRESSION:

1) Analyzing Raw Material Influence on Production Quality,

2) predict the rainfall of the coming days based on increasing the temperature,

3) Agricultural scientists often use linear regression to measure the effect of fertilizer and water

on crop yields,

4) Data scientists for professional sports teams often use linear regression to measure the

effect that different training regimes have on player performance,

5) Crime Data Mining : Predicting the crime rate of a state based on drug usage, number of

gangs, human trafficking …


36 DECISION TREE

Decision tree is the most powerful and popular tool for classification and prediction. Decision
Tree algorithms are referred to as CART or Classification and Regression Trees.
A decision tree typically starts with a single node, which branches into possible outcomes. Each
of those outcomes leads to additional nodes, which branch off into other possibilities. This gives
it a tree-like shape.
37 DECISION TREE
38 DECISION TREE
EXAMPLE OF A DECISION TREE
39 DECISION TREE
What is Entropy?

It is a measure of impurity of a node. By Impurity, We mean to measure the heterogeneity at a


particular node.
For Example:
Assume that we have 50 red balls and 50 blue balls in a Set. In this case , proportions of the
balls of both the colors are equal. Hence, the entropy would be 1. But, If the set has 98 red balls
and 2 blue balls, then the entropy would be low. This is because now the set is mostly pure as it
mostly contains balls belonging to one category. Because of this , the heterogeneity is reduced.
40 DECISION TREE - CLASSIFICATION
Information Gain

Information gain measures the reduction of uncertainty given some feature and it is also a
deciding factor for which attribute should be selected as a decision node or root node.

It is just entropy of the full dataset – entropy of the dataset given some feature.

Example:
Suppose our entire population has a total of 30 instances. The objective is to predict whether
the person will go to the gym or not. Let’s say 16 people go to the gym and 14 people don’t.
Feature 1 is “Energy” which takes two values “high” and “low”
Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral” and “Highly
motivated”.
41 DECISION TREE - CLASSIFICATION
Information Gain

To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Energy), information gain will be:
42 DECISION TREE - CLASSIFICATION
Information Gain

To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Motivation), information gain will
be:
To find out …

PERFORMANCE METRICS FOR CLASSIFIERS:

CONFUSION MATRIX, PRECISION, RECALL, &

F1-SCORE
44 CLASSIFICATION METRICS
Confusion Matrix

Make the Confusion Matrix Less Confusing.


A confusion matrix is a technique for summarizing the performance of a classification
algorithm.
45 CLASSIFICATION METRICS
Confusion Matrix

True Positive (TP)


•The predicted value matches the actual value
•The actual value was positive and the model predicted a positive value
True Negative (TN)
•The predicted value matches the actual value
•The actual value was negative and the model predicted a negative value
False Positive (FP)
•The predicted value was falsely predicted
•The actual value was negative but the model predicted a positive value
False Negative (FN)
•The predicted value was falsely predicted
•The actual value was positive but the model predicted a negative value
46 CLASSIFICATION METRICS

Accuracy
Accuracy is one metric for evaluating classification models. Informally, accuracy is the
fraction of predictions our model got right. Formally, accuracy has the following definition:

It works well only if there are equal number of samples belonging to each class.
For example, consider that there are 98% samples of class A and 2% samples of class B in our
training set. Then our model can easily get 98% training accuracy by simply predicting every
training sample belonging to class A.
47 CLASSIFICATION METRICS
Accuracy

Imagine someone claimed to create a model to identify terrorists trying to board flights with

greater than 99 percent accuracy ... entirely in their head. Would you believe them?

Well, here’s the model: simply label every single person flying from a U.S. airport as “not a

terrorist.” Given the 800 million average passengers on U.S. flights per year and the 19

(confirmed) terrorists who boarded U.S. flights from 2000–2017, this model achieves an

astounding accuracy of 99.9999999 percent!

While this solution has nearly perfect accuracy, this problem is one in which accuracy is clearly

not an adequate metric.


48 CLASSIFICATION METRICS

Accuracy

The terrorist detection task is an imbalanced classification problem: we have two classes we

need to identify—terrorists and not terrorists—with one category (non-terrorists) representing

most of the data points. Another imbalanced classification problem occurs in disease detection

when the rate of the disease in the public is very low. In both these cases, the negative class—

disease or terrorist—greatly outnumbers the positive class.


49 CLASSIFICATION METRICS

Recall
Recall attempts to answer the following question: What proportion of actual positives was
identified correctly?
Mathematically, we define recall as the number of true positives divided by the number of true
positives plus the number of false negatives.
50 CLASSIFICATION METRICS

Recall

Recall is a good measure to determine, when the cost of False Negative is high. For

instance, in sick patient detection. If a sick patient (Actual Positive) goes through the

test and predicted as not sick (Predicted Negative). The cost associated with False

Negative will be extremely high if the sickness is contagious or fatal.


51 CLASSIFICATION METRICS

Precision
Precision attempts to answer the following question: What proportion of positive identifications
was actually correct?
Mathematically, precision the number of true positives divided by the number of true positives plus
the number of false positives.
52 CLASSIFICATION METRICS

Precision

Precision is a good measure to determine, when the costs of False Positive is high. For

instance, email spam detection. In email spam detection, a false positive means that an

email that is non-spam (actual negative) has been identified as spam (predicted spam).

The email user might lose important emails if the precision is not high for the spam

detection model.
53 CLASSIFICATION METRICS
EXAMPLE
54 CLASSIFICATION METRICS
Precision & Recall

Example:

If we label all individuals as terrorists, then our recall

goes to 1.0! We have a perfect classifier, right?

Well, not exactly. A model labelling 100 percent of

passengers as terrorists is probably not useful

because we would have to ban every single person

from flying.
55 CLASSIFICATION METRICS
Precision & Recall

In some situations, we might know we want to maximize either recall or precision at the

expense of the other metric. For example, in preliminary disease screening of patients for

follow-up examinations, we would probably want a recall near 1.0 - we want to find all patients

who actually have the disease—and we can accept a low precision - we accidentally find some

patients have the disease who actually don’t have it - if the cost of the follow-up examination

isn’t high. However, in cases where we want to find an optimal blend of precision and recall, we

can combine the two metrics using the F1 score.


56 CLASSIFICATION METRICS

F1 Score
F1 Score is needed when you want to seek a balance between Precision and Recall.
The F1 score is the harmonic mean of precision and recall, taking both metrics into
account in the following equation:

F1 Score might be a better measure to use if [we need to seek a balance between Precision
and Recall] AND [there is an uneven class distribution (large number of Actual Negatives)].
57 CLASSIFICATION METRICS - RECAP
Four Outcomes of Binary Classification

•True positives: data points labeled as positive that are actually positive
•False positives: data points labeled as positive that are actually negative
•True negatives: data points labeled as negative that are actually negative
•False negatives: data points labeled as negative that are actually positive

Recall, Precision & F1 Score

- Recall: What percent of the positive cases did you catch?.


- Precision What percent of your predictions were correct?
- F1 score: a single metric that combines recall and precision using the harmonic mean
58 DECISION TREE- CLASSIFICATION

Splitting categorical features


When we have more than one feature, we need to look at the feature which provides the
maximum information gain after the split.

Starting with feature F₁:

The information gain (IG1) would be :


59 DECISION TREE- CLASSIFICATION

Splitting categorical features

Now checking for feature F₂:

and the information gain (IG)₂ would be :

Conclusion: Since IG₂ > IG₁, we would first split the data using feature F₂
60 DECISION TREE- CLASSIFICATION

Splitting categorical features


Note that the node after F₂=IND can further be broken down using
feature F₁. The final tree will look like-
61 DECISION TREE- CLASSIFICATION

Splitting numerical features


1. Sort by values of F₁
2. If we split using F₁ taking each of the values

That will lead to the overfitting issue.


Solution: The way we deal with this problem is by
defining a threshold value and we split the data
into two parts- one with values less than and
equal to the threshold and other with values
greater than that.
62 DECISION TREE- CLASSIFICATION

Splitting numerical features


How to determine the threshold?
63 DECISION TREE- CLASSIFICATION

Splitting numerical features


How to determine the threshold?

Step1: Sort the data (ascending order) based on the continuous independent variable.

Step2: Take the average of two consecutive numbers.

Step3: For quantitative data, there will be only two splits, either ≤ or >. The check is applied to all

the average values computed. Whichever threshold results in the lowest impurity (lowest

entropy), that will be considered as final.


64 DECISION TREE- CLASSIFICATION

The final tree will look like : [Glucose, BMI,


N e w d a t a p o i n t A[ 1g 2e 0] . 2 1 , 2 3 . 1 , 3 5 ]
65 DECISION TREE

APPLICATION OF DECISION TREE


CLASSIFICATION ON DIABETES DATASET
66 DECISION TREE - CLASSIFICATION

random-state :

It is important to note that random-state value can have significant effect on the quality of your
model (by quality I essentially mean accuracy to predict). So, it is important to find the best
random-state value to provide you with the most accurate model

You might also like