Algorithm Current Situation
Algorithm Current Situation
Algorithm
1
Context of our clients
1. 2 types of clients:
a. Tertiary sector
i. Shopping centers (majority)
ii. Recreation center…
b. Industries
2. Data Source
a. The data come from sensors and meters installed at each customer.
b. These sensors or meters communicate with our server every hour or every fifteen minutes depending on the types of
devices. (mainly every fifteen minutes)
3. Data Structure
a. The data are divided into two categories for each :
i. One to be predicted. This one is composed by multiple meters of electricity, gas, water, cold, heat
ii. One that will help with the prediction. These will be referred to as ‘explanatory variables’. Here are some
examples : the days, the opening hours, temperature , CO2, lighting, sensors in different area of the buildings.
b. In general, there are several sensors of CO2, temperature… for each customer placed at different locations in the
building.
c. The data structure is different for everyone in terms of the number of meters. However, shopping centers have the same
types of meters and explanatory variables. This means that two of our customers being shopping centers, they will have
different number of meters and explanatory variables. The number will depends on the size of the site. However, their
consumption habit and the general structure of data should be quite similar.
Goal of the algorithm
1. Goal
The goal of the algorithm is double:
- to detect current abnormalities in the consumption based on a reference period. (Attention, the goal is not to
predict exactly the current or the future consumption).
- To normalize the consumption based on a referenced period
2. Steps
a. Training on a past reference period
i. The customer will choose a period in the past a whole calendar year.
ii. The algorithm will then have to train itself on this period and try to explain in the best way possible the
consumption based on the explanatory variables. The input are the explanatory variables and the meters
are the outputs.
b. Apply the model for current consumption:
i. The model will then be applied to real-time explanatory variables in order to calculate a normalized
current consumption (this value depends on the model and therefore will be different according to the
chosen reference period).
ii. The real-time consumption will then be compared to the normalized consumption.
iii. If there is too big of a difference, the algorithm must conclude that there is an abnormality. In order to do
so we would like you to create a confidence interval.
Our needs
The difficulties come from what we want to detect and from the reference periods.
What we want to know:
- Savings/losses compared to a reference period (being able to say: in 2023, if I consumed the same way as in 2022, I would have consumed x
kWh, I only consumed x, so savings of x% ).
To do this, we need as a reference period, a complete calendar year without processing/cleaning the reference period. Do not worry about outliers
in the reference period, they are considered normal.
- Detect problems and drifts in consumption (to save energy and warn the customer that there is a problem)
For this, it would be necessary to base oneself on a reference period which is the most optimized in terms of consumption. This period becomes
the reference in the past and we must do better than that. The goal is to find everything that is wrong with the best period.
What we already had as an idea:
- “Clean a past period” by removing the values that are abnormal (we “clean” 2022 for example)
- Either manually: a lot of data and sometimes difficult to quickly identify problems ⇒ complicated
- Either find a way to analyze the reference profile, detect what does not seem normal, remove these values and use this data as a
reference ⇒ it is better perhaps to directly analyze the profile and see if it is normal or not ⇒ isolation forest or other?
- The problem (or not, we don’t know) with applying forest insulation to the entire profile is that you don't remove
abnormal past consumption.
- Have a 'rolling' reference period that sticks as closely as possible to today's date: reference period = the last month or the last 6 months,
for example
- The concern is to do this technique if consumption increases. The reference period would become bad and therefore high
consumption would become normal.
The difficulty is therefore “how to detect overconsumption? Based on what?” To see if a model with a reference over a year (XGBoost for example)
4
combined with another method (forest insulation for example) would meet our needs.
Stages - Plan of action
Already done
1. Exploration of Data and Analysis: We computed correlations between our data and the explanatory variables as we wanted to firstly
have an idea of the feasibility of the project.
a. We then decided to create variables (see slide x “Explanatory Variables”)
b. We recomputed the correlation in order to know if this made sense and for most it did
c. We aggregated the consumption of the meters per usage in order to see if there were correlations. There were a few but not
as much as expected
2. Building the ML model
a. We have tried 3 different models which are the XGBoost, SVR and Elastic Net. We selected the most performant using the
score of grid_search function for each.
b. We focused then on XGBoost and ran the code for all the meters of one site.
c. We are having relatively good results, it depends on the meters.
Next to do
3. Come up with new ideas of feature engineering to feed the model. Think about ways of improving the model ⇒ feature selection,...
You might want to test out different models, why not
a. We think feature engineering is an important part of the performance of our model. In order to so we expect you to come up
with ideas.
4. Develop a statistical model that would flag anomalies based on the prediction of the ML model (XGBoost).
5. Explore and propose other ways to detect anomalies (unsupervised models such as isolation forest)
6. If different solutions used, find a way to implement them one in the other
5
Remarks
General:
- Each customer can have different ways of consuming. We would like the model to be the more versatile possible. It should
work on different customers. The algorithm should not vary between each customer.You should anticipate it by testing it on
the different datasets you will be provided with and for example by automating the fine-tuning process of finding the adequate
parameters for the model.
- It may be that one and only model does not work for every customer. If this is the case we will discuss the alternatives. Possible
still to create one algorithm that tests different models, for example.
Data
- All the data must not be taken into account in your calculations as it might sometimes happen that we have outliers. Most of
the outliers are due to a problem of communication. This mostly translates into an absence of data during a certain period and
then a huge peak.
- The list of explanatory variables may change according to the customer. If you have any ideas on what kind of variables we can
add please feel free to share your thoughts. We will decide together if this can be done.
- Some meters are less important than others as their consumption is way less than the others. A list of the level of importance
for each meter will be joined to the files that will be given.
- You should analyze the importance of each explanatory variables for customers. The goal would be to have a list of important
variables.
-
Remarks
Reference period
- By default, the base period will be the previous civic year. It could change or retrain automatically every period of time.
Generating alarms:
- An alarm is being created when there is an abnormal consumption
- We rather create fewer alarms but that they are all true (true positive) than having too many false alarms (false
positive). (Rather false negatives than false positives).