The Big Book of Machine Learning Use Case
The Big Book of Machine Learning Use Case
Machine Learning
Use Cases
A collection of technical blogs,
including code samples and notebooks
THE BIG BOOK OF MACHINE LEARNING USE CASES
Contents
Introduction
C H A P T E R 1 : 3
C H A P T E R 4 :
Doing Multivariate Time Series Forecasting With Recurrent Neural Networks 24
C H A P T E R 5 :
Detecting Financial Fraud at Scale With Decision Trees 30
and MLflow on Databricks
C H A P T E R 6 :
Automating Digital Pathology Image Analysis With 41
Machine Learning on Databricks
C H A P T E R 7 :
A Convolutional Neural Network Implementation for Car Classification 46
C H A P T E R 8 :
Processing Geospatial Data at Scale With Databricks 54
C H A P T E R 1: Introduction The world of machine learning is evolving so fast that it’s not easy to find
real-world use cases that are relevant to what you’re working on. That’s why
we’ve collected together these blogs from industry thought leaders with
practical use cases you can put to work right now. This how-to reference
guide provides everything you need — including code samples — so you can
get your hands dirty exploring machine learning on the Databricks platform.
THE BIG BOOK OF MACHINE LEARNING USE CASES 4
CHAP TER 2: U
nderstanding Dynamic Introduction
Time Warping The phrase “dynamic time warping,” at first read, might evoke images of Marty McFly driving his DeLorean at 88 MPH in
the “Back to the Future” series. Alas, dynamic time warping does not involve time travel; instead, it’s a technique used to
Part 1 of our Using Dynamic dynamically compare time series data when the time indices between comparison data points do not sync up perfectly.
Time Warping and MLflow to
Detect Sales Trends series As we’ll explore below, one of the most salient uses of dynamic time warping is in speech recognition — determining
whether one phrase matches another, even if the phrase is spoken faster or slower than its comparison. You can
by R I C A R D O P O R T I L L A , imagine that this comes in handy to identify the “wake words” used to activate your Google Home or Amazon Alexa
B R E N N E R H E I N T Z and device — even if your speech is slow because you haven’t yet had your daily cup(s) of coffee.
DENNY LEE
Dynamic time warping is a useful, powerful technique that can be applied across many different domains. Once you
April 30, 2019 understand the concept of dynamic time warping, it’s easy to see examples of its applications in daily life, and its
exciting future applications. Consider the following uses:
Data scientists, data analysts and anyone working with time series data should become familiar with this technique,
given that perfectly aligned time-series comparison data can be as rare to see in the wild as perfectly “tidy” data.
•T
he basic principles of dynamic time warping
•R
unning dynamic time warping on sample audio data
•R
unning dynamic time warping on sample sales data using MLflow
THE BIG BOOK OF MACHINE LEARNING USE CASES 5
Dynamic time warping is a seminal time series comparison technique that has been
used for speech and word recognition since the 1970s with sound waves as the source;
an often cited paper is Dynamic time warping for isolated word recognition based on
ordered graph searching techniques. E U C L I D E A N M AT C H I N G
Background
This technique can be used not only for pattern matching, but also anomaly detection
(e.g., overlap time series between two disjoint time periods to understand if the shape
has changed significantly or to examine outliers). For example, when looking at the
red and blue lines in the following graph, note the traditional time series matching (i.e.,
Euclidean matching) is extremely restrictive. On the other hand, dynamic time warping
allows the two curves to match up evenly even though the X-axes (i.e., time) are not
necessarily in sync. Another way is to think of this is as a robust dissimilarity score
where a lower number means the series is more similar.
Two-time series (the base time series and new time series) are considered similar when
it is possible to map with function f(x) according to the following rules so as to match the
magnitudes using an optimal (warping) path.
THE BIG BOOK OF MACHINE LEARNING USE CASES 6
“Doors and corners, kid. That’s where they get you.” • C L I P 3: This is another time series that’s based on the quote “You walk into a room
too fast, the room eats you.” with the same intonation and speed as clip 1.
•C
L I P 4: This is a new time series [v3] based on clip 1 where the intonation and
And one clip (clip 3) is the quote:
speech pattern is similar to clip 1.
“You walk into a room too fast, the room eats you.”
Clip 1
| Doors and corners, kid.
That’s where they get you. [v1]
Clip 2
| D
oors and corners, kid.
That’s where they get you. [v2]
Clip 1
| Doors and corners, kid.
That’s where they get you. [v1]
Clip 2
| D
oors and corners, kid.
That’s where they get you. [v2]
Clip 3
| Y
ou walk into a room too fast,
the room eats you.
Clip 4
| Doors and corners, kid.
That’s where they get you. [v3]
Clip 3
| Y
ou walk into a room too fast,
the room eats you.
Clip 4
| Doors and corners, kid.
That’s where they get you. [v3]
The code to read these audio clips and visualize them using matplotlib can be As noted below, when the two clips (in this case, clips 1 and 4) have different intonations
summarized in the following code snippet. (amplitude) and latencies for the same quote.
# Create subplots
ax = plt.subplot(2, 2, 1)
ax.plot(data1, color=’#67A0DA’)
...
The full code base can be found in the notebook Dynamic Time Warping Background.
THE BIG BOOK OF MACHINE LEARNING USE CASES 8
If we were to follow a traditional Euclidean matching (per the following graph), even if we With dynamic time warping, we can shift time to allow for a time series comparison
were to discount the amplitudes, the timings between the original clip (blue) and the new between these two clips.
clip (yellow) do not match.
E U C L I D E A N M AT C H I N G D Y N A M I C T I M E WA R P I N G
THE BIG BOOK OF MACHINE LEARNING USE CASES 9
For our time series comparison, we will use the fastdtw PyPi library; the instructions
to install PyPi libraries within your Databricks workspace can be found here: Azure | AWS.
By using fastdtw, we can quickly calculate the distance between the different time series.
The full code base can be found in the notebook Dynamic Time Warping Background.
As you can see, with dynamic time warping, one can ascertain the similarity of two
different time series.
Next
Now that we have discussed dynamic time warping, let’s apply this use case to detect
sales trends.
THE BIG BOOK OF MACHINE LEARNING USE CASES 10
CHAP TER 2: U
sing Dynamic Time Background
Warping and MLflow to Imagine that you own a company that creates 3D-printed products. Last year, you knew that drone propellers were
Detect Sales Trends showing very consistent demand, so you produced and sold those, and the year before you sold phone cases. The new
year is arriving very soon, and you’re sitting down with your manufacturing team to figure out what your company
Part 2 of our Using Dynamic should produce for next year. Buying the 3D printers for your warehouse put you deep into debt, so you have to make sure
Time Warping and MLflow to that your printers are running at or near 100% capacity at all times in order to make the payments on them.
Detect Sales Trends series
Since you’re a wise CEO, you know that your production capacity over the next year will ebb and flow — there will be
some weeks when your production capacity is higher than others. For example, your capacity might be higher during the
by R I C A R D O P O R T I L L A ,
B R E N N E R H E I N T Z and summer (when you hire seasonal workers), and lower during the third week of every month (because of issues with the 3D
DENNY LEE printer filament supply chain). Take a look at the chart below to see your company’s production capacity estimate:
Dynamic time warping comes into play here because sometimes supply and demand for the product you choose will be
slightly out of sync. There will be some weeks when you simply don’t have enough capacity to meet all of your demand, but
as long as you’re very close and you can make up for it by producing more products in the week or two before or after, your
customers won’t mind. If we limited ourselves to comparing the sales data with our production capacity using Euclidean
matching, we might choose a product that didn’t account for this, and leave money on the table. Instead, we’ll use dynamic
time warping to choose the product that’s right for your company this year.
THE BIG BOOK OF MACHINE LEARNING USE CASES 11
Load the product sales data set Calculate distance to optimal time series by product code
We will use the weekly sales transaction data set found in the UCI Data Set Repository
to perform our sales-based time series analysis. (Source attribution: James Tan, # Calculate distance via dynamic time warping between product
code and optimal time series
[email protected], Singapore University of Social Sciences)
import numpy as np
import _ucrdtw
import pandas as pd
def get_keyed_values(s):
return(s[0], s[1:])
# Use Pandas to read this data
sales_pdf = pd.read_csv(sales_dbfspath, header=’infer’)
def compute_distance(row):
return(row[0], _ucrdtw.ucrdtw(list(row[1][0:52]),
# Review data
list(optimal_pattern), 0.05, True)[1])
display(spark.createDataFrame(sales_pdf))
ts_values = pd.DataFrame(np.apply_along_axis(get_keyed_values, 1,
sales_pdf.values))
distances = pd.DataFrame(np.apply_along_axis(compute_distance, 1,
ts_values.values))
distances.columns = [‘pcode’, ‘dtw_dist’]
Using the calculated dynamic time warping ‘distances’ column, we can view the
distribution of DTW distances in a histogram.
Each product is represented by a row, and each week in the year is represented by a
column. Values represent the number of units of each product sold per week. There are
811 products in the data set.
THE BIG BOOK OF MACHINE LEARNING USE CASES 12
From there, we can identify the product codes closest to the optimal sales trend (i.e., After running this query, along with the corresponding query for the product codes that
those that have the smallest calculated DTW distance). Since we’re using Databricks, we are furthest from the optimal sales trend, we were able to identify the two products that
can easily make this selection using a SQL query. Let’s display those that are closest. are closest and furthest from the trend. Let’s plot both of those products and see how
they differ.
%sql
-- Top 10 product codes closest to the optimal sales trend
select pcode, cast(dtw_dist as float) as dtw_dist from distances order
by cast(dtw_dist as float) limit 10
As you can see, Product #675 (shown in the orange triangles) represents the best match
to the optimal sales trend, although the absolute weekly sales are lower than we’d like
(we’ll remedy that later). This result makes sense since we’d expect the product with the
closest DTW distance to have peaks and valleys that somewhat mirror the metric we’re
comparing it to. (Of course, the exact time index for the product would vary on a week-
by-week basis due to dynamic time warping). Conversely, Product #716 (shown in the
green stars) is the product with the worst match, showing almost no variability.
THE BIG BOOK OF MACHINE LEARNING USE CASES 13
Finding the optimal product: Small DTW distance and Using MLflow to track best and worst products,
similar absolute sales numbers along with artifacts
Now that we’ve developed a list of products that are closest to our factory’s projected MLflow is an open-source platform for managing the machine learning lifecycle,
output (our “optimal sales trend”), we can filter them down to those that have small including experimentation, reproducibility and deployment. Databricks notebooks
DTW distances as well as similar absolute sales numbers. One good candidate would be offer a fully integrated MLflow environment, allowing you to create experiments, log
Product #202, which has a DTW distance of 6.86 versus the population median distance parameters and metrics, and save results. For more information about getting started
of 7.89 and tracks our optimal trend very closely. with MLflow, take a look at the excellent documentation.
MLflow’s design is centered around the ability to log all of the inputs and outputs of
# Review P202 weekly sales
each experiment we do in a systematic, reproducible way. On every pass through the
y_p202 = sales_pdf[sales_pdf[‘Product_Code’] == ‘P202’].values[0]
[1:53] data, known as a “Run,” we’re able to log our experiment’s:
•P
A R A M E T E R S : The inputs to our model
•M
E T R I C S : The output of our model, or measures of our model’s success
•A
R T I FA C T S : Any files created by our model — for example, PNG plots or
CSV data output
•M
O D E L S : The model itself, which we can later reload and use to serve predictions
THE BIG BOOK OF MACHINE LEARNING USE CASES 14
In our case, we can use it to run the dynamic time warping algorithm several times import mlflow
over our data while changing the “stretch factor,” the maximum amount of warp that
def run_DTW(ts_stretch_factor):
can be applied to our time series data. To initiate an MLflow experiment, and allow for # calculate DTW distance and Z-score for each product
easy logging using mlflow.log_param() , mlflow.log_metric() , mlflow.log_ with mlflow.start_run() as run:
artifact() , and mlflow.log_model() , we wrap our main function using:
# Log Model using Custom Flavor
dtw_model = {‘stretch_factor’ : float(ts_stretch_factor),
‘pattern’ : optimal_pattern}
with mlflow.start_run() as run:
mlflow_custom_flavor.log_model(dtw_model, artifact_
...
path=”model”)
return run.info
With each run through the data, we’ve created a log of the “stretch factor” parameter
being used, and a log of products we classified as being outliers based upon the Z-score
of the DTW distance metric. We were even able to save an artifact (file) of a histogram
of the DTW distances. These experimental runs are saved locally on Databricks and
remain accessible in the future if you decide to view the results of your experiment at
a later date.
THE BIG BOOK OF MACHINE LEARNING USE CASES 15
Now that MLflow has saved the logs of each experiment, we can go back through and MLflow also offers a “Python function” flavor, which allows you to save any model
examine the results. From your Databricks notebook, select the “Runs” icon in the upper from a third-party library (such as XGBoost or spaCy) or even a simple Python
right-hand corner to view and compare the results of each of our runs. function itself, as an MLflow model. Models created using the Python function flavor live
within the same ecosystem and are able to interact with other MLflow tools through the
Not surprisingly, as we increase our “stretch factor,” our distance metric decreases. Inference API. Although it’s impossible to plan for every use case, the Python function
Intuitively, this makes sense: As we give the algorithm more flexibility to warp the time model flavor was designed to be as universal and flexible as possible. It allows for custom
indices forward or backward, it will find a closer fit for the data. In essence, we’ve traded processing and logic evaluation, which can come in handy for ETL applications. Even
some bias for variance. as more “official” model flavors come online, the generic Python function flavor will still
serve as an important “catchall,” providing a bridge between Python code of any kind and
Logging models in MLflow MLflow’s robust tracking toolkit.
MLflow has the ability to not only log experiment parameters, metrics and artifacts (like
Logging a model using the Python function flavor is a straightforward process. Any
plots or CSV files), but also to log machine learning models. An MLflow model is simply a
model or function can be saved as a model, with one requirement: It must take in
folder that is structured to conform to a consistent API, ensuring compatibility with other
a pandas DataFrame as input, and return a DataFrame or NumPy array. Once that
MLflow tools and features. This interoperability is very powerful, allowing any Python
requirement is met, saving your function as an MLflow model involves defining a Python
model to be rapidly deployed to many different types of production environments.
class that inherits from PythonModel, and overriding the .predict() method with your
custom function, as described here.
MLflow comes preloaded with a number of common model “flavors” for many of the
most popular machine learning libraries, including scikit-learn, Spark MLlib, PyTorch,
TensorFlow and others. These model flavors make it trivial to log and reload models after
they are initially constructed, as demonstrated in this blog post. For example, when using
MLflow with scikit-learn, logging a model is as easy as running the following code from
within an experiment:
mlflow.sklearn.log_model(model=sk_model, artifact_path=”sk_model_
path”)
THE BIG BOOK OF MACHINE LEARNING USE CASES 16
loaded_model = mlflow_custom_flavor.load_model(artifact_path=’model’,
run_id=’e26961b25c4d4402a9a5a7a679fc8052’)
To show that our model is working as intended, we can now load the model and use it to
measure DTW distances on two new products that we’ve created within the variable
new_sales_units :
C H A P T E R 3: F
ine-Grained Time Series Advances in time series forecasting are enabling retailers to generate more reliable demand forecasts. The challenge
Forecasting at Scale With now is to produce these forecasts in a timely manner and at a level of granularity that allows the business to make precise
Prophet and Apache Spark™ adjustments to product inventories. Leveraging Apache Spark™ and Facebook Prophet, more and more enterprises facing
these challenges are finding they can overcome the scalability and accuracy limits of past solutions.
by B I L A L O B E I D A T ,
B R Y A N S M I T H and In this post, we’ll discuss the importance of time series forecasting, visualize some sample time series data, then build
BRENNER HEINTZ a simple model to show the use of Facebook Prophet. Once you’re comfortable building a single model, we’ll combine
Prophet with the magic of Apache Spark™ to show you how to train hundreds of models at once, allowing us to create
January 27, 2020 precise forecasts for each individual product-store combination at a level of granularity rarely achieved until now.
New expectations require more precise time series forecasting methods and models
For some time, enterprise resource planning (ERP) systems and third-party solutions have provided retailers with demand
forecasting capabilities based upon simple time series models. But with advances in technology and increased pressure
in the sector, many retailers are looking to move beyond the linear models and more traditional algorithms historically
available to them.
|
ew capabilities, such as those provided by Facebook Prophet, are emerging from
N
the data science community, and companies are seeking the flexibility to apply
these machine learning models to their time series forecasting needs.
THE BIG BOOK OF MACHINE LEARNING USE CASES 18
This movement away from traditional forecasting solutions requires retailers and the like Next, by viewing the same data on a monthly basis, we can see that the year-over-year
to develop in-house expertise not only in the complexities of demand forecasting but upward trend doesn’t progress steadily each month. Instead, we see a clear seasonal
also in the efficient distribution of the work required to generate hundreds of thousands pattern of peaks in the summer months, and troughs in the winter months. Using the
or even millions of machine learning models in a timely manner. Luckily, we can use built-in data visualization feature of Databricks Collaborative Notebooks, we can see the
Spark to distribute the training of these models, making it possible to predict not just value of our data during each month by mousing over the chart.
overall demand for products and services, but the unique demand for each product in
each location.
Getting started with a simple time series forecasting model Now that we have fit our model to the data, let’s use it to build a 90-day forecast. In the
code below, we define a data set that includes both historical dates and 90 days beyond,
on Facebook Prophet using prophet’s make_future_dataframe method:
As illustrated in the charts above, our data shows a clear year-over-year upward trend
in sales, along with both annual and weekly seasonal patterns. It’s these overlapping future_pd = model.make_future_dataframe(
patterns in the data that Prophet is designed to address. periods=90,
freq=’d’,
Facebook Prophet follows the scikit-learn API, so it should be easy to pick up for anyone include_history=True
)
with experience with sklearn. We need to pass in a 2 column pandas DataFrame as input:
the first column is the date, and the second is the value to predict (in our case, sales). # predict over the dataset
Once our data is in the proper format, building a model is easy: forecast_pd = model.predict(future_pd)
import pandas as pd That’s it! We can now visualize how our actual and predicted data line up as well as
from fbprophet import Prophet
a forecast for the future using Prophet’s built-in .plot method. As you can see, the
# instantiate the model and set parameters weekly and seasonal demand patterns we illustrated earlier are in fact reflected in the
model = Prophet( forecasted results.
interval_width=0.95,
growth=’linear’,
daily_seasonality=False, predict_fig = model.plot(forecast_pd, xlabel=’date’,
weekly_seasonality=True, ylabel=’sales’)
yearly_seasonality=True, display(fig)
seasonality_mode=’multiplicative’
)
HISTORICAL FORECASTED
D ATA D ATA
This visualization is a bit busy. Bartosz Mikulski provides an excellent breakdown of it that is well worth checking out. In a nutshell,
the black dots represent our actuals with the darker blue line representing our predictions and the lighter blue band representing our
(95%) uncertainty interval.
THE BIG BOOK OF MACHINE LEARNING USE CASES 21
store_item_history
How to use Spark DataFrames to distribute the processing of time series data .groupBy(‘store’, ‘item’)
Data scientists frequently tackle the challenge of training large numbers of models using # . . .
a distributed data processing engine such as Apache Spark. By leveraging a Spark cluster,
individual worker nodes in the cluster can train a subset of models in parallel with other
worker nodes, greatly reducing the overall time required to train the entire collection of
We share the groupBy code here to underscore how it enables us to train many
time series models.
models in parallel efficiently, although it will not actually come into play until we set
up and apply a UDF to our data in the next section.
Of course, training models on a cluster of worker nodes (computers) requires more cloud
infrastructure, and this comes at a price. But with the easy availability of on-demand
cloud resources, companies can quickly provision the resources they need, train their
models, and release those resources just as quickly, allowing them to achieve massive
scalability without long-term commitments to physical assets.
THE BIG BOOK OF MACHINE LEARNING USE CASES 22
# . . .
# return predictions
return results_pd
THE BIG BOOK OF MACHINE LEARNING USE CASES 23
Next steps
Now, to bring it all together, we use the groupBy command we discussed earlier to We have now constructed a time series forecasting model for each store-item
ensure our data set is properly partitioned into groups representing specific store and combination. Using a SQL query, analysts can view the tailored forecasts for each
item combinations. We then simply apply the UDF to our DataFrame, allowing the UDF product. In the chart below, we’ve plotted the projected demand for product #1 across 10
to fit a model and make predictions on each grouping of data. stores. As you can see, the demand forecasts vary from store to store, but the general
pattern is consistent across all of the stores, as we would expect.
The data set returned by the application of the function to each group is updated to
reflect the date on which we generated our predictions. This will help us keep track of
data generated during different model runs as we eventually take our functionality into
production.
results = (
store_item_history
.groupBy(‘store’, ‘item’)
.apply(forecast_store_item)
.withColumn(‘training_date’, current_date()) As new sales data arrives, we can efficiently generate new forecasts and append these to
)
our existing table structures, allowing analysts to update the business’s expectations as
conditions evolve.
To learn more, watch the on-demand webinar entitled How Starbucks Forecasts Demand
at Scale With Facebook Prophet and Azure Databricks.
THE BIG BOOK OF MACHINE LEARNING USE CASES 24
C H A P T E R 4: D
oing Multivariate Time Time series forecasting is an important area in machine learning. It can be difficult to build accurate models because
Series Forecasting With of the nature of the time-series data. With recent developments in the neural networks aspect of machine learning, we
Recurrent Neural Networks can tackle a wide variety of problems that were either out of scope or difficult to do with classical time series predictive
approaches. In this post, we will demonstrate how to use Keras’ implementation of Long Short-Term Memory (LSTM) for
time series forecasting and MLfLow for tracking model runs.
Using Keras’ implementation of
Long Short-Term Memory (LSTM)
What are LSTMs?
for time series forecasting
LSTM is a type of Recurrent Neural Network (RNN) that allows the network to retain long-term dependencies at a given
by V E D A N T J A I N time from many timesteps before. RNNs were designed to that effect using a simple feedback approach for neurons
where the output sequence of data serves as one of the inputs. However, long-term dependencies can make the network
September 10, 2019 untrainable due to the vanishing gradient problem. LSTM is designed precisely to solve that problem.
Try this notebook in Databricks Sometimes accurate time series predictions depend on a combination of both bits of old and recent data. We have to
efficiently learn even what to pay attention to, accepting that there may be a long history of data to learn from. LSTMs
combine simple DNN architectures with clever mechanisms to learn what parts of history to “remember” and what to
“forget” over long periods. The ability of LSTMs to learn patterns in data over long sequences makes them suitable for time
series forecasting.
For the theoretical foundation of LSTM’s architecture, see here (Chapter 4).
Experiment
The data set we chose for this experiment is perfect for building regression models of
appliances’ energy use. The house temperature and humidity conditions were monitored
with a ZigBee wireless sensor network. It is at 10-minute intervals for about 4.5 months.
The energy data was logged with m-bus energy meters. Weather from the nearest airport
weather station (Chievres Airport, Belgium) was downloaded from a public data set
from Reliable Prognosis (rp5.ru), and merged together with the experimental data sets
using the date and time column. The data set can be downloaded from the UCI Machine
Learning repository.
We’ll use this to train a model that predicts the energy consumed by household
appliances for the next day.
Data modeling
Before we can train a neural network, we need to model the data in a way the network
can learn from a sequence of past values. Specifically, LSTM expects the input data
in a specific 3D tensor format of test sample size by timesteps by the number of input
features. As a supervised learning approach, LSTM requires both features and labels in
order to learn. In the context of time series forecasting, it is important to provide the past
values as features and future values as labels, so LSTMs can learn how to predict the
future. Thus, we explode the time series data into a 2D array of features called ‘X’, where
the input data consists of overlapping lagged values at the desired number of timesteps
in batches. We generate a 1D array called ‘y’ consisting of only the labels or future values
that we are trying to predict for every batch of input features. The input data also should
include lagged values of ‘y’ so the network can also learn from past values of the labels.
See the following image for further explanation:
THE BIG BOOK OF MACHINE LEARNING USE CASES 26
Our data set has 10-minute samples. In the image above, we have chosen length = 3, The shape of the input set should be (samples, timesteps, input_dim) [https://keras.io/
which implies we have 30 minutes of data in every sequence (at 10-minute intervals). layers/recurrent/]. For every batch, we will have all 6 days’ worth of data, which is 864
By that logic, features ‘X’ should be a tensor of values [X(t), X(t+1), X(t+2)], [X(t+2), X(t+3), rows. The batch size determines the number of samples before a gradient update takes
X(t+4)], [X(t+3), X(t+4), X(t+5)]…and so on. And our target variable ‘y’ should be [y(t+3), place.
y(t+4), y(t+5)…y(t+10)] because the number of timesteps or length is equal to 3, so we will
ignore values y(t), y(t+1), y(t+2). Also, in the graph, it’s apparent that for every input row,
# Create overlapping windows of lagged values for training and
we’re only predicting one value out in the future, i.e., y(t+n+1); however, for more realistic testing datasets
scenarios you can choose to predict further out in the future, i.e., y(t+n+L), as you will see timesteps = 864
in our example below. train_generator = TimeseriesGenerator(trainX, trainY,
length=timesteps, sampling_rate=1, batch_size=timesteps)
test_generator = TimeseriesGenerator(testX, testY,
The Keras API has a built-in class called TimeSeriesGenerator that generates batches length=timesteps, sampling_rate=1, batch_size=timesteps)
of overlapping temporal data. This class takes in a sequence of data points gathered at
equal intervals, along with time series parameters such as stride, length of history, etc. to
produce batches for training/validation. For a full list of tuning parameters, see here.
So, let’s say for our use case, we want to learn to predict from 6 days’ worth of past data Model training
and predict values some time out in the future, let’s say, 1 day. In that case, length is equal
LSTMs are able to tackle the long-term dependency problems in neural networks, using
to 864, which is the number of 10-minute timesteps in 6 days (24x6x6). Similarly, we also
a concept known as Backpropogation through time (BPTT). To read more about BPTT, see
want to learn from past values of humidity, temperature, pressure, etc., which means that
here.
for every label we will have 864 values per feature. Our data set has a total of 28 features.
When generating the temporal sequences, the generator is configured to return batches Before we train a LSTM network, we need to understand a few key parameters provided
consisting of 6 days’ worth of data every time. To make it a more realistic scenario, we in Keras that will determine the quality of the network.
choose to predict the usage 1 day out in the future (as opposed to the next 10-minute time
1. E
P O C H S : Number of times the data will be passed to the neural network
interval), we prepare the test and train data set in a manner that the target vector is a set
of values 144 timesteps (24x6x1) out in the future. For details, see the notebook, section 2: 2. S
T E P S P E R E P O C H: The number of batch iterations before a training epoch is
Normalize and prepare the data set. considered finished
3. A
C T I VAT I O N S : Layer describing which activation function to use
4. O
P T I M I Z E R : Keras provides built-in optimizers
THE BIG BOOK OF MACHINE LEARNING USE CASES 27
units = 128 For choosing the number of epochs, it’s a good approach to choose a high number to
num_epoch = 5000
avoid underfitting. In order to circumvent the problem of overfitting, you can use built
learning_rate = 0.00144
in callbacks in Keras API; specifically EarlyStopping. EarlyStopping stops the model
with mlflow.start_run(experiment_id=3133492, nested=True): training when the monitored quantity has stopped improving. In our case, we use loss
as the monitored quantity and the model will stop training when there’s no decrease of
model = Sequential()
model.add(CuDNNLSTM(units, input_shape=(train_X.shape[1],
1e-5 for 50 epochs. Keras has built-in regularizers (weighted, dropout) that penalize the
train_X.shape[2]))) network to ensure a smoother distribution of parameters, so the network does not rely
model.add(LeakyReLU(alpha=0.5)) too much on the context passed between the units. See image below for layers in the
model.add(Dropout(0.1))
network.
model.add(Dense(1))
adam = Adam(lr=learning_rate)
# Stop training when a monitored quantity has stopped improving.
callback = [EarlyStopping(monitor=”loss”, min_delta = 0.00001,
patience = 50, mode = ‘auto’, restore_best_weights=True),
tensorboard]
In order to take advantage of the speed and performance of GPUs, we use the CUDNN
implementation of LSTM. We have also chosen an arbitrarily high number of epochs.
This is because we want to make sure that the data undergoes as many iterations as
possible to find the best model fit. As for the number of units, we have 28 features, so
we start with 32. After a few iterations, we found that using 128 gave us decent results.
THE BIG BOOK OF MACHINE LEARNING USE CASES 28
In order to send the output of one layer to the other, we need an activation function. In
this case, we use LeakyRelu, which is a better variant of its predecessor, the Rectifier
Linear Unit or Relu for short.
Keras provides many different optimizers for reducing loss and updates weights
iteratively over epochs. For a full list of optimizers, see here. We choose the Adam
version of stochastic gradient descent.
Summary
•L
STM can be used to learn from past values in order to predict future occurrences. •L
ike all machine learning approaches, LSTM is not immune to bad fitting, which is
LSTMs for time series don’t make certain assumptions that are made in classical why Keras has EarlyStopping callback. With some degree of intuition and the right
approaches, so it makes it easier to model time series problems and learn nonlinear callback parameters, you can get decent model performance without putting too
dependencies among multiple inputs. much effort in tuning hyperparameters.
•W
hen creating a sequence of events before feeding into the LSTM network, it is •R
NNs, specifically LSTMs, work best when given large amounts of data. So, when
important to lag the labels from inputs, so the LSTM network can learn from past little data is available, it is preferable to start with a smaller network with a few
data. TimeSeriesGenerator class in Keras allows users to prepare and transform the hidden layers. Smaller data also allows users to provide a larger batch of data to
time series data set with various parameters before feeding the time lagged data set every epoch, which can yield better results.
to the neural network.
•L
STM has a series of tunable hyperparameters, such as epochs, batch size, etc.,
which are imperative to determining the quality of the predictions. Learning rate is
an important hyperparameter that controls how the model weights get updated and
the speed at which the model learns. It is very important to determine an optimal
value for the learning rate in order to get the best model performance. Consider
using the LearingRateSchedular callback parameter in order to tweak the learning
rate to the optimal value.
•K
eras provides a choice of different optimizers to use with respect to the type of
problem you’re solving. Generally, Adam tends to do well. Using MLflow UI, the user
can compare model runs side by side to choose the best model.
•F
or time series, it’s important to maintain temporality in the data so the LSTM
network can learn patterns from the correct sequence of events. Therefore, it is
important not to shuffle the data when creating test and validation sets and also
when fitting the model.
THE BIG BOOK OF MACHINE LEARNING USE CASES 30
May 2, 2019
To build these detection patterns, a team of domain experts comes up with a set of rules based on how fraudsters typically
behave. A workflow may include a subject matter expert in the financial fraud detection space putting together a set of
requirements for a particular behavior. A data scientist may then take a subsample of the available data and select a set of
deep learning or machine learning algorithms using these requirements and possibly some known fraud cases. To put the
pattern in production, a data engineer may convert the resulting model to a set of rules with thresholds, often implemented
using SQL.
THE BIG BOOK OF MACHINE LEARNING USE CASES 31
This approach allows the financial institution to present a clear set of characteristics
that led to the identification of a fraudulent transaction that is compliant with the
General Data Protection Regulation (GDPR). However, this approach also poses numerous
difficulties. The implementation of a fraud detection system using a hardcoded set of
rules is very brittle. Any changes to the fraud patterns would take a very long time to
update. This, in turn, makes it difficult to keep up with and adapt to the shift in fraudulent
activities that are happening in the current marketplace.
Additionally, the systems in the workflow described above are often siloed, with the In this blog, we will showcase how to convert several such rule-based detection use
domain experts, data scientists and data engineers all compartmentalized. The data cases to machine learning use cases on the Databricks platform, unifying the key players
engineer is responsible for maintaining massive amounts of data and translating the in fraud detection: domain experts, data scientists and data engineers. We will learn how
work of the domain experts and data scientists into production level code. Due to a to create a machine learning fraud detection data pipeline and visualize the data in real-
lack of a common platform, the domain experts and data scientists have to rely on time leveraging a framework for building modular features from large data sets. We will
sampled down data that fits on a single machine for analysis. This leads to difficulty in also learn how to detect fraud using decision trees and Apache Spark MLlib. We will then
communication and ultimately a lack of collaboration. use MLflow to iterate and refine the model to improve its accuracy.
THE BIG BOOK OF MACHINE LEARNING USE CASES 32
Furthermore, the concern with machine learning models being difficult to interpret
may be further assuaged if a decision tree model is used as the initial machine learning
model. Because the model is being trained to a set of rules, the decision tree is likely
to outperform any other machine learning model. The additional benefit is, of course,
the utmost transparency of the model, which will essentially show the decision-
making process for fraud, but without human intervention and the need to hard code
any rules or thresholds. Of course, it must be understood that the future iterations of
the model may utilize a different algorithm altogether to achieve maximum accuracy.
The transparency of the model is ultimately achieved by understanding the features
that went into the algorithm. Having interpretable features will yield interpretable and
Training a supervised machine learning model to detect financial fraud is very difficult
defensible model results.
due to the low number of actual confirmed examples of fraudulent behavior. However, the
presence of a known set of rules that identify a particular type of fraud can help create
The biggest benefit of the machine learning approach is that after the initial modeling
a set of synthetic labels and an initial set of features. The output of the detection pattern
effort, future iterations are modular and updating the set of labels, features or model
that has been developed by the domain experts in the field has likely gone through
type is very easy and seamless, reducing the time to production. This is further
the appropriate approval process to be put in production. It produces the expected
facilitated on the Databricks Collaborative Notebooks where the domain experts, data
fraudulent behavior flags and may, therefore, be used as a starting point to train a
scientists, and data engineers may work off the same data set at scale and collaborate
machine learning model. This simultaneously mitigates three concerns:
directly in the notebook environment. So let’s get started!
THE BIG BOOK OF MACHINE LEARNING USE CASES 33
%sql
-- Organize by Type
select type, count(1) from financials group by type
Rules-based model
We are not likely to start with a large data set of known fraud cases to train our model.
In most practical applications, fraudulent detection patterns are identified by a set of
rules established by the domain experts. Here, we create a column called “label” based
on these rules.
To get an idea of how much money we are talking about, let’s also visualize the data based
# Rules to Identify Known Fraud-based
on the types of transactions and on their contribution to the amount of cash transferred df = df.withColumn(“label”,
(i.e., sum(amount)). F.when(
(
(df.oldbalanceOrg <= 56900) & (df.type ==
“TRANSFER”) & (df.newbalanceDest <= 105)) | ( (df.oldbalanceOrg
> 56900) & (df.newbalanceOrig <= 12)) | ( (df.oldbalanceOrg >
56900) & (df.newbalanceOrig > 12) & (df.amount > 1160000)
), 1
).otherwise(0))
THE BIG BOOK OF MACHINE LEARNING USE CASES 35
Visualizing data flagged by rules Selecting the appropriate machine learning models
These rules often flag quite a large number of fraudulent cases. Let’s visualize the In many cases, a black box approach to fraud detection cannot be used. First, the domain
number of flagged transactions. We can see that the rules flag about 4% of the cases and experts need to be able to understand why a transaction was identified as fraudulent.
11% of the total dollar amount as fraudulent. Then, if action is to be taken, the evidence has to be presented in court. The decision tree
is an easily interpretable model and is a great starting point for this use case. Read this
blog “The wise old tree” on decision trees to learn more.
%sql
select label, count(1) as ‘Transactions’, sun(amount) as ‘Total
Amount’ from financials_labeled group by label
# View the Decision Tree model (prior to CrossValidator) Visual representation of the Decision Tree model
dt_model = pipeline.fit(train)
THE BIG BOOK OF MACHINE LEARNING USE CASES 37
# Build the best model (balanced training and full test datasets)
train_pred_b = cvModel_b.transform(train_b)
test_pred_b = cvModel_b.transform(test)
---
# Output:
# PR train: 0.999629161563572
# AUC train: 0.9998071389056655
# PR test: 0.9904709171789063
# AUC test: 0.9997903902204509
THE BIG BOOK OF MACHINE LEARNING USE CASES 40
Review the results MLflow helps us throughout this cycle as we train different model versions. We can keep
track of our experiments, comparing the results of different model configurations and
Now let’s look at the results of our new confusion matrix. The model misidentified only parameters. For example, here we can compare the PR and AUC of the models trained
one fraudulent case. Balancing the classes seems to have improved the model. on balanced and unbalanced data sets using the MLflow UI. Data scientists can use
MLflow to keep track of the various model metrics and any additional visualizations and
C O N F U S I O N M AT R I X ( B A L A N C E D T E S T )
artifacts to help make the decision of which model should be deployed in production.
The data engineers will then be able to easily retrieve the chosen model along with the
library versions used for training as a .jar file to be deployed on new data in production.
Fraud
Thus, the collaboration between the domain experts who review the model results, the
data scientists who update the models, and the data engineers who deploy the models in
True label
No Fraud
Conclusion
We have reviewed an example of how to use a rule-based fraud detection label and
convert it to a machine learning model using Databricks with MLflow. This approach
Fraud No Fraud allows us to build a scalable, modular solution that will help us keep up with ever-
Predicted label
changing fraudulent behavior patterns. Building a machine learning model to identify
Model feedback and using MLflow fraud allows us to create a feedback loop that allows the model to evolve and identify
new potential fraudulent patterns. We have seen how a decision tree model, in particular,
Once a model is chosen for production, we want to continuously collect feedback to is a great starting point to introduce machine learning to a fraud detection program due
ensure that the model is still identifying the behavior of interest. Since we are starting to its interpretability and excellent accuracy.
with a rule-based label, we want to supply future models with verified true labels based
A major benefit of using the Databricks platform for this effort is that it allows for data
on human feedback. This stage is crucial for maintaining confidence and trust in the
scientists, engineers and business users to seamlessly work together throughout the
machine learning process. Since analysts are not able to review every single case, we
process. Preparing the data, building models, sharing the results, and putting the models
want to ensure we are presenting them with carefully chosen cases to validate the
into production can now happen on the same platform, allowing for unprecedented
model output. For example, predictions, where the model has low certainty, are good
collaboration. This approach builds trust across the previously siloed teams, leading to
candidates for analysts to review. The addition of this type of feedback will ensure the
an effective and dynamic fraud detection program.
models will continue to improve and evolve with the changing landscape.
Try this notebook by signing up for a free trial in just a few minutes and get started
creating your own models.
THE BIG BOOK OF MACHINE LEARNING USE CASES 41
1. S
L O W A N D C O S T LY D ATA I N G E S T A N D E N G I N E E R I N G P I P E L I N E S : WSI images are usually very large
(typically 0.5–2 GB per slide) and can require extensive image preprocessing.
2. T
R O U B L E S C A L I N G D E E P L E A R N I N G T O T E R A B Y T E S O F I M A G E S : Training a deep learning model across a
modestly sized data set with hundreds of WSIs can take days to weeks on a single node. These latences prevent rapid
experimentation on large data sets. While latency can be reduced by parallelizing deep learning workloads across
multiple nodes, this is an advanced technique that is out of the reach of a typical biological data scientist.
3. E
N S U R I N G R E P R O D U C I B I L I T Y O F T H E W S I W O R K F L O W: When it comes to novel insights based on patient
data, it is very important to be able to reproduce results. Current solutions are mostly ad hoc and do not allow
efficient ways of keeping track of experiments and versions of the data used during machine learning model training.
THE BIG BOOK OF MACHINE LEARNING USE CASES 42
In this blog, we discuss how the Databricks Unified Data Analytics Platform can be used Similar to the workflow Human Longevity used to preprocess radiology images, we will
to address these challenges and deploy end-to-end scalable deep learning workflows on use Apache Spark to manipulate both our slides and their annotations. For model training,
WSI image data. We will focus on a workflow that trains an image segmentation model we will start by extracting features using a pretrained InceptionV3 model from Keras.
that identifies regions of metastases on a slide. In this example, we will use Apache To this end, we leverage pandas UDFs to parallelize feature extraction. For more
Spark to parallelize data preparation across our collection of images, use pandas UDF to information on this technique see Featurization for Transfer Learning (AWS | Azure).
extract features based on pretrained models (transfer learning) across many nodes, and Note that this technique is not specific to InceptionV3 and can be applied to any other
use MLflow to reproducibly track our model training. pretrained model.
1. P
AT C H G E N E R AT I O N: Using coordinates annotated by a pathologist, we crop
slide images into equally sized patches. Each image can generate thousands of
patches and is labeled as tumor or normal.
2. D
E E P L E A R N I N G: We use transfer learning to use a pretrained model to extract
features from image patches and then use Apache Spark to train a binary classifier
to predict tumor vs. normal patches.
Figure 1: Implementing an end-to-end solution for training and deployment of a DL model based on WSI data
3. S
C O R I N G: We then use the trained model that is logged using MLflow to project a
probability heat map on a given slide.
THE BIG BOOK OF MACHINE LEARNING USE CASES 43
Although this workflow commonly uses annotations stored in an XML file, for simplicity,
we are using the preprocessed annotations made by the Baidu Research team that We then use the OpenSlide library to load the images from cloud storage, and to slice out
built the NCRF classifier on the Camelyon16 data set. These annotations are stored as the given coordinate range. While OpenSlide doesn’t natively understand how to read
CSV encoded text files, which Apache Spark will load into a DataFrame. In the following data from Amazon S3 or Azure Data Lake Storage, the Databricks File System (DBFS)
notebook cell, we load the annotations for both tumor and normal patches, and assign FUSE layer allows OpenSlide to directly access data stored in these blob stores without
the label 0 to normal slices and 1 to tumor slices. We then union the coordinates and any complex code changes. Finally, our function writes the patch back using the DBFS
labels into a single DataFrame. FUSE layer.
It takes approximately 10 minutes for this command to generate ~174,000 patches from
the Camelyon16 data set on databricks data sets. Once our command has completed, we
can load our patches back up and display them directly in-line in our notebook.
THE BIG BOOK OF MACHINE LEARNING USE CASES 44
use transfer learning to extract features from each patch using a pretrained deep neural Feature Engineering and Training 25 min
network and then use sparkml for the classification task. This technique frequently Scoring (per single slide) 15 sec
outperforms training from scratch for many image processing applications. We will start
with the InceptionV3 architecture, using pretrained weights from Keras.
Table 1: Runtime for different steps of the workflow using 2-10 r4.4xlarge workers using Databricks ML Runtime 6.2,
on 170,000 patches extracted from slides included in databricks-datasets
Apache Spark’s DataFrames provide a built-in Image schema, and we can directly load
all patches into a DataFrame. We then use Pandas UDFs to transform the images into
features based on InceptionV3 using Keras. Once we have featurized each image, we use Table 1 shows the time spent on different parts of the workflow. We notice that the model
spark.ml to fit a logistic regression between the features and the label for each patch. training on ~170K samples takes less than 25 minutes with an accuracy of 87%.
We log the logistic regression model with MLflow so that we can access the model later
for serving. Since there can be many more patches in practice, using deep neural networks for
classification can significantly improve accuracy. In such cases, we can use distributed
When running ML workflows on Databricks, users can take advantage of managed training techniques to scale the training process. On the Databricks platform, we have
MLflow. With every run of the notebook and every training round, MLflow automatically packaged up the HorovodRunner toolkit, which distributes the training task across a
logs parameters, metrics and any specified artifact. In addition, it stores the trained large cluster with very minor modifications to your ML code. This blog post provides a
model that can later be used for predicting labels on data. We refer interested readers to great background on how to scale ML workflows on Databricks.
these docs for more information on how MLflow can be leveraged to manage a full-cycle
of ML workflow on Databricks.
THE BIG BOOK OF MACHINE LEARNING USE CASES 45
C H A P T E R 7: Convolutional Neural
A Convolutional Neural Networks (CNN) are state-of-the-art neural network architectures that are primarily used for
Network Implementation computer vision tasks. CNN can be applied to a number of different tasks, such as image recognition, object localization,
for Car Classification and change detection. Recently, our partner Data Insights received a challenging request from a major car company:
Develop a Computer Vision application that could identify the car model in a given image. Considering that different car
by D R . E V A N E A M E S models can appear quite similar and any car can look very different depending on its surroundings and the angle at which
and H E N N I N G K R O P P it is photographed, such a task was, until quite recently, simply impossible.
However, starting around 2012, the Deep Learning Revolution made it possible to handle such a problem. Instead of
being explained the concept of a car, computers could instead repeatedly study pictures and learn such concepts
themselves. In the past few years, additional artificial neural network innovations have resulted in AI that can perform
image classification tasks with human-level accuracy. Building on such developments, we were able to train a Deep CNN
to classify cars by their model. The neural network was trained on the Stanford Cars Data Set, which contains over 16,000
pictures of cars, comprising 196 different models. Over time, we could see the accuracy of predictions begin to improve, as
the neural network learned the concept of a car and how to distinguish among different models.
THE BIG BOOK OF MACHINE LEARNING USE CASES 47
For the first step of data preprocessing, the images are compressed into hdf5 files (one
for training and one for testing). This can then be read in by the neural network. This step
can be omitted completely, if you like, as the hdf5 files are part of the ADLS Gen2 storage
provided as part of these provided notebooks:
•L
oad Stanford Cars data set into HDF5 files
•U
se Koalas for image augmentation
•T
rain the CNN with Keras
•D
eploy model as REST service to Azure ML
THE BIG BOOK OF MACHINE LEARNING USE CASES 48
A CNN is a series of both Identity Blocks and Convolution Blocks (or ConvBlocks) that
reduce an input image to a compact group of numbers. Each of these resulting numbers
(if trained correctly) should eventually tell you something useful toward classifying the
image. A Residual CNN adds an additional step for each block. The data is saved as
a temporary variable before the operations that constitute the block are applied, and
then this temporary data is added to the output data. Generally, this additional step is
applied to each block. As an example, the below figure demonstrates a simplified CNN for
detecting handwritten numbers:
Applying augmentation to a large corpus of training data can be very expensive, especially
when comparing the results of different approaches. With Koalas, it becomes easy to try
existing frameworks for image augmentation in Python, and scaling the process out on a
cluster with multiple nodes using the data science familiar to pandas API.
THE BIG BOOK OF MACHINE LEARNING USE CASES 49
In order to automatically track results, an existing or new Azure ML workspace can be In this scenario we want to make sure we can use a model inference engine that supports
linked to your Azure Databricks workspace. Additionally, MLflow supports auto-logging serving requests from a REST API client. For this we are using a custom model based on
for Keras models (mlflow.keras.autolog()), making the experience almost effortless. the previously built Keras model to accept a JSON DataFrame object that has a Base64-
encoded image inside.
THE BIG BOOK OF MACHINE LEARNING USE CASES 51
preds = self.keras_model.predict(rgb_img)
prob = np.max(preds)
class_id = np.argmax(preds)
return {“label”: self.class_names[class_id][0][0], “prob”:
“{:.4}”.format(prob)}
self.results = []
with open(context.artifacts[“cars_meta”], “rb”) as file:
# load the car classes file
cars_meta = scipy.io.loadmat(file)
self.class_names = cars_meta[‘class_names’]
self.class_names = np.transpose(self.class_names)
THE BIG BOOK OF MACHINE LEARNING USE CASES 52
webservice_deployment_config = AciWebservice.deploy_configuration()
Developer Resources
•N
OTEB OOKS:
—L
oad Stanford Cars data set into HDF5 files
—U
se Koalas for image augmentation
—T
rain the CNN with Keras
—D
eploy model as REST service to Azure ML
•V
I D E O: AI Car Classification With Deep Convolutional Neural Networks on Databricks
•G
I T H U B: EvanEames | Cars
•S
L I D E S : A Convolutional Neural Network Implementation for Car Classification
•P
D F: Convolutional Neural Network Implementation on Databricks
THE BIG BOOK OF MACHINE LEARNING USE CASES 54
December 5, 2019
Detect patterns of fraud and Site selection, urban planning, Economic distribution, loan risk Identifying disease epicenters,
collusion (e.g., claims fraud, foot traffic analysis analysis, predicting sales at retail, environmental impact on health,
credit card fraud) investments planning care
Flood surveys, earthquake Reconnaissance, threat detection, Transportation planning, agriculture Climate change analysis, energy
mapping, response planning damage assessment management, housing development asset inspection, oil discovery
For example, numerous companies provide localized drone-based services such as mapping and site inspection
(reference Developing for the Intelligent Cloud and Intelligent Edge). Another rapidly growing industry for geospatial data
is autonomous vehicles. Startups and established companies alike are amassing large corpuses of highly contextualized
geodata from vehicle sensors to deliver the next innovation in self-driving cars (reference Databricks fuels wejo’s ambition
to create a mobility data ecosystem). Retailers and government agencies are also looking to make use of their geospatial
data. For example, foot-traffic analysis (reference Building Foot-Traffic Insights Data Set) can help determine the best
location to open a new store or, in the public sector, improve urban planning. Despite all these investments in geospatial
data, a number of challenges exist.
THE BIG BOOK OF MACHINE LEARNING USE CASES 55
•U
nstructured data with location references 3. Indexing the data with grid systems and leveraging the generated index to perform
spatial operations is a common approach for dealing with very large scale or
In this blog post, we give an overview of general approaches to deal with the two main
computationally restricted workloads. S2, GeoHex and Uber’s H3 are examples
challenges listed above using the Databricks Unified Data Analytics Platform. This is the
of such grid systems. Grids approximate geo features such as polygons or points
first part of a series of blog posts on working with large volumes of geospatial data.
with a fixed set of identifiable cells thus avoiding expensive geospatial operations
altogether and thus offer much better scaling behavior. Implementers can decide
between grids fixed to a single accuracy that can be somewhat lossy yet more
performant or grids with multiple accuracies that can be less performant but
mitigate against lossines.
THE BIG BOOK OF MACHINE LEARNING USE CASES 56
The following examples are generally oriented around a NYC taxi pickup / drop-off data Geospatial operations using geospatial libraries
set found here. NYC Taxi Zone data with geometries will also be used as the set of polygons.
for Apache Spark
This data contains polygons for the five boroughs of NYC as well the neighborhoods. This
notebook will walk you through preparations and cleanings done to convert the initial Over the last few years, several libraries have been developed to extend the capabilities
CSV files into Delta Lake tables as a reliable and performant data source. of Apache Spark for geospatial analysis. These frameworks bear the brunt of registering
commonly applied user-defined types (UDT) and functions (UDF) in a consistent manner,
Our base DataFrame is the taxi pickup / drop-off data read from a Delta Lake Table lifting the burden otherwise placed on users and teams to write ad hoc spatial logic. Please
using Databricks. note that in this blog post we use several different spatial frameworks chosen to highlight
various capabilities. We understand that other frameworks exist beyond those highlighted,
%scala which you might also want to use with Databricks to process your spatial workloads.
val dfRaw = spark.read.format(“delta”).load(“/ml/blogs/geospatial/
delta/nyc-green”) Earlier, we loaded our base data into a DataFrame. Now we need to turn the latitude/
display(dfRaw) // showing first 10 columns
longitude attributes into point geometries. To accomplish this, we will use UDFs to
perform operations on DataFrames in a distributed fashion. Please refer to the provided
notebooks at the end of the blog for details on adding these frameworks to a cluster and
the initialization calls to register UDFs and UDTs. For starters, we have added GeoMesa to
our cluster, a framework especially adept at handling vector data. For ingestion, we are
mainly leveraging its integration of JTS with Spark SQL, which allows us to easily convert
to and use registered JTS geometry classes. We will be using the function st_makePoint
that given a latitude and longitude create a Point geometry object. Since the function is a
UDF, we can apply it to columns directly.
Example: Geospatial data read from a Delta Lake table using Databricks
%scala
val df = dfRaw
.withColumn(“pickup_point”, st_makePoint(col(“pickup_longitude”),
col(“pickup_latitude”)))
.withColumn(“dropoff_point”, st_makePoint(col(“dropoff_
longitude”),col(“dropoff_latitude”)))
display(df.select(“dropoff_point”,”dropoff_datetime”))
THE BIG BOOK OF MACHINE LEARNING USE CASES 57
%python
%scala # read the boroughs polygons with geopandas
val joinedDF = wktDF.join(df, st_contains($”the_geom”, $”pickup_ gdf = gdp.read_file(“/dbfs/ml/blogs/geospatial/nyc_boroughs.
point”) geojson”)
display(joinedDF.select(“zone”,”borough”,”pickup_point”,”pickup_
datetime”)) b_gdf = sc.broadcast(gdf) # broadcast the geopandas dataframe to
all nodes of the cluster
def find_borough(latitude,longitude):
mgdf = b_gdf.value.apply(lambda x: x[“boro_name”] if
x[“geometry”].intersects(Point(longitude, latitude))
idx = mgdf.first_valid_index()
return mgdf.loc[idx] if idx is not None else None
Example: Using GeoMesa’s provided st_contains UDF, for example, to produce the resulting join
of all polygons against pickup points
THE BIG BOOK OF MACHINE LEARNING USE CASES 58
Now we can apply the UDF to add a column to our Spark DataFrame, which assigns a Grid systems for spatial indexing
borough name to each pickup point.
Geospatial operations are inherently computationally expensive. Point-in-polygon,
spatial joins, nearest neighbor or snapping to routes all involve complex operations. By
%python
# read the coordinates from delta indexing with grid systems, the aim is to avoid geospatial operations altogether. This
df = spark.read.format(“delta”).load(“/ml/blogs/geospatial/delta/ approach leads to the most scalable implementations with the caveat of approximate
nyc-green”) operations. Here is a brief example with H3.
df_with_boroughs = df.withColumn(“pickup_borough”, find_borough_
udf(col(“pickup_latitude”),col(pickup_longitude)))
Scaling spatial operations with H3 is essentially a two-step process. The first step is to
display(df_with_boroughs.select(
“pickup_datetime”,”pickup_latitude”,”pickup_longitude”,”pickup_ compute an H3 index for each feature (points, polygons, …) defined as UDF geoToH3(…).
borough”)) The second step is to use these indices for spatial operations such as spatial join (point in
polygon, k-nearest neighbors, etc.), in this case defined as UDF multiPolygonToH3(…).
We can now apply these two UDFs to the NYC taxi data as well as the set of borough
polygons to generate the H3 index.
%scala %scala
val res = 7 //the resolution of the H3 index, 1.2km val dfWithBoroughH3 = dfH3.join(wktDFH3,”h3index”)
val dfH3 = df.withColumn(
“h3index”, display(df_with_borough_h3.select(“zone”,”borough”,”pickup_
geoToH3(col(“pickup_latitude”), col(“pickup_longitude”), point”,”pickup_datetime”,”h3index”))
lit(res))
)
val wktDFH3 = wktDF
.withColumn(“h3index”, multiPolygonToH3(col(“the_geom”),
lit(res)))
.withColumn(“h3index”, explode($”h3index”))
Given a set of a lat/lon points and a set of polygon geometries, it is now possible to
perform the spatial join using h3index field as the join condition. These assignments can
be used to aggregate the number of points that fall within each polygon for instance.
There are usually millions or billions of points that have to be matched to thousands or
millions of polygons, which necessitates a scalable approach. There are other techniques
not covered in this blog that can be used for indexing in support of spatial operations Example: DataFrame table representing the spatial join of a set of lat/lon points and
when an approximation is insufficient. polygon geometries, using a specific field as the join condition
THE BIG BOOK OF MACHINE LEARNING USE CASES 61
Here is a visualization of taxi drop-off locations, with latitude and longitude binned at a Handling spatial formats with Databricks
resolution of 7 (1.22km edge length) and colored by aggregated counts within each bin.
Geospatial data involves reference points, such as latitude and longitude, to physical
locations or extents on the Earth along with features described by attributes. While there
are many file formats to choose from, we have picked out a handful of representative
vector and raster formats to demonstrate reading with Databricks.
Vector data
Vector data is a representation of the world stored in x (longitude), y (latitude) coordinates
in degrees, also z (altitude in meters) if elevation is considered. The three basic symbol
types for vector data are points, lines and polygons. Well-known-text (WKT), GeoJSON
and Shapefile are some popular formats for storing vector data we highlight below.
Let’s read NYC Taxi Zone data with geometries stored as WKT. The data structure we
want to get back is a DataFrame that will allow us to standardize with other APIs and
available data sources, such as those used elsewhere in the blog. We are able to easily
convert the WKT text content found in field the_geom into its corresponding JTS
Geometry class through the st_geomFromWKT(…) UDF call.
%scala
val wktDFText = sqlContext.read.format(“csv”)
.option(“header”, “true”)
.option(“inferSchema”, “true”)
.load(“/ml/blogs/geospatial/nyc_taxi_zones.wkt.csv”)
Example: Geospatial visualization of taxi dropoff locations, with latitude and longitude binned
at a resolution of 7 (1.22km edge length) and colored by aggregated counts within each bin
THE BIG BOOK OF MACHINE LEARNING USE CASES 62
GeoJSON is used by many open-source GIS packages for encoding a variety of geographic From there, we could choose to hoist any of the fields up to top level columns using
data structures, including their features, properties and spatial extents. For this example, Spark’s built-in explode function. For example, we might want to bring up geometry,
we will read NYC Borough Boundaries with the approach taken depending on the workflow. properties and type and then convert geometry to its corresponding JTS class, as was
Since the data is conforming to JSON, we could use the Databricks built-in JSON reader shown with the WKT example.
with .option(“multiline”,”true”) to load the data with the nested schema.
%python
%python from pyspark.sql import functions as F
json_df = spark.read.option(“multiline”,”true”).json(“nyc_boroughs. json_explode_df = ( json_df.select(
geojson”) “features”,
“type”,
F.explode(F.col(“features.properties”)).alias(“properties”)
).select(“*”,F.explode(F.col(“features.geometry”)).
alias(“geometry”)).drop(“features”))
display(json_explode_df)
Example: Using the Spark’s built-in explode function to raise a field to the top level,
displayed within a DataFrame table
We can also visualize the NYC Taxi Zone data within a notebook using an existing
DataFrame or directly rendering the data with a library such as Folium, a Python library
for rendering spatial data. Databricks File System (DBFS) runs over a distributed
storage layer, which allows code to work with data formats using familiar file system
standards. DBFS has a FUSE Mount to allow local API calls that perform file read and
write operations, which makes it very easy to load data with non-distributed APIs for
interactive rendering. In the Python open(…) command below, the “/dbfs/…” prefix
enables the use of FUSE Mount.
%python
import folium
import json
Example: We can also visualize the NYC Taxi Zone data, for example, within a notebook using an existing DataFrame
or directly rendering the data with a library such as Folium, a Python library for rendering geospatial data
with open (“/dbfs/ml/blogs/geospatial/nyc_boroughs.geojson”, “r”)
as myfile:
boro_data=myfile.read() # read GeoJSON from DBFS using FuseMount Shapefile is a popular vector format developed by ESRI that stores the geometric location
and attribute information of geographic features. The format consists of a collection
m = folium.Map( of files with a common filename prefix (*.shp, *.shx and *.dbf are mandatory) stored in
location=[40.7128, -74.0060],
the same directory. An alternative to shapefile is KML, also used by our customers but not
tiles=’Stamen Terrain’,
zoom_start=12 shown for brevity. For this example, let’s use NYC Building shapefiles. While there are many
) ways to demonstrate reading shapefiles, we will give an example using GeoSpark. The
folium.GeoJson(json.loads(boro_data)).add_to(m)
built-in ShapefileReader is used to generate the rawSpatialDf DataFrame.
m # to display, also could use displayHTML(...) variants
%scala
var spatialRDD = new SpatialRDD[Geometry]
spatialRDD = ShapefileReader.readToGeometryRDD(sc, “/ml/blogs/
geospatial/shapefiles/nyc”)
display(json_explode_df)
THE BIG BOOK OF MACHINE LEARNING USE CASES 64
By registering rawSpatialDf as a temp view, we can easily drop into pure Spark SQL Raster data
syntax to work with the DataFrame, to include applying a UDF to convert the shapefile Raster data stores information of features in a matrix of cells (or pixels) organized into
WKT into Geometry. rows and columns (either discrete or continuous). Satellite images, photogrammetry and
scanned maps are all types of raster-based Earth Observation (EO) data.
%sql
SELECT *, The following Python example uses RasterFrames, a DataFrame-centric spatial analytics
ST_GeomFromWKT(geometry) AS geometry -- GeoSpark UDF to convert framework, to read two bands of GeoTIFF Landsat-8 imagery (red and near-infrared)
WKT to Geometry and combine them into Normalized Difference Vegetation Index. We can use this
FROM rawspatialdf
data to assess plant health around NYC. The rf_ipython module is used to manipulate
RasterFrame contents into a variety of visually useful forms, such as below where the red,
Additionally, we can use Databricks’ built-in visualization for in-line analytics, such as NIR and NDVI tile columns are rendered with color ramps, using the Databricks built-in
charting the tallest buildings in NYC. displayHTML(…) command to show the results within the notebook.
%sql %python
SELECT name, # construct a CSV “catalog” for RasterFrames `raster` reader
round(Cast(num_floors AS DOUBLE), 0) AS num_floors --String to Number # catalogs can also be Spark or Pandas DataFrames
FROM rawspatialdf bands = [f’B{b}’ for b in [4, 5]]
WHERE name <> ‘’ uris = [f’https://landsat-pds.s3.us-west-2.amazonaws.com/c1/
ORDER BY num_floors DESC LIMIT 5 L8/014/032/LC08_L1TP_014032_20190720_20190731_01_T1/LC08_L1TP_014
032_20190720_20190731_01_T1_{b}.TIF’ for b in bands]
catalog = ‘,’.join(bands) + ‘\n’ + ‘,’.join(uris)
# read red and NIR bands from Landsat 8 dataset over NYC
rf = spark.read.raster(catalog, bands) \
.withColumnRenamed(‘B4’, ‘red’).withColumnRenamed(‘B5’, ‘NIR’) \
.withColumn(‘longitude_latitude’, st_reproject(st_centroid(rf_
geometry(‘red’)), rf_crs(‘red’), lit(‘EPSG:4326’))) \
.withColumn(‘NDVI’, rf_normalized_difference(‘NIR’, ‘red’)) \
.where(rf_tile_sum(‘NDVI’) > 10000)
Example: A Databricks built-in visualization for in-line analytics charting, for example, the tallest buildings in NYC
results = rf.select(‘longitude_latitude’, rf_tile(‘red’), rf_
tile(‘NIR’), rf_tile(‘NDVI’))
displayHTML(rf_ipython.spark_df_to_html(results))
THE BIG BOOK OF MACHINE LEARNING USE CASES 65
Through its custom Spark DataSource, RasterFrames can read various raster formats,
including GeoTIFF, JP2000, MRF and HDF, from an array of services. It also supports
reading the vector formats GeoJSON and WKT/WKB. RasterFrame contents can be
filtered, transformed, summarized, resampled and rasterized through over 200 raster and
vector functions, such as st_reproject(…) and st_centroid(…) used in the example above.
It provides APIs for Python, SQL and Scala as well as interoperability with Spark ML.
Geo databases
Geo databases can be filebased for smaller-scale data or accessible via JDBC / ODBC
connections for medium-scale data. You can use Databricks to query many SQL
databases with the built-in JDBC / ODBC Data Source. Connecting to PostgreSQL is
shown below, which is commonly used for smaller-scale workloads by applying PostGIS
extensions. This pattern of connectivity allows customers to maintain as-is access to
existing databases.
%scala
display(
sqlContext.read.format(“jdbc”)
.option(“url”, jdbcUrl)
.option(“driver”, “org.postgresql.Driver”)
.option(“dbtable”,
“””(SELECT * FROM yellow_tripdata_staging
OFFSET 5 LIMIT 10) AS t”””) //predicate pushdown
.option(“user”, jdbcUsername)
.option(“jdbcPassword”, jdbcPassword)
.load)
Example: RasterFrame contents can be filtered, transformed, summarized, resampled
and rasterized through over 200 raster and vector functions
THE BIG BOOK OF MACHINE LEARNING USE CASES 66
In an upcoming blog, we will take a deep dive into more advanced topics for geospatial
processing at scale with Databricks. You will find additional details about the spatial
formats and highlighted frameworks by reviewing Data Prep Notebook, GeoMesa + H3
Notebook, GeoSpark Notebook, GeoPandas Notebook and Rasterframes Notebook.
Also, stay tuned for a new section in our documentation specifically for geospatial
topics of interest.
THE BIG BOOK OF MACHINE LEARNING USE CASES 67
C H A P T E R 9:
CUSTOMER As a global technology and media company that connects millions of customers to personalized experiences, Comcast
CASE STUDY
struggled with massive data, fragile data pipelines and poor data science collaboration. By using Databricks — including
Delta Lake and MLflow — they were able to build performant data pipelines for petabytes of data and easily manage the
lifecycle of hundreds of models, creating a highly innovative, unique and award-winning viewer experience that leverages
voice recognition and machine learning.
U S E C A S E : In the intensely competitive entertainment industry, there’s no time to press the Pause button. Comcast
realized they needed to modernize their entire approach to analytics from data ingest to the deployment of machine
learning models that deliver new features to delight their customers.
S O L U T I O N A N D B E N E F I T S : Armed with a unified approach to analytics, Comcast can now fast-forward into the future
of AI-powered entertainment — keeping viewers engaged and delighted with competition-beating customer experiences.
LEARN MORE
THE BIG BOOK OF MACHINE LEARNING USE CASES 68
C H A P T E R 9:
CUSTOMER Regeneron’s mission is to tap into the power of genomic data to bring new medicines to patients in need. Yet, transforming
CASE STUDY
this data into life-changing discovery and targeted treatments has never been more challenging. With poor processing
performance and scalability limitations, their data teams lacked what they needed to analyze petabytes of genomic
and clinical data. Databricks now empowers them to quickly analyze entire genomic data sets quickly to accelerate the
discovery of new therapeutics.
U S E C A S E : More than 95% of all experimental medicines that are currently in the drug development pipeline are expected
to fail. To improve these efforts, the Regeneron Genetics Center built one of the most comprehensive genetics databases
by pairing the sequenced exomes and electronic health records of more than 400,000 people. However, they faced
numerous challenges analyzing this massive set of data:
•G
enomic and clinical data is highly decentralized, making it very difficult to analyze and train models against their
entire 10TB data set.
LEARN MORE
THE BIG BOOK OF MACHINE LEARNING USE CASES 69
C H A P T E R 9:
CUSTOMER The explosive growth in data availability and increasing market competition are challenging insurance providers to
CASE STUDY
provide better pricing to their customers. With hundreds of millions of insurance records to analyze for downstream ML,
Nationwide realized their legacy batch analysis process was slow and inaccurate, providing limited insight to predict the
frequency and severity of claims. With Databricks, they have been able to employ deep learning models at scale to provide
more accurate pricing predictions, resulting in more revenue from claims.
U S E C A S E : The key to providing accurate insurance pricing lies in leveraging information from insurance claims.
However, data challenges were difficult as they had to analyze insurance records that were volatile as claims were
infrequent and unpredictable — resulting in inaccurate pricing.
S O L U T I O N A N D B E N E F I T S : Nationwide leverages the Databricks Unified Data Analytics Platform to manage the
entire analytics process from data ingestion to the deployment of deep learning models. The fully managed platform has
simplified IT operations and unlocked new data-driven opportunities for their data science teams.
LEARN MORE
THE BIG BOOK OF MACHINE LEARNING USE CASES 70
C H A P T E R 9:
CUSTOMER Condé Nast is one of the world’s leading media companies, counting some of the most iconic magazine titles in its
CASE STUDY
portfolio, including The New Yorker, Wired and Vogue. The company uses data to reach over 1 billion people in print, online,
video and social media.
U S E C A S E : As a leading media publisher, Condé Nast manages over 20 brands in their portfolio. On a monthly basis,
their web properties garner 100 million-plus visits and 800 million-plus page views, producing a tremendous amount of
data. The data team is focused on improving user engagement by using machine learning to provide personalized content
recommendations and targeted ads.
S O L U T I O N A N D B E N E F I T S : Databricks provides Condé Nast with a fully managed cloud platform that simplifies
operations, delivers superior performance and enables data science innovation.
• I M P R O V E D C U S T O M E R E N G A G E M E N T: With an improved data pipeline, Condé Nast can make better, faster and
more accurate content recommendations, improving the user experience.
Databricks has been an incredibly
•B
U I LT F O R S C A L E : Data sets can no longer outgrow Condé Nast’s capacity to process and glean insights.
powerful end-to-end solution for us.
It’s allowed a variety of different team •M
O R E M O D E L S I N P R O D U C T I O N: With MLflow, Condé Nast’s data science teams can innovate their products faster.
members from different backgrounds to They have deployed over 1,200 models in production.
quickly get in and utilize large volumes
of data to make actionable business
decisions. LEARN MORE
— P
AUL FRY ZEL
Principal Engineer of AI Infrastructure at Condé Nast
THE BIG BOOK OF MACHINE LEARNING USE CASES 71
C H A P T E R 9:
CUSTOMER SHOWTIME® is a premium television network and streaming service, featuring award-winning original series and original
CASE STUDY
limited series like “Shameless,” “Homeland,” “Billions,” “The Chi,” “Ray Donovan,” “SMILF,” “The Affair,” “Patrick Melrose,”
“Our Cartoon President,” “Twin Peaks” and more.
U S E C A S E : The Data Strategy team at SHOWTIME is focused on democratizing data and analytics across the
organization. They collect huge volumes of subscriber data (e.g., shows watched, time of day, devices used, subscription
history, etc.) and use machine learning to predict subscriber behavior and improve scheduling and programming.
S O L U T I O N A N D B E N E F I T S : Databricks has helped SHOWTIME democratize data and machine learning across the
organization, creating a more data-driven culture.
• 6 X FA S T E R P I P E L I N E S : Data pipelines that took over 24 hours are now run in less than 4 hours, enabling teams
to make decisions faster.
— J
O S H M c N U T T Senior Vice President
of Data Strategy and Consumer Analytics at SHOWTIME
LEARN MORE
THE BIG BOOK OF MACHINE LEARNING USE CASES 72
C H A P T E R 9:
CUSTOMER Shell is a recognized pioneer in oil and gas exploration and production technology and is one of the world’s leading oil
CASE STUDY
and natural gas producers, gasoline and natural gas marketers and petrochemical manufacturers.
U S E C A S E : To maintain production, Shell stocks over 3,000 different spare parts across their global facilities. It’s crucial
the right parts are available at the right time to avoid outages, but equally important is not overstocking, which can be
cost-prohibitive.
S O L U T I O N A N D B E N E F I T S : Databricks provides Shell with a cloud-native unified analytics platform that helps with
improved inventory and supply chain management.
•P
R E D I C T I V E M O D E L I N G: Scalable predictive model is developed and deployed across more than 3,000 types
of materials at 50-plus locations.
•H
I S T O R I C A L A N A LY S E S : Each material model involves simulating 10,000 Markov Chain Monte Carlo iterations
Databricks has produced an enormous to capture historical distribution of issues.
C H A P T E R 9:
CUSTOMER Riot Games’ goal is to be the world’s most player-focused gaming company. Founded in 2006 and based in LA,
CASE STUDY
Riot Games is best known for the League of Legends game. Over 100 million gamers play every month.
U S E C A S E : Improving gaming experience through network performance monitoring and combating in-game
abusive language.
S O L U T I O N A N D B E N E F I T S : Databricks allows Riot Games to improve the gaming experience of their players by
providing scalable, fast analytics.
•R
E D U C E D G A M E L A G: Built ML model that detects network issues in real time, enabling Riot Games to avoid
We wanted to free data scientists from outages before they adversely impact players.
— C
OLIN BORYS
Data Scientist at Riot Games
THE BIG BOOK OF MACHINE LEARNING USE CASES 74
C H A P T E R 9:
CUSTOMER Quby is the technology company behind Toon, the smart energy management device that gives people control over
CASE STUDY
their energy usage, their comfort, the security of their homes and much more. Quby’s smart devices are in hundreds
of thousands of homes across Europe. As such, they maintain Europe’s largest energy data set, consisting of petabytes
of IoT data, collected from sensors on appliances throughout the home. With this data, they are on a mission to help
their customers live more comfortable lives while reducing energy consumption through personalized energy usage
recommendations.
U S E C A S E : Personalized energy use recommendations: Leverage machine learning and IoT data to power their Waste
Checker app, which provides personalized recommendations to reduce in-home energy consumption.
S O L U T I O N A N D B E N E F I T S : Databricks provides Quby with a Unified Data Analytics Platform that has fostered a
scalable and collaborative environment across data science and engineering, allowing data teams to more quickly
innovate and deliver ML-powered services to Quby’s customers.
— S
TEPHEN GALSWORTHY
Head of Data Science at Quby
LEARN MORE
About us
Databricks is the data and AI company. Thousands of organizations worldwide —
including Comcast, Condé Nast, Nationwide and H&M — rely on Databricks’ open and
unified platform for data engineering, machine learning and analytics. Databricks is
venture-backed and headquartered in San Francisco, with offices around the globe.
Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is
on a mission to help data teams solve the world’s toughest problems.
© Databricks 2020. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
Privacy Policy | Terms of Use