Bridging the Completeness of Big Data on Databricks

Bridging the Completeness of Big Data on Databricks
Yanyan Wu
VP of Data
Wood Mackenzie, Verisk
Chao Yang
Director of Data
Wood Mackenzie, Verisk

Acknowledgement
• This work was based on US patent application number
63/142,551 filed on 01/28/2021
• Thanks to the coinventors’ support: Bernard Ajiboye,
Hugh Hopewell, Rhodri Thomas @Wood Mackenzie
(Verisk)

Agenda
• Introduction & use cases
• Limitation of existing approaches
• Null filling processes
• Similarity discovery
• Collaborative AI
• AI model management
• Our application
• Application tips

Why?
Null values existed almost in all data set
Limited data or Key data can’t be thrown away just
because null existed in some attributes
Machine learning models do not work with null values
very well
The importance of data completeness

Our Data Platform for Clients - LENS
Energy data powerhouse augmented by world-class platform
Upstream
conventional
Oil & Gas
Discover, model and
value upstream data
worldwide
Unconventional
Oil & Gas
Operational analysis for
improved business
performance
Subsurface
Analytics-ready,
global subsurface data
to optimise your
resource portfolios
with confidence
Data Directly Integrated Into Clients' System
Power &
Renewables
Navigate the energy
transition by connecting
the dots across the
electricity value chain

Issues with existing null filling methods
Low accuracy for backward or forward filling, filling
with fixed values or statists metric (min, max, mean)
Time consuming when using machine learning or
regression methods
Isolated, does not take account other attributes into
filling nulls for one attribute
We need a new method that can fill nulls with better speed & accuracy

Lens Data Platform
Apache Sedona MLflow
Databricks: Unified Platform
Parquet files on AWS
S3
Spark MLlib
Build with Spark
Parquet data files with null values
in S3
Neighbor discovery
1. Spatial RDD partitioned by
KDB tree
2. Distance based spatial join
3. Replace null values with
neighbor information
4. Save data in Delta Lake
Collaborative AI model
1. Label encoding
2. Remove noise
3. Bin to create userID group
4. Reformat for ALS model
Enriched data with high
completeness
• 02
• 01 • 03
AI model management
1. Used ML pipeline & cross
validation
2. Saved model hyper parameters
with MLFlow
3. Set model to production stage

01. Neighbor discovery
• Discover neighbors of every entity (oil well)
within defined limit
• Challenges:
• Large data size
• Long compute time
• Limited compute power on single machine
• Apache Sedona:
• Distributed framework for processing large-scale spatial data
• KDB-Tree
• Geometrical approach
• Subsequently divide data into a n-dimensional space
• Tree structure and fast query processing
Distributed spatial data partitioning on Spark

• Load data
• Create geometry object column
• Set up Spark context
• Import libraries

• Distance join
• Convert to DataFrame
• Create Spatial RDD
• Create Circle RDD with
defined range
• Partition data by KDB tree

02. Collaborative AI
• Like popular methods used for movie recommendation
• Leverage ALS (Alternating Least Squares) model from Spark MLlib
• Code example: https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
• Mapping:
• UserID: each object or each object group (better to use group due to noise in data)
• Item: attributes of the object
• Rating: attributes’ value
Leverage Spark MLlib

Spark MLlib: ALS and Pipeline
• ML pipeline
• Grid search
• Cross Validation

Transform data to fit the format required by ALS

03. AI model management
Use MLFlow to manage model revisions/stages

Our Application Result
• 314,000+ of well objects with >20
attributes with missing values
• Neighbor discovery
• <10 mins to generate 144,000,000+ neighbor combinations
• Fill null with Similarity
• Null reduction: Vertical_depth 36%->9.5% and lateral_length 46%-
>14%
• Fill null with collaborative AI
• 3.7 million training records (80% training, 20% testing).
• Took 5 minutes to train with grid search and cross validation on
Databricks
• Null reduction: to 0% null value
• Accuracy: error% is 7% to 18% for key attributes.
On Oil&Gas Unconventional Well Data

Tips for Applications
• Remove outliers in the training data for AI model
• No need to normalize the value
• Form object UserId groups to deal with the noise in the data for AI
model
• More attributes, more data leads to a higher accuracy
• Accuracy is higher for non-derived attributes (higher accuracy for the
attributes with less noise)
Attention to details

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Bridging the Completeness of Big Data on Databricks

Recommended

More Related Content

What's hot (20)

Similar to Bridging the Completeness of Big Data on Databricks (20)

More from Databricks (20)

Recently uploaded (20)

Bridging the Completeness of Big Data on Databricks