SlideShare a Scribd company logo
Geospatial Options in Apache Spark
Geospatial Analytics in Spark
Dan Corbiani
Data Scientist, Pacific Northwest National Lab
Goal:
Provide practical examples of preprocessing
and analyzing vector data at scale.
Agenda
Housekeeping
Info out PNNL and our team.
Challenges with Geospatial
Analytics
Why is this so hard?
Case Studies
Practical use cases and functional examples.
About PNNL
▪ Part of the Department of
Energy’s National Lab Complex
▪ ~4,400 staff
▪ ~$1B in business
▪ Interdisciplinary and highly
matrixed
Our Team
▪ Mike Giardinelli
▪ Jenny Webster
▪ Rich Buractaon
▪ Gideon Juve
▪ Lucas Tate
▪ Justin Almquist
6
▪ Ralph Perko
▪ Karl Pazdernik
▪ Mark Jensen
▪ Wenwei Xu
▪ Tim McPherson
▪ Patrick Royer
Disclaimer
▪ I am not a geospatial oracle. This presentation documents my
knowledge and experience and the landscape is constantly changing.
▪ I assume you are familiar with:
▪ UDFs
▪ Window Functions
7
Challenges with Geospatial Analytics
▪ Projections
▪ Indexing
▪ Finding and curating data
▪ Error Trapping
▪ System Libraries
▪ Easiest way to do geospatial analysis is for it to not be geospatial.
8
Projections
▪ Earth is not flat, most analytics run in 2d space.
▪ Using Latitude, Longitude, and Elevation can be misleading.
▪ It’s not that simple…
▪ Datum
▪ Important when working on flooding problems (NAD83, NAVD88, WGS84)
▪ Datums may be local and datasets cannot be directly compared.
▪ Projection (https://www.wikiwand.com/en/Map_projection)
▪ All projections are wrong. Just pick one that makes sense, record it, and move on.
▪ Protip: if something is projected, add the id in the column name for the geometry.
9
Indexing
▪ We generally want to join or search geospatially.
▪ Point in Polygon searches are expensive! Especially when the search
space is not limited.
▪ It may not be clear when the data is indexed.
▪ Lots of options
▪ Geohash (elastic)
▪ Quadtree
▪ H3 (uber)
▪ Possible to index base RDD in advanced use cases.
1
Finding and Curating Data
▪ Incoming formats
▪ JSON
▪ OSM
▪ Shapefile
▪ GDB
▪ Raster
▪ CSV / Parquet / SQL
▪ Our process:
1
Read Raw
Data
Validate
Geometry
Convert to
WKT
Parquet
CSV
Long Term Storage
Error Trapping
▪ Working with geospatial data in spark will cause errors. These must
be handled gracefully.
▪ Incoming files can have a variety of projections
▪ Examples of errors:
▪ Points not in correct order
▪ Malformed WKT Strings
1
System Libraries
▪ Use cluster init scripts or docker containers to install low level
libraries when necessary. Avoid this when possible.
▪ Packaging wheels specifically for databricks for common libraries can
help
▪ Useful libraries
▪ GeoPandas (https://github.com/geopandas/geopandas)
▪ Scikit-mobility (https://github.com/scikit-mobility/scikit-mobility)
▪ Moving-pandas (https://github.com/anitagraser/movingpandas)
▪ RasterFrames (https://rasterframes.io/index.html)
▪ GeoSpark (https://github.com/DataSystemsLab/GeoSpark)
▪ Finding the balance between user knowledge and compute time.
1
Examples - Indexing
▪ SQL – Speed Differences
▪ Index on / off
▪ GeoSpark
▪ Example
▪ H3 Hash Example
1
Case Studies
Large Scale Geospatial Joins
▪ As an analyst, I’m given many buildings and regions that must be
joined for analytics.
▪ Open source example:
▪ Join the Microsoft buildings dataset with the US Census Blocks to get statistics on average square foot density.
1
Large Scale Geospatial Joins
▪ Issues:
▪ Data formats…
▪ Buildings are in JSON
▪ Blocks are in shape format
▪ Indexing
▪ Join performance
▪ Output storage
1
Large Scale Geospatial Joins
▪ DEMO
1
Case Study
Spatial Disaggregation
Spatial Disaggregation
▪ As an analyst, I want to see a map of where PPE is needed across the
US at a specific resolution.
▪ Input data is PPE intensity by NAICS code output should be an “eye
candy” map.
2
Spatial Disaggregation
▪ Challenge
2
NAICS Intensity MapMAGIC?
Spatial Disaggregation
▪ Input data is PPE intensity by NAICS code output should be an “eye
candy” map.
2
NAICS Intensity
Map
County Business
Practices Data
County Level Data
Block Level Workforce
Intensity
H3 Grid / Summation
Spatial Disaggregation
▪ DEMO
2
Case Study
Pattern of Life
Pattern of Life
▪ As a research and operations team, I would like to understand patterns
in geospatial data.
▪ Researchers develop new algorithms.
▪ Operations team leverages algorithms on new data.
▪ Challenge
▪ How do you connect these two groups?
2
Entity Dataframe
▪ A domain class that has a known schema and a series of spatial
transformations.
▪ Entity_id, Lat, Lon, Timestamp
▪ Transformations:
▪ Stops
▪ Path Identification
▪ Spawn Maps
2
Polygon Dataframe
▪ A domain class for polygons that facilitates indexing and comparison.
▪ WKT, Name, Description, Source, etc…
▪ Transformations:
▪ Geohash Indexing (with buffers)
▪ H3 Indexing (with buffers)
2
Pattern of Life
▪ A domain class for combining Polygons and Entities
▪ Transformations:
▪ Spawn locations
▪ Most visited places
▪ Similar users
2
Pattern of Life
▪ DEMO
▪ Simple use case of showing spawn locations for the users.
2
Lessons Learned
Lessons Learned
▪ Standardizing on data formats is important
▪ Datalakes are helpful
▪ Domain Driven Design and common packages can save headache.
▪ Notebooks are useful but not a panacea
▪ Test scaling often!
▪ Landscape is always changing and a lot of time must be spent to keep
up.
▪ Your problem is unlikely to be a unicorn. Leverage talents of others to
deliver real impact.
3
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Geospatial Options in Apache Spark
Ad

More Related Content

What's hot (20)

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
Kernel TLV
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
Owen O'Malley
 
Oracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsOracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret Internals
Anil Nair
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Oracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLONOracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLON
Markus Michalewicz
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
DataWorks Summit
 
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
Yury Velikanov
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Keith Kraus
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
InfluxData
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
Kernel TLV
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
Oracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsOracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret Internals
Anil Nair
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Oracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLONOracle RAC 19c and Later - Best Practices #OOWLON
Oracle RAC 19c and Later - Best Practices #OOWLON
Markus Michalewicz
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
DataWorks Summit
 
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
Oracle 12c RAC On your laptop Step by Step Implementation Guide 1.0
Yury Velikanov
 
RAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature EngineeringRAPIDS: GPU-Accelerated ETL and Feature Engineering
RAPIDS: GPU-Accelerated ETL and Feature Engineering
Keith Kraus
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
InfluxData
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
Amita Mirajkar
 

Similar to Geospatial Options in Apache Spark (20)

Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Databricks
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
Steve Min
 
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL DatabasesA Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
Luiz Henrique Zambom Santana
 
A middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQLA middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQL
Luiz Henrique Zambom Santana
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
Sofian Hadiwijaya
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
Datio Big Data
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
MapR Technologies
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
Geoffrey Fox
 
Is hadoop for you
Is hadoop for youIs hadoop for you
Is hadoop for you
Gwen (Chen) Shapira
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developersISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
SpazioDati
 
[214]유연하고 확장성 있는 빅데이터 처리
[214]유연하고 확장성 있는 빅데이터 처리[214]유연하고 확장성 있는 빅데이터 처리
[214]유연하고 확장성 있는 빅데이터 처리
NAVER D2
 
Deep Learning - system view
Deep Learning - system viewDeep Learning - system view
Deep Learning - system view
Yoss Cohen
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
DataWorks Summit
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
Handaru Sakti
 
Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to Spark
Sky Yin
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben white
Data Con LA
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Geosp.AI.tial: Applying Big Data and Machine Learning to Solve the World's To...
Databricks
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
Steve Min
 
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL DatabasesA Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
A Workload-Aware Middleware for Storing Massive RDF Graphs into NoSQL Databases
Luiz Henrique Zambom Santana
 
A middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQLA middleware for storing massive RDF graphs into NoSQL
A middleware for storing massive RDF graphs into NoSQL
Luiz Henrique Zambom Santana
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
Datio Big Data
 
Swiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache DrillSwiss Big Data User Group - Introduction to Apache Drill
Swiss Big Data User Group - Introduction to Apache Drill
MapR Technologies
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
cdmaxime
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
Geoffrey Fox
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developersISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
SpazioDati
 
[214]유연하고 확장성 있는 빅데이터 처리
[214]유연하고 확장성 있는 빅데이터 처리[214]유연하고 확장성 있는 빅데이터 처리
[214]유연하고 확장성 있는 빅데이터 처리
NAVER D2
 
Deep Learning - system view
Deep Learning - system viewDeep Learning - system view
Deep Learning - system view
Yoss Cohen
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
DataWorks Summit
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
Handaru Sakti
 
Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to Spark
Sky Yin
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben white
Data Con LA
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
How to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process miningHow to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process mining
Process mining Evangelist
 
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
Taqyea
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
vMix Pro Crack + Serial Number Torrent free Download
vMix Pro Crack + Serial Number Torrent free DownloadvMix Pro Crack + Serial Number Torrent free Download
vMix Pro Crack + Serial Number Torrent free Download
eyeskye547
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682
way to join real illuminati Agent In Kampala Call/WhatsApp+256782561496/0756664682
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
Adopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use caseAdopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use case
Process mining Evangelist
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
Process Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBSProcess Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBS
Process mining Evangelist
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
2024-Media-Literacy-Index-Of-Ukrainians-ENG-SHORT.pdf
OlhaTatokhina1
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
How to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process miningHow to regulate and control your it-outsourcing provider with process mining
How to regulate and control your it-outsourcing provider with process mining
Process mining Evangelist
 
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
Taqyea
 
HershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistributionHershAggregator (2).pdf musicretaildistribution
HershAggregator (2).pdf musicretaildistribution
hershtara1
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
vMix Pro Crack + Serial Number Torrent free Download
vMix Pro Crack + Serial Number Torrent free DownloadvMix Pro Crack + Serial Number Torrent free Download
vMix Pro Crack + Serial Number Torrent free Download
eyeskye547
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
Adopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use caseAdopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use case
Process mining Evangelist
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
Process Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBSProcess Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBS
Process mining Evangelist
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 

Geospatial Options in Apache Spark

  • 2. Geospatial Analytics in Spark Dan Corbiani Data Scientist, Pacific Northwest National Lab
  • 3. Goal: Provide practical examples of preprocessing and analyzing vector data at scale.
  • 4. Agenda Housekeeping Info out PNNL and our team. Challenges with Geospatial Analytics Why is this so hard? Case Studies Practical use cases and functional examples.
  • 5. About PNNL ▪ Part of the Department of Energy’s National Lab Complex ▪ ~4,400 staff ▪ ~$1B in business ▪ Interdisciplinary and highly matrixed
  • 6. Our Team ▪ Mike Giardinelli ▪ Jenny Webster ▪ Rich Buractaon ▪ Gideon Juve ▪ Lucas Tate ▪ Justin Almquist 6 ▪ Ralph Perko ▪ Karl Pazdernik ▪ Mark Jensen ▪ Wenwei Xu ▪ Tim McPherson ▪ Patrick Royer
  • 7. Disclaimer ▪ I am not a geospatial oracle. This presentation documents my knowledge and experience and the landscape is constantly changing. ▪ I assume you are familiar with: ▪ UDFs ▪ Window Functions 7
  • 8. Challenges with Geospatial Analytics ▪ Projections ▪ Indexing ▪ Finding and curating data ▪ Error Trapping ▪ System Libraries ▪ Easiest way to do geospatial analysis is for it to not be geospatial. 8
  • 9. Projections ▪ Earth is not flat, most analytics run in 2d space. ▪ Using Latitude, Longitude, and Elevation can be misleading. ▪ It’s not that simple… ▪ Datum ▪ Important when working on flooding problems (NAD83, NAVD88, WGS84) ▪ Datums may be local and datasets cannot be directly compared. ▪ Projection (https://www.wikiwand.com/en/Map_projection) ▪ All projections are wrong. Just pick one that makes sense, record it, and move on. ▪ Protip: if something is projected, add the id in the column name for the geometry. 9
  • 10. Indexing ▪ We generally want to join or search geospatially. ▪ Point in Polygon searches are expensive! Especially when the search space is not limited. ▪ It may not be clear when the data is indexed. ▪ Lots of options ▪ Geohash (elastic) ▪ Quadtree ▪ H3 (uber) ▪ Possible to index base RDD in advanced use cases. 1
  • 11. Finding and Curating Data ▪ Incoming formats ▪ JSON ▪ OSM ▪ Shapefile ▪ GDB ▪ Raster ▪ CSV / Parquet / SQL ▪ Our process: 1 Read Raw Data Validate Geometry Convert to WKT Parquet CSV Long Term Storage
  • 12. Error Trapping ▪ Working with geospatial data in spark will cause errors. These must be handled gracefully. ▪ Incoming files can have a variety of projections ▪ Examples of errors: ▪ Points not in correct order ▪ Malformed WKT Strings 1
  • 13. System Libraries ▪ Use cluster init scripts or docker containers to install low level libraries when necessary. Avoid this when possible. ▪ Packaging wheels specifically for databricks for common libraries can help ▪ Useful libraries ▪ GeoPandas (https://github.com/geopandas/geopandas) ▪ Scikit-mobility (https://github.com/scikit-mobility/scikit-mobility) ▪ Moving-pandas (https://github.com/anitagraser/movingpandas) ▪ RasterFrames (https://rasterframes.io/index.html) ▪ GeoSpark (https://github.com/DataSystemsLab/GeoSpark) ▪ Finding the balance between user knowledge and compute time. 1
  • 14. Examples - Indexing ▪ SQL – Speed Differences ▪ Index on / off ▪ GeoSpark ▪ Example ▪ H3 Hash Example 1
  • 16. Large Scale Geospatial Joins ▪ As an analyst, I’m given many buildings and regions that must be joined for analytics. ▪ Open source example: ▪ Join the Microsoft buildings dataset with the US Census Blocks to get statistics on average square foot density. 1
  • 17. Large Scale Geospatial Joins ▪ Issues: ▪ Data formats… ▪ Buildings are in JSON ▪ Blocks are in shape format ▪ Indexing ▪ Join performance ▪ Output storage 1
  • 18. Large Scale Geospatial Joins ▪ DEMO 1
  • 20. Spatial Disaggregation ▪ As an analyst, I want to see a map of where PPE is needed across the US at a specific resolution. ▪ Input data is PPE intensity by NAICS code output should be an “eye candy” map. 2
  • 22. Spatial Disaggregation ▪ Input data is PPE intensity by NAICS code output should be an “eye candy” map. 2 NAICS Intensity Map County Business Practices Data County Level Data Block Level Workforce Intensity H3 Grid / Summation
  • 25. Pattern of Life ▪ As a research and operations team, I would like to understand patterns in geospatial data. ▪ Researchers develop new algorithms. ▪ Operations team leverages algorithms on new data. ▪ Challenge ▪ How do you connect these two groups? 2
  • 26. Entity Dataframe ▪ A domain class that has a known schema and a series of spatial transformations. ▪ Entity_id, Lat, Lon, Timestamp ▪ Transformations: ▪ Stops ▪ Path Identification ▪ Spawn Maps 2
  • 27. Polygon Dataframe ▪ A domain class for polygons that facilitates indexing and comparison. ▪ WKT, Name, Description, Source, etc… ▪ Transformations: ▪ Geohash Indexing (with buffers) ▪ H3 Indexing (with buffers) 2
  • 28. Pattern of Life ▪ A domain class for combining Polygons and Entities ▪ Transformations: ▪ Spawn locations ▪ Most visited places ▪ Similar users 2
  • 29. Pattern of Life ▪ DEMO ▪ Simple use case of showing spawn locations for the users. 2
  • 31. Lessons Learned ▪ Standardizing on data formats is important ▪ Datalakes are helpful ▪ Domain Driven Design and common packages can save headache. ▪ Notebooks are useful but not a panacea ▪ Test scaling often! ▪ Landscape is always changing and a lot of time must be spent to keep up. ▪ Your problem is unlikely to be a unicorn. Leverage talents of others to deliver real impact. 3
  • 32. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.