Geospatial Options in Apache Spark

Geospatial Analytics in Spark
Dan Corbiani
Data Scientist, Pacific Northwest National Lab

Goal:
Provide practical examples of preprocessing
and analyzing vector data at scale.

Agenda
Housekeeping
Info out PNNL and our team.
Challenges with Geospatial
Analytics
Why is this so hard?
Case Studies
Practical use cases and functional examples.

About PNNL
▪ Part of the Department of
Energy’s National Lab Complex
▪ ~4,400 staff
▪ ~$1B in business
▪ Interdisciplinary and highly
matrixed

Our Team
▪ Mike Giardinelli
▪ Jenny Webster
▪ Rich Buractaon
▪ Gideon Juve
▪ Lucas Tate
▪ Justin Almquist
6
▪ Ralph Perko
▪ Karl Pazdernik
▪ Mark Jensen
▪ Wenwei Xu
▪ Tim McPherson
▪ Patrick Royer

Disclaimer
▪ I am not a geospatial oracle. This presentation documents my
knowledge and experience and the landscape is constantly changing.
▪ I assume you are familiar with:
▪ UDFs
▪ Window Functions
7

Challenges with Geospatial Analytics
▪ Projections
▪ Indexing
▪ Finding and curating data
▪ Error Trapping
▪ System Libraries
▪ Easiest way to do geospatial analysis is for it to not be geospatial.
8

Projections
▪ Earth is not flat, most analytics run in 2d space.
▪ Using Latitude, Longitude, and Elevation can be misleading.
▪ It’s not that simple…
▪ Datum
▪ Important when working on flooding problems (NAD83, NAVD88, WGS84)
▪ Datums may be local and datasets cannot be directly compared.
▪ Projection (https://www.wikiwand.com/en/Map_projection)
▪ All projections are wrong. Just pick one that makes sense, record it, and move on.
▪ Protip: if something is projected, add the id in the column name for the geometry.
9

Indexing
▪ We generally want to join or search geospatially.
▪ Point in Polygon searches are expensive! Especially when the search
space is not limited.
▪ It may not be clear when the data is indexed.
▪ Lots of options
▪ Geohash (elastic)
▪ Quadtree
▪ H3 (uber)
▪ Possible to index base RDD in advanced use cases.
1

Finding and Curating Data
▪ Incoming formats
▪ JSON
▪ OSM
▪ Shapefile
▪ GDB
▪ Raster
▪ CSV / Parquet / SQL
▪ Our process:
1
Read Raw
Data
Validate
Geometry
Convert to
WKT
Parquet
CSV
Long Term Storage

Error Trapping
▪ Working with geospatial data in spark will cause errors. These must
be handled gracefully.
▪ Incoming files can have a variety of projections
▪ Examples of errors:
▪ Points not in correct order
▪ Malformed WKT Strings
1

System Libraries
▪ Use cluster init scripts or docker containers to install low level
libraries when necessary. Avoid this when possible.
▪ Packaging wheels specifically for databricks for common libraries can
help
▪ Useful libraries
▪ GeoPandas (https://github.com/geopandas/geopandas)
▪ Scikit-mobility (https://github.com/scikit-mobility/scikit-mobility)
▪ Moving-pandas (https://github.com/anitagraser/movingpandas)
▪ RasterFrames (https://rasterframes.io/index.html)
▪ GeoSpark (https://github.com/DataSystemsLab/GeoSpark)
▪ Finding the balance between user knowledge and compute time.
1

Examples - Indexing
▪ SQL – Speed Differences
▪ Index on / off
▪ GeoSpark
▪ Example
▪ H3 Hash Example
1

Large Scale Geospatial Joins
▪ As an analyst, I’m given many buildings and regions that must be
joined for analytics.
▪ Open source example:
▪ Join the Microsoft buildings dataset with the US Census Blocks to get statistics on average square foot density.
1

▪ Issues:
▪ Data formats…
▪ Buildings are in JSON
▪ Blocks are in shape format
▪ Indexing
▪ Join performance
▪ Output storage
1

▪ DEMO
1

Case Study
Spatial Disaggregation

▪ As an analyst, I want to see a map of where PPE is needed across the
US at a specific resolution.
▪ Input data is PPE intensity by NAICS code output should be an “eye
candy” map.
2

▪ Challenge
2
NAICS Intensity MapMAGIC?

▪ Input data is PPE intensity by NAICS code output should be an “eye
candy” map.
2
NAICS Intensity
Map
County Business
Practices Data
County Level Data
Block Level Workforce
Intensity
H3 Grid / Summation

▪ DEMO
2

Pattern of Life
▪ As a research and operations team, I would like to understand patterns
in geospatial data.
▪ Researchers develop new algorithms.
▪ Operations team leverages algorithms on new data.
▪ Challenge
▪ How do you connect these two groups?
2

Entity Dataframe
▪ A domain class that has a known schema and a series of spatial
transformations.
▪ Entity_id, Lat, Lon, Timestamp
▪ Transformations:
▪ Stops
▪ Path Identification
▪ Spawn Maps
2

Polygon Dataframe
▪ A domain class for polygons that facilitates indexing and comparison.
▪ WKT, Name, Description, Source, etc…
▪ Geohash Indexing (with buffers)
▪ H3 Indexing (with buffers)
2

Pattern of Life
▪ A domain class for combining Polygons and Entities
▪ Spawn locations
▪ Most visited places
▪ Similar users
2

Pattern of Life
▪ DEMO
▪ Simple use case of showing spawn locations for the users.
2

Lessons Learned
▪ Standardizing on data formats is important
▪ Datalakes are helpful
▪ Domain Driven Design and common packages can save headache.
▪ Notebooks are useful but not a panacea
▪ Test scaling often!
▪ Landscape is always changing and a lot of time must be spent to keep
up.
▪ Your problem is unlikely to be a unicorn. Leverage talents of others to
deliver real impact.
3

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Geospatial Options in Apache Spark

Recommended

More Related Content

What's hot (20)

Similar to Geospatial Options in Apache Spark (20)

More from Databricks (20)

Recently uploaded (20)

Geospatial Options in Apache Spark