Creating Reusable Geospatial Pipelines

Creating Reusable
Geospatial Pipelines
Dan Corbiani
Data Scientist, Pacific Northwest National Lab

Goals:
- Understand pipeline options and pitfalls.

Agenda
§ Value of pipelines
§ Available pipeline
technologies
§ Implementations
§ Demonstrations

Explainable
Notebooks are great, but the flow
can be challenging to understand.

Configurable
Configuration allows the
pipeline to be reused / tested.

Development
Spiral
Trials /
Notebooks
Demonstration
Answer
Presentation
Analytic
Question
Analysis / Work often
end at a presentation
Traditional Process

Development
Spiral
Trials /
Notebooks
Demonstration
Answer
Presentation
Analytic
Question
Analysis / Work often
end at a presentation
Peer Review
Documentation
Linting
Deployed
Library
Automated Testing
Decomposition
Leverage Pipelines
/ Functions

Pipeline Library Requirements
• Working documentation
• 100’s+ stars on Github
• Active development
• Works within a databricks notebook environment
• Operates on graphframes, SQL API, and core dataframes

Pipeline Technologies Shortlist
• Spark ML Pipelines
• Prefect
• Dagster
• Airflow

Spark ML Pipelines
• Works well for functions applied to dataframes.
• These functions are called “Transformers”
• Originally designed to prepare text for NLP analysis.
• All workflows are linear.
https://spark.apache.org/docs/latest/ml-pipeline.html

Spark ML Pipeline
Transformer Example
Custom
transformer that
accepts parameters
Goal:

Spark ML Pipeline
Transformer Example
Step 1: Basic Transformer
Goal:

Spark ML Pipeline
Transformer Example
Linters do not like
the function name.
Goal:
Step 2: Use Parameters
Names must
be the same.

Spark ML Pipeline
Transformer Example
Linters do not like
the function name.
Goal:
Step 2.2: Add setParams function

Spark ML Pipeline
Transformer Example
Custom
transformer that
accepts parameters
Linters do not like
the function name.
- __init__ and _transform are the required methods.
- setParams allows us to change parameters at instantiation.
- Keyword_only ensures this function always uses kwargs.
Goal:

Spark ML Pipeline Example
• Creating a steps and pipelines

Dagster
• Supports a wide variety of workflows / functions.
• Config can be complicated.
• The display function can be extremely helpful.
• Everything must be in the pipeline.
• Classes / typing can be interesting.

Hello World
Pipeline
- Set of solids to be executed
Lambda solid
- No context = no logging
Step 1: Create the pipeline and basic solids

Hello World
A Spark Dataframe cannot
be passed in as part of the
dagster config!
Step 2: Get data from spark

Hello World
A Spark Dataframe cannot
be passed in as part of the
dagster config!
Step 3: Run the pipeline

Custom Types
Goal:
- Create a class that contains parameterized functionality.

Dagit UI / Databricks
• Getting pipeline output can be confusing. Must grab the
result from a specific solid.
• It is possible to orchestrate a pipeline locally.
• https://dagster.io/blog/pyspark
• Visualizing pipeline in Databricks
• Visualization function removed in 0.8.0

Comparison
▪ Complicated config
▪ Supports fan-in, fan-out, merging
▪ Works with any* objects
▪ Everything must be within the pipeline
▪ Use this when:
▪ Need merging / fanning patterns.
▪ Leveraging things outside of dataframes such
as SQL.
▪ Complicated implementation
▪ Only linear processes
▪ Functions on Spark DataFrames
▪ Easy to move between pipeline and dataframes
▪ Use this when:
▪ Adding / removing / modifying a column in a
dataframe.
• Spark ML Pipelines
• Dagster
Lesson learned: Do as little as possible within the workflow framework.

Hotspot Example with H3 Hashing

Merging Geospatial Data using Dagster

First Thursday’s at 1PM Est.
Dan Corbiani (Dan.Corbiani@pnnl.gov)
Data Scientist, Pacific Northwest National Lab
Thanks to our team:
Jenny Webster (Co-lead)
Mark Jensen
Paige Maxwell
Kate Miller
Join our Big Geospatial Data Meetup!
Elise Saxon
Lucas Tate
Nile Wynar

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Creating Reusable Geospatial Pipelines

Recommended

More Related Content

What's hot (20)

Similar to Creating Reusable Geospatial Pipelines (20)

More from Databricks (20)

Recently uploaded (20)

Creating Reusable Geospatial Pipelines