Cloud+Native+Geospatial+Analytics+with+Apache+Sedona
Cloud+Native+Geospatial+Analytics+with+Apache+Sedona
Spatial Data Catalog: Deliver spatial data products to your customers, users,
and community through API, or data lakehouse friendly data formats.
Embedded AI & ML: Empower your teams with GPU backed raster inference
on aerial imagery, run map matching on billions of vehicle trips, and more.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Cloud Native Geospatial Analytics with
Apache Sedona, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the publisher’s views or
the views of the authors’ current or former employers. While the publisher and the authors have used
good faith efforts to ensure that the information and instructions contained in this work are accurate, the
publisher and the authors disclaim all responsibility for errors or omissions, including without limitation
responsibility for damages resulting from the use of or reliance on this work. Use of the information and
instructions contained in this work is at your own risk. If any code samples or other technology this work
contains or describes is subject to open source licenses or the intellectual property rights of others, it is
your responsibility to ensure that your use thereof complies with such licenses and/or rights.
This work is part of a collaboration between O’Reilly and Wherobots. See our statement of editorial
independence.
978-1-098-17399-9
[LSI]
Table of Contents
v
The Spatial DataFrame 23
Introduction To Spatial SQL 24
Working With The DataFrame API 27
Visualizing Data 28
Conclusion 29
Resources 29
Exercises 30
vi | Table of Contents
Brief Table of Contents (Not Yet Final)
vii
CHAPTER 1
Introduction to Apache Sedona
The open-source Apache Sedona project grew out of the need for a scalable geospatial
analytics framework capable of working with large-scale spatial data. There’s a com‐
mon saying in the data world that “spatial is special”. In other words, working with
spatial data implies that due to the unique characteristics and complexities of spatial
data, specialized techniques, tooling, and knowledge is required for effective analysis
and interpretation of spatial data. While there is some validity to this perspective,
it misses the more nuanced truth that many traditional best practices, techniques,
tooling, and data formats from the data engineering and data science world are
still perfectly relevant when working with geospatial data. However, there are some
unique challenges and considerations that arise when working with spatial data.
In this chapter we will discuss some of the challenges that commonly arise when
working with geospatial data and explore an overview of the geospatial data ecosys‐
tem including some of the gaps in tooling that led to the need for a scalable geospatial
analytics framework like Apache Sedona.
1
We will also introduce how Apache Sedona addresses the challenges of working with
geospatial data at scale and take a look behind the scenes at the basic architecture and
components of Apache Sedona. At the end of this chapter we should be able to have
a more clear understanding of the idea that “spatial is special” and evaluate if there is
truth to this common phrase.
Spatial vs Geospatial
The terms “spatial data” and “geospatial data” are often used inter‐
changeably, but they have slightly different meanings.
Spatial data refers to any data that has a spatial or geographic
component and can describe the location, shape, and relationships
of objects in space.
Geospatial data is a subset of spatial data that specifically pertains
to the Earth’s surface and features.
Because of the types of insights that be can attained and relevance
to common business challenges, the focus of this book is specifi‐
cally on geospatial data, however it is worth noting that many of
the techniques discussed can be applied to spatial data in general
and Apache Sedona can work with both spatial and geospatial data.
Geospatial data analysis or geospatial analysis involves the techniques and tools used
to interpret and visualize geospatial data. By applying spatial analysis methods we
can uncover patterns, relationships, and trends that are not immediately apparent.
This process empowers us to make more informed decisions, optimize allocation of
resources, and predict future outcomes.
Geospatial analysis is crucial because it provides a spatial context to data which can
reveal insights that traditional data cannot. Spatial data analysis helps us understand
complex spatial relationships and dynamics, leading to better decision-making in
fields ranging from environmental conservation to business intelligence. As our
world continues to become more data-driven, the ability to interpret spatial data
will be key to addressing global challenges. In this section we will explore the foun‐
dational concepts, tools, and applications of geospatial analysis used to harness the
power of location-based data.
Let’s examine some of the complexities that arise when working with geospatial data
for analysis that can help us evaluate if this idea that “spatial is special” is really true or
not.
The first complexity that often arises is in handling the two common types of data
representation used to store geospatial information: vector data and raster data.
Apache Sedona introduces data types, operations, and indexing techniques optimized
for spatial workloads on top of Apache Spark.
Let’s take a look at the workflow for analyzing spatial data with Apache Sedona.
1 “Spatial Data Management in Apache Spark: The GeoSpark Perspective and Beyond”. Jia Yu, Zongsi Zhang,
Mohamed Sarwat. Geoinformatica Journal 2019.
Spatial SQL
Apache Sedona’s Spatial DataFrames support Spatial SQL, an extension of SQL with
over 200 spatial specific functions that enable manipulating and processing both
vector and raster spatial data, in addition to the functionality of Spark SQL.
Many of these use cases can be described as geospatial ETL operations. ETL (extract,
transform, load) is a data integration process that involves retrieving data from
various sources, transforming and combining these datasets, then loading the trans‐
formed data into a target system or format for reporting or further analysis. Geospa‐
tial ETL shares many of the same challenges and requirements of traditional ETL
processes with the additional complexities of managing the geospatial component of
the data, as discussed earlier in the chapter: working with geospatial data sources
and formats, spatial data types and transformations, as well as the scalability and
Community Adoption
Apache Sedona has gained significant community adoption and has become a popu‐
lar geospatial analytics library within the Apache Spark ecosystem. As an Apache
Software Foundation (ASF) incubator project, Apache Sedona’s governance, licensing,
and community participation align with ASF principles.
Apache Sedona has an active and growing developer community, with contributors
from a number of different organizations and over 100 individuals interested in
geospatial analytics and distributed computing. At the time of writing Sedona has
now reached 35 million downloads with a rate of 1.5 million downloads per month
with usage growing at a rate of 150% year over year.
Apache Sedona has been adopted by organizations in industries including transporta‐
tion, urban planning, environment monitoring, logistics, insurance and risk analysis,
and more. Organizations leverage Apache Sedona’s capabilities to perform large-scale
geospatial analysis, extract insights from geospatial data and build geospatial analyt‐
ical applications at scale. The industry adoption of Apache Sedona showcases its
practical relevance and real-world use cases.
Apache Sedona has been featured in conferences, workshops, and research publica‐
tions related to geospatial analytics, distributed computing, and big data processing.
These presentations and publications contribute to the awareness, visibility, and
adoption both within the enterprise and within the research and academic communi‐
ties.
Resources
Throughout this book we will refer to specific documentation and resources relevant
for topics covered in each section. However, there are some important resources that
will be useful throughout your journey working with Apache Sedona:
• The documentation for Apache Sedona can be found online at: https://
sedona.apache.org/latest/
• Join the community Discord server: https://sedona.apache.org/latest/commu
nity/contact/
• The community forums: https://community.wherobots.com/
• Find and contribute to Apache Sedona on GitHub: https://github.com/apache/
sedona
Resources | 15
Conclusion
In this chapter we introduced Apache Sedona and the cloud native geospatial data
ecosystem. We discussed how Apache Sedona evolved out of the need for a scalable
geospatial focused analytics framework and how Apache Sedona is architected to take
advantage of distributed computation for processing large scale geospatial data. We
also reviewed the architecture of Apache Sedona and discussed the APIs for working
with data in Apache Sedona.
Now that we have an understanding of what Apache Sedona is and what it is used for,
in the next chapter we’re ready to get hands on with Apache Sedona as we get started
using Apache Sedona and Spatial SQL.
• How to get started with Apache Sedona, using Docker and Wherobots Cloud
• How to use the Spatial DataFrame data structure to work with data in Apache
Sedona
• How to use Spatial SQL to query and manipulate geospatial data
• How to visualize spatial data using SedonaKepler
Throughout the book we will primarily make use of Sedona in one of two ways:
using the Apache Sedona Docker image or using Wherobots Cloud. We’ll explore
getting started with both options in this chapter. Regardless of which option we use,
the developer experience of using Apache Sedona is similar and will focus around a
Jupyter Notebook environment and Spatial SQL queries. Later in the book in Chapter
9 we will explore how to work with Sedona in cloud environments such as AWS Glue,
Databricks, and Microsoft Fabric.
17
The Apache Sedona Docker Image
Running Apache Sedona using the official Docker image is the first option we
will explore for running Apache Sedona. If you’re not familiar with Docker, it’s an
open-source tool designed to automate the deployment, scaling, and management of
applications by “containerizing” applications and their dependencies into lightweight
and portable “containers” that can run consistently across operating systems and
environments.
Docker images are read-only templates used to create containers. They contain the
application and its dependencies and are built from a set of instructions written in a
Dockerfile.
You can learn more about Docker including installation instructions at https://
docker.com
The maintainers of Apache Sedona publish an official Apache Sedona Docker image
which bundles Apache Sedona, Apache Spark, Jupyter, Python, and other dependen‐
cies of Sedona.
The benefits of using the official Apache Sedona Docker image include
Ease of Setup
The image comes pre-configured with all necessary dependencies, reducing the
complexity of setting up a Sedona environment manually.
Consistency
The Docker image ensures a consistent environment across different environ‐
ments, minimizing the risk of configuration issues in different machines.
Isolation
Docker containers provide isolation from the host system and other containers,
ensuring Sedona runs in a clean environment without conflicts.
Portability
The Docker image can run on any system that supports Docker, including local
machines, on-premises servers, and cloud platforms, facilitating easy deploy‐
ment.
The official Apache Sedona Docker image is a convenient way to get started with
Apache Sedona and is suitable for development and testing on a single machine,
but is not designed to take advantage of the highly scalable benefits of running
spatial operations across a distributed cluster of machines. To leverage the distributed
benefits of Sedona we will take advantage of cloud services such as Wherobots Cloud,
AWS EMR, or Microsoft Fabric.
Next, to run the Docker image we use the docker run command, specifying configu‐
ration options for binding ports from the local machine to the container as well as
memory allocation.
docker run -e DRIVER_MEM=6g -e EXECUTOR_MEM=8g \
-p 8888:8888 -p 8080:8080 -p 8081:8081 \
-p 4040:4040 apache/sedona:1.6.0
Let’s break down this command to see what each piece is doing.
docker run
This is the Docker command used to create and start a new container from a
specified image.
-e DRIVER_MEM=6g
This flag sets an environment variable inside the container which specifies the
amount of memory allocated to the driver process.
-e EXECUTOR_MEM=8g
This flag indicates the executor process should use 8 gigabytes of memory.
-p 8888:8888
This flag maps port 8888 on the host machine to port 8888 on the container. This
is used to expose the Jupyter notebook environment running in the container to
the host machine.
-p 4040:4040
This flag maps port 4040 on the host to port 4040 on the container, which
exposes Spark UI, a web-based interface for monitoring and managing Spark
applications.
apache/sedona:1.6.0
This is the image name and tag.
The Sedona Tiny instance will be sufficient for running the examples in this book.
To scale up to larger runtime with larger cluster resources requires upgrading to the
Professional Tier in Wherobots Cloud.
Figure 2-2. The Wherobots Cloud console after starting a notebook runtime
config = SedonaContext.builder().master("spark://localhost:7077").getOrCreate()
sedona = SedonaContext.create(config)
This will initialize the Sedona cluster. Other common configuration options at this
step include configuring access to cloud object storage such as AWS S3. We’ll see how
this works in the next chapter when we cover working with files.
In the previous chapter we introduced the concept of the Spatial DataFrame. Let’s
take a deeper look at this important data structure.
cities_df.show()
+-------------+---------+--------+
| city|longitude|latitude|
+-------------+---------+--------+
|San Francisco|-122.4191| 37.7749|
| New York| -74.006| 40.7128|
| Austin| -97.7431| 30.2672|
+-------------+---------+--------+
We can view the schema of this DataFrame with the printSchema() method.
cities_df.printSchema()
root
|-- city: string (nullable = true)
|-- longitude: double (nullable = true)
|-- latitude: double (nullable = true)
Note that the Longitude and Latitude types are doubles, not a geometry or “point”
type. To take advantage of Sedona’s functionality for working with spatial data we’ll
Spatial SQL functions can be grouped as belonging to one of the following four
categories:
1 The official specification for spatial SQL is known as ISO/IEC 13249-3 SQL/MM Part 3: Spatial and was
originally derived from the Open Geospatial Consortium Simple Features Specification for SQL.
This will allow us to reference the cities_df as a view in our SQL queries. To run
a spatial SQL query we use the sedona.sql method, passing in our query. A new
Spatial DataFrame will be returned.
As noted previously, so far our cities_df DataFrame is using double types to
represent longitude and latitude. To take advantage of the spatial functionality of this
data we want the location information of each row to be represented as a geometry
type. Our first spatial SQL statement will use the ST_Point function to create point
geometries from the latitude and longitude representations.
cities_df = sedona.sql("""
SELECT *, ST_Point(longitude, latitude) AS geometry
FROM cities
""")
cities_df.show(truncate=False)
+-------------+---------+--------+-------------------------+
|city |longitude|latitude|geometry |
+-------------+---------+--------+-------------------------+
|San Francisco|-122.4191|37.7749 |POINT (-122.4191 37.7749)|
|New York |-74.006 |40.7128 |POINT (-74.006 40.7128) |
|Austin |-97.7431 |30.2672 |POINT (-97.7431 30.2672) |
+-------------+---------+--------+-------------------------+
So far we’ve used the ST_Point constructor spatial SQL function to create a point
geometry type column. Let’s explore other ways to use spatial SQL to manipulate and
create new geometries. We’ve seen the point geometry type, but we can also work
with more complex geometries like polygons.
The ST_Buffer function will return a polygon where all points of the polygon are at
least a given distance from all points of the input geometry, creating a buffer around
the input geometry. Using the city point geometries as inputs, we’ll use the ST_Buffer
function to create buffers around each point with a radius of 1km.
First, because we modified the cities_df DataFrame we’ll need to replace the tempo‐
rary view cities that we defined earlier.
cities_df.createOrReplaceTempView("cities")
Now we create a new DataFrame that will contain the name of each city and a
polygon geometry that represents the buffer.
buffer_df = sedona.sql("""
SELECT city, ST_Buffer(geometry, 1000, true) AS geometry
FROM cities
""")
buffer_df.show()
+-------------+--------------------+
| city| geometry|
+-------------+--------------------+
|San Francisco|POLYGON ((-122.40...|
| New York|POLYGON ((-73.994...|
| Austin|POLYGON ((-97.732...|
+-------------+--------------------+
route_df = cities_df.select(ST_MakeLine(collect_list(col("geometry"))).alias("geometry"))
route_df.show(truncate=False)
+-----------------------------------------------------------------+
|geometry |
+-----------------------------------------------------------------+
|LINESTRING (-122.4191 37.7749, -74.006 40.7128, -97.7431 30.2672)|
+-----------------------------------------------------------------+
As the above example demonstrates we can accomplish the same spatial operations
with both spatial SQL and the DataFrame API. While choosing which form to use
can sometimes be a personal preference there are advantages and disadvantages to
each approach.
Conclusion
In this chapter we took a deeper hands-on look at two important pieces of Apache
Sedona: spatial SQL and the Spatial DataFrame. Spatial SQL allows us to create,
manipulate, and analyze spatial data using Sedona’s Spatial DataFrame. The Spatial
DataFrame is a distributed data structure that supports spatial data types and working
with massive size datasets. We also saw how to get started with Sedona using both the
Apache Sedona Docker image and by using Wherobots Cloud.
So far we’ve limited our usage of Sedona to manually created small trivial sized exam‐
ples, but in real world analysis we typically encounter data in many different formats
and sources. In the next chapter we’ll see how to use Sedona to load, manipulate and
analyze spatial data in many different formats including CSV, GeoJSON, Shapefile,
and Parquet. We’ll also learn about the benefits of cloud-native geospatial file formats
like GeoParquet and see how to use Sedona to create and query spatial datasets using
GeoParquet.
Resources
• Spatial SQL function documentation: https://sedona.apache.org/lat
est/api/sql/Overview/
• SedonaKepler documentation: https://sedona.apache.org/latest/api/sql/
Visualization_SedonaKepler/
• Sedona Docker documentation: https://hub.docker.com/r/apache/sedona
Resources | 29
• Wherobots Cloud registration: https://cloud.wherobots.com/
• Wherobots Cloud documentation: https://docs.wherobots.com/latest/
Exercises
1. Create a free Wherobots Cloud account. After signing in, create a tiny Sedona
runtime and open the notebook environment. Run all cells in the example “First
Wherobots Cloud Notebook”.
2. Read the documentation for spatial SQL functions supported in Apache Sedona.
Choose a function to manipulate the geometry column of the cities_df Data‐
Frame. What is the input to the function? What is the output?
3. Using SedonaKepler, visualize the results of Exercise 2. Can you visualize this
data along with the original cities_df DataFrame.