0% found this document useful (0 votes)
40 views

Cloud+Native+Geospatial+Analytics+with+Apache+Sedona

Wherobots, developed by the creators of Apache Sedona, offers a serverless Spatial Intelligence Cloud that enables data engineering teams to create spatial data products significantly faster and at lower costs. It provides tools for global geospatial ETL, analytics, and AI, along with a spatial data catalog and embedded AI/ML capabilities. The document also introduces the book 'Cloud Native Geospatial Analytics with Apache Sedona,' which discusses the challenges and techniques of working with large-scale spatial data.

Uploaded by

konanemmanuel518
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Cloud+Native+Geospatial+Analytics+with+Apache+Sedona

Wherobots, developed by the creators of Apache Sedona, offers a serverless Spatial Intelligence Cloud that enables data engineering teams to create spatial data products significantly faster and at lower costs. It provides tools for global geospatial ETL, analytics, and AI, along with a spatial data catalog and embedded AI/ML capabilities. The document also introduces the book 'Cloud Native Geospatial Analytics with Apache Sedona,' which discusses the challenges and techniques of working with large-scale spatial data.

Uploaded by

konanemmanuel518
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

The Spatial Intelligence Cloud

Developed by the original creators of


Apache Sedona, Wherobots enables data
engineering teams to create spatial data
products up to 60x faster at a fraction
of the cost of existing solutions.

No overhead, serverless, Instant global geospatial ETL,


fully managed. analytics and AI.

Spatial Intelligence Cloud Benefits for Data Teams:

Serverless: Compatible with Apache Sedona APIs, enable geospatial joins


across global scale points, polygon, trips, raster data, and more without
considering the overhead.

Rapid Design: Discover geospatial data relationships at planetary scale


and speed. SQL and Python ready: 180+ ST functions, 90+ raster functions.

Spatial Data Catalog: Deliver spatial data products to your customers, users,
and community through API, or data lakehouse friendly data formats.

Embedded AI & ML: Empower your teams with GPU backed raster inference
on aerial imagery, run map matching on billions of vehicle trips, and more.

Get started for free


on Wherobots Cloud at Contact us
www.wherobots.com OR [email protected]

*Wherobots is 100% committed to be a carbon neutral


company, enabling every organization to analyze our
planet without impacting it.
Cloud Native Geospatial Analytics
With Apache Sedona
A Hands-On Guide For Working With Large-
Scale Spatial Data

With Early Release ebooks, you get books in their earliest


form—the author’s raw and unedited content as they write—
so you can take advantage of these technologies long before
the official release of these titles.

William Lyon, Jia Yu, and Mo Sarwat

Beijing Boston Farnham Sebastopol Tokyo


Cloud Native Geospatial Analytics with Apache Sedona
by William Lyon, Jia Yu, and Mo Sarwat
Copyright © 2025 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional
sales department: 800-998-9938 or [email protected].

Acquisitions Editor: Aaron Black Interior Designer: David Futato


Development Editor: Gary O’Brien Cover Designer: Karen Montgomery
Production Editor: Clare Laylock

June 2025: First Edition

Revision History for the Early Release


2024-08-21: First Release

See http://oreilly.com/catalog/errata.csp?isbn=9781098173999 for release details.

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Cloud Native Geospatial Analytics with
Apache Sedona, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the publisher’s views or
the views of the authors’ current or former employers. While the publisher and the authors have used
good faith efforts to ensure that the information and instructions contained in this work are accurate, the
publisher and the authors disclaim all responsibility for errors or omissions, including without limitation
responsibility for damages resulting from the use of or reliance on this work. Use of the information and
instructions contained in this work is at your own risk. If any code samples or other technology this work
contains or describes is subject to open source licenses or the intellectual property rights of others, it is
your responsibility to ensure that your use thereof complies with such licenses and/or rights.
This work is part of a collaboration between O’Reilly and Wherobots. See our statement of editorial
independence.

978-1-098-17399-9
[LSI]
Table of Contents

Brief Table of Contents (Not Yet Final). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1. Introduction to Apache Sedona. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


Introduction To Cloud Native Geospatial Analysis And Its Challenges 2
The Geospatial Analytics Ecosystem 4
Leveraging Cloud Native Architecture 6
Apache Sedona Overview 7
Spatial Query Processing 7
A Brief Overview Of Apache Spark 8
Understanding Apache Sedona’s Architecture & Components 9
Apache Sedona Data Structures 9
Spatial SQL 10
Spatial Query Optimizations 10
Support For Spatial File Formats 10
Visualization 11
Integration With PyData Ecosystem 11
Benefits of Apache Sedona 12
The Developer Experience 13
Who Uses Apache Sedona 13
Common Apache Sedona Use Cases 14
Community Adoption 15
Resources 15
Conclusion 16

2. Getting Started with Apache Sedona. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17


The Apache Sedona Docker Image 18
Using Wherobots Cloud 20
Overview Of The Notebook Environment 22

v
The Spatial DataFrame 23
Introduction To Spatial SQL 24
Working With The DataFrame API 27
Visualizing Data 28
Conclusion 29
Resources 29
Exercises 30

vi | Table of Contents
Brief Table of Contents (Not Yet Final)

Chapter 1: Introduction to Apache Sedona (available)


Chapter 2: Getting Started with Apache Sedona (available)
Chapter 3: Working with Geospatial Data At Scale (unavailable)
Chapter 4: Points, Lines, and Polygons: Vector Data Analysis Spatial SQL (unavailable)
Chapter 5: Raster Data Analysis (unavailable)
Chapter 6: Apache Sedona and The PyData Ecosystem (unavailable)
Chapter 7: Geospatial Data Science and Machine Learning (unavailable)
Chapter 8: Building a Geospatial Data Lakehouse with GeoParquet and Apache Iceberg
(unavailable)
Chapter 9: Using Apache Sedona with Cloud Data Providers (unavailable)

vii
CHAPTER 1
Introduction to Apache Sedona

A Note for Early Release Readers


With Early Release ebooks, you get books in their earliest form—the author’s raw and
unedited content as they write—so you can take advantage of these technologies long
before the official release of these titles.
This will be the 1st chapter of the final book.
If you have comments about how we might improve the content and/or examples in
this book, or if you notice missing material within this chapter, please reach out to the
editor at [email protected].

The open-source Apache Sedona project grew out of the need for a scalable geospatial
analytics framework capable of working with large-scale spatial data. There’s a com‐
mon saying in the data world that “spatial is special”. In other words, working with
spatial data implies that due to the unique characteristics and complexities of spatial
data, specialized techniques, tooling, and knowledge is required for effective analysis
and interpretation of spatial data. While there is some validity to this perspective,
it misses the more nuanced truth that many traditional best practices, techniques,
tooling, and data formats from the data engineering and data science world are
still perfectly relevant when working with geospatial data. However, there are some
unique challenges and considerations that arise when working with spatial data.
In this chapter we will discuss some of the challenges that commonly arise when
working with geospatial data and explore an overview of the geospatial data ecosys‐
tem including some of the gaps in tooling that led to the need for a scalable geospatial
analytics framework like Apache Sedona.

1
We will also introduce how Apache Sedona addresses the challenges of working with
geospatial data at scale and take a look behind the scenes at the basic architecture and
components of Apache Sedona. At the end of this chapter we should be able to have
a more clear understanding of the idea that “spatial is special” and evaluate if there is
truth to this common phrase.

Introduction To Cloud Native Geospatial Analysis And Its


Challenges
In our increasingly interconnected world geospatial data and analysis have become
essential tools for understanding the complexities of our environment, societies,
and economies. Geospatial data shapes our decision-making and problem-solving
processes.
Geospatial data refers to information that is associated with specific locations on
the Earth’s surface and can be represented in forms such as points, lines, polygons,
and rasters which capture features such as roads, rivers, buildings, and terrain. The
richness of geospatial data lies in its ability to provide both location information and
additional attributes which enable a multidimensional view of our world.
Geospatial data can come from a variety of sources, such as:
Satellite imagery
Satellite imagery provides detailed views of the Earth’s surface which is useful for
monitoring environmental changes and urban development.
GPS data
Captured from devices and sensors this type of telemetry data offers precise
location tracking for navigation and logistics.
Census and survey data
Demographic and socio-economic information tied to specific locations
describes a large amount of geospatial data often managed by governments.
Aerial photography
Captured from aircraft, aerial imagery provides high-resolution images for map‐
ping and analysis.
Remote sensing
Utilizing sensors to detect and measure physical characteristics of an area from a
distance, often from satellites, remote sensing is a large and data-rich technique.
Crowd-sourced datasets
Crowd-sourced datasets such as OpenStreetMap which contains a global scale
collection of points of interest, road network, land cover, and administrative
boundary data provides a rich input for geospatial analysis.

2 | Chapter 1: Introduction to Apache Sedona


These are just some of the sources of geospatial data which are commonly found in
geospatial analysis projects. It is also useful to note that many enterprises generate
massive amounts of data which have a geospatial component through normal busi‐
ness operations, such as retail transactions, inventory management, and customer
interactions. It is also common to find data derived from the above sources such
as those enriched by a commercial dataset provider or as the result of a machine
learning process, in geospatial analytic workflows.

Spatial vs Geospatial
The terms “spatial data” and “geospatial data” are often used inter‐
changeably, but they have slightly different meanings.
Spatial data refers to any data that has a spatial or geographic
component and can describe the location, shape, and relationships
of objects in space.
Geospatial data is a subset of spatial data that specifically pertains
to the Earth’s surface and features.
Because of the types of insights that be can attained and relevance
to common business challenges, the focus of this book is specifi‐
cally on geospatial data, however it is worth noting that many of
the techniques discussed can be applied to spatial data in general
and Apache Sedona can work with both spatial and geospatial data.

Geospatial data analysis or geospatial analysis involves the techniques and tools used
to interpret and visualize geospatial data. By applying spatial analysis methods we
can uncover patterns, relationships, and trends that are not immediately apparent.
This process empowers us to make more informed decisions, optimize allocation of
resources, and predict future outcomes.
Geospatial analysis is crucial because it provides a spatial context to data which can
reveal insights that traditional data cannot. Spatial data analysis helps us understand
complex spatial relationships and dynamics, leading to better decision-making in
fields ranging from environmental conservation to business intelligence. As our
world continues to become more data-driven, the ability to interpret spatial data
will be key to addressing global challenges. In this section we will explore the foun‐
dational concepts, tools, and applications of geospatial analysis used to harness the
power of location-based data.
Let’s examine some of the complexities that arise when working with geospatial data
for analysis that can help us evaluate if this idea that “spatial is special” is really true or
not.
The first complexity that often arises is in handling the two common types of data
representation used to store geospatial information: vector data and raster data.

Introduction To Cloud Native Geospatial Analysis And Its Challenges | 3


Vector data represents spatial features as discrete geometry objects (such as points,
lines, and polygons) which are defined by their coordinates and can also include
topological data such as connectivity between lines or adjacency between polygons.
Vector data typically also includes attributes associated with the geometries such as as
a unique identifier, name of the feature, or other relevant data. Working with vector
data is discussed in more detail in Chapter 4.
Raster data represents spatial information as a grid of cells (or pixels), where each
pixel represents a specific geographic location and has an associated value(or values)
known as band(s). The raster data representation is typically used to describe contin‐
uous data such as aerial imagery, elevation, or temperature. Raster data analysis will
be covered in depth in Chapter 5.
Coordinate systems are another complexity of working with geospatial data. Coordi‐
nate systems use a model of the Earth’s surface to map coordinates to a point on the
Earth’s surface. Geospatial data can be represented using various coordinate systems
(including geographic coordinate systems and projected coordinate systems), which
may use different units of measure. Projected coordinate systems introduce distortion
and often require assessing tradeoffs when choosing which projected coordinate
system is appropriate for the intended analysis.
Geospatial indexes are used to improve the efficiency of spatial queries and opera‐
tions on geospatial data by organizing the storage and retrieval of spatial objects
based on their spatial properties and relationships. Because spatial data is multi-
dimensional, indexing structures used for geospatial data such as R-trees, Quad
trees, and Geohashes are specialized for spatial data and differ from index structures
commonly used for non-spatial data such as B-trees or hash indexes.
Spatial partitioning, especially in a distributed system, adds complexity and introdu‐
ces challenges when we factor in “hot-spots” that observations are likely not evening
distributed by geographic area. For example, are we likely to find more taxi pickups
in Manhattan or Montana?
Now that we have a sense of some of the complexities of analyzing geospatial data,
let’s take a look at the geospatial data analytics ecosystem to see how tools and
libraries have evolved to address these challenges.

The Geospatial Analytics Ecosystem


When evaluating geospatial analytics tooling it is important to keep in mind the
requirements of your project. What type of data will you be working with? Will you
be working with vector data, raster data, or both? Are you building an interactive
application with transactional data access requirements or will you be defining data
transformation pipelines and analytics workflows? Will you be combining data from
multiple sources? What scale of data will you be working with? How often is the

4 | Chapter 1: Introduction to Apache Sedona


data changing? What type of spatial querying functionality does your project require?
Understanding these requirements will inform how you evaluate, prioritize, and
select the tools used to implement your project.
A typical geospatial analytics use case will involve combining multiple datasets or
enriching an existing dataset with additional values or aggregates based on spatial
relationships. Some component of vector data is typically involved and often both
vector and raster data is involved. The scale and velocity of the data (how often the
data changes) will depend on the specific use case but will often involve national or
planetary scale data and therefore will be large-scale, especially if we are working with
high resolution raster data.
Based on this description of the typical geospatial analytics project we can identify
some important characteristics to look for when evaluating geospatial data tooling.
Efficiency is important
Because of the scale of the data we will be working with it is important to con‐
sider how efficiently we can process geospatial data. This is a relevant considera‐
tion both in terms of the cost of computing resources but also when considering
the productivity of our engineering and data teams. Simply put, if an analytics
task takes 48 hours to run our team’s product velocity will be slower than if the
same task runs in 10 minutes.
Minimize data transfer
Related to the efficient use of computing resources is the efficient use of data
transfer. Because of the costs incurred when moving large amounts of data and
transforming the data into a format preferred by a different system, we should
prefer tools and data formats that don’t require this transformation step and
can work with data stored in a cloud object storage service such as AWS S3.
This concept is often called “bringing compute to the data instead of data to the
compute”.
Optimize for analytics over transactional workloads
Transactional systems are designed to handle day-to-day operational data with a
focus on data integrity and speed of online transactions such as inserting, updat‐
ing, and deleting data. Analytical systems are instead optimized for handling
large volumes of data and complex queries for data analysis.
Support for complex geospatial operations
Since our data processing tasks will often involve combining datasets based on
spatial relationships our tooling must be able to define and efficiently determine
these spatial relationships. These relationships can be based on distance but also
more complex spatial relationships such as intersecting, touching, or fully con‐
tained geometries and will often involve complex geometries such as polygons.

The Geospatial Analytics Ecosystem | 5


The geospatial analytics ecosystem has seen significant evolution in previous decades.
Specialized Geographic Information Systems (GIS) software were perhaps the first
tools to enable geospatial analysis. These tools were largely desktop-driven GUI
applications used by analysts and geospatial domain experts. These tools typically
run on a single machine and the scale of the data being analyzed is limited by the
resources of the machine on which they are running.
The open-source ecosystem has played an important role in advancing the ability
to work with geospatial data. Tools such as PostGIS, GDAL, QGIS, and GeoServer
have addressed the need for storage, retrieval, processing, sharing, and visualizing
geospatial data. The Python data ecosystem is especially rich in the geospatial domain
with libraries such as GeoPandas, Rasterio, and Shapely that enable working with
geospatial data in data structures idiomatic to Python. Similar to the specialized
GIS software described above, much of this tooling runs on a single machine with
scalability typically limited by the resources of that machine, although there are
notable exceptions for specialized use cases such as scientific computing.
The tooling that makes up this GIS-specific ecosystem enables very sophisticated
analysis and has been hardened over decades of use by the community. However, as
data volumes have soared in recent years, especially in the geospatial domain, the
need for geospatial data processing at massive scale has arisen as a challenge for many
of these tools due to their single machine design and limitations.
For example, one commonly used geospatial tool is the PostGIS extension to the
open-source PostgreSQL database. PostGIS is an excellent tool for geospatial data
storage and spatial queries and analysis. However, since PostGIS is based on the
PostgreSQL RDMS transactional database, challenges can arise when using PostGIS
for large-scale analytic workloads. Scaling PostGIS horizontally for high-volume
analytics workloads can be challenging when compared to distributed systems specif‐
ically designed for analytics. PostGIS stores data in a row-based format, which may
not be as efficient for analytic workloads that benefit from columnar storage. As a
transactional database, this requirement of using row-based storage also makes it
difficult for PostGIS to leverage cloud storage systems like AWS’s S3 service.

Leveraging Cloud Native Architecture


In more recent years the data analytics ecosystem has largely moved to the cloud,
taking advantage of the major benefits of cloud native architecture such as distributed
compute, elastic scalability, and cloud object storage.
While many cloud-based analytics services such as data warehouses offer scalability
and the ability to efficiently work with massive amounts of data, these tools often are
not optimized for working with geospatial data resulting in poor performance when
their techniques are applied to spatial data or users are faced with missing features
and functionality required for working with geospatial data at scale.

6 | Chapter 1: Introduction to Apache Sedona


The previous section could be summarized to point out that there are essentially
geospatial tools and libraries that treat geospatial data as a first class citizen and
have extensive geospatial support. However, these tools often have challenges with
scalability and may not be able to take advantage of modern cloud native practices
like distributed compute, elastic scaling, and cloud native storage.
And on the other hand tools like cloud data warehouses that are best in class at
leveraging cloud native practices that enable scaling to work with massive datasets
often do not treat geospatial data as a first class citizen and often lack functionality or
optimizations for working with geospatial data.
This clearly points out a gap in the ecosystem for tools that are capable of handling
the complexities of geospatial workloads with the scale of cloud native analytics
tooling like cloud data warehouses, which was exactly the motivation for the creation
of the Apache Sedona project.

Apache Sedona Overview


Apache Sedona is a cluster computing system for processing large-scale spatial
data. It treats spatial data as a first class citizen by extending the functionality of
distributed compute frameworks like Apache Spark, Apache Flink, and Snowflake.
Apache Sedona was created at Arizona State University under the name Geospark.1

Apache Sedona And Distributed Compute Frameworks


While Apache Sedona can work with several distributed compute
frameworks the focus of this book will be using Apache Sedona
with Apache Spark. See Chapter 9 for examples of using Apache
Sedona with other cloud services.

Apache Sedona introduces data types, operations, and indexing techniques optimized
for spatial workloads on top of Apache Spark.
Let’s take a look at the workflow for analyzing spatial data with Apache Sedona.

Spatial Query Processing


The first step in spatial query processing is to ingest geospatial data into Apache
Sedona. Data can be loaded from various sources such as files (Shapefiles, GeoJSON,
Parquet, GeoTiff, etc) or databases into Apache Sedona’s spatial data structures (typi‐
cally the Spatial DataFrame).

1 “Spatial Data Management in Apache Spark: The GeoSpark Perspective and Beyond”. Jia Yu, Zongsi Zhang,
Mohamed Sarwat. Geoinformatica Journal 2019.

Apache Sedona Overview | 7


Next, Apache Sedona makes use of spatial indexing techniques to accelerate query
processing, such as R-trees or Quad trees. The spatial index is used to partition the
data into smaller, manageable units, enabling efficient data retrieval during query
processing.
Once the data is loaded and indexed spatial queries can be executed using Apache
Sedona’a query execution engine. Sedona supports a wide range of spatial operations,
such as spatial joins, distance calculations, and spatial aggregations.
Apache Sedona optimizes spatial queries to improve performance. The query opti‐
mizer determines an efficient query plan by considering the spatial predicates, avail‐
able indexes, and the distribution of data across the cluster.
Spatial queries are executed in a distributed manner using Apache Spark’s computa‐
tional capabilities. The query execution engine distributes the query workload across
the cluster, with each node processing a portion of the data. Intermediate results
are combined to produce the final result set. Since spatial objects are very complex
with many coordinates, Apache Sedona implements a custom serializer for efficiently
moving spatial data throughout the cluster.
After query execution, the results are aggregated and presented to the user, which can
be further processed or visualized using Sedona’s integration with other geospatial
tools, libraries, and visualization frameworks such as Kepler.gl.
Apache Sedona leverages various optimization techniques to improve query perfor‐
mance. These optimizations include predicate pushdown which pushes down spatial
predicates to the index or file storage layer, reducing the amount of data to be pro‐
cessed. Sedona also supports indexing strategies like indexing on multiple attributes
or using advanced indexing structures for specific query patterns.
Apache Sedona enables efficient spatial query processing through spatial indexing,
distributed query execution, query optimization, and spatial partitioning. By lever‐
aging these techniques, Sedona provides scalable and high-performance geospatial
data processing capabilities, allowing users to perform complex spatial queries and
analysis on large-scale datasets.
We’ve mentioned several times now that Apache Sedona leverages Apache Spark as
a distributed compute layer. Understanding a bit about Spark can be helpful when
working with Apache Sedona.

A Brief Overview Of Apache Spark


Apache Spark is an open-source distributed computing system for big data process‐
ing, analytics, and machine learning.
The fundamental data structure when working with Apache Spark is the Resilient
Distributed Dataset (RDD). An RDD represents an immutable distributed collection

8 | Chapter 1: Introduction to Apache Sedona


of objects that can be processed in parallel across a cluster of machines. RDDs are
lazily evaluated which means that transformations on RDDs are not immediately
executed, rather the actual computation is triggered when an action is called such as
collecting data or writing results to file storage. RDDs can handle both structured and
unstructured data.
A higher level data structure used with Apache Spark is Spark’s DataFrame. Similar to
RDDs, DataFrames are immutable and distributed across a cluster of machines, how‐
ever an important distinction is that DataFrames are structured and schema-based.
By organizing data into named columns, similar to a table in a relational database or
spreadsheet, DataFrames offer a familiar, structured way for developers to work with
data in Spark. DataFrames can be manipulated and queried using Spark SQL, as well
as via the imperative DataFrame API using Java, Scala, Python, and R.
Spark SQL supports a wide range of SQL operations, including aggregations, filtering,
joins, and window functions. By offering a familiar SQL interface developers can
leverage SQL’s declarative querying capabilities with the power and scalability of
Spark’s distributed computing engine. For further flexibility users can seamlessly
switch between SQL queries and the imperative DataFrame API.
Like RDDs, DataFrames are lazily evaluated using Spark’s directed acyclic graph
(DAG) scheduler. The DAG scheduler is responsible for transforming the high-level
operations defined on RDDs and DataFrames into an optimized execution plan that
can be efficiently executed in a distributed cluster.
Spark provides several mechanisms to extend its functionality with custom data
types and optimizations. These mechanisms allow users to work with specialized
data formats, implement custom libraries, and optimize Spark’s execution for specific
use cases. Apache Sedona takes advantage of these extension mechanisms to add
spatial data types, spatial indexes, spatial query operations, and related performance
optimizations.

Understanding Apache Sedona’s Architecture &


Components
Now that we understand some fundamental components of Apache Spark, let’s see
how Apache Sedona extends them to enable spatial functionality while leveraging
Spark’s distributed compute framework.

Apache Sedona Data Structures


Apache Sedona extends Spark’s RDD and DataFrame data structures with Spatial
RDDs and Spatial DataFrames by adding support for spatial data types (Point,
Polygon, LineString, etc), spatial operations, and spatial indexing. We will work

Understanding Apache Sedona’s Architecture & Components | 9


extensively with these data structures throughout the book, with a focus on Spatial
DataFrames, beginning in Chapter 2.

Spatial SQL
Apache Sedona’s Spatial DataFrames support Spatial SQL, an extension of SQL with
over 200 spatial specific functions that enable manipulating and processing both
vector and raster spatial data, in addition to the functionality of Spark SQL.

Spatial Query Optimizations


Apache Sedona implements optimizations for distributed spatial queries for large-
scale spatial data. These queries interface with Spark’s query optimizer and ensure
the use of Sedona’s custom spatial indexes and data types to enable performance and
scalability.
Some of the types of spatial queries supported by Apache Sedona include:
Spatial Range Query
Also known as a spatial window query or a bounding box query, the spatial range
query retrieves all spatial objects within a specified region.
Spatial Range Join Query
Similar to a spatial range query, the spatial range join query combines two
datasets based on their spatial relationships within a specified range by retrieving
pairs of spatial objects from two datasets that overlap within the specified region.
The spatial relationship can vary depending on the use case but can include
intersection, containment, or overlap of the geometries.
Spatial Distance Join Query
This query combines two datasets based on their spatial proximity. It identifies
pairs of geometries from different datasets that are within a specified distance of
each other. The result of a spatial distance join query is a set of matching pairs of
geometries that satisfy the distance condition.
Spatial kNN
The spatial K-Nearest Neighbors query finds the the k nearest neighbors to a
given point or geometry in a spatial dataset, where k is a user defined parameter.

Support For Spatial File Formats


Apache Sedona supports dataset readers and writers for many spatial file types
including GeoJSON, Shapefile, GeoTIFF, ArcGrid, NetCDF/HDF and support for
reading directly from databases such as PostGIS. We will cover working with spatial
files in depth in Chapter 3.

10 | Chapter 1: Introduction to Apache Sedona


Visualization
Apache Sedona offers integrations with Kepler.GL and Deck.GL. These visualization
options allow for creating interactive geospatial visualizations using GPU accelerated
rendering. Both visualization options offer custom styling and rich visualization
options including point data, complex geometries, heatmaps, choropleths, and 3D
extrusions.
Other options for visualization with Apache Sedona include integration with the
Python data ecosystem using tools such as Matplotlib and GeoPandas.

Integration With PyData Ecosystem


Apache Sedona seamlessly integrates with many Python data libraries and tools for
working with geospatial data including:
Jupyter
The Jupyter notebook based development environment is the most common way
to develop using Apache Sedona and will be the predominate interface used in
this book. Jupyter provides an integrated environment for data exploration and
analysis by integrating code, visualizations, and documentation in a single inter‐
face. By integrating Apache Sedona into a Jupyter Notebook users can write code
in Python, Scala, or SQL using Apache Sedona’s geospatial functions, perform
data processing, and visualize the results in the same notebook.
GeoPandas
GeoPandas is a popular Python library for working with geospatial data. The
integration with Apache Sedona enables users to manipulate geospatial data
using GeoPandas data structures and operations, while leveraging Apache Sedo‐
na’s spatial indexing and analytical functions. GeoPandas also integrates well with
popular visualization libraries in the Python ecosystem. Users can seamlessly
switch between GeoPandas for data manipulation and Apache Sedona for geo‐
spatial analytics without the need for extensive data conversions or complex
integration steps.
Rasterio
Rasterio is a popular Python library for reading, writing, and manipulating raster
data. Apache Sedona provides a wide range of spatial operations that can be
applied to raster data and the integration with Rasterio further extends the raster
processing functionality available to Apache Sedona users.
Shapely
Shapely is a widely used Python library for geometric operations and manipula‐
tions. By integrating Shapely with Apache Sedona, users can leverage Shapely’s

Understanding Apache Sedona’s Architecture & Components | 11


extensive geometry manipulation capabilities to create, modify, and analyze geo‐
metric objects, enhancing the processing capabilities of Apache Sedona.

Benefits of Apache Sedona


Apache Sedona is a spatial-first framework for working with large-scale data that
leverages the scalability of distributed compute frameworks.
What then are the benefits of using Apache Sedona over other tools in the geospatial
data analytics ecosystem?
Scalability
Apache Sedona is designed to handle large-scale geospatial datasets. Leveraging
the distributed compute infrastructure of Apache Spark enables parallel process‐
ing allowing query execution to scale horizontally across multiple machines.
High Performance
Making use of geospatial specific indexing structures such as R-trees and Quad
trees, optimizing query planning, and an efficient data partitioning and custom
serialization method purpose built for geospatial results in efficient data retrieval,
minimizing data transfer, and maximizing query parallelism.
Reduced Cost
Somewhat related to high performance is the benefit of reduced costs when
running large scale data processing operations in a large cluster. The improved
performance of Apache Sedona results in less time spent running the expensive
compute resources for a large cluster.
Rich Spatial Operations
Apache Sedona provides a comprehensive set of spatial operations, including
point-in-polygon, distance calculation, spatial joins, spatial aggregations, raster
data functionality, and more. These operations are optimized for performance
and integrated with Apache Spark’s DataFrame and SQL APIs, making it easy to
perform complex geospatial analytics tasks.
Integration With Apache Spark Ecosystem
Apache Sedona seamlessly integrates with the Spark ecosystem, enabling users
to leverage its powerful data processing capabilities. Apache Sedona can be used
alongside other Spark libraries and tools to perform end-to-end data analysis
workflows, combining geospatial data processing with machine learning.
Community Support and Active Development
Apache Sedona benefits from an active and growing community of contributors
and users. The community provides support, shares best practices, and contrib‐

12 | Chapter 1: Introduction to Apache Sedona


utes to the development and enhancement of the Apache Sedona project, ensur‐
ing ongoing improvements, bug fixes, and the availability of new features.
Open Standards and Interoperability
Apache Sedona adheres to open geospatial standards such as Open Geospatial
Consortium (OGC) standards for Spatial SQL, supporting the GeoParquet speci‐
fication, and the ability to read and write many common geospatial data formats
like Shapefiles and GeoJSON, enabling data exchange and integration with exist‐
ing geospatial workflows.
Seamless Integration With The PyData Ecosystem
Apache Sedona seamlessly integrates with common Python geospatial libraries
such as GeoPandas, Shapely, and Rasterio. Through the use of User Defined
Functions (UDFs), these libraries can be leveraged to define custom logic which
can then leverage the scalability of Apache Sedona’s parallel execution.

The Developer Experience


The developer experience of Apache Sedona is designed to be user-friendly and
accessible, leveraging the familiarity, capabilities, and principles of Apache Spark.
This includes APIs and language support that align with Apache Spark, including
support for Scala, Java, Python, and R allowing developers to choose their preferred
programming language as well as Jupyter notebook compatibility enabling interactive
data exploration and easy sharing of geospatial analyses.
Apache Sedona’s support for spatial SQL allows developers to express geospatial oper‐
ations and queries using familiar SQL syntax. This feature simplifies the development
process for developers who are already proficient in SQL and reduces the learning
curve for working with Apache Sedona. For developers who prefer the imperative
DataFrame API, this approach is available to them as well.

Python, Spatial SQL, and Jupyter


The focus of this book will be using Apache Sedona using Python
and Spatial SQL, mostly within a Jupyter notebook environment.
While we will touch on other approaches of interacting with
Apache Sedona, these tools will be our focus.

Who Uses Apache Sedona


While Apache Sedona is used by many different types of data practitioners, there are
a few common types of users and use cases that emerge.
Data scientists working with large-scale geospatial data use Sedona to perform
advanced geospatial analysis on large geospatial datasets to gain insights and make

The Developer Experience | 13


data-driven decisions. By integrating with machine learning frameworks, Apache
Sedona allows data scientists to incorporate geospatial features into their predictive
models. This integration enables the development of geospatial machine learning
algorithms and predictive analytics on spatial data.
Data engineers use Apache Sedona to process and transform geospatial data at scale,
leveraging the distributed processing power of Apache Spark and Sedona’s unique
focus on extensive geospatial operations and manipulation capabilities to transform
geospatial data into desired formats. By leveraging the connectors and utilities pro‐
vided by Sedona to ingest geospatial data from various sources, data engineers can
efficiently integrate geospatial data into their data pipelines, perform ETL operations,
and prepare the data for downstream analytics.
Data analysts performing ad-hoc analysis use Apache Sedona to explore and visual‐
ize geospatial data. They can apply spatial queries, aggregations and visualizations to
understand patterns, relationships, and trends in the data. Apache Sedona’s capabili‐
ties enable data analysts to answer location-based questions and generate meaningful
geospatial insights.

Common Apache Sedona Use Cases


So what exactly are users doing with Apache Sedona? Here are some common
examples of what users are doing with Apache Sedona.

• Creating custom weather, climate, and environmental quality assessment reports


at national scale by combining vector parcel data with environmental raster data
products.
• Generating planetary scale GeoParquet files for public dissemination via cloud
storage by combining, cleaning, and indexing multiple public datasets.
• Converting billions of daily point telemetry observations into routes traveled by
vehicles.
• Enriching parcel level data with demographic and environmental data at the
national level to feed into a real estate investment suitability analysis.

Many of these use cases can be described as geospatial ETL operations. ETL (extract,
transform, load) is a data integration process that involves retrieving data from
various sources, transforming and combining these datasets, then loading the trans‐
formed data into a target system or format for reporting or further analysis. Geospa‐
tial ETL shares many of the same challenges and requirements of traditional ETL
processes with the additional complexities of managing the geospatial component of
the data, as discussed earlier in the chapter: working with geospatial data sources
and formats, spatial data types and transformations, as well as the scalability and

14 | Chapter 1: Introduction to Apache Sedona


performance considerations required for spatial operations such as joins based on
spatial relationships.

Community Adoption
Apache Sedona has gained significant community adoption and has become a popu‐
lar geospatial analytics library within the Apache Spark ecosystem. As an Apache
Software Foundation (ASF) incubator project, Apache Sedona’s governance, licensing,
and community participation align with ASF principles.
Apache Sedona has an active and growing developer community, with contributors
from a number of different organizations and over 100 individuals interested in
geospatial analytics and distributed computing. At the time of writing Sedona has
now reached 35 million downloads with a rate of 1.5 million downloads per month
with usage growing at a rate of 150% year over year.
Apache Sedona has been adopted by organizations in industries including transporta‐
tion, urban planning, environment monitoring, logistics, insurance and risk analysis,
and more. Organizations leverage Apache Sedona’s capabilities to perform large-scale
geospatial analysis, extract insights from geospatial data and build geospatial analyt‐
ical applications at scale. The industry adoption of Apache Sedona showcases its
practical relevance and real-world use cases.
Apache Sedona has been featured in conferences, workshops, and research publica‐
tions related to geospatial analytics, distributed computing, and big data processing.
These presentations and publications contribute to the awareness, visibility, and
adoption both within the enterprise and within the research and academic communi‐
ties.

Resources
Throughout this book we will refer to specific documentation and resources relevant
for topics covered in each section. However, there are some important resources that
will be useful throughout your journey working with Apache Sedona:

• The documentation for Apache Sedona can be found online at: https://
sedona.apache.org/latest/
• Join the community Discord server: https://sedona.apache.org/latest/commu
nity/contact/
• The community forums: https://community.wherobots.com/
• Find and contribute to Apache Sedona on GitHub: https://github.com/apache/
sedona

Resources | 15
Conclusion
In this chapter we introduced Apache Sedona and the cloud native geospatial data
ecosystem. We discussed how Apache Sedona evolved out of the need for a scalable
geospatial focused analytics framework and how Apache Sedona is architected to take
advantage of distributed computation for processing large scale geospatial data. We
also reviewed the architecture of Apache Sedona and discussed the APIs for working
with data in Apache Sedona.
Now that we have an understanding of what Apache Sedona is and what it is used for,
in the next chapter we’re ready to get hands on with Apache Sedona as we get started
using Apache Sedona and Spatial SQL.

16 | Chapter 1: Introduction to Apache Sedona


CHAPTER 2
Getting Started with Apache Sedona

A Note for Early Release Readers


With Early Release ebooks, you get books in their earliest form—the author’s raw and
unedited content as they write—so you can take advantage of these technologies long
before the official release of these titles.
This will be the 2nd chapter of the final book.
If you have comments about how we might improve the content and/or examples in
this book, or if you notice missing material within this chapter, please reach out to the
editor at [email protected].

In this chapter we will learn:

• How to get started with Apache Sedona, using Docker and Wherobots Cloud
• How to use the Spatial DataFrame data structure to work with data in Apache
Sedona
• How to use Spatial SQL to query and manipulate geospatial data
• How to visualize spatial data using SedonaKepler

Throughout the book we will primarily make use of Sedona in one of two ways:
using the Apache Sedona Docker image or using Wherobots Cloud. We’ll explore
getting started with both options in this chapter. Regardless of which option we use,
the developer experience of using Apache Sedona is similar and will focus around a
Jupyter Notebook environment and Spatial SQL queries. Later in the book in Chapter
9 we will explore how to work with Sedona in cloud environments such as AWS Glue,
Databricks, and Microsoft Fabric.

17
The Apache Sedona Docker Image
Running Apache Sedona using the official Docker image is the first option we
will explore for running Apache Sedona. If you’re not familiar with Docker, it’s an
open-source tool designed to automate the deployment, scaling, and management of
applications by “containerizing” applications and their dependencies into lightweight
and portable “containers” that can run consistently across operating systems and
environments.
Docker images are read-only templates used to create containers. They contain the
application and its dependencies and are built from a set of instructions written in a
Dockerfile.
You can learn more about Docker including installation instructions at https://
docker.com
The maintainers of Apache Sedona publish an official Apache Sedona Docker image
which bundles Apache Sedona, Apache Spark, Jupyter, Python, and other dependen‐
cies of Sedona.
The benefits of using the official Apache Sedona Docker image include
Ease of Setup
The image comes pre-configured with all necessary dependencies, reducing the
complexity of setting up a Sedona environment manually.
Consistency
The Docker image ensures a consistent environment across different environ‐
ments, minimizing the risk of configuration issues in different machines.
Isolation
Docker containers provide isolation from the host system and other containers,
ensuring Sedona runs in a clean environment without conflicts.
Portability
The Docker image can run on any system that supports Docker, including local
machines, on-premises servers, and cloud platforms, facilitating easy deploy‐
ment.
The official Apache Sedona Docker image is a convenient way to get started with
Apache Sedona and is suitable for development and testing on a single machine,
but is not designed to take advantage of the highly scalable benefits of running
spatial operations across a distributed cluster of machines. To leverage the distributed
benefits of Sedona we will take advantage of cloud services such as Wherobots Cloud,
AWS EMR, or Microsoft Fabric.

18 | Chapter 2: Getting Started with Apache Sedona


To get started with the Apache Sedona Docker image first pull the image from
DockerHub using the docker pull command, optionally specifying a version. This
will download the Docker images from the DockerHub remote repository to the local
machine.
For example to pull the latest image use the latest tag.
docker pull apache/sedona:latest
As of the time of writing version 1.6.0 is the latest Apache Sedona release so we will
use this version by specifying it in the docker pull command.
docker pull apache/sedona:1.6.0

Next, to run the Docker image we use the docker run command, specifying configu‐
ration options for binding ports from the local machine to the container as well as
memory allocation.
docker run -e DRIVER_MEM=6g -e EXECUTOR_MEM=8g \
-p 8888:8888 -p 8080:8080 -p 8081:8081 \
-p 4040:4040 apache/sedona:1.6.0
Let’s break down this command to see what each piece is doing.
docker run
This is the Docker command used to create and start a new container from a
specified image.
-e DRIVER_MEM=6g
This flag sets an environment variable inside the container which specifies the
amount of memory allocated to the driver process.
-e EXECUTOR_MEM=8g
This flag indicates the executor process should use 8 gigabytes of memory.
-p 8888:8888
This flag maps port 8888 on the host machine to port 8888 on the container. This
is used to expose the Jupyter notebook environment running in the container to
the host machine.
-p 4040:4040
This flag maps port 4040 on the host to port 4040 on the container, which
exposes Spark UI, a web-based interface for monitoring and managing Spark
applications.
apache/sedona:1.6.0
This is the image name and tag.

The Apache Sedona Docker Image | 19


After starting the Docker container open a web-browser and navigate to http://
localhost:8888. This will open the Jupyter notebook environment which will be our
main development interface for Apache Sedona.

Using Wherobots Cloud


Wherobots Cloud leverages the capabilities of Apache Sedona in a managed cloud
environment which enables data practitioners to focus on working with their data
rather than managing the infrastructure necessary to run Apache Sedona at scale.
Wherobots Cloud offers a free tier for all users which we will also make use of
throughout this book.
The benefits of using Apache Sedona with Wherobots Cloud include
Reduced complexity
By providing a managed solution for spatial data processing, Wherobots Cloud
reduces the complexity of setting up and maintaining a spatial data infrastruc‐
ture.
Improved performance
Wherobots Cloud is designed to handle large-scale spatial data efficiently,
improve the performance of spatial queries and analysis.
Cost-efficiency
Cloud-based deployment allows users to pay for only the resources they use,
making Wherobots Cloud a cost-effective solution for spatial data processing
Flexibility
Support for multiple data formats and integration with various tools and libraries
provides flexibility in how users can use and analyze their geospatial data.
After registering and signing in to Wherobots Cloud at https://cloud.whero
bots.com you’ll be prompted to start a notebook runtime. This will provision the
resources required to run your Sedona cluster. By default free tier users have access
to the “Sedona Tiny” runtime which is a cluster of 4 executors with a total of 15 CPU
cores and 60GB total RAM.

20 | Chapter 2: Getting Started with Apache Sedona


Figure 2-1. The Wherobots Cloud console

The Sedona Tiny instance will be sufficient for running the examples in this book.
To scale up to larger runtime with larger cluster resources requires upgrading to the
Professional Tier in Wherobots Cloud.

Figure 2-2. The Wherobots Cloud console after starting a notebook runtime

Once the notebook runtime is provisioned a hosted Jupyter notebook environment


will be available and will be our main development environment. In addition to the
hosted Sedona environment Wherobots Cloud includes storage, access to a spatial

Using Wherobots Cloud | 21


data catalog, performance optimizations, and other features that we will touch on
later throughout the book.

Overview Of The Notebook Environment


Whether using Apache Sedona via the official Docker image or via Wherobots Cloud
the main development environment is through Jupyter notebook.
Jupyter notebook is an open-source web application that enables users to create and
share documents that can contain code, visualization, narrative text, and interactive
widget elements. It is widely used in data science, scientific computing, and machine
learning environments for its ability to mix code execution with explanatory text.
Notebooks can be checked into version control such as git and published to the web
to be shared with others.
Jupyter notebooks serve as a powerful development interface for Apache Sedona
enabling data practitioners to interactively develop, execute, and visualize spatial data
processing workflows.
Within the Jupyter notebook we can work with Apache Sedona using either Python
or Scala. Python is our language of choice used throughout this book. Using Python
allows us to take advantage of Sedona’s integration with Python tooling both in the
geospatial realm and the general Python data ecosystem.
Let’s take a hands-on approach to introduce some of the basic getting started steps
when working with Sedona. If you haven’t already, follow the steps in “The Apache
Sedona Docker Image” on page 18 to launch the official Sedona Docker container.
Then open a web browser and navigate to http://localhost:8888.
First, create a new Python notebook in Jupyter and in the first cell add the code below
to initialize your Sedona environment.
from sedona.spark import *

config = SedonaContext.builder().master("spark://localhost:7077").getOrCreate()
sedona = SedonaContext.create(config)
This will initialize the Sedona cluster. Other common configuration options at this
step include configuring access to cloud object storage such as AWS S3. We’ll see how
this works in the next chapter when we cover working with files.
In the previous chapter we introduced the concept of the Spatial DataFrame. Let’s
take a deeper look at this important data structure.

22 | Chapter 2: Getting Started with Apache Sedona


The Spatial DataFrame
While there are a few different data structures available for working with Sedona, the
most performant and robust is the Spatial DataFrame, which is a two dimensional
tabular data structure that organizes data into rows and columns similar to a table in
a relational database. Spatial DataFrames are distributed data structures, meaning the
underlying data is distributed across machines in the cluster, enabling operations on
massive datasets. The Spatial DataFrame is an extension of the standard DataFrame
you may be familiar with but adds additional functionality for working with spatial
data, notably support for spatial data types and spatial operations.
Spatial DataFrames support both spatial SQL based operations for querying and
manipulating data and a more imperative interface known as the DataFrame API.
Typically we construct Spatial DataFrames by loading data from the storage layer, but
let’s create a simple example to familiarize ourselves with the basic concepts. In the
next chapter we’ll introduce how to load data from files.
Now we are ready to create a new DataFrame, cities_df, which will contain a city
name and its longitude and latitude.
cities_df = sedona.createDataFrame(
[
("San Francisco", -122.4191, 37.7749),
("New York", -74.0060, 40.7128),
("Austin", -97.7431, 30.2672)
],
["city", "longitude", "latitude"])

cities_df.show()
+-------------+---------+--------+
| city|longitude|latitude|
+-------------+---------+--------+
|San Francisco|-122.4191| 37.7749|
| New York| -74.006| 40.7128|
| Austin| -97.7431| 30.2672|
+-------------+---------+--------+

We can view the schema of this DataFrame with the printSchema() method.
cities_df.printSchema()
root
|-- city: string (nullable = true)
|-- longitude: double (nullable = true)
|-- latitude: double (nullable = true)

Note that the Longitude and Latitude types are doubles, not a geometry or “point”
type. To take advantage of Sedona’s functionality for working with spatial data we’ll

The Spatial DataFrame | 23


make use of a spatial SQL function to add a new column to the DataFrame, leveraging
Sedona’s geometry type.
There are two APIs for working with data in a Spatial DataFrame: using spatial
SQL and using the DataFrame API. In this chapter we will explore both options for
working with data in Sedona’s Spatial DataFrame, first using spatial SQL.

Introduction To Spatial SQL


Spatial SQL is a set of extensions to the Structured Query Language (SQL) that allows
for the processing and analysis of spatial data within a table-based data environment,
such as a relational database management system (RDBMS).1
Spatial SQL builds upon the core SQL language by introducing new data types,
functions, and operators to handle the unique requirements of spatial data. This
enables data practitioners and applications to perform a wide range of spatial queries,
spatial joins, spatial aggregations, and spatial analyses with a familiar SQL syntax and
environment.
Since spatial SQL implementations are based on the same standard, applications
leveraging spatial SQL are portable and minimal updates are needed to migrate from
one system to another. The spatial SQL standard was first published in 1999 and
since then has matured with multiple updates and has been adopted by a number of
implementations, including PostGIS and Apache Sedona. The spatial SQL standard
consistently uses the prefix ST_ for all function names. This prefix originally stood
for “Spatial and Temporal” as the early version of the standard intended to cover
temporal functionality as well, however this aspect was dropped in favor of the SQL/
Temporal standard.

ST_ and RS_ Functions In Apache Sedona


Unfortunately the scope of the spatial SQL standard does not
include raster data. Apache Sedona spatial SQL functions use the
prefix convention RS_ for functions that use raster data, a conven‐
tion that may not be shared with other systems.

Spatial SQL functions can be grouped as belonging to one of the following four
categories:

1 The official specification for spatial SQL is known as ISO/IEC 13249-3 SQL/MM Part 3: Spatial and was
originally derived from the Open Geospatial Consortium Simple Features Specification for SQL.

24 | Chapter 2: Getting Started with Apache Sedona


Constructors
Convert between geometries and external data formats. For example, ST_Geo
mFromWKT(string) to construct a geometry from a Well Known Text (WKT)
string.
Functions
Retrieve properties or measures from a geometry or compare two geometries
with respect to their spatial relationship. For example, ST_Distance(A, B) to
compute the distance between two geometries A and B.
Aggregations
Returns the aggregated value on the given column. For example,
ST_Union_Aggr(A) returns the union of all polygons in column A.
Predicates
Returns true or false after evaluating a logic judgement based on spatial relation‐
ships. For example, ST_Contains(A, B) to check if geometry A fully contains
geometry B. Spatial predicate functions are often used in spatial joins or spatial
range queries.
To execute a spatial SQL statement that references our cities_df we will first create a
temporary view of the DataFrame.
cities_df.createOrReplaceTempView("cities")

This will allow us to reference the cities_df as a view in our SQL queries. To run
a spatial SQL query we use the sedona.sql method, passing in our query. A new
Spatial DataFrame will be returned.
As noted previously, so far our cities_df DataFrame is using double types to
represent longitude and latitude. To take advantage of the spatial functionality of this
data we want the location information of each row to be represented as a geometry
type. Our first spatial SQL statement will use the ST_Point function to create point
geometries from the latitude and longitude representations.
cities_df = sedona.sql("""
SELECT *, ST_Point(longitude, latitude) AS geometry
FROM cities
""")

cities_df.show(truncate=False)
+-------------+---------+--------+-------------------------+
|city |longitude|latitude|geometry |
+-------------+---------+--------+-------------------------+
|San Francisco|-122.4191|37.7749 |POINT (-122.4191 37.7749)|
|New York |-74.006 |40.7128 |POINT (-74.006 40.7128) |
|Austin |-97.7431 |30.2672 |POINT (-97.7431 30.2672) |
+-------------+---------+--------+-------------------------+

Introduction To Spatial SQL | 25


Note that this query returns a new DataFrame and that we’ve overwritten our previ‐
ous cities_df variable. Now, if we inspect the schema of our new DataFrame we can
see that the geometry column is of type geometry. The geometry type supports point,
linestring, polygon and multi versions of each as well mixed geometry types. Spatial
DataFrames can support multiple geometry typed columns.
We can view the schema of the new cities_df DataFrame which will now include
the geometry column.
cities_df.printSchema()
root
|-- city: string (nullable = true)
|-- longitude: double (nullable = true)
|-- latitude: double (nullable = true)
|-- geometry: geometry (nullable = true)

So far we’ve used the ST_Point constructor spatial SQL function to create a point
geometry type column. Let’s explore other ways to use spatial SQL to manipulate and
create new geometries. We’ve seen the point geometry type, but we can also work
with more complex geometries like polygons.
The ST_Buffer function will return a polygon where all points of the polygon are at
least a given distance from all points of the input geometry, creating a buffer around
the input geometry. Using the city point geometries as inputs, we’ll use the ST_Buffer
function to create buffers around each point with a radius of 1km.
First, because we modified the cities_df DataFrame we’ll need to replace the tempo‐
rary view cities that we defined earlier.
cities_df.createOrReplaceTempView("cities")
Now we create a new DataFrame that will contain the name of each city and a
polygon geometry that represents the buffer.
buffer_df = sedona.sql("""
SELECT city, ST_Buffer(geometry, 1000, true) AS geometry
FROM cities
""")

buffer_df.show()
+-------------+--------------------+
| city| geometry|
+-------------+--------------------+
|San Francisco|POLYGON ((-122.40...|
| New York|POLYGON ((-73.994...|
| Austin|POLYGON ((-97.732...|
+-------------+--------------------+

26 | Chapter 2: Getting Started with Apache Sedona


Working With The DataFrame API
In addition to the SQL API, we can work with Spatial DataFrames using an imper‐
ative programmatic API known as the DataFrame API. Both method allow us to
manipulate and analyze data but the DataFrame API can be more intuitive for
developers familiar with programming languages like Python. The DataFrame API
supports chaining multiple operations together in a more readable manner and
supports many built-in functions. The DataFrame API may have a steeper learning
curve for users not familiar with functional programming paradigms and writing
complex queries can become cumbersome when compared to SQL.
Let’s explore using the DataFrame API to create a linestring geometry connecting
each city, such as a flight or train route between the cities.
In SQL this statement would look like this, making use of the ST_MakeLine function
and the collect_list SQL function which creates an array of the individual point
geometries from each row in the cities view.
SELECT ST_Makeline(collect_list(geometry)) AS geometry
FROM cities
Now let’s see how this would be accomplished using the programmatic DataFrame
API. When using the DataFrame API we first import the necessary Spark functions
from the pyspark.sql.functions module and any spatial SQL functions from the
sedona.sql.st_functions module.
Then we can chain together these functions in an imperative way to define the spatial
operations we want to execute.
from sedona.sql.st_functions import ST_MakeLine
from pyspark.sql.functions import collect_list, col

route_df = cities_df.select(ST_MakeLine(collect_list(col("geometry"))).alias("geometry"))

route_df.show(truncate=False)
+-----------------------------------------------------------------+
|geometry |
+-----------------------------------------------------------------+
|LINESTRING (-122.4191 37.7749, -74.006 40.7128, -97.7431 30.2672)|
+-----------------------------------------------------------------+
As the above example demonstrates we can accomplish the same spatial operations
with both spatial SQL and the DataFrame API. While choosing which form to use
can sometimes be a personal preference there are advantages and disadvantages to
each approach.

Working With The DataFrame API | 27


Visualizing Data
When working with spatial data it can be important to visualize the data to help us
interpret the results of our analysis or as a quality control to verify our data. There are
several options for visualizing spatial data with Sedona, including many of the Python
based tools such as matplotlib or Leafmap. We will explore some of these options
later on in the book, however the first spatial visualization tool we’ll make use of is
SedonaKepler.
SedonaKepler is an integration with the Kepler.gl visualization library and Apache
Sedona which enables visualizing vector geospatial data overlaid over a basemap in an
interactive map.
The simplest way to visualize data with SedonaKepler is to visualize a single Data‐
Frame by passing that DataFrame and a name for the layer to the SedonaKepler.cre
ate_map method.
SedonaKepler.create_map(cities_df, "Cities")
This will render the Kepler.gl map in the Jupyter environment and allow the user to
configure styling manually.

Figure 2-3. Visualizing cities using SedonaKepler

We can also visualize multiple DataFrame’s as multiple layers in the SedonaKepler


map view. To add additional DataFrames we use the SedonaKepler.add_df method,
including the DataFrame and a name for each layer.
mapview = SedonaKepler.create_map(cities_df, "Cities")
SedonaKepler.add_df(mapview, buffer_df, "City Buffer")
SedonaKepler.add_df(mapview, route_df, "City Route")
mapview

28 | Chapter 2: Getting Started with Apache Sedona


Figure 2-4. Visualizing multiple DataFrames using SedonaKepler

Conclusion
In this chapter we took a deeper hands-on look at two important pieces of Apache
Sedona: spatial SQL and the Spatial DataFrame. Spatial SQL allows us to create,
manipulate, and analyze spatial data using Sedona’s Spatial DataFrame. The Spatial
DataFrame is a distributed data structure that supports spatial data types and working
with massive size datasets. We also saw how to get started with Sedona using both the
Apache Sedona Docker image and by using Wherobots Cloud.
So far we’ve limited our usage of Sedona to manually created small trivial sized exam‐
ples, but in real world analysis we typically encounter data in many different formats
and sources. In the next chapter we’ll see how to use Sedona to load, manipulate and
analyze spatial data in many different formats including CSV, GeoJSON, Shapefile,
and Parquet. We’ll also learn about the benefits of cloud-native geospatial file formats
like GeoParquet and see how to use Sedona to create and query spatial datasets using
GeoParquet.

Resources
• Spatial SQL function documentation: https://sedona.apache.org/lat
est/api/sql/Overview/
• SedonaKepler documentation: https://sedona.apache.org/latest/api/sql/
Visualization_SedonaKepler/
• Sedona Docker documentation: https://hub.docker.com/r/apache/sedona

Resources | 29
• Wherobots Cloud registration: https://cloud.wherobots.com/
• Wherobots Cloud documentation: https://docs.wherobots.com/latest/

Exercises
1. Create a free Wherobots Cloud account. After signing in, create a tiny Sedona
runtime and open the notebook environment. Run all cells in the example “First
Wherobots Cloud Notebook”.
2. Read the documentation for spatial SQL functions supported in Apache Sedona.
Choose a function to manipulate the geometry column of the cities_df Data‐
Frame. What is the input to the function? What is the output?
3. Using SedonaKepler, visualize the results of Exercise 2. Can you visualize this
data along with the original cities_df DataFrame.

30 | Chapter 2: Getting Started with Apache Sedona


About the Authors
William Lyon is a Developer Relations Engineer at Wherobots, the creators of the
open-source Apache Sedona geospatial analytics platform, where he helps developers
and data scientists make sense of spatial data. Previously he worked at Neo4j and
other startups as a software engineer. He has a masters degree in Computer Science
from the University of Montana and publishes a blog at lyonwj.com.
Jia Yu is a co-founder of Wherobots, a venture-backed company for helping busi‐
nesses to drive insights from spatiotemporal data. He was a Tenure-Track Assistant
Professor of Computer Science at Washington State University from 2020 to 2023.
He obtained his Ph.D. in Computer Science from Arizona State University. His
research focuses on large-scale database systems and geospatial data management. In
particular, he worked on distributed geospatial data management systems, database
indexing, and geospatial data visualization. Jia’s research outcomes have appeared in
the most prestigious database / GIS conferences and journals, including SIGMOD,
VLDB, ICDE, SIGSPATIAL and VLDB Journal. He is the main contributor of several
open-sourced research projects such as Apache Sedona, a cluster computing frame‐
work for processing big spatial data, which receives 1 million downloads per month
and has users / contributors from major companies.
Mo Sarwat is the CEO of Wherobots, co-creator of Apache Sedona, and an associate
professor at Arizona State University. At Wherobots he is spearheading a team devel‐
oping a cloud data platform equipped with a brain and memory for our planet
to solve the world’s most pressing issues. Wherobots is founded by the creators of
Apache Sedona, an open-source framework designed for large-scale spatial data pro‐
cessing in cloud and on-prem deployments. At Arizona State University Mo teaches
and conducts research in the fields of large-scale data processing, databases, data
analytics, and AI data infrastructure. With over a decade of experience in academia
and industry, Mo has published more than 60 peer-reviewed papers, received two
best research paper awards, been named an Early Career Distinguished Lecturer by
the IEEE Mobile Data Management community, and is also a recipient of the 2019
National Science Foundation CAREER award, one of the most prestigious honors
for young faculty members. His mission is to advance the state of the art in data
management and AI, to empower data-driven decision making for a wide range of
applications, such as transportation, mobility, and environmental monitoring. He is
passionate about developing robust and scalable data systems that can handle com‐
plex and massive datasets, and leverage artificial intelligence and machine learning
techniques to extract valuable insights and patterns.

You might also like