SlideShare a Scribd company logo
Massively Scalable Real-time
Geospatial Data Processing with
Apache Kafka and Cassandra
Paul Brebner
instaclustr.com Technology Evangelist
Sydney Data Engineering Meetup
14 May 2020 (online)
© Instaclustr Pty Limited, 2020
Instaclustr Overview
Reliability
@ Scale
Managed &
Supported
Open Source
Extensive &
Unmatched
Expertise
Exclusively open-source
Expert-level support and
consulting services
Reliable, secure and
performance operation
at scale
Available in the cloud
or on-prem
Influential community
engagement
Solving the complexities of designing, deploying
and managing complex data layer technologies.
Delivering managed solutions for the most
powerful open source technologies through our
integrated data layer platform.
Over 50+ million node hours and 3 Petabytes
under management.
Founded in 2012
CBR Australia
100+ Employees
in 4 offices
HQ in Palo
Alto USA
24x7 TechOps &
Dev in Australia
300+ Customers
Globally
Global Presence
Elasticsearch
Overview
• In the News (location)
• Anomaly Detection: baseline throughput
• Spatial Anomaly Detection problem
• Solutions: location representation and querying/indexing
o Bounding boxes and secondary indexes
o Geohashes
o Lucene index
o Results
o 3D
© 2020 Instaclustr Pty Limited
In the
News
John Conway
Legendary Polymath
Passed away from
Covid-19
© 2020 Instaclustr Pty Limited
Game of
Life
Next state of each cell depends on state of
immediate neighbors
© 2020 Instaclustr Pty Limited
Simple rules but complex patterns
Game of
Life
© 2020 Instaclustr Pty Limited
Also In
the News
Social distancing and COVID-19 tracing
Uncle Ron’s social
distancing 3000
invention
Or COVIDSAFE App?
© 2020 Instaclustr Pty Limited
“UFO” photos
declassified by USA
moon orbit)
Uncle Ron’s social distancing
3000 invention
Also In
the News
And “planet-killer”
asteroid missed the
Earth in April
(16x moon orbit)
© 2020 Instaclustr Pty Limited
Previously…
Anomaly
Detection
Spot the difference
At speed (< 1s RT) and scale
(high throughput, lots of data)
© 2020 Instaclustr Pty Limited
How
Does It
Work?
• CUSUM (Cumulative Sum Control Chart)
• Statistical analysis of historical data
• Data for a single variable/key at a time
• Potentially billions of keys
© 2020 Instaclustr Pty Limited
Pipeline
Design
• Interaction with Kafka and Cassandra and Kubernetes Clusters –
Kafka handles streaming, Cassandra for data storage,
Kubernetes for application scaling
• Efficient Cassandra Data writes and reads with key,
a unique “account ID” or similar
© 2020 Instaclustr Pty Limited
Kubernetes
Cassandra
Data
Model
• Events are timeseries
• Id is Partition Key - Time is clustering key (order)
• Read gets most recent 50 values for id, very fast
create table event_stream (
id bigint,
time timestamp,
value double,
primary key (id, time)
) with clustering order by (time desc);
select value from event_stream where id=314159265 limit 50;
© 2020 Instaclustr Pty Limited
Baseline
Throughput
19 Billion Anomaly Checks/Day = 100%
0
20
40
60
80
100
120
Baseline (single transaction ID)
Normalised (%)
© 2020 Instaclustr Pty Limited
Harder
Problem:
Spot the
Differences
in Space
Space is big. Really big. You just won’t believe how vastly,
hugely, mind-bogglingly big it is. I mean, you may think it’s a
long way down the road to the chemist, but that’s just
peanuts to space.
Douglas Adams, The Hitchhiker’s Guide to the Galaxy
© 2020 Instaclustr Pty Limited
Spatial
Anomalies
Many and varied
© 2020 Instaclustr Pty Limited
Real
Example:
John Snow
No, not this one!
© 2020 Instaclustr Pty Limited
John Snow’s
1854 Cholera Map
• Death’s per household + location
• Used to identify a polluted pump (X)
• Some outliers—brewers drank beer not water! X
© 2020 Instaclustr Pty Limited
But…
First you have to
know where you are:
Location
To usefully represent location need:
• Coordinate system
• Map
• Scale
© 2020 Instaclustr Pty Limited
Better
• <lat, long> coordinates
• Scale
• Interesting locations “bulk
of treasure here”
© 2020 Instaclustr Pty Limited
Geospatial Anomaly Detection
South Atlantic Geomagnetic Anomaly
New problem…
• Rather than a single ID, events now have a location (and a value)
• The problem now is to
o find the nearest 50 events to each new event
o Quickly (< 1s RT)
• Can’t make any assumptions about geospatial
properties of events
o including location, density or distribution – i.e. where, or how many
o Need to search from smallest to increasingly larger areas
o E.g. South Atlantic Geomagnetic Anomaly is BIG
• Uber uses similar technologies to
o forecast demand
o Increase area until they have sufficient data for predictions
• Can we use <lat, long> as Cassandra partition key?
o Yes, compound partition keys are allowed.
o But can only select the exact locations.
© 2020 Instaclustr Pty Limited
How to Compute Nearness
To compute distance between locations
you need coordinate system
e.g. Mercator map
Flat earth, distortion nearer poles
© 2020 Instaclustr Pty Limited
World is (approx.) Spherical
Calculation of distance between two
latitudinal/longitudinal points is non-trivial
© 2020 Instaclustr Pty Limited
Bounding Box
Approximation of distance using inequalities
© 2020 Instaclustr Pty Limited
Bounding Boxes and Cassandra?
• Use ”country” partition key, Lat/long/time clustering keys
• But can’t run the query with multiple inequalities
CREATE TABLE latlong (
country text,
lat double,
long double,
time timestamp,
PRIMARY KEY (country, lat, long, time)
) WITH CLUSTERING ORDER BY (lat ASC, long ASC, time DESC);
select * from latlong where country='nz' and lat>= -39.58 and lat <= -38.67
and long >= 175.18 and long <= 176.08 limit 50;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column
"long" cannot be restricted (preceding column "lat" is restricted by a non-EQ relation)"
© 2020 Instaclustr Pty Limited
Secondary
Indexes to
the
Rescue?
Secondary Indexes
o create index i1 on latlong (lat);
o create index i2 on latlong (long);
• But same restrictions as clustering columns.
SASI - SSTable Attached Secondary Index
• Supports more complex queries more efficiently
o create custom index i1 on latlong (long) using
'org.apache.cassandra.index.sasi.SASIIndex';
o create custom index i2 on latlong (lat) using
'org.apache.cassandra.index.sasi.SASIIndex’;
• select * from latlong where country='nz' and lat>= -39.58 and
lat <= -38.67 and long >= 175.18 and long <= 176.08 limit 50
allow filtering;
• “allow filtering” may be inefficient (if many rows have to be
retrieved prior to filtering) and isn’t suitable for production.
• But SASI docs say
o even though “allow filtering” must be used with 2 or more column
inequalities, there is actually no filtering taking place
© 2020 Instaclustr Pty Limited
Results
Very slow (< 1%)
0
20
40
60
80
100
120
Normalised (%)
Baseline (single transaction ID) SASI
© 2020 Instaclustr Pty Limited
Geohashes
to the
Rescue?
• Divide maps into named and hierarchical areas
• We’ve been something similar already: “country” partition key
E.g. plate tectonics
© 2020 Instaclustr Pty Limited
Geohashes
Rectangular areas
Variable length base-32 string
Single char regions 5,000km x 5,000km
Each extra letter gives 32 sub-areas
8 chars is 40mx20m
En/de-code lat/long to/from geohash
But: Edges cases, non-linear near poles
© 2020 Instaclustr Pty Limited
Some
Geohashes
Are Words
“ketchup” is in Africa
© 2020 Instaclustr Pty Limited
Some
Geohashes
Are Words
153mx153m153m x 153m
© 2020 Instaclustr Pty Limited
“Trump”
Is in
Kazakhstan!
Not to scale
5kmx5km
© 2020 Instaclustr Pty Limited
Modifications for Geohashes
• Lat/long encoded as geohash • Geohash is new key • Geohash used to query Cassandra
© 2020 Instaclustr Pty Limited
Geohashes and Cassandra
In theory Geohashes work well for
database indexes
CREATE TABLE geohash1to8 (
geohash1 text,
time timestamp,
geohash2 text,
geohash3 text,
geohash4 text,
geohash5 text,
geohash6 text,
geohash7 text,
geohash8 text,
value double,
PRIMARY KEY (hash1, time)
) WITH CLUSTERING ORDER BY (time DESC);
CREATE INDEX i8 ON geohash1to8 (geohash8);
CREATE INDEX i7 ON geohash1to8 (geohash7);
CREATE INDEX i6 ON geohash1to8 (geohash6);
CREATE INDEX i5 ON geohash1to8 (geohash5);
CREATE INDEX i4 ON geohash1to8 (geohash4);
CREATE INDEX i3 ON geohash1to8 (geohash3);
CREATE INDEX i2 ON geohash1to8 (geohash2);
Option 1:
Multiple Indexed Geohash Columns
© 2020 Instaclustr Pty Limited
• Query from smallest to largest areas
select * from geohash1to8 where geohash1=’e’ and geohash7=’everywh’ limit 50;
select * from geohash1to8 where geohash1=’e’ and geohash6=’everyw’ limit 50;
select * from geohash1to8 where geohash1=’e’ and geohash5=’every’ limit 50;
select * from geohash1to8 where geohash1=’e’ and geohash4=’ever’ limit 50;
select * from geohash1to8 where geohash1=’e’ and geohash3=’eve’ limit 50;
select * from geohash1to8 where geohash1=’e’ and geohash2=’ev’ limit 50;
select * from geohash1to8 where geohash1=’e’ limit 50;
Tradeoffs?
Multiple secondary columns/indexes, multiple queries, accuracy and
number of queries depends on spatial distribution and density
• Stop when 50 rows found
© 2020 Instaclustr Pty Limited
Results
Option 1 = 10%
0
20
40
60
80
100
120
Normalised (%)
Baseline (single transaction ID) SASI Geohash Option 1
© 2020 Instaclustr Pty Limited
Geohashes and Cassandra
Denormalization is “Normal” in Cassandra
Create 8 tables, one for each geohash length
CREATE TABLE geohash1 (
geohash text,
time timestamp,
value double,
PRIMARY KEY (geohash, time)
) WITH CLUSTERING ORDER BY (time DESC);
…
CREATE TABLE geohash8 (
geohash text,
time timestamp,
value double,
PRIMARY KEY (geohash, time)
) WITH CLUSTERING ORDER BY (time DESC);
Option 2:
Denormalized Multiple Tables
© 2020 Instaclustr Pty Limited
• Select from smallest to largest areas using corresponding table
select * from geohash8 where geohash=’everywhe’ limit 50;
select * from geohash7 where geohash=’everywh’ limit 50;
select * from geohash6 where geohash=’everyw’ limit 50;
select * from geohash5 where geohash=’every’ limit 50;
select * from geohash4 where geohash=’ever’ limit 50;
select * from geohash3 where geohash=’eve’ limit 50;
select * from geohash2 where geohash=’ev’ limit 50;
select * from geohash1 where geohash=’e’ limit 50;
Tradeoffs?
Multiple tables and writes, multiple queries
© 2020 Instaclustr Pty Limited
Results
Option 2 = 20%
0
20
40
60
80
100
120
Normalised (%)
Baseline (single transaction ID) SASI
Geohash Option 1 Geohash Option 2
© 2020 Instaclustr Pty Limited
Geohashes and Cassandra
Similar to Option 1 but using
clustering columns
CREATE TABLE geohash1to8_clustering (
geohash1 text,
time timestamp,
geohash2 text,
gephash3 text,
geohash4 text,
geohash5 text,
geohash6 text,
geohash7 text,
geohash8 text,
value double,
PRIMARY KEY (geohash1, geohash2, geohash3, geohash4,
geohash5, geohash6, geohash7, geohash8, time)
) WITH CLUSTERING ORDER BY (geohash2 DESC, geohash3 DESC,
geohash4 DESC, geohash5 DESC, geohash6 DESC, geohash7 DESC,
geohash8 DESC, time DESC);
Option 3:
Clustering Column(s)
© 2020 Instaclustr Pty Limited
How Do Clustering Columns Work?
• Clustering columns are good for modelling and efficient querying of
hierarchical/nested data
• Query must include higher level columns with equality operator, ranges are
only allowed on last column in query, lower level columns don’t have to be
included.
For example:
o select * from geohash1to8_clustering where geohash1=’e’ and
geohash2=’ev’ and geohash3 >= ’ev0’ and geohash3 <= ‘evz’ limit 50;
• But why have multiple clustering columns when one is actually enough…
Good for Hierarchical Data
© 2020 Instaclustr Pty Limited
Better: Single Geohash
Clustering Column
Geohash8 and time are
clustering keys CREATE TABLE geohash_clustering (
geohash1 text,
time timestamp,
geohash8 text,
lat double,
long double,
PRIMARY KEY (geohash1, geohash8, time)
) WITH CLUSTERING ORDER BY (geohash8 DESC,
time DESC);
© 2020 Instaclustr Pty Limited
Inequality Range Query
• With decreasing length
geohashes
• Stop when result has 50 rows
select * from geohash_clustering where geohash1=’e’ and
geohash8=’everywhe’ limit 50;
select * from geohash_clustering where geohash1=’e’ and
geohash8>=’everywh0’ and geohash8 <=’everywhz’ limit 50;
select * from geohash_clustering where geohash1=’e’ and
geohash8>=’everyw0’ and geohash8 <=’everywz’ limit 50;
select * from geohash_clustering where geohash1=’e’ and
geohash8>=’every0’ and geohash8 <=’everyz’ limit 50;
select * from geohash_clustering where geohash1=’e’ and
geohash8>=’ever0’ and geohash8 <=’everz’ limit 50;
select * from geohash_clustering where geohash1=’e’ and
geohash8>=’eve0’ and geohash8 <=’evez’ limit 50;
select * from geohash_clustering where geohash1=’e’ and
geohash8>=’ev0’ and geohash8 <=’evz’ limit 50;
select * from geohash_clustering where geohash1=’e’ limit 50;
© 2020 Instaclustr Pty Limited
Geohash
Results
Option 3 is best = 34%
0
20
40
60
80
100
120
Normalised (%)
Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3
© 2020 Instaclustr Pty Limited
Another
Option:
Cassandra
Lucene
Index Plugin
A Concordance
© 2020 Instaclustr Pty Limited
Cassandra Lucene Index Plugin
• The Cassandra Lucene Index is a plugin for Apache Cassandra:
o That extends its index functionality to provide near real-time search, including full-text search capabilities and free
multivariable, geospatial and bitemporal search
o It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of
the cluster indexes its own data.
• Instaclustr supports the plugin
o Optional add-on to managed Cassandra service
o And code support
https://github.com/instaclustr/cassandra-lucene-index
• How does this help for Geospatial queries?
o Has very rich geospatial semantics including geo points, geo shapes, geo distance search, geo bounding box search,
geo shape search, multiple distance units, geo transformations, and complex geo shapes.
© 2020 Instaclustr Pty Limited
Cassandra Table and
Lucene Indexes
• Geopoint Example
• Under the hood indexing is
done using a tree structure
with geohashes
(configurable precision)
CREATE TABLE latlong_lucene (
geohash1 text,
value double,
time timestamp,
latitude double,
longitude double,
Primary key (geohash1, time)
) WITH CLUSTERING ORDER BY (time DESC);
CREATE CUSTOM INDEX latlong_index ON latlong_lucene ()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds': '1',
'schema': '{
fields: {
geohash1: {type: "string"},
value: {type: "double"},
time: {type: "date", pattern: "yyyy/MM/dd HH:mm:ss.SSS"},
place: {type: "geo_point", latitude: "latitude", longitude: "longitude"}
}'
};
Search Options
• Sort
• Sophisticated but complex
semantics (see the docs)
SELECT value FROM latlong_lucene
WHERE expr(latlong_index,
'{ sort: [ {field: "place", type:
"geo_distance", latitude: " + <lat> + ",
longitude: " + <long> + "}, {field: "time",
reverse: true} ] }') and
geohash1=<geohash> limit 50;
© 2020 Instaclustr Pty Limited
Search Options
• Building Box Filter
• Need to compute box
corners
SELECT value FROM latlong_lucene
WHERE expr(latlong_index, '{ filter: {
type: "geo_bbox", field: "place",
min_latitude: " + <minLat> + ",
max_latitude: " + <maxLat> + ",
min_longitude: " + <minLon> + ",
max_longitude: " + <maxLon> + " }}’)
limit 50;
© 2020 Instaclustr Pty Limited
Search Options
• Geo Distance Filter SELECT value FROM latlong_lucene
WHERE expr(latlong_index, '{ filter: { type:
"geo_distance", field: "place", latitude: " +
<lat> + ", longitude: " + <long> + ",
max_distance: " <distance> + "km" } }') and
geohash1=' + <hash1> + ' limit 50;
© 2020 Instaclustr Pty Limited
Search Options: Prefix Filter
Prefix search is useful for searching larger areas over a single geohash column
as you can search for a substring
SELECT value FROM latlong_lucene
WHERE expr(latlong_index, '{ filter: [ {type:
"prefix", field: "geohash1", value:
<geohash>} ] }') limit 50
Similar to inequality over clustering column
© 2020 Instaclustr Pty Limited
Lucene
Results
Options = 2-25%
Best is prefix filter (25%)
0
20
40
60
80
100
120
Normalised (%)
Baseline (single
transaction ID)
SASI
Geohash Option 1
Geohash Option 2
Geohash Option 3
Lucene sort
Lucene filter
bounded box
Lucene filter geo
distance
Lucene filter prefix
over geohash
© 2020 Instaclustr Pty Limited
Overall
Geohash options are
faster (25%, 34%)
0
20
40
60
80
100
120
Normalised (%)
Baseline (single
transaction ID)
SASI
Geohash Option 1
Geohash Option 2
Geohash Option 3
Lucene sort
Lucene filter bounded
box
Lucene filter geo
distance
Lucene filter prefix
over geohash
G
e
o
h
a
s
h
G
e
o
h
a
s
h
© 2020 Instaclustr Pty Limited
Overall
Geohash options are
faster (25%, 34%)
Lucene bounded box/geo
distance most accurate but only
5% of baseline performance
0
20
40
60
80
100
120
Normalised (%)
Baseline (single
transaction ID)
SASI
Geohash Option 1
Geohash Option 2
Geohash Option 3
Lucene sort
Lucene filter bounded
box
Lucene filter geo
distance
Lucene filter prefix
over geohash
L
u
c
e
n
e
L
u
c
e
n
e
© 2020 Instaclustr Pty Limited
3D
(Up and
Down)
Who needs it?
© 2020 Instaclustr Pty Limited
Location, Altitude and Volume
• 3D Geohashes
represent 2D location, altitude,
and volume
• A 3D geohash is a cube
© 2020 Instaclustr Pty Limited
• Demo 3D Geohash java code
o https://gist.github.com/paulbrebner/a67243859d2cf38bd9038a12a7b14762
o produces valid 3D geohashes for altitudes from 13km below sea level to
geostationary satellite orbit
o Can be used for any of the geohash options
More Information?
© 2020 Instaclustr Pty Limited
• https://www.instaclustr.com/paul-brebner/
• Latest Blog Series – Globally distributed Streaming, Storage and Search
o Application is deployed in multiple locations, data is replicated, or sent where/when it’s needed
o “Around the World” series: Part 3 introduces a Stock Trading application
o https://www.instaclustr.com/building-a-low-latency-distributed-stock-broker-application-part-3/
Blogs
© 2020 Instaclustr Pty Limited
The End
Try out the Instaclustr Managed Platform for Open Source
• https://www.instaclustr.com/platform/
• Free Trial:
https://console.instaclustr.com/user/signup?coupon-
code=WORKSHOP
© 2020 Instaclustr Pty Limited
©Instaclustr Pty Limited, 2020
https://www.instaclustr.com/company/policies/terms-conditions/
Except as permitted by the copyright law applicable to you, you may
not reproduce, distribute, publish, display, communicate or transmit any
of the content of this document, in any form, but any means, without
the prior written permission of Instaclustr Pty Limited

More Related Content

What's hot (20)

PDF
Druid @ branch
Biswajit Das
 
PPTX
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
DataWorks Summit
 
PPTX
Detecting Hacks: Anomaly Detection on Networking Data
DataWorks Summit
 
PPTX
Future Architecture of Streaming Analytics: Capitalizing on the Analytics of ...
DataWorks Summit
 
PDF
Apache storm vs. Spark Streaming
P. Taylor Goetz
 
PDF
How Spotify scales Apache Storm Pipelines
Kinshuk Mishra
 
PDF
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Big Data Spain
 
PPTX
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Hortonworks
 
PPTX
CERN IT Monitoring
Tim Bell
 
PPTX
Druid deep dive
Kashif Khan
 
PPTX
Real Time analytics with Druid, Apache Spark and Kafka
Daria Litvinov
 
PDF
Modeling the IoT with TitanDB and Cassandra
twilmes
 
PPTX
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
 
PDF
Data Stores @ Netflix
Vinay Kumar Chella
 
PDF
A real time architecture using Hadoop and Storm @ FOSDEM 2013
Nathan Bijnens
 
PDF
Apache Eagle: eBay构建开源分布式实时预警引擎实践
Hao Chen
 
PDF
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
 
PDF
Microsoft Big Data @ SQLUG 2013
Nathan Bijnens
 
PPTX
Fraud Detection Architecture
Gwen (Chen) Shapira
 
PPTX
a real-time architecture using Hadoop and Storm at Devoxx
Nathan Bijnens
 
Druid @ branch
Biswajit Das
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
DataWorks Summit
 
Detecting Hacks: Anomaly Detection on Networking Data
DataWorks Summit
 
Future Architecture of Streaming Analytics: Capitalizing on the Analytics of ...
DataWorks Summit
 
Apache storm vs. Spark Streaming
P. Taylor Goetz
 
How Spotify scales Apache Storm Pipelines
Kinshuk Mishra
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Big Data Spain
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Hortonworks
 
CERN IT Monitoring
Tim Bell
 
Druid deep dive
Kashif Khan
 
Real Time analytics with Druid, Apache Spark and Kafka
Daria Litvinov
 
Modeling the IoT with TitanDB and Cassandra
twilmes
 
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
 
Data Stores @ Netflix
Vinay Kumar Chella
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
Nathan Bijnens
 
Apache Eagle: eBay构建开源分布式实时预警引擎实践
Hao Chen
 
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
 
Microsoft Big Data @ SQLUG 2013
Nathan Bijnens
 
Fraud Detection Architecture
Gwen (Chen) Shapira
 
a real-time architecture using Hadoop and Storm at Devoxx
Nathan Bijnens
 

Similar to Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and Cassandra - Sydney Data Engineering Meetup 14 May 2020 (20)

PDF
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Paul Brebner
 
PPTX
20150314 sahara intro and the future plan for open stack meetup
Wei Ting Chen
 
PDF
Database@Home - Maps and Spatial Analyses: How to use them
Tammy Bednar
 
PDF
Graph-Based Identity Resolution at Scale
TigerGraph
 
PPTX
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax
 
PDF
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
Databricks
 
PPTX
Stories from the Trainyard!
Patrick Kelley
 
PDF
Big Data Expo 2015 - Gigaspaces Making Sense of it all
BigDataExpo
 
PDF
20150704 benchmark and user experience in sahara weiting
Wei Ting Chen
 
PDF
PIMRC-2012, Sydney, Australia, 28 July, 2012
Charith Perera
 
PDF
DATA LAKE AND THE RISE OF THE MICROSERVICES - ALEX BORDEI
Big Data Week
 
PDF
MySQL Goes to 8! FOSDEM 2020 Database Track, January 2nd, 2020
Geir Høydalsvik
 
PDF
Q4 2016 GeoTrellis Presentation
Rob Emanuele
 
PDF
Big Data Seervices in Danaos Use Case
Big Data Value Association
 
PPTX
At the core you will have KUSTO
Riccardo Zamana
 
PDF
ASHviz - Dats visualization research experiments using ASH data
John Beresniewicz
 
PPTX
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
PDF
What's New in Apache Hive
DataWorks Summit
 
PDF
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Certus Solutions
 
PPTX
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
InfluxData
 
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Paul Brebner
 
20150314 sahara intro and the future plan for open stack meetup
Wei Ting Chen
 
Database@Home - Maps and Spatial Analyses: How to use them
Tammy Bednar
 
Graph-Based Identity Resolution at Scale
TigerGraph
 
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax
 
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
Databricks
 
Stories from the Trainyard!
Patrick Kelley
 
Big Data Expo 2015 - Gigaspaces Making Sense of it all
BigDataExpo
 
20150704 benchmark and user experience in sahara weiting
Wei Ting Chen
 
PIMRC-2012, Sydney, Australia, 28 July, 2012
Charith Perera
 
DATA LAKE AND THE RISE OF THE MICROSERVICES - ALEX BORDEI
Big Data Week
 
MySQL Goes to 8! FOSDEM 2020 Database Track, January 2nd, 2020
Geir Høydalsvik
 
Q4 2016 GeoTrellis Presentation
Rob Emanuele
 
Big Data Seervices in Danaos Use Case
Big Data Value Association
 
At the core you will have KUSTO
Riccardo Zamana
 
ASHviz - Dats visualization research experiments using ASH data
John Beresniewicz
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
What's New in Apache Hive
DataWorks Summit
 
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Certus Solutions
 
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...
InfluxData
 
Ad

More from Paul Brebner (20)

PPTX
Streaming More For Less With Apache Kafka Tiered Storage
Paul Brebner
 
PDF
30 Of My Favourite Open Source Technologies In 30 Minutes
Paul Brebner
 
PDF
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
PDF
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
PDF
Architecting Applications With Multiple Open Source Big Data Technologies
Paul Brebner
 
PDF
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
Paul Brebner
 
PDF
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Paul Brebner
 
PDF
Spinning your Drones with Cadence Workflows and Apache Kafka
Paul Brebner
 
PDF
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Paul Brebner
 
PDF
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 
PDF
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 
PDF
A Visual Introduction to Apache Kafka
Paul Brebner
 
PDF
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Paul Brebner
 
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
PPTX
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Paul Brebner
 
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
PPTX
0b101000 years of computing: a personal timeline - decade "0", the 1980's
Paul Brebner
 
PDF
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
Paul Brebner
 
PPTX
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
PDF
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
Paul Brebner
 
Streaming More For Less With Apache Kafka Tiered Storage
Paul Brebner
 
30 Of My Favourite Open Source Technologies In 30 Minutes
Paul Brebner
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
Architecting Applications With Multiple Open Source Big Data Technologies
Paul Brebner
 
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
Paul Brebner
 
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Paul Brebner
 
Spinning your Drones with Cadence Workflows and Apache Kafka
Paul Brebner
 
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Paul Brebner
 
Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
Paul Brebner
 
A Visual Introduction to Apache Kafka
Paul Brebner
 
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Paul Brebner
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
0b101000 years of computing: a personal timeline - decade "0", the 1980's
Paul Brebner
 
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
Paul Brebner
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
Paul Brebner
 
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
Paul Brebner
 
Ad

Recently uploaded (20)

PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 

Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and Cassandra - Sydney Data Engineering Meetup 14 May 2020

  • 1. Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and Cassandra Paul Brebner instaclustr.com Technology Evangelist Sydney Data Engineering Meetup 14 May 2020 (online) © Instaclustr Pty Limited, 2020
  • 2. Instaclustr Overview Reliability @ Scale Managed & Supported Open Source Extensive & Unmatched Expertise Exclusively open-source Expert-level support and consulting services Reliable, secure and performance operation at scale Available in the cloud or on-prem Influential community engagement Solving the complexities of designing, deploying and managing complex data layer technologies. Delivering managed solutions for the most powerful open source technologies through our integrated data layer platform. Over 50+ million node hours and 3 Petabytes under management. Founded in 2012 CBR Australia 100+ Employees in 4 offices HQ in Palo Alto USA 24x7 TechOps & Dev in Australia 300+ Customers Globally Global Presence Elasticsearch
  • 3. Overview • In the News (location) • Anomaly Detection: baseline throughput • Spatial Anomaly Detection problem • Solutions: location representation and querying/indexing o Bounding boxes and secondary indexes o Geohashes o Lucene index o Results o 3D © 2020 Instaclustr Pty Limited
  • 4. In the News John Conway Legendary Polymath Passed away from Covid-19 © 2020 Instaclustr Pty Limited
  • 5. Game of Life Next state of each cell depends on state of immediate neighbors © 2020 Instaclustr Pty Limited
  • 6. Simple rules but complex patterns Game of Life © 2020 Instaclustr Pty Limited
  • 7. Also In the News Social distancing and COVID-19 tracing Uncle Ron’s social distancing 3000 invention Or COVIDSAFE App? © 2020 Instaclustr Pty Limited
  • 8. “UFO” photos declassified by USA moon orbit) Uncle Ron’s social distancing 3000 invention Also In the News And “planet-killer” asteroid missed the Earth in April (16x moon orbit) © 2020 Instaclustr Pty Limited
  • 9. Previously… Anomaly Detection Spot the difference At speed (< 1s RT) and scale (high throughput, lots of data) © 2020 Instaclustr Pty Limited
  • 10. How Does It Work? • CUSUM (Cumulative Sum Control Chart) • Statistical analysis of historical data • Data for a single variable/key at a time • Potentially billions of keys © 2020 Instaclustr Pty Limited
  • 11. Pipeline Design • Interaction with Kafka and Cassandra and Kubernetes Clusters – Kafka handles streaming, Cassandra for data storage, Kubernetes for application scaling • Efficient Cassandra Data writes and reads with key, a unique “account ID” or similar © 2020 Instaclustr Pty Limited Kubernetes
  • 12. Cassandra Data Model • Events are timeseries • Id is Partition Key - Time is clustering key (order) • Read gets most recent 50 values for id, very fast create table event_stream ( id bigint, time timestamp, value double, primary key (id, time) ) with clustering order by (time desc); select value from event_stream where id=314159265 limit 50; © 2020 Instaclustr Pty Limited
  • 13. Baseline Throughput 19 Billion Anomaly Checks/Day = 100% 0 20 40 60 80 100 120 Baseline (single transaction ID) Normalised (%) © 2020 Instaclustr Pty Limited
  • 14. Harder Problem: Spot the Differences in Space Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist, but that’s just peanuts to space. Douglas Adams, The Hitchhiker’s Guide to the Galaxy © 2020 Instaclustr Pty Limited
  • 15. Spatial Anomalies Many and varied © 2020 Instaclustr Pty Limited
  • 16. Real Example: John Snow No, not this one! © 2020 Instaclustr Pty Limited
  • 17. John Snow’s 1854 Cholera Map • Death’s per household + location • Used to identify a polluted pump (X) • Some outliers—brewers drank beer not water! X © 2020 Instaclustr Pty Limited
  • 18. But… First you have to know where you are: Location To usefully represent location need: • Coordinate system • Map • Scale © 2020 Instaclustr Pty Limited
  • 19. Better • <lat, long> coordinates • Scale • Interesting locations “bulk of treasure here” © 2020 Instaclustr Pty Limited
  • 20. Geospatial Anomaly Detection South Atlantic Geomagnetic Anomaly New problem… • Rather than a single ID, events now have a location (and a value) • The problem now is to o find the nearest 50 events to each new event o Quickly (< 1s RT) • Can’t make any assumptions about geospatial properties of events o including location, density or distribution – i.e. where, or how many o Need to search from smallest to increasingly larger areas o E.g. South Atlantic Geomagnetic Anomaly is BIG • Uber uses similar technologies to o forecast demand o Increase area until they have sufficient data for predictions • Can we use <lat, long> as Cassandra partition key? o Yes, compound partition keys are allowed. o But can only select the exact locations. © 2020 Instaclustr Pty Limited
  • 21. How to Compute Nearness To compute distance between locations you need coordinate system e.g. Mercator map Flat earth, distortion nearer poles © 2020 Instaclustr Pty Limited
  • 22. World is (approx.) Spherical Calculation of distance between two latitudinal/longitudinal points is non-trivial © 2020 Instaclustr Pty Limited
  • 23. Bounding Box Approximation of distance using inequalities © 2020 Instaclustr Pty Limited
  • 24. Bounding Boxes and Cassandra? • Use ”country” partition key, Lat/long/time clustering keys • But can’t run the query with multiple inequalities CREATE TABLE latlong ( country text, lat double, long double, time timestamp, PRIMARY KEY (country, lat, long, time) ) WITH CLUSTERING ORDER BY (lat ASC, long ASC, time DESC); select * from latlong where country='nz' and lat>= -39.58 and lat <= -38.67 and long >= 175.18 and long <= 176.08 limit 50; InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column "long" cannot be restricted (preceding column "lat" is restricted by a non-EQ relation)" © 2020 Instaclustr Pty Limited
  • 25. Secondary Indexes to the Rescue? Secondary Indexes o create index i1 on latlong (lat); o create index i2 on latlong (long); • But same restrictions as clustering columns. SASI - SSTable Attached Secondary Index • Supports more complex queries more efficiently o create custom index i1 on latlong (long) using 'org.apache.cassandra.index.sasi.SASIIndex'; o create custom index i2 on latlong (lat) using 'org.apache.cassandra.index.sasi.SASIIndex’; • select * from latlong where country='nz' and lat>= -39.58 and lat <= -38.67 and long >= 175.18 and long <= 176.08 limit 50 allow filtering; • “allow filtering” may be inefficient (if many rows have to be retrieved prior to filtering) and isn’t suitable for production. • But SASI docs say o even though “allow filtering” must be used with 2 or more column inequalities, there is actually no filtering taking place © 2020 Instaclustr Pty Limited
  • 26. Results Very slow (< 1%) 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI © 2020 Instaclustr Pty Limited
  • 27. Geohashes to the Rescue? • Divide maps into named and hierarchical areas • We’ve been something similar already: “country” partition key E.g. plate tectonics © 2020 Instaclustr Pty Limited
  • 28. Geohashes Rectangular areas Variable length base-32 string Single char regions 5,000km x 5,000km Each extra letter gives 32 sub-areas 8 chars is 40mx20m En/de-code lat/long to/from geohash But: Edges cases, non-linear near poles © 2020 Instaclustr Pty Limited
  • 29. Some Geohashes Are Words “ketchup” is in Africa © 2020 Instaclustr Pty Limited
  • 30. Some Geohashes Are Words 153mx153m153m x 153m © 2020 Instaclustr Pty Limited
  • 31. “Trump” Is in Kazakhstan! Not to scale 5kmx5km © 2020 Instaclustr Pty Limited
  • 32. Modifications for Geohashes • Lat/long encoded as geohash • Geohash is new key • Geohash used to query Cassandra © 2020 Instaclustr Pty Limited
  • 33. Geohashes and Cassandra In theory Geohashes work well for database indexes CREATE TABLE geohash1to8 ( geohash1 text, time timestamp, geohash2 text, geohash3 text, geohash4 text, geohash5 text, geohash6 text, geohash7 text, geohash8 text, value double, PRIMARY KEY (hash1, time) ) WITH CLUSTERING ORDER BY (time DESC); CREATE INDEX i8 ON geohash1to8 (geohash8); CREATE INDEX i7 ON geohash1to8 (geohash7); CREATE INDEX i6 ON geohash1to8 (geohash6); CREATE INDEX i5 ON geohash1to8 (geohash5); CREATE INDEX i4 ON geohash1to8 (geohash4); CREATE INDEX i3 ON geohash1to8 (geohash3); CREATE INDEX i2 ON geohash1to8 (geohash2); Option 1: Multiple Indexed Geohash Columns © 2020 Instaclustr Pty Limited
  • 34. • Query from smallest to largest areas select * from geohash1to8 where geohash1=’e’ and geohash7=’everywh’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash6=’everyw’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash5=’every’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash4=’ever’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash3=’eve’ limit 50; select * from geohash1to8 where geohash1=’e’ and geohash2=’ev’ limit 50; select * from geohash1to8 where geohash1=’e’ limit 50; Tradeoffs? Multiple secondary columns/indexes, multiple queries, accuracy and number of queries depends on spatial distribution and density • Stop when 50 rows found © 2020 Instaclustr Pty Limited
  • 35. Results Option 1 = 10% 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 © 2020 Instaclustr Pty Limited
  • 36. Geohashes and Cassandra Denormalization is “Normal” in Cassandra Create 8 tables, one for each geohash length CREATE TABLE geohash1 ( geohash text, time timestamp, value double, PRIMARY KEY (geohash, time) ) WITH CLUSTERING ORDER BY (time DESC); … CREATE TABLE geohash8 ( geohash text, time timestamp, value double, PRIMARY KEY (geohash, time) ) WITH CLUSTERING ORDER BY (time DESC); Option 2: Denormalized Multiple Tables © 2020 Instaclustr Pty Limited
  • 37. • Select from smallest to largest areas using corresponding table select * from geohash8 where geohash=’everywhe’ limit 50; select * from geohash7 where geohash=’everywh’ limit 50; select * from geohash6 where geohash=’everyw’ limit 50; select * from geohash5 where geohash=’every’ limit 50; select * from geohash4 where geohash=’ever’ limit 50; select * from geohash3 where geohash=’eve’ limit 50; select * from geohash2 where geohash=’ev’ limit 50; select * from geohash1 where geohash=’e’ limit 50; Tradeoffs? Multiple tables and writes, multiple queries © 2020 Instaclustr Pty Limited
  • 38. Results Option 2 = 20% 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 © 2020 Instaclustr Pty Limited
  • 39. Geohashes and Cassandra Similar to Option 1 but using clustering columns CREATE TABLE geohash1to8_clustering ( geohash1 text, time timestamp, geohash2 text, gephash3 text, geohash4 text, geohash5 text, geohash6 text, geohash7 text, geohash8 text, value double, PRIMARY KEY (geohash1, geohash2, geohash3, geohash4, geohash5, geohash6, geohash7, geohash8, time) ) WITH CLUSTERING ORDER BY (geohash2 DESC, geohash3 DESC, geohash4 DESC, geohash5 DESC, geohash6 DESC, geohash7 DESC, geohash8 DESC, time DESC); Option 3: Clustering Column(s) © 2020 Instaclustr Pty Limited
  • 40. How Do Clustering Columns Work? • Clustering columns are good for modelling and efficient querying of hierarchical/nested data • Query must include higher level columns with equality operator, ranges are only allowed on last column in query, lower level columns don’t have to be included. For example: o select * from geohash1to8_clustering where geohash1=’e’ and geohash2=’ev’ and geohash3 >= ’ev0’ and geohash3 <= ‘evz’ limit 50; • But why have multiple clustering columns when one is actually enough… Good for Hierarchical Data © 2020 Instaclustr Pty Limited
  • 41. Better: Single Geohash Clustering Column Geohash8 and time are clustering keys CREATE TABLE geohash_clustering ( geohash1 text, time timestamp, geohash8 text, lat double, long double, PRIMARY KEY (geohash1, geohash8, time) ) WITH CLUSTERING ORDER BY (geohash8 DESC, time DESC); © 2020 Instaclustr Pty Limited
  • 42. Inequality Range Query • With decreasing length geohashes • Stop when result has 50 rows select * from geohash_clustering where geohash1=’e’ and geohash8=’everywhe’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’everywh0’ and geohash8 <=’everywhz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’everyw0’ and geohash8 <=’everywz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’every0’ and geohash8 <=’everyz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’ever0’ and geohash8 <=’everz’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’eve0’ and geohash8 <=’evez’ limit 50; select * from geohash_clustering where geohash1=’e’ and geohash8>=’ev0’ and geohash8 <=’evz’ limit 50; select * from geohash_clustering where geohash1=’e’ limit 50; © 2020 Instaclustr Pty Limited
  • 43. Geohash Results Option 3 is best = 34% 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 © 2020 Instaclustr Pty Limited
  • 45. Cassandra Lucene Index Plugin • The Cassandra Lucene Index is a plugin for Apache Cassandra: o That extends its index functionality to provide near real-time search, including full-text search capabilities and free multivariable, geospatial and bitemporal search o It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. • Instaclustr supports the plugin o Optional add-on to managed Cassandra service o And code support https://github.com/instaclustr/cassandra-lucene-index • How does this help for Geospatial queries? o Has very rich geospatial semantics including geo points, geo shapes, geo distance search, geo bounding box search, geo shape search, multiple distance units, geo transformations, and complex geo shapes. © 2020 Instaclustr Pty Limited
  • 46. Cassandra Table and Lucene Indexes • Geopoint Example • Under the hood indexing is done using a tree structure with geohashes (configurable precision) CREATE TABLE latlong_lucene ( geohash1 text, value double, time timestamp, latitude double, longitude double, Primary key (geohash1, time) ) WITH CLUSTERING ORDER BY (time DESC); CREATE CUSTOM INDEX latlong_index ON latlong_lucene () USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds': '1', 'schema': '{ fields: { geohash1: {type: "string"}, value: {type: "double"}, time: {type: "date", pattern: "yyyy/MM/dd HH:mm:ss.SSS"}, place: {type: "geo_point", latitude: "latitude", longitude: "longitude"} }' };
  • 47. Search Options • Sort • Sophisticated but complex semantics (see the docs) SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ sort: [ {field: "place", type: "geo_distance", latitude: " + <lat> + ", longitude: " + <long> + "}, {field: "time", reverse: true} ] }') and geohash1=<geohash> limit 50; © 2020 Instaclustr Pty Limited
  • 48. Search Options • Building Box Filter • Need to compute box corners SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: { type: "geo_bbox", field: "place", min_latitude: " + <minLat> + ", max_latitude: " + <maxLat> + ", min_longitude: " + <minLon> + ", max_longitude: " + <maxLon> + " }}’) limit 50; © 2020 Instaclustr Pty Limited
  • 49. Search Options • Geo Distance Filter SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: { type: "geo_distance", field: "place", latitude: " + <lat> + ", longitude: " + <long> + ", max_distance: " <distance> + "km" } }') and geohash1=' + <hash1> + ' limit 50; © 2020 Instaclustr Pty Limited
  • 50. Search Options: Prefix Filter Prefix search is useful for searching larger areas over a single geohash column as you can search for a substring SELECT value FROM latlong_lucene WHERE expr(latlong_index, '{ filter: [ {type: "prefix", field: "geohash1", value: <geohash>} ] }') limit 50 Similar to inequality over clustering column © 2020 Instaclustr Pty Limited
  • 51. Lucene Results Options = 2-25% Best is prefix filter (25%) 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash © 2020 Instaclustr Pty Limited
  • 52. Overall Geohash options are faster (25%, 34%) 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash G e o h a s h G e o h a s h © 2020 Instaclustr Pty Limited
  • 53. Overall Geohash options are faster (25%, 34%) Lucene bounded box/geo distance most accurate but only 5% of baseline performance 0 20 40 60 80 100 120 Normalised (%) Baseline (single transaction ID) SASI Geohash Option 1 Geohash Option 2 Geohash Option 3 Lucene sort Lucene filter bounded box Lucene filter geo distance Lucene filter prefix over geohash L u c e n e L u c e n e © 2020 Instaclustr Pty Limited
  • 54. 3D (Up and Down) Who needs it? © 2020 Instaclustr Pty Limited
  • 55. Location, Altitude and Volume • 3D Geohashes represent 2D location, altitude, and volume • A 3D geohash is a cube © 2020 Instaclustr Pty Limited
  • 56. • Demo 3D Geohash java code o https://gist.github.com/paulbrebner/a67243859d2cf38bd9038a12a7b14762 o produces valid 3D geohashes for altitudes from 13km below sea level to geostationary satellite orbit o Can be used for any of the geohash options More Information? © 2020 Instaclustr Pty Limited
  • 57. • https://www.instaclustr.com/paul-brebner/ • Latest Blog Series – Globally distributed Streaming, Storage and Search o Application is deployed in multiple locations, data is replicated, or sent where/when it’s needed o “Around the World” series: Part 3 introduces a Stock Trading application o https://www.instaclustr.com/building-a-low-latency-distributed-stock-broker-application-part-3/ Blogs © 2020 Instaclustr Pty Limited
  • 58. The End Try out the Instaclustr Managed Platform for Open Source • https://www.instaclustr.com/platform/ • Free Trial: https://console.instaclustr.com/user/signup?coupon- code=WORKSHOP © 2020 Instaclustr Pty Limited ©Instaclustr Pty Limited, 2020 https://www.instaclustr.com/company/policies/terms-conditions/ Except as permitted by the copyright law applicable to you, you may not reproduce, distribute, publish, display, communicate or transmit any of the content of this document, in any form, but any means, without the prior written permission of Instaclustr Pty Limited