0% found this document useful (0 votes)
16 views

Using Spatial Data in Elasticsearch

This document provides instructions for a series of exercises to introduce spatial data and functions in Elasticsearch. It outlines installing required software, creating and populating indexes with suburb and accident data, and modifying mappings. The exercises cover indexing suburb and accident data from shapefiles, querying data using geo filters and aggregations, and managing the spatial data.

Uploaded by

matthewsheeran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Using Spatial Data in Elasticsearch

This document provides instructions for a series of exercises to introduce spatial data and functions in Elasticsearch. It outlines installing required software, creating and populating indexes with suburb and accident data, and modifying mappings. The exercises cover indexing suburb and accident data from shapefiles, querying data using geo filters and aggregations, and managing the spatial data.

Uploaded by

matthewsheeran
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Table of Contents

1. Introduction
2. Requirements
3. Exercise Data
4. Exercise 1 - Installation
5. Exercise 2 - Indexing
i. Create Index
ii. Insert Into Index
iii. Update Mapping
iv. Delete & Recreate Index
v. Load Accidents
6. Exercise 3 - Query
i. Geo Shape Filter
ii. Geo Shape Filter - Pre-Indexed
iii. Geo Distance Filter
iv. Geo Distance Aggregation
v. Geo Bounds Aggregation
7. Exercise 4 - Managing Data
Introduction
This book is aimed at providing an introdution into the use of spatial data and functions in Elasticsearch.

This book provides a series of exercises which can be worked through.

The book assumes some basci knowledge of Elasticsearch. For more comprehensive documentation on Elasticsearch
you can look here.
Requirements
To work through the exercises you'll need to install virtualbox and vagrant.

We also recommend using the Sense extension for Chrome which provides a JSON aware interface to Elasticsearch.

Interactions with the Elasticsearch API are illustrated using cURL

Grab the repo

$ git clone https://github.com/geoplex/elasticsearch-spatial

Python
A python loader is provided in the repository which helps load shapefiles into Elasticsearch. To use this you'll need
python installed.

We recommend using virtualenv, a tool which provides isolated environments of python. By using virtualenv you can
install python packages without affecting packages installed by other python applications.

You can install virtualenv using pip (Pip is a tool for easily installing and managing python packages)

Note: If you are an osx user, installing python via brew will include pip.

$ pip install virtualenv


Exercise Data
The repository contains some sample data which we'll use for the exercise. All the data can be found in: elasticsearch-
spatial/exercise_data

The directory contains two archives which will be unpacked when the virtual machine is created:

Melbourne_accident.zip
Melbourne-Localities.zip

The data we're going to use is information on road accidents and suburb boundaries. Both datasets are available for
download via the Victorian (Australia) State Government Data Directory.

The road accident information is a point dataset and the suburb boundary dataset is a polygon dataset.
Exercise 1 - Installation
Firstly lets get the initial exercise ready.

$ cd el-spatial-tutorial

OK now we need to create a virtual machine and install Elasticsearch. With vagrant this is easy. The vagrant script
provided with this tutorial uses a base box named precise32. Vagrant box is a pre-packaged environment. However the
first time you do this Vagrant needs to do two things:

pull down a base machine (i.e., vagrant box)


provision the base machine with Elasticsearch.

so it takes a little longer. To do these things run:

$ vagrant box add precise32 http://files.vagrantup.com/precise32.box


$ vagrant up

During the installation vagrant provisions the machine with Elasticsearch, Kibana and some useful plugins:

Paramedic
Head
BigDesk

Vagrant also sets up some port forwarding which allows us to access the machine using ssh as well as the Elasticsearch
API.

If you want to ssh to the machine you can run:

$ vagrant ssh

Paramedic, Head and BigDesk should be accessible by accessing the following URL's:

http://127.0.0.1:9200/_plugin/paramedic/
http://127.0.0.1:9200/_plugin/head/
http://127.0.0.1:9200/_plugin/bigdesk/
Exercise 2 - Indexing
You can think of an index in Elasticsearch as a database. Indexes contain documents and a document is of a given
type. For example we might have an index called 'suburbs' and a type called 'suburb'. The index would contain
documents which define suburbs and conform to the type 'suburb'.

In Elasticsearch each index has a mapping which is ike a schema definition for all the types within the index.

There's a whole lot more information here.

Elasticsearch supports two spatial types geo point and geo shape. Elasticsearch accepts GeoJSON as it's input
geometry format. Because our source data is in Shapefile format we need to translate it into GeoJSON so we can insert it
into an index. You can think of each record within our source datasets as a document which we'll insert into an
Elasticsearch index.

Elasticsearch will automatically create a mapping for a document type when a new document type is indexed. When
creating the mapping, Elasticsearch infers the field data types from the document. For example if we have a field called
Name in our suburb dataset and it contains the suburb name 'Melbourne' then Elasticsearch will infer that the field type
for the Name field is 'string'. Unfortunately Elasticsearch does not automatically infer the field types for spatial types, to do
this we need to manually update the mapping.

In this exercise we're run the following steps using the suburb data:

1. Create a new index.


2. Insert a single document and get Elasticsearch to automatically create the mapping based upon that document.
3. Retrieve the mapping and modify the field type for the geometry.
4. Delete and recreate the index.
5. Insert the modified mapping.
6. Index our suburb data into the index.
7. Run a single script to load the accident data.

At the end of the exercise all the suburb data will be loaded into a new index called 'suburbs' and all the accident data
will be loaded into an index called 'accidents'.
Create Index
Lets get started and create our empty index ready for our suburb data.

$ curl -XPUT 'http://localhost:9200/suburbs/'

Elasticsearch should tell you that the index has been created:

$ {"acknowledged":true}

You can also check by using paramedic or head e.g:

http://127.0.0.1:9200/_plugin/head/
Insert Into Index
Now that we have our index we can insert a document into it. To do this we're going to use a python loader which is in the
loader directory of the repository.

Set up
Before running the loader - lets make sure we have the dependencies we need.

$ virtualenv LOADER
$ source LOADER/bin/activate
$ pip install numpy
$ brew create http://download.savannah.gnu.org/releases/lzip/lzlib-1.5.tar.gz
$ brew install lzlib
$ brew install gdal
$ pip install pyes
$ pip install fiona
$ pip install shapely

Load a record
To load a record we'll use the loader - you can find out the parameters expected by the loader using:

$ python data-loader.py -h

To load a single suburb record you can use:

python data-loader.py '127.0.0.1:9200' 'suburbs' 'suburb' '../exercise_data/Melbourne-Localities/melbourne_locality_polygon.shp' 'id' --limit

The --limit optional argument specifies the number of records to load, in this case we only want to load one record.

Now we can test that our record has been loaded into Elasticsearch. Below we ask for the document indexed with the key
0 in the suburbs index.

$ curl -s -XGET http://127.0.0.1:9200/suburbs/suburb/0 | python -m json.tool

This will present the json response from Elasticsearch


Retrieve the type Mapping
We now have a single index called 'Suburbs' with a single document loaded into it. Elasticsearch has created a mapping
for our suburb type based upon the first document we've loaded in.

To retrieve the mapping:

curl -s -XGET "http://127.0.0.1:9200/suburbs/suburb/_mapping?pretty=true" > mapping.json

You will now have a file called mapping.json which contains the mapping for the suburbs index and suburb type. Open
the mapping.json file. Elasticsearch includes the index in the mapping:

{
"suburbs" : {
"mappings" : {...
}
}
}

The PUT mapping option on the Elasticsearch API allows you to define a mapping for a specifc type, therefore we need to
remove the index definition from the mapping.json file - so remove the above section leaving the suburb type only:

{
"suburb": {.....
}
}

You'll notice that Elasticsearch by default has defined the geometry field on the suburb as including a coordinates and
type property.

"geometry" : {
"properties" : {
"coordinates" : {
"type" : "double"
},
"type" : {
"type" : "string"
}
}
}

We need to update the mapping to tell Elasticsearch that the geometry field is actually a geo shape type. Update the file
as follows:

"geometry" : {
"type": "geo_shape"
}
Delete & Recreate Index
Ok now we want to remove our suburbs index and recreate it, this time using our modified mapping. This will enable us to
load the suburb shapefile and Elasticsearch will recognise and index the geometry correctly.

Delete the index

curl -XDELETE 'http://127.0.01:9200/suburbs'

Recreate the index

curl -XPUT 'http://127.0.01:9200/suburbs/'

Add the updated mapping

curl -XPUT '127.0.01:9200/suburbs/_mapping/suburb' --data @mapping.json

Finally load in all the suburbs..

python data-loader.py '127.0.0.1:9200' 'suburbs' 'suburb' '../exercise_data/Melbourne-Localities/melbourne_locality_polygon.shp' 'id' --limit

While the load is running you can monitor the application behaviour by using the head, paramedic or bigdesk plugins.
Load Accidents
Now we've ran through each of the individual steps to load suburb information you can execute the below script to load
accidents into a new index.

Create new accident index and load in pre-configured accident mapping.

curl -XPUT 'http://127.0.01:9200/accidents/'


curl -XPUT '127.0.01:9200/accidents/_mapping/accident' --data @/mappings/accident_mapping.json

Load in the accident data using the python data loader (remember to configure your virtualenv if you've haven't done so
already)

python data-loader.py '127.0.0.1:9200' 'accidents' 'accident' '../exercise_data/Melbourne_accident/melbourne_accident.shp' 'id' --limit


Exercise 3 - Query
We now have suburb and accident data loaded into our Elasticsearch instance, this exercise will cover the basics of
querying data within Elasticsearch, and specifically cover off on the geospatial query features.

There are two ways to search for data in Elasticsearch

REST Request URI


REST Request Body

In this exercise we'll be using the REST Request Body method, but for completeness here is an example of a simple
query using the REST Request URI method:

curl 'localhost:9200/accidents/_search?q=pedestrian&pretty'

In the above example we're looking for any documents in the accidents index which involve pedestrians. We can run the
same query using REST Request Body.

todo - fix this query it doesn't work!

curl -XPOST 'localhost:9200/accidents/_search' -d '


{
"query": { "match": "pedestrian" }
}'

Now that we've got some basic idea of the query syntax and approach, lets move on to look at some specific geospatial
queries.

mention the size option to control resultset talk about how to interpret the response include some basic information about
searching and the query DSL, filters and the differences maybe use a simple example of term based query filters perform
better
Geo Shape Filter
Elasticsearch allows us to query data using a Geo Shape filter. For example if we wanted to identify all accidents which
occurred within a particular area we can make the following query:

curl -XPOST 'http://127.0.01:9200/accidents/_search' -d '{


"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"geo_shape": {
"geometry": {
"shape": {
"type": "Polygon",
"coordinates": [
[
[
144.9400520324707,
-37.82158204850761
],
[
144.9400520324707,
-37.79391457604158
],
[
145.0059700012207,
-37.79391457604158
],
[
145.0059700012207,
-37.82158204850761
],
[
144.9400520324707,
-37.82158204850761
]
]
]
}
}
}
}
}
}
}' | python -m json.tool

In the above example we're asking for any accidents which intersect with a basic polygon. This yields 3401 accidents and
Elasticsearch took 30ms to process the request:

{
"took": 30,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3401,
"max_score": 1,
"hits": [....]
}
}

Using this type of filter is useful if you want to pass arbitrary geometries to Elasticsearch. For example this filter could be
used to return all accidents within the bounding box on a map. Another way of doing this would be to use the Geo
Bounding Box filter
Pre-Indexed Query
Elasticsearch also provides a feature whereby indexed spatial data can be used as a filter.

curl -XPOST 'http://127.0.01:9200/accidents/_search' -d '{


"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"geo_shape": {
"geometry": {
"indexed_shape": {
"id": "293",
"type": "suburb",
"index": "suburbs",
"path": "geometry"
}
}
}
}
}
}
}' | python -m json.tool

In the above example we're asking for any accidents which intersect with a pre-indexed suburb with the id 293. This
yields:

{
"took": 17,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 413,
"max_score": 1,
"hits": [....]
}
}
Geo Distance Filter

curl -XPOST 'http://127.0.01:9200/accidents/_search' -d '{


"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"geo_distance": {
"distance": "1km",
"accident.point_location": {
"lat": 144.97959,
"lon": -37.79845
}
}
}
}
}
}' | python -m json.tool

Can we sort by distance?

POST /accidents/_search { "sort": [ { "_geo_distance": { "accident.point_location": { "lat": 144.97959, "lon": -37.79845 },


"order": "asc", "unit": "m" } } ], "query": { "match_all": {} } }
Geo Distance Aggregation

curl -XPOST 'http://127.0.01:9200/accidents/_search' -d '{


"size": 0,
"aggs": {
"rings": {
"geo_distance": {
"field": "point_location",
"origin": "144.97959, -37.79845",
"unit": "m",
"ranges": [
{
"to": 100
},
{
"from": 100,
"to": 300
}
]
}
}
}
}' | python -m json.tool

This time with a nested aggregation which breaks each distance bucket down by days

curl -XPOST 'http://127.0.01:9200/accidents/_search' -d '{


"size": 0,
"aggs": {
"rings": {
"geo_distance": {
"field": "point_location",
"origin": "144.97959, -37.79845",
"unit": "m",
"ranges": [
{
"to": 100
},
{
"from": 100,
"to": 300
}
]
},
"aggs": {
"days": {
"terms": {
"field": "day_of_w_1"
}
}
}
}
}
}' | python -m json.tool
Geo Bounds Aggregation
Exercise 4 - Managing Data
In this exercise we'll cover the basic ways in which we can manage spatial data in our index.

Include CRUD

Include Index Types i.e. GeoHASH etc

You might also like