Using Spatial Data in Elasticsearch
Using Spatial Data in Elasticsearch
1. Introduction
2. Requirements
3. Exercise Data
4. Exercise 1 - Installation
5. Exercise 2 - Indexing
i. Create Index
ii. Insert Into Index
iii. Update Mapping
iv. Delete & Recreate Index
v. Load Accidents
6. Exercise 3 - Query
i. Geo Shape Filter
ii. Geo Shape Filter - Pre-Indexed
iii. Geo Distance Filter
iv. Geo Distance Aggregation
v. Geo Bounds Aggregation
7. Exercise 4 - Managing Data
Introduction
This book is aimed at providing an introdution into the use of spatial data and functions in Elasticsearch.
The book assumes some basci knowledge of Elasticsearch. For more comprehensive documentation on Elasticsearch
you can look here.
Requirements
To work through the exercises you'll need to install virtualbox and vagrant.
We also recommend using the Sense extension for Chrome which provides a JSON aware interface to Elasticsearch.
Python
A python loader is provided in the repository which helps load shapefiles into Elasticsearch. To use this you'll need
python installed.
We recommend using virtualenv, a tool which provides isolated environments of python. By using virtualenv you can
install python packages without affecting packages installed by other python applications.
You can install virtualenv using pip (Pip is a tool for easily installing and managing python packages)
Note: If you are an osx user, installing python via brew will include pip.
The directory contains two archives which will be unpacked when the virtual machine is created:
Melbourne_accident.zip
Melbourne-Localities.zip
The data we're going to use is information on road accidents and suburb boundaries. Both datasets are available for
download via the Victorian (Australia) State Government Data Directory.
The road accident information is a point dataset and the suburb boundary dataset is a polygon dataset.
Exercise 1 - Installation
Firstly lets get the initial exercise ready.
$ cd el-spatial-tutorial
OK now we need to create a virtual machine and install Elasticsearch. With vagrant this is easy. The vagrant script
provided with this tutorial uses a base box named precise32. Vagrant box is a pre-packaged environment. However the
first time you do this Vagrant needs to do two things:
During the installation vagrant provisions the machine with Elasticsearch, Kibana and some useful plugins:
Paramedic
Head
BigDesk
Vagrant also sets up some port forwarding which allows us to access the machine using ssh as well as the Elasticsearch
API.
$ vagrant ssh
Paramedic, Head and BigDesk should be accessible by accessing the following URL's:
http://127.0.0.1:9200/_plugin/paramedic/
http://127.0.0.1:9200/_plugin/head/
http://127.0.0.1:9200/_plugin/bigdesk/
Exercise 2 - Indexing
You can think of an index in Elasticsearch as a database. Indexes contain documents and a document is of a given
type. For example we might have an index called 'suburbs' and a type called 'suburb'. The index would contain
documents which define suburbs and conform to the type 'suburb'.
In Elasticsearch each index has a mapping which is ike a schema definition for all the types within the index.
Elasticsearch supports two spatial types geo point and geo shape. Elasticsearch accepts GeoJSON as it's input
geometry format. Because our source data is in Shapefile format we need to translate it into GeoJSON so we can insert it
into an index. You can think of each record within our source datasets as a document which we'll insert into an
Elasticsearch index.
Elasticsearch will automatically create a mapping for a document type when a new document type is indexed. When
creating the mapping, Elasticsearch infers the field data types from the document. For example if we have a field called
Name in our suburb dataset and it contains the suburb name 'Melbourne' then Elasticsearch will infer that the field type
for the Name field is 'string'. Unfortunately Elasticsearch does not automatically infer the field types for spatial types, to do
this we need to manually update the mapping.
In this exercise we're run the following steps using the suburb data:
At the end of the exercise all the suburb data will be loaded into a new index called 'suburbs' and all the accident data
will be loaded into an index called 'accidents'.
Create Index
Lets get started and create our empty index ready for our suburb data.
Elasticsearch should tell you that the index has been created:
$ {"acknowledged":true}
http://127.0.0.1:9200/_plugin/head/
Insert Into Index
Now that we have our index we can insert a document into it. To do this we're going to use a python loader which is in the
loader directory of the repository.
Set up
Before running the loader - lets make sure we have the dependencies we need.
$ virtualenv LOADER
$ source LOADER/bin/activate
$ pip install numpy
$ brew create http://download.savannah.gnu.org/releases/lzip/lzlib-1.5.tar.gz
$ brew install lzlib
$ brew install gdal
$ pip install pyes
$ pip install fiona
$ pip install shapely
Load a record
To load a record we'll use the loader - you can find out the parameters expected by the loader using:
$ python data-loader.py -h
The --limit optional argument specifies the number of records to load, in this case we only want to load one record.
Now we can test that our record has been loaded into Elasticsearch. Below we ask for the document indexed with the key
0 in the suburbs index.
You will now have a file called mapping.json which contains the mapping for the suburbs index and suburb type. Open
the mapping.json file. Elasticsearch includes the index in the mapping:
{
"suburbs" : {
"mappings" : {...
}
}
}
The PUT mapping option on the Elasticsearch API allows you to define a mapping for a specifc type, therefore we need to
remove the index definition from the mapping.json file - so remove the above section leaving the suburb type only:
{
"suburb": {.....
}
}
You'll notice that Elasticsearch by default has defined the geometry field on the suburb as including a coordinates and
type property.
"geometry" : {
"properties" : {
"coordinates" : {
"type" : "double"
},
"type" : {
"type" : "string"
}
}
}
We need to update the mapping to tell Elasticsearch that the geometry field is actually a geo shape type. Update the file
as follows:
"geometry" : {
"type": "geo_shape"
}
Delete & Recreate Index
Ok now we want to remove our suburbs index and recreate it, this time using our modified mapping. This will enable us to
load the suburb shapefile and Elasticsearch will recognise and index the geometry correctly.
While the load is running you can monitor the application behaviour by using the head, paramedic or bigdesk plugins.
Load Accidents
Now we've ran through each of the individual steps to load suburb information you can execute the below script to load
accidents into a new index.
Load in the accident data using the python data loader (remember to configure your virtualenv if you've haven't done so
already)
In this exercise we'll be using the REST Request Body method, but for completeness here is an example of a simple
query using the REST Request URI method:
curl 'localhost:9200/accidents/_search?q=pedestrian&pretty'
In the above example we're looking for any documents in the accidents index which involve pedestrians. We can run the
same query using REST Request Body.
Now that we've got some basic idea of the query syntax and approach, lets move on to look at some specific geospatial
queries.
mention the size option to control resultset talk about how to interpret the response include some basic information about
searching and the query DSL, filters and the differences maybe use a simple example of term based query filters perform
better
Geo Shape Filter
Elasticsearch allows us to query data using a Geo Shape filter. For example if we wanted to identify all accidents which
occurred within a particular area we can make the following query:
In the above example we're asking for any accidents which intersect with a basic polygon. This yields 3401 accidents and
Elasticsearch took 30ms to process the request:
{
"took": 30,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3401,
"max_score": 1,
"hits": [....]
}
}
Using this type of filter is useful if you want to pass arbitrary geometries to Elasticsearch. For example this filter could be
used to return all accidents within the bounding box on a map. Another way of doing this would be to use the Geo
Bounding Box filter
Pre-Indexed Query
Elasticsearch also provides a feature whereby indexed spatial data can be used as a filter.
In the above example we're asking for any accidents which intersect with a pre-indexed suburb with the id 293. This
yields:
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 413,
"max_score": 1,
"hits": [....]
}
}
Geo Distance Filter
This time with a nested aggregation which breaks each distance bucket down by days
Include CRUD