Data modeling for Elasticsearch

Data modeling for
Florian Hopf - @fhopf
GOTO nights Berlin
22.10.2015

What are we talking about?
●
Storing and querying data
●
String
●
Numeric
●
Date
●
Embedding documents
●
Types and Mapping
●
Updating data
●
Time stamped data

A relational view
●
Different aspects are stored in different tables
●
Traversal of tables via join-Operations
●
High degree of normalization

Documents
{ }Book
Author
Publisher

Documents
●
Often more natural
●
Flexible schema
●
Fields can be queried
●
Duplicate storage of document parts

Documents
POST /library/book
{
"title": "Elasticsearch in Action",
"author": [ "Radu Gheorghe",
"Matthew Lee Hinman",
"Roy Russo" ],
"pages": 400,
"published": "2015-06-30T00:00:00.000Z",
"publisher": {
"name": "Manning",
"country": "USA"
}
}

Text
POST /library/book
{
"Roy Russo" ],
"pages": 400,
"published": "2015-06-30T00:00:00.000Z",
"publisher": {
"name": "Manning",
"country": "USA"
}
}

Searching data
GET /library/book/_search?q=elasticsearch
{
"took": 75,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.067124054,
"hits": [
[...]
]
}
}

Searching data
GET /library/book/_search
{
"query": {
"match": {
"title": "elasticsearch"
}
}
}

Understand index storage
●
Data is stored in the inverted index
●
Analyzing process determines storage and
query characteristics
●
Important for designing data storage

Analyzing
Term Document Id
Action 1
ein 2
Einstieg 2
Elasticsearch 1,2
in 1
praktischer 2
1. Tokenization
Elasticsearch
in Action
Elasticsearch:
Ein praktischer
Einstieg

Analyzing
Term Document Id
action 1
ein 2
einstieg 2
elasticsearch 1,2
in 1
praktischer 2
1. Tokenization
Elasticsearch
in Action
Elasticsearch:
Ein praktischer
Einstieg
2. Lowercasing

Search
Term Document Id
action 1
ein 2
einstieg 2
elasticsearch 1,2
in 1
praktischer 2
1. Tokenization
2. LowercasingElasticsearch elasticsearch

Inverted Index
●
Terms are deduplicated
●
Original content is lost
●
Elasticsearch stores the original content in a
special field source

Inverted Index
●
New requirement: search for German content
●
praktischer praktisch→

Search
Term Document Id
action 1
ein 2
einstieg 2
elasticsearch 1,2
in 1
praktischer 2
1. Tokenization
2. Lowercasingpraktisch praktisch

Analyzing
Term Document Id
action 1
ein 2
einstieg 2
elasticsearch 1,2
in 1
praktisch 2
1. Tokenization
Elasticsearch
in Action
Elasticsearch:
Ein praktischer
Einstieg
2. Lowercasing
3. Stemming

Search
Term Document Id
action 1
ein 2
einstieg 2
elasticsearch 1,2
in 1
praktisch 2
1. Tokenization
2. Lowercasingpraktisch praktisch
3. Stemming

Mapping
curl -XPUT "http://localhost:9200/library/book/_mapping"
-d'
{
"book": {
"properties": {
"title": {
"type": "string",
"analyzer": "german"
}
}
}
}'

Understand index storage
●
For every indexed document Elasticsearch
builds a mapping from the fields in the
documents
●
Sane defaults for lots of use cases
●
But: understand and control it and your data

_all
●
Default search field _all
"book": {
"_all": {
"enabled": false
}
}

Partial Word Matches
●
New requirement: Search for parts of words
●
elastic elasticsearch→

●
Common option: Using wildcards
POST /library/book/_search
{
"query": {
"wildcard": {
"title": {
"value": "elastic*"
}
}
}
}

●
Wildcards
●
Query time option
●
Scalability?

●
Alternative: Index Time preprocessing
●
Terms are stored in the index in a special way
●
Search is then a normal lookup
●
For partial words: N-Grams

N-Grams
●
Configuring an N-Gram analyzer
●
Builds N-Grams
●
elas
●
elast
●
elasti
●
elastic
●
elastics
●
...

Index Settings for N-Grams
PUT /library-ngram
{
"settings": {
"analysis": {
"analyzer": {
"prefix_analyzer": {
"type": "custom",
"tokenizer": "prefix_tokenizer",
"filter": ["lowercase"]
}
},
"tokenizer": {
"prefix_tokenizer": {
"type": "edgeNGram",
"min_gram" : "4",
"max_gram" : "8",
"token_chars": [ "letter", "digit" ]
}
}
}}}

Mapping for N-Grams
PUT /library-ngram/book/_mapping
{
"book": {
"properties": {
"title": {
"type": "string",
"analyzer": "german",
"fields": {
"prefix": {
"type": "string",
"index_analyzer": "prefix_analyzer",
"query_analyzer": "lowercase"
}
}
}
}
}
}

Additional Field
●
Indexed Document stays the same
●
Additional index field title.prefix
●
Can be queried like any field

Querying additional Field
GET /library-ngram/book/_search
{
"query": {
"match": {
"title.prefix": "elastic"
}
}
}

Querying additional Field
GET /library-ngram/book/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"title": "elastic"
}
},
{
"match": {
"title.prefix": "elastic"
}
}
]
}
}
}

Additional Field
●
Increased storage requirements
●
Increased scalability (and performance) during
search
●
Trade storage against search performance

Storing data
POST /library/book
{
"Roy Russo" ],
"pages": 400,
"published": "2015-06-30T00:00:00.000Z",
"publisher": {
"name": "Manning",
"country": "USA"
}
}

Querying
{
"query": {
"term": {
"pages": "400"
}
}
}
●
Numeric term is in index

Querying
{
"query": {
"range": {
"pages": {
"gte": 300
}
}
}
}
●
Ranges

Numeric values
●
Numeric values are stored in a Trie structure
●
Makes range queries very efficient

Numeric values
●
Simplified view: 250, 290 and 400

Numeric values
●
Precision influences depth of tree
●
Lower precision_step higher number of→
terms
●
Most of the time defaults are fine

Date
●
Default: ISO8601 format
●
Joda Time patterns
●
Internally stored as long

Date
PUT /library-date/book/_mapping
{
"book": {
"properties": {
"published": {
"type": "date",
"format": "dd.MM.yyyy"
}
}
}
}

Date
POST /library-date/book
{
"Roy Russo" ],
"pages": 400,
"published": "30.06.2015",
"publisher": {
"name": "Manning",
"country": "USA"
}
}

Date
●
Common: Filtering on date range
●
from and/or to

Date
"query": {
"filtered": {
"filter": {
"range": {
"published": {
"to": "30.06.2015"
}
}
}
}
}

Date
"query": {
"filtered": {
"filter": {
"range": {
"published": {
"to": "now-3M"
}
}
}
}
}

Date
●
Filter is not cached with 'now'
●
Only cached with rounded value
"range": {
"published": {
"to": "now-3M/d"
}
}

Date
●
Exact values needed Combine filters→

Embedded Documents
POST /library/book
{
"Roy Russo" ],
"pages": 400,
"published": "2015-06-30T00:00:00.000Z",
"publisher": {
"name": "Manning",
"country": "USA"
}
}

Embedded Documents
●
Default: Flat structure
●
Good for 1:1 relation
"publisher": {
"name": "Manning",
"country": "USA"
}
"publisher.name": "Manning",
"publisher.country": "USA"

Embedded documents
●
1:N relations are problematic
{
"ratings": [
{
"source": "Amazon",
"stars": 5
},
{
"source": "Goodreads",
"stars": 4
}
]
}

Embedded documents
●
1:N relations are problematic
"query": {
"bool": {
"must": [
{ "match": { "ratings.source": "Goodreads" }},
{ "match": { "ratings.stars": 5 }}
]
}
}

Nested
●
Solution: Nested documents
●
Lucene internal: Seperate document,
connected via Block-Join
●
Accessing documents via specialized query

Nested
●
Explicit mapping
"book": {
"properties": {
"ratings": {
"type": "nested",
"properties": {
"source": {
"type": "string"
},
"stars": {
"type": "integer"
}
}
}
}
}

Nested
●
Nested-Query
"query": {
"nested": {
"path": "ratings",
"query": {
"bool": {
"must": [
{ "match": { "ratings.source": "Goodreads" }},
{ "match": { "ratings.stars": 5 }}
]
}
}
}
}

Nested
●
Additional flat storage
●
include_in_parent
●
include_in_root

Parent-Child
●
Alternative storage
●
Indexing seperate types
●
Connection via parent parameter

Parent-Child
●
Book is stored without ratings
POST /library-parent-child/book/
{
"publisher": {
"name": "Manning"
}
}

Parent-Child
●
Ratings reference books
PUT /library-parent-child/rating/_mapping
{
"rating": {
"_parent": {
"type": "book"
}
}
}

Parent-Child
●
Ratings reference book
POST /library-parent-child/rating?
parent=AU_smK5FYK634dNiekGr
{
"source": "Amazon",
"stars": 5
}
POST /library-parent-child/rating?
parent=AU_smK5FYK634dNiekGr
{
"source": "Goodreads",
"stars": 4
}

Parent-Child
●
has_child/has_parent
POST /library-parent-child/book/_search
{
"query": {
"has_child": {
"type": "rating",
"query": {
"bool": {
"must": [
{ "match": {"source": "Goodreads" }},
{ "match": {"stars": 5 }}
]
}
}
}
}
}

Parent-Child
●
Stored on same shard
●
Only suitable for smaller amounts of docs
●
Requires different types

Querying Elasticsearch
●
Ad-hoc queries
●
But better characteristics when designing storage
for query
●
Flexible Schema
●
But mapping better defined upfront

Mapping
●
Mapping for field can't be changed
●
Think about how you will be querying your
data
●
Think about defining a static mapping upfront

Disable dynamic mapping
PUT /library/book/_mapping
{
"book": {
"dynamic": "strict"
}
}

Disable dynamic mapping
POST /library/book
{
"titel": "Falsch"
}
{
"error" : "StrictDynamicMappingException[mapping set to
strict! dynamic introduction of [titel] within [book]
is not allowed]",
"status" : 400
}

Types
●
Types determine mapping
●
Lucene doesn't know about types

Types
●
Fields with same names need to be mapped
the same way
●
Relevance can be influenced
●
Index settings: shards, replicas per type?

Key-Value-Store
●
Careful when using ES as key-value-store
●
Mapping is part of cluster state

Updating Data
●
Primary Datastore
●
Full indexing
●
Incremental indexing

Updating Data
●
Elasticsearch stores data in segment files
●
Immutable files
●
Segment is a mini inverted index

Segments
●
Building inverted index is expensive
●
Add documents add new segments→

Segments
●
Doc deletion is only a marker
●
Deleted documents are automatically filtered

Updating Data
●
Documents can be updated
●
Full Update
●
Partial Update

Updating data
●
Full update: Replaces a document
PUT /library/book/AVBDusjh0tduyhTzZqTC
{
"author": [
"Radu Gheorghe",
"Matthew L. Hinman",
"Roy Russo"
],
"published": "2015-06-30T00:00:00.000Z",
"publisher": {
"name": "Manning",
"country": "USA"
}
}

Updating data
●
Partial update: Uses source of document
POST /library/book/AVBDusjh0tduyhTzZqTC/_update
{
"doc": {
"title": "Elasticsearch In Action"
}
}

Updating data
●
Update = Delete + Add
●
Expensive operation
●
Design documents as events if possible

Working with timestamps
●
Timestamped data
●
Write events
●
Common: Log events

Index Design
●
Use date aware index name
●
library-221015
●
Create a new index every day

Index Design
●
Index templates for custom settings
PUT /_template/library-template
{
"template": "library-*",
"mappings": {
"book": {
"properties": {
"title": {
"type": "string",
"analyzer": "german"
}
}
}
}
}

Index Design
●
Search multiple indices
GET /library-221015,library-211015/_search
GET /library-*/_search

Index Design
●
Combining indices with Index-Aliases
POST /_aliases
{
"actions" : [
{ "add" : {
"index" : "library-2015*",
"alias" : "thisyear"
}},
{ "add" : {
"index" : "library-2015-10*",
"alias" : "thismonth"
}}
]
}

Index Design
●
Implicit date selection
GET /thisyear/_search
GET /thismonth/_search

Index Design
●
Filtered Alias
"actions" : [{
"add" : {
"index" : "library",
"alias" : "buecher",
"filter" : {
"term" : { "publisher.country" : "de" }
}
}
}]

What is missing?
●
Distributed data and Routing
●
Field Data and Doc Values
●
Index-Options
●
Geo-Data

More Info
●
http://elastic.co
●
Elasticsearch – The definitive Guide
●
https://www.elastic.co/guide/en/elasticsearch/gui
de/master/index.html
●
Elasticsearch in Action
●
https://www.manning.com/books/elasticsearch-in-
action
●
http://blog.florian-hopf.de

Resources
●
http://blog.parsely.com/post/1691/lucene/
●
http://de.slideshare.net/VadimKirilchuk/nume
ric-rangequeries
●
https://www.elastic.co/blog/found-optimizing-
elasticsearch-searches

Images
●
http://www.morguefile.com/archive/display/48456
●
●
●
●
●
●
●

Data modeling for Elasticsearch

Recommended

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Data modeling for Elasticsearch (20)

More from Florian Hopf (13)

Recently uploaded (20)

Data modeling for Elasticsearch