Elasticsearch Essentials - Sample Chapter
Elasticsearch Essentials - Sample Chapter
ee
P U B L I S H I N G
C o m m u n i t y
E x p e r i e n c e
D i s t i l l e d
Elasticsearch Essentials
$ 39.99 US
25.99 UK
pl
Bharvi Dixit
Elasticsearch Essentials
Elasticsearch Essentials
Sa
m
Bharvi Dixit
Preface
With constantly evolving and growing datasets, organizations have the need to
find actionable insights for their business. Elasticsearch, which is the world's most
advanced search and analytics engine, brings the ability to make massive amounts
of data usable in a matter of milliseconds. It not only gives you the power to build
blazingly fast search solutions over a massive amount of data, but can also serve as
a NoSQL data store.
Elasticsearch Essentials will guide you to become a competent developer quickly
with a solid knowledge and understanding of the Elasticsearch core concepts.
In the beginning, this book will cover the fundamental concepts required to start
working with Elasticsearch and then it will take you through more advanced
concepts of search techniques and data analytics.
This book provides complete coverage of working with Elasticsearch using
Python and Java APIs to perform CRUD operations, aggregation-based analytics,
handling document relationships, working with geospatial data, and controlling
search relevancy.
In the end, you will not only learn about scaling Elasticsearch clusters in
production, but also how to secure Elasticsearch clusters and take data backups
using best practices.
Preface
Preface
[ 71 ]
Buckets: Buckets are simply the grouping of documents that meet a certain
criteria. They are used to categorize documents, for example:
The category of loans can fall into the buckets of home loan or
personal loan
Aggregation syntax
Aggregation follows the following syntax:
"aggregations" : {
"<aggregation_name>" : {
"<aggregation_type>" : {
<aggregation_body>
}
[,"aggregations" : { [<sub_aggregation>]+ } ]?
}
[,"<aggregation_name_2>" : { ... } ]*
}
[ 72 ]
Chapter 4
aggregations: The aggregations objects (which can also be replaced with agg)
in the preceding structure holds the aggregations that have to be computed.
There can be more than one aggregation inside this object.
[ 73 ]
Extracting values
Aggregations typically work on the values extracted from the aggregated document
set. These values can be extracted either from a specific field using the field key
inside the aggregation body or can also be extracted using a script.
While it's easy to define a field to be used to aggregate data, the syntax of using
scripts needs some special understanding. The benefit of using scripts is that one
can combine the values from more than one field to use as a single value inside
an aggregation.
Using scripting requires much more computation power and
slows down the performance on bigger datasets.
The scripts also support the use of parameters using the param keyword. For example:
{
"avg": {
"field": "price",
"script": {
"inline": "_value * correction",
"params": {
"correction": 1.5
}
}
}
}
The preceding aggregation calculates the average price after multiplying each value
of the price field with 1.5, which is used as an inline function parameter.
[ 74 ]
Chapter 4
Metric aggregations
As explained in the previous sections, metric aggregations allow you to find out the
statistical measurement of the data, which includes the following:
[ 75 ]
Combined stats
All the stats mentioned previously can be calculated with a single aggregation query.
Python example
query = {
"aggs": {
"follower_counts_stats": {
"stats": {
"field": "user.followers_count"
}
}
}
}
res = es.search(index='twitter', doc_type='tweets', body=query)
print resp
In the preceding response, count is the total values on which the aggregation
is executed.
[ 76 ]
Chapter 4
Java example
In Java, all the metric aggregations can be created using the
MetricsAggregationBuilder and AggregationBuilders
classes. However, you need to import a specific package into
your code to parse the results.
To build and execute a stats aggregation in Java, first do the following imports in
the code:
import org.elasticsearch.search.aggregations.metrics.stats.Stats;
value_count: This counts the number of values that are extracted from the
min: This finds the minimum value among the numeric values extracted from
aggregated documents
[ 77 ]
max: This finds the maximum value among the numeric values extracted
avg: This finds the average value among the numeric values extracted from
sum: This finds the sum of all the numeric values extracted from the
To perform these aggregations, you just need to use the following syntax:
{
"aggs": {
"aggaregation_name": {
"aggrigation_type": {
"field": "name_of_the_field"
}
}
}
}
Python example
query = {
"aggs": {
"follower_counts_stats": {
"sum": {
"field": "user.followers_count"
}
}
},"size": 0
}
res = es.search(index='twitter', doc_type='tweets', body=query)
We used the sum aggregation type in the preceding query; for other aggregations
such as min, max, avg, and value_count, just replace the type of aggregation in
the query.
Java example
To perform these aggregations using the Java client, you need to follow this syntax:
MetricsAggregationBuilder aggregation =
AggregationBuilders
.sum("follower_counts_stats")
.field("user.followers_count");
[ 78 ]
Chapter 4
Note that in the preceding aggregation, instead of sum, you just need to call the
corresponding aggregation type to build other types of metric aggregations such as,
min, max, count, and avg. The rest of the syntax remains the same.
For parsing the responses, you need to import the correct package according to the
aggregation type. The following are the imports that you will need:
[ 79 ]
Python example
query = {
"aggs": {
"follower_counts_stats": {
"extended_stats": {
"field": "user.followers_count"
}
}
}
},"size": 0
res = es.search(index='twitter', doc_type='tweets', body=query)
[ 80 ]
Chapter 4
Java example
An extended aggregation is build using the Java client in the following way:
MetricsAggregationBuilder aggregation =
AggregationBuilders
.extendedStats("agg_name")
.field("user.follower_count");
To parse the response of the extended_stats aggregation in Java, you need to have
the following import statement:
import org.elasticsearch.search.aggregations.metrics.stats.extended.
ExtendedStats;
[ 81 ]
Java example
Cardinality aggregation is built using the Java client in the following way:
MetricsAggregationBuilder aggregation =
AggregationBuilders
.cardinality("unique_users")
.field("user.screen_name");
To parse the response of the cardinality aggregation in Java, you need to have the
following import statement:
import org.elasticsearch.search.aggregations.metrics.cardinality.
Cardinality;
Bucket aggregations
Similar to metric aggregations, bucket aggregations are also categorized into two
forms: Single buckets that contain only a single bucket in the response, and multi
buckets that contain more than one bucket in the response.
The following are the most important aggregations that are used to create buckets:
Terms aggregation
Range aggregation
Histogram aggregation
[ 82 ]
Chapter 4
Filter-based aggregation
We will cover a few more aggregations such as nested and
geo aggregations in subsequent chapters.
Buckets aggregation response formats are different from the response formats of
metric aggregations. The response of a bucket aggregation usually comes in the
following format:
"aggregations": {
"aggregation_name": {
"buckets": [
{
"key": value,
"doc_count": value
},
......
]
}
}
Terms aggregation
Terms aggregation is the most widely used aggregation type and returns the buckets
that are dynamically built using one per unique value.
Let's see how to find the top 10 hashtags used in our Twitter index in descending order.
Python example
query = {
"aggs": {
"top_hashtags": {
"terms": {
"field": "entities.hashtags.text",
"size": 10,
"order": {
"_term": "desc"
}
}
}
}
}
In the preceding example, the size parameter controls how many buckets are to be
returned (defaults to 10) and the order parameter controls the sorting of the bucket
terms (defaults to asc):
res = es.search(index='twitter', doc_type='tweets', body=query)
[ 84 ]
Chapter 4
Java example
Terms aggregation can be built as follows:
AggregationBuilder aggregation =
AggregationBuilders.terms("agg").field(fieldName)
.size(10);
Here, agg is the aggregation bucket name and fieldName is the field on which the
aggregation is performed.
The response object can be parsed as follows:
To parse the terms aggregation response, you need to import the following class:
import org.elasticsearch.search.aggregations.bucket.terms.Terms;
Then, the response can be parsed with the following code snippet:
Terms screen_names = response.getAggregations().get("agg");
for (Terms.Bucket entry : screen_names.getBuckets()) {
entry.getKey();
// Term
entry.getDocCount(); // Doc count
}
Range aggregation
With range aggregation, a user can specify a set of ranges, where each range
represents a bucket. Elasticsearch will put the document sets into the correct
buckets by extracting the value from each document and matching it against
the specified ranges.
Python example
query = "aggs": {
"status_count_ranges": {
"range": {
"field": "user.statuses_count",
"ranges": [
{
"to": 50
},
{
"from": 50,
"to": 100
}
[ 85 ]
The range aggregation always discards the to value for each range
and only includes the from value.
The response for the preceding query request would look like this:
"aggregations": {
"status_count_ranges": {
"buckets": [
{
"key": "*-50.0",
"to": 50,
"to_as_string": "50.0",
"doc_count": 3
},
{
"key": "50.0-100.0",
"from": 50,
"from_as_string": "50.0",
"to": 100,
"to_as_string": "100.0",
"doc_count": 3
}
]
}
}
Java example
Building range aggregation:
AggregationBuilder aggregation =
AggregationBuilders
.range("agg")
.field(fieldName)
.addUnboundedTo(1)
// from -infinity to 1 (excluded)
.addRange(1, 100) // from 1 to 100(excluded)
.addUnboundedFrom(100); // from 100 to +infinity
[ 86 ]
Chapter 4
Here, agg is the aggregation bucket name and fieldName is the field on which the
aggregation is performed. The addUnboundedTo method is used when you do not
specify the from parameter and the addUnboundedFrom method is used when you
don't specify the to parameter.
Parsing the response
To parse the range aggregation response, you need to import the following class:
import org.elasticsearch.search.aggregations.bucket.range.Range;
Then, the response can be parsed with the following code snippet:
Range agg = response.getAggregations().get("agg");
for (Range.Bucket entry : agg.getBuckets()) {
String key = entry.getKeyAsString();
// Range as key
Number from = (Number) entry.getFrom();
// Bucket from
Number to = (Number) entry.getTo();
// Bucket to
long docCount = entry.getDocCount();
// Doc count
}
Description
Now
Current time
Now+1h
Now-1M
Now+1h+1m
Now+1h/d
2016-01-01||+1M/d
Python example
query = {
"aggs": {
"tweets_creation_interval": {
[ 87 ]
Java example
Building date range aggregation:
AggregationBuilder aggregation =
AggregationBuilders
.dateRange("agg")
.field(fieldName)
.format("yyyy")
.addUnboundedTo("2000")
// from -infinity to 2000 (excluded)
.addRange("2000", "2005") // from 2000 to 2005 (excluded)
.addUnboundedFrom("2005"); // from 2005 to +infinity
Here, agg is the aggregation bucket name and fieldName is the field on which the
aggregation is performed. The addUnboundedTo method is used when you do not
specify the from parameter and the addUnboundedFrom method is used when you
don't specify the to parameter.
Parsing the response:
To parse the date range aggregation response, you need to import the
following class:
import org.elasticsearch.search.aggregations.bucket.range.Range;
import org.joda.time.DateTime;
[ 88 ]
Chapter 4
Then, the response can be parsed with the following code snippet:
Range agg = response.getAggregations().get("agg");
for (Range.Bucket entry : agg.getBuckets()) {
String key = entry.getKeyAsString();
// Date range as key
DateTime fromAsDate = (DateTime) entry.getFrom(); // Date bucket
from as a Date
DateTime toAsDate = (DateTime) entry.getTo(); // Date bucket to as a
Date
long docCount = entry.getDocCount();
// Doc count
}
Histogram aggregation
A histogram aggregation works on numeric values extracted from documents and
creates fixed-sized buckets based on those values. Let's see an example for creating
buckets of a user's favorite tweet counts:
Python example
query = {
"aggs": {
"favorite_tweets": {
"histogram": {
"field": "user.favourites_count",
"interval": 20000
}
}
},"size": 0
}
res = es.search(index='twitter', doc_type='tweets', body=query)
for bucket in res['aggregations']['favorite_tweets']['buckets']:
print bucket['key'], bucket['doc_count']
The response for the preceding query will look like the following, which says that
114 users have favorite tweets between 0 to 20000 and 8 users have more than 20000
as their favorite tweets:
"aggregations": {
"favorite_tweets": {
"buckets": [
{
"key": 0,
"doc_count": 114
},
[ 89 ]
Java example
Building histogram aggregation:
AggregationBuilder aggregation =
AggregationBuilders
.histogram("agg")
.field(fieldName)
.interval(5);
Here, agg is the aggregation bucket name and fieldName is the field on which
aggregation is performed. The interval method is used to pass the interval for
generating the buckets.
Parsing the response:
To parse the histogram aggregation response, you need to import the following class:
import org.elasticsearch.search.aggregations.bucket.histogram.
Histogram;
Then, the response can be parsed with the following code snippet:
Range agg = response.getAggregations().get("agg");
for (Histogram.Bucket entry : agg.getBuckets()) {
Long key = (Long) entry.getKey();
// Key
long docCount = entry.getDocCount();
// Doc coun
}
[ 90 ]
Chapter 4
You can also specify fractional values, such as 1h (1 hour), 1m (1 minute) and so on.
Date histograms are mostly used to generate time-series graphs in many applications.
Python example
query = {
"aggs": {
"tweet_histogram": {
"date_histogram": {
"field": "created_at",
"interval": "hour"
}
}
}, "size": 0
}
The preceding aggregation will generate an hourly-based tweet timeline on the field,
created_at:
res = es.search(index='twitter', doc_type='tweets', body=query)
for bucket in res['aggregations']['tweet_histogram']['buckets']:
print bucket['key'], bucket['key_as_string'], bucket['doc_count']
Java example
Building date histogram aggregation:
AggregationBuilder aggregation =
AggregationBuilders
.histogram("agg")
.field(fieldName)
.interval(DateHistogramInterval.YEAR);
[ 91 ]
Here, agg is the aggregation bucket name and fieldname is the field
on which the aggregation is performed. The interval method is used to
pass the interval to generate buckets. For interval in days, you can do this:
DateHistogramInterval.days(10)
Filter-based aggregation
Elasticsearch allows filters to be used as aggregations too. Filters preserve their
behavior in the aggregation context as well and are usually used to narrow down
the current aggregation context to a specific set of documents. You can use any filter
such as range, term, geo, and so on.
To get the count of all the tweets done by the user, d_bharvi, use the following code:
Python example
query = {
"aggs": {
"screename_filter": {
"filter": {
"term": {
"user.screen_name": "d_bharvi"
}
}
}
},"size": 0
}
[ 92 ]
Chapter 4
In the preceding request, we have used a term filter to narrow down the bucket of
tweets done by a particular user:
res = es.search(index='twitter', doc_type='tweets', body=query)
for bucket in res['aggregations']['screename_filter']['buckets']:
print bucket['doc_count']
Java example
Building filter-based aggregation:
AggregationBuilder aggregation =
AggregationBuilders
.filter("agg")
.filter(QueryBuilders.termQuery("user.screen_name ", "d_bharvi"));
Here, agg is the aggregation bucket name under the first filter method and the
second filter method takes a query to apply the filter.
Parsing the response:
To parse a filter-based aggregation response, you need to import the following class:
import org.elasticsearch.search.aggregations.bucket.histogram.
DateHistogramInterval;
[ 93 ]
Chapter 4
}
}
}
}
}
}
} ,"size": 0
}
res = es.search(index='twitter', doc_type='tweets', body=query)
1563
hashtag_key
crime
hashtag_count
screen_name
42
andresenior
count 2
average_tweets 9239.0
............
Understanding the response in the context of our search of the term crime in a
text field:
hashtag_key: The name of the hashtag used by users within the specified
time bucket
[ 95 ]
hashtag_count: The count of each hashtag within the specified time bucket
screen_name: The screen name of the user who has tweeted using that hashtag
count: The number of times that user tweeted using a corresponding hashtag
Java example
Writing multilevel aggregation queries (as we just saw) in Java seems quite complex,
but once you learn the basics of structuring aggregations, it becomes fun.
Let's see how we write the previous query in Java:
Building the query using QueryBuilder:
QueryBuilder query = QueryBuilders.matchQuery("text", "crime");
You can relate the preceding syntax with the aggregation syntax you learned in the
beginning of this chapter.
The exact aggregation for our Python example will be as follows:
AggregationBuilder aggregation =
AggregationBuilders
.dateHistogram("hourly_timeline")
.field("@timestamp")
.interval(DateHistogramInterval.YEAR)
.subAggregation(AggregationBuilders
.terms("top_hashtags")
.field("entities.hashtags.text")
.subAggregation(AggregationBuilders
[ 96 ]
Chapter 4
.terms("top_users")
.field("user.screen_name")
.subAggregation(AggregationBuilders
.avg("average_status_count")
.field("user.statuses_count"))));
Let's execute the request by combining the query and aggregation we have built:
SearchResponse response = client.prepareSearch(indexName).
setTypes(docType)
.setQuery(query).addAggregation(aggregation)
.setSize(0)
.execute().actionGet();
[ 97 ]
As you saw, building these types of aggregations and going for a drill down on
data sets to do complex analytics can be fun. However, one has to keep in mind the
pressure on memory that Elasticsearch bears while doing these complex calculations.
The next section covers how we can avoid these memory implications.
Chapter 4
No longer limited by the amount of fielddata that can fit into a given
amount of heapinstead the file system caches can make use of all the
available RAM
The other important consideration to keep in mind is not to have a huge number
of buckets in a nested aggregation. For example, finding the total order value for
a country during a year with an interval of one week will generate 100*51 buckets
with the sum value. It is a big overhead that is not only calculated in data nodes,
but also in the co-ordinating node that aggregates them. A big JSON also gives
problems on parsing and loading on the "frontend". It will easily kill a server with
wide aggregations.
Summary
In this chapter, we learned about one of the most powerful features of Elasticsearch,
that is, aggregation frameworks. We went through the most important metric and
bucket aggregations along with examples of doing analytics on our Twitter dataset
with Python and Java API.
This chapter covered many fundamental as well complex examples of the different
facets of analytics, which can be built using a combination of full-text searches,
term-based searches, and multilevel aggregations. Elasticsearch is awesome for
analytics but one should always keep in mind the memory implications, which
we covered in the last section of this chapter, to avoid the over killing of nodes.
In the next chapter, we will learn to work with geo spatial data in Elasticsearch and
we will also cover analytics with geo aggregations.
[ 99 ]
www.PacktPub.com
Stay Connected: