Subject: A Glance To Elasticsearch in The Era of Analytics and Machine Learning
Subject: A Glance To Elasticsearch in The Era of Analytics and Machine Learning
Learning:
What is Elasticsearch?
Think of a situation where I have huge amount of data having terabyte size and I need to search a
specific term in it.
Definitely, I have to use a tool for this. But unfortunately, most of the search engines available in
the market are not open source.
Cluster: A cluster is a collection of one or more nodes (servers) that together holds the
entire data and provides federated indexing and search capabilities across all nodes, which
should be identified by an unique name. By default, it is ‘elasticsearch’. ‘Cluster Naming’ is
mandatory, because a node set up to join the cluster is possible through its name. We
should not reuse the same cluster names in different environments, otherwise we might
end up with nodes joining the wrong cluster. For instance we can name respective clusters
as logging-dev, logging-stage, and logging-prod for the development, staging, and
production clusters respectively.
Node: A node is a single server that is part of our cluster where data can be stored and
this participates in the cluster’s indexing and search capabilities, which should be identified
by also a name. We can define any node name we want if we do not want the default.
Unique ‘Node Naming’ is important for administration purposes where we want to identify
which servers in our network correspond to which nodes in our Elasticsearch cluster. A
master node manages the entire cluster.
Type: Type is the Elasticsearch meta object where the mapping for an index is stored.
Alias: Alias is a reference to an Elasticsearch index, which can be mapped to more than
one index.
Document: A document is a basic unit of information that can be indexed, which can be
expressed in JavaScript Object Notation (JSON) format. Connected query returns the
parent and child rows.
Shards and Replicas: Elasticsearch provides the ability to subdivide our index into multiple
pieces called shards. When we create an index, we have to define the number of shards
that we want. Each shard is a fully-functional and independent ‘index’, which can be
hosted on any node in the cluster. Elasticsearch allows us to make one or more copies of
our index’s shards into what are called replica shards, or replicas for short. After the index
is created, we can change the number of replicas dynamically anytime but we cannot
change the number of shards, once we configure it.
REST API: Rest API are used by client to interact with the Elasticsearch through http
methods (GET, POST, PUT, DELETE).
NRT(Near-Real-Time): Elasticsearch is one of the near-real-time search platforms. Once a
document is indexed, it becomes searchable within less than 1 second.
Mapping is the process of defining how a document, and its fields are stored and indexed. When
mapping our data, we create a mapping definition, which contains a list of fields that are pertinent
to the document. In Elasticsearch, an index may store documents of different "mapping types". A
mapping type describes the way of separating the documents in an index into logical groups. To
create a mapping, we will need the ‘Put Mapping API’, or you can add multiple mappings when you
create an index. For more information, please visit:
https://elastic.co/guide/en/elasticsearch/reference/6.8/indices-put-mapping.html.
"ELK" is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana.
Elasticsearch is a search and analytics engine. Logstash is a server-side data processing pipeline
that ingests data from multiple sources simultaneously, transforms it, and then sends it to a
"stash" like Elasticsearch. Kibana lets users to visualize data through charts and graphs in
Elasticsearch.
Together, these different components are most commonly used for monitoring,
troubleshooting and securing IT environments. Beats and Logstash take care of data
collection and processing, Elasticsearch indexes and stores the data, and Kibana provides a
user interface for querying the data and visualizing it.
Kibana dashboards: Once we have a collection of visualizations ready, we can add them
all into one comprehensive visualization called a dashboard, which provides us the ability
to monitor an environment in easier event correlation and trend analysis. Dashboards are
highly dynamic, which can be edited, shared, played around with, opened in different
display modes and more.
For small sized development environment, the ELK stack pipeline looks as follows:
Beats (Data Collection)->Redis, Kafka, Rabbit MQ(Buffering) -> Logstash (Data Aggregation and
Processing)->Elasticsearch (indexing and Storage)->Kibana (Analysis and Visualization)
Generally, there exists a bottleneck for a production environment which scales out unlimitedly:
Logstash needs to process logs with pipelines and filters which cost considerable time,
it may become a bottleneck if log bursts exist;
Elasticsearch needs to index logs which cost time, and it becomes a bottleneck when
log bursts happen.
The above mentioned bottlenecks can be smoothed by adding more Logstash deployment and
scaling Elasticsearch cluster, which can also be smoothed by introducing a cache layer in the
middle like all other IT solutions. One of the most popular solutions to leverage a cache layer is
integrating Kafka into the ELK stack.
Process Flow:
Data gets collected through beats and processed to kafka, which can serve as a data hub where
Beats can persist to and Logstash nodes can consume, from where the logs get consumed by
Logstash, for log processing. The common way to feed data into Logstash is through HTTP, TCP
and UDP protocols. Logstash can expose endpoint listeners with the respective TCP, UDP,
and HTTP input plugins. After processing, the processed logs get stored in Elasticsearch and get
consumed by Kibana for metric visualisation.
Besides the Log analysis, following are few real time use cases, where Elasticsearch get used
hugely:
1. Text Mining and Natural Language Processing (NLP): Elasticsearch is widely used as
a search and analytics engine. Following are few use cases:
A. PREPROCESSING (NORMALIZATION)
Have you ever used the ‘_analyze’ endpoint? ElasticSearch has over 20 language-analyzers
built in. What is an ‘Analyzer’ doing? Tokenization, stemming and stopword removal. That is
very often all we need for preprocessing for higher level tasks such as Machine Learning,
Language Modelling etc. We basically just need a running instance of ElasticSearch, without
any configuration or setup. Then ‘analyze-endpoint’ can be used as a Rest-API for NLP-
preprocessing. For more information, please visit:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html.
B. LANGUAGE DETECTION
‘Language Detection’ is a major challenge in NLP problem. This can be solved by installing
another plugin ‘langdetect’ from Elasticsearch. For more information, please visit:
https://github.com/jprante/elasticsearch-langdetect. It uses 3-gram character and a Bayesian
filter supporting various normalizations and feature sampling techniques. The precision is over
99% for 53 languages. Is it quite good? The plugin offers a ‘mapping type’ to specify the fields
where we want to enable language detection. The plugin offers a REST endpoint, where we can
post a short text which goes in UTF-8 format, and the plugin responds with a list of recognized
languages. What happens when that query is fired?
It will analyse our input text that comes either from the documents in the index or directly
from the like text. It will extract the most important keywords from that text and run a
‘Boolean’ query with all those keywords.
How does it know what a keyword is?
Keywords can be determined with a formula which shall be applied to a set of documents and
can be used to compare a subset of the documents to all documents based on word
probabilities. It is called Tf-Idf which is a very important formula for TextMining. It assigns a
score to each term in the subset compared to the entire corpus of documents. A high score
indicates that it is more likely that a term identifies or characterizes the current subset of
documents and distinguishes it clearly from all other documents.
C. RECOMMENDATION ENGINE
Basically, recommendation engines can be of 2 types: social and content based. A social
recommendation engine like Amazon e-commerce site, is referred to as “Collaborative
Filtering” where recommendation happens for People who bought this product also bought…
The other type of recommendation engine is called “Item based recommendation engine”,
which tries to group the datasets based on the properties of the entries, which is used to
answer “Any novel or scientific paper which is similar to the one which I read in recent past.”
We just configure the ‘MLT’ query template based on our data. We will use the actual item ID
as a starting point and recommend the most similar documents from your index. We can add
custom logic by running a bool query that combines a function score query to boost by
popularity or recency on top of the more like this query. The ‘More Like This’ (MLT) Query finds
documents that are "like" a given set of documents. In order to do so, MLT selects a set of
representative terms of these input documents, forms a query using these terms, executes the
query and returns the results. The user controls the input documents, how the terms should be
selected and how the query is formed. For more information, please visit:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html.
D. DUPLICATE DETECTION
If we have data from several sources (news, affiliate ads, etc.) there might be a possibility
that we are running our model into a dataset having many duplicates, which is an unwanted
behaviour for most end user applications.
We need to compare all documents pairwise. The objective is to retain the first inspected
element and discard all others. So we need a lot of custom logic to choose the first document
to look at. As the complexity is very high, it is quite difficult to detect the duplicates offline,
but yes, an online tool is much required for this. General algorithms for industry standard for
duplicate detection are Simhash and Minhash (used by Google and Twitter). They generate
hashes for all documents, store them in an extra datastore and use a similarity function and
the documents that exceed a certain threshold are considered duplicates. For very short
documents we can work with the Levenshtein distance or Minimum Edit Distance. But for
longer documents we might want to rely on a token based solution.
https://www.elastic.co/blog/how-to-find-and-remove-duplicate- documents-in-elasticsearch.
2. Image Processing:
Can you imagine how nice it will be if there is a tool with image search facility?
Enterprise search
What differentiates StormCrawler from other web crawlers is that it uses Elasticsearch as a
back end for storage as well. Elasticsearch is an excellent resource for doing this and
provides visibility into the data as well as great performance. The Elasticsearch module
contains a number of spout implementations, which query the status index to get the URLs
for StormCrawler to fetch. For more information, please visit:
https://www.elastic.co/blog/stormcrawler-open-source-web-crawler-strengthened-by-
elasticsearch-kibana.
5. Multitenancy:
Often, we have multiple customers or users with separate collections of documents, and a user
should never be able to search documents that do not belong to him. Then we end up with a
design where each user has his own index. More often than not, this leads to way too many
indexes. In almost every case we see index-per-user implemented, whereas we can have one
larger Elasticsearch index to address following downsides of having a huge number of small
indexes:
The memory overhead can be controlled because thousands of small indexes consume
a lot of heap space.
Conclusion:
There is a lot to learn with Elasticsearch, and sometimes it can be hard to know what you need to
learn. In this article, I have covered quite a few common use cases and some important things to
be aware of for all of them.