0% found this document useful (0 votes)
163 views8 pages

Subject: A Glance To Elasticsearch in The Era of Analytics and Machine Learning

Elasticsearch is a search and analytics engine that allows storing and searching of large amounts of data. It uses an inverted index to allow fast searching through documents. The ELK stack uses Elasticsearch for storage and indexing, Logstash for data processing, and Kibana for visualization and dashboards. This allows for log aggregation, transformation, and analysis across multiple systems and data sources.

Uploaded by

Suchismita Sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
163 views8 pages

Subject: A Glance To Elasticsearch in The Era of Analytics and Machine Learning

Elasticsearch is a search and analytics engine that allows storing and searching of large amounts of data. It uses an inverted index to allow fast searching through documents. The ELK stack uses Elasticsearch for storage and indexing, Logstash for data processing, and Kibana for visualization and dashboards. This allows for log aggregation, transformation, and analysis across multiple systems and data sources.

Uploaded by

Suchismita Sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Subject: A Glance to Elasticsearch in the era of Analytics and Machine

Learning:

What is Elasticsearch?

Think of a situation where I have huge amount of data having terabyte size and I need to search a
specific term in it.

Definitely, I have to use a tool for this. But unfortunately, most of the search engines available in
the market are not open source.

So, here Elasticsearch comes into picture.

 Elasticsearch is a full text search, readily-scalable, enterprise-grade, analytics engine for


all types of data such as textual, numerical, geospatial, structured, and unstructured. It is
accessible through RESTful web service interface and uses schema less JSON (JavaScript
Object Notation) documents to store data. It is platform independent, which enables users
to search from a very large amount of data in a very efficient way with a very high speed.
It supports a variety of use cases like allowing users to easily search through any portal,
collect and analyse log data, build business intelligence dashboards to quickly analyse and
visualize data.

Concept and Components:

Fig: 1: Elasticsearch Concept (Source: W3 school)

 Cluster: A cluster is a collection of one or more nodes (servers) that together holds the
entire data and provides federated indexing and search capabilities across all nodes, which
should be identified by an unique name. By default, it is ‘elasticsearch’. ‘Cluster Naming’ is
mandatory, because a node set up to join the cluster is possible through its name. We
should not reuse the same cluster names in different environments, otherwise we might
end up with nodes joining the wrong cluster. For instance we can name respective clusters
as logging-dev, logging-stage, and logging-prod for the development, staging, and
production clusters respectively.

 Node: A node is a single server that is part of our cluster where data can be stored and
this participates in the cluster’s indexing and search capabilities, which should be identified
by also a name. We can define any node name we want if we do not want the default.
Unique ‘Node Naming’ is important for administration purposes where we want to identify
which servers in our network correspond to which nodes in our Elasticsearch cluster. A
master node manages the entire cluster.

 Index: An index is a collection of documents that have somewhat similar characteristics. In


a single cluster, we can define as many indexes as we want, as per our requirement. An
index is an equivalent to the schema of a relational database. Elasticsearch, instead of
searching the text directly, it searches an index. So, it supports fast search responses.
Similar to retrieve, pages in a book related to a keyword by scanning the index at the back
of a book, as opposed to searching every word of every page of the book. This type of
index is called an ‘Inverted Index’, as it inverts a page-centric data structure (page-
>words) to a keyword-centric data structure (word->pages). Elasticsearch supports
Inverted Index for which it uses Apache Lucene to create and manage this inverted index.

 Type: Type is the Elasticsearch meta object where the mapping for an index is stored.

 Alias: Alias is a reference to an Elasticsearch index, which can be mapped to more than
one index.

 Document: A document is a basic unit of information that can be indexed, which can be
expressed in JavaScript Object Notation (JSON) format. Connected query returns the
parent and child rows.

 Shards and Replicas: Elasticsearch provides the ability to subdivide our index into multiple
pieces called shards. When we create an index, we have to define the number of shards
that we want. Each shard is a fully-functional and independent ‘index’, which can be
hosted on any node in the cluster. Elasticsearch allows us to make one or more copies of
our index’s shards into what are called replica shards, or replicas for short. After the index
is created, we can change the number of replicas dynamically anytime but we cannot
change the number of shards, once we configure it.

 REST API: Rest API are used by client to interact with the Elasticsearch through http
methods  (GET, POST, PUT, DELETE).
 NRT(Near-Real-Time): Elasticsearch is one of the near-real-time search platforms. Once a
document is indexed, it becomes searchable within less than 1 second.

How Elasticsearch represents data?

In Elasticsearch, we search a document and an index consists of one or more Documents, and a


Document consists of one or more Fields, where we need to specify a schema before indexing
documents, it is necessary to add mapping declarations if we require anything but the most basic
fields and operations. In database terminology, a Document corresponds to a table row and a Field
corresponds to a table column.

Mapping is the process of defining how a document, and its fields are stored and indexed. When
mapping our data, we create a mapping definition, which contains a list of fields that are pertinent
to the document. In Elasticsearch, an index may store documents of different "mapping types". A
mapping type describes the way of separating the documents in an index into logical groups. To
create a mapping, we will need the ‘Put Mapping API’, or you can add multiple mappings when you
create an index. For more information, please visit:
https://elastic.co/guide/en/elasticsearch/reference/6.8/indices-put-mapping.html.

What is ELK Stack and its Components?

"ELK" is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana.
Elasticsearch is a search and analytics engine. Logstash is a server-side data processing pipeline
that ingests data from multiple sources simultaneously, transforms it, and then sends it to a
"stash" like Elasticsearch. Kibana lets users to visualize data through charts and graphs in
Elasticsearch.

 Logs: Server logs that need to be analyzed.


 Logstash: Collect logs and events data. It even parses and transforms data
 ElasticSearch: The transformed data from Logstash is Stored, Searched, and Indexed.
 Kibana: Kibana uses Elasticsearch DB to Explore, Visualize, and Share the data stored.
Visualizations in Kibana can be categorized into following five different types:
o Basic Charts  (Area, Heat Map, Horizontal Bar, Line, Pie, Vertical bar)
o Data (Date Table, Gauge, Goal, Metric)
o Maps (Coordinate Map, Region Map)
o Time series (Timelion, Visual Builder)
o Other (Controls, Markdown, Tag Cloud)
 Beats: The most important in modern architecture which are lightweight agents that are
installed on edge hosts to collect different types of data for forwarding into the stack. The
data collected by the different beats varies — log files in the case of Filebeat, system and
service metrics in case of metricbeat, network data in the case of Packetbeat, Windows
event logs in the case of Winlogbeat etc. Once data collected, we can configure our beat to
ship the data either directly into Elasticsearch or to Logstash for additional processing. 

Together, these different components are most commonly used for monitoring,
troubleshooting and securing IT environments. Beats and Logstash take care of data
collection and processing, Elasticsearch indexes and stores the data, and Kibana provides a
user interface for querying the data and visualizing it.

 Kibana dashboards: Once we have a collection of visualizations ready, we can add them
all into one comprehensive visualization called a dashboard, which provides us the ability
to monitor an environment in easier event correlation and trend analysis. Dashboards are
highly dynamic, which can be edited, shared, played around with, opened in different
display modes and more.

Log management and analysis include the following key capabilities:


 Aggregation – To collect and ship logs from multiple data sources.
 Processing – To transform log messages into meaningful data for easier analysis.
 Storage – To store data for extended time periods and allow for monitoring, trend analysis,
and security use cases.
 Analysis – To dissect the data by querying it and creating visualizations and dashboards on
top of it.
 How to Use the ELK Stack for Log Analysis

Architecture of ELK Stack.

For small sized development environment, the ELK stack pipeline looks as follows:
Beats (Data Collection)->Redis, Kafka, Rabbit MQ(Buffering) -> Logstash (Data Aggregation and
Processing)->Elasticsearch (indexing and Storage)->Kibana (Analysis and Visualization)

But for complex scenarios, the pipeline looks as follows:


Beats (Data Collection)->Logstash (Data Aggregation and Processing)->Elasticsearch (indexing
and Storage)->Kibana (Analysis and Visualization)
Fig-2: ELK stack architecture with Kafka (Source: https://elastic-stack.readthedocs.io)

Generally, there exists a bottleneck for a production environment which scales out unlimitedly:

 Logstash needs to process logs with pipelines and filters which cost considerable time,
it may become a bottleneck if log bursts exist;
 Elasticsearch needs to index logs which cost time, and it becomes a bottleneck when
log bursts happen.

The above mentioned bottlenecks can be smoothed by adding more Logstash deployment and
scaling Elasticsearch cluster, which can also be smoothed by introducing a cache layer in the
middle like all other IT solutions. One of the most popular solutions to leverage a cache layer is
integrating Kafka into the ELK stack.
Process Flow:
Data gets collected through beats and processed to kafka, which can serve as a data hub where
Beats can persist to and Logstash nodes can consume, from where the logs get consumed by
Logstash, for log processing. The common way to feed data into Logstash is through HTTP, TCP
and UDP protocols. Logstash can expose endpoint listeners with the respective TCP, UDP,
and HTTP input plugins. After processing, the processed logs get stored in Elasticsearch and get
consumed by Kibana for metric visualisation.

Real time Uses:

Besides the Log analysis, following are few real time use cases, where Elasticsearch get used
hugely:

1. Text Mining and Natural Language Processing (NLP): Elasticsearch is widely used as
a search and analytics engine. Following are few use cases:

Most NLP tasks start with a standard preprocessing pipeline such as :

1. Gathering the data


2. Extracting raw text
3. Sentence Splitting
4. Tokenization
5. Normalizing (Stemming and Lemmantization etc..)
6. Stopword removal
7. Part of Speech (POS) tagging

A. PREPROCESSING (NORMALIZATION)
Have you ever used the ‘_analyze’ endpoint? ElasticSearch has over 20 language-analyzers
built in. What is an ‘Analyzer’ doing? Tokenization, stemming and stopword removal. That is
very often all we need for preprocessing for higher level tasks such as Machine Learning,
Language Modelling etc. We basically just need a running instance of ElasticSearch, without
any configuration or setup. Then ‘analyze-endpoint’ can be used as a Rest-API for NLP-
preprocessing. For more information, please visit:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html.

B. LANGUAGE DETECTION
‘Language Detection’ is a major challenge in NLP problem. This can be solved by installing
another plugin ‘langdetect’ from Elasticsearch. For more information, please visit:
https://github.com/jprante/elasticsearch-langdetect. It uses 3-gram character and a Bayesian
filter supporting various normalizations and feature sampling techniques. The precision is over
99% for 53 languages. Is it quite good? The plugin offers a ‘mapping type’ to specify the fields
where we want to enable language detection. The plugin offers a REST endpoint, where we can
post a short text which goes in UTF-8 format, and the plugin responds with a list of recognized
languages. What happens when that query is fired?

It will analyse our input text that comes either from the documents in the index or directly
from the like text. It will extract the most important keywords from that text and run a
‘Boolean’ query with all those keywords.
How does it know what a keyword is?
Keywords can be determined with a formula which shall be applied to a set of documents and
can be used to compare a subset of the documents to all documents based on word
probabilities. It is called Tf-Idf which is a very important formula for TextMining. It assigns a
score to each term in the subset compared to the entire corpus of documents. A high score
indicates that it is more likely that a term identifies or characterizes the current subset of
documents and distinguishes it clearly from all other documents.

C. RECOMMENDATION ENGINE
Basically, recommendation engines can be of 2 types: social and content based. A social
recommendation engine like Amazon e-commerce site, is referred to as “Collaborative
Filtering” where recommendation happens for People who bought this product also bought…
The other type of recommendation engine is called “Item based recommendation engine”,
which tries to group the datasets based on the properties of the entries, which is used to
answer “Any novel or scientific paper which is similar to the one which I read in recent past.”

With Elasticsearch we can easily build an item based recommendation engine.

We just configure the ‘MLT’ query template based on our data. We will use the actual item ID
as a starting point and recommend the most similar documents from your index. We can add
custom logic by running a bool query that combines a function score query to boost by
popularity or recency on top of the more like this query. The ‘More Like This’ (MLT) Query finds
documents that are "like" a given set of documents. In order to do so, MLT selects a set of
representative terms of these input documents, forms a query using these terms, executes the
query and returns the results. The user controls the input documents, how the terms should be
selected and how the query is formed. For more information, please visit:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html.

D. DUPLICATE DETECTION
If we have data from several sources (news, affiliate ads, etc.) there might be a possibility
that we are running our model into a dataset having many duplicates, which is an unwanted
behaviour for most end user applications.

How does it work?

There is a challenge with duplicate detection:

We need to compare all documents pairwise. The objective is to retain the first inspected
element and discard all others. So we need a lot of custom logic to choose the first document
to look at. As the complexity is very high, it is quite difficult to detect the duplicates offline,
but yes, an online tool is much required for this. General algorithms for industry standard for
duplicate detection are Simhash and Minhash (used by Google and Twitter). They generate
hashes for all documents, store them in an extra datastore and use a similarity function and
the documents that exceed a certain threshold are considered duplicates. For very short
documents we can work with the Levenshtein distance or Minimum Edit Distance. But for
longer documents we might want to rely on a token based solution.
https://www.elastic.co/blog/how-to-find-and-remove-duplicate- documents-in-elasticsearch.

2. Image Processing:
Can you imagine how nice it will be if there is a tool with image search facility?

This can be addressed with Deepdetect (https://www.deepdetect.com/ ). We send images to


Deepdetect, images get annotated, then the annotations and the image URL get indexed into
Elasticsearch directly without any glue code. Deepdetect is a classification service that
distinguishes among 1000 different image categories, from 'ambulance' to 'padlock' to
'hedgehog', and indexes images with their categories into an instance of Elasticsearch. For
every image, the Deepdetect server directly indexes the predicted categories into
Elasticsearch, by avoiding the glue code in between the deep learning server and
Elasticsearch. DeepDetect supports output templates, which allows transforming the standard
output of the DeepDetect server into any custom format, which provides a functionality to
search images with text, even without having caption. It is also scalable as prediction works
over batches of images, and multiple prediction servers can be set to work in parallel.
Following are few uses cases which can be resolved by using deepdetect:
 Signatureless Malware detection from binaries
 Anomaly detection from raw traffic logs
 False positives filtering from SOC alerts
 Domain Generation Algorithm detection
 URL filtering and clustering on GPUs
3. Additional Applications:
The Elastic Stack, along with custom-built Elasticsearch plugins, helps to drive the
following content search experiences:

 Search based on computer vision and metadata

 Deep textual and hybrid content search

 Video and richer format search

 Enterprise search

 Discovery and recommendations

4. Crawling and Document Processing:


StormCrawler is a popular and mature open source web crawler, which gets used to
provide documents to index to a search engine and, with Elastic being an open source
tools for search and analytics, we needed a resource in StormCrawler to achieve
this. IndexBolt in the Elasticsearch module takes a web page fetched and parsed by
StormCrawler and sends it for indexing to Elasticsearch. It builds a representation of a
document containing its URL and the text extracted by the parser and any relevant
metadata extracted during the parsing, such as the title of the page, keywords, summary,
language, hostname, etc. StormCrawler comes with various resources for data extraction
which can be easily configured or extended.

What differentiates StormCrawler from other web crawlers is that it uses Elasticsearch as a
back end for storage as well. Elasticsearch is an excellent resource for doing this and
provides visibility into the data as well as great performance. The Elasticsearch module
contains a number of spout implementations, which query the status index to get the URLs
for StormCrawler to fetch. For more information, please visit:
https://www.elastic.co/blog/stormcrawler-open-source-web-crawler-strengthened-by-
elasticsearch-kibana.

5. Multitenancy:

Often, we have multiple customers or users with separate collections of documents, and a user
should never be able to search documents that do not belong to him. Then we end up with a
design where each user has his own index. More often than not, this leads to way too many
indexes. In almost every case we see index-per-user implemented, whereas we can have one
larger Elasticsearch index to address following downsides of having a huge number of small
indexes:

 The memory overhead can be controlled because thousands of small indexes consume
a lot of heap space.

 There can be a lot of duplication.

Conclusion:
There is a lot to learn with Elasticsearch, and sometimes it can be hard to know what you need to
learn. In this article, I have covered quite a few common use cases and some important things to
be aware of for all of them.

You might also like