Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Near Real Time Indexing Kafka
Messages into Apache Blur
Dibyendu Bhattacharya
Big Data Architect , Pearson

Pearson : What We Do ?
We
are
building
a
scalable,
reliable
cloud-‐based
learning
pla3orm
providing
services
to
power
the
next
genera:on
of
products
for
Higher
Educa:on.
With
a
common
data
pla3orm,
we
build
up
student
analy:cs
across
product
and
ins:tu:on
boundaries
that
deliver
efficacy
insights
to
learners
and
ins:tu:ons
not
possible
before.
Pearson
is
building
● The
worlds
greatest
collec:on
of
educa:onal
content
● The
worlds
most
advanced
data,
analy:cs,
adap:ve,
and
personaliza:on
capabili:es
for
educa:on

Pearson Learning Platform : GRID
Assessment
Course Structure
Grades
Foundational Services
Recommend-ations
Assignments
Achievements
Mobile Enablers
Entitlements Authorization
Identity
Data Channels
Analytics
Media
Activity Eventing Integration
eCommerce
API Management (Apigee)
Catalog
Content
Customer
(CRM)
Adaptive
Learner
Behavioral Profile
IaaS (AWS)
Learning Services

Pearson Search Services : Why Blur
We are presently evaluating Apache Blur which is Distributed Search engine built on top of Hadoop and Lucene.
Primary reason for using Blur is ..
• Distributed Search Platform stores Indexes in HDFS.
• Leverages all goodness built into the Hadoop and Lucene stack
Benefit
Descrip-on
Scalable
Store
,
Index
and
Search
massive
amount
of
data
from
HDFS
Fast
Performance
similar
to
standard
Lucene
implementa:on
Durable
Provided
by
built
in
WAL
like
store
Fault
Tolerant
Auto
detect
node
failure
and
re-‐assigns
indexes
to
surviving
nodes
Query
Support
all
standard
Lucene
queries
and
Join
queries
“HDFS-based indexing is valuable when folks are also using Hadoop for other purposes (MapReduce, SQL queries, HBase, etc.). There are
considerable operational efficiencies to a shared storage system. For example, disk space, users, etc. can be centrally managed.” - Doug
Cutting

Blur Features
Fast Data Ingestion: Blur’s massively parallel processing (MPP) architecture uses MapReduce to
bulk load data at incredible speeds.
Near Real-Time Updates: Blur’s remote API allows new data to be indexed and searchable right
away.
Powerful and Accurate Search: Blur allows you to get precise search results against terabytes of
data at Google-like speed.
Actionable Data: Blur allows you to build rich data models and search them in a semi-relational
manner -- similar to joins while querying a relational database -- allowing you to uncover new
relationships in your data.
Secure: Blur provides record-level access control to your data. This ensures that users only see the
information that they are authorized to view.
Scalable & Fault Tolerant: Blur provides linear scaling using commodity server hardware and
provides automatic and immediate failover when a node in the cluster goes down.
Open Source: Blur is part of the Apache Incubator program.

Blur Architecture
Components
Purpose
Lucene
Perform
actual
search
du:es
HDFS
Store
Lucene
Index
Map
Reduce
Use
Hadoop
MR
for
batch
indexing
ThriQ
Inter
Process
Communica:on
Zookeeper
Manage
System
State
and
stores
Metadata
Blur
uses
two
types
of
Server
Processes
•
Controller
Server
•
Shard
Server
Orchestrate
Communica:on
between
all
Shard
Servers
for
communica:on
Responsible
for
performing
searches
for
all
shard
and
returns
results
to
controller
Cache
Controller
Server
Cache
Shard
Server
S
H
A
R
D
S
H
A
R
D

Blur Architecture
Cache
Controller
Server
Cache
Shard
Server
S
H
A
R
D
S
H
A
R
D
Cache
Controller
Server
Cache
Shard
Server
S
H
A
R
D
S
H
A
R
D
Cache
Shard
Server
S
H
A
R
D
S
H
A
R
D

Major Challenges Blur Solved
Random Access Latency w/HDFS
Problem :
HDFS is a great file system for streaming large amounts data across large scale clusters. However
the random access latency is typically the same performance you would get in reading from a local
drive if the data you are trying to access is not in the operating systems file cache. In other words
every access to HDFS is similar to a local read with a cache miss. Lucene relies on file system
caching or MMAP of index for performance when executing queries on a single machine with a
normal OS file system. Most of time the Lucene index files are cached by the operating system's file
system cache.
Solution:
Blur have a Lucene Directory level block cache to store the hot blocks from the files that Lucene
uses for searching. a concurrent LRU map stores the location of the blocks in pre allocated slabs of
memory. The slabs of memory are allocated at start-up and in essence are used in place of OS file
system cache.

Blur Index Directory Structure
-----CacheDirectory : Write Through Cache
|
------Use JoinDirectory ( Long Term and Short Term Storage)
|
------ Long Term : HDFSDirectory ( Files Written here After Merge)
------ Short Term : FastHdfsKeyValueDirectory (Small or Recently Added Files)
|
-- Uses HdfsKeyValueStore which act like a WAL
Blur V1 BlockCache, HDFSDirectory is committed to Lucene/Solr
Present Blur V2 Cache is more advanced and tuned towards performance.

Blur Data Structure
Blur
is
a
table
based
query
system.
So
within
a
single
cluster
there
can
be
many
different
tables,
each
with
a
different
schema,
shard
size,
analyzers,
etc.
Each
table
contains
Rows.
A
Row
contains
a
row
id
(Lucene
StringField
internally)
and
many
Records.
A
record
has
a
record
id
(Lucene
StringField
internally),
a
family
(Lucene
StringField
internally),
and
many
Columns.
A
column
contains
a
name
and
value,
both
are
Strings
in
the
API
but
the
value
can
be
interpreted
as
different
types.
All
base
Lucene
Field
types
are
supported,
Text,
String,
Long,
Int,
Double,
and
Float.
Row
Query
:
Ø execute
queries
across
Records
within
the
same
Row.
Ø similar
idea
to
an
inner
join.
find
all
the
Rows
that
contain
a
Record
with
the
family
"author"
and
has
a
"name"
Column
that
has
that
contains
a
term
"Jon"
and
another
Record
with
the
family
"docs"
and
has
a
"body"
Column
with
a
term
of
"Hadoop".
+<author.name:Jon>
+<docs.body:Hadoop>

Kafka Stream Indexing to Blur
Major
Challenges
:
•
No
Reliable
Kaba
Consumer
for
Spark
Streaming
exists.
• Present
High
Level
Kaba
Consumer
has
possible
data
loss
•
No
Spark
to
Blur
Connector
available

Spark RDD
An
RDD
in
Spark
is
simply
a
distributed
collec:on
of
objects.
Each
RDD
is
split
into
mul:ple
par$$ons,
which
may
be
computed
on
different
nodes
of
the
cluster.
There
are
lot
more
.
Different
Types
of
RDD
.
E.g.
PairRDD
RDD
Persistence,
RDD
Check
poin:ng
....

Spark in a Slide
RDD
(Resil ient
Dis t r ibuted
Dataset),
which
is
a
logically
centralized
en:ty
but
physically
par::oned
across
mul:ple
machines
inside
a
cluster
based
on
some
no:on
of
key
driver
program
that
launches
various
parallel
opera:ons
on
a
cluster
.
Driver
programs
access
Spark
through
a
SparkContext
object
which
helps
to
create
RDD.
RDD
can
op:onally
be
cached
in
memory
and
hence
providing
fast
access.
RDD
can
also
be
checkpointed
to
Disk
.
applica:on
logic
are
expressed
in
terms
of
a
sequence
of
Transforma:on
and
Ac:on.
"Transforma:on"
specifies
the
processing
dependency
among
RDDs
and
"Ac:on"
specifies
what
the
output
will
be

Spark Streaming and Kafka
Spark
Streaming
has
Kaba
High
level
Consumer
which
has
data
loss
problem.
Possible
Data
loss
scenarios..
1. Receiver
Failure
(SPARK-‐4062)
2. Driver
Failure
(SPARK-‐3129)
We
have
implemented
a
Low
Level
Kaba-‐Spark
Consumer
to
solve
the
Receiver
failure
Problem.
hjps://github.com/dibbhaj/kaba-‐spark-‐consumer
Ka5a
Topic

Low Level Kafka Consumer Challenges
Consumer
implemented
as
Custom
Spark
Receiver
which
need
to
handle
..
•
Consumer
need
to
know
Leader
of
a
Par::on.
•
Consumer
should
aware
of
leader
changes.
•
Consumer
should
handle
ZK
:meout.
•
Consumer
need
to
manage
Kaba
message
Offset.
•
Consumer
need
to
handle
fetch
from
topic.
All
of
the
Kaba
related
challenges
are
already
solved
in
Low
Level
Storm-‐Kaba
Spout.
That
has
been
modified
to
run
as
Spark
Receiver.

Kafka to Blur Integration Using Spark
hjps://github.com/dibbhaj/spark-‐blur-‐connector
Created
Spark
Conf.
Specify
Serializer
Created
Streaming
Context.
Set
Checkpoint
Directory
Create
Receiver
for
Every
Topic
Par::on
Create
Union
of
All
Receiver
Stream

First
Transforma:on.
Union
Stream
to
Pair
Stream
Create
the
BlurMutate
object

For
every
RDD
of
this
Pair
Stream
Perform
some
ac:on.
Specify
Blur
Table
Details
Make
number
of
Par::on
for
this
RDD
same
as
Blur
Table
Shard
Count
using
Custom
Par::oner.
Finally
Checkpoint
the
RDD.

Blur
Hadoop
Job
specific
proper:es.

Finally
,
Save
the
RDD
as
Hadoop
File.
This
uses
Blur
HDFSDirectory
to
write
indexes.
This
does
not
go
though
CacheDirectory.
Start
Stream
Execu:on

Thank You
Questions ?
We are Hiring @ jobs.pearson.com

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson (20)

More from Lucidworks (20)

Recently uploaded (20)

Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyendu Bhattacharya, Pearson