SlideShare a Scribd company logo
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
1
Leveraging the Power of Solr with Spark
JOHANNES WEIGEND
CTO, QAware GmbH / Germany
2
3
01
Agenda
Introduction to Solr Cloud and Spark
Importing
Searching and Aggregating
Scaling Up
It is Hard to Scale Horizontally!
■ Functions

- Trivial
- Loadbalancing of stateless services (macro- / microservices)
- More users -> more machines
- Nontrivial
- More machines -> faster response times
■ Data

- Trivial
- Linear distribution of data on multiple machines
- More machines -> more data
- Nontrivial
- Constant response times with growing datasets
4
5
Cloud
-Document based NoSQL database with outstanding search capabilities
A document is a collection of fields (string, number, date, …)
Single und multiple fields (fields can be arrays)
Nested documents
Static und dynamic scheme
Powerful query language (Lucene)
-Horizontally scalable with Solr Cloud
Distributed data in separate shards
Resilience by combination of zookeeper and replication
-Powerful aggregations (aka facets)
6
Shard2
Solr Server
Zookeeper
Solr ServerSolr Server
Shard1
Zookeeper Zookeeper Zookeeper
Ensamble
Solr Cloud
Leader
Scale Out
Shard3
Replica8 Replica9
Shard5Shard4 Shard6 Shard8Shard7 Shard9
Replica2 Replica3 Replica5
Shards
Replicas
Collection
Replica4 Replica7 Replica1 Replica6
The Architecture of Solr Cloud
Two Levels of Distribution
Search Search Search
Search

Index

Store

Map Map Map
Calculate

Cache

Join

Combine

Frontend
Reduce Business Layer
Combining Solr + Spark
7
READ THIS: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
■Distributed computing (100x faster than Hadoop M/R)
■Distributed Map/Reduce on distributed data can be done in-memory
■Supports online and batch workloads
■Scala with Java/Scala/Python APIs
■Processes data from distributed and local sources
-Textfiles (accessible from all nodes)
-Hadoop File System (HDFS)
-Databases (JDBC)
-Solr per Lucidworks API
8
Driver
9
Apache Spark
executing parallel tasks
executing parallel tasks
Executor
Executor
10
Cloud in a Box
The Cloud in a Box

6th generation Intel® Core™ i5-6260U processor
with Intel® Iris™ graphics
(1.9 GHz up to 2.8 GHz Turbo, Dual Core, 4 MB
Cache, 15W TDP)
CPU
32 GB Dual-channel DDR4 SODIMMs
1.2V, 2133 MHz
RAM
256 GB Samsung M.2 internal SSDDISK
! Used for all benchmarks in this talk
10 Cores, 20 HT Units, 160 GB RAM, 1,25 TB DiskTotal
11
12
13
01
Introduction into Solr Cloud and Spark
Importing
Searching and Aggregating
Scaling Up
Agenda
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Monitoring Sample Data
■ Single CSV per process, host, metric type
wls1_lpapp18_jmx.csv
Datetime CPU % Usage Heap % Usage #GC Invocations
1/10/16 9:00,000 50 50 1000
1/10/16 10:00,000 60 60 1100
1/10/16 11:00,000 70 70 1300
1/10/16 12:00,000 80 80 1800
CSV Solr document per cell
14
15
CloudSolrClient
SOLR1
SOLR2
SOLR3
add(List<document> batch) ShardsClient
Input Data
read input data
create batch
add batch to Solr
Bottleneck
Processing
Bottleneck
Network
Importing and Indexing into Solr can be slow
Some Options to Speed Things Up
Spark Executor
16
CloudSolrClient Solr Server 1
add(List<document> batch) Shards
Parallel Cloud Importer
Distributed
Input Data
-read input data
-create batch
-add batch to Solr
Parallel Import with Spark makes Import Scalable
Node1
CloudSolrClient Solr Server 2Spark ExecutorNode2
Scale upScale up Scale up
Node n
Solr Server 3CloudSolrClientSpark ExecutorNode3
17
How to Import Multiple (HDFS) Files
18
19
Solr UUID-Field
20
Import takes - 78411 ms
—> 180.000 Docs per Second
Indexing 14 Mio Docs in 1:20 Min
SolrJ and Spark have Different Transitive Dependencies
Depending on the Software Version
■ Adding both libraries to your classpath leads by transitivity to serious
problems at runtime (Serialization errors / ClassNotFoundExceptions…)
■ Pinning / Exclusion helps - but can produce strange errors. There is
currently no satisfying solution for the BigData class path hell.
21
22
01
Introduction into Solr Cloud and Spark
Importing
Searching and Aggregating
Scaling Up
Agenda
23
Using Solr Facet Queries for Aggregation
#
# Grouping per sub query
#
curl $SOLR/$COLLECTION/select -d '
q=process:wls1 AND metric:*.HeapMemoryUsage.used&
rows=0&
json.facet={
Hosts: {
type: terms,
field: host,
facet:{
Off : { query : "value: [* TO 0]" },
Idle : { query : "value: [0 TO 1000000000]" },
Busy : { query : "value: [1000000001 TO 10000000000]" },
Overload : { query : "value: [10000000001 TO *]" }
}
}
}
Why Do we Need Even More?
■ Data centerer applications need a scalable way of
- Post processing search results or facets (business logik, ML,
data analytics)
- Post filtering search results
- Processing denormalized data (if you store a one-to-many
relation in a single Solr document)
24
Accessing Solr from Spark with SolrRDD
■ https://github.com/
lucidworks/spark-solr
■ You have to build the
library locally. There is no
released version at Maven
Central.
■ Make sure to adjust the
versions depending on
your environment
25
Streaming from Solr into Spark
Not Bad! 14 Mio in 1:27 Minutes
26
27
You Can Speed up Spark / Solr by Factor 10
Using the Export Handler
Using SolrRDD with Java
28
29
Reading 14 Mio Docs in 10 Seconds
Streaming 14 Mio Solr documents into Spark
takes 10 Seconds
—> 1.400 000 Docs per Second
RDDs using /export Handler Rocks!
30
Scaling up
31
Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH
Recap: Monitoring Sample Data
■ Single CSV per process, host, metric type
wls1_lpapp18_jmx.csv
Date CPU % Usage Heap % Usage #GC Invocations
1/10/16 9:00,000 50 50 1000
1/10/16 10:00,000 60 60 1100
1/10/16 11:00,000 70 70 1300
1/10/16 12:00,000 80 80 1800
CSV SOLR
32
1000 lines with 10.000
columuns = 3MB gzipped
1000 x 10.000 docs = 1 Mio Solr docs
A Naive Solr Datamodel
A single Solr document per CSV cell
‣ Advantage
You can use Solr for aggregation, sorting and
searching for values or time intervals
‣ Disadvantage
Data explosion (single compressed CSV file with 3MB
in size produces 1 Mil Solr documents)
33
Column Based Denormalization
wls1_lpapp18_jmx.csv
Date CPU % Usage Heap % Usage #GC Invocations
1/10/16 9:00,000 50 50 1000
1/10/16 10:00,000 60 60 1100
1/10/16 11:00,000 70 70 1300
1/10/16 12:00,000 80 80 1800
CSV
SolrDocument {
process: wls1
host: lpapp18
type: jmx
maxdate: 1/10/16 9:00
mindate: 1/10/16 12:00
metric: CPU % Usage
values: [BINARY (Date, Long)]
max: 80
min: 50
avg: 65
}
n 1
Store 1000-10000 events in a single document
Document per column
34
Storing 1-to-1400 Relation in a Single Document
Base64 encoded and gzipped
values: [{date: …, value:}, … ]
35
32k Limit for DocValues
Benefits of Denomalization
‣ Benefits
- You can scale from a xxx million documents in a Solr Cloud up to
trillions of searchable events
- Import is vastly faster
‣ Drawbacks
- Searching on single values requires additional logic
- Counting and faceting requires additional logic
‣ Spark can solve these problems by parallel post processing
- Decompressing, aggregating, joining, grouping
36
Accessing Compressed Data within Spark
37
38
Indexing 19 Million of CSV Values
in 13500 Solr documents
takes now 24 Seconds (before 1:20)
—> 800,000 Values per Second
39
Streaming One Billion of Solr Values into Spark
Takes now 34 Seconds (Before 700 s)
—> 29,000,000 Values per Second
Summary
■ The combination of Solr Cloud and Spark gives you the power to
deal with BigData workloads in realtime
■ Denormalization can make your Solr application vastly faster
■ Make use of the /export handler when using the SolrRDD
■ Parallel post processing is mandatory for nontrivial applications
■ If you want to learn more: come to the Chronix talk on Friday
40
Learn More
■ https://github.com/lucidworks/spark-solr
■ https://github.com/jweigend/solr-spark
■ http://chronix.io
■ https://github.com/ChronixDB/chronix.spark/
■ http://qaware.blogspot.de
41
42
43

More Related Content

PDF
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Lucidworks
 
PDF
Automotive Information Research driven by Apache Solr
Mario-Leander Reimer
 
PDF
Spark with Cassandra by Christopher Batey
Spark Summit
 
PDF
Time Series Processing with Apache Spark
Josef Adersberger
 
PDF
Druid meetup 4th_sql_on_druid
Yousun Jeong
 
PPTX
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
PDF
Real-Time Analytics with Kafka, Cassandra and Storm
John Georgiadis
 
PDF
Real World Analytics with Solr Cloud and Spark
QAware GmbH
 
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Lucidworks
 
Automotive Information Research driven by Apache Solr
Mario-Leander Reimer
 
Spark with Cassandra by Christopher Batey
Spark Summit
 
Time Series Processing with Apache Spark
Josef Adersberger
 
Druid meetup 4th_sql_on_druid
Yousun Jeong
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
Real-Time Analytics with Kafka, Cassandra and Storm
John Georgiadis
 
Real World Analytics with Solr Cloud and Spark
QAware GmbH
 

What's hot (19)

PPTX
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
PDF
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Lucidworks
 
PPTX
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Cody Ray
 
PDF
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 
PDF
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 
PDF
SMACK Stack 1.1
Joe Stein
 
PDF
Adding Complex Data to Spark Stack by Tug Grall
Spark Summit
 
PPTX
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PDF
Cassandra spark connector
Duyhai Doan
 
PDF
Apache Cassandra and Python for Analyzing Streaming Big Data
prajods
 
PDF
The How and Why of Fast Data Analytics with Apache Spark
Legacy Typesafe (now Lightbend)
 
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
PDF
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
PPTX
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
DataStax
 
PDF
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Summit
 
PPTX
Intro to Apache Spark
Mammoth Data
 
PDF
Analytics with Cassandra & Spark
Matthias Niehoff
 
PDF
Spark Summit EU talk by Ted Malaska
Spark Summit
 
Real time data pipeline with spark streaming and cassandra with mesos
Rahul Kumar
 
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Lucidworks
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Cody Ray
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 
SMACK Stack 1.1
Joe Stein
 
Adding Complex Data to Spark Stack by Tug Grall
Spark Summit
 
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...
DataWorks Summit
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Cassandra spark connector
Duyhai Doan
 
Apache Cassandra and Python for Analyzing Streaming Big Data
prajods
 
The How and Why of Fast Data Analytics with Apache Spark
Legacy Typesafe (now Lightbend)
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Analyzing Time Series Data with Apache Spark and Cassandra
Patrick McFadin
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
DataStax
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Spark Summit
 
Intro to Apache Spark
Mammoth Data
 
Analytics with Cassandra & Spark
Matthias Niehoff
 
Spark Summit EU talk by Ted Malaska
Spark Summit
 
Ad

Viewers also liked (20)

PDF
JEE on DC/OS - MesosCon Europe
QAware GmbH
 
PDF
Microservices @ Work - A Practice Report of Developing Microservices
QAware GmbH
 
PDF
Lightweight developer provisioning with gradle and seu as-code
QAware GmbH
 
PDF
Automotive Information Research driven by Apache Solr
QAware GmbH
 
PDF
Secure Architecture and Programming 101
QAware GmbH
 
PDF
Der Cloud Native Stack in a Nutshell
QAware GmbH
 
PDF
Per Anhalter durch den Cloud Native Stack (extended edition)
QAware GmbH
 
PDF
Automotive Information Research driven by Apache Solr
QAware GmbH
 
PDF
Vamp - The anti-fragilitiy platform for digital services
QAware GmbH
 
PDF
Azure Functions - Get rid of your servers, use functions!
QAware GmbH
 
PDF
A Hitchhiker's Guide to the Cloud Native Stack
QAware GmbH
 
PDF
Developing Skills for Amazon Echo
QAware GmbH
 
PDF
Chronix as Long-Term Storage for Prometheus
QAware GmbH
 
PDF
Everything-as-code. Polyglotte Software-Entwicklung in der Praxis.
QAware GmbH
 
PDF
Kubernetes 101 and Fun
QAware GmbH
 
PDF
Hands-on K8s: Deployments, Pods and Fun
QAware GmbH
 
PDF
Cloud Native Unleashed
QAware GmbH
 
PDF
Everything as-code. Polyglotte Entwicklung in der Praxis. #oop2017
Mario-Leander Reimer
 
PDF
Die Leichtigkeit des Seins: Bindings für Eclipse SmartHome entwickeln
QAware GmbH
 
PPTX
ApacheCon NA 2015 Spark / Solr Integration
thelabdude
 
JEE on DC/OS - MesosCon Europe
QAware GmbH
 
Microservices @ Work - A Practice Report of Developing Microservices
QAware GmbH
 
Lightweight developer provisioning with gradle and seu as-code
QAware GmbH
 
Automotive Information Research driven by Apache Solr
QAware GmbH
 
Secure Architecture and Programming 101
QAware GmbH
 
Der Cloud Native Stack in a Nutshell
QAware GmbH
 
Per Anhalter durch den Cloud Native Stack (extended edition)
QAware GmbH
 
Automotive Information Research driven by Apache Solr
QAware GmbH
 
Vamp - The anti-fragilitiy platform for digital services
QAware GmbH
 
Azure Functions - Get rid of your servers, use functions!
QAware GmbH
 
A Hitchhiker's Guide to the Cloud Native Stack
QAware GmbH
 
Developing Skills for Amazon Echo
QAware GmbH
 
Chronix as Long-Term Storage for Prometheus
QAware GmbH
 
Everything-as-code. Polyglotte Software-Entwicklung in der Praxis.
QAware GmbH
 
Kubernetes 101 and Fun
QAware GmbH
 
Hands-on K8s: Deployments, Pods and Fun
QAware GmbH
 
Cloud Native Unleashed
QAware GmbH
 
Everything as-code. Polyglotte Entwicklung in der Praxis. #oop2017
Mario-Leander Reimer
 
Die Leichtigkeit des Seins: Bindings für Eclipse SmartHome entwickeln
QAware GmbH
 
ApacheCon NA 2015 Spark / Solr Integration
thelabdude
 
Ad

Similar to Leveraging the Power of Solr with Spark (20)

PPTX
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
SnapLogic
 
PPTX
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Dataconomy Media
 
PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
PDF
Apache Solr as a compressed, scalable, and high performance time series database
Florian Lautenschlager
 
PPTX
Agility and Scalability with MongoDB
MongoDB
 
PDF
Unified Big Data Processing with Apache Spark
C4Media
 
PPTX
CERN IT Monitoring
Tim Bell
 
PDF
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
Redis Labs
 
PDF
Re-Engineering PostgreSQL as a Time-Series Database
All Things Open
 
PPT
Everything You Need to Know About Sharding
MongoDB
 
PPT
BWC Supercomputing 2008 Presentation
lilyco
 
PDF
Azure Cosmos DB - Technical Deep Dive
Andre Essing
 
PDF
Aerospike Hybrid Memory Architecture
Aerospike, Inc.
 
PDF
What Are Science Clouds?
Robert Grossman
 
PPT
MongoDB Knowledge Shareing
Philip Zhong
 
PPTX
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics
 
PDF
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
PPTX
High Throughput Analytics with Cassandra & Azure
DataStax Academy
 
PDF
Autonomous control in Big Data platforms: and experience with Cassandra
Emiliano
 
PPTX
Practice of large Hadoop cluster in China Mobile
DataWorks Summit
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
SnapLogic
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Dataconomy Media
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Apache Solr as a compressed, scalable, and high performance time series database
Florian Lautenschlager
 
Agility and Scalability with MongoDB
MongoDB
 
Unified Big Data Processing with Apache Spark
C4Media
 
CERN IT Monitoring
Tim Bell
 
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
Redis Labs
 
Re-Engineering PostgreSQL as a Time-Series Database
All Things Open
 
Everything You Need to Know About Sharding
MongoDB
 
BWC Supercomputing 2008 Presentation
lilyco
 
Azure Cosmos DB - Technical Deep Dive
Andre Essing
 
Aerospike Hybrid Memory Architecture
Aerospike, Inc.
 
What Are Science Clouds?
Robert Grossman
 
MongoDB Knowledge Shareing
Philip Zhong
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics
 
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
High Throughput Analytics with Cassandra & Azure
DataStax Academy
 
Autonomous control in Big Data platforms: and experience with Cassandra
Emiliano
 
Practice of large Hadoop cluster in China Mobile
DataWorks Summit
 

More from QAware GmbH (20)

PDF
Frontends mit Hilfe von KI entwickeln.pdf
QAware GmbH
 
PDF
Mit ChatGPT Dinosaurier besiegen - Möglichkeiten und Grenzen von LLM für die ...
QAware GmbH
 
PDF
50 Shades of K8s Autoscaling #JavaLand24.pdf
QAware GmbH
 
PDF
Make Agile Great - PM-Erfahrungen aus zwei virtuellen internationalen SAFe-Pr...
QAware GmbH
 
PPTX
Fully-managed Cloud-native Databases: The path to indefinite scale @ CNN Mainz
QAware GmbH
 
PDF
Down the Ivory Tower towards Agile Architecture
QAware GmbH
 
PDF
"Mixed" Scrum-Teams – Die richtige Mischung macht's!
QAware GmbH
 
PDF
Make Developers Fly: Principles for Platform Engineering
QAware GmbH
 
PDF
Der Tod der Testpyramide? – Frontend-Testing mit Playwright
QAware GmbH
 
PDF
Was kommt nach den SPAs
QAware GmbH
 
PDF
Cloud Migration mit KI: der Turbo
QAware GmbH
 
PDF
Migration von stark regulierten Anwendungen in die Cloud: Dem Teufel die See...
QAware GmbH
 
PDF
Aus blau wird grün! Ansätze und Technologien für nachhaltige Kubernetes-Cluster
QAware GmbH
 
PDF
Endlich gute API Tests. Boldly Testing APIs Where No One Has Tested Before.
QAware GmbH
 
PDF
Kubernetes with Cilium in AWS - Experience Report!
QAware GmbH
 
PDF
50 Shades of K8s Autoscaling
QAware GmbH
 
PDF
Kontinuierliche Sicherheitstests für APIs mit Testkube und OWASP ZAP
QAware GmbH
 
PDF
Service Mesh Pain & Gain. Experiences from a client project.
QAware GmbH
 
PDF
50 Shades of K8s Autoscaling
QAware GmbH
 
PDF
Blue turns green! Approaches and technologies for sustainable K8s clusters.
QAware GmbH
 
Frontends mit Hilfe von KI entwickeln.pdf
QAware GmbH
 
Mit ChatGPT Dinosaurier besiegen - Möglichkeiten und Grenzen von LLM für die ...
QAware GmbH
 
50 Shades of K8s Autoscaling #JavaLand24.pdf
QAware GmbH
 
Make Agile Great - PM-Erfahrungen aus zwei virtuellen internationalen SAFe-Pr...
QAware GmbH
 
Fully-managed Cloud-native Databases: The path to indefinite scale @ CNN Mainz
QAware GmbH
 
Down the Ivory Tower towards Agile Architecture
QAware GmbH
 
"Mixed" Scrum-Teams – Die richtige Mischung macht's!
QAware GmbH
 
Make Developers Fly: Principles for Platform Engineering
QAware GmbH
 
Der Tod der Testpyramide? – Frontend-Testing mit Playwright
QAware GmbH
 
Was kommt nach den SPAs
QAware GmbH
 
Cloud Migration mit KI: der Turbo
QAware GmbH
 
Migration von stark regulierten Anwendungen in die Cloud: Dem Teufel die See...
QAware GmbH
 
Aus blau wird grün! Ansätze und Technologien für nachhaltige Kubernetes-Cluster
QAware GmbH
 
Endlich gute API Tests. Boldly Testing APIs Where No One Has Tested Before.
QAware GmbH
 
Kubernetes with Cilium in AWS - Experience Report!
QAware GmbH
 
50 Shades of K8s Autoscaling
QAware GmbH
 
Kontinuierliche Sicherheitstests für APIs mit Testkube und OWASP ZAP
QAware GmbH
 
Service Mesh Pain & Gain. Experiences from a client project.
QAware GmbH
 
50 Shades of K8s Autoscaling
QAware GmbH
 
Blue turns green! Approaches and technologies for sustainable K8s clusters.
QAware GmbH
 

Recently uploaded (20)

PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
short term internship project on Data visualization
JMJCollegeComputerde
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
D9110.pdfdsfvsdfvsdfvsdfvfvfsvfsvffsdfvsdfvsd
minhn6673
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Web dev -ppt that helps us understand web technology
shubhragoyal12
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 

Leveraging the Power of Solr with Spark

  • 1. O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A 1
  • 2. Leveraging the Power of Solr with Spark JOHANNES WEIGEND CTO, QAware GmbH / Germany 2
  • 3. 3 01 Agenda Introduction to Solr Cloud and Spark Importing Searching and Aggregating Scaling Up
  • 4. It is Hard to Scale Horizontally! ■ Functions - Trivial - Loadbalancing of stateless services (macro- / microservices) - More users -> more machines - Nontrivial - More machines -> faster response times ■ Data - Trivial - Linear distribution of data on multiple machines - More machines -> more data - Nontrivial - Constant response times with growing datasets 4
  • 5. 5 Cloud -Document based NoSQL database with outstanding search capabilities A document is a collection of fields (string, number, date, …) Single und multiple fields (fields can be arrays) Nested documents Static und dynamic scheme Powerful query language (Lucene) -Horizontally scalable with Solr Cloud Distributed data in separate shards Resilience by combination of zookeeper and replication -Powerful aggregations (aka facets)
  • 6. 6 Shard2 Solr Server Zookeeper Solr ServerSolr Server Shard1 Zookeeper Zookeeper Zookeeper Ensamble Solr Cloud Leader Scale Out Shard3 Replica8 Replica9 Shard5Shard4 Shard6 Shard8Shard7 Shard9 Replica2 Replica3 Replica5 Shards Replicas Collection Replica4 Replica7 Replica1 Replica6 The Architecture of Solr Cloud Two Levels of Distribution
  • 7. Search Search Search Search Index Store Map Map Map Calculate Cache Join Combine Frontend Reduce Business Layer Combining Solr + Spark 7
  • 8. READ THIS: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf ■Distributed computing (100x faster than Hadoop M/R) ■Distributed Map/Reduce on distributed data can be done in-memory ■Supports online and batch workloads ■Scala with Java/Scala/Python APIs ■Processes data from distributed and local sources -Textfiles (accessible from all nodes) -Hadoop File System (HDFS) -Databases (JDBC) -Solr per Lucidworks API 8
  • 9. Driver 9 Apache Spark executing parallel tasks executing parallel tasks Executor Executor
  • 11. The Cloud in a Box
 6th generation Intel® Core™ i5-6260U processor with Intel® Iris™ graphics (1.9 GHz up to 2.8 GHz Turbo, Dual Core, 4 MB Cache, 15W TDP) CPU 32 GB Dual-channel DDR4 SODIMMs 1.2V, 2133 MHz RAM 256 GB Samsung M.2 internal SSDDISK ! Used for all benchmarks in this talk 10 Cores, 20 HT Units, 160 GB RAM, 1,25 TB DiskTotal 11
  • 12. 12
  • 13. 13 01 Introduction into Solr Cloud and Spark Importing Searching and Aggregating Scaling Up Agenda
  • 14. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Monitoring Sample Data ■ Single CSV per process, host, metric type wls1_lpapp18_jmx.csv Datetime CPU % Usage Heap % Usage #GC Invocations 1/10/16 9:00,000 50 50 1000 1/10/16 10:00,000 60 60 1100 1/10/16 11:00,000 70 70 1300 1/10/16 12:00,000 80 80 1800 CSV Solr document per cell 14
  • 15. 15 CloudSolrClient SOLR1 SOLR2 SOLR3 add(List<document> batch) ShardsClient Input Data read input data create batch add batch to Solr Bottleneck Processing Bottleneck Network Importing and Indexing into Solr can be slow Some Options to Speed Things Up
  • 16. Spark Executor 16 CloudSolrClient Solr Server 1 add(List<document> batch) Shards Parallel Cloud Importer Distributed Input Data -read input data -create batch -add batch to Solr Parallel Import with Spark makes Import Scalable Node1 CloudSolrClient Solr Server 2Spark ExecutorNode2 Scale upScale up Scale up Node n Solr Server 3CloudSolrClientSpark ExecutorNode3
  • 17. 17 How to Import Multiple (HDFS) Files
  • 18. 18
  • 20. 20 Import takes - 78411 ms —> 180.000 Docs per Second Indexing 14 Mio Docs in 1:20 Min
  • 21. SolrJ and Spark have Different Transitive Dependencies Depending on the Software Version ■ Adding both libraries to your classpath leads by transitivity to serious problems at runtime (Serialization errors / ClassNotFoundExceptions…) ■ Pinning / Exclusion helps - but can produce strange errors. There is currently no satisfying solution for the BigData class path hell. 21
  • 22. 22 01 Introduction into Solr Cloud and Spark Importing Searching and Aggregating Scaling Up Agenda
  • 23. 23 Using Solr Facet Queries for Aggregation # # Grouping per sub query # curl $SOLR/$COLLECTION/select -d ' q=process:wls1 AND metric:*.HeapMemoryUsage.used& rows=0& json.facet={ Hosts: { type: terms, field: host, facet:{ Off : { query : "value: [* TO 0]" }, Idle : { query : "value: [0 TO 1000000000]" }, Busy : { query : "value: [1000000001 TO 10000000000]" }, Overload : { query : "value: [10000000001 TO *]" } } } }
  • 24. Why Do we Need Even More? ■ Data centerer applications need a scalable way of - Post processing search results or facets (business logik, ML, data analytics) - Post filtering search results - Processing denormalized data (if you store a one-to-many relation in a single Solr document) 24
  • 25. Accessing Solr from Spark with SolrRDD ■ https://github.com/ lucidworks/spark-solr ■ You have to build the library locally. There is no released version at Maven Central. ■ Make sure to adjust the versions depending on your environment 25
  • 26. Streaming from Solr into Spark Not Bad! 14 Mio in 1:27 Minutes 26
  • 27. 27 You Can Speed up Spark / Solr by Factor 10 Using the Export Handler
  • 29. 29 Reading 14 Mio Docs in 10 Seconds Streaming 14 Mio Solr documents into Spark takes 10 Seconds —> 1.400 000 Docs per Second
  • 30. RDDs using /export Handler Rocks! 30
  • 32. Apache Big Data North America | Vancouver | 05.05.2016 | Johannes Weigend | © QAware GmbH Recap: Monitoring Sample Data ■ Single CSV per process, host, metric type wls1_lpapp18_jmx.csv Date CPU % Usage Heap % Usage #GC Invocations 1/10/16 9:00,000 50 50 1000 1/10/16 10:00,000 60 60 1100 1/10/16 11:00,000 70 70 1300 1/10/16 12:00,000 80 80 1800 CSV SOLR 32 1000 lines with 10.000 columuns = 3MB gzipped 1000 x 10.000 docs = 1 Mio Solr docs
  • 33. A Naive Solr Datamodel A single Solr document per CSV cell ‣ Advantage You can use Solr for aggregation, sorting and searching for values or time intervals ‣ Disadvantage Data explosion (single compressed CSV file with 3MB in size produces 1 Mil Solr documents) 33
  • 34. Column Based Denormalization wls1_lpapp18_jmx.csv Date CPU % Usage Heap % Usage #GC Invocations 1/10/16 9:00,000 50 50 1000 1/10/16 10:00,000 60 60 1100 1/10/16 11:00,000 70 70 1300 1/10/16 12:00,000 80 80 1800 CSV SolrDocument { process: wls1 host: lpapp18 type: jmx maxdate: 1/10/16 9:00 mindate: 1/10/16 12:00 metric: CPU % Usage values: [BINARY (Date, Long)] max: 80 min: 50 avg: 65 } n 1 Store 1000-10000 events in a single document Document per column 34
  • 35. Storing 1-to-1400 Relation in a Single Document Base64 encoded and gzipped values: [{date: …, value:}, … ] 35 32k Limit for DocValues
  • 36. Benefits of Denomalization ‣ Benefits - You can scale from a xxx million documents in a Solr Cloud up to trillions of searchable events - Import is vastly faster ‣ Drawbacks - Searching on single values requires additional logic - Counting and faceting requires additional logic ‣ Spark can solve these problems by parallel post processing - Decompressing, aggregating, joining, grouping 36
  • 37. Accessing Compressed Data within Spark 37
  • 38. 38 Indexing 19 Million of CSV Values in 13500 Solr documents takes now 24 Seconds (before 1:20) —> 800,000 Values per Second
  • 39. 39 Streaming One Billion of Solr Values into Spark Takes now 34 Seconds (Before 700 s) —> 29,000,000 Values per Second
  • 40. Summary ■ The combination of Solr Cloud and Spark gives you the power to deal with BigData workloads in realtime ■ Denormalization can make your Solr application vastly faster ■ Make use of the /export handler when using the SolrRDD ■ Parallel post processing is mandatory for nontrivial applications ■ If you want to learn more: come to the Chronix talk on Friday 40
  • 41. Learn More ■ https://github.com/lucidworks/spark-solr ■ https://github.com/jweigend/solr-spark ■ http://chronix.io ■ https://github.com/ChronixDB/chronix.spark/ ■ http://qaware.blogspot.de 41
  • 42. 42
  • 43. 43