SlideShare a Scribd company logo
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
Time Series Processing with Solr and Spark
Josef Adersberger (@adersberger)
CTO, QAware
TIME SERIES 101
4
01
WE’RE SURROUNDED BY TIME SERIES
▸ Operational data: Monitoring data, performance
metrics, log events, …
▸ Data Warehouse: Dimension time
▸ Measured Me: Activity tracking, ECG, …
▸ Sensor telemetry: Sensor data, …
▸ Financial data: Stock charts, …
▸ Climate data: Temperature, …
▸ Web tracking: Clickstreams, …
▸ …
@adersberger
5
WE’RE SURROUNDED BY TIME SERIES (Pt. 2)
▸ Oktoberfest: Visitor and beer consumption trend
the singularity
6
01
TIME SERIES: BASIC TERMS
univariate time series multivariate time series multi-dimensional time
series (time series tensor)
time series setobservation
@adersberger
7
01
ILLUSTRATIVE OPERATIONS ON TIME SERIES
align
Time series => Time series
diff downsampling outlier
min/max avg/med slope std-dev
Time series => Scalar
@adersberger
OUR USE CASE
Monitoring Data Analysis 

of a business-critical,

worldwide distributed 

software system. Enable

root cause analysis and

anomaly detection.

> 1,000 nodes worldwide
> 10 processes per node
> 20 metrics per process

(OS, JVM, App-spec.)
Measured every second.
= about 6.3 trillions observations p.a.

Data retention: 5 yrs.
Time Series Processing with Solr and Spark
11
01
USE CASE: EXPLORING
Drill-down
host
process
measurements
counters (metrics)
Query time series metadata
Superimpose time series
@adersberger
12
01
USE CASE: STATISTICS
@adersberger
13
01
USE CASE: ANOMALY DETECTION
Featuring Twitter Anomaly Detection (https://github.com/twitter/AnomalyDetection

and Yahoo EGDAS https://github.com/yahoo/egads
@adersberger
14
01
USE CASE: SQL AND ZEPPELIN
@adersberger
CHRONIX SPARK
https://github.com/ChronixDB/chronix.spark
Time Series Processing with Solr and Spark
http://www.datasciencecentral.com
Time Series Processing with Solr and Spark
19
01
AVAILABLE TIME SERIES DATABASES
https://github.com/qaware/big-data-landscape
EASY-TO-USE BIG TIME
SERIES DATA STORAGE &
PROCESSING ON SPARK
21
01
THE CHRONIX STACK chronix.io
Big time series database
Scale-out
Storage-efficient
Interactive queries

No separate servers: Drop-in 

to existing Solr and Spark 

installations

Integrated into the relevant

open source ecosystem
@adersberger
Core
Chronix Storage
Chronix Server
Chronix Spark
ChronixFormat
GrafanaChronix Analytics
Collection
Analytics Frontends
Logstash fluentd collectd
Zeppelin
Prometheus Ingestion Bridge
KairosDB OpenTSDBInfluxDB Graphite
22
node
Distributed Data &

Data Retrieval
‣ Data sharding
‣ Fast index-based queries
‣ Efficient storage format
Distributed Processing
‣ Heavy lifting distributed
processing
‣ Efficient integration of Spark
and Solr
Result Processing
Post-processing on a
smaller set of time series
data flow
icon credits to Nimal Raj (database), Arthur Shlain (console) and alvarobueno (takslist)
@adersberger
23
TIME SERIES MODEL
Set of univariate multi-dimensional numeric time series
▸ set … because it’s more flexible and better to parallelise if operations can input and
output multiple time series.
▸ univariate … because multivariate will introduce too much complexity (and we have our
set to bundle multiple time series).
▸ multi-dimensional … because the ability to slice & dice in the set of time series is very
convenient for a lot of use cases.
▸ numeric … because it’s the most common use case.
A single time series is identified by a combination of its non-temporal
dimensional values (e.g. unit “mem usage” + host “aws42” + process “tomcat”)
@adersberger
24
01
CHRONIX SPARK API: ENTRY POINTS
CHRONIX SPARK


ChronixRDD
ChronixSparkContext
‣ Represents a set of time series
‣ Distributed operations on sets of time series
‣ Creates ChronixRDDs
‣ Speaks with the Chronix Server (Solr)
@adersberger
25
01
CHRONIX SPARK API: DATA MODEL
MetricTimeSeries
MetricObservationDataFrame
+ toDataFrame()
@adersberger
Dataset<MetricTimeSeries>
Dataset<MetricObservation>
+ toDataset()
+ toObservationsDataset()
ChronixRDD
26
01
SPARK APIs FOR DATA PROCESSING
RDD DataFrame Dataset
typed yes no yes
optimized medium highly highly
mature yes yes medium
SQL no yes no
@adersberger
27
01
CHRONIX RDD
Statistical operations
the set characteristic: 

a JavaRDD of 

MetricTimeSeries
Filter the set (esp. by

dimensions)
@adersberger
28
01
METRICTIMESERIES DATA TYPE
access all timestamps
the multi-dimensionality:

get/set dimensions

(attributes)
access all observations as
stream
access all numeric values
@adersberger
Time Series Processing with Solr and Spark
30
01
//Create Chronix Spark context from a SparkContext / JavaSparkContext

ChronixSparkContext csc = new ChronixSparkContext(sc);



//Read data into ChronixRDD

SolrQuery query = new SolrQuery(

"metric:"java.lang:type=Memory/HeapMemoryUsage/used"");



ChronixRDD rdd = csc.query(query,

"localhost:9983", //ZooKeeper host

"chronix", //Solr collection for Chronix

new ChronixSolrCloudStorage());



//Calculate the overall min/max/mean of all time series in the RDD

double min = rdd.min();

double max = rdd.max();

double mean = rdd.mean();
DataFrame df = rdd.toDataFrame(sqlContext);

DataFrame res = df

.select("time", "value", "process", "metric")

.where("process='jenkins-jolokia'")

.orderBy("time");

res.show();
@adersberger
CHRONIX SPARK INTERNALS
32
Distributed Data &

Data Retrieval
‣ Data sharding (OK)
‣ Fast index-based queries (OK)
‣ Efficient storage format
@adersberger
33
01
CHRONIX FORMAT: CHUNKING TIME SERIES
TIME SERIES
‣ start: TimeStamp
‣ end: TimeStamp
‣ dimensions: Map<String, String>
‣ observations: byte[]
TIME SERIES
‣ start: TimeStamp
‣ end: TimeStamp
‣ dimensions: Map<String, String>
‣ observations: byte[]
Logical
TIME SERIES
‣ start: TimeStamp
‣ end: TimeStamp
‣ dimensions: Map<String, String>
‣ observations: byte[]
Physical
Chunking:
1 logical time series = 

n physical time series (chunks)
1 chunk = fixed amount of
observations
1 chunk = 1 Solr document
@adersberger
34
01
CHRONIX FORMAT: ENCODING OF OBSERVATIONS
Binary encoding of all timestamp/value pairs (observations) with ProtoBuf incl.
binary compression.
Delta encoding leading to more effective binary compression
… of time stamps (DCC, Date-Delta-Compaction)













… of values: diff
chunck
• timespan
• nbr. of observations
periodic distributed time stamps (pts): timespan / nbr. of observations
real time stamps (rts) if |pts(x) - rts(x)| < threshold : rts(x) = pts(x)
value_to_store = pts(x) - rts(x)
value_to_store = value(x) - value(x-1)
@adersberger
35
01
CHRONIX FORMAT: TUNING CHUNK SIZE AND CODEC
GZIP +
128
kBytes
Florian Lautenschlager, Michael Philippsen, Andreas Kumlehn, Josef Adersberger

Chronix: Efficient Storage and Query of Operational Time Series
International Conference on Software Maintenance and Evolution 2016 (submitted)
@adersberger
storage 

demand access

time
36
01
CHRONIX FORMAT: STORAGE EFFICIENCY BENCHMARK
@adersberger
37
01
CHRONIX FORMAT: PERFORMANCE BENCHMARK
unit: secondsnbr of queries query
@adersberger
38
Distributed Processing
‣ Heavy lifting distributed
processing
‣ Efficient integration of Spark

and Solr
@adersberger
39
01
SPARK AND SOLR BEST PRACTICES: ALIGN PARALLELISM
SolrDocument

(Chunk)
Solr Shard Solr Shard
TimeSeries TimeSeries TimeSeries TimeSeries TimeSeries
Partition Partition
ChronixRDD
• Unit of parallelism in Spark: Partition
• Unit of parallelism in Solr: Shard
• 1 Spark Partition = 1 Solr Shard
SolrDocument

(Chunk)
SolrDocument

(Chunk)
SolrDocument

(Chunk)
SolrDocument

(Chunk)
SolrDocument

(Chunk)
SolrDocument

(Chunk)
SolrDocument

(Chunk)
SolrDocument

(Chunk)
SolrDocument

(Chunk)
@adersberger
40
01
ALIGN THE PARALLELISM WITHIN CHRONIXRDD
public ChronixRDD queryChronixChunks(

final SolrQuery query,

final String zkHost,

final String collection,

final ChronixSolrCloudStorage<MetricTimeSeries> chronixStorage)
throws SolrServerException, IOException {



// first get a list of replicas to query for this collection

List<String> shards = chronixStorage.getShardList(zkHost, collection);



// parallelize the requests to the shards

JavaRDD<MetricTimeSeries> docs = jsc.parallelize(shards, shards.size()).flatMap(

(FlatMapFunction<String, MetricTimeSeries>) shardUrl -> chronixStorage.streamFromSingleNode(

new KassiopeiaSimpleConverter(), shardUrl, query)::iterator);

return new ChronixRDD(docs);

}
Figure out all Solr
shards (using
CloudSolrClient in
the background)
Query each shard in parallel and convert
SolrDocuments to MetricTimeSeries
@adersberger
41
01
SPARK AND SOLR BEST PRACTICES: PUSHDOWN
SolrQuery query = new SolrQuery(

“<Solr query containing filters and aggregations>");



ChronixRDD rdd = csc.query(query, …
@adersberger
Predicate pushdown
• Pre-filter time series based on their 

metadata (dimensions, start, end)

with Solr.

Aggregation pushdown
• Perform pre-aggregations (min/max/avg/…) at
ingestion time and store it as metadata.
• (to come) Perform aggregations on Solr-level at
query time by enabling Solr to decode
observations
42
01
SPARK AND SOLR BEST PRACTICES: EFFICIENT DATA TRANSFER
Reduce volume: Pushdown & compression

Use efficient protocols: 

Low-overhead, bulk, stream

Avoid remote transfer: Place Spark

tasks (processes 1 partition) on the 

Solr node with the appropriate shard.

(to come by using SolrRDD)
@adersberger
Export 

Handler
Chronix

RDD
CloudSolr

Stream
Format
Decoder
bulk of 

JSON tuples
Chronix Spark
Solr / SolrJ
43
private Stream<MetricTimeSeries> 

streamWithCloudSolrStream(String zkHost, String collection, String shardUrl, SolrQuery query,

TimeSeriesConverter<MetricTimeSeries> converter)
throws IOException {

Map params = new HashMap();

params.put("q", query.getQuery());

params.put("sort", "id asc");

params.put("shards", extractShardIdFromShardUrl(shardUrl));

params.put("fl",

Schema.DATA + ", " + Schema.ID + ", " + Schema.START + ", " + Schema.END +

", metric, host, measurement, process, ag, group");

params.put("qt", "/export");

params.put("distrib", false);



CloudSolrStream solrStream = new CloudSolrStream(zkHost, collection, params);

solrStream.open();

SolrTupleStreamingService tupStream = new SolrTupleStreamingService(solrStream, converter);

return StreamSupport.stream(
Spliterators.spliteratorUnknownSize(tupStream, Spliterator.SIZED), false);

}
Pin query to one shard
Use export request handler
Boilerplate code to stream response
@adersberger
Time Series Databases should be first-class citizens.
Chronix leverages Solr and Spark to 

be storage efficient and to allow interactive 

queries for big time series data.
THANK YOU! QUESTIONS?
Mail: josef.adersberger@qaware.de
Twitter: @adersberger
TWITTER.COM/QAWARE - SLIDESHARE.NET/QAWARE
BONUS SLIDES
PERFORMANCE
codingvoding.tumblr.com
PREMATURE OPTIMIZATION IS
NOT EVIL IF YOU HANDLE BIG
DATA
Josef Adersberger
PERFORMANCE
USING A JAVA PROFILER WITH A LOCAL CLUSTER
PERFORMANCE
HIGH-PERFORMANCE, LOW-OVERHEAD COLLECTIONS
PERFORMANCE
830 MB -> 360 MB

(- 57%)
unveiled wrong Jackson 

handling inside of SolrClient
53
01
THE SECRETS OF DISTRIBUTED PROCESSING PERFORMANCE


Rule 1: Be as close to the data as possible!

(CPU cache > memory > local disk > network)

Rule 2: Reduce data volume as early as possible! 

(as long as you don’t sacrifice parallelization)

Rule 3: Parallelize as much as possible! 

(max = #cores * x)
PERFORMANCE
THE RULES APPLIED
‣ Rule 1: Be as close to the data as possible!
1. Solr caching

2. Spark in-memory processing with activated RDD compression

3. Binary protocol between Solr and Spark

‣ Rule 2: Reduce data volume as early as possible!
‣ Efficient storage format (Chronix Format)

‣ Predicate pushdown to Solr (query)

‣ Group-by & aggregation pushdown to Solr (faceting within a query)

‣ Rule 3: Parallelize as much as possible!
‣ Scale-out on data-level with SolrCloud

‣ Scale-out on processing-level with Spark
APACHE SPARK 101
CHRONIX SPARK WONDERLAND
ARCHITECTURE
APACHE SPARK
SPARK TERMINOLOGY (1/2)
▸ RDD: Has transformations and actions. Hides data partitioning &
distributed computation. References a set of partitions (“output
partitions”) - materialized or not - and has dependencies to
another RDD (“input partitions”). RDD operations are evaluated as
late as possible (when an action is called). As long as not being the
root RDD the partitions of an RDD are in memory but they can be
persisted by request.
▸ Partitions: (Logical) chunks of data. Default unit and level of
parallelism - inside of a partition everything is a sequential
operation on records. Has to fit into memory. Can have different
representations (in-memory, on disk, off heap, …)
APACHE SPARK
SPARK TERMINOLOGY (2/2)
▸ Job: A computation job which is launched when an action is called on a
RDD.
▸ Task: The atomic unit of work (function). Bound to exactly one partition.
▸ Stage: Set of Task pipelines which can be executed in parallel on one
executor.
▸ Shuffling: If partitions need to be transferred between executors. Shuffle
write = outbound partition transfer. Shuffle read = inbound partition
transfer.
▸ DAG Scheduler: Computes DAG of stages from RDD DAG. Determines
the preferred location for each task.
THE COMPETITORS / ALTERNATIVES
CHRONIX RDD VS. SPARK-TS
▸ Spark-TS provides no specific time series storage it uses the Spark persistence
mechanisms instead. This leads to a less efficient storage usage and less possibilities
to perform performance optimizations via predicate pushdown.
▸ In contrast to Spark-TS Chronix does not align all time series values on one vector of
timestamps. This leads to greater flexibility in time series aggregation
▸ Chronix provides multi-dimensional time series as this is very useful for data
warehousing and APM.
▸ Chronix has support for Datasets as this will be an important Spark API in the near
future. But Chronix currently doesn’t support an IndexedRowMatrix for SparkML.
▸ Chronix is purely written in Java. There is no explicit support for Python and Scala yet.
▸ Chronix doesn not support a ZonedTime as this makes it way more complicated.
CHRONIX SPARK INTERNALS
61
01
CHRONIXRDD: GET THE CHUNKS FROM SOLR
public ChronixRDD queryChronixChunks(

final SolrQuery query,

final String zkHost,

final String collection,

final ChronixSolrCloudStorage<MetricTimeSeries> chronixStorage)
throws SolrServerException, IOException {



// first get a list of replicas to query for this collection

List<String> shards = chronixStorage.getShardList(zkHost, collection);



// parallelize the requests to the shards

JavaRDD<MetricTimeSeries> docs = jsc.parallelize(shards, shards.size()).flatMap(

(FlatMapFunction<String, MetricTimeSeries>) shardUrl -> chronixStorage.streamFromSingleNode(

new KassiopeiaSimpleConverter(), shardUrl, query)::iterator);

return new ChronixRDD(docs);

}
Figure out all Solr
shards (using
CloudSolrClient in
the background)
Query each shard in parallel and convert
SolrDocuments to MetricTimeSeries
62
01
BINARY PROTOCOL WITH STANDARD SOLR CLIENT
private Stream<MetricTimeSeries> streamWithHttpSolrClient(String shardUrl,

SolrQuery query,

TimeSeriesConverter<MetricTimeSeries>
converter) {

HttpSolrClient solrClient = getSingleNodeSolrClient(shardUrl);

solrClient.setRequestWriter(new BinaryRequestWriter());

query.set("distrib", false);

SolrStreamingService<MetricTimeSeries> solrStreamingService = 

new SolrStreamingService<>(converter, query, solrClient, nrOfDocumentPerBatch);

return StreamSupport.stream(

Spliterators.spliteratorUnknownSize(solrStreamingService, Spliterator.SIZED), false);

}
Use HttpSolrClient pinned to one shard
Use binary (request)

protocol
Boilerplate code to stream response
63
private Stream<MetricTimeSeries> 

streamWithCloudSolrStream(String zkHost, String collection, String shardUrl, SolrQuery query,

TimeSeriesConverter<MetricTimeSeries> converter)
throws IOException {

Map params = new HashMap();

params.put("q", query.getQuery());

params.put("sort", "id asc");

params.put("shards", extractShardIdFromShardUrl(shardUrl));

params.put("fl",

Schema.DATA + ", " + Schema.ID + ", " + Schema.START + ", " + Schema.END +

", metric, host, measurement, process, ag, group");

params.put("qt", "/export");

params.put("distrib", false);



CloudSolrStream solrStream = new CloudSolrStream(zkHost, collection, params);

solrStream.open();

SolrTupleStreamingService tupStream = new SolrTupleStreamingService(solrStream, converter);

return StreamSupport.stream(
Spliterators.spliteratorUnknownSize(tupStream, Spliterator.SIZED), false);

}
EXPORT HANDLER PROTOCOL
Pin query to one shard
Use export request handler
Boilerplate code to stream response
64
01
CHRONIXRDD: FROM CHUNKS TO TIME SERIES
public ChronixRDD joinChunks() {

JavaPairRDD<MetricTimeSeriesKey, Iterable<MetricTimeSeries>> groupRdd

= this.groupBy(MetricTimeSeriesKey::new);



JavaPairRDD<MetricTimeSeriesKey, MetricTimeSeries> joinedRdd

= groupRdd.mapValues((Function<Iterable<MetricTimeSeries>, MetricTimeSeries>) mtsIt -> {

MetricTimeSeriesOrdering ordering = new MetricTimeSeriesOrdering();

List<MetricTimeSeries> orderedChunks = ordering.immutableSortedCopy(mtsIt);

MetricTimeSeries result = null;

for (MetricTimeSeries mts : orderedChunks) {

if (result == null) {

result = new MetricTimeSeries

.Builder(mts.getMetric())

.attributes(mts.attributes()).build();

}

result.addAll(mts.getTimestampsAsArray(), mts.getValuesAsArray());

}

return result;

});



JavaRDD<MetricTimeSeries> resultJavaRdd =

joinedRdd.map((Tuple2<MetricTimeSeriesKey, MetricTimeSeries> mtTuple) -> mtTuple._2);



return new ChronixRDD(resultJavaRdd); }
group chunks
according
identity
join chunks to

logical time 

series

More Related Content

What's hot (18)

PDF
Streams processing with Storm
Mariusz Gil
 
PPTX
Sharded cluster tutorial
Antonios Giannopoulos
 
PPTX
Michael Häusler – Everyday flink
Flink Forward
 
PDF
Triggers in MongoDB
Antonios Giannopoulos
 
PPTX
MongoDB - Sharded Cluster Tutorial
Jason Terpko
 
PPTX
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
Antonios Giannopoulos
 
PPTX
RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch
Redis Labs
 
PPTX
Triggers In MongoDB
Jason Terpko
 
PDF
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
DataStax
 
PDF
Apache Solr as a compressed, scalable, and high performance time series database
Florian Lautenschlager
 
PDF
Real time and reliable processing with Apache Storm
Andrea Iacono
 
PPTX
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Cody Ray
 
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
PDF
Using MongoDB with Kafka - Use Cases and Best Practices
Antonios Giannopoulos
 
PPTX
RedisConf17- durable_rules
Redis Labs
 
PPTX
Stress test data pipeline
Marina Grechuhin
 
PDF
Stratio platform overview v4.1
Stratio
 
PDF
R and cpp
Romain Francois
 
Streams processing with Storm
Mariusz Gil
 
Sharded cluster tutorial
Antonios Giannopoulos
 
Michael Häusler – Everyday flink
Flink Forward
 
Triggers in MongoDB
Antonios Giannopoulos
 
MongoDB - Sharded Cluster Tutorial
Jason Terpko
 
New Indexing and Aggregation Pipeline Capabilities in MongoDB 4.2
Antonios Giannopoulos
 
RedisConf17 - Internet Archive - Preventing Cache Stampede with Redis and XFetch
Redis Labs
 
Triggers In MongoDB
Jason Terpko
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
DataStax
 
Apache Solr as a compressed, scalable, and high performance time series database
Florian Lautenschlager
 
Real time and reliable processing with Apache Storm
Andrea Iacono
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Cody Ray
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Paul Brebner
 
Using MongoDB with Kafka - Use Cases and Best Practices
Antonios Giannopoulos
 
RedisConf17- durable_rules
Redis Labs
 
Stress test data pipeline
Marina Grechuhin
 
Stratio platform overview v4.1
Stratio
 
R and cpp
Romain Francois
 

Viewers also liked (20)

PDF
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Lucidworks
 
PDF
BIG DATA サービス と ツール
Ngoc Dao
 
PDF
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Romeo Kienzler
 
PDF
Using docker for data science - part 2
Calvin Giles
 
PDF
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Roberto Hashioka
 
PDF
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
PDF
Growing the Mesos Ecosystem
Mesosphere Inc.
 
PDF
Overview of DataStax OpsCenter
DataStax
 
PPTX
High Performance Processing of Streaming Data
Geoffrey Fox
 
PPTX
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Data Con LA
 
PDF
Data analysis with Pandas and Spark
Felix Crisan
 
PDF
The basics of fluentd
Treasure Data, Inc.
 
PDF
Fluentd and Kafka
N Masahiro
 
PPTX
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Michael Noll
 
PPTX
Hadoop on Docker
Rakesh Saha
 
PDF
Clickstream Analysis with Apache Spark
QAware GmbH
 
PPTX
How we solved Real-time User Segmentation using HBase
DataWorks Summit
 
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Data Con LA
 
PPTX
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
 
PDF
Data Engineering with Solr and Spark
Lucidworks
 
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
Lucidworks
 
BIG DATA サービス と ツール
Ngoc Dao
 
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Romeo Kienzler
 
Using docker for data science - part 2
Calvin Giles
 
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka ...
Roberto Hashioka
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
DataStax Academy
 
Growing the Mesos Ecosystem
Mesosphere Inc.
 
Overview of DataStax OpsCenter
DataStax
 
High Performance Processing of Streaming Data
Geoffrey Fox
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Data Con LA
 
Data analysis with Pandas and Spark
Felix Crisan
 
The basics of fluentd
Treasure Data, Inc.
 
Fluentd and Kafka
N Masahiro
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Michael Noll
 
Hadoop on Docker
Rakesh Saha
 
Clickstream Analysis with Apache Spark
QAware GmbH
 
How we solved Real-time User Segmentation using HBase
DataWorks Summit
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Data Con LA
 
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
 
Data Engineering with Solr and Spark
Lucidworks
 
Ad

Similar to Time Series Processing with Solr and Spark (20)

PDF
Time Series Processing with Apache Spark
QAware GmbH
 
PDF
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
NETWAYS
 
PDF
Chronix: A fast and efficient time series storage based on Apache Solr
Florian Lautenschlager
 
PDF
Chronix Time Series Database - The New Time Series Kid on the Block
QAware GmbH
 
PDF
Chronix as Long-Term Storage for Prometheus
QAware GmbH
 
PPTX
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
PDF
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
Codemotion Tel Aviv
 
PDF
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
Codemotion
 
PDF
Time Series With OrientDB - Fosdem 2015
wolf4ood
 
PDF
Webinar: Solr & Spark for Real Time Big Data Analytics
Lucidworks
 
PPTX
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
Luigi Dell'Aquila
 
PDF
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
PPTX
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
PDF
Elasticsearch as a time series database
felixbarny
 
PPTX
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Codemotion
 
PDF
Spark meets Telemetry
Roberto Agostino Vitillo
 
PDF
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
PDF
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Evan Chan
 
PPTX
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Time Series Processing with Apache Spark
QAware GmbH
 
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
NETWAYS
 
Chronix: A fast and efficient time series storage based on Apache Solr
Florian Lautenschlager
 
Chronix Time Series Database - The New Time Series Kid on the Block
QAware GmbH
 
Chronix as Long-Term Storage for Prometheus
QAware GmbH
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
Codemotion Tel Aviv
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
Codemotion
 
Time Series With OrientDB - Fosdem 2015
wolf4ood
 
Webinar: Solr & Spark for Real Time Big Data Analytics
Lucidworks
 
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
Luigi Dell'Aquila
 
Owning time series with team apache Strata San Jose 2015
Patrick McFadin
 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
 
Elasticsearch as a time series database
felixbarny
 
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Codemotion
 
Spark meets Telemetry
Roberto Agostino Vitillo
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Evan Chan
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Ad

More from Josef Adersberger (13)

PDF
Into the cloud, you better fly by sight
Josef Adersberger
 
PDF
Serverless containers … with source-to-image
Josef Adersberger
 
PDF
The need for speed – transforming insurance into a cloud-native industry
Josef Adersberger
 
PDF
The good, the bad, and the ugly of migrating hundreds of legacy applications ...
Josef Adersberger
 
PDF
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Josef Adersberger
 
PDF
Istio By Example (extended version)
Josef Adersberger
 
PDF
Docker und Kubernetes Patterns & Anti-Patterns
Josef Adersberger
 
PDF
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
Josef Adersberger
 
PDF
Dataservices - Processing Big Data The Microservice Way
Josef Adersberger
 
PDF
Cloud Native und Java EE: Freund oder Feind?
Josef Adersberger
 
PDF
Big Data Landscape 2016
Josef Adersberger
 
PDF
Clickstream Analysis with Spark
Josef Adersberger
 
PDF
Software-Sanierung: Wie man kranke Systeme wieder gesund macht.
Josef Adersberger
 
Into the cloud, you better fly by sight
Josef Adersberger
 
Serverless containers … with source-to-image
Josef Adersberger
 
The need for speed – transforming insurance into a cloud-native industry
Josef Adersberger
 
The good, the bad, and the ugly of migrating hundreds of legacy applications ...
Josef Adersberger
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Josef Adersberger
 
Istio By Example (extended version)
Josef Adersberger
 
Docker und Kubernetes Patterns & Anti-Patterns
Josef Adersberger
 
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
Josef Adersberger
 
Dataservices - Processing Big Data The Microservice Way
Josef Adersberger
 
Cloud Native und Java EE: Freund oder Feind?
Josef Adersberger
 
Big Data Landscape 2016
Josef Adersberger
 
Clickstream Analysis with Spark
Josef Adersberger
 
Software-Sanierung: Wie man kranke Systeme wieder gesund macht.
Josef Adersberger
 

Recently uploaded (20)

PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Digital Circuits, important subject in CS
contactparinay1
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 

Time Series Processing with Solr and Spark

  • 1. O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
  • 2. Time Series Processing with Solr and Spark Josef Adersberger (@adersberger) CTO, QAware
  • 4. 4 01 WE’RE SURROUNDED BY TIME SERIES ▸ Operational data: Monitoring data, performance metrics, log events, … ▸ Data Warehouse: Dimension time ▸ Measured Me: Activity tracking, ECG, … ▸ Sensor telemetry: Sensor data, … ▸ Financial data: Stock charts, … ▸ Climate data: Temperature, … ▸ Web tracking: Clickstreams, … ▸ … @adersberger
  • 5. 5 WE’RE SURROUNDED BY TIME SERIES (Pt. 2) ▸ Oktoberfest: Visitor and beer consumption trend the singularity
  • 6. 6 01 TIME SERIES: BASIC TERMS univariate time series multivariate time series multi-dimensional time series (time series tensor) time series setobservation @adersberger
  • 7. 7 01 ILLUSTRATIVE OPERATIONS ON TIME SERIES align Time series => Time series diff downsampling outlier min/max avg/med slope std-dev Time series => Scalar @adersberger
  • 9. Monitoring Data Analysis 
 of a business-critical,
 worldwide distributed 
 software system. Enable
 root cause analysis and
 anomaly detection.
 > 1,000 nodes worldwide > 10 processes per node > 20 metrics per process
 (OS, JVM, App-spec.) Measured every second. = about 6.3 trillions observations p.a.
 Data retention: 5 yrs.
  • 11. 11 01 USE CASE: EXPLORING Drill-down host process measurements counters (metrics) Query time series metadata Superimpose time series @adersberger
  • 13. 13 01 USE CASE: ANOMALY DETECTION Featuring Twitter Anomaly Detection (https://github.com/twitter/AnomalyDetection
 and Yahoo EGDAS https://github.com/yahoo/egads @adersberger
  • 14. 14 01 USE CASE: SQL AND ZEPPELIN @adersberger
  • 19. 19 01 AVAILABLE TIME SERIES DATABASES https://github.com/qaware/big-data-landscape
  • 20. EASY-TO-USE BIG TIME SERIES DATA STORAGE & PROCESSING ON SPARK
  • 21. 21 01 THE CHRONIX STACK chronix.io Big time series database Scale-out Storage-efficient Interactive queries
 No separate servers: Drop-in 
 to existing Solr and Spark 
 installations
 Integrated into the relevant
 open source ecosystem @adersberger Core Chronix Storage Chronix Server Chronix Spark ChronixFormat GrafanaChronix Analytics Collection Analytics Frontends Logstash fluentd collectd Zeppelin Prometheus Ingestion Bridge KairosDB OpenTSDBInfluxDB Graphite
  • 22. 22 node Distributed Data &
 Data Retrieval ‣ Data sharding ‣ Fast index-based queries ‣ Efficient storage format Distributed Processing ‣ Heavy lifting distributed processing ‣ Efficient integration of Spark and Solr Result Processing Post-processing on a smaller set of time series data flow icon credits to Nimal Raj (database), Arthur Shlain (console) and alvarobueno (takslist) @adersberger
  • 23. 23 TIME SERIES MODEL Set of univariate multi-dimensional numeric time series ▸ set … because it’s more flexible and better to parallelise if operations can input and output multiple time series. ▸ univariate … because multivariate will introduce too much complexity (and we have our set to bundle multiple time series). ▸ multi-dimensional … because the ability to slice & dice in the set of time series is very convenient for a lot of use cases. ▸ numeric … because it’s the most common use case. A single time series is identified by a combination of its non-temporal dimensional values (e.g. unit “mem usage” + host “aws42” + process “tomcat”) @adersberger
  • 24. 24 01 CHRONIX SPARK API: ENTRY POINTS CHRONIX SPARK 
 ChronixRDD ChronixSparkContext ‣ Represents a set of time series ‣ Distributed operations on sets of time series ‣ Creates ChronixRDDs ‣ Speaks with the Chronix Server (Solr) @adersberger
  • 25. 25 01 CHRONIX SPARK API: DATA MODEL MetricTimeSeries MetricObservationDataFrame + toDataFrame() @adersberger Dataset<MetricTimeSeries> Dataset<MetricObservation> + toDataset() + toObservationsDataset() ChronixRDD
  • 26. 26 01 SPARK APIs FOR DATA PROCESSING RDD DataFrame Dataset typed yes no yes optimized medium highly highly mature yes yes medium SQL no yes no @adersberger
  • 27. 27 01 CHRONIX RDD Statistical operations the set characteristic: 
 a JavaRDD of 
 MetricTimeSeries Filter the set (esp. by
 dimensions) @adersberger
  • 28. 28 01 METRICTIMESERIES DATA TYPE access all timestamps the multi-dimensionality:
 get/set dimensions
 (attributes) access all observations as stream access all numeric values @adersberger
  • 30. 30 01 //Create Chronix Spark context from a SparkContext / JavaSparkContext
 ChronixSparkContext csc = new ChronixSparkContext(sc);
 
 //Read data into ChronixRDD
 SolrQuery query = new SolrQuery(
 "metric:"java.lang:type=Memory/HeapMemoryUsage/used"");
 
 ChronixRDD rdd = csc.query(query,
 "localhost:9983", //ZooKeeper host
 "chronix", //Solr collection for Chronix
 new ChronixSolrCloudStorage());
 
 //Calculate the overall min/max/mean of all time series in the RDD
 double min = rdd.min();
 double max = rdd.max();
 double mean = rdd.mean(); DataFrame df = rdd.toDataFrame(sqlContext);
 DataFrame res = df
 .select("time", "value", "process", "metric")
 .where("process='jenkins-jolokia'")
 .orderBy("time");
 res.show(); @adersberger
  • 32. 32 Distributed Data &
 Data Retrieval ‣ Data sharding (OK) ‣ Fast index-based queries (OK) ‣ Efficient storage format @adersberger
  • 33. 33 01 CHRONIX FORMAT: CHUNKING TIME SERIES TIME SERIES ‣ start: TimeStamp ‣ end: TimeStamp ‣ dimensions: Map<String, String> ‣ observations: byte[] TIME SERIES ‣ start: TimeStamp ‣ end: TimeStamp ‣ dimensions: Map<String, String> ‣ observations: byte[] Logical TIME SERIES ‣ start: TimeStamp ‣ end: TimeStamp ‣ dimensions: Map<String, String> ‣ observations: byte[] Physical Chunking: 1 logical time series = 
 n physical time series (chunks) 1 chunk = fixed amount of observations 1 chunk = 1 Solr document @adersberger
  • 34. 34 01 CHRONIX FORMAT: ENCODING OF OBSERVATIONS Binary encoding of all timestamp/value pairs (observations) with ProtoBuf incl. binary compression. Delta encoding leading to more effective binary compression … of time stamps (DCC, Date-Delta-Compaction)
 
 
 
 
 
 
 … of values: diff chunck • timespan • nbr. of observations periodic distributed time stamps (pts): timespan / nbr. of observations real time stamps (rts) if |pts(x) - rts(x)| < threshold : rts(x) = pts(x) value_to_store = pts(x) - rts(x) value_to_store = value(x) - value(x-1) @adersberger
  • 35. 35 01 CHRONIX FORMAT: TUNING CHUNK SIZE AND CODEC GZIP + 128 kBytes Florian Lautenschlager, Michael Philippsen, Andreas Kumlehn, Josef Adersberger
 Chronix: Efficient Storage and Query of Operational Time Series International Conference on Software Maintenance and Evolution 2016 (submitted) @adersberger storage 
 demand access
 time
  • 36. 36 01 CHRONIX FORMAT: STORAGE EFFICIENCY BENCHMARK @adersberger
  • 37. 37 01 CHRONIX FORMAT: PERFORMANCE BENCHMARK unit: secondsnbr of queries query @adersberger
  • 38. 38 Distributed Processing ‣ Heavy lifting distributed processing ‣ Efficient integration of Spark
 and Solr @adersberger
  • 39. 39 01 SPARK AND SOLR BEST PRACTICES: ALIGN PARALLELISM SolrDocument
 (Chunk) Solr Shard Solr Shard TimeSeries TimeSeries TimeSeries TimeSeries TimeSeries Partition Partition ChronixRDD • Unit of parallelism in Spark: Partition • Unit of parallelism in Solr: Shard • 1 Spark Partition = 1 Solr Shard SolrDocument
 (Chunk) SolrDocument
 (Chunk) SolrDocument
 (Chunk) SolrDocument
 (Chunk) SolrDocument
 (Chunk) SolrDocument
 (Chunk) SolrDocument
 (Chunk) SolrDocument
 (Chunk) SolrDocument
 (Chunk) @adersberger
  • 40. 40 01 ALIGN THE PARALLELISM WITHIN CHRONIXRDD public ChronixRDD queryChronixChunks(
 final SolrQuery query,
 final String zkHost,
 final String collection,
 final ChronixSolrCloudStorage<MetricTimeSeries> chronixStorage) throws SolrServerException, IOException {
 
 // first get a list of replicas to query for this collection
 List<String> shards = chronixStorage.getShardList(zkHost, collection);
 
 // parallelize the requests to the shards
 JavaRDD<MetricTimeSeries> docs = jsc.parallelize(shards, shards.size()).flatMap(
 (FlatMapFunction<String, MetricTimeSeries>) shardUrl -> chronixStorage.streamFromSingleNode(
 new KassiopeiaSimpleConverter(), shardUrl, query)::iterator);
 return new ChronixRDD(docs);
 } Figure out all Solr shards (using CloudSolrClient in the background) Query each shard in parallel and convert SolrDocuments to MetricTimeSeries @adersberger
  • 41. 41 01 SPARK AND SOLR BEST PRACTICES: PUSHDOWN SolrQuery query = new SolrQuery(
 “<Solr query containing filters and aggregations>");
 
 ChronixRDD rdd = csc.query(query, … @adersberger Predicate pushdown • Pre-filter time series based on their 
 metadata (dimensions, start, end)
 with Solr.
 Aggregation pushdown • Perform pre-aggregations (min/max/avg/…) at ingestion time and store it as metadata. • (to come) Perform aggregations on Solr-level at query time by enabling Solr to decode observations
  • 42. 42 01 SPARK AND SOLR BEST PRACTICES: EFFICIENT DATA TRANSFER Reduce volume: Pushdown & compression
 Use efficient protocols: 
 Low-overhead, bulk, stream
 Avoid remote transfer: Place Spark
 tasks (processes 1 partition) on the 
 Solr node with the appropriate shard.
 (to come by using SolrRDD) @adersberger Export 
 Handler Chronix
 RDD CloudSolr
 Stream Format Decoder bulk of 
 JSON tuples Chronix Spark Solr / SolrJ
  • 43. 43 private Stream<MetricTimeSeries> 
 streamWithCloudSolrStream(String zkHost, String collection, String shardUrl, SolrQuery query,
 TimeSeriesConverter<MetricTimeSeries> converter) throws IOException {
 Map params = new HashMap();
 params.put("q", query.getQuery());
 params.put("sort", "id asc");
 params.put("shards", extractShardIdFromShardUrl(shardUrl));
 params.put("fl",
 Schema.DATA + ", " + Schema.ID + ", " + Schema.START + ", " + Schema.END +
 ", metric, host, measurement, process, ag, group");
 params.put("qt", "/export");
 params.put("distrib", false);
 
 CloudSolrStream solrStream = new CloudSolrStream(zkHost, collection, params);
 solrStream.open();
 SolrTupleStreamingService tupStream = new SolrTupleStreamingService(solrStream, converter);
 return StreamSupport.stream( Spliterators.spliteratorUnknownSize(tupStream, Spliterator.SIZED), false);
 } Pin query to one shard Use export request handler Boilerplate code to stream response @adersberger
  • 44. Time Series Databases should be first-class citizens. Chronix leverages Solr and Spark to 
 be storage efficient and to allow interactive 
 queries for big time series data.
  • 45. THANK YOU! QUESTIONS? Mail: [email protected] Twitter: @adersberger TWITTER.COM/QAWARE - SLIDESHARE.NET/QAWARE
  • 49. PREMATURE OPTIMIZATION IS NOT EVIL IF YOU HANDLE BIG DATA Josef Adersberger
  • 50. PERFORMANCE USING A JAVA PROFILER WITH A LOCAL CLUSTER
  • 52. PERFORMANCE 830 MB -> 360 MB
 (- 57%) unveiled wrong Jackson 
 handling inside of SolrClient
  • 53. 53 01 THE SECRETS OF DISTRIBUTED PROCESSING PERFORMANCE 
 Rule 1: Be as close to the data as possible!
 (CPU cache > memory > local disk > network)
 Rule 2: Reduce data volume as early as possible! 
 (as long as you don’t sacrifice parallelization)
 Rule 3: Parallelize as much as possible! 
 (max = #cores * x)
  • 54. PERFORMANCE THE RULES APPLIED ‣ Rule 1: Be as close to the data as possible! 1. Solr caching 2. Spark in-memory processing with activated RDD compression 3. Binary protocol between Solr and Spark
 ‣ Rule 2: Reduce data volume as early as possible! ‣ Efficient storage format (Chronix Format) ‣ Predicate pushdown to Solr (query) ‣ Group-by & aggregation pushdown to Solr (faceting within a query)
 ‣ Rule 3: Parallelize as much as possible! ‣ Scale-out on data-level with SolrCloud ‣ Scale-out on processing-level with Spark
  • 57. APACHE SPARK SPARK TERMINOLOGY (1/2) ▸ RDD: Has transformations and actions. Hides data partitioning & distributed computation. References a set of partitions (“output partitions”) - materialized or not - and has dependencies to another RDD (“input partitions”). RDD operations are evaluated as late as possible (when an action is called). As long as not being the root RDD the partitions of an RDD are in memory but they can be persisted by request. ▸ Partitions: (Logical) chunks of data. Default unit and level of parallelism - inside of a partition everything is a sequential operation on records. Has to fit into memory. Can have different representations (in-memory, on disk, off heap, …)
  • 58. APACHE SPARK SPARK TERMINOLOGY (2/2) ▸ Job: A computation job which is launched when an action is called on a RDD. ▸ Task: The atomic unit of work (function). Bound to exactly one partition. ▸ Stage: Set of Task pipelines which can be executed in parallel on one executor. ▸ Shuffling: If partitions need to be transferred between executors. Shuffle write = outbound partition transfer. Shuffle read = inbound partition transfer. ▸ DAG Scheduler: Computes DAG of stages from RDD DAG. Determines the preferred location for each task.
  • 59. THE COMPETITORS / ALTERNATIVES CHRONIX RDD VS. SPARK-TS ▸ Spark-TS provides no specific time series storage it uses the Spark persistence mechanisms instead. This leads to a less efficient storage usage and less possibilities to perform performance optimizations via predicate pushdown. ▸ In contrast to Spark-TS Chronix does not align all time series values on one vector of timestamps. This leads to greater flexibility in time series aggregation ▸ Chronix provides multi-dimensional time series as this is very useful for data warehousing and APM. ▸ Chronix has support for Datasets as this will be an important Spark API in the near future. But Chronix currently doesn’t support an IndexedRowMatrix for SparkML. ▸ Chronix is purely written in Java. There is no explicit support for Python and Scala yet. ▸ Chronix doesn not support a ZonedTime as this makes it way more complicated.
  • 61. 61 01 CHRONIXRDD: GET THE CHUNKS FROM SOLR public ChronixRDD queryChronixChunks(
 final SolrQuery query,
 final String zkHost,
 final String collection,
 final ChronixSolrCloudStorage<MetricTimeSeries> chronixStorage) throws SolrServerException, IOException {
 
 // first get a list of replicas to query for this collection
 List<String> shards = chronixStorage.getShardList(zkHost, collection);
 
 // parallelize the requests to the shards
 JavaRDD<MetricTimeSeries> docs = jsc.parallelize(shards, shards.size()).flatMap(
 (FlatMapFunction<String, MetricTimeSeries>) shardUrl -> chronixStorage.streamFromSingleNode(
 new KassiopeiaSimpleConverter(), shardUrl, query)::iterator);
 return new ChronixRDD(docs);
 } Figure out all Solr shards (using CloudSolrClient in the background) Query each shard in parallel and convert SolrDocuments to MetricTimeSeries
  • 62. 62 01 BINARY PROTOCOL WITH STANDARD SOLR CLIENT private Stream<MetricTimeSeries> streamWithHttpSolrClient(String shardUrl,
 SolrQuery query,
 TimeSeriesConverter<MetricTimeSeries> converter) {
 HttpSolrClient solrClient = getSingleNodeSolrClient(shardUrl);
 solrClient.setRequestWriter(new BinaryRequestWriter());
 query.set("distrib", false);
 SolrStreamingService<MetricTimeSeries> solrStreamingService = 
 new SolrStreamingService<>(converter, query, solrClient, nrOfDocumentPerBatch);
 return StreamSupport.stream(
 Spliterators.spliteratorUnknownSize(solrStreamingService, Spliterator.SIZED), false);
 } Use HttpSolrClient pinned to one shard Use binary (request)
 protocol Boilerplate code to stream response
  • 63. 63 private Stream<MetricTimeSeries> 
 streamWithCloudSolrStream(String zkHost, String collection, String shardUrl, SolrQuery query,
 TimeSeriesConverter<MetricTimeSeries> converter) throws IOException {
 Map params = new HashMap();
 params.put("q", query.getQuery());
 params.put("sort", "id asc");
 params.put("shards", extractShardIdFromShardUrl(shardUrl));
 params.put("fl",
 Schema.DATA + ", " + Schema.ID + ", " + Schema.START + ", " + Schema.END +
 ", metric, host, measurement, process, ag, group");
 params.put("qt", "/export");
 params.put("distrib", false);
 
 CloudSolrStream solrStream = new CloudSolrStream(zkHost, collection, params);
 solrStream.open();
 SolrTupleStreamingService tupStream = new SolrTupleStreamingService(solrStream, converter);
 return StreamSupport.stream( Spliterators.spliteratorUnknownSize(tupStream, Spliterator.SIZED), false);
 } EXPORT HANDLER PROTOCOL Pin query to one shard Use export request handler Boilerplate code to stream response
  • 64. 64 01 CHRONIXRDD: FROM CHUNKS TO TIME SERIES public ChronixRDD joinChunks() {
 JavaPairRDD<MetricTimeSeriesKey, Iterable<MetricTimeSeries>> groupRdd
 = this.groupBy(MetricTimeSeriesKey::new);
 
 JavaPairRDD<MetricTimeSeriesKey, MetricTimeSeries> joinedRdd
 = groupRdd.mapValues((Function<Iterable<MetricTimeSeries>, MetricTimeSeries>) mtsIt -> {
 MetricTimeSeriesOrdering ordering = new MetricTimeSeriesOrdering();
 List<MetricTimeSeries> orderedChunks = ordering.immutableSortedCopy(mtsIt);
 MetricTimeSeries result = null;
 for (MetricTimeSeries mts : orderedChunks) {
 if (result == null) {
 result = new MetricTimeSeries
 .Builder(mts.getMetric())
 .attributes(mts.attributes()).build();
 }
 result.addAll(mts.getTimestampsAsArray(), mts.getValuesAsArray());
 }
 return result;
 });
 
 JavaRDD<MetricTimeSeries> resultJavaRdd =
 joinedRdd.map((Tuple2<MetricTimeSeriesKey, MetricTimeSeries> mtTuple) -> mtTuple._2);
 
 return new ChronixRDD(resultJavaRdd); } group chunks according identity join chunks to
 logical time 
 series