SlideShare a Scribd company logo
Building a High-
Performance Database with
Scala, Akka, and Spark
Evan Chan
Who am I
User and contributor to Spark since 0.9,
Cassandra since 0.6
Created Spark Job Server and FiloDB
Talks at Spark Summit, Cassandra Summit,
Strata, Scala Days, etc.
http://velvia.github.io/
Streaming is now King
Message
Queue
Events
Stream
Processing
Layer
State /
Database
Happy
Users
Why are Updates Important?
Appends
Streaming workloads. Add new data continuously.
Real data is *always* changing. Queries on live real-time
data has business benefits.
Updates
Idempotency = really simple ingestion pipelines
Simpler streaming later
update late events (See Spark 2.0 Structured Streaming)
Introducing FiloDB
A distributed, versioned, columnar analytics
database. With updates. Built for streaming.
http://www.github.com/filodb/FiloDB
Fast Analytics Storage
• Scan speeds competitive with Apache Parquet
• In-memory version significantly faster
• Flexible filtering along two dimensions
• Much more efficient and flexible partition key filtering
• Efficient columnar storage using dictionary encoding and other
techniques
• Updatable
• Spark SQL for easy BI integration
Message
Queue
Events
Spark
Streaming
Short term
storage, K-V
Adhoc,
SQL, ML
Cassandra
FiloDB: Events,
ad-hoc, batch
Spark
Dashboa
rds,
maps
100% Reactive
• Scala
• Akka Cluster
• Spark
• Typesafe Config for all configuration
• Scodec, Ficus, Enumeratum, Scalactic, etc.
• Even most of the performance critical parts are
written in Scala :)
Scala, Akka, and Spark
• Akka - eliminate shared mutable state
• Remote and cluster makes building distributed
client-server architectures easy
• Backpressure, at-least-once is easy to build
• Failure handling and supervision are critical for
databases
• Spark for SQL, DataFrames, ML, interfacing
One FiloDB Node
NodeCoordinatorActor
(NCA)
DatasetCoordinatorActor
(DsCA)
DatasetCoordinatorActor
(DsCA)
Active MemTable
Flushing MemTable
Reprojector ColumnStore
Data, commands
Akka vs Futures
NodeCoordinatorActor
(NCA)
DatasetCoordinatorActor
(DsCA)
DatasetCoordinatorActor
(DsCA)
Active MemTable
Flushing MemTable
Reprojector ColumnStore
Data, commands
Akka - control
flow
Core I/O - Futures
Akka vs Futures
• Akka Actors:
• External FiloDB node API (remote + cluster)
• Async messaging with clients
• State management and scheduling (flushing)
• Futures:
• Core I/O
• Columnar data processing / ingestion
• Type-safe processing stages
Akka for Control Flow
Driver
Client
Executor
NCA
DsCA1 DsCA2
Executor
NCA
DsCA1 DsCA2
Flush()
NodeClusterActor
SingletonClusterProxy
Yes, Akka in Spark
• Columnar ingestion is stateful - need stickiness of state. This
is inherently difficult in Spark.
• Akka (cluster) gives us a separate, asynchronous control
channel to talk to FiloDB ingestors
• Spark only gives data flow primitives, not async messaging
• We need to route incoming records to the correct ingestion
node. Sorting data is inefficient and forces all nodes to wait
for sorting to be done.
• On failure, can control state recovery and moving state
Data Ingestion Setup
Executor
NCA
DsCA1 DsCA2
task0 task1
Row
Source
Actor
Row
Source
Actor
Executor
NCA
DsCA1 DsCA2
task0 task1
Row
Source
Actor
Row
Source
Actor
Node
Cluster
Actor
Partition Map
FiloDB NodeFiloDB Node
FiloDB separate nodes
Executor
NCA
DsCA1 DsCA2
task0 task1
Row
Source
Actor
Row
Source
Actor
Executor
NCA
DsCA1 DsCA2
task0 task1
Row
Source
Actor
Row
Source
Actor
Node
Cluster
Actor
Partition Map
Akka wire protocol
Backpressure
• Assumes receiver is OK, starts sending rows
• Allows configurable number of unacked
messages before stops sending
• Acking is receiver’s way of rate-limiting
• Automatic retries for at-least-once
• NACK for when receiver must stop (out of
memory or MemTable full)
Testing Akka Cluster
• MultiNodeSpec / sbt-multi-jvm
• AWESOME
• Test multi-node message routing
• Test cluster membership and subscription
• Inject network failures
Core: All Futures
/**
* Clears all data from the column store for that given projection, for all versions.
* More like a truncation, not a drop.
* NOTE: please make sure there are no reprojections or writes going on before calling this
*/
def clearProjectionData(projection: Projection): Future[Response]
/**
* Completely and permanently drops the dataset from the column store.
* @param dataset the DatasetRef for the dataset to drop.
*/
def dropDataset(dataset: DatasetRef): Future[Response]
/**
* Appends the ChunkSets and incremental indices in the segment to the column store.
* @param segment the ChunkSetSegment to write / merge to the columnar store
* @param version the version # to write the segment to
* @return Success. Future.failure(exception) otherwise.
*/
def appendSegment(projection: RichProjection,
segment: ChunkSetSegment,
version: Int): Future[Response]
Kamon Tracing
def appendSegment(projection: RichProjection,
segment: ChunkSetSegment,
version: Int): Future[Response] = Tracer.withNewContext("append-segment") {
val ctx = Tracer.currentContext
stats.segmentAppend()
if (segment.chunkSets.isEmpty) {
stats.segmentEmpty()
return(Future.successful(NotApplied))
}
for { writeChunksResp <- writeChunks(projection.datasetRef, version, segment, ctx)
writeIndexResp <- writeIndices(projection, version, segment, ctx)
if writeChunksResp == Success
} yield {
ctx.finish()
writeIndexResp
}
}
private def writeChunks(dataset: DatasetRef,
version: Int,
segment: ChunkSetSegment,
ctx: TraceContext): Future[Response] = {
asyncSubtrace(ctx, "write-chunks", "ingestion") {
val binPartition = segment.binaryPartition
val segmentId = segment.segmentId
val chunkTable = getOrCreateChunkTable(dataset)
Future.traverse(segment.chunkSets) { chunkSet =>
chunkTable.writeChunks(binPartition, version, segmentId, chunkSet.info.id, chunkSet.chunks, stats)
}.map { responses => responses.head }
}
}
Kamon Tracing
• http://kamon.io
• One trace can encapsulate multiple Future steps
all executing on different threads
• Tunable tracing levels
• Summary stats and histograms for segments
• Super useful for production debugging of reactive
stack
Kamon Metrics
• Uses HDRHistogram for much finer and more
accurate buckets
• Built-in metrics for Akka actors, Spray, Akka-
Http, Play, etc. etc.
KAMON trace name=append-segment n=2863 min=765952 p50=2113536 p90=3211264 p95=3981312 p99=9895936
p999=16121856 max=19529728
KAMON trace-segment name=write-chunks n=2864 min=436224 p50=1597440 p90=2637824 p95=3424256 p99=9109504
p999=15335424 max=18874368
KAMON trace-segment name=write-index n=2863 min=278528 p50=432128 p90=544768 p95=598016 p99=888832
p999=2260992 max=8355840
Validation: Scalactic
private def getColumnsFromNames(allColumns: Seq[Column],
columnNames: Seq[String]): Seq[Column] Or BadSchema = {
if (columnNames.isEmpty) {
Good(allColumns)
} else {
val columnMap = allColumns.map { c => c.name -> c }.toMap
val missing = columnNames.toSet -- columnMap.keySet
if (missing.nonEmpty) { Bad(MissingColumnNames(missing.toSeq, "projection")) }
else { Good(columnNames.map(columnMap)) }
}
}
for { computedColumns <- getComputedColumns(dataset.name, allColIds, columns)
dataColumns <- getColumnsFromNames(columns, normProjection.columns)
richColumns = dataColumns ++ computedColumns
// scalac has problems dealing with (a, b, c) <- getColIndicesAndType... apparently
segStuff <- getColIndicesAndType(richColumns, Seq(normProjection.segmentColId), "segment")
keyStuff <- getColIndicesAndType(richColumns, normProjection.keyColIds, "row")
partStuff <- getColIndicesAndType(richColumns, dataset.partitionColumns, "partition") }
yield {
• Notice how multiple validations compose!
Machine-Speed Scala
http://github.com/velvia/filo
https://github.com/filodb/FiloDB/blob/new-storage-format/core/
src/main/scala/filodb.core/binaryrecord/BinaryRecord.scala
Filo: High Performance
Binary Vectors
• Designed for NoSQL, not a file format
• random or linear access
• on or off heap
• missing value support
• Scala only, but cross-platform support possible
http://github.com/velvia/filo is a binary data vector
library designed for extreme read performance with
minimal deserialization costs.
Billions of Ops / Sec
• JMH benchmark: 0.5ns per FiloVector element access / add
• 2 Billion adds per second - single threaded
• Who said Scala cannot be fast?
• Spark API (row-based) limits performance significantly
val randomInts = (0 until numValues).map(i => util.Random.nextInt)
val randomIntsAray = randomInts.toArray
val filoBuffer = VectorBuilder(randomInts).toFiloBuffer
val sc = FiloVector[Int](filoBuffer)
@Benchmark
@BenchmarkMode(Array(Mode.AverageTime))
@OutputTimeUnit(TimeUnit.MICROSECONDS)
def sumAllIntsFiloApply(): Int = {
var total = 0
for { i <- 0 until numValues optimized } {
total += sc(i)
}
total
}
Thank you Scala OSS!
Ad

More Related Content

What's hot (19)

spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
Vritika Godara
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
Knoldus Inc.
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Legacy Typesafe (now Lightbend)
 
Optimizing Apache Spark UDFs
Optimizing Apache Spark UDFsOptimizing Apache Spark UDFs
Optimizing Apache Spark UDFs
Databricks
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
CloudxLab
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
Joey Echeverria
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Databricks
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
BalajiVaradarajan13
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Anand Narayanan
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
Knoldus Inc.
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Legacy Typesafe (now Lightbend)
 
Optimizing Apache Spark UDFs
Optimizing Apache Spark UDFsOptimizing Apache Spark UDFs
Optimizing Apache Spark UDFs
Databricks
 
Robust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache SparkRobust and Scalable ETL over Cloud Storage with Apache Spark
Robust and Scalable ETL over Cloud Storage with Apache Spark
Databricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
CloudxLab
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
Joey Echeverria
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Databricks
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Anand Narayanan
 

Similar to Building a High-Performance Database with Scala, Akka, and Spark (20)

Meetup spark structured streaming
Meetup spark structured streamingMeetup spark structured streaming
Meetup spark structured streaming
José Carlos García Serrano
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
Dori Waldman
 
How to Build an Apache Kafka® Connector
How to Build an Apache Kafka® ConnectorHow to Build an Apache Kafka® Connector
How to Build an Apache Kafka® Connector
confluent
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Spring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffSpring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard Wolff
JAX London
 
Scala and Spring
Scala and SpringScala and Spring
Scala and Spring
Eberhard Wolff
 
Streams Don't Fail Me Now - Robustness Features in Kafka Streams
Streams Don't Fail Me Now - Robustness Features in Kafka StreamsStreams Don't Fail Me Now - Robustness Features in Kafka Streams
Streams Don't Fail Me Now - Robustness Features in Kafka Streams
HostedbyConfluent
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
Toby Matejovsky
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lightbend
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
Ilya Bogunov
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
Murat Çakal
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Spark Summit
 
Manchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsManchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internals
Christopher Batey
 
Kafka meetup - kafka connect
Kafka meetup -  kafka connectKafka meetup -  kafka connect
Kafka meetup - kafka connect
Yi Zhang
 
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDeep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Databricks
 
Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)
Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)
Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)
Alexandre Dutra
 
Staying Ahead of the Curve with Spring and Cassandra 4
Staying Ahead of the Curve with Spring and Cassandra 4Staying Ahead of the Curve with Spring and Cassandra 4
Staying Ahead of the Curve with Spring and Cassandra 4
VMware Tanzu
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
Guido Schmutz
 
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
HostedbyConfluent
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
Dori Waldman
 
How to Build an Apache Kafka® Connector
How to Build an Apache Kafka® ConnectorHow to Build an Apache Kafka® Connector
How to Build an Apache Kafka® Connector
confluent
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
Databricks
 
Spring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffSpring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard Wolff
JAX London
 
Streams Don't Fail Me Now - Robustness Features in Kafka Streams
Streams Don't Fail Me Now - Robustness Features in Kafka StreamsStreams Don't Fail Me Now - Robustness Features in Kafka Streams
Streams Don't Fail Me Now - Robustness Features in Kafka Streams
HostedbyConfluent
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lightbend
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
Ilya Bogunov
 
Scaling web applications with cassandra presentation
Scaling web applications with cassandra presentationScaling web applications with cassandra presentation
Scaling web applications with cassandra presentation
Murat Çakal
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Spark Summit
 
Manchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internalsManchester Hadoop Meetup: Cassandra Spark internals
Manchester Hadoop Meetup: Cassandra Spark internals
Christopher Batey
 
Kafka meetup - kafka connect
Kafka meetup -  kafka connectKafka meetup -  kafka connect
Kafka meetup - kafka connect
Yi Zhang
 
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDeep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
Databricks
 
Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)
Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)
Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)
Alexandre Dutra
 
Staying Ahead of the Curve with Spring and Cassandra 4
Staying Ahead of the Curve with Spring and Cassandra 4Staying Ahead of the Curve with Spring and Cassandra 4
Staying Ahead of the Curve with Spring and Cassandra 4
VMware Tanzu
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
Guido Schmutz
 
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...
HostedbyConfluent
 
Ad

More from Evan Chan (12)

Time-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 TalkTime-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 Talk
Evan Chan
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
Evan Chan
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
Evan Chan
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
Evan Chan
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Evan Chan
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
Evan Chan
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
Evan Chan
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
Evan Chan
 
Time-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 TalkTime-State Analytics: MinneAnalytics 2024 Talk
Time-State Analytics: MinneAnalytics 2024 Talk
Evan Chan
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
Evan Chan
 
Designing Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and KubernetesDesigning Stateful Apps for Cloud and Kubernetes
Designing Stateful Apps for Cloud and Kubernetes
Evan Chan
 
Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019Histograms at scale - Monitorama 2019
Histograms at scale - Monitorama 2019
Evan Chan
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Evan Chan
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
Evan Chan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
Evan Chan
 
MIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data ArchitectureMIT lecture - Socrata Open Data Architecture
MIT lecture - Socrata Open Data Architecture
Evan Chan
 
OLAP with Cassandra and Spark
OLAP with Cassandra and SparkOLAP with Cassandra and Spark
OLAP with Cassandra and Spark
Evan Chan
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan
 
Real-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and SharkReal-time Analytics with Cassandra, Spark, and Shark
Real-time Analytics with Cassandra, Spark, and Shark
Evan Chan
 
Ad

Recently uploaded (20)

Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
Engineering Chemistry First Year Fullerenes
Engineering Chemistry First Year FullerenesEngineering Chemistry First Year Fullerenes
Engineering Chemistry First Year Fullerenes
5g2jpd9sp4
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
Gas Power Plant for Power Generation System
Gas Power Plant for Power Generation SystemGas Power Plant for Power Generation System
Gas Power Plant for Power Generation System
JourneyWithMe1
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
Basic Principles for Electronics Students
Basic Principles for Electronics StudentsBasic Principles for Electronics Students
Basic Principles for Electronics Students
cbdbizdev04
 
Crack the Domain with Event Storming By Vivek
Crack the Domain with Event Storming By VivekCrack the Domain with Event Storming By Vivek
Crack the Domain with Event Storming By Vivek
Vivek Srivastava
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
BTech_CSE_LPU_Presentation.pptx.........
BTech_CSE_LPU_Presentation.pptx.........BTech_CSE_LPU_Presentation.pptx.........
BTech_CSE_LPU_Presentation.pptx.........
jinny kaur
 
Elevate Your Workflow
Elevate Your WorkflowElevate Your Workflow
Elevate Your Workflow
NickHuld
 
Mirada a 12 proyectos desarrollados con BIM.pdf
Mirada a 12 proyectos desarrollados con BIM.pdfMirada a 12 proyectos desarrollados con BIM.pdf
Mirada a 12 proyectos desarrollados con BIM.pdf
topitodosmasdos
 
Unit III.pptx IT3401 web essentials presentatio
Unit III.pptx IT3401 web essentials presentatioUnit III.pptx IT3401 web essentials presentatio
Unit III.pptx IT3401 web essentials presentatio
lakshitakumar291
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
"Heaters in Power Plants: Types, Functions, and Performance Analysis"
"Heaters in Power Plants: Types, Functions, and Performance Analysis""Heaters in Power Plants: Types, Functions, and Performance Analysis"
"Heaters in Power Plants: Types, Functions, and Performance Analysis"
Infopitaara
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
aset and manufacturing optimization and connecting edge
aset and manufacturing optimization and connecting edgeaset and manufacturing optimization and connecting edge
aset and manufacturing optimization and connecting edge
alilamisse
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
Engineering Chemistry First Year Fullerenes
Engineering Chemistry First Year FullerenesEngineering Chemistry First Year Fullerenes
Engineering Chemistry First Year Fullerenes
5g2jpd9sp4
 
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
DATA-DRIVEN SHOULDER INVERSE KINEMATICS YoungBeom Kim1 , Byung-Ha Park1 , Kwa...
charlesdick1345
 
Gas Power Plant for Power Generation System
Gas Power Plant for Power Generation SystemGas Power Plant for Power Generation System
Gas Power Plant for Power Generation System
JourneyWithMe1
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
Basic Principles for Electronics Students
Basic Principles for Electronics StudentsBasic Principles for Electronics Students
Basic Principles for Electronics Students
cbdbizdev04
 
Crack the Domain with Event Storming By Vivek
Crack the Domain with Event Storming By VivekCrack the Domain with Event Storming By Vivek
Crack the Domain with Event Storming By Vivek
Vivek Srivastava
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
BTech_CSE_LPU_Presentation.pptx.........
BTech_CSE_LPU_Presentation.pptx.........BTech_CSE_LPU_Presentation.pptx.........
BTech_CSE_LPU_Presentation.pptx.........
jinny kaur
 
Elevate Your Workflow
Elevate Your WorkflowElevate Your Workflow
Elevate Your Workflow
NickHuld
 
Mirada a 12 proyectos desarrollados con BIM.pdf
Mirada a 12 proyectos desarrollados con BIM.pdfMirada a 12 proyectos desarrollados con BIM.pdf
Mirada a 12 proyectos desarrollados con BIM.pdf
topitodosmasdos
 
Unit III.pptx IT3401 web essentials presentatio
Unit III.pptx IT3401 web essentials presentatioUnit III.pptx IT3401 web essentials presentatio
Unit III.pptx IT3401 web essentials presentatio
lakshitakumar291
 
Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.Fort night presentation new0903 pdf.pdf.
Fort night presentation new0903 pdf.pdf.
anuragmk56
 
"Heaters in Power Plants: Types, Functions, and Performance Analysis"
"Heaters in Power Plants: Types, Functions, and Performance Analysis""Heaters in Power Plants: Types, Functions, and Performance Analysis"
"Heaters in Power Plants: Types, Functions, and Performance Analysis"
Infopitaara
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
aset and manufacturing optimization and connecting edge
aset and manufacturing optimization and connecting edgeaset and manufacturing optimization and connecting edge
aset and manufacturing optimization and connecting edge
alilamisse
 
IntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdfIntroSlides-April-BuildWithAI-VertexAI.pdf
IntroSlides-April-BuildWithAI-VertexAI.pdf
Luiz Carneiro
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 

Building a High-Performance Database with Scala, Akka, and Spark

  • 1. Building a High- Performance Database with Scala, Akka, and Spark Evan Chan
  • 2. Who am I User and contributor to Spark since 0.9, Cassandra since 0.6 Created Spark Job Server and FiloDB Talks at Spark Summit, Cassandra Summit, Strata, Scala Days, etc. http://velvia.github.io/
  • 5. Why are Updates Important? Appends Streaming workloads. Add new data continuously. Real data is *always* changing. Queries on live real-time data has business benefits. Updates Idempotency = really simple ingestion pipelines Simpler streaming later update late events (See Spark 2.0 Structured Streaming)
  • 6. Introducing FiloDB A distributed, versioned, columnar analytics database. With updates. Built for streaming. http://www.github.com/filodb/FiloDB
  • 7. Fast Analytics Storage • Scan speeds competitive with Apache Parquet • In-memory version significantly faster • Flexible filtering along two dimensions • Much more efficient and flexible partition key filtering • Efficient columnar storage using dictionary encoding and other techniques • Updatable • Spark SQL for easy BI integration
  • 8. Message Queue Events Spark Streaming Short term storage, K-V Adhoc, SQL, ML Cassandra FiloDB: Events, ad-hoc, batch Spark Dashboa rds, maps
  • 9. 100% Reactive • Scala • Akka Cluster • Spark • Typesafe Config for all configuration • Scodec, Ficus, Enumeratum, Scalactic, etc. • Even most of the performance critical parts are written in Scala :)
  • 10. Scala, Akka, and Spark • Akka - eliminate shared mutable state • Remote and cluster makes building distributed client-server architectures easy • Backpressure, at-least-once is easy to build • Failure handling and supervision are critical for databases • Spark for SQL, DataFrames, ML, interfacing
  • 12. Akka vs Futures NodeCoordinatorActor (NCA) DatasetCoordinatorActor (DsCA) DatasetCoordinatorActor (DsCA) Active MemTable Flushing MemTable Reprojector ColumnStore Data, commands Akka - control flow Core I/O - Futures
  • 13. Akka vs Futures • Akka Actors: • External FiloDB node API (remote + cluster) • Async messaging with clients • State management and scheduling (flushing) • Futures: • Core I/O • Columnar data processing / ingestion • Type-safe processing stages
  • 14. Akka for Control Flow Driver Client Executor NCA DsCA1 DsCA2 Executor NCA DsCA1 DsCA2 Flush() NodeClusterActor SingletonClusterProxy
  • 15. Yes, Akka in Spark • Columnar ingestion is stateful - need stickiness of state. This is inherently difficult in Spark. • Akka (cluster) gives us a separate, asynchronous control channel to talk to FiloDB ingestors • Spark only gives data flow primitives, not async messaging • We need to route incoming records to the correct ingestion node. Sorting data is inefficient and forces all nodes to wait for sorting to be done. • On failure, can control state recovery and moving state
  • 16. Data Ingestion Setup Executor NCA DsCA1 DsCA2 task0 task1 Row Source Actor Row Source Actor Executor NCA DsCA1 DsCA2 task0 task1 Row Source Actor Row Source Actor Node Cluster Actor Partition Map
  • 17. FiloDB NodeFiloDB Node FiloDB separate nodes Executor NCA DsCA1 DsCA2 task0 task1 Row Source Actor Row Source Actor Executor NCA DsCA1 DsCA2 task0 task1 Row Source Actor Row Source Actor Node Cluster Actor Partition Map
  • 19. Backpressure • Assumes receiver is OK, starts sending rows • Allows configurable number of unacked messages before stops sending • Acking is receiver’s way of rate-limiting • Automatic retries for at-least-once • NACK for when receiver must stop (out of memory or MemTable full)
  • 20. Testing Akka Cluster • MultiNodeSpec / sbt-multi-jvm • AWESOME • Test multi-node message routing • Test cluster membership and subscription • Inject network failures
  • 21. Core: All Futures /** * Clears all data from the column store for that given projection, for all versions. * More like a truncation, not a drop. * NOTE: please make sure there are no reprojections or writes going on before calling this */ def clearProjectionData(projection: Projection): Future[Response] /** * Completely and permanently drops the dataset from the column store. * @param dataset the DatasetRef for the dataset to drop. */ def dropDataset(dataset: DatasetRef): Future[Response] /** * Appends the ChunkSets and incremental indices in the segment to the column store. * @param segment the ChunkSetSegment to write / merge to the columnar store * @param version the version # to write the segment to * @return Success. Future.failure(exception) otherwise. */ def appendSegment(projection: RichProjection, segment: ChunkSetSegment, version: Int): Future[Response]
  • 22. Kamon Tracing def appendSegment(projection: RichProjection, segment: ChunkSetSegment, version: Int): Future[Response] = Tracer.withNewContext("append-segment") { val ctx = Tracer.currentContext stats.segmentAppend() if (segment.chunkSets.isEmpty) { stats.segmentEmpty() return(Future.successful(NotApplied)) } for { writeChunksResp <- writeChunks(projection.datasetRef, version, segment, ctx) writeIndexResp <- writeIndices(projection, version, segment, ctx) if writeChunksResp == Success } yield { ctx.finish() writeIndexResp } } private def writeChunks(dataset: DatasetRef, version: Int, segment: ChunkSetSegment, ctx: TraceContext): Future[Response] = { asyncSubtrace(ctx, "write-chunks", "ingestion") { val binPartition = segment.binaryPartition val segmentId = segment.segmentId val chunkTable = getOrCreateChunkTable(dataset) Future.traverse(segment.chunkSets) { chunkSet => chunkTable.writeChunks(binPartition, version, segmentId, chunkSet.info.id, chunkSet.chunks, stats) }.map { responses => responses.head } } }
  • 23. Kamon Tracing • http://kamon.io • One trace can encapsulate multiple Future steps all executing on different threads • Tunable tracing levels • Summary stats and histograms for segments • Super useful for production debugging of reactive stack
  • 24. Kamon Metrics • Uses HDRHistogram for much finer and more accurate buckets • Built-in metrics for Akka actors, Spray, Akka- Http, Play, etc. etc. KAMON trace name=append-segment n=2863 min=765952 p50=2113536 p90=3211264 p95=3981312 p99=9895936 p999=16121856 max=19529728 KAMON trace-segment name=write-chunks n=2864 min=436224 p50=1597440 p90=2637824 p95=3424256 p99=9109504 p999=15335424 max=18874368 KAMON trace-segment name=write-index n=2863 min=278528 p50=432128 p90=544768 p95=598016 p99=888832 p999=2260992 max=8355840
  • 25. Validation: Scalactic private def getColumnsFromNames(allColumns: Seq[Column], columnNames: Seq[String]): Seq[Column] Or BadSchema = { if (columnNames.isEmpty) { Good(allColumns) } else { val columnMap = allColumns.map { c => c.name -> c }.toMap val missing = columnNames.toSet -- columnMap.keySet if (missing.nonEmpty) { Bad(MissingColumnNames(missing.toSeq, "projection")) } else { Good(columnNames.map(columnMap)) } } } for { computedColumns <- getComputedColumns(dataset.name, allColIds, columns) dataColumns <- getColumnsFromNames(columns, normProjection.columns) richColumns = dataColumns ++ computedColumns // scalac has problems dealing with (a, b, c) <- getColIndicesAndType... apparently segStuff <- getColIndicesAndType(richColumns, Seq(normProjection.segmentColId), "segment") keyStuff <- getColIndicesAndType(richColumns, normProjection.keyColIds, "row") partStuff <- getColIndicesAndType(richColumns, dataset.partitionColumns, "partition") } yield { • Notice how multiple validations compose!
  • 27. Filo: High Performance Binary Vectors • Designed for NoSQL, not a file format • random or linear access • on or off heap • missing value support • Scala only, but cross-platform support possible http://github.com/velvia/filo is a binary data vector library designed for extreme read performance with minimal deserialization costs.
  • 28. Billions of Ops / Sec • JMH benchmark: 0.5ns per FiloVector element access / add • 2 Billion adds per second - single threaded • Who said Scala cannot be fast? • Spark API (row-based) limits performance significantly val randomInts = (0 until numValues).map(i => util.Random.nextInt) val randomIntsAray = randomInts.toArray val filoBuffer = VectorBuilder(randomInts).toFiloBuffer val sc = FiloVector[Int](filoBuffer) @Benchmark @BenchmarkMode(Array(Mode.AverageTime)) @OutputTimeUnit(TimeUnit.MICROSECONDS) def sumAllIntsFiloApply(): Int = { var total = 0 for { i <- 0 until numValues optimized } { total += sc(i) } total }