Databricks: Building and Operating A Big Data Service Based On Apache Spark
Databricks: Building and Operating A Big Data Service Based On Apache Spark
ETL
Data Dashboards
Warehousing & Reports
6
Databricks is an End-to-End Solution
Data Dashboards
Warehousing & Reports
• Interactive Workspace
– Notebook environment, Collaboration, Visualization, Versioning, ACLs
• Lessons
– Lessons in building a large scale distributed system in the cloud
PART I:
Apache Spark
What we added to to Spark
Apache Spark
1 2 3 4 5 6 7 8 9 10 11 12
RDDs continued
Spark Core
• Streaming (Spark Streaming)
– Streaming of real-time data
– how: Series of RDDs, each containing seconds of real-time data
• Problem
– Not declarative, hard to optimize
– Eagerly executes command by command
– Language specific (R dataframes, Pandas)
Unifying Libraries
• Early user feedback
– Different use cases for R, Python, Scala, Java, SQL
– How to intermix and go across these?
• Explosion
Common of Rproblem
performance Data Frames
in Spark and Python Pandas
– DataFrame is a table
val pairs = words.map(word => (word, 1))
– Many procedural operations
val grouped = pairs.groupByKey()
– Ideal =forgrouped.map((key,
val counts dealing with semi-structured data=> (key, values.sum))
values)
• Problem
– Not declarative, hard to optimize
– Eagerly executes command by command
– Language specific (R dataframes, Pandas)
Spark Data Frames
• Procedural DataFrames vs declarative SQL
– Two different approaches
• Features
– Pushdown of predicates, aggregations, column pruning
– Locality information
– User Defined Types (UDTs), e.g. vectors
Proliferation of Data Solutions
• Customers already run a slew of data management systems
– MySQL category, Cassandra category, S3 category, HDFS category
ETL all data
– class over to Databricks?
PointUDT extends UserDefinedType[Point]
{
• We added Spark Data
def dataType Source API
= StructType(Seq(
StructField ("x", DoubleType),
– OpenStructField
APIs for implementing
("y", your own data source
DoubleType) ))
– Examples: CSV, JDBC, Parquet/Avro, ElasticSearch, RedShift, Cassandra
def serialize(p: Point) = Row(p.x, p.y)
• Features
def deserialize(r: Row) =
– Pushdown of predicates,
Point(r. getDoubleaggregations,
(0), r.column pruning(1))
getDouble
Locality information
– }
– User Defined Types (UDTs), e.g. vectors
Modern Spark Architecture
Spark
Spark SQL MLlib GraphX
Streaming
Spark Core
Modern Spark Architecture
DataFrames
Spark
Spark SQL MLlib GraphX
Streaming
Spark Core
Data Sources
{JSON}
Databricks as just-in-time Datawarehouse
• Traditional datawarehouse
– Every night ETL all relevant data to a warehouse
– Precompute cubes of fact tables
– Slow, costly, poor recency
DataFrames
Spark
Spark SQL MLlib GraphX
Streaming
Spark Core
Data Sources
{JSON}
PART II:
Cluster Management
Spark as a Service in the Cloud
• Problems
– Existing cluster managers were not cloud-aware
Cloud-Aware Cluster Management
• Instance manager
– Responsible for acquiring machines from cloud provider
• Resource manager
– Schedule and configure isolated containers on machine instances
• Fault-handling
– Terminated or slow instances, spot price hikes
– Seamlessly replace machines
• Payment management
– Bid for spot instances, monitor their price Instance Resource
Spark
– Recording cluster usage for payment system Cluster
Manager Manager
Manager
Spark
Instance Resource
Cluster
Manager Manager
Manager
Spark
Instance Resource
Cluster
Manager Manager
Manager
• Problem
– Real time collaboration on notebooks
– Version control of notebooks
– Access control on notebooks
Pub/sub-based TreeStore
• Web application server
– Stores an in-memory representation of Databricks workspace
• Usage
– Subscribe to a notebook, see live edits of notebook
– Used to create a collaborative environment
PART IV:
Lessons
Lessons
• Loose coupling necessary but hard
– Narrow well-defined APIs, backwards compatibility, upgrades
E-mail <[email protected]>