0% found this document useful (0 votes)
127 views32 pages

Databricks: Building and Operating A Big Data Service Based On Apache Spark

This document discusses Databricks, a company that provides an end-to-end solution for building and operating Apache Spark clusters in the cloud. It describes how Databricks leverages Spark to provide interactive querying, streaming, machine learning and SQL capabilities. It also discusses how Databricks addresses challenges like automatically managing cloud clusters, providing a unified interface across languages, and enabling Spark to access diverse data sources without requiring data movement.

Uploaded by

Saravanan1234567
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views32 pages

Databricks: Building and Operating A Big Data Service Based On Apache Spark

This document discusses Databricks, a company that provides an end-to-end solution for building and operating Apache Spark clusters in the cloud. It describes how Databricks leverages Spark to provide interactive querying, streaming, machine learning and SQL capabilities. It also discusses how Databricks addresses challenges like automatically managing cloud clusters, providing a unified interface across languages, and enabling Spark to access diverse data sources without requiring data movement.

Uploaded by

Saravanan1234567
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Databricks

Building and Operating a Big Data Service


Based on Apache Spark

Ali Ghodsi <[email protected]>


Cloud Computing and Big Data

• Three major trends


– Computers not getting any faster
– More people connected to the Internet
– More devices collecting data

• Computation moving to the cloud


The Dawn of Big Data

• Most companies collect lots of data


– Cheap storage (hardware, software)

• Everyone is hoping to extract insights


– Great examples (Netflix, Uber, Ebay)

• Big Data is Hard!


Big Data is Hard

• Compute the average of 1,000 integers

• Compute the average of 10 terabyte of integers


Goal: Make Big Data Simple
The Challenges of Data Science

Building a Build and


Import and explore data with different tools deploy data
cluster
applications

Data Advanced Production


Exploration Analytics Deployment

ETL
Data Dashboards
Warehousing & Reports

6
Databricks is an End-to-End Solution

Automatically Single tool for


Managed Ingest, Exploration, Advanced Analytics, Production, Visualization
Clusters
Data Advanced
Exploration Analytics
Notebooks & Built-in libraries
visualization
ETL Production
Deployment
Diverse data
Job scheduler
source connectors
Real-time Dashboards
query engine 3rd party apps

Data Dashboards
Warehousing & Reports

Short time to value


7
Databricks in a nutshell
Talk outline
• Apache Spark
– ETL, interactive queries, streaming, machine learning

• Cluster and Cloud Management


– Operating thousands of machines in the cloud

• Interactive Workspace
– Notebook environment, Collaboration, Visualization, Versioning, ACLs

• Lessons
– Lessons in building a large scale distributed system in the cloud
PART I:
Apache Spark
What we added to to Spark
Apache Spark

• Resilient Distributed Datasets (RDDs) as core abstraction


– Collection of objects
– Like a LinkedList <MyObjects>
1 2 3 4 5 6 7 8 9 10 11 12

• Spark RDDs are distributed


– RDD collections are partitioned
– RDD partitions can be cached
– RDD partitions can be recomputed

1 2 3 4 5 6 7 8 9 10 11 12
RDDs continued

• RDDs can be composed 2 4 6 8 10 12 14 16 18 20 22 24

– All RDDs initially derived from data source


– RDDs can be created from other RDDs 1 2 3 4 5 6 7 8 9 10 11 12

– Two basic operations: map & reduce


– Many other operators: join,filter,union etc

val text = sc.textFile(”s3://my-bucket/wikipedia")


val words = text.flatMap(line => line.split(" "))
val pairs = words.map(word => (word, 1))
val result = pairs.reduceByKey((a, b) => a + b)
Spark Libraries on top of RDDs
• SQL (Spark SQL)
– Full Hive SQL support with UDF, UDAFs, etc
– how: Internally keep RDDs of row objects (or RDD of column segments)

• Machine Learning (MLlib)


– Library of machine learning algorithms Spark
Spark SQL MLlib GraphX
– how: Cache an RDD, repeatedly iterate it Streaming

Spark Core
• Streaming (Spark Streaming)
– Streaming of real-time data
– how: Series of RDDs, each containing seconds of real-time data

• Graph Processing (GraphX)


– Iterative computation on graphs (e.g. social network)
– how: RDD of Tuple<Vertex, Edge, Vertex> and perform self joins
Unifying Libraries
• Early user feedback
– Different use cases for R, Python, Scala, Java, SQL
– How to intermix and go across these?

• Explosion of R Data Frames and Python Pandas


– DataFrame is a table
– Many procedural operations
– Ideal for dealing with semi-structured data

• Problem
– Not declarative, hard to optimize
– Eagerly executes command by command
– Language specific (R dataframes, Pandas)
Unifying Libraries
• Early user feedback
– Different use cases for R, Python, Scala, Java, SQL
– How to intermix and go across these?

• Explosion
Common of Rproblem
performance Data Frames
in Spark and Python Pandas
– DataFrame is a table
val pairs = words.map(word => (word, 1))
– Many procedural operations
val grouped = pairs.groupByKey()
– Ideal =forgrouped.map((key,
val counts dealing with semi-structured data=> (key, values.sum))
values)

• Problem
– Not declarative, hard to optimize
– Eagerly executes command by command
– Language specific (R dataframes, Pandas)
Spark Data Frames
• Procedural DataFrames vs declarative SQL
– Two different approaches

• Developed DataFrames for Spark


– DataFrames situated above the SQL optimizer
– DataFrame operations available in R, Python, Scala, Java
– SQL operations return DataFrames
users = context.sql(”select * from users”) # SQL
young = users.filter(users.age < 21) # Python
young.groupBy("gender").count()
tokenizer = Tokenizer(inputCol=”name", outputCol="words") # ML
hashingTF = HashingTF(inputCol="words", outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
model = pipeline.fit(young) # model
Proliferation of Data Solutions
• Customers already run a slew of data management systems
– MySQL category, Cassandra category, S3 category, HDFS category
– ETL all data over to Databricks?

• We added Spark Data Source API


– Open APIs for implementing your own data source
– Examples: CSV, JDBC, Parquet/Avro, ElasticSearch, RedShift, Cassandra

• Features
– Pushdown of predicates, aggregations, column pruning
– Locality information
– User Defined Types (UDTs), e.g. vectors
Proliferation of Data Solutions
• Customers already run a slew of data management systems
– MySQL category, Cassandra category, S3 category, HDFS category
ETL all data
– class over to Databricks?
PointUDT extends UserDefinedType[Point]
{
• We added Spark Data
def dataType Source API
= StructType(Seq(
StructField ("x", DoubleType),
– OpenStructField
APIs for implementing
("y", your own data source
DoubleType) ))
– Examples: CSV, JDBC, Parquet/Avro, ElasticSearch, RedShift, Cassandra
def serialize(p: Point) = Row(p.x, p.y)
• Features
def deserialize(r: Row) =
– Pushdown of predicates,
Point(r. getDoubleaggregations,
(0), r.column pruning(1))
getDouble
Locality information
– }
– User Defined Types (UDTs), e.g. vectors
Modern Spark Architecture

Spark
Spark SQL MLlib GraphX
Streaming

Spark Core
Modern Spark Architecture

DataFrames

Spark
Spark SQL MLlib GraphX
Streaming

Spark Core

Data Sources

{JSON}
Databricks as just-in-time Datawarehouse

• Traditional datawarehouse
– Every night ETL all relevant data to a warehouse
– Precompute cubes of fact tables
– Slow, costly, poor recency

• Spark JIT datawarehouse


– Switzerland of storage: NoSQL, SQL, cloud, …
– Storage remains at source of truth
– Spark used to directly read and cache date

DataFrames
Spark
Spark SQL MLlib GraphX
Streaming

Spark Core
Data Sources

{JSON}
PART II:
Cluster Management
Spark as a Service in the Cloud

• Experience with Mesos, YARN, …


– Use off-the-shelf cluster manager?

• Problems
– Existing cluster managers were not cloud-aware
Cloud-Aware Cluster Management
• Instance manager
– Responsible for acquiring machines from cloud provider

• Resource manager
– Schedule and configure isolated containers on machine instances

• Spark cluster manager


– Monitor and setup Spark clusters

Instance Resource Spark Cluster


Manager Manager Manager

Databricks Cluster Manager


Databricks Instance Manager
Instance manager’s job is to manage machine instances

• Pluggable cloud providers


– General interface that can be plugged in with AWS, …
– Availability management (AZ, 1h), configuration management (VPCs)

• Fault-handling
– Terminated or slow instances, spot price hikes
– Seamlessly replace machines

• Payment management
– Bid for spot instances, monitor their price Instance Resource
Spark
– Recording cluster usage for payment system Cluster
Manager Manager
Manager

Databricks Cluster Manager


Databricks Resource Manager

Resource manager’s job is to multiplex tenants on instances

• Isolates tenants using container technology


– Manages multiple versions of Spark
– Configures firewall rules, filters traffic

• Provides fast SSD/in-memory caching across containers


– ramdisk for a fast in-memory cache, mmap to access from Spark JVM
– Bind-mount into containers for shared in-memory cache

Spark
Instance Resource
Cluster
Manager Manager
Manager

Databricks Cluster Manager


Databricks Spark Cluster Manager

Spark CM’s job is to setup Spark clusters and multiplex REPLs

• Setting up Spark clusters


– Currently using Standalone mode Spark
– Dynamic resizing of clusters based on load (wip)

• Multiplexing of multiple REPLs


– Many interactive REPLs/notebooks on the same Spark cluster
– ClassLoader isolation and library management

Spark
Instance Resource
Cluster
Manager Manager
Manager

Databricks Cluster Manager


PART III:
Interactive Workspace
Collaborative Workspace

• Problem
– Real time collaboration on notebooks
– Version control of notebooks
– Access control on notebooks
Pub/sub-based TreeStore
• Web application server
– Stores an in-memory representation of Databricks workspace

• TreeStore is a directory service + a pub-sub service


– In-memory tree structure representing:
directories, notebooks, commands, results
– Browsers subscribe to subtrees and get notifications on updates
– Special handler sends delta-updates over web sockets

• Usage
– Subscribe to a notebook, see live edits of notebook
– Used to create a collaborative environment
PART IV:
Lessons
Lessons
• Loose coupling necessary but hard
– Narrow well-defined APIs, backwards compatibility, upgrades

• State management very hard at scale


– Legacy state: databases, configurations, machines, data formats…

• Cloud software development is superior


– Two week sprints, two week releases, SCRUM …

• Testing is key for evolution and scale


– Step-wise refinement for extension, testing pyramid 70/20/10

• Combine bottom-up with top-down approach


– Top-down for quick results, bottom-up for modularity/reuse
Thank you & Questions
Databricks is hiring, taking interns, …

E-mail <[email protected]>

You might also like