0% found this document useful (0 votes)

127 views32 pages

Databricks: Building and Operating A Big Data Service Based On Apache Spark

This document discusses Databricks, a company that provides an end-to-end solution for building and operating Apache Spark clusters in the cloud. It describes how Databricks leverages Spark to provide interactive querying, streaming, machine learning and SQL capabilities. It also discusses how Databricks addresses challenges like automatically managing cloud clusters, providing a unified interface across languages, and enabling Spark to access diverse data sources without requiring data movement.

Uploaded by

Saravanan1234567

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

127 views32 pages

Databricks: Building and Operating A Big Data Service Based On Apache Spark

Uploaded by

Saravanan1234567

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Databricks

Building and Operating a Big Data Service

Based on Apache Spark

Ali Ghodsi <[email protected]>

Cloud Computing and Big Data

• Three major trends

– Computers not getting any faster
– More people connected to the Internet
– More devices collecting data

• Computation moving to the cloud

The Dawn of Big Data

• Most companies collect lots of data

– Cheap storage (hardware, software)

• Everyone is hoping to extract insights

– Great examples (Netflix, Uber, Ebay)

• Big Data is Hard!

Big Data is Hard

• Compute the average of 1,000 integers

• Compute the average of 10 terabyte of integers

Goal: Make Big Data Simple
The Challenges of Data Science

Building a Build and

Import and explore data with different tools deploy data
cluster
applications

Data Advanced Production

Exploration Analytics Deployment

ETL
Data Dashboards
Warehousing & Reports

6
Databricks is an End-to-End Solution

Automatically Single tool for

Managed Ingest, Exploration, Advanced Analytics, Production, Visualization
Clusters
Data Advanced
Exploration Analytics
Notebooks & Built-in libraries
visualization
ETL Production
Deployment
Diverse data
Job scheduler
source connectors
Real-time Dashboards
query engine 3rd party apps

Data Dashboards
Warehousing & Reports

Short time to value

7
Databricks in a nutshell
Talk outline
• Apache Spark
– ETL, interactive queries, streaming, machine learning

• Cluster and Cloud Management

– Operating thousands of machines in the cloud

• Interactive Workspace
– Notebook environment, Collaboration, Visualization, Versioning, ACLs

• Lessons
– Lessons in building a large scale distributed system in the cloud
PART I:
Apache Spark
What we added to to Spark
Apache Spark

• Resilient Distributed Datasets (RDDs) as core abstraction

– Collection of objects
– Like a LinkedList <MyObjects>
1 2 3 4 5 6 7 8 9 10 11 12

• Spark RDDs are distributed

– RDD collections are partitioned
– RDD partitions can be cached
– RDD partitions can be recomputed

1 2 3 4 5 6 7 8 9 10 11 12
RDDs continued

• RDDs can be composed 2 4 6 8 10 12 14 16 18 20 22 24

– All RDDs initially derived from data source

– RDDs can be created from other RDDs 1 2 3 4 5 6 7 8 9 10 11 12

– Two basic operations: map & reduce

– Many other operators: join,filter,union etc

val text = sc.textFile(”s3://my-bucket/wikipedia")

val words = text.flatMap(line => line.split(" "))
val pairs = words.map(word => (word, 1))
val result = pairs.reduceByKey((a, b) => a + b)
Spark Libraries on top of RDDs
• SQL (Spark SQL)
– Full Hive SQL support with UDF, UDAFs, etc
– how: Internally keep RDDs of row objects (or RDD of column segments)

• Machine Learning (MLlib)

– Library of machine learning algorithms Spark
Spark SQL MLlib GraphX
– how: Cache an RDD, repeatedly iterate it Streaming

Spark Core
• Streaming (Spark Streaming)
– Streaming of real-time data
– how: Series of RDDs, each containing seconds of real-time data

• Graph Processing (GraphX)

– Iterative computation on graphs (e.g. social network)
– how: RDD of Tuple<Vertex, Edge, Vertex> and perform self joins
Unifying Libraries
• Early user feedback
– Different use cases for R, Python, Scala, Java, SQL
– How to intermix and go across these?

• Explosion of R Data Frames and Python Pandas

– DataFrame is a table
– Many procedural operations
– Ideal for dealing with semi-structured data

• Problem
– Not declarative, hard to optimize
– Eagerly executes command by command
– Language specific (R dataframes, Pandas)
Unifying Libraries
• Early user feedback
– Different use cases for R, Python, Scala, Java, SQL
– How to intermix and go across these?

• Explosion
Common of Rproblem
performance Data Frames
in Spark and Python Pandas
– DataFrame is a table
val pairs = words.map(word => (word, 1))
– Many procedural operations
val grouped = pairs.groupByKey()
– Ideal =forgrouped.map((key,
val counts dealing with semi-structured data=> (key, values.sum))
values)

• Problem
– Not declarative, hard to optimize
– Eagerly executes command by command
– Language specific (R dataframes, Pandas)
Spark Data Frames
• Procedural DataFrames vs declarative SQL
– Two different approaches

• Developed DataFrames for Spark

– DataFrames situated above the SQL optimizer
– DataFrame operations available in R, Python, Scala, Java
– SQL operations return DataFrames
users = context.sql(”select * from users”) # SQL
young = users.filter(users.age < 21) # Python
young.groupBy("gender").count()
tokenizer = Tokenizer(inputCol=”name", outputCol="words") # ML
hashingTF = HashingTF(inputCol="words", outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
model = pipeline.fit(young) # model
Proliferation of Data Solutions
• Customers already run a slew of data management systems
– MySQL category, Cassandra category, S3 category, HDFS category
– ETL all data over to Databricks?

• We added Spark Data Source API

– Open APIs for implementing your own data source
– Examples: CSV, JDBC, Parquet/Avro, ElasticSearch, RedShift, Cassandra

• Features
– Pushdown of predicates, aggregations, column pruning
– Locality information
– User Defined Types (UDTs), e.g. vectors
Proliferation of Data Solutions
• Customers already run a slew of data management systems
– MySQL category, Cassandra category, S3 category, HDFS category
ETL all data
– class over to Databricks?
PointUDT extends UserDefinedType[Point]
{
• We added Spark Data
def dataType Source API
= StructType(Seq(
StructField ("x", DoubleType),
– OpenStructField
APIs for implementing
("y", your own data source
DoubleType) ))
– Examples: CSV, JDBC, Parquet/Avro, ElasticSearch, RedShift, Cassandra
def serialize(p: Point) = Row(p.x, p.y)
• Features
def deserialize(r: Row) =
– Pushdown of predicates,
Point(r. getDoubleaggregations,
(0), r.column pruning(1))
getDouble
Locality information
– }
– User Defined Types (UDTs), e.g. vectors
Modern Spark Architecture

Spark
Spark SQL MLlib GraphX
Streaming

Spark Core
Modern Spark Architecture

DataFrames

Spark
Spark SQL MLlib GraphX
Streaming

Spark Core

Data Sources

{JSON}
Databricks as just-in-time Datawarehouse

• Traditional datawarehouse
– Every night ETL all relevant data to a warehouse
– Precompute cubes of fact tables
– Slow, costly, poor recency

• Spark JIT datawarehouse

– Switzerland of storage: NoSQL, SQL, cloud, …
– Storage remains at source of truth
– Spark used to directly read and cache date

DataFrames
Spark
Spark SQL MLlib GraphX
Streaming

Spark Core
Data Sources

{JSON}
PART II:
Cluster Management
Spark as a Service in the Cloud

• Experience with Mesos, YARN, …

– Use off-the-shelf cluster manager?

• Problems
– Existing cluster managers were not cloud-aware
Cloud-Aware Cluster Management
• Instance manager
– Responsible for acquiring machines from cloud provider

• Resource manager
– Schedule and configure isolated containers on machine instances

• Spark cluster manager

– Monitor and setup Spark clusters

Instance Resource Spark Cluster

Manager Manager Manager

Databricks Cluster Manager

Databricks Instance Manager
Instance manager’s job is to manage machine instances

• Pluggable cloud providers

– General interface that can be plugged in with AWS, …
– Availability management (AZ, 1h), configuration management (VPCs)

• Fault-handling
– Terminated or slow instances, spot price hikes
– Seamlessly replace machines

• Payment management
– Bid for spot instances, monitor their price Instance Resource
Spark
– Recording cluster usage for payment system Cluster
Manager Manager
Manager

Databricks Cluster Manager

Databricks Resource Manager

Resource manager’s job is to multiplex tenants on instances

• Isolates tenants using container technology

– Manages multiple versions of Spark
– Configures firewall rules, filters traffic

• Provides fast SSD/in-memory caching across containers

– ramdisk for a fast in-memory cache, mmap to access from Spark JVM
– Bind-mount into containers for shared in-memory cache

Spark
Instance Resource
Cluster
Manager Manager
Manager

Databricks Cluster Manager

Databricks Spark Cluster Manager

Spark CM’s job is to setup Spark clusters and multiplex REPLs

• Setting up Spark clusters

– Currently using Standalone mode Spark
– Dynamic resizing of clusters based on load (wip)

• Multiplexing of multiple REPLs

– Many interactive REPLs/notebooks on the same Spark cluster
– ClassLoader isolation and library management

Spark
Instance Resource
Cluster
Manager Manager
Manager

Databricks Cluster Manager

PART III:
Interactive Workspace
Collaborative Workspace

• Problem
– Real time collaboration on notebooks
– Version control of notebooks
– Access control on notebooks
Pub/sub-based TreeStore
• Web application server
– Stores an in-memory representation of Databricks workspace

• TreeStore is a directory service + a pub-sub service

– In-memory tree structure representing:
directories, notebooks, commands, results
– Browsers subscribe to subtrees and get notifications on updates
– Special handler sends delta-updates over web sockets

• Usage
– Subscribe to a notebook, see live edits of notebook
– Used to create a collaborative environment
PART IV:
Lessons
Lessons
• Loose coupling necessary but hard
– Narrow well-defined APIs, backwards compatibility, upgrades

• State management very hard at scale

– Legacy state: databases, configurations, machines, data formats…

• Cloud software development is superior

– Two week sprints, two week releases, SCRUM …

• Testing is key for evolution and scale

– Step-wise refinement for extension, testing pyramid 70/20/10

• Combine bottom-up with top-down approach

– Top-down for quick results, bottom-up for modularity/reuse
Thank you & Questions
Databricks is hiring, taking interns, …

E-mail <[email protected]>

Data Analysis With Databricks Version 2
No ratings yet
Data Analysis With Databricks Version 2
137 pages
Snowflake To Lakehouse Migration Assessment 5-23
100% (1)
Snowflake To Lakehouse Migration Assessment 5-23
22 pages
Denodo Data Virtualization Basics
100% (1)
Denodo Data Virtualization Basics
57 pages
Imperative Programming 1
No ratings yet
Imperative Programming 1
169 pages
Azure Databricks Best Practices 1664384402
No ratings yet
Azure Databricks Best Practices 1664384402
30 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
Data Migration Deloitte Solution-Siemens
No ratings yet
Data Migration Deloitte Solution-Siemens
2 pages
Pyspark
No ratings yet
Pyspark
31 pages
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
Connect Databricks Delta Tables With DBeaver
No ratings yet
Connect Databricks Delta Tables With DBeaver
10 pages
Designing Data Integration The ETL Pattern Approac
No ratings yet
Designing Data Integration The ETL Pattern Approac
9 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
(Guia Databrick Lakehouse)
No ratings yet
(Guia Databrick Lakehouse)
83 pages
11 Best Practices For Data Engineers
No ratings yet
11 Best Practices For Data Engineers
7 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
SCD Typ2 in Databricks Azure
0% (1)
SCD Typ2 in Databricks Azure
8 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Manage Data Access With Unity Catalog
No ratings yet
Manage Data Access With Unity Catalog
17 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Databricks Guide
No ratings yet
Databricks Guide
27 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
DP 3011 ENU PowerPoint - 01 Content
No ratings yet
DP 3011 ENU PowerPoint - 01 Content
42 pages
Mongodb Spark
No ratings yet
Mongodb Spark
13 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
The Medallion Architecture
100% (1)
The Medallion Architecture
2 pages
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
No ratings yet
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
38 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Snowflake To Oracle
No ratings yet
Snowflake To Oracle
16 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
No ratings yet
Loan Risk Analysis With Databricks and XGBoost - A Databricks Guide, Including Code Samples and Notebooks (2019)
11 pages
Data Engineering Roadmap 2023
No ratings yet
Data Engineering Roadmap 2023
1 page
Advanced Project For Data Engineering in Azure
100% (1)
Advanced Project For Data Engineering in Azure
5 pages
Databricks Dbutils
100% (1)
Databricks Dbutils
34 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
De Mod 3 Manage Data With Delta Lake
No ratings yet
De Mod 3 Manage Data With Delta Lake
16 pages
Data Prep Ebook Snowflake 1
No ratings yet
Data Prep Ebook Snowflake 1
8 pages
C2 Databricks - Sparks - EE
No ratings yet
C2 Databricks - Sparks - EE
9 pages
Data Engineer Interview Questions
No ratings yet
Data Engineer Interview Questions
6 pages
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
No ratings yet
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
14 pages
1 Introduction To Databricks Machine Learning
No ratings yet
1 Introduction To Databricks Machine Learning
9 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Lakehouse: A Unified Data Architecture
No ratings yet
Lakehouse: A Unified Data Architecture
9 pages
Details of Delta Lake Tutorial
67% (3)
Details of Delta Lake Tutorial
43 pages
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
100% (3)
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
55 pages
Designing A Modern Data Warehouse + Data Lake
100% (1)
Designing A Modern Data Warehouse + Data Lake
72 pages
Snowflake Architecture
No ratings yet
Snowflake Architecture
18 pages
100 Days of Data Engineering - Make A Copy and Use As You Need - Sheet1
No ratings yet
100 Days of Data Engineering - Make A Copy and Use As You Need - Sheet1
4 pages
Matillion Optimizing Snowflake
No ratings yet
Matillion Optimizing Snowflake
23 pages
06-Setting Up Unity Catalog
No ratings yet
06-Setting Up Unity Catalog
5 pages
Data Lakes For Maximum Flexibility
No ratings yet
Data Lakes For Maximum Flexibility
29 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Datamesh Ebook
No ratings yet
Datamesh Ebook
46 pages
Cert DEWD (Edits)
No ratings yet
Cert DEWD (Edits)
158 pages
DWH Fundamentals (Training Material)
No ratings yet
DWH Fundamentals (Training Material)
21 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Databricks
No ratings yet
Databricks
4 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
Snowflake and Its Benefits
No ratings yet
Snowflake and Its Benefits
93 pages
Data Modeling Interviews
No ratings yet
Data Modeling Interviews
16 pages
PROGRAM DEVELOPMENT LIFE CYCLE (2)
No ratings yet
PROGRAM DEVELOPMENT LIFE CYCLE (2)
54 pages
Nanosoft Training Programs On PHP, ORACLE And: Nano Information Technology
No ratings yet
Nanosoft Training Programs On PHP, ORACLE And: Nano Information Technology
6 pages
Project Budgets Export - 86
No ratings yet
Project Budgets Export - 86
3 pages
UI - Developer 262
No ratings yet
UI - Developer 262
1 page
CCNA Security Course Booklet-78-98
No ratings yet
CCNA Security Course Booklet-78-98
21 pages
XML Quick Guide
No ratings yet
XML Quick Guide
30 pages
Mobile Application Development: A New Trend in The Computing Patterns of Our Time
No ratings yet
Mobile Application Development: A New Trend in The Computing Patterns of Our Time
17 pages
Scepter of Goth History
100% (2)
Scepter of Goth History
52 pages
Learning Bootstrap 4 2nd Edition Matt Lambert download
100% (2)
Learning Bootstrap 4 2nd Edition Matt Lambert download
54 pages
State of Port 20160906
No ratings yet
State of Port 20160906
14 pages
Ark 2120 User Manual Ed2
No ratings yet
Ark 2120 User Manual Ed2
60 pages
zxr10 8900 Series PDF
No ratings yet
zxr10 8900 Series PDF
160 pages
SAP ASE Performance and Tuning Series Basics en
No ratings yet
SAP ASE Performance and Tuning Series Basics en
138 pages
3 G
No ratings yet
3 G
290 pages
Developer Cover Letter
100% (1)
Developer Cover Letter
5 pages
Five Great Inventions in The Philippines
No ratings yet
Five Great Inventions in The Philippines
2 pages
Online Bus Ticketing System
No ratings yet
Online Bus Ticketing System
3 pages
MOR-JRM-Healthcare IT BPO Call Center Non-Voice Agent Performance and Rewards Management System
100% (1)
MOR-JRM-Healthcare IT BPO Call Center Non-Voice Agent Performance and Rewards Management System
136 pages
PAN Truncation Best Practices
No ratings yet
PAN Truncation Best Practices
3 pages
Murad Hajiyev: Maven, Cent OS, Python Scripts, Load Balancing, Jira, Redis, WEB RTC, SIP - Js
No ratings yet
Murad Hajiyev: Maven, Cent OS, Python Scripts, Load Balancing, Jira, Redis, WEB RTC, SIP - Js
2 pages
Pasos y Teclas para Iniciar Recuperación Desde BIOS de Distintas Marcas
No ratings yet
Pasos y Teclas para Iniciar Recuperación Desde BIOS de Distintas Marcas
2 pages
Part 1 Transcript
No ratings yet
Part 1 Transcript
6 pages
7.repated SACCH, FACCH
No ratings yet
7.repated SACCH, FACCH
15 pages
IT P1 GR12 QP SEPT 2024_English_watermark
No ratings yet
IT P1 GR12 QP SEPT 2024_English_watermark
25 pages
eDRAM-OESP: A Novel Performance Efficient in-embedded-DRAM-compute Design For On-Edge Signal Processing Application
No ratings yet
eDRAM-OESP: A Novel Performance Efficient in-embedded-DRAM-compute Design For On-Edge Signal Processing Application
7 pages
Sleeping Barber Problem
No ratings yet
Sleeping Barber Problem
8 pages
Visual Foxpro Best Practices For The Next Ten Years: Conference Proceedings Outline
No ratings yet
Visual Foxpro Best Practices For The Next Ten Years: Conference Proceedings Outline
17 pages
Variable Length Subnet Masking (VLSM) : Samiul Haque Suman
No ratings yet
Variable Length Subnet Masking (VLSM) : Samiul Haque Suman
12 pages
E4 User Manual-V1.1
No ratings yet
E4 User Manual-V1.1
31 pages

Databricks: Building and Operating A Big Data Service Based On Apache Spark

Uploaded by

Databricks: Building and Operating A Big Data Service Based On Apache Spark

Uploaded by

Databricks

Building and Operating a Big Data Service

Ali Ghodsi <[email protected]>

• Three major trends

• Computation moving to the cloud

• Most companies collect lots of data

• Everyone is hoping to extract insights

• Big Data is Hard!

• Compute the average of 1,000 integers

• Compute the average of 10 terabyte of integers

Building a Build and

Data Advanced Production

Automatically Single tool for

Short time to value

• Cluster and Cloud Management

• Resilient Distributed Datasets (RDDs) as core abstraction

• Spark RDDs are distributed

• RDDs can be composed 2 4 6 8 10 12 14 16 18 20 22 24

– All RDDs initially derived from data source

– Two basic operations: map & reduce

val text = sc.textFile(”s3://my-bucket/wikipedia")

• Machine Learning (MLlib)

• Graph Processing (GraphX)

• Explosion of R Data Frames and Python Pandas

• Developed DataFrames for Spark

• We added Spark Data Source API

• Spark JIT datawarehouse

• Experience with Mesos, YARN, …

• Spark cluster manager

Instance Resource Spark Cluster

Databricks Cluster Manager

• Pluggable cloud providers

Databricks Cluster Manager

Resource manager’s job is to multiplex tenants on instances

• Isolates tenants using container technology

• Provides fast SSD/in-memory caching across containers

Databricks Cluster Manager

Spark CM’s job is to setup Spark clusters and multiplex REPLs

• Setting up Spark clusters

• Multiplexing of multiple REPLs

Databricks Cluster Manager

• TreeStore is a directory service + a pub-sub service

• State management very hard at scale

• Cloud software development is superior

• Testing is key for evolution and scale

• Combine bottom-up with top-down approach

You might also like