Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

18 likes•3,899 views

This document summarizes the growth and development of the Spark project. It notes that Spark has grown significantly over the past year in terms of contributors, companies involved, and lines of code. Spark is now one of the most active projects within the Apache Hadoop ecosystem. The document outlines major new additions to Spark including Spark SQL for structured data, MLlib for machine learning algorithms, and Java 8 APIs. It discusses the vision for Spark as a unified platform and standard library for big data applications.

Data & Analytics

Spark’s Role in the Big Data Ecosystem
Matei Zaharia

An Exciting Year for Spark
Very fast community growth
1.0 release in May
7+ distributors, 20+ apps

Project Activity
June 2013
June 2014
total
contributors
68
255
companies
contributing
17
50
total lines"
of code
63,000
175,000

Compared to Other Projects
MapReduce
YARN
HDFS
Storm
Spark
1400
1200
1000
800
600
400
200
0
MapReduce
YARN
HDFS
Storm
Spark
300000
250000
200000
150000
100000
50000
0
Commits
Lines of Code Changed
Activity in past 6 months

Compared to Other Projects
Spark is one of top 3 most active projects at Apache
More active than “general” data processing projects
like NumPy, matplotlib, SciKit-Learn

Continuing Growth
source: ohloh.net
Contributors per month to Spark

Last Summit
Last Summit we said we’d focus on two things:
• Standard libraries
• Enterprise features
New libraries: Spark SQL, MLlib (machine learning),
GraphX (graph processing)
Enterprise features: security, monitoring, HA

Spark SQL
Enables loading & querying structured data in Spark
From Hive:
c = HiveContext(sc)!
rows = c.sql(“select text, year from hivetable”)!
rows.filter(lambda r: r.year > 2013).collect()!
{“text”: “hi”,
“user”: {
“name”: “matei”,
“id”: 123
}}
From JSON:
c.jsonFile(“tweets.json”).registerAsTable(“tweets”)!
c.sql(“select text, user.name from tweets”)!
tweets.json

Spark SQL
Integrates closely with Spark’s language APIs
c.registerFunction(“hasSpark”, lambda text: “Spark” in text)!
c.sql(“select * from tweets where hasSpark(text)”)!
Uniform interface for data access
44 contributors in
past year
Hive
Parquet
JSON
Cassan-dra
…
SQL
Python
Scala
Java

Machine Learning Library (MLlib)
Standard library of machine learning algorithms
Now includes 15+ algorithms
• New in 1.0: decision trees, SVD, PCA, L-BFGS
• In development: non-negative matrix factorization, LDA,
Lanczos, multiclass trees, ADMM
points = context.sql(“select latitude, longitude from tweets”)!
model = KMeans.train(points, 10)!
!
40 contributors in
past year

Java 8 API
Enables concise programming in Java similar to
Scala and Python
JavaRDD<String> lines = sc.textFile("data.txt");!
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());!
int totalLength = lineLengths.reduce((a, b) -> a + b);!

1. Unified Platform for Big Data Apps
Batch
Interactive
Streaming
Hadoop
Cassandra
Mesos
…
Uniform API for diverse workloads over diverse
storage systems and runtimes
…
Cloud
Providers
…

Why a Platform Matters
Good for developers: one system to learn
Good for users: take apps anywhere
Good for distributors: more applications

2. Standard Library for Big Data
Big data apps lack libraries"
of common algorithms
Spark’s generality + support"
for multiple languages make it"
suitable to offer this
Python
Scala
Java
R
SQL
ML
graph
Core
…
Much of future activity will be in these libraries

Databricks & Spark
At Databricks, we are working to keep Spark 100%
open source and compatible across vendors
All our work on Spark is at Apache
Check out project-specific talks to see what’s next!

More Related Content

What's hot (20)

PPTX

[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1

PDF

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen

PDF

Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !

PDF

Introducing Databricks DeltaDatabricks

PPTX

Introduction to Apache SparkRahul Jain

PDF

Building End-to-End Delta Pipelines on GCPDatabricks

PPTX

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

PPTX

Free Training: How to Build a LakehouseDatabricks

PPTX

DW Migration Webinar-March 2022.pptxDatabricks

PPTX

SparkHeena Madan

PDF

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

PDF

Introduction SQL Analytics on Lakehouse ArchitectureDatabricks

PDF

Data Mesh for DinnerKent Graziano

PDF

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

PPTX

Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama

PDF

Deep Dive into the New Features of Apache Spark 3.0Databricks

PDF

Large Scale Lakehouse Implementation Using Structured StreamingDatabricks

PPTX

Introduction to Azure DatabricksJames Serra

PDF

Databricks Delta Lake and Its BenefitsDatabricks

PPTX

Azure DataBricks for Data Engineering by Eugene PolonichkoDimko Zhluktenko

[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen

Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !

Introducing Databricks DeltaDatabricks

Introduction to Apache SparkRahul Jain

Building End-to-End Delta Pipelines on GCPDatabricks

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

Free Training: How to Build a LakehouseDatabricks

DW Migration Webinar-March 2022.pptxDatabricks

SparkHeena Madan

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

Introduction SQL Analytics on Lakehouse ArchitectureDatabricks

Data Mesh for DinnerKent Graziano

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama

Deep Dive into the New Features of Apache Spark 3.0Databricks

Large Scale Lakehouse Implementation Using Structured StreamingDatabricks

Introduction to Azure DatabricksJames Serra

Databricks Delta Lake and Its BenefitsDatabricks

Azure DataBricks for Data Engineering by Eugene PolonichkoDimko Zhluktenko

Viewers also liked (20)

PDF

Temporal Databases: Data Modelstorp42

PDF

JupyterHub for Interactive Data Science CollaborationCarol Willing

PDF

Jupyter, A Platform for Data Science at ScaleMatthias Bussonnier

PPTX

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari

PDF

Big data ecosystemmagda3695

PPT

Temporalsunsie

PPTX

Bde euro proworkshopBigData_Europe

PDF

Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Denodo

PDF

Temporal databaseHussain Azmee

PPTX

The Big Data Ecosystem for Financial ServicesDataStax

PDF

Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta

PPTX

The Big Data Ecosystem at LinkedInOSCON Byrum

PDF

BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”BigData_Europe

PPTX

1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...Jürgen Ambrosi

PDF

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit

PDF

The Ecosystem is too damn big DataWorks Summit/Hadoop Summit

PDF

Overview - IBM Big Data PlatformVikas Manoria

PDF

Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!

PDF

Big data landscape v 3.0 - Matt Turck (FirstMark) Matt Turck

PDF

The Rise of the CDO in Today's EnterpriseCaserta

Temporal Databases: Data Modelstorp42

JupyterHub for Interactive Data Science CollaborationCarol Willing

Jupyter, A Platform for Data Science at ScaleMatthias Bussonnier

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari

Big data ecosystemmagda3695

Temporalsunsie

Bde euro proworkshopBigData_Europe

Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Denodo

Temporal databaseHussain Azmee

The Big Data Ecosystem for Financial ServicesDataStax

Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta

The Big Data Ecosystem at LinkedInOSCON Byrum

BDE-SC6 Hangout - “Insight into Virtual Currency Ecosystems”BigData_Europe

1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...Jürgen Ambrosi

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit

The Ecosystem is too damn big DataWorks Summit/Hadoop Summit

Overview - IBM Big Data PlatformVikas Manoria

Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!

Big data landscape v 3.0 - Matt Turck (FirstMark) Matt Turck

The Rise of the CDO in Today's EnterpriseCaserta

Similar to Spark's Role in the Big Data Ecosystem (Spark Summit 2014) (20)

PDF

Composable Parallel Processing in Apache Spark and WeldDatabricks

PDF

BDTC2015 databricks-辛湜-state of sparkJerry Wen

PPT

An Introduction to Apache spark with scalajohnn210

PDF

Spark streaming State of the Union - Strata San Jose 2015Databricks

PDF

Why spark by Stratio - v.1.0Stratio

PDF

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

PDF

The BDAS Open Source Communityjeykottalam

PDF

Dev Ops TrainingSpark Summit

PDF

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

PPTX

Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys

PDF

Present and future of unified, portable, and efficient data processing with A...DataWorks Summit

PDF

Big data apache spark + scalaJuantomás García Molina

PDF

Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Databricks

PDF

Started with-apache-sparkHappiest Minds Technologies

PDF

New directions for Apache Spark in 2015Databricks

PPTX

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn

PDF

Spark + AI Summit 2020 イベント概要Paulo Gutierrez

PPTX

Koalas: Unifying Spark and pandas APIsTakuya UESHIN

PDF

H2O World - H2O Rains with Databricks CloudSri Ambati

PDF

Spark Community Update - Spark Summit San Francisco 2015Databricks

Composable Parallel Processing in Apache Spark and WeldDatabricks

BDTC2015 databricks-辛湜-state of sparkJerry Wen

An Introduction to Apache spark with scalajohnn210

Spark streaming State of the Union - Strata San Jose 2015Databricks

Why spark by Stratio - v.1.0Stratio

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

The BDAS Open Source Communityjeykottalam

Dev Ops TrainingSpark Summit

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys

Present and future of unified, portable, and efficient data processing with A...DataWorks Summit

Big data apache spark + scalaJuantomás García Molina

Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Databricks

Started with-apache-sparkHappiest Minds Technologies

New directions for Apache Spark in 2015Databricks

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn

Spark + AI Summit 2020 イベント概要Paulo Gutierrez

Koalas: Unifying Spark and pandas APIsTakuya UESHIN

H2O World - H2O Rains with Databricks CloudSri Ambati

Spark Community Update - Spark Summit San Francisco 2015Databricks

More from Databricks (20)

PPTX

Data Lakehouse Symposium | Day 1 | Part 1Databricks

PPT

Data Lakehouse Symposium | Day 1 | Part 2Databricks

PPTX

Data Lakehouse Symposium | Day 2Databricks

PPTX

Data Lakehouse Symposium | Day 4Databricks

PDF

Democratizing Data Quality Through a Centralized PlatformDatabricks

PDF

Learn to Use Databricks for Data ScienceDatabricks

PDF

Why APM Is Not the Same As ML MonitoringDatabricks

PDF

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

PDF

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

PDF

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

PDF

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

PDF

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

PDF

Sawtooth Windows for Feature AggregationsDatabricks

PDF

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

PDF

Re-imagine Data Monitoring with whylogs and SparkDatabricks

PDF

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

PDF

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

PDF

Massive Data Processing in Adobe Using Delta LakeDatabricks

PDF

Machine Learning CI/CD for Email Attack DetectionDatabricks

PDF

Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Machine Learning CI/CD for Email Attack DetectionDatabricks

Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks

Recently uploaded (20)

PPTX

UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the topchinnusindhu1

PDF

apidays Munich 2025 - Automating Operations Without Reinventing the Wheel, Ma...apidays

PDF

Top Civil Engineer Canada Services111111nengineeringfirms

PPTX

Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...Sione Palu

PPTX

GLOBAL_Gender-module-5_committing-equity-responsive-budget.pptxrashmisahu90

PPTX

apidays Munich 2025 - GraphQL 101: I won't REST, until you GraphQL, Surbhi Si...apidays

PDF

apidays Munich 2025 - The life-changing magic of great API docs, Jens Fischer...apidays

PPTX

apidays Munich 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (Aavista Oy)apidays

DOCX

Online Delivery Restaurant idea and analyst the datasejalsengar2323

PDF

Blitz Campinas - Dia 24 de maio - Piettro.pdffabigreek

PDF

McKinsey - Global Energy Perspective 2023_11.pdfniyudha

PPTX

apidays Munich 2025 - Streamline & Secure LLM Traffic with APISIX AI Gateway ...apidays

PDF

apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...apidays

PPT

From Vision to Reality: The Digital India RevolutionHarsh Bharvadiya

PDF

Basotho Satisfaction with Electricity(Statspack)KatlehoMefane

PPTX

apidays Munich 2025 - Federated API Management and Governance, Vince Baker (D...apidays

PPTX

things that used in cleaning of the thingsdrkaran1421

PPTX

Fluvial_Civilizations_Presentation (1).pptxalisslovemendoza7

PPTX

MR and reffffffvvvvvvvfversal_083605.pptxmanjeshjain

PDF

apidays Munich 2025 - Let’s build, debug and test a magic MCP server in Postm...apidays

UVA-Ortho-PPT-Final-1.pptx Data analytics relevant to the topchinnusindhu1

apidays Munich 2025 - Automating Operations Without Reinventing the Wheel, Ma...apidays

Top Civil Engineer Canada Services111111nengineeringfirms

Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...Sione Palu

GLOBAL_Gender-module-5_committing-equity-responsive-budget.pptxrashmisahu90

apidays Munich 2025 - GraphQL 101: I won't REST, until you GraphQL, Surbhi Si...apidays

apidays Munich 2025 - The life-changing magic of great API docs, Jens Fischer...apidays

apidays Munich 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (Aavista Oy)apidays

Online Delivery Restaurant idea and analyst the datasejalsengar2323

Blitz Campinas - Dia 24 de maio - Piettro.pdffabigreek

McKinsey - Global Energy Perspective 2023_11.pdfniyudha

apidays Munich 2025 - Streamline & Secure LLM Traffic with APISIX AI Gateway ...apidays

apidays Munich 2025 - Developer Portals, API Catalogs, and Marketplaces, Miri...apidays

From Vision to Reality: The Digital India RevolutionHarsh Bharvadiya

Basotho Satisfaction with Electricity(Statspack)KatlehoMefane

apidays Munich 2025 - Federated API Management and Governance, Vince Baker (D...apidays

things that used in cleaning of the thingsdrkaran1421

Fluvial_Civilizations_Presentation (1).pptxalisslovemendoza7

MR and reffffffvvvvvvvfversal_083605.pptxmanjeshjain

apidays Munich 2025 - Let’s build, debug and test a magic MCP server in Postm...apidays

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

1. Spark’s Role in the Big Data Ecosystem Matei Zaharia

2. An Exciting Year for Spark Very fast community growth 1.0 release in May 7+ distributors, 20+ apps

3. Project Activity June 2013 June 2014 total contributors 68 255 companies contributing 17 50 total lines" of code 63,000 175,000

4. Project Activity June 2013 June 2014 total contributors 68 255 companies contributing 17 50 total lines" of code 63,000 175,000

5. Compared to Other Projects MapReduce YARN HDFS Storm Spark 1400 1200 1000 800 600 400 200 0 MapReduce YARN HDFS Storm Spark 300000 250000 200000 150000 100000 50000 0 Commits Lines of Code Changed Activity in past 6 months

6. Compared to Other Projects MapReduce YARN HDFS Storm Spark 1400 1200 1000 800 600 400 200 0 MapReduce YARN HDFS Storm Spark 300000 250000 200000 150000 100000 50000 0 Commits Lines of Code Changed Spark is now the most active project in the" Hadoop ecosystem Activity in past 6 months

7. Compared to Other Projects Spark is one of top 3 most active projects at Apache More active than “general” data processing projects like NumPy, matplotlib, SciKit-Learn

8. Continuing Growth source: ohloh.net Contributors per month to Spark

9. Major new additions

10. Last Summit Last Summit we said we’d focus on two things: • Standard libraries • Enterprise features New libraries: Spark SQL, MLlib (machine learning), GraphX (graph processing) Enterprise features: security, monitoring, HA

11. Spark SQL Enables loading & querying structured data in Spark From Hive: c = HiveContext(sc)! rows = c.sql(“select text, year from hivetable”)! rows.filter(lambda r: r.year > 2013).collect()! {“text”: “hi”, “user”: { “name”: “matei”, “id”: 123 }} From JSON: c.jsonFile(“tweets.json”).registerAsTable(“tweets”)! c.sql(“select text, user.name from tweets”)! tweets.json

12. Spark SQL Integrates closely with Spark’s language APIs c.registerFunction(“hasSpark”, lambda text: “Spark” in text)! c.sql(“select * from tweets where hasSpark(text)”)! Uniform interface for data access 44 contributors in past year Hive Parquet JSON Cassan-dra … SQL Python Scala Java

13. Machine Learning Library (MLlib) Standard library of machine learning algorithms Now includes 15+ algorithms • New in 1.0: decision trees, SVD, PCA, L-BFGS • In development: non-negative matrix factorization, LDA, Lanczos, multiclass trees, ADMM points = context.sql(“select latitude, longitude from tweets”)! model = KMeans.train(points, 10)! ! 40 contributors in past year

14. Java 8 API Enables concise programming in Java similar to Scala and Python JavaRDD<String> lines = sc.textFile("data.txt");! JavaRDD<Integer> lineLengths = lines.map(s -> s.length());! int totalLength = lineLengths.reduce((a, b) -> a + b);!

15. What is our vision for Spark?

16. 1. Unified Platform for Big Data Apps Batch Interactive Streaming Hadoop Cassandra Mesos … Uniform API for diverse workloads over diverse storage systems and runtimes … Cloud Providers …

17. Why a Platform Matters Good for developers: one system to learn Good for users: take apps anywhere Good for distributors: more applications

18. 2. Standard Library for Big Data Big data apps lack libraries" of common algorithms Spark’s generality + support" for multiple languages make it" suitable to offer this Python Scala Java R SQL ML graph Core … Much of future activity will be in these libraries

19. Databricks & Spark At Databricks, we are working to keep Spark 100% open source and compatible across vendors All our work on Spark is at Apache Check out project-specific talks to see what’s next!

20. Thank You and Enjoy Spark Summit!