SlideShare a Scribd company logo
Scaling Big Data, a Spark perspective
Reynold Xin
@rxin
2016-07-09 Big Data LA
Scaling Big Data
Early adopters
Data Scientists
Statisticians
Physicists
R users
PyData
…
Citizen data scientists
Sophisticated
engineering
teams
Spark Philosophy
Unified engine
Support end-to-end applications
High-level APIs
Easy to use, rich optimizations
Integrate broadly
Storage systems, libraries, etc
SQLStreaming ML Graph
…
1
2
3
Apache Spark 2.0
Next major release,coming out in the next few weeks
• Unstable preview release at spark.apache.org
• 2.0.0-rc2 available on dev@sparkmailing list
Remains highly compatible with ApacheSpark 1.X
17k patches (2500 for 2.0) from 1200+ contributors
New in 2.0
Structured API improvements
(DataFrame, Dataset, SparkSession)
Structured Streaming
MLlib model export
R bindings
SQL 2003
Performance improvements
Deep learning libraries
(Baidu, Yahoo!, Berkeley, Databricks)
GraphFrames
PyData integration
Reactive streams
C# bindings:Mobius
JS bindings:EclairJS
Broader Community
Growing the Community
New initiatives from Databricks
The largest challenge in applying big data is
the skills gap.
StackOverflow Developer Survey 2016
Massive Open Online Courses
Free 5-course series on big
data with Apache Spark
dbricks.co/mooc16
Introduction
to Apache Spark
TM
Distributed
Machine Learning
with Apache Spark
TM
Big Data Analysis
with Apache Spark
TM
Advanced Apache Spark
for Data Science and
Data Engineering
TM
Advanced
Machine Learning
with Apache Spark
TM
Databricks Community Edition
Free version of Databricks with:
• Interactive tutorials
• Apache Spark and populardata
science libraries
• Visualization & debug tools
databricks.com/ce
Demo
Link to demo: http://tinyurl.com/big-data-la-2016-demo
2016 Apache Spark Survey
http://tinyurl.com/spark2016survey
Thank you.

More Related Content

PPTX
Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...
Ryft
 
PPTX
The R Ecosystem
Revolution Analytics
 
PPTX
R Then and Now
Revolution Analytics
 
PDF
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
PPTX
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
Spark Summit
 
PDF
Detecting Mobile Malware with Apache Spark with David Pryce
Databricks
 
PDF
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
PPTX
Predicting Loan Delinquency at One Million Transactions per Second
Revolution Analytics
 
Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...
Ryft
 
The R Ecosystem
Revolution Analytics
 
R Then and Now
Revolution Analytics
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
How Apache Spark Is Helping Tame the Wild West of Wi-Fi
Spark Summit
 
Detecting Mobile Malware with Apache Spark with David Pryce
Databricks
 
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Predicting Loan Delinquency at One Million Transactions per Second
Revolution Analytics
 

What's hot (19)

PPTX
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
PPTX
Reproducible Data Science with R
Revolution Analytics
 
PDF
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Databricks
 
PPTX
Databricks @ Strata SJ
Databricks
 
PPT
InkSpot Science presentation at Open Science Meeting
David Leahy
 
PDF
Databricks with R: Deep Dive
Databricks
 
PPTX
Data Science at Scale by Sarah Guido
Spark Summit
 
PPTX
R at Microsoft
Revolution Analytics
 
PDF
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
Spark Summit
 
PDF
Genomics on aws-webinar-april2018
Brendan Bouffler
 
PPTX
BDM26: Spark Summit 2014 Debriefing
David Lauzon
 
PPTX
Processing genetic data at scale
Mark Schroering
 
PDF
Rdf saturator
INRIA-OAK
 
PPTX
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Databricks
 
PDF
Introduction to Microsoft R Services
Gregg Barrett
 
PPTX
Next Generation Big Data Platform at Netflix 2014
Eva Tse
 
PPTX
Revolution R Enterprise - Portland R User Group, November 2013
Revolution Analytics
 
PDF
Building Open Data Lakes on AWS with Debezium and Apache Hudi
Gary Stafford
 
PPTX
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
Reproducible Data Science with R
Revolution Analytics
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Databricks
 
Databricks @ Strata SJ
Databricks
 
InkSpot Science presentation at Open Science Meeting
David Leahy
 
Databricks with R: Deep Dive
Databricks
 
Data Science at Scale by Sarah Guido
Spark Summit
 
R at Microsoft
Revolution Analytics
 
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
Spark Summit
 
Genomics on aws-webinar-april2018
Brendan Bouffler
 
BDM26: Spark Summit 2014 Debriefing
David Lauzon
 
Processing genetic data at scale
Mark Schroering
 
Rdf saturator
INRIA-OAK
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Databricks
 
Introduction to Microsoft R Services
Gregg Barrett
 
Next Generation Big Data Platform at Netflix 2014
Eva Tse
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution Analytics
 
Building Open Data Lakes on AWS with Debezium and Apache Hudi
Gary Stafford
 
How Spark Enables the Internet of Things- Paula Ta-Shma
Spark Summit
 
Ad

Viewers also liked (20)

PPTX
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Data Con LA
 
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Data Con LA
 
PPTX
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Data Con LA
 
PDF
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Data Con LA
 
PDF
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Data Con LA
 
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
PDF
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
 
PPTX
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
PDF
Impala presentation ahad rana
Data Con LA
 
PPTX
Hybrid architecture integrateduserviewdata-peyman_mohajerian
Data Con LA
 
PDF
VoltDB Big Data Camp LA 2014 - Scott Jar
Data Con LA
 
PDF
Aziksa hadoop architecture santosh jha
Data Con LA
 
PPTX
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Data Con LA
 
PPTX
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
 
PDF
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 
PDF
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
René Pfitzner
 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PPTX
Apache Spark Components
Girish Khanzode
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Data Con LA
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Panel - Interactive Applic...
Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Data Con LA
 
Big Data Day LA 2016/ Data Science Track - The Right Tool for the Job: Guidel...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
 
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
Impala presentation ahad rana
Data Con LA
 
Hybrid architecture integrateduserviewdata-peyman_mohajerian
Data Con LA
 
VoltDB Big Data Camp LA 2014 - Scott Jar
Data Con LA
 
Aziksa hadoop architecture santosh jha
Data Con LA
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Real Time Analytics with Druid - Guill...
Data Con LA
 
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 
"Spark Summit 2016: Trends & Insights" -- Zurich Spark Meetup, July 2016
René Pfitzner
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Apache Spark Components
Girish Khanzode
 
Ad

Similar to Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks (20)

PDF
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
PPTX
Databricks vs Apache Spark: What’s the Difference?
Accentfuture
 
PPTX
Databricks vs Apache Spark: What’s the Difference?
Accentfuture
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
PDF
Started with-apache-spark
Happiest Minds Technologies
 
PPTX
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
PDF
Data processing with spark in r & python
Maloy Manna, PMP®
 
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
PDF
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PPTX
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
PPTX
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Taras Matyashovsky
 
PPTX
Glint with Apache Spark
Venkata Naga Ravi
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
PDF
Bds session 13 14
Infinity Tech Solutions
 
PDF
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
PPTX
Getting Started with Apache Spark (Scala)
Knoldus Inc.
 
PDF
Spark Hsinchu meetup
Yung-An He
 
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Databricks vs Apache Spark: What’s the Difference?
Accentfuture
 
Databricks vs Apache Spark: What’s the Difference?
Accentfuture
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Started with-apache-spark
Happiest Minds Technologies
 
Large-Scale Data Science in Apache Spark 2.0
Databricks
 
Data processing with spark in r & python
Maloy Manna, PMP®
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
Apache Spark and Python: unified Big Data analytics
Julien Anguenot
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
Taras Matyashovsky
 
Glint with Apache Spark
Venkata Naga Ravi
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
Bds session 13 14
Infinity Tech Solutions
 
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
Getting Started with Apache Spark (Scala)
Knoldus Inc.
 
Spark Hsinchu meetup
Yung-An He
 

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
Data Con LA
 
PPTX
Data Con LA 2022 Keynotes
Data Con LA
 
PDF
Data Con LA 2022 Keynote
Data Con LA
 
PPTX
Data Con LA 2022 - Startup Showcase
Data Con LA
 
PPTX
Data Con LA 2022 Keynote
Data Con LA
 
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
PPTX
Data Con LA 2022 - AI Ethics
Data Con LA
 
PDF
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
PDF
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
PDF
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 

Recently uploaded (20)

PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Software Development Methodologies in 2025
KodekX
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 

Big Data Day LA 2016 Keynote - Reynold Xin/ Databricks

  • 1. Scaling Big Data, a Spark perspective Reynold Xin @rxin 2016-07-09 Big Data LA
  • 2. Scaling Big Data Early adopters Data Scientists Statisticians Physicists R users PyData … Citizen data scientists Sophisticated engineering teams
  • 3. Spark Philosophy Unified engine Support end-to-end applications High-level APIs Easy to use, rich optimizations Integrate broadly Storage systems, libraries, etc SQLStreaming ML Graph … 1 2 3
  • 4. Apache Spark 2.0 Next major release,coming out in the next few weeks • Unstable preview release at spark.apache.org • 2.0.0-rc2 available on dev@sparkmailing list Remains highly compatible with ApacheSpark 1.X 17k patches (2500 for 2.0) from 1200+ contributors
  • 5. New in 2.0 Structured API improvements (DataFrame, Dataset, SparkSession) Structured Streaming MLlib model export R bindings SQL 2003 Performance improvements Deep learning libraries (Baidu, Yahoo!, Berkeley, Databricks) GraphFrames PyData integration Reactive streams C# bindings:Mobius JS bindings:EclairJS Broader Community
  • 6. Growing the Community New initiatives from Databricks
  • 7. The largest challenge in applying big data is the skills gap. StackOverflow Developer Survey 2016
  • 8. Massive Open Online Courses Free 5-course series on big data with Apache Spark dbricks.co/mooc16 Introduction to Apache Spark TM Distributed Machine Learning with Apache Spark TM Big Data Analysis with Apache Spark TM Advanced Apache Spark for Data Science and Data Engineering TM Advanced Machine Learning with Apache Spark TM
  • 9. Databricks Community Edition Free version of Databricks with: • Interactive tutorials • Apache Spark and populardata science libraries • Visualization & debug tools databricks.com/ce
  • 10. Demo Link to demo: http://tinyurl.com/big-data-la-2016-demo
  • 11. 2016 Apache Spark Survey http://tinyurl.com/spark2016survey