SlideShare a Scribd company logo
Disrupting Big Data with
Apache Spark in the Cloud
Ali Ghodsi
Co-Founder and CEO
The Dawn of Advanced Analytics
2
WatsonSIRI/assistantsSelf-driving cars
Not just sci-fi, important applications for businesses
Analytics Transforming Industries
3
Predictive analytics Anomaly Detection
Predict Product Revenue
Customer Assessment
Targeted Advertising
Fraud Detection
Risk Assessment
Equipment Failure
Data-Driven Real-time Analytics Applications
Today’s Data Reality
4
HADOOP
DATA LAKES
DATA HUBS
CLOUD
STORAGE
DATA
WAREHOUSES
Siloed, Fast-Growing Size, Cost
The Analytics Gap
5
IndustrialMediaPharma
HADOOP
DATA LAKES
DATA HUBS
CLOUD
STORAGE
DATA
WAREHOUSES
Siloed, Fast-Growing Size, Cost
Real-time Data-Driven Analytics Applications
Why is there a gap?
6
Real-time Data-Driven Analytics Applications
ManageData
infrastructure
• Create, tune, monitor compute clusters.
• Securely access silos of disparate data sources.
• Enforce proper data governance.
•1
Empower teams to be
productive
• Securely share big data clusters among analysts.
• Interactively explore data and prototypeideas.
• Debug, troubleshoot, version-control big data applications.•
•
•
2
Establish Production-
Ready Applications
• Setup robust data pipelines for ETL/ELT.
• Productionize real-time applications with HA,FT.
• Build, serve, maintain advanced machine learning models.
•
3
Siloed, Fast-Growing Size, Cost
Databricks Cloud-Hosted Platform
7
• Separate compute & storage
• Integrate existing data stores
• Efficient cache on first access
Just-in-Time Data
Platform
1
Agile
• Workflow scheduler for ML,
streaming, SQL, ETL
• Highavailability,fault-tolerant,
performance-optimized
Automated Apache
Spark Management
3
Production-Ready
• Interactive notebooks,
dashboards, reports
• Real-time exploration, machine
learning, graph use cases
Integrated
Workspace
2
Democratize Big Data
HADOOP /
DATA LAKES
DATA
WAREHOUSESYOUR STORAGE
CLOUD
STORAGE
8
Databricks Just-in-Time Data Platform
INTEGRATEDWORKSPACE
DASHBOARDS
Reports
NOTEBOOKS
github, viz,
collaboration
BI TOOLS
JUST-IN-TIME
PROCESSING
POWEREDBY
APACHE CLUSTERS: Auto-scaled, resilient, multi-tenant
DATA INTEGRATION: secure and fast data source integrations
INTERFACES: RESTAPIs & BI tools
DATABRICKSSERVICES
+
YOUR CUSTOM SPARK APPS
PRODUCTION JOBS
DATA LAKE
DATA HUB
The Challenge of Securing Analytics
9
End-to-end security a challenge for enterprises
Securing file
management
Secure table
management
Secure cluster
management
Secure job
workflows
Secure dashboards,
report, notebook
management
Today there are piecemeal solutions, but no comprehensive solution
Databricks Enterprise Security (DBES)
10
Holistic end-to-end security for Data Analytics
Tables Clusters Workflows Notebooks,
Dashboards,
Reports
Files
• Role-based access control
• Auditing and governance
• Integrated identity-management
• Encryption on-diskand on-the-wire
DBES provides
The First End-to-End Security Solution for Apache Spark
Enterprise use-cases
11
Preventing creditcard fraud
Predictenergy demand based on massiveweather data
Predictplayer churn, predicting network outages
Natural language processing to extract author graph
Generating tailored programs based on big data
Thank you.
Try Apache Spark with Databricks
13
http://databricks.com/try
Try latestversion of ApacheSpark and preview of Spark 2.0

More Related Content

PDF
Announcing Databricks Cloud (Spark Summit 2014)
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
Spark Summit EU talk by Christos Erotocritou
PDF
H2O World - H2O Rains with Databricks Cloud
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
PPTX
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Announcing Databricks Cloud (Spark Summit 2014)
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit EU talk by Christos Erotocritou
H2O World - H2O Rains with Databricks Cloud
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R

What's hot (20)

PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PDF
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
PDF
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
PDF
Spark Summit 2015 keynote: Making Big Data Simple with Spark
PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
PDF
Visualizing big data in the browser using spark
PPTX
Databricks @ Strata SJ
PDF
Building a Data Pipeline from Scratch - Joe Crobak
PDF
From R Script to Production Using rsparkling with Navdeep Gill
PDF
Spark Summit EU talk by Bas Geerdink
PDF
Lessons from Running Large Scale Spark Workloads
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PDF
Spark Summit EU 2015: Reynold Xin Keynote
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
PDF
New Developments in Spark
PDF
Spark streaming State of the Union - Strata San Jose 2015
PDF
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
PDF
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
PDF
Apache Spark Usage in the Open Source Ecosystem
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Spark Summit 2015 keynote: Making Big Data Simple with Spark
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Visualizing big data in the browser using spark
Databricks @ Strata SJ
Building a Data Pipeline from Scratch - Joe Crobak
From R Script to Production Using rsparkling with Navdeep Gill
Spark Summit EU talk by Bas Geerdink
Lessons from Running Large Scale Spark Workloads
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Spark Summit EU 2015: Reynold Xin Keynote
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
New Developments in Spark
Spark streaming State of the Union - Strata San Jose 2015
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Apache Spark Usage in the Open Source Ecosystem
Ad

Viewers also liked (20)

PPTX
Combining Machine Learning Frameworks with Apache Spark
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PPTX
Parallelizing Existing R Packages with SparkR
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
PDF
The Future of Real-Time in Spark
PDF
Spark Summit Europe 2016 Keynote - Databricks CEO
PPTX
Apache Spark Model Deployment
PPTX
Introduction to Apache Spark Developer Training
PPTX
TensorFrames: Google Tensorflow on Apache Spark
ODP
Python Gae django
ODP
Desarrollo con JSF
KEY
Hello World Python featuring GAE
PPTX
PuttingItAllTogether
PPTX
Platform as a Service with Kubernetes and Mesos
PPTX
Developing Distributed Web Applications, Where does REST fit in?
PDF
Foundations for Scaling ML in Apache Spark
PDF
New Directions for Spark in 2015 - Spark Summit East
PPT
Complex Event Processing (CEP) for Next-Generation Security Event Management,...
PDF
Distributed ML in Apache Spark
PPTX
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Combining Machine Learning Frameworks with Apache Spark
Apache Spark 2.0: Faster, Easier, and Smarter
Parallelizing Existing R Packages with SparkR
Apache Spark MLlib 2.0 Preview: Data Science and Production
The Future of Real-Time in Spark
Spark Summit Europe 2016 Keynote - Databricks CEO
Apache Spark Model Deployment
Introduction to Apache Spark Developer Training
TensorFrames: Google Tensorflow on Apache Spark
Python Gae django
Desarrollo con JSF
Hello World Python featuring GAE
PuttingItAllTogether
Platform as a Service with Kubernetes and Mesos
Developing Distributed Web Applications, Where does REST fit in?
Foundations for Scaling ML in Apache Spark
New Directions for Spark in 2015 - Spark Summit East
Complex Event Processing (CEP) for Next-Generation Security Event Management,...
Distributed ML in Apache Spark
Data Science in the Cloud with Spark, Zeppelin, and Cloudbreak
Ad

Similar to Spark Summit San Francisco 2016 - Ali Ghodsi Keynote (20)

PPTX
Disrupting Big Data with Apache Spark in the Cloud
PDF
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
PPTX
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
PPTX
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
PDF
Ανδρέας Τσαγκάρης, 5th Digital Banking Forum
PDF
Hitachi Data Systems Hadoop Solution
PPTX
Testing Big Data: Automated Testing of Hadoop with QuerySurge
PDF
(real-time)²: Real-time data for real-time analytics with Kafka and ClickHouse
PDF
2022 Trends in Enterprise Analytics
PPTX
Big Data Analytics with Hadoop
PPTX
Simplifying Real-Time Architectures for IoT with Apache Kudu
PPTX
NYC Data Amp - Microsoft Azure and Data Services Overview
PPTX
Analysis of Major Trends in Big Data Analytics
PPTX
Analysis of Major Trends in Big Data Analytics
PPTX
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
PPTX
Accelerating Data Warehouse Modernization
PDF
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
PPTX
Microsoft Fabric Introduction
PPTX
Accelerating Data Lakes and Streams with Real-time Analytics
PPTX
TechEvent Databricks on Azure
Disrupting Big Data with Apache Spark in the Cloud
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Ανδρέας Τσαγκάρης, 5th Digital Banking Forum
Hitachi Data Systems Hadoop Solution
Testing Big Data: Automated Testing of Hadoop with QuerySurge
(real-time)²: Real-time data for real-time analytics with Kafka and ClickHouse
2022 Trends in Enterprise Analytics
Big Data Analytics with Hadoop
Simplifying Real-Time Architectures for IoT with Apache Kudu
NYC Data Amp - Microsoft Azure and Data Services Overview
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Accelerating Data Warehouse Modernization
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
Microsoft Fabric Introduction
Accelerating Data Lakes and Streams with Real-time Analytics
TechEvent Databricks on Azure

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Become an Agentblazer Champion Challenge
PDF
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
PPTX
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes
PPTX
Mastering-Cybersecurity-The-Crucial-Role-of-Antivirus-Support-Services.pptx
PDF
Best Practices for Rolling Out Competency Management Software.pdf
PPTX
AIRLINE PRICE API | FLIGHT API COST |
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Jenkins: An open-source automation server powering CI/CD Automation
PPTX
Benefits of DCCM for Genesys Contact Center
PDF
Forouzan Book Information Security Chaper - 1
PPTX
Save Business Costs with CRM Software for Insurance Agents
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
CRUISE TICKETING SYSTEM | CRUISE RESERVATION SOFTWARE
PPTX
How a Careem Clone App Allows You to Compete with Large Mobility Brands
PDF
top salesforce developer skills in 2025.pdf
PPTX
Presentation of Computer CLASS 2 .pptx
PDF
How to Confidently Manage Project Budgets
PDF
Understanding NFT Marketplace Development_ Trends and Innovations.pdf
PDF
Perfecting Gamer’s Experiences with Performance Testing for Gaming Applicatio...
Become an Agentblazer Champion Challenge
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes
Mastering-Cybersecurity-The-Crucial-Role-of-Antivirus-Support-Services.pptx
Best Practices for Rolling Out Competency Management Software.pdf
AIRLINE PRICE API | FLIGHT API COST |
2025 Textile ERP Trends: SAP, Odoo & Oracle
Jenkins: An open-source automation server powering CI/CD Automation
Benefits of DCCM for Genesys Contact Center
Forouzan Book Information Security Chaper - 1
Save Business Costs with CRM Software for Insurance Agents
Upgrade and Innovation Strategies for SAP ERP Customers
PTS Company Brochure 2025 (1).pdf.......
CRUISE TICKETING SYSTEM | CRUISE RESERVATION SOFTWARE
How a Careem Clone App Allows You to Compete with Large Mobility Brands
top salesforce developer skills in 2025.pdf
Presentation of Computer CLASS 2 .pptx
How to Confidently Manage Project Budgets
Understanding NFT Marketplace Development_ Trends and Innovations.pdf
Perfecting Gamer’s Experiences with Performance Testing for Gaming Applicatio...

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

  • 1. Disrupting Big Data with Apache Spark in the Cloud Ali Ghodsi Co-Founder and CEO
  • 2. The Dawn of Advanced Analytics 2 WatsonSIRI/assistantsSelf-driving cars Not just sci-fi, important applications for businesses
  • 3. Analytics Transforming Industries 3 Predictive analytics Anomaly Detection Predict Product Revenue Customer Assessment Targeted Advertising Fraud Detection Risk Assessment Equipment Failure Data-Driven Real-time Analytics Applications
  • 4. Today’s Data Reality 4 HADOOP DATA LAKES DATA HUBS CLOUD STORAGE DATA WAREHOUSES Siloed, Fast-Growing Size, Cost
  • 5. The Analytics Gap 5 IndustrialMediaPharma HADOOP DATA LAKES DATA HUBS CLOUD STORAGE DATA WAREHOUSES Siloed, Fast-Growing Size, Cost Real-time Data-Driven Analytics Applications
  • 6. Why is there a gap? 6 Real-time Data-Driven Analytics Applications ManageData infrastructure • Create, tune, monitor compute clusters. • Securely access silos of disparate data sources. • Enforce proper data governance. •1 Empower teams to be productive • Securely share big data clusters among analysts. • Interactively explore data and prototypeideas. • Debug, troubleshoot, version-control big data applications.• • • 2 Establish Production- Ready Applications • Setup robust data pipelines for ETL/ELT. • Productionize real-time applications with HA,FT. • Build, serve, maintain advanced machine learning models. • 3 Siloed, Fast-Growing Size, Cost
  • 7. Databricks Cloud-Hosted Platform 7 • Separate compute & storage • Integrate existing data stores • Efficient cache on first access Just-in-Time Data Platform 1 Agile • Workflow scheduler for ML, streaming, SQL, ETL • Highavailability,fault-tolerant, performance-optimized Automated Apache Spark Management 3 Production-Ready • Interactive notebooks, dashboards, reports • Real-time exploration, machine learning, graph use cases Integrated Workspace 2 Democratize Big Data
  • 8. HADOOP / DATA LAKES DATA WAREHOUSESYOUR STORAGE CLOUD STORAGE 8 Databricks Just-in-Time Data Platform INTEGRATEDWORKSPACE DASHBOARDS Reports NOTEBOOKS github, viz, collaboration BI TOOLS JUST-IN-TIME PROCESSING POWEREDBY APACHE CLUSTERS: Auto-scaled, resilient, multi-tenant DATA INTEGRATION: secure and fast data source integrations INTERFACES: RESTAPIs & BI tools DATABRICKSSERVICES + YOUR CUSTOM SPARK APPS PRODUCTION JOBS DATA LAKE DATA HUB
  • 9. The Challenge of Securing Analytics 9 End-to-end security a challenge for enterprises Securing file management Secure table management Secure cluster management Secure job workflows Secure dashboards, report, notebook management Today there are piecemeal solutions, but no comprehensive solution
  • 10. Databricks Enterprise Security (DBES) 10 Holistic end-to-end security for Data Analytics Tables Clusters Workflows Notebooks, Dashboards, Reports Files • Role-based access control • Auditing and governance • Integrated identity-management • Encryption on-diskand on-the-wire DBES provides The First End-to-End Security Solution for Apache Spark
  • 11. Enterprise use-cases 11 Preventing creditcard fraud Predictenergy demand based on massiveweather data Predictplayer churn, predicting network outages Natural language processing to extract author graph Generating tailored programs based on big data
  • 13. Try Apache Spark with Databricks 13 http://databricks.com/try Try latestversion of ApacheSpark and preview of Spark 2.0