0% found this document useful (0 votes)
23 views

Azure Databricks - An Introduction

Azure Databricks is a collaborative analytics platform powered by Apache Spark, designed for big data processing and integrated with Azure services. It offers features such as secure collaboration, fine-grained access control, and a unified workspace for data engineers, scientists, and analysts. The platform enhances productivity with one-click setup, serverless infrastructure, and optimized performance for large-scale data processing.

Uploaded by

Nguyễn Sơn
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Azure Databricks - An Introduction

Azure Databricks is a collaborative analytics platform powered by Apache Spark, designed for big data processing and integrated with Azure services. It offers features such as secure collaboration, fine-grained access control, and a unified workspace for data engineers, scientists, and analysts. The platform enhances productivity with one-click setup, serverless infrastructure, and optimized performance for large-scale data processing.

Uploaded by

Nguyễn Sơn
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Azure Databricks

An Introduction

Bryan Cafferky
Technical Solutions Professional
B I G D ATA & A D V A N C E D A N A LY T I C S AT A G L A N C E

Ingest Store Prep & Train Model & Intelligence


Business Serve
apps Data Factory
(Data movement, pipelines & orchestration)

Collaboration
Cosmos
Portal DB
Predictive apps
Databricks
Kafka Blobs
10
HDInsight SQL
SQL Database
Custom
01
Data Lake
Data Lake
apps
Analytics

SQL Data
Operational reports
Warehouse
Event Hub Machine
IoT Hub Learning
Sensors ML Workbench Analysis
and Services
devices Analytical
dashboards
Azure Databricks
Powered by Apache Spark
APACHE SPARK
An unified, open source, parallel, data processing framework for Big Data Analytics

Spark SQL Spark MLlib Spark GraphX


Interactive Machine Streaming Graph
Spark Unifies: Computation
Queries Learning Stream processing
 Batch Processing
 Interactive SQL
 Real-time processing
 Machine Learning
Spark Core Engine
 Deep Learning
 Graph Processing Standalone
Yarn Mesos
Scheduler
Spark MLlib
Spark Structured Machine
Streaming Learning
Stream processing
DATA B R I C K S - C O M PA N Y OV E RV I E W

 Founded in late 2013


 By the creators of Apache Spark, original team
from UC Berkeley AMPLab
 Largest code contributor code to Apache Spark
 Level 2/3 support partnership with
• Hortonworks
• MapR
• DataStax
 Provides certifications such as Databricks
Certified Application, Databricks Certified
Distribution and Databricks Certified Developer
 Main Product: The Unified Analytics Platform
 In Oct 2017, introduced Databricks Delta
(currently in private preview).
A Z U R E DATA B R I C K S

 Azure Databricks is a first party service on Azure.


• Unlike with other clouds, it is not an Azure Marketplace or a
3rd party hosted service.
 Azure Databricks is integrated seamlessly with Azure
services:
• Azure Portal: Service an be launched directly from Azure
Portal
• Azure Storage Services: Directly access data in Azure Blob
Storage and Azure Data Lake Store
• Azure Active Directory: For user authentication, eliminating
the need to maintain two separate sets of uses in
Databricks and Azure. Microsoft Azure
• Azure SQL DW and Azure Cosmos DB: Enables you to
combine structured and unstructured data for analytics
• Apache Kafka for HDInsight: Enables you to use Kafka as a
streaming data source or sink
• Azure Billing: You get a single bill from Azure

• Azure Power BI: For rich data visualization

 Eliminates need to create a separate account with


Databricks.
A Z U R E DATA B R I C K S

Azure Databricks
Collaborative Workspace

Machine learning models


IoT / streaming data
DATA DATA BUSINESS
ENGINEER SCIENTIST ANALYST

Deploy Production Jobs & Workflows


BI tools
Cloud storage

MULTI-STAGE JOB SCHEDULER NOTIFICATION &


PIPELINES LOGS
Data warehouses
Optimized Databricks Runtime Engine Data exports

Hadoop storage
DATABRICKS APACHE SERVERLESS Rest APIs
I/O SPARK Data warehouses

Enhance Productivity Build on secure & trusted cloud Scale without limits
GENERAL SPARK CLUSTER ARCHITECTURE

Driver Program
SparkContext
 ‘Driver’ runs the user’s ‘main’ function and
executes the various parallel operations on
the worker nodes.
 The results of the operations are collected by Cluster Manager
the driver
 The worker nodes read and write data from/to Worker Node Worker Node Worker Node
Data Sources including HDFS.
 Worker node also cache transformed data in Cache Cache Cache
memory as RDDs (Resilient Data Sets).
Task Task Task
 Worker nodes and the Driver Node execute as
VMs in public clouds (AWS, Google and
Azure).

Data Sources (HDFS, SQL, NoSQL, …)


S E C U R E C O L L A BO RAT I O N
Azure Databricks enables secure collaboration between colleagues

• With Azure Databricks


colleagues can securely share
key artifacts such as Clusters,
Notebooks, Jobs and
Workspaces Fine Grained Permissions
• Secure collaboration is enabled
through a combination of:

Fine grained permissions:


Defines who can do what on which
artifacts (access control)
AAD-based User
Authentication
AAD-based authentication: Ensures
that users are actually who they
claim to be
A Z U R E DATA B R I C K S I N T E G RAT I O N W I T H A A D
Azure Databricks is integrated with AAD—so Azure Databricks users are just regular AAD
users

 There is no need to define users—and


their access control—separately in
Databricks.
 AAD users can be used directly in
Azure Databricks for all user-based
access control (Clusters, Jobs, Access Authentication
Notebooks etc.). Control

 Databricks has delegated user


Azure Databricks
authentication to AAD enabling single-
sign on (SSO) and unified
authentication.
 Notebooks, and their outputs, are
stored in the Databricks account.
However, AAD-based access-control
ensures that only authorized users
can access them.
DATA B R I C K S AC C E S S C O N T R O L
Access control can be defined at the user level via the Admin Console

Access Control can be defined for Workspaces, Clusters, Jobs and REST APIs

Workspace Access Defines who can who can view, edit, and run
Control notebooks in their workspace

Allows users to who can attach to, restart, and


manage (resize/delete) clusters.
Cluster Access
Databric Control
ks Allows Admins to specify which users have
Access permissions to create clusters
Control Allows owners of a job to control who can view job
Jobs Access Control
results or manage runs of a job (run now/cancel)

Allows users to use personal access tokens instead of


REST API Tokens
passwords to access the Databricks REST API
A Z U R E DATA B R I C K S C O R E A RT I FAC T S

Clusters

Libraries Workspac
es

Azure
Databrick
s
Jobs Notebook
s
Why Spark?

• Open-source data processing engine built around speed, ease of use, and
sophisticated analytics

• In memory engine that is up to 100 times faster than Hadoop

• Largest open-source data project with 1000+ contributors

• Highly extensible with support for Scala, Java and Python alongside Spark SQL,
GraphX, Streaming and Machine Learning Library (Mllib)
What is Azure Databricks?
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized
for Azure

Best of Best of
Databricks Microsoft

Designed in collaboration with the founders of Apache Spark

One-click set up; streamlined workfl ows

Interactive workspace that enables collaboration between data scientists, data engineers, and
business analysts.

Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)

Enterprise-grade Azure security (Active Directory integration, compliance, enterprise-grade SL As)


Differentiated experience on Azure
ENHANCE BUILD ON THE MOST COMPLIANT CLOUD
SCALE WITHOUT LIMITS
PRODUCTIVITY
Get started quickly by Simplify security and identity Operate at massive scale
launching your new Spark control with built-in integration with without limits globally.
environment with one click. Active Directory.

Share your insights in Accelerate data


Regulate access with fine-grained
powerful ways through rich processing with the fastest
user permissions to Azure
integration with Power BI. Spark engine.
Databricks’ notebooks, clusters, jobs
and data.
Improve collaboration
amongst your analytics team
through a unified workspace. Build with confidence on the
trusted cloud backed by
Innovate faster with native unmatched support, compliance and
integration with rest of Azure SLAs.
platform
Collaborative Workspace
GET STARTED IN SECONDS
Single click to launch your new Spark environment
Azure Databricks
INTERACTIVE EXPLORATION Collaborative Workspace
Explore data using interactive notebooks with support
for multiple programming languages including R,
Python, Scala, and SQL DATA DATA BUSINESS
ENGINEER SCIENTIST ANALYST

COLLABORATION Deploy Production Jobs &


Workflows
Work on the same notebook in real-time while
tracking changes with detailed revision history,
GitHub, or Bitbucket MULTI-STAGE JOB SCHEDULER NOTIFICATION &
PIPELINES LOGS

VISUALIZATIONS Optimized Databricks Runtime Engine


Visualize insights through a wide assortment of point-
and-click visualizations. Or use powerful scriptable
options like matplotlib, ggplot, and D3 DATABRICKS APACHE SERVERLESS Rest APIs
I/O SPARK

DASHBOARDS
Rich integration with PowerBI to discover and share
your insights in powerful new ways
Deploy Production Jobs & Workflows
JOBS SCHEDULER
Execute jobs for production pipelines on a specific
schedule Azure Databricks
Collaborative Workspace
NOTEBOOK WORKFLOWS
Create multi-stage pipelines with the control
structures of the source programming language DATA DATA BUSINESS
ENGINEER SCIENTIST ANALYST

RUN NOTEBOOKS AS JOBS Deploy Production Jobs &


Workflows
Turn notebooks or JARs into resilient Spark jobs with a
click or an API call
MULTI-STAGE JOB SCHEDULER NOTIFICATION &
NOTIFICATIONS AND LOGS PIPELINES LOGS

Set up alerts and quickly access audit logs for easy Optimized Databricks Runtime Engine
monitoring and troubleshooting

INTEGRATE NATIVELY WITH AZURE SERVICES DATABRICKS APACHE SERVERLESS Rest APIs
I/O SPARK
Deep integration with Azure SQL Data Warehouse,
Cosmos DB, Azure Data Lake Store, Azure Blob
Storage, and Azure Event Hub
Optimized Databricks Runtime Engine
OPTIMIZED I/O PERFORMANCE
The Databricks I/O module (DBIO) takes processing
speeds to the next level — significantly improving the Azure Databricks
performance of Spark in the cloud Collaborative Workspace

FULLY-MANAGED PLATFORM ON AZURE


Reap the benefits of a fully managed service and DATA DATA BUSINESS
ENGINEER ANALYST
remove the complexity of big data and machine SCIENTIST

learning Deploy Production Jobs &


Workflows
SERVERLESS INFRASTRUCTURE
Databricks’ serverless and highly elastic cloud service MULTI-STAGE JOB SCHEDULER NOTIFICATION &
is designed to remove operational complexity while PIPELINES LOGS

ensuring reliability and cost efficiency at scale Optimized Databricks Runtime Engine

OPERATE AT MASSIVE SCALE


Without limits globally DATABRICKS APACHE SERVERLESS Rest APIs
I/O SPARK
Advanced Analytics on Big Data

Ingest Store Prep & Train Model & Intelligence


Serve

Logs, files and


media
(unstructured)
Data factory Azure Azure Databricks Azure Cosmos DB Web & mobile apps
storage (Spark Mllib,
SparkR, SparklyR)

Business / custom Polybas


apps Data factory e
(Structured) Azure SQL Data
Analytical
Warehouse
dashboards
Spark Context
Demo
Pricing

Launch Stage Start Date DBU VM Pricing


Pricing
Gated public preview 11/15 50% 100%
Ungated public preview TBD (~January 50% 100%
2018)
GA TBD (~ March 100% 100%
2018 )
Roadmap Add-ons to public preview
• .NET integration
• GPU clusters
End-to-end experience
• Deep integration with SQL
• Create first-party DW
Databricks workspace and
cluster from Azure portal • Easy data import UI for
Blob Store & Azure Data
• Enterprise-grade Lake
security with AAD
authentication & Single Add-ons to Private • Audit logs, Spark logs to
sign-On Preview storage, log history, log
encryption at rest
• Integration with External • Billing
stores like Blob Store, • Event Hub integration Add-ons to general
• Clusters – Tagging, Disk
ADLS, SQL DW, Cosmos Storage, SSH access • ISO 27001, stretch goal - availability • OMS for service
DB, HDI Kafka SOC2 & HIPAA monitoring
• DBFS mount points • Deep integration with
• Power BI Integration via • Reserved Instances & PowerBI • Azure ML integration
JDBC/ODBC endpoint • Jobs – email alerts
commit pricing • ADF Integration (TBD) • All Azure regions
• REST APIs • Free community edition • PCI & other certifications
• 3 regions • 8 Regions

Ungated General availability GA+ GA++


Private Preview Gated Public 2018
Public H1 - 20 18 2018-19
Oct 2017 Preview
Nov 2017 Preview
Connect () Q1 2018
3 Regions 8 Regions
• West US • West US
• East US 2 • East US
• West
Europe • Central US
• West Europe
• North
Europe
Provided by Microsoft and Databricks under NDA • West US 2
How to get started
How to get started
Sign up for preview at
http://databricks.azurewebsites.net

Engage Microsoft experts for a workshop to help


identify high impact scenarios

Learn more about Azure Databricks


www.azure.com/databricks
Appendix
Help All Along the Way

Quick Start
Documentat
ion
Azure Databricks – workspace home page
Azure Databricks – service home page
Azure Databricks – creating a workspace
Azure Databricks – workspace
deployment
Important Techical
Details
CLUSTERS

 Azure Databricks clusters are the set of Azure Linux


VMs that host the Spark Worker and Driver Nodes
 Your Spark application code (i.e. Jobs) runs on the
provisioned clusters.
 Azure Databricks clusters are launched in your
subscription—but are managed through the Azure
Databricks portal.
 Azure Databricks provides a comprehensive set of
graphical wizards to manage the complete lifecycle of
clusters—from creation to termination.
C LU S T E R C R E AT I O N

 You can create two types of clusters –


Standard and Serverless Pool (see next
slide)
 While creating a cluster you can specify:
• Number of nodes
• Autoscaling and Auto Termination policy
• Auto Termination policy
• Spark Configuration details
• The Azure VM instance types for the
Driver and Worker Nodes

Graphical wizard in the Azure Databricks portal to create a Standard Cluster


CLUSTERS: AUTO SCALING AND AUTO
T E R M I N AT I O N
Simplifies cluster management and reduces costs by eliminating wastage

When creating Azure Databricks clusters you can


choose Autoscaling and Auto Termination options.

Autoscaling: Just specify the min and max number of


clusters. Azure Databricks automatically scales up or
down based on load.

Auto Termination: After the specified minutes of


inactivity the cluster is automatically terminated.
Benefits:
 You do not have to guess, or determine by trial and error,
the correct number of nodes for the cluster
 As the workload changes you do not have to manually
tweak the number of nodes
 You do not have to worry about wasting resources when the
cluster is idle. You only pay for resource when they are
actually being used
 You do not have to wait and watch for jobs to complete just
so you can shutdown the clusters
SERVERLESS POOL (BETA)
A self-managed pool of cloud resources, auto-configured for interactive Spark workloads

 You specify only the minimum and


maximum number of nodes in the cluster—
Azure Databricks provisions and adjusts the
compute and local storage based on your
usage.
 Limitation: Currently works only for SQL and
Python.
• Benefits of Serverless Pool
 Databricks chooses the best configuration for Spark to get the best performance
Auto-  Users don’t need to worry about providing any of the Databricks runtime version or Spark
Configuration configuration.
 Databricks also chooses the best cluster parameters to save cost on infrastructure
Elasticity  Automatically scales the compute and local storage, independently, based on usage

 Offers maximum resource utilization and minimum query latencies


Fine grained • Preemption: Databricks proactively preempts Spark tasks from over-committed users to ensure all users get their
fair share of cluster time and their jobs complete in a timely manner even when contending with dozens of other
Sharing
users. Uses the “Task Preemption for High Concurrency” feature of Spark in Databricks.
• Fault isolation: Databricks sandboxes the environments belonging to different notebooks from one another.
CLUSTER ACCESS CONTROL
• There are two configurable types of permissions for Cluster Access Control:
• Individual Cluster Permissions - This controls a user’s ability to attach notebooks to a cluster, as well as to
restart/resize/terminate/start clusters.
• Cluster Creation Permissions - This controls a user’s ability to create clusters

• Individual permissions can be configured on the Clusters Page


by clicking on Permissions under the ‘More Actions’ icon of an
existing cluster
• There are 4 different individual cluster permission levels: No
Permissions, Can Attach To, Can Restart, and Can Manage.
Abilities No Permissions Can Attach To Can Restart Can Manage
Privileges are shown below
Attach notebooks to
x x x
cluster
Tom Smith
View Spark UI x x x ([email protected])

View cluster metrics


x x x
(Ganglia)

Terminate cluster x x

Start cluster x x

Restart cluster x x

Resize cluster x

Modify permissions x

You might also like