Azure Databricks - An Introduction
Azure Databricks - An Introduction
An Introduction
Bryan Cafferky
Technical Solutions Professional
B I G D ATA & A D V A N C E D A N A LY T I C S AT A G L A N C E
Collaboration
Cosmos
Portal DB
Predictive apps
Databricks
Kafka Blobs
10
HDInsight SQL
SQL Database
Custom
01
Data Lake
Data Lake
apps
Analytics
SQL Data
Operational reports
Warehouse
Event Hub Machine
IoT Hub Learning
Sensors ML Workbench Analysis
and Services
devices Analytical
dashboards
Azure Databricks
Powered by Apache Spark
APACHE SPARK
An unified, open source, parallel, data processing framework for Big Data Analytics
Azure Databricks
Collaborative Workspace
Hadoop storage
DATABRICKS APACHE SERVERLESS Rest APIs
I/O SPARK Data warehouses
Enhance Productivity Build on secure & trusted cloud Scale without limits
GENERAL SPARK CLUSTER ARCHITECTURE
Driver Program
SparkContext
‘Driver’ runs the user’s ‘main’ function and
executes the various parallel operations on
the worker nodes.
The results of the operations are collected by Cluster Manager
the driver
The worker nodes read and write data from/to Worker Node Worker Node Worker Node
Data Sources including HDFS.
Worker node also cache transformed data in Cache Cache Cache
memory as RDDs (Resilient Data Sets).
Task Task Task
Worker nodes and the Driver Node execute as
VMs in public clouds (AWS, Google and
Azure).
Access Control can be defined for Workspaces, Clusters, Jobs and REST APIs
Workspace Access Defines who can who can view, edit, and run
Control notebooks in their workspace
Clusters
Libraries Workspac
es
Azure
Databrick
s
Jobs Notebook
s
Why Spark?
• Open-source data processing engine built around speed, ease of use, and
sophisticated analytics
• Highly extensible with support for Scala, Java and Python alongside Spark SQL,
GraphX, Streaming and Machine Learning Library (Mllib)
What is Azure Databricks?
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized
for Azure
Best of Best of
Databricks Microsoft
Interactive workspace that enables collaboration between data scientists, data engineers, and
business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
DASHBOARDS
Rich integration with PowerBI to discover and share
your insights in powerful new ways
Deploy Production Jobs & Workflows
JOBS SCHEDULER
Execute jobs for production pipelines on a specific
schedule Azure Databricks
Collaborative Workspace
NOTEBOOK WORKFLOWS
Create multi-stage pipelines with the control
structures of the source programming language DATA DATA BUSINESS
ENGINEER SCIENTIST ANALYST
Set up alerts and quickly access audit logs for easy Optimized Databricks Runtime Engine
monitoring and troubleshooting
INTEGRATE NATIVELY WITH AZURE SERVICES DATABRICKS APACHE SERVERLESS Rest APIs
I/O SPARK
Deep integration with Azure SQL Data Warehouse,
Cosmos DB, Azure Data Lake Store, Azure Blob
Storage, and Azure Event Hub
Optimized Databricks Runtime Engine
OPTIMIZED I/O PERFORMANCE
The Databricks I/O module (DBIO) takes processing
speeds to the next level — significantly improving the Azure Databricks
performance of Spark in the cloud Collaborative Workspace
ensuring reliability and cost efficiency at scale Optimized Databricks Runtime Engine
Quick Start
Documentat
ion
Azure Databricks – workspace home page
Azure Databricks – service home page
Azure Databricks – creating a workspace
Azure Databricks – workspace
deployment
Important Techical
Details
CLUSTERS
Terminate cluster x x
Start cluster x x
Restart cluster x x
Resize cluster x
Modify permissions x