SlideShare a Scribd company logo
The Databricks
Platform Introduction
All your data, analytics and
AI on one platform
Alex Ivanichev
March 2022
DataBricks is a unified & open Data
and Analytics Platform
What is DataBricks ?
Modern Data
Teams
5
Data Engineers Data Scientists
Data Analysts
How the data management looks
like today ?
Data management complexity
Siloed stacks increase data architecture complexity
Data Warehousing Data Engineering
Streaming
decrease productivity
Data Science and ML
Data Analysts Data Engineers Data Engineers
Disconnected systems and proprietary data formats make integration difficult
Data Scientists
Amazon Redshift
Azure Synapse
Snowflake
SAP
Teradata
Google BigQuery
IBM Db2
Oracle Autonomous
Data Warehouse
Hadoop Apache Airflow Apache Kafka Apache Spark Jupyter Amazon SageMaker
Amazon EMR Apache Spark Apache Flink Amazon Kinesis Azure ML Studio MatLAB
Google Dataproc Cloudera Azure Stream Analytics Google Dataflow Domino Data Labs SAS
Tibco Spotfire Confluent TensorFlow PyTorch
Extract Load
Transform Real-time Database
Analytics and BI
Data marts Data prep
Machine
Learning
Data
Science
Streaming Data Engine
Data Lake Data Lake
Data warehouse
Structured, semi-
structured
and unstructured data
Structured, semi-
structured
and unstructured data
Structured data
Streaming data sources
5
Data Warehouse Data Lake
vs.
Warehouses and lakes create complexity
Two separate copies of the data
Warehouses
Proprietary
Lakes
Open
Incompatible interfaces
Warehouses
SQL
Lakes
Python
Incompatible security and governance models
Warehouses
Tables
Lakes
Files
Data Warehouse Data Lake
Streaming
Analytics
B
I
Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured Data
Data Lakehouse
One platform to unify all of
your data, analytics, and AI workloads
Why choose Databricks ?
The data lakehouse offers a better path
Data processing and management built on open source and open
standards
Common security, governance, and administration
Modern Data
Engineering
Analytics and Data
Warehousing
Data Science
and ML
Integrated and collaborative role-based experiences with
open API’s
Cloud Data Lake
Structured, semi-structured, and unstructured data
Lake-first approach that builds upon where
the freshest, most complete data resides
AI/ML from the ground up
High reliability and performance Single
approach to managing data
Support for all use cases on a single
platform:
• Data engineering
• Data warehousing
• Real time streaming
• Data science and ML
Built on open source and open standards
Multi-cloud, work with your cloud of choice
The Data Lakehouse
Foundation
©2021 Databricks Inc. — All rights r eserved
An open approach to bringing data
management and governance to data
lakes
Better reliability with transactions
48x faster data processing with indexing
Data governance at scale with fine-
grained access control lists
Data
Warehouse
Data
Lake
What is Delta Lake?
● A open source project that enables building a Lakehouse architecture on top of data lakes.
● An storage layer that brings scalable, ACID transactions to Apache Spark and other big-data
engines.
● Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and
batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS.
● ACID Transactions
● Scalable Metadata Handling
● Time Travel (data versioning)
● Open Format
● Delta Lake change data feed
● Unified Batch and Streaming Source and Sink
● Schema Enforcement
● Schema Evolution
● Audit History
● Updates and Delete
● 100% Compatible with Apache Spark API
● Data Clean-up
Key Features:
https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
Delta Lake solves challenges with data lakes
RELIABILITY &
QUALITY
PERFORMANCE &
LATENCY
GOVERNANCE
ACID transactions
Advanced indexing & caching
Governance with Data Catalogs
Delta Lake key feature - ACID transaction
● Add File: It adds the data file
● Remove File: It removes the data file
● Update Metadata: It updates the table metadata.
● Set Transaction: It records that a structure streaming job created a micro-batch with ID
● Change Protocol: Makes more secure by transferring Delta Lakes to the latest securing
protocol.
● Commit Info: It contains the information about the Commits.
State Recomputing With Checkpoint Files
● Delta Lake automatically generates checkpoint files every 10 commits
● Delta Lake saves a checkpoint file in Parquet format in the same _delta_log subdirectory.
Building the foundation of a Lakehouse
Filtered, Cleaned,
Augmented
Business-level
Aggregates
Greatly improve the quality of your data for end users
BRONZE SILVER
GOLD
Raw Ingestion
and History
Kinesis
CSV,
JSON, TXT…
Data Lake
Quality
BI &
Reporting
Streaming
Analytics
Data Science &
ML
But the reality is not so simple
Maintaining data quality and reliability at scale is complex and brittle
CSV,
JSON, TXT…
Data
Lake
Kinesis
BI &
Reporting
Streaming
Analytics
Data Science &
ML
Modern data engineering on the lakehouse
Data Engineering on the Databricks Lakehouse Platform
Open format storage
Data transformation
Scheduling &
orchestration
Automatic deployment & operations
BI / Reporting
Dashboarding
Machine
Learning / Data
Science
Data & ML
Sharing
Data Products
Databases
Streaming
Sources
Cloud Object
Stores
SaaS
Applications
NoSQL
On-premises
Systems
Data Sources Data Consumers
Observability, lineage, and end-to-end pipeline visibility
Data quality management
Data
ingestion
Data Science & Engineering
Workspace
Databricks Workspaces: Clusters
It is a set of computation resources where a developer can run Data Analytics,
Data Science, or Data Engineering workloads.
The workloads can be executed in the form of a set of commands written in a notebook
Databricks Workspaces: Notebooks
It is a Web Interface where a developer can write and execute codes. Notebook contains
a sequence of runnable cells that helps a developer to work with files, manipulate
tables, create visualizations, and add narrative texts
Databricks Workspaces: AutoLoader
Auto Loader incrementally and efficiently processes new data files as they arrive in
cloud storage. Auto Loader can load data files from Google Cloud Storage (GCS, gs://) in
addition to Databricks File System (DBFS, dbfs:/)
** Supports: JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats. **
val checkpoint_path = "/tmp/delta/population_data/_checkpoints"
val write_path = "/tmp/delta/population_data"
// Set up the stream to begin reading incoming files from the
// upload_path location.
val df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", "true")
.schema("city string, year int, population long")
.load(upload_path)
// Start the stream.
// Use the checkpoint_path location to keep a record of all files that
// have already been uploaded to the upload_path location.
// For those that have been uploaded since the last check,
// write the newly-uploaded files' data to the write_path location.
df.writeStream.format("delta")
.option("checkpointLocation", checkpoint_path)
.start(write_path)
https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html
Databricks Workspaces: Jobs
Jobs allow a user to run notebooks on a scheduled basis. It is a method of executing or
automating specific tasks like ETL, Model Building, and more.
The pipeline of the ML workflow can be organized
into jobs so that it sequentially runs the series of
steps one after another
Databricks Workspaces:Delta Live Tables
Delta Live Tables is a framework designed to enable declaratively define, deploy, test &
upgrade data pipelines and eliminate operational burdens associated with the
management of such pipelines.
Databricks Workspaces: Repos
To empower the process of ML application development, repo’s provide repository-
level integration with Git-based hosting providers such as GitHub, GitLab, bitBucket,
and Azure DevOps
Developers can write code in a Notebook
and Sync it with the hosting provider,
allowing developers to clone, manage
branches, push changes, pull changes,
etc.
Databricks Workspaces: Models
It refers to a Developer’s ML Workflow Model registered in the MLflow Model Registry,
a centralized model store that manages the entire life cycle of MLflow models.
MLflow Model Registry provides all
the information about modern
lineage, model versioning, present
condition, workflow, and stage
transition (whether promoted to
production or archived).
Governance requirements for
data are quickly evolving
Governance is hard to enforce on data lakes
42
Cloud 2
Cloud 3
Structured
Semi-structured
Unstructured
Streaming
Cloud 1
The problem is getting bigger
Enterprises need a way to share and govern a wide variety of data products
Files Dashboards Models Tables
Unity Catalog for Lakehouse Governance
• Centrally catalog, Search, and discover
data and AI assets
• Simplify governance with a unified Cross- cloud
governance model
• Easily integrate with your existing
Enterprise Data Catalogs
• Securely share live data across platforms
with delta sharing
Delta Sharing on Databricks
Delta Lake
Table
Delta Sharing
Server
Delta Sharing
Protocol
Data
Provider
Data Recipient
Any Sharing Client
Access
permissions
Machine Learning
Workspace
ML Architecture: Data Warehouse VS Data Lakehouse
Data Warehouse Data Lakehouse
Open Multi-Cloud Data Lakehouse and Feature Store
Collaborative Multi-Language Notebooks
← Full ML Lifecycle →
Model Tracking
and Registry
Model Training
and Tuning
Model Serving
and Monitoring
Automation and
Governance
Data Science and Machine Learning
A data-native and collaborative solution for the full ML lifecycle
What Does ML Need from a Lakehouse?
58
Access to Unstructured Data
• Images, text, audio, custom formats
• Libraries understand files, not tables
• Must scale to petabytes
Open Source Libraries
• OSS dominates ML tooling (Tensorflow, scikit-
learn, xgboost, R, etc)
• Must be able to apply these in Python, R
Specialized Hardware, Distributed Compute
• Scalability of algorithms
• GPUs, for deep learning
• Cloud elasticity to manage that cost!
Model Lifecycle Management
• Outputs are model artifacts
• Artifact lineage
• Productionization of model
Three Data Users
• SQL and BI tools
• Prepare and run reports
• Summarize data
• Visualize data
• (Sometimes) Big Data
• Data Warehouse data store
• R, SAS, some Python
• Statistical analysis
• Explain data
• Visualize data
• Often small data sets
• Database, data warehouse
data store; local files
Business Intelligence Data Science
• Python
• Deep learning and
specialized GPU hardware
• Create predictive models
• Deploy models to prod
• Often big data sets
• Unstructured data in files
Machine Learning
How Is ML Different?
• Operates on unstructured data like text
and images
• Can require learning from massive
data sets, not just analysis of a sample
• Uses open source tooling to
manipulate data as “DataFrames”
rather than with SQL
• Outputs are models rather than data or
reports
• Sometimes needs special hardware
MLOps and the Lakehouse
• Applying open tools in-place to data in
the lakehouse is a win for training
• Applying them for operating models is
important too!
• "Models are data too"
• Need to apply models to data
• MLFlow for MLOps on the lakehouse
• Track and manage model data,
lineage, inputs
• Deploy models as lakehouse "services"
Feature Stores for Model Inputs
• Tables are OK for managing model input
• Input often structured
• Well understood, easy to access
• … but not quite enough
• Upstream lineage: how were
features computed?
• Downstream lineage: where is the
feature used?
• Model caller has to read, feed inputs
• How to do (also) access in real
time?
SQL Analytics Workspace
Query data lake data using familiar ANSI SQL, and find and share new insights faster
with the built-in SQL query editor, alerts, visualizations, and interactive dashboards.
Databricks Workspaces: Queries
Provides a simplified control (which is SQL only) to query the data
Databricks Workspaces: Dashboards
A Databricks SQL dashboard lets you combine visualizations and text boxes that provide
context with your data.
Databricks Workspaces: Alerts
Alerts notify you when a field returned by a scheduled query meets a threshold.
Alerts complement scheduled queries, but their criteria are checked after every
execution.
Databricks Workspaces: Query History
The query history shows SQL queries performed using SQL endpoints.
Thank you

More Related Content

PPTX
Databricks Fundamentals
PPTX
DW Migration Webinar-March 2022.pptx
PDF
Modernizing to a Cloud Data Architecture
PDF
Technical Deck Delta Live Tables.pdf
PPTX
Data Lakehouse Symposium | Day 4
PDF
Intro to Delta Lake
PDF
Introducing Databricks Delta
PDF
Moving to Databricks & Delta
Databricks Fundamentals
DW Migration Webinar-March 2022.pptx
Modernizing to a Cloud Data Architecture
Technical Deck Delta Live Tables.pdf
Data Lakehouse Symposium | Day 4
Intro to Delta Lake
Introducing Databricks Delta
Moving to Databricks & Delta

What's hot (20)

PDF
Introduction SQL Analytics on Lakehouse Architecture
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
PDF
Webinar Data Mesh - Part 3
PPTX
Microsoft Fabric.pptx
PPTX
Snowflake Datawarehouse Architecturing
PPTX
Free Training: How to Build a Lakehouse
PPTX
Introducing the Snowflake Computing Cloud Data Warehouse
PDF
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
PDF
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
PDF
Getting Started with Delta Lake on Databricks
PDF
PPTX
Databricks on AWS.pptx
PPTX
Building an Effective Data Warehouse Architecture
PPTX
Introduction to Azure Databricks
PPTX
Delta lake and the delta architecture
PDF
From Data Warehouse to Lakehouse
PPTX
Big data architectures and the data lake
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PPTX
Azure Synapse Analytics Overview (r1)
Introduction SQL Analytics on Lakehouse Architecture
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Webinar Data Mesh - Part 3
Microsoft Fabric.pptx
Snowflake Datawarehouse Architecturing
Free Training: How to Build a Lakehouse
Introducing the Snowflake Computing Cloud Data Warehouse
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Putting the Ops in DataOps: Orchestrate the Flow of Data Across Data Pipelines
Getting Started with Delta Lake on Databricks
Databricks on AWS.pptx
Building an Effective Data Warehouse Architecture
Introduction to Azure Databricks
Delta lake and the delta architecture
From Data Warehouse to Lakehouse
Big data architectures and the data lake
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Azure Synapse Analytics Overview (r1)
Ad

Similar to Databricks Platform.pptx (20)

PPTX
DataBricks fundamentals for fresh graduates
PPTX
Data Engineering A Deep Dive into Databricks
PDF
What Is Delta Lake ???
PDF
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
PDF
Building End-to-End Delta Pipelines on GCP
PDF
Building Data Intensive Analytic Application on Top of Delta Lakes
PDF
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PPTX
Unlock Data-driven Insights in Databricks Using Location Intelligence
PPTX
databricks course | databricks online training
PDF
Master Databricks with AccentFuture – Online Training
PPTX
Introduction_to_Databricks_power_point_presentation.pptx
PPTX
use_case.pptx
PDF
Building Robust Production Data Pipelines with Databricks Delta
PDF
Building Robust Production Data Pipelines with Databricks Delta
PPTX
Data Engineering with Databricks Presentation
PPTX
DataEngineering_Databricks_Training_NoBrand.pptx
PDF
Intro to databricks delta lake
PDF
So You Want to Build a Data Lake?
PDF
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
DataBricks fundamentals for fresh graduates
Data Engineering A Deep Dive into Databricks
What Is Delta Lake ???
Streaming Data Into Your Lakehouse With Frank Munz | Current 2022
Building End-to-End Delta Pipelines on GCP
Building Data Intensive Analytic Application on Top of Delta Lakes
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Unlock Data-driven Insights in Databricks Using Location Intelligence
databricks course | databricks online training
Master Databricks with AccentFuture – Online Training
Introduction_to_Databricks_power_point_presentation.pptx
use_case.pptx
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
Data Engineering with Databricks Presentation
DataEngineering_Databricks_Training_NoBrand.pptx
Intro to databricks delta lake
So You Want to Build a Data Lake?
final-the-data-teams-guide-to-the-db-lakehouse-platform-rd-6-14-22.pdf
Ad

Recently uploaded (20)

PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Geotechnical Engineering, Soil mechanics- Soil Testing.pdf
PPTX
anatomy of limbus and anterior chamber .pptx
PPTX
436813905-LNG-Process-Overview-Short.pptx
PPTX
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPT
SCOPE_~1- technology of green house and poyhouse
PDF
algorithms-16-00088-v2hghjjnjnhhhnnjhj.pdf
PDF
dse_final_merit_2025_26 gtgfffffcjjjuuyy
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
B.Tech (Electrical Engineering ) 2024 syllabus.pdf
PPTX
24AI201_AI_Unit_4 (1).pptx Artificial intelligence
PPTX
Soil science - sampling procedures for soil science lab
PPTX
Simulation of electric circuit laws using tinkercad.pptx
PDF
MCAD-Guidelines. Modernization of command Area Development, Guideines
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
AgentX UiPath Community Webinar series - Delhi
PDF
Chad Ayach - A Versatile Aerospace Professional
PDF
A Framework for Securing Personal Data Shared by Users on the Digital Platforms
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Geotechnical Engineering, Soil mechanics- Soil Testing.pdf
anatomy of limbus and anterior chamber .pptx
436813905-LNG-Process-Overview-Short.pptx
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
SCOPE_~1- technology of green house and poyhouse
algorithms-16-00088-v2hghjjnjnhhhnnjhj.pdf
dse_final_merit_2025_26 gtgfffffcjjjuuyy
Structs to JSON How Go Powers REST APIs.pdf
B.Tech (Electrical Engineering ) 2024 syllabus.pdf
24AI201_AI_Unit_4 (1).pptx Artificial intelligence
Soil science - sampling procedures for soil science lab
Simulation of electric circuit laws using tinkercad.pptx
MCAD-Guidelines. Modernization of command Area Development, Guideines
Lesson 3_Tessellation.pptx finite Mathematics
AgentX UiPath Community Webinar series - Delhi
Chad Ayach - A Versatile Aerospace Professional
A Framework for Securing Personal Data Shared by Users on the Digital Platforms

Databricks Platform.pptx

  • 1. The Databricks Platform Introduction All your data, analytics and AI on one platform Alex Ivanichev March 2022
  • 2. DataBricks is a unified & open Data and Analytics Platform What is DataBricks ?
  • 3. Modern Data Teams 5 Data Engineers Data Scientists Data Analysts
  • 4. How the data management looks like today ?
  • 5. Data management complexity Siloed stacks increase data architecture complexity Data Warehousing Data Engineering Streaming decrease productivity Data Science and ML Data Analysts Data Engineers Data Engineers Disconnected systems and proprietary data formats make integration difficult Data Scientists Amazon Redshift Azure Synapse Snowflake SAP Teradata Google BigQuery IBM Db2 Oracle Autonomous Data Warehouse Hadoop Apache Airflow Apache Kafka Apache Spark Jupyter Amazon SageMaker Amazon EMR Apache Spark Apache Flink Amazon Kinesis Azure ML Studio MatLAB Google Dataproc Cloudera Azure Stream Analytics Google Dataflow Domino Data Labs SAS Tibco Spotfire Confluent TensorFlow PyTorch Extract Load Transform Real-time Database Analytics and BI Data marts Data prep Machine Learning Data Science Streaming Data Engine Data Lake Data Lake Data warehouse Structured, semi- structured and unstructured data Structured, semi- structured and unstructured data Structured data Streaming data sources 5
  • 7. Warehouses and lakes create complexity Two separate copies of the data Warehouses Proprietary Lakes Open Incompatible interfaces Warehouses SQL Lakes Python Incompatible security and governance models Warehouses Tables Lakes Files
  • 8. Data Warehouse Data Lake Streaming Analytics B I Data Science Machine Learning Structured, Semi-Structured and Unstructured Data Data Lakehouse One platform to unify all of your data, analytics, and AI workloads
  • 10. The data lakehouse offers a better path Data processing and management built on open source and open standards Common security, governance, and administration Modern Data Engineering Analytics and Data Warehousing Data Science and ML Integrated and collaborative role-based experiences with open API’s Cloud Data Lake Structured, semi-structured, and unstructured data Lake-first approach that builds upon where the freshest, most complete data resides AI/ML from the ground up High reliability and performance Single approach to managing data Support for all use cases on a single platform: • Data engineering • Data warehousing • Real time streaming • Data science and ML Built on open source and open standards Multi-cloud, work with your cloud of choice
  • 12. ©2021 Databricks Inc. — All rights r eserved An open approach to bringing data management and governance to data lakes Better reliability with transactions 48x faster data processing with indexing Data governance at scale with fine- grained access control lists Data Warehouse Data Lake
  • 13. What is Delta Lake? ● A open source project that enables building a Lakehouse architecture on top of data lakes. ● An storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines. ● Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS. ● ACID Transactions ● Scalable Metadata Handling ● Time Travel (data versioning) ● Open Format ● Delta Lake change data feed ● Unified Batch and Streaming Source and Sink ● Schema Enforcement ● Schema Evolution ● Audit History ● Updates and Delete ● 100% Compatible with Apache Spark API ● Data Clean-up Key Features: https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
  • 14. Delta Lake solves challenges with data lakes RELIABILITY & QUALITY PERFORMANCE & LATENCY GOVERNANCE ACID transactions Advanced indexing & caching Governance with Data Catalogs
  • 15. Delta Lake key feature - ACID transaction ● Add File: It adds the data file ● Remove File: It removes the data file ● Update Metadata: It updates the table metadata. ● Set Transaction: It records that a structure streaming job created a micro-batch with ID ● Change Protocol: Makes more secure by transferring Delta Lakes to the latest securing protocol. ● Commit Info: It contains the information about the Commits.
  • 16. State Recomputing With Checkpoint Files ● Delta Lake automatically generates checkpoint files every 10 commits ● Delta Lake saves a checkpoint file in Parquet format in the same _delta_log subdirectory.
  • 17. Building the foundation of a Lakehouse Filtered, Cleaned, Augmented Business-level Aggregates Greatly improve the quality of your data for end users BRONZE SILVER GOLD Raw Ingestion and History Kinesis CSV, JSON, TXT… Data Lake Quality BI & Reporting Streaming Analytics Data Science & ML
  • 18. But the reality is not so simple Maintaining data quality and reliability at scale is complex and brittle CSV, JSON, TXT… Data Lake Kinesis BI & Reporting Streaming Analytics Data Science & ML
  • 19. Modern data engineering on the lakehouse Data Engineering on the Databricks Lakehouse Platform Open format storage Data transformation Scheduling & orchestration Automatic deployment & operations BI / Reporting Dashboarding Machine Learning / Data Science Data & ML Sharing Data Products Databases Streaming Sources Cloud Object Stores SaaS Applications NoSQL On-premises Systems Data Sources Data Consumers Observability, lineage, and end-to-end pipeline visibility Data quality management Data ingestion
  • 20. Data Science & Engineering Workspace
  • 21. Databricks Workspaces: Clusters It is a set of computation resources where a developer can run Data Analytics, Data Science, or Data Engineering workloads. The workloads can be executed in the form of a set of commands written in a notebook
  • 22. Databricks Workspaces: Notebooks It is a Web Interface where a developer can write and execute codes. Notebook contains a sequence of runnable cells that helps a developer to work with files, manipulate tables, create visualizations, and add narrative texts
  • 23. Databricks Workspaces: AutoLoader Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. Auto Loader can load data files from Google Cloud Storage (GCS, gs://) in addition to Databricks File System (DBFS, dbfs:/) ** Supports: JSON, CSV, PARQUET, AVRO, ORC, TEXT, and BINARYFILE file formats. ** val checkpoint_path = "/tmp/delta/population_data/_checkpoints" val write_path = "/tmp/delta/population_data" // Set up the stream to begin reading incoming files from the // upload_path location. val df = spark.readStream.format("cloudFiles") .option("cloudFiles.format", "csv") .option("header", "true") .schema("city string, year int, population long") .load(upload_path) // Start the stream. // Use the checkpoint_path location to keep a record of all files that // have already been uploaded to the upload_path location. // For those that have been uploaded since the last check, // write the newly-uploaded files' data to the write_path location. df.writeStream.format("delta") .option("checkpointLocation", checkpoint_path) .start(write_path) https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html
  • 24. Databricks Workspaces: Jobs Jobs allow a user to run notebooks on a scheduled basis. It is a method of executing or automating specific tasks like ETL, Model Building, and more. The pipeline of the ML workflow can be organized into jobs so that it sequentially runs the series of steps one after another
  • 25. Databricks Workspaces:Delta Live Tables Delta Live Tables is a framework designed to enable declaratively define, deploy, test & upgrade data pipelines and eliminate operational burdens associated with the management of such pipelines.
  • 26. Databricks Workspaces: Repos To empower the process of ML application development, repo’s provide repository- level integration with Git-based hosting providers such as GitHub, GitLab, bitBucket, and Azure DevOps Developers can write code in a Notebook and Sync it with the hosting provider, allowing developers to clone, manage branches, push changes, pull changes, etc.
  • 27. Databricks Workspaces: Models It refers to a Developer’s ML Workflow Model registered in the MLflow Model Registry, a centralized model store that manages the entire life cycle of MLflow models. MLflow Model Registry provides all the information about modern lineage, model versioning, present condition, workflow, and stage transition (whether promoted to production or archived).
  • 28. Governance requirements for data are quickly evolving
  • 29. Governance is hard to enforce on data lakes 42 Cloud 2 Cloud 3 Structured Semi-structured Unstructured Streaming Cloud 1
  • 30. The problem is getting bigger Enterprises need a way to share and govern a wide variety of data products Files Dashboards Models Tables
  • 31. Unity Catalog for Lakehouse Governance • Centrally catalog, Search, and discover data and AI assets • Simplify governance with a unified Cross- cloud governance model • Easily integrate with your existing Enterprise Data Catalogs • Securely share live data across platforms with delta sharing
  • 32. Delta Sharing on Databricks Delta Lake Table Delta Sharing Server Delta Sharing Protocol Data Provider Data Recipient Any Sharing Client Access permissions
  • 34. ML Architecture: Data Warehouse VS Data Lakehouse Data Warehouse Data Lakehouse
  • 35. Open Multi-Cloud Data Lakehouse and Feature Store Collaborative Multi-Language Notebooks ← Full ML Lifecycle → Model Tracking and Registry Model Training and Tuning Model Serving and Monitoring Automation and Governance Data Science and Machine Learning A data-native and collaborative solution for the full ML lifecycle
  • 36. What Does ML Need from a Lakehouse? 58 Access to Unstructured Data • Images, text, audio, custom formats • Libraries understand files, not tables • Must scale to petabytes Open Source Libraries • OSS dominates ML tooling (Tensorflow, scikit- learn, xgboost, R, etc) • Must be able to apply these in Python, R Specialized Hardware, Distributed Compute • Scalability of algorithms • GPUs, for deep learning • Cloud elasticity to manage that cost! Model Lifecycle Management • Outputs are model artifacts • Artifact lineage • Productionization of model
  • 37. Three Data Users • SQL and BI tools • Prepare and run reports • Summarize data • Visualize data • (Sometimes) Big Data • Data Warehouse data store • R, SAS, some Python • Statistical analysis • Explain data • Visualize data • Often small data sets • Database, data warehouse data store; local files Business Intelligence Data Science • Python • Deep learning and specialized GPU hardware • Create predictive models • Deploy models to prod • Often big data sets • Unstructured data in files Machine Learning
  • 38. How Is ML Different? • Operates on unstructured data like text and images • Can require learning from massive data sets, not just analysis of a sample • Uses open source tooling to manipulate data as “DataFrames” rather than with SQL • Outputs are models rather than data or reports • Sometimes needs special hardware
  • 39. MLOps and the Lakehouse • Applying open tools in-place to data in the lakehouse is a win for training • Applying them for operating models is important too! • "Models are data too" • Need to apply models to data • MLFlow for MLOps on the lakehouse • Track and manage model data, lineage, inputs • Deploy models as lakehouse "services"
  • 40. Feature Stores for Model Inputs • Tables are OK for managing model input • Input often structured • Well understood, easy to access • … but not quite enough • Upstream lineage: how were features computed? • Downstream lineage: where is the feature used? • Model caller has to read, feed inputs • How to do (also) access in real time?
  • 41. SQL Analytics Workspace Query data lake data using familiar ANSI SQL, and find and share new insights faster with the built-in SQL query editor, alerts, visualizations, and interactive dashboards.
  • 42. Databricks Workspaces: Queries Provides a simplified control (which is SQL only) to query the data
  • 43. Databricks Workspaces: Dashboards A Databricks SQL dashboard lets you combine visualizations and text boxes that provide context with your data.
  • 44. Databricks Workspaces: Alerts Alerts notify you when a field returned by a scheduled query meets a threshold. Alerts complement scheduled queries, but their criteria are checked after every execution.
  • 45. Databricks Workspaces: Query History The query history shows SQL queries performed using SQL endpoints.