0% found this document useful (0 votes)
384 views

Data Analysis With Databricks Version 2

This document provides an overview of a class on Data Analysis with Databricks SQL. The class introduces Databricks SQL services and capabilities over several modules. It covers topics like setting up a catalog and schema, importing data, running simple queries and creating visualizations and dashboards. The class uses lectures, demos, labs and knowledge checks to help students learn how to use Databricks SQL to ingest, query and analyze data in the lakehouse architecture.

Uploaded by

Adventure World
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
384 views

Data Analysis With Databricks Version 2

This document provides an overview of a class on Data Analysis with Databricks SQL. The class introduces Databricks SQL services and capabilities over several modules. It covers topics like setting up a catalog and schema, importing data, running simple queries and creating visualizations and dashboards. The class uses lectures, demos, labs and knowledge checks to help students learn how to use Databricks SQL to ingest, query and analyze data in the lakehouse architecture.

Uploaded by

Adventure World
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 137

Data Analysis

with
Databricks SQL

Databricks Academy
2023
©2023 Databricks Inc. — All rights reserved 1
Meet your classmates
• Where is everyone joining us from today (city, country)?

©2023 Databricks Inc. — All rights reserved 2


Meet your classmates
• How long have you been working with Databricks SQL?

©2023 Databricks Inc. — All rights reserved 3


Meet your classmates
• What data analysis tools have you worked with in the past?

©2023 Databricks Inc. — All rights reserved 4


Meet your classmates
• What are you hoping to get out of this class?

©2023 Databricks Inc. — All rights reserved 5


Getting Started with
the Course

©2023 Databricks Inc. — All rights reserved 6


Course goals

1● Describe how Databricks SQL:


○ Works in the Lakehouse architecture
○ Implements data security
2● Use Databricks SQL to:
○ Ingest data into the Lakehouse
○ Use SQL commands specific to Databricks
○ Query data
○ Create visualizations and dashboards
○ Share queries and dashboards with others
○ Use automation and integration capabilities
©2023 Databricks Inc. — All rights reserved
Agenda
Module Name Duration

Databricks SQL Services and Capabilities 1 hour, 50 min

Data Management in Databricks SQL 1 hour, 40 min

Data Visualization and Dashboarding 3 hours, 30 min

● We will take 10 minute breaks about every hour

©2023 Databricks Inc. — All rights reserved 8


Activities used in this course

Lectures Demos Labs Knowledge checks

©2023 Databricks Inc. — All rights reserved 9


Technical Environment Overview
The Databricks SQL workspace
• Everyone:
• is in the same workspace
• has their own catalog
• is using the same SQL warehouse
• Only the instructor has administrator privileges in the workspace
• only a select few tasks in this course require admin privileges
• you will see these tasks in the slides in order to provide context
• the labs do not require admin privileges

©2023 Databricks Inc. — All rights reserved 10


Databricks SQL
Services and
Capabilities
Databricks Academy 2023

©2023 Databricks Inc. — All rights reserved 11


Databricks SQL Services and Capabilities

Lesson Name Duration

Lecture: Getting Started with Databricks SQL 15 min

Follow Along Demo: Setting Up a Catalog and Schema 20 mins

Follow Along Demo: Data Importing 20 mins

Follow Along Demo: A Simple (but Quick) Query, Visualization, and Dashboard 15 min

Lecture: Unity Catalog in Databricks SQL 20 min

Lecture: Lakehouse Architecture 20 min

Lecture: Integrations 10 min

©2023 Databricks Inc. — All rights reserved 12


Getting Started with
Databricks SQL

©2023 Databricks Inc. — All rights reserved 13


Learning Objectives
By the end of this lesson, you should be able to:

1 Describe what Databricks SQL is.

2 Describe the benefits of Databricks SQL.

©2023 Databricks Inc. — All rights reserved 14


Databricks SQL
Delivering analytics on the freshest
data with data warehouse
performance and data lake economics

■ Better price / performance than other cloud


data warehouses
■ Simplify discovery and sharing of new insights
■ Connect to familiar BI tools, like Tableau or Power
BI
■ Simplified administration and governance

©2023 Databricks Inc. — All rights reserved


Better price / performance
Run SQL queries on your
lakehouse and analyze your
freshest data with up to 6x
better price/performance than
traditional cloud data
warehouses.

Source: Performance Benchmark with Barcelona Supercomputing Center

©2023 Databricks Inc. — All rights reserved


Better together | Broad integration with BI tools
Connect your preferred BI tools
with optimized connectors that
provide fast performance, low
latency, and high user
conconcurrency to your data lake for
your existing BI tools.

©2023 Databricks Inc. — All rights reserved


A new home for data analysts
Enable data analysts to quickly
perform ad-hoc and
exploratory data analysis, with a
new SQL query editor,
visualizations and dashboards.
Automatic alerts can be triggered
for critical changes, allowing to
respond to business needs faster.

©2023 Databricks Inc. — All rights reserved


Simple administration and governance
Quickly setup SQL / BI
optimized compute with SQL
warehouses. Databricks
automatically determines
instance types and
configuration for the best
price/performance. Then, easily
manage usage, perform quick
auditing, and troubleshooting
with query history.

©2023 Databricks Inc. — All rights reserved


Use Cases

Connect existing BI Collaboratively Build data-


tools to one source explore the enhanced
of truth for all your latest and freshest applications
data data

©2023 Databricks Inc. — All rights reserved


Follow Along Demo:
Setting Up a Catalog and
Schema

©2023 Databricks Inc. — All rights reserved 21


Follow Along Demo
Setting Up a Catalog and Schema
• Get your username for the course
• Create a catalog
• Create a schema

©2023 Databricks Inc. — All rights reserved


Follow Along Demo:
Data Importing

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 23


Follow Along Demo
Data Importing
• Upload a .csv file
• Use the Data Explorer
• Create a table with data from object store

©2023 Databricks Inc. — All rights reserved


Follow Along Demo:
A Simple (but Quick)
Query, Visualization, and
Dashboard

©2023 Databricks Inc. — All rights reserved 25


Follow Along Demo
A Simple (but Quick) Query, Visualization, and
Dashboard
• Query data
• Create a visualization
• Create a dashboard

©2023 Databricks Inc. — All rights reserved


Knowledge Check

©2023 Databricks Inc. — All rights reserved 27


Knowledge check
Think about this question and volunteer an answer

Which of the following statements about


Databricks SQL is true? Select one response.
A. With Databricks SQL, queries deliver up to 2x
better price/performance than other cloud data
warehouses.
B. Delta Live Tables can be created in Databricks
SQL.
C. Databricks SQL automatically configures scaling
when creating SQL warehouses.
D. Databricks SQL clusters are powered by Photon.
©2023 Databricks Inc. — All rights reserved
Knowledge check
Think about this question and volunteer an answer

Which of the following features is used by


Databricks SQL to ensure your data is secure?
Select one response.
A. Built-in data governance
B. Delta sharing
C. Integration with 3rd party tools
D. Automatically scalable cloud infrastructure

©2023 Databricks Inc. — All rights reserved


Knowledge check
Think about this question and volunteer an answer

Which of the following is the landing point to run


a query in Databricks SQL? Select one response.
A. Dashboards
B. Job scheduler
C. Query editor
D. SQL warehouses

©2023 Databricks Inc. — All rights reserved


Knowledge check
Think about this question and volunteer an answer

Which of the following SQL keywords can be


used to ingest data in-place? Select one
response.
A. CREATE TABLE AS
B. WHERE
C. LOCATION
D. INSERT

©2023 Databricks Inc. — All rights reserved


Unity Catalog in
Databricks SQL

DAWD 02-1

©2023 Databricks Inc. — All rights reserved 32


Learning Objectives
By the end of this lesson, you should be able to:

1 Describe the three-level namespacing system provided by Unity


Catalog.
Describe persistence and scope of catalogs, schemas (databases),
2
tables, and views on Databricks.

Compare and contrast the behavior of managed and unmanaged


3
tables.

©2023 Databricks Inc. — All rights reserved 33


Learning Objectives, cont.
By the end of this lesson, you should be able to:

4 Identify the legacy hive_metastore as appearing as the default


catalog to be compatible with Unity Catalog.
Describe how and where Unity Catalog stores the data behind its
5
catalogs, schema, and granular data objects.

Explain the impact of Unity Catalog on existing external storage


6
locations.

©2023 Databricks Inc. — All rights reserved 34


Unity Catalog
Architecture

Implements access control on data Unity Catalog


Access control is always enabled Identity
User/group
Provider Metastore
management
Works across multiple workspaces
Grants permissions to users at the
account level 35
Workspace Workspace

Compute Compute
resources resources

©2023 Databricks Inc. — All rights reserved


Unity Catalog
Comparison to workspace security model

Workspace security model Unity Catalog security model

Object/privilege/principal access control Object/privilege/principal access


model control model
Open by default Secure by default
Local to workspace Works across multiple workspaces
Grants privileges to workspace-level Grants
36 privileges to account-level
principals principals

©2023 Databricks Inc. — All rights reserved


Data Explorer UI
Single pane of glass for all of your data
UI driven access
control to simplify
secure data
permissioning

Browse and understand


data assets stored in
your Lakehouse

Public Preview

Data lineage
End-to-end table &
column lineage
©2023 Databricks Inc. — All rights reserved
Data Lineage
Mapping the flow of data in the lakehouse

Auto-capture runtime
data lineage across all
languages

Track lineage down to


the table and column
high_sales
level

Leverage common
permission model from
Unity Catalog

©2023 Databricks Inc. — All rights reserved


Object Model

©2023 Databricks Inc. — All rights reserved 39


Metastore

• Stores data assets


• Permissions
• Created with default storage
location (external object store)
• Metastore Admin

©2023 Databricks Inc. — All rights reserved 40


Catalog

• First level of organization


• Users can see all catalogs where
USAGE is granted

©2023 Databricks Inc. — All rights reserved 41


Schema

• aka, Database
• Second level of organization
• Users can see all schemas where
USAGE is granted on both the
schema and the catalog

©2023 Databricks Inc. — All rights reserved 42


Managed Table

• Third level of organization


• Supported format: Delta
• Data is written to a new directory in
the metastore’s default location
• Created using CREATE TABLE
statement with no LOCATION
clause
• Example:
CREATE TABLE table1 …

©2023 Databricks Inc. — All rights reserved 43


External Table

• Third level of organization


• Data stored in a location outside
the managed storage location
• DROP TABLE does not delete data
• Can easily clone a table to a new
schema or table name without
moving data
• Supported formats:
• Delta, csv, json, avro, parquet, orc, text

©2023 Databricks Inc. — All rights reserved 44


Creating External Tables

• Two credential types:


• Storage Credential or External Location
• Use the LOCATION clause
• Example using External Location only
CREATE TABLE table2
LOCATION 's3://<bucket_path>/<table_directory>'
...

• Example using Storage Credential


CREATE TABLE table2
LOCATION 's3://<bucket_path>/<table_directory>'
...
WITH CREDENTIAL <credential-name>;

©2023 Databricks Inc. — All rights reserved 45


View

• Third level of organization


• Can be composed from tables and
views in multiple schemas or
catalogs
• Created using CREATE VIEW:

CREATE VIEW view1 AS


SELECT column1, column2
FROM table1 ...

©2023 Databricks Inc. — All rights reserved 46


Three-Level Namespace Notation

• Data objects must be specified with three elements, depending on


granularity required: Catalog, Schema, and Table
• Example:
CREATE TABLE main.default.department
(
deptcode INT,
deptname STRING,
location STRING
);
• Or, with a USE statement:
USE main.default;
SELECT * FROM department;

©2023 Databricks Inc. — All rights reserved 47


Lakehouse
Architecture

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 48


Learning Objectives
By the end of this lesson, you should be able to:

1 Describe the benefits of using Databricks SQL for in-Lakehouse


platform data processing.
Describe the medallion architecture as a sequential data
2
organization and pipeline system of progressively cleaner data.

Identify that bronze and silver layers data requires additional


3
processing and cleaning.

4 Describe the data in the gold layer of the medallion architecture.

©2023 Databricks Inc. — All rights reserved 49


Learning Objectives, cont.
By the end of this lesson, you should be able to:

Describe last-mile ETL workflows fully within the gold layer for
5
specific use cases.

Identify the gold layer as the most common layer for data analysts
6
using Databricks SQL.

7 Describe the benefits of working with streaming data.

©2023 Databricks Inc. — All rights reserved 50


The Lakehouse Architecture
● Full ACID Transaction
● Focus on your data
flow, instead of
worrying about failures. Data stream source
● Open Standards, Open
Source
Streaming
● Store petabytes of data Analytics
without worries of Batch source
lock-in. Growing
community including
Presto, Spark and more.
● Powered by Data Lake
AI & Reporting
Unifies Streaming /
CSV,
● JSON, TXT…
Batch. Convert existing
jobs with minimal
modifications. Data stream source

©2023 Databricks Inc. — All rights reserved


The Delta Lake Architecture
*Data Quality Levels *

Bronze Silver Gold


Streaming
Analytics
CSV,
JSON, TXT…

Raw Filtered, Cleaned Business-level


Ingestion Augmented Aggregates
Data Lake AI & Reporting

Quality

©2023 Databricks Inc. — All rights reserved


The Delta Lake Architecture
*Data Quality Levels *

Bronze Silver Gold


Streaming
Analytics
CSV,
JSON, TXT…
🔥
Raw Filtered, Cleaned Business-level
Ingestion Augmented Aggregates
Data Lake AI & Reporting

Quality

©2023 Databricks Inc. — All rights reserved


The Delta Lake Architecture
*Data Quality Levels *

Bronze Silver Gold


Streaming
Analytics
CSV,
JSON, TXT…

Raw Filtered, Cleaned Business-level


Ingestion Augmented Aggregates
Data Lake AI & Reporting

Quality

©2023 Databricks Inc. — All rights reserved


The Delta Lake Architecture
*Data Quality Levels *
OVERWRITE
INSERT
Bronze DELETE Silver MERGE Gold
Streaming
Analytics
CSV,
JSON, TXT…

Raw Filtered, Cleaned Business-level


Ingestion UPDATE Augmented Aggregates
Data Lake AI & Reporting

Quality

©2023 Databricks Inc. — All rights reserved


Knowledge Check

©2023 Databricks Inc. — All rights reserved 56


Knowledge check
Think about this question and volunteer an answer

Which of the following statements about the lakehouse


medallion architecture is true? Select one response.
A. The data in a single upstream table could be used to generate
multiple downstream tables.
B. The silver layer is for reporting and uses more de-normalized
and read-optimized data models with fewer joins.
C. The gold layer provides an "Enterprise view" of all its key
business entities, concepts and transactions.
D. Only minimal or "just-enough" transformations and data
cleansing rules are applied to each layer in the medallion
architecture.

©2023 Databricks Inc. — All rights reserved


Knowledge check
Think about this question and volunteer an answer

Which of the following describes the data quality of the gold


layer of data in the lakehouse medallion architecture? Select
one response.
A. The gold layer brings the data from different sources into an
Enterprise view.
B. The gold layer is comprised of aggregated data, ready for a
downstream user.
C. The table structures in the gold layer correspond to the source
system table structures "as-is”.
D. The focus of the gold layer is quick Change Data Capture and
the ability to provide a historical archive if needed without
rereading the data from the source system.
©2023 Databricks Inc. — All rights reserved
Knowledge check
Think about this question and volunteer an answer

Which of the following statements about working with


streaming data is true? Select one response.
A. Stream processing is typically used when there are loose
latency requirements.
B. The total amount of money a particular movie made in the year
2020 is an example of stream processing data.
C. Streaming data can be optimized for cloud-provider object
storage.
D. Stream processes cannot be stopped or restarted once they
have begun.

©2023 Databricks Inc. — All rights reserved


Knowledge check
Think about this question and volunteer an answer

Which of the following statements describes the relationship


between the silver and gold layer of data? Select one response.
A. The gold layer has less clean data than the silver layer.
B. Project-specific business rules are applied from the silver to
gold layer.
C. Self-service analytics are enabled for the gold layer for ad-hoc
reporting in the silver layer.
D. The gold layer is where we land all the data from external
source systems, which are represented by the silver layer.

©2023 Databricks Inc. — All rights reserved


Integrations

DAWD 01-6

©2023 Databricks Inc. — All rights reserved 61


Learning Objectives
By the end of this lesson, you should be able to:

1 Identify Databricks SQL as a complementary tool for BI partner tool


workflows.
Identify Databricks SQL as a quick opportunity to create queries,
2
visualizations, and dashboards from within the Lakehouse

Identify Partner Connect as a tool for implementing simple


3
integrations with a number of other data products.

©2023 Databricks Inc. — All rights reserved 62


Databricks Partner Connect
Databricks Partner Connect is a dedicated ecosystem
of integrations that allows users to easily connect with
popular data ingestion, transformation and BI partner
products.

This helps data analysts get useful data into their


lakehouse faster without the need to manually configure
each product so they can get data-driven insights

©2023 Databricks Inc. — All rights reserved


Databricks Partner Connect
Helps Data Analysts who:
● Struggle to connect to their choice of BI tools
● Struggle to bring data from SaaS apps (Google Analytics,
Facebook, etc.) to run SQL queries
● Have to wait on eng/ops to generate data for analysis

©2023 Databricks Inc. — All rights reserved


Partner Connect Makes it Easy
How do I get the data What tools can I use I heard Fivetran is great!
from SFDC into Delta to ingest data into How do I connect it to
lake? Delta? Databricks?

cluster

■ Many partner integrations take as few as 6 clicks


■ No context or page switches required
DATABRICKS
■ Automatically launches a cluster, calls Partner API to pass on PAT token and the cluster
PARTNER configuration details
CONNECT
■ Sets up all the necessary configs for an optimized user experience
■ Creates trial account in the partner product if an account doesn’t exist
©2023 Databricks Inc. — All rights reserved
Databricks Partner Connect

Data BI and Data Prep and Machine


Ingestion Visualization Transformation Learning

©2023 Databricks Inc. — All rights reserved


Databricks Partner Connect

Data Data Reverse Semantic


Governance Quality ETL Layer

©2023 Databricks Inc. — All rights reserved


Other Partners

Data Data Reverse Semantic


Governance Quality ETL Layer

©2023 Databricks Inc. — All rights reserved


Follow Along Demo:
Integrations

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 69


Follow Along Demo
Integrations
• Connecting to outside data
• Connecting to BI tools
• Partner Connect

©2023 Databricks Inc. — All rights reserved


Data
Management in
Databricks SQL

Databricks Academy 2023

©2023 Databricks Inc. — All rights reserved 71


Data Management in Databricks SQL

Lesson Name Duration

Lecture: Databricks SQL Warehouses 15 min

Follow Along Demo: Delta Lake in Databricks SQL 20 min

Lecture: Data Security 20 min

©2023 Databricks Inc. — All rights reserved 72


Databricks SQL
Warehouses

©2023 Databricks Inc. — All rights reserved 73


Learning Objectives
By the end of this lesson, you should be able to:

1 Describe the purpose of Databricks SQL warehouses.

Compare and contrast Classic, Pro, and Serverless Databricks SQL


2
warehouses.

Identify Serverless Databricks SQL warehouses as a quick-starting


3
option.

©2023 Databricks Inc. — All rights reserved 74


Learning Objectives, cont.
By the end of this lesson, you should be able to:

Describe basic Databricks SQL warehouse sizing and scaling


4 guidelines in response to slow-running single-user queries and
multi-user environments.
Describe the impact of Databricks SQL warehouse permissions on
5 query history availability.

©2023 Databricks Inc. — All rights reserved 75


Successful Lakehouse platform relies on
efficient infrastructure
ML
engineers Data scientists ALL THE
USERS
Data Data
engineers analysts

INFRASTRUCTURE

ALL THE
DATA LAKE DATA WAREHOUSE DATA
©2023 Databricks Inc. — All rights reserved
Problems with Managing Infrastructure
Users Admins

Clusters

Lost Productivity Admin Effort


Waiting for results while Manually configure
clusters startup / scaleup. versions, cluster sizes,
instance types

Cost
Need to reduce costs

Finance

©2023 Databricks Inc. — All rights reserved


Databricks SQL Serverless
Managed Servers
Always-running server fleet
Patched and upgraded automatically

Optimized Capacity
Databricks SQL Idle clusters removed
Serverless
...
10 minutes after last
Compute query
(configurable)

Instant Compute Secure


< 10s second allocation to users 3-layer isolation with data
No waiting for clusters to startup or encryption
scale-up

©2023 Databricks Inc. — All rights reserved


Industry leading security architecture
supporting production workloads
1. Container Isolation 2. VM Isolation 3. Network Isolation

● Hardened container ● Workloads separated by ● All nodes egress is


images per industry best VM boundaries blocked except to nodes
practice in same cluster
● Blocked - reusing VMs
● Disable privilege access among customers ● Federated access
in the container through temporary
security tokens

● Ingress traffic from other


customers is blocked

©2023 Databricks Inc. — All rights reserved


Databricks SQL Serverless Benefits

Higher user Zero Management Lower Cost


productivity
● User queries start ● No configuration ● Pay what you consume;
instantly, no waiting for ● No performance tuning eliminate idle cluster time
cluster start-up ● No capacity ● No over-provisioning of
management resources
● Add more concurrent
users with instant cluster ● Automatic upgrades and ● Idle capacity removed 10
scaling patching minutes after last query

©2023 Databricks Inc. — All rights reserved


Serverless Compute Architecture
Databricks
Databricks Databricks control plane
Account
Account Benefits:

WORKSPACE WORKSPACE WORKSPACE


● Production ready
environment

● Robust security
VPC/VNET
foundation - data
Databricks Serverless compute
isolation and
encryption

Customers
Customer
Account
Account

Customer Storage
©2023 Databricks Inc. — All rights reserved
Warehouse Configuration
AWS Azure

©2023 Databricks Inc. — All rights reserved 82


Warehouse Configuration
GCP

©2023 Databricks Inc. — All rights reserved 83


Warehouse Configuration

In the course, SQL Warehouses have the following settings


• Cluster size – 2X-Small
• Scaling – Min: 1, Max 1
• Auto-stop – After ten minutes

©2023 Databricks Inc. — All rights reserved 84


Delta Lake

©2023 Databricks Inc. — All rights reserved 85


Follow Along Demo:
Delta Lake in
Databricks SQL
DAWD 01-5

©2023 Databricks Inc. — All rights reserved 86


Follow Along Demo
Delta Lake in Databricks SQL
• Create a schema
• Create views that withhold data from
unauthorized groups
• Optimize delta tables

©2023 Databricks Inc. — All rights reserved


Knowledge Check

©2023 Databricks Inc. — All rights reserved 88


Knowledge check
Think about this question and volunteer an answer

Which of the following statements describes the purpose of


Databricks SQL warehouses? Select one response.
A. SQL warehouses enable data analysts to find and share
dashboards.
B. SQL warehouses are a declarative framework for building data
processing pipelines.
C. SQL warehouses provide data discovery capabilities across
Databricks workspaces.
D. SQL warehouses allow you to run SQL commands on data
objects within Databricks SQL.

©2023 Databricks Inc. — All rights reserved


Knowledge check
Think about this question and volunteer an answer

Which of the following resources can be used to


programmatically interact with Databricks SQL?
Select one response.
A. SQL Warehouses and job clusters
B. Databricks SQL REST API and Databricks CLI
C. Fivetran and Tableau
D. All of the above

©2023 Databricks Inc. — All rights reserved


Knowledge check
Think about this question and volunteer an answer

Which of the following statements about SQL warehouse


sizing and scaling is true? Select all response apply.
A. Increasing maximum scaling allows for multiple users to use
the same warehouse at the same time.
B. Scaling is set to a minimum of 1 and a maximum of 1 by
default.
C. The higher the cluster size, the higher the latency in your
queries.
D. The auto-stop feature will restart the warehouse if it remains
idle during the auto-stop period.

©2023 Databricks Inc. — All rights reserved


Knowledge check
Think about this question and volunteer an answer

A data analyst is trying to connect their database


to Tableau in order to visualize the results of their
data. Which of the following tabs do they need to
go to in order to implement this? Select one
response.
A. SQL Query
B. Partner Connect
C. Workflows
D. Fivetran
©2023 Databricks Inc. — All rights reserved
Lab:
Lakehouse SQL

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 93


Lab
Lakehouse SQL
• Time Travel
• COPY INTO
• MERGE INTO

©2023 Databricks Inc. — All rights reserved


Data Security

DAWD 03-1

©2023 Databricks Inc. — All rights reserved 95


Learning Objectives
By the end of this lesson, you should be able to:

1 Describe the different levels of data object access available with


Unity Catalog.
Identify that catalogs, schemas, and tables can all have unique
2
owners.

Describe how to organize owned data objects for the purposes of


3
security.
Identify that the creator of a data object becomes the owner of
4
that data object.
©2023 Databricks Inc. — All rights reserved 96
Learning Objectives, cont.
By the end of this lesson, you should be able to:

5 Identify the responsibilities of data object ownership.

Update data object permissions to address user access needs in a


6
variety of common scenarios.

Identify PII data objects as needing additional, organization-specific


7
considerations.

©2023 Databricks Inc. — All rights reserved 97


The Life of a Query (Without Unity Catalog)
Per Workspace

2 nts
ra
SQL ec kG Table ACL
Ch
SELECT 3
1 Lookup Location
*
FROM
Sales2020; Cluster or SQL 4 Hive
6 Warehouse Return path to table Metastore
Cluster filters s3://sales/sales2020
unauthorized data
5
Instance Profile /
Service Principal /
Service Account

©2023 Databricks Inc. — All rights reserved


Cloud Storage
Unity Catalog Overview
SQL access Audit
controls log
Managed
Data Source

Unity Catalog
(cross-workspace)
Managed
Data Source

Cluster or SQL
Warehouse
User Identity
Passthrough
Defined
Credentials External
Tables

Other
©2023 Databricks Inc. — All rights reserved
Existing Data
Databricks Unity Catalog
Audit
Unity Log Data (files on S3/ADLS/GCS)
Users
Catalog
table1 /dataset/pages/part-001
/dataset/pages/part-002
table2
/dataset/users/uk/part-001
view1 /dataset/users/uk/part-002
view2 /dataset/users/us/part-001
models
SQL Databases
view3

Fine-grained permissions on tables, fields, views: not


files
Industry standard interface: ANSI SQL grants ML Models

Uniform permission model for all data assets

Centrally audited Delta Shares

©2023 Databricks Inc. — All rights reserved


Attribute-Based Access Control
True policy driven ABAC that flows from your metadata

Policy pii

Sales 2020 Finance Data Eng Analyst


Visible Redact Mask
profile sku cost pii

userID purchase purhcaseAMT paymentDetails

... ... ... ...

pii

A record of all sales transactions is ingested from several data


sources, and only the finance team should be able to see payment
information. The PII policy flows down from column being tagged as
PII Finance Data Eng Analyst

©2023 Databricks Inc. — All rights reserved


Attribute-Based Access Control
CREATE ATTRIBUTE pii

ALTER TABLE iot_events ADD ATTRIBUTE pii ON email


ALTER TABLE users ADD ATTRIBUTE pii ON phone
...

GRANT SELECT ON SCHEMA iot_data


Set permission on all
HAVING ATTRIBUTE NOT IN (pii)
TO product_managers columns tagged pii
together

©2023 Databricks Inc. — All rights reserved


Unity Catalog: External Table with Defined
Credentials
CREATE CREDENTIAL iot_role TYPE AWS_ROLE ...

CREATE TABLE iot_data LOCATION s3:/...


WITH CREDENTIAL iot_role

iot_key
Credentials

Unity Catalog External


Tables

©2023 Databricks Inc. — All rights reserved


Unity Catalog: External Files with Passthrough
SELECT * FROM csv.`adls:/.../myfolder`
if a direct file path is specified,
we perform passthrough with
the user’s cloud credentials

CREATE VIEW v AS SELECT * FROM csv.`adls:/.../myfolder`


User ADLS
Credentials

External
Tables

©2023 Databricks Inc. — All rights reserved


Data
Visualization and
Dashboarding
Databricks Academy 2023

©2023 Databricks Inc. — All rights reserved 105


Data Visualization and Dashboarding

Lesson Name Duration

Follow Along Demo: Data Visualizations 40 min

Lab: Data Visualizations 30 min

Follow Along Demo: Dashboarding Basics 40 min

Lab: Dashboards 30 min

Follow Along Demo: Databricks SQL in Production 40 min

Lab: Databricks SQL in Production 30 min

©2023 Databricks Inc. — All rights reserved 106


Follow Along Demo:
Data Visualizations

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 107


Follow Along Demo
Data Visualizations
• Create common visualizations
• Create map visualizations

©2023 Databricks Inc. — All rights reserved


Knowledge Check

©2023 Databricks Inc. — All rights reserved 109


Knowledge check
Think about this question and volunteer an answer

A data analyst is worried that the mean value of


their data is being affected by a few outliers. Which
of the following data visualizations is best suited to
show if this is the case? Select one response.
A. Boxplot
B. Funnel
C. Map - markers
D. Counter
©2023 Databricks Inc. — All rights reserved
Knowledge check
Think about this question and volunteer an answer

A data analyst needs to create a table that uses Unity Catalog


based on data stored in the location path. Which of the
following lines of code accomplishes this? Select one response.
A. CREATE MANAGED TABLE sales CREDENTIAL iam_role
LOCATION path
B. CREATE TABLE sales LOCATION path WHERE CREDENTIAL
NOT IN iam_role
C. CREATE EXTERNAL TABLE sales LOCATION path WITH
CREDENTIAL iam_role
D. CREATE TABLE sales WITH CREDENTIAL iam_role
LOCATION path

©2023 Databricks Inc. — All rights reserved


Knowledge check
Think about this question and volunteer an answer

A data analyst needs to visualize the number of


recorded cases of COVID-19 within Europe. Which
of the following data visualizations is best suited to
accomplish this? Select one response.
A. Stacked bar chart
B. Map - choropleth
C. Map - markers
D. The counter

©2023 Databricks Inc. — All rights reserved


Knowledge check
Think about this question and volunteer an answer

Which of the following SQL keywords can be


used to time travel back in a table’s history?
Select one response.
A. RETURN VERSION AS OF
B. REVERT TABLE
C. DESCRIBE DETAIL
D. DESCRIBE HISTORY

©2023 Databricks Inc. — All rights reserved


Lab:
Data Visualizations

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 114


Lab
Data Visualizations
• Create a boxplot
• Create a funnel

©2023 Databricks Inc. — All rights reserved


Follow Along Demo:
Dashboarding Basics

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 116


Follow Along Demo
Dashboarding Basics
• Create a dashboard
• Parameterize queries
• Organize Databricks SQL assets

©2023 Databricks Inc. — All rights reserved


Knowledge Check

©2023 Databricks Inc. — All rights reserved 118


Knowledge check
Think about this question and volunteer an answer

Which of the following can be added to a query


so that the code can be rerun with different
variable inputs? Select one response.
A. User-defined functions
B. Parameters
C. Vectors
D. SQL warehouses

©2023 Databricks Inc. — All rights reserved


Knowledge check
Think about this question and volunteer an answer

A data analyst needs to create a visualization out of the following query:


SELECT order_date
FROM sales
WHERE order_date >= to_date('2020-01-01')
AND order_date <= to_date('2021-01-01');
Which of the following visualization types is best suited to depict the
results of this query? Select one response.
A. Funnel
B. Stacked bar chart
C. Bar chart
D. Boxplot

©2023 Databricks Inc. — All rights reserved


Knowledge check
Think about this question and volunteer an answer

A team of stakeholders needs to be notified of changes


in a dashboard’s statistics on a daily basis. Which of the
following actions can be taken to ensure they always
have the newest information? Select one response.
A. A refresh schedule can be configured and stakeholders
can be subscribed to the dashboard's output.
B. A trigger alert can be created for the dashboard and
stakeholders can be added to the alert notification list.
C. A webhook can be created and shared with
stakeholders.
D. None of the above
©2023 Databricks Inc. — All rights reserved
Knowledge check
Think about this question and volunteer an answer

Which of the following data visualizations


displays a single number by default? Select one
response.
A. Bar chart
B. Counter
C. Map - markers
D. Funnel

©2023 Databricks Inc. — All rights reserved


Lab:
Dashboards

DAWD 01-5

©2023 Databricks Inc. — All rights reserved 123


Lab
Dashboards
• Create your own dashboard

©2023 Databricks Inc. — All rights reserved


Follow Along Demo:
Databricks SQL in
Production
DAWD 01-5

©2023 Databricks Inc. — All rights reserved 125


Follow Along Demo
Databricks SQL in Production
• Automation in Databricks SQL
• Sharing Databricks SQL assets
• Creating gold-level tables

©2023 Databricks Inc. — All rights reserved


Knowledge Check

©2023 Databricks Inc. — All rights reserved 127


Knowledge check
Think about this question and volunteer an answer

Which of the following automations are


available in Databricks SQL? Select one
response.
A. Query refresh schedules
B. Dashboard refresh schedules
C. Alerts
D. All of the above

©2023 Databricks Inc. — All rights reserved


Knowledge check
Think about this question and volunteer an answer

Which of the following steps is optional when


sharing the most recent results of a dashboard?
Select one response.
A. Selecting a refresh interval
B. Selecting a SQL warehouse to use to refresh
the dashboard
C. Set subscribers to be notified of dashboard
results
D. Set the dashboard refresh schedule to Enabled
if it isn’t already.
©2023 Databricks Inc. — All rights reserved
Knowledge check
Think about this question and volunteer an answer

If no destination is specified for an alert, where


will that alert be sent? Select one response.
A. The email address of the workspace
administrator
B. The email addresses of any subscribed
stakeholders
C. The email address of the creator of the alert
D. Nowhere; An error will be thrown when creating
the alert.
©2023 Databricks Inc. — All rights reserved
Knowledge check
Think about this question and volunteer an answer

What level of permissions is the owner of a


query granted on their query? Select one
response.
A. Can View
B. Can Run
C. Can Edit
D. Can Manage

©2023 Databricks Inc. — All rights reserved


Lab:
Databricks SQL in
Production
DAWD 01-5

©2023 Databricks Inc. — All rights reserved 132


Lab
Databricks SQL in Production
• Create automations
• Share a dashboard

©2023 Databricks Inc. — All rights reserved


Summary and Next
Steps

©2023 Databricks Inc. — All rights reserved 134


Earn a Databricks certification!
Certification helps you gain industry recognition, competitive
differentiation, greater productivity, and results.

• This course helps you prepare for the


Databricks Certified Data Analyst
Associate exam
• Recommended Self-Paced Courses
• Ingesting Data for Databricks SQL
• Integrating BI Tools with Databricks SQL
• Please see the Databricks Academy for
additional prep materials
For more information visit:
databricks.com/learn/certification

©2023 Databricks Inc. — All rights reserved 135


Thank you!

©2023 Databricks Inc. — All rights reserved 136


©2023 Databricks Inc. — All rights reserved 137

You might also like