AZURE DATA BRICKS
AZURE DATA BRICKS
• Databricks Introduction
• Databricks in Azure
o Databricks SQL
o Databricks Data Science and Engineering
o Databricks Machine Learning
• Pros and Cons of Azure Databricks
o Pros
o Cons
• Databricks SQL
o Data Management
o Computation Management
o Authorization
• Databricks Data Science & Engineering
o Workspace
o Interface
o Data Management
o Computation Management
o Databricks Runtime
o Job
o Model Management
o Authentication and Authorization
• Databricks Machine Learning
• Conclusion
Databricks Introduction
Databricks is a software company founded by the creators of Apache Spark. The company
has also created famous software such as Delta Lake, MLflow, and Koalas. These are the
popular open-source projects that span data engineering, data science, and machine learning.
Databricks develops web-based platforms for working with Spark, which provides automated
cluster management and Python-style notebooks.
Databricks in Azure
Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud
services platform. Azure Databricks offers three environments:
• Databricks SQL
• Databricks data science and engineering
• Databricks machine learning
Databricks SQL
Databricks SQL provides a user-friendly platform. This helps analysts, who work on SQL
queries, to run queries on Azure Data Lake, create multiple virtualizations, and build and
share dashboards.
Databricks data science and engineering provide an interactive working environment for data
engineers, data scientists, and machine learning engineers. The two ways to send data through
the big data pipeline are:
• It can process large amounts of data with Databricks and since it is part of Azure; the
data is cloud-native.
• The clusters are easy to set up and configure.
• It has an Azure Synapse Analytics connector as well as the ability to connect to Azure
DB.
• It is integrated with Active Directory.
• It supports multiple languages. Scala is the main language, but it also works well with
Python, SQL, and R.
Cons
Databricks SQL
Databricks SQL allows you to run quick ad-hoc SQL queries on Data Lake. Integrating with
Azure Active Directory enables to run of complete Azure-based solutions by using
Databricks SQL. By integrating with Azure databases, Databricks SQL can store Synapse
Analytics, Azure Cosmos DB, Data Lake Store, and Blob Storage. Integrating with Power BI,
Databricks SQL allows users to discover and share insights more easily. BI tools, such as
Tableau Software, can also be used for accessing data bricks.
The interface that allows the automation of Databricks SQL objects is REST API.
Data Management
Computation Management
Here, we will know about the terms that will help to run SQL queries in Databricks SQL.
Authorization
• User and group: The user is an individual who has access to the system. The set of
multiple users is known as a group.
• Personal access token: An opaque string is used to authenticate to the REST API.
• Access control list: Set of permissions attached to a principal that requires access to an
object. ACL (Access Control List) specifies the object and actions allowed in it.
Databricks Data Science & Engineering comprises complete open-source Apache Spark
cluster technologies and capabilities. Spark in Databricks Data Science & Engineering
includes the following components:
• Spark SQL and DataFrames: This is the Spark module for working with structured
data. A DataFrame is a distributed collection of data that is organized into named
columns. It is very similar to a table in a relational database or a data frame in R or
Python.
• Streaming: This integrates with HDFS, Flume, and Kafka. Streaming is real-time data
processing and analysis for analytical and interactive applications.
• MLlib: It is short for Machine Learning Library consisting of common learning
algorithms and utilities including classification, regression, clustering, collaborative
filtering, dimensionality reduction as well as underlying optimization primitives.
• GraphX: Graphs and graph computation for a broad scope of use cases from cognitive
analytics to data exploration.
• Spark Core API: This has the support for R, SQL, Python, Scala, and Java.
Integrating with Azure Active Directory enables you to run complete Azure-based solutions
by using Databricks SQL. By integrating with Azure databases, Databricks SQL can store
Synapse Analytics, Cosmos DB, Data Lake Store, and Blob Storage. By integrating with
Power BI, Databricks SQL allows users to discover and share insights more easily. BI tools,
such as Tableau Software, can also be used.
Workspace
Workspace is the place for accessing all Azure Databricks assets. It organizes objects into
folders and provides access to data objects and computational resources.
Interface
Data Management
• Databricks File System (DBFS): It is an abstraction layer over the Blob store. It
contains directories that can contain files or more directories.
• Database: It is a collection of information that can be managed and updated.
• Table: Tables can be queried with Apache Spark SQL and Apache Spark APIs.
• Metastore: It stores information about various tables and partitions in the data
warehouse.
Computation Management
Databricks Runtime
The core components that run on clusters managed by Azure Databricks offer several
runtimes:
• It includes Apache Spark but also adds numerous other features to improve big data
analytics.
• Databricks Runtime for machine learning is built on Databricks runtime and provides a
ready environment for machine learning and data science.
• Databricks Runtime for genomics is a version of Databricks runtime that is optimized
for working with genomic and biomedical data.
• Databricks Light is the Azure Databricks packaging of the open-source Apache Spark
runtime.
Job
• Workload: There are two types of workloads with respect to the pricing schemes:
o Data engineering workload: This workload works on a job cluster.
o Data analytics workload: This workload runs on an all-purpose cluster.
• Execution context: It is the state of a REPL environment. It supports Python, R, Scala,
and SQL.
Model Management
The concepts that are needed to know how to build machine learning models are:
• Model: This is a mathematical function that represents the relation between inputs and
outputs. Machine learning consists of training and inference steps. We can train a model
by using an existing data set and using that to predict the outcomes of new data.
• Run: It is a collection of parameters, metrics, and tags that are related to training a
machine learning model.
• Experiment: It is the primary unit of organization and access control for runs. All
MLflow runs belong to the experiment.
• User and group: A user is an individual who has access to the system. A set of users is
a group.
• Access control list: Access control list (ACL) is a set of permissions that are attached to
a principal, which requires access to an object. ACL specifies the object and the actions
allowed on it.
We also have access to all of the capabilities of Azure Databricks workspace such as
notebooks, clusters, jobs, data, Delta tables, security and admin controls, and many more.
Conclusion
Azure Databricks is an easy, fast, and collaborative Apache spark-based analytics platform. It
accelerates innovation by bringing together data science, data engineering, and business. This
helps to take the collaboration to another step and makes the process of data analytics more
productive, secure, scalable, and optimized for Azure.