0% found this document useful (0 votes)
28 views8 pages

AZURE DATA BRICKS

Azure Databricks is a cloud-based analytics platform optimized for Microsoft Azure, offering environments for SQL, data science, and machine learning. It provides tools for data management, computation management, and model management, while also integrating with Azure services and BI tools. The platform has pros such as ease of use and cloud-native capabilities, but also cons like limited version control integration.

Uploaded by

thanish shekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views8 pages

AZURE DATA BRICKS

Azure Databricks is a cloud-based analytics platform optimized for Microsoft Azure, offering environments for SQL, data science, and machine learning. It provides tools for data management, computation management, and model management, while also integrating with Azure services and BI tools. The platform has pros such as ease of use and cloud-native capabilities, but also cons like limited version control integration.

Uploaded by

thanish shekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

AZURE DATA BRICKSt comes in from various sources such a

• Databricks Introduction
• Databricks in Azure
o Databricks SQL
o Databricks Data Science and Engineering
o Databricks Machine Learning
• Pros and Cons of Azure Databricks
o Pros
o Cons
• Databricks SQL
o Data Management
o Computation Management
o Authorization
• Databricks Data Science & Engineering
o Workspace
o Interface
o Data Management
o Computation Management
o Databricks Runtime
o Job
o Model Management
o Authentication and Authorization
• Databricks Machine Learning
• Conclusion

Databricks Introduction
Databricks is a software company founded by the creators of Apache Spark. The company
has also created famous software such as Delta Lake, MLflow, and Koalas. These are the
popular open-source projects that span data engineering, data science, and machine learning.
Databricks develops web-based platforms for working with Spark, which provides automated
cluster management and Python-style notebooks.
Databricks in Azure
Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud
services platform. Azure Databricks offers three environments:

• Databricks SQL
• Databricks data science and engineering
• Databricks machine learning

Databricks SQL

Databricks SQL provides a user-friendly platform. This helps analysts, who work on SQL
queries, to run queries on Azure Data Lake, create multiple virtualizations, and build and
share dashboards.

Databricks Data Science and Engineering

Databricks data science and engineering provide an interactive working environment for data
engineers, data scientists, and machine learning engineers. The two ways to send data through
the big data pipeline are:

• Ingest into Azure through Azure Data Factory in batches


• Stream real-time by using Apache Kafka, Event Hubs, or IoT Hub

Databricks Machine Learning

Databricks machine learning is a complete machine learning environment. It helps to manage


services for experiment tracking, model training, feature development, and management. It
also does model serving.

Pros and Cons of Azure Databricks


We will discuss the pros and cons of Azure Databricks and understand how good it really is.
Pros

• It can process large amounts of data with Databricks and since it is part of Azure; the
data is cloud-native.
• The clusters are easy to set up and configure.
• It has an Azure Synapse Analytics connector as well as the ability to connect to Azure
DB.
• It is integrated with Active Directory.
• It supports multiple languages. Scala is the main language, but it also works well with
Python, SQL, and R.

Cons

• It does not integrate with Git or any other versioning tool.


• It, currently, only supports HDInsight and not Azure Batch or AZTK.

Databricks SQL
Databricks SQL allows you to run quick ad-hoc SQL queries on Data Lake. Integrating with
Azure Active Directory enables to run of complete Azure-based solutions by using
Databricks SQL. By integrating with Azure databases, Databricks SQL can store Synapse
Analytics, Azure Cosmos DB, Data Lake Store, and Blob Storage. Integrating with Power BI,
Databricks SQL allows users to discover and share insights more easily. BI tools, such as
Tableau Software, can also be used for accessing data bricks.

The interface that allows the automation of Databricks SQL objects is REST API.

Data Management

It has three parts:

• Visualization: A graphical presentation of the result of running a query


• Dashboard: A presentation of query visualizations and commentary
• Alert: A notification that a field returned by a query has reached a threshold 10

Computation Management
Here, we will know about the terms that will help to run SQL queries in Databricks SQL.

• Query: A valid SQL statement


• SQL endpoint: A resource where SQL queries are executed
• Query history: A list of previously executed queries and their characteristics

Authorization

• User and group: The user is an individual who has access to the system. The set of
multiple users is known as a group.
• Personal access token: An opaque string is used to authenticate to the REST API.
• Access control list: Set of permissions attached to a principal that requires access to an
object. ACL (Access Control List) specifies the object and actions allowed in it.

Databricks Data Science & Engineering


Databricks Data Science & Engineering is, sometimes, also called Workspace. It is an
analytics platform that is based on Apache Spark.

Databricks Data Science & Engineering comprises complete open-source Apache Spark
cluster technologies and capabilities. Spark in Databricks Data Science & Engineering
includes the following components:

• Spark SQL and DataFrames: This is the Spark module for working with structured
data. A DataFrame is a distributed collection of data that is organized into named
columns. It is very similar to a table in a relational database or a data frame in R or
Python.
• Streaming: This integrates with HDFS, Flume, and Kafka. Streaming is real-time data
processing and analysis for analytical and interactive applications.
• MLlib: It is short for Machine Learning Library consisting of common learning
algorithms and utilities including classification, regression, clustering, collaborative
filtering, dimensionality reduction as well as underlying optimization primitives.
• GraphX: Graphs and graph computation for a broad scope of use cases from cognitive
analytics to data exploration.
• Spark Core API: This has the support for R, SQL, Python, Scala, and Java.
Integrating with Azure Active Directory enables you to run complete Azure-based solutions
by using Databricks SQL. By integrating with Azure databases, Databricks SQL can store
Synapse Analytics, Cosmos DB, Data Lake Store, and Blob Storage. By integrating with
Power BI, Databricks SQL allows users to discover and share insights more easily. BI tools,
such as Tableau Software, can also be used.

Workspace

Workspace is the place for accessing all Azure Databricks assets. It organizes objects into
folders and provides access to data objects and computational resources.

The workspace contains:

• Dashboard: It provides access to visualizations.


• Library: Package available to notebook or job running on the cluster. We can also add
our own libraries.
• Repo: A folder whose contents are co-versioned together by syncing them to a local Git
repository.
• Experiment: A collection of MLflow runs for training an ML model.

Interface

It supports UI, API, and command line (CLI.)

• UI: It provides a user-friendly interface to workspace folders and their resources.


• Rest API: There are two versions, REST API 2.0 and REST API 1.2. REST API 2.0 has
features of REST API 1.2 along with some additional features. So, REST API 2.0 is the
preferred version.
• CLI: It is an open-source project that is available on GitHub. CLI is built on REST API
2.0.

Data Management

• Databricks File System (DBFS): It is an abstraction layer over the Blob store. It
contains directories that can contain files or more directories.
• Database: It is a collection of information that can be managed and updated.
• Table: Tables can be queried with Apache Spark SQL and Apache Spark APIs.
• Metastore: It stores information about various tables and partitions in the data
warehouse.

Computation Management

To run computations in Azure Databricks, we need to know about the following:

• Cluster: It is a set of computation resources and configurations on which we can run


notebooks and jobs. These are of two types:
o All-purpose: We create an all-purpose cluster by using UI, CLI, or REST API.
We can manually terminate and restart an all-purpose cluster. Multiple users can
share such clusters to do collaborative, interactive analysis.
o Job: The Azure Databricks job scheduler creates a job cluster when we run a job
on a new job cluster and terminates the cluster when the job is complete. We
cannot restart a job cluster.
• Pool: It has a set of ready-to-use instances that reduce cluster start. It also reduces auto-
scaling time. If the pool does not have enough resources, it expands itself. When the
attached cluster is terminated, the instances it uses are returned to the pool and can be
reused by a different cluster.

Databricks Runtime

The core components that run on clusters managed by Azure Databricks offer several
runtimes:

• It includes Apache Spark but also adds numerous other features to improve big data
analytics.
• Databricks Runtime for machine learning is built on Databricks runtime and provides a
ready environment for machine learning and data science.
• Databricks Runtime for genomics is a version of Databricks runtime that is optimized
for working with genomic and biomedical data.
• Databricks Light is the Azure Databricks packaging of the open-source Apache Spark
runtime.
Job

• Workload: There are two types of workloads with respect to the pricing schemes:
o Data engineering workload: This workload works on a job cluster.
o Data analytics workload: This workload runs on an all-purpose cluster.
• Execution context: It is the state of a REPL environment. It supports Python, R, Scala,
and SQL.

Model Management

The concepts that are needed to know how to build machine learning models are:

• Model: This is a mathematical function that represents the relation between inputs and
outputs. Machine learning consists of training and inference steps. We can train a model
by using an existing data set and using that to predict the outcomes of new data.
• Run: It is a collection of parameters, metrics, and tags that are related to training a
machine learning model.
• Experiment: It is the primary unit of organization and access control for runs. All
MLflow runs belong to the experiment.

Authentication and Authorization

• User and group: A user is an individual who has access to the system. A set of users is
a group.
• Access control list: Access control list (ACL) is a set of permissions that are attached to
a principal, which requires access to an object. ACL specifies the object and the actions
allowed on it.

Databricks Machine Learning


Databricks machine learning is an integrated end-to-end machine learning platform
incorporating managed services for experiment tracking, model training, feature development
and management, and feature and model serving. Databricks machine learning automates the
creation of a cluster that is optimized for machine learning. Databricks Runtime ML clusters
include the most popular machine learning libraries such as TensorFlow, PyTorch, Keras, and
XGBoost. It also includes libraries, such as Horovod, that are required for distributed
training.

With Databricks machine learning, we can:

• Train models either manually or with AutoML


• Track training parameters and models by using experiments with MLflow tracking
• Create feature tables and access them for model training and inference
• Share, manage, and serve models by using Model Registry

We also have access to all of the capabilities of Azure Databricks workspace such as
notebooks, clusters, jobs, data, Delta tables, security and admin controls, and many more.

Conclusion
Azure Databricks is an easy, fast, and collaborative Apache spark-based analytics platform. It
accelerates innovation by bringing together data science, data engineering, and business. This
helps to take the collaboration to another step and makes the process of data analytics more
productive, secure, scalable, and optimized for Azure.

You might also like