Cosmos DB Real-time Advanced Analytics Workshop

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Sri Chintala, Microsoft
Cosmos DB Real-time
Advanced Analytics
Workshop
#UnifiedDataAnalytics #SparkAISummit
Cosmos DB Real-time advanced analytics workshop

Today’s customer scenario
ž Woodgrove Bank provides payment processing services
for commerce.
ž Want to build PoC of an innovative online fraud detection
solution.
ž Goal: Monitor fraud in real-time across millions of
transactions to prevent financial loss and detect
widespread attacks.
3#UnifiedDataAnalytics #SparkAISummit

Part 1: Customer Scenario
• Woodgrove Banks’ customers – end merchants
– are all around the world.
• The right solution would minimize any latencies
experienced by using their service by
distributing the solution as close as possible to
the regions used by customers.
4

Part 1: Customer scenario
• Have decades-worth of historical transactional data, including transactions
identified as fraudulent.
• Data is in tabular format and can be exported to CSVs.
• The analysts are very interested in the recent notebook-driven approach
to data science & data engineering tasks.
• They would prefer a solution that features notebooks to explore and
prepare data, model, & define the logic for scheduled processing.
5

Part 1: Customer needs
• Provide fraud detection services to merchant customers, using incoming
payment transaction data to provide early warning of fraudulent activity.
• Schedule offline scoring of “suspicious activity” using trained model, and make
globally available.
• Store data from streaming sources into long-term storage without interfering
with read jobs.
• Use standard platform that supports near-term data pipeline needs and long-
term standard for data science, data engineering, & development.
6

Part 2: Design the solution (10 min)
• Design a solution and prepare to present the
solution to the target customer audience in a
chalk-talk format.
8

Part 3: Discuss preferred solution
9

Preferred solution – Data Ingest
ž Payment transactions can be ingested in real-time using Event Hubs
or Azure Cosmos DB.
ž Factors to consider are:
ž rate of flow (how many transactions/second)
ž data source and compatibility
ž level of effort to implement
ž long-term storage needs

ž Cosmos DB:
ž Is optimized for high write throughput
ž Provides streaming through its change feed.
ž TTL (time to live) – automatic expiration & save in storage cost
ž Event Hub:
ž Data streams through, and can be persisted (Capture) in Blob or ADLS
ž Both guarantee event ordering per-partition. It is important how you
partition your data with either service.

ž Cosmos DB likely easier for Woodgrove to integrate because they
are already writing payment transactions to a database.
ž Cosmos DB multi-master accepts writes from any region (failover
auto redirects to next available region)
ž Event Hub requires multiple instances in different geographies
(failover requires more planning)
ž Recommend: Cosmos DB – think of as “persistent event store”

Preferred solution – Data pipeline processing
ž Azure Databricks:
ž Managed Spark environment that can process streaming & batch data
ž Enables data science, data engineering, and development needs.
ž Features it provides on top of standard Apache Spark include:
ž AAD integration and RBAC
ž Collaborative features such as workspace and git integration
ž Run scheduled jobs for automatic notebook/library execution
ž Integrates with Azure Key Vault
ž Train and evaluate machine learning models at scale

ž Azure Databricks can connect to both Event Hubs and Cosmos DB, using
Spark connectors for both.
ž Spark Structured Streaming to process real-time payment transactions into
Databricks Delta tables.
ž Be sure to set a checkpoint directory on your streams. This allows you to
restart stream processing if the job is stopped at any point.

ž Store secrets such as account keys and connection strings centrally in
Azure Key Vault
ž Set Key Vault as the source for secret scopes in Azure Databricks. Secrets
are [REDACTED].

ž Databricks Delta tables are Spark tables with built-in reliability
and performance optimizations.
ž Supports batch & streaming with additional features:
ACID transactions: Multiple writers can simultaneously modify data,
without interfering with jobs reading the data set.
DELETES/UPDATES/UPSERTS:
Automatic file management: Data access speeds up by organizing data into
large files that can be read efficiently
Statistics and data skipping: Reads are 10-100x faster when statistics are
tracked about data in each file, avoiding irrelevant
information

Preferred solution – Model training &
deployment
ž Azure Databricks supports machine learning training at scale.
ž Train model using historical payment transaction data

Preferred solution – Model training &
deployment
ž Use Azure Machine Learning service (AML) to:
ž Register the trained model
ž Deploy it to Azure Kubernetes Service (AKS) cluster for easy web
accessibility and high availability.
ž For scheduled, batch scoring, Access model from notebook and
write results to Cosmos via Cosmos DB Spark connector.

Preferred solution – serving pre-scored data
Use Cosmos DB for storing offline suspicious transaction data globally.
qAdd applicable customer regions
qEstimate RU/s needed – Cosmos can scale up & down to handle workload.
qConsistency: Session consistency
qPartition key: Choose to get even distribution of request volume & storage

Preferred solution – Long-term storage
ž Use Azure Data Lake Storage Gen2 (ADLS Gen2) as the underlying
long-term file store for Databricks Delta tables.
ž Databricks Delta can compact small files together into larger files
up to 1 GB in size using the OPTIMIZE operator. This can improve
query performance over time.
ž Define file paths in ADLS for query, dimension, and summary
tables. Point to those paths when saving to Delta.
ž Delta tables can be accessed by Power BI through a JDBC
connector.

Preferred solution – Dashboards & Reporting
ž Connect to Databricks Delta tables from Power BI to allow
analysts to build reports and dashboards.
ž The connection can be made using a JDBC connection string to
an Azure Databricks cluster. Querying the tables is similar to
querying a more traditional relational database.
ž Data scientists and data engineers can use Azure Databricks
notebooks to craft complex queries and data visualizations.

Preferred solution – Dashboards & Reporting
ž A more cost-effective option for serving summary data for
business analysts to use from Power BI is to use Azure Analysis
Services.
ž Eliminates having to have a dedicated Databricks cluster running
at all times for reporting and analysis.
ž Data is stored in a tabular semantic data model
ž Write to it during stream processing (using rolling aggregates)
ž Schedule batch writes via Databricks job or ADF.

Participant Guide
• https://aka.ms/cosmos-mcw
39

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Cosmos DB Real-time Advanced Analytics Workshop

Recommended

More Related Content

What's hot (20)

Similar to Cosmos DB Real-time Advanced Analytics Workshop (20)

More from Databricks (20)

Recently uploaded (20)

Cosmos DB Real-time Advanced Analytics Workshop