Azure Stream Analytics

#sqlsatParma
#sqlsat462November 28°, 2015
Azure Stream Analytics
Marco Parenzan
@marco_parenzan

#sqlsatParma
Sponsors

#sqlsatParma
Organizers
getlatestversion.it

#sqlsatParma
Meet Marco Parenzan | @marco_parenzan
 Microsoft MVP 2015 for Azure
 Develop modern distributed
and cloud solutions
 marco.parenzan@1nn0va.it
 Passion for speaking and
inspiring programmers, students,
people
 www.innovazionefvg.net
 I’m a developer!

#sqlsatParma
Agenda
 Why a developer talks about analytics
 Analytics in a modern world
 Introduction to Azure Stream Analytics
 Stream Analytics Query Language (SAQL)
 Handling time in Azure Stream Analytics
 Scaling Analytics
 Conclusions

#sqlsatParma
ANALYTICS
IN A MODERN WORLD

#sqlsatParma
What is Analytics
 From Wikipedia
 Analytics is the discovery and communication of meaningful
patterns in data.
 Especially valuable in areas rich with recorded information,
analytics relies on the simultaneous application of statistics,
computer programming and operations research to quantify
performance.
 Analytics often favors data visualization to communicate
insight.

#sqlsatParma
IoT proof of concept

#sqlsatParma
Event-based systems
 Event I “something happened…
 …somewhere…
 …sometime!
 Event arrive at different times i.e. have unique
timestamps
 Events arrive at different rates (events/sec).
 In any given period of time there may be 0, 1 or
more events

#sqlsatParma
Azure Service Bus
Azure Service Bus
Relay
Queue
Topic
Notification Hub
Event Hub
NAT and Firewall Traversal Service
Request/Response Services
Unbuffered with TCP Throttling
Transactional Cloud AMQP/HTTP Broker
High-Scale, High-Reliability Messaging
Sessions, Scheduled Delivery, etc.
Transactional Message Distribution
Up to 2000 subscriptions per Topic
Up to 2K/100K filter rules per subscription
High-scale notification distribution
Most mobile push notification services
Millions of notification targets
Hyper Scale.
A Million Clients.
Concurrent.

#sqlsatParma
Azure Event Hubs
Event
Producers
> 1M Producers
> 1GB/sec
Aggregate
Throughput
Direct
Hash
Throughput Units:
• 1 ≤ TUs ≤ Partition Count
• TU: 1 MB/s writes, 2 MB/s reads

#sqlsatParma
Microsoft Azure IoT Services
Devices Device Connectivity Storage Analytics Presentation & Action
Event Hubs SQL Database
Machine
Learning
App Service
Service Bus
Table/Blob
Storage
Stream
Analytics
Power BI
External Data
Sources
DocumentDB HDInsight
Notification
Hubs
External Data
Sources
Data Factory Mobile Services
BizTalk
Services
{ }

#sqlsatParma
Traditional analytics
 Everything around us produce data
 From devices, sensors, infrastructures and
applications
 Traditional Business Intelligence first collects
data and analyzes it afterwards
 Typically 1 day latency, the day after
 But we live in a fast paced world
 Social media
 Internet of Things
 Just-in-time production
 Offline data is unuseful
 For many organizations, capturing and storing
event data for later analysis is no longer
enough
Data at Rest

#sqlsatParma
Analytics in a modern world
 We work with streaming data
 We want to monitor and analyze
data in near real time
 Typically a few seconds up to a few
minutes latency
 So we don’t have the time to
stop, copy data and analyze, but
we have to work with streams of
data
Data in motion

#sqlsatParma
Scenarios
 Real-time ingestion, processing and archiving of
data
 Real-time Analytics
 Connected devices (Internet of Things)

#sqlsatParma
Why Stream Analytics in the Cloud?
 Not all data is local
 Event data is already in the Cloud
 Event data is globally distributed
 Bring the processing to the data, not the data to
the processing
1
7

#sqlsatParma
Apply cloud principles
 Focus on building solutions (PAAS or SAAS)
 Without having to manage complex infrastructure
and software
 no hardware or other up-front costs and no time-consuming
installation or setup
 has elastic scale where resources are efficiently
allocated and paid for as requested
 Scale to any volume of data while still achieving high
throughput, low-latency, and guaranteed resiliency
 Up and running in minutes

#sqlsatParma
SCENARIO
demo
1
9

#sqlsatParma
An API can be a “thing”
 Api Apps, Logic Apps,
 World-wide distributed API (Rest)
 Resource consuming (CPU, storage, network bandwidth)
 Each request is logged
 With Event Hub or in log files
 Evaluate how API is going on
 “real time” statistics
 Ex.
 ASP.NET apps logs directly on EventHub

#sqlsatParma
INTRODUCTION TO
AZURE STREAM ANALYTICS

#sqlsatParma
What is Azure Stream Analytics?
 Azure Stream Analytics is a cost effective event
processing engine
 Describe their desired transformations in SQL-
like syntax
 Is a stream processing engine that is integrated
with a scalable event queuing system like Azure
Event Hubs

#sqlsatParma
Canonical Stream Analytics Pattern

#sqlsatParma
Real-time analytics
 Intake millions of events per second
 Intake millions of events per second (up to 1 GB/s)
 At variable loads
 Scale that accommodates variable loads
 Low processing latency, auto adaptive (sub-second to
seconds)
 Transform, augment, correlate, temporal operations
 Correlate between different streams, or with reference
data
 Find patterns or lack of patterns in data in real-time

#sqlsatParma
No challenges with scale
 Elasticity of the cloud for scale out
 Spin up any number of resources on demand
 Scale from small to large when required
 Distributed, scale-out architecture

#sqlsatParma
Fully managed
 No hardware (PaaS offering)
 Bypasses deployment expertise
 No software provisioning and maintaining
 No performance tuning
 Spin up any number of resources on demand
 Expand your business globally leveraging Azure
regions

#sqlsatParma
Mission critical availability
 Guaranteed events delivery
 Guaranteed not to lose events or incorrect output
 Guaranteed “once and only once” delivery of event
 Ability to replay events
 Guaranteed business continuity
 Guaranteed uptime (three nines of availability)
 Auto-recovery from failures
 Built in state management for fast recovery
 Effective Audits
 Privacy and security properties of solutions are evident
 Azure integration for monitoring and ops alerting

#sqlsatParma
Lower costs
 Efficiently pay only for usage
 Architected for multi-tenancy
 Not paying for idle resources
 Typical cloud expense model
 Low startup costs
 Ability to incrementally add resources
 Reduce costs when business needs changes

#sqlsatParma
Rapid development
 SQL like language
 High-level: focus on stream analytics solution
 Concise: less code to maintain
 First-class support for event streams and reference
data
 Built in temporal semantics
 Built-in temporal windowing and joining
 Simple policy configuration to manage out-of-order
events and late arrivals

#sqlsatParma
Azure Stream Analytics
Data Source
Collect Process ConsumeDeliver
Event Inputs
- Event Hub
- Azure Blob
Transform
- Temporal joins
- Filter
- Aggregates
- Projections
- Windows
- Etc.
Enrich
Correlate
Outputs
- SQL Azure
- Azure Blobs
- Event Hub
- Service Bus Queue
- Service Bus Topics
- Table storage
- PowerBI
Azure
Storage
• Temporal Semantics
• Guaranteed delivery
• Guaranteed up time
Reference Data
- Azure Blob

#sqlsatParma
Inputs sources for a Stream Analytics Job
• Currently supported input Data Streams
are Azure Event Hub , Azure IoT Hub and
Azure Blob Storage. Multiple input Data
Streams are supported.
• Advanced options lets you configure how
the Job will read data from the input blob
(which folders to read from, when a blob
is ready to be read, etc).
• Reference data is usually static or changes
very slowly over time.
• Must be stored in Azure Blob
Storage.
• Cached for performance

#sqlsatParma
Defining Event Schema
• The serialization format and the encoding for the
for the input data sources (both Data Streams
and Reference Data) must be defined.
• Currently three formats are supported: CSV,
JSON and Avro (binary JSON -
https://avro.apache.org/docs/1.7.7/spec.ht
ml)
• For CSV format a number of common delimiters
are supported: (comma (,), semi-colon(;), colon(:),
tab and space.
• For CSV and Avro optionally you can provide the
schema for the input data.

#sqlsatParma
Output for Stream Analytics Jobs
Currently data stores supported as outputs
Azure Blob storage: creates log files with temporal query
results
Ideal for archiving
Azure Table storage:
More structured than blob storage, easier to setup than
SQL database and durable (in contrast to event hub)
SQL database: Stores results in Azure SQL Database table
Ideal as source for traditional reporting and analysis
Event hub: Sends an event to an event hub
Ideal to generate actionable events such as alerts or
notifications
Service Bus Queue: sends an event on a queue
Ideal for sending events sequentially
Service Bus Topics: sends an event to subscribers
Ideal for sending events to many consumers
PowerBI.com:
Ideal for near real time reporting!
DocumentDb:
Ideal if you work with json and object graphs

#sqlsatParma
PREPARATION
demo
3
4

#sqlsatParma
STREAM ANALYTICS
QUERY LANGUAGE (SAQL)

#sqlsatParma
SAQL – Language & Library
SELECT
FROM
WHERE
GROUP BY
HAVING
CASE WHEN THEN ELSE
INNER/LEFT OUTER JOIN
UNION
CROSS/OUTER APPLY
CAST
INTO
ORDER BY ASC, DSC
WITH
PARTITION BY
OVER
DateName
DatePart
Day
Month
Year
DateTimeFromParts
DateDiff
DateAdd
TumblingWindow
HoppingWindow
SlidingWindow
Duration
Sum
Count
Avg
Min
Max
StDev
StDevP
Var
VarP
Len
Concat
CharIndex
Substring
PatIndex
Lag, IsFirst
CollectTop

#sqlsatParma
Supported types
Type Description
bigint Integers in the range -2^63 (-9,223,372,036,854,775,808) to 2^63-1
(9,223,372,036,854,775,807).
float Floating point numbers in the range - 1.79E+308 to -2.23E-308, 0, and 2.23E-308 to
1.79E+308.
nvarchar(max) Text values, comprised of Unicode characters. Note: A value other than max is not supported.
datetime Defines a date that is combined with a time of day with fractional seconds that is based on a
24-hour clock and relative to UTC (time zone offset 0).
Inputs will be casted into one of these types
We can control these types with a CREATE TABLE statement:
This does not create a table, but just a data type mapping for the inputs

#sqlsatParma
INTO clause
 Pipelining data from input to output
 Without INTO clause we write to destination named
‘output’
 We can have multiple outputs
 With INTO clause we can choose for every select the
appropriate destination
 E.g. send events to blob storage for big data analysis,
but send special events to event hub for alerting
SELECT UserName, TimeZone
INTO Output
FROM InputStream
WHERE Topic = 'XBox'

#sqlsatParma
WHERE clause
 Specifies the conditions for the rows returned in
the result set for a SELECT statement, query
expression, or subquery
 There is no limit to the number of predicates
that can be included in a search condition.
SELECT UserName, TimeZone
FROM InputStream
WHERE Topic = 'XBox'

#sqlsatParma
JOIN
 We can combine multiple event streams or an
event stream with reference data via a join
(inner join) or a left outer join
 In the join clause we can specify the time window in
which we want the join to take place
 We use a special version of DateDiff for this

#sqlsatParma
Reference Data
 Seamless correlation of event streams with
reference data
 Static or slowly-changing data stored in blobs
 CSV and JSON files in Azure Blobs
 scanned for new snapshots on a settable cadence
JOIN (INNER or LEFT OUTER) between streams and
reference data sources
 Reference data appears like another input:
SELECT myRefData.Name, myStream.Value
FROM myStream
JOIN myRefData
ON myStream.myKey = myRefData.myKey

#sqlsatParma
Reference data tips
 Currently reference data cannot be refreshed
automatically.
 You need to stop the job and specify new snapshot
with reference data
 Reference Data are only in Blog
 Practice says that you use services like Azure Data
Factory to move data from Azure Data Sources to
Azure Blob Storage
 Have you followed Francesco Diaz’s session?

#sqlsatParma
UNION
SELECT TollId, ENTime AS Time , LicensePlate FROM EntryStream TIMESTAMP BY ENTime
UNION
SELECT TollId, EXTime AS Time , LicensePlateFROM ExitStream TIMESTAMP BY EXTime
TollId EntryTime LicensePlate …
1 2014-09-1012:01:00.000 JNB7001 …
1 2014-09-1012:02:00.000 YXZ1001 …
3 2014-09-1012:02:00.000 ABC1004 …
TollId ExitTime LicensePlate
1 2009-06-2512:03:00.000 JNB7001
1 2009-06-2512:03:00.000 YXZ1001
3 2009-06-2512:04:00.000 ABC1004
TollId Time LicensePlate
1 2014-09-1012:01:00.000 JNB7001
1 2014-09-1012:02:00.000 YXZ1001
3 2014-09-1012:02:00.000 ABC1004
1 2009-06-2512:03:00.000 JNB7001
1 2009-06-2512:03:00.000 YXZ1001
3 2009-06-2512:04:00.000 ABC1004

#sqlsatParma
STORING, FILTERING
AND DECODING
demo
4
4

#sqlsatParma
HANDLING TIME IN AZURE
STREAM ANALYTICS

#sqlsatParma
Traditional queries
 Traditional querying assumes the data doesn’t
change while you are querying it:
 query a fixed state
 If the data is changing: snapshots and transactions
‘freeze’ the data while we query it
 Since we query a finite state, our query should finish
in a finite amount of time
table query
result
table

#sqlsatParma
A different kind of query
 When analyzing a stream of data, we deal with a
potential infinite amount of data
 As a consequence our query will never end!
 To solve this problem most queries will use time
windows
stream
temporal
query
result
strea
m

#sqlsatParma
Arrival Time Vs Application Time
 Every event that flows through the system comes with a
timestamp that can be accessed via System.Timestamp
 This timestamp can either be an application time which the
user can specify in the query
 A record can have multiple timestamps associated with it
 The arrival time has different meanings based on the
input sources.
 For the events from Azure Service Bus Event Hub, the arrival
time is the timestamp given by the Event Hub
 For Blob storage, it is the blob’s last modified time.
 If the user wants to use an application time, they can do
so using the TIMESTAMP BY keyword
 Data are sorted by timestamp column

#sqlsatParma
Temporal Joins
SELECT Make
FROM EntryStream ES TIMESTAMP BY EntryTime
JOIN ExitStream EX TIMESTAMP BY ExitTime
ON ES.Make= EX.Make
AND DATEDIFF(second,ES,EX) BETWEEN 0 AND 10
Time
(Seconds)
{“Mazda”,6} {“BMW”,7} {“Honda”,2} {“Volvo”,3}Toll
Entry :
{“Mazda”,3} {“BMW”,7}{“Honda”,2} {“Volvo”,3}
Toll
Exit :
0 5 10 15 20 25

#sqlsatParma
Windowing Concepts
 Common requirement to perform some set-based
operation (count, aggregation etc) over events that
arrive within a specified period of time
 Group by returns data aggregated over a certain
subset of data
 How to define a subset in a stream?
 Windowing functions!
 Each Group By requires a windowing function

#sqlsatParma
Three types of windows
 Every window operation outputs events at the end of
the window
 The output of the window will be single event based on the
aggregate function used. The event will have the time stamp
of the window
 All windows have a fixed length
5
1
Tumbling window
Aggregate per time interval
Hopping window
Schedule overlapping windows
Sliding window
Windows constant re-evaluated

#sqlsatParma
Tumbling Window
1 5 4 26 8 6 5
Time
(secs)
1 5 4 26
8 6
A 20-second Tumbling Window
3 6 1
5 3 6 1
Tumbling windows:
• Repeat
• Are non-overlapping
SELECT TollId, COUNT(*)
FROM EntryStream TIMESTAMP BY EntryTime
GROUP BY TollId, TumblingWindow(second, 20)
Query: Count the total number of vehicles entering each
toll booth every interval of 20 seconds.
An event can belong to only one tumbling window

#sqlsatParma
Hopping Window
1 5 4 26 8 6
A 20-second Hopping Window with a10 second “Hop”
Hopping windows:
• Repeat
• Can overlap
• Hop forward in time by a fixed period
Same as tumbling window if hop size = window size
Events can belong to more than one hopping
window
SELECT COUNT(*), TollId
FROM EntryStream TIMESTAMP BY EntryTime
GROUP BY TollId, HoppingWindow (second, 20,10)
4 26
8 6
5 3 6 1
1 5 4 26
8 6 5 3
6 15 3
QUERY: Count the number of vehicles entering each toll
booth every interval of 20 seconds; update results every
10 seconds

#sqlsatParma
Sliding Window
1 5
A 20-secondSliding Window
Sliding window:
• Continuously moves forward by an ε (epsilon)
• Produces an output only during the occurrence of
an event
• Every windows will have at least one event
Events can belong to more than one sliding window
SELECT TollId, Count(*)
FROM EntryStream ES
GROUP BY TollId, SlidingWindow (second, 20)
HAVING Count(*) > 10
Query: Find all the toll booths which have served more
than 10 vehicles in the last 20 seconds
1
8
8
5 1
9
5 1 9

#sqlsatParma
1 5
A 20-secondSliding Window
1
8
8
51
9
51 9
5 9
«5» enter
«1» enter
«9» enter
«1» exit
«5» exit 9
«9» exit «8» enter

#sqlsatParma
TEMPORAL TASKS
demo
5
6

#sqlsatParma
SCALING ANALYTICS

#sqlsatParma
Steaming Unit
 Is a measure of the computing resource
available for processing a Job
 A streaming unit can process up to 1 Mb /
second
 By default every job consists of 1 streaming unit.
Total number of streaming units that can be
used depends on :
 rate of incoming events
 complexity of the query

#sqlsatParma
Multiple steps, multiple outputs
 A query can have multiple steps to enable
pipeline execution
 A step is a sub-query defined using WITH
(“common table expression”)
 The only query outside of the WITH keyword
is also counted as a step
 Can be used to develop complex queries
more elegantly by creating a intermediary
named result
 Each step’s output can be sent to multiple output
targets using INTO
WITH Step1 AS (
SELECT Count(*) AS CountTweets,
Topic
FROM TwitterStream PARTITION BY
PartitionId
GROUP BY TumblingWindow(second, 3),
Topic, PartitionId
),
Step2 AS (
SELECT Avg(CountTweets)
FROM Step1
GROUP BY TumblingWindow(minute, 3)
)
SELECT * INTO Output1 FROM Step1

#sqlsatParma
Scaling Concepts – Partitions
 When a query is partitioned, input events will be processed and aggregated
in a separate partition groups
 Output events are produced for each partition group
 To read from Event Hubs ensure that the number of partitions match
 The query within the step must have the Partition By keyword
 If your input is a partitioned event hub, we can write partitioned queries and
partitioned subqueries (WITH clause)
 A non-partitioned query with a 3-fold partitioned subquery can have (1+3) * 4 = 24
streaming units!
SELECT Count(*) AS Count, Topic
FROM TwitterStream PARTITION BY PartitionId
GROUP BY TumblingWindow(minute, 3), Topic, PartitionId
Query Result1
Query Result2
Query Result3
Event Hub

#sqlsatParma
Out of order inputs
 Event Hub guarantees monotonicity of the timestamp on each
partition of the Event Hub
 All events from all partitions are merged by timestamp order, there will be
no out of order events.
 When it's important for you to use sender's timestamp, so a
timestamp from the event payload is chosen using "timestamp by,"
there can be several sources or disorderness introduced.
 Producers of the events have clock skews.
 Network delay from the producers sending the events to Event Hub.
 Clock skews between Event Hub partitions.
 Do we skip them (drop) or do we pretend they happened just now
(adjust)?

#sqlsatParma
Handling out of order events
 On the configuration tab, you will find the following defaults.
 Using 0 seconds as the out of order tolerance window means you assert all
events are in order all the time.
 To allow ASA to correct the disorderness, you can specify a non-zero
out of order tolerance window size.
 ASA will buffer events up to that window and reorder them using the user
chosen timestamp before applying the temporal transformation.
 Because of the buffering, the side effect is the output is delayed by
the same amount of time
 As a result, you will need to tune the value to reduce the number of out of
order events and keep the latency low.

#sqlsatParma
STRUCTURING
AND SCALING QUERY
demo
6
3

#sqlsatParma
CONCLUSIONS

#sqlsatParma
Summary
 Azure Stream Analytics is the PaaS solution for
Analytics on streaming data
 It is programmable with a SQL-like language
 Handling time is a special and central feature
 Scale with cloud principles: elastic, self service,
multitenant, pay per use
 More questions:
 Other solutions
 Pricing
 What to do with that data?
 Futures

#sqlsatParma
Microsoft real-time stream processing options

#sqlsatParma
Apache Storm (in HDInsight)
 Apache Storm is a distributed, fault-tolerant, open
source real-time event processing solution.
 Storm was originally used by Twitter to process
massive streams of data from the Twitter firehose.
 Today, Storm is an incubator project as part of the
Apache Software foundation.
 Typically, Storm will be integrated with a scalable
event queuing system like Apache Kafka or Azure
Event Hubs.

#sqlsatParma
Stream Analytics vs Apache Storm
 Storm:
 Data Transformation
 Can handle more dynamic data (if you're willing to
program)
 Requires programming
 Stream Analytics
 Ease of Setup
 JSON and CSV format only
 Can change queries within 4 minutes
 Only takes inputs from Event Hub, Blob Storage
 Only outputs to Azure Blob, Azure Tables, Azure SQL,
PowerBI

#sqlsatParma
Pricing
 Pricing based on volume per job:
 Volume of data processed
 Streaming units required to process the data stream
Price (USD)
Volume of Data Processed
 Volume of data processed by the streaming job (in
GB)
€ 0.0009 per GB
Streaming Unit*
 Blended measure of CPU, memory, throughput.
€ 0.0262 per hour
€ 18,864 per month

#sqlsatParma
Azure Machine Learning
 Undestand the “sequence” of data in the history
to predict the future
 But Azure can ‘learn’ which values preceded issues
Azure Machine Learning

#sqlsatParma
•
•
– Inside Office 365

#sqlsatParma
Futures
 [started]
 Native integration with Azure Machine Learning
 Provide better ways to debug.
 [planned]
 Call to a REST endpoint to invoke custom code
 [under review]
 Take input from DocumentDb

#sqlsatParma
Thanks
 Don’t forget to compile evaluations form here
 http://speakerscore.com/SqlSatParma2015
 Marco Parenzan
 http://twitter.com/marco_parenzan
 http://www.slideshare.net/marcoparenzan
 http://www.github.com/marcoparenzan

Azure Stream Analytics

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Azure Stream Analytics (20)

More from Marco Parenzan (20)

Recently uploaded (20)

Azure Stream Analytics

Editor's Notes