0% found this document useful (0 votes)
10 views35 pages

BDA Unit 4

Real-time data (RTD) is information processed immediately after generation, contrasting with traditional batch processing. Real-time analytics enables organizations to gain actionable insights quickly, enhancing decision-making and user interactions. Key characteristics of real-time systems include time constraints, correctness, safety, and scalability, with applications in various sectors such as healthcare, finance, and e-commerce.

Uploaded by

koushik.p1102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views35 pages

BDA Unit 4

Real-time data (RTD) is information processed immediately after generation, contrasting with traditional batch processing. Real-time analytics enables organizations to gain actionable insights quickly, enhancing decision-making and user interactions. Key characteristics of real-time systems include time constraints, correctness, safety, and scalability, with applications in various sectors such as healthcare, finance, and e-commerce.

Uploaded by

koushik.p1102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

UNIT IV

What is Real-Time Data?


Real-time data (RTD) refers to information that is processed, consumed, and/or acted upon
immediately after it's generated. While data processing is not new, real-time data is a newer
paradigm that changes how businesses run.

Batch vs Real-Time Data Processing


In previous years, batch data processing was the norm. Systems had to collect, process, and
store large volumes of data as separate functions before data could be utilized for further
action. In situations where real-time data or analytics are not needed, batch processing is still
a viable process.

In contrast, real-time data processing (or streaming data) can collect, store, and analyze
continuously, making data readily available to the end-user as soon as it's generated with no
delay.

While databases and offline data analysis remain valid tools, the need for real-time data has
increased exponentially with the advent of modern applications. After all, the world isn’t a
batch process - it runs in real-time.

Real-Time Analytics

In real-time, analysis of data allows users to view, analyse and understand data in the system
it's entered. Mathematical reasoning and logic are incorporated into the data, which means it
gives users a sense of real-time data to make decisions.

Overview

Real-time analytics allows organizations to gain awareness and actionable information


immediately or as soon as the data has entered their systems. Analytics responses in real-time
are completed within a matter of minutes. They can process a huge amount of data in a short
time with high speed and a low response time. For instance, real-time big-data analytics
makes use of financial databases to inform traders of decisions. Analytics may be performed
on-demand or continuously. On-demand alerts users to results when the user wants them.
Users can continuously update their results as events occur. It can also be programmed to
respond to specific circumstances automatically. For instance, real-time web analytics could
restructure the administrator's page if the load presentation is not within the boundaries of the
present.

Examples
Examples of real-time customer analytics include the following.
o Monitoring orders as they take place to trace them better and determine the type of
clothing.
o Continuously modernize customer interactions, such as the number of page views and
shopping cart usage, to better understand the etiquette of users.
o Select customers who are more advanced in their shopping habits in a shop, impacting
the decisions in real time.

The Operation of Real-time Analytics


Real-time analytics tools for data analytics can pull or push. Streaming demands that faculty
push huge amounts of fast-moving data. If streaming consumes too many resources and isn't
an empirical process, data could be moved at intervals between a couple of seconds and
hours. The two may occur between business requirements that need to be figured out in order
not to interrupt the flow. The time to react for real-time analysis can vary from nearly
instantaneous to a few minutes or seconds. The key components of real-time analytics
comprise the following.
o Aggregator
o Broker
o Analytics engine
o Stream processor

Benefits of Real-time Analytics


Momentum is the primary benefit of real-time analysis of data. The shorter a company has to
wait for data from the moment it arrives and is processed, and the business is able to utilize
data insights to make changes and make the results of a crucial decision.
In the same way, real-time analytics tools allow companies to see how users connect to an
item after liberating the product, so there's no problem in understanding the behaviour of
users to make the necessary adjustments.
Advantages of Real-time Analytics:
Real-time analytics provides the benefits over traditional analytics.
o Create our interactive analytics tools.
o Transparent dashboards allow users to share information.
o Monitor behaviour in a way that is customized.
o Perform immediate adjustments if necessary.
o Make use of machine learning.

Real-Time Data Analytics with Apache Kafka

 GPS Data: GPS-enabled devices, including mobile phones, produce streams of geographical
data. Using real-time location data, businesses can track delivery fleets. Air traffic controllers
can land planes safely. Commuters can use live traffic data to choose the fastest route. Social
networks can use GPS data streams to build a more accurate model of our social
relationships. Real-time data streams allow cars to ingest, store, and integrate live GPS data
with self-driving software to form the backbone of autonomous cars, delivery drones, and the
internet of things (IoT).
 Ride Share Applications: Uber relies on real-time data to match customers to drivers. Real-
time data is also collected to forecast demand, compute performance metrics, and extract
patterns of human behavior from event streams. Not only would real-time data streams allow
for seamless customer experience, they'd also provide real-time fraud detection, anomaly
detection, marketing campaigns, visualization, and customer feedback. The company
uses Apache Kafka to achieve real-time data at this scale, processing over 30 billion
messages per day.
 Streaming Platforms: Netflix embraces event streams to achieve speed and scalability in all
aspects of its business. Streaming is the communication mechanism for the entire Netflix
ecosystem. The company uses Apache Kafka to support a variety of microservices, ranging
from studio financing to real-time data on the service levels within its infrastructure.
 Walmart: Walmart operates thousands of stores and hundreds of distribution centers across
the world. The company also makes millions of online transactions. Walmart uses Apache
Kafka to drive its real-time inventory management system. The system ingests 500 million
events per day and ensures that the company has an accurate view of its entire inventory in
real-time. The system also supports Walmart's telemetry, alerting, and auditing requirements.
 Medical data: Real-time data on heart rate, blood pressure and oxygen saturation enables
hospitals to identify patients whose health is at risk of deteriorating. In the case of Covid-19,
when hospitals were short on equipment, personnel, and at patient capacity, real-time data
analytics this would enable hospitals to optimize the use of Intensive Care Units, ventilators,
and patient health data in real-time, increasing efficiency and streamlining processes.
 Another example is heart attacks. Approximately 10% of patients suffer heart attacks while
they are already in a hospital. By using real-time data analytics, heart attacks could be
predicted before they happen. Electronic monitoring and predictive analytics are vital in
many clinical areas where patient safety is at stake.

Characteristics of Real-time System:


Following are the some of the characteristics of Real-time System:

1. Time Constraints: Time constraints related with real-time systems simply means that
time interval allotted for the response of the ongoing program. This deadline means that
the task should be completed within this time interval. Real-time system is responsible
for the completion of all tasks within their time intervals.
2. Correctness: Correctness is one of the prominent part of real-time systems. Real-time
systems produce correct result within the given time interval. If the result is not obtained
within the given time interval then also result is not considered correct. In real-time
systems, correctness of result is to obtain correct result in time constraint.
3. Embedded: All the real-time systems are embedded now-a-days. Embedded system
means that combination of hardware and software designed for a specific purpose. Real-
time systems collect the data from the environment and passes to other components of
the system for processing.
4. Safety: Safety is necessary for any system but real-time systems provide critical safety.
Real-time systems also can perform for a long time without failures. It also recovers
very soon when failure occurs in the system and it does not cause any harm to the data
and information.
5. Concurrency: Real-time systems are concurrent that means it can respond to a several
number of processes at a time. There are several different tasks going on within the
system and it responds accordingly to every task in short intervals. This makes the real-
time systems concurrent systems.
6. Distributed: In various real-time systems, all the components of the systems are
connected in a distributed way. The real-time systems are connected in such a way that
different components are at different geographical locations. Thus all the operations of
real-time systems are operated in distributed ways.
7. Stability: Even when the load is very heavy, real-time systems respond in the time
constraint i.e. real-time systems does not delay the result of tasks even when there are
several task going on a same time. This brings the stability in real-time systems.
8. Fault tolerance: Real-time systems must be designed to tolerate and recover from faults
or errors. The system should be able to detect errors and recover from them without
affecting the system’s performance or output.
9. Determinism: Real-time systems must exhibit deterministic behavior, which means that
the system’s behavior must be predictable and repeatable for a given input. The system
must always produce the same output for a given input, regardless of the load or other
factors.
10. Real-time communication: Real-time systems often require real-time communication
between different components or devices. The system must ensure that communication
is reliable, fast, and secure.
11. Resource management: Real-time systems must manage their resources efficiently,
including processing power, memory, and input/output devices. The system must ensure
that resources are used optimally to meet the time constraints and produce correct
results.
12. Heterogeneous environment: Real-time systems may operate in a heterogeneous
environment, where different components or devices have different characteristics or
capabilities. The system must be designed to handle these differences and ensure that all
components work together seamlessly.
13. Scalability: Real-time systems must be scalable, which means that the system must be
able to handle varying workloads and increase or decrease its resources as needed.
14. Security: Real-time systems may handle sensitive data or operate in critical
environments, which makes security a crucial aspect. The system must ensure that data
is protected and access is restricted to authorized users only.
Scalability, High Availability, and Performance
The terms scalability, high availability, performance, and mission
mission-critical
critical can mean different
things to different organizations, or to different departments within an organization. They are
often interchanged and create confusion that results in poorly managed expectations,
implementation delays, or unrealistic metrics. This Refcard provides you with the tools to
define these terms so that
at your team can implement mission
mission-critical
critical systems with well
understood performance goals.
Scalability
It's the property of a system or application to handle bigger amounts of work, or to be easily
expanded, in response to increased demand for network, pr
processing,
ocessing, database access or file
system resources.
Horizontal scalability
A system scales horizontally, or out, when it's expanded by adding new nodes with identical
functionality to existing ones, redistributing the load among all of them. SOA systems and
web servers scale out by adding more servers to a load
load-balanced
balanced network so
s that incoming
requests may be distributed among all of them. Cluster is a common term for describing a
scaled out processing system.

Figure : Clustering
Vertical scalability
A system scales vertically, or up, when it's expanded by adding processing, main memory,
storage, or network interfaces to a node to satisfy more requests per system. Hosting services
companies scale up by increasing the number of processors or the amount of main memory to
host more virtual servers in the same hardware.

Figure :Virtualization
High Availability
Availability describes how well a system provides useful resources over a set period of time.
High availability guarantees an absolute degree of functional continuity within a time
window expressed as the relationship between uptime and downtime.
A = 100 – (100*D/U), D ::= unplanned downtime, U ::= uptime; D, U expressed in minutes
Uptime and availability don't mean the same thing. A system may be up for a complete
measuring period, but may be unavailable due to network outages or downtime in related
supportt systems. Downtime and unavailability are synonymous.
Measuring Availability
Vendors define availability as a given number of "nines" like in Table 1, which also describes
the number of minutes or seconds of estimated downtime in relation to the number of minutes
in a 365-day
day year, or 525,600, making U a constant for their marketing purposes.
Availability Downtime in Downtime per Vendor
% Minutes Year Jargon

90 52,560.00 36.5 days one nine

99 5,256.00 4 days two nines

99.9 525.60 8.8 hours three nines

99.99 52.56 53 minutes four nines

99.999 5.26 5.3 minutes five nines

99.9999 0.53 32 seconds six nines

Table : Availability as a Percentage of Total Yearly Uptime


Analysis
High availability depends on the expected uptime defined for system requirements; don't be
misled by vendor figures. The meaning of having a highly available system and its
measurable uptime are a direct function of a Service Level Agreement. Availability goes up
when factoring planned downtime, such as a monthly 8-hour maintenance window. The cost
of each additional nine of availability can grow exponentially. Availability is a function of
scaling the systems up or out and implementing system, network, and storage redundancy.
Service Level Agreement (SLA)
SLAs are the negotiated terms that outline the obligations of the two parties involved in
delivering and using a system, like:
 System type (virtual or dedicated servers, shared hosting)
 Levels of availability
o Minimum
o Target
 Uptime
o Network
o Power
o Maintenance windows
 Serviceability
 Performance and Metrics
 Billing
SLAs can bind obligations between two internal organizations (e.g. the IT and e-commerce
departments), or between the organization and an outsourced services provider. The SLA
establishes the metrics for evaluating the system performance, and provides the definitions
for availability and the scalability targets. It makes no sense to talk about any of these topics
unless an SLA is being drawn or one already exists.
Elasticity
Elasticity is the ability to dynamically add and remove resources in a system in response to
demand, and is a specialized implementation of scaling horizontally or vertically.
As requests increase during a busy period, more nodes can be automatically added to a cluster
to scale out and removed when the demand has faded – similar to seasonal hiring at brick and
mortar retailers. Additionally, system resources can be re-allocated to better support a system
for scaling up dynamically.
Implementing Scalable Systems
SLAs determine whether systems must scale up or out. They also drive the growth timeline.
A stock trading system must scale in real-time within minimum and maximum availability
levels. An e-commerce system, in contrast, may scale in during the "slow" months of the
year, and scale out during the retail holiday season to satisfy much larger demand.
Load Balancing
Load balancing is a technique for minimizing response time and maximizing throughput by
spreading requests among two or more resources. Load balancers may be implemented in
dedicated hardware devices, or in software. Figure 3 shows how load-balanced systems
appear to the resource consumers as a single resource exposed through a well-known address.
The load balancer is responsible for routing requests to available systems based on a
scheduling rule.
Figure: Availability as percentage of Total Yearly Uptime
Scheduling rules are algorithms for determining which server must service a request. Web
applications and services are typically balanced by following round robin scheduling rules,
but can also balance based on least
least-connected, IP-hash,
hash, or a number of other options.
Caching pools are balanced by applying frequency rules and expiration algorit
algorithms.
Applications where stateless requests arrive with a uniform probability for any number of
servers may use a pseudo-random
random scheduler. Applications like music stores, where some
content is statistically more popular, may use asymmetric load balancers to shift the larger
number popular requests to higher performance systems, serving the rest of the requests from
less powerful systems or clusters.
Persistent Load Balancers
Stateful applications require persistent or sticky load balancing, where a consumer is
guaranteed to maintain a session with a specific server from the pool. Figure 4 shows a sticky
balancer that maintains sessions from multiple clients. Figure 5 shows how the cluster
maintains sessions by sharing data using a database.
Figure: Sticky Load Balancer
Common Features
atures of a Load Balancer
Asymmetric load distribution – assigns some servers to handle a bigger load than others
 Content filtering: Inbound or outbound.
 Distributed Denial of Services (DDoS) attack protection
 Firewall.
 Payload switching: Sends requests to different servers based on URI, port, and/or
protocol.
 Priority activation: Adds standing by servers to the pool.
 Rate shaping: Ability to give different priority to different traffic.
 Scripting: Reduces human interacti
interaction
on by implementing programming rules or actions.
 SSL termination: Hardware
Hardware-assisted
assisted encryption frees web server resources.
 TCP buffering and offloading: Throttle requests to servers in the pool.
 GZIP compression: Decreases transfer bandwidth utilization.
Figure: Database Sessions
Caching Strategies
Stateful load balancing techniques require data sharing among the service providers. Caching
is a technique for sharing data among multiple consumers or servers that are expensive to
either compute or fetch. Data are stored and retrieved in a subsystem tha
that provides quick
access to a copy of the frequently accessed data.
Caches are implemented as an indexed table where a unique key is used for referencing some
datum. Consumers access data by checking (hitting) the cache first and retrieving the datum
from it.
t. If it's not there (cache miss), then the costlier retrieval operation takes place and the
consumer or a subsystem inserts the datum to the cache.
Write Policy
The cache may become stale if the backing store changes without updating the cache. A write
policy for the cache defines how cached data are refreshed. Some common write policies
include:
 Write-through: Every write to the cache follows a synchronous write to the backing
store.
 Write-behind: Updated entries are marked in the cache table as dirty and it's updated
only when a dirty datum is requested.
 No-write allocation: Only read requests are cached under the assumption that the data
won't change over time but it's expensive to retrieve.
Application Caching
 Implicit caching happens when there is little or no programmer participation in
implementing the caching. The program executes queries and updates using its native
API and the caching layer automatically caches the requests independently of the
application. Example: Terracotta (https://www.terracotta.org/).
 Explicit caching happens when the programmer participates in implementing the
caching API and may also implement the caching policies. The program must import
the caching API into its flow in order to use it. Examples: memcached
(http://www.danga.com/memcached), Redis (https://redis.io), and Oracle Coherence
(http://coherence.oracle.com).
In general, implicit caching systems are specific to a platform or language. Terracotta, for
example, only works with Java and JVM-hosted languages like Groovy or Kotlin. Explicit
caching systems may be used with many programming languages and across multiple
platforms at the same time. Memcached and Redis work with every major programming
language, and Coherence works with Java, .Net, and native C++ applications.
Web Caching
Web caching is used for storing documents or portions of documents (‘particles') to reduce
server load, bandwidth usage and lag for web applications. Web caching can exist on the
browser (user cache) or on the server, the topic of this section. Web caches are invisible to
the client may be classified in any of these categories:
 Web accelerators: they operate on behalf of the server of origin. Used for expediting
access to heavy resources, like media files, and are often geolocated closer to intended
recipients. Content distribution networks (CDNs) are an example of web acceleration
caches; Akamai, Amazon S3, Nirvanix are examples of this technology.
 Proxy caches: they serve requests to a group of clients that may all have access to the
same resources. They can be used for content filteri
filtering
ng and for reducing bandwidth
usage. Squid, Apache, Amazon Cloud Front, ISA server are examples of this
technology.
Distributed Caching
Caching techniques can be implemented across multiple systems that serve requests for
multiple consumers and from multip
multiple
le resources. These are known as distributed caches, like
the setup in Figure 6. Akamai is an example of a distributed web cache, and memcached is an
example of a distributed application cache.

Figure: Distributed Cache


Clustering
A cluster is a group of computer systems that work together to form what appears to the user
as a single system. Clusters are deployed to improve services availability or to increase
computational or data manipulation performance. In terms of equivalent co
computing power, a
cluster is more cost-effective
effective than a monolithic system with the same performance
characteristics.
The systems in a cluster are interconnected over high
high-speed
speed local area networks like gigabit
Ethernet, fiber distributed data interface (FDD
(FDDI),
I), Infiniband, Myrinet, or other technologies.

Figure: Load Balancing Cluster


Load-balancing
balancing cluster (active/active)
(active/active):: Distribute the load among multiple back-end,
back
redundant nodes. All nodes in the cluster offer full
full-service
service capabilities to the consumers and
are active at the same time.
High availability cluster (active/passive)
(active/passive):: Improve services availability by providing
providin
uninterrupted service through redundant clusters that eliminate single points of failure. High
availability clusters require two nodes at a minimum, a "heartbeat" to detect that all nodes are
ready, and a routing mechanism that will automatically switch traffic, or fail over, if the main
cluster fails.

Figure: Cluster Failover


Grid: Process workloads defined as independent jobs that don't require data sharing among
processes. Storage or network may be shared across all nodes of the grid, but intermediate
results have no bearing on other jobs progress or on other nodes in the grid, suc
such as a
Cloudera Map Reduce cluster (http://www.cloudera.com).

Figure: Computational Clusters


Computational clusters:: Execute processes that require raw computational power instead of
executing transactional operations like web or database clusters. The nodes are tightly
coupled, homogeneous, and in close physical proximity. They often replace supercomputers.
Redundancy and Fault Tolerance
Redundant system design depends on the expectation that any system component failure is
independent of failure in the other components
components.
Fault tolerant systems continue to operate in the event of component or subsystem failure;
throughput may decrease but overall system availability remains constant. Faults in hardware
or software are handled through component redundancy or safe fallbacks
fallbacks,, if one can be made
in software. Fault tolerance in software is often implemented as a fallback method if a
dependent system is unavailable. Fault tolerance requirements are derived from SLAs. The
implementation depends on the hardware and software compon
components,
ents, and on the rules by
which they interact.
Fault Tolerance SLA Requirements
 No single point of failure
failure:: Redundant components ensure continuous operation and
allow repairs without disruption of service.
 Fault isolation:: Problem detection must pinpoint tthe
he specific faulty component
 Fault propagation containment
containment:: Faults in one component must not cascade to
others.
 Reversion mode: Set the system back to a known state.
Redundant clustered systems can provide higher availability, better throughput, and fault
tolerance. The A/A cluster in Figure 10 provides uninterrupted service for a scalable, stateless
application.

Figure: A/A full tolerance and recovery


Some stateful applications may only scale up; the A/P cluster in Figure 11 provides
uninterrupted service and disaster recovery for such an application. Active/Active
configurations provide failure transparency. Active/Passive configurations may provide
failure transparency at a much higher cost because automatic failure detection and
reconfiguration are implemented through a feedback control system, which is more expensive
and trickier to implement.

Figure: A/P fault tolerance and recovery


Enterprise systems most commonly implement A/P fault tolerance and recovery through fault
transparency by diverting services to the passive system and bringing it on-line as soon as
possible. Robotics and life-critical systems may implement probabilistic, linear model, fault
hiding, and optimization control systems instead.
Multi-Region
Redundant systems often span multiple regions in order to isolate geographic phenomenon,
provide failover capabilities, and deliver content as close to the consumer as possible. These
redundancies cascade down through the system into all services, and a single scalable system
may have a number of load balanced clusters throughout.
Cloud Computing
Cloud computing describes applications running on distributed, computing resources owned
and operated by a third-party.
End-user apps are the most common examples. They utilize the Software as a Service (SaaS)
and Platform as a Service (PaaS) computing models.

Figure: Cloud computing configuration


Cloud Services Types
 Web services: Salesforce com, USPS, Google Maps.
 Service platforms: Google App Engine, Amazon Web Services (EC2, S3, Cloud
Front), Nirvanix, Akamai, MuleSource.
Fault Detection Methods
Fault detection methods must provide enough information to isolate the fault and execute
automatic or assisted failover action. Some of the most common fault detection methods
include:
 Built-in diagnostics.
 Protocol sniffers.
 Sanity checks.
 Watchdog checks.
Criticality is defined as the number of consecutive faults reported by two or more detection
mechanisms over a fixed time period. A fault detection mechanism is useless if it reports
every single glitch (noise) or if it fails to report a real fault over a number of monitoring
periods.
System Performance
Performance refers to the system throughput and latency under a particular workload for a
defined period of time. Performance testing validates implementation decisions about the
system throughput, scalability, reliability, and resource usage. Performance engineers work
with the development and deployment teams to ensure that the system's non-functional
requirements like SLAs are implemented as part of the system development lifecycle. System
performance encompasses hardware, software, and networking optimizations.
Tip: Performance testing efforts must begin at the same time as the development project and
continue through deployment. Testing should be performed against a mirror of the production
environment, if possible.
The performance engineer's objective is to detect bottlenecks early and to collaborate with the
development and deployment teams on eliminating them.
System Performance Tests
Performance specifications are documented along with the SLA and with the system design.
Performance troubleshooting includes these types of testing:
 Endurance testing: Identifies resource leaks under the continuous, expected load.
 Load testing: Determines the system behavior under a specific load.
 Spike testing: Shows how the system operates in response to dramatic changes in
load.
 Stress testing: Identifies the breaking point for the application under dramatic load
changes for extended periods of time.
What are events?
An event is a happening of interest. An event can also be defined as a meaningful change of
state, which can be further used for sending notifications or in order to generate desired
results/outputs.
Events can be either simple or complex. Simple events are directly detectable and not
composed with other events. A complex event is described as an event that is produced by
composing two or more simple events through operators of an event algebra and/or enriching
an event with external information.

How are events triggered from sensors?


Let us take the example of car’s tire pressure: Imagine that a car has several sensors—one
that measures tire pressure, one that measures speed etc. Let us imagine that the sensor for
tire pressure during a drive notes a change from 45 psi (pound per square inch) to 41 psi
within 15 minutes.

As the pressure in the tire decreases, a series of events concerning the tire pressure is
generated. In addition, a series of events containing the speed of the car is generated. The
car’s event processor may detect a situation whereby a loss of tire pressure over a relatively
long period of time results in the creation of the “LossOfTirePressure” event.

This new event may trigger a reaction process to note the pressure loss into the car’s
maintenance log, and alert the driver via the car’s portal that the tire pressure has reduced.

What is Complex Event Processing (CEP)?


Event processing is a method of tracking and analyzing (processing) streams of information
(data) about things that happen (events) and deriving a conclusion from them. Complex
event processing (CEP) combines data from multiple sources to infer events or patterns that
suggest more complicated circumstances. The following figure shows how CEP is
aggregating event streams and detecting patterns.
Major Application Areas for Complex Event Processing [CEP]:
Business Activity Monitoring aims at identifying problems and opportunities in early stages
by monitoring business processes and other critical resources.
Sensor Networks that are used for monitoring of industrial facilities. These are usually
derived from raw numerical measurements [e.g., temperature, smoke].
Market data such as stock or commodity prices; they need to be derived from several events
and their relationships through CEP.
The Most Common Tools Used for Complex Event Processing Are:
Apache Spark Streaming used by Databricks
Apache Flink used by data Artisans
Apache Samza used by LinkedIn
Apache Storm used by Twitter
Hadoop/MapReduce.
Amazon Kinesis Analytics
Microsoft Azure Stream Analytics, Stream Insight
Fujitsu Software Interstage Big Data Complex Event Processing Server
IBM Streams, Operational Decision Manager [ODM]
Oracle Stream Analytics and Stream Explore
Date Stream Analytics Platforms
EPS stands for "Event Processing System." These systems are designed to detect, analyze,
and respond to events or patterns within data streams in real-time or near real-time. There are
several types of EPSs based on their approach to processing events:
1. Query-Based EPSs:
 These EPSs operate by continuously monitoring data streams for events that match
predefined queries or conditions.
 They typically use query languages or query builders to define the conditions for
event detection.
 Examples include systems that use SQL-like queries to monitor databases or data
streams for specific patterns or conditions.
2. Rule-Oriented EPSs:
 Rule-oriented EPSs use a set of predefined rules or rulesets to detect events or
patterns in data streams.
 Rules are typically expressed in a rule language or a rule definition format.
 These systems evaluate incoming events against the defined rules and trigger actions
based on matches.
 Rules can be simple conditions or complex combinations of conditions and actions.
 Examples include Complex Event Processing (CEP) systems, which are commonly
used in financial trading, fraud detection, and network monitoring.
3. Programmatic EPSs:
 Programmatic EPSs provide a more flexible and customizable approach to event
processing by allowing developers to write custom code or scripts to define event
detection and response logic.
 Developers can use programming languages or scripting languages to implement
event processing algorithms tailored to specific requirements.
 These systems offer greater flexibility but may require more development effort
compared to query-based or rule-oriented approaches.
 Examples include custom-built event processing systems implemented using
programming languages like Java, Python, or Scala, often leveraging libraries or
frameworks for stream processing such as Apache Flink, Apache Storm, or Apache
Kafka Streams.

Each type of EPS has its advantages and use cases, and the choice depends on factors such as
the complexity of event processing logic, performance requirements, and ease of
development and maintenance.
The Difference Between Real-Time, Near Real-Time, and Batch Processing
in Big Data

When it comes to data processing, there are more ways to do it than ever. Your choices
include real-time, near real-time, and batch processing. How you do it and the tools you
choose depend largely on what your purposes are for processing the data in the first place.
In many cases, you’re processing historical and archived data and time isn’t so critical. You
can wait a few hours for your answer, and if necessary, a few days. Conversely, other
processing tasks are crucial, and the answers need to be delivered within seconds to be of
value.
Real-time, near real-time, and batch processing

Type of data
When do you need it?
processing

Real-time When you need information processed immediately (such as at a bank


ATM)

Near real-time When speed is important, but you don’t need it immediately (such as
producing operational intelligence)

Batch When you can wait for days (or longer) for processing (Payroll is a
good example.)

What is real-time processing and when do you need it?


Real-time processing requires a continual input, constant processing, and steady output of
data.
A great example of real-time processing is data streaming, radar systems, customer service
systems, and bank ATMs, where immediate processing is crucial to make the system work
properly. Spark is a great tool to use for real-time processing.
Examples of real-time processing:
 Data streaming
 Radar systems
 Customer service systems
 Bank ATMs
What is near real-time processing and when do you need it?
This processing is when speed is important, but processing time in minutes is acceptable in
lieu of seconds.
An example of this processing is the production of operational intelligence, which is a
combination of data processing and Complete Event Processing (CEP). CEP involves
combining data from multiple sources in order to detect patterns. It’s useful for identifying
opportunities in the data sets (such as sales leads) as well as threats (detecting an intruder in
the network).
Operational intelligence, or OI, should not be confused with Operational business
intelligence, or OBI, which involves the analysis of historical and archived data for strategic
and planning purposes. It is not necessary to process OBI in real time or near-real time.
Examples of near real-time processing:
 Processing sensor data
 IT systems monitoring
 Financial transaction processing
What is batch processing and when do you need it?
Batch processing is even less time-sensitive than near real-time. In fact, batch processing jobs
can take hours, or perhaps even days.
Batch processing involves three separate processes. First, data is collected, usually over a
period of time. Second, the data is processed by a separate program. Thirdly, the data is
output. Examples of data entered in for analysis can include operational data, historical and
archived data, data from social media, service data, etc.
MapReduce is a useful tool for batch processing and analytics that doesn’t need to be real
time or near real-time, because it is incredibly powerful.
Examples of uses for batch processing include payroll and billing activities, which usually
occur on monthly cycles, and deep analytics that are not essential for fast intelligence
necessary for immediate decision making.
Examples of batch processing:
 Payroll
 Billing
 Orders from customers

Popular Stream Processing Frameworks Compared

Types of Stream Processing Engines


There are three major types of processing engines.

1. Open Source Compositional Engines


In a compositional stream processing engines, developers define the Directed Acyclic Graph
(DAG) in advance and then process the data. This may simplify code, but also means
developers need to plan their data stream architecture carefully to avoid inefficient
processing.
Challenges: Compositional stream processing are considered the “first generation” of stream
processing and can be complex and difficult to manage.
Examples: Compositional engines include Samza, Apex, and Apache Storm.

2. Managed Declarative Engines


Developers use declarative engines to chain stream processing functions. The engine
calculates the DAG as it ingests the data. Developers can specify the DAG explicitly in their
code, and the engine optimizes it on the fly.
Challenges: While declarative engines are easier to manage, and have readily-available
managed service options, they still require major investments in data engineering to set up the
data pipeline, from source to eventual storage and analysis.
Examples: Declarative engines include Apache Spark Streaming and Flink, both of which
are provided as a managed offering.

3. Fully Managed Self-Service Engines


A new category of stream processing engines is emerging, which not only manages the DAG
but offers an end-to-end solution including ingestion of streaming data into storage
infrastructure, organizing the data, and facilitating streaming analytics.
Examples: Upsolver SQLake is a fully managed declarative data pipeline platform for
streaming and batch data. SQLake handles huge volumes of streaming data, stores it in a
high-performance cloud data lake architecture, and enables real-time access to data and SQL-
based analytics. Learn more from our architecture overview. And see some of the ways you
can use SQLake to build declarative, self-orchestrating end-to-end data pipelines.

Comparing Popular Stream Processing Frameworks


Apache Spark
Spark is an open-source distributed general-purpose cluster computing framework. Spark’s
in-memory data processing engine conducts analytics, ETL, machine learning and graph
processing on data in motion or at rest. It offers high-level APIs for the programming
languages: Python, Java, Scala, R, and SQL.
The Apache Spark Architecture is founded on Resilient Distributed Datasets (RDDs). These
are distributed immutable tables of data, which are split up and allocated to workers. The
worker executors implement the data. The RDD is immutable, so the worker nodes cannot
make alterations; they process information and output results.
Pros: Apache Spark is a mature product with a large community, proven in production for
many use cases, and readily supports SQL querying.
Cons:
 Spark can be complex to set up and implement
 It is not a true streaming engine (it performs very fast batch processing)
 Limited language support
 Latency of a few seconds, which eliminates some real-time analytics use cases
Apache Storm
Apache Storm has very low latency and is suitable for near real time processing workloads. It
processes large quantities of data and provides results with lower latency than most other
solutions.
The Apache Storm Architecture is founded on spouts and bolts
bolts.. Spouts are origins of
information and transfer information to one or more bolts. This information is linked to other
bolts, and the entire topology forms a DAG. Developers define how the spouts and bolts are
connected.

Source: Apache Storm


Pros:
 Probably the best technical solution for true real
real-time processing
 Use of micro-batches
batches provides flexibility in adapting the tool for different use cases
 Very wide language support
Cons:
 Does not guarantee
antee ordering of messages, may compromise reliability
 Highly complex to implement
Apache Samza
Apache Samza uses a publish/subscribe task, which observes the data stream, processes
messages, and outputs its findings to another stream. Samza can divide a stream into multiple
partitions and spawn a replica of the task for every partition.
Apache Samza uses the Apache Kafka messaging system, architecture, and guarantees, to
offer buffering, fault tolerance, and state storage. Samza relies on YARN for resource
negotiation. However, a Hadoop cluster is needed (at lleast
east HDFS and YARN).
Samza has a callback-based
based process message API. It works with YARN to provide fault
tolerance, and migrates your tasks to another machine if a machine in the cluster fails. Smaza
processes messages in the order they were written and ensures that no message is lost. It is
also scalable as it is partitioned and distributed at all levels.
Pros:
 Offers replicated storage that provides reliable persistency with low latency.
 Easy and inexpensive multi-subscriber model
 Can eliminate backpressure, allowing data to be persisted and processed later
Cons:
 Only supports JVM languages
 Does not support very low latency
 Does not support exactly-once semantics
Apache Flink
Flink is based on the concept of streams and transformations. Data comes into the system via
a source and leaves via a sink. To produce a Flink job Apache Maven is used. Maven has a
skeleton project where the packing requirements and dependencies are ready, so the
developer can add custom code.
Apache Flink is a stream processing framework that also handles batch tasks. Flink
approaches batches as data streams with finite boundaries.
Pros:
 Stream-first approach offers low latency, high throughput
 Real entry-by-entry processing
 Does not require manual optimization and adjustment to data it processes
 Dynamically analyzes and optimizes tasks
Cons:
 Some scaling limitations
 A relatively new project with less production deployments than other frameworks
Amazon Kinesis Streams
Amazon Kinesis Streams is a durable and scalable real time service. It can collect gigabytes
of data per seconds from hundreds of thousands of sources, including database event streams,
website clickstreams, financial transactions, IT logs, social media feeds, and location-tracking
events. The data captured is provided in milliseconds for real time analytics use cases,
including real time anomaly detection, real time dashboards, and dynamic pricing.
You can build data-processing applications, called Kinesis Data Stream (KDS) applications.
Typically, a kinesis data stream application interprets data from a data stream as data records.
The application can run on Amazon EC2 and can use the kinesis client library.
Source: Amazon
Pros:
 A robust managed service that is easy to set up and maintain
 Integrates with Amazon’s extensive big data toolset
Cons:
 Commercial cloud service, priced per hour per shard (see pricing)
Apache Apex
Apex offers a platform for batch and stream processing using Hadoop’s data-in-motion
architecture by YARN. The platform provides integration with different data platforms. Apex
also provides a framework that is easy to use.
Operationally, Apex utilizes native HDFS for persisting state and the YARN features found
in Hadoop such as scheduling, resource management, jobs, security, multi-tenancy, and fault-
tolerance. Functionally, developers can integrate Apex APIs with other data processing
systems.
Apex allows for high throughput, low latency, reliability, and unified architecture, for batch
and streaming use cases. It can process unbound data sets, which can grow infinitely.
Pros:
 Design focuses on enterprise readiness
 Strong processing guarantees (end-to-end exactly once)
 Highly scalable, high throughput with low latency
 Secure, supports fault-tolerance and multi-tenancy
Cons:
 Apex is no longer widely used and no vendor is currently supporting this framework at scale
(see article)
 Limited support for SQL
 Difficult to find skilled users
Apache Flume
Flume is a reliable, distributed service for aggregating, collecting and moving massive
amounts of log data. It has a flexible and basic architecture. It is fault
fault-tolerant
tolerant and hardy with
failover and recovery features and tunable reliability. It operates an exte
extensible
nsible data model that
allows for online analytic application.
The key concept behind the design of Flume is to capture streaming data from web servers to
Hadoop Distributed File System (HDFS).

Source: https://flume.apache.org/FlumeUserGuide.html
Pros:
 Central master server controls all nodes
 Fault tolerance, failover and advanced recovery and reliability features
Cons:
 Difficult to understand and configure with complex logical/physical mapping
 Big footprint, over 50,000 lines of Java code
Want to learn more about streaming data analytics and architecture? Get our Ultimate
Guide to Streaming Data:
 Get an overview of common
mon options for building an infrastructure
 See how to turn event streams into analytics
analytics-ready data
 Cut through some of the noise of all the “shiny new objects”
 Come away with concrete ideas for wringing all you want from your data streams.
Data Analysis and Analytic Techniques: Data Analysis in General. Data
Analysis for Stream Applications
Data analysis, in general, refers to the process of inspecting, cleaning, transforming, and
modeling data to uncover insights, make informed decisions, and solve problems. It involves
various techniques and methodologies depending on the nature of the data, the objectives of
the analysis, and the desired outcomes. Here's an overview of data analysis and its application
in stream processing:
1. Data Analysis in General:
 Exploratory Data Analysis (EDA): This involves summarizing the main
characteristics of the data using statistical graphics and other data visualization
techniques to understand its underlying structure, patterns, and relationships.
 Descriptive Statistics: These techniques help in summarizing and describing the
main features of the data through numerical summaries, such as mean, median, mode,
variance, and standard deviation.
 Inferential Statistics: Inferential techniques are used to make predictions or
inferences about a population based on a sample of data. This includes hypothesis
testing, confidence intervals, and regression analysis.
 Machine Learning: Machine learning algorithms are used to build predictive models
and make data-driven decisions. This includes supervised learning, unsupervised
learning, and reinforcement learning techniques.
2. Data Analysis for Stream Applications:
 Real-time Data Visualization: In stream processing applications, it's crucial to
visualize data as it arrives to monitor system performance, detect anomalies, and gain
insights. Real-time dashboards and visualizations help in understanding the current
state of the data stream.
 Streaming Analytics: Streaming analytics involves analyzing data in motion, as it is
generated and processed in real-time. Techniques such as windowing, aggregation,
filtering, and pattern matching are used to extract meaningful insights from data
streams.
 Complex Event Processing (CEP): CEP systems analyze and correlate events from
multiple sources in real-time to identify complex patterns and detect actionable
events. These systems use rule-based or pattern-based approaches to process
continuous data streams and trigger responses based on predefined rules or conditions.
 Online Machine Learning: In stream processing, online machine learning techniques
are used to continuously update and adapt predictive models as new data arrives.
Algorithms such as online linear regression, online clustering, and online
classification are employed to handle data streams and make real-time predictions or
decisions.
 Anomaly Detection: Anomaly detection techniques are used to identify unusual
patterns or outliers in data streams that may indicate potential issues or anomalies.
Statistical methods, machine learning algorithms, and pattern recognition techniques
are applied to detect and flag anomalies in real-time.

Overall, data analysis for stream applications requires specialized techniques and algorithms
to handle the continuous flow of data and extract actionable insights in real-time. These
techniques enable organizations to make timely decisions, respond to events promptly, and
derive value from streaming data sources.

System Components: System Back-End Architecture, System Front-End


Architecture, Software Apache Spark? A gentle introduction to Spark. A
tour of Spark's toolset. Overview, Basic Structured Operations, Data
Sources. Spark SQL.

System Components:

1. System Back-End Architecture:


 The back-end architecture of a system typically refers to the components responsible
for processing and managing data behind the scenes. In the context of a data
processing system like Apache Spark, the back-end architecture includes the
computational engine, distributed storage system, and resource management
framework.
 Apache Spark's back-end architecture consists of a distributed computing engine that
enables parallel data processing across a cluster of machines. It includes components
such as the Spark Core, which provides basic functionality for distributed processing,
along with additional modules for tasks like SQL processing (Spark SQL), streaming
analytics (Spark Streaming), machine learning (Spark MLlib), and graph processing
(GraphX).
 Spark relies on a distributed storage system, commonly Hadoop Distributed File
System (HDFS) or cloud storage solutions like Amazon S3 or Azure Data Lake
Storage, to store and manage large datasets across the cluster.
 Resource management frameworks like Apache YARN, Apache Mesos, or Spark's
standalone cluster manager are used to allocate and manage computational resources
across the cluster efficiently.
2. System Front-End Architecture:
 The front-end architecture of a system typically refers to the components that interact
directly with users or external systems. In the context of Spark, the front-end
architecture includes interfaces, APIs, and tools used for submitting jobs, monitoring
job execution, and interacting with data.
 Users interact with Spark through various interfaces, including command-line tools,
interactive shells (such as the Spark Shell or SparkR), programming APIs (such as
Scala, Java, Python, or R APIs), and higher-level libraries and frameworks.
 Front-end tools also include web-based interfaces and dashboards for monitoring
Spark job execution, managing clusters, and visualizing performance metrics.
3. Software: Apache Spark:
 Apache Spark is an open-source distributed computing framework that provides an
interface for programming entire clusters with implicit data parallelism and fault
tolerance.
 It offers a unified computing engine for processing large-scale data across a cluster of
machines, with support for various workloads including batch processing, streaming
analytics, iterative algorithms, and interactive queries.
 Spark's key features include in-memory computing, lazy evaluation, fault tolerance,
and support for multiple programming languages and libraries.
 Spark's toolset includes several components for different types of data processing
tasks, including Spark SQL, Spark Streaming, Spark MLlib, and GraphX.

A Gentle Introduction to Spark:

Apache Spark is a fast and general-purpose cluster computing system that provides APIs in
Scala, Java, Python, and R. It aims to make distributed computing accessible and easy to use
by providing high-level APIs for various tasks such as batch processing, real-time analytics,
machine learning, and graph processing. Spark achieves high performance through in-
memory computing and efficient execution planning.

A Tour of Spark's Toolset:

1. Overview:
 Spark Core: The foundational component of Spark that provides distributed task
dispatching, scheduling, and basic I/O functionalities.
 Spark SQL: A module for working with structured data, providing a DataFrame API
and support for SQL queries.
 Spark Streaming: An extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data streams.
 Spark MLlib: A library for scalable machine learning algorithms and utilities.
 GraphX: A distributed graph processing framework built on top of Spark's core API.
2. Basic Structured Operations:
 Spark provides a DataFrame API for performing structured data operations, similar to
a relational database or a data frame in R or Python's pandas library.
 Basic operations include filtering, selecting, aggregating, grouping, joining, and
sorting data.
 Spark leverages Catalyst optimizer to optimize query plans for better performance.
3. Data Sources:
 Spark supports reading and writing data from various sources including HDFS,
Apache Hive, Apache HBase, JSON, CSV, Parquet, Avro, ORC, JDBC, and more.
 Users can define custom data sources by implementing the DataSource API.
4. Spark SQL:
 Spark SQL is a component of Spark that provides a programming abstraction called
DataFrame, which behaves like a table in a relational database.
 It allows users to run SQL queries as well as perform complex data manipulations
using the DataFrame API.
 Spark SQL supports ANSI SQL as well as HiveQL, enabling seamless integration
with existing Hive deployments.
 Spark SQL leverages Catalyst optimizer to optimize query plans for better
performance.
Overall, Apache Spark provides a powerful and versatile platform for distributed data
processing, with a rich set of APIs and tools for various data processing tasks. Its unified
architecture and scalable design make it suitable for a wide range of use cases, from simple
batch processing to complex analytics and machine learning workflows.

You might also like