BDA Unit 4
BDA Unit 4
In contrast, real-time data processing (or streaming data) can collect, store, and analyze
continuously, making data readily available to the end-user as soon as it's generated with no
delay.
While databases and offline data analysis remain valid tools, the need for real-time data has
increased exponentially with the advent of modern applications. After all, the world isn’t a
batch process - it runs in real-time.
Real-Time Analytics
In real-time, analysis of data allows users to view, analyse and understand data in the system
it's entered. Mathematical reasoning and logic are incorporated into the data, which means it
gives users a sense of real-time data to make decisions.
Overview
Examples
Examples of real-time customer analytics include the following.
o Monitoring orders as they take place to trace them better and determine the type of
clothing.
o Continuously modernize customer interactions, such as the number of page views and
shopping cart usage, to better understand the etiquette of users.
o Select customers who are more advanced in their shopping habits in a shop, impacting
the decisions in real time.
GPS Data: GPS-enabled devices, including mobile phones, produce streams of geographical
data. Using real-time location data, businesses can track delivery fleets. Air traffic controllers
can land planes safely. Commuters can use live traffic data to choose the fastest route. Social
networks can use GPS data streams to build a more accurate model of our social
relationships. Real-time data streams allow cars to ingest, store, and integrate live GPS data
with self-driving software to form the backbone of autonomous cars, delivery drones, and the
internet of things (IoT).
Ride Share Applications: Uber relies on real-time data to match customers to drivers. Real-
time data is also collected to forecast demand, compute performance metrics, and extract
patterns of human behavior from event streams. Not only would real-time data streams allow
for seamless customer experience, they'd also provide real-time fraud detection, anomaly
detection, marketing campaigns, visualization, and customer feedback. The company
uses Apache Kafka to achieve real-time data at this scale, processing over 30 billion
messages per day.
Streaming Platforms: Netflix embraces event streams to achieve speed and scalability in all
aspects of its business. Streaming is the communication mechanism for the entire Netflix
ecosystem. The company uses Apache Kafka to support a variety of microservices, ranging
from studio financing to real-time data on the service levels within its infrastructure.
Walmart: Walmart operates thousands of stores and hundreds of distribution centers across
the world. The company also makes millions of online transactions. Walmart uses Apache
Kafka to drive its real-time inventory management system. The system ingests 500 million
events per day and ensures that the company has an accurate view of its entire inventory in
real-time. The system also supports Walmart's telemetry, alerting, and auditing requirements.
Medical data: Real-time data on heart rate, blood pressure and oxygen saturation enables
hospitals to identify patients whose health is at risk of deteriorating. In the case of Covid-19,
when hospitals were short on equipment, personnel, and at patient capacity, real-time data
analytics this would enable hospitals to optimize the use of Intensive Care Units, ventilators,
and patient health data in real-time, increasing efficiency and streamlining processes.
Another example is heart attacks. Approximately 10% of patients suffer heart attacks while
they are already in a hospital. By using real-time data analytics, heart attacks could be
predicted before they happen. Electronic monitoring and predictive analytics are vital in
many clinical areas where patient safety is at stake.
1. Time Constraints: Time constraints related with real-time systems simply means that
time interval allotted for the response of the ongoing program. This deadline means that
the task should be completed within this time interval. Real-time system is responsible
for the completion of all tasks within their time intervals.
2. Correctness: Correctness is one of the prominent part of real-time systems. Real-time
systems produce correct result within the given time interval. If the result is not obtained
within the given time interval then also result is not considered correct. In real-time
systems, correctness of result is to obtain correct result in time constraint.
3. Embedded: All the real-time systems are embedded now-a-days. Embedded system
means that combination of hardware and software designed for a specific purpose. Real-
time systems collect the data from the environment and passes to other components of
the system for processing.
4. Safety: Safety is necessary for any system but real-time systems provide critical safety.
Real-time systems also can perform for a long time without failures. It also recovers
very soon when failure occurs in the system and it does not cause any harm to the data
and information.
5. Concurrency: Real-time systems are concurrent that means it can respond to a several
number of processes at a time. There are several different tasks going on within the
system and it responds accordingly to every task in short intervals. This makes the real-
time systems concurrent systems.
6. Distributed: In various real-time systems, all the components of the systems are
connected in a distributed way. The real-time systems are connected in such a way that
different components are at different geographical locations. Thus all the operations of
real-time systems are operated in distributed ways.
7. Stability: Even when the load is very heavy, real-time systems respond in the time
constraint i.e. real-time systems does not delay the result of tasks even when there are
several task going on a same time. This brings the stability in real-time systems.
8. Fault tolerance: Real-time systems must be designed to tolerate and recover from faults
or errors. The system should be able to detect errors and recover from them without
affecting the system’s performance or output.
9. Determinism: Real-time systems must exhibit deterministic behavior, which means that
the system’s behavior must be predictable and repeatable for a given input. The system
must always produce the same output for a given input, regardless of the load or other
factors.
10. Real-time communication: Real-time systems often require real-time communication
between different components or devices. The system must ensure that communication
is reliable, fast, and secure.
11. Resource management: Real-time systems must manage their resources efficiently,
including processing power, memory, and input/output devices. The system must ensure
that resources are used optimally to meet the time constraints and produce correct
results.
12. Heterogeneous environment: Real-time systems may operate in a heterogeneous
environment, where different components or devices have different characteristics or
capabilities. The system must be designed to handle these differences and ensure that all
components work together seamlessly.
13. Scalability: Real-time systems must be scalable, which means that the system must be
able to handle varying workloads and increase or decrease its resources as needed.
14. Security: Real-time systems may handle sensitive data or operate in critical
environments, which makes security a crucial aspect. The system must ensure that data
is protected and access is restricted to authorized users only.
Scalability, High Availability, and Performance
The terms scalability, high availability, performance, and mission
mission-critical
critical can mean different
things to different organizations, or to different departments within an organization. They are
often interchanged and create confusion that results in poorly managed expectations,
implementation delays, or unrealistic metrics. This Refcard provides you with the tools to
define these terms so that
at your team can implement mission
mission-critical
critical systems with well
understood performance goals.
Scalability
It's the property of a system or application to handle bigger amounts of work, or to be easily
expanded, in response to increased demand for network, pr
processing,
ocessing, database access or file
system resources.
Horizontal scalability
A system scales horizontally, or out, when it's expanded by adding new nodes with identical
functionality to existing ones, redistributing the load among all of them. SOA systems and
web servers scale out by adding more servers to a load
load-balanced
balanced network so
s that incoming
requests may be distributed among all of them. Cluster is a common term for describing a
scaled out processing system.
Figure : Clustering
Vertical scalability
A system scales vertically, or up, when it's expanded by adding processing, main memory,
storage, or network interfaces to a node to satisfy more requests per system. Hosting services
companies scale up by increasing the number of processors or the amount of main memory to
host more virtual servers in the same hardware.
Figure :Virtualization
High Availability
Availability describes how well a system provides useful resources over a set period of time.
High availability guarantees an absolute degree of functional continuity within a time
window expressed as the relationship between uptime and downtime.
A = 100 – (100*D/U), D ::= unplanned downtime, U ::= uptime; D, U expressed in minutes
Uptime and availability don't mean the same thing. A system may be up for a complete
measuring period, but may be unavailable due to network outages or downtime in related
supportt systems. Downtime and unavailability are synonymous.
Measuring Availability
Vendors define availability as a given number of "nines" like in Table 1, which also describes
the number of minutes or seconds of estimated downtime in relation to the number of minutes
in a 365-day
day year, or 525,600, making U a constant for their marketing purposes.
Availability Downtime in Downtime per Vendor
% Minutes Year Jargon
As the pressure in the tire decreases, a series of events concerning the tire pressure is
generated. In addition, a series of events containing the speed of the car is generated. The
car’s event processor may detect a situation whereby a loss of tire pressure over a relatively
long period of time results in the creation of the “LossOfTirePressure” event.
This new event may trigger a reaction process to note the pressure loss into the car’s
maintenance log, and alert the driver via the car’s portal that the tire pressure has reduced.
Each type of EPS has its advantages and use cases, and the choice depends on factors such as
the complexity of event processing logic, performance requirements, and ease of
development and maintenance.
The Difference Between Real-Time, Near Real-Time, and Batch Processing
in Big Data
When it comes to data processing, there are more ways to do it than ever. Your choices
include real-time, near real-time, and batch processing. How you do it and the tools you
choose depend largely on what your purposes are for processing the data in the first place.
In many cases, you’re processing historical and archived data and time isn’t so critical. You
can wait a few hours for your answer, and if necessary, a few days. Conversely, other
processing tasks are crucial, and the answers need to be delivered within seconds to be of
value.
Real-time, near real-time, and batch processing
Type of data
When do you need it?
processing
Near real-time When speed is important, but you don’t need it immediately (such as
producing operational intelligence)
Batch When you can wait for days (or longer) for processing (Payroll is a
good example.)
Source: https://flume.apache.org/FlumeUserGuide.html
Pros:
Central master server controls all nodes
Fault tolerance, failover and advanced recovery and reliability features
Cons:
Difficult to understand and configure with complex logical/physical mapping
Big footprint, over 50,000 lines of Java code
Want to learn more about streaming data analytics and architecture? Get our Ultimate
Guide to Streaming Data:
Get an overview of common
mon options for building an infrastructure
See how to turn event streams into analytics
analytics-ready data
Cut through some of the noise of all the “shiny new objects”
Come away with concrete ideas for wringing all you want from your data streams.
Data Analysis and Analytic Techniques: Data Analysis in General. Data
Analysis for Stream Applications
Data analysis, in general, refers to the process of inspecting, cleaning, transforming, and
modeling data to uncover insights, make informed decisions, and solve problems. It involves
various techniques and methodologies depending on the nature of the data, the objectives of
the analysis, and the desired outcomes. Here's an overview of data analysis and its application
in stream processing:
1. Data Analysis in General:
Exploratory Data Analysis (EDA): This involves summarizing the main
characteristics of the data using statistical graphics and other data visualization
techniques to understand its underlying structure, patterns, and relationships.
Descriptive Statistics: These techniques help in summarizing and describing the
main features of the data through numerical summaries, such as mean, median, mode,
variance, and standard deviation.
Inferential Statistics: Inferential techniques are used to make predictions or
inferences about a population based on a sample of data. This includes hypothesis
testing, confidence intervals, and regression analysis.
Machine Learning: Machine learning algorithms are used to build predictive models
and make data-driven decisions. This includes supervised learning, unsupervised
learning, and reinforcement learning techniques.
2. Data Analysis for Stream Applications:
Real-time Data Visualization: In stream processing applications, it's crucial to
visualize data as it arrives to monitor system performance, detect anomalies, and gain
insights. Real-time dashboards and visualizations help in understanding the current
state of the data stream.
Streaming Analytics: Streaming analytics involves analyzing data in motion, as it is
generated and processed in real-time. Techniques such as windowing, aggregation,
filtering, and pattern matching are used to extract meaningful insights from data
streams.
Complex Event Processing (CEP): CEP systems analyze and correlate events from
multiple sources in real-time to identify complex patterns and detect actionable
events. These systems use rule-based or pattern-based approaches to process
continuous data streams and trigger responses based on predefined rules or conditions.
Online Machine Learning: In stream processing, online machine learning techniques
are used to continuously update and adapt predictive models as new data arrives.
Algorithms such as online linear regression, online clustering, and online
classification are employed to handle data streams and make real-time predictions or
decisions.
Anomaly Detection: Anomaly detection techniques are used to identify unusual
patterns or outliers in data streams that may indicate potential issues or anomalies.
Statistical methods, machine learning algorithms, and pattern recognition techniques
are applied to detect and flag anomalies in real-time.
Overall, data analysis for stream applications requires specialized techniques and algorithms
to handle the continuous flow of data and extract actionable insights in real-time. These
techniques enable organizations to make timely decisions, respond to events promptly, and
derive value from streaming data sources.
System Components:
Apache Spark is a fast and general-purpose cluster computing system that provides APIs in
Scala, Java, Python, and R. It aims to make distributed computing accessible and easy to use
by providing high-level APIs for various tasks such as batch processing, real-time analytics,
machine learning, and graph processing. Spark achieves high performance through in-
memory computing and efficient execution planning.
1. Overview:
Spark Core: The foundational component of Spark that provides distributed task
dispatching, scheduling, and basic I/O functionalities.
Spark SQL: A module for working with structured data, providing a DataFrame API
and support for SQL queries.
Spark Streaming: An extension of the core Spark API that enables scalable, high-
throughput, fault-tolerant stream processing of live data streams.
Spark MLlib: A library for scalable machine learning algorithms and utilities.
GraphX: A distributed graph processing framework built on top of Spark's core API.
2. Basic Structured Operations:
Spark provides a DataFrame API for performing structured data operations, similar to
a relational database or a data frame in R or Python's pandas library.
Basic operations include filtering, selecting, aggregating, grouping, joining, and
sorting data.
Spark leverages Catalyst optimizer to optimize query plans for better performance.
3. Data Sources:
Spark supports reading and writing data from various sources including HDFS,
Apache Hive, Apache HBase, JSON, CSV, Parquet, Avro, ORC, JDBC, and more.
Users can define custom data sources by implementing the DataSource API.
4. Spark SQL:
Spark SQL is a component of Spark that provides a programming abstraction called
DataFrame, which behaves like a table in a relational database.
It allows users to run SQL queries as well as perform complex data manipulations
using the DataFrame API.
Spark SQL supports ANSI SQL as well as HiveQL, enabling seamless integration
with existing Hive deployments.
Spark SQL leverages Catalyst optimizer to optimize query plans for better
performance.
Overall, Apache Spark provides a powerful and versatile platform for distributed data
processing, with a rich set of APIs and tools for various data processing tasks. Its unified
architecture and scalable design make it suitable for a wide range of use cases, from simple
batch processing to complex analytics and machine learning workflows.