0% found this document useful (0 votes)
9 views

Question Bank (1)

The document outlines various problems and solutions related to data monitoring and processing, including pedestrian counts, sensor readings, vehicle counts, and heart rate monitoring. It also discusses web traffic processing using tumbling windows and watermarking, comparisons between batch processing and stream processing, and the advantages of Spark Structured Streaming. Additionally, it covers stream processing semantics, types of joins, and how to perform operations like word count and data writing to Kafka in Spark Structured Streaming.

Uploaded by

pg0145
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Question Bank (1)

The document outlines various problems and solutions related to data monitoring and processing, including pedestrian counts, sensor readings, vehicle counts, and heart rate monitoring. It also discusses web traffic processing using tumbling windows and watermarking, comparisons between batch processing and stream processing, and the advantages of Spark Structured Streaming. Additionally, it covers stream processing semantics, types of joins, and how to perform operations like word count and data writing to Kafka in Spark Structured Streaming.

Uploaded by

pg0145
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Here are the problems along with their solutions:

1. Pedestrian Count Over Time

A city monitors the number of pedestrians crossing a street every 10 minutes.

Time (minutes) 00:00 00:10 00:20 00:30 00:40 00:50 01:00

Count of People 8 12 15 10 9 14 11

Solution:

• The peak pedestrian count occurs at 00:20 (15 pedestrians).

• The lowest count is at 00:00 (8 pedestrians).

• A visual timeline representation would show fluctuations, with the highest activity around
the 20-minute mark.

2. Sensor Readings in a Factory

A factory records temperature readings from a sensor every 5 minutes.

Time (minutes) 00:00 00:05 00:10 00:15 00:20 00:25 00:30

Temperature (°C) 22 23 24 22 25 26 24

Solution:

• The highest temperature recorded is 26°C at 00:25.

• The lowest temperature is 22°C at 00:00 and 00:15.

• The temperature rises until 00:25, then slightly drops.

3. Vehicles Passing a Toll Booth

A toll booth records the number of vehicles passing every 15 minutes.

Time (minutes) 00:00 00:15 00:30 00:45 01:00 01:15 01:30

Vehicle Count 20 25 30 22 27 18 24

Solution:

• The peak traffic occurs at 00:30 (30 vehicles).

• The lowest traffic is at 01:15 (18 vehicles).


• Traffic generally increases at first but fluctuates in later intervals.

4. Heart Rate Monitoring

A fitness tracker records heart rate every 10 minutes.

Time (minutes) 00:00 00:10 00:20 00:30 00:40 00:50 01:00

Heart Rate (bpm) 72 75 80 78 82 79 77

Solution:

• The highest heart rate is 82 bpm at 00:40.

• The lowest heart rate is 72 bpm at 00:00.

• The heart rate fluctuates but generally increases before slightly dropping.

5. Problem: Web Traffic Processing with Tumbling Window and Watermarking

A web traffic streaming application processes incoming requests, where each request has an event
time and a (latitude, longitude) location. The system uses:

• Tumbling window of 4 minutes (non-overlapping fixed time windows).

• Watermarking period of 7 minutes (late-arriving events within 7 minutes can still be


processed).

Given the event timestamps, compute the count of requests in each 4-minute window from 03:00
to 03:32 and demonstrate step-by-step calculations.

Step 1: Define Tumbling Windows

Each window spans 4 minutes, meaning our window boundaries are:

Window ID Time Range

W1 03:00 - 03:03

W2 03:04 - 03:07

W3 03:08 - 03:11

W4 03:12 - 03:15
Window ID Time Range

W5 03:16 - 03:19

W6 03:20 - 03:23

W7 03:24 - 03:27

W8 03:28 - 03:31

Step 2: Assign Events to Windows

Since the tuples arrive in an unordered fashion, we assign them to windows based on event time:

Event Time Location (Latitude, Longitude) Window

03:07 (43.6510, -79.3470) W2 (03:04 - 03:07)

03:09 (35.6895, 139.6917) W3 (03:08 - 03:11)

03:10 (-33.9249, 18.4241) W3 (03:08 - 03:11)

03:11 (55.7558, 37.6173) W3 (03:08 - 03:11)

03:12 (19.0760, 72.8777) W4 (03:12 - 03:15)

03:15 (40.7128, -74.0060) W4 (03:12 - 03:15)

03:16 (48.8566, 2.3522) W5 (03:16 - 03:19)

03:21 (-22.9068, -43.1729) W6 (03:20 - 03:23)

03:27 (-33.8688, 151.2093) W7 (03:24 - 03:27)

03:29 (51.5074, -0.1278) W8 (03:28 - 03:31)

Step 3: Applying the Watermark

• Watermarking allows late events within 7 minutes to be included in their respective


windows.

• The latest event (03:29) determines the watermark as 03:29 - 7 = 03:22.

• Events older than 03:22 but arriving later will still be processed.

No Late Events Beyond Watermark

• All events arrived in order, meaning all are included in their correct windows.
Step 4: Count Requests Per Window

Now, we compute the count of requests in each window.

Window ID Time Range Count of Requests

W1 03:00 - 03:03 0

W2 03:04 - 03:07 1

W3 03:08 - 03:11 3

W4 03:12 - 03:15 2

W5 03:16 - 03:19 1

W6 03:20 - 03:23 1

W7 03:24 - 03:27 1

W8 03:28 - 03:31 1

Step 5: Graphical Representation

A bar chart representing requests per window:

3| ███

2| ███ ██

1| █ ███ ██ █ █ █ █

0 |--------------------------------------

03:00 03:04 03:08 03:12 03:16 03:20 03:24 03:28

• X-axis: Time windows

• Y-axis: Request count

Step 6: Final Answer

The request count for each 4-minute window is:

• W1 (03:00-03:03): 0

• W2 (03:04-03:07): 1
• W3 (03:08-03:11): 3

• W4 (03:12-03:15): 2

• W5 (03:16-03:19): 1

• W6 (03:20-03:23): 1

• W7 (03:24-03:27): 1

• W8 (03:28-03:31): 1

Conclusion

• The system correctly categorized requests into their windows.

• Since no late events beyond the watermark period occurred, all data was processed
successfully.

• The highest request count occurred in W3 (03:08-03:11) with 3 requests.

6. Compare and contrast batch processing, microbatching, and real-time stream processing.

Solution:

Real-Time Stream
Feature Batch Processing Microbatching
Processing

Processing Processes large amounts Processes small batches at Processes each event as
Mode of data at once fixed intervals it arrives

Medium (seconds to
Latency High (minutes to hours) Very Low (milliseconds)
minutes)

Example Hadoop, Apache Spark Apache Spark Streaming, Apache Flink, Apache
Frameworks (Batch Mode) Structured Streaming Kafka Streams

Data warehousing, Log analysis, near real-time Fraud detection, stock


Use Case
periodic reporting dashboards price monitoring

7. What are the advantages and disadvantages of Spark Structured Streaming over traditional
Spark Streaming?

Solution:
Feature Traditional Spark Streaming Spark Structured Streaming

API Type DStream API DataFrame / Dataset API

Processing Model Microbatching Microbatching (default) + Continuous Processing

Fault Tolerance RDD-based Checkpointing & WAL

Performance Moderate Better due to Catalyst optimizer

Integration Limited SQL support Full SQL support

Use Case Simple event processing Real-time analytics, ML integration

Conclusion:

• Structured Streaming is more flexible, scalable, and optimized than the older DStream-
based approach.

• It allows using SQL queries, ML models, and better recovery mechanisms.

8. Explain the concept of watermarking in stream processing and its role in handling late-
arriving data.

Solution:

• Definition: Watermarking in stream processing helps handle late-arriving data by defining


a threshold time after which old events are discarded from processing.

• Use Case: If a stream window runs from 3:00 to 3:05, but events arrive late, a watermark of
7 minutes allows processing events until 3:12 before finalizing results.

• Importance:

1. Prevents infinite waiting for late events.

2. Balances accuracy vs. latency.

3. Ensures better fault tolerance.

• Supported Frameworks: Apache Spark Structured Streaming, Flink, Kafka Streams.

9. Differentiate between stateless and stateful stream processing with examples.

Solution:
Feature Stateless Processing Stateful Processing

Definition Processes each event independently Maintains historical context

Memory Usage Low High (stores state data)

Fault Tolerance Easier to recover Requires checkpointing

Examples Filtering, transformations Aggregations, session windows

Use Cases Removing bad data, logging Counting occurrences, detecting anomalies

Example:

• Stateless: Removing logs with errors (filter()).

• Stateful: Counting how many requests a user made in the last 10 minutes.

10. What are the different types of joins in Spark Structured Streaming, and how do they work?

Solution:

Join Type Description Example Use Case

Joining website clicks with


Inner Join Matches records present in both streams
purchases

Keeps all records from the left stream; missing Matching users with their most
Left Outer Join
values from the right are NULL recent activity

Keeps all records from the right stream; Matching transactions with
Right Outer Join
missing values from the left are NULL user profiles

Keeps all records from both streams, filling Combining logs from two data
Full Outer Join
missing values with NULL sources

Watermarking + Helps join events arriving at different times by Combining late-arriving sensor
Joins setting a time threshold data with alerts

Example Scenario:

• A user click stream (Stream A) is joined with a purchase stream (Stream B).

• If a purchase occurs 5 minutes after a click, using a watermark of 10 minutes ensures late
clicks are considered.
11. How do you perform word count on streaming data using Spark Structured Streaming?
Provide a sample code snippet.

Solution:

Steps:

1. Setup Streaming Source (Netcat server)

2. Process Streaming Data (Tokenize words)

3. Apply Aggregation (Count occurrences)

4. Start Streaming Query (Output results to console)

Sample Code:

from pyspark.sql import SparkSession

from pyspark.sql.functions import explode, split

# Initialize Spark Session

spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()

# Read streaming data from netcat

lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()

# Split lines into words

words = lines.select(explode(split(lines.value, " ")).alias("word"))

# Count occurrences of each word

wordCounts = words.groupBy("word").count()

# Output to console

query = wordCounts.writeStream.outputMode("complete").format("console").start()

query.awaitTermination()

Explanation:
1. The application listens to a Netcat server (port 9999) for real-time text input.

2. Incoming lines are split into words and flattened.

3. Words are grouped and counted.

4. Results are printed to the console in real-time.

12. How can you use windowed aggregation to count events in a fixed time interval in Spark
Structured Streaming?

Solution:

Steps:

1. Define a sliding event window

2. Aggregate data within the window

3. Apply watermarking to handle late events

Sample Code:

from pyspark.sql.functions import window

# Define a window of 10 minutes with a slide of 5 minutes

wordCounts = words.groupBy(window(words.timestamp, "10 minutes", "5 minutes"),


"word").count()

query = wordCounts.writeStream.outputMode("update").format("console").start()

query.awaitTermination()

Explanation:

• Uses time-based windowing to count words appearing in 10-minute windows, sliding


every 5 minutes.

• Watermarking ensures late events are considered within limits.

13. What is the difference between Complete, Append, and Update output modes in Spark
Structured Streaming?

Solution:
Output
Description Use Case
Mode

Outputs the entire aggregated result in


Complete Used for aggregations like word count
every trigger

Only outputs new rows added since last Used for event logs where old rows don't
Append
trigger change

Updates modified rows but does not Suitable for stateful processing (e.g.,
Update
reprint all results session tracking)

Example:

wordCounts.writeStream.outputMode("append").format("console").start()

• Complete Mode → Best for aggregations like word count.

• Append Mode → Best for log-based streaming.

• Update Mode → Best for stateful transformations.

14. How do you join two real-time data streams in Spark Structured Streaming? Provide an
example.

Solution:

Steps:

1. Read two input streams (Example: Click Stream & Purchase Stream)

2. Perform a join on a common key (e.g., user_id)

3. Use watermarking to handle late events

Sample Code:

clicks = spark.readStream.format("socket").option("host", "localhost").option("port", 9998).load()

purchases = spark.readStream.format("socket").option("host", "localhost").option("port",


9997).load()

joinedStream = clicks.join(purchases, "user_id", "inner")

query = joinedStream.writeStream.outputMode("append").format("console").start()
query.awaitTermination()

Explanation:

• Merges two real-time streams on user_id.

• Uses Inner Join to match click events with purchases.

• Ensures event synchronization using watermarking.

15. How do you write streaming data from Spark Structured Streaming to a Kafka topic?

Solution:

Steps:

1. Read streaming data

2. Transform it (e.g., JSON format)

3. Write to Kafka

Sample Code:

# Read data from socket

lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()

# Write data to Kafka

query = lines.writeStream \

.format("kafka") \

.option("kafka.bootstrap.servers", "localhost:9092") \

.option("topic", "wordCountTopic") \

.option("checkpointLocation", "/tmp/kafka-checkpoint") \

.start()

query.awaitTermination()

Explanation:

• Kafka Topic: "wordCountTopic" receives real-time messages.

• Checkpointing: Stores metadata to handle failures.

• Kafka Broker (localhost:9092) ensures high-throughput data ingestion.


16. Compare and contrast the three stream processing semantics in distributed stream
processing systems.

Solution:

Stream processing systems provide different delivery guarantees to ensure correctness. The three
main semantics are:

Processing
Definition Example Scenario
Semantics

Events are processed at most once, Logging where occasional data loss is
At-most-once
but may be lost acceptable

Events are processed at least once Financial transactions where duplication


At-least-once
but may be duplicated is handled via deduplication logic

Events are processed exactly once, Payment processing, inventory


Exactly-once
ensuring no loss or duplication management

Comparison:

1. At-most-once: Fast but unreliable.

2. At-least-once: More reliable, but duplicates need to be handled.

3. Exactly-once: Most reliable but has higher processing overhead.

Example Frameworks:

• Apache Kafka Streams supports exactly-once processing using transactional writes.

• Spark Structured Streaming provides exactly-once guarantees with checkpointing.

17. Compute the competitive ratio for renting vs. buying a swimming gear under worst-case
scenarios.

Given Costs:

• Renting cost: $20 per session

• Buying cost: $200 (one-time cost)

Case 1: Swimming Once

• Renting: $20

• Buying: $200

• Competitive Ratio = (Cost of Renting) / (Optimal Cost) = $20 / $20 = 1


Case 2: Swimming 10 Times

• Renting: $20 × 10 = $200

• Buying: $200

• Competitive Ratio = $200 / $200 = 1

• The competitive ratio is 1 in both cases because, in the worst case, the total cost of renting
does not exceed buying.

• Break-even point: If swimming ≤10 times, renting is cheaper or equal. If >10 times, buying
is better.

18. Explain and compare event time, processing time, and ingestion time in stream
processing.

Solution:

Time Type Definition Example

The actual timestamp when an event A user clicks a webpage at 12:05 PM,
Event Time
occurred recorded in the event metadata

Processing The timestamp when the event is The event is processed at 12:07 PM due
Time processed by the system to network delay

Ingestion The timestamp when the event enters Kafka receives the event at 12:06 PM,
Time the stream processing system before Spark processes it

Comparison:

1. Event Time is ideal for accurate analytics, but requires watermarking to handle late-
arriving data.

2. Processing Time is simple but not reliable when event arrival is delayed.

3. Ingestion Time is useful for approximate analytics but does not represent actual event
occurrence.

19. What is the competitive ratio in online algorithms? Explain with an example.

Solution:

• Competitive ratio = (Cost of online algorithm) / (Cost of optimal offline solution)


• It measures the worst-case performance of an online decision-making algorithm
compared to the best offline decision.

Example: Renting vs. Buying a Car

• Renting a car: $50 per day

• Buying a car: $1000 (one-time cost)

If a user rents for 20 days, the cost is $50 × 20 = $1000.

• Competitive Ratio = $1000 / $1000 = 1 (Worst case, break-even)

• If the user rents for 10 days, the cost is $500.

• Competitive Ratio = $500 / $500 = 1 (Still optimal)

Competitive ratio helps analyze online decision-making when future knowledge is unavailable.

20. How does watermarking help in handling late-arriving data in stream processing? Provide
an example.

Solution:

Watermarking is used in event-time processing to handle late events by setting a threshold for
considering late data.

Example:

• Assume a 10-minute window (3:00 - 3:10) with a 5-minute watermark.

• Events arriving before 3:15 are accepted, but later ones are dropped.

Sample Code (Spark Structured Streaming):

from pyspark.sql.functions import window

# Group events by a 10-minute window with a 5-minute watermark

eventCounts = events \

.withWatermark("timestamp", "5 minutes") \

.groupBy(window("timestamp", "10 minutes"), "eventType") \

.count()

query = eventCounts.writeStream.outputMode("append").format("console").start()

query.awaitTermination()
Why is Watermarking Important?

1. Ensures bounded state size (prevents infinite waiting).

2. Allows handling late-arriving but still relevant data.

3. Helps in real-time analytics where delays are common (e.g., IoT sensor data).

You might also like