Question Bank (1)
Question Bank (1)
Count of People 8 12 15 10 9 14 11
Solution:
• A visual timeline representation would show fluctuations, with the highest activity around
the 20-minute mark.
Temperature (°C) 22 23 24 22 25 26 24
Solution:
Vehicle Count 20 25 30 22 27 18 24
Solution:
Solution:
• The heart rate fluctuates but generally increases before slightly dropping.
A web traffic streaming application processes incoming requests, where each request has an event
time and a (latitude, longitude) location. The system uses:
Given the event timestamps, compute the count of requests in each 4-minute window from 03:00
to 03:32 and demonstrate step-by-step calculations.
W1 03:00 - 03:03
W2 03:04 - 03:07
W3 03:08 - 03:11
W4 03:12 - 03:15
Window ID Time Range
W5 03:16 - 03:19
W6 03:20 - 03:23
W7 03:24 - 03:27
W8 03:28 - 03:31
Since the tuples arrive in an unordered fashion, we assign them to windows based on event time:
• Events older than 03:22 but arriving later will still be processed.
• All events arrived in order, meaning all are included in their correct windows.
Step 4: Count Requests Per Window
W1 03:00 - 03:03 0
W2 03:04 - 03:07 1
W3 03:08 - 03:11 3
W4 03:12 - 03:15 2
W5 03:16 - 03:19 1
W6 03:20 - 03:23 1
W7 03:24 - 03:27 1
W8 03:28 - 03:31 1
3| ███
2| ███ ██
1| █ ███ ██ █ █ █ █
0 |--------------------------------------
• W1 (03:00-03:03): 0
• W2 (03:04-03:07): 1
• W3 (03:08-03:11): 3
• W4 (03:12-03:15): 2
• W5 (03:16-03:19): 1
• W6 (03:20-03:23): 1
• W7 (03:24-03:27): 1
• W8 (03:28-03:31): 1
Conclusion
• Since no late events beyond the watermark period occurred, all data was processed
successfully.
6. Compare and contrast batch processing, microbatching, and real-time stream processing.
Solution:
Real-Time Stream
Feature Batch Processing Microbatching
Processing
Processing Processes large amounts Processes small batches at Processes each event as
Mode of data at once fixed intervals it arrives
Medium (seconds to
Latency High (minutes to hours) Very Low (milliseconds)
minutes)
Example Hadoop, Apache Spark Apache Spark Streaming, Apache Flink, Apache
Frameworks (Batch Mode) Structured Streaming Kafka Streams
7. What are the advantages and disadvantages of Spark Structured Streaming over traditional
Spark Streaming?
Solution:
Feature Traditional Spark Streaming Spark Structured Streaming
Conclusion:
• Structured Streaming is more flexible, scalable, and optimized than the older DStream-
based approach.
8. Explain the concept of watermarking in stream processing and its role in handling late-
arriving data.
Solution:
• Use Case: If a stream window runs from 3:00 to 3:05, but events arrive late, a watermark of
7 minutes allows processing events until 3:12 before finalizing results.
• Importance:
Solution:
Feature Stateless Processing Stateful Processing
Use Cases Removing bad data, logging Counting occurrences, detecting anomalies
Example:
• Stateful: Counting how many requests a user made in the last 10 minutes.
10. What are the different types of joins in Spark Structured Streaming, and how do they work?
Solution:
Keeps all records from the left stream; missing Matching users with their most
Left Outer Join
values from the right are NULL recent activity
Keeps all records from the right stream; Matching transactions with
Right Outer Join
missing values from the left are NULL user profiles
Keeps all records from both streams, filling Combining logs from two data
Full Outer Join
missing values with NULL sources
Watermarking + Helps join events arriving at different times by Combining late-arriving sensor
Joins setting a time threshold data with alerts
Example Scenario:
• A user click stream (Stream A) is joined with a purchase stream (Stream B).
• If a purchase occurs 5 minutes after a click, using a watermark of 10 minutes ensures late
clicks are considered.
11. How do you perform word count on streaming data using Spark Structured Streaming?
Provide a sample code snippet.
Solution:
Steps:
Sample Code:
spark = SparkSession.builder.appName("StructuredNetworkWordCount").getOrCreate()
wordCounts = words.groupBy("word").count()
# Output to console
query = wordCounts.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()
Explanation:
1. The application listens to a Netcat server (port 9999) for real-time text input.
12. How can you use windowed aggregation to count events in a fixed time interval in Spark
Structured Streaming?
Solution:
Steps:
Sample Code:
query = wordCounts.writeStream.outputMode("update").format("console").start()
query.awaitTermination()
Explanation:
13. What is the difference between Complete, Append, and Update output modes in Spark
Structured Streaming?
Solution:
Output
Description Use Case
Mode
Only outputs new rows added since last Used for event logs where old rows don't
Append
trigger change
Updates modified rows but does not Suitable for stateful processing (e.g.,
Update
reprint all results session tracking)
Example:
wordCounts.writeStream.outputMode("append").format("console").start()
14. How do you join two real-time data streams in Spark Structured Streaming? Provide an
example.
Solution:
Steps:
1. Read two input streams (Example: Click Stream & Purchase Stream)
Sample Code:
query = joinedStream.writeStream.outputMode("append").format("console").start()
query.awaitTermination()
Explanation:
15. How do you write streaming data from Spark Structured Streaming to a Kafka topic?
Solution:
Steps:
3. Write to Kafka
Sample Code:
query = lines.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "wordCountTopic") \
.option("checkpointLocation", "/tmp/kafka-checkpoint") \
.start()
query.awaitTermination()
Explanation:
Solution:
Stream processing systems provide different delivery guarantees to ensure correctness. The three
main semantics are:
Processing
Definition Example Scenario
Semantics
Events are processed at most once, Logging where occasional data loss is
At-most-once
but may be lost acceptable
Comparison:
Example Frameworks:
17. Compute the competitive ratio for renting vs. buying a swimming gear under worst-case
scenarios.
Given Costs:
• Renting: $20
• Buying: $200
• Buying: $200
• The competitive ratio is 1 in both cases because, in the worst case, the total cost of renting
does not exceed buying.
• Break-even point: If swimming ≤10 times, renting is cheaper or equal. If >10 times, buying
is better.
18. Explain and compare event time, processing time, and ingestion time in stream
processing.
Solution:
The actual timestamp when an event A user clicks a webpage at 12:05 PM,
Event Time
occurred recorded in the event metadata
Processing The timestamp when the event is The event is processed at 12:07 PM due
Time processed by the system to network delay
Ingestion The timestamp when the event enters Kafka receives the event at 12:06 PM,
Time the stream processing system before Spark processes it
Comparison:
1. Event Time is ideal for accurate analytics, but requires watermarking to handle late-
arriving data.
2. Processing Time is simple but not reliable when event arrival is delayed.
3. Ingestion Time is useful for approximate analytics but does not represent actual event
occurrence.
19. What is the competitive ratio in online algorithms? Explain with an example.
Solution:
Competitive ratio helps analyze online decision-making when future knowledge is unavailable.
20. How does watermarking help in handling late-arriving data in stream processing? Provide
an example.
Solution:
Watermarking is used in event-time processing to handle late events by setting a threshold for
considering late data.
Example:
• Events arriving before 3:15 are accepted, but later ones are dropped.
eventCounts = events \
.count()
query = eventCounts.writeStream.outputMode("append").format("console").start()
query.awaitTermination()
Why is Watermarking Important?
3. Helps in real-time analytics where delays are common (e.g., IoT sensor data).