Apache Spark - Practices 2nd
Apache Spark - Practices 2nd
Contents
General Knowledge .................................................................................................... 2
Distributed System Questions................................................................................. 2
Spark Cores Questions ........................................................................................... 3
Spark DataFrames Questions ................................................................................. 4
Spark SQL Questions ............................................................................................. 5
Best Practices Questions ........................................................................................ 6
Apache Spark Machine Learning ................................................................................. 7
Graph Analytics with Apache Spark........................................................................... 17
Streaming Analytics with Apache Spark – Spark Structured Streaming ......................... 22
General Knowledge
A) Strong consistency
B) Eventual consistency Cho phép dữ liệu không đồng bộ trong thời gian ngắn.
C) Causal consistency
D) Linearizability
3. What is the CAP theorem in the context of distributed systems? Tính nhất quán, tính khả dụng và
dung sai phân vùng đồng thời
A) It states that you cannot achieve Consistency, Availability, and Partition Tolerance
simultaneously Định lý CAP nói rằng không thể đạt cả 3 yếu tố này đồng thời trong hệ thống phân
tán.
B) It describes the three types of distributed databases
B) To agree on a single value among distributed processes Thuật toán đồng thuận giúp các quy trình
phân tán đạt thỏa thuận chung.
A) Data sharding
Cân bằng tải giúp giảm độ trễ khi xử lý dữ liệu.
B) Load balancing
C) Centralized processing
D) Synchronous communication
A) filter()
C) reduceByKey()
D) flatMap()
B) It allows Spark to recompute lost data efficiently Lineage giúp Spark tái tạo lại dữ liệu khi bị mất.
A) Executors are responsible for running tasks and storing data Executors chạy các tác vụ và lưu trữ
dữ liệu.
B) Executors can only run on the driver node
B) Delaying execution until an action is called Lazy evaluation hoãn thực thi cho đến khi gọi action.
7. Which method would you use to group data in a DataFrame and perform
aggregations?
A) aggregateByKey()
B) groupBy() groupBy() dùng để nhóm dữ liệu và thực hiện tổng hợp.
C) collect()
D) summarize()
2. Which function allows you to define window specifications in Spark SQL for
analytics?
A) window()
B) over() Hàm over() định nghĩa các cửa sổ phân tích dữ liệu.
C) partitionBy()
D) groupBy()
3. In Spark SQL, how can you optimize join operations between large tables?
A) By increasing executor memory
B) By using broadcast joins when one table is small enough Broadcast join tối ưu khi một bảng đủ nhỏ.
C) By performing joins sequentially
D) By avoiding joins altogether
5. How does Spark SQL handle schema inference when reading JSON files?
A) It requires explicit schema definition
B) It infers schema based on sample data Spark SQL suy luận schema dựa trên dữ liệu mẫu.
C) It uses default data types
D) It ignores schema entirely
7. Which command would you use to drop a column from a DataFrame before
executing SQL queries on it?
A) dropColumn()
B) removeColumn()
C) drop() drop() dùng để xóa cột khỏi DataFrame.
D) deleteColumn()
8. What type of join returns all records from both tables, matching where
possible, and filling with NULLs where there are no matches?
A) Inner join
B) Left outer join
C) Full outer join Full outer join trả về tất cả bản ghi từ hai bảng, điền NULL khi không khớp.
D) Right outer join
3. When should you consider using persist() over cache() for an RDD or
DataFrame?
A) When you want to store it only temporarily
B) When you need to specify different storage levels persist() linh hoạt hơn vì cho phép chọn cấp độ lưu trữ
(như trên disk hoặc memory).
C) When using small datasets
D) When you want automatic eviction
4. What is one best practice when writing UDFs (User Defined Functions)?
A) Write UDFs that operate on entire datasets at once
B) Ensure UDFs are stateless and avoid side effects UDF nên không có trạng thái (stateless) và không thay
đổi dữ liệu ngoài.
C) Use UDFs for all transformations regardless of built-in functions
D) Avoid testing UDFs before deployment
10. Which method can be used to handle missing values in a DataFrame? Select
many
A) Fill with mean or median
B) Drop rows with missing values
C) Ignore missing values entirely
D) Replace with zeroes.
13. In PySpark, which function would you use to create a DataFrame from an
existing RDD?
A) createDataFrame()
B) toDF()
C) fromRDD()
D) buildDataFrame()
15. Which evaluation metric is commonly used for binary classification models?
Select many
A) Precision
B) Recall
C) Mean Squared Error
D) R-squared
17. Which method would you use to save a trained model in Spark MLlib?
A) saveModel()
B) write().save()
C) exportModel()
D) persistModel()
20. Which algorithm would you use for multi-class classification problems in Spark
MLlib? Select many
A) Decision Trees
B) Logistic Regression (with One-vs-Rest strategy)
C) K-Means Clustering
D) Linear Regression
23. In PySpark, which function allows you to split your dataset into training and test
sets?
A) split()
B) randomSplit()
C) trainTestSplit()
D) partition()
26. Which of the following methods can be used for feature selection in PySpark's
MLlib? Select many Chi-Squared Test: Được sử dụng để đánh giá tầm quan
A) Chi-Squared Test trọng của các feature phân loại.
Feature Importance từ Tree Models: Các model cây (như
B) Recursive Feature Elimination Random Forest) có thể đánh giá tầm quan trọng của các
C) Feature Importance from Tree Models feature.
Lasso Regression: Loại bỏ feature không quan trọng thông
D) Lasso Regression qua regularization.
27. In which scenario would you prefer using DataFrame over RDD in Apache
Spark? Select many
A) When working with structured data DataFrame tối ưu hơn khi làm việc với dữ liệu có cấu
B) When requiring complex transformations trúc và sử dụng Catalyst optimizer để cải thiện hiệu suất.
C) When needing optimized execution plans through Catalyst optimizer
D) When handling unstructured data
28. Which function would you use to convert categorical features into numerical
format in PySpark's MLlib? Select many
A) StringIndexer
StringIndexer: Chuyển đổi các giá trị dạng chuỗi thành chỉ số số.
B) OneHotEncoder OneHotEncoder: Tạo các vector nhị phân đại diện cho từng giá trị phân loại
C) VectorAssembler
D) IndexToString.
34. Which of the following metrics is NOT typically used for evaluating regression
models?
A) Mean Absolute Error (MAE)
B) Root Mean Squared Error (RMSE)
C) F1 Score
D) R-squared
36. Which method would you use to perform hyperparameter tuning with cross-
validation in Spark MLlib?
A) CrossValidator
B) ParamGridBuilder
C) TrainValidationSplit
D) HyperparameterTuner
39. Which function can be used to convert a DataFrame column to a feature vector
in PySpark?
A) toVector()
B) assemble()
C) VectorAssembler().transform()
D) createVector()
41. Which algorithm is commonly used for anomaly detection in Spark MLlib?
A) K-Means Clustering
B) Isolation Forest
C) Decision Trees
D) Linear Regression
45. Which algorithm would you use for collaborative filtering in Spark MLlib?
A) K-Means Clustering
B) Alternating Least Squares (ALS)
C) Logistic Regression
D) Decision Trees
49. Which method would you use to visualize decision boundaries of a trained
model in PySpark?
A) plotDecisionBoundary()
B) visualizeModel()
C) plot() with Matplotlib or Seaborn after collecting predictions from the model
D) drawBoundary()
50. Which type of neural network is commonly used for image classification tasks?
A) Recurrent Neural Network
B) Convolutional Neural Network (CNN)
C) Feedforward Neural Network
D) Generative Adversarial Network
52. Which function would you use to compute the confusion matrix for a
classification model in PySpark?
A) computeConfusionMatrix()
B) MulticlassMetrics()
C) evaluateConfusionMatrix()
D) getConfusionMatrix()
56. In which scenario would you use Random Forest over Decision Trees in Spark
MLlib? Select many
A) When needing better generalization and reduced overfitting risk
B) When requiring faster training times
C) When dealing with high-dimensional datasets
D) When interpreting individual decision paths is crucial
57. What does the term "bagging" refer to in ensemble learning methods like
Random Forests?
A) Combining predictions from different algorithms
B) Training multiple models on different subsets of data and averaging their
predictions
C) Using boosting techniques to improve weak learners
D) Selecting features randomly for each model
58. How can you handle imbalanced datasets when training models in Spark MLlib?
Select many
A) Use oversampling techniques like SMOTE (Synthetic Minority Over-sampling
Technique)
B) Use undersampling techniques to reduce majority class samples
C) Ignore class imbalance entirely
D) Use algorithms that are robust to class imbalance
59. Which method would you use to evaluate multi-class classification models
effectively?
A) MulticlassClassificationEvaluator().evaluate()
B) BinaryClassificationEvaluator().evaluate()
C) accuracy(), precision(), recall(), F1 score metrics calculations directly
D) confusionMatrix().
14. Which property can you set when creating a GraphFrame to define edge
weights?
A) weight
B) score
C) value
D) strength
17. Which Spark library provides functionalities for working with graphs?
A) Spark SQL
B) MLlib
C) GraphX
D) Both GraphX and GraphFrames.
18. How do you convert a DataFrame into a format suitable for creating a
GraphFrame? Select many
A) Ensure it has appropriate columns for vertices and edges
B) Convert it into an RDD first
C) Use schema definitions that match vertex and edge requirements
D) Flatten nested structures into single columns.
21. Which function would you use to find number of triangles that pass through
each vertex in a graph using GraphFrames?
A) findTriangles()
B) triangleCount()
C) countTriangles()
D) detectTriangles()
23. Which method allows you to run SQL queries on a GraphFrame? Select many
A) sqlQuery()
B) runSQL()
C) createOrReplaceTempView() followed by spark.sql()
D) queryGraph()
24. How can you efficiently handle large graphs in Spark using GraphFrames?
Select many
A) By partitioning data appropriately before creating graphs
B) By using small datasets only
C) By leveraging distributed computing capabilities of Spark
D) By avoiding complex queries on large graphs.
25. What is one limitation when using traditional RDDs compared to DataFrames
for graph analytics in Spark?
A) RDDs cannot store structured data
B) RDDs lack optimization features like Catalyst and Tungsten found in DataFrames
C) RDDs cannot be used with machine learning algorithms
D) RDDs cannot handle large datasets
26. Which type of query would be best suited for analyzing relationships in graphs
using PySpark's Graph APIs?
A) JOIN operations on edge attributes
B) GROUP BY operations on vertex properties
C) Pattern matching queries using motifs API
D) Aggregation queries on edge weights
27. In PySpark, how would you visualize relationships between nodes after
performing analysis with GraphFrames? Select many
A) Use built-in visualization tools provided by Databricks or third-party libraries like
NetworkX or Matplotlib
B) Directly visualize using Spark's native plotting functions
C) Export results as CSV and use external software like Gephi or Cytoscape for
visualization
D) Visualization is not possible with GraphFrames.
29. Which algorithm would you use for community detection in large graphs with
GraphFrames?
A) K-Means Clustering
B) Label Propagation algorithm
C) Decision Trees
D) Logistic Regression
4. Which of the following sources are suitable for ingesting data into Structured
Streaming? Select many
A) Kafka
B) Flume
C) HDFS
D) Kinesis
6. Which output mode in Structured Streaming only outputs new rows as they
arrive?
A) Complete mode
B) Append mode
C) Update mode
D) Batch mode
11. What is the default output mode for streaming queries in Spark?
A) Complete
B) Append
C) Update
D) Batch
12. Which method would you use to stop a running streaming query gracefully?
A) stopQuery()
B) terminate()
C) awaitTermination()
D) stop().
16. In Spark Structured Streaming, what does selectExpr() allow you to do?
A) Perform complex transformations on DataFrames
B) Execute SQL expressions directly on DataFrames
C) Filter out specific records from a stream
D) Aggregate results over time
17. Which function would you use to read from a Kafka topic in Spark Structured
Streaming?
A) readKafka()
B) readStream().format("kafka")
C) streamFromKafka()
D) loadKafkaStream()
19. How can you handle late-arriving data in Structured Streaming? Select many
A) By using watermarks
B) By ignoring late data entirely
C) By adjusting batch intervals
D) By using stateful operations with event-time processing.
21. Which operation would you use to join two streaming DataFrames?
A) join()
B) merge()
C) combine()
D) union()
23. In which scenario would you use Continuous Processing mode in Spark
Structured Streaming? Select many
A) When low latency is critical and you need response times under 1 millisecond
B) When processing large batches of historical data
C) When high throughput is required
D) When you want at-least-once guarantees rather than exactly-once guarantees.
24. Which command would you use to monitor active streaming queries in Spark?
A) spark.streams.active()
B) monitorQueries()
C) listQueries()
D) showActiveStreams()
26. Which method allows you to apply transformations on each RDD generated
from a DStream?
A) foreachRDD()
B) mapRDD()
C) transformRDD()
D) applyToRDD().
29. Which function would you use to aggregate counts over a sliding window in
structured streaming?
A) countByWindow()
B) window() followed by groupByKey().agg(count())
C) slidingWindowCount()
D) aggregateWindow().