Big Data Integration and Processing Final
Big Data Integration and Processing Final
Which of the following SQL clauses is used to filter the result set based on a specified condition?
*A: WHERE
Feedback: Correct! The WHERE clause is used to filter records based on a specified condition.
B: SELECT
Feedback: Incorrect. The SELECT clause is used to select data from a database.
C: JOIN
Feedback: Incorrect. The JOIN clause is used to combine rows from two or more tables, based on a
related column between them.
D: ORDER BY
Feedback: Incorrect. The ORDER BY clause is used to sort the result set in ascending or descending
order.
*A: COUNT
Feedback: Correct! COUNT is an aggregate function in SQL that returns the number of input rows that
match a specific condition.
*B: MAX
Feedback: Correct! MAX is an aggregate function in SQL that returns the maximum value in a set of
values.
C: JOIN
Feedback: Incorrect. JOIN is used to combine rows from two or more tables based on a related column
between them, but it is not an aggregate function.
*D: SUM
Feedback: Correct! SUM is an aggregate function in SQL that returns the sum of a set of values.
E: SELECT
Feedback: Incorrect. SELECT is used to specify the columns to be returned by the query, but it is not an
aggregate function.
If a table has 100 rows and a query returns 20% of them, how many rows will be returned?
*A: 20.0
Default Feedback: Incorrect. Review the calculation for percentage of total rows.
What is the name of the SQL clause used to combine rows from two or more tables based on a related
column between them? Please answer in all lowercase.
*A: join
Feedback: Correct! The JOIN clause is used to combine rows from two or more tables based on a related
column.
Default Feedback: Remember the clause that allows combining rows from multiple tables based on
related columns.
What tool can be used to visualize SQL query results in this course? Please answer in all lowercase.
*A: pgadmin
Default Feedback: Incorrect. The correct tools are PgAdmin and Docker Desktop.
What SQL clause is used to sort the result set of a query? Please answer in all lowercase. Please answer
in all lowercase.
*A: order by
Feedback: Correct! The ORDER BY clause is used to sort the result set.
Default Feedback: Incorrect. Remember that the clause used to sort the result set of a query is ORDER
BY.
What SQL keyword is used to remove a table from a database? Please answer in all lowercase. Please
answer in all lowercase.
*A: drop
Feedback: Correct! The DROP statement is used to remove a table from a database.
Default Feedback: Incorrect. Review the SQL commands for removing database objects.
If a table has 500 rows and you execute the following SQL query: SELECT COUNT(*) FROM
table_name WHERE column_name = 'value'; and 100 rows match the condition, how many rows will be
returned?
*A: 100.0
Feedback: Correct! The query returns the count of rows that match the condition.
Default Feedback: Incorrect. Consider how the COUNT function works with the WHERE clause.
Question 9 - numeric, easy difficulty
How many rows will be returned by the following query if the Employees table has 50 rows? \
[ SELECT \* FROM Employees LIMIT 10 \]
*A: 10.0
Feedback: Correct! The LIMIT clause restricts the number of rows returned by the query to 10.
Default Feedback: Incorrect. Review how the LIMIT clause works in SQL.
Which SQL function is used to return the number of rows that matches a specified condition? Answer in
all lowercase. Please answer in all lowercase.
*A: count
Feedback: Correct! The COUNT function returns the number of rows that matches a specified condition.
*B: cnt
How many rows are returned by the SQL query SELECT COUNT(*) FROM table_name if the table
table_name has 100 rows?
*A: 1.0
Feedback: Correct! The COUNT(*) function returns the number of rows in a table, which is always 1 for
this query.
Default Feedback: Incorrect. Review the COUNT(*) function to understand its behavior.
What is the command to remove all records from a table without deleting the table itself? Please answer
in all lowercase.
*A: truncate
Feedback: Correct! The TRUNCATE command is used to remove all records from a table without
deleting the table itself.
B: delete
Feedback: Incorrect. The DELETE command removes records but can be used with a WHERE clause to
delete specific records.
C: drop
Feedback: Incorrect. The DROP command deletes the table itself along with all its records.
Default Feedback: Incorrect. Please review the SQL commands for removing records from a table.
If a table named Products contains 200 rows and you execute the following SQL query, how many rows
will be returned? plain_text SELECT * FROM Products LIMIT 50 OFFSET 100;
*A: 50.0
Feedback: Correct! The LIMIT clause specifies the number of rows to return, and the OFFSET clause
specifies the number of rows to skip before starting to return rows.
Default Feedback: Incorrect. Review the usage of the LIMIT and OFFSET clauses in SQL.
You have executed a query in PgAdmin to retrieve data from a table called sales where the revenue is
greater than 1000. How many rows will be returned if the sales table contains 5000 rows with revenue
values uniformly distributed between 500 and 1500?
*A: 2500.0
Feedback: Well done! You correctly calculated the number of rows with revenue greater than 1000.
Default Feedback: Try calculating the proportion of rows that meet the condition from the provided
range.
Which of the following are benefits of data partitioning in a distributed database system?
*A: Improves query performance by dividing data into smaller, more manageable pieces
Feedback: Incorrect. Data partitioning involves dividing data, not replicating it.
Feedback: Correct! Data partitioning helps in distributing the load across various nodes.
Feedback: Incorrect. Data partitioning is about dividing data, not about simplifying replication.
Feedback: Incorrect. Data partitioning does not inherently remove duplicates; it organizes data across
partitions.
What keyword is used in SQL to rename a column in the result set? Please answer in all lowercase.
*A: as
Feedback: Correct! The AS keyword is used to give an alias to a column in the result set.
Default Feedback: Incorrect. Please review the SQL syntax for renaming columns in the result set.
*A: properties
Feedback: Correct! The Properties feature in PgAdmin allows you to view table and column definitions.
*B: structure
Feedback: Correct! The Structure feature in PgAdmin allows you to view table and column definitions.
*C: columns
Feedback: Correct! The Columns feature in PgAdmin allows you to view table and column definitions.
Default Feedback: Incorrect. Please review the course material on viewing table and column definitions
in PgAdmin.
If a partitioned table contains 100,000 rows and is divided into 4 partitions, what is the average number
of rows per partition?
*A: 25000.0
Default Feedback: Incorrect. Please review the calculation for average number of rows per partition in a
partitioned table.
What command would you use to view the column definitions of a table in PgAdmin?
Feedback: Correct! This command queries the information_schema to show column definitions.
Feedback: Not quite. This syntax is typically used in MySQL, not PostgreSQL.
C: DESCRIBE TABLE your_table_name;
Feedback: Close, but this will only work if you manually store column names and data types in a table.
Select all the tasks that can be performed using PgAdmin in conjunction with PostgreSQL and Docker
Desktop.
Feedback: Correct! Docker Desktop can be used to visualize query results when integrated with
PgAdmin and PostgreSQL.
Feedback: Correct! PgAdmin allows you to filter table rows and columns.
Feedback: While PgAdmin can manage databases, deploying to cloud requires additional tools and
configurations.
Which SQL clause is used to combine rows from two or more tables based on a related column?
Feedback: Correct! The JOIN clause is used to combine rows from two or more tables based on a related
column.
Feedback: ORDER BY is used to sort the result set, not to combine tables.
Feedback: The WHERE clause is used to filter records, not to join tables.
How can you view the data within a table using PgAdmin?
Feedback: Correct! You can view data by selecting the 'View Data' option in PgAdmin.
Feedback: ALTER TABLE is used to modify the structure of a table, not to view data.
Feedback: DELETE FROM removes records from a table, it doesn't display data.
Feedback: Accessing system catalogs is complex and not typically used for simple data viewing.
Feedback: Incorrect. It's a common misconception, but SQL stands for Structured Query Language.
C: Simple Query Language
Feedback: Incorrect. While SQL is designed to simplify database queries, it stands for Structured Query
Language.
Feedback: Incorrect. SQL is not about sequence; it stands for Structured Query Language.
Feedback: Incorrect. Although indexes help in data retrieval, they do not ensure data integrity.
Feedback: Correct! The JOIN clause is used to combine rows from two or more tables.
Feedback: Incorrect. MERGE is used for merging data, not specifically for joining tables.
Which of the following are benefits of data partitioning in a distributed database system?
Feedback: Correct! Partitioning can enhance query performance by limiting the amount of data that
needs to be scanned.
Feedback: Incorrect. Partitioning does not inherently increase data redundancy; it focuses on
performance and manageability.
Feedback: Incorrect. While partitioning may optimize data access, it does not necessarily reduce storage
needs.
Which SQL clause allows you to specify the condition under which rows are returned?
*A: WHERE
Feedback: Correct! The WHERE clause is used to filter records based on specific conditions.
B: JOIN
Feedback: Incorrect. The JOIN clause is used to combine rows from two or more tables based on a
related column.
C: GROUP BY
Feedback: Incorrect. The GROUP BY clause is used to arrange identical data into groups.
D: ORDER BY
Feedback: Incorrect. The ORDER BY clause is used to sort the result set in ascending or descending
order.
If a partitioned table in a distributed database system is split into 10 partitions with an even distribution,
what is the maximum possible size of each partition if the total table size is 1,000,000 records?
*A: 100000.0
Feedback: Correct! Each partition would ideally hold an equal portion of the total records.
Default Feedback: Incorrect. Consider how the total records are distributed across partitions evenly.
If a table in PostgreSQL has 1500 rows and you execute a query that filters these rows down to 25% of
the original, how many rows are in the result?
*A: 375.0
What is the keyword used in SQL to remove duplicate records from a query result? Please answer in all
lowercase.
*A: distinct
Feedback: Correct! The DISTINCT keyword is used to remove duplicates from query results.
Default Feedback: Remember to review SQL keywords related to query result modifications.
Which of the following are differences between a database management system (DBMS) and a big data
management system (BDMS)?
*A: BDMS can handle unstructured data, while DBMS primarily handles structured data
Feedback: Correct! BDMS are designed to handle a variety of data types, including unstructured data.
*B: DBMS is optimized for transaction processing, while BDMS is optimized for analytical processing
Feedback: Correct! DBMS are often optimized for transactions, whereas BDMS are optimized for
analytics.
Feedback: Incorrect. Both systems can have complex ETL processes, but the complexity depends on the
specific use case and data.
*D: BDMS can scale out horizontally more efficiently than DBMS
Feedback: Correct! BDMS are designed to scale horizontally across distributed systems.
Feedback: Incorrect. DBMS can process large volumes of data, but they are not as efficient as BDMS in
handling extremely large datasets.
Which of the following are challenges associated with streaming data in big data systems?
C: Data Redundancy
Feedback: Not quite. Data redundancy is not a primary challenge of streaming data.
D: Batch Processing
Feedback: Incorrect. Batch processing is not typically a challenge for streaming data systems.
Which of the following are necessary steps to organize downloaded files for use in the Hands-On
Modules?
Feedback: Correct! Naming files according to their content or purpose makes it easier to identify and
access them.
Feedback: Incorrect. Leaving files in the Downloads folder can lead to disorganization.
Feedback: Correct! Deleting unnecessary files helps in keeping the workspace clean and organized.
Feedback: Incorrect. Renaming files to random names will make it difficult to identify and access them.
What term is used to describe the speed at which data is generated and processed in a big data context?
Please answer in all lowercase.
*A: velocity
Feedback: Correct! Velocity refers to the speed at which data is generated and processed in big data.
Default Feedback: Incorrect. Remember, the speed at which data is generated and processed is a crucial
factor in big data.
In a big data context, a certain process handles between 1,000 and 2,000 transactions per second. What
is the range of transactions per second?
Feedback: Correct! The process handles between 1,000 and 2,000 transactions per second.
Default Feedback: Incorrect. Please review the handling capacity in transactions per second for big data
processes.
Question 37 - text match, easy difficulty
What is the name of the interactive computing environment that JupyterLab extends? Please answer in
all lowercase.
*B: notebook
Default Feedback: Incorrect. Please review the basic features and functionalities of JupyterLab.
In a big data system, a particular data processing job takes between 5 and 10 seconds to complete. What
is the range of time in seconds?
Default Feedback: Incorrect. Consider the range of time it takes for the job to complete.
Which of the following are requirements of programming models for big data?
*A: Scalability
Feedback: Correct! Scalability is a crucial requirement for programming models in big data.
Feedback: Correct! Fault tolerance ensures the system can handle failures gracefully.
C: Portability
Feedback: Incorrect. While useful, portability is not a primary requirement of programming models for
big data.
*D: Efficiency
E: User-friendliness
Feedback: Incorrect. User-friendliness is advantageous but not a core requirement for big data
programming models.
What is the name of the interface used to interact with Jupyter Notebooks? Please answer in all
lowercase.
*A: jupyterlab
Feedback: Correct! JupyterLab is the interface used to interact with Jupyter Notebooks.
*B: notebook
Feedback: Correct! Notebook is another term for interacting with Jupyter Notebooks.
Default Feedback: Incorrect. Review the section on Jupyter Notebooks to find the interface name.
Which tool is used for real-time data processing and can handle large streams of data efficiently? Please
answer in all lowercase.
*A: spark
Feedback: Correct! Apache Spark is widely used for real-time data processing.
*B: splunk
Feedback: Correct! Splunk is also used for real-time data processing and can handle large data streams
efficiently.
Default Feedback: Incorrect. Please review the lesson material on tools for real-time data processing.
Question 42 - numeric, easy difficulty
If a big data system processes 2 terabytes of data in 30 minutes, what is the data processing rate in
gigabytes per minute?
Feedback: Correct! The data processing rate is approximately 66.67 gigabytes per minute.
Default Feedback: Incorrect. Please review the lesson material on calculating data processing rates.
*A: SUM
Feedback: Correct! SUM is an aggregate function that calculates the total sum of a numeric column.
*B: MAX
Feedback: Correct! MAX is an aggregate function that returns the maximum value in a set.
*C: MIN
Feedback: Correct! MIN is an aggregate function that returns the minimum value in a set.
D: JOIN
Feedback: Incorrect. JOIN is not an aggregate function; it is used to combine rows from two or more
tables based on a related column.
*E: COUNT
Feedback: Correct! COUNT is an aggregate function that returns the number of rows that match a
specified criterion.
F: DISTINCT
Feedback: Incorrect. DISTINCT is not an aggregate function; it is used to return only distinct (different)
values.
Question 44 - checkbox, shuffle, partial credit, easy difficulty
Which of the following Pandas operations can be used to filter data in a DataFrame?
*A: loc
Feedback: Correct! The loc function is used for label-based indexing to filter data.
*B: iloc
Feedback: Correct! The iloc function is used for positional indexing to filter data.
C: groupby
Feedback: Incorrect. groupby is used for grouping data, not directly for filtering.
D: pivot_table
Feedback: Incorrect. pivot_table is used to create pivot tables, not directly for filtering.
*E: filter
Feedback: Correct! The filter function is used to filter data based on column names.
Which of the following operations can be performed using the Pandas library in Python?
Feedback: Correct! Pandas can read data from a CSV file into a DataFrame.
Feedback: Correct! INNER JOIN selects records that have matching values in both tables.
Feedback: Correct! LEFT JOIN returns all records from the left table, and the matched records from the
right table.
C: UPPER JOIN
Feedback: Correct! RIGHT JOIN returns all records from the right table, and the matched records from
the left table.
Feedback: Correct! FULL JOIN returns all records when there is a match in either left or right table.
What method is used to count the number of documents in a MongoDB collection that match a specified
filter? Please answer in all lowercase. Please answer in all lowercase.
*A: count_documents
Feedback: Correct! The count_documents method is used to count the number of documents in a
MongoDB collection that match a specified filter.
B: count
Feedback: Incorrect. The count method is deprecated in MongoDB. Use count_documents instead.
C: countdocs
Default Feedback: Incorrect. Please review the MongoDB documentation to find the correct method for
counting documents that match a filter.
How many rows will be returned by the SQL query if there are 50 rows in the students table and the
condition is WHERE age > 20? Assume 30 students are older than 20.
*A: 30.0
Feedback: Correct! There are 30 students older than 20 in the students table.
Default Feedback: Incorrect. Please review the lesson on SQL query conditions and row counts.
What is the keyword used in SQL to extract data from a database? Please answer in all lowercase.
*A: select
Feedback: Correct! SELECT is the keyword used to extract data from a database.
B: query
Feedback: Incorrect. QUERY is not a valid keyword in SQL for extracting data.
C: fetch
Feedback: Incorrect. FETCH is not a valid keyword in SQL for extracting data.
D: retrieve
Feedback: Incorrect. RETRIEVE is not a valid keyword in SQL for extracting data.
Default Feedback: Incorrect. Please review the SQL keywords for extracting data.
If a DataFrame has 500 rows and 30 columns, how many elements does it contain?
*A: 15000.0
Default Feedback: Incorrect. Please review how to calculate the number of elements in a DataFrame.
What is the term for a subset of a DataFrame that meets a specified condition? Please answer in all
lowercase. Please answer in all lowercase.
*A: filter
Feedback: Correct! A subset of a DataFrame that meets a specified condition is called a filter.
*B: subset
Feedback: Correct! A subset of a DataFrame that meets a specified condition can also be called a subset.
Default Feedback: Incorrect. Please review the lesson materials and try again.
What Python library is commonly used for data manipulation and analysis, especially with DataFrames?
Please answer in all lowercase.
*A: pandas
Feedback: Correct! Pandas is widely used for data manipulation and analysis with DataFrames.
Default Feedback: Incorrect. The commonly used library for data manipulation and analysis with
DataFrames is Pandas.
If a DataFrame has 1000 rows and 20 columns, how many elements does it contain?
*A: 20000.0
Default Feedback: Incorrect. Please review how to calculate the number of elements in a DataFrame.
How many rows will be returned by the following SQL query? SELECT * FROM employees WHERE
salary > 50000 AND department_id = 10;
*A: 5.0
Feedback: Correct! 5 rows match the criteria specified in the SQL query.
Default Feedback: Incorrect. Please revisit the SQL query and consider the criteria specified for filtering
the rows.
What file format is commonly used for storing tabular data and can be read into a Pandas DataFrame?
Please answer in all lowercase.
*A: csv
Feedback: Correct! CSV is a common file format used for storing tabular data.
B: tsv
Feedback: Incorrect. While TSV is also used for tabular data, it is less common than CSV.
Default Feedback: Incorrect. Please review the common file formats for storing tabular data.
*A: distinct
Feedback: Correct! The DISTINCT keyword is used to return only distinct (different) values.
Default Feedback: Incorrect. Please review the SQL keywords that help in filtering records.
If a table has 1000 rows and you execute a query that selects 10% of the rows, how many rows will be
returned?
*A: 100.0
Default Feedback: Incorrect. Please check your calculation of the percentage of rows.
How many documents will be returned if a MongoDB query matches 15 documents in a collection?
*A: 15.0
Default Feedback: Incorrect. Revise the MongoDB query syntax and ensure you understand the filtering
process.
Which of the following operations can be performed using the Pandas library?
Feedback: Correct! You can read data from a CSV file using Pandas.
Feedback: Incorrect. Building a Docker container is not an operation performed using Pandas.
Which of the following are valid MongoDB aggregation stages? Select all that apply.
*A: $match
Feedback: Correct! The $match stage filters documents to pass only those that match the specified
condition(s).
B: $sum
*C: $group
Feedback: Correct! The $group stage groups input documents by a specified identifier expression and
applies the accumulator expressions.
*D: $sort
Feedback: Correct! The $sort stage sorts all input documents and returns them in the specified order.
E: $filter
Feedback: Correct! docker pull is the command used to pull a Docker image from a registry.
B: dockerpull
C: pull docker
Feedback: Incorrect. The command should start with docker followed by pull.
Default Feedback: Incorrect. Make sure you are using the correct Docker command to pull an image.
If you have a DataFrame df with 500 rows, how many rows will be displayed when you use the
command df.head(10)?
*A: 10.0
Feedback: Correct! The df.head(10) command displays the first 10 rows of the DataFrame.
Default Feedback: Incorrect. Review the head method in Pandas to understand how many rows it
displays by default.
Which of the following operations can be performed using the Pandas library in Python?
Feedback: Incorrect. While Pandas has merge operations similar to SQL joins, actual SQL operations
are distinct from Pandas capabilities.
What is the Pandas function used to read a CSV file into a DataFrame?
*A: read_csv()
Feedback: Correct! The read_csv() function is specifically designed for reading CSV files into
DataFrames.
B: read_excel()
Feedback: Not quite. The read_excel() function is used for reading Excel files, not CSV files.
C: read_json()
Feedback: Incorrect. The read_json() function is used for reading JSON formatted data, not CSV files.
D: read_html()
Feedback: Incorrect. The read_html() function is used for reading HTML tables, not CSV files.
Which MongoDB command is used to find documents with specific field values?
*A: find()
Feedback: Correct! The find() command is used to retrieve documents that match specific criteria.
B: search()
C: query()
Feedback: Incorrect. While querying is the process, query() is not a MongoDB command for finding
documents.
D: filter()
Feedback: Incorrect. Filtering is part of querying, but filter() is not a MongoDB command.
Which MongoDB operator is used to select documents that match the values of multiple fields?
*A: $and
Feedback: Correct! The $and operator is used to match documents that fulfill all specified conditions.
B: $or
Feedback: Not quite. The $or operator is used to match documents that satisfy at least one of the
specified conditions.
C: $nor
Feedback: Incorrect. The $nor operator is the opposite of $or. It selects documents that fail all the
specified conditions.
D: $not
Feedback: No, the $not operator is used to invert the effect of a query expression.
Which MongoDB function retrieves distinct values from a specified field across a single collection?
*A: distinct
Feedback: Correct! The distinct function retrieves unique values from a specified field.
B: aggregate
Feedback: Not quite. The aggregate function is used to process data records and return computed results.
C: find
Feedback: Incorrect. The find function is used for querying documents in a collection.
D: mapReduce
Feedback: No, the mapReduce function is used for processing and aggregating collections.
What command would you use to count the number of documents in a MongoDB collection?
*A: db.collection.countDocuments()
Feedback: Correct! This command is used to count the number of documents in a MongoDB collection.
B: db.collection.find().count()
Feedback: Incorrect. While this was previously used, the correct method now is countDocuments().
C: db.collection.totalDocuments()
D: db.collection.countFiles()
Feedback: Incorrect. This is not a valid MongoDB command for counting documents.
In MongoDB, which operator would you use to combine multiple conditions with a logical AND within
a query?
*A: $and
Feedback: Correct! The $and operator is used to combine multiple conditions with a logical AND.
B: $or
Feedback: The $or operator is used for combining conditions with a logical OR, not AND.
C: $set
Feedback: $set is used for updating fields in a document, not for combining conditions.
D: $match
Feedback: $match is used to filter documents from the collection, not for combining conditions with
AND.
Which of the following are valid MongoDB aggregation framework stages used to transform
documents?
*A: $group
*B: $lookup
Feedback: Correct! $lookup is used for performing a left outer join to another collection in the same
database.
C: $concatArrays
*D: $sort
E: $filter
*F: $limit
Feedback: Correct! $limit is used to restrict the number of documents in the aggregation pipeline.
If a MongoDB collection contains 1000 documents and a query with a specific condition returns 250
documents, what percentage of the collection does this represent?
*A: 25.0
Feedback: Correct! The query returns 25% of the documents in the collection.
Default Feedback: Consider the ratio of documents returned by the query to the total number of
documents in the collection.
What is the term for a Docker image that is used to create a running instance? Please answer in all
lowercase.
*A: container
*B: containers
Default Feedback: Remember that a Docker image becomes a specific term when it is run.
*A: They allow for continuous processing of streaming data over a specific interval.
Feedback: Correct! Sliding windows process data in overlapping intervals, enabling continuous data
analysis.
B: They process data only once at the end of the window interval.
Feedback: Not quite. Sliding windows process data continuously, not just once at the end.
Feedback: Incorrect. Sliding windows are specifically designed for streaming data.
Feedback: No, sliding windows are not used for data storage but for continuous data processing.
When working with sensor data using Spark streaming, which of the following is a common operation?
Feedback: Correct! Applying windowed computations is a common operation when working with sensor
data using Spark streaming.
Feedback: Incorrect. Ignoring duplicate data is not a common operation in this context. Review the
lecture on sensor data processing.
Feedback: Incorrect. While table joins are useful, they are not a primary operation in sensor data
processing using Spark streaming.
Feedback: Incorrect. Converting data types is not specific to sensor data processing in Spark streaming.
Check the relevant materials on this topic.
*A: map
B: reduceByKey
C: groupByKey
D: join
Select all of the transformations in Spark that are considered wide transformations.
*A: groupByKey
Feedback: Correct! groupByKey is a wide transformation because it requires data shuffling across the
nodes.
*B: reduceByKey
Feedback: Correct! reduceByKey also requires shuffling of data, making it a wide transformation.
C: filter
D: map
Feedback: Incorrect. map is considered a narrow transformation as it does not require data shuffling.
*E: sortBy
Feedback: Correct! sortBy is a wide transformation because it can involve shuffling of data across the
partitions.
*A: Kafka
*B: Flume
C: AWS Lambda
Feedback: Incorrect. AWS Lambda is a serverless compute service, not a streaming data source for
Spark.
*D: HDFS
Feedback: Correct! HDFS can also be used as a source of streaming data in Spark.
E: SQLite
Feedback: Incorrect. SQLite is a database engine, not a streaming data source for Spark.
*B: Amazon S3
Feedback: Correct! Google Cloud Storage is supported by Spark for streaming data.
D: Microsoft Excel
Feedback: Incorrect. Microsoft Excel is not a source of streaming data supported by Spark.
E: PostgreSQL
What method is used to access Postgres database tables with SparkSQL? Please answer in all lowercase.
*A: jdbc
Feedback: Correct! The jdbc method is used to access Postgres database tables with SparkSQL.
Default Feedback: Incorrect. Please review the methods used to access Postgres database tables with
SparkSQL.
Question 80 - numeric, easy difficulty
What is the range of possible values if the result of a computation over a sliding window is expected to
be between 10 and 20 (inclusive)?
Feedback: Correct! The range of possible values is between 10 and 20, inclusive.
Default Feedback: Incorrect. Please review how computations over sliding windows work.
What is the name of the machine learning library in Spark? Please answer in all lowercase.
*A: mllib
*B: sparkmllib
Default Feedback: Incorrect. Remember to review the name of Spark's machine learning library.
If a sliding window computation on sensor data results in values between 5 and 15 (inclusive), what is
the range of possible values?
Default Feedback: Incorrect. Please review the concepts of sliding window computations.
*A: rdd
Feedback: Correct! The Resilient Distributed Dataset (RDD) is the fundamental programming model for
Spark.
Feedback: Correct! The Resilient Distributed Dataset (RDD) is the fundamental programming model for
Spark.
Default Feedback: Incorrect. Please review the fundamental programming model for Spark.
If a Spark job creates an RDD and splits it into 4 partitions, and then a coalesce transformation is applied
with a value of 2, how many partitions will the resulting RDD have?
*A: 2.0
Feedback: Correct! The coalesce transformation reduces the number of partitions in the RDD to the
specified number.
Default Feedback: Incorrect. Please review how the coalesce transformation works in Spark.
Which of the following operations can be performed on a Spark DataFrame when working with
streaming data?
Feedback: Correct! Filtering rows is a common operation that can be performed on a Spark DataFrame
when working with streaming data.
Feedback: Correct! Joining streaming DataFrames with static DataFrames is a supported operation in
Spark.
Feedback: Incorrect. Directly modifying source data is not an operation performed on Spark DataFrames
when working with streaming data.
What is the term for the concept in Spark where RDDs cannot be modified after their creation? Please
answer in all lowercase.
*A: immutability
Feedback: Correct! Immutability refers to the concept where RDDs cannot be modified after their
creation.
*B: immutable
Feedback: Correct! Immutability refers to the concept where RDDs cannot be modified after their
creation.
Default Feedback: Remember to review the concept of RDD immutability in Spark, where RDDs cannot
be modified after their creation.
What is the name of Spark's component designed for graph processing? Please answer in all lowercase.
*A: graphx
Default Feedback: Incorrect. Please review the components of Spark and try again.
What is the term used to describe a continuous sequence of data elements in Spark? Please answer in all
lowercase.
*A: stream
*B: streaming
Feedback: Correct! The continuous sequence of data elements can also be described with the term
streaming.
Default Feedback: Incorrect. Please review the concept of continuous data processing in Spark.
Which of the following statements best describes how to filter rows and columns of a Spark DataFrame?
*A: Use the filter() function to select specific rows and the select() function to choose columns.
Feedback: Correct! The filter() function is used for row selection while the select() function is used for
column selection in Spark DataFrames.
B: Use the groupBy() function to filter rows and the orderBy() function to select columns.
Feedback: Almost! groupBy() and orderBy() are used for grouping and ordering, not filtering.
C: Use the drop() function to filter out unwanted rows and the aggregate() function for columns.
Feedback: Not quite. The drop() function is used to remove columns, not filter rows, and aggregate()
isn't used for column selection.
D: Use the join() function to filter rows and the distinct() function to filter columns.
Feedback: Incorrect. join() is for combining DataFrames and distinct() is for removing duplicate rows,
not specifically for filtering.
Which function in Spark SQL allows you to access Postgres database tables using a specified JDBC
URL?
*A: jdbc
Feedback: "jdbc" is the correct function used to access Postgres database tables via a JDBC URL in
Spark SQL. Well done!
B: readTable
Feedback: "readTable" is not a valid function in Spark SQL for accessing Postgres tables. You might be
confusing it with other data access methods.
C: loadTable
Feedback: "loadTable" is not the correct function for accessing Postgres tables in Spark SQL. Consider
revisiting the function names.
D: getConnection
Feedback: "getConnection" is used differently and not specifically in accessing Postgres tables in Spark
SQL. Try focusing on JDBC-specific functions.
Which Spark transformation is used to apply a function to each element of an RDD and flatten the
results into a new RDD?
*A: FlatMap
Feedback: Correct! FlatMap applies a function to each element and flattens the result.
B: Map
Feedback: Incorrect. Map applies a function to each element but does not flatten the results.
C: Filter
Feedback: Incorrect. Filter selects elements based on a condition but does not apply a function to each
element.
D: Reduce
Feedback: Incorrect. Reduce aggregates elements using a binary operator but does not apply a function
to each element.
*A: Coalesce
B: Map
Feedback: Incorrect. Map is used for element-wise transformation and does not change the number of
partitions.
C: Filter
Feedback: Incorrect. Filter selects elements but does not alter the number of partitions.
D: FlatMap
Feedback: Incorrect. FlatMap can increase the number of elements but not decrease partitions.
Feedback: Correct! Sliding windows allow Spark to handle continuous data streams by breaking them
into manageable time-based batches.
Feedback: Not quite. While data compression is important, sliding windows are focused on processing
time-based data streams.
Feedback: This is not correct. Sliding windows do not deal with data encryption or security features.
What is one of the benefits of using Spark SQL for data processing?
Feedback: Correct! Spark SQL is designed to integrate seamlessly with Hadoop, enhancing its
processing capabilities.
Feedback: Not exactly. Spark SQL uses SQL-like syntax but isn't a new language.
Feedback: Incorrect. Spark SQL does not replace data storage systems.
Feedback: This is not correct. While Spark SQL aids in querying, data cleaning still requires attention.
When working with sensor data using Spark streaming, which of the following operations can be applied
to the data?
Feedback: Correct! Sliding window computations are often used in streaming data analysis to process
recent data efficiently.
Feedback: Incorrect. collect() is not typically used in streaming due to memory constraints and
continuous data flow.
Feedback: Correct! You can filter sensor data in real-time using Spark's filter() transformation.
Feedback: Correct! Aggregating data streams using transformations like reduceByKey() is common in
stream processing.
E: Use cache() to store the entire data stream.
Feedback: Incorrect. Caching is not practical for entire streams due to continuous data flow and size
limitations.
When creating computations over a sliding window of streaming data in Spark, which of the following
operations can be utilized?
*A: Aggregation
Feedback: Joining static data can be performed but is not specific to sliding window computations in
streaming data.
D: Checkpointing
Feedback: Checkpointing is a mechanism for fault tolerance in streaming but not specific to sliding
window computations.
E: Filtering
Feedback: Filtering can be used in data preparation but doesn't specifically pertain to sliding window
computations.
Select the steps involved in a typical Spark pipeline ending with a collect action.
Feedback: This is correct. The first step in any Spark pipeline is to create an RDD from a data source.
*B: Apply one or more transformations
Feedback: Correct! Actions trigger Spark to execute the transformations and return results.
Feedback: Incorrect. A Spark session is typically started once at the beginning of the pipeline.
Feedback: Incorrect. Conversion to DataFrame is optional and not part of every pipeline.
*A: Vertices
*B: Edges
C: Layers
D: Nodes
Feedback: Incorrect. GraphX uses vertices, not nodes, to denote points in a graph.
E: Frames
*A: 1.0
Default Feedback: Incorrect. Consider how coalesce affects the number of partitions.
*A: 4.0
What is the term for a data structure in Spark that is distributed and immutable? Please answer in all
lowercase.
*A: rdd
Feedback: Correct! RDD stands for Resilient Distributed Dataset, which is immutable in Spark.
*B: rdds
Feedback: Correct! RDDs, or Resilient Distributed Datasets, are immutable data structures in Spark.
Default Feedback: Incorrect. Remember that this data structure is distributed and immutable in Spark.
What is the name of Apache Spark's machine learning library? Please answer in all lowercase.
*A: mllib
Feedback: Correct! MLlib is the designated machine learning library for Spark.
*B: ml-lib
Feedback: Correct! While this variant isn't standard, it's accepted here. MLlib is the actual name.
Which of the following techniques are involved in integrating data from multiple sources in a medical
setting?
Feedback: Correct! Schema mapping is essential for aligning data structures from different sources.
Feedback: Correct! Data fusion helps in combining data from various sources to create a unified view.
C: Data anonymization
Feedback: Incorrect. While important, data anonymization is not a technique for integrating data from
multiple sources.
Feedback: Correct! Data warehousing involves collecting and managing data from varied sources in a
centralized repository.
E: Predictive analytics
Feedback: Incorrect. Predictive analytics is used for forecasting and trends, not for data integration.
Feedback: Correct! Splunk includes data visualization tools to represent data in various formats.
Feedback: Incorrect. Splunk is designed to minimize manual configuration through its automated
features.
Feedback: Correct! Splunk integrates with machine learning for advanced data analysis.
Feedback: Incorrect. Splunk can be deployed both on-premises and in the cloud.
What is the name of the software platform that collects, indexes, and analyzes machine data developed
by Splunk? Please answer in all lowercase.
*A: splunk
Feedback: Correct! Splunk is the software platform developed to handle machine data.
Default Feedback: Incorrect. The platform developed by Splunk to handle machine data is called
Splunk.
In a public health system, what percentage range of data accuracy is generally required for effective
disease information exchange?
Feedback: Correct! High data accuracy, typically in the range of 95-99%, is crucial for effective disease
information exchange.
Default Feedback: Incorrect. Data accuracy is crucial for effective disease information exchange. Please
review the required accuracy levels.
Question 107 - text match, easy difficulty
What is the default port number for accessing Splunk Web? Please answer in all lowercase.
*A: 8000
Feedback: Correct! The default port number for accessing Splunk Web is 8000.
B: 8001
C: 8080
D: 8081
Default Feedback: Incorrect. Please refer to the Splunk documentation for the correct default port
number.
*A: 2009.0
What is the name of the platform that provides customizable and modular car data for Open XC? Please
answer in all lowercase.
*A: openxc
Feedback: Correct! Open XC is the platform that provides customizable and modular car data.
B: open xc
C: open-xc
D: open_xc
Default Feedback: Incorrect. Please review the lesson material on Open XC platforms.
*A: The process of combining data from multiple sources to provide a comprehensive view of customer
information.
Feedback: Correct! Data Fusion involves combining data from various sources to get a complete view of
the customer.
Feedback: Incorrect. Data Fusion is about combining and integrating data, not encrypting it.
Feedback: Incorrect. Data Fusion is not about data storage but about integrating and analyzing data.
Feedback: Incorrect. Data Fusion aims to combine data from multiple sources, not to filter it.
Which of the following are challenges involved in integrating data from multiple sources in a medical
setting?
*A: Data variety
Feedback: Correct! Data variety is a significant challenge in integrating data from multiple sources.
B: Data velocity
Feedback: Incorrect. While data velocity can be a challenge in some contexts, it is not typically a
primary concern in medical data integration.
Feedback: Correct! Schema mapping is essential for integrating data from different sources.
D: Data encryption
Feedback: Incorrect. Data encryption is related to data security, not specifically to integrating data from
multiple sources.
Feedback: Correct! Both Splunk and Datameer offer data visualization capabilities.
Feedback: Correct! Real-time data analysis is a key feature of both Splunk and Datameer.
Feedback: Incorrect. Data entry automation is not a primary feature of either Splunk or Datameer.
Feedback: Correct! Both Splunk and Datameer support machine learning integration.
Which of the following actions can you perform in Splunk to manage your data?
Feedback: Correct! Querying data is a fundamental action you can perform in Splunk to manage your
data.
Feedback: Correct! Filtering data is essential for managing and refining the data you work with in
Splunk.
Feedback: Incorrect. Installing a new operating system is not related to managing data in Splunk.
Feedback: Correct! Plotting data is a crucial step in visualizing and analyzing data in Splunk.
Feedback: Incorrect. While creating user accounts is an administrative task in Splunk, it is not directly
related to managing data.
What Splunk web interface component allows you to create a visual representation of your data? Please
answer in all lowercase.
*A: pivot
Feedback: Correct! The Pivot component allows you to create visual representations of your data in
Splunk.
Default Feedback: Incorrect. Please review the components of the Splunk web interface that allow for
data visualization.
*A: Access the Pivot interface and select a prebuilt data model.
Feedback: Correct! Accessing the Pivot interface is the initial step to start creating a report in Splunk.
Feedback: Not quite. Before you create a bar chart, you need to access the Pivot interface and select a
data model.
Feedback: Segmenting data comes after you've accessed the Pivot interface and selected a data model.
Feedback: Performing statistical calculations is part of data analysis, but not the first step in constructing
a report.
Which key component is essential for the personalized music recommendation project?
Feedback: This isn't correct. Think about how automation influences personalized recommendations.
Feedback: Incorrect. Reflect on the dynamic nature of data used in personalized recommendations.
Which of the following describes how Splunk software collects machine data?
*A: Splunk uses input methods to collect data from various sources including logs and metrics.
Feedback: Great job! Splunk indeed collects data via diverse input methods to handle logs and metrics.
Feedback: This isn't correct. Think about how automation plays a role in data collection for big data
tools like Splunk.
Feedback: This is not correct. Consider the independence of big data tools in their data collection
processes.
D: Splunk is designed to work only with structured data collected from databases.
Feedback: Not quite. Remember the flexibility and scope of data types that Splunk can handle.
What is the primary importance of information integration in creating a unified view of customer data in
financial services?
Feedback: Yes, information integration is crucial for a unified customer view, especially in financial
services.
Feedback: Not quite. Information integration is more about data usage and less about storage capacity.
C: Improving network speed
Feedback: Incorrect. While security is important, information integration focuses on unification and
accessibility.
What is the primary function of Data Fusion in the context of customer analytics?
Feedback: This is correct. Data Fusion involves integrating data from various sources to provide a
comprehensive view, especially useful in customer analytics.
Feedback: This option is incorrect. Data Fusion is more about integrating data from different sources
rather than just storing it centrally.
Feedback: This is not the main focus of Data Fusion. While data cleaning might be involved, fusion
emphasizes integration from diverse sources.
Feedback: Data Fusion is primarily about integration, not visualization, though the latter may utilize
fused data.
Which of the following tasks are necessary to successfully install Splunk on a Windows server?
Feedback: Importing CSV files is related to data management post-installation, not during the
installation process.
Feedback: Correct! Customizing the installation is often necessary to meet specific system requirements.
Feedback: Evaluating the data model is a separate task related to data analysis, not installation.
Which of the following are crucial for handling data variety and model transformation in exchanging
disease information in public health systems?
Feedback: Correct! Public health systems deal with varied data formats which require transformation.
Feedback: Incorrect. Although important, real-time streaming is not the focus of data integration
techniques for model transformation.
Feedback: Incorrect. Automated cleaning is useful but not a primary concern in model transformation
for public health data.
E: Data anonymization
Feedback: Incorrect. While anonymization is important, it's not directly related to data variety and
model transformation.
At least how many different data sources are typically involved in a complex medical data integration
scenario?
*A: 5.0
Feedback: Typically, integrating data from multiple sources involves at least this number.
Default Feedback: Think about the complexity and variety of sources in a medical setting.
How many steps are typically involved in the process of installing Splunk on a Windows server?
*A: 5.0
Feedback: Correct! Installing Splunk on a Windows server typically involves five main steps.
Default Feedback: Consider reviewing the installation guide for Splunk on Windows.
What is the term used to describe the vehicle data platform developed by Open XC? Please answer in all
lowercase.
*A: openxc
Feedback: Correct! OpenXC is the platform for creating customizable car data solutions.
Default Feedback: Remember the specific platform name associated with vehicle data in the lesson.
What Splunk interface allows users to create reports using a prebuilt data model? Please answer in all
lowercase.
*A: pivot
Feedback: Correct! The Pivot interface in Splunk is used for this purpose.
*B: pivotinterface
Feedback: Correct! The Pivot interface in Splunk is used for this purpose.
Default Feedback: Revisit the section on Splunk interfaces for creating reports.
Identify the range of years during which significant development in personalized music recommendation
projects using DataMirror, Hadoop, and Spark occurred.
Feedback: This isn't quite right. Consider when these technologies became prominent in data processing.
Default Feedback: Reflect on the timeline of when these technologies gained prominence in the
industry.
Which big data processing engine performs in-memory processing using Resilient Distributed Datasets
(RDDs)?
A: Hadoop MapReduce
Feedback: Incorrect. Hadoop MapReduce does not perform in-memory processing using RDDs.
B: Flink
Feedback: Incorrect. Flink is not the engine that performs in-memory processing using RDDs.
*C: Spark
Feedback: Correct! Spark is the engine that performs in-memory processing using RDDs.
D: Storm
Feedback: Incorrect. Storm does not perform in-memory processing using RDDs.
Feedback: This choice describes a workflow, not dataflow. Review the material on dataflow to
understand its role.
*B: It refers to the movement and transformation of data through a series of processing steps.
Feedback: Correct! Dataflow involves the movement and transformation of data through different stages
in a pipeline.
Feedback: This choice is about data collection, not dataflow. Make sure you understand the distinct
stages of data handling.
Feedback: Storing data in a structured format relates to data storage, not dataflow. Review how data
flows and transforms in a pipeline.
B: HDFS
Feedback: Incorrect. HDFS is part of the Hadoop ecosystem, not the Spark stack.
D: Storm
Feedback: Incorrect. Storm is a separate big data processing engine, not part of the Spark stack.
*E: MLlib
Feedback: Correct! MLlib is a machine learning library that is part of the Spark stack.
Which of the following are common analytical operations within big data pipelines?
*A: Filtering
B: Data replication
Feedback: Data replication is more of a data management technique than an analytical operation.
*C: Aggregation
*D: Sorting
E: Compression
Which of the following actions are involved in performing WordCount with Spark Python (PySpark)?
Feedback: Correct! Reading a text file into an RDD is a crucial step in performing WordCount.
Feedback: Correct! Initiating a Spark context is necessary before performing any operations.
Feedback: Incorrect. Connecting to a SQL database is not related to performing WordCount with
PySpark.
Feedback: Incorrect. Starting a Hadoop cluster is not needed for WordCount with PySpark.
*A: Filtering
*B: Sorting
C: Replication
Feedback: Replication is not typically a data transformation process in pipelines. Review the common
transformations used.
*D: Aggregation
E: Compression
Feedback: Compression is more about data storage and transmission efficiency rather than a
transformation. Revisit the transformation processes.
Which of the following steps are necessary to perform WordCount using PySpark in JupyterLab?
Feedback: Correct! Reading the text file into an RDD is one of the first steps.
*B: Create a SparkConf object
Feedback: This is incorrect. While map() can be used, it is not specifically for splitting lines into words.
Feedback: Correct! Using reduceByKey() helps to count the occurrences of each word.
Feedback: This is incorrect. While saving results is important, a text file is typically used rather than a
CSV file.
What is the main data structure used by Spark for in-memory computation? Please answer in all
lowercase.
*A: rdd
Feedback: Correct! Resilient Distributed Datasets (RDDs) are the main data structure used by Spark.
*B: rdds
Feedback: Correct! Resilient Distributed Datasets (RDDs) are the main data structure used by Spark.
*C: resilientdistributeddataset
Feedback: Correct! Resilient Distributed Datasets (RDDs) are the main data structure used by Spark.
Default Feedback: Incorrect. Please review the section on Spark's in-memory computation data
structures.
*A: reducebykey
Feedback: Correct! The reduceByKey function is commonly used for this purpose.
*B: reduce_by_key
Feedback: Correct! The reduce_by_key function is commonly used for this purpose.
Default Feedback: Incorrect. Try reviewing the PySpark documentation on counting word occurrences.
If you have a text file with 5000 words, and perform WordCount using PySpark, assuming 10% of the
words are unique, how many unique key-value pairs would you expect?
*A: 500.0
Feedback: Correct! With 10% of the words being unique out of 5000 words, you would expect 500
unique key-value pairs.
Default Feedback: Incorrect. Consider the percentage of unique words in the total word count.
If an initial dataset contains 50,000 entries and aggregation reduces it to 5,000 entries, what is the
reduction factor?
*A: 10.0
Default Feedback: Remember to divide the original number of entries by the reduced number of entries
to find the reduction factor.
*A: aggregation
B: compaction
Feedback: Compaction is not the term we are looking for. Remember the specific term for summarizing
data.
Default Feedback: Review the term that best describes summarizing data to make it more manageable.
*A: 4.0
Default Feedback: Incorrect. Please review the components of the Spark stack.
In the context of PySpark, what is the term used for the distributed collection of data? Please answer in
all lowercase.
*A: rdd
Feedback: Correct! RDD stands for Resilient Distributed Dataset and is a fundamental concept in
PySpark.
*B: resilientdistributeddataset
Default Feedback: Incorrect. Please review the concept of distributed data collections in PySpark.
If a dataset is reduced from 10,000 entries to 1,000 entries through aggregation, what is the reduction
factor?
*A: 10.0
Default Feedback: Incorrect. Remember, the reduction factor is the ratio of the original size to the
reduced size.
If a text file contains 2000 words and you perform WordCount using PySpark, how many key-value
pairs would you expect to have in the result assuming all words are unique?
*A: 2000.0
Feedback: Correct! If all words are unique, you would have 2000 key-value pairs.
Default Feedback: Incorrect. Remember that each unique word in the text file will produce a unique
key-value pair.
Which of the following aggregation operations is best suited for counting the number of unique elements
in a dataset?
A: Count
Feedback: Count operation can be used to count the total number of elements, but not unique elements.
B: Sum
Feedback: Sum operation aggregates the total value, but doesn't count unique elements.
Feedback: Correct! Distinct Count operation is used to count the number of unique elements in a dataset.
D: Average
Feedback: Average operation calculates the mean of values, but doesn't count unique elements.
*A: Filtering
*B: Joining
Feedback: Correct! Joining is also a common analytical operation in big data pipelines.
C: Replication
*D: Aggregation
E: Load Balancing
Select the steps involved in performing WordCount with Spark Python (PySpark).
Feedback: Correct! Initiating a Spark context is the first step in performing WordCount with PySpark.
Feedback: Incorrect. WordCount with PySpark typically involves loading data into an RDD, not a
DataFrame.
Feedback: Correct! Reading a text file into an RDD is a crucial step in WordCount with PySpark.
*D: Use the map and reduceByKey functions
Feedback: Correct! The map and reduceByKey functions are used in WordCount with PySpark to
process the data.
Feedback: Incorrect. In WordCount with PySpark, the results are usually saved to a text file, not a
database.
What is the main motivation behind the development of Apache Spark? Please answer in all lowercase.
*A: speed
Feedback: Correct! Speed is one of the main motivations behind the development of Apache Spark.
*B: efficiency
Feedback: Correct! Efficiency is one of the main motivations behind the development of Apache Spark.
Default Feedback: Incorrect. Please review the main motivations behind the development of Apache
Spark.
Feedback: Correct! The Spark context is used to connect to a cluster and execute operations.
Feedback: Not quite. While Spark can interact with HDFS, the Spark context is specifically for
connecting to a cluster.
Feedback: This is incorrect. Spark context is not used for compiling code; it's for executing operations
over a cluster.
D: To visualize data analytics
Feedback: Visualization is not the main purpose of a Spark context. It's primarily for executing
operations.
What does the term 'dataflow' refer to in the context of big data pipelines?
Feedback: Correct! Dataflow describes how data moves through processing stages.
Feedback: Duplication is not the same as dataflow. Consider how data moves.
Feedback: Backup and recovery are important, but they aren't what dataflow describes.
Which of the following operations can be used to compact datasets and reduce their volume in big data
processing?
*A: Aggregation
Feedback: Correct! Aggregation is often used to summarize and reduce the size of datasets.
B: Replication
Feedback: Replication increases data volume by duplicating data, not reducing it.
C: Encryption
Feedback: Encryption secures data but does not reduce its volume.
D: Indexing
Feedback: Indexing improves data retrieval speed but doesn't compact the data.
Which of the following is a major motivation for the development of Apache Spark?
Feedback: Correct! Spark was developed to offer a unified engine for both batch and stream processing.
Feedback: Incorrect. Spark complements databases by providing powerful processing capabilities, not
replacing them.
Feedback: Incorrect. Spark focuses on data processing speed and efficiency rather than storage capacity.
Which of the following big data processing engines is known for its capability to perform in-memory
computations using Resilient Distributed Datasets (RDDs)?
*A: Spark
Feedback: Correct! Spark is well-known for utilizing RDDs for in-memory computations, which
enhances processing speed.
B: Hadoop MapReduce
Feedback: Incorrect. Hadoop MapReduce is primarily used for batch processing and does not utilize in-
memory computations.
C: Storm
Feedback: Incorrect. While Storm offers real-time processing, it doesn't leverage RDDs for in-memory
computation.
D: Beam
Feedback: Incorrect. Beam is designed for data processing pipelines but does not focus on in-memory
computation with RDDs like Spark.
Which of the following are common analytical operations in big data pipelines?
*A: Join
*B: Sorting
C: Encryption
*D: Filtering
Feedback: Correct! Filtering is used to select specific data that meets certain criteria.
E: Replication
If a text file contains 1,000,000 words and the WordCount operation takes 10 seconds to complete using
a PySpark job, what is the average number of words processed per second?
*A: 100000.0
Feedback: Correct! The job processes an average of 100,000 words per second.
Default Feedback: Try calculating the number of words processed per second by dividing the total
words by the time taken.
If a dataset originally contains 1000 records and an aggregation operation reduces it to 200 records, by
what percentage has the dataset been reduced?
*A: 80.0
If a Spark job execution time ranges between 5.5 and 7.5 minutes on average, what is a reasonable time
(in minutes) to expect it to complete?
Feedback: Good estimation! This range represents the typical execution time for a Spark job.
Default Feedback: Consider the given range of typical execution times to estimate a reasonable
completion time.
In PySpark, what data structure is typically used to hold key-value pairs for operations? Please answer in
all lowercase.
*A: rdd
Feedback: Correct! RDDs are used to hold data in key-value pairs for operations.
B: dataframe
Feedback: Not quite. DataFrames are used for structured data rather than key-value pairs in Spark.
Default Feedback: Remember to review the data structures used in PySpark for handling key-value
pairs.
What is the primary abstraction used by Apache Spark for distributed data processing? Please answer in
all lowercase.
*A: rdd
Feedback: That's right! RDDs are the core abstraction in Spark for distributed data processing.
*B: rdds
Feedback: Correct! RDDs (Resilient Distributed Datasets) are indeed the primary abstraction.
Default Feedback: Remember, Spark uses a specific abstraction to manage distributed data processing
efficiently.