0% found this document useful (0 votes)
0 views

Big Data Integration and Processing Final

The document contains a series of multiple-choice and text match questions related to SQL and data retrieval concepts, including clauses like WHERE, JOIN, and ORDER BY. It also covers SQL functions, commands, and tools used for managing and visualizing data in PostgreSQL, such as PgAdmin and Docker Desktop. The questions vary in format and difficulty, assessing knowledge on SQL syntax, aggregate functions, and database management practices.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Big Data Integration and Processing Final

The document contains a series of multiple-choice and text match questions related to SQL and data retrieval concepts, including clauses like WHERE, JOIN, and ORDER BY. It also covers SQL functions, commands, and tools used for managing and visualizing data in PostgreSQL, such as PgAdmin and Docker Desktop. The questions vary in format and difficulty, assessing knowledge on SQL syntax, aggregate functions, and database management practices.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 70

Question 1 - multiple choice, shuffle, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

Which of the following SQL clauses is used to filter the result set based on a specified condition?

*A: WHERE

Feedback: Correct! The WHERE clause is used to filter records based on a specified condition.

B: SELECT

Feedback: Incorrect. The SELECT clause is used to select data from a database.

C: JOIN

Feedback: Incorrect. The JOIN clause is used to combine rows from two or more tables, based on a
related column between them.

D: ORDER BY

Feedback: Incorrect. The ORDER BY clause is used to sort the result set in ascending or descending
order.

Question 2 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

Select the SQL aggregate functions from the list below.

*A: COUNT

Feedback: Correct! COUNT is an aggregate function in SQL that returns the number of input rows that
match a specific condition.

*B: MAX

Feedback: Correct! MAX is an aggregate function in SQL that returns the maximum value in a set of
values.

C: JOIN

Feedback: Incorrect. JOIN is used to combine rows from two or more tables based on a related column
between them, but it is not an aggregate function.
*D: SUM

Feedback: Correct! SUM is an aggregate function in SQL that returns the sum of a set of values.

E: SELECT

Feedback: Incorrect. SELECT is used to specify the columns to be returned by the query, but it is not an
aggregate function.

Question 3 - numeric, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

If a table has 100 rows and a query returns 20% of them, how many rows will be returned?

*A: 20.0

Feedback: Correct! 20% of 100 rows is 20 rows.

Default Feedback: Incorrect. Review the calculation for percentage of total rows.

Question 4 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

What is the name of the SQL clause used to combine rows from two or more tables based on a related
column between them? Please answer in all lowercase.

*A: join

Feedback: Correct! The JOIN clause is used to combine rows from two or more tables based on a related
column.

Default Feedback: Remember the clause that allows combining rows from multiple tables based on
related columns.

Question 5 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

What tool can be used to visualize SQL query results in this course? Please answer in all lowercase.

*A: pgadmin

Feedback: Correct! PgAdmin is used to visualize SQL query results.


*B: dockerdesktop

Feedback: Correct! Docker Desktop is used to visualize SQL query results.

Default Feedback: Incorrect. The correct tools are PgAdmin and Docker Desktop.

Question 6 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

What SQL clause is used to sort the result set of a query? Please answer in all lowercase. Please answer
in all lowercase.

*A: order by

Feedback: Correct! The ORDER BY clause is used to sort the result set.

Default Feedback: Incorrect. Remember that the clause used to sort the result set of a query is ORDER
BY.

Question 7 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

What SQL keyword is used to remove a table from a database? Please answer in all lowercase. Please
answer in all lowercase.

*A: drop

Feedback: Correct! The DROP statement is used to remove a table from a database.

Default Feedback: Incorrect. Review the SQL commands for removing database objects.

Question 8 - numeric, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

If a table has 500 rows and you execute the following SQL query: SELECT COUNT(*) FROM
table_name WHERE column_name = 'value'; and 100 rows match the condition, how many rows will be
returned?

*A: 100.0

Feedback: Correct! The query returns the count of rows that match the condition.

Default Feedback: Incorrect. Consider how the COUNT function works with the WHERE clause.
Question 9 - numeric, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

How many rows will be returned by the following query if the Employees table has 50 rows? \
[ SELECT \* FROM Employees LIMIT 10 \]

*A: 10.0

Feedback: Correct! The LIMIT clause restricts the number of rows returned by the query to 10.

Default Feedback: Incorrect. Review how the LIMIT clause works in SQL.

Question 10 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

Which SQL function is used to return the number of rows that matches a specified condition? Answer in
all lowercase. Please answer in all lowercase.

*A: count

Feedback: Correct! The COUNT function returns the number of rows that matches a specified condition.

*B: cnt

Feedback: Correct! Another acceptable answer is 'cnt'.

Default Feedback: Incorrect. Review SQL functions related to counting rows.

Question 11 - numeric, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

How many rows are returned by the SQL query SELECT COUNT(*) FROM table_name if the table
table_name has 100 rows?

*A: 1.0

Feedback: Correct! The COUNT(*) function returns the number of rows in a table, which is always 1 for
this query.

Default Feedback: Incorrect. Review the COUNT(*) function to understand its behavior.

Question 12 - text match, easy difficulty


Question category: Module: Retrieving Big Data (Part 1)

What is the command to remove all records from a table without deleting the table itself? Please answer
in all lowercase.

*A: truncate

Feedback: Correct! The TRUNCATE command is used to remove all records from a table without
deleting the table itself.

B: delete

Feedback: Incorrect. The DELETE command removes records but can be used with a WHERE clause to
delete specific records.

C: drop

Feedback: Incorrect. The DROP command deletes the table itself along with all its records.

Default Feedback: Incorrect. Please review the SQL commands for removing records from a table.

Question 13 - numeric, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

If a table named Products contains 200 rows and you execute the following SQL query, how many rows
will be returned? plain_text SELECT * FROM Products LIMIT 50 OFFSET 100;

*A: 50.0

Feedback: Correct! The LIMIT clause specifies the number of rows to return, and the OFFSET clause
specifies the number of rows to skip before starting to return rows.

Default Feedback: Incorrect. Review the usage of the LIMIT and OFFSET clauses in SQL.

Question 14 - numeric, medium

Question category: Module: Retrieving Big Data (Part 1)

You have executed a query in PgAdmin to retrieve data from a table called sales where the revenue is
greater than 1000. How many rows will be returned if the sales table contains 5000 rows with revenue
values uniformly distributed between 500 and 1500?

*A: 2500.0

Feedback: Well done! You correctly calculated the number of rows with revenue greater than 1000.
Default Feedback: Try calculating the proportion of rows that meet the condition from the provided
range.

Question 15 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

Which of the following are benefits of data partitioning in a distributed database system?

*A: Improves query performance by dividing data into smaller, more manageable pieces

Feedback: Correct! Data partitioning enhances query performance and manageability.

B: Ensures data redundancy by storing multiple copies of the data

Feedback: Incorrect. Data partitioning involves dividing data, not replicating it.

*C: Facilitates load balancing by distributing data across different nodes

Feedback: Correct! Data partitioning helps in distributing the load across various nodes.

D: Simplifies the process of data replication

Feedback: Incorrect. Data partitioning is about dividing data, not about simplifying replication.

E: Helps in optimizing storage by removing duplicate records

Feedback: Incorrect. Data partitioning does not inherently remove duplicates; it organizes data across
partitions.

Question 16 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

What keyword is used in SQL to rename a column in the result set? Please answer in all lowercase.

*A: as

Feedback: Correct! The AS keyword is used to give an alias to a column in the result set.

Default Feedback: Incorrect. Please review the SQL syntax for renaming columns in the result set.

Question 17 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)


In PgAdmin, which feature can be used to view table and column definitions? Please answer in all
lowercase.

*A: properties

Feedback: Correct! The Properties feature in PgAdmin allows you to view table and column definitions.

*B: structure

Feedback: Correct! The Structure feature in PgAdmin allows you to view table and column definitions.

*C: columns

Feedback: Correct! The Columns feature in PgAdmin allows you to view table and column definitions.

Default Feedback: Incorrect. Please review the course material on viewing table and column definitions
in PgAdmin.

Question 18 - numeric, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

If a partitioned table contains 100,000 rows and is divided into 4 partitions, what is the average number
of rows per partition?

*A: 25000.0

Feedback: Correct! Each partition would have an average of 25,000 rows.

Default Feedback: Incorrect. Please review the calculation for average number of rows per partition in a
partitioned table.

Question 19 - multiple choice, shuffle, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

What command would you use to view the column definitions of a table in PgAdmin?

*A: SELECT * FROM information_schema.columns WHERE table_name='your_table_name';

Feedback: Correct! This command queries the information_schema to show column definitions.

B: SHOW COLUMNS FROM your_table_name;

Feedback: Not quite. This syntax is typically used in MySQL, not PostgreSQL.
C: DESCRIBE TABLE your_table_name;

Feedback: Incorrect. This command is not used in PostgreSQL.

D: SELECT COLUMN_NAME, DATA_TYPE FROM your_table_name;

Feedback: Close, but this will only work if you manually store column names and data types in a table.

Question 20 - checkbox, shuffle, partial credit, medium

Question category: Module: Retrieving Big Data (Part 1)

Select all the tasks that can be performed using PgAdmin in conjunction with PostgreSQL and Docker
Desktop.

*A: Visualizing query results in Docker Desktop

Feedback: Correct! Docker Desktop can be used to visualize query results when integrated with
PgAdmin and PostgreSQL.

B: Using PgAdmin to modify PostgreSQL source code

Feedback: PgAdmin is used to manage databases, not to modify source code.

*C: Filtering table rows using PgAdmin

Feedback: Correct! PgAdmin allows you to filter table rows and columns.

D: Deploying PostgreSQL databases to cloud services using PgAdmin

Feedback: While PgAdmin can manage databases, deploying to cloud requires additional tools and
configurations.

Question 21 - multiple choice, shuffle, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

Which SQL clause is used to combine rows from two or more tables based on a related column?

*A: Using the SQL JOIN clause

Feedback: Correct! The JOIN clause is used to combine rows from two or more tables based on a related
column.

B: Using the SQL SELECT clause


Feedback: The SELECT clause is used for selecting data, not for joining tables.

C: Using the SQL ORDER BY clause

Feedback: ORDER BY is used to sort the result set, not to combine tables.

D: Using the SQL WHERE clause

Feedback: The WHERE clause is used to filter records, not to join tables.

Question 22 - multiple choice, shuffle, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

How can you view the data within a table using PgAdmin?

*A: By selecting 'View Data' option

Feedback: Correct! You can view data by selecting the 'View Data' option in PgAdmin.

B: By using the ALTER TABLE command

Feedback: ALTER TABLE is used to modify the structure of a table, not to view data.

C: By using the DELETE FROM command

Feedback: DELETE FROM removes records from a table, it doesn't display data.

D: By accessing the system catalogs directly

Feedback: Accessing system catalogs is complex and not typically used for simple data viewing.

Question 23 - multiple choice, shuffle, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

What does SQL stand for?

*A: Structured Query Language

Feedback: Correct! SQL stands for Structured Query Language.

B: Standard Query Language

Feedback: Incorrect. It's a common misconception, but SQL stands for Structured Query Language.
C: Simple Query Language

Feedback: Incorrect. While SQL is designed to simplify database queries, it stands for Structured Query
Language.

D: Sequential Query Language

Feedback: Incorrect. SQL is not about sequence; it stands for Structured Query Language.

Question 24 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

What are the purposes of using indexes in SQL databases?

*A: Optimize query performance

Feedback: Correct! Indexes help in optimizing the performance of queries.

B: Ensure data integrity

Feedback: Incorrect. Although indexes help in data retrieval, they do not ensure data integrity.

*C: Speed up data retrieval

Feedback: Correct! Indexes are used to speed up data retrieval processes.

D: Control user access

Feedback: Incorrect. Indexes do not control user access.

Question 25 - multiple choice, shuffle, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

Which SQL clause is used to join tables together?

*A: Using the JOIN clause

Feedback: Correct! The JOIN clause is used to combine rows from two or more tables.

B: Using the AGGREGATE clause

Feedback: Incorrect. AGGREGATE is not used for joining tables.

C: Using the CONNECT clause


Feedback: Incorrect. CONNECT is not a valid clause for joining tables.

D: Using the MERGE clause

Feedback: Incorrect. MERGE is used for merging data, not specifically for joining tables.

Question 26 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

Which of the following are benefits of data partitioning in a distributed database system?

*A: Improved query performance

Feedback: Correct! Partitioning can enhance query performance by limiting the amount of data that
needs to be scanned.

*B: Simplified data management

Feedback: Correct! Partitioning helps in organizing data, making it easier to manage.

C: Increased data redundancy

Feedback: Incorrect. Partitioning does not inherently increase data redundancy; it focuses on
performance and manageability.

D: Reduced storage requirements

Feedback: Incorrect. While partitioning may optimize data access, it does not necessarily reduce storage
needs.

Question 27 - multiple choice, shuffle, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

Which SQL clause allows you to specify the condition under which rows are returned?

*A: WHERE

Feedback: Correct! The WHERE clause is used to filter records based on specific conditions.

B: JOIN

Feedback: Incorrect. The JOIN clause is used to combine rows from two or more tables based on a
related column.
C: GROUP BY

Feedback: Incorrect. The GROUP BY clause is used to arrange identical data into groups.

D: ORDER BY

Feedback: Incorrect. The ORDER BY clause is used to sort the result set in ascending or descending
order.

Question 28 - numeric, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

If a partitioned table in a distributed database system is split into 10 partitions with an even distribution,
what is the maximum possible size of each partition if the total table size is 1,000,000 records?

*A: 100000.0

Feedback: Correct! Each partition would ideally hold an equal portion of the total records.

Default Feedback: Incorrect. Consider how the total records are distributed across partitions evenly.

Question 29 - numeric, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

If a table in PostgreSQL has 1500 rows and you execute a query that filters these rows down to 25% of
the original, how many rows are in the result?

*A: 375.0

Feedback: Consider the percentage of rows that remain after filtering.

Default Feedback: Revisit the concept of filtering and percentage calculations.

Question 30 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 1)

What is the keyword used in SQL to remove duplicate records from a query result? Please answer in all
lowercase.

*A: distinct

Feedback: Correct! The DISTINCT keyword is used to remove duplicates from query results.
Default Feedback: Remember to review SQL keywords related to query result modifications.

Question 31 - checkbox, shuffle, partial credit, medium

Question category: Module: Welcome to Big Data Integration and Processing

Which of the following are differences between a database management system (DBMS) and a big data
management system (BDMS)?

*A: BDMS can handle unstructured data, while DBMS primarily handles structured data

Feedback: Correct! BDMS are designed to handle a variety of data types, including unstructured data.

*B: DBMS is optimized for transaction processing, while BDMS is optimized for analytical processing

Feedback: Correct! DBMS are often optimized for transactions, whereas BDMS are optimized for
analytics.

C: BDMS typically requires more complex ETL processes compared to DBMS

Feedback: Incorrect. Both systems can have complex ETL processes, but the complexity depends on the
specific use case and data.

*D: BDMS can scale out horizontally more efficiently than DBMS

Feedback: Correct! BDMS are designed to scale horizontally across distributed systems.

E: DBMS cannot process large volumes of data

Feedback: Incorrect. DBMS can process large volumes of data, but they are not as efficient as BDMS in
handling extremely large datasets.

Question 32 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Welcome to Big Data Integration and Processing

Which of the following are basic features of JupyterLab?

*A: File browser

Feedback: Correct! File browser is a basic feature of JupyterLab.

*B: Integrated development environment (IDE)

Feedback: Correct! JupyterLab serves as an integrated development environment (IDE).


C: Version control system

Feedback: Incorrect. JupyterLab does not include a version control system.

*D: Web-based interface

Feedback: Correct! JupyterLab has a web-based interface.

E: Database management system

Feedback: Incorrect. JupyterLab does not have database management features.

Question 33 - checkbox, shuffle, partial credit, medium

Question category: Module: Welcome to Big Data Integration and Processing

Which of the following are challenges associated with streaming data in big data systems?

*A: High Velocity

Feedback: Correct! High velocity is a significant challenge in streaming data.

*B: Data Variety

Feedback: Correct! Data variety is another challenge in handling streaming data.

C: Data Redundancy

Feedback: Not quite. Data redundancy is not a primary challenge of streaming data.

D: Batch Processing

Feedback: Incorrect. Batch processing is not typically a challenge for streaming data systems.

*E: Real-time Processing

Feedback: Correct! Real-time processing is a key challenge in streaming data.

Question 34 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Welcome to Big Data Integration and Processing

Which of the following are necessary steps to organize downloaded files for use in the Hands-On
Modules?

*A: Create a dedicated project directory.


Feedback: Correct! Creating a dedicated project directory helps in organizing files efficiently.

*B: Name files according to their content or purpose.

Feedback: Correct! Naming files according to their content or purpose makes it easier to identify and
access them.

C: Leave the files in the Downloads folder.

Feedback: Incorrect. Leaving files in the Downloads folder can lead to disorganization.

*D: Delete unnecessary files.

Feedback: Correct! Deleting unnecessary files helps in keeping the workspace clean and organized.

E: Rename all files to random names.

Feedback: Incorrect. Renaming files to random names will make it difficult to identify and access them.

Question 35 - text match, easy difficulty

Question category: Module: Welcome to Big Data Integration and Processing

What term is used to describe the speed at which data is generated and processed in a big data context?
Please answer in all lowercase.

*A: velocity

Feedback: Correct! Velocity refers to the speed at which data is generated and processed in big data.

Default Feedback: Incorrect. Remember, the speed at which data is generated and processed is a crucial
factor in big data.

Question 36 - numeric, easy difficulty

Question category: Module: Welcome to Big Data Integration and Processing

In a big data context, a certain process handles between 1,000 and 2,000 transactions per second. What
is the range of transactions per second?

*A: [1000, 2000)

Feedback: Correct! The process handles between 1,000 and 2,000 transactions per second.

Default Feedback: Incorrect. Please review the handling capacity in transactions per second for big data
processes.
Question 37 - text match, easy difficulty

Question category: Module: Welcome to Big Data Integration and Processing

What is the name of the interactive computing environment that JupyterLab extends? Please answer in
all lowercase.

*A: jupyter notebook

Feedback: Correct! JupyterLab extends the functionality of Jupyter Notebooks.

*B: notebook

Feedback: Correct! JupyterLab extends the functionality of Jupyter Notebooks.

Default Feedback: Incorrect. Please review the basic features and functionalities of JupyterLab.

Question 38 - numeric, easy difficulty

Question category: Module: Welcome to Big Data Integration and Processing

In a big data system, a particular data processing job takes between 5 and 10 seconds to complete. What
is the range of time in seconds?

*A: [5, 10]

Feedback: Correct! The job takes between 5 and 10 seconds to complete.

Default Feedback: Incorrect. Consider the range of time it takes for the job to complete.

Question 39 - checkbox, shuffle, partial credit, medium

Question category: Module: Welcome to Big Data Integration and Processing

Which of the following are requirements of programming models for big data?

*A: Scalability

Feedback: Correct! Scalability is a crucial requirement for programming models in big data.

*B: Fault tolerance

Feedback: Correct! Fault tolerance ensures the system can handle failures gracefully.

C: Portability
Feedback: Incorrect. While useful, portability is not a primary requirement of programming models for
big data.

*D: Efficiency

Feedback: Correct! Efficiency is essential to process large datasets effectively.

E: User-friendliness

Feedback: Incorrect. User-friendliness is advantageous but not a core requirement for big data
programming models.

Question 40 - text match, easy difficulty

Question category: Module: Welcome to Big Data Integration and Processing

What is the name of the interface used to interact with Jupyter Notebooks? Please answer in all
lowercase.

*A: jupyterlab

Feedback: Correct! JupyterLab is the interface used to interact with Jupyter Notebooks.

*B: notebook

Feedback: Correct! Notebook is another term for interacting with Jupyter Notebooks.

Default Feedback: Incorrect. Review the section on Jupyter Notebooks to find the interface name.

Question 41 - text match, easy difficulty

Question category: Module: Welcome to Big Data Integration and Processing

Which tool is used for real-time data processing and can handle large streams of data efficiently? Please
answer in all lowercase.

*A: spark

Feedback: Correct! Apache Spark is widely used for real-time data processing.

*B: splunk

Feedback: Correct! Splunk is also used for real-time data processing and can handle large data streams
efficiently.

Default Feedback: Incorrect. Please review the lesson material on tools for real-time data processing.
Question 42 - numeric, easy difficulty

Question category: Module: Welcome to Big Data Integration and Processing

If a big data system processes 2 terabytes of data in 30 minutes, what is the data processing rate in
gigabytes per minute?

*A: [66.67, 67)

Feedback: Correct! The data processing rate is approximately 66.67 gigabytes per minute.

Default Feedback: Incorrect. Please review the lesson material on calculating data processing rates.

Question 43 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

Which of the following are aggregate functions in SQL?

*A: SUM

Feedback: Correct! SUM is an aggregate function that calculates the total sum of a numeric column.

*B: MAX

Feedback: Correct! MAX is an aggregate function that returns the maximum value in a set.

*C: MIN

Feedback: Correct! MIN is an aggregate function that returns the minimum value in a set.

D: JOIN

Feedback: Incorrect. JOIN is not an aggregate function; it is used to combine rows from two or more
tables based on a related column.

*E: COUNT

Feedback: Correct! COUNT is an aggregate function that returns the number of rows that match a
specified criterion.

F: DISTINCT

Feedback: Incorrect. DISTINCT is not an aggregate function; it is used to return only distinct (different)
values.
Question 44 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

Which of the following Pandas operations can be used to filter data in a DataFrame?

*A: loc

Feedback: Correct! The loc function is used for label-based indexing to filter data.

*B: iloc

Feedback: Correct! The iloc function is used for positional indexing to filter data.

C: groupby

Feedback: Incorrect. groupby is used for grouping data, not directly for filtering.

D: pivot_table

Feedback: Incorrect. pivot_table is used to create pivot tables, not directly for filtering.

*E: filter

Feedback: Correct! The filter function is used to filter data based on column names.

Question 45 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

Which of the following operations can be performed using the Pandas library in Python?

*A: Reading data from a CSV file

Feedback: Correct! Pandas can read data from a CSV file into a DataFrame.

*B: Filtering data in a DataFrame

Feedback: Correct! Pandas provides various methods to filter data in a DataFrame.

C: Counting documents in a MongoDB collection

Feedback: Incorrect. Counting documents in a MongoDB collection is not a Pandas operation.

*D: Calculating summary statistics


Feedback: Correct! Pandas can calculate summary statistics like mean, median, sum, etc.

E: Pulling a Docker image

Feedback: Incorrect. Pulling a Docker image is not related to Pandas.

Question 46 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

Which of the following are SQL JOIN types?

*A: INNER JOIN

Feedback: Correct! INNER JOIN selects records that have matching values in both tables.

*B: LEFT JOIN

Feedback: Correct! LEFT JOIN returns all records from the left table, and the matched records from the
right table.

C: UPPER JOIN

Feedback: Incorrect. UPPER JOIN is not a valid SQL JOIN type.

*D: RIGHT JOIN

Feedback: Correct! RIGHT JOIN returns all records from the right table, and the matched records from
the left table.

*E: FULL JOIN

Feedback: Correct! FULL JOIN returns all records when there is a match in either left or right table.

Question 47 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

What method is used to count the number of documents in a MongoDB collection that match a specified
filter? Please answer in all lowercase. Please answer in all lowercase.

*A: count_documents

Feedback: Correct! The count_documents method is used to count the number of documents in a
MongoDB collection that match a specified filter.
B: count

Feedback: Incorrect. The count method is deprecated in MongoDB. Use count_documents instead.

C: countdocs

Feedback: Incorrect. Double-check the method name. It's count_documents.

Default Feedback: Incorrect. Please review the MongoDB documentation to find the correct method for
counting documents that match a filter.

Question 48 - numeric, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

How many rows will be returned by the SQL query if there are 50 rows in the students table and the
condition is WHERE age > 20? Assume 30 students are older than 20.

*A: 30.0

Feedback: Correct! There are 30 students older than 20 in the students table.

Default Feedback: Incorrect. Please review the lesson on SQL query conditions and row counts.

Question 49 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

What is the keyword used in SQL to extract data from a database? Please answer in all lowercase.

*A: select

Feedback: Correct! SELECT is the keyword used to extract data from a database.

B: query

Feedback: Incorrect. QUERY is not a valid keyword in SQL for extracting data.

C: fetch

Feedback: Incorrect. FETCH is not a valid keyword in SQL for extracting data.

D: retrieve

Feedback: Incorrect. RETRIEVE is not a valid keyword in SQL for extracting data.
Default Feedback: Incorrect. Please review the SQL keywords for extracting data.

Question 50 - numeric, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

If a DataFrame has 500 rows and 30 columns, how many elements does it contain?

*A: 15000.0

Feedback: Correct! The DataFrame contains 500 * 30 = 15000 elements.

Default Feedback: Incorrect. Please review how to calculate the number of elements in a DataFrame.

Question 51 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

What is the term for a subset of a DataFrame that meets a specified condition? Please answer in all
lowercase. Please answer in all lowercase.

*A: filter

Feedback: Correct! A subset of a DataFrame that meets a specified condition is called a filter.

*B: subset

Feedback: Correct! A subset of a DataFrame that meets a specified condition can also be called a subset.

Default Feedback: Incorrect. Please review the lesson materials and try again.

Question 52 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

What Python library is commonly used for data manipulation and analysis, especially with DataFrames?
Please answer in all lowercase.

*A: pandas

Feedback: Correct! Pandas is widely used for data manipulation and analysis with DataFrames.

Default Feedback: Incorrect. The commonly used library for data manipulation and analysis with
DataFrames is Pandas.

Question 53 - numeric, easy difficulty


Question category: Module: Retrieving Big Data (Part 2)

If a DataFrame has 1000 rows and 20 columns, how many elements does it contain?

*A: 20000.0

Feedback: Correct! The DataFrame contains 1000 * 20 = 20000 elements.

Default Feedback: Incorrect. Please review how to calculate the number of elements in a DataFrame.

Question 54 - numeric, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

How many rows will be returned by the following SQL query? SELECT * FROM employees WHERE
salary > 50000 AND department_id = 10;

*A: 5.0

Feedback: Correct! 5 rows match the criteria specified in the SQL query.

Default Feedback: Incorrect. Please revisit the SQL query and consider the criteria specified for filtering
the rows.

Question 55 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

What file format is commonly used for storing tabular data and can be read into a Pandas DataFrame?
Please answer in all lowercase.

*A: csv

Feedback: Correct! CSV is a common file format used for storing tabular data.

B: tsv

Feedback: Incorrect. While TSV is also used for tabular data, it is less common than CSV.

Default Feedback: Incorrect. Please review the common file formats for storing tabular data.

Question 56 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)


What keyword is used in SQL to remove duplicate records from a result set? Please answer in all
lowercase.

*A: distinct

Feedback: Correct! The DISTINCT keyword is used to return only distinct (different) values.

Default Feedback: Incorrect. Please review the SQL keywords that help in filtering records.

Question 57 - numeric, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

If a table has 1000 rows and you execute a query that selects 10% of the rows, how many rows will be
returned?

*A: 100.0

Feedback: Correct! 10% of 1000 is 100.

Default Feedback: Incorrect. Please check your calculation of the percentage of rows.

Question 58 - numeric, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

How many documents will be returned if a MongoDB query matches 15 documents in a collection?

*A: 15.0

Feedback: Correct! The query will return exactly 15 documents.

Default Feedback: Incorrect. Revise the MongoDB query syntax and ensure you understand the filtering
process.

Question 59 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

Which of the following operations can be performed using the Pandas library?

*A: Reading data from a CSV file

Feedback: Correct! You can read data from a CSV file using Pandas.

*B: Filtering data in a DataFrame


Feedback: Correct! You can filter data in a DataFrame using Pandas.

C: Counting documents in a MongoDB collection

Feedback: Incorrect. Counting documents in a MongoDB collection is a task performed using


MongoDB, not Pandas.

*D: Calculating summary statistics

Feedback: Correct! You can calculate summary statistics using Pandas.

E: Building a Docker container

Feedback: Incorrect. Building a Docker container is not an operation performed using Pandas.

Question 60 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

Which of the following are valid MongoDB aggregation stages? Select all that apply.

*A: $match

Feedback: Correct! The $match stage filters documents to pass only those that match the specified
condition(s).

B: $sum

Feedback: Incorrect. $sum is an accumulator operator, not an aggregation stage.

*C: $group

Feedback: Correct! The $group stage groups input documents by a specified identifier expression and
applies the accumulator expressions.

*D: $sort

Feedback: Correct! The $sort stage sorts all input documents and returns them in the specified order.

E: $filter

Feedback: Incorrect. $filter is an array operator, not an aggregation stage.

Question 61 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)


What is the command to pull a Docker image from a registry? Please answer in all lowercase.

*A: docker pull

Feedback: Correct! docker pull is the command used to pull a Docker image from a registry.

B: dockerpull

Feedback: Incorrect. Remember to include a space between docker and pull.

C: pull docker

Feedback: Incorrect. The command should start with docker followed by pull.

Default Feedback: Incorrect. Make sure you are using the correct Docker command to pull an image.

Question 62 - numeric, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

If you have a DataFrame df with 500 rows, how many rows will be displayed when you use the
command df.head(10)?

*A: 10.0

Feedback: Correct! The df.head(10) command displays the first 10 rows of the DataFrame.

Default Feedback: Incorrect. Review the head method in Pandas to understand how many rows it
displays by default.

Question 63 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

Which of the following operations can be performed using the Pandas library in Python?

*A: Filtering data in a DataFrame

Feedback: Correct! Pandas provides robust filtering capabilities for DataFrames.

B: Counting documents in a MongoDB collection

Feedback: Incorrect. This operation is handled by MongoDB, not Pandas.

*C: Calculating summary statistics of a DataFrame


Feedback: Correct! Pandas can easily calculate summary statistics like mean, median, and standard
deviation.

D: Joining tables in SQL

Feedback: Incorrect. While Pandas has merge operations similar to SQL joins, actual SQL operations
are distinct from Pandas capabilities.

Question 64 - multiple choice, shuffle, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

What is the Pandas function used to read a CSV file into a DataFrame?

*A: read_csv()

Feedback: Correct! The read_csv() function is specifically designed for reading CSV files into
DataFrames.

B: read_excel()

Feedback: Not quite. The read_excel() function is used for reading Excel files, not CSV files.

C: read_json()

Feedback: Incorrect. The read_json() function is used for reading JSON formatted data, not CSV files.

D: read_html()

Feedback: Incorrect. The read_html() function is used for reading HTML tables, not CSV files.

Question 65 - multiple choice, shuffle, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

Which MongoDB command is used to find documents with specific field values?

*A: find()

Feedback: Correct! The find() command is used to retrieve documents that match specific criteria.

B: search()

Feedback: Incorrect. There is no search() command in MongoDB for this purpose.

C: query()
Feedback: Incorrect. While querying is the process, query() is not a MongoDB command for finding
documents.

D: filter()

Feedback: Incorrect. Filtering is part of querying, but filter() is not a MongoDB command.

Question 66 - multiple choice, shuffle, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

Which MongoDB operator is used to select documents that match the values of multiple fields?

*A: $and

Feedback: Correct! The $and operator is used to match documents that fulfill all specified conditions.

B: $or

Feedback: Not quite. The $or operator is used to match documents that satisfy at least one of the
specified conditions.

C: $nor

Feedback: Incorrect. The $nor operator is the opposite of $or. It selects documents that fail all the
specified conditions.

D: $not

Feedback: No, the $not operator is used to invert the effect of a query expression.

Question 67 - multiple choice, shuffle, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

Which MongoDB function retrieves distinct values from a specified field across a single collection?

*A: distinct

Feedback: Correct! The distinct function retrieves unique values from a specified field.

B: aggregate

Feedback: Not quite. The aggregate function is used to process data records and return computed results.

C: find
Feedback: Incorrect. The find function is used for querying documents in a collection.

D: mapReduce

Feedback: No, the mapReduce function is used for processing and aggregating collections.

Question 68 - multiple choice, shuffle, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

What command would you use to count the number of documents in a MongoDB collection?

*A: db.collection.countDocuments()

Feedback: Correct! This command is used to count the number of documents in a MongoDB collection.

B: db.collection.find().count()

Feedback: Incorrect. While this was previously used, the correct method now is countDocuments().

C: db.collection.totalDocuments()

Feedback: Incorrect. This command does not exist in MongoDB.

D: db.collection.countFiles()

Feedback: Incorrect. This is not a valid MongoDB command for counting documents.

Question 69 - multiple choice, shuffle, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

In MongoDB, which operator would you use to combine multiple conditions with a logical AND within
a query?

*A: $and

Feedback: Correct! The $and operator is used to combine multiple conditions with a logical AND.

B: $or

Feedback: The $or operator is used for combining conditions with a logical OR, not AND.

C: $set

Feedback: $set is used for updating fields in a document, not for combining conditions.
D: $match

Feedback: $match is used to filter documents from the collection, not for combining conditions with
AND.

Question 70 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

Which of the following are valid MongoDB aggregation framework stages used to transform
documents?

*A: $group

Feedback: Correct! $group is used to group documents by a specified expression.

*B: $lookup

Feedback: Correct! $lookup is used for performing a left outer join to another collection in the same
database.

C: $concatArrays

Feedback: Incorrect. $concatArrays is an operator, not an aggregation stage.

*D: $sort

Feedback: Correct! $sort is used to order the documents in a pipeline.

E: $filter

Feedback: Incorrect. $filter is an array operator, not an aggregation stage.

*F: $limit

Feedback: Correct! $limit is used to restrict the number of documents in the aggregation pipeline.

Question 71 - numeric, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

If a MongoDB collection contains 1000 documents and a query with a specific condition returns 250
documents, what percentage of the collection does this represent?

*A: 25.0
Feedback: Correct! The query returns 25% of the documents in the collection.

Default Feedback: Consider the ratio of documents returned by the query to the total number of
documents in the collection.

Question 72 - text match, easy difficulty

Question category: Module: Retrieving Big Data (Part 2)

What is the term for a Docker image that is used to create a running instance? Please answer in all
lowercase.

*A: container

Feedback: Correct! A Docker image, when instantiated, becomes a container.

*B: containers

Feedback: Correct! A Docker image, when instantiated, becomes a container.

Default Feedback: Remember that a Docker image becomes a specific term when it is run.

Question 73 - multiple choice, shuffle, easy difficulty

Question category: Module: Big Data Analytics using Spark

Which of the following is a characteristic of Spark's sliding windows?

*A: They allow for continuous processing of streaming data over a specific interval.

Feedback: Correct! Sliding windows process data in overlapping intervals, enabling continuous data
analysis.

B: They process data only once at the end of the window interval.

Feedback: Not quite. Sliding windows process data continuously, not just once at the end.

C: They only work with batch data, not streaming data.

Feedback: Incorrect. Sliding windows are specifically designed for streaming data.

D: They are used to store data in Spark's memory.

Feedback: No, sliding windows are not used for data storage but for continuous data processing.

Question 74 - multiple choice, shuffle, easy difficulty


Question category: Module: Big Data Analytics using Spark

When working with sensor data using Spark streaming, which of the following is a common operation?

*A: Applying windowed computations

Feedback: Correct! Applying windowed computations is a common operation when working with sensor
data using Spark streaming.

B: Ignoring duplicate data

Feedback: Incorrect. Ignoring duplicate data is not a common operation in this context. Review the
lecture on sensor data processing.

C: Performing table joins

Feedback: Incorrect. While table joins are useful, they are not a primary operation in sensor data
processing using Spark streaming.

D: Converting data types

Feedback: Incorrect. Converting data types is not specific to sensor data processing in Spark streaming.
Check the relevant materials on this topic.

Question 75 - multiple choice, shuffle, easy difficulty

Question category: Module: Big Data Analytics using Spark

Which of the following is a narrow transformation in Spark?

*A: map

Feedback: Correct! The map transformation is indeed a narrow transformation.

B: reduceByKey

Feedback: This is incorrect. The reduceByKey transformation is a wide transformation.

C: groupByKey

Feedback: This is incorrect. The groupByKey transformation is also a wide transformation.

D: join

Feedback: Incorrect. The join transformation is considered a wide transformation.


Question 76 - checkbox, shuffle, partial credit, medium

Question category: Module: Big Data Analytics using Spark

Select all of the transformations in Spark that are considered wide transformations.

*A: groupByKey

Feedback: Correct! groupByKey is a wide transformation because it requires data shuffling across the
nodes.

*B: reduceByKey

Feedback: Correct! reduceByKey also requires shuffling of data, making it a wide transformation.

C: filter

Feedback: Incorrect. filter is a narrow transformation as it operates on a single partition.

D: map

Feedback: Incorrect. map is considered a narrow transformation as it does not require data shuffling.

*E: sortBy

Feedback: Correct! sortBy is a wide transformation because it can involve shuffling of data across the
partitions.

Question 77 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Big Data Analytics using Spark

Which of the following are supported sources of streaming data in Spark?

*A: Kafka

Feedback: Correct! Kafka is a commonly used source of streaming data in Spark.

*B: Flume

Feedback: Correct! Flume is another source of streaming data supported by Spark.

C: AWS Lambda

Feedback: Incorrect. AWS Lambda is a serverless compute service, not a streaming data source for
Spark.
*D: HDFS

Feedback: Correct! HDFS can also be used as a source of streaming data in Spark.

E: SQLite

Feedback: Incorrect. SQLite is a database engine, not a streaming data source for Spark.

Question 78 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Big Data Analytics using Spark

Which of the following are sources of streaming data supported by Spark?

*A: Apache Kafka

Feedback: Correct! Apache Kafka is supported by Spark for streaming data.

*B: Amazon S3

Feedback: Correct! Amazon S3 is supported by Spark for streaming data.

*C: Google Cloud Storage

Feedback: Correct! Google Cloud Storage is supported by Spark for streaming data.

D: Microsoft Excel

Feedback: Incorrect. Microsoft Excel is not a source of streaming data supported by Spark.

E: PostgreSQL

Feedback: Incorrect. PostgreSQL is not a source of streaming data supported by Spark.

Question 79 - text match, easy difficulty

Question category: Module: Big Data Analytics using Spark

What method is used to access Postgres database tables with SparkSQL? Please answer in all lowercase.

*A: jdbc

Feedback: Correct! The jdbc method is used to access Postgres database tables with SparkSQL.

Default Feedback: Incorrect. Please review the methods used to access Postgres database tables with
SparkSQL.
Question 80 - numeric, easy difficulty

Question category: Module: Big Data Analytics using Spark

What is the range of possible values if the result of a computation over a sliding window is expected to
be between 10 and 20 (inclusive)?

*A: [10, 20]

Feedback: Correct! The range of possible values is between 10 and 20, inclusive.

Default Feedback: Incorrect. Please review how computations over sliding windows work.

Question 81 - text match, easy difficulty

Question category: Module: Big Data Analytics using Spark

What is the name of the machine learning library in Spark? Please answer in all lowercase.

*A: mllib

Feedback: Correct! MLlib is Spark's machine learning library.

*B: sparkmllib

Feedback: Correct! MLlib is also known as Spark MLlib.

Default Feedback: Incorrect. Remember to review the name of Spark's machine learning library.

Question 82 - numeric, easy difficulty

Question category: Module: Big Data Analytics using Spark

If a sliding window computation on sensor data results in values between 5 and 15 (inclusive), what is
the range of possible values?

*A: [5, 15]

Feedback: Correct! The range is [5, 15] inclusive.

Default Feedback: Incorrect. Please review the concepts of sliding window computations.

Question 83 - text match, easy difficulty

Question category: Module: Big Data Analytics using Spark


What is the programming model used in Spark? Please answer in all lowercase.

*A: rdd

Feedback: Correct! The Resilient Distributed Dataset (RDD) is the fundamental programming model for
Spark.

*B: resilient distributeddataset

Feedback: Correct! The Resilient Distributed Dataset (RDD) is the fundamental programming model for
Spark.

Default Feedback: Incorrect. Please review the fundamental programming model for Spark.

Question 84 - numeric, easy difficulty

Question category: Module: Big Data Analytics using Spark

If a Spark job creates an RDD and splits it into 4 partitions, and then a coalesce transformation is applied
with a value of 2, how many partitions will the resulting RDD have?

*A: 2.0

Feedback: Correct! The coalesce transformation reduces the number of partitions in the RDD to the
specified number.

Default Feedback: Incorrect. Please review how the coalesce transformation works in Spark.

Question 85 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Big Data Analytics using Spark

Which of the following operations can be performed on a Spark DataFrame when working with
streaming data?

*A: Filtering rows

Feedback: Correct! Filtering rows is a common operation that can be performed on a Spark DataFrame
when working with streaming data.

*B: Joining with static DataFrames

Feedback: Correct! Joining streaming DataFrames with static DataFrames is a supported operation in
Spark.

*C: Reading from multiple streams


Feedback: Correct! Spark allows reading from multiple streams simultaneously.

D: Directly modifying source data

Feedback: Incorrect. Directly modifying source data is not an operation performed on Spark DataFrames
when working with streaming data.

Question 86 - text match, easy difficulty

Question category: Module: Big Data Analytics using Spark

What is the term for the concept in Spark where RDDs cannot be modified after their creation? Please
answer in all lowercase.

*A: immutability

Feedback: Correct! Immutability refers to the concept where RDDs cannot be modified after their
creation.

*B: immutable

Feedback: Correct! Immutability refers to the concept where RDDs cannot be modified after their
creation.

Default Feedback: Remember to review the concept of RDD immutability in Spark, where RDDs cannot
be modified after their creation.

Question 87 - text match, easy difficulty

Question category: Module: Big Data Analytics using Spark

What is the name of Spark's component designed for graph processing? Please answer in all lowercase.

*A: graphx

Feedback: Correct! GraphX is Spark's component designed for graph processing.

Default Feedback: Incorrect. Please review the components of Spark and try again.

Question 88 - text match, easy difficulty

Question category: Module: Big Data Analytics using Spark

What is the term used to describe a continuous sequence of data elements in Spark? Please answer in all
lowercase.
*A: stream

Feedback: Correct! In Spark, a continuous sequence of data elements is referred to as a stream.

*B: streaming

Feedback: Correct! The continuous sequence of data elements can also be described with the term
streaming.

Default Feedback: Incorrect. Please review the concept of continuous data processing in Spark.

Question 89 - multiple choice, shuffle, easy difficulty

Question category: Module: Big Data Analytics using Spark

Which of the following statements best describes how to filter rows and columns of a Spark DataFrame?

*A: Use the filter() function to select specific rows and the select() function to choose columns.

Feedback: Correct! The filter() function is used for row selection while the select() function is used for
column selection in Spark DataFrames.

B: Use the groupBy() function to filter rows and the orderBy() function to select columns.

Feedback: Almost! groupBy() and orderBy() are used for grouping and ordering, not filtering.

C: Use the drop() function to filter out unwanted rows and the aggregate() function for columns.

Feedback: Not quite. The drop() function is used to remove columns, not filter rows, and aggregate()
isn't used for column selection.

D: Use the join() function to filter rows and the distinct() function to filter columns.

Feedback: Incorrect. join() is for combining DataFrames and distinct() is for removing duplicate rows,
not specifically for filtering.

Question 90 - multiple choice, shuffle, easy difficulty

Question category: Module: Big Data Analytics using Spark

Which function in Spark SQL allows you to access Postgres database tables using a specified JDBC
URL?

*A: jdbc
Feedback: "jdbc" is the correct function used to access Postgres database tables via a JDBC URL in
Spark SQL. Well done!

B: readTable

Feedback: "readTable" is not a valid function in Spark SQL for accessing Postgres tables. You might be
confusing it with other data access methods.

C: loadTable

Feedback: "loadTable" is not the correct function for accessing Postgres tables in Spark SQL. Consider
revisiting the function names.

D: getConnection

Feedback: "getConnection" is used differently and not specifically in accessing Postgres tables in Spark
SQL. Try focusing on JDBC-specific functions.

Question 91 - multiple choice, shuffle, easy difficulty

Question category: Module: Big Data Analytics using Spark

Which Spark transformation is used to apply a function to each element of an RDD and flatten the
results into a new RDD?

*A: FlatMap

Feedback: Correct! FlatMap applies a function to each element and flattens the result.

B: Map

Feedback: Incorrect. Map applies a function to each element but does not flatten the results.

C: Filter

Feedback: Incorrect. Filter selects elements based on a condition but does not apply a function to each
element.

D: Reduce

Feedback: Incorrect. Reduce aggregates elements using a binary operator but does not apply a function
to each element.

Question 92 - multiple choice, shuffle, easy difficulty

Question category: Module: Big Data Analytics using Spark


Which of the following transformations in Spark results in a new RDD with fewer partitions than the
original RDD?

*A: Coalesce

Feedback: Correct! Coalesce is used to reduce the number of partitions in an RDD.

B: Map

Feedback: Incorrect. Map is used for element-wise transformation and does not change the number of
partitions.

C: Filter

Feedback: Incorrect. Filter selects elements but does not alter the number of partitions.

D: FlatMap

Feedback: Incorrect. FlatMap can increase the number of elements but not decrease partitions.

Question 93 - multiple choice, shuffle, easy difficulty

Question category: Module: Big Data Analytics using Spark

What is the primary function of Spark's sliding windows?

*A: To process time-based data streams in chunks.

Feedback: Correct! Sliding windows allow Spark to handle continuous data streams by breaking them
into manageable time-based batches.

B: To compress data for efficient storage.

Feedback: Not quite. While data compression is important, sliding windows are focused on processing
time-based data streams.

C: To provide a graphical interface for data processing.

Feedback: Incorrect. Sliding windows are not related to graphical interfaces.

D: To enhance security by encrypting data streams.

Feedback: This is not correct. Sliding windows do not deal with data encryption or security features.

Question 94 - multiple choice, shuffle, easy difficulty


Question category: Module: Big Data Analytics using Spark

What is one of the benefits of using Spark SQL for data processing?

*A: Integration with the Hadoop ecosystem.

Feedback: Correct! Spark SQL is designed to integrate seamlessly with Hadoop, enhancing its
processing capabilities.

B: Provides a new programming language for data analysis.

Feedback: Not exactly. Spark SQL uses SQL-like syntax but isn't a new language.

C: Replaces the need for data storage systems.

Feedback: Incorrect. Spark SQL does not replace data storage systems.

D: Reduces the need for data cleaning.

Feedback: This is not correct. While Spark SQL aids in querying, data cleaning still requires attention.

Question 95 - checkbox, shuffle, partial credit, medium

Question category: Module: Big Data Analytics using Spark

When working with sensor data using Spark streaming, which of the following operations can be applied
to the data?

*A: Apply computations over a sliding window.

Feedback: Correct! Sliding window computations are often used in streaming data analysis to process
recent data efficiently.

B: Use collect() to bring all data into memory for processing.

Feedback: Incorrect. collect() is not typically used in streaming due to memory constraints and
continuous data flow.

*C: Filter data in real-time using the filter() transformation.

Feedback: Correct! You can filter sensor data in real-time using Spark's filter() transformation.

*D: Aggregate data streams using reduceByKey().

Feedback: Correct! Aggregating data streams using transformations like reduceByKey() is common in
stream processing.
E: Use cache() to store the entire data stream.

Feedback: Incorrect. Caching is not practical for entire streams due to continuous data flow and size
limitations.

Question 96 - checkbox, shuffle, partial credit, medium

Question category: Module: Big Data Analytics using Spark

When creating computations over a sliding window of streaming data in Spark, which of the following
operations can be utilized?

*A: Aggregation

Feedback: Aggregation is indeed used in sliding window computations. Great choice!

B: Joining static data

Feedback: Joining static data can be performed but is not specific to sliding window computations in
streaming data.

*C: Windowed join

Feedback: Windowed join is applicable in sliding window computations, especially in streaming


contexts. Correct!

D: Checkpointing

Feedback: Checkpointing is a mechanism for fault tolerance in streaming but not specific to sliding
window computations.

E: Filtering

Feedback: Filtering can be used in data preparation but doesn't specifically pertain to sliding window
computations.

Question 97 - checkbox, shuffle, partial credit, medium

Question category: Module: Big Data Analytics using Spark

Select the steps involved in a typical Spark pipeline ending with a collect action.

*A: Create an RDD from a data source

Feedback: This is correct. The first step in any Spark pipeline is to create an RDD from a data source.
*B: Apply one or more transformations

Feedback: Correct! Transformations are applied to RDDs to create new RDDs.

*C: Execute an action to produce a result

Feedback: Correct! Actions trigger Spark to execute the transformations and return results.

D: Start a Spark session every time a transformation is applied

Feedback: Incorrect. A Spark session is typically started once at the beginning of the pipeline.

E: Convert each RDD to a DataFrame before any action

Feedback: Incorrect. Conversion to DataFrame is optional and not part of every pipeline.

Question 98 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Big Data Analytics using Spark

Which of the following are components of Spark's GraphX?

*A: Vertices

Feedback: Correct! Vertices are a fundamental part of GraphX's graph structure.

*B: Edges

Feedback: Correct! Edges connect vertices and are essential to GraphX.

C: Layers

Feedback: Incorrect. GraphX does not use the concept of layers.

D: Nodes

Feedback: Incorrect. GraphX uses vertices, not nodes, to denote points in a graph.

E: Frames

Feedback: Incorrect. Frames are not part of GraphX's architecture.

Question 99 - numeric, easy difficulty

Question category: Module: Big Data Analytics using Spark


If you have an RDD with 10 partitions and you use the coalesce transformation to reduce it, what is the
minimum number of partitions you can reduce it to?

*A: 1.0

Feedback: Correct! Coalesce can reduce the number of partitions to as few as 1.

Default Feedback: Incorrect. Consider how coalesce affects the number of partitions.

Question 100 - numeric, easy difficulty

Question category: Module: Big Data Analytics using Spark

How many categories of techniques does MLlib offer?

*A: 4.0

Feedback: Correct! MLlib offers four main categories of techniques.

Default Feedback: Review the MLlib section on categories of techniques.

Question 101 - text match, easy difficulty

Question category: Module: Big Data Analytics using Spark

What is the term for a data structure in Spark that is distributed and immutable? Please answer in all
lowercase.

*A: rdd

Feedback: Correct! RDD stands for Resilient Distributed Dataset, which is immutable in Spark.

*B: rdds

Feedback: Correct! RDDs, or Resilient Distributed Datasets, are immutable data structures in Spark.

Default Feedback: Incorrect. Remember that this data structure is distributed and immutable in Spark.

Question 102 - text match, easy difficulty

Question category: Module: Big Data Analytics using Spark

What is the name of Apache Spark's machine learning library? Please answer in all lowercase.

*A: mllib
Feedback: Correct! MLlib is the designated machine learning library for Spark.

*B: ml-lib

Feedback: Correct! While this variant isn't standard, it's accepted here. MLlib is the actual name.

Default Feedback: Refer back to Spark's documentation on machine learning libraries.

Question 103 - checkbox, shuffle, partial credit, medium

Question category: Module: Big Data Integration

Which of the following techniques are involved in integrating data from multiple sources in a medical
setting?

*A: Schema mapping

Feedback: Correct! Schema mapping is essential for aligning data structures from different sources.

*B: Data fusion

Feedback: Correct! Data fusion helps in combining data from various sources to create a unified view.

C: Data anonymization

Feedback: Incorrect. While important, data anonymization is not a technique for integrating data from
multiple sources.

*D: Data warehousing

Feedback: Correct! Data warehousing involves collecting and managing data from varied sources in a
centralized repository.

E: Predictive analytics

Feedback: Incorrect. Predictive analytics is used for forecasting and trends, not for data integration.

Question 104 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Big Data Integration

Which of the following are features of Splunk?

*A: Real-time data analysis

Feedback: Correct! Splunk provides real-time data analysis capabilities.


*B: Data visualization tools

Feedback: Correct! Splunk includes data visualization tools to represent data in various formats.

C: Requires extensive manual configuration

Feedback: Incorrect. Splunk is designed to minimize manual configuration through its automated
features.

*D: Machine learning integration

Feedback: Correct! Splunk integrates with machine learning for advanced data analysis.

E: Limited to on-premises deployment

Feedback: Incorrect. Splunk can be deployed both on-premises and in the cloud.

Question 105 - text match, easy difficulty

Question category: Module: Big Data Integration

What is the name of the software platform that collects, indexes, and analyzes machine data developed
by Splunk? Please answer in all lowercase.

*A: splunk

Feedback: Correct! Splunk is the software platform developed to handle machine data.

Default Feedback: Incorrect. The platform developed by Splunk to handle machine data is called
Splunk.

Question 106 - numeric, easy difficulty

Question category: Module: Big Data Integration

In a public health system, what percentage range of data accuracy is generally required for effective
disease information exchange?

*A: [95, 100)

Feedback: Correct! High data accuracy, typically in the range of 95-99%, is crucial for effective disease
information exchange.

Default Feedback: Incorrect. Data accuracy is crucial for effective disease information exchange. Please
review the required accuracy levels.
Question 107 - text match, easy difficulty

Question category: Module: Big Data Integration

What is the default port number for accessing Splunk Web? Please answer in all lowercase.

*A: 8000

Feedback: Correct! The default port number for accessing Splunk Web is 8000.

B: 8001

Feedback: Incorrect. The default port number is not 8001.

C: 8080

Feedback: Incorrect. The default port number is not 8080.

D: 8081

Feedback: Incorrect. The default port number is not 8081.

Default Feedback: Incorrect. Please refer to the Splunk documentation for the correct default port
number.

Question 108 - numeric, easy difficulty

Question category: Module: Big Data Integration

In which year was Datameer founded?

*A: 2009.0

Feedback: Correct! Datameer was founded in 2009.

Default Feedback: Incorrect. Please review the history of Datameer.

Question 109 - text match, easy difficulty

Question category: Module: Big Data Integration

What is the name of the platform that provides customizable and modular car data for Open XC? Please
answer in all lowercase.

*A: openxc
Feedback: Correct! Open XC is the platform that provides customizable and modular car data.

B: open xc

Feedback: Incorrect. The correct name of the platform is 'openxc'.

C: open-xc

Feedback: Incorrect. The correct name of the platform is 'openxc'.

D: open_xc

Feedback: Incorrect. The correct name of the platform is 'openxc'.

Default Feedback: Incorrect. Please review the lesson material on Open XC platforms.

Question 110 - multiple choice, shuffle, easy difficulty

Question category: Module: Big Data Integration

What is 'Data Fusion' in the context of customer analytics?

*A: The process of combining data from multiple sources to provide a comprehensive view of customer
information.

Feedback: Correct! Data Fusion involves combining data from various sources to get a complete view of
the customer.

B: A technique used to encrypt customer data before analysis.

Feedback: Incorrect. Data Fusion is about combining and integrating data, not encrypting it.

C: A method for storing large amounts of customer data in a data warehouse.

Feedback: Incorrect. Data Fusion is not about data storage but about integrating and analyzing data.

D: An approach to filter out unimportant customer data from large datasets.

Feedback: Incorrect. Data Fusion aims to combine data from multiple sources, not to filter it.

Question 111 - checkbox, shuffle, partial credit, medium

Question category: Module: Big Data Integration

Which of the following are challenges involved in integrating data from multiple sources in a medical
setting?
*A: Data variety

Feedback: Correct! Data variety is a significant challenge in integrating data from multiple sources.

B: Data velocity

Feedback: Incorrect. While data velocity can be a challenge in some contexts, it is not typically a
primary concern in medical data integration.

*C: Schema mapping

Feedback: Correct! Schema mapping is essential for integrating data from different sources.

D: Data encryption

Feedback: Incorrect. Data encryption is related to data security, not specifically to integrating data from
multiple sources.

*E: Integrated views

Feedback: Correct! Creating integrated views is a key challenge in data integration.

Question 112 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Big Data Integration

Which of the following are features or applications of Splunk and Datameer?

*A: Data visualization

Feedback: Correct! Both Splunk and Datameer offer data visualization capabilities.

*B: Real-time data analysis

Feedback: Correct! Real-time data analysis is a key feature of both Splunk and Datameer.

C: Data entry automation

Feedback: Incorrect. Data entry automation is not a primary feature of either Splunk or Datameer.

*D: Machine learning integration

Feedback: Correct! Both Splunk and Datameer support machine learning integration.

E: Only usable for financial data


Feedback: Incorrect. Splunk and Datameer can be used for a wide range of data types, not just financial
data.

Question 113 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Big Data Integration

Which of the following actions can you perform in Splunk to manage your data?

*A: Query data

Feedback: Correct! Querying data is a fundamental action you can perform in Splunk to manage your
data.

*B: Filter data

Feedback: Correct! Filtering data is essential for managing and refining the data you work with in
Splunk.

C: Install a new operating system

Feedback: Incorrect. Installing a new operating system is not related to managing data in Splunk.

*D: Plot data

Feedback: Correct! Plotting data is a crucial step in visualizing and analyzing data in Splunk.

E: Create user accounts

Feedback: Incorrect. While creating user accounts is an administrative task in Splunk, it is not directly
related to managing data.

Question 114 - text match, easy difficulty

Question category: Module: Big Data Integration

What Splunk web interface component allows you to create a visual representation of your data? Please
answer in all lowercase.

*A: pivot

Feedback: Correct! The Pivot component allows you to create visual representations of your data in
Splunk.

*B: pivot interface


Feedback: Correct! The Pivot interface component allows you to create visual representations of your
data in Splunk.

Default Feedback: Incorrect. Please review the components of the Splunk web interface that allow for
data visualization.

Question 115 - multiple choice, shuffle, easy difficulty

Question category: Module: Big Data Integration

What is the first step to construct a report using Pivot in Splunk?

*A: Access the Pivot interface and select a prebuilt data model.

Feedback: Correct! Accessing the Pivot interface is the initial step to start creating a report in Splunk.

B: Create a bar chart.

Feedback: Not quite. Before you create a bar chart, you need to access the Pivot interface and select a
data model.

C: Segment data by host.

Feedback: Segmenting data comes after you've accessed the Pivot interface and selected a data model.

D: Perform statistical calculations.

Feedback: Performing statistical calculations is part of data analysis, but not the first step in constructing
a report.

Question 116 - multiple choice, shuffle, easy difficulty

Question category: Module: Big Data Integration

Which key component is essential for the personalized music recommendation project?

*A: Collaborative filtering algorithms

Feedback: Great choice! Collaborative filtering is central to personalized recommendations.

B: Manual playlist curation by music experts

Feedback: This isn't correct. Think about how automation influences personalized recommendations.

C: Exclusive use of SQL databases


Feedback: Not quite. Consider the diverse data processing technologies used in modern projects.

D: Use of only static data sets

Feedback: Incorrect. Reflect on the dynamic nature of data used in personalized recommendations.

Question 117 - multiple choice, shuffle, easy difficulty

Question category: Module: Big Data Integration

Which of the following describes how Splunk software collects machine data?

*A: Splunk uses input methods to collect data from various sources including logs and metrics.

Feedback: Great job! Splunk indeed collects data via diverse input methods to handle logs and metrics.

B: Splunk requires manual data entry for processing machine data.

Feedback: This isn't correct. Think about how automation plays a role in data collection for big data
tools like Splunk.

C: Splunk relies on third-party applications exclusively to gather machine data.

Feedback: This is not correct. Consider the independence of big data tools in their data collection
processes.

D: Splunk is designed to work only with structured data collected from databases.

Feedback: Not quite. Remember the flexibility and scope of data types that Splunk can handle.

Question 118 - multiple choice, shuffle, easy difficulty

Question category: Module: Big Data Integration

What is the primary importance of information integration in creating a unified view of customer data in
financial services?

*A: Creating a unified customer view

Feedback: Yes, information integration is crucial for a unified customer view, especially in financial
services.

B: Increasing data storage capacity

Feedback: Not quite. Information integration is more about data usage and less about storage capacity.
C: Improving network speed

Feedback: Incorrect. Network speed is not directly related to information integration.

D: Enhancing data encryption methods

Feedback: Incorrect. While security is important, information integration focuses on unification and
accessibility.

Question 119 - multiple choice, shuffle, easy difficulty

Question category: Module: Big Data Integration

What is the primary function of Data Fusion in the context of customer analytics?

*A: Data integration from multiple customer touchpoints

Feedback: This is correct. Data Fusion involves integrating data from various sources to provide a
comprehensive view, especially useful in customer analytics.

B: Storing data in a single database

Feedback: This option is incorrect. Data Fusion is more about integrating data from different sources
rather than just storing it centrally.

C: Cleaning and preparing data for analysis

Feedback: This is not the main focus of Data Fusion. While data cleaning might be involved, fusion
emphasizes integration from diverse sources.

D: Visualizing customer data in dashboards

Feedback: Data Fusion is primarily about integration, not visualization, though the latter may utilize
fused data.

Question 120 - checkbox, shuffle, partial credit, medium

Question category: Module: Big Data Integration

Which of the following tasks are necessary to successfully install Splunk on a Windows server?

*A: Download the Splunk installer from the official website.

Feedback: Correct! The installer is needed to begin the installation process.

*B: Create a username and password for accessing Splunk Enterprise.


Feedback: Correct! Setting up access credentials is part of the installation process.

C: Import CSV files into Splunk.

Feedback: Importing CSV files is related to data management post-installation, not during the
installation process.

*D: Customize the installation options according to specific requirements.

Feedback: Correct! Customizing the installation is often necessary to meet specific system requirements.

E: Evaluate the relevance of the data model.

Feedback: Evaluating the data model is a separate task related to data analysis, not installation.

Question 121 - checkbox, shuffle, partial credit, medium

Question category: Module: Big Data Integration

Which of the following are crucial for handling data variety and model transformation in exchanging
disease information in public health systems?

*A: Heterogeneous data representation

Feedback: Correct! Public health systems deal with varied data formats which require transformation.

B: Real-time data streaming

Feedback: Incorrect. Although important, real-time streaming is not the focus of data integration
techniques for model transformation.

*C: Semantic data integration

Feedback: Correct! It helps in aligning data from different sources semantically.

D: Automated data cleaning

Feedback: Incorrect. Automated cleaning is useful but not a primary concern in model transformation
for public health data.

E: Data anonymization

Feedback: Incorrect. While anonymization is important, it's not directly related to data variety and
model transformation.

Question 122 - numeric, easy difficulty


Question category: Module: Big Data Integration

At least how many different data sources are typically involved in a complex medical data integration
scenario?

*A: 5.0

Feedback: Typically, integrating data from multiple sources involves at least this number.

Default Feedback: Think about the complexity and variety of sources in a medical setting.

Question 123 - numeric, easy difficulty

Question category: Module: Big Data Integration

How many steps are typically involved in the process of installing Splunk on a Windows server?

*A: 5.0

Feedback: Correct! Installing Splunk on a Windows server typically involves five main steps.

Default Feedback: Consider reviewing the installation guide for Splunk on Windows.

Question 124 - text match, easy difficulty

Question category: Module: Big Data Integration

What is the term used to describe the vehicle data platform developed by Open XC? Please answer in all
lowercase.

*A: openxc

Feedback: Correct! OpenXC is the platform for creating customizable car data solutions.

Default Feedback: Remember the specific platform name associated with vehicle data in the lesson.

Question 125 - text match, easy difficulty

Question category: Module: Big Data Integration

What Splunk interface allows users to create reports using a prebuilt data model? Please answer in all
lowercase.

*A: pivot

Feedback: Correct! The Pivot interface in Splunk is used for this purpose.
*B: pivotinterface

Feedback: Correct! The Pivot interface in Splunk is used for this purpose.

Default Feedback: Revisit the section on Splunk interfaces for creating reports.

Question 126 - numeric, hard

Question category: Module: Big Data Integration

Identify the range of years during which significant development in personalized music recommendation
projects using DataMirror, Hadoop, and Spark occurred.

*A: [2010, 2020]

Feedback: This isn't quite right. Consider when these technologies became prominent in data processing.

Default Feedback: Reflect on the timeline of when these technologies gained prominence in the
industry.

Question 127 - multiple choice, shuffle, easy difficulty

Question category: Module: Processing Big Data

Which big data processing engine performs in-memory processing using Resilient Distributed Datasets
(RDDs)?

A: Hadoop MapReduce

Feedback: Incorrect. Hadoop MapReduce does not perform in-memory processing using RDDs.

B: Flink

Feedback: Incorrect. Flink is not the engine that performs in-memory processing using RDDs.

*C: Spark

Feedback: Correct! Spark is the engine that performs in-memory processing using RDDs.

D: Storm

Feedback: Incorrect. Storm does not perform in-memory processing using RDDs.

Question 128 - multiple choice, shuffle, easy difficulty

Question category: Module: Processing Big Data


Which of the following best explains the role of dataflow in data science?

A: It defines the sequential execution of data processing tasks.

Feedback: This choice describes a workflow, not dataflow. Review the material on dataflow to
understand its role.

*B: It refers to the movement and transformation of data through a series of processing steps.

Feedback: Correct! Dataflow involves the movement and transformation of data through different stages
in a pipeline.

C: It is the process of collecting raw data from various sources.

Feedback: This choice is about data collection, not dataflow. Make sure you understand the distinct
stages of data handling.

D: It involves storing data in a structured format for easy access.

Feedback: Storing data in a structured format relates to data storage, not dataflow. Review how data
flows and transforms in a pipeline.

Question 129 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Processing Big Data

Which of the following are components of the Spark stack?

*A: Spark SQL

Feedback: Correct! Spark SQL is a component of the Spark stack.

B: HDFS

Feedback: Incorrect. HDFS is part of the Hadoop ecosystem, not the Spark stack.

*C: Spark Streaming

Feedback: Correct! Spark Streaming is a component of the Spark stack.

D: Storm

Feedback: Incorrect. Storm is a separate big data processing engine, not part of the Spark stack.

*E: MLlib
Feedback: Correct! MLlib is a machine learning library that is part of the Spark stack.

Question 130 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Processing Big Data

Which of the following are common analytical operations within big data pipelines?

*A: Filtering

Feedback: Correct! Filtering is a common analytical operation.

B: Data replication

Feedback: Data replication is more of a data management technique than an analytical operation.

*C: Aggregation

Feedback: Correct! Aggregation is a common analytical operation.

*D: Sorting

Feedback: Correct! Sorting is a common analytical operation.

E: Compression

Feedback: Compression is used to reduce data size, not for analysis.

Question 131 - checkbox, shuffle, partial credit, medium

Question category: Module: Processing Big Data

Which of the following actions are involved in performing WordCount with Spark Python (PySpark)?

*A: Reading a text file into an RDD

Feedback: Correct! Reading a text file into an RDD is a crucial step in performing WordCount.

*B: Initiating a Spark context

Feedback: Correct! Initiating a Spark context is necessary before performing any operations.

*C: Creating key-value pairs

Feedback: Correct! Creating key-value pairs is part of the WordCount implementation.


D: Connecting to a SQL database

Feedback: Incorrect. Connecting to a SQL database is not related to performing WordCount with
PySpark.

E: Starting a Hadoop cluster

Feedback: Incorrect. Starting a Hadoop cluster is not needed for WordCount with PySpark.

Question 132 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Processing Big Data

Identify the common data transformations within big data pipelines.

*A: Filtering

Feedback: Correct! Filtering is a common data transformation used to refine datasets.

*B: Sorting

Feedback: Correct! Sorting is often used to organize data within pipelines.

C: Replication

Feedback: Replication is not typically a data transformation process in pipelines. Review the common
transformations used.

*D: Aggregation

Feedback: Correct! Aggregation is a key transformation in reducing and summarizing data.

E: Compression

Feedback: Compression is more about data storage and transmission efficiency rather than a
transformation. Revisit the transformation processes.

Question 133 - checkbox, shuffle, partial credit, medium

Question category: Module: Processing Big Data

Which of the following steps are necessary to perform WordCount using PySpark in JupyterLab?

*A: Read the text file into an RDD

Feedback: Correct! Reading the text file into an RDD is one of the first steps.
*B: Create a SparkConf object

Feedback: Correct! Creating a SparkConf object is necessary to configure the SparkContext.

*C: Start a Spark session using SparkSession.builder()

Feedback: Correct! Starting a Spark session using SparkSession.builder() is a necessary step.

D: Use map() to split lines into words

Feedback: This is incorrect. While map() can be used, it is not specifically for splitting lines into words.

*E: Use reduceByKey() to count occurrences of each word

Feedback: Correct! Using reduceByKey() helps to count the occurrences of each word.

F: Save the results in a CSV file

Feedback: This is incorrect. While saving results is important, a text file is typically used rather than a
CSV file.

Question 134 - text match, easy difficulty

Question category: Module: Processing Big Data

What is the main data structure used by Spark for in-memory computation? Please answer in all
lowercase.

*A: rdd

Feedback: Correct! Resilient Distributed Datasets (RDDs) are the main data structure used by Spark.

*B: rdds

Feedback: Correct! Resilient Distributed Datasets (RDDs) are the main data structure used by Spark.

*C: resilientdistributeddataset

Feedback: Correct! Resilient Distributed Datasets (RDDs) are the main data structure used by Spark.

Default Feedback: Incorrect. Please review the section on Spark's in-memory computation data
structures.

Question 135 - text match, easy difficulty

Question category: Module: Processing Big Data


In PySpark, what function is commonly used to count the occurrences of each word in an RDD? Please
answer in all lowercase.

*A: reducebykey

Feedback: Correct! The reduceByKey function is commonly used for this purpose.

*B: reduce_by_key

Feedback: Correct! The reduce_by_key function is commonly used for this purpose.

Default Feedback: Incorrect. Try reviewing the PySpark documentation on counting word occurrences.

Question 136 - numeric, easy difficulty

Question category: Module: Processing Big Data

If you have a text file with 5000 words, and perform WordCount using PySpark, assuming 10% of the
words are unique, how many unique key-value pairs would you expect?

*A: 500.0

Feedback: Correct! With 10% of the words being unique out of 5000 words, you would expect 500
unique key-value pairs.

Default Feedback: Incorrect. Consider the percentage of unique words in the total word count.

Question 137 - numeric, easy difficulty

Question category: Module: Processing Big Data

If an initial dataset contains 50,000 entries and aggregation reduces it to 5,000 entries, what is the
reduction factor?

*A: 10.0

Feedback: Correct! The reduction factor is 10.

Default Feedback: Remember to divide the original number of entries by the reduced number of entries
to find the reduction factor.

Question 138 - text match, easy difficulty

Question category: Module: Processing Big Data


What term describes the practice of summarizing data to make it more manageable? Please answer in all
lowercase.

*A: aggregation

Feedback: Correct! Aggregation helps in summarizing data to make it more manageable.

B: compaction

Feedback: Compaction is not the term we are looking for. Remember the specific term for summarizing
data.

Default Feedback: Review the term that best describes summarizing data to make it more manageable.

Question 139 - numeric, easy difficulty

Question category: Module: Processing Big Data

How many main components are there in the Spark stack?

*A: 4.0

Feedback: Correct! The Spark stack consists of four main components.

Default Feedback: Incorrect. Please review the components of the Spark stack.

Question 140 - text match, easy difficulty

Question category: Module: Processing Big Data

In the context of PySpark, what is the term used for the distributed collection of data? Please answer in
all lowercase.

*A: rdd

Feedback: Correct! RDD stands for Resilient Distributed Dataset and is a fundamental concept in
PySpark.

*B: resilientdistributeddataset

Feedback: Correct! Resilient Distributed Dataset is abbreviated as RDD in PySpark.

Default Feedback: Incorrect. Please review the concept of distributed data collections in PySpark.

Question 141 - numeric, easy difficulty


Question category: Module: Processing Big Data

If a dataset is reduced from 10,000 entries to 1,000 entries through aggregation, what is the reduction
factor?

*A: 10.0

Feedback: Correct! The reduction factor is 10.

Default Feedback: Incorrect. Remember, the reduction factor is the ratio of the original size to the
reduced size.

Question 142 - numeric, easy difficulty

Question category: Module: Processing Big Data

If a text file contains 2000 words and you perform WordCount using PySpark, how many key-value
pairs would you expect to have in the result assuming all words are unique?

*A: 2000.0

Feedback: Correct! If all words are unique, you would have 2000 key-value pairs.

Default Feedback: Incorrect. Remember that each unique word in the text file will produce a unique
key-value pair.

Question 143 - multiple choice, shuffle, easy difficulty

Question category: Module: Processing Big Data

Which of the following aggregation operations is best suited for counting the number of unique elements
in a dataset?

A: Count

Feedback: Count operation can be used to count the total number of elements, but not unique elements.

B: Sum

Feedback: Sum operation aggregates the total value, but doesn't count unique elements.

*C: Distinct Count

Feedback: Correct! Distinct Count operation is used to count the number of unique elements in a dataset.

D: Average
Feedback: Average operation calculates the mean of values, but doesn't count unique elements.

Question 144 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Processing Big Data

Select the common analytical operations in big data pipelines:

*A: Filtering

Feedback: Correct! Filtering is a common analytical operation in big data pipelines.

*B: Joining

Feedback: Correct! Joining is also a common analytical operation in big data pipelines.

C: Replication

Feedback: Replication is not typically considered an analytical operation.

*D: Aggregation

Feedback: Correct! Aggregation is a common analytical operation in big data pipelines.

E: Load Balancing

Feedback: Load balancing is related to system performance, not an analytical operation.

Question 145 - checkbox, shuffle, partial credit, medium

Question category: Module: Processing Big Data

Select the steps involved in performing WordCount with Spark Python (PySpark).

*A: Initiate a Spark context

Feedback: Correct! Initiating a Spark context is the first step in performing WordCount with PySpark.

B: Load data into a DataFrame

Feedback: Incorrect. WordCount with PySpark typically involves loading data into an RDD, not a
DataFrame.

*C: Read a text file into an RDD

Feedback: Correct! Reading a text file into an RDD is a crucial step in WordCount with PySpark.
*D: Use the map and reduceByKey functions

Feedback: Correct! The map and reduceByKey functions are used in WordCount with PySpark to
process the data.

E: Save the results to a database

Feedback: Incorrect. In WordCount with PySpark, the results are usually saved to a text file, not a
database.

Question 146 - text match, easy difficulty

Question category: Module: Processing Big Data

What is the main motivation behind the development of Apache Spark? Please answer in all lowercase.

*A: speed

Feedback: Correct! Speed is one of the main motivations behind the development of Apache Spark.

*B: efficiency

Feedback: Correct! Efficiency is one of the main motivations behind the development of Apache Spark.

Default Feedback: Incorrect. Please review the main motivations behind the development of Apache
Spark.

Question 147 - multiple choice, shuffle, easy difficulty

Question category: Module: Processing Big Data

What is the main purpose of initiating a Spark context in a PySpark application?

*A: To execute job operations over a cluster

Feedback: Correct! The Spark context is used to connect to a cluster and execute operations.

B: To store data in HDFS

Feedback: Not quite. While Spark can interact with HDFS, the Spark context is specifically for
connecting to a cluster.

C: To compile Python code

Feedback: This is incorrect. Spark context is not used for compiling code; it's for executing operations
over a cluster.
D: To visualize data analytics

Feedback: Visualization is not the main purpose of a Spark context. It's primarily for executing
operations.

Question 148 - multiple choice, shuffle, easy difficulty

Question category: Module: Processing Big Data

What does the term 'dataflow' refer to in the context of big data pipelines?

*A: The movement of data through various processing stages

Feedback: Correct! Dataflow describes how data moves through processing stages.

B: The duplication of data across storage nodes

Feedback: Duplication is not the same as dataflow. Consider how data moves.

C: The security protocols applied to data

Feedback: Dataflow refers to movement, not security measures.

D: The backup and recovery procedures for data

Feedback: Backup and recovery are important, but they aren't what dataflow describes.

Question 149 - multiple choice, shuffle, easy difficulty

Question category: Module: Processing Big Data

Which of the following operations can be used to compact datasets and reduce their volume in big data
processing?

*A: Aggregation

Feedback: Correct! Aggregation is often used to summarize and reduce the size of datasets.

B: Replication

Feedback: Replication increases data volume by duplicating data, not reducing it.

C: Encryption

Feedback: Encryption secures data but does not reduce its volume.
D: Indexing

Feedback: Indexing improves data retrieval speed but doesn't compact the data.

Question 150 - multiple choice, shuffle, easy difficulty

Question category: Module: Processing Big Data

Which of the following is a major motivation for the development of Apache Spark?

*A: To provide a unified analytics engine for big data processing

Feedback: Correct! Spark was developed to offer a unified engine for both batch and stream processing.

B: To replace traditional databases entirely

Feedback: Incorrect. Spark complements databases by providing powerful processing capabilities, not
replacing them.

C: To improve storage capacity in data centers

Feedback: Incorrect. Spark focuses on data processing speed and efficiency rather than storage capacity.

D: To offer a proprietary alternative to open-source solutions

Feedback: Incorrect. Spark itself is an open-source project, not a proprietary alternative.

Question 151 - multiple choice, shuffle, easy difficulty

Question category: Module: Processing Big Data

Which of the following big data processing engines is known for its capability to perform in-memory
computations using Resilient Distributed Datasets (RDDs)?

*A: Spark

Feedback: Correct! Spark is well-known for utilizing RDDs for in-memory computations, which
enhances processing speed.

B: Hadoop MapReduce

Feedback: Incorrect. Hadoop MapReduce is primarily used for batch processing and does not utilize in-
memory computations.

C: Storm
Feedback: Incorrect. While Storm offers real-time processing, it doesn't leverage RDDs for in-memory
computation.

D: Beam

Feedback: Incorrect. Beam is designed for data processing pipelines but does not focus on in-memory
computation with RDDs like Spark.

Question 152 - checkbox, shuffle, partial credit, easy difficulty

Question category: Module: Processing Big Data

Which of the following are common analytical operations in big data pipelines?

*A: Join

Feedback: Correct! Joining datasets is a common operation in data analysis.

*B: Sorting

Feedback: Correct! Sorting is frequently used in data analysis to organize data.

C: Encryption

Feedback: Encryption is not an analytical operation but a security measure.

*D: Filtering

Feedback: Correct! Filtering is used to select specific data that meets certain criteria.

E: Replication

Feedback: Replication involves duplicating data, not analyzing it.

Question 153 - numeric, easy difficulty

Question category: Module: Processing Big Data

If a text file contains 1,000,000 words and the WordCount operation takes 10 seconds to complete using
a PySpark job, what is the average number of words processed per second?

*A: 100000.0

Feedback: Correct! The job processes an average of 100,000 words per second.
Default Feedback: Try calculating the number of words processed per second by dividing the total
words by the time taken.

Question 154 - numeric, easy difficulty

Question category: Module: Processing Big Data

If a dataset originally contains 1000 records and an aggregation operation reduces it to 200 records, by
what percentage has the dataset been reduced?

*A: 80.0

Feedback: Correct! Aggregation reduced the dataset size by 80%.

Default Feedback: Consider the reduction formula: \[(\text{original} - \text{new}) / \text{original} \


times 100\]

Question 155 - numeric, easy difficulty

Question category: Module: Processing Big Data

If a Spark job execution time ranges between 5.5 and 7.5 minutes on average, what is a reasonable time
(in minutes) to expect it to complete?

*A: [5.5, 7.5]

Feedback: Good estimation! This range represents the typical execution time for a Spark job.

Default Feedback: Consider the given range of typical execution times to estimate a reasonable
completion time.

Question 156 - text match, easy difficulty

Question category: Module: Processing Big Data

In PySpark, what data structure is typically used to hold key-value pairs for operations? Please answer in
all lowercase.

*A: rdd

Feedback: Correct! RDDs are used to hold data in key-value pairs for operations.

B: dataframe

Feedback: Not quite. DataFrames are used for structured data rather than key-value pairs in Spark.
Default Feedback: Remember to review the data structures used in PySpark for handling key-value
pairs.

Question 157 - text match, easy difficulty

Question category: Module: Processing Big Data

What is the primary abstraction used by Apache Spark for distributed data processing? Please answer in
all lowercase.

*A: rdd

Feedback: That's right! RDDs are the core abstraction in Spark for distributed data processing.

*B: rdds

Feedback: Correct! RDDs (Resilient Distributed Datasets) are indeed the primary abstraction.

Default Feedback: Remember, Spark uses a specific abstraction to manage distributed data processing
efficiently.

You might also like