BDA Regular Paper Solution
BDA Regular Paper Solution
Solution:
Data Sources and their Likely Structures:
(a) Detailed customer profiles stored in internal systems
Data Structure: Structured Data
Explanation: Customer profiles typically follow a well-defined schema, stored in relational
databases (e.g., SQL), with clear columns such as name, address, purchase history, etc.
Challenges:
i. Scalability: As the company grows, managing large volumes of structured data may
become cumbersome.
ii. Integration: Integrating structured customer data with unstructured data from other
sources may be challenging.
iii. Privacy & Security: Ensuring customer data protection and compliance with data
privacy laws (e.g., GDPR) is critical. 1 Mark
(b) Customer feedback from social media platforms
Data Structure: Unstructured Data
Explanation: Social media data (e.g., tweets, comments) is unstructured and does not
follow a predefined schema. It may contain text, images, videos, etc.
Challenges:
i. Data Volume: Handling large volumes of unstructured data is difficult, especially
for sentiment analysis.
ii. Data Cleaning: Extracting relevant information from noisy, informal text.
iii. Real-time Analysis: Processing feedback in real-time for timely decision-making
can be resource-intensive. 1 Mark
(c) Daily financial transactions recorded by the sales department
Data Structure: Structured Data
Explanation: Financial transactions are typically well-structured and stored in transactional
databases, with clear fields like transaction ID, date, amount, and customer ID.
Challenges:
i. Data Consistency: Ensuring data consistency across multiple branches or
departments.
ii. Real-time Processing: Processing financial data in real-time for monitoring and
fraud detection.
iii. Compliance: Adhering to financial regulations and auditing standards. 1 Mark
Days required = Processing capacity per month/Data collected per day = 100 TB/5TB/day
= 20 days
Answer: The hospital will need 20 days of data collection to fully utilize its monthly
processing capacity. 2 Marks
c) How many actionable insights are generated per month based on the current modeling
plan?
Actionable insights per model = 200
Number of models trained per month = 15
Total actionable insights generated per month:
Total insights = Actionable insights per model × Number of models
= 200 × 15 = 3,000
Answer: 3,000 actionable insights are generated per month based on the current modeling
plan. 2 Marks
Q.4 State whether the following statements are True or False with proper justification.
Answers without proper justification will not be given any marks. [ 3Marks]
a) In a MapReduce job, the 'Reduce' tasks are responsible for dividing the input data into
smaller chunks before the 'Map' tasks process them.
b) MapReduce lacks fault tolerance, so if a node fails, the entire job will fail without any task
reassignment.
c) YARN statically allocates resources to applications, which can lead to inefficient cluster
utilization and job execution.
Solution:
a) In a MapReduce job, the 'Reduce' tasks are responsible for dividing the input data into
smaller chunks before the 'Map' tasks process them.
• Answer: False
• Justification: In a MapReduce job, the 'Map' tasks are responsible for processing the input
data by dividing it into smaller chunks. The 'Reduce' tasks occur after the 'Map' phase, and
their job is to aggregate and summarize the intermediate output produced by the 'Map'
tasks. The division of input data into smaller chunks (splitting) is handled before the 'Map'
tasks execute, not by the 'Reduce' tasks. 1 Mark
b) MapReduce lacks fault tolerance, so if a node fails, the entire job will fail without any task
reassignment.
• Answer: False
• Justification: MapReduce provides built-in fault tolerance. If a node fails during the
execution of a job, the task that was running on that node is automatically reassigned to
another available node. This ensures that a node failure does not result in the failure of the
entire job, and MapReduce can continue to execute by rerunning failed tasks on other nodes
in the cluster. 1 Mark
c) YARN statically allocates resources to applications, which can lead to inefficient cluster
utilization and job execution.
• Answer: False
• Justification: YARN (Yet Another Resource Negotiator) dynamically allocates resources
to applications based on their requirements and the availability of resources in the cluster.
This dynamic allocation enables efficient cluster utilization by adjusting resource
allocation as needed during job execution, ensuring that resources are used effectively and
jobs are completed in a timely manner. 1 Mark
Q.5 You are a Hadoop administrator at a large organization that has recently adopted Hadoop for storing
and analysing its big data. The organization is planning to conduct a data science competition among
its employees to encourage innovative data analysis ideas. As the Hadoop administrator, your task
is to set up the necessary infrastructure for the competition. The competition requires participants to
store their data on HDFS, run MapReduce jobs to process the data, and present their results to the
judges. You need to create a directory on HDFS for each participant, set appropriate permissions to
ensure privacy, and provide the necessary commands for uploading, downloading, and managing
the data.
Your task is to write commands for following:
a) Verify Hadoop version and Hadoop daemon components are active. [0.5Mark]
b) Create a new directory with RollNumber as name in the root directory of the HDFS file system.
[0.5Mark]
c) Create 3 different files named as sample.txt, sample2.csv, sample3.tsv [1Mark]
d) Uploads all created files from the local file system to HDFS. [1Mark]
e) Lists the contents of a directory in HDFS. [0.5Mark]
f) Downloads a file named sample3.tsv from HDFS to the local file system. [1Marks]
g) Removes a file named sample.txt from HDFS. [0.5Mark]
h) Change the permissions of a file named sample2.csv in HDFS. [1Marks]
Solution
a) Verify Hadoop version and Hadoop daemon components are active. [0.5Mark]
# Command to verify Hadoop version
hadoop version
# Command to verify that Hadoop daemons (NameNode, DataNode, etc.) are active
jps
b) Create a new directory with RollNumber as name in the root directory of the HDFS file system.
[0.5Mark]
hadoop fs -mkdir /RollNumber
c) Create 3 different files named as sample.txt, sample2.csv, sample3.tsv [1Mark]
touch sample.txt
touch sample2.csv
touch sample3.tsv
d) Uploads all created files from the local file system to HDFS. [1Mark]
hadoop fs -put sample.txt sample2.csv sample3.tsv /RollNumber/
e) Lists the contents of a directory in HDFS. [0.5Mark]
hadoop fs -ls /RollNumber
f) Downloads a file named sample3.tsv from HDFS to the local file system. [1Marks]
hadoop fs -get /RollNumber/sample3.tsv ./
g) Removes a file named sample.txt from HDFS. [0.5Mark]
hadoop fs -rm /RollNumber/sample.txt
h) Change the permissions of a file named sample2.csv in HDFS. [1Marks]
hadoop fs -chmod 700 /RollNumber/sample2.csv
Q.6 Scenario:
A bank named XYZ Bank wants to create a new table in their Hive data warehouse to store customer
transaction details. The table should include the following fields: transaction_id (INT), customer_id
(INT), transaction_amount (DOUBLE), and transaction_date (STRING). After creating the table,
the bank needs to insert the following transaction data into the table:
• Transaction 1: transaction_id = 101, customer_id = 501, transaction_amount = 1500.75,
transaction_date = '2024-09-01'
• Transaction 2: transaction_id = 102, customer_id = 502, transaction_amount = 2500.00,
transaction_date = '2024-09-02'
Write the HiveQL commands to:
a) Create the customer_transactions table with the specified fields. [2Marks]
b) Insert the two transactions into the customer_transactions table. [2Marks]
Solution
a) Create the customer_transactions table with the specified fields. [2Marks]
CREATE TABLE customer_transactions (
transaction_id INT,
customer_id INT,
transaction_amount DOUBLE,
transaction_date STRING );
Q.7 Scenario:
A university named BITS Pilani University maintains a MongoDB collection called students to store
information about their students. Each document in the collection contains the following fields:
student_id, name, major, and year_of_enrollment. The university wants to:
• Insert a new student record into the collection.
• Update the major of an existing student.
• Retrieve the details of a specific student by their student_id.
• Delete a student record from the collection.
Write the MongoDB commands to:
a) Insert a new student with student_id = 1001, name = "John Doe", major = "Computer Science",
and year_of_enrollment = 2022. [1.5Marks]
b) Update the major of the student with student_id = 1001 to "Data Science". [1.5Marks]
Solution:
a) Insert a new student with student_id = 1001, name = "John Doe", major =
"Computer Science", and year_of_enrollment = 2022.
[1.5Marks]
db.students.insertOne({
student_id: 1001,
name: "John Doe",
major: "Computer Science",
year_of_enrollment: 2022
});
b) Update the major of the student with student_id = 1001 to "Data Science".
[1.5Marks]
db.students.updateOne(
{ student_id: 1001 },
{ $set: { major: "Data Science" } } );
******