0% found this document useful (0 votes)
26 views

BDA Regular Paper Solution

Uploaded by

Prash809
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

BDA Regular Paper Solution

Uploaded by

Prash809
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Birla Institute of Technology & Science, Pilani

Work-Integrated Learning Programmes Division


Semester 1- 2024-2025
Mid-Semester Regular (EC-2)

Course No. : BA ZG525


Course Title : BIG DATA ANALYTICS
Nature of Exam : Closed Book
Weightage : 30% No. of Pages =2
Duration : 2 Hours No. of Questions = 7
Date of Exam : 22 September, 2024 - 01:00 PM
Note:
1. Please follow all the Instructions to Candidates given on the cover page of the answer book.
2. All parts of a question should be answered consecutively. Each answer should start from a fresh page.
3. Assumptions made if any, should be stated clearly at the beginning of your answer.

Q.1 Case Study:


XYZ Corporation, a multinational company, handles various types of data across its operations. The
data management team has identified several key data sources:
(a) Detailed customer profiles stored in their internal systems.
(b) A large volume of customer feedback gathered from social media platforms.
(c) Daily financial transactions recorded by the sales department.
(d) Data exchanged between systems using XML files.
Given the nature of the data sources above, identify the type of structure each data source likely
follows. Discuss the specific challenges XYZ Corporation might encounter when managing and
analyzing each type of data. [4 Marks]

Solution:
Data Sources and their Likely Structures:
(a) Detailed customer profiles stored in internal systems
Data Structure: Structured Data
Explanation: Customer profiles typically follow a well-defined schema, stored in relational
databases (e.g., SQL), with clear columns such as name, address, purchase history, etc.
Challenges:
i. Scalability: As the company grows, managing large volumes of structured data may
become cumbersome.
ii. Integration: Integrating structured customer data with unstructured data from other
sources may be challenging.
iii. Privacy & Security: Ensuring customer data protection and compliance with data
privacy laws (e.g., GDPR) is critical. 1 Mark
(b) Customer feedback from social media platforms
Data Structure: Unstructured Data
Explanation: Social media data (e.g., tweets, comments) is unstructured and does not
follow a predefined schema. It may contain text, images, videos, etc.
Challenges:
i. Data Volume: Handling large volumes of unstructured data is difficult, especially
for sentiment analysis.
ii. Data Cleaning: Extracting relevant information from noisy, informal text.
iii. Real-time Analysis: Processing feedback in real-time for timely decision-making
can be resource-intensive. 1 Mark
(c) Daily financial transactions recorded by the sales department
Data Structure: Structured Data
Explanation: Financial transactions are typically well-structured and stored in transactional
databases, with clear fields like transaction ID, date, amount, and customer ID.
Challenges:
i. Data Consistency: Ensuring data consistency across multiple branches or
departments.
ii. Real-time Processing: Processing financial data in real-time for monitoring and
fraud detection.
iii. Compliance: Adhering to financial regulations and auditing standards. 1 Mark

(d) Data exchanged between systems using XML files


Data Structure: Semi-structured Data
Explanation: XML data is semi-structured as it follows a loose schema defined by tags,
making it more flexible than fully structured data but more organized than unstructured
data.
Challenges:
i. Parsing Complexity: Parsing large XML files can be computationally expensive.
ii. Integration: Combining semi-structured XML data with structured databases or
unstructured sources may require complex transformations.
iii. Data Validation: Ensuring data accuracy and integrity when transferring data
between systems. 1 Mark

Q.2 Case Study:


ABC Financial Services, a leader in the financial sector, relies heavily on big data to enhance its
operations and customer service. The company focuses on the 5 V's of big data—Volume, Velocity,
Variety, Veracity, and Value—to manage and analyze the vast amounts of data generated daily.
Each of these aspects presents unique challenges that must be addressed to successfully leverage
big data for better decision-making and service delivery.
Discuss how ABC Financial Services can effectively manage the challenges associated with each
of the 5 V's of big data. Provide specific strategies or technologies that the company might employ
to overcome these challenges and maximize the value derived from its big data initiatives.
[4 marks]
Solution:
ABC Financial Services can effectively manage the challenges associated with the 5 V's of big data
by employing specific strategies and technologies:
1. Volume:
o Challenge: The large scale of data generated daily by financial transactions, customer
interactions, and market fluctuations is overwhelming for traditional systems.
o Strategy: Implement distributed storage systems like Hadoop Distributed File System
(HDFS) and cloud-based solutions (e.g., AWS S3) to store massive datasets. Leverage
NoSQL databases (e.g., Cassandra, MongoDB) to handle the scalability of data storage.
o Benefit: This ensures efficient storage and retrieval of vast datasets, enabling better
analysis without performance bottlenecks.
2. Velocity:
o Challenge: Financial data is generated at high speeds (real-time trading, transactions,
and customer interactions). Handling real-time data streaming is crucial to providing
timely insights and decisions.
o Strategy: Use stream processing platforms like Apache Kafka and Apache Flink to
process data in real-time. Implement real-time analytics tools like Apache Spark for
instantaneous decision-making and fraud detection.
o Benefit: These technologies allow ABC Financial to manage high-speed data
effectively and respond in real-time, improving customer service and fraud prevention.
3. Variety:
o Challenge: Financial data comes in different formats, such as structured data
(transaction records), semi-structured data (XML files), and unstructured data
(customer feedback from social media).
o Strategy: Utilize data integration platforms and tools like Apache NiFi or Talend to
ingest, process, and transform data from multiple formats into a unified system. For
unstructured data, deploy Natural Language Processing (NLP) tools and text mining
techniques to analyze social media data.
o Benefit: Managing multiple data types enables holistic analysis, including customer
sentiment analysis and pattern recognition from structured financial data.
4. Veracity:
o Challenge: Ensuring the accuracy and reliability of data is a critical issue in the
financial sector, as incorrect data can lead to poor decision-making.
o Strategy: Implement data quality management tools, such as Apache Griffin or Talend
Data Quality, to monitor and improve data accuracy. Use data validation techniques
and establish governance policies for data entry and integrity.
o Benefit: Enhanced data veracity improves trust in analytical outcomes, reducing errors
in financial reporting, risk assessment, and customer service.
5. Value:
o Challenge: Extracting actionable insights from big data is essential to gain a
competitive advantage, but it requires the right tools and expertise.
o Strategy: Use machine learning platforms like TensorFlow or H2O.ai to derive
predictive insights from data. Implement BI tools (e.g., Tableau, Power BI) to visualize
data and make it accessible to decision-makers.
o Benefit: Maximizing data value leads to better decision-making, personalized customer
experiences, and optimized operational efficiency.
1 X 4 = 4 Marks
Q.3 In a healthcare analytics project, a hospital is using big data analytics to improve patient outcomes.
The project involves four core components: Data Collection, Processing, Modeling, and Decision-
Making. Here are the details:
1. Data Collection: The hospital collects patient data from various sources, including
electronic health records (EHRs), wearable devices, and laboratory results. Each day, the
hospital collects 5 terabytes (TB) of data.
2. Processing: The processing system can handle 100 terabytes of data per month, efficiently
cleaning and transforming it for analysis.
3. Modeling: After processing, the data is used to build predictive models. Each model
requires 2 terabytes of processed data to train effectively. The hospital plans to train 15
models per month.
4. Decision-Making: The insights from the models are used for decision-making. On average,
each model generates 200 actionable insights.
Given these parameters, calculate the following:
a) How many days of data collection will the hospital need to fully utilize its monthly processing
capacity? [ 2Marks]
b) How much processed data is used per month for modeling? [ 2Marks]
c) How many actionable insights are generated per month based on the current modeling plan?
[ 2Marks]
Solution:
a) How many days of data collection will the hospital need to fully utilize its monthly
processing capacity?

Data collected per day = 5 TB


Processing capacity per month = 100 TB
To find the number of days required to collect 100 TB:

Days required = Processing capacity per month/Data collected per day = 100 TB/5TB/day
= 20 days
Answer: The hospital will need 20 days of data collection to fully utilize its monthly
processing capacity. 2 Marks

b) How much processed data is used per month for modeling?

Data required per model = 2 TB


Number of models trained per month = 15
Total data used for modeling per month:
Total data used = data required per model × Number of models trained per month
= 2 TB × 15 = 30 TB
Answer: 30 TB of processed data is used per month for modeling. 2 Marks

c) How many actionable insights are generated per month based on the current modeling
plan?
Actionable insights per model = 200
Number of models trained per month = 15
Total actionable insights generated per month:
Total insights = Actionable insights per model × Number of models
= 200 × 15 = 3,000
Answer: 3,000 actionable insights are generated per month based on the current modeling
plan. 2 Marks
Q.4 State whether the following statements are True or False with proper justification.
Answers without proper justification will not be given any marks. [ 3Marks]
a) In a MapReduce job, the 'Reduce' tasks are responsible for dividing the input data into
smaller chunks before the 'Map' tasks process them.
b) MapReduce lacks fault tolerance, so if a node fails, the entire job will fail without any task
reassignment.
c) YARN statically allocates resources to applications, which can lead to inefficient cluster
utilization and job execution.
Solution:
a) In a MapReduce job, the 'Reduce' tasks are responsible for dividing the input data into
smaller chunks before the 'Map' tasks process them.
• Answer: False
• Justification: In a MapReduce job, the 'Map' tasks are responsible for processing the input
data by dividing it into smaller chunks. The 'Reduce' tasks occur after the 'Map' phase, and
their job is to aggregate and summarize the intermediate output produced by the 'Map'
tasks. The division of input data into smaller chunks (splitting) is handled before the 'Map'
tasks execute, not by the 'Reduce' tasks. 1 Mark
b) MapReduce lacks fault tolerance, so if a node fails, the entire job will fail without any task
reassignment.
• Answer: False
• Justification: MapReduce provides built-in fault tolerance. If a node fails during the
execution of a job, the task that was running on that node is automatically reassigned to
another available node. This ensures that a node failure does not result in the failure of the
entire job, and MapReduce can continue to execute by rerunning failed tasks on other nodes
in the cluster. 1 Mark
c) YARN statically allocates resources to applications, which can lead to inefficient cluster
utilization and job execution.
• Answer: False
• Justification: YARN (Yet Another Resource Negotiator) dynamically allocates resources
to applications based on their requirements and the availability of resources in the cluster.
This dynamic allocation enables efficient cluster utilization by adjusting resource
allocation as needed during job execution, ensuring that resources are used effectively and
jobs are completed in a timely manner. 1 Mark
Q.5 You are a Hadoop administrator at a large organization that has recently adopted Hadoop for storing
and analysing its big data. The organization is planning to conduct a data science competition among
its employees to encourage innovative data analysis ideas. As the Hadoop administrator, your task
is to set up the necessary infrastructure for the competition. The competition requires participants to
store their data on HDFS, run MapReduce jobs to process the data, and present their results to the
judges. You need to create a directory on HDFS for each participant, set appropriate permissions to
ensure privacy, and provide the necessary commands for uploading, downloading, and managing
the data.
Your task is to write commands for following:
a) Verify Hadoop version and Hadoop daemon components are active. [0.5Mark]
b) Create a new directory with RollNumber as name in the root directory of the HDFS file system.
[0.5Mark]
c) Create 3 different files named as sample.txt, sample2.csv, sample3.tsv [1Mark]
d) Uploads all created files from the local file system to HDFS. [1Mark]
e) Lists the contents of a directory in HDFS. [0.5Mark]
f) Downloads a file named sample3.tsv from HDFS to the local file system. [1Marks]
g) Removes a file named sample.txt from HDFS. [0.5Mark]
h) Change the permissions of a file named sample2.csv in HDFS. [1Marks]

Solution
a) Verify Hadoop version and Hadoop daemon components are active. [0.5Mark]
# Command to verify Hadoop version
hadoop version

# Command to verify that Hadoop daemons (NameNode, DataNode, etc.) are active
jps
b) Create a new directory with RollNumber as name in the root directory of the HDFS file system.
[0.5Mark]
hadoop fs -mkdir /RollNumber
c) Create 3 different files named as sample.txt, sample2.csv, sample3.tsv [1Mark]
touch sample.txt
touch sample2.csv
touch sample3.tsv
d) Uploads all created files from the local file system to HDFS. [1Mark]
hadoop fs -put sample.txt sample2.csv sample3.tsv /RollNumber/
e) Lists the contents of a directory in HDFS. [0.5Mark]
hadoop fs -ls /RollNumber
f) Downloads a file named sample3.tsv from HDFS to the local file system. [1Marks]
hadoop fs -get /RollNumber/sample3.tsv ./
g) Removes a file named sample.txt from HDFS. [0.5Mark]
hadoop fs -rm /RollNumber/sample.txt
h) Change the permissions of a file named sample2.csv in HDFS. [1Marks]
hadoop fs -chmod 700 /RollNumber/sample2.csv
Q.6 Scenario:
A bank named XYZ Bank wants to create a new table in their Hive data warehouse to store customer
transaction details. The table should include the following fields: transaction_id (INT), customer_id
(INT), transaction_amount (DOUBLE), and transaction_date (STRING). After creating the table,
the bank needs to insert the following transaction data into the table:
• Transaction 1: transaction_id = 101, customer_id = 501, transaction_amount = 1500.75,
transaction_date = '2024-09-01'
• Transaction 2: transaction_id = 102, customer_id = 502, transaction_amount = 2500.00,
transaction_date = '2024-09-02'
Write the HiveQL commands to:
a) Create the customer_transactions table with the specified fields. [2Marks]
b) Insert the two transactions into the customer_transactions table. [2Marks]

Solution
a) Create the customer_transactions table with the specified fields. [2Marks]
CREATE TABLE customer_transactions (
transaction_id INT,
customer_id INT,
transaction_amount DOUBLE,
transaction_date STRING );

b) Insert the two transactions into the customer_transactions table. [2Marks]


INSERT INTO customer_transactions VALUES
(101, 501, 1500.75, '2024-09-01'),
(102, 502, 2500.00, '2024-09-02');

Q.7 Scenario:
A university named BITS Pilani University maintains a MongoDB collection called students to store
information about their students. Each document in the collection contains the following fields:
student_id, name, major, and year_of_enrollment. The university wants to:
• Insert a new student record into the collection.
• Update the major of an existing student.
• Retrieve the details of a specific student by their student_id.
• Delete a student record from the collection.
Write the MongoDB commands to:
a) Insert a new student with student_id = 1001, name = "John Doe", major = "Computer Science",
and year_of_enrollment = 2022. [1.5Marks]
b) Update the major of the student with student_id = 1001 to "Data Science". [1.5Marks]

Solution:
a) Insert a new student with student_id = 1001, name = "John Doe", major =
"Computer Science", and year_of_enrollment = 2022.
[1.5Marks]
db.students.insertOne({
student_id: 1001,
name: "John Doe",
major: "Computer Science",
year_of_enrollment: 2022
});

b) Update the major of the student with student_id = 1001 to "Data Science".
[1.5Marks]
db.students.updateOne(
{ student_id: 1001 },
{ $set: { major: "Data Science" } } );

******

You might also like