0% found this document useful (0 votes)
62 views55 pages

Big Data Visualization

Big data and visualisation all Questions Papers

Uploaded by

psawardekar633
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views55 pages

Big Data Visualization

Big data and visualisation all Questions Papers

Uploaded by

psawardekar633
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 55

1. What are the 5 V’s of Big Data?

Answer:
The 5 V's of Big Data are critical characteristics that define its complexity and challenges:

 Volume: Refers to the vast amounts of data generated every second from various
sources, including social media, sensors, and transactions. Organizations must manage
and analyze this massive volume effectively.
 Velocity: This denotes the speed at which data is generated and needs to be processed.
For example, real-time data processing is essential for applications like fraud detection
and social media analytics.
 Variety: Big Data comes in different formats, such as structured data (e.g., databases),
semi-structured data (e.g., XML, JSON), and unstructured data (e.g., text, images).
Handling this variety requires flexible storage and processing systems.
 Veracity: Refers to the quality and accuracy of the data. High veracity data is crucial for
making reliable business decisions, while low veracity can lead to incorrect conclusions.
 Value: This indicates the potential insights and benefits derived from analyzing Big
Data. The goal is to extract actionable insights that can drive business strategies and
decisions.

2. Explain HDFS.

Answer:
Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop,
designed to handle large files and provide high throughput access to application data.

 Key Features:
o Scalability: HDFS can scale horizontally by adding more nodes to the cluster.
o Fault Tolerance: Data is divided into blocks (default size 128 MB) and
replicated across multiple DataNodes. This replication ensures that data remains
accessible even if some nodes fail.
o High Throughput: HDFS is optimized for high throughput, making it suitable
for batch processing of large datasets.
 Architecture Components:
o NameNode: The master server that manages the metadata and namespace of the
file system. It keeps track of which blocks are stored on which DataNodes.
o DataNodes: These are the worker nodes that store the actual data blocks. They
periodically send heartbeat signals to the NameNode to confirm their status.
o Secondary NameNode: This node periodically saves the state of the NameNode
to provide a backup in case of failures.

3. What is NoSQL?
Answer:
NoSQL refers to a broad class of database management systems that do not adhere strictly to the
traditional relational database model. They are designed for scalability and flexibility, especially
for large volumes of structured and unstructured data.

 Key Types of NoSQL Databases:


o Key-Value Stores: Store data as key-value pairs (e.g., Redis). Ideal for session
management and caching.
o Document Stores: Store data in documents, typically in JSON or BSON format
(e.g., MongoDB). Suitable for applications with varying data structures.
o Column-Family Stores: Organize data into columns and rows, optimized for
reading and writing large volumes of data (e.g., Cassandra).
o Graph Databases: Focus on the relationships between data points (e.g., Neo4j).
Used in social networks and recommendation engines.

4. What is Hive and its architecture?

Answer:
Hive is a data warehousing infrastructure built on top of Hadoop, designed to facilitate querying
and managing large datasets using HiveQL, a SQL-like language.

 Architecture Components:
o Metastore: A centralized repository that stores metadata about tables, partitions,
and schemas. This metadata is essential for query optimization and execution.
o Driver: Manages the lifecycle of a HiveQL query, from compilation to execution.
o Compiler: Converts HiveQL queries into a series of MapReduce jobs or other
execution plans, optimizing for performance.
o Execution Engine: Executes the generated execution plans using Hadoop
MapReduce, Tez, or Spark.
 Example:

sql
Copy code
CREATE TABLE employees (id INT, name STRING, salary FLOAT);

This command creates a new table in Hive, which is stored in HDFS, with metadata
managed in the Metastore.

5. What is Apache Pig, and how does it work?

Answer:
Apache Pig is a high-level platform for creating programs that run on Hadoop, primarily using a
language called Pig Latin.
 Key Components:
o Pig Latin: A scripting language that simplifies writing MapReduce programs. It
allows users to express data transformations without dealing with the complexities
of Java code.
o Execution Engine: Converts Pig Latin scripts into a series of MapReduce jobs
for execution on the Hadoop cluster.
o Grunt Shell: An interactive shell for running Pig commands and testing Pig Latin
scripts.
 How it Works:

1. Users write Pig Latin scripts describing data transformations.


2. The Pig interpreter parses the script and generates a logical plan.
3. The optimizer optimizes this plan, producing an efficient execution plan.
4. Finally, the execution engine runs the MapReduce jobs on the Hadoop cluster.

6. Explain the significance of Apache Kafka.

Answer:
Apache Kafka is a distributed streaming platform used for building real-time data pipelines and
streaming applications.

 Key Features:
o High Throughput: Kafka can handle thousands of messages per second, making
it suitable for high-volume data streams.
o Durability: Messages are stored on disk and replicated across multiple brokers,
ensuring data persistence and fault tolerance.
o Scalability: Kafka can be scaled horizontally by adding more brokers to the
cluster.
o Real-time Processing: It allows for processing streams of data in real time,
enabling immediate insights and actions based on incoming data.
 Architecture Components:
o Producers: Applications that publish messages to Kafka topics.
o Topics: Categories for messages, which can be partitioned for load balancing.
o Consumers: Applications that subscribe to topics and process the messages.

7. What is YARN, and what are its components?

Answer:
YARN (Yet Another Resource Negotiator) is the resource management layer of Hadoop,
enabling multiple data processing engines to run on a single cluster.

 Key Components:
o ResourceManager: The master daemon responsible for managing resources
across the cluster and scheduling applications.
o NodeManager: A per-node daemon that monitors resource usage (CPU,
memory) and manages the execution of tasks on that node.
o ApplicationMaster: A per-application daemon that negotiates resources from the
ResourceManager and coordinates the execution of tasks.
 How it Works:

1. Users submit applications to the ResourceManager.


2. The ResourceManager allocates resources and launches the ApplicationMaster for
each application.
3. The ApplicationMaster negotiates with the ResourceManager for resources and
monitors the execution of tasks.

8. Explain the CAP Theorem in the context of NoSQL databases.

Answer:
The CAP Theorem states that in a distributed data store, it is impossible to simultaneously
guarantee all three of the following properties:

 Consistency: Every read receives the most recent write for a given piece of data.
 Availability: Every request receives a response, either success or failure.
 Partition Tolerance: The system continues to operate despite network partitions.

Example:

 HBase prioritizes consistency and partition tolerance (CP) at the expense of availability
during a network partition.
 Cassandra, on the other hand, emphasizes availability and partition tolerance (AP),
potentially sacrificing consistency.

9. Discuss the differences between Structured, Semi-Structured, and


Unstructured data.

Answer:

 Structured Data:
o Organized in a fixed format, typically in rows and columns (e.g., relational
databases).
o Easy to enter, store, query, and analyze.
o Examples include SQL databases like MySQL and Oracle.
 Semi-Structured Data:
o Does not conform to a fixed schema but contains tags or markers to separate data
elements (e.g., XML, JSON).
o More flexible than structured data but still allows some level of organization.
o Examples include web services that use JSON for data interchange.
 Unstructured Data:
o Lacks a predefined format or structure, making it difficult to organize and
analyze.
o Examples include text documents, emails, images, audio, and video files.
o Requires advanced analytics tools to extract insights.

10. Explain Bulk Synchronous Processing (BSP) in relation to Apache Spark.

Answer:
Bulk Synchronous Processing (BSP) is a parallel computing model that consists of a series of
supersteps, where computation is performed, followed by a synchronization phase.

 Key Features:
o Each superstep can communicate with other nodes, allowing for data exchange
and aggregation.
o After computation in each superstep, all nodes synchronize to ensure consistency
before moving to the next superstep.
 Relation to Spark:
o Spark leverages a similar model for iterative processing, allowing data to be
processed in parallel across a cluster while efficiently managing communication
and synchronization between tasks.
o This model enhances performance for iterative algorithms, such as those used in
machine learning.

11. Describe the process of Matrix Multiplication using MapReduce.

Answer:
Matrix multiplication using MapReduce involves several steps:

1. Mapper Function:
o The input matrices are distributed across the cluster. Each mapper processes rows
from matrix A and columns from matrix B.
o For each row of A and each column of B, the mapper emits key-value pairs where
the key is a tuple representing the resulting matrix indices and the value is the
product of the corresponding elements.

Example: For Matrix A:


Copy code
1 2 5
3 4 2

And Matrix B:

Copy code
5 6
2 3

The mapper for row (1, 2, 5) of A emits:

rust
Copy code
(0, 0) -> 1*5
(0, 1) -> 1*6
(0, 2) -> 2*5
(0, 3) -> 2*6

2. Reducer Function:
o The reducer receives the emitted key-value pairs and sums the values for each
unique key, producing the final result for the corresponding cell in the resulting
matrix.
3. Final Output:
o The output is a matrix where each cell represents the sum of the products for the
corresponding row from A and column from B.

12. Illustrate the differences between HDFS and HBase.

Answer:

 HDFS:
o Designed for storing large files with high throughput.
o Optimized for batch processing; data is typically read in large blocks.
o Data is immutable, and once written, it cannot be changed.
o Works best for workloads requiring high data reliability and fault tolerance.
 HBase:
o A NoSQL database built on top of HDFS designed for real-time access to data.
o Supports random read/write access to large datasets.
o Data is stored in tables with rows and columns, allowing for dynamic schema
design.
o Suitable for applications needing low-latency access to big data.

13. What are the advantages of Apache Spark over MapReduce?


Answer:

 Speed: Spark performs in-memory processing, reducing the time taken for data
read/write operations compared to disk-based MapReduce.
 Ease of Use: Spark offers high-level APIs in languages like Python, Java, and Scala,
making it more accessible for developers.
 Unified Framework: Spark supports various data processing tasks (batch, streaming,
machine learning) within a single framework.
 Advanced Analytics: It provides built-in libraries for machine learning (MLlib), graph
processing (GraphX), and stream processing (Spark Streaming).

14. What is the architecture of Apache Pig?

Answer:
Apache Pig’s architecture is designed to simplify the process of writing and executing
MapReduce programs.

 Key Components:
o Pig Latin: The language used to express data flows and transformations.
o Parser: Validates and translates Pig Latin scripts into a logical plan.
o Optimizer: Improves the logical plan for performance, converting it into a
physical plan.
o Execution Engine: Executes the physical plan on Hadoop, generating
MapReduce jobs.

Diagram:

rust
Copy code
User -> Pig Latin Script -> Parser -> Logical Plan -> Optimizer -> Physical
Plan -> Execution Engine -> Hadoop Cluster

15. Explain the role of Data Visualization in Big Data.

Answer:
Data visualization is a critical component of data analysis that transforms complex data into
graphical representations, making it easier to identify trends, patterns, and insights.

 Key Benefits:
o Enhanced Understanding: Visuals help to clarify complex data and make it
more accessible to non-technical stakeholders.
o Improved Decision-Making: Data visualizations enable quicker and more
informed decisions based on data insights.
o Trend Identification: Visualizations highlight trends and anomalies that may be
overlooked in raw data.
o Communication: Visuals convey information more effectively than text or
tables, facilitating better storytelling with data.
 Popular Tools:
o Tableau: Offers interactive dashboards and visualizations.
o D3.js: A JavaScript library for producing dynamic, interactive data visualizations
in web browsers.
o Power BI: A Microsoft tool for visual analytics that integrates with various data
sources.

Q1

(a) BASE properties of NoSQL databases (5 marks)


NoSQL databases are designed to handle large volumes of unstructured data. They follow the
BASE model, which stands for:

1. Basically Available: The system guarantees availability of data even in the face of
failures. This is a trade-off for consistency, which may not always be available.
2. Soft State: The state of the system may change over time, even without new input. This
is due to the eventual consistency model, where updates propagate through the system
over time.
3. Eventual Consistency: While immediate consistency is not guaranteed, the system will
become consistent eventually, as long as no new updates are made to the data.

(b) 3 Vs of Big Data (5 marks)


Big Data is characterized by three fundamental dimensions, commonly referred to as the "3 Vs":

1. Volume: Refers to the enormous amounts of data generated every second from various
sources (social media, IoT devices, transactions, etc.). This scale of data requires special
techniques for storage and processing.
2. Velocity: The speed at which new data is generated and processed is crucial. High-
velocity data needs to be analyzed in real time or near real time to derive timely insights.
3. Variety: Data comes in multiple formats and types, such as structured (databases), semi-
structured (XML, JSON), and unstructured (text, images). Handling this variety poses
challenges for data management and analysis.

(c) HDFS Federation (5 marks)


HDFS Federation enhances the scalability of the Hadoop Distributed File System. Key aspects
include:
 Multiple NameNodes: Instead of a single NameNode managing the metadata for all files
in HDFS, Federation allows multiple NameNodes to exist, each managing its own
namespace. This reduces bottlenecks.
 Improved Scalability: By distributing the load across several NameNodes, HDFS can
handle more files and directories, allowing clusters to scale horizontally.
 High Availability: Each NameNode can work independently, which enhances system
resilience. If one NameNode fails, others can continue operating without interruption.

(d) Hive SerDe (5 marks)


Hive uses SerDe (Serializer/Deserializer) for data serialization and deserialization:

 Serialization: The process of converting an object (data structure) into a byte stream for
storage or transmission. For example, when saving data to HDFS.
 Deserialization: The reverse process of converting a byte stream back into an object.
When reading data from HDFS into Hive tables, the data needs to be deserialized.
 Customization: Users can implement custom SerDes for different data formats (like
JSON, XML) to enable Hive to read and write data in formats beyond its default
capabilities.

Q2

(a) Main components of HDFS with the diagram (10 marks)


HDFS is built around a master-slave architecture and comprises:

1. NameNode: The master server that stores metadata about the files in HDFS, such as file
locations, directory structure, and permissions. It does not store the actual data.
2. DataNode: The worker nodes that store the actual data blocks. They are responsible for
serving read and write requests from clients and periodically sending heartbeats and
block reports to the NameNode.
3. Secondary NameNode: Acts as a backup for the NameNode, periodically merging the
namespace image and edit log to prevent the NameNode from running out of memory.

Diagram:

plaintext
Copy code
+-----------+
| NameNode |
+-----------+
/ \
/ \
/ \
+---------+ +---------+
| DataNode| | DataNode|
+---------+ +---------+

(b) Compare HBase with conventional RDBMS (10 marks)

Feature HBase RDBMS


Data Model Column-oriented, sparse tables. Row-oriented, fixed schema.
Scalability Horizontal scalability (add more nodes). Vertical scalability (larger servers).
Fixed schema (tables must be defined
Schema Dynamic schema for columns.
beforehand).
Limited support for ACID transactions
Transactions Full ACID compliance.
(single row).
Query Uses HBase API or Apache Phoenix for
Uses SQL for querying.
Language SQL-like queries.
Designed for large datasets and real-time Best suited for complex queries with
Use Case
read/write access. structured data.

Q3

(a) Hadoop ecosystem with the diagram (10 marks)


The Hadoop ecosystem is a collection of tools and frameworks that work with Hadoop to process
large data sets efficiently. Key components include:

 HDFS: The distributed file system that stores data across multiple nodes.
 MapReduce: The programming model for processing data in parallel.
 YARN: The resource management layer that manages compute resources in the cluster.
 Hive: A data warehouse tool for querying and managing large datasets using SQL-like
syntax.
 Pig: A platform for analyzing large datasets using a high-level scripting language (Pig
Latin).
 HBase: A distributed NoSQL database that runs on top of HDFS.

Diagram:

plaintext
Copy code
+------------------+
| YARN |
+------------------+
/ | \
/ | \
+---------+ +---------+ +---------+
| HDFS | | MapReduce| | HBase |
+---------+ +---------+ +---------+
| |
+---------+ +---------+
| Hive | | Pig |
+---------+ +---------+

Three components elaborated:

1. HDFS: Manages storage across nodes, providing fault tolerance by replicating data
blocks.
2. YARN: Acts as a resource manager, scheduling jobs and allocating resources to various
applications running in the Hadoop ecosystem.
3. Hive: Provides a SQL-like interface to query and analyze data stored in HDFS, allowing
users to work with data without needing to know MapReduce.

(b) InputFormat and RecordReader in Hadoop MapReduce (10 marks)

 InputFormat:
o Defines how input data is split and read. It controls how data is divided into splits
for processing.
o Examples include TextInputFormat, which reads lines of text, and
SequenceFileInputFormat, which reads key-value pairs from sequence files.
 RecordReader:
o A key component of the MapReduce framework, it reads the data from the
InputFormat and converts it into key-value pairs that can be processed by the
Mapper.
o It works in conjunction with InputFormat to facilitate the reading of input data
efficiently.

The combination of InputFormat and RecordReader allows MapReduce jobs to be agnostic of


the underlying data storage format, enabling flexibility in data processing.

Q4

(a) Types of NoSQL databases with examples (10 marks)

1. Key-Value Stores: Store data as a collection of key-value pairs.


o Example: Redis, which offers fast access and high performance.
2. Document Stores: Store semi-structured data in formats like JSON or XML, allowing
for complex data types.
o Example: MongoDB, which provides powerful querying capabilities on JSON-
like documents.
3. Column Family Stores: Organize data into columns rather than rows, which allows for
efficient data retrieval.
o Example: Cassandra, designed for high availability and scalability across
multiple nodes.
4. Graph Databases: Store data in graph structures, using nodes, edges, and properties to
represent relationships.
o Example: Neo4j, which excels at managing interconnected data.

(b) What is HIVE? Explain HIVE architecture in detail (10 marks)


Hive is a data warehousing infrastructure built on top of Hadoop that provides a SQL-like
interface for querying large datasets stored in HDFS.

Architecture Components:

1. Hive Metastore: A central repository for storing metadata, including information about
tables, columns, and data types. It helps manage schema and enables query optimization.
2. Hive Driver: The component that manages the lifecycle of a Hive query. It parses the
HiveQL statement, compiles it, and translates it into a series of MapReduce jobs.
3. Execution Engine: Executes the tasks as defined by the driver. It converts HiveQL into
MapReduce jobs and coordinates their execution.
4. User Interface: Hive provides several interfaces, including:
o Command Line Interface (CLI): For direct interaction with Hive.
o Web Interface: For GUI-based access.
o Thrift Server: For programmatic access via JDBC/ODBC.

Overall, Hive abstracts the complexities of MapReduce, allowing users to work with data using
familiar SQL-like commands.

Q5

(a) Relational Algebra operations using MapReduce (10 marks)


Relational algebra operations can be expressed using MapReduce as follows:

1. Selection (σ): Filter rows based on a predicate.


o Mapper: Reads each row, emits the row if it meets the criteria.
o Reducer: Not necessary as no aggregation is needed.
2. Projection (π): Retrieve specific columns from a table.
o Mapper: Reads each row, emits only the required columns.
o Reducer: Not needed; output can be directly written.
3. Join (⨝): Combine records from two datasets based on a common key.
o Mappers: Emit keys from both datasets.
o Reducer: Collects records with matching keys and combines them.

By structuring MapReduce jobs in this way, you can efficiently perform complex operations on
large datasets.
(b) What is Pig? Discuss the Load() & Store() commands in Pig framework (10 marks)
Apache Pig is a high-level platform for processing large datasets on Hadoop. It uses a scripting
language called Pig Latin, which simplifies the process of writing complex MapReduce
programs.

 Load() Command:
o This command is used to read data into Pig from various sources (e.g., HDFS,
local file system).
o Syntax: data = LOAD 'path/to/data' USING PigStorage(',') AS
(field1:type, field2:type);
o It specifies the format of the input data and how it should be parsed.
 Store() Command:
o This command is used to write the processed data back to HDFS or other storage
systems.
o Syntax: STORE data INTO 'path/to/output' USING PigStorage(',');
o It defines the format of the output data and where to store it.

These commands enable Pig to efficiently handle the data flow within the Hadoop ecosystem,
allowing users to focus on data processing rather than the underlying complexity of MapReduce.

Q6 Short Notes (20 marks)

(a) HBASE data models (5 marks):


HBase utilizes a sparse, distributed, column-oriented data model. Key features include:

 Tables: Data is stored in tables, similar to RDBMS, but can have a dynamic number of
columns.
 Column Families: Columns are grouped into families, allowing related data to be stored
together. This enhances read/write efficiency.
 Rows: Each row is identified by a unique key, and the data within can be sparsely
populated.
 Scalability: The model is designed to handle massive datasets by distributing data across
multiple nodes.

(b) Speculative Execution (5 marks):


Speculative Execution in Hadoop is a performance optimization technique that tackles straggler
tasks (tasks that take significantly longer than others). Key points include:

 Redundancy: If a task is running slower than expected, Hadoop will launch another
instance of the same task on a different node.
 Improved Job Completion: This redundancy helps ensure that the overall job does not
get delayed by slow tasks.
 Configuration: Users can enable or disable speculative execution through configuration
settings, depending on their workload requirements.

(c) YARN (5 marks):


YARN (Yet Another Resource Negotiator) is a core component of the Hadoop ecosystem that
manages resources across the cluster. Key aspects include:

 Resource Management: It allocates system resources to various applications running in


the Hadoop environment.
 Job Scheduling: YARN enables scheduling of jobs, ensuring that resources are used
efficiently.
 Separation of Concerns: By separating resource management from data processing,
YARN allows multiple data processing frameworks (like MapReduce, Spark) to run on
the same cluster.

(d) HDFS Commands (5 marks):


HDFS commands are used for managing files in the Hadoop file system. Common commands
include:

 hdfs dfs -ls <path>: Lists files and directories in a specified path.
 hdfs dfs -put <local_path> <hdfs_path>: Uploads a file from the local filesystem
to HDFS.
 hdfs dfs -get <hdfs_path> <local_path>: Downloads a file from HDFS to the
local filesystem.
 hdfs dfs -rm <path>: Deletes a file or directory in HDFS.

These commands are essential for navigating and manipulating data within HDFS, facilitating
effective data management.
Q1

(a) CAP Theorem (5 marks)


The CAP theorem, proposed by Eric Brewer, states that a distributed data store can only
guarantee two of the following three properties at any given time:

1. Consistency (C): Every read receives the most recent write or an error. All nodes return
the same data when queried at the same time.
2. Availability (A): Every request receives a response, regardless of whether it was
successful or not. The system remains operational even if some nodes are down.
3. Partition Tolerance (P): The system continues to operate despite network partitions that
prevent nodes from communicating.

In practice, systems can achieve two of these three properties, leading to trade-offs depending on
the requirements of the application. For example, a system might favor consistency and partition
tolerance, sacrificing availability during a network failure.

(b) HiveQL (5 marks)


HiveQL is a SQL-like query language used with Apache Hive to query and manage large
datasets stored in Hadoop. Key features include:

 SQL-like Syntax: Allows users familiar with SQL to write queries without needing to
understand MapReduce.
 Data Manipulation: Supports operations like SELECT, JOIN, GROUP BY, and
ORDER BY to retrieve and manipulate data.
 Hive Metastore Integration: Queries can leverage metadata stored in the Hive
Metastore for better optimization and schema management.
 Extensibility: Users can define User-Defined Functions (UDFs) to extend HiveQL
capabilities, allowing custom operations on data.

HiveQL abstracts the complexity of MapReduce, making it easier for analysts and developers to
work with big data.
(c) Speculative Execution (5 marks)
Speculative Execution is a technique used in Hadoop to improve the performance of jobs that
experience slow-running tasks, commonly referred to as stragglers. Key points include:

 Redundant Task Execution: When a task is running slower than a predefined threshold,
Hadoop launches a duplicate of that task on another node.
 Task Completion: The first task to finish successfully will report its result, while the
slower task is killed, improving overall job completion time.
 Configuration: Speculative execution can be enabled or disabled based on the specific
needs of the workload, balancing between resource usage and job performance.

This approach helps in ensuring that the overall job does not get delayed due to a few straggler
tasks.

(d) HDFS High Availability (5 marks)


HDFS High Availability (HA) is a feature that enables HDFS to continue functioning even in the
event of a NameNode failure. Key components include:

 Active/Standby Configuration: Two NameNodes operate in an active/standby mode,


ensuring that if the active NameNode fails, the standby can take over with minimal
disruption.
 Shared Storage: Both NameNodes share a common storage system for the edit logs and
namespace image, allowing for quick recovery and state consistency.
 Automatic Failover: In the event of a failure, the system automatically switches to the
standby NameNode, ensuring continuous availability of the file system.

HDFS HA enhances the reliability and fault tolerance of the Hadoop ecosystem.

Q2

(a) What is Big Data? Characteristics? (10 marks)


Big Data refers to the vast volumes of data generated at high velocity from various sources.
Characteristics include:

1. Volume: The sheer size of data, often measured in terabytes or petabytes, requires
specialized tools for storage and processing.
2. Velocity: The speed at which data is generated and processed. Real-time or near-real-
time processing is often necessary for timely insights.
3. Variety: Data comes in different formats (structured, semi-structured, unstructured) from
diverse sources, such as social media, sensors, and transaction logs.
4. Veracity: The accuracy and reliability of the data, highlighting the challenges of data
quality and consistency.
5. Value: The potential insights and benefits that can be derived from analyzing big data,
emphasizing the importance of effective data management and analytics.

These characteristics necessitate innovative approaches and technologies for effective data
management and analysis.

(b) Hadoop ecosystem with the diagram and three components (10 marks)
The Hadoop ecosystem consists of various tools and frameworks that work together to process
and analyze big data efficiently. Key components include:

 HDFS: The distributed file system that stores data across multiple nodes.
 MapReduce: The programming model for processing data in parallel across a cluster.
 YARN: The resource management layer that schedules jobs and allocates resources.
 Hive: A data warehouse tool for querying large datasets using SQL-like syntax.
 Pig: A platform for analyzing large datasets using a high-level scripting language.

Diagram:

plaintext
Copy code
+------------------+
| YARN |
+------------------+
/ | \
/ | \
+---------+ +---------+ +---------+
| HDFS | | MapReduce| | HBase |
+---------+ +---------+ +---------+
| |
+---------+ +---------+
| Hive | | Pig |
+---------+ +---------+

Three components elaborated:

1. HDFS: Manages storage, providing fault tolerance through data replication.


2. YARN: Manages resources and job scheduling, enabling multiple data processing
frameworks to run concurrently.
3. Hive: Allows users to query data using HiveQL, making it accessible for users familiar
with SQL.

Q3
(a) MapReduce Architecture and tasks (10 marks)
MapReduce is a programming model for processing large datasets in a distributed environment.
Its architecture consists of:

1. Client: Submits jobs to the Hadoop cluster.


2. Job Tracker (in MapReduce v1): Coordinates the job execution, scheduling tasks, and
monitoring their progress. In YARN (MapReduce v2), this is replaced by the Resource
Manager and Application Master.
3. Task Tracker (in MapReduce v1): Executes tasks on worker nodes. In YARN, this role
is taken by Node Managers.

Map Task:

 Input Splits: The input data is divided into splits for parallel processing.
 Mapper: Each mapper processes its assigned split, reading data, and emitting key-value
pairs as output.

Reduce Task:

 Shuffle and Sort: The framework collects outputs from mappers and sorts them by key.
 Reducer: Each reducer processes the sorted data, performing aggregation or other
computations, and produces final output.

Example: For a word count program, the mapper reads text lines and emits (word, 1) pairs,
while the reducer sums up counts for each word.

(b) HBase architecture (10 marks)


HBase is a distributed, scalable NoSQL database built on top of HDFS. Its architecture includes:

1. HMaster: The master server that manages region servers, handles schema changes, and
coordinates load balancing.
2. Region Server: Responsible for serving read and write requests for regions (horizontal
slices of a table). Each region contains rows sorted by row key.
3. Regions: Tables are divided into multiple regions, enabling parallel processing and
scalability.
4. HFile: The underlying storage format used by HBase to store data on HDFS. It supports
efficient read/write operations.
5. Zookeeper: Coordinates between HMaster and Region Servers, ensuring high
availability and consistency.

This architecture allows HBase to handle large volumes of data while providing fast access and
scalability.
Q4

(a) Hadoop Architecture YARN 2.0 (10 marks)


YARN (Yet Another Resource Negotiator) separates resource management from data processing
in Hadoop 2.0. Its architecture consists of:

1. Resource Manager (RM): The master daemon that manages resources across the cluster
and schedules jobs.
2. Node Manager (NM): Each worker node runs a Node Manager, responsible for
managing containers (units of resource allocation) and monitoring resource usage.
3. Application Master (AM): Each application (job) has its own Application Master that
negotiates resources from the Resource Manager and coordinates execution.
4. Containers: Resource allocations in YARN that run tasks (mappers/reducers) for
applications.

Diagram:

plaintext
Copy code
+--------------------+
| Resource Manager |
+--------------------+
| |
| |
+------+ +------+
| Node Manager | Node Manager |
+------+ +------+
| |
+------+------+ +------+------+
| Container | | Container |
| (Mapper) | | (Reducer) |
+--------------+ +--------------+

This architecture enables improved scalability and resource utilization compared to earlier
Hadoop versions.

(b) MapReduce algorithm for word count (10 marks)


The MapReduce algorithm for counting the occurrences of words in a text file consists of two
main tasks: Map and Reduce.

1. Map Function:
o Input: A text line.
o Output: (word, 1) for each word in the line.

python
Copy code
def map(line):
words = line.split()
for word in words:
emit(word, 1)

2. Reduce Function:
o Input: (word, list of counts).
o Output: (word, total count).

python
Copy code
def reduce(word, counts):
total_count = sum(counts)
emit(word, total_count)

3. Job Submission:
o Configure the job with the input and output paths, set the Mapper and Reducer
classes, and submit the job to the Hadoop cluster.

This approach allows for distributed processing of large text files, providing the count of each
word efficiently.

Q5

(a) What is HIVE? HIVE architecture (10 marks)


Hive is a data warehousing solution built on top of Hadoop, allowing for data summarization,
querying, and analysis using a SQL-like language called HiveQL.

Hive Architecture Components:

1. Hive Metastore: Stores metadata about tables, partitions, and schemas. It is crucial for
query optimization and schema management.
2. Driver: Manages the execution of HiveQL queries. It parses the query, plans its
execution, and communicates with the execution engine.
3. Execution Engine: Converts HiveQL queries into MapReduce jobs (or Spark jobs) and
manages their execution.
4. User Interface: Provides several interfaces for users, including:
o CLI (Command Line Interface)
o Web UI
o Thrift Server for programmatic access.

Hive abstracts the complexities of Hadoop and enables users to perform data analysis using
familiar SQL constructs.
(b) What is Pig? Load() & Store() commands (10 marks)
Apache Pig is a high-level platform for processing large datasets on Hadoop, using a scripting
language called Pig Latin.

 Load() Command:
o Used to read data into Pig from various sources.
o Syntax: data = LOAD 'path/to/data' USING PigStorage(',') AS
(field1:type, field2:type);
o It specifies the format of the input data and how it should be parsed.
 Store() Command:
o Used to write processed data back to HDFS or other storage systems.
o Syntax: STORE data INTO 'path/to/output' USING PigStorage(',');
o It defines the format of the output data and the target location.

These commands facilitate effective data handling within the Pig framework, allowing for
flexible data processing workflows.

Q6 Short Notes (20 marks)

(a) Row-oriented vs. Column-oriented storage (5 marks):

 Row-oriented Storage: Stores data in rows, which is ideal for transactional applications.
Each row is stored together, making it efficient for retrieving entire records. Examples
include traditional RDBMS systems like MySQL.
 Column-oriented Storage: Stores data in columns, which is better suited for analytical
applications. This format allows for efficient aggregation and retrieval of specific
attributes across many records. Examples include HBase and Google Bigtable.

(b) RecordReader in Hadoop MapReduce (5 marks):

 Role: RecordReader bridges the gap between the input data and the Mapper. It reads
input splits (as defined by InputFormat) and converts the data into key-value pairs that
can be processed by the Mapper.
 Functionality: It handles the conversion of data formats, managing tasks such as
splitting lines in text files or reading binary files. Each RecordReader is responsible for a
single input split.

(c) NameNode vs. DataNode (5 marks):

 NameNode: The master server in HDFS responsible for managing metadata, such as file
and directory structures, and the location of data blocks. It does not store actual data but
tracks which DataNodes hold which blocks.
 DataNode: Worker nodes in HDFS that store the actual data blocks. They handle read
and write requests from clients and periodically send heartbeats and block reports to the
NameNode.

(d) 5 HDFS Commands (5 marks):

1. hdfs dfs -ls <path>: Lists files and directories in the specified path.
2. hdfs dfs -put <local_path> <hdfs_path>: Uploads a file from the local filesystem
to HDFS.
3. hdfs dfs -get <hdfs_path> <local_path>: Downloads a file from HDFS to the
local filesystem.
4. hdfs dfs -rm <path>: Deletes a file or directory in HDFS.
5. hdfs dfs -mkdir <path>: Creates a new directory in HDFS.

(e) Key-value databases (5 marks):


Key-value databases are NoSQL databases that store data as a collection of key-value pairs.
Characteristics include:

 Simplicity: They provide a simple model for storing and retrieving data, where each
unique key is associated with a value.
 Performance: Designed for high-speed operations, making them ideal for applications
that require quick lookups.
 Scalability: They can be scaled horizontally by adding more nodes to handle increased
load.

Examples: Redis and Amazon DynamoDB are popular key-value databases used for caching
and session management.
Q1

(a) Requirements of NoSQL Databases (5 marks)


NoSQL databases were developed to address limitations of traditional relational databases,
especially for big data applications. Key requirements include:

1. Scalability: Must easily scale horizontally to accommodate increasing data loads without
significant performance degradation.
2. Flexibility: Support various data models (e.g., document, key-value, column-family,
graph) to handle diverse data types and structures.
3. High Availability: Should provide high availability and fault tolerance through data
replication and distribution across multiple nodes.
4. Schema-less Design: Allow for a dynamic schema, enabling applications to evolve
without needing complex migrations.
5. Performance: Ensure low latency for read and write operations, catering to high-velocity
data processing.

(b) Industry Applications of Big Data (5 marks)


Big Data is utilized across various industries to derive insights and drive decision-making.
Examples include:

1. Healthcare: Analyzing patient data and treatment outcomes to improve healthcare


services and reduce costs.
2. Retail: Personalizing customer experiences through analysis of purchasing behavior,
preferences, and inventory management.
3. Finance: Detecting fraudulent activities and assessing credit risk by analyzing
transaction patterns and customer data.
4. Telecommunications: Optimizing network performance and predicting customer churn
by analyzing call records and service usage.
5. Manufacturing: Enhancing supply chain efficiency and predictive maintenance by
analyzing operational data and sensor information.
(c) Compare MapReduce and YARN (5 marks)

Feature MapReduce YARN


Programming model for
Purpose Resource management and job scheduling
processing large data
Single-layer, with JobTracker Two-layer, separating resource management
Architecture
and TaskTracker (RM) from processing (AM)
Limited, as it relies on Highly scalable, allows multiple applications
Scalability
JobTracker to run simultaneously
Resource Static allocation of resources
Dynamic resource allocation across the cluster
Allocation for each job
Handles task failures at the job Manages node failures and reassigns tasks at
Fault Tolerance
level the container level

(d) The 3 Vs of Big Data (5 marks)


The 3 Vs of Big Data refer to three fundamental characteristics that define its nature:

1. Volume: Refers to the vast amounts of data generated every second, often measured in
terabytes or petabytes.
2. Velocity: Describes the speed at which data is generated, processed, and analyzed. Real-
time processing is often required to derive timely insights.
3. Variety: Represents the different types of data (structured, semi-structured, unstructured)
and formats (text, images, videos) that organizations deal with.

Q2

(a) Main Components of HDFS with Diagram (10 marks)


HDFS (Hadoop Distributed File System) is designed for storing large datasets reliably. Its main
components are:

1. NameNode: The master server that manages metadata and regulates access to files by
clients. It does not store actual data.
2. DataNode: The worker nodes that store actual data blocks. They regularly send heartbeat
signals and block reports to the NameNode.
3. Secondary NameNode: An auxiliary service that periodically saves the state of the
filesystem metadata. It helps in recovering from NameNode failures but does not replace
the NameNode.

Diagram:

plaintext
Copy code
+---------------------+
| NameNode |
+---------------------+
|
|
+-------+-------+
| | |
+------+ +------+ +------+
|DataNode| |DataNode| |DataNode|
+------+ +------+ +------+

Task of Secondary NameNode:


The Secondary NameNode periodically merges the namespace image and the edit logs, helping
to reduce the amount of data the NameNode needs to process for recovery. However, it is not a
failover node.

(b) Compare HBase with Conventional RDBMS (10 marks)

Feature HBase Conventional RDBMS


Data Model Column-oriented, schema-less Row-oriented, fixed schema
Vertically scalable, limited horizontal
Scalability Horizontally scalable
scale
Supports unstructured and semi-
Data Types Primarily structured data
structured data
Query
Uses HBase API for CRUD operations Uses SQL for querying
Language
Transactions Limited ACID properties Full ACID compliance
Ideal for large-scale, real-time data
Use Cases Best for transactional applications
access

Q3

(a) Hadoop Ecosystem with Diagram (10 marks)


The Hadoop ecosystem comprises various tools and technologies that work together to manage
big data. Key components include:

1. Hadoop Common: The common utilities and libraries that support other Hadoop
modules.
2. HDFS: The distributed file system for storing data.
3. YARN: The resource management layer that schedules jobs.
4. MapReduce: The programming model for data processing.
5. Hive: A data warehouse tool for querying large datasets using HiveQL.
6. Pig: A platform for analyzing large datasets using Pig Latin.
7. HBase: A NoSQL database for real-time read/write access to large datasets.
8. Sqoop: A tool for transferring data between Hadoop and relational databases.
9. Flume: A service for collecting and aggregating log data from various sources.

Diagram:

plaintext
Copy code
+------------------+
| YARN |
+------------------+
/ | \
/ | \
+---------+ +---------+ +---------+
| HDFS | | MapReduce| | HBase |
+---------+ +---------+ +---------+
| |
+---------+ +---------+
| Hive | | Pig |
+---------+ +---------+
|
+---------+
| Sqoop |
+---------+
|
+---------+
| Flume |
+---------+

Difference between Flume & Sqoop:

 Flume: Primarily used for collecting and aggregating log data from various sources into
HDFS. It is designed for streaming data and real-time ingestion.
 Sqoop: Used for transferring bulk data between Hadoop and relational databases. It is
optimized for importing/exporting data in large volumes and works with structured data.

(b) What is Pig? Load() & Store() commands (10 marks)


Apache Pig is a high-level platform for creating programs that run on Hadoop. It uses a language
called Pig Latin, which simplifies the process of writing complex MapReduce programs.

 Load() Command:
o Reads data into Pig from various sources.
o Syntax: data = LOAD 'path/to/data' USING PigStorage(',') AS
(field1:type, field2:type);
o Specifies the format of the input data.
 Store() Command:
o Writes processed data back to HDFS or other storage systems.
o Syntax: STORE data INTO 'path/to/output' USING PigStorage(',');
o Defines the format of the output data.
These commands facilitate effective data handling within Pig, allowing for flexible data
processing workflows.

Q4

(a) Types of NoSQL Databases with Examples (10 marks)

1. Key-Value Stores: Store data as a collection of key-value pairs.


Example: Redis, Amazon DynamoDB.
2. Document Stores: Store data in document formats (e.g., JSON, BSON).
Example: MongoDB, CouchDB.
3. Column Family Stores: Store data in columns rather than rows, suitable for wide
datasets.
Example: HBase, Cassandra.
4. Graph Databases: Designed for data whose relationships are best represented as a
graph.
Example: Neo4j, Amazon Neptune.

(b) Enunciate the Word Count Algorithm Using MapReduce (10 marks)
The Word Count algorithm is a classic MapReduce example that counts the occurrences of each
word in a text file.

1. Map Function:
o Input: A line of text.
o Output: Key-value pairs where the key is the word and the value is 1.

python
Copy code
def map(line):
words = line.split()
for word in words:
emit(word, 1)

2. Reduce Function:
o Input: Key-value pairs where the key is a word and the value is a list of counts.
o Output: Key-value pairs where the key is the word and the value is the total count.

python
Copy code
def reduce(word, counts):
total_count = sum(counts)
emit(word, total_count)

3. Job Submission:
o The MapReduce job is configured with input and output paths and the Mapper
and Reducer classes, then submitted to the Hadoop cluster.

This method effectively processes large volumes of text data, providing a count for each unique
word.

Q5

(a) How Map Task & Reduce Task Work in MapReduce (10 marks)
Map Task:

 Input Splits: Input data is divided into splits, with each split processed by a separate
Map task.
 Processing: Each Mapper reads its assigned input split, processes it, and emits
intermediate key-value pairs.
 Shuffle & Sort: The framework sorts and groups these intermediate pairs by key,
preparing them for the Reduce tasks.

Reduce Task:

 Input: Receives sorted intermediate key-value pairs from all Map tasks.
 Aggregation: Each Reducer processes the grouped data, performing aggregation or other
computations.
 Output: The final results are written to the output path specified in the job configuration.

Example: In a Word Count program, Mappers emit (word, 1) pairs, and Reducers sum these
pairs to produce (word, total count).

(b) What is HIVE? Explain HIVE Architecture in Detail (10 marks)


Hive is a data warehouse infrastructure built on top of Hadoop, enabling data summarization,
querying, and analysis through a SQL-like language called HiveQL.

Hive Architecture Components:

1. Hive Metastore: Stores metadata about the data, including schemas and table definitions,
enabling query optimization and efficient data retrieval.
2. Driver: Manages the execution of HiveQL queries. It parses the queries and creates an
execution plan.
3. Execution Engine: Transforms HiveQL queries into MapReduce jobs (or Spark jobs)
and executes them on the Hadoop cluster.
4. User Interface: Offers several ways to interact with Hive, including:
o CLI (Command Line Interface)
oWeb UI
oThrift Server for programmatic access.
5. Storage: Data is stored in HDFS, with Hive handling the structure and metadata.

Hive simplifies the process of querying and analyzing large datasets, making it accessible to
users familiar with SQL.

Q6 Short Notes (20 marks)

(a) HBase Data Models (5 marks):


HBase uses a sparse, distributed, and persistent data model based on tables. Key features include:

 Column Families: Data is stored in column families, enabling efficient read and write
operations. Each row can have a different number of columns.
 Rows and Columns: Each row is identified by a unique row key, while columns can be
dynamically added. Data is stored in key-value pairs.
 Versioning: Supports multiple versions of data, allowing for time-based access to
historical records.

(b) Relational Operators in MapReduce (5 marks):


Relational operators are used in MapReduce to perform operations similar to SQL. Key
operations include:

 Selection: Filtering records based on specific criteria.


 Projection: Selecting specific columns from the dataset.
 Join: Combining records from two or more datasets based on a common key.
 Aggregation: Performing calculations (e.g., SUM, COUNT) on groups of records.

These operations can be implemented using custom Mapper and Reducer functions.

(c) Functions of JobTracker & TaskTracker (5 marks):

 JobTracker: The master server in MapReduce v1 that manages jobs submitted by


clients. It schedules tasks, monitors their progress, and handles task failures.
 TaskTracker: A worker node that executes tasks assigned by the JobTracker. It reports
the status of tasks and handles data locality to optimize performance.

(d) HDFS Commands (5 marks):

1. hdfs dfs -ls <path>: Lists files in the specified directory.


2. hdfs dfs -put <local_path> <hdfs_path>: Uploads a file to HDFS.
3. hdfs dfs -get <hdfs_path> <local_path>: Downloads a file from HDFS.
4. hdfs dfs -rm <path>: Deletes a file or directory in HDFS.
5. hdfs dfs -mkdir <path>: Creates a new directory in HDFS.
(e) Pig Latin (5 marks):
Pig Latin is a high-level scripting language used with Apache Pig for processing large datasets.
Key features include:

 Ease of Use: Provides a simpler alternative to writing complex MapReduce code.


 Data Flow Language: Allows users to describe data transformations as a series of steps
in a data flow model.
 Built-in Functions: Offers a variety of built-in functions for common tasks like filtering,
grouping, and joining data.
 Extensibility: Users can define User-Defined Functions (UDFs) for custom processing.

Pig Latin is especially useful for data analysts and engineers working with large datasets in
Hadoop.

Q1

(a) Need of Big Data Analytics (5 marks)


Big Data Analytics is essential due to the following reasons:

1. Decision Making: It enables organizations to make data-driven decisions, improving


operational efficiency and strategic planning.
2. Customer Insights: Businesses can analyze customer behavior and preferences, allowing
for personalized marketing and enhanced customer experiences.
3. Predictive Analytics: Organizations can anticipate trends and customer needs by
analyzing historical data, leading to better product development.
4. Operational Efficiency: Identifying inefficiencies in processes can lead to cost reduction
and optimization of resources.
5. Competitive Advantage: Companies leveraging Big Data analytics can outperform
competitors by quickly adapting to market changes and consumer demands.

(b) D3 and Big Data (5 marks)


D3.js (Data-Driven Documents) is a JavaScript library for producing dynamic, interactive data
visualizations in web browsers. Its connection to Big Data includes:

1. Data Binding: D3 allows developers to bind data to DOM elements, facilitating the
visualization of large datasets interactively.
2. Scalability: It can handle large volumes of data, making it suitable for visualizing Big
Data.
3. Custom Visualizations: D3 enables the creation of custom visual representations,
accommodating diverse data types and structures.
4. Interactivity: Users can explore data visually through interactive elements, enhancing
comprehension and engagement.
5. Integration: D3 can easily integrate with other web technologies, enabling real-time data
visualization from Big Data sources.

(c) CAP Theorem (5 marks)


The CAP theorem, proposed by Eric Brewer, states that in a distributed data store, only two of
the following three guarantees can be achieved simultaneously:

1. Consistency (C): Every read receives the most recent write or an error. All nodes in the
system see the same data at the same time.
2. Availability (A): Every request receives a response, either with the requested data or an
error. The system is operational and responsive.
3. Partition Tolerance (P): The system continues to operate despite arbitrary message loss
or failure of part of the system.

In practical terms, a distributed system can be CP (Consistent and Partition Tolerant) or AP


(Available and Partition Tolerant), but not both.

(d) Features of Apache Spark (5 marks)


Apache Spark is a fast and general-purpose cluster computing system. Key features include:

1. Speed: In-memory data processing makes Spark significantly faster than Hadoop
MapReduce for certain workloads.
2. Ease of Use: Provides high-level APIs in Java, Scala, Python, and R, making it
accessible to data scientists and engineers.
3. Unified Engine: Supports diverse data processing tasks, including batch processing,
streaming, machine learning, and graph processing.
4. Rich Ecosystem: Integrates with various tools like Hadoop, Hive, and Flume, facilitating
the processing of Big Data.
5. Resilient Distributed Datasets (RDDs): Provides a fault-tolerant data structure for in-
memory computations, enabling parallel processing.

(e) Load and Store Functions of Apache Pig (5 marks)


Apache Pig uses the following functions for data handling:

1. Load():
o Reads data into Pig from various sources, such as HDFS or local file systems.
o Syntax: data = LOAD 'path/to/data' USING PigStorage(',') AS
(field1:type, field2:type);
o Defines the format of the input data, which can be customized.
2. Store():
o Writes processed data back to storage systems.
o Syntax: STORE data INTO 'path/to/output' USING PigStorage(',');
o Defines the format of the output data, allowing users to specify how to store
results.

These functions enable efficient data ingestion and output in the Pig framework.

(f) Big Data Characteristics (any 5) (5 marks)


Key characteristics of Big Data include:

1. Volume: Refers to the vast amounts of data generated every second, often in terabytes or
petabytes.
2. Velocity: The speed at which data is created, processed, and analyzed, often in real-time.
3. Variety: The diverse types of data (structured, semi-structured, unstructured) and formats
(text, images, videos).
4. Veracity: The quality and accuracy of the data, which can affect insights and decision-
making.
5. Value: The potential insights and benefits that can be derived from analyzing Big Data.

Q2

(a) HDFS Architecture in Detail (10 marks)


Hadoop Distributed File System (HDFS) is designed for storing large datasets reliably. Key
components include:

1. NameNode: The master server that manages metadata and directory structure. It tracks
the location of data blocks but does not store them.
2. DataNode: Worker nodes that store actual data blocks. They handle read/write requests
and periodically send heartbeats and block reports to the NameNode.
3. Secondary NameNode: An auxiliary service that periodically saves the state of the
filesystem metadata, helping with recovery.
4. Blocks: Files in HDFS are split into fixed-size blocks (default 128 MB), distributed
across DataNodes.
5. Replication: Each block is replicated (default replication factor is 3) to ensure data
durability and availability.

Diagram:

plaintext
Copy code
+-------------------+
| NameNode |
+-------------------+
| |
+-----+ +-----+
| |
+--------+ +--------+
| DataNode| | DataNode|
+--------+ +--------+
|
+--------+
| DataNode|
+--------+

(b) Workflow using Resource Manager, Application Master & Node Manager of YARN (10
marks)
YARN (Yet Another Resource Negotiator) manages resources in Hadoop. Key components
include:

1. Resource Manager (RM): The master daemon that allocates resources to various
applications.
2. Application Master (AM): A per-application master that negotiates resources with the
Resource Manager and works with Node Managers to execute tasks.
3. Node Manager (NM): A per-node daemon that manages containers, monitors resource
usage, and reports to the Resource Manager.

Workflow:

1. The client submits an application to the Resource Manager.


2. The Resource Manager allocates resources and starts an Application Master for the
submitted application.
3. The Application Master requests resources from the Resource Manager and negotiates
containers with Node Managers.
4. Node Managers launch the tasks in containers and monitor their execution.
5. The Application Master manages task execution and monitors their status, reporting back
to the Resource Manager.

Diagram:

plaintext
Copy code
+--------------------+
| Resource Manager |
+--------------------+
|
|
+--------------------+
| Application Master |
+--------------------+
|
|---------------------+
| |
+-------------+ +-------------+
| Node Manager | | Node Manager |
+-------------+ +-------------+

Q3

(a) Rack Awareness in HDFS and Fault Tolerance (10 marks)


Rack Awareness: HDFS can be configured to be aware of the physical location of nodes (rack
topology). This helps in optimizing data storage and retrieval.

Fault Tolerance with Rack Awareness:

1. Data Replication: HDFS can replicate blocks across different racks. For example, if a
block is stored on two DataNodes in one rack, the third replica might be stored on a
DataNode in a different rack.
2. Example:
o Suppose we have two racks, Rack A and Rack B. If a file is split into three blocks,
one block might be stored on a DataNode in Rack A and the other two in Rack B.
If a DataNode in Rack A fails, the system can still access the data from Rack B,
ensuring availability.

This design enhances fault tolerance, as it reduces the likelihood of data loss due to rack-level
failures.

(b) Matrix Multiplication Mapper and Reducer Function for MapReduce (10 marks)
Matrix Multiplication Overview: To multiply two matrices A (m x n) and B (n x p) to produce
matrix C (m x p).

Mapper Function:

 Each mapper processes one matrix (A or B) and emits its entries along with their
respective row/column indices.

python
Copy code
def mapper(matrix_name, row, col, value):
if matrix_name == "A":
for k in range(num_cols_B): # For each column in B
emit((row, k), ("A", col, value))
elif matrix_name == "B":
for k in range(num_rows_A): # For each row in A
emit((k, col), ("B", row, value))
Reducer Function:

 The reducer receives the key (i, j) and combines the contributions from matrix A and B to
compute the final value of C[i][j].

python
Copy code
def reducer(i, j, values):
total = 0
A_values = {}
B_values = {}

for value in values:


if value[0] == "A":
A_values[value[1]] = value[2]
elif value[0] == "B":
B_values[value[1]] = value[2]

for a_col, a_value in A_values.items():


for b_row, b_value in B_values.items():
if a_col == b_row:
total += a_value * b_value

emit((i, j), total)

Q4

(a) Types of NoSQL Databases with Examples (10 marks)

1. Key-Value Stores:
o Example: Redis, DynamoDB
o Stores data as key-value pairs, allowing for high-performance retrieval.
2. Document Stores:
o Example: MongoDB, CouchDB
o Stores data in document formats (e.g., JSON), enabling flexible schemas.
3. Column-Family Stores:
o Example: Apache Cassandra, HBase
o Data is stored in columns rather than rows, optimizing read and write operations.
4. Graph Databases:
o Example: Neo4j, Amazon Neptune
o Optimized for managing and querying data represented as graphs (nodes and
edges).
5. Wide-Column Stores:
o Example: Google Bigtable, ScyllaDB
o Similar to column-family stores but designed for massive scalability and high
performance.
(b) HBase Architecture and Components (10 marks)
HBase is a distributed, scalable NoSQL database built on top of HDFS. Its architecture includes:

1. HMaster: The master server that manages region servers, handles metadata, and
performs load balancing.
2. Region Server: Stores data in regions, which are horizontal partitions of tables. Each
region is a sorted map from row keys to values.
3. Regions: A table is divided into multiple regions, and each region can be hosted on
different Region Servers for scalability.
4. HFile: The file format used by HBase to store data on HDFS. HFiles are immutable and
stored in the DataNodes.
5. Zookeeper: Used for coordinating distributed processes and maintaining configuration
information.

HBase allows real-time read/write access to large datasets while leveraging the fault tolerance of
HDFS.

Q5

(a) Architecture of Apache Pig (10 marks)


Apache Pig is a platform for analyzing large datasets through a high-level language known as
Pig Latin.

Architecture Components:

1. Pig Latin: The scripting language used to express data flows.


2. Parser: Parses the Pig Latin scripts and generates a logical plan.
3. Optimizer: Optimizes the logical plan to improve execution efficiency.
4. Execution Engine: Converts the optimized plan into a series of MapReduce jobs that are
executed on Hadoop.

Diagram:

plaintext
Copy code
+-----------------+
| Pig Latin |
+-----------------+
|
[Parser]
|
+-------------+
| Logical |
| Plan |
+-------------+
|
[Optimizer]
|
+-------------+
| Optimized |
| Plan |
+-------------+
|
[Execution Engine]
|
+-------------+
| MapReduce |
| Jobs |
+-------------+

(b) Apache Kafka: Benefits and Need (10 marks)


Apache Kafka is a distributed event streaming platform designed for high-throughput and fault-
tolerant data streaming.

Benefits:

1. Scalability: Kafka can scale horizontally, allowing it to handle increased loads without
performance degradation.
2. Durability: Data is stored reliably with replication, ensuring fault tolerance and high
availability.
3. Performance: Capable of handling millions of messages per second, making it suitable
for real-time data processing.
4. Flexibility: Supports various messaging patterns, including publish-subscribe and
message queuing.
5. Ecosystem Integration: Integrates seamlessly with various big data tools, including
Hadoop, Spark, and Flink.

Need:

1. Real-time Data Processing: Businesses require immediate insights from data, which
Kafka facilitates through event streaming.
2. Decoupled Systems: Kafka allows microservices and applications to communicate
asynchronously, reducing dependency.
3. Data Pipeline: Acts as a central hub for data flow, enabling easier data ingestion and
distribution across systems.

Q6

(a) What is Big Data and Its Types? Differences between Traditional Data vs Big Data
Approach (10 marks)
Big Data: Refers to datasets that are so large or complex that traditional data processing
applications are inadequate. It encompasses the three Vs: Volume, Velocity, and Variety.
Types of Big Data:

1. Structured Data: Organized in a predefined format (e.g., relational databases).


2. Semi-Structured Data: Lacks a strict structure but contains tags or markers (e.g., JSON,
XML).
3. Unstructured Data: No predefined format (e.g., text documents, images, videos).

Differences between Traditional Data and Big Data Approach:

Feature Traditional Data Big Data Approach


Volume Small, manageable datasets Massive datasets (terabytes/petabytes)
Velocity Slow, batch processing Real-time processing
Variety Homogeneous data types Diverse data types and formats
Storage Relational databases Distributed storage systems
Processing Tools SQL-based querying Hadoop, Spark, NoSQL databases

(b) Tools and Benefits of Data Visualization; Challenges of Big Data Visualization (10
marks)
Tools for Data Visualization:

1. Tableau: Offers powerful visualization capabilities with an intuitive drag-and-drop


interface.
2. D3.js: A JavaScript library for creating dynamic and interactive visualizations on the
web.
3. Power BI: A business analytics tool that provides interactive visualizations and business
intelligence capabilities.
4. Matplotlib: A Python library for creating static, animated, and interactive visualizations.
5. Google Data Studio: A web-based tool for creating reports and dashboards from various
data sources.

Benefits of Data Visualization:

1. Insight Discovery: Helps in identifying trends, patterns, and correlations in data quickly.
2. Enhanced Communication: Simplifies complex data, making it easier to convey
information to stakeholders.
3. Faster Decision-Making: Provides real-time insights, facilitating quicker decision-
making processes.
4. Engagement: Interactive visualizations engage users, allowing them to explore data
intuitively.
5. Accessibility: Makes data accessible to non-technical users, democratizing data insights.
Challenges of Big Data Visualization:

1. Data Volume: Large datasets can be overwhelming and may require summarization for
effective visualization.
2. Data Variety: Different data formats can complicate integration and visualization
efforts.
3. Real-time Processing: Visualizing data in real-time requires robust infrastructure and
tools.
4. Interpretation Issues: Users may misinterpret complex visualizations if not designed
clearly.
5. Scalability: Tools must handle increasing data sizes and complexities without
performance loss.

(a) 5 V’s of Big Data (5 marks)


The 5 V's of Big Data are:

1. Volume: Refers to the massive amounts of data generated from various sources (e.g.,
social media, sensors, transactions). Organizations must manage and analyze this data
efficiently.
2. Velocity: The speed at which data is generated and processed. Real-time data streams
(e.g., financial transactions, social media feeds) require rapid analysis and response.
3. Variety: The different types and formats of data, including structured (databases), semi-
structured (XML, JSON), and unstructured data (text, images, videos).
4. Veracity: The reliability and accuracy of data. With large datasets, ensuring data quality
and consistency becomes challenging.
5. Value: The insights and benefits derived from analyzing big data. Organizations seek to
extract valuable information to drive decision-making and business strategies.

(b) HDFS (5 marks)


Hadoop Distributed File System (HDFS) is designed for storing large datasets across clusters of
computers. Key characteristics include:

1. Scalability: HDFS can store petabytes of data by distributing it across multiple nodes.
2. Fault Tolerance: Data is replicated (default replication factor is 3) across different nodes
to ensure durability and availability.
3. High Throughput: Optimized for batch processing, HDFS allows efficient read/write
operations.
4. Block Storage: Files are divided into blocks (default size is 128 MB) stored in different
DataNodes, enhancing data access speed.
5. Master-Slave Architecture: HDFS has a NameNode (master) that manages metadata
and DataNodes (slaves) that store actual data.

(c) NoSQL (5 marks)


NoSQL databases are designed to handle unstructured and semi-structured data at scale. Key
features include:

1. Schema Flexibility: Unlike traditional RDBMS, NoSQL databases can store data
without a fixed schema, allowing for agile development.
2. Scalability: They can scale horizontally, handling large volumes of data by adding more
servers.
3. Variety of Data Models: Supports various data models, including key-value, document,
column-family, and graph databases.
4. High Performance: Optimized for high-speed read and write operations, making them
suitable for real-time applications.
5. Distributed Architecture: Data is distributed across multiple nodes, providing fault
tolerance and high availability.

(d) Hive (5 marks)


Apache Hive is a data warehousing tool built on top of Hadoop that facilitates data
summarization, querying, and analysis. Key aspects include:

1. HiveQL: A SQL-like language used for querying data, making it accessible to users
familiar with SQL.
2. Metastore: Stores metadata about tables, schemas, and partitions, enabling efficient
query processing.
3. Data Storage: Data is stored in HDFS, allowing for scalable storage of large datasets.
4. Extensibility: Supports user-defined functions (UDFs) for custom processing logic.
5. Integration: Can work with various data processing tools in the Hadoop ecosystem, such
as MapReduce and Spark.

(e) Pig (5 marks)


Apache Pig is a platform for analyzing large datasets using a high-level scripting language called
Pig Latin. Key features include:

1. Ease of Use: Pig Latin abstracts complex MapReduce programming, allowing users to
focus on data processing logic.
2. Data Flow Language: Users can express data transformations as a series of steps,
making it intuitive to write data processing scripts.
3. Execution Engine: Converts Pig Latin scripts into MapReduce jobs, optimizing
execution based on the data flow.
4. Support for UDFs: Users can create custom functions for specific data processing tasks.
5. Integration with Hadoop: Pig runs on top of Hadoop, leveraging HDFS for storage and
distributed computing capabilities.

(f) Apache Kafka (5 marks)


Apache Kafka is a distributed event streaming platform designed for high-throughput data
pipelines. Key characteristics include:

1. Publish-Subscribe Model: Kafka allows producers to publish messages to topics and


consumers to subscribe to those topics, facilitating real-time data distribution.
2. Scalability: Kafka can handle large volumes of messages by distributing them across
multiple brokers.
3. Durability: Data is stored on disk with replication across brokers, ensuring fault
tolerance.
4. High Performance: Capable of processing millions of messages per second, making it
suitable for real-time analytics.
5. Stream Processing: Integrates with stream processing frameworks like Apache Spark
and Apache Flink for real-time data processing.

(a) Explain Hadoop Ecosystem with Core Components and Architecture (10 marks)
The Hadoop ecosystem comprises various components that facilitate data storage, processing,
and analysis. Core components include:

1. Hadoop Common: Contains libraries and utilities required by other Hadoop modules.
2. HDFS: The distributed file system for storing large datasets across multiple nodes.
3. YARN (Yet Another Resource Negotiator): Manages resources and job scheduling
across the Hadoop cluster.
4. MapReduce: The programming model for processing large datasets in parallel.
5. Hadoop Ecosystem Tools:
o Hive: For data warehousing and SQL-like queries.
o Pig: For data processing using Pig Latin.
o HBase: A NoSQL database built on top of HDFS.
o Spark: A fast, in-memory data processing framework.

Architecture:

plaintext
Copy code
+---------------------+
| Hadoop |
| Ecosystem |
+---------------------+
| +---------+ +------+|
| | HDFS | | YARN ||
| +---------+ +------+|
| +---------+ +------+|
| | MapReduce| | Hive ||
| +---------+ +------+|
| +---------+ +------+|
| | Pig | | HBase ||
| +---------+ +------+|
| +---------+ +------+|
| | Spark | | Kafka ||
| +---------+ +------+|
+---------------------+

(b) Frameworks Running Under YARN and YARN Daemons (10 marks)
YARN supports various data processing frameworks, including:

1. Apache Spark: In-memory data processing framework for batch and stream processing.
2. Apache Flink: Stream processing framework for real-time analytics.
3. Apache Storm: Real-time computation framework for processing data streams.
4. Apache Tez: Framework for building complex data processing workflows.

YARN Daemons:

1. ResourceManager (RM): The master daemon that manages resources and schedules
applications across the cluster.
2. NodeManager (NM): A per-node daemon that manages containers and monitors
resource usage on each node.
3. ApplicationMaster (AM): A per-application daemon that negotiates resources from the
ResourceManager and manages the execution of tasks.

(a) What is NoSQL? Explain Various NoSQL Data Architecture Patterns (10 marks)
NoSQL: A category of database management systems that provide a mechanism for storage and
retrieval of data that is modeled in means other than the tabular relations used in relational
databases.

NoSQL Data Architecture Patterns:

1. Key-Value Store: Data is stored as a collection of key-value pairs.


o Example: Redis, DynamoDB.
2. Document Store: Data is stored in document formats (e.g., JSON).
o Example: MongoDB, CouchDB.
3. Column-Family Store: Data is stored in columns rather than rows, optimizing read and
write operations.
o Example: Apache Cassandra, HBase.
4. Graph Database: Optimized for storing and querying relationships between data.
o Example: Neo4j, Amazon Neptune.
5. Time-Series Database: Designed for handling time-stamped data efficiently.
o Example: InfluxDB, TimescaleDB.

(b) What is RDD? How is Data Partitioned in RDD? (10 marks)


RDD (Resilient Distributed Dataset): A fundamental data structure in Apache Spark that
represents an immutable distributed collection of objects. RDDs can be processed in parallel
across a cluster.

Data Partitioning in RDD:

1. Logical Partitions: RDDs are divided into logical partitions, allowing Spark to execute
tasks on different partitions simultaneously.
2. Data Locality: Data is partitioned based on the location of the underlying data (e.g.,
HDFS blocks), optimizing data access and reducing network overhead.
3. Custom Partitioning: Users can define custom partitioning strategies based on specific
use cases, improving performance for certain workloads.
4. Fault Tolerance: If a partition is lost, Spark can recompute it using the lineage
information, ensuring data durability.

(a) (i) Advantages of Apache Spark over MapReduce (5 marks)

1. Speed: Spark processes data in-memory, resulting in faster performance compared to the
disk-based processing of MapReduce.
2. Ease of Use: Provides high-level APIs and supports multiple programming languages
(Java, Scala, Python), making it more accessible.
3. Unified Engine: Supports batch processing, stream processing, machine learning, and
graph processing, all in one platform.
4. Rich Libraries: Includes built-in libraries for machine learning (MLlib), graph
processing (GraphX), and streaming (Spark Streaming).
5. Interactive Shell: Allows for interactive data analysis with a REPL environment,
facilitating rapid prototyping.

(ii) D3 and Big Data (5 marks)


D3.js (Data-Driven Documents) is a JavaScript library used for creating dynamic and interactive
visualizations in web browsers. It leverages HTML, SVG, and CSS to bind arbitrary data to the
Document Object Model (DOM).

 Integration with Big Data: D3 can visualize large datasets effectively by using data-
binding techniques, allowing for real-time updates and interactions.
 Interactivity: D3 allows users to explore complex datasets through interactive
visualizations, facilitating better understanding and insights from big data.
 Customizability: Offers flexibility to create bespoke visualizations tailored to specific
data characteristics, enhancing data storytelling.

(b) Explain Bulk Synchronous Processing (BSP) and Graph Processing with respect to
Apache Spark (10 marks)
Bulk Synchronous Processing (BSP): A parallel computing model where processes
communicate in a series of supersteps. Each superstep consists of computation, communication,
and synchronization.

 In Spark: The BSP model is utilized in graph processing frameworks like GraphX,
where vertices and edges are processed in bulk synchronously, ensuring consistency
across computations.

Graph Processing in Spark:

 GraphX: Spark’s API for graph processing, allowing for scalable graph computations
and analytics.
 Resilient Distributed Property Graph: Represents graphs as RDDs, where each vertex
can hold properties.
 Pregel API: Implements the BSP model, enabling users to define iterative graph
algorithms efficiently.

(a) Significance of Apache Pig in Hadoop Context (10 marks)


Apache Pig is significant in the Hadoop ecosystem as it simplifies the process of analyzing large
datasets.

Main Components:

1. Pig Latin: A high-level language used to write data analysis programs.


2. Parser: Converts Pig Latin scripts into logical plans.
3. Optimizer: Optimizes the logical plan for efficient execution.
4. Execution Engine: Transforms the optimized plan into MapReduce jobs that run on
Hadoop.
Working with Diagram:

plaintext
Copy code
+-----------------+
| Pig Latin |
+-----------------+
|
[Parser]
|
+-------------+
| Logical |
| Plan |
+-------------+
|
[Optimizer]
|
+-------------+
| Optimized |
| Plan |
+-------------+
|
[Execution Engine]
|
+-------------+
| MapReduce |
| Jobs |
+-------------+

Pig provides an abstraction over the complexities of MapReduce, allowing data analysts and
developers to work with big data more intuitively.

(b) What is Big Data and Its Types? Differences between Traditional Data vs. Big Data
with Examples (10 marks)
Big Data: Refers to datasets that are so large or complex that traditional data processing
applications cannot manage them efficiently.

Types of Big Data:

1. Structured Data: Organized in fixed fields, typically in relational databases (e.g.,


customer information in a SQL database).
2. Semi-Structured Data: Does not conform to a rigid structure but contains tags or
markers (e.g., JSON files).
3. Unstructured Data: Lacks a predefined format (e.g., text documents, images, videos).

Differences between Traditional Data and Big Data:

Feature Traditional Data Big Data


Volume Small to medium Petabytes to exabytes of data
Feature Traditional Data Big Data
datasets
Velocity Slow, batch processing Real-time processing
Homogeneous data Diverse data types (structured, semi-structured,
Variety
types unstructured)
Storage Relational databases Distributed file systems (HDFS)
Processing
SQL-based querying Hadoop, Spark, NoSQL databases
Tools

(a) Discuss the Apache Kafka Fundamentals and Kafka Cluster Architecture (10 marks)
Apache Kafka Fundamentals:

 Messaging System: Kafka is designed for high-throughput, fault-tolerant messaging.


 Publish-Subscribe Model: Producers send messages to topics; consumers subscribe to
those topics for data retrieval.
 Scalability: Kafka can scale horizontally by adding more brokers to handle increased
loads.
 Durability: Messages are replicated across multiple brokers, ensuring data durability and
availability.

Kafka Cluster Architecture:

 Brokers: Kafka servers that store data and serve clients.


 Topics: Categories to which messages are published. Each topic can have multiple
partitions for scalability.
 Producers: Applications that send data to Kafka topics.
 Consumers: Applications that read data from Kafka topics.

Diagram:

plaintext
Copy code
+---------------------------+
| Kafka Cluster |
+---------------------------+
| +---------+ +---------+ |
| | Broker 1| | Broker 2| |
| +---------+ +---------+ |
| | |
| +----------------------+ |
| | Topics | |
| +----------------------+ |
| +---------+ +---------+ |
| |Producer | |Consumer | |
| +---------+ +---------+ |
+---------------------------+

(b) Write Short Note on:

(a) Master-Slave vs. Peer-to-Peer Architecture in NoSQL (10 marks)


Master-Slave Architecture:

 In this architecture, one node (master) acts as the primary source of data and controls the
data distribution to multiple subordinate nodes (slaves).
 Pros: Simplifies data consistency and management; efficient for read-heavy workloads.
 Cons: The master node can become a bottleneck; single point of failure.

Peer-to-Peer Architecture:

 All nodes (peers) have equal status and can act as both producers and consumers, sharing
data without a central controller.
 Pros: High availability and fault tolerance; no single point of failure.
 Cons: Data consistency can be challenging; more complex management.

Use Cases:

 Master-Slave: Used in systems requiring strong consistency (e.g., Redis with replicas).
 Peer-to-Peer: Common in distributed file systems and decentralized applications (e.g.,
Cassandra).

(b) Partitioner and Combiner in MapReduce (10 marks)


Partitioner:

 A component in MapReduce that determines how the output from the Map tasks is
distributed to Reducers.
 Function: It hashes the key and assigns it to a specific reducer, ensuring that all values
for a particular key end up at the same reducer.
 Benefit: Allows for balanced load distribution among reducers.

Combiner:

 An optional optimization that runs after the Map phase and before the Reduce phase.
 Function: Acts as a mini reducer to aggregate output from mappers, reducing the amount
of data transferred to reducers.
 Benefit: Decreases network bandwidth usage and speeds up the reduce process by
minimizing the amount of data sent across the network.
1

(a) YARN Daemon (5 marks)


YARN (Yet Another Resource Negotiator) is a resource management layer of Hadoop that
enables multiple data processing frameworks to run on a Hadoop cluster. The main daemons in
YARN include:

1. ResourceManager (RM): The master daemon responsible for managing resources


across the cluster. It allocates resources to applications based on their requirements and
monitors their progress.
2. NodeManager (NM): A per-node daemon that manages the lifecycle of containers (the
execution units of YARN). It monitors resource usage (CPU, memory) and reports back
to the ResourceManager.
3. ApplicationMaster (AM): A per-application daemon that negotiates resources from the
ResourceManager and works with the NodeManagers to execute the application tasks.

(b) Visualization using Bar Chart (5 marks)


A bar chart is a graphical representation of data using bars of different heights or lengths to show
the values of different categories. Key characteristics include:

1. Axes: The x-axis represents categories, while the y-axis represents values.
2. Bars: Each bar's length corresponds to the value it represents, allowing for easy
comparison between different categories.
3. Color Coding: Different colors can be used to enhance clarity or denote categories.
4. Labels: Axes and bars are often labeled for easier interpretation.

Example: Visualizing sales data for different products can quickly show which product has the
highest sales.

(c) Dataset vs. DataFrames (5 marks)

 Dataset:
o A distributed collection of data, which can be strongly typed (e.g., Spark
Datasets).
o Supports compile-time type safety and is optimized for structured data.
 DataFrame:
o A distributed collection of data organized into named columns, similar to a table
in a relational database.
o Provides a higher-level abstraction than RDDs, making it easier to use for data
manipulation and analysis.
o Supports SQL-like operations, allowing users to perform complex queries.

Key Difference: Datasets provide type safety and are often more efficient for complex
transformations, while DataFrames provide a more user-friendly interface for data manipulation.

(d) Joins in Hive (5 marks)


Hive supports several types of joins to combine data from multiple tables:

1. Inner Join: Returns only the rows that have matching values in both tables.

sql
Copy code
SELECT a.*, b.*
FROM tableA a
INNER JOIN tableB b ON a.id = b.id;

2. Left Join (Outer Join): Returns all rows from the left table and matched rows from the
right table. If no match, NULLs are returned for columns from the right table.

sql
Copy code
SELECT a.*, b.*
FROM tableA a
LEFT JOIN tableB b ON a.id = b.id;

3. Right Join (Outer Join): Returns all rows from the right table and matched rows from
the left table.

sql
Copy code
SELECT a.*, b.*
FROM tableA a
RIGHT JOIN tableB b ON a.id = b.id;

4. Full Outer Join: Returns all rows when there is a match in one of the tables.

sql
Copy code
SELECT a.*, b.*
FROM tableA a
FULL OUTER JOIN tableB b ON a.id = b.id;

(e) HDFS vs. HBase (5 marks)

 HDFS (Hadoop Distributed File System):


o Designed for storing large files in a distributed manner across a cluster.
o Optimized for high throughput of data access.
o Suitable for batch processing with a write-once, read-many model.
 HBase:
o A NoSQL database that runs on top of HDFS.
o Supports random, real-time read/write access to large datasets.
o Suitable for scenarios requiring low-latency access to sparse data.

Key Difference: HDFS is primarily for storing large files, while HBase allows for real-time
access and manipulation of structured data.

(f) Data Types of Big Data (5 marks)

1. Structured Data: Organized in a fixed format, such as rows and columns (e.g., SQL
databases).
2. Semi-Structured Data: Does not have a strict structure but contains tags or markers
(e.g., JSON, XML).
3. Unstructured Data: Lacks a predefined format (e.g., text documents, images, videos).
4. Time-Series Data: Data indexed in time order, often used in financial and sensor data.
5. Geospatial Data: Data related to geographic locations (e.g., maps, GPS data).

(a) Characteristics of Social Media for Big Data Analytics (10 marks)

1. User-Generated Content: Social media platforms generate massive amounts of content


from users, providing diverse data for analysis.
2. Real-Time Data Streams: Social media provides instantaneous data updates (tweets,
posts), enabling real-time analytics.
3. Rich Data Formats: Data includes text, images, videos, and metadata, enriching the
dataset for comprehensive analysis.
4. High Engagement Levels: Social interactions and engagement metrics provide valuable
insights into consumer behavior and trends.
5. Network Effects: Social connections and relationships can be analyzed to understand
influence and information dissemination.

(b) Data Visualization and Tools (10 marks)


Data Visualization: The graphical representation of information and data, making complex data
more accessible and understandable through visual context.

Key Tools:
1. Tableau: A powerful tool for creating interactive and shareable dashboards, suitable for
business intelligence.
2. Power BI: A Microsoft tool that enables data visualization and sharing insights across an
organization.
3. D3.js: A JavaScript library for producing dynamic and interactive data visualizations in
web browsers.
4. Matplotlib: A Python library for creating static, animated, and interactive visualizations
in Python.
5. Google Charts: A web service that creates interactive charts and graphs, integrating well
with Google services.

Example: Tableau can visualize sales data over time, enabling stakeholders to identify trends and
make informed decisions.

(a) Significance of Pig in Hadoop Ecosystem (10 marks)


Apache Pig is significant for several reasons:

1. Simplifies MapReduce: Provides a higher-level scripting language (Pig Latin) that


abstracts the complexity of MapReduce programming.
2. Data Processing Pipeline: Allows users to build complex data processing workflows
easily.
3. Extensibility: Supports user-defined functions (UDFs) for custom processing needs.
4. Optimization: The Pig execution engine optimizes the execution plan, improving
performance.
5. Integration with Hadoop: Works seamlessly with HDFS and other Hadoop ecosystem
components.

Data Model in Pig Latin:

 Pig uses a data model based on tuples, bags, and maps.


 Tuple: An ordered set of fields (e.g., (1, "Alice")).
 Bag: A collection of tuples (e.g., { (1, "Alice"), (2, "Bob") }).
 Map: A set of key-value pairs (e.g., [ "name" # "Alice", "age" # 30 ]).

Example:

pig
Copy code
A = LOAD 'data.txt' USING PigStorage(',') AS (id:int, name:chararray);
(b) CAP Theorem and NoSQL Data Architectural Pattern (10 marks)
CAP Theorem: States that in a distributed data store, you can only guarantee two of the
following three properties at the same time:

1. Consistency: All nodes see the same data at the same time.
2. Availability: Every request receives a response, whether successful or failed.
3. Partition Tolerance: The system continues to function despite network partitions.

Example:

 CP Systems: HBase prioritizes consistency and partition tolerance but may sacrifice
availability during network issues.
 AP Systems: Cassandra emphasizes availability and partition tolerance, potentially
allowing inconsistencies.

NoSQL Data Architectural Patterns:

1. Key-Value Store: Simplistic, key-based access (e.g., Redis).


2. Document Store: Stores data in document formats (e.g., MongoDB).
3. Column-Family Store: Stores data in columns (e.g., Cassandra).
4. Graph Database: Optimized for relationships (e.g., Neo4j).

(a) Matrix Multiplication by MapReduce (10 marks)


To perform matrix multiplication using MapReduce, we can break it down into two main phases:
Mapper and Reducer.

Matrices:

 Matrix A (2x2):

Copy code
1 2
3 4

 Matrix B (2x2):

Copy code
5 6
2 3

Map Function:

 Emit intermediate key-value pairs for matrix A and B.


 Example: For each element in A, emit a key representing its column index.

Reduce Function:

 Calculate the final matrix product by summing up the products of the corresponding
elements.

Example:

plaintext
Copy code
Mapper Output for Matrix A:
(A, row: 0, col: 0, value: 1)
(A, row: 0, col: 1, value: 2)
(A, row: 1, col: 0, value: 3)
(A, row: 1, col: 1, value: 4)

Mapper Output for Matrix B:


(B, row: 0, col: 0, value: 5)
(B, row: 0, col: 1, value: 6)
(B, row: 1, col: 0, value: 2)
(B, row: 1, col: 1, value: 3)

Reducer:

 Combine and compute the sum to form the resulting matrix C.

(b) Hive Architecture with Example (10 marks)


Hive Architecture: Hive is a data warehousing infrastructure built on top of Hadoop, designed
to facilitate querying and managing large datasets.

Components:

1. Hive Metastore: Stores metadata about the tables and partitions.


2. Driver: Manages the execution of the queries.
3. Compiler: Converts HiveQL queries into execution plans.
4. Execution Engine: Executes the plan using Hadoop MapReduce.

Example:

sql
Copy code
CREATE TABLE employees (id INT, name STRING, salary FLOAT);

This command creates a table in Hive, which will be stored in HDFS, with metadata maintained
in the Metastore.
5

(a) Apache Spark and Advantages Over MapReduce (10 marks)


Apache Spark: An open-source unified analytics engine for large-scale data processing, known
for its speed and ease of use.

Advantages over MapReduce:

1. Speed: Spark performs in-memory data processing, making it significantly faster than
MapReduce, which relies on disk storage.
2. Ease of Use: Supports high-level APIs in Java, Scala, Python, and R, making it more
accessible for developers.
3. Advanced Analytics: Provides built-in libraries for streaming, machine learning, and
graph processing.
4. Unified Engine: Supports batch processing, streaming, and interactive queries with a
single platform.

(b) NoSQL Data Architecture Patterns (10 marks)


NoSQL databases are designed to handle various types of data models. Key architecture patterns
include:

1. Key-Value Store: Simple pairs of keys and values (e.g., Redis).


o Example: User session storage.
2. Document Store: Stores data in documents (JSON, XML) (e.g., MongoDB).
o Example: E-commerce product catalogs.
3. Column-Family Store: Organizes data into columns and rows (e.g., Cassandra).
o Example: Time-series data storage.
4. Graph Database: Focuses on relationships between data points (e.g., Neo4j).
o Example: Social network analysis.

(a) Distributed Storage System of Hadoop (10 marks)


Hadoop employs a distributed storage system primarily through HDFS (Hadoop Distributed File
System).

Key Features:

1. Data Replication: Data is split into blocks (default 128 MB) and replicated across
multiple nodes to ensure fault tolerance.
2. Scalability: Easily scales horizontally by adding more nodes to the cluster.
3. High Throughput: Optimized for large data sets and high throughput of data access.

Diagram:

plaintext
Copy code
+---------------+
| Client |
+---------------+
|
+--------------+
| NameNode |
+--------------+
/ \
/ \
+------+ +------+
| DataNode | DataNode |
+------+ +------+

(b) Apache Kafka and Real-Time Data Streaming (10 marks)


Apache Kafka: A distributed streaming platform that provides a framework for handling real-
time data feeds.

Streaming Real-Time Data:

1. Producers: Applications that publish messages to Kafka topics.


2. Topics: Categories for messages; can be partitioned for scalability.
3. Consumers: Applications that subscribe to topics and process messages.

Example:

 A social media application can publish user activity data (e.g., likes, comments) to a
Kafka topic. This data can then be consumed by analytics applications for real-time
insights.

Diagram:

plaintext
Copy code
+----------+ +---------+
| Producer | ---> | Topic |
+----------+ +---------+
|
|
+---------+
| Consumer |
+---------+

You might also like