Big Data Visualization
Big Data Visualization
Answer:
The 5 V's of Big Data are critical characteristics that define its complexity and challenges:
Volume: Refers to the vast amounts of data generated every second from various
sources, including social media, sensors, and transactions. Organizations must manage
and analyze this massive volume effectively.
Velocity: This denotes the speed at which data is generated and needs to be processed.
For example, real-time data processing is essential for applications like fraud detection
and social media analytics.
Variety: Big Data comes in different formats, such as structured data (e.g., databases),
semi-structured data (e.g., XML, JSON), and unstructured data (e.g., text, images).
Handling this variety requires flexible storage and processing systems.
Veracity: Refers to the quality and accuracy of the data. High veracity data is crucial for
making reliable business decisions, while low veracity can lead to incorrect conclusions.
Value: This indicates the potential insights and benefits derived from analyzing Big
Data. The goal is to extract actionable insights that can drive business strategies and
decisions.
2. Explain HDFS.
Answer:
Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop,
designed to handle large files and provide high throughput access to application data.
Key Features:
o Scalability: HDFS can scale horizontally by adding more nodes to the cluster.
o Fault Tolerance: Data is divided into blocks (default size 128 MB) and
replicated across multiple DataNodes. This replication ensures that data remains
accessible even if some nodes fail.
o High Throughput: HDFS is optimized for high throughput, making it suitable
for batch processing of large datasets.
Architecture Components:
o NameNode: The master server that manages the metadata and namespace of the
file system. It keeps track of which blocks are stored on which DataNodes.
o DataNodes: These are the worker nodes that store the actual data blocks. They
periodically send heartbeat signals to the NameNode to confirm their status.
o Secondary NameNode: This node periodically saves the state of the NameNode
to provide a backup in case of failures.
3. What is NoSQL?
Answer:
NoSQL refers to a broad class of database management systems that do not adhere strictly to the
traditional relational database model. They are designed for scalability and flexibility, especially
for large volumes of structured and unstructured data.
Answer:
Hive is a data warehousing infrastructure built on top of Hadoop, designed to facilitate querying
and managing large datasets using HiveQL, a SQL-like language.
Architecture Components:
o Metastore: A centralized repository that stores metadata about tables, partitions,
and schemas. This metadata is essential for query optimization and execution.
o Driver: Manages the lifecycle of a HiveQL query, from compilation to execution.
o Compiler: Converts HiveQL queries into a series of MapReduce jobs or other
execution plans, optimizing for performance.
o Execution Engine: Executes the generated execution plans using Hadoop
MapReduce, Tez, or Spark.
Example:
sql
Copy code
CREATE TABLE employees (id INT, name STRING, salary FLOAT);
This command creates a new table in Hive, which is stored in HDFS, with metadata
managed in the Metastore.
Answer:
Apache Pig is a high-level platform for creating programs that run on Hadoop, primarily using a
language called Pig Latin.
Key Components:
o Pig Latin: A scripting language that simplifies writing MapReduce programs. It
allows users to express data transformations without dealing with the complexities
of Java code.
o Execution Engine: Converts Pig Latin scripts into a series of MapReduce jobs
for execution on the Hadoop cluster.
o Grunt Shell: An interactive shell for running Pig commands and testing Pig Latin
scripts.
How it Works:
Answer:
Apache Kafka is a distributed streaming platform used for building real-time data pipelines and
streaming applications.
Key Features:
o High Throughput: Kafka can handle thousands of messages per second, making
it suitable for high-volume data streams.
o Durability: Messages are stored on disk and replicated across multiple brokers,
ensuring data persistence and fault tolerance.
o Scalability: Kafka can be scaled horizontally by adding more brokers to the
cluster.
o Real-time Processing: It allows for processing streams of data in real time,
enabling immediate insights and actions based on incoming data.
Architecture Components:
o Producers: Applications that publish messages to Kafka topics.
o Topics: Categories for messages, which can be partitioned for load balancing.
o Consumers: Applications that subscribe to topics and process the messages.
Answer:
YARN (Yet Another Resource Negotiator) is the resource management layer of Hadoop,
enabling multiple data processing engines to run on a single cluster.
Key Components:
o ResourceManager: The master daemon responsible for managing resources
across the cluster and scheduling applications.
o NodeManager: A per-node daemon that monitors resource usage (CPU,
memory) and manages the execution of tasks on that node.
o ApplicationMaster: A per-application daemon that negotiates resources from the
ResourceManager and coordinates the execution of tasks.
How it Works:
Answer:
The CAP Theorem states that in a distributed data store, it is impossible to simultaneously
guarantee all three of the following properties:
Consistency: Every read receives the most recent write for a given piece of data.
Availability: Every request receives a response, either success or failure.
Partition Tolerance: The system continues to operate despite network partitions.
Example:
HBase prioritizes consistency and partition tolerance (CP) at the expense of availability
during a network partition.
Cassandra, on the other hand, emphasizes availability and partition tolerance (AP),
potentially sacrificing consistency.
Answer:
Structured Data:
o Organized in a fixed format, typically in rows and columns (e.g., relational
databases).
o Easy to enter, store, query, and analyze.
o Examples include SQL databases like MySQL and Oracle.
Semi-Structured Data:
o Does not conform to a fixed schema but contains tags or markers to separate data
elements (e.g., XML, JSON).
o More flexible than structured data but still allows some level of organization.
o Examples include web services that use JSON for data interchange.
Unstructured Data:
o Lacks a predefined format or structure, making it difficult to organize and
analyze.
o Examples include text documents, emails, images, audio, and video files.
o Requires advanced analytics tools to extract insights.
Answer:
Bulk Synchronous Processing (BSP) is a parallel computing model that consists of a series of
supersteps, where computation is performed, followed by a synchronization phase.
Key Features:
o Each superstep can communicate with other nodes, allowing for data exchange
and aggregation.
o After computation in each superstep, all nodes synchronize to ensure consistency
before moving to the next superstep.
Relation to Spark:
o Spark leverages a similar model for iterative processing, allowing data to be
processed in parallel across a cluster while efficiently managing communication
and synchronization between tasks.
o This model enhances performance for iterative algorithms, such as those used in
machine learning.
Answer:
Matrix multiplication using MapReduce involves several steps:
1. Mapper Function:
o The input matrices are distributed across the cluster. Each mapper processes rows
from matrix A and columns from matrix B.
o For each row of A and each column of B, the mapper emits key-value pairs where
the key is a tuple representing the resulting matrix indices and the value is the
product of the corresponding elements.
And Matrix B:
Copy code
5 6
2 3
rust
Copy code
(0, 0) -> 1*5
(0, 1) -> 1*6
(0, 2) -> 2*5
(0, 3) -> 2*6
2. Reducer Function:
o The reducer receives the emitted key-value pairs and sums the values for each
unique key, producing the final result for the corresponding cell in the resulting
matrix.
3. Final Output:
o The output is a matrix where each cell represents the sum of the products for the
corresponding row from A and column from B.
Answer:
HDFS:
o Designed for storing large files with high throughput.
o Optimized for batch processing; data is typically read in large blocks.
o Data is immutable, and once written, it cannot be changed.
o Works best for workloads requiring high data reliability and fault tolerance.
HBase:
o A NoSQL database built on top of HDFS designed for real-time access to data.
o Supports random read/write access to large datasets.
o Data is stored in tables with rows and columns, allowing for dynamic schema
design.
o Suitable for applications needing low-latency access to big data.
Speed: Spark performs in-memory processing, reducing the time taken for data
read/write operations compared to disk-based MapReduce.
Ease of Use: Spark offers high-level APIs in languages like Python, Java, and Scala,
making it more accessible for developers.
Unified Framework: Spark supports various data processing tasks (batch, streaming,
machine learning) within a single framework.
Advanced Analytics: It provides built-in libraries for machine learning (MLlib), graph
processing (GraphX), and stream processing (Spark Streaming).
Answer:
Apache Pig’s architecture is designed to simplify the process of writing and executing
MapReduce programs.
Key Components:
o Pig Latin: The language used to express data flows and transformations.
o Parser: Validates and translates Pig Latin scripts into a logical plan.
o Optimizer: Improves the logical plan for performance, converting it into a
physical plan.
o Execution Engine: Executes the physical plan on Hadoop, generating
MapReduce jobs.
Diagram:
rust
Copy code
User -> Pig Latin Script -> Parser -> Logical Plan -> Optimizer -> Physical
Plan -> Execution Engine -> Hadoop Cluster
Answer:
Data visualization is a critical component of data analysis that transforms complex data into
graphical representations, making it easier to identify trends, patterns, and insights.
Key Benefits:
o Enhanced Understanding: Visuals help to clarify complex data and make it
more accessible to non-technical stakeholders.
o Improved Decision-Making: Data visualizations enable quicker and more
informed decisions based on data insights.
o Trend Identification: Visualizations highlight trends and anomalies that may be
overlooked in raw data.
o Communication: Visuals convey information more effectively than text or
tables, facilitating better storytelling with data.
Popular Tools:
o Tableau: Offers interactive dashboards and visualizations.
o D3.js: A JavaScript library for producing dynamic, interactive data visualizations
in web browsers.
o Power BI: A Microsoft tool for visual analytics that integrates with various data
sources.
Q1
1. Basically Available: The system guarantees availability of data even in the face of
failures. This is a trade-off for consistency, which may not always be available.
2. Soft State: The state of the system may change over time, even without new input. This
is due to the eventual consistency model, where updates propagate through the system
over time.
3. Eventual Consistency: While immediate consistency is not guaranteed, the system will
become consistent eventually, as long as no new updates are made to the data.
1. Volume: Refers to the enormous amounts of data generated every second from various
sources (social media, IoT devices, transactions, etc.). This scale of data requires special
techniques for storage and processing.
2. Velocity: The speed at which new data is generated and processed is crucial. High-
velocity data needs to be analyzed in real time or near real time to derive timely insights.
3. Variety: Data comes in multiple formats and types, such as structured (databases), semi-
structured (XML, JSON), and unstructured (text, images). Handling this variety poses
challenges for data management and analysis.
Serialization: The process of converting an object (data structure) into a byte stream for
storage or transmission. For example, when saving data to HDFS.
Deserialization: The reverse process of converting a byte stream back into an object.
When reading data from HDFS into Hive tables, the data needs to be deserialized.
Customization: Users can implement custom SerDes for different data formats (like
JSON, XML) to enable Hive to read and write data in formats beyond its default
capabilities.
Q2
1. NameNode: The master server that stores metadata about the files in HDFS, such as file
locations, directory structure, and permissions. It does not store the actual data.
2. DataNode: The worker nodes that store the actual data blocks. They are responsible for
serving read and write requests from clients and periodically sending heartbeats and
block reports to the NameNode.
3. Secondary NameNode: Acts as a backup for the NameNode, periodically merging the
namespace image and edit log to prevent the NameNode from running out of memory.
Diagram:
plaintext
Copy code
+-----------+
| NameNode |
+-----------+
/ \
/ \
/ \
+---------+ +---------+
| DataNode| | DataNode|
+---------+ +---------+
Q3
HDFS: The distributed file system that stores data across multiple nodes.
MapReduce: The programming model for processing data in parallel.
YARN: The resource management layer that manages compute resources in the cluster.
Hive: A data warehouse tool for querying and managing large datasets using SQL-like
syntax.
Pig: A platform for analyzing large datasets using a high-level scripting language (Pig
Latin).
HBase: A distributed NoSQL database that runs on top of HDFS.
Diagram:
plaintext
Copy code
+------------------+
| YARN |
+------------------+
/ | \
/ | \
+---------+ +---------+ +---------+
| HDFS | | MapReduce| | HBase |
+---------+ +---------+ +---------+
| |
+---------+ +---------+
| Hive | | Pig |
+---------+ +---------+
1. HDFS: Manages storage across nodes, providing fault tolerance by replicating data
blocks.
2. YARN: Acts as a resource manager, scheduling jobs and allocating resources to various
applications running in the Hadoop ecosystem.
3. Hive: Provides a SQL-like interface to query and analyze data stored in HDFS, allowing
users to work with data without needing to know MapReduce.
InputFormat:
o Defines how input data is split and read. It controls how data is divided into splits
for processing.
o Examples include TextInputFormat, which reads lines of text, and
SequenceFileInputFormat, which reads key-value pairs from sequence files.
RecordReader:
o A key component of the MapReduce framework, it reads the data from the
InputFormat and converts it into key-value pairs that can be processed by the
Mapper.
o It works in conjunction with InputFormat to facilitate the reading of input data
efficiently.
Q4
Architecture Components:
1. Hive Metastore: A central repository for storing metadata, including information about
tables, columns, and data types. It helps manage schema and enables query optimization.
2. Hive Driver: The component that manages the lifecycle of a Hive query. It parses the
HiveQL statement, compiles it, and translates it into a series of MapReduce jobs.
3. Execution Engine: Executes the tasks as defined by the driver. It converts HiveQL into
MapReduce jobs and coordinates their execution.
4. User Interface: Hive provides several interfaces, including:
o Command Line Interface (CLI): For direct interaction with Hive.
o Web Interface: For GUI-based access.
o Thrift Server: For programmatic access via JDBC/ODBC.
Overall, Hive abstracts the complexities of MapReduce, allowing users to work with data using
familiar SQL-like commands.
Q5
By structuring MapReduce jobs in this way, you can efficiently perform complex operations on
large datasets.
(b) What is Pig? Discuss the Load() & Store() commands in Pig framework (10 marks)
Apache Pig is a high-level platform for processing large datasets on Hadoop. It uses a scripting
language called Pig Latin, which simplifies the process of writing complex MapReduce
programs.
Load() Command:
o This command is used to read data into Pig from various sources (e.g., HDFS,
local file system).
o Syntax: data = LOAD 'path/to/data' USING PigStorage(',') AS
(field1:type, field2:type);
o It specifies the format of the input data and how it should be parsed.
Store() Command:
o This command is used to write the processed data back to HDFS or other storage
systems.
o Syntax: STORE data INTO 'path/to/output' USING PigStorage(',');
o It defines the format of the output data and where to store it.
These commands enable Pig to efficiently handle the data flow within the Hadoop ecosystem,
allowing users to focus on data processing rather than the underlying complexity of MapReduce.
Tables: Data is stored in tables, similar to RDBMS, but can have a dynamic number of
columns.
Column Families: Columns are grouped into families, allowing related data to be stored
together. This enhances read/write efficiency.
Rows: Each row is identified by a unique key, and the data within can be sparsely
populated.
Scalability: The model is designed to handle massive datasets by distributing data across
multiple nodes.
Redundancy: If a task is running slower than expected, Hadoop will launch another
instance of the same task on a different node.
Improved Job Completion: This redundancy helps ensure that the overall job does not
get delayed by slow tasks.
Configuration: Users can enable or disable speculative execution through configuration
settings, depending on their workload requirements.
hdfs dfs -ls <path>: Lists files and directories in a specified path.
hdfs dfs -put <local_path> <hdfs_path>: Uploads a file from the local filesystem
to HDFS.
hdfs dfs -get <hdfs_path> <local_path>: Downloads a file from HDFS to the
local filesystem.
hdfs dfs -rm <path>: Deletes a file or directory in HDFS.
These commands are essential for navigating and manipulating data within HDFS, facilitating
effective data management.
Q1
1. Consistency (C): Every read receives the most recent write or an error. All nodes return
the same data when queried at the same time.
2. Availability (A): Every request receives a response, regardless of whether it was
successful or not. The system remains operational even if some nodes are down.
3. Partition Tolerance (P): The system continues to operate despite network partitions that
prevent nodes from communicating.
In practice, systems can achieve two of these three properties, leading to trade-offs depending on
the requirements of the application. For example, a system might favor consistency and partition
tolerance, sacrificing availability during a network failure.
SQL-like Syntax: Allows users familiar with SQL to write queries without needing to
understand MapReduce.
Data Manipulation: Supports operations like SELECT, JOIN, GROUP BY, and
ORDER BY to retrieve and manipulate data.
Hive Metastore Integration: Queries can leverage metadata stored in the Hive
Metastore for better optimization and schema management.
Extensibility: Users can define User-Defined Functions (UDFs) to extend HiveQL
capabilities, allowing custom operations on data.
HiveQL abstracts the complexity of MapReduce, making it easier for analysts and developers to
work with big data.
(c) Speculative Execution (5 marks)
Speculative Execution is a technique used in Hadoop to improve the performance of jobs that
experience slow-running tasks, commonly referred to as stragglers. Key points include:
Redundant Task Execution: When a task is running slower than a predefined threshold,
Hadoop launches a duplicate of that task on another node.
Task Completion: The first task to finish successfully will report its result, while the
slower task is killed, improving overall job completion time.
Configuration: Speculative execution can be enabled or disabled based on the specific
needs of the workload, balancing between resource usage and job performance.
This approach helps in ensuring that the overall job does not get delayed due to a few straggler
tasks.
HDFS HA enhances the reliability and fault tolerance of the Hadoop ecosystem.
Q2
1. Volume: The sheer size of data, often measured in terabytes or petabytes, requires
specialized tools for storage and processing.
2. Velocity: The speed at which data is generated and processed. Real-time or near-real-
time processing is often necessary for timely insights.
3. Variety: Data comes in different formats (structured, semi-structured, unstructured) from
diverse sources, such as social media, sensors, and transaction logs.
4. Veracity: The accuracy and reliability of the data, highlighting the challenges of data
quality and consistency.
5. Value: The potential insights and benefits that can be derived from analyzing big data,
emphasizing the importance of effective data management and analytics.
These characteristics necessitate innovative approaches and technologies for effective data
management and analysis.
(b) Hadoop ecosystem with the diagram and three components (10 marks)
The Hadoop ecosystem consists of various tools and frameworks that work together to process
and analyze big data efficiently. Key components include:
HDFS: The distributed file system that stores data across multiple nodes.
MapReduce: The programming model for processing data in parallel across a cluster.
YARN: The resource management layer that schedules jobs and allocates resources.
Hive: A data warehouse tool for querying large datasets using SQL-like syntax.
Pig: A platform for analyzing large datasets using a high-level scripting language.
Diagram:
plaintext
Copy code
+------------------+
| YARN |
+------------------+
/ | \
/ | \
+---------+ +---------+ +---------+
| HDFS | | MapReduce| | HBase |
+---------+ +---------+ +---------+
| |
+---------+ +---------+
| Hive | | Pig |
+---------+ +---------+
Q3
(a) MapReduce Architecture and tasks (10 marks)
MapReduce is a programming model for processing large datasets in a distributed environment.
Its architecture consists of:
Map Task:
Input Splits: The input data is divided into splits for parallel processing.
Mapper: Each mapper processes its assigned split, reading data, and emitting key-value
pairs as output.
Reduce Task:
Shuffle and Sort: The framework collects outputs from mappers and sorts them by key.
Reducer: Each reducer processes the sorted data, performing aggregation or other
computations, and produces final output.
Example: For a word count program, the mapper reads text lines and emits (word, 1) pairs,
while the reducer sums up counts for each word.
1. HMaster: The master server that manages region servers, handles schema changes, and
coordinates load balancing.
2. Region Server: Responsible for serving read and write requests for regions (horizontal
slices of a table). Each region contains rows sorted by row key.
3. Regions: Tables are divided into multiple regions, enabling parallel processing and
scalability.
4. HFile: The underlying storage format used by HBase to store data on HDFS. It supports
efficient read/write operations.
5. Zookeeper: Coordinates between HMaster and Region Servers, ensuring high
availability and consistency.
This architecture allows HBase to handle large volumes of data while providing fast access and
scalability.
Q4
1. Resource Manager (RM): The master daemon that manages resources across the cluster
and schedules jobs.
2. Node Manager (NM): Each worker node runs a Node Manager, responsible for
managing containers (units of resource allocation) and monitoring resource usage.
3. Application Master (AM): Each application (job) has its own Application Master that
negotiates resources from the Resource Manager and coordinates execution.
4. Containers: Resource allocations in YARN that run tasks (mappers/reducers) for
applications.
Diagram:
plaintext
Copy code
+--------------------+
| Resource Manager |
+--------------------+
| |
| |
+------+ +------+
| Node Manager | Node Manager |
+------+ +------+
| |
+------+------+ +------+------+
| Container | | Container |
| (Mapper) | | (Reducer) |
+--------------+ +--------------+
This architecture enables improved scalability and resource utilization compared to earlier
Hadoop versions.
1. Map Function:
o Input: A text line.
o Output: (word, 1) for each word in the line.
python
Copy code
def map(line):
words = line.split()
for word in words:
emit(word, 1)
2. Reduce Function:
o Input: (word, list of counts).
o Output: (word, total count).
python
Copy code
def reduce(word, counts):
total_count = sum(counts)
emit(word, total_count)
3. Job Submission:
o Configure the job with the input and output paths, set the Mapper and Reducer
classes, and submit the job to the Hadoop cluster.
This approach allows for distributed processing of large text files, providing the count of each
word efficiently.
Q5
1. Hive Metastore: Stores metadata about tables, partitions, and schemas. It is crucial for
query optimization and schema management.
2. Driver: Manages the execution of HiveQL queries. It parses the query, plans its
execution, and communicates with the execution engine.
3. Execution Engine: Converts HiveQL queries into MapReduce jobs (or Spark jobs) and
manages their execution.
4. User Interface: Provides several interfaces for users, including:
o CLI (Command Line Interface)
o Web UI
o Thrift Server for programmatic access.
Hive abstracts the complexities of Hadoop and enables users to perform data analysis using
familiar SQL constructs.
(b) What is Pig? Load() & Store() commands (10 marks)
Apache Pig is a high-level platform for processing large datasets on Hadoop, using a scripting
language called Pig Latin.
Load() Command:
o Used to read data into Pig from various sources.
o Syntax: data = LOAD 'path/to/data' USING PigStorage(',') AS
(field1:type, field2:type);
o It specifies the format of the input data and how it should be parsed.
Store() Command:
o Used to write processed data back to HDFS or other storage systems.
o Syntax: STORE data INTO 'path/to/output' USING PigStorage(',');
o It defines the format of the output data and the target location.
These commands facilitate effective data handling within the Pig framework, allowing for
flexible data processing workflows.
Row-oriented Storage: Stores data in rows, which is ideal for transactional applications.
Each row is stored together, making it efficient for retrieving entire records. Examples
include traditional RDBMS systems like MySQL.
Column-oriented Storage: Stores data in columns, which is better suited for analytical
applications. This format allows for efficient aggregation and retrieval of specific
attributes across many records. Examples include HBase and Google Bigtable.
Role: RecordReader bridges the gap between the input data and the Mapper. It reads
input splits (as defined by InputFormat) and converts the data into key-value pairs that
can be processed by the Mapper.
Functionality: It handles the conversion of data formats, managing tasks such as
splitting lines in text files or reading binary files. Each RecordReader is responsible for a
single input split.
NameNode: The master server in HDFS responsible for managing metadata, such as file
and directory structures, and the location of data blocks. It does not store actual data but
tracks which DataNodes hold which blocks.
DataNode: Worker nodes in HDFS that store the actual data blocks. They handle read
and write requests from clients and periodically send heartbeats and block reports to the
NameNode.
1. hdfs dfs -ls <path>: Lists files and directories in the specified path.
2. hdfs dfs -put <local_path> <hdfs_path>: Uploads a file from the local filesystem
to HDFS.
3. hdfs dfs -get <hdfs_path> <local_path>: Downloads a file from HDFS to the
local filesystem.
4. hdfs dfs -rm <path>: Deletes a file or directory in HDFS.
5. hdfs dfs -mkdir <path>: Creates a new directory in HDFS.
Simplicity: They provide a simple model for storing and retrieving data, where each
unique key is associated with a value.
Performance: Designed for high-speed operations, making them ideal for applications
that require quick lookups.
Scalability: They can be scaled horizontally by adding more nodes to handle increased
load.
Examples: Redis and Amazon DynamoDB are popular key-value databases used for caching
and session management.
Q1
1. Scalability: Must easily scale horizontally to accommodate increasing data loads without
significant performance degradation.
2. Flexibility: Support various data models (e.g., document, key-value, column-family,
graph) to handle diverse data types and structures.
3. High Availability: Should provide high availability and fault tolerance through data
replication and distribution across multiple nodes.
4. Schema-less Design: Allow for a dynamic schema, enabling applications to evolve
without needing complex migrations.
5. Performance: Ensure low latency for read and write operations, catering to high-velocity
data processing.
1. Volume: Refers to the vast amounts of data generated every second, often measured in
terabytes or petabytes.
2. Velocity: Describes the speed at which data is generated, processed, and analyzed. Real-
time processing is often required to derive timely insights.
3. Variety: Represents the different types of data (structured, semi-structured, unstructured)
and formats (text, images, videos) that organizations deal with.
Q2
1. NameNode: The master server that manages metadata and regulates access to files by
clients. It does not store actual data.
2. DataNode: The worker nodes that store actual data blocks. They regularly send heartbeat
signals and block reports to the NameNode.
3. Secondary NameNode: An auxiliary service that periodically saves the state of the
filesystem metadata. It helps in recovering from NameNode failures but does not replace
the NameNode.
Diagram:
plaintext
Copy code
+---------------------+
| NameNode |
+---------------------+
|
|
+-------+-------+
| | |
+------+ +------+ +------+
|DataNode| |DataNode| |DataNode|
+------+ +------+ +------+
Q3
1. Hadoop Common: The common utilities and libraries that support other Hadoop
modules.
2. HDFS: The distributed file system for storing data.
3. YARN: The resource management layer that schedules jobs.
4. MapReduce: The programming model for data processing.
5. Hive: A data warehouse tool for querying large datasets using HiveQL.
6. Pig: A platform for analyzing large datasets using Pig Latin.
7. HBase: A NoSQL database for real-time read/write access to large datasets.
8. Sqoop: A tool for transferring data between Hadoop and relational databases.
9. Flume: A service for collecting and aggregating log data from various sources.
Diagram:
plaintext
Copy code
+------------------+
| YARN |
+------------------+
/ | \
/ | \
+---------+ +---------+ +---------+
| HDFS | | MapReduce| | HBase |
+---------+ +---------+ +---------+
| |
+---------+ +---------+
| Hive | | Pig |
+---------+ +---------+
|
+---------+
| Sqoop |
+---------+
|
+---------+
| Flume |
+---------+
Flume: Primarily used for collecting and aggregating log data from various sources into
HDFS. It is designed for streaming data and real-time ingestion.
Sqoop: Used for transferring bulk data between Hadoop and relational databases. It is
optimized for importing/exporting data in large volumes and works with structured data.
Load() Command:
o Reads data into Pig from various sources.
o Syntax: data = LOAD 'path/to/data' USING PigStorage(',') AS
(field1:type, field2:type);
o Specifies the format of the input data.
Store() Command:
o Writes processed data back to HDFS or other storage systems.
o Syntax: STORE data INTO 'path/to/output' USING PigStorage(',');
o Defines the format of the output data.
These commands facilitate effective data handling within Pig, allowing for flexible data
processing workflows.
Q4
(b) Enunciate the Word Count Algorithm Using MapReduce (10 marks)
The Word Count algorithm is a classic MapReduce example that counts the occurrences of each
word in a text file.
1. Map Function:
o Input: A line of text.
o Output: Key-value pairs where the key is the word and the value is 1.
python
Copy code
def map(line):
words = line.split()
for word in words:
emit(word, 1)
2. Reduce Function:
o Input: Key-value pairs where the key is a word and the value is a list of counts.
o Output: Key-value pairs where the key is the word and the value is the total count.
python
Copy code
def reduce(word, counts):
total_count = sum(counts)
emit(word, total_count)
3. Job Submission:
o The MapReduce job is configured with input and output paths and the Mapper
and Reducer classes, then submitted to the Hadoop cluster.
This method effectively processes large volumes of text data, providing a count for each unique
word.
Q5
(a) How Map Task & Reduce Task Work in MapReduce (10 marks)
Map Task:
Input Splits: Input data is divided into splits, with each split processed by a separate
Map task.
Processing: Each Mapper reads its assigned input split, processes it, and emits
intermediate key-value pairs.
Shuffle & Sort: The framework sorts and groups these intermediate pairs by key,
preparing them for the Reduce tasks.
Reduce Task:
Input: Receives sorted intermediate key-value pairs from all Map tasks.
Aggregation: Each Reducer processes the grouped data, performing aggregation or other
computations.
Output: The final results are written to the output path specified in the job configuration.
Example: In a Word Count program, Mappers emit (word, 1) pairs, and Reducers sum these
pairs to produce (word, total count).
1. Hive Metastore: Stores metadata about the data, including schemas and table definitions,
enabling query optimization and efficient data retrieval.
2. Driver: Manages the execution of HiveQL queries. It parses the queries and creates an
execution plan.
3. Execution Engine: Transforms HiveQL queries into MapReduce jobs (or Spark jobs)
and executes them on the Hadoop cluster.
4. User Interface: Offers several ways to interact with Hive, including:
o CLI (Command Line Interface)
oWeb UI
oThrift Server for programmatic access.
5. Storage: Data is stored in HDFS, with Hive handling the structure and metadata.
Hive simplifies the process of querying and analyzing large datasets, making it accessible to
users familiar with SQL.
Column Families: Data is stored in column families, enabling efficient read and write
operations. Each row can have a different number of columns.
Rows and Columns: Each row is identified by a unique row key, while columns can be
dynamically added. Data is stored in key-value pairs.
Versioning: Supports multiple versions of data, allowing for time-based access to
historical records.
These operations can be implemented using custom Mapper and Reducer functions.
Pig Latin is especially useful for data analysts and engineers working with large datasets in
Hadoop.
Q1
1. Data Binding: D3 allows developers to bind data to DOM elements, facilitating the
visualization of large datasets interactively.
2. Scalability: It can handle large volumes of data, making it suitable for visualizing Big
Data.
3. Custom Visualizations: D3 enables the creation of custom visual representations,
accommodating diverse data types and structures.
4. Interactivity: Users can explore data visually through interactive elements, enhancing
comprehension and engagement.
5. Integration: D3 can easily integrate with other web technologies, enabling real-time data
visualization from Big Data sources.
1. Consistency (C): Every read receives the most recent write or an error. All nodes in the
system see the same data at the same time.
2. Availability (A): Every request receives a response, either with the requested data or an
error. The system is operational and responsive.
3. Partition Tolerance (P): The system continues to operate despite arbitrary message loss
or failure of part of the system.
1. Speed: In-memory data processing makes Spark significantly faster than Hadoop
MapReduce for certain workloads.
2. Ease of Use: Provides high-level APIs in Java, Scala, Python, and R, making it
accessible to data scientists and engineers.
3. Unified Engine: Supports diverse data processing tasks, including batch processing,
streaming, machine learning, and graph processing.
4. Rich Ecosystem: Integrates with various tools like Hadoop, Hive, and Flume, facilitating
the processing of Big Data.
5. Resilient Distributed Datasets (RDDs): Provides a fault-tolerant data structure for in-
memory computations, enabling parallel processing.
1. Load():
o Reads data into Pig from various sources, such as HDFS or local file systems.
o Syntax: data = LOAD 'path/to/data' USING PigStorage(',') AS
(field1:type, field2:type);
o Defines the format of the input data, which can be customized.
2. Store():
o Writes processed data back to storage systems.
o Syntax: STORE data INTO 'path/to/output' USING PigStorage(',');
o Defines the format of the output data, allowing users to specify how to store
results.
These functions enable efficient data ingestion and output in the Pig framework.
1. Volume: Refers to the vast amounts of data generated every second, often in terabytes or
petabytes.
2. Velocity: The speed at which data is created, processed, and analyzed, often in real-time.
3. Variety: The diverse types of data (structured, semi-structured, unstructured) and formats
(text, images, videos).
4. Veracity: The quality and accuracy of the data, which can affect insights and decision-
making.
5. Value: The potential insights and benefits that can be derived from analyzing Big Data.
Q2
1. NameNode: The master server that manages metadata and directory structure. It tracks
the location of data blocks but does not store them.
2. DataNode: Worker nodes that store actual data blocks. They handle read/write requests
and periodically send heartbeats and block reports to the NameNode.
3. Secondary NameNode: An auxiliary service that periodically saves the state of the
filesystem metadata, helping with recovery.
4. Blocks: Files in HDFS are split into fixed-size blocks (default 128 MB), distributed
across DataNodes.
5. Replication: Each block is replicated (default replication factor is 3) to ensure data
durability and availability.
Diagram:
plaintext
Copy code
+-------------------+
| NameNode |
+-------------------+
| |
+-----+ +-----+
| |
+--------+ +--------+
| DataNode| | DataNode|
+--------+ +--------+
|
+--------+
| DataNode|
+--------+
(b) Workflow using Resource Manager, Application Master & Node Manager of YARN (10
marks)
YARN (Yet Another Resource Negotiator) manages resources in Hadoop. Key components
include:
1. Resource Manager (RM): The master daemon that allocates resources to various
applications.
2. Application Master (AM): A per-application master that negotiates resources with the
Resource Manager and works with Node Managers to execute tasks.
3. Node Manager (NM): A per-node daemon that manages containers, monitors resource
usage, and reports to the Resource Manager.
Workflow:
Diagram:
plaintext
Copy code
+--------------------+
| Resource Manager |
+--------------------+
|
|
+--------------------+
| Application Master |
+--------------------+
|
|---------------------+
| |
+-------------+ +-------------+
| Node Manager | | Node Manager |
+-------------+ +-------------+
Q3
1. Data Replication: HDFS can replicate blocks across different racks. For example, if a
block is stored on two DataNodes in one rack, the third replica might be stored on a
DataNode in a different rack.
2. Example:
o Suppose we have two racks, Rack A and Rack B. If a file is split into three blocks,
one block might be stored on a DataNode in Rack A and the other two in Rack B.
If a DataNode in Rack A fails, the system can still access the data from Rack B,
ensuring availability.
This design enhances fault tolerance, as it reduces the likelihood of data loss due to rack-level
failures.
(b) Matrix Multiplication Mapper and Reducer Function for MapReduce (10 marks)
Matrix Multiplication Overview: To multiply two matrices A (m x n) and B (n x p) to produce
matrix C (m x p).
Mapper Function:
Each mapper processes one matrix (A or B) and emits its entries along with their
respective row/column indices.
python
Copy code
def mapper(matrix_name, row, col, value):
if matrix_name == "A":
for k in range(num_cols_B): # For each column in B
emit((row, k), ("A", col, value))
elif matrix_name == "B":
for k in range(num_rows_A): # For each row in A
emit((k, col), ("B", row, value))
Reducer Function:
The reducer receives the key (i, j) and combines the contributions from matrix A and B to
compute the final value of C[i][j].
python
Copy code
def reducer(i, j, values):
total = 0
A_values = {}
B_values = {}
Q4
1. Key-Value Stores:
o Example: Redis, DynamoDB
o Stores data as key-value pairs, allowing for high-performance retrieval.
2. Document Stores:
o Example: MongoDB, CouchDB
o Stores data in document formats (e.g., JSON), enabling flexible schemas.
3. Column-Family Stores:
o Example: Apache Cassandra, HBase
o Data is stored in columns rather than rows, optimizing read and write operations.
4. Graph Databases:
o Example: Neo4j, Amazon Neptune
o Optimized for managing and querying data represented as graphs (nodes and
edges).
5. Wide-Column Stores:
o Example: Google Bigtable, ScyllaDB
o Similar to column-family stores but designed for massive scalability and high
performance.
(b) HBase Architecture and Components (10 marks)
HBase is a distributed, scalable NoSQL database built on top of HDFS. Its architecture includes:
1. HMaster: The master server that manages region servers, handles metadata, and
performs load balancing.
2. Region Server: Stores data in regions, which are horizontal partitions of tables. Each
region is a sorted map from row keys to values.
3. Regions: A table is divided into multiple regions, and each region can be hosted on
different Region Servers for scalability.
4. HFile: The file format used by HBase to store data on HDFS. HFiles are immutable and
stored in the DataNodes.
5. Zookeeper: Used for coordinating distributed processes and maintaining configuration
information.
HBase allows real-time read/write access to large datasets while leveraging the fault tolerance of
HDFS.
Q5
Architecture Components:
Diagram:
plaintext
Copy code
+-----------------+
| Pig Latin |
+-----------------+
|
[Parser]
|
+-------------+
| Logical |
| Plan |
+-------------+
|
[Optimizer]
|
+-------------+
| Optimized |
| Plan |
+-------------+
|
[Execution Engine]
|
+-------------+
| MapReduce |
| Jobs |
+-------------+
Benefits:
1. Scalability: Kafka can scale horizontally, allowing it to handle increased loads without
performance degradation.
2. Durability: Data is stored reliably with replication, ensuring fault tolerance and high
availability.
3. Performance: Capable of handling millions of messages per second, making it suitable
for real-time data processing.
4. Flexibility: Supports various messaging patterns, including publish-subscribe and
message queuing.
5. Ecosystem Integration: Integrates seamlessly with various big data tools, including
Hadoop, Spark, and Flink.
Need:
1. Real-time Data Processing: Businesses require immediate insights from data, which
Kafka facilitates through event streaming.
2. Decoupled Systems: Kafka allows microservices and applications to communicate
asynchronously, reducing dependency.
3. Data Pipeline: Acts as a central hub for data flow, enabling easier data ingestion and
distribution across systems.
Q6
(a) What is Big Data and Its Types? Differences between Traditional Data vs Big Data
Approach (10 marks)
Big Data: Refers to datasets that are so large or complex that traditional data processing
applications are inadequate. It encompasses the three Vs: Volume, Velocity, and Variety.
Types of Big Data:
(b) Tools and Benefits of Data Visualization; Challenges of Big Data Visualization (10
marks)
Tools for Data Visualization:
1. Insight Discovery: Helps in identifying trends, patterns, and correlations in data quickly.
2. Enhanced Communication: Simplifies complex data, making it easier to convey
information to stakeholders.
3. Faster Decision-Making: Provides real-time insights, facilitating quicker decision-
making processes.
4. Engagement: Interactive visualizations engage users, allowing them to explore data
intuitively.
5. Accessibility: Makes data accessible to non-technical users, democratizing data insights.
Challenges of Big Data Visualization:
1. Data Volume: Large datasets can be overwhelming and may require summarization for
effective visualization.
2. Data Variety: Different data formats can complicate integration and visualization
efforts.
3. Real-time Processing: Visualizing data in real-time requires robust infrastructure and
tools.
4. Interpretation Issues: Users may misinterpret complex visualizations if not designed
clearly.
5. Scalability: Tools must handle increasing data sizes and complexities without
performance loss.
1. Volume: Refers to the massive amounts of data generated from various sources (e.g.,
social media, sensors, transactions). Organizations must manage and analyze this data
efficiently.
2. Velocity: The speed at which data is generated and processed. Real-time data streams
(e.g., financial transactions, social media feeds) require rapid analysis and response.
3. Variety: The different types and formats of data, including structured (databases), semi-
structured (XML, JSON), and unstructured data (text, images, videos).
4. Veracity: The reliability and accuracy of data. With large datasets, ensuring data quality
and consistency becomes challenging.
5. Value: The insights and benefits derived from analyzing big data. Organizations seek to
extract valuable information to drive decision-making and business strategies.
1. Scalability: HDFS can store petabytes of data by distributing it across multiple nodes.
2. Fault Tolerance: Data is replicated (default replication factor is 3) across different nodes
to ensure durability and availability.
3. High Throughput: Optimized for batch processing, HDFS allows efficient read/write
operations.
4. Block Storage: Files are divided into blocks (default size is 128 MB) stored in different
DataNodes, enhancing data access speed.
5. Master-Slave Architecture: HDFS has a NameNode (master) that manages metadata
and DataNodes (slaves) that store actual data.
1. Schema Flexibility: Unlike traditional RDBMS, NoSQL databases can store data
without a fixed schema, allowing for agile development.
2. Scalability: They can scale horizontally, handling large volumes of data by adding more
servers.
3. Variety of Data Models: Supports various data models, including key-value, document,
column-family, and graph databases.
4. High Performance: Optimized for high-speed read and write operations, making them
suitable for real-time applications.
5. Distributed Architecture: Data is distributed across multiple nodes, providing fault
tolerance and high availability.
1. HiveQL: A SQL-like language used for querying data, making it accessible to users
familiar with SQL.
2. Metastore: Stores metadata about tables, schemas, and partitions, enabling efficient
query processing.
3. Data Storage: Data is stored in HDFS, allowing for scalable storage of large datasets.
4. Extensibility: Supports user-defined functions (UDFs) for custom processing logic.
5. Integration: Can work with various data processing tools in the Hadoop ecosystem, such
as MapReduce and Spark.
1. Ease of Use: Pig Latin abstracts complex MapReduce programming, allowing users to
focus on data processing logic.
2. Data Flow Language: Users can express data transformations as a series of steps,
making it intuitive to write data processing scripts.
3. Execution Engine: Converts Pig Latin scripts into MapReduce jobs, optimizing
execution based on the data flow.
4. Support for UDFs: Users can create custom functions for specific data processing tasks.
5. Integration with Hadoop: Pig runs on top of Hadoop, leveraging HDFS for storage and
distributed computing capabilities.
(a) Explain Hadoop Ecosystem with Core Components and Architecture (10 marks)
The Hadoop ecosystem comprises various components that facilitate data storage, processing,
and analysis. Core components include:
1. Hadoop Common: Contains libraries and utilities required by other Hadoop modules.
2. HDFS: The distributed file system for storing large datasets across multiple nodes.
3. YARN (Yet Another Resource Negotiator): Manages resources and job scheduling
across the Hadoop cluster.
4. MapReduce: The programming model for processing large datasets in parallel.
5. Hadoop Ecosystem Tools:
o Hive: For data warehousing and SQL-like queries.
o Pig: For data processing using Pig Latin.
o HBase: A NoSQL database built on top of HDFS.
o Spark: A fast, in-memory data processing framework.
Architecture:
plaintext
Copy code
+---------------------+
| Hadoop |
| Ecosystem |
+---------------------+
| +---------+ +------+|
| | HDFS | | YARN ||
| +---------+ +------+|
| +---------+ +------+|
| | MapReduce| | Hive ||
| +---------+ +------+|
| +---------+ +------+|
| | Pig | | HBase ||
| +---------+ +------+|
| +---------+ +------+|
| | Spark | | Kafka ||
| +---------+ +------+|
+---------------------+
(b) Frameworks Running Under YARN and YARN Daemons (10 marks)
YARN supports various data processing frameworks, including:
1. Apache Spark: In-memory data processing framework for batch and stream processing.
2. Apache Flink: Stream processing framework for real-time analytics.
3. Apache Storm: Real-time computation framework for processing data streams.
4. Apache Tez: Framework for building complex data processing workflows.
YARN Daemons:
1. ResourceManager (RM): The master daemon that manages resources and schedules
applications across the cluster.
2. NodeManager (NM): A per-node daemon that manages containers and monitors
resource usage on each node.
3. ApplicationMaster (AM): A per-application daemon that negotiates resources from the
ResourceManager and manages the execution of tasks.
(a) What is NoSQL? Explain Various NoSQL Data Architecture Patterns (10 marks)
NoSQL: A category of database management systems that provide a mechanism for storage and
retrieval of data that is modeled in means other than the tabular relations used in relational
databases.
1. Logical Partitions: RDDs are divided into logical partitions, allowing Spark to execute
tasks on different partitions simultaneously.
2. Data Locality: Data is partitioned based on the location of the underlying data (e.g.,
HDFS blocks), optimizing data access and reducing network overhead.
3. Custom Partitioning: Users can define custom partitioning strategies based on specific
use cases, improving performance for certain workloads.
4. Fault Tolerance: If a partition is lost, Spark can recompute it using the lineage
information, ensuring data durability.
1. Speed: Spark processes data in-memory, resulting in faster performance compared to the
disk-based processing of MapReduce.
2. Ease of Use: Provides high-level APIs and supports multiple programming languages
(Java, Scala, Python), making it more accessible.
3. Unified Engine: Supports batch processing, stream processing, machine learning, and
graph processing, all in one platform.
4. Rich Libraries: Includes built-in libraries for machine learning (MLlib), graph
processing (GraphX), and streaming (Spark Streaming).
5. Interactive Shell: Allows for interactive data analysis with a REPL environment,
facilitating rapid prototyping.
Integration with Big Data: D3 can visualize large datasets effectively by using data-
binding techniques, allowing for real-time updates and interactions.
Interactivity: D3 allows users to explore complex datasets through interactive
visualizations, facilitating better understanding and insights from big data.
Customizability: Offers flexibility to create bespoke visualizations tailored to specific
data characteristics, enhancing data storytelling.
(b) Explain Bulk Synchronous Processing (BSP) and Graph Processing with respect to
Apache Spark (10 marks)
Bulk Synchronous Processing (BSP): A parallel computing model where processes
communicate in a series of supersteps. Each superstep consists of computation, communication,
and synchronization.
In Spark: The BSP model is utilized in graph processing frameworks like GraphX,
where vertices and edges are processed in bulk synchronously, ensuring consistency
across computations.
GraphX: Spark’s API for graph processing, allowing for scalable graph computations
and analytics.
Resilient Distributed Property Graph: Represents graphs as RDDs, where each vertex
can hold properties.
Pregel API: Implements the BSP model, enabling users to define iterative graph
algorithms efficiently.
Main Components:
plaintext
Copy code
+-----------------+
| Pig Latin |
+-----------------+
|
[Parser]
|
+-------------+
| Logical |
| Plan |
+-------------+
|
[Optimizer]
|
+-------------+
| Optimized |
| Plan |
+-------------+
|
[Execution Engine]
|
+-------------+
| MapReduce |
| Jobs |
+-------------+
Pig provides an abstraction over the complexities of MapReduce, allowing data analysts and
developers to work with big data more intuitively.
(b) What is Big Data and Its Types? Differences between Traditional Data vs. Big Data
with Examples (10 marks)
Big Data: Refers to datasets that are so large or complex that traditional data processing
applications cannot manage them efficiently.
(a) Discuss the Apache Kafka Fundamentals and Kafka Cluster Architecture (10 marks)
Apache Kafka Fundamentals:
Diagram:
plaintext
Copy code
+---------------------------+
| Kafka Cluster |
+---------------------------+
| +---------+ +---------+ |
| | Broker 1| | Broker 2| |
| +---------+ +---------+ |
| | |
| +----------------------+ |
| | Topics | |
| +----------------------+ |
| +---------+ +---------+ |
| |Producer | |Consumer | |
| +---------+ +---------+ |
+---------------------------+
In this architecture, one node (master) acts as the primary source of data and controls the
data distribution to multiple subordinate nodes (slaves).
Pros: Simplifies data consistency and management; efficient for read-heavy workloads.
Cons: The master node can become a bottleneck; single point of failure.
Peer-to-Peer Architecture:
All nodes (peers) have equal status and can act as both producers and consumers, sharing
data without a central controller.
Pros: High availability and fault tolerance; no single point of failure.
Cons: Data consistency can be challenging; more complex management.
Use Cases:
Master-Slave: Used in systems requiring strong consistency (e.g., Redis with replicas).
Peer-to-Peer: Common in distributed file systems and decentralized applications (e.g.,
Cassandra).
A component in MapReduce that determines how the output from the Map tasks is
distributed to Reducers.
Function: It hashes the key and assigns it to a specific reducer, ensuring that all values
for a particular key end up at the same reducer.
Benefit: Allows for balanced load distribution among reducers.
Combiner:
An optional optimization that runs after the Map phase and before the Reduce phase.
Function: Acts as a mini reducer to aggregate output from mappers, reducing the amount
of data transferred to reducers.
Benefit: Decreases network bandwidth usage and speeds up the reduce process by
minimizing the amount of data sent across the network.
1
1. Axes: The x-axis represents categories, while the y-axis represents values.
2. Bars: Each bar's length corresponds to the value it represents, allowing for easy
comparison between different categories.
3. Color Coding: Different colors can be used to enhance clarity or denote categories.
4. Labels: Axes and bars are often labeled for easier interpretation.
Example: Visualizing sales data for different products can quickly show which product has the
highest sales.
Dataset:
o A distributed collection of data, which can be strongly typed (e.g., Spark
Datasets).
o Supports compile-time type safety and is optimized for structured data.
DataFrame:
o A distributed collection of data organized into named columns, similar to a table
in a relational database.
o Provides a higher-level abstraction than RDDs, making it easier to use for data
manipulation and analysis.
o Supports SQL-like operations, allowing users to perform complex queries.
Key Difference: Datasets provide type safety and are often more efficient for complex
transformations, while DataFrames provide a more user-friendly interface for data manipulation.
1. Inner Join: Returns only the rows that have matching values in both tables.
sql
Copy code
SELECT a.*, b.*
FROM tableA a
INNER JOIN tableB b ON a.id = b.id;
2. Left Join (Outer Join): Returns all rows from the left table and matched rows from the
right table. If no match, NULLs are returned for columns from the right table.
sql
Copy code
SELECT a.*, b.*
FROM tableA a
LEFT JOIN tableB b ON a.id = b.id;
3. Right Join (Outer Join): Returns all rows from the right table and matched rows from
the left table.
sql
Copy code
SELECT a.*, b.*
FROM tableA a
RIGHT JOIN tableB b ON a.id = b.id;
4. Full Outer Join: Returns all rows when there is a match in one of the tables.
sql
Copy code
SELECT a.*, b.*
FROM tableA a
FULL OUTER JOIN tableB b ON a.id = b.id;
Key Difference: HDFS is primarily for storing large files, while HBase allows for real-time
access and manipulation of structured data.
1. Structured Data: Organized in a fixed format, such as rows and columns (e.g., SQL
databases).
2. Semi-Structured Data: Does not have a strict structure but contains tags or markers
(e.g., JSON, XML).
3. Unstructured Data: Lacks a predefined format (e.g., text documents, images, videos).
4. Time-Series Data: Data indexed in time order, often used in financial and sensor data.
5. Geospatial Data: Data related to geographic locations (e.g., maps, GPS data).
(a) Characteristics of Social Media for Big Data Analytics (10 marks)
Key Tools:
1. Tableau: A powerful tool for creating interactive and shareable dashboards, suitable for
business intelligence.
2. Power BI: A Microsoft tool that enables data visualization and sharing insights across an
organization.
3. D3.js: A JavaScript library for producing dynamic and interactive data visualizations in
web browsers.
4. Matplotlib: A Python library for creating static, animated, and interactive visualizations
in Python.
5. Google Charts: A web service that creates interactive charts and graphs, integrating well
with Google services.
Example: Tableau can visualize sales data over time, enabling stakeholders to identify trends and
make informed decisions.
Example:
pig
Copy code
A = LOAD 'data.txt' USING PigStorage(',') AS (id:int, name:chararray);
(b) CAP Theorem and NoSQL Data Architectural Pattern (10 marks)
CAP Theorem: States that in a distributed data store, you can only guarantee two of the
following three properties at the same time:
1. Consistency: All nodes see the same data at the same time.
2. Availability: Every request receives a response, whether successful or failed.
3. Partition Tolerance: The system continues to function despite network partitions.
Example:
CP Systems: HBase prioritizes consistency and partition tolerance but may sacrifice
availability during network issues.
AP Systems: Cassandra emphasizes availability and partition tolerance, potentially
allowing inconsistencies.
Matrices:
Matrix A (2x2):
Copy code
1 2
3 4
Matrix B (2x2):
Copy code
5 6
2 3
Map Function:
Reduce Function:
Calculate the final matrix product by summing up the products of the corresponding
elements.
Example:
plaintext
Copy code
Mapper Output for Matrix A:
(A, row: 0, col: 0, value: 1)
(A, row: 0, col: 1, value: 2)
(A, row: 1, col: 0, value: 3)
(A, row: 1, col: 1, value: 4)
Reducer:
Components:
Example:
sql
Copy code
CREATE TABLE employees (id INT, name STRING, salary FLOAT);
This command creates a table in Hive, which will be stored in HDFS, with metadata maintained
in the Metastore.
5
1. Speed: Spark performs in-memory data processing, making it significantly faster than
MapReduce, which relies on disk storage.
2. Ease of Use: Supports high-level APIs in Java, Scala, Python, and R, making it more
accessible for developers.
3. Advanced Analytics: Provides built-in libraries for streaming, machine learning, and
graph processing.
4. Unified Engine: Supports batch processing, streaming, and interactive queries with a
single platform.
Key Features:
1. Data Replication: Data is split into blocks (default 128 MB) and replicated across
multiple nodes to ensure fault tolerance.
2. Scalability: Easily scales horizontally by adding more nodes to the cluster.
3. High Throughput: Optimized for large data sets and high throughput of data access.
Diagram:
plaintext
Copy code
+---------------+
| Client |
+---------------+
|
+--------------+
| NameNode |
+--------------+
/ \
/ \
+------+ +------+
| DataNode | DataNode |
+------+ +------+
Example:
A social media application can publish user activity data (e.g., likes, comments) to a
Kafka topic. This data can then be consumed by analytics applications for real-time
insights.
Diagram:
plaintext
Copy code
+----------+ +---------+
| Producer | ---> | Topic |
+----------+ +---------+
|
|
+---------+
| Consumer |
+---------+