0% found this document useful (0 votes)
12 views8 pages

BDA Ass 3(1)

The document discusses various data processing technologies, focusing on MapReduce, Apache Spark, NoSQL databases, and their components. It highlights the advantages of Spark over MapReduce, the features of NoSQL databases like MongoDB, and the architecture of tools like Hive and Pig. Additionally, it covers the roles of ZooKeeper, HBase, and Apache Kafka in managing and processing big data.

Uploaded by

harshshah72004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views8 pages

BDA Ass 3(1)

The document discusses various data processing technologies, focusing on MapReduce, Apache Spark, NoSQL databases, and their components. It highlights the advantages of Spark over MapReduce, the features of NoSQL databases like MongoDB, and the architecture of tools like Hive and Pig. Additionally, it covers the roles of ZooKeeper, HBase, and Apache Kafka in managing and processing big data.

Uploaded by

harshshah72004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Harsh Shah TY-IT-1-A 220410116025

1) Describe iterative and interactive operations on MapReduce and Spark RDD.


MapReduce is suitable for batch processing but struggles with iterative and interactive
computations. Iterative operations require multiple passes over the same data, such as
machine learning algorithms, where intermediate results must be stored and read from disk
repeatedly. This leads to inefficiency in MapReduce. Interactive operations, which allow
users to explore data and receive real-time feedback, are also not well-supported in
MapReduce due to its high latency. Apache Spark, with its in-memory data storage using
RDDs (Resilient Distributed Datasets), supports both iterative and interactive processing
efficiently. RDDs cache data in memory, avoiding redundant disk I/O, and enable low-latency
computations.
2) Describe important features of Apache Spark. Also explain transformations and actions
in Spark.
Apache Spark is an in-memory distributed computing framework known for its speed, ease
of use, and flexibility. It supports batch processing, real-time analytics, machine learning,
and graph processing. Key features include fault-tolerant RDDs, support for multiple
languages (Java, Scala, Python), and integration with Hadoop and various data sources.
Transformations are lazy operations that define a new RDD from an existing one, such as
map(), filter(), and flatMap(). Actions trigger execution and return results, such as collect(),
count(), and reduce(). This separation allows Spark to optimize execution plans before
processing.
3) What is NoSQL? Differentiate NoSQL with SQL.
NoSQL (Not Only SQL) databases are non-relational databases designed for large-scale data
storage and real-time web applications. Unlike SQL (relational) databases that store
structured data in tables with predefined schemas, NoSQL supports flexible schema designs
and stores data in forms like key-value pairs, documents, graphs, or wide-columns. NoSQL
provides high scalability, availability, and performance for unstructured or semi-structured
data. SQL databases are better suited for complex transactions and consistency, whereas
NoSQL is optimized for scalability and handling diverse data types.
Differences between NoSQL and SQL:

Feature SQL (Relational DB) NoSQL (Non-relational DB)

Data Model Tables with rows and columns Key-Value, Document, Column, or Graph

Schema Fixed schema Dynamic schema

Scalability Vertical scaling Horizontal scaling

Transactions Supports ACID May follow BASE (eventual consistency)

Query
SQL Varies by database (MongoQL, CQL, etc.)
Language

1
Harsh Shah TY-IT-1-A 220410116025

Structured data with Large-scale, semi-structured/unstructured


Use Case
relationships data

NoSQL (Not Only SQL) databases are non-relational databases designed for large-scale data
storage and real-time web applications. Unlike SQL (relational) databases that store
structured data in tables with predefined schemas, NoSQL supports flexible schema designs
and stores data in forms like key-value pairs, documents, graphs, or wide-columns. NoSQL
provides high scalability, availability, and performance for unstructured or semi-structured
data. SQL databases are better suited for complex transactions and consistency, whereas
NoSQL is optimized for scalability and handling diverse data types.
4) Categorize types of NoSQL databases with examples.
NoSQL databases are categorized into four main types:
1. Key-Value Stores: Store data as key-value pairs (e.g., Redis, Riak).
2. Document Stores: Store semi-structured data in JSON/XML documents (e.g.,
MongoDB, CouchDB).
3. Column-Family Stores: Store data in columns instead of rows (e.g., Apache
Cassandra, HBase).
4. Graph Databases: Store relationships between entities using graph structures (e.g.,
Neo4j, OrientDB). Each type is optimized for specific use cases and offers flexible
scalability.
5) Explain important components of Spark with necessary diagram.
Apache Spark's architecture includes the following components:
 Driver Program: Coordinates all activities and initiates the SparkContext.
 Cluster Manager: Allocates resources (e.g., YARN, Mesos).
 Executors: Run computations and store data for applications.
 Tasks: Units of work executed on each partition.
[Diagram: Spark Architecture]
Driver Program --> Cluster Manager --> Executors --> Tasks
This architecture enables distributed data processing with in-memory performance and fault
tolerance.
6) What is MongoDB? Explain important features.
MongoDB is a NoSQL, document-oriented database that stores data in flexible, JSON-like
documents. It supports dynamic schemas, allowing different documents in a collection to
have different structures. Key features include high scalability, indexing, replication, and

2
Harsh Shah TY-IT-1-A 220410116025

sharding for horizontal scaling. MongoDB allows developers to store and retrieve complex
data types efficiently and supports aggregation, ad hoc queries, and rich indexing.
7) Differentiate MongoDB with RDBMS. Compare advantages and drawbacks.
MongoDB differs from RDBMS in data model, schema, and performance. While RDBMS uses
tables with fixed schemas, MongoDB stores data in collections of dynamic JSON-like
documents. RDBMS is ideal for transactions and complex queries, whereas MongoDB is
suited for scalability and agile development. MongoDB excels in handling large volumes of
unstructured data, though it may lack strict ACID compliance. RDBMS offers strong
consistency but struggles with horizontal scalability.
8) Mention advantages of using NoSQL databases.
NoSQL databases provide several advantages such as flexible schema design, horizontal
scalability, high performance for large data volumes, and ease of replication. They are ideal
for applications requiring fast access to unstructured or semi-structured data, real-time
analytics, and distributed environments. NoSQL also supports agile development by allowing
quick changes to data structures.
9) MongoDB Terms: Database, Collection, Document, Datatypes
In MongoDB, a Database is a container for collections. A Collection is a group of documents,
similar to a table in RDBMS. A Document is a JSON-like data structure that contains fields
and values. MongoDB supports data types like strings, numbers, arrays, objects, dates, and
binary data, allowing flexibility in data modeling.
10) MongoDB CRUD operations with syntax
 Create Database: use myDatabase
 Drop Database: db.dropDatabase()
 Create Collection: db.createCollection("users")
 Insert Document: db.users.insert({name: "John", age: 30})
 Find Document: db.users.find({name: "John"})
 Update Document: db.users.update({name: "John"}, {$set: {age: 31}})
 Delete Document: db.users.remove({name: "John"})
11) What is RDD? Explain RDD operations in detail.
RDD (Resilient Distributed Dataset) is Spark's fundamental data structure, representing an
immutable, distributed collection of objects partitioned across nodes. RDDs are fault-
tolerant and support parallel processing. Operations on RDDs are categorized into
transformations (e.g., map, filter, flatMap) and actions (e.g., collect, count, reduce).
Transformations are lazy and build a lineage graph, which is recomputed in case of failures.
Actions trigger the computation.

3
Harsh Shah TY-IT-1-A 220410116025

12) Why RDD is better than MapReduce data storage?


RDDs outperform MapReduce by enabling in-memory computation, reducing the overhead
of writing intermediate data to disk. RDDs support iterative algorithms efficiently and
provide fault tolerance through lineage, making them ideal for machine learning and real-
time data processing. In contrast, MapReduce writes data to HDFS after each operation,
resulting in higher latency.
13) Justify: “SPARK is faster than MapReduce”
Spark is significantly faster than MapReduce due to its in-memory processing, DAG-based
execution engine, and support for advanced analytics. While MapReduce stores
intermediate results on disk after each job, Spark keeps data in memory using RDDs,
drastically reducing I/O operations. This architectural difference makes Spark up to 100
times faster in certain use cases, particularly for iterative algorithms.
14) Word Count program in Scala using Spark
val input = sc.textFile("input.txt")
val words = input.flatMap(line => line.split(" "))
val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)
wordCounts.collect().foreach(println)
This simple program splits input text into words, maps each word to a count of 1, and
aggregates them using reduceByKey.
15) "Moving Computation is Cheaper than Moving Data" – Justify
In distributed systems, moving large volumes of data across the network is costly in terms of
time and resources. Thus, it is more efficient to move computation to where the data
resides, reducing latency and bandwidth usage. Hadoop and Spark follow this principle by
running tasks on nodes that store the required data blocks. This data locality enhances
system performance and scalability.

) Mention usefulness of Pig. What are key features of Pig?


Apache Pig is a high-level platform for processing large data sets. It uses a scripting language
called Pig Latin, which simplifies the development of MapReduce programs. Pig is useful
because it abstracts the complexities of writing low-level MapReduce code and allows for
faster development and prototyping. Key features include ease of use, support for both
structured and semi-structured data, fault tolerance, and extensibility. Pig is ideal for ETL
(Extract, Transform, Load) operations, data preparation, and research analysis.

4
Harsh Shah TY-IT-1-A 220410116025

2) Explain components of Hive architecture. Also describe working of Hive with suitable
diagram.
Apache Hive is a data warehouse infrastructure built on top of Hadoop that provides data
summarization, query, and analysis using HiveQL, a SQL-like language. The main components
of Hive architecture include:
 Metastore: Stores metadata about tables and partitions.
 Driver: Manages the lifecycle of HiveQL statements.
 Compiler: Translates HiveQL into execution plans.
 Execution Engine: Executes the plans using Hadoop or Spark.
 Hive Server: Accepts client connections and handles queries.
[Diagram: Hive Architecture]
Clients --> Hive Server --> Compiler --> Execution Engine --> Hadoop/YARN
|
--> Metastore
Hive processes user queries by converting them into DAGs of MapReduce jobs, which are
executed in the Hadoop ecosystem.

3) Differentiate:
(i) Pig vs. MapReduce

Feature Pig MapReduce

Language Pig Latin Java (programming required)

Ease of Use High-level, user-friendly scripting Low-level programming complexity

Development Speed Faster Slower

Data Types Supports complex and nested types Mostly primitive types

Execution Engine Converts scripts to MapReduce Native MapReduce

(ii) HDFS vs. HBase

Feature HDFS HBase

Data Model File system for batch processing NoSQL database (column-oriented)

5
Harsh Shah TY-IT-1-A 220410116025

Access Pattern Sequential Random read/write

Latency High (not ideal for real-time) Low (supports real-time)

Schema Schema-less Flexible schema

Integration Works with Hive, Pig, MapReduce Works with Spark, MapReduce

4) What is role of Zookeeper? How it helps in monitoring a cluster?


Apache ZooKeeper is a centralized service for maintaining configuration information,
naming, and providing distributed synchronization. In a Hadoop ecosystem, it helps in
managing and coordinating distributed components. ZooKeeper ensures that various nodes
in a cluster can work together in a synchronized manner. It maintains a hierarchical tree of
nodes (znodes) which store data and provide a way to coordinate distributed processes. It is
essential in managing failover for Hadoop components like HBase, HDFS, and YARN.

5) Data model and implementation of HBase


HBase is a distributed, column-oriented database built on top of HDFS. Its data model
resembles Google's BigTable and includes tables, rows, column families, and columns. Each
row has a unique row key, and columns are grouped into families. HBase stores data in
HFiles and uses a Write-Ahead Log (WAL) for durability. RegionServers manage regions
(subsets of tables), and the HBase Master oversees load balancing and failover. HBase is
ideal for sparse datasets and supports real-time read/write access.

6) HiveQL data manipulation queries in detail


HiveQL supports data manipulation operations like INSERT, UPDATE, DELETE, and SELECT. For
example:
 INSERT INTO table_name VALUES (...) adds new records.
 UPDATE table_name SET column=value WHERE condition modifies existing records.
 DELETE FROM table_name WHERE condition removes records.
 SELECT column FROM table WHERE condition queries data.
Hive transforms these SQL-like statements into MapReduce jobs, enabling large-scale data
processing over Hadoop.

7) What is HBase? Write a query to create a table in HBase.

6
Harsh Shah TY-IT-1-A 220410116025

HBase is a distributed NoSQL database that stores structured and semi-structured data in a
fault-tolerant way. It supports random access to large datasets and is suitable for sparse
tables. To create a table in HBase:
create 'students', 'personal', 'academic'
This creates a table named 'students' with two column families: 'personal' and 'academic'.

8) Draw architecture of Apache Pig and explain in short.


Apache Pig architecture includes:
 Pig Latin Scripts: Input by the user.
 Parser: Checks syntax and generates logical plans.
 Optimizer: Optimizes execution plans.
 Compiler: Converts plans into MapReduce jobs.
 Execution Engine: Executes jobs on Hadoop.
[Diagram: Pig Architecture]
Pig Latin Script --> Parser --> Logical Plan --> Optimizer --> Physical Plan --> MapReduce -->
HDFS

9) Benefits of Zookeeper, znodes and their types


ZooKeeper benefits include coordination, configuration management, and fault tolerance. A
znode is a data node in ZooKeeper’s hierarchy. There are three types:
 Persistent: Remain after client disconnects.
 Ephemeral: Deleted once the client session ends.
 Sequential: Unique, numbered znodes.

10) RDBMS vs. HBase; HiveQL Data Definition Language (DDL)

Feature RDBMS HBase

Data Model Relational (tables, rows) Column-oriented NoSQL

Schema Fixed schema Schema-less, flexible

Transactions Full ACID support Limited ACID

7
Harsh Shah TY-IT-1-A 220410116025

Scalability Vertical Horizontal

HiveQL DDL defines structure. Examples:


 CREATE TABLE students (id INT, name STRING)
 ALTER TABLE students ADD COLUMNS (age INT)
 DROP TABLE students removes the table structure.

11) What is Big Data streaming? Stream data architecture


Big Data Streaming involves processing continuous flows of data in real-time. Examples
include logs, sensor data, or social media feeds. A typical stream data architecture consists
of:
 Data Sources: Devices or applications generating data.
 Stream Processing Engine: e.g., Apache Storm, Spark Streaming.
 Data Storage: HDFS, HBase.
 Data Sink: Dashboards or databases for analysis.

12) Write about Apache Kafka


Apache Kafka is a distributed messaging system used for building real-time data pipelines
and streaming applications. It uses a publish-subscribe model and is designed for high
throughput, fault tolerance, and durability. Kafka stores messages in topics and distributes
them across brokers. Producers send messages to topics, and consumers read them
asynchronously. Kafka is widely used for real-time analytics, log aggregation, and event
sourcing.

You might also like