0% found this document useful (0 votes)

12 views8 pages

BDA Ass 3(1)

The document discusses various data processing technologies, focusing on MapReduce, Apache Spark, NoSQL databases, and their components. It highlights the advantages of Spark over MapReduce, the features of NoSQL databases like MongoDB, and the architecture of tools like Hive and Pig. Additionally, it covers the roles of ZooKeeper, HBase, and Apache Kafka in managing and processing big data.

Uploaded by

harshshah72004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views8 pages

BDA Ass 3(1)

Uploaded by

harshshah72004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Harsh Shah TY-IT-1-A 220410116025

1) Describe iterative and interactive operations on MapReduce and Spark RDD.

MapReduce is suitable for batch processing but struggles with iterative and interactive
computations. Iterative operations require multiple passes over the same data, such as
machine learning algorithms, where intermediate results must be stored and read from disk
repeatedly. This leads to inefficiency in MapReduce. Interactive operations, which allow
users to explore data and receive real-time feedback, are also not well-supported in
MapReduce due to its high latency. Apache Spark, with its in-memory data storage using
RDDs (Resilient Distributed Datasets), supports both iterative and interactive processing
efficiently. RDDs cache data in memory, avoiding redundant disk I/O, and enable low-latency
computations.
2) Describe important features of Apache Spark. Also explain transformations and actions
in Spark.
Apache Spark is an in-memory distributed computing framework known for its speed, ease
of use, and flexibility. It supports batch processing, real-time analytics, machine learning,
and graph processing. Key features include fault-tolerant RDDs, support for multiple
languages (Java, Scala, Python), and integration with Hadoop and various data sources.
Transformations are lazy operations that define a new RDD from an existing one, such as
map(), filter(), and flatMap(). Actions trigger execution and return results, such as collect(),
count(), and reduce(). This separation allows Spark to optimize execution plans before
processing.
3) What is NoSQL? Differentiate NoSQL with SQL.
NoSQL (Not Only SQL) databases are non-relational databases designed for large-scale data
storage and real-time web applications. Unlike SQL (relational) databases that store
structured data in tables with predefined schemas, NoSQL supports flexible schema designs
and stores data in forms like key-value pairs, documents, graphs, or wide-columns. NoSQL
provides high scalability, availability, and performance for unstructured or semi-structured
data. SQL databases are better suited for complex transactions and consistency, whereas
NoSQL is optimized for scalability and handling diverse data types.
Differences between NoSQL and SQL:

Feature SQL (Relational DB) NoSQL (Non-relational DB)

Data Model Tables with rows and columns Key-Value, Document, Column, or Graph

Schema Fixed schema Dynamic schema

Scalability Vertical scaling Horizontal scaling

Transactions Supports ACID May follow BASE (eventual consistency)

Query
SQL Varies by database (MongoQL, CQL, etc.)
Language

1
Harsh Shah TY-IT-1-A 220410116025

Structured data with Large-scale, semi-structured/unstructured

Use Case
relationships data

NoSQL (Not Only SQL) databases are non-relational databases designed for large-scale data
storage and real-time web applications. Unlike SQL (relational) databases that store
structured data in tables with predefined schemas, NoSQL supports flexible schema designs
and stores data in forms like key-value pairs, documents, graphs, or wide-columns. NoSQL
provides high scalability, availability, and performance for unstructured or semi-structured
data. SQL databases are better suited for complex transactions and consistency, whereas
NoSQL is optimized for scalability and handling diverse data types.
4) Categorize types of NoSQL databases with examples.
NoSQL databases are categorized into four main types:
1. Key-Value Stores: Store data as key-value pairs (e.g., Redis, Riak).
2. Document Stores: Store semi-structured data in JSON/XML documents (e.g.,
MongoDB, CouchDB).
3. Column-Family Stores: Store data in columns instead of rows (e.g., Apache
Cassandra, HBase).
4. Graph Databases: Store relationships between entities using graph structures (e.g.,
Neo4j, OrientDB). Each type is optimized for specific use cases and offers flexible
scalability.
5) Explain important components of Spark with necessary diagram.
Apache Spark's architecture includes the following components:
 Driver Program: Coordinates all activities and initiates the SparkContext.
 Cluster Manager: Allocates resources (e.g., YARN, Mesos).
 Executors: Run computations and store data for applications.
 Tasks: Units of work executed on each partition.
[Diagram: Spark Architecture]
Driver Program --> Cluster Manager --> Executors --> Tasks
This architecture enables distributed data processing with in-memory performance and fault
tolerance.
6) What is MongoDB? Explain important features.
MongoDB is a NoSQL, document-oriented database that stores data in flexible, JSON-like
documents. It supports dynamic schemas, allowing different documents in a collection to
have different structures. Key features include high scalability, indexing, replication, and

2
Harsh Shah TY-IT-1-A 220410116025

sharding for horizontal scaling. MongoDB allows developers to store and retrieve complex
data types efficiently and supports aggregation, ad hoc queries, and rich indexing.
7) Differentiate MongoDB with RDBMS. Compare advantages and drawbacks.
MongoDB differs from RDBMS in data model, schema, and performance. While RDBMS uses
tables with fixed schemas, MongoDB stores data in collections of dynamic JSON-like
documents. RDBMS is ideal for transactions and complex queries, whereas MongoDB is
suited for scalability and agile development. MongoDB excels in handling large volumes of
unstructured data, though it may lack strict ACID compliance. RDBMS offers strong
consistency but struggles with horizontal scalability.
8) Mention advantages of using NoSQL databases.
NoSQL databases provide several advantages such as flexible schema design, horizontal
scalability, high performance for large data volumes, and ease of replication. They are ideal
for applications requiring fast access to unstructured or semi-structured data, real-time
analytics, and distributed environments. NoSQL also supports agile development by allowing
quick changes to data structures.
9) MongoDB Terms: Database, Collection, Document, Datatypes
In MongoDB, a Database is a container for collections. A Collection is a group of documents,
similar to a table in RDBMS. A Document is a JSON-like data structure that contains fields
and values. MongoDB supports data types like strings, numbers, arrays, objects, dates, and
binary data, allowing flexibility in data modeling.
10) MongoDB CRUD operations with syntax
 Create Database: use myDatabase
 Drop Database: db.dropDatabase()
 Create Collection: db.createCollection("users")
 Insert Document: db.users.insert({name: "John", age: 30})
 Find Document: db.users.find({name: "John"})
 Update Document: db.users.update({name: "John"}, {$set: {age: 31}})
 Delete Document: db.users.remove({name: "John"})
11) What is RDD? Explain RDD operations in detail.
RDD (Resilient Distributed Dataset) is Spark's fundamental data structure, representing an
immutable, distributed collection of objects partitioned across nodes. RDDs are fault-
tolerant and support parallel processing. Operations on RDDs are categorized into
transformations (e.g., map, filter, flatMap) and actions (e.g., collect, count, reduce).
Transformations are lazy and build a lineage graph, which is recomputed in case of failures.
Actions trigger the computation.

3
Harsh Shah TY-IT-1-A 220410116025

12) Why RDD is better than MapReduce data storage?

RDDs outperform MapReduce by enabling in-memory computation, reducing the overhead
of writing intermediate data to disk. RDDs support iterative algorithms efficiently and
provide fault tolerance through lineage, making them ideal for machine learning and real-
time data processing. In contrast, MapReduce writes data to HDFS after each operation,
resulting in higher latency.
13) Justify: “SPARK is faster than MapReduce”
Spark is significantly faster than MapReduce due to its in-memory processing, DAG-based
execution engine, and support for advanced analytics. While MapReduce stores
intermediate results on disk after each job, Spark keeps data in memory using RDDs,
drastically reducing I/O operations. This architectural difference makes Spark up to 100
times faster in certain use cases, particularly for iterative algorithms.
14) Word Count program in Scala using Spark
val input = sc.textFile("input.txt")
val words = input.flatMap(line => line.split(" "))
val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)
wordCounts.collect().foreach(println)
This simple program splits input text into words, maps each word to a count of 1, and
aggregates them using reduceByKey.
15) "Moving Computation is Cheaper than Moving Data" – Justify
In distributed systems, moving large volumes of data across the network is costly in terms of
time and resources. Thus, it is more efficient to move computation to where the data
resides, reducing latency and bandwidth usage. Hadoop and Spark follow this principle by
running tasks on nodes that store the required data blocks. This data locality enhances
system performance and scalability.

) Mention usefulness of Pig. What are key features of Pig?

Apache Pig is a high-level platform for processing large data sets. It uses a scripting language
called Pig Latin, which simplifies the development of MapReduce programs. Pig is useful
because it abstracts the complexities of writing low-level MapReduce code and allows for
faster development and prototyping. Key features include ease of use, support for both
structured and semi-structured data, fault tolerance, and extensibility. Pig is ideal for ETL
(Extract, Transform, Load) operations, data preparation, and research analysis.

4
Harsh Shah TY-IT-1-A 220410116025

2) Explain components of Hive architecture. Also describe working of Hive with suitable
diagram.
Apache Hive is a data warehouse infrastructure built on top of Hadoop that provides data
summarization, query, and analysis using HiveQL, a SQL-like language. The main components
of Hive architecture include:
 Metastore: Stores metadata about tables and partitions.
 Driver: Manages the lifecycle of HiveQL statements.
 Compiler: Translates HiveQL into execution plans.
 Execution Engine: Executes the plans using Hadoop or Spark.
 Hive Server: Accepts client connections and handles queries.
[Diagram: Hive Architecture]
Clients --> Hive Server --> Compiler --> Execution Engine --> Hadoop/YARN
|
--> Metastore
Hive processes user queries by converting them into DAGs of MapReduce jobs, which are
executed in the Hadoop ecosystem.

3) Differentiate:
(i) Pig vs. MapReduce

Feature Pig MapReduce

Language Pig Latin Java (programming required)

Ease of Use High-level, user-friendly scripting Low-level programming complexity

Development Speed Faster Slower

Data Types Supports complex and nested types Mostly primitive types

Execution Engine Converts scripts to MapReduce Native MapReduce

(ii) HDFS vs. HBase

Feature HDFS HBase

Data Model File system for batch processing NoSQL database (column-oriented)

5
Harsh Shah TY-IT-1-A 220410116025

Access Pattern Sequential Random read/write

Latency High (not ideal for real-time) Low (supports real-time)

Schema Schema-less Flexible schema

Integration Works with Hive, Pig, MapReduce Works with Spark, MapReduce

4) What is role of Zookeeper? How it helps in monitoring a cluster?

Apache ZooKeeper is a centralized service for maintaining configuration information,
naming, and providing distributed synchronization. In a Hadoop ecosystem, it helps in
managing and coordinating distributed components. ZooKeeper ensures that various nodes
in a cluster can work together in a synchronized manner. It maintains a hierarchical tree of
nodes (znodes) which store data and provide a way to coordinate distributed processes. It is
essential in managing failover for Hadoop components like HBase, HDFS, and YARN.

5) Data model and implementation of HBase

HBase is a distributed, column-oriented database built on top of HDFS. Its data model
resembles Google's BigTable and includes tables, rows, column families, and columns. Each
row has a unique row key, and columns are grouped into families. HBase stores data in
HFiles and uses a Write-Ahead Log (WAL) for durability. RegionServers manage regions
(subsets of tables), and the HBase Master oversees load balancing and failover. HBase is
ideal for sparse datasets and supports real-time read/write access.

6) HiveQL data manipulation queries in detail

HiveQL supports data manipulation operations like INSERT, UPDATE, DELETE, and SELECT. For
example:
 INSERT INTO table_name VALUES (...) adds new records.
 UPDATE table_name SET column=value WHERE condition modifies existing records.
 DELETE FROM table_name WHERE condition removes records.
 SELECT column FROM table WHERE condition queries data.
Hive transforms these SQL-like statements into MapReduce jobs, enabling large-scale data
processing over Hadoop.

7) What is HBase? Write a query to create a table in HBase.

6
Harsh Shah TY-IT-1-A 220410116025

HBase is a distributed NoSQL database that stores structured and semi-structured data in a
fault-tolerant way. It supports random access to large datasets and is suitable for sparse
tables. To create a table in HBase:
create 'students', 'personal', 'academic'
This creates a table named 'students' with two column families: 'personal' and 'academic'.

8) Draw architecture of Apache Pig and explain in short.

Apache Pig architecture includes:
 Pig Latin Scripts: Input by the user.
 Parser: Checks syntax and generates logical plans.
 Optimizer: Optimizes execution plans.
 Compiler: Converts plans into MapReduce jobs.
 Execution Engine: Executes jobs on Hadoop.
[Diagram: Pig Architecture]
Pig Latin Script --> Parser --> Logical Plan --> Optimizer --> Physical Plan --> MapReduce -->
HDFS

9) Benefits of Zookeeper, znodes and their types

ZooKeeper benefits include coordination, configuration management, and fault tolerance. A
znode is a data node in ZooKeeper’s hierarchy. There are three types:
 Persistent: Remain after client disconnects.
 Ephemeral: Deleted once the client session ends.
 Sequential: Unique, numbered znodes.

10) RDBMS vs. HBase; HiveQL Data Definition Language (DDL)

Feature RDBMS HBase

Data Model Relational (tables, rows) Column-oriented NoSQL

Schema Fixed schema Schema-less, flexible

Transactions Full ACID support Limited ACID

7
Harsh Shah TY-IT-1-A 220410116025

Scalability Vertical Horizontal

HiveQL DDL defines structure. Examples:

 CREATE TABLE students (id INT, name STRING)
 ALTER TABLE students ADD COLUMNS (age INT)
 DROP TABLE students removes the table structure.

11) What is Big Data streaming? Stream data architecture

Big Data Streaming involves processing continuous flows of data in real-time. Examples
include logs, sensor data, or social media feeds. A typical stream data architecture consists
of:
 Data Sources: Devices or applications generating data.
 Stream Processing Engine: e.g., Apache Storm, Spark Streaming.
 Data Storage: HDFS, HBase.
 Data Sink: Dashboards or databases for analysis.

12) Write about Apache Kafka

Apache Kafka is a distributed messaging system used for building real-time data pipelines
and streaming applications. It uses a publish-subscribe model and is designed for high
throughput, fault tolerance, and durability. Kafka stores messages in topics and distributes
them across brokers. Producers send messages to topics, and consumers read them
asynchronously. Kafka is widely used for real-time analytics, log aggregation, and event
sourcing.

S.C. Rastogi Parag Rastogi, Namita Mendiratta - Bioinformatics_ Methods and Applications_ Genomics, Proteomics and Drug Discovery-PHI (2022)
100% (1)
S.C. Rastogi Parag Rastogi, Namita Mendiratta - Bioinformatics_ Methods and Applications_ Genomics, Proteomics and Drug Discovery-PHI (2022)
626 pages
NoSQL Unit 1 & 2 QnA
No ratings yet
NoSQL Unit 1 & 2 QnA
18 pages
No SQL
No ratings yet
No SQL
32 pages
NoSQL Database
No ratings yet
NoSQL Database
45 pages
Fdocuments - in Nosql-Seminar
No ratings yet
Fdocuments - in Nosql-Seminar
40 pages
BDA_(2)_merged[1]
No ratings yet
BDA_(2)_merged[1]
29 pages
2 BDA A6515 Hadoop
No ratings yet
2 BDA A6515 Hadoop
55 pages
Mongo DB
No ratings yet
Mongo DB
33 pages
NoSQL Database Technology - A Survey and Comparison of Systems
No ratings yet
NoSQL Database Technology - A Survey and Comparison of Systems
44 pages
DA Assigment 1
No ratings yet
DA Assigment 1
26 pages
R23-IDS-Unit3-PPT
No ratings yet
R23-IDS-Unit3-PPT
36 pages
Module 1
No ratings yet
Module 1
34 pages
nosql
No ratings yet
nosql
64 pages
ngd unit 1-4
No ratings yet
ngd unit 1-4
43 pages
NOSQL Interview Q&A
No ratings yet
NOSQL Interview Q&A
25 pages
MATH 5-Q1-W1-D4
No ratings yet
MATH 5-Q1-W1-D4
15 pages
Bigdata Unit 4
No ratings yet
Bigdata Unit 4
97 pages
NGD
No ratings yet
NGD
9 pages
NoSQL DBs
No ratings yet
NoSQL DBs
46 pages
Unit 4
No ratings yet
Unit 4
36 pages
mongodb_report
No ratings yet
mongodb_report
26 pages
2 module
No ratings yet
2 module
14 pages
UDM Cae-1 Answers
No ratings yet
UDM Cae-1 Answers
14 pages
Big Data Analysis Unit 1-5 Extended
No ratings yet
Big Data Analysis Unit 1-5 Extended
35 pages
BDA Ass 3
No ratings yet
BDA Ass 3
8 pages
unit 2
No ratings yet
unit 2
6 pages
No SQL
No ratings yet
No SQL
32 pages
Lec 6 - Big Data Storage Technologies II - NoSQL
No ratings yet
Lec 6 - Big Data Storage Technologies II - NoSQL
20 pages
BDA Question bank with solutions
No ratings yet
BDA Question bank with solutions
88 pages
NoSQL, Cloud Computing, and IOT
No ratings yet
NoSQL, Cloud Computing, and IOT
3 pages
Overview of NoSQL
No ratings yet
Overview of NoSQL
17 pages
assignment 4 rdbms
No ratings yet
assignment 4 rdbms
18 pages
Big Data Visualization
No ratings yet
Big Data Visualization
55 pages
NONSQL-DATABASE_NOTE
No ratings yet
NONSQL-DATABASE_NOTE
24 pages
ADBMS
No ratings yet
ADBMS
19 pages
Full Stack UNIT 3
No ratings yet
Full Stack UNIT 3
36 pages
Unit 4: Big Data Tehnology Landscape Two Inportant Technologies
No ratings yet
Unit 4: Big Data Tehnology Landscape Two Inportant Technologies
42 pages
BDA_answers[1]
No ratings yet
BDA_answers[1]
6 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
9 pages
No SQL
No ratings yet
No SQL
109 pages
UNIT-4-1
No ratings yet
UNIT-4-1
7 pages
Introduction to NoSQL
No ratings yet
Introduction to NoSQL
13 pages
Part A
No ratings yet
Part A
4 pages
Copy of What is Nosql Nodesc
No ratings yet
Copy of What is Nosql Nodesc
17 pages
unit 4 BDA
No ratings yet
unit 4 BDA
22 pages
BDA CW Chapter 3
No ratings yet
BDA CW Chapter 3
9 pages
BIG DATA UNIT-II NOTES
No ratings yet
BIG DATA UNIT-II NOTES
7 pages
Jojos Bizarre Adventure Part 1 Phantom Blood Vol.1 Chapter 1 - Prologue Manganelo
No ratings yet
Jojos Bizarre Adventure Part 1 Phantom Blood Vol.1 Chapter 1 - Prologue Manganelo
32 pages
Nosql Module 1
No ratings yet
Nosql Module 1
23 pages
NO SQL Unit 1
No ratings yet
NO SQL Unit 1
66 pages
NoSQL MongoDB HBase Cassandra
100% (1)
NoSQL MongoDB HBase Cassandra
142 pages
BIG DATA 2023
No ratings yet
BIG DATA 2023
18 pages
BIG - DATA - Unit 4
No ratings yet
BIG - DATA - Unit 4
99 pages
Mongodb Introductioninstalaltion and Basic Crud Operations
No ratings yet
Mongodb Introductioninstalaltion and Basic Crud Operations
53 pages
PPT 2.1.2
No ratings yet
PPT 2.1.2
31 pages
UNIT II First Half Notes
No ratings yet
UNIT II First Half Notes
21 pages
NoSQL Lecture Notes Compilation
No ratings yet
NoSQL Lecture Notes Compilation
5 pages
Unit 6
No ratings yet
Unit 6
143 pages
Nosql Databases Unit-2
0% (1)
Nosql Databases Unit-2
15 pages
Nosql Databases Unit-1
No ratings yet
Nosql Databases Unit-1
16 pages
MANUAL-BOBCAT-E10
No ratings yet
MANUAL-BOBCAT-E10
109 pages
A Practical Guide To Witchcraft and Magic Spells
No ratings yet
A Practical Guide To Witchcraft and Magic Spells
279 pages
TOK Exhibition APA Format (1)
No ratings yet
TOK Exhibition APA Format (1)
6 pages
Mozart Studies 2 Simon P. Keefe (Ed.) - Download the full set of chapters carefully compiled
100% (2)
Mozart Studies 2 Simon P. Keefe (Ed.) - Download the full set of chapters carefully compiled
65 pages
Evolution of Economic Life
100% (3)
Evolution of Economic Life
35 pages
KCL Thesis Font
100% (3)
KCL Thesis Font
4 pages
CSC
No ratings yet
CSC
26 pages
Marc Antoine Laugier - "An Essay On Architecture"
No ratings yet
Marc Antoine Laugier - "An Essay On Architecture"
312 pages
Annual Maintenance Plan 17-18
No ratings yet
Annual Maintenance Plan 17-18
1 page
Weisinger - Emotional Intelligence at Work
No ratings yet
Weisinger - Emotional Intelligence at Work
124 pages
Rainwater Harvesting
No ratings yet
Rainwater Harvesting
12 pages
Worksheet Questions for HCF and LCM Grade 6
No ratings yet
Worksheet Questions for HCF and LCM Grade 6
2 pages
Feminism, Nation and Myth La Malinche - (Intro)
No ratings yet
Feminism, Nation and Myth La Malinche - (Intro)
6 pages
BIA Cyber Security & Ethical Hacking Detailed Brochure - Marathahalli, Bengaluru
No ratings yet
BIA Cyber Security & Ethical Hacking Detailed Brochure - Marathahalli, Bengaluru
26 pages
The Slug by Elise Gravel Teacher's Guide
No ratings yet
The Slug by Elise Gravel Teacher's Guide
8 pages
Lin 1987
No ratings yet
Lin 1987
8 pages
Sig. Figs. Sci. Notation Worksheet Answer Key
No ratings yet
Sig. Figs. Sci. Notation Worksheet Answer Key
6 pages
Practical-11 Vraj CS
No ratings yet
Practical-11 Vraj CS
7 pages
Chap 1 Phy
No ratings yet
Chap 1 Phy
24 pages
Name:: Uganda Christian University Faculty of Law
No ratings yet
Name:: Uganda Christian University Faculty of Law
6 pages
영어교수법
No ratings yet
영어교수법
2 pages
E Marking Notes Physics IX
No ratings yet
E Marking Notes Physics IX
20 pages
Code of Ethics For Teachers
No ratings yet
Code of Ethics For Teachers
59 pages
Explore Kim Cheng's "Report To Wordsworth" Showing What Feelings He Is Trying To Arise in The Reader and How He Achieves This
100% (2)
Explore Kim Cheng's "Report To Wordsworth" Showing What Feelings He Is Trying To Arise in The Reader and How He Achieves This
2 pages
Project Title: Group Members Names School
No ratings yet
Project Title: Group Members Names School
5 pages
185-19 Wpqc-W160-2''-Asme Xi PDF
No ratings yet
185-19 Wpqc-W160-2''-Asme Xi PDF
1 page
TRY Levi Roots
No ratings yet
TRY Levi Roots
1 page
DBMS MASTER: Become Pro in Database Management System
From Everand
DBMS MASTER: Become Pro in Database Management System
Ummed Singh
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
DBA's Guide to NoSQL
From Everand
DBA's Guide to NoSQL
The Enlightened DBA
5/5 (1)

BDA Ass 3(1)

Uploaded by

BDA Ass 3(1)

Uploaded by

Harsh Shah TY-IT-1-A 220410116025

1) Describe iterative and interactive operations on MapReduce and Spark RDD.

Feature SQL (Relational DB) NoSQL (Non-relational DB)

Schema Fixed schema Dynamic schema

Scalability Vertical scaling Horizontal scaling

Transactions Supports ACID May follow BASE (eventual consistency)

Structured data with Large-scale, semi-structured/unstructured

12) Why RDD is better than MapReduce data storage?

) Mention usefulness of Pig. What are key features of Pig?

Feature Pig MapReduce

Language Pig Latin Java (programming required)

Ease of Use High-level, user-friendly scripting Low-level programming complexity

Development Speed Faster Slower

Execution Engine Converts scripts to MapReduce Native MapReduce

(ii) HDFS vs. HBase

Feature HDFS HBase

Access Pattern Sequential Random read/write

Latency High (not ideal for real-time) Low (supports real-time)

Schema Schema-less Flexible schema

4) What is role of Zookeeper? How it helps in monitoring a cluster?

5) Data model and implementation of HBase

6) HiveQL data manipulation queries in detail

7) What is HBase? Write a query to create a table in HBase.

8) Draw architecture of Apache Pig and explain in short.

9) Benefits of Zookeeper, znodes and their types

10) RDBMS vs. HBase; HiveQL Data Definition Language (DDL)

Feature RDBMS HBase

Data Model Relational (tables, rows) Column-oriented NoSQL

Schema Fixed schema Schema-less, flexible

Transactions Full ACID support Limited ACID

Scalability Vertical Horizontal

HiveQL DDL defines structure. Examples:

11) What is Big Data streaming? Stream data architecture

12) Write about Apache Kafka

You might also like