BDA Ass 3(1)
BDA Ass 3(1)
Data Model Tables with rows and columns Key-Value, Document, Column, or Graph
Query
SQL Varies by database (MongoQL, CQL, etc.)
Language
1
Harsh Shah TY-IT-1-A 220410116025
NoSQL (Not Only SQL) databases are non-relational databases designed for large-scale data
storage and real-time web applications. Unlike SQL (relational) databases that store
structured data in tables with predefined schemas, NoSQL supports flexible schema designs
and stores data in forms like key-value pairs, documents, graphs, or wide-columns. NoSQL
provides high scalability, availability, and performance for unstructured or semi-structured
data. SQL databases are better suited for complex transactions and consistency, whereas
NoSQL is optimized for scalability and handling diverse data types.
4) Categorize types of NoSQL databases with examples.
NoSQL databases are categorized into four main types:
1. Key-Value Stores: Store data as key-value pairs (e.g., Redis, Riak).
2. Document Stores: Store semi-structured data in JSON/XML documents (e.g.,
MongoDB, CouchDB).
3. Column-Family Stores: Store data in columns instead of rows (e.g., Apache
Cassandra, HBase).
4. Graph Databases: Store relationships between entities using graph structures (e.g.,
Neo4j, OrientDB). Each type is optimized for specific use cases and offers flexible
scalability.
5) Explain important components of Spark with necessary diagram.
Apache Spark's architecture includes the following components:
Driver Program: Coordinates all activities and initiates the SparkContext.
Cluster Manager: Allocates resources (e.g., YARN, Mesos).
Executors: Run computations and store data for applications.
Tasks: Units of work executed on each partition.
[Diagram: Spark Architecture]
Driver Program --> Cluster Manager --> Executors --> Tasks
This architecture enables distributed data processing with in-memory performance and fault
tolerance.
6) What is MongoDB? Explain important features.
MongoDB is a NoSQL, document-oriented database that stores data in flexible, JSON-like
documents. It supports dynamic schemas, allowing different documents in a collection to
have different structures. Key features include high scalability, indexing, replication, and
2
Harsh Shah TY-IT-1-A 220410116025
sharding for horizontal scaling. MongoDB allows developers to store and retrieve complex
data types efficiently and supports aggregation, ad hoc queries, and rich indexing.
7) Differentiate MongoDB with RDBMS. Compare advantages and drawbacks.
MongoDB differs from RDBMS in data model, schema, and performance. While RDBMS uses
tables with fixed schemas, MongoDB stores data in collections of dynamic JSON-like
documents. RDBMS is ideal for transactions and complex queries, whereas MongoDB is
suited for scalability and agile development. MongoDB excels in handling large volumes of
unstructured data, though it may lack strict ACID compliance. RDBMS offers strong
consistency but struggles with horizontal scalability.
8) Mention advantages of using NoSQL databases.
NoSQL databases provide several advantages such as flexible schema design, horizontal
scalability, high performance for large data volumes, and ease of replication. They are ideal
for applications requiring fast access to unstructured or semi-structured data, real-time
analytics, and distributed environments. NoSQL also supports agile development by allowing
quick changes to data structures.
9) MongoDB Terms: Database, Collection, Document, Datatypes
In MongoDB, a Database is a container for collections. A Collection is a group of documents,
similar to a table in RDBMS. A Document is a JSON-like data structure that contains fields
and values. MongoDB supports data types like strings, numbers, arrays, objects, dates, and
binary data, allowing flexibility in data modeling.
10) MongoDB CRUD operations with syntax
Create Database: use myDatabase
Drop Database: db.dropDatabase()
Create Collection: db.createCollection("users")
Insert Document: db.users.insert({name: "John", age: 30})
Find Document: db.users.find({name: "John"})
Update Document: db.users.update({name: "John"}, {$set: {age: 31}})
Delete Document: db.users.remove({name: "John"})
11) What is RDD? Explain RDD operations in detail.
RDD (Resilient Distributed Dataset) is Spark's fundamental data structure, representing an
immutable, distributed collection of objects partitioned across nodes. RDDs are fault-
tolerant and support parallel processing. Operations on RDDs are categorized into
transformations (e.g., map, filter, flatMap) and actions (e.g., collect, count, reduce).
Transformations are lazy and build a lineage graph, which is recomputed in case of failures.
Actions trigger the computation.
3
Harsh Shah TY-IT-1-A 220410116025
4
Harsh Shah TY-IT-1-A 220410116025
2) Explain components of Hive architecture. Also describe working of Hive with suitable
diagram.
Apache Hive is a data warehouse infrastructure built on top of Hadoop that provides data
summarization, query, and analysis using HiveQL, a SQL-like language. The main components
of Hive architecture include:
Metastore: Stores metadata about tables and partitions.
Driver: Manages the lifecycle of HiveQL statements.
Compiler: Translates HiveQL into execution plans.
Execution Engine: Executes the plans using Hadoop or Spark.
Hive Server: Accepts client connections and handles queries.
[Diagram: Hive Architecture]
Clients --> Hive Server --> Compiler --> Execution Engine --> Hadoop/YARN
|
--> Metastore
Hive processes user queries by converting them into DAGs of MapReduce jobs, which are
executed in the Hadoop ecosystem.
3) Differentiate:
(i) Pig vs. MapReduce
Data Types Supports complex and nested types Mostly primitive types
Data Model File system for batch processing NoSQL database (column-oriented)
5
Harsh Shah TY-IT-1-A 220410116025
Integration Works with Hive, Pig, MapReduce Works with Spark, MapReduce
6
Harsh Shah TY-IT-1-A 220410116025
HBase is a distributed NoSQL database that stores structured and semi-structured data in a
fault-tolerant way. It supports random access to large datasets and is suitable for sparse
tables. To create a table in HBase:
create 'students', 'personal', 'academic'
This creates a table named 'students' with two column families: 'personal' and 'academic'.
7
Harsh Shah TY-IT-1-A 220410116025