BDA IA1 QB Solved complete - Copy
BDA IA1 QB Solved complete - Copy
1. Explain CAP theorem. How is CAP is different from ACID properties in Databases
• CAP theorem originally introduced as the CAP principle can be used to explain
some of the competing requirements in distributed system with replication.
• It is a tool used to make system designers aware of the trade-offs while designing
networked Shared data Systems.
• It stands for consistency, availability, and Partition tolerance
• Consistency: Consistency means that the nodes will have the same copies of a
replicated data item visible for Various transactions. Consistency refers to every
client having the same view of the data.
• Availability: Availability means that each read and write request for a data item
will either be processed successfully or will receive a message that the operation
Cannot be completed.
• Partition Tolerance: It means that the system can Continue operating even if the
system nodes has a fault that results in 2 or more partitions, where the nodes in
each partition can only communicate among each other.
• That is both CAP and ACID use the term consistency, but they don't really mean
the same thing by this.
• In CAP consistency means that all nodes store and provide the same data. While
for ACID, consistency means that internal rules within an individual hade must
apply across the board
2. What is Big Data? What is Hadoop? How Big Data and Hadoop are linked?
• Big Data refers to the large volumes of data that are complex and grow rapidly, often
characterized by the three Vs: Volume (massive amounts of data), Velocity (rapid
generation and processing), and Variety (different types of data such as structured,
unstructured, and semi-structured). Examples of Big Data include social media data,
sensor data, and transaction records.
• Hadoop is an open-source framework designed to store and process large datasets
efficiently. It consists of several components: HDFS (Hadoop Distributed File
System) for storing data across multiple machines, MapReduce for processing data in
parallel across clusters, YARN (Yet Another Resource Negotiator) for managing
resources and scheduling, and Hadoop Common, which includes common utilities and
libraries. Hadoop is primarily written in Java.
• Big Data and Hadoop are closely linked because Hadoop is specifically designed to
handle Big Data. Hadoop’s HDFS component stores large datasets efficiently, while
MapReduce processes these datasets in parallel, making it possible to manage and
analyze Big Data effectively. Hadoop is also highly scalable, allowing for the addition
of more nodes to the cluster to handle increasing amounts of data. Common use cases
for Hadoop include data warehousing, business intelligence, machine learning, and
data mining.
4. Why is HDFS more suited for applications having large datasets and not when there
are small files? Elaborate.
Reasons HDFS is Suited for Large Datasets
1. Large Block Size: HDFS uses large block sizes (128 MB or 256 MB), reducing the
overhead of managing metadata.
2. High Throughput: Optimized for high-throughput access, making it ideal for reading
and writing large files sequentially.
3. Fault Tolerance: Data blocks are replicated across multiple nodes, ensuring data
availability even if some nodes fail.
4. Scalability: Easily scales by adding more nodes to the cluster, distributing large
datasets efficiently.
8. Explain Hadoop Ecosystem with core components? Explain the physical architecture
of Hadoop. State its limitations
• Purpose: HDFS is designed to store large datasets reliably and to stream those
datasets at high bandwidth to user applications.
• Structure: It consists of two main components:
o NameNode: Manages the metadata (data about data) and keeps track of which
blocks are stored on which DataNodes.
o DataNode: Stores the actual data. Data is split into blocks and distributed
across multiple DataNodes.
• Fault Tolerance: Data is replicated across multiple DataNodes to ensure fault
tolerance and high availability.
MapReduce
A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, Name Node, and Data Node whereas the slave node
includes Data Node and Task Tracker.
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains
a master/slave architecture. This architecture consist of a single NameNode performs the role
of master, and multiple DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run
the NameNode and DataNode software.
NameNode
DataNode
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker
MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job
to Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.
Limitations of Hadoop:
9. Explain how Hadoop goals are covered in Hadoop Distributed file system
Fault Tolerance:
• Replication: HDFS automatically replicates each block of data across
multiple nodes.
• Heartbeat and block reports: HDFS Continuously monitors the health of
data nodes.
High Throughput: HDFS moves the computation closer to where the data is stored,
minimizing data movement across the network and reducing bottlenecks. HDFS is
optimized for batch processing rather than random reads and writes.
Reliability: HDFS separates metadata management from actual data storage. In data
Integrity, HDFS checksums data blocks and verifies the integrity of data during
Storage and retrieval.
10. Differentiate between RDBMS and a NOSQL Database. Illustrate with example.
• Use Case: Banking system where data integrity and complex transactions are crucial.
• Structure: Tables for customers, accounts, transactions, etc., with relationships
defined by foreign keys.
• Use Case: Social media platform where data is unstructured and rapidly changing.
• Structure: Collections of documents, each document being a JSON-like object.
12 Demonstrate how business problems have been successfully solved faster,
cheaper and more e ectively considering NoSQL Google’s MapReduce case study.
Also illustrate the business drivers in it.
Google’s MapReduce framework, coupled with NoSQL databases, allowed for the parallel
processing of massive datasets across distributed clusters of servers. Traditionally, processing
such vast amounts of data with relational databases (RDBMS) would have been time-
consuming and inefficient. However, with MapReduce, Google was able to break down large
computational tasks into smaller, manageable chunks that could be processed simultaneously
across multiple nodes. This approach dramatically accelerated data processing times,
enabling Google to solve complex data problems, such as indexing the entire web, much
faster than before.
The use of NoSQL databases and the MapReduce framework made it possible to handle
enormous volumes of unstructured data without the need for expensive, high-end hardware.
Instead, Google leveraged clusters of commodity servers, which were far less costly than
traditional enterprise-grade systems. By distributing the workload across many inexpensive
servers, Google reduced the need for costly hardware investments. Additionally, the
scalability of the MapReduce framework meant that as data volumes grew, Google could
simply add more servers to the cluster rather than investing in entirely new infrastructure,
keeping costs down.
MapReduce and NoSQL databases allowed Google to handle a diverse range of data types,
including structured, semi-structured, and unstructured data, more effectively than traditional
relational databases. This flexibility enabled Google to store and process data in a way that
was most appropriate for the task at hand, improving the accuracy and efficiency of their data
processing operations. For example, in web search indexing, handling the vast and varied
types of data from the internet required a system that could process different formats quickly
and accurately. MapReduce provided this capability, allowing Google to deliver more
relevant search results to users.
1. Scalability: The need to efficiently manage and process rapidly growing amounts of
data, particularly with the rise of the internet and web search demands.
2. Cost Efficiency: The desire to reduce operational costs by using commodity hardware
and avoiding the high costs associated with traditional RDBMS and high-end servers.
3. Data Variety: The increasing need to process and store diverse data types, beyond
what traditional relational databases could handle effectively.
4. Performance: The need for faster data processing to support real-time and near-real-
time applications, such as web search and ad targeting.
By adopting the MapReduce framework and NoSQL databases, Google was able to solve
critical business problems faster, cheaper, and more effectively, ensuring they could maintain
their competitive edge in the rapidly evolving digital landscape.
ADVANTAGES
Scalability: Hadoop can easily scale horizontally by adding more nodes to the cluster,
allowing it to handle vast amounts of data.
Fault Tolerance: Hadoop automatically replicates data across multiple nodes, ensuring
data availability even if some nodes fail.
Flexibility: Hadoop can process a wide variety of data types, including structured, semi-
structured, and unstructured data, from multiple sources.
LIMITATIONS
Real-Time Processing: Hadoop is designed for batch processing and struggles with real-
time data processing tasks.
Small File Handling: Hadoop is inefficient at managing a large number of small files,
leading to performance issues and increased overhead.
High Latency: Due to its batch processing nature, Hadoop often exhibits higher latency,
which can be problematic for time-sensitive applications
16 Explain CAP theorem and explain how NoSQL system guarantees BASE property.
NoSQL systems guarantee the BASE properties, which stand for Basically Available, Soft State, and
Eventual Consistency. Here’s a breakdown of each property:
1. Basically Available: This means the system guarantees availability of the data, ensuring that
requests will receive a response, even if it’s a failure. This is achieved through redundancy
and replication, which help maintain availability even during partial system failures
2. Soft State: The state of the system can change over time, even without input from the user.
This is due to the ongoing background processes that update the data. NoSQL systems handle
these changes gracefully, ensuring that the system remains operational and data is not
corrupted
3. Eventual Consistency: Instead of requiring immediate consistency like traditional ACID-
compliant databases, NoSQL systems ensure that data will eventually become consistent. This
means that after some time, all updates will propagate through the system, and all nodes will
have the same data2.
These properties allow NoSQL databases to be highly scalable and available, making them suitable
for large-scale applications like social media platforms and online shopping websites 2.
17 Write a short note on: NoSQL data stores with example.
NoSQL stands for “Not Only SQL” and refers to a variety of database technologies that were
developed to address the limitations of traditional relational databases. NoSQL databases are
particularly useful for handling big data and real-time web applications.
Document Databases
Example: MongoDB
Use Case: Ideal for content management systems, real-time analytics, and applications
requiring flexible schemas.
Example Document:
"_id": "12345",
"email": "[email protected]",
"address": {
"city": "Anytown",
"state": "CA",
"zip": "12345"
},
Key-Value Stores
Example: Redis
Use Case: Suitable for caching, session management, and real-time analytics.
Example
Graph Databases
Example: Neo4j
Example:
Column Stores
Example: Cassandra
Use Case: Excellent for handling large-scale, distributed data across many servers.
Example:
Columns:
email: [email protected]
2. Flexibility: Handle various data types and structures without predefined schemas.
3. Performance: Optimized for specific data models and access patterns, often resulting in
faster performance for certain queries.
18 What are three Vs of Big Data? Give two examples of big data case studies. Indicate
which Vs are satisfied by these case studies.
1. Volume: Refers to the vast amounts of data generated every second. Big data involves
handling terabytes, petabytes, or even exabytes of data.
2. Velocity: The speed at which data is generated and processed. This includes real-time
data streams and the need for quick processing to derive insights.
1. Netflix
Netflix uses big data to analyze viewer behavior and preferences. By collecting data on
what users watch, how long they watch, and their interactions with the platform, Netflix
can recommend personalized content to each user. This recommendation system is
responsible for over 80% of the content watched on Netflix 3.
Volume: Netflix handles massive amounts of data from millions of users worldwide.
Variety: Data includes viewing history, ratings, search queries, and interaction logs.
2. Starbucks
Starbucks leverages big data through its loyalty card program and mobile app. By analyzing
purchase data, Starbucks can predict customer preferences and send personalized o ers
to increase sales and customer engagement3.
Variety: Data includes purchase history, customer preferences, and location data.
These case studies illustrate how big data’s three Vs—Volume, Velocity, and Variety—are
utilized to enhance customer experiences and drive business success.
19 Explain distributed storage system of Hadoop with the help of neat diagram.
20 Explain Shared Disk System and Shared Nothing System with the help of diagram.
21 Demonstrate how LiveJournal's use of Memcache has successfully enhanced the
scalability, performance, and cost-e ectiveness of its platform. Also, illustrate the
key business drivers that led to the adoption of Memcache in LiveJournal's
architecture.
Scalability
Performance
Cost-Effectiveness
The decision to implement Memcache at LiveJournal was driven by several key business
needs:
1. Rising Traffic and User Demand: The continuous increase in the number of users
necessitated a solution that could handle high traffic without degrading performance.
2. Resource Optimization: With RAM being a precious resource, the need to maximize
its usage across web servers became paramount.
3. Database Overload: The existing database infrastructure was becoming
overwhelmed with repetitive queries, leading to slower response times and a potential
bottleneck.
4. Cost Management: As scaling up the database infrastructure would have been
prohibitively expensive, LiveJournal sought a more cost-effective solution to manage
growing demand.