0% found this document useful (0 votes)

7 views

BDA IA1 QB Solved complete - Copy

The document discusses key concepts in Big Data, including the CAP theorem, the relationship between Big Data and Hadoop, and the characteristics of both traditional and NoSQL databases. It outlines the Hadoop ecosystem, its core components like HDFS and YARN, and highlights the limitations of Hadoop in handling small files and real-time processing. Additionally, it illustrates how Google’s MapReduce framework effectively addresses business problems by enabling faster, cheaper, and more efficient data processing.

Uploaded by

tyagrajssecs121

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

BDA IA1 QB Solved complete - Copy

Uploaded by

tyagrajssecs121

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Big Data IA1 QB Solution

1. Explain CAP theorem. How is CAP is different from ACID properties in Databases
• CAP theorem originally introduced as the CAP principle can be used to explain
some of the competing requirements in distributed system with replication.
• It is a tool used to make system designers aware of the trade-offs while designing
networked Shared data Systems.
• It stands for consistency, availability, and Partition tolerance
• Consistency: Consistency means that the nodes will have the same copies of a
replicated data item visible for Various transactions. Consistency refers to every
client having the same view of the data.
• Availability: Availability means that each read and write request for a data item
will either be processed successfully or will receive a message that the operation
Cannot be completed.
• Partition Tolerance: It means that the system can Continue operating even if the
system nodes has a fault that results in 2 or more partitions, where the nodes in
each partition can only communicate among each other.
• That is both CAP and ACID use the term consistency, but they don't really mean
the same thing by this.
• In CAP consistency means that all nodes store and provide the same data. While
for ACID, consistency means that internal rules within an individual hade must
apply across the board

2. What is Big Data? What is Hadoop? How Big Data and Hadoop are linked?
• Big Data refers to the large volumes of data that are complex and grow rapidly, often
characterized by the three Vs: Volume (massive amounts of data), Velocity (rapid
generation and processing), and Variety (different types of data such as structured,
unstructured, and semi-structured). Examples of Big Data include social media data,
sensor data, and transaction records.
• Hadoop is an open-source framework designed to store and process large datasets
efficiently. It consists of several components: HDFS (Hadoop Distributed File
System) for storing data across multiple machines, MapReduce for processing data in
parallel across clusters, YARN (Yet Another Resource Negotiator) for managing
resources and scheduling, and Hadoop Common, which includes common utilities and
libraries. Hadoop is primarily written in Java.
• Big Data and Hadoop are closely linked because Hadoop is specifically designed to
handle Big Data. Hadoop’s HDFS component stores large datasets efficiently, while
MapReduce processes these datasets in parallel, making it possible to manage and
analyze Big Data effectively. Hadoop is also highly scalable, allowing for the addition
of more nodes to the cluster to handle increasing amounts of data. Common use cases
for Hadoop include data warehousing, business intelligence, machine learning, and
data mining.

3. Differentiate between Traditional Data vs Big data

4. Why is HDFS more suited for applications having large datasets and not when there
are small files? Elaborate.
Reasons HDFS is Suited for Large Datasets
1. Large Block Size: HDFS uses large block sizes (128 MB or 256 MB), reducing the
overhead of managing metadata.
2. High Throughput: Optimized for high-throughput access, making it ideal for reading
and writing large files sequentially.
3. Fault Tolerance: Data blocks are replicated across multiple nodes, ensuring data
availability even if some nodes fail.
4. Scalability: Easily scales by adding more nodes to the cluster, distributing large
datasets efficiently.

Challenges with Small Files

5. Metadata Overhead: Each small file requires an inode in the NameNode’s memory,
leading to excessive memory usage.
6. Inefficient Storage: Small files do not fully utilize the large block size, resulting in
wasted storage space.
7. High Latency: Accessing many small files incurs high latency due to the overhead of
opening and closing files.
8. Resource Management: Managing numerous small files increases the load on the
NameNode, affecting overall cluster performance.
9. Not Optimized for Random Access: HDFS is designed for sequential access, making
it inefficient for random access patterns typical of small files.
10. Complexity in Handling Small Files: The overhead of handling many small files can
degrade the performance and efficiency of the HDFS cluster.

5. Describe any Five characteristics of Big Data Analytics.

1. Volume: Volume means “How much Data is generated”. Now-a-days, Organizations
or Human Beings or Systems are generating or getting very vast amount of Data say
TB (Tera Bytes) to PB(Peta Bytes) to Exa Byte(EB) and more.
2. Velocity: Velocity means “How fast produce Data”. Now-a-days, Organizations or
Human Beings or Systems are generating huge amounts of Data at very fast rate.
3. Variety: Variety means “Different forms of Data”. Now-a-days, Organizations or
Human Beings or Systems are generating very huge amount of data at very fast rate in
different formats.
4. Veracity: Veracity means “The Quality or Correctness or Accuracy of Captured
Data”. Out of 5Vs, it is most important V for any Big Data Solutions. Because
without Correct Information or Data, there is no use of storing large amount of data at
fast rate and different formats. That data should give correct business value.
5. Value: After having the 4 V’s into account there comes one more V which stands for
Value! The bulk of Data having no Value is of no good to the company, unless you
turn it into something useful. Data in itself is of no use or importance, but it needs to
be converted into something valuable to extract Information.

6. Describe characteristics of Pig and Mahout.

Characteristics of Apache Pig

1. High-Level Abstraction: Provides a high-level scripting language (Pig Latin) for

data analysis, abstracting the complexity of MapReduce.
2. Ease of Use: Easy to learn, read, and write, especially for SQL programmers,
reducing the development effort.
3. Extensibility: Allows users to create their own processes and user-defined functions
(UDFs) in languages like Python and Java.
4. Rich Set of Operators: Offers built-in operators for filtering, joining, sorting, and
aggregation, simplifying data operations.
5. Nested Data Types: Supports complex data types such as tuples, bags, and maps,
enabling more sophisticated data handling.
6. Efficient Code: Reduces the length of code significantly compared to writing in Java
for MapReduce.
7. Prototyping and Ad-Hoc Queries: Useful for exploring large datasets, prototyping
data processing algorithms, and running ad-hoc queries.

Characteristics of Apache Mahout

1. Scalability: Designed to handle large-scale data processing by leveraging Hadoop and

Spark, making it suitable for big data machine learning projects.
2. Versatility: Offers a wide range of machine learning algorithms, including
classification, clustering, recommendation, and pattern mining.
3. Integration: Seamlessly integrates with other Hadoop ecosystem components like
HDFS and HBase, simplifying data storage and retrieval.
4. Distributed Processing: Utilizes Hadoop’s MapReduce and Spark for distributed
data processing, ensuring efficient handling of large datasets.
5. Extensibility: Easily extensible, allowing users to add custom algorithms and
processing steps to meet specific requirements.

7. Describe BASE properties in NOSQL Database.

• NoSQL systems guarantee the BASE properties, which stand for Basically
Available, Soft State, and Eventual Consistency.
1. Basically Available: This means the system guarantees availability of the data,
ensuring that requests will receive a response, even if it’s a failure. This is
achieved through redundancy and replication, which help maintain availability
even during partial system failures
2. Soft State: The state of the system can change over time, even without input from
the user. This is due to the ongoing background processes that update the data.
NoSQL systems handle these changes gracefully, ensuring that the system
remains operational, and data is not corrupted
3. Eventual Consistency: Instead of requiring immediate consistency like
traditional ACID-compliant databases, NoSQL systems ensure that data will
eventually become consistent. This means that after some time, all updates will
propagate through the system, and all nodes will have the same data.
• These properties allow NoSQL databases to be highly scalable and available,
making them suitable for large-scale applications like social media platforms and
online shopping websites.

8. Explain Hadoop Ecosystem with core components? Explain the physical architecture
of Hadoop. State its limitations

Core Components of the Hadoop Ecosystem

HDFS (Hadoop Distributed File System)

• Purpose: HDFS is designed to store large datasets reliably and to stream those
datasets at high bandwidth to user applications.
• Structure: It consists of two main components:
o NameNode: Manages the metadata (data about data) and keeps track of which
blocks are stored on which DataNodes.
o DataNode: Stores the actual data. Data is split into blocks and distributed
across multiple DataNodes.
• Fault Tolerance: Data is replicated across multiple DataNodes to ensure fault
tolerance and high availability.

YARN (Yet Another Resource Negotiator)

• Purpose: YARN is the resource management layer of Hadoop, responsible for

managing and scheduling resources across the cluster.
• Components:
o Resource Manager: Allocates resources to various applications running in the
cluster.
o Node Manager: Manages resources on a single node and reports to the
Resource Manager.
o Application Manager: Acts as an interface between the Resource Manager
and Node Manager, negotiating resources for applications.
• Functionality: YARN allows multiple data processing engines to run and share
resources, improving the utilization and efficiency of the cluster.

MapReduce

• Purpose: MapReduce is a programming model used for processing large datasets in a

distributed and parallel manner.
• Process:
o Map Function: Takes input data and converts it into a set of key-value pairs.
It performs sorting and filtering of data.
o Reduce Function: Takes the output from the Map function and aggregates the
data, producing the final result.
• Execution: The MapReduce framework handles the distribution of tasks, manages
data transfer between nodes, and ensures fault tolerance.
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, Name Node, and Data Node whereas the slave node
includes Data Node and Task Tracker.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains
a master/slave architecture. This architecture consist of a single NameNode performs the role
of master, and multiple DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run
the NameNode and DataNode software.

NameNode

o It is a single master server exist in the HDFS cluster.

o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
o It simplifies the architecture of the system.

DataNode

o The HDFS cluster contains multiple DataNodes.

o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's
clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.

Job Tracker

o The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker

o It works as a slave node for Job Tracker.

o It receives task and code from Job Tracker and applies that code on the file. This process
can also be called as a Mapper.

MapReduce Layer

The MapReduce comes into existence when the client application submits the MapReduce job
to Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.

Limitations of Hadoop:

• Complexity: Setting up, managing, and optimizing Hadoop requires specialized

knowledge, making it challenging for non-experts.
• Real-Time Processing: Hadoop is designed for batch processing and struggles with
real time data processing tasks.
• Small File Handling: Hadoop is inefficient at managing a large number of small
files, leading to performance issues and increased overhead.
• High Latency: Due to its batch processing nature, Hadoop often exhibits higher
latency, which can be problematic for time-sensitive applications

9. Explain how Hadoop goals are covered in Hadoop Distributed file system

HDFS aligns with the main goals of Hadoop:

Scalability: HDFS is designed to handle large amounts of data by distributing it
across multiple nodes in a cluster.

Fault Tolerance:
• Replication: HDFS automatically replicates each block of data across
multiple nodes.
• Heartbeat and block reports: HDFS Continuously monitors the health of
data nodes.
High Throughput: HDFS moves the computation closer to where the data is stored,
minimizing data movement across the network and reducing bottlenecks. HDFS is
optimized for batch processing rather than random reads and writes.
Reliability: HDFS separates metadata management from actual data storage. In data
Integrity, HDFS checksums data blocks and verifies the integrity of data during
Storage and retrieval.

10. Differentiate between RDBMS and a NOSQL Database. Illustrate with example.

RDBMS Example: MySQL

• Use Case: Banking system where data integrity and complex transactions are crucial.
• Structure: Tables for customers, accounts, transactions, etc., with relationships
defined by foreign keys.

NoSQL Example: MongoDB

• Use Case: Social media platform where data is unstructured and rapidly changing.
• Structure: Collections of documents, each document being a JSON-like object.
12 Demonstrate how business problems have been successfully solved faster,
cheaper and more e ectively considering NoSQL Google’s MapReduce case study.
Also illustrate the business drivers in it.

Solving Business Problems Faster

Google’s MapReduce framework, coupled with NoSQL databases, allowed for the parallel
processing of massive datasets across distributed clusters of servers. Traditionally, processing
such vast amounts of data with relational databases (RDBMS) would have been time-
consuming and inefficient. However, with MapReduce, Google was able to break down large
computational tasks into smaller, manageable chunks that could be processed simultaneously
across multiple nodes. This approach dramatically accelerated data processing times,
enabling Google to solve complex data problems, such as indexing the entire web, much
faster than before.

Solving Business Problems Cheaper

The use of NoSQL databases and the MapReduce framework made it possible to handle
enormous volumes of unstructured data without the need for expensive, high-end hardware.
Instead, Google leveraged clusters of commodity servers, which were far less costly than
traditional enterprise-grade systems. By distributing the workload across many inexpensive
servers, Google reduced the need for costly hardware investments. Additionally, the
scalability of the MapReduce framework meant that as data volumes grew, Google could
simply add more servers to the cluster rather than investing in entirely new infrastructure,
keeping costs down.

Solving Business Problems More Effectively

MapReduce and NoSQL databases allowed Google to handle a diverse range of data types,
including structured, semi-structured, and unstructured data, more effectively than traditional
relational databases. This flexibility enabled Google to store and process data in a way that
was most appropriate for the task at hand, improving the accuracy and efficiency of their data
processing operations. For example, in web search indexing, handling the vast and varied
types of data from the internet required a system that could process different formats quickly
and accurately. MapReduce provided this capability, allowing Google to deliver more
relevant search results to users.

Business Drivers for Adopting MapReduce and NoSQL

1. Scalability: The need to efficiently manage and process rapidly growing amounts of
data, particularly with the rise of the internet and web search demands.
2. Cost Efficiency: The desire to reduce operational costs by using commodity hardware
and avoiding the high costs associated with traditional RDBMS and high-end servers.
3. Data Variety: The increasing need to process and store diverse data types, beyond
what traditional relational databases could handle effectively.
4. Performance: The need for faster data processing to support real-time and near-real-
time applications, such as web search and ad targeting.

By adopting the MapReduce framework and NoSQL databases, Google was able to solve
critical business problems faster, cheaper, and more effectively, ensuring they could maintain
their competitive edge in the rapidly evolving digital landscape.

13 List the uses of Big Data.

Business Analytics: Analyzes customer behavior to improve products and services.

Healthcare: Enables personalized medicine and predictive healthcare analytics.
Finance and Banking: Detects fraud and manages financial risk through data analysis.
Retail: Offers personalized recommendations and optimizes inventory management.
Supply Chain and Logistics: Predicts demand and optimizes transportation routes.
Telecommunications: Optimizes network traffic and predicts customer churn.
Manufacturing: Anticipates equipment failures with predictive maintenance.
Energy and Utilities: Manages smart grids and predicts energy production from renewables.
Government and Public Sector: Enhances public safety and disaster response through predictive
analytics.
Education: Personalizes learning experiences and monitors student performance.
14 What is Big Data and give types of big data.
15 What are the advantages and limitations of Hadoop.

ADVANTAGES

 Scalability: Hadoop can easily scale horizontally by adding more nodes to the cluster,
allowing it to handle vast amounts of data.

 Cost-Effective: It uses commodity hardware, making it a cost-effective solution for storing

and processing large datasets.

 Fault Tolerance: Hadoop automatically replicates data across multiple nodes, ensuring
data availability even if some nodes fail.

 Flexibility: Hadoop can process a wide variety of data types, including structured, semi-
structured, and unstructured data, from multiple sources.
LIMITATIONS

 Complexity: Setting up, managing, and optimizing Hadoop requires specialized

knowledge, making it challenging for non-experts.

 Real-Time Processing: Hadoop is designed for batch processing and struggles with real-
time data processing tasks.

 Small File Handling: Hadoop is inefficient at managing a large number of small files,
leading to performance issues and increased overhead.

 High Latency: Due to its batch processing nature, Hadoop often exhibits higher latency,
which can be problematic for time-sensitive applications

16 Explain CAP theorem and explain how NoSQL system guarantees BASE property.
NoSQL systems guarantee the BASE properties, which stand for Basically Available, Soft State, and
Eventual Consistency. Here’s a breakdown of each property:
1. Basically Available: This means the system guarantees availability of the data, ensuring that
requests will receive a response, even if it’s a failure. This is achieved through redundancy
and replication, which help maintain availability even during partial system failures
2. Soft State: The state of the system can change over time, even without input from the user.
This is due to the ongoing background processes that update the data. NoSQL systems handle
these changes gracefully, ensuring that the system remains operational and data is not
corrupted
3. Eventual Consistency: Instead of requiring immediate consistency like traditional ACID-
compliant databases, NoSQL systems ensure that data will eventually become consistent. This
means that after some time, all updates will propagate through the system, and all nodes will
have the same data2.
These properties allow NoSQL databases to be highly scalable and available, making them suitable
for large-scale applications like social media platforms and online shopping websites 2.
17 Write a short note on: NoSQL data stores with example.

NoSQL stands for “Not Only SQL” and refers to a variety of database technologies that were
developed to address the limitations of traditional relational databases. NoSQL databases are
particularly useful for handling big data and real-time web applications.

Document Databases

Example: MongoDB

 Structure: Stores data in documents similar to JSON objects.

 Use Case: Ideal for content management systems, real-time analytics, and applications
requiring ﬂexible schemas.

 Example Document:

"_id": "12345",

"name": "John Doe",

"email": "[email protected]",

"address": {

"street": "123 Main St",

"city": "Anytown",

"state": "CA",

"zip": "12345"

"hobbies": ["reading", "gaming", "hiking"]

Key-Value Stores

Example: Redis

 Structure: Stores data as key-value pairs.

 Use Case: Suitable for caching, session management, and real-time analytics.

 Example

"user:1000" -> {"name": "John Doe", "email": "[email protected]"}

Graph Databases

Example: Neo4j

 Structure: Stores data in nodes, edges, and properties.

 Use Case: Perfect for applications involving complex relationships, like social networks,
recommendation engines, and fraud detection.

 Example:

(User: John Doe)-[FRIEND]->(User: Jane Smith)

Column Stores

Example: Cassandra

 Structure: Stores data in tables with rows and dynamic columns.

 Use Case: Excellent for handling large-scale, distributed data across many servers.

 Example:

Row Key: user123

Columns:

name: John Doe

email: [email protected]

address: 123 Main St, Anytown, CA, 12345

Advantages of NoSQL Databases

1. Scalability: Easily scale out by adding more servers.

2. Flexibility: Handle various data types and structures without predeﬁned schemas.

3. Performance: Optimized for speciﬁc data models and access patterns, often resulting in
faster performance for certain queries.

4. Availability: Designed to ensure high availability and fault tolerance.

18 What are three Vs of Big Data? Give two examples of big data case studies. Indicate
which Vs are satisﬁed by these case studies.

The Three Vs of Big Data

1. Volume: Refers to the vast amounts of data generated every second. Big data involves
handling terabytes, petabytes, or even exabytes of data.

2. Velocity: The speed at which data is generated and processed. This includes real-time
data streams and the need for quick processing to derive insights.

3. Variety: The di erent types of data, including structured, semi-structured, and

unstructured data from various sources like social media, sensors, and logs12.
Example Case Studies

1. Netﬂix

Netflix uses big data to analyze viewer behavior and preferences. By collecting data on
what users watch, how long they watch, and their interactions with the platform, Netflix
can recommend personalized content to each user. This recommendation system is
responsible for over 80% of the content watched on Netflix 3.

 Volume: Netﬂix handles massive amounts of data from millions of users worldwide.

 Velocity: Data is processed in real-time to provide instant recommendations.

 Variety: Data includes viewing history, ratings, search queries, and interaction logs.

2. Starbucks

Starbucks leverages big data through its loyalty card program and mobile app. By analyzing
purchase data, Starbucks can predict customer preferences and send personalized o ers
to increase sales and customer engagement3.

 Volume: Starbucks processes data from millions of transactions weekly.

 Velocity: Data is analyzed quickly to send timely o ers and recommendations.

 Variety: Data includes purchase history, customer preferences, and location data.

These case studies illustrate how big data’s three Vs—Volume, Velocity, and Variety—are
utilized to enhance customer experiences and drive business success.
19 Explain distributed storage system of Hadoop with the help of neat diagram.
20 Explain Shared Disk System and Shared Nothing System with the help of diagram.
21 Demonstrate how LiveJournal's use of Memcache has successfully enhanced the
scalability, performance, and cost-e ectiveness of its platform. Also, illustrate the
key business drivers that led to the adoption of Memcache in LiveJournal's
architecture.

Scalability

LiveJournal enhanced its platform's scalability by deploying Memcache across multiple

servers. This distributed caching system allowed the platform to handle an ever-growing user
base by spreading the load evenly across servers. Each server cached a portion of the most
frequently used database query results, reducing the demand on the central database and
enabling horizontal scaling. As traffic increased, LiveJournal could simply add more servers
to the Memcache cluster, seamlessly scaling its capacity to handle more users and data
without overloading individual servers.

Performance

Memcache significantly improved LiveJournal's performance by reducing the need for

repeated, time-consuming database queries. By caching frequently accessed data in RAM, the
platform was able to serve content much faster. When a user request was made, the system
first checked Memcache. If the requested data was in the cache (a cache hit), it was returned
almost instantly, bypassing the need to query the database. This approach reduced latency
and improved response times, providing a smoother and faster user experience even during
peak traffic periods.

Cost-Effectiveness

Implementing Memcache made LiveJournal’s platform more cost-effective by minimizing

the strain on its database servers. With Memcache handling a large portion of the read
requests, the frequency of expensive database operations was reduced. This decreased the
need for additional, costly database infrastructure and hardware. Furthermore, by avoiding
constant database scaling, LiveJournal was able to control operational costs while still
accommodating increased user traffic and data demands.

In summary, by integrating Memcache, LiveJournal successfully enhanced the scalability,

performance, and cost-effectiveness of its platform, ensuring it could continue to grow and
serve its expanding user base efficiently.

Key Business Drivers for Adopting Memcache

The decision to implement Memcache at LiveJournal was driven by several key business
needs:

1. Rising Traffic and User Demand: The continuous increase in the number of users
necessitated a solution that could handle high traffic without degrading performance.
2. Resource Optimization: With RAM being a precious resource, the need to maximize
its usage across web servers became paramount.
3. Database Overload: The existing database infrastructure was becoming
overwhelmed with repetitive queries, leading to slower response times and a potential
bottleneck.
4. Cost Management: As scaling up the database infrastructure would have been
prohibitively expensive, LiveJournal sought a more cost-effective solution to manage
growing demand.

By adopting Memcache, LiveJournal successfully addressed these challenges, ensuring that

their platform remained scalable, performant, and cost-efficient as it continued to grow.

Unit Iii
No ratings yet
Unit Iii
20 pages
Microcontroller Decode
No ratings yet
Microcontroller Decode
99 pages
SAP Press - SAP Transaction Codes - Your Quick Reference To Transactions in SAP ERP
No ratings yet
SAP Press - SAP Transaction Codes - Your Quick Reference To Transactions in SAP ERP
7 pages
Assignment BDHhhh
No ratings yet
Assignment BDHhhh
15 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
BDA CW Chapter 2
No ratings yet
BDA CW Chapter 2
6 pages
BDAunit-II
No ratings yet
BDAunit-II
4 pages
Unit-I Material
No ratings yet
Unit-I Material
32 pages
Act2 - March7 - 6E - BDA - SEC
No ratings yet
Act2 - March7 - 6E - BDA - SEC
8 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Big Data Analysis IAT-1
No ratings yet
Big Data Analysis IAT-1
43 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
Hadoop Chapter 1
No ratings yet
Hadoop Chapter 1
6 pages
Big Data
No ratings yet
Big Data
17 pages
bda final sem 7
No ratings yet
bda final sem 7
120 pages
BD by maaz
No ratings yet
BD by maaz
19 pages
BDA_Unit_2
No ratings yet
BDA_Unit_2
29 pages
Bigdata
No ratings yet
Bigdata
12 pages
Module - 2 Half
No ratings yet
Module - 2 Half
12 pages
BDAV_QB
No ratings yet
BDAV_QB
88 pages
Bda Unit 4 Material
No ratings yet
Bda Unit 4 Material
37 pages
Unit III
No ratings yet
Unit III
15 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Big-Data Final
No ratings yet
Big-Data Final
7 pages
Unit1
No ratings yet
Unit1
50 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Big Data
No ratings yet
Big Data
63 pages
Haddob Lab Report
No ratings yet
Haddob Lab Report
12 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
BDA Handy Notes
No ratings yet
BDA Handy Notes
19 pages
UNIT 3 FULL
No ratings yet
UNIT 3 FULL
89 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Module 2 BDA
No ratings yet
Module 2 BDA
64 pages
Unit 4 - Data Science - Www.rgpvnotes.in
No ratings yet
Unit 4 - Data Science - Www.rgpvnotes.in
18 pages
Bda (21cs71) Module-2
No ratings yet
Bda (21cs71) Module-2
64 pages
CC UNIT 2 (1)
No ratings yet
CC UNIT 2 (1)
29 pages
Bda QB
No ratings yet
Bda QB
18 pages
unit III
No ratings yet
unit III
120 pages
cloud computing Unit-5
No ratings yet
cloud computing Unit-5
22 pages
Unit II Big Data
No ratings yet
Unit II Big Data
27 pages
BD-Unit-II (1)
No ratings yet
BD-Unit-II (1)
57 pages
HADOOP
No ratings yet
HADOOP
40 pages
BDH Unit 3
No ratings yet
BDH Unit 3
16 pages
data analyst
No ratings yet
data analyst
9 pages
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
No ratings yet
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
24 pages
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
No ratings yet
What Are Basic Characteristics of Data and How Is Parallel Processing System Different From Distributed System?
24 pages
HADOOP
No ratings yet
HADOOP
18 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
BDA Module-2
No ratings yet
BDA Module-2
7 pages
High Performance Fault-Tolerant Hadoop Distributed File System
No ratings yet
High Performance Fault-Tolerant Hadoop Distributed File System
9 pages
High Performance Fault-Tolerant Hadoop Distributed File System
No ratings yet
High Performance Fault-Tolerant Hadoop Distributed File System
9 pages
Big Data Analytics QP
No ratings yet
Big Data Analytics QP
36 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Syllabus:: Introduction To Hadoop (T1)
No ratings yet
Syllabus:: Introduction To Hadoop (T1)
23 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Chapter 2 Finance management
No ratings yet
Chapter 2 Finance management
69 pages
Chapter 1 Finance management
No ratings yet
Chapter 1 Finance management
60 pages
Chapter 3 FM
No ratings yet
Chapter 3 FM
35 pages
bda 5 ques (some imp ques-ans
No ratings yet
bda 5 ques (some imp ques-ans
9 pages
BDA CW Chapter 3
No ratings yet
BDA CW Chapter 3
9 pages
0226
No ratings yet
0226
3 pages
1T00934 Winter 202320112023
No ratings yet
1T00934 Winter 202320112023
1 page
SE (Software Engineering)
No ratings yet
SE (Software Engineering)
18 pages
Id 2 PDF
No ratings yet
Id 2 PDF
69 pages
Performance Comparison
No ratings yet
Performance Comparison
11 pages
SME Questions
No ratings yet
SME Questions
46 pages
Perform The Following Queries
No ratings yet
Perform The Following Queries
3 pages
TR 106
No ratings yet
TR 106
29 pages
Exos e 2u24 DS1978 2 1810US en - US
No ratings yet
Exos e 2u24 DS1978 2 1810US en - US
2 pages
1D Arrays
No ratings yet
1D Arrays
10 pages
Open Catalog Interface (SAP ERP)
No ratings yet
Open Catalog Interface (SAP ERP)
21 pages
Code On White Paper RLNC A Tutorial
No ratings yet
Code On White Paper RLNC A Tutorial
17 pages
Tugas Pbo
No ratings yet
Tugas Pbo
5 pages
G7-L03C469B340 - hw - la - channel - Software Upgrade Guideline - 软件升级指导书 PDF
No ratings yet
G7-L03C469B340 - hw - la - channel - Software Upgrade Guideline - 软件升级指导书 PDF
8 pages
6B1 - G3 - RDBMS - Day1 - 1 To 7
100% (1)
6B1 - G3 - RDBMS - Day1 - 1 To 7
47 pages
Backup Exec Admin Guide
No ratings yet
Backup Exec Admin Guide
1,473 pages
Access Data Types
No ratings yet
Access Data Types
4 pages
MySQL WorkBench Installation Guide
No ratings yet
MySQL WorkBench Installation Guide
16 pages
Cursor in DB2
No ratings yet
Cursor in DB2
41 pages
Analytics 2024 01 14 090012.ips - Ca
No ratings yet
Analytics 2024 01 14 090012.ips - Ca
141 pages
Unit 1 Introduction To Computers
No ratings yet
Unit 1 Introduction To Computers
113 pages
Cloud Installation and Configuration
No ratings yet
Cloud Installation and Configuration
19 pages
Windows Admin Interview Questions and Answers.
0% (1)
Windows Admin Interview Questions and Answers.
4 pages
Form Hoa Don Hang
No ratings yet
Form Hoa Don Hang
12 pages
ASSIGNMENT-1 Computer Peripherals
No ratings yet
ASSIGNMENT-1 Computer Peripherals
7 pages
Fortios Handbook 54 PDF
No ratings yet
Fortios Handbook 54 PDF
3,124 pages
SAP Query
No ratings yet
SAP Query
38 pages
Task2 - Solution - Corrected
No ratings yet
Task2 - Solution - Corrected
7 pages
Afterlock
100% (1)
Afterlock
33 pages
B.sc.-II Entire Computer Science
No ratings yet
B.sc.-II Entire Computer Science
22 pages
Swati Gupta
No ratings yet
Swati Gupta
22 pages

BDA IA1 QB Solved complete - Copy

Uploaded by

BDA IA1 QB Solved complete - Copy

Uploaded by

Big Data IA1 QB Solution

3. Differentiate between Traditional Data vs Big data

Challenges with Small Files

5. Describe any Five characteristics of Big Data Analytics.

6. Describe characteristics of Pig and Mahout.

1. High-Level Abstraction: Provides a high-level scripting language (Pig Latin) for

Characteristics of Apache Mahout

1. Scalability: Designed to handle large-scale data processing by leveraging Hadoop and

7. Describe BASE properties in NOSQL Database.

Core Components of the Hadoop Ecosystem

YARN (Yet Another Resource Negotiator)

• Purpose: YARN is the resource management layer of Hadoop, responsible for

• Purpose: MapReduce is a programming model used for processing large datasets in a

Hadoop Distributed File System

o It is a single master server exist in the HDFS cluster.

o The HDFS cluster contains multiple DataNodes.

o It works as a slave node for Job Tracker.

• Complexity: Setting up, managing, and optimizing Hadoop requires specialized

HDFS aligns with the main goals of Hadoop:

RDBMS Example: MySQL

NoSQL Example: MongoDB

Solving Business Problems Faster

Solving Business Problems Cheaper

Solving Business Problems More Effectively

Business Drivers for Adopting MapReduce and NoSQL

13 List the uses of Big Data.

Business Analytics: Analyzes customer behavior to improve products and services.

 Cost-Effective: It uses commodity hardware, making it a cost-effective solution for storing

 Complexity: Setting up, managing, and optimizing Hadoop requires specialized

 Structure: Stores data in documents similar to JSON objects.

"name": "John Doe",

"street": "123 Main St",

"hobbies": ["reading", "gaming", "hiking"]

 Structure: Stores data as key-value pairs.

"user:1000" -> {"name": "John Doe", "email": "[email protected]"}

 Structure: Stores data in nodes, edges, and properties.

(User: John Doe)-[FRIEND]->(User: Jane Smith)

 Structure: Stores data in tables with rows and dynamic columns.

Row Key: user123

name: John Doe

address: 123 Main St, Anytown, CA, 12345

Advantages of NoSQL Databases

1. Scalability: Easily scale out by adding more servers.

4. Availability: Designed to ensure high availability and fault tolerance.

The Three Vs of Big Data

3. Variety: The di erent types of data, including structured, semi-structured, and

 Velocity: Data is processed in real-time to provide instant recommendations.

 Volume: Starbucks processes data from millions of transactions weekly.

 Velocity: Data is analyzed quickly to send timely o ers and recommendations.

LiveJournal enhanced its platform's scalability by deploying Memcache across multiple

Memcache significantly improved LiveJournal's performance by reducing the need for

Implementing Memcache made LiveJournal’s platform more cost-effective by minimizing

In summary, by integrating Memcache, LiveJournal successfully enhanced the scalability,

Key Business Drivers for Adopting Memcache

By adopting Memcache, LiveJournal successfully addressed these challenges, ensuring that

You might also like