q and a44_1
q and a44_1
The "5 V's of Big Data" is a concept that highlights the key characteristics of
big data, helping to understand its nature and challenges. These five V's are:
These five V's provide a framework for understanding the complexity and
challenges associated with big data and guide organizations in developing
strategies for managing, analyzing, and leveraging data effectively.
Explain the 5vs of big data relating to real world use cases . write
for 10 marks
The "5 Vs of Big Data" - Volume, Velocity, Variety, Veracity, and Value -
provide a framework for understanding the characteristics and challenges
associated with big data. Here's how these Vs relate to real-world use cases:
1. Structured Data:
Definition: Structured data is highly organized and follows a
fixed schema with a preformatted data, typically stored in
tabular formats with rows and columns. Each data element has a
clear and defined meaning.
Examples:
Transactional Data: Records of sales transactions,
invoices, payments, and orders in databases.
Employee Information: HR databases containing
employee details such as name, ID, department, and
salary.
Financial Data: Stock market data, accounting records,
balance sheets, and income statements.
Sources: Structured data primarily originates from operational
systems, transactional databases, and enterprise applications.
Common sources include:
Relational Databases: Systems like MySQL, PostgreSQL,
Oracle, and Microsoft SQL Server store structured data in
tables with predefined schemas.
Data Warehouses: Central repositories that consolidate
structured data from multiple sources for reporting and
analysis.
Enterprise Applications: ERP systems, CRM systems,
and financial software generate structured data as part of
routine business operations.
2. Unstructured Data:
Definition: Unstructured data lacks a predefined structure and
does not fit neatly into rows and columns. It often comprises
text, images, audio, and video files.
Examples:
Textual Content: Documents, emails, social media posts,
blogs, articles, and web pages.
Multimedia Content: Images, videos, audio recordings,
and presentations.
Sensor Data: Raw sensor readings from IoT devices,
telemetry data, and satellite imagery.
Sources: Unstructured data is generated from diverse sources
across various domains. Some common sources include:
Social Media Platforms: Facebook, Twitter, Instagram,
and LinkedIn generate vast amounts of unstructured
textual and multimedia content.
File Systems: Documents, images, and multimedia files
stored on local drives, network shares, and cloud storage
platforms.
Surveillance Systems: Video feeds from security
cameras and CCTV systems capture unstructured data for
surveillance and monitoring purposes.
3. Semi-Structured Data:
Definition: Semi-structured data exhibits some level of
organization but does not adhere to a rigid schema like
structured data. It may have a flexible structure with nested or
repeating elements.
Examples:
JSON (JavaScript Object Notation): Data interchange
format commonly used for web APIs, configuration files,
and NoSQL databases.
XML (eXtensible Markup Language): Document format
for representing hierarchical data with nested elements
and attributes.
Log Files: Records of system events, application logs, and
web server logs that contain structured fields but variable
formats.
Sources: Semi-structured data arises from various sources
where flexibility in data representation is necessary. Key sources
include:
Web APIs: Many web services expose data in JSON or XML
format through APIs for integration with other systems.
NoSQL Databases: Document-oriented databases like
MongoDB and Couchbase store semi-structured data in
JSON or BSON format.
Logging Systems: Applications, servers, and network
devices generate log files in text-based formats like JSON,
XML, or plain text with structured fields.
Tool: my sql, postgresql mongodb, Cassandra
Differences between Structured, Semi-structured and Unstructured
data:
Semi-structured Unstructured
Properties Structured data data data
It is based on
It is based on It is based on
XML/RDF(Resource
Technology Relational character and
Description
database table binary data
Framework).
Matured
No transaction
Transaction transaction and Transaction is
management
managemen various adapted from DBMS
and no
t concurrency not matured
concurrency
techniques
there is
but less flexible than
less flexible absence of
unstructured data
schema
It is very difficult
It’s scaling is simpler It is more
Scalability to scale DB
than structured data scalable.
schema
1. Problem:
Data Overload: Managing and analyzing large volumes of data
can be overwhelming for organizations, leading to challenges in
data storage, processing, and analysis.
Data Quality: Big data often includes diverse data sources with
varying levels of quality, accuracy, and reliability. Poor data
quality can lead to incorrect insights and decisions.
Privacy and Security Concerns: Collecting and storing vast
amounts of data can raise concerns about privacy, security, and
compliance with data protection regulations. Mishandling
sensitive data can lead to legal and reputational risks.
Costs and Infrastructure: Building and maintaining the
infrastructure required for big data analytics, including storage,
processing, and analytics tools, can be expensive and resource-
intensive.
2. Opportunity:
Insights and Innovation: Big data provides organizations with
access to vast amounts of valuable information that can be
analyzed to uncover insights, trends, and patterns. These
insights can drive innovation, improve decision-making, and
create new business opportunities.
Competitive Advantage: Organizations that effectively
harness big data analytics can gain a competitive edge by better
understanding customer needs, optimizing operations, and
anticipating market trends.
Personalization and Customer Experience: Big data
analytics enables organizations to personalize products, services,
and experiences based on individual customer preferences and
behavior, leading to improved customer satisfaction and loyalty.
Predictive Analytics: By analyzing historical data and
identifying predictive patterns, organizations can anticipate
future events, trends, and behaviors, enabling proactive
decision-making and risk mitigation.
Social and Economic Impact: Big data has the potential to
address societal challenges in areas such as healthcare,
education, transportation, and environmental sustainability. By
leveraging big data analytics, organizations and governments
can drive positive social and economic outcomes.
Time in DS
- Each machine in a distributed system has its own clock providing the
physical time.
- The distributed system do not have global physical time.
- Time synchronization is essential to know at what time of day a particular
event occurred at a particular computer within a system.
Physical Clock
Cristian’s Method
- It makes use of a time server to get the current time and helps in
synchronization of computer externally.
- Upon request, the server process S provides the time according to its clock
to the requesting process p.
- This method achieve synchronization only if the round trip times between
client and time server are sufficiently short compared to the required
accuracy.
Algorithm:
- A process p requests time in a message mr and receives time value t in a
message mt. Process p records total round trip time T(round) taken to send
request mr and receive reply mt.
- Assuming elapsed time is splitted before and after S placed t in mt, the
time estimate to which p should set its clock is t + T(round)/2.
- Assuming min as the earliest point at which S could have placed time mt
after p dispatches mr, then:
a) Time by S’s clock when reply arrives is in range [t + min, t + T(round) –
min]
b) Width of time range is T(round) – 2 * min
c) Accuracy is +- (T(round) /2 – min)
Discussion:
- If a time server fails, the synchronization is impossible.
- To remove this drawback, time should be provided by a group of
synchronized time servers.
Berkeley’s Algorithm
Design Aims:
- Adjust system clock close to UTC over Internet.
- Handle bad connectivity
- Enable frequent resynchronization
- Security
Hierarchical Structure of NTP:
Logical Clock
- Logical clock is a virtual clock that records the relative ordering of events in
a process.
- It is a monotonically increasing software counter.
- It is realized whenever relative ordering of events is more important than
the physical time.
- Physical clocks are not needed to be synchronized.
- The value of logical clock is used to assign time stamps to the events.
- According to Lamport:
1. If for some process pi: a --> b, then a --> b.
2. For any message m, send(m) --> receive(m)
3. If a, b and c are events such that a-->b and b-->c, then a --> c.
4. If a --> b, event a casually affects event b.
5. If a --> e and e --> a are false, then a and e are concurrent events, which
can be written as a || e.
Implementation Rules:
1. Cpi is incremented before each event is issued at process Pi.
CPi := CPi + 1
2. a) When send(m) is a event of process Pi, timestamp tm = CPi(a) is
included in m.
b) On receiving message m by Pj, its clock CPj is updated as:
CPj := max [ CPj, tm ]
c) The new value of CPj is used to timestamp event receive(m) by Pj
Problems:
1. Lamport's logical clock impose only partial order on set of events but pairs
of distinct events of different processes can have identical time stamp.
2. Total ordering can be enforced by global logical time stamp.
Vector Clock
- Vector clock is a clock that gives ability to decide whether two events are
causally related or not by looking at their time stamps.
- A vector clock for a system of N processes is an array of N integers.
- Each process keeps its own vector clock Vi used to time stamp local events.
- The disadvantage is that it takes more amount of storage and message
payload proportional to the number of processes.
Chapter 2
In Google File System (GFS), the master node plays a crucial role in
coordinating and managing the file system operations, including read and
write processes. Its main roles during read and write processes are as
follows:
use of large and fixed-size chunks in Google File System (GFS) offers
several advantages and disadvantages:
Merits:
Demerits:
Why was the Google File System Created ? What are the
assumptions made by the GFS?
The Google File System (GFS) was created to address the unique storage
challenges faced by Google's massive-scale distributed computing
infrastructure. Some of the primary reasons for its creation include:
1. Master (Controller):
The master node is responsible for coordinating and managing
the activities of the system.
It typically maintains the global state of the system, handles
resource allocation, and distributes tasks to the slave nodes.
The master node often serves as a central point of
communication, receiving requests from clients or users and
dispatching them to the appropriate slaves.
In fault-tolerant systems, there may be multiple master nodes
operating in a redundant or leader-follower configuration to
ensure high availability and reliability.
2. Slave (Worker):
The slave nodes are responsible for executing tasks assigned by
the master node.
They perform computational or processing tasks, such as data
processing, calculations, or data storage operations.
Slaves typically execute tasks in parallel, enabling distributed
and concurrent processing, which can improve overall system
performance and scalability.
Slaves may report task status and results back to the master
node, allowing the master to monitor progress and manage
workload distribution dynamically.
1. Client:
The client is a device, application, or system that initiates
requests for services or resources from servers.
Clients are typically end-user devices such as computers,
smartphones, tablets, or IoT devices, as well as software
applications that interact with servers over a network.
Clients may include user interfaces or user-facing components
that allow users to interact with the system and initiate requests.
2. Server:
The server is a device, application, or system that provides
services, resources, or data in response to client requests.
Servers are typically more powerful and have more
computational resources than clients, enabling them to handle
multiple client requests simultaneously.
Servers may be dedicated physical machines or virtualized
instances running on cloud infrastructure, providing various
services such as web hosting, database management, file
storage, application processing, or network services.
In Google File System (GFS), the master node plays a crucial role in
implementing garbage collection and detecting stale replicas to maintain the
integrity and efficiency of the file system. Here's how the master
accomplishes these tasks:
1. Garbage Collection:
Garbage collection in GFS involves reclaiming storage space occupied
by stale or obsolete data chunks. The master implements garbage
collection through the following steps:
a. Identifying Stale Chunks: The master periodically scans the
namespace metadata to identify stale or orphaned data chunks. Stale
chunks are those that are no longer referenced by any file or have
become obsolete due to file deletions or modifications.
b. Marking Stale Chunks: Once identified, the master marks stale
chunks as garbage, indicating that they can be safely reclaimed. The
master maintains a list of garbage chunks in its metadata.
c. Notifying Chunk Servers: After marking chunks as garbage, the
master sends notifications to the respective chunk servers, instructing
them to delete the garbage chunks and reclaim the associated storage
space.
d. Data Reclamation: Chunk servers follow the master's instructions
to delete the garbage chunks and free up storage space for future use.
e. Verification and Confirmation: Once garbage collection is
completed, the master verifies that the garbage chunks have been
successfully deleted by communicating with the chunk servers. If any
inconsistencies are detected, the master takes appropriate corrective
actions.
2. Stale Replica Detection:
Stale replica detection is essential for maintaining data consistency
and ensuring fault tolerance in GFS. The master node continuously
monitors the health and status of data replicas to detect stale or
outdated replicas. Here's how it works:
a. Heartbeat Mechanism: Chunk servers regularly send heartbeat
signals to the master to indicate their availability and status. The
master monitors these heartbeats to track the liveness of chunk
servers.
b. Replica Status Tracking: The master maintains metadata about
the location and status of data replicas across chunk servers. It keeps
track of the version numbers and timestamps associated with each
replica.
c. Comparing Replica Versions: When a client requests data from
GFS, the master determines the appropriate replicas to serve the
request. Before serving the data, the master compares the version
numbers of the replicas to ensure consistency.
d. Detecting Stale Replicas: If the master detects a replica with an
outdated version or timestamp, it marks the replica as stale. Stale
replicas may occur due to chunk server failures, network partitions, or
data corruption.
e. Replica Reclamation: Once a replica is marked as stale, the
master initiates replica reclamation by instructing the affected chunk
servers to delete the stale replicas and initiate replica re-replication to
maintain the desired replication factor.
1. Namespace Information:
The master maintains a hierarchical namespace that organizes
files and directories within the file system. Each file and directory
is represented by a unique identifier (inode) and associated
metadata.
For each file or directory, the master stores metadata such as
the file/directory name, inode number, permissions, creation
time, and last modified time.
The namespace metadata allows the master to efficiently handle
file operations such as file creation, deletion, renaming, and
directory traversal.
2. File-to-Chunk Mapping:
GFS breaks large files into fixed-size data chunks (typically 64
MB in size). The master maintains mappings that associate each
file with its constituent data chunks.
For each file, the master stores information about the sequence
of data chunks that comprise the file, along with their respective
chunk handles (unique identifiers).
The file-to-chunk mapping allows the master to determine the
location of data chunks across chunk servers and coordinate
data access and retrieval operations.
3. Chunk Locations:
GFS replicates data chunks across multiple chunk servers for
fault tolerance and data reliability. The master maintains
information about the locations of data chunks and their replicas.
For each data chunk, the master stores metadata indicating the
chunk handle, the locations of replica copies, and other relevant
details such as version numbers and timestamps.
The chunk location metadata enables the master to ensure that
data chunks are appropriately replicated and distributed across
the cluster, facilitating efficient data access and fault tolerance.
4. Lease Information:
GFS employs leases to manage concurrent access to data and
ensure consistency. The master maintains lease information for
files and data chunks to track which clients hold leases for
reading or writing data.
Lease information includes details such as lease holder, lease
expiration time, and lease duration.
By managing leases, the master coordinates access to data and
enforces consistency semantics, such as ensuring that only one
client holds a write lease at a time for a given chunk.
5. Heartbeat and Health Information:
The master receives periodic heartbeat signals from chunk
servers, indicating their availability and status. The master stores
metadata about chunk server health, including heartbeat
timestamps and server statuses.
Heartbeat and health information allow the master to monitor
the liveness of chunk servers, detect failures, and take
appropriate actions such as reallocating data or initiating replica
re-replication.
Chapter 3
Student Name College Name Final Marks
Ram ABC 70
Sita ABC 80
Hari ABC 60
Gita XYZ 90
Rita AYZ 80
Shyam PQR 90
Laxmi PQR 70
Gopal PQR 60
The task is to find the maximum marks of each college using MapReduce.
To achieve the task of finding the maximum marks of each college using
MapReduce, we need to design the map and reduce functions appropriately.
Here's how we can structure them:
Map Function:
Input: (Student Name, College Name, Final Marks)
Output: (College Name, Final Marks)
Reduce Function:
Input: (College Name, [List of Final Marks])
Output: (College Name, Maximum Marks)
Let's list the input to/output from both the map and reduce functions:
Map Function:
Input:
(Ram, ABC, 70)
(Sita, ABC, 80)
(Hari, ABC, 60)
(Gita, XYZ, 90)
(Rita, AYZ, 80)
(Shyam, PQR, 90)
(Laxmi, PQR, 70)
(Gopal, PQR, 60)
Output:
(ABC, 70)
(ABC, 80)
(ABC, 60)
(XYZ, 90)
(AYZ, 80)
(PQR, 90)
(PQR, 70)
(PQR, 60)
Reduce Function:
Input:
(ABC, [70, 80, 60])
(XYZ, [90])
(AYZ, [80])
(PQR, [90, 70, 60])
Output:
(ABC, 80)
(XYZ, 90)
(AYZ, 80)
(PQR, 90)
In the reduce function, for each college name, we find the maximum value
from the list of final marks associated with that college and output the
college name along with its maximum marks.
1. Fault Tolerance:
If a node in the cluster fails while executing a map task, the following steps
occur:
2. Speculative Execution:
3. Automatic Recovery:
Example: If a node failure occurs during the execution of a reduce task, the
framework redistributes the incomplete intermediate data and reruns the
affected reduce task on another node. This process ensures that the overall
job progresses towards completion despite the failure.
Here's how the combiner function works within the MapReduce framework:
Here's how the MapReduce process would work without using a combiner
function:
1. Mapper Phase:
Mapper tasks read input records and emit intermediate key-
value pairs, where the key is the product ID and the value is the
sales amount associated with that product.
For each input record, a mapper emits a key-value pair like this:
(product_id, sales_amount).
2. Shuffle and Sort:
Intermediate key-value pairs generated by mappers are shuffled
and sorted based on the keys to group together records with the
same key.
3. Reducer Phase:
Reducer tasks receive sorted intermediate key-value pairs and
process them to calculate the total sales amount for each
product.
For each unique product ID, a reducer iterates over all associated
sales amounts and calculates the sum to determine the total
sales amount.
1. Mapper Phase:
Same as before, mappers read input records and emit
intermediate key-value pairs (product_id, sales_amount).
2. Combiner Function: (combiner before shuffle and sort)
The combiner function is applied locally on each map node to
perform partial aggregation of intermediate key-value pairs.
For each unique product ID, the combiner aggregates the sales
amounts associated with that product, reducing the number of
key-value pairs outputted by the mapper.
3. Shuffle and Sort:
Intermediate key-value pairs, after local aggregation by the
combiner, are still shuffled and sorted based on the keys to
prepare for the reducer phase.
4. Reducer Phase:
Reducers receive sorted intermediate key-value pairs, but with
fewer key-value pairs to process due to the local aggregation
performed by the combiner.
For each unique product ID, reducers calculate the sum of sales
amounts associated with that product, as before.
Hadoop – Architecture
MapReduce
HDFS(Hadoop Distributed File System)
YARN(Yet Another Resource Negotiator)
Common Utilities or Hadoop Common
Let’s understand the role of each one of this component in detail.
1. MapReduce
Shuffle and Sort: The Task of Reducer starts with this step, the
process in which the Mapper generates the intermediate key-value
and transfers them to the Reducer task is known as Shuffling. Using
the Shuffling process the system can sort the data using its key
value.
Once some of the Mapping tasks are done Shuffling begins that is
why it is a faster process and does not wait for the completion of
the task performed by Mapper.
Reduce: The main function or task of the Reduce is to gather the
Tuple generated from Map and then perform some sorting and
aggregation sort of process on those key-value depending on its key
element.
OutputFormat: Once all the operations are performed, the key-
value pairs are written into the file with the help of record writer,
each record in a new line, and the key and value in a space-
separated manner.
2. HDFS
NameNode(Master)
DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides
the Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e.
the data about the data. Meta Data can be the transaction logs that keep
track of the user’s activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about
the location(Block number, Block ids) of Datanode that Namenode stores to
find the closest DataNode for Faster Communication. Namenode instructs
the DataNodes with the operation like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for
storing the data in a Hadoop cluster, the number of DataNodes can be from 1
to 500 or even more than that. The more number of DataNode, the Hadoop
cluster will be able to store more data. So it is advised that the DataNode
should have High storing capacity to store a large number of file blocks.
Multi-Tenancy
Scalability
Cluster-Utilization
Compatibility
4. Hadoop common or Common Utilities
Hadoop common or Common utilities are nothing but our java library and
java files or we can say the java scripts that we need for all the other
components present in a Hadoop cluster. these utilities are used by HDFS,
YARN, and MapReduce for running the cluster. Hadoop Common verify that
Hardware failure in a Hadoop cluster is common so it needs to be solved
automatically in software by Hadoop Framework.
Components of YARN
YARN (Yet Another Resource Negotiator) is the resource management layer
in Hadoop 2.x that decouples the resource management and job scheduling
functionalities from the MapReduce programming paradigm. YARN allows
multiple data processing engines to run on top of a shared resource pool
efficiently. The key components of YARN are as follows:
1. ResourceManager (RM):
ResourceManager is the master daemon responsible for
managing and allocating cluster resources.
It consists of two main components:
Scheduler: The scheduler is responsible for allocating
resources to various applications based on policies such as
fair sharing, capacity, or custom-defined rules.
ApplicationManager: The ApplicationManager manages the
lifecycle of applications submitted to the cluster. It
negotiates resources with NodeManagers, tracks
application status, and handles application failures and
restarts.
2. NodeManager (NM):
NodeManager is a per-node daemon responsible for managing
resources (CPU, memory, disk, network) on individual nodes in
the cluster.
It monitors resource utilization, container execution, and node
health.
NodeManager launches and manages containers, which are
isolated execution environments for running application tasks.
NodeManager reports container status, resource utilization, and
node health back to the ResourceManager.
3. ApplicationMaster (AM):
ApplicationMaster is a per-application component responsible for
negotiating resources with the ResourceManager, tracking
application status, and managing application execution.
Each application running on YARN has its own ApplicationMaster,
which coordinates the execution of application tasks and
monitors their progress.
ApplicationMaster communicates with the ResourceManager to
request and release resources, handle failures, and update
application status.
4. Container:
A container is an encapsulation of resources (CPU, memory)
allocated to an application task by the NodeManager.
Containers provide an isolated execution environment for
running application tasks without interference from other tasks
or applications.
Each container hosts one or more application tasks, such as map
or reduce tasks in the case of MapReduce jobs.
5. Resource Scheduler:
YARN supports pluggable resource schedulers that determine
how cluster resources are allocated among competing
applications.
The resource scheduler is responsible for making allocation
decisions based on resource availability, application
requirements, and scheduling policies.
YARN provides built-in schedulers like the CapacityScheduler and
the FairScheduler, and custom schedulers can be developed to
meet specific requirements.
6. Job Scheduler
for Hadoop cluster with 128 MB Block size how many members will
Hadoop mapreduce form while performing Maple function on 1GB of
data justify with example
To determine the number of map tasks that Hadoop MapReduce will form
when processing 1GB of data with a block size of 128MB, we need to
consider the data block size and the total size of the input data.
Given:
To calculate the number of map tasks, we'll first determine the number of
data blocks for the input data and then assign one map task per data block.
Calculations:
1. Determine the Number of Data Blocks:
Input data size: 1 GB
Block size: 128 MB
Number of data blocks = Input data size / Block size = 1 GB / 128
MB ≈ 8 data blocks
2. Assign Map Tasks:
One map task is assigned per data block.
Number of map tasks = Number of data blocks = 8 map tasks
Justification:
In this scenario, Hadoop MapReduce will form 8 map tasks when performing
the Map function on 1GB of data with a block size of 128MB. Each map task
will process one data block (128MB) of the input data independently in
parallel.
For example, let's consider a file of 1GB divided into 8 data blocks (each
128MB in size) named Block1, Block2, ..., Block8. Hadoop MapReduce will
form 8 map tasks, each responsible for processing one data block. The map
tasks will execute in parallel across the cluster, processing their assigned
data blocks concurrently. This parallel processing of data blocks enables
efficient and scalable data processing in Hadoop clusters.
amazon cloud
Amazon Web Services (AWS) is a cloud computing platform offered by
Amazon.com. It provides a wide range of cloud services, including computing
power, storage, databases, machine learning, analytics, and more, allowing
businesses and developers to build and deploy scalable, flexible, and cost-
effective solutions.
1. Compute Services:
Amazon Elastic Compute Cloud (EC2): Virtual servers in the
cloud, allowing users to run applications and workloads of
various sizes and complexities.
AWS Lambda: Serverless compute service that runs code in
response to events without provisioning or managing servers.
2. Storage Services:
Amazon Simple Storage Service (S3): Object storage service
designed to store and retrieve any amount of data from
anywhere on the web.
Amazon Elastic Block Store (EBS): Persistent block storage
volumes for use with EC2 instances.
3. Database Services:
Amazon Relational Database Service (RDS): Managed
relational database service supporting multiple database engines
such as MySQL, PostgreSQL, Oracle, and SQL Server.
Amazon DynamoDB: Fully managed NoSQL database service
for key-value and document data.
4. Networking Services:
Amazon Virtual Private Cloud (VPC): Virtual network service
for logically isolating resources within the AWS cloud.
Amazon Route 53: Scalable domain name system (DNS) web
service for routing traffic to resources.
5. Machine Learning and AI Services:
Amazon SageMaker: Fully managed service for building,
training, and deploying machine learning models.
Amazon Rekognition: Deep learning-based image and video
analysis service for object and scene detection, facial
recognition, and text extraction.
6. Security and Identity Services:
AWS Identity and Access Management (IAM): Identity
management service for controlling access to AWS resources.
Amazon Inspector: Automated security assessment service for
identifying security vulnerabilities and compliance issues.
Interrelationship between Hadoop and Amazon Cloud
The interrelationship between Hadoop and Amazon Web Services (AWS)
cloud is significant, as AWS offers a variety of services that complement and
enhance the capabilities of the Hadoop ecosystem. Here's how Hadoop and
AWS cloud interact:
1. Distributed Architecture:
Both HDFS and GFS are designed as distributed file systems,
intended to store and manage large amounts of data across
multiple machines in a cluster.
They distribute data across multiple nodes to achieve scalability,
fault tolerance, and high throughput.
2. Master-Slave Architecture:
Both systems employ a master-slave architecture where there
are one or more master nodes responsible for metadata
management and coordination, and multiple slave nodes
responsible for storing and serving data.
In GFS, the master node is called the "Master," while in HDFS, it's
called the "NameNode."
3. Data Replication:
Both HDFS and GFS replicate data across multiple nodes to
ensure fault tolerance and high availability.
They use replication to store multiple copies of data blocks on
different nodes, typically three replicas in both systems.
4. Large Block Sizes:
Both file systems use large block sizes for efficient data storage
and processing.
GFS typically uses a default block size of 64 MB, while HDFS
commonly uses a default block size of 128 MB, although both are
configurable.
5. Write-Once, Read-Many (WORM) Model:
Both systems follow a Write-Once, Read-Many (WORM) model,
where files are typically written once and then read multiple
times.
This model simplifies data consistency and replication, as
updates are infrequent and mainly append-only.
6. Checksums and Data Integrity:
Both systems use checksums to ensure data integrity during
storage and transfer.
Checksums are calculated for each data block and verified to
detect any data corruption or errors.
7. Data Locality:
Both HDFS and GFS emphasize data locality, aiming to process
data where it resides to minimize data transfer over the network.
They schedule data processing tasks close to the data blocks
they operate on, reducing network overhead and improving
performance.
8. Scalability and Fault Tolerance:
Both systems are designed for scalability and fault tolerance,
allowing them to handle large-scale data storage and processing
in distributed environments.
They can scale horizontally by adding more nodes to the cluster
and handle node failures gracefully through data replication and
recovery mechanisms.
Big data is nothing but a collection of data sets that are large, complex, and
which are difficult to store and process using available data management
tools or traditional data processing applications. Hadoop is a framework
(open source) for writing, running, storing, and processing large datasets in a
parallel and distributed manner. It is a solution that is used to overcome the
challenges faced by big data.
Hadoop has two components:
1. Name Node
2. Data Node
3. Secondary Name Node (SNN)
4. Job Tracker
5. Task Tracker
Let’s get an idea of how data flows between the client interacting with HDFS,
the name node, and the data nodes with the help of a diagram. Consider the
figure:
Step 1: The client opens the file it wishes to read by calling open() on the
File System Object(which for HDFS is an instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote
procedure calls (RPCs), to determine the locations of the first few blocks in
the file. For each block, the name node returns the addresses of the data
nodes that have a copy of that block. The DFS returns an FSDataInputStream
to the client for it to read data from. FSDataInputStream in turn wraps a
DFSInputStream, which manages the data node and name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which
has stored the info node addresses for the primary few blocks within the file,
then connects to the primary (closest) data node for the primary block in the
file.
Step 4: Data is streamed from the data node back to the client, which calls
read() repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next
block. This happens transparently to the client, which from its point of view is
simply reading an endless stream. Blocks are read as, with the
DFSInputStream opening new connections to data nodes because the client
reads through the stream. It will also call the name node to retrieve the data
node locations for the next batch of blocks as needed.
Step 6: When the client has finished reading the file, a function is called,
close() on the FSDataInputStream.
Next, we’ll check out how files are written to HDFS. Consider figure 1.2 to get
a better understanding of the concept.
Note: HDFS follows the Write once Read many times model. In HDFS we
cannot edit the files which are already stored in HDFS, but we can append
data by reopening the files.
Step 1: The client creates the file by calling create() on
DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the
file system’s namespace, with no blocks associated with it. The name node
performs various checks to make sure the file doesn’t already exist and that
the client has the right permissions to create the file. If these checks pass,
the name node prepares a record of the new file; otherwise, the file can’t be
created and therefore the client is thrown an error i.e. IOException. The DFS
returns an FSDataOutputStream for the client to start out writing data to.
Step 3: Because the client writes data, the DFSOutputStream splits it into
packets, which it writes to an indoor queue called the info queue. The data
queue is consumed by the DataStreamer, which is liable for asking the name
node to allocate new blocks by picking an inventory of suitable data nodes to
store the replicas. The list of data nodes forms a pipeline, and here we’ll
assume the replication level is three, so there are three nodes in the
pipeline. The DataStreamer streams the packets to the primary data node
within the pipeline, which stores each packet and forwards it to the second
data node within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to
the third (and last) data node in the pipeline.
Chapter 5
Data Replication in Cassandra
Partitions are automatically replicated across nodes.
In Cassandra, one or more of the nodes in a cluster act as replicas for a given
piece of data. If it is detected that some of the nodes responded with an out-
of-date value, Cassandra will return the most recent value to the client. After
returning the most recent value, Cassandra performs a read repair in the
background to update the stale values.
The following figure shows a schematic view of how Cassandra uses data
replication among the nodes in a cluster to ensure no single point of failure.
Components of Cassandra
Note − Cassandra uses the Gossip Protocol in the background to allow the
nodes to communicate with each other and detect any faulty nodes in the
cluster.
Mongo DB
Key Features of MongoDB
Document-oriented Database
Tables in mongodb are known as collections
Stores data in BSON-like documents.
Schema Less database.
It provides horizontal scalability with the help of sharding.
It provides high availability and redundancy with the help
of replication.
It allows one to perform operations on the grouped data and get a
single result or computed result.
It has very high performance.
MongoDB Vs RDBMS
MongoDB RDBMS
It is a non-relational, document-
It is a Relational database
oriented database management
management system and works on
system and works on document-
relational databases.
based database
Drivers
Drivers are client libraries that offer interfaces and methods for
applications to communicate with MongoDB databases. Drivers will handle
the translation of documents between BSON objects and mapping
application structures.
.NET, Java, JavaScript, Node.js, Python, etc are some of the widely used
drives supported by MongoDB.
Storage Engine
The storage engine significantly influences the performance of applications,
serving as an intermediary between the MongoDB database and persistent
storage, typically disks. MongoDB supports different storage engines:
MMAPv1 – It is a traditional storage engine based on memory
mapped files. This storage engine is optimized for workloads with
high volumes of read operations, insertions, and in-place updates.
It uses B-tress to store indexes. Storage Engine works on multiple
reader single writer lock. A user cannot have two write calls to be
processes in parallel on the same collection. It is fast for reads and
slow for writes.
Wired Tiger – Default Storage Engine starts
from MongoDB 3version. No locking Algorithms like hash pointer.
It yields 7x-10x better write operations and 80% of the file system
compression than MMAP.
InMemory – Instead of storing documents on disk, the engine uses
in-memory for more predictable data latencies. It uses 50% of
physical RAM minimum 1 GB as default. It requires all its data.
When dealing with large datasets, the in-memory engine may not
be the most suitable choice.
3. MongoDB Server
It serves as the central element and is in charge
of maintaining, storing, and retrieving data from the database through
a number of interfaces. The system’s heart is the MongoDB server. Each
mongod server instance is in charge of handling client requests,
maintaining data storage, and performing database operations. Several
mongod instances work together to form a cluster in a typical MongoDB
setup.
4. MongoDB Shell
For dealing with MongoDB databases, MongoDB provides the MongoDB
Shell command-line interface (CLI) tool. The ability to handle and query
MongoDB data straight from the terminal is robust and flexible. After
installing MongoDB, you may access the MongoDB Shell, often known as
mongo. It interacts with the database using JavaScript-based syntax.
Additionally, it has built-in help that shows details about possible
commands and how to use them.
6. Indexes
Indexes are data structures that make it simple to navigate across the
collection’s data set. They help to execute queries and find documents that
match the query criteria without a collection scan.
7. Replication
Within a MongoDB cluster, data replication entails keeping several copies of
the same data on various servers or nodes. Enhancing data
availability and dependability is the main objective of data replication. A
replica may seamlessly replace a failing server in the cluster to maintain
service continuity and data integrity.
Primary Node (Primary Replica): In a replica set, the primary
node serves as the main source for all write operations. It’s the
only node that accepts write requests. The main node is where all
data modifications begin and are implemented initially.
Secondary Nodes: Secondary nodes duplicate data from the
primary node (also known as secondary replicas). They are useful
for dispersing read workloads and load balancing since they are
read-only and mostly utilized for read activities.
8. Sharding
Sharding is basically horizontal scaling of databases as compared to the
traditional vertical scaling of adding more CPUS and ram to the current
system.
For example, you have huge set of files you might segregate it into smaller
sets for ease. Similarly what mongo database does is it segregates its
data into smaller chunks to improve the efficiency.
you have a machine with these configuration and mongo db instance
running on it storing 100 million documents.
Now with time your data will grow in your mongo db instance and suppose
100 million extra documents get added. Now to manage the processing of
these extra records you might need to add extra ram, extra storage and
extra CPU to the server. Such type of scaling is called vertical scaling.
Now consider another situation if you have 4 small machines with small
configurations. You can divide 200 million of document into each of the
server such that each of the server might hold around 50 million
documents. By dividing the data into multiple servers you have reduced the
computation requirements and such kind of scaling is known as horizontal
scaling and this horizontal scaling is known as sharding in mongo and each
of the servers S1, S2, S3, S4 are the shards.
The partioning of data in a sharded environment is done on a range based
basis by deciding a field as a shard key.
Document-based databases
Key-value stores
Column-oriented databases
Graph-based databases
Document-Based Database:
Key-Value Stores:
Graph-Based databases:
Architecture of HBase
1. HMaster –
The implementation of Master Server in HBase is HMaster. It is a
process in which regions are assigned to region server as well as
DDL (create, drop table) operations. It monitor all Region Server
instances present in the cluster. In a distributed environment,
Master runs several background threads. HMaster has many
features like controlling load balancing, failover etc.
2. Region Server –
HBase Tables are divided horizontally by row key range into
Regions. Regions are the basic building elements of HBase cluster
that consists of the distribution of tables and are comprised of
Column families. Region Server runs on HDFS DataNode which is
present in Hadoop cluster. Regions of Region Server are responsible
for several things, like handling, managing, executing as well as
reads and writes HBase operations on that set of regions. The
default size of a region is 256 MB. DML (insert, update, delete)
3. Zookeeper –
It is like a coordinator in HBase. It provides services like maintaining
configuration information, naming, providing distributed
synchronization, server failure notification etc. Clients communicate
with region servers via zookeeper.
Advantages of HBase –
Disadvantages of HBase –
2. No transaction support
HBase supports random read and writes while HDFS supports Write
once Read Many times.
Hbase, Cassandra and MongoDB are called column oriented no SQL db? How
row oriented data base differs from them . write difference in table form
Yes, HBase, Cassandra, and MongoDB are often categorized as NoSQL column-oriented
databases. However, it's important to note that MongoDB can also store data in a
document-oriented manner. When comparing column-oriented databases like HBase
and Cassandra with row-oriented databases, such as traditional relational databases like
MySQL or PostgreSQL, there are several differences. Here's a table outlining some of the
key differences:
Eventual Consistency:
Eventual consistency is a consistency model where all updates to the
database propagate through the system asynchronously. This means
that after a certain period of time, all replicas will converge to the
same state.
In Cassandra, eventual consistency is achieved by allowing read
operations to return data even if some replicas have not yet been
updated with the latest write.
Eventual consistency prioritizes availability and partition tolerance over
strong consistency, making it suitable for scenarios where immediate
consistency is not required, such as distributed caching or systems
with high availability requirements.
For example:
put 'example_table', 'row1', 'cf1:column1', 'value1'
For example:
get 'example_table', 'row1'
scan 'example_table'
For example:
delete 'example_table', 'row1'
delete 'example_table', 'row1', 'cf1:column1'
explain the reason why some nosql databases like Cassandra sacrifice
absolute consistency for absolute availability
NoSQL databases like Cassandra prioritize availability over absolute consistency due to
the requirements of distributed systems and the CAP theorem. The CAP theorem,
proposed by computer scientist Eric Brewer, states that in a distributed system, it's
impossible to simultaneously achieve all three of the following guarantees:
1. Consistency (C): Every read receives the most recent write or an error.
2. Availability (A): Every request receives a response, without the guarantee that
it contains the most recent write.
3. Partition tolerance (P): The system continues to operate despite network
partitions (communication failures) between nodes.
Given that network partitions are inevitable in distributed systems, databases must
choose between consistency and availability during network partitions. Cassandra,
along with many other NoSQL databases, opts to prioritize availability over consistency.
Here's why:
This command inserts a single document into the "posts" collection with the
specified fields.
This query searches for documents in the "posts" collection where the "by" field
equals "Tom" and projects only the "title" field in the result.
db.posts.mapReduce(
mapFunction,
reduceFunction,
{ out: "post_count_by_user" }
);
This mapReduce function maps each document to the username ("by" field)
as the key and emits a value of 1. The reduce function then sums up the values for
each key. The result is stored in a collection named "post_count_by_user".
After executing these commands, you can verify the results by querying the
"posts" collection for title or checking the "post_count_by_user" collection for the
count of posts by each user.
HBase is called a column-oriented NoSQL database because of its design and storage
model, which is optimized for efficient column-wise operations and scalability. Here's
why it's called "column-oriented":
1. Storage Model:
In HBase, data is stored in a column-oriented fashion, meaning that
columns of data for each row are stored together on disk. This is different
from row-oriented databases where entire rows are stored contiguously.
Storing data in a column-oriented manner allows for efficient read and
write operations on specific columns, making HBase suitable for analytical
queries and applications with wide and sparse datasets.
2. Column Families:
HBase organizes data into column families, which are groups of columns
stored together physically. Each column family can contain multiple
columns, and all columns within a column family are stored together on
disk.
Column families provide a way to group related columns and apply
configurations (such as compression or compaction) at the column family
level, offering flexibility in managing different types of data.
3. Sparse Data Support:
HBase is designed to efficiently handle sparse data, where rows may have
a varying number of columns, and many columns may have missing or
null values.
By storing data in a column-oriented fashion, HBase can handle sparse
datasets effectively, as it only stores data that exists, rather than
allocating space for missing values.
Regarding HDFS (Hadoop Distributed File System), HBase is often deployed on top of
HDFS to leverage its distributed storage capabilities. HDFS provides the underlying
storage layer for HBase, enabling HBase to store large volumes of data across a
distributed cluster of commodity hardware. HBase uses HDFS for storing its data files
and relies on Hadoop for distributed data management and fault tolerance.
JSON
JSON stands for JavaScript Object Notation, which is a lightweight data
interchange format used to represent data in a structured format. JSON is often
used for transmitting data between a server and a web application, as it is easy for
humans to read and write and easy for machines to parse and generate.
Here's a simple example of JSON data representing information about a
person:
{
"name": "John Doe",
"age": 30,
"email": "[email protected]",
"address": {
"street": "123 Main Street",
"city": "Anytown",
"state": "California",
"postal_code": "12345"
},
"phone_numbers": [
{
"type": "home",
"number": "555-1234"
},
{
"type": "work",
"number": "555-5678"
}
]
}
In this JSON example:
The outermost curly braces {} represent an object.
Inside the object, there are key-value pairs separated by colons (:). The keys
are strings, and the values can be strings, numbers, arrays, objects, boolean values,
or null.
Arrays are enclosed in square brackets [], and elements inside the array are
separated by commas.
Nested objects and arrays can be used to represent complex data structures.
JSON is widely used in web development for transmitting and storing data,
and it's supported by most programming languages through built-in libraries or
third-party packages. It's commonly used in APIs for exchanging data between
servers and clients, configuration files, and data storage formats.
Chapter 6
why is lucene index called tf-idf
ChatGPT
The Lucene index is not directly called "TF-IDF." Rather, Lucene uses the TF-IDF (Term
Frequency-Inverse Document Frequency) algorithm as one of the key components for
scoring and ranking search results. TF-IDF is a statistical measure used to evaluate the
importance of a term within a document relative to a collection of documents.
1. Segmented Indexing: Lucene divides the index into smaller segments, which
are individually optimized for efficient updates and searches. This segmentation
helps alleviate congestion by distributing the indexing load across multiple
segments. It also allows for incremental updates to the index, reducing the
impact of large indexing operations on search performance.
2. Merge Policy: Lucene employs a merge policy to periodically consolidate
smaller index segments into larger segments. This process helps maintain
optimal index performance by reducing the number of segments and optimizing
disk access patterns. The merge policy can be configured based on factors like
index size, update frequency, and system resources to balance indexing
throughput and search performance.
3. Buffered Indexing: Lucene buffers incoming indexing operations in memory
before flushing them to disk. This buffering reduces the frequency of disk writes,
improving indexing throughput and minimizing congestion caused by frequent
disk access. Buffered indexing also allows for batched updates, which can be
more efficient than individual updates.
4. Concurrent Indexing: Lucene supports concurrent indexing, allowing multiple
threads or processes to index documents simultaneously. This concurrency
increases indexing throughput by leveraging multicore processors and
parallelism. However, care must be taken to ensure thread safety and proper
synchronization to avoid data corruption or inconsistencies.
5. Caching: Lucene incorporates caching mechanisms to cache frequently
accessed data structures and search results. Caching helps reduce the
computational overhead of repetitive search operations and improves overall
search performance. Lucene provides various caching options, such as filter
caching, query result caching, and field value caching, which can be tuned based
on specific use cases and resource constraints.
Keyword Analyzer: This analyzer indexes the entire input text as a single term.
It's useful for scenarios where exact matches are required, such as indexing IDs
or codes.
Simple Analyzer: This analyzer breaks text into tokens based on non-letter
characters, such as whitespace and punctuation, and converts tokens to
lowercase. It's less aggressive than the standard analyzer and doesn't remove
stop words.