0% found this document useful (0 votes)
2 views104 pages

q and a44_1

The document discusses the '5 V's of Big Data'—Volume, Velocity, Variety, Veracity, and Value—highlighting their significance in understanding big data's characteristics and challenges. It also outlines ten common big data problems faced across industries, such as customer churn analysis and fraud detection, and explains the role of distributed systems in managing big data. Additionally, it describes structured, unstructured, and semi-structured data types, their real-world sources, and the importance of data analytics processes for organizations.

Uploaded by

deVOID
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views104 pages

q and a44_1

The document discusses the '5 V's of Big Data'—Volume, Velocity, Variety, Veracity, and Value—highlighting their significance in understanding big data's characteristics and challenges. It also outlines ten common big data problems faced across industries, such as customer churn analysis and fraud detection, and explains the role of distributed systems in managing big data. Additionally, it describes structured, unstructured, and semi-structured data types, their real-world sources, and the importance of data analytics processes for organizations.

Uploaded by

deVOID
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 104

Chapter 1

The "5 V's of Big Data" is a concept that highlights the key characteristics of
big data, helping to understand its nature and challenges. These five V's are:

1. Volume: Refers to the vast amount of data generated from various


sources such as social media, sensors, devices, and business
applications. Big data involves processing and analyzing large volumes
of data, often reaching terabytes, petabytes, or even exabytes in scale.
2. Velocity: Describes the speed at which data is generated, collected,
and processed. With the proliferation of real-time data sources like
social media feeds, sensor data, and financial transactions, big data
systems must handle data streams at high velocity, requiring fast
processing and analysis capabilities. Batch, real-time, stream
3. Variety: Indicates the diverse types of data that exist, including
structured, semi-structured, and unstructured data. Structured data
refers to organized data stored in databases, while unstructured data
encompasses text, images, videos, and social media posts. Big data
systems must be able to handle and analyze this variety of data types
efficiently.
4. Veracity: Relates to the quality and reliability of the data. Big data
often involves dealing with data of varying accuracy, completeness,
and trustworthiness. Veracity challenges arise from data inconsistency,
errors, biases, and noise, requiring data cleaning, validation, and
quality assurance techniques.
5. Value: Represents the ultimate goal of big data initiatives, which is to
derive meaningful insights, knowledge, and value from data. Extracting
value from big data involves analyzing and interpreting the data to
uncover patterns, trends, correlations, and actionable insights that can
lead to better decision-making, innovation, and competitive
advantage.

These five V's provide a framework for understanding the complexity and
challenges associated with big data and guide organizations in developing
strategies for managing, analyzing, and leveraging data effectively.

Ten common big data problems:

1. Modelling True Risk: Utilizing big data to accurately assess and


model risks in financial markets, insurance, healthcare, or other
sectors where risk assessment is critical for decision-making.
2. Customer Churn Analysis: Predicting and understanding why
customers leave a service or product and developing strategies to
retain them. Big data can help analyze customer behavior,
preferences, and interactions to identify churn indicators.
3. Recommendation Engine: Building sophisticated recommendation
systems that leverage big data to provide personalized
recommendations for products, content, or services based on user
preferences, past behavior, and contextual data.
4. Ad Targeting: Leveraging big data analytics to target advertisements
more effectively by analyzing user demographics, behavior, and
preferences to deliver personalized and relevant ads, thereby
maximizing ad engagement and ROI.
5. Transaction Analysis: Analyzing large volumes of transactional data
to detect fraudulent activities, monitor financial transactions for
compliance purposes, and identify patterns or anomalies that may
indicate potential fraud or risk.
6. Analyzing Network Data to Predict Failure: Using big data
analytics to monitor and analyze network data to predict equipment or
system failures in industries such as telecommunications,
manufacturing, or utilities. Predictive maintenance techniques can help
reduce downtime and maintenance costs.
7. Threat Analysis: Employing big data analytics to detect and mitigate
cybersecurity threats by analyzing network traffic, system logs, user
behavior, and other data sources to identify potential security
breaches, malware, or unauthorized access attempts.
8. Trade Surveillance: Monitoring financial markets and transactions
using big data analytics to detect insider trading, market manipulation,
or other illegal activities, ensuring compliance with regulatory
requirements and maintaining market integrity.
9. Search Quality: Improving the quality and relevance of search results
in search engines or e-commerce platforms by analyzing user queries,
click-through rates, and other user interactions with search results to
optimize search algorithms and enhance user experience. Page rank
algorithm
10. Data Sandbox: Creating secure and isolated environments for
data exploration, experimentation, and testing. Data sandboxes allow
data scientists, analysts, and developers to work with large datasets
without affecting production systems, facilitating innovation and
discovery.

These big data problems represent common challenges across various


industries and domains, highlighting the diverse applications and
opportunities for leveraging big data analytics.

What is a Distributed System?




Distributed System is a collection of autonomous computer systems that are


physically separated but are connected by a centralized computer network
that is equipped with distributed system software and appears to its end
users as a single coherent system. The autonomous computers will
communicate among each system by sharing resources and files and
performing the tasks assigned to them.
Types of Distributed Systems:
There are many models and architectures of distributed systems in use
today.
 Client-server systems, the most traditional and simple type of
distributed system, involve a multitude of networked computers
that interact with a central server for data storage, processing or
other common goal.
 Peer-to-peer networks distribute workloads among hundreds or
thousands of computers all running the same software.

Example of Distributed System:


Any Social Media can have its Centralized Computer Network as its
Headquarters and computer systems that can be accessed by any user and
using their services will be the Autonomous Systems in the Distributed
System Architecture.
Distributed systems play a crucial role in big data for several reasons:

1. Scalability: Big data systems often need to handle massive amounts


of data that exceed the capacity of a single machine. Distributed
systems allow data to be distributed across multiple machines,
enabling horizontal scalability. As the data volume grows, additional
machines can be added to the cluster to accommodate the increased
workload, ensuring that the system can handle the scale of big data.
2. Fault Tolerance and Reliability: Distributed systems are designed
to be fault-tolerant, meaning they can continue operating even if
individual components or machines fail. By replicating data and tasks
across multiple nodes in the cluster, distributed systems can tolerate
failures without compromising the availability or integrity of data. This
reliability is crucial for big data applications where data loss or system
downtime can have significant consequences.
3. Parallel Processing and performance: Big data processing tasks
often involve analyzing large datasets or executing complex
computations that can be parallelized across multiple machines.
Distributed systems provide the infrastructure for parallel processing
by distributing data and computation tasks across the nodes in the
cluster. This parallelism allows big data processing jobs to be
completed faster and more efficiently than on a single machine.
4. Data Locality: Distributed systems enable data locality, which refers
to the ability to process data where it resides, minimizing data
movement across the network. By co-locating data and computation
tasks on the same machines or nearby nodes, distributed systems
reduce network overhead and latency, improving overall performance
and efficiency.
5. Flexibility and Adaptability: Distributed systems are inherently
flexible and adaptable, allowing them to support various big data
processing frameworks, storage systems, and workload types. Whether
it's batch processing, real-time stream processing, interactive
querying, or machine learning, distributed systems can accommodate
diverse workloads and scale dynamically to meet changing demands.
6. Resource Utilization and Cost Efficiency: By distributing workloads
across multiple machines, distributed systems can better utilize
hardware resources and achieve higher levels of resource efficiency.
This efficient resource utilization translates to cost savings, as
organizations can maximize the value of their infrastructure
investments by leveraging distributed systems to process big data
workloads more effectively.
 why do we need data analytics process (big data 10 big
problems lai milayera generalize garera lekhda ni huncha)
 benefits of using big data
 application of big data analytics
Data analytics processes are crucial for organizations for several reasons:

1. Insight Generation: Data analytics processes help organizations


extract valuable insights and knowledge from their data. By analyzing
large volumes of structured and unstructured data, organizations can
uncover patterns, trends, correlations, and hidden relationships that
provide valuable business insights.
2. Informed Decision-Making: Data analytics empowers organizations
to make data-driven decisions based on evidence and analysis rather
than intuition or guesswork. By providing actionable insights derived
from data, analytics processes enable decision-makers to identify
opportunities, mitigate risks, and optimize outcomes.
3. Performance Improvement: Through data analytics, organizations
can assess and optimize their performance across various business
functions, such as marketing, sales, operations, finance, and customer
service. By analyzing key performance indicators (KPIs) and
performance metrics, organizations can identify areas for improvement
and implement data-driven strategies to enhance performance and
efficiency.
4. Customer Understanding: Data analytics processes help
organizations gain a deeper understanding of their customers by
analyzing customer behavior, preferences, needs, and sentiment. By
segmenting customers, predicting customer churn, and personalizing
marketing efforts, organizations can improve customer satisfaction,
loyalty, and retention.
5. Innovation and Competitive Advantage: Data analytics fosters
innovation by enabling organizations to identify emerging trends,
market opportunities, and new business models. By leveraging data
analytics, organizations can innovate products, services, and
processes, gaining a competitive edge in the marketplace.
6. Risk Management: Data analytics processes help organizations
identify and mitigate risks by analyzing historical data, predicting
future outcomes, and detecting anomalies or patterns indicative of
potential risks. Whether it's financial risk, operational risk,
cybersecurity risk, or regulatory compliance risk, data analytics
provides organizations with the insights needed to manage and
mitigate risks effectively.
7. Resource Optimization: By analyzing data on resource utilization,
supply chain dynamics, inventory levels, and demand forecasts,
organizations can optimize resource allocation, minimize waste, and
improve operational efficiency. Data analytics processes enable
organizations to make informed decisions about resource allocation,
capacity planning, and inventory management, thereby reducing costs
and improving profitability.

Explain the 5vs of big data relating to real world use cases . write
for 10 marks
The "5 Vs of Big Data" - Volume, Velocity, Variety, Veracity, and Value -
provide a framework for understanding the characteristics and challenges
associated with big data. Here's how these Vs relate to real-world use cases:

1. Volume: This refers to the sheer amount of data generated and


collected. Real-world examples include:
 Social Media Analytics: Analyzing millions of social media posts,
comments, and interactions to understand customer sentiment,
trends, and brand perception.
 Sensor Data in IoT: Collecting and analyzing large volumes of
sensor data from smart devices, vehicles, and industrial
equipment to monitor performance, detect anomalies, and
optimize operations.
2. Velocity: Velocity refers to the speed at which data is generated,
processed, and analyzed. Real-world examples include:
 Financial Trading: Processing high-frequency trading data in real-
time to make split-second investment decisions and capitalize on
market opportunities.
 Network Security: Analyzing network traffic in real-time to detect
and respond to cyber threats, such as DDoS attacks, malware
infections, and unauthorized access attempts.
3. Variety: Variety refers to the diversity of data types and sources.
Real-world examples include:
 Healthcare Data: Integrating and analyzing diverse healthcare
data sources, such as electronic health records (EHRs), medical
images, genomic data, and wearable device data, to personalize
treatments, improve patient outcomes, and advance medical
research.
 Retail Data: Combining structured transaction data with
unstructured customer feedback, social media mentions, and
clickstream data to understand customer behavior, optimize
product assortments, and personalize marketing campaigns.
4. Veracity: Veracity refers to the quality, accuracy, and reliability of
data. Real-world examples include:
 Predictive Maintenance: Analyzing sensor data from industrial
equipment to predict equipment failures and schedule
maintenance proactively. Ensuring data accuracy is crucial to
avoid false alarms and unnecessary downtime.
 Financial Risk Management: Assessing the veracity of financial
data to identify and mitigate risks such as fraud, market
volatility, and regulatory compliance issues.
5. Value: Value refers to the insights and actionable outcomes derived
from analyzing big data. Real-world examples include:
 Personalized Recommendations: Analyzing customer behavior
and preferences to deliver personalized product
recommendations in e-commerce platforms, streaming services,
and content websites, leading to increased sales and customer
satisfaction.
 Smart Cities: Leveraging big data analytics to optimize urban
planning, transportation systems, energy usage, and public
services, resulting in improved efficiency, sustainability, and
quality of life for citizens.

Structured, unstructured and semi-structured data with their


sources in real world
Structured, unstructured, and semi-structured data are three types of data
that differ in their organization and format. Here are examples of each type
of data and their sources in the real world:

1. Structured Data:
 Definition: Structured data is highly organized and follows a
fixed schema with a preformatted data, typically stored in
tabular formats with rows and columns. Each data element has a
clear and defined meaning.
 Examples:
 Transactional Data: Records of sales transactions,
invoices, payments, and orders in databases.
 Employee Information: HR databases containing
employee details such as name, ID, department, and
salary.
 Financial Data: Stock market data, accounting records,
balance sheets, and income statements.
 Sources: Structured data primarily originates from operational
systems, transactional databases, and enterprise applications.
Common sources include:
 Relational Databases: Systems like MySQL, PostgreSQL,
Oracle, and Microsoft SQL Server store structured data in
tables with predefined schemas.
 Data Warehouses: Central repositories that consolidate
structured data from multiple sources for reporting and
analysis.
 Enterprise Applications: ERP systems, CRM systems,
and financial software generate structured data as part of
routine business operations.
2. Unstructured Data:
 Definition: Unstructured data lacks a predefined structure and
does not fit neatly into rows and columns. It often comprises
text, images, audio, and video files.
 Examples:
 Textual Content: Documents, emails, social media posts,
blogs, articles, and web pages.
 Multimedia Content: Images, videos, audio recordings,
and presentations.
 Sensor Data: Raw sensor readings from IoT devices,
telemetry data, and satellite imagery.
 Sources: Unstructured data is generated from diverse sources
across various domains. Some common sources include:
 Social Media Platforms: Facebook, Twitter, Instagram,
and LinkedIn generate vast amounts of unstructured
textual and multimedia content.
 File Systems: Documents, images, and multimedia files
stored on local drives, network shares, and cloud storage
platforms.
 Surveillance Systems: Video feeds from security
cameras and CCTV systems capture unstructured data for
surveillance and monitoring purposes.
3. Semi-Structured Data:
 Definition: Semi-structured data exhibits some level of
organization but does not adhere to a rigid schema like
structured data. It may have a flexible structure with nested or
repeating elements.
 Examples:
 JSON (JavaScript Object Notation): Data interchange
format commonly used for web APIs, configuration files,
and NoSQL databases.
 XML (eXtensible Markup Language): Document format
for representing hierarchical data with nested elements
and attributes.
 Log Files: Records of system events, application logs, and
web server logs that contain structured fields but variable
formats.
 Sources: Semi-structured data arises from various sources
where flexibility in data representation is necessary. Key sources
include:
 Web APIs: Many web services expose data in JSON or XML
format through APIs for integration with other systems.
 NoSQL Databases: Document-oriented databases like
MongoDB and Couchbase store semi-structured data in
JSON or BSON format.
 Logging Systems: Applications, servers, and network
devices generate log files in text-based formats like JSON,
XML, or plain text with structured fields.
Tool: my sql, postgresql mongodb, Cassandra
Differences between Structured, Semi-structured and Unstructured
data:
Semi-structured Unstructured
Properties Structured data data data

It is based on
It is based on It is based on
XML/RDF(Resource
Technology Relational character and
Description
database table binary data
Framework).

Matured
No transaction
Transaction transaction and Transaction is
management
managemen various adapted from DBMS
and no
t concurrency not matured
concurrency
techniques

Version Versioning over


Versioning over Versioned as a
managemen tuples or graph is
tuples,row,tables whole
t possible

Flexibility It is schema It is more flexible It is more


dependent and than structured data flexible and
Semi-structured Unstructured
Properties Structured data data data

there is
but less flexible than
less flexible absence of
unstructured data
schema

It is very difficult
It’s scaling is simpler It is more
Scalability to scale DB
than structured data scalable.
schema

New technology, not


Robustness Very robust —
very spread

Query Structured query Queries over Only textual


performanc allow complex anonymous nodes queries are
e joining are possible possible

Aspect Structured Data Semi-Structured Data Unstructured Data


Data organized in a
predefined format with a Data that does not Data that lacks a
clear schema, typically adhere to a strict predefined structure
stored in tables with schema but has some and does not fit neatly
Definition rows and columns level of organization into rows and columns
May have some
organizational properties No fixed format or
Tabular format with well- but may also contain structure, can include
defined fields and flexible elements and text, images, audio,
Format relationships variable attributes video, etc.
Transactional data, Text documents, social
customer information, JSON and XML files, log media posts,
Examples financial records files, NoSQL databases multimedia files
Relational databases, Web APIs, log files, Social media platforms,
data warehouses, NoSQL databases, web file systems, sensor
Sources spreadsheets scraping data data
Clearly defined schema May have a flexible No predefined schema,
with fixed data types schema with nested or data may vary in
Schema and relationships repeating elements content and format
Processin Well-suited for relational Requires more flexible Requires specialized
g database management data processing tools tools for text mining,
systems (RDBMS) and and techniques natural language
Aspect Structured Data Semi-Structured Data Unstructured Data
processing, and
SQL queries multimedia analysis
Text analytics,
Business applications, Web APIs, IoT data, log sentiment analysis,
Use financial transactions, analysis, document image recognition,
Cases inventory management databases audio processing

is big data a problem or an oppurtunity?


Big data can be both a problem and an opportunity, depending on how
organizations approach it and utilize the data:

1. Problem:
 Data Overload: Managing and analyzing large volumes of data
can be overwhelming for organizations, leading to challenges in
data storage, processing, and analysis.
 Data Quality: Big data often includes diverse data sources with
varying levels of quality, accuracy, and reliability. Poor data
quality can lead to incorrect insights and decisions.
 Privacy and Security Concerns: Collecting and storing vast
amounts of data can raise concerns about privacy, security, and
compliance with data protection regulations. Mishandling
sensitive data can lead to legal and reputational risks.
 Costs and Infrastructure: Building and maintaining the
infrastructure required for big data analytics, including storage,
processing, and analytics tools, can be expensive and resource-
intensive.
2. Opportunity:
 Insights and Innovation: Big data provides organizations with
access to vast amounts of valuable information that can be
analyzed to uncover insights, trends, and patterns. These
insights can drive innovation, improve decision-making, and
create new business opportunities.
 Competitive Advantage: Organizations that effectively
harness big data analytics can gain a competitive edge by better
understanding customer needs, optimizing operations, and
anticipating market trends.
 Personalization and Customer Experience: Big data
analytics enables organizations to personalize products, services,
and experiences based on individual customer preferences and
behavior, leading to improved customer satisfaction and loyalty.
 Predictive Analytics: By analyzing historical data and
identifying predictive patterns, organizations can anticipate
future events, trends, and behaviors, enabling proactive
decision-making and risk mitigation.
 Social and Economic Impact: Big data has the potential to
address societal challenges in areas such as healthcare,
education, transportation, and environmental sustainability. By
leveraging big data analytics, organizations and governments
can drive positive social and economic outcomes.

In summary, while big data presents challenges such as data overload,


quality issues, and privacy concerns, it also offers significant opportunities
for organizations to gain insights, innovate, and create value. Effectively
managing and leveraging big data can turn these challenges into
opportunities for growth, competitiveness, and societal impact.

Clock Synchronization in Distributed System



In the world of distributed computing, where multiple systems collaborate to


accomplish tasks ensuring that all the clocks are synchronized plays a crucial
role. Clock synchronization involves aligning the clocks of computers or
nodes which enables efficient data transfer, smooth communication, and
coordinated task execution. This article explores the importance of clock
synchronization, in distributed systems discusses the challenges it
addresses, and delves into approaches used to achieve synchronization.
The Crucial Role of Clock Synchronization: Bridging Time Gaps
Types of Clock Synchronization
 Physical clock synchronization
 Logical clock synchronization
 Mutual exclusion synchronization

clock synchronization in dfs maybe the big challenge. how can it be


solved
Clock synchronization is indeed a critical challenge in distributed file systems
(DFS) because accurate timekeeping is essential for ensuring consistency,
coordination, and coherence across distributed nodes. Here are some
approaches to address this challenge:

1. Network Time Protocol (NTP): NTP is a widely used protocol for


synchronizing the clocks of computer systems over a network. It allows
distributed nodes to adjust their clocks by communicating with time
servers that provide accurate time information. Implementing NTP in a
DFS ensures that all nodes have synchronized clocks, reducing the risk
of data inconsistencies due to timestamp mismatches.
2. Vector Clocks: Vector clocks are logical clocks used in distributed
systems to establish a partial ordering of events based on causality.
Each node maintains a vector clock that tracks the local event history,
and during communication, nodes exchange and merge their vector
clocks to reconcile differences in event ordering. By incorporating
vector clocks into the DFS protocol, it's possible to track and manage
causal relationships between file operations, even in the absence of
perfectly synchronized physical clocks.
3. Quorum-Based Timestamping: In quorum-based distributed
systems, timestamps are assigned based on the agreement of a
sufficient number of nodes, known as a quorum. By requiring a quorum
of nodes to reach a consensus on timestamps for file operations, the
DFS can ensure that conflicting operations are properly ordered,
regardless of minor clock discrepancies between nodes.
4. Physical Clock Synchronization: In addition to software-based
approaches, ensuring physical clock synchronization among distributed
nodes can help minimize clock drift and skew. Using hardware-based
solutions such as precision time protocol (PTP) or GPS receivers,
organizations can achieve higher levels of clock accuracy and
consistency, which are particularly important for time-sensitive
applications or systems with stringent consistency requirements.
5. Clock Skew Detection and Correction: Implementing mechanisms
to detect and correct clock skew between distributed nodes can help
mitigate the impact of unsynchronized clocks. Techniques such as
periodically recalibrating clocks, estimating and compensating for clock
drift, or adjusting timestamps based on observed skew can help
maintain synchronization and consistency in the DFS environment.

Time in DS

- Each machine in a distributed system has its own clock providing the
physical time.
- The distributed system do not have global physical time.
- Time synchronization is essential to know at what time of day a particular
event occurred at a particular computer within a system.

Physical Clock

- Each computer contains an electronic device that counts oscillations in a


crystal at a definite frequency and store division of count to frequency in a
register to provide time. Such device is called physical clock and the time
shown is physical time.
- Since, different computers in a distributed system have different crystals
that run at different rates, the physical clock gradually get out of
synchronization and provide different time values.
- Due to this, it is very difficult to handle and maintain time critical real time
systems.
- Consistency of distributed data during any modification is based on time
factor.

Synchronization of Physical Clocks

The algorithms for synchronization of physical clocks are as follows:


1. Christian's method
2. Berkeley's method
3. Network time protocol

Cristian’s Method

- It makes use of a time server to get the current time and helps in
synchronization of computer externally.
- Upon request, the server process S provides the time according to its clock
to the requesting process p.
- This method achieve synchronization only if the round trip times between
client and time server are sufficiently short compared to the required
accuracy.

Algorithm:
- A process p requests time in a message mr and receives time value t in a
message mt. Process p records total round trip time T(round) taken to send
request mr and receive reply mt.
- Assuming elapsed time is splitted before and after S placed t in mt, the
time estimate to which p should set its clock is t + T(round)/2.
- Assuming min as the earliest point at which S could have placed time mt
after p dispatches mr, then:
a) Time by S’s clock when reply arrives is in range [t + min, t + T(round) –
min]
b) Width of time range is T(round) – 2 * min
c) Accuracy is +- (T(round) /2 – min)

Discussion:
- If a time server fails, the synchronization is impossible.
- To remove this drawback, time should be provided by a group of
synchronized time servers.
Berkeley’s Algorithm

- It is an algorithm for internal synchronization.


- A computer is chosen as a master.
- All other computers are slaves.
- Master periodically polls for the time of slaves and the slaves send back
their clock values to master.
- The master estimates local time of each slave by observing the round-trip
times.
- Master calculates average of obtained time including its own time.
- While calculating average, it eliminates faulty clocks by choosing a subset
of clocks that do not differ from one another by more than a specified
amount.
- The master then sends the amount by which each slave should adjust their
clock which may be positive or negative.
- If the master fails, one of the slaves can be elected to take the place of
master.

Network Time Protocol (NTP)

- It defines an architecture to enable clients, across the Internet to be


synchronized accurately to UTC.
- It synchronizes against many time servers.

Design Aims:
- Adjust system clock close to UTC over Internet.
- Handle bad connectivity
- Enable frequent resynchronization
- Security
Hierarchical Structure of NTP:

- NTP is provided by a network of servers located across the Internet.


- Primary servers are connected directly to a time server.
- Secondary servers are synchronized with primary servers.
- The logical hierarchy of server connection is called synchronization subnet.
- Each level of synchronization subnet is called stratum.
- Lowest level executes in user's workstation.
- Server with high stratum numbers are liable to have less accurate clocks.

Logical Clock

- Logical clock is a virtual clock that records the relative ordering of events in
a process.
- It is a monotonically increasing software counter.
- It is realized whenever relative ordering of events is more important than
the physical time.
- Physical clocks are not needed to be synchronized.
- The value of logical clock is used to assign time stamps to the events.

Lamport Logical Clock youtube.com/watch?


v=taRbDdlAK7s&ab_channel=PerfectComputerEngineer
- The ordering of events is based on two situations:
1. If two events within a same process occurred, they occured in the order in
which that process observes.
2. Whenever a message is sent between processes, the event of sending
message occurred before the event of receiving the message.

- Lamport generalizes the two conditions to form happened-before relation,


denoted by -->
i.e a --> b ; meaning that event a happened before event b

- According to Lamport:
1. If for some process pi: a --> b, then a --> b.
2. For any message m, send(m) --> receive(m)
3. If a, b and c are events such that a-->b and b-->c, then a --> c.
4. If a --> b, event a casually affects event b.
5. If a --> e and e --> a are false, then a and e are concurrent events, which
can be written as a || e.

Implementation Rules:
1. Cpi is incremented before each event is issued at process Pi.
CPi := CPi + 1
2. a) When send(m) is a event of process Pi, timestamp tm = CPi(a) is
included in m.
b) On receiving message m by Pj, its clock CPj is updated as:
CPj := max [ CPj, tm ]
c) The new value of CPj is used to timestamp event receive(m) by Pj

Problems:
1. Lamport's logical clock impose only partial order on set of events but pairs
of distinct events of different processes can have identical time stamp.
2. Total ordering can be enforced by global logical time stamp.
Vector Clock

- Vector clock is a clock that gives ability to decide whether two events are
causally related or not by looking at their time stamps.
- A vector clock for a system of N processes is an array of N integers.
- Each process keeps its own vector clock Vi used to time stamp local events.
- The disadvantage is that it takes more amount of storage and message
payload proportional to the number of processes.

Rules for clock update

1. Initially Vi[j] = 0 for i, j = 1, 2, 3, .......... N


2. Just before Pi time stamps an event, it sets Vi[i] := Vi[i] + 1
3. Pi includes the value t = Vi in every message it sends.
4. When Pi receives a time stamp t in a message, it sets Vi[j] := max(Vi[j] ,
ti[j]), for j = 1, 2, 3, .. N

Chapter 2

key differences between Google File System (GFS) and traditional


file systems:

Aspect Google File System (GFS) Traditional File Systems


Designed for large-scale
distributed data-intensive Generally used for conventional
applications like web indexing, computing tasks on single
Purpose crawling, and processing. machines or small clusters.
Scalability Highly scalable, capable of Limited scalability compared to
Aspect Google File System (GFS) Traditional File Systems
handling petabytes of data across GFS, typically suitable for smaller-
thousands of commodity servers. scale deployments.
Built-in fault tolerance with Relies more on hardware
automatic replication of data redundancy (e.g., RAID) and
Fault blocks across multiple servers to backup systems for fault
Tolerance ensure reliability. tolerance.
Provides relaxed consistency
model, prioritizing availability and Traditional file systems often
Consistency partition tolerance over strong provide strong consistency
Model consistency. guarantees for data integrity.
Optimized for sequential Designed to handle various access
read/write access patterns patterns, including random access
Access common in data-intensive commonly seen in general-
Patterns applications like MapReduce. purpose computing.
Separates metadata from data Metadata operations are typically
and employs a master server to handled alongside data
Metadata manage metadata operations operations, potentially leading to
Management efficiently. scalability challenges.
May have limitations on file size
Supports extremely large file due to underlying filesystem
File Size sizes, suitable for storing massive design or operating system
Limitations datasets efficiently. constraints.
Prioritizes data locality, ensuring Data locality may vary depending
that computation is performed as on the storage architecture but
close to the data as possible to may not be as optimized as in
Data Locality minimize network overhead. GFS.
Traditional file systems also
Supports concurrent access support concurrent access but
Concurrent control mechanisms for multiple may have limitations or require
Access clients accessing the same files additional configurations for
Control simultaneously. scalability.

main role of gfs master during read write process

In Google File System (GFS), the master node plays a crucial role in
coordinating and managing the file system operations, including read and
write processes. Its main roles during read and write processes are as
follows:

1. Metadata Management: The master node maintains metadata about


the file system, including information about file locations, file size,
permissions, and replication factors. During write processes, the
master determines where data blocks will be stored and ensures that
replicas are created across multiple data nodes for fault tolerance.
2. Namespace Operations: It handles namespace operations such as
file creation, deletion, and renaming. When a write process occurs, the
master node allocates a unique identifier for the file and records its
metadata.
3. Chunk Location Management: The master node keeps track of the
locations of data chunks across the cluster. When a write process
occurs, the master node selects suitable data nodes to store the data
chunks and updates the chunk locations accordingly.
4. Replication Management: It manages the replication of data chunks
to ensure fault tolerance and data reliability. During write processes,
the master node decides the number of replicas required and
coordinates their creation on different data nodes.
5. Load Balancing: The master node may perform load balancing by
distributing read and write requests across data nodes to optimize
resource utilization and reduce latency.
6. Access Control: It enforces access control policies, ensuring that only
authorized users or processes can read from or write to the file system.
This involves authentication and authorization checks during read and
write processes.

use of large and fixed-size chunks in Google File System (GFS) offers
several advantages and disadvantages:

Merits:

Reduced Metadata Overhead: It reduces clients’ need to interact with


the master because reads and writes on the same chunk require only one
initial request to the master for chunk location information. The reduction
is especially significant for the workloads because applications mostly
read and write large files sequentially. Even for small random reads, the
client can comfortably cache all the chunk location information for a multi
TB working set.
1. Optimized for Streaming Access: Large chunk sizes are well-suited
for data-intensive applications like MapReduce, where data is
processed in a streaming fashion. With larger chunks, the system can
efficiently stream data to processing nodes, reducing the overhead
associated with managing smaller data blocks and improving overall
processing speed.
2. Enhanced Fault Tolerance: Large chunks enable efficient replication
of data across multiple servers. By replicating entire chunks rather
than individual blocks, GFS can ensure fault tolerance and data
reliability without excessive duplication or overhead.
3. Persistent TCP Connection to Chunkserver: Since a client is more
likely to perform multiple operations on a given chunk due to its large
size, maintaining a persistent TCP connection to the chunkserver can
significantly reduce network overhead. With a persistent connection,
the client can send multiple read and write requests to the
chunkserver without incurring the overhead of establishing a new
connection for each operation.
4. Reduction of Metadata Size on the Master: By using large chunks,
the total number of chunks in the system decreases, thereby reducing
the size of metadata stored on the master node. Since metadata
management can become a bottleneck in distributed file systems,
reducing the metadata size allows the master node to efficiently
handle metadata operations such as namespace operations, chunk
location management, and access control.

Demerits:

1. Increased Space Wastage: Large, fixed-size chunks can lead to


space wastage, especially when storing small files. If a file's size is
smaller than the chunk size, the remaining space within the chunk may
go unused, resulting in inefficient storage utilization.
2. Limited Random Access Performance: While large chunks are
optimized for streaming access patterns, they may not perform as well
for random access workloads. Applications that require frequent
random access to small portions of data within large files may
experience degraded performance compared to file systems with
smaller block sizes.
3. Difficulty Handling Variable-Sized Files: GFS may face challenges
when dealing with files of varying sizes. Files significantly smaller than
the chunk size may incur additional overhead due to space wastage,
while very large files may require multiple chunks, potentially
complicating data management and access patterns.
4. Increased Latency for Small Operations: Small read or write
operations may incur higher latency compared to file systems with
smaller block sizes. This latency overhead is due to the need to read or
write entire chunks even for small data requests, resulting in increased
I/O overhead.

why single master not bottleneck in gfs cluster


In Google File System (GFS), having a single master node might seem like a
potential bottleneck, but GFS is designed in a way to mitigate this concern
effectively. Several design choices and optimizations help alleviate the risk
of the master node becoming a bottleneck:
1. Metadata Caching in client side so less request to master and less
overhead for master
2. Metadata distributed across multiple servers for fault tolerance
3. Master Node's Light Load
4. High Availability and Fault Tolerance by shadow masters:
5. Replica loadbalancing and re-replication
6. Deletion of orphans chunks to save space and more efficient
performance

Why was the Google File System Created ? What are the
assumptions made by the GFS?
The Google File System (GFS) was created to address the unique storage
challenges faced by Google's massive-scale distributed computing
infrastructure. Some of the primary reasons for its creation include:

1. Scalability: Google's infrastructure needed a file system that could


scale to handle petabytes of data spread across thousands of
commodity servers. Traditional file systems were not designed for such
massive scale.
2. Reliability: Google required a highly reliable and fault-tolerant file
system to ensure the availability and integrity of data despite
hardware failures, which are common in large-scale distributed
systems.
3. Performance: The file system needed to provide high throughput and
low latency for both read and write operations, supporting the
demands of Google's diverse workloads, including web indexing,
search, and other data-intensive applications.
4. Simplicity: Google aimed to create a file system that was easy to
manage and operate at scale. This involved streamlining the
architecture and minimizing complexity wherever possible.
5. Cost-effectiveness: Leveraging commodity hardware and open-
source software allowed Google to build a cost-effective storage
solution that could meet its requirements without relying on expensive
proprietary hardware or software.
6. Data locality
7. Availability

Assumptions made by the Google File System:

1. Component Failures are Common: GFS assumes that component


failures, including disk failures, server failures, and network failures,
are common in large-scale distributed systems. As such, it is designed
to tolerate these failures gracefully without compromising data
availability or integrity.
2. Reads are Dominant: GFS assumes that most of the operations
performed on the file system are reads, rather than writes. This
influences the design decisions regarding data replication, caching,
and data placement to optimize read performance.
3. Large Sequential Writes: GFS assumes that applications often
perform large sequential writes rather than small random writes. This
influences the design of the file system's chunk size, which is
optimized for efficient streaming write operations.
4. Cooperative Environment: GFS assumes a cooperative environment
where applications are aware of the underlying file system and can
interact with it efficiently. This allows for optimizations such as data
locality-aware scheduling and placement.
5. Simplicity over Richness: GFS prioritizes simplicity and robustness
over a rich feature set. This means that certain advanced features
found in traditional file systems may be omitted in favor of a
streamlined design that is easier to manage and operate at Google's
scale.

what is master slave architecture


Master-slave architecture, also known as master-worker architecture or
master-node architecture, is a design pattern commonly used in distributed
computing systems. In this architecture, one node, known as the master or
controller, controls and coordinates the overall operation of the system,
while one or more nodes, known as slaves or workers, perform tasks
assigned by the master.

Here's a breakdown of the roles and responsibilities in a master-slave


architecture:

1. Master (Controller):
 The master node is responsible for coordinating and managing
the activities of the system.
 It typically maintains the global state of the system, handles
resource allocation, and distributes tasks to the slave nodes.
 The master node often serves as a central point of
communication, receiving requests from clients or users and
dispatching them to the appropriate slaves.
 In fault-tolerant systems, there may be multiple master nodes
operating in a redundant or leader-follower configuration to
ensure high availability and reliability.
2. Slave (Worker):
 The slave nodes are responsible for executing tasks assigned by
the master node.
 They perform computational or processing tasks, such as data
processing, calculations, or data storage operations.
 Slaves typically execute tasks in parallel, enabling distributed
and concurrent processing, which can improve overall system
performance and scalability.
 Slaves may report task status and results back to the master
node, allowing the master to monitor progress and manage
workload distribution dynamically.

what is client server architecture


Client-server architecture is a computing model that divides tasks or
processes between clients and servers, where clients request services or
resources from servers, which then fulfill those requests. This architecture is
prevalent in networked systems and applications, facilitating efficient
distribution of computational tasks and resources. Here's an overview of the
client-server architecture:

1. Client:
 The client is a device, application, or system that initiates
requests for services or resources from servers.
 Clients are typically end-user devices such as computers,
smartphones, tablets, or IoT devices, as well as software
applications that interact with servers over a network.
 Clients may include user interfaces or user-facing components
that allow users to interact with the system and initiate requests.
2. Server:
 The server is a device, application, or system that provides
services, resources, or data in response to client requests.
 Servers are typically more powerful and have more
computational resources than clients, enabling them to handle
multiple client requests simultaneously.
 Servers may be dedicated physical machines or virtualized
instances running on cloud infrastructure, providing various
services such as web hosting, database management, file
storage, application processing, or network services.

why do we have a single master in gfs and millions of chunk


servers?
the design choice of having a single master and millions of chunk servers in
GFS enables the system to achieve
 scalability
 fault tolerance
 efficient data storage and retrieval across large-scale distributed
infrastructures.
 centralizing metadata management

explain how master implements garbage collection and detects


state replica in GFS

In Google File System (GFS), the master node plays a crucial role in
implementing garbage collection and detecting stale replicas to maintain the
integrity and efficiency of the file system. Here's how the master
accomplishes these tasks:

1. Garbage Collection:
Garbage collection in GFS involves reclaiming storage space occupied
by stale or obsolete data chunks. The master implements garbage
collection through the following steps:
a. Identifying Stale Chunks: The master periodically scans the
namespace metadata to identify stale or orphaned data chunks. Stale
chunks are those that are no longer referenced by any file or have
become obsolete due to file deletions or modifications.
b. Marking Stale Chunks: Once identified, the master marks stale
chunks as garbage, indicating that they can be safely reclaimed. The
master maintains a list of garbage chunks in its metadata.
c. Notifying Chunk Servers: After marking chunks as garbage, the
master sends notifications to the respective chunk servers, instructing
them to delete the garbage chunks and reclaim the associated storage
space.
d. Data Reclamation: Chunk servers follow the master's instructions
to delete the garbage chunks and free up storage space for future use.
e. Verification and Confirmation: Once garbage collection is
completed, the master verifies that the garbage chunks have been
successfully deleted by communicating with the chunk servers. If any
inconsistencies are detected, the master takes appropriate corrective
actions.
2. Stale Replica Detection:
Stale replica detection is essential for maintaining data consistency
and ensuring fault tolerance in GFS. The master node continuously
monitors the health and status of data replicas to detect stale or
outdated replicas. Here's how it works:
a. Heartbeat Mechanism: Chunk servers regularly send heartbeat
signals to the master to indicate their availability and status. The
master monitors these heartbeats to track the liveness of chunk
servers.
b. Replica Status Tracking: The master maintains metadata about
the location and status of data replicas across chunk servers. It keeps
track of the version numbers and timestamps associated with each
replica.
c. Comparing Replica Versions: When a client requests data from
GFS, the master determines the appropriate replicas to serve the
request. Before serving the data, the master compares the version
numbers of the replicas to ensure consistency.
d. Detecting Stale Replicas: If the master detects a replica with an
outdated version or timestamp, it marks the replica as stale. Stale
replicas may occur due to chunk server failures, network partitions, or
data corruption.
e. Replica Reclamation: Once a replica is marked as stale, the
master initiates replica reclamation by instructing the affected chunk
servers to delete the stale replicas and initiate replica re-replication to
maintain the desired replication factor.

By implementing garbage collection and stale replica detection, the master


node in GFS ensures the integrity, reliability, and efficiency of the distributed
file system, allowing it to handle large-scale data processing tasks reliably
and efficiently.

Explain the metadata stored by GFS master


In Google File System (GFS), the master node is responsible for managing
metadata, which includes information about the file system's namespace,
file-to-chunk mappings, chunk locations, and other essential details needed
for the system's operation. Here's an overview of the metadata stored by the
GFS master:

1. Namespace Information:
 The master maintains a hierarchical namespace that organizes
files and directories within the file system. Each file and directory
is represented by a unique identifier (inode) and associated
metadata.
 For each file or directory, the master stores metadata such as
the file/directory name, inode number, permissions, creation
time, and last modified time.
 The namespace metadata allows the master to efficiently handle
file operations such as file creation, deletion, renaming, and
directory traversal.
2. File-to-Chunk Mapping:
 GFS breaks large files into fixed-size data chunks (typically 64
MB in size). The master maintains mappings that associate each
file with its constituent data chunks.
 For each file, the master stores information about the sequence
of data chunks that comprise the file, along with their respective
chunk handles (unique identifiers).
 The file-to-chunk mapping allows the master to determine the
location of data chunks across chunk servers and coordinate
data access and retrieval operations.
3. Chunk Locations:
 GFS replicates data chunks across multiple chunk servers for
fault tolerance and data reliability. The master maintains
information about the locations of data chunks and their replicas.
 For each data chunk, the master stores metadata indicating the
chunk handle, the locations of replica copies, and other relevant
details such as version numbers and timestamps.
 The chunk location metadata enables the master to ensure that
data chunks are appropriately replicated and distributed across
the cluster, facilitating efficient data access and fault tolerance.
4. Lease Information:
 GFS employs leases to manage concurrent access to data and
ensure consistency. The master maintains lease information for
files and data chunks to track which clients hold leases for
reading or writing data.
 Lease information includes details such as lease holder, lease
expiration time, and lease duration.
 By managing leases, the master coordinates access to data and
enforces consistency semantics, such as ensuring that only one
client holds a write lease at a time for a given chunk.
5. Heartbeat and Health Information:
 The master receives periodic heartbeat signals from chunk
servers, indicating their availability and status. The master stores
metadata about chunk server health, including heartbeat
timestamps and server statuses.
 Heartbeat and health information allow the master to monitor
the liveness of chunk servers, detect failures, and take
appropriate actions such as reallocating data or initiating replica
re-replication.

Explain the GFS garbage collection purpose against implementing


eager deletion for storage reallocation
In Google File System (GFS), garbage collection serves the purpose of
reclaiming storage space occupied by stale or obsolete data chunks, thereby
optimizing storage utilization and ensuring efficient resource allocation
within the distributed file system. Garbage collection is particularly important
in systems like GFS where data is stored in large, fixed-size chunks and
where storage resources are shared among multiple clients and applications.

Here's why GFS employs garbage collection instead of implementing eager


deletion for storage reallocation:

1. Optimization of Storage Utilization: Eager deletion, where data is


immediately removed upon file deletion or modification, can lead to
fragmentation and inefficient storage utilization. In a distributed file
system like GFS, where data is stored in large, fixed-size chunks, eager
deletion may result in small, unused gaps between chunks, leading to
wasted space and suboptimal storage efficiency. Garbage collection
allows GFS to reclaim such wasted space by identifying and
consolidating unused or obsolete data chunks, thereby optimizing
storage utilization and maximizing available capacity.
2. Minimization of Data Movement: Eager deletion may require
frequent data movement and reallocation operations, especially in
systems with dynamic workloads and frequent file modifications. This
can lead to increased overhead and latency, as well as unnecessary
wear and tear on storage devices. Garbage collection in GFS operates
periodically and selectively, minimizing the frequency of data
movement and reallocation while still ensuring efficient space
reclamation. By consolidating unused data chunks during garbage
collection, GFS reduces the need for frequent data movement and
optimizes storage performance.
3. Maintenance of Data Consistency and Integrity: Garbage
collection in GFS is designed to maintain data consistency and integrity
while reclaiming storage space. The master node carefully manages
garbage collection operations to ensure that only stale or obsolete
data chunks are removed, while active and valid data remains intact.
This helps prevent data loss, corruption, or inconsistencies that may
occur with aggressive or indiscriminate deletion strategies. By
implementing garbage collection, GFS can safely reclaim storage space
without compromising data integrity or risking accidental data loss.
4. Scalability and Fault Tolerance: Garbage collection in GFS is
designed to scale efficiently and operate reliably in large-scale
distributed environments. The master node coordinates garbage
collection operations across the cluster, ensuring consistent and
coordinated space reclamation while maintaining fault tolerance and
system reliability. This scalable and fault-tolerant approach enables
GFS to manage storage resources effectively across thousands of
machines and petabytes of data, without sacrificing performance or
availability.
5. Syncs storage reclamation with background activity of master
6. Internal fragementation when eager deletion

Chapter 3
Student Name College Name Final Marks
Ram ABC 70
Sita ABC 80
Hari ABC 60
Gita XYZ 90
Rita AYZ 80
Shyam PQR 90
Laxmi PQR 70
Gopal PQR 60
The task is to find the maximum marks of each college using MapReduce.

To achieve the task of finding the maximum marks of each college using
MapReduce, we need to design the map and reduce functions appropriately.
Here's how we can structure them:

Map Function:
 Input: (Student Name, College Name, Final Marks)
 Output: (College Name, Final Marks)
Reduce Function:
 Input: (College Name, [List of Final Marks])
 Output: (College Name, Maximum Marks)

Let's list the input to/output from both the map and reduce functions:

Map Function:
 Input:
 (Ram, ABC, 70)
 (Sita, ABC, 80)
 (Hari, ABC, 60)
 (Gita, XYZ, 90)
 (Rita, AYZ, 80)
 (Shyam, PQR, 90)
 (Laxmi, PQR, 70)
 (Gopal, PQR, 60)
 Output:
 (ABC, 70)
 (ABC, 80)
 (ABC, 60)
 (XYZ, 90)
 (AYZ, 80)
 (PQR, 90)
 (PQR, 70)
 (PQR, 60)
Reduce Function:
 Input:
 (ABC, [70, 80, 60])
 (XYZ, [90])
 (AYZ, [80])
 (PQR, [90, 70, 60])
 Output:
 (ABC, 80)
 (XYZ, 90)
 (AYZ, 80)
 (PQR, 90)

In the reduce function, for each college name, we find the maximum value
from the list of final marks associated with that college and output the
college name along with its maximum marks.

Why do we require map reduce framework?


MapReduce is a programming model and processing framework designed to
handle large-scale data processing tasks across clusters of computers. It was
popularized by Google and later implemented in open-source projects like
Apache Hadoop. Here are several reasons why MapReduce frameworks are
required:
1. Scalability: MapReduce allows processing of massive datasets by
distributing the workload across a large number of commodity
hardware nodes in a cluster. This enables horizontal scalability,
meaning you can add more machines to the cluster to handle
increased data volumes.
2. Fault tolerance: MapReduce frameworks like Hadoop provide fault
tolerance by automatically handling node failures. When a node fails
during processing, tasks are automatically rerouted to other nodes,
ensuring that processing continues without interruption.
3. Parallel processing: MapReduce breaks down tasks into smaller,
independent units (map and reduce tasks) that can be executed in
parallel across multiple nodes. This parallelism enables faster
processing of data compared to traditional sequential processing.
4. Ease of programming: MapReduce abstracts away many of the
complexities of distributed computing, making it easier for developers
to write distributed applications. Developers only need to implement
the map and reduce functions, and the framework takes care of
distributing data, scheduling tasks, and managing cluster resources.
5. Data locality: MapReduce frameworks are designed to optimize data
locality, meaning that processing tasks are scheduled on nodes where
the data they require is already stored. This reduces network traffic
and improves performance by minimizing data transfer across the
cluster.
6. Support for diverse data types (file formats choose): MapReduce
frameworks can handle a wide variety of data types, including
structured, semi-structured, and unstructured data. This flexibility
makes them suitable for processing different types of data, such as
text, logs, images, and more.
7. Cost-effectiveness: MapReduce frameworks utilize commodity
hardware and open-source software, making them a cost-effective
solution for large-scale data processing compared to proprietary
solutions.

Overall, MapReduce frameworks are essential for handling the challenges of


big data processing, including scalability, fault tolerance, parallelism, and
ease of programming. They enable organizations to efficiently process and
analyze vast amounts of data to extract valuable insights and make
data-driven decisions.
Explain in detail how failures are handled in map reduce framework
along with example

Failures in a MapReduce framework are handled through mechanisms such


as fault tolerance, speculative execution, and automatic recovery. These
mechanisms ensure that data processing tasks can continue despite failures
occurring in the cluster. Let's delve into each of these aspects with an
example:

1. Fault Tolerance:

 In a MapReduce framework like Apache Hadoop, fault tolerance is


achieved primarily through data replication and task redundancy.
 Data replication: Input data in Hadoop Distributed File System (HDFS)
is replicated across multiple nodes in the cluster. By default, HDFS
replicates each data block three times. If a node containing a data
block fails, the framework can retrieve the data from a replica stored
on another node.
 Task redundancy: MapReduce tasks are rerun on other nodes if a node
fails during task execution. The framework keeps track of task progress
and automatically reschedules failed tasks on available nodes.

Example: Let's consider a scenario where a Hadoop cluster is processing a


large dataset consisting of log files. The dataset is divided into multiple input
splits, with each split processed by a map task. Additionally, there are
several reduce tasks for aggregating the intermediate outputs generated by
the map tasks.

If a node in the cluster fails while executing a map task, the following steps
occur:

 The framework detects the failure of the node.


 The map task being executed on the failed node is marked as failed.
 Any intermediate output produced by the failed map task is also
invalidated.
 The framework reschedules the failed map task on another node in the
cluster.
 If necessary, the framework re-replicates any lost intermediate data
blocks to ensure fault tolerance.

2. Speculative Execution:

 Speculative execution is a technique used to mitigate the impact of


slow-running tasks by launching duplicate instances of tasks on
different nodes.
 The framework monitors the progress of tasks and identifies tasks that
are progressing slower than expected.
 Duplicate instances of slow tasks, called speculative tasks, are
launched on other nodes. The first instance to complete successfully is
used, while the others are terminated.
Example: Suppose a reduce task is running slower than expected due to
data skew or node resource contention. The framework identifies this slow-
running task and launches a speculative instance of the same task on
another node. Once any instance of the task completes successfully, the
other speculative instances are terminated to avoid unnecessary resource
consumption.

3. Automatic Recovery:

 MapReduce frameworks are designed to automatically recover from


various types of failures, including node failures, task failures, and
communication failures.
 Upon detecting a failure, the framework takes necessary actions to
restore the system to a consistent state and continue processing
without manual intervention.

Example: If a node failure occurs during the execution of a reduce task, the
framework redistributes the incomplete intermediate data and reruns the
affected reduce task on another node. This process ensures that the overall
job progresses towards completion despite the failure.

What is Optimization and Locality in Map Reduce?


In the context of MapReduce, optimization and locality are important
concepts for improving the performance and efficiency of data processing
tasks on distributed systems. Let's break down each concept:

1. Optimization: Optimization in MapReduce refers to the techniques


used to enhance the performance of data processing tasks by
minimizing resource consumption, reducing execution time, and
improving overall efficiency. Optimization techniques can be applied at
various stages of the MapReduce workflow, including data partitioning,
task scheduling, data shuffling, and task execution.
Some common optimization techniques in MapReduce include:
 Combiners: Combiners are mini-reducers that operate on the
output of mappers before sending data over the network to
reducers. They help reduce the amount of data transferred over
the network and improve overall performance by performing
partial aggregation locally.
 Partitioning: Partitioning data evenly across nodes helps
balance the workload and ensures that each node processes a
roughly equal amount of data, thus preventing stragglers and
improving overall performance.
 Speculative Execution: Speculative execution involves
launching redundant tasks on different nodes for tasks that are
running slower than expected. This helps mitigate the impact of
stragglers and ensures timely completion of jobs.
 Data Compression: Compressing intermediate data between
map and reduce phases reduces the amount of data transferred
over the network, thus improving performance.
 Data Skew Handling: Techniques for handling data skew, such
as data skew detection, skew mitigation strategies, and skewed
data partitioning, help prevent performance bottlenecks caused
by uneven data distribution.

2. Locality: Locality in MapReduce refers to the principle of moving


computation close to the data it operates on, thereby minimizing data
movement across the network. This principle is crucial for improving
performance and efficiency in distributed systems, where data is often
distributed across multiple nodes.
There are three types of locality in MapReduce:
 Data Locality: Data locality refers to the principle of scheduling
tasks on nodes where the input data resides. By processing data
locally, MapReduce avoids the overhead of transferring large
volumes of data over the network, thereby improving
performance.
 Compute Locality: Compute locality refers to the principle of
scheduling tasks close to the nodes where computational
resources are available. By minimizing task scheduling overhead
and reducing resource contention, compute locality improves
overall efficiency.
 Network Locality: Network locality refers to the principle of
reducing network communication overhead by scheduling tasks
on nodes that are connected by high-speed, low-latency network
links. By minimizing data transfer latency and network
congestion, network locality improves performance and
scalability.
MapReduce frameworks such as Hadoop and Apache Spark employ
various strategies to achieve locality, including task scheduling
policies, data placement optimizations, and network-aware job
scheduling algorithms.

How is MapReduce library designed to tolerate different machines


(map/reduce nodes) failure while executing MapReduce job?
The MapReduce framework is designed to be fault-tolerant, ensuring that it
can continue executing jobs even in the presence of failures in individual
machines (map/reduce nodes). Here are some key design principles and
mechanisms that enable fault tolerance in MapReduce:

1. Task Redundancy: When a task (map or reduce) fails to complete


within a specified timeout period, the framework automatically re-
executes the task on another available node. This redundancy ensures
that tasks are resilient to failures and can eventually complete
successfully.
2. Task Tracking and Monitoring: The MapReduce framework
maintains detailed information about the progress of each task,
including its execution status, input/output data locations, and
resource utilization. By continuously monitoring task progress, the
framework can detect failures and take appropriate actions, such as re-
executing failed tasks or rescheduling tasks on different nodes.
3. Data Replication: Input data is typically replicated across multiple
nodes in a distributed file system (e.g., Hadoop Distributed File System
- HDFS). This replication ensures that even if a node containing a copy
of the input data fails, the framework can retrieve the data from
another replica and continue processing without interruption.
4. Speculative Execution: Speculative execution is a mechanism
whereby the framework launches redundant tasks for tasks that are
running significantly slower than expected. By running duplicate tasks
in parallel, the framework can mitigate the impact of stragglers and
ensure timely completion of jobs, even in the presence of slow or
failing nodes.
5. Task Isolation: Each task in the MapReduce framework runs in a
separate process or container, isolated from other tasks. This isolation
ensures that failures or resource contention in one task do not affect
the execution of other tasks, thereby enhancing fault tolerance and job
stability.
6. Checkpointing: Checkpointing is a technique used to periodically
save the intermediate state of tasks to durable storage (e.g.,
distributed file system) during task execution. In the event of a failure,
the framework can resume task execution from the most recent
checkpoint, reducing the amount of work that needs to be recomputed
and improving job recovery time.
7. Job Monitoring and Recovery: The MapReduce framework provides
mechanisms for monitoring job progress and detecting failures at the
job level. In the event of a job failure, the framework can automatically
restart the job from the last successful checkpoint or from the
beginning, depending on the configured recovery strategy.

What is the combiner function


In MapReduce, a combiner function is an optional component that can be
used to perform local aggregation of intermediate key-value pairs generated
by the mapper phase before they are sent over the network to the reducer
phase. The primary purpose of the combiner function is to reduce the
volume of data transferred over the network and improve overall
performance by aggregating or summarizing data locally on each map node.

Here's how the combiner function works within the MapReduce framework:

1. Mapper Output: After processing input data, each mapper generates


a set of intermediate key-value pairs as output. These intermediate
key-value pairs typically represent partial results of computations
performed on different portions of the input data.
2. Combiner Function: The combiner function is applied to the
intermediate key-value pairs outputted by the mappers on each map
node. The combiner function takes these intermediate pairs as input
and performs local aggregation or reduction operations based on the
keys. It aggregates values associated with the same key, combining
them into fewer key-value pairs.
3. Local Aggregation: By performing local aggregation, the combiner
function reduces the volume of data that needs to be transferred over
the network to the reducer nodes. This is particularly beneficial when
there is a significant amount of data with the same key emitted by the
mappers. The aggregated output from the combiner function is then
sent to the reducer phase.
4. Reduced Network Traffic: By reducing the volume of intermediate
data transferred over the network, the combiner function helps
alleviate network congestion and improves the overall performance of
the MapReduce job. It also reduces the computational load on reducer
nodes by pre-aggregating data, leading to faster processing during the
reduce phase.
5. Optional Component: It's important to note that the combiner
function is optional in MapReduce. While it can significantly improve
performance in certain scenarios, its use depends on the specific
requirements of the data processing task and the characteristics of the
data being processed. Additionally, the combiner function must be
idempotent and associative since it may be applied multiple times to
the same intermediate data during execution.

Explain its purpose with suitable example.


Let's illustrate the purpose of the combiner function in MapReduce with a
simple example.
Suppose we have a large dataset consisting of records representing sales
transactions in a retail store. Each record contains information about the
product ID and the corresponding sales amount. The objective is to calculate
the total sales amount for each product.

Here's how the MapReduce process would work without using a combiner
function:

1. Mapper Phase:
 Mapper tasks read input records and emit intermediate key-
value pairs, where the key is the product ID and the value is the
sales amount associated with that product.
 For each input record, a mapper emits a key-value pair like this:
(product_id, sales_amount).
2. Shuffle and Sort:
 Intermediate key-value pairs generated by mappers are shuffled
and sorted based on the keys to group together records with the
same key.
3. Reducer Phase:
 Reducer tasks receive sorted intermediate key-value pairs and
process them to calculate the total sales amount for each
product.
 For each unique product ID, a reducer iterates over all associated
sales amounts and calculates the sum to determine the total
sales amount.

Now, let's introduce the combiner function into the process:

1. Mapper Phase:
 Same as before, mappers read input records and emit
intermediate key-value pairs (product_id, sales_amount).
2. Combiner Function: (combiner before shuffle and sort)
 The combiner function is applied locally on each map node to
perform partial aggregation of intermediate key-value pairs.
 For each unique product ID, the combiner aggregates the sales
amounts associated with that product, reducing the number of
key-value pairs outputted by the mapper.
3. Shuffle and Sort:
 Intermediate key-value pairs, after local aggregation by the
combiner, are still shuffled and sorted based on the keys to
prepare for the reducer phase.
4. Reducer Phase:
 Reducers receive sorted intermediate key-value pairs, but with
fewer key-value pairs to process due to the local aggregation
performed by the combiner.
 For each unique product ID, reducers calculate the sum of sales
amounts associated with that product, as before.

The purpose of the combiner function in this example is to reduce the


volume of data transferred over the network during the shuffle and sort
phase. By aggregating sales amounts locally on each map node, the
combiner reduces the number of key-value pairs that need to be shuffled
and sorted, thereby improving the efficiency of the MapReduce job and
reducing network traffic.
Chapter 4

Hadoop – Architecture



As we all know Hadoop is a framework written in Java that utilizes a large


cluster of commodity hardware to maintain and store big size data. Hadoop
works on MapReduce Programming Algorithm that was introduced by
Google. Today lots of Big Brand Companies are using Hadoop in their
Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc.
The Hadoop Architecture Mainly consists of 4 components.

 MapReduce
 HDFS(Hadoop Distributed File System)
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common
Let’s understand the role of each one of this component in detail.

1. MapReduce

MapReduce nothing but just like an Algorithm or a data structure that is


based on the YARN framework. The major feature of MapReduce is to
perform the distributed processing in parallel in a Hadoop cluster which
Makes Hadoop working so fast. When you are dealing with Big Data, serial
processing is no more of any use. MapReduce has mainly 2 tasks which are
divided phase-wise:
In first phase, Map is utilized and in next phase Reduce is utilized.
Here, we can see that the Input is provided to the Map() function then
it’s output is used as an input to the Reduce function and after that, we
receive our final output. Let’s understand What this Map() and Reduce()
does.
As we can see that an Input is provided to the Map(), now as we are using
Big Data. The Input is a set of Data. The Map() function here breaks this
DataBlocks into Tuples that are nothing but a key-value pair. These key-
value pairs are now sent as input to the Reduce(). The Reduce() function
then combines this broken Tuples or key-value pair based on its Key value
and form set of Tuples, and perform some operation like sorting, summation
type job, etc. which is then sent to the final Output Node. Finally, the Output
is Obtained.
The data processing is always done in Reducer depending upon the business
requirement of that industry. This is How First Map() and then Reduce is
utilized one by one.
Let’s understand the Map Task and Reduce Task in detail.
Map Task:

 RecordReader The purpose of recordreader is to break the


records. It is responsible for providing key-value pairs in a Map()
function. The key is actually is its locational information and value is
the data associated with it.
 Map: A map is nothing but a user-defined function whose work is to
process the Tuples obtained from record reader. The Map() function
either does not generate any key-value pair or generate multiple
pairs of these tuples.
 Combiner: Combiner is used for grouping the data in the Map
workflow. It is similar to a Local reducer. The intermediate key-value
that are generated in the Map is combined with the help of this
combiner. Using a combiner is not necessary as it is optional.
 Partitionar: Partitional is responsible for fetching key-value pairs
generated in the Mapper Phases. The partitioner generates the
shards corresponding to each reducer. Hashcode of each key is also
fetched by this partition. Then partitioner performs it’s(Hashcode)
modulus with the number of reducers(key.hashcode()%(number of
reducers)).
Reduce Task

 Shuffle and Sort: The Task of Reducer starts with this step, the
process in which the Mapper generates the intermediate key-value
and transfers them to the Reducer task is known as Shuffling. Using
the Shuffling process the system can sort the data using its key
value.
Once some of the Mapping tasks are done Shuffling begins that is
why it is a faster process and does not wait for the completion of
the task performed by Mapper.
 Reduce: The main function or task of the Reduce is to gather the
Tuple generated from Map and then perform some sorting and
aggregation sort of process on those key-value depending on its key
element.
 OutputFormat: Once all the operations are performed, the key-
value pairs are written into the file with the help of record writer,
each record in a new line, and the key and value in a space-
separated manner.
2. HDFS

HDFS(Hadoop Distributed File System) is utilized for storage permission. It is


mainly designed for working on commodity Hardware devices(inexpensive
devices), working on a distributed file system design. HDFS is designed in
such a way that it believes more in storing the data in a large chunk of
blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage
layer and the other devices p resent in that Hadoop cluster. Data storage
Nodes in HDFS.

 NameNode(Master)
 DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides
the Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e.
the data about the data. Meta Data can be the transaction logs that keep
track of the user’s activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about
the location(Block number, Block ids) of Datanode that Namenode stores to
find the closest DataNode for Faster Communication. Namenode instructs
the DataNodes with the operation like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for
storing the data in a Hadoop cluster, the number of DataNodes can be from 1
to 500 or even more than that. The more number of DataNode, the Hadoop
cluster will be able to store more data. So it is advised that the DataNode
should have High storing capacity to store a large number of file blocks.

In earlier versions of Hadoop (prior to Hadoop 2.0), JobTracker and


TaskTracker were indeed important components of the Hadoop ecosystem.
However, with the introduction of YARN (Yet Another Resource Negotiator) in
Hadoop 2.0, the architecture was redesigned, and JobTracker and
TaskTracker were replaced with ResourceManager and NodeManager,
respectively.

High Level Architecture Of Hadoop


File Block In HDFS: Data in HDFS is always stored in terms of blocks. So
the single block of data is divided into multiple blocks of size 128MB which is
default and you can also change it manually.

Let’s understand this concept of breaking down of file in blocks with an


example. Suppose you have uploaded a file of 400MB to your HDFS then
what happens is this file got divided into blocks of
128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created
each of 128MB except the last one. Hadoop doesn’t know or it doesn’t care
about what data is stored in these blocks so it considers the final file blocks
as a partial record as it does not have any idea regarding it. In the Linux file
system, the size of a file block is about 4KB which is very much less than the
default size of file blocks in the Hadoop file system. As we all know Hadoop is
mainly configured for storing the large size data which is in petabyte, this is
what makes Hadoop file system different from other file systems as it can be
scaled, nowadays file blocks of 128MB to 256MB are considered in Hadoop.
Replication In HDFS Replication ensures the availability of the data.
Replication is making a copy of something and the number of times you
make a copy of that particular thing can be expressed as it’s Replication
Factor. As we have seen in File blocks that the HDFS stores the data in the
form of various blocks at the same time Hadoop is also configured to make a
copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be
configured means you can change it manually as per your requirement like
in above example we have made 4 file blocks which means that 3 Replica or
copy of each file block is made means total of 4×3 = 12 blocks are made for
the backup purpose.
This is because for running Hadoop we are using commodity hardware
(inexpensive system hardware) which can be crashed at any time. We are
not using the supercomputer for our Hadoop setup. That is why we need
such a feature in HDFS which can make copies of that file blocks for backup
purposes, this is known as fault tolerance.
Now one thing we also need to notice that after making so many replica’s of
our file blocks we are wasting so much of our storage but for the big brand
organization the data is very much important than the storage so nobody
cares for this extra storage. You can configure the Replication factor in
your hdfs-site.xml file.
Rack Awareness The rack is nothing but just the physical collection of
nodes in our Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is
consists of so many Racks . with the help of this Racks information
Namenode chooses the closest Datanode to achieve the maximum
performance while performing the read/write information which reduces the
Network Traffic.
HDFS Architecture
3. YARN(Yet Another Resource Negotiator)

YARN is a Framework on which MapReduce works. YARN performs 2


operations that are Job scheduling and Resource Management. The Purpose
of Job schedular is to divide a big task into small jobs so that each job can be
assigned to various slaves in a Hadoop cluster and Processing can be
Maximized. Job Scheduler also keeps track of which job is important, which
job has more priority, dependencies between the jobs and all the other
information like job timing, etc. And the use of Resource Manager is to
manage all the resources that are made available for running a Hadoop
cluster.
Features of YARN

 Multi-Tenancy
 Scalability
 Cluster-Utilization
 Compatibility
4. Hadoop common or Common Utilities

Hadoop common or Common utilities are nothing but our java library and
java files or we can say the java scripts that we need for all the other
components present in a Hadoop cluster. these utilities are used by HDFS,
YARN, and MapReduce for running the cluster. Hadoop Common verify that
Hardware failure in a Hadoop cluster is common so it needs to be solved
automatically in software by Hadoop Framework.

View content of a file in HDFS:

hdfs dfs - cat <hdfs_file_path>

hdfs dfs - cat /user/hadoop/file.txt

Explain the role of HDFS in Hadoop


1. Storage: HDFS provides a distributed and scalable storage platform
for big data. It stores large files in a distributed manner across multiple
machines (nodes) in a Hadoop cluster. This enables Hadoop to handle
datasets that are too large to be stored on a single machine.
2. Fault Tolerance: HDFS ensures high fault tolerance by replicating
data across multiple nodes in the cluster. By default, HDFS replicates
each data block three times, storing each copy on a different machine.
If one copy of the data becomes unavailable due to a hardware failure
or other issue, HDFS can retrieve the data from one of the other
replicas, ensuring data availability and reliability.
3. Scalability: HDFS is designed to scale horizontally by adding more
nodes to the cluster. As the amount of data grows, additional storage
capacity and processing power can be added to the cluster without
significant changes to the underlying infrastructure. This makes it
suitable for handling large-scale data processing tasks.
4. Data Processing Support: HDFS is integrated with other
components of the Hadoop ecosystem, such as MapReduce and
Apache Spark, to enable distributed data processing. These
frameworks can directly access data stored in HDFS, allowing for
parallel processing of large datasets across the cluster.
5. Data Accessibility: HDFS provides a filesystem interface that allows
users to interact with data stored in Hadoop using familiar commands
and APIs. Users can upload, download, and manipulate files stored in
HDFS using command-line tools or programming interfaces.
6. Parallel tasks so performance

Components of YARN
YARN (Yet Another Resource Negotiator) is the resource management layer
in Hadoop 2.x that decouples the resource management and job scheduling
functionalities from the MapReduce programming paradigm. YARN allows
multiple data processing engines to run on top of a shared resource pool
efficiently. The key components of YARN are as follows:

1. ResourceManager (RM):
 ResourceManager is the master daemon responsible for
managing and allocating cluster resources.
 It consists of two main components:
 Scheduler: The scheduler is responsible for allocating
resources to various applications based on policies such as
fair sharing, capacity, or custom-defined rules.
 ApplicationManager: The ApplicationManager manages the
lifecycle of applications submitted to the cluster. It
negotiates resources with NodeManagers, tracks
application status, and handles application failures and
restarts.
2. NodeManager (NM):
 NodeManager is a per-node daemon responsible for managing
resources (CPU, memory, disk, network) on individual nodes in
the cluster.
 It monitors resource utilization, container execution, and node
health.
 NodeManager launches and manages containers, which are
isolated execution environments for running application tasks.
 NodeManager reports container status, resource utilization, and
node health back to the ResourceManager.
3. ApplicationMaster (AM):
 ApplicationMaster is a per-application component responsible for
negotiating resources with the ResourceManager, tracking
application status, and managing application execution.
 Each application running on YARN has its own ApplicationMaster,
which coordinates the execution of application tasks and
monitors their progress.
 ApplicationMaster communicates with the ResourceManager to
request and release resources, handle failures, and update
application status.
4. Container:
 A container is an encapsulation of resources (CPU, memory)
allocated to an application task by the NodeManager.
 Containers provide an isolated execution environment for
running application tasks without interference from other tasks
or applications.
 Each container hosts one or more application tasks, such as map
or reduce tasks in the case of MapReduce jobs.
5. Resource Scheduler:
 YARN supports pluggable resource schedulers that determine
how cluster resources are allocated among competing
applications.
 The resource scheduler is responsible for making allocation
decisions based on resource availability, application
requirements, and scheduling policies.
 YARN provides built-in schedulers like the CapacityScheduler and
the FairScheduler, and custom schedulers can be developed to
meet specific requirements.
6. Job Scheduler

how is fault tolerance and scalability handled by hadoop cluster

Fault Tolerance in Hadoop Cluster:


1. Data Replication:
 Hadoop Distributed File System (HDFS) replicates data blocks
across multiple DataNodes by default.
 Each data block is replicated to multiple DataNodes, typically
three, ensuring fault tolerance.
 If a DataNode fails, the NameNode can retrieve the data block
from its replicas stored on other nodes, ensuring continuous data
availability.
2. High Availability NameNode:
 In Hadoop clusters with HA (High Availability) enabled, standby
NameNodes maintain synchronized copies of the namespace and
metadata.
 If the active NameNode fails, the standby NameNode can
seamlessly take over as the active NameNode, ensuring
continuous availability and fault tolerance.
 HA NameNode architecture eliminates the single point of failure
associated with the NameNode, improving fault tolerance in
Hadoop clusters.
3. Automatic Data Replication and Rebalancing:
 HDFS automatically replicates data blocks and rebalances data
across the cluster as nodes are added or removed.
 When a new DataNode joins the cluster, HDFS automatically
replicates existing data blocks to maintain the desired replication
factor.
 Similarly, when a DataNode is decommissioned or fails, HDFS
replicates its data blocks to other nodes to maintain fault
tolerance.
4. Task Redundancy and Job Retry:
 Hadoop frameworks such as MapReduce and YARN support
automatic job retry and task redundancy to handle transient
failures.
 If a task fails due to a temporary issue (e.g., network timeout),
Hadoop retries the task on another node or launches redundant
tasks to ensure completion.
 Redundant task execution provides fault tolerance against
transient failures and ensures job completion even in the
presence of occasional node failures.
Scalability in Hadoop Cluster:
1. Horizontal Scalability:
 Hadoop clusters can scale horizontally by adding more nodes to
the cluster as data and processing requirements grow.
 New nodes can be added incrementally to the cluster, and
Hadoop automatically rebalances data and workload distribution
to accommodate the increased capacity.
 Horizontal scalability allows Hadoop clusters to handle large
volumes of data and processing tasks efficiently without
significant changes to the system configuration or application
code.
2. Resource Elasticity:
 Hadoop's distributed architecture allows for resource elasticity,
enabling clusters to dynamically scale up or down based on
workload demands.
 Hadoop clusters can dynamically allocate and deallocate
resources to match the changing demands of applications and
users.
 Resource elasticity ensures optimal resource utilization and
performance in Hadoop clusters, maximizing scalability and
efficiency in distributed computing environments.
3. Efficient Resource Utilization:
 Hadoop's distributed processing model enables efficient resource
utilization by distributing processing tasks across multiple nodes
in the cluster.
 Each node in the cluster contributes computing resources, such
as CPU and memory, to parallelize data processing tasks.
By leveraging distributed processing, Hadoop maximizes
resource utilization and throughput, allowing clusters to scale to
handle large workloads efficiently.
4. Load distribution

for Hadoop cluster with 128 MB Block size how many members will
Hadoop mapreduce form while performing Maple function on 1GB of
data justify with example
To determine the number of map tasks that Hadoop MapReduce will form
when processing 1GB of data with a block size of 128MB, we need to
consider the data block size and the total size of the input data.

Given:

 Block size: 128 MB


 Input data size: 1 GB

To calculate the number of map tasks, we'll first determine the number of
data blocks for the input data and then assign one map task per data block.

Calculations:
1. Determine the Number of Data Blocks:
 Input data size: 1 GB
 Block size: 128 MB
 Number of data blocks = Input data size / Block size = 1 GB / 128
MB ≈ 8 data blocks
2. Assign Map Tasks:
 One map task is assigned per data block.
 Number of map tasks = Number of data blocks = 8 map tasks
Justification:
In this scenario, Hadoop MapReduce will form 8 map tasks when performing
the Map function on 1GB of data with a block size of 128MB. Each map task
will process one data block (128MB) of the input data independently in
parallel.

For example, let's consider a file of 1GB divided into 8 data blocks (each
128MB in size) named Block1, Block2, ..., Block8. Hadoop MapReduce will
form 8 map tasks, each responsible for processing one data block. The map
tasks will execute in parallel across the cluster, processing their assigned
data blocks concurrently. This parallel processing of data blocks enables
efficient and scalable data processing in Hadoop clusters.

amazon cloud
Amazon Web Services (AWS) is a cloud computing platform offered by
Amazon.com. It provides a wide range of cloud services, including computing
power, storage, databases, machine learning, analytics, and more, allowing
businesses and developers to build and deploy scalable, flexible, and cost-
effective solutions.

Here's an overview of some key services and features offered by Amazon


Web Services:

1. Compute Services:
 Amazon Elastic Compute Cloud (EC2): Virtual servers in the
cloud, allowing users to run applications and workloads of
various sizes and complexities.
 AWS Lambda: Serverless compute service that runs code in
response to events without provisioning or managing servers.
2. Storage Services:
 Amazon Simple Storage Service (S3): Object storage service
designed to store and retrieve any amount of data from
anywhere on the web.
 Amazon Elastic Block Store (EBS): Persistent block storage
volumes for use with EC2 instances.
3. Database Services:
 Amazon Relational Database Service (RDS): Managed
relational database service supporting multiple database engines
such as MySQL, PostgreSQL, Oracle, and SQL Server.
 Amazon DynamoDB: Fully managed NoSQL database service
for key-value and document data.
4. Networking Services:
 Amazon Virtual Private Cloud (VPC): Virtual network service
for logically isolating resources within the AWS cloud.
 Amazon Route 53: Scalable domain name system (DNS) web
service for routing traffic to resources.
5. Machine Learning and AI Services:
 Amazon SageMaker: Fully managed service for building,
training, and deploying machine learning models.
 Amazon Rekognition: Deep learning-based image and video
analysis service for object and scene detection, facial
recognition, and text extraction.
6. Security and Identity Services:
 AWS Identity and Access Management (IAM): Identity
management service for controlling access to AWS resources.
 Amazon Inspector: Automated security assessment service for
identifying security vulnerabilities and compliance issues.
Interrelationship between Hadoop and Amazon Cloud
The interrelationship between Hadoop and Amazon Web Services (AWS)
cloud is significant, as AWS offers a variety of services that complement and
enhance the capabilities of the Hadoop ecosystem. Here's how Hadoop and
AWS cloud interact:

1. Infrastructure as a Service (IaaS):


 AWS provides scalable and flexible infrastructure services such
as Amazon EC2 (Elastic Compute Cloud) and Amazon S3 (Simple
Storage Service), which are commonly used to deploy Hadoop
clusters.
 Hadoop clusters can be easily provisioned on EC2 instances,
providing on-demand compute capacity without the need for
upfront hardware investments.
 Amazon S3 can serve as a highly available and durable storage
backend for Hadoop data, enabling seamless integration with
Hadoop Distributed File System (HDFS) and serving as a cost-
effective storage solution for big data analytics workloads.
2. Managed Hadoop Services:
 AWS offers managed Hadoop services such as Amazon EMR
(Elastic MapReduce), a fully managed big data platform that
simplifies the deployment, management, and scaling of Hadoop
clusters on AWS cloud infrastructure.
 Amazon EMR provides pre-configured Hadoop distributions,
including Apache Hadoop, Apache Spark, Apache HBase, and
other ecosystem tools, making it easy to launch and run big data
applications without managing the underlying infrastructure.
3. Data Processing and Analytics:
 AWS cloud services complement Hadoop's data processing and
analytics capabilities by providing additional tools and services
for data ingestion, processing, and visualization.
 Services like Amazon Kinesis for real-time data streaming,
Amazon Redshift for data warehousing, and Amazon Athena for
interactive query analysis can be integrated with Hadoop to build
end-to-end data pipelines and analytics solutions.
4. Scalability and Flexibility:
 AWS cloud infrastructure offers elastic scalability, allowing
Hadoop clusters to dynamically scale up or down based on
workload demands.
 With AWS Auto Scaling and Elastic Load Balancing, Hadoop
clusters can automatically adjust their capacity to handle varying
workloads, ensuring optimal resource utilization and
performance.
5. Cost Optimization:
 AWS offers a pay-as-you-go pricing model, allowing organizations
to optimize costs by only paying for the compute and storage
resources they use.
 Hadoop clusters on AWS can leverage Spot Instances and
Reserved Instances to further reduce costs and optimize
resource utilization.

how hadoop and GFS are similar interms of design architecture


Hadoop's Hadoop Distributed File System (HDFS) and Google's Google File
System (GFS) share several similarities in terms of design architecture,
owing to HDFS being inspired by GFS. Here are some key similarities:

1. Distributed Architecture:
 Both HDFS and GFS are designed as distributed file systems,
intended to store and manage large amounts of data across
multiple machines in a cluster.
 They distribute data across multiple nodes to achieve scalability,
fault tolerance, and high throughput.
2. Master-Slave Architecture:
 Both systems employ a master-slave architecture where there
are one or more master nodes responsible for metadata
management and coordination, and multiple slave nodes
responsible for storing and serving data.
 In GFS, the master node is called the "Master," while in HDFS, it's
called the "NameNode."
3. Data Replication:
 Both HDFS and GFS replicate data across multiple nodes to
ensure fault tolerance and high availability.
 They use replication to store multiple copies of data blocks on
different nodes, typically three replicas in both systems.
4. Large Block Sizes:
 Both file systems use large block sizes for efficient data storage
and processing.
 GFS typically uses a default block size of 64 MB, while HDFS
commonly uses a default block size of 128 MB, although both are
configurable.
5. Write-Once, Read-Many (WORM) Model:
 Both systems follow a Write-Once, Read-Many (WORM) model,
where files are typically written once and then read multiple
times.
 This model simplifies data consistency and replication, as
updates are infrequent and mainly append-only.
6. Checksums and Data Integrity:
 Both systems use checksums to ensure data integrity during
storage and transfer.
 Checksums are calculated for each data block and verified to
detect any data corruption or errors.
7. Data Locality:
 Both HDFS and GFS emphasize data locality, aiming to process
data where it resides to minimize data transfer over the network.
 They schedule data processing tasks close to the data blocks
they operate on, reducing network overhead and improving
performance.
8. Scalability and Fault Tolerance:
 Both systems are designed for scalability and fault tolerance,
allowing them to handle large-scale data storage and processing
in distributed environments.
 They can scale horizontally by adding more nodes to the cluster
and handle node failures gracefully through data replication and
recovery mechanisms.

give overview of hadoop ecosystem


The Hadoop ecosystem is a comprehensive suite of open-source software
tools and frameworks designed to facilitate the storage, processing, and
analysis of large-scale data sets across distributed computing environments.
It comprises various components that work together to address different
aspects of data management, processing, and analytics. Here's an overview
of the key components and technologies within the Hadoop ecosystem:
1. Hadoop Distributed File System (HDFS):
 HDFS is a distributed file system designed to store large volumes
of data across multiple machines in a Hadoop cluster.
 It provides high throughput and fault tolerance by replicating
data blocks across multiple DataNodes.
 HDFS is the primary storage layer for Hadoop, enabling reliable
data storage and access for distributed processing frameworks.
2. MapReduce:
 MapReduce is a programming model and processing engine for
distributed data processing in Hadoop.
 It divides large data sets into smaller chunks, processes them in
parallel across multiple nodes in the cluster, and aggregates the
results to produce final output.
 MapReduce is well-suited for batch processing of large-scale data
and is commonly used for tasks like data transformation,
aggregation, and analysis.
3. Yet Another Resource Negotiator (YARN):
 YARN is a resource management and job scheduling framework
introduced in Hadoop 2.x.
 It decouples resource management and job scheduling from the
MapReduce programming paradigm, allowing multiple data
processing engines to run on a shared resource pool.
 YARN supports a variety of processing models, including
MapReduce, Apache Spark, Apache Flink, and Apache Tez,
among others.
4. Apache Hive:
Hive is a data warehouse infrastructure built on top of Hadoop
for querying and analyzing large datasets stored in HDFS.
 It provides a SQL-like query language called HiveQL, which
translates SQL queries into MapReduce jobs or other execution
engines compatible with YARN.
 Hive is commonly used for batch processing and interactive
querying of structured data in Hadoop.
5. Apache Pig:
 Pig is a high-level data flow scripting language and execution
framework for analyzing large datasets in Hadoop.
 It provides a scripting language called Pig Latin, which allows
users to express data transformations and processing pipelines
in a concise and flexible manner.
 Pig automatically translates Pig Latin scripts into MapReduce jobs
or other execution engines compatible with YARN.
6. Apache Spark:
 Spark is a fast and general-purpose cluster computing framework
for processing large-scale data sets.
 It provides high-level APIs in languages like Scala, Java, Python,
and R, as well as a rich set of libraries for data analysis, machine
learning, and graph processing.
 Spark supports in-memory processing and iterative algorithms,
making it significantly faster than MapReduce for many
workloads.
7. Apache HBase:
 HBase is a distributed, scalable, and column-oriented NoSQL
database built on top of Hadoop.
 It provides real-time read/write access to large volumes of data
stored in HDFS and is optimized for random, low-latency access
patterns.
 HBase is commonly used for applications requiring fast and
scalable data storage, such as web analytics, social media, and
time series data.
8. Apache Kafka:
 Kafka is a distributed event streaming platform designed for
building real-time data pipelines and streaming applications.
 It provides high-throughput, fault-tolerant messaging, allowing
producers to publish data streams and consumers to subscribe to
and process them in real-time.
 Kafka is often used as a messaging backbone for real-time data
processing and analytics in Hadoop environments.
9. Apache Sqoop:
 Sqoop is a tool designed for efficiently transferring bulk data
between Hadoop and structured data stores such as relational
databases.
 It supports importing data from databases into Hadoop (HDFS or
Hive) and exporting data from Hadoop to databases.
 Sqoop simplifies the process of integrating Hadoop with existing
data warehouses and data lakes, enabling seamless data
ingestion and extraction.
10. Apache Flume:
 Flume is a distributed, reliable, and scalable service for
collecting, aggregating, and moving large volumes of log data
from various sources to Hadoop.
 It provides a flexible architecture for building data ingestion
pipelines, supporting a wide range of data sources, such as web
servers, application logs, and sensors.
 Flume is commonly used for centralized log collection, data
streaming, and real-time event processing in Hadoop
environments.
11. Apache Oozie:
 Oozie is a workflow scheduler system for managing and
orchestrating Hadoop jobs and data processing workflows.
 It allows users to define, schedule, and execute complex
workflows consisting of multiple Hadoop jobs and other actions
(e.g., Pig scripts, Hive queries) in a coordinated manner.
 Oozie provides a web-based interface and a command-line
interface for workflow management and monitoring, enabling
users to schedule and monitor job executions in Hadoop.
12. Apache Mahout:
 Mahout is a scalable machine learning library built on top of
Hadoop for building and deploying machine learning models at
scale.
 It provides a set of algorithms and tools for clustering,
classification, recommendation, and other machine learning
tasks.
 Mahout integrates with other components of the Hadoop
ecosystem, such as HDFS, MapReduce, and Spark, enabling
distributed machine learning on large datasets.
13. Apache Zeppelin:
 Zeppelin is a web-based notebook interface for interactive data
analysis, visualization, and collaboration in Hadoop
environments.
 It supports multiple programming languages (e.g., Scala, Python,
SQL) and provides built-in integrations with various data
processing frameworks, including Spark, Hive, and HBase.
 Zeppelin allows users to write, execute, and share data analysis
and visualization code in a collaborative notebook environment,
making it easier to explore and analyze data in Hadoop.
14. Apache ZooKeeper:
 ZooKeeper is a distributed coordination service for building and
managing distributed systems.
 It provides primitives such as distributed locks, barriers, and
configuration management, which are essential for coordinating
and synchronizing processes in distributed environments.
 ZooKeeper is often used as a centralized coordination service for
maintaining configuration information, managing distributed
locks, and ensuring consistency in Hadoop clusters.
15. Apache Drill:
 Drill is a distributed SQL query engine designed for interactive
analysis of large-scale datasets in Hadoop and other data
sources.
 It provides ANSI SQL support and allows users to run SQL queries
against diverse data sources, including HDFS, Hive, HBase, and
relational databases.
 Drill supports schema-free data exploration, nested data
structures, and complex data types, making it suitable for ad-hoc
querying and exploration of semi-structured and nested data in
Hadoop.
16. Apache Storm:
 Storm is a distributed stream processing framework for real-time
event processing and analytics.
 It provides low-latency, fault-tolerant stream processing
capabilities for processing continuous streams of data in real-
time.
 Storm is commonly used for real-time analytics, event-driven
applications, and streaming data processing in Hadoop and other
distributed environments.
17. Cloudera
Cloudera's platform, the Cloudera Data Platform (CDP), is built on top of
open-source technologies, including Hadoop. Cloudera integrates and
enhances various open-source components, including Hadoop's core
components like HDFS (Hadoop Distributed File System) and YARN (Yet
Another Resource Negotiator), along with other technologies such as Apache
Spark, Apache Hive, Apache HBase, etc., to provide a comprehensive data
management and analytics solution for enterprises.
Anatomy of File Read and Write in HDFS



Big data is nothing but a collection of data sets that are large, complex, and
which are difficult to store and process using available data management
tools or traditional data processing applications. Hadoop is a framework
(open source) for writing, running, storing, and processing large datasets in a
parallel and distributed manner. It is a solution that is used to overcome the
challenges faced by big data.
Hadoop has two components:

 HDFS (Hadoop Distributed File System)


 YARN (Yet Another Resource Negotiator)
In this article, we focus on one of the components of Hadoop i.e., HDFS and
the anatomy of file reading and file writing in HDFS. HDFS is a file system
designed for storing very large files (files that are hundreds of megabytes,
gigabytes, or terabytes in size) with streaming data access, running on
clusters of commodity hardware(commonly available hardware that can be
obtained from various vendors). In simple terms, the storage unit of Hadoop
is called HDFS.
Some of the characteristics of HDFS are:
 Fault-Tolerance
 Scalability
 Distributed Storage
 Reliability
 High availability
 Cost-effective
 High throughput
Building Blocks of Hadoop:

1. Name Node
2. Data Node
3. Secondary Name Node (SNN)
4. Job Tracker
5. Task Tracker

Anatomy of File Read in HDFS

Let’s get an idea of how data flows between the client interacting with HDFS,
the name node, and the data nodes with the help of a diagram. Consider the
figure:

Step 1: The client opens the file it wishes to read by calling open() on the
File System Object(which for HDFS is an instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote
procedure calls (RPCs), to determine the locations of the first few blocks in
the file. For each block, the name node returns the addresses of the data
nodes that have a copy of that block. The DFS returns an FSDataInputStream
to the client for it to read data from. FSDataInputStream in turn wraps a
DFSInputStream, which manages the data node and name node I/O.

Step 3: The client then calls read() on the stream. DFSInputStream, which
has stored the info node addresses for the primary few blocks within the file,
then connects to the primary (closest) data node for the primary block in the
file.

Step 4: Data is streamed from the data node back to the client, which calls
read() repeatedly on the stream.

Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next
block. This happens transparently to the client, which from its point of view is
simply reading an endless stream. Blocks are read as, with the
DFSInputStream opening new connections to data nodes because the client
reads through the stream. It will also call the name node to retrieve the data
node locations for the next batch of blocks as needed.

Step 6: When the client has finished reading the file, a function is called,
close() on the FSDataInputStream.

Anatomy of File Write in HDFS

Next, we’ll check out how files are written to HDFS. Consider figure 1.2 to get
a better understanding of the concept.
Note: HDFS follows the Write once Read many times model. In HDFS we
cannot edit the files which are already stored in HDFS, but we can append
data by reopening the files.
Step 1: The client creates the file by calling create() on
DistributedFileSystem(DFS).

Step 2: DFS makes an RPC call to the name node to create a new file in the
file system’s namespace, with no blocks associated with it. The name node
performs various checks to make sure the file doesn’t already exist and that
the client has the right permissions to create the file. If these checks pass,
the name node prepares a record of the new file; otherwise, the file can’t be
created and therefore the client is thrown an error i.e. IOException. The DFS
returns an FSDataOutputStream for the client to start out writing data to.

Step 3: Because the client writes data, the DFSOutputStream splits it into
packets, which it writes to an indoor queue called the info queue. The data
queue is consumed by the DataStreamer, which is liable for asking the name
node to allocate new blocks by picking an inventory of suitable data nodes to
store the replicas. The list of data nodes forms a pipeline, and here we’ll
assume the replication level is three, so there are three nodes in the
pipeline. The DataStreamer streams the packets to the primary data node
within the pipeline, which stores each packet and forwards it to the second
data node within the pipeline.

Step 4: Similarly, the second data node stores the packet and forwards it to
the third (and last) data node in the pipeline.

Step 5: The DFSOutputStream sustains an internal queue of packets that


are waiting to be acknowledged by data nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node
pipeline and waits for acknowledgments before connecting to the name node
to signal whether the file is complete or not.
HDFS follows Write Once Read Many models. So, we can’t edit files that are
already stored in HDFS, but we can include them by again reopening the file.
This design allows HDFS to scale to a large number of concurrent clients
because the data traffic is spread across all the data nodes in the cluster.
Thus, it increases the availability, scalability, and throughput of the system.

Chapter 5
Data Replication in Cassandra
Partitions are automatically replicated across nodes.
In Cassandra, one or more of the nodes in a cluster act as replicas for a given
piece of data. If it is detected that some of the nodes responded with an out-
of-date value, Cassandra will return the most recent value to the client. After
returning the most recent value, Cassandra performs a read repair in the
background to update the stale values.

The following figure shows a schematic view of how Cassandra uses data
replication among the nodes in a cluster to ensure no single point of failure.

Components of Cassandra

The key components of Cassandra are as follows −

 Node − It is the place where data is stored.


 Data center − It is a collection of related nodes.
 Cluster − A cluster is a component that contains one or more data
centers.
 Commit log − The commit log is a crash-recovery mechanism in
Cassandra. Every write operation is written to the commit log.
 Mem-table − A mem-table is a memory-resident data structure. After
commit log, the data will be written to the mem-table. Sometimes, for
a single-column family, there will be multiple mem-tables.
 SSTable − It is a disk file to which the data is flushed from the mem-
table when its contents reach a threshold value.
 Bloom filter − These are nothing but quick, nondeterministic,
algorithms for testing whether an element is a member of a set. It is a
special kind of cache. Bloom filters are accessed after every query.

Note − Cassandra uses the Gossip Protocol in the background to allow the
nodes to communicate with each other and detect any faulty nodes in the
cluster.

Mongo DB
Key Features of MongoDB
 Document-oriented Database
 Tables in mongodb are known as collections
 Stores data in BSON-like documents.
 Schema Less database.
 It provides horizontal scalability with the help of sharding.
 It provides high availability and redundancy with the help
of replication.
 It allows one to perform operations on the grouped data and get a
single result or computed result.
 It has very high performance.
MongoDB Vs RDBMS
MongoDB RDBMS

It is a non-relational, document-
It is a Relational database
oriented database management
management system and works on
system and works on document-
relational databases.
based database

MongoDB uses a document The table structure is used to store


structure to store the data. the data in RDBMS

It has schema-less databases. It uses the schema structure.

Supports MongoDB Query


Supports SQL query language.
Language (MQL) of BSON type

It follows the CAP theorem. It follows ACID properties.

In MongoDB, the database can be In RDBMS, we can scale the


scaled horizontally. database vertically.
MongoDB Architecture and its Components

1. Drivers & Storage Engine


MongoDB store the data on the server but that data we will try to retrieve
from our application. So that time how the communication is happening
between our application and MongoDB server.
Any application which is written in python, .net and java or any kind of
frontend application, these application are trying to access the data from
these physical storage in server. First they will interact with driver which
will communicate with MongoDB server. What happen is once the request
is going from the frontend application through the driver then driver will
change appropriate query by using query engine and then the query will
get executed in MongoDB data model.

Drivers
Drivers are client libraries that offer interfaces and methods for
applications to communicate with MongoDB databases. Drivers will handle
the translation of documents between BSON objects and mapping
application structures.
.NET, Java, JavaScript, Node.js, Python, etc are some of the widely used
drives supported by MongoDB.
Storage Engine
The storage engine significantly influences the performance of applications,
serving as an intermediary between the MongoDB database and persistent
storage, typically disks. MongoDB supports different storage engines:
 MMAPv1 – It is a traditional storage engine based on memory
mapped files. This storage engine is optimized for workloads with
high volumes of read operations, insertions, and in-place updates.
It uses B-tress to store indexes. Storage Engine works on multiple
reader single writer lock. A user cannot have two write calls to be
processes in parallel on the same collection. It is fast for reads and
slow for writes.
 Wired Tiger – Default Storage Engine starts
from MongoDB 3version. No locking Algorithms like hash pointer.
It yields 7x-10x better write operations and 80% of the file system
compression than MMAP.
 InMemory – Instead of storing documents on disk, the engine uses
in-memory for more predictable data latencies. It uses 50% of
physical RAM minimum 1 GB as default. It requires all its data.
When dealing with large datasets, the in-memory engine may not
be the most suitable choice.

3. MongoDB Server
It serves as the central element and is in charge
of maintaining, storing, and retrieving data from the database through
a number of interfaces. The system’s heart is the MongoDB server. Each
mongod server instance is in charge of handling client requests,
maintaining data storage, and performing database operations. Several
mongod instances work together to form a cluster in a typical MongoDB
setup.
4. MongoDB Shell
For dealing with MongoDB databases, MongoDB provides the MongoDB
Shell command-line interface (CLI) tool. The ability to handle and query
MongoDB data straight from the terminal is robust and flexible. After
installing MongoDB, you may access the MongoDB Shell, often known as
mongo. It interacts with the database using JavaScript-based syntax.
Additionally, it has built-in help that shows details about possible
commands and how to use them.

5. Data Storage in MongoDB


5.1 Collections
A database can contain as many collections as it wishes, and MongoDB
stores data inside collections.
As an example, a database might contain three collections a user’s
collection, a blog post collection, and a comments collection. The
user collection would hold user data and documents, the blog post
collection would hold blog posts and documents, and the comments
collection would hold documents related to comments. This would allow for
the easy retrieval of all the documents from a single collection.
5.2 Documents
Documents themselves represent the individual records in a specific
collection.
For example inside the blog posts collection we’d store a lot of blog post
documents and each one represents a single blog post now the way that
data is structured inside a document looks very much like a JSON object
with key value pairs but actually it’s being stored as something called BSON
which is just binary JSON.

6. Indexes
Indexes are data structures that make it simple to navigate across the
collection’s data set. They help to execute queries and find documents that
match the query criteria without a collection scan.

7. Replication
Within a MongoDB cluster, data replication entails keeping several copies of
the same data on various servers or nodes. Enhancing data
availability and dependability is the main objective of data replication. A
replica may seamlessly replace a failing server in the cluster to maintain
service continuity and data integrity.
 Primary Node (Primary Replica): In a replica set, the primary
node serves as the main source for all write operations. It’s the
only node that accepts write requests. The main node is where all
data modifications begin and are implemented initially.
 Secondary Nodes: Secondary nodes duplicate data from the
primary node (also known as secondary replicas). They are useful
for dispersing read workloads and load balancing since they are
read-only and mostly utilized for read activities.

8. Sharding
Sharding is basically horizontal scaling of databases as compared to the
traditional vertical scaling of adding more CPUS and ram to the current
system.
For example, you have huge set of files you might segregate it into smaller
sets for ease. Similarly what mongo database does is it segregates its
data into smaller chunks to improve the efficiency.
you have a machine with these configuration and mongo db instance
running on it storing 100 million documents.
Now with time your data will grow in your mongo db instance and suppose
100 million extra documents get added. Now to manage the processing of
these extra records you might need to add extra ram, extra storage and
extra CPU to the server. Such type of scaling is called vertical scaling.
Now consider another situation if you have 4 small machines with small
configurations. You can divide 200 million of document into each of the
server such that each of the server might hold around 50 million
documents. By dividing the data into multiple servers you have reduced the
computation requirements and such kind of scaling is known as horizontal
scaling and this horizontal scaling is known as sharding in mongo and each
of the servers S1, S2, S3, S4 are the shards.
The partioning of data in a sharded environment is done on a range based
basis by deciding a field as a shard key.

Types of NoSQL Databases



A database is a collection of structured data or information which is stored in


a computer system and can be accessed easily. A database is usually
managed by a Database Management System (DBMS).
NoSQL is a non-relational database that is used to store the data in the
nontabular form. NoSQL stands for Not only SQL. The main types are
documents, key-value, wide-column, and graphs.

Types of NoSQL Database: (taxonomy of noSQL)

 Document-based databases
 Key-value stores
 Column-oriented databases
 Graph-based databases
Document-Based Database:

The document-based database is a nonrelational database. Instead of storing


the data in rows and columns (tables), it uses the documents to store the
data in the database. A document database stores data in JSON, BSON, or
XML documents.
Documents can be stored and retrieved in a form that is much closer to the
data objects used in applications which means less translation is required to
use these data in the applications. In the Document database, the particular
elements can be accessed by using the index value that is assigned for
faster querying.
Collections are the group of documents that store documents that have
similar contents. Not all the documents are in any collection as they require
a similar schema because document databases have a flexible schema.
Key features of documents database:
 Flexible schema: Documents in the database has a flexible schema.
It means the documents in the database need not be the same
schema.
 Faster creation and maintenance: the creation of documents is easy
and minimal maintenance is required once we create the
document.
 No foreign keys: There is no dynamic relationship between two
documents so documents can be independent of one another. So,
there is no requirement for a foreign key in a document database.
 Open formats: To build a document we use XML, JSON, and others.
Eg : mongodb

Key-Value Stores:

A key-value store is a nonrelational database. The simplest form of a NoSQL


database is a key-value store. Every data element in the database is stored
in key-value pairs. The data can be retrieved by using a unique key allotted
to each element in the database. The values can be simple data types like
strings and numbers or complex objects.
A key-value store is like a relational database with only two columns which is
the key and the value.
Key features of the key-value store:
 Simplicity.
 Scalability.
 Speed.
Eg: cassandra

Column Oriented Databases:

A column-oriented database is a non-relational database that stores the data


in columns instead of rows. That means when we want to run analytics on a
small number of columns, you can read those columns directly without
consuming memory with the unwanted data.
Columnar databases are designed to read data more efficiently and retrieve
the data with greater speed. A columnar database is used to store a large
amount of data. Key features of columnar oriented database:
 Scalability.
 Compression.
 Very responsive.
Eg: google’s big table

Graph-Based databases:

Graph-based databases focus on the relationship between the elements. It


stores the data in the form of nodes in the database. The connections
between the nodes are called links or relationships.
Key features of graph database:
 In a graph-based database, it is easy to identify the relationship
between the data by using the links.
 The Query’s output is real-time results.
 The speed depends upon the number of relationships among the
database elements.
 Updating data is also easy, as adding a new node or edge to a
graph database is a straightforward task that does not require
significant schema changes.
Eg: oracle no sql db

Architecture of HBase

1. HMaster –
The implementation of Master Server in HBase is HMaster. It is a
process in which regions are assigned to region server as well as
DDL (create, drop table) operations. It monitor all Region Server
instances present in the cluster. In a distributed environment,
Master runs several background threads. HMaster has many
features like controlling load balancing, failover etc.

2. Region Server –
HBase Tables are divided horizontally by row key range into
Regions. Regions are the basic building elements of HBase cluster
that consists of the distribution of tables and are comprised of
Column families. Region Server runs on HDFS DataNode which is
present in Hadoop cluster. Regions of Region Server are responsible
for several things, like handling, managing, executing as well as
reads and writes HBase operations on that set of regions. The
default size of a region is 256 MB. DML (insert, update, delete)

3. Zookeeper –
It is like a coordinator in HBase. It provides services like maintaining
configuration information, naming, providing distributed
synchronization, server failure notification etc. Clients communicate
with region servers via zookeeper.

Advantages of HBase –

1. Can store large data sets

2. Database can be shared

3. Cost-effective from gigabytes to petabytes

4. High availability through failover and replication

Disadvantages of HBase –

1. No support SQL structure

2. No transaction support

3. Sorted only on key

4. Memory issues on the cluster

Comparison between HBase and HDFS:

 HBase provides low latency access while HDFS provides high


latency operations.

 HBase supports random read and writes while HDFS supports Write
once Read Many times.

 HBase is accessed through shell commands, Java API, REST, Avro or


Thrift API while HDFS is accessed through MapReduce jobs.

Hbase, Cassandra and MongoDB are called column oriented no SQL db? How
row oriented data base differs from them . write difference in table form
Yes, HBase, Cassandra, and MongoDB are often categorized as NoSQL column-oriented
databases. However, it's important to note that MongoDB can also store data in a
document-oriented manner. When comparing column-oriented databases like HBase
and Cassandra with row-oriented databases, such as traditional relational databases like
MySQL or PostgreSQL, there are several differences. Here's a table outlining some of the
key differences:

Column-Oriented Databases (HBase, Row-Oriented Databases (MySQL,


Aspect Cassandra) PostgreSQL)
Storage
Format Data is stored in columns Data is stored in rows
Data Access Optimized for transactional operations
Pattern Optimized for read-heavy workloads and write-heavy workloads
Generally offers less schema flexibility
Schema compared to document-oriented NoSQL
Flexibility databases like MongoDB Offers more rigid schema structures
Aggregation Typically better for analytical queries and Typically better for OLTP (Online
Performance aggregations over large datasets Transaction Processing) operations
Scalability may be limited by the
Highly scalable and can handle large underlying relational model, although
volumes of data distributed across sharding and clustering techniques can
Scalability clusters be applied
Eventual consistency or tunable Typically provide strong consistency
consistency levels, sacrificing strong guarantees, adhering to ACID
Consistency consistency for scalability and availability (Atomicity, Consistency, Isolation,
Model CCAP theorem Durability) properties
Suitable for applications requiring real- Commonly used for traditional
time analytics, time-series data, or high transactional applications, such as e-
write throughput with moderate read commerce systems, banking, and ERP
Use Cases requirements systems
Example
Technologies HBase, Cassandra MySQL, PostgreSQL

Column-Oriented Database (Apache Cassandra):


Table Name: user_activity
User_ID (Partition Key) Timestamp (Clustering Column) Page_ID Action
101 2024-03-31 15:23:45 001 View
101 2024-03-31 15:25:12 002 Like
102 2024-03-31 15:30:00 001 Share
103 2024-03-31 15:35:21 003 Comment
In Cassandra, the table is optimized for fast writes and can efficiently
store large volumes of data. Data is partitioned by User_ID, and within
each partition, rows are sorted by Timestamp. This structure facilitates
fast retrieval of data based on specific queries, such as retrieving all
actions performed by a user or actions within a specific time range.
Customer_ID (Primary Key) Name Email Age Country
1001 John Smith [email protected] 35 USA
1002 Emma Brown [email protected] 28 Canada
1003 Alice Lee [email protected] 42 Australia

Row-Oriented Database (MySQL):


Table Name: customers
In MySQL, the table follows a traditional row-oriented structure where each
row represents a single customer record. The primary key (Customer_ID)
uniquely identifies each row, and other columns store various attributes of
the customer such as Name, Email, Age, and Country. This structure is
suitable for transactional operations and provides strong consistency
guarantees.

Eventual Consistency:
 Eventual consistency is a consistency model where all updates to the
database propagate through the system asynchronously. This means
that after a certain period of time, all replicas will converge to the
same state.
 In Cassandra, eventual consistency is achieved by allowing read
operations to return data even if some replicas have not yet been
updated with the latest write.
 Eventual consistency prioritizes availability and partition tolerance over
strong consistency, making it suitable for scenarios where immediate
consistency is not required, such as distributed caching or systems
with high availability requirements.

Tunable consistency in Apache Cassandra refers to the ability to adjust the


consistency level on a per-operation basis, allowing developers to balance
between consistency, availability, and partition tolerance according to their
application requirements. In Cassandra, consistency levels determine how
many replicas need to respond to a read or write operation before it is
considered successful. How many replicas ko consensus chaincha

Cassandra offers several consistency levels, each providing different


guarantees:

1. ONE: Only one replica needs to respond to the operation for it to be


considered successful. This level prioritizes availability over
consistency and is suitable for scenarios where immediate consistency
is not critical.
2. QUORUM: A majority of replicas (N/2 + 1) need to respond to the
operation. This level ensures strong consistency within the local data
center and provides a balance between consistency and availability.
3. LOCAL_QUORUM: Similar to QUORUM, but restricted to the replicas
within the local data center. This level is useful for applications where
data locality is important, optimizing for low-latency reads and writes
within the same data center.
4. ALL: All replicas must respond to the operation. This level ensures
strong consistency across all replicas but may impact availability,
especially in distributed environments with network partitions.
5. ANY: The operation is considered successful as long as any replica
responds. This level prioritizes availability over consistency and is
useful for scenarios where immediate consistency is not required.
6. EACH_QUORUM: Introduced in Cassandra 4.0, this level ensures that
a quorum of replicas is contacted in each data center, offering strong
consistency guarantees across multiple data centers.

A list of HBase commands are given below.

o Create: Creates a new table identified by 'table1' and Column Family


identified by 'colf'.
o Put: Inserts a new record into the table with row identified by
'row..' /update
o Scan: returns the data stored in table
o Get: Returns the records matching the row identifier provided in the
table
o Help: Get a list of commands

1. create 'table1', 'colf'


2. list 'table1'
3. put 'table1', 'row1', 'colf:a', 'value1'
4. put 'table1', 'row1', 'colf:b', 'value2'
5. put 'table1', 'row2', 'colf:a', 'value3'
6. scan 'table1'
7. get 'table1', 'row1'

write hbase commands to store and select data in hbase database


what are the commands to STORE, SELECT, MODIFY, and DELETE
records from a Table of hbase
In Apache HBase, you interact with the database using the HBase shell or
through the HBase Java API. Here are the commands to perform basic CRUD
operations (Create, Read, Update, Delete) using the HBase shell:

Store Records (Put):


# Start the HBase shell
$ hbase shell
# Insert a record into a table
put 'table_name', 'row_key', 'column_family:column_qualifier', 'value'

For example:
put 'example_table', 'row1', 'cf1:column1', 'value1'

Select Records (Get and Scan):


# Get a single record by row key
get 'table_name', 'row_key'
# Scan for records in a table
scan 'table_name'

For example:
get 'example_table', 'row1'
scan 'example_table'

Modify Records (Put):


To modify a record, you can use the same put command as used for
storing records. If the row key already exists, the put command will update
the existing record with the new values.
For example:
put 'example_table', 'row1', 'cf1:column1', 'new_value'

Delete Records (Delete):


# Delete a single record by row key
delete 'table_name', 'row_key'
# Delete a specific column in a record
delete 'table_name', 'row_key', 'column_family:column_qualifier'

For example:
delete 'example_table', 'row1'
delete 'example_table', 'row1', 'cf1:column1'

Q. How can rdbms be represented in mongodb (6) CRUD operation


justify " distributed scenario normalization contradicts data availability"
In a distributed database system, normalization refers to the process of organizing data
to minimize redundancy and dependency. It involves breaking down data into smaller,
more manageable units and storing them in separate tables, linked together by
relationships. While normalization is a widely accepted practice in traditional relational
database systems (RDBMS), in distributed scenarios, it can contradict the principle of
data availability. Here's why:

1. CAP + CHARACTERISTICS OF NOSQL


2. Increased Network Traffic:
 Normalization often results in spreading related data across multiple
tables, which can lead to increased network traffic in a distributed
environment. Each query may require fetching data from multiple nodes,
leading to latency issues and slower response times. This increased
network overhead can impact data availability, especially in systems with
high throughput requirements.
3. Dependency on Network Connectivity:
 In a distributed environment, network partitions or failures are common
occurrences. When data is normalized and spread across multiple nodes,
each query requires communication across the network to retrieve related
data. If there are network issues or partitions, it can lead to data
unavailability or inconsistent results. Denormalization, on the other hand,
reduces the dependency on network connectivity by storing related data
together in a single document or table.
4. Consistency Challenges:
 In distributed systems, maintaining data consistency across multiple
nodes is challenging, especially in scenarios where data is highly
normalized. Updates to normalized data may require modifying multiple
tables, which increases the likelihood of inconsistencies, especially during
concurrent operations. Denormalization can help mitigate consistency
issues by reducing the need for complex join operations and ensuring that
related data is stored together.
5. Scalability Concerns:
 Normalization can hinder scalability in distributed databases. As the
system scales out to accommodate increased load or data volume, the
overhead of joining normalized data from multiple nodes becomes more
pronounced. This can limit the system's ability to handle large-scale
deployments and may require complex sharding or partitioning strategies
to maintain performance and availability.
6. Trade-off Between Availability and Consistency:
 In distributed systems, there is often a trade-off between data availability
and consistency, known as the CAP theorem. Normalization tends to
prioritize consistency by reducing redundancy and ensuring data integrity.
However, this can come at the cost of availability, especially in scenarios
where network partitions or failures are common. Denormalization allows
for greater flexibility in balancing consistency and availability by
optimizing data access patterns based on specific use cases.

explain the reason why some nosql databases like Cassandra sacrifice
absolute consistency for absolute availability

NoSQL databases like Cassandra prioritize availability over absolute consistency due to
the requirements of distributed systems and the CAP theorem. The CAP theorem,
proposed by computer scientist Eric Brewer, states that in a distributed system, it's
impossible to simultaneously achieve all three of the following guarantees:

1. Consistency (C): Every read receives the most recent write or an error.
2. Availability (A): Every request receives a response, without the guarantee that
it contains the most recent write.
3. Partition tolerance (P): The system continues to operate despite network
partitions (communication failures) between nodes.

Given that network partitions are inevitable in distributed systems, databases must
choose between consistency and availability during network partitions. Cassandra,
along with many other NoSQL databases, opts to prioritize availability over consistency.
Here's why:

1. High Availability Requirements: Many modern applications prioritize


availability over strong consistency, particularly those serving high volumes of
traffic or requiring low-latency responses. Users expect services to remain
operational even under adverse conditions, such as network partitions or
hardware failures. Sacrificing consistency ensures that the system remains
responsive and continues to serve requests, providing a better user experience.
2. Decentralized Nature: Cassandra is designed as a decentralized, distributed
database that can span multiple nodes across different geographical locations.
Each node operates independently, serving read and write requests locally. In
such a setup, enforcing strong consistency across all nodes would introduce
significant latency and increase the risk of network-related issues, undermining
availability.
3. Tunable Consistency, Eventual consisyency: Cassandra offers tunable
consistency levels, allowing developers to choose the appropriate consistency
level for each operation based on their application's requirements. Developers
can select consistency levels ranging from eventual consistency to strong
consistency, depending on factors such as latency, throughput, and data
consistency needs. This flexibility enables developers to strike a balance
between consistency and availability according to the specific use case.
4. Conflict Resolution: NoSQL databases like Cassandra employ conflict resolution
mechanisms to reconcile inconsistencies that may arise due to eventual
consistency. Techniques such as vector clocks or last-write-wins resolution are
used to detect and resolve conflicts during read and write operations, ensuring
eventual convergence of data across replicas.

differences between relational database management systems


(RDBMS) and NoSQL databases:

Relational Database Management


Aspect System (RDBMS) NoSQL Database
Document, Key-Value, Column-Family,
Data Model Tabular (Structured) Graph, etc.
Flexible schema, often schema-less or
Schema Typically has a fixed schema schema-flexible
Varies by database type (e.g., SQL,
Query Language SQL (Structured Query Language) JSON-like queries, etc.)
Supports ACID transactions
ACID (Atomicity, Consistency, Isolation, May support ACID transactions, but not
Transactions Durability) always guaranteed cap
Scalability Vertical scaling (scaling up) Horizontal scaling (scaling out)
Schema Design
Flexibility Rigid schema structures Allows for more flexible schema designs
Distributed across multiple nodes or
Data Distribution Generally not distributed clusters
Relationships Supports complex relationships and May not support joins or complex
and Joins joins relationships
Data Consistency Emphasizes strong consistency Emphasizes eventual consistency
Traditional transactional applications Big data, real-time analytics, distributed
Use Cases (e.g., banking systems) systems, scalable web applications, etc.
MySQL, PostgreSQL, Oracle Database, MongoDB, Cassandra, Redis, Neo4j,
Examples SQL Server Elasticsearch, etc.

Q. Using a MongoDB database,


Create a collections named "posts", insert following records:
title: MongoDB, description: MongoDB is a NoSQL database, by: Tom,
Comments: We use MongoDB for unstructured data, likes: 100
i). Now write a query to search title of the post written by Tom.
ii) Write mapReduce function to count number of posts created by
various users.

Here are the MongoDB commands to accomplish the tasks:

Insert Records into "posts" Collection:


db.posts.insertOne({
"title": "MongoDB",
"description": "MongoDB is a NoSQL database",
"by": "Tom",
"comments": ["We use MongoDB for unstructured data"],
"likes": 100
});

This command inserts a single document into the "posts" collection with the
specified fields.

Query to Search Title of the Post Written by Tom:


db.posts.find({ "by": "Tom" }, { "title": 1, "_id": 0 });

This query searches for documents in the "posts" collection where the "by" field
equals "Tom" and projects only the "title" field in the result.

MapReduce Function to Count Number of Posts Created by Various Users:


var mapFunction = function() {
emit(this.by, 1);
};

var reduceFunction = function(key, values) {


return Array.sum(values);
};

db.posts.mapReduce(
mapFunction,
reduceFunction,
{ out: "post_count_by_user" }
);
This mapReduce function maps each document to the username ("by" field)
as the key and emits a value of 1. The reduce function then sums up the values for
each key. The result is stored in a collection named "post_count_by_user".
After executing these commands, you can verify the results by querying the
"posts" collection for title or checking the "post_count_by_user" collection for the
count of posts by each user.

why hbase is called column-oriented nosql database built on top of hdfs?

HBase is called a column-oriented NoSQL database because of its design and storage
model, which is optimized for efficient column-wise operations and scalability. Here's
why it's called "column-oriented":

1. Storage Model:
 In HBase, data is stored in a column-oriented fashion, meaning that
columns of data for each row are stored together on disk. This is different
from row-oriented databases where entire rows are stored contiguously.
 Storing data in a column-oriented manner allows for efficient read and
write operations on specific columns, making HBase suitable for analytical
queries and applications with wide and sparse datasets.
2. Column Families:
 HBase organizes data into column families, which are groups of columns
stored together physically. Each column family can contain multiple
columns, and all columns within a column family are stored together on
disk.
 Column families provide a way to group related columns and apply
configurations (such as compression or compaction) at the column family
level, offering flexibility in managing different types of data.
3. Sparse Data Support:
 HBase is designed to efficiently handle sparse data, where rows may have
a varying number of columns, and many columns may have missing or
null values.
 By storing data in a column-oriented fashion, HBase can handle sparse
datasets effectively, as it only stores data that exists, rather than
allocating space for missing values.

Regarding HDFS (Hadoop Distributed File System), HBase is often deployed on top of
HDFS to leverage its distributed storage capabilities. HDFS provides the underlying
storage layer for HBase, enabling HBase to store large volumes of data across a
distributed cluster of commodity hardware. HBase uses HDFS for storing its data files
and relies on Hadoop for distributed data management and fault tolerance.

JSON
JSON stands for JavaScript Object Notation, which is a lightweight data
interchange format used to represent data in a structured format. JSON is often
used for transmitting data between a server and a web application, as it is easy for
humans to read and write and easy for machines to parse and generate.
Here's a simple example of JSON data representing information about a
person:
{
"name": "John Doe",
"age": 30,
"email": "[email protected]",
"address": {
"street": "123 Main Street",
"city": "Anytown",
"state": "California",
"postal_code": "12345"
},
"phone_numbers": [
{
"type": "home",
"number": "555-1234"
},
{
"type": "work",
"number": "555-5678"
}
]
}
In this JSON example:
The outermost curly braces {} represent an object.
Inside the object, there are key-value pairs separated by colons (:). The keys
are strings, and the values can be strings, numbers, arrays, objects, boolean values,
or null.
Arrays are enclosed in square brackets [], and elements inside the array are
separated by commas.
Nested objects and arrays can be used to represent complex data structures.
JSON is widely used in web development for transmitting and storing data,
and it's supported by most programming languages through built-in libraries or
third-party packages. It's commonly used in APIs for exchanging data between
servers and clients, configuration files, and data storage formats.

comparison of horizontal scaling and vertical scaling in table form:

Aspect Horizontal Scaling Vertical Scaling


Adding more machines or nodes Increasing the capacity of existing
Definition to the system machines
Method Distributed scaling (scale-out) Monolithic scaling (scale-up)
Resource Upgrades hardware resources of existing
Expansion Adds more servers or nodes servers
Workload Workload is distributed across Workload remains on a single, upgraded
Distribution multiple nodes server
Limited by hardware limitations of a single
Scalability Linear scalability server
Achieved through redundancy and
High Availability fault tolerance Limited to the reliability of a single server
May require investment in high-end
Cost-effectiveness Utilizes commodity hardware hardware
Aspect Horizontal Scaling Vertical Scaling
Management
Complexity More nodes to manage Fewer nodes to manage
Performance May require distributed computing Predictable performance improvements
Impact overhead with hardware upgrades
Adding more web servers to Upgrading a database server to improve
Example handle increased traffic query performance

comparison between SQL (relational databases) and NoSQL


databases:

Aspect SQL (Relational Databases) NoSQL Databases


Document, Key-Value, Column-Family,
Data Model Tabular (Structured) Graph, etc.
Flexible schema, often schema-less or
Schema Typically has a fixed schema schema-flexible
Varies by database type (e.g., SQL, JSON-
Query Language SQL (Structured Query Language) like queries, etc.)
ACID May support ACID transactions, but not
Transactions Supports ACID transactions always guaranteed
Scalability Vertical scaling (scaling up) Horizontal scaling (scaling out)
Schema Design
Flexibility Rigid schema structures Allows for more flexible schema designs
Distributed across multiple nodes or
Data Distribution Generally not distributed clusters
Relationships and Supports complex relationships May not support joins or complex
Joins and joins relationships
Data Consistency Emphasizes strong consistency Emphasizes eventual consistency
Traditional transactional
applications (e.g., banking Big data, real-time analytics, distributed
Use Cases systems) systems, scalable web applications, etc.
MySQL, PostgreSQL, Oracle MongoDB, Cassandra, Redis, Neo4j,
Examples Database, SQL Server Elasticsearch, etc.

when do we require columnar database explain with example


Columnar databases are particularly useful in scenarios where analytical queries are
frequent and performance is crucial. Here are some situations where a columnar
database might be preferred:

1. Analytics and Reporting:


 Columnar databases excel in analytical workloads where queries involve
aggregating and analyzing large volumes of data. Since analytical queries
typically access a subset of columns rather than entire rows, the columnar
storage format allows for efficient data retrieval and processing.
 Example: A retail company analyzing sales data to identify trends, track
inventory, and optimize pricing strategies might benefit from a columnar
database. Queries such as total sales per product category or sales over
time can be executed quickly due to the columnar storage format.
2. Data Warehousing:
 In data warehousing environments, where historical data from multiple
sources is consolidated for analysis, columnar databases are well-suited
for storing and querying large datasets efficiently.
 Example: A financial institution storing transaction data from various
banking channels (ATMs, online banking, branches) might use a columnar
database for querying historical transaction volumes, identifying patterns,
and detecting fraudulent activities.
3. Data Archiving and Compliance:
 Columnar databases are often used for data archiving and compliance
purposes, where organizations need to store vast amounts of historical
data for regulatory compliance or auditing purposes.
 Example: Healthcare organizations storing patient records, medical
images, and treatment histories might use a columnar database to
efficiently query and retrieve patient data for compliance reporting,
research, or retrospective analysis.
4. Time-Series Data:
 Columnar databases are well-suited for storing time-series data, such as
sensor data, log files, or IoT data, where queries often involve analyzing
data over time intervals or aggregating metrics.
 Example: A smart energy grid system collecting data from sensors
installed on power lines might use a columnar database to store and
analyze energy consumption patterns, voltage fluctuations, and system
performance metrics.
5. Data Exploration and Ad-hoc Queries:
 Columnar databases provide fast query performance, making them
suitable for ad-hoc queries and exploratory data analysis. Users can
interactively query and explore large datasets without experiencing
significant latency.
 Example: Data scientists exploring a dataset to identify correlations,
outliers, or patterns might use a columnar database for interactive data
exploration and visualization.

Chapter 6
why is lucene index called tf-idf

ChatGPT
The Lucene index is not directly called "TF-IDF." Rather, Lucene uses the TF-IDF (Term
Frequency-Inverse Document Frequency) algorithm as one of the key components for
scoring and ranking search results. TF-IDF is a statistical measure used to evaluate the
importance of a term within a document relative to a collection of documents.

Here's why Lucene's index is often associated with TF-IDF:


1. Term Frequency (TF): In the context of Lucene, when documents are indexed,
the frequency of each term within a document is recorded. This term frequency
information is then used to calculate the TF component of the TF-IDF score.
Essentially, TF represents how often a term appears within a document.
2. Inverse Document Frequency (IDF): Lucene also maintains statistics about
the overall corpus of indexed documents. IDF measures the rarity of a term
across the entire corpus. Terms that appear frequently in many documents are
considered less important than terms that appear in a smaller subset of
documents. IDF is used to downweight the importance of common terms.
3. Scoring: When a search query is executed, Lucene calculates a relevance score
for each document based on a combination of TF and IDF. Documents containing
the query terms more frequently (higher TF) are given higher scores.
Additionally, terms that are rare across the entire corpus (higher IDF) are given
more weight.
4. Relevance Ranking: Lucene uses the TF-IDF score (along with other factors like
field normalization, length normalization, etc.) to rank search results. Documents
with higher TF-IDF scores are considered more relevant to the query and are
typically displayed higher in the search results.

how lucene index resolve write and read congestion problem


Lucene employs several strategies to address write and read congestion problems
inherent in indexing and searching large volumes of data:

1. Segmented Indexing: Lucene divides the index into smaller segments, which
are individually optimized for efficient updates and searches. This segmentation
helps alleviate congestion by distributing the indexing load across multiple
segments. It also allows for incremental updates to the index, reducing the
impact of large indexing operations on search performance.
2. Merge Policy: Lucene employs a merge policy to periodically consolidate
smaller index segments into larger segments. This process helps maintain
optimal index performance by reducing the number of segments and optimizing
disk access patterns. The merge policy can be configured based on factors like
index size, update frequency, and system resources to balance indexing
throughput and search performance.
3. Buffered Indexing: Lucene buffers incoming indexing operations in memory
before flushing them to disk. This buffering reduces the frequency of disk writes,
improving indexing throughput and minimizing congestion caused by frequent
disk access. Buffered indexing also allows for batched updates, which can be
more efficient than individual updates.
4. Concurrent Indexing: Lucene supports concurrent indexing, allowing multiple
threads or processes to index documents simultaneously. This concurrency
increases indexing throughput by leveraging multicore processors and
parallelism. However, care must be taken to ensure thread safety and proper
synchronization to avoid data corruption or inconsistencies.
5. Caching: Lucene incorporates caching mechanisms to cache frequently
accessed data structures and search results. Caching helps reduce the
computational overhead of repetitive search operations and improves overall
search performance. Lucene provides various caching options, such as filter
caching, query result caching, and field value caching, which can be tuned based
on specific use cases and resource constraints.

Analyzers play a crucial role in search engine development as they are


responsible for breaking down text into tokens or terms, normalizing them, and
preparing them for indexing and searching. Different types of analyzers are
available in search engines, each serving specific purposes. Here are some
common types of analyzers and their roles in search engine development:

Standard Analyzer: This is the default analyzer in many search engines. It


performs tokenization, lowercase transformation, and removes stop words
(commonly occurring words like "and," "the," etc.). It's suitable for general-
purpose text indexing and searching.

Whitespace Analyzer: This analyzer tokenizes text based on whitespace


characters. It's useful for preserving exact phrases or treating whitespace-
separated terms as single tokens.

Keyword Analyzer: This analyzer indexes the entire input text as a single term.
It's useful for scenarios where exact matches are required, such as indexing IDs
or codes.

Simple Analyzer: This analyzer breaks text into tokens based on non-letter
characters, such as whitespace and punctuation, and converts tokens to
lowercase. It's less aggressive than the standard analyzer and doesn't remove
stop words.

Language-Specific Analyzers: Analyzers tailored for specific languages are


designed to handle language-specific tokenization rules, stemming (reducing
words to their root form), and stop words. Examples include the EnglishAnalyzer,
FrenchAnalyzer, etc.

Custom Analyzers: Search engines often allow developers to define custom


analyzers tailored to specific requirements. These analyzers can combine
tokenizers, token filters, and character filters in a flexible manner to achieve
desired text processing.
Phonetic Analyzers: These analyzers are used for phonetic matching, which
allows searching for words that sound similar. They transform text into phonetic
representations, enabling searches to match words based on their pronunciation.

You might also like