Hadoop is a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It includes HDFS, a distributed file system, and MapReduce, a programming model for large-scale data processing. HDFS stores data reliably across clusters and allows computations to be processed in parallel near the data. The key components are the NameNode, DataNodes, JobTracker and TaskTrackers. HDFS provides high throughput access to application data and is suitable for applications handling large datasets.
The document summarizes Google File System (GFS) and MapReduce. GFS is a scalable distributed file system used for large data applications. It consists of a master server that coordinates metadata, and chunkservers that store and serve file chunks to clients. MapReduce is a processing technique that uses mappers to break data into key-value pairs, and reducers to combine the outputs into smaller sets. It allows for scalable distributed processing across computing nodes.
Cloud computing UNIT 2.1 presentation inRahulBhole12
Cloud storage allows users to store files online through cloud storage providers like Apple iCloud, Dropbox, Google Drive, Amazon Cloud Drive, and Microsoft SkyDrive. These providers offer various amounts of free storage and options to purchase additional storage. They allow files to be securely uploaded, accessed, and synced across devices. The best cloud storage provider depends on individual needs and preferences regarding storage space requirements and features offered.
The document describes the Google File System (GFS), which was developed by Google to handle its large-scale distributed data and storage needs. GFS uses a master-slave architecture with the master managing metadata and chunk servers storing file data in 64MB chunks that are replicated across machines. It is designed for high reliability and scalability handling failures through replication and fast recovery. Measurements show it can deliver high throughput to many concurrent readers and writers.
Data Lake and the rise of the microservicesBigstep
By simply looking at structured and unstructured data, Data Lakes enable companies to understand correlations between existing and new external data - such as social media - in ways traditional Business Intelligence tools cannot.
For this you need to find out the most efficient way to store and access structured or unstructured petabyte-sized data across your entire infrastructure.
In this meetup we’ll give answers on the next questions:
1. Why would someone use a Data Lake?
2. Is it hard to build a Data Lake?
3. What are the main features that a Data Lake should bring in?
4. What’s the role of the microservices in the big data world?
The document summarizes two distributed storage systems developed by Google: the Google File System (GFS) and Bigtable. GFS was developed in the late 1990s to provide petabytes of storage for large files across thousands of machines. It uses a master/slave architecture with chunk replication for fault tolerance. Bigtable is a distributed storage system for structured data that scales to petabytes of data and thousands of machines. It uses a table abstraction with rows, columns, and timestamps to store data in a sparse, sorted, multi-dimensional map.
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
My presentation for the Cloud Data Management course at EPFL by Anastasia Ailamaki and Christoph Koch.
It is mainly based on the following two papers:
1) S. Ghemawat, H. Gobioff, S. Leung. The Google File System. SOSP, 2003
2) J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004
The document discusses the Google File System (GFS), which was developed by Google to handle large files across thousands of commodity servers. It provides three main functions: (1) dividing files into chunks and replicating chunks for fault tolerance, (2) using a master server to manage metadata and coordinate clients and chunkservers, and (3) prioritizing high throughput over low latency. The system is designed to reliably store very large files and enable high-speed streaming reads and writes.
Operating systems use main memory management techniques like paging and segmentation to allocate memory to processes efficiently. Paging divides both logical and physical memory into fixed-size pages. It uses a page table to map logical page numbers to physical frame numbers. This allows processes to be allocated non-contiguous physical frames. A translation lookaside buffer (TLB) caches recent page translations to improve performance by avoiding slow accesses to the page table in memory. Protection bits and valid/invalid bits ensure processes only access their allocated memory regions.
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...raghdooosh
The document discusses big data storage concepts including cluster computing, distributed file systems, and different database types. It covers cluster structures like symmetric and asymmetric, distribution models like sharding and replication, and database types like relational, non-relational and NewSQL. Sharding partitions large datasets across multiple machines while replication stores duplicate copies of data to improve fault tolerance. Distributed file systems allow clients to access files stored across cluster nodes. Relational databases are schema-based while non-relational databases like NoSQL are schema-less and scale horizontally.
The document describes the Google File System (GFS), a scalable distributed file system designed and implemented by Google to meet its rapidly growing data storage needs. Key aspects of GFS include using inexpensive commodity hardware, supporting large files and high throughput appending, and providing fault tolerance through replication across multiple servers. GFS differs from previous distributed file systems in its focus on high volume appending over rewriting, use of large files, and relaxed consistency to improve performance for Google's specific workload characteristics.
The document describes the Google File System (GFS), a scalable distributed file system designed and implemented by Google to meet its rapidly growing data storage needs. Key aspects of the GFS design include supporting large files and high throughput appending workloads on inexpensive commodity hardware in the face of frequent component failures. The GFS architecture uses a single master to manage metadata and multiple chunkservers to store and retrieve file chunks, providing fault tolerance through replication.
HPC and cloud distributed computing, as a journeyPeter Clapham
Introducing an internal cloud brings new paradigms, tools and infrastructure management. When placed alongside traditional HPC the new opportunities are significant But getting to the new world with micro-services, autoscaling and autodialing is a journey that cannot be achieved in a single step.
The Google File System (GFS) is a scalable distributed file system designed by Google to provide reliable, scalable storage and high performance for large datasets and workloads. It uses low-cost commodity hardware and is optimized for large files, streaming reads and writes, and high throughput. The key aspects of GFS include using a single master node to manage metadata, chunking files into 64MB chunks distributed across multiple chunk servers, replicating chunks for reliability, and optimizing for large sequential reads and appends. GFS provides high availability, fault tolerance, and data integrity through replication, fast recovery, and checksum verification.
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
If you’re building a SaaS application, you probably already have the notion of tenancy built in your data model. Typically, most information relates to tenants / customers / accounts and your database tables capture this natural relation.
With smaller amounts of data, it’s easy to throw more hardware at the problem and scale up your database. As these tables grow however, you need to think about ways to scale your multi-tenant (B2B) database across dozens or hundreds of machines.
In this talk, we're first going to talk about motivations behind scaling your SaaS (multi-tenant) database and several heuristics we found helpful on deciding when to scale. We'll then describe three design patterns that are common in scaling SaaS databases: (1) Create one database per tenant, (2) Create one schema per tenant, and (3) Have all tenants share the same table(s). Next, we'll highlight the tradeoffs involved with each design pattern and focus on one pattern that scales to hundreds of thousands of tenants. We'll also share an example architecture from the industry that describes this pattern in more detail.
Last, we'll talk about key PostgreSQL properties, such as semi-structured data types, that make building multi-tenant applications easy. We'll also mention Citus as a method to scale out your multi-tenant database. We'll conclude by answering frequently asked questions on multi-tenant databases and Q&A.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
This presentation:
* covers basics of caching and popular cache types
* explains evolution from simple cache to distributed, and from distributed to IMDG
* not describes usage of NoSQL solutions for caching
* is not intended for products comparison or for promotion of Hazelcast as the best solution
Big data is generated from a variety of sources at a massive scale and high velocity. Hadoop is an open source framework that allows processing and analyzing large datasets across clusters of commodity hardware. It uses a distributed file system called HDFS that stores multiple replicas of data blocks across nodes for reliability. Hadoop also uses a MapReduce processing model where mappers process data in parallel across nodes before reducers consolidate the outputs into final results. An example demonstrates how Hadoop would count word frequencies in a large text file by mapping word counts across nodes before reducing the results.
This presentation introduces the Big Data topic to Software Quality Assurance Engineers. It can also be useful for Software Developers and other software professionals.
The document provides an overview of Hadoop including:
- A brief history of Hadoop and its origins from Nutch.
- An overview of the Hadoop architecture including HDFS and MapReduce.
- Examples of how companies like Yahoo, Facebook and Amazon use Hadoop at large scales to process petabytes of data.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses Google's MapReduce programming model and Google File System for reliability. The Hadoop architecture includes a distributed file system (HDFS) that stores data across clusters and a job scheduling and resource management framework (YARN) that allows distributed processing of large datasets in parallel. Key components include the NameNode, DataNodes, ResourceManager and NodeManagers. Hadoop provides reliability through replication of data blocks and automatic recovery from failures.
- Hadoop Distributed File System (HDFS) is a distributed file system that stores large datasets across clusters of computers. It divides files into blocks and stores the blocks across nodes, replicating them for fault tolerance.
- HDFS is designed for distributed storage and processing of very large datasets. It allows applications to work with data in parallel on large clusters of commodity hardware.
Operating System
Topic Memory Management
for Btech/Bsc (C.S)/BCA...
Memory management is the functionality of an operating system which handles or manages primary memory. Memory management keeps track of each and every memory location either it is allocated to some process or it is free. It checks how much memory is to be allocated to processes. It decides which process will get memory at what time. It tracks whenever some memory gets freed or unallocated and correspondingly it updates the status.
This document provides an overview of high performance computing infrastructures. It discusses parallel architectures including multi-core processors and graphical processing units. It also covers cluster computing, which connects multiple computers to increase processing power, and grid computing, which shares resources across administrative domains. The key aspects covered are parallelism, memory architectures, and technologies used to implement clusters like Message Passing Interface.
Training Webinar: Enterprise application performance with distributed cachingOutSystems
2nd Session - Distributed Caching:
- What is Distributed Caching
- Performance hurdles solved by Distributed Caching
- When to use Distributed Caching
- Patterns to Populate a Distributed Cache
- How to use Distributed Caching in OutSystems
Free Online training: https://www.outsystems.com/learn/courses/
Follow us on Twitter http://www.twitter.com/OutSystemsDev
Like us on Facebook http://www.Facebook.com/OutSystemsDev
A distributed database (DDB) is a collection of logically related databases distributed across a computer network. It allows data and processing to occur at multiple sites. Key characteristics include data fragmentation across sites, replication of fragments for availability and performance, and distributed transaction management to ensure consistency. The main types are homogeneous DDBMS, where all sites use identical software, and heterogeneous DDBMS where different sites may use different systems. Challenges include complex management, security, and maintaining consistency across sites.
The document discusses the Google File System (GFS), which was developed by Google to handle large files across thousands of commodity servers. It provides three main functions: (1) dividing files into chunks and replicating chunks for fault tolerance, (2) using a master server to manage metadata and coordinate clients and chunkservers, and (3) prioritizing high throughput over low latency. The system is designed to reliably store very large files and enable high-speed streaming reads and writes.
Operating systems use main memory management techniques like paging and segmentation to allocate memory to processes efficiently. Paging divides both logical and physical memory into fixed-size pages. It uses a page table to map logical page numbers to physical frame numbers. This allows processes to be allocated non-contiguous physical frames. A translation lookaside buffer (TLB) caches recent page translations to improve performance by avoiding slow accesses to the page table in memory. Protection bits and valid/invalid bits ensure processes only access their allocated memory regions.
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...raghdooosh
The document discusses big data storage concepts including cluster computing, distributed file systems, and different database types. It covers cluster structures like symmetric and asymmetric, distribution models like sharding and replication, and database types like relational, non-relational and NewSQL. Sharding partitions large datasets across multiple machines while replication stores duplicate copies of data to improve fault tolerance. Distributed file systems allow clients to access files stored across cluster nodes. Relational databases are schema-based while non-relational databases like NoSQL are schema-less and scale horizontally.
The document describes the Google File System (GFS), a scalable distributed file system designed and implemented by Google to meet its rapidly growing data storage needs. Key aspects of GFS include using inexpensive commodity hardware, supporting large files and high throughput appending, and providing fault tolerance through replication across multiple servers. GFS differs from previous distributed file systems in its focus on high volume appending over rewriting, use of large files, and relaxed consistency to improve performance for Google's specific workload characteristics.
The document describes the Google File System (GFS), a scalable distributed file system designed and implemented by Google to meet its rapidly growing data storage needs. Key aspects of the GFS design include supporting large files and high throughput appending workloads on inexpensive commodity hardware in the face of frequent component failures. The GFS architecture uses a single master to manage metadata and multiple chunkservers to store and retrieve file chunks, providing fault tolerance through replication.
HPC and cloud distributed computing, as a journeyPeter Clapham
Introducing an internal cloud brings new paradigms, tools and infrastructure management. When placed alongside traditional HPC the new opportunities are significant But getting to the new world with micro-services, autoscaling and autodialing is a journey that cannot be achieved in a single step.
The Google File System (GFS) is a scalable distributed file system designed by Google to provide reliable, scalable storage and high performance for large datasets and workloads. It uses low-cost commodity hardware and is optimized for large files, streaming reads and writes, and high throughput. The key aspects of GFS include using a single master node to manage metadata, chunking files into 64MB chunks distributed across multiple chunk servers, replicating chunks for reliability, and optimizing for large sequential reads and appends. GFS provides high availability, fault tolerance, and data integrity through replication, fast recovery, and checksum verification.
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
If you’re building a SaaS application, you probably already have the notion of tenancy built in your data model. Typically, most information relates to tenants / customers / accounts and your database tables capture this natural relation.
With smaller amounts of data, it’s easy to throw more hardware at the problem and scale up your database. As these tables grow however, you need to think about ways to scale your multi-tenant (B2B) database across dozens or hundreds of machines.
In this talk, we're first going to talk about motivations behind scaling your SaaS (multi-tenant) database and several heuristics we found helpful on deciding when to scale. We'll then describe three design patterns that are common in scaling SaaS databases: (1) Create one database per tenant, (2) Create one schema per tenant, and (3) Have all tenants share the same table(s). Next, we'll highlight the tradeoffs involved with each design pattern and focus on one pattern that scales to hundreds of thousands of tenants. We'll also share an example architecture from the industry that describes this pattern in more detail.
Last, we'll talk about key PostgreSQL properties, such as semi-structured data types, that make building multi-tenant applications easy. We'll also mention Citus as a method to scale out your multi-tenant database. We'll conclude by answering frequently asked questions on multi-tenant databases and Q&A.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
This presentation:
* covers basics of caching and popular cache types
* explains evolution from simple cache to distributed, and from distributed to IMDG
* not describes usage of NoSQL solutions for caching
* is not intended for products comparison or for promotion of Hazelcast as the best solution
Big data is generated from a variety of sources at a massive scale and high velocity. Hadoop is an open source framework that allows processing and analyzing large datasets across clusters of commodity hardware. It uses a distributed file system called HDFS that stores multiple replicas of data blocks across nodes for reliability. Hadoop also uses a MapReduce processing model where mappers process data in parallel across nodes before reducers consolidate the outputs into final results. An example demonstrates how Hadoop would count word frequencies in a large text file by mapping word counts across nodes before reducing the results.
This presentation introduces the Big Data topic to Software Quality Assurance Engineers. It can also be useful for Software Developers and other software professionals.
The document provides an overview of Hadoop including:
- A brief history of Hadoop and its origins from Nutch.
- An overview of the Hadoop architecture including HDFS and MapReduce.
- Examples of how companies like Yahoo, Facebook and Amazon use Hadoop at large scales to process petabytes of data.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses Google's MapReduce programming model and Google File System for reliability. The Hadoop architecture includes a distributed file system (HDFS) that stores data across clusters and a job scheduling and resource management framework (YARN) that allows distributed processing of large datasets in parallel. Key components include the NameNode, DataNodes, ResourceManager and NodeManagers. Hadoop provides reliability through replication of data blocks and automatic recovery from failures.
- Hadoop Distributed File System (HDFS) is a distributed file system that stores large datasets across clusters of computers. It divides files into blocks and stores the blocks across nodes, replicating them for fault tolerance.
- HDFS is designed for distributed storage and processing of very large datasets. It allows applications to work with data in parallel on large clusters of commodity hardware.
Operating System
Topic Memory Management
for Btech/Bsc (C.S)/BCA...
Memory management is the functionality of an operating system which handles or manages primary memory. Memory management keeps track of each and every memory location either it is allocated to some process or it is free. It checks how much memory is to be allocated to processes. It decides which process will get memory at what time. It tracks whenever some memory gets freed or unallocated and correspondingly it updates the status.
This document provides an overview of high performance computing infrastructures. It discusses parallel architectures including multi-core processors and graphical processing units. It also covers cluster computing, which connects multiple computers to increase processing power, and grid computing, which shares resources across administrative domains. The key aspects covered are parallelism, memory architectures, and technologies used to implement clusters like Message Passing Interface.
Training Webinar: Enterprise application performance with distributed cachingOutSystems
2nd Session - Distributed Caching:
- What is Distributed Caching
- Performance hurdles solved by Distributed Caching
- When to use Distributed Caching
- Patterns to Populate a Distributed Cache
- How to use Distributed Caching in OutSystems
Free Online training: https://www.outsystems.com/learn/courses/
Follow us on Twitter http://www.twitter.com/OutSystemsDev
Like us on Facebook http://www.Facebook.com/OutSystemsDev
A distributed database (DDB) is a collection of logically related databases distributed across a computer network. It allows data and processing to occur at multiple sites. Key characteristics include data fragmentation across sites, replication of fragments for availability and performance, and distributed transaction management to ensure consistency. The main types are homogeneous DDBMS, where all sites use identical software, and heterogeneous DDBMS where different sites may use different systems. Challenges include complex management, security, and maintaining consistency across sites.
Passenger car unit (PCU) of a vehicle type depends on vehicular characteristics, stream characteristics, roadway characteristics, environmental factors, climate conditions and control conditions. Keeping in view various factors affecting PCU, a model was developed taking a volume to capacity ratio and percentage share of particular vehicle type as independent parameters. A microscopic traffic simulation model VISSIM has been used in present study for generating traffic flow data which some time very difficult to obtain from field survey. A comparison study was carried out with the purpose of verifying when the adaptive neuro-fuzzy inference system (ANFIS), artificial neural network (ANN) and multiple linear regression (MLR) models are appropriate for prediction of PCUs of different vehicle types. From the results observed that ANFIS model estimates were closer to the corresponding simulated PCU values compared to MLR and ANN models. It is concluded that the ANFIS model showed greater potential in predicting PCUs from v/c ratio and proportional share for all type of vehicles whereas MLR and ANN models did not perform well.
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYijscai
With the increased use of Artificial Intelligence (AI) in malware analysis there is also an increased need to
understand the decisions models make when identifying malicious artifacts. Explainable AI (XAI) becomes
the answer to interpreting the decision-making process that AI malware analysis models use to determine
malicious benign samples to gain trust that in a production environment, the system is able to catch
malware. With any cyber innovation brings a new set of challenges and literature soon came out about XAI
as a new attack vector. Adversarial XAI (AdvXAI) is a relatively new concept but with AI applications in
many sectors, it is crucial to quickly respond to the attack surface that it creates. This paper seeks to
conceptualize a theoretical framework focused on addressing AdvXAI in malware analysis in an effort to
balance explainability with security. Following this framework, designing a machine with an AI malware
detection and analysis model will ensure that it can effectively analyze malware, explain how it came to its
decision, and be built securely to avoid adversarial attacks and manipulations. The framework focuses on
choosing malware datasets to train the model, choosing the AI model, choosing an XAI technique,
implementing AdvXAI defensive measures, and continually evaluating the model. This framework will
significantly contribute to automated malware detection and XAI efforts allowing for secure systems that
are resilient to adversarial attacks.
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...Infopitaara
A Boiler Feed Pump (BFP) is a critical component in thermal power plants. It supplies high-pressure water (feedwater) to the boiler, ensuring continuous steam generation.
⚙️ How a Boiler Feed Pump Works
Water Collection:
Feedwater is collected from the deaerator or feedwater tank.
Pressurization:
The pump increases water pressure using multiple impellers/stages in centrifugal types.
Discharge to Boiler:
Pressurized water is then supplied to the boiler drum or economizer section, depending on design.
🌀 Types of Boiler Feed Pumps
Centrifugal Pumps (most common):
Multistage for higher pressure.
Used in large thermal power stations.
Positive Displacement Pumps (less common):
For smaller or specific applications.
Precise flow control but less efficient for large volumes.
🛠️ Key Operations and Controls
Recirculation Line: Protects the pump from overheating at low flow.
Throttle Valve: Regulates flow based on boiler demand.
Control System: Often automated via DCS/PLC for variable load conditions.
Sealing & Cooling Systems: Prevent leakage and maintain pump health.
⚠️ Common BFP Issues
Cavitation due to low NPSH (Net Positive Suction Head).
Seal or bearing failure.
Overheating from improper flow or recirculation.
Analysis of reinforced concrete deep beam is based on simplified approximate method due to the complexity of the exact analysis. The complexity is due to a number of parameters affecting its response. To evaluate some of this parameters, finite element study of the structural behavior of the reinforced self-compacting concrete deep beam was carried out using Abaqus finite element modeling tool. The model was validated against experimental data from the literature. The parametric effects of varied concrete compressive strength, vertical web reinforcement ratio and horizontal web reinforcement ratio on the beam were tested on eight (8) different specimens under four points loads. The results of the validation work showed good agreement with the experimental studies. The parametric study revealed that the concrete compressive strength most significantly influenced the specimens’ response with the average of 41.1% and 49 % increment in the diagonal cracking and ultimate load respectively due to doubling of concrete compressive strength. Although the increase in horizontal web reinforcement ratio from 0.31 % to 0.63 % lead to average of 6.24 % increment on the diagonal cracking load, it does not influence the ultimate strength and the load-deflection response of the beams. Similar variation in vertical web reinforcement ratio leads to an average of 2.4 % and 15 % increment in cracking and ultimate load respectively with no appreciable effect on the load-deflection response.
Value Stream Mapping Worskshops for Intelligent Continuous SecurityMarc Hornbeek
This presentation provides detailed guidance and tools for conducting Current State and Future State Value Stream Mapping workshops for Intelligent Continuous Security.
Sorting Order and Stability in Sorting.
Concept of Internal and External Sorting.
Bubble Sort,
Insertion Sort,
Selection Sort,
Quick Sort and
Merge Sort,
Radix Sort, and
Shell Sort,
External Sorting, Time complexity analysis of Sorting Algorithms.
π0.5: a Vision-Language-Action Model with Open-World GeneralizationNABLAS株式会社
今回の資料「Transfusion / π0 / π0.5」は、画像・言語・アクションを統合するロボット基盤モデルについて紹介しています。
拡散×自己回帰を融合したTransformerをベースに、π0.5ではオープンワールドでの推論・計画も可能に。
This presentation introduces robot foundation models that integrate vision, language, and action.
Built on a Transformer combining diffusion and autoregression, π0.5 enables reasoning and planning in open-world settings.
The Fluke 925 is a vane anemometer, a handheld device designed to measure wind speed, air flow (volume), and temperature. It features a separate sensor and display unit, allowing greater flexibility and ease of use in tight or hard-to-reach spaces. The Fluke 925 is particularly suitable for HVAC (heating, ventilation, and air conditioning) maintenance in both residential and commercial buildings, offering a durable and cost-effective solution for routine airflow diagnostics.
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...Infopitaara
A feed water heater is a device used in power plants to preheat water before it enters the boiler. It plays a critical role in improving the overall efficiency of the power generation process, especially in thermal power plants.
🔧 Function of a Feed Water Heater:
It uses steam extracted from the turbine to preheat the feed water.
This reduces the fuel required to convert water into steam in the boiler.
It supports Regenerative Rankine Cycle, increasing plant efficiency.
🔍 Types of Feed Water Heaters:
Open Feed Water Heater (Direct Contact)
Steam and water come into direct contact.
Mixing occurs, and heat is transferred directly.
Common in low-pressure stages.
Closed Feed Water Heater (Surface Type)
Steam and water are separated by tubes.
Heat is transferred through tube walls.
Common in high-pressure systems.
⚙️ Advantages:
Improves thermal efficiency.
Reduces fuel consumption.
Lowers thermal stress on boiler components.
Minimizes corrosion by removing dissolved gases.
2. Big Data technologies
• Big data technologies are essential for offering more precise analysis, which may
lead to more tangible decision making resulting in better operational
efficiencies, cost reductions, and reduced risks for the business.
• To control the power of big data you would need an infrastructure that can
handle and process huge volumes of structured and unstructured data in real
time and can preserve data privacy and security.
• Today, the various architectures and papers that were contributed by these and
other developers across the world have culminated into several open-source
projects under the Apache Software Foundation and the NoSQL movement.
• All of these technologies have been identified as Big Data processing platforms,
including Hadoop, Hive, HBase, Cassandra, and Map Reduce.
• NoSQL platforms include MongoDB, Neo4J, Riak, Amazon DynamoDB,
MemcachedDB, BerkleyDB, Voldemort, and many more.
3. Distributed Data Processing
• Distributed data processing has been in
existence since the late 1970s.
• The primary concept was to replicate the
DBMS in a master–slave configuration and
process data across multiple instances
• Each slave would engage in a two-phase
commit with its master in a query processing
situation.
4. Why did distributed data processing fail to meet the
requirements in the relational data processing
architecture?
• Complex architectures for consistency management
• Latencies across the system
• Slow networks
• Infrastructure cost
• Complex data processing and transformation
requirements
5. Client–Server Data
Processing
Benefits:
• Centralization of administration, security, and setup.
• Back-up and recovery of data is inexpensive, as outages can occur at the server
or a client and can be restored.
• Scalability of infrastructure by adding more server capacity or client capacity
can be accomplished. The scalability is not linear.
• Accessibility of the server from heterogeneous platforms locally or remotely.
• Clients can use servers for different types of processing.
Limitations:
• The server is the central point of failure.
• Very limited scalability.
• Performance can degrade with network congestion.
• Too many clients accessing a single server cannot process data in a quick time.
6. Big Data Processing Requirements
Volume:
• Size of data to be processed is large—it needs
to be broken into manageable chunks.
• Data needs to be processed in parallel across
multiple systems.
• Data needs to be processed across several
program modules simultaneously.
7. Velocity:
• Data needs to be processed at streaming speeds during data collection.
• Data needs to be processed for multiple acquisition points.
Variety:
• Data of different formats needs to be processed.
• Data of different types needs to be processed.
• Data of different structures needs to be processed.
• Data from different regions needs to be processed.
Ambiguity:
• Big Data is ambiguous by nature due to the lack of relevant metadata and context in
many cases. An example is the use of M and F in a sentence—it can mean,
respectively, Monday and Friday, male and female, or mother and father.
• Big Data that is within the corporation also exhibits this ambiguity to a lesser degree.
For example, employment agreements have standard and custom sections and the
latter is ambiguous without the right context.
Complexity:
• Big Data complexity needs to use many algorithms to process data quickly and
efficiently.
• Several types of data need multipass processing and scalability is extremely
important.
8. Google File System
https://youtu.be/eRgFNW4QFDc
• Developed by Google to hold Google’s increasing
data processing requirements.
• It is one of the scalable distributed file system.
• GFS is improved to hold the data used by Google
and their requirement of storage, like search
engine, which produce large amounts of data that
needs to be stored.
• The main purpose behind the design of GFS is to
hold the Google’s huge cluster requirements
without making extra load on applications.
9. • Google organized the GFS into clusters of computers.
• Each cluster might contain hundreds or even thousands
of machines. Within GFS clusters there are three kinds
of entities: clients, master servers and chunkservers.
• "client" refers to any entity that makes a file request.
• Requests can range from retrieving and manipulating
existing files to creating new files on the system.
• Clients can be other computers or computer
applications.
• You can think of clients as the customers of the GFS.
16. Master Server
• The master server acts as the coordinator for the cluster.
• The master's duties include maintaining an operation log, which keeps track of
the activities of the master's cluster.
• The operation log helps keep service interruptions to a minimum -- if the
master server crashes, a replacement server that has monitored the operation
log can take its place.
• The master server also keeps track of metadata, which is the information that
describes chunks.
• The metadata tells the master server to which files the chunks belong and
where they fit within the overall file.
• Upon startup, the master polls all the chunkservers in its cluster.
• The chunkservers respond by telling the master server the contents of their
inventories.
• From that moment on, the master server keeps track of the location of chunks
within the cluster.
• There's only one active master server per cluster at any one time (though each
cluster has multiple copies of the master server in case of a hardware failure)
17. Chunkservers
• Chunkservers are the workhorses of the GFS.
• They're responsible for storing the 64-MB file chunks.
• The chunkservers don't send chunks to the master
server.
• Instead, they send requested chunks directly to the
client.
• The GFS copies every chunk multiple times and stores it
on different chunkservers.
• Each copy is called a replica.
• By default, the GFS makes three replicas per chunk, but
users can change the setting and make more or fewer
replicas if desired.
18. A GFS cluster:
• A single master
• Multiple chunk servers (workers or slaves) per
master
• Accessed by multiple clients
• Running on commodity Linux machines
A file:
• Represented as fixed-sized chunks
• Labeled with 64-bit unique global IDs
• Stored at chunk servers and three-way mirrored
across chunk servers
19. • In the GFS cluster, input data files are divided into
chunks (64 MB is the standard chunk size), each
assigned its unique 64-bit handle, and stored on
local chunk server systems as files.
• To ensure fault tolerance and scalability, each chunk
is replicated at least once on another server, and the
default design is to create three copies of a chunk.
• The role of the master is to communicate to clients
which chunk servers have which chunks and their
metadata information.
• Clients’ tasks then interact directly with chunk
servers for all subsequent operations, and use the
master only in a minimal fashion.
20. • Another important issue to understand in the GFS architecture is the single point
of failure (SPOF) of the master node and all the metadata that keeps track of the
chunks and their state.
• To avoid this situation, GFS was designed to have the master keep data in
memory for speed, keep a log on the master’s local disk, and replicate the disk
across remote nodes.
• This way if there is a crash in the master node, a shadow can be up and running
almost instantly.
• The master stores three types of metadata:
1. File and chunk names or namespaces.
2. Mapping from files to chunks (i.e., the chunks that make up each file).
3. Locations of each chunk’s replicas. The replica locations for each chunk are
stored on the local chunk server apart from being replicated, and the
information of the replications is provided to the master at startup or when a
chunk server is added to a cluster.
• Since the master controls the chunk placement, it always updates metadata as
new chunks get written.
21. • To recover from any corruption, GFS appends data as it is
available rather than updates an existing data set; this provides the
ability to recover from corruption or failure quickly.
• When a corruption is detected, with a combination of frequent
checkpoints, snapshots, and replicas, data is recovered with
minimal chance of data loss.
22. The GFS architecture has the following
strengths:
● Availability:
1. Triple replication–based redundancy (or more if you choose).
2. Chunk replication.
3. Rapid failovers for any master failure.
4. Automatic replication management.
● Performance:
1. The biggest workload for GFS is read-on large data sets, which based on the architecture
discussion, will be a nonissue.
2. There are minimal writes to the chunks directly, thus providing auto availability.
● Management:
1. GFS manages itself through multiple failure modes.
2. Automatic load balancing.
3. Storage management and pooling.
4. Chunk management.
5. Failover management.
● Cost:
1. Is not a constraint due to use of commodity hardware and Linux platforms.
23. Hadoop
• The Hadoop framework application works in an
environment that provides distributed storage and
computation across clusters of computers.
• Hadoop is designed to scale up from single server to
thousands of machines, each offering local
computation and storage.
• Hadoop is an open source, Java-based programming
framework that supports the processing and storage
of extremely large data sets in a distributed
computing environment.
26. Map Reduce
• MapReduce is a parallel programming model for
writing distributed applications devised at Google
• For efficient processing of large amounts of data
(multiterabyte datasets), on large clusters
(thousands of nodes) of commodity hardware in a
reliable, fault tolerant manner.
• The MapReduce program runs on Hadoop which is
an Apache open source framework.
27. Hadoop Distributed File System
• The Hadoop Distributed File System (HDFS)is
based on the Google File System (GFS) and
provides a distributed file system that is
designed to run on commodity hardware.
• It is highly fault tolerant and is designed to be
deployed on low cost hardware.
• It provides high throughput access to
application data and is suitable for applications
having large datasets.
28. • Apart from the above mentioned two core
components, Hadoop framework also includes
the following two modules:
• Hadoop Common:
These are Java libraries and utilities required by
other Hadoop modules.
• Hadoop YARN:
• This is a framework for job scheduling and cluster
resource management.
29. How it works?
• Data is initially divided into directories and files. Files are
divided into uniform sized blocks of 128M and
64M(preferably 128M)
• These files are then distributed across various cluster nodes
for further processing.
• HDFS being on top of the local file system supervises the
processing
• Blocks are replicated for handling hardware failure
• Checking that the code was executed successfully
• Performing the sort that takes place between the map and
reduce stages.
• Sending the sorted data to a certain computer.
• Writing the debugging logs for each job
30. HDFS
• Hadoop Distributed File System is a block-structured file system
where each file is divided into blocks of a pre-determined size.
• These blocks are stored across a cluster of one or several machines.
• Apache Hadoop HDFS Architecture follows a Master/Slave
Architecture, where a cluster comprises of a single NameNode
(Master node) and all the other nodes are DataNodes (Slave nodes).
• The HDFS architecture was designed to solve two known problems
experienced by the early developers of large-scale data processing.
• The first problem was the ability to break down the files across
multiple systems and process each piece of the file independent of
the other pieces and finally consolidate all the outputs in a single
result set.
• The second problem was the fault tolerance both at the file
processing level and the overall system level in the distributed data
processing systems.
32. Namenode and Datanodes
Master/slave architecture
HDFS cluster consists of a single Namenode, a master server that
manages the file system namespace and regulates access to files by
clients.
There are a number of DataNodes usually one per node in a cluster.
The DataNodes manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be
stored in files.
A file is split into one or more blocks and set of blocks are stored in
DataNodes.
DataNodes: serves read, write requests, performs block creation,
deletion, and replication upon instruction from Namenode.
01/06/2025 32
34. File System Namespace
01/06/2025 34
• Hierarchical file system with directories and files
• Create, remove, move, rename etc.
• Namenode maintains the file system
• Any meta information changes to the file system
recorded by the Namenode.
• An application can specify the number of replicas
of the file needed: replication factor of the file.
This information is stored in the Namenode.
35. Data Replication
01/06/2025 35
HDFS is designed to store very large files across
machines in a large cluster.
Each file is a sequence of blocks.
All blocks in the file except the last are of the same size.
Blocks are replicated for fault tolerance.
Block size and replicas are configurable per file.
The Namenode receives a Heartbeat and a BlockReport
from each DataNode in the cluster.
BlockReport contains all the blocks on a Datanode.
36. Datanode
01/06/2025 36
• A Datanode stores data in files in its local file system.
• Datanode has no knowledge about HDFS filesystem
• It stores each block of HDFS data in a separate file.
• Datanode does not create all files in the same directory.
• A typical HDFS cluster can have thousands of DataNodes and tens of
thousands of HDFS clients per cluster, since each DataNode may execute
multiple application tasks simultaneously.
• The DataNodes are responsible for managing read and write requests
from the file system’s clients, and block maintenance and perform
replication as directed by the NameNode.
• The size of the data file equals the actual length of the block. This means
if a block is half full it needs only half of the space of the full block on the
local drive, thereby optimizing storage space for compactness, and there
is no extra space consumed on the block unlike a regular file system.
37. • Image : An image represents the metadata of the namespace (inodes and
lists of blocks).
• On startup,the NameNode pins the entire namespace image in memory.
The in-memory persistence enables the NameNode to service multiple
client requests concurrently.
• Journal : The journal represents the modification log of the image in the
local host’s native file system.
• During normal operations, each client transaction is recorded in the journal,
and the journal file is flushed and synced before the acknowledgment is
sent to the client. The NameNode upon startup or from a recovery can
replay this journal.
• Checkpoint : To enable recovery, the persistent record of the image is
also stored in the local host’s native files system and is called a
checkpoint.
• Once the system starts up, the NameNode never modifies or updates the
checkpoint file.
• A new checkpoint file can be created during the next startup, on a restart,
or on demand when requested by the administrator or by the
CheckpointNode
38. Checkpoint Node and Backup Node
• There are two roles that a NameNode can be
designated to perform apart from servicing client
requests and managing Data Nodes.
• These roles are specified during startup and can
be the Checkpoint Node or the Backup Node.
39. Checkpoint Node
• The Checkpoint Node serves as a journal-capture
architecture to create a recovery mechanism for the
NameNode.
• The Checkpoint Node combines the existing
checkpoint and journal to create a new checkpoint
and an empty journal in specific intervals.
• It returns the new checkpoint to the NameNode.
• The Checkpoint Node runs on a different host from
the NameNode since it has the same memory
requirements as the NameNode.
• This mechanism provides a protection.
40. Backup Node
• The Backup Node can be considered as a read-only NameNode.
• It contains all file system metadata information except for block
locations.
• It accepts a stream of namespace transactions from the active
NameNode and saves them to its own storage directories, and
applies these transactions to its own namespace image in its
memory.
• If the NameNode fails, the Backup Node’s image in memory and the
checkpoint on disk are a record of the latest namespace state and
can be used to create a checkpoint for recovery.
• Creating a checkpoint from a Backup Node is very efficient as it
processes the entire image in its own disk and memory.
• A Backup Node can perform all operations of the regular
NameNode that do not involve modification of the namespace or
management of block locations.
41. MapReduce
• The key features of MapReduce that make it the
interface on Hadoop or Cassandra include:
1. Automatic parallelization
2. Automatic distribution
3. Fault-tolerance
4. Status and monitoring tools
5. Easy abstraction for programmers
6. Programming language flexibility
7. Extensibility
42. MapReduce Programming Model
• MapReduce is based on functional programming models largely from Lisp.
Typically, the users will implement two functions:
Map (in_key, in_value) -> (out_key, intermediate_value) list
• The Map function written by the user will receive an input pair of keys and values,
and after the computation cycles, will produce a set of intermediate key-value
pairs.
• Library functions then are used to group together all intermediate values
associated with an intermediate key I and passes them to the Reduce function.
Reduce (out_key, intermediate_value list) -> out_value list
• The Reduce function written by the user will accept an intermediate key I, and the
set of values for the key.
• It will merge together these values to form a possibly smaller set of values.
• Reducer outputs are just zero or one output value per invocation.
• The intermediate values are supplied to the Reduce function via an iterator.
• The Iterator function allows us to handle large lists of values that cannot be fit in
memory or a single pass.
44. • The main components of this architecture include:
• Mapper—maps input key-value pairs to a set of
intermediate key-value pairs.
• For an input pair the mapper can map to zero or
many output pairs. By default the mapper spawns
one map task for each input.
• Reducer—performs a number of tasks:
Sort and group mapper outputs.
Shuffle partitions.
Perform secondary sorting as necessary.
Manage overrides specified by users for grouping
and partitioning.
45. • Reporter—is used to report progress, set application-level status messages,
update any user set counters, and indicate long running tasks or jobs are alive.
• Combiner—an optional performance booster that can be specified to perform
local aggregation of the intermediate outputs to manage the amount of data
transferred from the Mapper to the Reducer.
• Partitioner—controls the partitioning of the keys of the intermediate map
outputs. The key (or a subset of the key) is used to derive the partition and default
partitions are created by a hash function. The total number of partitions will be
same as the number of reduce tasks for the job.
• Output collector—collects the output of Mappers and Reducers.
• Job configuration—is the primary user interface to manage MapReduce jobs.
• It is typically used to specify the Mapper, Combiner, Partitioner, Reducer,
InputFormat, OutputFormat, and OutputCommitter for every job.
• It also indicates the set of input files and where the output files should be written.
Optionally used to specify other advanced options for the job such as the
comparator to be used, files to be put in the DistributedCache, and compression on
intermediate and/or final job outputs.
• It is used for debugging via user-provided scripts, whether job tasks can be
executed in a speculative manner, the maximum number of attempts per task for
any possible failure, and the percentage of task failures that can be tolerated by the
job overall.
46. • Output committer—is used to manage the commit for jobs and tasks in
MapReduce. Key tasks executed are:
• Set up the job during initialization. For example, create the intermediate directory
for the job during the initialization of the job.
• Clean up the job after the job completion. For example, remove the temporary
output directory after the job completion.
• Set up any task temporary output.
• Check whether a task needs a commit. This will avoid overheads on unnecessary
commits.
• Commit of the task output on completion.
• On failure, discard the task commit and clean up all intermediate results, memory
release, and other user-specified tasks.
• Job input:
• Specifies the input format for a Map/Reduce job.
• Validate the input specification of the job.
• Split up the input file(s) into logical instances to be assigned to an individual
Mapper.
• Provide input records from the logical splits for processing by the Mapper.
• Memory management, JVM reuse, and compression are managed with the job
configuration set of classes.
50. Anatomy of File Write and Read
• HDFS has a master and slave kind of architecture.
• Namenode acts as master and Datanodes as worker.
• All the metadata information is with namenode and the
original data is stored on the datanodes.
• Keeping all these in mind the below figure will give idea
about how data flow happens between the Client
interacting with HDFS, i.e. the Namenode and the
Datanodes.
51. • The following steps are involved in reading the file from HDFS:
Let’s suppose a Client (a HDFS Client) wants to read a file from HDFS.
Step 1: First the Client will open the file by giving a call to open() method on
FileSystem object, which for HDFS is an instance of DistributedFileSystem class.
Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote Procedure
Call), to determine the locations of the blocks for the first few blocks of the file.
For each block, the NameNode returns the addresses of all the DataNode’s that
have a copy of that block. Client will interact with respective DataNode’sto read
the file. NameNode also provide a token to the client which it shows to data node
for authentication.
• The DistributedFileSystem returns an object of FSDataInputStream(an input
stream that supports file seeks) to the client for it to read data from
FSDataInputStream in turn wraps a DFSInputStream, which manages the
datanode and namenode I/O
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored
the DataNode addresses for the first few blocks in the file, then connects to the
first closest DataNode for the first block in the file.
52. • Step 4: Data is streamed from the DataNode back to the
client, which calls read() repeatedly on the stream.
• Step 5: When the end of the block is reached,
DFSInputStream will close the connection to the
DataNode , then find the best DataNode for the next
block. This happens transparently to the client, which
from its point of view is just reading a continuous
stream.
• Step 6: Blocks are read in order, with the
DFSInputStream opening new connections to datanodes
as the client reads through the stream. It will also call the
namnode to retrieve the datanode locations for the next
batch of blocks as needed. When the client has finished
reading, it calls close() on the FSDataInputStream.
54. Now we will look at what happens when you write a File in
HDFS.
• DistributedFileSystem object do a RPC call to namenode to create a new file in
filesystem namespace with no blocks associated to it
• Namenode process performs various checks like a) client has required permissions
to create a file or not 2) file should not exists earlier. In case of above exceptions it
will throw an IOexception to client
• Once the file is registered with the namenode then client will get an object i.e.
FSDataOutputStream which in turns embed DFSOutputStream object for the client
to start writing data to DFSoutputStream handles communication with the
datanodes and namenode.
• As client writes data DFSOutputStream split it into packets and write it to its
internal queue i.e. data queue and also maintains an acknowledgement queue.
• Data queue is then consumed by a Data Streamer process which is responsible for
asking namenode to allocate new blocks by picking a list of suitable datanodes to
store the replicas.
55. • The list of datanodes forms a pipeline and
assuming a replication factor of three, so there will
be three nodes in the pipeline.
• The data streamer streams the packets to the first
datanode in the pipeline, which then stores the
packet and forward it to second datanode in the
pipeline. Similarly the second node stores the
packet and forward it to next datanode or last
datanode in the pipeline
• Once each datanode in the pipeline acknowledge
the packet the packet is removed from the
acknowledgement queue.
56. • Now what happens when one of the machines i.e.
part of the pipeline which has datanode process
running fails. Hadoop has inbuilt functionality to
handle this scenario.If a datanode fails while data
is being written to it, then the following actions
are taken, which are transparent to the client
writing the data.
57. • First, the pipeline is closed, and any packets in the ack queue are added to the front
of the data queue so that datanodes that are downstream from the failed node will
not miss any packets.
• The current block on the good datanodes is given a new identity, which is
communicated to the namenode, so that the partial block on the failed datanode will
be deleted if the failed datanode recovers later on.
• The failed datanode is removed from the pipeline, and the remainder of the block’s
data is written to the two good datanodes in the pipeline.
• The namenode notices that the block is under-replicated, and it arranges for a
further replica to be created on another node. Subsequent blocks are then treated as
normal.
It’s possible, but unlikely, that multiple datanodes fail while a block is being written.
• As long as dfs.replication.min replicas (which default to one) are written, the write
will succeed, and the block will be asynchronously replicated across the cluster until
its target replication factor is reached (dfs.replication, which defaults to three).