PPT on Hadoop

Jan 9, 2016Download as PPTX, PDF31 likes25,161 views

The document discusses Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers. It describes Hadoop as having two main components - the Hadoop Distributed File System (HDFS) which stores data across infrastructure, and MapReduce which processes the data in a parallel, distributed manner. HDFS provides redundancy, scalability, and fault tolerance. Together these components provide a solution for businesses to efficiently analyze the large, unstructured "Big Data" they collect.

What is Hadoop?
• The Apache Hadoop software library is a
framework that allows for the distributed
processing of large data sets across clusters
of computers using simple programming
models.
• It is made by apache software foundation in
2011.
• Written in JAVA.

Hadoop is open source software.
Framework
Massive Storage
Processing Power

Big Data
• Big data is a term used to define very large amount of unstructured and
semi structured data a company creates.
•The term is used when talking about Petabytes and Exabyte of data.
•That much data would take so much time and cost to load into relational
database for analysis.
•Facebook has almost 10billion photos taking up to 1Petabytes of storage.

So what is the problem??
1. Processing that large data is very difficult in relational database.
2. It would take too much time to process data and cost.

We can solve this problem by Distributed
Computing.
But the problems in distributed computing is –
Hardware failure
Chances of hardware failure is always there.
Combine the data after analysis
Data from all disks have to be combined from all the disks which is a mess.

To Solve all the Problems Hadoop Came.
It has two main parts –
1. Hadoop Distributed File System (HDFS),
2. Data Processing Framework & MapReduce

1. Hadoop Distributed File System
It ties so many small and reasonable priced machines together into a single cost effective computer
cluster.
Data and application processing are protected against hardware failure.
 If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed
computing does not fail.
it automatically stores multiple copies of all data.
It provides simplified programming model which allows user to quickly read and write the
distributed system.

2. MapReduce
MapReduce is a programming model for processing and generating large data sets with a
parallel, distributed algorithm on a cluster.
It is an associative implementation for processing and generating large data sets.
MAP function that process a key pair to generates a set of intermediate key pairs.
REDUCE function that merges all intermediate values associated with the same intermediate
key

Pros of Hadoop
1. Computing power
2. Flexibility
3. Fault Tolerance
4. Low Cost
5. Scalability

Cons of Hadoop
1. Integration with existing systems
Hadoop is not optimised for ease for use. Installing and integrating with existing
databases might prove to be difficult, especially since there is no software support
provided.
2. Administration and ease of use
Hadoop requires knowledge of MapReduce, while most data practitioners use SQL. This
means significant training may be required to administer Hadoop clusters.
3. Security
Hadoop lacks the level of security functionality needed for safe enterprise deployment,
especially if it concerns sensitive data.

This document provides an introduction to Docker. It discusses why Docker is useful for isolation, being lightweight, simplicity, workflow, and community. It describes the Docker engine, daemon, and CLI. It explains how Docker Hub provides image storage and automated builds. It outlines the Docker installation process and common workflows like finding images, pulling, running, stopping, and removing containers and images. It promotes Docker for building local images and using host volumes.

Introduction to Hadoop TechnologyManish Borkar

This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop uses HDFS for distributed storage and fault tolerance, YARN for resource management, and MapReduce for parallel processing of large datasets. It provides details on the architecture of HDFS including the name node, data nodes, and clients. It also explains the MapReduce programming model and job execution involving map and reduce tasks. Finally, it states that as data volumes continue rising, Hadoop provides an affordable solution for large-scale data handling and analysis through its distributed and scalable architecture.

Introduction to HDFSBhavesh Padharia

The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a Master and Slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. The NameNode tracks locations of data blocks and regulates access to files, while DataNodes store file blocks and manage read/write operations as directed by the NameNode. HDFS provides high-performance, scalable access to data across large Hadoop clusters.

Big Data ArchitectureGuido Schmutz

This document discusses different architectures for big data systems, including traditional, streaming, lambda, kappa, and unified architectures. The traditional architecture focuses on batch processing stored data using Hadoop. Streaming architectures enable low-latency analysis of real-time data streams. Lambda architecture combines batch and streaming for flexibility. Kappa architecture avoids duplicating processing logic. Finally, a unified architecture trains models on batch data and applies them to real-time streams. Choosing the right architecture depends on use cases and available components.

Methods for handling deadlocksangrampatil81

1. There are three methods to handle deadlocks: prevention, avoidance, and detection with recovery. 2. Deadlock prevention ensures that at least one of the necessary conditions for deadlock cannot occur. Deadlock avoidance requires processes to declare maximum resource needs upfront. 3. The Banker's algorithm is a deadlock avoidance technique that dynamically checks the resource allocation state to ensure it remains safe and no circular wait can occur.

HadoopNishant Gandhi

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was created to support applications handling large datasets operating on many servers. Key Hadoop technologies include MapReduce for distributed computing, and HDFS for distributed file storage inspired by Google File System. Other related Apache projects extend Hadoop capabilities, like Pig for data flows, Hive for data warehousing, and HBase for NoSQL-like big data. Hadoop provides an effective solution for companies dealing with petabytes of data through distributed and parallel processing.

Map ReducePrashant Gupta

Seminar Presentation HadoopVarun Narang

The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.

Hadoop YARNVigen Sahakyan

This document provides an overview of YARN (Yet Another Resource Negotiator), the resource management system for Hadoop. It describes the key components of YARN including the Resource Manager, Node Manager, and Application Master. The Resource Manager tracks cluster resources and schedules applications, while Node Managers monitor nodes and containers. Application Masters communicate with the Resource Manager to manage applications. YARN allows Hadoop to run multiple applications like Spark and HBase, improves on MapReduce scheduling, and transforms Hadoop into a distributed operating system for big data processing.

Hadoop File system (HDFS)Prashant Gupta

Big data and HadoopRahul Agarwal

This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.

Hadoop ecosystemStanley Wang

Hadoop is a distributed processing framework for large datasets. It utilizes HDFS for storage and MapReduce as its programming model. The Hadoop ecosystem has expanded to include many other tools. YARN was developed to address limitations in the original Hadoop architecture. It provides a common platform for various data processing engines like MapReduce, Spark, and Storm. YARN improves scalability, utilization, and supports multiple workloads by decoupling cluster resource management from application logic. It allows different applications to leverage shared Hadoop cluster resources.

Introduction to Hadoop and Hadoop component rebeccatho

This document provides an introduction to Apache Hadoop, which is an open-source software framework for distributed storage and processing of large datasets. It discusses Hadoop's main components of MapReduce and HDFS. MapReduce is a programming model for processing large datasets in a distributed manner, while HDFS provides distributed, fault-tolerant storage. Hadoop runs on commodity computer clusters and can scale to thousands of nodes.

Big Data Analytics with HadoopPhilippe Julio

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn

This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units. Below topics are explained in this Hadoop presentation: 1. What is Hadoop 2. Why Hadoop 3. Big Data generation 4. Hadoop HDFS 5. Hadoop MapReduce 6. Hadoop YARN 7. Use of Hadoop 8. Demo on HDFS, MapReduce and YARN What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? This course will enable you to: 1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark 2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management 3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts 4. Get an overview of Sqoop and Flume and describe how to ingest data using them 5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning 6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution 7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations 8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS 9. Gain a working knowledge of Pig and its components 10. Do functional programming in Spark 11. Understand resilient distribution datasets (RDD) in detail 12. Implement and build Spark applications 13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques 14. Understand the common use-cases of Spark and the various interactive algorithms 15. Learn Spark SQL, creating, transforming, and querying Data frames Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training

Hadoop And Their Ecosystem pptsunera pathan

The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and an ecosystem of related projects like Hive, HBase, Pig and Zookeeper that provide additional functions. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.

Apache HBase™Prashant Gupta

The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn

This presentation about Hive will help you understand the history of Hive, what is Hive, Hive architecture, data flow in Hive, Hive data modeling, Hive data types, different modes in which Hive can run on, differences between Hive and RDBMS, features of Hive and a demo on HiveQL commands. Hive is a data warehouse system which is used for querying and analyzing large datasets stored in HDFS. Hive uses a query language called HiveQL which is similar to SQL. Hive issues SQL abstraction to integrate SQL queries (like HiveQL) into Java without the necessity to implement queries in the low-level Java API. Now, let us get started and understand Hadoop Hive in detail Below topics are explained in this Hive presetntation: 1. History of Hive 2. What is Hive? 3. Architecture of Hive 4. Data flow in Hive 5. Hive data modeling 6. Hive data types 7. Different modes of Hive 8. Difference between Hive and RDBMS 9. Features of Hive 10. Demo on HiveQL What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? This course will enable you to: 1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark 2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management 3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts 4. Get an overview of Sqoop and Flume and describe how to ingest data using them 5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning 6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution 7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations 8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS 9. Gain a working knowledge of Pig and its components 10. Do functional programming in Spark 11. Understand resilient distribution datasets (RDD) in detail 12. Implement and build Spark applications 13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques 14. Understand the common use-cases of Spark and the various interactive algorithms 15. Learn Spark SQL, creating, transforming, and querying Data frames Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training

Hadoop Tutorial For BeginnersDataflair Web Services Pvt Ltd

Map reduce in BIG DATAGauravBiswas9

MapReduce is a programming framework that allows for distributed and parallel processing of large datasets. It consists of a map step that processes key-value pairs in parallel, and a reduce step that aggregates the outputs of the map step. As an example, a word counting problem is presented where words are counted by mapping each word to a key-value pair of the word and 1, and then reducing by summing the counts of each unique word. MapReduce jobs are executed on a cluster in a reliable way using YARN to schedule tasks across nodes, restarting failed tasks when needed.

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark. Below topics are explained in this Spark presentation: 1. History of Spark 2. What is Spark 3. Hadoop vs Spark 4. Components of Apache Spark 5. Spark architecture 6. Applications of Spark 7. Spark usecase What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? Simplilearn’s Apache Spark and Scala certification training are designed to: 1. Advance your expertise in the Big Data Hadoop Ecosystem 2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark 3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos What skills will you learn? By completing this Apache Spark and Scala course you will be able to: 1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations 2. Understand the fundamentals of the Scala programming language and its features 3. Explain and master the process of installing Spark as a standalone cluster 4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark 5. Master Structured Query Language (SQL) using SparkSQL 6. Gain a thorough understanding of Spark streaming features 7. Master and describe the features of Spark ML programming and GraphX programming Who should take this Scala course? 1. Professionals aspiring for a career in the field of real-time big data analytics 2. Analytics professionals 3. Research professionals 4. IT developers and testers 5. Data scientists 6. BI and reporting professionals 7. Students who wish to gain a thorough understanding of Apache Spark Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

Hadoop Overview & Architecture EMC

This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.

Hive(ppt)Abhinav Tyagi

Hive is a data warehouse infrastructure tool that allows users to query and analyze large datasets stored in Hadoop. It uses a SQL-like language called HiveQL to process structured data stored in HDFS. Hive stores metadata about the schema in a database and processes data into HDFS. It provides a familiar interface for querying large datasets using SQL-like queries and scales easily to large datasets.

Introduction to YARN and MapReduce 2Cloudera, Inc.

As part of the recent release of Hadoop 2 by the Apache Software Foundation, YARN and MapReduce 2 deliver significant upgrades to scheduling, resource management, and execution in Hadoop. At their core, YARN and MapReduce 2’s improvements separate cluster resource management capabilities from MapReduce-specific logic. YARN enables Hadoop to share resources dynamically between multiple parallel processing frameworks such as Cloudera Impala, allows more sensible and finer-grained resource configuration for better cluster utilization, and scales Hadoop to accommodate more and larger jobs.

Hadoop ArchitectureDr. C.V. Suresh Babu

HADOOP TECHNOLOGY pptsravya raju

Introduccion apache hadoopParadigma Digital

More Related Content

What's hot (20)

HadoopNishant Gandhi

Map ReducePrashant Gupta

Seminar Presentation HadoopVarun Narang

Hadoop YARNVigen Sahakyan

Hadoop File system (HDFS)Prashant Gupta

Big data and HadoopRahul Agarwal

Hadoop ecosystemStanley Wang

Introduction to Hadoop and Hadoop component rebeccatho

Big Data Analytics with HadoopPhilippe Julio

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn

Hadoop And Their Ecosystem pptsunera pathan

Apache HBase™Prashant Gupta

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn

Hadoop Tutorial For BeginnersDataflair Web Services Pvt Ltd

Map reduce in BIG DATAGauravBiswas9

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

Hadoop Overview & Architecture EMC

Hive(ppt)Abhinav Tyagi

Introduction to YARN and MapReduce 2Cloudera, Inc.

Hadoop ArchitectureDr. C.V. Suresh Babu

HadoopNishant Gandhi

Map ReducePrashant Gupta

Seminar Presentation HadoopVarun Narang

Hadoop YARNVigen Sahakyan

Hadoop File system (HDFS)Prashant Gupta

Big data and HadoopRahul Agarwal

Hadoop ecosystemStanley Wang

Introduction to Hadoop and Hadoop component rebeccatho

Big Data Analytics with HadoopPhilippe Julio

Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn

Hadoop And Their Ecosystem pptsunera pathan

Apache HBase™Prashant Gupta

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn

Hadoop Tutorial For BeginnersDataflair Web Services Pvt Ltd

Map reduce in BIG DATAGauravBiswas9

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

Hadoop Overview & Architecture EMC

Hive(ppt)Abhinav Tyagi

Introduction to YARN and MapReduce 2Cloudera, Inc.

Hadoop ArchitectureDr. C.V. Suresh Babu

Viewers also liked (12)

HADOOP TECHNOLOGY pptsravya raju

Introduccion apache hadoopParadigma Digital

Hadoopcamposer

Big data con Hadoop y SSIS 2016Ángel Rayo

Hadoop: MapReduce para procesar grandes cantidades de datosRaul Ochoa

Este documento presenta una introducción a Hadoop y MapReduce. Explica que Hadoop es un framework de código abierto para procesar grandes cantidades de datos de forma distribuida a través de clusters de computadoras. Describe los componentes clave de Hadoop como MapReduce, HDFS y subproyectos como Pig, Hive y HBase. Finalmente, incluye una sección de demostración para aplicar los conceptos aprendidos.

¿Por que cambiar de Apache Hadoop a Apache Spark?Socialmetrix

HadoopCamilo Andrés Berrios Terreros

Introducción a hadoopCarlos Meseguer Gimenez

Este documento proporciona una introducción general a Hadoop y su ecosistema. Explica brevemente el contexto histórico de Big Data y Hadoop, define los componentes clave de Hadoop como HDFS y MapReduce, y describe varias aplicaciones populares como Hive, Pig y Spark. También cubre conceptos como EMR de AWS y casos de uso comercial exitosos de Hadoop en análisis de riesgos, retención de clientes y puntos de venta.

Seminario mongo db springdata 10-11-2011Paradigma Digital

Hadoop en accioncampus party

Este documento presenta una introducción a Hadoop, incluyendo qué es MapReduce, HDFS, y Hadoop. Explica que Hadoop es un framework para almacenar y procesar grandes volúmenes de datos usando hardware de bajo costo. Usa MapReduce para dividir los problemas en subproblemas que se distribuyen a nodos trabajadores, los cuales devuelven resultados al nodo maestro para ser combinados. HDFS almacena los datos de forma distribuida a través de múltiples nodos para evitar la pérdida de datos.

Hadoop demo pptPhil Young

Introduction to Machine LearningLior Rokach

HADOOP TECHNOLOGY pptsravya raju

Introduccion apache hadoopParadigma Digital

Hadoopcamposer

Big data con Hadoop y SSIS 2016Ángel Rayo

Hadoop: MapReduce para procesar grandes cantidades de datosRaul Ochoa

¿Por que cambiar de Apache Hadoop a Apache Spark?Socialmetrix

HadoopCamilo Andrés Berrios Terreros

Introducción a hadoopCarlos Meseguer Gimenez

Seminario mongo db springdata 10-11-2011Paradigma Digital

Hadoop en accioncampus party

Hadoop demo pptPhil Young

Introduction to Machine LearningLior Rokach

Similar to PPT on Hadoop (20)

Hadoop training in bangaloreTIB Academy

Hadoop tutorial for Freshers, TIB Academy

Hadoopthisisnabin

This document discusses Hadoop technology. It was developed by Doug Cutting and Michael J. Cafarella to support large-scale data processing for the Nutch search engine project. Hadoop features include distributed storage via HDFS and distributed processing via MapReduce. It allows for scalable and fault-tolerant processing of large data sets across commodity hardware. While Hadoop provides low-cost processing of big data, its administration and integration with other systems can be challenging.

Hadoop introduction , Why and What is Hadoop ?sudhakara st

Hadoop live online trainingHarika583

Jumpstart your career with the world’s most in-demand technology: Hadoop. Hadooptrainingacademy provides best Hadoop online training with quality videos, comprehensive online live training and detailed study material. Join today! For more info, visit: http://www.hadooptrainingacademy.com/ Contact Us: 8121660088 732-419-2619 http://www.hadooptrainingacademy.com/

Seminar pptRajatTripathi34

This document discusses big data and Hadoop. It defines big data as large amounts of unstructured data that would be too costly to store and analyze in a traditional database. It then describes how Hadoop provides a solution to this challenge through distributed and parallel processing across clusters of commodity hardware. Key aspects of Hadoop covered include HDFS for reliable storage, MapReduce for distributed computing, and how together they allow scalable analysis of very large datasets. Popular users of Hadoop like Amazon, Yahoo and Facebook are also mentioned.

Hadoop by kamran khanKamranKhan587

Learn what is Hadoop-and-BigDataThanusha154

Hadoop Seminar ReportAtul Kushwaha

1. The document discusses the evolution of computing from mainframes to smaller commodity servers and PCs. It then introduces cloud computing as an emerging technology that is changing the technology landscape, with examples like Google File System and Amazon S3. 2. It discusses the need for large data processing due to increasing amounts of data from sources like the stock exchange, Facebook, genealogy sites, and scientific experiments. 3. Hadoop is introduced as a framework for distributed computing and reliable shared storage and analysis of large datasets using its Hadoop Distributed File System (HDFS) for storage and MapReduce for analysis.

Hadoop infoNikita Sure

Understanding hadoopRexRamos9

This document provides an overview of Hadoop, including its core components HDFS, MapReduce, and YARN. It describes how HDFS stores and replicates data across nodes for reliability. MapReduce is used for distributed processing of large datasets by mapping data to key-value pairs, shuffling, and reducing results. YARN was introduced to improve scalability by separating job scheduling and resource management from MapReduce. The document also gives examples of using MapReduce on a movie ratings dataset to demonstrate Hadoop functionality and running simple MapReduce jobs via the command line.

Hadoop technologytipanagiriharika

Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro

Big Data and HadoopMr. Ankit

This document provides an overview of big data and Hadoop. It defines big data using the 3Vs - volume, variety, and velocity. It describes Hadoop as an open-source software framework for distributed storage and processing of large datasets. The key components of Hadoop are HDFS for storage and MapReduce for processing. HDFS stores data across clusters of commodity hardware and provides redundancy. MapReduce allows parallel processing of large datasets. Careers in big data involve working with Hadoop and related technologies to extract insights from large and diverse datasets.

Seminar_Report_hadoopVarun Narang

This document provides an introduction and overview of Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop uses MapReduce and HDFS to parallelize workloads and store data redundantly across nodes to solve issues around hardware failure and combining results. Key aspects covered include how HDFS distributes and replicates data, how MapReduce isolates processing into mapping and reducing functions to abstract communication, and how Hadoop moves computation to the data to improve performance.

Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA

Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers. It addresses problems like hardware failure and combining data after analysis. The core components are HDFS for distributed storage and MapReduce for distributed processing. HDFS stores data as blocks across nodes and handles replication for reliability. The Namenode manages the file system namespace and metadata, while Datanodes store and retrieve blocks. Hadoop supports reliable analysis of large datasets in a distributed manner through its scalable architecture.

2.1-HADOOP.pdfMarianJRuben

This document provides an overview of Hadoop, including: - Prerequisites for getting the most out of Hadoop include programming skills in languages like Java and Python, SQL knowledge, and basic Linux skills. - Hadoop is a software framework for distributed processing of large datasets across computer clusters using MapReduce and HDFS. - Core Hadoop components include HDFS for storage, MapReduce for distributed processing, and YARN for resource management. - The Hadoop ecosystem also includes components like HBase, Pig, Hive, Mahout, Sqoop and others that provide additional functionality.

Cppt Hadoopchunkypandey12

This document discusses Hadoop and its core components HDFS and MapReduce. It provides an overview of how Hadoop addresses the challenges of big data by allowing distributed processing of large datasets across clusters of computers. Key points include: Hadoop uses HDFS for distributed storage and MapReduce for distributed processing; HDFS works on a master-slave model with a Namenode and Datanodes; MapReduce utilizes a map and reduce programming model to parallelize tasks. Fault tolerance is built into Hadoop to prevent single points of failure.

Cpptchunkypandey12

This document provides an overview of Hadoop and how it addresses the challenges of big data. It discusses how Hadoop uses a distributed file system (HDFS) and MapReduce programming model to allow processing of large datasets across clusters of computers. Key aspects summarized include how HDFS works using namenodes and datanodes, how MapReduce leverages mappers and reducers to parallelize processing, and how Hadoop provides fault tolerance.

Cpptchunkypandey12

Hadoop .pdfSudhanshiBakre1

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through HDFS and processes large amounts of data in parallel using MapReduce. The core components of Hadoop are HDFS for storage, MapReduce for processing, and YARN for resource management. Hadoop allows for scalable and cost-effective solutions to various big data problems like storage, processing speed, and scalability by distributing data and computation across clusters.

Hadoop training in bangaloreTIB Academy

Hadoop tutorial for Freshers, TIB Academy

Hadoopthisisnabin

Hadoop introduction , Why and What is Hadoop ?sudhakara st

Hadoop live online trainingHarika583

Seminar pptRajatTripathi34

Hadoop by kamran khanKamranKhan587

Learn what is Hadoop-and-BigDataThanusha154

Hadoop Seminar ReportAtul Kushwaha

Hadoop infoNikita Sure

Understanding hadoopRexRamos9

Hadoop technologytipanagiriharika

Big Data and HadoopMr. Ankit

Seminar_Report_hadoopVarun Narang

Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA

2.1-HADOOP.pdfMarianJRuben

Cppt Hadoopchunkypandey12

Cpptchunkypandey12

Hadoop .pdfSudhanshiBakre1

Recently uploaded (20)

"Heaters in Power Plants: Types, Functions, and Performance Analysis"Infopitaara

This presentation provides a detailed overview of heaters used in power plants, focusing mainly on feedwater heaters, their types, construction, and role in improving thermal efficiency. It explains the difference between open and closed feedwater heaters, highlights the importance of low-pressure and high-pressure heaters, and describes the orientation types—horizontal and vertical. The PPT also covers major heater connections, the three critical heat transfer zones (desuperheating, condensing, and subcooling), and key performance indicators such as Terminal Temperature Difference (TTD) and Drain Cooler Approach (DCA). Additionally, it discusses common operational issues, monitoring parameters, and the arrangement of steam and drip flows. Understanding and maintaining these heaters is crucial for ensuring optimum power plant performance, reducing fuel costs, and enhancing equipment life.

Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Journal of Soft Computing in Civil Engineering

Passenger car unit (PCU) of a vehicle type depends on vehicular characteristics, stream characteristics, roadway characteristics, environmental factors, climate conditions and control conditions. Keeping in view various factors affecting PCU, a model was developed taking a volume to capacity ratio and percentage share of particular vehicle type as independent parameters. A microscopic traffic simulation model VISSIM has been used in present study for generating traffic flow data which some time very difficult to obtain from field survey. A comparison study was carried out with the purpose of verifying when the adaptive neuro-fuzzy inference system (ANFIS), artificial neural network (ANN) and multiple linear regression (MLR) models are appropriate for prediction of PCUs of different vehicle types. From the results observed that ANFIS model estimates were closer to the corresponding simulated PCU values compared to MLR and ANN models. It is concluded that the ANFIS model showed greater potential in predicting PCUs from v/c ratio and proportional share for all type of vehicles whereas MLR and ANN models did not perform well.

Unit III.pptx IT3401 web essentials presentatiolakshitakumar291

Explainable-Artificial-Intelligence-in-Disaster-Risk-Management (2).pptx_2024...LiyaShaji4

ELectronics Boards & Product Testing_Shiju.pdfShiju Jacob

Elevate Your WorkflowNickHuld

π0.5: a Vision-Language-Action Model with Open-World GeneralizationNABLAS株式会社

今回の資料「Transfusion / π0 / π0.5」は、画像・言語・アクションを統合するロボット基盤モデルについて紹介しています。拡散×自己回帰を融合したTransformerをベースに、π0.5ではオープンワールドでの推論・計画も可能に。 This presentation introduces robot foundation models that integrate vision, language, and action. Built on a Transformer combining diffusion and autoregression, π0.5 enables reasoning and planning in open-world settings.

Fourth Semester BE CSE BCS401 ADA Module 3 PPT.pptxVENKATESHBHAT25

Reagent dosing (Bredel) presentation.pptxAlejandroOdio

Crack the Domain with Event Storming By VivekVivek Srivastava

Data Structures_Searching and Sorting.pptxRushaliDeshmukh2

Compiler Design_Lexical Analysis phase.pptxRushaliDeshmukh2

introduction to machine learining for beginersJoydebSheet

Value Stream Mapping Worskshops for Intelligent Continuous SecurityMarc Hornbeek

Gas Power Plant for Power Generation SystemJourneyWithMe1

vlsi digital circuits full power point presentationDrSunitaPatilUgaleKK

MAQUINARIA MINAS CEMA 6th Edition (1).pdfssuser562df4

ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYijscai

With the increased use of Artificial Intelligence (AI) in malware analysis there is also an increased need to understand the decisions models make when identifying malicious artifacts. Explainable AI (XAI) becomes the answer to interpreting the decision-making process that AI malware analysis models use to determine malicious benign samples to gain trust that in a production environment, the system is able to catch malware. With any cyber innovation brings a new set of challenges and literature soon came out about XAI as a new attack vector. Adversarial XAI (AdvXAI) is a relatively new concept but with AI applications in many sectors, it is crucial to quickly respond to the attack surface that it creates. This paper seeks to conceptualize a theoretical framework focused on addressing AdvXAI in malware analysis in an effort to balance explainability with security. Following this framework, designing a machine with an AI malware detection and analysis model will ensure that it can effectively analyze malware, explain how it came to its decision, and be built securely to avoid adversarial attacks and manipulations. The framework focuses on choosing malware datasets to train the model, choosing the AI model, choosing an XAI technique, implementing AdvXAI defensive measures, and continually evaluating the model. This framework will significantly contribute to automated malware detection and XAI efforts allowing for secure systems that are resilient to adversarial attacks.

comparison of motors.pptx 1. Motor Terminology.pptyadavmrr7

Fort night presentation new0903 pdf.pdf.anuragmk56

"Heaters in Power Plants: Types, Functions, and Performance Analysis"Infopitaara

Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Journal of Soft Computing in Civil Engineering

Unit III.pptx IT3401 web essentials presentatiolakshitakumar291

Explainable-Artificial-Intelligence-in-Disaster-Risk-Management (2).pptx_2024...LiyaShaji4

ELectronics Boards & Product Testing_Shiju.pdfShiju Jacob

Elevate Your WorkflowNickHuld

π0.5: a Vision-Language-Action Model with Open-World GeneralizationNABLAS株式会社

Fourth Semester BE CSE BCS401 ADA Module 3 PPT.pptxVENKATESHBHAT25

Reagent dosing (Bredel) presentation.pptxAlejandroOdio

Crack the Domain with Event Storming By VivekVivek Srivastava

Data Structures_Searching and Sorting.pptxRushaliDeshmukh2

Compiler Design_Lexical Analysis phase.pptxRushaliDeshmukh2

introduction to machine learining for beginersJoydebSheet

Value Stream Mapping Worskshops for Intelligent Continuous SecurityMarc Hornbeek

Gas Power Plant for Power Generation SystemJourneyWithMe1

vlsi digital circuits full power point presentationDrSunitaPatilUgaleKK

MAQUINARIA MINAS CEMA 6th Edition (1).pdfssuser562df4

ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYijscai

comparison of motors.pptx 1. Motor Terminology.pptyadavmrr7

Fort night presentation new0903 pdf.pdf.anuragmk56

PPT on Hadoop

1. BY – SHUBHAM PARMAR

2. What is Hadoop? • The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. • It is made by apache software foundation in 2011. • Written in JAVA.

3. Hadoop is open source software. Framework Massive Storage Processing Power

4. Big Data • Big data is a term used to define very large amount of unstructured and semi structured data a company creates. •The term is used when talking about Petabytes and Exabyte of data. •That much data would take so much time and cost to load into relational database for analysis. •Facebook has almost 10billion photos taking up to 1Petabytes of storage.

5. So what is the problem?? 1. Processing that large data is very difficult in relational database. 2. It would take too much time to process data and cost.

6. We can solve this problem by Distributed Computing. But the problems in distributed computing is – Hardware failure Chances of hardware failure is always there. Combine the data after analysis Data from all disks have to be combined from all the disks which is a mess.

7. To Solve all the Problems Hadoop Came. It has two main parts – 1. Hadoop Distributed File System (HDFS), 2. Data Processing Framework & MapReduce

8. 1. Hadoop Distributed File System It ties so many small and reasonable priced machines together into a single cost effective computer cluster. Data and application processing are protected against hardware failure.  If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. it automatically stores multiple copies of all data. It provides simplified programming model which allows user to quickly read and write the distributed system.

9. 2. MapReduce MapReduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster. It is an associative implementation for processing and generating large data sets. MAP function that process a key pair to generates a set of intermediate key pairs. REDUCE function that merges all intermediate values associated with the same intermediate key

12. Pros of Hadoop 1. Computing power 2. Flexibility 3. Fault Tolerance 4. Low Cost 5. Scalability

13. Cons of Hadoop 1. Integration with existing systems Hadoop is not optimised for ease for use. Installing and integrating with existing databases might prove to be difficult, especially since there is no software support provided. 2. Administration and ease of use Hadoop requires knowledge of MapReduce, while most data practitioners use SQL. This means significant training may be required to administer Hadoop clusters. 3. Security Hadoop lacks the level of security functionality needed for safe enterprise deployment, especially if it concerns sensitive data.

PPT on Hadoop

Recommended

More Related Content

What's hot (20)

Viewers also liked (12)

Similar to PPT on Hadoop (20)

Recently uploaded (20)

PPT on Hadoop