This document provides information on Apache Hadoop, including its core components and features. It discusses how Hadoop uses MapReduce for distributed processing of large datasets across clusters of computers. It also describes HDFS for providing high-throughput access to application data in a distributed file system, and how data is stored in blocks and replicated across multiple nodes in HDFS.
This document provides information on Apache Hadoop, including its core components and features. It discusses how Hadoop uses MapReduce for distributed processing of large datasets across clusters of computers. It also describes HDFS for providing high-throughput access to application data in a distributed file system, and how data is stored in blocks and replicated across multiple nodes in HDFS.
3Vs of Big Data = Volume, Velocity and Variety. Apache Hadoop
The Apache Hadoop project develops open-
source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a
framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage. Cont..
The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS): A distributed file system that provides high- throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. Cont..
biggest advantage is the easy scaling of data processing over many computing nodes.
The primitives to achieve data processing
with MapReduce are called mappers and reducers. Cont.. Hadoop File System (HDFS)
HDFS stands for Hadoop distributed file system.
HDFS is a filesystem designed for large-scale distributed data processing under MapReduce.
HDFS is the primary distributed storage used by
Hadoop applications.
HDFS cluster primarily consists of a NameNode
that manages the file system metadata and DataNodes that store the actual data. Cont..
Data is stored in blocks distributed and
replicated over many nodes and it is capable to store a big data set, (e.g 100TB) as a single file. The block size is often range from 64MB to 1GB.
Hadoop provides a set of command line
utilities that work similarly to the basics Linux file commands. Cont.. Cont.. Cont.. list of supported file systems HDFS FTP Amazon S3 file system. Windows Azure Storage Blobs (WASB) file system. features in HDFS
File permissions and authentication.
Rack awareness Safemode Upgrade and rollback Secondary NameNode Checkpoint node Backup node Fsck Fetchdt Current level of security in Hadoop Current version of Hadoop has very basic rudimentary implementation of security which is advisory access control mechanism.
Hadoop doesn’t strongly authenticate the client, it simply
asks the underlying Unix system by executing `who am i` command
Anyone can communicate directly with a Datanode
(without the need for communicating with the Namenode) and ask for blocks if you have the block location details (This was experimented at the recent Cloudera's Hadoop Hackathon) Hadoop Configuration NameNode Conf. DataNode Conf.