Chapter 2 Hadoop Eco System
Chapter 2 Hadoop Eco System
R E F B O O K : H A D O O P E S S E N T I A L S B Y S H I VA A C H A R I , I S B N 9 7 8 - 1 - 7 8 4 3 9 - 6 6 8 - 8
AVA I L A B L E O N P R O Q U E S T E - B O O K C E N T R A L
• Searching/text mining
• Log processing
• Recommendation systems
• Archiving
• Pattern recognition
• Risk assessment
• Sentiment analysis
Hadoop Modes
Standalone: It is used for simple analysis or debugging.
• Pseudo distributed: It helps you to simulate a multi-node installation on a single node. In pseudo-distributed mode, each of
the component processes runs in a separate JVM. Instead of installing Hadoop on different servers, you can simulate it on a
single server.
• Distributed: Cluster with multiple worker nodes in tens or hundreds or thousands of nodes.
INTRODUCTION TO BIG DATA 8
Hadoop distributions
Cloudera: is the most widely used Hadoop distribution with the biggest customer base as it provides good support and has some good utility
components such as Cloudera Manager, which can create, manage, and maintain a cluster, and manage job processing, and Impala is developed and
contributed by Cloudera which has real-time processing capability.
• Hortonworks: Hortonworks' strategy is to drive all innovation through the open source community and create an ecosystem of partners that
accelerates Hadoop adoption among enterprises. It uses an open source Hadoop project and is a major contributor to Hadoop enhancement in Apache
Hadoop. Ambari was developed and contributed to Apache by Hortonworks. Hortonworks offers a very good, easy-to-use sandbox for getting
started. Hortonworks contributed changes that made Apache Hadoop run natively on the Microsoft Windows platforms including Windows Server
and Microsoft Azure.
• MapR: MapR distribution of Hadoop uses different concepts than plain open source Hadoop and its competitors, especially support for a network
file system (NFS) instead of HDFS for better performance and ease of use. In NFS, Native Unix commands can be used instead of Hadoop
commands. MapR have high availability features such as snapshots, mirroring, or stateful failover.
• Amazon Elastic MapReduce (EMR): AWS's Elastic MapReduce (EMR) leverages its comprehensive cloud services, such as Amazon EC2 for
compute, Amazon S3 for storage, and other services, to offer a very strong Hadoop solution for customers who wish to implement Hadoop in the
cloud. EMR is much advisable to be used for infrequent big data processing. It might save you a lot of money.