0% found this document useful (0 votes)
8 views

UNIT-4-Hadoop Ecosystem-Part 1

Hadoop is an open-source framework for storing and processing massive datasets in a distributed manner, developed by Apache Software Foundation. It addresses the limitations of traditional databases by providing scalable storage, real-time data handling, and support for various data formats through its core components: HDFS, MapReduce, and YARN. The Hadoop ecosystem has expanded to include tools like Apache Hive, Pig, Spark, and HBase, enabling efficient data processing and analytics across diverse applications.

Uploaded by

sonali kharade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

UNIT-4-Hadoop Ecosystem-Part 1

Hadoop is an open-source framework for storing and processing massive datasets in a distributed manner, developed by Apache Software Foundation. It addresses the limitations of traditional databases by providing scalable storage, real-time data handling, and support for various data formats through its core components: HDFS, MapReduce, and YARN. The Hadoop ecosystem has expanded to include tools like Apache Hive, Pig, Spark, and HBase, enabling efficient data processing and analytics across diverse applications.

Uploaded by

sonali kharade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Unit-4: Understanding Hadoop Ecosystem

Hadoop Introduction

“Hadoop is a technology to store massive datasets on a cluster of cheap machines in a distributed manner”. It was originated by Doug
Cutting and Mike Cafarella.

Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in volume. Hadoop is
written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo,
Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.

Hadoop is the solution to above Big Data problems. It is the technology to store massive datasets on a cluster of cheap machines in a
distributed manner. Not only this it provides Big Data analytics through distributed computing framework.

It is an open-source software developed as a project by Apache Software Foundation. Doug Cutting created Hadoop. In the year 2008 Yahoo
gave Hadoop to Apache Software Foundation. Since then two versions of Hadoop has come. Version 1.0 in the year 2011 and version 2.0.6
in the year 2013. Hadoop comes in various flavors like Cloudera, IBM BigInsight, MapReduce and Hortonworks.

PROF. SONALI KHARADE 1


Why Hadoop is Invented?

Let us discuss the shortcomings of the traditional approach which led to the invention of Hadoop –

1. Storage for Large Datasets

The conventional RDBMS is incapable of storing huge amounts of Data. The cost of data storage in available RDBMS is very high. As it
incurs the cost of hardware and software both.

2. Handling data in different formats

The RDBMS is capable of storing and manipulating data in a structured format. But in the real world we have to deal with data in a
structured, unstructured and semi-structured format.

3. Data getting generated with high speed:

The data in oozing out in the order of tera to peta bytes daily. Hence we need a system to process data in real-time within a few seconds.
The traditional RDBMS fail to provide real-time processing at great speeds.

History of Hadoop

The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System paper, published by Google.

PROF. SONALI KHARADE 2


In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an open source web crawler software project.

While working on Apache Nutch, they were dealing with big data. To store that data they have to spend a lot of costs which becomes the
consequence of that project. This problem becomes one of the important reason for the emergence of Hadoop.

In 2003, Google introduced a file system known as GFS (Google file system). It is a proprietary distributed file system developed to provide
efficient access to data.

In 2004, Google released a white paper on Map Reduce. This technique simplifies the data processing on large clusters.

PROF. SONALI KHARADE 3


In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS (Nutch Distributed File System). This file system
also includes Map reduce.

In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project, Dough Cutting introduces a new project Hadoop
with a file system known as HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in this year.

Doug Cutting gave named his project Hadoop after his son's toy elephant.

In 2007, Yahoo runs two clusters of 1000 machines.

In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster within 209 seconds.

In 2013, Hadoop 2.2 was released.

In 2017, Hadoop 3.0 was released.

Hadoop Ecosystem

Hadoop consists of three core components –

• Hadoop Distributed File System (HDFS) – It is the storage layer of Hadoop.


• Map-Reduce – It is the data processing layer of Hadoop.
• YARN – It is the resource management layer of Hadoop.

PROF. SONALI KHARADE 4


Hadoop Stack

Large datasets can be processed, stored, and distributed across computer clusters using the open-source Hadoop framework. Hadoop, created
by the Apache Software Foundation, is a scalable, dependable, and affordable solution for managing large amounts of data. The MapReduce
programming approach for parallel processing and the Hadoop Distributed File System (HDFS) for distributed storage form the system's
foundation. Large files are split up into smaller blocks by HDFS, which then distribute them throughout the cluster to provide high
availability and fault tolerance. Inspired by ideas from functional programming, MapReduce divides large jobs into smaller subtasks and
distributes them among cluster nodes to process data in parallel. Because Hadoop is distributed, it can process large volumes of data
quickly and effectively, making it a fundamental tool in the extensive data ecosystem.
PROF. SONALI KHARADE 5
With the inclusion of new components like Apache Hive for data warehousing, Apache Pig for high-level scripting, and Apache Spark for
in-memory processing, Hadoop has grown in the last several years. Because of this ecosystem, Hadoop is a flexible platform that can handle
various data processing requirements, from real-time analytics to batch processing. Hadoop continues to be a key technology as businesses
struggle with the problems brought on by the exponential expansion of data. It offers the framework for creating reliable and scalable big
data solutions.

1. Hadoop Distributed File System (HDFS):

The storage part of Hadoop, known as the Hadoop Distributed File System (HDFS), was created primarily to manage massive files and
spread them among several cluster nodes. To provide fault tolerance, it divides files into smaller chunks (usually 128 MB or 256 MB in
size) and duplicates them across nodes. A NameNode manages metadata, whereas DataNodes stores the data in HDFS's master-slave design.
This design makes Hadoop's fault tolerance and fast throughput possible, making it ideal for storing and retrieving massive volumes of data.

PROF. SONALI KHARADE 6


PROF. SONALI KHARADE 7
2. MapReduce Programming paradigm:

Hadoop uses this programming paradigm to handle and analyze large datasets concurrently. It separates a task into two stages: the Map
phase, which involves splitting the input into key-value pairs and processing them concurrently, and the Reduce phase, which involves
combining the output of the Map phase. This architecture makes it possible to process data in parallel across several nodes, which makes
handling large-scale computations efficient. Despite its strength, MapReduce can be difficult to use for some computations.

Apart from these fundamental elements, the Hadoop ecosystem has grown to encompass a range of initiatives and resources that enhance
its capabilities and address distinct facets of the data processing workflow.

MapReduce is the heart of Hadoop. It is a software framework for writing applications that process large datasets in parallel across
hundreds or thousands of nodes on the Hadoop cluster.

Hadoop divides the client’s MapReduce job into a number of independent tasks that run in parallel to give throughput.

The MapReduce framework works in two phases- Map phase and the Reduce phase. The input to both the phases is the key-value pair.

Features of Hadoop MapReduce:

• Scalable: Once we write a MapReduce program, we can easily expand it to work over a cluster having hundreds or even thousands
of nodes.
• Fault-tolerance: It is highly fault-tolerant. It automatically recovers from failure.
• Distributed Processing: MapReduce makes it possible to process data in parallel on a group of computers. It provides for scalable
and effective data processing by distributing the data and jobs among numerous nodes.

PROF. SONALI KHARADE 8


• Support for Several Programming Languages: Different MapReduce implementations support various programming languages,
enabling developers to do data processing tasks using the language of their choosing.

3. Apache Hive:

Built on top of Hadoop, Apache Hive is a data warehousing and SQL-like query language that facilitates the handling of massive datasets
by analysts and data scientists.

PROF. SONALI KHARADE 9


Apache Hive is a java based data warehousing tool designed by Facebook for analyzing and processing large data.

Hive uses HQL(Hive Query Language) similar to SQL that is transformed into MapReduce jobs for processing huge amounts of data.

It provides support for developers and analytics to query and analyze big data with SQL like queries(HQL) without writing the complex
MapReduce jobs.

Users can interact with the Apache Hive through the command line tool (Beeline shell) and JDBC driver.

Features of Apache Hive:

• Hive supports client-application written in any language like Python, Java, PHP, Ruby, and C++.
• It generally uses RDBMS as metadata storage, which significantly reduces the time taken for the semantic check.
• Hive Partitioning and Bucketing improves query performance.
• Hive is fast, scalable, and extensible.
• It supports Online Analytical Processing and is an efficient ETL tool.
• It provides support for User Defined Function to support use cases that are not supported by Built-in functions.

4. Apache Pig:

A high-level scripting platform called Apache Pig was created to make creating MapReduce applications easier. Pig Latin is the language
used to express data transformations in it.

Pig is developed by Yahoo as an alternative approach to make MapReduce job easier.

PROF. SONALI KHARADE 10


It enables developers to use Pig Latin, which is a scripting language designed for pig framework that runs on Pig runtime.

Pig Latin is SQL like commands that are converted to MapReduce program in the background by the compiler.

It works by loading the commands and the data source.

Then we perform various operations like sorting, filtering, joining, etc.

At last, based on the requirement, the results are either dumped on the screen or stored back to the HDFS.

Features of Pig:

• Extensibility: Users can create their own function for performing specific purpose processing.
• Solving complex use cases: Pig is best suited for solving complex use cases that include multiple data processing having multiple
imports and exports.
• Handles all kinds of data: Structured and Unstructured can be easily analyzed or processed using Pig.
• Optimization Opportunities: In Pig, the execution of the task gets automatically optimized by the task itself. Thus programmers need
to focus on semantics rather than efficiency.
• It provides a platform for building data flow for ETL (Extract, Transform, and Load), processing, and analyzing massive data sets.

5. Apache Spark:

Spark was not included in the Hadoop project at first. However, it is frequently used in tandem with it. Compared to conventional
MapReduce, it's a quick and versatile cluster computing system that offers in-memory processing and greater expressiveness.

PROF. SONALI KHARADE 11


It is a popular open-source unified analytics engine for big data and machine learning.

Apache Software Foundation developed Apache Spark for speeding up the Hadoop big data processing.

It extends the Hadoop MapReduce model to effectively use it for more types of computations like interactive queries, stream processing,
etc.

Apache Spark enables batch, real-time, and advanced analytics over the Hadoop platform.

Spark provides in-memory data processing for the developers and the data scientists

Companies, including Netflix, Yahoo, eBay, and many more, have deployed Spark at a massive scale.

Features of Apache Spark:

• Speed: Spark has the ability to run applications in Hadoop clusters 100 times faster in memory and ten times faster on the disk.
• Ease of use: It can work with different data stores (such as OpenStack, HDFS, Cassandra) due to which it provides more flexibility
than Hadoop.
• Generality: It contains a stack of libraries, including MLlib for machine learning, SQL and DataFrames, GraphX, and Spark
Streaming. We can combine these libraries in the same application.
• Runs Everywhere: Spark can run on Hadoop, Kubernetes, Apache Mesos, standalone, or in the cloud.

7. HBase on Apache

PROF. SONALI KHARADE 12


Operating on top of Hadoop, HBase is a NoSQL database that is distributed and scalable. It offers massive dataset read and write access in
real-time. Data is arranged into column families in a column-family store for effective archiving and retrieval.

Scalability: Able to handle enormous volumes of data by scaling horizontally.

8. Kafka the Apache

Real-time data pipelines and streaming applications can be built using Kafka, a distributed streaming platform. Publish-Subscribe
Model: The publish-subscribe model enables the decoupling of producers and consumers in real-time data processing.

Fault Tolerance: Designed with fault tolerance and high availability in mind.

9. ZooKeeper

Distributed systems can be managed and synchronized with ZooKeeper, a distributed coordination service.

Coordination: Offers a centralized solution for naming, distributed synchronization, and configuration information maintenance.

Together, these elements create a solid and adaptable Hadoop environment that enables businesses to handle a range of big data
processing, analytics, and storage tasks. New projects and tools are constantly being introduced to the ecosystem to meet new difficulties
in the significant data era.

10. Apache Mahout

PROF. SONALI KHARADE 13


Apache Mahout is an open-source framework that normally runs coupled with the Hadoop infrastructure at its background to manage large
volumes of data.

The name Mahout is derived from the Hindi word “Mahavat,” which means the rider of an elephant.

As Apache Mahout runs algorithms on the top of the Hadoop framework, thus named as Mahout.

We can use Apache Mahout for implementing scalable machine learning algorithms on the top of Hadoop using the MapReduce paradigm.

Apache Mahout is not restricted to the Hadoop based implementation; it can run algorithms in the standalone mode as well.

Apache Mahout implements popular machine learning algorithms such as Classification, Clustering, Recommendation, Collaborative
filtering, etc.

Features of Mahout:

• It works well in the distributed environment since its algorithms are written on the top of Hadoop. It uses the Hadoop library to scale
in the cloud.
• Mahout offers a ready-to-use framework to the coders for performing data mining tasks on large datasets.
• It lets the application to quickly analyze the large datasets.
• Apache Mahout includes various MapReduce enabled clustering applications such as Canopy, Mean-Shift, K-means, fuzzy k-means.
• It also includes vectors and matrix libraries.
• Apache Mahout exposed various Classification algorithms such as Naive Bayes, Complementary Naive Bayes, and Random Forest.

11. HBase

HBase is an open-source distributed NoSQL database that stores sparse data in tables consisting of billions of rows and columns.
PROF. SONALI KHARADE 14
It is written in Java and modeled after Google’s big table.

HBase is used when we need to search or retrieve a small amount of data from large data sets.

For example: If we are having billions of customer emails and we need to find out the customer name who has used the word replace in
their emails, then we use HBase.

There are two main components in HBase. They are:

• HBase Master: HBase Master negotiates load balancing across the region server. It controls the failover, maintains, and monitors the
Hadoop cluster.
• Region Server: Region Server is the worker node that handles the read, write, update, and delete requests from the clients.

Features of HBase:

• Scalable storage
• It Supports fault-tolerant feature
• Support for real-time search on sparse data
• Support easily consistent read and writes
• Column-Family-Based Data Model: HBase groups data into logical groups of columns called column families. Data is saved in rows
within these column families, each of which can have many columns. This data architecture enables effective data retrieval and flexible
schema creation.
• HBase Coprocessors: Coprocessors are specialised code modules that may run on region servers. They are supported by HBase.
Coprocessors enhance query speed by enabling developers to incorporate unique data processing algorithms near to the data.

PROF. SONALI KHARADE 15


Analysing Data with Unix tools

Analysing Data with Hadoop

Hadoop Streaming

IBM Big Data Strategy

Introduction to Infosphere Big insights and Big sheets.

Introduction to Infosphere Big insights

IBM Infosphere BigInsights is a big data analytics platform built on Apache Hadoop. It is designed to help enterprises store, manage, and
analyze large volumes of structured and unstructured data efficiently.

IBM InfoSphere Streams is a software platform that enables the development and execution of applications that process information in data
streams. InfoSphere Streams enables continuous and fast analysis of massive volumes of moving data to help improve the speed of business
insight and decision making

Big Insights is an analytics platform that enables companies to turn complex Internet-scale information sets into insights.

It consists of a packaged Apache Hadoop distribution, with a greatly simplified installation process, and associated tools for application
development, data movement, and cluster management.

Other open source technologies in BigInsights are:

PROF. SONALI KHARADE 16


◦Pig: A platform that provides a high-level language for expressing programs that analyze large datasets. Pig has a compiler that translates
Pig programs into sequences of MapReduce jobs that the Hadoop framework executes.

◦HIVE: A data-warehousing solution built on top of the Hadoop environment. It has familiar relational-database concepts, such as tables,
columns, and partitions, and a subset of SQL (HiveQL) to the unstructured world of Hadoop. Hive queries are compiled into MapReduce
jobs executed using Hadoop.

◦Jaql: An IBM-developed query language designed for JavaScript Object Notation (JSON) and provides a SQL-like interface.

◦Hbase: A column-oriented NoSQL data-storage environment designed to support large, sparsely populated tables in Hadoop.

◦Flume: A distributed, reliable, available service for efficiently moving large amounts of data as it is produced. Flume is well-suited to
gathering logs from multiple systems and inserting them into the Hadoop Distributed File System (HDFS) as they are generated.

◦Avro: A data-serialization technology that uses JSON for defining data types and protocols, and serializes data in a compact binary
format.

◦Lucene: A search-engine library that provides high-performance and full-featured text search.

◦ZooKeeper: A centralized service for maintaining configuration information and naming, providing distributed
synchronization and group services.

◦Oozie: A workflow scheduler system for managing and orchestrating the execution of Apache Hadoop jobs.

PROF. SONALI KHARADE 17


1. Key Features of BigInsights

1. Hadoop-based Architecture
• Uses HDFS (Hadoop Distributed File System) for scalable data storage.
• Supports MapReduce, Spark, and YARN for distributed computing.
2. Big SQL
• Allows SQL-based querying on Hadoop data, making it accessible for users familiar with relational databases.
• Supports joins, subqueries, and aggregations similar to traditional SQL databases.
3. Advanced Text Analytics
• Enables natural language processing (NLP) and text mining to extract meaningful insights from unstructured data (e.g., emails, documents, social media).
4. Machine Learning and Predictive Analytics
• Supports predictive modeling and data science frameworks like Apache Spark MLlib.
5. Security and Data Governance
• Provides role-based access control (RBAC), encryption, and data auditing features.

2. Infosphere BigInsights Use Cases

2.1 Social Media and Sentiment Analysis


Use Case: A retail company wants to analyze customer opinions from Twitter and Facebook to improve its marketing strategy.

Solution using BigInsights:


• Data Collection: Gather tweets, comments, and posts using APIs.
• Storage: Store raw data in HDFS.
• Processing:
o Use text analytics to detect keywords like "good service" or "bad experience".
o Apply sentiment analysis to classify data as positive, neutral, or negative.

PROF. SONALI KHARADE 18


• Visualization: Generate reports using BigSheets or export insights to a BI tool.

Outcome: The company identifies trending issues and adjusts marketing strategies accordingly.

2.2 Fraud Detection in Banking


Use Case: A bank wants to detect fraudulent transactions in real time.

Solution using BigInsights:


• Data Collection: Aggregate customer transactions and behavior logs.
• Processing:
o Apply machine learning models to detect anomalies.
o Use Big SQL to identify suspicious transactions based on patterns.
• Real-time Analysis: Use Spark Streaming for live fraud detection.

Outcome: The bank reduces financial losses and improves fraud prevention.

2.3 Healthcare and Patient Data Analytics


Use Case: A hospital wants to analyze patient records to predict disease outbreaks.

Solution using BigInsights:


• Data Collection: Gather patient data, symptoms, and historical records.
• Processing:
o Use predictive analytics to detect correlations.
o Apply natural language processing (NLP) to extract patterns from doctor notes.
• Visualization: Present insights using BigSheets for easy interpretation.

Outcome: Early detection of disease trends helps in preventive measures.

PROF. SONALI KHARADE 19


Introduction to BigSheets

IBM BigSheets is a spreadsheet-style tool built on Hadoop, allowing business users to explore, clean, and analyze large datasets without coding. IBM Infosphere
BigInsights is a powerful big data processing platform that enables enterprises to analyze large datasets using advanced analytics, machine learning, and
Hadoop-based tools. BigSheets, on the other hand, provides a user-friendly, spreadsheet-like interface for analyzing big data without requiring technical
expertise.

1. Key Features of BigSheets

1. Excel-like Interface – Familiar UI for easy data analysis.


2. Data Import & Integration – Supports CSV, JSON, XML, and databases.
3. Built-in Functions – Includes filtering, sorting, and data transformation.
4. Data Visualization – Generates charts, graphs, and reports.

2. Infosphere BigSheets Use Cases

2.1 Website Log Analysis


Use Case: A company wants to analyze website traffic logs to improve user experience.

Solution using BigSheets:


• Import Data: Load server logs into BigSheets.
• Processing:
o Filter out bot traffic.
o Identify user behavior patterns.
• Visualization: Create charts for traffic trends.

Outcome: The company improves website design and performance.


PROF. SONALI KHARADE 20
2.2 Market Research & Customer Segmentation
Use Case: A telecom company wants to segment customers for targeted promotions.

Solution using BigSheets:


• Import Data: Customer demographics, usage patterns, billing data.
• Processing:
o Apply filters to identify high-value customers.
o Group users based on data patterns.
• Visualization: Generate graphs showing customer segments.

Outcome: The company personalizes marketing campaigns for better engagement.

2.3 Supply Chain Optimization


Use Case: A manufacturing company wants to analyze supplier performance.

Solution using BigSheets:


• Import Data: Supplier deliveries, defect rates, order processing times.
• Processing:
o Identify suppliers with delays.
o Find correlations between defects and specific suppliers.
• Visualization: Generate reports to evaluate supplier reliability.

Outcome: The company optimizes its supply chain and reduces costs.

PROF. SONALI KHARADE 21


3. Differences Between BigInsights and BigSheets
Feature BigInsights BigSheets
Type Hadoop-based big data platform Spreadsheet-based big data analysis tool
Processing Uses MapReduce, Big SQL, and Spark Uses GUI-based data transformation
Users Data scientists, engineers Business analysts, non-technical users
Complexity Requires Hadoop, SQL, and scripting knowledge Easy-to-use, spreadsheet-style interface
Functionality Advanced analytics, machine learning, real-time processing Data filtering, transformation, visualization

When to use BigInsights?


• When working with complex big data processing, machine learning, and real-time analytics.

When to use BigSheets?


• When a business analyst or non-technical user needs to analyze large datasets quickly using a familiar spreadsheet-style tool.

PROF. SONALI KHARADE 22

You might also like