0% found this document useful (0 votes)

8 views

UNIT-4-Hadoop Ecosystem-Part 1

Hadoop is an open-source framework for storing and processing massive datasets in a distributed manner, developed by Apache Software Foundation. It addresses the limitations of traditional databases by providing scalable storage, real-time data handling, and support for various data formats through its core components: HDFS, MapReduce, and YARN. The Hadoop ecosystem has expanded to include tools like Apache Hive, Pig, Spark, and HBase, enabling efficient data processing and analytics across diverse applications.

Uploaded by

sonali kharade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

UNIT-4-Hadoop Ecosystem-Part 1

Uploaded by

sonali kharade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Unit-4: Understanding Hadoop Ecosystem

Hadoop Introduction

“Hadoop is a technology to store massive datasets on a cluster of cheap machines in a distributed manner”. It was originated by Doug
Cutting and Mike Cafarella.

Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in volume. Hadoop is
written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo,
Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.

Hadoop is the solution to above Big Data problems. It is the technology to store massive datasets on a cluster of cheap machines in a
distributed manner. Not only this it provides Big Data analytics through distributed computing framework.

It is an open-source software developed as a project by Apache Software Foundation. Doug Cutting created Hadoop. In the year 2008 Yahoo
gave Hadoop to Apache Software Foundation. Since then two versions of Hadoop has come. Version 1.0 in the year 2011 and version 2.0.6
in the year 2013. Hadoop comes in various flavors like Cloudera, IBM BigInsight, MapReduce and Hortonworks.

PROF. SONALI KHARADE 1

Why Hadoop is Invented?

Let us discuss the shortcomings of the traditional approach which led to the invention of Hadoop –

1. Storage for Large Datasets

The conventional RDBMS is incapable of storing huge amounts of Data. The cost of data storage in available RDBMS is very high. As it
incurs the cost of hardware and software both.

2. Handling data in different formats

The RDBMS is capable of storing and manipulating data in a structured format. But in the real world we have to deal with data in a
structured, unstructured and semi-structured format.

3. Data getting generated with high speed:

The data in oozing out in the order of tera to peta bytes daily. Hence we need a system to process data in real-time within a few seconds.
The traditional RDBMS fail to provide real-time processing at great speeds.

History of Hadoop

The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System paper, published by Google.

PROF. SONALI KHARADE 2

In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an open source web crawler software project.

While working on Apache Nutch, they were dealing with big data. To store that data they have to spend a lot of costs which becomes the
consequence of that project. This problem becomes one of the important reason for the emergence of Hadoop.

In 2003, Google introduced a file system known as GFS (Google file system). It is a proprietary distributed file system developed to provide
efficient access to data.

In 2004, Google released a white paper on Map Reduce. This technique simplifies the data processing on large clusters.

PROF. SONALI KHARADE 3

In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS (Nutch Distributed File System). This file system
also includes Map reduce.

In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project, Dough Cutting introduces a new project Hadoop
with a file system known as HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in this year.

Doug Cutting gave named his project Hadoop after his son's toy elephant.

In 2007, Yahoo runs two clusters of 1000 machines.

In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster within 209 seconds.

In 2013, Hadoop 2.2 was released.

In 2017, Hadoop 3.0 was released.

Hadoop Ecosystem

Hadoop consists of three core components –

• Hadoop Distributed File System (HDFS) – It is the storage layer of Hadoop.

• Map-Reduce – It is the data processing layer of Hadoop.
• YARN – It is the resource management layer of Hadoop.

PROF. SONALI KHARADE 4

Hadoop Stack

Large datasets can be processed, stored, and distributed across computer clusters using the open-source Hadoop framework. Hadoop, created
by the Apache Software Foundation, is a scalable, dependable, and affordable solution for managing large amounts of data. The MapReduce
programming approach for parallel processing and the Hadoop Distributed File System (HDFS) for distributed storage form the system's
foundation. Large files are split up into smaller blocks by HDFS, which then distribute them throughout the cluster to provide high
availability and fault tolerance. Inspired by ideas from functional programming, MapReduce divides large jobs into smaller subtasks and
distributes them among cluster nodes to process data in parallel. Because Hadoop is distributed, it can process large volumes of data
quickly and effectively, making it a fundamental tool in the extensive data ecosystem.
PROF. SONALI KHARADE 5
With the inclusion of new components like Apache Hive for data warehousing, Apache Pig for high-level scripting, and Apache Spark for
in-memory processing, Hadoop has grown in the last several years. Because of this ecosystem, Hadoop is a flexible platform that can handle
various data processing requirements, from real-time analytics to batch processing. Hadoop continues to be a key technology as businesses
struggle with the problems brought on by the exponential expansion of data. It offers the framework for creating reliable and scalable big
data solutions.

1. Hadoop Distributed File System (HDFS):

The storage part of Hadoop, known as the Hadoop Distributed File System (HDFS), was created primarily to manage massive files and
spread them among several cluster nodes. To provide fault tolerance, it divides files into smaller chunks (usually 128 MB or 256 MB in
size) and duplicates them across nodes. A NameNode manages metadata, whereas DataNodes stores the data in HDFS's master-slave design.
This design makes Hadoop's fault tolerance and fast throughput possible, making it ideal for storing and retrieving massive volumes of data.

PROF. SONALI KHARADE 6

PROF. SONALI KHARADE 7
2. MapReduce Programming paradigm:

Hadoop uses this programming paradigm to handle and analyze large datasets concurrently. It separates a task into two stages: the Map
phase, which involves splitting the input into key-value pairs and processing them concurrently, and the Reduce phase, which involves
combining the output of the Map phase. This architecture makes it possible to process data in parallel across several nodes, which makes
handling large-scale computations efficient. Despite its strength, MapReduce can be difficult to use for some computations.

Apart from these fundamental elements, the Hadoop ecosystem has grown to encompass a range of initiatives and resources that enhance
its capabilities and address distinct facets of the data processing workflow.

MapReduce is the heart of Hadoop. It is a software framework for writing applications that process large datasets in parallel across
hundreds or thousands of nodes on the Hadoop cluster.

Hadoop divides the client’s MapReduce job into a number of independent tasks that run in parallel to give throughput.

The MapReduce framework works in two phases- Map phase and the Reduce phase. The input to both the phases is the key-value pair.

Features of Hadoop MapReduce:

• Scalable: Once we write a MapReduce program, we can easily expand it to work over a cluster having hundreds or even thousands
of nodes.
• Fault-tolerance: It is highly fault-tolerant. It automatically recovers from failure.
• Distributed Processing: MapReduce makes it possible to process data in parallel on a group of computers. It provides for scalable
and effective data processing by distributing the data and jobs among numerous nodes.

PROF. SONALI KHARADE 8

• Support for Several Programming Languages: Different MapReduce implementations support various programming languages,
enabling developers to do data processing tasks using the language of their choosing.

3. Apache Hive:

Built on top of Hadoop, Apache Hive is a data warehousing and SQL-like query language that facilitates the handling of massive datasets
by analysts and data scientists.

PROF. SONALI KHARADE 9

Apache Hive is a java based data warehousing tool designed by Facebook for analyzing and processing large data.

Hive uses HQL(Hive Query Language) similar to SQL that is transformed into MapReduce jobs for processing huge amounts of data.

It provides support for developers and analytics to query and analyze big data with SQL like queries(HQL) without writing the complex
MapReduce jobs.

Users can interact with the Apache Hive through the command line tool (Beeline shell) and JDBC driver.

Features of Apache Hive:

• Hive supports client-application written in any language like Python, Java, PHP, Ruby, and C++.
• It generally uses RDBMS as metadata storage, which significantly reduces the time taken for the semantic check.
• Hive Partitioning and Bucketing improves query performance.
• Hive is fast, scalable, and extensible.
• It supports Online Analytical Processing and is an efficient ETL tool.
• It provides support for User Defined Function to support use cases that are not supported by Built-in functions.

4. Apache Pig:

A high-level scripting platform called Apache Pig was created to make creating MapReduce applications easier. Pig Latin is the language
used to express data transformations in it.

Pig is developed by Yahoo as an alternative approach to make MapReduce job easier.

PROF. SONALI KHARADE 10

It enables developers to use Pig Latin, which is a scripting language designed for pig framework that runs on Pig runtime.

Pig Latin is SQL like commands that are converted to MapReduce program in the background by the compiler.

It works by loading the commands and the data source.

Then we perform various operations like sorting, filtering, joining, etc.

At last, based on the requirement, the results are either dumped on the screen or stored back to the HDFS.

Features of Pig:

• Extensibility: Users can create their own function for performing specific purpose processing.
• Solving complex use cases: Pig is best suited for solving complex use cases that include multiple data processing having multiple
imports and exports.
• Handles all kinds of data: Structured and Unstructured can be easily analyzed or processed using Pig.
• Optimization Opportunities: In Pig, the execution of the task gets automatically optimized by the task itself. Thus programmers need
to focus on semantics rather than efficiency.
• It provides a platform for building data flow for ETL (Extract, Transform, and Load), processing, and analyzing massive data sets.

5. Apache Spark:

Spark was not included in the Hadoop project at first. However, it is frequently used in tandem with it. Compared to conventional
MapReduce, it's a quick and versatile cluster computing system that offers in-memory processing and greater expressiveness.

PROF. SONALI KHARADE 11

It is a popular open-source unified analytics engine for big data and machine learning.

Apache Software Foundation developed Apache Spark for speeding up the Hadoop big data processing.

It extends the Hadoop MapReduce model to effectively use it for more types of computations like interactive queries, stream processing,
etc.

Apache Spark enables batch, real-time, and advanced analytics over the Hadoop platform.

Spark provides in-memory data processing for the developers and the data scientists

Companies, including Netflix, Yahoo, eBay, and many more, have deployed Spark at a massive scale.

Features of Apache Spark:

• Speed: Spark has the ability to run applications in Hadoop clusters 100 times faster in memory and ten times faster on the disk.
• Ease of use: It can work with different data stores (such as OpenStack, HDFS, Cassandra) due to which it provides more flexibility
than Hadoop.
• Generality: It contains a stack of libraries, including MLlib for machine learning, SQL and DataFrames, GraphX, and Spark
Streaming. We can combine these libraries in the same application.
• Runs Everywhere: Spark can run on Hadoop, Kubernetes, Apache Mesos, standalone, or in the cloud.

7. HBase on Apache

PROF. SONALI KHARADE 12

Operating on top of Hadoop, HBase is a NoSQL database that is distributed and scalable. It offers massive dataset read and write access in
real-time. Data is arranged into column families in a column-family store for effective archiving and retrieval.

Scalability: Able to handle enormous volumes of data by scaling horizontally.

8. Kafka the Apache

Real-time data pipelines and streaming applications can be built using Kafka, a distributed streaming platform. Publish-Subscribe
Model: The publish-subscribe model enables the decoupling of producers and consumers in real-time data processing.

Fault Tolerance: Designed with fault tolerance and high availability in mind.

9. ZooKeeper

Distributed systems can be managed and synchronized with ZooKeeper, a distributed coordination service.

Coordination: Offers a centralized solution for naming, distributed synchronization, and configuration information maintenance.

Together, these elements create a solid and adaptable Hadoop environment that enables businesses to handle a range of big data
processing, analytics, and storage tasks. New projects and tools are constantly being introduced to the ecosystem to meet new difficulties
in the significant data era.

10. Apache Mahout

PROF. SONALI KHARADE 13

Apache Mahout is an open-source framework that normally runs coupled with the Hadoop infrastructure at its background to manage large
volumes of data.

The name Mahout is derived from the Hindi word “Mahavat,” which means the rider of an elephant.

As Apache Mahout runs algorithms on the top of the Hadoop framework, thus named as Mahout.

We can use Apache Mahout for implementing scalable machine learning algorithms on the top of Hadoop using the MapReduce paradigm.

Apache Mahout is not restricted to the Hadoop based implementation; it can run algorithms in the standalone mode as well.

Apache Mahout implements popular machine learning algorithms such as Classification, Clustering, Recommendation, Collaborative
filtering, etc.

Features of Mahout:

• It works well in the distributed environment since its algorithms are written on the top of Hadoop. It uses the Hadoop library to scale
in the cloud.
• Mahout offers a ready-to-use framework to the coders for performing data mining tasks on large datasets.
• It lets the application to quickly analyze the large datasets.
• Apache Mahout includes various MapReduce enabled clustering applications such as Canopy, Mean-Shift, K-means, fuzzy k-means.
• It also includes vectors and matrix libraries.
• Apache Mahout exposed various Classification algorithms such as Naive Bayes, Complementary Naive Bayes, and Random Forest.

11. HBase

HBase is an open-source distributed NoSQL database that stores sparse data in tables consisting of billions of rows and columns.
PROF. SONALI KHARADE 14
It is written in Java and modeled after Google’s big table.

HBase is used when we need to search or retrieve a small amount of data from large data sets.

For example: If we are having billions of customer emails and we need to find out the customer name who has used the word replace in
their emails, then we use HBase.

There are two main components in HBase. They are:

• HBase Master: HBase Master negotiates load balancing across the region server. It controls the failover, maintains, and monitors the
Hadoop cluster.
• Region Server: Region Server is the worker node that handles the read, write, update, and delete requests from the clients.

Features of HBase:

• Scalable storage
• It Supports fault-tolerant feature
• Support for real-time search on sparse data
• Support easily consistent read and writes
• Column-Family-Based Data Model: HBase groups data into logical groups of columns called column families. Data is saved in rows
within these column families, each of which can have many columns. This data architecture enables effective data retrieval and flexible
schema creation.
• HBase Coprocessors: Coprocessors are specialised code modules that may run on region servers. They are supported by HBase.
Coprocessors enhance query speed by enabling developers to incorporate unique data processing algorithms near to the data.

PROF. SONALI KHARADE 15

Analysing Data with Unix tools

Analysing Data with Hadoop

Hadoop Streaming

IBM Big Data Strategy

Introduction to Infosphere Big insights and Big sheets.

Introduction to Infosphere Big insights

IBM Infosphere BigInsights is a big data analytics platform built on Apache Hadoop. It is designed to help enterprises store, manage, and
analyze large volumes of structured and unstructured data efficiently.

IBM InfoSphere Streams is a software platform that enables the development and execution of applications that process information in data
streams. InfoSphere Streams enables continuous and fast analysis of massive volumes of moving data to help improve the speed of business
insight and decision making

Big Insights is an analytics platform that enables companies to turn complex Internet-scale information sets into insights.

It consists of a packaged Apache Hadoop distribution, with a greatly simplified installation process, and associated tools for application
development, data movement, and cluster management.

Other open source technologies in BigInsights are:

PROF. SONALI KHARADE 16

◦Pig: A platform that provides a high-level language for expressing programs that analyze large datasets. Pig has a compiler that translates
Pig programs into sequences of MapReduce jobs that the Hadoop framework executes.

◦HIVE: A data-warehousing solution built on top of the Hadoop environment. It has familiar relational-database concepts, such as tables,
columns, and partitions, and a subset of SQL (HiveQL) to the unstructured world of Hadoop. Hive queries are compiled into MapReduce
jobs executed using Hadoop.

◦Jaql: An IBM-developed query language designed for JavaScript Object Notation (JSON) and provides a SQL-like interface.

◦Hbase: A column-oriented NoSQL data-storage environment designed to support large, sparsely populated tables in Hadoop.

◦Flume: A distributed, reliable, available service for efficiently moving large amounts of data as it is produced. Flume is well-suited to
gathering logs from multiple systems and inserting them into the Hadoop Distributed File System (HDFS) as they are generated.

◦Avro: A data-serialization technology that uses JSON for defining data types and protocols, and serializes data in a compact binary
format.

◦Lucene: A search-engine library that provides high-performance and full-featured text search.

◦ZooKeeper: A centralized service for maintaining configuration information and naming, providing distributed
synchronization and group services.

◦Oozie: A workflow scheduler system for managing and orchestrating the execution of Apache Hadoop jobs.

PROF. SONALI KHARADE 17

1. Key Features of BigInsights

1. Hadoop-based Architecture
• Uses HDFS (Hadoop Distributed File System) for scalable data storage.
• Supports MapReduce, Spark, and YARN for distributed computing.
2. Big SQL
• Allows SQL-based querying on Hadoop data, making it accessible for users familiar with relational databases.
• Supports joins, subqueries, and aggregations similar to traditional SQL databases.
3. Advanced Text Analytics
• Enables natural language processing (NLP) and text mining to extract meaningful insights from unstructured data (e.g., emails, documents, social media).
4. Machine Learning and Predictive Analytics
• Supports predictive modeling and data science frameworks like Apache Spark MLlib.
5. Security and Data Governance
• Provides role-based access control (RBAC), encryption, and data auditing features.

2. Infosphere BigInsights Use Cases

2.1 Social Media and Sentiment Analysis

Use Case: A retail company wants to analyze customer opinions from Twitter and Facebook to improve its marketing strategy.

Solution using BigInsights:

• Data Collection: Gather tweets, comments, and posts using APIs.
• Storage: Store raw data in HDFS.
• Processing:
o Use text analytics to detect keywords like "good service" or "bad experience".
o Apply sentiment analysis to classify data as positive, neutral, or negative.

PROF. SONALI KHARADE 18

• Visualization: Generate reports using BigSheets or export insights to a BI tool.

Outcome: The company identifies trending issues and adjusts marketing strategies accordingly.

2.2 Fraud Detection in Banking

Use Case: A bank wants to detect fraudulent transactions in real time.

Solution using BigInsights:

• Data Collection: Aggregate customer transactions and behavior logs.
• Processing:
o Apply machine learning models to detect anomalies.
o Use Big SQL to identify suspicious transactions based on patterns.
• Real-time Analysis: Use Spark Streaming for live fraud detection.

Outcome: The bank reduces financial losses and improves fraud prevention.

2.3 Healthcare and Patient Data Analytics

Use Case: A hospital wants to analyze patient records to predict disease outbreaks.

Solution using BigInsights:

• Data Collection: Gather patient data, symptoms, and historical records.
• Processing:
o Use predictive analytics to detect correlations.
o Apply natural language processing (NLP) to extract patterns from doctor notes.
• Visualization: Present insights using BigSheets for easy interpretation.

Outcome: Early detection of disease trends helps in preventive measures.

PROF. SONALI KHARADE 19

Introduction to BigSheets

IBM BigSheets is a spreadsheet-style tool built on Hadoop, allowing business users to explore, clean, and analyze large datasets without coding. IBM Infosphere
BigInsights is a powerful big data processing platform that enables enterprises to analyze large datasets using advanced analytics, machine learning, and
Hadoop-based tools. BigSheets, on the other hand, provides a user-friendly, spreadsheet-like interface for analyzing big data without requiring technical
expertise.

1. Key Features of BigSheets

1. Excel-like Interface – Familiar UI for easy data analysis.

2. Data Import & Integration – Supports CSV, JSON, XML, and databases.
3. Built-in Functions – Includes filtering, sorting, and data transformation.
4. Data Visualization – Generates charts, graphs, and reports.

2. Infosphere BigSheets Use Cases

2.1 Website Log Analysis

Use Case: A company wants to analyze website traffic logs to improve user experience.

Solution using BigSheets:

• Import Data: Load server logs into BigSheets.
• Processing:
o Filter out bot traffic.
o Identify user behavior patterns.
• Visualization: Create charts for traffic trends.

Outcome: The company improves website design and performance.

PROF. SONALI KHARADE 20
2.2 Market Research & Customer Segmentation
Use Case: A telecom company wants to segment customers for targeted promotions.

Solution using BigSheets:

• Import Data: Customer demographics, usage patterns, billing data.
• Processing:
o Apply filters to identify high-value customers.
o Group users based on data patterns.
• Visualization: Generate graphs showing customer segments.

Outcome: The company personalizes marketing campaigns for better engagement.

2.3 Supply Chain Optimization

Use Case: A manufacturing company wants to analyze supplier performance.

Solution using BigSheets:

• Import Data: Supplier deliveries, defect rates, order processing times.
• Processing:
o Identify suppliers with delays.
o Find correlations between defects and specific suppliers.
• Visualization: Generate reports to evaluate supplier reliability.

Outcome: The company optimizes its supply chain and reduces costs.

PROF. SONALI KHARADE 21

3. Differences Between BigInsights and BigSheets
Feature BigInsights BigSheets
Type Hadoop-based big data platform Spreadsheet-based big data analysis tool
Processing Uses MapReduce, Big SQL, and Spark Uses GUI-based data transformation
Users Data scientists, engineers Business analysts, non-technical users
Complexity Requires Hadoop, SQL, and scripting knowledge Easy-to-use, spreadsheet-style interface
Functionality Advanced analytics, machine learning, real-time processing Data filtering, transformation, visualization

When to use BigInsights?

• When working with complex big data processing, machine learning, and real-time analytics.

When to use BigSheets?

• When a business analyst or non-technical user needs to analyze large datasets quickly using a familiar spreadsheet-style tool.

PROF. SONALI KHARADE 22

Unit Iii
No ratings yet
Unit Iii
20 pages
Critical Tcode in SAP For ITGC and Sox Audit
No ratings yet
Critical Tcode in SAP For ITGC and Sox Audit
7 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
unit 2
No ratings yet
unit 2
9 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
INTRODUCTION TO DATA SCIENCE
No ratings yet
INTRODUCTION TO DATA SCIENCE
14 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop
No ratings yet
Hadoop
11 pages
Unit 4 Hadoop
No ratings yet
Unit 4 Hadoop
31 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Big Data RAJNEESH CCC
No ratings yet
Big Data RAJNEESH CCC
11 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
No ratings yet
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
11 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
No ratings yet
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
14 pages
CC UNIT 2 (1)
No ratings yet
CC UNIT 2 (1)
29 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Unit 2
No ratings yet
Unit 2
10 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
hadoop ecosystem-converted
No ratings yet
hadoop ecosystem-converted
5 pages
Chicago Crime (2013) Analysis Using Pig and Visualization Using R
No ratings yet
Chicago Crime (2013) Analysis Using Pig and Visualization Using R
61 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
Big Data Analytics
No ratings yet
Big Data Analytics
12 pages
HADOOP
No ratings yet
HADOOP
10 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Hadoop Presentation: Swarnali B.SC Computer Science Hons. 2 Year Chandernagore Govt. College Halder
No ratings yet
Hadoop Presentation: Swarnali B.SC Computer Science Hons. 2 Year Chandernagore Govt. College Halder
8 pages
00 HadoopWelcome Transcript
No ratings yet
00 HadoopWelcome Transcript
4 pages
UNIT II
No ratings yet
UNIT II
30 pages
UNIT2 BDA
No ratings yet
UNIT2 BDA
12 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Unit_5[1]
No ratings yet
Unit_5[1]
21 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Eng - Hadoopthe Next Big Thing in - Tanvi Deshpande
No ratings yet
Eng - Hadoopthe Next Big Thing in - Tanvi Deshpande
6 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
UNIT 3-1
No ratings yet
UNIT 3-1
14 pages
Bda Aiml Note Unit 2
No ratings yet
Bda Aiml Note Unit 2
13 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Unit 3 Introduction To Hadoop Syllabus
No ratings yet
Unit 3 Introduction To Hadoop Syllabus
22 pages
Hadoop
No ratings yet
Hadoop
5 pages
BIG DATA UNIT 2
No ratings yet
BIG DATA UNIT 2
277 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop
No ratings yet
Hadoop
13 pages
An Overview of The Hadoop Ecosystem
No ratings yet
An Overview of The Hadoop Ecosystem
9 pages
Module-2 - Introduction To Hadoop
No ratings yet
Module-2 - Introduction To Hadoop
13 pages
Hadoop Notes 1
No ratings yet
Hadoop Notes 1
9 pages
Module-2
No ratings yet
Module-2
23 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Big Data Intro
No ratings yet
Big Data Intro
10 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
NMTC 2023 Stage II Finals - Junior BHASKARA (Grade 9 & 10) - Problems and Solutions - Cheenta Academy
No ratings yet
NMTC 2023 Stage II Finals - Junior BHASKARA (Grade 9 & 10) - Problems and Solutions - Cheenta Academy
1 page
SDL 6pphkrly Receipt
0% (1)
SDL 6pphkrly Receipt
2 pages
Dell Inspiron 3501 LA-K033P Rev 1.0 DA8001NI000 PDF
No ratings yet
Dell Inspiron 3501 LA-K033P Rev 1.0 DA8001NI000 PDF
101 pages
Manual - RouterOS Features - MikroTik Wiki
No ratings yet
Manual - RouterOS Features - MikroTik Wiki
5 pages
R Programming Lab
No ratings yet
R Programming Lab
14 pages
Mtsumi Opu
No ratings yet
Mtsumi Opu
2 pages
cnas
No ratings yet
cnas
5 pages
How To Perform A Clean Boot in Windows
No ratings yet
How To Perform A Clean Boot in Windows
10 pages
Mad 22617 Winter 2023 Model Answer
No ratings yet
Mad 22617 Winter 2023 Model Answer
49 pages
Carel mC2SE User Manual Eng
No ratings yet
Carel mC2SE User Manual Eng
52 pages
Jagan Teki U Boot From Scratch v2019 01 Edition v2
No ratings yet
Jagan Teki U Boot From Scratch v2019 01 Edition v2
63 pages
Catalogo de Piezas IM430F
No ratings yet
Catalogo de Piezas IM430F
122 pages
SAP_JOULE_1736446780
No ratings yet
SAP_JOULE_1736446780
84 pages
CUCM BK C95ABA82 00 Admin-Guide-100 Chapter 01010011
No ratings yet
CUCM BK C95ABA82 00 Admin-Guide-100 Chapter 01010011
6 pages
Iterations - Loops in C Programming
No ratings yet
Iterations - Loops in C Programming
29 pages
GA-H110M-H Rev 1.0-Schematic
No ratings yet
GA-H110M-H Rev 1.0-Schematic
47 pages
NASA Facing Rise in Cyberattacks
No ratings yet
NASA Facing Rise in Cyberattacks
7 pages
Bank Management System
100% (7)
Bank Management System
36 pages
MVC Tutorial
No ratings yet
MVC Tutorial
21 pages
Holistic Quation Wd & DBA Part One
No ratings yet
Holistic Quation Wd & DBA Part One
14 pages
Jones Thesis
No ratings yet
Jones Thesis
104 pages
LSISAS1064 Product Brief
No ratings yet
LSISAS1064 Product Brief
2 pages
NCERT Solutions For Class 10 Chapter 2 Polynomials
No ratings yet
NCERT Solutions For Class 10 Chapter 2 Polynomials
16 pages
Marksheet 6th
No ratings yet
Marksheet 6th
1 page
Transaction Processing Systems and Types
No ratings yet
Transaction Processing Systems and Types
11 pages
Application Development
No ratings yet
Application Development
11 pages
Semantic Analysis, Scope
No ratings yet
Semantic Analysis, Scope
112 pages
Math10 Module Q1 Wk3
No ratings yet
Math10 Module Q1 Wk3
13 pages
Safety Instructions: Screen Cleaning Precautions
No ratings yet
Safety Instructions: Screen Cleaning Precautions
36 pages

UNIT-4-Hadoop Ecosystem-Part 1

Uploaded by

UNIT-4-Hadoop Ecosystem-Part 1

Uploaded by

Unit-4: Understanding Hadoop Ecosystem

PROF. SONALI KHARADE 1

1. Storage for Large Datasets

2. Handling data in different formats

3. Data getting generated with high speed:

PROF. SONALI KHARADE 2

PROF. SONALI KHARADE 3

In 2007, Yahoo runs two clusters of 1000 machines.

In 2013, Hadoop 2.2 was released.

In 2017, Hadoop 3.0 was released.

Hadoop consists of three core components –

• Hadoop Distributed File System (HDFS) – It is the storage layer of Hadoop.

PROF. SONALI KHARADE 4

1. Hadoop Distributed File System (HDFS):

PROF. SONALI KHARADE 6

Features of Hadoop MapReduce:

PROF. SONALI KHARADE 8

PROF. SONALI KHARADE 9

Features of Apache Hive:

Pig is developed by Yahoo as an alternative approach to make MapReduce job easier.

PROF. SONALI KHARADE 10

It works by loading the commands and the data source.

Then we perform various operations like sorting, filtering, joining, etc.

PROF. SONALI KHARADE 11

Features of Apache Spark:

PROF. SONALI KHARADE 12

Scalability: Able to handle enormous volumes of data by scaling horizontally.

8. Kafka the Apache

10. Apache Mahout

PROF. SONALI KHARADE 13

There are two main components in HBase. They are:

PROF. SONALI KHARADE 15

Analysing Data with Hadoop

IBM Big Data Strategy

Introduction to Infosphere Big insights and Big sheets.

Introduction to Infosphere Big insights

Other open source technologies in BigInsights are:

PROF. SONALI KHARADE 16

PROF. SONALI KHARADE 17

2. Infosphere BigInsights Use Cases

2.1 Social Media and Sentiment Analysis

Solution using BigInsights:

PROF. SONALI KHARADE 18

2.2 Fraud Detection in Banking

Solution using BigInsights:

2.3 Healthcare and Patient Data Analytics

Solution using BigInsights:

Outcome: Early detection of disease trends helps in preventive measures.

PROF. SONALI KHARADE 19

1. Key Features of BigSheets

1. Excel-like Interface – Familiar UI for easy data analysis.

2. Infosphere BigSheets Use Cases

2.1 Website Log Analysis

Solution using BigSheets:

Outcome: The company improves website design and performance.

Solution using BigSheets:

Outcome: The company personalizes marketing campaigns for better engagement.

2.3 Supply Chain Optimization

Solution using BigSheets:

PROF. SONALI KHARADE 21

When to use BigInsights?

When to use BigSheets?

PROF. SONALI KHARADE 22

You might also like