0% found this document useful (0 votes)
87 views

Unit 4 Hadoop Eco System PDF

The document discusses Apache Pig, a platform for analyzing large datasets that provides a high-level language called Pig Latin for expressing data analysis programs as sequences of transformations. It explains how Pig abstracts away the complexity of MapReduce and allows for easier development of programs for processing structured and unstructured data stored in Hadoop. Key features of Pig include its rich set of operators, ease of programming through a SQL-like language, ability to optimize execution, and support for user-defined functions.

Uploaded by

january
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views

Unit 4 Hadoop Eco System PDF

The document discusses Apache Pig, a platform for analyzing large datasets that provides a high-level language called Pig Latin for expressing data analysis programs as sequences of transformations. It explains how Pig abstracts away the complexity of MapReduce and allows for easier development of programs for processing structured and unstructured data stored in Hadoop. Key features of Pig include its rich set of operators, ease of programming through a SQL-like language, ability to optimize execution, and support for user-defined functions.

Uploaded by

january
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

CSE Department

Hadoop Eco System

Unit 4

Dr Rakesh Ranjan Kumar


Assistant Professor
3/28/2023 1
PIG: A Big Data Processor

3/28/2023 2
Apache Pig : Introduction
 Pig is an open-source technology (introduced by Yahoo) that is part of
the Hadoop ecosystem for processing a high volume of data
(structured, unstructured, semi-structured).

 Provides abstraction over MapReduce.

 It is used to analyze large sets of data, as well as to represent them as


data flows.

 Pig is not an acronym; it was named after a domestic animal. As an


animal pig eats anything, pig can work upon any kind of data.
Apache Pig : Contd…
 It has a high-level scripting language known as pig Latin scripts that
help programmers to develop their own functions for reading, writing,
and processing data.

 A component known as Pig Engine is present inside Apache Pig in


which Pig Latin scripts are taken as input and these scripts gets
converted into Map-Reduce jobs.

Fun Fact:
10 lines of pig Latin= approx. 200 lines of Map-Reduce Java Program
Conti...

3/28/2023 5
Why go for Pig when MR is there?

3/28/2023 6
Why go for Pig when MR is there?

3/28/2023 7
Why go for Pig when MR is there?

3/28/2023 8
Why go for Pig when MR is there?

3/28/2023 9
Why Apache Pig?

3/28/2023 10
Apache Pig: Features
Features of Pig
• Rich set of operators: It provides many operators to perform operations like join,
sort, filer, etc.
• Ease of programming: Pig Latin is similar to SQL and it is easy to write a Pig script
if you are good at SQL.
• Optimization opportunities: The tasks in Apache Pig optimize their execution
automatically, so the programmers need to focus only on semantics of the
language.
• Extensibility: Using the existing operators, users can develop their own functions
to read, process, and write data.
• UDF’s: Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig Scripts.
• Handles all kinds of data: Apache Pig analyzes all kinds of data, both structured
as well as unstructured. It stores the results in HDFS.
3/28/2023 12
Apache Pig – Component

3/28/2023 13
Apache Pig – Components

3/28/2023 14
Pig Architecture

3/28/2023 15
Apache Pig – Architectural Components
• Parser: Initially the Pig Scripts are handled by the Parser. It checks the syntax of
the script, does type checking, and other miscellaneous checks. The output of the
parser will be a DAG (directed acyclic graph), which represents the Pig Latin
statements and logical operators.
• Optimizer: The logical plan (DAG) is passed to the logical optimizer, which carries
out the logical optimizations such as projection and pushdown.
• Compiler: The compiler compiles the optimized logical plan into a series of
MapReduce jobs.
• Execution engine: Finally the MapReduce jobs are submitted to Hadoop in a
sorted order. Finally, these MapReduce jobs are executed on Hadoop producing
the desired results.

3/28/2023 16
Apache Pig – Execution Modes

3/28/2023 17
Apache Pig – Interaction Modes

3/28/2023 18
Apache Pig : Job Execution Flow

 The programmer creates a Pig Latin script which is in the local file
system as a function.

 Once the pig script is submitted it connect with a compiler which


generates a series of MapReduce jobs.

 Pig compiler gets raw data from HDFS perform operations.

 The result files are again placed in the Hadoop File System (HDFS)
after the MapReduce job is completed.
How Apache Pig Work

3/28/2023 20
Apache Pig – Data Models

3/28/2023 21
Apache Pig Data Model - Tuple and Bag

3/28/2023 22
Apache Pig Data Model - Map and Atom

3/28/2023 23
Apache Pig - Commands

3/28/2023 24
Pig Case Study – Twitter

3/28/2023 25
Pig Case Study – Twitter

3/28/2023 26
Pig Case Study – Twitter

3/28/2023 27
Pig Case Study – Twitter

3/28/2023 28
Pig Case Study – Twitter

3/28/2023 29
Pig Case Study – Twitter

3/28/2023 30
Pig vs SQL

3/28/2023 31
Apache Pig – Applications
• Processes large volume of data
• Supports quick prototyping and ad-hoc queries across large datasets
• Performs data processing in search platforms
• Processes time-sensitive data loads
• Used by telecom companies to de-identify the user call data information.
• How Yahoo! Uses Pig:
• Yahoo uses Pig for the following purpose:
• In Pipelines – To bring logs from its web servers, where these logs undergo a cleaning step to remove
bots, company interval views and clicks.
• In Research – To quickly write a script to test a theory. Pig Integration makes it easy for the researchers to
take a Perl or Python script and run it against a huge data set.

3/28/2023 32
Apache Hive: Data Warehousing
& Analytics on Hadoop

3/28/2023 33
Hive History

3/28/2023 34
Hive Introduction

3/28/2023 35
Hive Introduction
• Hive is a data warehouse infrastructure tool used to process structured data
stored in HDFS cluster.
• It resides on top of Hadoop to summarize Big Data, and makes querying and
analyzing easy.
• Initially Hive was developed by Facebook, Later the apache Software
Foundation took it up and developed it further as an open source under the
name of Apache Hive.
• It is used by different companies.
• For example Amazon, Facebook, Netflix
3/28/2023 36
Hive Introduction

3/28/2023 37
Need of Hive

3/28/2023 38
Apache Hive: Features
 Open-source: Apache Hive is an open-source tool.

 Query large datasets: Hive can query and manage huge datasets stored in Hadoop
Distributed File System.

 Multiple-users: Multiple users can query the data using Hive Query Language (HQL)
simultaneously.

 File-formats: Hive provides support for various file formats such as textFile, ORC, Avro Files,
SequenceFile, Parquet, RCFile, LZO Compression etc.

 Built-In function: Hive provides various Built-In functions. For example, abs(), round(),
isnull().

 User-Defined Functions: It also provides support for User-Defined Functions for the tasks
like data cleansing and filtering.
Apache Hive: Contd…
 Fast: Hive is a fast, scalable, extensible tool and uses familiar concepts.

 Table Structure: Table structure in Hive is similar to table structure in RDBMS.

 ETL support: Hive supports ETL operations.

 Storage: Hive allows us to access files stored in HDFS and other similar data storage systems
such as HBase.

 Ad-hoc queries: Hive allows us to run Ad-hoc queries which are the loosely typed command
or query whose value depends on some variable for the data analysis.
 Adhoc Query: Example:- var adSQL = "SELECT * FROM table WHERE id = " + myId
A different query for each time that line of code is executed, depending on the value of
myId.
 Data Visualization: Hive can be used for Data Visualization. Integrating Hive with Apache
Tez will provide the real time processing capabilities.
Limitations of Hive
• Doesn’t support subqueries
• Subqueries are not supported.
• Latency
• The latency in the apache hive query is very high.
• Only non-real or cold data is supported
• Hive is not used for real-time data querying since it takes a while to produce a result.
• Transaction processing is not supported
• HQL does not support the Transaction processing feature.

3/28/2023 41
Architecture of Hive

3/28/2023 42
Apache Hive: Contd…
Hive chiefly consists of three core parts:

 Hive Clients: Hive offers a variety of drivers designed for communication with
different applications. For example, Hive provides Thrift clients for Thrift-based
applications. These clients and drivers then communicate with the Hive server,
which falls under Hive services.

 Hive Services: Hive services perform client interactions with Hive. For example, if
a client wants to perform a query, it must talk with Hive services.

 Hive Storage and Computing: Hive services such as file system, job client, and
meta store communicates with Hive storage and stores things like metadata
table information and query results.
Apache Hive: Hive Client
 Hive allows writing applications in various languages, including Java,
Python, and C++. It supports different types of clients such as:-

 Thrift Server - It is a cross-language service provider platform that


serves the request from all those programming languages that supports
Thrift.

 JDBC Driver - It is used to establish a connection between hive and Java


applications.

 ODBC Driver - It allows the applications that support the ODBC protocol
to connect to Hive.
Apache Hive: Hive Services
 Hive CLI - The Hive CLI (Command Line Interface) is a shell used to execute Hive queries and
commands.
 Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a web-
based GUI for executing Hive queries and commands.
 Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different clients
and provides it to Hive Driver.
 Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and JDBC/ODBC
driver and transfers the queries to the compiler.
 Hive MetaStore - It is a central repository that stores all the structure information of various
tables and partitions in the warehouse, metadata of column and its type information, the
serializers and deserializers which is used to read and write data and the corresponding HDFS
files where the data is stored.
 Hive Compiler - The purpose of the compiler is to parse the query and perform semantic analysis
on the different query blocks and expressions. It converts HiveQL statements into MapReduce
jobs.
 Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-reduce
tasks and HDFS tasks.
Apache Hive: Hive Driver
Data Flow in Hive

3/28/2023 47
Apache Hive: Contd…

 Execute a query, which goes into the driver.


 The driver asks for the query execution plan.
 The compiler gets the metadata from the metastore.
 The metastore responds with the metadata.
 The compiler gathers this information and sends the plan back to the driver.
 The driver sends the execution plan to the execution engine.
 The execution engine acts as a bridge between the Hive and Hadoop to process
the query.
 The execution engine also communicates bidirectionally with the metastore to
perform various operations, such as create and drop tables
 Finally, a bidirectional communication is done to fetch and send results back to
the client.
Hive Data Modelling

3/28/2023 49
Apache Hive: Contd…
Apache Hive: Data Types
Different modes of Hive

3/28/2023 52
Hive vs RDBMS

3/28/2023 53
Hive Commands
 Command in Hive:
 Hive DDL (Create, View, Drop, Alert, Use)
 Hive DML (Load, Insert, Update, Delete)
 Data Retrieval Queries (Select, Where, Group BY, Limit)
 Joins in Hive (Inner, outer, Full Join)

 Built in Function in Hive.


Hive vs Pig

3/28/2023 55
HiveQL vs Pig Latin

3/28/2023 56
Hive Vs Pig Vs SQL
Hbase: Large Scale Data
Management

3/28/2023 58
Hbase History

3/28/2023 59
HBase: Why??
Hbase Introduction

3/28/2023 61
HBase: Introduction
 Hbase is an open-source non-relational distributed database written in Java. It is runs on
top of HDFS.

 Hbase is a database management system designed in 2007 by Powerset, a Microsoft


company.

 Hbase is a column –oriented database and enables real-time analysis of data.

 It can store huge amount of data in tabular format (rows and columns) for extremely fast
reads and writes.

 Hbase is mostly used in a scenario that requires regular and consistent inserting and
overwriting of data.
NoSQL Types

3/28/2023 63
Hbase Use Case

3/28/2023 64
Feature of Hbase
• Linear and modular scalability: It is highly scalable, which means, we can add more
machines to its cluster.
• Easy to use Java API for client access: HBase has been developed with the robust Java API
support (client/server) which is simple to create and easy to execute.
• Thrift gateway and RESTful Web services: To support the front end apart from Java
programing language, it supports Thrift and REST API.
• Atomic read and write: On a row level, HBase provides atomic read and write. It can be
explained as, during one read or write process, all other processes are prevented from
performing any read or write operations.
• Consistent reads and writes: HBase provides consistent reads and writes due to above
feature.
• Automatic and configurable sharding of tables: HBase tables are distributed across clusters
and these clusters are distributed across regions. These regions and clusters split, and are
redistributed as the data grows
3/28/2023 65
Application of Hbase
• Medical: The medical industry uses HBase to store the data related to patients such
as patient diseases, information such as age, gender, etc., to run MapReduce on it.
• Sports: Sports industry uses HBase to store the information related to the matches.
This information would help perform analytics and in predicting the outcomes in
future matches.
• Web: The web is using the HBase services to store the history searches of the
customers. This search information helps the companies to target the customer
directly with the product or service that they had searched for.
• Oil and petroleum: HBase is used to store the exploration data which helps in
analysing and predicting the areas where oil can be found.
• E-commerce: E-commerce is using HBase to record the customer logs and the
products they are searching for. It enables the organizations in targeting the
customer with the ads to induce him to buy their products or services.
• Other fields: Hbase is being employed in different fields where data is the most
important factor and needs to store petabytes of data to conduct the analysis
3/28/2023 66
Companies Using HBase
• Mozilla
• “Mozilla” uses HBase to store all crash data in HBase
• Facebook
• To store real-time messages, “Facebook” uses HBase storage.
• Twitter
• A company like Twitter also runs HBase across its entire Hadoop cluster. For
them, HBase offers a distributed, read/write the backup of all MySQL tables in their
production backend.
• Yahoo!
• One of the most famous companies Yahoo! also uses HBase. There HBase helps to store
document fingerprint in order to detect near-duplicates.

3/28/2023 67
HBase: Column Oriented Storage
HBase: Architecture
 HBase has two types of nodes: Master and RegionServer
Hbase Architectural component

3/28/2023 70
Data Storage in Hbase

3/28/2023 71
www.youtube.com/c/powerupwithpowerpoint
Hbase Architectural component

3/28/2023 72
Hbase Architectural component

3/28/2023 73
Hbase Architectural component

3/28/2023 74
Hbase vs Hive
• Hive and HBase are two different Hadoop based technologies

• Hive is an SQL-like engine that runs MapReduce jobs, and

• HBase is a NoSQL key/value database on Hadoop.

• Just like Google can be used for search and Facebook for social networking,
Hive can be used for analytical queries while HBase for real-time querying.

3/28/2023 75
Hbase vs RDBMS

3/28/2023 76
References
1. https://data-flair.training/blogs/hadoop-pig-tutorial/
2. https://www.educba.com/pig-architecture/
3. https://www.youtube.com/watch?v=Hve24pRW_Ps
4. https://www.simplilearn.com/tutorials/hadoop-
tutorial/hive#data_flow_in_hive
5. https://www.youtube.com/watch?v=rr17cbPGWGA&t=1089s
6. https://www.simplilearn.com/tutorials/hadoop-
tutorial/hbase#introduction_to_hbase
Thank You

Rakesh Ranjan Kumar

CSE Department

[email protected]

7 0 7 0 2 5 4 4 8 6
w w3/28/2023
w.cgu-odisha.edu.in 78

You might also like