0% found this document useful (0 votes)
13 views

BDA Unit - IV

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

BDA Unit - IV

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Department of

Computer Science and Engineering

10212CS210 – Big Data Analytics

Course Category : Program Elective


Credits :4
Slot : S1 & S5
Semester : Summer
Academic Year : 2024-2025
Faculty Name : Dr. S. Jagan

School of Computing
Vel Tech Rangarajan Dr. Sagunthala R&D Institute of
Science and Technology
Unit 4 Big Data Visualization and Prediction

Pig : Introduction to PIG, Execution Modes of Pig, Comparison of


Pig with Databases, Grunt, Pig Latin, User Defined Functions, Data
Processing operators. Hive : Hive Shell, Hive Services, Hive
Metastore, Comparison with Traditional Databases, HiveQL, Tables,
Querying Data and User Defined Functions, NoSQL Databases :
Schema-less Models‖: Increasing Flexibility for Data Manipulation-
Key Value Stores- Document Stores – Tabular Stores – Object Data
Stores – Graph Databases Hive – Sharding- Hbase – Analyzing big
data with twitter – Big data for E-Commerce Big data for blogs.
Department of Computer Science and Engineering 2
Introduction to PIG

• Developed by Yahoo! and a top level Apache project


• Immediately makes data on a cluster available to non-
Java programmers via Pig Latin – a dataflow language
• Interprets Pig Latin and generates MapReduce jobs
that run on the cluster
• Enables easy data summarization, ad-hoc reporting
and querying, and analysis of large volumes of data
• Pig interpreter runs on a client machine – no
administrative overhead required

Department of Computer Science and Engineering 3


Introduction to PIG

Department of Computer Science and Engineering 4


Pig Terms

• All data in Pig one of four types:


• An Atom is a simple data value - stored as a string but can
be used as either a string or a number
• A Tuple is a data record consisting of a sequence of "fields"
• Each field is a piece of data of any type (atom, tuple or bag)
• A Bag is a set of tuples (also referred to as a ‘Relation’)
• The concept of a “kind of a” table
• A Map is a map from keys that are string literals to values
that can be any data type
• The concept of a hash map

Department of Computer Science and Engineering 5


Pig Capabilities

• Support for
• Grouping
• Joins
• Filtering
• Aggregation
• Extensibility
• Support for User Defined Functions (UDF’s)
• Leverages the same massive parallelism as native
MapReduce

Department of Computer Science and Engineering 6


Pig Basics

• Pig is a client application


• No cluster software is required
• Interprets Pig Latin scripts to MapReduce jobs
• Parses Pig Latin scripts
• Performs optimization
• Creates execution plan
• Submits MapReduce jobs to the cluster

Department of Computer Science and Engineering 7


Execution Modes

• Pig has two execution modes


• Local Mode - all files are installed and run using your local host
and file system
• MapReduce Mode - all files are installed and run on a Hadoop
cluster and HDFS installation
• Interactive
• By using the Grunt shell by invoking Pig on the command line
$ pig
grunt>
• Batch
• Run Pig in batch mode using Pig Scripts and the "pig" command
$ pig –f id.pig –p <param>=<value> ...

Department of Computer Science and Engineering 8


Pig Latin

• Pig Latin scripts are generally organized as follows


• A LOAD statement reads data
• A series of “transformation” statements process the data
• A STORE statement writes the output to the filesystem
• A DUMP statement displays output on the screen
• Logical vs. physical plans:
• All statements are stored and validated as a logical plan
• Once a STORE or DUMP statement is found the logical plan
is executed

Department of Computer Science and Engineering 9


Example Pig Script

-- Load the content of a file into a pig bag named ‘input_lines’


input_lines = LOAD 'CHANGES.txt' AS (line:chararray);

-- Extract words from each line and put them into a pig bag named ‘words’
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- filter out any words that are just white spaces


filtered_words = FILTER words BY word MATCHES '\\w+';

-- create a group for each word


word_groups = GROUP filtered_words BY word;

-- count the entries in each group


word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS
word;

-- order the records by count


ordered_word_count = ORDER word_count BY count DESC;

-- Store the results ( executes the pig script )


STORE ordered_word_count INTO 'output’;

Department of Computer Science and Engineering 10


Basic “grunt” Shell Commands

• Help is available
$ pig -h
• Pig supports HDFS commands
grunt> pwd
• put, get, cp, ls, mkdir, rm, mv, etc.

Department of Computer Science and Engineering 11


About Pig Scripts

• Pig Latin statements grouped together in a file


• Can be run from the command line or the shell
• Support parameter passing
• Comments are supported
• Inline comments '--'
• Block comments /* */

Department of Computer Science and Engineering 12


Simple Data Types
Type Description
int 4-byte integer
long 8-byte integer
float 4-byte (single precision) floating point
double 8-byte (double precision) floating point
bytearray Array of bytes; blob
chararray String (“hello world”)
boolean True/False (case insensitive)
datetime A date and time
biginteger Java BigInteger
bigdecimal Java BigDecimal

Department of Computer Science and Engineering 13


Complex Data Types

Type Description
Tuple Ordered set of fields (a “row / record”)
Bag Collection of tuples (a “resultset / table”)
Map A set of key-value pairs
Keys must be of type chararray

Department of Computer Science and Engineering 14


Pig Data Formats

• BinStorage
• Loads and stores data in machine-readable (binary) format
• PigStorage
• Loads and stores data as structured, field delimited text
files
• TextLoader
• Loads unstructured data in UTF-8 format
• PigDump
• Stores data in UTF-8 format
• YourOwnFormat!
• via UDFs

Department of Computer Science and Engineering 15


Loading Data Into Pig

• Loads data from an HDFS file


var = LOAD 'employees.txt';
var = LOAD 'employees.txt' AS (id, name,
salary);
var = LOAD 'employees.txt' using PigStorage()
AS (id, name, salary);
• Each LOAD statement defines a new bag
• Each bag can have multiple elements (atoms)
• Each element can be referenced by name or position ($n)
• A bag is immutable
• A bag can be aliased and referenced later

Department of Computer Science and Engineering 16


Storing Data Into Pig

• STORE
• Writes output to an HDFS file in a specified directory
grunt> STORE processed INTO
'processed_txt';
• Fails if directory exists
• Writes output files, part-[m|r]-xxxxx, to the directory
• PigStorage can be used to specify a field delimiter
• DUMP
• Write output to screen
grunt> DUMP processed;

Department of Computer Science and Engineering 17


Relational Operators

• FOREACH
• Applies expressions to every record in a bag
• FILTER
• Filters by expression
• GROUP
• Collect records with the same key
• ORDER BY
• Sorting
• DISTINCT
• Removes duplicates

Department of Computer Science and Engineering 18


Relational Operators

• Use the FOREACH …GENERATE operator to work with


rows of data, call functions, etc.
• Basic syntax:
alias2 = FOREACH alias1 GENERATE
expression;
• Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5)
(8,4,3)
alias2 = FOREACH alias1 GENERATE col1, col2;
DUMP alias2;
(1,2) (4,2) (8,3) (4,3) (7,2) (8,4)

Department of Computer Science and Engineering 19


Relational Operators

• Use the FILTER operator to restrict tuples or rows of


data
• Basic syntax:
alias2 = FILTER alias1 BY expression;
• Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5)
(8,4,3)
alias2 = FILTER alias1 BY (col1 == 8) OR (NOT
(col2+col3 > col1));
DUMP alias2;
(4,2,1) (8,3,4) (7,2,5) (8,4,3)

Department of Computer Science and Engineering 20


Relational Operators

• Use the GROUP…ALL operator to group data


• Use GROUP when only one relation is involved
• Use COGROUP with multiple relations are involved
• Basic syntax:
alias2 = GROUP alias1 ALL;
• Example:
DUMP alias1;
(John,18,4.0F) (Mary,19,3.8F)
(Bill,20,3.9F) (Joe,18,3.8F)
alias2 = GROUP alias1 BY col2;
DUMP alias2;
(18,{(John,18,4.0F),(Joe,18,3.8F)})
(19,{(Mary,19,3.8F)})
(20,{(Bill,20,3.9F)})
Department of Computer Science and Engineering 21
Relational Operators

• Use the ORDER…BY operator to sort a relation based


on one or more fields
• Basic syntax:
alias = ORDER alias BY field_alias [ASC|DESC];
• Example:
DUMP alias1;
(1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5)
(8,4,3)
alias2 = ORDER alias1 BY col3 DESC;
DUMP alias2;
(7,2,5) (8,3,4) (1,2,3) (4,3,3) (8,4,3)
(4,2,1)

Department of Computer Science and Engineering 22


Relational Operators

• Use the DISTINCT operator to remove duplicate tuples


in a relation.
• Basic syntax:
alias2 = DISTINCT alias1;
• Example:
DUMP alias1;
(8,3,4) (1,2,3) (4,3,3) (4,3,3) (1,2,3)
alias2= DISTINCT alias1;
DUMP alias2;
(8,3,4) (1,2,3) (4,3,3)

Department of Computer Science and Engineering 23


Relational Operators

• FLATTEN
• Used to un-nest tuples as well as bags
• INNER JOIN
• Used to perform an inner join of two or more relations based on
common field values
• OUTER JOIN
• Used to perform left, right or full outer joins
• SPLIT
• Used to partition the contents of a relation into two or more
relations
• SAMPLE
• Used to select a random data sample with the stated sample size

Department of Computer Science and Engineering 24


Relational Operators

• Use the JOIN operator to perform an inner, equi-join


join of two or more relations based on common field
values
• The JOIN operator always performs an inner join
• Inner joins ignore null keys
• Filter null keys before the join
• JOIN and COGROUP operators perform similar
functions
• JOIN creates a flat set of output records
• COGROUP creates a nested set of output records

Department of Computer Science and Engineering 25


Relational Operators

DUMP Alias1; Join Alias1 by Col1 to


(1,2,3) Alias2 by Col1
(4,2,1) Alias3 = JOIN Alias1
(8,3,4) BY Col1, Alias2 BY
Col1;
(4,3,3)
(7,2,5)
(8,4,3) Dump Alias3;
DUMP Alias2; (1,2,3,1,3)
(2,4) (4,2,1,4,6)
(8,9) (4,3,3,4,6)
(1,3) (4,2,1,4,9)
(2,7) (4,3,3,4,9)
(2,9) (8,3,4,8,9)
(4,6) (8,4,3,8,9)
(4,9)

Department of Computer Science and Engineering 26


Relational Operators

• Use the OUTER JOIN operator to perform left, right, or full


outer joins
• Pig Latin syntax closely adheres to the SQL standard
• The keyword OUTER is optional
• keywords LEFT, RIGHT and FULL will imply left outer, right outer
and full outer joins respectively
• Outer joins will only work provided the relations which
need to produce nulls (in the case of non-matching keys)
have schemas
• Outer joins will only work for two-way joins
• To perform a multi-way outer join perform multiple two-way
outer join statements

Department of Computer Science and Engineering 27


User-Defined Functions

• Natively written in Java, packaged as a jar file


• Other languages include JavaScript, Ruby, Groovy, and
Python
• Register the jar with the REGISTER statement
• Optionally, alias it with the DEFINE statement
REGISTER /src/myfunc.jar;
A = LOAD 'students';
B = FOREACH A GENERATE myfunc.MyEvalFunc($0);

Department of Computer Science and Engineering 28


DEFINE

• DEFINE can be used to work with UDFs and also


streaming commands
• Useful when dealing with complex input/output formats
/* read and write comma-delimited data */
DEFINE Y 'stream.pl' INPUT(stdin USING PigStreaming(','))
OUTPUT(stdout USING PigStreaming(','));
A = STREAM X THROUGH Y;

/* Define UDFs to a more readable format */


DEFINE MAXNUM org.apache.pig.piggybank.evaluation.math.MAX;
A = LOAD ‘student_data’ AS (name:chararray, gpa1:float,
gpa2:double);
B = FOREACH A GENERATE name, MAXNUM(gpa1, gpa2);
DUMP B;

Department of Computer Science and Engineering 29


Data Warehousing package built on top of
Hadoop

Department of Computer Science and Engineering 30


Hive Background

• Started at Facebook
• Data was collected and stored into Oracle DB
• Data Grew from 10s of GB (2006) to 1 TB/day new data(2007)
• Now the 2020 time its 1024 TB of data generating in a minute.

Department of Computer Science and Engineering 31


Hive use case @ Facebook

Department of Computer Science and Engineering 32


What is Hive

• Data Warehousing package built on top of Hadoop.


• Used for data analysis.
• Targeted towards users comfortable with SQL.
• It is similar to SQL and called HiveQL.
• For managing and querying structured data.
• No need to learn Java and Hadoop APIs.
• Developed by Facebook and contributed by community.
• Facebook analyzed several Terabytes of data every day using Hive.

Department of Computer Science and Engineering 33


Features of Hive

• Hive is fast and scalable.


• It provides SQL-like queries (i.e., HQL) that are implicitly transformed to
MapReduce or Spark jobs.
• It is capable of analyzing large datasets stored in HDFS.
• It allows different storage types such as plain text, RCFile, and HBase.
• It uses indexing to accelerate queries.
• It can operate on compressed data stored in the Hadoop ecosystem.
• It supports user-defined functions (UDFs) where user can provide its
functionality.

Department of Computer Science and Engineering 34


What is Hive

ETL – Extract,
Transform,
Load

Department of Computer Science and Engineering 35


Why go for Hive? When Pig is there

Department of Computer Science and Engineering 36


Hive Architecture and components

Department of Computer Science and Engineering 37


Why go for Hive When Pig is there

Pig Latin: Hive QL:

Procedural data-flow language Declarative SQLish language


A=load’mydata’; Select * from ‘mytable’;
Dump A;

Pig is used by programmer and Hive is used by Analysts generating daily


Researchers. reports.

Department of Computer Science and Engineering 38


Pig vs Hive

Features Hive Pig


Language SQL-like Piglatin
Schemas/Types Yes(Explicit) Yes(Implicit)
Partitions Yes No
Server Optional(Thrift) No
User Defined Yes(Java) Yes(Java)
Functions(UDF)

DFS Direct access Yes Yes


Join/Order/Sort Yes Yes
Shell Yes Yes
Web Interface Yes No
JDBC/ODBC Yes No

Department of Computer Science and Engineering 39


Differences between Hive and Pig

Hive Pig

Hive is commonly used by Data Pig is commonly used by


Analysts. programmers.

It follows SQL-like queries. It follows the data-flow language.

It can handle structured data. It can handle semi-structured data.

It works on server-side of HDFS It works on client-side of HDFS


cluster. cluster.

Hive is slower than Pig. Pig is comparatively faster than Hive.

Department of Computer Science and Engineering 40


Hive Architecture

Department of Computer Science and Engineering 41


Apache Hive Installation

Java Installation - Check whether the Java is installed or not using the following
command.
$ java -version
•Hadoop Installation - Check whether the Hadoop is installed or not using the
following command.
$hadoop version
Steps to install Apache Hive
Download the Apache Hive tar file.
http://mirrors.estointernet.in/apache/hive/hive-1.2.2/
DUnzip the downloaded tar file.

Department of Computer Science and Engineering 42


Apache Hive Installation

tar -xvf apache-hive-1.2.2-bin.tar.gz


DOpen the bashrc file.
$ sudo nano ~/.bashrc
DNow, provide the following HIVE_HOME path.
export HIVE_HOME=/home/codegyani/apache-hive-1.2.2-bin
export PATH=$PATH:/home/codegyani/apache-hive-1.2.2-bin/bin
DUpdate the environment variable.
$ source ~/.bashrc
DLet's start the hive by providing the following command.
$ hive

Department of Computer Science and Engineering 43


Hive Components

Department of Computer Science and Engineering 44


Metastore

Department of Computer Science and Engineering 45


Limitations of Hive

Department of Computer Science and Engineering 46


Abilities of Hive Query Language

Department of Computer Science and Engineering 47


Hive Data Models

Department of Computer Science and Engineering 48


Partitioning

Department of Computer Science and Engineering 49


Partitioning in Hive

• The partitioning in Hive means dividing the table into some parts based
on the values of a particular column like date, course, city or country.
• The advantage of partitioning is that since the data is stored in slices, the
query response time becomes faster.
• As we know that Hadoop is used to handle the huge amount of data, it is
always required to use the best approach to deal with it.
• The partitioning in Hive is the best example of it.

Department of Computer Science and Engineering 50


Partitioning in Hive

• Let's assume we have a data of 10 million students studying in an institute.


• Now, we have to fetch the students of a particular course.
• If we use a traditional approach, we have to go through the entire data.
• This leads to performance degradation.
• In such a case, we can adopt the better approach i.e., partitioning in Hive and
divide the data among the different datasets based on particular columns.

The partitioning in Hive can be executed in two ways -


•Static partitioning
•Dynamic partitioning

Department of Computer Science and Engineering 51


Bucketing

• Bucket concept is based on (Hashing function) mod (By total


number of bucket)

Department of Computer Science and Engineering 52


Bucketing in Hive

• The bucketing in Hive is a data organizing technique.


• It is similar to partitioning in Hive with an added functionality that it divides
large datasets into more manageable parts known as buckets.
• So, we can use bucketing in Hive when the implementation of partitioning
becomes difficult.
• However, we can also divide partitions further in buckets.

Department of Computer Science and Engineering 53


Bucketing in Hive

•The concept of bucketing is based on the hashing technique.


•Here, modules of current column value and the number of required
buckets is calculated (let say, F(x) % 3).
•Now, based on the resulted value, the data is stored into the
corresponding bucket.

Department of Computer Science and Engineering 54


Example of Bucketing in Hive

•First, select the database in which we want to create a table.

hive> use showbucket;

Department of Computer Science and Engineering 55


SerDe - Serialization and Deserialization

Introduction to Hive SerDe

• For the purpose of IO, Apache Hive uses the Hive SerDe interface.
Hence, it handles both serialization and deserialization in Hive.

• Also, interprets the results of serialization as individual fields for


processing.

• In addition, to read in data from a table a SerDe allows Hive.


Further writes it back out to HDFS in any custom format.

• However, it is possible that anyone can write their own SerDe for
their own data formats.

Department of Computer Science and Engineering 56


SerDe

• HDFS files –> InputFileFormat –> <key, value> –>


Deserializer –> Row object

• Row object –> Serializer –> <key, value> –>


OutputFileFormat –> HDFS files

Department of Computer Science and Engineering 57


UDF

• User Defined Functions, also known as UDF, allow you to


create custom functions to process records or groups of
records.

• Hive comes with a comprehensive library of functions.

• There are however some omissions, and some specific cases


for which UDFs are the solution.

Department of Computer Science and Engineering 58


UDF

A UDF processes one or several columns of one row and outputs one
value. For example :
•SELECT lower(str) from table

For each row in "table," the "lower" UDF takes one argument, the value
of "str", and outputs one value, the lowercase representation of "str".
•SELECT datediff(date_begin, date_end) from table

Department of Computer Science and Engineering 59


UDF

For each row in "table," the "datediff" UDF takes two arguments, the value of
"date_begin" and "date_end", and outputs one value, the difference in time
between these two dates.
Each argument of a UDF can be:
•A column of the table.
•A constant value.
•The result of another UDF.
•The result of an arithmetic computation.

Department of Computer Science and Engineering 60


Types of Built-in Functions in HIVE

• Collection Functions.

• Date Functions.

• Mathematical Functions.

• Conditional Functions.

• String Functions.

Department of Computer Science and Engineering 61


NoSQL – Not Only Sql

• Lightweight, Open source,.

• NoSQL DB used in

• Bigdata

• Real-time Web application.

• Log analysis

• Social networking feeds

• Non-relational database.

• Distributed.

• No support for Acid properties.

• No fixed table schema.


Department of Computer Science and Engineering 62
NoSQL - Types

• NoSQL

• Key-value or big hash table – Dynamo, Redis, Riak

• Document – MongoDB, Apache CouchDB, Mark Logic.

• Columnar – Cassandra, Hbase.

• Graph formats – Neo4j, Hypergraph DB, Infinite Graph

Department of Computer Science and Engineering 63


NoSQL - Types

Department of Computer Science and Engineering 64


What is it?
• NoSql database are not relational. - Key value
• Key value pair or document oriented or column oriented or graph
oriented.
Key value or big hash table.
• Key Value
• Firstname Rahul
• Lastnme Dravid

Document oriented.
• Maintain data in collections constituted of documents.
• For ex- mongoDB, Apache CouchDB, Couchbase, MarkLogic.
{
“Book Name” : BDA “,
“Publisher” : Wiley India
“ Year of publications”: 2011
}
Department of Computer Science and Engineering 65
Column

• Column – each storage block has data from only one column.

NoSQL

Key/Value or Bighash
table Schema less

Department of Computer Science and Engineering 66


Graph

• They are called network database, A graph stores in nodes.

ID: 1001 ID : 1002

ID : 1003

Department of Computer Science and Engineering 67


NoSQL – Types & Tools

Department of Computer Science and Engineering 68


Advantages of NoSql

• Can easily scale up and down


• Does not require a predefined schema
• Cheap, easily to implement.
• Relaxes the data consistency requirement.
• Data can be replicated to multiple nodes and can be partitioned.

Department of Computer Science and Engineering 69


Sql Vs NoSql

Department of Computer Science and Engineering 70


No SQL Vendors

Company Product Most widely used by

Amazon DynamoDB LinkedIn, Mozilla

Facebook Cassandra Netflix, Twitter, Ebay

Google Big Table Adobe Photoshop

Department of Computer Science and Engineering 71


Hbase

HBase is an open-source,
distributed, column-oriented
database built on top of HDFS
based on BigTable!

Department of Computer Science and Engineering 72


Hbase

• A distributed data store that can scale horizontally to


1,000s of commodity servers and petabytes of
indexed storage.
• Designed to operate on top of the Hadoop distributed
file system (HDFS) or Kosmos File System (KFS, aka
Cloudstore) for scalability, fault tolerance, and high
availability.

Department of Computer Science and Engineering 73


Hbase

• Distributed storage
• Table-like in data structure
• multi-dimensional map
• High scalability
• High availability
• High performance

Department of Computer Science and Engineering 74


Hbase

• Started toward by Chad Walters and Jim


• 2006.11
• Google releases paper on BigTable
• 2007.2
• Initial HBase prototype created as Hadoop contrib.
• 2007.10
• First useable HBase
• 2008.1
• Hadoop become Apache top-level project and HBase becomes
subproject
• 2008.10~
• HBase 0.18, 0.19 released
Department of Computer Science and Engineering 75
Hbase

• Tables have one primary index, the row key.


• No join operators.
• Scans and queries can select a subset of available
columns, perhaps by using a wildcard.
• There are three types of lookups:
• Fast lookup using row key and optional timestamp.
• Full table scan
• Range scan from region start to end.

Department of Computer Science and Engineering 76


Hbase

• HBase is a Bigtable clone.


• It is open source
• It has a good community and promise for the future
• It is developed on top of and has good integration for
the Hadoop platform, if you are using Hadoop
already.
• It has a Cascading connector.

Department of Computer Science and Engineering 77


Hbase

Department of Computer Science and Engineering 78


Analyzing big data with twitter

Department of Computer Science and Engineering 79


Big data for E-Commerce

Department of Computer Science and Engineering 80


Big data for blogs

Department of Computer Science and Engineering 81

You might also like