BDA Unit - IV
BDA Unit - IV
School of Computing
Vel Tech Rangarajan Dr. Sagunthala R&D Institute of
Science and Technology
Unit 4 Big Data Visualization and Prediction
• Support for
• Grouping
• Joins
• Filtering
• Aggregation
• Extensibility
• Support for User Defined Functions (UDF’s)
• Leverages the same massive parallelism as native
MapReduce
-- Extract words from each line and put them into a pig bag named ‘words’
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
• Help is available
$ pig -h
• Pig supports HDFS commands
grunt> pwd
• put, get, cp, ls, mkdir, rm, mv, etc.
Type Description
Tuple Ordered set of fields (a “row / record”)
Bag Collection of tuples (a “resultset / table”)
Map A set of key-value pairs
Keys must be of type chararray
• BinStorage
• Loads and stores data in machine-readable (binary) format
• PigStorage
• Loads and stores data as structured, field delimited text
files
• TextLoader
• Loads unstructured data in UTF-8 format
• PigDump
• Stores data in UTF-8 format
• YourOwnFormat!
• via UDFs
• STORE
• Writes output to an HDFS file in a specified directory
grunt> STORE processed INTO
'processed_txt';
• Fails if directory exists
• Writes output files, part-[m|r]-xxxxx, to the directory
• PigStorage can be used to specify a field delimiter
• DUMP
• Write output to screen
grunt> DUMP processed;
• FOREACH
• Applies expressions to every record in a bag
• FILTER
• Filters by expression
• GROUP
• Collect records with the same key
• ORDER BY
• Sorting
• DISTINCT
• Removes duplicates
• FLATTEN
• Used to un-nest tuples as well as bags
• INNER JOIN
• Used to perform an inner join of two or more relations based on
common field values
• OUTER JOIN
• Used to perform left, right or full outer joins
• SPLIT
• Used to partition the contents of a relation into two or more
relations
• SAMPLE
• Used to select a random data sample with the stated sample size
• Started at Facebook
• Data was collected and stored into Oracle DB
• Data Grew from 10s of GB (2006) to 1 TB/day new data(2007)
• Now the 2020 time its 1024 TB of data generating in a minute.
ETL – Extract,
Transform,
Load
Hive Pig
Java Installation - Check whether the Java is installed or not using the following
command.
$ java -version
•Hadoop Installation - Check whether the Hadoop is installed or not using the
following command.
$hadoop version
Steps to install Apache Hive
Download the Apache Hive tar file.
http://mirrors.estointernet.in/apache/hive/hive-1.2.2/
DUnzip the downloaded tar file.
• The partitioning in Hive means dividing the table into some parts based
on the values of a particular column like date, course, city or country.
• The advantage of partitioning is that since the data is stored in slices, the
query response time becomes faster.
• As we know that Hadoop is used to handle the huge amount of data, it is
always required to use the best approach to deal with it.
• The partitioning in Hive is the best example of it.
• For the purpose of IO, Apache Hive uses the Hive SerDe interface.
Hence, it handles both serialization and deserialization in Hive.
• However, it is possible that anyone can write their own SerDe for
their own data formats.
A UDF processes one or several columns of one row and outputs one
value. For example :
•SELECT lower(str) from table
For each row in "table," the "lower" UDF takes one argument, the value
of "str", and outputs one value, the lowercase representation of "str".
•SELECT datediff(date_begin, date_end) from table
For each row in "table," the "datediff" UDF takes two arguments, the value of
"date_begin" and "date_end", and outputs one value, the difference in time
between these two dates.
Each argument of a UDF can be:
•A column of the table.
•A constant value.
•The result of another UDF.
•The result of an arithmetic computation.
• Collection Functions.
• Date Functions.
• Mathematical Functions.
• Conditional Functions.
• String Functions.
• NoSQL DB used in
• Bigdata
• Log analysis
• Non-relational database.
• Distributed.
• NoSQL
Document oriented.
• Maintain data in collections constituted of documents.
• For ex- mongoDB, Apache CouchDB, Couchbase, MarkLogic.
{
“Book Name” : BDA “,
“Publisher” : Wiley India
“ Year of publications”: 2011
}
Department of Computer Science and Engineering 65
Column
• Column – each storage block has data from only one column.
NoSQL
Key/Value or Bighash
table Schema less
ID : 1003
HBase is an open-source,
distributed, column-oriented
database built on top of HDFS
based on BigTable!
• Distributed storage
• Table-like in data structure
• multi-dimensional map
• High scalability
• High availability
• High performance