Bda Module 4
Bda Module 4
The JobTracker is responsible for coordinating the execution of a job by scheduling tasks
across the cluster, monitoring their progress, and re-executing any failed tasks to ensure
reliability. Meanwhile, the Task Trackers handle the execution of tasks as directed by the
JobTracker, operating on the cluster's individual nodes.
The input data for a MapReduce task is stored in files, typically within the Hadoop Distributed
File System (HDFS). These files can vary in format, including line-based logs, binary files, or
multi-line input records. The MapReduce framework processes this data entirely as key-value
pairs, where both the input and output of tasks are structured in this form. While the input and
output key-value pairs may differ in type, this flexible model allows for a wide range of data
processing tasks, making MapReduce a robust solution for handling diverse big data
workloads.
1
Map-Tasks
A Map Task in the MapReduce programming model is responsible for processing input data
in the form of key-value pairs, denoted as (k1,v1)(k_1, v_1). Here, k1k_1 represents a set of
keys, and v1v_1 is a value (often a large string) read from the input file(s). The map() function
implemented within the task executes the user application logic on these pairs. The output of a
map task consists of zero or more intermediate key-value pairs (k2,v2)(k_2, v_2), which are
used as input for the Reduce task for further processing.
The Mapper operates independently on each dataset, without intercommunication between
Mappers. The output of the Mapper, v2v_2, serves as input for transformation operations at the
Reduce stage, typically involving aggregation or other reducing functions. A Reduce Task
takes these intermediate outputs, processes them using a combiner, and generates a smaller,
summarized dataset. Reduce tasks are always executed after the completion of all Map tasks.
The Hadoop Java API provides a Mapper class, which includes an abstract map() function.
Any specific Mapper implementation must extend this class and override the map () function
to define its behaviour.
For instance:
public class SampleMapper extends Mapper<k1, v1, k2, v2> {
void map(k1 key, v1 value, Context context) throws IOException, InterruptedException {
// User-defined logic
}
}
The number of Map tasks, NmapN_{map}, is determined by the size of the input files and the
block size of the Hadoop Distributed File System (HDFS).
For example, a 1 TB input file with a block size of 128 MB results in 8192 Map tasks. The
number of Map tasks can also be explicitly set using setNumMapTasks(int) and typically
ranges between 10–100 per node, though higher values can be configured for more granular
parallelism.
2
Key-Value Pair
Each phase (Map phase and Reduce phase) of MapReduce has key-value pairs as input
and output. Data should be first converted into key-value pairs before it is passed to
the Mapper, as the Mapper only understands key-value pairs of data.
Key-value pairs in Hadoop MapReduce are generated as follows:
InputSplit - Defines a logical representation of data and presents a Split data for
processing at individual map().
RecordReader - Communicates with the Input Split and converts the Split into records which
are in the form of key-value pairs in a format suitable for reading by the Mapper.
RecordReader uses TextlnputFormat by default for converting data into key-value
pairs. RecordReader communicates with the InputSplit until the file is read.
In
MapReduce, the Grouping by Key operation involves collecting and grouping all the output
key-value pairs from the mapper by their keys. This process aggregates values associated with
the same key into a list, which is crucial for further processing during the Shuffle and Sorting
Phase. During this phase, all pairs with the same key are grouped together, creating a list for
3
each unique key, and the results are sorted. The output format of the shuffle phase is <k2,
List(v2)>. Once the shuffle process completes, the data is divided into partitions.
A Partitioner plays a key role in this step, distributing the intermediate data into different
partitions, ensuring efficient data handling across multiple reducers.
A Combiner is an optional, local reducer that aggregates map output records on each node
before the shuffle phase, optimizing data transfer between the mapper and reducer by reducing
the volume of data that needs to be shuffled across the network.
The Reduce Tasks then process the grouped key-value pairs, applying the reduce() function
to aggregate the data and produce the final output. Each reduce task receives a list of values for
each key and iterates over them to generate aggregated results, which are then outputted in the
form of key-value pairs (k3, v3). This setup, which includes the shuffle, partitioning, combiner,
and reduce phases, optimizes performance and reduces the network load in distributed
computing environments like Hadoop.
public class ExampleReducer extends Reducer<K2, V2, K3, V3> {
@Override
public void reduce(K2 key, Iterable<V2> values, Context context) throws IOException,
InterruptedException {
// Processing logic for each key-value pair in the reduce function
// Example: Sum of values for each key
int sum = 0;
for (V2 value : values) {
sum += value; // assuming the values are integers
}
// Emit the final output key-value pair
context.write(key, sum);
}
}
Coping with Node Failure
Hadoop achieves fault tolerance by restarting tasks that fail during the execution of a
MapReduce job.
4
2. Reduce TaskTracker Failure:
o If a Reduce TaskTracker fails, only the in-progress reduce tasks are reset to idle.
o A new TaskTracker will execute the in-progress reduce tasks.
3. Master JobTracker Failure:
o If the JobTracker fails, the entire job is aborted and the client is notified.
o If there is only one master node, a failure in the JobTracker results in the job
needing to restart.
Through regular communication between TaskTrackers and the JobTracker, Hadoop can detect
failures, reassign tasks, and ensure that the job completes even in the event of node failures.
This fault tolerance mechanism helps MapReduce jobs run reliably on a large distributed
cluster.
Composing MapReduce for Calculations and Algorithms
In MapReduce, calculations and algorithms can be composed to efficiently handle a variety of
big data processing tasks. Below are several examples of common MapReduce compositions
for various operations:
1. Counting and Summing
Counting and summing operations are fundamental to MapReduce jobs. For example, counting
the number of alerts or messages generated during a vehicle's maintenance activity for a
specific period (e.g., a month) can be done by emitting a count for each message.
Example: Word count or counting messages in a log file:
Mapper: For each message, emit a key-value pair with key as a generic identifier (like
null or a timestamp) and the value as 1.
Reducer: The reducer will sum the values, providing the total count of messages or
words.
2. Sorting
Sorting in MapReduce typically occurs by emitting keys in a sorted order during the map phase
and having the framework sort them before they are passed to the reducer. The reducer will
then aggregate the sorted results.
Mapper: Emits items associated with sorting keys.
Reducer: Combines all emitted parts into a final sorted list.
5
1. First Solution: Mapper emits dummy counters for each field and group ID, and the
reducer calculates the total number of occurrences for each pair.
2. Second Solution: The Mapper emits values and group IDs, and the reducer excludes
duplicates and counts unique values for each group.
Example: Counting unique users by their ID in web logs.
Mapper: Emits the user ID with a dummy count.
Reducer: Filters out duplicate user IDs and counts the total number of unique users.
4. Collating
Collating involves collecting all items with the same key into a list. This is useful for operations
like producing inverted indexes or performing extract, transform, and load (ETL) tasks.
Mapper: Computes a given function for each item and emits the result as a key, with
the item itself as a value.
Reducer: Groups items by key and processes them.
Example: Creating an inverted index.
Mapper: Emits each word from the document as a key and the document ID as the
value.
Reducer: Collects all document IDs for each word, producing a list of documents
where each word appears.
5. Filtering or Parsing
Filtering or parsing is used when processing datasets to collect only the items that satisfy certain
conditions or transform items into other formats.
Mapper: Accepts only items that satisfy specific conditions and emits them.
Reducer: Collects all the emitted items and outputs the results.
Example: Extracting valid records from a log file.
Mapper: Filters records based on a condition (e.g., logs with errors) and emits the valid
records.
Reducer: Collects the valid records and saves them.
6. Distributed Tasks Execution
Large-scale computations are divided into multiple partitions and executed in parallel. The
results from each partition are then combined to produce the final result.
Mapper: Processes a specific partition of the data and emits the computed results.
Reducer: Combines the results from all the mappers.
Example: Numerical analysis or performance testing tasks that require distributed execution.
7. Graph Processing using Iterative Message Passing
6
In graph processing, nodes and edges represent entities and relationships, and iterative message
passing is used for tasks like path traversal.
Mapper: Each node sends messages to its neighbouring nodes.
Reducer: Updates the state of nodes based on received messages.
Example: PageRank computation or social network analysis.
Mapper: Sends messages with node IDs to their neighbouring nodes.
Reducer: Updates each node’s state based on the messages received from neighbours.
Cross-Correlation using MapReduce
Cross-correlation is a technique that computes how much two sequences (or datasets) are
similar to one another. In the context of big data, particularly in text analytics or market
analysis, cross-correlation is used to find co-occurring items, like words in sentences or
products bought together by customers.
Use Cases:
1. Text Analytics: Finding words that co-occur in the same sentence or document.
2. Market Analysis: Identifying which products are often bought together (e.g.,
"customers who bought item X also tend to buy item Y").
Basic Approach:
1. N x N Matrix: If there are N items, the total number of pairwise correlations will be
N×NN \times NN×N. For example, in text analytics, these pairs would represent co-
occurring words, and in market analysis, they would represent items bought together.
2. Memory Constraints: If N×NN \times NN×N is small enough to fit into memory, the
correlation matrix can be processed straightforwardly on a single machine. For larger
datasets, the problem must be distributed across multiple nodes.
MapReduce Approaches for Cross-Correlation:
There are two main solutions for calculating cross-correlation using MapReduce:
1. First Approach: Emitting All Pairs and Dummy Counters
In this approach, the Mapper emits all possible pairs of items and dummy counters for each
pair. The Reducer then sums these counters.
Steps:
Mapper: For each tuple (sentence or transaction), emit pairs of items with a counter
(1).
o Example: For sentence ["apple", "banana", "cherry"], emit the following pairs:
(apple, banana, 1)
(apple, cherry, 1)
7
(banana, cherry, 1)
Reducer: The reducer will sum all the dummy counters for each item pair to compute
the total co-occurrence count.
o Example: If the pair (apple, banana) appears in three sentences, the reducer will
sum the counters for this pair to give the final count.
2. Second Approach: Using Stripes for Efficient Computation
When the dataset is large, emitting all pairs directly may not be efficient. Instead, a stripe
technique is used, which groups the data by the first item in each pair. This method accumulates
the counts for all adjacent items in an associative array (or "stripe").
Steps:
Mapper: For each tuple, the mapper groups all adjacent items into an associative array
(stripe). The stripe keeps track of the co-occurrence counts for each item in the tuple.
o Example: For sentence ["apple", "banana", "cherry"], the mapper will emit:
(apple, {banana: 1, cherry: 1})
(banana, {apple: 1, cherry: 1})
(cherry, {apple: 1, banana: 1})
Reducer: The reducer will then merge all stripes for the same leading item (i.e., for
each unique item), aggregate the counts for each co-occurring item, and emit the final
result.
Relational Algebra Operations
Relational algebra is a procedural query language used for querying relational databases. It
consists of a set of operations that take one or two relations (tables) as input and produce a new
relation as output. These operations form the foundation of SQL and are used to manipulate
and retrieve data from relational databases.
Here are the basic relational algebra operations:
1. Selection (σ)
The Selection operation is used to select a subset of rows from a relation that satisfy a given
condition. The result is a new relation that contains only those rows from the original relation
where the condition holds true.
Syntax:
σ condition(R)
Where:
condition is a predicate (a logical condition) that the rows must satisfy.
R is the relation (table) from which rows are selected.
8
Example:
Consider a relation Employees with attributes (EmpID, Name, Age, Department).
101 Alice 30 HR
102 Bob 25 IT
103 Carol 35 HR
101 Alice 30 HR
103 Carol 35 HR
2. Projection (π)
The Projection operation is used to select specific columns from a relation, effectively
reducing the number of attributes in the resulting relation. It eliminates duplicate rows in the
result.
Syntax:
πattribute1, attribute2, ..., attributeN(R)
Where:
attribute1, attribute2, ..., attributeN are the columns to be selected from the relation.
R is the relation from which attributes are selected.
Example:
Consider the Employees relation again. If we only want to select the Name and Department
columns, we would write: πName, Department(Employees)
This would produce the following result:
Name Department
Alice HR
Bob IT
Carol HR
3. Union (∪)
The Union operation combines the rows of two relations, removing duplicates. The two
relations involved must have the same set of attributes (columns).
9
Syntax:
R∪S
Where:
R and S are two relations with the same schema (same attributes).
Example:
Let’s assume two relations:
Employees (EmpID, Name) and Contractors (EmpID, Name).
Employees:
EmpID Name
101 Alice
102 Bob
Contractors:
EmpID Name
103 Carol
102 Bob
EmpID Name
101 Alice
102 Bob
103 Carol
10
Example:
If we subtract Contractors from Employees: Employees − Contractors
This would result in:
EmpID Name
101 Alice
EmpID Name
101 Alice
102 Bob
Departments:
DeptID Department
D01 HR
D02 IT
11
EmpID Name DeptID Department
6. Rename (ρ)
The Rename operation is used to rename the attributes (columns) of a relation or to change the
name of the relation itself. This operation is particularly useful when combining relations in
operations like join.
Syntax:
ρNewName(OldName)(R)
Where:
NewName is the new name of the relation.
OldName is the current name of the relation.
R is the relation.
Example:
If we have a relation Employees and want to rename the attribute EmpID to EmployeeID, we
would write: ρEmployees(EmpID → EmployeeID)(Employees)
This would result in the following relation:
101 Alice 30 HR
102 Bob 25 IT
103 Carol 35 HR
7. Join (⨝)
The Join operation combines two relations based on a common attribute. It is one of the most
important operations in relational algebra, as it allows combining data from different tables.
Types of Join:
Inner Join: Combines rows from both relations where the join condition is true.
Outer Join: Returns all rows from one or both relations, with null values for unmatched
rows.
Syntax:
R ⨝condition S
Where:
R and S are relations.
condition specifies the common attribute used for the join.
12
Example:
Consider the following relations:
Employees:
EmpID Name
101 Alice
102 Bob
Departments:
EmpID Department
101 HR
102 IT
101 Alice HR
102 Bob IT
Hive
Hive is a data warehousing and SQL-like query system built on top of Hadoop. It was originally
developed by Facebook to manage large amounts of data in Hadoop's distributed file system
(HDFS). Hive simplifies the process of querying and managing large-scale datasets by
providing an abstraction layer that allows users to run SQL-like queries (HiveQL) on top of
the Hadoop ecosystem.
Characteristics of Hive
1. MapReduce Integration:
Hive translates queries written in Hive Query Language (HiveQL) into MapReduce jobs.
This makes Hive scalable and suitable for managing and analyzing vast datasets,
particularly static data. Since Hive uses MapReduce, it inherits the scalability and parallel
processing capabilities of Hadoop.
2. Web and API Support:
Hive provides web interfaces and APIs that allow clients to interact with the Hive database
server. Users can query Hive either through a web browser or via programmatic access,
making it convenient for both end-users and developers.
3. SQL-like Query Language (HiveQL):
13
Hive provides a query language called HiveQL (Hive Query Language), which is similar
to SQL. It allows users to perform typical database operations like SELECT, INSERT,
JOIN, and GROUP BY. However, HiveQL is specifically designed to work with the
underlying Hadoop infrastructure.
4. Data Storage on HDFS:
Data loaded into Hive tables is stored in Hadoop's HDFS. Hive abstracts away the
complexity of managing HDFS directly, and users can interact with their data through
HiveQL queries. This allows for easy integration with Hadoop's distributed storage system.
Limitations of Hive
1. Not a Full Database:
While Hive provides querying capabilities and table management features, it is not a full-
fledged database. Some critical operations typically available in traditional databases (such
as UPDATE, ALTER, and DELETE) are not directly supported by Hive. The design of
Hive prioritizes read-heavy, analytical workloads rather than transactional operations.
2. Unstructured Data Handling:
Hive is primarily designed for structured and semi-structured data. It is not optimized for
managing unstructured data (e.g., audio, video, or images). Therefore, it may not be the
best tool for use cases requiring real-time analysis or unstructured data processing.
14
Data Analytics:
Hive is widely used for analyzing large datasets in industries like e-commerce, finance, and
telecom, where data is typically stored in HDFS, and querying needs are focused on
extracting insights from large volumes of static data.
Features of Hive:
Hive Architecture
The architecture of Hive is designed to provide an abstraction layer on top of Hadoop, allowing
users to run SQL-like queries (HiveQL) for managing and analyzing large datasets stored in
HDFS. Hive architecture consists of several key components that work together to enable
querying, execution, and management of data within the Hadoop ecosystem.
15
Components of Hive Architecture
1. Hive Server (Thrift):
o Function: The Hive Server is an optional service that allows remote clients to
submit requests to Hive and retrieve results.
o Client API: The Hive Server exposes a simple client API (through Thrift),
enabling the execution of HiveQL statements. It supports various programming
languages for interacting with Hive, such as Java, Python, and others.
o Role: This server acts as an interface for external applications to communicate
with the Hive system. The Thrift service allows clients to send HiveQL queries
and retrieve results without directly interacting with the underlying
infrastructure.
2. Hive CLI (Command Line Interface):
o Function: The Hive CLI is a popular interface that allows users to interact
directly with Hive through a command line.
o Local Mode: Hive can run in local mode when used with the CLI. In this mode,
Hive uses the local file system for storing data, rather than HDFS. This is useful
for small-scale testing or development.
o Usage: The Hive CLI allows users to submit queries, manage databases, create
tables, and perform other administrative tasks in Hive.
3. Web Interface (HWI):
o Function: Hive can also be accessed through a web interface, which is provided
by the Hive Web Interface (HWI).
o HWI Server: A designated HWI server must be running to provide web-based
access to Hive. Users can access Hive via a web browser by navigating to a
URL like http://hadoop:<port>/hwi.
o Usage: The web interface provides a graphical interface for executing queries,
managing tables, and performing administrative tasks without needing to use
the CLI.
4. Metastore:
o Function: The Metastore is a crucial component of Hive that stores all the
metadata (schema information) related to the tables, databases, and columns.
o Metadata: It stores information such as the database schema, column data
types, and HDFS locations of the data files.
o Interaction: All other components of Hive interact with the Metastore to fetch
or update metadata. For example, when a user queries a table, the Metastore
helps locate the corresponding data in HDFS.
16
o Storage: The Metastore typically uses a relational database (like MySQL or
PostgreSQL) to store this metadata.
5. Hive Driver:
o Function: The Hive Driver manages the lifecycle of a HiveQL query.
o Lifecycle Management: It is responsible for compiling the HiveQL query,
optimizing it, and finally executing the query on the Hadoop cluster.
o Execution Flow:
Compilation: The Hive Driver compiles the HiveQL statement into a
series of MapReduce jobs (or other execution plans depending on the
environment).
Optimization: The query is then optimized for execution. This may
include tasks such as predicate pushdown, column pruning, and join
optimization.
Execution: The final optimized query is submitted for execution on the
Hadoop cluster, where it is processed by the MapReduce framework.
6. Query Compiler:
o Function: The Query Compiler is responsible for parsing the HiveQL
statements and converting them into execution plans that are understandable by
the Hadoop system.
o Stages: The process involves the compilation of the HiveQL statement into an
Abstract Syntax Tree (AST), followed by the generation of a logical query plan
and its optimization before the physical plan is produced.
7. Execution Engine:
o Function: The Execution Engine is responsible for the actual execution of the
query.
17
o Processing: It submits tasks to the underlying Hadoop infrastructure
(MapReduce, Tez, or Spark, depending on the configuration). The Execution
Engine also handles data movement between various stages of the computation.
Primitive Data Types also divide into 3 types which are as follows.
18
19
Hive Data Model
The Hive data model organizes and structures data in a way that allows efficient querying,
analysis, and storage in a Hadoop ecosystem. The components of the Hive data model include
Databases, Tables, Partitions, and Buckets.
1. Database
Description: A database in Hive acts as a namespace for organizing and storing tables.
Each database can contain multiple tables, and you can use the USE statement to switch
between databases.
Example: A database can represent different applications or systems, like a database
for customer data or product data.
2. Tables
Description: Tables in Hive are similar to tables in traditional RDBMS. They are used
to store structured data in a tabular format. Each table is backed by a directory in HDFS
where the actual data files reside.
Operations Supported: Hive tables support various operations such as:
o Filter: Filtering rows based on certain conditions.
o Projection: Selecting specific columns to be returned in the result.
o Join: Joining multiple tables to retrieve related data.
o Union: Combining results from multiple queries.
Data Storage: The data in a Hive table is stored in HDFS, and the structure (schema)
is defined when creating the table.
3. Partitions
Description: Partitions are used to divide the data in a table into subsets based on the
values of one or more columns. This helps in organizing large datasets and enables
efficient querying by reducing the amount of data scanned for specific queries.
How It Works: A table can be partitioned by a column, such as date or region, and each
partition will store data corresponding to that column's value (e.g., data from January
will be stored in one partition, data from February in another).
Example: A sales table can be partitioned by the year and month columns to store data
for each year and month separately.
4. Buckets
Description: Buckets further divide data within each partition based on a hash of a
column in the table. This technique allows data to be split into smaller, more
manageable files within each partition.
20
How It Works: Data is divided into a specific number of buckets (files) by hashing a
particular column's value. Each bucket corresponds to one file stored in the partition's
directory.
Example: A customer table might be bucketed by the customer_id column, ensuring
that the data for each customer is stored in a separate bucket.
Hive Integration and Workflow Steps
Hive’s integration with Hadoop involves several key components that handle the query
execution, metadata retrieval, and job management.
1. Execute Query:
o The query is sent from the Hive interface (CLI, Web Interface, etc.) to the
Database Driver, which is responsible for initiating the execution process.
2. Get Plan:
o The Driver forwards the query to the Query Compiler. The compiler parses
the query and creates an execution plan, verifying the syntax and determining
the operations required.
3. Get Metadata:
o The Compiler requests metadata information (like table schema, column types,
etc.) from the Metastore (which can be backed by databases like MySQL or
PostgreSQL).
4. Send Metadata:
o The Metastore responds with the metadata, and the Compiler uses this
information to refine the query plan.
5. Send Plan:
o After parsing the query and receiving metadata, the Compiler sends the
finalized query execution plan back to the Driver.
21
6. Execute Plan:
o The Driver sends the execution plan to the Execution Engine, which is
responsible for actually running the query on the Hadoop cluster.
7. Execute Job:
o The execution engine triggers the execution of the query, which is typically
translated into a MapReduce job. This job is sent to the JobTracker (running
on the NameNode), which assigns tasks to TaskTrackers on DataNodes for
parallel processing.
8. Metadata Operations:
o During the execution, the Execution Engine may also perform metadata
operations with the Metastore, such as querying schema details or updating the
metastore.
9. Fetch Result:
o After completing the MapReduce job, the Execution Engine collects the results
from the DataNodes where the job was processed.
10. Send Results:
o The results are sent back to the Driver, which in turn forwards them to the Hive
interface for display to the user.
Hive Built-in Functions
Hive provides a wide range of built-in functions to operate on different data types, enabling
various data transformations and calculations. Here’s a breakdown of some common built-in
functions in Hive:
1. BIGINT Functions
round(double a)
o Description: Returns the rounded BIGINT (8-byte integer) value of the 8-byte
double-precision floating point number a.
o Return Type: BIGINT
o Example: round(123.456) returns 123.
floor(double a)
o Description: Returns the maximum BIGINT value that is equal to or less than
the double value.
o Return Type: BIGINT
o Example: floor(123.789) returns 123.
ceil(double a)
22
o Description: Returns the minimum BIGINT value that is equal to or greater
than the double value.
o Return Type: BIGINT
o Example: ceil(123.456) returns 124.
2. Random Number Generation
rand(), rand(int seed)
o Description: Returns a random number (double) that is uniformly distributed
between 0 and 1. The sequence changes with each row, and specifying a seed
ensures the random number sequence is deterministic.
o Return Type: double
o Example: rand() returns a random number like 0.456789, and rand(5) will
generate a sequence based on the seed 5.
3. String Functions
concatenate(string str1, string str2, ...)
o Description: Concatenates two or more strings into one.
o Return Type: string
o Example: concatenate('Hello ', 'World') returns 'Hello World'.
substr(string str, int start)
o Description: Returns a substring of str starting from the position start till the
end of the string.
o Return Type: string
o Example: substr('Hello World', 7) returns 'World'.
substr(string str, int start, int length)
o Description: Returns a substring of str starting from position start with the
given length.
o Return Type: string
o Example: substr('Hello World', 1, 5) returns 'Hello'.
upper(string str), ucase(string str)
o Description: Converts all characters of str to uppercase.
o Return Type: string
o Example: upper('hello') returns 'HELLO'.
lower(string str), lcase(string str)
23
o Description: Converts all characters of str to lowercase.
o Return Type: string
o Example: lower('HELLO') returns 'hello'.
trim(string str)
o Description: Trims spaces from both ends of the string.
o Return Type: string
o Example: trim(' Hello World ') returns 'Hello World'.
ltrim(string str)
o Description: Trims spaces from the left side of the string.
o Return Type: string
o Example: ltrim(' Hello') returns 'Hello'.
rtrim(string str)
o Description: Trims spaces from the right side of the string.
o Return Type: string
o Example: rtrim('Hello ') returns 'Hello'.
4. Date and Time Functions
year(string date)
o Description: Extracts the year part of a date or timestamp string.
o Return Type: int
o Example: year('2024-12-25') returns 2024.
month(string date)
o Description: Extracts the month part of a date or timestamp string.
o Return Type: int
o Example: month('2024-12-25') returns 12.
day(string date)
o Description: Extracts the day part of a date or timestamp string.
o Return Type: int
o Example: day('2024-12-25') returns 25.
24
HiveQL Features
Data Definition: Allows users to define and manage the schema of tables, databases,
etc.
Data Manipulation: Enables the manipulation of data, such as inserting, updating, or
deleting records (although with some limitations).
Query Processing: Supports querying large datasets using operations like filtering,
joining, and aggregating data.
HiveQL Process Engine
The HiveQL Process Engine translates HiveQL queries into execution plans and
communicates with the Execution Engine to run the query. It is a replacement for the
traditional approach of writing Java-based MapReduce programs.
Hive Execution Engine
The Execution Engine is the component that bridges HiveQL and MapReduce. It
processes the query and generates results in the same way that MapReduce jobs would
do. It uses a variant of MapReduce to execute HiveQL queries across a distributed
Hadoop cluster.
HiveQL Data Definition Language (DDL)
HiveQL provides several commands for defining databases and tables. These commands are
used to manage the structure of the data in Hive.
Creating a Database
To create a new database in Hive, the following command is used:
CREATE DATABASE [IF NOT EXISTS] <database_name>;
IF NOT EXISTS: Ensures that Hive does not throw an error if the database already
exists.
Example:
CREATE DATABASE IF NOT EXISTS my_database;
Show Databases
To list all the databases in Hive, use the command:
SHOW DATABASES;
Dropping a Database
To delete an existing database, use the following command:
DROP DATABASE [IF EXISTS] [RESTRICT | CASCADE] <database_name>;
IF EXISTS: Prevents an error if the database does not exist.
RESTRICT: Deletes the database only if it is empty.
25
CASCADE: Deletes the database along with any tables it contains.
Example:
DROP DATABASE IF EXISTS my_database CASCADE;
Creating a Table
The syntax for creating a table in Hive is:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]
[<database_name>.]<table_name>
[(<column_name> <data_type> [COMMENT <column_comment>], ...)]
[COMMENT <table_comment>]
[ROW FORMAT <row_format>]
[STORED AS <file_format>];
TEMPORARY: Creates a temporary table that is only available during the session.
EXTERNAL: Specifies that the table is external, meaning Hive won't manage its data
(i.e., data is stored outside the Hive warehouse).
IF NOT EXISTS: Avoids an error if the table already exists.
COMMENT: Adds a description to the table or column.
ROW FORMAT: Specifies the format of the rows in the table (e.g., DELIMITED).
STORED AS: Specifies the file format for storing data (e.g., TEXTFILE, ORC,
PARQUET).
Example:
CREATE TABLE IF NOT EXISTS employee (
emp_id INT,
name STRING,
salary DOUBLE
)
COMMENT 'Employee table'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
26
HiveQL Data Manipulation Language (DML)
DML commands in Hive are used for managing and modifying the data within Hive tables.
Using a Database
To set the current database in Hive, use the USE command:
USE <database_name>;
Loading Data into a Table
To load data into a Hive table from a local or HDFS path, the LOAD DATA command is used:
LOAD DATA [LOCAL] INPATH '<file_path>' [OVERWRITE] INTO TABLE <table_name>
[PARTITION (partcol1=val1, partcol2=val2, ...)];
LOCAL: Specifies that the file is on the local filesystem.
OVERWRITE: Overwrites the existing data in the table.
PARTITION: Specifies partitioning columns if applicable.
Example:
LOAD DATA LOCAL INPATH '/path/to/data.csv' INTO TABLE employee;
Dropping a Table
To delete an existing table, the command is:
DROP TABLE [IF EXISTS] <table_name>;
IF EXISTS: Avoids an error if the table does not exist.
Example:
DROP TABLE IF EXISTS employee;
Altering a Table
You can modify the structure of a table using the ALTER TABLE command:
ALTER TABLE <table_name> ADD COLUMNS (<column_name> <data_type>
[COMMENT <column_comment>]);
Example:
ALTER TABLE employee ADD COLUMNS (department STRING);
HiveQL Querying the Data
Hive supports querying data using SQL-like syntax, with additional features for partitioning,
sorting, and aggregating data.
Basic Query
A basic query to select data from a table is:
27
SELECT [ALL | DISTINCT] <select_expression>, ...
FROM <table_name>
[WHERE <condition>]
[GROUP BY <column_list>]
[HAVING <condition>]
[CLUSTER BY <column_list>]
[DISTRIBUTE BY <column_list>]
[SORT BY <column_list>]
[LIMIT <number>];
ALL: Returns all rows (default).
DISTINCT: Returns only unique rows.
WHERE: Filters rows based on a condition.
GROUP BY: Groups rows based on column values.
HAVING: Filters groups based on a condition.
CLUSTER BY: Groups rows and sorts them in a specific way.
DISTRIBUTE BY: Distributes rows to different reducers.
SORT BY: Sorts the data within each reducer.
LIMIT: Limits the number of rows returned.
Example:
SELECT DISTINCT name, salary
FROM employee
WHERE salary > 50000
ORDER BY salary DESC;
PIG
Pig is a high-level platform built on top of Hadoop to facilitate the processing of large datasets.
It abstracts the complexities of writing MapReduce programs and provides a more user-friendly
interface for data manipulation.
Features of Apache Pig
Dataflow Language: Pig uses a dataflow language, where operations on data are linked
in a chain, and the output of one operation is the input to the next.
Simplifies MapReduce: Pig reduces the complexity of writing raw MapReduce
programs by providing a higher-level abstraction.
28
Parallel Processing: Pig allows the execution of tasks in parallel, which makes it
suitable for handling large datasets.
Flexible: It can process structured, semi-structured, and unstructured data.
High-level Operations: Supports complex data manipulation tasks like filtering,
joining, and aggregating large datasets.
Applications of Apache Pig
Large Dataset Analysis: Ideal for analyzing vast amounts of data in HDFS.
Ad-hoc Data Processing: Useful for quick, one-time data processing tasks.
Processing Streaming Data: It can process web logs, sensor data, or other real-time
data.
Search Platform Data Processing: Pig can be used for processing and analyzing data
related to search platforms.
Time-sensitive Data Processing: Processes and analyzes data quickly, which is
essential for applications that require fast insights.
Pig scripts are often used in combination with Hadoop for data processing at scale, making it
a powerful tool for big data analytics.
Pig Architecture
The Pig architecture is built to support flexible and scalable data processing in a Hadoop
ecosystem. It executes Pig Latin scripts via three main methods:
1. Grunt Shell: An interactive shell that executes Pig scripts in real time.
2. Script File: A file containing Pig commands that are executed on a Pig server.
3. Embedded Script: Pig Latin functions that can be written as User-Defined Functions
(UDFs) in different programming languages and embedded within Pig scripts.
29
Pig Execution Process
The Pig execution flow involves multiple stages to transform raw data into processed output:
1. Parser: After a script passes through Grunt or Pig Server, the parser handles syntax
checking and type validation. The output of this step is a Directed Acyclic Graph (DAG)
that represents the flow of operations.
o DAG: Nodes represent operations, and edges indicate data flows between them.
This structure ensures that each node only handles one set of inputs at a time,
making the process acyclic.
2. Optimizer: After generating the DAG, the optimizer reduces data at various stages to
optimize performance. Some of the optimizations include:
o PushUpFilter: Splits and pushes filters up in the execution plan to reduce the
dataset early in the pipeline.
o PushDownForEachFlatten: Delays the flatten operation to minimize the data
set in the pipeline.
o ColumnPruner: Removes unused columns as early as possible.
o MapKeyPruner: Discards unused map keys.
o Limit Optimizer: Pushes the LIMIT operation as early as possible to avoid
unnecessary computation.
3. Compiler: After optimization, the Pig scripts are converted into a series of MapReduce
jobs, which are compiled into code that will be executed on the Hadoop cluster.
4. Execution Engine: The execution engine takes the MapReduce jobs and executes them
on the Hadoop cluster, generating the final output.
Pig Grunt Shell
The Grunt shell is primarily used for writing and executing Pig Latin scripts. You can also
invoke shell commands such as sh and ls. For instance:
To execute shell commands: grunt> sh shell_command_parameters
To list files in the Grunt shell: grunt> sh ls
Pig Latin Data Model
Pig Latin supports both primitive (atomic) and complex data types, making it versatile for
handling various data structures.
Primitive Data Types:
int: 32-bit signed integer (e.g., 10)
long: 64-bit signed integer (e.g., 101)
float: 32-bit floating point (e.g., 22.7F)
double: 64-bit floating point (e.g., 3.4)
30
chararray: Character array (e.g., 'hello')
bytearray: Binary data (e.g., ffoo)
Complex Data Types:
bag: Collection of tuples (e.g., {{(1,1), (2,4)}})
tuple: Ordered set of fields (e.g., (1, 1))
map: Set of key-value pairs (e.g., ['key1#1'])
Pig Latin Constructs
Pig Latin scripts are built using a variety of operations that handle data input, output, and
transformations. A typical Pig Latin script includes the following:
1. Schemas and Expressions: Defines how data is structured and what operations will be
performed on it.
2. Commands:
o LOAD: Reads data from the file system.
o DUMP: Displays the result.
o STORE: Stores the processed result into the file system.
3. Comments:
o Single-line comments start with --.
o Multiline comments are enclosed in /* */.
4. Case Sensitivity:
o Keywords (like LOAD, STORE, DUMP) are not case-sensitive.
o Function names, relations, and paths are case-sensitive.
Pig Latin Script Execution Modes
1. Interactive Mode: This mode uses the Grunt shell. It allows you to write and execute
Pig Latin scripts interactively, making it ideal for quick testing and debugging.
2. Batch Mode: In this mode, you write the Pig Latin script in a single file with a .pig
extension. The script is then executed as a batch process.
3. Embedded Mode: This mode involves defining User-Defined Functions (UDFs) in
programming languages such as Java, and using them in Pig scripts. It allows for more
advanced functionality beyond the built-in operations of Pig.
Pig Commands
31
1. To get a list of Pig commands:
2. pig-help;
3. To check the version of Pig:
4. pig -version;
5. To start the Grunt shell:
6. pig
Load Command
The LOAD command in Pig is used to load data into the system from various data sources.
Here's how it works:
Loading data from HBase:
book = LOAD 'MyBook' USING HBaseStorage();
Loading data from a CSV file using PigStorage, with a comma as a separator:
book = LOAD 'PigDemo/Data/Input/myBook.csv' USING PigStorage(',');
Specifying a schema while loading data: You can define a schema for the loaded data,
which helps in interpreting each field of the record.
book = LOAD 'MyBook' AS (name:chararray, author:chararray, edition:int,
publisher:chararray);
Store Command
The STORE command writes the processed data to a storage location, typically HDFS. It can
store data in various formats.
Default storage in HDFS (tab-delimited format):
STORE processed INTO '/PigDemo/Data/Output/Processed';
Storing data in HBase:
STORE processed INTO 'processed' USING HBaseStorage();
Storing data as comma-separated text:
STORE processed INTO 'processed' USING PigStorage(',');
Dump Command
The DUMP command is useful for displaying the processed data directly on the screen. It’s
often used during debugging or prototyping to quickly inspect the results.
Displaying processed data:
DUMP processed;
Relational Operations in Pig Latin
32
Pig Latin provides several relational operations that allow you to transform and manipulate
data. These operations are used to sort, group, join, project, and filter data. Some of the basic
relational operators include:
1. FOREACH: This operation applies transformations to the data based on columns and
is often used to project data (i.e., select specific columns). It is the projection operator
in Pig Latin.
Example:
result = FOREACH data GENERATE field1, field2;
The FOREACH operation is extremely powerful and can also be used for applying functions
or expressions to each field in the dataset.
Summary of Key Commands
LOAD: Used for loading data from an external source into Pig.
STORE: Writes processed data to an external location.
DUMP: Displays the processed data on the screen for inspection.
FOREACH: Allows applying transformations and projections to data.
These commands form the foundation of writing and executing Pig scripts, enabling you to
process and analyze large datasets in a Hadoop environment.
33
-------------------------------------------END OF MODULE 4-----------------------------------------
34