0% found this document useful (0 votes)

4 views34 pages

Bda Module 4

Module 4 covers the MapReduce programming model, which processes large datasets through a master-slave architecture involving Map and Reduce phases. It explains the roles of Map and Reduce tasks, key-value pairs, and fault tolerance mechanisms in Hadoop, as well as various applications and compositions of MapReduce for big data processing. Additionally, it introduces relational algebra operations, highlighting their relevance in querying relational databases.

Uploaded by

Gurudev Mehta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views34 pages

Bda Module 4

Uploaded by

Gurudev Mehta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

MODULE 4

MapReduce, Hive and Pig

Map Reduce Programming Model
The MapReduce programming model is a powerful framework used for processing and
analysing large-scale datasets in a distributed computing environment. It divides tasks into two
core operations: Map and Reduce.
In the Map phase, the input data is split into smaller chunks and distributed across multiple
nodes for parallel processing, where each node produces key-value pairs as intermediate
outputs.
The Reduce phase then aggregates these outputs, combining them into a smaller, more concise
result. This parallelized approach allows for efficient handling of vast amounts of data. Hadoop,
one of the most widely used implementations of MapReduce, utilizes the Hadoop Distributed
File System (HDFS) for storing and retrieving data. In such systems, nodes serve both as
computational units and storage devices, optimizing resource use and scalability.
The MapReduce model is highly applicable in big data scenarios, enabling tasks like log
analysis, data transformation, and large-scale data mining. Additionally, database techniques
such as indexing and inner joins further enhance the efficiency of data retrieval and processing,
making MapReduce a foundational concept for modern big data solutions.
MapReduce employs a master-slave architecture, consisting of a JobTracker as the master and
a Task Tracker on each cluster node as the slave.

The JobTracker is responsible for coordinating the execution of a job by scheduling tasks
across the cluster, monitoring their progress, and re-executing any failed tasks to ensure
reliability. Meanwhile, the Task Trackers handle the execution of tasks as directed by the
JobTracker, operating on the cluster's individual nodes.
The input data for a MapReduce task is stored in files, typically within the Hadoop Distributed
File System (HDFS). These files can vary in format, including line-based logs, binary files, or
multi-line input records. The MapReduce framework processes this data entirely as key-value
pairs, where both the input and output of tasks are structured in this form. While the input and
output key-value pairs may differ in type, this flexible model allows for a wide range of data
processing tasks, making MapReduce a robust solution for handling diverse big data
workloads.

1
Map-Tasks
A Map Task in the MapReduce programming model is responsible for processing input data
in the form of key-value pairs, denoted as (k1,v1)(k_1, v_1). Here, k1k_1 represents a set of
keys, and v1v_1 is a value (often a large string) read from the input file(s). The map() function
implemented within the task executes the user application logic on these pairs. The output of a
map task consists of zero or more intermediate key-value pairs (k2,v2)(k_2, v_2), which are
used as input for the Reduce task for further processing.
The Mapper operates independently on each dataset, without intercommunication between
Mappers. The output of the Mapper, v2v_2, serves as input for transformation operations at the
Reduce stage, typically involving aggregation or other reducing functions. A Reduce Task
takes these intermediate outputs, processes them using a combiner, and generates a smaller,
summarized dataset. Reduce tasks are always executed after the completion of all Map tasks.
The Hadoop Java API provides a Mapper class, which includes an abstract map() function.
Any specific Mapper implementation must extend this class and override the map () function
to define its behaviour.
For instance:
public class SampleMapper extends Mapper<k1, v1, k2, v2> {
void map(k1 key, v1 value, Context context) throws IOException, InterruptedException {
// User-defined logic
}
}
The number of Map tasks, NmapN_{map}, is determined by the size of the input files and the
block size of the Hadoop Distributed File System (HDFS).
For example, a 1 TB input file with a block size of 128 MB results in 8192 Map tasks. The
number of Map tasks can also be explicitly set using setNumMapTasks(int) and typically
ranges between 10–100 per node, though higher values can be configured for more granular
parallelism.

2
Key-Value Pair
Each phase (Map phase and Reduce phase) of MapReduce has key-value pairs as input
and output. Data should be first converted into key-value pairs before it is passed to
the Mapper, as the Mapper only understands key-value pairs of data.
Key-value pairs in Hadoop MapReduce are generated as follows:
InputSplit - Defines a logical representation of data and presents a Split data for
processing at individual map().
RecordReader - Communicates with the Input Split and converts the Split into records which
are in the form of key-value pairs in a format suitable for reading by the Mapper.
RecordReader uses TextlnputFormat by default for converting data into key-value
pairs. RecordReader communicates with the InputSplit until the file is read.

In
MapReduce, the Grouping by Key operation involves collecting and grouping all the output
key-value pairs from the mapper by their keys. This process aggregates values associated with
the same key into a list, which is crucial for further processing during the Shuffle and Sorting
Phase. During this phase, all pairs with the same key are grouped together, creating a list for

3
each unique key, and the results are sorted. The output format of the shuffle phase is <k2,
List(v2)>. Once the shuffle process completes, the data is divided into partitions.
A Partitioner plays a key role in this step, distributing the intermediate data into different
partitions, ensuring efficient data handling across multiple reducers.
A Combiner is an optional, local reducer that aggregates map output records on each node
before the shuffle phase, optimizing data transfer between the mapper and reducer by reducing
the volume of data that needs to be shuffled across the network.
The Reduce Tasks then process the grouped key-value pairs, applying the reduce() function
to aggregate the data and produce the final output. Each reduce task receives a list of values for
each key and iterates over them to generate aggregated results, which are then outputted in the
form of key-value pairs (k3, v3). This setup, which includes the shuffle, partitioning, combiner,
and reduce phases, optimizes performance and reduces the network load in distributed
computing environments like Hadoop.
public class ExampleReducer extends Reducer<K2, V2, K3, V3> {
@Override
public void reduce(K2 key, Iterable<V2> values, Context context) throws IOException,
InterruptedException {
// Processing logic for each key-value pair in the reduce function
// Example: Sum of values for each key
int sum = 0;
for (V2 value : values) {
sum += value; // assuming the values are integers
}
// Emit the final output key-value pair
context.write(key, sum);
}
}
Coping with Node Failure
Hadoop achieves fault tolerance by restarting tasks that fail during the execution of a
MapReduce job.

1. Map TaskTracker Failure:

o If a Map TaskTracker fails, the tasks it was running are reset to idle.
o A new TaskTracker will be assigned to re-execute the failed map tasks.

4
2. Reduce TaskTracker Failure:
o If a Reduce TaskTracker fails, only the in-progress reduce tasks are reset to idle.
o A new TaskTracker will execute the in-progress reduce tasks.
3. Master JobTracker Failure:
o If the JobTracker fails, the entire job is aborted and the client is notified.
o If there is only one master node, a failure in the JobTracker results in the job
needing to restart.
Through regular communication between TaskTrackers and the JobTracker, Hadoop can detect
failures, reassign tasks, and ensure that the job completes even in the event of node failures.
This fault tolerance mechanism helps MapReduce jobs run reliably on a large distributed
cluster.
Composing MapReduce for Calculations and Algorithms
In MapReduce, calculations and algorithms can be composed to efficiently handle a variety of
big data processing tasks. Below are several examples of common MapReduce compositions
for various operations:
1. Counting and Summing
Counting and summing operations are fundamental to MapReduce jobs. For example, counting
the number of alerts or messages generated during a vehicle's maintenance activity for a
specific period (e.g., a month) can be done by emitting a count for each message.
Example: Word count or counting messages in a log file:
 Mapper: For each message, emit a key-value pair with key as a generic identifier (like
null or a timestamp) and the value as 1.
 Reducer: The reducer will sum the values, providing the total count of messages or
words.
2. Sorting
Sorting in MapReduce typically occurs by emitting keys in a sorted order during the map phase
and having the framework sort them before they are passed to the reducer. The reducer will
then aggregate the sorted results.
 Mapper: Emits items associated with sorting keys.
 Reducer: Combines all emitted parts into a final sorted list.

3. Finding Distinct Values (Counting Unique Values)

Finding distinct values is a common task in applications like web log analysis or counting
unique users. Here are two possible solutions for counting unique values:

5
1. First Solution: Mapper emits dummy counters for each field and group ID, and the
reducer calculates the total number of occurrences for each pair.
2. Second Solution: The Mapper emits values and group IDs, and the reducer excludes
duplicates and counts unique values for each group.
Example: Counting unique users by their ID in web logs.
 Mapper: Emits the user ID with a dummy count.
 Reducer: Filters out duplicate user IDs and counts the total number of unique users.
4. Collating
Collating involves collecting all items with the same key into a list. This is useful for operations
like producing inverted indexes or performing extract, transform, and load (ETL) tasks.
 Mapper: Computes a given function for each item and emits the result as a key, with
the item itself as a value.
 Reducer: Groups items by key and processes them.
Example: Creating an inverted index.
 Mapper: Emits each word from the document as a key and the document ID as the
value.
 Reducer: Collects all document IDs for each word, producing a list of documents
where each word appears.
5. Filtering or Parsing
Filtering or parsing is used when processing datasets to collect only the items that satisfy certain
conditions or transform items into other formats.
 Mapper: Accepts only items that satisfy specific conditions and emits them.
 Reducer: Collects all the emitted items and outputs the results.
Example: Extracting valid records from a log file.
 Mapper: Filters records based on a condition (e.g., logs with errors) and emits the valid
records.
 Reducer: Collects the valid records and saves them.
6. Distributed Tasks Execution
Large-scale computations are divided into multiple partitions and executed in parallel. The
results from each partition are then combined to produce the final result.
 Mapper: Processes a specific partition of the data and emits the computed results.
 Reducer: Combines the results from all the mappers.
Example: Numerical analysis or performance testing tasks that require distributed execution.
7. Graph Processing using Iterative Message Passing

6
In graph processing, nodes and edges represent entities and relationships, and iterative message
passing is used for tasks like path traversal.
 Mapper: Each node sends messages to its neighbouring nodes.
 Reducer: Updates the state of nodes based on received messages.
Example: PageRank computation or social network analysis.
 Mapper: Sends messages with node IDs to their neighbouring nodes.
 Reducer: Updates each node’s state based on the messages received from neighbours.
Cross-Correlation using MapReduce
Cross-correlation is a technique that computes how much two sequences (or datasets) are
similar to one another. In the context of big data, particularly in text analytics or market
analysis, cross-correlation is used to find co-occurring items, like words in sentences or
products bought together by customers.
Use Cases:
1. Text Analytics: Finding words that co-occur in the same sentence or document.
2. Market Analysis: Identifying which products are often bought together (e.g.,
"customers who bought item X also tend to buy item Y").
Basic Approach:
1. N x N Matrix: If there are N items, the total number of pairwise correlations will be
N×NN \times NN×N. For example, in text analytics, these pairs would represent co-
occurring words, and in market analysis, they would represent items bought together.
2. Memory Constraints: If N×NN \times NN×N is small enough to fit into memory, the
correlation matrix can be processed straightforwardly on a single machine. For larger
datasets, the problem must be distributed across multiple nodes.
MapReduce Approaches for Cross-Correlation:
There are two main solutions for calculating cross-correlation using MapReduce:
1. First Approach: Emitting All Pairs and Dummy Counters
In this approach, the Mapper emits all possible pairs of items and dummy counters for each
pair. The Reducer then sums these counters.
Steps:
 Mapper: For each tuple (sentence or transaction), emit pairs of items with a counter
(1).
o Example: For sentence ["apple", "banana", "cherry"], emit the following pairs:
 (apple, banana, 1)
 (apple, cherry, 1)

7
 (banana, cherry, 1)
 Reducer: The reducer will sum all the dummy counters for each item pair to compute
the total co-occurrence count.
o Example: If the pair (apple, banana) appears in three sentences, the reducer will
sum the counters for this pair to give the final count.
2. Second Approach: Using Stripes for Efficient Computation
When the dataset is large, emitting all pairs directly may not be efficient. Instead, a stripe
technique is used, which groups the data by the first item in each pair. This method accumulates
the counts for all adjacent items in an associative array (or "stripe").
Steps:
 Mapper: For each tuple, the mapper groups all adjacent items into an associative array
(stripe). The stripe keeps track of the co-occurrence counts for each item in the tuple.
o Example: For sentence ["apple", "banana", "cherry"], the mapper will emit:
 (apple, {banana: 1, cherry: 1})
 (banana, {apple: 1, cherry: 1})
 (cherry, {apple: 1, banana: 1})
 Reducer: The reducer will then merge all stripes for the same leading item (i.e., for
each unique item), aggregate the counts for each co-occurring item, and emit the final
result.
Relational Algebra Operations
Relational algebra is a procedural query language used for querying relational databases. It
consists of a set of operations that take one or two relations (tables) as input and produce a new
relation as output. These operations form the foundation of SQL and are used to manipulate
and retrieve data from relational databases.
Here are the basic relational algebra operations:
1. Selection (σ)
The Selection operation is used to select a subset of rows from a relation that satisfy a given
condition. The result is a new relation that contains only those rows from the original relation
where the condition holds true.
Syntax:
σ condition(R)
Where:
 condition is a predicate (a logical condition) that the rows must satisfy.
 R is the relation (table) from which rows are selected.

8
Example:
Consider a relation Employees with attributes (EmpID, Name, Age, Department).

EmpID Name Age Department

101 Alice 30 HR

102 Bob 25 IT

103 Carol 35 HR

If we want to select employees from the HR department, we would write: σDepartment =

'HR'(Employees)
This would result in the following relation:

EmpID Name Age Department

101 Alice 30 HR

103 Carol 35 HR

2. Projection (π)
The Projection operation is used to select specific columns from a relation, effectively
reducing the number of attributes in the resulting relation. It eliminates duplicate rows in the
result.
Syntax:
πattribute1, attribute2, ..., attributeN(R)
Where:
 attribute1, attribute2, ..., attributeN are the columns to be selected from the relation.
 R is the relation from which attributes are selected.
Example:
Consider the Employees relation again. If we only want to select the Name and Department
columns, we would write: πName, Department(Employees)
This would produce the following result:

Name Department

Alice HR

Bob IT

Carol HR

3. Union (∪)
The Union operation combines the rows of two relations, removing duplicates. The two
relations involved must have the same set of attributes (columns).

9
Syntax:
R∪S
Where:
 R and S are two relations with the same schema (same attributes).
Example:
Let’s assume two relations:
Employees (EmpID, Name) and Contractors (EmpID, Name).
Employees:

EmpID Name

101 Alice

102 Bob

Contractors:

EmpID Name

103 Carol

102 Bob

Performing the union: Employees ∪ Contractors

This would result in:

EmpID Name

101 Alice

102 Bob

103 Carol

The duplicate entry (Bob) is eliminated.

4. Set Difference (−)
The Set Difference operation returns the rows that are present in one relation but not in the
other. Like the Union operation, the two relations must have the same schema.
Syntax:
R−S
Where:
 R is the first relation.
 S is the second relation.

10
Example:
If we subtract Contractors from Employees: Employees − Contractors
This would result in:

EmpID Name

101 Alice

Bob is excluded because he appears in both Employees and Contractors.

5. Cartesian Product (×)
The Cartesian Product operation combines every row of the first relation with every row of
the second relation. The result is a relation where each row is a combination of one row from
the first relation and one from the second.
Syntax:
R×S
Where:
 R and S are two relations.
Example:
Consider the following relations:
Employees:

EmpID Name

101 Alice

102 Bob

Departments:

DeptID Department

D01 HR

D02 IT

The Cartesian Product of Employees and Departments (Employees × Departments) would

be:

EmpID Name DeptID Department

101 Alice D01 HR

101 Alice D02 IT

102 Bob D01 HR

11
EmpID Name DeptID Department

102 Bob D02 IT

6. Rename (ρ)
The Rename operation is used to rename the attributes (columns) of a relation or to change the
name of the relation itself. This operation is particularly useful when combining relations in
operations like join.
Syntax:
ρNewName(OldName)(R)
Where:
 NewName is the new name of the relation.
 OldName is the current name of the relation.
 R is the relation.
Example:
If we have a relation Employees and want to rename the attribute EmpID to EmployeeID, we
would write: ρEmployees(EmpID → EmployeeID)(Employees)
This would result in the following relation:

EmployeeID Name Age Department

101 Alice 30 HR

102 Bob 25 IT

103 Carol 35 HR

7. Join (⨝)
The Join operation combines two relations based on a common attribute. It is one of the most
important operations in relational algebra, as it allows combining data from different tables.
Types of Join:
 Inner Join: Combines rows from both relations where the join condition is true.
 Outer Join: Returns all rows from one or both relations, with null values for unmatched
rows.
Syntax:
R ⨝condition S
Where:
 R and S are relations.
 condition specifies the common attribute used for the join.

12
Example:
Consider the following relations:
Employees:

EmpID Name

101 Alice

102 Bob

Departments:

EmpID Department

101 HR

102 IT

Performing an Inner Join on EmpID: Employees ⨝EmpID Departments

The result would be:

EmpID Name Department

101 Alice HR

102 Bob IT

Hive
Hive is a data warehousing and SQL-like query system built on top of Hadoop. It was originally
developed by Facebook to manage large amounts of data in Hadoop's distributed file system
(HDFS). Hive simplifies the process of querying and managing large-scale datasets by
providing an abstraction layer that allows users to run SQL-like queries (HiveQL) on top of
the Hadoop ecosystem.
Characteristics of Hive
1. MapReduce Integration:
Hive translates queries written in Hive Query Language (HiveQL) into MapReduce jobs.
This makes Hive scalable and suitable for managing and analyzing vast datasets,
particularly static data. Since Hive uses MapReduce, it inherits the scalability and parallel
processing capabilities of Hadoop.
2. Web and API Support:
Hive provides web interfaces and APIs that allow clients to interact with the Hive database
server. Users can query Hive either through a web browser or via programmatic access,
making it convenient for both end-users and developers.
3. SQL-like Query Language (HiveQL):

13
Hive provides a query language called HiveQL (Hive Query Language), which is similar
to SQL. It allows users to perform typical database operations like SELECT, INSERT,
JOIN, and GROUP BY. However, HiveQL is specifically designed to work with the
underlying Hadoop infrastructure.
4. Data Storage on HDFS:
Data loaded into Hive tables is stored in Hadoop's HDFS. Hive abstracts away the
complexity of managing HDFS directly, and users can interact with their data through
HiveQL queries. This allows for easy integration with Hadoop's distributed storage system.
Limitations of Hive
1. Not a Full Database:
While Hive provides querying capabilities and table management features, it is not a full-
fledged database. Some critical operations typically available in traditional databases (such
as UPDATE, ALTER, and DELETE) are not directly supported by Hive. The design of
Hive prioritizes read-heavy, analytical workloads rather than transactional operations.
2. Unstructured Data Handling:
Hive is primarily designed for structured and semi-structured data. It is not optimized for
managing unstructured data (e.g., audio, video, or images). Therefore, it may not be the
best tool for use cases requiring real-time analysis or unstructured data processing.

3. Real-time Query Limitations:

Hive is not designed for real-time queries. It performs best in batch processing scenarios
where fast response times are not a strict requirement. This makes it less suitable for
interactive or real-time analytics where low-latency queries are needed.
4. Partitioning from the Last Column:
One limitation of Hive is its partitioning strategy. When creating partitions in Hive, the
partitioning is always done based on the last column in the table. This can be restrictive in
certain situations where partitions need to be based on a different column or dynamic
partitioning strategies.
Use Cases of Hive
 Data Warehousing:
Hive is commonly used in large-scale data warehousing applications. It allows
organizations to store vast amounts of structured data in HDFS and provides an easy way
to run analytical queries using HiveQL.
 Batch Processing:
Due to its MapReduce-based execution model, Hive is well-suited for batch processing of
large datasets. It can be used for ETL (Extract, Transform, Load) operations, aggregation,
and other data transformation tasks.

14
 Data Analytics:
Hive is widely used for analyzing large datasets in industries like e-commerce, finance, and
telecom, where data is typically stored in HDFS, and querying needs are focused on
extracting insights from large volumes of static data.
Features of Hive:

Hive Architecture
The architecture of Hive is designed to provide an abstraction layer on top of Hadoop, allowing
users to run SQL-like queries (HiveQL) for managing and analyzing large datasets stored in
HDFS. Hive architecture consists of several key components that work together to enable
querying, execution, and management of data within the Hadoop ecosystem.

15
Components of Hive Architecture
1. Hive Server (Thrift):
o Function: The Hive Server is an optional service that allows remote clients to
submit requests to Hive and retrieve results.
o Client API: The Hive Server exposes a simple client API (through Thrift),
enabling the execution of HiveQL statements. It supports various programming
languages for interacting with Hive, such as Java, Python, and others.
o Role: This server acts as an interface for external applications to communicate
with the Hive system. The Thrift service allows clients to send HiveQL queries
and retrieve results without directly interacting with the underlying
infrastructure.
2. Hive CLI (Command Line Interface):
o Function: The Hive CLI is a popular interface that allows users to interact
directly with Hive through a command line.
o Local Mode: Hive can run in local mode when used with the CLI. In this mode,
Hive uses the local file system for storing data, rather than HDFS. This is useful
for small-scale testing or development.
o Usage: The Hive CLI allows users to submit queries, manage databases, create
tables, and perform other administrative tasks in Hive.
3. Web Interface (HWI):
o Function: Hive can also be accessed through a web interface, which is provided
by the Hive Web Interface (HWI).
o HWI Server: A designated HWI server must be running to provide web-based
access to Hive. Users can access Hive via a web browser by navigating to a
URL like http://hadoop:<port>/hwi.
o Usage: The web interface provides a graphical interface for executing queries,
managing tables, and performing administrative tasks without needing to use
the CLI.
4. Metastore:
o Function: The Metastore is a crucial component of Hive that stores all the
metadata (schema information) related to the tables, databases, and columns.
o Metadata: It stores information such as the database schema, column data
types, and HDFS locations of the data files.
o Interaction: All other components of Hive interact with the Metastore to fetch
or update metadata. For example, when a user queries a table, the Metastore
helps locate the corresponding data in HDFS.

16
o Storage: The Metastore typically uses a relational database (like MySQL or
PostgreSQL) to store this metadata.
5. Hive Driver:
o Function: The Hive Driver manages the lifecycle of a HiveQL query.
o Lifecycle Management: It is responsible for compiling the HiveQL query,
optimizing it, and finally executing the query on the Hadoop cluster.
o Execution Flow:
 Compilation: The Hive Driver compiles the HiveQL statement into a
series of MapReduce jobs (or other execution plans depending on the
environment).
 Optimization: The query is then optimized for execution. This may
include tasks such as predicate pushdown, column pruning, and join
optimization.
 Execution: The final optimized query is submitted for execution on the
Hadoop cluster, where it is processed by the MapReduce framework.
6. Query Compiler:
o Function: The Query Compiler is responsible for parsing the HiveQL
statements and converting them into execution plans that are understandable by
the Hadoop system.
o Stages: The process involves the compilation of the HiveQL statement into an
Abstract Syntax Tree (AST), followed by the generation of a logical query plan
and its optimization before the physical plan is produced.
7. Execution Engine:
o Function: The Execution Engine is responsible for the actual execution of the
query.

17
o Processing: It submits tasks to the underlying Hadoop infrastructure
(MapReduce, Tez, or Spark, depending on the configuration). The Execution
Engine also handles data movement between various stages of the computation.

Hive Data Types and File Formats

Hive defines various primitive, complex, string, date/time, collection data types and file
formats for handling and storing different data formats. The following Table gives primitive,
string, date/time and complex Hive data types and their descriptions.
Hive data types can be classified into two parts.
Primitive Data Types:

Primitive Data Types also divide into 3 types which are as follows.

18
19
Hive Data Model
The Hive data model organizes and structures data in a way that allows efficient querying,
analysis, and storage in a Hadoop ecosystem. The components of the Hive data model include
Databases, Tables, Partitions, and Buckets.
1. Database
 Description: A database in Hive acts as a namespace for organizing and storing tables.
Each database can contain multiple tables, and you can use the USE statement to switch
between databases.
 Example: A database can represent different applications or systems, like a database
for customer data or product data.
2. Tables
 Description: Tables in Hive are similar to tables in traditional RDBMS. They are used
to store structured data in a tabular format. Each table is backed by a directory in HDFS
where the actual data files reside.
 Operations Supported: Hive tables support various operations such as:
o Filter: Filtering rows based on certain conditions.
o Projection: Selecting specific columns to be returned in the result.
o Join: Joining multiple tables to retrieve related data.
o Union: Combining results from multiple queries.
 Data Storage: The data in a Hive table is stored in HDFS, and the structure (schema)
is defined when creating the table.
3. Partitions
 Description: Partitions are used to divide the data in a table into subsets based on the
values of one or more columns. This helps in organizing large datasets and enables
efficient querying by reducing the amount of data scanned for specific queries.
 How It Works: A table can be partitioned by a column, such as date or region, and each
partition will store data corresponding to that column's value (e.g., data from January
will be stored in one partition, data from February in another).
 Example: A sales table can be partitioned by the year and month columns to store data
for each year and month separately.
4. Buckets
 Description: Buckets further divide data within each partition based on a hash of a
column in the table. This technique allows data to be split into smaller, more
manageable files within each partition.

20
 How It Works: Data is divided into a specific number of buckets (files) by hashing a
particular column's value. Each bucket corresponds to one file stored in the partition's
directory.
 Example: A customer table might be bucketed by the customer_id column, ensuring
that the data for each customer is stored in a separate bucket.
Hive Integration and Workflow Steps
Hive’s integration with Hadoop involves several key components that handle the query
execution, metadata retrieval, and job management.

1. Execute Query:
o The query is sent from the Hive interface (CLI, Web Interface, etc.) to the
Database Driver, which is responsible for initiating the execution process.
2. Get Plan:
o The Driver forwards the query to the Query Compiler. The compiler parses
the query and creates an execution plan, verifying the syntax and determining
the operations required.
3. Get Metadata:
o The Compiler requests metadata information (like table schema, column types,
etc.) from the Metastore (which can be backed by databases like MySQL or
PostgreSQL).
4. Send Metadata:
o The Metastore responds with the metadata, and the Compiler uses this
information to refine the query plan.
5. Send Plan:
o After parsing the query and receiving metadata, the Compiler sends the
finalized query execution plan back to the Driver.

21
6. Execute Plan:
o The Driver sends the execution plan to the Execution Engine, which is
responsible for actually running the query on the Hadoop cluster.
7. Execute Job:
o The execution engine triggers the execution of the query, which is typically
translated into a MapReduce job. This job is sent to the JobTracker (running
on the NameNode), which assigns tasks to TaskTrackers on DataNodes for
parallel processing.
8. Metadata Operations:
o During the execution, the Execution Engine may also perform metadata
operations with the Metastore, such as querying schema details or updating the
metastore.
9. Fetch Result:
o After completing the MapReduce job, the Execution Engine collects the results
from the DataNodes where the job was processed.
10. Send Results:
o The results are sent back to the Driver, which in turn forwards them to the Hive
interface for display to the user.
Hive Built-in Functions
Hive provides a wide range of built-in functions to operate on different data types, enabling
various data transformations and calculations. Here’s a breakdown of some common built-in
functions in Hive:
1. BIGINT Functions
 round(double a)
o Description: Returns the rounded BIGINT (8-byte integer) value of the 8-byte
double-precision floating point number a.
o Return Type: BIGINT
o Example: round(123.456) returns 123.
 floor(double a)
o Description: Returns the maximum BIGINT value that is equal to or less than
the double value.
o Return Type: BIGINT
o Example: floor(123.789) returns 123.
 ceil(double a)

22
o Description: Returns the minimum BIGINT value that is equal to or greater
than the double value.
o Return Type: BIGINT
o Example: ceil(123.456) returns 124.
2. Random Number Generation
 rand(), rand(int seed)
o Description: Returns a random number (double) that is uniformly distributed
between 0 and 1. The sequence changes with each row, and specifying a seed
ensures the random number sequence is deterministic.
o Return Type: double
o Example: rand() returns a random number like 0.456789, and rand(5) will
generate a sequence based on the seed 5.
3. String Functions
 concatenate(string str1, string str2, ...)
o Description: Concatenates two or more strings into one.
o Return Type: string
o Example: concatenate('Hello ', 'World') returns 'Hello World'.
 substr(string str, int start)
o Description: Returns a substring of str starting from the position start till the
end of the string.
o Return Type: string
o Example: substr('Hello World', 7) returns 'World'.
 substr(string str, int start, int length)
o Description: Returns a substring of str starting from position start with the
given length.
o Return Type: string
o Example: substr('Hello World', 1, 5) returns 'Hello'.
 upper(string str), ucase(string str)
o Description: Converts all characters of str to uppercase.
o Return Type: string
o Example: upper('hello') returns 'HELLO'.
 lower(string str), lcase(string str)

23
o Description: Converts all characters of str to lowercase.
o Return Type: string
o Example: lower('HELLO') returns 'hello'.
 trim(string str)
o Description: Trims spaces from both ends of the string.
o Return Type: string
o Example: trim(' Hello World ') returns 'Hello World'.
 ltrim(string str)
o Description: Trims spaces from the left side of the string.
o Return Type: string
o Example: ltrim(' Hello') returns 'Hello'.
 rtrim(string str)
o Description: Trims spaces from the right side of the string.
o Return Type: string
o Example: rtrim('Hello ') returns 'Hello'.
4. Date and Time Functions
 year(string date)
o Description: Extracts the year part of a date or timestamp string.
o Return Type: int
o Example: year('2024-12-25') returns 2024.
 month(string date)
o Description: Extracts the month part of a date or timestamp string.
o Return Type: int
o Example: month('2024-12-25') returns 12.
 day(string date)
o Description: Extracts the day part of a date or timestamp string.
o Return Type: int
o Example: day('2024-12-25') returns 25.

24
HiveQL Features
 Data Definition: Allows users to define and manage the schema of tables, databases,
etc.
 Data Manipulation: Enables the manipulation of data, such as inserting, updating, or
deleting records (although with some limitations).
 Query Processing: Supports querying large datasets using operations like filtering,
joining, and aggregating data.
HiveQL Process Engine
 The HiveQL Process Engine translates HiveQL queries into execution plans and
communicates with the Execution Engine to run the query. It is a replacement for the
traditional approach of writing Java-based MapReduce programs.
Hive Execution Engine
 The Execution Engine is the component that bridges HiveQL and MapReduce. It
processes the query and generates results in the same way that MapReduce jobs would
do. It uses a variant of MapReduce to execute HiveQL queries across a distributed
Hadoop cluster.
HiveQL Data Definition Language (DDL)
HiveQL provides several commands for defining databases and tables. These commands are
used to manage the structure of the data in Hive.
Creating a Database
To create a new database in Hive, the following command is used:
CREATE DATABASE [IF NOT EXISTS] <database_name>;
 IF NOT EXISTS: Ensures that Hive does not throw an error if the database already
exists.
Example:
CREATE DATABASE IF NOT EXISTS my_database;
Show Databases
To list all the databases in Hive, use the command:
SHOW DATABASES;
Dropping a Database
To delete an existing database, use the following command:
DROP DATABASE [IF EXISTS] [RESTRICT | CASCADE] <database_name>;
 IF EXISTS: Prevents an error if the database does not exist.
 RESTRICT: Deletes the database only if it is empty.

25
 CASCADE: Deletes the database along with any tables it contains.
Example:
DROP DATABASE IF EXISTS my_database CASCADE;
Creating a Table
The syntax for creating a table in Hive is:
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]
[<database_name>.]<table_name>
[(<column_name> <data_type> [COMMENT <column_comment>], ...)]
[COMMENT <table_comment>]
[ROW FORMAT <row_format>]
[STORED AS <file_format>];
 TEMPORARY: Creates a temporary table that is only available during the session.
 EXTERNAL: Specifies that the table is external, meaning Hive won't manage its data
(i.e., data is stored outside the Hive warehouse).
 IF NOT EXISTS: Avoids an error if the table already exists.
 COMMENT: Adds a description to the table or column.
 ROW FORMAT: Specifies the format of the rows in the table (e.g., DELIMITED).
 STORED AS: Specifies the file format for storing data (e.g., TEXTFILE, ORC,
PARQUET).
Example:
CREATE TABLE IF NOT EXISTS employee (
emp_id INT,
name STRING,
salary DOUBLE
)
COMMENT 'Employee table'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

26
HiveQL Data Manipulation Language (DML)
DML commands in Hive are used for managing and modifying the data within Hive tables.
Using a Database
To set the current database in Hive, use the USE command:
USE <database_name>;
Loading Data into a Table
To load data into a Hive table from a local or HDFS path, the LOAD DATA command is used:
LOAD DATA [LOCAL] INPATH '<file_path>' [OVERWRITE] INTO TABLE <table_name>
[PARTITION (partcol1=val1, partcol2=val2, ...)];
 LOCAL: Specifies that the file is on the local filesystem.
 OVERWRITE: Overwrites the existing data in the table.
 PARTITION: Specifies partitioning columns if applicable.
Example:
LOAD DATA LOCAL INPATH '/path/to/data.csv' INTO TABLE employee;
Dropping a Table
To delete an existing table, the command is:
DROP TABLE [IF EXISTS] <table_name>;
 IF EXISTS: Avoids an error if the table does not exist.
Example:
DROP TABLE IF EXISTS employee;
Altering a Table
You can modify the structure of a table using the ALTER TABLE command:
ALTER TABLE <table_name> ADD COLUMNS (<column_name> <data_type>
[COMMENT <column_comment>]);
Example:
ALTER TABLE employee ADD COLUMNS (department STRING);
HiveQL Querying the Data
Hive supports querying data using SQL-like syntax, with additional features for partitioning,
sorting, and aggregating data.
Basic Query
A basic query to select data from a table is:

27
SELECT [ALL | DISTINCT] <select_expression>, ...
FROM <table_name>
[WHERE <condition>]
[GROUP BY <column_list>]
[HAVING <condition>]
[CLUSTER BY <column_list>]
[DISTRIBUTE BY <column_list>]
[SORT BY <column_list>]
[LIMIT <number>];
 ALL: Returns all rows (default).
 DISTINCT: Returns only unique rows.
 WHERE: Filters rows based on a condition.
 GROUP BY: Groups rows based on column values.
 HAVING: Filters groups based on a condition.
 CLUSTER BY: Groups rows and sorts them in a specific way.
 DISTRIBUTE BY: Distributes rows to different reducers.
 SORT BY: Sorts the data within each reducer.
 LIMIT: Limits the number of rows returned.
Example:
SELECT DISTINCT name, salary
FROM employee
WHERE salary > 50000
ORDER BY salary DESC;
PIG
Pig is a high-level platform built on top of Hadoop to facilitate the processing of large datasets.
It abstracts the complexities of writing MapReduce programs and provides a more user-friendly
interface for data manipulation.
Features of Apache Pig
 Dataflow Language: Pig uses a dataflow language, where operations on data are linked
in a chain, and the output of one operation is the input to the next.
 Simplifies MapReduce: Pig reduces the complexity of writing raw MapReduce
programs by providing a higher-level abstraction.

28
 Parallel Processing: Pig allows the execution of tasks in parallel, which makes it
suitable for handling large datasets.
 Flexible: It can process structured, semi-structured, and unstructured data.
 High-level Operations: Supports complex data manipulation tasks like filtering,
joining, and aggregating large datasets.
Applications of Apache Pig
 Large Dataset Analysis: Ideal for analyzing vast amounts of data in HDFS.
 Ad-hoc Data Processing: Useful for quick, one-time data processing tasks.
 Processing Streaming Data: It can process web logs, sensor data, or other real-time
data.
 Search Platform Data Processing: Pig can be used for processing and analyzing data
related to search platforms.
 Time-sensitive Data Processing: Processes and analyzes data quickly, which is
essential for applications that require fast insights.
Pig scripts are often used in combination with Hadoop for data processing at scale, making it
a powerful tool for big data analytics.
Pig Architecture

The Pig architecture is built to support flexible and scalable data processing in a Hadoop
ecosystem. It executes Pig Latin scripts via three main methods:
1. Grunt Shell: An interactive shell that executes Pig scripts in real time.
2. Script File: A file containing Pig commands that are executed on a Pig server.
3. Embedded Script: Pig Latin functions that can be written as User-Defined Functions
(UDFs) in different programming languages and embedded within Pig scripts.

29
Pig Execution Process
The Pig execution flow involves multiple stages to transform raw data into processed output:
1. Parser: After a script passes through Grunt or Pig Server, the parser handles syntax
checking and type validation. The output of this step is a Directed Acyclic Graph (DAG)
that represents the flow of operations.
o DAG: Nodes represent operations, and edges indicate data flows between them.
This structure ensures that each node only handles one set of inputs at a time,
making the process acyclic.
2. Optimizer: After generating the DAG, the optimizer reduces data at various stages to
optimize performance. Some of the optimizations include:
o PushUpFilter: Splits and pushes filters up in the execution plan to reduce the
dataset early in the pipeline.
o PushDownForEachFlatten: Delays the flatten operation to minimize the data
set in the pipeline.
o ColumnPruner: Removes unused columns as early as possible.
o MapKeyPruner: Discards unused map keys.
o Limit Optimizer: Pushes the LIMIT operation as early as possible to avoid
unnecessary computation.
3. Compiler: After optimization, the Pig scripts are converted into a series of MapReduce
jobs, which are compiled into code that will be executed on the Hadoop cluster.
4. Execution Engine: The execution engine takes the MapReduce jobs and executes them
on the Hadoop cluster, generating the final output.
Pig Grunt Shell
The Grunt shell is primarily used for writing and executing Pig Latin scripts. You can also
invoke shell commands such as sh and ls. For instance:
 To execute shell commands: grunt> sh shell_command_parameters
 To list files in the Grunt shell: grunt> sh ls
Pig Latin Data Model
Pig Latin supports both primitive (atomic) and complex data types, making it versatile for
handling various data structures.
Primitive Data Types:
 int: 32-bit signed integer (e.g., 10)
 long: 64-bit signed integer (e.g., 101)
 float: 32-bit floating point (e.g., 22.7F)
 double: 64-bit floating point (e.g., 3.4)

30
 chararray: Character array (e.g., 'hello')
 bytearray: Binary data (e.g., ffoo)
Complex Data Types:
 bag: Collection of tuples (e.g., {{(1,1), (2,4)}})
 tuple: Ordered set of fields (e.g., (1, 1))
 map: Set of key-value pairs (e.g., ['key1#1'])
Pig Latin Constructs
Pig Latin scripts are built using a variety of operations that handle data input, output, and
transformations. A typical Pig Latin script includes the following:
1. Schemas and Expressions: Defines how data is structured and what operations will be
performed on it.
2. Commands:
o LOAD: Reads data from the file system.
o DUMP: Displays the result.
o STORE: Stores the processed result into the file system.
3. Comments:
o Single-line comments start with --.
o Multiline comments are enclosed in /* */.
4. Case Sensitivity:
o Keywords (like LOAD, STORE, DUMP) are not case-sensitive.
o Function names, relations, and paths are case-sensitive.
Pig Latin Script Execution Modes
1. Interactive Mode: This mode uses the Grunt shell. It allows you to write and execute
Pig Latin scripts interactively, making it ideal for quick testing and debugging.
2. Batch Mode: In this mode, you write the Pig Latin script in a single file with a .pig
extension. The script is then executed as a batch process.
3. Embedded Mode: This mode involves defining User-Defined Functions (UDFs) in
programming languages such as Java, and using them in Pig scripts. It allows for more
advanced functionality beyond the built-in operations of Pig.

Pig Commands

31
1. To get a list of Pig commands:
2. pig-help;
3. To check the version of Pig:
4. pig -version;
5. To start the Grunt shell:
6. pig
Load Command
The LOAD command in Pig is used to load data into the system from various data sources.
Here's how it works:
 Loading data from HBase:
 book = LOAD 'MyBook' USING HBaseStorage();
 Loading data from a CSV file using PigStorage, with a comma as a separator:
 book = LOAD 'PigDemo/Data/Input/myBook.csv' USING PigStorage(',');
 Specifying a schema while loading data: You can define a schema for the loaded data,
which helps in interpreting each field of the record.
 book = LOAD 'MyBook' AS (name:chararray, author:chararray, edition:int,
publisher:chararray);
Store Command
The STORE command writes the processed data to a storage location, typically HDFS. It can
store data in various formats.
 Default storage in HDFS (tab-delimited format):
 STORE processed INTO '/PigDemo/Data/Output/Processed';
 Storing data in HBase:
 STORE processed INTO 'processed' USING HBaseStorage();
 Storing data as comma-separated text:
 STORE processed INTO 'processed' USING PigStorage(',');
Dump Command
The DUMP command is useful for displaying the processed data directly on the screen. It’s
often used during debugging or prototyping to quickly inspect the results.
 Displaying processed data:
 DUMP processed;
Relational Operations in Pig Latin

32
Pig Latin provides several relational operations that allow you to transform and manipulate
data. These operations are used to sort, group, join, project, and filter data. Some of the basic
relational operators include:
1. FOREACH: This operation applies transformations to the data based on columns and
is often used to project data (i.e., select specific columns). It is the projection operator
in Pig Latin.
Example:
result = FOREACH data GENERATE field1, field2;
The FOREACH operation is extremely powerful and can also be used for applying functions
or expressions to each field in the dataset.
Summary of Key Commands
 LOAD: Used for loading data from an external source into Pig.
 STORE: Writes processed data to an external location.
 DUMP: Displays the processed data on the screen for inspection.
 FOREACH: Allows applying transformations and projections to data.
These commands form the foundation of writing and executing Pig scripts, enabling you to
process and analyze large datasets in a Hadoop environment.

33
-------------------------------------------END OF MODULE 4-----------------------------------------

Lesson Plan Grammar Year 4
No ratings yet
Lesson Plan Grammar Year 4
3 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Anatomy of A MapReduce Job
No ratings yet
Anatomy of A MapReduce Job
5 pages
Unit-2 (MapReduce-II)
No ratings yet
Unit-2 (MapReduce-II)
11 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
05 Movies Data Analysis Using Mapreduce
No ratings yet
05 Movies Data Analysis Using Mapreduce
20 pages
Unit-4-1
No ratings yet
Unit-4-1
12 pages
CC UNIT-7
No ratings yet
CC UNIT-7
16 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Map reduce
No ratings yet
Map reduce
35 pages
Bda Unit 4
No ratings yet
Bda Unit 4
20 pages
UNIT 3 NOTES (1)
No ratings yet
UNIT 3 NOTES (1)
21 pages
UNIT – III
No ratings yet
UNIT – III
38 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Map Red
No ratings yet
Map Red
6 pages
Survey Paper On Traditional Hadoop and Pipelined Map Reduce: Dhole Poonam B, Gunjal Baisa L
No ratings yet
Survey Paper On Traditional Hadoop and Pipelined Map Reduce: Dhole Poonam B, Gunjal Baisa L
5 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
18mcs35e U4
No ratings yet
18mcs35e U4
7 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Unit-4
No ratings yet
Unit-4
19 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
BIG DATA UNIT -3
No ratings yet
BIG DATA UNIT -3
7 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
BDA-MapReduce (1) 5rfgy656yhgvcft6
No ratings yet
BDA-MapReduce (1) 5rfgy656yhgvcft6
60 pages
Module 3 (Part-1) - Big Data
No ratings yet
Module 3 (Part-1) - Big Data
46 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Hadoop Streaming: Mapreduce
No ratings yet
Hadoop Streaming: Mapreduce
8 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
8300 Gui SV
No ratings yet
8300 Gui SV
22 pages
Unit - III
No ratings yet
Unit - III
37 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
B. Hadoop Ecosystem_III_b (MapReduce Framework)
No ratings yet
B. Hadoop Ecosystem_III_b (MapReduce Framework)
33 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
MAP Reduce - 1 (1).Pptx (1)
No ratings yet
MAP Reduce - 1 (1).Pptx (1)
34 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
B. Hadoop Ecosystem_III (MapReduce)
No ratings yet
B. Hadoop Ecosystem_III (MapReduce)
55 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
No ratings yet
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
20 pages
MapReduce
No ratings yet
MapReduce
14 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Hadoop
No ratings yet
Hadoop
34 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Map Reduce Report
No ratings yet
Map Reduce Report
16 pages
Mapreduce Introduction
No ratings yet
Mapreduce Introduction
14 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
More on C# in Front Office
From Everand
More on C# in Front Office
Xing Zhou
No ratings yet
Dart for Flutter
From Everand
Dart for Flutter
Zeuz IT
No ratings yet
Form 47 (See Rules 83 (2) and 87 (2) ) Authorisation For Tourist Permit or National Permit
No ratings yet
Form 47 (See Rules 83 (2) and 87 (2) ) Authorisation For Tourist Permit or National Permit
2 pages
Radio Wave Propagation
No ratings yet
Radio Wave Propagation
48 pages
Interview and Its Various Types: Submitted By: Komal Sahi MBA-HR Semester 1
No ratings yet
Interview and Its Various Types: Submitted By: Komal Sahi MBA-HR Semester 1
39 pages
Msds Ghs-Klueber SV 5 KR N (Eng) (22.10.18)
No ratings yet
Msds Ghs-Klueber SV 5 KR N (Eng) (22.10.18)
8 pages
Sony Prs 600 Manual
No ratings yet
Sony Prs 600 Manual
190 pages
Session 1 - Speed-Time Graph Worksheet - For Tutor
No ratings yet
Session 1 - Speed-Time Graph Worksheet - For Tutor
4 pages
Factoring FS
No ratings yet
Factoring FS
13 pages
Chapter 30 Assessment and Management of Patients With Vascular Disorders and Problems of Peripheral Circulation
No ratings yet
Chapter 30 Assessment and Management of Patients With Vascular Disorders and Problems of Peripheral Circulation
23 pages
Annexure A 3.80
No ratings yet
Annexure A 3.80
1 page
European J Soil Science - 2022 - de - Crude Glycerol A Biodiesel Byproduct Used As A Soil Amendment To Temporarily
No ratings yet
European J Soil Science - 2022 - de - Crude Glycerol A Biodiesel Byproduct Used As A Soil Amendment To Temporarily
19 pages
Anh 7.2-unit 9
No ratings yet
Anh 7.2-unit 9
6 pages
Grammar: Expressing Movement
No ratings yet
Grammar: Expressing Movement
3 pages
Wps PQR Aws d11 Form
100% (2)
Wps PQR Aws d11 Form
2 pages
B.sc. Information Technology
No ratings yet
B.sc. Information Technology
142 pages
Batumi in Your Pocket
No ratings yet
Batumi in Your Pocket
36 pages
A Study of Self-Loosening of Bolted Joints Due To Repetition of Small Amount of Slippage at Bearing Surface PDF
No ratings yet
A Study of Self-Loosening of Bolted Joints Due To Repetition of Small Amount of Slippage at Bearing Surface PDF
10 pages
Office Auto Notes Unit 1
No ratings yet
Office Auto Notes Unit 1
21 pages
Immediate Download Encyclopedia of Policy Studies Second Edition, Revised and Expanded Edition Nagel Ebooks 2024
100% (3)
Immediate Download Encyclopedia of Policy Studies Second Edition, Revised and Expanded Edition Nagel Ebooks 2024
52 pages
Energy and Chemistry: Lesson 1: Nature of Energy
No ratings yet
Energy and Chemistry: Lesson 1: Nature of Energy
13 pages
Calibration Procedure For Abb Make Analyser: Check & Calibrate The Analyzer Within 7 Days
No ratings yet
Calibration Procedure For Abb Make Analyser: Check & Calibrate The Analyzer Within 7 Days
1 page
Aggalao, Jonathan Guyang PMSG
No ratings yet
Aggalao, Jonathan Guyang PMSG
1 page
(Marketing Momentum) LaJe Eblast: Thursday, October 3
No ratings yet
(Marketing Momentum) LaJe Eblast: Thursday, October 3
4 pages
8832 - Roadster
No ratings yet
8832 - Roadster
4 pages
Bms 1201 Assignment Two - 2025
No ratings yet
Bms 1201 Assignment Two - 2025
2 pages
Bill - Maruti Suzuki - Fronx - PB80A7375
No ratings yet
Bill - Maruti Suzuki - Fronx - PB80A7375
6 pages
Pedagogical Grammar Report
No ratings yet
Pedagogical Grammar Report
19 pages
Paper 4
No ratings yet
Paper 4
11 pages
The Comprehensive Valve Range: Consultant'S Approval
No ratings yet
The Comprehensive Valve Range: Consultant'S Approval
2 pages
Peet Curriculumvitae
No ratings yet
Peet Curriculumvitae
7 pages

Bda Module 4

Uploaded by

Bda Module 4

Uploaded by

MODULE 4

MapReduce, Hive and Pig

1. Map TaskTracker Failure:

3. Finding Distinct Values (Counting Unique Values)

EmpID Name Age Department

If we want to select employees from the HR department, we would write: σDepartment =

EmpID Name Age Department

Performing the union: Employees ∪ Contractors

The duplicate entry (Bob) is eliminated.

Bob is excluded because he appears in both Employees and Contractors.

The Cartesian Product of Employees and Departments (Employees × Departments) would

EmpID Name DeptID Department

101 Alice D01 HR

101 Alice D02 IT

102 Bob D01 HR

102 Bob D02 IT

EmployeeID Name Age Department

Performing an Inner Join on EmpID: Employees ⨝EmpID Departments

EmpID Name Department

3. Real-time Query Limitations:

Hive Data Types and File Formats

You might also like