0% found this document useful (0 votes)

21 views

Relational Algebra Operations in Mapreduce

Uploaded by

mohitnaman07

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Relational Algebra Operations in Mapreduce

Uploaded by

mohitnaman07

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Relational Algebra

Before getting a brief overview of relational algebra we need to know what a relation represents.
As most of us are already familiar with SQL there is no point on putting a long description
here. A relation represents a database table. A major point that distinguishes SQL and
relational algebra is that in relational algebra duplicate rows are implicitly
eliminated which is not the case with SQL implementations.

Selection: selection(WHERE clause in SQL) lets you apply a condition over the data you have
and only get the rows that satisfy the condition.

Selection operation to select only rows where age is greater than 20

 Projection: In order to select some columns only we use the projection operator. It’s
analogous to SELECT in SQL.
Projection operation to select only Name, IsActive column

 Union: We concatenate two tables vertically. Similar to UNION in SQL, but the duplicate
rows are removed implicitly. The point to note in the below output table is that (Smith,
16)was a duplicate row so it appears only once in the output where as (Tom, 17) , (Tom,
19) appears as two, as those are not identical rows.

Union the two tables

 Intersection: Same as INTERSECT in SQL. It intersects two tables and selects only the
common rows.
Intersection of two tables

 Difference: The rows that are in the first table but not in second are selected for output.
Keep in mind that (Monty, 21) is not considered in the output as its present in the second table
but not in first.

Difference between the two tables

 Natural Join: Merge two tables based on some common column. It represents the INNER
JOIN in SQL. But the condition is implicit on the column that is common in both tables. The
output will only contain rows for which the values in the common column matches. It will
generate one row for every time the column value across two tables match as is the case below
for Tom . If there were multiple Tom values in the table on the top in the image below, then four
rows would have been created in the output table representing all the combinations.

Natural Join With column Name as the common column

 Grouping and Aggregation: Group rows based on some set of columns and apply some
aggregation (sum, count, max, min, etc.) on some column of the small groups that are formed.
This corresponds to GROUP BY in SQL.
Group By Name and take sum of the Winning

Data Representation For a Table

In a distributed storage system the entire table isn’t stored on a single node(computer), the most
relevant reason being it doesn’t fit completely on a single system because of its large size. So in
order to store a table the table is partitioned into small files which are distributed across the
nodes available in the system. Why some abstraction of partitions is a must for distributed
storage HDFS docs is a good place to goto.

Let’s have a simple abstraction on how data is stored in the system. Think of tables as being
stored as small CSV files which if concatenated will represent the table.

Actual storage of a table on distributed file system

Hash Function

Hash function can be something like

1. Take a key
2. Typecast it to string
3. For each character in the string sum up the ASCII value
4. Mod the sum with number of reduce workers, this value is the hash value for that particular
key.

You can check in link in resources section how people compute hashes of strings. We just want
a hash function in this case that will distribute the work equally among reduce
workers. Even if we have high number of collisions it’s fine as we are not using hash function to
construct maps which allows fast search, we just want data to be partitioned in n buckets
where n is the number of reduce workers while making sure data for same key goes to the same
reduce worker across all worker nodes.

Communication Cost

As is the case with all the systems, the performance of a system is measured on the bases
of the least efficient component of that system. In Map Reduce system that component is
the network and the cost associated with this component is termed as communication cost. The
communication cost of a task is the number of rows input to that task. Here task
represents either a map task or reduce task. We consider number of rows as a measure of size as
mostly the data is in tabular form having a predefined schema, because of which size will almost
be the same for each row. So the amount of data sent over the network is proportional to the
number of rows. The sum of all the communication costs associated with all the tasks is the
communication cost of the operation.

We don’t consider output of a task for computing this cost as output will be an input to the next
task, so this will end up getting counted twice.

Selection Using Map Reduce

To perform selections using map reduce we need the following Map and Reduce functions:

 Map Function: For each row r in the table apply condition and produce a key value pair r,
r if condition is satisfied else produce nothing. i.e. key and value are the same.
 Reduce Function: The reduce function has nothing to do in this case. It will simply write
the value for each key it receives to the output.

For our example we will doSelection(B <= 3). Select all the rows where value of B is less than or
equal to 3.

Let’s consider the data we have initially distributed as files in Map Workers, And the data looks
like the following figure

Initial data distributed in files across map workers representing a single table

After applying the map function (And grouping, there are no common keys in this case as each
row is unique) we will get the output as follows, The tuples are constructed with 0th index
containing values from A column and 1st index containing values from B. In actual
implementations either this information can be sent as some extra metadata or within each value
itself, making values and keys look something like ({A: 1}, {B: 2}), which does look somewhat
inefficient.
Data after applying Map function which filtered rows having B value less than 3

After this based on number or reduce workers (2 in our case). A hash function is applied as
explained in the Hash Function section. The files for reduce workers on map workers will look
like:

Files for reduce workers created at map worker based on hash function

After this step The files for reduce worker 1 are sent to that and reduce worker 2 are sent to that.
The data at reduce workers will look like:
Data at reduce workers sent from map workers

The final output after applying the reduce function which ignores the keys and just consider
values will look like:

Output of selection(B ≤ 3)

The points to take into consideration here are that we don’t need to shuffle data across the
nodes really. We can just execute the map function and save values to the output from map
workers itself. This makes it an efficient operation (When compared to others where reduce
function does something).
Projection Using Map Reduce

 Map Function: For each row r in the table produce a key value pair r', r’, where r' only
contains the columns which are wanted in the projection.

 Reduce Function: The reduce function will get outputs in the form of r' :[r', r', r', r',

...]. As after removing some columns the output may contain duplicate rows. So it will just
take the value at 0th index, getting rid of duplicates (Note that this de duplication is done as
we are implementing the operations while getting outputs which we are supposed to get
according to relational algebra).

Let’s see it in action, by computing projection(A, B) for the following table:

Initial Data distributed on map workers

After application of map function (ignoring values in C column) and grouping the keys the data
will look like:
The keys will be partitioned using a hash function as was the case in selection. The data will look
like:

Files generated for reduce workers

The data at the reduce workers will be:

Data at reduce workers

At the reduce node the keys will be aggregated again as same keys might have occurred at
multiple map workers. As we already know the reduce function operates on values of each key
only once.

Data after aggregation by key at reduce workers

The reduce function is applied which will consider only the first value of the values list and ignore
rest of the information.
Output of projection(A, B)

The points to remember are that here the reduce function is required for duplicate elimination. If
that’s not the case (as it is in SQL) We can get rid of reduce operation, meaning we don’t have to
move data around. So, this operation can be implemented without actually passing data around.

Union Using Map Reduce

Both selection and projection are operations that are applied on a single table, whereas Union,
intersection and difference are among the operations that are applied on two or more tables. Let’s
consider that schemas of the two tables are the same, and columns are also ordered in same
order.

 Map Function: For each row r generate key-value pair (r, r) .

 Reduce Function: With each key there can be one or two values (As we don’t have duplicate
rows), in either case just output first value.

This operations has the map function of the selection and reduce function of projection.
Let’s see the working using an example. Here yellow colour represents one table and green colour
is used to represent the other one stored at two map workers.
Initial data at map workers

After applying the map function and grouping the keys we will get output as:

Map and grouping the keys

The data to be sent to reduce workers will look like:

Files to be sent to reduce workers

Data at reduce workers after will be:

Files At reduce workers

At reduce workers aggregation on keys will be done.

Aggregated data at reduce workers

The final output after applying the reduce function which takes only the first value and ignores
everything else is as follows:

Final table after union

Here we note that in this case same as projection we can this done without moving data around
in case we are not interested in removing duplicates. And hence this operation is also efficient it
terms of data shuffle across machines.

Intersection Using Map Reduce

For intersection, let’s consider the same data we considered for union and just change the map
and reduce functions

 Map Function: For each row r generate key-value pair (r, r) (Same as union).

 Reduce Function: With each key there can be one or two values (As we don’t have duplicate
rows), in case we have length of list as 2 we output first value else we output nothing.

As the map function is same as union and we are considering the same data lets skip to the part
before reduce function is applied.

Data at reduce workers

Now we just apply the reduce operation which will output only rows if list has a length of 2.

Output of intersection
Difference Using Map Reduce

Let’s again consider the same data. The difficulty with difference arises with the fact that we want
to output a row only if it exists in the first table but not the second one. So the reduce function
needs to keep track on which tuple belongs to which relation. To visualize that easier we will keep
those rows green which come from 2nd table and yellow for which come from 1st table
and purple which comes from both tables.

 Map Function: For each row r create a key-value pair (r, T1) if row is from table 1 else
product key-value pair (r, T2).

 Reduce Function: Output the row if and only if the value in the list is T1 , otherwise output
nothing.

The data taken initially is the same as it was for union

Initial Data

After applying the map function and grouping the keys the data looks like the following figure
Data after applying map function and grouping keys

After applying map function files for reduce workers will be created based on hashing keys as has
been the case so far.

Files for reduce workers

The data at the reduce workers will look like

Files at reduce workers

After aggregation of the keys at reduce workers the data looks like:

Data after aggregation of keys at reduce workers

The final output is generated after applying the reduce function over the output.
Output of difference of the tables

For the difference operation we notice that we cannot get rid of the reduce part and hence have to
send data across the workers as the context of from which table the value came is needed. Hence
it will be more expensive operation as compared to selection, projection, union and
intersection.

Grouping and Aggregation Using Map Reduce

Usually understanding grouping and aggregation takes a bit of time when we learn SQL, but not
in case when we understand these operations using map reduce. The logic is already there in the
working of the map. Map workers implicitly group keys and the reduce function acts upon the
aggregated values to generate output.

 Map Function: For each row in the table, take the attributes using which grouping is to be
done as the key, and value will be the ones on which aggregation is to be performed. For
example, If a relation has 4 columns A, B, C, D and we want to group by A, B and do an
aggregation on C we will make (A, B) as the key and C as the value.

 Reduce Function: Apply the aggregation operation (sum, max, min, avg, …) on the list of
values and output the result.
For our example lets group by (A, B) and apply sum as the aggregation.

Initial data at the map workers

The data after application of map function and grouping keys will creates (A, B) as key and C as
value and D is discarded as if it doesn’t exist.

Data at map workers

Applying partitioning using hash functions, we get

Files for the reduce workers

The data at the reduce workers will look like

Data at reduce workers

The data is aggregated based on keys before applying the aggregation function (sum in this case).
Aggregated data based on keys

After applying the sum over the value lists we get the final output

Output of group by (A, B) sum(C)

Here also like difference operation we can’t get rid of the reduce stage. The context of tables isn’t
wanted here but the aggregation function makes it necessary for the values to be in one place for
a single key. This operation is also inefficient as compared to selection, projection, union, and
intersection. The column that is not in aggregation or grouping clause is ignored and isn’t
required. So if the data be stored in a columnar format we can save cost of loading a lot of data.
Usually there are only a few columns involved in grouping and aggregation it does save up a lot of
cost both in terms of data that is sent over the network and the data that needs to be loaded to
main memory for execution.

Natural Join Using Map Reduce

The natural join will keep the rows that matches the values in the common column for both
tables. To perform natural join we will have to keep track of from which table the value came
from. If the values for the same key are from different tables we need to form pairs of those
values along with key to get a single row of the output. Join can explode the number of rows as we
have to form each and every possible combination of the values for both tables.

 Map Function: For two relations Table 1(A, B) and Table 2(B, C) the map function will
create key-value pairs of form b: [(T1, a)] for table 1 where T1 represents the fact that the
value a came from table 1, for table 2 key-value pairs will be of the form b: [(T2, c)].

 Reduce Function: For a given key b construct all possible combinations for the values
where one value is from table T1 and the other value is from table T2. The output will consist of
key-value pairs of form b: [(a, c)] which represent one row a, b, c for the output table.

For an example lets consider joining Table 1 and Table 2, where B is the common column.

Initial data at map workers

The data after applying the map function and grouping at the map workers will look like:

Data at map workers after applying map function and grouping the keys

As has been the case so far files for reduce workers will be created at the map workers

Files constructed for reduce workers

The data at the reduce workers will be:

Data at reduce workers

Applying aggregation of keys at the reduce workers we get:

Data after aggregation of keys at the reduce workers

After applying the reduce function which will create a row by taking one value from table T1 and
other one from T2. If there are only values from T1 or T2 in the values list that won’t constitute a
row in output.
Output of the join

As we need to keep context from which table a value came from, we can’t get rid of the data that
needs to be sent across the workers for application of reduce task, this operation also becomes
costly as compared to others we discussed so far. The fact that for each list of values we need to
create pairs also plays a major factor in the computation cost associated with this operation.

WEEK 4 CIS 205 Relational Algebra Full
No ratings yet
WEEK 4 CIS 205 Relational Algebra Full
21 pages
Configure NLB For Sharepoint
No ratings yet
Configure NLB For Sharepoint
27 pages
Relational Algebra Ii: CS121: Relational Databases Fall 2018 - Lecture 3
No ratings yet
Relational Algebra Ii: CS121: Relational Databases Fall 2018 - Lecture 3
43 pages
Database & Database Management Systems (Notes)
No ratings yet
Database & Database Management Systems (Notes)
22 pages
Relational Algebra
No ratings yet
Relational Algebra
21 pages
Definitions of Database Terms
No ratings yet
Definitions of Database Terms
7 pages
RELATIONAL ALGEBRA-DBMS
No ratings yet
RELATIONAL ALGEBRA-DBMS
17 pages
Relational Algebra and Data Aggregation
No ratings yet
Relational Algebra and Data Aggregation
43 pages
Chapter 5
No ratings yet
Chapter 5
7 pages
Relational Database Models
No ratings yet
Relational Database Models
38 pages
Module 4 - Relational Algebra
No ratings yet
Module 4 - Relational Algebra
21 pages
Relational Algebra
100% (1)
Relational Algebra
140 pages
Relational - Algebra Examples
No ratings yet
Relational - Algebra Examples
35 pages
DBMS Unit 2
No ratings yet
DBMS Unit 2
20 pages
Relational Algebra
No ratings yet
Relational Algebra
47 pages
Relational Algebra Examples
No ratings yet
Relational Algebra Examples
35 pages
Lecture 2 Relational Algebra
No ratings yet
Lecture 2 Relational Algebra
37 pages
Dbms 2
No ratings yet
Dbms 2
54 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
CS2102 Database Systems 2
No ratings yet
CS2102 Database Systems 2
2 pages
IIT3
No ratings yet
IIT3
25 pages
Chapter 06
No ratings yet
Chapter 06
57 pages
Unit-4 Relational Model and SQL Commands - Image.Marked
No ratings yet
Unit-4 Relational Model and SQL Commands - Image.Marked
34 pages
Relational Algebra
No ratings yet
Relational Algebra
13 pages
DBS_Part_2-1
No ratings yet
DBS_Part_2-1
23 pages
DBMS UNIT 3
No ratings yet
DBMS UNIT 3
14 pages
Lecture 3 - Relational Algebra-Full
No ratings yet
Lecture 3 - Relational Algebra-Full
27 pages
DBMS Module-2
No ratings yet
DBMS Module-2
80 pages
Generalization: For Example, Faculty and Student Entities Can Be Generalized and Create A Higher Level
No ratings yet
Generalization: For Example, Faculty and Student Entities Can Be Generalized and Create A Higher Level
14 pages
Relational Algebra
No ratings yet
Relational Algebra
9 pages
Relational Model & Relational Algebra
No ratings yet
Relational Model & Relational Algebra
83 pages
FALLSEM2023 24 - BCSE302L - TH - VL2023240100776 - 2023 06 09 - Reference Material I 2
No ratings yet
FALLSEM2023 24 - BCSE302L - TH - VL2023240100776 - 2023 06 09 - Reference Material I 2
49 pages
DBMS - Unit 2
No ratings yet
DBMS - Unit 2
108 pages
Relational Algebra and Relational Calculus
No ratings yet
Relational Algebra and Relational Calculus
44 pages
CHAPTER 2 (Relational Algebra)
No ratings yet
CHAPTER 2 (Relational Algebra)
22 pages
RDBMS11 12july
No ratings yet
RDBMS11 12july
35 pages
FALLSEM2019-20 CSE2004 ETH VL2019201000657 Reference Material I 26-Aug-2019 RELATIONAL ALGEBRA
No ratings yet
FALLSEM2019-20 CSE2004 ETH VL2019201000657 Reference Material I 26-Aug-2019 RELATIONAL ALGEBRA
68 pages
Chapter 5 Part 1
No ratings yet
Chapter 5 Part 1
59 pages
ICT502 - Relational Algebra
No ratings yet
ICT502 - Relational Algebra
40 pages
Computational Tools DTU Presentation Week3
No ratings yet
Computational Tools DTU Presentation Week3
33 pages
Relational Algebra
No ratings yet
Relational Algebra
31 pages
Relational Algebra
No ratings yet
Relational Algebra
31 pages
SQL Primary Key
No ratings yet
SQL Primary Key
63 pages
Unit II Notes Dbms
No ratings yet
Unit II Notes Dbms
56 pages
notes dbms2
No ratings yet
notes dbms2
47 pages
Lecture 2005 3
No ratings yet
Lecture 2005 3
26 pages
Linear Search: Collision Chain
No ratings yet
Linear Search: Collision Chain
16 pages
Relational Algebra
0% (1)
Relational Algebra
37 pages
Unit 4
No ratings yet
Unit 4
43 pages
Dbms Unit II
No ratings yet
Dbms Unit II
49 pages
Lecture 1.3
No ratings yet
Lecture 1.3
58 pages
Relational Algebra
No ratings yet
Relational Algebra
15 pages
Unit 3A
No ratings yet
Unit 3A
13 pages
Lecture 10-12 (R.A)
No ratings yet
Lecture 10-12 (R.A)
31 pages
MC 0067
No ratings yet
MC 0067
60 pages
Database Systems
No ratings yet
Database Systems
140 pages
Lecture 03 Relational ALgebra and Relational Calculus
No ratings yet
Lecture 03 Relational ALgebra and Relational Calculus
45 pages
ch-3
No ratings yet
ch-3
30 pages
Unit-III (Functional Dependencies and Normalization, Relational Data Model and Relational Algebra) Important Questions Section A: (2 Marks)
No ratings yet
Unit-III (Functional Dependencies and Normalization, Relational Data Model and Relational Algebra) Important Questions Section A: (2 Marks)
12 pages
Graphs with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
From Everand
Graphs with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
Peter Kattan
4/5 (2)
Learning Open Office: Calc & Base
From Everand
Learning Open Office: Calc & Base
Durgesh
No ratings yet
Task 2
No ratings yet
Task 2
4 pages
Big Data Basics (1)
No ratings yet
Big Data Basics (1)
7 pages
Hbase With WAL
No ratings yet
Hbase With WAL
5 pages
Mapreduce Notes (1)
No ratings yet
Mapreduce Notes (1)
4 pages
homecoming questions
No ratings yet
homecoming questions
5 pages
Homecoming Answes
No ratings yet
Homecoming Answes
6 pages
R with SQL (2)
No ratings yet
R with SQL (2)
8 pages
Yin Xie 2021 Playing Platformized Language Games Social Media Logic and The Mutation of Participatory Cultures in
No ratings yet
Yin Xie 2021 Playing Platformized Language Games Social Media Logic and The Mutation of Participatory Cultures in
23 pages
CS434 - Data Base Theory and Design: Description
No ratings yet
CS434 - Data Base Theory and Design: Description
3 pages
MachTek Smart Hard Flyer
No ratings yet
MachTek Smart Hard Flyer
1 page
Intro and Exercice Power Query
100% (1)
Intro and Exercice Power Query
30 pages
lecture 1 (2)
No ratings yet
lecture 1 (2)
28 pages
Hpe6-A44 - Acmp
No ratings yet
Hpe6-A44 - Acmp
104 pages
Electronic Diversity Visa
No ratings yet
Electronic Diversity Visa
1 page
Cloud Computing DR Sunil - KR Pandey Unit 3
No ratings yet
Cloud Computing DR Sunil - KR Pandey Unit 3
43 pages
VENTAS: 69005848 755-26829 690-93991 76894916: Precio y Disponibilidad Sujeto A Cambiar Sin Previo Aviso
No ratings yet
VENTAS: 69005848 755-26829 690-93991 76894916: Precio y Disponibilidad Sujeto A Cambiar Sin Previo Aviso
20 pages
IC5 Intro WQ U1to2
No ratings yet
IC5 Intro WQ U1to2
2 pages
Jupyter Notebook and Management
No ratings yet
Jupyter Notebook and Management
145 pages
Notebook Czone PDF
No ratings yet
Notebook Czone PDF
3 pages
Red Wine Quality Detection
No ratings yet
Red Wine Quality Detection
17 pages
Spring Security
No ratings yet
Spring Security
16 pages
Question Bank Unit Iv: List
No ratings yet
Question Bank Unit Iv: List
2 pages
Google Cloud Network Engineer Exam Prep Sheet
No ratings yet
Google Cloud Network Engineer Exam Prep Sheet
15 pages
Towards Modeling and Evaluation of ETCS Real-Timecommunication and Operation
No ratings yet
Towards Modeling and Evaluation of ETCS Real-Timecommunication and Operation
8 pages
Vaibhav Ichake
No ratings yet
Vaibhav Ichake
4 pages
Solving Equal Piles Problem
No ratings yet
Solving Equal Piles Problem
35 pages
Marketing Automation PDF
100% (1)
Marketing Automation PDF
24 pages
Test AI Cannot Beat
No ratings yet
Test AI Cannot Beat
4 pages
Intel Realsense Lidar Camera L515: Datasheet
No ratings yet
Intel Realsense Lidar Camera L515: Datasheet
25 pages
Case Study 5
No ratings yet
Case Study 5
3 pages
Payroll & GA Daily Accomplishment Report
No ratings yet
Payroll & GA Daily Accomplishment Report
15 pages
Review On Machine Learning For Resource Usage Cost Optimization in Cloud Computing
No ratings yet
Review On Machine Learning For Resource Usage Cost Optimization in Cloud Computing
7 pages
C Theory Notes
No ratings yet
C Theory Notes
16 pages
Introduction To Flowcharting: Grade 10-ICT
No ratings yet
Introduction To Flowcharting: Grade 10-ICT
58 pages
DBMS M3 Ktunotes.in
No ratings yet
DBMS M3 Ktunotes.in
44 pages
Our DSA Lab 3 Yousha
No ratings yet
Our DSA Lab 3 Yousha
9 pages