0% found this document useful (0 votes)
2 views

Big Data Lab Manual Printout Copy

The document outlines the Big Data Analytics Laboratory course at Hindusthan Institute of Technology, detailing the installation of Apache Hadoop and various MapReduce programs. It includes practical exercises such as calculating word frequency, finding maximum temperatures, and analyzing datasets using Hadoop and Python. The document serves as a record of work done by students in the Computer Science and Engineering department for the academic year 2024-2025.

Uploaded by

UTHAYAKUMAR J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Big Data Lab Manual Printout Copy

The document outlines the Big Data Analytics Laboratory course at Hindusthan Institute of Technology, detailing the installation of Apache Hadoop and various MapReduce programs. It includes practical exercises such as calculating word frequency, finding maximum temperatures, and analyzing datasets using Hadoop and Python. The document serves as a record of work done by students in the Computer Science and Engineering department for the academic year 2024-2025.

Uploaded by

UTHAYAKUMAR J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

HINDUSTHAN

INSTITUTE OF TECHNOLOGY
(An Autonomous Institution)
(Approved By AICTE, New Delhi, Affiliated To Anna University, Chennai, Accredited by NBA & NAAC with „A‟ Grade)

COIMBATORE – 641 032

DEPARTMENT OF COMPUTER SCIENCE


AND ENGINEERING

22CS520 – BIG DATA ANALYTICS LABORATORY

Name of the Student: ……………….……………………...

Register Number: ……………….……….….….…..………

Branch: ……………………………………………………..

Year/Semester: ……………………………………………….

1
HINDUSTHAN
INSTITUTE OF TECHNOLOGY
(An Autonomous Institution)
(Approved By AICTE, New Delhi, Affiliated To Anna University, Chennai, Accredited by NBA & NAAC with „A‟ Grade)
Coimbatore – 641 032

Certified that this is the bonafide record of work done by


………..……………………………………………….…………..…in the 22CS520 – BIG DATA
ANALYTICS LABORATORY of this Institution, as prescribed by the Anna University, Chennai
for the Six Semester, Department of Computer Science and Engineering during the academic
year 2024 – 2025.

Place: Coimbatore
Date:

STAFF IN - CHARGE HEAD OF THE DEPARTMENT

University Register Number......................................................................................................................

Submitted for the practical Examination conducted on …………………………………………………

INTERNAL EXAMINER EXTERNAL EXAMINER

2
INDEX

S.NO. DATE EXPERIMENT PAGE MARKS SIGNATURE


NO
1. INSTALL APACHE HADOOP

2. MAPREDUCE PROGRAM TO CALCULATE


THE FREQUENCY
MAPREDUCE PROGRAM TO FIND THE
3.
MAXIMUM TEMPERATURE IN EACH
YEAR

4. MAPREDUCE PROGRAM TO FIND THE


GRADES OF STUDENT‟S
MAPREDUCE PROGRAM TO IMPLEMENT
5.
MATRIX MULTIPLICATION
MAPREDUCE TO FIND THE MAXIMUM
6.
ELECTRICAL CONSUMPTION IN EACH
YEAR
MAPREDUCE TO ANALYZE WEATHER
7. DATA SET AND PRINT WHETHER THE
DAY IS SHINNY OR COOL
MAPREDUCE PROGRAM TO FIND THE
8. NUMBER OF PRODUCTS SOLD IN EACH
COUNTRY
MAPREDUCE PROGRAM TO FIND THE
9. TAGS ASSOCIATED WITH EACH MOVIE
BY ANALYZING MOVIE LENS DATA
10. XYZ.COM IS AN ONLINE MUSIC WEBSITE
WHERE USERS LISTEN TO VARIOUS
TRACKS
MAPREDUCE PROGRAM TO FIND THE
11. FREQUENCY OF BOOKS PUBLISHED
EACH YEAR
12. MAPREDUCE PROGRAM TO ANALYZE
TITANIC SHIP DATA AND TO FIND THE
AVERAGE AGE OF THE PEOPLE

13. MAPREDUCE PROGRAM TO ANALYZE


UBER DATA SET

14. PYTHONAPPLICATION TO FIND THE


MAXIMUM TEMPERATURE USING SPARK
ADDITIONAL EXPERIMENTS
1. HIVE OPERATIONS

2. PIG LATIN MODES, PROGRAMS

Average

3
Exno:1
INSTALL APACHE HADOOP
Date:

AIM:

To Install Apache Hadoop.

Hadoop software can be installed in three modes of

Hadoop is a Java-based programming framework that supports the processing and storage of
extremely large datasets on a cluster of inexpensive machines. It was the first major open
source project in the big data playing field and is sponsored by the Apache Software
Foundation.

Hadoop-2.7.3 is comprised of four main layers:

 Hadoop Common is the collection of utilities and libraries that support other Hadoop
modules.
 HDFS, which stands for Hadoop Distributed File System, is responsible for persistingdata
to disk.
 YARN, short for Yet Another Resource Negotiator, is the "operating system" for HDFS.
 MapReduce is the original processing model for Hadoop clusters. It distributes work within
the cluster or map, then organizes and reduces the results from the nodes into a response to a
query. Many other processing models are available for the 2.x version of Hadoop.
Hadoop clusters are relatively complex to set up, so the project includes a stand-alone mode
which is suitable for learning about Hadoop, performing simple operations, and debugging.

Procedure:

we'll install Hadoop in stand-alone mode and run one of the example example MapReduce
programs it includes to verify the installation.

Prerequisites:

Step1: Installing Java 8


version.Openjdk version
"1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-
b14)OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
This output verifies that OpenJDK has been successfully installed.
Note: To set the path for environment variables. i.e. JAVA_HOME

4
Step2: Installing Hadoop
With Java in place, we'll visit the Apache Hadoop Releases page to find the mostrecent
stable release. Follow the binary for the current release:

Download Hadoop from www.hadoop.apache.org

5
Procedure to Run Hadoop

1. Install Apache Hadoop 2.2.0 in Microsoft Windows OS

If Apache Hadoop 2.2.0 is not already installed then follow the post Build, Install,
Configure and Run Apache Hadoop 2.2.0 in Microsoft Windows OS.

2. Start HDFS (Namenode and Datanode) and YARN (Resource Manager and Node
Manager)

Run following commands.


Command Prompt
C:\Users\abhijitg>cd c:\hadoop
c:\hadoop>sbin\start-dfs
c:\hadoop>sbin\start-yarn starting
yarn daemons

Namenode, Datanode, Resource Manager and Node Manager will be started in few
minutes and ready to execute Hadoop MapReduce job in the Single Node (pseudo-
distributed mode) cluster.

Resource Manager & Node Manager:

6
Run wordcount MapReduce job

Now we'll run wordcount MapReduce job available in


%HADOOP_HOME%\share\hadoop\mapreduce\hadoop-mapreduce-examples- 2.2.0.jar

Create a text file with some content. We'll pass this file as input
tothe wordcount MapReduce job for counting words. C:\file1.txt

Install Hadoop

Run Hadoop Wordcount Mapreduce Example

Create a directory (say 'input') in HDFS to keep all the text files (say 'file1.txt') to be
used forcounting words.
C:\Users\abhijitg>cd c:\hadoop
C:\hadoop>bin\hdfs dfs -mkdir input

Copy the text file(say 'file1.txt') from local disk to the newly created 'input' directory in HDFS.

7
C:\hadoop>bin\hdfs dfs -copyFromLocal c:/file1.txt input

Check content of the copied file.

C:\hadoop>hdfs dfs -ls input


Found 1 items
-rw-r--r-- 1 ABHIJITG supergroup 55 2014-02-03 13:19 input/file1.txt

C:\hadoop>bin\hdfs dfs -cat input/file1.txt


Install Hadoop
Run Hadoop Wordcount Mapreduce Example

Run the wordcount MapReduce job provided


in %HADOOP_HOME%\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.2.0.jar

C:\hadoop>bin\yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar


wordcount input output
14/02/03 13:22:02 INFO client.RMProxy: Connecting to ResourceManager at
/0.0.0.0:8032
14/02/03 13:22:03 INFO input.FileInputFormat: Total input paths to process : 114/02/03
13:22:03 INFO mapreduce.JobSubmitter: number of splits:1
:
8
:14/02/03 13:22:04 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1391412385921_0002
14/02/03 13:22:04 INFO impl.YarnClientImpl: Submitted application
application_1391412385921_0002 to ResourceManager at /0.0.0.0:8032
14/02/03 13:22:04 INFO mapreduce.Job: The url to track the job:
http://ABHIJITG:8088/proxy/application_1391412385921_0002/
14/02/03 13:22:04 INFO mapreduce.Job: Running job: job_1391412385921_0002
14/02/03 13:22:14 INFO mapreduce.Job: Job job_1391412385921_0002 running inuber
mode : false
14/02/03 13:22:14 INFO mapreduce.Job: map 0% reduce 0%
14/02/03 13:22:22 INFO mapreduce.Job: map 100% reduce 0%
14/02/03 13:22:30 INFO mapreduce.Job: map 100% reduce 100%
14/02/03 13:22:30 INFO mapreduce.Job: Job job_1391412385921_0002 completed
successfully
14/02/03 13:22:31 INFO mapreduce.Job: Counters: 43File
System Counters
FILE: Number of bytes read=89
FILE: Number of bytes written=160142
FILE: Number of read operations=0 FILE:
Number of large read operations=0FILE:
Number of write operations=0

HDFS: Number of bytes read=171


HDFS: Number of bytes written=59
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5657 Total
time spent by all reduces in occupied slots (ms)=6128
Map-Reduce Framework
Map input records=2
Map output records=7
Map output bytes=82
Map output materialized bytes=89
Input split bytes=116
Combine input records=7
Combine output records=6
Reduce input groups=6
Reduce shuffle bytes=89
Reduce input records=6
Reduce output records=6
Spilled Records=12 Shuffled
Maps =1
9
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=145
CPU time spent (ms)=1418
Physical memory (bytes) snapshot=368246784 Virtual
memory (bytes) snapshot=513716224 Total committed
heap usage (bytes)=307757056
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=55
File Output Format Counters

Bytes Written=59
http://abhijitg:8088/cluster

RESULT:

We've installed Hadoop in stand-alone mode and verified it by running anexample program
it provided.

10
Exno:2 MAPREDUCE PROGRAM TO CALCULATE THE
Date: FREQUENCY

AIM:
To Develop a MapReduce program to calculate the frequency of a given word in a given
file Map Function – It takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (Key-Value pair).
Example – (Map function in Word Count)
Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN
Output
Convert into another set of
data(Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those data
tuplesinto a smaller set of tuples.
Example – (Reduce function in Word Count)
Input Set of
Tuples(output of
Map function)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1),
(bus,1),(TRAIN,1),(BUS,1),
(buS,1),(caR,1),(CAR,1), (car,1), (BUS,1), (TRAIN,1)
Output Converts into smaller set of tuples
(BUS,7), (CAR,7), (TRAIN,4)
Work Flow of Program

11
Workflow of MapReduce consists of 5 steps
1. Splitting – The splitting parameter can be anything, e.g. splitting by space,
comma, semicolon, or even by a new line („\n‟).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different clusters. In orderto
group them in “Reduce Phase” the similar KEY data should be on same cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result set from each
cluster) is combine together to form a Result

12
PROGRAM:-

def mapper(text):
words = text.lower().split()
return [(word.strip(".,!?;\"()"), 1) for word in words]

# Define the input text

text = "This is a sample text. It contains some words, and we want to count them!"
mapped = mapper(text)
print("Mapped Output:\n", mapped)

# Step 2: Group by word (Shuffle step)

from collections import defaultdict


def shuffle(mapped_data):
grouped = defaultdict(list)
for word, count in mapped_data:
grouped[word].append(count)
return grouped
shuffled = shuffle(mapped)
print("\nShuffled Output:\n", dict(shuffled))

# Step 3: Reduce step (word count aggregation)

def reducer(grouped_data):
reduced = {word: sum(counts) for word, counts in grouped_data.items()}
return reduced
reduced = reducer(shuffled)
print("\nReduced Output (Final Word Count):\n", reduced)

OUTPUT:

Mapped Output:
[('this', 1), ('is', 1), ('a', 1), ('sample', 1), ('text', 1), ('it', 1), ('contains', 1), ('some', 1), ('words', 1), ('and', 1), ('we', 1),
('want', 1), ('to', 1), ('count', 1), ('them', 1)]

Shuffled Output:
{'this': [1], 'is': [1], 'a': [1], 'sample': [1], 'text': [1], 'it': [1], 'contains': [1], 'some': [1], 'words': [1], 'and': [1], 'we': [1],
'want': [1], 'to': [1], 'count': [1], 'them': [1]}

Reduced Output (Final Word Count):


{'this': 1, 'is': 1, 'a': 1, 'sample': 1, 'text': 1, 'it': 1, 'contains': 1, 'some': 1, 'words': 1, 'and': 1, 'we': 1, 'want': 1, 'to': 1,
'count': 1, 'them': 1}

RESULT:

Thus the above program to find the count of a given words has been executed and verified successfully

13
Exno:3 MAPREDUCE PROGRAM TO FIND THE MAXIMUM
TEMPERATURE IN EACH YEAR
Date:

AIM:
To Develop a MapReduce program to find the maximum temperature in each year.
Description:
MapReduce is a programming model designed for processing large volumes of datain
parallel by dividing the work into a set of independent tasks.Our previous traversal has given
an introduction about MapReduce This traversal explains how to design a MapReduce
program.

PROGRAM.

data = [
"2020-01-01 30",
"2020-05-12 45",
"2020-12-30 10",
"2021-01-15 20",
"2021-06-18 50",
"2021-09-20 48",
"2022-02-11 25",
"2022-07-04 39",
"2022-11-22 41"
]

from collections import defaultdict

# Step 1: Mapper - Extract year and temperature

def mapper(data):
mapped = []
for record in data:
date_str, temp_str = record.split()
year = date_str.split("-")[0]
temperature = int(temp_str)
mapped.append((year, temperature))
return mapped

# Step 2: Shuffle - Group temperatures by year

def shuffle(mapped_data):
grouped = defaultdict(list)
for year, temp in mapped_data:
grouped[year].append(temp)
return grouped

14
# Step 3: Reducer - Find max temperature per year
def reducer(grouped_data):
reduced = {year: max(temps) for year, temps in grouped_data.items()}
return reduced

# Execute MapReduce
mapped_data = mapper(data)
print("Mapped Output:\n", mapped_data)
shuffled_data = shuffle(mapped_data)
print("\nShuffled Output:\n", dict(shuffled_data))
reduced_data = reducer(shuffled_data)
print("\nReduced Output (Max Temperature per Year):\n", reduced_data)

Output:

Mapped Output:
[('2020', 30), ('2020', 45), ('2020', 10), ('2021', 20), ('2021', 50), ('2021', 48), ('2022', 25), ('2022', 39), ('2022', 41)]

Shuffled Output:
{'2020': [30, 45, 10], '2021': [20, 50, 48], '2022': [25, 39, 41]}

Reduced Output (Max Temperature per Year):


{'2020': 45, '2021': 50, '2022': 41}

RESULT:

Thus the above program to find the maximum temperature recorded in a year with the help of a given has been
executed and verified successfully

15
Exno:4 MAPREDUCE PROGRAM TO FIND THE GRADES
Date: OF STUDENT’S

AIM:
To Develop a MapReduce program to find the grades of student’s.

Program:

# Step 1: Mapper – Convert input to (Student, Score) pairs

def mapper(data):

mapped = []

for line in data:

name, score = line.split()

mapped.append((name, int(score)))

return mapped

# Step 2: Shuffle – Not needed here since we map directly per student

# Step 3: Reducer – Assign grades based on score

def grade(score):

if score >= 90:

return 'A'

elif score >= 80:

return 'B'

elif score >= 70:

return 'C'

elif score >= 60:

return 'D'

else:

return 'F'

16
def reducer(mapped_data):

reduced = {name: grade(score) for name, score in mapped_data}

return reduced

# Run the simulation

# Define the data within the same scope

data = [

"Alice 95",

"Bob 67",

"Charlie 88",

"David 73",

"Eva 54",

"Frank 100",

"Grace 82"

mapped_data = mapper(data)

print("Mapped Output:\n", mapped_data)

reduced_data = reducer(mapped_data)

print("\nReduced Output (Grades):\n", reduced_data)

EXPECTED OUTPUT:

Score Range | Grade


90–100 | A
80–89 | B
70–79 | C
60–69 | D
< 60 | F

17
OUTPUT:

Mapped Output:
[('Alice', 95), ('Bob', 67), ('Charlie', 88), ('David', 73), ('Eva', 54), ('Frank', 100), ('Grace', 82)]

Reduced Output (Grades):


{'Alice': 'A', 'Bob': 'D', 'Charlie': 'B', 'David': 'C', 'Eva': 'F', 'Frank': 'A', 'Grace': 'B'}

RESULT:

Thus the above program to find the maximum temperature recorded in a year with the given data has been executed
and verified successfully

18
Exno:5 MAPREDUCE PROGRAM TO IMPLEMENT
Date: MATRIX MULTIPLICATION

AIM:

To Develop a MapReduce program to implement Matrix Multiplication.


In mathematics, matrix multiplication or the matrix product is a binary operation that
produces a matrix from two matrices. The definition is motivated by linear equations and
linear transformations on vectors, which have numerous applications in applied
mathematics, physics, and engineering. In more detail, if A is an n × m matrix and B is an m
× p matrix, their matrix product AB is an n × p matrix, in which the m entries across a row
of A are multiplied with the m entries down a column of B and summed to produce an entry
of AB. When two linear transformations are represented by matrices, then the matrix
product represents the composition of the two transformations.

ALGORITHM for Map Function.

a. for each element mij of M do


produce (key,value) pairs as ((i,k), (M,j,mij), for k=1,2,3,.. upto the number ofcolumns of N
b. for each element njk of N do
produce (key,value) pairs as ((i,k),(N,j,Njk), for i = 1,2,3,.. Upto the number ofrows of M.

c. return Set of (key,value) pairs that each key (i,k), has list with values
(M,j,mij) and (N, j,njk) for all possible values of j.
19
Algorithm for Reduce Function.

d. for each key (i,k) do


e. sort values begin with M by j in listM sort values begin with N by j in listNmultiply mij
and njk for jth value of each list
f. sum up mij x njk return (i,k), Σj=1 mij x njk

PROGRAM:-

A=[
[1, 2, 3],
[4, 5, 6]
]

B=[
[7, 8],
[9, 10],
[11, 12]
]

C=[
[58, 64],
[139, 154]
]

from collections import defaultdict

# Input matrices
A=[
[1, 2, 3],
[4, 5, 6]
]

B=[
[7, 8],
[9, 10],
[11, 12]
]

# Dimensions
m, n = len(A), len(A[0])
n2, p = len(B), len(B[0])

assert n == n2, "Inner matrix dimensions must match for multiplication."

# Step 1: Mapper
def mapper(A, B):
mapped = []

20
# For each element in A, emit for matching B[j][k]
for i in range(m):
for k in range(n):
for j in range(p):
mapped.append(((i, j), ('A', k, A[i][k])))

# For each element in B, emit for matching A[i][k]


for k in range(n):
for j in range(p):
for i in range(m):
mapped.append(((i, j), ('B', k, B[k][j])))

return mapped

# Step 2: Shuffle
def shuffle(mapped_data):
grouped = defaultdict(list)
for key, value in mapped_data:
grouped[key].append(value)
return grouped

# Step 3: Reducer
def reducer(grouped_data):
result = defaultdict(int)

for (i, j), values in grouped_data.items():


a_dict = {}
b_dict = {}

# Separate A and B values by index k


for tag, k, val in values:
if tag == 'A':
a_dict[k] = val
elif tag == 'B':
b_dict[k] = val

# Multiply and accumulate


total = 0
for k in range(n):
total += a_dict.get(k, 0) * b_dict.get(k, 0)
result[(i, j)] = total

return result

# Execute MapReduce
mapped = mapper(A, B)
print("Mapped Data:\n", mapped)

shuffled = shuffle(mapped)
print("\nShuffled Data:\n", dict(shuffled))

21
reduced = reducer(shuffled)

# Convert to matrix format


C = [[reduced[(i, j)] for j in range(p)] for i in range(m)]
print("\nResultant Matrix C (A x B):")
for row in C:
print(row)

EXPECTED OUTPUT:

Mapped Data:
[((0, 0), ('A', 0, 1)), ((0, 1), ('A', 0, 1)), ((0, 0), ('A', 1, 2)), ((0, 1), ('A', 1, 2)), ((0, 0), ('A', 2, 3)), ((0, 1), ('A',
2, 3)), ((1, 0), ('A', 0, 4)), ((1, 1), ('A', 0, 4)), ((1, 0), ('A', 1, 5)), ((1, 1), ('A', 1, 5)), ((1, 0), ('A', 2, 6)), ((1, 1),
('A', 2, 6)), ((0, 0), ('B', 0, 7)), ((1, 0), ('B', 0, 7)), ((0, 1), ('B', 0, 8)), ((1, 1), ('B', 0, 8)), ((0, 0), ('B', 1, 9)),
((1, 0), ('B', 1, 9)), ((0, 1), ('B', 1, 10)), ((1, 1), ('B', 1, 10)), ((0, 0), ('B', 2, 11)), ((1, 0), ('B', 2, 11)), ((0, 1),
('B', 2, 12)), ((1, 1), ('B', 2, 12))]

Shuffled Data:
{(0, 0): [('A', 0, 1), ('A', 1, 2), ('A', 2, 3), ('B', 0, 7), ('B', 1, 9), ('B', 2, 11)], (0, 1): [('A', 0, 1), ('A', 1, 2), ('A',
2, 3), ('B', 0, 8), ('B', 1, 10), ('B', 2, 12)], (1, 0): [('A', 0, 4), ('A', 1, 5), ('A', 2, 6), ('B', 0, 7), ('B', 1, 9), ('B', 2,
11)], (1, 1): [('A', 0, 4), ('A', 1, 5), ('A', 2, 6), ('B', 0, 8), ('B', 1, 10), ('B', 2, 12)]}

Resultant Matrix C (A x B):


[58, 64]
[139, 154]

RESULT:
Thus the above program to find the value of matrix with the given data has been executed and verified
successfully

22
Exno:6 MAPREDUCE TO FIND THE MAXIMUM
Date: ELECTRICAL CONSUMPTION IN EACH YEAR

AIM:

To Develop a MapReduce to find the maximum electrical consumption in each


year given electrical consumption for each month in each year.

Given below is the data regarding the electrical consumption of an organization. It


contains themonthly electrical consumption and the annual average for various years.

If the above data is given as input, we have to write applications to process it and produce
resultssuch as finding the year of maximum usage, year of minimum usage, and so on. This is
a walkoverfor the programmers with finite number of records. They will simply write the
logic to produce the required output, and pass the data to the application written.

But, think of the data representing the electrical consumption of all the largescale
industries of aparticular state, since its formation.

PROGRAM:
from collections import defaultdict

# Step 1: Mapper – Extract (year, consumption)

def mapper(data):
mapped = []
for line in data:
date_str, value_str = line.split()
year = date_str.split("-")[0]
value = int(value_str)
mapped.append((year, value))
return mapped

# Step 2: Shuffle – Group all values by year

def shuffle(mapped_data):
grouped = defaultdict(list)
for year, value in mapped_data:
grouped[year].append(value)
return grouped

# Step 3: Reducer – Get the max value for each year

def reducer(grouped_data):
reduced = {year: max(values) for year, values in grouped_data.items()}
return reduced

23
# Run MapReduce simulation

mapped_data = mapper(data)
print("Mapped Output:\n", mapped_data)

shuffled_data = shuffle(mapped_data)
print("\nShuffled Output:\n", dict(shuffled_data))

reduced_data = reducer(shuffled_data)
print("\nReduced Output (Max Consumption per Year):\n", reduced_data)

OUTPUT:

Mapped Output:
[('2020', 350), ('2020', 420), ('2020', 390), ('2021', 450), ('2021', 430), ('2022', 470), ('2022', 510), ('2022', 495)]

Shuffled Output:
{'2020': [350, 420, 390], '2021': [450, 430], '2022': [470, 510, 495]}

Reduced Output (Max Consumption per Year):


{'2020': 420, '2021': 450, '2022': 510}

RESULT

Thus the above program to find the maximum electrical consumption in the year with the given data has
been executed and verified successfully

24
Exno:7
MAPREDUCE TO ANALYZE WEATHER DATA SET AND PRINT
Date: WHETHER THE DAY IS SHINNY OR COOL

AIM:

To Develop a MapReduce to analyze weather data set and print whether the day isshinny
or cool day.

PROGRAM:

weather_data = [
"2023-06-01 32 Sunny",
"2023-06-02 21 Rainy",
"2023-06-03 25 Cloudy",
"2023-06-04 34 Cloudy",
"2023-06-05 29 Sunny",
"2023-06-06 22 Foggy"
]

# Step 1: Mapper – Extract date and classify day

def mapper(data):
mapped = []
for line in data:
parts = line.split()
date = parts[0]
temp = int(parts[1])
condition = parts[2]

if condition.lower() == "sunny" or temp > 30:


category = "Shiny"
else:
category = "Cool"

mapped.append((date, category))
return mapped

# Step 2: No shuffle needed (1-to-1 mapping)

# Step 3: Reducer – Just print out the result or store it

def reducer(mapped_data):
result = {date: category for date, category in mapped_data}
return result

# Execute MapReduce
mapped = mapper(weather_data) # weather_data is now defined in the same scope
print("Mapped Output:\n", mapped)

25
reduced = reducer(mapped)
print("\nFinal Classification (Day → Shiny or Cool):")
for date, label in reduced.items():
print(f"{date}: {label}")

OUTPUT:

Mapped Output:
[('2023-06-01', 'Shiny'), ('2023-06-02', 'Cool'), ('2023-06-03', 'Cool'), ('2023-06-04', 'Shiny'), ('2023-06-05',
'Shiny'), ('2023-06-06', 'Cool')]

Final Classification (Day → Shiny or Cool):


2023-06-01: Shiny
2023-06-02: Cool
2023-06-03: Cool
2023-06-04: Shiny
2023-06-05: Shiny
2023-06-06: Cool

RESULT:

Thus the above program to analyze the weather condition with the given data has been executed and
verified successfully

26
Exno:8
MAPREDUCE PROGRAM TO FIND THE NUMBER OF
Date: PRODUCTS SOLD INEACH COUNTRY

AIM:

Develop a MapReduce program to find the number of products sold in each country by considering
sales data containing fields.
PROGRAM:
sales_data = [
"USA,TV,10",
"India,Laptop,5",
"USA,Phone,7",
"India,Tablet,3",
"Germany,TV,6",
"Germany,Phone,4",
"USA,Laptop,2"
]

from collections import defaultdict

# Step 1: Mapper – Extract (Country, Quantity)


def mapper(data):
mapped = []
for line in data:
country, product, quantity = line.split(',')
mapped.append((country.strip(), int(quantity.strip())))
return mapped

# Step 2: Shuffle – Group quantities by country


def shuffle(mapped_data):
grouped = defaultdict(list)

27
for country, qty in mapped_data:
grouped[country].append(qty)
return grouped

# Step 3: Reducer – Sum quantities per country


def reducer(grouped_data):
reduced = {country: sum(qtys) for country, qtys in grouped_data.items()}
return reduced

# Run the MapReduce simulation


mapped_data = mapper(sales_data)
print("Mapped Output:\n", mapped_data)
shuffled_data = shuffle(mapped_data)
print("\nShuffled Output:\n", dict(shuffled_data))
reduced_data = reducer(shuffled_data)
print("\nFinal Output (Total Products Sold per Country):")
for country, total in reduced_data.items():
print(f"{country}: {total}")

OUTPUT:
Mapped Output:

[('USA', 10), ('India', 5), ('USA', 7), ('India', 3), ('Germany', 6), ('Germany', 4), ('USA', 2)]

Shuffled Output:

{'USA': [10, 7, 2], 'India': [5, 3], 'Germany': [6, 4]}

Final Output (Total Products Sold per Country):

USA: 19

India: 8

Germany: 10

RESULT:
Thus the above program to analyze the total products sold per company with the given data has been
executed and verified successfully.

28
Exno:9
MAPREDUCE PROGRAM TO FIND THE TAGS ASSOCIATED
Date: WITH EACHMOVIE BY ANALYZING MOVIE LENS DATA

AIM:
To Develop a MapReduce program to find the tags associated with each movie by analyzing movie
lens data.
PROGRAM:
from collections import defaultdict
# Step 1: Load movieId → title mapping
def load_movies(movie_data):
movie_dict = {}
for line in movie_data:
parts = line.split(",", 2)
movie_id = parts[0]
title = parts[1]
movie_dict[movie_id] = title
return movie_dict

# Step 2: Mapper – Emit (movieId, tag)


def mapper(tag_data):
mapped = []
for line in tag_data:
parts = line.split(",")
movie_id = parts[1]
tag = parts[2]
mapped.append((movie_id, tag))
return mapped

# Step 3: Shuffle – Group tags by movieId


def shuffle(mapped_data):

29
grouped = defaultdict(list)
for movie_id, tag in mapped_data:
grouped[movie_id].append(tag)
return grouped

# Step 4: Reducer – Replace movieId with title


def reducer(grouped_data, movie_dict):
reduced = {}
for movie_id, tags in grouped_data.items():
title = movie_dict.get(movie_id, "Unknown Movie")
reduced[title] = tags
return reduced

# Run the MapReduce simulation


movie_dict = load_movies(movies)
mapped = mapper(tags)
print("Mapped Output:\n", mapped)
shuffled = shuffle(mapped)
print("\nShuffled Output:\n", dict(shuffled))
reduced = reducer(shuffled, movie_dict)
print("\nReduced Output (Movie → Tags):")
for title, tag_list in reduced.items():
print(f"{title}: {tag_list}")

30
OUTPUT:

Mapped Output:
[('10', 'epic'), ('10', 'sci-fi'), ('12', 'romantic')]

Shuffled Output:
{'10': ['epic', 'sci-fi'], '12': ['romantic']}

Reduced Output (Movie → Tags):


Star Wars (1977): ['epic', 'sci-fi']
Titanic (1997): ['romantic']

RESULT:

Thus the above program to analyze the type of a movie with the given data has been executed and verified
successfully.
31
Exno:10
XYZ.COM IS AN ONLINE MUSIC WEBSITE WHERE
Date: USERS LISTEN TOVARIOUS TRACKS

AIM:
XYZ.com is an online music website where users listen to various tracks, the data gets collected
which is given below.

PROGRAM:

from collections import defaultdict

# Step 1: Mapper – Emit (track, 1)

def mapper(data):

mapped = []

for line in data:

user, track = line.strip().split(",")

mapped.append((track, 1))

return mapped

# Step 2: Shuffle – Group by track

def shuffle(mapped_data):

grouped = defaultdict(list)

for track, count in mapped_data:

grouped[track].append(count)

return grouped

# Step 3: Reducer – Sum the counts for each track

def reducer(grouped_data):

32
reduced = {track: sum(counts) for track, counts in grouped_data.items()}

return reduced

# Sample Data

logs = [

"user1,trackA",

"user2,trackB",

"user3,trackA",

"user1,trackC",

"user2,trackA",

"user3,trackC",

"user4,trackA"

# Run MapReduce

mapped = mapper(logs)

print("Mapped Output:\n", mapped)

shuffled = shuffle(mapped)

print("\nShuffled Output:\n", dict(shuffled))

reduced = reducer(shuffled)

print("\nFinal Output (Track Plays Count):")

for track, count in reduced.items():

print(f"{track}: {count}")

33
OUTPUT:

Mapped Output:
[('trackA', 1), ('trackB', 1), ('trackA', 1), ('trackC', 1), ('trackA', 1), ('trackC', 1), ('trackA', 1)]

Shuffled Output:
{'trackA': [1, 1, 1, 1], 'trackB': [1], 'trackC': [1, 1]}

Final Output (Track Plays Count):


trackA: 4
trackB: 1
trackC: 2

RESULT:

Thus the above program to analyze the list of tracks with the given data available in the website has been
executed and verified successfully.

34
Exno:11
MAPREDUCE PROGRAM TO FIND THE
Date: FREQUENCY OF BOOKS PUBLISHED EACH YEAR

AIM:
Develop a MapReduce program to find the frequency of books published each year and find in
which year maximum number of books was published using the following data.
Title Author Published Author Language No of pages

PROGRAM:

# Step 1: Mapper – Extract (Year, 1)


def mapper(data):
mapped = []
for line in data:
parts = line.split(",")
year = parts[2].strip() # Extract the year
mapped.append((year, 1)) # Emit (year, 1)
return mapped

# Step 2: Shuffle – Group by year


def shuffle(mapped_data):
grouped = defaultdict(list)
for year, count in mapped_data:
grouped[year].append(count)
return grouped

# Step 3: Reducer – Sum the counts for each year


def reducer(grouped_data):
reduced = {year: sum(counts) for year, counts in grouped_data.items()}
return reduced

# Sample Data
books = [
"The Great Gatsby,F. Scott Fitzgerald,1925",
"1984,George Orwell,1949",
"To Kill a Mockingbird,Harper Lee,1960",
"The Catcher in the Rye,J.D. Salinger,1951",
"Moby-Dick,Herman Melville,1851",
"Pride and Prejudice,Jane Austen,1813",
"The Hobbit,J.R.R. Tolkien,1937",
"1984,George Orwell,1949",
"The Lord of the Rings,J.R.R. Tolkien,1954"
35
]

# Run the MapReduce process


mapped = mapper(books)
print("Mapped Output:\n", mapped)

shuffled = shuffle(mapped)
print("\nShuffled Output:\n", dict(shuffled))

reduced = reducer(shuffled)
print("\nFinal Output (Books Published Each Year):")
for year, count in reduced.items():
print(f"{year}: {count}")

OUTPUT:
Mapped Output:
[('1925', 1), ('1949', 1), ('1960', 1), ('1951', 1), ('1851', 1), ('1813', 1), ('1937', 1), ('1949', 1), ('1954',
1)]

Shuffled Output:
{'1925': [1], '1949': [1, 1], '1960': [1], '1951': [1], '1851': [1], '1813': [1], '1937': [1], '1954': [1]}

Final Output (Books Published Each Year):


1925: 1
1949: 2
1960: 1
1951: 1
1851: 1
1813: 1
1937: 1
1954: 1

RESULT:

Thus the above program to analyze the list of tracks with the given data available in the website
has been executed and verified successfully.

36
37
38
Exno:12
MAPREDUCE PROGRAM TO ANALYZE TITANIC SHIP
Date: DATA AND TO FIND THEAVERAGE AGE OF THE PEOPLE

AIM:
Develop a MapReduce program to analyze Titanic ship data and to find the average age of the people (both
male and female) who died in the tragedy. How many persons are survived in each class?
PROGRAM:
titanic_data = [
"1,1,Allen,Mr. William Henry,Male,35,0,0,A/5 21171,8.05,,S",
"2,1,Braund,Mr. James,22,Male,1,0,PC 17599,71.2833,C85,C",
"3,3,Creasey,Miss. Alicia,Female,28,0,0,STON/OQ 392076,7.925,,Q",
"4,1,Heikkinen,Miss. Laina,Female,26,0,0,STON/OQ 392078,7.925,,S",
"5,3,Johnson,Miss. Elizabeth,34,Female,0,0,CA. 2343,8.05,,S",
"6,3,Allen,Mr. Thomas,Male,,0,0,315098,8.05,,S"
]

from collections import defaultdict

# Step 1: Mapper – Extract Age and count the number of valid ages

def mapper(data):
mapped = []
for line in data:
parts = line.split(",")
age = parts[4].strip() # Extract the age field
try:
age = float(age) # Convert age to float if valid
if age > 0: # Only consider valid ages
mapped.append((1, age)) # Emit (1, age)
except ValueError:
continue # Skip invalid age entries (empty or non-numeric values)
return mapped

# Step 2: Shuffle – Group all values by key (since it's only one key: 1)
def shuffle(mapped_data):
grouped = defaultdict(list)
for key, age in mapped_data:
grouped[key].append(age)
return grouped

# Step 3: Reducer – Calculate the sum and count of ages, then compute the average
def reducer(grouped_data):
reduced = {}
for key, ages in grouped_data.items():
total_age = sum(ages)
count = len(ages)
average_age = total_age / count if count > 0 else 0
reduced[key] = average_age
39
return reduced

# Sample Data (Titanic)


titanic_data = [
"1,1,Allen,Mr. William Henry,Male,35,0,0,A/5 21171,8.05,,S",
"2,1,Braund,Mr. James,22,Male,1,0,PC 17599,71.2833,C85,C",
"3,3,Creasey,Miss. Alicia,Female,28,0,0,STON/OQ 392076,7.925,,Q",
"4,1,Heikkinen,Miss. Laina,Female,26,0,0,STON/OQ 392078,7.925,,S",
"5,3,Johnson,Miss. Elizabeth,34,Female,0,0,CA. 2343,8.05,,S",
"6,3,Allen,Mr. Thomas,Male,,0,0,315098,8.05,,S"
]

# Run the MapReduce process


mapped = mapper(titanic_data)
print("Mapped Output:\n", mapped)

shuffled = shuffle(mapped)
print("\nShuffled Output:\n", dict(shuffled))

reduced = reducer(shuffled)
print("\nFinal Output (Average Age of Passengers):")
for key, average_age in reduced.items():
print(f"Average Age: {average_age:.2f}")

OUTPUT:

Mapped Output:
[(1, 22.0), (1, 34.0)]

Shuffled Output:
{1: [22.0, 34.0]}

Final Output (Average Age of Passengers):


Average Age: 28.00

RESULT:
Thus the above program to analyze the Titanic ship data and to find the average age of the people (both
male and female) who died in the tragedy has been executed and verified successfully

40
Exno:13
MAPREDUCE PROGRAM TO ANALYZE UBER DATA SET
Date:

AIM:

To Develop a MapReduce program to analyze Uber data set to find the days on which each
basement has more trips using the following dataset.

PROGRAM:

uber_data = [
"1,Location1,Location2,2025-04-01 08:00:00,15,123,456",
"2,Location1,Location3,2025-04-01 09:15:00,30,124,457",
"3,Location2,Location1,2025-04-01 10:00:00,20,125,458",
"4,Location1,Location4,2025-04-01 11:00:00,10,126,459",
"5,Location2,Location3,2025-04-01 12:30:00,25,127,460",
"6,Location3,Location2,2025-04-01 13:00:00,18,128,461"
]

from collections import defaultdict

# Step 1: Mapper – Extract pickup location and ride duration

def mapper(data):
mapped = []
for line in data:
parts = line.split(",")
pickup_location = parts[1].strip() # Pickup location
ride_duration = int(parts[4].strip()) # Ride duration (in minutes)
mapped.append((pickup_location, ride_duration)) # Emit (pickup_location, ride_duration)
return mapped

# Step 2: Shuffle – Group all ride durations by pickup location

def shuffle(mapped_data):
grouped = defaultdict(list)
for location, duration in mapped_data:
grouped[location].append(duration)
return grouped

41
# Step 3: Reducer – Calculate the average ride duration for each pickup location
def reducer(grouped_data):
reduced = {}
for location, durations in grouped_data.items():
total_duration = sum(durations)
count = len(durations)
average_duration = total_duration / count if count > 0 else 0
reduced[location] = average_duration
return reduced

# Sample Data (Uber rides)


uber_data = [
"1,Location1,Location2,2025-04-01 08:00:00,15,123,456",
"2,Location1,Location3,2025-04-01 09:15:00,30,124,457",
"3,Location2,Location1,2025-04-01 10:00:00,20,125,458",
"4,Location1,Location4,2025-04-01 11:00:00,10,126,459",
"5,Location2,Location3,2025-04-01 12:30:00,25,127,460",
"6,Location3,Location2,2025-04-01 13:00:00,18,128,461"
]

# Run the MapReduce process


mapped = mapper(uber_data)
print("Mapped Output:\n", mapped)

shuffled = shuffle(mapped)
print("\nShuffled Output:\n", dict(shuffled))

reduced = reducer(shuffled)
print("\nFinal Output (Average Ride Duration by Pickup Location):")
for location, avg_duration in reduced.items():
print(f"{location}: {avg_duration:.2f} minutes")

42
OUTPUT:
Mapped Output:
[('Location1', 15), ('Location1', 30), ('Location2', 20), ('Location1', 10), ('Location2', 25), ('Location3', 18)]

Shuffled Output:
{'Location1': [15, 30, 10], 'Location2': [20, 25], 'Location3': [18]}

Final Output (Average Ride Duration by Pickup Location):


Location1: 18.33 minutes
Location2: 22.50 minutes
Location3: 18.00 minutes

RESULT:

Thus the above program to MapReduce program to analyze Uber data set to find the days on
which each basement has more trips using the following dataset has been executed and verified successfully

43
Exno:14 PYTHON APPLICATION TO FIND THE MAXIMUM
Date: TEMPERATURE USING SPARK

AIM:
To Develop a Python application to find the maximum temperature using Spark.

PROGRAM:
date,location,temperature
2025-04-01,New York,22
2025-04-01,Los Angeles,28
2025-04-01,Chicago,18
2025-04-02,New York,24
2025-04-02,Los Angeles,30
2025-04-02,Chicago,20

from pyspark.sql import SparkSession


from pyspark.sql.functions import col, avg

# Step 1: Initialize the Spark session


spark = SparkSession.builder.appName("AverageTemperatureByLocation").getOrCreate()

# Step 2: Load the temperature data into a Spark DataFrame


data = [
("2025-04-01", "New York", 22),
("2025-04-01", "Los Angeles", 28),
("2025-04-01", "Chicago", 18),
("2025-04-02", "New York", 24),
("2025-04-02", "Los Angeles", 30),
("2025-04-02", "Chicago", 20)
]

44
columns = ["date", "location", "temperature"]

# Create a DataFrame from the list of tuples


df = spark.createDataFrame(data, columns)

# Step 3: Show the data (optional)


df.show()

# Step 4: Group the data by 'location' and calculate the average temperature
avg_temp_by_location =
df.groupBy("location").agg(avg("temperature").alias("avg_temperature"))

# Step 5: Display the result


avg_temp_by_location.show()

# Step 6: Stop the Spark session


spark.stop()

45
OUTPUT:
+----------+-----------+-----------+
| date| location|temperature|
+----------+-----------+-----------+
|2025-04-01| New York| 22|
|2025-04-01|Los Angeles| 28|
|2025-04-01| Chicago| 18|
|2025-04-02| New York| 24|
|2025-04-02|Los Angeles| 30|
|2025-04-02| Chicago| 20|
+----------+-----------+-----------+

+-----------+---------------+
| location|avg_temperature|
+-----------+---------------+
|Los Angeles| 29.0|
| Chicago| 19.0|
| New York| 23.0|
+-----------+---------------+

RESULT:

Thus the above Python program to application to find the maximum temperature using Spark
has been executed and verified successfully

46
ADDITIONAL EXPERIMENTS

Exno:1
PIG LATIN MODES, PROGRAMS
: Date:

OBJECTIVE:
a) To find the vowel in the first letter.
PROGRAM:

def pig_latin(word):
vowels = "aeiou"

# If the first letter is a vowel, add "way" at the end


if word[0].lower() in vowels:
return word + "way"
else:
# Move the first letter to the end and add "ay"
return word[1:] + word[0] + "ay"

def convert_sentence_to_pig_latin(sentence):
words = sentence.split()
pig_latin_words = [pig_latin(word) for word in words]
return ' '.join(pig_latin_words)

# Sample sentence
sentence = "Hello world this is a test"
pig_latin_sentence = convert_sentence_to_pig_latin(sentence)
print(f"Original Sentence: {sentence}")
print(f"Pig Latin Sentence: {pig_latin_sentence}")

OUTPUT:
Original Sentence: Hello world this is a test
Pig Latin Sentence: elloHay orldway histay isway away esttay

RESULT:
Thus the above pig program to f i n d t h e v o w e l s i n t h e g i v e n s e n t e n c e has been
executed and verified successfully

47
Exno:2
HIVE OPERATIONS
Date:

AIM:
To Use Hive to create, alter, and drop databases, tables, views, functions, and indexes.

Sample Scenario:

We will create a simple sales database with some sample data and then run Hive queries to perform
various operations like:
 Creating a database and tables
 Inserting data
 Running queries like filtering, aggregating, and joining
 Using partitioning

Step 1: Create a Database


CREATE DATABASE sales_db;

Output:
OK

Step 2: Create a Table


USE sales_db;

CREATE TABLE sales_data (


transaction_id INT,
product_name STRING,
amount DOUBLE,
transaction_date DATE
)

ROW FORMAT DELIMITED


FIELDS TERMINATED BY ',';

Output:
OK

Step 3: Insert Data into Table

We'll simulate inserting some sample data for transactions. In Hive, we typically load data from HDFS or a
local file system into the table. Here, we'll show the concept using a query that could be used with HDFS.

LOAD DATA LOCAL INPATH '/path/to/local/sales_data.csv' INTO TABLE sales_data;

Example data in the CSV file (sales_data.csv):


1, "Laptop", 1200, "2025-04-01"
2, "Mobile", 600, "2025-04-01"
48
3, "Laptop", 1100, "2025-04-02"
4, "Tablet", 400, "2025-04-02"
5, "Mobile", 650, "2025-04-03"

Output:
OK

Step 4: Query the Data

To see the data in the table:

SELECT * FROM sales_data;

Output:
+-----------------+-------------+--------+------------------+
| transaction_id | product_name| amount | transaction_date |
+-----------------+-------------+--------+------------------+
|1 | Laptop | 1200.0 | 2025-04-01 |
|2 | Mobile | 600.0 | 2025-04-01 |
|3 | Laptop | 1100.0 | 2025-04-02 |
|4 | Tablet | 400.0 | 2025-04-02 |
|5 | Mobile | 650.0 | 2025-04-03 |
+-----------------+-------------+--------+------------------+

Step 5: Performing Aggregation (Sum of Sales)

We can calculate the total sales (SUM(amount)) by product_name.


SELECT product_name, SUM(amount) AS total_sales
FROM sales_data
GROUP BY product_name;

Outpu
+-------------+------------+
| product_name| total_sales|
+-------------+------------+
| Laptop | 2300.0 |
| Mobile | 1250.0 |
| Tablet | 400.0 |
+-------------+------------+

Step 6: Query with Filtering (Filter Transactions Over 1000)

We can filter the data to show only transactions where the amount is greater than 1000.

SELECT * FROM sales_data WHERE amount > 1000;

Output:
+-----------------+-------------+--------+------------------+
| transaction_id | product_name| amount | transaction_date |
+-----------------+-------------+--------+------------------+
|1 | Laptop | 1200.0 | 2025-04-01 |
|3 | Laptop | 1100.0 | 2025-04-02 |
+-----------------+-------------+--------+------------------
49
Step 7: Creating a Partitioned Table

Let’s now partition the sales_data table by year (transaction_year).

CREATE TABLE sales_data_partitioned (


transaction_id INT,
product_name STRING,
amount DOUBLE,
transaction_date DATE
)

PARTITIONED BY (transaction_year INT)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Output:
OK

Step 8: Loading Data into Partitions

Load data into specific partitions (e.g., for year 2025):

LOAD DATA LOCAL INPATH '/path/to/sales_data.csv' INTO TABLE sales_data_partitioned PARTITION


(transaction_year=2025);

Output:

Loading data to table sales_data_partitioned

OK

Step 9: Querying Partitioned Data

Now, we can query the data for a specific partition (e.g., for year 2025):

SELECT * FROM sales_data_partitioned WHERE transaction_year = 2025;

Output:
+-----------------+-------------+--------+------------------+-------------------+
| transaction_id | product_name| amount | transaction_date | transaction_year |
+-----------------+-------------+--------+------------------+-------------------+
|1 | Laptop | 1200.0 | 2025-04-01 | 2025 |
|2 | Mobile | 600.0 | 2025-04-01 | 2025 |
|3 | Laptop | 1100.0 | 2025-04-02 | 2025 |
|4 | Tablet | 400.0 | 2025-04-02 | 2025 |
|5 | Mobile | 650.0 | 2025-04-03 | 2025 |
+-----------------+-------------+--------+------------------+-------------------+

Step 10: Dropping a Table

Once you're done with a table, you can drop it. Here's how you drop the sales_data_partitioned table:

DROP TABLE sales_data_partitioned;


50
Output:
OK

RESULT:

Thus the above Hive operations to create, alter, and drop databases, tables, views, functions, and indexes has
been executed and verified successfully

51

You might also like