Big Data Lab Manual Printout Copy
Big Data Lab Manual Printout Copy
INSTITUTE OF TECHNOLOGY
(An Autonomous Institution)
(Approved By AICTE, New Delhi, Affiliated To Anna University, Chennai, Accredited by NBA & NAAC with „A‟ Grade)
Branch: ……………………………………………………..
Year/Semester: ……………………………………………….
1
HINDUSTHAN
INSTITUTE OF TECHNOLOGY
(An Autonomous Institution)
(Approved By AICTE, New Delhi, Affiliated To Anna University, Chennai, Accredited by NBA & NAAC with „A‟ Grade)
Coimbatore – 641 032
Place: Coimbatore
Date:
2
INDEX
Average
3
Exno:1
INSTALL APACHE HADOOP
Date:
AIM:
Hadoop is a Java-based programming framework that supports the processing and storage of
extremely large datasets on a cluster of inexpensive machines. It was the first major open
source project in the big data playing field and is sponsored by the Apache Software
Foundation.
Hadoop Common is the collection of utilities and libraries that support other Hadoop
modules.
HDFS, which stands for Hadoop Distributed File System, is responsible for persistingdata
to disk.
YARN, short for Yet Another Resource Negotiator, is the "operating system" for HDFS.
MapReduce is the original processing model for Hadoop clusters. It distributes work within
the cluster or map, then organizes and reduces the results from the nodes into a response to a
query. Many other processing models are available for the 2.x version of Hadoop.
Hadoop clusters are relatively complex to set up, so the project includes a stand-alone mode
which is suitable for learning about Hadoop, performing simple operations, and debugging.
Procedure:
we'll install Hadoop in stand-alone mode and run one of the example example MapReduce
programs it includes to verify the installation.
Prerequisites:
4
Step2: Installing Hadoop
With Java in place, we'll visit the Apache Hadoop Releases page to find the mostrecent
stable release. Follow the binary for the current release:
5
Procedure to Run Hadoop
If Apache Hadoop 2.2.0 is not already installed then follow the post Build, Install,
Configure and Run Apache Hadoop 2.2.0 in Microsoft Windows OS.
2. Start HDFS (Namenode and Datanode) and YARN (Resource Manager and Node
Manager)
Namenode, Datanode, Resource Manager and Node Manager will be started in few
minutes and ready to execute Hadoop MapReduce job in the Single Node (pseudo-
distributed mode) cluster.
6
Run wordcount MapReduce job
Create a text file with some content. We'll pass this file as input
tothe wordcount MapReduce job for counting words. C:\file1.txt
Install Hadoop
Create a directory (say 'input') in HDFS to keep all the text files (say 'file1.txt') to be
used forcounting words.
C:\Users\abhijitg>cd c:\hadoop
C:\hadoop>bin\hdfs dfs -mkdir input
Copy the text file(say 'file1.txt') from local disk to the newly created 'input' directory in HDFS.
7
C:\hadoop>bin\hdfs dfs -copyFromLocal c:/file1.txt input
Bytes Written=59
http://abhijitg:8088/cluster
RESULT:
We've installed Hadoop in stand-alone mode and verified it by running anexample program
it provided.
10
Exno:2 MAPREDUCE PROGRAM TO CALCULATE THE
Date: FREQUENCY
AIM:
To Develop a MapReduce program to calculate the frequency of a given word in a given
file Map Function – It takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (Key-Value pair).
Example – (Map function in Word Count)
Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN
Output
Convert into another set of
data(Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those data
tuplesinto a smaller set of tuples.
Example – (Reduce function in Word Count)
Input Set of
Tuples(output of
Map function)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1),
(bus,1),(TRAIN,1),(BUS,1),
(buS,1),(caR,1),(CAR,1), (car,1), (BUS,1), (TRAIN,1)
Output Converts into smaller set of tuples
(BUS,7), (CAR,7), (TRAIN,4)
Work Flow of Program
11
Workflow of MapReduce consists of 5 steps
1. Splitting – The splitting parameter can be anything, e.g. splitting by space,
comma, semicolon, or even by a new line („\n‟).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different clusters. In orderto
group them in “Reduce Phase” the similar KEY data should be on same cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result set from each
cluster) is combine together to form a Result
12
PROGRAM:-
def mapper(text):
words = text.lower().split()
return [(word.strip(".,!?;\"()"), 1) for word in words]
text = "This is a sample text. It contains some words, and we want to count them!"
mapped = mapper(text)
print("Mapped Output:\n", mapped)
def reducer(grouped_data):
reduced = {word: sum(counts) for word, counts in grouped_data.items()}
return reduced
reduced = reducer(shuffled)
print("\nReduced Output (Final Word Count):\n", reduced)
OUTPUT:
Mapped Output:
[('this', 1), ('is', 1), ('a', 1), ('sample', 1), ('text', 1), ('it', 1), ('contains', 1), ('some', 1), ('words', 1), ('and', 1), ('we', 1),
('want', 1), ('to', 1), ('count', 1), ('them', 1)]
Shuffled Output:
{'this': [1], 'is': [1], 'a': [1], 'sample': [1], 'text': [1], 'it': [1], 'contains': [1], 'some': [1], 'words': [1], 'and': [1], 'we': [1],
'want': [1], 'to': [1], 'count': [1], 'them': [1]}
RESULT:
Thus the above program to find the count of a given words has been executed and verified successfully
13
Exno:3 MAPREDUCE PROGRAM TO FIND THE MAXIMUM
TEMPERATURE IN EACH YEAR
Date:
AIM:
To Develop a MapReduce program to find the maximum temperature in each year.
Description:
MapReduce is a programming model designed for processing large volumes of datain
parallel by dividing the work into a set of independent tasks.Our previous traversal has given
an introduction about MapReduce This traversal explains how to design a MapReduce
program.
PROGRAM.
data = [
"2020-01-01 30",
"2020-05-12 45",
"2020-12-30 10",
"2021-01-15 20",
"2021-06-18 50",
"2021-09-20 48",
"2022-02-11 25",
"2022-07-04 39",
"2022-11-22 41"
]
def mapper(data):
mapped = []
for record in data:
date_str, temp_str = record.split()
year = date_str.split("-")[0]
temperature = int(temp_str)
mapped.append((year, temperature))
return mapped
def shuffle(mapped_data):
grouped = defaultdict(list)
for year, temp in mapped_data:
grouped[year].append(temp)
return grouped
14
# Step 3: Reducer - Find max temperature per year
def reducer(grouped_data):
reduced = {year: max(temps) for year, temps in grouped_data.items()}
return reduced
# Execute MapReduce
mapped_data = mapper(data)
print("Mapped Output:\n", mapped_data)
shuffled_data = shuffle(mapped_data)
print("\nShuffled Output:\n", dict(shuffled_data))
reduced_data = reducer(shuffled_data)
print("\nReduced Output (Max Temperature per Year):\n", reduced_data)
Output:
Mapped Output:
[('2020', 30), ('2020', 45), ('2020', 10), ('2021', 20), ('2021', 50), ('2021', 48), ('2022', 25), ('2022', 39), ('2022', 41)]
Shuffled Output:
{'2020': [30, 45, 10], '2021': [20, 50, 48], '2022': [25, 39, 41]}
RESULT:
Thus the above program to find the maximum temperature recorded in a year with the help of a given has been
executed and verified successfully
15
Exno:4 MAPREDUCE PROGRAM TO FIND THE GRADES
Date: OF STUDENT’S
AIM:
To Develop a MapReduce program to find the grades of student’s.
Program:
def mapper(data):
mapped = []
mapped.append((name, int(score)))
return mapped
# Step 2: Shuffle – Not needed here since we map directly per student
def grade(score):
return 'A'
return 'B'
return 'C'
return 'D'
else:
return 'F'
16
def reducer(mapped_data):
return reduced
data = [
"Alice 95",
"Bob 67",
"Charlie 88",
"David 73",
"Eva 54",
"Frank 100",
"Grace 82"
mapped_data = mapper(data)
reduced_data = reducer(mapped_data)
EXPECTED OUTPUT:
17
OUTPUT:
Mapped Output:
[('Alice', 95), ('Bob', 67), ('Charlie', 88), ('David', 73), ('Eva', 54), ('Frank', 100), ('Grace', 82)]
RESULT:
Thus the above program to find the maximum temperature recorded in a year with the given data has been executed
and verified successfully
18
Exno:5 MAPREDUCE PROGRAM TO IMPLEMENT
Date: MATRIX MULTIPLICATION
AIM:
c. return Set of (key,value) pairs that each key (i,k), has list with values
(M,j,mij) and (N, j,njk) for all possible values of j.
19
Algorithm for Reduce Function.
PROGRAM:-
A=[
[1, 2, 3],
[4, 5, 6]
]
B=[
[7, 8],
[9, 10],
[11, 12]
]
C=[
[58, 64],
[139, 154]
]
# Input matrices
A=[
[1, 2, 3],
[4, 5, 6]
]
B=[
[7, 8],
[9, 10],
[11, 12]
]
# Dimensions
m, n = len(A), len(A[0])
n2, p = len(B), len(B[0])
# Step 1: Mapper
def mapper(A, B):
mapped = []
20
# For each element in A, emit for matching B[j][k]
for i in range(m):
for k in range(n):
for j in range(p):
mapped.append(((i, j), ('A', k, A[i][k])))
return mapped
# Step 2: Shuffle
def shuffle(mapped_data):
grouped = defaultdict(list)
for key, value in mapped_data:
grouped[key].append(value)
return grouped
# Step 3: Reducer
def reducer(grouped_data):
result = defaultdict(int)
return result
# Execute MapReduce
mapped = mapper(A, B)
print("Mapped Data:\n", mapped)
shuffled = shuffle(mapped)
print("\nShuffled Data:\n", dict(shuffled))
21
reduced = reducer(shuffled)
EXPECTED OUTPUT:
Mapped Data:
[((0, 0), ('A', 0, 1)), ((0, 1), ('A', 0, 1)), ((0, 0), ('A', 1, 2)), ((0, 1), ('A', 1, 2)), ((0, 0), ('A', 2, 3)), ((0, 1), ('A',
2, 3)), ((1, 0), ('A', 0, 4)), ((1, 1), ('A', 0, 4)), ((1, 0), ('A', 1, 5)), ((1, 1), ('A', 1, 5)), ((1, 0), ('A', 2, 6)), ((1, 1),
('A', 2, 6)), ((0, 0), ('B', 0, 7)), ((1, 0), ('B', 0, 7)), ((0, 1), ('B', 0, 8)), ((1, 1), ('B', 0, 8)), ((0, 0), ('B', 1, 9)),
((1, 0), ('B', 1, 9)), ((0, 1), ('B', 1, 10)), ((1, 1), ('B', 1, 10)), ((0, 0), ('B', 2, 11)), ((1, 0), ('B', 2, 11)), ((0, 1),
('B', 2, 12)), ((1, 1), ('B', 2, 12))]
Shuffled Data:
{(0, 0): [('A', 0, 1), ('A', 1, 2), ('A', 2, 3), ('B', 0, 7), ('B', 1, 9), ('B', 2, 11)], (0, 1): [('A', 0, 1), ('A', 1, 2), ('A',
2, 3), ('B', 0, 8), ('B', 1, 10), ('B', 2, 12)], (1, 0): [('A', 0, 4), ('A', 1, 5), ('A', 2, 6), ('B', 0, 7), ('B', 1, 9), ('B', 2,
11)], (1, 1): [('A', 0, 4), ('A', 1, 5), ('A', 2, 6), ('B', 0, 8), ('B', 1, 10), ('B', 2, 12)]}
RESULT:
Thus the above program to find the value of matrix with the given data has been executed and verified
successfully
22
Exno:6 MAPREDUCE TO FIND THE MAXIMUM
Date: ELECTRICAL CONSUMPTION IN EACH YEAR
AIM:
If the above data is given as input, we have to write applications to process it and produce
resultssuch as finding the year of maximum usage, year of minimum usage, and so on. This is
a walkoverfor the programmers with finite number of records. They will simply write the
logic to produce the required output, and pass the data to the application written.
But, think of the data representing the electrical consumption of all the largescale
industries of aparticular state, since its formation.
PROGRAM:
from collections import defaultdict
def mapper(data):
mapped = []
for line in data:
date_str, value_str = line.split()
year = date_str.split("-")[0]
value = int(value_str)
mapped.append((year, value))
return mapped
def shuffle(mapped_data):
grouped = defaultdict(list)
for year, value in mapped_data:
grouped[year].append(value)
return grouped
def reducer(grouped_data):
reduced = {year: max(values) for year, values in grouped_data.items()}
return reduced
23
# Run MapReduce simulation
mapped_data = mapper(data)
print("Mapped Output:\n", mapped_data)
shuffled_data = shuffle(mapped_data)
print("\nShuffled Output:\n", dict(shuffled_data))
reduced_data = reducer(shuffled_data)
print("\nReduced Output (Max Consumption per Year):\n", reduced_data)
OUTPUT:
Mapped Output:
[('2020', 350), ('2020', 420), ('2020', 390), ('2021', 450), ('2021', 430), ('2022', 470), ('2022', 510), ('2022', 495)]
Shuffled Output:
{'2020': [350, 420, 390], '2021': [450, 430], '2022': [470, 510, 495]}
RESULT
Thus the above program to find the maximum electrical consumption in the year with the given data has
been executed and verified successfully
24
Exno:7
MAPREDUCE TO ANALYZE WEATHER DATA SET AND PRINT
Date: WHETHER THE DAY IS SHINNY OR COOL
AIM:
To Develop a MapReduce to analyze weather data set and print whether the day isshinny
or cool day.
PROGRAM:
weather_data = [
"2023-06-01 32 Sunny",
"2023-06-02 21 Rainy",
"2023-06-03 25 Cloudy",
"2023-06-04 34 Cloudy",
"2023-06-05 29 Sunny",
"2023-06-06 22 Foggy"
]
def mapper(data):
mapped = []
for line in data:
parts = line.split()
date = parts[0]
temp = int(parts[1])
condition = parts[2]
mapped.append((date, category))
return mapped
def reducer(mapped_data):
result = {date: category for date, category in mapped_data}
return result
# Execute MapReduce
mapped = mapper(weather_data) # weather_data is now defined in the same scope
print("Mapped Output:\n", mapped)
25
reduced = reducer(mapped)
print("\nFinal Classification (Day → Shiny or Cool):")
for date, label in reduced.items():
print(f"{date}: {label}")
OUTPUT:
Mapped Output:
[('2023-06-01', 'Shiny'), ('2023-06-02', 'Cool'), ('2023-06-03', 'Cool'), ('2023-06-04', 'Shiny'), ('2023-06-05',
'Shiny'), ('2023-06-06', 'Cool')]
RESULT:
Thus the above program to analyze the weather condition with the given data has been executed and
verified successfully
26
Exno:8
MAPREDUCE PROGRAM TO FIND THE NUMBER OF
Date: PRODUCTS SOLD INEACH COUNTRY
AIM:
Develop a MapReduce program to find the number of products sold in each country by considering
sales data containing fields.
PROGRAM:
sales_data = [
"USA,TV,10",
"India,Laptop,5",
"USA,Phone,7",
"India,Tablet,3",
"Germany,TV,6",
"Germany,Phone,4",
"USA,Laptop,2"
]
27
for country, qty in mapped_data:
grouped[country].append(qty)
return grouped
OUTPUT:
Mapped Output:
[('USA', 10), ('India', 5), ('USA', 7), ('India', 3), ('Germany', 6), ('Germany', 4), ('USA', 2)]
Shuffled Output:
USA: 19
India: 8
Germany: 10
RESULT:
Thus the above program to analyze the total products sold per company with the given data has been
executed and verified successfully.
28
Exno:9
MAPREDUCE PROGRAM TO FIND THE TAGS ASSOCIATED
Date: WITH EACHMOVIE BY ANALYZING MOVIE LENS DATA
AIM:
To Develop a MapReduce program to find the tags associated with each movie by analyzing movie
lens data.
PROGRAM:
from collections import defaultdict
# Step 1: Load movieId → title mapping
def load_movies(movie_data):
movie_dict = {}
for line in movie_data:
parts = line.split(",", 2)
movie_id = parts[0]
title = parts[1]
movie_dict[movie_id] = title
return movie_dict
29
grouped = defaultdict(list)
for movie_id, tag in mapped_data:
grouped[movie_id].append(tag)
return grouped
30
OUTPUT:
Mapped Output:
[('10', 'epic'), ('10', 'sci-fi'), ('12', 'romantic')]
Shuffled Output:
{'10': ['epic', 'sci-fi'], '12': ['romantic']}
RESULT:
Thus the above program to analyze the type of a movie with the given data has been executed and verified
successfully.
31
Exno:10
XYZ.COM IS AN ONLINE MUSIC WEBSITE WHERE
Date: USERS LISTEN TOVARIOUS TRACKS
AIM:
XYZ.com is an online music website where users listen to various tracks, the data gets collected
which is given below.
PROGRAM:
def mapper(data):
mapped = []
mapped.append((track, 1))
return mapped
def shuffle(mapped_data):
grouped = defaultdict(list)
grouped[track].append(count)
return grouped
def reducer(grouped_data):
32
reduced = {track: sum(counts) for track, counts in grouped_data.items()}
return reduced
# Sample Data
logs = [
"user1,trackA",
"user2,trackB",
"user3,trackA",
"user1,trackC",
"user2,trackA",
"user3,trackC",
"user4,trackA"
# Run MapReduce
mapped = mapper(logs)
shuffled = shuffle(mapped)
reduced = reducer(shuffled)
print(f"{track}: {count}")
33
OUTPUT:
Mapped Output:
[('trackA', 1), ('trackB', 1), ('trackA', 1), ('trackC', 1), ('trackA', 1), ('trackC', 1), ('trackA', 1)]
Shuffled Output:
{'trackA': [1, 1, 1, 1], 'trackB': [1], 'trackC': [1, 1]}
RESULT:
Thus the above program to analyze the list of tracks with the given data available in the website has been
executed and verified successfully.
34
Exno:11
MAPREDUCE PROGRAM TO FIND THE
Date: FREQUENCY OF BOOKS PUBLISHED EACH YEAR
AIM:
Develop a MapReduce program to find the frequency of books published each year and find in
which year maximum number of books was published using the following data.
Title Author Published Author Language No of pages
PROGRAM:
# Sample Data
books = [
"The Great Gatsby,F. Scott Fitzgerald,1925",
"1984,George Orwell,1949",
"To Kill a Mockingbird,Harper Lee,1960",
"The Catcher in the Rye,J.D. Salinger,1951",
"Moby-Dick,Herman Melville,1851",
"Pride and Prejudice,Jane Austen,1813",
"The Hobbit,J.R.R. Tolkien,1937",
"1984,George Orwell,1949",
"The Lord of the Rings,J.R.R. Tolkien,1954"
35
]
shuffled = shuffle(mapped)
print("\nShuffled Output:\n", dict(shuffled))
reduced = reducer(shuffled)
print("\nFinal Output (Books Published Each Year):")
for year, count in reduced.items():
print(f"{year}: {count}")
OUTPUT:
Mapped Output:
[('1925', 1), ('1949', 1), ('1960', 1), ('1951', 1), ('1851', 1), ('1813', 1), ('1937', 1), ('1949', 1), ('1954',
1)]
Shuffled Output:
{'1925': [1], '1949': [1, 1], '1960': [1], '1951': [1], '1851': [1], '1813': [1], '1937': [1], '1954': [1]}
RESULT:
Thus the above program to analyze the list of tracks with the given data available in the website
has been executed and verified successfully.
36
37
38
Exno:12
MAPREDUCE PROGRAM TO ANALYZE TITANIC SHIP
Date: DATA AND TO FIND THEAVERAGE AGE OF THE PEOPLE
AIM:
Develop a MapReduce program to analyze Titanic ship data and to find the average age of the people (both
male and female) who died in the tragedy. How many persons are survived in each class?
PROGRAM:
titanic_data = [
"1,1,Allen,Mr. William Henry,Male,35,0,0,A/5 21171,8.05,,S",
"2,1,Braund,Mr. James,22,Male,1,0,PC 17599,71.2833,C85,C",
"3,3,Creasey,Miss. Alicia,Female,28,0,0,STON/OQ 392076,7.925,,Q",
"4,1,Heikkinen,Miss. Laina,Female,26,0,0,STON/OQ 392078,7.925,,S",
"5,3,Johnson,Miss. Elizabeth,34,Female,0,0,CA. 2343,8.05,,S",
"6,3,Allen,Mr. Thomas,Male,,0,0,315098,8.05,,S"
]
# Step 1: Mapper – Extract Age and count the number of valid ages
def mapper(data):
mapped = []
for line in data:
parts = line.split(",")
age = parts[4].strip() # Extract the age field
try:
age = float(age) # Convert age to float if valid
if age > 0: # Only consider valid ages
mapped.append((1, age)) # Emit (1, age)
except ValueError:
continue # Skip invalid age entries (empty or non-numeric values)
return mapped
# Step 2: Shuffle – Group all values by key (since it's only one key: 1)
def shuffle(mapped_data):
grouped = defaultdict(list)
for key, age in mapped_data:
grouped[key].append(age)
return grouped
# Step 3: Reducer – Calculate the sum and count of ages, then compute the average
def reducer(grouped_data):
reduced = {}
for key, ages in grouped_data.items():
total_age = sum(ages)
count = len(ages)
average_age = total_age / count if count > 0 else 0
reduced[key] = average_age
39
return reduced
shuffled = shuffle(mapped)
print("\nShuffled Output:\n", dict(shuffled))
reduced = reducer(shuffled)
print("\nFinal Output (Average Age of Passengers):")
for key, average_age in reduced.items():
print(f"Average Age: {average_age:.2f}")
OUTPUT:
Mapped Output:
[(1, 22.0), (1, 34.0)]
Shuffled Output:
{1: [22.0, 34.0]}
RESULT:
Thus the above program to analyze the Titanic ship data and to find the average age of the people (both
male and female) who died in the tragedy has been executed and verified successfully
40
Exno:13
MAPREDUCE PROGRAM TO ANALYZE UBER DATA SET
Date:
AIM:
To Develop a MapReduce program to analyze Uber data set to find the days on which each
basement has more trips using the following dataset.
PROGRAM:
uber_data = [
"1,Location1,Location2,2025-04-01 08:00:00,15,123,456",
"2,Location1,Location3,2025-04-01 09:15:00,30,124,457",
"3,Location2,Location1,2025-04-01 10:00:00,20,125,458",
"4,Location1,Location4,2025-04-01 11:00:00,10,126,459",
"5,Location2,Location3,2025-04-01 12:30:00,25,127,460",
"6,Location3,Location2,2025-04-01 13:00:00,18,128,461"
]
def mapper(data):
mapped = []
for line in data:
parts = line.split(",")
pickup_location = parts[1].strip() # Pickup location
ride_duration = int(parts[4].strip()) # Ride duration (in minutes)
mapped.append((pickup_location, ride_duration)) # Emit (pickup_location, ride_duration)
return mapped
def shuffle(mapped_data):
grouped = defaultdict(list)
for location, duration in mapped_data:
grouped[location].append(duration)
return grouped
41
# Step 3: Reducer – Calculate the average ride duration for each pickup location
def reducer(grouped_data):
reduced = {}
for location, durations in grouped_data.items():
total_duration = sum(durations)
count = len(durations)
average_duration = total_duration / count if count > 0 else 0
reduced[location] = average_duration
return reduced
shuffled = shuffle(mapped)
print("\nShuffled Output:\n", dict(shuffled))
reduced = reducer(shuffled)
print("\nFinal Output (Average Ride Duration by Pickup Location):")
for location, avg_duration in reduced.items():
print(f"{location}: {avg_duration:.2f} minutes")
42
OUTPUT:
Mapped Output:
[('Location1', 15), ('Location1', 30), ('Location2', 20), ('Location1', 10), ('Location2', 25), ('Location3', 18)]
Shuffled Output:
{'Location1': [15, 30, 10], 'Location2': [20, 25], 'Location3': [18]}
RESULT:
Thus the above program to MapReduce program to analyze Uber data set to find the days on
which each basement has more trips using the following dataset has been executed and verified successfully
43
Exno:14 PYTHON APPLICATION TO FIND THE MAXIMUM
Date: TEMPERATURE USING SPARK
AIM:
To Develop a Python application to find the maximum temperature using Spark.
PROGRAM:
date,location,temperature
2025-04-01,New York,22
2025-04-01,Los Angeles,28
2025-04-01,Chicago,18
2025-04-02,New York,24
2025-04-02,Los Angeles,30
2025-04-02,Chicago,20
44
columns = ["date", "location", "temperature"]
# Step 4: Group the data by 'location' and calculate the average temperature
avg_temp_by_location =
df.groupBy("location").agg(avg("temperature").alias("avg_temperature"))
45
OUTPUT:
+----------+-----------+-----------+
| date| location|temperature|
+----------+-----------+-----------+
|2025-04-01| New York| 22|
|2025-04-01|Los Angeles| 28|
|2025-04-01| Chicago| 18|
|2025-04-02| New York| 24|
|2025-04-02|Los Angeles| 30|
|2025-04-02| Chicago| 20|
+----------+-----------+-----------+
+-----------+---------------+
| location|avg_temperature|
+-----------+---------------+
|Los Angeles| 29.0|
| Chicago| 19.0|
| New York| 23.0|
+-----------+---------------+
RESULT:
Thus the above Python program to application to find the maximum temperature using Spark
has been executed and verified successfully
46
ADDITIONAL EXPERIMENTS
Exno:1
PIG LATIN MODES, PROGRAMS
: Date:
OBJECTIVE:
a) To find the vowel in the first letter.
PROGRAM:
def pig_latin(word):
vowels = "aeiou"
def convert_sentence_to_pig_latin(sentence):
words = sentence.split()
pig_latin_words = [pig_latin(word) for word in words]
return ' '.join(pig_latin_words)
# Sample sentence
sentence = "Hello world this is a test"
pig_latin_sentence = convert_sentence_to_pig_latin(sentence)
print(f"Original Sentence: {sentence}")
print(f"Pig Latin Sentence: {pig_latin_sentence}")
OUTPUT:
Original Sentence: Hello world this is a test
Pig Latin Sentence: elloHay orldway histay isway away esttay
RESULT:
Thus the above pig program to f i n d t h e v o w e l s i n t h e g i v e n s e n t e n c e has been
executed and verified successfully
47
Exno:2
HIVE OPERATIONS
Date:
AIM:
To Use Hive to create, alter, and drop databases, tables, views, functions, and indexes.
Sample Scenario:
We will create a simple sales database with some sample data and then run Hive queries to perform
various operations like:
Creating a database and tables
Inserting data
Running queries like filtering, aggregating, and joining
Using partitioning
Output:
OK
Output:
OK
We'll simulate inserting some sample data for transactions. In Hive, we typically load data from HDFS or a
local file system into the table. Here, we'll show the concept using a query that could be used with HDFS.
Output:
OK
Output:
+-----------------+-------------+--------+------------------+
| transaction_id | product_name| amount | transaction_date |
+-----------------+-------------+--------+------------------+
|1 | Laptop | 1200.0 | 2025-04-01 |
|2 | Mobile | 600.0 | 2025-04-01 |
|3 | Laptop | 1100.0 | 2025-04-02 |
|4 | Tablet | 400.0 | 2025-04-02 |
|5 | Mobile | 650.0 | 2025-04-03 |
+-----------------+-------------+--------+------------------+
Outpu
+-------------+------------+
| product_name| total_sales|
+-------------+------------+
| Laptop | 2300.0 |
| Mobile | 1250.0 |
| Tablet | 400.0 |
+-------------+------------+
We can filter the data to show only transactions where the amount is greater than 1000.
Output:
+-----------------+-------------+--------+------------------+
| transaction_id | product_name| amount | transaction_date |
+-----------------+-------------+--------+------------------+
|1 | Laptop | 1200.0 | 2025-04-01 |
|3 | Laptop | 1100.0 | 2025-04-02 |
+-----------------+-------------+--------+------------------
49
Step 7: Creating a Partitioned Table
Output:
OK
Output:
OK
Now, we can query the data for a specific partition (e.g., for year 2025):
Output:
+-----------------+-------------+--------+------------------+-------------------+
| transaction_id | product_name| amount | transaction_date | transaction_year |
+-----------------+-------------+--------+------------------+-------------------+
|1 | Laptop | 1200.0 | 2025-04-01 | 2025 |
|2 | Mobile | 600.0 | 2025-04-01 | 2025 |
|3 | Laptop | 1100.0 | 2025-04-02 | 2025 |
|4 | Tablet | 400.0 | 2025-04-02 | 2025 |
|5 | Mobile | 650.0 | 2025-04-03 | 2025 |
+-----------------+-------------+--------+------------------+-------------------+
Once you're done with a table, you can drop it. Here's how you drop the sales_data_partitioned table:
RESULT:
Thus the above Hive operations to create, alter, and drop databases, tables, views, functions, and indexes has
been executed and verified successfully
51