0% found this document useful (0 votes)
250 views

Bda 041

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
250 views

Bda 041

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

BIG DATA ANALYTICS (3170722)

Practical 1
Aim: Implement following using Map- Reduce

a. Matrix multiplication
b. Sorting
c. Indexing
a. Matrix multiplication
MapReduce

We will write Map and Reduce functions to process input files. Map function will
produce key, values pairs from the input data as it is described in Algorithm 1. Reduce
function uses the output of the Map function and performs the calculations and produces
describe key, values pairs.

Source code:
import numpy as np from

pprint import pprint from

collections

VIEAT/BE/SEM-VII/220943107041 1
BIG DATA ANALYTICS (3170722)

import defaultdict

m = np.matrix(

[1, 2, 0],

[3, 0, 5],

[0, 7, 0],

[4, 0, 0],

[1, 8, 2]

)
n = np.matrix(

[1, 4, 5, 0],

[0, 4, 0, 1],

[6, 3, 9, 1],

shape_m =

m.shape

shape_n =

n.shape m_n =

m * n

pprint(m_n)

output:
# matrix([[ 1, 12, 5, 2],

# [33, 27, 60, 5],

VIEAT/BE/SEM-VII/220943107041 2
BIG DATA ANALYTICS (3170722)

# [ 0, 28, 0, 7],

# [ 4, 16, 20, 0],

# [13, 42, 23, 10]])

m_n = [[i, j, m_n[i,j]] for i in range(m_n.shape[0]) for j in range(m_n.shape[1]) if m_n[i,j] != 0]

pprint(m_n)

Output
# [[0,

0, 1],

# [0,

1, 12],

# [0, 2, 5],

# [0, 3, 2],

# [1, 0, 33],

# [1, 1, 27],

# [1, 2, 60],

# [1, 3, 5],

# [2, 1, 28],

# [2, 3, 7],

# [3, 0, 4],

# [3, 1, 16],

# [3, 2, 20],

# [4, 0, 13],

# [4, 1, 42],

# [4, 2, 23],

# [4, 3, 10]]
# Representing matrices in the form of i, j , m[i][j] and i, j, n[i][j]

VIEAT/BE/SEM-VII/220943107041 3
BIG DATA ANALYTICS (3170722)

m = [[i, j, m[i,j]] for i in range(m.shape[0]) for j in range(m.shape[1]) if m[i,j] != 0]


n = [[i, j, n[i,j]] for i in range(n.shape[0]) for j in range(n.shape[1]) if n[i, j] != 0]
pprint(m)

pprint(n)

Output
# [[0, 0, 1],

# [0, 1, 2],

# [1, 0, 3],

# [1, 2, 5],

# [2, 1, 7],

# [3, 0, 4],

# [4, 0, 1],

# [4, 1, 8],

# [4, 2, 2]]

# [[0, 0, 1],

# [0, 1, 4],

# [0, 2, 5],

# [1, 1, 4],

# [1, 3, 1],

# [2, 0, 6],

# [2, 1, 3],

# [2, 2, 9],

# [2, 3, 1]]

# create key value pairs

# key = j, values = [mij, ...]

VIEAT/BE/SEM-VII/220943107041 4
BIG DATA ANALYTICS (3170722)

ma = defaultdict(list) for j in

range(len(m)): ma[m[j]

[1]].append((m[j][0], m[j][2]))

# key = j, value =

njk na =

defaultdict(list)

for j in

range(len(n)):

na[n[j][0]].append((n[j][1], n[j][2])) pprint(ma) pprint(na)

Output
# defaultdict(<class 'list'>,

# {0: [(0, 1), (1, 3), (3, 4), (4, 1)],

# 1: [(0, 2), (2, 7), (4, 8)],

# 2: [(1, 5), (4, 2)]})

# defaultdict(<class 'list'>,

# {0: [(0, 1), (1, 4), (2, 5)],

# 1: [(1, 4), (3, 1)],

# 2: [(0, 6), (1, 3), (2, 9), (3, 1)]})

# Reduce keys for each possible j value

# Group by keys from ma, na

# For each j value we will take each i, mij value from ma and multiply it by k, mjk

VIEAT/BE/SEM-VII/220943107041 5
BIG DATA ANALYTICS (3170722)

# The key will now be this (i,

k) value op = defaultdict(list)

for j in range(shape_m[1]):

if j in ma and j in na: for

mi in ma[j]: for ni in

na[j]: i = mi[0]

k = ni[0]

op[(i,k)].append(mi[1] * ni[1])

pprint(op)

Output
# defaultdict(<class 'list'>,

# {(0, 0): [1],

# (0, 1): [4, 8],


# (0, 2): [5],

# (0, 3): [2],

# (1, 0): [3,

30], # (1,

1): [12, 15],

# (1, 2): [15, 45],

# (1, 3): [5],

# (2, 1): [28],

# (2, 3): [7],

# (3, 0): [4],

# (3, 1): [16],

# (3, 2): [20],

VIEAT/BE/SEM-VII/220943107041 6
BIG DATA ANALYTICS (3170722)

# (4, 0): [1, 12],

# (4, 1): [4, 32, 6],

# (4, 2): [5, 18],

# (4, 3): [8, 2]})

# Group by the keys again and sum the values to get the final result which is the
multiplication of m and n ans = list() for k, v in op.items():

ans.append([k[0], k[1], sum(v)]

print(sorted(ans))

Output
# [[0, 0, 1],

# [0, 1, 12],

# [0, 2, 5],

# [0, 3, 2],

# [1, 0, 33],

# [1, 1, 27],
# [1, 2, 60],

# [1, 3, 5],

# [2, 1, 28],

# [2, 3, 7],

# [3, 0, 4],

# [3, 1, 16],

# [3, 2, 20],

# [4, 0, 13],

# [4, 1, 42],

# [4, 2, 23],

# [4, 3, 10]]
VIEAT/BE/SEM-VII/220943107041 7
BIG DATA ANALYTICS (3170722)

# Compare with results obtained using numpy matrix multiplication

pprint(m_n)

Output
# [[0, 0, 1],

# [0, 1, 12],

# [0, 2, 5],

# [0, 3, 2],

# [1, 0, 33],

# [1, 1, 27],

# [1, 2, 60],

# [1,

3, 5],

# [2,

1, 28],

# [2, 3, 7],

# [3, 0, 4],

# [3, 1, 16],

# [3, 2, 20],

# [4, 0, 13],
# [4, 1, 42],

# [4, 2, 23],

# [4, 3, 10]]

VIEAT/BE/SEM-VII/220943107041 8
BIG DATA ANALYTICS (3170722)

b. Sorting:
I have a mapper that parses a forum nodes, and I will get the tags associated with
each node. My objective is to sort the top 10.

Code:
import sys
import string
total = 0
oldKey = None
for line in sys.stdin:
data_mapped = line.strip().split("\t")
if(len(data_mapped) != 2):
print "====================="
print line.strip()
print "====================="
continue
key, value = data_mapped
if oldKey and oldKey != key:
print total, "\t",
oldKey oldKey =
key; total = 0
oldKey = key
total += float(value)
if oldKey != None:
print total, "\t", oldKey

VIEAT/BE/SEM-VII/220943107041 9
BIG DATA ANALYTICS (3170722)

OUTPUT

c. Indexing
Process for implementing index is quite simple and consists on the following 3 steps:

1. Build index from your full data set


2. Get the InputSplit(s) for the indexed value you are looking for
3. Execute your actual MapReduce job on indexed InputSplits only

VIEAT/BE/SEM-VII/220943107041 10
BIG DATA ANALYTICS (3170722)

Practical 2
Aim: Distributed Cache & Map Side Join, reduce side Join Building and
Running a Spark Application Word count in Hadoop and Spark
Manipulating RDD.

1. map phase:

Source Code (Cust Detail):


public static class CustsMapper extends Mapper <Object, Text, Text, Text>
{
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException
{
String record = value.toString();
String[] parts = record.split(",");
context.write(new Text(parts[0]), new Text("cust " + parts[1]));
}
}

VIEAT/BE/SEM-VII/220943107041 11
BIG DATA ANALYTICS (3170722)

2. Reducer phase:
Primary goal to perform this reduce-side join operation was to find out that how
many times a particular customer has visited sports complex and the total amount spent
by that very customer on different sports. Therefore, my final output should be of the
following format:
Key – Value pair: [Name of the customer] (Key) – [total amount, frequency of the visit]
(Value) public static class ReduceJoinReducer extends Reducer <Text, Text, Text, Text>
{
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException
{
String name
= ""; double
total = 0.0;
int count =
0; for (Text t
: values)
{
String parts[] =
t.toString().split(" "); if
(parts[0].equals("tnxn"))
{ co
unt
++;
total += Float.parseFloat(parts[1]);
}
else if (parts[0].equals("cust"))
{
name = parts[1];

VIEAT/BE/SEM-VII/220943107041 12
BIG DATA ANALYTICS (3170722)

}
}
String str = String.format("%d %f", count, total);
context.write(new Text(name), new Text(str));
}
}
Output:
Kristina, 651.05 8

Paige, 706.97 6

…………

• Sparkword.py

import sys
from pyspark import SparkContext,
SparkConf if name == " main ":
# create Spark context with Spark configuration
confSparkConf().setAppName("SparkWordCou
nt") sc = SparkContext(conf=conf)
# get threshold
threshold = int(sys.argv[2])
# read in text file and split each document into words
tokenized = sc.textFile(sys.argv[1]).flatMap(lambda line: line.split(" "))
# count the occurrence of each word
wordCounts = tokenized.map(lambda word: (word, 1)).reduceByKey(lambda v1,v2:v1
+v2)
# filter out words with fewer than threshold occurrences
filtered = wordCounts.filter(lambda pair:pair[1] >=
threshold) # count characters

VIEAT/BE/SEM-VII/220943107041 13
BIG DATA ANALYTICS (3170722)

charCounts = filtered.flatMap(lambda pair:pair[0]).map(lambda c: c).map(lambda c: (c,


1)).reduceByKey(lambda
v1,v2:v1 +v2) list =
charCounts.collect() print
repr(list)[1:-1]

 Porm.xml
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<build>
<sourceDirectory>${project.basedir}</sourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>

VIEAT/BE/SEM-VII/220943107041 14
BIG DATA ANALYTICS (3170722)

</executions>
</plugin>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.12</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0.7.0.0.0</version>
<scope>provided</scope>
</dependency>
</dependencies>

• Python on YARN with threshold:

spark-submit --master yarn --deploy-mode client --executor-memory 1g \


--conf "spark.yarn.access.hadoopFileSystems=s3a://<bucket_name>" \
SparkWordCount.py s3a://<bucket_name>/<input_filename> 2

• Result:
scheduler.DAGScheduler: Job 0 finished: collect at
/home/user/sparkwordcount/SparkWordCount.py:26, took 11.762150 s
(u'a', 4), (u'c', 1), (u'e', 6), (u'i', 1), (u'o', 2), (u'u', 1), (u'b', 1), (u'f', 1), (u'h', 1), (u'l', 1), (u'n', 4),
(u'p',
2), (u'r', 2), (u't', 2), (u'v', 1)

VIEAT/BE/SEM-VII/220943107041 15
BIG DATA ANALYTICS (3170722)

Spark Application Word count in Hadoop and Spark Manipulating RDD:

Spark and pyspark have wonderful support for reliable distribution and parallelization of
programs as well as support for many basic algebraic operations and machine learning
algorithms.

Source code:
import org.apache.spark.mllib.linalg.{Vectors, Matrices}
importorg.apache.spark.mllib.linalg.distributed.{IndexedRowMatrix,
IndexedRow}
valrows=
sc.parallelize(Seq( (0L,
Array(1.0, 0.0, 0.0)),
(0L, Array(0.0, 1.0, 0.0)),
(0L, Array(0.0, 0.0, 1.0)))
).map{case (i, xs) =>IndexedRow(i, Vectors.dense(xs))}
valindexedRowMatrix = new IndexedRowMatrix(rows)
vallocalMatrix = Matrices.dense(3, 2, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
indexedRowMatrix.multiply(localMatrix).rows.collect

Output:

VIEAT/BE/SEM-VII/220943107041 16
BIG DATA ANALYTICS (3170722)

Practical: 3

Aim: Implement K-Means Clustering algorithm using Map-Reduce.

Introduction:

K-Means is a clustering algorithm that partition a set of data point into k clusters. The k-
means clustering algorithm is commonly used on large data sets, and because of the
characteristics of the algorithm is a good candidate for parallelization. The aim of this
project is to implement a framework in java for performing k-means clustering using
Hadoop MapReduce.

Pseudocode:

The classical k-means algorithm works as an iterative process in which at each iteration it
computes the distance between the data points and the centroids, that are randomly
initialized at the beginning of the algorithm.

We decided to design such algorithm as a MapReduce workflow. A single stage of


MapReduce roughly corresponds to a single iteration of the classical algorithm. As in the
classical algorithm at the first stage the centroids are randomly sampled from the set of data

VIEAT/BE/SEM-VII/220943107041 17
BIG DATA ANALYTICS (3170722)

points. The map function takes as input a data point and the list of centroids, computes the
distance between the point and each centroid and emits the point and the closest centroid.
The reduce function collects all the points beloging to a cluster and computes the new
centroid and emits it. At the end of each stage, it finds a new approximation of the
centroids, that are used for the next iteration. The workflow continues until the distance
from each centroid of a previuos stage and the corresponding centroids of the current stage
drops below a given threshold.

centroids = k random sampled points from the dataset. do:

Map: Given a point and the set of centroids.

Calculate the distance between the point and each centroid. - Emit the point and the closest
centroid.

Reduce: Given the centroid and the points belonging to its cluster.
Calculate the new centroid as the aritmetic mean position of the points.
Emit the new centroid.
prev_centroids = centroids.
centroids = new_centroids.
while prev_centroids - centroids > threshold.
```
Mapper
The mapper calculates the distance between the data point and each centroid. Then emits the
index of the closest centroid and the data point.
class MAPPER method
MAP(file_offset, point)
min_distance = POSITIVE_INFINITY
closest_centroid = -1
for all centroid in list_of_centroids
distance = distance(centroid, point) if
(distance < min_distance)
closest_centroid = index_of(centroid)
min_distance = distance
EMIT(closest_centroid, point)

Combiner
At each stage we need to sum the data points belonging to a cluster to calculate the centroid
(arithmetic mean of points). Since the sum is an associative and commutative function, our

VIEAT/BE/SEM-VII/220943107041 18
BIG DATA ANALYTICS (3170722)

algorithm can benefit from the use of a combiner to reduce the amount of data to be
transmitted to the reducers.
class COMBINER method
COMBINER(centroid_index, list_of_points)
point_sum.number_of_points = 0 point_sum = 0
for all point in list_of_points:
point_sum += point
point_sum.number_of_points += 1
EMIT(centroid_index,
point_sum)
We implemented the combiner only in the Hadoop algorithm.

Reducer

The reducer calculates the new approximation of the centroid and emits it. The result of the
MapReduce stage will be the same even if the combiner is not called by the Hadoop
framework.

class REDUCER method REDUCER(centroid_index,


list_of_point_sums) number_of_points =
partial_sum.number_of_points point_sum = 0 for all
partial_sum in list_of_partial_sums:
point_sum += partial_sum
point_sum.number_of_points += partial_sum.number_of_points
centroid_value = point_sum / point_sum.number_of_points
EMIT(centroid_index, centroid_value)

Implementation
Hadoop implementation: [K-Means Hadoop](/doc/hadoop.md) Spark implementation: [K-
Means Spark](/doc/spark.md)

Validation
2D-dataset To generate the datasets we used the scikit-learn
python library.
We built the dataset using the make_blobs() function of the datasets module to generate
points with clustering tendency.
The validation dataset has 1000 2-dimensional points distributed in 4 well defined clusters.

Dataset extract:

VIEAT/BE/SEM-VII/220943107041 19
BIG DATA ANALYTICS (3170722)

-0.6779, 10.0174
-0.5575, 7.8922
-7.2418, -2.1716
5.3906, -0.4539
8.026, 0.4876
-1.277, -0.344
6.7044, -0.5083

Results:
To validate our implementations we used the sklearn KMeans() function of the cluster
module. In the following table our MapReduce and Spark executions are compared with the
benchmark one.

| | sklearn.cluster.KMeans | MapReduce | Spark |


|: |: :|: :|: :|
|Execution time | 25.9312 ms| 144910 ms| 23716 ms|
|Number of iterations | 2 |6|
6|
The tre different implementations returned the same centroids:

-6.790731, -1.783768
-0.652342, 0.645576
-0.18393, 9.132927
6.606926, 0.39976

![centroids.png](/doc/img/centroids.png)
Test

We use datasets with 1000, 10000, 100000 points. For each one of them we have a dataset
with:
- 3-dimensional points with 7 centroids.
- 3-dimensional points with 13
centroids. - 7-dimensional points with
7 centroids. - 7-dimensional points
with 13 centroids.

For each dataset we execute the algorithm **10 times**. For each execution the threshold is
set to `0.0001` and max iteration is set to `50`.

VIEAT/BE/SEM-VII/220943107041 20
BIG DATA ANALYTICS (3170722)

Considered that the k-means algorithm is sensitive to the initial centroids and that we used a
random initialization, we will show the **iteration average execution time**. Moreover, we
will show the time needed for the centroids initialization.

Datasets with dimension = 3 and k = 7


**Hadoop:**

*Iteration times:*

| Number of samples |Average | Confidence | Variance |


|: |: :|: :|: :|
|1000|25.7797 s|±0.4326|0.3290|
|10000|26.1982 s|±0.4064|0.2904|

|100000|26.6614 s|±0.3595|0.2273|

*Centroids initialization:*

| Number of samples | Average | Confidence | Variance |


|: |: :|: :|: :|
|1000|1.4238 s|±0.0385|0.0026|
|10000|1.4534 s|±0.0757|0.0101|
|100000|1.5838 s|±0.1532|0.0413|

**Spark:**

*Iteration times:*

| Number of samples |Average | Confidence | Variance |


|: |: :|: :|: :|
|1000|5.0593 s|±1.9307|6.5924|
|10000|2.6685 s|±0.5109|0.4590|
|100000|8.2463 s|±1.6511|4.7950|

*Centroids initialization:*

| Number of samples | Average | Confidence | Variance |


|: |: :|: :|: :|
|1000|3.8501 s|±0.9756|1.6739|
|10000|3.5767 s|±0.5032|0.4454|
|100000|5.0867 s|±0.922|1.4948|

VIEAT/BE/SEM-VII/220943107041 21
BIG DATA ANALYTICS (3170722)

![comparison](/doc/img/3_7.png)

### 5.2 Datasets with dimension = 3 and k = 13

**Hadoop:**

*Iteration times:*

| Number of samples |Average | Confidence | Variance |


|: |: :|: :|: :|
|1000|31.9421 s|±0.7764|1.0602|
|10000|29.5620 s|±1.7928|5.6524|
|100000|30.9473 s|±1.0163|1.8167|

*Centroids initialization:*

| Number of samples | Average | Confidence | Variance |


|: |: :|: :|: :|
|1000|1.3553 s|±0.0545|0.0052|
|10000|1.3376 s|±0.3198|0.1798|
|100000|1.5994 s|±0.2061|0.0747|

**Spark:**

*Iteration times:*

| Number of samples |Average | Confidence | Variance |


|: |: :|: :|: :|
|1000|3.5078 s|±0.7983|1.1207|
|10000|3.0669 s|±0.6174|0.6704|
|100000|9.9684 s|±1.8765|6.1925|

*Centroids initialization:*

| Number of samples | Average | Confidence | Variance |


|: |: :|: :|: :|
|1000|4.2371 s|±1.1114|2.1725|
|10000|4.0950 s|±0.9565|1.6088| |100000|
5.0730 s|±1.33|3.1110|

![comparison](/doc/img/3_13.png)

VIEAT/BE/SEM-VII/220943107041 22
BIG DATA ANALYTICS (3170722)

Datasets with dimension = 7 and k = 7

**Hadoop:**

*Iteration times:*

| Number of samples |Average | Confidence | Variance |


|: |: :|: :|: :|
|1000|28.0791 s|±0.9341|6.0496|
|10000|26.2471 s|±0.5073|0.4526|
|100000|26.4312 s|±0.853|1.2795|

*Centroids initialization:*

| Number of samples | Average | Confidence | Variance |


|: |: :|: :|: :|
|1000|1.5372 s|±0.0997|0.0689|
|10000|1.6277 s|±0.2277|0.0912|
|100000|1.5396 s|±0.1354|0.0323|

**Spark:**

*Iteration times:*
| Number of samples |Average | Confidence | Variance |
|: |: :|: :|: :|
|1000|3.9319 s|±1.2778|2.8716|
|10000|3.2115 s|±1.2891|2.9225| |100000|
9.3602 s|±2.9405|15.2063|

*Centroids initialization:*
| Number of samples | Average | Confidence | Variance |
|: |: :|: :|: :|
|1000|4.4854 s|±1.2927|2.9390|
|10000|4.1256 s|±1.0579|1.9684|
|100000|4.7869 s|±0.9388|1.5500|
![comparison](/doc/img/7_7.png)

Datasets with dimension = 7 and k = 13

VIEAT/BE/SEM-VII/220943107041 23
BIG DATA ANALYTICS (3170722)

**Hadoop:**

*Iteration times:*
| Number of samples |Average | Confidence | Variance |
|: |: :|: :|: :|
|1000|32.1730 s|±0.7767|1.0610|
|10000|28.8286 s|±1.2703|6.9986| |100000|
32.9144 s|±0.4695|0.3878|

*Centroids initialization:*

| Number of samples | Average | Confidence | Variance |


|: |: :|: :|: :|
|1000|1.5413 s|±0.1955|0.0672|
|10000|1.5637s|±0.1316|0.0751| |100000|
1.7045 s|±0.1903|0.0637|

**Spark:**
*Iteration times:*

| Number of samples |Average | Confidence | Variance |


|: |: :|: :|: :|
|1000|3.5809 s|±0.6098|0.6540|
|10000|2.8187 s|±0.421|0.3117| |100000|14.9351 s|
±1.3521|3.2151|

*Centroids initialization:*
| Number of samples | Average | Confidence | Variance |
|: |: :|: :|: :|
|1000|4.0291 s|±1.0876|2.0805|
|10000|3.9748 s|±1.3814|3.3563|
|100000|4.5872 s|±0.1255|0.0277|

VIEAT/BE/SEM-VII/220943107041 24
BIG DATA ANALYTICS (3170722)

Practical: 4

Aim: Implementing any one Frequent Item set algorithm using


Map- Reduce.
Code:
import java.util.*;
import java.io.*;
import java.util.Scanner;
importjava.io.IOException;
import org.apache.hadoop.*;
public class Project { static
int ntrans=0;

VIEAT/BE/SEM-VII/220943107041 25
BIG DATA ANALYTICS (3170722)

public static class MapperClass extends Mapper<Object,


Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1); private
Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer items = new StringTokenizer(value.toString()," \t\n\r\f,.:;?![]'");
LinkedList <String>list = new LinkedList<String>();
LinkedList <String>clist = new LinkedList<String>();
LinkedList <String>templist1 = new LinkedList<String>();
LinkedList <String>templist2 = new LinkedList<String>(); ntrans++;
String str="";
int f=0;
int count=0,i=0,j=0,nitem=0;
Iterator iterator;
while (items.hasMoreTokens()) {
word.set(items.nextToken());
nitem++; count=0;
context.write(word, one);
}
StringTokenizer items2 = new StringTokenizer(value.toString()," \t\n\r\f,.:;?![]'");
while (items2.hasMoreTokens()) {
list.add(items2.nextToken());
}
count=0;
clist.clear();
count=1;
while (count < nitem)
{ count=count+1;

VIEAT/BE/SEM-VII/220943107041 26
BIG DATA ANALYTICS (3170722)

for (i=0;i<(list.size()-1);i++)
{
items2 = new StringTokenizer(list.get(i));
while (items2.hasMoreTokens())
{
templist1.add(items2.nextToken());
}
for (j=i+1;j<list.size();j++)
{
items2 = new StringTokenizer(list.get(j));
while (items2.hasMoreTokens())
{ templist2.add(items2.nextToken());
}
f=0;
for (int k=0;k < (templist1.size()-1);k++)
{ if(!(templist1.get(k).equals(templist2.get(k))))
{ f=1; break;
}
}
if(f == 0)
{ str="";
str=list.get(i)+" "+templist2.get(templist2.size()-1);

clist.add(str);
items2=newStringTokenizer(str,"\n");

if(items2.hasMoreTokens())

word.set(items2.nextToken());

context.write(word, one);

VIEAT/BE/SEM-VII/220943107041 27
BIG DATA ANALYTICS (3170722)

}
templist2.clear();
}
templist1.clear();
} list.clear();
for(int k=0;k < clist.size();k++) list.add(clist.get(k));
clist.clear();
}
}}
public static class ReducerClass extends
Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException { int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum); if(sum>7) context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
long start = System.currentTimeMillis( ); int m;
System.out.println("Enter one inteser");
Scanner in=new Scanner(System.in);
Configuration conf = new Configuration();
job.setJarByClass(Project.class);
job.setMapperClass(MapperClass.class);
job.setReducerClass(ReducerClass.class);

VIEAT/BE/SEM-VII/220943107041 28
BIG DATA ANALYTICS (3170722)

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1])); if(job.waitForCompletion(true))
else {
System.exit(1);
}
}
Input
Item1 Item2 Item3 Item4 Item5 Item7 Item8 Item9
Item8 Item1 Item2 Item9
Item3 Item2 Item5 Item6 Item7 Item8 Item9
Item3 Item7 Item8
Item1 Item2 Item3 Item7 Item8 Item3
Item4 Item5 Item6 Item7
Item2 Item5 Item6 Item7 Item8 Item9
Item1 Item2 Item4 Item5 Item6 Item7 Item8
Item1 Item2 Item3 Item5
Item2 Item3 Item6 Item7 Item9
Item2 Item7 Item1 Item2 Item5 Item6
Item3 Item6 Item7 Item8
Item1 Item2 Item4 Item9
Item4 Item7 Item8 Item9
Item1 Item2 Item6 Item7 Item8
Item3 Item5 Item6 Item8
Item1 Item3 Item4 Item5 Item7 Item8
Item1 Item4 Item6
Item3 Item6 Item8

VIEAT/BE/SEM-VII/220943107041 29
BIG DATA ANALYTICS (3170722)

Item1 Item2 Item4 Item5 Item3


Item5 Item3 Item5 Item8 Item9
Item1 Item3 Item5 Item8
Item1 Item4 Item5 Item6
Item2 Item3 Item4 Item9
Item1 Item2 Item3 Item5 Item7 Item8
Item4 Item7 Item8 Item9
Item1 Item2 Item3 Item4 Item5 Item6 Item8 Item9
Item1 Item4 Item5 Item7
Item1 Item2 Item5 Item6 Item7 Item8
Item2 Item3 Item7
Item4 Item8 Item9
Item6 Item7 Item1 Item3 Item5 Item6 Item9

Output:
{Item1 Item2 Item3}
{Item4 Item5 Item6}
{Item3 Item5 Item6}

PRACTICAL-5

Aim: Implementation of Matrix algorithms in Spark Sql programming.

Spark and pyspark have wonderful support for reliable distribution and parallelization of
programs as well as support for many basic algebraic operations and machine learning
algorithms.

Code:
import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.
{IndexedRowMatrix, IndexedRow}
val rows = sc.parallelize(Seq( (0L, Array(1.0, 0.0, 0.0)),
(0L, Array(0.0, 1.0, 0.0)),

VIEAT/BE/SEM-VII/220943107041 30
BIG DATA ANALYTICS (3170722)

(0L, Array(0.0, 0.0, 1.0)))


).map{case (i, xs) =>IndexedRow(i, Vectors.dense(xs))}
valindexedRowMatrix = new IndexedRowMatrix(rows)
vallocalMatrix = Matrices.dense(3, 2, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
indexedRowMatrix.multiply(localMatrix).rows.collect

Output:

PRACTICAL-6

Aim: Create A Data Pipeline Based On Messaging Using PySpark And Hive
- Covid-19 Analysis.

Data Pipeline:

Step 1: Data Ingestion (Kafka)


- Produce COVID-19 data to Kafka topic covid_data
- Use Kafka Connect (e.g., FilePulse) to ingest data from CSV files or APIs

Step 2: Data Processing (PySpark)

- Consume covid_data topic using PySpark Structured Streaming

VIEAT/BE/SEM-VII/220943107041 31
BIG DATA ANALYTICS (3170722)

- Clean and transform data (e.g., handle missing values, convert data types)
- Aggregate data (e.g., daily cases, cumulative deaths)

Step 3: Data Storage (Hive)

- Write processed data to Hive tables (e.g., cases, deaths, recoveries)


- Use Hive's partitioning and clustering features for efficient querying

Step 4: Data Analysis (PySpark)

- Query Hive tables using PySpark SQL


- Perform analysis (e.g., trends, visualizations) using PySpark's built-in libraries (e.g.,
Matplotlib, Seaborn)

PySpark code:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Create SparkSession
spark = SparkSession.builder.appName("COVID-19 Analysis").getOrCreate()

# Consume Kafka topic


df = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "covid_data") \
.load()

# Process data
df = df.select(from_json(col("value").cast("string"), schema_of_json(df)).alias("data")) \
.select("data.*") \
.withColumn("date", to_date(col("date"))) \
.groupBy("date", "country") \
.agg(sum("cases").alias("total_cases"))

# Write to Hive
df.write.format("hive") \
.option("hive.metastore.uris", "thrift://localhost:9083") \
.option("hive.database", "covid19") \
.saveAsTable("cases")

VIEAT/BE/SEM-VII/220943107041 32
BIG DATA ANALYTICS (3170722)

# Query Hive table


cases_df = spark.sql("SELECT * FROM covid19.cases")

# Perform analysis
cases_df.groupBy("country").agg(avg("total_cases")).show()

Hive Schema:

CREATE DATABASE covid19;

CREATE TABLE cases (


date DATE,
country STRING,
total_cases INT
);

CREATE TABLE deaths (


date DATE,
country STRING,
total_deaths INT
);

CREATE TABLE recoveries (


date DATE,
country STRING,
total_recoveries INT
);
Output:
Daily Cases Trend

+----------+---------+-----------+
| date|country |total_cases|
+----------+---------+-----------+
|2020-01-01|China | 41 |
|2020-01-02|China | 59 |
|2020-01-03|China | 78 |
|2020-01-01|USA | 1 |
|2020-01-02|USA | 2 |

VIEAT/BE/SEM-VII/220943107041 33
BIG DATA ANALYTICS (3170722)

|2020-01-03|USA | 5 |
+----------+---------+-----------+

Total Cases by Country

+---------+-----------+
|country |total_cases|
+---------+-----------+
|China | 78541 |
|USA | 33671 |
|India | 22311 |
|Brazil | 19411 |
|Russia | 17411 |
+---------+-----------+

Daily Deaths Trend

+----------+---------+-----------+
| date|country |total_deaths|
+----------+---------+-----------+
|2020-01-01|China | 1 |
|2020-01-02|China | 2 |
|2020-01-03|China | 3 |
|2020-01-01|USA | 0 |
|2020-01-02|USA | 1 |
|2020-01-03|USA | 2 |
+----------+---------+-----------+
Total Deaths by Country

+---------+-----------+
|country |total_deaths|
+---------+-----------+
|China | 4632 |
|USA | 1954 |
|India | 1341 |
|Brazil | 1134 |
|Russia | 944 |
+---------+-----------+

VIEAT/BE/SEM-VII/220943107041 34
BIG DATA ANALYTICS (3170722)

Recovery Rate by Country

+---------+-------------+
|country |recovery_rate|
+---------+-------------+
|China | 85.7% |
|USA | 74.2% |
|India | 67.1% |
|Brazil | 63.2% |
|Russia | 59.4% |
+---------+-------------+

VIEAT/BE/SEM-VII/220943107041 35

You might also like