0% found this document useful (0 votes)

250 views

Bda 041

Uploaded by

harsh sidhpuria (960s)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

250 views

Bda 041

Uploaded by

harsh sidhpuria (960s)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 35

BIG DATA ANALYTICS (3170722)

Practical 1
Aim: Implement following using Map- Reduce

a. Matrix multiplication
b. Sorting
c. Indexing
a. Matrix multiplication
MapReduce

We will write Map and Reduce functions to process input files. Map function will
produce key, values pairs from the input data as it is described in Algorithm 1. Reduce
function uses the output of the Map function and performs the calculations and produces
describe key, values pairs.

Source code:
import numpy as np from

pprint import pprint from

collections

VIEAT/BE/SEM-VII/220943107041 1
BIG DATA ANALYTICS (3170722)

import defaultdict

m = np.matrix(

[1, 2, 0],

[3, 0, 5],

[0, 7, 0],

[4, 0, 0],

[1, 8, 2]

)
n = np.matrix(

[1, 4, 5, 0],

[0, 4, 0, 1],

[6, 3, 9, 1],

shape_m =

m.shape

shape_n =

n.shape m_n =

m * n

pprint(m_n)

output:
# matrix([[ 1, 12, 5, 2],

# [33, 27, 60, 5],

VIEAT/BE/SEM-VII/220943107041 2
BIG DATA ANALYTICS (3170722)

# [ 0, 28, 0, 7],

# [ 4, 16, 20, 0],

# [13, 42, 23, 10]])

m_n = [[i, j, m_n[i,j]] for i in range(m_n.shape[0]) for j in range(m_n.shape[1]) if m_n[i,j] != 0]

pprint(m_n)

Output
# [[0,

0, 1],

# [0,

1, 12],

# [0, 2, 5],

# [0, 3, 2],

# [1, 0, 33],

# [1, 1, 27],

# [1, 2, 60],

# [1, 3, 5],

# [2, 1, 28],

# [2, 3, 7],

# [3, 0, 4],

# [3, 1, 16],

# [3, 2, 20],

# [4, 0, 13],

# [4, 1, 42],

# [4, 2, 23],

# [4, 3, 10]]
# Representing matrices in the form of i, j , m[i][j] and i, j, n[i][j]

VIEAT/BE/SEM-VII/220943107041 3
BIG DATA ANALYTICS (3170722)

m = [[i, j, m[i,j]] for i in range(m.shape[0]) for j in range(m.shape[1]) if m[i,j] != 0]

n = [[i, j, n[i,j]] for i in range(n.shape[0]) for j in range(n.shape[1]) if n[i, j] != 0]
pprint(m)

pprint(n)

Output
# [[0, 0, 1],

# [0, 1, 2],

# [1, 0, 3],

# [1, 2, 5],

# [2, 1, 7],

# [3, 0, 4],

# [4, 0, 1],

# [4, 1, 8],

# [4, 2, 2]]

# [[0, 0, 1],

# [0, 1, 4],

# [0, 2, 5],

# [1, 1, 4],

# [1, 3, 1],

# [2, 0, 6],

# [2, 1, 3],

# [2, 2, 9],

# [2, 3, 1]]

# create key value pairs

# key = j, values = [mij, ...]

VIEAT/BE/SEM-VII/220943107041 4
BIG DATA ANALYTICS (3170722)

ma = defaultdict(list) for j in

range(len(m)): ma[m[j]

[1]].append((m[j][0], m[j][2]))

# key = j, value =

njk na =

defaultdict(list)

for j in

range(len(n)):

na[n[j][0]].append((n[j][1], n[j][2])) pprint(ma) pprint(na)

Output
# defaultdict(<class 'list'>,

# {0: [(0, 1), (1, 3), (3, 4), (4, 1)],

# 1: [(0, 2), (2, 7), (4, 8)],

# 2: [(1, 5), (4, 2)]})

# defaultdict(<class 'list'>,

# {0: [(0, 1), (1, 4), (2, 5)],

# 1: [(1, 4), (3, 1)],

# 2: [(0, 6), (1, 3), (2, 9), (3, 1)]})

# Reduce keys for each possible j value

# Group by keys from ma, na

# For each j value we will take each i, mij value from ma and multiply it by k, mjk

VIEAT/BE/SEM-VII/220943107041 5
BIG DATA ANALYTICS (3170722)

# The key will now be this (i,

k) value op = defaultdict(list)

for j in range(shape_m[1]):

if j in ma and j in na: for

mi in ma[j]: for ni in

na[j]: i = mi[0]

k = ni[0]

op[(i,k)].append(mi[1] * ni[1])

pprint(op)

Output
# defaultdict(<class 'list'>,

# {(0, 0): [1],

# (0, 1): [4, 8],

# (0, 2): [5],

# (0, 3): [2],

# (1, 0): [3,

30], # (1,

1): [12, 15],

# (1, 2): [15, 45],

# (1, 3): [5],

# (2, 1): [28],

# (2, 3): [7],

# (3, 0): [4],

# (3, 1): [16],

# (3, 2): [20],

VIEAT/BE/SEM-VII/220943107041 6
BIG DATA ANALYTICS (3170722)

# (4, 0): [1, 12],

# (4, 1): [4, 32, 6],

# (4, 2): [5, 18],

# (4, 3): [8, 2]})

# Group by the keys again and sum the values to get the final result which is the
multiplication of m and n ans = list() for k, v in op.items():

ans.append([k[0], k[1], sum(v)]

print(sorted(ans))

Output
# [[0, 0, 1],

# [0, 1, 12],

# [0, 2, 5],

# [0, 3, 2],

# [1, 0, 33],

# [1, 1, 27],
# [1, 2, 60],

# [1, 3, 5],

# [2, 1, 28],

# [2, 3, 7],

# [3, 0, 4],

# [3, 1, 16],

# [3, 2, 20],

# [4, 0, 13],

# [4, 1, 42],

# [4, 2, 23],

# [4, 3, 10]]
VIEAT/BE/SEM-VII/220943107041 7
BIG DATA ANALYTICS (3170722)

# Compare with results obtained using numpy matrix multiplication

pprint(m_n)

Output
# [[0, 0, 1],

# [0, 1, 12],

# [0, 2, 5],

# [0, 3, 2],

# [1, 0, 33],

# [1, 1, 27],

# [1, 2, 60],

# [1,

3, 5],

# [2,

1, 28],

# [2, 3, 7],

# [3, 0, 4],

# [3, 1, 16],

# [3, 2, 20],

# [4, 0, 13],
# [4, 1, 42],

# [4, 2, 23],

# [4, 3, 10]]

VIEAT/BE/SEM-VII/220943107041 8
BIG DATA ANALYTICS (3170722)

b. Sorting:
I have a mapper that parses a forum nodes, and I will get the tags associated with
each node. My objective is to sort the top 10.

Code:
import sys
import string
total = 0
oldKey = None
for line in sys.stdin:
data_mapped = line.strip().split("\t")
if(len(data_mapped) != 2):
print "====================="
print line.strip()
print "====================="
continue
key, value = data_mapped
if oldKey and oldKey != key:
print total, "\t",
oldKey oldKey =
key; total = 0
oldKey = key
total += float(value)
if oldKey != None:
print total, "\t", oldKey

VIEAT/BE/SEM-VII/220943107041 9
BIG DATA ANALYTICS (3170722)

OUTPUT

c. Indexing
Process for implementing index is quite simple and consists on the following 3 steps:

1. Build index from your full data set

2. Get the InputSplit(s) for the indexed value you are looking for
3. Execute your actual MapReduce job on indexed InputSplits only

VIEAT/BE/SEM-VII/220943107041 10
BIG DATA ANALYTICS (3170722)

Practical 2
Aim: Distributed Cache & Map Side Join, reduce side Join Building and
Running a Spark Application Word count in Hadoop and Spark
Manipulating RDD.

1. map phase:

Source Code (Cust Detail):

public static class CustsMapper extends Mapper <Object, Text, Text, Text>
{
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException
{
String record = value.toString();
String[] parts = record.split(",");
context.write(new Text(parts[0]), new Text("cust " + parts[1]));
}
}

VIEAT/BE/SEM-VII/220943107041 11
BIG DATA ANALYTICS (3170722)

2. Reducer phase:
Primary goal to perform this reduce-side join operation was to find out that how
many times a particular customer has visited sports complex and the total amount spent
by that very customer on different sports. Therefore, my final output should be of the
following format:
Key – Value pair: [Name of the customer] (Key) – [total amount, frequency of the visit]
(Value) public static class ReduceJoinReducer extends Reducer <Text, Text, Text, Text>
{
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException
{
String name
= ""; double
total = 0.0;
int count =
0; for (Text t
: values)
{
String parts[] =
t.toString().split(" "); if
(parts[0].equals("tnxn"))
{ co
unt
++;
total += Float.parseFloat(parts[1]);
}
else if (parts[0].equals("cust"))
{
name = parts[1];

VIEAT/BE/SEM-VII/220943107041 12
BIG DATA ANALYTICS (3170722)

}
}
String str = String.format("%d %f", count, total);
context.write(new Text(name), new Text(str));
}
}
Output:
Kristina, 651.05 8

Paige, 706.97 6

…………

• Sparkword.py

import sys
from pyspark import SparkContext,
SparkConf if name == " main ":
# create Spark context with Spark configuration
confSparkConf().setAppName("SparkWordCou
nt") sc = SparkContext(conf=conf)
# get threshold
threshold = int(sys.argv[2])
# read in text file and split each document into words
tokenized = sc.textFile(sys.argv[1]).flatMap(lambda line: line.split(" "))
# count the occurrence of each word
wordCounts = tokenized.map(lambda word: (word, 1)).reduceByKey(lambda v1,v2:v1
+v2)
# filter out words with fewer than threshold occurrences
filtered = wordCounts.filter(lambda pair:pair[1] >=
threshold) # count characters

VIEAT/BE/SEM-VII/220943107041 13
BIG DATA ANALYTICS (3170722)

charCounts = filtered.flatMap(lambda pair:pair[0]).map(lambda c: c).map(lambda c: (c,

1)).reduceByKey(lambda
v1,v2:v1 +v2) list =
charCounts.collect() print
repr(list)[1:-1]

 Porm.xml
<pluginRepositories>
<pluginRepository>
<id>scala-tools.org</id>
<name>Scala-tools Maven2 Repository</name>
<url>http://scala-tools.org/repo-releases</url>
</pluginRepository>
</pluginRepositories>
<build>
<sourceDirectory>${project.basedir}</sourceDirectory>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>

VIEAT/BE/SEM-VII/220943107041 14
BIG DATA ANALYTICS (3170722)

</executions>
</plugin>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.12</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0.7.0.0.0</version>
<scope>provided</scope>
</dependency>
</dependencies>

• Python on YARN with threshold:

spark-submit --master yarn --deploy-mode client --executor-memory 1g \

--conf "spark.yarn.access.hadoopFileSystems=s3a://<bucket_name>" \
SparkWordCount.py s3a://<bucket_name>/<input_filename> 2

• Result:
scheduler.DAGScheduler: Job 0 finished: collect at
/home/user/sparkwordcount/SparkWordCount.py:26, took 11.762150 s
(u'a', 4), (u'c', 1), (u'e', 6), (u'i', 1), (u'o', 2), (u'u', 1), (u'b', 1), (u'f', 1), (u'h', 1), (u'l', 1), (u'n', 4),
(u'p',
2), (u'r', 2), (u't', 2), (u'v', 1)

VIEAT/BE/SEM-VII/220943107041 15
BIG DATA ANALYTICS (3170722)

Spark Application Word count in Hadoop and Spark Manipulating RDD:

Spark and pyspark have wonderful support for reliable distribution and parallelization of
programs as well as support for many basic algebraic operations and machine learning
algorithms.

Source code:
import org.apache.spark.mllib.linalg.{Vectors, Matrices}
importorg.apache.spark.mllib.linalg.distributed.{IndexedRowMatrix,
IndexedRow}
valrows=
sc.parallelize(Seq( (0L,
Array(1.0, 0.0, 0.0)),
(0L, Array(0.0, 1.0, 0.0)),
(0L, Array(0.0, 0.0, 1.0)))
).map{case (i, xs) =>IndexedRow(i, Vectors.dense(xs))}
valindexedRowMatrix = new IndexedRowMatrix(rows)
vallocalMatrix = Matrices.dense(3, 2, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
indexedRowMatrix.multiply(localMatrix).rows.collect

Output:

VIEAT/BE/SEM-VII/220943107041 16
BIG DATA ANALYTICS (3170722)

Practical: 3

Aim: Implement K-Means Clustering algorithm using Map-Reduce.

Introduction:

K-Means is a clustering algorithm that partition a set of data point into k clusters. The k-
means clustering algorithm is commonly used on large data sets, and because of the
characteristics of the algorithm is a good candidate for parallelization. The aim of this
project is to implement a framework in java for performing k-means clustering using
Hadoop MapReduce.

Pseudocode:

The classical k-means algorithm works as an iterative process in which at each iteration it
computes the distance between the data points and the centroids, that are randomly
initialized at the beginning of the algorithm.

We decided to design such algorithm as a MapReduce workflow. A single stage of

MapReduce roughly corresponds to a single iteration of the classical algorithm. As in the
classical algorithm at the first stage the centroids are randomly sampled from the set of data

VIEAT/BE/SEM-VII/220943107041 17
BIG DATA ANALYTICS (3170722)

points. The map function takes as input a data point and the list of centroids, computes the
distance between the point and each centroid and emits the point and the closest centroid.
The reduce function collects all the points beloging to a cluster and computes the new
centroid and emits it. At the end of each stage, it finds a new approximation of the
centroids, that are used for the next iteration. The workflow continues until the distance
from each centroid of a previuos stage and the corresponding centroids of the current stage
drops below a given threshold.

centroids = k random sampled points from the dataset. do:

Map: Given a point and the set of centroids.

Calculate the distance between the point and each centroid. - Emit the point and the closest
centroid.

Reduce: Given the centroid and the points belonging to its cluster.
Calculate the new centroid as the aritmetic mean position of the points.
Emit the new centroid.
prev_centroids = centroids.
centroids = new_centroids.
while prev_centroids - centroids > threshold.
```
Mapper
The mapper calculates the distance between the data point and each centroid. Then emits the
index of the closest centroid and the data point.
class MAPPER method
MAP(file_offset, point)
min_distance = POSITIVE_INFINITY
closest_centroid = -1
for all centroid in list_of_centroids
distance = distance(centroid, point) if
(distance < min_distance)
closest_centroid = index_of(centroid)
min_distance = distance
EMIT(closest_centroid, point)

Combiner
At each stage we need to sum the data points belonging to a cluster to calculate the centroid
(arithmetic mean of points). Since the sum is an associative and commutative function, our

VIEAT/BE/SEM-VII/220943107041 18
BIG DATA ANALYTICS (3170722)

algorithm can benefit from the use of a combiner to reduce the amount of data to be
transmitted to the reducers.
class COMBINER method
COMBINER(centroid_index, list_of_points)
point_sum.number_of_points = 0 point_sum = 0
for all point in list_of_points:
point_sum += point
point_sum.number_of_points += 1
EMIT(centroid_index,
point_sum)
We implemented the combiner only in the Hadoop algorithm.

Reducer

The reducer calculates the new approximation of the centroid and emits it. The result of the
MapReduce stage will be the same even if the combiner is not called by the Hadoop
framework.

class REDUCER method REDUCER(centroid_index,

list_of_point_sums) number_of_points =
partial_sum.number_of_points point_sum = 0 for all
partial_sum in list_of_partial_sums:
point_sum += partial_sum
point_sum.number_of_points += partial_sum.number_of_points
centroid_value = point_sum / point_sum.number_of_points
EMIT(centroid_index, centroid_value)

Implementation
Hadoop implementation: [K-Means Hadoop](/doc/hadoop.md) Spark implementation: [K-
Means Spark](/doc/spark.md)

Validation
2D-dataset To generate the datasets we used the scikit-learn
python library.
We built the dataset using the make_blobs() function of the datasets module to generate
points with clustering tendency.
The validation dataset has 1000 2-dimensional points distributed in 4 well defined clusters.

Dataset extract:

VIEAT/BE/SEM-VII/220943107041 19
BIG DATA ANALYTICS (3170722)

-0.6779, 10.0174
-0.5575, 7.8922
-7.2418, -2.1716
5.3906, -0.4539
8.026, 0.4876
-1.277, -0.344
6.7044, -0.5083

Results:
To validate our implementations we used the sklearn KMeans() function of the cluster
module. In the following table our MapReduce and Spark executions are compared with the
benchmark one.

| | sklearn.cluster.KMeans | MapReduce | Spark |

|: |: :|: :|: :|
|Execution time | 25.9312 ms| 144910 ms| 23716 ms|
|Number of iterations | 2 |6|
6|
The tre different implementations returned the same centroids:

-6.790731, -1.783768
-0.652342, 0.645576
-0.18393, 9.132927
6.606926, 0.39976

![centroids.png](/doc/img/centroids.png)
Test

We use datasets with 1000, 10000, 100000 points. For each one of them we have a dataset
with:
- 3-dimensional points with 7 centroids.
- 3-dimensional points with 13
centroids. - 7-dimensional points with
7 centroids. - 7-dimensional points
with 13 centroids.

For each dataset we execute the algorithm **10 times**. For each execution the threshold is
set to `0.0001` and max iteration is set to `50`.

VIEAT/BE/SEM-VII/220943107041 20
BIG DATA ANALYTICS (3170722)

Considered that the k-means algorithm is sensitive to the initial centroids and that we used a
random initialization, we will show the **iteration average execution time**. Moreover, we
will show the time needed for the centroids initialization.

Datasets with dimension = 3 and k = 7

**Hadoop:**

*Iteration times:*

| Number of samples |Average | Confidence | Variance |

|: |: :|: :|: :|
|1000|25.7797 s|±0.4326|0.3290|
|10000|26.1982 s|±0.4064|0.2904|

|100000|26.6614 s|±0.3595|0.2273|

*Centroids initialization:*

| Number of samples | Average | Confidence | Variance |

|: |: :|: :|: :|
|1000|1.4238 s|±0.0385|0.0026|
|10000|1.4534 s|±0.0757|0.0101|
|100000|1.5838 s|±0.1532|0.0413|

**Spark:**

*Iteration times:*

| Number of samples |Average | Confidence | Variance |

|: |: :|: :|: :|
|1000|5.0593 s|±1.9307|6.5924|
|10000|2.6685 s|±0.5109|0.4590|
|100000|8.2463 s|±1.6511|4.7950|

*Centroids initialization:*

| Number of samples | Average | Confidence | Variance |

|: |: :|: :|: :|
|1000|3.8501 s|±0.9756|1.6739|
|10000|3.5767 s|±0.5032|0.4454|
|100000|5.0867 s|±0.922|1.4948|

VIEAT/BE/SEM-VII/220943107041 21
BIG DATA ANALYTICS (3170722)

![comparison](/doc/img/3_7.png)

### 5.2 Datasets with dimension = 3 and k = 13

**Hadoop:**

*Iteration times:*

| Number of samples |Average | Confidence | Variance |

|: |: :|: :|: :|
|1000|31.9421 s|±0.7764|1.0602|
|10000|29.5620 s|±1.7928|5.6524|
|100000|30.9473 s|±1.0163|1.8167|

*Centroids initialization:*

| Number of samples | Average | Confidence | Variance |

|: |: :|: :|: :|
|1000|1.3553 s|±0.0545|0.0052|
|10000|1.3376 s|±0.3198|0.1798|
|100000|1.5994 s|±0.2061|0.0747|

**Spark:**

*Iteration times:*

| Number of samples |Average | Confidence | Variance |

|: |: :|: :|: :|
|1000|3.5078 s|±0.7983|1.1207|
|10000|3.0669 s|±0.6174|0.6704|
|100000|9.9684 s|±1.8765|6.1925|

*Centroids initialization:*

| Number of samples | Average | Confidence | Variance |

|: |: :|: :|: :|
|1000|4.2371 s|±1.1114|2.1725|
|10000|4.0950 s|±0.9565|1.6088| |100000|
5.0730 s|±1.33|3.1110|

![comparison](/doc/img/3_13.png)

VIEAT/BE/SEM-VII/220943107041 22
BIG DATA ANALYTICS (3170722)

Datasets with dimension = 7 and k = 7

**Hadoop:**

*Iteration times:*

| Number of samples |Average | Confidence | Variance |

|: |: :|: :|: :|
|1000|28.0791 s|±0.9341|6.0496|
|10000|26.2471 s|±0.5073|0.4526|
|100000|26.4312 s|±0.853|1.2795|

*Centroids initialization:*

| Number of samples | Average | Confidence | Variance |

|: |: :|: :|: :|
|1000|1.5372 s|±0.0997|0.0689|
|10000|1.6277 s|±0.2277|0.0912|
|100000|1.5396 s|±0.1354|0.0323|

**Spark:**

*Iteration times:*
| Number of samples |Average | Confidence | Variance |
|: |: :|: :|: :|
|1000|3.9319 s|±1.2778|2.8716|
|10000|3.2115 s|±1.2891|2.9225| |100000|
9.3602 s|±2.9405|15.2063|

*Centroids initialization:*
| Number of samples | Average | Confidence | Variance |
|: |: :|: :|: :|
|1000|4.4854 s|±1.2927|2.9390|
|10000|4.1256 s|±1.0579|1.9684|
|100000|4.7869 s|±0.9388|1.5500|
![comparison](/doc/img/7_7.png)

Datasets with dimension = 7 and k = 13

VIEAT/BE/SEM-VII/220943107041 23
BIG DATA ANALYTICS (3170722)

**Hadoop:**

*Iteration times:*
| Number of samples |Average | Confidence | Variance |
|: |: :|: :|: :|
|1000|32.1730 s|±0.7767|1.0610|
|10000|28.8286 s|±1.2703|6.9986| |100000|
32.9144 s|±0.4695|0.3878|

*Centroids initialization:*

| Number of samples | Average | Confidence | Variance |

|: |: :|: :|: :|
|1000|1.5413 s|±0.1955|0.0672|
|10000|1.5637s|±0.1316|0.0751| |100000|
1.7045 s|±0.1903|0.0637|

**Spark:**
*Iteration times:*

| Number of samples |Average | Confidence | Variance |

|: |: :|: :|: :|
|1000|3.5809 s|±0.6098|0.6540|
|10000|2.8187 s|±0.421|0.3117| |100000|14.9351 s|
±1.3521|3.2151|

*Centroids initialization:*
| Number of samples | Average | Confidence | Variance |
|: |: :|: :|: :|
|1000|4.0291 s|±1.0876|2.0805|
|10000|3.9748 s|±1.3814|3.3563|
|100000|4.5872 s|±0.1255|0.0277|

VIEAT/BE/SEM-VII/220943107041 24
BIG DATA ANALYTICS (3170722)

Practical: 4

Aim: Implementing any one Frequent Item set algorithm using

Map- Reduce.
Code:
import java.util.*;
import java.io.*;
import java.util.Scanner;
importjava.io.IOException;
import org.apache.hadoop.*;
public class Project { static
int ntrans=0;

VIEAT/BE/SEM-VII/220943107041 25
BIG DATA ANALYTICS (3170722)

public static class MapperClass extends Mapper<Object,

Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1); private
Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer items = new StringTokenizer(value.toString()," \t\n\r\f,.:;?![]'");
LinkedList <String>list = new LinkedList<String>();
LinkedList <String>clist = new LinkedList<String>();
LinkedList <String>templist1 = new LinkedList<String>();
LinkedList <String>templist2 = new LinkedList<String>(); ntrans++;
String str="";
int f=0;
int count=0,i=0,j=0,nitem=0;
Iterator iterator;
while (items.hasMoreTokens()) {
word.set(items.nextToken());
nitem++; count=0;
context.write(word, one);
}
StringTokenizer items2 = new StringTokenizer(value.toString()," \t\n\r\f,.:;?![]'");
while (items2.hasMoreTokens()) {
list.add(items2.nextToken());
}
count=0;
clist.clear();
count=1;
while (count < nitem)
{ count=count+1;

VIEAT/BE/SEM-VII/220943107041 26
BIG DATA ANALYTICS (3170722)

for (i=0;i<(list.size()-1);i++)
{
items2 = new StringTokenizer(list.get(i));
while (items2.hasMoreTokens())
{
templist1.add(items2.nextToken());
}
for (j=i+1;j<list.size();j++)
{
items2 = new StringTokenizer(list.get(j));
while (items2.hasMoreTokens())
{ templist2.add(items2.nextToken());
}
f=0;
for (int k=0;k < (templist1.size()-1);k++)
{ if(!(templist1.get(k).equals(templist2.get(k))))
{ f=1; break;
}
}
if(f == 0)
{ str="";
str=list.get(i)+" "+templist2.get(templist2.size()-1);

clist.add(str);
items2=newStringTokenizer(str,"\n");

if(items2.hasMoreTokens())

word.set(items2.nextToken());

context.write(word, one);

VIEAT/BE/SEM-VII/220943107041 27
BIG DATA ANALYTICS (3170722)

}
templist2.clear();
}
templist1.clear();
} list.clear();
for(int k=0;k < clist.size();k++) list.add(clist.get(k));
clist.clear();
}
}}
public static class ReducerClass extends
Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException { int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum); if(sum>7) context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
long start = System.currentTimeMillis( ); int m;
System.out.println("Enter one inteser");
Scanner in=new Scanner(System.in);
Configuration conf = new Configuration();
job.setJarByClass(Project.class);
job.setMapperClass(MapperClass.class);
job.setReducerClass(ReducerClass.class);

VIEAT/BE/SEM-VII/220943107041 28
BIG DATA ANALYTICS (3170722)

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1])); if(job.waitForCompletion(true))
else {
System.exit(1);
}
}
Input
Item1 Item2 Item3 Item4 Item5 Item7 Item8 Item9
Item8 Item1 Item2 Item9
Item3 Item2 Item5 Item6 Item7 Item8 Item9
Item3 Item7 Item8
Item1 Item2 Item3 Item7 Item8 Item3
Item4 Item5 Item6 Item7
Item2 Item5 Item6 Item7 Item8 Item9
Item1 Item2 Item4 Item5 Item6 Item7 Item8
Item1 Item2 Item3 Item5
Item2 Item3 Item6 Item7 Item9
Item2 Item7 Item1 Item2 Item5 Item6
Item3 Item6 Item7 Item8
Item1 Item2 Item4 Item9
Item4 Item7 Item8 Item9
Item1 Item2 Item6 Item7 Item8
Item3 Item5 Item6 Item8
Item1 Item3 Item4 Item5 Item7 Item8
Item1 Item4 Item6
Item3 Item6 Item8

VIEAT/BE/SEM-VII/220943107041 29
BIG DATA ANALYTICS (3170722)

Item1 Item2 Item4 Item5 Item3

Item5 Item3 Item5 Item8 Item9
Item1 Item3 Item5 Item8
Item1 Item4 Item5 Item6
Item2 Item3 Item4 Item9
Item1 Item2 Item3 Item5 Item7 Item8
Item4 Item7 Item8 Item9
Item1 Item2 Item3 Item4 Item5 Item6 Item8 Item9
Item1 Item4 Item5 Item7
Item1 Item2 Item5 Item6 Item7 Item8
Item2 Item3 Item7
Item4 Item8 Item9
Item6 Item7 Item1 Item3 Item5 Item6 Item9

Output:
{Item1 Item2 Item3}
{Item4 Item5 Item6}
{Item3 Item5 Item6}

PRACTICAL-5

Aim: Implementation of Matrix algorithms in Spark Sql programming.

Spark and pyspark have wonderful support for reliable distribution and parallelization of
programs as well as support for many basic algebraic operations and machine learning
algorithms.

Code:
import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.
{IndexedRowMatrix, IndexedRow}
val rows = sc.parallelize(Seq( (0L, Array(1.0, 0.0, 0.0)),
(0L, Array(0.0, 1.0, 0.0)),

VIEAT/BE/SEM-VII/220943107041 30
BIG DATA ANALYTICS (3170722)

(0L, Array(0.0, 0.0, 1.0)))

).map{case (i, xs) =>IndexedRow(i, Vectors.dense(xs))}
valindexedRowMatrix = new IndexedRowMatrix(rows)
vallocalMatrix = Matrices.dense(3, 2, Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0))
indexedRowMatrix.multiply(localMatrix).rows.collect

Output:

PRACTICAL-6

Aim: Create A Data Pipeline Based On Messaging Using PySpark And Hive
- Covid-19 Analysis.

Data Pipeline:

Step 1: Data Ingestion (Kafka)

- Produce COVID-19 data to Kafka topic covid_data
- Use Kafka Connect (e.g., FilePulse) to ingest data from CSV files or APIs

Step 2: Data Processing (PySpark)

- Consume covid_data topic using PySpark Structured Streaming

VIEAT/BE/SEM-VII/220943107041 31
BIG DATA ANALYTICS (3170722)

- Clean and transform data (e.g., handle missing values, convert data types)
- Aggregate data (e.g., daily cases, cumulative deaths)

Step 3: Data Storage (Hive)

- Write processed data to Hive tables (e.g., cases, deaths, recoveries)

- Use Hive's partitioning and clustering features for efficient querying

Step 4: Data Analysis (PySpark)

- Query Hive tables using PySpark SQL

- Perform analysis (e.g., trends, visualizations) using PySpark's built-in libraries (e.g.,
Matplotlib, Seaborn)

PySpark code:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Create SparkSession
spark = SparkSession.builder.appName("COVID-19 Analysis").getOrCreate()

# Consume Kafka topic

df = spark.readStream.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "covid_data") \
.load()

# Process data
df = df.select(from_json(col("value").cast("string"), schema_of_json(df)).alias("data")) \
.select("data.*") \
.withColumn("date", to_date(col("date"))) \
.groupBy("date", "country") \
.agg(sum("cases").alias("total_cases"))

# Write to Hive
df.write.format("hive") \
.option("hive.metastore.uris", "thrift://localhost:9083") \
.option("hive.database", "covid19") \
.saveAsTable("cases")

VIEAT/BE/SEM-VII/220943107041 32
BIG DATA ANALYTICS (3170722)

# Query Hive table

cases_df = spark.sql("SELECT * FROM covid19.cases")

# Perform analysis
cases_df.groupBy("country").agg(avg("total_cases")).show()

Hive Schema:

CREATE DATABASE covid19;

CREATE TABLE cases (

date DATE,
country STRING,
total_cases INT
);

CREATE TABLE deaths (

date DATE,
country STRING,
total_deaths INT
);

CREATE TABLE recoveries (

date DATE,
country STRING,
total_recoveries INT
);
Output:
Daily Cases Trend

+----------+---------+-----------+
| date|country |total_cases|
+----------+---------+-----------+
|2020-01-01|China | 41 |
|2020-01-02|China | 59 |
|2020-01-03|China | 78 |
|2020-01-01|USA | 1 |
|2020-01-02|USA | 2 |

VIEAT/BE/SEM-VII/220943107041 33
BIG DATA ANALYTICS (3170722)

|2020-01-03|USA | 5 |
+----------+---------+-----------+

Total Cases by Country

+---------+-----------+
|country |total_cases|
+---------+-----------+
|China | 78541 |
|USA | 33671 |
|India | 22311 |
|Brazil | 19411 |
|Russia | 17411 |
+---------+-----------+

Daily Deaths Trend

+----------+---------+-----------+
| date|country |total_deaths|
+----------+---------+-----------+
|2020-01-01|China | 1 |
|2020-01-02|China | 2 |
|2020-01-03|China | 3 |
|2020-01-01|USA | 0 |
|2020-01-02|USA | 1 |
|2020-01-03|USA | 2 |
+----------+---------+-----------+
Total Deaths by Country

+---------+-----------+
|country |total_deaths|
+---------+-----------+
|China | 4632 |
|USA | 1954 |
|India | 1341 |
|Brazil | 1134 |
|Russia | 944 |
+---------+-----------+

VIEAT/BE/SEM-VII/220943107041 34
BIG DATA ANALYTICS (3170722)

Recovery Rate by Country

+---------+-------------+
|country |recovery_rate|
+---------+-------------+
|China | 85.7% |
|USA | 74.2% |
|India | 67.1% |
|Brazil | 63.2% |
|Russia | 59.4% |
+---------+-------------+

VIEAT/BE/SEM-VII/220943107041 35

Python Cheat Sheet 2.0
100% (1)
Python Cheat Sheet 2.0
10 pages
MCM 223 Advertising Copywriting and Layout
No ratings yet
MCM 223 Advertising Copywriting and Layout
33 pages
Encore Tricolore 3 - Textbook Copy 2
No ratings yet
Encore Tricolore 3 - Textbook Copy 2
175 pages
Boxer Codex - Wikipedia
No ratings yet
Boxer Codex - Wikipedia
9 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (3)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
9 pages
Bda F
No ratings yet
Bda F
23 pages
Ilovepdf Merged (2) Merged
No ratings yet
Ilovepdf Merged (2) Merged
65 pages
Computational Tools DTU Presentation Week3
No ratings yet
Computational Tools DTU Presentation Week3
33 pages
L and T Projects - Colabs
No ratings yet
L and T Projects - Colabs
7 pages
L_AND_T_project_Naveen 24cs002895
No ratings yet
L_AND_T_project_Naveen 24cs002895
7 pages
Matrix Case Study
No ratings yet
Matrix Case Study
51 pages
sowmi DS
No ratings yet
sowmi DS
27 pages
Python programming U5
No ratings yet
Python programming U5
46 pages
pythonrecord (1)
No ratings yet
pythonrecord (1)
13 pages
Python Course Cheat Sheet
No ratings yet
Python Course Cheat Sheet
30 pages
Bda Lab
No ratings yet
Bda Lab
4 pages
ML Journal
No ratings yet
ML Journal
58 pages
Python
No ratings yet
Python
24 pages
DMC - Record
No ratings yet
DMC - Record
54 pages
AI Final PDF
No ratings yet
AI Final PDF
38 pages
ML lab manual 1-10
No ratings yet
ML lab manual 1-10
58 pages
FDS Lab 1 Manuel .1..1new
No ratings yet
FDS Lab 1 Manuel .1..1new
38 pages
Fds PDF
No ratings yet
Fds PDF
58 pages
Section 7
No ratings yet
Section 7
33 pages
Bda - Unit I - Lecture 6, 7
No ratings yet
Bda - Unit I - Lecture 6, 7
48 pages
GEC PRACTICALS
No ratings yet
GEC PRACTICALS
31 pages
Introduction To PySpark
100% (1)
Introduction To PySpark
21 pages
Week 8N
No ratings yet
Week 8N
6 pages
EX-02-Data manipulation pandas matplot
No ratings yet
EX-02-Data manipulation pandas matplot
9 pages
De Interview Raamashaamy Qna Bank
No ratings yet
De Interview Raamashaamy Qna Bank
11 pages
LAB MANUAL
No ratings yet
LAB MANUAL
56 pages
FDS Lab 1 Manuel .1..1new
No ratings yet
FDS Lab 1 Manuel .1..1new
34 pages
Data Toolkit Assignment
No ratings yet
Data Toolkit Assignment
11 pages
Big Data Lab
No ratings yet
Big Data Lab
12 pages
Vanshika Goyal Gec Practicals
No ratings yet
Vanshika Goyal Gec Practicals
31 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
DMC Lab Ex - 1 To 15 (31.03.2024)
No ratings yet
DMC Lab Ex - 1 To 15 (31.03.2024)
52 pages
1036c sreesh.
No ratings yet
1036c sreesh.
12 pages
Module 6 NumPY and Pandas
No ratings yet
Module 6 NumPY and Pandas
12 pages
EDA (2)
No ratings yet
EDA (2)
7 pages
Python Programming PART A Lab Manual
No ratings yet
Python Programming PART A Lab Manual
5 pages
Numpy Cheatsheet
No ratings yet
Numpy Cheatsheet
11 pages
DATASCIENCE_INTERNSHIP[1]
No ratings yet
DATASCIENCE_INTERNSHIP[1]
43 pages
Data Science Practical Problems
No ratings yet
Data Science Practical Problems
40 pages
python prg docs
No ratings yet
python prg docs
5 pages
Python Formula Sheet
No ratings yet
Python Formula Sheet
3 pages
Week 3 GGG
No ratings yet
Week 3 GGG
17 pages
pp DWDM 4 5
No ratings yet
pp DWDM 4 5
26 pages
DWM Practical
No ratings yet
DWM Practical
12 pages
PYTHONa 7
No ratings yet
PYTHONa 7
15 pages
Assignment 2
No ratings yet
Assignment 2
9 pages
NUMPY - Jupyter Notebook
No ratings yet
NUMPY - Jupyter Notebook
32 pages
NUMPY
No ratings yet
NUMPY
16 pages
Daa - Integrated Lab Manual
No ratings yet
Daa - Integrated Lab Manual
50 pages
big_data_lab[1]
No ratings yet
big_data_lab[1]
52 pages
Ip Project
No ratings yet
Ip Project
27 pages
Python Assignment 1 (1)
No ratings yet
Python Assignment 1 (1)
4 pages
Machine Learning
No ratings yet
Machine Learning
81 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Ujjwal
No ratings yet
Ujjwal
17 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Digital Image Processing: Fundamentals and Applications
From Everand
Digital Image Processing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Robert G. Bartle, Donald R. Sherbert - Introduction To Real Analysis Fourth Edition (2011, John Wiley & Sons) - Libgen - Li
No ratings yet
Robert G. Bartle, Donald R. Sherbert - Introduction To Real Analysis Fourth Edition (2011, John Wiley & Sons) - Libgen - Li
13 pages
Feedback Your Answer Is Correct.: Correct Mark 1.00 Out of 1.00
100% (1)
Feedback Your Answer Is Correct.: Correct Mark 1.00 Out of 1.00
29 pages
Midterm Test Advanced
No ratings yet
Midterm Test Advanced
3 pages
250209-SOTC5
No ratings yet
250209-SOTC5
7 pages
Iot Based Led Scrolling Display: International Journal of Scientific Research & Management Studies July 2020
No ratings yet
Iot Based Led Scrolling Display: International Journal of Scientific Research & Management Studies July 2020
6 pages
001 - INTRODUCTION (Compatibility Mode)
No ratings yet
001 - INTRODUCTION (Compatibility Mode)
21 pages
The Symbolism and Meaning of The Unburnt Bush Icon of The Mother of God
No ratings yet
The Symbolism and Meaning of The Unburnt Bush Icon of The Mother of God
7 pages
Test UGW
No ratings yet
Test UGW
533 pages
Grammar Exam Answer Key
No ratings yet
Grammar Exam Answer Key
2 pages
Assignment 1 (Module 1) - KAS 203T
No ratings yet
Assignment 1 (Module 1) - KAS 203T
1 page
Result
No ratings yet
Result
2 pages
Fraz - Data Structures & Algorithms Fellowship
No ratings yet
Fraz - Data Structures & Algorithms Fellowship
2 pages
English Grammar 2020
No ratings yet
English Grammar 2020
27 pages
Question Bank Preparation Micro Teaching
No ratings yet
Question Bank Preparation Micro Teaching
10 pages
Early Grades Reading Assessment: (EGRA) - Components
100% (10)
Early Grades Reading Assessment: (EGRA) - Components
55 pages
Title: Date Course Title Student'S Name Teacher'S Name
No ratings yet
Title: Date Course Title Student'S Name Teacher'S Name
4 pages
Full Download Spring 5 Recipes: A Problem-Solution Approach 4th Edition Marten Deinum PDF
100% (9)
Full Download Spring 5 Recipes: A Problem-Solution Approach 4th Edition Marten Deinum PDF
53 pages
Name: - Grade&Section: - Date: - 1 Quarter Summative Test in English Iii
No ratings yet
Name: - Grade&Section: - Date: - 1 Quarter Summative Test in English Iii
10 pages
01-03 Local Attack Defense Configuration
No ratings yet
01-03 Local Attack Defense Configuration
76 pages
CatBoost - An In-Depth Guide Python
No ratings yet
CatBoost - An In-Depth Guide Python
33 pages
A Study of Gender Specific Pitch Variation Pattern of Emotion Expression For Hindi Speech
No ratings yet
A Study of Gender Specific Pitch Variation Pattern of Emotion Expression For Hindi Speech
9 pages
25 basic Linux commands for beginners
No ratings yet
25 basic Linux commands for beginners
9 pages
Culture Health and Society MIDs
100% (1)
Culture Health and Society MIDs
7 pages
Yoruba Grammar Oro Ayalo Loan Words 2
No ratings yet
Yoruba Grammar Oro Ayalo Loan Words 2
10 pages
Yoruba Master Victor
No ratings yet
Yoruba Master Victor
10 pages
Quia - Week 5 Grammar - Simple Past Tense
No ratings yet
Quia - Week 5 Grammar - Simple Past Tense
5 pages
SM - RaagDarbari
No ratings yet
SM - RaagDarbari
20 pages