0% found this document useful (0 votes)
6 views

Bda Lab Exercises Lab Mannual - 2023

Uploaded by

viharashadha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Bda Lab Exercises Lab Mannual - 2023

Uploaded by

viharashadha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

BIG DATA ANALYTICS LABORATORY

1. Design MapReduce technique for word counting using python on Hadoop cluster
2. Develop MapReduce algorithm for finding the coolest year from the available Weather data
using java/python program on Hadoop cluster
3. Design a bloom filter to remove the duplicate users from the Log file and analyse the filter
with different cases.
4. Implement the Flajolet-Martin algorithm to extract the distinct twitter users from the twitter
data set.
5. Demonstrate the significance of Page rank algorithm in the Hadoop platform with available
data set using MapReduce based Matrix vector multiplication algorithm.
6. Design a friend of friend’s network using Girvan Newman algorithms from the social network
data.
7. Demonstrate the relational algebra operations such as sort, group, join, project, and filter using
Hive.
8. Load the unstructured data into the Hadoop and convert it into the structured data using Hive.
Develop a Hive and HBase Databases, Tables, Views, Functions and Indexes and perform the
some perform basic query operations.
9. Implement a Pig Latin scripts to sort, group, join, project, and filter your data.
10. Implement the collaborative filtering system using PySpark
11. Perform the Logistic regression classification, SVM and Decision tree classifier algorithms
using PySpark and display the result with graph and compare the accuracy of an algorithms
using Precision, Recall and F-Measure.
12. ImplementtheKMean clustering algorithm usingPySpark.

EXERCISE 1: Design MapReduce technique for word counting using python on Hadoop
cluster

Hadoop MapReduce is a software framework for easily writing applications whichprocess vast
amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerantmanner. A MapReduce job usually splits the input
dataset into independent chunks whichare processed by the map tasks in a completely parallel
manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks.
Typicallyboth the input and the output of the job are stored in a file-system. The framework takes
care of scheduling tasks, monitoringthem and re-executes the failed tasks.
The MapReduce framework consists of a single master ResourceManager, one worker
NodeManager per cluster-node, and MRAppMaster per application. Minimally, applications
specify the input/output locations and supply map andreduce functions via implementations of
appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job
configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration
to the ResourceManager which then assumes the responsibility of distributing the
software/configuration to the workers, scheduling tasks and monitoring them, providing status
and diagnostic information to the job-client.

Although the Hadoop framework is implemented in Java, MapReduce applications need not be
written in Java.
⚫ Hadoop Streaming is a utility which allows users to create and run jobs with anyexecutables
(e.g. shell utilities) as the mapper and/or the reducer.
⚫ Hadoop Pipesis a SWIG-compatible C++ API to implement MapReduceapplications (non JNI
based).

Hadoop Installation-
https://www.google.com/url?sa=D&q=https://archive.apache.org/dist/hadoop/common/hadoop-
3.1.2/hadoop-3.1.2.tar.gz&ust=1668138120000000&usg=AOvVaw
1IaOd8wX2vpgV6lrbCNufU&hl=en&source=gmail

Inputs and Outputs:


• The MapReduce framework operates exclusively on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a set of
&lt;key,value&gt; pairs as the output of the job, conceivably of different types.
• Input and Output types of a MapReduce job:
o (input) <k1, v1> -map - <k2, v2>;
o combine -<k2,v2>;
o reduce -<k3, v3> (output)

1.1 Finding word count using Map reduce algorithm using Python withoutHadoop:

Problem Statement: Design Map_Reduce technique for word counting using python

Description: Map Reduce is a programming model for writing applications that can process big
data in parallel on multiple nodes. Map-reduce provides analytical capabilities foranalyzing huge
volumes of data.

Algorithm/Steps:

⚫ Create one text file and store input into the text file.
⚫ Splitting - Split the input into various lines by taking the split function in python.
⚫ Mapping - Each of the words should be separated in a particular line andtheir counts must be
saved
⚫ Sort/Shuffle - From all the sentences get all the common words together along with their
respective counts.
⚫ Reduce - Sum up the individuals counts of these common words andcalculate the final count
of each word
⚫ Result - Return the result of the map reduce

Program:

sentence=open(‘word.txt’)
txt=sentence.read() data=txt.split()
for word in data:
words = word.split()
for word in words:
print(word, 1)
q=[]
for word in data:
if(word not in q):
q.append(word)
currentword=word count=0
for wrd in data: if(wrd==currentword): count+=1
print(currentword,count)
Input: River Bear Dear Clever Cheer Dear Bear River Clever Cheer Power tower

Output:

Result: By using Map-Reduce algorithm, performed word count for a particular sentence and got
the required output.
1.2 Finding word count using Map reduce algorithm using Python with Hadoop

Problem Statement: Designing Map Reduce technique for word counting using python in Hadoop.

Description: Map Reduce is a programming model for writing applications that can process big
data in parallel on multiple nodes Map-reduce provides analytical capabilities for analyzing huge
volumes of data.

The MapReduce algorithm contains two important tasks, namely Map andReduce.

⚫ The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
⚫ The Reduce task takes the output from the Map as an input and combines thosedata tuples (key-
value pairs) into a smaller set of tuples.

There are 6 phases in the Map Reduce algorithm. They are:

⚫ Input phase - gets input from a text file.


⚫ Splitting phase - splits every word in the text file.
⚫ Mapping phase - makes key-value pairs.
⚫ Shuffle and/or sorting - groups similar data from the map phase into identifiable sets.
⚫ Reducing - makes a key-value pair for each set that are identified in the previous step.
⚫ Output - writes the output into a file.

Algorithm :

map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

Map Reduce code using Hadoop:

Hadoop Streaming is a feature that comes with Hadoop and allows users or developers to use
various different languages for writing MapReduce programslike Python, C++, Ruby, etc. To
implement the word count problem in python to understand Hadoop Streaming, will be creating
mapper.py and reducer.py to perform map and reduce tasks. Let’s create one file which contains
multiple words that can counted.

Step 1: Open the virtual box and go to the Ubuntu operating system. Create a filewith the name
word_count_data.txt and add some data to it. Type these commands in the terminal.

cd Documents/ # to change the directory to /Documents


touch word_count_data.txt # touch is used to create an empty file
nano word_count_data.txt # nano is a command line editor to edit the file
cat word_count_data.txt # cat is used to see the content of the file

Step 2: Create a mapper.py file that implements the mapper logic. It will read the data from STDIN
and will split the lines into words, and will generate an output ofeach word with its individual
count.

cd Documents/ # to change the directory to /Documents


touch mapper.py # touch is used to create an empty file
cat mapper.py # cat is used to see the content of the file
Mapper.py code:

import sys inp=input()


for word in inp.split():
line = inp.strip() words = line.split()
for word in words:
print(word,1)

Let’s test our mapper.py locally to see if it is working fine or not.

Syntax: cat text_data_file | python3 mapper_code_python_file

Command(in my case): cat word_count_data.txt | python3 mapper.py


Step 3: Create a reducer.py file that implements the reducer logic. It will read the output of
mapper.py from STDIN(standard input) and will aggregate the occurrence of each word and will
write the final output to STDOUT.

cd Documents/ # to change the directory to /Documents


touch reducer.py # touch is used to create an empty file
Reducer.py code:
q=[]
for word in inp.split():
if(word not in q):
q.append(word) currentword=word
count=0
for wrd in inp.split():
if(wrd==currentword):
count+=1 print(currentword,count)
Now let’s check our reducer code reducer.py with mapper.py is it working properlyor not with the
help of the below command.

cat word_count_data.txt | python mapper.py | sort -k1,1 | python reducer.py


Step 4: Now let’s start all our Hadoop daemons with the below command.

start-dfs.sh
start-yarn.sh

Now make a directory word_count_in_python in our HDFS in the root directorythat will store our
word_count_data.txt file with the below command.

hdfs dfs -mkdir /word_count_in_python

Copy word_count_data.txt to this folder in our HDFS with help of copyFromLocalcommand.


Syntax to copy a file from your local file system to the HDFS is given below:

hdfs dfs -copyFromLocal /path 1 /path 2 /path n /destination


Actual command(in my case)

hdfs dfs -copyFromLocal /home/dikshant/Documents/word_count_data.txt


/word_count_in

Now our data file has been sent to HDFS successfully. we can check whether itsends or not by
using the below command or by manually visiting our HDFS.

hdfs dfs -ls / # list down content of the root directory


hdfs dfs -ls /word_count_in_python
# list down content of/word_count_in_python d
Let’s give executable permission to our mapper.py and reducer.py with the help ofthe below
command.

cd Documents/chmod 777 mapper.py reducer.py

# changing the permission to read, write, execute for user, group and
others

Step 5: Now download the latest hadoop-streaming jar file. Then place this Hadoop,- streaming
jar file to a place from where you can easily access it. In my case, I am placing it in the /Documents
folder where mapper.py and reducer.py fileis present. Now let’s run our python files with the help
of the Hadoop streaming utility asshown below.

hadoop jar /home/dikshant/Documents/hadoop-streaming-2.7.3.jar \


- input /word_count_in_python/word_count_data.txt \
- output /word_count_in_python/output \
-mapper /home/dikshant/Documents/mapper.py \
-reducer /home/dikshant/Documents/reducer.py
In the above command in -output, we will specify the location in HDFS where wewant our output
to be stored. So let’s check our output in the output file at location

/word_count_in_python/output/part-00000 in my case.

We can check results by manually visiting the location in HDFS or with the help of cat command
as shownbelow.

hdfs dfs -cat /word_count_in_python/output/part-00000


Result: By using Map-Reduce Algorithm we performed word count for a particularsentence and
got the required output.
EXERCISE 2: Develop MapReduce algorithm for finding the coolest year from the available
Weather data using java program on Hadoop cluster

2.1 Using python, without Hadoop:

AIM : To develop a Map Reduce Algorithm to find the hottest and coldest monthin a year from
the available weather data using python.

MODEL DESCRIPTION :

Map Reduce is a Distributed Data Processing Algorithm. It is mainly inspired by theFunctional


Programming model. This algorithm is useful to process huge amounts of data in parallel, reliable
and efficient ways in cluster environments. It divides input tasks into smaller and manageable
sub-tasks to execute them in parallel.

Map reduce works by breaking the process into 3 phases :

1. Map Phase
2. Sort and Shuffle Phase
3. Reduce Phase

DATASET DESCRIPTION:

The dataset is created manually using a notepad with details of weather such asmonths and their
respective minimum and maximum temperatures in three different columns. This file is then
imported into hadoop to perform a map reducing algorithm to find the coolest and the hottest
month of the year.
ALGORITHM :

1) Create an input text file using a notepad containing the weather details of allmonths in a year.
2) It has such as Month, Minimum Temperature, Maximum Temperature.
3) Import the dataset and store the data in three separate lists.
4) Find out the minimum temperature and maximum temperature using mapreducing algorithm.
5) We can then find out the coolest and the hottest month of the year.

STEP-BY-STEP IMPLEMENTATION OF CODE :

1. Upload the created dataset.


2. Import all the required packages before starting the implementation.
3. Create two separate lists for minimum and maximum temperatures.
4. Append the minimum and maximum temperatures to the list after splitting thedata.
5. Find the minimum temperature from the corresponding minimum list andmaximum
temperature from the corresponding maximum list.
6. Print the respective months of minimum and maximum temperatures found.
7. Close the file.

CODE:

myfile = open(“weatherdata.txt”)
txt = myfile.read()
months = []
min_temp = []
max_temp = []
total_list = txt.split(‘,’)
for i in range(len(total_list)):
if i % 3 == 0: months.append(total_list[i].strip(‘’\t’)) elif i % 3
== 1:
min_temp.append(int(total_list[i]))
else:
max_temp.append(int(total_list[i])) minimum = min(min_temp)
maximum = max(max_temp)
print(“Month with minimum temperature of {t} is” {m}.format(t =
minimum, m = months[min_temp.index(minimum)]))
print(“&Month with maximum temperature of {t} is {m}”.format(t =
maximum, m = months[max_temp.index(maximum)]))
myfile.close()

OUTPUT:
RESULT : Thus, the coolest and hottest months of the year have been determined using map
reducing algorithm.

2.1 Using python, with Hadoop:

AIM : To develop a Map Reduce Algorithm to find the hottest and coldest monthin a year from
the available weather data on Hadoop Cluster.

MODEL DESCRIPTION : Map Reduce is a Distributed Data Processing Algorithm. Itis mainly
inspired by the Functional Programming model. This algorithm is useful to process huge amounts
of data in parallel, reliable and efficient ways in cluster environments. It divides input tasks into
smaller and manageable sub-tasks to execute them in parallel. Map reduce works by breaking the
process into 3 phases

1. Map Phase
2. Sort and Shuffle Phase
3. Reduce Phase

DATASET DESCRIPTION: The dataset is created manually using a notepad with details of
weather such as months and their respective minimum and maximum temperatures in three
different columns. This file is then imported into hadoop toperform a map reducing algorithm to
find the coolest and the hottest month of the year.
ALGORITHM:

1) Create an input text file using a notepad containing the weather details of allmonths in a year.
2) It has such as Month, Minimum Temperature, Maximum Temperature.
3) Import the dataset and store the data in three separate lists.
4) Find out the minimum temperature and maximum temperature using mapreducing algorithm.
5) We can then find out the coolest and the hottest month of the year.

STEP-BY-STEP IMPLEMENTATION OF CODE:

1) Create a java project &quot;Weather”.


2) Import all the required packages before starting the implementation.
3) Create a Mapper Class. Record Reader processes each Input record and generates the respective
key-value pair. Hadoop&#39;s Mapper store saves thisintermediate data into the local disk.
4) Create a Reducer Class. The Intermediate output generated from the mapper isfed to the reducer
which processes it and generates the final output which is thensaved.
5) Create a Driver Class. The major component in a MapReduce job is a Driver Class. It is
responsible for setting up a MapReduce Job to run-in Hadoop. We specify the names of Mapper
and Reducer Classes long with data types and theirrespective job names

CODE:
OUTPUT:
RESULT: Thus, the coolest and hottest months of the year have been determined using map
reducing algorithm.
EXERCISE 3 : Design a bloom filter to remove the duplicate users from the Log file and
analyzethe filter with different cases.

3.1 Without Hadoop:

Aim: To design a bloom’s filter to remove duplicates and analyze filters with different cases.

Description: Bloom’s filter is a space efficient probabilistic data structure that is used to test
whether an element is a member of a set. It is used to remove duplicate users from log filesand
analyze the filters in different cases.

Algorithm:

The purpose of the Bloom filter is to allow through all stream elements whose keys are in S, while
rejecting most of the stream elements whose keys are not in S.

A Bloom filter consists of:


1. An array of n bits, initially all 0’s.
a. To initialize the bit array, begin with all bits 0.
2. A collection of hash functions h1, h2, . . . , hk. Each hash function maps “key” values to n
buckets, corresponding to the n bits of the bit-array.
3. A set S of m key values.
4. Take each key value in S and hash it using each of the k hash functions.
a. Set to 1 each bit that is hi(K) for some hash function hi and some key value K in S.
5. To test a key K that arrives in the stream, check that all of h1(K), h2(K), . . . , hk(K) are 1’s
in the bit-array.
a. If all are 1’s, then let the stream element through.
b. If one or more of these bits are 0, then K could not be in S, so reject the stream
element.
The analysis is similar to the analysis goes as follows:
• The probability that a given dart will not hit a given target is (x − 1)/x.
• The probability that none of the y darts will hit a given target is (x – 1)y
• Using the approximation (1− ∈) 1/∈ = 1/e for small ∈, we conclude that the probability that
none of the y darts hit a given target is e −y/x
Steps :
1) Import all necessary packages
2) Initialize all indexes to zero in an empty bit array.
3) Define the necessary hash functions h1 and h2.
4) Store the ip addresses which we take from the dataset into a list and convertthem as integers and
store them in a new list.
5) Calculate hash value h1 and h2 for the numbers for which we will findduplicates
6) If the element for the particular hash is 0, change it to 1.
7) And we will get the final bloom’s filter.
8) Now we can get an ip address as input from the user and we will check whetherthe particular
user is already present or not..
9) If the value at a particular index is already present in the array(hash value is 1),print it as a
duplicate document.
10) If the value at a particular index is 0, then it is unique and print it as not aduplicate document.

Code:

M=int(input())
x=list(map(int,input().split())) arr=[0]*M
for i in range(len(x)) :
h1=x[i]%M h2=(3*x[i]+2)%M
if arr[h1]==0 : arr[h1]=1
if arr[h2]==0 : arr[h2]=1
print(“Final Blooms filter :” ,arr)
x1=int(input())
h1=x1%M h2=(3*x1+2)%M
if arr[h1]==0 : arr[h1]=1
elif arr[h1]==1 : print(“ It is duplicate”)
elif arr[h2]==0 : arr[h2]=1
else : print(“ It is duplicate”)
Output:

Result: Bloom’s Filter is thereby used to remove duplicates and analyze the filter withdifferent
cases.
3.2 With Hadoop:

Aim: To design a bloom’s filter to remove duplicate users from log values andanalyze filters with
different cases.

Description: Bloom’s filter is a space efficient probabilistic data structure that is used to test
whether an element is a member of a set. It is used to remove duplicate users from log files and
analyze the filters in different cases.

The dataset used is the weblog.csv

Link to the dataset: https://www.kaggle.com/datasets/shawon10/web-log-dataset

Algorithm:
1) Import all necessary packages
2) Initialize all indexes to zero in an empty bit array.
3) Define the necessary hash functions h1 and h2.
4) Store the ip addresses which we take from the dataset into a list and convertthem as integers and
store them in a new list.
5) Calculate hash value h1 and h2 for the numbers for which we will findduplicates
6) If the element for the particular hash is 0, change it to 1.
7) And we will get the final bloom’s filter.
8) Now we can get an ip address as input from the user and we will check whetherthe particular
user is already present or not..
9) If the value at a particular index is already present in the array(hash value is 1),print it as a
duplicate document.
10) If the value at a particular index is 0, then it is unique and print it as not aduplicate document.

Code:
import pandas as pd
df = pd.read_csv(“weblog.csv”)
#Getting only the IP addresses x = df[“ID”]
x
#taking a large “m value and creating the blooms m = 6023”
l = [0]*m
#Function to execute the hash functions and return the index values def
bloom(n,m):
h = n%m
h1 = (2*n+3)%m return h,h1
y = “;”
#for running in less time , 100 records are only considered. We can give
len(df) for all records.
for i in range(100):
s = x[i]
a = s.split() for j in range(3):
y = y+a[j] z = int(y)
print(Integer=, z) y = “”
c,c1 = bloom(z,m) print(“Blooms value”,c)
print(“ Second hash value “;,c1) l[c] = 1
l[c1] = 1

print(“State of l[c] is changed to: “,l[c])

print(“State of l[c1] is changed to: “,l[c1]) print()

#taking a new value for testing as test test = x[101]


#taking an already existing value for testing test1 = 10.1.2.82;

#testing for new value


a,b,c,d = test.split() y = a+b+c+d
z = int(y) print(z)
c,c1 = bloom(z,m) print(c)
print(c1)
if(l[c]==1 or l[c1]==1): print(“Duplicate!”) else:
print(“Not Duplicate!”)

#testing for already existing value - 101282 a,b,c,d = test1.split()


y = a+b+c+d z = int(y) print(z)
c,c1 = bloom(z,m) print(c)
print(c1)
if(l[c]==1 or l[c1]==1): print(“Duplicate!”;) else:
print(“Not Duplicate!”)
Output:

Result:
Bloom’s Filter is thereby used to remove duplicate users from the log file andanalyze the filter
with different cases.
EXERCISE 4 : Implement the Flajolet-Martin algorithm to extract the distinct twitter users
from the twitter data set

Aim: To implement the Flajolet Martin algorithm using Python to extract the distincttwitter users
from the twitter dataset.

Description:

Data cleaning can be done using various techniques but before cleaning the data,it is necessary to
know the amount of useful data present in the dataset. Therefore, before the removal of duplicate
data from a data stream or database, it is necessary to have knowledge of distinct or unique data
present. A way to do so is by hashing the elements of the universal set using the Flajolet Martin
Algorithm.The FM algorithm is used in a database query, big data analytics, spectrum sensingin
cognitive radio sensor networks, and many more areas. It shows superior performance as
compared with many other methods to find distinct elements in a stream of data. Flajolet Martin
Algorithm, also known as FM algorithm, is used to approximate thenumber of unique elements in
a data stream or database in one pass. The highlight of this algorithm is that it uses less memory
space while executing.

Algorithm:

The idea behind the Flajolet-Martin Algorithm is that the more different elements we see in the
stream, the more different hash-values we shall see. As we see more different hash-values, it
becomes more likely that one of these values will be “unusual.” The particular unusual property
we shall exploit is that the value ends in many 0’s.

• Whenever, apply a hash function h to a stream element a, the bit string h(a) will end in
some number of 0’s, possibly none - Call this number the tail length for a and h.
• Let R be the maximum tail length of any a seen so far in the stream.
• Then estimate 2R for the number of distinct elements seen in the stream.
• The probability that a given stream element a has h(a) ending in at least r 0’s is 2−r.
• Suppose there are m distinct elements in the stream, then the probability that none of them
has tail length at least r is (1 − 2 −r ) m.
Steps:
1. Selecting a hash function h so each element in the set is mapped to a string to atleast log2n
2. bits. The names are converted to binary numbers using a in-built Hash functionpresent in
python.
3. For each element x, r(x)= length of trailing zeroes in h(x).
4. R= max(r(x)). The maximum count of trailing zeroes is initialized to R.
5. The R is substituted in the formula 2^R to yield the result with a distinct count.
6. Distinct elements= 2^R.

Code:

#TASK-4 FLAJOLET MARTIN


import pandas as pd
df = pd.read_csv(&quot;gender-classifier-DFE-791531.csv&quot;) df =
df[“name”]
def getTrailingZeroes(num):
cnt = 0 while num:
if num &amp; 1:
break cnt += 1
num = 1 return cnt names = dict()
maxTrailingZeroes = 0 for name in df:
name.strip(&quot;\n&quot;) binar=int(str(bin(hash(name))[3:]))
maxTrailingZeroes = max(maxTrailingZeroes, getTrailingZeroes(binar))
print(“Possible distinct count :”, 1 “lts maxTrailingZeroes”) for name in
names.keys():
print(name, end = “”)
Output:
Possible distinct count : 8192

Result:

Flajolet Martin Algorithm has been implemented to find the number of distincttwitter users and it
is observed that it occupies less memory space and shows better results in terms of time in
seconds.
EXERCISE 5 : Demonstrate the significance Page rank algorithm in the Hadoop platform
withavailable data set using MapReduce based Matrix vector multiplication algorithm.

Project Objective

• Gaining working experience with Hadoop and MapReduce.


• Understanding graph algorithms such as PageRank and implementation using MapReduce.
• Running PageRank using Apache Graph, and other algorithms that rely on MapReduce.
• Comparing results and performance between direct implementation in MapReduce vs
Graph.

Hadoop
An apache open source project. It is an application framework that helps integrateHDFS (hadoop
distributed file system) and run MapReduce jobs.

MapReduce

A computational technique of dividing up applications into small independenttasks which can be


executed on any node in the cluster.

Map -> Shuffle -> Reduce


What is PageRank?

PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine
results. PageRank was named after Larry Page, one of the foundersof Google. PageRank is a way
of measuring the importance of website pages. PageRank works by counting the number and
quality of links to a page to determine a rough estimate of how important the website is. The
underlying assumption is that more important websites are likely to receive more links fromother
websites.

It is not the only algorithm used by Google to order search engine results, but it isthe first algorithm
that was used by the company, and it is the best-known. The above centrality measure is not
implemented for multi-graphs.

i. Measures the relative importance of a hyperlinked set of documents(web-pages)


ii. Assumes that more important websites will have more links from other websites
iii. Each web-page is a node and the hyperlink between web pages are edges
iv. A hyperlink to a page is a count of support with a weighted value based onimportance

Page Rank algorithm:

The PageRank algorithm outputs a probability distribution used to represent thelikelihood that a
person randomly clicking on links will arrive at any particular page. PageRank can be calculated
for collections of documents of any size. It is assumed in several research papers that the
distribution is evenly divided among all documents in the collection at the beginning of the
computational process. The PageRank computations require several passes, called “iterations”,
through the collection to adjust approximate PageRank values to more closely reflect the
theoretical true value.

Algorithm: The matrix-vector multiplication is defined as:


➢ Suppose we have an n × n matrix M, whose element in row i and column j will be
denoted mij and suppose we also have a vector v of length n, whose jth element is vj .
➢ Then the matrix-vector product is the vector x of length n, whose ith element xi is given
by n
Steps:
• Start with uniform initialization of all pages
• Simple Algorithm: - Damping Factor :
• (d=0.85) where d is the damping factor (0.85), M(pi ) is the set of pageslinking to pi and
L(pj ) is the cardinality of pj .
• Iterate until convergence or for a fixed number of iterations.

PageRank using MapReduce

CODE:
import numpy as np

L = np.array([[0, 1/2, 1/3, 0, 0, 0 ],


[1/3, 0, 0, 0, 1/2, 0 ],
[1/3, 1/2, 0, 1, 0, 1/2 ],
[1/3, 0, 1/3, 0, 1/2, 1/2 ],
[0, 0, 0, 0, 0, 0 ],
[0, 0, 1/3, 0, 0, 0 ]])
r = 100 * np.ones(6) / 6 # Sets up this vector (6 entries of 1/6 × 100 each)r
array([16.66666667, 16.66666667, 16.66666667, 16.66666667, 16.66666667,
16.66666667])
for i in np.arange(100) : # Repeat the dot product calcualtion for 100iterations
r = L @ rr
array([16. , 5.33333333, 40. , 25.33333333, 0. ,13.33333333])
r = 100 * np.ones(6) / 6 # Sets up this vector (6 entries of 1/6 × 100 each)lastR = r
r = L @ ri = 0
while la.norm(lastR - r) &gt; 0.01 :
lastR = rr = L @ ri += 1
print(str(i) + &quot; iterations to convergence.&quot;)
r
18 iterations to convergence.
array([16.00149917, 5.33252025, 39.99916911, 25.3324738 , 0. ,13.33433767])
# We&#39;ll call this one L2, to distinguish it from the previous L.L2 = np.array([[0, 1/2, 1/3, 0,
0, 0, 0 ],
[1/3, 0, 0, 0, 1/2, 0, 0 ],
[1/3, 1/2, 0, 1, 0, 0, 0 ],
[1/3, 0, 1/3, 0, 1/2, 0, 0 ],
[0, 0, 0, 0, 0, 0, 0 ],
[0, 0, 1/3, 0, 0, 1, 0 ],
[0, 0, 0, 0, 0, 0, 1 ]])
d = 0.8

M = d * L2 + (1-d)/7 * np.ones([7, 7]) # np.ones() is the J matrix, with onesfor each entry.
r = 100 * np.ones(7) / 7 # Sets up this vector (7 entries of 1/7 × 100each)
lastR = r
r = L2 @ ri = 0
while la.norm(lastR - r) &gt; 0.01 :
lastR = r
r = L2 @ ri += 1
print(str(i) + &quot; iterations to convergence.&quot;)r
46 iterations to convergence.
array([1.07742874e-02, 4.20323681e-03, 2.13132105e-02, 1.25178906e-02,0.00000000e+00,
8.56654771e+01, 1.42857143e+01])
d = 0.8
M = d * L2 + (1-d)/7 * np.ones([7, 7]) # np.ones() is the J matrix, with onesfor each entry.
r = 100 * np.ones(7) / 7 # Sets up this vector (6 entries of 1/6 × 100 each)lastR = r
r = M @ ri = 0
while la.norm(lastR - r) &gt; 0.01 :
lastR = r
r = M @ ri += 1
print(str(i) + &quot; iterations to convergence.&quot;)r
19 iterations to convergence.
array([10.15892915, 6.7094426 , 17.3137475 , 11.32722084, 2.85714286,
37.34780278, 14.28571429])

This is certainly better, the PageRank gives sensible numbers that users that endup on
each webpage.
EXERCISE 6 : Design a friend of friend's network using Girvan Newman algorithmsfrom
the social network data

Aim: Designing a friend of friend’s network using Girvan Newman algorithm.

Description: The Girvan Newman technique for the detection and analysis of communitystructure
depends upon the iterative elimination of edges with the highest number of the shortest paths that
pass through them. By getting rid of the edges, the network breaks down into smaller networks,
i.e. communities.

The algorithm, as the name suggests, is introduced by Girvan &amp; Newman.


The idea was to find which edges in a network occur most frequently between other pairs of nodes
by finding edges betweenness. The edgesjoining communities are then expected to have high edge
betweenness.The underlying community structure of the network will be much fine- grained once
we eliminate edges with high edge betweenness. For the removal of each edge, the calculation of
edge betweenness is O(EN); therefore, this algorithm’s time complexity is O(E^2N).
We can express Girvan-Newman algorithm in the following procedure:

1. Calculate edge betweenness for every edge in the graph.


2. Remove the edge with highest edge betweenness.
3. Calculate edge betweenness for remaining edges.
4. Repeat steps 2–4 until all edges are removed.

In order to calculate edge betweenness it is necessary to find all shortestpaths in the graph. The
algorithm starts with one vertex, calculates edgeweights for paths going through that vertex, and
then repeats it for everyvertex in the graph and sums the weights for every edge.

Although understandable and simple, Girvan Newman Algorithm has its own limitations. The
algorithm is not very time efficient with networks containing large number of nodes and data.
Communities in huge and complex networks are difficult to detect and therefore, Girvan Newman
is not favorable for very larger number of data sets. There then occurred various variations,
modifications and designing of other methods of community detection which work on the idea of
“Modularity”.

Algorithm:
• The algorithm begins by performing a breadth-first search (BFS) of the graph, starting at
the node X.
• Note that the level of each node in the BFS presentation is the length of the shortest path
from X to that node.
• The edges that go between nodes at the same level can never be part of a shortest path
from X. Edges between levels are called DAG edges (“DAG” stands for directed, acyclic
graph).
• Each DAG edge will be part of at least one shortest path from root X.
• If there is a DAG edge (Y, Z), where Y is at the level above Z (i.e., closer to the root),
then we shall call Y a parent of Z and Z a child of Y , although parents are not necessarily
unique in a DAG as they would be in a tree.
• In order to exploit the betweenness of edges, need to calculate the number of shortest
paths going through each edge., describe as method called the Girvan-Newman (GN)
Algorithm.
o visits each node X once and computes the number of shortest paths from X to
each of the other nodes that go through each of the edges.
• The algorithm begins by performing a breadth-first search (BFS) of the graph, starting at
the node X.
o Note that the level of each node in the BFS presentation is the length of the
shortest path from X to that node.
• Thus, the edges that go between nodes at the same level can never be part of a shortest
path from X.
 Edges between levels are called DAG edges (“DAG” stands for directed, acyclic graph)
and each DAG edge will be part of at least one shortest path from root X.
 If there is a DAG edge (Y, Z), where Y is at the level above Z (i.e., closer to the
root), then we shall call Y a parent of Z and Z a child of Y , although parents are
not necessarily unique in a DAG as they would be in a tree
The rules for the calculation are as follows:
1. Each leaf in the DAG (a leaf is a node with no DAG edges to nodes at levels below) gets
a credit of 1.
2. Each node that is not a leaf gets a credit equal to 1 plus the sum of the credits of the DAG
edges from that node to the level below.
3. A DAG edge e entering node Z from the level above is given a share of the credit of Z
proportional to the fraction of shortest paths from the root to Z that go through e.

Source Code:

import networkx as nx def edge_to_remove(g):


d1 = nx.edge_betweenness_centrality(g) list_of_tuples = list(d1.items())
sorted(list_of_tuples, key = lambda x:x[1], reverse = True) # Will return in
the form (a,b)
return list_of_tuples[0][0] def girvan(g):
a = nx.connected_components(g) lena = len(list(a))
print (“The number of connected components are “, lena) while (lena == 1):
# We need (a,b) instead of ((a,b)) u, v = edge_to_remove(g)
g.remove_edge(u, v)

a = nx.connected_components(g) lena=len(list(a))
print (“The number of connected components are”, lena) return a

print(“enter input1”) k=int(input())


print(“enter input2”) l=int(input())
g = nx.barbell_graph(k,l) a = girvan(g)
print (“Barbell Graph”) for i in a:
print (i.nodes()) print (‘ ........... ’)

g1 = nx.karate_club_graph() a1 = girvan(g1)
print (“Karate Club Graph”) for i in a1:
print (i.nodes())
print (“ .................................. “)

Input:
Enter input 1:
5
Enter input 2:
0

Output:
Barbell Graph
The number of connected components are 1
The number of connected components are 2= [0, 1, 2, 3, 4]…………. [8, 9, 5, 6, 7]………….

Karate Club Graph

The number of connected components are 1 The number of connected components are 1 The
number of connected components are 1The number of connected components are 1The number of
connected components are 1The number of connected components are 1The number of connected
components are 1 The number of connected components are 1 The number of connected
components are 1 The number of connected components are 1 The number of connected
components are 1The number of connected components are 2
[32, 33, 2, 8, 9, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] ………….

[0, 1, 3, 4, 5, 6, 7, 10, 11, 12, 13, 16, 17, 19, 21]

………….

Implementation using NetworkX:

NetworkX is a package for the Python programming language that&#39;s used to create,
manipulate, and study the structure, dynamics, and functions of complex graph networks. Using
networkx we can load and store complex Networks.
Exercise 7 : Demonstrate the relational algebra operations such as sort, group,join, project,
and filter using Hive and Pig.

Problem Statement: Demonstrate the relational algebra operations such as sort, group, join,
project and filter using Hive.

Description: Hive is built on top of Apache Hadoop, which is an open-source framework used to
efficiently store and process large datasets. Hive allows users to read, write, and manage petabytes
of data using SQL. Hive provides SQL-like querying capabilities with HiveQL. It supports
structured and unstructured data. Provides native support for common SQL data types, like INT,
FLOAT, and VARCHAR.

Algorithm:

1. Go to the virtual box.


2. Select centOS (cloudera platform) inside the virtual machine.3.Then, go to terminal
3. Check if Hadoop exists in the system, using the command >hadoop version
4. Then type hive to get the terminal for typing hive commands, usingthe command

>hive

Creating Table in hive:

Table 1 (employee2)

>create table employee2(eid INT,name String,role String,salaryString)


>row format delimited
>fields terminated by ‘,’;

Table 2 (personal2)

>create table personal2(id INT,school string,city string,scoreint)


>row format delimited
>fields terminated by ‘,’;

Inserting data into the table

>load data local inpath ‘home\cloudera\employee.txt’ into tableemployee2;


>load data local inpath ‘home\cloudera\employee3.txt’ into table personal2;
Table description:

describe employee2;

describe personal2;
Viewing the tables in hive:show tables:

Viewing contents of tables:

select * from employee2 ;

select * from personal2;


Sorting:

select * from employee2 order by salary;

select * from personal2 order by score;


select * from employee2 sort by salary;

select * from personal2 sort by score;


Grouping:

select role, count(*) from employee2 group by role;

select city, count(*) form personal2 group by city;


Join:

select e.name,e.role,p.school,p.city from employee2 e joinpersonal2 p on e.eid=p.id;

select e.name,e.role,p.city from employee2 e left outer join personal2 p on e.eid=p.id;


Filter:

select * from employee2 where role in(‘SDE’,’TAX’);

select * from personal2 where score>9;


----

select count(eid),role from employee2 group by rolehaving count(eid)>=2;

select count(id),city from personal2 group by city havingcount(id)=1 order by city desc;
select count(distinct role) from employee2;

select count(distinct city) from personal2;


Exercise 8
Exercise 9: IMPLEMENT PIG LATIN SCRIPTS TO SORT, GROUP, JOIN, PROJECT
ANDFILTER DATA

Commands to run Pig scripts:

Open terminal in CentOS and follow these steps:


1. To check version of pig: pig -version
2. To start the pig environment in local: pig –x local
3. To start the pig environment in MapReduce Mode: pig -x mapreduce
4. To run the pig script : pig -x local name.pig

Commands 2 and 3 are used to run pig in the “Grunt” Shell. For this exercise we can use
command 4 to run our scripts.

Steps:

1. Create some sample data or use some database which is alreadyavailable. In our case we have
used Medicine.txt

2. Load the file in the pig latin script using LOAD function
3. For SORT, first load the data and then use the ORDER BY command as shown inthe picture
below.

Code for Sort:

med = LOAD ‘path’ USING PigStorage(‘ , ‘) as (name:chararray ,


dosage:int, madedate: int, expiry: int , company: chararray);
Order_by_date = ORDER med by madedate desc; DUMP med;
DUMP Order_by_date;
Type this code in text editor and save it as sample_script.pig. To run this code type the following
command in the terminal:

To run the pig script:

pig -x sample_script.pig

Output:
1. For executing the group operation, Load the data into the Pig Latin script and use the GROUP BY
command

Code for GROUP:

med = LOAD ‘path’ USING PigStorage(‘ , ‘) as (name:chararray ,


dosage:int , madedate: int, expiry: int , company: chararray);
a = GROUP med BY expiry; DUMP a;
Type this code in text editor and save it as sample_script1.pig.

To run this code type the following command in the terminal: To run the pig script :
pig -x sample_script1.pig

OUTPUT:
1. For executing the join operation Load the data into the piglatin file and use the JOIN
command.
2. For join we use another text file with some additional details. We use join to use a common
column between both the text files and generate the final output with the complete
information. In this case we have used ‘med1.txt’ which contains the minimum age for each
drug.

CODE FOR JOIN:

med =LOAD ‘path’ USING PigStorage(‘ , ‘) as (name:chararray , dosage:int


, madedate: int, expiry: int , company: chararray);
med1 = LOAD ‘path’ USING PigStorage(‘ , ‘) as (name:chararray ,
minage:int); a= JOIN med BY name,med1 BY name USING ‘skewed’;
DUMP a;
Type this code in text editor and save it as sample_script2.pig.
To run this code type the following command in the terminal:
pig -x sample_script2.pig OUTPUT:
1. For executing the project operation Load the data into the pig latin file and use the AS
command.
2. Project is used to display only some rows of a particular dataset.

CODE FOR PROJECT:

med =LOAD ‘path’ USING PigStorage(‘ , ‘) as (t1:chararray , t2:int , t3:int ,


t4:int, t5:chararray);
DUMP med;
Type this code in text editor and save it as sample_script3.pig.

To run this code type the following command in the terminal:


pig -x sample_script3.pig

OUTPUT:
1. For executing the filter operation Load the data into the pig latin file and use the FILTER
BY command
2. Filter by is used to select and display the data based on some condition.

CODE FOR FILTER BY:

med =LOAD ‘path’ USING PigStorage(‘ , ‘) as (t1:chararray , t2:int , t3:int


, t4:int, t5:chararray);
relation = FILTER med by expiry>=2022; DUMP relation;
Type this code in text editor and save it as sample_script3.pig.

To run this code type the following command in the terminal:


pig -x sample_script3.pig

OUTPUT:
EXERCISE 10 : COLLABORATIVE FILTERING USING PYSPARK

Description: For this project, we have to create a recommender system that will recommend
new musical artists to a user based on their listening history. Suggesting different songs or
musical artists to a user is important to many music streaming services, such as Wynk and
Spotify. In addition, this type of recommender system could also be used as a means of
suggesting TV shows or movies to a user (e.g., Netflix).

To create this system you will be using Spark and the collaborative filtering technique.The
instructions for completing this project will be laid out entirely in this file. You willhave to
implement any missing code as well as answer any questions.

Submission Instructions: Add all of your updates to this IPython file and do not clear any of
the output youget from running your code.

Datasets : You will be using some publicly available song data from audioscrobbler, which can
befound here (http://wwwetud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html).

However, we modified the original data files so that the code will run in a reasonable time on
a single machine. The reduced data files have been suffixed with _small.txt,and contains only
theinformation relevant to the top 50 most prolific users (highest artist play counts).

The original data file user_artist_data.txt contained about 141,000 unique users, and 1.6million
unique artists. About 24.2 million users’ plays of artists are recorded, along with their count.

Note that when plays are scribbled, the client application submits the name of the artistbeing
played. This name could be misspelled or nonstandard, and this may only be detected later.

For example, &quot;The Smiths&quot;, &quot;Smiths, The&quot;, and &quot;the


smiths&quot; may appear asdistinct artist IDs in the data set, even though they clearly refer to
the same artist. So, the data set includes artist_alias.txt, which mapsartist IDs that are known
misspellings or variants to the canonical ID of that artist.

The artist_data.txt file then provides a map from the canonical artist ID to the name ofthe artist.

Exercise 11 : LOGISTIC REGRESSION CLASSIFICATION, SVM & DECISION


TREE CLASSIFIER ALGORITHMS USING PYSPARK

Program Statement : To perform logistic regression , SVM and Decision tree Classification
algorithm.

Process:
1. Download the dataset from net as a csv file
2. Install the pyspark on the windows
3. Import the required libraries and modules.
4. Cleaning the dataset
5. Analyzing the dataset
6. Splitting the dataset into training and testing and applying the necessary classification
algorithms we need
7. Summarizing the performance of the prediction model using precision,recall and
accuracy.

Logistic regression: Logistic regression is the machine is one of the supervised machine
learning algorithms which is used for classification to predict the discrete value outcomes. It
uses the statistical approach to predict the outcomes of dependent variables based on the
observation given in the dataset.The dataset we used here is a search engine dataset.

Code:

import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('Logistic_Regression').getOrCrea
te()
df=spark.read.csv('search_engine.csv',inferSchema=True,header=True)
df.describe().show()
df.groupBy('Country').count().show()
df.groupBy('Platform').count().show()
df.groupBy('Status').count().show()
import matplotlib.pyplot as plt
import seaborn as sns
df11=df.toPandas()
sns.set_style('whitegrid')
sns.countplot(x='Country',hue='Platform',data=df11)
from pyspark.ml.feature import StringIndexer
search_engine_indexer = StringIndexer(inputCol="Platform",
outputCol="Platform_Num").fit(df)
df = search_engine_indexer.transform(df)
df.select(['Platform','Platform_Num']).show(10,False)
df.select(['Country','Country_Num']).show(10,False)
from pyspark.ml.feature import OneHotEncoder
search_engine_encoder = OneHotEncoder(inputCol="Platform_Num",
outputCol="Search_Engine_Vector").fit(df)
df = search_engine_encoder.transform(df)
from pyspark.ml.feature import VectorAssembler
df_assembler =
VectorAssembler(inputCols=['Search_Engine_Vector','Country_Vector','
Age', 'Repeat_Visitor','Web_pages_viewed'], outputCol="features")
df = df_assembler.transform(df)
training_df,test_df=model_df.randomSplit([0.75,0.25])
from pyspark.ml.classification import LogisticRegression
log_reg=LogisticRegression(labelCol='Status').fit(training_df)
train_results=log_reg.evaluate(training_df).predictions
train_results.filter(train_results['Status']==1).filter(train_results['prediction'
]==1).select(['Status','prediction','probability']).show(10,False)

accuracy=float((true_postives+true_negatives) /(results.count()))
print("Accuracy : " + str(accuracy))

recall = float(true_postives)/(true_postives + false_negatives)


print("Precision Rate : " + str(recall))
SVM: Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.The dataset we used here
is load breast cancer dataset.

Code:

from pyspark import SparkContext


from pyspark.sql import SQLContext
from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_breast_cancer
import pandas as pd

bc = load_breast_cancer()

df_bc = pd.DataFrame(bc.data, columns=bc.feature_names)


df_bc['label'] = pd.Series(bc.target)
print(df_bc.head())

sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)

data = sqlContext.createDataFrame(df_bc)
print(data.printSchema())

features = bc.feature_names

va = VectorAssembler(inputCols = features, outputCol='features')

va_df = va.transform(data)
va_df = va_df.select(['features', 'label'])
va_df.show(3)

(train, test) = va_df.randomSplit([0.9, 0.1])

lsvc = LinearSVC(labelCol="label", maxIter=50)


lsvc = lsvc.fit(train)

pred = lsvc.transform(test)
pred.show(3)

evaluator=MulticlassClassificationEvaluator(metricName="accuracy")
acc = evaluator.evaluate(pred)

print("Prediction Accuracy: ", acc)

y_pred=pred.select("prediction").collect()
y_orig=pred.select("label").collect()

cm = confusion_matrix(y_orig, y_pred)
print("Confusion Matrix:")
print(cm)

DECISION TREE CLASSIFIER: Decision Tree is a Supervised learning technique that can
be used for both classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision rules and each leaf node represents the
outcome.The dataset we used for this is Iris dataset.

Code:

from pyspark import SparkContext


from pyspark.sql import SQLContext
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['label'] = pd.Series(iris.target)

print(df_iris.head())

sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)
data = sqlContext.createDataFrame(df_iris)
print(data.printSchema())
features = iris.feature_names

va = VectorAssembler(inputCols = features, outputCol='features')


va_df = va.transform(data)
va_df = va_df.select(['features', 'label'])
va_df.show(3)

(train, test) = va_df.randomSplit([0.8, 0.2])

dtc = DecisionTreeClassifier(featuresCol="features", labelCol="label")


dtc = dtc.fit(train)

pred = dtc.transform(test)
pred.show(3)

evaluator=MulticlassClassificationEvaluator(predictionCol="prediction")
acc = evaluator.evaluate(pred)

print("Prediction Accuracy: ", acc)


y_pred=pred.select("prediction").collect()
y_orig=pred.select("label").collect()

cm = confusion_matrix(y_orig, y_pred)
print("Confusion Matrix:")
print(cm)

Results: Several classification methods like SVM, Decision tree, Logistic regression are
applied over three different datasets Breast cancer dataset, Iris dataset, Search Engine dataset
respectively among which Decision tree gave high accuracy.
Exercise 12: Implement K-means clustering algorithm using pyspark.

DESCRIPTION: K-means clustering algorithm is an unsupervised learning algorithm that is


used to solve clustering problems in machine learning and other fields. It creates cluster by
placing a number of points called centroids, inside the feature space. Each point in the dataset
is assigned to the cluster of whichever centroid it is closest to.The “K” is known as refers to
how many centroid it creates.
The dataset contains:
CUSTID : Identification of Credit Card holder (Categorical)
BALANCE : Balance amount left in their account to make purchases (
BALANCE FREQUENCY : How frequently the Balance is updated, score
between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
PURCHASES : Amount of purchases made from account
ONEOFF PURCHASES : Maximum purchase amount done in one-go
INSTALLMENTS PURCHASES : Amount of purchase done in installment
CASH ADVANCE : Cash in advance given by the user
PURCHASES FREQUENCY : How frequently the Purchases are being made, score
between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
ONEOFF PURCHASES FREQUENCY : How frequently Purchases are
happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
PURCHASES INSTALLMENTS FREQUENCY : How frequently purchases in
installments are being done (1 = frequently done, 0 = not frequently done)
CASH ADVANCE FREQUENCY : How frequently the cash in
advance being paid CASH ADVANCE TRX : Number of
Transactions made with "Cash in Advanced"
PURCHASES TRX : Number of purchase transactions made
CREDIT LIMIT : Limit of Credit Card for user
PAYMENTS : Amount of Payment done by user
MINIMUM_PAYMENTS : Minimum amount of payments made by user
PRCFULL PAYMENT : Percent of full payment paid by user
TENURE : Tenure of credit card service for user

ALGORITHM:
• Import the required libraries and dataset.
• Perform exploratory data analysis on the imported dataset and clean the dataset.
• Declare the feature vector and target variable.
• Convert the categorical data into integers.
• Perform feature scaling.
• Import K means clustering.
• Fit the model with the data.
• Predict the results and compute the required measures.
CODE:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null !wget -q
https://www-us.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-ha
doop2.7.tgz
!tar xf spark-3.1.2-bin-hadoop2.7.tgz
!pip install pyspark py4j
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("test_pyspark").getOrCreate()
from pyspark.sql import SparkSession
spark = parkSession.builder.appName('Clustering using K-
Means').getOrCreate()
data_customer=spark.read.csv('CC GENERAL.csv', header=True,
inferSchema=True)
data_customer.printSchema()
data_customer=data_customer.na.drop()
from pyspark.ml.feature import VectorAssembler
data_customer.columns
assemble=VectorAssembler(inputCols=[
'BALANCE',
'BALANCE_FREQUENCY',
'PURCHASES',
'ONEOFF_PURCHASES',
'INSTALLMENTS_PURCHASES',
'CASH_ADVANCE',
'PURCHASES_FREQUENCY',
'ONEOFF_PURCHASES_FREQUENCY',
'PURCHASES_INSTALLMENTS_FREQUENCY',
'CASH_ADVANCE_FREQUENCY',
'CASH_ADVANCE_TRX',
'PURCHASES_TRX',
'CREDIT_LIMIT',
'PAYMENTS',
'MINIMUM_PAYMENTS',
'PRC_FULL_PAYMENT',
'TENURE'], outputCol='features')
assembled_data=assemble.transform(data_customer)
assembled_data.show(2)
from pyspark.ml.feature import StandardScaler
scale=StandardScaler(inputCol='features',outputCol='standardized')
data_scale=scale.fit(assembled_data)
data_scale_output=data_scale.transform(assembled_data)
data_scale_output.show(2)
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
silhouette_score=[]
evaluator = ClusteringEvaluator(predictionCol='prediction',
featuresCol='standardized', \ metricName='silhouette',
distanceMeasure='squaredEuclidean')
for i in range(2,10):
KMeans_algo=KMeans(featuresCol= 'standardized', k=i)
KMeans_fit=KMeans_algo.fit(data_scale_output)
output=KMeans_fit.transform(data_scale_output)
score=evaluator.evaluate(output)
silhouette_score.append(score)
print("Silhouette Score:",score)
score

#Visualizing the silhouette scores in a plot import matplotlib.pyplot as plt


fig, ax = plt.subplots(1,1, figsize =(8,6))
ax.plot(range(2,10),silhouette_score)
ax.set_xlabel('k')
ax.set_ylabel('cost')

OUTPUT:
Results : Data is divided across processors and then sorted within each processor. Each processor
works on a set of records. The initial centroid values are initialized and divided/shared across each
of these processors. Now every processor has one centroid of information. Compute the distance
of these points to these centroids. For data points in a processor which are extremely low or high:
If they are closer to the centroid of processor assign them to that cluster, else if they are closer to
the centroid belonging to a different processor, move data point to the new processor Repeat till
convergence is met. All local clusters from Processor P is returned.
Experiment 8: DEVELOP HBASE DATABASE AND PERFORM BASIC QUERY OPERATIONS

Name: Punna Rahul(123015079)

Name: Rakulan(123015085)

Name: RameshKrishnan(123015086)

• Conventional relational approach presents

• Modeling challenge for many data analytics applications

• Real time applications like facebook

• need to record high volumes of time-based events

• Tend to have many possible structural variations

NoSQL and Column-Oriented Databases

• Refers to

• Non-relational databases

• Encompasses a wide collection of data storage models

• Graph databases

• Document databases

• Key/value data stores

• Column-family databases

• HBase

• Classified as a column-family/column-oriented DB
• Modeled on Google’s BigTable Architecture

• Provides

• Random(row-level) read/write access

• Strong consistency

• “schema-less”/flexible data modeling

• HBase

• Organizes data into tables that contain rows

• Within table

• Rows are identified by their unique row key

• Do not have a data type

• Stored and treated as a byte array

• Row keys are

• Automatically indexed

• similar to primary keys in relational databases

• Table rows are sorted by row key

• Stores data as key/value pairs

• All table lookups are performed via

• table’s row key

• Unique identifier to the stored record data

Creating :

Altering etc..,:
Describe :

• Inserting data with put


• To store descriptive data about link such as title

• Table-> row->column

Get rows or cell values:


Get command accepts parameters like

• Column

• Timestamp

• Version

• To retrieve data
Retrieving data on time range:

Scans the entire HBASE


Implement the collaborative filtering system using PySpark

Team Members:

Anand.S(123015011)
Bharathi.S(123015018)
Keerthana.S(123015047)

AIM:
To Implement the collaborative filtering system using PySpark.

DESCRIPTION:

Collaborative Filtering is a mathematical method to find the predictions about how users
can rate a particular item based on ratings of other similar users. Typical Collaborative Filtering
involves 4 different stages:

Data Collection — Collecting user behaviors and associated data items


Data Processing — Processing the collected data
Recommendation Calculation — Calculate referrals based on processed data
Result Derivation — Extract the similarity and return the top N results.

The library package spark.ml currently supports model-based collaborative filtering, in which
users and products are described by a small set of latent factors that can be used to make
predictions. It uses the Alternating Least Squares (ALS) algorithm to learn these latent factors.

We will use the dataset from https://www.last.fm/api/ which contains 3 files:

user_artist_data.txt 3 columns: userid artistid playcount


artist_data.txt 2 columns: artistid artist_name
artist_alias.txt 2 columns: badid, goodid [known incorrectly spelt artists and the correct artist id]

Explicit & Implicit Feedbacks

So what type of data are being collected in the first stage of Collaborative Filtering? There’s two
different categories of data (referred as feedbacks), which can be Explicit or Implicit.

An example of Explicit Feedback is the ratings given by users, which Netflix is collecting to
make recommendations to their customers after they provide ratings to the movies they have
watched. Implicit Feedback is less straightforward as it’s based on the users’ interactions with
the platform, ranging from clicks, views, likes and purchases. Spotify makes use of Implicit
Feedbacks to implement it’s recommendation system.

Calculating similarity
Once the data has been collected and processed, some mathematical formula is needed to
make the similarity calculation. The two most common measures are:

Euclidean Distance — Distance of the preference between two users. If the distance is small,
similarity between both users is high
Pearson Correlation — If the cosine values (angle of incidence) between two users coincide,
similarity between both users is high

DATASET:

PYTHON CODE:
Result:

Pyspark is installed successfully and a recommendation system using collaborative filtering for
movie rating is executed successfully.

You might also like