Bda Lab Exercises Lab Mannual - 2023
Bda Lab Exercises Lab Mannual - 2023
1. Design MapReduce technique for word counting using python on Hadoop cluster
2. Develop MapReduce algorithm for finding the coolest year from the available Weather data
using java/python program on Hadoop cluster
3. Design a bloom filter to remove the duplicate users from the Log file and analyse the filter
with different cases.
4. Implement the Flajolet-Martin algorithm to extract the distinct twitter users from the twitter
data set.
5. Demonstrate the significance of Page rank algorithm in the Hadoop platform with available
data set using MapReduce based Matrix vector multiplication algorithm.
6. Design a friend of friend’s network using Girvan Newman algorithms from the social network
data.
7. Demonstrate the relational algebra operations such as sort, group, join, project, and filter using
Hive.
8. Load the unstructured data into the Hadoop and convert it into the structured data using Hive.
Develop a Hive and HBase Databases, Tables, Views, Functions and Indexes and perform the
some perform basic query operations.
9. Implement a Pig Latin scripts to sort, group, join, project, and filter your data.
10. Implement the collaborative filtering system using PySpark
11. Perform the Logistic regression classification, SVM and Decision tree classifier algorithms
using PySpark and display the result with graph and compare the accuracy of an algorithms
using Precision, Recall and F-Measure.
12. ImplementtheKMean clustering algorithm usingPySpark.
EXERCISE 1: Design MapReduce technique for word counting using python on Hadoop
cluster
Hadoop MapReduce is a software framework for easily writing applications whichprocess vast
amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerantmanner. A MapReduce job usually splits the input
dataset into independent chunks whichare processed by the map tasks in a completely parallel
manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks.
Typicallyboth the input and the output of the job are stored in a file-system. The framework takes
care of scheduling tasks, monitoringthem and re-executes the failed tasks.
The MapReduce framework consists of a single master ResourceManager, one worker
NodeManager per cluster-node, and MRAppMaster per application. Minimally, applications
specify the input/output locations and supply map andreduce functions via implementations of
appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job
configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration
to the ResourceManager which then assumes the responsibility of distributing the
software/configuration to the workers, scheduling tasks and monitoring them, providing status
and diagnostic information to the job-client.
Although the Hadoop framework is implemented in Java, MapReduce applications need not be
written in Java.
⚫ Hadoop Streaming is a utility which allows users to create and run jobs with anyexecutables
(e.g. shell utilities) as the mapper and/or the reducer.
⚫ Hadoop Pipesis a SWIG-compatible C++ API to implement MapReduceapplications (non JNI
based).
Hadoop Installation-
https://www.google.com/url?sa=D&q=https://archive.apache.org/dist/hadoop/common/hadoop-
3.1.2/hadoop-3.1.2.tar.gz&ust=1668138120000000&usg=AOvVaw
1IaOd8wX2vpgV6lrbCNufU&hl=en&source=gmail
1.1 Finding word count using Map reduce algorithm using Python withoutHadoop:
Problem Statement: Design Map_Reduce technique for word counting using python
Description: Map Reduce is a programming model for writing applications that can process big
data in parallel on multiple nodes. Map-reduce provides analytical capabilities foranalyzing huge
volumes of data.
Algorithm/Steps:
⚫ Create one text file and store input into the text file.
⚫ Splitting - Split the input into various lines by taking the split function in python.
⚫ Mapping - Each of the words should be separated in a particular line andtheir counts must be
saved
⚫ Sort/Shuffle - From all the sentences get all the common words together along with their
respective counts.
⚫ Reduce - Sum up the individuals counts of these common words andcalculate the final count
of each word
⚫ Result - Return the result of the map reduce
Program:
sentence=open(‘word.txt’)
txt=sentence.read() data=txt.split()
for word in data:
words = word.split()
for word in words:
print(word, 1)
q=[]
for word in data:
if(word not in q):
q.append(word)
currentword=word count=0
for wrd in data: if(wrd==currentword): count+=1
print(currentword,count)
Input: River Bear Dear Clever Cheer Dear Bear River Clever Cheer Power tower
Output:
Result: By using Map-Reduce algorithm, performed word count for a particular sentence and got
the required output.
1.2 Finding word count using Map reduce algorithm using Python with Hadoop
Problem Statement: Designing Map Reduce technique for word counting using python in Hadoop.
Description: Map Reduce is a programming model for writing applications that can process big
data in parallel on multiple nodes Map-reduce provides analytical capabilities for analyzing huge
volumes of data.
The MapReduce algorithm contains two important tasks, namely Map andReduce.
⚫ The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
⚫ The Reduce task takes the output from the Map as an input and combines thosedata tuples (key-
value pairs) into a smaller set of tuples.
Algorithm :
map(key, value):
// key: document name; value: text of the document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
Hadoop Streaming is a feature that comes with Hadoop and allows users or developers to use
various different languages for writing MapReduce programslike Python, C++, Ruby, etc. To
implement the word count problem in python to understand Hadoop Streaming, will be creating
mapper.py and reducer.py to perform map and reduce tasks. Let’s create one file which contains
multiple words that can counted.
Step 1: Open the virtual box and go to the Ubuntu operating system. Create a filewith the name
word_count_data.txt and add some data to it. Type these commands in the terminal.
Step 2: Create a mapper.py file that implements the mapper logic. It will read the data from STDIN
and will split the lines into words, and will generate an output ofeach word with its individual
count.
start-dfs.sh
start-yarn.sh
Now make a directory word_count_in_python in our HDFS in the root directorythat will store our
word_count_data.txt file with the below command.
Now our data file has been sent to HDFS successfully. we can check whether itsends or not by
using the below command or by manually visiting our HDFS.
# changing the permission to read, write, execute for user, group and
others
Step 5: Now download the latest hadoop-streaming jar file. Then place this Hadoop,- streaming
jar file to a place from where you can easily access it. In my case, I am placing it in the /Documents
folder where mapper.py and reducer.py fileis present. Now let’s run our python files with the help
of the Hadoop streaming utility asshown below.
/word_count_in_python/output/part-00000 in my case.
We can check results by manually visiting the location in HDFS or with the help of cat command
as shownbelow.
AIM : To develop a Map Reduce Algorithm to find the hottest and coldest monthin a year from
the available weather data using python.
MODEL DESCRIPTION :
1. Map Phase
2. Sort and Shuffle Phase
3. Reduce Phase
DATASET DESCRIPTION:
The dataset is created manually using a notepad with details of weather such asmonths and their
respective minimum and maximum temperatures in three different columns. This file is then
imported into hadoop to perform a map reducing algorithm to find the coolest and the hottest
month of the year.
ALGORITHM :
1) Create an input text file using a notepad containing the weather details of allmonths in a year.
2) It has such as Month, Minimum Temperature, Maximum Temperature.
3) Import the dataset and store the data in three separate lists.
4) Find out the minimum temperature and maximum temperature using mapreducing algorithm.
5) We can then find out the coolest and the hottest month of the year.
CODE:
myfile = open(“weatherdata.txt”)
txt = myfile.read()
months = []
min_temp = []
max_temp = []
total_list = txt.split(‘,’)
for i in range(len(total_list)):
if i % 3 == 0: months.append(total_list[i].strip(‘’\t’)) elif i % 3
== 1:
min_temp.append(int(total_list[i]))
else:
max_temp.append(int(total_list[i])) minimum = min(min_temp)
maximum = max(max_temp)
print(“Month with minimum temperature of {t} is” {m}.format(t =
minimum, m = months[min_temp.index(minimum)]))
print(“&Month with maximum temperature of {t} is {m}”.format(t =
maximum, m = months[max_temp.index(maximum)]))
myfile.close()
OUTPUT:
RESULT : Thus, the coolest and hottest months of the year have been determined using map
reducing algorithm.
AIM : To develop a Map Reduce Algorithm to find the hottest and coldest monthin a year from
the available weather data on Hadoop Cluster.
MODEL DESCRIPTION : Map Reduce is a Distributed Data Processing Algorithm. Itis mainly
inspired by the Functional Programming model. This algorithm is useful to process huge amounts
of data in parallel, reliable and efficient ways in cluster environments. It divides input tasks into
smaller and manageable sub-tasks to execute them in parallel. Map reduce works by breaking the
process into 3 phases
1. Map Phase
2. Sort and Shuffle Phase
3. Reduce Phase
DATASET DESCRIPTION: The dataset is created manually using a notepad with details of
weather such as months and their respective minimum and maximum temperatures in three
different columns. This file is then imported into hadoop toperform a map reducing algorithm to
find the coolest and the hottest month of the year.
ALGORITHM:
1) Create an input text file using a notepad containing the weather details of allmonths in a year.
2) It has such as Month, Minimum Temperature, Maximum Temperature.
3) Import the dataset and store the data in three separate lists.
4) Find out the minimum temperature and maximum temperature using mapreducing algorithm.
5) We can then find out the coolest and the hottest month of the year.
CODE:
OUTPUT:
RESULT: Thus, the coolest and hottest months of the year have been determined using map
reducing algorithm.
EXERCISE 3 : Design a bloom filter to remove the duplicate users from the Log file and
analyzethe filter with different cases.
Aim: To design a bloom’s filter to remove duplicates and analyze filters with different cases.
Description: Bloom’s filter is a space efficient probabilistic data structure that is used to test
whether an element is a member of a set. It is used to remove duplicate users from log filesand
analyze the filters in different cases.
Algorithm:
The purpose of the Bloom filter is to allow through all stream elements whose keys are in S, while
rejecting most of the stream elements whose keys are not in S.
Code:
M=int(input())
x=list(map(int,input().split())) arr=[0]*M
for i in range(len(x)) :
h1=x[i]%M h2=(3*x[i]+2)%M
if arr[h1]==0 : arr[h1]=1
if arr[h2]==0 : arr[h2]=1
print(“Final Blooms filter :” ,arr)
x1=int(input())
h1=x1%M h2=(3*x1+2)%M
if arr[h1]==0 : arr[h1]=1
elif arr[h1]==1 : print(“ It is duplicate”)
elif arr[h2]==0 : arr[h2]=1
else : print(“ It is duplicate”)
Output:
Result: Bloom’s Filter is thereby used to remove duplicates and analyze the filter withdifferent
cases.
3.2 With Hadoop:
Aim: To design a bloom’s filter to remove duplicate users from log values andanalyze filters with
different cases.
Description: Bloom’s filter is a space efficient probabilistic data structure that is used to test
whether an element is a member of a set. It is used to remove duplicate users from log files and
analyze the filters in different cases.
Algorithm:
1) Import all necessary packages
2) Initialize all indexes to zero in an empty bit array.
3) Define the necessary hash functions h1 and h2.
4) Store the ip addresses which we take from the dataset into a list and convertthem as integers and
store them in a new list.
5) Calculate hash value h1 and h2 for the numbers for which we will findduplicates
6) If the element for the particular hash is 0, change it to 1.
7) And we will get the final bloom’s filter.
8) Now we can get an ip address as input from the user and we will check whetherthe particular
user is already present or not..
9) If the value at a particular index is already present in the array(hash value is 1),print it as a
duplicate document.
10) If the value at a particular index is 0, then it is unique and print it as not aduplicate document.
Code:
import pandas as pd
df = pd.read_csv(“weblog.csv”)
#Getting only the IP addresses x = df[“ID”]
x
#taking a large “m value and creating the blooms m = 6023”
l = [0]*m
#Function to execute the hash functions and return the index values def
bloom(n,m):
h = n%m
h1 = (2*n+3)%m return h,h1
y = “;”
#for running in less time , 100 records are only considered. We can give
len(df) for all records.
for i in range(100):
s = x[i]
a = s.split() for j in range(3):
y = y+a[j] z = int(y)
print(Integer=, z) y = “”
c,c1 = bloom(z,m) print(“Blooms value”,c)
print(“ Second hash value “;,c1) l[c] = 1
l[c1] = 1
Result:
Bloom’s Filter is thereby used to remove duplicate users from the log file andanalyze the filter
with different cases.
EXERCISE 4 : Implement the Flajolet-Martin algorithm to extract the distinct twitter users
from the twitter data set
Aim: To implement the Flajolet Martin algorithm using Python to extract the distincttwitter users
from the twitter dataset.
Description:
Data cleaning can be done using various techniques but before cleaning the data,it is necessary to
know the amount of useful data present in the dataset. Therefore, before the removal of duplicate
data from a data stream or database, it is necessary to have knowledge of distinct or unique data
present. A way to do so is by hashing the elements of the universal set using the Flajolet Martin
Algorithm.The FM algorithm is used in a database query, big data analytics, spectrum sensingin
cognitive radio sensor networks, and many more areas. It shows superior performance as
compared with many other methods to find distinct elements in a stream of data. Flajolet Martin
Algorithm, also known as FM algorithm, is used to approximate thenumber of unique elements in
a data stream or database in one pass. The highlight of this algorithm is that it uses less memory
space while executing.
Algorithm:
The idea behind the Flajolet-Martin Algorithm is that the more different elements we see in the
stream, the more different hash-values we shall see. As we see more different hash-values, it
becomes more likely that one of these values will be “unusual.” The particular unusual property
we shall exploit is that the value ends in many 0’s.
• Whenever, apply a hash function h to a stream element a, the bit string h(a) will end in
some number of 0’s, possibly none - Call this number the tail length for a and h.
• Let R be the maximum tail length of any a seen so far in the stream.
• Then estimate 2R for the number of distinct elements seen in the stream.
• The probability that a given stream element a has h(a) ending in at least r 0’s is 2−r.
• Suppose there are m distinct elements in the stream, then the probability that none of them
has tail length at least r is (1 − 2 −r ) m.
Steps:
1. Selecting a hash function h so each element in the set is mapped to a string to atleast log2n
2. bits. The names are converted to binary numbers using a in-built Hash functionpresent in
python.
3. For each element x, r(x)= length of trailing zeroes in h(x).
4. R= max(r(x)). The maximum count of trailing zeroes is initialized to R.
5. The R is substituted in the formula 2^R to yield the result with a distinct count.
6. Distinct elements= 2^R.
Code:
Result:
Flajolet Martin Algorithm has been implemented to find the number of distincttwitter users and it
is observed that it occupies less memory space and shows better results in terms of time in
seconds.
EXERCISE 5 : Demonstrate the significance Page rank algorithm in the Hadoop platform
withavailable data set using MapReduce based Matrix vector multiplication algorithm.
Project Objective
Hadoop
An apache open source project. It is an application framework that helps integrateHDFS (hadoop
distributed file system) and run MapReduce jobs.
MapReduce
PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine
results. PageRank was named after Larry Page, one of the foundersof Google. PageRank is a way
of measuring the importance of website pages. PageRank works by counting the number and
quality of links to a page to determine a rough estimate of how important the website is. The
underlying assumption is that more important websites are likely to receive more links fromother
websites.
It is not the only algorithm used by Google to order search engine results, but it isthe first algorithm
that was used by the company, and it is the best-known. The above centrality measure is not
implemented for multi-graphs.
The PageRank algorithm outputs a probability distribution used to represent thelikelihood that a
person randomly clicking on links will arrive at any particular page. PageRank can be calculated
for collections of documents of any size. It is assumed in several research papers that the
distribution is evenly divided among all documents in the collection at the beginning of the
computational process. The PageRank computations require several passes, called “iterations”,
through the collection to adjust approximate PageRank values to more closely reflect the
theoretical true value.
CODE:
import numpy as np
M = d * L2 + (1-d)/7 * np.ones([7, 7]) # np.ones() is the J matrix, with onesfor each entry.
r = 100 * np.ones(7) / 7 # Sets up this vector (7 entries of 1/7 × 100each)
lastR = r
r = L2 @ ri = 0
while la.norm(lastR - r) > 0.01 :
lastR = r
r = L2 @ ri += 1
print(str(i) + " iterations to convergence.")r
46 iterations to convergence.
array([1.07742874e-02, 4.20323681e-03, 2.13132105e-02, 1.25178906e-02,0.00000000e+00,
8.56654771e+01, 1.42857143e+01])
d = 0.8
M = d * L2 + (1-d)/7 * np.ones([7, 7]) # np.ones() is the J matrix, with onesfor each entry.
r = 100 * np.ones(7) / 7 # Sets up this vector (6 entries of 1/6 × 100 each)lastR = r
r = M @ ri = 0
while la.norm(lastR - r) > 0.01 :
lastR = r
r = M @ ri += 1
print(str(i) + " iterations to convergence.")r
19 iterations to convergence.
array([10.15892915, 6.7094426 , 17.3137475 , 11.32722084, 2.85714286,
37.34780278, 14.28571429])
This is certainly better, the PageRank gives sensible numbers that users that endup on
each webpage.
EXERCISE 6 : Design a friend of friend's network using Girvan Newman algorithmsfrom
the social network data
Description: The Girvan Newman technique for the detection and analysis of communitystructure
depends upon the iterative elimination of edges with the highest number of the shortest paths that
pass through them. By getting rid of the edges, the network breaks down into smaller networks,
i.e. communities.
In order to calculate edge betweenness it is necessary to find all shortestpaths in the graph. The
algorithm starts with one vertex, calculates edgeweights for paths going through that vertex, and
then repeats it for everyvertex in the graph and sums the weights for every edge.
Although understandable and simple, Girvan Newman Algorithm has its own limitations. The
algorithm is not very time efficient with networks containing large number of nodes and data.
Communities in huge and complex networks are difficult to detect and therefore, Girvan Newman
is not favorable for very larger number of data sets. There then occurred various variations,
modifications and designing of other methods of community detection which work on the idea of
“Modularity”.
Algorithm:
• The algorithm begins by performing a breadth-first search (BFS) of the graph, starting at
the node X.
• Note that the level of each node in the BFS presentation is the length of the shortest path
from X to that node.
• The edges that go between nodes at the same level can never be part of a shortest path
from X. Edges between levels are called DAG edges (“DAG” stands for directed, acyclic
graph).
• Each DAG edge will be part of at least one shortest path from root X.
• If there is a DAG edge (Y, Z), where Y is at the level above Z (i.e., closer to the root),
then we shall call Y a parent of Z and Z a child of Y , although parents are not necessarily
unique in a DAG as they would be in a tree.
• In order to exploit the betweenness of edges, need to calculate the number of shortest
paths going through each edge., describe as method called the Girvan-Newman (GN)
Algorithm.
o visits each node X once and computes the number of shortest paths from X to
each of the other nodes that go through each of the edges.
• The algorithm begins by performing a breadth-first search (BFS) of the graph, starting at
the node X.
o Note that the level of each node in the BFS presentation is the length of the
shortest path from X to that node.
• Thus, the edges that go between nodes at the same level can never be part of a shortest
path from X.
Edges between levels are called DAG edges (“DAG” stands for directed, acyclic graph)
and each DAG edge will be part of at least one shortest path from root X.
If there is a DAG edge (Y, Z), where Y is at the level above Z (i.e., closer to the
root), then we shall call Y a parent of Z and Z a child of Y , although parents are
not necessarily unique in a DAG as they would be in a tree
The rules for the calculation are as follows:
1. Each leaf in the DAG (a leaf is a node with no DAG edges to nodes at levels below) gets
a credit of 1.
2. Each node that is not a leaf gets a credit equal to 1 plus the sum of the credits of the DAG
edges from that node to the level below.
3. A DAG edge e entering node Z from the level above is given a share of the credit of Z
proportional to the fraction of shortest paths from the root to Z that go through e.
Source Code:
a = nx.connected_components(g) lena=len(list(a))
print (“The number of connected components are”, lena) return a
g1 = nx.karate_club_graph() a1 = girvan(g1)
print (“Karate Club Graph”) for i in a1:
print (i.nodes())
print (“ .................................. “)
Input:
Enter input 1:
5
Enter input 2:
0
Output:
Barbell Graph
The number of connected components are 1
The number of connected components are 2= [0, 1, 2, 3, 4]…………. [8, 9, 5, 6, 7]………….
The number of connected components are 1 The number of connected components are 1 The
number of connected components are 1The number of connected components are 1The number of
connected components are 1The number of connected components are 1The number of connected
components are 1 The number of connected components are 1 The number of connected
components are 1 The number of connected components are 1 The number of connected
components are 1The number of connected components are 2
[32, 33, 2, 8, 9, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] ………….
………….
NetworkX is a package for the Python programming language that's used to create,
manipulate, and study the structure, dynamics, and functions of complex graph networks. Using
networkx we can load and store complex Networks.
Exercise 7 : Demonstrate the relational algebra operations such as sort, group,join, project,
and filter using Hive and Pig.
Problem Statement: Demonstrate the relational algebra operations such as sort, group, join,
project and filter using Hive.
Description: Hive is built on top of Apache Hadoop, which is an open-source framework used to
efficiently store and process large datasets. Hive allows users to read, write, and manage petabytes
of data using SQL. Hive provides SQL-like querying capabilities with HiveQL. It supports
structured and unstructured data. Provides native support for common SQL data types, like INT,
FLOAT, and VARCHAR.
Algorithm:
>hive
Table 1 (employee2)
Table 2 (personal2)
describe employee2;
describe personal2;
Viewing the tables in hive:show tables:
select count(id),city from personal2 group by city havingcount(id)=1 order by city desc;
select count(distinct role) from employee2;
Commands 2 and 3 are used to run pig in the “Grunt” Shell. For this exercise we can use
command 4 to run our scripts.
Steps:
1. Create some sample data or use some database which is alreadyavailable. In our case we have
used Medicine.txt
2. Load the file in the pig latin script using LOAD function
3. For SORT, first load the data and then use the ORDER BY command as shown inthe picture
below.
pig -x sample_script.pig
Output:
1. For executing the group operation, Load the data into the Pig Latin script and use the GROUP BY
command
To run this code type the following command in the terminal: To run the pig script :
pig -x sample_script1.pig
OUTPUT:
1. For executing the join operation Load the data into the piglatin file and use the JOIN
command.
2. For join we use another text file with some additional details. We use join to use a common
column between both the text files and generate the final output with the complete
information. In this case we have used ‘med1.txt’ which contains the minimum age for each
drug.
OUTPUT:
1. For executing the filter operation Load the data into the pig latin file and use the FILTER
BY command
2. Filter by is used to select and display the data based on some condition.
OUTPUT:
EXERCISE 10 : COLLABORATIVE FILTERING USING PYSPARK
Description: For this project, we have to create a recommender system that will recommend
new musical artists to a user based on their listening history. Suggesting different songs or
musical artists to a user is important to many music streaming services, such as Wynk and
Spotify. In addition, this type of recommender system could also be used as a means of
suggesting TV shows or movies to a user (e.g., Netflix).
To create this system you will be using Spark and the collaborative filtering technique.The
instructions for completing this project will be laid out entirely in this file. You willhave to
implement any missing code as well as answer any questions.
Submission Instructions: Add all of your updates to this IPython file and do not clear any of
the output youget from running your code.
Datasets : You will be using some publicly available song data from audioscrobbler, which can
befound here (http://wwwetud.iro.umontreal.ca/~bergstrj/audioscrobbler_data.html).
However, we modified the original data files so that the code will run in a reasonable time on
a single machine. The reduced data files have been suffixed with _small.txt,and contains only
theinformation relevant to the top 50 most prolific users (highest artist play counts).
The original data file user_artist_data.txt contained about 141,000 unique users, and 1.6million
unique artists. About 24.2 million users’ plays of artists are recorded, along with their count.
Note that when plays are scribbled, the client application submits the name of the artistbeing
played. This name could be misspelled or nonstandard, and this may only be detected later.
The artist_data.txt file then provides a map from the canonical artist ID to the name ofthe artist.
Program Statement : To perform logistic regression , SVM and Decision tree Classification
algorithm.
Process:
1. Download the dataset from net as a csv file
2. Install the pyspark on the windows
3. Import the required libraries and modules.
4. Cleaning the dataset
5. Analyzing the dataset
6. Splitting the dataset into training and testing and applying the necessary classification
algorithms we need
7. Summarizing the performance of the prediction model using precision,recall and
accuracy.
Logistic regression: Logistic regression is the machine is one of the supervised machine
learning algorithms which is used for classification to predict the discrete value outcomes. It
uses the statistical approach to predict the outcomes of dependent variables based on the
observation given in the dataset.The dataset we used here is a search engine dataset.
Code:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('Logistic_Regression').getOrCrea
te()
df=spark.read.csv('search_engine.csv',inferSchema=True,header=True)
df.describe().show()
df.groupBy('Country').count().show()
df.groupBy('Platform').count().show()
df.groupBy('Status').count().show()
import matplotlib.pyplot as plt
import seaborn as sns
df11=df.toPandas()
sns.set_style('whitegrid')
sns.countplot(x='Country',hue='Platform',data=df11)
from pyspark.ml.feature import StringIndexer
search_engine_indexer = StringIndexer(inputCol="Platform",
outputCol="Platform_Num").fit(df)
df = search_engine_indexer.transform(df)
df.select(['Platform','Platform_Num']).show(10,False)
df.select(['Country','Country_Num']).show(10,False)
from pyspark.ml.feature import OneHotEncoder
search_engine_encoder = OneHotEncoder(inputCol="Platform_Num",
outputCol="Search_Engine_Vector").fit(df)
df = search_engine_encoder.transform(df)
from pyspark.ml.feature import VectorAssembler
df_assembler =
VectorAssembler(inputCols=['Search_Engine_Vector','Country_Vector','
Age', 'Repeat_Visitor','Web_pages_viewed'], outputCol="features")
df = df_assembler.transform(df)
training_df,test_df=model_df.randomSplit([0.75,0.25])
from pyspark.ml.classification import LogisticRegression
log_reg=LogisticRegression(labelCol='Status').fit(training_df)
train_results=log_reg.evaluate(training_df).predictions
train_results.filter(train_results['Status']==1).filter(train_results['prediction'
]==1).select(['Status','prediction','probability']).show(10,False)
accuracy=float((true_postives+true_negatives) /(results.count()))
print("Accuracy : " + str(accuracy))
Code:
bc = load_breast_cancer()
sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)
data = sqlContext.createDataFrame(df_bc)
print(data.printSchema())
features = bc.feature_names
va_df = va.transform(data)
va_df = va_df.select(['features', 'label'])
va_df.show(3)
pred = lsvc.transform(test)
pred.show(3)
evaluator=MulticlassClassificationEvaluator(metricName="accuracy")
acc = evaluator.evaluate(pred)
y_pred=pred.select("prediction").collect()
y_orig=pred.select("label").collect()
cm = confusion_matrix(y_orig, y_pred)
print("Confusion Matrix:")
print(cm)
DECISION TREE CLASSIFIER: Decision Tree is a Supervised learning technique that can
be used for both classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision rules and each leaf node represents the
outcome.The dataset we used for this is Iris dataset.
Code:
iris = load_iris()
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['label'] = pd.Series(iris.target)
print(df_iris.head())
sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)
data = sqlContext.createDataFrame(df_iris)
print(data.printSchema())
features = iris.feature_names
pred = dtc.transform(test)
pred.show(3)
evaluator=MulticlassClassificationEvaluator(predictionCol="prediction")
acc = evaluator.evaluate(pred)
cm = confusion_matrix(y_orig, y_pred)
print("Confusion Matrix:")
print(cm)
Results: Several classification methods like SVM, Decision tree, Logistic regression are
applied over three different datasets Breast cancer dataset, Iris dataset, Search Engine dataset
respectively among which Decision tree gave high accuracy.
Exercise 12: Implement K-means clustering algorithm using pyspark.
ALGORITHM:
• Import the required libraries and dataset.
• Perform exploratory data analysis on the imported dataset and clean the dataset.
• Declare the feature vector and target variable.
• Convert the categorical data into integers.
• Perform feature scaling.
• Import K means clustering.
• Fit the model with the data.
• Predict the results and compute the required measures.
CODE:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null !wget -q
https://www-us.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-ha
doop2.7.tgz
!tar xf spark-3.1.2-bin-hadoop2.7.tgz
!pip install pyspark py4j
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("test_pyspark").getOrCreate()
from pyspark.sql import SparkSession
spark = parkSession.builder.appName('Clustering using K-
Means').getOrCreate()
data_customer=spark.read.csv('CC GENERAL.csv', header=True,
inferSchema=True)
data_customer.printSchema()
data_customer=data_customer.na.drop()
from pyspark.ml.feature import VectorAssembler
data_customer.columns
assemble=VectorAssembler(inputCols=[
'BALANCE',
'BALANCE_FREQUENCY',
'PURCHASES',
'ONEOFF_PURCHASES',
'INSTALLMENTS_PURCHASES',
'CASH_ADVANCE',
'PURCHASES_FREQUENCY',
'ONEOFF_PURCHASES_FREQUENCY',
'PURCHASES_INSTALLMENTS_FREQUENCY',
'CASH_ADVANCE_FREQUENCY',
'CASH_ADVANCE_TRX',
'PURCHASES_TRX',
'CREDIT_LIMIT',
'PAYMENTS',
'MINIMUM_PAYMENTS',
'PRC_FULL_PAYMENT',
'TENURE'], outputCol='features')
assembled_data=assemble.transform(data_customer)
assembled_data.show(2)
from pyspark.ml.feature import StandardScaler
scale=StandardScaler(inputCol='features',outputCol='standardized')
data_scale=scale.fit(assembled_data)
data_scale_output=data_scale.transform(assembled_data)
data_scale_output.show(2)
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
silhouette_score=[]
evaluator = ClusteringEvaluator(predictionCol='prediction',
featuresCol='standardized', \ metricName='silhouette',
distanceMeasure='squaredEuclidean')
for i in range(2,10):
KMeans_algo=KMeans(featuresCol= 'standardized', k=i)
KMeans_fit=KMeans_algo.fit(data_scale_output)
output=KMeans_fit.transform(data_scale_output)
score=evaluator.evaluate(output)
silhouette_score.append(score)
print("Silhouette Score:",score)
score
OUTPUT:
Results : Data is divided across processors and then sorted within each processor. Each processor
works on a set of records. The initial centroid values are initialized and divided/shared across each
of these processors. Now every processor has one centroid of information. Compute the distance
of these points to these centroids. For data points in a processor which are extremely low or high:
If they are closer to the centroid of processor assign them to that cluster, else if they are closer to
the centroid belonging to a different processor, move data point to the new processor Repeat till
convergence is met. All local clusters from Processor P is returned.
Experiment 8: DEVELOP HBASE DATABASE AND PERFORM BASIC QUERY OPERATIONS
Name: Rakulan(123015085)
Name: RameshKrishnan(123015086)
• Refers to
• Non-relational databases
• Graph databases
• Document databases
• Column-family databases
• HBase
• Classified as a column-family/column-oriented DB
• Modeled on Google’s BigTable Architecture
• Provides
• Strong consistency
• HBase
• Within table
• Automatically indexed
Creating :
Altering etc..,:
Describe :
• Table-> row->column
•
Get command accepts parameters like
• Column
• Timestamp
• Version
• To retrieve data
Retrieving data on time range:
Team Members:
Anand.S(123015011)
Bharathi.S(123015018)
Keerthana.S(123015047)
AIM:
To Implement the collaborative filtering system using PySpark.
DESCRIPTION:
Collaborative Filtering is a mathematical method to find the predictions about how users
can rate a particular item based on ratings of other similar users. Typical Collaborative Filtering
involves 4 different stages:
The library package spark.ml currently supports model-based collaborative filtering, in which
users and products are described by a small set of latent factors that can be used to make
predictions. It uses the Alternating Least Squares (ALS) algorithm to learn these latent factors.
So what type of data are being collected in the first stage of Collaborative Filtering? There’s two
different categories of data (referred as feedbacks), which can be Explicit or Implicit.
An example of Explicit Feedback is the ratings given by users, which Netflix is collecting to
make recommendations to their customers after they provide ratings to the movies they have
watched. Implicit Feedback is less straightforward as it’s based on the users’ interactions with
the platform, ranging from clicks, views, likes and purchases. Spotify makes use of Implicit
Feedbacks to implement it’s recommendation system.
Calculating similarity
Once the data has been collected and processed, some mathematical formula is needed to
make the similarity calculation. The two most common measures are:
Euclidean Distance — Distance of the preference between two users. If the distance is small,
similarity between both users is high
Pearson Correlation — If the cosine values (angle of incidence) between two users coincide,
similarity between both users is high
DATASET:
PYTHON CODE:
Result:
Pyspark is installed successfully and a recommendation system using collaborative filtering for
movie rating is executed successfully.