0% found this document useful (0 votes)

49 views

BDA MSC It

Big data analysis journal msc it

Uploaded by

yijika6491

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views

BDA MSC It

Big data analysis journal msc it

Uploaded by

yijika6491

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Practical No: 1 – K means clustering.

Aim:
Read a datafile grades_km_input.csv and apply k-means clustering.

Datafile:
https://github.com/Mounaki/Clustering/blob/master/grades_km_input.csv

source code:
# install required packages

install.packages("plyr")
install.packages("ggplot2")
install.packages("cluster")
install.packages("lattice")
install.packages("grid")
install.packages("gridExtra")

# Load the package

library(plyr)
library(ggplot2)
library(cluster)
library(lattice)
library(grid)
library(gridExtra)

# A data frame is a two-dimensional array-like structure in which each column contains values of one
variable and each row contains one set of values from each column.

grade_input=as.data.frame(read.csv("D:/2020/Big Data Analytics/Practical/grades_km_input.csv"))

kmdata_orig=as.matrix(grade_input[, c ("Student","English","Math","Science")])
kmdata=kmdata_orig[,2:4]
kmdata[1:10,]

# the k-means algorithm is used to identify clusters for k = 1, 2, .. . , 15. For each value of k, the WSS
is calculated.

wss=numeric(15)

# the option n start=25 specifies that the k-means algorithm will be repeated 25 times, each starting
with k random initial centroids

for(k in 1:15)wss[k]=sum(kmeans(kmdata,centers=k,nstart=25)$withinss)

plot(1:15,wss,type="b",xlab="Number of Clusters",ylab="Within sum of square")

#As can be seen, the WSS is greatly reduced when k increases from one to two. Another substantial
reduction in WSS occurs at k = 3. However, the improvement in WSS is fairly linear fork > 3.

km = kmeans(kmdata,3,nstart=25)
km

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
c( wss[3] , sum(km$withinss))
df=as.data.frame(kmdata_orig[,2:4])
df$cluster=factor(km$cluster)
centers=as.data.frame(km$centers)
g1=ggplot(data=df, aes(x=English, y=Math, color=cluster )) +
geom_point() + theme(legend.position="right") +
geom_point(data=centers,aes(x=English,y=Math, color=as.factor(c(1,2,3))),size=10, alpha=.3,
show.legend =FALSE)
g2=ggplot(data=df, aes(x=English, y=Science, color=cluster )) +
geom_point () +geom_point(data=centers,aes(x=English,y=Science,
color=as.factor(c(1,2,3))),size=10, alpha=.3, show.legend=FALSE)
g3 = ggplot(data=df, aes(x=Math, y=Science, color=cluster )) +
geom_point () + geom_point(data=centers,aes(x=Math,y=Science,
color=as.factor(c(1,2,3))),size=10, alpha=.3, show.legend=FALSE)
tmp=ggplot_gtable(ggplot_build(g1))

grid.arrange(arrangeGrob(g1 + theme(legend.position="none"),g2 +
theme(legend.position="none"),g3 + theme(legend.position="none"),top ="High School Student
Cluster Analysis" ,ncol=1))

output

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Practical no 2: Apriori algorithm

Aim: Perform Apriori algorithm using Groceries dataset from the R arules package.

Code:

install.packages("arules")
install.packages("arulesViz")
install.packages("RColorBrewer")

# Loading Libraries
library(arules)
library(arulesViz)
library(RColorBrewer)

# import dataset
data(Groceries)
Groceries

summary(Groceries)
class(Groceries)

# using apriori() function

rules = apriori(Groceries, parameter = list(supp = 0.02, conf = 0.2))
summary (rules)

# using inspect() function

inspect(rules[1:10])

# using itemFrequencyPlot() function

arules::itemFrequencyPlot(Groceries, topN = 20,
col = brewer.pal(8, 'Pastel2'),
main = 'Relative Item Frequency Plot',
type = "relative",
ylab = "Item Frequency (Relative)")

itemsets = apriori(Groceries, parameter = list(minlen=2, maxlen=2,support=0.02, target="frequent

itemsets"))
summary(itemsets)

# using inspect() function

inspect(itemsets[1:10])

itemsets_3 = apriori(Groceries, parameter = list(minlen=3, maxlen=3,support=0.02, target="frequent

itemsets"))
summary(itemsets_3)

# using inspect() function

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
inspect(itemsets_3)

output:
lhs rhs support confidence coverage lift count
[1] {} => {whole milk} 0.25551601 0.2555160 1.00000000 1.000000 2513
[2] {hard cheese} => {whole milk} 0.01006609 0.4107884 0.02450432 1.607682 99
[3] {butter milk} => {other vegetables} 0.01037112 0.3709091 0.02796136 1.916916 102
[4] {butter milk} => {whole milk} 0.01159126 0.4145455 0.02796136 1.622385 114
[5] {ham} => {whole milk} 0.01148958 0.4414062 0.02602949 1.727509 113
[6] {sliced cheese} => {whole milk} 0.01077783 0.4398340 0.02450432 1.721356 106
[7] {oil} => {whole milk} 0.01128622 0.4021739 0.02806304 1.573968 111
[8] {onions} => {other vegetables} 0.01423488 0.4590164 0.03101169 2.372268 140
[9] {onions} => {whole milk} 0.01209964 0.3901639 0.03101169 1.526965 119
[10] {berries} => {yogurt} 0.01057448 0.3180428 0.03324860 2.279848 104

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Practical no: 3 Linear regression

Practical no: 3 a) Simple Linear regression

Aim: Create your own data for years of experience and salary in lakhs and apply linear regression
model to predict the salary.

Code:

years_of_exp = c(7,5,1,3)
salary_in_lakhs = c(21,13,6,8)

#employee.data = data.frame(satisfaction_score, years_of_exp, salary_in_lakhs)

employee.data = data.frame(years_of_exp, salary_in_lakhs)
employee.data

# Estimation of the salary of an employee, based on his year of experience and satisfaction score in
his company.

model <- lm(salary_in_lakhs ~ years_of_exp, data = employee.data)

summary(model)

# The formula of Regression becomes

# Y = 2 + 2.5*year_of_Exp

# Visualization of Regression

plot(salary_in_lakhs ~ years_of_exp, data = employee.data)

abline(model)

output:

years_of_exp salary_in_lakhs
1 7 21
2 5 13
3 1 6
4 3 8

Residuals:
1 2 3 4
1.5 -1.5 1.5 -1.5

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.0000 2.1737 0.92 0.4547
years_of_exp 2.5000 0.4743 5.27 0.0342 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.121 on 2 degrees of freedom

Multiple R-squared: 0.9328, Adjusted R-squared: 0.8993

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
F-statistic: 27.78 on 1 and 2 DF, p-value: 0.03417

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
b) Logistic regression

Aim: Take the in-built data from ISLR package and apply generalized logistic regression to
find whether a person would be defaulter or not; considering input as student, income
and balance.

Source code:

install.packages("ISLR")
library(ISLR)

#load dataset
data <- ISLR::Default
print (head(ISLR::Default))

#view summary of dataset

summary(data)

#find total observations in dataset

nrow(data)

#Create Training and Test Samples

#split the dataset into a training set to train the model on and a testing set to test the model
set.seed(1)

#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.7,0.3))
print (sample)

train <- data[sample, ]

test <- data[!sample, ]

nrow(train)
nrow(test)

# Fit the Logistic Regression Model

# use the glm (general linear model) function and specify family="binomial"
#so that R fits a logistic regression model to the dataset

model <- glm(default~student+balance+income, family="binomial", data=train)

#view model summary

summary(model)

#Model Diagnostics
install.packages("InformationValue")
library(InformationValue)
predicted <- predict(model, test, type="response")

confusionMatrix(test$default, predicted)

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
output:

> print (head(ISLR::Default))

default student balance income
1 No No 729.5265 44361.625
2 No Yes 817.1804 12106.135
3 No No 1073.5492 31767.139
4 No No 529.2506 35704.494
5 No No 785.6559 38463.496
6 No Yes 919.5885 7491.559

summary(data)
default student balance income
No :9667 No :7056 Min. : 0.0 Min. : 772
Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
Median : 823.6 Median :34553
Mean : 835.4 Mean :33517
3rd Qu.:1166.3 3rd Qu.:43808
Max. :2654.3 Max. :73554
> nrow(data)
[1] 10000
> print (sample)
[1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
FALSE TRUE FALSE FALSE
[19] TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
TRUE TRUE FALSE TRUE
> nrow(train)
[1] 6964
> nrow(test)
[1] 3036

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Practical 4 a Decision Tree

Get ElemStatLearn package from https://cran.r-project.org/src/contrib/Archive/ElemStatLearn/

as shown below:

Code:
# Decision Tree Classification

# Importing the dataset

dataset = read.csv('D:\\2020\\Big Data Analytics\\Practical\\p4 decision
tree\\Social_Network_Ads.csv')
dataset = dataset[3:5]

# Encoding the target feature as factor

dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

# Splitting the dataset into the Training set and Test set
install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
test_set = subset(dataset, split == FALSE)

# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

# Fitting Decision Tree Classification to the Training set

install.packages('rpart')
library(rpart)
classifier = rpart(formula = Purchased ~ .,
data = training_set)

# Predicting the Test set results

y_pred = predict(classifier, newdata = test_set[-3], type = 'class')

# Making the Confusion Matrix

cm = table(test_set[, 3], y_pred)

# Visualising the Training set results

install.packages("ElemStatLearn")
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3],
main = 'Decision Tree Classification (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

# Visualising the Test set results

library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3], main = 'Decision Tree Classification (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

# Plotting the tree

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
plot(classifier)
text(classifier)

input: Social_Network_Ads.csv

User ID Gender Age EstimatedSalary Purchased

15624510 Male 19 19000 0
15810944 Male 35 20000 0
15668575 Female 26 43000 0
15603246 Female 27 57000 0
15804002 Male 19 76000 0
15728773 Male 27 58000 0
15598044 Female 27 84000 0
15694829 Female 32 150000 1
15600575 Male 25 33000 0
15727311 Female 35 65000 0

Output:

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Practical no: 4b Naïve Bayes Classification

Code:

# Naive Bayes

# Importing the dataset

dataset = read.csv('D:\\2020\\Big Data Analytics\\Practical\\p4 naive
bayes\\Social_Network_Ads.csv')
dataset = dataset[3:5]

# Encoding the target feature as factor

dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

# Splitting the dataset into the Training set and Test set
#install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

# Fitting Naive Bayes to the Training set

install.packages('e1071')
library(e1071)
classifier = naiveBayes(x = training_set[-3],
y = training_set$Purchased)

# Predicting the Test set results

y_pred = predict(classifier, newdata = test_set[-3])

# Making the Confusion Matrix

cm = table(test_set[, 3], y_pred)
print(cm)

# Visualising the Training set results

install.packages("ElemStatLearn")
library(ElemStatLearn)
set = training_set
print(set)
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3],
main = 'Naive Bayes (Training set)',

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

# Visualising the Test set results

library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3], main = 'NaiveBayes (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

input: Social_Network_Ads.csv

User ID Gender Age EstimatedSalary Purchased

Output:

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Compiled by: Ms. Beena Kapadia
Vidyalankar School of Information Technology
Practical 5: Text Analysis

Code:
Natural Language Processing

# Importing the dataset

dataset_original = read.delim('D:\\2020\\Big Data Analytics\\Practical\\P6
NLP\\Restaurant_Reviews.tsv', quote = '', stringsAsFactors = FALSE)

# Cleaning the texts

install.packages('tm')
install.packages('SnowballC')
library(tm)
library(SnowballC)
corpus = VCorpus(VectorSource(dataset_original$Review))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords())
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)

# Creating the Bag of Words model

dtm = DocumentTermMatrix(corpus)
dtm = removeSparseTerms(dtm, 0.999)
dataset = as.data.frame(as.matrix(dtm))
dataset$Liked = dataset_original$Liked
print(dataset$Liked)

# Encoding the target feature as factor

dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

# Splitting the dataset into the Training set and Test set
install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Liked, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Fitting Random Forest Classification to the Training set

install.packages('randomForest')
library(randomForest)
classifier = randomForest(x = training_set[-692],
y = training_set$Liked,
ntree = 10)

# Predicting the Test set results

y_pred = predict(classifier, newdata = test_set[-692])

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
# Making the Confusion Matrix
cm = table(test_set[, 692], y_pred)
print(cm)

input
Review Liked
Wow... Loved this place. 1
Crust is not good. 0
Not tasty and the texture was just nasty. 0
Stopped by during the late May bank holiday off Rick Steve recommendation and loved it. 1
The selection on the menu was great and so were the prices. 1
:
:
Overall I was not impressed and would not go back. 0
The whole experience was underwhelming, and I think we'll just go to Ninja Sushi next time. 0
Then, as if I hadn't wasted enough of my life there, they poured salt in the wound by drawing out
the time it took to bring the check. 0

Output:

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Practical No.: 6 and 7
6 Aim: Install Virtual Box
7 Aim: Install, configure, and run Hadoop and HDFS ad explore HDFS.

Practical No.: 1
Aim: Install, configure and run Hadoop and HDFS ad explore HDFS.

Step 1: Download and install VirtualBox

Go to the website of Oracle VirtualBox and get the latest stable version from
the following site
https://www.virtualbox.org/
click on ‘Download’’

You will get VirtualBox-6.1.22-144080-Win.exe file downloaded.

Double click and run it. Click on next.

Click on ‘next’ without changing the default folder as shown below:

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Again, click on next as shown below:

Finally, click on ‘Yes’.

Click on ’Install’.

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
It may ask you for the permission to install, click ’yes’ to allow.
Select ‘Install’ as shown below:

You will get the screen as shown below:

Click on ‘Finish’ to finish Installation of virtual box.

You will get the following screen:

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Step 2: download Ubuntu

Download iso file ubuntu-20.04.2.0-desktop-amd64; which is required to

install Ubuntu.
Browse ubuntu.com
Click on download and 20.04 LTS as shown below:
LTS stands for Long term support

You will get file, which may take few minutes to download.
Now, click on ‘New’ to virtual box and write Name as ‘Ubuntu’ as shown
below:

Click on ‘Next’.

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Here, you allow memory size up to green indicator (1970 MB).
Click on ‘Next’.

Don’t change anything in this screen and click on ‘Create’.

Click on ‘Next’, keeping the selection as it is (on VDI).’

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Keep this screen also as it is and click on ‘Next’.

Keep the file location as it is but preferably keep size 100 GB and click on
‘Create’.
You may see the following screen having Ubuntu on Virtual Machine.

Select ‘settings’
Select ‘General’ -> ’ Basic’ as shown below:
You may change the name from Ubuntu to Ubuntu 20.04
Select bidirectional in ‘General’ -> ’ Advanced’ as shown below:

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Go to ‘System’ option and change the processor up to green bar, usually 4.(if
it allows)

Cut and paste your

ubuntu .iso file from
current folder to

C:\Users\ADMIN\VirtualBox VMs\Ubuntu 20.04 folder.

Click on ‘Storage’ and click on ‘Empty’ followed by ‘Choose a disk file’ as
shown below:

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Browse the folder where you have selected ubuntu iso file.

Click on Ubuntu….iso file and click on open and then click on ok.

Click on Ubuntu -> start button.

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Again, click on ‘Start’ button. It will show you the following screen.

And simultaneously one more screen as follows:

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Keep on closing all warnings.
Next you will get following screen automatically.

Select language -> English and click on ‘Install Ubuntu’.in ‘Keyboard Layout’
screen, select ‘English UK’. Click on ‘Continue’.
Select the checkbox for third party software as shown below:

Click on ‘continue’.

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Select Erase disk and Install Ubuntu and click on ‘Install Now’.
Click on ‘Continue’ on the next screen.
Select “Kolkata” for “where are you?” and click on ‘Continue’.

Click on continue after entering name, company name, username, password

and confirm your password.

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Installation of Ubuntu started. Click on finish once installation done.
Click on restart and press Enter key.

Step 3 Install Hadoop

Login to ubuntu
Some keys may change like you try to type @ and it types “.
** please refer to note - Some Keys for Ubuntu under UK keyboard layout – at the end.

Search for Ubuntu terminal on search bar, after login done.

Apply following commands from ubuntu terminal

Prerequisite

buntu@ubuntu:~$ sudo apt update

Ign:1 cdrom://Ubuntu 20.04.2.0 LTS _Focal Fossa_ - Release amd64 (20210209.1) focal
InRelease
Hit:2 cdrom://Ubuntu 20.04.2.0 LTS _Focal Fossa_ - Release amd64 (20210209.1) focal
Release
Hit:4 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:5 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:6 http://security.ubuntu.com/ubuntu focal-security InRelease
Reading package lists... Done

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Building dependency tree
Reading state information... Done
291 packages can be upgraded. Run 'apt list --upgradable' to see them.

bda@bda-VirtualBox:~$ sudo apt install default-jdk

Reading package lists... Done
Building dependency tree
:
etting up default-jdk (2:1.11-72) ...
Setting up libxt-dev:amd64 (1:1.1.5-1) ...

bda@bda-VirtualBox:~$ java -version

openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing)

open ssh server

bda@bda-VirtualBox:~$ sudo apt install openssh-server openssh-client -y
Reading package lists... Done
Building dependency tree
:
Processing triggers for ufw (0.36-6) ...

bda@bda-VirtualBox:~$ sudo adduser hdoop

Adding user `hdoop' ...
Adding new group `hdoop' (1000) ...
Adding new user `hdoop' (1000) with group `hdoop' ...
Creating home directory `/home/hdoop' ...
Copying files from `/etc/skel' ...
New password: hdoop
Retype new password:
passwd: password updated successfully
Changing the user information for hdoop
Enter the new value, or press ENTER for the default
Full Name []:
Room Number []:
Work Phone []:
Home Phone []:
Other []:
Is the information correct? [Y/n] y

bda@bda-VirtualBox:~$ su - hdoop
Password: hdoop

hdoop@bda-VirtualBox:~$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

Generating public/private rsa key pair.
Created directory '/home/hdoop/.ssh'.
Your identification has been saved in /home/hdoop/.ssh/id_rsa
Your public key has been saved in /home/hdoop/.ssh/id_rsa.pub
The key fingerprint is:

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
SHA256:EDxiHTL1r3LUCdKFWc0moPHUh1D8tU6Y0b2rnxuwUtQ hdoop@bda-
VirtualBox
The key's randomart image is:
+---[RSA 3072]----+
| o+=.+X++ . . |
| oo+Oo.= * + .|
| . .+.= = * E.|
| o + .= o. |
| S + = .|
| . . . +. |
| . o . ... |
| o .. o|
| .+.|
+----[SHA256]-----+

hdoop@bda-VirtualBox:~$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

hdoop@bda-VirtualBox:~$ chmod 0600 ~/.ssh/authorized_keys

hdoop@bda-VirtualBox:~$ ssh localhost

The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is
SHA256:4TE4DDAv14vhARPWjZcW3C5UM3X94B7wUudPrT+ZmF0.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
:
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

Downloading Hadoop

hdoop@bda-VirtualBox:~$ wget
https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
--2021-06-14 08:52:00-- https://downloads.apache.org/hadoop/common/hadoop-
3.3.1/hadoop-3.3.1.tar.gz
Resolving downloads.apache.org (downloads.apache.org)... 88.99.95.219, 135.181.209.10,
135.181.214.104, ...
Connecting to downloads.apache.org (downloads.apache.org)|88.99.95.219|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 359196911 (343M) [application/x-gzip]
Saving to: ‘hadoop-3.3.1.tar.gz’

hadoop-3.3.1.tar.gz 100%[==================>] 342.56M 15.4MB/s in 33s

2021-06-14 08:52:34 (10.2 MB/s) - ‘hadoop-3.3.1.tar.gz’ saved [359196911/359196911]

hdoop@bda-VirtualBox:~$ ls
hadoop-3.3.1.tar.gz
hdoop@bda-VirtualBox:~$ tar xzf hadoop-3.3.1.tar.gz
hdoop@bda-VirtualBox:~$ ls
hadoop-3.3.1 hadoop-3.3.1.tar.gz

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
Editing 6 important files for creating a single cluster

hdoop@bda-VirtualBox:~$ su - bda
bda@bda-VirtualBox:~$ sudo adduser hdoop sudo
Adding user `hdoop' to group `sudo' ...
Adding user hdoop to group sudo
Done.
bda@bda-VirtualBox:~$ su - hdoop

1.
hdoop@bda-VirtualBox:~$ sudo nano .bashrc
File will be opened and add following lines at the end of the file:
#Hadoop Related Options
export HADOOP_HOME=/home/hdoop/hadoop-3.3.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/nativ"

save this file as ctrl x and y. Press enter.

hdoop@bda-VirtualBox:~$ source ~/.bashrc

Edit hadoop-env.sh File

The hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and
Hadoop-related project settings.
When setting up a single node Hadoop cluster, you need to define which Java
implementation is to be utilized. Use the previously created $HADOOP_HOME variable to
access the hadoop-env.sh file:
hdoop@bda-VirtualBox:~$ sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
at the end of the file add the following line
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
save it.

Edit core-site.xml File

The core-site.xml file defines HDFS and Hadoop core properties.

To set up Hadoop in a pseudo-distributed mode, you need to specify the URL for your
NameNode, and the temporary directory Hadoop uses for the map and reduce process.
Open the core-site.xml file in a text editor:

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
hdoop@bda-VirtualBox:~$ sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

4
hdoop@bda-VirtualBox:~$ sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

5
hdoop@bda-VirtualBox:~$ sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

6
hdoop@bda-VirtualBox:~$ sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>

<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP
_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOO
P_MAPRED_HOME</value>
</property>
</configuration>

Format HDFS NameNode

hdoop@bda-VirtualBox:~$ hdfs namenode -format
:
xid=0 when meet shutdown.
2021-06-18 14:16:33,353 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at bda-VirtualBox/127.0.1.1
************************************************************/

Start Hadoop Cluster (services)

hdoop@bda-VirtualBox:~$ cd Hadoop-3.3.1
hdoop@bda-VirtualBox:~/Hadoop-3.3.1$ cd sbin
hdoop@bda-VirtualBox:~/hadoop-3.3.1/sbin$ ./start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [bda-VirtualBox]
bda-VirtualBox: Warning: Permanently added 'bda-virtualbox' (ECDSA) to the list of known
hosts.
2021-06-18 14:26:34,962 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

hdoop@bda-VirtualBox:~/hadoop-3.3.1/sbin$ ./start-yarn.sh
Starting resourcemanager
Starting nodemanagers
To see all components, we use jps command:

hdoop@bda-VirtualBox:~/hadoop-3.3.1/sbin$ jps
11744 NodeManager
11616 ResourceManager
12192 Jps
11268 SecondaryNameNode
11077 DataNode
10954 NameNode

Compiled by: Ms. Beena Kapadia

Vidyalankar School of Information Technology
hdoop@bda-VirtualBox:~/hadoop-3.3.1/sbin$ hdfs dfs -ls /
2021-06-18 14:33:24,698 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

hdoop@bda-VirtualBox:~/hadoop-3.3.1/sbin$ sudo nano /home/bda/sample.txt

[sudo] password for hdoop:
edit the file by adding some text and save and exit

hdoop@bda-VirtualBox:~/hadoop-3.3.1/sbin$ ls /home/bda/
Desktop Downloads Pictures sample.txt Videos
Documents Music Public Templates

hdoop@bda-VirtualBox:~/hadoop-3.3.1/sbin$ hdfs dfs -put /home/bda/sample.txt /

2021-06-18 14:44:24,257 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

doop@bda-VirtualBox:~/hadoop-3.3.1/sbin$ hdfs dfs -ls /

2021-06-18 14:48:17,221 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r-- 1 hdoop supergroup 6 2021-06-18 14:44 /sample.txt

**Note:
Some Keys for Ubuntu under UK keyboard layout
“ -> @
@ -> “
pipe -> take from this file or on google search for pipe in linux
~ -> pipe
Compiled by: Ms. Beena Kapadia
Vidyalankar School of Information Technology

Yamaha YH50 19992006 EN
60% (5)
Yamaha YH50 19992006 EN
203 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Modern Applied Regressions
No ratings yet
Modern Applied Regressions
298 pages
Free Fire Key Mapping Latest
0% (1)
Free Fire Key Mapping Latest
15 pages
Kawasaki 85ziv
100% (2)
Kawasaki 85ziv
31 pages
SCIENCE MONTH PROPOSAL Converted 1 1
100% (1)
SCIENCE MONTH PROPOSAL Converted 1 1
13 pages
Lab 4 Classification v.0
No ratings yet
Lab 4 Classification v.0
5 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
R Practicals
No ratings yet
R Practicals
32 pages
saurabh
No ratings yet
saurabh
22 pages
WEEK
No ratings yet
WEEK
17 pages
R Code Default Data PDF
No ratings yet
R Code Default Data PDF
10 pages
R Assignment
No ratings yet
R Assignment
8 pages
Rstudio Study Notes For PA 20181126
No ratings yet
Rstudio Study Notes For PA 20181126
6 pages
Machine Learning-Lecture 2(Student)
No ratings yet
Machine Learning-Lecture 2(Student)
9 pages
Lab-4: Regression Analysis: Logistic & Multinomial Logistic Regression
No ratings yet
Lab-4: Regression Analysis: Logistic & Multinomial Logistic Regression
10 pages
Note 4
No ratings yet
Note 4
18 pages
Cost Practical
No ratings yet
Cost Practical
13 pages
Commands for Data Analysis using R
No ratings yet
Commands for Data Analysis using R
11 pages
R Codes
No ratings yet
R Codes
23 pages
FRA Assignment - India Credit Model
No ratings yet
FRA Assignment - India Credit Model
14 pages
R Programing Bhagu
No ratings yet
R Programing Bhagu
40 pages
Codes
No ratings yet
Codes
14 pages
Datamining Lab Record
No ratings yet
Datamining Lab Record
36 pages
STAT-2450 Assignment 1: Name:, Student ID: B00
No ratings yet
STAT-2450 Assignment 1: Name:, Student ID: B00
9 pages
BAN5
No ratings yet
BAN5
2 pages
DSA lab
No ratings yet
DSA lab
29 pages
Final Practical
No ratings yet
Final Practical
53 pages
Praktikum Modul 3
No ratings yet
Praktikum Modul 3
5 pages
m565A3-24F
No ratings yet
m565A3-24F
22 pages
R Programming-1
No ratings yet
R Programming-1
6 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Workshop Activity: X Seq y Length
No ratings yet
Workshop Activity: X Seq y Length
3 pages
Da Exp9,10
No ratings yet
Da Exp9,10
9 pages
A Note On R
No ratings yet
A Note On R
90 pages
datamining
No ratings yet
datamining
20 pages
Lab 3 - Logistic Regression: Part B
No ratings yet
Lab 3 - Logistic Regression: Part B
7 pages
K Nearest Neighbours (KNN) : Short Intro To KNN
No ratings yet
K Nearest Neighbours (KNN) : Short Intro To KNN
13 pages
Lab file AD pdf
No ratings yet
Lab file AD pdf
25 pages
PreProcessing With R
No ratings yet
PreProcessing With R
6 pages
Sunil Test
No ratings yet
Sunil Test
15 pages
Analysis Report
No ratings yet
Analysis Report
8 pages
Essential R
No ratings yet
Essential R
261 pages
R Console
No ratings yet
R Console
6 pages
R For Machine Learning Lab Practical Work: Master of Business Administration in Business Analytics
0% (1)
R For Machine Learning Lab Practical Work: Master of Business Administration in Business Analytics
9 pages
Bda Assign
No ratings yet
Bda Assign
15 pages
cor
No ratings yet
cor
6 pages
R Program Record Book Iba
No ratings yet
R Program Record Book Iba
24 pages
Lab2
No ratings yet
Lab2
22 pages
Multicollinearity and Oaxaca -Tutorial
No ratings yet
Multicollinearity and Oaxaca -Tutorial
35 pages
R Lab
No ratings yet
R Lab
15 pages
Statistical Regression
No ratings yet
Statistical Regression
32 pages
21BCS5999 - Ankit Kumar (Assignment 2)
No ratings yet
21BCS5999 - Ankit Kumar (Assignment 2)
16 pages
R Unit 4th and 5th
No ratings yet
R Unit 4th and 5th
17 pages
Data Analytics
No ratings yet
Data Analytics
31 pages
IntroR 2
No ratings yet
IntroR 2
18 pages
Ex5
No ratings yet
Ex5
6 pages
R Commands
No ratings yet
R Commands
18 pages
Mock Exam - Appendix
No ratings yet
Mock Exam - Appendix
15 pages
Awini Mustapha-Project1
No ratings yet
Awini Mustapha-Project1
8 pages
ds
No ratings yet
ds
2 pages
Lab Wk1soln PDF
No ratings yet
Lab Wk1soln PDF
14 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Doing Good Racial Tensions and Workplace Inequalities at a Community Clinic in El Nuevo South 2nd Edition Natalia Deeb-Sossa All Chapters Instant Download
100% (1)
Doing Good Racial Tensions and Workplace Inequalities at a Community Clinic in El Nuevo South 2nd Edition Natalia Deeb-Sossa All Chapters Instant Download
48 pages
Vivablast Corporate Handbook 2021
No ratings yet
Vivablast Corporate Handbook 2021
31 pages
Memoryranger Prevents Hijacking File - Objec
No ratings yet
Memoryranger Prevents Hijacking File - Objec
10 pages
Part 1. Listen To A Talk About "Driverless Cars". Decide Whether The
No ratings yet
Part 1. Listen To A Talk About "Driverless Cars". Decide Whether The
11 pages
Lesson 9
No ratings yet
Lesson 9
16 pages
Csir Net Life Science Unit 7 Syllabus
No ratings yet
Csir Net Life Science Unit 7 Syllabus
2 pages
Cheg Subscription Cancellation Confirmation
No ratings yet
Cheg Subscription Cancellation Confirmation
2 pages
ArcLinkXT 01
No ratings yet
ArcLinkXT 01
2 pages
Unlocked
No ratings yet
Unlocked
8 pages
Crypto Explained - The Columnar Transposition Cipher
No ratings yet
Crypto Explained - The Columnar Transposition Cipher
2 pages
NMT 04210 Information and Management
No ratings yet
NMT 04210 Information and Management
3 pages
P.6 Whole Numbers
No ratings yet
P.6 Whole Numbers
9 pages
Leica POL Microscopes-Brochure en
No ratings yet
Leica POL Microscopes-Brochure en
12 pages
Topic Outline - The Woulds of Wall Street By_ Saleh Bani Mustafa
No ratings yet
Topic Outline - The Woulds of Wall Street By_ Saleh Bani Mustafa
2 pages
115) 61 SECOND 05) Artengo: Ittf N°) Supplier Ittf N°) Supplier Ittf N°) Supplier Ittf N°) Supplier
No ratings yet
115) 61 SECOND 05) Artengo: Ittf N°) Supplier Ittf N°) Supplier Ittf N°) Supplier Ittf N°) Supplier
7 pages
Autism
No ratings yet
Autism
17 pages
Design of Wind Turbine: December 2015
No ratings yet
Design of Wind Turbine: December 2015
40 pages
CS223 - Windows Programming - 2021-10-12
No ratings yet
CS223 - Windows Programming - 2021-10-12
3 pages
M2 Sol
No ratings yet
M2 Sol
36 pages
IKEA Case
No ratings yet
IKEA Case
5 pages
En 428 User 2
100% (3)
En 428 User 2
148 pages
How Do You Make High Quality Rosin PDF
100% (1)
How Do You Make High Quality Rosin PDF
3 pages
Detailed Lesson Plan.
75% (12)
Detailed Lesson Plan.
6 pages
Index and Log
No ratings yet
Index and Log
4 pages
Two_samples_t_test_same_n
No ratings yet
Two_samples_t_test_same_n
1 page
Group12 - Do No. 24, S. 2020
No ratings yet
Group12 - Do No. 24, S. 2020
28 pages