BDA MSC It
BDA MSC It
Aim:
Read a datafile grades_km_input.csv and apply k-means clustering.
Datafile:
https://github.com/Mounaki/Clustering/blob/master/grades_km_input.csv
source code:
# install required packages
install.packages("plyr")
install.packages("ggplot2")
install.packages("cluster")
install.packages("lattice")
install.packages("grid")
install.packages("gridExtra")
library(plyr)
library(ggplot2)
library(cluster)
library(lattice)
library(grid)
library(gridExtra)
# A data frame is a two-dimensional array-like structure in which each column contains values of one
variable and each row contains one set of values from each column.
# the k-means algorithm is used to identify clusters for k = 1, 2, .. . , 15. For each value of k, the WSS
is calculated.
wss=numeric(15)
# the option n start=25 specifies that the k-means algorithm will be repeated 25 times, each starting
with k random initial centroids
for(k in 1:15)wss[k]=sum(kmeans(kmdata,centers=k,nstart=25)$withinss)
#As can be seen, the WSS is greatly reduced when k increases from one to two. Another substantial
reduction in WSS occurs at k = 3. However, the improvement in WSS is fairly linear fork > 3.
km = kmeans(kmdata,3,nstart=25)
km
grid.arrange(arrangeGrob(g1 + theme(legend.position="none"),g2 +
theme(legend.position="none"),g3 + theme(legend.position="none"),top ="High School Student
Cluster Analysis" ,ncol=1))
output
Aim: Perform Apriori algorithm using Groceries dataset from the R arules package.
Code:
install.packages("arules")
install.packages("arulesViz")
install.packages("RColorBrewer")
# Loading Libraries
library(arules)
library(arulesViz)
library(RColorBrewer)
# import dataset
data(Groceries)
Groceries
summary(Groceries)
class(Groceries)
output:
lhs rhs support confidence coverage lift count
[1] {} => {whole milk} 0.25551601 0.2555160 1.00000000 1.000000 2513
[2] {hard cheese} => {whole milk} 0.01006609 0.4107884 0.02450432 1.607682 99
[3] {butter milk} => {other vegetables} 0.01037112 0.3709091 0.02796136 1.916916 102
[4] {butter milk} => {whole milk} 0.01159126 0.4145455 0.02796136 1.622385 114
[5] {ham} => {whole milk} 0.01148958 0.4414062 0.02602949 1.727509 113
[6] {sliced cheese} => {whole milk} 0.01077783 0.4398340 0.02450432 1.721356 106
[7] {oil} => {whole milk} 0.01128622 0.4021739 0.02806304 1.573968 111
[8] {onions} => {other vegetables} 0.01423488 0.4590164 0.03101169 2.372268 140
[9] {onions} => {whole milk} 0.01209964 0.3901639 0.03101169 1.526965 119
[10] {berries} => {yogurt} 0.01057448 0.3180428 0.03324860 2.279848 104
Aim: Create your own data for years of experience and salary in lakhs and apply linear regression
model to predict the salary.
Code:
years_of_exp = c(7,5,1,3)
salary_in_lakhs = c(21,13,6,8)
# Estimation of the salary of an employee, based on his year of experience and satisfaction score in
his company.
# Visualization of Regression
output:
years_of_exp salary_in_lakhs
1 7 21
2 5 13
3 1 6
4 3 8
Residuals:
1 2 3 4
1.5 -1.5 1.5 -1.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.0000 2.1737 0.92 0.4547
years_of_exp 2.5000 0.4743 5.27 0.0342 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Aim: Take the in-built data from ISLR package and apply generalized logistic regression to
find whether a person would be defaulter or not; considering input as student, income
and balance.
Source code:
install.packages("ISLR")
library(ISLR)
#load dataset
data <- ISLR::Default
print (head(ISLR::Default))
#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.7,0.3))
print (sample)
nrow(train)
nrow(test)
#Model Diagnostics
install.packages("InformationValue")
library(InformationValue)
predicted <- predict(model, test, type="response")
confusionMatrix(test$default, predicted)
summary(data)
default student balance income
No :9667 No :7056 Min. : 0.0 Min. : 772
Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
Median : 823.6 Median :34553
Mean : 835.4 Mean :33517
3rd Qu.:1166.3 3rd Qu.:43808
Max. :2654.3 Max. :73554
> nrow(data)
[1] 10000
> print (sample)
[1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
FALSE TRUE FALSE FALSE
[19] TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
TRUE TRUE FALSE TRUE
> nrow(train)
[1] 6964
> nrow(test)
[1] 3036
Code:
# Decision Tree Classification
# Splitting the dataset into the Training set and Test set
install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
input: Social_Network_Ads.csv
Output:
Code:
# Naive Bayes
# Splitting the dataset into the Training set and Test set
#install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
input: Social_Network_Ads.csv
Output:
Code:
Natural Language Processing
# Splitting the dataset into the Training set and Test set
install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Liked, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
input
Review Liked
Wow... Loved this place. 1
Crust is not good. 0
Not tasty and the texture was just nasty. 0
Stopped by during the late May bank holiday off Rick Steve recommendation and loved it. 1
The selection on the menu was great and so were the prices. 1
:
:
Overall I was not impressed and would not go back. 0
The whole experience was underwhelming, and I think we'll just go to Ninja Sushi next time. 0
Then, as if I hadn't wasted enough of my life there, they poured salt in the wound by drawing out
the time it took to bring the check. 0
Output:
Practical No.: 1
Aim: Install, configure and run Hadoop and HDFS ad explore HDFS.
Click on ’Install’.
You will get file, which may take few minutes to download.
Now, click on ‘New’ to virtual box and write Name as ‘Ubuntu’ as shown
below:
Click on ‘Next’.
Keep the file location as it is but preferably keep size 100 GB and click on
‘Create’.
You may see the following screen having Ubuntu on Virtual Machine.
Select ‘settings’
Select ‘General’ -> ’ Basic’ as shown below:
You may change the name from Ubuntu to Ubuntu 20.04
Select bidirectional in ‘General’ -> ’ Advanced’ as shown below:
Click on Ubuntu….iso file and click on open and then click on ok.
Select language -> English and click on ‘Install Ubuntu’.in ‘Keyboard Layout’
screen, select ‘English UK’. Click on ‘Continue’.
Select the checkbox for third party software as shown below:
Click on ‘continue’.
Prerequisite
bda@bda-VirtualBox:~$ su - hdoop
Password: hdoop
Downloading Hadoop
hdoop@bda-VirtualBox:~$ wget
https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
--2021-06-14 08:52:00-- https://downloads.apache.org/hadoop/common/hadoop-
3.3.1/hadoop-3.3.1.tar.gz
Resolving downloads.apache.org (downloads.apache.org)... 88.99.95.219, 135.181.209.10,
135.181.214.104, ...
Connecting to downloads.apache.org (downloads.apache.org)|88.99.95.219|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 359196911 (343M) [application/x-gzip]
Saving to: ‘hadoop-3.3.1.tar.gz’
hdoop@bda-VirtualBox:~$ ls
hadoop-3.3.1.tar.gz
hdoop@bda-VirtualBox:~$ tar xzf hadoop-3.3.1.tar.gz
hdoop@bda-VirtualBox:~$ ls
hadoop-3.3.1 hadoop-3.3.1.tar.gz
hdoop@bda-VirtualBox:~$ su - bda
bda@bda-VirtualBox:~$ sudo adduser hdoop sudo
Adding user `hdoop' to group `sudo' ...
Adding user hdoop to group sudo
Done.
bda@bda-VirtualBox:~$ su - hdoop
1.
hdoop@bda-VirtualBox:~$ sudo nano .bashrc
File will be opened and add following lines at the end of the file:
#Hadoop Related Options
export HADOOP_HOME=/home/hdoop/hadoop-3.3.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/nativ"
2.
The hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and
Hadoop-related project settings.
When setting up a single node Hadoop cluster, you need to define which Java
implementation is to be utilized. Use the previously created $HADOOP_HOME variable to
access the hadoop-env.sh file:
hdoop@bda-VirtualBox:~$ sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
at the end of the file add the following line
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/
save it.
4
hdoop@bda-VirtualBox:~$ sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
5
hdoop@bda-VirtualBox:~$ sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
6
hdoop@bda-VirtualBox:~$ sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP
_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOO
P_MAPRED_HOME</value>
</property>
</configuration>
hdoop@bda-VirtualBox:~$ cd Hadoop-3.3.1
hdoop@bda-VirtualBox:~/Hadoop-3.3.1$ cd sbin
hdoop@bda-VirtualBox:~/hadoop-3.3.1/sbin$ ./start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [bda-VirtualBox]
bda-VirtualBox: Warning: Permanently added 'bda-virtualbox' (ECDSA) to the list of known
hosts.
2021-06-18 14:26:34,962 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
hdoop@bda-VirtualBox:~/hadoop-3.3.1/sbin$ ./start-yarn.sh
Starting resourcemanager
Starting nodemanagers
To see all components, we use jps command:
hdoop@bda-VirtualBox:~/hadoop-3.3.1/sbin$ jps
11744 NodeManager
11616 ResourceManager
12192 Jps
11268 SecondaryNameNode
11077 DataNode
10954 NameNode
hdoop@bda-VirtualBox:~/hadoop-3.3.1/sbin$ ls /home/bda/
Desktop Downloads Pictures sample.txt Videos
Documents Music Public Templates
**Note:
Some Keys for Ubuntu under UK keyboard layout
“ -> @
@ -> “
pipe -> take from this file or on google search for pipe in linux
~ -> pipe
Compiled by: Ms. Beena Kapadia
Vidyalankar School of Information Technology