0% found this document useful (0 votes)
48 views

University Institute of Computing: Big Data Analytics 22CAH-782

The document discusses big data analytics using machine learning techniques. It covers supervised learning algorithms like linear regression, logistic regression and random forest. It also discusses unsupervised learning algorithms like K-means clustering and principal component analysis. Finally, it provides steps to set up the environment for big data analytics using Spark.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

University Institute of Computing: Big Data Analytics 22CAH-782

The document discusses big data analytics using machine learning techniques. It covers supervised learning algorithms like linear regression, logistic regression and random forest. It also discusses unsupervised learning algorithms like K-means clustering and principal component analysis. Finally, it provides steps to set up the environment for big data analytics using Spark.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIVERSITY INSTITUTE OF COMPUTING

MASTER OF COMPUTER APPLICATIONS


Big Data Analytics
22CAH-782

DISCOVER . LEARN . EMPOWER


1
Outlines

• Big SQL : Introduction


• Machine Learning : Introduction
• Supervised Learning
• Unsupervised Learning
• Collaborative Filtering
• Big Data Analytics with BigR

2
Big Data Analytics using Machine Learning
Techniques
Machine learning (ML) focuses on the data analysis using different
statistical tools and learning processes to obtain more knowledge from
the data (Faruk and Cahyono, 2018). ML has been applied on many
problems such as recognition systems, informatics and data mining and
autonomous control systems (Qui et al., 2016). The ways of defining the
machine learning algorithms differs but they are commonly subdivided
mainly into supervised learning, unsupervised learning and semi-
supervised learning which combines both supervised learning and
unsupervised learning.

3
Supervised Machine Learning algorithms
• Supervised learning is defined as where a function that maps an input variable to
an output variable based on example input-output pairs. Linear Regressions and
Random Forest are the two examples of supervised learning algorithms that will
be considered for this course.
• Linear Regressions
• Linear regression is a statistical tool that is mainly used for predicting
and forecasting values based on historical information subject to some
important assumptions:
• There requires a dependent variable and a set of independent variables
• There exist a linear relationship between the dependent and the
independent variables, that is:
4
𝑦=𝑎1𝑥1+𝑎2𝑥2 ,…+𝑏+𝑒
Where :
𝑦 : is the response variable.
𝑥𝑗 : is the predictor variable j where j=1,2,3,………..p.
𝑒 : is the error term that is normally distributed with mean 0 and constant variance
𝑎𝑗 and 𝑏: are the regression coefficients to be estimated the coefficients
Regression is a technique used to identify the linear relationship between target
variables and explanatory variables (Prajapati, 2013). Other terms are also used to
describe the variable. One of these variable is called predictor variable whose value
is gathered through experiments. The other variable is called response variable
whose value is derived from the predictor variable.

5
• Logistic Regression
• In statistics, logistic regression is known to be a probabilistic
classification model. Logistic regression is widely used in many
disciplines, including the medical and social science fields. Logistic
regression can be either binomial or multinomial. It is very popular to
predict a categorical response. Binary logistic regression is used in
cases where the outcome for a dependent variable have two
possibilities. As for multinomial, the logistic regression is concerned
with possibilities where there are three or more possible types.
• Using logistic regression, the input values (x) are combined linearly
using weights or coefficient values to predict an output value (y) based
on the log odds ratio.
6
• One major difference between linear regression and logistic regression is that in
linear regression, the output value being modeled is a numerical value while in
logistic, it is a binary value (0 or 1) (Prajapati, 2013).
• The logistic regression equation can be given as follows:
• Py = e^(b0 + b1*xi) / (1 + e^(b0 + b1*xi))
• Where Py is the expected probability for the y(f) subject, b0 is the bias or
intercept term and b1 is the coefficient for the single input value (xi). Each
column in your input data has an associated b coefficient (a constant real value)
that must be learned from your training data.
• It is quite simple to make predictions using logistic regression since there is a
need to plug in numbers into the logistic regression equation to obtain the output.

7
Random Forest
Random forests (RF) are known to be very popular for the classification and
regression methods. RF is a very powerful machine learning algorithm which is
concerned with combining the decisions output from several trees (Faruk and
Cahyono, 2018). In fact, they combine tree predictors in such a way that each tree
depends on the values of a random vector sampled independently. The
generalization error of a forest of tree classifiers relies on the strength of the
individual trees in the forest and the correlation between them (Breiman, 2001). RF
follows the same principles of the decision trees but it does not select all the data
points and variables in each of the trees. It randomly samples data points and
variables in each of the tree that it creates and then combines the output at the end.
It removes the bias that a decision tree model might introduce in the system. It also
improves the predictive power significantly.

8
Unsupervised Machine Learning algorithms

• Compared to supervised learning, unsupervised learning makes use of


input data but have no corresponding output variables. In other words,
unsupervised learning algorithms have no outcome to be predicted.
They aim at discovering the structure and distribution of data in order
to learn more about the data. Two examples of unsupervised learning
algorithms are K-Means algorithm and Principal Components Analysis
(PCA) algorithm.
K-Means Algorithm
• K-Means algorithm is an unsupervised machine learning algorithm
which aims at clustering data together, that is, finding clusters in data
based on similarity in the descriptions of the data and their
relationships (Hans-Hermann, 2008). Each cluster is associated with a
center point known as a centroid. 9
Based on the center, the length of space of each cluster with respect to the center is
calculated and the clusters are formed by assigning points to the closest centroid.
Various algorithms, such as Euclidean distance, Euclidean squared distance and the
Manhattan or City distance, are used to determine which observation is appended to
which centroid. The number of clusters is represented by the variable K.
• Principal Components Analysis (PCA)
PCA algorithm aims at analyzing data to identify patterns and expressing the data in
such a way to highlight similarities and differences (Smith, 2002). Once the patterns
are found, the data can be compressed by reducing the dimensions of the dataset
with minimal loss of information. This technique is commonly used for image
compression and can be applied in the fields such as finance and bioinformatics to
find patterns in data of high dimension.

10
• Setting up the environment for Big Data Analytics using Spark
To be able to run R on Hadoop you now have to install packages.
SparkR has been chosen as it has an in-built Machine Learning library
consisting of a number of algorithms that run in memory. Proceed with
the following steps to install SparkR.
• Steps:
1. From the VM console type the following to install the packages for
curl, libcurl and libxml (if not already installed). If they are already
installed, the messages displayed will indicate same to you.
sudo yum -y install curl libcurl-devel
sudo yum -y install libxml2 libxml2-devel
11
2. For installing SparkR, the tutorial from http://spark.rstudio.com/ has been used.
3. Open Rstudio and install the package sparklyr as follows:
install.packages("sparklyr")
Warning: In Case the following error is being encountered:
• ERROR: configuration failed for package 'stringi'
• Install the stringi dependency as follows:
install.packages(c(“stringi”),configure.args=c(“—disable-cxxll”),repos=”https://
cran.rstudio.com”)
• Install the sparklyr package again as follows:
install.packages("sparklyr")

12
4. Install local version of R
library(sparklyr) spark_install(version = "2.1.0")
5. Install latest version of sparklyr
Install.packages(c(‘devtools’,’curl’))
devtools::install_github("rstudio/sparklyr")
• Your SparkR environment should now be ready to use.
6. You need to connect to the local instance of Spark and remote Spark clusters.
• Use the following codes to connect to a local instance of Spark via the
spark_connect function:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local") 13
7. Read the dataset
• Note that before reading any dataset in Spark, you need to upload the
database from the source (if it is not already in the Spark) using the
following codes:
databaseName <- read.csv(“name of the databse. csv”)
8. Load data from R dataset to Spark

library(dplyr)
tableName_tbl <- copy_to(sc, TableName)

14
• Applying supervised Machine Learning techniques using Spark
• This section provides examples of how supervised Machine Learning techniques
linear regression, random forest and the logistic regression can be applied to a
dataset based on the following case study.
• Case Study Description
• The case study chosen is on the birth weights of babies. The dataset has been
downloaded from the following link: https://www.sheffield.ac.uk/mash/data. It
contains details on the weight of newborn babies and their parents. The dataset
contains mostly continuous variables that is most useful for correlation and
regression. Supervised techniques are applied on the dataset to determine which
variables have an influence on the babies’ birthweight.

15
The table below describes the different variables that have been used for the birthweight
datasets.

16
• Linear Regression
• The first supervised technique that is being applied on the birthweight
dataset is the linear regression.
1. The libraries have to be imported and the database is loaded
2. Load the data in R dataset.
birthweight <- read.csv("birthweight_reduced.csv")
birthweight_tbl <- copy_to(sc,birthweight)
head(birthweight_tbl)

17
The details of the dataset can be obtained by using head( ) function. After running the
codes, the following is being displayed:

18
• The summary can be obtained by using the function summary ( )
summary(birthweight_tbl)
3. Apply the linear regression function on the dataset. To apply the
linear regression, it is important to know which factor to take as the
response variable and which one(s) to take as predictor variable.
4. Different linear models are formulated based on questions.

19
• Question : Do mother's height and father's height have any
influence on the baby's length?
A linear model is formulated as follows based on question :
lm_model2<- birthweight_tbl %>% select(length,mheight, fheight) %>
% ml_linear_regression(length~ mheight+ fheight)
• summary(lm_model2)
• R-Squared: 0.1779
• Root Mean Squared Error: 0.997

20
21
• Applying unsupervised Machine Learning Techniques
• This section provides examples of how unsupervised Machine Learning
techniques namely K-Means algorithm can be applied to a dataset based on the
following case study.
• Case Study Description
• The Breast Cancer dataset has been chosen for demonstrating the K-Means
algorithm. According to Dubey et al. (2016), clustering is an important activity
that enables grouping of data based on the nature or a symptom of the disease. The
Breast Cancer Wisconsin has been downloaded from UCI Repository
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+ (original). The
details of the fields are given in table below. The dataset has 11 variables with 699
observations, first variable is the identifier and has been excluded in the analysis.
Thus, there are 9 predictors and a response variable (class). The response
variable denotes “Malignant” or “Benign” cases.
22
23
Steps:
1. The libraries hav
2. Load the data in R dataset.
library(sparklyr)
library(ggplot2)
library(dplyr)
sc <- spark_connect(master="local")
library(readxl)
BreastCancerData <- read_excel("/home/cloudera/Downloads/BreastCancerData.xlsx")
3. Select the appropriate table, fields and labels that will be used to formulate the K-
Means model. The predictors Uniformity_of_Cell_Shape and
Uniformity_of_Cell_Size are chosen to formulate the KMeans model. e to be
imported and the database is loaded

24
4. Write the codes for the KMeans model.
breastcancer_tbl <- copy_to(sc,BreastCancerData, "BreastCancerData", overwrite = TRUE)
kmeans_model <- breastcancer_tbl %>%
ml_kmeans(formula= ~ Uniformity_of_Cell_Shape + Uniformity_of_Cell_Size, centers = 2)
kmeans_model

25
• 5. Prediction is made based on the model. Write the following codes.
predicted <- sdf_predict(kmeans_model, breastcancer_tbl) %>%
collect
table(predicted$Class_2benign_4malignant, predicted$prediction)
• The output below is displayed:
01
2 9 449
4 191 50
• Interpretation of results
449 of the benign cases have been classified under 1 and 9 cases have been
misclassified. 191 of the malignant cases have been grouped together and 50 have
been misclassified.
26
THANK YOU

27

You might also like