University Institute of Computing: Big Data Analytics 22CAH-782

The document discusses big data analytics using machine learning techniques. It covers supervised learning algorithms like linear regression, logistic regression and random forest. It also discusses unsupervised learning algorithms like K-means clustering and principal component analysis. Finally, it provides steps to set up the environment for big data analytics using Spark.

Uploaded by

19BCA1150Pratap Mukherjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views

University Institute of Computing: Big Data Analytics 22CAH-782

Uploaded by

19BCA1150Pratap Mukherjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

UNIVERSITY INSTITUTE OF COMPUTING

MASTER OF COMPUTER APPLICATIONS

Big Data Analytics
22CAH-782

DISCOVER . LEARN . EMPOWER

1
Outlines

• Big SQL : Introduction

• Machine Learning : Introduction
• Supervised Learning
• Unsupervised Learning
• Collaborative Filtering
• Big Data Analytics with BigR

2
Big Data Analytics using Machine Learning
Techniques
Machine learning (ML) focuses on the data analysis using different
statistical tools and learning processes to obtain more knowledge from
the data (Faruk and Cahyono, 2018). ML has been applied on many
problems such as recognition systems, informatics and data mining and
autonomous control systems (Qui et al., 2016). The ways of defining the
machine learning algorithms differs but they are commonly subdivided
mainly into supervised learning, unsupervised learning and semi-
supervised learning which combines both supervised learning and
unsupervised learning.

3
Supervised Machine Learning algorithms
• Supervised learning is defined as where a function that maps an input variable to
an output variable based on example input-output pairs. Linear Regressions and
Random Forest are the two examples of supervised learning algorithms that will
be considered for this course.
• Linear Regressions
• Linear regression is a statistical tool that is mainly used for predicting
and forecasting values based on historical information subject to some
important assumptions:
• There requires a dependent variable and a set of independent variables
• There exist a linear relationship between the dependent and the
independent variables, that is:
4
𝑦=𝑎1𝑥1+𝑎2𝑥2 ,…+𝑏+𝑒
Where :
𝑦 : is the response variable.
𝑥𝑗 : is the predictor variable j where j=1,2,3,………..p.
𝑒 : is the error term that is normally distributed with mean 0 and constant variance
𝑎𝑗 and 𝑏: are the regression coefficients to be estimated the coefficients
Regression is a technique used to identify the linear relationship between target
variables and explanatory variables (Prajapati, 2013). Other terms are also used to
describe the variable. One of these variable is called predictor variable whose value
is gathered through experiments. The other variable is called response variable
whose value is derived from the predictor variable.

5
• Logistic Regression
• In statistics, logistic regression is known to be a probabilistic
classification model. Logistic regression is widely used in many
disciplines, including the medical and social science fields. Logistic
regression can be either binomial or multinomial. It is very popular to
predict a categorical response. Binary logistic regression is used in
cases where the outcome for a dependent variable have two
possibilities. As for multinomial, the logistic regression is concerned
with possibilities where there are three or more possible types.
• Using logistic regression, the input values (x) are combined linearly
using weights or coefficient values to predict an output value (y) based
on the log odds ratio.
6
• One major difference between linear regression and logistic regression is that in
linear regression, the output value being modeled is a numerical value while in
logistic, it is a binary value (0 or 1) (Prajapati, 2013).
• The logistic regression equation can be given as follows:
• Py = e^(b0 + b1*xi) / (1 + e^(b0 + b1*xi))
• Where Py is the expected probability for the y(f) subject, b0 is the bias or
intercept term and b1 is the coefficient for the single input value (xi). Each
column in your input data has an associated b coefficient (a constant real value)
that must be learned from your training data.
• It is quite simple to make predictions using logistic regression since there is a
need to plug in numbers into the logistic regression equation to obtain the output.

7
Random Forest
Random forests (RF) are known to be very popular for the classification and
regression methods. RF is a very powerful machine learning algorithm which is
concerned with combining the decisions output from several trees (Faruk and
Cahyono, 2018). In fact, they combine tree predictors in such a way that each tree
depends on the values of a random vector sampled independently. The
generalization error of a forest of tree classifiers relies on the strength of the
individual trees in the forest and the correlation between them (Breiman, 2001). RF
follows the same principles of the decision trees but it does not select all the data
points and variables in each of the trees. It randomly samples data points and
variables in each of the tree that it creates and then combines the output at the end.
It removes the bias that a decision tree model might introduce in the system. It also
improves the predictive power significantly.

8
Unsupervised Machine Learning algorithms

• Compared to supervised learning, unsupervised learning makes use of

input data but have no corresponding output variables. In other words,
unsupervised learning algorithms have no outcome to be predicted.
They aim at discovering the structure and distribution of data in order
to learn more about the data. Two examples of unsupervised learning
algorithms are K-Means algorithm and Principal Components Analysis
(PCA) algorithm.
K-Means Algorithm
• K-Means algorithm is an unsupervised machine learning algorithm
which aims at clustering data together, that is, finding clusters in data
based on similarity in the descriptions of the data and their
relationships (Hans-Hermann, 2008). Each cluster is associated with a
center point known as a centroid. 9
Based on the center, the length of space of each cluster with respect to the center is
calculated and the clusters are formed by assigning points to the closest centroid.
Various algorithms, such as Euclidean distance, Euclidean squared distance and the
Manhattan or City distance, are used to determine which observation is appended to
which centroid. The number of clusters is represented by the variable K.
• Principal Components Analysis (PCA)
PCA algorithm aims at analyzing data to identify patterns and expressing the data in
such a way to highlight similarities and differences (Smith, 2002). Once the patterns
are found, the data can be compressed by reducing the dimensions of the dataset
with minimal loss of information. This technique is commonly used for image
compression and can be applied in the fields such as finance and bioinformatics to
find patterns in data of high dimension.

10
• Setting up the environment for Big Data Analytics using Spark
To be able to run R on Hadoop you now have to install packages.
SparkR has been chosen as it has an in-built Machine Learning library
consisting of a number of algorithms that run in memory. Proceed with
the following steps to install SparkR.
• Steps:
1. From the VM console type the following to install the packages for
curl, libcurl and libxml (if not already installed). If they are already
installed, the messages displayed will indicate same to you.
sudo yum -y install curl libcurl-devel
sudo yum -y install libxml2 libxml2-devel
11
2. For installing SparkR, the tutorial from http://spark.rstudio.com/ has been used.
3. Open Rstudio and install the package sparklyr as follows:
install.packages("sparklyr")
Warning: In Case the following error is being encountered:
• ERROR: configuration failed for package 'stringi'
• Install the stringi dependency as follows:
install.packages(c(“stringi”),configure.args=c(“—disable-cxxll”),repos=”https://
cran.rstudio.com”)
• Install the sparklyr package again as follows:
install.packages("sparklyr")

12
4. Install local version of R
library(sparklyr) spark_install(version = "2.1.0")
5. Install latest version of sparklyr
Install.packages(c(‘devtools’,’curl’))
devtools::install_github("rstudio/sparklyr")
• Your SparkR environment should now be ready to use.
6. You need to connect to the local instance of Spark and remote Spark clusters.
• Use the following codes to connect to a local instance of Spark via the
spark_connect function:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local") 13
7. Read the dataset
• Note that before reading any dataset in Spark, you need to upload the
database from the source (if it is not already in the Spark) using the
following codes:
databaseName <- read.csv(“name of the databse. csv”)
8. Load data from R dataset to Spark

library(dplyr)
tableName_tbl <- copy_to(sc, TableName)

14
• Applying supervised Machine Learning techniques using Spark
• This section provides examples of how supervised Machine Learning techniques
linear regression, random forest and the logistic regression can be applied to a
dataset based on the following case study.
• Case Study Description
• The case study chosen is on the birth weights of babies. The dataset has been
downloaded from the following link: https://www.sheffield.ac.uk/mash/data. It
contains details on the weight of newborn babies and their parents. The dataset
contains mostly continuous variables that is most useful for correlation and
regression. Supervised techniques are applied on the dataset to determine which
variables have an influence on the babies’ birthweight.

15
The table below describes the different variables that have been used for the birthweight
datasets.

16
• Linear Regression
• The first supervised technique that is being applied on the birthweight
dataset is the linear regression.
1. The libraries have to be imported and the database is loaded
2. Load the data in R dataset.
birthweight <- read.csv("birthweight_reduced.csv")
birthweight_tbl <- copy_to(sc,birthweight)
head(birthweight_tbl)

17
The details of the dataset can be obtained by using head( ) function. After running the
codes, the following is being displayed:

18
• The summary can be obtained by using the function summary ( )
summary(birthweight_tbl)
3. Apply the linear regression function on the dataset. To apply the
linear regression, it is important to know which factor to take as the
response variable and which one(s) to take as predictor variable.
4. Different linear models are formulated based on questions.

19
• Question : Do mother's height and father's height have any
influence on the baby's length?
A linear model is formulated as follows based on question :
lm_model2<- birthweight_tbl %>% select(length,mheight, fheight) %>
% ml_linear_regression(length~ mheight+ fheight)
• summary(lm_model2)
• R-Squared: 0.1779
• Root Mean Squared Error: 0.997

20
21
• Applying unsupervised Machine Learning Techniques
• This section provides examples of how unsupervised Machine Learning
techniques namely K-Means algorithm can be applied to a dataset based on the
following case study.
• Case Study Description
• The Breast Cancer dataset has been chosen for demonstrating the K-Means
algorithm. According to Dubey et al. (2016), clustering is an important activity
that enables grouping of data based on the nature or a symptom of the disease. The
Breast Cancer Wisconsin has been downloaded from UCI Repository
https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+ (original). The
details of the fields are given in table below. The dataset has 11 variables with 699
observations, first variable is the identifier and has been excluded in the analysis.
Thus, there are 9 predictors and a response variable (class). The response
variable denotes “Malignant” or “Benign” cases.
22
23
Steps:
1. The libraries hav
2. Load the data in R dataset.
library(sparklyr)
library(ggplot2)
library(dplyr)
sc <- spark_connect(master="local")
library(readxl)
BreastCancerData <- read_excel("/home/cloudera/Downloads/BreastCancerData.xlsx")
3. Select the appropriate table, fields and labels that will be used to formulate the K-
Means model. The predictors Uniformity_of_Cell_Shape and
Uniformity_of_Cell_Size are chosen to formulate the KMeans model. e to be
imported and the database is loaded

24
4. Write the codes for the KMeans model.
breastcancer_tbl <- copy_to(sc,BreastCancerData, "BreastCancerData", overwrite = TRUE)
kmeans_model <- breastcancer_tbl %>%
ml_kmeans(formula= ~ Uniformity_of_Cell_Shape + Uniformity_of_Cell_Size, centers = 2)
kmeans_model

25
• 5. Prediction is made based on the model. Write the following codes.
predicted <- sdf_predict(kmeans_model, breastcancer_tbl) %>%
collect
table(predicted$Class_2benign_4malignant, predicted$prediction)
• The output below is displayed:
01
2 9 449
4 191 50
• Interpretation of results
449 of the benign cases have been classified under 1 and 9 cases have been
misclassified. 191 of the malignant cases have been grouped together and 50 have
been misclassified.
26
THANK YOU

AWS Certified Cloud Practitioner Exam
100% (1)
AWS Certified Cloud Practitioner Exam
332 pages
Second Edition Handbook of PE Pipe - HDPE Handbook
0% (1)
Second Edition Handbook of PE Pipe - HDPE Handbook
3 pages
AIML
No ratings yet
AIML
30 pages
41 Machine Learning Algorithms I
No ratings yet
41 Machine Learning Algorithms I
8 pages
ICT202B AI ML and Emerging technologies UNIT 3 (Classification and Regression) 2
No ratings yet
ICT202B AI ML and Emerging technologies UNIT 3 (Classification and Regression) 2
23 pages
Machine Learning
No ratings yet
Machine Learning
22 pages
Machine Learning
100% (3)
Machine Learning
46 pages
CS601_Machine Learning_Unit 1_Notes_1672759748
No ratings yet
CS601_Machine Learning_Unit 1_Notes_1672759748
13 pages
Lecture - 2 & 3
No ratings yet
Lecture - 2 & 3
62 pages
Data Science: Sales Forecasting For Marketing
No ratings yet
Data Science: Sales Forecasting For Marketing
52 pages
Machine Learning
No ratings yet
Machine Learning
33 pages
Machine Learning Theory
100% (1)
Machine Learning Theory
12 pages
Week 9 - PROG 8510 Week 9
No ratings yet
Week 9 - PROG 8510 Week 9
27 pages
Machinelearning Algorithm Basics2 NOTES
No ratings yet
Machinelearning Algorithm Basics2 NOTES
72 pages
Unit - 2 ML notes
No ratings yet
Unit - 2 ML notes
14 pages
Introduction To Basics of Machine Learning Algorithms: Pankaj Oli
100% (1)
Introduction To Basics of Machine Learning Algorithms: Pankaj Oli
13 pages
Supervised Learning
No ratings yet
Supervised Learning
46 pages
ML - Machine Learning PDF
No ratings yet
ML - Machine Learning PDF
13 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
10 pages
Unit 4 - Machine Learning PDF
No ratings yet
Unit 4 - Machine Learning PDF
49 pages
Algorithms 1
No ratings yet
Algorithms 1
23 pages
Tutorial 3
No ratings yet
Tutorial 3
30 pages
ML Algorithms
No ratings yet
ML Algorithms
12 pages
Module 3 (1)
No ratings yet
Module 3 (1)
63 pages
AI & ML Unit 3 Notes
No ratings yet
AI & ML Unit 3 Notes
20 pages
ML 2 nd Unit
No ratings yet
ML 2 nd Unit
50 pages
R_LabManual_6-8_Pgms
No ratings yet
R_LabManual_6-8_Pgms
12 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
17 pages
COMP1801 - Copy 1
No ratings yet
COMP1801 - Copy 1
18 pages
PerceptiLabs-ML Handbook
No ratings yet
PerceptiLabs-ML Handbook
31 pages
AI and DS QB1
No ratings yet
AI and DS QB1
31 pages
Supervised Learning
No ratings yet
Supervised Learning
24 pages
Lec05 - Supervised
No ratings yet
Lec05 - Supervised
26 pages
Reference Papers
No ratings yet
Reference Papers
7 pages
UNIT II Machine Learning
No ratings yet
UNIT II Machine Learning
118 pages
Commonly Used Machine Learning Algorithms
No ratings yet
Commonly Used Machine Learning Algorithms
38 pages
Slide 1
No ratings yet
Slide 1
29 pages
20IT503 - Big Data Analytics - Unit3
No ratings yet
20IT503 - Big Data Analytics - Unit3
78 pages
Machine Learning Ppts
No ratings yet
Machine Learning Ppts
38 pages
Machine Learning
No ratings yet
Machine Learning
115 pages
AI 4 Unit Notes
No ratings yet
AI 4 Unit Notes
47 pages
ML UNIT-4
No ratings yet
ML UNIT-4
20 pages
Supervised and Unsupervised Learning
No ratings yet
Supervised and Unsupervised Learning
92 pages
AI overview Simplified
No ratings yet
AI overview Simplified
17 pages
Top 10 Machine Learning Algorithms With Their Use
100% (1)
Top 10 Machine Learning Algorithms With Their Use
12 pages
Unit 2 ML
No ratings yet
Unit 2 ML
141 pages
Essentials of Machine Learning Algorithms
No ratings yet
Essentials of Machine Learning Algorithms
15 pages
ML & DL Notes
No ratings yet
ML & DL Notes
30 pages
ML Algorithms Week 3
No ratings yet
ML Algorithms Week 3
30 pages
Unit 2 Supervised Learning and Applications
No ratings yet
Unit 2 Supervised Learning and Applications
13 pages
Analytics Boot Camp
No ratings yet
Analytics Boot Camp
126 pages
Avani kakkar
No ratings yet
Avani kakkar
13 pages
Machine Learning For Beginners
100% (1)
Machine Learning For Beginners
30 pages
Supervised ML
No ratings yet
Supervised ML
69 pages
Big Data Analytics Algorithm, Tools in Systematic Review
No ratings yet
Big Data Analytics Algorithm, Tools in Systematic Review
7 pages
1 - Supervised Learning & Its Types
No ratings yet
1 - Supervised Learning & Its Types
24 pages
ML-1-PPT-UNIT-1
No ratings yet
ML-1-PPT-UNIT-1
93 pages
Tybsc Cs368 Data Analytics Labbook
No ratings yet
Tybsc Cs368 Data Analytics Labbook
58 pages
Fulldoc - Dsec Mca - Crime Prediction (1) - 051521
No ratings yet
Fulldoc - Dsec Mca - Crime Prediction (1) - 051521
65 pages
UNIT-2 Material
No ratings yet
UNIT-2 Material
71 pages
ML Introduction
No ratings yet
ML Introduction
47 pages
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
Pratap CIBIL Report | Cibil Dashboard
No ratings yet
Pratap CIBIL Report | Cibil Dashboard
102 pages
Cicdworksheet 6th
No ratings yet
Cicdworksheet 6th
6 pages
22MCC20021 Cicd 10
No ratings yet
22MCC20021 Cicd 10
3 pages
Zestminds Technologies Pvt. LTD
No ratings yet
Zestminds Technologies Pvt. LTD
4 pages
Students List
No ratings yet
Students List
12 pages
Lastexception 63845843395
No ratings yet
Lastexception 63845843395
265 pages
Simple Machine Worksheet Key
No ratings yet
Simple Machine Worksheet Key
2 pages
MATHEMATICS FOR SHSAT
No ratings yet
MATHEMATICS FOR SHSAT
5 pages
FYBScCS - Sem I - LabBook C Programming CS 102 P
No ratings yet
FYBScCS - Sem I - LabBook C Programming CS 102 P
45 pages
Work Immersion Portfolio Based on Deped-1-1
No ratings yet
Work Immersion Portfolio Based on Deped-1-1
45 pages
APC Smart-UPS RC 2000VA 230V: Technical Specifications
No ratings yet
APC Smart-UPS RC 2000VA 230V: Technical Specifications
3 pages
Need To Download PO SAPSCRIPT in PDF Format Without Printing - SAP Q&A
No ratings yet
Need To Download PO SAPSCRIPT in PDF Format Without Printing - SAP Q&A
11 pages
SPE Lagos Section 61 Newsletter Dec 2018
No ratings yet
SPE Lagos Section 61 Newsletter Dec 2018
26 pages
Keras and Tensorflow
No ratings yet
Keras and Tensorflow
11 pages
Comparison On Construction of Strut-And-Tie Models For Reinforced Concrete Deep Beams
No ratings yet
Comparison On Construction of Strut-And-Tie Models For Reinforced Concrete Deep Beams
8 pages
SGT Webcam
No ratings yet
SGT Webcam
73 pages
Takehome Activity
No ratings yet
Takehome Activity
5 pages
Logs 1
100% (1)
Logs 1
3 pages
MPPT Solar Charge Controller ML4830N15 Instructions
No ratings yet
MPPT Solar Charge Controller ML4830N15 Instructions
13 pages
Linear Algebra and Differential Equations: Sartaj Ul Hasan
No ratings yet
Linear Algebra and Differential Equations: Sartaj Ul Hasan
21 pages
FALLSEM2023-24 MEE1014 TH VL2023240101810 2023-10-13 Reference-Material-I
No ratings yet
FALLSEM2023-24 MEE1014 TH VL2023240101810 2023-10-13 Reference-Material-I
44 pages
Number Series Part - 1
No ratings yet
Number Series Part - 1
11 pages
Objectives at The End of The Lesson The Students Must Be Able To
0% (1)
Objectives at The End of The Lesson The Students Must Be Able To
26 pages
dji Mini 4 Pro Fly More Combo UAE Dubai, Abu Dhabi
No ratings yet
dji Mini 4 Pro Fly More Combo UAE Dubai, Abu Dhabi
1 page
Boundary Wall Detail
No ratings yet
Boundary Wall Detail
1 page
Milan Joseph Mathew-Resume N
No ratings yet
Milan Joseph Mathew-Resume N
1 page
Econoled by Trend Lighting
No ratings yet
Econoled by Trend Lighting
1 page
Brockmann 2022 Error Annotation in Post-Editing Machine Translation
No ratings yet
Brockmann 2022 Error Annotation in Post-Editing Machine Translation
9 pages
Basement Entrance Design
No ratings yet
Basement Entrance Design
20 pages
Naukri_AboliVitthalVani[6y_0m]
No ratings yet
Naukri_AboliVitthalVani[6y_0m]
1 page
Information System Overview
No ratings yet
Information System Overview
68 pages
Production Logging Techniques and Interpretation of Resulted Figure: A Case Study of A Gas Field Iran
No ratings yet
Production Logging Techniques and Interpretation of Resulted Figure: A Case Study of A Gas Field Iran
18 pages
Ultraprop
No ratings yet
Ultraprop
4 pages

University Institute of Computing: Big Data Analytics 22CAH-782

Uploaded by

University Institute of Computing: Big Data Analytics 22CAH-782

Uploaded by

UNIVERSITY INSTITUTE OF COMPUTING

MASTER OF COMPUTER APPLICATIONS

DISCOVER . LEARN . EMPOWER

• Big SQL : Introduction

• Compared to supervised learning, unsupervised learning makes use of

You might also like