CSE-Machine Learning & Big Data - WSS Source Book
CSE-Machine Learning & Big Data - WSS Source Book
OF
COMPUTER SCIENCE AND ENGINEERING
WSS-SOURCE BOOK
WSCS17
Prepared by:
With reference:
2 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Description of the Skill:
Web technology refers to the means by which computers communicate with each other using
markup languages and multimedia packages. It gives us a way to interact with hosted information, like
websites. In order to make websites look and function a certain way, web developers utilize different
languages. The three core languages that make up the World Wide Web are HTML, CSS, and JavaScript
is the backbone of most webpages.
Client-side technologies are things that operate in the browser. There is no need to interact with
the server. These languages are generally very easy to use, and we can try them out right on our own
computer.
It is a basic mark-up language that we will use to create the structure of our web pages. We can
think of this as the framing for the house that we were building. It is the most basic and essential part of
our web site - it gives our house shape, rooms, and structure.
It is used to create the decoration for our website. CSS describes how a web page should look in
the browser. The key to good web page creation is to completely separate the presentation (CSS) from the
structure of our site (HTML). This way it is easy to make changes to the look of our site without
changing all of the HTML files.
JavaScript
JavaScript is a simple scripting language used to make things happen on the web page. As a
client-side language, JavaScript only works within the browser window. It cannot retrieve, create, or store
data on the server; it can only manipulate things within the browser.
3 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
CONTENTS
4 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Front-End Development
● Create website animations
and functionalities to assist
in context explanations and
● JavaScript • How to integrate adding visual appeal
libraries, frameworks and other ● Create and update
5 systems or features with JavaScript JavaScript code to enhance 6 30
● Use JavaScript pre/post processors a websites functionality,
and task running workflow usability and aesthetics
● Manipulate data and
custom media with
JavaScript
Back-End Development
● Manipulate data making
● Object-oriented PHP use of programming skills
● Open Source server side Libraries ● Protect against security
and Frameworks exploits
● Connect to server through SSH to ● Integrate with existing code
operate server-side libraries and with API (Application
6 frameworks. Programming Interfaces), 6 30
● How to design and implement libraries and frameworks
databases with MySQL ● Create or maintain database
● FTP (File Transfer Protocol) server to support system
and client relationships and software requirements
packages. ● Create code that is
modular and reusable
Test Project --- 1 ---
Test Project --- 1 ---
Assessment --- 1 ---
Total No. of days 25 100%
5 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
TRAINING SCHEDULE
Skill
BIG DATA AND MACHINE LEARNING
Name
Day Content
INTRODUCTION TO MACHINE LEARNING, DATA COLLECTION AND
Day 1
STUDY ABOUT DATASET
Day 2 INTRODUCTION TO RAPID MINER AND DATA PREPROCESSING
Day 3 DATA MINING INTRODUCTION — DATA PREPROCESSING WITH R
Day 4 DATA PREPROCESSING USING PYTHON
Day 5 MODEL EVALUATION AND SELECTION-LINEAR REGRESSION
7 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Marking Scheme
The Assessment is done by awarding points by adopting two methods, Measurement and Judgments
Measurement – Either One which is measurable or one according to the binary system of measurement
(Yes/No).
Judgments – Either One based on Industry expectations (Scale) or one according to the binary system of
measurement (Yes/No).
8 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 1 INTRODUCTION TO MACHINE LEARNING AND DATA
COLLECTION AND STUDY ABOUT DATASET
Objectives: The aim of the task is to make the students understand what is machine learning and
bigdata and practice the students to collect dataset from various resources.
Outcome: Downloading dataset and preparation of their own dataset for various analytics
Theory:
What is Machine Learning?
Machine learning is a subfield of artificial intelligence (AI). The goal of machine learning generally is
to understand the structure of data and fit that data into models that can be understood and utilized by
people. Machine Learning is used anywhere from automating mundane tasks to offering intelligent
insights, industries in every sector try to benefit from it.
Sample Coding:
Step 2: Select the yellow highlighted drop down arrow and create a new repository
9 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
10 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
11 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 3: Now select the repository you created, and create two sub folders under the repository as
Data and processes.
12 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Type in Data and repeat same method for creating sub folder for process.
Step 4: Select the purple highlighted box, Import Data. In MyComputer, choose the downloaded
dataset
13 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
14 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 5: Do the preliminary formatting of dataset like choosing the class labels, changing role, etc.,
15 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 6: Finally save the dataset in the Data folder.
16 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Python:
1. Download dataset from Kaggle or any machine learning repositories in the form of CSV
2. from google.colab import files
uploaded = files.upload()
R –Tool:
1. Download the dataset from machine learning repository and save in a location in your computer
2. Oen RStudio and type the following command
mydata <- read.table("c:/mydata.csv", header=TRUE, sep=",", row.names="id")
3. mydata
17 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Schedule
10.40a 11.30a
8.45am 10.25am
9.35am to m to m to 12.20pm 1.30pm to 2.20pm to 3.10pm to 3.25pm to
Day to to
10.25am 11.30a 12.20p to 1.30pm 2.20pm 3.10pm 3.25pm 4.15pm
9.35am 10.40am
m m
Installation of Tasks
Rapid Miner, R related to
Day 1 Introduction Tea Break Lunch Break Dataset collection Tea Break
Tool and Python importing
IDE dataset
Download a health-care dataset (Excel or CSV format) from Kaggle and import it in Rapid Miner, R
Tool and Python IDE and print the imported dataset to check whether the dataset is correctly imported.
Assessment specification:
Aspect Maximum
S.No Type Aspect Description Additional Aspect Description Requirement Score(10)
1 J Downloading correct 2 Marks
dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 2
2 J Creating new 0.5 Mark
repository in the name New repository correctly created
of dataset in 0 mark
RapidMiner No repository created Rapid Miner 0. 5
18 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
3 J Creating two 0.25 Mark
subfolders in If Data subfolder is created correctly inside
repository new repository
0.25 Mark
If Process subfolder is created correctly inside
new repository Rapid Miner 0. 5
0 Mark
No subfolders created
4 J Changing the roles 1 Mark
for target labels If target labels are set
0 Mark
If target labels are not set Rapid Miner 1
5 J Displaying the 1 Mark
imported dataset If dataset is correctly displayed indicating the
features and target labels
0 Mark Rapid Miner 1
If not imported properly
6 J Importing dataset 2 Marks
using Python If dataframe is properly used and printed
0 mark
If dataframe is not properly imported Pycharm Editor 2
7 J Displaying the 1 Mark
different parts of the If first 5 rows and last 5 rows are correctly
data using pandas diaplyed
package 0 Mark Pycharm Editor 1
If data manipulation is not properly done
8 J Importing dataset 1 Mark
using R Tool If dataframe is properly used and printed
0 mark 1
If dataframe is not properly imported R Studio
9 J Displaying the 1 Mark
different parts of the If first 5 rows and last 5 rows are correctly
data diaplyed
0 Mark
If data manipulation is not properly done If
no backgroung colour is used
R Studio 1
Conclusion:
Thus various resources for data collection and the methodology for importing the data sets
using Rapid Miner/ R Tool and Python IDE are completed.
19 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 2 INTRODUCTION TO RAPID MINER AND DATA PRE-PROCESSING
Objectives: The aim of the task is to pre-process a dataset for changing formatting the data and
identifying the statistics of the data to improve the results of the data analysis.
Outcome: Student will able to apply Rapid Miner tool for preprocessing dataset required for data
analysis.
Prerequisites:
Knowledge on Machine Learning
Theory:
Rapid Miner is a data science software platform developed by the company of the same name that
provides an integrated environment for data preparation, machine learning, deep learning, text
mining, and predictive analytics. It is used for business and commercial applications as well as for
research, education, training, rapid prototyping, and application development and supports all steps
of the machine learning process including data preparation, results visualization, model validation
and optimization. Rapid Miner is developed on an open core model. The Rapid Miner Studio Free
Edition, which is limited to 1 logical processor and 10,000 data rows, is available under
the AGPL license, by depending on various non-open source components.
Rapid Miner is written in the Java programming language. Rapid Miner provides a GUI to
design and execute analytical workflows. Those workflows are called “Processes” in Rapid Miner
and they consist of multiple “Operators”. Each operator performs a single task within the process,
and the output of each operator forms the input of the next one. Alternatively, the engine can be
called from other programs or used as an API. Individual functions can be called from the command
line. Rapid Miner provides learning schemes, models and algorithms and can be extended
using R and Python scripts.
Data preprocessing
Step 1:
Step 2:
21 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 3:
Step 2:
Using operators expand Transformation ,then expand Data Cleansing,Outlier Detection
and then select the option open Detect Outlier(Distant)
22 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 3: Run the program, Check the outlier
1.3.1 Normalization:
23 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Import File sales_data
Using operators expand Data Transformation ,then expand Value Modification Numerical
Value Modification open Normalize)
Step 1:
Step 2:Run the program, Check the result after normalization all the values will replace by given
range.
24 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Demo: 1.3.2 Aggregation:
Step 4 : De-select any columns that you do not want to import. In this case I do not care to see
what teams the QBs play for. You must make the column you want to sort by ID. You can see
in the first column that contains the names was changed from attribute to ID. After you do that
click Next and save your data.
25 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 5:Once you get into your main process drag and drop your data onto process area. Click
Data Transformation > Aggregation. Drag and drop the aggregate widget onto the process area.
Next connect the out port of the data to the exa port on the left side of the aggregate widget.
Then connect the exa port on the right side of the widget to the result port.
26 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 6:After you connect the ports select edit list by aggregation attributes. Here make an entry for
each attribute you want to aggregate and select the functions you want to use. After you do this
click Ok.
Step 7:Next click select attribute by group by attributes. Here move your ID column (in this case
Name) into the right side by selecting your ID and clicking on the arrow pointing right. Click Ok.
27 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 8 :Now just click the play button on the toolbar and you get your results!
Step 1 :The Generate Nominal Data operator generates an ExampleSet based on nominal attributes.
Select any one parameter in the Generate Nominal Data operator.
Parameters
Step 2: The number examples parameter is set to 100, thus the ExampleSet will have 100
examples. The number of attributes parameter is set to 3, thus three nominal attributes will be
generated beside the label attribute. The number of values parameter is set to 5, thus each attribute
28 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
will have 5 possible values
Step 3: Using operators expand Data Transformation, then expand Type Conversion
Discretization Discretize by Binning
Step 4: Run the program, Check the result after discretization all the see the difference by
plotting Histogram for the attribute “product_id”.
29 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Histogram of “product_id” after discretization, “5” number of bins.
Schedule:
8.45am 9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm
3.10pm to 3.25pm
3.25pm
Day to to to to to to to to
to
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm
4.15pm
Rapid Miner
Day 2 Installation and Tea Lunch
Exercise -1 Exercise -2 Tea Break Exercise -3
Introduction to data Break Break
preprocessing
Download IRIS dataset (Excel or CSV format) from Kaggle and import it in Rapid Miner, and do
the necessary preprocessing
Sample Output
30 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:
31 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
6 J Transforming 1 Mark
data to suitable Apply normalization to the dataset
form 0 Mark
No normalization is applied Rapid Miner 1
7 J Transforming 1 Mark
data to suitable Apply aggregation to the dataset
form 0 Mark
No aggregation is applied Rapid Miner 1
8 J Apply Generate 1 Mark
Nominal Data Apply Generate Nominal Data operator
to the dataset for generating an
ExampleSet Rapid Miner 1
0 Mark
No Generate Nominal Data operator is
applied
9 J Apply Data 1 Mark
Reduction Apply data Discretization on the dataset
process 0 Mark
No data Discretization is applied Rapid Miner 1
10 J Plotting 1 Mark
Histogram Apply histogram plotting for all the
attributes in the dataset
0 Mark Rapid Miner 1
No histogram plotting is applied
Conclusion:
Thus various pre-processing tasks can be achieved using Rapid Miner which in result making the
data cleans to perform various analytics.
32 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 3 DATA MINING INTRODUCTION — DATA PREPROCESSING WITH R
Objectives: The aim of the task is to make the students understand the basic concepts of data mining
and fundamental data processing procedures using R
Data mining is a field of research that has emerged in the 1990s, and is very popular
today, sometimes under different names such as “big data” and “data science“, which have a similar
meaning. To give a short definition of data mining, it can be defined as a set of techniques for
automatically analyzing data to discover interesting knowledge or pasterns in the data.
To perform data mining, a process consisting of seven steps is usually followed. This process is often
called the “Knowledge Discovery in Database” (KDD) process.
1. Data cleaning: This step consists of cleaning the data by removing noise or other
inconsistencies that could be a problem for analyzing the data.
2. Data integration: This step consists of integrating data from various sources to prepare the
data that needs to be analyzed. For example, if the data is stored in multiple databases or file, it
may be necessary to integrate the data into a single file or database to analyze it.
3. Data selection: This step consists of selecting the relevant data for the analysis to be
performed.
4. Data transformation: This step consists of transforming the data to a proper format that can
be analyzed using data mining techniques. For example, some data mining techniques require
that all numerical values are normalized.
33 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
5. Data mining: This step consists of applying some data mining techniques (algorithms) to
analyze the data and discover interesting patterns or extract interesting knowledge from this
data.
6. Evaluating the knowledge that has been discovered: This step consists of evaluating the
knowledge that has been extracted from the data. This can be done in terms of objective and/or
subjective measures.
7. Visualization: Finally, the last step is to visualize the knowledge that has been extracted from
the data.
R is a programming language developed by Ross Ihaka and Robert Gentleman in 1993. R possesses
an extensive catalog of statistical and graphical methods. It includes machine learning algorithm,
linear regression, time series, statistical inference to name a few. Most of the R libraries are written
in R, but for heavy computational task, C, C++ and Fortran codes are preferred.
Data Preprocessing in R
The above code blocks check for missing values in the age and salary columns and update the missing
34 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
cells with the column-wise average.
Since we are not interested in having decimal places for age we will round it up using the above code
35 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Splitting the Dataset into Training and Testing Sets
set.seed(123)
test_set[,3:4] = scale(test_set[,3:4])
The scale method in R can be used to scale the features in the dataset. Here we are only scaling the
non-factors which are the age and the salary.
Training_set:
36 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Test_set:
Schedule:
8.45a
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm
m to 3.25pm to
Day to to to to to to to to
9.35a 4.15pm
10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm
m
Data mining - Tea R programming Lunch Data pre- Tea Data pre-
Day 4
introduction Break introduction Break processing task Break processing task
Sample Output:
37 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
38 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:
Conclusion:
Thus various pre-processing tasks can be achieved using RTool which in result makes the data cleans to
perform various analytics
40 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 4 DATA PREPROCESSING USING PYTHON
Objectives: The aim of the task is to pre-process a dataset for formatting the data and identifying the
statistics of the data to improve the results of the data analysis.
Outcome:
Student will able to apply python libraries for preprocessing dataset required for data analysis.
Resources required:
Python Version 3.6
Prerequisites:
Knowledge on Python Programming language
Theory:
Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In
other words, whenever the data is gathered from different sources it is collected in raw format
which is not feasible for the analysis.
Need of Data Preprocessing:
For achieving better results from the applied model in Machine Learning projects the format of
the data has to be in a proper manner. Some specified Machine Learning model needs
information in a specified format, for example, Random Forest algorithm does not support null
values; therefore to execute random forest algorithm null values have to be managed from the
original raw data set.
Another aspect is that data set should be formatted in such a way that more than one Machine
Learning and Deep Learning algorithms are executed in one data set and best out of them is
chosen.
Sample Coding:
The “chronic_kidney_disease.arff” dataset is used for this task, which is available at the UCI Repository.
1. Read and clean the data
# kidney_dis.py
import pandas as pd
import numpy as np
# create header for dataset
header = ['age','bp','sg','al','su','rbc','pc','pcc',
'ba','bgr','bu','sc','sod','pot','hemo','pcv',
'wbcc','rbcc','htn','dm','cad','appet','pe','ane',
'classification']
# read the dataset
df = pd.read_csv("data/chronic_kidney_disease.arff", 41 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
header=None,
names=header
)
# dataset has '?' in it, convert these into NaN
df = df.replace('?', np.nan)
# drop the NaN
df = df.dropna(axis=0, how="any")
# print total samples
print("Total samples:", len(df))
# print 4-rows and 6-columns
print("Partial data\n", df.iloc[0:4, 0:6])
Below is the output of the above code:
$ python kidney_dis.py
Total samples: 157
Partial data
age bp sg al su rbc
30 48 70 1.005 4 0 normal
36 53 90 1.020 2 0 abnormal
38 63 70 1.010 3 0 abnormal
41 68 80 1.010 3 2 normal
One can also convert the ‘categorical-targets (i.e. strings ‘ckd’ and ‘notckd’) into ‘numeric-targets (i.e. 0
and 1’) using “.cat.codes” command, as shown below,
# covert 'ckd' and 'notckd' labels as '0' and '1'
targets = df['classification'].astype('category').cat.codes
# save target-values as color for plotting
# red: disease, green: no disease
label_color = ['red' if i==0 else 'green' for i in targets]
print(label_color[0:3], label_color[-3:-1])
Below is the first three and last two samples of the ‘label_color’,
$ python kidney_dis.py
['red', 'red', 'red'] ['green', 'green']
targets = df['classification'].astype('category')
43 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
# save target-values as color for plotting
# red: disease, green: no disease
label_color = ['red' if i=='ckd' else 'green' for i in targets]
# print(label_color[0:3], label_color[-3:-1])
# list of categorical features
categorical_ = ['rbc', 'pc', 'pcc', 'ba', 'htn',
'dm', 'cad', 'appet', 'pe', 'ane'
]
# drop the "categorical" features
# drop the classification column
df = df.drop(labels=['classification'], axis=1)
# drop using 'inplace' which is equivalent to df = df.drop()
df.drop(labels=categorical_, axis=1, inplace=True)
print("Partial data\n", df.iloc[0:4, 0:6]) # print partial data
Below is the output of the above code. Note that, if we compare the below results with the results of Listing
8.1, we can see that the ‘rbc’ column is removed.
$ python kidney_dis.py
Partial data
age bp sg al su bgr
30 48 70 1.005 4 0 117
36 53 90 1.020 2 0 70
38 63 70 1.010 3 0 380
41 68 80 1.010 3 2 157
4. Dimensionality reduction
Let’s perform dimensionality reduction using the PCA model. The results infer where we can see that the
model can fairly classify the kidney disease based on the provided features.
# kidney_dis.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
targets = df['classification'].astype('category')
# save target-values as color for plotting
# red: disease, green: no disease
label_color = ['red' if i=='ckd' else 'green' for i in targets]
# print(label_color[0:3], label_color[-3:-1])
pca = PCA(n_components=2)
pca.fit(df)
T = pca.transform(df) # transformed data
# change 'T' to Pandas-DataFrame to plot using Pandas-plots
T = pd.DataFrame(T)
5. Data Visualization
45 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Chronic Kidney Disease
The dataset had a large number of features. PCA looks for the correlation between these features and
reduces the dimensionality. In this example, we reduce the number of features to 2 using PCA.
After the dimensionality reduction, only 2 features are extracted, therefore it is plotted using the
scatter-plot, which is easier to visualize. For example, one can clearly see the differences between the
‘ckd’ and ‘not ckd’ in the current example.
The dimensionality reduction methods, such as PCA are used to reduce the dimensionality of the
features to 2 or 3. Next, these 2 or 3 features can be plotted to visualize the information.
Schedule:
Tea
Tea Lunch
Day 3 Introduction Exercise -1 Exercise -2 Brea Exercise -3
Break Break
k
46 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Sample Output:
Assessment specification:
47 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
5 J Handling 1 Mark
categorical data Use Label Encoders to covert
categorical data to numerical data
0 mark Python 1
No outliers detected
Conclusion:
Thus various pre-processing tasks can be achieved using Python which in result making the data cleans
to perform various analytics.
48 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 5 MODEL EVALUATION AND SELECTION-LINEAR REGRESSION
Objectives:
The aim of the task is to select a model for doing analytics and insights about Linear Regression.
Outcome:
Student will able to apply the machine learning model for predicting a result using Linear Regression.
Theory:
Machine learning continues to be an increasingly integral component of our lives, whether we’re applying
the techniques to research or business problems. Machine learning models ought to be able to give
accurate predictions in order to create real value for a given organization.
Linear Regression
Linear regression is a method for approximating a linear relationship between two variables.
Linear Equations
49 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Sample Coding:
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import seaborn as sns
import io
from google.colab import files
uploaded = files.upload()
import io
df2 = pd.read_csv(io.BytesIO(uploaded['homeprice.csv']))
df2
plt.figure(figsize=(10, 5))
sns.heatmap(df2.corr(), annot=True)
50 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
sns.heatmap(df2.corr())
plt.show()
ch = pd.read_csv(io.BytesIO(uploaded['homeprice.csv']))
print(ch.head())
grp = sns.regplot(x='area', y='price', data=ch, color='orange')
plt.title("Predicting Home Price")
plt.show()
order = ch[['area']]
print(order)
totalorders = ch[['price']]
print(totalorders)
reg = linear_model.LinearRegression()
reg.fit(order, totalorders)
51 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
from google.colab import files
uploaded = files.upload()
ch1 = pd.read_csv(io.BytesIO(uploaded['homeprice1.csv']))
print(ch1.head())
p = reg.predict(ch1)
print(p)
Schedule:
8.45am
10.25a 10.40a 1.30p 2.20p 3.10p
To 9.35am 11.30a 12.20p
m m m m m 3.25pm
to mto m
Day to to To to to to
9.3 10.25a 12.20p to
10.40a 11.30a 2.20p 3.10p 3.25p 4.15pm
5a m m 1.30pm
m m m m m
m
Linear Regression Task on
Tea Linear Regression Lunch for predicting Tea Linear
Day 6 Introduction
Break Basics Break the home Break regressio
price n
Determine the rise in price of Gold using the past data and predict how the price would be in the
forthcoming years using Linear Regression in Python
Sample Output
52 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:
Conclusion:
Thus linear regression is used to predict the future values of the target labels in the chosen dataset.
54 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 6 K-NEAREST NEIGHBOR REGRESSION
Objectives:
The aim of the task is to make the students understand the implementation of K-Nearest Neighbour
regression using Python programming
Outcome:
Students will be able to perform K-Nearest Neighbour regression method which aims to predict the
numerical target based on similarity measure such as distance function.
Theory:
K-Nearest Neighbors (KNN) is one of the simplest algorithms used in Machine Learning for
regression and classification problem. A classification problem has a discrete value as its output, whereas a
regression problem has a real number (a number with a decimal point) as its output. KNN algorithms use
data and classify new data points based on similarity measures (e.g. distance function). Classification is
done by a majority vote to its neighbors. The data is assigned to the class which has the nearest neighbors.
As the number of nearest neighbors k is increased, accuracy might increase.
Sample Coding:
55 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step1: Import the Libraries
Output:
Coding:
56 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Output:
Coding:
Coding:
Output:
58 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Schedule
Download the iris dataset (Excel or CSV format) from Kaggle, implement the KNN algorithm and visualize
the test result
59 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:
Aspect Maximum
S.No Type Aspect Description Additional Aspect Description Requirement Score(10)
1 J Downloading correct 2 Marks
dataset If downloaded CSV dataset
0 Mark
If downloaded wrong dataset Browser 2
60 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
5 J Training data 2 Marks Python 2
If dataset is splitted into training data
0 Mark
If data is not splitted properly
Conclusion:
Thus the implementation of K-nearest neighbor Regression, using Python IDE are completed
successfully.
61 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 7 RANDOM FOREST REGRESSION
Objectives:
The aim of the task is to predict the selling prices of houses based on some economic factors and
build a random forest regression model using Python programming
Outcome:
Student will able to apply Random Forest regression algorithm for predicting the result based on
the average of different predictions of different decision trees.
Resources required: Python Version 3.5
Prerequisites: Knowledge on Python Programming language
Theory:
Random Forest Regression
Random forest is a type of supervised machine learning algorithm based on ensemble learning.
Ensemble learning is a type of learning where different types of algorithms or same algorithm are joined
multiple times to form a more powerful prediction model. The random forest algorithm combines multiple
algorithm of the same type i.e. multiple decision trees, resulting in a forest of trees, hence the name
"Random Forest". The random forest algorithm can be used for both regression and classification tasks.
Sample Coding:
Step1: Import libraries
Coding:
Output:
62 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 2: Define the features and the target
Coding:
Output:
63 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 3: Split the dataset into train and test sets
Coding:
Step 4: Build the random forest regression model with random forest regressor function
Coding:
Output:
Schedule:
8.45am 9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm
Day to to to to to to to to to to
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm 4.15pm
Tea Lunch
Day 8 Introduction Exercise -1 Exercise -2 Tea Break Exercise -3
Break Break
64 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:
Conclusion:
Thus the implementation of Random Forest Regression algorithm, using Rapid Miner/ R
Tool and Python IDE are completed successfully.
66 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 8 GRADIENT BOOSTING REGRESSION
Objectives:
The aim of the task is to make the students understand the implementation of gradient boosting
regression using R
Outcome:
Students will be able to perform boosting method which aims to optimise an arbitrary
(differentiable) cost function.
Boosting is a class of ensamble learning techniques for regression and classification problems.
Boosting aims to build a set of weak learners (i.e. predictive models that are only slightly better than
random chance) to create one ‘strong’ learner (i.e. a predictive model that predicts the response variable
with a high degree of accuracy). Gradient boosting is a machine learning technique for regression and
classification problems, which produces a prediction model in the form of an ensemble of weak prediction
models, typically decision trees.
Sample coding
Step1: Importing packages
Coding
require(gbm)
require(MASS)#package with the boston housing dataset
#separating training and test data
train =sample(1:506,size=374)
Step2: Apply Boston housing dataset to predict the median value of houses
Coding:
67 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Boston.boost=gbm(medv ~ . ,data = Boston[train,],distribution = "gaussian",n.trees =
10000,shrinkage = 0.01, interaction.depth = 4)
Boston.boost
summary(Boston.boost)
Output: #Summary gives a table of Variable Importance and a plot of Variable
Importance
bm(formula = medv ~ ., distribution = "gaussian", data = Boston[-train,], n.trees = 10000,
interaction.depth = 4, shrinkage = 0.01)
A gradient boosted model with gaussian loss function.
10000 iterations were performed.
There were 13 predictors of which 13 had non-zero influence.
>summary(Boston.boost)
var rel.inf
rm rm 36.96963915
lstat lstat 24.40113288
dis dis 10.67520770
crim crim 8.61298346
age age 4.86776735
black black 4.23048222
nox nox 4.06930868
ptratio ptratio 2.21423811
tax tax 1.73154882
rad rad 1.04400159
indus indus 0.80564216
chas chas 0.28507720
zn zn 0.09297068
Step 3: Plotting the Partial Dependence Plot
#Plot of Response variable with lstat variable
plot(Boston.boost,i="lstat")
#Inverse relation with lstat variable
plot(Boston.boost,i="rm")
#as the average number of rooms increases the the price increases
68 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 4: Prediction on Test Set
Coding:
69 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
n.trees = seq(from=100 ,to=10000, by=100) #no of trees-a vector of 100 values
#Generating a Prediction matrix for each Tree
predmatrix<-predict(Boston.boost,Boston[-train,],n.trees = n.trees)
dim(predmatrix) #dimentions of the Prediction Matrix
#Calculating The Mean squared Test Error
test.error<-with(Boston[-train,],apply( (predmatrix-medv)^2,2,mean))
head(test.error) #contains the Mean squared test error for each of the 100 trees averaged
#Plotting the test error vs number of trees
plot(n.trees , test.error , pch=19,col="blue",xlab="Number of Trees",ylab="Test Error", main =
"Perfomance of Boosting on Test Set")
#adding the RandomForests Minimum Error line trained on same data and similar
parameters
abline(h = min(test.err),col="red") #test.err is the test error of a Random forest fitted on same
data
legend("topright",c("Minimum Test error Line for Random Forests"),col="red",lty=1,lwd=1)
Output:
dim(predmatrix)
[1] 206 100
head(test.error)
100 200 300 400 500 600
26.428346 14.938232 11.232557 9.221813 7.873472 6.911313
70 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Schedule:
8.45am 9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm
Day to to to to to to to to to to
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm 4.15pm
Gradient
Introduction – Random Hands on Session for
Day 9 Tea Gradient Boosting Lunch Tea boosting
Forest vs Gradient Gradient boosting
Break Example Break Break regression
Boosting regression
task
71 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:
72 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
If prediction matrix is not
generated for ech tree
6 J Mean squared test R program
2 Marks
error
If mean squared test error is
computed
0 Mark
If mean squared test error is not
computed
7 J Plotting test error 1 Mark R program
If mean squared test error is
plotted as a graph
0 Mark
If mean squared test error is not
plotted as a graph
7 J Visualization of result 1 Mark R program
If the performance of boosting
on test set is visualized with a
graph
0 Mark
If the performance of boosting
on test set is visualized with a
graph
Conclusion:
Thus the implementation of Gradient Boost Regression algorithm, using Rapid Miner/ R
Tool is completed successfully.
73 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 9 SUPPORT VECTOR REGRESSION
Objectives:
The aim of the task is to make the students understand the implementation of support vector
regression using Python.
Outcome:
Students will be able to perform support vector method which aims to optimise an arbitrary
(differentiable) cost function by fitting the best line within a predefined or threshold error value.
Theory:
Sample Coding:
Step1: Importing the libraries
74 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
75 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 6: Visualizing the SVR results (for higher resolution and smoother curve)
X_grid = np.arange(min(X), max(X), 0.01) #this step required because data is feature scaled.
X_grid = X_grid.reshape((len(X_grid), 1))
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
Output:
Schedule:
76 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:
77 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Conclusion
By applying the Support vector Regression, predicting the salary of the employees according to the
position held in the working organization is completed successfully.
78 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 10
LOGISTIC REGRESSION
Objectives:
The aim of the task is to make the students understand about logistic regression to classify binary
response variables
Outcome:
Students will be able to understand the need of classifying dependent and independent variables for
predicting binary classes and understand the computation of event occurrence probability
79 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Sample coding
Step 1: Data gathering
The required data to buid a logistic regression model in Python inorder to determine whether the
candidates would get admitted to a prestigious university is gathered. The two possible outcomes
are: Admitted (represented by the value of ‘1’) and Rejected (represented by the value of ‘0’).The
logistic regression model consists of:
(i) The dependent variable which represents whether a person gets admitted; and
(ii) The 3 independent variables are the GMAT score, GPA and Years of work experience
80 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Step 2: Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn
import matplotlib.pyplot as plt
dfX= =pd.DataFrame(candidates,columns=
df[['gmat', 'gpa','work_experience']]['gmat', 'gpa','work_experience','admitted'])
-> Indepenednt variable
print (df)
y = df['admitted'] -> Dependent variable
81 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Apply the logistic regression as follows:
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
Output:
82 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Schedule:
Introduct Ex. 1
ion
Predicting Visualization and Lunc Logistic
to the Tea Tea ROC
Day 12 Recursive Feature h Regression Model
Logistic variables Break Break Curve
Elimination Break Fitting
Regressi and data
on exploration
Confusion Matrix:
[[65 3]
[8 24]]
Accuracy: 0.89
83 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:
84 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
0 Mark
If the performance result is not
visualized in the form of a
graph 1
Conclusion:
Thus the implementation of Logistic Regression algorithm, using python programming is
completed successfully.
85 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 11 INTRODUCTION TO CLASSIFICATION AND DIFFERENT TYPES OF
CLASSIFICATION ALGORITHMS AND NAÏVE BAYES ALGORITHM
Objectives:
The aim of the task is to classify the given test dataset using Naïve Bayes algorithm to predict the
accuracy.
Outcome:
Students will be able to apply python libraries for the given dataset and also they will be able to
identify the accuracy using Naïve Bayes algorithm.
Resources required: Python Version 3.5
Prerequisites: Knowledge on Python Programming language
Theory:
Naive Bayes is a probabilistic classifier in Machine Learning which is built on the principle of Bayes
theorem. Naive Bayes classifier makes an assumption that one particular feature in a class is
unrelated to any other feature and that is why it is known as naive.
The below picture denotes the Bayes theorem:
• P(h): the probability of hypothesis H being true (regardless of the data). This is known as the
prior probability of h.
• P(D): the probability of the data (regardless of the hypothesis). This is known as the prior
probability.
• P(h|D): the probability of hypothesis h given the data D. This is known as posterior probability.
• P(D|h): the probability of data d given that the hypothesis h was true. This is known as posterior
probability.
Sample Coding
from sklearn import datasets
iris = datasets.load_iris()
print(iris)
print("Features: ", iris.feature_names)
print ("Labels: ", iris.target_names)
print(iris.data[0:5])
from sklearn.model_selection import train_test_split
86 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3,random_state=1
09)
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Schedule:
9.35am 10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 3.25pm
8.45am to to
Day to to to to to to to to
9.35am 3.25pm
10.25am 10.40am 11.30am 12.20pm 1.30pm 2.20pm 3.10pm 4.15pm
Day Introduction Tea Exercise Lunch Naive Tea Exercise Day 3 Introducti Tea
13 to different Break -1 Break Bayes Break -2 on to Break
classification Algorithm different
algorithms classificati
on
algorithms
Download the dataset containing the weather information from Kaggle and classify whether
players will play or not based on weather condition
Sample Output:
87 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Assessment specification:
Conclusion:
89 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 12 STOCHASTIC GRADIENT DESCENT ALGORITHM
Objectives:
The aim of the task is to classify the given test dataset using Stochastic Gradient Descent Algorithm to
predict the accuracy level.
Outcome:
Student will be able to apply python libraries for the given dataset and also they will be able to
identify the accuracy using Stochastic Gradient Descent classifier.
Resources required: Python Version 3.5
Prerequisites: Knowledge on Python Programming language
Theory:
Gradient descent is the backbone of a machine learning algorithm. Imagine that you are on a mountain
and are blindfolded and your task is to come down from the mountain to the flat land without assistance. The
only assistance you have is a gadget which tells you the height from sea-level. What would be your approach
be. You would start to descend in some random direction and then ask the gadget what is the height now. If
the gadget tells you that height and it is more than the initial height then you know you started in the wrong
direction. You change the direction and repeat the process. This way in much iteration finally you
successfully descend down. Well here is the analogy with machine learning terms now:
Size of Steps took in any direction = Learning rate
Gadget tells you height = Cost function
The direction of your steps = Gradients
Looks simple but mathematically how can we represent this. Here is the maths:
# standardizing data
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test=scaler.transform(x_test)
x_test=np.array(x_test)
y_test=np.array(y_test)
def MyCustomSGD(train_data,learning_rate,n_iter,k,divideby):
cur_iter=1
while(cur_iter<=n_iter):
#Updating the weights(W) and Bias(b) with the above calculated Gradients
w=w-learning_rate*(w_gradient/k)
b=b-learning_rate*(b_gradient/k)
def predict(x,w,b):
y_pred=[]
92 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
for i in range(len(x)):
y=np.asscalar(np.dot(w,x[i])+b)
y_pred.append(y)
return np.array(y_pred)
w,b=MyCustomSGD(train_data,learning_rate=1,n_iter=100,divideby=2,k=10)
y_pred_customsgd=predict(x_test,w,b)
plt.scatter(y_test,y_pred_customsgd)
plt.grid()
plt.xlabel('Actual y')
plt.ylabel('Predicted y')
plt.title('Scatter plot from actual y and predicted y')
plt.show()
print('Mean Squared Error :',mean_squared_error(y_test, y_pred_customsgd))
plt.scatter(y_test,y_pred_customsgd_improved)
plt.grid()
plt.xlabel('Actual y')
plt.ylabel('Predicted y')
plt.title('Scatter plot from actual y and predicted y')
plt.show()
print('Mean Squared Error :',mean_squared_error(y_test, y_pred_customsgd_improved))
Schedule:
93 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
Description of the task:
Download the Boston dataset from Kaggle and perform Linear Regression on Boston Housing data
using Scikit Learn’s SGDRegressor and visualize the results
Sample Output:
Assessment specification:
Conclusion:
Thus the implementation of Stochastic Gradient Descent algorithm, using python programming
is completed successfully.
95 of 178
WSK2019_TDFS10_EN (1) ©Bannari Amman Institute of Technology. All Rights Reserved
BIG DATA AND MACHINE LEARNING
DAY 13 K-NEAREST NEIGHBOURS ALGORITHM Object
ives:
The aim of the task is to classify the given test dataset using K- Nearest Neighbour to predict the accuracy
level.
Outcome:
Student will be able to apply python libraries for the given dataset and also they will be able to identify the
accuracy using KNN classifier.
Theory:
Classifying the input data is a very important task in Machine Learning, for example, whether a mail is
genuine or spam, whether a transaction is fraudulent or not and there are multiple other examples.
Let’s say, you live in a gated housing society and your society has separate dustbins for different types of
waste: one for paper waste, one for plastic waste, and so on. What you are basically doing over here is
classifying the waste into different categories. So, classification is the process of assigning a ‘class label’
to a particular item. In the above example, we are assigning the labels ‘paper’, ‘metal’, ‘plastic’, and so on
to different types of waste.
KNearest Neighbor
KNN is a non-parametric and lazy learning algorithm. Non-parametric means there is no assumption for
underlying data distribution.
Sample Coding:
# Assigning features and label variables
# First Feature
weather=['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny','Sunny',
'Rainy','Sunny','Overcast','Overcast','Rainy']
# Second Feature
temp=['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild','Mild','Hot','Mild']
# Label or target varible
play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()
# Converting string labels into numbers.
Download the wine dataset from Kaggle and perform classification on wine data to classify the three types
of wine using KNearest Neighbor algorithm and visualize the results
Sample Output:
Conclusion:
Thus the implementation of KNearest Neighbor algorithm, using python programming is completed
successfully.
Objective:
The aim of this task is to use SVM to representation of different classes in a hyperplane in
multidimensional space.
Outcome:
Students will able to
Understand SVM that distinctly classify data points
Apply how to maximize the margin of the classifier
Resource Required: Python IDE (Jupiter or Anaconda or Spyder or Pycharm)
Theory:
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is
used for Classification as well as Regression problems. However, primarily, it is used for Classification problems
in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in the future.
This best decision boundary is called a hyperplane.
Sample Coding:
#Load dataset
cancer = datasets.load_breast_cancer()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
Schedule
Download the iris dataset from Kaggle and perform classification on iris data to classify the three types of
flowers using SVM and visualize the results.
Sample Output:
Assessment specification:
Conclusion:
Thus the implementation of Support Vector Machine using python programming is completed
successfully.
Objective:
The aim of this task is to use Arima Model to predict the time series data
Outcome:
Students will able to understand ARIMA model to predict and obtain the results for time series data
Theory:
Time series analysis comprises methods for analyzing time series data in order to extract meaningful
statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values
based on previously observed values. Time series are widely used for non-stationary data, like economic, weather,
stock price, and retail sales Forecasting is the next step where you want to predict the future values the series is
going to take.
ARIMA, short for ‘Auto Regressive Integrated Moving Average’ is actually a class of models that
‘explains’ a given time series based on its own past values, that is, its own lags and the lagged forecast errors, so
that equation can be used to forecast future values.
Sample Coding:
import pandas as pd
def parser(x):
return datetime.strptime(x,'%Y-%m')
sales=pd.read_csv('/content/sample_data/sales.csv',date_parser=(0),date_parser=parser)
sales.head()
sales.Month[1]
sales=pd.read_csv('/content/sample_data/sales.csv',index_col=0,parse_dates=[0],date_parser=parser)
sales.plot()
plot_acf(sales)
sales.shift(1)
sales_diff=sales.diff(periods=1)
#integrated of order 1, denoted by d (for diff), one of the parameter of the ARIMA model
sales_diff
sales_diff=sales_diff[1:]
sales_diff.head()
X=sales.values
X.size
train=X[0:15]
test=X[15:]
predictions=[]
model_arima = ARIMA(train,order=(1,0,1))
model_arima_fit = model_arima.fit()
print(model_arima_fit.aic)
predictions= model_arima_fit.forecast(steps=10)[0]
plt.plot(test)
plt.plot(predictions,color='red')
Schedule
Download the COVID dataset from Kaggle and perform prediction using ARIMA model to determine the
increase in cases and deaths
Sample Output:
Assessment specification:
Conclusion:
Thus the implementation of time series prediction with ARIMA model using python programming is
completed successfully.
Objectives:
The aim of the task is to provide adequate knowledge on machine learning using decision
tree in PIMA Indian Diabetes dataset.
Resources required: Programming language Python, Chrome browser or any other available
browser.
Theory:
A decision tree is a flowchart-like tree structure where an internal node represents feature
(or attribute), the branch represents a decision rule, and each leaf node represents the outcome.
The topmost node in a decision tree is known as the root node. It learns to partition on the basis
of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This
flowchart-like structure helps you in decision making. It's visualization like a flowchart diagram
which easily mimics the human level thinking. That is why decision trees are easy to understand
and interpret.
1. Select the best attribute using Attribute Selection Measures (ASM) to split the records.
2. Make that attribute a decision node and breaks the dataset into smaller subsets.
3. Starts tree building by repeating this process recursively for each child until one of the
condition will match:
All the tuples belong to the same attribute value.
There are no more remaining attributes.
There are no more instances.
Sample Coding:
Let's first load the required Pima Indian Diabetes dataset using pandas' read CSV function.
To understand model performance, dividing the dataset into a training set and a test set is a good
strategy.Let's split the dataset by using function train_test_split(). You need to pass 3 parameters
features, target, and test_set size.
Well, you got a classification rate of 67.53%, considered as good accuracy. You can improve this
accuracy by tuning the parameters in the Decision Tree Algorithm.
export_graphviz function converts decision tree classifier into dot file and pydotplus convert this
dot file to png or displayable form on Jupyter.
Download a IRIS dataset (Excel or CSV format) from Kaggle and import it in Python IDE and print the
decision tree.
Output:
Conclusion:
Thus the working mechanism of decision tree has been successfully implemented on machine
learning and results have been verified successfully.
Objectives:
The aim of the task is to provide adequate knowledge on machine learning using random
forest classifier in IRIS dataset and neural networks.
Theory:
Random Forests:
Random forests is a supervised learning algorithm. It can be used both for classification
and regression. It is also the most flexible and easy to use algorithm. A forest is comprised of
trees. It is said that the more trees it has, the more robust a forest is. Random forests creates
decision trees on randomly selected data samples, gets prediction from each tree and selects the
best solution by means of voting. It also provides a pretty good indicator of the feature
importance.
Sample code:
Step1: Loading the dataset and print the target and features names
Start by importing the datasets library from scikit-learn, and load the iris dataset with
load_iris().
The code given below is to print the top 5 records of IRIS dataset.
Sample Code:
# print the iris data (top 5 records)
print(iris.data[0:5])
# print the iris labels (0:setosa, 1:versicolor, 2:virginica)
print(iris.target)
Source Code:
# Creating a DataFrame of given iris dataset
import pandas as pd
data=pd.DataFrame({
'sepal length':iris.data[:,0],
'sepal width':iris.data[:,1],
'petal length':iris.data[:,2],
©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
'petal width':iris.data[:,3],
'species':iris.target
})
data.head()
Step 3: Split features and labels into training and test data
Sample Code:
# Import train_test_split function
from sklearn.model_selection import train_test_split
X=data[['sepal length', 'sepal width', 'petal length', 'petal width']] # Features
y=data['species'] # Labels
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and
30% test
Step 4: Prediction
Sample code:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
The code given below is to import the scikit-learn metrics module for accuracy
calculation on IRIS dataset.
Sample code:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
The code given below is to make a prediction for a single item on IRIS dataset
Sample code:
#Make a prediction for a single item
#sepal length = 3
#sepal width = 5
#petal length = 4
#petal width = 2
clf.predict([[3, 5, 4, 2]])
The code given below is to finding of important features, creating a Gaussian classifier
and for training the model.
Source code:
#Finding Important Features
from sklearn.ensemble import RandomForestClassifier
Sample code:
import pandas as pd
feature_imp = pd.Series
(clf.feature_importances_,index=iris.feature_names).sort_values (ascending=False)
feature_imp
Source code:
#Generating model on selected Features
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into features and labels
X=data[['petal length', 'petal width','sepal length']] # Removed feature "sepal length"
y=data['species']
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and
30% test
The code given below is to import Random Forest Classifier and Gaussian Classifier.
Sample code:
from sklearn.ensemble import RandomForestClassifier
Schedule
Download a PIMA Indian dataset (Excel or CSV format) from Kaggle and import it in Python
IDE and print the accuracy using random forest classifier.
Assessment specification:
Conclusion:
Thus the working mechanism of Random forest algorithm has been successfully implemented.
Objective:
The aim of this task is to use deep learning for image classification by applying convolutional
neural network.
Outcome:
Students will able to build a neural network model for performing image classification for a real-
world dataset.
Resource Required:Python IDE (Jupiter or Anaconda or Spyder or Pycharm)
Theory:
Deep Learning is a very popular subset of machine learning due to its high level of performance
across many types of data. Convolutional Neural Network (CNN) in deep learning is used to classify
images. Many libraries in Python helps to apply CNN in which the Keras library in Python makes is a
simpler one.
Computers see images using pixels. Pixels in images are usually related. For example, a certain
group of pixels may signify an edge in an image or some other pattern. Convolutions use this to help
identify images.
A convolution multiplies a matrix of pixels with a filter matrix or kernel and sums up the
multiplication values. Then the convolution slides over to the next pixel and repeats the same process
until all the image pixels have been covered.
CNN like other neural networks, are made up of neurons with learnable weights and biases. Each
neuron receives several inputs, takes a weighted sum over them, pass it through an activation function
and responds with an output.
Data pre-processing
Next, reshape the dataset inputs (X_train and X_test) to the shape that our model expects when
we train the model. The first number is the number of images (60,000 for X_train and 10,000 for X_test).
Then comes the shape of each image (28x28). The last number is 1, which signifies that the images are
greyscale.
#reshape data to fit model
X_train = X_train.reshape(60000,28,28,1)
X_test = X_test.reshape(10000,28,28,1)
We need to ‘one-hot-encode’ our target variable. This means that a column will be created for
each output category and a binary variable is inputted for each category. For example, we saw that the
first image in the dataset is a 5. This means that the sixth number in our array will have a 1 and the rest of
the array will be filled with 0.
from keras.utils import to_categorical
#one-hot encode target column
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
y_train[0]
Schedule:
8.45a 9.35am 10.25a 10.25a 10.25a 12.20p 1.30p 2.20p 3.10p 3.25p
m to to m m m m m to m to m to m to
Day 9.35a 10.25a to to to to 2.20p 3.10p 3.25p 4.15p
m m 10.40a 10.40a 10.40a 1.30p m m m m
m m m m
Day 4 Introduction To Tea Building the Lunc Evaluating the Tea Task /
CNN Brea Model h model Brea Assessm
k Break k ent
Sample Output:
Conclusion:
The Convolution Neural Network modelling is built using Keras library package in Python which
is used for classifying the images and also the accuracy of the model is evaluated.
Objective:
The aim of this task is to creating a simple Chat bot in python using NLTK library.
Outcome:
Students will able to build a Chabot in python using NLTK library.
Theory:
What is a Chatbot?
A chatbot is AI-based software designed to interact with humans in their natural languages. These
chatbots are usually converse via auditory or textual methods, and they can effortlessly mimic human
languages to communicate with human beings in a human-like manner.
1. In a Rule-based approach, a bot answers questions based on some rules on which it is trained
on. The rules defined can be very simple to very complex. The bots can handle simple queries
but fail to manage complex ones.
2. Self-learning bots are the ones that use some Machine Learning-based approaches and are
definitely more efficient than rule-based bots. These bots can be of further two
types: Retrieval Based or Generative
The nltk library is used. NLTK stands for Natural Language Toolkit and is a leading python library to
work with text data. The first line of code below imports the library, while the second line uses
the nltk.chat module to import the required utilities.
import nltk
from nltk.chat.util import Chat, reflections
The code below shows that utility Chat is a class that provides logic for building the chatbot.
print(Chat)
Output:
<class 'nltk.chat.util.Chat'>
The other import you did above was Reflections, which is a dictionary that contains a set of input text
and its corresponding output values. You can examine the dictionary with the code below. This is an
optional dictionary and you can create your own dictionary in the same format as below.
reflections
The first step is to create rules that will be used to train the chatbot. The lines of code below create a
simple set of rules. The first element of the list is the user input, whereas the second element is the
response from the bot. Several such lists are created in the set_pairs object.
set_pairs = [
[
r"my name is (.*)",
["Hello %1, How are you doing today ?",]
],
[
r"hi|hey|hello",
["Hello", "Hey there",]
],
[
r"what is your name?",
["You can call me a chatbot ?",]
chatbot()
Output:
The next step is to instantiate the Chat() function containing the pairs and reflections.
Output:
You have created a simple rule-based chatbot, and the last step is to initiate the conversation. This is done
using the code below where the converse() function triggers the conversation.
chat.converse()
if __name__ == "__main__":
chatbot()
The code above will generate the following chatbox in your notebook, as shown in the image below.
You're ready to interact with the chatbot. Start by typing a simple greeting, "hi", in the box, and you'll get
the response "Hello" from the bot, as shown in the image below.
Output:
You can continue conversing with the chatbot and quit the conversation once you are done, as shown in
the image below.
Output:
8.45a 9.35am 10.25a 10.25a 10.25a 12.20p 1.30p 2.20p 3.10p 3.25p
m to to m m m m m to m to m to m to
Day 9.35a 10.25a to to to to 2.20p 3.10p 3.25p 4.15p
m m 10.40a 10.40a 10.40a 1.30p m m m m
m m m m
Sample Output:
Conclusion:
Thus an interactive chatbot is created using nltk library in python for an application and executed
successfully.
Objective:
The aim of this task is to understand the basic concepts of Big data and the installing process of
Hadoop.
Outcome:
Students will be able to understand the importance of Big data and how the big data applied in
real world environment.
Theory:
Big data analytics is the often complex process of examining large and varied data sets, or big
data, to uncover information -- such as hidden patterns, unknown correlations, market trends and
customer preferences -- that can help organizations make informed business decisions.
On a broad scale, data analytics technologies and techniques provide a means to analyse data sets
and draw conclusions about them which help organizations make informed business decisions.
Business intelligence (BI) queries answer basic questions about business operations and
performance.
Big data analytics is a form of advanced analytics, which involves complex applications with
elements such as predictive models, statistical algorithms and what-if analysis powered by high-
performance analytics systems.
Big data analytics applications enable big data analysts, data scientists, predictive modelers, statisticians
and other analytics professionals to analyze growing volumes of structured transaction data, plus other
forms of data that are often left untapped by conventional BI and analytics programs. This encompasses a
mix of semi-structured and unstructured data -- for example, internet clickstream data, web server logs,
social media content, text from customer emails and survey responses, mobile phone records, and
machine data captured by sensors connected to the internet of things (IoT).
Installation of Hadoop
Hadoop is a software framework from Apache Software Foundation that is used to store and process
Big Data. It has two main components; Hadoop Distributed File System (HDFS), its storage system and
MapReduce, is its data processing framework. Hadoop has the capability to manage large datasets by
distributing the dataset into smaller chunks across multiple machines and performing parallel
computation on it.
Step 1:
Download the Hadoop version 3.1 from the following Link
CLICK HERE TO INSTALL HADOOP
Step 2:
Extract it to the folder.
Now we need to edit some files located in the hadoop directory of the etc folder where we installed
hadoop. The files that need to be edited have been highlighted.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Create a folder with the name ‘datanode’ and a folder ‘namenode’ in this data directory
Note: The path of namenode and datanode across value would be the path of the datanode and namenode
folders you just created.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\Users\hp\Downloads\hadoop-3.1.0\hadoop-3.1.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value> C:\Users\hp\Downloads\hadoop-3.1.0\hadoop-3.1.0\data\datanode</value>
</property>
</configuration>
5. Edit the file yarn-site.xml and add below property in the configuration
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
6. Edit hadoop-env.cmd and replace %JAVA_HOME% with the path of the java folder where your jdk
1.8 is installed
To include those files, replace the bin folder in hadoop directory with the bin folder provided in this
github link.
https://github.com/s911415/apache-hadoop-3.1.0-winutils
Download it as zip file. Extract it and copy the bin folder in it. If you want to save the old bin folder,
rename it like bin_old and paste the copied bin folder in that directory.
hadoop version
Schedule:
8.45a 9.35am 10.25a 10.25a 10.25a 12.20p 1.30p 2.20p 3.10p 3.25p
m to to m m m m m to m to m to m to
Day 9.35a 10.25a to to to to 2.20p 3.10p 3.25p 4.15p
m m 10.40a 10.40a 10.40a 1.30p m m m m
m m m m
Using the installation steps of Hadoop 2.0, install Hadoop 3.0 and display the version details.
Sample Output:
Aspect Maximum
S.No Aspect Description Additional Aspect Description Requirement
Type Score (10)
1 Mark
JDK selection & Installation
1 J Java Installation - 2
1 Mark
Environment Setup
1 Mark
Hadoop download
2 J Hadoop - 2
1 Mark
Environment Setup
0.25 Mark
Configure core-site.xml
0.25 Mark
3 J Configuration - 1
Configure Hadoop-env.xml
0.5 Mark
Configure hdfs-site.xml
0.5 Mark
Configure mapred-site.xml
4 J Configuration 0.5 Mark - 2
Configure yarn-site.xml
Learning Hadoop
7 J Learning the commands 1
Commands
1 Mark
Completed within 30 min
0.5 Mark
8 M Time Management 30 mins 1
Completed within 30 to 45 mins
0 Mark
Exceeded 45 mins
Conclusion:
Thus Hadoop 2.0 and Hadoop 3.0 is installed and executed successfully.
Objectives:
The aim of the task is to provide adequate knowledge on creating and deploying a highly scalable
and performance-oriented database.
Theory:
Installation of MongoDB:
Step2: Move to community Server tab in the same page and choose the latest version of
windows. Click Download
Step 4: Choose the setup type as “Complete” and proceed “Next”. Then click “ Run Service as
Network Service User”. Fill the service name and directory fields. Then Proceed Next
Step 6: After Installation, License agreement screen opens up and press “Agree”.
Step 8: Press “Get Started”. Check all the boxes in Private settings and press “Start using
Compass”.
Step 11: Inside data folder create another new folder named “db”
Step 12: Copy the entire path of the above screen. Move to “System Properties” and
select “Environment Variables”.
Step 14: Press “Ok” in Environment Variables and System properties. Open the command
prompt and type “mongod”. Now the installation of packages started.
Step 16: Go to Mongodb window which we have minimized. Clicks connect and it will
get connected.
Step 2: Click “Create Databse” to create the database and its first collection.
Step 3: To access the collection screen for a database by clicking a “Database Name” in
the main Databases view.
Step 4: Click the “Create Collection” button and enter the name of the collection to
create.
Step 5: To insert documents into the collection, click “Add Data” dropdown button and
select insert document.
(Note: Click the {} brackets for JSON view. This is the default view. Or click the list icon
for Field-by-Field mode)
db.post.insert([
{
title: "MongoDB Overview",
description: "MongoDB is no SQL database",
by: "BIT",
url: "http://www.bitsathy.ac.in",
tags: ["mongodb", "database", "NoSQL"],
likes: 100
},
{
title: "NoSQL Database",
description: "NoSQL database doesn't have tables",
by: "BIT",
url: "http://www. bitsathy.ac.in",
tags: ["mongodb", "database", "NoSQL"],
likes: 20,
comments: [
{
user:"user1",
message: "My first comment",
dateCreated: new Date(2013,11,10,2,35),
like: 0
}
]
}
])
> db.createCollection("empDetails")
{ "ok" : 1 }
> db.empDetails.insertOne(
{
First_Name: "Radhika",
Last_Name: "Sharma",
Date_Of_Birth: "1995-09-26",
e_mail: "[email protected]",
phone: "9848022338"
})
{
"acknowledged" : true,
"insertedId" : ObjectId("5dd62b4070fb13eec3963bea")
}
> db.empDetails.insertMany(
[
{
First_Name: "Radhika",
Last_Name: "Sharma",
Date_Of_Birth: "1995-09-26",
e_mail: "[email protected]",
phone: "9000012345"
},
{
First_Name: "Rachel",
Last_Name: "Christopher",
Date_Of_Birth: "1990-02-16",
e_mail: "[email protected]",
phone: "9000054321"
},
{
First_Name: "Fathima",
Last_Name: "Sheik",
Date_Of_Birth: "1990-02-16",
e_mail: "[email protected]",
phone: "9000054321"
}
]
)
{
"acknowledged" : true,
"insertedIds" : [
ObjectId("5dd631f270fb13eec3963bed"),
ObjectId("5dd631f270fb13eec3963bee"),
ObjectId("5dd631f270fb13eec3963bef")
]
}
>
8.45a 9.35am 10.25a 10.25a 10.25a 12.20p 1.30p 2.20p 3.10p 3.25p
m to to m m m m m to m to m to m to
Day 9.35a 10.25a to to to to 2.20p 3.10p 3.25p 4.15p
m m 10.40a 10.40a 10.40a 1.30p m m m m
m m m m
The following is the structure of 'restaurants' collection, use Mongodb to create the given structure
containing documents and collections
{
"address": {
"building": "1007",
"coord": [-73.856077, 40.848447],
"street": "Morris Park Ave",
"zipcode": "10462"
},
"borough": "Bronx",
"cuisine": "Bakery",
"grades": [
{“date": {“$date": 1393804800000}, "grade": "A", "score": 2},
{“date": {“$date": 1378857600000}, "grade": "A", "score": 6},
{“date": {“$date": 1358985600000}, "grade": "A", "score": 10},
{“date": {“$date": 1322006400000}, "grade": "A", "score": 9},
{“date": {“$date": 1299715200000}, "grade": "B", "score": 14}
],
"name": "Morris Park Bake Shop",
"restaurant_id": "30075445”
}
1. Write a MongoDB query to display all the documents in the collection restaurants.
Sample output:
2. Write a MongoDB query to display the fields restaurant_id, name, borough and cuisine for all the
documents in the collection restaurant.
Sample output:
3. Write a MongoDB query to display the fields restaurant_id, name, borough and cuisine, but exclude the
field _id for all the documents in the collection restaurant.
Sample output:
{ "borough" : "Manhattan", "cuisine" : "Irish", "name" : "Dj Reynolds Pub And Restaurant",
"restaurant_id" : "30191841" }
{ "borough" : "Bronx", "cuisine" : "Bakery", "name" : "Morris Park Bake Shop", "restaurant_id" :
"30075445" }
{ "borough" : "Brooklyn", "cuisine" : "American ", "name" : "Riviera Caterer", "restaurant_id" :
"40356018" }
Etc……….
4. Write a MongoDB query to display the fields restaurant_id, name, borough and zip code, but exclude the
field _id for all the documents in the collection restaurant.
Sample output:
{ "address" : { "zipcode" : "10019" }, "borough" : "Manhattan", "name" : "Dj Reynolds Pub And
Restaurant", "restaurant_id" : "30191841" }
{ "address" : { "zipcode" : "10462" }, "borough" : "Bronx", "name" : "Morris Park Bake Shop",
"restaurant_id" : "30075445" }
{ "address" : { "zipcode" : "11224" }, "borough" : "Brooklyn", "name" : "Riviera Caterer",
"restaurant_id" : "40356018" }
Etc…….
5. Write a MongoDB query to display all the restaurant which is in the borough Bronx
6. Write a MongoDB query to display the first 5 restaurant which is in the borough Bronx.
Sample output:
Conclusion:
Thus the working of MongoDB is clearly discussed and the given queries are implemented in
MongoDB and the results are verified successfully
Objectives: The aim of the task is to provide adequate knowledge SQL and NoSQL databases
Theory:
SQL commands:
Command Description
Command Description
These commands are to keep a check on other commands and their effect on the database.
These commands can annul changes made by other commands by rolling the data back to its
original state. It can also make any temporary change permanent.
Command Description
Data control languages are the commands to grant and take back authority from any database
user.
Command Description
Data query language is used to fetch data from tables based on conditions that we can easily
apply.
Command Description
1. In relational database we need to define structure and schema of data first and then only we can
process the data.
2. Relational database systems provides consistency and integrity of data by enforcing ACID
properties (Atomicity, Consistency, Isolation and Durability). There are some scenarios where this is
useful like banking system. However in most of the other cases these properties are significant
performance overhead and can make your database response very slow.
3. Most of the applications store their data in JSON format and RDBMS don’t provide you a better
way of performing operations such as create, insert, update, delete etc on this data. On the other hand
NoSQL store their data in JSON format, which is compatible with most of the today’s world
application.
NoSQL databases are different than relational databases. In relational database you need to create
the table, define schema, set the data types of fields etc before you can actually insert the data. In NoSQL
, you can insert, update data on the fly.
One of the advantages of NoSQL database is that they are really easy to scale and they are much
faster in most types of operations that we perform on database. There are certain situations where you
would prefer relational database over NoSQL, however when you are dealing with huge amount of data
then NoSQL database is your best choice.
Advantages of NoSQL
There are several advantages of working with NoSQL databases such as MongoDB and
Cassandra. The main advantages are high scalability and high availability.
(i) High scalability: NoSQL database such as MongoDB uses sharding for horizontal scaling.
Sharding is partitioning of data and placing it on multiple machines in such a way that the order
©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
of the data is preserved. Vertical scaling means adding more resources to the existing machine
while horizontal scaling means adding more machines to handle the data. Vertical scaling is not
that easy to implement, on the other hand horizontal scaling is easy to implement. Horizontal
scaling database examples: MongoDB, Cassandra etc. Because of this feature NoSQL can
handle huge amount of data, as the data grows NoSQL scale itself to handle that data in efficient
manner.
(ii) High Availability: Auto replication feature in MongoDB makes it highly available because in
case of any failure data replicates itself to the previous consistent state.
Here are the types of NoSQL databases and the name of the databases system that falls in that
category. MongoDB falls in the category of NoSQL document based database.
SQL database:
DDL COMMANDS
1. CREATE
CREATE statement is used to create a new database, table, index or stored procedure.
CREATE TABLE user (id INT (16) PRIMARY KEY AUTO_INCREMENT, name
©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
VARCHAR (255) NOT NULL);
2. DROP
DROP statement allows you to remove database, table, index or stored procedure.
3. ALTER
ALTER is used to modify existing database data structures (database, table).
ALTER TABLE user ADD COLUMN lastname VARCHAR (255) NOT NULL;
4. RENAME
RENAME command is used to rename SQL table.
5. TRUNCATE
TRUNCATE operation is used to delete all table records. Logically it is the same as
DELETE command.
Example:
TRUNCATE student;
DML COMMANDS:
Example:
2. INSERT
INSERT command is used to add new rows into the database table.
Example:
3. UPDATE
UPDATE statement modifies records into the table.
Example:
4. DELETE
DELETE query removes entries from the table.
Example:
DCL COMMANDS:
1. GRANT
For example, I want to grant all privileges to ‘explainjava’ database for user
‘dmytro@localhost’.
FLUSH PRIVILEGES;
2. REVOKE
REVOKE statement is used to remove privileges from user accounts.
Example:
FLUSH PRIVILEGES;
TCL COMMANDS:
1. START TRANSACTION; after that, you are doing manipulations with a data
(insert, update, delete) and at the end, you need to commit a transaction.
2. COMMIT
As a mentioned above COMMIT, command finishes transaction and stores all changes
made inside of a transaction.
Example:
START TRANSACTION;
INSERT INTO student (name, lastname) VALUES ('Dmytro', 'Shvechikov');
COMMIT;
3. ROLLBACK
ROLLBACK statement reverts all changes made in the scope of transaction.
Example:
START TRANSACTION;
INSERT INTO student (name, lastname) VALUES ('Dmytro', 'Shvechikov');
ROLLBACK;
4. SAVEPOINT
SAVEPOINT is a point in a transaction when you can roll the transaction back to a
certain point without rolling back the entire transaction.
Example:
©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
SAVEPOINT SAVEPOINT_NAME;
The data in MongoDB is stored in form of documents. These documents are stored in
Collection and Collection is stored in Database.
db.students.insert({
name: "Chaitanya",
dept: CSE,
place: "Coimbatore"
})
We do not have a collection students in the database bitdb. This command will
create the collection named “students” on the fly and insert a document in it with the
specified key and value pairs.
(ii) To check whether the document is successfully inserted, type the following
command.
Syntax: db.collection_name.find()
It shows all the documents in the given collection.
(iii) To check whether the collection is created successfully, use the following
command.
Syntax: db.collection_name.insert()
(v) To drop a collection, first connect to the database in which you want to delete
collection and then type the following command to delete the collection:
©Bannari Amman Institute of Technology. All Rights Reserved
Machine Learning
Syntax: db.collection_name.drop()
(vii) To delete documents from a collection. The remove() method is used for
removing the documents from a collection in MongoDB.
Syntax: db.collection_name.remove(delete_criteria)
1. SQL Commands:
Query:
1. Create table books (author varchar (20), bookid int, noofpages int);
insert into books values ('xxx', 1234, 90);
select * from books;
2. alter table books add publisher varchar (30);
select * from books;
3. insert into books values('xxx',1234, 90, 'VPN');
select * from books;
4. alter table books rename to library;
select * from library;
2254 89 jiuhui
db.students.remove({"StudentId": 3333})
To verify whether the document is actually deleted. Type the following command:
db.students.find().pretty()
8.45a 9.35am 10.25a 10.25a 10.25a 12.20p 1.30p 2.20p 3.10p 3.25p
m to to m m m m m to m to m to m to
Day 9.35a 10.25a to to to to 2.20p 3.10p 3.25p 4.15p
m m 10.40a 10.40a 10.40a 1.30p m m m m
m m m m
1. Create tables to store the details of student, teacher and department using SQL and perform
the basic operations like find, insert, update and delete.
2. Create collection and documents to add student details, teacher details and department
details using MongoDB , perform the basic operations like find, insert, update and delete.
Assessment specification:
Aspect Maximum
S.No Aspect Description Additional Aspect Description Requirement
Type Score (10)
2 Mark
If all the tables are created
1 Mark
1 J Create table - 2
If 2 tables are created
0 Mark
If no table is created
1 Mark
If values for all the tables are
inserted
2 J Insert values 0.5 Mark - 1
If values for 2 tables are inserted
0 mark
If the values are not inserted
0.5 Mark
If any of the values are deleted
3 J Delete values - 0.5
0 Mark
If the values are not deleted