0% found this document useful (0 votes)
19 views79 pages

DSL Lab

Uploaded by

Om Badhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views79 pages

DSL Lab

Uploaded by

Om Badhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

LAB MANUAL

SUBJECT: DS USING
PYTHON LAB
(ITL605)
Institute’s Vision

To be an organization with potential for excellence in engineering and management


for the advancement of society and human kind.

Institute’s Mission

To excel in academics, practical engineering, management and to commence


research endeavors.
To prepare students for future opportunities.
To nurture students with social and ethical responsibilities.
Department’s Vision

To create IT graduates with ethical and employable skills.

Department’s Mission

To imbibe problem solving and analytical skills through teaching learning process.
To impart technical and managerial skills to meet the industry requirement.
To encourage ethical and value-based education.
Excelssior’s Education Society
K. C. COLLEGE OF ENGINEERING AND
MANAGEMENT STUDIES AND RESEARCH
THANE (EAST).

Certificate

Om Ishwar Badhe
This is to certify that Mr/ Ms__________________________________of
VI
Semester: _____ IT
Branch: _______ 01
Roll No: _______ has performed and
successfully & completed all the practicals in the subject
DSL LAB
_____________of for the academic year 2023 to 2024 as prescribed by
University of Mumbai.

DATE: -

Practical In charge Internal Examiner

Head of Department External Examiner

COLLEGE SEAL
Course details
Lab Objective
Lab Objectives: Sr. No. Lab Objectives
The Lab experiments aims:
1 To know the fundamental concepts of data science and analytics

2 To learn data collection,


preprocessing and visualization techniques for data science

3 To Understand and practice analytical methods for solving real life


problems based on Statistical analysis

4 To learn various machine learning


techniques to solve complex real- world problems

5 To learn streaming and batch data processing using Apache Spark

6 To map the elements of data science to perceive information

Lab Outcomes:
Sr. No. Lab Outcomes Cognitive levels of attainment
as per Bloom’s Taxonomy

1 Understand the concept of Data science L1


process and associated terminologies to
solve real-world
problems
2 Analyze the data using different statistical L1, L2, L3, L4
techniques and visualize
the outcome using different types of plots.

3 Analyze and apply the supervised machine L1,L2, L3, L4


learning techniques like Classification,
Regression or Support Vector Machine on
data for
building the models of data and solve
the problems.
4 Apply the different unsupervised machine L1, L2,L3
learning algorithms like Clustering,
Decision Trees, Random
Forests or Association to solve the
problems.
5 Design and Build an application that L1,L2,L3,L4,L5,L6
performs exploratory data analysis using
Apache Spark
6 Design and develop a data science L1,L2,L3,L4,L5,L6
application that can have data acquisition,
processing, visualization and statistical
analysis methods with supported machine
learning technique to solve the real-world
problem
Prerequisite: Basics of Python programming and Database management system.

DETAILED SYLLABUS:
Program Outcomes
Engineering Graduates will be able to:

Engineering knowledge: Apply the knowledge of mathematics, science, engineering


fundamentals, and an engineering specialization to the solution of complex
engineering problems.

Problem analysis: Identify, formulate, review research literature, and analyze complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences,
and engineering sciences.

Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the
public health and safety, and the cultural, societal, and environmental considerations.

Conduct investigations of complex problems: Use research-based knowledge and research


methods including design of experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.

Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities with
an understanding of the limitations.

The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
professional engineering practice.

Environment and sustainability: Understand the impact of the professional engineering solutions
in societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.

Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.

Individual and team work: Function effectively as an individual, and as a member or leader
in diverse teams, and in multidisciplinary settings.

Communication: Communicate effectively on complex engineering activities with the engineering


community and with society at large, such as, being able to comprehend and write effective reports
and design documentation, make effective presentations, and give and receive clear instructions.
Project management and finance: Demonstrate knowledge and understanding of the engineering
and management principles and apply these to one’s own work, as a member and leader in a team,
to manage projects and in multidisciplinary environments.

Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
Department of Information Technology

Subject : DS Using Python Lab

Semester :VI

Class : TE

Course Outcomes / Lab Outcomes

Course Code(ITL703) Lab Outcomes

At the end of experiment student will able to

ITL 605.1 Understand the concept of Data science process and associated
terminologies to solve real-world problems

ITL 605.2 Analyze the data using different statistical techniques and visualize
the outcome using different types of plots.

ITL 605.3 Analyze and apply the supervised machine learning techniques like
Classification, Regression or Support Vector Machine on data for
building the models of data and solve the problems.

ITL 605.4 Apply the different unsupervised machine learning algorithms like
Clustering, Decision Trees, Random Forests or Association to solve
the problems.

ITL 605.5 Design and Build an application that performs exploratory


data analysis using Apache Spark

ITL 605.6 Design and develop a data science application that can have data
acquisition, processing, visualization and statistical analysis
methods with supported machine learning technique to solve the
real-world problem
Rubrics for Practical

Rubrics Maximum 15-12 12-9 9-6 6-0


Description Marks
Weight

Implementatio 5 Successful Output Few errors Incorrect


n completion correct but in the Output
(R1) with accurate not precise output (2-0)
output (5- (4-3) (3-2)
4)

Understanding 5 Understanding Understand Improper No


(R2) Experiment Experiment Conclusion Conclusion
and drawn but (3-2) (2-0)
correct conclusion
conclusion less
(5-4) accurate
(4-3)

Punctuality and 5 Submission Submission Submission Submission


Discipline within a week after week after two after three
(R3) (5-4) (4-3) weeks weeks and
(3-2) more (2-0)
Sr. Name of Experiment Date of Date of Page Grade Sign

No Conduction Submission No. /

Marks

1 Data preparation using NumPy and Pandas

2 Data Visualization / Exploratory Data Analysis for


the selected data set using Matplotlib and Seaborn
a. Create a bar graph, contingency table using any 2
variables.
b. Create a normalized histogram.
c. Describe what these graphs and tables indicate?

3 Data Modeling : Validating partition by


performing a two‐sample Z‐test.

4 Implementation of Statistical Hypothesis Test using


Scipy and Sci-kit learn.
Correlation Tests : Chi-Squared Test
5 Apply regression Model techniques to predict the
data on House prices dataset . And Prediction of
Loan Using Multivariable Regression in Python.

6 Classification modeling
a. Choose a classifier for classification problems.
b. Evaluate the performance of classifier

7 Clustering
a. Clustering algorithms for unsupervised
classification.
b. Plot the cluster data.

8 Using any machine learning techniques using


available data sets to develop a recommendation
system.

9 Exploratory data analysis using Apache Spark and


Pandas

10 Batch and Streamed Data Analysis using Spark.


11 Implementation of Mini project based on a case
study taken from a given dataset using Data science
and Machine learning.
Each group has to select a problem based on which
ML project is done. Attach here the same.
The following steps should be outlined.

a) Problem definition, identifying which data set


can be implemented.
b) Identify and use a standard data mining dataset
available for the problem. Some links for data
science datasets are: Kaggle, UCI Machine
Learning Repository etc.
c) Implement appropriate machine learning
algorithms.
d) Interpret and visualize the results.

12 Assignment 1

13 Assignment 2
Total Grade / Marks :-

Avg. marks of Experiments Avg. marks of Assignments Total Marks

(A) (B)

(A+B)
Obtained Out of Obtained Out of

Practical Incharge Date


EXPERIMENT NO. - 01

Aim of the Experiment: - Data preparation using NumPy and Pandas

Lab Outcome: - Understand the concept of Data science process and associated
terminologies to solve real-world problems

Date of Conduction: Date of Submission:

Implementation Understanding Punctuality & Total


(5) (5) Discipline (15)
(5)

Practical Incharge
EXPERIMENT NO. - 01

AIM : Data preparation using NumPy and Pandas

THEORY:
Data Preprocessing:
Data preprocessing is a data mining technique which is used to transform the raw data in a useful
and efficient format.

Steps Involved in Data Pre processing:


1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.

● (a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values
are missing within a tuple .

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values manually,
by attribute mean or the most probable value.

● (b). Noisy Data:


Noisy data is meaningless data that can’t be interpreted by machines. It can be generated due
to faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segment is handled separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function. regression used may
be linear (having one independent variable) or multiple (having multiple independent
variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will
fall outside the clusters.

2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for the mining
process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.

3. Discretization:
This is done to replace the raw values of numeric attributes by interval levels or conceptual
levels.

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amounts of data. While working
with a huge volume of data, analysis became harder in such cases. In order to get rid of this, we
use data reduction techniques. It aims to increase the storage efficiency and reduce data storage
and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

2. Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For performing
attribute selection, one can use the level of significance and p- value of the attribute.The
attribute having p-value greater than significance level can be discarded.

3. Numerosity Reduction:
This enables to store the model of data instead of whole data, for example: Regression Models.

4. Dimensionality Reduction:
This reduces the size of data by encoding mechanism .It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are called
lossless reduction, else it is called lossy reduction. The two effective methods of
dimensionality reduction are: Wavelet transforms and PCA (Principal Component Analysis).

Feature Scaling:
Feature Scaling is a technique to standardize the independent features present in the data in a
fixed range. It is performed during the data pre-processing to handle highly varying
magnitudes or values or units. If feature scaling is not done, then a machine learning algorithm
tends to weigh greater values, higher and consider smaller values as the lower values,
regardless of the unit of the values.
OUTPUT :
Data preparation using numpy and pandas

Finding out the missing values


Standardize the variable

Identifying the outlier

CONCLUSION:

Data preparation using NumPy and Pandas has been successfully studied on a small sample data
set on google collab environment.
EXPERIMENT NO. - 02

Aim of the Experiment :- Data Visualization / Exploratory Data Analysis for the selected
data set using Matplotlib and Seaborn
Create a bar graph, contingency table using any 2 variables.
Create normalized histogram.
Describe what this graphs and tables indicates?

Lab Outcome :- Analyze the data using different statistical techniques


and visualize the outcome using different types of plots.

Date of Conduction : Date of Submission :

Implementation Understanding Punctuality & Total


(5) (5) Discipline (15)
(5)

Practical Incharge
EXPERIMENT NO. - 02
AIM : Data Visualization / Exploratory Data Analysis for the selected data set using Matplotlib and
Seaborn
a. Create a bar graph, contingency table using any 2 variables.
b. Create a normalized histogram.
c. Describe what these graphs and tables indicate?

THEORY:
A bar graph is the graphical representation of categorical data using rectangular bars where the length
of each bar is proportional to the value they represent. A histogram is the graphical representation of
data where data is grouped into continuous number ranges and each range corresponds to a vertical bar.

Contingency Table is one of the techniques for exploring two or even more variables. It is basically
a tally of counts between two or more categorical variables.

Seaborn.barplot() method in Python


Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface
for drawing attractive and informative statistical graphics.

A barplot is basically used to aggregate the categorical data according to some methods and by default
it’s the mean. It can also be understood as a visualization of the group by action. To use this plot we
choose a categorical column for the x-axis and a numerical column for the y-axis, and we see that it
creates a plot taking a mean per categorical column.

Syntax : seaborn.barplot(x=None, y=None, hue=None, data=None, order=None, hue_order=None,


estimator=<function mean at 0x000002BC3EB5C4C8>, ci=95, n_boot=1000, units=None,
orient=None, color=None, palette=None, saturation=0.75, errcolor=’.26′, err width=None,
capsize=None, dodge=True, ax=None, **kwargs,)

Parameters :
Arguments Value Description

names of variables in Inputs for plotting long-form data. See examples for
x, y, hue “data“ or vector data, interpretation.
optional

DataFrame, array, or Dataset for plotting. If “x“ and “y“ are absent, this is
data list of arrays, optional interpreted as wide-form. Otherwise it is expected to be
long-form.

lists of strings, Order to plot the categorical levels in, otherwise the levels
order, optional are inferred from the data objects.
hue_order

callable that maps Statistical function to estimate within each categorical bin.
estimator vector -> scalar,
optional

float or “sd” or None, Size of confidence intervals to draw around estimated


ci optional values. If “sd”, skip bootstrapping and draw the standard
deviation of the observations. If “None“, no bootstrapping
will be performed, and error bars will not be drawn.

int, optional Number of bootstrap iterations to use when computing


n_boot confidence intervals.

name of variable in Identifier of sampling units, which will be used to perform


units “data“ or vector data, a multilevel bootstrap and account for repeated measures
optional design.
“v” | “h”, optional Orientation of the plot (vertical or horizontal). This is
orient usually inferred from the dtype of the input variables, but
can be used to specify when the “categorical” variable is a
numeric or when plotting wide-form data.

matplotlib color, Color for all of the elements, or seed for a gradient palette.
color optional

palette name, list, or Colors to use for the different levels of the “hue“ variable.
palette dict, optional Should be something that can be interpreted by
:func:`color_palette`, or a dictionary mapping hue levels to
matplotlib colors.

float, optional Proportion of the original saturation to draw colors at.


saturation Large patches often look better with slightly desaturated
colors, but set this to “1“ if you want the plot colors to
perfectly match the input color spec.

matplotlib color Color for the lines that represent the confidence interval.
errcolor

float, optional Thickness of error bar lines (and caps).


errwidth

float, optional Width of the “caps” on error bars.


capsize

bool, optional When hue nesting is used, whether elements should be


dodge shifted along the categorical axis.

matplotlib Axes, Axes object to draw the plot onto, otherwise uses the
ax optional current Axes.

ey, value mappings Other keyword arguments are passed through to “plt.bar“ at
kwargs draw time.
OUTPUT:
Conclusion :
we have successfully implemented Data Visualization / Exploratory Data Analysis for the
selected data set using Matplotlib and Seaborn
EXPERIMENT NO. - 03

Aim of the Experiment :- Data Modeling : Validating partition by performing a two‐sample


Z‐ test.

Lab Outcome :- Understand the concept of Data science process and


associated terminologies to solve real-world problems

Date of Conduction : Date of Submission :

Implementation Understanding Punctuality & Total


(5) (5) Discipline (15)
(5)

Practical Incharge
Experiment No. 3
AIM : Data Modeling : Validating partition by performing a two‐sample Z‐test.

THEORY: Data Modeling

Data modeling is the process of creating a simplified diagram of a software system and the data
elements it contains, using text and symbols to represent the data and how it flows. Data models
provide a blueprint for designing a new database or reengineering a legacy application.

Z-test

Z-test is a statistical method to determine whether the distribution of the test statistics can be
approximated by a normal distribution. It is the method to determine whether two sample means
are approximately the same or different when their variance is known and the sample size is
large (should be >= 30).

When to Use Z-test:

● The sample size should be greater than 30. Otherwise, we should use the t-test.
● Samples should be drawn at random from the population.
● The standard deviation of the population should be known.
● Samples that are drawn from the population should be independent of each other.
● The data should be normally distributed, however for large sample size, it is assumed
to have a normal distribution.

Hypothesis Testing

A hypothesis is an educated guess/claim about a particular property of an object.


Hypothesis testing is a way to validate the claim of an experiment.

● Null Hypothesis: The null hypothesis is a statement that the value of a population
parameter (such as proportion, mean, or standard deviation) is equal to some claimed
value. We either reject or fail to reject the null hypothesis. Null Hypothesis is denoted by
H0.
● Alternate Hypothesis: The alternative hypothesis is the statement that the parameter
has a value that is different from the claimed value. It is denoted by HA.

Steps to perform Z-test:

● First, identify the null and alternate hypotheses.


● Determine the level of significance (𝖺).
● Find the critical value of z in the z-test using
● Calculate the z-test statistics. Below is the formula for calculating the z-test statistics.
● where,
o X¯: mean of the sample.
o Mu: mean of the population.
o Sd: Standard deviation of the population.
o n: sample size.

Two-sampled z-test:
In this test, we have provided 2 normally distributed and independent populations, and we have
drawn samples at random from both populations. Here, we consider u1 and u2 be the population
mean X1 and X2 are the observed sample mean. Here, our null hypothesis could be like:

H0 : µ1- µ2 = 0
and alternative hypothesis

and the formula for calculating the z-test score:

where sigma1 and sigma2 are the standard deviation and n1 and n2 are the sample size of
population corresponding to u1 and u2 .

Type 1 error and Type II error:

● Type I error: Type 1 error has occurred when we reject the null hypothesis, even
when the hypothesis is true. This error is denoted by alpha.
● Type II error: Type II error has occurred when we didn’t reject the null hypothesis,
even when the hypothesis is false. This error is denoted by beta.
OUTPUT:

CONCLUSION:
Data Modeling : Validating partition by performing a two‐sample Z‐test has been successfully
performed along with the study of output on google collab environment
EXPERIMENT NO. - 04

Aim of the Experiment :- Implementation of Statistical Hypothesis Test using Scipy and Sci-kit learn.
Correlation Tests : Chi-Squared Test

Lab Outcome :- Analyze the data using different statistical techniques and visualize the
outcome using different types of plots.

Date of Conduction : Date of Submission :

Implementation Understanding Punctuality & Total


(5) (5) Discipline (15)
(5)

Practical In charge
Experiment No. 4
AIM : Implementation of Statistical Hypothesis Test using Scipy and Sci-kit learn.

Correlation Tests : Chi-Squared Test

THEORY:

The Pearson’s Chi-Square statistical hypothesis is a test for independence between categorical
variables. In this article, we will perform the test using a mathematical approach and then
using Python’s SciPy module.

The Contingency Table :

A Contingency table (also called crosstab) is used in statistics to summarise the relationship
between several categorical variables. Here, we take a table that shows the number of men and
women buying different types of pets.

dog cat bird total


men 207 282 241 730
women 234 242 232 708
total 441 524 473 1438

The aim of the test is to conclude whether the two variables( gender and choice of pet ) are related
to each other.

Null hypothesis:

We start by defining the null hypothesis (H0) which states that there is no relation between the
variables. An alternate hypothesis would state that there is a significant relation between the two.

We can verify the hypothesis by these methods:

● Using p-value:

We define a significance factor to determine whether the relation between the variables is of
considerable significance. Generally a significance factor or alpha value of 0.05 is chosen. This
alpha value denotes the probability of erroneously rejecting H0 when it is true. A lower alpha value
is chosen in cases where we expect more precision. If the p-value for the test comes out to be
strictly greater than the alpha value, then H0 holds true.

● Using chi-square value:

If our calculated value of chi-square is less or equal to the tabular(also called critical) value of chi-
square, then H0 holds true.

Expected Values Table :Next, we prepare a similar table of calculated(or expected) values.
To do this we need to calculate each item in the new table as :
row total * column total / grand total

The expected values table :

dog cat bird total


men 223.87343533 266.00834492 240.11821975 730
women 217.12656467 257.99165508 232.88178025 708
total 441 524 473 1438

Chi-Square Table :

We prepare this table by calculating for each item the

following: (Observed_value – Calculated_value)^2 /

Calculated_value

The chi-square table:


observed (o) calculated (c) (o-c)^2 / c
207 223.87343533 1.2717579435607573
282 266.00834492 0.9613722161954465
241 240.11821975 0.003238139990850831
234 217.12656467 1.3112758457617977
242 257.99165508 0.991245364156322
232 232.88178025 0.0033387601600580606
Total 4.542228269825232

From this table, we obtain the total of the last column, which gives us the calculated value of chi-
square. Hence the calculated value of chi-square is 4.542228269825232

Now, we need to find the critical value of chi-square. We can obtain this from a table. To use this
table, we need to know the degrees of freedom for the dataset. The degrees of freedom is defined
as : (no. of rows – 1) * (no. of columns – 1).
Hence, the degrees of freedom is (2-1) * (3-1) = 2

Now, look at the table and find the value corresponding to 2 degrees of freedom and 0.05
significance factor :
The tabular value of chi-square here is 5.991
Hence, Critical value of X^2>=Calculates value of x
Therefore, H0 is accepted, that is, the variables do not have a significant relation.

Performing the test using Python (scipy.stats) :

SciPy is an Open Source Python library, which is used in mathematics, engineering,


scientific and technical computing.

Installation:

pip install scipy

The chi2_contingency() function of scipy.stats module takes as input, the contingency table in
2d array format. It returns a tuple containing test statistics, the p-value, degrees of freedom and
expected table(the one we created from the calculated values) in that order.

Hence, we need to compare the obtained p-value with alpha value of 0.05.

from scipy.stats import chi2_contingency


# defining the table
data = [[207, 282, 241], [234, 242, 232]]
stat, p, dof, expected = chi2_contingency(data)

# interpret p-value
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
print('Dependent (reject H0)')
else:
print('Independent (H0 holds true)')

OUTPUT:
Chi-square Test for feature selection

Feature selection is also known as attribute selection is a process of extracting the most relevant
features from the dataset and then applying machine learning algorithms for the better performance
of the model. A large number of irrelevant features increases the training time exponentially and
increase the risk of overfitting.

Chi-square Test for Feature Extraction:

Chi-square test is used for categorical features in a dataset. We calculate Chi-square between each
feature and the target and select the desired number of features with best Chi-square scores. It
determines if the association between two categorical variables of the sample would reflect their
real association in the population.

Chi- square score is given by :

where –
Observed frequency = No. of observations of class
Expected frequency = No. of expected observations of class if there was no relationship
between the feature and the target.

OUTPUT:
CONCLUSION:

Implementation of Statistical Hypothesis Test using Scipy and Sci-kit learn, Correlation Tests :
Chi-Squared Test has been successfully studied along with the desire conclusion for the
obtained results on google collab environment.
EXPERIMENT NO. - 05

Aim of the Experiment :- Apply regression Model techniques to predict the data on House prices
dataset . And Prediction of Loan Using Multivariable Regression in Python.

Lab Outcome :- Analyze and apply the supervised machine learning


techniques like Classification, Regression or Support Vector Machine on data for
building the models of data and solve the problems.

Date of Conduction : Date of Submission :

Implementation Understanding Punctuality & Total


(5) (5) Discipline (15)
(5)

Practical In charge
Experiment No. 5
AIM:- Apply regression Model techniques to predict the data on House prices dataset .
And Prediction of Loan Using Multivariable Regression in Python

THEORY: Linear Regression:


Linear regression is probably one of the most important and widely used regression techniques.
It’s among the simplest regression methods. One of its main advantages is the ease of interpreting
results.

When implementing linear regression of some dependent variable 𝑦 on the set of independent
variables 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of predictors, you assume a linear relationship
between 𝑦 and 𝐱: 𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + ⋯ + 𝛽ᵣ𝑥ᵣ + 𝜀. This equation is the regression equation. 𝛽₀ ,
𝛽₁,
…, 𝛽ᵣ are the regression coefficients, and 𝜀 is the random error.

Linear regression calculates the estimators of the regression coefficients or simply the predicted
weights, denoted with 𝑏₀ , 𝑏₁, …, 𝑏ᵣ. They define the estimated regression function (𝐱) = 𝑏₀ +
𝑏₁𝑥₁
+ ⋯ + 𝑏ᵣ𝑥ᵣ. This function should capture the dependencies between the inputs and output
sufficiently well.

The estimated or predicted response, (𝐱ᵢ), for each observation 𝑖 = 1, …, 𝑛, should be as close as
possible to the corresponding actual response 𝑦ᵢ. The differences 𝑦ᵢ - (𝐱ᵢ) for all observations 𝑖 = 1,
…, 𝑛, are called the residuals. Regression is about determining the best predicted weights, that is
the weights corresponding to the smallest residuals.

To get the best weights, you usually minimize the sum of squared residuals (SSR) for all
observations 𝑖 = 1, …, 𝑛: SSR = Σᵢ(𝑦ᵢ - 𝑓(𝐱ᵢ))². This approach is called the method of ordinary least
squares.

Multiple Linear Regression:


Multiple or multivariate linear regression is a case of linear regression with two or more
independent variables.

If there are just two independent variables, the estimated regression function is (𝑥₁, 𝑥₂) = 𝑏₀ + 𝑏₁𝑥₁
+ 𝑏₂𝑥₂. It represents a regression plane in a three-dimensional space. The goal of regression is to
determine the values of the weights 𝑏₀ , 𝑏₁, and 𝑏₂ such that this plane is as close as possible to
theactual responses and yield the minimal SSR.

The case of more than two independent variables is similar, but more general. The estimated
regression function is (𝑥₁, …, 𝑥ᵣ) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ +𝑏ᵣ𝑥ᵣ, and there are 𝑟 + 1 weights to be
determined when the number of inputs is 𝑟.
Output:-
CONCLUSION:

Apply regression Model techniques to predict the data on House prices dataset. And Prediction of
Loan Using Multivariable Regression in Python has been successfully implemented using
google collab
EXPERIMENT NO. - 06

Aim of the Experiment: - Classification modelling Choose classifier for classification problem.
Evaluate the performance of classifier

Lab Outcome: - Analyze and apply the supervised machine learning techniques like
Classification, Regression or Support Vector Machine on data for building the
models of data and solve the problems.

Date of Conduction: Date of Submission

Implementation Understanding Punctuality & Total


(5) (5) Discipline (15)
(5)

Practical Incharge
Experiment No. 6

Aim : Classification modeli

a. Choose a classifier for classification problems.


b. Evaluate the performance of classifier

THEORY: Ensemble learning is a machine learning paradigm where multiple models (often called
“weak learners”) are trained to solve the same problem and combined to get better results. The
main hypothesis is that when weak models are correctly combined we can obtain more accurate
and/or robust models.

Bagging is a homogeneous weak learners’ model that learns from each other independently in
parallel and combines them for determining the model average. Bagging is an acronym for
‘Bootstrap Aggregation’ and is used to decrease the variance in the prediction model. Bagging is a
parallel method that fits different, considered learners independently from each other, making it
possible to train them simultaneously.
Bagging generates additional data for training from the dataset. This is achieved by random
sampling with replacement from the original dataset. Sampling with replacement may repeat some
observations in each new training data set. Every element in Bagging is equally probable for
appearing in a new dataset.
These multi datasets are used to train multiple models in parallel. The average of all the predictions
from different ensemble models is calculated. The majority vote gained from the voting mechanism
is considered when classification is made. Bagging decreases the variance and tunes the prediction
to an expected outcome.
Example of Bagging: The Random Forest model uses Bagging, where decision tree models with
higher variance are present. It makes random feature selection to grow trees. Several random trees
make a Random Forest.

OUTPUT:
CONCLUSION:
Classification modeling - a) Choose a classifier for a classification problem. b) Evaluate the
performance of the classifier has been successfully studied on google collab environment.
EXPERIMENT NO. - 07

Aim of the Experiment :- Clustering algorithms for unsupervised classification.


Plot the cluster data.

Lab Outcome :- ITL605.4 Apply the different unsupervised machine learning algorithms
like Clustering, Decision Trees, Random Forests or Association to solve the problems.

Date of Conduction : Date of Submission :

Implementation Understanding Punctuality & Total


(5) (5) Discipline (15)
(5)

Practical Incharge
Experiment No. 7

AIM : Clustering
a. Clustering algorithms for unsupervised classification.
b. Plot the cluster data.

THEORY: -K-Means Clustering is an unsupervised learning algorithm that is used to solve the
clustering problems in machine learning or data science. In this topic, we will learn what is K-
means clustering algorithm, how the algorithm works, along with the Python implementation of k-
means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm , which groups the unlabeled dataset
into different clusters. Here K defines the number of predefined clusters that need to be created in
the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so
on. It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties. It allows us to cluster the
data into different groups and a convenient way to discover the categories of groups in the
unlabeled dataset on its own without the need for any training. It is a centreoid -based algorithm,
where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the
sum of distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K centre points or centroids by an iterative process.
o Assigns each data point to its closest k-centre Those data points which are near to the
particular k- centre , create a cluster.

Hence each cluster has data points with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each data point to the new closest centroid
of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

OUTPUT:
CONCLUSION:
Clustering a. Clustering algorithms for unsupervised classification. b. Plot the cluster data. Has been
successfully studied and implemented on google collab environment.
EXPERIMENT NO. - 08

Aim of the experiment :- Using any machine learning techniques using available data
set to develop a recommendation system.

Lab Outcome :- ITL605.3 Analyze and apply the supervised machine learning
techniques like Classification, Regression or Support Vector Machine on data for
building the models of data and solve the problems.

Date of Conduction : Date of Submission :

Implementation Understanding Punctuality & Total


(5) (5) Discipline (15)
(5)

Practical Incharge
EXPERIMENT NO.: 8

AIM: Using any machine learning techniques using available data set to develop a recommendation
system.

THEORY:
Practically, recommender systems encompass a class of techniques and algorithms which are
able to suggest “relevant” items to users. Ideally, the suggested items are as relevant to the user
as possible, so that the user can engage with those items: YouTube videos, news articles, online
products, and so on.
Items are ranked according to their relevancy, and the most relevant ones are shown to the user.
The relevancy is something that the recommender system must determine and is mainly based
on historical data. If you’ve recently watched YouTube videos about elephants, then YouTube is
going to start showing you a lot of elephant videos with similar titles and themes!
Recommender systems are generally divided into two main categories: collaborative filtering
and content-based systems.

Figure 1: A tree of the different types of Recommender Systems.


Collaborative Filtering Systems
Collaborative filtering methods for recommender systems are methods that are solely based
on the past interactions between users and the target items. Thus, the input to a collaborative
filtering system will be all historical data of user interactions with target items. This data is
typically stored in a matrix where the rows are the users, and the columns are the items.
The core idea behind such systems is that the historical data of the users should be enough to
make a prediction. I.e we don’t need anything more than that historical data, no extra push
from the user, no presently trending information, etc.
Beyond this, collaborative filtering methods are further divided into two sub-groups: memory-
based and model-based methods
Model-based approaches, on the other hand, always assume some kind of underlying model and
basically try to make sure that whatever predictions come out will fit the model well.
steps:

1. Load up the data with pandas


2. Convert the pandas dataframes to graph lab SFrames
3. Train the model
4. Make recommendations

Principal component analysis (PCA) is a statistical procedure that is used to reduce the
dimensionality. It uses an orthogonal transformation to convert a set of observations of possibly
correlated variables into a set of values of linearly uncorrelated variables called principal
components. It is often used as a dimensionality reduction technique.

Steps Involved in the PCA

Step 1: Standardize the dataset.

Step 2: Calculate the covariance matrix for the features in the dataset.

Step 3: Calculate the eigenvalues and eigenvectors for the covariance matrix.

Step 4: Sort eigenvalues and their corresponding eigenvectors.

Step 5: Pick k eigenvalues and form a matrix of eigenvectors.

Step 6: Transform the original matrix.

OUTPUT:
CONCLUSION:
Using any machine learning techniques using available data set a recommendation system
has been developed on jupiter notebook.
EXPERIMENT NO. - 09

Aim of the Experiment :- Exploratory data analysis using Apache Spark and Pandas

Lab Outcome :- Design and Build an application that performs exploratory


data analysis using Apache Spark

Date of Conduction : Date of Submission : _______

Implementation Understanding Punctuality & Total


(5) (5) Discipline (15)
(5)

Practical Incharge
Experiment No-9

AIM: Exploratory data analysis using Apache Spark and Pandas

THEORY:

Exploratory Data Analysis In Python?

Exploratory Data Analysis (EDA) in Python is the first step in data analysis process developed
by “John Tukey” in the 1970s. In statistics, exploratory data analysis is an approach to analyzing
data sets to summarize their main characteristics, often with visual methods.

For Example, You are planning to go on a trip to the “X” location. Things you do before taking a
decision:

● You will explore the location on what all places, waterfalls, trekking, beaches, restaurants
that location has in Google, Instagram, Facebook, and other social Websites.
● Calculate whether it is in your budget or not.
● Check for the time to cover all the places.
● Type of Travel method.

Similarly, when you are trying to build a machine learning model you need to be pretty sure
whether your data is making sense or not. The main aim of exploratory data analysis is to obtain
confidence in your data to an extent a machine learning algorithm.

Need For Exploratory Data Analysis

Exploratory Data Analysis is a crucial step before jumping to machine learning or modeling of the
data. By doing this you can get to know whether the selected features are good enough to model,
are all the features required, are there any correlations based on which we can either go back to the
Data Pre-processing step or move on to modeling.

Once Exploratory Data Analysis is complete, its feature can be used for supervised and
unsupervised machine learning modeling.

In every machine learning workflow, the last step is Reporting or Providing the insights to the
Stake Holders. By completing the Exploratory Data Analysis many plots can be drawn, heat-
maps, frequency distribution, graphs, correlation matrix along with the hypothesis by which any
individual can understand what the data is all about and what insights can get from exploring the
data set.

In Trip Example, all the exploration of the selected place are done based on which we will get the
confidence to plan the trip and even share with our friends the insights we got regarding the place
so that they can also join.

What Are The Steps In Exploratory Data Analysis In Python?

There are many steps for conducting Exploratory data

analysis.
● Description of data
● Handling missing data
● Handling outliers
● Understanding relationships and new insights through plots

a) Description of data:
We need to know the different kinds of data and other statistics of our data before we can move on
to the other steps. A good one is to start with the describe() function in python. In Pandas, we can
apply describe() on a DataFrame which helps in generating descriptive statistics that summarize
the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

The result’s index will include count, mean, std, min, max as well as lower, 50 and upper
percentiles. By default, the lower percentile is 25 and the upper percentile is 75. The 50 percentile
is the same as the median.

Loading the Dataset:

import pandas as pd
from sklearn.datasets import load_boston

boston = load_boston()
x = boston.data
y = boston.target
columns = boston.feature_names
# creating dataframes
boston_df = pd.DataFrame(boston.data) boston_df.columns = columns boston_df.describe()

b) Handling missing data:


Data in the real-world are rarely clean and homogeneous. Data can either be missing during data
extraction or collection due to several reasons. Missing values need to be handled carefully because
they reduce the quality of any of our performance matrix. It can also lead to wrong prediction or
classification and can also cause a high bias for any given model being used. There are several
options for handling missing values. However, the choice of what should be done is largely
dependent on the nature of our data and the missing values. Below are some of the techniques:

● Drop NULL or missing values


● Fill Missing Values
● Predict Missing values with an ML Algorithm
Drop NULL or missing values:
This is the fastest and easiest step to handle missing values. However, it is not generally advised.
This method reduces the quality of our model as it reduces sample size because it works by deleting
all other observations where any of the variables is missing.

The above code indicates that there are no null values in our data set.
Fill Missing Values:
This is the most common method of handling missing values. This is a process whereby missing
values are replaced with a test statistic like mean, median or mode of the particular feature the
missing value belongs to. Let’s suppose we have a missing value of age in the boston data set.
Then the below code will fill the missing value with the 30.

Predict Missing values with an ML Algorithm:


This is by far one of the best and most efficient methods for handling missing data. Depending on
the class of data that is missing, one can either use a regression or classification model to predict
missing data.

c) Handling outliers:
An outlier is something which is separate or different from the crowd. Outliers can be a result of a
mistake during data collection or it can be just an indication of variance in your data. Some of the
methods for detecting and handling outliers:

● BoxPlot
● Scatterplot
● Z-score
● IQR(Inter-Quartile Range)

BoxPlot:
A box plot is a method for graphically depicting groups of numerical data through their quartiles.
The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The
whiskers extend from the edges of the box to show the range of the data. Outlier points are those
past the end of the whiskers. Boxplots show robust measures of location and spread as well as
providing information about symmetry and outlier
Output:-
CONCLUSION:

Exploratory data analysis using Apache Spark and Pandas has been successfully studied and
implemented on Jupiter notebook environment.
EXPERIMENT NO. - 10

Aim of the Experiment :- Batch and Streamed Data Analysis using Spark.

Lab Outcome :- Design and Build an application that performs exploratory


data analysis using Apache Spark

Date of Conduction : Date of Submission :

Implementation Understanding Punctuality & Total


(5) (5) Discipline (15)
(5)

Practical Incharge
EXPERIMENT NO. - 10
AIM :- Batch and Streamed Data Analysis using Spark.

THEORY:

Datasets are becoming huge. Infact, data is growing faster than processing speeds. Therefore,
algorithms involving large data and high amount of computation are often run on a distributed
computing system. A distributed computing system involves nodes (networked computers) that
run processes in parallel and communicate (if, necessary).
MapReduce – The programming model that is used for Distributed computing is known as
MapReduce. The MapReduce model involves two stages, Map and Reduce.
1. Map – The mapper processes each line of the input data (it is in the form of a file), and
produces key – value pairs.
Input data → Mapper → list([key, value])
2. Reduce – The reducer processes the list of key – value pairs (after the Mapper’s
function). It outputs a new set of key – value pairs.
list([key, value]) → Reducer → list([key, list(values)])
Spark – Spark (open source Big-Data processing engine by Apache) is a cluster computing
system. It is faster as compared to other cluster computing systems (such as, Hadoop). It
provides high level APIs in Python, Scala, and Java. Parallel jobs are easy to write in Spark.
We will cover PySpark (Python + Apache Spark), because this will make the learning curve
flatter. To install Spark on a linux system, follow this. To run Spark in a multi – cluster system,
follow this. We will see how to create RDDs (fundamental data structure of Spark).
RDDs (Resilient Distributed Datasets) – RDDs are immutable collection of objects. Since
we are using PySpark, these objects can be of multiple types. These will become more clear
further.

SparkContext – For creating a standalone application in Spark, we first define a


SparkContext –

RDD transformations – Now, a SparkContext object is created. Now, we will create RDDs
and see some transformations on them.

One major advantage of using Spark is that it does not load the dataset into memory,
lines is a pointer to the ‘file_name.txt’ ?file.
Steps:
1. Our text file is in the following format – (each line represents an edge of a
directed graph)
1 2
1 3
2 3
3 4
. .
. .
. .PySpark
2. Large Datasets may contain millions of nodes, and edges.
3. First few lines set up the SparkContext. We create an RDD lines from it.
4. Then, we transform the lines RDD to edges RDD.The function conv a?cts on
each line and key value pairs of the form (1, 2), (1, 3), (2, 3), (3, 4), … are stored
in
the edges RDD.
5. After this the reduceByKey aggregates all the key – pairs corresponding to a
particular key and numNeighbours function is used for generating each
vertex’s degree in a separate RDD Adj_list, which has the form (1, 2), (2, 1), (3,
1), …

OUTPUT:
CONCLUSION:
Batch and Streamed Data Analysis using Spark has been successfully implemented
and studied using google collab environment

You might also like