0% found this document useful (0 votes)
0 views24 pages

Datascience Lab

The document outlines a series of experiments related to data science using Python and various libraries such as Pandas, Matplotlib, and NumPy. It includes practical examples of working with data frames, creating plots, performing statistical tests like Z-test and T-test, and building linear and logistic models. Each experiment is numbered and provides code snippets along with expected outputs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views24 pages

Datascience Lab

The document outlines a series of experiments related to data science using Python and various libraries such as Pandas, Matplotlib, and NumPy. It includes practical examples of working with data frames, creating plots, performing statistical tests like Z-test and T-test, and building linear and logistic models. Each experiment is numbered and provides code snippets along with expected outputs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

INDEX

Sl.No LIST OF EXPERIMENTS Pg.No.

Tools: Python, Numpy, Scipy, Matplotib, Pandas,Statmodels,


Seaborn,Plotly,Bokeh,working with Numpy arrays

1. Working with Pandas data frame 2

Basic Plots using Matplotlib


2. 3

Frequency distributors, Averages, Variability


3. 5

Normal Curves, Correlation and scatter plots, Correlation


4. coefficient 6

5. Regression 9

6. Z-test 11

7. T-test 13

8. Anova 15

9. Building and validating linear models 16

10. Building and validating logistic models 19

11. Time series analysis 22


L.2 Fundamentals of Data Science

Experiment No: 1

WORKING WITH PANDAS DATA FRAMES

Program:
import pandas as pd
data = {"calories": [420, 380, 390], "duration": [50,
40,
45]}
#load data into a DataFrame
object: df = pd.DataFrame(data)
print (df.loc[0])

Output:
calories 420
duration 50
Name: 0, dtype: int64



Downloaded by Saravanan Sujatha


Experiment No: 2

BASIC PLOTS USING MATPLOTLIB

Program:
import matplotlib.pyplot as plt
a = [1, 2, 3, 4, 5]
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
plt.plot(a)
# o is for circles and r
is # for red
plt.plot(b, "or")
plt.plot(list(range(0, 22,
3))) # naming the x-axis
plt.xlabel('Day ->')
# naming the y-axis
plt.ylabel('Temp ->')
c = [4, 2, 6, 8, 3, 20, 13, 15]
plt.plot(c, label = '4th
Rep') # get current axes
command
ax = plt.gca()
# get command over the
individual # boundary line of
the graph body
ax.spines['right'].set_visible(Fa
lse)
ax.spines['top'].set_visible(Fals
e) # set the range or the bounds
of
# the left boundary line to fixed
range ax.spines['left'].set_bounds(-
3, 40)
# set the interval by
which # the x-axis set the
marks
plt.xticks(list(range(-3, 10)))
Downloaded by Saravanan Sujatha
L.4 Fundamentals of Data Science

# set the intervals by which y-


axis # set the marks
plt.yticks(list(range(-3, 20,
3))) # legend denotes that what
color
# signifies what
ax.legend(['1st Rep', '2nd Rep', '3rd Rep', '4th
Rep']) # annotate command helps to write
# ON THE GRAPH any text xy
denotes # the position on the
graph
plt.annotate('Temperature V / s Days', xy = (1.01, -
2.15)) # gives a title to the Graph
plt.title('All Features
Discussed') plt.show()

Output:



Downloaded by Saravanan Sujatha


Experiment No: 3

FREQUENCY DISTRIBUTIONS, AVERAGES, VARIABILITY

Program:
# Python program to get average of a list
# Importing the NumPy
module import numpy as np
# Taking a list of elements
list = [2, 40, 2, 502, 177, 7, 9]
# Calculating average using
average() print(np.average(list))
Output:
105.57142857142857
# Python program to get variance of a list
# Importing the NumPy module
import numpy as np
# Taking a list of elements
list = [2, 4, 4, 4, 5, 5, 7,
9]
# Calculating variance using
var() print(np.var(list))
Output:
4.0
# Python program to get standard deviation of a list
# Importing the NumPy module
import numpy as np
# Taking a list of
elements list = [290, 124,
127, 899]
# Calculating
standard # deviation
using var()
print(np.std(list))

Output:
318.35750344541907


Downloaded by Saravanan Sujatha


L.6 Fundamentals of Data Science

Experiment No: 4

NORMAL CURVES, CORRELATION AND SCATTER PLOTS,


CORRELATION COEFFICIENT

Program:
#Normal curves
import matplotlib.pyplot as
plt import numpy as np
mu, sigma = 0.5, 0.1
s = np.random.normal(mu, sigma,
1000) # Create the bins and
histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

Output:

#Correlation and scatter


plots import sklearn
import numpy as np
import matplotlib.pyplot as
plt import pandas as pd
y = pd.Series([1, 2, 3, 4, 3, 5, 4])
x = pd.Series([1, 2, 3, 4, 5, 6, 7])
correlation = y.corr(x)
correlation

Downloaded by Saravanan Sujatha


Output:
0.8603090020146067
# Correlation
coefficient import math
# function that returns correlation
coefficient. def correlationCoefficient(X,
Y, n) :
sum_X = 0
sum_Y = 0
sum_XY = 0
squareSum_X = 0
squareSum_Y = 0
i = 0
while i < n :
# sum of elements of array X.
sum_X = sum_X + X[i]
# sum of elements of array
Y. sum_Y = sum_Y + Y[i
# sum of X[i] * Y[i].
sum_XY = sum_XY + X[i] * Y[i]
# sum of square of array elements.
squareSum_X = squareSum_X + X[i] * X[i]
squareSum_Y = squareSum_Y + Y[i] * Y[i]
i = i + 1
# use formula for calculating
correlation # coefficient.
corr = (float)(n * sum_XY - sum_X *
sum_Y)/ (float)(math.sqrt((n *
squareSum_X - sum_X * sum_X)* (n *
squareSum_Y - sum_Y * sum_Y)))
return corr
# Driver function
X = [15, 18, 21, 24, 27]

Downloaded by Saravanan Sujatha


L.8 Fundamentals of Data Science

Y = [25, 25, 27, 31, 32]


# Find the size of
array. n = len(X)
# Function call to correlationCoefficient.
print ('{0:.6f}'.format(correlationCoefficient(X, Y, n)))

Output:
0.953463


Downloaded by Saravanan Sujatha


Experiment No: 5

REGRESSION

Program:
import numpy as np
import matplotlib.pyplot as
plt def estimate_coef(x, y):
# number of
observations/points n =
np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression
coefficients b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s =
30) # predicted response
vector y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color =
"g") # putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot

Downloaded by Saravanan Sujatha


L.10 Fundamentals of Data Science

plt.show()
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating
coefficients b =
estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0],
b[1])) # plotting regression line
plot_regression_line(x, y, b)
if name == " main ":
main()

Output:

Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437



Downloaded by Saravanan Sujatha


Fundamentals of Data Science Laboratory L.11

Experiment No: 6

Z-TEST

Program:
# imports
import math
import numpy as np
from numpy.random import randn
from statsmodels.stats.weightstats import ztest
# Generate a random array of 50 numbers having mean 110
and sd 15
# similar to the IQ scores data we assume
above mean_iq = 110
sd_iq = 15/math.sqrt(50)
alpha = 0.05
null_mean =100
data =
sd_iq*randn(50)+mean_iq #
print mean and sd
print('mean=%.2f stdv=%.2f' % (np.mean(data), np.std(data)))
# now we perform the test. In this function, we passed
data, in the value parameter
# we passed mean value in the null hypothesis, in
alternative hypothesis we check whether the
# mean is larger
ztest_Score,p_value=ztest(data,value=null_mean,alternative='
la rger')
# the function outputs a p_value and z-score corresponding
to that value, we compare the
# p-value with alpha, if it is greater than alpha then
we do not null hypothesis
# else we reject it.
if(p_value < alpha):
print("Reject Null
Hypothesis")

Downloaded by Saravanan Sujatha


L.12 Fundamentals of Data Science

else:
print("Fail to Reject NUll Hypothesis")

Output:
Reject Null Hypothesis



Downloaded by Saravanan Sujatha


Fundamentals of Data Science Laboratory L.13

Experiment No: 7

T-TEST

Program:
# Importing the required libraries and
packages import numpy as np
from scipy import stats
# Defining two random
distributions # Sample Size
N = 10
# Gaussian distributed data with mean = 2 and var
= 1 x = np.random.randn(N) + 2
# Gaussian distributed data with mean = 0 and var
= 1 y = np.random.randn(N)
# Calculating the Standard Deviation
# Calculating the variance to get the standard
deviation var_x = x.var(ddof = 1)
var_y = y.var(ddof =
1) # Standard
Deviation
SD = np.sqrt((var_x + var_y) / 2)
print("Standard Deviation =",
SD) # Calculating the T-
Statistics
tval = (x.mean() - y.mean()) / (SD * np.sqrt(2 /
N)) # Comparing with the critical T-Value
# Degrees of freedom
dof = 2 * N - 2
# p-value after comparison with the T-
Statistics pval = 1 - stats.t.cdf( tval, df
= dof) print("t = " + str(tval))
print("p = " + str(2 * pval))

Downloaded by Saravanan Sujatha


L.14 Fundamentals of Data Science

## Cross Checking using the internal function from SciPy


Packa ge
tval2, pval2 = stats.ttest_ind(x, y)
print("t = " + str(tval2))
print("p = " + str(pval2))

Output:
Standard Deviation =
0.7642398582227466 t =
4.87688162540348
p = 0.0001212767169695983
t = 4.876881625403479
p = 0.00012127671696957205



Downloaded by Saravanan Sujatha


Fundamentals of Data Science Laboratory L.15

Experiment No: 8

ANOVA

Program:
# Installing the
package
install.packages("dplyr
") # Loading the package
library(dplyr)
# Variance in mean within group and between group
boxplot(mtcars$disp~factor(mtcars$gear),
xlab = "gear", ylab = "disp")
# Step 1: Setup Null Hypothesis and Alternate
Hypothesis # H0 = mu = mu01 = mu02 (There is no
difference
# between average displacement for different
gear) # H1 = Not all means are equal
# Step 2: Calculate test statistics using aov function
mtcars_aov <- aov(mtcars$disp~factor(mtcars$gear))
summary(mtcars_aov)
# Step 3: Calculate F-Critical Value
# For 0.05 Significant value, critical value = alpha =
0.05 # Step 4: Compare test statistics with F-Critical
value
# and conclude test p <alpha, Reject Null Hypothesis

Output:



Downloaded by Saravanan Sujatha


L.16 Fundamentals of Data Science

Experiment No: 9

BUILDING AND VALIDATING LINEAR MODELS

Program
# Importing the necessary
libraries import pandas as pd
import numpy as np
import matplotlib.pyplot as
plt import seaborn as sns
from sklearn.datasets import load_boston
sns.set(style=”ticks”,color_codes=True)
plt.rcParams[‘figure.figsize’] = (8,5)
plt.rcParams[‘figure.dpi’] = 150
# loading the databoston = load_boston()
You can check those keys with the following code.
print(boston.keys())
The output will be as follow:
dict_keys([‘data’, ‘target’, ‘feature_names’,
‘DESCR’, ‘filename’])
print(boston.DESCR)

You will find these details in output:


Attribute Information (in order):
— CRIM per capita crime rate by town
— ZN proportion of residential land zoned for lots over 25,000 sq.ft.
— INDUS proportion of non-retail business acres per town
— CHAS Charles River dummy variable (= 1 if tract bounds river; 0
otherwise)
— NOX nitric oxides concentration (parts per 10 million)
— RM average number of rooms per dwelling
— AGE proportion of owner-occupied units built prior to 1940
— DIS weighted distances to five Boston employment centres
— RAD index of accessibility to radial highways
— TAX full-value property-tax rate per $10,000

Downloaded by Saravanan Sujatha


Fundamentals of Data Science Laboratory L.17

— PTRATIO pupil-teacher ratio by town


— B 1000 (Bk — 0.63)² where Bk is the proportion of blacks by town
— LSTAT % lower status of the population
— MEDV Median value of owner-occupied homes in $1000’s :Missing
Attribute Values: None
df=pd.DataFrame(boston.data,columns=boston.feature_names
) df.head()
# print the columns present in the dataset
print(df.columns)
# print the top 5 rows in the dataset
print(df.head())

First five records from data set


#plotting heatmap for overall data
setsns.heatmap(df.corr(), square=True, cmap=’RdYlGn’)

Downloaded by Saravanan Sujatha


L.18 Fundamentals of Data Science

Heat map of overall data set


So let’s plot a regression plot to see the correlation between RM and MEDV.
sns.lmplot(x = ‘RM’, y = ‘MEDV’, data = df)

Regression plot with RM and MEDV



Downloaded by Saravanan Sujatha


Fundamentals of Data Science Laboratory L.19

Experiment No: 10

BUILDING AND VALIDATING LOGISTICS MODELS

Program

Building the Logistic Regression model:


# importing libraries
import statsmodels.api as
sm import pandas as pd
# loading the training dataset
df = pd.read_csv('logit_train1.csv', index_col
= 0) # defining the dependent and independent
variables Xtrain = df[['gmat', 'gpa',
'work_experience']] ytrain = df[['admitted']]
# building the model and fitting the data
log_reg = sm.Logit(ytrain, Xtrain).fit()

Output :
Optimization terminated successfully.
Current function value: 0.352707
Iterations 8
# printing the summary table
print(log_reg.summary())

Output :
Logit Regression Results
=============================================================
Dep. Variable: admitted No. Observations: 30
Model: Logit Df Residuals: 27
Method: MLE Df Model: 2
Date: Wed, 15 Jul 2020 Pseudo R-squ.: 0.4912
Time: 16:09:17 Log-Likelihood: -10.581

Downloaded by Saravanan Sujatha


L.20 Fundamentals of Data Science

converged: True LL-Null: -20.794


Covariance Type: nonrobust LLR p-value: 3.668e-05
=============================================================
===
coef std err z P>|z| [0.025 0.975]

gmat -0.0262 0.011 -2.383 0.017 -0.048 -0.005


gpa 3.9422 1.964 2.007 0.045 0.092 7.792
work_experience 1.1983 0.482 2.487 0.013 0.254 2.143

Predicting on New Data :

# loading the testing dataset


df = pd.read_csv('logit_test1.csv', index_col
= 0) # defining the dependent and independent
variables Xtest = df[['gmat', 'gpa',
'work_experience']] ytest = df['admitted']
# performing predictions on the test
dataset yhat = log_reg.predict(Xtest)
prediction = list(map(round, yhat))
# comparing original and predicted values of y
print('Actual values', list(ytest.values))
print('Predictions :', prediction)

Output :
Optimization terminated successfully.
Current function value: 0.352707
Iterations 8
Actual values [0, 0, 0, 0, 0, 1, 1, 0, 1, 1]
Predictions : [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]

Downloaded by Saravanan Sujatha


Fundamentals of Data Science Laboratory L.21

Testing the accuracy of the model :

from sklearn.metrics import (confusion_matrix,


accuracy_score)
# confusion matrix
cm = confusion_matrix(ytest, prediction)
print ("Confusion Matrix : \n", cm)
# accuracy score of the model
print('Test accuracy = ', accuracy_score(ytest, prediction))

Output :
Confusion Matrix :
[[6 0]
[2 2]]
Test accuracy = 0.8



Downloaded by Saravanan Sujatha


L.22 Fundamentals of Data Science

Experiment No: 11

TIME SERIES ANALYSIS

Program
We are using Superstore sales data .
import warnings
import itertools
import numpy as
np
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight')
import pandas as pd
import statsmodels.api as sm
import matplotlibmatplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'

We start from time series analysis and forecasting for furniture sales.
df=pd.read_excel("Superstore.xls")
furniture = df.loc[df['Category'] ==
'Furniture'] A good 4-year furniture sales
data.
furniture['Order Date'].min(), furniture['Order Date'].max()
Timestamp(‘2014–01–06 00:00:00’), Timestamp(‘2017–12–30
00:00:00’)
Data Preprocessing
This step includes removing columns we do not need, check missing values,
aggregate sales by date and so on.
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID',
'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product
ID', 'Category', 'Sub-Category', 'Product Name', 'Quantity', 'Discount', 'Profit']

Downloaded by Saravanan Sujatha


Fundamentals of Data Science Laboratory L.23

furniture.drop(cols,axis=1,inplace=True)
furniture=furniture.sort_values('Order
Date')furniture.isnull().sum()
furniture=furniture.groupby('OrderDate')
['Sales'].sum().reset_ index()

Order Date 0
Sales 0
dtype:
int64
Figure 1

Indexing with Time Series Data


furniture=furniture.set_index('OrderDate')
furniture.index

Figure 2
We will use the averages daily sales value for that month instead, and we are using
the start of each month as the timestamp.
y = furniture
['Sales'].resample('MS').mean() Have a
quick peek 2017 furniture sales data.
y['2017':]

Downloaded by Saravanan Sujatha


L.24 Fundamentals of Data Science

Figure 3

Visualizing Furniture Sales Time Series Data


y.plot (figsize=(15,6))
plt.show()



Downloaded by Saravanan Sujatha

You might also like