FDS LAB
FDS LAB
LABORATORY
NAME :
REGISTER NO :
SEMESTER :
NAME : ……………………………………………
DEPARTMENT : ……………………………………………
YEAR/SEM : ……………………………………………
Certified that this is the Bonafide record of practical work done by the aforesaid
student in the during the year .
VISION :
To create knowledge pool in the field of computer science and engineering to empower the students to meet
the challenges of the society
MISSION :
Prepare the students with strong fundamental concepts, analytical capabilities, programming and problem
solving skills.
Bringing an Eco-System to provide new cutting edge technologies required to meet the challenges.
Imparting necessary skills to become continuous learners in the field of Computer Science and Engineering
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTIC
COURSE OBJECTIVES:
• To understand the techniques and processes of data science
• To apply descriptive data analytics
• To visualize data for various applications
• To understand inferential data analytics
• To analysis and build predictive models from data
SUGESTED EXPERIMENTS
1.Working with Numpy arrays
2. Working with Pandas data frames
3. Develop a Python Program for Basic plots using Matplotlib
4. a Develop a Python Program for Frequency distributions
4.b Develop a Python Program for Averages
4.c Develop a Python Program for Variability
5.a Develop a Python Program for Normal Curves
5.b Develop a Python Program for Correlation and scatter plots
5.c Develop a Python Program for Correlation Coefficient
6. Develop a Python program for Simple Linear Regression
7. Develop a Python Program for Z-Test
8. Develop a Python Program for T-Test
9. Develop a Python Program for ANOVA
10. Building and Validating Linear Models
11. Building and Validating Logistic Models
12. Develop a Python Program for Time Series Analysis
COURSE OUTCOMES:
Upon successful completion of this course, the students will be able to:
CO1: Explain the data analytics pipeline
CO2: Describe and visualize data
CO3: Perform statistical inferences from data CO4:
Analyze the variance in the data
CO5: Build models for predictive analytics
CO’s- PO’s & PSO’s MAPPING
CONTENTS
AIM:
To work with Numpy arrays
.
ALGORITHM:
Step1: Start
Step2: Import Numpy module
Step3: Print the basic characteristics and operations of array
Step4: Stop
PROGRAM:
import numpy as np
# Creating array object arr
= np.array( [[ 1, 2, 3],
[ 4, 2, 5]] )
# Printing type of arr object
print("Array is of type: ", type(arr)) #
Printing array dimensions (axes)
print("No. of dimensions: ",
arr.ndim) # Printing shape of array
print("Shape of array: ", arr.shape)
# Printing size (total number of elements) of array
print("Size of array: ", arr.size) # Printing type of
elements in array
print("Array stores elements of type: ", arr.dtype)
OUTPUT:
1
Program to Perform Array Slicing
a=np.array([[1,2,3],[3,4,5],[4,5,6]])
print(a) print("After slicing")
print(a[1:])
OUTPUT:
[[1 2 3]
[3 4 5]
[4 5 6]]
After slicing
[[3 4 5]
[4 5 6]]
2
OUTPUT:
Our array is:
[[1 2 3]
[3 4 5]
[4 5 6]]
The items in the second column are:
[2 4 5]
The items in the second row are:
[3 4 5]
The items column 1 onwards are:
[[2 3]
[4 5]
[5 6]]
RESULT:
Thus the working with Numpy arrays was executed and verified successfully.
3
Ex no: 2 Working with Pandas data frames
Date:
AIM:
ALGORITHM:
Step1: Start
Step2: import numpy and pandas module
Step3: Create a dataframe using the dictionary
Step4: Print the output
Step5: Stop
PROGRAM:
print(pd.DataFrame(data=data[1:,1:],
index = data[1:,0],
columns=data[0,1:]))
# Take a 2D array as input to your DataFrame
my_2darray = np.array([[1, 2, 3], [4, 5, 6]])
print(pd.DataFrame(my_2darray))
4
# Take a Series as input to your DataFrame
my_series = pd.Series({"United Kingdom":"London", "India":"New Delhi", "United
States":"Washington", "Belgium":"Brussels"}) print(pd.DataFrame(my_series)) df
= pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
OUTPUT:
Col1 Col2
Row1 1 2
Row2 3 4
012
0 123
1 4 5 61 2 3
0 112
1 3 2 4A
0 4
1 5
2 6
3 7
0
United Kingdom London
India New Delhi
United States Washington
Belgium Brussels
(2, 3)
2
RESULT:
Thus the working with Pandas data frames was was executed and verified successfully.
5
Ex. No.:3 Develop a Python Program for Basic plots using Matplotlib
Date:
AIM:
ALGORITHM:
Step1: Start
Step2: import Matplotlib module
Step3: Create a Basic plots using Matplotlib
Step4: Print the output
Step5: Stop
PROGRAM:
# x axis values x
= [1,2,3]
# corresponding y axis values y
= [2,4,1]
6
OUTPUT:
PROGRAM:3B
import matplotlib.pyplot as plt a
= [1, 2, 3, 4, 5] b = [0, 0.6, 0.2,
15, 10, 8, 16, 21] plt.plot(a)
7
)
ax.spines['top'].set_visible(False)
# set the range or the bounds of # the
left boundary line to fixed range
ax.spines['left'].set_bounds(-3, 40)
OUTPUT:
8
PROGRAM:3c
import matplotlib.pyplot as plt
sub2.plot(b, 'or')
sub4.plot(c, 'Dm')
9
sub4.set_yticks(list(range(0, 24, 2)))
sub4.set_title('4th Rep')
OUTPUT:
RESULT:
Thus the basic plots using Matplotlib in Python program was executed and verified successfully.
10
Ex. No.:4a Develop a python program Frequency distributions
Date:
AIM:
To Count the frequency of occurrence of a word in a body of text.
ALGORITHM:
for i in range(50):
wlist.append(token[i]) wordfreq =
[wlist.count(w) for w in
OUTPUT:
[([', 1), (Poems', 1), (by', 1), (William', 1), (Blake', 1), (1789', 1), (]', 1), (SONGS', 2), (OF', 3),
(INNOCENCE', 2), (AND', 1), (OF', 3), (EXPERIENCE', 1), (and', 1), (THE', 1), (BOOK', 1), (of', 2),
(THEL', 1), (SONGS', 2), (OF', 3), (INNOCENCE', 2), (INTRODUCTION', 1), (Piping', 2), (down', 1),
(the', 1), (valleys', 1), (wild', 1), (,', 3), (Piping', 2), (songs', 1), (of', 2), (pleasant', 1), (glee', 1), (,', 3),
(On', 1), (a', 2), (cloud', 1), (I', 1), (saw', 1), (a', 2), (child', 1), (,', 3), (And', 1), (he', 1), (laughing', 1),
(said', 1), (to', 1), (me', 1), (:', 1), (``', 1)]
RESULT:
Thus the count the frequency of occurrence of a word in a body of text was executed and verified
successfully.
11
Ex. No.:4b Develop a Python Program for Averages
Date:
AIM:
To compute weighted averages in Python either defining your own functions or using Numpy
ALGORITHM:
PROGRAM:
OUTPUT:
44225.35
RESULT:
Thus the computation of weighted averages in Python either defining your own functions or using
Numpy was executed and verified successfully.
12
Ex. No.: 4c Develop a Python Program for Variability
Date:
AIM:
To write a python program to calculate the variance.
ALGORITHM:
13
Sample3 is % s " %(variance(sample3))) print("Variance of
Sample4 is % s " %(variance(sample4))) print("Variance of
Sample5 is % s " %(variance(sample5)))
OUTPUT :
RESULT:
Thus the computation for variance was executed and verified successfully.
14
Ex. No.:5a Develop a Python Program for Normal Curve
Date:
AIM:
To create a normal curve using python program.
ALGORITHM:
Step 1: Start the Program
Step 2: Import packages scipy and call function scipy.stats
Step 3: Import packages numpy, matplotlib and seaborn
Step 4: Create the distribution
Step 5: Visualizing the distribution
Step 6: Stop the process
PROGRAM:
sb.set_style('whitegrid')
sb.lineplot(data, pdf , color = 'black')
plt.xlabel('Heights')
plt.ylabel('Probability Density')
15
OUTPUT:
RESULT:
Thus the normal curve using python program was executed and verified successfully.
16
Ex. No.: 5b Develop a Python Program for Correlation and scatter plots
Date:
AIM:
To write a python program for correlation with scatter plot
ALGORITHM:
PROGRAM:
# Data
X=np.random.randn(100
) yl= 5*x + 9 y2=-5 * x
y3=np.random.randn(100
)
#Plot
plt.figure(figsize=(10,8), dpi=100
plt.scatter(x, yl, label=’yl’, color = ‘blue’)
plt.scatter(x, y2, label=’y2’, color = ‘red’)
plt.scatter(x, y3, label=’y3’, color = ‘green’)
plt.title(‘Scatterplot and Correlations’)
plt.legend() plt.show()
17
OUTPUT:
RESULT:
Thus the Correlation and scatter plots using python program was executed and verified
successfully.
18
Ex. No.: 5c Develop a Python Program for Correlation coefficient
Date:
AIM:
To write a python program to compute correlation coefficient.
ALGORITHM:
PROGRAM:
# Python Program to find correlation coefficient.
import math
i = 0 while i
<n:
# sum of elements of array X. sum_X
= sum_X + X[i]
19
squareSum_X = squareSum_X + X[i] * X[i]
squareSum_Y = squareSum_Y + Y[i] * Y[i]
i=i+1
# Driver function
X = [15, 18, 21, 24, 27]
Y = [25, 25, 27, 31, 32]
OUTPUT :
0.953463
RESULT:
Thus the computation for correlation coefficient was executed and verified successfully.
20
Ex. No.: 6 Develop a Python Program for Regression
Date:
AIM:
To write a python program for Simple Linear Regression
ALGORITHM:
PROGRAM:
# mean of x and y
vector m_x =
np.mean(x) m_y =
np.mean(y)
# calculating regression
coefficients b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
21
def plot_regression_line(x, y, b): # plotting the
actual points as scatter plot plt.scatter(x,
y, color = "m",
marker = "o", s = 30)
# putting labels
plt.xlabel('x')
plt.ylabel('y')
def main():
# observations / data x = np.array([0, 1, 2,
3, 4, 5, 6, 7, 8, 9]) y = np.array([1, 3, 2, 5,
7, 8, 8, 9, 10, 12])
# estimating coefficients b =
estimate_coef(x, y) print("Estimated
coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
OUTPUT :
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
22
Graph:
RESULT:
Thus the computation for Regression was executed and verified successfully.
23
Ex. No.: 7 Develop a Python Program for z-test
Date:
AIM:
To write a python program for z-test
ALGORITHM:
PROGRAM:
24
# mean is larger
ztest_Score,p_value=ztest(data,value=null_mean,alternative='la
rger')
# the function outputs a p_value and z-score corresponding to that
value, we compare the
# p-value with alpha, if it is greater than alpha then we do
not null hypothesis # else we reject it. if(p_value < alpha):
print("Reject Null Hypothesis") else:
print("Fail to Reject Null Hypothesis")
OUTPUT :
RESULT:
Thus the computation for z-test was executed and verified successfully.
25
Ex. No.: 8 Develop a Python Program for t-test
Date:
AIM:
To write a python program for t-test.
ALGORITHM
PROGRAM:
26
SD = np.sqrt((var_x + var_y) / 2)
print("Standard Deviation =", SD)
# Calculating the T-Statistics
tval = (x.mean() - y.mean()) / (SD * np.sqrt(2 / N))
# Comparing with the critical T-Value
# Degrees of freedom dof
=2*N-2
# p-value after comparison with the T-Statistics
pval = 1 - stats.t.cdf( tval, df = dof)
print("t = " + str(tval)) print("p = "
+ str(2 * pval))
## Cross Checking using the internal function from SciPy Package
tval2, pval2 = stats.ttest_ind(x, y) print("t
= " + str(tval2))
print("p = " + str(pval2))
Output:
Standard Deviation = 0.7642398582227466
t = 4.87688162540348 p =
0.0001212767169695983 t =
4.876881625403479
p = 0.00012127671696957205
RESULT:
Thus the computation for t-test was executed and verified successfully.
27
Ex. No.: 9 Develop a Python Program for ANOVA
Date:
AIM:
To write a python program for ANOVA.
ALGORITHM
Step 1: Start the Program
Step 2: Import numpy and dplyr package
Step 3:Setup Null Hypothesis and Alternate Hypothesis
Step 4: Calculate test statistics using aov function Step
5: Calculate F-Critical Value
Step 6: Compare test statistics with F-Critical value
Step 7: Print the result
Step 8: Stop the process
PROGRAM:
# Installing the package
install.packages("dplyr")
# Loading the package
library(dplyr)
# Variance in mean within group and between group
boxplot(mtcars$disp~factor(mtcars$gear),
xlab = "gear", ylab = "disp")
# Step 1: Setup Null Hypothesis and Alternate Hypothesis
# H0 = mu = mu01 = mu02 (There is no difference
# between average displacement for different gear)
# H1 = Not all means are equal
# Step 2: Calculate test statistics using aov function
28
# For 0.05 Significant value, critical value = alpha = 0.05
# Step 4: Compare test statistics with F-Critical value
# and conclude test p <alpha, Reject Null Hypothesis
RESULT:
Thus the computation for ANOVA was executed and verified successfully.
29
Ex. No.: 10 Develop a Python Program for building and validating linear
models
Date:
AIM:
To write a python program for building and validating data models.
ALGORITHM:
PROGRAM:
30
‘filename’])
print(boston.DESCR)
#You will find these details in output:
Attribute Information (in order):
— CRIM per capita crime rate by town
— ZN proportion of residential land zoned for lots over 25,000 sq.ft.
— INDUS proportion of non-retail business acres per town
— CHAS Charles River dummy variable (= 1 if tract bounds river; 0
otherwise)
— NOX nitric oxides concentration (parts per 10 million)
— RM average number of rooms per dwelling
— AGE proportion of owner-occupied units built prior to 1940
— DIS weighted distances to five Boston employment centres
— RAD index of accessibility to radial highways
— TAX full-value property-tax rate per $10,000
df=pd.DataFrame(boston.data,columns=boston.feature_names)
df.head()
# print the columns present in the dataset
print(df.columns)
# print the top 5 rows in the dataset
print(df.head())
31
OUTPUT:
RESULT:
Thus the Python Program for building and validating linear models was executed and verified
successfully.
32
Ex. No.: 11 Develop a Python Program for building and validating logistic
models
Date:
AIM:
To write a python program for building and validating logistic models.
ALGORITHM
Step 1: Start the Program
Step 2: Import statsmodel and pandas package
Step 3: Load a dataset
PROGRAM:
33
Predicting on New Data : # loading the testing
dataset df = pd.read_csv('logit_test1.csv',
index_col = 0) # defining the dependent and
independent variables Xtest = df[['gmat', 'gpa',
'work_experience']] ytest = df['admitted']
# performing predictions on the test
dataset yhat = log_reg.predict(Xtest)
prediction = list(map(round, yhat))
# comparing original and predicted values of
y print('Actual values', list(ytest.values))
print('Predictions :', prediction)
OUTPUT :
Optimization terminated successfully.
34
Current function value: 0.352707 Iterations
8
Actual values [0, 0, 0, 0, 0, 1, 1, 0, 1, 1]
Predictions : [0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
OUTPUT :
Confusion Matrix :
[[6 0]
[2 2]]
Test accuracy = 0.8
RESULT:
Thus the Python Program for building and validating logistic models was executed and verified
successfully.
35
Ex. No.: 12 Develop a Python Program for Time Series Analysis
Date:
AIM:
To write a python program for Time Series Analysis.
ALGORITHM:
36
# Data Preprocessing
# This step includes removing columns we do not need, check missing values, aggregate sales by date
etc., cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID',
'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region',
'Product ID', 'Category', 'Sub-Category', 'Product Name', 'Quantity', 'Discount',
'Profit'] furniture.drop(cols,axis=1,inplace=True)
furniture=furniture.sort_values('Order Date') furniture.isnull().sum()
furniture=furniture.groupby('OrderDate')['Sales'].sum().reset_index()
# We will use the averages daily sales value for that month instead, and we are using
the start of each month as the timestamp. y = furniture
['Sales'].resample('MS').mean() y['2017':] # Have a quick
peek 2017 furniture sales data.
37
OUTPUT:
RESULT:
Thus the Python program for Time Series Analysis was executed and verified successfully.
38