0% found this document useful (0 votes)

74 views

cs3362 Foundations of Data Science Lab Manual

Uploaded by

thilakraj.a0321

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views

cs3362 Foundations of Data Science Lab Manual

Uploaded by

thilakraj.a0321

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 53

CS3362 Foundations OF DATA Science LAB Manual

Computer Science and Engineering (Anna University)

Studocu is not sponsored or endorsed by any college or university

Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
Downloaded by Jegatheeswari ic37721 ([email protected])
EX.NO.4. READING DATA FROM TEXT FILES, EXCEL AND THE
WEB DATE:

Aim:
To Reading data from text files, Excel and the web using pandas package.

ALGORITHM:
STEP 1: Start the program
STEP 2: To read data from csv file using pandas package.
STEP 3: To read data from excel file using pandas package.
STEP 4: To read data from html file using pandas package.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
DATA INPUT AND OUTPUT

This notebook is the reference code for getting input and output, pandas can read a variety of file
types using its pd.read_ methods. Let’s take a look at the most common data types:

import numpy as np
import pandas as pd

CSV

CSV INPUT:
df = pd.read_csv('example')
df

a b c d

0 0 1 2 3

1 4 5 6 7

2 8 9 10 11

3 12 13 14 15

Downloaded by Jegatheeswari ic37721 ([email protected])

CSV OUTPUT:
df.to_csv('example',index=False)

EXCEL

Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or
images, having images or macros may cause this read_excel method to crash.

EXCEL INPUT :
pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')

a b c d

0 0 1 2 3

1 4 5 6 7

2 8 9 10 11

3 12 13 14 15

EXCEL OUTPUT :
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

HTML

You may need to install htmllib5, lxml, and BeautifulSoup4. In your terminal/command prompt
run:

pip install lxml

pip install html5lib==1.1
pip install BeautifulSoup4

Then restart Jupyter Notebook. (or use conda install)

Pandas can read table tabs off of html.

For example:

HTML INPUT

Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:
Downloaded by Jegatheeswari ic37721 ([email protected])
url = https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list

df = pd.read_html(url)

df[0]

match = "Metcalf Bank"

df_list = pd.read_html(url, match=match)

df_list[0]

HTML OUTPUT:

RESULT:
Exploring commands for read data from csv file, excel file and html are successfully
executed.

Downloaded by Jegatheeswari ic37721 ([email protected])

EX NO 4(a). EXPLORING VARIOUS COMMANDS FOR DOING
DESCRIPTIVE DATE: ANALYTICS ON THE IRIS DATA SET.

AIM:
To explore various commands for doing descriptive analytics on the Iris data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To understand idea behind Descriptive Statistics.
STEP 3: Load the packages we will need and also the `iris` dataset.
STEP 4: load_iris() loads in an object containing the iris dataset, which I stored in
`iris_obj`.
STEP 5: Basic statistics: count, mean, median, min, max
STEP 6: Display the output.
STEP 7: Stop the program.
PROGRAM:
import pandas as pd

from pandas import DataFrame

from sklearn.datasets import load_iris

# sklearn.datasetsincludes common example datasets

# A function to load in the iris dataset

iris_obj = load_iris()

# Dataset preview

iris_obj.data

iris = DataFrame(iris_obj.data, columns=iris_obj.feature_names,index=pd.Index([i for i in

range(iris_obj.data.shape[0])])).join(DataFrame(iris_obj.target,
columns=pd.Index(["species"]), index=pd.Index([i for i in range(iris_obj.target.shape[0])])))

iris # prints iris data

Commands

iris_obj.feature_names

iris.count()

iris.mean()

iris.median()
Downloaded by Jegatheeswari ic37721 ([email protected])
iris.var()

iris.std()

iris.max()

iris.min()

iris.describe()

OUTPUT:

RESULT:
Exploring various commands for doing descriptive analytics on the Iris data set
successfully executed.

Downloaded by Jegatheeswari ic37721 ([email protected])

EX.NO 5. USE THE DIABETES DATA SET FROM UCI AND PIMA
INDIANS DATE: DIABETES DATA SET FOR PERFORMING THE FOLLOWING:

A) UNIVARIATE ANALYSIS: FREQUENCY, MEAN, MEDIAN, MODE, VARIANCE,

STANDARD DEVIATION, SKEWNESS AND KURTOSIS.
AIM:
To explore various commands for doing Univariate analytics on the UCI AND PIMA
INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the mean, median, mode, variance, standard deviation, skewness and
kurtosis in the given excel data set package.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
from matplotlib.ticker import FormatStrFormatter
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('C:/Users/kirub/Documents/Learning/Untitled Folder/diabetes.csv')
df.head()
df.shape
df.dtypes
df['Outcome']=df['Outcome'].astype('bool')
df.dtypes['Outcome']
df.info()
df.describe().T

# Frequency# finding the unique count

df1 = df['Outcome'].value_counts()

# displaying
df1 print(df1)
#mean
df.mean()
#median
df.median()
Downloaded by Jegatheeswari ic37721 ([email protected])
#mode
df.mode()
#Variance
df.var()
#standard deviation
df.std()
#
#kurtosis
df.kurtosis(axis=0,skipna=True)
df['Outcome'].kurtosis(axis=0,skipna=True)
#skewness
# skewness along the index axis
df.skew(axis = 0, skipna = True)

# skip the na values

# find skewness in each row
df.skew(axis = 1, skipna = True)

#Pregnancy variable
preg_proportion = np.array(df['Pregnancies'].value_counts())
preg_month = np.array(df['Pregnancies'].value_counts().index)
preg_proportion_perc =
np.array(np.round(preg_proportion/sum(preg_proportion),3)*100,dtype=int)

preg =
pd.DataFrame({'month':preg_month,'count_of_preg_prop':preg_proportion,'percentage_pro
portion':preg_proportion_perc})
preg.set_index(['month'],inplace=True)
preg.head(10)

sns.countplot(data=df['Outcome'])

sns.distplot(df['Pregnancies'])

sns.boxplot(data=df['Pregnancies'])

Downloaded by Jegatheeswari ic37721 ([email protected])

OUTPUT:

RESULT:
Exploring various commands for doing univariate analytics on the UCI AND PIMA
INDIANS DIABETES was successfully executed.

Downloaded by Jegatheeswari ic37721 ([email protected])

EX.NO:5. B) BIVARIATE ANALYSIS: LINEAR AND LOGISTIC REGRESSION
DATE: MODELING
AIM:
To explore the Linear and Logistic Regression model on the USA HOUSING AND UCI
AND PIMA INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the any kind of data set like housing dataset using kaggle.
STEP 3: To read data from downloaded data set.
STEP 4: To find the linear and logistic regression model using the given data set.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
BIVARIATE ANALYSIS GENERAL PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
from matplotlib.ticker import FormatStrFormatter
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('C:/Users/diabetes.csv')
df.head()
df.shape
df.dtypes
df['Outcome']=df['Outcome'].astype('bool')

fig,axes = plt.subplots(nrows=3,ncols=2,dpi=120,figsize = (8,6))

plot00=sns.countplot('Pregnancies',data=df,ax=axes[0][0],color='green')
axes[0][0].set_title('Count',fontdict={'fontsize':8}) axes[0]
[0].set_xlabel('Month of Preg.',fontdict={'fontsize':7}) axes[0]
[0].set_ylabel('Count',fontdict={'fontsize':7})
Downloaded by Jegatheeswari ic37721 ([email protected])
plt.tight_layout()

plot01=sns.countplot('Pregnancies',data=df,hue='Outcome',ax=axes[0][1])
axes[0][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8}) axes[0]
[1].set_xlabel('Month of Preg.',fontdict={'fontsize':7}) axes[0]
[1].set_ylabel('Count',fontdict={'fontsize':7}) plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()

plot10 = sns.distplot(df['Pregnancies'],ax=axes[1][0]) axes[1]

[0].set_title('Pregnancies Distribution',fontdict={'fontsize':8}) axes[1]
[0].set_xlabel('Pregnancy Class',fontdict={'fontsize':7}) axes[1]
[0].set_ylabel('Freq/Dist',fontdict={'fontsize':7}) plt.tight_layout()

plot11 = df[df['Outcome']==False]['Pregnancies'].plot.hist(ax=axes[1][1],label='Non-
Diab.') plot11_2=df[df['Outcome']==True]['Pregnancies'].plot.hist(ax=axes[1]
[1],label='Diab.') axes[1][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[1][1].set_xlabel('Pregnancy Class',fontdict={'fontsize':7}) axes[1]
[1].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plot11.axes.legend(loc=1)
plt.setp(axes[1][1].get_legend().get_texts(), fontsize='6') # for legend
text plt.setp(axes[1][1].get_legend().get_title(), fontsize='6') # for legend
title plt.tight_layout()

plot20 = sns.boxplot(df['Pregnancies'],ax=axes[2][0],orient='v') axes[2]

[0].set_title('Pregnancies',fontdict={'fontsize':8}) axes[2]
[0].set_xlabel('Pregnancy',fontdict={'fontsize':7}) axes[2][0].set_ylabel('Five Point
Summary',fontdict={'fontsize':7}) plt.tight_layout()

plot21 = sns.boxplot(x='Outcome',y='Pregnancies',data=df,ax=axes[2]
[1]) axes[2][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})

Downloaded by Jegatheeswari ic37721 ([email protected])

axes[2][1].set_xlabel('Pregnancy',fontdict={'fontsize':7}) axes[2]
[1].set_ylabel('Five Point Summary',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7)
plt.tight_layout()
plt.show()

OUTPUT:

## Blood Pressure variable

fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))

plot00=sns.distplot(df['BloodPressure'],ax=axes[0][0],color='green')
axes[0][0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0][0].set_title('Distribution of BP',fontdict={'fontsize':8}) axes[0]
[0].set_xlabel('BP Class',fontdict={'fontsize':7}) axes[0]
[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7}) plt.tight_layout()

plot01=sns.distplot(df[df['Outcome']==False]['BloodPressure'],ax=axes[0][1],color='green',
label='Non Diab.') sns.distplot(df[df.Outcome==True]['BloodPressure'],ax=axes[0]
[1],color='red',label='Diab')

Downloaded by Jegatheeswari ic37721 ([email protected])

axes[0][1].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0][1].set_xlabel('BP Class',fontdict={'fontsize':7}) axes[0]
[1].set_ylabel('Count/Dist.',fontdict={'fontsize':7}) axes[0]
[1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1) plt.setp(axes[0]
[1].get_legend().get_texts(), fontsize='6') plt.setp(axes[0]
[1].get_legend().get_title(), fontsize='6') plt.tight_layout()
plot10=sns.boxplot(df['BloodPressure'],ax=axes[1][0],orient='v')
axes[1][0].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1][0].set_xlabel('BP',fontdict={'fontsize':7})
axes[1][0].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()
plot11=sns.boxplot(x='Outcome',y='BloodPressure',data=df,ax=axes[1][1])
axes[1][1].set_title(r'Numerical Summary
(Outcome)',fontdict={'fontsize':8}) axes[1][1].set_ylabel(r'Five Point
Summary(BP)',fontdict={'fontsize':7}) plt.xticks(ticks=[0,1],labels=['Non-
Diab.','Diab.'],fontsize=7) axes[1]
[1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()
plt.show()

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

fig,axes = plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))

plot0=sns.distplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[0],color='green')
axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot1=sns.boxplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('BloodPressure',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

LINEAR REGRESSION MODELLING ON HOUSING DATASET

# Data manipulation libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
USAhousing.info()
USAhousing.describe()

USAhousing.columns
sns.pairplot(USAhousing)

sns.distplot(USAhousing['Price'])

sns.heatmap(USAhousing.corr())

X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of
Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
# print the intercept
print(lm.intercept_)

coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df

predictions = lm.predict(X_test)
plt.scatter(y_test,predictions)

sns.distplot((y_test-predictions),bins=50);

from sklearn import metrics

print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

Downloaded by Jegatheeswari ic37721 ([email protected])

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

LOGISTIC REGRESSION MODELLING ON PIME DIABETIES

# Data manipulation libraries

import numpy as np
import pandas as pd

###scikit Learn Modules needed for Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import
LabelEncoder,MinMaxScaler,OneHotEncoder,StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

Downloaded by Jegatheeswari ic37721 ([email protected])

#for plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(color_codes=True)
import warnings
warnings.filterwarnings('ignore')

df=pd.read_csv('C:/Users/diabetes.csv')

df.head()

df.tail()

df.isnull().sum()

df.describe(include='all')

df.corr()

sns.heatmap(df.corr(),annot=True)
plt.show()

df.hist()
plt.show()

sns.countplot(x=df['Outcome'])

scaler=StandardScaler()
df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']]=scaler.fit_transform(df[['Pregnancies',
'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']])

df_new = df
Downloaded by Jegatheeswari ic37721 ([email protected])
# Train & Test split
x_train, x_test, y_train, y_test = train_test_split( df_new[['Pregnancies', 'Glucose',
'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']],
df_new['Outcome'],test_size=0.20,
random_state=21)

print('Shape of Training Xs:{}'.format(x_train.shape))

print('Shape of Test Xs:{}'.format(x_test.shape))
print('Shape of Training y:{}'.format(y_train.shape))
print('Shape of Test y:{}'.format(y_test.shape))

Shape of Training Xs:(614, 8)

Shape of Test Xs:(154, 8)
Shape of Training y:(614,)
Shape of Test y:(154,)

# Build Model
model = LogisticRegression()
model.fit(x_train, y_train)
y_predicted = model.predict(x_test)

score=model.score(x_test,y_test);
print(score)

0.7337662337662337

#Confusion Matrix
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_predicted)
np.set_printoptions(precision=2)
cnf_matrix

Downloaded by Jegatheeswari ic37721 ([email protected])

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

Downloaded by Jegatheeswari ic37721 ([email protected])
RESULT:
Exploring various commands for doing Bivariate analytics on the USA HOUSING Dataset
was successfully executed.

Downloaded by Jegatheeswari ic37721 ([email protected])

EX.NO:5.C) MULTIPLE REGRESSION
ANALYSIS DATE:`
AIM:
To explore various commands for doing Multiivariate analytics on the UCI AND PIMA
INDIANS DIABETES data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the multiple regression analysis the
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
# Data manipulation libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
USAhousing.info()
USAhousing.describe()

USAhousing.columns
sns.pairplot(USAhousing)

Downloaded by Jegatheeswari ic37721 ([email protected])

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

RESULT:

Thus the Multi regression analysis using housing data sets are executed successfully.

Downloaded by Jegatheeswari ic37721 ([email protected])

EX.NO:5.D) ALSO COMPARE THE RESULTS OF THE ABOVE ANALYSIS FOR
THE DATE: TWO DATA SETS.

AIM:
To explore various commands for compare the results of the above analysis for the date:
two data sets.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the comparison between the two different dataset using various command.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
# Glucose Variable
df.Glucose.describe()

#sns.set_style('darkgrid')
fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))

plot00=sns.distplot(df['Glucose'],ax=axes[0][0],color='green') axes[0]
[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f')) axes[0]
[0].set_title('Distribution of Glucose',fontdict={'fontsize':8}) axes[0]
[0].set_xlabel('Glucose Class',fontdict={'fontsize':7}) axes[0]
[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7}) plt.tight_layout()

plot01=sns.distplot(df[df['Outcome']==False]['Glucose'],ax=axes[0][1],color='green',label='
Non Diab.') sns.distplot(df[df.Outcome==True]['Glucose'],ax=axes[0]
[1],color='red',label='Diab') axes[0][1].set_title('Distribution of
Glucose',fontdict={'fontsize':8}) axes[0][1].set_xlabel('Glucose
Class',fontdict={'fontsize':7}) axes[0][1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
axes[0][1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()

plot10=sns.boxplot(df['Glucose'],ax=axes[1][0],orient='v') axes[1]
[0].set_title('Numerical Summary',fontdict={'fontsize':8}) axes[1]
[0].set_xlabel('Glucose',fontdict={'fontsize':7}) axes[1][0].set_ylabel(r'Five
Point Summary(Glucose)',fontdict={'fontsize':7}) plt.tight_layout()

plot11=sns.boxplot(x='Outcome',y='Glucose',data=df,ax=axes[1][1])

Downloaded by Jegatheeswari ic37721 ([email protected])

axes[1][1].set_title(r'Numerical Summary (Outcome)',fontdict={'fontsize':8})
axes[1][1].set_ylabel(r'Five Point
Summary(Glucose)',fontdict={'fontsize':7})
plt.xticks(ticks=[0,1],labels=['Non-Diab.','Diab.'],fontsize=7) axes[1]
[1].set_xlabel('Category',fontdict={'fontsize':7})
plt.tight_layout()

plt.show()

fig,axes = plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))

plot0=sns.distplot(df[df['Glucose']!=0]['Glucose'],ax=axes[0],color='green')
axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()

plot1=sns.boxplot(df[df['Glucose']!=0]['Glucose'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('Glucose',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.tight_layout()

Downloaded by Jegatheeswari ic37721 ([email protected])

OUTPUT:

RESULT:

Thus the comparison of the above analysis for the two datasets are executed successfully.

Downloaded by Jegatheeswari ic37721 ([email protected])

EX.NO:6. APPLY AND EXPLORE VARIOUS PLOTTING FUNCTIONS ON
UCI DATE: DATA SETS.

AIM:
To apply and explore various plotting functions on UCI datasets.

ALGORITHM:

STEP 1: Install seaborn package and import the package.

STEP 2: Normal curves, density or contour plots, correlation and sctter plots, and
histogram plots are visualized.
STEP 3: 3d plotting done using plotly package
STEP 4: Stop the program.
PROGRAM:

A. NORMAL CURVES

#seaborn package
import seaborn as sns
flights = sns.load_dataset("flights")
flights.head()
may_flights = flights.query("month == 'May'")
sns.lineplot(data=may_flights, x="year", y="passengers")

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

B. DENSITY AND CONTOUR PLOTS

iris = sns.load_dataset("iris")
sns.kdeplot(data=iris)

OUTPUT:

C. CORRELATION AND SCATTER PLOTS

#correlation visualized using heatmap function

df = sns.load_dataset("titanic")
ax = sns.heatmap(df annot=True, fmt="d")

#scatter plots of categorical variable

df = sns.load_dataset("titanic")
sns.catplot(data=df, x="age", y="class")

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

D. HISTOGRAMS

#histogram of datafra,e

df = sns.load_dataset("titanic")
sns.histplot(data=df, x="age")

OUTPUT:

E. THREE DIMENSIONAL PLOTTING

#3d plotting using ploty package

import plotly as px
df = sns.load_dataset("iris")

px.scatter_3d(df, x="PetalLengthCm", y="PetalWidthCm", z="SepalWidthCm",

size="SepalLengthCm",
color="Species", color_discrete_map = {"Joly": "blue", "Bergeron": "violet",
"Coderre":"pink"})

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

RESULT:

Thus the various exploring visual plots are successfully executed.

Downloaded by Jegatheeswari ic37721 ([email protected])

EX.NO:7. VISUALIZING GEOGRAPHIC DATA WITH
BASEMAP DATE:

AIM:

To check the Visualizing Geographic Data with Basemap using googlecolap.

ALGORITHM:

STEP 1: Install the basemap package

Install the below package:

Use google colab (in anaconda prompt , conda version is need to change, it may affect our
other packages compatability)
pip install basemap
(or)
conda install -c https://conda.anaconda.org/anaconda basemap

STEP 2: Explore on various projection options example: ortho, lcc.

STEP 3: Mark the location using longitude and latitude

PROGRAM:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);

OUTPUT:

Downloaded by Jegatheeswari ic37721 ([email protected])

fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)

# Map (long, lat) to (x, y) for plotting

x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12);

OUTPUT:

from itertools import chain

def draw_map(m, scale=0.2):

# draw a shaded-relief image
m.shadedrelief(scale=scale)

# lats and longs are returned as a dictionary

lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))

# keys contain the plt.Line2D instances

lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)

# cycle through these lines and set the desired style

for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='w')

fig = plt.figure(figsize=(8, 6), edgecolor='w')

Downloaded by Jegatheeswari ic37721 ([email protected])
m = Basemap(projection='cyl',
resolution=None, llcrnrlat=-90,
urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)

OUTPUT:

fig = plt.figure(figsize=(8, 8))

m = Basemap(projection='lcc', resolution=None,
lon_0=0, lat_0=50, lat_1=45, lat_2=55,
width=1.6E7, height=1.2E7)
draw_map(m)

OUTPUT:

RESULT:

Thus the Exploring Geographic Data with Basemap was successfully executed.
Downloaded by Jegatheeswari ic37721 ([email protected])

Duba2604832573318 2
No ratings yet
Duba2604832573318 2
6 pages
Gujrat Gas Vocantional Training
No ratings yet
Gujrat Gas Vocantional Training
60 pages
cs3362 Foundations of Data Science Lab Manual
75% (8)
cs3362 Foundations of Data Science Lab Manual
53 pages
CS 3362 FDS
No ratings yet
CS 3362 FDS
53 pages
Data Exploration in Python PDF
No ratings yet
Data Exploration in Python PDF
1 page
Data Science Fundamentals
No ratings yet
Data Science Fundamentals
22 pages
Chapter 4 - Python For Data Analysis
No ratings yet
Chapter 4 - Python For Data Analysis
47 pages
CS3362 Data Science Laboratory Manual 2022-23
No ratings yet
CS3362 Data Science Laboratory Manual 2022-23
54 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
data science programs
No ratings yet
data science programs
11 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
47 pages
Python for ML
No ratings yet
Python for ML
41 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
Unit 5
No ratings yet
Unit 5
93 pages
MACHINE LEARNING LAB WORD 12-1-2025. DOCUMENT
No ratings yet
MACHINE LEARNING LAB WORD 12-1-2025. DOCUMENT
68 pages
Python For DA
100% (2)
Python For DA
47 pages
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
100% (1)
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
12 pages
Utf-8''libraries Data Management
No ratings yet
Utf-8''libraries Data Management
9 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
NumPy and Pandas Tutorial
No ratings yet
NumPy and Pandas Tutorial
8 pages
Python Libraries 2
No ratings yet
Python Libraries 2
80 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
Usage of NumPy for Numerical Data in Detail
No ratings yet
Usage of NumPy for Numerical Data in Detail
52 pages
ML Lab Manual (Upto Cie-1)
No ratings yet
ML Lab Manual (Upto Cie-1)
33 pages
FDS RECORD-1-4
No ratings yet
FDS RECORD-1-4
18 pages
Pandas
No ratings yet
Pandas
12 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
13-007 Datasets and DataFrames
No ratings yet
13-007 Datasets and DataFrames
10 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
unit-3(FODS)
No ratings yet
unit-3(FODS)
34 pages
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
No ratings yet
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
7 pages
Course_ Introduction to Data Science (SD211105)
No ratings yet
Course_ Introduction to Data Science (SD211105)
10 pages
Python-for-Data-Analysis-edgar
No ratings yet
Python-for-Data-Analysis-edgar
49 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
Stats Unit1
No ratings yet
Stats Unit1
27 pages
Tutorial Data Visualization Pandas Matplotlib Seaborn
No ratings yet
Tutorial Data Visualization Pandas Matplotlib Seaborn
32 pages
data analysis
No ratings yet
data analysis
42 pages
Python For Data Analysis: Dr. Kishore Kunal
100% (1)
Python For Data Analysis: Dr. Kishore Kunal
43 pages
Pandas Worksheet
No ratings yet
Pandas Worksheet
3 pages
2,3. Introduction Pandas & Matplotlib - Copy
No ratings yet
2,3. Introduction Pandas & Matplotlib - Copy
32 pages
pandas (1)
No ratings yet
pandas (1)
25 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
PR Final File
No ratings yet
PR Final File
70 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
CSE445 NSU Week_3
No ratings yet
CSE445 NSU Week_3
48 pages
ML Lab1 Python Panda
No ratings yet
ML Lab1 Python Panda
9 pages
EXP1-siddhant gupta (23_SE_148)
No ratings yet
EXP1-siddhant gupta (23_SE_148)
17 pages
CS3361 - Data Science Laboratory
No ratings yet
CS3361 - Data Science Laboratory
31 pages
Pandas
No ratings yet
Pandas
29 pages
AIML LAB MANAUAL R23
100% (1)
AIML LAB MANAUAL R23
10 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
FDS Aim Algorithm
No ratings yet
FDS Aim Algorithm
18 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
C Language Programming Codes
From Everand
C Language Programming Codes
Durgesh
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
07 Planning Under Uncertainty Paper
No ratings yet
07 Planning Under Uncertainty Paper
12 pages
PDV01 - CO1.1. Knowing-Oneself-and-Development-and-Self-awareness
No ratings yet
PDV01 - CO1.1. Knowing-Oneself-and-Development-and-Self-awareness
9 pages
SOA - Salary Loan
No ratings yet
SOA - Salary Loan
1 page
Geotechnical Engineering C2
No ratings yet
Geotechnical Engineering C2
8 pages
Irad 2025
No ratings yet
Irad 2025
2 pages
Plan Layout For ETP - 10 KLD-Model
100% (1)
Plan Layout For ETP - 10 KLD-Model
1 page
Ieseg Fiche Cours en
No ratings yet
Ieseg Fiche Cours en
3 pages
Arvind Singh Bisht: Career Objective
No ratings yet
Arvind Singh Bisht: Career Objective
3 pages
Learning Outcome 1 Flex
No ratings yet
Learning Outcome 1 Flex
4 pages
PRONETA Documentation V2 6 en
No ratings yet
PRONETA Documentation V2 6 en
73 pages
PM 251 (Local Government and Regional Administration)
No ratings yet
PM 251 (Local Government and Regional Administration)
12 pages
The Brita Product Company: Submitted By: Submitted To
No ratings yet
The Brita Product Company: Submitted By: Submitted To
22 pages
CERGAS Deteccion de Gases Explosivos
No ratings yet
CERGAS Deteccion de Gases Explosivos
4 pages
Fingerprint Recognition Using Fuzzy Inferencing Techniques
No ratings yet
Fingerprint Recognition Using Fuzzy Inferencing Techniques
9 pages
Q4-WHLP-wk1-and-2 Science and Arts
No ratings yet
Q4-WHLP-wk1-and-2 Science and Arts
3 pages
Physics Project 12th cbse
No ratings yet
Physics Project 12th cbse
18 pages
Flowguide OP 99-30287
100% (1)
Flowguide OP 99-30287
141 pages
Variables and Conditions For The PMDG 737 Ngxu
No ratings yet
Variables and Conditions For The PMDG 737 Ngxu
26 pages
3. ENG8_Q3_Mod-3_Analyze-Literature-as-a-Mirror-to-a-Shared-Heritage-of-People-with-Diverse-Backgrounds (1) (1)
No ratings yet
3. ENG8_Q3_Mod-3_Analyze-Literature-as-a-Mirror-to-a-Shared-Heritage-of-People-with-Diverse-Backgrounds (1) (1)
26 pages
Reviewer AIS1
No ratings yet
Reviewer AIS1
3 pages
Arranging For Strings, Part 3
No ratings yet
Arranging For Strings, Part 3
5 pages
Download (Ebook) Organisational Ethics in the Built Environment by Jason Challender ISBN 9781394186242, 139418624X ebook All Chapters PDF
100% (8)
Download (Ebook) Organisational Ethics in the Built Environment by Jason Challender ISBN 9781394186242, 139418624X ebook All Chapters PDF
75 pages
Exam Experiment S2 Sample Solution
0% (1)
Exam Experiment S2 Sample Solution
9 pages
Software Project Management CH2 5-11
No ratings yet
Software Project Management CH2 5-11
38 pages
Overview Data Sheets: F 8653: Central Module
No ratings yet
Overview Data Sheets: F 8653: Central Module
2 pages
3.2 Light
No ratings yet
3.2 Light
48 pages
Preparations of Tetraamminecopper II
No ratings yet
Preparations of Tetraamminecopper II
13 pages
Ahmed Mohamed CV
No ratings yet
Ahmed Mohamed CV
3 pages

cs3362 Foundations of Data Science Lab Manual

Uploaded by

cs3362 Foundations of Data Science Lab Manual

Uploaded by

CS3362 Foundations OF DATA Science LAB Manual

Computer Science and Engineering (Anna University)

Studocu is not sponsored or endorsed by any college or university

Downloaded by Jegatheeswari ic37721 ([email protected])

pip install lxml

Then restart Jupyter Notebook. (or use conda install)

Pandas can read table tabs off of html.

match = "Metcalf Bank"

df_list = pd.read_html(url, match=match)

Downloaded by Jegatheeswari ic37721 ([email protected])

from pandas import DataFrame

from sklearn.datasets import load_iris

# sklearn.datasetsincludes common example datasets

# A function to load in the iris dataset

iris = DataFrame(iris_obj.data, columns=iris_obj.feature_names,index=pd.Index([i for i in

iris # prints iris data

Downloaded by Jegatheeswari ic37721 ([email protected])

A) UNIVARIATE ANALYSIS: FREQUENCY, MEAN, MEDIAN, MODE, VARIANCE,

# Frequency# finding the unique count

# skip the na values

Downloaded by Jegatheeswari ic37721 ([email protected])

Downloaded by Jegatheeswari ic37721 ([email protected])

fig,axes = plt.subplots(nrows=3,ncols=2,dpi=120,figsize = (8,6))

plot10 = sns.distplot(df['Pregnancies'],ax=axes[1][0]) axes[1]

plot20 = sns.boxplot(df['Pregnancies'],ax=axes[2][0],orient='v') axes[2]

Downloaded by Jegatheeswari ic37721 ([email protected])

## Blood Pressure variable

fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))

Downloaded by Jegatheeswari ic37721 ([email protected])

Downloaded by Jegatheeswari ic37721 ([email protected])

Downloaded by Jegatheeswari ic37721 ([email protected])

# Data manipulation libraries

from sklearn import metrics

Downloaded by Jegatheeswari ic37721 ([email protected])

Downloaded by Jegatheeswari ic37721 ([email protected])

# Data manipulation libraries

###scikit Learn Modules needed for Logistic Regression

Downloaded by Jegatheeswari ic37721 ([email protected])

print('Shape of Training Xs:{}'.format(x_train.shape))

Shape of Training Xs:(614, 8)

Downloaded by Jegatheeswari ic37721 ([email protected])

Downloaded by Jegatheeswari ic37721 ([email protected])

Downloaded by Jegatheeswari ic37721 ([email protected])

Downloaded by Jegatheeswari ic37721 ([email protected])

Downloaded by Jegatheeswari ic37721 ([email protected])

Downloaded by Jegatheeswari ic37721 ([email protected])

Downloaded by Jegatheeswari ic37721 ([email protected])

fig,axes = plt.subplots(nrows=1,ncols=2,dpi=120,figsize = (8,4))

Downloaded by Jegatheeswari ic37721 ([email protected])

Downloaded by Jegatheeswari ic37721 ([email protected])

STEP 1: Install seaborn package and import the package.

Downloaded by Jegatheeswari ic37721 ([email protected])

C. CORRELATION AND SCATTER PLOTS

#correlation visualized using heatmap function

#scatter plots of categorical variable

Downloaded by Jegatheeswari ic37721 ([email protected])

E. THREE DIMENSIONAL PLOTTING

#3d plotting using ploty package

px.scatter_3d(df, x="PetalLengthCm", y="PetalWidthCm", z="SepalWidthCm",

Downloaded by Jegatheeswari ic37721 ([email protected])

Thus the various exploring visual plots are successfully executed.

Downloaded by Jegatheeswari ic37721 ([email protected])

To check the Visualizing Geographic Data with Basemap using googlecolap.

STEP 1: Install the basemap package

Install the below package:

STEP 2: Explore on various projection options example: ortho, lcc.

Downloaded by Jegatheeswari ic37721 ([email protected])

# Map (long, lat) to (x, y) for plotting

from itertools import chain

def draw_map(m, scale=0.2):

# lats and longs are returned as a dictionary

# keys contain the plt.Line2D instances

# cycle through these lines and set the desired style

fig = plt.figure(figsize=(8, 6), edgecolor='w')

fig = plt.figure(figsize=(8, 8))

You might also like