cs3362 Foundations of Data Science Lab Manual
cs3362 Foundations of Data Science Lab Manual
Aim:
To Reading data from text files, Excel and the web using pandas package.
ALGORITHM:
STEP 1: Start the program
STEP 2: To read data from csv file using pandas package.
STEP 3: To read data from excel file using pandas package.
STEP 4: To read data from html file using pandas package.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
DATA INPUT AND OUTPUT
This notebook is the reference code for getting input and output, pandas can read a variety of file
types using its pd.read_ methods. Let’s take a look at the most common data types:
import numpy as np
import pandas as pd
CSV
CSV INPUT:
df = pd.read_csv('example')
df
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
EXCEL
Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or
images, having images or macros may cause this read_excel method to crash.
EXCEL INPUT :
pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
EXCEL OUTPUT :
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')
HTML
You may need to install htmllib5, lxml, and BeautifulSoup4. In your terminal/command prompt
run:
For example:
HTML INPUT
Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:
Downloaded by Jegatheeswari ic37721 ([email protected])
url = https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list
df = pd.read_html(url)
df[0]
df_list[0]
HTML OUTPUT:
RESULT:
Exploring commands for read data from csv file, excel file and html are successfully
executed.
AIM:
To explore various commands for doing descriptive analytics on the Iris data set.
ALGORITHM:
STEP 1: Start the program
STEP 2: To understand idea behind Descriptive Statistics.
STEP 3: Load the packages we will need and also the `iris` dataset.
STEP 4: load_iris() loads in an object containing the iris dataset, which I stored in
`iris_obj`.
STEP 5: Basic statistics: count, mean, median, min, max
STEP 6: Display the output.
STEP 7: Stop the program.
PROGRAM:
import pandas as pd
iris_obj = load_iris()
# Dataset preview
iris_obj.data
Commands
iris_obj.feature_names
iris.count()
iris.mean()
iris.median()
Downloaded by Jegatheeswari ic37721 ([email protected])
iris.var()
iris.std()
iris.max()
iris.min()
iris.describe()
OUTPUT:
RESULT:
Exploring various commands for doing descriptive analytics on the Iris data set
successfully executed.
df = pd.read_csv('C:/Users/kirub/Documents/Learning/Untitled Folder/diabetes.csv')
df.head()
df.shape
df.dtypes
df['Outcome']=df['Outcome'].astype('bool')
df.dtypes['Outcome']
df.info()
df.describe().T
# displaying
df1 print(df1)
#mean
df.mean()
#median
df.median()
Downloaded by Jegatheeswari ic37721 ([email protected])
#mode
df.mode()
#Variance
df.var()
#standard deviation
df.std()
#
#kurtosis
df.kurtosis(axis=0,skipna=True)
df['Outcome'].kurtosis(axis=0,skipna=True)
#skewness
# skewness along the index axis
df.skew(axis = 0, skipna = True)
#Pregnancy variable
preg_proportion = np.array(df['Pregnancies'].value_counts())
preg_month = np.array(df['Pregnancies'].value_counts().index)
preg_proportion_perc =
np.array(np.round(preg_proportion/sum(preg_proportion),3)*100,dtype=int)
preg =
pd.DataFrame({'month':preg_month,'count_of_preg_prop':preg_proportion,'percentage_pro
portion':preg_proportion_perc})
preg.set_index(['month'],inplace=True)
preg.head(10)
sns.countplot(data=df['Outcome'])
sns.distplot(df['Pregnancies'])
sns.boxplot(data=df['Pregnancies'])
RESULT:
Exploring various commands for doing univariate analytics on the UCI AND PIMA
INDIANS DIABETES was successfully executed.
df = pd.read_csv('C:/Users/diabetes.csv')
df.head()
df.shape
df.dtypes
df['Outcome']=df['Outcome'].astype('bool')
plot00=sns.countplot('Pregnancies',data=df,ax=axes[0][0],color='green')
axes[0][0].set_title('Count',fontdict={'fontsize':8}) axes[0]
[0].set_xlabel('Month of Preg.',fontdict={'fontsize':7}) axes[0]
[0].set_ylabel('Count',fontdict={'fontsize':7})
Downloaded by Jegatheeswari ic37721 ([email protected])
plt.tight_layout()
plot01=sns.countplot('Pregnancies',data=df,hue='Outcome',ax=axes[0][1])
axes[0][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8}) axes[0]
[1].set_xlabel('Month of Preg.',fontdict={'fontsize':7}) axes[0]
[1].set_ylabel('Count',fontdict={'fontsize':7}) plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()
plot11 = df[df['Outcome']==False]['Pregnancies'].plot.hist(ax=axes[1][1],label='Non-
Diab.') plot11_2=df[df['Outcome']==True]['Pregnancies'].plot.hist(ax=axes[1]
[1],label='Diab.') axes[1][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
axes[1][1].set_xlabel('Pregnancy Class',fontdict={'fontsize':7}) axes[1]
[1].set_ylabel('Freq/Dist',fontdict={'fontsize':7})
plot11.axes.legend(loc=1)
plt.setp(axes[1][1].get_legend().get_texts(), fontsize='6') # for legend
text plt.setp(axes[1][1].get_legend().get_title(), fontsize='6') # for legend
title plt.tight_layout()
plot21 = sns.boxplot(x='Outcome',y='Pregnancies',data=df,ax=axes[2]
[1]) axes[2][1].set_title('Diab. VS Non-Diab.',fontdict={'fontsize':8})
OUTPUT:
plot01=sns.distplot(df[df['Outcome']==False]['BloodPressure'],ax=axes[0][1],color='green',
label='Non Diab.') sns.distplot(df[df.Outcome==True]['BloodPressure'],ax=axes[0]
[1],color='red',label='Diab')
OUTPUT:
plot0=sns.distplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[0],color='green')
axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of BP',fontdict={'fontsize':8})
axes[0].set_xlabel('BP Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()
plot1=sns.boxplot(df[df['BloodPressure']!=0]['BloodPressure'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('BloodPressure',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(BP)',fontdict={'fontsize':7})
plt.tight_layout()
OUTPUT:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
USAhousing.info()
USAhousing.describe()
USAhousing.columns
sns.pairplot(USAhousing)
sns.distplot(USAhousing['Price'])
sns.heatmap(USAhousing.corr())
X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of
Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)
# print the intercept
print(lm.intercept_)
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df
predictions = lm.predict(X_test)
plt.scatter(y_test,predictions)
sns.distplot((y_test-predictions),bins=50);
df=pd.read_csv('C:/Users/diabetes.csv')
df.head()
df.tail()
df.isnull().sum()
df.describe(include='all')
df.corr()
sns.heatmap(df.corr(),annot=True)
plt.show()
df.hist()
plt.show()
sns.countplot(x=df['Outcome'])
scaler=StandardScaler()
df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']]=scaler.fit_transform(df[['Pregnancies',
'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']])
df_new = df
Downloaded by Jegatheeswari ic37721 ([email protected])
# Train & Test split
x_train, x_test, y_train, y_test = train_test_split( df_new[['Pregnancies', 'Glucose',
'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']],
df_new['Outcome'],test_size=0.20,
random_state=21)
# Build Model
model = LogisticRegression()
model.fit(x_train, y_train)
y_predicted = model.predict(x_test)
score=model.score(x_test,y_test);
print(score)
0.7337662337662337
#Confusion Matrix
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_predicted)
np.set_printoptions(precision=2)
cnf_matrix
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()
USAhousing.info()
USAhousing.describe()
USAhousing.columns
sns.pairplot(USAhousing)
Thus the Multi regression analysis using housing data sets are executed successfully.
AIM:
To explore various commands for compare the results of the above analysis for the date:
two data sets.
ALGORITHM:
STEP 1: Start the program
STEP 2: To download the UCI AND PIMA INDIANS DIABETES data set using Kaggle.
STEP 3: To read data from UCI AND PIMA INDIANS DIABETES data set.
STEP 4: To find the comparison between the two different dataset using various command.
STEP 5: Display the output.
STEP 6: Stop the program.
PROGRAM:
# Glucose Variable
df.Glucose.describe()
#sns.set_style('darkgrid')
fig,axes = plt.subplots(nrows=2,ncols=2,dpi=120,figsize = (8,6))
plot00=sns.distplot(df['Glucose'],ax=axes[0][0],color='green') axes[0]
[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f')) axes[0]
[0].set_title('Distribution of Glucose',fontdict={'fontsize':8}) axes[0]
[0].set_xlabel('Glucose Class',fontdict={'fontsize':7}) axes[0]
[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7}) plt.tight_layout()
plot01=sns.distplot(df[df['Outcome']==False]['Glucose'],ax=axes[0][1],color='green',label='
Non Diab.') sns.distplot(df[df.Outcome==True]['Glucose'],ax=axes[0]
[1],color='red',label='Diab') axes[0][1].set_title('Distribution of
Glucose',fontdict={'fontsize':8}) axes[0][1].set_xlabel('Glucose
Class',fontdict={'fontsize':7}) axes[0][1].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
axes[0][1].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plot01.axes.legend(loc=1)
plt.setp(axes[0][1].get_legend().get_texts(), fontsize='6')
plt.setp(axes[0][1].get_legend().get_title(), fontsize='6')
plt.tight_layout()
plot10=sns.boxplot(df['Glucose'],ax=axes[1][0],orient='v') axes[1]
[0].set_title('Numerical Summary',fontdict={'fontsize':8}) axes[1]
[0].set_xlabel('Glucose',fontdict={'fontsize':7}) axes[1][0].set_ylabel(r'Five
Point Summary(Glucose)',fontdict={'fontsize':7}) plt.tight_layout()
plot11=sns.boxplot(x='Outcome',y='Glucose',data=df,ax=axes[1][1])
plt.show()
plot0=sns.distplot(df[df['Glucose']!=0]['Glucose'],ax=axes[0],color='green')
axes[0].yaxis.set_major_formatter(FormatStrFormatter('%.3f'))
axes[0].set_title('Distribution of Glucose',fontdict={'fontsize':8})
axes[0].set_xlabel('Glucose Class',fontdict={'fontsize':7})
axes[0].set_ylabel('Count/Dist.',fontdict={'fontsize':7})
plt.tight_layout()
plot1=sns.boxplot(df[df['Glucose']!=0]['Glucose'],ax=axes[1],orient='v')
axes[1].set_title('Numerical Summary',fontdict={'fontsize':8})
axes[1].set_xlabel('Glucose',fontdict={'fontsize':7})
axes[1].set_ylabel(r'Five Point Summary(Glucose)',fontdict={'fontsize':7})
plt.tight_layout()
RESULT:
Thus the comparison of the above analysis for the two datasets are executed successfully.
AIM:
To apply and explore various plotting functions on UCI datasets.
ALGORITHM:
A. NORMAL CURVES
#seaborn package
import seaborn as sns
flights = sns.load_dataset("flights")
flights.head()
may_flights = flights.query("month == 'May'")
sns.lineplot(data=may_flights, x="year", y="passengers")
OUTPUT:
iris = sns.load_dataset("iris")
sns.kdeplot(data=iris)
OUTPUT:
OUTPUT:
#histogram of datafra,e
df = sns.load_dataset("titanic")
sns.histplot(data=df, x="age")
OUTPUT:
OUTPUT:
AIM:
ALGORITHM:
PROGRAM:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);
OUTPUT:
OUTPUT:
OUTPUT:
OUTPUT:
RESULT:
Thus the Exploring Geographic Data with Basemap was successfully executed.
Downloaded by Jegatheeswari ic37721 ([email protected])