0% found this document useful (0 votes)
10 views

Quality Prediction Checkpoint

Uploaded by

Pavan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Quality Prediction Checkpoint

Uploaded by

Pavan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

In [260]:

# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')

In [215]:

# Loding the dataset


df = pd.read_csv('QualityPrediction.csv')
df

Out[215]:

free total
fixed volatile citric residual
chlorides sulfur sulfur density pH sulphates alcohol quality
acidity acidity acid sugar
dioxide dioxide

0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5

1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68 9.8 5

2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65 9.8 5

3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58 9.8 6

4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56 9.4 5

... ... ... ... ... ... ... ... ... ... ... ... ...

1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58 10.5 5

1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76 11.2 6

1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75 11.0 6

1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71 10.2 5

1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66 11.0 6

1599 rows × 12 columns

In [216]:

# Checking for null values


df.isnull().sum()

Out[216]:

fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
In [217]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 1599 non-null float64
1 volatile acidity 1599 non-null float64
2 citric acid 1599 non-null float64
3 residual sugar 1599 non-null float64
4 chlorides 1599 non-null float64
5 free sulfur dioxide 1599 non-null float64
6 total sulfur dioxide 1599 non-null float64
7 density 1599 non-null float64
8 pH 1599 non-null float64
9 sulphates 1599 non-null float64
10 alcohol 1599 non-null float64
11 quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

In [218]:

df.describe()

Out[218]:

fixed volatile residual free sulfur total sulfur


citric acid chlorides density pH sulphates
acidity acidity sugar dioxide dioxide

count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 159

mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149

std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507

min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000

25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000

50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000

75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000

max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000

In [219]:

df.head(5)

Out[219]:

fixed volatile citric residual free sulfur total sulfur


chlorides density pH sulphates alcohol quality
acidity acidity acid sugar dioxide dioxide

0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5

2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5

3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6

4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

Data Preprocessing
In [220]:

df['quality'].value_counts()

Out[220]:

5 681
6 638
7 199
4 53
8 18
3 10
Name: quality, dtype: int64
In [221]:

sns.catplot(x='quality', data=df, kind='count')

Out[221]:

<seaborn.axisgrid.FacetGrid at 0x152b9301c10>

In [222]:

plot=plt.figure(figsize=(5,5))
sns.barplot(x='quality',y='volatile acidity',data=df)

Out[222]:

<AxesSubplot:xlabel='quality', ylabel='volatile acidity'>


In [223]:

plot=plt.figure(figsize=(5,5))
sns.barplot(x='quality',y='citric acid',data=df)

Out[223]:

<AxesSubplot:xlabel='quality', ylabel='citric acid'>

In [224]:

plt.bar(df['quality'], df['alcohol'])
plt.xlabel('quality')
plt.ylabel('alcohol')
plt.show()

Exploratory Data Analysis


In [225]:

df['quality'] = df['quality'].apply(lambda x: 1 if x >= 7 else 0)


df.rename(columns={'quality': 'good quality'}, inplace=True)
df.head()

Out[225]:

fixed volatile citric residual free sulfur total sulfur good


chlorides density pH sulphates alcohol
acidity acidity acid sugar dioxide dioxide quality

0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 0

1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 0

2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 0

3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 0

4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 0

In [226]:

plt.figure(figsize=(5,5))
sns.countplot(x='good quality', data=df)
plt.xlabel('good quality')
plt.ylabel('Count')
plt.title('Count of Good vs Bad Quality Wines')
plt.show()
In [227]:

plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=True)
plt.show()

In [228]:
fig, ax = plt.subplots(2,4,figsize=(20,20))
sns.scatterplot(x = 'fixed acidity', y = 'citric acid', hue = 'good quality', data = df, ax=ax[0,0])
sns.scatterplot(x = 'volatile acidity', y = 'citric acid', hue = 'good quality', data = df, ax=ax[0,1])
sns.scatterplot(x = 'free sulfur dioxide', y = 'total sulfur dioxide', hue = 'good quality', data = df, ax=ax[0,2
])
sns.scatterplot(x = 'fixed acidity', y = 'density', hue = 'good quality', data = df, ax=ax[0,3])
sns.scatterplot(x = 'fixed acidity', y = 'pH', hue = 'good quality', data = df, ax=ax[1,0])
sns.scatterplot(x = 'citric acid', y = 'pH', hue = 'good quality', data = df, ax=ax[1,1])
sns.scatterplot(x = 'chlorides', y = 'sulphates', hue = 'good quality', data = df, ax=ax[1,2])
sns.scatterplot(x = 'residual sugar', y = 'alcohol', hue = 'good quality', data = df, ax=ax[1,3])
Out[228]:

<AxesSubplot:xlabel='residual sugar', ylabel='alcohol'>

Train Test Split


In [229]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('good quality', axis=1), df['good quality'], test_siz
e=0.3, random_state=42)

In [230]:

X_train.head()

Out[230]:

fixed volatile citric residual free sulfur total sulfur


chlorides density pH sulphates alcohol
acidity acidity acid sugar dioxide dioxide

925 8.6 0.22 0.36 1.9 0.064 53.0 77.0 0.99604 3.47 0.87 11.0

363 12.5 0.46 0.63 2.0 0.071 6.0 15.0 0.99880 2.99 0.87 10.2

906 7.2 0.54 0.27 2.6 0.084 12.0 78.0 0.99640 3.39 0.71 11.0

426 6.4 0.67 0.08 2.1 0.045 19.0 48.0 0.99490 3.49 0.49 11.4

1251 7.5 0.58 0.14 2.2 0.077 27.0 60.0 0.99630 3.28 0.59 9.8
In [231]:

X_test.head()

Out[231]:

fixed volatile citric residual free sulfur total sulfur


chlorides density pH sulphates alcohol
acidity acidity acid sugar dioxide dioxide

803 7.7 0.56 0.08 2.50 0.114 14.0 46.0 0.9971 3.24 0.66 9.6

124 7.8 0.50 0.17 1.60 0.082 21.0 102.0 0.9960 3.39 0.48 9.5

350 10.7 0.67 0.22 2.70 0.107 17.0 34.0 1.0004 3.28 0.98 9.9

682 8.5 0.46 0.31 2.25 0.078 32.0 58.0 0.9980 3.33 0.54 9.8

1326 6.7 0.46 0.24 1.70 0.077 18.0 34.0 0.9948 3.39 0.60 10.6

Model Training
Feature Scaling
In [232]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [233]:

X_train_scaled

Out[233]:

array([[ 1.69536131e-01, -1.72107140e+00, 4.59303345e-01, ...,


1.01180685e+00, 1.22661179e+00, 5.50057013e-01],
[ 2.44606730e+00, -4.01957443e-01, 1.84105501e+00, ...,
-2.10687612e+00, 1.22661179e+00, -2.05174641e-01],
[-6.47680186e-01, 3.77472102e-02, -1.28054303e-03, ...,
4.92026353e-01, 2.97270776e-01, 5.50057013e-01],
...,
[-6.47680186e-01, 4.77451864e-01, -1.07597628e+00, ...,
1.27169710e+00, -6.90154049e-01, -8.66002338e-01],
[-2.39072027e-01, -1.83099757e+00, 4.08127357e-01, ...,
3.72184202e-02, 8.20025095e-01, 1.39969262e+00],
[-1.46489650e+00, -1.33632983e+00, -5.24565306e-02, ...,
4.92026353e-01, -6.90154049e-01, 2.91015593e+00]])

In [234]:

X_test_scaled

Out[234]:

array([[-0.35581722, 0.14767337, -0.97362431, ..., -0.48256207,


0.00685171, -0.77159838],
[-0.29744462, -0.18210512, -0.51304042, ..., 0.49202635,
-1.03865693, -0.86600234],
[ 1.39536061, 0.75226727, -0.25716048, ..., -0.22267183,
1.86553373, -0.48838651],
...,
[-0.93954316, -0.40195744, -0.15480851, ..., 0.49202635,
-0.34165117, 0.17244119],
[ 1.27861542, -0.12714203, 1.892231 , ..., -1.4571505 ,
0.00685171, 1.30528867],
[ 0.92837985, -0.18210512, -0.15480851, ..., 0.16716354,
-0.80632167, -0.39398255]])

Logistic Refression
In [235]:
lr = LogisticRegression()
lr

Out[235]:

LogisticRegression()
In [261]:

#training the model


lr.fit(X_train, y_train)
lr.score(X_train, y_train)

Out[261]:

0.8838248436103664

In [237]:

# testing the model


lr_pred = lr.predict(X_test)
accuracy_score(y_test, lr_pred)

Out[237]:

0.85625

Support Vector Machine (SVM)


In [238]:

clf = svm.SVC(kernel='rbf')
clf

Out[238]:

SVC()

In [239]:

# training the model


clf.fit(X_train, y_train)
clf.score(X_train, y_train)

Out[239]:

0.8668453976764968

In [240]:

# testing the model


sv_pred = clf.predict(X_test)
accuracy_score(y_test, sv_pred)

Out[240]:

0.8625

Decision Tree
In [241]:

dtree = DecisionTreeClassifier()
dtree

Out[241]:

DecisionTreeClassifier()

In [242]:

# training the model


dtree.fit(X_train, y_train)
dtree.score(X_train, y_train)

Out[242]:

1.0

In [243]:

# testing the model


tr_pred = dtree.predict(X_test)
accuracy_score(y_test, tr_pred)

Out[243]:

0.8604166666666667
K-Nearest Neighbors (KNN)
In [244]:

from sklearn.neighbors import KNeighborsClassifier


knn = KNeighborsClassifier(n_neighbors=5)
knn

Out[244]:

KNeighborsClassifier()

In [262]:

# training the model


knn.fit(X_train, y_train)
knn.score(X_train, y_train)

Out[262]:

0.9079535299374442

In [263]:

# testing the model


kn_pred = knn.predict(X_test)
accuracy_score(y_test, kn_pred)

Out[263]:

0.8583333333333333

Model Evaluation
Logistic Regression
In [247]:

# logistic regression model evaluation


sns.heatmap(confusion_matrix(y_test, lr_pred), annot=True, cmap='Blues')
plt.ylabel('Predicted Values')
plt.xlabel('Actual Values')
plt.title('Confusion Matrix for Logistic Regression')
plt.show()
In [248]:

print('Logistic Regression Model Accuracy: ', accuracy_score(y_test, lr_pred))


print('Logistic Regression Model f1 score: ', metrics.f1_score(y_test, lr_pred))
print('Logistic Regression Model MAE: ', metrics.mean_absolute_error(y_test, lr_pred))
print('Logistic Regression Model RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, lr_pred)))

Logistic Regression Model Accuracy: 0.85625


Logistic Regression Model f1 score: 0.28865979381443296
Logistic Regression Model MAE: 0.14375
Logistic Regression Model RMSE: 0.3791437722025775

Support Vector Machine (SVM)


In [249]:

sns.heatmap(confusion_matrix(y_test, sv_pred), annot=True, cmap='Reds')


plt.ylabel('Predicted Values')
plt.xlabel('Actual Values')
plt.title('Confusion Matrix for Support Vector Machine')
plt.show()

In [250]:

print('Support Vector Machine Model Accuracy: ', accuracy_score(y_test, sv_pred))


print('Support Vector Machine Model f1 score: ', metrics.f1_score(y_test, sv_pred))
print('Support Vector Machine Model MAE: ', metrics.mean_absolute_error(y_test, sv_pred))
print('Support Vector Machine Model RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, sv_pred)))

Support Vector Machine Model Accuracy: 0.8625


Support Vector Machine Model f1 score: 0.029411764705882353
Support Vector Machine Model MAE: 0.1375
Support Vector Machine Model RMSE: 0.37080992435478316

Decision Tree
In [251]:

sns.heatmap(confusion_matrix(y_test, tr_pred), annot=True, cmap='Greens')


plt.ylabel('Predicted Values')
plt.xlabel('Actual Values')
plt.title('Confusion Matrix for Decision Tree')
plt.show()

In [252]:
print('Decision Tree Model Accuracy: ', accuracy_score(y_test, tr_pred))
print('Decision Tree Model f1 score: ', metrics.f1_score(y_test, tr_pred))
print('Decision Tree Model MAE: ', metrics.mean_absolute_error(y_test, tr_pred))
print('Decision Tree Model RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, tr_pred)))

Decision Tree Model Accuracy: 0.8604166666666667


Decision Tree Model f1 score: 0.5677419354838709
Decision Tree Model MAE: 0.13958333333333334
Decision Tree Model RMSE: 0.3736085295243316

K-Nearest Neighbors (KNN)


In [253]:

sns.heatmap(confusion_matrix(y_test, kn_pred), annot=True, cmap='Purples')


plt.ylabel('Predicted Values')
plt.xlabel('Actual Values')
plt.title('Confusion Matrix for K-Nearest Neighbors')
plt.show()

In [254]:
print('K-Nearest Neighbors Model Accuracy: ', accuracy_score(y_test, kn_pred))
print('K-Nearest Neighbors Model f1 score: ', metrics.f1_score(y_test, kn_pred))
print('K-Nearest Neighbors Model MAE: ', metrics.mean_absolute_error(y_test, kn_pred))
print('K-Nearest Neighbors Model RMSE: ', np.sqrt(metrics.mean_squared_error(y_test, kn_pred)))

K-Nearest Neighbors Model Accuracy: 0.8583333333333333


K-Nearest Neighbors Model f1 score: 0.276595744680851
K-Nearest Neighbors Model MAE: 0.14166666666666666
K-Nearest Neighbors Model RMSE: 0.3763863263545405

Model Comparison
In [264]:

models = ['Logistic Regression', 'Support Vector Machine', 'Decision Tree', 'K-Nearest Neighbors']
accuracy = [accuracy_score(y_test, lr_pred), accuracy_score(y_test, sv_pred), accuracy_score(y_test, tr_pred), ac
curacy_score(y_test, kn_pred)]
plt.figure(figsize=(10,6))
sns.barplot(x=models, y=accuracy)
plt.title('Model Accuracy Comparison')
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.ylim(0.5, 1.0)
plt.show()

Conclusion

It is observed that the Logistic Regression model performs the best on the test set with an accuracy of 86%. The model
can predict the quality of the wine based on the given features with an accuracy of 86%.

In [ ]:

You might also like