Assignment4 VidulGarg
Assignment4 VidulGarg
Project Title:
Grapes to Greatness: Machine Learning in Wine Quality Prediction
Description:
Predicting wine quality using machine learning is a common and valuable application in the field
of data science and analytics. Wine quality prediction involves building a model that can assess
and predict the quality of a wine based on various input features, such as chemical composition,
sensory characteristics, and environmental factors.
Tasks:
Load the Dataset, Data preprocessing including visualization, Machine Learning Model building,
Evaluate the model, Test with random observation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
df=pd.read_csv(r"D:\MachineLearning\DataScienceCourse\winequality-
red.csv")
df
alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
... ... ...
1594 10.5 5
1595 11.2 6
1596 11.0 6
1597 10.2 5
1598 11.0 6
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 1599 non-null float64
1 volatile acidity 1599 non-null float64
2 citric acid 1599 non-null float64
3 residual sugar 1599 non-null float64
4 chlorides 1599 non-null float64
5 free sulfur dioxide 1599 non-null float64
6 total sulfur dioxide 1599 non-null float64
7 density 1599 non-null float64
8 pH 1599 non-null float64
9 sulphates 1599 non-null float64
10 alcohol 1599 non-null float64
11 quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
chlorides 0
free sulfur dioxide 0
total sulfur dioxide 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
df.describe()
Data Visualization
plt.figure(figsize=(5,3))
df["quality"].value_counts().plot(kind='bar')
plt.xticks(rotation=0)
(array([0, 1, 2, 3, 4, 5]),
[Text(0, 0, '5'),
Text(1, 0, '6'),
Text(2, 0, '7'),
Text(3, 0, '4'),
Text(4, 0, '8'),
Text(5, 0, '3')])
Wines with quality '5' and '6' are more!!
plt.figure(figsize=(8,8))
l=["fixed acidity","volatile acidity","citric acid","residual
sugar","chlorides","free sulfur dioxide","total sulfur
dioxide","density","pH","sulphates","alcohol"]
for i in l:
plt.subplot(4, 3, l.index(i) + 1) # 4 rows, 3 columns
sns.barplot(x=df["quality"],y=df[i])
plt.tight_layout()
# sns.barplot(x=df["quality"],y=df["alcohol"])
Correlation Check
plt.figure(figsize=(12, 8))
cor=df.corr()
sns.heatmap(cor,annot=True)
<Axes: >
As we can see there is no such correlated features in the dataset
Checking outliers
sns.boxplot(data=df, orient='h') # 'orient' is set to 'h' for
horizontal box plots
plt.xlabel('Values')
plt.title('Box Plot of All Columns')
df[i]=np.where(df[i]>upperL,upperL,np.where(df[i]<lowerL,lowerL,df[i])
)
plt.xlabel('Values')
plt.title('Box Plot of All Columns')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 1599 non-null float64
1 volatile acidity 1599 non-null float64
2 citric acid 1599 non-null float64
3 residual sugar 1599 non-null float64
4 chlorides 1599 non-null float64
5 free sulfur dioxide 1599 non-null float64
6 total sulfur dioxide 1599 non-null float64
7 density 1599 non-null float64
8 pH 1599 non-null float64
9 sulphates 1599 non-null float64
10 alcohol 1599 non-null float64
11 quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
Splitting the data into dependent and independent
variables
x=df.iloc[:,:11]
y=df.iloc[:,-1]
x.info()
y.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 1599 non-null float64
1 volatile acidity 1599 non-null float64
2 citric acid 1599 non-null float64
3 residual sugar 1599 non-null float64
4 chlorides 1599 non-null float64
5 free sulfur dioxide 1599 non-null float64
6 total sulfur dioxide 1599 non-null float64
7 density 1599 non-null float64
8 pH 1599 non-null float64
9 sulphates 1599 non-null float64
10 alcohol 1599 non-null float64
dtypes: float64(11)
memory usage: 137.5 KB
<class 'pandas.core.series.Series'>
RangeIndex: 1599 entries, 0 to 1598
Series name: quality
Non-Null Count Dtype
-------------- -----
1599 non-null int64
dtypes: int64(1)
memory usage: 12.6 KB
Model Training
KNN Classifier
model1=KNeighborsClassifier(n_neighbors=3)
model1.fit(x_train, y_train)
y_pred1 = model1.predict(x_test)
print(classification_report(y_test, y_pred1))
print(confusion_matrix(y_test,y_pred1))
[[ 0 0 1 0 0 0]
[ 2 0 3 2 1 0]
[ 0 6 70 37 7 0]
[ 1 9 62 64 10 0]
[ 0 0 15 13 12 0]
[ 0 0 1 2 2 0]]
C:\Users\Vidul\AppData\Local\Programs\Python\Python311\Lib\site-
packages\sklearn\metrics\_classification.py:1469:
UndefinedMetricWarning: Precision and F-score are ill-defined and
being set to 0.0 in labels with no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\Vidul\AppData\Local\Programs\Python\Python311\Lib\site-
packages\sklearn\metrics\_classification.py:1469:
UndefinedMetricWarning: Precision and F-score are ill-defined and
being set to 0.0 in labels with no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\Vidul\AppData\Local\Programs\Python\Python311\Lib\site-
packages\sklearn\metrics\_classification.py:1469:
UndefinedMetricWarning: Precision and F-score are ill-defined and
being set to 0.0 in labels with no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
Logistic Regression
model2=LogisticRegression(max_iter=5000)
model2.fit(x_train, y_train)
y_pred2 = model2.predict(x_test)
print(classification_report(y_test, y_pred2))
print(confusion_matrix(y_test,y_pred2))
precision recall f1-score support
[[ 0 0 1 0 0 0]
[ 0 0 2 6 0 0]
[ 0 0 92 28 0 0]
[ 0 0 57 82 7 0]
[ 0 0 2 31 7 0]
[ 0 0 0 2 3 0]]
C:\Users\Vidul\AppData\Local\Programs\Python\Python311\Lib\site-
packages\sklearn\metrics\_classification.py:1469:
UndefinedMetricWarning: Precision and F-score are ill-defined and
being set to 0.0 in labels with no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\Vidul\AppData\Local\Programs\Python\Python311\Lib\site-
packages\sklearn\metrics\_classification.py:1469:
UndefinedMetricWarning: Precision and F-score are ill-defined and
being set to 0.0 in labels with no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
C:\Users\Vidul\AppData\Local\Programs\Python\Python311\Lib\site-
packages\sklearn\metrics\_classification.py:1469:
UndefinedMetricWarning: Precision and F-score are ill-defined and
being set to 0.0 in labels with no predicted samples. Use
`zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
[[ 0 0 0 1 0 0]
[ 1 1 3 2 1 0]
[ 1 4 90 22 3 0]
[ 0 3 45 77 21 0]
[ 0 0 7 13 16 4]
[ 0 0 0 1 3 1]]
Accuracy Check
print("KNN Classifier Accuracy:", accuracy_score(y_test, y_pred1)*100)
print("Logistic Regression Accuracy:", accuracy_score(y_test,
y_pred2)*100)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred3)*100)
for i in sample_check:
x=model2.predict([i])
if(x>=6):
print(x, "--> Good" )
elif(x<6):
print(x, "--> Not Good")
[5] --> Not Good
[6] --> Good
[5] --> Not Good
[6] --> Good
[5] --> Not Good
C:\Users\Vidul\AppData\Local\Programs\Python\Python311\Lib\site-
packages\sklearn\base.py:464: UserWarning: X does not have valid
feature names, but LogisticRegression was fitted with feature names
warnings.warn(
C:\Users\Vidul\AppData\Local\Programs\Python\Python311\Lib\site-
packages\sklearn\base.py:464: UserWarning: X does not have valid
feature names, but LogisticRegression was fitted with feature names
warnings.warn(
C:\Users\Vidul\AppData\Local\Programs\Python\Python311\Lib\site-
packages\sklearn\base.py:464: UserWarning: X does not have valid
feature names, but LogisticRegression was fitted with feature names
warnings.warn(
C:\Users\Vidul\AppData\Local\Programs\Python\Python311\Lib\site-
packages\sklearn\base.py:464: UserWarning: X does not have valid
feature names, but LogisticRegression was fitted with feature names
warnings.warn(
C:\Users\Vidul\AppData\Local\Programs\Python\Python311\Lib\site-
packages\sklearn\base.py:464: UserWarning: X does not have valid
feature names, but LogisticRegression was fitted with feature names
warnings.warn(