m1
m1
(https://colab.research.google.com/github/JAYASURYAb/ML-project1/blob/master/Project_1.ipynb)
Importing Libraries
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
In [ ]:
dataset = pd.read_csv('/content/winequality-red.csv')
In [ ]:
dataset
Out[ ]:
free total
fixed volatile citric residual
chlorides sulfur sulfur density pH sulphates
acidity acidity acid sugar
dioxide dioxide
0 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56
1 7.8 0.880 0.00 2.6 0.098 25.0 67.0 0.99680 3.20 0.68
2 7.8 0.760 0.04 2.3 0.092 15.0 54.0 0.99700 3.26 0.65
3 11.2 0.280 0.56 1.9 0.075 17.0 60.0 0.99800 3.16 0.58
4 7.4 0.700 0.00 1.9 0.076 11.0 34.0 0.99780 3.51 0.56
... ... ... ... ... ... ... ... ... ... ...
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66
dataset.head()
Out[ ]:
free total
fixed volatile citric residual
chlorides sulfur sulfur density pH sulphates alc
acidity acidity acid sugar
dioxide dioxide
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56
In [ ]:
dataset.tail()
Out[ ]:
free total
fixed volatile citric residual
chlorides sulfur sulfur density pH sulphates
acidity acidity acid sugar
dioxide dioxide
1594 6.2 0.600 0.08 2.0 0.090 32.0 44.0 0.99490 3.45 0.58
1595 5.9 0.550 0.10 2.2 0.062 39.0 51.0 0.99512 3.52 0.76
1596 6.3 0.510 0.13 2.3 0.076 29.0 40.0 0.99574 3.42 0.75
1597 5.9 0.645 0.12 2.0 0.075 32.0 44.0 0.99547 3.57 0.71
1598 6.0 0.310 0.47 3.6 0.067 18.0 42.0 0.99549 3.39 0.66
In [ ]:
dataset.describe()
Out[ ]:
In [ ]:
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 fixed acidity 1599 non-null float64
1 volatile acidity 1599 non-null float64
2 citric acid 1599 non-null float64
3 residual sugar 1599 non-null float64
4 chlorides 1599 non-null float64
5 free sulfur dioxide 1599 non-null float64
6 total sulfur dioxide 1599 non-null float64
7 density 1599 non-null float64
8 pH 1599 non-null float64
9 sulphates 1599 non-null float64
10 alcohol 1599 non-null float64
11 quality 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
In [ ]:
dataset['quality'].unique()
Out[ ]:
array([5, 6, 7, 4, 8, 3])
In [ ]:
dataset.quality.value_counts().sort_index()
Out[ ]:
3 10
4 53
5 681
6 638
7 199
8 18
Name: quality, dtype: int64
In [ ]:
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: F
utureWarning: pandas.util.testing is deprecated. Use the functions in the
public API at pandas.testing instead.
import pandas.util.testing as tm
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4515a60eb8>
making binary classification for the variable and dividing wine as bad and good by giving the limit to
quality
In [ ]:
In [ ]:
le = LabelEncoder()
In [ ]:
dataset['quality']= le.fit_transform(dataset['quality'])
In [ ]:
dataset['quality'].value_counts()
Out[ ]:
0 1382
1 217
Name: quality, dtype: int64
In [ ]:
sns.countplot(dataset['quality'])
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f451458c390>
x = dataset.drop('quality', axis = 1)
y = dataset['quality']
Splitting into test and train sets
In [ ]:
In [ ]:
In [ ]:
print(len(x_train))
print(len(x_test))
print(len(y_train))
print(len(y_test))
1279
320
1279
320
Standardization
In [ ]:
In [ ]:
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.fit_transform(x_test)
In [ ]:
print(dataset.describe())
[8 rows x 12 columns]
In [ ]:
In [ ]:
l_cla = LogisticRegression()
k_cla = KNeighborsClassifier()
d_cla = DecisionTreeClassifier()
r_cla = RandomForestClassifier(n_estimators = 10)
s_cla = SVC(kernel='linear')
ks_cla = SVC(kernel= 'rbf')
In [ ]:
l_cla.fit(x_train, y_train)
k_cla.fit(x_train, y_train)
d_cla.fit(x_train, y_train)
r_cla.fit(x_train, y_train)
s_cla.fit(x_train, y_train)
ks_cla.fit(x_train, y_train)
Out[ ]:
In [ ]:
l_pred = l_cla.predict(x_test)
k_pred = k_cla.predict(x_test)
d_pred = d_cla.predict(x_test)
r_pred = r_cla.predict(x_test)
s_pred = s_cla.predict(x_test)
ks_pred = ks_cla.predict(x_test)
In [ ]:
In [ ]:
print(l_c)
print(k_c)
print(d_c)
print(r_c)
print(s_c)
print(ks_c)
[[275 12]
[ 19 14]]
[[278 9]
[ 17 16]]
[[261 26]
[ 13 20]]
[[283 4]
[ 18 15]]
[[287 0]
[ 33 0]]
[[282 5]
[ 20 13]]
In [ ]:
In [ ]:
print('Logistic Regression: ' + str(l_a) + '\nKNN: ' + str(k_a) + '\nDecision Tree: ' +
str(d_a) + '\nRandom Forest: ' + str(r_a) + '\nLinear SVC: ' + str(s_a) + '\nKernel SV
C: ' + str(ks_a))
So from above,we can conclude that Random Forest classifier shows the highest accuracy
Checking for Regression model
Linear Regression
Polynomial Regression
Decision Tree
Random Forest
Support Vector Regression
In [ ]:
In [ ]:
m_reg = LinearRegression()
p_reg = LinearRegression()
d_reg = DecisionTreeRegressor()
r_reg = RandomForestRegressor(n_estimators = 500)
s_reg = SVR()
In [ ]:
x_poly = PolynomialFeatures(degree = 5)
x_poly = x_poly.fit_transform(x_train)
In [ ]:
m_reg.fit(x_train , y_train)
p_reg.fit(x_poly , y_train)
d_reg.fit(x_train , y_train)
r_reg.fit(x_train , y_train)
s_reg.fit(x_train , y_train)
Out[ ]:
In [ ]:
temp = PolynomialFeatures(degree = 5)
temp = temp.fit_transform(x_test)
In [ ]:
m_pred = m_reg.predict(x_test)
p_pred = p_reg.predict(temp)
d_pred = d_reg.predict(x_test)
r_pred = r_reg.predict(x_test)
s_pred = s_reg.predict(x_test)
In [ ]:
In [ ]:
m = r2_score(y_test ,m_pred)
p = r2_score(y_test ,p_pred)
d = r2_score(y_test ,d_pred)
r = r2_score(y_test ,r_pred)
s = r2_score(y_test , s_pred)
In [ ]:
Linear regression:0.22111770775681405
polynomialFeatures:-8951.520231784129
decisiontree:-0.419068736141907
RandomForest:0.4433563678597823
SupportVectorRegression:0.40423984022510806
Here we are getting negative values for our regression models and score is also comparatively less
as compared to classification model.
So from above results,we can conclude that Classification model is best fit for our dataset.
Thank you
Jayasurya B