ML pr5
ML pr5
Apply select K best and chi2 for feature selection and identify the best
features.
Import Pima Indian diabetes data. Apply select K best and chi2 for feature selection
Identify the best features.
The following lines of code will select the best features from dataset –
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X,Y)
We can also summarize the data for output as per our choice. Here, we are setting
the precision to 2 and showing the 4 data attributes with best features along with
best score of each attribute −
set_printoptions(precision=2)
print(fit.scores_)
featured_data = fit.transform(X)
print ("\nFeatured data:\n", featured_data[0:4])
Output
[ 111.52 1411.89 17.61 53.11 2175.57 127.67 5.39 181.3 ]
Featured data:
[
[148. 0. 33.6 50. ]
[ 85. 0. 26.6 31. ]
[183. 0. 23.3 32. ]
[ 89. 94. 28.1 21. ]
]
where –
Observed frequency = No. of observations of class
Expected frequency = No. of expected observations of class if there was no relationship between
the feature and the target.
# Load libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Reduced features
print('Original feature number:', X.shape[1])
print('Reduced feature number:', X_kbest.shape[1])
Output:
Original feature number: 4
Reduced feature number: 2