0% found this document useful (0 votes)
60 views

Scikit-Learn Cheat Sheet

This document provides an overview of preprocessing techniques, model evaluation metrics, and machine learning algorithms in scikit-learn. It discusses standardization, normalization, binarization, imputing missing values, encoding categorical features, generating polynomial features, and loading data. For model evaluation, it covers classification metrics like accuracy score, classification report, confusion matrix and regression metrics like mean absolute error, mean squared error, and R2 score. It also lists clustering evaluation metrics like adjusted rand index, homogeneity, and V-measure. Scikit-learn is an open source Python machine learning library that implements preprocessing, modeling, validation, and visualization algorithms.

Uploaded by

Gurudutt Mishra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Scikit-Learn Cheat Sheet

This document provides an overview of preprocessing techniques, model evaluation metrics, and machine learning algorithms in scikit-learn. It discusses standardization, normalization, binarization, imputing missing values, encoding categorical features, generating polynomial features, and loading data. For model evaluation, it covers classification metrics like accuracy score, classification report, confusion matrix and regression metrics like mean absolute error, mean squared error, and R2 score. It also lists clustering evaluation metrics like adjusted rand index, homogeneity, and V-measure. Scikit-learn is an open source Python machine learning library that implements preprocessing, modeling, validation, and visualization algorithms.

Uploaded by

Gurudutt Mishra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

> Preprocessing The Data > Evaluate Your Model’s Performance

Python For Data Science


Standardization Classification Metrics

Scikit-Learn Cheat Sheet >>>


>>>
>>>
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)

standardized_X = scaler.transform(X_train)

Accuracy Score
>>> knn.score(X_test, y_test) #Estimator score method

>>> from sklearn.metrics import accuracy_score #Metric scoring functions

>>> standardized_X_test = scaler.transform(X_test) >>> accuracy_score(y_test, y_pred)


Learn Scikit-Learn online at www.DataCamp.com
Classification Report
Normalization >>> from sklearn.metrics import classification_report #Precision, recall, f1-score and support

>>> print(classification_report(y_test, y_pred))


>>> from sklearn.preprocessing import Normalizer

Confusion Matrix
>>> scaler = Normalizer().fit(X_train)

Scikit-learn >>>
>>>
normalized_X = scaler.transform(X_train)

normalized_X_test = scaler.transform(X_test)
>>> from sklearn.metrics import confusion_matrix

>>> print(confusion_matrix(y_test, y_pred))

Scikit-learn is an open source Python library that implements a range of Binarization Regression Metrics
machine learning, preprocessing, cross-validation and visualization

algorithms using a unified interface. >>> from sklearn.preprocessing import Binarizer


Mean Absolute Error
>>> binarizer = Binarizer(threshold=0.0).fit(X)
>>> from sklearn.metrics import mean_absolute_error

>>> binary_X = binarizer.transform(X)


A Basic Example >>> y_true = [3, -0.5, 2]

>>> mean_absolute_error(y_true, y_pred)

>>> from sklearn import neighbors, datasets, preprocessing

Encoding Categorical Features Mean Squared Error


>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import mean_squared_error

>>> from sklearn.metrics import accuracy_score


>>> from sklearn.preprocessing import LabelEncoder
>>> mean_squared_error(y_test, y_pred)
>>> iris = datasets.load_iris()
>>> enc = LabelEncoder()

>>> X, y = iris.data[:, :2], iris.target


R² Score
>>> y = enc.fit_transform(y)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)
>>> from sklearn.metrics import r2_score

>>> scaler = preprocessing.StandardScaler().fit(X_train)


>>> r2_score(y_true, y_pred)
>>>
>>>
X_train = scaler.transform(X_train)

X_test = scaler.transform(X_test)

Imputing Missing Values


>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5)

>>> from sklearn.preprocessing import Imputer

Clustering Metrics
>>> knn.fit(X_train, y_train)

>>> y_pred = knn.predict(X_test)


>>> imp = Imputer(missing_values=0, strategy='mean', axis=0)

>>> imp.fit_transform(X_train) Adjusted Rand Index


>>> accuracy_score(y_test, y_pred)
>>> from sklearn.metrics import adjusted_rand_score

Generating Polynomial Features >>> adjusted_rand_score(y_true, y_pred)

> Loading The Data Also see NumPy & Pandas


>>> from sklearn.preprocessing import PolynomialFeatures

Homogeneity

>>> from sklearn.metrics import homogeneity_score

>>> poly = PolynomialFeatures(5)


>>> homogeneity_score(y_true, y_pred)
Your data needs to be numeric and stored as NumPy arrays or SciPy sparse matrices. Other types that are >>> poly.fit_transform(X)
convertible to numeric arrays, such as Pandas DataFrame, are also acceptable. V-measure
>>> import numpy as np
>>> from sklearn.metrics import v_measure_score

> Create Your Model


>>> X = np.random.random((10,5))
>>> metrics.v_measure_score(y_true, y_pred)
>>> y = np.array(['M','M','F','F','M','F','M','M','F','F','F'])

>>> X[X < 0.7] = 0


Cross-Validation
Supervised Learning Estimators
> Training And Test Data Linear Regression
>>> from sklearn.cross_validation import cross_val_score

>>> print(cross_val_score(knn, X_train, y_train, cv=4))

>>> print(cross_val_score(lr, X, y, cv=2))


>>> from sklearn.linear_model import LinearRegression

>>> from sklearn.model_selection import train_test_split

>>> lr = LinearRegression(normalize=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X,

y,

random_state=0)
Support Vector Machines (SVM)
>>> from sklearn.svm import SVC

> Tune Your Model


>>> svc = SVC(kernel='linear')

Grid Search
> Model Fitting
Naive Bayes
>>> from sklearn.naive_bayes import GaussianNB

>>> gnb = GaussianNB() >>> from sklearn.grid_search import GridSearchCV

>>> params = {"n_neighbors": np.arange(1,3),

Supervised learning KNN "metric": ["euclidean", "cityblock"]}

>>> lr.fit(X, y) #Fit the model to the data


>>> from sklearn import neighbors
>>> grid = GridSearchCV(estimator=knn,

>>> knn.fit(X_train, y_train)


>>> knn = neighbors.KNeighborsClassifier(n_neighbors=5) param_grid=params)

>>> svc.fit(X_train, y_train) >>> grid.fit(X_train, y_train)

>>> print(grid.best_score_)

Unsupervised Learning
Unsupervised Learning Estimators >>> print(grid.best_estimator_.n_neighbors)
>>> k_means.fit(X_train) #Fit the model to the data

>>> pca_model = pca.fit_transform(X_train) #Fit to data, then transform it


Principal Component Analysis (PCA) Randomized Parameter Optimization
>>> from sklearn.decomposition import PCA

>>> pca = PCA(n_components=0.95) >>> from sklearn.grid_search import RandomizedSearchCV

> Prediction K Means


>>> params = {"n_neighbors": range(1,5), "weights": ["uniform", "distance"]}

>>> rsearch = RandomizedSearchCV(estimator=knn, param_distributions=params,

>>> from sklearn.cluster import KMeans


cv=4, n_iter=8, random_state=5)

Supervised Estimators >>> k_means = KMeans(n_clusters=3, random_state=0) >>> rsearch.fit(X_train, y_train)

>>> print(rsearch.best_score_)
>>> y_pred = svc.predict(np.random.random((2,5))) #Predict labels

>>> y_pred = lr.predict(X_test) #Predict labels

>>> y_pred = knn.predict_proba(X_test) #Estimate probability of a label


Unsupervised Estimators
Learn Data Skills Online at www.DataCamp.com
>>> y_pred = k_means.predict(X_test) #Predict labels in clustering algos

You might also like