scikit-learn Train a classifier with cross-validation


Using iris dataset:

import sklearn.datasets
iris_dataset = sklearn.datasets.load_iris()
X, y = iris_dataset['data'], iris_dataset['target']

Data is split into train and test sets. To do this we use the train_test_split utility function to split both X and y (data and target vectors) randomly with the option train_size=0.75 (training sets contain 75% of the data).

Training datasets are fed into a k-nearest neighbors classifier. The method fit of the classifier will fit the model to the data.

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75) 
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3), y_train)

Finally predicting quality on test sample:

clf.score(X_test, y_test) # Output: 0.94736842105263153

By using one pair of train and test sets we might get a biased estimation of the quality of the classifier due to the arbitrary choice the data split. By using cross-validation we can fit of the classifier on different train/test subsets of the data and make an average over all accuracy results. The function cross_val_score fits a classifier to the input data using cross-validation. It can take as input the number of different splits (folds) to be used (5 in the example below).

from sklearn.cross_validation import cross_val_score
scores = cross_val_score(clf, X, y, cv=5)
# Output: array([ 0.96666667,  0.96666667,  0.93333333,  0.96666667,  1.        ])
print "Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() / 2)
# Output: Accuracy: 0.97 (+/- 0.03)