Using iris dataset:
import sklearn.datasets
iris_dataset = sklearn.datasets.load_iris()
X, y = iris_dataset['data'], iris_dataset['target']
Data is split into train and test sets. To do this we use the train_test_split
utility function to split both X
and y
(data and target vectors) randomly with the option train_size=0.75
(training sets contain 75% of the data).
Training datasets are fed into a k-nearest neighbors classifier. The method fit
of the classifier will fit the model to the data.
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75)
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)
Finally predicting quality on test sample:
clf.score(X_test, y_test) # Output: 0.94736842105263153
By using one pair of train and test sets we might get a biased estimation of the quality of the classifier due to the arbitrary choice the data split.
By using cross-validation we can fit of the classifier on different train/test subsets of the data and make an average over all accuracy results.
The function cross_val_score
fits a classifier to the input data using cross-validation. It can take as input the number of different splits (folds) to be used (5 in the example below).
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(clf, X, y, cv=5)
print(scores)
# Output: array([ 0.96666667, 0.96666667, 0.93333333, 0.96666667, 1. ])
print "Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() / 2)
# Output: Accuracy: 0.97 (+/- 0.03)