scikit-learn Getting started with scikit-learn Sample datasets


For ease of testing, sklearn provides some built-in datasets in sklearn.datasets module. For example, let's load Fisher's iris dataset:

import sklearn.datasets
iris_dataset = sklearn.datasets.load_iris()
['target_names', 'data', 'target', 'DESCR', 'feature_names']

You can read full description, names of features and names of classes (target_names). Those are stored as strings.

We are interested in the data and classes, which stored in data and target fields. By convention those are denoted as X and y

X, y = iris_dataset['data'], iris_dataset['target']
X.shape, y.shape
((150, 4), (150,))
array([0, 1, 2])

Shapes of X and y say that there are 150 samples with 4 features. Each sample belongs to one of following classes: 0, 1 or 2.

X and y can now be used in training a classifier, by calling the classifier's fit() method.

Here is the full list of datasets provided by the sklearn.datasets module with their size and intended use:

Load withDescriptionSizeUsage
load_boston()Boston house-prices dataset506regression
load_breast_cancer()Breast cancer Wisconsin dataset569classification (binary)
load_diabetes()Diabetes dataset442regression
load_digits(n_class)Digits dataset1797classification
load_iris()Iris dataset150classification (multi-class)
load_linnerud()Linnerud dataset20multivariate regression

Note that (source:

These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in the scikit. They are however often too small to be representative of real world machine learning tasks.

In addition to these built-in toy sample datasets, sklearn.datasets also provides utility functions for loading external datasets:

  • load_mlcomp for loading sample datasets from the repository (note that the datasets need to be downloaded before). Here is an example of usage.
  • fetch_lfw_pairs and fetch_lfw_people for loading Labeled Faces in the Wild (LFW) pairs dataset from, used for face verification (resp. face recognition). This dataset is larger than 200 MB. Here is an example of usage.