For ease of testing, sklearn
provides some built-in datasets in sklearn.datasets
module. For example, let's load Fisher's iris dataset:
import sklearn.datasets
iris_dataset = sklearn.datasets.load_iris()
iris_dataset.keys()
['target_names', 'data', 'target', 'DESCR', 'feature_names']
You can read full description, names of features and names of classes (target_names
). Those are stored as strings.
We are interested in the data and classes, which stored in data
and target
fields. By convention those are denoted as X
and y
X, y = iris_dataset['data'], iris_dataset['target']
X.shape, y.shape
((150, 4), (150,))
numpy.unique(y)
array([0, 1, 2])
Shapes of X
and y
say that there are 150 samples with 4 features. Each sample belongs to one of following classes: 0, 1 or 2.
X
and y
can now be used in training a classifier, by calling the classifier's fit()
method.
Here is the full list of datasets provided by the sklearn.datasets
module with their size and intended use:
Load with | Description | Size | Usage |
---|---|---|---|
load_boston() | Boston house-prices dataset | 506 | regression |
load_breast_cancer() | Breast cancer Wisconsin dataset | 569 | classification (binary) |
load_diabetes() | Diabetes dataset | 442 | regression |
load_digits(n_class) | Digits dataset | 1797 | classification |
load_iris() | Iris dataset | 150 | classification (multi-class) |
load_linnerud() | Linnerud dataset | 20 | multivariate regression |
Note that (source: http://scikit-learn.org/stable/datasets/):
These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in the scikit. They are however often too small to be representative of real world machine learning tasks.
In addition to these built-in toy sample datasets, sklearn.datasets
also provides utility functions for loading external datasets:
load_mlcomp
for loading sample datasets from the mlcomp.org repository (note that the datasets need to be downloaded before). Here is an example of usage.fetch_lfw_pairs
and fetch_lfw_people
for loading Labeled Faces in the Wild (LFW) pairs dataset from http://vis-www.cs.umass.edu/lfw/, used for face verification (resp. face recognition). This dataset is larger than 200 MB. Here is an example of usage.