For ease of testing,
sklearn provides some built-in datasets in
sklearn.datasets module. For example, let's load Fisher's iris dataset:
import sklearn.datasets iris_dataset = sklearn.datasets.load_iris() iris_dataset.keys() ['target_names', 'data', 'target', 'DESCR', 'feature_names']
You can read full description, names of features and names of classes (
target_names). Those are stored as strings.
We are interested in the data and classes, which stored in
target fields. By convention those are denoted as
X, y = iris_dataset['data'], iris_dataset['target'] X.shape, y.shape ((150, 4), (150,))
numpy.unique(y) array([0, 1, 2])
y say that there are 150 samples with 4 features. Each sample belongs to one of following classes: 0, 1 or 2.
y can now be used in training a classifier, by calling the classifier's
Here is the full list of datasets provided by the
sklearn.datasets module with their size and intended use:
|Boston house-prices dataset||506||regression|
|Breast cancer Wisconsin dataset||569||classification (binary)|
|Iris dataset||150||classification (multi-class)|
|Linnerud dataset||20||multivariate regression|
Note that (source: http://scikit-learn.org/stable/datasets/):
These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in the scikit. They are however often too small to be representative of real world machine learning tasks.
In addition to these built-in toy sample datasets,
sklearn.datasets also provides utility functions for loading external datasets:
load_mlcompfor loading sample datasets from the mlcomp.org repository (note that the datasets need to be downloaded before). Here is an example of usage.
fetch_lfw_peoplefor loading Labeled Faces in the Wild (LFW) pairs dataset from http://vis-www.cs.umass.edu/lfw/, used for face verification (resp. face recognition). This dataset is larger than 200 MB. Here is an example of usage.