scikit-learn Tutorial => Sample datasets

Example

For ease of testing, sklearn provides some built-in datasets in sklearn.datasets module. For example, let's load Fisher's iris dataset:

import sklearn.datasets
iris_dataset = sklearn.datasets.load_iris()
iris_dataset.keys()
['target_names', 'data', 'target', 'DESCR', 'feature_names']

You can read full description, names of features and names of classes (target_names). Those are stored as strings.

We are interested in the data and classes, which stored in data and target fields. By convention those are denoted as X and y

X, y = iris_dataset['data'], iris_dataset['target']
X.shape, y.shape
((150, 4), (150,))

numpy.unique(y)
array([0, 1, 2])

Shapes of X and y say that there are 150 samples with 4 features. Each sample belongs to one of following classes: 0, 1 or 2.

X and y can now be used in training a classifier, by calling the classifier's fit() method.

Here is the full list of datasets provided by the sklearn.datasets module with their size and intended use:

Load with	Description	Size	Usage
`load_boston()`	Boston house-prices dataset	506	regression
`load_breast_cancer()`	Breast cancer Wisconsin dataset	569	classification (binary)
`load_diabetes()`	Diabetes dataset	442	regression
`load_digits(n_class)`	Digits dataset	1797	classification
`load_iris()`	Iris dataset	150	classification (multi-class)
`load_linnerud()`	Linnerud dataset	20	multivariate regression

Note that (source: http://scikit-learn.org/stable/datasets/):

These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in the scikit. They are however often too small to be representative of real world machine learning tasks.

In addition to these built-in toy sample datasets, sklearn.datasets also provides utility functions for loading external datasets:

load_mlcomp for loading sample datasets from the mlcomp.org repository (note that the datasets need to be downloaded before). Here is an example of usage.
fetch_lfw_pairs and fetch_lfw_people for loading Labeled Faces in the Wild (LFW) pairs dataset from http://vis-www.cs.umass.edu/lfw/, used for face verification (resp. face recognition). This dataset is larger than 200 MB. Here is an example of usage.

PDF - Download scikit-learn for free

Previous Next

scikit-learn

Fastest Entity Framework Extensions

Example

Got any scikit-learn Question?

scikit-learn

scikit-learn Getting started with scikit-learn Sample datasets

Fastest Entity Framework Extensions

Example

Got any scikit-learn Question?