machine-learning Train the first classifier: Setting a baseline with ZeroR


ZeroR is a simple classifier. It doesn't operate per instance instead it operates on general distribution of the classes. It selects the class with the largest a priori probability. It is not a good classifier in the sense that it doesn't use any information in the candidate, but it is often used as a baseline. Note: Other baselines can be used aswel, such as: Industry standard classifiers or handcrafted rules

 // First we tell our data that it's class is hidden in the last attribute
 data.setClassIndex(data.numAttributes() -1);
 // Then we split the data in to two sets
 // randomize first because we don't want unequal distributions
 data.randomize(new java.util.Random(0));
 Instances testset = new Instances(data, 0, 50);
 Instances trainset = new Instances(data, 50, 99);
 // Now we build a classifier
 // Train it with the trainset
 ZeroR classifier1 = new ZeroR();
 // Next we test it against the testset
 Evaluation Test = new Evaluation(trainset);
 Test.evaluateModel(classifier1, testset);

The largest class in the set gives you a 34% correct rate. (50 out of 149)

The actual distribution of the classes in the set

Note: The ZeroR performs around 30%. This is because we splitted randomly into a train and test set. The largest set in the train set, will thusly be the smallest in the test set. Crafting a good test/train set can be worth your while