machine-learning Supervised Learning Introduction to Supervised Learning


There are quite a number situations where one has huge amounts of data and using which he has to classify an object in to one of several known classes. Consider the following situations:

Banking: When a bank receives a request from a customer for a bankcard, the bank has to decide whether to issue or not to issue the bankcard, based on the characteristics of its customers already enjoying the cards for whom the credit history is known.

Medical: One may be interested in developing a medical system that diagnoses a patient whether he is having or not having a particular disease, based on the symptoms observed and medical tests conducted on that patient.

Finance: A financial consulting firm would like to predict the trend of the price of a stock which may be classified into upward, downward or no trend based on several technical features that govern the price movement.

Gene Expression: A scientist analyzing the gene expression data would like to identify the most relevant genes and risk factors involved in breast cancer, in order to separate healthy patients from breast cancer patients.

In all the above examples, an object is classified into one of several known classes, based on the measurements made on a number of characteristics, which he may think discriminate the objects of different classes. These variable are called predictor variables and the class label is called the dependent variable. Note that, in all the above examples, the dependent variable is categorical.

To develop a model for the classification problem, we require, for each object, data on a set of prescribed characteristics together with the class labels, to which the objects belong. The data set is divided into two sets in a prescribed ratio. The larger of these data sets is called the training data set and the other, test data set. The training data set is used in the development of the model. As the model is developed using observations whose class labels are known, these models are known as supervised learning models.

After developing the model, the model is to be evaluated for its performance using the test data set. The objective of a classification model is to have minimum probability of misclassification on the unseen observations. Observations not used in the model development are known as unseen observations.

Decision tree induction is one of the classification model building techniques. The decision tree model built for the categorical dependent variable is called a Classification Tree. The dependent variable could be numeric in certain problems. The decision tree model developed for numeric dependent variables is called Regression Tree.