Evaluating what's been learned

Issues

How predictive is the model we learned.
Error on the training data is not a good indicator of performance on future data. This error may be easily reduced to 0, however we need generalizations of our data.
Solution: split data into training and test set
However, to create a good model we need a large training set and to evaluate its performance we need a large test set as well. So, what we really need is lots of preclassified data.
We need also statistical reliability of estimated differences in performance (significance tests)
Performance measures:

Number of correct classifications
Accuracy of probability estimates
Costs assigned to different types of errors

To improve the classifier accuracy we may combining multiple models
We can measure the classifier predictiveness by applying Minimum Description Length Principle (MDL).

Training and testing

Error rate

Natural performance measure for classification problems.
Success: instance class is predicted correctly.
Error: instance class is predicted incorrectly.
(Observed) Error rate: proportion of errors made over the whole set of instances tested.
Resubstitution error: error on the training set (too optimistic measure!).
True error rate: the actual error rate on the whole population (usually estimated because in most cases the whole population is not available).

Testing

Test set: a set of instances that have not been used in the training process.
Assumption: training set and test set are both representative samples of the same larger population.
Some classifiers work in two steps (often iteratively):

Step 1: learning the basic structure.
Step 2: optimizing parameters that are used in learning.

The test set must not be used in any way in the training process (even for parameter tuning, in Step 2).
There are may be another independent set of instances to be used for optimizing parameters (validation set). That is, we split the set of known data into three: training set, validation set and test set.
Holdout procedure: the method of splitting data into training and test set. Dilemma: the balance between training and test set.

Predicting performance (true success/error rate).

Testing just estimates the probability of success on unknown data (data, not used in both training and testing).
How good is this estimate? (What is the true success/error rate?) We need confidence intervals (a kind of statistical reasoning) to predict this.
Assume that success and error are two possible outcomes of a statistical experiment (normally distributed random variable).
Bernoulli process: We have made N experiments and got S successes. Then, the observed success rate is P=S/N. What is the true success rate?
Example:

N=100, S=75. Then with confidence 80% P is in [0.691, 0.801].
N=1000, S=750. Then with confidence 80% P is in [0.732, 0.767].

Estimating classifier accuracy

Holdout

Reserve a certain amount for testing and use the remainder for training (usually 1/3 for testing, 2/3 for training).
Problem: the samples might not be representative. For example, some classes might be represented with very few instance or even with no instances at all.
Solution: stratification - sampling for training and testing within classes. This ensures that each class is represented with approximately equal proportions in both subsets

Repeated holdout. Success/error estimate can be made more reliable by repeating the process with different subsamples.

In each iteration, a certain proportion is randomly selected for training (possibly with stratification)
The error rates on the different iterations are averaged to yield an overall error rate.
Problem: the different test sets may overlap. Can we prevent overlapping?

Cross-validation (CV). Avoids overlapping test sets.

k-fold cross-validation

First step: data is split into k subsets of equal size (usually by random sampling).
Second step: each subset in turn is used for testing and the remainder for training.
The error estimates are averaged to yield an overall error estimate.

Stratified cross-validation: subsets are stratified before the cross-validation is performed.
Stratified ten-fold cross-validation

Standard method for evaluation. Extensive experiments have shown that this is the best choice to get an accurate estimate. There is also some theoretical evidence for this.
Stratification reduces the estimate's variance.
Repeated stratified cross-validation is even better. Ten-fold cross-validation is repeated ten times and results are averaged.

Leave-one-out cross-validation (LOO CV).

LOO CV is a n-fold cross-validation, where n is the number of training instances. That is, n classifiers are built for all possible (n-1)-element subsets of the training set and then tested on the remaining single instance.
LOO CV makes maximum use of the data.
No random subsampling is involved.
Problems

LOO CV is very computationally expensive.
Stratification is not possible. Actually this method guarantees a non- stratified sample (there is only one instance in the test set).
Worst case example: assume a completely random dataset with two classes each represented by 50% of the instances. The best classifier for this data is the majority predictor. LOO CV will predict 100% error (!) rate for this classifier (explain why?).

Bootstrapping

CV uses sampling without replacement. That is, the same instance, once selected, can not be selected again for a particular training/test set.
The bootstrap is an estimation method that uses sampling with replacement to form the training set.
Training set: a dataset of n instances is sampled with replacement n times with replacement to form the training set of n instances (possibly with repetitions).
Test set: the instances from the original dataset that don't occur in the training set.
0.632 bootstrap:

A particular instance has a probability of (1-1/n) of not being selected for the training set. Thus, an instance will fall in the test set with probability (1-1/n)ⁿ = (for large n) = 1/e = 0.368.
This means that the training data will contain approximately 63.2% of the instances and consequently we will get a very pessimistic error estimate.

Bootstrapping is the best error estimator for small datasets.

Counting the cost

Different types of classification errors often incur different costs.
Example: predict cancer. Compare the cost of predicting "no" when the actual classification is "yes" and predicting "yes" when the actual classification is "no". Obviously the first error is much more costly.
Confusion matrix

Actual \ Predicted class	yes	no
yes	True positive (TP)	False negative (FN)
no	False positive (FP)	True negative (TN)

Total error = (FP+FN)/(TP+FP+TN+FN)
Lift charts

Sort instances according to their predicted probability of being a true positive (TP).
X axis is sample size and Y axis is number of true positives (TP).

ROC curves (ROC means receiver operating characteristic, a term from signal processing)

X axis shows percentage of false positives (FP) in the sample.
Y axis shows percentage of true positives (TP) in the sample.

Recall - precision (information retrieval):

Precision (retrieved relevant / total retrieved) = TP / (TP+FP)
Recall (retrieved relevant / total relevant) = TP / (TP + FN)

Combining multiple models

Basic idea: meta learning

Build different models and combine their classifications
Advantage: often improves predictive accuracy
Disadvantage: the output hard to understand (does not work for explanation).
Three basic approaches: bagging, boosting and stacking.

Bagging

Combining predictions by voting or averaging (for numeric prediction).
Each model receives equal weight.
Algorithm:

Generate several training sets of size n by sampling with replacement the original training sets of size n.
Build a classifier for each training set.
Combine the classifier's predictions by voting/averaging.

This improves performance in almost all cases.
The more classifiers the better.

Boosting

Uses voting/averaging but models are weighted according to their performance.
Iterative procedure: new models are influenced by performance of previously built ones.
Information is passed between iterations by weights assigned to instances.
When a new classifier is created the instances are reweighted according to the classifier's output:

The weights of the correctly classified instances is decreased.
The weights of the incorrectly classified instances is increased.

The classifier error is calculated as a sum of the weights of the missclassified instances divided by the total weight of all instances. In this way a series of models that complement one another is created.
Classification:

Each classifier receives a weight according to its performance on the weighted data: weight = -log(e/(1-e)), where e is the classifier error.
Weights of all classifiers that vote for a particular class are summed, and the class with the highest total is selected.

Stacking

Uses a meta classifier instead of voting to combine the predictions of the base classifiers.
Predictions of base learners (level-0 models) are used as input for the meta learner (level-1 model).
Base learners and the meta learner usually use different learning schemes.

Minimum Description Length Principle

Click here to read a PDF document.