Evaluating what's been learned
Issues

How predictive is the model we learned.

Error on the training data is not a good indicator of performance on future
data. This error may be easily reduced to 0, however we need generalizations
of our data.

Solution: split data into training and test set

However, to create a good model we need a large training set and to evaluate
its performance we need a large test set as well. So, what we really need
is lots of preclassified data.

We need also statistical reliability of estimated differences in performance
(significance tests)

Performance measures:

Number of correct classifications

Accuracy of probability estimates

Costs assigned to different types of errors

To improve the classifier accuracy we may combining multiple models

We can measure the classifier predictiveness by applying Minimum Description
Length Principle (MDL).
Training and testing

Error rate

Natural performance measure for classification problems.

Success: instance class is predicted correctly.

Error: instance class is predicted incorrectly.

(Observed) Error rate: proportion of errors made over the whole
set of instances tested.

Resubstitution error: error on the training set (too optimistic
measure!).

True error rate: the actual error rate on the whole population (usually
estimated because in most cases the whole population is not available).

Testing

Test set: a set of instances that have not been used in the training
process.

Assumption: training set and test set are both representative samples
of the same larger population.

Some classifiers work in two steps (often iteratively):

Step 1: learning the basic structure.

Step 2: optimizing parameters that are used in learning.

The test set must not be used in any way in the training
process (even for parameter tuning, in Step 2).

There are may be another independent set of instances to be used for optimizing
parameters (validation set). That is, we split the set of known data into
three: training set, validation set and test set.

Holdout procedure: the method of splitting data into training and
test set. Dilemma: the balance between training and test set.

Predicting performance (true success/error rate).

Testing just estimates the probability of success on unknown data (data,
not used in both training and testing).

How good is this estimate? (What is the true success/error rate?)
We need confidence intervals (a kind of statistical reasoning) to
predict this.

Assume that success and error are two possible outcomes of a statistical
experiment (normally distributed random variable).

Bernoulli process: We have made N experiments and got S successes.
Then, the observed success rate is P=S/N. What is the true
success rate?

Example:

N=100, S=75. Then with confidence 80% P is in [0.691,
0.801].

N=1000, S=750. Then with confidence 80% P is in [0.732,
0.767].
Estimating classifier accuracy

Holdout

Reserve a certain amount for testing and use the remainder for training
(usually 1/3 for testing, 2/3 for training).

Problem: the samples might not be representative. For example, some classes
might be represented with very few instance or even with no instances at
all.

Solution: stratification  sampling for training and testing within
classes. This ensures that each class is represented with approximately
equal proportions in both subsets

Repeated holdout. Success/error estimate can be made more reliable by repeating
the process with different subsamples.

In each iteration, a certain proportion is randomly selected for training
(possibly with stratification)

The error rates on the different iterations are averaged to yield an overall
error rate.

Problem: the different test sets may overlap. Can we prevent overlapping?

Crossvalidation (CV). Avoids overlapping test sets.

kfold crossvalidation

First step: data is split into k subsets of equal size (usually by random
sampling).

Second step: each subset in turn is used for testing and the remainder
for training.

The error estimates are averaged to yield an overall error estimate.

Stratified crossvalidation: subsets are stratified before the crossvalidation
is performed.

Stratified tenfold crossvalidation

Standard method for evaluation. Extensive experiments have shown that this
is the best choice to get an accurate estimate. There is also some
theoretical evidence for this.

Stratification reduces the estimate's variance.

Repeated stratified crossvalidation is even better. Tenfold crossvalidation
is repeated ten times and results are averaged.

Leaveoneout crossvalidation (LOO CV).

LOO CV is a nfold crossvalidation, where n is the number
of training instances. That is, n classifiers are built for all
possible (n1)element subsets of the training set and then tested
on the remaining single instance.

LOO CV makes maximum use of the data.

No random subsampling is involved.

Problems

LOO CV is very computationally expensive.

Stratification is not possible. Actually this method guarantees a non
stratified sample (there is only one instance in the test set).

Worst case example: assume a completely random dataset with two
classes each represented by 50% of the instances. The best classifier
for this data is the majority predictor. LOO CV will predict 100% error
(!) rate for this classifier (explain why?).

Bootstrapping

CV uses sampling without replacement. That is, the same instance, once
selected, can not be selected again for a particular training/test set.

The bootstrap is an estimation method that uses sampling with replacement
to form the training set.

Training set: a dataset of n instances is sampled with replacement
n
times with replacement to form the training set of n instances (possibly
with repetitions).

Test set: the instances from the original dataset that don't occur in the
training set.

0.632 bootstrap:

A particular instance has a probability of (11/n) of not being selected
for the training set. Thus, an instance will fall in the test set with
probability (11/n)^{n} = (for large n) = 1/e = 0.368.

This means that the training data will contain approximately 63.2% of the
instances and consequently we will get a very pessimistic error estimate.

Bootstrapping is the best error estimator for small datasets.

Counting the cost

Different types of classification errors often incur different costs.

Example: predict cancer. Compare the cost of predicting "no" when the actual
classification is "yes" and predicting "yes" when the actual classification
is "no". Obviously the first error is much more costly.

Confusion matrix
Actual \ Predicted class 
yes 
no 
yes 
True positive (TP) 
False negative (FN) 
no 
False positive (FP) 
True negative (TN) 

Total error = (FP+FN)/(TP+FP+TN+FN)

Lift charts

Sort instances according to their predicted probability of being a true
positive (TP).

X axis is sample size and Y axis is number of true positives
(TP).

ROC curves (ROC means receiver operating characteristic, a term from signal
processing)

X axis shows percentage of false positives (FP) in the sample.

Y axis shows percentage of true positives (TP) in the sample.

Recall  precision (information retrieval):

Precision (retrieved relevant / total retrieved) = TP / (TP+FP)

Recall (retrieved relevant / total relevant) = TP / (TP + FN)
Combining multiple models

Basic idea: meta learning

Build different models and combine their classifications

Advantage: often improves predictive accuracy

Disadvantage: the output hard to understand (does not work for explanation).

Three basic approaches: bagging, boosting and stacking.

Bagging

Combining predictions by voting or averaging (for numeric prediction).

Each model receives equal weight.

Algorithm:

Generate several training sets of size n by sampling with replacement
the original training sets of size n.

Build a classifier for each training set.

Combine the classifier's predictions by voting/averaging.

This improves performance in almost all cases.

The more classifiers the better.

Boosting

Uses voting/averaging but models are weighted according to their performance.

Iterative procedure: new models are influenced by performance of previously
built ones.

Information is passed between iterations by weights assigned to instances.

When a new classifier is created the instances are reweighted according
to the classifier's output:

The weights of the correctly classified instances is decreased.

The weights of the incorrectly classified instances is increased.

The classifier error is calculated as a sum of the weights of the missclassified
instances divided by the total weight of all instances. In this way a series
of models that complement one another is created.

Classification:

Each classifier receives a weight according to its performance on the weighted
data: weight = log(e/(1e)), where e is the classifier error.

Weights of all classifiers that vote for a particular class are summed,
and the class with the highest total is selected.

Stacking

Uses a meta classifier instead of voting to combine the predictions of
the base classifiers.

Predictions of base learners (level0 models) are used as input for the
meta learner (level1 model).

Base learners and the meta learner usually use different learning schemes.
Minimum Description Length Principle
Click here to read a PDF document.