Machine Learning - Spring 2004
==============================
Lab experiments 10
------------------
Program: knn.pl
Data: loandata.pl
-----------------
1. knn.pl basic procedures (building blocks)
===========================================
dist(X,Y,D) - calculates the distance (number of different A=V pairs) between X and Y.
neighbors(Instance,K,Neighbors) - Finds K nearest neighbors of Instance.
sum(Neighbors,Sum) - returns in Sum a list of class/count's.
sumw(Neighbors,Sum) - returns in Sum a list of distance-weighted class/count's.
knn(Instance, K, Class) - K-Nearest Neighbor.
knnw(Instance, K, Class) - Distance-weighted K-Nearest Neighbor.
2. Experiments with the knn basic procedures
============================================
?- ['c:/prolog/knn.pl']. % load the program
We are going to investigate how knn deals with example 4 from the loan data set
(it was misclassified by Naive Bayes). To do this we leave it out the training
set by changing its name from "example" to "test". This also makes it available
from the database so that we can to pass it to the knn algorithm. Thus, after
doing this modification we load loandata in the Prolog database.
?- ['c:/prolog/loandata.pl']. % load the modified data set
So, we are going to classify the following example:
?- test(4,C,X).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
Let's try first some basic procedures of knn.pl. Assume we want to find the
distance between example 4 and the examples from the training data.
?- test(4,C,X),example(N,P,Y),dist(X,Y,D),write(N-D-P),nl,fail.
1-2-approve
2-2-reject
3-3-approve
5-1-reject
6-1-approve
7-2-approve
8-3-approve
9-2-approve
10-2-approve
11-3-reject
12-1-reject
This query shows that example 4 has three nearest neighbors all at distance 1.
These are examples 5, 6 and 12. If we use the 1-NN algorithm we may get
different classifications depending on which of the nearest neighbors we pick.
If we pick 5 or 12, the classification will be "reject". If we pick 6, the
classification will be "approve". In this situation a better approach would be
to use all three nearest neighbors, i.e. k-NN with k=3. Then the majority class
among them is "reject" and this will be the classification of the example in
question.
The process of picking nearest neighbors is implemented by the procedure
"neighbors", which picks a specified number of nearest neighbors and returns a
list of pairs of distance-classification ordered by distance (closest first).
For example:
?- test(4,C,X),neighbors(X,1,L).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
L = [1-reject]
?- test(4,C,X),neighbors(X,3,L).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
L = [1-reject, 1-approve, 1-reject]
?- test(4,C,X),neighbors(X,5,L).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
L = [1-reject, 1-approve, 1-reject, 2-approve, 2-reject]
?- test(4,C,X),neighbors(X,11,L).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
L = [1-reject, 1-approve, 1-reject, 2-approve, 2-reject, 2-approve, 2-approve, 2-approve, ... -...|...]
The last query actually shows all examples in the database. (Prolog does not print
long lists in the standard answer to queries. So, we can add write(L) to see the whole list.)
To "count the votes" (the class labels) among the selected nearest neighbors we can use
the "sum" procedure.
?- test(4,C,X),neighbors(X,3,L),sum(L,S).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
L = [1-reject, 1-approve, 1-reject]
S = [2-reject, 1-approve]
?- test(4,C,X),neighbors(X,5,L),sum(L,S).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
L = [1-reject, 1-approve, 1-reject, 2-approve, 2-reject]
S = [3-reject, 2-approve]
?- test(4,C,X),neighbors(X,7,L),sum(L,S).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
L = [1-reject, 1-approve, 1-reject, 2-approve, 2-reject, 2-approve, 2-approve]
S = [3-reject, 4-approve]
Now it's easy to make the classification - we just pick the class label with a larger count.
It's obvious however that the parameter k (the number of neighbors) plays an important role
here. Using large k does not make sense, because in this case the classification is affected
by not so close neighbors. The extreme case would be to use all examples as neigbors. Then
we have:
?- test(4,C,X),neighbors(X,11,L),sum(L,S).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
L = [1-reject, 1-approve, 1-reject, 2-approve, 2-reject, 2-approve, 2-approve, 2-approve, ... -...|...]
S = [4-reject, 7-approve]
Note however that in this case S contains the class distribution and thus all new
examples would be classified as belonging to the majority class ("approve" in this case).
So, with k equal to the number of training examples knn works as the simplest classifier
called "majority predictor". In practice we usually use K=1 or K=3.
3. Automatic classification using knn.pl
========================================
All procedures illustrated in the previous section are encapsulated in the "knn"
procedure.
?- test(4,C,X),knn(X,1,P).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
P = reject
?- test(4,C,X),knn(X,3,P).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
P = reject
?- test(4,C,X),knn(X,11,P).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
P = approve
These experiments actually confirm the results obtained by Naive Bayes. The actual
class of example 4 is should be "reject", because the majority of its nearest
neighbors belong to this class. Of course including more neighbors makes the
majority class among all examples to "prevail".
4. Distance-weighted K-Nearest Neighbor
=======================================
In contract to Naive Bayes knn is a local prediction algorithm, because it uses just
a part of the training examples to compute the predicted class. However there is an
extension of knn, the so-called distance-weighted knn, which uses all examples.
It's interesting to see if the distance-weighted knn will be in agreement with Naive Bayes.
We use it through the "knnw" procedure as follows:
?- test(4,C,X),knnw(X,1,P).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
P = reject
?- test(4,C,X),knnw(X,3,P).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
P = reject
?- test(4,C,X),knnw(X,11,P).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
P = reject
If fact, there is not much sense to use knnw with K=1, because it works exactly
as knn. However, in contract ot knn, it does make sense to use knnw with k=11 (all examples).
We also see that in this case it makes a different prediction ("reject"), which agrees with
the local knn and Naive Bayes as well.
To see the mechanism of the weighted voting (votes as conditioned by 1/(D^2), where D is the
distance), we may use the "sumw" procedure (it's a part of knnw).
?- test(4,C,X),neighbors(X,1,L),sumw(L,S).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
L = [1-reject]
S = [1-reject]
?- test(4,C,X),neighbors(X,3,L),sumw(L,S).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
L = [1-reject, 1-approve, 1-reject]
S = [2-reject, 1-approve]
?- test(4,C,X),neighbors(X,5,L),sumw(L,S).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
L = [1-reject, 1-approve, 1-reject, 2-approve, 2-reject]
S = [2.25-reject, 1.25-approve]
?- test(4,C,X),neighbors(X,11,L),sumw(L,S).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
L = [1-reject, 1-approve, 1-reject, 2-approve, 2-reject, 2-approve, 2-approve, 2-approve, ... -...|...]
S = [2.36111-reject, 2.22222-approve]
5. The effect of repetitions
============================
The loan data set has 3 examples that appear twice in the data set. These are 1, 3 and 9.
We can detect them by using the following query:
?- example(N1,C1,E),example(N2,C2,E),N1>N2,write(N1-N2),nl,fail.
7-1
8-3
10-9
Now, let's remove them from the data set and see how knn works.
(Just put % before examples 1, 3 and 8 and reload loandata.pl).
?- test(4,C,X),neighbors(X,1,L),sumw(L,S).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
L = [1-reject]
S = [1-reject]
?- test(4,C,X),neighbors(X,3,L),sumw(L,S).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
L = [1-reject, 1-approve, 1-reject]
S = [2-reject, 1-approve]
?- test(4,C,X),neighbors(X,8,L),sumw(L,S).
C = approve
X = [emp=yes, buy=car, sex=f, married=yes]
L = [1-reject, 1-approve, 1-reject, 2-reject, 2-approve, 2-approve, 3-approve, 3-reject]
S = [2.36111-reject, 1.61111-approve]
The predictions are the same, the outputs with K=1 and K-3 are the same, but
with K=8 (all examples in the training data) we have a stronger vote for "reject".
6. Further experiments
======================
See how Naive Bayes and the concept learning algorithms (id3, lgg, search) are affected
by removing the repetitions in loandata.pl and how their predictions compare with the knn
predictions.