Machine Learning - Spring 2004 ============================== Lab experiments 10 ------------------ Program: knn.pl Data: loandata.pl ----------------- 1. knn.pl basic procedures (building blocks) =========================================== dist(X,Y,D) - calculates the distance (number of different A=V pairs) between X and Y. neighbors(Instance,K,Neighbors) - Finds K nearest neighbors of Instance. sum(Neighbors,Sum) - returns in Sum a list of class/count's. sumw(Neighbors,Sum) - returns in Sum a list of distance-weighted class/count's. knn(Instance, K, Class) - K-Nearest Neighbor. knnw(Instance, K, Class) - Distance-weighted K-Nearest Neighbor. 2. Experiments with the knn basic procedures ============================================ ?- ['c:/prolog/knn.pl']. % load the program We are going to investigate how knn deals with example 4 from the loan data set (it was misclassified by Naive Bayes). To do this we leave it out the training set by changing its name from "example" to "test". This also makes it available from the database so that we can to pass it to the knn algorithm. Thus, after doing this modification we load loandata in the Prolog database. ?- ['c:/prolog/loandata.pl']. % load the modified data set So, we are going to classify the following example: ?- test(4,C,X). C = approve X = [emp=yes, buy=car, sex=f, married=yes] Let's try first some basic procedures of knn.pl. Assume we want to find the distance between example 4 and the examples from the training data. ?- test(4,C,X),example(N,P,Y),dist(X,Y,D),write(N-D-P),nl,fail. 1-2-approve 2-2-reject 3-3-approve 5-1-reject 6-1-approve 7-2-approve 8-3-approve 9-2-approve 10-2-approve 11-3-reject 12-1-reject This query shows that example 4 has three nearest neighbors all at distance 1. These are examples 5, 6 and 12. If we use the 1-NN algorithm we may get different classifications depending on which of the nearest neighbors we pick. If we pick 5 or 12, the classification will be "reject". If we pick 6, the classification will be "approve". In this situation a better approach would be to use all three nearest neighbors, i.e. k-NN with k=3. Then the majority class among them is "reject" and this will be the classification of the example in question. The process of picking nearest neighbors is implemented by the procedure "neighbors", which picks a specified number of nearest neighbors and returns a list of pairs of distance-classification ordered by distance (closest first). For example: ?- test(4,C,X),neighbors(X,1,L). C = approve X = [emp=yes, buy=car, sex=f, married=yes] L = [1-reject] ?- test(4,C,X),neighbors(X,3,L). C = approve X = [emp=yes, buy=car, sex=f, married=yes] L = [1-reject, 1-approve, 1-reject] ?- test(4,C,X),neighbors(X,5,L). C = approve X = [emp=yes, buy=car, sex=f, married=yes] L = [1-reject, 1-approve, 1-reject, 2-approve, 2-reject] ?- test(4,C,X),neighbors(X,11,L). C = approve X = [emp=yes, buy=car, sex=f, married=yes] L = [1-reject, 1-approve, 1-reject, 2-approve, 2-reject, 2-approve, 2-approve, 2-approve, ... -...|...] The last query actually shows all examples in the database. (Prolog does not print long lists in the standard answer to queries. So, we can add write(L) to see the whole list.) To "count the votes" (the class labels) among the selected nearest neighbors we can use the "sum" procedure. ?- test(4,C,X),neighbors(X,3,L),sum(L,S). C = approve X = [emp=yes, buy=car, sex=f, married=yes] L = [1-reject, 1-approve, 1-reject] S = [2-reject, 1-approve] ?- test(4,C,X),neighbors(X,5,L),sum(L,S). C = approve X = [emp=yes, buy=car, sex=f, married=yes] L = [1-reject, 1-approve, 1-reject, 2-approve, 2-reject] S = [3-reject, 2-approve] ?- test(4,C,X),neighbors(X,7,L),sum(L,S). C = approve X = [emp=yes, buy=car, sex=f, married=yes] L = [1-reject, 1-approve, 1-reject, 2-approve, 2-reject, 2-approve, 2-approve] S = [3-reject, 4-approve] Now it's easy to make the classification - we just pick the class label with a larger count. It's obvious however that the parameter k (the number of neighbors) plays an important role here. Using large k does not make sense, because in this case the classification is affected by not so close neighbors. The extreme case would be to use all examples as neigbors. Then we have: ?- test(4,C,X),neighbors(X,11,L),sum(L,S). C = approve X = [emp=yes, buy=car, sex=f, married=yes] L = [1-reject, 1-approve, 1-reject, 2-approve, 2-reject, 2-approve, 2-approve, 2-approve, ... -...|...] S = [4-reject, 7-approve] Note however that in this case S contains the class distribution and thus all new examples would be classified as belonging to the majority class ("approve" in this case). So, with k equal to the number of training examples knn works as the simplest classifier called "majority predictor". In practice we usually use K=1 or K=3. 3. Automatic classification using knn.pl ======================================== All procedures illustrated in the previous section are encapsulated in the "knn" procedure. ?- test(4,C,X),knn(X,1,P). C = approve X = [emp=yes, buy=car, sex=f, married=yes] P = reject ?- test(4,C,X),knn(X,3,P). C = approve X = [emp=yes, buy=car, sex=f, married=yes] P = reject ?- test(4,C,X),knn(X,11,P). C = approve X = [emp=yes, buy=car, sex=f, married=yes] P = approve These experiments actually confirm the results obtained by Naive Bayes. The actual class of example 4 is should be "reject", because the majority of its nearest neighbors belong to this class. Of course including more neighbors makes the majority class among all examples to "prevail". 4. Distance-weighted K-Nearest Neighbor ======================================= In contract to Naive Bayes knn is a local prediction algorithm, because it uses just a part of the training examples to compute the predicted class. However there is an extension of knn, the so-called distance-weighted knn, which uses all examples. It's interesting to see if the distance-weighted knn will be in agreement with Naive Bayes. We use it through the "knnw" procedure as follows: ?- test(4,C,X),knnw(X,1,P). C = approve X = [emp=yes, buy=car, sex=f, married=yes] P = reject ?- test(4,C,X),knnw(X,3,P). C = approve X = [emp=yes, buy=car, sex=f, married=yes] P = reject ?- test(4,C,X),knnw(X,11,P). C = approve X = [emp=yes, buy=car, sex=f, married=yes] P = reject If fact, there is not much sense to use knnw with K=1, because it works exactly as knn. However, in contract ot knn, it does make sense to use knnw with k=11 (all examples). We also see that in this case it makes a different prediction ("reject"), which agrees with the local knn and Naive Bayes as well. To see the mechanism of the weighted voting (votes as conditioned by 1/(D^2), where D is the distance), we may use the "sumw" procedure (it's a part of knnw). ?- test(4,C,X),neighbors(X,1,L),sumw(L,S). C = approve X = [emp=yes, buy=car, sex=f, married=yes] L = [1-reject] S = [1-reject] ?- test(4,C,X),neighbors(X,3,L),sumw(L,S). C = approve X = [emp=yes, buy=car, sex=f, married=yes] L = [1-reject, 1-approve, 1-reject] S = [2-reject, 1-approve] ?- test(4,C,X),neighbors(X,5,L),sumw(L,S). C = approve X = [emp=yes, buy=car, sex=f, married=yes] L = [1-reject, 1-approve, 1-reject, 2-approve, 2-reject] S = [2.25-reject, 1.25-approve] ?- test(4,C,X),neighbors(X,11,L),sumw(L,S). C = approve X = [emp=yes, buy=car, sex=f, married=yes] L = [1-reject, 1-approve, 1-reject, 2-approve, 2-reject, 2-approve, 2-approve, 2-approve, ... -...|...] S = [2.36111-reject, 2.22222-approve] 5. The effect of repetitions ============================ The loan data set has 3 examples that appear twice in the data set. These are 1, 3 and 9. We can detect them by using the following query: ?- example(N1,C1,E),example(N2,C2,E),N1>N2,write(N1-N2),nl,fail. 7-1 8-3 10-9 Now, let's remove them from the data set and see how knn works. (Just put % before examples 1, 3 and 8 and reload loandata.pl). ?- test(4,C,X),neighbors(X,1,L),sumw(L,S). C = approve X = [emp=yes, buy=car, sex=f, married=yes] L = [1-reject] S = [1-reject] ?- test(4,C,X),neighbors(X,3,L),sumw(L,S). C = approve X = [emp=yes, buy=car, sex=f, married=yes] L = [1-reject, 1-approve, 1-reject] S = [2-reject, 1-approve] ?- test(4,C,X),neighbors(X,8,L),sumw(L,S). C = approve X = [emp=yes, buy=car, sex=f, married=yes] L = [1-reject, 1-approve, 1-reject, 2-reject, 2-approve, 2-approve, 3-approve, 3-reject] S = [2.36111-reject, 1.61111-approve] The predictions are the same, the outputs with K=1 and K-3 are the same, but with K=8 (all examples in the training data) we have a stronger vote for "reject". 6. Further experiments ====================== See how Naive Bayes and the concept learning algorithms (id3, lgg, search) are affected by removing the repetitions in loandata.pl and how their predictions compare with the knn predictions.