Machine Learning - Lab experiments 3 ==================================== Programs needed: id3.pl Data files: animals.pl, loandata.pl ---------------------------------------- 1. GENERATING, VIEWING AND ACCESSING DECISION TREES AND RULES ------------------------------------------------------------- ?- [id3]. ?- [animals]. ?- id3(2). The parameter we specify (2) is used to control the growing of the tree. If a node covers less than or equal to 2 examples, then it becomes a leaf no matter what the class distribution is. To see the tree we use: ?- showtree. has_covering=feathers => [bird/2] has_covering=scales gills=f => [reptile/2] gills=t => [fish/1] has_covering=none => [amphibian/1, mammal/1] has_covering=hair => [mammal/3] The lists at the leaves show the class distribution in the examples covered. If we change the threshold we get a different tree. With threshold 1 we will have one-class examples at the leaves. For example: ?- id3(1),showtree. has_covering=feathers => [bird/2] has_covering=scales gills=f => [reptile/2] gills=t => [fish/1] has_covering=none eggs=t => [amphibian/1] eggs=f => [mammal/1] has_covering=hair => [mammal/3] The "id3" query also generates "if-then" rules. We can see these rules by using "listing": ?- listing(if). :- dynamic if/1. if[has_covering=feathers]then bird. if[has_covering=scales, gills=f]then reptile. if[has_covering=scales, gills=t]then fish. if[has_covering=none, eggs=t]then amphibian. if[has_covering=none, eggs=f]then mammal. if[has_covering=hair]then mammal. NOTE: "listing" is a Prolog built-in procedure that lists the contents of the database by specifying a name. This means that the if-then strucures are already in the database and their name is "if". In fact, the ID3 program when started by "?- id3(1)." creates a decision tree (that can be viewed by "?- showtree.") and loads in the database a set of rules, that equivalently represent the tree. The rule created by ID3 are Prolog facts, but put in a special prefix format. They can also be accessed directly by Prolog queries like these: ?- if X then mammal. X = [has_covering=none, eggs=f] ; X = [has_covering=hair] ; ?- if X then Y. X = [has_covering=feathers] Y = bird ; X = [has_covering=scales, gills=f] Y = reptile ; X = [has_covering=scales, gills=t] Y = fish ; X = [has_covering=none, eggs=t] Y = amphibian ; X = [has_covering=none, eggs=f] Y = mammal ; X = [has_covering=hair] Y = mammal If you want to run ID3 with another data file, load the data only (no need to reload id3.pl). However NOTE that after executing "?- id3(1)" the previous tree and rules are deleted and replaced with those corresponding to the currently loaded data file. For example: ?- [loandata]. ?- id3(1),showtree. emp=no => [reject/3] emp=yes buy=car married=no => [reject/1] married=yes => [approve/1] buy=comp => [approve/7] ?- id3(9),showtree. emp=no => [reject/3] emp=yes => [approve/8, reject/1] 2. COMPARISON WITH VERSION SPACE LEARNING ----------------------------------------- Assuming that the loandata tree and rules created by "?- id3(1)" in the database, try this (note the sets [2, 5, 11, 12] = [2, 11, 12] + [5]): ?- if H then [reject/N],model(H,M). H = [emp=no] N = 3 M = [2, 11, 12] ; H = [emp=yes, buy=car, married=no] N = 1 M = [5] ; This query accesses the two disjunctive components of the hypothesis for class "reject" (the two paths in the decision tree following to the "reject" leaves). Now, obviously we should be able to run VS on each of the two sets and produce consistent hypotheses. For the first part, [2, 11, 12] we put these examples in the beginning and all from class "approve" to follow (we may also leave the other "reject" example at the end). Let's call this new data file loandata1.pl and load it into the database. ?- [loandata1]. We may try "listing" to verify our data: ?- listing(example). example(2, reject, [emp=no, buy=comp, sex=f, married=yes]). example(11, reject, [emp=no, buy=comp, sex=m, married=yes]). example(12, reject, [emp=no, buy=car, sex=f, married=yes]). example(1, approve, [emp=yes, buy=comp, sex=f, married=no]). example(3, approve, [emp=yes, buy=comp, sex=m, married=no]). example(4, approve, [emp=yes, buy=car, sex=f, married=yes]). example(6, approve, [emp=yes, buy=comp, sex=f, married=yes]). example(7, approve, [emp=yes, buy=comp, sex=f, married=no]). example(8, approve, [emp=yes, buy=comp, sex=m, married=no]). example(9, approve, [emp=yes, buy=comp, sex=m, married=yes]). example(10, approve, [emp=yes, buy=comp, sex=m, married=yes]). example(5, reject, [emp=yes, buy=car, sex=f, married=no]). Now, we load and run the candiate elimination algorithm (vs.pl) in batch mode. ?- [vs]. ?- batch. +[no, comp, f, yes] G-set: [[?, ?, ?, ?]] S-set: [[no, comp, f, yes]] +[no, comp, m, yes] G-set: [[?, ?, ?, ?]] S-set: [[no, comp, ?, yes]] +[no, car, f, yes] G-set: [[?, ?, ?, ?]] S-set: [[no, ?, ?, yes]] -[yes, comp, f, no] G-set: [[?, ?, ?, yes], [no, ?, ?, ?]] S-set: [[no, ?, ?, yes]] -[yes, comp, m, no] -[yes, car, f, yes] G-set: [[no, ?, ?, ?]] S-set: [[no, ?, ?, yes]] -[yes, comp, f, yes] -[yes, comp, f, no] -[yes, comp, m, no] -[yes, comp, m, yes] -[yes, comp, m, yes] +[yes, car, f, no] G-set: [] S-set: [] There is no consistent concept description in this language ! NOTE that the inconsistancy occurs because of the last example (# 5) which is a part of the other hypothesis. The hypothesis created by ID3, [emp=no] is contained in the version space: G-set: [[no, ?, ?, ?]] S-set: [[no, ?, ?, yes]] In fact, it is the one in the G-set. Interestingly, we have another one (the S-set). This means that there may be another more specific decision tree, which looks like this: emp=no married=yes => reject THIS IS THE NEW BRANCH married=no => approve THIS HAS TO BE ADDED TO COMPLETE THE SUBTREE emp=yes buy=car married=no => reject married=yes => approve buy=comp => approve NOTE that this tree is created MANUALLY. With loandata1.pl the ID3 algorithm will create the previous tree, because the order of the examples does not matter. Let's verify what the new brach covers. First, we need the "model" procedure back (it's in id3.pl). ?- [id3]. ?- model([emp=no,married=yes],M). M = [2, 11, 12] NOTE that this is the same coverage as before. ?- model([emp=no,married=no],M). M = [] The brach we added to complete the tree does not cover anything, but this is ok. Similarly the other disjunctive componenet of the ID3's "reject" hypothesis, if [emp=yes, buy=car, married=no] then reject may be learned with VS. First, we create file loandata2.pl with the following order of the examples: [5, 1, 3, 4, 6, 7, 8, 9, 10, 2, 11, 12]. Then running VS: ?- batch. +[yes, car, f, no] G-set: [[?, ?, ?, ?]] S-set: [[yes, car, f, no]] -[yes, comp, f, no] G-set: [[?, car, ?, ?]] S-set: [[yes, car, f, no]] -[yes, comp, m, no] -[yes, car, f, yes] G-set: [[?, car, ?, no]] S-set: [[yes, car, f, no]] -[yes, comp, f, yes] -[yes, comp, f, no] -[yes, comp, m, no] -[yes, comp, m, yes] -[yes, comp, m, yes] +[no, comp, f, yes] G-set: [] S-set: [] There is no consistent concept description in this language ! NOTE that the version space here is bigger: G-set: [[?, car, ?, no]] S-set: [[yes, car, f, no]] It has four hypotheses, among which is the one created by ID3. 3. ONE-LEVEL DECISION TREE WITH ERROR-BASED ATTRIBUTE SELECTION (OneR algorithm) -------------------------------------------------------------------------------- ?- [id3]. ?- [loandata]. ?- id3(1),showtree. emp=no => [reject/3] emp=yes buy=car married=no => [reject/1] married=yes => [approve/1] buy=comp => [approve/7] The id3.pl file defines procedures to compute the entopy and the distribution of classes in a set of examples. For example: ?- if H then C, model(H,M),entropy(M,I). H = [emp=no] C = [reject/3] M = [2, 11, 12] I = 0 ; H = [emp=yes, buy=car, married=no] C = [reject/1] M = [5] I = 0 ; H = [emp=yes, buy=car, married=yes] C = [approve/1] M = [4] I = 0 ; H = [emp=yes, buy=comp] C = [approve/7] M = [1, 3, 6, 7, 8, 9, 10] I = 0 These are all 100% correct hypotheses and that's way the entropy of the class distribution is 0. If we allow mixed classes at the leaves we may get non-zero entropy. For example: ?- id3(2),showtree. emp=no => [reject/3] emp=yes buy=car => [approve/1, reject/1] buy=comp => [approve/7] ?- if H then C, model(H,M),entropy(M,I). H = [emp=no] C = [reject/3] M = [2, 11, 12] I = 0 ; H = [emp=yes, buy=car] C = [approve/1, reject/1] M = [4, 5] I = 1 ; H = [emp=yes, buy=comp] C = [approve/7] M = [1, 3, 6, 7, 8, 9, 10] I = 0 ; The id3.pl program inlcudes a procedure to compute the class distribution of a set of examples. For example: ?- if H then C, model(H,M),distr(M,D). H = [emp=no] C = [reject/3] M = [2, 11, 12] D = [reject/3] ; H = [emp=yes, buy=car] C = [approve/1, reject/1] M = [4, 5] D = [approve/1, reject/1] ; H = [emp=yes, buy=comp] C = [approve/7] M = [1, 3, 6, 7, 8, 9, 10] D = [approve/7] ; This we may see the class distribution and the entropy in a set of examples. For example: ?- distr([1,2,3,4],D),entropy([1,2,3,4],I). D = [approve/3, reject/1] I = 0.811278 Here is the entropy of a hypothesis that covers mixed class examples: ?- model([buy=car],M),distr(M,D),entropy(M,I). M = [4, 5, 12] D = [approve/1, reject/2] I = 0.918296 The distribution of classes can be used to compute the error. For example, asuming that we assigned the majority class label to the set above (reject), the error will be 1/3 (one wrong classification out of 3). The OneR algorithm generates a one-level decision tree, using an error based attribute selection. The following query can be used to find the OneR decision tree for our loan data ("attr" and "vals" are described in the header of id3.pl. Try the query without writeln and fail to understand better how it works). ?- attr(L),member(A,L),vals(A,W),nl,member(V,W),model([V],M),distr(M,D),writeln([V]-D),fail. [buy=car]-[approve/1, reject/2] [buy=comp]-[approve/7, reject/2] [emp=no]-[reject/3] [emp=yes]-[approve/8, reject/1] [married=no]-[approve/4, reject/1] [married=yes]-[approve/4, reject/3] [sex=f]-[approve/4, reject/3] [sex=m]-[approve/4, reject/1] The class distributions above allow us to compute the total error in each of the four possible one-level trees: 1. buy=car => reject [approve/1, reject/2] buy=comp => approve [approve/7, reject/2] ----------------------------------------- Error = (1+2)/12 = 0.25 2. emp=no => reject [reject/3] emp=yes => approve [approve/8, reject/1] ---------------------------------------- Error = (0+1)/12 = 0.0833333 3. married=no => approve [approve/4, reject/1] married=yes => approve [approve/4, reject/3] -------------------------------------------- Error = (1+3)/12 = 0.333333 4. sex=f => approve [approve/4, reject/3] sex=m => approve [approve/4, reject/1] -------------------------------------- Error = (3+1)/12 = 0.333333 Obviously the best OneR tree is #2, because it has the lowest error rate of 0.0833333. Interestingly the entropy-based attribute selection used by ID3 is in agreement with the error-based one. This is clear from the fact the pruned ID tree is the same as the OneR one.