Machine Learning - Spring 2004 ============================== Lab experiments 12 ------------------ Programs: stats.pl, cluster.pl Data: sample.pl, weather.pl, cells.pl, animals.pl ------------------------------------------------- I. Experiments with EM (stats.pl) ================================= 1. Basic statistics and Bayes ----------------------------- Note that stats.pl does not implement the EM algorithm. Rather, it provides a collection of statistical procedures, which are used hereafter to illustrate the basic principles of EM. They are the following: mean(List, Mean) - computes mean stdev(List, Deviation) - computes standard deviation variance(List, Variance) - computes variance p(Value, Mean, Deviation, Density) - computes density function given mean and stdev p(Value, List, Density) - computes density function given the sample For the experiments we use the data file sample.pl, which contains a set of examples with one numeric attribute and two classes. Let's examine this data sample first. ?- ['c:/prolog/stats.pl']. % load the program ?- ['c:/prolog/sample.pl']. % load the data ?- listing(example). example(1, a, 51). example(2, a, 43) example(3, b, 62). ... example(49, a, 42). example(50, a, 48). example(51, a, 41). For brevity we don't include the whole listing above. There are 51 examples in it. Let's do some statistics. ?- findall(X,example(_,a,X),A),length(A,N),PA is N/51. A = [51, 43, 45, 42, 46, 45, 45, 47, 52|...] N = 32 PA = 0.627451 ?- findall(X,example(_,b,X),B),length(B,N),PB is N/51. B = [62, 64, 62, 64, 65, 64, 62, 62, 64|...] N = 19 PB = 0.372549 In the above queries "findall" collects all values of the variable X, which satisfy the goal in the second argument. That is, we collect all examples from class "a" in list A, and all examples from class "b" - in list B. "length" computes the length of the lists. Thus we obtain the probability of classes "a" and "b" in PA and PB. Let's use some of the procedures from stats.pl. ?- findall(X,example(_,a,X),A),mean(A,MA),stdev(A,DA). A = [51, 43, 45, 42, 46, 45, 45, 47, 52|...] MA = 46.8125 DA = 3.72816 ?- findall(X,example(_,b,X),B),mean(B,MB),stdev(B,DB). B = [62, 64, 62, 64, 65, 64, 62, 62, 64|...] MB = 63.6316 DB = 1.21154 Now we know our data well. We have a mixture of samples from two distributions A and B, where the probability of sampling is: P(A)=0.627451 P(B)=0.372549 The parameters of each distribution are their mean (MA and MB) and standard deviation (DA and DB), computed by the above queries. The data sample along with graphs of the two (normal) distributions is shown in the PDF document at http://www.cs.ccsu.edu/~markov/ccsu_courses/mixture.pdf. Knowing this parameters we can apply Bayes to compute the probability of sampling for each value. As these are continuous variables we cannot use counts and discrete probabilities as we do with nominal attributes. Rather, we need the probability density function, which uses the mean and the standard deviation. So, assume we want to find the distribution from which the value 46 is drawn (i.e. the classification of 46). We have to compute P(A|46) = P(46|A)*P(A)/P(46) P(B|46) = p(46|B)*P(B)/P(46) and compare the probabilities (or the likelihoods only, without the denominators). ?- findall(X,example(_,a,X),A),mean(A,MA),stdev(A,DA),p(46,MA,DA,P),PA is P*0.627451. A = [51, 43, 45, 42, 46, 45, 45, 47, 52|...] MA = 46.8125 DA = 3.72816 P = 0.104496 PA = 0.0655664 ?- findall(X,example(_,b,X),B),mean(B,MB),stdev(B,DB),p(46,MB,DB,P),PB is P*0.372549. B = [62, 64, 62, 64, 65, 64, 62, 62, 64|...] MB = 63.6316 DB = 1.21154 P = 3.37307e-047 PB = 1.25664e-047 There is also a simple way to compute the density through the other version of "p" that uses the list of values directly. ?- findall(X,example(_,a,X),A),p(46,A,P),PA is P*0.627451. A = [51, 43, 45, 42, 46, 45, 45, 47, 52|...] P = 0.104496 PA = 0.0655664 ?- findall(X,example(_,b,X),B),p(46,B,P),PB is P*0.372549. B = [62, 64, 62, 64, 65, 64, 62, 62, 64|...] P = 3.37307e-047 PB = 1.25664e-047 Thus we have: P(46|A)*P(A) = 0.0655664. p(46|B)*P(B) = 1.25664e-047 (a very small value). Even at this point we see that 46 obviously belongs to distribution A. But let's do the normalization too: ?- PA46 is 0.0655664/(0.0655664+1.25664e-047). PA46 = 1 ?- PB46 is 1.25664e-047/(0.0655664+1.25664e-047). PB46 = 1.91659e-046 It's now clear that the second probability is practically zero. This result agrees with the actual labels the values in the sample come with. 46 belongs to examples 7, 18 and 36, all labeled as "a". In contrast to the discrete case (with nominal values), here we can compute probabilities for unknown values directly. For example: ?- findall(X,example(_,a,X),A),p(46.5,A,P),PA is P*0.627451. A = [51, 43, 45, 42, 46, 45, 45, 47, 52|...] P = 0.106632 PA = 0.0669067 The value of 46.5 which is not in the sample has a little bit higher likelihood to belong to A, than 46 does. The reason is that 46.5 is closer the distribution's mean, which is 46.8125. 2. Learning with unobserved or partially observed variables ----------------------------------------------------------- Assume now that we don't know or know only some of the parameters of the distributions A and B. The EM algorithm can be used to find the missing parameters. The basic idea is to assign arbitrarily the labels A and B to the values and then classify each value using the bayes rule (as done in the queries above) and update its label according to this classification. After multiple iterations the labels get stabilized and thus the originally unknown parameters of the two distributions are found. To illustrate this process assume that the algorithm is close to convergence, but still there are some values that get different classification from the one they are labeled with. Let's switch the labels of examples 7, 22 and 37 to simulate this. We do this in the file and then reload it. ?- ['c:/prolog/sample.pl']. % load the modified sample ?- findall(X,example(_,a,X),A),findall(X,example(_,b,X),B),example(ID,C,V),p(V,A,PAV),p(V,B,PBV), length(A,N),length(B,M),PA is PAV*N/(N+M),PB is PBV*M/(N+M),(PA>PB,C\=a;PB>PA,C\=b). ID = 7 C = b V = 46 PAV = 0.0849553 PBV = 0.0040198 N = 31 M = 20 PA = 0.0516395 PB = 0.00157639 ; ID = 22 C = a V = 62 PAV = 0.000424731 PBV = 0.0592342 N = 31 M = 20 PA = 0.00025817 PB = 0.0232291 ; ID = 37 C = b V = 39 PAV = 0.0135648 PBV = 0.000208726 N = 31 M = 20 PA = 0.00824528 PB = 8.18534e-005 ; No 14 ?- Here is how this query works. After getting the samples from distributions A and B, it gets a value V from an example, computes the densities in PAV and PBV, and the corresponding likelihoods in PA and PB, where the terms N/(N+M) and M/(N+M) compute P(A) and P(B) correspondingly. Then the term (PA>PB,C\=a;PB>PA,C\=b) works as a filter. It is true only if the predicted label for V is different from its actual one in C (if PA>PB then V belongs to a, and if PB>PA then V belongs to b). Interestingly the algorithm detects all the swaps we have made. In fact the above query simulates roughly one iteration of the EM algorithm. After it, the labels of samples 7, 22 and 37 have to be switched back, which will result in convergence (though this may be just a local solution). To verify this let's switch the labels of samples 7, 22 and 37 back to their original values and reload the file. ?- ['c:/prolog/sample.pl']. % load the original sample ?- findall(X,example(_,a,X),A),findall(X,example(_,b,X),B),example(ID,C,V),p(V,A,PAV),p(V,B,PBV), length(A,N),length(B,M),PA is PAV*N/(N+M),PB is PBV*M/(N+M),(PA>PB,C\=a;PB>PA,C\=b). No 30 ?- "No" here means that there is no value with a label different from the predicted one. So, no need of more iterations, all labels have fallen into their right places. This approach works also with more attributes. In this case the different attributes are assumed to be independent (the Naive Bayes assumption) and the product of all densities is used. In case of nominal attributes the discrete probabilities (based on counts) are used. The task that the EM algorithm solves can be seen as unsupervised learning (clustering). That is the case when given a mixture no parameters of the distributions are known, which means that no class labels are assigned to the examples. Then starting with random class assignments the algorithm comes up with a grouping of the instances into classes (clusters). In the next section we discuss another (non-probabilistic) appraoch to clustering. II. Distance-based agglomerative clustering (cluster.pl) ======================================================== 1. Basic functions: ------------------- cluster(Mode,Threshold,Clustering) - returns Clustering (list of pairs) cluster(Mode,Threshold) - prints clustering tree Where: ------ Mode = min Single-linkage agglomerative clustering. Merging stops when distance > Threshold. Mode = max Complete-linkage agglomerative clustering. Merging stops when distance > Threshold. The distance is computed as the number of different attribute values (as in k-NN). This means that the distance is in the interval [0, ]. 2. Clustering cells.pl ---------------------- ?- ['c:/prolog/cluster.pl']. % load the program ?- ['c:/prolog/cells.pl']. % load the data ?- listing(example). example(1, c1, [tails=one, color=light, nuclei=one]). example(2, c2, [tails=two, color=dark, nuclei=two]). example(3, c2, [tails=two, color=light, nuclei=two]). example(4, c1, [tails=one, color=dark, nuclei=three]). Note that the examples are represented in the same way as for the purposes of concept learning and prediction, i.e. with class labels. These labels however are NOT used in the process of clustering. The following query generates a clustering in the form of a list of pairs of examples. It uses the single-linkage clustering method (the mode parameter = min), where merging stops, if distance > 2. ?- cluster(min,2,CL). CL = [[[1, 4], [2, 3]]] To show the clustering as a horizontal tree we may use the "show" procedure. The second parameter (0) specifies the column where the root node is printed. ?- cluster(min,2,CL),show(CL,0). + + 1-c1 4-c1 + 2-c2 3-c2 CL = [[[1, 4], [2, 3]]] Both creating and printing the clustering can be achieved by using the cluster procedure with two parameters only. ?- cluster(min,2). + + 1-c1 4-c1 + 2-c2 3-c2 Note the format in which the tree is printed. The root and non-leaf nodes are shown as "+". Leaves are shown with the corresponding example ID's and their class labels. As noted above the class labels are not used in the process of clustering. If they are not known or arbitrary the clustering would be the same (try for example, replacing them all with one label). However when the class labels are known (as in the example here), we may use them to evaluate the clustering quality (the so-called error-based evaluation). Obviously the error in the above clustering is 0, because all examples grouped in a cluster have the same class label. Let's experiment with the value of the threshold. ?- cluster(min,1). % stop merging if distance > 1 + + 2-c2 3-c2 1-c1 4-c1 So, here we have three clusters one with two examples (2,3) and two others with one example each. The error again is 0. ?- cluster(min,0). % stop merging if distance > 0, i.e. no merging at all + 1-c1 2-c2 3-c2 4-c1 Here each example represents a cluster. The other agglomerative clustering method (complete-linkage clustering) produces similar results. ?- cluster(max,1). % Mode = max, i.e. Complete-linkage clustering, + + 2-c2 3-c2 1-c1 4-c1 ?- cluster(max,2). + + 1 - c1 4 - c1 + 2 - c2 3 - c2 3. Clustering the animals data set ---------------------------------- ?- ['c:/prolog/animals.pl']. % load the data ?- cluster(min,1). + + + 5-fish 7-reptile + 6-reptile 10-amphibian + 1-mammal 4-mammal + 8-bird 9-bird 2-mammal 3-mammal Let's compute the error in this clustering. First we have to decide whether to consider all clusters or only the top level ones (the immediate successors of the root). Assume we do the top level clusters only. We have 5 such clusters: [[5,7],[6,10]], [1,4], [8,9], 2 and 3. In each cluster we find the majority label, and then count the labels that are different (they are considered errors). Thus examples 5 and 10 in cluster [[5,7],[6,10]] are errors. All other clusters are error free. The total error at the top level clustering is 2/10 = 0.2 error rate. ?- cluster(min,2). + + + 2-mammal 3-mammal + 1-mammal 4-mammal + + + 5-fish 7-reptile + 6-reptile 10-amphibian + 8-bird 9-bird The error is 2/10 again. But the advantage of this clustering is that it puts all mammals in one cluster. ?- cluster(max,2). + + + 2-mammal 3-mammal + 1-mammal 4-mammal + 5-fish 7-reptile + 6-reptile 10-amphibian + 8-bird 9-bird This clustering is even better - same error (2/10), but we have the birds in one clsuter at the top level. Other thresholds can be tried too, but no lower error can be achieved. The maximal value of the threshold is 6, because the maximal distance between examples with 6 attributes is 6. Using bigger thresholds will not make any difference.