Machine Learning - Spring 2004
==============================
Lab experiments 12
------------------
Programs: stats.pl, cluster.pl
Data: sample.pl, weather.pl, cells.pl, animals.pl
-------------------------------------------------

I. Experiments with EM (stats.pl)
=================================

1. Basic statistics and Bayes
-----------------------------

Note that stats.pl does not implement the EM algorithm.
Rather, it provides a collection of statistical procedures,
which are used hereafter to illustrate the basic principles of EM.
They are the following:

  mean(List, Mean)                   - computes mean
  stdev(List, Deviation)             - computes standard deviation
  variance(List, Variance)           - computes variance
  p(Value, Mean, Deviation, Density) - computes density function given mean and stdev
  p(Value, List, Density)            - computes density function given the sample

For the experiments we use the data file sample.pl, which contains
a set of examples with one numeric attribute and two classes.
Let's examine this data sample first.

?- ['c:/prolog/stats.pl'].           % load the program
?- ['c:/prolog/sample.pl'].          % load the data

?- listing(example).

example(1, a, 51).
example(2, a, 43)
example(3, b, 62).
... 
example(49, a, 42).
example(50, a, 48).
example(51, a, 41).

For brevity we don't include the whole listing above. There are 51
examples in it. Let's do some statistics.

?- findall(X,example(_,a,X),A),length(A,N),PA is N/51.

A = [51, 43, 45, 42, 46, 45, 45, 47, 52|...]
N = 32
PA = 0.627451 

?- findall(X,example(_,b,X),B),length(B,N),PB is N/51.

B = [62, 64, 62, 64, 65, 64, 62, 62, 64|...]
N = 19
PB = 0.372549 

In the above queries "findall" collects all values of the variable X,
which satisfy the goal in the second argument. That is, we collect 
all examples from class "a" in list A, and all examples from
class "b" - in list B. "length" computes the length of the lists.
Thus we obtain the probability of classes "a" and "b" in PA and PB.

Let's use some of the procedures from stats.pl.

?- findall(X,example(_,a,X),A),mean(A,MA),stdev(A,DA).

A = [51, 43, 45, 42, 46, 45, 45, 47, 52|...]
MA = 46.8125
DA = 3.72816 

?- findall(X,example(_,b,X),B),mean(B,MB),stdev(B,DB).

B = [62, 64, 62, 64, 65, 64, 62, 62, 64|...]
MB = 63.6316
DB = 1.21154 

Now we know our data well. We have a mixture of samples from two distributions
A and B, where the probability of sampling is:

P(A)=0.627451
P(B)=0.372549

The parameters of each distribution are their mean (MA and MB) and 
standard deviation (DA and DB), computed by the above queries.
The data sample along with graphs of the two (normal) distributions 
is shown in the PDF document at
http://www.cs.ccsu.edu/~markov/ccsu_courses/mixture.pdf.

Knowing this parameters we can apply Bayes to compute the probability 
of sampling for each value. As these are continuous variables we cannot
use counts and discrete probabilities as we do with nominal attributes. 
Rather, we need the probability density function, which uses the mean 
and the standard deviation. So, assume we want to find the distribution
from which the value 46 is drawn (i.e. the classification of 46). 
We have to compute

P(A|46) = P(46|A)*P(A)/P(46)
P(B|46) = p(46|B)*P(B)/P(46)

and compare the probabilities (or the likelihoods only, without the denominators).

?- findall(X,example(_,a,X),A),mean(A,MA),stdev(A,DA),p(46,MA,DA,P),PA is P*0.627451.

A = [51, 43, 45, 42, 46, 45, 45, 47, 52|...]
MA = 46.8125
DA = 3.72816
P = 0.104496
PA = 0.0655664 

?- findall(X,example(_,b,X),B),mean(B,MB),stdev(B,DB),p(46,MB,DB,P),PB is P*0.372549.

B = [62, 64, 62, 64, 65, 64, 62, 62, 64|...]
MB = 63.6316
DB = 1.21154
P = 3.37307e-047
PB = 1.25664e-047 

There is also a simple way to compute the density through the other 
version of "p" that uses the list of values directly.

?- findall(X,example(_,a,X),A),p(46,A,P),PA is P*0.627451.

A = [51, 43, 45, 42, 46, 45, 45, 47, 52|...]
P = 0.104496
PA = 0.0655664 

?- findall(X,example(_,b,X),B),p(46,B,P),PB is P*0.372549.

B = [62, 64, 62, 64, 65, 64, 62, 62, 64|...]
P = 3.37307e-047
PB = 1.25664e-047 

Thus we have:

P(46|A)*P(A) = 0.0655664.
p(46|B)*P(B) = 1.25664e-047 (a very small value).

Even at this point we see that 46 obviously belongs to distribution A.
But let's do the normalization too:

?- PA46 is 0.0655664/(0.0655664+1.25664e-047).

PA46 = 1 

?- PB46 is 1.25664e-047/(0.0655664+1.25664e-047).

PB46 = 1.91659e-046 

It's now clear that the second probability is practically zero.

This result agrees with the actual labels the values in the sample
come with. 46 belongs to examples 7, 18 and 36, all labeled as "a".

In contrast to the discrete case (with nominal values), here we can 
compute probabilities for unknown values directly. For example:

?- findall(X,example(_,a,X),A),p(46.5,A,P),PA is P*0.627451.

A = [51, 43, 45, 42, 46, 45, 45, 47, 52|...]
P = 0.106632
PA = 0.0669067 

The value of 46.5 which is not in the sample has a little bit higher likelihood
to belong to A, than 46 does. The reason is that 46.5 is closer the distribution's
mean, which is 46.8125.


2. Learning with unobserved or partially observed variables
-----------------------------------------------------------

Assume now that we don't know or know only some of the parameters of the distributions
A and B. The EM algorithm can be used to find the missing parameters. The basic
idea is to assign arbitrarily the labels A and B to the values 
and then classify each value using the bayes rule (as done in the queries above)
and update its label according to this classification. After multiple
iterations the labels get stabilized and thus the originally unknown
parameters of the two distributions are found.

To illustrate this process assume that the algorithm is close to convergence, but
still there are some values that get different classification from the one they 
are labeled with. Let's switch the labels of examples 7, 22 and 37 to simulate this. 
We do this in the file and then reload it.

?- ['c:/prolog/sample.pl'].          % load the modified sample

?- findall(X,example(_,a,X),A),findall(X,example(_,b,X),B),example(ID,C,V),p(V,A,PAV),p(V,B,PBV),
length(A,N),length(B,M),PA is PAV*N/(N+M),PB is PBV*M/(N+M),(PA>PB,C\=a;PB>PA,C\=b).

ID = 7
C = b
V = 46
PAV = 0.0849553
PBV = 0.0040198
N = 31
M = 20
PA = 0.0516395
PB = 0.00157639 ;

ID = 22
C = a
V = 62
PAV = 0.000424731
PBV = 0.0592342
N = 31
M = 20
PA = 0.00025817
PB = 0.0232291 ;

ID = 37
C = b
V = 39
PAV = 0.0135648
PBV = 0.000208726
N = 31
M = 20
PA = 0.00824528
PB = 8.18534e-005 ;

No
14 ?- 

Here is how this query works. After getting the samples from distributions A and B, it gets
a value V from an example, computes the densities in PAV and PBV, and the corresponding 
likelihoods in PA and PB, where the terms N/(N+M) and M/(N+M) compute P(A) and P(B) 
correspondingly. Then the term (PA>PB,C\=a;PB>PA,C\=b) works as a filter. It is true 
only if the predicted label for V is different from its actual one in C
(if PA>PB then V belongs to a, and if PB>PA then V belongs to b).
Interestingly the algorithm detects all the swaps we have made.

In fact the above query simulates roughly one iteration of the EM algorithm.
After it, the labels of samples 7, 22 and 37 have to be switched back, which will result
in convergence (though this may be just a local solution). To verify this let's switch 
the labels of samples 7, 22 and 37 back to their original values and reload the file.

?- ['c:/prolog/sample.pl'].          % load the original sample

?- findall(X,example(_,a,X),A),findall(X,example(_,b,X),B),example(ID,C,V),p(V,A,PAV),p(V,B,PBV),
length(A,N),length(B,M),PA is PAV*N/(N+M),PB is PBV*M/(N+M),(PA>PB,C\=a;PB>PA,C\=b).

No
30 ?- 

"No" here means that there is no value with a label different from the predicted one.
So, no need of more iterations, all labels have fallen into their right places.

This approach works also with more attributes. In this case the different attributes 
are assumed to be independent (the Naive Bayes assumption) and the 
product of all densities is used. In case of nominal attributes the discrete 
probabilities (based on counts) are used.

The task that the EM algorithm solves can be seen as unsupervised learning (clustering).
That is the case when given a mixture no parameters of the distributions are known,
which means that no class labels are assigned to the examples. Then starting with random 
class assignments the algorithm comes up with a grouping of the instances into 
classes (clusters). In the next section we discuss another (non-probabilistic) appraoch to
clustering.


II. Distance-based agglomerative clustering (cluster.pl)
========================================================

1. Basic functions:
-------------------
cluster(Mode,Threshold,Clustering) - returns Clustering (list of pairs)

cluster(Mode,Threshold) - prints clustering tree

Where:
------
Mode = min    Single-linkage agglomerative clustering.
              Merging stops when distance > Threshold.  

Mode = max    Complete-linkage agglomerative clustering.  
              Merging stops when distance > Threshold.  

The distance is computed as the number of different attribute values (as in k-NN).
This means that the distance is in the interval [0, <total number of attributes>].


2. Clustering cells.pl
----------------------

?- ['c:/prolog/cluster.pl'].        % load the program 
?- ['c:/prolog/cells.pl'].          % load the data

?- listing(example).


example(1, c1, [tails=one, color=light, nuclei=one]).
example(2, c2, [tails=two, color=dark, nuclei=two]).
example(3, c2, [tails=two, color=light, nuclei=two]).
example(4, c1, [tails=one, color=dark, nuclei=three]).

Note that the examples are represented in the same way as for
the purposes of concept learning and prediction, i.e. with
class labels. These labels however are NOT used in the process
of clustering. 

The following query generates a clustering in the form of a list
of pairs of examples. It uses the single-linkage clustering
method (the mode parameter = min), where merging stops, if distance > 2.

?- cluster(min,2,CL).

CL = [[[1, 4], [2, 3]]] 

To show the clustering as a horizontal tree we may use the "show"
procedure. The second parameter (0) specifies the column where the
root node is printed.

?- cluster(min,2,CL),show(CL,0).
+
  +
    1-c1
    4-c1
  +
    2-c2
    3-c2

CL = [[[1, 4], [2, 3]]] 

Both creating and printing the clustering can be achieved by using
the cluster procedure with two parameters only.

?- cluster(min,2).
+
  +
    1-c1
    4-c1
  +
    2-c2
    3-c2

Note the format in which the tree is printed. The root and non-leaf
nodes are shown as "+". Leaves are shown with the corresponding example
ID's and their class labels. As noted above the class labels are not
used in the process of clustering. If they are not known or arbitrary
the clustering would be the same (try for example, replacing them all
with one label). However when the class labels are known (as in the 
example here), we may use them to evaluate the clustering quality
(the so-called error-based evaluation). Obviously the error in the
above clustering is 0, because all examples grouped in a cluster
have the same class label.

Let's experiment with the value of the threshold.

?- cluster(min,1).  % stop merging if distance > 1
+
  +
    2-c2
    3-c2
  1-c1
  4-c1

So, here we have three clusters one with two examples (2,3) and
two others with one example each. The error again is 0.

?- cluster(min,0).  % stop merging if distance > 0, i.e. no merging at all
+
  1-c1
  2-c2
  3-c2
  4-c1

Here each example represents a cluster.

The other agglomerative clustering method (complete-linkage clustering)
produces similar results.

?- cluster(max,1). % Mode = max, i.e. Complete-linkage clustering, 
+
  +
    2-c2
    3-c2
  1-c1
  4-c1

?- cluster(max,2).
+
  +
    1 - c1
    4 - c1
  +
    2 - c2
    3 - c2


3. Clustering the animals data set
----------------------------------

?- ['c:/prolog/animals.pl'].          % load the data

?- cluster(min,1).
+
  +
    +
      5-fish
      7-reptile
    +
      6-reptile
      10-amphibian
  +
    1-mammal
    4-mammal
  +
    8-bird
    9-bird
  2-mammal
  3-mammal

Let's compute the error in this clustering. First we have to decide 
whether to consider all clusters or only the top level ones (the immediate
successors of the root). Assume we do the top level clusters only. 
We have 5 such clusters: [[5,7],[6,10]], [1,4], [8,9], 2 and 3.
In each cluster we find the majority label, and then count the labels
that are different (they are considered errors). Thus examples 5 and 10
in cluster [[5,7],[6,10]] are errors. All other clusters are error free.
The total error at the top level clustering is 2/10 = 0.2 error rate.


?- cluster(min,2).
+
  +
    +
      2-mammal
      3-mammal
    +
      1-mammal
      4-mammal
  +
    +
      +
        5-fish
        7-reptile
      +
        6-reptile
        10-amphibian
    +
      8-bird
      9-bird

The error is 2/10 again. But the advantage of this clustering is that
it puts all mammals in one cluster.


?- cluster(max,2).
+
  +
    +
      2-mammal
      3-mammal
    +
      1-mammal
      4-mammal
  +
    5-fish
    7-reptile
  +
    6-reptile
    10-amphibian
  +
    8-bird
    9-bird

This clustering is even better - same error (2/10), but we have the birds 
in one clsuter at the top level. Other thresholds can be tried too, 
but no lower error can be achieved. The maximal value of the threshold is 6, 
because the maximal distance between examples with 6 attributes is 6. 
Using bigger thresholds will not make any difference.