CS580 - Web Mining
Fall-2004

Laboratory Project 3
====================
Programs files: textmine.pl
Data files: webdata.zip, alarm.pl, artsci.pl


I. Web Document Classification by Naive Bayes
---------------------------------------------

Assume we have a catalog of documents and their class lables both stored in
"files.pl" as follows:

files([	'Anthropology.txt',
	'Art.txt',
	'Biology.txt',
	'Chemistry.txt',
	'Communication.txt',
	'Computer.txt',
	'Justice.txt',
	'Economics.txt',
	'English.txt',
	'Geography.txt',
	'History.txt',
	'Math.txt',
	'Languages.txt',
	'Music.txt',
	'Philosophy.txt',
	'Physics.txt',
	'Political.txt',
	'Psychology.txt',
	'Sociology.txt',
	'Theatre.txt' ]).

label( [
	art - [ 'Art.txt',
		'Justice.txt',
		'English.txt',
		'History.txt',
		'Languages.txt',
		'Music.txt',
		'Philosophy.txt',
		'Political.txt',
		'Theatre.txt' ],

	sci - ['Anthropology.txt',
		'Biology.txt',
		'Chemistry.txt',
		'Communication.txt',
		'Computer.txt',
		'Math.txt',
		'Physics.txt',
		'Geography.txt',
		'Economics.txt',
		'Psychology.txt',
		'Sociology.txt' ]
	]).


For these experiments we use a discrete representation of our documents as boolean vectors.
So, instead of "vectors", we use "binvectors". For example, for 5-component vectors we 
use the following query:

?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),ppl(Vs).
Anthropology.txt-sci-[1, 0, 1, 1, 1]
Art.txt-art-[0, 1, 1, 1, 1]
Biology.txt-sci-[1, 0, 0, 0, 1]
Chemistry.txt-sci-[1, 0, 1, 0, 0]
Communication.txt-sci-[0, 1, 1, 1, 1]
Computer.txt-sci-[1, 0, 1, 1, 0]
Justice.txt-art-[0, 0, 1, 0, 0]
Economics.txt-sci-[0, 0, 1, 0, 0]
English.txt-art-[0, 1, 0, 0, 1]
Geography.txt-sci-[1, 0, 0, 1, 1]
History.txt-art-[0, 1, 0, 1, 1]
Math.txt-sci-[1, 1, 0, 0, 1]
Languages.txt-art-[0, 1, 0, 1, 1]
Music.txt-art-[0, 0, 1, 1, 1]
Philosophy.txt-art-[0, 1, 1, 0, 1]
Physics.txt-sci-[0, 0, 1, 1, 0]
Political.txt-art-[1, 0, 1, 1, 0]
Psychology.txt-sci-[0, 1, 0, 1, 1]
Sociology.txt-sci-[0, 1, 1, 0, 1]
Theatre.txt-art-[0, 0, 0, 1, 1]

F = ['Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt'|...]
T = [department, study, students, ba, website, location, programs, 832, phone|...]
IDF = [1.09861-science, 0.847298-offers, 0.559616-program, 0.559616-faculty, 0.405465-programs]


The list of terms used to create the above binary vector representation is in IDF list above, 
where the IDF weights are ignored. That is, if a term appears in the documents the corresponding 
vector components is 1, if not - it's 0.

The following procedures a part of the Naive Bayes algorithm are provided in textmine.pl:

cond_prob(X,Class,Examples,LP) - generates a list LP of conditional probabilities for
                                 each value pair in X, given Class.

class_prob(Class,Examples,CP)  - calculates the probability of Class 
                                 (proportion of examples in Class w.r.t. the whole data set).

probs(E,Examples,LP)           - generates likelihoods of E belonging to each classes.

bayes(X,Examples,Class)        - assigns X a Class according to the Naive Bayes algorithm.


Below are examples of using the above procedures with the 5-component vector representation of 
the documents described in file.pl. The vector X=[0,1,1,1,0] is used as a representation of a
new document, which is to be classified.

?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),cond_prob([0,1,1,1,0],sci,Vs,LP).

LP = [0.454545, 0.363636, 0.636364, 0.545455, 0.363636] 

In LP we have the conditional probabilities of each value (0/1) in the vector [0,1,1,1,0] given
class "sci". We may also compute the prior probability of classes "sci" and "art" as follows:

?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),class_prob(sci,Vs,CP).

CP = 0.55 

?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),class_prob(art,Vs,CP).

CP = 0.45 

Then by mulitiplying these probabilities we get the likelihoods of X belongning to "sci" and to "art".
This is done by the procedure "probs".

?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),probs([0,1,1,1,0],Vs,Ps).

Ps = [0.0182899-art, 0.0114746-sci] 

According these likelihoods, X belongs to class "art". The procedure "bayes" does all this in one
step:

?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),bayes([0,1,1,1,0],Vs,Class).

Class = art

If we change the test example X to [1,1,1,1,0] we get a different classification:

?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),bayes([1,1,1,1,0],Vs,Class).

Class = sci 



II. LOO-CV evaluation
---------------------

The following query deletes a vector and then classifies it using the rest of the vectors. Then this
is repeated for all vectors in the set. 

?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),del(ID-X,Vs,R),bayes(X,R,Class),write(ID-Class),nl,fail.
Anthropology.txt-sci-sci
Art.txt-art-art
Biology.txt-sci-sci
Chemistry.txt-sci-sci
Communication.txt-sci-art
Computer.txt-sci-sci
Justice.txt-art-sci
Economics.txt-sci-sci
English.txt-art-art
Geography.txt-sci-sci
History.txt-art-art
Math.txt-sci-sci
Languages.txt-art-art
Music.txt-art-sci
Philosophy.txt-art-sci
Physics.txt-sci-art
Political.txt-art-sci
Psychology.txt-sci-art
Sociology.txt-sci-art
Theatre.txt-art-art

By counting the number of differences between the actuall and predicted labels we compute the LOO-CV
error. For the above experiment it's 8/20 = 40%.


III. Using Bayesian (Belief) Network for reasoning/classification
-----------------------------------------------------------------

A simple example of a belief network is provided in the file alarm.pl. 
Read it first to understand the representation.

The BN reasoning algorithm is used through p(Var,Obs,Dist), where:
 - Var is a variable (as defined in variables([...]));
 - Obs is a list of observations (variable=value pairs);
 - Dist is a distibution of P(Var|Obs);
 - Var must not appear in the observations Obs.

Make sure that textmine.pl is loaded and then load alarm.pl too:

?- [alarm].   % load BN for the alarm network


1. Diagnostic reasoning: from effect to cause (the effect is given as evidence)
-------------------------------------------------------------------------------

What is the probability distribution of Burglary (b) given the evidence
that John calls (j=t)?

?- p(b,[j=t],P).

P = [0.0162837, 0.983716] 

The probability distribution [0.0162837, 0.983716] corresponds to the
values of the variable b as defined in values(b,[t,f]). Thus, the
probabilty of "f" is 0.983716. Without this evidence we have

?- p(b,[],P).

P = [0.001, 0.999] 

This is the prior distribution of b as it has no parents.
Let's add more evidence: Jonh calls and Mary calls too. 

?- p(b,[j=t,m=t],P).

P = [0.284172, 0.715828] 

The probabilty of burglary increases. If they both call, it becomes
more likely that there is a burglary. But still, the alarm is not on,
so the probabilty of b is not too high. Adding the alarm (a=t) further
increases this probabilty.

?- p(b,[j=t,m=t,a=t],P).

P = [0.373551, 0.626449] 

Because there is another possible cause for the alarm (earthquake),
the probabilty of b can also increase by adding the evidence that
there is no earthquake (e=f).

?- p(b,[j=t,m=t,a=t,e=f],P).

P = [0.484786, 0.515214] 

This last example however is not diagnostic reasoning, because 
b (burglary) and e (earthquake) are not connected by a causal relation.


2. Predictive reasoning: from cause to effect
---------------------------------------------

What is the probability distribution of John calls (j), given that 
there is a burglary (b=t)?


?- p(j,[b=t],P).

P = [0.849017, 0.150983] 

So, it's very likely that Jonh calls in this situation (the first 
value for j is t). Similarly, we can get the probability of Mary calls
with the same evidence. It's lower than John's (why?).

?- p(m,[b=t],P).

P = [0.658614, 0.341386] 


IV. Creating a belief network for the text documents from file.pl
-----------------------------------------------------------------

We shall illustrate this with a sample of our document representation obtained by the following
query:

?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),ppl(Vs).
Anthropology.txt-sci-[1, 0, 1, 1]
Art.txt-art-[0, 1, 1, 1]
Biology.txt-sci-[1, 0, 0, 0]
Chemistry.txt-sci-[1, 0, 1, 0]
Communication.txt-sci-[0, 1, 1, 1]
Computer.txt-sci-[1, 0, 1, 1]
Justice.txt-art-[0, 0, 1, 0]
Economics.txt-sci-[0, 0, 1, 0]
English.txt-art-[0, 1, 0, 0]
Geography.txt-sci-[1, 0, 0, 1]
History.txt-art-[0, 1, 0, 1]
Math.txt-sci-[1, 1, 0, 0]
Languages.txt-art-[0, 1, 0, 1]
Music.txt-art-[0, 0, 1, 1]
Philosophy.txt-art-[0, 1, 1, 0]
Physics.txt-sci-[0, 0, 1, 1]
Political.txt-art-[1, 0, 1, 1]
Psychology.txt-sci-[0, 1, 0, 1]
Sociology.txt-sci-[0, 1, 1, 0]
Theatre.txt-art-[0, 0, 0, 1]

IDF = [1.09861-science, 0.847298-offers, 0.559616-program, 0.559616-faculty]

Thus we have 4 terms: science, offers, program, faculty


1. The BN variables are the class and the terms (features):

   variables([class, science, offers, program, faculty]).

2. The structure of the graph represents the causal relationship between
   the terms and the class. The class value determines (is a cause for) the 
   term values (the effects). So, we have the class node as a parent
   of all attributes. In Prolog this is:

   parents(science,[class]).
   parents(offers,[class]).
   parents(program,[class]).
   parents(faculty,[class]).
   parents(class,[]).

3. The term values (0/1) are the possible values that the BN variables can take:

   values(science,[0,1]).
   values(offers,[0,1]).
   values(program,[0,1]).
   values(faculty,[0,1]).
   values(class,[sci,art]).

4. Conditional probabilities: use the procedures from Naive Bayes to compute them.


CPT's for the terms:
--------------------

?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),cond_prob([0,0,0,0],sci,Vs,P).
P = [0.454545, 0.636364, 0.363636, 0.454545] 

?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),cond_prob([1,1,1,1],sci,Vs,P).
P = [0.545455, 0.363636, 0.636364, 0.545455] 

?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),cond_prob([0,0,0,0],art,Vs,P).
P = [0.888889, 0.444444, 0.444444, 0.333333] 

?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),cond_prob([1,1,1,1],art,Vs,P).
P = [0.111111, 0.555556, 0.555556, 0.666667] 

The probabilities calculated by cond_prob are used to create the CPT's as follows:

pr(science,[class=sci],[0.454545,0.545455]).
pr(science,[class=art],[0.888889,0.111111]).

pr(offers,[class=sci],[0.636364,0.363636]).
pr(offers,[class=art],[0.444444,0.555556]).

pr(program,[class=sci],[0.363636,0.636364]).
pr(program,[class=art],[0.444444,0.555556]).

pr(faculty,[class=sci],[0.454545,0.545455]).
pr(faculty,[class=art],[0.333333,0.666667]).


CPT for class:
--------------
?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),class_prob(sci,Vs,P).
P = 0.55 

?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),class_prob(art,Vs,P).
P = 0.45 

This leads to the following definition:

pr(class,[],[0.55,0.45]).


Note that for all CPT's the order in which the probabilties are listed follows the order of their 
corresponding values as described in "values".


V. Classification of new vectors (documents) with BN
----------------------------------------------------

Let's put all the Prolog structures used to define the BN in a file hamed artsci.pl
and load the latter in Prolog:

?- [artsci].

Then a new document vector, say [1,1,1,1], can be classified by using the following query:

?- p(class,[science=1,offers=1,program=1,faculty=1],P).
P = [0.786352, 0.213648] 

The first probability is for class=sci and because it's higher we can classify this
document as belonging to sclass sci.

We may also try this this with less evidence, for example:

?- p(class,[science=1,offers=1,program=1],P).
P = [0.818133, 0.181867] 

?- p(class,[science=1],P).
P = [0.857143, 0.142857] 

Basically the same results are obtained by naive bayes.