CS580 - Web Mining
Fall-2004

Laboratory Project 4
====================
Programs files: textmine.pl
Data files: webdata.zip


I. Web Document Classification by Decision Tree Learning
--------------------------------------------------------

Assume we have a catalog of documents and their class lables both stored in
"files.pl" as follows:

files([	'Anthropology.txt',
	'Art.txt',
	'Biology.txt',
	'Chemistry.txt',
	'Communication.txt',
	'Computer.txt',
	'Justice.txt',
	'Economics.txt',
	'English.txt',
	'Geography.txt',
	'History.txt',
	'Math.txt',
	'Languages.txt',
	'Music.txt',
	'Philosophy.txt',
	'Physics.txt',
	'Political.txt',
	'Psychology.txt',
	'Sociology.txt',
	'Theatre.txt' ]).

label( [
	art - [ 'Art.txt',
		'Justice.txt',
		'English.txt',
		'History.txt',
		'Languages.txt',
		'Music.txt',
		'Philosophy.txt',
		'Political.txt',
		'Theatre.txt' ],

	sci - ['Anthropology.txt',
		'Biology.txt',
		'Chemistry.txt',
		'Communication.txt',
		'Computer.txt',
		'Math.txt',
		'Physics.txt',
		'Geography.txt',
		'Economics.txt',
		'Psychology.txt',
		'Sociology.txt' ]
	]).

For these experiments we use the discrete representation of our documents as boolean vectors and
an additional procedure ("id3format") that converts the list of vectors into a set of structures 
and loads them into the Prolog database. For example, for 5-component vectors we use the 
following query:

?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),id3format(Vs,IDF).
...


To see the new format of the examples we use the Prolog built-in procedure "listing":

?- listing(example).

example('Anthropology.txt', sci, [science=1, offers=0, program=1, faculty=1, programs=1]).
example('Art.txt', art, [science=0, offers=1, program=1, faculty=1, programs=1]).
example('Biology.txt', sci, [science=1, offers=0, program=0, faculty=0, programs=1]).
example('Chemistry.txt', sci, [science=1, offers=0, program=1, faculty=0, programs=0]).
example('Communication.txt', sci, [science=0, offers=1, program=1, faculty=1, programs=1]).
example('Computer.txt', sci, [science=1, offers=0, program=1, faculty=1, programs=0]).
example('Justice.txt', art, [science=0, offers=0, program=1, faculty=0, programs=0]).
example('Economics.txt', sci, [science=0, offers=0, program=1, faculty=0, programs=0]).
example('English.txt', art, [science=0, offers=1, program=0, faculty=0, programs=1]).
example('Geography.txt', sci, [science=1, offers=0, program=0, faculty=1, programs=1]).
example('History.txt', art, [science=0, offers=1, program=0, faculty=1, programs=1]).
example('Math.txt', sci, [science=1, offers=1, program=0, faculty=0, programs=1]).
example('Languages.txt', art, [science=0, offers=1, program=0, faculty=1, programs=1]).
example('Music.txt', art, [science=0, offers=0, program=1, faculty=1, programs=1]).
example('Philosophy.txt', art, [science=0, offers=1, program=1, faculty=0, programs=1]).
example('Physics.txt', sci, [science=0, offers=0, program=1, faculty=1, programs=0]).
example('Political.txt', art, [science=1, offers=0, program=1, faculty=1, programs=0]).
example('Psychology.txt', sci, [science=0, offers=1, program=0, faculty=1, programs=1]).
example('Sociology.txt', sci, [science=0, offers=1, program=1, faculty=0, programs=1]).
example('Theatre.txt', art, [science=0, offers=0, program=0, faculty=1, programs=1]).


Now we are ready to run the decision tree learning algorithm:

?- id3(3).

The parameter we specify (3) is used to control the growing of the tree. If a node covers
less than or equal to 3 examples, then it becomes a leaf no matter what the class distribution
is.

To see the tree we use:

?- showtree.

  science=0
    programs=0 => [art/1, sci/2]
    programs=1
      offers=0 => [art/2]
      offers=1
        program=0
          faculty=1 => [art/2, sci/1]
          faculty=0 => [art/1]
        program=1
          faculty=0 => [art/1, sci/1]
          faculty=1 => [art/1, sci/1]
  science=1
    programs=0 => [art/1, sci/2]
    programs=1 => [sci/4]


We can also see the tree as a set of If-Then rules:

?- listing(if).

if[science=0, programs=0]then[art/1, sci/2].
if[science=0, programs=1, offers=0]then[art/2].
if[science=0, programs=1, offers=1, program=0, faculty=1]then[art/2, sci/1].
if[science=0, programs=1, offers=1, program=0, faculty=0]then[art/1].
if[science=0, programs=1, offers=1, program=1, faculty=0]then[art/1, sci/1].
if[science=0, programs=1, offers=1, program=1, faculty=1]then[art/1, sci/1].
if[science=1, programs=0]then[art/1, sci/2].
if[science=1, programs=1]then[sci/4].


Assume we have a new example [science=0, offers=1, program=1, faculty=1, programs=0]. 
Then these rules may be used to classify this new example as follows:

?- if Y then Class, subset(Y,[science=0, offers=1, program=1, faculty=1, programs=0]).

Y = [science=0, programs=0]
Class = [art/1, sci/2] ;

In fact we get a distribution, but as the majority class is "sci", we may assign class "sci"
to the new example.

If we want to get "pure" set of examples at the leaves then we have to use pass 1 as a parameter to
id3.

?- id3(1).
Inconsistent data: cannot split [Justice.txt, Economics.txt] at node offers=0
Inconsistent data: cannot split [History.txt, Languages.txt, Psychology.txt] at node faculty=1
Inconsistent data: cannot split [Philosophy.txt, Sociology.txt] at node faculty=0
Inconsistent data: cannot split [Art.txt, Communication.txt] at node faculty=1
Inconsistent data: cannot split [Computer.txt, Political.txt] at node offers=0

In this case however we see get some messages indicating inconsistencies in data (find out what is
inconsistent). 

Using more terms may help avoiding such inconsistencies. For example:

?- files(F),tf(F,15,T),idf(F,T,8,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),id3format(Vs,IDF).

?- id3(1).

?- showtree.

  ba=0 => [sci/4]
  ba=1
    science=0
      hall=0
        program=0 => [art/1]
        program=1
          students=0 => [sci/1]
          students=1
            programs=0 => [art/1]
            programs=1 => [sci/1]
      hall=1
        students=0 => [art/3]
        students=1
          offers=0 => [art/2]
          offers=1
            faculty=0 => [sci/1]
            faculty=1
              program=0 => [sci/1]
              program=1 => [art/1]
    science=1
      programs=0 => [art/1]
      programs=1 => [sci/3]


Now the leaf sets are "pure".