CS462 - Artificial Intelligence Spring-2005 Laboratory Project 1: Web/text document classification ====================================================== Programs files: textmine.pl Data files: webdata.zip, artsci.pl 1. Web Document Collection -------------------------- Let's browse a number of web pages and save each one as a text file. For example the following 20 pages are collected from the web site of the CCSU school of Art and Sciences. For convenience we put the file names in a list, the list in a file and then load the latter in the Prolog database. So, let's assume that the file name is files.pl and its contents is: files([ 'Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt', 'Geography.txt', 'History.txt', 'Math.txt', 'Languages.txt', 'Music.txt', 'Philosophy.txt', 'Physics.txt', 'Political.txt', 'Psychology.txt', 'Sociology.txt', 'Theatre.txt' ]). label( [ art - [ 'Art.txt', 'Justice.txt', 'English.txt', 'History.txt', 'Languages.txt', 'Music.txt', 'Philosophy.txt', 'Political.txt', 'Theatre.txt' ], sci - ['Anthropology.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Math.txt', 'Physics.txt', 'Geography.txt', 'Economics.txt', 'Psychology.txt', 'Sociology.txt' ] ]). Then to load the file in Prolog we use the query: ?- [files]. 2. Generating terms from the corpus of all documents ---------------------------------------------------- For this and all further steps we use the Prolog program textmine.pl by first loading it into Prolog by using the query: ?- [textmine]. The following query generates a list of the 10 most frequent terms that appear in the corpus of all 20 documents. ?- files(F),tf(F,20,T),write(T). [department, study, students, ba, website, location, programs, 832, phone, chair, program, science, hall, faculty, offers, music, courses, research, studies, sociology] F = ['Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt'|...] T = [department, study, students, ba, website, location, programs, 832, phone|...] Note that we use write(T) to print the whole list, because Prolog prints just the first 9 elements in its standard answer. 3. Generating the inverted document frequency list (IDF) -------------------------------------------------------- First we have to generate a list of terms and then we pass them to the procedure that generates the IDF list. For example, ?- files(F),tf(F,50,T),idf(F,T,20,IDF),write(IDF). [3.04452-music, 3.04452-sociology, 3.04452-anthropology, 3.04452-theatre, 3.04452-criminal, 3.04452-justice, 3.04452-communication, 3.04452-chemistry, 2.35138-physics, 2.35138-political, 1.94591-history, 1.94591-sciences, 1.65823-american, 1.65823-social, 1.65823-international, 1.65823-public, 1.43508-computer, 1.43508-offered, 1.25276-ma, 1.25276-work] F = ['Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt'|...] T = [department, study, students, ba, website, location, programs, 832, phone|...] IDF = [3.04452-music, 3.04452-sociology, 3.04452-anthropology, 3.04452-theatre, 3.04452-criminal, 3.04452-justice, 3.04452-communication, 3.04452-chemistry, ... -...|...] Note that the IDF list is ordered by decreasing values of IDF (shown before each term). As the IDF value is usually big for rare terms, in the IDF list we have the least frequent 20 terms out of the 50 terms generated by tf(F,50,T). 4. Generating document vectors ------------------------------ ?- files(F),tf(F,50,T),idf(F,T,20,IDF),vectors(F,IDF,V),ppl(V). Anthropology.txt-[0, 0, 0.989818, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.066903, 0.0677833, 0.0686871, 0, 0.0602473, 0, 0.0533136] Art.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] Biology.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.919678, 0, 0, 0, 0, 0.233791, 0.236478, 0.208835, 0] Chemistry.txt-[0, 0, 0, 0, 0, 0, 0, 0.991006, 0, 0, 0, 0, 0.10034, 0, 0, 0, 0.0885403, 0, 0, 0] Communication.txt-[0, 0, 0, 0, 0, 0, 0.981156, 0, 0, 0.135537, 0, 0, 0.0967631, 0, 0, 0.0979726, 0, 0, 0, 0] Computer.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0] Justice.txt-[0, 0, 0, 0, 0.703715, 0.710482, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] Economics.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.546823, 0, 0, 0, 0, 0.837248] English.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.500821, 0, 0, 0, 0, 0, 0.383151, 0.776127] Geography.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] History.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.877854, 0, 0.469615, 0, 0, 0, 0, 0, 0.0939918, 0] Math.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0.32297, 0, 0, 0.814169, 0, 0, 0, 0, 0.209831, 0.213271, 0.378557, 0] Languages.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.641837, 0, 0, 0.570685, 0.512215, 0] Music.txt-[0.995667, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0610873, 0, 0, 0, 0, 0.0527235, 0.0462213, 0, 0, 0] Philosophy.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] Physics.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0.934359, 0, 0, 0.356332, 0, 0, 0, 0, 0, 0, 0, 0] Political.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0.895432, 0, 0, 0, 0.113139, 0.229142, 0.352641, 0, 0, 0, 0.0924049] Psychology.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.545458, 0, 0, 0, 0, 0.838138, 0] Sociology.txt-[0, 0.967344, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.248383, 0, 0, 0, 0, 0, 0.0505209] Theatre.txt-[0, 0, 0, 0.991255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.064895, 0, 0.114897] F = ['Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt'|...] T = [department, study, students, ba, website, location, programs, 832, phone|...] IDF = [3.04452-music, 3.04452-sociology, 3.04452-anthropology, 3.04452-theatre, 3.04452-criminal, 3.04452-justice, 3.04452-communication, 3.04452-chemistry, ... -...|...] V = ['Anthropology.txt'-[0, 0, 0.989818, 0, 0, 0, 0|...], 'Art.txt'-[0, 0, 0, 0, 0, 0|...], 'Biology.txt'-[0, 0, 0, 0, 0|...], 'Chemistry.txt'-[0, 0, 0, 0|...], 'Communication.txt'-[0, 0, 0|...], 'Computer.txt'-[0, 0|...], 'Justice.txt'-[0|...], 'Economics.txt'-[...|...], ... -...|...] Note that ppl(V) is used here to print the list of vectors nicely (it prints each list item on a separate line). 5. Predicting document class with Naive Bayes --------------------------------------------- For these experiments we use a discrete representation of our documents as boolean vectors. So, instead of "vectors", we use "binvectors". Also, we use the document labels, so that we get vectors with Id's and labels too. For example, for 5-component vectors we use the following query: ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),ppl(Vs). Anthropology.txt-sci-[1, 0, 1, 1, 1] Art.txt-art-[0, 1, 1, 1, 1] Biology.txt-sci-[1, 0, 0, 0, 1] Chemistry.txt-sci-[1, 0, 1, 0, 0] Communication.txt-sci-[0, 1, 1, 1, 1] Computer.txt-sci-[1, 0, 1, 1, 0] Justice.txt-art-[0, 0, 1, 0, 0] Economics.txt-sci-[0, 0, 1, 0, 0] English.txt-art-[0, 1, 0, 0, 1] Geography.txt-sci-[1, 0, 0, 1, 1] History.txt-art-[0, 1, 0, 1, 1] Math.txt-sci-[1, 1, 0, 0, 1] Languages.txt-art-[0, 1, 0, 1, 1] Music.txt-art-[0, 0, 1, 1, 1] Philosophy.txt-art-[0, 1, 1, 0, 1] Physics.txt-sci-[0, 0, 1, 1, 0] Political.txt-art-[1, 0, 1, 1, 0] Psychology.txt-sci-[0, 1, 0, 1, 1] Sociology.txt-sci-[0, 1, 1, 0, 1] Theatre.txt-art-[0, 0, 0, 1, 1] F = ['Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt'|...] T = [department, study, students, ba, website, location, programs, 832, phone|...] IDF = [1.09861-science, 0.847298-offers, 0.559616-program, 0.559616-faculty, 0.405465-programs] The list of terms used to create the above binary vector representation is in IDF list above, where the IDF weights are ignored. That is, if a term appears in the documents the corresponding vector components is 1, if not - it's 0. The following procedures a part of the Naive Bayes algorithm are provided in textmine.pl: cond_prob(X,Class,Examples,LP) - generates a list LP of conditional probabilities for each value pair in X, given Class. class_prob(Class,Examples,CP) - calculates the probability of Class (proportion of examples in Class w.r.t. the whole data set). probs(E,Examples,LP) - generates likelihoods of E belonging to each class. bayes(X,Examples,Class) - assigns X a Class according to the Naive Bayes algorithm. Below are examples of using the above procedures with the 5-component vector representation of the documents described in file.pl. The vector X=[0,1,1,1,0] is used as a representation of a new document, which is to be classified. ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),cond_prob([0,1,1,1,0],sci,Vs,LP). LP = [0.454545, 0.363636, 0.636364, 0.545455, 0.363636] In LP we have the conditional probabilities of each value (0/1) in the vector [0,1,1,1,0] given class "sci". We may also compute the prior probability of classes "sci" and "art" as follows: ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),class_prob(sci,Vs,CP). CP = 0.55 ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),class_prob(art,Vs,CP). CP = 0.45 Then by mulitiplying these probabilities we get the likelihoods of X belongning to "sci" and to "art". This is done by the procedure "probs". ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),probs([0,1,1,1,0],Vs,Ps). Ps = [0.0182899-art, 0.0114746-sci] According these likelihoods, X belongs to class "art". The procedure "bayes" does all this in one step: ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),bayes([0,1,1,1,0],Vs,Class). Class = art If we change the test example X to [1,1,1,1,0] we get a different classification: ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),bayes([1,1,1,1,0],Vs,Class). Class = sci 5. Creating a Bayesian network for the text documents ----------------------------------------------------- We shall illustrate this with a sample of our document representation obtained by the following query: ?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),ppl(Vs). Anthropology.txt-sci-[1, 0, 1, 1] Art.txt-art-[0, 1, 1, 1] Biology.txt-sci-[1, 0, 0, 0] Chemistry.txt-sci-[1, 0, 1, 0] Communication.txt-sci-[0, 1, 1, 1] Computer.txt-sci-[1, 0, 1, 1] Justice.txt-art-[0, 0, 1, 0] Economics.txt-sci-[0, 0, 1, 0] English.txt-art-[0, 1, 0, 0] Geography.txt-sci-[1, 0, 0, 1] History.txt-art-[0, 1, 0, 1] Math.txt-sci-[1, 1, 0, 0] Languages.txt-art-[0, 1, 0, 1] Music.txt-art-[0, 0, 1, 1] Philosophy.txt-art-[0, 1, 1, 0] Physics.txt-sci-[0, 0, 1, 1] Political.txt-art-[1, 0, 1, 1] Psychology.txt-sci-[0, 1, 0, 1] Sociology.txt-sci-[0, 1, 1, 0] Theatre.txt-art-[0, 0, 0, 1] IDF = [1.09861-science, 0.847298-offers, 0.559616-program, 0.559616-faculty] Thus we have 4 terms: science, offers, program, faculty 1. The BN variables are the class and the terms (features): variables([class, science, offers, program, faculty]). 2. The structure of the graph represents the causal relationship between the terms and the class. The class value determines (is a cause for) the term values (the effects). So, we have the class node as a parent of all attributes. In Prolog this is: parents(science,[class]). parents(offers,[class]). parents(program,[class]). parents(faculty,[class]). parents(class,[]). 3. The term values (0/1) are the possible values that the BN variables can take: values(science,[0,1]). values(offers,[0,1]). values(program,[0,1]). values(faculty,[0,1]). values(class,[sci,art]). 4. Conditional probabilities: use the procedures from Naive Bayes to compute them. CPT's for the terms: -------------------- ?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),cond_prob([0,0,0,0],sci,Vs,P). P = [0.454545, 0.636364, 0.363636, 0.454545] ?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),cond_prob([1,1,1,1],sci,Vs,P). P = [0.545455, 0.363636, 0.636364, 0.545455] ?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),cond_prob([0,0,0,0],art,Vs,P). P = [0.888889, 0.444444, 0.444444, 0.333333] ?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),cond_prob([1,1,1,1],art,Vs,P). P = [0.111111, 0.555556, 0.555556, 0.666667] The probabilities calculated by cond_prob are used to create the CPT's as follows: pr(science,[class=sci],[0.454545,0.545455]). pr(science,[class=art],[0.888889,0.111111]). pr(offers,[class=sci],[0.636364,0.363636]). pr(offers,[class=art],[0.444444,0.555556]). pr(program,[class=sci],[0.363636,0.636364]). pr(program,[class=art],[0.444444,0.555556]). pr(faculty,[class=sci],[0.454545,0.545455]). pr(faculty,[class=art],[0.333333,0.666667]). CPT for class: -------------- ?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),class_prob(sci,Vs,P). P = 0.55 ?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),class_prob(art,Vs,P). P = 0.45 This leads to the following definition: pr(class,[],[0.55,0.45]). Note that for all CPT's the order in which the probabilties are listed follows the order of their corresponding values as described in "values". 6. Classification of new vectors (documents) with BN ---------------------------------------------------- Let's put all the Prolog structures used to define the BN in a file hamed artsci.pl and load the latter in Prolog: ?- [artsci]. Then a new document vector, say [1,1,1,1], can be classified by using the following query: ?- p(class,[science=1,offers=1,program=1,faculty=1],P). P = [0.786352, 0.213648] The first probability is for class=sci and because it's higher we can classify this document as belonging to sclass sci. We may also try this this with less evidence, for example: ?- p(class,[science=1,offers=1,program=1],P). P = [0.818133, 0.181867] ?- p(class,[science=1],P). P = [0.857143, 0.142857] Basically the same results are obtained by naive bayes. 7. Document classification with Decision Tree learning ------------------------------------------------------ For these experiments we use the discrete representation of our documents as boolean vectors and an additional procedure ("id3format") that converts the list of vectors into a set of structures and loads them into the Prolog database. For example, for 5-component vectors we use the following query: ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),id3format(Vs,IDF). ... To see the new format of the examples we use the Prolog built-in procedure "listing": ?- listing(example). example('Anthropology.txt', sci, [science=1, offers=0, program=1, faculty=1, programs=1]). example('Art.txt', art, [science=0, offers=1, program=1, faculty=1, programs=1]). example('Biology.txt', sci, [science=1, offers=0, program=0, faculty=0, programs=1]). example('Chemistry.txt', sci, [science=1, offers=0, program=1, faculty=0, programs=0]). example('Communication.txt', sci, [science=0, offers=1, program=1, faculty=1, programs=1]). example('Computer.txt', sci, [science=1, offers=0, program=1, faculty=1, programs=0]). example('Justice.txt', art, [science=0, offers=0, program=1, faculty=0, programs=0]). example('Economics.txt', sci, [science=0, offers=0, program=1, faculty=0, programs=0]). example('English.txt', art, [science=0, offers=1, program=0, faculty=0, programs=1]). example('Geography.txt', sci, [science=1, offers=0, program=0, faculty=1, programs=1]). example('History.txt', art, [science=0, offers=1, program=0, faculty=1, programs=1]). example('Math.txt', sci, [science=1, offers=1, program=0, faculty=0, programs=1]). example('Languages.txt', art, [science=0, offers=1, program=0, faculty=1, programs=1]). example('Music.txt', art, [science=0, offers=0, program=1, faculty=1, programs=1]). example('Philosophy.txt', art, [science=0, offers=1, program=1, faculty=0, programs=1]). example('Physics.txt', sci, [science=0, offers=0, program=1, faculty=1, programs=0]). example('Political.txt', art, [science=1, offers=0, program=1, faculty=1, programs=0]). example('Psychology.txt', sci, [science=0, offers=1, program=0, faculty=1, programs=1]). example('Sociology.txt', sci, [science=0, offers=1, program=1, faculty=0, programs=1]). example('Theatre.txt', art, [science=0, offers=0, program=0, faculty=1, programs=1]). Now we are ready to run the decision tree learning algorithm: ?- id3(3). The parameter we specify (3) is used to control the growing of the tree. If a node covers less than or equal to 3 examples, then it becomes a leaf no matter what the class distribution is. To see the tree we use: ?- showtree. science=0 programs=0 => [art/1, sci/2] programs=1 offers=0 => [art/2] offers=1 program=0 faculty=1 => [art/2, sci/1] faculty=0 => [art/1] program=1 faculty=0 => [art/1, sci/1] faculty=1 => [art/1, sci/1] science=1 programs=0 => [art/1, sci/2] programs=1 => [sci/4] We can also see the tree as a set of If-Then rules: ?- listing(if). if[science=0, programs=0]then[art/1, sci/2]. if[science=0, programs=1, offers=0]then[art/2]. if[science=0, programs=1, offers=1, program=0, faculty=1]then[art/2, sci/1]. if[science=0, programs=1, offers=1, program=0, faculty=0]then[art/1]. if[science=0, programs=1, offers=1, program=1, faculty=0]then[art/1, sci/1]. if[science=0, programs=1, offers=1, program=1, faculty=1]then[art/1, sci/1]. if[science=1, programs=0]then[art/1, sci/2]. if[science=1, programs=1]then[sci/4]. Assume we have a new example [science=0, offers=1, program=1, faculty=1, programs=0]. Then these rules may be used to classify this new example as follows: ?- if Y then Class, subset(Y,[science=0, offers=1, program=1, faculty=1, programs=0]). Y = [science=0, programs=0] Class = [art/1, sci/2] ; In fact we get a distribution, but as the majority class is "sci", we may assign class "sci" to the new example. If we want to get "pure" set of examples at the leaves then we have to use pass 1 as a parameter to id3. ?- id3(1). Inconsistent data: cannot split [Justice.txt, Economics.txt] at node offers=0 Inconsistent data: cannot split [History.txt, Languages.txt, Psychology.txt] at node faculty=1 Inconsistent data: cannot split [Philosophy.txt, Sociology.txt] at node faculty=0 Inconsistent data: cannot split [Art.txt, Communication.txt] at node faculty=1 Inconsistent data: cannot split [Computer.txt, Political.txt] at node offers=0 In this case however we see get some messages indicating inconsistencies in data (find out what is inconsistent). Using more terms may help avoiding such inconsistencies. For example: ?- files(F),tf(F,15,T),idf(F,T,8,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),id3format(Vs,IDF). ?- id3(1). ?- showtree. ba=0 => [sci/4] ba=1 science=0 hall=0 program=0 => [art/1] program=1 students=0 => [sci/1] students=1 programs=0 => [art/1] programs=1 => [sci/1] hall=1 students=0 => [art/3] students=1 offers=0 => [art/2] offers=1 faculty=0 => [sci/1] faculty=1 program=0 => [sci/1] program=1 => [art/1] science=1 programs=0 => [art/1] programs=1 => [sci/3] Now the leaf sets are "pure". 8. Generating an ARFF file (the WEKA data format) (For Weka see http://www.cs.waikato.ac.nz/~ml/weka/index.html) -------------------------------------------------------------- ?- files(F),tf(F,50,T),idf(F,T,20,IDF),vectors(F,IDF,V),arff(IDF,V,'wekadata.txt'). This query generates an ARFF file named 'wekadata.txt'. If we specify class labels, the ARFF file will include them in the attribute definitions and in the data rows. For example, the following query generated an ARFF file with labeled data. ?- files(F),tf(F,50,T),idf(F,T,20,IDF),label(CL),class(F,CL,LF),vectors(LF,IDF,V),arff(IDF,V,'wekadata.txt'). Note that we use .txt files for easy viewing (with Notepad). To prepare data files for direct use by Weka we may use the .arff extension.