CS580 - Web Mining Fall-2004 Laboratory Project 3 ==================== Programs files: textmine.pl Data files: webdata.zip, alarm.pl, artsci.pl I. Web Document Classification by Naive Bayes --------------------------------------------- Assume we have a catalog of documents and their class lables both stored in "files.pl" as follows: files([ 'Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt', 'Geography.txt', 'History.txt', 'Math.txt', 'Languages.txt', 'Music.txt', 'Philosophy.txt', 'Physics.txt', 'Political.txt', 'Psychology.txt', 'Sociology.txt', 'Theatre.txt' ]). label( [ art - [ 'Art.txt', 'Justice.txt', 'English.txt', 'History.txt', 'Languages.txt', 'Music.txt', 'Philosophy.txt', 'Political.txt', 'Theatre.txt' ], sci - ['Anthropology.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Math.txt', 'Physics.txt', 'Geography.txt', 'Economics.txt', 'Psychology.txt', 'Sociology.txt' ] ]). For these experiments we use a discrete representation of our documents as boolean vectors. So, instead of "vectors", we use "binvectors". For example, for 5-component vectors we use the following query: ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),ppl(Vs). Anthropology.txt-sci-[1, 0, 1, 1, 1] Art.txt-art-[0, 1, 1, 1, 1] Biology.txt-sci-[1, 0, 0, 0, 1] Chemistry.txt-sci-[1, 0, 1, 0, 0] Communication.txt-sci-[0, 1, 1, 1, 1] Computer.txt-sci-[1, 0, 1, 1, 0] Justice.txt-art-[0, 0, 1, 0, 0] Economics.txt-sci-[0, 0, 1, 0, 0] English.txt-art-[0, 1, 0, 0, 1] Geography.txt-sci-[1, 0, 0, 1, 1] History.txt-art-[0, 1, 0, 1, 1] Math.txt-sci-[1, 1, 0, 0, 1] Languages.txt-art-[0, 1, 0, 1, 1] Music.txt-art-[0, 0, 1, 1, 1] Philosophy.txt-art-[0, 1, 1, 0, 1] Physics.txt-sci-[0, 0, 1, 1, 0] Political.txt-art-[1, 0, 1, 1, 0] Psychology.txt-sci-[0, 1, 0, 1, 1] Sociology.txt-sci-[0, 1, 1, 0, 1] Theatre.txt-art-[0, 0, 0, 1, 1] F = ['Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt'|...] T = [department, study, students, ba, website, location, programs, 832, phone|...] IDF = [1.09861-science, 0.847298-offers, 0.559616-program, 0.559616-faculty, 0.405465-programs] The list of terms used to create the above binary vector representation is in IDF list above, where the IDF weights are ignored. That is, if a term appears in the documents the corresponding vector components is 1, if not - it's 0. The following procedures a part of the Naive Bayes algorithm are provided in textmine.pl: cond_prob(X,Class,Examples,LP) - generates a list LP of conditional probabilities for each value pair in X, given Class. class_prob(Class,Examples,CP) - calculates the probability of Class (proportion of examples in Class w.r.t. the whole data set). probs(E,Examples,LP) - generates likelihoods of E belonging to each classes. bayes(X,Examples,Class) - assigns X a Class according to the Naive Bayes algorithm. Below are examples of using the above procedures with the 5-component vector representation of the documents described in file.pl. The vector X=[0,1,1,1,0] is used as a representation of a new document, which is to be classified. ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),cond_prob([0,1,1,1,0],sci,Vs,LP). LP = [0.454545, 0.363636, 0.636364, 0.545455, 0.363636] In LP we have the conditional probabilities of each value (0/1) in the vector [0,1,1,1,0] given class "sci". We may also compute the prior probability of classes "sci" and "art" as follows: ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),class_prob(sci,Vs,CP). CP = 0.55 ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),class_prob(art,Vs,CP). CP = 0.45 Then by mulitiplying these probabilities we get the likelihoods of X belongning to "sci" and to "art". This is done by the procedure "probs". ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),probs([0,1,1,1,0],Vs,Ps). Ps = [0.0182899-art, 0.0114746-sci] According these likelihoods, X belongs to class "art". The procedure "bayes" does all this in one step: ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),bayes([0,1,1,1,0],Vs,Class). Class = art If we change the test example X to [1,1,1,1,0] we get a different classification: ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),bayes([1,1,1,1,0],Vs,Class). Class = sci II. LOO-CV evaluation --------------------- The following query deletes a vector and then classifies it using the rest of the vectors. Then this is repeated for all vectors in the set. ?- files(F),tf(F,15,T),idf(F,T,5,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),del(ID-X,Vs,R),bayes(X,R,Class),write(ID-Class),nl,fail. Anthropology.txt-sci-sci Art.txt-art-art Biology.txt-sci-sci Chemistry.txt-sci-sci Communication.txt-sci-art Computer.txt-sci-sci Justice.txt-art-sci Economics.txt-sci-sci English.txt-art-art Geography.txt-sci-sci History.txt-art-art Math.txt-sci-sci Languages.txt-art-art Music.txt-art-sci Philosophy.txt-art-sci Physics.txt-sci-art Political.txt-art-sci Psychology.txt-sci-art Sociology.txt-sci-art Theatre.txt-art-art By counting the number of differences between the actuall and predicted labels we compute the LOO-CV error. For the above experiment it's 8/20 = 40%. III. Using Bayesian (Belief) Network for reasoning/classification ----------------------------------------------------------------- A simple example of a belief network is provided in the file alarm.pl. Read it first to understand the representation. The BN reasoning algorithm is used through p(Var,Obs,Dist), where: - Var is a variable (as defined in variables([...])); - Obs is a list of observations (variable=value pairs); - Dist is a distibution of P(Var|Obs); - Var must not appear in the observations Obs. Make sure that textmine.pl is loaded and then load alarm.pl too: ?- [alarm]. % load BN for the alarm network 1. Diagnostic reasoning: from effect to cause (the effect is given as evidence) ------------------------------------------------------------------------------- What is the probability distribution of Burglary (b) given the evidence that John calls (j=t)? ?- p(b,[j=t],P). P = [0.0162837, 0.983716] The probability distribution [0.0162837, 0.983716] corresponds to the values of the variable b as defined in values(b,[t,f]). Thus, the probabilty of "f" is 0.983716. Without this evidence we have ?- p(b,[],P). P = [0.001, 0.999] This is the prior distribution of b as it has no parents. Let's add more evidence: Jonh calls and Mary calls too. ?- p(b,[j=t,m=t],P). P = [0.284172, 0.715828] The probabilty of burglary increases. If they both call, it becomes more likely that there is a burglary. But still, the alarm is not on, so the probabilty of b is not too high. Adding the alarm (a=t) further increases this probabilty. ?- p(b,[j=t,m=t,a=t],P). P = [0.373551, 0.626449] Because there is another possible cause for the alarm (earthquake), the probabilty of b can also increase by adding the evidence that there is no earthquake (e=f). ?- p(b,[j=t,m=t,a=t,e=f],P). P = [0.484786, 0.515214] This last example however is not diagnostic reasoning, because b (burglary) and e (earthquake) are not connected by a causal relation. 2. Predictive reasoning: from cause to effect --------------------------------------------- What is the probability distribution of John calls (j), given that there is a burglary (b=t)? ?- p(j,[b=t],P). P = [0.849017, 0.150983] So, it's very likely that Jonh calls in this situation (the first value for j is t). Similarly, we can get the probability of Mary calls with the same evidence. It's lower than John's (why?). ?- p(m,[b=t],P). P = [0.658614, 0.341386] IV. Creating a belief network for the text documents from file.pl ----------------------------------------------------------------- We shall illustrate this with a sample of our document representation obtained by the following query: ?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),ppl(Vs). Anthropology.txt-sci-[1, 0, 1, 1] Art.txt-art-[0, 1, 1, 1] Biology.txt-sci-[1, 0, 0, 0] Chemistry.txt-sci-[1, 0, 1, 0] Communication.txt-sci-[0, 1, 1, 1] Computer.txt-sci-[1, 0, 1, 1] Justice.txt-art-[0, 0, 1, 0] Economics.txt-sci-[0, 0, 1, 0] English.txt-art-[0, 1, 0, 0] Geography.txt-sci-[1, 0, 0, 1] History.txt-art-[0, 1, 0, 1] Math.txt-sci-[1, 1, 0, 0] Languages.txt-art-[0, 1, 0, 1] Music.txt-art-[0, 0, 1, 1] Philosophy.txt-art-[0, 1, 1, 0] Physics.txt-sci-[0, 0, 1, 1] Political.txt-art-[1, 0, 1, 1] Psychology.txt-sci-[0, 1, 0, 1] Sociology.txt-sci-[0, 1, 1, 0] Theatre.txt-art-[0, 0, 0, 1] IDF = [1.09861-science, 0.847298-offers, 0.559616-program, 0.559616-faculty] Thus we have 4 terms: science, offers, program, faculty 1. The BN variables are the class and the terms (features): variables([class, science, offers, program, faculty]). 2. The structure of the graph represents the causal relationship between the terms and the class. The class value determines (is a cause for) the term values (the effects). So, we have the class node as a parent of all attributes. In Prolog this is: parents(science,[class]). parents(offers,[class]). parents(program,[class]). parents(faculty,[class]). parents(class,[]). 3. The term values (0/1) are the possible values that the BN variables can take: values(science,[0,1]). values(offers,[0,1]). values(program,[0,1]). values(faculty,[0,1]). values(class,[sci,art]). 4. Conditional probabilities: use the procedures from Naive Bayes to compute them. CPT's for the terms: -------------------- ?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),cond_prob([0,0,0,0],sci,Vs,P). P = [0.454545, 0.636364, 0.363636, 0.454545] ?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),cond_prob([1,1,1,1],sci,Vs,P). P = [0.545455, 0.363636, 0.636364, 0.545455] ?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),cond_prob([0,0,0,0],art,Vs,P). P = [0.888889, 0.444444, 0.444444, 0.333333] ?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),cond_prob([1,1,1,1],art,Vs,P). P = [0.111111, 0.555556, 0.555556, 0.666667] The probabilities calculated by cond_prob are used to create the CPT's as follows: pr(science,[class=sci],[0.454545,0.545455]). pr(science,[class=art],[0.888889,0.111111]). pr(offers,[class=sci],[0.636364,0.363636]). pr(offers,[class=art],[0.444444,0.555556]). pr(program,[class=sci],[0.363636,0.636364]). pr(program,[class=art],[0.444444,0.555556]). pr(faculty,[class=sci],[0.454545,0.545455]). pr(faculty,[class=art],[0.333333,0.666667]). CPT for class: -------------- ?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),class_prob(sci,Vs,P). P = 0.55 ?- files(F),tf(F,15,T),idf(F,T,4,IDF),label(L),class(F,L,FL),binvectors(FL,IDF,Vs),class_prob(art,Vs,P). P = 0.45 This leads to the following definition: pr(class,[],[0.55,0.45]). Note that for all CPT's the order in which the probabilties are listed follows the order of their corresponding values as described in "values". V. Classification of new vectors (documents) with BN ---------------------------------------------------- Let's put all the Prolog structures used to define the BN in a file hamed artsci.pl and load the latter in Prolog: ?- [artsci]. Then a new document vector, say [1,1,1,1], can be classified by using the following query: ?- p(class,[science=1,offers=1,program=1,faculty=1],P). P = [0.786352, 0.213648] The first probability is for class=sci and because it's higher we can classify this document as belonging to sclass sci. We may also try this this with less evidence, for example: ?- p(class,[science=1,offers=1,program=1],P). P = [0.818133, 0.181867] ?- p(class,[science=1],P). P = [0.857143, 0.142857] Basically the same results are obtained by naive bayes.