CS580 - Web Mining Fall-2004 Laboratory Project 1 ==================== Programs files: textmine.pl Data files: webdata.zip 1. Web Document Collection -------------------------- Let's browse a number of web pages and save each one as a text file. For example the following 20 pages are collected from the web site of the CCSU school of Art and Sciences. For convenience we put the file names in a list, the list in a file and then load the latter in the Prolog database. So, let's assume that the file name is files.pl and its contents is: files([ 'Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt', 'Geography.txt', 'History.txt', 'Math.txt', 'Languages.txt', 'Music.txt', 'Philosophy.txt', 'Physics.txt', 'Political.txt', 'Psychology.txt', 'Sociology.txt', 'Theatre.txt' ]). Then to load the file in Prolog we use the query: ?- [files]. 2. Generating terms from the corpus of all documents ---------------------------------------------------- For this and all further steps we use the Prolog program textmine.pl by first loading it into Prolog by using the query: ?- [textmine]. The following query generates a list of the 10 most frequent terms that appear in the corpus of all 20 documents. ?- files(F),tf(F,20,T),write(T). [department, study, students, ba, website, location, programs, 832, phone, chair, program, science, hall, faculty, offers, music, courses, research, studies, sociology] F = ['Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt'|...] T = [department, study, students, ba, website, location, programs, 832, phone|...] Note that we use write(T) to print the whole list, because Prolog prints just the first 9 elements in its standard answer. 3. Generating the inverted document frequency list (IDF) -------------------------------------------------------- First we have to generate a list of terms and then we pass them to the procedure that generates the IDF list. For example, ?- files(F),tf(F,50,T),idf(F,T,20,IDF),write(IDF). [3.04452-music, 3.04452-sociology, 3.04452-anthropology, 3.04452-theatre, 3.04452-criminal, 3.04452-justice, 3.04452-communication, 3.04452-chemistry, 2.35138-physics, 2.35138-political, 1.94591-history, 1.94591-sciences, 1.65823-american, 1.65823-social, 1.65823-international, 1.65823-public, 1.43508-computer, 1.43508-offered, 1.25276-ma, 1.25276-work] F = ['Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt'|...] T = [department, study, students, ba, website, location, programs, 832, phone|...] IDF = [3.04452-music, 3.04452-sociology, 3.04452-anthropology, 3.04452-theatre, 3.04452-criminal, 3.04452-justice, 3.04452-communication, 3.04452-chemistry, ... -...|...] Note that the IDF list is ordered by decreasing values of IDF (shown before each term). As the IDF value is usually big for rare terms, in the IDF list we have the least frequent 20 terms out of the 50 terms generated by tf(F,50,T). 4. Generating document vectors ------------------------------ ?- files(F),tf(F,50,T),idf(F,T,20,IDF),vectors(F,IDF,V),ppl(V). Anthropology.txt-[0, 0, 0.989818, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.066903, 0.0677833, 0.0686871, 0, 0.0602473, 0, 0.0533136] Art.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] Biology.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.919678, 0, 0, 0, 0, 0.233791, 0.236478, 0.208835, 0] Chemistry.txt-[0, 0, 0, 0, 0, 0, 0, 0.991006, 0, 0, 0, 0, 0.10034, 0, 0, 0, 0.0885403, 0, 0, 0] Communication.txt-[0, 0, 0, 0, 0, 0, 0.981156, 0, 0, 0.135537, 0, 0, 0.0967631, 0, 0, 0.0979726, 0, 0, 0, 0] Computer.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0] Justice.txt-[0, 0, 0, 0, 0.703715, 0.710482, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] Economics.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.546823, 0, 0, 0, 0, 0.837248] English.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.500821, 0, 0, 0, 0, 0, 0.383151, 0.776127] Geography.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] History.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.877854, 0, 0.469615, 0, 0, 0, 0, 0, 0.0939918, 0] Math.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0.32297, 0, 0, 0.814169, 0, 0, 0, 0, 0.209831, 0.213271, 0.378557, 0] Languages.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.641837, 0, 0, 0.570685, 0.512215, 0] Music.txt-[0.995667, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0610873, 0, 0, 0, 0, 0.0527235, 0.0462213, 0, 0, 0] Philosophy.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0] Physics.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0.934359, 0, 0, 0.356332, 0, 0, 0, 0, 0, 0, 0, 0] Political.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0.895432, 0, 0, 0, 0.113139, 0.229142, 0.352641, 0, 0, 0, 0.0924049] Psychology.txt-[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.545458, 0, 0, 0, 0, 0.838138, 0] Sociology.txt-[0, 0.967344, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.248383, 0, 0, 0, 0, 0, 0.0505209] Theatre.txt-[0, 0, 0, 0.991255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.064895, 0, 0.114897] F = ['Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt'|...] T = [department, study, students, ba, website, location, programs, 832, phone|...] IDF = [3.04452-music, 3.04452-sociology, 3.04452-anthropology, 3.04452-theatre, 3.04452-criminal, 3.04452-justice, 3.04452-communication, 3.04452-chemistry, ... -...|...] V = ['Anthropology.txt'-[0, 0, 0.989818, 0, 0, 0, 0|...], 'Art.txt'-[0, 0, 0, 0, 0, 0|...], 'Biology.txt'-[0, 0, 0, 0, 0|...], 'Chemistry.txt'-[0, 0, 0, 0|...], 'Communication.txt'-[0, 0, 0|...], 'Computer.txt'-[0, 0|...], 'Justice.txt'-[0|...], 'Economics.txt'-[...|...], ... -...|...] Note that ppl(V) is used here to print the list of vectors nicely (it prints each list item on a separate line). 5. Clustering documents ----------------------- ?- files(F),tf(F,50,T),idf(F,T,20,IDF),vectors(F,IDF,V),cluster(V,Q). + + + + + + + History.txt Philosophy.txt Music.txt + + + + Biology.txt Math.txt Physics.txt Computer.txt Chemistry.txt + + + + + + Languages.txt Psychology.txt + Economics.txt English.txt Political.txt Communication.txt Anthropology.txt Sociology.txt Theatre.txt Art.txt + Justice.txt Geography.txt F = ['Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt'|...] T = [department, study, students, ba, website, location, programs, 832, phone|...] IDF = [3.04452-music, 3.04452-sociology, 3.04452-anthropology, 3.04452-theatre, 3.04452-criminal, 3.04452-justice, 3.04452-communication, 3.04452-chemistry, ... -...|...] V = ['Anthropology.txt'-[0, 0, 0.989818, 0, 0, 0, 0|...], 'Art.txt'-[0, 0, 0, 0, 0, 0|...], 'Biology.txt'-[0, 0, 0, 0, 0|...], 'Chemistry.txt'-[0, 0, 0, 0|...], 'Communication.txt'-[0, 0, 0|...], 'Computer.txt'-[0, 0|...], 'Justice.txt'-[0|...], 'Economics.txt'-[...|...], ... -...|...] Q = 0.229943 The clustering algorith used here is Hierarchical Agglomerative Clustering implemented by the cluster(V,Q) procedure. It uses another procedure, "show" internally to print the clustering as a horizontal tree. Thus an equivalent query of "cluster(V,Q)" is "cluster(V,C,Q),show(C)". In the latter case we also see the actual clustering in variable C (as a nested list). The parameter Q that cluster(V,Q) returns is the clustering quality computed as an average distance between all merged clusters. By increasing the number of terms (the size of the document vectors) we usually get a better clustering quality. For example: ?- files(F),tf(F,50,T),idf(F,T,50,IDF),vectors(F,IDF,V),cluster(V,Q). + + + + + + + + + + + + + + English.txt Psychology.txt Art.txt Languages.txt Economics.txt + History.txt Philosophy.txt + + Biology.txt Math.txt Physics.txt + + Computer.txt Geography.txt Political.txt Chemistry.txt Communication.txt Anthropology.txt Sociology.txt Theatre.txt Music.txt Justice.txt Q = 0.283154 However, the following query produces a lower clustering quality (explain why). ?- files(F),tf(F,100,T),idf(F,T,50,IDF),vectors(F,IDF,V),cluster(V,Q),show(V). Q = 0.0661595 6. Entropy-based evaluation --------------------------- Assume that we have assigned class (category) labels to all documents. Then we may compute the clustering quality by looking into the distribution of class labels in each cluster. For example, assume that our departments are grouped in two categories ("art" and "sci"), represented as follows and added to the file catalog (files.pl) for convenient access. label( [ art - [ 'Art.txt', 'Justice.txt', 'English.txt', 'History.txt', 'Languages.txt', 'Music.txt', 'Philosophy.txt', 'Political.txt', 'Theatre.txt' ], sci - ['Anthropology.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Math.txt', 'Physics.txt', 'Geography.txt', 'Economics.txt', 'Psychology.txt', 'Sociology.txt' ] ]). After reloading files.pl (with the above structure added) we use class(F,L,FL) to add class labels to the document vectors as follows: ?- files(F),tf(F,100,T),idf(F,T,50,IDF),label(L),class(F,L,FL),vectors(FL,IDF,V),cluster(V,Q). [9-art, 11-sci]-0.992774 [1-art, 1-sci]-1 Computer.txt-sci Justice.txt-art [8-art, 10-sci]-0.991076 [6-art, 10-sci]-0.954434 [5-art, 10-sci]-0.918296 [5-art, 9-sci]-0.940286 [5-art, 8-sci]-0.961237 [5-art, 7-sci]-0.979869 [5-art, 2-sci]-0.863121 [4-art]-0 [2-art]-0 English.txt-art Languages.txt-art [2-art]-0 History.txt-art Philosophy.txt-art [1-art, 2-sci]-0.918296 [1-art, 1-sci]-1 Economics.txt-sci Political.txt-art Communication.txt-sci [5-sci]-0 [4-sci]-0 [3-sci]-0 [2-sci]-0 Math.txt-sci Physics.txt-sci Biology.txt-sci Psychology.txt-sci Sociology.txt-sci Anthropology.txt-sci Chemistry.txt-sci Geography.txt-sci Music.txt-art [2-art]-0 Art.txt-art Theatre.txt-art Note the difference in the way clsuters are shown now. When class labels are used the show(Clustering) procedure prints the class labels distribution and its entropy for each cluster (instead of "+"). In the Clustering above good clusters are those with 0 or low entropy. 7. Generating an ARFF file (the WEKA data format) ------------------------------------------------- ?- files(F),tf(F,50,T),idf(F,T,20,IDF),vectors(F,IDF,V),arff(IDF,V,'wekadata.txt'). This query generates an ARFF file named 'wekadata.txt'. If we specify class labels, the ARFF file will include them in the attribute definitions and in the data rows. For example, the following query generated an ARFF file with labeled data. ?- files(F),tf(F,50,T),idf(F,T,20,IDF),label(CL),class(F,CL,LF),vectors(LF,IDF,V),arff(IDF,V,'wekadata.txt'). Note that we use .txt files for easy viewing (with Notepad). To prepare data files for direct use by Weka we may use the .arff extension.