CS580 - Web Mining Fall-2004 Laboratory Project 2 ==================== Programs files: textmine.pl Data files: webdata.zip 1. Web Document Classification by KNN ------------------------------------- Assume we have a catalog of documents and their class lables both stored in "files.pl" as follows: files([ 'Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt', 'Geography.txt', 'History.txt', 'Math.txt', 'Languages.txt', 'Music.txt', 'Philosophy.txt', 'Physics.txt', 'Political.txt', 'Psychology.txt', 'Sociology.txt', 'Theatre.txt' ]). label( [ art - [ 'Art.txt', 'Justice.txt', 'English.txt', 'History.txt', 'Languages.txt', 'Music.txt', 'Philosophy.txt', 'Political.txt', 'Theatre.txt' ], sci - ['Anthropology.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Math.txt', 'Physics.txt', 'Geography.txt', 'Economics.txt', 'Psychology.txt', 'Sociology.txt' ] ]). After loading files.pl in the Prolog database we can get a lebeled list of vectors as follows: ?- files(F),tf(F,20,T),idf(F,T,10,IDF),label(L),class(F,L,FL),vectors(FL,IDF,Vs),ppl(Vs). Anthropology.txt-sci-[0, 0, 0.633684, 0.563212, 0, 0, 0, 0.330474, 0.334456, 0.245282] Art.txt-art-[0, 0, 0, 0, 0, 0.801579, 0.408572, 0.272495, 0.275193, 0.201383] Biology.txt-sci-[0, 0, 0.308831, 0.822925, 0, 0, 0, 0, 0, 0.476883] Chemistry.txt-sci-[0, 0, 0.884458, 0, 0, 0, 0, 0.466619, 0, 0] Communication.txt-sci-[0, 0, 0, 0, 0, 0.50298, 0.508762, 0.33993, 0.343929, 0.504315] Computer.txt-sci-[0, 0, 0.945013, 0, 0, 0, 0, 0.308882, 0.107437, 0] Justice.txt-art-[0, 0, 0, 0, 0.857546, 0, 0, 0.514407, 0, 0] Economics.txt-sci-[0, 0, 0, 0.862175, 0, 0, 0, 0.50661, 0, 0] English.txt-art-[0, 0, 0, 0, 0, 0.658568, 0.675454, 0, 0, 0.331737] Geography.txt-sci-[0, 0, 0.671246, 0, 0.59846, 0, 0, 0, 0.352284, 0.259171] History.txt-art-[0, 0, 0, 0, 0.655244, 0.590413, 0.303185, 0, 0.202988, 0.298232] Math.txt-sci-[0, 0, 0.622853, 0.555566, 0, 0.495383, 0, 0, 0, 0.240823] Languages.txt-art-[0, 0, 0, 0, 0.653047, 0.294527, 0.597236, 0, 0.202864, 0.298227] Music.txt-art-[0.999126, 0, 0, 0, 0.0304017, 0, 0, 0.0178549, 0.0180868, 0.013277] Philosophy.txt-art-[0, 0, 0, 0, 0.381885, 0.683209, 0.355001, 0.478314, 0, 0.180499] Physics.txt-sci-[0, 0, 0, 0.393327, 0.3973, 0, 0.352371, 0.23513, 0.712738, 0] Political.txt-art-[0, 0, 0.912367, 0.340379, 0.174498, 0, 0, 0.102482, 0.103813, 0] Psychology.txt-sci-[0, 0, 0, 0.671941, 0, 0.305138, 0.620983, 0, 0.212528, 0.156837] Sociology.txt-sci-[0, 0.997769, 0, 0.0372761, 0, 0.0333218, 0.0339388, 0.0228386, 0, 0.0168657] Theatre.txt-art-[0, 0, 0, 0, 0, 0, 0, 0, 0.806259, 0.591563] Now we may select an vector and find its nearest neighbors. ?- files(F),tf(F,20,T),idf(F,T,10,IDF),label(L),class(F,L,FL),vectors(FL,IDF,Vs),member(X,Vs),neighbors(X,5,Vs,NNs). F = ['Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt'|...] T = [department, study, students, ba, website, location, programs, 832, phone|...] IDF = [3.04452-music, 3.04452-sociology, 1.09861-science, 0.965081-research, 0.965081-studies, 0.847298-offers, 0.847298-courses, 0.559616-program, ... -...|...] L = [art-['Art.txt', 'Justice.txt', 'English.txt', 'History.txt', 'Languages.txt', 'Music.txt', 'Philosophy.txt'|...], sci-['Anthropology.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Math.txt'|...]] FL = ['Anthropology.txt'-sci, 'Art.txt'-art, 'Biology.txt'-sci, 'Chemistry.txt'-sci, 'Communication.txt'-sci, 'Computer.txt'-sci, 'Justice.txt'-art, 'Economics.txt'-sci, ... -...|...] Vs = ['Anthropology.txt'-sci-[0, 0, 0.633684, 0.563212, 0, 0, 0|...], 'Art.txt'-art-[0, 0, 0, 0, 0, 0.801579|...], 'Biology.txt'-sci-[0, 0, 0.308831, 0.822925, 0|...], 'Chemistry.txt'-sci-[0, 0, 0.884458, 0|...], 'Communication.txt'-sci-[0, 0, 0|...], 'Computer.txt'-sci-[0, 0|...], 'Justice.txt'-art-[0|...], ... -... -[...|...], ... -...|...] X = 'Anthropology.txt'-sci-[0, 0, 0.633684, 0.563212, 0, 0, 0, 0.330474|...] NNs = [1-sci, 0.838446-art, 0.776153-sci, 0.766663-sci, 0.73685-sci] If we want find the neihgbors of all vectors we may use the following query: ?- files(F),tf(F,20,T),idf(F,T,10,IDF),label(L),class(F,L,FL),vectors(FL,IDF,Vs),member(Id-X,Vs),neighbors(Id-X,5,Vs,NNs),write(Id-NNs),nl,fail. Anthropology.txt-sci-[1-sci, 0.838446-art, 0.776153-sci, 0.766663-sci, 0.73685-sci] Art.txt-art-[1-art, 0.899881-sci, 0.870672-art, 0.859377-art, 0.713056-art] Biology.txt-sci-[1-sci, 0.776153-sci, 0.76439-sci, 0.709506-sci, 0.62775-sci] Chemistry.txt-sci-[1-sci, 0.979955-sci, 0.854771-art, 0.714673-sci, 0.593689-sci] Communication.txt-sci-[1-sci, 0.899881-art, 0.842192-art, 0.777873-art, 0.672163-art] Computer.txt-sci-[1-sci, 0.979955-sci, 0.905007-art, 0.73685-sci, 0.672185-sci] Justice.txt-art-[1-art, 0.573532-art, 0.561902-art, 0.560018-art, 0.513207-sci] Economics.txt-sci-[1-sci, 0.709506-sci, 0.653009-sci, 0.579331-sci, 0.478995-sci] English.txt-art-[1-art, 0.870672-art, 0.842192-sci, 0.749604-art, 0.696304-art] Geography.txt-sci-[1-sci, 0.753425-art, 0.672185-sci, 0.606752-sci, 0.593689-sci] History.txt-art-[1-art, 0.91299-art, 0.815065-art, 0.713056-art, 0.69255-art] Math.txt-sci-[1-sci, 0.766663-sci, 0.76439-sci, 0.757373-art, 0.588604-sci] Languages.txt-art-[1-art, 0.91299-art, 0.716461-art, 0.696304-art, 0.672163-sci] Music.txt-art-[1-art, 0.0352555-art, 0.0291679-sci, 0.0280069-sci, 0.0275516-art] Philosophy.txt-art-[1-art, 0.859377-art, 0.815065-art, 0.777873-sci, 0.749604-art] Physics.txt-sci-[1-sci, 0.634585-sci, 0.614493-art, 0.574651-art, 0.53761-sci] Political.txt-art-[1-art, 0.905007-sci, 0.854771-sci, 0.838446-sci, 0.757373-sci] Psychology.txt-sci-[1-sci, 0.672429-art, 0.634585-sci, 0.62775-sci, 0.621601-sci] Sociology.txt-sci-[1-sci, 0.0589357-sci, 0.0504638-art, 0.0502961-sci, 0.0501964-art] Theatre.txt-art-[1-art, 0.57563-sci, 0.574651-sci, 0.437348-sci, 0.414758-sci] ?- files(F),tf(F,20,T),idf(F,T,10,IDF),label(L),class(F,L,FL),vectors(FL,IDF,Vs),member(Id-X,Vs),neighbors(Id-X,5,Vs,NNs),sumv(NNs,S),write(Id-S),nl,fail. Anthropology.txt-sci-[4-sci, 1-art] Art.txt-art-[4-art, 1-sci] Biology.txt-sci-[5-sci] Chemistry.txt-sci-[4-sci, 1-art] Communication.txt-sci-[1-sci, 4-art] Computer.txt-sci-[4-sci, 1-art] Justice.txt-art-[4-art, 1-sci] Economics.txt-sci-[5-sci] English.txt-art-[4-art, 1-sci] Geography.txt-sci-[4-sci, 1-art] History.txt-art-[5-art] Math.txt-sci-[4-sci, 1-art] Languages.txt-art-[4-art, 1-sci] Music.txt-art-[3-art, 2-sci] Philosophy.txt-art-[4-art, 1-sci] Physics.txt-sci-[3-sci, 2-art] Political.txt-art-[1-art, 4-sci] Psychology.txt-sci-[4-sci, 1-art] Sociology.txt-sci-[3-sci, 2-art] Theatre.txt-art-[1-art, 4-sci] ?- files(F),tf(F,20,T),idf(F,T,10,IDF),label(L),class(F,L,FL),vectors(FL,IDF,Vs),member(Id-X,Vs),neighbors(Id-X,5,Vs,NNs),sumw(NNs,S),write(Id-S),nl,fail. Anthropology.txt-sci-[3.27967-sci, 0.838446-art] Art.txt-art-[3.4431-art, 0.899881-sci] Biology.txt-sci-[3.8778-sci] Chemistry.txt-sci-[3.28832-sci, 0.854771-art] Communication.txt-sci-[1-sci, 3.19211-art] Computer.txt-sci-[3.38899-sci, 0.905007-art] Justice.txt-art-[2.69545-art, 0.513207-sci] Economics.txt-sci-[3.42084-sci] English.txt-art-[3.31658-art, 0.842192-sci] Geography.txt-sci-[2.87263-sci, 0.753425-art] History.txt-art-[4.13366-art] Math.txt-sci-[3.11966-sci, 0.757373-art] Languages.txt-art-[3.32576-art, 0.672163-sci] Music.txt-art-[1.06281-art, 0.0571748-sci] Philosophy.txt-art-[3.42405-art, 0.777873-sci] Physics.txt-sci-[2.1722-sci, 1.18914-art] Political.txt-art-[1-art, 3.3556-sci] Psychology.txt-sci-[2.88394-sci, 0.672429-art] Sociology.txt-sci-[1.10923-sci, 0.10066-art] Theatre.txt-art-[1-art, 2.00239-sci] ?- files(F),tf(F,20,T),idf(F,T,10,IDF),label(L),class(F,L,FL),vectors(FL,IDF,Vs),member(Id-X,Vs),knn(Id-X,5,Vs,Class). F = ['Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt'|...] T = [department, study, students, ba, website, location, programs, 832, phone|...] IDF = [3.04452-music, 3.04452-sociology, 1.09861-science, 0.965081-research, 0.965081-studies, 0.847298-offers, 0.847298-courses, 0.559616-program, ... -...|...] L = [art-['Art.txt', 'Justice.txt', 'English.txt', 'History.txt', 'Languages.txt', 'Music.txt', 'Philosophy.txt'|...], sci-['Anthropology.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Math.txt'|...]] FL = ['Anthropology.txt'-sci, 'Art.txt'-art, 'Biology.txt'-sci, 'Chemistry.txt'-sci, 'Communication.txt'-sci, 'Computer.txt'-sci, 'Justice.txt'-art, 'Economics.txt'-sci, ... -...|...] Vs = ['Anthropology.txt'-sci-[0, 0, 0.633684, 0.563212, 0, 0, 0|...], 'Art.txt'-art-[0, 0, 0, 0, 0, 0.801579|...], 'Biology.txt'-sci-[0, 0, 0.308831, 0.822925, 0|...], 'Chemistry.txt'-sci-[0, 0, 0.884458, 0|...], 'Communication.txt'-sci-[0, 0, 0|...], 'Computer.txt'-sci-[0, 0|...], 'Justice.txt'-art-[0|...], ... -... -[...|...], ... -...|...] Id = 'Anthropology.txt'-sci X = [0, 0, 0.633684, 0.563212, 0, 0, 0, 0.330474, 0.334456|...] Class = sci ?- files(F),tf(F,20,T),idf(F,T,10,IDF),label(L),class(F,L,FL),vectors(FL,IDF,Vs),member(Id-X,Vs),knnw(Id-X,5,Vs,Class). F = ['Anthropology.txt', 'Art.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Justice.txt', 'Economics.txt', 'English.txt'|...] T = [department, study, students, ba, website, location, programs, 832, phone|...] IDF = [3.04452-music, 3.04452-sociology, 1.09861-science, 0.965081-research, 0.965081-studies, 0.847298-offers, 0.847298-courses, 0.559616-program, ... -...|...] L = [art-['Art.txt', 'Justice.txt', 'English.txt', 'History.txt', 'Languages.txt', 'Music.txt', 'Philosophy.txt'|...], sci-['Anthropology.txt', 'Biology.txt', 'Chemistry.txt', 'Communication.txt', 'Computer.txt', 'Math.txt'|...]] FL = ['Anthropology.txt'-sci, 'Art.txt'-art, 'Biology.txt'-sci, 'Chemistry.txt'-sci, 'Communication.txt'-sci, 'Computer.txt'-sci, 'Justice.txt'-art, 'Economics.txt'-sci, ... -...|...] Vs = ['Anthropology.txt'-sci-[0, 0, 0.633684, 0.563212, 0, 0, 0|...], 'Art.txt'-art-[0, 0, 0, 0, 0, 0.801579|...], 'Biology.txt'-sci-[0, 0, 0.308831, 0.822925, 0|...], 'Chemistry.txt'-sci-[0, 0, 0.884458, 0|...], 'Communication.txt'-sci-[0, 0, 0|...], 'Computer.txt'-sci-[0, 0|...], 'Justice.txt'-art-[0|...], ... -... -[...|...], ... -...|...] Id = 'Anthropology.txt'-sci X = [0, 0, 0.633684, 0.563212, 0, 0, 0, 0.330474, 0.334456|...] Class = sci