CS 580 - Web Mining
Fall-2008
Classes: MW 6:45 pm - 8:00 pm, Room: Maria Sanford Hall 214
Instructor: Dr. Zdravko Markov, 30307 Maria Sanford Hall, (860)-832-2711,
http://www.cs.ccsu.edu/~markov/,
e-mail: markovz at ccsu dot edu
Office hours: TR 10:00 - 12:30 pm, or by appointment
Description: The Web is the largest collection of electronically
accessible documents, which make the richest source of information in the
world. The problem with the Web is that this information is not well structured
and organized so that it would be be easily retrieved. The search engines
help in accessing web documents by keywords, but this is still far from
what we need in order to effectively use the knowledge available on the
Web. Machine Learning and Data Mining approaches go further and try to
extract knowledge from the raw data available on the Web by organizing
web pages in well defined structures or by looking into patterns of activities
of Web users. These are the challenges of the area of Web Mining. This
course focuses on extracting knowledge from the web by applying Machine
Learning techniques for classification and clustering of hypertext documents.
Basic approaches from the area of Information Retrieval and text analysis
are also discussed. The students use recent Machine Learning and Data Mining
software to implement practical applications for web document retrieval,
classification and clustering.
Prerequisites: CS 501 and CS 502, basic knowledge of algebra,
discrete math and statistics.
Course Objectives
Introduce students to the basic concepts and techniques of Information
Retrieval, Web Search, Data Mining, and Machine Learning for extracting
knowledge from the web.
Develop skills of using recent data mining software for solving practical
problems of Web Mining.
Gain experience of doing independent study and research.
Required text (DMW): Zdravko Markov and Daniel T. Larose. Data
Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage,
Wiley, 2007, ISBN: 978-0-471-66655-4.
Recommended texts: Ian H. Witten and Eibe Frank. Data Mining:
Practical Machine Learning Tools and Techniques (Second Edition), Morgan
Kaufmann, 2005, ISBN: 0-12-088407-0.
Required software: Weka
3 Data Mining System - Data Mining System with Free Open Source Machine
Learning Software in Java. Available at http://www.cs.waikato.ac.nz/~ml/weka/index.html
Semester project: There will be a semester project that involves
independent study, work with the course software, writing reports and making
presentations. The project can be done individually or in teams of 2 or
3. The project description and timetable is included in the schedule for
classes and assignments.
Grading: The final grade will be based on the project (80%) and
two tests (20%), and will be affected by classroom participation. The letter
grades will be calculated according to the following table:
A |
A- |
B+ |
B |
B- |
C+ |
C |
C- |
D+ |
D |
D- |
F |
95-100 |
90-94 |
87-89 |
84-86 |
80-83 |
77-79 |
74-76 |
70-73 |
67-69 |
64-66 |
60-63 |
0-59 |
Honesty policy: It is expected that all students will conduct
themselves in an honest manner (see the CCSU Student handbook), and NEVER
claim work which is not their own. Violating this policy will result in
a substantial grade penalty, and may lead to expulsion from the University.
Tentative schedule of classes, assignments and tests
-
Introduction
-
The Web Challenges (How to turn the web data into web knowledge):
-
Web Search Engines
-
Topic Directories
-
Semantic Web
-
Web Mining
-
Web content mining - discovery of Web document content patterns (text mining).
-
Web structure mining - discovery of hypertext/linking structure patterns
-
use hyperlinks to enhance text classification
-
page ranking
-
modeling and measuring the Web
-
Web usage mining - discovery of web users activity patterns
-
mining web server logs
-
mining client machine access logs
-
Related areas
-
Reading: DMW, Chapter 1
-
Lecture slides: dmw1.pdf
-
Information Retrieval and Web Search
-
Topics:
-
Crawling the Web
-
Indexing and keyword search
-
Document representation
-
Relevance Ranking
-
Vector space model (TF, IDF, TFIDF), Euclidian distance, cosine similarity
-
Relevance feedback
-
Advanced text search
-
Using the HTML structure in keyword search
-
Evaluating search quality
-
Similarity search
-
Reading: DMW, Chapter 1
-
Lecture slides: dmw1.pdf
-
Exercises:
-
Hyperlink Based Ranking
-
Clustering approaches for Web Mining
-
Evaluating Clustering
-
Classification approaches for Web Mining
-
Reading: DMW, Chapter 5
-
Lecture slides: dmw5.pdf
-
Basic approaches
-
Semester Projects
-
Students may choose one out of the following three projects:
-
To complete the project students are required to:
-
Write an initial project description, including specific goals,
resources to be used, plans how to achive the goals and evaluate the project
results, and a timetable.
-
Submit reports and make presentations on:
-
the initial project description (10% of final grade), due on October
1
-
the progress made by midterm (30% of final grade), due on November 5
-
the results acheived upon project completion (40% of final grade), due
on December 17
-
The students may work individually or in teams of 2 or 3.
-
The project grading will be based both on reports and presenations.
Hyperlink Based Ranking
1. The structure of the Web
-
Estimated (1998) to 150 million nodes (pages) and 1.7 billion edges (links).
Now more than 300 million, 1 million added every day.
-
Pages are very diverse in format (text, images, animation, scripts, forms
etc.) and content (information, ads, news, personal pages etc.)
-
No central authority of editors: relevance, popularity, authority
are hard to evaluate
-
Links are also very diverse, many have nothing to do with content or authority
(e.g. navigation links).
-
The challenge: use the web hyperlink structure to evaluate the importance
of pages and to enhance search
2. Social networks
-
An early approach, works well for academic networks, bibliometrics.
-
Mostly counting the in-degree of nodes, e.g. impact factor (number of citations
in the previous two years).
-
Prestige in social networks:
-
A(u,v) = 1 if page u cites page v; 0 otherwise
-
p(v) = Su A(u,v) p(u)
-
Matrix notation: compute P (column vector over web pages) by iterative
assignment P' = ATP
-
Basics of linear algebra
-
Matrices (see http://mathworld.wolfram.com/Matrix.html)
-
Vectors and norms (see http://mathworld.wolfram.com/VectorNorm.html)
-
Eigenvectors (see http://mathworld.wolfram.com/Eigenvector.html)
-
Example:
-
Graph: a ® b, a ®
c, b ® c, c ® a
-
Prestige vector (column): P = (p(a), p(b), p(c))
-
Matrix: A = [(0,1,1), (0,0,1), (1,0,0)]; AT = [(0,0,1), (1,0,0),
(1,1,0)]
-
Equation: cP = ATP
-
Solution: eigenvalue c = 1.325; eigenvector P = (0.548, 0.413, 0.726)
-
Differences with the Web
3. PageRank
-
Web page u, Fu = {pages u points to}, Bu = {pages
that point to u}, Nu = |Fu|
-
Basic idea: propagation of ranking through links (see Page, Brin et al.,
Figure 2)
-
R(u) = c S{v Î
Bu} R(v)/Nv
-
Example:
-
Graph: a ® b, a ®
c, b ® c, c ® a
-
R(a) = 0.4; R(b) = 0.2; R(c) = 0.4 (see Page, Brin et al., Figure 3)
-
Eigenvector approach:
-
A(u,v) = 1/Nu, if u cites v, 0 otherwise;
-
matrix: A = [(0, 0.5, 0.5), (0,0,1), (1,0,0)]; AT = [(0,0,1),
(0.5,0,0), (0.5,1,0)]
-
Equation: cP = ATP
-
Solutions (find eigenvalue c and eigenvector P):
-
Integer: c = 1; P = (2, 1, 2)
-
|P|2 = 1 (L2 norm): c = 1; P = (0.666, 0.333, 0.666)
-
|P|1 = 1 (L1 norm): c = 1; P = (0.4, 0.2, 0.4)
-
Rank sink (a loop without outlinks)
-
Source of rank E(u)
-
R(u) = c S{v Î
Bu} R(v)/Nv + c E(u), where c is maximized and |R|1
= 1 (L1 vector norm of R).
-
Computing PageRank (S is initial vector over web pages, e.g. E, all norms
are L1):
-
R0 = S
-
Loop
-
Ri+1 = ARi
-
d = |Ri| - |Ri+1|
-
Ri+1 = Ri+1 + dE
-
While |Ri+1-Ri| > e
-
Random surfer model:
-
R(u) is the probability of a random walk on the graph of the Web.
-
If the surfer gets into a loop, then jumps to a random page chosen based
on the distribution in E
-
Adjusting PageRank by using the source of rank E
-
E is a uniform vector with a small norm (e.g. |E|=0.15), i.e. periodically
jumping to a random web page. Problem: manipulation by commercial interests
(have an important page or a lot of non-important pages to include a link)
-
E is just one web page: the page chosen gets the highest rank followed
by it's links.
-
Other approaches: use all root level pages of all web servers (difficult
to manipulate).
-
Other applications of PageRank
-
Estimating Web traffic
-
Optimal crawling: using PageRank as an evaluation function.
-
Page navigation (show the PageRank of a link before the user clicks on
it).
4. Authorities and Hubs
-
Problems with associating authority with in-degree:
-
Often links have noting to do with authority (e.g. navigational links)
-
The balance between relevance and popularity (the most popular pages are
not the most relevant ones, e.g. sometimes the latter do not contain
the query string)
-
Idea:
-
Focus on the relevant pages first and then compute authority
-
Use also hub pages (pages that point to multiple relevant authoritative
pages)
-
The algorithm (HITS) - topic distillation. Given a query q:
-
Using a text-based search find a small set of relevant pages (root set
Rq).
-
Expand the root set by adding pages that point to and are pointed to by
pages from the root set. Thus create the base set Sq.
-
Find authorities and hubs in Sq
-
E(u,v)=1 if u points to v; 0 otherwise (both u and v belong to Sq)
-
x - authority vector; y - hub vector; k - parameter (number of iterations)
-
(x1, x2, ..., xn) = (1,1,1,
...,1)
-
(y1, y2, ..., yn) = (1,1,1,
...,1)
-
Loop k times
-
xu = S{v, E(v,u)=1} yv,
for all u
-
yu = S{v, E(v,u)=1} xv,
for all u
-
normalize x and y (L2 norm)
-
End loop
-
Similar page queries
-
Link-based approach (the alternative is text-based similarity)
-
Find k pages pointing to p
-
Find the root set Rp and the base set Sp
-
Search in Sp for hubs and authorities
-
Report the highest ranking authorities and hubs as similar pages to p
-
Advantages: no problems with pages containing images or very little text
(e.g. very little overlap).
-
Dealing with disconnected graphs
-
Example: ambiguous queries
-
Using higher order eigenvectors: HITS actually finds the principal eigenvector
of EET and ETE (the eigenvector associated with the
largest eigenvalue).
-
More eigenvectors may be used too to find hubs authorities in smaller subgraphs
-
In general, higher order eigenvectors reveal clusters in the graph structure.
-
Improving HITS stability - random walk model (parameter d)
-
with probability d the surfer jumps to a random node in the base set.
-
with probability (1-d) the surfer takes a random out-link from the current
page or goes back to a random page that points to the current one.
-
Tuning parameter d
-
stability improves as d increases
-
d=1 (no ranking)
5. Enhanced techniques for page ranking
-
Course-grained and fine-grained models
-
Topic generalization and drift
-
Avoiding nepotism
-
k pages on a single host
-
Assign a weight of 1/k to the in-links coming from these pages
-
Eliminating outliers
-
Create vector space representation for the retrieved pages
-
Find the centroid of the root set
-
Eliminate pages from the base set that are too far from the centroid
-
Fine Grained models
-
Using the anchor text (Rank-and-File)
-
No hubs and authorities
-
Use a base set only and consider pages as chains of terms and links.
-
Increment counts for URL's that appear near (within distance k to) a query
term (start with 0 counts)
-
Report the top ranking pages
-
Using the document markup structure (DOM)
-
Slides
6. Using Web structure to enhance crawling and similarity search
-
Enhanced Crawling
-
Crawling as guided search (e.g. use PageRank as evaluation function)
-
Keyword based search
-
Link-based similarity search
General Setting and Evaluation
Techniques
1. General Setting
-
General setting for Classification (Supervised Learning, Learning from
examples, Concept learning)
-
Step 1: Data collection
-
Training documents (model construction subset + model validation subset)
-
Test documents
-
Step 2: Building a model
-
Feature Selection
-
Applying an ML approach (learner, classifier)
-
Validating the Model (tuning learner parameters)
-
Step 3: Testing and evaluating the model
-
Step 4: Using the model to classify new documents (with unknown class labels)
-
Problems with classification of text and hypertext
-
Very large number of features (terms) compared with the number of examples
(documents)
-
Many irrelevant or correlated features
-
Different number of features in different documents
2. Evaluating text classifiers
-
Evaluation criteria
-
Accuracy
-
Computational efficiency (speed, scalabiulity, modification)
-
Ease of model interpretation and using user feedback
-
Simplicity (MDL)
-
Benchmark data
-
Evaluating classification accuracy
-
Holdout
-
Reserve a certain amount for testing and use the remainder for training
(usually 1/3 for testing, 2/3 for training).
-
Problem: the samples might not be representative. For example, some classes
might be represented with very few instance or even with no instances at
all.
-
Solution: stratification - sampling for training and testing within
classes. This ensures that each class is represented with approximately
equal proportions in both subsets
-
Repeated holdout. Success/error estimate can be made more reliable by repeating
the process with different subsamples.
-
In each iteration, a certain proportion is randomly selected for training
(possibly with stratification)
-
The error rates on the different iterations are averaged to yield an overall
error rate.
-
Problem: the different test sets may overlap. Can we prevent overlapping?
-
Cross-validation (CV). Avoids overlapping test sets.
-
k-fold cross-validation
-
First step: data is split into k subsets of equal size (usually by random
sampling).
-
Second step: each subset in turn is used for testing and the remainder
for training.
-
The error estimates are averaged to yield an overall error estimate.
-
Stratified cross-validation: subsets are stratified before the cross-validation
is performed.
-
Stratified ten-fold cross-validation
-
Standard method for evaluation. Extensive experiments have shown that this
is the best choice to get an accurate estimate. There is also some
theoretical evidence for this.
-
Stratification reduces the estimate's variance.
-
Repeated stratified cross-validation is even better. Ten-fold cross-validation
is repeated ten times and results are averaged.
-
Leave-one-out cross-validation (LOO CV).
-
LOO CV is a n-fold cross-validation, where n is the number
of training instances. That is, n classifiers are built for all
possible (n-1)-element subsets of the training set and then tested
on the remaining single instance.
-
LOO CV makes maximum use of the data.
-
No random subsampling is involved.
-
Problems
-
LOO CV is very computationally expensive.
-
Stratification is not possible. Actually this method guarantees a non-
stratified sample (there is only one instance in the test set).
-
Worst case example: assume a completely random dataset with two
classes each represented by 50% of the instances. The best classifier
for this data is the majority predictor. LOO CV will predict 100% error
(!) rate for this classifier (explain why?).
-
Contingency matrix
Actual \ Predicted |
+ |
- |
+ |
True positive (TP) |
False negative (FN) |
- |
False positive (FP) |
True negative (TN) |
-
Total error = (FP+FN)/(TP+FP+TN+FN)
-
Recall - precision (information retrieval):
-
Precision (retrieved relevant / total retrieved) = TP / (TP+FP)
-
Recall (retrieved relevant / total relevant) = TP / (TP + FN)
-
Combined measures: F=2*Recall*Precision/(Recall+Precision)
-
Multiple class setting
-
Predicting performance (true success/error rate)
-
Testing just estimates the probability of success on unknown data (data,
not used in both training and testing).
-
How good is this estimate? (What is the true success/error rate?)
We need confidence intervals (a kind of statistical reasoning) to
predict this.
-
Assume that success and error are two possible outcomes of a statistical
experiment (normally distributed random variable).
-
Bernoulli process: We have made N experiments and got S successes.
Then, the observed success rate is P=S/N. What is the true
success rate?
-
Example:
-
N=100, S=75. Then with confidence 80% P is in [0.691,
0.801].
-
N=1000, S=750. Then with confidence 80% P is in [0.732,
0.767].
3. Basic Approaches
-
Nearest Neighbor
-
Feature Selection
-
Bayesian approaches (Naive Bayes, Bayesian Networks, Maximal Entropy)
-
Numeric approaches (Linear regression and SVM)
-
Decision tree learning
-
Using Hypertext structure and Relational Learning (First-Order rule induction)
Nearest Neghbor Learning
-
Distance or similarity function defines what's learned.
-
Euclidean distance (for numeric attributes): D(X,Y) = sqrt[(x1-y1)2
+ (x2-y2)2 + ... + (xn-yn)2],
where X = {x1, x2, ..., xn}, Y = {y1,
y2, ..., yn}.
-
Cosine similarity (dot product when normalized to unit length): Sim(X,Y)
= x1.y1 + x2.y2 + ... + xn.yn
-
Other popular metric: city-block distance. D(X,Y) = |x1-y1|
+ |x2-y2| + ... + |xn-yn|.
-
As different attributes use diferent scales, normalization is required.
Vnorm = (V-Vmin) / (Vmax - Vmin).
Thus Vnorm is within [0,1].
-
Nominal attributes: number of differences, i.e. city block distance, where
|xi-yi| = 0 (xi=yi) or 1 (xi<>yi).
-
Missing attributes: assumed to be maximally distant (given normalized attributes).
-
Example: weather data
ID |
outlook |
temp |
humidity |
windy |
play |
1 |
sunny |
hot |
high |
false |
no |
2 |
sunny |
hot |
high |
true |
no |
3 |
overcast |
hot |
high |
false |
yes |
4 |
rainy |
mild |
high |
false |
yes |
5 |
rainy |
cool |
normal |
false |
yes |
6 |
rainy |
cool |
normal |
true |
no |
7 |
overcast |
cool |
normal |
true |
yes |
8 |
sunny |
mild |
high |
false |
no |
9 |
sunny |
cool |
normal |
false |
yes |
10 |
rainy |
mild |
normal |
false |
yes |
11 |
sunny |
mild |
normal |
true |
yes |
12 |
overcast |
mild |
high |
true |
yes |
13 |
overcast |
hot |
normal |
false |
yes |
14 |
rainy |
mild |
high |
true |
no |
X |
sunny |
cool |
high |
true |
? |
ID |
2 |
8 |
9 |
11 |
D(X, ID) |
1 |
2 |
2 |
2 |
play |
no |
no |
yes |
yes |
-
Discussion
-
Instance space: Voronoi diagram
-
1-NN is very accurate but also slow: scans entire training data to derive
a prediction (possible improvements: use a sample)
-
Assumes all attributes are equally important. Remedy: attribute selection
or weights (see attribute relevance).
-
Dealing with noise (wrong values of some attributes)
-
Taking a majority vote over the k nearest neighbors (k-NN).
-
Removing noisy instances from dataset (difficult!)
-
Numeric class attribute: take mean of the class values the k nearest neighbors.
-
k-NN has been used by statisticians since early 1950s. Question: k=?
-
Distance weighted k-NN:
-
Weight each vote or class value (for numeric) with the distance.
-
For example: instead of summing up votes, sum up 1 / D(X,Y) or 1 / D(X,Y)2
-
Then it makes sense to use all instances (k=n).
Bayesian approaches
Naive Bayes
-
Basic assumptions
-
Opposite of KNN: use all examples
-
Attributes are assumed to be:
-
equally important: all attributes have the same relevance to the classification
task.
-
statistically independent (given the class value): knowledge about the
value of a particular attribute doesn't tell us anything about the value
of another attribute (if the class is known).
-
Although based on assumptions that are almost never correct, this scheme
works well in practice!
-
Probabilities of weather data
outlook |
temp |
humidity |
windy |
play |
sunny |
hot |
high |
false |
no |
sunny |
hot |
high |
true |
no |
overcast |
hot |
high |
false |
yes |
rainy |
mild |
high |
false |
yes |
rainy |
cool |
normal |
false |
yes |
rainy |
cool |
normal |
true |
no |
overcast |
cool |
normal |
true |
yes |
sunny |
mild |
high |
false |
no |
sunny |
cool |
normal |
false |
yes |
rainy |
mild |
normal |
false |
yes |
sunny |
mild |
normal |
true |
yes |
overcast |
mild |
high |
true |
yes |
overcast |
hot |
normal |
false |
yes |
rainy |
mild |
high |
true |
no |
-
outlook = sunny [yes (2/9); no (3/5)];
-
temperature = cool [yes (3/9); no (1/5)];
-
humidity = high [yes (3/9); no (4/5)];
-
windy = true [yes (3/9); no (3/5)];
-
play = yes [(9/14)]
-
play = no [(5/14)]
-
New instance: [outlook=sunny, temp=cool, humidity=high, windy=true, play=?]
-
Likelihood of the two classes (play=yes; play=no):
-
yes = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) = 0.0053;
-
no = (3/5)*(1/5)*(4/5)*(3/5)*(5/14) = 0.0206;
-
Conversion into probabilities by normalization:
-
P(yes) = 0.0053 / (0.0053 + 0.0206) = 0.205
-
P(no) = 0.0206 / (0.0053 + 0.0206) = 0.795
-
Bayes
theorem (Bayes rule)
-
Probability of event H, given evidence E: P(H|E) = P(E|H) * P(H) / P(E);
-
P(H): a priori probability of H (probability of event before
evidence has been seen);
-
P(H|E): a posteriori (conditional) probability of H (probability
of event after evidence has been seen);
-
Bayes for classification
-
What is the probability of the class given an instance?
-
Evidence E = instance
-
Event H = class value for instance
-
Naïve Bayes assumption: evidence can be split into independent parts
(attributes of the instance).
-
E = [A1,A2,...,An]
-
P(E|H) = P(A1|H)*P(A2|H)*...*P(An|H)
-
Bayes: P(H|E) = P(A1|H)*P(A2|H)*...*P(An|H)*P(H)
/ P(E)
-
Weather data:
-
E = [outlook=sunny, temp=cool, humidity=high, windy=true]
-
P(yes|E) = (outlook=sunny|yes) * P(temp=cool|yes) * P(humidity=high|yes)
* P(windy=true|yes) * P(yes) / P(E) = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) /
P(E)
-
The “zero-frequency problem”
-
What if an attribute value doesn't occur with every class value (e. g.
humidity = high for class yes)?
-
Probability will be zero, for example P(humidity=high|yes) = 0;
-
A posteriori probability will also be zero: P(yes|E) = 0 (no matter how
likely the other values are!)
-
Remedy: add 1 to the count for every attribute value-class combination
(i.e. use the Laplace estimator: (p+1) / (n+1) ).
-
Result: probabilities will never be zero! (also stabilizes probability
estimates)
-
Missing values
-
Calculating probabilities: instance is not included in frequency count
for attribute value-class combination.
-
Classification: attribute will be omitted from calculation
-
Example: [outlook=?, temp=cool, humidity=high, windy=true, play=?]
-
Likelihood of yes = (3/9)*(3/9)*(3/9)*(9/14) = 0.0238;
-
Likelihood of no = (1/5)*(4/5)*(3/5)*(5/14) = 0.0343;
-
P(yes) = 0.0238 / (0.0238 + 0.0343) = 0.41
-
P(no) = 0.0343 / (0.0238 + 0.0343) = 0.59
-
Numeric attributes
-
Assumption: attributes have a normal or Gaussian probability
distribution (given the class)
-
Parameters involved: mean, standard deviation, density function for probabilty
-
Discussion
-
Naïve Bayes works surprisingly well (even if independence assumption
is clearly violated).
-
Why? Because classification doesn't require accurate probability estimates
as long as
maximum probability is assigned to correct class.
-
Adding too many redundant attributes will cause problems (e. g. identical
attributes).
-
Numeric attributes are often not normally distributed.
-
Yet another problem: estimating prior probability is difficult.
-
Advanced approaches: Bayesian networks.
Bayesian networks
-
Basics of BN
-
Define joint conditional probabilities.
-
Combine Bayesian reasoning with causal relationships between attributes.
-
Also known as belief networks, probabilistic networks.
-
Defined by:
-
Directed acyclic graph, with nodes representing random variables and links
- probabilistic dependence.
-
Conditional probability tables (CPT) for each variable (node): specifies
all P(X|partents(X)), i.e. the probability of each value of X, given every
possible combination of values for its parents.
-
Reasoning: given the probabilities at some nodes (inputs) BN calculates
the probabilities in other nodes (outputs).
-
Classification: inputs - attribute values, output - class value probability.
-
There are mechanisms for training BN from examples, given variables and
network structure, i.e. creating CPT's.
-
Example:
-
Variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls
(M)
-
Structure ("->" denotes causal relation): Burglary ->
Alarm; Earthquake -> Alarm; Alarm -> JohnCalls; Alarm
->
MaryCalls.
-
CPT's (for brevity, probability of false is not given, rows must sum to
1):
-
P(B) = 0.001
-
P(E) = 0.002
B E |
P(A) |
T T
T F
F T
F F |
0.95
0.94
0.29
0.001 |
-
Calculation of joint probabilities (~ means not): P(J, M, A, ~B, ~E) =
P(J|A) * P(M|A) * P(A|~B and ~E) * P(~B) * P(~E) = 0.9 * 0.7 * 0.001 *
0.999 * 0.998 = 0.000628.
-
Reasoning (using complete joints distribution or other more efficient methods):
-
Diagnostic (from effect top cause): P(B|J) = 0.016; P(B|J and M) = 0.29);
P(A|J and M) = 0.76
-
Predictive (from cause to effect): P(J|B) = 0.86; P(M|B) = 0.67;
-
Other: intercausal P(B|A), mixed P(A|J and ~E)
-
-
Naive Bayes as a BN
-
Variables: play, outlook, temp, humidity, windy.
-
Structure: play -> outlook, play -> temp, play ->
humidity, play -> windy.
-
CPT's:
-
play: P(play=yes)=9/14; P(play=no)=5/14;
-
outlook:
-
P(outlook=overcast | play=yes) = 4/9
-
P(outlook=sunny | play=yes) = 2/9
-
P(outlook=rainy | play=yes) = 3/9
-
P(outlook=overcast | play=no) = 0/5
-
P(outlook=sunny | play=no) = 3/5
-
P(outlook=rainy | play=no) = 2/5
-
...
-
Links/Exercises:
Numeric Approaches
Linear Regression
-
Basic idea
-
Work most naturally with numeric attributes. The standard technique
for numeric prediction is linear regression.
-
Predicted class value is linear combination of attribute values
(ai): C = w0*a0 + w1*a1
+ w2*a2 + ... + wk*ak.
For k attributes we have k+1 coefficients. To simplify notation
we add a0 that is always 1.
-
Squared error: Sum through all instances (actual class value - predicted
one)2
-
Deriving the coefficients (wi): minimizing squared
error on training data. Using standard numerical analysis techniques
(matrix operations). Can be done if there are more instances than attributes
(roughly speaking).
-
Classification by linear regression
-
Binary classification (class values 1, -1). Two possible interpretations:
-
Hyperplane that separates the two classes
-
Data points are projected on a line perpendicular to the hyperplane and
thus positive and negative points are separated.
-
Multi-response linear regression (learning a membership function for each
class)
-
Training: perform a regression (create a model) for each class, setting
the output to 1 for training instances that belong to the class, and 0
for those that do not.
-
Prediction: predict the class corresponding to the model with largest output
value
-
Pairwise regression (designed especially for multiple classification)
-
Training: perform regression for every pair of classes assigning output
1 for one class and -1 for the other.
-
Prediction: predict the class that receives most "votes" (outputs > 0)
from the regression lines.
-
More accurate than multi-response linear regression, however more computationally
expensive.
-
Discussion
-
Creates a hyperplane for any two classes
-
Pairwise: the regression line between the two classes
-
Multi-response: (w0-v0)*a0 + (w1-v1)*a1
+ ... + (wk-vk)*ak, where wi
and vi are the coefficients of the models for the two
classes.
-
Not appropriate if data exhibits non-linear dependencies. For example,
instances that cannot be separated by a hyperplane. Classical example:
XOR function.
Support Vector Machine (SVM)
-
Same idea as linear separaton (projection)
-
Choosing the hyperplane so that the all points are within some minimal
distance from the hyperplane.
-
One of the most accurate text document classifier
-
Quadratic optimization problem solved by iterative algorithms
Web Crawler Project
This project includes two basic steps:
-
Implementing a Web Crawler.
-
Using the crawler to collect a set of web pages and identify their properties
related to the web structure.
For step 1 you may use WebSPHINX:
A Personal, Customizable Web Crawler or write your own crawler in Java
or C using the open source provided with WebSPHINX
or the W3C Protocol Library. Step
2 includes:
-
Identifying a portion of the Web (a subtree, a server or a topic oriented
part of the Web) to be analyzed.
-
Analysis of the structure of the set of web pages.
-
Ranking pages by using various techniques.
-
Grouping pages by similarity.
Note that no programming and implementing stand-alone applications is required.
Web Document Classification Project
Introduction
Along with the search engines, topic directories
are the most popular sites on the Web. Topic directories organize web pages
in a hierarchical structure (taxonomy, ontology) according to their content.
The purpose of this structuring is twofold: firstly, it helps web searches
focus on the relevant collection of Web documents. The ultimate goal here
is to organize the entire web into a directory, where each web page has
its place in the hierarchy and thus can be easily identified and accessed.
The Open Directory Project (dmoz.org) and About.com are some of the best-known
projects in this area. Secondly, the topic directories can be used to classify
web pages or associate them with known topics. This process is called tagging
and can be used to extend the directories themselves. In fact, some well-known
search portals as Yahoo and Google return with their responses the topic
path of the response, if the response URL has been associated with some
topic found in a topic directory. As these topic directories are usually
created manually they cannot capture all URL’s, therefore just a fraction
of all responses are tagged.
Project overview
The aim of the project is to investigate the process
of tagging web pages using the topic directory structures and apply Machine
Learning techniques for automatic tagging or classifying web pages into
topic categories. This would help filtering out the responses of a search
engine or ranking them according to their relevance to a topic specified
by the user.
For example, a keyword search for “Machine Learning”
using Yahoo may return along with some of the pages found (about 5 million)
topic directory paths like:
Category: Artificial Intelligence > Machine Learning
Category: Artificial Intelligence > Web Directories
Category: Maryland > Baltimore > Johns Hopkins University > Courses
(Note that this may not be what you see when you try this query. The
web content is constantly changing as well as the search engines’ approaches
to search the web. This usually results in getting different results from
the same search query at different times.)
Most of the pages returned however are not tagged with directory topics.
Assuming that we know the general topic of such untagged web page, say,
Artificial Intelligence and this is a topic in a directory, we can try
to find the closest subtopic to the web page found. This is where Machine
Learning comes into play. Using some text document classification techniques
we can classify the new web page to one of the existing topics. By using
the collection of pages available under each topic as examples we can create
category descriptions (e.g. classification rules, or conditional probabilities).
Then using these descriptions we can classify new web pages. Another approach
would be the similarity search approach, where using some metric over text
documents we find the closest document and assign its category to the new
web page.
Project description
The project is split into three major parts. These
parts are also stages in the overall process of knowledge extraction from
the web and classification of web documents (tagging). As this process
is interactive and iterative in nature, the stages may be included in a
loop structure that would allow each stage to be revisited so that some
feedback from latter stages to be used. The parts are well defined and
can be developed separately and then put together as components in a semi-automated
system or executed manually. Hereafter we describe the project stages in
detail along with the deliverables that the students need to document in
the final report for each stage.
1. Collecting sets of web documents grouped by topic
The purpose of this stage is to collect sets of web documents belonging
to different topics (subject area). The basic idea is to use a topic directory
structure. Such structures are available from dmoz.org (the Open Directory
project), the yahoo directory (dir.yahoo.com), about.com and many other
web sites that provide access to web pages grouped by topic or subject.
These
topic structures have to be examined in order to find several topics (e.g.
5), each of which is well represented by a set of documents (at least 20).
Alternative approaches could be extracting web documents manually from
the list of hits returned by a search engine using a general keyword search
or collecting web pages by using a Web Crawler (see the Web Crawler project)
from the web page structure of a large organization (e.g. university).
Deliverable: The outcome of this stage is a collection
of several sets of web documents (actual files stored locally, not just
URL’s) representing different topics or subjects, where the following restrictions
apply:
a) As these topics will be used for learning and classification experiments
at later stages they have to form a specific structure (part of the topic
hierarchy). It’s good to have topics at different levels of the topic hierarchy
and with different distances between them (a distance between two topics
can be defined as the number of predecessors to the first common parent
in the hierarchy). An example of such structure is:
topic1 > topic2 > topic3
topic1 > topic2 > topic4
topic1 > topic5 > topic6
topic1 > topic7 > topic8
topic1 > topic9
The set of topics here is {topic3, topic4, topic6, topic8, topic9}.
Also, it would be interesting to find topics, which are subtopics of two
different topics. An example of this is:
Top > … > topic2 > topic4
Top > … > topic5 > topic4
b) There must be at least 5 different topics with at least 20 documents
in each.
c) Each document should contain certain minimum amount of text. This
may be measured with the number of words (excluding stopwords and punctuation
marks). For example, this minimum could be 200 words.
2. Feature extraction and data preparation
At this stage the web documents are represented
by feature vectors, which in turn are used to form a training data set
for the Machine Learning stage. To complete this use the Weka system and
follow the directions provided in section Exercises of DMW, Chapter 1 (
free
download from Wiley).
Deliverable: ARFF data files containing the feature vectors
for all web documents collected at stage 1. It is recommended that students
prepare several files by using different approaches to feature extraction,
for example, one with Boolean attributes and one with numeric ones created
by applying the TFIDF approach. Versions of the data sets with different
number of attributes can be also prepared.
3. Machine Learning Stage
At this stage Machine Learning algorithms are used
to create models of the data sets. These models are then used for two purposes.
Firstly, the accuracy of the initial topic structure is evaluated and secondly,
new web documents are classified into existing topics. For both purposes
we use the Weka system. The ML stage of consists of the following steps:
-
Preprocessing of the web document data. Load the ARFF files created at
project stage 2, verify their consistency and get some statistics by using
the preprocess panel.
-
Using the Weka’s decision tree algorithm (J48) examine the decision trees
generated with different data sets. Which are the most important terms
for each data set (the terms appearing on the top of the tree)? How do
they change with changing the data set? Check also the classification accuracy
and the confusion matrix obtained with 10-fold cross validation and find
out which topic is best represented by the decision tree.
-
Use the Naïve Bayes and Nearest Neighbor (IBk) algorithms and compare
their classification accuracy and confusion matrices obtained with 10-fold
cross validation with the ones produces by the decision tree. Which ones
are better? Why?
-
Run the Weka clustering algorithms (k-means, EM and Cobweb) ignoring the
class attribute (document topic) on all data sets. Evaluate the obtained
clusterings by comparing them to the original set of topics or to the topic
hierarchy (when using Cobweb). Use also the formal method, classes to clusters
evaluation, provided by Weka. For more details of clustering with Weka
see http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-Ex3.html.
-
New web document classifications. Get web documents from the same subject
areas (topics), but not belonging to the original set of documents prepared
in project stage 1. Get also documents from different topics. Apply feature
extraction and create ARFF files each one representing one document. Then
using the Weka test set option classify the new documents. Compare their
original topic with the one predicted by Weka. For the classification experiments
use the guidelines provided in http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-Ex2.doc.
Deliverable: This stage of the project requires writing a
report on the experiments performed. The report should include detailed
description of the experiments (input data, Weka outputs), answers to the
questions, and interpretation and analysis of the results with respect
to the original problem stated in the project, web document classification.
Intelligent Web Browser Project
Introduction
The Web searches provide large amount of information about the web users.
Data mining techniques can be used to analyze this information and create
user profiles or identify user preferences. A key application of this approach
is in marketing and offering personalized services, an area referred to
as "data gold rush". This project will focus on use of machine learning
approaches to create models of web users. Students will collect web pages
from web searches or by running a web crawler and label them according
to user preferences. The labeled pages will be then encoded as feature
vectors and fed into the machine learning system. The later will produce
user models that may be used for improving the efficiency of web searches
or identifying users.
Project description
Similarly to the web document classification project this project is split
into three major parts/stages - data collection, feature extraction and
machine learning (mining). At the data collection and feature extraction
stages web pages (documents) are collected and represented as feature vectors.
The important difference with the document classification project is that
the documents are mapped into users (not topic categories). At the machine
learning algorithms stage various learning algorithms are applied to the
feature vectors in order to create models of the users that these vectors
(documents) are mapped onto. Then the models can be used to filter out
web documents returned by searches so that the users can get more focused
information from the search engines. In this way users can also be identified
by their preferences and new users classified accordingly. Hereafter we
describe briefly the project stages.
1. Collecting sets of web documents grouped by users' preference
The purpose of this stage is to collect a set of web documents labeled
with user preferences. This can be done in the following way: A user performs
web searches with simple keyword search, just browses the web or examines
a set of pages collected by a web crawler. To each web document the user
assigns a label representing whether or not the document is interesting
to the user. As in the web document classification project some restrictions
apply: (1) The number of web pages should be greater than the number of
selected features (stage 2). (2) The web pages should have sufficient text
content so that they could be well described by feature vectors.
2. Feature extraction and data preparation
This stage is very similar to the one described in the Web Document Classification
project. By using the Weka filters Boolean or numeric values are calculated
for each web document and the corresponding feature vector is created.
Finally the vectors are included in the ARFF file to be used by WEKA. Note
that at this last step the vectors are extended with class labels (for
example, interesting/non-interesting or +/-) according to the user preferences.
As in the in the web document classification project the outcome of
this stage is an ARFF data file containing the feature vectors for all
web documents collected at stage 1. It is recommended that students prepare
several files by using different approaches to feature extraction - Boolean
attributes, numeric attributes (using the TFIDF approach) and with different
number of terms. The idea is to do more experiments with different data
sets and different ML algorithms in order to find the best user model.
3. Machine Learning Stage
At this stage the approaches and experiments are similar to those described
in the Web Document Classification project with an important difference
in the last step where the machine learning models are used. This step
can be called web document filtering (focusing the search) and can be described
as follows: Collect a number of web documents using one of the approaches
suggested in project stage 1. Apply feature extraction and create an ARFF
test file with one data row for each document. Then using the training
set prepared in stage 2 and the Weka's test set option classify the new
documents. Each one will get a corresponding label (interesting/non-interesting
or +/-). Then simply discard the non-interesting documents and present
the interesting ones to the user. Further, this step can be incorporated
into a web browser, so that it automatically labels all web pages as interesting/non-interesting
according to the user preferences.