-
An early approach, works well for academic networks, bibliometrics.
-
Mostly counting the in-degree of nodes, e.g. impact factor (number of citations
in the previous two years).
-
Prestige in social networks:
-
A(u,v) = 1 if page u cites page v; 0 otherwise
-
p(v) = Su A(u,v) p(u)
-
Matrix notation: compute P (column vector over web pages) by iterative
assignment P' = ATP
-
Basics of linear algebra
-
Matrices (see http://mathworld.wolfram.com/Matrix.html)
-
Vectors and norms (see http://mathworld.wolfram.com/VectorNorm.html)
-
Eigenvectors (see http://mathworld.wolfram.com/Eigenvector.html)
-
Example:
-
Graph: a ® b, a ®
c, b ® c, c ® a
-
Prestige vector (column): P = (p(a), p(b), p(c))
-
Matrix: A = [(0,1,1), (0,0,1), (1,0,0)]; AT = [(0,0,1), (1,0,0),
(1,1,0)]
-
Equation: cP = ATP
-
Solution: eigenvalue c = 1.325; eigenvector P = (0.548, 0.413, 0.726)
-
Differences with the Web
4. PageRank
-
Web page u, Fu = {pages u points to}, Bu = {pages
that point to u}, Nu = |Fu|
-
Basic idea: propagation of ranking through links (see Page, Brin et al.,
Figure 2)
-
R(u) = c S{v Î
Bu} R(v)/Nv
-
Example:
-
Graph: a ® b, a ®
c, b ® c, c ® a
-
R(a) = 0.4; R(b) = 0.2; R(c) = 0.4 (see Page, Brin et al., Figure 3)
-
Eigenvector approach:
-
A(u,v) = 1/Nu, if u cites v, 0 otherwise;
-
matrix: A = [(0, 0.5, 0.5), (0,0,1), (1,0,0)]; AT = [(0,0,1),
(0.5,0,0), (0.5,1,0)]
-
Equation: cP = ATP
-
Solutions (find eigenvalue c and eigenvector P):
-
Integer: c = 1; P = (2, 1, 2)
-
|P|2 = 1 (L2 norm): c = 1; P = (0.666, 0.333, 0.666)
-
|P|1 = 1 (L1 norm): c = 1; P = (0.4, 0.2, 0.4)
-
Rank sink (a loop without outlinks)
-
Source of rank E(u)
-
R(u) = c S{v Î
Bu} R(v)/Nv + c E(u), where c is maximized and |R|1
= 1 (L1 vector norm of R).
-
Computing PageRank (S is initial vector over web pages, e.g. E, all norms
are L1):
-
R0 = S
-
Loop
-
Ri+1 = ARi
-
d = |Ri| - |Ri+1|
-
Ri+1 = Ri+1 + dE
-
While |Ri+1-Ri| > e
-
Random surfer model:
-
R(u) is the probability of a random walk on the graph of the Web.
-
If the surfer gets into a loop, then jumps to a random page chosen based
on the distribution in E
-
Adjusting PageRank by using the source of rank E
-
E is a uniform vector with a small norm (e.g. |E|=0.15), i.e. periodically
jumping to a random web page. Problem: manipulation by commercial interests
(have an important page or a lot of non-important pages to include a link)
-
E is just one web page: the page chosen gets the highest rank followed
by it's links.
-
Other approaches: use all root level pages of all web servers (difficult
to manipulate).
-
Other applications of PageRank
-
Estimating Web traffic
-
Optimal crawling: using PageRank as an evaluation function.
-
Page navigation (show the PageRank of a link before the user clicks on
it).
5. Authorities and Hubs
-
Problems with associating authority with in-degree:
-
Often links have noting to do with authority (e.g. navigational links)
-
The balance between relevance and popularity (the most popular pages are
not the most relevant ones, e.g. sometimes the latter do not contain
the query string)
-
Idea:
-
Focus on the relevant pages first and then compute authority
-
Use also hub pages (pages that point to multiple relevant authoritative
pages)
-
The algorithm (HITS) - topic distillation. Given a query q:
-
Using a text-based search find a small set of relevant pages (root set
Rq).
-
Expand the root set by adding pages that point to and are pointed to by
pages from the root set. Thus create the base set Sq.
-
Find authorities and hubs in Sq
-
E(u,v)=1 if u points to v; 0 otherwise (both u and v belong to Sq)
-
x - authority vector; y - hub vector; k - parameter (number of iterations)
-
(x1, x2, ..., xn) = (1,1,1,
...,1)
-
(y1, y2, ..., yn) = (1,1,1,
...,1)
-
Loop k times
-
xu = S{v, E(v,u)=1} yv,
for all u
-
yu = S{v, E(v,u)=1} xv,
for all u
-
normalize x and y (L2 norm)
-
End loop
-
Similar page queries
-
Link-based approach (the alternative is text-based similarity)
-
Find k pages pointing to p
-
Find the root set Rp and the base set Sp
-
Search in Sp for hubs and authorities
-
Report the highest ranking authorities and hubs as similar pages to p
-
Advantages: no problems with pages containing images or very little text
(e.g. very little overlap).
-
Dealing with disconnected graphs
-
Example: ambiguous queries
-
Using higher order eigenvectors: HITS actually finds the principal eigenvector
of EET and ETE (the eigenvector associated with the
largest eigenvalue).
-
More eigenvectors may be used too to find hubs authorities in smaller subgraphs
-
In general, higher order eigenvectors reveal clusters in the graph structure.
-
Improving HITS stability - random walk model (parameter d)
-
with probability d the surfer jumps to a random node in the base set.
-
with probability (1-d) the surfer takes a random out-link from the current
page or goes back to a random page that points to the current one.
-
Tuning parameter d
-
stability improves as d increases
-
d=1 (no ranking)
6. Enhanced techniques for page ranking
-
Course-grained and fine-grained models
-
Topic generalization and drift
-
Avoiding nepotism
-
k pages on a single host
-
Assign a weight of 1/k to the in-links coming from these pages
-
Eliminating outliers
-
Create vector space representation for the retrieved pages
-
Find the centroid of the root set
-
Eliminate pages from the base set that are too far from the centroid
-
Fine Grained models
-
Using the anchor text (Rank-and-File)
-
Use a base set only and consider pages as chains of terms and links.
-
Increment counts for URL's that appear near (within distance k to) a query
term (start with 0 counts)
-
Report the top ranking pages
-
Using the document markup structure (DOM)
-
Slides
7. Using Web structure to enhance crawling and similarity search
-
Enhanced Crawling
-
Crawling as guided search (e.g. use PageRank as evaluation function)
-
Keyword based search
-
Link-based similarity search
General Setting and Evaluation
Techniques
1. General Setting
-
General setting for Classification (Supervised Learning, Learning from
examples, Concept learning)
-
Step 1: Data collection
-
Training documents (model construction subset + model validation subset)
-
Test documents
-
Step 2: Building a model
-
Feature Selection
-
Applying an ML approach (learner, classifier)
-
Validating the Model (tuning learner parameters)
-
Step 3: Testing and evaluating the model
-
Step 4: Using the model to classify new documents (with unknown class labels)
-
Problems with classification of text and hypertext
-
Very large number of features (terms) compared with the number of examples
(documents)
-
Many irrelevant or correlated features
-
Different number of features in different documents
2. Evaluating text classifiers
-
Evaluation criteria
-
Accuracy
-
Computational efficiency (speed, scalabiulity, modification)
-
Ease of model interpretation and using user feedback
-
Simplicity (MDL)
-
Benchmark data
-
Evaluating classification accuracy
-
Holdout
-
Reserve a certain amount for testing and use the remainder for training
(usually 1/3 for testing, 2/3 for training).
-
Problem: the samples might not be representative. For example, some classes
might be represented with very few instance or even with no instances at
all.
-
Solution: stratification - sampling for training and testing within
classes. This ensures that each class is represented with approximately
equal proportions in both subsets
-
Repeated holdout. Success/error estimate can be made more reliable by repeating
the process with different subsamples.
-
In each iteration, a certain proportion is randomly selected for training
(possibly with stratification)
-
The error rates on the different iterations are averaged to yield an overall
error rate.
-
Problem: the different test sets may overlap. Can we prevent overlapping?
-
Cross-validation (CV). Avoids overlapping test sets.
-
k-fold cross-validation
-
First step: data is split into k subsets of equal size (usually by random
sampling).
-
Second step: each subset in turn is used for testing and the remainder
for training.
-
The error estimates are averaged to yield an overall error estimate.
-
Stratified cross-validation: subsets are stratified before the cross-validation
is performed.
-
Stratified ten-fold cross-validation
-
Standard method for evaluation. Extensive experiments have shown that this
is the best choice to get an accurate estimate. There is also some
theoretical evidence for this.
-
Stratification reduces the estimate's variance.
-
Repeated stratified cross-validation is even better. Ten-fold cross-validation
is repeated ten times and results are averaged.
-
Leave-one-out cross-validation (LOO CV).
-
LOO CV is a n-fold cross-validation, where n is the number
of training instances. That is, n classifiers are built for all
possible (n-1)-element subsets of the training set and then tested
on the remaining single instance.
-
LOO CV makes maximum use of the data.
-
No random subsampling is involved.
-
Problems
-
LOO CV is very computationally expensive.
-
Stratification is not possible. Actually this method guarantees a non-
stratified sample (there is only one instance in the test set).
-
Worst case example: assume a completely random dataset with two
classes each represented by 50% of the instances. The best classifier
for this data is the majority predictor. LOO CV will predict 100% error
(!) rate for this classifier (explain why?).
-
Contingency matrix
Actual \ Predicted |
+ |
- |
+ |
True positive (TP) |
False negative (FN) |
- |
False positive (FP) |
True negative (TN) |
-
Total error = (FP+FN)/(TP+FP+TN+FN)
-
Recall - precision (information retrieval):
-
Precision (retrieved relevant / total retrieved) = TP / (TP+FP)
-
Recall (retrieved relevant / total relevant) = TP / (TP + FN)
-
Combined measures: F=2*Recall*Precision/(Recall+Precision)
-
Multiple class setting
-
Predicting performance (true success/error rate)
-
Testing just estimates the probability of success on unknown data (data,
not used in both training and testing).
-
How good is this estimate? (What is the true success/error rate?)
We need confidence intervals (a kind of statistical reasoning) to
predict this.
-
Assume that success and error are two possible outcomes of a statistical
experiment (normally distributed random variable).
-
Bernoulli process: We have made N experiments and got S successes.
Then, the observed success rate is P=S/N. What is the true
success rate?
-
Example:
-
N=100, S=75. Then with confidence 80% P is in [0.691,
0.801].
-
N=1000, S=750. Then with confidence 80% P is in [0.732,
0.767].
3. Basic Approaches
-
Nearest Neighbor
-
Feature Selection
-
Bayesian approaches (Naive Bayes, Bayesian Networks, Maximal Entropy)
-
Numeric approaches (Linear regression and SVM)
-
Decision tree learning
-
Using Hypertext structure and Relational Learning (First-Order rule induction)
Nearest Neghbor Learning
-
Distance or similarity function defines what's learned.
-
Euclidean distance (for numeric attributes): D(X,Y) = sqrt[(x1-y1)2
+ (x2-y2)2 + ... + (xn-yn)2],
where X = {x1, x2, ..., xn}, Y = {y1,
y2, ..., yn}.
-
Cosine similarity (dot product
when normalized to unit length): Sim(X,Y) = x1.y1
+ x2.y2 + ... + xn.yn
-
Other popular metric: city-block distance. D(X,Y) = |x1-y1|
+ |x2-y2| + ... + |xn-yn|.
-
As different attributes use diferent scales, normalization is required.
Vnorm = (V-Vmin) / (Vmax - Vmin).
Thus Vnorm is within [0,1].
-
Nominal attributes: number of differences, i.e. city block distance, where
|xi-yi| = 0 (xi=yi) or 1 (xi<>yi).
-
Missing attributes: assumed to be maximally distant (given normalized attributes).
-
Example: weather data
ID |
outlook |
temp |
humidity |
windy |
play |
1 |
sunny |
hot |
high |
false |
no |
2 |
sunny |
hot |
high |
true |
no |
3 |
overcast |
hot |
high |
false |
yes |
4 |
rainy |
mild |
high |
false |
yes |
5 |
rainy |
cool |
normal |
false |
yes |
6 |
rainy |
cool |
normal |
true |
no |
7 |
overcast |
cool |
normal |
true |
yes |
8 |
sunny |
mild |
high |
false |
no |
9 |
sunny |
cool |
normal |
false |
yes |
10 |
rainy |
mild |
normal |
false |
yes |
11 |
sunny |
mild |
normal |
true |
yes |
12 |
overcast |
mild |
high |
true |
yes |
13 |
overcast |
hot |
normal |
false |
yes |
14 |
rainy |
mild |
high |
true |
no |
X |
sunny |
cool |
high |
true |
? |
ID |
2 |
8 |
9 |
11 |
D(X, ID) |
1 |
2 |
2 |
2 |
play |
no |
no |
yes |
yes |
-
Discussion
-
Instance space: Voronoi diagram
-
1-NN is very accurate but also slow: scans entire training data to derive
a prediction (possible improvements: use a sample)
-
Assumes all attributes are equally important. Remedy: attribute selection
or weights (see attribute relevance).
-
Dealing with noise (wrong values of some attributes)
-
Taking a majority vote over the k nearest neighbors (k-NN).
-
Removing noisy instances from dataset (difficult!)
-
Numeric class attribute: take mean of the class values the k nearest neighbors.
-
k-NN has been used by statisticians since early 1950s. Question: k=?
-
Distance weighted k-NN:
-
Weight each vote or class value (for numeric) with the distance.
-
For example: instead of summing up votes, sum up 1 / D(X,Y) or 1 / D(X,Y)2
-
Then it makes sense to use all instances (k=n).
Bayesian approaches
Naive Bayes
-
Basic assumptions
-
Opposite of KNN: use all examples
-
Attributes are assumed to be:
-
equally important: all attributes have the same relevance to the classification
task.
-
statistically independent (given the class value): knowledge about the
value of a particular attribute doesn't tell us anything about the value
of another attribute (if the class is known).
-
Although based on assumptions that are almost never correct, this scheme
works well in practice!
-
Probabilities of weather data
outlook |
temp |
humidity |
windy |
play |
sunny |
hot |
high |
false |
no |
sunny |
hot |
high |
true |
no |
overcast |
hot |
high |
false |
yes |
rainy |
mild |
high |
false |
yes |
rainy |
cool |
normal |
false |
yes |
rainy |
cool |
normal |
true |
no |
overcast |
cool |
normal |
true |
yes |
sunny |
mild |
high |
false |
no |
sunny |
cool |
normal |
false |
yes |
rainy |
mild |
normal |
false |
yes |
sunny |
mild |
normal |
true |
yes |
overcast |
mild |
high |
true |
yes |
overcast |
hot |
normal |
false |
yes |
rainy |
mild |
high |
true |
no |
-
outlook = sunny [yes (2/9); no (3/5)];
-
temperature = cool [yes (3/9); no (1/5)];
-
humidity = high [yes (3/9); no (4/5)];
-
windy = true [yes (3/9); no (3/5)];
-
play = yes [(9/14)]
-
play = no [(5/14)]
-
New instance: [outlook=sunny, temp=cool, humidity=high, windy=true, play=?]
-
Likelihood of the two classes (play=yes; play=no):
-
yes = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) = 0.0053;
-
no = (3/5)*(1/5)*(4/5)*(3/5)*(5/14) = 0.0206;
-
Conversion into probabilities by normalization:
-
P(yes) = 0.0053 / (0.0053 + 0.0206) = 0.205
-
P(no) = 0.0206 / (0.0053 + 0.0206) = 0.795
-
Bayes theorem (Bayes rule)
-
Probability of event H, given evidence E: P(H|E) = P(E|H) * P(H) / P(E);
-
P(H): a priori probability of H (probability of event before
evidence has been seen);
-
P(H|E): a posteriori (conditional) probability of H (probability
of event after evidence has been seen);
-
Bayes for classification
-
What is the probability of the class given an instance?
-
Evidence E = instance
-
Event H = class value for instance
-
Naïve Bayes assumption: evidence can be split into independent parts
(attributes of the instance).
-
E = [A1,A2,...,An]
-
P(E|H) = P(A1|H)*P(A2|H)*...*P(An|H)
-
Bayes: P(H|E) = P(A1|H)*P(A2|H)*...*P(An|H)*P(H)
/ P(E)
-
Weather data:
-
E = [outlook=sunny, temp=cool, humidity=high, windy=true]
-
P(yes|E) = (outlook=sunny|yes) * P(temp=cool|yes) * P(humidity=high|yes)
* P(windy=true|yes) * P(yes) / P(E) = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) /
P(E)
-
The “zero-frequency problem”
-
What if an attribute value doesn't occur with every class value (e. g.
humidity = high for class yes)?
-
Probability will be zero, for example P(humidity=high|yes) = 0;
-
A posteriori probability will also be zero: P(yes|E) = 0 (no matter how
likely the other values are!)
-
Remedy: add 1 to the count for every attribute value-class combination
(i.e. use the Laplace estimator: (p+1) / (n+1) ).
-
Result: probabilities will never be zero! (also stabilizes probability
estimates)
-
Missing values
-
Calculating probabilities: instance is not included in frequency count
for attribute value-class combination.
-
Classification: attribute will be omitted from calculation
-
Example: [outlook=?, temp=cool, humidity=high, windy=true, play=?]
-
Likelihood of yes = (3/9)*(3/9)*(3/9)*(9/14) = 0.0238;
-
Likelihood of no = (1/5)*(4/5)*(3/5)*(5/14) = 0.0343;
-
P(yes) = 0.0238 / (0.0238 + 0.0343) = 0.41
-
P(no) = 0.0343 / (0.0238 + 0.0343) = 0.59
-
Numeric attributes
-
Assumption: attributes have a normal or Gaussian probability
distribution (given the class)
-
Parameters involved: mean, standard deviation, density function for probabilty
-
Discussion
-
Naïve Bayes works surprisingly well (even if independence assumption
is clearly violated).
-
Why? Because classification doesn't require accurate probability estimates
as long as
maximum probability is assigned to correct class.
-
Adding too many redundant attributes will cause problems (e. g. identical
attributes).
-
Numeric attributes are often not normally distributed.
-
Yet another problem: estimating prior probability is difficult.
-
Advanced approaches: Bayesian networks.
Bayesian networks
-
Basics of BN
-
Define joint conditional probabilities.
-
Combine Bayesian reasoning with causal relationships between attributes.
-
Also known as belief networks, probabilistic networks.
-
Defined by:
-
Directed acyclic graph, with nodes representing random variables and links
- probabilistic dependence.
-
Conditional probability tables (CPT) for each variable (node): specifies
all P(X|partents(X)), i.e. the probability of each value of X, given every
possible combination of values for its parents.
-
Reasoning: given the probabilities at some nodes (inputs) BN calculates
the probabilities in other nodes (outputs).
-
Classification: inputs - attribute values, output - class value probability.
-
There are mechanisms for training BN from examples, given variables and
network structure, i.e. creating CPT's.
-
Example:
-
Variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls
(M)
-
Structure ("->" denotes causal relation): Burglary ->
Alarm; Earthquake -> Alarm; Alarm -> JohnCalls; Alarm
->
MaryCalls.
-
CPT's (for brevity, probability of false is not given, rows must sum to
1):
B E |
P(A) |
T T
T F
F T
F F |
0.95
0.94
0.29
0.001 |
-
Calculation of joint probabilities (~ means not): P(J, M, A, ~B, ~E) =
P(J|A) * P(M|A) * P(A|~B and ~E) * P(~B) * P(~E) = 0.9 * 0.7 * 0.001 *
0.999 * 0.998 = 0.000628.
-
Reasoning (using complete joints distribution or other more efficient methods):
-
Diagnostic (from effect top cause): P(B|J) = 0.016; P(B|J and M) = 0.29);
P(A|J and M) = 0.76
-
Predictive (from cause to effect): P(J|B) = 0.86; P(M|B) = 0.67;
-
Other: intercausal P(B|A), mixed P(A|J and ~E)
-
-
Naive Bayes as a BN
-
Variables: play, outlook, temp, humidity, windy.
-
Structure: play -> outlook, play -> temp, play ->
humidity, play -> windy.
-
CPT's:
-
play: P(play=yes)=9/14; P(play=no)=5/14;
-
outlook:
-
P(outlook=overcast | play=yes) = 4/9
-
P(outlook=sunny | play=yes) = 2/9
-
P(outlook=rainy | play=yes) = 3/9
-
P(outlook=overcast | play=no) = 0/5
-
P(outlook=sunny | play=no) = 3/5
-
P(outlook=rainy | play=no) = 2/5
-
...
Numeric Approaches
Linear Regression
-
Basic idea
-
Work most naturally with numeric attributes. The standard technique
for numeric prediction is linear regression.
-
Predicted class value is linear combination of attribute values
(ai): C = w0*a0 + w1*a1
+ w2*a2 + ... + wk*ak.
For k attributes we have k+1 coefficients. To simplify notation
we add a0 that is always 1.
-
Squared error: Sum through all instances (actual class value - predicted
one)2
-
Deriving the coefficients (wi): minimizing squared
error on training data. Using standard numerical analysis techniques
(matrix operations). Can be done if there are more instances than attributes
(roughly speaking).
-
Classification by linear regression
-
Binary classification (class values 1, -1). Two possible interpretations:
-
Hyperplane that separates the two classes
-
Data points are projected on a line perpendicular to the hyperplane and
thus positive and negative points are separated.
-
Multi-response linear regression (learning a membership function for each
class)
-
Training: perform a regression (create a model) for each class, setting
the output to 1 for training instances that belong to the class, and 0
for those that do not.
-
Prediction: predict the class corresponding to the model with largest output
value
-
Pairwise regression (designed especially for multiple classification)
-
Training: perform regression for every pair of classes assigning output
1 for one class and -1 for the other.
-
Prediction: predict the class that receives most "votes" (outputs > 0)
from the regression lines.
-
More accurate than multi-response linear regression, however more computationally
expensive.
-
Discussion
-
Creates a hyperplane for any two classes
-
Pairwise: the regression line between the two classes
-
Multi-response: (w0-v0)*a0 + (w1-v1)*a1
+ ... + (wk-vk)*ak, where wi
and vi are the coefficients of the models for the two
classes.
-
Not appropriate if data exhibits non-linear dependencies. For example,
instances that cannot be separated by a hyperplane. Classical example:
XOR function.
Support Vector Machine (SVM)
-
Same idea as linear separaton (projection)
-
Choosing the hyperplane so that the all points are within some minimal
distance from the hyperplane.
-
One of the most accurate text document classifier
-
Quadratic optimization problem solved by iterative algorithms
Decision tree learning
-
Type of learning: supervised, concept learning, divide-and-conquer strategy.
-
Strategies for concept learning:
-
Covering: generate a rule, exclude the data covered by it and continue
with the rest.
-
Divide-and-conquer: split the data in subsets and apply the algorithm recursively
to the subsets.
-
Top-down induction of decision trees (TDIDT, old approach know from pattern
recognition):
-
Select an attribute for root node and create a branch for each possible
attribute value.
-
Split the instances into subsets (one for each branch extending from the
node).
-
Repeat the procedure recursively for each branch, using only instances
that reach the branch (those that satisfy the conditions along the path
from the root to the branch).
-
Stop if all instances have the same class.
-
A criterion for attribute selection
-
Basic idea: choose the attribute which will result in the smallest tree.
-
Heuristic: choose the attribute that produces the “purest” nodes.
-
Properties we require from a purity measure:
-
When node is pure, measure should be zero.
-
When impurity is maximal (i. e. all classes equally likely), measure should
be maximal.
-
The measure should obey multistage property (i. e. decisions can
be made in several stages). For example, assume [2,3,4] is the distribution
of three classes in a set of 9 instances. Then, this property states that
measure([2,3,4])
= measure([2,7]) + (7/9)*measure([3,4]).
-
Entropy is the only function that satisfies all three properties!
-
Given a probability distribution (P1,P2,...,Pn),
the information required to predict an event is the distribution’s entropy.
-
Entropy(P1,P2,...,Pn) = -P1*
log(P1)-P2*log(P2) - ...- Pn*log(Pn).
When the base of log is 2, then entropy is in bits.
-
Example: entropy of the class distribution in the weather data (9 yes's
and 5 no's). Entropy(9/14,5/14) = -(9/14)* log(9/14)-(5/14)*log(5/14)
= 0.94.
-
Information in a set, Info([9,5]) = Entropy(9/14,5/14).
-
Attribute outlook splits the set in three subsets, [9,5]
= [2,3] + [4,0] + [3,2]. The information in this partition is Info([2,3],[4,0],[3,2])
= (5/14)*Info([2,3]) + (4/14)*Info([4,0]) + (5/14)*Info([3,2]).
-
Information gain = information before splitting – information after
splitting. Gain(outlook) = Info([9,5]) - Info([2,3],[4,0],[3,2]).
-
Gain(outlok)=0.247, Gain(temperature)=0.029, Gain(humidity)=0.152, Gain(windy)=0.048.
Best attribute: outlook (generates the purest split).
-
Highly-branching attributes (with large number of values)
-
Problem: the gain is usually high. Extreme case: tuple ID. Gain(ID)
= Info([9,5] - Info([1,0],[1,0]),...,[1,0]) = 0.94 (maximum).
-
Subsets are more likely to be pure if there is a large number of values
-
Information gain is biased towards choosing attributes with a large number
of values
-
This may result in overfitting (selection of an attribute that is non-optimal
for prediction)
-
The gain ratio: a modification of the information gain that reduces its
bias.
-
Gain ratio takes number and size of branches into account when choosing
an attribute.
-
It corrects the information gain by taking the intrinsic information
of a split into account.
-
Intrinsic information: entropy of distribution of instances into branches
(i. e. how much info do we need to tell which branch an instance belongs
to).
-
Example: intrinsic information for ID code: Intrinsic_info([1,1,...,1])
= -14*(1/14)*log(1/14) = 3.807 bits.
-
Gain_ratio(Attribute) = Gain(Attribute)/Intrinsic_info(Attribute).
-
Problem: ID is still the best attribute. Solution: ignore ID attribute.
-
Another problem with gain ratio: it may overcompensate, i.e. choose an
attribute just because its intrinsic information is very low.
-
Decision tree pruning: avoiding overfitting (overspecialization) and fragmentation.
-
Pre-pruning (stop generating splits, i.e. new nodes):
-
assign a leaf label for a node with lower than a prespecified error level.
-
stop splitting when gain get lower than a prespecified threshold.
-
stop when a node represents fewer than some threshold number of instances
(say 10, or 5% of the total training set)
-
Postpruning (cutting subtrees after the complete tree has been built).
Usually error-based: replace a subtree with a leaf node, if the error on
the test data is the same or lower (use cross-validation). Computationally
expensive.
-
Generating rules from decision trees
-
Direct approach: each node is represented by a rule with antecedent including
all tests along the path from the root to the particular node.
-
Rule optimization: deleting conditions form a rule if this does not affect
the error rate.
-
Discussion
-
Basic ideas of TDIDT are developed in 60's (CLS, 1966).
-
Algorithm for top-down induction of decision trees using information gain
for attribute selection (ID3) was developed by Ross Quinlan (1981).
-
Gain ratio and other modifications and improvements led to development
of C4.5, which can deal with numeric attributes, missing values, and noisy
data, and also can extract rules from the tree (one of the best concept
learners).
-
There are many other attribute selection criteria (but almost no difference
in accuracy of result.)
Web Crawler Project
This project includes two basic steps:
-
Implementing a Web Crawler
-
Using the crawler to collect a set of web pages and identify their properties
related to the web structure (see Chakrabarti - Chapter 7 and 8).
For step 1 you may use WebSPHINX:
A Personal, Customizable Web Crawler or write your own crawler in Java
or C using the open source provided with WebSPHINX
or the W3C Protocol Library. Step
2 includes:
-
Identifying a portion of the Web (a subtree, a server or a topic oriented
part of the Web) to be analyzed.
-
Analysis of the structure of the set of web pages (on-line, during crawling
or off-line, after collecting the pages).
-
Ranking pages by using various techniques.
-
Grouping pages by similarity.
Note that no programming and implementing stand-alone applications is required
by this project - it's optional and depends on the students' programming
experience and preferences. The project may be very well completed by using
some ready-made tools and manual analysis.
Web Document Classification Project
Introduction
Along with the search engines, topic directories
are the most popular sites on the Web. Topic directories organize web pages
in a hierarchical structure (taxonomy, ontology) according to their content.
The purpose of this structuring is twofold: firstly, it helps web searches
focus on the relevant collection of Web documents. The ultimate goal here
is to organize the entire web into a directory, where each web page has
its place in the hierarchy and thus can be easily identified and accessed.
The Open Directory Project (dmoz.org) and About.com are some of the best-known
projects in this area. Secondly, the topic directories can be used to classify
web pages or associate them with known topics. This process is called tagging
and can be used to extend the directories themselves. In fact, some well-known
search portals as Yahoo and Google return with their responses the topic
path of the response, if the response URL has been associated with some
topic found in a topic directory. As these topic directories are usually
created manually they cannot capture all URL’s, therefore just a fraction
of all responses are tagged.
Project overview
The aim of the project is to investigate the process
of tagging web pages using the topic directory structures and apply Machine
Learning techniques for automatic tagging or classifying web pages into
topic categories. This would help filtering out the responses of a search
engine or ranking them according to their relevance to a topic specified
by the user.
For example, a keyword search for “Machine Learning”
using Yahoo may return along with some of the pages found (about 5 million)
topic directory paths like:
Category: Artificial Intelligence > Machine Learning
Category: Artificial Intelligence > Web Directories
Category: Maryland > Baltimore > Johns Hopkins University > Courses
(Note that this may not be what you see when you try this query. The
web content is constantly changing as well as the search engines’ approaches
to search the web. This usually results in getting different results from
the same search query at different times.)
Most of the pages returned however are not tagged with directory topics.
Assuming that we know the general topic of such untagged web page, say,
Artificial Intelligence and this is a topic in a directory, we can try
to find the closest subtopic to the web page found. This is where Machine
Learning comes into play. Using some text document classification techniques
we can classify the new web page to one of the existing topics. By using
the collection of pages available under each topic as examples we can create
category descriptions (e.g. classification rules, or conditional probabilities).
Then using these descriptions we can classify new web pages. Another approach
would be the similarity search approach, where using some metric over text
documents we find the closest document and assign its category to the new
web page.
Project description
The project is split into three major parts. These
parts are also stages in the overall process of knowledge extraction from
the web and classification of web documents (tagging). As this process
is interactive and iterative in nature, the stages may be included in a
loop structure that would allow each stage to be revisited so that some
feedback from latter stages to be used. The parts are well defined and
can be developed separately and then put together as components in a semi-automated
system or executed manually. Hereafter we describe the project stages in
detail along with the deliverables that the students need to document in
the final report for each stage.
1. Collecting sets of web documents grouped by topic
The purpose of this stage is to collect sets of web documents belonging
to different topics (subject area). The basic idea is to use a topic directory
structure. Such structures are available from dmoz.org (the Open Directory
project), the yahoo directory (dir.yahoo.com), about.com and many other
web sites that provide access to web pages grouped by topic or subject.
These topic structures have to be examined in order to find several topics
(e.g. 5), each of which is well represented by a set of documents (at least
20).
Alternative approaches could be extracting web documents manually from
the list of hits returned by a search engine using a general keyword search
or collecting web pages by using a Web Crawler (see the Web Crawler project)
from the web page structure of a large organization (e.g. university).
Deliverable: The outcome of this stage is a collection
of several sets of web documents (actual files stored locally, not just
URL’s) representing different topics or subjects, where the following restrictions
apply:
a) As these topics will be used for learning and classification experiments
at later stages they have to form a specific structure (part of the topic
hierarchy). It’s good to have topics at different levels of the topic hierarchy
and with different distances between them (a distance between two topics
can be defined as the number of predecessors to the first common parent
in the hierarchy). An example of such structure is:
topic1 > topic2 > topic3
topic1 > topic2 > topic4
topic1 > topic5 > topic6
topic1 > topic7 > topic8
topic1 > topic9
The set of topics here is {topic3, topic4, topic6, topic8, topic9}.
Also, it would be interesting to find topics, which are subtopics of two
different topics. An example of this is:
Top > … > topic2 > topic4
Top > … > topic5 > topic4
b) There must be at least 5 different topics with at least 20 documents
in each.
c) Each document should contain certain minimum amount of text. This
may be measured with the number of words (excluding stopwords and punctuation
marks). For example, this minimum could be 200 words.
d) Each document should be in HTML format and contain HTML tags as title,
headings or font modifiers.
2. Feature extraction and data preparation
At this stage the web documents are represented
by feature vectors, which in turn are used to form a training data set
for the Machine Learning stage. The basic steps to achieve this are the
following:
-
Select a number of terms (words) whose presence or absence in each document
can be used to describe the document topic. This can be done manually by
using some domain expertise for each topic or automatically by using a
statistical text processing system. The latter is based on putting all
documents together and sorting in ascending order all words appearing in
all documents by their frequency. The first N words in the sorted sequence
can be used to represent the documents with vectors of size N.
-
Using the selected set of terms as features (attributes) create a feature
vector (tuple) for each document with Boolean values corresponding to each
attribute (1 if the term is in the document, 0 – if it’s not). A more sophisticated
approach for determining the attributes values can be used too. It is based
on using the term frequencies scaled in some way to normalize the document
length and adjust too frequent words (the TFIDF approach known from Information
Retrieval). Further, the HTML tags may be used to modify the attribute
values of the terms appearing with the scope of some tags (for example.
increase the values for titles, headings and emphasized terms).
-
Create a data set in the ARFF format to be used by the Weka Machine Learning
system. An ARFF file is a text file, which defines the attribute types
(for the Boolean values they will be nominal, and for the frequency-based
ones – numeric) and lists all document feature vectors along with their
class value (the document topic).
Steps (1) and (2) are part of the vector space model, which is well known
in the area of Information Retrieval (IR). For details see the text [Chakrabarti,
2003], Chapter 3. Students with good experience in programming can write
a program (for example, in Java) to create the vector space model. Another
option is to use a text corpus analysis package that filters and extracts
keywords with their frequency counts. An example of such a system is TextSTAT,
freeware software available from http://www.niederlandistik.fu-berlin.de/textstat/software-en.html.
Other such systems are also available as freeware from http://www.textanalysis.info/.
Step (3) is discussed in [Witten and Frank, 2000], Chapter 8 (available
online at http://prdownloads.sourceforge.net/weka/Tutorial.pdf). More details
about the ARFF file format can be found in the document http://www.cs.waikato.ac.nz/~ml/weka/arff.html.
Deliverable: ARFF data files containing the feature vectors
for all web documents collected at stage 1. It is recommended that students
prepare several files by using different approaches to feature extraction,
for example, one with Boolean attributes and one with numeric ones created
by applying the TFIDF approach. Versions of the data sets with different
number of attributes can be also prepared. A rule of thumb here is that
the number of attributes should be less than the number of examples. The
idea of preparing all those data sets is twofold. Firstly, by experimenting
with different data sets and different ML algorithms the best classification
model can be found. Secondly, by evaluating all those models students will
understand the importance of various parameters of the input data for the
quality of learning and classification.
3. Machine Learning Stage
At this stage Machine Learning algorithms are used
to create models of the data sets. These models are then used for two purposes.
Firstly, the accuracy of the initial topic structure is evaluated and secondly,
new web documents are classified into existing topics. For both purposes
we use the Weka Data Mining System – a free Machine Learning software package
in Java available from
http://www.cs.waikato.ac.nz/~ml/weka/index.html.
This is one of the most popular ML systems used for educational purposes.
It is the companion software package of an excellent book on Machine Learning
and Data Mining [Witten and Frank, 2000]. The ML stage of the project consists
of the following steps:
-
Installation of the Weka package and familiarizations with its functionality.
-
Install Weka using the information provided in the Weka software page.
-
Read the Tutorial (Chapter 8), which describes well the Weka functionality
without GUI. Then read the GUI version user guide at http://prdownloads.sourceforge.net/weka/ExplorerGuide.pdf
and run some experiments using the data sets provided with the package
(e.g. the weather data).
-
Preprocessing of the web document data. Load the ARFF files created at
project stage 2, verify their consistency and get some statistics by using
the preprocess panel.
-
Using the Weka’s decision tree algorithm (J48) examine the decision trees
generated with different data sets. Which are the most important terms
for each data set (the terms appearing on the top of the tree)? How do
they change with changing the data set? Check also the classification accuracy
and the confusion matrix obtained with 10-fold cross validation and find
out which topic is best represented by the decision tree.
-
Use the Naïve Bayes and Nearest Neighbor (IBk) algorithms and compare
their classification accuracy and confusion matrices obtained with 10-fold
cross validation with the ones produces by the decision tree. Which ones
are better? Why?
-
Run the Weka clustering algorithms (k-means, EM and Cobweb) ignoring the
class attribute (document topic) on all data sets. Evaluate the obtained
clusterings by comparing them to the original set of topics or to the topic
hierarchy (when using Cobweb). Use also the formal method, classes to clusters
evaluation, provided by Weka. For more details of clustering with Weka
see http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-Ex3.html.
-
New web document classifications. Get web documents from the same subject
areas (topics), but not belonging to the original set of documents prepared
in project stage 1. Get also documents from different topics. Apply feature
extraction and create ARFF files each one representing one document. Then
using the Weka test set option classify the new documents. Compare their
original topic with the one predicted by Weka. For the classification experiments
use the guidelines provided in http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-Ex2.doc.
Deliverable: This stage of the project requires writing a
report on the experiments performed. The report should include detailed
description of the experiments (inputs data, Weka outputs), answers to
the questions, and interpretation and analysis of the results with respect
to the original problem stated in the project, web document classification.
References and Readings
-
[Chakrabarti, 2003] Chakrabarti, Mining the Web - Discovering Knowledge
from Hypertext Data, Morgan Kaufmann Publishers, 2003.
-
[Witten and Frank, 2000] Ian H. Witten and Eibe Frank, Data Mining: Practical
Machine Learning Tools and Techniques with Java Implementations, Morgan
Kaufmann, 2000.
Intelligent Web Browser Project
Introduction
The Web searches provide large amount of information about the web users.
Data mining techniques can be used to analyze this information and create
user profiles or identify user preferences. A key application of this approach
is in marketing and offering personalized services, an area referred to
as "data gold rush". This project will focus on use of machine learning
approaches to create models of web users. Students will collect web pages
from web searches or by running a web crawler and label them according
to user preferences. The labeled pages will be then encoded as feature
vectors and fed into the machine learning system. The later will produce
user models that may be used for improving the efficiency of web searches
or identifying users.
Project description
Similarly to the web document classification project this project is split
into three major parts/stages - data collection, feature extraction and
machine learning (mining). At the data collection and feature extraction
stages web pages (documents) are collected and represented as feature vectors.
The important difference with the document classification project is that
the documents are mapped into users (not topic categories). At the machine
learning algorithms stage various learning algorithms are applied to the
feature vectors in order to create models of the users that these vectors
(documents) are mapped onto. Then the models can be used to filter out
web documents returned by searches so that the users can get more focused
information from the search engines. In this way users can also be identified
by their preferences and new users classified accordingly. Hereafter we
describe briefly the project stages.
1. Collecting sets of web documents grouped by users' preference
The purpose of this stage is to collect a set of web documents labeled
with user preferences. This can be done in the following way: A user performs
web searches with simple keyword search, just browses the web or examines
a set of pages collected by a web crawler. To each web document the user
assigns a label representing whether or not the document is interesting
to the user. As in the web document classification project some restrictions
apply: (1) The number of web pages should be greater than the number of
selected features (stage 2). (2) The web pages should have sufficient text
content so that they could be well described by feature vectors.
2. Feature extraction and data preparation
This stage is very similar to the one described in the Web Document Classification
project. By applying statistical text processing software the terms that
will be used in the feature vectors are first identified from the document
corpus. Then Boolean or numeric values are calculated for each web document
and the corresponding feature vector is created. Finally the vectors are
included in the ARFF file to be used by WEKA. Note that at this last step
the vectors are extended with class labels (for example, interesting/non-interesting
or +/-) according to the user preferences.
As in the in the web document classification project the outcome of
this stage is an ARFF data file containing the feature vectors for all
web documents collected at stage 1. It is recommended that students prepare
several files by using different approaches to feature extraction - Boolean
attributes, numeric attributes (using the TFIDF approach) and with different
number of terms. The idea is to do more experiments with different data
sets and different ML algorithms in order to find the best user model.
3. Machine Learning Stage
At this stage the approaches and experiments are similar to those described
in the Web Document Classification project with an important difference
in the last step where the machine learning models are used. This step
can be called web document filtering (focusing the search) and can be described
as follows: Collect a number of web documents using one of the approaches
suggested in project stage 1. Apply feature extraction and create an ARFF
test file with one data row for each document. Then using the training
set prepared in stage 2 and the Weka's test set option classify the new
documents. Each one will get a corresponding label (interesting/non-interesting
or +/-). Then simply discard the non-interesting documents and present
the interesting ones to the user. Further, this step can be incorporated
into a web browser, so that it automatically labels all web pages as interesting/non-interesting
according to the user preferences.