CS 580  Web Mining
Fall2008
Classes: MW 6:45 pm  8:00 pm, Room: Maria Sanford Hall 214
Instructor: Dr. Zdravko Markov, 30307 Maria Sanford Hall, (860)8322711,
http://www.cs.ccsu.edu/~markov/,
email: markovz at ccsu dot edu
Office hours: TR 10:00  12:30 pm, or by appointment
Description: The Web is the largest collection of electronically
accessible documents, which make the richest source of information in the
world. The problem with the Web is that this information is not well structured
and organized so that it would be be easily retrieved. The search engines
help in accessing web documents by keywords, but this is still far from
what we need in order to effectively use the knowledge available on the
Web. Machine Learning and Data Mining approaches go further and try to
extract knowledge from the raw data available on the Web by organizing
web pages in well defined structures or by looking into patterns of activities
of Web users. These are the challenges of the area of Web Mining. This
course focuses on extracting knowledge from the web by applying Machine
Learning techniques for classification and clustering of hypertext documents.
Basic approaches from the area of Information Retrieval and text analysis
are also discussed. The students use recent Machine Learning and Data Mining
software to implement practical applications for web document retrieval,
classification and clustering.
Prerequisites: CS 501 and CS 502, basic knowledge of algebra,
discrete math and statistics.
Course Objectives
Introduce students to the basic concepts and techniques of Information
Retrieval, Web Search, Data Mining, and Machine Learning for extracting
knowledge from the web.
Develop skills of using recent data mining software for solving practical
problems of Web Mining.
Gain experience of doing independent study and research.
Required text (DMW): Zdravko Markov and Daniel T. Larose. Data
Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage,
Wiley, 2007, ISBN: 9780471666554.
Recommended texts: Ian H. Witten and Eibe Frank. Data Mining:
Practical Machine Learning Tools and Techniques (Second Edition), Morgan
Kaufmann, 2005, ISBN: 0120884070.
Required software: Weka
3 Data Mining System  Data Mining System with Free Open Source Machine
Learning Software in Java. Available at http://www.cs.waikato.ac.nz/~ml/weka/index.html
Semester project: There will be a semester project that involves
independent study, work with the course software, writing reports and making
presentations. The project can be done individually or in teams of 2 or
3. The project description and timetable is included in the schedule for
classes and assignments.
Grading: The final grade will be based on the project (80%) and
two tests (20%), and will be affected by classroom participation. The letter
grades will be calculated according to the following table:
A 
A 
B+ 
B 
B 
C+ 
C 
C 
D+ 
D 
D 
F 
95100 
9094 
8789 
8486 
8083 
7779 
7476 
7073 
6769 
6466 
6063 
059 
Honesty policy: It is expected that all students will conduct
themselves in an honest manner (see the CCSU Student handbook), and NEVER
claim work which is not their own. Violating this policy will result in
a substantial grade penalty, and may lead to expulsion from the University.
Tentative schedule of classes, assignments and tests

Introduction

The Web Challenges (How to turn the web data into web knowledge):

Web Search Engines

Topic Directories

Semantic Web

Web Mining

Web content mining  discovery of Web document content patterns (text mining).

Web structure mining  discovery of hypertext/linking structure patterns

use hyperlinks to enhance text classification

page ranking

modeling and measuring the Web

Web usage mining  discovery of web users activity patterns

mining web server logs

mining client machine access logs

Related areas

Reading: DMW, Chapter 1

Lecture slides: dmw1.pdf

Information Retrieval and Web Search

Topics:

Crawling the Web

Indexing and keyword search

Document representation

Relevance Ranking

Vector space model (TF, IDF, TFIDF), Euclidian distance, cosine similarity

Relevance feedback

Advanced text search

Using the HTML structure in keyword search

Evaluating search quality

Similarity search

Reading: DMW, Chapter 1

Lecture slides: dmw1.pdf

Exercises:

Hyperlink Based Ranking

Clustering approaches for Web Mining

Evaluating Clustering

Classification approaches for Web Mining

Reading: DMW, Chapter 5

Lecture slides: dmw5.pdf

Basic approaches

Semester Projects

Students may choose one out of the following three projects:

To complete the project students are required to:

Write an initial project description, including specific goals,
resources to be used, plans how to achive the goals and evaluate the project
results, and a timetable.

Submit reports and make presentations on:

the initial project description (10% of final grade), due on October
1

the progress made by midterm (30% of final grade), due on November 5

the results acheived upon project completion (40% of final grade), due
on December 17

The students may work individually or in teams of 2 or 3.

The project grading will be based both on reports and presenations.
Hyperlink Based Ranking
1. The structure of the Web

Estimated (1998) to 150 million nodes (pages) and 1.7 billion edges (links).
Now more than 300 million, 1 million added every day.

Pages are very diverse in format (text, images, animation, scripts, forms
etc.) and content (information, ads, news, personal pages etc.)

No central authority of editors: relevance, popularity, authority
are hard to evaluate

Links are also very diverse, many have nothing to do with content or authority
(e.g. navigation links).

The challenge: use the web hyperlink structure to evaluate the importance
of pages and to enhance search
2. Social networks

An early approach, works well for academic networks, bibliometrics.

Mostly counting the indegree of nodes, e.g. impact factor (number of citations
in the previous two years).

Prestige in social networks:

A(u,v) = 1 if page u cites page v; 0 otherwise

p(v) = S_{u} A(u,v) p(u)

Matrix notation: compute P (column vector over web pages) by iterative
assignment P' = A^{T}P

Basics of linear algebra

Matrices (see http://mathworld.wolfram.com/Matrix.html)

Vectors and norms (see http://mathworld.wolfram.com/VectorNorm.html)

Eigenvectors (see http://mathworld.wolfram.com/Eigenvector.html)

Example:

Graph: a ® b, a ®
c, b ® c, c ® a

Prestige vector (column): P = (p(a), p(b), p(c))

Matrix: A = [(0,1,1), (0,0,1), (1,0,0)]; A^{T} = [(0,0,1), (1,0,0),
(1,1,0)]

Equation: cP = A^{T}P

Solution: eigenvalue c = 1.325; eigenvector P = (0.548, 0.413, 0.726)

Differences with the Web
3. PageRank

Web page u, F_{u} = {pages u points to}, B_{u} = {pages
that point to u}, N_{u} = F_{u}

Basic idea: propagation of ranking through links (see Page, Brin et al.,
Figure 2)

R(u) = c S_{{v Î
Bu}} R(v)/N_{v}

Example:

Graph: a ® b, a ®
c, b ® c, c ® a

R(a) = 0.4; R(b) = 0.2; R(c) = 0.4 (see Page, Brin et al., Figure 3)

Eigenvector approach:

A(u,v) = 1/N_{u}, if u cites v, 0 otherwise;

matrix: A = [(0, 0.5, 0.5), (0,0,1), (1,0,0)]; A^{T} = [(0,0,1),
(0.5,0,0), (0.5,1,0)]

Equation: cP = A^{T}P

Solutions (find eigenvalue c and eigenvector P):

Integer: c = 1; P = (2, 1, 2)

P_{2} = 1 (L_{2} norm): c = 1; P = (0.666, 0.333, 0.666)

P_{1} = 1 (L_{1} norm): c = 1; P = (0.4, 0.2, 0.4)

Rank sink (a loop without outlinks)

Source of rank E(u)

R(u) = c S_{{v Î
Bu}} R(v)/N_{v} + c E(u), where c is maximized and R_{1}
= 1 (L_{1} vector norm of R).

Computing PageRank (S is initial vector over web pages, e.g. E, all norms
are L_{1}):

R_{0 }= S

Loop

R_{i+1} = AR_{i}

d = R_{i}  R_{i+1}

R_{i+1} = R_{i+1} + dE

While R_{i+1}R_{i} > e

Random surfer model:

R(u) is the probability of a random walk on the graph of the Web.

If the surfer gets into a loop, then jumps to a random page chosen based
on the distribution in E

Adjusting PageRank by using the source of rank E

E is a uniform vector with a small norm (e.g. E=0.15), i.e. periodically
jumping to a random web page. Problem: manipulation by commercial interests
(have an important page or a lot of nonimportant pages to include a link)

E is just one web page: the page chosen gets the highest rank followed
by it's links.

Other approaches: use all root level pages of all web servers (difficult
to manipulate).

Other applications of PageRank

Estimating Web traffic

Optimal crawling: using PageRank as an evaluation function.

Page navigation (show the PageRank of a link before the user clicks on
it).
4. Authorities and Hubs

Problems with associating authority with indegree:

Often links have noting to do with authority (e.g. navigational links)

The balance between relevance and popularity (the most popular pages are
not the most relevant ones, e.g. sometimes the latter do not contain
the query string)

Idea:

Focus on the relevant pages first and then compute authority

Use also hub pages (pages that point to multiple relevant authoritative
pages)

The algorithm (HITS)  topic distillation. Given a query q:

Using a textbased search find a small set of relevant pages (root set
R_{q}).

Expand the root set by adding pages that point to and are pointed to by
pages from the root set. Thus create the base set S_{q}.

Find authorities and hubs in S_{q}

E(u,v)=1 if u points to v; 0 otherwise (both u and v belong to S_{q})

x  authority vector; y  hub vector; k  parameter (number of iterations)

(x_{1}, x_{2}, ..., x_{n}) = (1,1,1,
...,1)

(y_{1}, y_{2}, ..., y_{n}) = (1,1,1,
...,1)

Loop k times

x_{u} = S_{{v, E(v,u)=1}} y_{v},
for all u

y_{u} = S_{{v, E(v,u)=1}} x_{v},
for all u

normalize x and y (L_{2} norm)

End loop

Similar page queries

Linkbased approach (the alternative is textbased similarity)

Find k pages pointing to p

Find the root set R_{p} and the base set S_{p}

Search in S_{p} for hubs and authorities

Report the highest ranking authorities and hubs as similar pages to p

Advantages: no problems with pages containing images or very little text
(e.g. very little overlap).

Dealing with disconnected graphs

Example: ambiguous queries

Using higher order eigenvectors: HITS actually finds the principal eigenvector
of EE^{T} and E^{T}E (the eigenvector associated with the
largest eigenvalue).

More eigenvectors may be used too to find hubs authorities in smaller subgraphs

In general, higher order eigenvectors reveal clusters in the graph structure.

Improving HITS stability  random walk model (parameter d)

with probability d the surfer jumps to a random node in the base set.

with probability (1d) the surfer takes a random outlink from the current
page or goes back to a random page that points to the current one.

Tuning parameter d

stability improves as d increases

d=1 (no ranking)
5. Enhanced techniques for page ranking

Coursegrained and finegrained models

Topic generalization and drift

Avoiding nepotism

k pages on a single host

Assign a weight of 1/k to the inlinks coming from these pages

Eliminating outliers

Create vector space representation for the retrieved pages

Find the centroid of the root set

Eliminate pages from the base set that are too far from the centroid

Fine Grained models

Using the anchor text (RankandFile)

No hubs and authorities

Use a base set only and consider pages as chains of terms and links.

Increment counts for URL's that appear near (within distance k to) a query
term (start with 0 counts)

Report the top ranking pages

Using the document markup structure (DOM)

Slides
6. Using Web structure to enhance crawling and similarity search

Enhanced Crawling

Crawling as guided search (e.g. use PageRank as evaluation function)

Keyword based search

Linkbased similarity search
General Setting and Evaluation
Techniques
1. General Setting

General setting for Classification (Supervised Learning, Learning from
examples, Concept learning)

Step 1: Data collection

Training documents (model construction subset + model validation subset)

Test documents

Step 2: Building a model

Feature Selection

Applying an ML approach (learner, classifier)

Validating the Model (tuning learner parameters)

Step 3: Testing and evaluating the model

Step 4: Using the model to classify new documents (with unknown class labels)

Problems with classification of text and hypertext

Very large number of features (terms) compared with the number of examples
(documents)

Many irrelevant or correlated features

Different number of features in different documents
2. Evaluating text classifiers

Evaluation criteria

Accuracy

Computational efficiency (speed, scalabiulity, modification)

Ease of model interpretation and using user feedback

Simplicity (MDL)

Benchmark data

Evaluating classification accuracy

Holdout

Reserve a certain amount for testing and use the remainder for training
(usually 1/3 for testing, 2/3 for training).

Problem: the samples might not be representative. For example, some classes
might be represented with very few instance or even with no instances at
all.

Solution: stratification  sampling for training and testing within
classes. This ensures that each class is represented with approximately
equal proportions in both subsets

Repeated holdout. Success/error estimate can be made more reliable by repeating
the process with different subsamples.

In each iteration, a certain proportion is randomly selected for training
(possibly with stratification)

The error rates on the different iterations are averaged to yield an overall
error rate.

Problem: the different test sets may overlap. Can we prevent overlapping?

Crossvalidation (CV). Avoids overlapping test sets.

kfold crossvalidation

First step: data is split into k subsets of equal size (usually by random
sampling).

Second step: each subset in turn is used for testing and the remainder
for training.

The error estimates are averaged to yield an overall error estimate.

Stratified crossvalidation: subsets are stratified before the crossvalidation
is performed.

Stratified tenfold crossvalidation

Standard method for evaluation. Extensive experiments have shown that this
is the best choice to get an accurate estimate. There is also some
theoretical evidence for this.

Stratification reduces the estimate's variance.

Repeated stratified crossvalidation is even better. Tenfold crossvalidation
is repeated ten times and results are averaged.

Leaveoneout crossvalidation (LOO CV).

LOO CV is a nfold crossvalidation, where n is the number
of training instances. That is, n classifiers are built for all
possible (n1)element subsets of the training set and then tested
on the remaining single instance.

LOO CV makes maximum use of the data.

No random subsampling is involved.

Problems

LOO CV is very computationally expensive.

Stratification is not possible. Actually this method guarantees a non
stratified sample (there is only one instance in the test set).

Worst case example: assume a completely random dataset with two
classes each represented by 50% of the instances. The best classifier
for this data is the majority predictor. LOO CV will predict 100% error
(!) rate for this classifier (explain why?).

Contingency matrix
Actual \ Predicted 
+ 
 
+ 
True positive (TP) 
False negative (FN) 
 
False positive (FP) 
True negative (TN) 

Total error = (FP+FN)/(TP+FP+TN+FN)

Recall  precision (information retrieval):

Precision (retrieved relevant / total retrieved) = TP / (TP+FP)

Recall (retrieved relevant / total relevant) = TP / (TP + FN)

Combined measures: F=2*Recall*Precision/(Recall+Precision)

Multiple class setting

Predicting performance (true success/error rate)

Testing just estimates the probability of success on unknown data (data,
not used in both training and testing).

How good is this estimate? (What is the true success/error rate?)
We need confidence intervals (a kind of statistical reasoning) to
predict this.

Assume that success and error are two possible outcomes of a statistical
experiment (normally distributed random variable).

Bernoulli process: We have made N experiments and got S successes.
Then, the observed success rate is P=S/N. What is the true
success rate?

Example:

N=100, S=75. Then with confidence 80% P is in [0.691,
0.801].

N=1000, S=750. Then with confidence 80% P is in [0.732,
0.767].
3. Basic Approaches

Nearest Neighbor

Feature Selection

Bayesian approaches (Naive Bayes, Bayesian Networks, Maximal Entropy)

Numeric approaches (Linear regression and SVM)

Decision tree learning

Using Hypertext structure and Relational Learning (FirstOrder rule induction)
Nearest Neghbor Learning

Distance or similarity function defines what's learned.

Euclidean distance (for numeric attributes): D(X,Y) = sqrt[(x_{1}y_{1})^{2}
+ (x_{2}y_{2})^{2} + ... + (x_{n}y_{n})^{2}],
where X = {x_{1}, x_{2}, ..., x_{n}}, Y = {y_{1},
y_{2}, ..., y_{n}}.

Cosine similarity (dot product when normalized to unit length): Sim(X,Y)
= x_{1}.y_{1} + x_{2}.y_{2} + ... + x_{n}.y_{n}

Other popular metric: cityblock distance. D(X,Y) = x_{1}y_{1}
+ x_{2}y_{2} + ... + x_{n}y_{n}.

As different attributes use diferent scales, normalization is required.
V_{norm} = (VV_{min}) / (V_{max}  V_{min}).
Thus V_{norm} is within [0,1].

Nominal attributes: number of differences, i.e. city block distance, where
x_{i}y_{i} = 0 (x_{i}=y_{i}) or 1 (x_{i}<>y_{i}).

Missing attributes: assumed to be maximally distant (given normalized attributes).

Example: weather data
ID 
outlook 
temp 
humidity 
windy 
play 
1 
sunny 
hot 
high 
false 
no 
2 
sunny 
hot 
high 
true 
no 
3 
overcast 
hot 
high 
false 
yes 
4 
rainy 
mild 
high 
false 
yes 
5 
rainy 
cool 
normal 
false 
yes 
6 
rainy 
cool 
normal 
true 
no 
7 
overcast 
cool 
normal 
true 
yes 
8 
sunny 
mild 
high 
false 
no 
9 
sunny 
cool 
normal 
false 
yes 
10 
rainy 
mild 
normal 
false 
yes 
11 
sunny 
mild 
normal 
true 
yes 
12 
overcast 
mild 
high 
true 
yes 
13 
overcast 
hot 
normal 
false 
yes 
14 
rainy 
mild 
high 
true 
no 
X 
sunny 
cool 
high 
true 
? 
ID 
2 
8 
9 
11 
D(X, ID) 
1 
2 
2 
2 
play 
no 
no 
yes 
yes 

Discussion

Instance space: Voronoi diagram

1NN is very accurate but also slow: scans entire training data to derive
a prediction (possible improvements: use a sample)

Assumes all attributes are equally important. Remedy: attribute selection
or weights (see attribute relevance).

Dealing with noise (wrong values of some attributes)

Taking a majority vote over the k nearest neighbors (kNN).

Removing noisy instances from dataset (difficult!)

Numeric class attribute: take mean of the class values the k nearest neighbors.

kNN has been used by statisticians since early 1950s. Question: k=?

Distance weighted kNN:

Weight each vote or class value (for numeric) with the distance.

For example: instead of summing up votes, sum up 1 / D(X,Y) or 1 / D(X,Y)^{2}

Then it makes sense to use all instances (k=n).
Bayesian approaches
Naive Bayes

Basic assumptions

Opposite of KNN: use all examples

Attributes are assumed to be:

equally important: all attributes have the same relevance to the classification
task.

statistically independent (given the class value): knowledge about the
value of a particular attribute doesn't tell us anything about the value
of another attribute (if the class is known).

Although based on assumptions that are almost never correct, this scheme
works well in practice!

Probabilities of weather data
outlook 
temp 
humidity 
windy 
play 
sunny 
hot 
high 
false 
no 
sunny 
hot 
high 
true 
no 
overcast 
hot 
high 
false 
yes 
rainy 
mild 
high 
false 
yes 
rainy 
cool 
normal 
false 
yes 
rainy 
cool 
normal 
true 
no 
overcast 
cool 
normal 
true 
yes 
sunny 
mild 
high 
false 
no 
sunny 
cool 
normal 
false 
yes 
rainy 
mild 
normal 
false 
yes 
sunny 
mild 
normal 
true 
yes 
overcast 
mild 
high 
true 
yes 
overcast 
hot 
normal 
false 
yes 
rainy 
mild 
high 
true 
no 

outlook = sunny [yes (2/9); no (3/5)];

temperature = cool [yes (3/9); no (1/5)];

humidity = high [yes (3/9); no (4/5)];

windy = true [yes (3/9); no (3/5)];

play = yes [(9/14)]

play = no [(5/14)]

New instance: [outlook=sunny, temp=cool, humidity=high, windy=true, play=?]

Likelihood of the two classes (play=yes; play=no):

yes = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) = 0.0053;

no = (3/5)*(1/5)*(4/5)*(3/5)*(5/14) = 0.0206;

Conversion into probabilities by normalization:

P(yes) = 0.0053 / (0.0053 + 0.0206) = 0.205

P(no) = 0.0206 / (0.0053 + 0.0206) = 0.795

Bayes
theorem (Bayes rule)

Probability of event H, given evidence E: P(HE) = P(EH) * P(H) / P(E);

P(H): a priori probability of H (probability of event before
evidence has been seen);

P(HE): a posteriori (conditional) probability of H (probability
of event after evidence has been seen);

Bayes for classification

What is the probability of the class given an instance?

Evidence E = instance

Event H = class value for instance

Naïve Bayes assumption: evidence can be split into independent parts
(attributes of the instance).

E = [A_{1},A_{2},...,A_{n}]

P(EH) = P(A_{1}H)*P(A_{2}H)*...*P(A_{n}H)

Bayes: P(HE) = P(A_{1}H)*P(A_{2}H)*...*P(A_{n}H)*P(H)
/ P(E)

Weather data:

E = [outlook=sunny, temp=cool, humidity=high, windy=true]

P(yesE) = (outlook=sunnyyes) * P(temp=coolyes) * P(humidity=highyes)
* P(windy=trueyes) * P(yes) / P(E) = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) /
P(E)

The “zerofrequency problem”

What if an attribute value doesn't occur with every class value (e. g.
humidity = high for class yes)?

Probability will be zero, for example P(humidity=highyes) = 0;

A posteriori probability will also be zero: P(yesE) = 0 (no matter how
likely the other values are!)

Remedy: add 1 to the count for every attribute valueclass combination
(i.e. use the Laplace estimator: (p+1) / (n+1) ).

Result: probabilities will never be zero! (also stabilizes probability
estimates)

Missing values

Calculating probabilities: instance is not included in frequency count
for attribute valueclass combination.

Classification: attribute will be omitted from calculation

Example: [outlook=?, temp=cool, humidity=high, windy=true, play=?]

Likelihood of yes = (3/9)*(3/9)*(3/9)*(9/14) = 0.0238;

Likelihood of no = (1/5)*(4/5)*(3/5)*(5/14) = 0.0343;

P(yes) = 0.0238 / (0.0238 + 0.0343) = 0.41

P(no) = 0.0343 / (0.0238 + 0.0343) = 0.59

Numeric attributes

Assumption: attributes have a normal or Gaussian probability
distribution (given the class)

Parameters involved: mean, standard deviation, density function for probabilty

Discussion

Naïve Bayes works surprisingly well (even if independence assumption
is clearly violated).

Why? Because classification doesn't require accurate probability estimates
as long as
maximum probability is assigned to correct class.

Adding too many redundant attributes will cause problems (e. g. identical
attributes).

Numeric attributes are often not normally distributed.

Yet another problem: estimating prior probability is difficult.

Advanced approaches: Bayesian networks.
Bayesian networks

Basics of BN

Define joint conditional probabilities.

Combine Bayesian reasoning with causal relationships between attributes.

Also known as belief networks, probabilistic networks.

Defined by:

Directed acyclic graph, with nodes representing random variables and links
 probabilistic dependence.

Conditional probability tables (CPT) for each variable (node): specifies
all P(Xpartents(X)), i.e. the probability of each value of X, given every
possible combination of values for its parents.

Reasoning: given the probabilities at some nodes (inputs) BN calculates
the probabilities in other nodes (outputs).

Classification: inputs  attribute values, output  class value probability.

There are mechanisms for training BN from examples, given variables and
network structure, i.e. creating CPT's.

Example:

Variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls
(M)

Structure (">" denotes causal relation): Burglary >
Alarm; Earthquake > Alarm; Alarm > JohnCalls; Alarm
>
MaryCalls.

CPT's (for brevity, probability of false is not given, rows must sum to
1):

P(B) = 0.001

P(E) = 0.002
B E 
P(A) 
T T
T F
F T
F F 
0.95
0.94
0.29
0.001 

Calculation of joint probabilities (~ means not): P(J, M, A, ~B, ~E) =
P(JA) * P(MA) * P(A~B and ~E) * P(~B) * P(~E) = 0.9 * 0.7 * 0.001 *
0.999 * 0.998 = 0.000628.

Reasoning (using complete joints distribution or other more efficient methods):

Diagnostic (from effect top cause): P(BJ) = 0.016; P(BJ and M) = 0.29);
P(AJ and M) = 0.76

Predictive (from cause to effect): P(JB) = 0.86; P(MB) = 0.67;

Other: intercausal P(BA), mixed P(AJ and ~E)


Naive Bayes as a BN

Variables: play, outlook, temp, humidity, windy.

Structure: play > outlook, play > temp, play >
humidity, play > windy.

CPT's:

play: P(play=yes)=9/14; P(play=no)=5/14;

outlook:

P(outlook=overcast  play=yes) = 4/9

P(outlook=sunny  play=yes) = 2/9

P(outlook=rainy  play=yes) = 3/9

P(outlook=overcast  play=no) = 0/5

P(outlook=sunny  play=no) = 3/5

P(outlook=rainy  play=no) = 2/5

...

Links/Exercises:
Numeric Approaches
Linear Regression

Basic idea

Work most naturally with numeric attributes. The standard technique
for numeric prediction is linear regression.

Predicted class value is linear combination of attribute values
(a_{i}): C = w_{0}*a_{0} + w_{1}*a_{1}
+ w_{2}*a_{2} + ... + w_{k}*a_{k}.
For k attributes we have k+1 coefficients. To simplify notation
we add a_{0} that is always 1.

Squared error: Sum through all instances (actual class value  predicted
one)^{2}

Deriving the coefficients (w_{i}): minimizing squared
error on training data. Using standard numerical analysis techniques
(matrix operations). Can be done if there are more instances than attributes
(roughly speaking).

Classification by linear regression

Binary classification (class values 1, 1). Two possible interpretations:

Hyperplane that separates the two classes

Data points are projected on a line perpendicular to the hyperplane and
thus positive and negative points are separated.

Multiresponse linear regression (learning a membership function for each
class)

Training: perform a regression (create a model) for each class, setting
the output to 1 for training instances that belong to the class, and 0
for those that do not.

Prediction: predict the class corresponding to the model with largest output
value

Pairwise regression (designed especially for multiple classification)

Training: perform regression for every pair of classes assigning output
1 for one class and 1 for the other.

Prediction: predict the class that receives most "votes" (outputs > 0)
from the regression lines.

More accurate than multiresponse linear regression, however more computationally
expensive.

Discussion

Creates a hyperplane for any two classes

Pairwise: the regression line between the two classes

Multiresponse: (w_{0}v_{0})*a_{0} + (w_{1}v_{1})*a_{1}
+ ... + (w_{k}v_{k})*a_{k}, where w_{i}
and v_{i} are the coefficients of the models for the two
classes.

Not appropriate if data exhibits nonlinear dependencies. For example,
instances that cannot be separated by a hyperplane. Classical example:
XOR function.
Support Vector Machine (SVM)

Same idea as linear separaton (projection)

Choosing the hyperplane so that the all points are within some minimal
distance from the hyperplane.

One of the most accurate text document classifier

Quadratic optimization problem solved by iterative algorithms
Web Crawler Project
This project includes two basic steps:

Implementing a Web Crawler.

Using the crawler to collect a set of web pages and identify their properties
related to the web structure.
For step 1 you may use WebSPHINX:
A Personal, Customizable Web Crawler or write your own crawler in Java
or C using the open source provided with WebSPHINX
or the W3C Protocol Library. Step
2 includes:

Identifying a portion of the Web (a subtree, a server or a topic oriented
part of the Web) to be analyzed.

Analysis of the structure of the set of web pages.

Ranking pages by using various techniques.

Grouping pages by similarity.
Note that no programming and implementing standalone applications is required.
Web Document Classification Project
Introduction
Along with the search engines, topic directories
are the most popular sites on the Web. Topic directories organize web pages
in a hierarchical structure (taxonomy, ontology) according to their content.
The purpose of this structuring is twofold: firstly, it helps web searches
focus on the relevant collection of Web documents. The ultimate goal here
is to organize the entire web into a directory, where each web page has
its place in the hierarchy and thus can be easily identified and accessed.
The Open Directory Project (dmoz.org) and About.com are some of the bestknown
projects in this area. Secondly, the topic directories can be used to classify
web pages or associate them with known topics. This process is called tagging
and can be used to extend the directories themselves. In fact, some wellknown
search portals as Yahoo and Google return with their responses the topic
path of the response, if the response URL has been associated with some
topic found in a topic directory. As these topic directories are usually
created manually they cannot capture all URL’s, therefore just a fraction
of all responses are tagged.
Project overview
The aim of the project is to investigate the process
of tagging web pages using the topic directory structures and apply Machine
Learning techniques for automatic tagging or classifying web pages into
topic categories. This would help filtering out the responses of a search
engine or ranking them according to their relevance to a topic specified
by the user.
For example, a keyword search for “Machine Learning”
using Yahoo may return along with some of the pages found (about 5 million)
topic directory paths like:
Category: Artificial Intelligence > Machine Learning
Category: Artificial Intelligence > Web Directories
Category: Maryland > Baltimore > Johns Hopkins University > Courses
(Note that this may not be what you see when you try this query. The
web content is constantly changing as well as the search engines’ approaches
to search the web. This usually results in getting different results from
the same search query at different times.)
Most of the pages returned however are not tagged with directory topics.
Assuming that we know the general topic of such untagged web page, say,
Artificial Intelligence and this is a topic in a directory, we can try
to find the closest subtopic to the web page found. This is where Machine
Learning comes into play. Using some text document classification techniques
we can classify the new web page to one of the existing topics. By using
the collection of pages available under each topic as examples we can create
category descriptions (e.g. classification rules, or conditional probabilities).
Then using these descriptions we can classify new web pages. Another approach
would be the similarity search approach, where using some metric over text
documents we find the closest document and assign its category to the new
web page.
Project description
The project is split into three major parts. These
parts are also stages in the overall process of knowledge extraction from
the web and classification of web documents (tagging). As this process
is interactive and iterative in nature, the stages may be included in a
loop structure that would allow each stage to be revisited so that some
feedback from latter stages to be used. The parts are well defined and
can be developed separately and then put together as components in a semiautomated
system or executed manually. Hereafter we describe the project stages in
detail along with the deliverables that the students need to document in
the final report for each stage.
1. Collecting sets of web documents grouped by topic
The purpose of this stage is to collect sets of web documents belonging
to different topics (subject area). The basic idea is to use a topic directory
structure. Such structures are available from dmoz.org (the Open Directory
project), the yahoo directory (dir.yahoo.com), about.com and many other
web sites that provide access to web pages grouped by topic or subject.
These
topic structures have to be examined in order to find several topics (e.g.
5), each of which is well represented by a set of documents (at least 20).
Alternative approaches could be extracting web documents manually from
the list of hits returned by a search engine using a general keyword search
or collecting web pages by using a Web Crawler (see the Web Crawler project)
from the web page structure of a large organization (e.g. university).
Deliverable: The outcome of this stage is a collection
of several sets of web documents (actual files stored locally, not just
URL’s) representing different topics or subjects, where the following restrictions
apply:
a) As these topics will be used for learning and classification experiments
at later stages they have to form a specific structure (part of the topic
hierarchy). It’s good to have topics at different levels of the topic hierarchy
and with different distances between them (a distance between two topics
can be defined as the number of predecessors to the first common parent
in the hierarchy). An example of such structure is:
topic1 > topic2 > topic3
topic1 > topic2 > topic4
topic1 > topic5 > topic6
topic1 > topic7 > topic8
topic1 > topic9
The set of topics here is {topic3, topic4, topic6, topic8, topic9}.
Also, it would be interesting to find topics, which are subtopics of two
different topics. An example of this is:
Top > … > topic2 > topic4
Top > … > topic5 > topic4
b) There must be at least 5 different topics with at least 20 documents
in each.
c) Each document should contain certain minimum amount of text. This
may be measured with the number of words (excluding stopwords and punctuation
marks). For example, this minimum could be 200 words.
2. Feature extraction and data preparation
At this stage the web documents are represented
by feature vectors, which in turn are used to form a training data set
for the Machine Learning stage. To complete this use the Weka system and
follow the directions provided in section Exercises of DMW, Chapter 1 (
free
download from Wiley).
Deliverable: ARFF data files containing the feature vectors
for all web documents collected at stage 1. It is recommended that students
prepare several files by using different approaches to feature extraction,
for example, one with Boolean attributes and one with numeric ones created
by applying the TFIDF approach. Versions of the data sets with different
number of attributes can be also prepared.
3. Machine Learning Stage
At this stage Machine Learning algorithms are used
to create models of the data sets. These models are then used for two purposes.
Firstly, the accuracy of the initial topic structure is evaluated and secondly,
new web documents are classified into existing topics. For both purposes
we use the Weka system. The ML stage of consists of the following steps:

Preprocessing of the web document data. Load the ARFF files created at
project stage 2, verify their consistency and get some statistics by using
the preprocess panel.

Using the Weka’s decision tree algorithm (J48) examine the decision trees
generated with different data sets. Which are the most important terms
for each data set (the terms appearing on the top of the tree)? How do
they change with changing the data set? Check also the classification accuracy
and the confusion matrix obtained with 10fold cross validation and find
out which topic is best represented by the decision tree.

Use the Naïve Bayes and Nearest Neighbor (IBk) algorithms and compare
their classification accuracy and confusion matrices obtained with 10fold
cross validation with the ones produces by the decision tree. Which ones
are better? Why?

Run the Weka clustering algorithms (kmeans, EM and Cobweb) ignoring the
class attribute (document topic) on all data sets. Evaluate the obtained
clusterings by comparing them to the original set of topics or to the topic
hierarchy (when using Cobweb). Use also the formal method, classes to clusters
evaluation, provided by Weka. For more details of clustering with Weka
see http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMiningEx3.html.

New web document classifications. Get web documents from the same subject
areas (topics), but not belonging to the original set of documents prepared
in project stage 1. Get also documents from different topics. Apply feature
extraction and create ARFF files each one representing one document. Then
using the Weka test set option classify the new documents. Compare their
original topic with the one predicted by Weka. For the classification experiments
use the guidelines provided in http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMiningEx2.doc.
Deliverable: This stage of the project requires writing a
report on the experiments performed. The report should include detailed
description of the experiments (input data, Weka outputs), answers to the
questions, and interpretation and analysis of the results with respect
to the original problem stated in the project, web document classification.
Intelligent Web Browser Project
Introduction
The Web searches provide large amount of information about the web users.
Data mining techniques can be used to analyze this information and create
user profiles or identify user preferences. A key application of this approach
is in marketing and offering personalized services, an area referred to
as "data gold rush". This project will focus on use of machine learning
approaches to create models of web users. Students will collect web pages
from web searches or by running a web crawler and label them according
to user preferences. The labeled pages will be then encoded as feature
vectors and fed into the machine learning system. The later will produce
user models that may be used for improving the efficiency of web searches
or identifying users.
Project description
Similarly to the web document classification project this project is split
into three major parts/stages  data collection, feature extraction and
machine learning (mining). At the data collection and feature extraction
stages web pages (documents) are collected and represented as feature vectors.
The important difference with the document classification project is that
the documents are mapped into users (not topic categories). At the machine
learning algorithms stage various learning algorithms are applied to the
feature vectors in order to create models of the users that these vectors
(documents) are mapped onto. Then the models can be used to filter out
web documents returned by searches so that the users can get more focused
information from the search engines. In this way users can also be identified
by their preferences and new users classified accordingly. Hereafter we
describe briefly the project stages.
1. Collecting sets of web documents grouped by users' preference
The purpose of this stage is to collect a set of web documents labeled
with user preferences. This can be done in the following way: A user performs
web searches with simple keyword search, just browses the web or examines
a set of pages collected by a web crawler. To each web document the user
assigns a label representing whether or not the document is interesting
to the user. As in the web document classification project some restrictions
apply: (1) The number of web pages should be greater than the number of
selected features (stage 2). (2) The web pages should have sufficient text
content so that they could be well described by feature vectors.
2. Feature extraction and data preparation
This stage is very similar to the one described in the Web Document Classification
project. By using the Weka filters Boolean or numeric values are calculated
for each web document and the corresponding feature vector is created.
Finally the vectors are included in the ARFF file to be used by WEKA. Note
that at this last step the vectors are extended with class labels (for
example, interesting/noninteresting or +/) according to the user preferences.
As in the in the web document classification project the outcome of
this stage is an ARFF data file containing the feature vectors for all
web documents collected at stage 1. It is recommended that students prepare
several files by using different approaches to feature extraction  Boolean
attributes, numeric attributes (using the TFIDF approach) and with different
number of terms. The idea is to do more experiments with different data
sets and different ML algorithms in order to find the best user model.
3. Machine Learning Stage
At this stage the approaches and experiments are similar to those described
in the Web Document Classification project with an important difference
in the last step where the machine learning models are used. This step
can be called web document filtering (focusing the search) and can be described
as follows: Collect a number of web documents using one of the approaches
suggested in project stage 1. Apply feature extraction and create an ARFF
test file with one data row for each document. Then using the training
set prepared in stage 2 and the Weka's test set option classify the new
documents. Each one will get a corresponding label (interesting/noninteresting
or +/). Then simply discard the noninteresting documents and present
the interesting ones to the user. Further, this step can be incorporated
into a web browser, so that it automatically labels all web pages as interesting/noninteresting
according to the user preferences.