Machine Learning Course

CS570 - Topics in AI: Machine Learning

Summer/2003 (May 27 - Jun 26)

Classes: MTWR 5:30 pm - 7:30 pm, Frank J. DiLoreto Hall 012
Instructor: Dr. Zdravko Markov, MS 203, (860)-832-2711, http://www.cs.ccsu.edu/~markov/, e-mail: markovz@ccsu.edu
Office hours: MTWR 7:30pm - 8:30pm, or by appointment

Description: One of the many definitions of Machine Learning (ML) is "Any change in a system that allows it to perform better the second time on repetition of the same task or on another task drawn from the same population" (Simon, 1983). Practically this means developing computer programs that automatically improve their performance through experience. The course covers the basic concepts and techniques of Machine Learning from both theoretical and practical perspective. The material includes classical ML approaches as Version Spaces and Decision Trees, new approaches as Inductive Logic Programming and Minimum Description Length Principle (MLD) as well as "hot" topics as Knowledge Discovery and Data Mining. The students will be able to experiment with implementations of almost all algorithms discussed in class and will use a recent ML system to solve practical problems.

Prerequisites: CS 501 and CS 502, basic knowledge of algebra, discrete math and statistics.

Course Objectives

To introduce students to the basic concepts and techniques of Machine Learning.

To develop skills of using recent machine learning software for solving practical problems.

To gain experience of doing independent study and research.

Required Texts:

Machine Learning, Tom Mitchell, McGraw Hill, 1997, ISBN 0-07-042807-7.
Lecture notes in Machine Learning, Zdravko Markov.

Required software: Zprolog (updated on 11/7/2001) - a free DOS window-based Prolog interpreter.

Assignments and projects: The students will use a set of ML programs written in Prolog and a simple Prolog interpreter to do experiments and to complete their assignments. There will be 5 projects requiring independent study and practical work with the ML programs.

Grading and grading policy: The final grade will be based 90% on projects and 10% on participation in class discussions. The letter grades will be calculated according to the following table:

A A- B+ B B- C+ C C- D+ D D- F

95-100 90-94 87-89 84-86 80-83 77-79 74-76 70-73 67-69 64-66 60-63 0-59

Late assignments will be marked one letter grade down for each 3 days they are late. It is expected that all students will conduct themselves in an honest manner and NEVER claim work which is not their own. Violating this policy will result in a substantial grade penalty or a final grade of F.

WEB resources

Tentative schedule of classes and assignments

Introduction

Topics

Machine learning problems
Designing a learning system
ML paradigms and categories of learning systems

Reading: Mitchell - Chapter 1, Markov - Chapter 1
Additional reading: Chris Thornton, Truth from Trash:How Learning Makes Sense, MIT Press, 2000
Lecture slides: ML example, Play Tennis example, Markov - Chapter 1

Inductive learning

Topics

Introducing basic concepts by example - learning semantic networks
General setting for induction (concept learning) - languages, orderings, generalization and specialization operators, structure of the hypothesis space

Reading: Markov - Chapter 2
Lecture slides: Markov - Chapter 2
Program: arch.pl
Data: archdata.pl

Languages for learning

Topics

Propositional languages (attribute-value) - attribute types, syntactic and semantic covering relations, least general generalization (lgg)
Relational languages - Logic Programming, syntax and semantics of logic programs
Prolog

Reading: Markov - Chapter 3, Mitchell - Sections 10.4, 10.7.1
Lecture slides: Markov - Chapter 3
Program: covering.pl
Data: animals.pl, monks.pl, shapes.pl, taxonomy.pl
Lab experiments 1

Project 1: Representing examples and hypotheses. Due date: 06/2/2003
Version space learning

Topics

Version space

Search strategies in version space
Candidate Elimination Algorithm
Experiment generation, interactive learning

Inductive bias
Learning multiple concepts by version space - Aq, AQ11

Reading: Mitchell - Chapter 2, Markov - Chapter 4
Lecture slides: Mitchell - Chapter 2, Markov - Chapter 4
Program: vs.pl
Data: taxonomy.pl, animals.pl, loandata.pl
Lab experiments 2

Induction of Decision Trees

Topics

Representing disjunctive concepts as trees and rules
Building a decision tree
Information-based heuristic for attribute selection
Learning from noisy data
Avoiding overfitting and tree pruning
One-level decision tree: OneR

Reading: Mitchell - Chapter 3, Markov - Chapter 5, OneR
Lecture slides: Mitchell - Chapter 3
Program: id3.pl
Data: animals.pl, loandata.pl
Lab experiments 3

Covering strategies

Topics

Basic idea
Lgg-based propositional induction
Lgg-based relational induction

Reading: Markov - Chapter 6
Program: lgg.pl
Data: animals.pl, loandata.pl
Lab experiments 4

Searching the generalization/specialization graph

Topics

Propositional case
Relational case
Relational learning by heuristic search - FOIL

Reading: Mitchell - Sections 10.1-10.5, Markov - Chapter 7, Covering algorithms
Lecture slides: covering.pdf
Program: search.pl
Data: animals.pl, loandata.pl
Lab experiments 5

Project 2: Basic concept learning methods. Due date: 6/10/2003
Inductive Logic Programming

Topics

General setting for ILP
Term ordering
Ordering Horn clauses - theta subsumption, subsumption under implication
Inverse resolution - V and W operators
Illustrative examples
Basic strategies for solving the ILP problem

Reading: Mitchell - Section 10.6, 10.7, Markov - Chapter 8
Lecture slides: Mitchell - Chapter 10

Evaluating hypotheses

Topics

Error estimation
Basics of sampling theory
Comparing hypotheses and learning algorithms
Bayes estimation and MDL
Applying MDL to propositional and relational hypotheses
Evaluating relational hypotheses
Learning from positive only examples

Reading: Mitchell - Chapters 5. Markov - Chapter 9
Lecture slides: Mitchell - Chapter 5, Markov - Chapter 9
Programs: id3.pl, lgg.pl, search.pl
Data: animal23.pl, animal13.pl, loan23.pl, loan13.pl
Lab experiments 6

Project 3: Evaluating hypotheses. Due date: 6/16/2003
Bayesian learning

Topics

Bayes theorem and MAP hypotheses
Consistent learners
Bayes optimal classifier
Naive Bayes algorithm

Reading: Mitchell - Chapter 6
Lecture slides: Mitchell - Chapter 6
Program: bayes.pl
Data: animals.pl, loandata.pl, loandat2.pl
Lab experiments 7

Bayesian Belief Networks

Topics

Conditional independence
Representation
Reasoning with Belief Networks
Learning Belief Networks
Relation to Naive Bayes

Reading: Mitchell - Chapter 6
Lecture slides: Mitchell - Chapter 6, Markov - Chapter 9
Program: bn.pl
Data: bnet1.pl, bnet2.pl, loandata.pl
Lab experiments 8
Microsoft Bayesian Network Editor (MSBNx)

Instance-based learning

Topics

Similarity and distance measures
Nearest neighbor algorithm
Case-based reasoning
Lazy and eager learning

Reading: Mitchell - Chapter 8
Lecture slides: Mitchell - Chapter 8
Program: knn.pl, Data: loandata.pl
Lab experiments 9

Project 4: Prediction. Due date: 6/23/2003
Unsupervised Learning

Topics

Basic issues in clustering. Example: clustering cells
First conceptual clustering system: Cluster/2
Partitioning methods: k-means, mixture models (example), expectation maximization (EM)
Criterion functions for clustering: sum of squared error, log-likelihood
Hierarchical methods: distance-based agglomerative
Conceptual clustering: Cobweb

Reading: Markov - Chapter 10 (lecture slides), Mitchell - Section 6.12
Lecture slides: Markov - Chapter 10
Programs: cluster.pl, cobweb.pl
Data: cells.pl, animals.pl, hotels.pl, loan-num.pl, weather.pl
Lab experiments 10

Project 5: Clustering. Due date: 6/26/2003
Analytical (Explanation-based) learning

Topics

Generalization-based (inductive) learning vs. improving systems knowledge and/or performance (analytical learning).
Explanation-based learning (EBL)
EBL task: find an effective definition of the target concept given domain theory, training example and operationality criteria.
Perfect domain theories (Prolog-EBG)
EBL is speed up learning or knowledge reformulation (partial evaluation, unfolding, newly inferred rules belong to the deductive closure of the theory).
Advantages of EBL: inferring good heuristics, refining incomplete or incorrect theories, integration of EBL and inductive learning.

Reading: Mitchell - Chapter 11, Markov - Chapter 11
Lecture slides: Mitchell - Chapter 11
Programs: ebl.pl
Data: dtheory1.pl, dtheory2.pl
Lab experiments 11: Perfect domain theories (Prolog-EBG)

Artificial Neural Networks. Reading: Mitchell - Chapter 4. Lecture slides: Mitchell - Chapter 4.
Genetic Algorithms. Reading: Mitchell - Chapter 9. Lecture slides: Mitchell - Chapter 9.
Reinforcement Learning. Reading: Mitchell - Chapter 13. Lecture slides: Mitchell - Chapter 13.
Knowledge Discovery and Data Mining:

Lecture slides: Introduction to Data Mining.
Examples of Data Mining Systems

Inferring rudimentary rules

OneR: learns a one-level decision tree, i.e. generates a set of rules that test one particular attribute. Basic version (assuming nominal attributes):

One branch for each of the attribute’s values
Each branch assigns most frequent class
Error rate: proportion of instances that don't belong to the majority class of their corresponding branch
Choose attribute with lowest error rate

Example: evaluating the weather attributes

outlook temperature humidity windy play

sunny hot high false no

sunny hot high true no

overcast hot high false yes

rainy mild high false yes

rainy cool normal false yes

rainy cool normal true no

overcast cool normal true yes

sunny mild high false no

sunny cool normal false yes

rainy mild normal false yes

sunny mild normal true yes

overcast mild high true yes

overcast hot normal false yes

rainy mild high true no

Attribute Rules Errors Total errors

outlook sunny -> no
overcast -> yes
rainy -> yes 2/5
0/4
2/5 4/14

temperature hot -> no
mild -> yes
cool -> yes 2/4
2/6
1/4 5/14

humidity high -> no
normal -> yes 3/7
1/7 4/14

windy false -> yes
true -> no 2/8
3/5 5/14

Dealing with numeric attributes

Discretizing numeric attributes: error-based discretization.

Instances are sorted according to attribute’s values
Breakpoints are placed where the (majority) class changes (so that the total error is minimized)

Example: discretizing temperature (error = 1/14).
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
Problem: overfitting (little generalization, difficult to apply to new data). Solution: enforce minimum number of instances in the majority class per interval.
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No
Now error = ?

Discussion of OneR

OneR was described in a paper by Holte (1993)
Contains an experimental evaluation on 16 datasets (using cross-validation, so that results were representative of performance on future data)
Minimum number of instances (for discretization) was set to 6 after some experimentation (minBucketSize parameter in Weka).
OneR’s simple rules performed not much worse than much more complex decision trees
Simplicity first pays off! (Occam's Razor).

Covering algorithms

General strategy: for each class find rule set that covers all instances in it (excluding instances not in the class). This approach is called a covering approach because at each stage a rule is identified that covers some of the instances.
General to specific rule induction (PRISM algorithm):

For each class C
Initialize E to the instance set
While E contains instances in class C
Create a rule R with an empty left-hand side that predicts class C
While R covers instances from classes other than C do:

For each attribute A not mentioned in R, and each value v,

Consider adding the condition A = v to the left-hand side of R
Select A and v to maximize the accuracy, i.e. (# of instances from C)/(total # of instances covered by R). For same accuracies choose the condition providing the largest coverage.

Add (A = v) to R

Remove the instances covered by R from E

Example: covering class "play=yes" in weather data.

Rule 1: If {outlook=overcast} Then play=yes (error=0/4; covered 4 out of 9)
Rule 2: If {humidity=normal, windy=false} Then play=yes (error=0/3; covered 3 out of 5)
Rule 3: If {temp=mild, humidity=normal} Then play=yes (error 0/1; covered 1 out of 2)
Rule 4: If {outlook=rainy, temp=mild, windy=false} Then play=yes (error=0/1; covered 1 out of 1)

Specific to general rule induction:

Pick up an instance and generalize it by repeatedly dropping conditions. Stop when all further generalizations lead to covering instances from other classes. Save the generalized instance as a rule.
Remove all instances covered by R and continue until all instances are covered.
When dropping conditions choose the ones that maximize rule coverage.

Problems: rule overlapping, rule subsumption.

Last updated: 6-5-2003

A	A-	B+	B	B-	C+	C	C-	D+	D	D-	F
95-100	90-94	87-89	84-86	80-83	77-79	74-76	70-73	67-69	64-66	60-63	0-59

outlook	temperature	humidity	windy	play
sunny	hot	high	false	no
sunny	hot	high	true	no
overcast	hot	high	false	yes
rainy	mild	high	false	yes
rainy	cool	normal	false	yes
rainy	cool	normal	true	no
overcast	cool	normal	true	yes
sunny	mild	high	false	no
sunny	cool	normal	false	yes
rainy	mild	normal	false	yes
sunny	mild	normal	true	yes
overcast	mild	high	true	yes
overcast	hot	normal	false	yes
rainy	mild	high	true	no

Attribute	Rules	Errors	Total errors
outlook	sunny -> no overcast -> yes rainy -> yes	2/5 0/4 2/5	4/14
temperature	hot -> no mild -> yes cool -> yes	2/4 2/6 1/4	5/14
humidity	high -> no normal -> yes	3/7 1/7	4/14
windy	false -> yes true -> no	2/8 3/5	5/14