CS570 - Topics in AI: Machine Learning
Summer/2003 (May 27 - Jun 26)
Classes: MTWR 5:30 pm - 7:30 pm, Frank J. DiLoreto Hall 012
Instructor: Dr. Zdravko Markov, MS 203, (860)-832-2711, http://www.cs.ccsu.edu/~markov/,
e-mail: markovz@ccsu.edu
Office hours: MTWR 7:30pm - 8:30pm, or by appointment
Description: One of the many definitions of Machine Learning
(ML) is "Any change in a system that allows it to perform better the
second time on repetition of the same task or on another task drawn
from the same population" (Simon, 1983). Practically this means developing
computer programs that automatically improve their performance through
experience. The course covers the basic concepts and techniques of
Machine Learning from both theoretical and practical perspective. The material
includes classical ML approaches as Version Spaces and Decision Trees,
new approaches as Inductive Logic Programming and Minimum Description Length
Principle (MLD) as well as "hot" topics as Knowledge Discovery and Data
Mining. The students will be able to experiment with implementations of
almost all algorithms discussed in class and will use a recent ML system
to solve practical problems.
Prerequisites: CS 501 and CS 502, basic knowledge of algebra,
discrete math and statistics.
Course Objectives
To introduce students to the basic concepts and techniques of Machine Learning.
To develop skills of using recent machine learning software for solving
practical problems.
To gain experience of doing independent study and research.
Required Texts:
Required software: Zprolog
(updated on 11/7/2001) - a free DOS window-based Prolog interpreter.
Assignments and projects: The students will use a set of ML programs
written in Prolog and a simple Prolog interpreter to do experiments and
to complete their assignments. There will be 5 projects requiring independent
study and practical work with the ML programs.
Grading and grading policy: The final grade will be based 90%
on projects and 10% on participation in class discussions. The letter grades
will be calculated according to the following table:
A |
A- |
B+ |
B |
B- |
C+ |
C |
C- |
D+ |
D |
D- |
F |
95-100 |
90-94 |
87-89 |
84-86 |
80-83 |
77-79 |
74-76 |
70-73 |
67-69 |
64-66 |
60-63 |
0-59 |
Late assignments will be marked one letter grade down for each 3 days
they are late. It is expected that all students will conduct themselves
in an honest manner and NEVER claim work which is not their own. Violating
this policy will result in a substantial grade penalty or a final grade
of F.
WEB resources
-
Journal of Machine Learning Research
-
Journal of Machine Learning Gossip (ML humor)
-
Machine Learning
Database Repository at UC Irvine
-
Machine Learning subject
index maintained by the Knowledge Systems Laboratory, Canada
-
MLnet Online Information Service
-
ML Information
Services maintained by the Austrian Research Institute for Artificial Intelligence
(OFAI)
-
David
Aha's list of machine learning resources
-
Avrim Blum's Machine Learning
Page
-
The
AutoClass Project
-
UCI - Machine
Learning information, software and databases
-
UTCS Machine Learning Research
Group
-
Microsoft Bayesian
Network Editor (MSBNx)
-
Machine Learning Journal Online
-
Weka 3 -- Machine
Learning Software in Java
-
Journal of AI Research
-
Journal
of Data Mining and Knowledge Discovery
-
C5/See5
-
MLC++, A Machine Learning Library
in C++
-
Web->KB
project
-
KD Nuggets Directory
-
KDNet
Tentative schedule of classes and assignments
-
Introduction
-
Inductive learning
-
Topics
-
Introducing basic concepts by example - learning semantic networks
-
General setting for induction (concept learning) - languages, orderings,
generalization and specialization operators, structure of the hypothesis
space
-
Reading: Markov - Chapter 2
-
Lecture slides: Markov - Chapter 2
-
Program: arch.pl
-
Data:
archdata.pl
-
Languages for learning
-
Topics
-
Propositional languages (attribute-value) - attribute types, syntactic
and semantic covering relations, least general generalization (lgg)
-
Relational languages - Logic Programming, syntax and semantics of logic
programs
-
Prolog
-
Reading: Markov - Chapter 3, Mitchell - Sections 10.4, 10.7.1
-
Lecture slides: Markov - Chapter 3
-
Program: covering.pl
-
Data: animals.pl,
monks.pl,
shapes.pl,
taxonomy.pl
-
Lab experiments 1
-
Project 1: Representing examples
and hypotheses. Due date: 06/2/2003
-
Version space learning
-
Induction of Decision Trees
-
Topics
-
Representing disjunctive concepts as trees and rules
-
Building a decision tree
-
Information-based heuristic for attribute selection
-
Learning from noisy data
-
Avoiding overfitting and tree pruning
-
One-level decision tree: OneR
-
Reading: Mitchell - Chapter 3, Markov - Chapter 5, OneR
-
Lecture slides: Mitchell - Chapter 3
-
Program: id3.pl
-
Data: animals.pl,
loandata.pl
-
Lab experiments 3
-
Covering strategies
-
Searching the generalization/specialization graph
-
Project 2: Basic concept learning
methods. Due date: 6/10/2003
-
Inductive Logic Programming
-
Topics
-
General setting for ILP
-
Term ordering
-
Ordering Horn clauses - theta subsumption, subsumption under implication
-
Inverse resolution - V and W operators
-
Illustrative examples
-
Basic strategies for solving the ILP problem
-
Reading: Mitchell - Section 10.6, 10.7, Markov - Chapter 8
-
Lecture slides: Mitchell - Chapter 10
-
Evaluating hypotheses
-
Project 3: Evaluating hypotheses.
Due
date: 6/16/2003
-
Bayesian learning
-
Bayesian Belief Networks
-
Instance-based learning
-
Project 4: Prediction. Due
date: 6/23/2003
-
Unsupervised Learning
-
Project 5: Clustering.
Due date: 6/26/2003
-
Analytical (Explanation-based) learning
-
Topics
-
Generalization-based (inductive) learning vs. improving systems knowledge
and/or performance (analytical learning).
-
Explanation-based learning (EBL)
-
EBL task: find an effective definition of the target concept given
domain
theory, training example and operationality criteria.
-
Perfect domain theories (Prolog-EBG)
-
EBL is speed up learning or knowledge reformulation (partial evaluation,
unfolding, newly inferred rules belong to the deductive closure of the
theory).
-
Advantages of EBL: inferring good heuristics, refining incomplete or incorrect
theories, integration of EBL and inductive learning.
-
Reading: Mitchell - Chapter 11, Markov - Chapter 11
-
Lecture slides: Mitchell - Chapter 11
-
Programs: ebl.pl
-
Data:
dtheory1.pl,
dtheory2.pl
-
Lab experiments 11:
Perfect domain theories (Prolog-EBG)
-
Artificial Neural Networks. Reading: Mitchell - Chapter 4. Lecture slides:
Mitchell
- Chapter 4.
-
Genetic Algorithms. Reading: Mitchell - Chapter 9. Lecture slides:
Mitchell
- Chapter 9.
-
Reinforcement Learning. Reading: Mitchell - Chapter 13. Lecture slides:
Mitchell
- Chapter 13.
-
Knowledge Discovery and Data Mining:
Inferring rudimentary rules
-
OneR: learns a one-level decision tree, i.e. generates a set of rules that
test one particular attribute. Basic version (assuming nominal attributes):
-
One branch for each of the attribute’s values
-
Each branch assigns most frequent class
-
Error rate: proportion of instances that don't belong to the majority class
of their corresponding branch
-
Choose attribute with lowest error rate
-
Example: evaluating the weather attributes
outlook |
temperature |
humidity |
windy |
play |
sunny |
hot |
high |
false |
no |
sunny |
hot |
high |
true |
no |
overcast |
hot |
high |
false |
yes |
rainy |
mild |
high |
false |
yes |
rainy |
cool |
normal |
false |
yes |
rainy |
cool |
normal |
true |
no |
overcast |
cool |
normal |
true |
yes |
sunny |
mild |
high |
false |
no |
sunny |
cool |
normal |
false |
yes |
rainy |
mild |
normal |
false |
yes |
sunny |
mild |
normal |
true |
yes |
overcast |
mild |
high |
true |
yes |
overcast |
hot |
normal |
false |
yes |
rainy |
mild |
high |
true |
no |
Attribute |
Rules |
Errors |
Total errors |
outlook |
sunny -> no
overcast -> yes
rainy -> yes |
2/5
0/4
2/5 |
4/14 |
temperature |
hot -> no
mild -> yes
cool -> yes |
2/4
2/6
1/4 |
5/14 |
humidity |
high -> no
normal -> yes |
3/7
1/7 |
4/14 |
windy |
false -> yes
true -> no |
2/8
3/5 |
5/14 |
-
Dealing with numeric attributes
-
Discretizing numeric attributes: error-based discretization.
-
Instances are sorted according to attribute’s values
-
Breakpoints are placed where the (majority) class changes (so that the
total error is minimized)
-
Example: discretizing temperature (error = 1/14).
-
64 65 68 69 70
71 72 72 75 75 80
81 83 85
-
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
-
Problem: overfitting (little generalization, difficult to apply
to new data). Solution: enforce minimum number of instances in the majority
class per interval.
-
64 65 68 69 70 71 72 72 75
75 80 81 83 85
-
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No
-
Now error = ?
-
Discussion of OneR
-
OneR was described in a paper by Holte (1993)
-
Contains an experimental evaluation on 16 datasets (using cross-validation,
so that results were representative of performance on future data)
-
Minimum number of instances (for discretization) was set to 6 after some
experimentation (minBucketSize parameter in Weka).
-
OneR’s simple rules performed not much worse than much more complex decision
trees
-
Simplicity first pays off! (Occam's Razor).
Covering algorithms
-
General strategy: for each class find rule set that covers all instances
in it (excluding instances not in the class). This approach is called a
covering
approach because at each stage a rule is identified that covers some of
the instances.
-
General to specific rule induction (PRISM algorithm):
-
For each class C
-
Initialize E to the instance set
-
While E contains instances in class C
-
Create a rule R with an empty left-hand side that predicts class C
-
While R covers instances from classes other than C do:
-
For each attribute A not mentioned in R, and each value v,
-
Consider adding the condition A = v to the left-hand side of R
-
Select A and v to maximize the accuracy, i.e. (# of instances from C)/(total
# of instances covered by R). For same accuracies choose the condition
providing the largest coverage.
-
Add (A = v) to R
-
Remove the instances covered by R from E
-
Example: covering class "play=yes" in weather data.
-
Rule 1: If {outlook=overcast} Then play=yes (error=0/4;
covered 4 out of 9)
-
Rule 2: If {humidity=normal, windy=false} Then
play=yes (error=0/3; covered 3 out of 5)
-
Rule 3: If {temp=mild, humidity=normal} Then play=yes
(error 0/1; covered 1 out of 2)
-
Rule 4: If {outlook=rainy, temp=mild, windy=false}
Then play=yes (error=0/1; covered 1 out of 1)
-
Specific to general rule induction:
-
Pick up an instance and generalize it by repeatedly dropping conditions.
Stop when all further generalizations lead to covering instances from other
classes. Save the generalized instance as a rule.
-
Remove all instances covered by R and continue until all instances are
covered.
-
When dropping conditions choose the ones that maximize rule coverage.
-
Problems: rule overlapping, rule subsumption.
Last updated: 6-5-2003