CS570 - Topics in AI: Machine Learning
Summer/2003 (May 27 - Jun 26)
Classes: MTWR 5:30 pm - 7:30 pm, Frank J. DiLoreto Hall 012
Instructor: Dr. Zdravko Markov, MS 203, (860)-832-2711, http://www.cs.ccsu.edu/~markov/,
e-mail: markovz@ccsu.edu
Office hours: MTWR 7:30pm - 8:30pm, or by appointment
Description: One of the many definitions of Machine Learning
(ML) is "Any change in a system that allows it to perform better the
second time on repetition of the same task or on another task drawn
from the same population" (Simon, 1983). Practically this means developing
computer programs that automatically improve their performance through
experience. The course covers the basic concepts and techniques of
Machine Learning from both theoretical and practical perspective. The material
includes classical ML approaches as Version Spaces and Decision Trees,
new approaches as Inductive Logic Programming and Minimum Description Length
Principle (MLD) as well as "hot" topics as Knowledge Discovery and Data
Mining. The students will be able to experiment with implementations of
almost all algorithms discussed in class and will use a recent ML system
to solve practical problems.
Prerequisites: CS 501 and CS 502, basic knowledge of algebra,
discrete math and statistics.
Course Objectives
To introduce students to the basic concepts and techniques of Machine Learning.
To develop skills of using recent machine learning software for solving
practical problems.
To gain experience of doing independent study and research.
Required Texts:
Required software: Zprolog
(updated on 11/7/2001) - a free DOS window-based Prolog interpreter.
Assignments and projects: The students will use a set of ML programs
written in Prolog and a simple Prolog interpreter to do experiments and
to complete their assignments. There will be 5 projects requiring independent
study and practical work with the ML programs.
Grading and grading policy: The final grade will be based 90%
on projects and 10% on participation in class discussions. The letter grades
will be calculated according to the following table:
A |
A- |
B+ |
B |
B- |
C+ |
C |
C- |
D+ |
D |
D- |
F |
95-100 |
90-94 |
87-89 |
84-86 |
80-83 |
77-79 |
74-76 |
70-73 |
67-69 |
64-66 |
60-63 |
0-59 |
Late assignments will be marked one letter grade down for each 3 days
they are late. It is expected that all students will conduct themselves
in an honest manner and NEVER claim work which is not their own. Violating
this policy will result in a substantial grade penalty or a final grade
of F.
WEB resources
Journal of Machine Learning Research
Journal of Machine Learning Gossip (ML humor)
Machine Learning
Database Repository at UC Irvine
Machine Learning subject
index maintained by the Knowledge Systems Laboratory, Canada
MLnet Online Information Service
ML Information
Services maintained by the Austrian Research Institute for Artificial Intelligence
Aha's list of machine learning resources
Avrim Blum's Machine Learning
AutoClass Project
UCI - Machine
Learning information, software and databases
UTCS Machine Learning Research
Microsoft Bayesian
Network Editor (MSBNx)
Machine Learning Journal Online
Weka 3 -- Machine
Learning Software in Java
Journal of AI Research
of Data Mining and Knowledge Discovery
MLC++, A Machine Learning Library
in C++
KD Nuggets Directory
Tentative schedule of classes and assignments
Inductive learning
Introducing basic concepts by example - learning semantic networks
General setting for induction (concept learning) - languages, orderings,
generalization and specialization operators, structure of the hypothesis
Reading: Markov - Chapter 2
Lecture slides: Markov - Chapter 2
Program: arch.pl
Languages for learning
Propositional languages (attribute-value) - attribute types, syntactic
and semantic covering relations, least general generalization (lgg)
Relational languages - Logic Programming, syntax and semantics of logic
Reading: Markov - Chapter 3, Mitchell - Sections 10.4, 10.7.1
Lecture slides: Markov - Chapter 3
Program: covering.pl
Data: animals.pl,
Lab experiments 1
Project 1: Representing examples
and hypotheses. Due date: 06/2/2003
Version space learning
Induction of Decision Trees
Representing disjunctive concepts as trees and rules
Building a decision tree
Information-based heuristic for attribute selection
Learning from noisy data
Avoiding overfitting and tree pruning
One-level decision tree: OneR
Reading: Mitchell - Chapter 3, Markov - Chapter 5, OneR
Lecture slides: Mitchell - Chapter 3
Program: id3.pl
Data: animals.pl,
Lab experiments 3
Covering strategies
Searching the generalization/specialization graph
Project 2: Basic concept learning
methods. Due date: 6/10/2003
Inductive Logic Programming
General setting for ILP
Term ordering
Ordering Horn clauses - theta subsumption, subsumption under implication
Inverse resolution - V and W operators
Illustrative examples
Basic strategies for solving the ILP problem
Reading: Mitchell - Section 10.6, 10.7, Markov - Chapter 8
Lecture slides: Mitchell - Chapter 10
Evaluating hypotheses
Project 3: Evaluating hypotheses.
date: 6/16/2003
Bayesian learning
Bayesian Belief Networks
Instance-based learning
Project 4: Prediction. Due
date: 6/23/2003
Unsupervised Learning
Project 5: Clustering.
Due date: 6/26/2003
Analytical (Explanation-based) learning
Generalization-based (inductive) learning vs. improving systems knowledge
and/or performance (analytical learning).
Explanation-based learning (EBL)
EBL task: find an effective definition of the target concept given
theory, training example and operationality criteria.
Perfect domain theories (Prolog-EBG)
EBL is speed up learning or knowledge reformulation (partial evaluation,
unfolding, newly inferred rules belong to the deductive closure of the
Advantages of EBL: inferring good heuristics, refining incomplete or incorrect
theories, integration of EBL and inductive learning.
Reading: Mitchell - Chapter 11, Markov - Chapter 11
Lecture slides: Mitchell - Chapter 11
Programs: ebl.pl
Lab experiments 11:
Perfect domain theories (Prolog-EBG)
Artificial Neural Networks. Reading: Mitchell - Chapter 4. Lecture slides:
- Chapter 4.
Genetic Algorithms. Reading: Mitchell - Chapter 9. Lecture slides:
- Chapter 9.
Reinforcement Learning. Reading: Mitchell - Chapter 13. Lecture slides:
- Chapter 13.
Knowledge Discovery and Data Mining:
Inferring rudimentary rules
OneR: learns a one-level decision tree, i.e. generates a set of rules that
test one particular attribute. Basic version (assuming nominal attributes):
One branch for each of the attribute’s values
Each branch assigns most frequent class
Error rate: proportion of instances that don't belong to the majority class
of their corresponding branch
Choose attribute with lowest error rate
Example: evaluating the weather attributes
outlook |
temperature |
humidity |
windy |
play |
sunny |
hot |
high |
false |
no |
sunny |
hot |
high |
true |
no |
overcast |
hot |
high |
false |
yes |
rainy |
mild |
high |
false |
yes |
rainy |
cool |
normal |
false |
yes |
rainy |
cool |
normal |
true |
no |
overcast |
cool |
normal |
true |
yes |
sunny |
mild |
high |
false |
no |
sunny |
cool |
normal |
false |
yes |
rainy |
mild |
normal |
false |
yes |
sunny |
mild |
normal |
true |
yes |
overcast |
mild |
high |
true |
yes |
overcast |
hot |
normal |
false |
yes |
rainy |
mild |
high |
true |
no |
Attribute |
Rules |
Errors |
Total errors |
outlook |
sunny -> no
overcast -> yes
rainy -> yes |
2/5 |
4/14 |
temperature |
hot -> no
mild -> yes
cool -> yes |
1/4 |
5/14 |
humidity |
high -> no
normal -> yes |
1/7 |
4/14 |
windy |
false -> yes
true -> no |
3/5 |
5/14 |
Dealing with numeric attributes
Discretizing numeric attributes: error-based discretization.
Instances are sorted according to attribute’s values
Breakpoints are placed where the (majority) class changes (so that the
total error is minimized)
Example: discretizing temperature (error = 1/14).
64 65 68 69 70
71 72 72 75 75 80
81 83 85
Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No
Problem: overfitting (little generalization, difficult to apply
to new data). Solution: enforce minimum number of instances in the majority
class per interval.
64 65 68 69 70 71 72 72 75
75 80 81 83 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No
Now error = ?
Discussion of OneR
OneR was described in a paper by Holte (1993)
Contains an experimental evaluation on 16 datasets (using cross-validation,
so that results were representative of performance on future data)
Minimum number of instances (for discretization) was set to 6 after some
experimentation (minBucketSize parameter in Weka).
OneR’s simple rules performed not much worse than much more complex decision
Simplicity first pays off! (Occam's Razor).
Covering algorithms
General strategy: for each class find rule set that covers all instances
in it (excluding instances not in the class). This approach is called a
approach because at each stage a rule is identified that covers some of
the instances.
General to specific rule induction (PRISM algorithm):
For each class C
Initialize E to the instance set
While E contains instances in class C
Create a rule R with an empty left-hand side that predicts class C
While R covers instances from classes other than C do:
For each attribute A not mentioned in R, and each value v,
Consider adding the condition A = v to the left-hand side of R
Select A and v to maximize the accuracy, i.e. (# of instances from C)/(total
# of instances covered by R). For same accuracies choose the condition
providing the largest coverage.
Add (A = v) to R
Remove the instances covered by R from E
Example: covering class "play=yes" in weather data.
Rule 1: If {outlook=overcast} Then play=yes (error=0/4;
covered 4 out of 9)
Rule 2: If {humidity=normal, windy=false} Then
play=yes (error=0/3; covered 3 out of 5)
Rule 3: If {temp=mild, humidity=normal} Then play=yes
(error 0/1; covered 1 out of 2)
Rule 4: If {outlook=rainy, temp=mild, windy=false}
Then play=yes (error=0/1; covered 1 out of 1)
Specific to general rule induction:
Pick up an instance and generalize it by repeatedly dropping conditions.
Stop when all further generalizations lead to covering instances from other
classes. Save the generalized instance as a rule.
Remove all instances covered by R and continue until all instances are
When dropping conditions choose the ones that maximize rule coverage.
Problems: rule overlapping, rule subsumption.
Last updated: 6-5-2003