Machine Learning - Project 5

Machine Learning - Summer 2003

Project 5 - Clustering

Posted: 6/19/2003
Due: 6/26/2003

1. Mixture models. Consider the data set weather.pl, which is a version of the PlayTennis data with numeric attributes temp and humidity. Create mixture models for these two attributes (i.e. find their mean and standard deviation for each class). Then use the mixture models along with the discrete probabilities of the other (nominal) attributes to predict the classification of the following new example:

[outlook = sunny, temp = 67, humidity = 50, wind = strong]

Include in your report: a detailed description of the approach you used.

2. Hierarchical agglomerative clustering. Using the following algorithms: min, max, lgg_m, lgg_s (specified through the Mode parameter of cluster.pl) find the best clustering hierarchies for the loandata.pl data set, created with each one of the algorithms. Vary the threshold parameter to get different clustering hierarchies. Then evaluate them by using error-based evaluation and minimize the error in the top level partition. Include in your report: the best hierarchies, their errors and the values for the mode and threshold parameters used in clustering.

3. Category utility and Cobweb.

Run cobweb.pl on the loandata.pl data set and calculate the error in the top level partition.
Calculate the category utility of the top level partition of the best clustering of loandata found in question 2.
Compare the two hierarchies (from cobweb.pl and from cluster.pl) and analyze their differences in terms of structure, error and category utility value for the top level partition.
Include in your report: all calculations and the analysis of the clustering results.

Hint: to calculate the error in a partition assign the majority class to each cluster and then count the total number of misclassified examples in all clusters. The error is this number divided by the total number of examples.