CS545 - ML for Data Mining / CS407 - Machine Learning
Summer 2024 (online asynchronous)
Classes: Online Asynchronous, May 28 - Jul 22, 2024
Instructor: Dr. Zdravko Markov, 30307 Maria Sanford Hall, (860)-832-2711,
http://www.cs.ccsu.edu/~markov/,
e-mail: markovz at ccsu dot edu
Office hours: TBA, via Blackboard Collaborate
Prerequisites by topic
- Programming Fundamentals and Data Structures
- Basic Statistics
- Linear Algebra
- Discrete Mathematics
Required Textbook
Ian H. Witten, Eibe Frank, and Mark A. Hall, Christopher Pal. Data
Mining: Practical Machine Learning Tools and Techniques (Fourth Edition),
Morgan Kaufmann, January 2017, ISBN 978-0-12-804291-5.
Required Software
The Weka Workbench - open-source machine learning software aailable at https://ml.cms.waikato.ac.nz/weka/index.html
Course Goals
-
Introduce students to the basic concepts and techniques of Data Mining and Machine Learning.
-
Develop skills of using recent Machine Learning software for solving practical
problems.
-
Gain experience of doing independent study and research.
Grading Policies
Grading will be based on eight assignments (75%), one quiz (10%) and class
participation through scheduled discussions (15%). The letter grade will be determined by the following
grading scale:
A
|
A-
|
B+
|
B
|
B-
|
C+
|
C
|
C-
|
D+
|
D
|
D-
|
F
|
94-100
|
90-93
|
87-89
|
84-86
|
80-83
|
77-79
|
74-76
|
70-73
|
67-69
|
64-66
|
60-63
|
0-59
|
Late assignments will be marked one letter grade down for each 3 days
they are late. It is expected that all students will conduct themselves
in an honest manner and NEVER claim work which is not their own. Violating
this policy will result in a substantial grade penalty or a final grade
of F.
Course Content (12 units)
-
Introduction
-
What is data mining?
-
Related technologies - Machine Learning, DBMS, OLAP, Statistics
-
Data Mining Goals
-
Stages of the Data Mining Process
-
Data Mining Techniques
-
Knowledge Representation Methods
-
Applications
-
Example: weather data
-
Data Warehouse and OLAP
-
Data Warehouse and DBMS
-
Multidimensional data model
-
OLAP operations
-
Example: loan data set
-
Data preprocessing
-
Data cleaning
-
Data transformation
-
Data reduction
-
Discretization and generating concept hierarchies
-
Installing Weka 3 Data Mining System
-
Experiments with Weka - filters, discretization
-
Data mining knowledge representation
-
Task relevant data
-
Background knowledge
-
Interestingness measures
-
Representing input data and output knowledge
-
Visualization techniques
-
Experiments with Weka - visualization
-
Attribute-oriented analysis
-
Attribute generalization
-
Attribute relevance
-
Class comparison
-
Statistical measures
-
Experiments with Weka - using filters and statistics
-
Learning Association rules
-
Motivation and terminology
-
Example: weather data
-
Basic idea: item sets
-
Generating item sets and rules efficiently
-
Correlation analysis
-
Experiments with Weka - mining association rules
-
Machine Learning algorithms: Classification
-
Basic learning/mining tasks
-
Inferring rudimentary rules: 1R algorithm
-
Decision trees
-
Covering rules
-
Experiments with Weka - decision trees, rules
-
Machine Learning algorithms: Prediction
-
The prediction task
-
Statistical (Bayesian) classification
-
Bayesian networks
-
Instance-based methods (nearest neighbor)
-
Linear models
-
Experiments with Weka - Prediction
-
Evaluating what's been learned
-
Basic issues
-
Training and testing
-
Estimating classifier accuracy (holdout, cross-validation, leave-one-out)
-
Combining multiple models (bagging, boosting, stacking)
-
Minimum Description Length Principle (MDL)
-
Experiments with Weka - training and testing
-
Mining real data
-
Preprocessing data from a real medical domain (310 patients with Hepatitis
C).
-
Applying various data mining techniques to create a comprehensive and accurate
model of the data.
-
Machine Learning algorithms: Clustering
-
Basic issues in clustering
-
First conceptual clustering system: Cluster/2
-
Partitioning methods: k-means, expectation maximization (EM)
-
Hierarchical methods: distance-based agglomerative and divisible clustering
-
Conceptual clustering: Cobweb
-
Experiments with Weka - k-means, EM, Cobweb
-
Text and Web Mining
-
Basics of Information Retrieval
-
TFIDF representation of text documents
-
Text classification