CS580 - Data Mining, Summer 2023 (online)
Course Information
Course title: Data Mining
Course number: CS 580
Course description: Data Mining studies algorithms and computational
paradigms that allow computers to find patterns and regularities in data,
perform prediction and forecasting, and generally improve their performance
through interaction with data. It is currently regarded as the key element
of a more general process called Knowledge Discovery that deals with extracting
useful knowledge from raw data. The knowledge discovery process includes
data selection, cleaning, coding, using different statistical and machine
learning techniques, and visualization of the generated structures. The
course will cover all these issues and will illustrate the whole process
by examples. Special emphasis will be given to the Machine Learning methods
as they provide the real knowledge discovery tools. Related technologies,
as data warehousing and on-line analytical processing (OLAP) will be also
discussed. The students will use recent Data Mining software. Enrollment
in this course is limited to 15 students.
Course dates: May 30, 2023 - Jul 24, 2023
Instructor Information
Required reading: Ian H. Witten, Eibe Frank, and Mark A. Hall, Christopher Pal. Data
Mining: Practical Machine Learning Tools and Techniques (Fourth Edition),
Morgan Kaufmann, January 2017, ISBN 978-0-12-804291-5.
Required Software: Weka 3: Data Mining System with Free Open Source
Machine Learning Software in Java. Available at http://www.cs.waikato.ac.nz/~ml/weka/index.html
Course Goals
Introduce students to the basic concepts and techniques of Data Mining.
Develop skills of using recent data mining software for solving practical
Gain experience of doing independent study and research.
Grading Policies
Grading will be based on eight assignments (75%), one quiz (10%) and class
participation through three scheduled discussions (15%). The maximum course
total is 1000 points. The letter grade will be determined by the following
grading scale:
Late assignments will be marked one letter grade down for each 3 days
they are late. It is expected that all students will conduct themselves
in an honest manner and NEVER claim work which is not their own. Violating
this policy will result in a substantial grade penalty or a final grade
of F.
Course Content (12 units)
Introduction to Data Mining
What is data mining?
Related technologies - Machine Learning, DBMS, OLAP, Statistics
Data Mining Goals
Stages of the Data Mining Process
Data Mining Techniques
Knowledge Representation Methods
Example: weather data
Data Warehouse and OLAP
Data Warehouse and DBMS
Multidimensional data model
OLAP operations
Example: loan data set
Data preprocessing
Data cleaning
Data transformation
Data reduction
Discretization and generating concept hierarchies
Installing Weka 3 Data Mining System
Experiments with Weka - filters, discretization
Data mining knowledge representation
Task relevant data
Background knowledge
Interestingness measures
Representing input data and output knowledge
Visualization techniques
Experiments with Weka - visualization
Attribute-oriented analysis
Attribute generalization
Attribute relevance
Class comparison
Statistical measures
Experiments with Weka - using filters and statistics
Data mining algorithms: Association rules
Motivation and terminology
Example: mining weather data
Basic idea: item sets
Generating item sets and rules efficiently
Correlation analysis
Experiments with Weka - mining association rules
Data mining algorithms: Classification
Basic learning/mining tasks
Inferring rudimentary rules: 1R algorithm
Decision trees
Covering rules
Experiments with Weka - decision trees, rules
Data mining algorithms: Prediction
The prediction task
Statistical (Bayesian) classification
Bayesian networks
Instance-based methods (nearest neighbor)
Linear models
Experiments with Weka - Prediction
Evaluating what's been learned
Basic issues
Training and testing
Estimating classifier accuracy (holdout, cross-validation, leave-one-out)
Combining multiple models (bagging, boosting, stacking)
Minimum Description Length Principle (MLD)
Experiments with Weka - training and testing
Mining real data
Preprocessing data from a real medical domain (310 patients with Hepatitis
Applying various data mining techniques to create a comprehensive and accurate
model of the data.
Basic issues in clustering
First conceptual clustering system: Cluster/2
Partitioning methods: k-means, expectation maximization (EM)
Hierarchical methods: distance-based agglomerative and divisible clustering
Conceptual clustering: Cobweb
Experiments with Weka - k-means, EM, Cobweb
Text and Web Mining
Basics of Information Retrieval
TFIDF representation of text documents
Text classification