CS580  Data Mining, Summer 2018 (online)
Course Information

Course title: Data Mining

Course number: CS 580

Course description: Data Mining studies algorithms and computational
paradigms that allow computers to find patterns and regularities in databases,
perform prediction and forecasting, and generally improve their performance
through interaction with data. It is currently regarded as the key element
of a more general process called Knowledge Discovery that deals with extracting
useful knowledge from raw data. The knowledge discovery process includes
data selection, cleaning, coding, using different statistical and machine
learning techniques, and visualization of the generated structures. The
course will cover all these issues and will illustrate the whole process
by examples. Special emphasis will be given to the Machine Learning methods
as they provide the real knowledge discovery tools. Important related technologies,
as data warehousing and online analytical processing (OLAP) will be also
discussed. The students will use recent Data Mining software. Enrollment
in this course is limited to 15 students.

Course dates: May 29, 2018  Jul 23, 2018
Instructor Information

Name: Zdravko Markov

Email: markovz@ccsu.edu

URL: http://www.cs.ccsu.edu/~markov/

Office location: 303 Maria Sanford Hall, CCSU

Phone: (860) 8322711

Biography: Dr. Zdravko Markov has an M.S. in Mathematics and Computer
Science and a Ph.D. in Artificial Intelligence. He has been teaching and
doing research in the area of Machine Learning for more than 20 years.
Recently he is working on unsupervised learning algorithms for attribute
selection and clustering based on the Minimum Description Length Principle.
Dr. Markov has published 4 books and more than 60 research papers in
conference proceedings and journals. His most recent book (coauthored
with Daniel Larose), "Data Mining The Web: Uncovering Patterns in Web
Content, Structure, and Usage", is published by Wiley in 2007.
Dr. Markov teaches courses in the areas of Programming, Computer Architecture,
Machine Learning, Data and Web Mining.
Textbook

Required reading: Ian H. Witten, Eibe Frank, and Mark A. Hall, Christopher Pal. Data
Mining: Practical Machine Learning Tools and Techniques (Fourth Edition),
Morgan Kaufmann, January 2017, ISBN 9780128042915.

Required Software: Weka 3: Data Mining System with Free Open Source
Machine Learning Software in Java. Available at http://www.cs.waikato.ac.nz/~ml/weka/index.html
Course Goals

Introduce students to the basic concepts and techniques of Data Mining.

Develop skills of using recent data mining software for solving practical
problems.

Gain experience of doing independent study and research.
Grading Policies
Grading will be based on eight assignments (75%), one quiz (10%) and class
participation through three scheduled discussions (15%). The maximum course
total is 1000 points. The letter grade will be determined by the following
grading scale:
A

A

B+

B

B

C+

C

C

D+

D

D

F

9501000

900940

870890

840860

800830

770790

740760

700730

670690

640660

600630

0590

Late assignments will be marked one letter grade down for each 3 days
they are late. It is expected that all students will conduct themselves
in an honest manner and NEVER claim work which is not their own. Violating
this policy will result in a substantial grade penalty or a final grade
of F.
Course Content (12 units)

Introduction to Data Mining

What is data mining?

Related technologies  Machine Learning, DBMS, OLAP, Statistics

Data Mining Goals

Stages of the Data Mining Process

Data Mining Techniques

Knowledge Representation Methods

Applications

Example: weather data

Data Warehouse and OLAP

Data Warehouse and DBMS

Multidimensional data model

OLAP operations

Example: loan data set

Data preprocessing

Data cleaning

Data transformation

Data reduction

Discretization and generating concept hierarchies

Installing Weka 3 Data Mining System

Experiments with Weka  filters, discretization

Data mining knowledge representation

Task relevant data

Background knowledge

Interestingness measures

Representing input data and output knowledge

Visualization techniques

Experiments with Weka  visualization

Attributeoriented analysis

Attribute generalization

Attribute relevance

Class comparison

Statistical measures

Experiments with Weka  using filters and statistics

Data mining algorithms: Association rules

Motivation and terminology

Example: mining weather data

Basic idea: item sets

Generating item sets and rules efficiently

Correlation analysis

Experiments with Weka  mining association rules

Data mining algorithms: Classification

Basic learning/mining tasks

Inferring rudimentary rules: 1R algorithm

Decision trees

Covering rules

Experiments with Weka  decision trees, rules

Data mining algorithms: Prediction

The prediction task

Statistical (Bayesian) classification

Bayesian networks

Instancebased methods (nearest neighbor)

Linear models

Experiments with Weka  Prediction

Evaluating what's been learned

Basic issues

Training and testing

Estimating classifier accuracy (holdout, crossvalidation, leaveoneout)

Combining multiple models (bagging, boosting, stacking)

Minimum Description Length Principle (MLD)

Experiments with Weka  training and testing

Mining real data

Preprocessing data from a real medical domain (310 patients with Hepatitis
C).

Applying various data mining techniques to create a comprehensive and accurate
model of the data.

Clustering

Basic issues in clustering

First conceptual clustering system: Cluster/2

Partitioning methods: kmeans, expectation maximization (EM)

Hierarchical methods: distancebased agglomerative and divisible clustering

Conceptual clustering: Cobweb

Experiments with Weka  kmeans, EM, Cobweb

Text and Web Mining

Basics of Information Retrieval

TFIDF representation of text documents

Text classification