Course Information 





Course
title:

Topics:
Data Mining

Course
number:

CS
580

Course
description:

Data
Mining studies algorithms and computational paradigms that allow computers
to find patterns and regularities in databases, perform prediction and
forecasting, and generally improve their performance through interaction
with data. It is currently regarded as the key element of a more general
process called Knowledge Discovery that deals with extracting useful knowledge
from raw data. The knowledge discovery process includes data selection,
cleaning, coding, using different statistical and machine learning techniques,
and visualization of the generated structures. The course will cover all
these issues and will illustrate the whole process by examples. Special
emphasis will be give to the Machine Learning methods as they provide the
real knowledge discovery tools. Important related technologies, as data
warehousing and online analytical processing (OLAP) will be also discussed.
The students will use recent Data Mining software. Enrollment in this course
is limited to 15 students.

Course
dates:

May 27, 2014  Jul 21, 2014



Instructor Information 





Name:

Zdravko
Markov

Email:

markovz@ccsu.edu

Office
location:

303
Maria Sanford Hall, CCSU

Phone:

(860)
8322711; fax (860) 8322712

Biography:

Dr.
Zdravko Markov has an M.S. in Mathematics and Computer Science and a Ph.D.
in Artificial Intelligence. He has been teaching and doing research in
the area of Machine Learning for more than 15 years. Recently he developed
a novel approach to conceptual clustering and is studying its application
to Data Mining tasks. Dr. Markov has published 4 textbooks and more than
50 research papers in conference proceedings and journals. His most recent
book (coauthored with Daniel Larose) is â€œData Mining The
Web: Uncovering Patterns in Web Content, Structure, and Usage", published
by Wiley in 2007. Dr. Markovâ€™s CCSU courses are in the areas
of Computer Architecture and Design, Computing and Communication technology,
Machine Learning, Data and Web Mining.



Textbook 





Required
reading:

Ian
H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools
and Techniques (Second Edition), Morgan Kaufmann, 2005,
ISBN: 0120884070.



Required Software 




Course Goals 







To
introduce students to the basic concepts and techniques of Data Mining.

To
develop skills of using recent data mining software for solving practical
problems.

To
gain experience of doing independent study and research.



Grading Policies 






Grading
will be based on six assignments (60%), two quizzes (20%) and class participation
through four scheduled discussions (20%). Assignments and quizzes will
be graded on a 100 point scale, discussions  on a 50 point scale. Thus
the maximum course total will be 1000 points.
The
letter grade will be determined by the following grading scale:
A

A

B+

B

B

C+

C

C

D+

D

D

F

9501000

900940

870890

840860

800830

770790

740760

700730

670690

640660

600630

0590

Late
assignments will be marked one letter grade down for each 3 days they are
late.
It
is expected that all students will conduct themselves in an honest manner
and NEVER claim work which is not their own. Violating this policy will
result in a substantial grade penalty or a final grade of F. 


Course Content (12 units) 






Introduction to Data Mining

What is data mining?

Related technologies  Machine Learning, DBMS, OLAP, Statistics

Data Mining Goals

Stages of the Data Mining Process

Data Mining Techniques

Knowledge Representation Methods

Applications

Example: weather data

Data Warehouse and OLAP

Data Warehouse and DBMS

Multidimensional data model

OLAP operations

Example: loan data set

Data preprocessing

Data cleaning

Data transformation

Data reduction

Discretization and generating concept hierarchies

Installing Weka 3 Data Mining System

Experiments with Weka  filters, discretization

Data mining knowledge representation

Task relevant data

Background knowledge

Interestingness measures

Representing input data and output knowledge

Visualization techniques

Experiments with Weka  visualization

Attributeoriented analysis

Attribute generalization

Attribute relevance

Class comparison

Statistical measures

Experiments with Weka  using filters and statistics

Data mining algorithms: Association rules

Motivation and terminology

Example: mining weather data

Basic idea: item sets

Generating item sets and rules efficiently

Correlation analysis

Experiments with Weka  mining association rules

Data mining algorithms: Classification

Basic learning/mining tasks

Inferring rudimentary rules: 1R algorithm

Decision trees

Covering rules

Experiments with Weka  decision trees, rules

Data mining algorithms: Prediction

The prediction task

Statistical (Bayesian) classification

Bayesian networks

Instancebased methods (nearest neighbor)

Linear models

Experiments with Weka  Prediction

Evaluating what's been learned

Basic issues

Training and testing

Estimating classifier accuracy (holdout, crossvalidation, leaveoneout)

Combining multiple models (bagging, boosting, stacking)

Minimum Description Length Principle (MLD)

Experiments with Weka  training and testing

Mining real data

Preprocessing data from a real medical domain (310 patients with Hepatitis
C).

Applying various data mining techniques to create a comprehensive and accurate
model of the data.

Clustering

Basic issues in clustering

First conceptual clustering system: Cluster/2

Partitioning methods: kmeans, expectation maximization (EM)

Hierarchical methods: distancebased agglomerative and divisible clustering

Conceptual clustering: Cobweb

Experiments with Weka  kmeans, EM, Cobweb

Advanced techniques, Data Mining software and applications

Text mining: extracting attributes (keywords), structural approaches (parsing,
soft parsing).

Bayesian approach to classifying text

Web mining: classifying web pages, extracting knowledge from the web

Data Mining software and applications

