CS580-Data Mining: Syllabus

Syllabus: CS580 - Data Mining, Summer 2014

Log On Vista Course

Course Information


Course title:	Topics: Data Mining
Course number:	CS 580
Course description:	Data Mining studies algorithms and computational paradigms that allow computers to find patterns and regularities in databases, perform prediction and forecasting, and generally improve their performance through interaction with data. It is currently regarded as the key element of a more general process called Knowledge Discovery that deals with extracting useful knowledge from raw data. The knowledge discovery process includes data selection, cleaning, coding, using different statistical and machine learning techniques, and visualization of the generated structures. The course will cover all these issues and will illustrate the whole process by examples. Special emphasis will be give to the Machine Learning methods as they provide the real knowledge discovery tools. Important related technologies, as data warehousing and on-line analytical processing (OLAP) will be also discussed. The students will use recent Data Mining software. Enrollment in this course is limited to 15 students.
Course dates:	May 27, 2014 - Jul 21, 2014

Instructor Information


Name:	Zdravko Markov
Email:	markovz@ccsu.edu
Office location:	303 Maria Sanford Hall, CCSU
Phone:	(860) 832-2711; fax (860) 832-2712
Biography:	Dr. Zdravko Markov has an M.S. in Mathematics and Computer Science and a Ph.D. in Artificial Intelligence. He has been teaching and doing research in the area of Machine Learning for more than 15 years. Recently he developed a novel approach to conceptual clustering and is studying its application to Data Mining tasks. Dr. Markov has published 4 textbooks and more than 50 research papers in conference proceedings and journals. His most recent book (co-authored with Daniel Larose) is â€œData Mining The Web: Uncovering Patterns in Web Content, Structure, and Usage", published by Wiley in 2007. Dr. Markovâ€™s CCSU courses are in the areas of Computer Architecture and Design, Computing and Communication technology, Machine Learning, Data and Web Mining.

Textbook


Required reading:	Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition), Morgan Kaufmann, 2005, ISBN: 0-12-088407-0.

Required Software


Weka 3 :	Data Mining System with Free Open Source Machine Learning Software in Java. Available at http://www.cs.waikato.ac.nz/~ml/weka/index.html

Course Goals


	To introduce students to the basic concepts and techniques of Data Mining. To develop skills of using recent data mining software for solving practical problems. To gain experience of doing independent study and research.

Grading Policies

Grading will be based on six assignments (60%), two quizzes (20%) and class participation through four scheduled discussions (20%). Assignments and quizzes will be graded on a 100 point scale, discussions - on a 50 point scale. Thus the maximum course total will be 1000 points.

The letter grade will be determined by the following grading scale:


A	A-	B+	B	B-	C+	C	C-	D+	D	D-	F
950-1000	900-940	870-890	840-860	800-830	770-790	740-760	700-730	670-690	640-660	600-630	0-590

Late assignments will be marked one letter grade down for each 3 days they are late.

It is expected that all students will conduct themselves in an honest manner and NEVER claim work which is not their own. Violating this policy will result in a substantial grade penalty or a final grade of F.

Course Content (12 units)


		Introduction to Data Mining What is data mining? Related technologies - Machine Learning, DBMS, OLAP, Statistics Data Mining Goals Stages of the Data Mining Process Data Mining Techniques Knowledge Representation Methods Applications Example: weather data Data Warehouse and OLAP Data Warehouse and DBMS Multidimensional data model OLAP operations Example: loan data set Data preprocessing Data cleaning Data transformation Data reduction Discretization and generating concept hierarchies Installing Weka 3 Data Mining System Experiments with Weka - filters, discretization Data mining knowledge representation Task relevant data Background knowledge Interestingness measures Representing input data and output knowledge Visualization techniques Experiments with Weka - visualization Attribute-oriented analysis Attribute generalization Attribute relevance Class comparison Statistical measures Experiments with Weka - using filters and statistics Data mining algorithms: Association rules Motivation and terminology Example: mining weather data Basic idea: item sets Generating item sets and rules efficiently Correlation analysis Experiments with Weka - mining association rules Data mining algorithms: Classification Basic learning/mining tasks Inferring rudimentary rules: 1R algorithm Decision trees Covering rules Experiments with Weka - decision trees, rules Data mining algorithms: Prediction The prediction task Statistical (Bayesian) classification Bayesian networks Instance-based methods (nearest neighbor) Linear models Experiments with Weka - Prediction Evaluating what's been learned Basic issues Training and testing Estimating classifier accuracy (holdout, cross-validation, leave-one-out) Combining multiple models (bagging, boosting, stacking) Minimum Description Length Principle (MLD) Experiments with Weka - training and testing Mining real data Preprocessing data from a real medical domain (310 patients with Hepatitis C). Applying various data mining techniques to create a comprehensive and accurate model of the data. Clustering Basic issues in clustering First conceptual clustering system: Cluster/2 Partitioning methods: k-means, expectation maximization (EM) Hierarchical methods: distance-based agglomerative and divisible clustering Conceptual clustering: Cobweb Experiments with Weka - k-means, EM, Cobweb Advanced techniques, Data Mining software and applications Text mining: extracting attributes (keywords), structural approaches (parsing, soft parsing). Bayesian approach to classifying text Web mining: classifying web pages, extracting knowledge from the web Data Mining software and applications