CS 580 - Web Mining


Classes: TR 6:45 pm - 8:00 pm, Maria Sanford Hall, Room 210
Instructor: Dr. Zdravko Markov, MS 203, (860)-832-2711, http://www.cs.ccsu.edu/~markov/, e-mail: markovz@ccsu.edu
Office hours:  MW 5:00pm-6:45pm, TR 8:00pm-9:00pm, or by appointment

Description: The Web is the largest collection of electronically accessible documents, which make the richest source of information in the world. The problem with the Web is that this information is not well structured and organized so that it would be be easily retrieved. The search engines help in accessing web documents by keywords, but this is still far from what we need in order to effectively use the knowledge available on the Web. Machine Learning and Data Mining approaches go further and try to extract knowledge from the raw data available on the Web by organizing web pages in well defined structures or by looking into patterns of activities of Web users. These are the challenges of the area of Web Mining. This course focuses on extracting knowledge from the web by applying Machine Learning techniques for classification and clustering of hypertext documents. Basic approaches from the area of Information Retrieval and text analysis are also discussed. The students implement practical applications for creating topic directories and customizing software.

Prerequisites: CS 501 and CS 502, basic knowledge of algebra, discrete math and statistics.

Course Objectives

  • Introduce students to the basic concepts and techniques of information retrieval, web search and Data Mining and Machine Learning for extracting knowledge from the web.
  • Develop skills of using recent data mining software for solving practical problems of Web Mining.
  • Gain experience of doing independent study and research.
  • Required text: Soumen Chakrabarti, Mining the Web - Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, 2003, ISBN 1-55860-754-4.

    Recommended texts: Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 1999, ISBN 1-55860-552-5. Chapter 8 (Nuts And Bolts: Machine Learning Algorithms In Java)  is available online from the Weka 3 homepage.

    Required software: Weka 3 Data Mining System - a free Machine Learning Software in Java.

    Tentative list of topics (will be further elaborated)

    1. Introduction
    2. Crawling the Web
    3. Information Retrieval and Web Search
    4. Analyzing the Web structure
    5. Clustering approaches for Web Mining
    6. Classification approaches for Web Mining
    7. Projects

    Analyzing the Web structure

    1. Reading

    2. The structure of the Web

    3. Social networks