CS 580 - Web Mining

Fall-2008

Classes: MW 6:45 pm - 8:00 pm, Room: Maria Sanford Hall 214
Instructor: Dr. Zdravko Markov, 30307 Maria Sanford Hall, (860)-832-2711, http://www.cs.ccsu.edu/~markov/, e-mail: markovz at ccsu dot edu
Office hours: TR 10:00 - 12:30 pm, or by appointment

Description: The Web is the largest collection of electronically accessible documents, which make the richest source of information in the world. The problem with the Web is that this information is not well structured and organized so that it would be be easily retrieved. The search engines help in accessing web documents by keywords, but this is still far from what we need in order to effectively use the knowledge available on the Web. Machine Learning and Data Mining approaches go further and try to extract knowledge from the raw data available on the Web by organizing web pages in well defined structures or by looking into patterns of activities of Web users. These are the challenges of the area of Web Mining. This course focuses on extracting knowledge from the web by applying Machine Learning techniques for classification and clustering of hypertext documents. Basic approaches from the area of Information Retrieval and text analysis are also discussed. The students use recent Machine Learning and Data Mining software to implement practical applications for web document retrieval, classification and clustering.

Prerequisites: CS 501 and CS 502, basic knowledge of algebra, discrete math and statistics.

Course Objectives

  • Introduce students to the basic concepts and techniques of Information Retrieval, Web Search, Data Mining, and Machine Learning for extracting knowledge from the web.
  • Develop skills of using recent data mining software for solving practical problems of Web Mining.
  • Gain experience of doing independent study and research.
  • Required text (DMW): Zdravko Markov and Daniel T. Larose. Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007, ISBN: 978-0-471-66655-4. Recommended texts: Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques (Second Edition), Morgan Kaufmann, 2005, ISBN: 0-12-088407-0.

    Required software: Weka 3 Data Mining System - Data Mining System with Free Open Source Machine Learning Software in Java. Available at http://www.cs.waikato.ac.nz/~ml/weka/index.html

    Semester project: There will be a semester project that involves independent study, work with the course software, writing reports and making presentations. The project can be done individually or in teams of 2 or 3. The project description and timetable is included in the schedule for classes and assignments.

    Grading: The final grade will be based on the project (80%) and two tests (20%), and will be affected by classroom participation. The letter grades will be calculated according to the following table:
     
    A A- B+ B B- C+ C C- D+ D D- F
    95-100 90-94 87-89 84-86 80-83 77-79 74-76 70-73 67-69 64-66 60-63 0-59

    Honesty policy: It is expected that all students will conduct themselves in an honest manner (see the CCSU Student handbook), and NEVER claim work which is not their own. Violating this policy will result in a substantial grade penalty, and may lead to expulsion from the University.
     

    Tentative schedule of classes, assignments and tests

    1. Introduction
    2. Information Retrieval and Web Search
    3. Hyperlink Based Ranking
    4. Clustering approaches for Web Mining
    5. Evaluating Clustering
    6. Classification approaches for Web Mining
    7. Semester Projects

    Hyperlink Based Ranking

    1. The structure of the Web

    2. Social networks

    3. PageRank

    4. Authorities and Hubs

    5. Enhanced techniques for page ranking

    6. Using Web structure to enhance crawling and similarity search


    General Setting and Evaluation Techniques

    1. General Setting

    2. Evaluating text classifiers

    3. Basic Approaches


    Nearest Neghbor Learning


    Bayesian approaches

    Naive Bayes

    Bayesian networks


    Numeric Approaches

    Linear Regression

    Support Vector Machine (SVM)


    Web Crawler Project

    This project includes two basic steps:
    1. Implementing a Web Crawler.
    2. Using the crawler to collect a set of web pages and identify their properties related to the web structure.
    For step 1 you may use WebSPHINX: A Personal, Customizable Web Crawler or write your own crawler in Java or C using the open source provided with WebSPHINX or the W3C Protocol Library. Step 2 includes: Note that no programming and implementing stand-alone applications is required.

    Web Document Classification Project

    Introduction

    Along with the search engines, topic directories are the most popular sites on the Web. Topic directories organize web pages in a hierarchical structure (taxonomy, ontology) according to their content. The purpose of this structuring is twofold: firstly, it helps web searches focus on the relevant collection of Web documents. The ultimate goal here is to organize the entire web into a directory, where each web page has its place in the hierarchy and thus can be easily identified and accessed. The Open Directory Project (dmoz.org) and About.com are some of the best-known projects in this area. Secondly, the topic directories can be used to classify web pages or associate them with known topics. This process is called tagging and can be used to extend the directories themselves. In fact, some well-known search portals as Yahoo and Google return with their responses the topic path of the response, if the response URL has been associated with some topic found in a topic directory. As these topic directories are usually created manually they cannot capture all URL’s, therefore just a fraction of all responses are tagged.

    Project overview

    The aim of the project is to investigate the process of tagging web pages using the topic directory structures and apply Machine Learning techniques for automatic tagging or classifying web pages into topic categories. This would help filtering out the responses of a search engine or ranking them according to their relevance to a topic specified by the user.
    For example, a keyword search for “Machine Learning” using Yahoo may return along with some of the pages found (about 5 million) topic directory paths like:


    Category: Artificial Intelligence > Machine Learning
    Category: Artificial Intelligence > Web Directories
    Category: Maryland > Baltimore > Johns Hopkins University > Courses

    (Note that this may not be what you see when you try this query. The web content is constantly changing as well as the search engines’ approaches to search the web. This usually results in getting different results from the same search query at different times.)

    Most of the pages returned however are not tagged with directory topics. Assuming that we know the general topic of such untagged web page, say, Artificial Intelligence and this is a topic in a directory, we can try to find the closest subtopic to the web page found. This is where Machine Learning comes into play. Using some text document classification techniques we can classify the new web page to one of the existing topics. By using the collection of pages available under each topic as examples we can create category descriptions (e.g. classification rules, or conditional probabilities). Then using these descriptions we can classify new web pages. Another approach would be the similarity search approach, where using some metric over text documents we find the closest document and assign its category to the new web page.

    Project description

    The project is split into three major parts. These parts are also stages in the overall process of knowledge extraction from the web and classification of web documents (tagging). As this process is interactive and iterative in nature, the stages may be included in a loop structure that would allow each stage to be revisited so that some feedback from latter stages to be used. The parts are well defined and can be developed separately and then put together as components in a semi-automated system or executed manually. Hereafter we describe the project stages in detail along with the deliverables that the students need to document in the final report for each stage.

    1. Collecting sets of web documents grouped by topic

    The purpose of this stage is to collect sets of web documents belonging to different topics (subject area). The basic idea is to use a topic directory structure. Such structures are available from dmoz.org (the Open Directory project), the yahoo directory (dir.yahoo.com), about.com and many other web sites that provide access to web pages grouped by topic or subject. These topic structures have to be examined in order to find several topics (e.g. 5), each of which is well represented by a set of documents (at least 20).

    Alternative approaches could be extracting web documents manually from the list of hits returned by a search engine using a general keyword search or collecting web pages by using a Web Crawler (see the Web Crawler project) from the web page structure of a large organization (e.g. university).

    Deliverable: The outcome of this stage is a collection of several sets of web documents (actual files stored locally, not just URL’s) representing different topics or subjects, where the following restrictions apply:

    a) As these topics will be used for learning and classification experiments at later stages they have to form a specific structure (part of the topic hierarchy). It’s good to have topics at different levels of the topic hierarchy and with different distances between them (a distance between two topics can be defined as the number of predecessors to the first common parent in the hierarchy). An example of such structure is:

    topic1 > topic2 > topic3
    topic1 > topic2 > topic4
    topic1 > topic5 > topic6
    topic1 > topic7 > topic8
    topic1 > topic9

    The set of topics here is {topic3, topic4, topic6, topic8, topic9}. Also, it would be interesting to find topics, which are subtopics of two different topics. An example of this is:

    Top > … > topic2 > topic4
    Top > … > topic5 > topic4

    b) There must be at least 5 different topics with at least 20 documents in each.

    c) Each document should contain certain minimum amount of text. This may be measured with the number of words (excluding stopwords and punctuation marks). For example, this minimum could be 200 words.
     

    2. Feature extraction and data preparation

    At this stage the web documents are represented by feature vectors, which in turn are used to form a training data set for the Machine Learning stage. To complete this use the Weka system and follow the directions provided in section Exercises of DMW, Chapter 1 (free download from Wiley).
    Deliverable: ARFF data files containing the feature vectors for all web documents collected at stage 1. It is recommended that students prepare several files by using different approaches to feature extraction, for example, one with Boolean attributes and one with numeric ones created by applying the TFIDF approach. Versions of the data sets with different number of attributes can be also prepared.

    3. Machine Learning Stage

    At this stage Machine Learning algorithms are used to create models of the data sets. These models are then used for two purposes. Firstly, the accuracy of the initial topic structure is evaluated and secondly, new web documents are classified into existing topics. For both purposes we use the Weka system. The ML stage of consists of the following steps:
    1. Preprocessing of the web document data. Load the ARFF files created at project stage 2, verify their consistency and get some statistics by using the preprocess panel.
    2. Using the Weka’s decision tree algorithm (J48) examine the decision trees generated with different data sets. Which are the most important terms for each data set (the terms appearing on the top of the tree)? How do they change with changing the data set? Check also the classification accuracy and the confusion matrix obtained with 10-fold cross validation and find out which topic is best represented by the decision tree.
    3. Use the Naïve Bayes and Nearest Neighbor (IBk) algorithms and compare their classification accuracy and confusion matrices obtained with 10-fold cross validation with the ones produces by the decision tree. Which ones are better? Why?
    4. Run the Weka clustering algorithms (k-means, EM and Cobweb) ignoring the class attribute (document topic) on all data sets. Evaluate the obtained clusterings by comparing them to the original set of topics or to the topic hierarchy (when using Cobweb). Use also the formal method, classes to clusters evaluation, provided by Weka. For more details of clustering with Weka see http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-Ex3.html.
    5. New web document classifications. Get web documents from the same subject areas (topics), but not belonging to the original set of documents prepared in project stage 1. Get also documents from different topics. Apply feature extraction and create ARFF files each one representing one document. Then using the Weka test set option classify the new documents. Compare their original topic with the one predicted by Weka. For the classification experiments use the guidelines provided in http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-Ex2.doc.
    Deliverable: This stage of the project requires writing a report on the experiments performed. The report should include detailed description of the experiments (input data, Weka outputs), answers to the questions, and interpretation and analysis of the results with respect to the original problem stated in the project, web document classification.

    Intelligent Web Browser Project

    Introduction

    The Web searches provide large amount of information about the web users. Data mining techniques can be used to analyze this information and create user profiles or identify user preferences. A key application of this approach is in marketing and offering personalized services, an area referred to as "data gold rush". This project will focus on use of machine learning approaches to create models of web users. Students will collect web pages from web searches or by running a web crawler and label them according to user preferences. The labeled pages will be then encoded as feature vectors and fed into the machine learning system. The later will produce user models that may be used for improving the efficiency of web searches or identifying users.

    Project description

    Similarly to the web document classification project this project is split into three major parts/stages - data collection, feature extraction and machine learning (mining). At the data collection and feature extraction stages web pages (documents) are collected and represented as feature vectors. The important difference with the document classification project is that the documents are mapped into users (not topic categories). At the machine learning algorithms stage various learning algorithms are applied to the feature vectors in order to create models of the users that these vectors (documents) are mapped onto. Then the models can be used to filter out web documents returned by searches so that the users can get more focused information from the search engines. In this way users can also be identified by their preferences and new users classified accordingly. Hereafter we describe briefly the project stages.

    1. Collecting sets of web documents grouped by users' preference

    The purpose of this stage is to collect a set of web documents labeled with user preferences. This can be done in the following way: A user performs web searches with simple keyword search, just browses the web or examines a set of pages collected by a web crawler. To each web document the user assigns a label representing whether or not the document is interesting to the user. As in the web document classification project some restrictions apply: (1) The number of web pages should be greater than the number of selected features (stage 2). (2) The web pages should have sufficient text content so that they could be well described by feature vectors.

    2. Feature extraction and data preparation

    This stage is very similar to the one described in the Web Document Classification project. By using the Weka filters Boolean or numeric values are calculated for each web document and the corresponding feature vector is created. Finally the vectors are included in the ARFF file to be used by WEKA. Note that at this last step the vectors are extended with class labels (for example, interesting/non-interesting or +/-) according to the user preferences.

    As in the in the web document classification project the outcome of this stage is an ARFF data file containing the feature vectors for all web documents collected at stage 1. It is recommended that students prepare several files by using different approaches to feature extraction - Boolean attributes, numeric attributes (using the TFIDF approach) and with different number of terms. The idea is to do more experiments with different data sets and different ML algorithms in order to find the best user model.

    3. Machine Learning Stage

    At this stage the approaches and experiments are similar to those described in the Web Document Classification project with an important difference in the last step where the machine learning models are used. This step can be called web document filtering (focusing the search) and can be described as follows: Collect a number of web documents using one of the approaches suggested in project stage 1. Apply feature extraction and create an ARFF test file with one data row for each document. Then using the training set prepared in stage 2 and the Weka's test set option classify the new documents. Each one will get a corresponding label (interesting/non-interesting or +/-). Then simply discard the non-interesting documents and present the interesting ones to the user. Further, this step can be incorporated into a web browser, so that it automatically labels all web pages as interesting/non-interesting according to the user preferences.