Overview

Along with the search engines, topic directories are the most popular sites on the web. Topic directories organize web pages in a hierarchical structure according to their content. This structuring helps web searches to focus on relevant collections of Web documents. The ultimate goal in this direction would be to organize the entire web as a directory, where each web page has its own place in the hierarchy and thus can be easily identified and accessed. Topic directories are not very well developed yet, mainly because they are created manually. Automatic classification of web pages would greatly facilitate this process by identifying where in the directory structure a page belongs or by expanding and creating new subdirectory structures.

The goal of the project is to investigate the process of tagging web pages using the topic directory structures and apply Machine Learning techniques for automatic tagging. This would help in filtering out the responses of a search engine or ranking them according to their relevance to a topic specified by the user.

While doing this project students will learn the basics of Information Retrieval, Data Mining and Machine Learning, gain experience in using recent software applications in these areas and most importantly have a better understanding of fundamental AI concepts as Knowledge Representation and Search play important roles in the areas mentioned above.

While enforcing traditional AI core topics, using a unified example, in this case web document classification, the project allows the discussion of various issues related to machine learning including:
  • The basic concepts and techniques of machine learning.
  • Issues involved in the implementation of a learning system.
  • The role of learning in improved performance and in allowing a system to adapt based on previous experiences.
  • The important role data preparation and feature extraction play in machine learning.
  • The vector space model for representing web documents and a variety of feature extraction techniques combined with the pros and cons of each in identifying and classifying documents by feature vectors.
The importance of model evaluation in machine learning and in particular the training and testing framework used to choose the best model for web page classification.
The students should have a basic knowledge of algebra, discrete mathematics and statistics. Another prerequisite is the data structures course. While not necessary, experience with Java would be of help as the basic tool needed for this project -- the Weka Machine Learning system -- is implemented in Java. Before starting the project, students may want to cover the recommended reading so that they understand better the fundamental concepts of Information Retrieval and Machine Learning.

In support of the exercises and project, students should download Weka which is available at http://www.cs.waikato.ac.nz/~ml/weka/index.html and a text corpus analysis package such as TextSTAT, a freeware software available from
http://www.niederlandistik.fu-berlin.de/textstat/software-en.html
For an introduction to machine learning and to decision tree learning, students can read the corresponding chapter in any good AI book. For example, one might assign chapter 18 of: To understand the basic concepts of Information Retrieval, Web Search and Web document classification, the students are encouraged to read chapters 3 and 5 of:
  • Soumen Chakrabarti, Mining the Web - Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, 2002.
The following book is recommended for the basic principles of Machine Learning and practical information about the Weka Machine Learning system:
  • Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000.
Chapters 2, 3 and 4 give important background information. Chapter 8 which is also available online at http://weka.sourceforge.net/wekadoc/index.php/en%3APrimer discusses the use of the Weka system. Students should read chapter 8. Students are required to install and use the Weka system. They are also encouraged to experiment with the examples provided in the book and in the software package.
The detailed project description is available in the PDF file DocClassification.pdf. You will need the free Adobe Acrobat Reader to view this file.
This project is customizable to accommodate different approaches to teaching and different implementations. Additional exercises are also included for students seeking more extended challenges.
A sample syllabus used at the University of Hartford when this project was assigned is available at:
Syllabus for AI Course at the University of Hartford

A sample syllabus used at Central Connecticut State University when this project was assigned is available at:
Syllabus for AI Course at Central Connecticut State University

Additional readings are included in the Background section above.