Laboratory Experiences for Introducing Undergraduates to Artificial
Along with the search engines, topic directories are the most popular sites on the web. Topic directories organize web pages in a hierarchical structure according to their content. This structuring helps web searches to focus on relevant collections of Web documents. The ultimate goal in this direction would be to organize the entire web as a directory, where each web page has its own place in the hierarchy and thus can be easily identified and accessed. Topic directories are not very well developed yet, mainly because they are created manually. Automatic classification of web pages would greatly facilitate this process by identifying where in the directory structure a page belongs or by expanding and creating new subdirectory structures.
The goal of the project is to investigate the process of tagging web pages using the topic directory structures and apply Machine Learning techniques for automatic tagging. This would help in filtering out the responses of a search engine or ranking them according to their relevance to a topic specified by the user.
While doing this project students will learn the basics of Information Retrieval, Data Mining and Machine Learning, gain experience in using recent software applications in these areas and most importantly have a better understanding of fundamental AI concepts as Knowledge Representation and Search play important roles in the areas mentioned above.
While enforcing traditional AI core topics, using a unified example, in this case web document classification, the project allows the discussion of various issues related to machine learning including:
The importance of model evaluation in machine learning and in particular the training and testing framework used to choose the best model for web page classification.
The students should have a basic knowledge of algebra, discrete mathematics and statistics. Another prerequisite is the data structures course. While not necessary, experience with Java would be of help as the basic tool needed for this project -- the Weka Machine Learning system -- is implemented in Java. Before starting the project, students may want to cover the recommended reading so that they understand better the fundamental concepts of Information Retrieval and Machine Learning.
In support of the exercises and project, students should download Weka which is available at http://www.cs.waikato.ac.nz/~ml/weka/index.html and a text corpus analysis package such as TextSTAT, a freeware software available from
For an introduction to machine learning and to decision tree learning, students can read the corresponding chapter in any good AI book. For example, one might assign chapter 18 of:
Stuart Russell and Peter Norvig.
Artificial Intelligence: A Modern
Approach, 2nd edition. Prentice Hall,
To understand the basic concepts of Information Retrieval, Web Search and Web document classification, the students are encouraged to read chapters 3 and 5 of:
Soumen Chakrabarti, Mining the Web - Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, 2002.
The following book is recommended for the basic principles of Machine Learning and practical information about the Weka Machine Learning system:
Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000.
Chapters 2, 3 and 4 give
important background information. Chapter 8 which is also available
online at http://www.cs.waikato.ac.nz/~ml/weka/Tutorial.pdf
discusses the use of the Weka system.
Students should read chapter 8. Students are required to install and
use the Weka system. They are also encouraged to
experiment with the examples provided in the book and in the software
The project is customizable to
accommodate different approaches to teaching and different
implementations. Additional exercises are also included for Students
seeking more extended challenges.
A sample syllabus used
A sample syllabus used
Additional readings are included in the Background section above.