Overview

Most of the content-based approaches to text and web document classification explored in other related projects are based on the bag of words model, well known from the area of Information Retrieval. This model is simple and efficient, but fails to capture many additional document features such as the internal HTML structure, language structure and inter-document link structure. All this however may be a valuable source of information for the classification task. The basic problem with incorporating this information into the classification algorithm is the need for uniform representation. For example, the content-based classification works well with the vector space representation, while hyperlink-based classification can be implemented by using graph models. This project introduces an approach that allows various kinds of information to be represented in a uniform way and used for document classification. The idea is known as Relational Learning or First-Order Learning. Another term also used in this context is Inductive Logic Programming (ILP), which uses the language of logic programming (or Prolog) as a representation language for learning. Some relational learning techniques have been successfully used for Data Mining applications (Relational Data Mining).

The project allows students to study the basics of relational learning and reasoning in the context of solving practical problems. One of the most successful relational learning systems, FOIL is used to create relational representation of web documents and to solve classification problems.
The aim of this project is to provide a framework for experimentation and solving practical problems in the area of relational learning and first order inference. By using this framework students can:
  • Learn the basics of Relational Learning and its application to web document classification.
  • Gain experience in using software applications in these areas for solving practical problems.
  • Better understand the fundamentals of First-Order Logic, Learning and Reasoning in First-Order Logic, which are basic components of the wider area of knowledge representation and reasoning in AI.
Students should have basic knowledge of discrete mathematics and logic. Some programming experience in Prolog would be helpful as most of the software tools used in the project are implemented in Prolog and Prolog is also the representation language for First-Order Learning.

The software packages and data sets used in the project are freely available on the Web:
It is recommended that before starting the project students read Chapters 8, 9, and 19 of Russell and Norvig’s book ([1]), Chapter 1 and Chapter 5 (Section “Relational Learning”) of Markov and Larose’s book ([2]), or Chapter 10 of Mitchell’s book ([3]). While installing and experimenting with Prolog they may use a Prolog tutorial ([4,5]).
  1. Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach, 2nd edition. Prentice Hall, Upper Saddle River, NJ, USA, 2003.
  2. Zdravko Markov and Daniel T. Larose. Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage. Wiley, 2007. Chapter 1 is available for download from Wiley.
  3. Tom Mitchell. Machine Learning. McGraw Hill, 1997.
  4. Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2nd edition. Morgan Kaufmann, 2005.
  5. Quick Introduction to Prolog, available at http://www.cs.ccsu.edu/~markov/ccsu_courses/prolog.txt.
  6. A Prolog Tutorial by J.R. Fisher, available at http://www.csupomona.edu/~jrfisher/www/prolog_tutorial/contents.html.
The detailed project description is available in the PDF file RelationalLearning.pdf. You will need the free Adobe Acrobat Reader to view this file.
This project is customizable to accommodate different approaches to teaching and different implementations. Additional exercises are also included for students seeking more extended challenges.
A sample syllabus is not available.

Additional readings are included in the Background section above.