The Web is the largest collection of electronically accessible documents, which make the richest source of information in the world. The problem with the Web is that this information is not well structured and organized so that it can be easily retrieved. Search engines help in accessing web documents by keywords, but this is still far from what we need in order to effectively use the knowledge available on the Web. Machine Learning and Data Mining approaches go further and try to extract knowledge from the raw data available on the Web by organizing web pages in well defined structures or by looking into patterns of activities of Web users. This project focuses on this challenge and explores the Machine Learning techniques suitable for this purpose.
Web searches provide large amounts of information about the web users. Data mining techniques can be used to analyze this information and create web user profiles. A key application of this approach is in marketing and offering personalized services, an area referred to as “data gold rush”.
The aim of this project is to develop a system that helps us develop an intelligent web browser. The project will focus on the use of Decision Tree learning to create models of web users. You will be provided with Decision Tree learning tools and will collect data from web searches. You will then experiment with creating web user models and using these models for improving the efficiency of web searches performed by the same or new users.

The aim of this project is to investigate approaches and algorithms needed to develop an intelligent web browser that is able to adjust automatically to user preferences. The project focuses on the use of Decision Tree learning to create models of web users.
The learning objectives of the project are:
  • Learning the basics of Information Retrieval and Machine Learning
  • Gaining experience in using recent software applications in these areas and
  • Better understanding of fundamental AI concepts such as Knowledge Representation and Search.
The students should have a basic knowledge of algebra, discrete mathematics and statistics. Another prerequisite is the data structures course. While not necessary, experience with Java would be of help as the basic tool needed for this project -- the Weka Machine Learning system -- is implemented in Java. Before starting the project, students may want to cover the recommended reading so that they understand better the fundamental concepts of Information Retrieval and Machine Learning.

In support of the exercises and project, students should download Weka which is available at http://www.cs.waikato.ac.nz/~ml/weka/index.html and a text corpus analysis package such as TextSTAT, a freeware software available from http://www.niederlandistik.fu-berlin.de/textstat/software-en.html.
For an introduction to machine learning and to decision tree learning, students can read the corresponding chapter in any good AI book. For example, one might assign chapter 18 of: To understand the basic concepts of Information Retrieval, Web Search and Web document classification the students are encouraged to read chapters 3 and 5 of:
  • Soumen Chakrabarti, Mining the Web - Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, 2002.
The following book is recommended for the basic principles of Machine Learning and practical information about the Weka Machine Learning system:
  • Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000.
Chapters 2, 3 and 4 give important background information. Chapter 8 which is also available online at http://weka.sourceforge.net/wekadoc/index.php/en%3APrimer discusses the use of the Weka system. Students should read chapter 8. Students are required to install and use the Weka system. They are also encouraged to experiment with the examples provided in the book and in the software package.
The detailed project description is available in the PDF file UserProfiling.pdf. You will need the free Adobe Acrobat Reader to view this file.
This project is customizable to accommodate different approaches to teaching and different implementations. Additional exercises are also included for students seeking more extended challenges.
A sample syllabus used at the University of Hartford when this project was assigned is available at:
Syllabus for AI Course at the University of Hartford

A sample syllabus used at Central Connecticut State University when this project was assigned is available at:
Syllabus for AI Course at Central Connecticut State University

Additional readings are included in the Background section above.