|
|
 |
 |
|
 |
|
|
 |
 |
 |
 |
 |
 |
While doing this project students will learn the basics of Information Retrieval,
Data Mining and Machine Learning, gain experience in using recent software
applications in these areas and most importantly have a better understanding of fundamental AI concepts as Knowledge Representation
and Search play important roles in the areas
mentioned above.
While enforcing traditional AI core topics, using a unified example, in this case
web document classification, the project allows the discussion of various issues
related to machine learning including:
- The basic concepts and techniques of machine learning.
- Issues involved in the implementation of a learning system.
- The role of learning in improved performance and in allowing a system to adapt based
on previous experiences.
- The important role data preparation and feature extraction play in machine learning.
- The vector space model for representing web documents and a variety of feature extraction
techniques combined with the pros and cons of each in identifying and classifying
documents by feature vectors.
The importance of model evaluation in machine learning and in particular the training
and testing framework used to choose the best model for web page classification.
|
|
 |
 |
 |
 |
|
 |
 |
 |
 |
 |
 |
The students should have a basic knowledge of algebra, discrete mathematics and
statistics. Another prerequisite is the data structures course. While not necessary,
experience with Java would be of help as the basic tool needed for this project
-- the Weka Machine Learning system -- is implemented in Java. Before starting the
project, students may want to cover the recommended reading so that they understand
better the fundamental concepts of Information Retrieval and Machine Learning.
In support of the exercises and project, students should download Weka which is
available at
http://www.cs.waikato.ac.nz/~ml/weka/index.html
and a text corpus
analysis package such as TextSTAT, a freeware software available from
http://www.niederlandistik.fu-berlin.de/textstat/software-en.html
|
|
 |
 |
 |
 |
|
 |
 |
 |
 |
 |
 |
For an introduction to machine learning and to decision tree learning, students
can read the corresponding chapter in any good AI book. For example, one might assign
chapter 18 of:
To understand the basic concepts of Information Retrieval, Web Search and Web document
classification, the students are encouraged to read chapters 3 and 5 of:
- Soumen Chakrabarti, Mining the Web - Discovering Knowledge from Hypertext
Data, Morgan Kaufmann Publishers, 2002.
The following book is recommended for the basic principles of Machine Learning and
practical information about the Weka Machine Learning system:
- Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning
Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2000.
Chapters 2, 3 and 4 give important background information. Chapter 8 which is also
available online at http://weka.sourceforge.net/wekadoc/index.php/en%3APrimer
discusses the use of the Weka system. Students should read chapter 8. Students are
required to install and use the Weka system. They are also encouraged to experiment
with the examples provided in the book and in the software package.
|
|
 |
 |
 |
 |
|
 |
 |
 |
 |
 |
 |
|
The detailed project description is available in the PDF file DocClassification.pdf. You will need the free Adobe Acrobat Reader to view this file.
|
|
|
This project is customizable to accommodate different approaches to teaching and different implementations. Additional exercises are also included for students seeking more extended challenges.
|
|
|
 |
 |
 |
 |
|
 |
|
|
|
 |
|
|