Due to the explosive growth of knowledge in biotech (about 1500 research abstracts are added every single day to MEDLINE, an electronic repository of biomedical papers) an acute need for knowledge management tools has arisen. Natural Language Processing (NLP) offers the necessary technologies to organize, mine, and have natural access to huge collections of text or text combined with other types of media such as tables, charts, and images.

A major problem in BioNLP is that same biological term can frequently be used with different meanings in biological texts. For instance, SBP2 can refer both to a protein or a gene. In this project, we combine NLP with Machine Learning techniques, namely decision trees and Naïve Bayes, to build software tools that classify terms that refer to DNA, RNA, protein and cell_line, cell_type.
Students will learn core concepts in Natural Language Processing and Machine Learning. In addition, they will become familiar with issues related to Biomedical Informatics, an area with a great deal of opportunities in the future. Students will be exposed to software tools available in Natural Language Proceesing and Machine Learning. Most importantly, students will be exposed to fundamental concepts in AI such as Knowledge Representation.

The project will familiarize students with the following concepts related to Machine Learning:
  • The basic concepts and techniques of machine learning.
  • Issues involved in the implementation of a learning system.
  • The role of learning in improved performance and in allowing a system to adapt based on previous experiences.
  • The important role data preparation and feature extraction play in machine learning.
  • The importance of model evaluation in machine learning and in particular the training and testing framework used to choose the best model for web page classification.
Students should have basic knowledge of algebra, discrete mathematics and statistics. Another prerequisite is the data structures course. While not necessary, experience with Java would be of help as the basic tool needed for this project - the Weka Machine Learning system - is implemented in Java. Before starting the project, students may want to cover the recommended reading so that they understand better the fundamental concepts of Natural Language Processing and Machine Learning. In support of the exercises and project, students should download Weka which is available at http://www.cs.waikato.ac.nz/~ml/weka/index.html.
For an introduction to machine learning and to decision tree learning, students can read the corresponding chapter in any good AI book. For example, one might assign chapter 18 of:
To understand the basic concepts of Natural Language Processing the students are encouraged to read chapters 1, 5, 11 and 19 of:
  • Daniel Jurafsky and James Martin. Speech and Language Processing. Prentice-Hall, 2000.
The following book is recommended for the basic principles of Machine Learning and practical information about the Weka Machine Learning system. Chapters 2, 3 and 4 give important background information. Chapter 8 which is also available online at http://weka.sourceforge.net/wekadoc/index.php/en%3APrimer discusses the use of the Weka system. Students should read chapter 8. Students are required to install and use the Weka system. They are also encouraged to experiment with the examples provided in the book and in the software package.
  • Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000.
For an introduction to BioNLP students should read the following report: For an introduction to Biology students are encouraged to read the following book:
  • William Cohen. Computer Scientist's Guide To Biology. Springer, 2007.
The detailed project description is available in the PDF file BioNLP.pdf. You will need the free Adobe Acrobat Reader to view this file.
This project is customizable to accommodate different approaches to teaching and different implementations. Additional exercises are also included for students seeking more extended challenges.
A sample syllabus used at The University of Memphis when this project was assigned will be available soon.

Additional readings are included in the Background section above.