A DNA microarray is a small silicon chip that is covered with thousands of spots of DNA of known sequence. Biologists use microarrays to study gene expression patterns in a wide range of medical and scientific applications. For example, analysis of microarrays has led to discovery of genomic patterns that distinguish cancerous from non-cancerous cells. Other researchers have used microarrays to identify the genes in the brains of fish that are associated with certain kinds of mating behavior. Microarray data sets are very large and can only be analyzed with the help of computers. A support vector machine (SVM) is a powerful machine learning technique that is used in a variety of data mining applications, including the analysis of DNA microarrays.

The goal of this project is to learn how to use an SVM to recognize patterns in microarray data. Using an open source software package named libsvm and downloaded data from a cancer research project, we will train an SVM to distinguish between gene expression patterns that are associated with two different forms of leukemia.
Although this project will focus on SVMs and machine learning techniques, students will learn enough about the biology behind DNA analysis for the project to make sense. Students will gain experience using SVM software and will emerge from this project with an improved understanding of how machine learning can be used to recognize important patterns in vast amounts of data. The specific objectives of the project are to learn:
  • The basic concepts and techniques of supervised machine learning.
  • Some of the issues involved in the implementation of a learning system.
  • The vector space model for representing microarray (and other) data.
  • How to design a simple learning machine experiment using our own data set.
  • To appreciate some of the challenges involved in data mining in general and microarray analysis in particular.
Because microarray data are so esoteric, especially for those without extensive training in genetics, many of these objectives will be addressed through the analysis of a large set of baseball statistics.
Although the theory underlying SVMs involves linear algebra, Langrangian multipliers, and other concepts from advanced mathetics, this basic ideas on how SVMs work will be presented in non-mathematical terms. Therefore, to complete the project students should have a basic knowledge of algebra, discrete mathematics, statistics, and data structures.

We will be using an open source software too named libsvm, which can be downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm/. The software comes with a variety of supporting documentation, including a A Practical Guide to SVM Classification. No special knowledge, beyond what is covered in this module, is required to use the software.
For an introduction to general concepts in machine learning students can read the corresponding chapter in any good AI book. For example:
There are a number of good online tutorials and primers available on microarrays, especially: See also: To understand the basic concepts of SVMs and how they are used in classification problems, students are encouraged to read the following short articles. The first article describes, in non-mathematical terms, how an SVM classifier works. The following article provides a mathematical introduction to SVMs (for those with advanced math knowledge): The following tutorial provides a concise explanation of basic concepts in statistics and probability:
The detailed project description is available in the PDF file svm_project.pdf. You will need the free Adobe Acrobat Reader to view this file.
This project is customizable to accommodate different approaches to teaching and different implementations. Additional exercises are also included for students seeking more extended challenges.
A sample syllabus used at Trinity College when this project was assigned is available at:
Syllabus for AI Course at Trinity College

Additional readings are included in the Background section above.