Microarray expression studies are producing massive high-throughput quantities of gene expression and other functional genomics data. One of the most challenging factors of the discovery process which may be sourced from gene expression data matrices is the identification of small subsets of genes likely to be strongly related to the biological pathways involved in the experiment. We developed a gene selection method based on the Recursive Feature Elimination procedure for Support Vector Machines (SVM RFE, Guyon et al 2002): in order to better control and speed-up the elimination of genes (typically from several thousands to less than 10), we introduced a reduction algorithm E-RFE based on the structure of the distributions of weights obtained from a SVM classifier by feature elimination. The reduction algorithm is based on an entropy measure of the distribution and it allows to eliminate chunks of uninteresting genes until the remaining distribution stabilizes, typically at 50 genes. Then the single step SVM-RFE is operated. Our first experiments on public and on unpublished microarray data are very promising: the accuracy of SVM classification is maintained also with very few remaining genes, with a remarkable acceleration with respect to the SVM-RFE procedure. At a first analysis of the oncological interest of the selected genes, performed by specialists, also gave interesting results, as only genes relevant to cancer classification were selected. In particular, one gene related to tissue composition selected by RFE resulted not selected by E-RFE. This is a case in which the features selected may matter more than the classifier used. On tasks such as prediction of patients' response to therapy, we aim to develop accurate classification systems based on a very reduced number of genes in order to provide, at the same time, a predictive methodology and an analysis tool in experimental oncology. The automatic selection of genes relevant to the underlying oncological basis is thus crucial in the design of targeted experiments. In this paper we present an application of the method to three different microarray data sets: a data set of diffuse large B-cell lymphoma (96 samples and 4026 genes, where 7 genes where selected), the AML/ALL dataset (discriminate Acute Myeloid Leukemia versus Acute Lymphoblastic Leukemia, 72 cases and 7129 genes, of which 9 selected), colon tumor (discriminate tumor and normal colon tissues, 62 cases and 2000 genes, with 7 genes selected). In all cases, classification accuracy with the reduced models resulted comparable to previously published results. In a perspective of automating the gene selection procedure within an integrated discovery process, we are now developing a system for a complete bionformatics treatment, such as interaction with data through a database system connection, facilitation of tasks as comparing gene selection results with BLAST service, and specialized data displays produced by statistical software

Gene selection and classification with support vector machines applied to microarray data

Furlanello, Cesare;Merler, Stefano;Jurman, Giuseppe;Serafini, Maria
2002-01-01

Abstract

Microarray expression studies are producing massive high-throughput quantities of gene expression and other functional genomics data. One of the most challenging factors of the discovery process which may be sourced from gene expression data matrices is the identification of small subsets of genes likely to be strongly related to the biological pathways involved in the experiment. We developed a gene selection method based on the Recursive Feature Elimination procedure for Support Vector Machines (SVM RFE, Guyon et al 2002): in order to better control and speed-up the elimination of genes (typically from several thousands to less than 10), we introduced a reduction algorithm E-RFE based on the structure of the distributions of weights obtained from a SVM classifier by feature elimination. The reduction algorithm is based on an entropy measure of the distribution and it allows to eliminate chunks of uninteresting genes until the remaining distribution stabilizes, typically at 50 genes. Then the single step SVM-RFE is operated. Our first experiments on public and on unpublished microarray data are very promising: the accuracy of SVM classification is maintained also with very few remaining genes, with a remarkable acceleration with respect to the SVM-RFE procedure. At a first analysis of the oncological interest of the selected genes, performed by specialists, also gave interesting results, as only genes relevant to cancer classification were selected. In particular, one gene related to tissue composition selected by RFE resulted not selected by E-RFE. This is a case in which the features selected may matter more than the classifier used. On tasks such as prediction of patients' response to therapy, we aim to develop accurate classification systems based on a very reduced number of genes in order to provide, at the same time, a predictive methodology and an analysis tool in experimental oncology. The automatic selection of genes relevant to the underlying oncological basis is thus crucial in the design of targeted experiments. In this paper we present an application of the method to three different microarray data sets: a data set of diffuse large B-cell lymphoma (96 samples and 4026 genes, where 7 genes where selected), the AML/ALL dataset (discriminate Acute Myeloid Leukemia versus Acute Lymphoblastic Leukemia, 72 cases and 7129 genes, of which 9 selected), colon tumor (discriminate tumor and normal colon tissues, 62 cases and 2000 genes, with 7 genes selected). In all cases, classification accuracy with the reduced models resulted comparable to previously published results. In a perspective of automating the gene selection procedure within an integrated discovery process, we are now developing a system for a complete bionformatics treatment, such as interaction with data through a database system connection, facilitation of tasks as comparing gene selection results with BLAST service, and specialized data displays produced by statistical software
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11582/640
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact