We describe a ranking method for the identification of biomedically important genes for diagnosis and therapy, and provide an experimental schema for the development of predictive classification models on microarray data. This combined approach is designed to correct the selection bias problem (too optimistic prediction errors estimated by rules tested on samples previously used in the feature selection process). Our E--RFE ranking algorithm is a wrapper based on the entropy of the distribution of weights obtained from a SVM classifier. In order to control and speed-up the ranking process, E--RFE eliminates chunks of uninteresting genes until the remaining distribution stabilizes into higher entropy levels, then proceeding at shorter elimination steps as in the original recursive feature elimination RFE method. To control the selection bias, we use an external stratified partition resampling scheme and an internal K-fold cross-validation for the E--RFE feature ranking at each run. This double intensive model selection and error estimation process is made viable by a speed-up factor of 1/100 of E-RFE with respect to RFE, without a decrease of classification accuracy. The experimental scheme supports the identification of gene candidates with the highest contribute to predictive accuracy
Entropy-based gene ranking without selection bias for the predictive classification of microarray data
Furlanello, Cesare;Merler, Stefano;Jurman, Giuseppe;Serafini, Maria
2002-01-01
Abstract
We describe a ranking method for the identification of biomedically important genes for diagnosis and therapy, and provide an experimental schema for the development of predictive classification models on microarray data. This combined approach is designed to correct the selection bias problem (too optimistic prediction errors estimated by rules tested on samples previously used in the feature selection process). Our E--RFE ranking algorithm is a wrapper based on the entropy of the distribution of weights obtained from a SVM classifier. In order to control and speed-up the ranking process, E--RFE eliminates chunks of uninteresting genes until the remaining distribution stabilizes into higher entropy levels, then proceeding at shorter elimination steps as in the original recursive feature elimination RFE method. To control the selection bias, we use an external stratified partition resampling scheme and an internal K-fold cross-validation for the E--RFE feature ranking at each run. This double intensive model selection and error estimation process is made viable by a speed-up factor of 1/100 of E-RFE with respect to RFE, without a decrease of classification accuracy. The experimental scheme supports the identification of gene candidates with the highest contribute to predictive accuracyI documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.