Classification of microarray datasets using random forest

Loading...
Thumbnail Image
Date
2009-06
Authors
Ee Ling, Ng
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
DNA microarray technology has enabled the capability to monitor the expressions of tens of thousands of genes in a biological sample on a single chip. Medical fields can benefit from microarray data mining as it helps in early detection of genes mutation and diagnosis of disease. A well built model can be used to predict unknown disease classes in a test case. Prior to a well built model is to achieve good classification resuits which rely very much on the classifiers that are being us~d. However, in most microarray data, the number of genes usually outnumbers the number of samples. Thus, it is often not just selecting the type of classifier that is essential but also the features looked in selecting significant genes that will contribute to good classification results. Genes selection also varies from study scope and depends on the criteria researchers are looking at. In this study, we propose a stair-line method to select significant genes to reduce the effect of kurtosis found among the genes. Classification is then done using Random Forest. Five microarray datasets with different number of genes and samples are used to demonstrate the effectiveness of this method. This method improves the percentages of correct classification and at the same time reduces the effect of kurtosis in the genes expression values. Other conventional classification schemes are also looked at as a comparison to Random Forest and it is shown that the latter is one classifier that is more superior to the others. In short, Random Forest managed to give a competitive result in classifying genes correctly as Random Forest performed consistently well on all datasets.
Description
Keywords
Microarray datasets , Random forest
Citation