Classification of microarray datasets using random forest
Loading...
Date
2009-06
Authors
Ee Ling, Ng
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
DNA microarray technology has enabled the capability to monitor the
expressions of tens of thousands of genes in a biological sample on a single chip.
Medical fields can benefit from microarray data mining as it helps in early detection of
genes mutation and diagnosis of disease. A well built model can be used to predict
unknown disease classes in a test case. Prior to a well built model is to achieve good
classification resuits which rely very much on the classifiers that are being us~d.
However, in most microarray data, the number of genes usually outnumbers the number
of samples. Thus, it is often not just selecting the type of classifier that is essential but
also the features looked in selecting significant genes that will contribute to good
classification results. Genes selection also varies from study scope and depends on the
criteria researchers are looking at. In this study, we propose a stair-line method to select
significant genes to reduce the effect of kurtosis found among the genes. Classification
is then done using Random Forest. Five microarray datasets with different number of
genes and samples are used to demonstrate the effectiveness of this method. This
method improves the percentages of correct classification and at the same time reduces
the effect of kurtosis in the genes expression values. Other conventional classification
schemes are also looked at as a comparison to Random Forest and it is shown that the
latter is one classifier that is more superior to the others. In short, Random Forest
managed to give a competitive result in classifying genes correctly as Random Forest
performed consistently well on all datasets.
Description
Keywords
Microarray datasets , Random forest