Microarray data mining based on forward-backward stepping approach

Date

2005-11

Authors

Wei Ping, Loh

Abstract

Research on genetic microarray data is a recent research effort and it involves researchers in diverse disciplines. Analyzing gene expression data has many important applications in medicine and biology. This leads us to mine patterns and hidden information from microarray data. Mining microarray data presents significant challenges which include: i) analyzing data with many attributes but few examples, ii) high likelihood of false positives, iii) assessing classifier certainty, iv) understanding the influence of data preprocessing steps, v) complexities of gene interaction, vi) lack of absolute ground truth, vii) abundance of biological knowledge and difficulty of integrating it. This research attempts to address the above issues. The publicly available study data comprise training and testing files of 7070 genes. Samples of training file are labelled with five pediatric brain tumour diseases: MED, MGL, RHB, EPD and JPA. Our research bases on concepts of data mining aided by the software tool WEKA. The emphasis is on data quality, data distribution and visualization of patterns. Further, disease classes of unlabelled samples are predicted through classification analysis. Several classification algorithms utilized include Na"ive Bayes, 181, 182, 183 and J-48. Uncertainties which exist within the five prediction models are identified. Data are generally mined in sequential stages: preprocessing, processing and post-processing. Nevertheless, such technique seems inefficient in prediction as it often ends up with uncertainties. Thus, we propose two mining strategies, the backward stepping and forward-backward stepping. Backward stepping approach uses efforts of visualization whilst the forward-backward stepping predicts samples by applying classification algorithms. These models undergo backward flow analysis to determine the best predicted model for each sample. In this study, out of 23 test predictions, MED disease is the best predicted class. The forward-backward stepping strategy is efficient in predictive data mining compared to the usual approach. This is because uncertainties of predictions are clarified prior to determining the best classification. Besides, this approach is far simpler upon preprocessing data, yet provides much satisfactory percentage of accuracy. A general mathematics formula to represent the best-interpreted disease is also generated. A further initiative is to compare preprocessing techniques for training and testing files in forward-backward stepping strategy. The comparison aspects between clean trained-partial clean test and clean trained-clean test are used to study the effects on our research outcomes. The clean trained-clean test results in better accuracies. This research successfully shows data patterns, interprets relationships between two files and predicts diseases in test samples.

Keywords

Data mining based , Stepping approach

URI

http://hdl.handle.net/123456789/1570

Collections

Pusat Pengajian Sains Matematik - Tesis

Full item page