Microarray data mining based on forward-backward stepping approach
Loading...
Date
2005-11
Authors
Wei Ping, Loh
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Research on genetic microarray data is a recent research effort and it involves
researchers in diverse disciplines. Analyzing gene expression data has many important
applications in medicine and biology. This leads us to mine patterns and hidden
information from microarray data.
Mining microarray data presents significant challenges which include: i)
analyzing data with many attributes but few examples, ii) high likelihood of false
positives, iii) assessing classifier certainty, iv) understanding the influence of data
preprocessing steps, v) complexities of gene interaction, vi) lack of absolute ground
truth, vii) abundance of biological knowledge and difficulty of integrating it.
This research attempts to address the above issues. The publicly available
study data comprise training and testing files of 7070 genes. Samples of training file
are labelled with five pediatric brain tumour diseases: MED, MGL, RHB, EPD and JPA.
Our research bases on concepts of data mining aided by the software tool
WEKA. The emphasis is on data quality, data distribution and visualization of patterns.
Further, disease classes of unlabelled samples are predicted through classification
analysis. Several classification algorithms utilized include Na"ive Bayes, 181, 182, 183
and J-48. Uncertainties which exist within the five prediction models are identified.
Data are generally mined in sequential stages: preprocessing, processing and
post-processing. Nevertheless, such technique seems inefficient in prediction as it
often ends up with uncertainties. Thus, we propose two mining strategies, the
backward stepping and forward-backward stepping. Backward stepping approach uses
efforts of visualization whilst the forward-backward stepping predicts samples by
applying classification algorithms. These models undergo backward flow analysis to
determine the best predicted model for each sample.
In this study, out of 23 test predictions, MED disease is the best predicted class.
The forward-backward stepping strategy is efficient in predictive data mining compared
to the usual approach. This is because uncertainties of predictions are clarified prior to
determining the best classification. Besides, this approach is far simpler upon
preprocessing data, yet provides much satisfactory percentage of accuracy. A general
mathematics formula to represent the best-interpreted disease is also generated.
A further initiative is to compare preprocessing techniques for training and
testing files in forward-backward stepping strategy. The comparison aspects between
clean trained-partial clean test and clean trained-clean test are used to study the
effects on our research outcomes. The clean trained-clean test results in better
accuracies.
This research successfully shows data patterns, interprets relationships
between two files and predicts diseases in test samples.
Description
Keywords
Data mining based , Stepping approach