A multi-tier knowledge discovery info-structure using ensemble techniques

Date

2007

Authors

Sakthiaseelan, Karthigasoo

Abstract

Our terminal focus is to learn rules instances that have been discovered from unannotated data and generate results with high accuracy. This is done via a hybridized methodology which features both supervised and unsupervised techniques. Unannotated data without prior classification information could now be useful as our research has brought new insight to knowledge discovery and learning altogether. Our Methodology for Knowledge Discovery and Learning (MKDL) consists of 6 important phases that used different algorithms to produce the outcome. The phases and algorithms used are as follows: a) Data Preprocessing using Mean/Mode Fill and Combinatorial Completion, b) Clustering Ensemble using Boosting technique within Kohonen Self Organizing Map, c) Data Discretization using Boolean Reasoning and Entropy/Minimum Description Length, d) Rule Generation using Genetic Algorithm, Johnson Algorithm and Rough Sets Approximation, e) Rule Filtering using Michalski’s formula and Torgo’s technique and f) Learning using the ensemble technique with Bagging within Neural Networks. An output from one phase will be an input to the next phase. All the 6 phases combined with its functions and algorithm form an integration of different application. This complete architecture forms the Multi-tier Knowledge Discovery, Amalgamation and Learning Info-structure (MESTAC). We performed comparison and analysis with 2 knowledge discovery frameworks and different algorithms to come up with the best model (combination of algorithms) that result in high accuracy in prediction. We introduced a boosting ensemble technique into Kohonen Self Organizing Map to produce better clustering results. We also introduced bagging ensemble technique to a combination of neural network algorithm to produce precision in prediction. MESTAC may seem to be a complex combination of phases but there are 3 important advantages in terms of its overall methodology. MESTAC is simple, efficient and generic. Simplicity here indicates that MESTAC is a highly modular info-structure, where each phase is an independent functional-specific module. Efficiency here indicates that the final outcome of the info-structure is more accurate. Genericity here indicates that the info-structure can be used to discover knowledge for different types of data-sets such as continuous, mixed and discrete data-sets. MESTAC has demonstrated to be a feasible method using a well-known breast cancer dataset. The positive results from the empirical study indicate that the methodology is sound and is indeed applicable to be a new knowledge discovery and learning methodology.

Description

Master

Keywords

Science Physic , Ensemble Techniques

URI

http://hdl.handle.net/123456789/547

Collections

Pusat Pengajian Sains Kimia - Tesis

Full item page