Topic hierarchy annotation using feature selection techniques

Date

2002-12

Authors

Sae Tan, Saravadee

Abstract

Over the past decade, we have witnessed an explosion in the availability of online information. As the amount of information increases, it becomes more difficult to assimilate and profitably utilize such large amount of information. By categorizing the documents according to their topic, i.e. in a hierarchical structure of increasing specificity, makes this mass of information easy to process and navigate. In this way, the problem is reduced to a manageable size. In a topic hierarchy, each topic/category is given a name to describe the category. In addition to the keywords for naming, we propose to generate a set of additional keywords for each category to clearly express the content of the category, which can be treated as annotation to the topic hierarchy. The annotation approach proposed in this thesis uses feature selection technique to annotate a topic hierarchy. Feature selection is a process that extracts significant words from documents classified under a category. This method favors in selecting words that are too common to appear in the documents. Usually, these words are related to the situation referred by the category and they are descriptive keywords in representing the concepts of the category. Different methods of feature selection have been developed by researcher in machine learning and text learning. Feature selection in machine learning is described as a search through a space of feature subset that selects an optimal subset using an evaluation criterion. This causes the machine learning method to be less appropriate when the number of features is large. Feature selection in text learning, on the other hand, evaluates all features independently. Each of the features is assigned a score using a sconng measure and they are sorted according to the assigned score. Normally, a predefined number of best features are selected as the solution subset. However, the number of features to be selected is an experimental issue in text learning. In this thesis, we propose a new method of feature selection. The proposed feature selection combines the idea from both feature selection methods in machine learning and text learning, such that both methods may complement each other, retains advantages of each other as well as overcomes their respective limitations. The proposed feature selection is simplified by sorting all features in a list using a scoring measure in text learning and finding a cut-off point in the list using an evaluation function in machine learning. Using this idea, the proposed feature selection is able to handle large number of features as well as fast in finding the smallest set of optimal features. In the thesis, the hierarchical structure of a topic hierarchy Is utilized by decomposing the topic hierarchy into a set of sub-hierarchies. This serves to simplify the annotation task as well as improve accuracy in selecting significant keywords to represent every category in the topic hierarchy. Experiments have been conducted in order to demonstrate the efficacy of the proposed annotation approach in terms of its ability to select significant keywords for further describing the content of a category in a topic hierarchy. E?'perimental evaluations on real world data collected from the web have shown that the proposed annotation approach gives promising results and it can potentially be used to annotate a web hierarchy.

Keywords

Hierarchy annotation , Feature selection techniques

URI

http://hdl.handle.net/123456789/1111

Collections

Pusat Pengajian Sains Komputer - Tesis

Full item page