Topic hierarchy annotation using feature selection techniques
Loading...
Date
2002-12
Authors
Sae Tan, Saravadee
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Over the past decade, we have witnessed an explosion in the availability of online
information. As the amount of information increases, it becomes more difficult to
assimilate and profitably utilize such large amount of information. By categorizing the
documents according to their topic, i.e. in a hierarchical structure of increasing
specificity, makes this mass of information easy to process and navigate. In this way, the
problem is reduced to a manageable size. In a topic hierarchy, each topic/category is
given a name to describe the category. In addition to the keywords for naming, we
propose to generate a set of additional keywords for each category to clearly express the
content of the category, which can be treated as annotation to the topic hierarchy.
The annotation approach proposed in this thesis uses feature selection technique to
annotate a topic hierarchy. Feature selection is a process that extracts significant words
from documents classified under a category. This method favors in selecting words that
are too common to appear in the documents. Usually, these words are related to the
situation referred by the category and they are descriptive keywords in representing the
concepts of the category.
Different methods of feature selection have been developed by researcher in
machine learning and text learning. Feature selection in machine learning is described as
a search through a space of feature subset that selects an optimal subset using an
evaluation criterion. This causes the machine learning method to be less appropriate when
the number of features is large. Feature selection in text learning, on the other hand,
evaluates all features independently. Each of the features is assigned a score using a
sconng measure and they are sorted according to the assigned score. Normally, a
predefined number of best features are selected as the solution subset. However, the
number of features to be selected is an experimental issue in text learning.
In this thesis, we propose a new method of feature selection. The proposed feature
selection combines the idea from both feature selection methods in machine learning and
text learning, such that both methods may complement each other, retains advantages of
each other as well as overcomes their respective limitations. The proposed feature
selection is simplified by sorting all features in a list using a scoring measure in text
learning and finding a cut-off point in the list using an evaluation function in machine
learning. Using this idea, the proposed feature selection is able to handle large number of
features as well as fast in finding the smallest set of optimal features.
In the thesis, the hierarchical structure of a topic hierarchy Is utilized by
decomposing the topic hierarchy into a set of sub-hierarchies. This serves to simplify the
annotation task as well as improve accuracy in selecting significant keywords to represent
every category in the topic hierarchy.
Experiments have been conducted in order to demonstrate the efficacy of the
proposed annotation approach in terms of its ability to select significant keywords for
further describing the content of a category in a topic hierarchy. E?'perimental evaluations
on real world data collected from the web have shown that the proposed annotation
approach gives promising results and it can potentially be used to annotate a web
hierarchy.
Description
Keywords
Hierarchy annotation , Feature selection techniques