An Improved K-Nearest Neighbors Approach Using Modified Term Weighting And Similarity Coefficient For Text Classification

Loading...
Thumbnail Image
Date
2016-03
Authors
Kadhim, Ammar Ismael
Journal Title
Journal ISSN
Volume Title
Publisher
Universiti Sains Malaysia
Abstract
Automatic text classification is important because of the increased availability of digital documents and therefore the need to organize them. The current state-of-the-art statistical modeling approaches do not provide sufficient useful information on the topics for each feature and category. Furthermore, feature extraction using traditional term frequency-inverse document frequency (TF-IDF) results in the identification of too many categories for a particular document. In terms of classification, current k-NN approaches with Euclidean distance and cosine similarity score produce a wide range of variance in performance. To address these issues, this study classifies topics for short and long texts using a new method for the main stage (i.e., feature extraction and text classification). The study also introduces TF-IDF with logarithm and k-NN with a new cosine similarity score for feature extraction and classification, respectively. Moreover, the factors that affect the performance of supervised machine learning (ML) are also identified. For short texts, three different dataset sizes are collected using API (i.e., each with 2,196; 5,534; and 10,186 tweets). For long texts, the Reuters-21578 test collection is used. The experiments show that TF-IDF with logarithm improves the performance of feature extraction with an average F1-measure of 92.36%, 93.04%, and 93.60% for the 2,196-; 5,534-; and 10,186-tweet datasets, respectively, for short text, and 92.53%, for long texts. For dimension reduction (DR), four different cases are applied for each dataset in short and long texts. Subsequently, for text classification, the proposed k-NN approach with new cosine similarity score (k-NN-CSNew) outperforms k-NN with Euclidian distance (k-NN-ED) and k-NN with traditional cosine similarity score (k-NN-CSOld) based on different k number of neighbors.
Description
Keywords
Dokumen
Citation