An Improved K-Nearest Neighbors Approach Using Modified Term Weighting And Similarity Coefficient For Text Classification

dc.contributor.authorKadhim, Ammar Ismael
dc.date.accessioned2017-01-09T07:24:17Z
dc.date.available2017-01-09T07:24:17Z
dc.date.issued2016-03
dc.description.abstractAutomatic text classification is important because of the increased availability of digital documents and therefore the need to organize them. The current state-of-the-art statistical modeling approaches do not provide sufficient useful information on the topics for each feature and category. Furthermore, feature extraction using traditional term frequency-inverse document frequency (TF-IDF) results in the identification of too many categories for a particular document. In terms of classification, current k-NN approaches with Euclidean distance and cosine similarity score produce a wide range of variance in performance. To address these issues, this study classifies topics for short and long texts using a new method for the main stage (i.e., feature extraction and text classification). The study also introduces TF-IDF with logarithm and k-NN with a new cosine similarity score for feature extraction and classification, respectively. Moreover, the factors that affect the performance of supervised machine learning (ML) are also identified. For short texts, three different dataset sizes are collected using API (i.e., each with 2,196; 5,534; and 10,186 tweets). For long texts, the Reuters-21578 test collection is used. The experiments show that TF-IDF with logarithm improves the performance of feature extraction with an average F1-measure of 92.36%, 93.04%, and 93.60% for the 2,196-; 5,534-; and 10,186-tweet datasets, respectively, for short text, and 92.53%, for long texts. For dimension reduction (DR), four different cases are applied for each dataset in short and long texts. Subsequently, for text classification, the proposed k-NN approach with new cosine similarity score (k-NN-CSNew) outperforms k-NN with Euclidian distance (k-NN-ED) and k-NN with traditional cosine similarity score (k-NN-CSOld) based on different k number of neighbors.en_US
dc.identifier.urihttp://hdl.handle.net/123456789/3363
dc.language.isoenen_US
dc.publisherUniversiti Sains Malaysiaen_US
dc.subjectDokumenen_US
dc.titleAn Improved K-Nearest Neighbors Approach Using Modified Term Weighting And Similarity Coefficient For Text Classificationen_US
dc.typeThesisen_US
Files
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: