An Improved K-Nearest Neighbors Approach Using Modified Term Weighting And Similarity Coefficient For Text Classification

An Improved K-Nearest Neighbors Approach Using Modified Term Weighting And Similarity Coefficient For Text Classification

Simple item page

dc.contributor.author	Kadhim, Ammar Ismael
dc.date.accessioned	2017-01-09T07:24:17Z
dc.date.available	2017-01-09T07:24:17Z
dc.date.issued	2016-03
dc.description.abstract	Automatic text classification is important because of the increased availability of digital documents and therefore the need to organize them. The current state-of-the-art statistical modeling approaches do not provide sufficient useful information on the topics for each feature and category. Furthermore, feature extraction using traditional term frequency-inverse document frequency (TF-IDF) results in the identification of too many categories for a particular document. In terms of classification, current k-NN approaches with Euclidean distance and cosine similarity score produce a wide range of variance in performance. To address these issues, this study classifies topics for short and long texts using a new method for the main stage (i.e., feature extraction and text classification). The study also introduces TF-IDF with logarithm and k-NN with a new cosine similarity score for feature extraction and classification, respectively. Moreover, the factors that affect the performance of supervised machine learning (ML) are also identified. For short texts, three different dataset sizes are collected using API (i.e., each with 2,196; 5,534; and 10,186 tweets). For long texts, the Reuters-21578 test collection is used. The experiments show that TF-IDF with logarithm improves the performance of feature extraction with an average F1-measure of 92.36%, 93.04%, and 93.60% for the 2,196-; 5,534-; and 10,186-tweet datasets, respectively, for short text, and 92.53%, for long texts. For dimension reduction (DR), four different cases are applied for each dataset in short and long texts. Subsequently, for text classification, the proposed k-NN approach with new cosine similarity score (k-NN-CSNew) outperforms k-NN with Euclidian distance (k-NN-ED) and k-NN with traditional cosine similarity score (k-NN-CSOld) based on different k number of neighbors.	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/3363
dc.language.iso	en	en_US
dc.publisher	Universiti Sains Malaysia	en_US
dc.subject	Dokumen	en_US
dc.title	An Improved K-Nearest Neighbors Approach Using Modified Term Weighting And Similarity Coefficient For Text Classification	en_US
dc.type	Thesis	en_US

Files

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Collections

Pusat Pengajian Sains Komputer - Tesis