Evaluating the suitability of normalized google similarity and individual match ratio average as measures for protein similari

Date

2008-06

Authors

Lee, Jun Choi

Abstract

Biological sequence comparIson faces various challenges. Although dynamic programming solution was claims to be the optimal solution for comparison process, the computation limitation and some fundamental challenges still make it inefficient for mass sequence comparison. Statistical method explores the statistics of sequences by the frequency of the words or partition in the sequence, it not only provides a solution without loss of statistical information, but also caters some of the fundamental problems in sequence comparison. Normalized Google Distance is a way of finding semantic similarity in web pages, with significant related characteristics. In this study, the suitability of Normalized Google Similarity and Individual Match Ratio Average in representing statistical significance of proteins in protein sequence comparison is studied. The potential of the proposed similarity measurements is evaluated through correlation coefficient and accuracy with FAST A as the reference benchmark. This study shows that the protein similarity measurement based on overlapping K-tuple has an overall better result compares to non-overlapping K-tuple. Both Normalized Google Similarity and Individual Match Ratio Average shows capability in representing protein sequence comparison.

Keywords

Normalized Google Distance is a way of , finding semantic similarity in web pages

URI

http://hdl.handle.net/123456789/2921

Collections

Pusat Pengajian Sains Komputer - Tesis

Full item page