Parallelization of maximum clique algorithm for protein sequence clustering

Date

2008-06

Authors

Al-Qudami, Nasser Ali Mohammed

Abstract

Clustering algorithms are widely used in bioinformatics to classify biological data. Graph-based clustering is one of the methods used in protein sequences clustering. This method transforms the clustering process into a graph problem via representing the sequences as vertices and their similarities as weighted edges. In this method, a sub-graph is defined as a core cluster. The clique algorithm has been employed in the clustering method to find a maximum sub-graph (or clique) in which each vertex is linked by an edge. However, applying this clustering method on large protein databases leads to the generation of large graphs in which finding a sub-graph ( clique) of maximum size became quite difficult and time consuming. This work focuses on parallelizing a maximum clique algorithm, which uses a branch-andbound technique to find a clique in a graph, in order to enhance clique algorithm performance and simultaneously conferring a positive impact on overall clustering performance. The algorithm has been implemented using OpenMP and executed in a multi-core computer which supports the multi-threads programming. A 7.9 speedup obtained by executing the algorithm with eight threads. The main benefit of this study was its successful reduction of a 71 minute problem (or dataset consisting of 1800 protein sequences) into a 9 minute problem.

Keywords

Graph-based clustering is one of the methods , used in protein sequences clustering.

URI

http://hdl.handle.net/123456789/3041

Collections

Pusat Pengajian Sains Komputer - Tesis

Full item page