Filtered Distance Matrix For Constructing High-Throughput Multiple Sequence Alignment On Protein Data

Date

2015-02

Authors

MOHAMMAD ABU-HASHEM, MUHANNAD ABDUL-QADER

Abstract

Multiple sequence alignment (MSA) is a significant process in computational biology and bioinformatics. Optimal MSA is an NP-hard problem, while building optimal alignment using dynamic programming is an NP complete problem. Although numerous algorithms have been proposed for MSA, producing an efficient MSA with high accuracy remains a huge challenge. Computing the MSA of a large dataset may take hours. Progressive alignment method is broadly used for constructing MSA. It uses guide trees as an input to guide the alignment process. Pair-wise alignment plays a significant role in building the distance matrices where distance matrices are necessary for building the guide trees. Robust distance matrix leads to better MSA. In this research, HashTable-N-Gram-Hirschberg (HT-NGH) and Filtered Distance Matrix for building MSA (FDM-MSA) to construct MSA methods are presented. HT-NGH is an extension to the N-Gram-Hirschberg (NGH) and Hashing-N-Gram-Hirschberg (H-NGH) pair-wise alignment methods; it uses the hash table capabilities to enhance the transformation phase. FDM-MSA is divided into four phases: constructing the distance matrix, building the filtering system, building the guide tree, and constructing the MSA. HT-NGH is used to build the distance matrix. Two sequence detectors are involved in building the filtering system: multi-domain detector and outlier detector. After filtering the distance matrix, Neighbor Joining and progressive alignment methods are employed to construct the MSA. The experiments show that the HT-NGH algorithm outperforms xxi the NGH and H-NGH methods by 60% and 30% respectively in terms of time, while it outperforms the H-NGH algorithm in terms of accuracy by producing the same results as the NGH algorithm. On the other hand, experiments on the FDM-MSA algorithm show improved performance in both terms; time and accuracy. FDM-MSA algorithm obtains the best time performance over all competitive methods in most datasets (total time in seconds on Balibase =1087, IRMbase=163, SABMARK=81, and Oxbench=83), as well as obtains the highest Sum-of-Pairs Score on RV2 (SP score = 0.9437) dataset of BAlibase dataset and the second best Total Column score on average.

Keywords

Filtered Distance Matrix For Constructing High-Throughput , Multiple Sequence Alignment On Protein Data

URI

http://hdl.handle.net/123456789/2028

Collections

Pusat Pengajian Sains Komputer - Tesis

Full item page