Filtered Distance Matrix For Constructing High-Throughput Multiple Sequence Alignment On Protein Data

dc.contributor.authorMOHAMMAD ABU-HASHEM, MUHANNAD ABDUL-QADER
dc.date.accessioned2016-05-11T03:03:17Z
dc.date.available2016-05-11T03:03:17Z
dc.date.issued2015-02
dc.description.abstractMultiple sequence alignment (MSA) is a significant process in computational biology and bioinformatics. Optimal MSA is an NP-hard problem, while building optimal alignment using dynamic programming is an NP complete problem. Although numerous algorithms have been proposed for MSA, producing an efficient MSA with high accuracy remains a huge challenge. Computing the MSA of a large dataset may take hours. Progressive alignment method is broadly used for constructing MSA. It uses guide trees as an input to guide the alignment process. Pair-wise alignment plays a significant role in building the distance matrices where distance matrices are necessary for building the guide trees. Robust distance matrix leads to better MSA. In this research, HashTable-N-Gram-Hirschberg (HT-NGH) and Filtered Distance Matrix for building MSA (FDM-MSA) to construct MSA methods are presented. HT-NGH is an extension to the N-Gram-Hirschberg (NGH) and Hashing-N-Gram-Hirschberg (H-NGH) pair-wise alignment methods; it uses the hash table capabilities to enhance the transformation phase. FDM-MSA is divided into four phases: constructing the distance matrix, building the filtering system, building the guide tree, and constructing the MSA. HT-NGH is used to build the distance matrix. Two sequence detectors are involved in building the filtering system: multi-domain detector and outlier detector. After filtering the distance matrix, Neighbor Joining and progressive alignment methods are employed to construct the MSA. The experiments show that the HT-NGH algorithm outperforms xxi the NGH and H-NGH methods by 60% and 30% respectively in terms of time, while it outperforms the H-NGH algorithm in terms of accuracy by producing the same results as the NGH algorithm. On the other hand, experiments on the FDM-MSA algorithm show improved performance in both terms; time and accuracy. FDM-MSA algorithm obtains the best time performance over all competitive methods in most datasets (total time in seconds on Balibase =1087, IRMbase=163, SABMARK=81, and Oxbench=83), as well as obtains the highest Sum-of-Pairs Score on RV2 (SP score = 0.9437) dataset of BAlibase dataset and the second best Total Column score on average.en_US
dc.identifier.urihttp://hdl.handle.net/123456789/2028
dc.subjectFiltered Distance Matrix For Constructing High-Throughputen_US
dc.subjectMultiple Sequence Alignment On Protein Dataen_US
dc.titleFiltered Distance Matrix For Constructing High-Throughput Multiple Sequence Alignment On Protein Dataen_US
dc.typeThesisen_US
Files
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: