Abstract:
The development of computers and the theory of doubly stochastic processes, have led to a wide variety of applications of the hidden Markov models (HMMs). Due to computational efficiency, discrete HMMs are often favorable. HMMs offer a flexible way of presenting events with temporal and dynamical variations. Both of these matters are presented in alignment and gene finding problems, which are of increasing interest in the research of computational biology. The continued rise in attention and financial support of public sequencing projects, combined with improvements in sequencing techniques, have led to an exponential growth in the amount of biological sequence data available, this poses a huge challenge to bioinformatics community, which tries to interpret these data. One of their primary goals is to reveal the structure of genomes, specially the location of regions regulating or directly encoding functional elements. Amino acid sequence alignments are widely used for the analysis of protein structure, gene function prediction, inference of phylogeny and other aspects of bioinformatics.
In this study it is given mathematically uniform presentation of the theory of discrete hidden Markov models. Especially, three basic problems, scoring, decoding and estimation, are considered. To solve these problems it is presented forward and backward algorithms, Viterbi algorithm, and Baum-Welch algorithm, respectively.
The second objective of this study to present an application of HMMs to alignment and gene finding problems. This study designed a model to improve the gene prediction quality in terms of finding exact gene boundaries; the model was evaluated on several test sets including eight complete bacterial genomes. This model was proved to be significantly accurate, the analysis of false positive and false negative predictions were presented with caution that these categories are not precisely defined if the public database annotation is used as control.
In this study we also study the comparison of sequences from a finite alphabet and the theory of hidden Markov models. The main motivation is the application to biological sequences analysis. A central problem is to test if two sequences violate the hypothesis of being independent with a fast statistics chosen to reflect the specific kind of violation that we are interested in, a comprehensive search for conserved elements in HPV5 L1 open reading frame was conducted using multiple alignments of HPV5L1 from Sudan with four sequences from the database, significant conserved elements were identified with the model of this study and the variations and differences between them in terms of amino acid sequences were discussed.
The gene prediction model of this study was proved to be accurate for determination of genes boundaries so it is highly recommended to be applied for such aspects. This study also recommended using the alignment model for design of diagnostic test, vaccines and drug synthesis for HPV.