A Computational Method for Refinement of Gene 5’ End Identification in Genome Sequences Using Predicted Ribosome Binding Site and Homology Search Information

Author: Pankaj Vashith

Primary Advisor: Michel E. Brandt, PhD

Committee Members: Steven J. Norris, PhD

Masters thesis, The University of Texas School of Health Information Sciences at Houston.


Numerous microbial genome sequencing projects are in progress, but the precision of annotation remains a matter of concern. In the new technique described here, we aim for improvement of the gene identification process for the Treponema denticola (ATCC 35405) genome sequence. Ribosome binding sites (RBS) play an important role in translation of genes, and BLAST, a process developed by NCBI, is widely used to identify and align similar sequences and find regions of homology.  We automated the analysis of these two tasks by developing a new computational technique.  A program called RBS identifier was developed in Perl to identify Shine-Dalgarno sequence or purine-rich regions upstream of the start site. To easily analyze the huge file resulting from BLAST analysis of an entire genome, a Perl program and a Microsoft (MS) Access database were used by applying a filter based on N-terminus evaluations, E value and the length ratio of each match. RBS and BLAST processes are often used separately for gene annotation, but data obtained from both these process were combined in our new procedure.

The genes which did not have a detectable RBS upstream of their start sites were selected to determine if a new 5’ end could be identified.  The Newstartfinder program was developed, which searches for new start sites based on RBS predictions. Using this method, we were able to improve 5’ end identification for 12% of the ORFs of T. denticola from the original Glimmer program results. This process included the identification of potential new start sites for 9% of the original ORFs, and the finding that 3% of the ORFs lacked a potential RBS and had negative BLAST results.