harlequin.jax.org logo
BU Bioinformatics
BMERC

Home
   mRNA Site Predictor

Example Output
   List of Outputs

Tools
   Sites in User Submitted Sequence
   Sites in Known Genes
   Check Results

 Documentation
  Output Format
   Overview Plot
   Top Sites
   Full Site Detail

   DSM/HMM Used
   DSM Overview Plot
  Element Positions
   Nucleotide Frequency


Comments, Questions to Joel Graber


Analysis for 3' pairs of ORFs

Complete list of overlapping pairs of ORFs


We applied the S. cerevisiae 3'-processing site DSM to a set of 429 pairs of ORFs that were previously identified as problematic because their projected coding sequences (CDS) overlapped (Smith, T. F. and X. Zhang (1997). Nat Biotechnol 15(12): 1222-3). In the table at the bottom of this page, we list the ORF identifiers, along with the common name if it exists, and the logarithm of maximum value of the DSM score found in the 500 nt immediately downstream of the projected CDS. In many of the pairs, the difference in the log10(maxima) of the two ORFs is at least 1.0, indicating an order of magnitude higher DSM score for the "better" site.

We separated the maximum scores for each pair into "higher" and "lower" values. In the figure below, we plot the distribution of the two sets of scores. The distribution of the lower score from each pair is plotted as a thin blue line; the distribution of the higher scores is plotted as the thick blue line. For comparison, the distribution for all ORFs is plotted as the thick black line.


For further comparison, we wanted to see if the distribution could be explained simply by the variation in the scores of all ORFs. We therefore created two control sets of scores. In the first set, represented by the red line plots above (thin = low score and thick = high score), we randomly selected pairs from the complete set of 6,281 predicted ORFs for S. cerevisiae. In the second set (green plots above), we randomly selected only the first ORF in each pair. An adjacent ORF in the genome was used as the second of the pair. (This more closely mimics the set of overlapping pairs.) As can be seen, the two sets of randomly selected pairs have very similar distributions. The lower score ORFS from the overlapping pairs clearly do not come randomly from the complete population of all ORFs, which we believe indicates the fact that at least some of these predicted genes are spurious ORFs that are not actively transcribed.