Complete list of overlapping pairs of ORFs
We applied the S. cerevisiae 3'-processing site DSM
to a set of 429 pairs of ORFs that were previously identified
as problematic because their projected coding sequences (CDS)
overlapped (Smith, T. F. and X. Zhang (1997). Nat Biotechnol
15(12): 1222-3). In the table at the bottom of this
page, we list the ORF identifiers, along with the common name
if it exists, and the logarithm of maximum value of the DSM
score found in the 500 nt immediately downstream of the projected
CDS. In many of the pairs, the difference in the log10(maxima)
of the two ORFs is at least 1.0, indicating an order of magnitude
higher DSM score for the "better" site.
We separated the maximum scores for each pair into "higher"
and "lower" values. In the figure below, we plot the distribution
of the two sets of scores. The distribution of the lower score
from each pair is plotted as a thin blue line; the distribution
of the higher scores is plotted as the thick blue line. For
comparison, the distribution for all ORFs is plotted as the
thick black line.
For further comparison, we wanted to see if the distribution
could be explained simply by the variation in the scores of
all ORFs. We therefore created two control sets of scores.
In the first set, represented by the red line plots above (thin
= low score and thick = high score), we randomly selected pairs
from the complete set of 6,281 predicted ORFs for S. cerevisiae.
In the second set (green plots above), we randomly selected
only the first ORF in each pair. An adjacent ORF in the genome
was used as the second of the pair. (This more closely mimics
the set of overlapping pairs.) As can be seen, the two sets
of randomly selected pairs have very similar distributions.
The lower score ORFS from the overlapping pairs clearly do
not come randomly from the complete population of all ORFs,
which we believe indicates the fact that at least some of these
predicted genes are spurious ORFs that are not actively transcribed.
|