EST library analysis

Analytic Tools

Supplemental Results referenced in the manuscript

Supplemental Figure 1. Mouse libraries: median 3'-UTR length vs. L-divergence from the reference set

In this figure, we show a measurement the L-Divergence between the transcript length distributions of the ENSEMBL reference transcript set and all EST libraries for which we could determine at least 50 unique putative 3'-processing sites. We plot the median 3'-UTR length (based on the putative 3'-processing sites) as a function of the L-Divergence between the EST library and the ENSEMBL reference transcript set. As shown, the scatter plot naturally segregates into two classes, for which L = 0.15 acts as an approximate separation. All of the libraries with L > 0.15 come from the NIH Brain Molecular Anatomy Project, a somewhat expected result since many of these EST libraries have very specific and extreme selections on cDNA insert size. Examination of the data for libraries with L < 0.15 reveals a significant correlation (r = 0.61) indicating a general lengthening of the 3'-UTR distribution with increasing divergence from the reference cDNA set. This result is consistent with a depletion or absence of short transcripts (and therefore short 3'-UTR sequences) in many EST libraries. In further support of the depletion of short transcripts, we examined a number of comparative plots such as shown in Figure 1 of the manuscript and the annotation of the associated EST libraries (data not shown), and found a general depletion of the transcript distribution for short (< 500 nucleotides) transcripts and annotations indicating removal of clones with short cDNA inserts, respectively. The EST libraries in this plot are restricted to those for which we could identify at least 50 unique 3'-processing sites.

Supplemental Figure 2. Mouse: 3'-UTR length distribution as a function of transcript length, from PACdb

In This figure, we display a heatmap representation of the dependence of 3'-UTR length on approximate transcript length, based on high confidence mouse 3'-processing extracted from PACdb ( Since 5'-UTRs are roughly constant in size and relatively short, the sum of CDS and 3'-UTR lengths was used as a surrogate for transcript length. As shown, the distribution of 3'-UTR lengths with respect to transcript length is far from uniform, with longer transcripts preferentially associating with long 3'-UTR sequences. Accurate assessment of the 3'-UTR length distribution and, more importantly, comparison of distributions between distinct EST libraries will therefore be highly affected by the transcript sampling in the EST library. Both transcript and 3'-UTR lengths were separated into 50-nucleotide bins. The heat map displays the number of transcripts observed in PACdb for the paired 3'-UTR length (y-coordinate) and transcript (x-coordinate) lengths.