K-means clustering of hexamers using sequences that extended 40 nt downstream of the 3'-processing site.

We restricted the clustering analysis to hexamers that occur at least 0.05 times per sequence, so that our models would reflect general organism-wide trends rather than rare functional forms. This cutoff produced an average of 80 hexamers per species,ranging from 48 in A. gambiae to 113 in C. elegans. The hexamers above this threshold were k-means clustered using the CPAN Algorithm::Cluster module. Each run was set up using 3 clusters, 300 iterations and Pearson's Correlation as a distance measure from the centroid mean.
The three groups resemble the tetramer groups with regards to positioning; however with some caveats. In the vertebrate sequences, the positioning biases of the three groups (reading left to right) can be generally described as corresponding to the UG-rich, composite U/UG-rich, and U-rich tetramer elements.
We interpret the composite cluster as a consequence of the sequence degeneracy, sequence similarity, and partial positioningoverlap of the UG- and U-rich elements. For instance, specific hexamers such as GUGUUU may be compatible with the sequence requirements of both the UG- and U-rich elements. In D. melanogaster and A. gambiae, there does not appear to be a composite element, but rather distinct UG- and U-rich elements, along with a more distal A-rich element (group C for D. melanogaster and A. gambiae).
This A-rich element marks the end of the U-rich DSE region and a return to background nucleotide frequencies. In C. elegans, no UG-rich element is apparent, while the U-rich element is extended over a broader positioning, as was indicated by the tetramer analysis. Similar to the arthropods, a distal A-rich element appears as the third grouping. The Sequence Logo representations were generated by optimizing the alignments of the PWC vectors for each word in a cluster, as described in Methods. The reported Logos extend beyond six nucleotides because the positioning of each hexamer in the logo is set accordingto the calculated minimum distance offset of its aligned PWC vector.