Position-dependent motif characterization using nonnegative matrix factorization

Lucie N. Hutchins, Sean Murphy, Priyam Singh, Joel H. Graber


In press at Bioinformatics

Supplemental Data And Programs

Fasta-formatted sequence files
Random sequences with
implanted patterns
(polyadenylation) sites
D. melanogaster
transcription start sites
Random sequences, no patterns
T = 30, 100,300, 1000,3000,10000
Human (10059 sequences) All Unique Sequences
Random bg, six inserted patterns
T = 30, 100,300, 1000,3000,10000
Mouse (8542 sequences) Supplement set 1

nmf Downloads




MOTIVATION: Cis-acting regulatory elements are frequently constrained by both sequence content and positioning relative to a functional site, such as a splice or polyadenylation site. We describe an approach to regulatory motif analysis based on nonnegative matrix factorization (NMF). Whereas existing pattern recognition algorithms commonly focus primarily on sequence content, our method simultaneously characterizes both positioning and sequence content of putative motifs. RESULTS: Tests on artificially generated sequences show that NMF can faithfully reproduce both positioning and content of test motifs. We show how the variation of the residual sum of squares can be used to give a robust estimate of the number of motifs or patterns in a sequence set. Our analysis distinguishes multiple motifs with significant overlap in sequence content and/or positioning. Finally, we demonstrate the use of the NMF approach through characterization of biologically interesting data sets. Specifically, an analysis of mRNA 3'-processing (cleavage and polyadenylation) sites from a broad range of higher eukaryotes reveals a conserved core pattern of three elements

Software and System Requirements

System requirements Linux/Unix
Operating System A reasonably modern 32/64-bit Linux environment is required.
Computer Processor 1.5 GHz or better
Computer Memory 1 GB or more
Software requirements
  1. gsl package Installed
  2. g++ compiler
  3. gnuplot installed and included in your command line path

Files included in this package

  1. setup script (to compile the nmf )
  2. Configuration file(pipeline_arg_config)
  3. Install shell script file (check_libraries)
  4. source code files of:
    1. fa2Stats
    2. WindowCount
    3. nmfSmoothAndPseudoCounts
    4. nnmf
    5. nmfSortMatrix
    6. nmfBuildMotifs
    7. nmfMotifsToModels
    8. pwmToExamples
    9. seqlogo
    10. nmfWebPage
  5. html main page (index.html)

Third-Party Software

weblogo package
WebLogo is a web based application designed to make the generation of sequence logos as easy and painless as possible. Sequence logos are a graphical representation of an amino acid or nucleic acid multiple sequence alignment developed by Tom Schneider and Mike Stephens. Each logo consists of stacks of symbols, one stack for each position in the sequence. The overall height of the stack indicates the sequence conservation at that position, while the height of symbols within the stack indicates the relative frequency of each amino or nucleic acid at that position. In general, a sequence logo provides a richer and more precise description of, for example, a binding site, than would a consensus sequence.

WebLogo requires a recent version of gs (ghostscript) to create PNG and PDF output, and ImageMagic's convert to create GIFs.

  • ./seqlogo -F PDF -f globin.fasta > globin.pdf
  • ./seqlogo -F PNG -f globin.fasta > globin.png
  • ./seqlogo -F GIF -f globin.fasta > globin.png
gsl package
GSL package ( http://directory.fsf.org/project/gsl/ ) The GNU Scientific Library (GSL) is a collection of routines for numerical computing.The routines are written from scratch by the GSL team in ANSI C, and present a modern API for C programmers, while allowing wrappers to be written for very high-level languages.The library is stable and available for general use
gnuplot package (http://sourceforge.net/project/showfiles.php?group_id=2055) Gnuplot is a portable command-line driven interactive data and function plotting utility for UNIX, IBM OS/2, MS Windows, DOS, Macintosh, VMS, Atari and many other platforms. The software is copyrighted but freely distributed (i.e., you don't have to pay for it). It was originally intended as to allow scientists and students to visualize mathematical functions and data.