NMF analysis pipeline arguments

Required arguments (with no default values):

-i: input filename (string)

Required to be a valid name of a fasta-formatted sequence file.

-L: sequence length (integer)

The length of all sequences found in the file specified by argument i.

All sequences are to be the same length. The net behavior is unpredictable if

either the sequences are not the same length, or if L is incorrectly specified.

-o: output file prefix (string)

The prefix that will be appended to output files from all programs.

The remainder of the output filenames are generated automatically according to run parameters.

The string specified by Ðo must be capable of being part of legal filenames.

-a: start action (integer between 1 and 11)

The step number at which the pipeline is to start.

Optional arguments (including default values in square brackets):

-r: number of patterns to generate (integer [8])

This is the critical setting for obtaining good results, the choice of the number of patterns (basis vectors) in the analysis.

-k: k-mer size (integer [4])

Size of k-mers to use in the analysis; constrained between 3 and 8.

-w: window size (integer [5[)

The window size used in the analysis; upper limit is 100.

-z: zero point (integer [0])

Reference point in the sequences: word-counting windows will be set up in a non-overlapping manner based on this zero position.

-d: dry run control (integer [0])

Use of any non-zero value will cause the pipe to only print out all command lines with arguments without actually executing.

-p: stop action (integer between 1 and 11, [11])

The step number at which the pipeline is to stop. Must be greater than or equal to the value set by Ði

-smN: maximum number of words to use in the analysis (integer [65536])

The nmf analysis can be focused and made to execute more quickly with an upper limit on the number of k-mers to be used.

If the number specified is less than 4^k, then the word list is truncated according to ranking based on

the ratio of variance to mean of the k-mer counts across the windows.

-smr: smoothing radius on the rows of the word count matrix. (integer [1])

The smoothing is a polynomial smoothing that preserves the sum of the counts of the word across all windows.

Larger values provide greater smoothing Negative numbers provide undefined results.

-smp: pseudocounts (float [0.0])

Pseudocount value to add to each element in the word count matrix prior to nmf analysis.

-e: nmf convergence tolerance (float [1.0e-6])

The tolerance for determining convergence in each nmf restart. Smaller numbers result in longer run times, and can prevent convergence

-nmfI: max iterations in each restart (integer [1000])

The maximum number of update steps to attempt in each restart of the nmf analysis.

Larger values for k or smaller values for w can result in the necessity to increase nmfI.

-nmfS: number of times to restart nmf analysis (integer [300])

Larger numbers increase the confidence of finding a global best solution.

Small numbers such as 20-30 are useful in determining optimum values of r.

-xLab: label for the reference point (string [Ò3Õ-processing siteÓ]

A label for the position on which all sequences are aligned, as specified by Ðz option.

-bmS: number of times to restart the motif build process (integer [300])

The motif builder is a Monte Carlo Markov Chain (MCMC) process with random starting motif.

Larger number of restarts improves the likelihood of finding a global best solution.

-bmMinW: minimum size motif to consider (integer [4])

-bmMaxW: maximum size motif to consider (integer [8])

-bmI: maximum number of iterations in each restart of the motif building MCMC (integer [2000])

-bmN: assumed number of models in model evaluation (integer [1000])

This value primarily affects the scale of the changes made to motif models in each iteration.

Larger numbers make the changes smaller, making a smoother, but slower convergence to a solution.

-bm_P: MCMC plateau value (integer [200])

The required number of iterations with no improvement in the motif at which convergence is declared.

-bmx: min overlap in k-mer and motif (integer [2])

Used in generating the probability of observing a given k-mer from a motif,

specifically the start and stop offsets at which the probability will be calculated.

The number must be smaller than k-1. Larger numbers will focus the probability distribution into the motif;

smaller numbers allow greater contribution from the surrounding background.

-bmff: empirical focus factor (float [1.0])

An exponential coefficient helping to skew the probability distribution of all k-mers generated by a motif.

A factor of 1.0 makes no change. Empirical studies of artificial known sequence patterns showed

that results improved significantly for smaller sequence sets with increasing values of bmff.

Roughly, a sequence set of T=30 or less should have bmff ~2.0 or greater, T=100: bmff ~1.6-1.8, T=300: bmff ~ 1.2-1.4

-mmMinIC: minimum information content (float [0.2])

After convergence, motifs are trimmed from each end, eliminating any positions with information content less than the value set by this argument

-mmMinW: minimum motif width (integer [4])

Minimum motif width to carry forward after trimming.

-nEx: number of examples for Sequence Logo generation (integer [100])

Sequence logos are generated by using the final motif to generate a set of examples that are fed to the sequence logo code. This argument controls the number generated.

-nWd: number of words in html (integer [10])

The top number of k-mers displayed for each pattern in the html output page.