Non-negative Matrix Factorization Pipeline
How to run the pipeline
Note: This package includes both the 32bit and the 64bit executables. The 32bit stored under $NMF_PATH/bin/cpp/linux_32/ The 64bit stored under $NMF_PATH/bin/cpp/linux_64/
By default, the pipeline runs using the 32bit executables, but if the system detects the 64 bit architechture, it will use the 64bit executeables.
To enforce the 32bit on a 64bit architechture, you need to modify the run_nmf_pipeline.prl script under $NMF_PATH/pipeline/ and comment lines 126-128.
The input file must be fasta-formatted, and all sequences are assumed to be the same length, and aligned on a common feature of some sort.
The common length of the sequences must be known in advance and passed to the pipeline program with the "-L" option. The alignment point can be set if desired with the "-z" option. If it's not set, the beginning of all sequences is taken as position 0. (For instance in the example file that you have, all sequences are 200 nt up- and downstream of the polyA site, so I would include the arguments "-L" 400 and "-z" 200. This would result in one count window boundary set at position 200 of the sequences (with this called position 0) and all other windows set up accordingly.
The "-a" argument typically will be set to 1, the beginning of the process (the call to fa2stats), however if the files generated by these programs have already been created, the process can be started at later steps. For instance, the first step (fa2stats) only needs to be run once on a given fasta-formatted sequence file. If the same sequence file is analyzed again with different values for "-w" or "-k" there is no reason to start at "-a" 1. Similarly, if the choices for "-w" and "-k" are unchanged and a new value of "-r" (the number of patterns), then the first three steps are redundant, so "-a" 4 would be appropriate.
What is critical to understand is that the pipeline program assumes not only that the earlier programs will have been run, but also that the naming of the output files from these runs can be predicted from the parameters of the current run. If this condition is not met, it will likely fail, and in the current version of the programs, it does not fail gracefully.
Store your input fasta-formatted file (myinputfile.fa ) under data directory $NMF_PATH/data/myinputfile.fa
Note: Input myinputfile.fa format: A fasta of m equal size sequences .
For example: >1 786 AATGCTTGTTTATATGTTTGTTTCTGCTAACTTGTCTTGTTGTTCTTGAAGCTGACGTGTGTTT TGTCTTTCAGGAATTCACTGATCATCTGGCCAAGACTCACACCAGGGTCTCCGTTCAGAGGACT CAGGGAGCTCTTCCTCAGGGAACCTCCTAAATTTTGTACAACGGATTGAAATAAAAGTCCAGAA GTAAAACCATCTTTGAGTGTATGGTCTGTTCTTGGGCGTTAAAGTTGTGTGAAAGGTGCAAAAT AATATCCTGATATTACTGGCTTTCCCATGGTTCCTTAATCTTTTTAAAGTTTGTGATTCTGATT ACATTAGATGCTGGAAAGTGTTGGAAAATAAGCATTAATAATGGTCCTTGCTTGAATTTGTTTA ATGCTGCGTCCCAATT >2 886
TGTCTTTCAGGAATTCACTGATCATCTGGCCAAGACTCACACCAGGGTCTCCGTTCAGAGGACT CAGGGAGCTCTTCCTCAGGGAACCTCCTAAATTTTGTACAACGGATTGAAATAAAAGTCCAGAA GTAAAACCATCTTTGAGTGTATGGTCTGTTCTTGGGCGTTAAAGTTGTGTGAAAGGTGCAAAAT AATATCCTGATATTACTGGCTTTCCCATGGTTCCTTAATCTTTTTAAAGTTTGTGATTCTGATT ACATTAGATGCTGGAAAGTGTTGGAAAATAAGCATTAATAATGGTCCTTGCTTGAATTTGTTTA AATGCTTGTTTATATGTTTGTTTCTGCTAACTTGTCTTGTTGTTCTTGAAGCTGACGTGTGTTT ATGCTGCGTCCCAATT .... In this example we have a fasta file containing genome aligned sequences of length=400bp word_len=400 bp
Check that third-party libraries are installed
Change directory to sh_script and run the check_libraries script from the command line : %> ./check_libraries %> ./setup -->
If the install script is a success and displays : "gsl and gnuplot are installed.You can run the pipeline" then skip steps 3 and 4
Install third-party libraries You can use our gsl package under $NMF_PATH/bin/libpck/gsl-1.8.tar.gz to install the gsl libraries. Make sure that the gsl package is installed under /usr/local/include/ ( /usr/local/include/gsl) You can use our gnuplot package under $NMF_PATH/bin/libpck/gnuplot-4.2.4.tar.gz to install gnuplot. Make sure that gnuplot is in your path. You can invoke gnuplot from the command line. After installing third-party programs and libraries, re-run the install script to verify that the nmf supporting programs can find the libraries Change directory to sh_script and run the check_libraries script from the command line : %> ./check_libraries If the check_libraries script is a success and displays : "gsl and gnuplot are installed.You can run the pipeline" then skip step 4
Run the setup Script Change directory to sh_script and run the setup script from the command line : %> ./setup Change directory to sh_script and run the check_libraries script from the command line : %> ./check_libraries If the install script is a success and displays : "gsl and gnuplot are installed.You can run the pipeline" then Proceed with step 5, otherwise check the error logs under logs/ fix the problem and try to re-run the setup and the check_libraries scrpits
Running the pipeline Change directory to $NMF_PATH/pipeline/ -->Open the Configuration file(pipeline_arg_config) and setup the pipeline arguments:
- Set the Absolute path to nmf package($NMF_PATH)
- Modify the input file name ($INPUT_FILE)
- Set the word length ($WORD_LEN)
- Set the output prefix ($OUTPUT_PREFIX)
- Set the start action ($A)
- Set optional arguments as needed (only default values of optional arguments are used at the moment)
- Change directory to $NMF_PATH/sh_script and run the pipeline script (run_pipeline) %> ./run_pipeline
- You can also run the pipeline as a background process using the command : %> nohup ./run_pipeline &
Pipeline result files The generated result files are under $NMF_PATH/pipeline/result/ . Click on the "RESULT FILES" link ( $NMF_PATH/docs/nmfOutputFiles.txt ) on the nmf main page to read the description of each file. The generated logs, errors files can be found under $NMF_PATH/logs/