README CONTENTS * Overview * Contents of this Package * Descriptions of C++ Executables * Descriptions of Perl Scripts * Contents of Example Directory * Walkthrough OVERVIEW This document describes the software distributed with A biophysical approach to predicting protein-DNA binding energetics, George Locke and Alexandre V. Morozov. First, I will describe the programs included in this package. The I will walk the reader through an example workflow. This software was written by George Locke from 2012-2014. Graphics code from TFBS (Lenhard B., Wasserman W.W. (2002) TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics 18:1135-1136) formed the starting point for graphics code in intervalLogo.pl. This readme was written by George Locke in 2014. CONTENTS OF THIS PACKAGE This distribution includes four C++ executables: (a) bindSter, (b) applyModel, (c) fullProfile, (d) interval. In addition, the tarball includes the following perl scripts: (e) writeBindSter.pl, (f) queueList.pl, (g) checkMinImprovement.pl, (h) findBestErr.pl, (i) intervalLogo.pl, (j) rcIntervalTab.pl, (k) motifCor.pl. You will also find a directory named "example", which contains some examples of the kinds of input the package expects and the outputs it will generate. INSTALLATION By now you've probably untarred bindSter.tgz; if not, do so. Next, enter the directory where the tarball has been extracted and type 'make'. Aside from standard libraries, the C++ included here requires various BOOST libraries, kseq (included), and zlib (required by kseq). G++ may produce warnings that look like this: applyModel2.cpp: In function int ks_getuntil(kstream_t*, int, kstring_t*, int*) applyModel2.cpp:21:1: warning: comparison between signed and unsigned integer expressions These warnings are due to kseq and are unlikely to cause problems unless "high throughput" starts to mean something many orders of magnitude greater. Most of the perl modules required are part of the core library, although a few are not; I will briefly describe the notable exceptions. There is one routine that calls BioPerl; installing bioperl can be a bit of a pain, and as its only function here is to read fasta files (which it does far more rapidly than native perl can, probably because of XS), you should be able write your own work-around with relative ease. intervalLogo.pl requires GD and a few of its sub-modules. Without these it can't draw logos. libBindSter.pm is included here; it contains a few mundane subroutines shared by a few scripts, and the perl code included here should automatically locate this module so that you needn't install this module natively. DESCRIPTIONS OF C++ EXECUTABLES Each of the C++ executables takes a variety of arguments. The precise meaning of each of these arguments is documented by the executable itself, and this documentation will not be reproduced here, at least not in all its gory detail. To view this documentation, simply run any of these executables without any arguments, and a comprehensive explanation for all arguments will be written to stderr. A: bindSter This program is the heart of the matter. Provided with high-throughput TF binding data, it will fit a model. If you provide a file specifying model parameters, bindSter will refine them. Otherwise, it will create a random motif to start with and go from there. The program creates multiple output files. The main one is the fitted parameters, and this you specify with the --out argument; for convenience, I will suppose that you specify --out $out. The recommended convention is that you specify $out to have the suffix ".coef". The second file, which is always written, is written to "$out.err", and this file simply records the error (typically 1-r, where r is cor(predition, target)) at each step. (Note that this .err file is essential if you want to split a regression up into multiple runs of bindSter; see section G: checkMinImprovement.pl and the walkthrough). There are several other outputs that are conditional on particular arguments being set. If you do not specify --coef, bindSter creates a random starting motif, and it will write this random motif to a file called "$out.initial.coef". If you specify --coef2, it will write the optimized second motif to "$out.primary.coef". If you specify "--nonspec", it will write the optimized position-independent parameters to "$out.nonspec.coef". The program will write progress updates to STDOUT as it churns along, and when it finishes it will also write "took X lateral steps", where X is the number of steps in the TryAndStep regression that did not decrease the error. Most regressions will take very few lateral steps. To reproduce our procedures, it is recommended that you use writeBindSter.pl (or some other automated tool of your own devising) rather than interacting directly with bindSter from the command line, as our procedure involves running bindSter 80 times in parallel, which is impractical without computer assistance. See sections E: writeBindSter.pl and the walkthrough for typical use instructions. bindSter will accept binding data in three formats: --mit, --dual, or --targ with --fa. * --mit file.txt expects file.txt to have the format of MITOMI 2.0 data as described in Fordyce et al. NatBiotech 2010 (as of this writing, these data are available from http://derisilab.ucsf.edu/index.php?page=data ) * --dual file.txt expects a whose every line consists of " ", where represents the factor's affinity for the on that line (PBM data is often provided in such a fashion). A dummy example of what this input format looks like is found in the example sub-directory (rand.1000.60.dual.txt). * --targ score.txt --fa seq.fasta expects two files. the file specified by --targ will have a number on each line, and seq.fasta is a multi-fasta file. It is assumed that the n'th line in score.txt gives the binding score corresponding to the n'th sequence in seq.fasta example use: ./bindSter --dual example/rand.1000.60.dual.txt --out example/try.coef --maxstep 20 --seed 1 B: applyModel Given model parameters and target sequences, applyModel will produce a file showing the predicted score, the target score (if provided) and the target sequence. By default, the predicted score is exactly the average number of proteins bound per probe, although if the model parameters include linear scaling terms (called "floor" and "scale", corresponding to the m and b in 'm'x + 'b'), and --noscale is set to false, it will scale the average occupancy to produce a normalized score. The output file will also indicate the correlation (squared), and, if --noscale 0 is set, absolute error. example use: ./applyModel --dual example/rand.1000.60.dual.txt --coef example/try.coef --out example/try.apply C: fullProfile This program does essentially the same thing that applyModel does, but it provides more detailed output that describes the probability to find a protein bound at any position within the probe. The output format is as follows: >[sequence_1] [P^1_1],[P^1_2],[P^1_3],... [P^2_1],[P^2_2],[P^2_3],... ... >[sequence_2] [P^1_1],[P^1_2],[P^1_3],... [P^2_1],[P^2_2],[P^2_3],... ... sequence_i represents the sequence of the i'th probe. P^j_k represents the probability to find a bound protein at position k. the j superscript refers to the binding mode. Note that binding to Watson and Crick strands are considered different binding modes. So the "mononuc", "dinuc", "k-mer" models will have two binding modes, while the "two-motif", "secondary motif", "mononuc + PI" modes will have four binding modes. (the --coef2 or --nonspec binding modes will be j=3,4, the --coef binding mode will always be j=1,2.) (libBindSter.pm includes a few subroutines that may be useful should you wish to do some analysis of the fullProfile output files.) example use: ./fullProfile --dual example/rand.1000.60.dual.txt --coef example/try.coef --out example/try.fullProfile D: interval Calculate the confidence boundary for a given motif. Given target data and a motif, interval will calculate the how far you can push each parameter in the motif until you lose %2 of the correlation coefficient it has at the motifs given parameters. example use: ./interval --dual example/rand.1000.60.dual.txt --coef example/try.coef --out example/try.interval.tab DESCRIPTIONS OF PERL SCRIPTS The procedure we used in applying the bindSter algorithm amounts to making many parallel runs of bindSter and collating the results. Most of the scripts provided here automate this process. Some of these scripts may be platform-specific, so you may have to modify them yourself to match your server's requirements. These modifications may end up being non-trivial, so these scripts are to be understood as an example of the kind of thing you might do to make your life with bindSter convenient rather than a robust, battle-hardened software package. They work on my system, and something like them will work on yours. If you execute any of these scripts without any arguments, they will remind you what arguments they take. (findBestErr.pl is an exception, as I found it convenient to use without any arguments.) E: writeBindSter.pl This script creates a number of bash scripts that call bindSter. The idea is to make it easy to use a queue manager to run your jobs on a cluster. The output scripts are written to the current directory, so it's best to create a directory expressly for this purpose, e.g. step01 (see the walkthrough below). You need to provide target data (formats identical to that described section A). You also need to provide some identifying name for the job; this argument will also be used to make the --out argument of bindSter. The rest of the arguments are optional. Many of these arguments directly correspond to bindSter arguments, and you may refer to the bindSter auto-help for those. The rest are described here. bindSter-corresponding args: obj, ncmp, minimp, wordlen, ntry, threads, seed, locbias*, nonspeconly*, seqbias*, soft* the starred ones are the same as their counterparts in bindSter, but the C++ requires you to supply a true/false, whereas the perl treats them like flags, so writeBindSter.pl ... -seqbias -soft corresponds to bindSter ... --seqbias 1 --soft 1 the following arguments are new or interpreted differently * -njobs : how many parallel runs of bindSter do you want? * -coef2 : if you specify a specific .coef file, you'll reproduce the procedure for "secondary" fit. if you use "rand", you'll reproduce the procedure for a "two-motif" fit. * -nonspec : if set to n, follows the "Position-independent N=n" procedure. * -errfit : this is just the same as --cor in bindSter, except it's a flag that sets --cor 0 when active (defaults to --cor 1). However, just don't use this flag as it's useless. The regressions converge much more quickly without it. example use: mkdir example/mytest mkdir example/mytest/step01 cd example/mytest/step01 ../../../writeBindSter.pl -dual ../../../rand.1000.60.dual.txt -dir ../../regressions/ -job testing01 -maxstep 20 -njobs 20 (NB this number of steps is way too low to be practical, but is used here for expository purposes.) F: queueList.pl This script takes a list of bash scripts and sends them to PBS via qsub. This script does one slightly non-trivial thing which is that it will wait and check to see how many jobs you have in the queue and only submit more if there are less than N currently running or queued. This is handy if your queue manager doesn't have functioning "fairness" parameters and you dno't want to block all your colleagues from access to the server, but this piece of the software in particular is likely to be non-portable. the two '$PBS_...' args at the top of the script may help should you try and use it. This is its default behavior. If you just want it to submit everything in your list all at once, use the -all flag. This script was designed to work on our server; it is platform-specific. You may prefer to use GNU parallels, which does much the same thing, as I understand it. example use: (picks up where E left off) ls $PWD/*bash > bash.list ../../../queueList.pl -in bash.list -all G: checkMinImprovement.pl bindSter regressions conclude when the error has improved by less than X amount over the last Y turns. This script takes a list of bash scripts and checks to see whether the bindSter runs in that list have concluded having reached this minimum improvement. For each job that concluded without reaching min improvement, a new batch script is created that picks up where the previous one left off. Most PBS systems have a walltime or some other cutoff that forbids jobs from running forever. As a result, it becomes practical to split long jobs into many parts. checkMinImprovement.pl is designed to facilitate this process. You provide it a list of bash scripts, and it categorizes the current status of these jobs into one of three groups: ended on error, ended without reaching min improvement, ended after reaching min improvement. The script will try and divide errors detected into categories that seemed useful to me. (Of note, sometimes I got inexplicable nan errors from jobs that worked fine when next submitted. If you find such errors, just resubmit the scripts in the fail.nan.list.) This script checks for stdin and stdout records produced by the queue manager. It looks for these files under a naming convention that may not be consistent with your server, so you may need to modify it for your use. The args it takes are: * -in : a file listing batch scripts * -dir : follow up scripts are written here * -err : jobs that ended with an error are listed in this file optional arguments: * -done : list jobs that ended after reaching min improvement here * -ncmp/-minimp : correspond to the bindSter arguments * -errlist : this mode was convenient to me at one time, but I don't think it will come up for you. when this flag is set, the -in argument is interpreted not as a list of batch scripts, but a list of .err files as output by bindSter. in this case, the only output will be to write a list of those error files that correspond to unfinished jobs to the file specified by -err. (note that it will look at myJob12_iter002.coef.err and myJob12_iter003.coef.err and notice that iter003 is a more recent version, and ignore 002.) * -nosort : this flag stops the script from attempting to categorize the errors. example use: (picks up where F left off) ../../../checkMinImprovement -in bash.list -dir ../step02 -err fail.list -done finished.list the script will create the step02 directory if necessary, and it will not create fail.list or finished.list if these lists would be empty. One quirk: this script creates new seeds for any scripts it makes, and it uses previous seeds to do so. if the previous seed is less than 6 digits long, then there's a regular expression that will fail and die, on line 236. Typically, the seeds are derived from system time, which is 10 digits. So if you're testing and using -seed 1 or something, this script will fail; using -seed 100000000 will solve the problem. H: findBestErr.pl So now you've completed many rounds of bindSter regressions and you want to find the motif with the least error. findBestErr.pl to the rescue! Go to the directory where the bindSter output is written (or specify it via the -dir arg) and findBestErr.pl will list all the jobs and the errors associated with them, printing the best error last. If you just want to know the name of the file with the best motif, specify -brief. example use: (again picking up where we left off) ../../findBestErr.pl -brief -dir ../ cd .. ../../findBestErr.pl this script will not auto-identify when run with no arguments like the others. If you want it to remind you of its arguments, say findBestErr.pl -help. I: intervalLogo.pl This script takes the output of the interval program and turns it into an IntervalLogo as described in the paper. The only output format currently supported is PNG. The user selects between the following input formats: interval, coef, psam, matrix. The first option is the default, and it refers to the output of the interval program (D) included in this package (or a similarly formatted file). This is the only input format that includes confidence intervals, so the other formats will produce a more traditional sequence logo without the 3-color shading described in the paper. The other formats are : * -coef : a coef file, such as that produced by bindSter. * -psam : a simply formatted file where each row lists the "affinity" for each letter. So the format would be like A1 C1 G1 T1 A2 C2 G2 T2 ... where A1 is interpreted as exp(-epsilon_A^1 + delta^1) where epsilon_A^1 is the energy parameter associated with an A at position 1 and delta^1 is any constant that is added to all energies at position 1. (That is, often PSAMs are written such that the most popular latter at a given position has weight 1, and other positions are between 0 and 1. This is easily interpreted as a representation of a mononuc energy motif where delta^i = -min(epsilon^i)-ln(1). Note that a constant offset added to all letters at a given position has no effect on P_a^i and that such constants may be different at each position so long as they are shared by all letters at each position.) * -matrix : A file containing energy paramters formated like so A epsilon_A^1 epsilon_A^2 ... C epsilon_C^1 epsilon_C^2 ... G epsilon_G^1 epsilon_G^2 ... T epsilon_T^1 epsilon_T^2 ... again, any offsets added across a given position/column have no effect on the logo. The other arguments are: * -reduce : given a higher-order model, the script will produce a mononuc logo if this flag is set. * -nointerval : force the script to hide confidence interval shading * -nonspec : if this flag is set, only the first position of any parameter will be drawn, and position labels on the x-axis will be suppressed. (suitable for drawing the position-independent component of a mononuc + PI model) * -kmer : if this flag is set, higher order parameters within the input will be interpreted as position-independent and plotted to a separate png file. If you try to draw a k-mer motif without either this or -reduce set, the program will balk. * -linewidth : width of axis lines * -ptsize : font point size * -height, -width, -margin : height/width of the output file in pixels. the margin is the distance between the edge of the image and the axes. example use: ./intervalLogo.pl -in example/testing01_001_iter003.coef.interval.tab -out example/testing01_001_iter003.coef.logo.png J: rcIntervalTab.pl This is a simple script that takes the output of the interval program and reproduces the same information after taking the reverse complement. It takes one optional argument, -kmer, a flag that is necessary for taking the r/c of k-mer models. (To be precise, the program will work but the output will place the higher-order parameters at the last position rather than the first, in which case there will be problems with intervalLogo should you want to plot your results.) ./rcIntervalTab.pl -in example/testing01_001_iter003.coef.interval.tab -out example/testing01_001_iter003.coef.interval.rc.tab K: motifCor.pl This script produces the motif correlation between two or more motifs, as described in the paper. The input files must be formatted as bindSter coef files. If correlating two files, you may specify them using the -in1 and -in2 arguments. You can also produce corrlations for all pairs among a list of coefficients using the -in argument, where the file -in points to is a list of coef files, with one entry per line. In all cases, the output is a table showing the correlations for each pair of motifs. If there are only two motifs being correlated, the output file includes a note specifying the conditions that led to the maximum correlation between those motifs. That is, the note will specify whether a reverse complement was required and what offset was necessary. Additionally, in the case of only two motifs being correlated, the correlation obtained is written to stdout. arguments: * -out : write the table of correlations here * -maxshift : the maximum offset allowed when comparing one motif to another * -energy : compare not the information content (as described in the paper) but the energy parameters themselves. The procedure is identical except that the conversion from e_a^i to h_a^i is not performed. * -ignorewords : treat all motifs as if they had no higher order parameters (no effect on mononuc motifs) * -noreduce : by default, when comparing two motifs, higher order motifs are "reduced" to mononucleotide information content before comparison. This flag forces the script to compare information content without reducing to mononuc. Note that in this case, correlations between different models are impossible. (no effect on mononuc motifs) * -kmer : if comparing two kmer models of the same order, this flag alerts the script that higher order parameters are position independent. The -noreduce must be set in order to use this option, as it is meaningless unless reduction to mononuc is not performed. CONTENTS OF EXAMPLE DIRECTORY * rand.1000.60.dual.text This file contains dummy data for use in testing the package. It has 1000 random probes, each with a random sequence of 60 bases and a random score drawn from a uniform distribution on the interval [0,1). * safe_try.coef This file should look exactly like the output of the example use from section A: bindSter * testing01_001_iter003.coef* These are the results of the walkthrough procedure * provided/* These are the working files used to produce the walkthrough results WALKTHROUGH This walkthrough describes the procedure for fitting a new mononuc motif from high throughput TF binding data such as PBM. The only differences between this walkthrough and our actual procedure is the use of dummy data, the use of only 20 parallel runs instead of 80, and that the regression will not converge in the number of steps used here. (The walkthrough creates an environment variable called $BSDIR that points to the location of the executables distributed with this package. This is merely a convenience, as none of those executables will require such an environment variable.) You may check your working files against those in examples/provided/ and the "final" results against examples/testing01_001_iter003.coef* . I instruct you to use the same seed I used in creating those results, so your results should be identical. However, it's possible that an update to the BOOST random library or something platform dependent will result in differences between your results and those in the examples/provided/ directory. # Step 0: prepare working space # cd to wherever the bindSterPackage extracted to export BSDIR=$PWD # your shell may want different syntax, e.g. setenv mkdir example/yourDir mkdir example/yourDir/regressions mkdir example/yourDir/scripts/ mkdir example/yourDir/scripts/step01 # Step 1: make and submit jobs cd example/yourDir/scripts/step01 $BSDIR/writeBindSter.pl -dual /home/glocke/percus5/scripts/tfscripts/cpp/distribute/example/rand.1000.60.dual.txt -dir ../../regressions/ -job yourJob -maxstep 20 -njobs 20 -seed 1415988679 ls $PWD/*bash > bash.list $BSDIR/queueList.pl -in bash.list -all # as noted above, queueList.pl is platform-specific. see section F. # these jobs should complete within a few seconds # Step 2: check to see whether any of the jobs have found min improvement # they won't have in this case, but if you had set -maxstep 3000 # then surely most of them would have $BSDIR/checkMinImprovement.pl -in bash.list -dir ../step02 -err fail.list # checkMinImprovement.pl is also platform-specific. see section G. ls *list # if something has gone wrong, you will probably find a file # called fail.unknown.list or fail.absent.list. # If you find no such list, then proceed. ls ../../regressions # you should find 60 files in here now # Step 3: continue the unfinished jobs cd ../step02 ls $PWD/*bash > bash.list $BSDIR/queueList.pl -in bash.list -all # again, these should conclude very rapidly # Step 4: rinse and repeat $BSDIR/checkMinImprovement.pl -in bash.list -dir ../step03 -err fail.list ls *list # just checking ls ../../regressions # 100 files now expected cd ../step03 ls $PWD/*bash > bash.list $BSDIR/queueList.pl -in bash.list -all # in the actual case, you'd repeate this process until all jobs finish ## checkMinImprovement will alert you if this is the case. # Step 5: Find the best regression # we'll now assume that all parallel runs of bindSter have reached # minimum improvement. cd ../../regressions $BSDIR/findBestErr.pl # if your results match mine, then findBestErr should produce a bunch # of output, the last four lines of which will look like this: #./yourJob_008_iter003.coef.err 0.870841 #./yourJob_011_iter003.coef.err 0.870405 #./yourJob_002_iter003.coef.err 0.865921 #./yourJob_001_iter003.coef.err 0.862233 cd ../ $BSDIR/findBestErr.pl -dir regressions -brief # this should write the line, "regressions/yourJob_001_iter003.coef" to STDOUT cp regressions/yourJob_001_iter003.coef . # Step 6: prepare your final results $BSDIR/applyModel --coef yourJob_001_iter003.coef --dual ../rand.1000.60.dual.txt --out yourJob_001_iter003.coef.applied $BSDIR/fullProfile --coef yourJob_001_iter003.coef --dual ../rand.1000.60.dual.txt --out yourJob_001_iter003.coef.fullProfile $BSDIR/interval --coef yourJob_001_iter003.coef --dual ../rand.1000.60.dual.txt --out yourJob_001_iter003.coef.interval.tab # this may take a minute or so $BSDIR/rcIntervalTab.pl -in yourJob_001_iter003.coef.interval.tab -out yourJob_001_iter003.coef.interval.rc.tab $BSDIR/intervalLogo.pl -in yourJob_001_iter003.coef.interval.tab -out yourJob_001_iter003.coef.logo.png $BSDIR/intervalLogo.pl -in yourJob_001_iter003.coef.interval.rc.tab -out yourJob_001_iter003.coef.rc.logo.png # yourDir should now have 8 files and two directories in it # you can compare the files with their counterparts in the examples # directory.