README CONTENTS
 * Overview
 * Contents of this Package
 * Descriptions of C++ Executables
 * Descriptions of Perl Scripts
 * Contents of Example Directory
 * Walkthrough


OVERVIEW

This document describes the software distributed with A biophysical
approach to predicting protein-DNA binding energetics, George Locke and 
Alexandre V. Morozov.

First, I will describe the programs included in this package.  The I
will walk the reader through an example workflow.  

This software was written by George Locke from 2012-2014. 
Graphics code from TFBS (Lenhard B., Wasserman W.W. (2002) TFBS: 
Computational framework for transcription factor binding site
analysis. Bioinformatics 18:1135-1136) formed the starting point for
graphics code in intervalLogo.pl.  

This readme was written by George Locke in 2014.


CONTENTS OF THIS PACKAGE

This distribution includes four C++ executables: (a) bindSter, (b) applyModel, 
(c) fullProfile, (d) interval.  In addition, the tarball includes the
following perl scripts: (e) writeBindSter.pl, (f) queueList.pl, 
(g) checkMinImprovement.pl, (h) findBestErr.pl, (i) intervalLogo.pl,
(j) rcIntervalTab.pl, (k) motifCor.pl.

You will also find a directory named "example", which contains some
examples of the kinds of input the package expects and the outputs it
will generate.

INSTALLATION

By now you've probably untarred bindSter.tgz; if not, do so.  Next,
enter the directory where the tarball has been extracted and type
'make'.  Aside from standard libraries, the C++ included here requires 
various BOOST libraries, kseq (included), and zlib (required by kseq).  

G++  may produce warnings that look like this: 
applyModel2.cpp: In function int ks_getuntil(kstream_t*, int, kstring_t*, int*)
applyModel2.cpp:21:1: warning: comparison between signed and unsigned integer expressions
These warnings are due to kseq and are unlikely to cause problems
unless "high throughput" starts to mean something many orders of
magnitude greater.

Most of the perl modules required are part of the core library,
although a few are not; I will briefly describe the notable
exceptions. There is one routine that calls BioPerl; installing
bioperl can be a bit of a pain, and as its only function here is to
read fasta files (which it does far more rapidly than native perl can,
probably because of XS), you should be able write your own work-around
with relative ease.  intervalLogo.pl requires GD and a few of its 
sub-modules.  Without these it can't draw logos.  libBindSter.pm is
included here; it contains a few mundane subroutines shared by a few
scripts, and the perl code included here should automatically locate
this module so that you needn't install this module natively.




DESCRIPTIONS OF C++ EXECUTABLES

Each of the C++ executables takes a variety of arguments.  The precise meaning 
of each of these arguments is documented by the executable itself, and this 
documentation will not be reproduced here, at least not in all its gory detail.
To view this documentation, simply run any of these executables without any 
arguments, and a comprehensive explanation for all arguments will be written to
stderr. 

A: bindSter

This program is the heart of the matter.  Provided with high-throughput 
TF binding data, it will fit a model.  If you provide a file specifying
model parameters, bindSter will refine them.  Otherwise, it will create
a random motif to start with and go from there.

The program creates multiple output files.  The main one is the fitted
parameters, and this you specify with the --out argument; for
convenience, I will suppose that you specify --out $out.
The recommended convention is that you specify $out to have the suffix
".coef".  The second file, which is always written, is written to
"$out.err", and this file simply records the error (typically 1-r,
where r is cor(predition, target)) at each step.  (Note that this .err
file is essential if you want to split a regression up into multiple
runs of bindSter; see section G: checkMinImprovement.pl and the
walkthrough).   

There are several other outputs that are conditional on particular
arguments being set.  If you do not specify --coef, bindSter creates a
random starting motif, and it will write this random motif to a file
called "$out.initial.coef".  If you specify --coef2, it will write the
optimized second motif to "$out.primary.coef".  If you specify
"--nonspec", it will write the optimized position-independent
parameters to "$out.nonspec.coef". 

The program will write progress updates to STDOUT as it churns along,
and when it finishes it will also write "took X lateral steps", where
X is the number of steps in the TryAndStep regression that did not
decrease the error.  Most regressions will take very few lateral
steps.

To reproduce our procedures, it is recommended that you use
writeBindSter.pl (or some other automated tool of your own devising)
rather than interacting directly with bindSter from the command line,
as our procedure involves running bindSter 80 times in parallel, which
is impractical without computer assistance.  See sections E:
writeBindSter.pl and the walkthrough for typical use instructions.

bindSter will accept binding data in three formats: --mit, --dual, or --targ 
with --fa.  

* --mit file.txt expects file.txt to have the format of MITOMI 2.0 data as 
    described in Fordyce et al.  NatBiotech 2010 (as of this writing,
    these data are available from 
    http://derisilab.ucsf.edu/index.php?page=data )
* --dual file.txt expects a whose every line consists of "<score> <sequence>", 
    where <score> represents the factor's affinity for the <sequence> on that 
    line (PBM data is often provided in such a fashion).  A dummy
    example of what this input format looks like is found in the
    example sub-directory (rand.1000.60.dual.txt).
* --targ score.txt --fa seq.fasta expects two files.  the file specified by 
    --targ will have a number on each line, and seq.fasta is a multi-fasta file.
    It is assumed that the n'th line in score.txt gives the binding score 
    corresponding to the n'th sequence in seq.fasta

example use: 
./bindSter --dual example/rand.1000.60.dual.txt --out example/try.coef --maxstep 20 --seed 1

B: applyModel

Given model parameters and target sequences, applyModel will produce a
file showing the predicted score, the target score (if provided) and
the target sequence.  By default, the predicted score is exactly the
average number of proteins bound per probe, although if the model
parameters include linear scaling terms (called "floor" and "scale",
corresponding to the m and b in 'm'x + 'b'), and --noscale is set to
false, it will scale the average occupancy to produce a normalized
score.  The output file will also indicate the correlation (squared),
and, if --noscale 0 is set, absolute error.  

example use:
./applyModel --dual example/rand.1000.60.dual.txt --coef example/try.coef --out example/try.apply


C: fullProfile

This program does essentially the same thing that applyModel does, but
it provides more detailed output that describes the probability to
find a protein bound at any position within the probe.  The output
format is as follows: 
>[sequence_1]
[P^1_1],[P^1_2],[P^1_3],...
[P^2_1],[P^2_2],[P^2_3],...
...
>[sequence_2]
[P^1_1],[P^1_2],[P^1_3],...
[P^2_1],[P^2_2],[P^2_3],...
...

sequence_i represents the sequence of the i'th probe.
P^j_k represents the probability to find a bound protein at position
k.  the j superscript refers to the binding mode.  Note that binding
to Watson and Crick strands are considered different binding modes.
So the "mononuc", "dinuc", "k-mer" models will have two binding modes,
while the "two-motif", "secondary motif", "mononuc + PI" modes will
have four binding modes.  (the --coef2 or --nonspec binding modes
will be j=3,4, the --coef binding mode will always be j=1,2.)

(libBindSter.pm includes a few subroutines that may be useful should
you wish to do some analysis of the fullProfile output files.)

example use:
./fullProfile --dual example/rand.1000.60.dual.txt --coef example/try.coef --out example/try.fullProfile


D: interval

Calculate the confidence boundary for a given motif.  Given target
data and a motif, interval will calculate the how far you can push
each parameter in the motif until you lose %2 of the correlation
coefficient it has at the motifs given parameters.  

example use:
./interval --dual example/rand.1000.60.dual.txt --coef example/try.coef --out example/try.interval.tab




DESCRIPTIONS OF PERL SCRIPTS

The procedure we used in applying the bindSter algorithm amounts to
making many parallel runs of bindSter and collating the results.
Most of the scripts provided here automate this process.  Some of
these scripts may be platform-specific, so you may have to modify them
yourself to match your server's requirements.  These modifications may
end up being non-trivial, so these scripts are to be understood as an
example of the kind of thing you might do to make your life with
bindSter convenient rather than a robust, battle-hardened software
package.  They work on my system, and something like them will work on
yours.  

If you execute any of these scripts without any arguments, they will
remind you what arguments they take.  (findBestErr.pl is an exception,
as I found it convenient to use without any arguments.)

E: writeBindSter.pl

This script creates a number of bash scripts that call bindSter.  The
idea is to make it easy to use a queue manager to run your jobs on a
cluster.  The output scripts are written to the current directory, so it's
best to create a directory expressly for this purpose, e.g. step01
(see the walkthrough below).

You need to provide target data (formats identical to that described  
section A).  You also need to provide some identifying name for the
job; this argument will also be used to make the --out argument of
bindSter.  The rest of the arguments are optional.  Many of these
arguments directly correspond to bindSter arguments, and you may
refer to the bindSter auto-help for those. The rest are described here.

bindSter-corresponding args: obj, ncmp, minimp, wordlen, ntry,
  threads, seed, locbias*, nonspeconly*, seqbias*, soft*
the starred ones are the same as their counterparts in bindSter, but
  the C++ requires you to supply a true/false, whereas the perl
  treats them like flags, so writeBindSter.pl ... -seqbias -soft 
  corresponds to bindSter ... --seqbias 1 --soft 1

the following arguments are new or interpreted differently
* -njobs : how many parallel runs of bindSter do you want?  
* -coef2 : if you specify a specific .coef file, you'll reproduce
  the procedure for "secondary" fit.  if you use "rand",
  you'll reproduce the procedure for a "two-motif" fit.
* -nonspec : if set to n, follows the "Position-independent N=n"
  procedure. 
* -errfit : this is just the same as --cor in bindSter, except it's
  a flag that sets --cor 0 when active (defaults to --cor
  1).  However, just don't use this flag as it's useless.
  The regressions converge much more quickly without it.

example use: 
mkdir example/mytest
mkdir example/mytest/step01
cd example/mytest/step01
../../../writeBindSter.pl -dual ../../../rand.1000.60.dual.txt -dir ../../regressions/ -job testing01 -maxstep 20 -njobs 20

(NB this number of steps is way too low to be practical, but is used
here for expository purposes.)

F: queueList.pl

This script takes a list of bash scripts and sends them to PBS via
qsub.  

This script does one slightly non-trivial thing which is that it will
wait and check to see how many jobs you have in the queue and only
submit more if there are less than N currently running or queued.
This is handy if your queue manager doesn't have functioning
"fairness" parameters and you dno't want to block all your colleagues
from access to the server, but this piece of the software in
particular is likely to be non-portable.  the two '$PBS_...' args at
the top of the script may help should you try and use it.  This is its
default behavior.  If you just want it to submit everything in your
list all at once, use the -all flag.  

This script was designed to work on our server; it is
platform-specific.  You may prefer to use GNU parallels, which does
much the same thing, as I understand it.

example use:
(picks up where E left off)
ls $PWD/*bash > bash.list
../../../queueList.pl -in bash.list -all

G: checkMinImprovement.pl

bindSter regressions conclude when the error has improved by less than
X amount over the last Y turns.  This script takes a list of bash
scripts and checks to see whether the bindSter runs in that list have
concluded having reached this minimum improvement.  For each job that
concluded without reaching min improvement, a new batch script is
created that picks up where the previous one left off. 

Most PBS systems have a walltime or some other cutoff that forbids
jobs from running forever.  As a result, it becomes practical to split
long jobs into many parts.  checkMinImprovement.pl is designed to
facilitate this process.  You provide it a list of bash scripts, and
it categorizes the current status of these jobs into one of three
groups: ended on error, ended without reaching min improvement, ended
after reaching min improvement.  The script will try and divide errors
detected into categories that seemed useful to me.  (Of note,
sometimes I got inexplicable nan errors from jobs that worked fine
when next submitted.  If you find such errors, just resubmit the scripts
in the fail.nan.list.)

This script checks for stdin and stdout records produced by the queue
manager.  It looks for these files under a naming convention that may
not be consistent with your server, so you may need to modify it for
your use.  

The args it takes are: 
* -in : a file listing batch scripts
* -dir : follow up scripts are written here
* -err : jobs that ended with an error are listed in this file
  optional arguments:
* -done : list jobs that ended after reaching min improvement here
* -ncmp/-minimp : correspond to the bindSter arguments
* -errlist : this mode was convenient to me at one time, but I don't
  think it will come up for you.  when this flag is set, the
  -in argument is interpreted not as a list of batch scripts,
  but a list of .err files as output by bindSter.  in this
  case, the only output will be to write a list of those
  error files that correspond to unfinished jobs to the file
  specified by -err.  (note that it will look at
  myJob12_iter002.coef.err and myJob12_iter003.coef.err and
  notice that iter003 is a more recent version, and ignore 002.)
* -nosort : this flag stops the script from attempting to categorize the errors.

example use:
(picks up where F left off)
../../../checkMinImprovement -in bash.list -dir ../step02 -err fail.list -done finished.list

the script will create the step02 directory if necessary, and it will
not create fail.list or finished.list if these lists would be empty. 

One quirk: this script creates new seeds for any scripts it makes, and
it uses previous seeds to do so.  if the previous seed is less than 6
digits long, then there's a regular expression that will fail and die,
on line 236.  Typically, the seeds are derived from system time, which
is 10 digits.  So if you're testing and using -seed 1 or something,
this script will fail; using -seed 100000000 will solve the problem. 

H: findBestErr.pl

So now you've completed many rounds of bindSter regressions and you
want to find the motif with the least error.  findBestErr.pl to the
rescue!  Go to the directory where the bindSter output is written (or
specify it via the -dir arg) and findBestErr.pl will list all the jobs
and the errors associated with them, printing the best error last.  If
you just want to know the name of the file with the best motif,
specify -brief.

example use: 
(again picking up where we left off)
../../findBestErr.pl -brief -dir ../
cd ..
../../findBestErr.pl 

this script will not auto-identify when run with no arguments like the
others.  If you want it to remind you of its arguments, say
findBestErr.pl -help.

I: intervalLogo.pl

This script takes the output of the interval program and turns it into
an IntervalLogo as described in the paper.  The only output format
currently supported is PNG.

The user selects between the following input formats: interval, coef,
psam, matrix.  The first option is the default, and it refers to the
output of the interval program (D) included in this package (or a
similarly formatted file).  This is the only input format that
includes confidence intervals, so the other formats will produce a
more traditional sequence logo without the 3-color shading described
in the paper.  The other formats are :

* -coef : a coef file, such as that produced by bindSter.  
* -psam : a simply formatted file where each row lists the "affinity" 
  for each letter.  So the format would be like
  A1 C1 G1 T1
  A2 C2 G2 T2
  ...
  where A1 is interpreted as exp(-epsilon_A^1 + delta^1) where
  epsilon_A^1 is the energy parameter associated with an A at
  position 1 and delta^1 is any constant that is added to all
  energies at position 1.  (That is, often PSAMs are written
  such that the most popular latter at a given position has
  weight 1, and other positions are between 0 and 1.  This is
  easily interpreted as a representation of a mononuc energy
  motif where delta^i = -min(epsilon^i)-ln(1). Note that a
  constant offset added to all letters at a given position has
  no effect on P_a^i and that such constants may be different
  at each position so long as they are shared by all letters at
  each position.)   
* -matrix : A file containing energy paramters formated like so
  A epsilon_A^1 epsilon_A^2 ...
  C epsilon_C^1 epsilon_C^2 ...
  G epsilon_G^1 epsilon_G^2 ...
  T epsilon_T^1 epsilon_T^2 ...
  again, any offsets added across a given position/column
  have no effect on the logo.

The other arguments are:
* -reduce : given a higher-order model, the script will produce a
  mononuc logo if this flag is set.
* -nointerval : force the script to hide confidence interval shading
* -nonspec : if this flag is set, only the first position of any
  parameter will be drawn, and position labels on the
  x-axis will be suppressed.  (suitable for drawing the
  position-independent component of a mononuc + PI model)
* -kmer : if this flag is set, higher order parameters within the
  input will be interpreted as position-independent and
  plotted to a separate png file.  If you try to draw a k-mer
  motif without either this or -reduce set, the program will
  balk. 
* -linewidth : width of axis lines
* -ptsize : font point size
* -height, -width, -margin : height/width of the output file in
  pixels.  the margin is the distance between the edge of the image
  and the axes.

example use: 
./intervalLogo.pl -in example/testing01_001_iter003.coef.interval.tab -out example/testing01_001_iter003.coef.logo.png

J: rcIntervalTab.pl

This is a simple script that takes the output of the interval program
and reproduces the same information after taking the reverse
complement.  It takes one optional argument, -kmer, a flag that is
necessary for taking the r/c of k-mer models.  (To be precise, the
program will work but the output will place the higher-order
parameters at the last position rather than the first, in which case
there will be problems with intervalLogo should you want to plot your
results.) 

./rcIntervalTab.pl -in example/testing01_001_iter003.coef.interval.tab -out example/testing01_001_iter003.coef.interval.rc.tab

K: motifCor.pl

This script produces the motif correlation between two or more motifs,
as described in the paper.  The input files must be formatted as
bindSter coef files.  If correlating two files, you may specify them
using the -in1 and -in2 arguments.  You can also produce corrlations
for all pairs among a list of coefficients using the -in argument,
where the file -in points to is a list of coef files, with one entry
per line.  

In all cases, the output is a table showing the correlations for each
pair of motifs.  If there are only two motifs being correlated, the
output file includes a note specifying the conditions that led to the
maximum correlation between those motifs.  That is, the note will
specify whether a reverse complement was required and what offset was
necessary.  Additionally, in the case of only two motifs being
correlated, the correlation obtained is written to stdout.

arguments:
* -out : write the table of correlations here
* -maxshift : the maximum offset allowed when comparing one motif to another
* -energy : compare not the information content (as described in the
  paper) but the energy parameters themselves.  The procedure
  is identical except that the conversion from e_a^i to h_a^i
  is not performed. 
* -ignorewords : treat all motifs as if they had no higher order
  parameters (no effect on mononuc motifs)
* -noreduce : by default, when comparing two motifs, higher order motifs
  are "reduced" to mononucleotide information content before
  comparison.  This flag forces the script to compare
  information content without reducing to mononuc.  Note
  that in this case, correlations between different models
  are impossible. (no effect on mononuc motifs)
* -kmer : if comparing two kmer models of the same order, this flag
  alerts the script that higher order parameters are position
  independent.  The -noreduce must be set in order to use this
  option, as it is meaningless unless reduction to mononuc is
  not performed.




CONTENTS OF EXAMPLE DIRECTORY

* rand.1000.60.dual.text
    This file contains dummy data for use in testing the package.  It
    has 1000 random probes, each with a random sequence of 60 bases and
    a random score drawn from a uniform distribution on the interval
    [0,1).  
* safe_try.coef 
    This file should look exactly like the output of the example use
    from section A: bindSter
* testing01_001_iter003.coef*
    These are the results of the walkthrough procedure
* provided/*
    These are the working files used to produce the walkthrough results 




WALKTHROUGH

This walkthrough describes the procedure for fitting a new mononuc
motif from high throughput TF binding data such as PBM.  The only
differences between this walkthrough and our actual procedure is the
use of dummy data, the use of only 20 parallel runs instead of 80, and
that the regression will not converge in the number of steps used here.   

(The walkthrough creates an environment variable called $BSDIR that
points to the location of the executables distributed with this
package.  This is merely a convenience, as none of those executables
will require such an environment variable.)

You may check your working files against those in examples/provided/
and the "final" results against examples/testing01_001_iter003.coef* .
I instruct you to use the same seed I used in creating those results,
so your results should be identical.  However, it's possible that an
update to the BOOST random library or something platform dependent
will result in differences between your results and those in the 
examples/provided/ directory.



# Step 0: prepare working space

# cd to wherever the bindSterPackage extracted to

export BSDIR=$PWD # your shell may want different syntax, e.g. setenv
mkdir example/yourDir 
mkdir example/yourDir/regressions
mkdir example/yourDir/scripts/
mkdir example/yourDir/scripts/step01


# Step 1: make and submit jobs

cd example/yourDir/scripts/step01
$BSDIR/writeBindSter.pl -dual /home/glocke/percus5/scripts/tfscripts/cpp/distribute/example/rand.1000.60.dual.txt -dir ../../regressions/ -job yourJob -maxstep 20 -njobs 20 -seed 1415988679
ls $PWD/*bash > bash.list
$BSDIR/queueList.pl -in bash.list -all
# as noted above, queueList.pl is platform-specific.  see section F.

# these jobs should complete within a few seconds


# Step 2: check to see whether any of the jobs have found min improvement
# they won't have in this case, but if you had set -maxstep 3000
# then surely most of them would have

$BSDIR/checkMinImprovement.pl -in bash.list -dir ../step02 -err fail.list
# checkMinImprovement.pl is also platform-specific. see section G.
ls *list # if something has gone wrong, you will probably find a file
# called fail.unknown.list or fail.absent.list.  
# If you find no such list, then proceed.

ls ../../regressions # you should find 60 files in here now

# Step 3: continue the unfinished jobs

cd ../step02
ls $PWD/*bash > bash.list 
$BSDIR/queueList.pl -in bash.list -all

# again, these should conclude very rapidly


# Step 4: rinse and repeat
$BSDIR/checkMinImprovement.pl -in bash.list -dir ../step03 -err fail.list
ls *list # just checking
ls ../../regressions # 100 files now expected

cd ../step03
ls $PWD/*bash > bash.list 
$BSDIR/queueList.pl -in bash.list -all

# in the actual case, you'd repeate this process until all jobs finish
##   checkMinImprovement will alert you if this is the case.


# Step 5: Find the best regression
# we'll now assume that all parallel runs of bindSter have reached
# minimum improvement.

cd ../../regressions
$BSDIR/findBestErr.pl

# if your results match mine, then findBestErr should produce a bunch
# of output, the last four lines of which will look like this: 

#./yourJob_008_iter003.coef.err  0.870841
#./yourJob_011_iter003.coef.err  0.870405
#./yourJob_002_iter003.coef.err  0.865921
#./yourJob_001_iter003.coef.err  0.862233

cd ../
$BSDIR/findBestErr.pl -dir regressions -brief
# this should write the line, "regressions/yourJob_001_iter003.coef" to STDOUT
cp regressions/yourJob_001_iter003.coef .

# Step 6: prepare your final results
$BSDIR/applyModel --coef yourJob_001_iter003.coef --dual ../rand.1000.60.dual.txt --out yourJob_001_iter003.coef.applied
$BSDIR/fullProfile --coef yourJob_001_iter003.coef --dual ../rand.1000.60.dual.txt --out yourJob_001_iter003.coef.fullProfile 
$BSDIR/interval --coef yourJob_001_iter003.coef --dual ../rand.1000.60.dual.txt --out yourJob_001_iter003.coef.interval.tab # this may take a minute or so
$BSDIR/rcIntervalTab.pl -in yourJob_001_iter003.coef.interval.tab -out yourJob_001_iter003.coef.interval.rc.tab
$BSDIR/intervalLogo.pl -in yourJob_001_iter003.coef.interval.tab -out yourJob_001_iter003.coef.logo.png
$BSDIR/intervalLogo.pl -in yourJob_001_iter003.coef.interval.rc.tab -out yourJob_001_iter003.coef.rc.logo.png

# yourDir should now have 8 files and two directories in it
# you can compare the files with their counterparts in the examples
# directory.