nextpeak

NEXT-peak

NEXT-peak is a program to call peaks from ChIP-seq data for transcription factor binding sites.
The latest version of NEXT-peak can be downloaded here.
Instructions for installation can be found here.
Supplementary program for finding multiple binding sites in a region can be found here.

NEXT-peak is implementing the Normal-EXponential Two-peak (NEXT-peak) model. To specify the model, parameters beta and sigma have to be estimated. Parameter estimation can be done with ChIP-seq data and a list of the corresponding motif sites. NEXT-peak can use optional mappability information. The mappability information should be saved in a directory with the file names of chr1.map, chr2.map .... The files should contain values of zeros or ones, where 1 represents mappable, 0, unmappable. Using the PeakSeq suite, these files can be generated by modifying CountMap.py of the PeakSeq suite.

Usage

$ nextpeak <OPTIONS>

nextpeak is a name of the executable file.
<OPTIONS> is a list of options (arguments).

The options are separated by spaces.
Each option starts with the symbol "-" and takes the form

-<option code> <option value>

Options

-l    To provide a list of the genomic sequence lengths. A possible format is a single column of values.
    If this file is not provided, a user can use -g option to provide a genomic sequence file.
    If neither -l nor -g options was used, sequence lengths of hg18 (Homo Sapiens Build 36.1) is used.

-g    When -l option is not used, a user can provide a genomic sequence file, containing all genomic
    sequences in FASTA format. Possible headers are

    ">" immediately followed by a chromosome number, e.g., >1, >2, etc.

    headers of UCSC human genome. ">chr" followed by a number or "X", "Y",
        e.g., >chr1, >chr2, ..., >chrX, >chrY.

    headers of NCBI Build 36.1.
        e.g. >gi|89161185|ref|NC_000001.9|NC_000001 Homo sapiens chromosome 1, reference assembly, complete sequence

-c    To provide mapped ChIP-seq data. The format should be "+"/"-" for strand information, followed by
    a genome number, followed by a genomic coordinate. e.g.,

        -       5       25494792
        +       18      31889166
      -       23      118034178
        ...
        (more lines)
    When Bowtie is used for mapping, this format can be generated with --suppress 1,5,6,7,8 option with
    headers containing only numbers.

-d    To provide a directory name that contains files for optional mappability information. The directory
    should contain "/" at the last position of the text, e.g., /home/mappability/
    Under the directory, it should contain files with names "chr1.map", "chr2.map",..., "chr24.map",etc.
    Note that "chrX.map" and "chrY.map" are not allowed; use "chr23.map" and "chr24.map" instead.
    Each file should contain a sequence of zeros and ones, "0" for unmappable location, "1" for mappable.

-r    To generate optional output for the candidate regions for peaks. With this output, a user can
    make a plot of any particular region of interest. If mappability information was used, unmappable
    locations have "Non" instead of usually a pair of counts, one for the left strand, the other for
    the right strand, separated by "-", e.g., 2-4 means 2 tags were mapped in the left (forward) strand,
    4 tags were mapped in the right (backward) strand for this particular genomic location.

-m    To estimate NEXT-peak model parameters beta and sigma, a user can provide a motif site location file.
    The file should have two columns, the first is for the genome number, the second is for the left end
    location of the site. When using -m option, make sure to provide the tag length (-h option) and
    the motif length (-i option) as well.

-h    To provide the length of tags.

-i    To provide the length of the motif.

-a    To provide the name of the file that contains observed frequencies around the motif sites.

-b    To provide the value of parameter beta. If -m option is used, beta and sigma will be estimated.

-s    To provide the value of parameter sigma.

-w    To provide a window length for identifying regions of interest. (Default: 150)

-t    To provide a minimum count in a window for identifying regions of interest. (Default: 15)

-v    To provide the vector length for NEXT-peak density tabulation. (Default: 1000)

Output format

NEXT-peak creates two parts of the output: a global summary and regions statistics.

A global summary appears ahead of region statistics. It shows a ChIP-seq file name, parameter estimates for the model, and suggested cutoff values for spurious region. Note that the area under the curve (AUC) is not the usual AUC from a ROC curve. This is a precision curve of Figure 3 in the article. It considers top 10,000 peaks.

NEXT-peak generates several statistics for each candidate region where each line holds statistics for each region. Total 20 columns will be reported. The information in the columns are as follows:

col 1: genome number
col 2: starting position of the region in terms of genomic coordinate (position starts from 1, inclusive)
col 3: ending position of the region (inclusive)
col 4: region length defined as col 3 - col 2 + 1
col 5: number of mappable locations in the region. If all locations are mappable or mappability information was not used, col 5 = col 4.
col 6: ratio of mappable locations defined as col 5 / col 4
col 7: relative position of the center of the estimated peak (mu). To convert to the genomic coordinate of the peak, add col 2 to col 7.
col 8: estimated standard error for the peak position (standard error of mu)
col 9: number of the left (forward) tags in the region
col 10: ratio of the left tags and the total number of tags: col 9 / col 11
col 11: the total number of tags in the region
col 12: estimated number of tags due to the binding on both strands (2 * nu)
col 13: estimated standard error for the tags numbers (standard error of 2 * nu)
col 14: estimated local background error rate (rho)
col 15: estimated standard error of the background error rate (standard error of rho)
col 16: ratio of the tags explained by the binding event
col 17: p-value for test of nu = 0, i.e., whether there is no binding event in the region.
col 18: test statistic for the goodness-of-fit test of the observed tags to the NEXT-peak model
col 19: degrees of freedom of the goodness-of-fit test
col 20: p-value for the goodness-of-fit test

Output example

ChIP-seq file: /STAT1/STAT1mapped.txt
Mappability information is used
Motif location file: /CPP/FindKnownMotif/STAT1_w15_5m6.txt
Parameter estimation using anchored data: beta=68.3992 sigma=50.5182 mu=500 nu=180350 rho=182.072 Total count=724845
cutoff for spurious region: (region length = 400, p-value for GOF = 1e-6) AUC = 0.307992 Distance to motif cutoff=250
1 554323 558411 4089 3759 0.919296 791.791 12.5462 3488 0.462538 7541 208.413 28.4792 0.976817 0.00378813 0.0261623 0 19.8336 12 0.0703005
1 558564 560144 1581 1341 0.848197 905.581 8.03588 1257 0.483834 2598 251.906 20.3322 0.883445 0.00758099 0.0879905 0 37.5651 12 0.000180821
1 703751 703951 201 170 0.845771 177.155 7.92663 16 0.727273 22 19.1956 8.83313 0.0406596 0.0259798 0.371624 6.39239e-08 16.5858 12 0.165855
1 865987 866342 356 356 1 174.524 6.68542 30 0.517241 58 53.0984 6.03224 0.0142044 0.00847225 0.825629 2.92872e-13 14.6734 12 0.259783
1 926070 926219 150 150 1 85.3262 6.54363 7 0.466667 15 18.3815 4.48855 0.0150987 0.0149618 0.698026 0.000978739 13.2669 12 0.349941
1 938384 938767 384 384 1 205.205 5.9193 31 0.596154 52 44.2749 3.86602 0.0147415 0.00503388 0.78228 1.23131e-11 20.1754 12 0.0638401
1 1004685 1004834 150 150 1 130.048 5.75204 13 0.866667 15 27.2357 3.47877 0.00553989 0.0115959 0.889202 0.00261428 17.7941 12 0.122086
1 1041429 1041681 253 253 1 108.664 5.64222 14 0.451613 31 20.2571 3.08478 0.0298033 0.0060964 0.513534 6.61504e-09 11.8285 12 0.459548
1 1060382 1061082 701 661 0.942939 400.4 3.65425 140 0.454545 308 302.083 3.04245 0.0289262 0.0023014 0.875843 4.61065e-99 64.2658 12 3.72534e-09
...
(more lines)

Usage examples

Example 1:

A mapped ChIP-seq data file should be prepared in the format explained in -c option. If a user knows beta and sigma values from the previous run or from a separate analysis, beta value (-b option) and sigma (-s option) could be used. nextpeak assumes hg18 (Homo Sapiens Build 36.1) as a reference genome.

$ nextpeak -c ChIPmapped.txt -b 65.2 -s 45.1

Example 2:

If the reference genome is not hg18, a user can provide a list of genomic sequences.

$ nextpeak -c ChIPmapped.txt -b 65.2 -s 45.1 -l hg19SeqLen.txt

Example 3:

As an alternative to Example 2, a user can provide genomic sequences in FASTA format.

$ nextpeak -c ChIPmapped.txt -b 65.2 -s 45.1 -g hg19all.fa

Example 4:

When beta and sigma values are unknown, a file for motif sites should be prepared as explained in -m option.When -m option is used, make sure to provide the tag length (-h option) and the motif length (-i option).

$ nextpeak -c ChIPmapped.txt -m motifSites.txt -h 27 -i 15

Example 5:

When mappability information is available through the use of the PeakSeq suite, a user can improve peak-calling by adding mappability information. As explained in -d option, each file should contain mappability information for each chromosome. Make sure to add "/" when providing the path of the directory.

$ nextpeak -c ChIPmapped.txt -b 65.2 -s 45.1 -d /home/Mappability/

Example 6:

To save frequencies of tags around the motif sites, use -a option.

$ nextpeak -c ChIPmapped.txt -m motifSites.txt -h 27 -i 15 -a anchoredFreq.txt

Example 7:

To save frequencies of the mapped tags in the regions, add -r option.

$ nextpeak -c ChIPmapped.txt -m motifSites.txt -h 27 -i 15 -r regions.txt

Example 8:

Combination of multiple options.

$ nextpeak -c ChIPmapped.txt -m motifSites.txt -h 27 -i 15 -a anchoredFreq.txt -r regions.txt -d /home/Mappability/

Files and Installation

The files in the downloaded directory include:

nextpeak1.1_linux.zip: Linux executable
nextpeak1.1_cpp_files.zip: C++ source files

No special installation is required
The executable file can be downloaded, unzipped and run with an appropriate command
Alternatively, the source C++ files can be downloaded, unzipped, and complied in a suitable C++ environment. For compilation in Linux, included Makefile can be used.

Reference

Kim NK, Jayatillake RV, Spouge JL,
"NEXT-peak: a normal-exponential two-peak model for peak-calling in ChIP-seq data"
(2013), BMC Genomics, 14:349