NEXT-peak
- NEXT-peak
is a program to call
peaks from ChIP-seq data for
transcription factor binding sites.
- The latest version of NEXT-peak can be downloaded here.
- Instructions for installation can be found here.
- Supplementary program for finding multiple binding sites in a
region can be found here.
NEXT-peak is implementing the
Normal-EXponential Two-peak (NEXT-peak)
model. To specify the model, parameters beta and sigma have to be
estimated. Parameter estimation
can be done with ChIP-seq data and a list of the corresponding motif
sites. NEXT-peak can use
optional
mappability information. The mappability information should be saved in
a directory with the
file names of chr1.map, chr2.map .... The files should contain values
of zeros or ones, where 1 represents
mappable, 0, unmappable. Using the PeakSeq suite, these files can be
generated by modifying
CountMap.py of the PeakSeq suite.
Usage
$ nextpeak <OPTIONS>
- nextpeak is a name of the executable file.
- <OPTIONS>
is a list of options (arguments).
- The options are
separated by spaces.
- Each option starts with the symbol "-" and takes the
form
-<option code>
<option value>
Options
-l To provide a list of the genomic
sequence lengths.
A possible format is a single column of values.
If this file is not provided, a user can
use -g
option to provide a genomic sequence file.
If neither -l nor -g options was used,
sequence
lengths of hg18 (Homo Sapiens Build 36.1) is used.
-g When -l option is not used, a user can
provide a
genomic sequence file, containing all genomic
sequences in FASTA format. Possible
headers are
">" immediately followed by
a chromosome number, e.g., >1,
>2, etc.
headers of UCSC human genome.
">chr"
followed by a number or "X", "Y",
e.g.,
>chr1, >chr2, ...,
>chrX, >chrY.
headers of NCBI Build 36.1.
e.g.
>gi|89161185|ref|NC_000001.9|NC_000001 Homo sapiens chromosome
1,
reference assembly, complete sequence
-c To provide mapped ChIP-seq data. The
format should
be "+"/"-" for strand information, followed by
a genome number, followed by a genomic
coordinate.
e.g.,
-
5 25494792
+
18
31889166
- 23
118034178
...
(more lines)
When Bowtie is used for mapping, this
format can be
generated with --suppress 1,5,6,7,8 option with
headers containing only numbers.
-d To provide a directory name that
contains files
for optional mappability information. The directory
should contain "/" at the last position
of the text,
e.g., /home/mappability/
Under the directory, it should contain
files with
names "chr1.map", "chr2.map",..., "chr24.map",etc.
Note that "chrX.map" and "chrY.map" are
not allowed;
use "chr23.map" and "chr24.map" instead.
Each file should contain a sequence of
zeros and
ones, "0" for unmappable location, "1" for mappable.
-r To generate optional output for the
candidate
regions for peaks. With this output, a user can
make a plot of any particular region of
interest. If
mappability information was used, unmappable
locations have "Non" instead of usually
a pair of
counts, one for the left strand, the other for
the right strand, separated by "-",
e.g., 2-4 means
2 tags were mapped in the left (forward) strand,
4 tags were mapped in the right
(backward) strand
for this particular genomic location.
-m To estimate NEXT-peak model parameters
beta and
sigma, a user can provide a motif site location file.
The file should have two columns, the
first is for
the genome number, the second is for the left end
location of the site. When using -m
option, make
sure to provide the tag length (-h option) and
the motif length (-i option) as well.
-h To provide the length of tags.
-i To provide the length of the motif.
-a To provide the name of the file that contains
observed frequencies around the motif sites.
-b To provide the value of parameter
beta. If -m
option is used, beta and sigma will be estimated.
-s To provide the value of parameter
sigma.
-w To provide a window length for
identifying regions
of interest. (Default: 150)
-t To provide a minimum count in a window
for
identifying regions of interest. (Default: 15)
-v To provide the vector length for
NEXT-peak density
tabulation. (Default: 1000)
Output
format
NEXT-peak creates two parts of the output: a global summary and regions
statistics.
A global summary appears ahead of region statistics. It shows a
ChIP-seq file name, parameter estimates for the model, and suggested
cutoff values for spurious region. Note that the area under the
curve (AUC) is not the usual AUC from a ROC curve. This is a precision
curve of Figure 3 in the article. It considers top 10,000 peaks.
NEXT-peak generates several statistics for each candidate region where
each line holds statistics for each region. Total 20 columns will be
reported. The information in the
columns are as follows:
- col 1: genome number
- col 2: starting position of the region in
terms of
genomic coordinate (position starts from 1, inclusive)
- col 3: ending position of the region
(inclusive)
- col 4: region length defined as col 3 -
col 2 + 1
- col 5: number of mappable locations in
the region. If
all locations are mappable or mappability information was not
used,
col 5 = col 4.
- col 6: ratio of mappable locations
defined as col 5 /
col 4
- col 7: relative position of the center of
the
estimated peak (mu). To convert to the genomic coordinate of the peak,
add
col 2 to col 7.
- col 8: estimated standard error for the
peak position
(standard error of mu)
- col 9: number of the left (forward) tags
in the region
- col 10: ratio of the left tags and the
total number
of tags: col 9 / col 11
- col 11: the total number of tags in the
region
- col 12: estimated number of tags due to
the binding
on both strands (2 * nu)
- col 13: estimated standard error for the
tags numbers
(standard error of 2 * nu)
- col 14: estimated local background error
rate (rho)
- col 15: estimated standard error of the
background
error rate (standard error of rho)
- col 16: ratio of the tags explained by
the binding
event
- col 17: p-value for test of nu = 0, i.e.,
whether
there is no binding event in the region.
- col 18: test statistic for the
goodness-of-fit test
of the observed tags to the NEXT-peak model
- col 19: degrees of freedom of the
goodness-of-fit test
- col 20: p-value for the goodness-of-fit
test
Output
example
ChIP-seq file: /STAT1/STAT1mapped.txt
Mappability information is used
Motif location file: /CPP/FindKnownMotif/STAT1_w15_5m6.txt
Parameter estimation using anchored data: beta=68.3992 sigma=50.5182
mu=500 nu=180350 rho=182.072 Total count=724845
cutoff for spurious region: (region length = 400, p-value for GOF =
1e-6) AUC = 0.307992 Distance to motif cutoff=250
1 554323 558411 4089 3759 0.919296 791.791 12.5462 3488 0.462538 7541
208.413 28.4792 0.976817 0.00378813 0.0261623 0 19.8336 12 0.0703005
1 558564 560144 1581 1341 0.848197 905.581 8.03588 1257 0.483834 2598
251.906 20.3322 0.883445 0.00758099 0.0879905 0 37.5651 12 0.000180821
1 703751 703951 201 170 0.845771 177.155 7.92663 16 0.727273 22 19.1956
8.83313 0.0406596 0.0259798 0.371624 6.39239e-08 16.5858 12 0.165855
1 865987 866342 356 356 1 174.524 6.68542 30 0.517241 58 53.0984
6.03224 0.0142044 0.00847225 0.825629 2.92872e-13 14.6734 12 0.259783
1 926070 926219 150 150 1 85.3262 6.54363 7 0.466667 15 18.3815 4.48855
0.0150987 0.0149618 0.698026 0.000978739 13.2669 12 0.349941
1 938384 938767 384 384 1 205.205 5.9193 31 0.596154 52 44.2749 3.86602
0.0147415 0.00503388 0.78228 1.23131e-11 20.1754 12 0.0638401
1 1004685 1004834 150 150 1 130.048 5.75204 13 0.866667 15 27.2357
3.47877 0.00553989 0.0115959 0.889202 0.00261428 17.7941 12 0.122086
1 1041429 1041681 253 253 1 108.664 5.64222 14 0.451613 31 20.2571
3.08478 0.0298033 0.0060964 0.513534 6.61504e-09 11.8285 12 0.459548
1 1060382 1061082 701 661 0.942939 400.4 3.65425 140 0.454545 308
302.083 3.04245 0.0289262 0.0023014 0.875843 4.61065e-99 64.2658 12
3.72534e-09
...
(more lines)
Usage
examples
- Example 1:
- A mapped ChIP-seq data file should be prepared in the format
explained
in -c option. If a user knows beta and sigma values from the previous
run or from a separate
analysis, beta value (-b option) and sigma (-s option) could be used.
nextpeak assumes hg18 (Homo Sapiens
Build 36.1) as a reference genome.
$ nextpeak -c ChIPmapped.txt -b 65.2 -s 45.1
- Example 2:
- If the reference genome is not hg18, a user can provide a list
of
genomic sequences.
$ nextpeak -c ChIPmapped.txt -b 65.2 -s 45.1 -l
hg19SeqLen.txt
- Example 3:
- As an alternative to Example 2, a user can provide genomic
sequences in
FASTA format.
$ nextpeak -c ChIPmapped.txt -b 65.2 -s 45.1 -g
hg19all.fa
- Example 4:
- When beta and sigma values are unknown, a file for motif sites
should
be prepared as explained in -m option.When -m option is used, make sure
to provide the tag length (-h option)
and the motif length (-i option).
$ nextpeak -c ChIPmapped.txt -m motifSites.txt
-h 27 -i 15
- Example 5:
- When mappability information is available through the use of
the
PeakSeq suite, a user can improve peak-calling by adding mappability
information. As explained in -d
option, each file should contain mappability information for each
chromosome. Make sure to add "/" when
providing the path of the directory.
$ nextpeak -c ChIPmapped.txt -b 65.2 -s 45.1 -d
/home/Mappability/
- Example 6:
- To save frequencies of tags around the motif sites, use -a
option.
$ nextpeak -c ChIPmapped.txt -m motifSites.txt
-h 27 -i 15 -a
anchoredFreq.txt
- Example 7:
- To save frequencies of the mapped tags in the regions, add -r
option.
$ nextpeak -c ChIPmapped.txt -m motifSites.txt
-h 27 -i 15 -r
regions.txt
- Example 8:
- Combination of multiple options.
$ nextpeak -c ChIPmapped.txt -m motifSites.txt
-h 27 -i 15 -a
anchoredFreq.txt -r regions.txt -d /home/Mappability/
Files
and Installation
The files in the downloaded directory include:
- nextpeak1.1_linux.zip: Linux
executable
- nextpeak1.1_cpp_files.zip:
C++ source files
- No special installation is required
- The executable file can be downloaded, unzipped and run with an
appropriate command
- Alternatively, the source C++ files can be downloaded, unzipped,
and complied in a suitable C++ environment. For compilation in Linux,
included Makefile can be used.
Reference
Kim NK, Jayatillake RV, Spouge JL,
"NEXT-peak: a normal-exponential two-peak model for peak-calling in
ChIP-seq data"
(2013), BMC
Genomics,
14:349