3. Aplicaciones: ChIP-seq

Anuncio
Análisis Datos NGS
3. Aplicaciones: ChIP-seq
David G. Pisano
dgpisano@cnio.es
1
2
Visión general de un experimento de ChIP-seq
Sample fragmentation
Immunoprecipitation
Non-histone ChIP
Histone ChIP
DNA purification
End repair and
adaptor ligation
Cluster
generation
(bridge PCR)
Illumina
Sequencing
with reversible
terminators
PolyA tailing
Amplification
on beads
(emulsion PCR)
Roche
Pyrosequencing
ABI
Sequencing
by ligation
Helicos
Single-molecule
sequencing
with reversible
terminators
Sequence reads
Figure 1 | Overview of a ChIP–seq experiment. Using chromatin immunoprecipitation
Nature Reviews | Genetics
(ChIP) followed by massively parallel sequencing,
3 the specific DNA sites that interact
with transcription factors or other chromatin-associated proteins (non-histone ChIP)
REVI
instance, can be done much more effectively by
Sequence variations within repeat elemen
captured by sequencing and used to map re
genome; unique sequences that flank repea
helpful in aligning the reads to the genome.
ple, only 48% of the human genome is non-rep
80% is mappable with 30 bp reads and 89% is
with 70 bp reads38.
All profiling technologies produce u
artefacts, and ChIP–seq is no exception.
sequencing errors have been reduced subst
the technology has improved, they are sti
especially towards the end of each read. Thi
can be ameliorated by improvements in align
rithms (see below) and computational analys
also bias towards GC-rich content in fragmen
both in library preparation and in amplificat
and during sequencing14,39, although notable
ments have been made recently. In addition
insufficient number of reads is generated, the
sensitivity or specificityin detection of enrich
There are also technical issues in performing
ment, such as loading the correct amount of s
little sample will result in too few tags; too mu
will result in fluorescent labels that are too c
another, and therefore lower quality data.
However, the main disadvantage with
is its current cost and availability. Several gr
successfully developed and applied their o
cols for library construction, which has low
cost substantially. But the overall cost of C
which includes machine depreciation and re
will have to be lowered further for it to be co
with the cost of ChIP–chip in every case.
resolution profiling of an entire large genome
is already less expensive than ChIP–chip, bu
Nat.size
Rev.
Genetics
2009
ing on the Park,
genome
and
the depth
of se
4
5
6
7
8
ChIP-Seq Methods Benchmarking
Tested -> Results!
Problems (no controls, human reference only, nb. Of
reads, sequence formats….)!
Commercial Solutions!
9
Enrique Carrillo de Santa-Pau
Intergenic Peaks
Gene-Associated Peaks
max|FC| = maximal|log2(fold change WT/KO)| within a 100bp window for each peak
10
Angel Carro
Check sequencing depth
Sequencing
platform
Image analysis
(base calling)
35–50 bp
sequences
Genome
alignment
Quality
scores
Visualization
with genome
browser
Peak
calling
Control
sample
Other
advanced
analysis
Enriched
regions
Differential
profile
analysis
Motif
discovery
Relationship
to gene
structure
Integration
with gene
expression
Gene set
analysis
Figure 4 | Overview of ChIP–seq analysis.11The raw data for chromatin immunopre-
Nature Reviews | Genetics
generally
become t
are still re
samples
steps. As
concorda
as a third
Challen
As NGS p
generatio
ing factor
of the da
tion, I di
data anal
range of
varied an
flow char
shown in
Data m
produces
and imag
run, whic
no
me
© 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology
cis
ge
ARTICLES
An integrated software system for analyzing ChIP-chip
and ChIP-seq data
Hongkai Ji1, Hui Jiang2, Wenxiu Ma3, David S Johnson4,8, Richard M Myers5 & Wing H Wong6,7
We present CisGenome, a software system for analyzing genome-wide chromatin immunoprecipitation (ChIP) data. CisGenome
is designed to meet all basic needs of ChIP data analyses, including visualization, data normalization, peak detection, false
discovery rate computation, gene-peak association, and sequence and motif analysis. In addition to implementing previously
published ChIP–microarray (ChIP-chip) analysis methods, the software contains statistical methods designed specifically
for ChlP sequencing (ChIP-seq) data obtained by coupling ChIP with massively parallel sequencing. The modular design of
CisGenome enables it to support interactive analyses through a graphic user interface as well as customized batch-mode
computation for advanced data mining. A built-in browser allows visualization of array images, signals, gene structure,
conservation, and DNA sequence and motif information. We demonstrate the use of these tools by a comparative analysis of
ChIP-chip and ChIP-seq data for the transcription factor NRSF/REST, a study of ChIP-seq analysis with or without a negative
control sample, and an analysis of a new motif in Nanog- and Sox2-binding regions.
Chromatin immunoprecipitation followed by either genome tiling
array analysis (ChIP-chip)1–3 or massively parallel sequencing
(ChIP-seq)4–10 enables transcriptional regulation to be studied on a
genome-wide scale (Supplementary Fig. 1 online). By systematically
identifying protein-DNA interactions of interest, studies using these
technologies provide information on cis-regulatory circuitry underlying various cellular processes. However, analysis of the massive and
heterogeneous datasets from these studies poses several challenges,
including effective data visualization, seamless connection of low-level
(close to raw data) and high-level (close to answering biological
questions) analysis, integration of data from multiple technological
platforms, and flexibility to customize the analysis so that specific
biological questions can be addressed. Although there are several
recently developed programs11–31 that target some of the individual
steps, an integrated tool that can satisfy all basic needs in ChIP data
analyses is not yet available (see Supplementary Notes online).
We developed a set of methods to meet these needs in ChIP data
analyses and implemented them in an integrated software package
(Fig. 1). CisGenome provides a wide range of functionalities for
ChIP data analyses that can be accessed through a menu-driven
system in a graphic user interface (GUI). The results are automatically
linked to the CisGenome browser, which is designed for data
visualization. CisGenome is a standalone system that bench biologists
can use to analyze their own data locally on personal computers.
At the same time, most CisGenome functionalities can also be
accessed in a command-line manner. This modular design allows
computational biologists to build large batch jobs for customized
analyses on computer servers.
Cisgenome es un paquete de programas para
analizar experimentos de ChIP-chip y de ChIP-seq
RESULTS
CisGenome basic functionalities are listed below.
Data processing and binding region identification
Finding regions harboring protein-DNA association is the critical first
step of ChIP data analyses. CisGenome can detect these regions (or
peaks) from raw tiling array probe intensities or mapped sequence
reads. For example, using the GUI, one can directly load Affymetrix
CEL and BPMAP tiling array data, examine raw array images to detect
hybridization artifacts, normalize data across different arrays and then
detect binding regions (Supplementary Fig. 2a–c online). CisGenome
can also take as input the binding regions or peak scores obtained
from other preprocessing programs, such as MAT11 for ChIP-chip and
QuEST30 for ChIP-seq data. CisGenome uses TileMap12 for internal
ChIP-chip peak calling and false discovery rate (FDR) estimation
(Supplementary Methods online).
• Usa modelos de poisson y de binomial condicionada
• Multiplataforma (GUI sólo en Windos, Mac, Linux...)
• Rápido
• Muchos falsos positivos, pero más preciso en la
Visualization of results
Convenient visualization of raw and processed data provides a useful
way to assess data quality and generate scientific hypotheses. In
CisGenome, the peak signals, including fold changes and summary
statistics, are reported in tables and linked to the CisGenome browser.
Using the browser, one can visualize at the probe or read level
determinación de los picos que la mayoría
• Herramientas downstream asociadas: anotación de picos,
obtención de secuencias, análisis de motivos de unión
http://www.biostat.jhsph.edu/~hji/cisgenome/
1Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, Maryland 21205, USA. 2Institute for Computational
and Mathematical Engineering, Stanford University, Durand Building, 496 Lomita Mall, Stanford, California 94305, USA. 3Department of Computer Science, Stanford
University, 353 Serra Mall, Stanford, California 94305, USA. 4Department of Genetics, Stanford University School of Medicine, 300 Pasteur Drive, Stanford, California
94305, USA. 5HudsonAlpha Institute for Biotechnology, 601 Genome Way, Huntsville, Alabama 35806, USA. 6Department of Statistics, 7Department of Health Research
and Policy, Stanford University, Sequoia Hall, 390 Serra Mall, Stanford, California 94305, USA. 8Present address: Gene Security Network, Inc., 1442 Cortland Avenue,
San Francisco, California 94110, USA. Correspondence should be addressed to W.H.W. (whwong@stanford.edu).
Received 11 June; accepted 3 October; published online 2 November 2008; doi:10.1038/nbt.1505
NATURE BIOTECHNOLOGY
VOLUME 26
NUMBER 11
12
NOVEMBER 2008
1293
om
e
en
cis
g
a
r
t
s
e
u
m
a
n
u
e
d
s
i
cisgenome: anális
1*. Convertir el fichero al estándar de alineamientos de cisgenome (aln)
3Jane:cisgenome$ file_bed2aln
* ----------------------------- */
file_bed2aln
-i input
-o output
example:
file_bed2aln -i input.bed -o output.aln
3Jane:cisgenome$ file_eland2aln
/* ----------------------------- */
file_eland2aln
-i input
-o output
-s number of bp to shift the start position (default: 12bp)
example:
file_eland2aln -i seq.eland.out -o seq.aln
/* ----------------------------- */
13
om
e
en
cis
g
a
r
t
s
e
u
m
a
n
u
e
d
s
i
cisgenome: anális
2. Ordenar el fichero de alineamiento por localización cromosómica
3Jane:cisgenome$ tablesorter_str
tablesorter <src txt file>
tablesorter <src txt file> <dest txt file>
3Jane:cisgenome$ tablesorter_str chip2002_uniq_hg17_chr21.aln chip2002_uniq_hg17_chr21.aln.sort
25573 data units loaded.
Sorting ...
14
om
e
en
cis
g
a
r
t
s
e
u
m
a
n
u
e
d
s
i
cisgenome: anális
3. Convertir el fichero de texto ordenado a binario (bar)
3Jane:cisgenome$ hts_aln2barv2
/* ----------------------------- */
hts_aln2barv2
-i input
-o output
example:
hts_aln2barv2 -i input.txt -o output.bar
3Jane:cisgenome$ hts_aln2barv2 -i chip2002_uniq_hg17_chr21.aln.sort -o
chip2002_uniq_hg17_chr21.aln.sort.bar
Nota: el programa genera 3 ficheros bar, con extensión .bar, _F.bar y _R.bar
respectivamente. Contienen todas las lecturas, sólo las lecturas 5’ y sólo las 3’
respectivamente. En los pasos posteriores, usar sólamente el fichero .bar en la
opción -i. No usar _F.bar ó _R.bar, el programa los buscará y usará
automáticamente cuando los necesite.
15
om
e
en
cis
g
a
r
t
s
e
u
m
a
n
u
e
d
s
i
cisgenome: anális
4. Calcular el FDR y elegir un punto de corte
3Jane:cisgenome$ hts_windowsummaryv2 -i chip2002_uniq_hg17_chr21.aln.sort.bar -g ../../genomes/
hg17/chrlist.txt -l ../../genomes/hg17/chrlen.txt -w 100 -o summarys1.txt
3Jane:cisgenome$ more summarys1.txt
#Forward+Reverse_Combined
# Window_Size=100
Poisson_Lambda=0.000538 Poisson_p=0.999887
NegBinomial_Alpha=0.001596
NegBinomial_Beta=1.965232
NegBinomial_p=1.000000
#No_of_reads/window
No_of_window
percentage
poisson_expected
poisson_exp/obs
negbinomial_expected
negbinomial_exp/obs
0
30784167
0.999349
0.999349
1.000000
0.999344
0.999994
1
16567
0.000538
0.000538
1.000000
0.000538
0.999994
2
2798
0.000091
0.000000
0.001593
0.000091
0.999994
3
473
0.000015
0.000000
0.000002
0.000020
1.331008
4
109
0.000004
0.000000
0.000000
0.000005
1.461668
5
31
0.000001
0.000000
0.000000
0.000001
1.387133
6
15
0.000000
0.000000
0.000000
0.000000
0.805911
7
7
0.000000
0.000000
0.000000
0.000000
0.499333
8
6
0.000000
0.000000
0.000000
0.000000
0.171943
9
1
0.000000
0.000000
0.000000
0.000000
0.309323
10
3
0.000000
0.000000
0.000000
0.000000
0.031301
11
3
0.000000
0.000000
0.000000
0.000000
0.009598
12
2
0.000000
0.000000
0.000000
0.000000
0.004451
13
2
0.000000
0.000000
0.000000
0.000000
0.001386
FDR<0.01
El FDR para cada nivel de número de reads está en la última
columna de summarys1.txt
16
om
e
en
cis
g
a
r
t
s
e
u
m
a
n
u
e
d
s
i
cisgenome: anális
5. Primera ronda de detección de picos
Obtenido en el
paso anterior
3Jane:cisgenome$3Jane:nrsf_chipseq dgonzalez$ hts_peakdetectorv2
/* ----------------------------- */
hts_peakdetectorv2
-i input
-d output folder
-o output title
-w window size (default = 100)
-s step size (default = 25)
-c cutoff (default = 10)
-g max gap (default = 0)
-l min region length (default = 1)
-br apply boundary refinement (0[default]: no; 1: yes)
-brl boundary refinement min region length (default=30)
-ssf apply single strand filtering (0[default]: no; 1: yes)
-cf (currently only for developer's use) forward cutoff (default = c)
-cr (currently only for developer's use) reverse cutoff (default = c)
-z use combined data after shifting (default = 0)
example:
hts_peakdetectorv2 -i input.bar -d . -o output -w 200 -s 25 -c 10 -g 50 l 500 -br 1 -brl 30 -ssf 1 -cf 8 -cr 8 -z 1
/* ----------------------------- */
17
om
e
en
cis
g
a
r
t
s
e
u
m
a
n
u
e
d
s
i
cisgenome: anális
5. Primera ronda de detección de picos
3Jane:cisgenome$ mkdir out
3Jane:cisgenome$ hts_peakdetectorv2 -i chip2002_uniq_hg17_chr21.aln.sort.bar -d out -o
chip2002_uniq_hg17_chr21 -w 100 -s 25 -c 11 -br 1 -brl 30 -ssf 1
Nota: Si el ordenador no tiene suficiente memoria, usa un -s mayor (por ejemplo, -s 100)
6. Conversión a bed para su visualización
3Jane:cisgenome$ cd out
3Jane:cisgenome$ file_cod2bed
/* ----------------------------file_cod2bed
-i input
-o output
example:
file_cod2bed -i input.cod -o
/* ----------------------------3Jane:cisgenome$ file_cod2bed -i
*/
output.bed
*/
chip2002_uniq_hg17_chr21.cod -o chip2002_uniq_hg17_chr21.bed
18
om
e
en
cis
g
a
r
t
s
e
u
m
a
n
u
e
d
s
i
cisgenome: anális
7. Mover las lecturas hacia el centro del pico L bp (mitad de longitud mediana)
3Jane:cisgenome$ hts_alnshift2bar -i chip2002_uniq_hg17_chr21.aln.sort.bar -s 35
8. Recalcular FDR y elegir cut-off de nuevo
3Jane:cisgenome$ hts_windowsummaryv2 -i chip2002_uniq_hg17_chr21.aln.sort.bar -g ../../genomes/hg17/
chrlist.txt -l ../../genomes/hg17/chrlen.txt -w 100 -o summarys1_2.txt -z 1
9. Segunda ronda de detección de picos
3Jane:cisgenome$ mkdir out2
3Jane:cisgenome$ hts_peakdetectorv2 -i chip2002_uniq_hg17_chr21.aln.sort.bar -d out2 -o
chip2002_uniq_hg17_chr21 -w 100 -s 25 -c 13 -br 1 -brl 30 -ssf 1 -z 1
19
om
e
en
cis
g
s
a
r
t
s
e
u
m
s
o
d
e
d
s
cisgenome: análisi
1*. Convertir los 2 ficheros al estándar de alineamientos de cisgenome (aln)
3Jane:cisgenome$ file_bed2aln
* ----------------------------- */
file_bed2aln
-i input
-o output
example:
file_bed2aln -i input.bed -o output.aln
3Jane:cisgenome$ file_eland2aln
/* ----------------------------- */
file_eland2aln
-i input
-o output
-s number of bp to shift the start position (default: 12bp)
example:
file_eland2aln -i seq.eland.out -o seq.aln
/* ----------------------------- */
20
om
e
en
cis
g
s
a
r
t
s
e
u
m
s
o
d
e
d
s
cisgenome: análisi
2. Ordenar los ficheros de alineamiento por localización cromosómica
3Jane:cisgenome$
25573 data units
Sorting ...
3Jane:cisgenome$
35197 data units
Sorting ...
tablesorter_str chip2002_uniq_hg17_chr21.aln chip2002_uniq_hg17_chr21.aln.sort
loaded.
tablesorter_str mock2002_uniq_hg17_chr21.aln mock2002_uniq_hg17_chr21.aln.sort
loaded.
21
om
e
en
cis
g
s
a
r
t
s
e
u
m
s
o
d
e
d
s
cisgenome: análisi
3. Convertir los ficheros de texto ordenados a binario (bar)
3Jane:cisgenome$ hts_aln2barv2 -i chip2002_uniq_hg17_chr21.aln.sort -o
chip2002_uniq_hg17_chr21.aln.sort.bar
3Jane:cisgenome$ hts_aln2barv2 -i mock2002_uniq_hg17_chr21.aln.sort -o
mock2002_uniq_hg17_chr21.aln.sort.bar
22
om
e
en
cis
g
s
a
r
t
s
e
u
m
s
o
d
e
d
s
cisgenome: análisi
4. Calcular el FDR y anotar el dp_Hat
3Jane:cisgenome$ hts_windowsummaryv2_2sample -i chip2002_uniq_hg17_chr21.aln.sort.bar -n
mock2002_uniq_hg17_chr21.aln.sort.bar -g ../../genomes/hg17/chrlist.txt -l ../../genomes/hg17/
chrlen.txt -w 100 -o summarys2.txt
#Window_Size=100
dP0_hat=0.368806
#Poisson_Lambda=0.001239
Poisson_p=0.999701
#NegBinomial_Alpha=0.003319
NegBinomial_Beta=1.678024
NegBinomial_p=1.000000
#No_of_reads/window
No_of_window
percentage
poisson_expected
poisson_exp/obs
negbinomial_expected
negbinomial_exp/obs
0
30756849
0.998463
0.998463
1.000000
0.998450
0.999987
1
38121
0.001238
0.001238
1.000000
0.001238
0.999987
2
7141
0.000232
0.000001
0.003308
0.000232
0.999987
3
1493
0.000048
0.000000
0.000007
0.000058
1.192636
4
373
0.000012
0.000000
0.000000
0.000016
1.338401
5
107
0.000003
0.000000
0.000000
0.000005
1.394913
6
40
0.000001
0.000000
0.000000
0.000002
1.161886
7
26
0.000001
0.000000
0.000000
0.000000
0.572439
8
13
0.000000
0.000000
0.000000
0.000000
0.374247
9
4
0.000000
0.000000
0.000000
0.000000
0.403883
10
4
0.000000
0.000000
0.000000
0.000000
0.135782
11
3
0.000000
0.000000
0.000000
0.000000
0.061478
12
4
0.000000
0.000000
0.000000
0.000000
0.015787
13
6
0.000000
0.000000
0.000000
0.000000
0.003629
23
om
e
en
cis
g
s
a
r
t
s
e
u
m
s
o
d
e
d
s
cisgenome: análisi
5. Primera ronda de detección de picos
Obtenido en el
paso anterior
hts_peakdetectorv2_2sample
/* ----------------------------- */
hts_peakdetectorv2_2sample
-i input (positive)
-n input (negative)
-d output folder
-o output title
-t 1: one-sided or 0: two-sided test (default=1)
-w window size (default = 100)
-s step size (default = 25)
-f FDR file
-c fdr cutoff (default = 0.1)
-m min window read_num (default = 10)
-p p0 = pos/neg background ratio
-g max gap (default = 0)
-l min region length (default = 1)
-br apply boundary refinement (0[default]: no; 1: yes)
-brl boundary refinement min region length (default=30)
-ssf apply single strand filtering (0[default]: no; 1: yes)
-cf (currently only for developer's use) forward cutoff (default = c)
-cr (currently only for developer's use) reverse cutoff (default = c)
-z use combined data after shifting (default = 0)
-fc minimal fold change
-tfc minimal region pos_read/neg_read ratio
example:
hts_peakdetectorv2_2sample -i pos.bar -n neg.bar -d . -o output -f
FDR.txt -c 0.2 -p 0.5 -br 1 -brl 30 -ssf 1 -cf 8 -cr 8 -z 1
/* ----------------------------- */
24
om
e
en
cis
g
s
a
r
t
s
e
u
m
s
o
d
e
d
s
cisgenome: análisi
5. Primera ronda de detección de picos
3Jane:cisgenome$ mkdir out_2s
3Jane:cisgenome$ hts_peakdetectorv2_2sample -i chip2002_uniq_hg17_chr21.aln.sort.bar -n
mock2002_uniq_hg17_chr21.aln.sort.bar -d out_2s/ -o chipvsmock_chr21 -f summarys2.txt.fdr -c 0.1 -m 10 -w
100 -s 25 -p 0.368806 -br 1 -brl 30 -ssf 1
Nota: Si el ordenador no tiene suficiente memoria, usa un -s mayor (por ejemplo, -s 100)
6. Conversión a bed para su visualización
3Jane:cisgenome$ cd out_2S
3Jane:cisgenome$ file_cod2bed -i chipvsmock_chr21.cod -o chipvsmock_chr21.bed
25
om
e
en
cis
g
s
a
r
t
s
e
u
m
s
o
d
e
d
s
cisgenome: análisi
7. Mover las lecturas hacia el centro del pico L bp (mitad de longitud mediana)
3Jane:cisgenome$ hts_alnshift2bar -i chip2002_uniq_hg17_chr21.aln.sort.bar -s 35
3Jane:cisgenome$ hts_alnshift2bar -i mock2002_uniq_hg17_chr21.aln.sort.bar -s 35
8. Recalcular FDR y elegir cut-off de nuevo
3Jane:cisgenome$ hts_windowsummaryv2_2sample -i chip2002_uniq_hg17_chr21.aln.sort.bar -n
mock2002_uniq_hg17_chr21.aln.sort.bar -g ../../genomes/hg17/chrlist.txt -l ../../genomes/hg17/chrlen.txt
-w 100 -o summarys2_2.txt -z 1
9. Segunda ronda de detección de picos
3Jane:cisgenome$ mkdir out_2s2
3Jane:cisgenome$ hts_peakdetectorv2_2sample -i chip2002_uniq_hg17_chr21.aln.sort.bar -n
mock2002_uniq_hg17_chr21.aln.sort.bar -d out_2s2/ -o chipvsmock_chr21 -f summarys2_2.txt.fdr -c 0.1 -m 10
-w 100 -s 25 -p 0.369843 -br 1 -brl 30 -ssf 1 -z 1
26
Descargar