Análisis Datos NGS 3. Aplicaciones: ChIP-seq David G. Pisano dgpisano@cnio.es 1 2 Visión general de un experimento de ChIP-seq Sample fragmentation Immunoprecipitation Non-histone ChIP Histone ChIP DNA purification End repair and adaptor ligation Cluster generation (bridge PCR) Illumina Sequencing with reversible terminators PolyA tailing Amplification on beads (emulsion PCR) Roche Pyrosequencing ABI Sequencing by ligation Helicos Single-molecule sequencing with reversible terminators Sequence reads Figure 1 | Overview of a ChIP–seq experiment. Using chromatin immunoprecipitation Nature Reviews | Genetics (ChIP) followed by massively parallel sequencing, 3 the specific DNA sites that interact with transcription factors or other chromatin-associated proteins (non-histone ChIP) REVI instance, can be done much more effectively by Sequence variations within repeat elemen captured by sequencing and used to map re genome; unique sequences that flank repea helpful in aligning the reads to the genome. ple, only 48% of the human genome is non-rep 80% is mappable with 30 bp reads and 89% is with 70 bp reads38. All profiling technologies produce u artefacts, and ChIP–seq is no exception. sequencing errors have been reduced subst the technology has improved, they are sti especially towards the end of each read. Thi can be ameliorated by improvements in align rithms (see below) and computational analys also bias towards GC-rich content in fragmen both in library preparation and in amplificat and during sequencing14,39, although notable ments have been made recently. In addition insufficient number of reads is generated, the sensitivity or specificityin detection of enrich There are also technical issues in performing ment, such as loading the correct amount of s little sample will result in too few tags; too mu will result in fluorescent labels that are too c another, and therefore lower quality data. However, the main disadvantage with is its current cost and availability. Several gr successfully developed and applied their o cols for library construction, which has low cost substantially. But the overall cost of C which includes machine depreciation and re will have to be lowered further for it to be co with the cost of ChIP–chip in every case. resolution profiling of an entire large genome is already less expensive than ChIP–chip, bu Nat.size Rev. Genetics 2009 ing on the Park, genome and the depth of se 4 5 6 7 8 ChIP-Seq Methods Benchmarking Tested -> Results! Problems (no controls, human reference only, nb. Of reads, sequence formats….)! Commercial Solutions! 9 Enrique Carrillo de Santa-Pau Intergenic Peaks Gene-Associated Peaks max|FC| = maximal|log2(fold change WT/KO)| within a 100bp window for each peak 10 Angel Carro Check sequencing depth Sequencing platform Image analysis (base calling) 35–50 bp sequences Genome alignment Quality scores Visualization with genome browser Peak calling Control sample Other advanced analysis Enriched regions Differential profile analysis Motif discovery Relationship to gene structure Integration with gene expression Gene set analysis Figure 4 | Overview of ChIP–seq analysis.11The raw data for chromatin immunopre- Nature Reviews | Genetics generally become t are still re samples steps. As concorda as a third Challen As NGS p generatio ing factor of the da tion, I di data anal range of varied an flow char shown in Data m produces and imag run, whic no me © 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology cis ge ARTICLES An integrated software system for analyzing ChIP-chip and ChIP-seq data Hongkai Ji1, Hui Jiang2, Wenxiu Ma3, David S Johnson4,8, Richard M Myers5 & Wing H Wong6,7 We present CisGenome, a software system for analyzing genome-wide chromatin immunoprecipitation (ChIP) data. CisGenome is designed to meet all basic needs of ChIP data analyses, including visualization, data normalization, peak detection, false discovery rate computation, gene-peak association, and sequence and motif analysis. In addition to implementing previously published ChIP–microarray (ChIP-chip) analysis methods, the software contains statistical methods designed specifically for ChlP sequencing (ChIP-seq) data obtained by coupling ChIP with massively parallel sequencing. The modular design of CisGenome enables it to support interactive analyses through a graphic user interface as well as customized batch-mode computation for advanced data mining. A built-in browser allows visualization of array images, signals, gene structure, conservation, and DNA sequence and motif information. We demonstrate the use of these tools by a comparative analysis of ChIP-chip and ChIP-seq data for the transcription factor NRSF/REST, a study of ChIP-seq analysis with or without a negative control sample, and an analysis of a new motif in Nanog- and Sox2-binding regions. Chromatin immunoprecipitation followed by either genome tiling array analysis (ChIP-chip)1–3 or massively parallel sequencing (ChIP-seq)4–10 enables transcriptional regulation to be studied on a genome-wide scale (Supplementary Fig. 1 online). By systematically identifying protein-DNA interactions of interest, studies using these technologies provide information on cis-regulatory circuitry underlying various cellular processes. However, analysis of the massive and heterogeneous datasets from these studies poses several challenges, including effective data visualization, seamless connection of low-level (close to raw data) and high-level (close to answering biological questions) analysis, integration of data from multiple technological platforms, and flexibility to customize the analysis so that specific biological questions can be addressed. Although there are several recently developed programs11–31 that target some of the individual steps, an integrated tool that can satisfy all basic needs in ChIP data analyses is not yet available (see Supplementary Notes online). We developed a set of methods to meet these needs in ChIP data analyses and implemented them in an integrated software package (Fig. 1). CisGenome provides a wide range of functionalities for ChIP data analyses that can be accessed through a menu-driven system in a graphic user interface (GUI). The results are automatically linked to the CisGenome browser, which is designed for data visualization. CisGenome is a standalone system that bench biologists can use to analyze their own data locally on personal computers. At the same time, most CisGenome functionalities can also be accessed in a command-line manner. This modular design allows computational biologists to build large batch jobs for customized analyses on computer servers. Cisgenome es un paquete de programas para analizar experimentos de ChIP-chip y de ChIP-seq RESULTS CisGenome basic functionalities are listed below. Data processing and binding region identification Finding regions harboring protein-DNA association is the critical first step of ChIP data analyses. CisGenome can detect these regions (or peaks) from raw tiling array probe intensities or mapped sequence reads. For example, using the GUI, one can directly load Affymetrix CEL and BPMAP tiling array data, examine raw array images to detect hybridization artifacts, normalize data across different arrays and then detect binding regions (Supplementary Fig. 2a–c online). CisGenome can also take as input the binding regions or peak scores obtained from other preprocessing programs, such as MAT11 for ChIP-chip and QuEST30 for ChIP-seq data. CisGenome uses TileMap12 for internal ChIP-chip peak calling and false discovery rate (FDR) estimation (Supplementary Methods online). • Usa modelos de poisson y de binomial condicionada • Multiplataforma (GUI sólo en Windos, Mac, Linux...) • Rápido • Muchos falsos positivos, pero más preciso en la Visualization of results Convenient visualization of raw and processed data provides a useful way to assess data quality and generate scientific hypotheses. In CisGenome, the peak signals, including fold changes and summary statistics, are reported in tables and linked to the CisGenome browser. Using the browser, one can visualize at the probe or read level determinación de los picos que la mayoría • Herramientas downstream asociadas: anotación de picos, obtención de secuencias, análisis de motivos de unión http://www.biostat.jhsph.edu/~hji/cisgenome/ 1Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, Maryland 21205, USA. 2Institute for Computational and Mathematical Engineering, Stanford University, Durand Building, 496 Lomita Mall, Stanford, California 94305, USA. 3Department of Computer Science, Stanford University, 353 Serra Mall, Stanford, California 94305, USA. 4Department of Genetics, Stanford University School of Medicine, 300 Pasteur Drive, Stanford, California 94305, USA. 5HudsonAlpha Institute for Biotechnology, 601 Genome Way, Huntsville, Alabama 35806, USA. 6Department of Statistics, 7Department of Health Research and Policy, Stanford University, Sequoia Hall, 390 Serra Mall, Stanford, California 94305, USA. 8Present address: Gene Security Network, Inc., 1442 Cortland Avenue, San Francisco, California 94110, USA. Correspondence should be addressed to W.H.W. (whwong@stanford.edu). Received 11 June; accepted 3 October; published online 2 November 2008; doi:10.1038/nbt.1505 NATURE BIOTECHNOLOGY VOLUME 26 NUMBER 11 12 NOVEMBER 2008 1293 om e en cis g a r t s e u m a n u e d s i cisgenome: anális 1*. Convertir el fichero al estándar de alineamientos de cisgenome (aln) 3Jane:cisgenome$ file_bed2aln * ----------------------------- */ file_bed2aln -i input -o output example: file_bed2aln -i input.bed -o output.aln 3Jane:cisgenome$ file_eland2aln /* ----------------------------- */ file_eland2aln -i input -o output -s number of bp to shift the start position (default: 12bp) example: file_eland2aln -i seq.eland.out -o seq.aln /* ----------------------------- */ 13 om e en cis g a r t s e u m a n u e d s i cisgenome: anális 2. Ordenar el fichero de alineamiento por localización cromosómica 3Jane:cisgenome$ tablesorter_str tablesorter <src txt file> tablesorter <src txt file> <dest txt file> 3Jane:cisgenome$ tablesorter_str chip2002_uniq_hg17_chr21.aln chip2002_uniq_hg17_chr21.aln.sort 25573 data units loaded. Sorting ... 14 om e en cis g a r t s e u m a n u e d s i cisgenome: anális 3. Convertir el fichero de texto ordenado a binario (bar) 3Jane:cisgenome$ hts_aln2barv2 /* ----------------------------- */ hts_aln2barv2 -i input -o output example: hts_aln2barv2 -i input.txt -o output.bar 3Jane:cisgenome$ hts_aln2barv2 -i chip2002_uniq_hg17_chr21.aln.sort -o chip2002_uniq_hg17_chr21.aln.sort.bar Nota: el programa genera 3 ficheros bar, con extensión .bar, _F.bar y _R.bar respectivamente. Contienen todas las lecturas, sólo las lecturas 5’ y sólo las 3’ respectivamente. En los pasos posteriores, usar sólamente el fichero .bar en la opción -i. No usar _F.bar ó _R.bar, el programa los buscará y usará automáticamente cuando los necesite. 15 om e en cis g a r t s e u m a n u e d s i cisgenome: anális 4. Calcular el FDR y elegir un punto de corte 3Jane:cisgenome$ hts_windowsummaryv2 -i chip2002_uniq_hg17_chr21.aln.sort.bar -g ../../genomes/ hg17/chrlist.txt -l ../../genomes/hg17/chrlen.txt -w 100 -o summarys1.txt 3Jane:cisgenome$ more summarys1.txt #Forward+Reverse_Combined # Window_Size=100 Poisson_Lambda=0.000538 Poisson_p=0.999887 NegBinomial_Alpha=0.001596 NegBinomial_Beta=1.965232 NegBinomial_p=1.000000 #No_of_reads/window No_of_window percentage poisson_expected poisson_exp/obs negbinomial_expected negbinomial_exp/obs 0 30784167 0.999349 0.999349 1.000000 0.999344 0.999994 1 16567 0.000538 0.000538 1.000000 0.000538 0.999994 2 2798 0.000091 0.000000 0.001593 0.000091 0.999994 3 473 0.000015 0.000000 0.000002 0.000020 1.331008 4 109 0.000004 0.000000 0.000000 0.000005 1.461668 5 31 0.000001 0.000000 0.000000 0.000001 1.387133 6 15 0.000000 0.000000 0.000000 0.000000 0.805911 7 7 0.000000 0.000000 0.000000 0.000000 0.499333 8 6 0.000000 0.000000 0.000000 0.000000 0.171943 9 1 0.000000 0.000000 0.000000 0.000000 0.309323 10 3 0.000000 0.000000 0.000000 0.000000 0.031301 11 3 0.000000 0.000000 0.000000 0.000000 0.009598 12 2 0.000000 0.000000 0.000000 0.000000 0.004451 13 2 0.000000 0.000000 0.000000 0.000000 0.001386 FDR<0.01 El FDR para cada nivel de número de reads está en la última columna de summarys1.txt 16 om e en cis g a r t s e u m a n u e d s i cisgenome: anális 5. Primera ronda de detección de picos Obtenido en el paso anterior 3Jane:cisgenome$3Jane:nrsf_chipseq dgonzalez$ hts_peakdetectorv2 /* ----------------------------- */ hts_peakdetectorv2 -i input -d output folder -o output title -w window size (default = 100) -s step size (default = 25) -c cutoff (default = 10) -g max gap (default = 0) -l min region length (default = 1) -br apply boundary refinement (0[default]: no; 1: yes) -brl boundary refinement min region length (default=30) -ssf apply single strand filtering (0[default]: no; 1: yes) -cf (currently only for developer's use) forward cutoff (default = c) -cr (currently only for developer's use) reverse cutoff (default = c) -z use combined data after shifting (default = 0) example: hts_peakdetectorv2 -i input.bar -d . -o output -w 200 -s 25 -c 10 -g 50 l 500 -br 1 -brl 30 -ssf 1 -cf 8 -cr 8 -z 1 /* ----------------------------- */ 17 om e en cis g a r t s e u m a n u e d s i cisgenome: anális 5. Primera ronda de detección de picos 3Jane:cisgenome$ mkdir out 3Jane:cisgenome$ hts_peakdetectorv2 -i chip2002_uniq_hg17_chr21.aln.sort.bar -d out -o chip2002_uniq_hg17_chr21 -w 100 -s 25 -c 11 -br 1 -brl 30 -ssf 1 Nota: Si el ordenador no tiene suficiente memoria, usa un -s mayor (por ejemplo, -s 100) 6. Conversión a bed para su visualización 3Jane:cisgenome$ cd out 3Jane:cisgenome$ file_cod2bed /* ----------------------------file_cod2bed -i input -o output example: file_cod2bed -i input.cod -o /* ----------------------------3Jane:cisgenome$ file_cod2bed -i */ output.bed */ chip2002_uniq_hg17_chr21.cod -o chip2002_uniq_hg17_chr21.bed 18 om e en cis g a r t s e u m a n u e d s i cisgenome: anális 7. Mover las lecturas hacia el centro del pico L bp (mitad de longitud mediana) 3Jane:cisgenome$ hts_alnshift2bar -i chip2002_uniq_hg17_chr21.aln.sort.bar -s 35 8. Recalcular FDR y elegir cut-off de nuevo 3Jane:cisgenome$ hts_windowsummaryv2 -i chip2002_uniq_hg17_chr21.aln.sort.bar -g ../../genomes/hg17/ chrlist.txt -l ../../genomes/hg17/chrlen.txt -w 100 -o summarys1_2.txt -z 1 9. Segunda ronda de detección de picos 3Jane:cisgenome$ mkdir out2 3Jane:cisgenome$ hts_peakdetectorv2 -i chip2002_uniq_hg17_chr21.aln.sort.bar -d out2 -o chip2002_uniq_hg17_chr21 -w 100 -s 25 -c 13 -br 1 -brl 30 -ssf 1 -z 1 19 om e en cis g s a r t s e u m s o d e d s cisgenome: análisi 1*. Convertir los 2 ficheros al estándar de alineamientos de cisgenome (aln) 3Jane:cisgenome$ file_bed2aln * ----------------------------- */ file_bed2aln -i input -o output example: file_bed2aln -i input.bed -o output.aln 3Jane:cisgenome$ file_eland2aln /* ----------------------------- */ file_eland2aln -i input -o output -s number of bp to shift the start position (default: 12bp) example: file_eland2aln -i seq.eland.out -o seq.aln /* ----------------------------- */ 20 om e en cis g s a r t s e u m s o d e d s cisgenome: análisi 2. Ordenar los ficheros de alineamiento por localización cromosómica 3Jane:cisgenome$ 25573 data units Sorting ... 3Jane:cisgenome$ 35197 data units Sorting ... tablesorter_str chip2002_uniq_hg17_chr21.aln chip2002_uniq_hg17_chr21.aln.sort loaded. tablesorter_str mock2002_uniq_hg17_chr21.aln mock2002_uniq_hg17_chr21.aln.sort loaded. 21 om e en cis g s a r t s e u m s o d e d s cisgenome: análisi 3. Convertir los ficheros de texto ordenados a binario (bar) 3Jane:cisgenome$ hts_aln2barv2 -i chip2002_uniq_hg17_chr21.aln.sort -o chip2002_uniq_hg17_chr21.aln.sort.bar 3Jane:cisgenome$ hts_aln2barv2 -i mock2002_uniq_hg17_chr21.aln.sort -o mock2002_uniq_hg17_chr21.aln.sort.bar 22 om e en cis g s a r t s e u m s o d e d s cisgenome: análisi 4. Calcular el FDR y anotar el dp_Hat 3Jane:cisgenome$ hts_windowsummaryv2_2sample -i chip2002_uniq_hg17_chr21.aln.sort.bar -n mock2002_uniq_hg17_chr21.aln.sort.bar -g ../../genomes/hg17/chrlist.txt -l ../../genomes/hg17/ chrlen.txt -w 100 -o summarys2.txt #Window_Size=100 dP0_hat=0.368806 #Poisson_Lambda=0.001239 Poisson_p=0.999701 #NegBinomial_Alpha=0.003319 NegBinomial_Beta=1.678024 NegBinomial_p=1.000000 #No_of_reads/window No_of_window percentage poisson_expected poisson_exp/obs negbinomial_expected negbinomial_exp/obs 0 30756849 0.998463 0.998463 1.000000 0.998450 0.999987 1 38121 0.001238 0.001238 1.000000 0.001238 0.999987 2 7141 0.000232 0.000001 0.003308 0.000232 0.999987 3 1493 0.000048 0.000000 0.000007 0.000058 1.192636 4 373 0.000012 0.000000 0.000000 0.000016 1.338401 5 107 0.000003 0.000000 0.000000 0.000005 1.394913 6 40 0.000001 0.000000 0.000000 0.000002 1.161886 7 26 0.000001 0.000000 0.000000 0.000000 0.572439 8 13 0.000000 0.000000 0.000000 0.000000 0.374247 9 4 0.000000 0.000000 0.000000 0.000000 0.403883 10 4 0.000000 0.000000 0.000000 0.000000 0.135782 11 3 0.000000 0.000000 0.000000 0.000000 0.061478 12 4 0.000000 0.000000 0.000000 0.000000 0.015787 13 6 0.000000 0.000000 0.000000 0.000000 0.003629 23 om e en cis g s a r t s e u m s o d e d s cisgenome: análisi 5. Primera ronda de detección de picos Obtenido en el paso anterior hts_peakdetectorv2_2sample /* ----------------------------- */ hts_peakdetectorv2_2sample -i input (positive) -n input (negative) -d output folder -o output title -t 1: one-sided or 0: two-sided test (default=1) -w window size (default = 100) -s step size (default = 25) -f FDR file -c fdr cutoff (default = 0.1) -m min window read_num (default = 10) -p p0 = pos/neg background ratio -g max gap (default = 0) -l min region length (default = 1) -br apply boundary refinement (0[default]: no; 1: yes) -brl boundary refinement min region length (default=30) -ssf apply single strand filtering (0[default]: no; 1: yes) -cf (currently only for developer's use) forward cutoff (default = c) -cr (currently only for developer's use) reverse cutoff (default = c) -z use combined data after shifting (default = 0) -fc minimal fold change -tfc minimal region pos_read/neg_read ratio example: hts_peakdetectorv2_2sample -i pos.bar -n neg.bar -d . -o output -f FDR.txt -c 0.2 -p 0.5 -br 1 -brl 30 -ssf 1 -cf 8 -cr 8 -z 1 /* ----------------------------- */ 24 om e en cis g s a r t s e u m s o d e d s cisgenome: análisi 5. Primera ronda de detección de picos 3Jane:cisgenome$ mkdir out_2s 3Jane:cisgenome$ hts_peakdetectorv2_2sample -i chip2002_uniq_hg17_chr21.aln.sort.bar -n mock2002_uniq_hg17_chr21.aln.sort.bar -d out_2s/ -o chipvsmock_chr21 -f summarys2.txt.fdr -c 0.1 -m 10 -w 100 -s 25 -p 0.368806 -br 1 -brl 30 -ssf 1 Nota: Si el ordenador no tiene suficiente memoria, usa un -s mayor (por ejemplo, -s 100) 6. Conversión a bed para su visualización 3Jane:cisgenome$ cd out_2S 3Jane:cisgenome$ file_cod2bed -i chipvsmock_chr21.cod -o chipvsmock_chr21.bed 25 om e en cis g s a r t s e u m s o d e d s cisgenome: análisi 7. Mover las lecturas hacia el centro del pico L bp (mitad de longitud mediana) 3Jane:cisgenome$ hts_alnshift2bar -i chip2002_uniq_hg17_chr21.aln.sort.bar -s 35 3Jane:cisgenome$ hts_alnshift2bar -i mock2002_uniq_hg17_chr21.aln.sort.bar -s 35 8. Recalcular FDR y elegir cut-off de nuevo 3Jane:cisgenome$ hts_windowsummaryv2_2sample -i chip2002_uniq_hg17_chr21.aln.sort.bar -n mock2002_uniq_hg17_chr21.aln.sort.bar -g ../../genomes/hg17/chrlist.txt -l ../../genomes/hg17/chrlen.txt -w 100 -o summarys2_2.txt -z 1 9. Segunda ronda de detección de picos 3Jane:cisgenome$ mkdir out_2s2 3Jane:cisgenome$ hts_peakdetectorv2_2sample -i chip2002_uniq_hg17_chr21.aln.sort.bar -n mock2002_uniq_hg17_chr21.aln.sort.bar -d out_2s2/ -o chipvsmock_chr21 -f summarys2_2.txt.fdr -c 0.1 -m 10 -w 100 -s 25 -p 0.369843 -br 1 -brl 30 -ssf 1 -z 1 26