Blog
New Developments In NGS Data Analysis: ChIP seq
- December 4, 2018
- Posted by: rasa
- Category: Bioinformatics Next Generation Sequencing
Genome-wide examinations concerning cooperative interactions among genomic functions, e.g. DNA replication, isolation, translation, repair and rearrangement, are fundamental for methodically explaining every single biological activity on the genome. To this end, chromatin immunoprecipitation pursued by sequencing (ChIP-seq) examination was produced to comprehend the collaboration and cooperations that happen in a wide assortment of living beings utilizing next-generation sequencing (NGS) [1– 3]. ChIP-seq examination is a standard technique in genomics and epigenomics and has prompted vital revelations identified with disease-related transcriptional regulation [4– 7], tissue-explicitness of epigenetic control [8, 9] and chromatin association [10– 13].
ChlP-seq protocols have numerous steps including sample preparation and computational examination (Figure 1). In a word, cross-linked chromatin is sonicated, and purged with and without immunoprecipitation (ChIP and comparing input DNA fragments, individually). DNA sections are sequenced as reads, which are then mapped onto the reference genome, and the genomic areas that are fundamentally enriched for ChIP reads, contrasted with input reads, are distinguished as peaks. Other genomic areas are viewed as non-specific background. Called peaks, which speak to candidates of targeted protein/DNA-binding and histone alteration sites, can be utilized to distinguish related functional annotations, including binding motifs [14, 15] and gene ontology [16, 17]. ChIP-seq results can likewise be coordinated with different sorts of genomic tests, including gene expression, DNA methylation, and chromatin conformation, to comprehend mechanisms of genomic functions from numerous aspects [18– 20].
The shapes of the peaks differ among proteins and are arranged into three modes [1]: ‘sharp mode’, situated at specific positions in the genome; ‘broad mode’, related with substantial genomic domains; and mixed mode’, which includes both peak modes. As most point-source transcription factors (TFs) and restricted chromatin markers (e.g. H3K4me3) have sharp modes, a vast part of peak-calling calculations has been intended for this mode, despite the fact that different proteins (e.g. heterochromatin protein HP1 [21]) and some histone changes (e.g. H3K9me3) have broad modes. The mixed mode is seen for RNA polymerase II (Pol II) and transcription elongation factors [22]. Distinctive peak- calling techniques are required for each shape.
Recent advances in sequencing technology and examinations empower us to deal with several ChIP tests all the while; such huge scale investigations uncovered the high-dimensional interrelationship for administrative components [23– 25] and comment on
novel utilitarian genomic regions once more [26, 27]. Since a large scale investigation is delicate to the nature of input samples and changing the conventions for each sample’s quality is troublesome, samples which have deficient quality ought to be dismissed naturally. As there are different components (counting immune response quality) during sample preparation that influence the nature of the obtained outcomes [28, 29], multilateral quality evaluations during the computational methods are fundamental. Thus, to acquire high caliber, unprejudiced and reasonable data, the general convention design and quality administration, which are changed in accordance with the examination properties, are of basic significance.
- Sample preparation and sequencing: Fragmented DNAs (150– 500 bp) from ChIP-seq tests are sequenced as reads (36– 100 bp). Single-end reads are regularly utilized for ChIP-seq investigations, while paired-end ones enhance the library complexity and increment mapping proficiency at repetitive regions [36]. At the point when research focuses on repetitive regions, longer, as well as paired end reads, are favored. The chromatin accessibility during fragmentation isn’t uniform over the genome. In some open-chromatin areas, DNA is manageable to discontinuity and along these lines specially represented in the fragmented sample, which causes false-positive read advancement [37].
- Sequencing Depth: Successful investigation of ChIP-seq data requires adequate inclusion by sequence reads (sequencing profundity). It mainly relies upon the size of the genome, and the number and size of the coupling sites of the protein.
- Quality Metrics: Preprocessing of ChIP-seq data, as a rule, is like that of some other sequencing data; survey the quality of raw reads to recognize conceivable sequencing errors or biases.
- Read Mapping: It is imperative to consider the percentage of interestingly mapped reads revealed by the mapper. The rate changes between life forms; for human, mouse, or Arabidopsis, above 70% exceptionally mapped reads is normal. A low level of particularly mapped reads might be cause for concern and is regularly because of, (I) either inordinate amplification in the PCR step, (ii) inadequate read length, or (iii) issues with the sequencing stage. Read mappers are intended to permit a (client settable) number of mismatches in the reads. Most peak-calling calculations would overlook the multi-mapping reads.
- Â Library complexity: Library complexity is estimated by the non- redundant division (NRF) (1). Since the NRF score relies upon the aggregate number of mapped reads, read sampling is fundamental when looking at NRF scores among tests. Low library complexity frequently happens when samples are set up from a little number of materials. Regardless of whether the quantity of sequenced reads is adequate after polymerase chain response (PCR) amplification, the considerable read number might be little, bringing about a poor significance power.
- Signal to noise ratio: The S/N is assessed by the number and quality of peak acquired for each ChIP test. This measure can likewise be utilized to evaluate the level of noise in the input sample. The ENCODE consortium proposed two measurements, fraction of reads in peaks (FRiP) and cross-correlation profiles (CCPs) to quantify the S/Ns [28]. This value relates decidedly to the number and power of the identified peaks. ChIP tests that have excessively few peaks can be separated utilizing a cutoff for this score. Additionally, an extensive S/N does not ensure that the recognized peaks are real binding sites—a substantial score simply implies that there are many read-enriched areas in the genome. Tests that have some false-positive sites (e.g. non-explicit restricting destinations) likewise have huge S/Ns.
- Mapping: The huge number of reads produced in each study should be analyzed and that investigation starts with alignment to a reference genome. The most broadly utilized for ChIP-seq have been ELAND, MAQ7 and Bowtie8. Mapping is by and large performed while taking into consideration a modest number (1-3) of sequence mismatches.
- MAQ: It makes utilization of the sequence quality values, with the goal that a confound at low-quality bases is dealt with uniquely in contrast to a mismatch at high-quality bases, accepting that a low-quality base-call is almost certain a sequencing error.
- 5.2 Bowtie: It is one of the quickest mapping calculations. The calculations likewise vary in their treatment of reads that guide to numerous areas, situating them haphazardly or self-assertively. In the event that ChIP-seq analysis recoups sequences from exceedingly repetitive areas, the utilization of paired-end sequencing presents the chance to grapple read-pairs in a non-rehash region of the genome, consequently expanding trust in the last mapping.
- Peak Calling: A peak is called where either the number of reads exceeds a pre-determined threshold value or where there is a minimum enrichment compared to background signal, often in a sliding window across the genome. Some tools apply both methods. The parameters for identifying peaks can be adjusted, sometimes leading to very different numbers of peaks being called. The user must determine if fewer high-quality peaks are preferred over lower-quality peaks.A peak is called where either the quantity of reads surpasses a pre-decided threshold value or where there is a minimum enrichment contrasted with foundation signal, regularly in a sliding window over the genome. A few instruments apply the two techniques. The parameters for distinguishing peaks can be balanced, once in a while leading to different quantities of peaks being called. The researcher must decide whether few high-quality peaks are favored over lower-quality peaks.
- Significance analysis: Many peak callers register a P value for called peaks, while others utilize the height of the peaks and additionally improvement over foundation to rank peaks. However, these annotations don’t give factual criticalness values. The false discovery rate (FDR) is frequently used to give a more genuine peak list, and this can be computed from the given P value. A few bundles make utilization of the control data to decide an experimental FDR and create a proportion of peaks in the sample versus a control.
- Peak Callers:
- MACS: MACS, used often for transcription factor binding site peaks, is one of the most popular peaks MACS, utilized frequently for transcription factor binding site peaks, is a standout amongst the most well-known peak-callers, it is additionally one of the most seasoned, and its age likely adds to its prosperity. It is a decent technique, adequate for some experimental conditions and requires almost no support whenever referred to as the apparatus utilized in a publication. Â MACS expel repetitive reads and perform read shifting to represent the counterbalance in forward or invert strand reads. It utilizes control samples and local measurements to limit bias and computes an observational FDR.
- SICER: SICER was produced for more diffuse chromatin modification that can range kilobases or megabases of the genome. The SICER technique examines the genome in windows and distinguishes groups of spatial signs that are probably not going to show up by chance. These groups or “islands” are utilized rather than fixed length windows. Gaps in the islands are permitted with the end goal to conquer technical issues. This gap size can be balanced for various sorts of chromatin modification. The program makes utilization of control data or an irregular background model (30)
- T-PIC: This utilizes the shape of putative peaks to recognize genuine peaks from the background noise. The researchers contrasted their methodology with MACS and PeakSeq and exhibited enhanced outcomes. The bundle initially stretches out short reads to the evaluated section length; it then isolates the genome into regions for which it develops “trees” for shape investigation and utilization the tree shape measurement to distinguish genuine peaks (31).
- Genome-wide event finding and motif discovery (GEM): This is one of the most up to date apparatuses. Its special element is the mix of peak finding and motif analysis to enhance the goals of the final peaks called. The GEM research shows very nearly 400 spatially-obliged transcription factor binding occasions. This instrument seems, by all accounts, to be an energizing improvement for ChIP-seq researches.
- DATA ANALYSIS
- Differential Binding Analysis: A moderately new system is the investigation of differential binding, which draws much from the examination of differential gene expression and has the comparative capacity to identify biologically important binding changes between samples (35). The DiffBind programming package permits identification of genomic loci that are differentially bound between two conditions. It was produced dependent on calculations utilized for differential gene expression examination by RNA-seq. These differential strategies enable scientists to survey ChIP peaks quantitatively utilizing peak heights. Key to these techniques is the standardization of read counts in ChIP-seq datasets and quantile standardization strategies like those utilized in microarray examination.
- Motif Analysis: A standout amongst the most widely recognized as of ChIP-seq tests is to find the sequence motifs for protein binding in the genome. The Multiple EM for Motif Elicitation (MEME) calculation is the most generally embraced instrument for motif discovery (32). Regularly, various motifs can be found in a solitary data collection, and motif analysis can be performed even on low-quality ChIP-seq data, despite the fact that the statistical importance of these motifs is probably going to be lower.
- Chromatin State: Another valuable examination of ChIP-seq data originates from a deliberate methodology utilized by the ENCODE consortium to describe genomic regions dependent on histone modification content (34). Different histone changes are assayed utilizing modification-specific histone antibodies in ChIP-seq investigations to get a profile of that histone mark inside a sample. For its very own investigations, the consortium has actualized thorough particularity tests that utilization varieties of differentially changed histone tail peptides to guarantee antibody specificity. They additionally share normal cell sources which are collectively profiled and compared, guaranteeing consistency between individual trials. Their present rules cover antibody validation, test replication, sequencing depth, information and metadata revealing, and data quality assessment (33).
REFERENCES
- Park PJ. ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet2009;10:669–80.
- Pepke S, Wold B, Mortazavi A. Computation for ChIP-seq and RNA-seq studies. Nat Methods2009;6:S22–32.
- Furey TS.ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat Rev Genet.2012;13:840–52.
- Deardorff MA, Bando M, Nakato R, et al.   HDAC8 mutations in Cornelia de Lange syndrome affect the cohesin acetylation cycle. Nature2012;489:313–17.
- Schaub MA, Boyle AP, Kundaje A, et al.   Linking disease associations with regulatory information in the human genome. Genome Res 2012;22:1748–59.
- Zuin J, FrankeV, van Ijcken WF, et al.   A cohesin-independent role for NIPBL at promoters provides insights in CdLS. PLoS Genet2014;10:e1004153.
- Izumi K, Nakato R, Zhang Z, et al.   Germline gain-of-function mutations in AFF4 cause a developmental syndrome functionally linking the super elongation complex and cohesin. Nat Genet.2015;47:338–44.
- Mikkelsen TS, Xu Z, Zhang X, et al.   Comparative epigenomic analysis of murine and human adipogenesis. Cell 2010;143:156-69.
- ErnstJ, Kheradpour P, Mikkelsen TS, et al.   Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 2011;473:43–9.
- Shang WH, Hori T, Martins NM, et al.   Chromosome engineering allows the efficient isolation of vertebrate neocentromeres. Dev Cell2013;24:635 –48.
- Jeppsson K, Carlborg KK, Nakato R, et al.   The chromosomal association of the smc5/6 complex depends on cohesion and predicts the level of sister chromatid entanglement. PLoS Genet 2014;10:e1004680.
- Hansen P, Hecht J, Ibrahim DM, et al.   Saturation analysis of ChIP-seq data for reproducible identification of binding peaks. Genome Res 2015;25:1391–400.
- Sutani T, Sakata T, Nakato R, et al.   Condensin targets and reduces unwound DNA structures associated with transcription in mitotic chromosome condensation. Nat Commun 2015;6:7815.
- Machanick P, Bailey TL. MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 2011;27:1696–7.
- Thomas-Chollier M, Herrmann C, Defrance M, et al.   RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res 2012;40:e31.
- McLean CY, Bristor D, Hiller M, et al.   GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol 2010;28:495–501.
- Welch RP, Lee C, Imbriano PM, et al.   ChIP-Enrich: gene set enrichment testing for ChIP-seq data. Nucleic Acids Res 2014;42:e105.
- Dixon JR, Selvaraj S, Yue F, et al.   Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 2012;485:376–80.
- Li G, Ruan X, Auerbach RK, et al.   Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell 2012;148:84–98.
- Agirre X, Castellano G, Pascual M, et al.   Whole-epigenome analysis in multiple myeloma reveals DNA hypermethylation of B cell specific enhancers. Genome Res 2015;25:478–87.
- Alekseyenko AA, Gorchakov AA, Zee BM, et al.   Heterochromatin-associated interactions of Drosophila HP1a with dADD1, HIPP1, and repetitive RNAs. Genes Dev 2014;28:1445–60.
- Lin C, Garrett AS, De Kumar B, et al.   Dynamic transcriptional events in embryonic stem cells mediated by the super elongation complex (SEC). Genes Dev2011;25:1486–98.
- Gerstein MB, Kundaje A, Hariharan M, et al.   Architecture of the human regulatory network derived from ENCODE data. Nature2012;489:91–100.
- Yan J, Enge M, Whitington T, et al.   Transcription factor binding in human cells occurs in dense clusters formed around cohesin anchor sites. Cell 2013;154:801–13.
- Griffon A, Barbier Q, Dalino J, et al.   Integrative analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory landscape. Nucleic Acids Res 2015;43:e27.
- Hoffman MM, Ernst J, Wilder SP, et al.   Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Res 2013;41:827–41.
- Kundaje A, Meuleman W, Ernst J, et al.  ; Roadmap Epigenomics Consortium .Integrative analysis of 111 reference human epigenomes. Nature2015;518:317–30.
- Landt SG, Marinov GK, Kundaje A, et al.   ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res 2012; 22:1813–31.
- Meyer CA , Liu XS. Identifying and mitigating bias in next-generation sequencing methods for chromatin biology. Nat Rev Genet 2014;15:709–21.
- Zang C, Schones DE, Zeng C, Cui K, Zhao K, Peng W. A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics. 2009 Aug 1;25(15):1952-1958.
- Hower V, Evans SN, Pachter L. Shape-based peak identification for ChIP-Seq.BMC Bioinformatics. 2011 Jan 12;12:15.
- Bailey TL and Elkan C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning. 1995; 21(1-2): 51- 83.
- Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet AL, Ecker JR, Farnham PJ, Hirst M, Lander ES, Mikkelsen TS, Thomson JA. The NIH Roadmap Epigenomics Mapping Consortium. Nat Biotechnol. 2010 Oct;28(10):1045-8.
- Landt SG, et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012 Sep;22(9):1813-31.
- Bardet AF, He Q, Zeitlinger J, Stark A. A computational pipeline for comparative ChIP-seq analyses. Nat Protoc. 2011 Dec 15;7(1):45-61.
- Chen Y, Negre N, Li Q, et al.   Systematic evaluation of factors influencing ChIP-seq fidelity. Nat Method 2012;9:609-14.
- Auerbach RK, Euskirchen G, Rozowsky J, et al.   Mapping accessible chromatin regions using Sono-Seq. Proc Natl Acad Sci USA2009; 106:14926–31