人类基因组中DNA元素的综合百科全书

微信号：inooooow
不接反杀，想去别人群里开挂，开不了不用加。
复制微信号

　　自2007年以来，编码已开发出方法，并进行了大量基于序列的研究，以绘制人类基因组的功能元素3。The elements mapped (and approaches used) include RNA transcribed regions (RNA-seq, CAGE, RNA-PET and manual annotation), protein-coding regions (mass spectrometry), transcription-factor-binding sites (ChIP-seq and DNase-seq), chromatin structure (DNase-seq, FAIRE-seq, histone ChIP-seq and MNase-seq), and DNA甲基化位点（RRBS分析）（Box 1列出方法和缩写；补充表1，第p节，详细信息生产统计）3。为了比较和整合不同实验室的结果，数据生产工作集中在两组选定的细胞系中，分别指定为“ Tier 1”和“ Tier 2”（Box 1）。为了捕获较广泛的生物学多样性，还在包括100多种细胞类型的第三层上执行了选定的测定法。所有数据和协议说明均可在http://www.encodeproject.org/上获得，并且最近发布了一个用户指南，其中包括细胞类型选择和限制的详细信息3。　　为了保持一致性，使用标准化指南生成和处理数据，并且为某些测定而设计了新的质量控制措施（参见参考文献3、12和http://encodeproject.org/encode/encode/datastandards.html; A. Kundaje，个人交流）。为每种测定法开发了统一的数据处理方法（请参阅补充信息； A。Kundaje，个人通信），并且大多数测定结果既可以表示为信号信息（基因组之间的每个基本估计值）和离散元素（区域计算被计算为信号富含信号的区域）。开发了广泛的处理管道来生成每种表示（M. M. Hoffman等人，《制备中的手稿》和A. Kundaje，个人交流）。此外，我们开发了不可重复的发现率（IDR）13措施，以提供对阈值的强大而保守的估计，在这种阈值中，从生物学重复的两个排名的结果列表不再同意（也就是说，是不可重复的），我们将其应用于定义的离散元素集。我们确定并排除了大多数分析，产生了可能是人为的不信任信号的区域（例如，多拷贝区域）。这些区域共同占基因组0.39％（请参阅补充信息）。伴随此问题的海报代表了不同的编码识别元素及其基因组覆盖范围。　　我们使用手动和自动注释来生成人类蛋白质编码和非编码RNA以及假基因的综合目录，称为Gencode参考基因set14,15（补充表1，U节）。这包括20,687个蛋白质编码基因（Gencode注释，V7），平均每个位点剪接的转录本（3.9个不同的蛋白质编码转录本）。总体而言，蛋白质编码基因的Gencode注释的外显子覆盖了基因组的2.94％，蛋白质编码外显子占1.22％。蛋白质编码基因从最外部开始到停止密码子占33.45％，或从启动子到poly（a）位点的39.54％。分析来自K562和GM12878细胞系的质谱数据，相对于GENCODE注释，有57个自信地鉴定出基因间区域中的独特肽序列。与普遍基因组转录16的证据一起，这些数据表明，额外的蛋白质编码基因仍有待发现。　　此外，我们注释了8,801个自动得出的小RNA和9,640个手动策划的长无编码RNA（LNCRNA）基因座。将LNCRNA与其他编码数据进行比较表明LNCRNA是通过类似于蛋白质编码基因的途径生成的。Gencode项目还注释了11,224个假基因，其中863个被转录并与活性染色质18相关。　　我们从不同的细胞系和多个亚细胞级分对RNA16进行了测序，以开发广泛的RNA表达目录。使用保守的阈值来识别RNA活性的区域，在测序的长（> 200个核苷酸）RNA分子或Gencode外显子中可重复地表示62％的基因组碱基。在这些基础中，Gencode外显子仅解释了5.5％。大多数转录的碱基在注释的基因边界内或重叠的基因边界内（即内含子），在测序转录本中，只有31％的碱基为基因中16。　　我们使用Cage-seq（5'Cap靶向的RNA分离和测序）以高度置信度（IDR为0.01）在第1层和2个细胞类型中识别62,403个转录起始位点（TSSS）。其中，有27,362（44％）在Gencode宣布的转录本的5'端的100个碱基对（BP）之内或先前报道的全长Messenger RNA。其余区域主要位于外显子和3'非翻译区域（UTR），一些区域表现出细胞型限制的表达。这些可能代表新型细胞类型的转录本的起点。　　最后，我们看到了大量的编码和非编码转录本处理到稳态稳定的RNA中的比例短于200个核苷酸。这些先驱包括转移RNA，microRNA，小核RNA和小核仁RNA（分别为tRNA，miRNA，snRNA和SNORNA）以及这些处理产物的5'末端与限制的5'端TAGS16对齐。　　为了直接识别调节区域，我们使用CHIP-SEQ绘制了119种不同的DNA结合蛋白和72个细胞类型中的许多RNA聚合酶成分的结合位置（表1，补充表1，N和参考文献19）；87（73％）是序列特异性转录因子。总体而言，覆盖231兆布（MB; 8.1％）的636,336个结合区域富含所有细胞类型的DNA结合蛋白的区域。我们评估了每个蛋白质结合位点，以富集已知的DNA结合基序和新基序的存在。总体而言，由序列特异性转录因子占据的DNA片段中有86％包含强大的DNA结合基序，在大多数（55％）的情况下，已知基序最富集（P. kheradpour和M. Kellis，手稿，准备）。　　缺乏高或中等亲和力同源识别位点的蛋白质结合区域比具有识别序列的区域低21％（Wilcoxon秩和P值 <10−16). Eighty-two per cent of the low-signal regions have high-affinity recognition sequences for other factors. In addition, when ChIP-seq peaks are ranked by their concordance with their known recognition sequence, the median DNase I accessibility is twofold higher in the bottom 20% of peaks than in the upper 80% (genome structure correction (GSC)20 P value <10−16), consistent with previous observations21,22,23,24. We speculate that low signal regions are either lower-affinity sites21 or indirect transcription-factor target regions associated through interactions with other factors (see also refs 25, 26). 　　We organized all the information associated with each transcription factor—including the ChIP-seq peaks, discovered motifs and associated histone modification patterns—in FactorBook (http://www.factorbook.org; ref. 26), a public resource that will be updated as the project proceeds. 　　Chromatin accessibility characterized by DNase I hypersensitivity is the hallmark of regulatory DNA regions27,28. We mapped 2.89 million unique, non-overlapping DNase I hypersensitive sites (DHSs) by DNase-seq in 125 cell types, the overwhelming majority of which lie distal to TSSs29. We also mapped 4.8 million sites across 25 cell types that displayed reduced nucleosomal crosslinking by FAIRE, many of which coincide with DHSs. In addition, we used micrococcal nuclease to map nucleosome occupancy in GM12878 and K562 cells30. 　　In tier 1 and tier 2 cell types, we identified a mean of 205,109 DHSs per cell type (at false discovery rate (FDR) 1%), encompassing an average of 1.0% of the genomic sequence in each cell type, and 3.9% in aggregate. On average, 98.5% of the occupancy sites of transcription factors mapped by ENCODE ChIP-seq (and, collectively, 94.4% of all 1.1 million transcription factor ChIP-seq peaks in K562 cells) lie within accessible chromatin defined by DNase I hotspots29. However, a small number of factors, most prominently heterochromatin-bound repressive complexes (for example, the TRIM28–SETDB1–ZNF274 complex31,32 encoded by the TRIM28, SETDB1 and ZNF274 genes), seem to occupy a significant fraction of nucleosomal sites. 　　Using genomic DNase I footprinting33,34 on 41 cell types we identified 8.4 million distinct DNase I footprints (FDR 1%)25. Our de novo motif discovery on DNase I footprints recovered 90% of known transcription factor motifs, together with hundreds of novel evolutionarily conserved motifs, many displaying highly cell-selective occupancy patterns similar to major developmental and tissue-specific regulators. 　　We assayed chromosomal locations for up to 12 histone modifications and variants in 46 cell types, including a complete matrix of eight modifications across tier 1 and tier 2. Because modification states may span multiple nucleosomes, which themselves can vary in position across cell populations, we used a continuous signal measure of histone modifications in downstream analysis, rather than calling regions (M. M. Hoffman et al., manuscript in preparation; see http://code.google.com/p/align2rawsignal/). For the strongest, ‘peak-like’ histone modifications, we used MACS35 to characterize enriched sites. Table 2 describes the different histone modifications, their peak characteristics, and a summary of their known roles (reviewed in refs 36–39). 　　Our data show that global patterns of modification are highly variable across cell types, in accordance with changes in transcriptional activity. Consistent with previous studies40,41, we find that integration of the different histone modification information can be used systematically to assign functional attributes to genomic regions (see below). 　　Methylation of cytosine, usually at CpG dinucleotides, is involved in epigenetic regulation of gene expression. Promoter methylation is typically associated with repression, whereas genic methylation correlates with transcriptional activity42. We used reduced representation bisulphite sequencing (RRBS) to profile DNA methylation quantitatively for an average of 1.2 million CpGs in each of 82 cell lines and tissues (8.6% of non-repetitive genomic CpGs), including CpGs in intergenic regions, proximal promoters and intragenic regions (gene bodies)43, although it should be noted that the RRBS method preferentially targets CpG-rich islands. We found that 96% of CpGs exhibited differential methylation in at least one cell type or tissue assayed (K. Varley et al., personal communication), and levels of DNA methylation correlated with chromatin accessibility. The most variably methylated CpGs are found more often in gene bodies and intergenic regions, rather than in promoters and upstream regulatory regions. In addition, we identified an unexpected correspondence between unmethylated genic CpG islands and binding by P300, a histone acetyltransferase linked to enhancer activity44. 　　Because RRBS is a sequence-based assay with single-base resolution, we were able to identify CpGs with allele-specific methylation consistent with genomic imprinting, and determined that these loci exhibit aberrant methylation in cancer cell lines (K. Varley et al., personal communication). Furthermore, we detected reproducible cytosine methylation outside CpG dinucleotides in adult tissues45, providing further support that this non-canonical methylation event may have important roles in human biology (K. Varley et al., personal communication). 　　Physical interaction between distinct chromosome regions that can be separated by hundreds of kilobases is thought to be important in the regulation of gene expression46. We used two complementary chromosome conformation capture (3C)-based technologies to probe these long-range physical interactions. 　　A 3C-carbon copy (5C) approach47,48 provided unbiased detection of long-range interactions with TSSs in a targeted 1% of the genome (the 44 ENCODE pilot regions) in four cell types (GM12878, K562, HeLa-S3 and H1 hESC)49. We discovered hundreds of statistically significant long-range interactions in each cell type after accounting for chromatin polymer behaviour and experimental variation. Pairs of interacting loci showed strong correlation between the gene expression level of the TSS and the presence of specific functional element classes such as enhancers. The average number of distal elements interacting with a TSS was 3.9, and the average number of TSSs interacting with a distal element was 2.5, indicating a complex network of interconnected chromatin. Such interwoven long-range architecture was also uncovered genome-wide using chromatin interaction analysis with paired-end tag sequencing (ChIA-PET)50 applied to identify interactions in chromatin enriched by RNA polymerase II (Pol II) ChIP from five cell types51. In K562 cells, we identified 127,417 promoter-centred chromatin interactions using ChIA-PET, 98% of which were intra-chromosomal. Whereas promoter regions of 2,324 genes were involved in ‘single-gene’ enhancer–promoter interactions, those of 19,813 genes were involved in ‘multi-gene’ interaction complexes spanning up to several megabases, including promoter–promoter and enhancer–promoter interactions51. 　　These analyses portray a complex landscape of long-range gene–element connectivity across ranges of hundreds of kilobases to several megabases, including interactions among unrelated genes (Supplementary Fig. 1, section Y). Furthermore, in the 5C results, 50–60% of long-range interactions occurred in only one of the four cell lines, indicative of a high degree of tissue specificity for gene–element connectivity49. 　　Accounting for all these elements, a surprisingly large amount of the human genome, 80.4%, is covered by at least one ENCODE-identified element (detailed in Supplementary Table 1, section Q). The broadest element class represents the different RNA types, covering 62% of the genome (although the majority is inside of introns or near genes). Regions highly enriched for histone modifications form the next largest class (56.1%). Excluding RNA elements and broad histone elements, 44.2% of the genome is covered. Smaller proportions of the genome are occupied by regions of open chromatin (15.2%) or sites of transcription factor binding (8.1%), with 19.4% covered by at least one DHS or transcription factor ChIP-seq peak across all cell lines. Using our most conservative assessment, 8.5% of bases are covered by either a transcription-factor-binding-site motif (4.6%) or a DHS footprint (5.7%). This, however, is still about 4.5-fold higher than the amount of protein-coding exons, and about twofold higher than the estimated amount of pan-mammalian constraint. 　　Given that the ENCODE project did not assay all cell types, or all transcription factors, and in particular has sampled few specialized or developmentally restricted cell lineages, these proportions must be underestimates of the total amount of functional bases. However, many assays were performed on more than one cell type, allowing assessment of the rate of discovery of new elements. For both DHSs and CTCF-bound sites, the number of new elements initially increases rapidly with a steep gradient for the saturation curve and then slows with increasing number of cell types (Supplementary Figs 1 and 2, section R). With the current data, at the flattest part of the saturation curve each new cell type adds, on average, 9,500 DHS elements (across 106 cell types) and 500 CTCF-binding elements (across 49 cell types), representing 0.45% of the total element number. We modelled saturation for the DHSs and CTCF-binding sites using a Weibull distribution (r2 >0.999）并分别预测约410万（标准误差（S.E.）= 108,000）和185,100（S.E. = 18,020）地点的饱和度，这表明我们发现了估计的总DHSS的一半左右。这些估计值代表了下限，但强化了这样的观察结果，即与编码序列或哺乳动物进化的碱基相比，非编码功能性DNA更多。　　从比较基因组研究中，至少3-8％的碱基净化（负）选择为4,5,6,7,8,9,10,1111111111，表明这些碱基可能有可能功能性。我们先前发现，在编码试验项目中注释了60％的哺乳动物进化限制的碱基，但也观察到许多功能元素缺乏约束的证据，这一结论是由其他功能元素证实的52,53,54。现在确定的功能元件的多样性和全基因组的发生提供了前所未有的机会，以进一步研究人类功能序列的负面选择力。　　我们使用两种指标突出了人类基因组中不同选择时期的措施检查了负选择。第一个测量，种间，泛妈妈约束（基于GERP的分数； 24个哺乳动物8），解决了哺乳动物进化过程中的选择。第二个措施是使用1000个基因组Project55的数据在人类种群中发现的变异数量估计的种类内约束，并涵盖了对人类进化的选择。在图1中，我们绘制了针对不同类别的已识别功能元素的约束措施，不包括重叠的外显子和启动子，这些特征已知受到约束。每个图还显示了基因组背景水平和比较代码约束的措施。因为我们以倒数的量表绘制了人口多样性，因此受到负面选择更加限制的元素往往位于情节的上和右侧区域。　　对于DNase I元素（图1B）和结合基序（图1C），大多数元素表现出泛妈妈约束的富集和人群多样性的降低，尽管对于某些细胞类型，DNase I位点总体上似乎并未受到泛哺乳动物约束的影响。结合的转录因子基序具有具有相等序列潜在的转录因子基序的自然控制，但没有CHIP-SEQ实验的结合证据，在所有情况下，结合基序既显示出更多的哺乳动物约束和对人类多样性的更高抑制。　　与以前的发现一致，我们没有观察到全基因组的泛哺乳动物选择新的RNA序列的证据（图1D）。也有大量没有哺乳动物约束的元素，转录因子结合区域以及DHSS和Faire Regions在17％至90％之间。先前的研究无法确定这些序列是否具有生化活性，但对生物体的总体影响很小，或者在特定于谱系的选择下。通过隔离序列优先插入灵长类动物谱系，只有该数据的全基因组量表，这是可行的，我们能够具体检查该问题。大多数灵长类动物特异性的序列是由于逆转座子活性引起的，但明显的比例是非重复的灵长类动物特异性序列。在104,343,413个灵长类动物特异性基础（不包括重复元素）中，有67,769,372（65％）在编码识别的元素中发现。对这些灵长类动物特异性区域中隔离的227,688种变体的检查表明，所有类别的元素（RNA和调节性）均显示出降低的衍生等位基因频率，这与至少在这些区域中的一些负面选择一致（图1E）。一种研究序列的另一种方法在泛哺乳动物的约束下尚未清楚地显示出相似的结果（L. Ward和M. Kellis，手稿提交）。这表明，相当比例的不受约束的元素是有机功能所需的谱系特异性元素，与最近进化的长期观点一致，其余部分可能是“中性”元素2当前未被选择但仍可能影响蜂窝或大规模表型而没有对健身影响而影响的细胞表型。　　转录因子的结合模式不是均匀的，我们可以将负选择的种间和种类内的测量与基序位置的总体信息含量相关联。某些基序位置的选择与蛋白质编码外显子一样高（图1F； L. Ward和M. Kellis，手稿提交）。这些基序的这些骨料测量表明，在地点种群中发现的结合偏好也与每场地行为有关。通过开发人口对约束基序的人口影响，我们发现跨哺乳动物的高度限制实例能够缓解单个变异的影响57。

本文来自作者[yjmlxc]投稿，不代表颐居号立场，如若转载，请注明出处：https://yjmlxc.cn/life/202506-4815.html