笔录表达感知注释改善了稀有变体解释

微信号：inooooow
不接反杀，想去别人群里开挂，开不了不用加。
复制微信号

　　没有使用统计方法来预先确定样本量。实验不是随机的，研究人员在实验和结果评估过程中并未对分配视而不见。　　为了鉴定单倍延迟的发育延迟基因，我们选择了由Clingen剂量敏感性工作组策划的基因34 58在61个基因中有3个得分为3，具有足够的致病证据，而两个基因（Champ1，CTCF）的得分为2个（一些证据）（一些证据）和一个基因（RERE）（RERE）。在文献中回顾了每个基因中致病性变异的渗透性，仅包括超过75％的渗透率的基因。这些条件是那些太严重的条件，无法期望看到侏儒中的一个人（如果没有监护权，可能无法同意学习）。61个基因包括50个高严重性和高渗透率的常染色体基因，以及染色体X上的11个基因，其中预计表型在男性中会严重或致命，女性中度至重度。最终的基因列表可在gs：//gnomad-public/papers/2019-tx-annotation/data/gene_lists/hi_genes_100417.tsv上获得。　　我们提取了plof变体，定义为基本的剪接受体，基本的剪接供体，并取得了停止和移码变体，这些变体在61个gnomad v2.1.1.1外显子组和基因组站点表中的61个haploinsive疾病基因中都被确定，并且仅考虑那些通过gnomad DataSet and and and and and Anots and Annots and Annots Annots Annots Annots Annots Annots and Annots Annots and Annots and Annots and Annots and Annots and Annots and not not not not not not not not not not not。在61个基因中，有55个至少有1个高质量的植物。我们使用基于Web的策展门户进行了401个plof变体的手动策划，以确定plof可能是变体呼叫或注释错误的任何原因，并将每个变体都是真正的LOF的可能性分类。　　将LOF变体归类为人工的证据被分为以下组：映射错误，链偏差，参考错误，基因分型错误，均聚合序列，固定型多核苷酸变体或框架 - 恢复Indel Indel，基本的剪接场地救援，少数成绩单，较弱的转录，弱化的外向保护，exon exon Coververtics，最后一个exon exontry和其他Annortation和其他Annortation和其他Annotation。即使单个标准将变体归类为非LOF，也标记了拒绝LOF后果的所有可能原因。然后根据补充表2中概述的标准将变体分为LOF，可能不是LOF，可能不是LOF，也不是LOF。　　技术误差包括基因分型错误，链偏差，参考误差和重复区域，这些区域可以通过目视检查集成基因组viever35（IGV）和UCSC基因组Browser36中的读数来检测到。基因分型错误包括偏斜的等位基因余额（保守的缩短≤35％），低复杂度序列，富含GC的区域，均聚物区域（≥6个基本对或≥6个三核苷酸重复序列）和低质量指标（基因型质量质量） < 20). Strand bias was flagged when a variant was skewed preferentially on the forward or reverse strand, or when the majority (>给定股的90％覆盖了一个区域；这通常是围绕内含子 - 外观边界观察到的。尽管前部和反向链的覆盖率平衡，但链偏差还是可能不符合LOF的，而由于偏斜的链覆盖量引起的链偏置与其他基因分型误差一起加权。参考误差很常见，但通过给定外显子中的小删除来确定，以作为一个 <5-base-pair intron. Most genotyping errors and strand biases in isolation were not deemed critical in deciding whether a variant was probably not LoF or not LoF, with the exception of allele balance ≤25%. Mapping errors were often identified by an enrichment of complex variation surrounding a variant of interest. Furthermore, the UCSC browser was used to highlight mapping discrepancies, such as self-chain alignments, segmental duplications, simple tandem repeats, and microsatellite regions. 　　In-frame multi-nucleotide variants (MNVs), essential splice site rescue, and frame-restoring insertion-deletions are rescue events that are predicted to restore gene function. MNVs were visualized in IGV and cross checked with codons from the UCSC browser; in frame MNVs that rescued stop codons were scored as not LoF. Essential splice site rescue occurs when an in frame alternative donor or acceptor site is present, which probably has a minimal effect on the transcript. A total of 36 base pairs upstream and downstream of the splice variant were assessed for splice site rescue. Cryptic splice sites within 6 base pairs of the splice variant were considered a complete rescue, rendering the variant not LoF. Rescue sites >距离6对6底座，但在±20个基本对以内的置信度较小，得分可能不是LOF。使用Alamut v.2.11（https://www.interactive-biosoftware.com/alamut-visual/）对所有潜在的剪接网站救援进行了验证。通过从注释的Indel扫描大约±80个碱基对，并计算所有插入/删除以评估框架是否可以恢复，从而确定了框架恢复的indels。　　成绩单错误涵盖了围绕替代成绩单，终端编码外显子内的变体，保守外显子和重新发射事件的问题。占据少数族裔的编码变体（<50%) of NCBI coding RefSeq transcripts for a given gene were considered not LoF. These variants often affected poorly conserved exons, as determined by PhyloP37, PhyloCSF19 and visualization in the UCSC browser36. The only exceptions to the minority of transcript criteria were cases where the exon was well conserved, which relegated the categorization to probably not LoF. Variants within the last coding exon, or within 50 base pairs of the penultimate coding exon were also considered not LoF, unless 25% < x < 50% of the coding sequence was affected, in which case the variant was deemed probably not LoF. If >在最后一个外显子中，编码序列的50％被变体破坏了，这可能被认为是LOF。其他成绩单错误包括：重新引入错误；给定LOF变体的上游停止密码子；恰好在编码RefSeq成绩单的50％的变体中；和/或部分外显子保护。当第一个编码外显子中的变体下游预计将重新启动转录时，重新定型事件被标记为标记，预计可能不是LOF。在最后一个编码外显子中停止密码子之后发生的变体被认为不是LOF，尤其是在相关的外显子或转录本的区域。将误差类别分组为图1，如下所示：少数转录本和弱外显子保护被分组为转录本误差，基因分型误差和均聚物作为测序误差，基本的剪接救援，而将MNV分组为救援和链条偏见，将其包括在其他注释错误中。　　上面的标准严格遵循，并由两名独立审阅者进行手动策划，以确保最大的一致性并最大程度地减少人为错误。策展人将两个策展人的任何不一致均由两个策展人重新伴随并解决。补充表3中提供了手动策划的完整结果。　　我们首先将GTEX V7同工型定量导入冰雹，并计算每个组织每个转录物的中位表达。该预分解的摘要同工型表达式矩阵可用于GS：// gnomad-public/papers/2019-tx-Annotation/data/grch37_hg19/。我们还针对使用Loftee V1.0插件实现的Gencode V1920的变体效果预测变量（VEP）版本8538导入并注释一个变体文件。　　我们使用成绩单后果VEP字段来计算变异注释的同工型表达的总和，即跨转录本（EXT）的注释级表达式。对于一个对一个转录本具有多重后果的变体（例如，一个单个核苷酸变体，既是一个转录本上的错义和剪接区域变体），我们使用VEP的最坏结果（在此示例中，错义优先于剪接区域）。我们将后果过滤到仅在蛋白质编码转录本上发生的后果。VEP后果的完整订购可在以下网址提供：　　然后，我们将每个变体的每个转录本的表达总结为每个组织的每一种组合，阁楼滤波器和阁楼标志（补充图3A）。For example, if a single nucleotide variant is synonymous on ENST1, a high-confidence LOFTEE stop-gained variant on ENST3 and ENST4, and low-confidence LOFTEE stop-gained variant on ENST5 and ENST6, the ext values will be synonymous: ENST1, stop-gained high-confidence: ENST3 + ENST4, and stop-gained low-confidence: ENST5 + ENST6 per tissue.可以通过将tx_annotation_type设置为“ expression”来使用tx_annotate（）函数来计算。我们预见到仅考虑一种感兴趣的组织时，非归一化的扩展值将是有用的。　　为了允许在感兴趣的组织中取出平均表达值，我们将给定值的表达值正常化为发现变体的基因的总表达。这是通过将EXT值与每个组织所有转录本的表达之和以每百万（TPM）（补充图3B）的形式进行的。所得的PEXT值可以解释为基因的总转录输出的比例，该基因将受到所讨论的变体注释影响。如果给定组织中的基因表达值（因此是分母）为0，则该组织的PEXT值将不可用（Na）。　　在跨组织的平均值时，不考虑这种不可用的PEXT值（也就是说，在跨组织均值时，我们去除Na值）。可以通过将tx_annotation_type设置为“比例”来使用tx_annotate（）函数来计算此值。For the analyses in this manuscript, we remove reproduction-associated GTEx tissues (endocervix, ectocervix, fallopian tube, prostate, uterus, ovary, testes and vagina), cell lines (transformed fibroblasts and transformed lymphocytes) and any tissue with less than 100 samples (bladder, brain Cervicalc-1 spinal cord, brain substantia nigra, kidney皮质和小唾液腺），导致38个GTEX组织使用。　　我们注意到，对于少数基因，当RSEM15为非编码转录物分配较高的相对表达时，编码转录物值的总和可能比转录本的基因表达值小得多，从而导致基因中所有编码变体的PEXT得分较低，从而导致所有变体的可能过滤给定基因。在许多情况下，这似乎是由于具有高度外显子与真实编码转录物重叠的虚假非编码转录本的结果。为了防止这种伪像影响我们的分析，我们首先计算了所有蛋白质编码基因上所有变体的最大PEXT分数，并删除了最大PEXT分数低于0.2的任何基因。这导致了668个基因的过滤，占所有分析基因的3.3％。我们注意到，与668个基因没有重叠，单倍体基因列表列表，OMIM中存在97个过滤基因（代表OMIM基因列表的1.5％），42个过滤基因被认为受到约束（代表Loeuf的1.4％的Loeuf。 <0.35, or constrained, genes) thus having low effect on variant interpretation in the context of disease associations. 　　The full transcript-expression aware annotation pipeline, implemented in Hail 0.2, is fully available at https://github.com/macarthur-lab/tx_annotation with commands laid out for analyses in the manuscript. Passing a Hail table through the tx_annotate() function returns the same table with a new field entitled ‘tx_annotation’ which provides either the ext or pext value per variant-annotation pair, depending on parameter choice. We provide a helper function to extract the worst consequence and the associated expression values for these annotations. All analyses in the manuscript are based on the worst consequence of variant, ordered by VEP38. 　　Conservation analysis was performed using phyloCSF scores using the same file used for the LOFTEE plugin, available publicly in gs://gnomad-public/papers/2019-tx-annotation/data/other_data/phylocsf_data.tsv.bgz . We denoted exons with a phyloCSF max open-reading frame score >1,000个高度保守的人和最大开放阅读框架得分的人 <−100 as lowly conserved (Supplementary Fig. 5a) and evaluated their average usage in GTEx. 　　Using the base-level pext values that are used in the gnomAD browser, we filtered to intervals with high or low conservation, and calculated the average pext value in the interval. To evaluate regions with low conservation but high expression, we identified genes harbouring unconserved regions with the pext value >0.9用于途径富集分析，并将Web浏览器用于FUMA Gene2Func Feartry39，它结合了Reactome40，Kegg41，Gene Ontology42（GO）以及其他本体学。默认参数用于FUMA，所有蛋白质编码基因作为背景列表。FUMA途径分析的结果可在补充图12中获得，并在补充表7中获得了完整的结果。　　使用GNOMAD v2.1.1外显子数据集对Loftee标志和地图计算的PEXT值分析。以前描述了地图得分的计算，并将其实现为冰雹模块，如前所述1。地图是相对度量的，不能在数据集中进行比较，但对于频率频谱的有用摘要度量，表明从稀有性的稀有度中推断出有害性（高图的高值与较低的频率相对应，表明在更有害的位点上负面选择的作用）。在GNOMAD v.2.1.1数据集分区上计算了地图得分，对LOEUF评分和表达箱进行了分区。用于生成地图分数的脚本可在/analyses/maps/maps_submit_per_per_class.py的tx-Annotation GitHub存储库中获得　　作为对用PEXT指标进行标记的区域的正交评估，我们确定了具有平均PEXT值的61个单倍体疾病基因中的任何区域 <0.1 in all GTEx tissues and in GTEx brain samples, owing to the relevance of brain tissues for these disorders, regardless of mutational burden in gnomAD. The resulting list of 128 regions was evaluated by the HAVANA manual annotation group of the GENCODE project20. 　　The manual evaluation first established whether the transcript model corresponding to the region in question was correct in terms of structure, comparing exon–intron combinations, and the accuracy of splice sites against the RNA evidence supporting the model. Second, the functional biotype of each model was reassessed; in particular, whether the decision to annotate the model as protein-coding in GENCODE v19 was appropriate. Note that GENCODE models that incorporate alternative exons or exon combinations in comparison to the ‘canonical’ isoform are likely to be annotated as coding if they contain a prospective CDS that is considered biologically plausible, based on a mechanistic view of translation. These re-annotations are summarized in Supplementary Table 5. 　　We binned cases into three main categories, according to confidence in both the accuracy and potential functional relevance of the overlapping models: (1) ‘error’, in which the model was seen to have an incorrect transcript structure and/or a CDS that conflicted with updated GENCODE annotation criteria (these annotations had been or will be changed in future GENCODE releases based on this evaluation); (2) ‘putative’, in which the model structure and CDS satisfied our current annotation criteria, although we judged the potential of the transcript represented to encode a protein with a functional role in cellular physiology to be nonetheless speculative (these have been maintained as putative protein-coding transcripts in GENCODE); (3) ‘validated’, in which we believe it is highly probable that the model represents a true protein-coding isoform. High confidence in the validity of the CDS was based on comparative annotation, that is, the observation of CDS conservation and also the existence of equivalent transcript models in other species. GENCODE also annotates transcript models as ‘nonsense-mediated decay’ and ‘non-stop decay’, in which a translation is found that is predicted to direct the RNA molecule into cellular degradation programs. Although it has been established that such ‘non-productive’ transcription events can have a role in gene regulation and thus disease, the interpretation of variants within nonsense-mediated decay and non-stop decay CDS regions remains challenging43. These models were therefore classed in a separate category. 　　To evaluate the filtering power of the pext metric for Mendelian variants, we evaluated the number of variants that would be filtered with an average GTEx pext cutoff of 0.1 (low expression) in the ClinVar and gnomAD datasets. We downloaded the ClinVar VCF from the ClinVar FTP (version dated 10/28/2018), imported it into Hail, annotated it with VEP v85 against Gencode v19, and added pext annotations with the tx_annotate() function. All evaluated variants were annotated as HC by LOFTEE v1.0, and ClinVar variants were filtered to those marked as pathogenic, with no conflicts, and reviewed with at least one star status. 　　For variants in 61 haploinsufficient genes, we identified any variant identified in at least one individual with any zygosity in both datasets. For variants identified in autosomal recessive disease genes, we used a list of 1,183 OMIM disease genes deemed to follow a recessive inheritance pattern by Blekhman et al.44 and Berg et al.45 (available as https://github.com/macarthur-lab/gene_lists/blob/master/lists/all_ar.tsv). We compared the pext value for all pLoF variants identified in ClinVar versus any variant in a homozygous state in at least one individual in the gnomAD exome or genome datasets. Finally, we used a LOEUF cutoff of 0.35 to denote constrained genes, and compared any synonymous or pLoF variant in these genes in the gnomAD exome or genome datasets. 　　De novo variants were collated from previously published studies. We collected de novo variants identified in 5,305 probands from trio studies of intellectual disability/developmental disorders (Hamdam et al.27: n = 41, de Ligt et al.28: N = 100, Rauch et al.29: N = 51, DDD24: n = 4,293, Lelieveld et al.26: n = 820), 1,073 probands with congenital heart disease with co-morbid developmental delay (Sifrim et al.46: n = 512, Chih Jin et al.47: 561), 6,430 ASD probands, and 2,179 unaffected controls from the Autism Sequencing Consortium25. We also used a previously published dataset of variants in 8,437 cases with ASD and/or attention-deficit/hyperactivity disorder and 5,214 controls from the Danish Neonatal Screening Biobank48. In this analysis, we analysed pLoF variants identified in highly constrained genes (first LOEUF decile) with a combined total allele count of ≤ 10 in cases and controls. 　　We annotated both de novo and rare variants with VEP v85 against Gencode v19 and added pext annotations with the tx_annotate() function. We then calculated the average pext metric across 11 GTEx brain samples and binned them as low (pext < 0.1), medium (0.1 ≤ pext ≤ 0.9) or high (pext > 0.9）表达。然后，我们计算了每个PEXT表达箱的plof，错义和同义变体的数量。为了获得新的变种分析的病例对照率比和95％的置信区间，我们对计数使用了双面泊松精确测试。为了获得ASD/ADHD中稀有变体分析的优势比，我们使用了Fisher的精确测试来计数数据。　　为了评估使用不同的同工型定量工具是否会影响结果，我们比较了TCF4基础水平表达的结果（如图2b所示），地图（图3C），比较了使用RSEM量化的Clinvar VS GNOMAD中的单倍体发育疾病基因在Clinvar vs Gnomad中的单倍疾病基因中过滤的变体数量的比较。由于重新量化整个GTEX数据集的难以理解，我们从V7数据集下载并要求151 GTEX Brain Cortex CRAM文件。我们首先使用PICARD 2.18.20将CRAMS转换为FastQ文件，然后使用“鲑鱼量 - i索引-fastq1 - fastq2 - minassignedfrag1 - validatemappings”命令运行鲑鱼。该索引是使用“ Salmon Index –T Transcript.fa –Type Quasi –K 31”命令创建的，该命令使用Gencode V19蛋白质编码和LNCRNA Transcripts Fasta Files创建了索引。将现有的GTEX RSEM同工型定量过滤到同一GTEX脑皮层样品中。为了使分析与手稿的其余部分保持一致，我们计算了所有蛋白质编码基因和RSEM和鲑鱼定量的所有变体的最大脑皮质PEXT分数，并删除了最大PEXT分数低于0.2的任何基因。这导致从脑皮质样品的鲑鱼定量和RSEM定量中的691个基因中过滤325个基因，分别对应于定量基因的3.4和1.6％。我们在补充图11中看到的地图和基因列表比较分析中过滤了这些基因。定量管道的WDL脚本可在以下网址获得：gs：//gnomad-public/papers/2019-tx-annotation/results/salmon_rsem/salmon_rsem/salmon.wdl and命令，以获取gith gos gos goth gos goth goth gos goth gos。/分析/rsem_salmon/。　　尽管我们的分析是基于GTEX V7数据集中的转录本表达意识注释，但我们提供了必要的文件来使用人脑发育资源（HBDR）胎儿脑dataset49 in GS：// gnomad-public/papers/2019-tx-tx-notaperation/2019-tx-notaperation/data/hbdr_ftr_fetal_fetal_rnaseq中。HBDR包括来自发育时间点各个大脑子区域的558个样本。我们从欧洲核苷酸档案（研究加入PRJEB14594）下载了HDBR示例FASTQ文件，并使用GTEX V7量化管道在HBDR FASTQ上获得了RSEM同工型量化，并在https://github.com/github.com/baredinstitute/gtex-pipelely/cass上公开获取，这是witt twipery and interive and int cast。RSEM v1.2.22。在这里，我们还取下了HBDR跨HBDR的平均PEXT低于0.2的基因，导致去除712个基因（占所有分析基因的3.5％）。该数据集还用于分析补充图7d中所示的SCN2A中的基准表达值。　　有关研究设计的更多信息可在与本文有关的自然研究报告摘要中获得。

本文来自作者[yjmlxc]投稿，不代表颐居号立场，如若转载，请注明出处：https://yjmlxc.cn/zsfx/202506-9360.html