多模式细胞图作为结构和功能基因组学的基础

微信号：inooooow
不接反杀，想去别人群里开挂，开不了不用加。
复制微信号

　　根据先前描述的方案作为Bioplex Project14的一部分，对AP -MS的蛋白质 - 蛋白质相互作用映射进行了U2OS细胞培养。U2OS细胞是从美国型培养物中（ATCC）获得的，并测试了支原体污染。使用来自人类Orfeome库的克隆构建了针对2,174个诱饵蛋白中每种诱导蛋白中每种诱导蛋白中每种的C末端HA-FLAG标记的DNA构建体，并通过慢病毒转染引入U2OS细胞。根据先前的下拉实验的成功选择诱饵，并确保与先前观察到的相互作用组的广泛采样22。将固定和预洗的小鼠单克隆抗HA琼脂糖树脂与细胞裂解物孵育，以提取蛋白质诱饵及其相关的蛋白质复合物。随后，用HA肽洗脱，然后用胰蛋白酶消化。使用C18微毛细管柱加载了大约1 µg的肽，用于相反的液态色谱法，然后使用数据依赖性采集选择MS2分析的前20个前体。使用sequest58从MS2光谱中鉴定出蛋白质，并通过额外的基于熵的过滤14过滤至1％蛋白质级的FDR。与所有其他免疫沉淀的平均水平相比，使用蛋白质 - 蛋白质 - 蛋白质相互作用的comppass算法59,60用于选择高含量（前2％）蛋白质 - 蛋白质相互作用。用1％FDR14,61的Comppass-Plus进一步过滤相互作用。质量控制步骤如下。如先前所述57。AP -MS分析要求在隔离结果中检测到诱饵蛋白。此外，与同一96孔板上的其他下拉相比，诱饵蛋白必须在自己的下拉中具有更高的丰度（基于光谱计数）。要删除载荷不足的样品，我们要求LC -MS运行至少包含约5,000 psms和约700个蛋白质。使用单方面的二项式测试计算了Corum复合物中的相互作用（扩展数据图1C，D; Corum v.4.1），假设与网络相互作用密度相等的背景概率，则使用Benjamini – Hochberg（BH）FDR校正。每种情况的CORUM复合物仅限于至少三种蛋白质和网络中至少一个AP -MS诱饵的复合物。构建了随机网络，以保留每个诱饵（节点度）的相互作用总数。　　使用IF共聚焦成像作为人类蛋白质图集项目（HPA）的一部分，使用先前描述的方案12分析了U2OS细胞培养。U2OS细胞是从ATCC获得的，并根据制造商使用形态，核分型和基于PCR的方法进行认证，以确认身份并排除种内和种间污染。将U2OS细胞接种在96孔玻璃底板中，并在McCoy 5A培养基中在37°C下生长至60％至70％的汇合，并补充10％的胎牛血清（FBS）和5％CO2，以进行繁殖。然后将细胞固定在4％多聚甲醛中，然后用Triton X-100洗涤剂透化，并与靶蛋白的HPA初级抗体一起孵育，在4°C下过夜。用1μgml-1小鼠抗细胞皮蛋白和1μgml-1鸡抗癌蛋白在阻止缓冲液中将HPA抗体稀释至2-4μgmL-1。第二天，将细胞在室温下与二抗（山羊抗兔alexafluor 488;山羊抗小鼠和山羊抗小鼠647;或山羊抗鼠Alexafluor 647）一起孵育至1μgml-1和柜台，并用4 a'4 am'iam稀释，（DAPI）。如果使用配备有×63 HCX PL APO 1.40油CS物镜的Leica SP5共聚焦显微镜获取图像。每个图像都包含四个颜色通道，一个用于感兴趣的蛋白质，其他三个通道用于对应于核（DAPI），微管（抗微管蛋白抗体）和内质网（抗癌细胞抗体）的参考标记。根据标准HPA协议（https://www.proteinatlas.org/about/antibody+validation）对抗体质量进行评分；选择最高的每个蛋白质评分抗体，最多具有两个技术复制图像。　　我们根据先前描述的程序62在U2OS细胞系中收集了一个蛋白质组学SEC -MS数据集。对U2OS细胞进行了支原体污染。每种复制（n = 3）的三个15 cm融合的U2OS细胞洗涤并在冰冷的SEC缓冲液中收集（50 mm KCl，50 mM Nach3coo，50 mm Nach3Coo，50 mm Tris，pH 7.2，含有1×EDTA无EDTA的HALT蛋白酶和Thermo Fisher Fisher Scientific Chapatific Chastase磷酸化酶抑制剂）。将这些样品遵守先前描述的分馏协议63，并进行修改。简而言之，使用带有紧密杵在冰上的Dounce匀浆液裂解细胞。将裂解物在100,000 RCF下在4°C下进行15分钟，并将上清液浓缩在100 kDa分子质量截止柱（Sartorius）之上（Sartorius）。使用标准的布拉德福德测定法对每种复制的600 µg蛋白质注入了600 µg的蛋白质中，使用没有蛋白酶抑制剂的SEC缓冲液，将单个300×7.8 mm Biosep-4000柱（Phenomenex）注入。然后，使用1290系列半预编程HPLC（Agilent Technologies）系统以0.6 ml min-1在6°C下以每分分数为单位的15 s分离样品，分别为15 s。通过测量BSA标准峰的末端，可以预先确定收集终点，从而丢弃比单个BSA蛋白质大小小的任何东西。通过添加最终浓度为20％（v/v）2,2,2-三氟乙醇（Sigma-Aldrich），降低和烷基化64，将所得的蛋白质分数变性。随后，我们在37°C下添加了相等的体积，用胰蛋白酶（新英格兰Biolabs）在夜间消化中添加了相等的体积。使用40％（v/v）乙腈清洁所得的肽，并用C-18停止清洁（阶段）TIST65，并在水中作为洗脱缓冲液，在水中进行0.1％（v/v）甲酸。肽浓度在纳米旋风仪上测量（Thermo Fisher Scientific，205 nm，Scopes方法），然后我们将约50 ng的肽加载到Timstof Pro2上（Bruker Daltonics）使用Aurora系列GEN2分析柱（25 cm×75μm，1.6μmFSC C18; ION OPTICKS），使用ActivesPray源耦合到纳米ELUTE UHPLC（BRUKER DALTONICS）设备的系统。该仪器设置为以前概述的DIA-PASEF模式获取66。样品批次在注射前是随机的。在DIA-NN（v.1.8.1.0）67上搜索了获得的SEC – MS数据，该数据针对Uniprot人类序列（UP000005640，下载，2023年6月2日）和常见污染物序列（229个序列）。使用胰蛋白酶/P蛋白酶特异性和1个漏洞的切割，启用了无图书馆搜索。其他搜索参数包括1个可变修饰的最大数量，N末端M切除，C的C氨基甲基化以及肽长度的氧化范围为7至30，范围为1-4，范围为1-4，前体M/Z的范围为300至1,800至1,800至1,800，以及Fragnment Ion M/Z rangment Ion M/Z范围为2000至1,1,800。前体FDR设置为1％，设置为0，“质量精度”，“ MS1精度”和“扫描窗口”。启用了“启发式蛋白推理”，“使用同位素”，“ Run（MBR）之间的匹配”和“无共享光谱”的设置。为蛋白质推理参数选择了“ FastA的蛋白质名称”以及神经网络分类器的“双通道模式”。使用鲁棒的LC（高精度）用于量化策略，互依的RT依赖模式以及用于库生成的智能分析模式。SEC – MS数据的分析使用了蛋白质洗脱曲线，该分布定义为DIA-NN在所有部分中报告的蛋白质级定量值。在每对蛋白质的洗脱剖面之间计算了相似性，从而将三个重复的平均Pearson相关性计算出来。为了评估生物测量的可重复性（扩展数据图5B），我们首先选择了所有三个重复中存在的蛋白质集（n = 5,018）。对于每个重复，我们确定了每个蛋白质的洗脱模式，定义为该蛋白质与5,018种蛋白质中的所有其他相关性。然后，我们计算了相同蛋白质或随机蛋白质对之间的重复跨复制的蛋白质洗脱模式的Pearson相关性。　　蛋白质首先是在AP – MS内预处理的，如果分别进行了蛋白质。对于AP – MS数据，使用Node2Vec68 Python3实现（https://github.com/eliorc/node2vec）将每个蛋白I I作为1,024-二维特征矢量（XI），基于其蛋白质 - 蛋白质的相互作用（p = 2，q = 2，q = 2，q = 2，q = 2，q = 2，q = 1，步行= 80，步行= 80，步行= 80，数量= 10，数字= 10。对于IF数据，我们应用了Densenet-121，这是一种预先训练的卷积神经网络，以识别蛋白质IF共聚焦图像69。Densenet-121用于将每种蛋白质表示为彩色图像四个通道的1,024维特征矢量（Yi）。　　我们开发了一种自我监管的多模式学习模型，以整合（合并）AP-MS，以及蛋白质表示形式是否为单个低维（128维）嵌入空间（扩展数据图2A）。我们的模型基于具有修改的自动编码器体系结构，称为多模式结构嵌入70。使用两个组件损耗函数对自动编码器的参数进行训练，该功能结合了重建损耗和三重损失（对比度）损耗。在“编码器/解码器体系结构”，“损失功能”和“模型培训”部分中提供了详细信息。　　单独的AP – MS和矢量输入（每个蛋白I的Xi和Yi，请参见上文）被模态特异性编码器（FX和FY）压缩，产生128维矢量A和B：　　辍学表示辍学层71；线性表示线性转换层；批处理指示批处理72;Tanh表示双曲线切线功能；ELU表示指数线性单位函数。然后将A和B向量输入到关节编码器FZ中，该联合编码器FZ学习L2归一化的128维度潜在表示Zi：　　ZI的值构成了用于后续细胞图评估的自我监管的多模式嵌入（请参阅下面的“嵌入方法的评估”部分评估）和构建（请参阅下面的“泛分解社区检测”部分）。对于解码器步骤，Z被反向转换，以通过权重矩阵WX和WY提取128二维模态特征：　　最后，这些特征传递给特定于模态的解码器（GX和GY），得出1,024维度重建输入（i，ŷI）：　　为了计算重建损失r，将自动编码器的（i，ŷi）输出与每种模式的原始输入值（xi，yi）进行比较：　　其中n是蛋白质的总数。总体重建损失是模态特异性重建损失和正规化项的总和，其中λ注册是正则化权重，|| w || f是矩阵的F-norm：　　为了计算三胞胎损失t，使用louvain算法的聚类在每种模态的（a，b）向量上执行（在早期训练中使用输入（x，y）值定义了在早期训练中；而是请参见下面的“模型训练”部分）。该聚类为每种模态定义了选择函数SX和SY，s（i，j）= 1对于蛋白质i，j在同一群集中，else 0。此信息用于计算每种模态的t：　　其中n是所有蛋白质的集合，d表示余弦距离（1 - 余弦相似性），而m是总和内的术语总数大于0。全损失函数l是重建和三胞胎损失的加权总和：　　使用ADAM随机梯度下降方法75基于backpropation，对Pytorch74 v.2.0.1提供的标准神经网络学习程序进行了模型参数。训练分为三个阶段：（1）在前200个时期，仅使用重建损失r进行反向传播。（2）在另外200个时期内，全部损失函数L用于反向传播，使用输入X，y向量定义了SX和SY。（3）在最终的500个训练时期，全部损失函数L用于向后流动，使用A，B向量定义了SX和SY（每200个时期每200个时代）定义。超参数的值是基于以前的工作70设置的，而无需微调：批次大小= 64，λ注册= 5，λtriplet= 5，Adam优化学习率= 0.0001。三胞胎损失率和脱落百分比（ε= 0.10，辍学= 0.25）是根据通常的建议值76,77设置的。　　与两种替代性多模式嵌入方法相比，评估了上述自我监督的嵌入模型：（1）单独的AP-MS的简单无监督串联以及IF INPUTS（X，Y）；（2）监督使用的随机森林回归模型（x，y），以预测基因本体论的蛋白质 - 蛋白质语义相似性（2023年6月发行），如前所述21（Python Scikit-Learn软件包，五倍交叉验证，N_Estimators，N_Estimators，N_Estimators = 1000，max_depth = 30）。对这些嵌入模型的评分均可用于在三个互补参考数据库中记录的相互作用蛋白对的恢复：（1）字符串78,79中的高信心蛋白 - 蛋白质相互作用（V.12，NDEX UUID 0B04E9EB-8E60-8E60-11EE-8E-8A13-8A13-8A13-005056EAE23AAE; EXTERDED; 2BE）（2）分配给同一CORUM25复合物的蛋白质对（V.4.1，NDEX UUID 764F7471-9B79-11ED-9A1F-005056AE23AA；扩展数据图2C，E）;或（3）在全基因组CRISPR扰动/mRNA测序筛选中具有高功能相似性的蛋白质对（perturb-seq80;扩展数据图2d，e）。在这里，高功能相似性被定义为蛋白质对的最高1％，这是由Pearson相关性通过两种蛋白质的CRISPR破坏引起的mRNA转录变化之间的相关性（请参阅下面的“捕孔序列数据分析”部分）。　　每对蛋白质的多模式嵌入之间的余弦相似，用于生成一系列蛋白质 - 蛋白质接近网络，其中从最相似的0.2、0.3、0.4、0.4、0.5、0.5、1.0、1.0、2.0、3.0、3.0、4.0、4.0、5.0、5.0或10.0％Pairs定义了边缘，分别为10个网络。使用分层社区解码框架（hidef; https：//github.com/fanzheng10/hidef)81在这些网络中进行了泛滥的社区检测，其持久性阈值（k）为10，最大分辨率（最大分辨率（最大）为80，其他参数保留在默认设置上。hidef在不同的分辨率下识别蛋白质群落，并将其分层关系作为定向无环图（DAG）。在此DAG中，节点代表社区，有向边缘（A→B）表示社区A包含社区b。通过分配了consement索引指数≥75％的组件之间的父子女的关系，并删除带有jaccard指数≥90％的冗余系统，通过分配父子女的遏制关系来完善DAG。最终DAG定义了图2B中引用的单元格。　　从与文献中有已知物理直径的组件相对应的细胞图中选择了13个蛋白质组件的子集（补充表2）。线性回归用于拟合log10转化的直径（NM，Y），以与装配的log10转换大小（蛋白质数，x）：y = 1.27x-0.31。然后使用该线性方程来估计地图中每个组件的直径ŷ。根据标准误差，估算了95％的预测间隔（PI）：　　由t由学生的t分布确定（t = 2.2，d.f. = n-2，n = 13个组件）。这些。是预测和测量大小之间的标准误差，计算为如下：　　在哪里，。与图2C相关。　　如先前所述，使用统计夹克方法评估蛋白质组件的鲁棒性。21。在多模式嵌入之前，除去了一组10％的蛋白质集（请参见上面的“多模式嵌入概述”部分）；然后使用“模型培训”和“泛分解社区检测”部分中描述的相同参数进行集成和社区检测。该随机化过程重复了300次，以创建一组具型式层次结构。然后计算每个组件的鲁棒性，然后计算为所有包含至少一个匹配组件的所有夹克层次结构的比例，定义为代表目标和匹配的蛋白质集之间的实质性和显着重叠（Jaccard Index Index≥40％和高温统计FDR< 0.001). To assess the dependence of each assembly on the protein imaging data, we created a dataset with AP–MS features randomized (1,024-dimension random vectors sampled from a normal distribution) before the statistical jackknifing procedure, and the robustness of each assembly was computed as described above. For assessing the dependence of each assembly in the map on the AP–MS data, a reciprocal procedure was performed in which image embeddings were randomized. Relevant to Extended Data Fig. 3. 　　The cell map was annotated by first aligning assemblies with the GO cellular component branch (June 2023 release), CORUM (4.1 human complexes) or HPA (v.23) resources. Each of these cell biology resources defines a list of protein sets (GO terms, CORUM complex, HPA subcellular localizations), referred to here as components. Hypergeometric tests were performed for each assembly versus each component in the resource, and the FDR was determined using BH correction. The results were tabulated for all assembly–component pairs with Jaccard index ≥ 10% and hypergeometric statistic FDR < 0.01 (Supplementary Table 1). Assemblies in the map were labelled as high overlap with known assembly (Jaccard index ≥ 50% for at least one of the three resources); substantial variation on known assembly (Jaccard index < 50% for all three resources and 20% ≤ Jaccard index < 50% for at least one of the resources); or not previously documented assembly (Jaccard index < 20% for all three resources) based on this enrichment analysis. We also used our recently developed Gene Set AI (GSAI) pipeline3 to guide the GPT-4 model27 (v.gpt-4-1106-preview) to annotate assemblies with <1,000 proteins (Extended Data Fig. 4a). This approach uses a well-engineered prompt that follows the chain-of-thought82 and one-shot83 strategies to query GPT-4 for a descriptive name, a confidence score and a detailed reasoning assay of the protein members from each assembly. One example is shown in Extended Data Fig. 4c, and the full result for each assembly is available in Supplementary Table 1. Literature references are provided by a separate GPT-4 based citation module developed in the previous study3 (Extended Data Fig. 4b) to aid in interpretability. The citation model extracts gene symbols and functional keywords from each paragraph of the LLM-generated analysis text; these are used to construct and execute PubMed queries that search titles and abstracts. The returned publications are prioritized based on relevance and the number of matching genes in their abstracts. Finally, a separate GPT-4 instance is asked to evaluate whether the top three publication titles and abstracts provide supporting evidence for factual statements in the original analysis paragraph, selecting those that satisfy this requirement as references. To evaluate the reproducibility of GPT-4 naming (Extended Data Fig. 4d), we performed the GSAI pipeline for five additional replicate runs of GPT-4 and calculated the semantic similarity between the assembly names generated in each of these runs versus the original run. Similarity was computed using the SapBERT model84 from huggingface (cambridgeltl/SapBERT-from-PubMedBERT-fulltext) using the transformers package85 (v.4.29.2). Assemblies that were not named by the original run were eliminated from the reproducibility test. 　　To analyse the cell map for biological condensates, we used three resources: IUPred3.086, a sequence-based predictor of protein disorder; FuzDrop87, a sequence-based predictor for the ability of a protein to drive condensate formation; and CD-Code88, a database containing proteins known to participate in biological condensates. IUPred3.0 predicts the probability of each amino acid in a sequence as being disordered. Proteins containing a contiguous sequence of amino acids >30 residues, where each amino acid has a >有50％的失调机会被注释的可能性可能是混乱的。Fuzdrop分配了序列驱动相分离的概率，我们的阈值<60％以注释蛋白质为“可能相位分离”。最后，我们在“智人”下搜索了CD代码中的每个基因的uniprotid（2023年5月31日访问），使我们能够注释蛋白质为“与已知冷凝物相关”。我们使用超几何测试来分配统计显着性（P< 0.01) to each protein assembly that was enriched in proteins that were likely disordered, likely phase-separated, or associated with known condensates. Assemblies that were significant in one of these three analyses were considered possible biological condensates (Supplementary Table 3). 　　For the set of proteins in each assembly, we determined the Pearson correlation in SEC–MS elution profiles for all pairs of these proteins (see the ‘SEC–MS data collection’ section). This similarity distribution was then compared to a null distribution (all pairs of proteins not in any common U2OS assembly, that is, assigned to root node only) using a one-sided Wilcoxon rank-sum test with BH correction (Fig. 3d and Supplementary Table 4). Assemblies with FDR < 5% were considered validated. A similar analysis was performed using PrinCE89 (https://github.com/fosterlab/PrInCE) scores to rank protein pairs rather than Pearson correlations, with PrinCE run using the default parameters. We found that 90 assemblies were validated at 5% FDR in the complementary analysis using PrInCE, including 70 assemblies validated by both Pearson correlation and PrinCE similarity measures (Supplementary Table 4). For validation of unexpected protein subunits within assemblies, for each assembly <50 proteins, ‘unexpected proteins’ were defined as those not included in the best matching cellular component from any of three cell biology resources (GO, CORUM, HPA; see the ‘Annotation of cell map assemblies’ section above). For each unexpected member, its SEC–MS elution profile was compared against all other proteins in the assembly using Pearson correlation; this similarity distribution was compared to the null distribution as described above to compute an FDR. Unexpected proteins with FDR < 5% were considered validated (Supplementary Table 4). 　　All pairs of proteins in small assemblies (<10 proteins) were selected for AlphaFold-Multimer analysis. AlphaFold-Multimer was run on each pair using localcolabfold (https://github.com/YoshitakaMo/localcolabfold) with the default settings90. Sequences were acquired from the complete human protein UniProt FASTA file (UP000005640, reviewed sequences, downloaded 11 September 2023). For each predicted heterodimeric structure, we calculated a weighted average between the predicted template modelling score (PTM, an estimate of the similarity between the predicted and ground truth structures) and the ipTM score (the pTM score modified to score the interfaces across different proteins)31: 　　We calculated the median score out of five independent models generated per protein pair. A null score distribution was generated by repeating this score computation for pairs of proteins drawn randomly from those pairs that were not part of the same small assembly (<10 proteins as above). This null distribution was used to calculate an FDR for actual protein pair scores, selecting a cut-off of 30% corresponding to a weighted PTM score of 0.39. Pairs were further evaluated for the presence of a confident interface residue (within 10 Å of the other protein and plDDT score > 80). Relevant to Fig. 4a. 　　A structural model of the Rag–Ragulator community was computed by using an integrative modelling approach35,91,92,93, proceeding through the standard four stages35,91,94 as follows. (1) Gathering input information: the Rag–Ragulator model in the cell map included LAMTOR1 through LAMTOR5, RRAGA, RRAGC, SLC38A9, BORCS6, NUDT3 and ITPA. An integrative model was computed based on the SLC38A9–RagA–RagC–Ragulator comparative model (PDB: 6WJ2 template)36, AlphaFold30 predictions for BORCS6 and ITPA, and pairwise AlphaFold-Multimer predictions31 for BORCS6 or ITPA versus all other members of the Rag–Ragulator complex. One-hundred AlphaFold-Multimer models were generated for each pair and evaluated using FoldDock95. The model excluded NUDT3 because AlphaFold-Multimer did not produce high-confidence models of NUDT3 and other Rag–Ragulator components according to FoldDock. (2) Representing subunits and translating data into spatial restraints: the components of the Rag–Ragulator community were represented as rigid bodies. Alternative models were ranked through a scoring function corresponding to a sum of terms, each one of which restrains some aspect of the model based on a subset of input information. The spatial restraints included a binary binding mode restraint on the position and orientation of pairs of proteins as derived from ensembles of AlphaFold-Multimer predictions, connectivity restraints between consecutive pairs of beads in a subunit and excluded volume restraints between non-bonded pairs of beads. (3) Configurational sampling to produce an ensemble of structures that satisfies the restraints: the initial positions and orientations of rigid bodies and flexible beads were randomized. The generation of structural models was performed using replica exchange Gibbs sampling, based on the Metropolis Monte Carlo algorithm96. Each Monte Carlo step consisted of a series of random translations of flexible beads and random translations and rotations of rigid bodies. (4) Analysing and validating the data and ensemble structures: model validation93,97 included selection of the models for validation; estimation of sampling precision; estimation of model precision; and quantification of the degree to which a model satisfies the information used to compute it. The above four-step modelling protocol was scripted using the Python Modelling Interface (PMI) package, a library for modelling macromolecular complexes based on the open-source Integrative Modelling Platform (IMP) package v.2.18 (https://integrativemodeling.org)91. The configuration of the rigid Rag–Ragulator complex, ITPA protein and the two BORCS6 domains was computed by minimizing the violations of the spatial restraints implied by the input information, using IMP91. Relevant to Fig. 4j. 　　The K562 day-8 perturb-seq dataset80 was acquired at https://gwps.wi.mit.edu (BioProject: PRJNA831566). This dataset provides single-cell transcriptional profiles for 9,867 distinct gene knockouts, which underwent filtering based on the following criteria: (1) gene knockout corresponds to a protein in our U2OS cell map; (2) gene knockout has efficient on-target mRNA reduction of >30％；（3）基因敲除诱导由≥20个差异表达基因定义的强转录表型，其显着性< 0.05 on the basis of the Anderson–Darling test followed by BH correction. This filtering process resulted in a list of 1,289 gene knockouts. The functional cell states due to each of these perturbations were represented using the mean-normalized differential expression profile. Relevant to Fig. 4m and Extended Data Fig. 2d,e. 　　U2OS cells were seeded in triplicate at 300,000 cells per well in a six-well plate (two biological replicates). The next day, cells were treated with 1G244, a DPP9 inhibitor (HY-116304, MedChem Express) at the indicated concentrations for a total of 6 h. After treatment, The medium was aspirated and washed once with ice-cold PBS. Cells were collected in 500 µl of cold TRIzol reagent (15596026, Invitrogen) using a cell scraper. 100 µl of chloroform was added to the TRIzol lysate and vortexed for 20 s followed by a 3 min incubation at room temperature. The homogenate was centrifuged at 10,000g for 18 min at 4 °C. A total of 200 µl of aqueous phase was removed with a pipette and transferred to a new Eppendorf tube. An equal volume of 100% ethanol was slowly added to the aqueous phase and mixed by gentle pipetting. The entire sample was transferred to an RNeasy Mini spin column placed in a 2 ml collection tube (74104, Qiagen). The rest of the extraction was carried out according to the Qiagen RNeasy protocol. 2 µg of RNA per sample was reverse-transcribed according to the iScript cDNA Synthesis Kit protocol (1708890, Bio-Rad, interferon beta 1: Hs01077958_s1; interferon gamma 1, Hs00194264_m1; interferon gamma 2, Hs00988304_m1; non-ISG—18S, 4333760T; and GAPDH, Hs0275889q_g1). qPCR was carried out in triplicates in a 96-well plate according to the TaqMan Fast Advanced Master Mix protocol (4444557, Thermo Fisher Scientific) on a CFX96 Touch Real-Time PCR Detection System from Bio-Rad. The expression levels were compared against a housekeeping gene (GAPDH), and the relative expression levels were compared against the DMSO control. Relevant to Extended Data Fig. 6. 　　We downloaded the AP–MS BioPlex v3 network from NDEx (uuid 6b995fc9-2379-11ea-bb65-0ac135e8bacf), which provides high coverage of human protein interactions in a second cell type, HEK293 cells (14,033 proteins, 127,732 protein–protein interactions). Node2vec was used to represent the interaction pattern of each protein in this HEK293 network (see the ‘AP–MS and IF data preprocessing’ section). The cosine similarity in interaction patterns was then computed for all protein pairs (separately for HEK293 and U2OS). For the set of proteins included in each U2OS assembly, the distribution of pairwise protein similarities in HEK293 were compared to those in U2OS cells using the two-sided Mann–Whitney U-test. This test was translated to an effect size using Cliff’s delta98; assemblies with Cliff’s delta ≥ 0.5 were considered to be increasingly U2OS-specific whereas those with Cliff’s delta < 0.5 were considered to be increasingly conserved. Relevant to Extended Data Fig. 7; in Extended Data Fig. 7b, Cliff’s delta scores of <0 are set to 0. 　　For each protein, we identified its terminal locations in the cell map hierarchy, defined as assemblies (hierarchy nodes) where the protein appeared but was absent in all subassemblies (child nodes). We then counted the number of unique paths from these terminal locations to the root of the hierarchy (root node). Proteins with multiple distinct paths to the root were classified as multi-localized, indicating their presence in different branches of the cell map. Multi-localized assemblies were identified as assemblies with more than one parent node in the hierarchy. Relevant to Extended Data Fig. 8. 　　Data were obtained from a pan-paediatric cancer study4 of 914 individual patients with cancer aged under 25 years (study ID: pediatric_dkfz_2017, downloaded from cBioPortal99,100). We selected the following types of non-silent somatic mutation events: ‘Frame_Shift_Del’, ‘Frame_Shift_Ins’, ‘In_Frame_Del’, ‘In_Frame_Ins’, ‘Missense_Mutation’, ‘Nonsense_Mutation’, ‘Nonstop_Mutation’, ‘RNA’, ‘Splice_Region’, ‘Splice_Site’ and ‘Translation_Start_Site’. A total of 772 primary tumour samples, spanning 18 cancer types, were in the resulting list (Supplementary Table 9). We recorded the number of tumours in the pan-paediatric cohort, as well as each individual tumour cohort, in which each gene was observed to have at least one somatic mutation event (N(g,obs)). Moreover, we calculated the expected number of mutations for each gene in the pan-paediatric cohort (N(g,exp)) using the default setting of MutSigCV v.1.4, as described in a previous study101. For expected mutation counts for individual cancer cohorts, we down-scaled the pan-paediatric cancer N(g,exp) based on the proportion of patients (for example, 44 patients with Wilms’ tumours (WT) account for 5.7% of the pan-paediatric cohort, so Ng,exp,WT = 0.057 × Ng,exp,pan-paediatric). Finally, the corrected log mutation count of each gene (Mg) for each cohort was calculated as: 　　We applied a previously described statistical model, HiSig101 (https://github.com/fanzheng10/HiSig), to calculate the mutation selection pressure on assemblies with the default parameter settings. HiSig implements linear regression (with L1 lasso regularization) of the mutation count against the organization of proteins in assemblies. We calculated an empirical P value by comparing the mutational selection on assemblies against 10,000 randomly permuted assignments of proteins to assemblies. The FDR was calculated by BH correction. Recurrently mutated assemblies were selected on the basis of FDR ≤ 0.4. Assembly-level mutation frequencies were calculated from the number of distinct patients who carried at least one mutated protein in the assembly. Tumour types with fewer than 15 patients were excluded from analysis, as were mutated assemblies with >50种突变的蛋白质。　　在多名患有癌症患者中突变的基因，位于显着复发的突变组件中（见上文）被定义为假定的癌症蛋白。我们从候选癌症基因数据库（CCGD）46（http://ccgd-starrlab.oit.unm.umn.umn.edu/index.html，下载的候选癌症基因数据库（CCGD）中，我们从候选癌症基因数据库（CCGD）中获得了大量基于转座的诱变筛选。该数据库由13个肿瘤类别的小鼠转座子插入诱变筛选共有72项研究（扩展数据图9A）。我们确定了在小鼠肿瘤中被转座插入破坏基因的研究数量。癌症组装中突变的基因被指定为阳性（基因预计会有很高的研究计数，因为它们被突变），并且所有其他基因均被指定为负面因素（基因预计不会具有很高的研究计数）。我们使用stat.gaussian_kde函数从python package scipy（v1.7.3）计算了癌症组件中突变基因和其他基因中突变基因的内核密度估计（KDE）。使用Python软件包Numpy（v.1.21.6）中的TRAPZ函数集成了KDE曲线下的区域。然后将FDR计算为fl曲线下的面积的误报（区域FP）与KDE曲线下的总面积的比率，该面积代表误报和真实的阳性（AREATP），数学显示为：。我们指定了在4（x≥4）中报告基因的最小筛查数量，对应于FDR = 0.28，这是经过验证的癌症驱动因素的阈值截止（扩展数据图9B）。从TCGA Pan-Cancer Atlas102收集成人癌症驱动基因；从参考资料中收集了泛科癌队列中的显着突变基因。4,103。这些基因在扩展数据中被定义为已知的癌症基因图9c，d。　　单元映射工具包（https://github.com/idekerlab/cellmaps_pipeline）实现了一系列Python软件包，以执行此处描述的端到端管道。特定的软件包包括处理蛋白质成像和生物物理交互数据集（cellmaps_imagedownloader，cellmaps_ppidownloader）的步骤用已知资源（例如cellmaps_hierarchyeval）来注释单元格图。每个软件包都是pip包含的，并链接到在ReadThedocs（https://cellmaps-pipeline.readthedocs.io/）上托管的完整用户文档。GitHub存储库提供了分步指南。　　在适当的情况下，使用BH多测试校正的Scipy104进行统计测试。使用Mann – Whitney U检验或Wilcoxon Rank-sum测试计算了两个数据分布之间的统计数据（图2D，3C，D和扩展数据图2B – D，7B，9B）。除非另有说明，否则使用超几何测试计算评估蛋白质或蛋白质对富集的统计数据（图2B和扩展数据图3）。SEC -MS数据在三个生物学重复中复制。IF染色是在HPA中至少在两个不同的细胞系中复制的（图4H，L和扩展数据图6b，7d，8c，g，9f）。对DPP9的QPCR实验进行了两个生物学重复和三个技术重复的重复（扩展数据图6C）。　　有关研究设计的更多信息可在与本文有关的自然投资组合报告摘要中获得。

本文来自作者[yjmlxc]投稿，不代表颐居号立场，如若转载，请注明出处：https://yjmlxc.cn/zlan/202506-8632.html