基于电阻随机访问存储器的计算记忆芯片

微信号：inooooow
不接反杀，想去别人群里开挂，开不了不用加。
复制微信号

　　图2D和扩展数据图1显示了单个CIM核心的框图。为了支持多功能MVM方向，大多数设计在行（BLS和WLS）和列（SLS）方向上是对称的。行和列寄存器文件存储了MVM的输入和输出，可以通过串行外围接口（SPI）或随机访问接口在外部写入，该接口使用8位地址解码器选择一个寄存器条目，或在内部通过神经元在内部选择。SL外围电路包含用于生成用于概率采样的伪随机序列的LFSR块。它由两个LFSR链以相反的方向传播。这两个链的寄存器被固定以生成空间不相关的随机数51。控制器块接收命令并生成控制波形到BL/WL/SL外围逻辑和神经元。它包含一个基于延迟线的脉冲发生器，其可调脉冲宽度从1 ns到10 ns。它还实现了在空闲模式下关闭核心的时钟门控和电源门控逻辑。TNSA的每个WL，BL和SL均由由多个供电不同电压组成的驱动器驱动。根据存储在寄存器文件中的值和控制器发出的控制信号，WL/BL/SL逻辑决定每个通行门的状态。　　核心具有三种主要操作模式：重量编程模式，神经元测试模式和MVM模式（扩展数据图1）。在重量编程模式下，选择单个RRAM单元进行读写。要选择一个单元格，在相应的行和列处的寄存器通过行和列解码器的帮助通过随机访问编程为“ 1”，而其他寄存器则将重置为“ 0”。WL/BL/SL逻辑打开相应的驱动程序通道门以在所选单元格上应用设置/重置/读取电压。在神经元测试模式下，WLS保持在地面电压（GND）。神经元通过其BL或SL开关直接从BL或SL驱动程序接收输入，从而绕开RRAM设备。这使我们能够独立于RRAM数组来表征神经元。在MVM模式下，每个输入BL和SL被驱动到VREF- VRER -VRED，VREF+VREAD或VREF，具体取决于该行或列的寄存器值。如果MVM位于BL到SL方向，我们将激活在输入矢量长度内的WL，同时将其余部分保持在GND。如果MVM在SL-to-Bl方向上，我们将激活所有WLS。神经元完成模数转换后，从BLS和SLS到寄存器的通行门被打开以允许神经元状态读数。　　Neurram中的RRAM阵列中位于单晶体管 - 固定器（1T1R）配置中，其中每个RRAM设备都堆叠在顶部，并与Selector NMOS晶体管串联连接，该晶体管切断了偷偷摸摸的路径并在RRAM编程和阅读过程中提供当前合规性。在标准的130 nm铸造工艺中制造了选择器N型金属 - 氧化物 - 氧化型 - 氧化型 - 氧化型 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化物 - 氧化作用。由于RRAM形成和编程所需的较高电压，因此与RRAM阵列直接接口的Selector NMOS和外围电路使用厚氧化物输入/输出（I/O）晶体管，该晶体管用于5-V操作。神经元，数字逻辑，登记台等的所有其他CMOS电路使用核心晶体管，以1.8V操作额定。　　RRAM设备夹在图2C中所示的金属4和金属5层之间。铸造厂完成CMO和底部四个金属层的制造后，我们使用实验室工艺完成RRAM设备和Metal-5互连的制造，以及顶部的金属垫和钝化层。RRAM设备堆栈由氮化钛（TIN）底部电极层，氧化物hafnium（HFOX）开关层，氧化诱变（Taox）热增强层52和Tin Top ectrode层组成。它们依次沉积，然后进行光刻步骤，以对设备阵列的横向结构进行模式。　　每个神经网络的重量都由沿同一列的两个RRAM细胞之间的差分电导量编码。第一个RRAM细胞编码正重，如果重量为负，则将其编程为低电导态（GMIN）；第二个单元编码负重量，如果重量为正，则将其编程为转基因。从数学上讲，两个单元的电导分别是并且分别是gmax和gmin是RRAM的最大和最小电导率，WMAX是权重的最大绝对值，W是未量化的高精度重量。　　为了将RRAM单元格编程为其目标电导，我们使用增量脉冲写入 - 验证技术42。扩展数据图3A，B说明了该过程。我们首先测量细胞的初始电导率。如果该值低于目标电导率，我们将应用一个弱集脉冲，以稍微增加细胞电导。然后我们再次阅读单元格。如果该值仍低于目标，则我们将另一个集合脉冲以少量的幅度增加。我们重复此类设置 - 读取周期，直到细胞电导在目标值的接受范围内或到目标的另一侧。在后一种情况下，我们将脉冲极性反向重置，然后重复与集合相同的过程。在设置/重置脉冲列车中，电池电导可能会多次上下反弹，直到最终进入接受范围或达到超时限制为止。　　选择编程条件有一些权衡。（1）较小的接受范围和更高的超时限制提高了编程精度，但需要更长的时间。（2）较高的GMAX在推断过程中改善了SNR，但导致无法达到高电导率的细胞的能源消耗和更多的编程故障。在我们的实验中，我们将初始集脉冲电压设置为1.2 V，并且复位脉冲电压为1.5 V，均增量为0.1 V，脉冲宽度为1μs。RRAM读取取决于其电导率。验收范围为目标电导率±1μs。超时限制为30集 - 重分极性逆转。我们为所有模型使用了Gmin =1μs，CNNS使用GMAX =40μs，LSTMS和RBMS使用GMAX =30μs。通过这种设置，可以将99％的RRAM单元格编程到超时限制内的接受范围。平均而言，每个单元格需要8.52集/复位脉冲。在当前的实现中，这种写入过程的速度受到DAC和ADC的外部控制的限制。如果将所有内容整合到一个芯片中，则此类写入将平均每个单元格56 µs。具有多个DAC和ADC的副本以在多个单元格上进行写入 - 将进一步改善RRAM编程吞吐量，而成本更高。　　除了较长的编程时间外，不使用过度写入验证范围的另一个原因是RRAM电导范围放松。编程后的RRAM电导随时间变化。大多数更改发生在编程之后的短时间窗口（小于1 s）中，此后，更改变慢，如图3D所示。突然的初始变化称为文献中的“电导放松” 41。它的统计数据遵循所有电导状态的高斯分布，除非电导接近Gmin。扩展数据图3C，D显示了在整个GMIN到GMAX电导范围内测得的电导弛豫。我们发现，由于电导放宽而导致的编程精度丢失远高于写入验证范围所引起的。所有初始电导级别的平均标准偏差约为2.8μs。最大标准偏差约为4μs，接近Gmax的10％。　　为了减轻放松，我们使用迭代编程技术。我们在RRAM阵列上迭代多次。在每次迭代中，我们测量了所有电导率漂移超出接受范围的细胞和重新编程。扩展数据图3E显示，随着编程迭代的更多编程，标准偏差变小。3次迭代后，标准偏差变为约2μs，与初始值相比，降低了29％。我们在所有神经网络演示中使用3次迭代，并在编程后至少30分钟执行推理，以使测得的推理精度可以解释这种电导放松效应。通过将迭代编程与我们的硬件感知模型培训方法相结合，可以在很大程度上减轻放松的影响。　　神经元和外围电路在可配置的输入和输出比特方面支持MVM。MVM操作由初始化阶段，输入阶段和输出阶段组成。扩展数据图4说明了神经元电路操作。在初始化阶段（扩展数据图4A），所有BLS和SLS都被预测到VREF。神经元的采样电容器也被预先释放为VREF，而集成电容器Cinteg则被排出。　　在输入阶段，每条输入线（根据MVM方向取决于MVM方向）驱动到三个电压级别之一，即通过三个通行门，VREF- VREF- VRER -VREAD，VREF和VREF+VREAD，如图3B所示。在正向MVM期间，在差异重量映射下，将每个输入应用于一对相邻的BLS。相对于VREF，这两个BLS被驱动到相反的电压。也就是说，当输入为0时，这两根电线均驱动到VREF；当输入为+1时，两条线将驱动到VREF +VREAD和VREF -VREAD；当输入为-1时，到vref -vread和vref+vread。在向后MVM期间，将每个输入应用于单个SL。在神经元完成模数转换后，以数字方式进行差异操作。　　偏置输入线后，我们将那些具有10 ns输入的WL脉冲，同时保持输出线浮动。随着输出电线的电压沉降到第i-th行和j-the列的RRAM的电导，我们关闭WLS以停止所有当前流动。然后，我们对位于神经元内的csame的输出线寄生电容上剩余的电荷进行了对电荷进行采样，然后将电荷整合到Cinteg上，如扩展数据图4B所示。采样脉冲为10 ns（受FPGA的100-MHz外时钟限制）；集成脉冲为240 ns，受大型集成电容器（104 FF）的限制，该脉冲是为了确保功能正确性和测试不同神经元工作条件的功能。　　多位输入数字到解析的转换是以某种方式进行的。对于第nth LSB，我们将单个脉冲应用于输入线，然后将输出线从输出线进行采样和集成到Cinteg的2n -1循环。在多位输入阶段的末尾，完整的模拟MVM输出作为电荷存储在Cinteg上。例如，如图3E所示，当输入向量为4位签名的整数时，带有1个签名位和3级位时，我们首先发送对应于输入线的第一个（最不显着）位的脉冲，然后采样并集成一个循环。在第二次和第三次距离的情况下，我们再次将一个脉冲应用于输入线，然后分别进行两个循环和四个循环进行采样和集成。通常，对于N位签名的整数输入，我们需要总共n-1个输入脉冲以及2n-1-1-1-1采样和整合周期。　　由于指数增加的采样和集成周期，这种多位输入方案对于高输入位精确而变得效率低下。此外，随着Cinteg在Cinteg的集成中的电荷与更多集成周期饱和，卸载式剪辑成为一个问题。可以通过使用较低的VREAD来克服净空的剪辑，但以较低的SNR为代价，因此使用较高精确的输入时的总MVM准确性可能不会提高。例如，扩展数据图5A，C显示了MVM结果的测得的根平方误差（R.M.S.E.）。由于较低的SNR，将输入量化为6位（R.M.S.E. = 0.581）不能提高MVM精度（R.M.S.E. = 0.582）。　　为了解决这两个问题，我们使用2相输入方案来大于4位。图3F说明了过程。要使用6位输入和8位输出执行MVM，我们将输入分为两个段，第一个段包含三个MSB，第二个包含三个LSB。然后，我们执行MVM，包括每个段的输出模数转换。对于MSB，神经元（ADC）配置为输出8位；对于LSB，神经元输出5位。最终结果是通过在数字域中转移和添加两个输出来获得的。扩展数据图5D显示该方案降低了MVM R.M.S.E.从0.581到0.519。扩展数据图12C – E进一步表明，这种两相方案都扩展了输入比率范围并提高了能源效率。　　最后，在输出阶段，通过二进制搜索过程再次以某种方式进行类似物到数字的转换。首先，为了生成输出的签名位，我们断开放大器的反馈回路，以将集成器转换为比较器（扩展数据图4C）。我们将Cinteg的右侧驱动到VREF。如果集成电荷为正，则比较器输出将为GND，否则将提供电压VDD。然后，将比较器输出倒置，锁定并通过Neuron BL或SL开关倒入BL或SL，然后将其写入外围BL或SL寄存器。　　为了生成K幅度位，我们从Cinteg添加或减去电荷（扩展数据图4D），然后对K循环进行比较和读数。从MSB到LSB，每个循环中添加或减去的电荷量减半。是由上一个周期中存储在闩锁中的比较结果自动确定是否添加还是减去。图3G说明了这样的过程。首先生成“ 1”的标志位，并在第一个周期中锁定，代表正输出。为了生成最显着的幅度位，闩锁从vdecr- = vref-vdecr到csample的路径开了。然后，通过打开放大器的负反馈回路，将Csample采样的电荷集成在Cinteg上，从而导致从Cinteg减去CSampleVDecr的电荷量。在此示例中，csamplevDecr大于Cinteg上的原始充电量，因此总电荷变为负数，并且比较器产生“ 0”输出。为了生成第二级位，VDECR减少了一半。这次，闩锁从vdecr+= vref+1/2vdecr到csample开了路径。由于集成后Cinteg的总电荷仍然为负，因此比较器在此周期中再次输出“ 0”。我们重复此过程，直到生成最小的幅度位。注意到，如果初始签名位为“ 0”，则所有后续幅度位在读数之前都会倒置。　　这样的输出转换方案类似于算法ADC或SAR ADC，从某种意义上说，对于n个循环进行了n个循环的二进制搜索。不同之处在于，算法ADC使用残基放大器，SAR ADC需要为每个ADC进行多位数DAC，而我们的方案不需要残留放大器，并且使用单个DAC，并且单个DAC输出2×（n-1）不同的VECR+和VDECR+和VDECR级别，由所有神经元共享。结果，我们的方案通过将放大器进行积分和比较，消除残留放大器并摊销CIM核心所有神经元的DAC区域，从而实现更紧凑的设计。对于使用密集的内存数组的CIM设计，这种紧凑的设计使每个ADC都可以通过少量的行和列来延时，从而改善了吞吐量。　　总而言之，使用四个基本操作的不同组合实现了可配置的MVM输入和输出比特角度以及各种神经元激活功能：采样，集成，比较和电荷减少。重要的是，所有四个操作都是通过以不同反馈模式配置的单个放大器实现的。结果，设计同时实现了多功能性和紧凑性。　　Neurram支持在多个CIM芯上并行执行MVM。多核MVM为计算准确性带来了其他挑战，因为在单核MVM中没有表现出的某些硬件非理想性变得更加严重。它们包括输入线上的电压下降，核心到核心变化和电源电压不稳定性。输入线上的电压下降（图4A中的非理想性（1））是由从共享电压源从共享电压源同时通过多个内核引起的。它使在每个核心中存储的等效权重变化随施加的输入而变化，因此对MVM输出具有非线性输入依赖性效应。此外，由于不同的芯与共享电压源具有不同的距离，因此它们经历了不同量的电压降。因此，我们无法分别优化每个核心的读取幅度，以使其MVM输出完全占据完整的神经元输入动态范围。　　这些非理想性共同降低了多核MVM精度。扩展数据图5E，F显示，当在3个芯上并行执行卷积时，卷积层15的输出的测量为较高的R.M.S.e。与0.383的0.318相比，通过在3个芯上依次执行卷积获得的0.318。在我们的Resnet-20实验中，我们在第1块（扩展数据）内进行了2核并行MVM，并在第2和3块内进行了3核并行MVM。　　电压流出问题可以通过使携带尽可能低的瞬时电流以及使用具有更优化拓扑的电力输送网络的电线来部分缓解。但是，随着使用更多的核心，这个问题将持续并变得更糟。因此，我们的实验旨在研究算法 - 硬件合作技术在缓解问题方面的功效。同样，要注意，要进行全芯片实现，将需要集成其他模块，例如中间结果缓冲区，部分和芯片，以管理核心间数据传输。还应仔细优化程序调度，以最大程度地减少中间数据运动中花费的缓冲区大小和能量。尽管有关于这样的全芯片建筑和计划的研究37,38,53，但它们不在本研究的范围之内。　　在神经网络训练中，我们在神经网络训练的正向通过期间将噪声注入了所有完全连接和卷积层的噪声，以效仿RRAM电导放松的影响并读取噪音。注射噪声的分布是通过RRAM表征获得的。我们使用迭代写入技术将RRAM细胞编程为不同的初始电导态，并在30分钟后测量其电导率松弛。扩展数据图3D显示，测得的电导放松具有平均值的绝对值 <1 μS (gmin) at all conductance states. The highest standard deviation is 3.87 μS, about 10% of the gmax 40 μS, found at about 12 μS initial conductance state. Therefore, to simulate such conductance relaxation behaviour during inference, we inject a Gaussian noise with a zero mean and a standard deviation equal to 10% of the maximum weights of a layer. 　　We train models with different levels of noise injection from 0% to 40%, and select the model that achieves the highest inference accuracy at 10% noise level for on-chip deployment. We find that injecting a higher noise during training than testing improves models’ noise resiliency. Extended Data Fig. 7a–c shows that the best test-time accuracy in the presence of 10% weight noise is obtained with 20% training-time noise injection for CIFAR-10 image classification, 15% for Google voice command classification and 35% for RBM-based image reconstruction. 　　For CIFAR-10, the better initial accuracy obtained by the model trained with 5% noise is most likely due to the regularization effect of noise injection. A similar phenomenon has been reported in neural-network quantization literature where a model trained with quantization occasionally outperforms a full-precision model54,55. In our experiments, we did not apply additional regularization on top of noise injection for models trained without noise, which might result in sub-optimal accuracy. 　　For RBM, Extended Data Fig. 7d further shows how reconstruction errors reduce with the number of Gibbs sampling steps for models trained with different noises. In general, models trained with higher noises converge faster during inference. The model trained with 20% noise reaches the lowest error at the end of 100 Gibbs sampling steps. 　　Extended Data Fig. 7e shows the effect of noise injection on weight distribution. Without noise injection, the weights have a Gaussian distribution. The neural-network outputs heavily depend on a small fraction of large weights, and thus become vulnerable to noise injection. With noise injection, the weights distribute more uniformly, making the model more noise resilient. 　　To efficiently implement the models on NeuRRAM, inputs to all convolutional and fully connected layers are quantized to 4-bit or below. The input bit-precisions of all the models are summarized in Table 1. We perform the quantized training using the parameterized clipping activation technique46. The accuracies of some of our quantized models are lower than that of the state-of-the-art quantized model because we apply <4-bit quantization to the most sensitive input and output layers of the neural networks, which have been reported to cause large accuracy degradation and are thus often excluded from low-precision quantization46,54. To obtain better accuracy for quantized models, one can use higher precision for sensitive input and output layers, apply more advanced quantization techniques, and use more optimized data preprocessing, data augmentation and regularization techniques during training. However, the focus of this work is to achieve comparable inference accuracy on hardware and on software while keeping all these variables the same, rather than to obtain state-of-the-art inference accuracy on all the tasks. The aforementioned quantization and training techniques will be equally beneficial for both our software baselines and hardware measurements. 　　During the progressive chip-in-the-loop fine-tuning, we use the chip-measured intermediate outputs from a layer to fine-tune the weights of the remaining layers. Importantly, to fairly evaluate the efficacy of the technique, we do not use the test-set data (for either training or selecting checkpoint) during the entire process of fine-tuning. To avoid over-fitting to a small fraction of data, measurements should be performed on the entire training-set data. We reduce the learning rate to 1/100 of the initial learning rate used for training the baseline model, and fine-tune for 30 epochs, although we observed that the accuracy generally plateaus within the first 10 epochs. The same weight noise injection and input quantization are applied during the fine-tuning. 　　We use CNN models for the CIFAR-10 and MNIST image classification tasks. The CIFAR-10 dataset consists of 50,000 training images and 10,000 testing images belonging to 10 object classes. We perform image classification using the ResNet-2043, which contains 21 convolutional layers and 1 fully connected layer (Extended Data Fig. 9a), with batch normalizations and ReLU activations between the layers. The model is trained using the Keras framework. We quantize the input of all convolutional and fully connected layers to a 3-bit unsigned fixed-point format except for the first convolutional layer, where we quantize the input image to 4-bit because the inference accuracy is more sensitive to the input quantization. For the MNIST handwritten digits classification, we use a seven-layer CNN consisting of six convolutional layers and one fully connected layer, and use max-pooling between layers to down-sample feature map sizes. The inputs to all the layers, including the input image, are quantized to a 3-bit unsigned fixed-point format. 　　All the parameters of the CNNs are implemented on a single NeuRRAM chip including those of the convolutional layers, the fully connected layers and the batch normalization. Other operations such as partial-sum accumulation and average pooling are implemented on an FPGA integrated on the same board as the NeuRRAM. These operations amount to only a small fraction of the total computation and integrating their implementation in digital CMOS would incur negligible overhead; the FPGA implementation was chosen to provide greater flexibility during test and development. 　　Extended Data Fig. 9a–c illustrates the process to map a convolutional layer on a chip. To implement the weights of a four-dimensional convolutional layer with dimension H (height), W (width), I (number of input channels), O (number of output channels) on two-dimensional RRAM arrays, we flatten the first three dimensions into a one-dimensional vector, and append the bias term of each output channel to each vector. If the range of the bias values is B times of the weight range, we evenly divide the bias values and implement them using B rows. Furthermore, we merge the batch normalization parameters into convolutional weights and biases after training (Extended Data Fig. 9b), and program the merged Wʹ and bʹ onto RRAM arrays such that no explicit batch normalization needs to be performed during inference. 　　Under the differential-row weight-mapping scheme, the parameters of a convolutional layer are converted into a conductance matrix of size (2(HWI + B), O). If the conductance matrix fits into a single core, an input vector is applied to 2(HWI + B) rows and broadcast to O columns in a single cycle. HWIO multiply–accumulate (MAC) operations are performed in parallel. Most ResNet-20 convolutional layers have a conductance matrix height of 2(HWI + B) that is greater than the RRAM array length of 256. We therefore split them vertically into multiple segments, and map the segments either onto different cores that are accessed in parallel, or onto different columns within a core that are accessed sequentially. The details of the weight-mapping strategies are described in the next section. 　　The Google speech command dataset consists of 65,000 1-s-long audio recordings of voice commands, such as ‘yes’, ‘up’, ‘on’, ‘stop’ and so on, spoken by thousands of different people. The commands are categorized into 12 classes. Extended Data Fig. 9d illustrates the model architecture. We use the Mel-frequency cepstral coefficient encoding approach to encode every 40-ms piece of audio into a length-40 vector. With a hop length of 20 ms, we have a time series of 50 steps for each 1-s recording. 　　We build a model that contains four parallel LSTM cells. Each cell has a hidden state of length 112. The final classification is based on summation of outputs from the four cells. Compared with a single-cell model, the 4-cell model reduces the classification error (of an unquantized model) from 10.13% to 9.28% by leveraging additional cores on the NeuRRAM chip. Within a cell, in each time step, we compute the values of four LSTM gates (input, activation, forget and output) based on the inputs from the current step and hidden states from the previous step. We then perform element-wise operations between the four gates to compute the new hidden-state value. The final logit outputs are calculated based on the hidden states of the final time step. 　　Each LSTM cell has 3 weight matrices that are implemented on the chip: an input-to-hidden-state matrix with size 40 × 448, a hidden-state-to-hidden-state matrix with size 112 × 448 and a hidden-state-to-logits matrix with size 112 × 12. The element-wise operations are implemented on the FPGA. The model is trained using the PyTorch framework. The inputs to all the MVMs are quantized to 4-bit signed fixed-point formats. All the remaining operations are quantized to 8-bit. 　　An RBM is a type of generative probabilistic graphical model. Instead of being trained to perform discriminative tasks such as classification, it learns the statistical structure of the data itself. Extended Data Fig. 9e shows the architecture of our image-recovery RBM. The model consists of 794 fully connected visible neurons, corresponding to 784 image pixels plus 10 one-hot encoded class labels and 120 hidden neurons. We train the RBM using the contrastive divergence learning procedure in software. 　　During inference, we send 3-bit images with partially corrupted or blocked pixels to the model running on a NeuRRAM chip. The model then performs back-and-forth MVMs and Gibbs sampling between visible and hidden neurons for ten cycles. In each cycle, neurons sample binary states h and v from the MVM outputs based on the probability distributions: and , where σ is the sigmoid function, ai is a bias for hidden neurons (h) and bj is a bias for visible neurons (v). After sampling, we reset the uncorrupted pixels (visible neurons) to the original pixel values. The final inference performance is evaluated by computing the average L2-reconstruction error between the original image and the recovered image. Extended Data Fig. 10 shows some examples of the measured image recovery. 　　When mapping the 794 × 120 weight matrix to multiple cores of the chip, we try to make the MVM output dynamic range of each core relatively consistent such that the recovery performance will not overly rely on the computational accuracy of any single core. To achieve this, we assign adjacent pixels (visible neurons) to different cores such that every core sees a down-sampled version of the whole image, as shown in Extended Data Fig. 9f). Utilizing the bidirectional MVM functionality of the TNSA, the visible-to-hidden neuron MVM is performed from the SL-to-BL direction in each core; the hidden-to-visible neuron MVM is performed from the BL-to-SL direction. 　　To implement an AI model on a NeuRRAM chip, we convert the weights, biases and other relevant parameters (for example, batch normalization) of each model layer into a single two-dimensional conductance matrix as described in the previous section. If the height or the width of a matrix exceed the RRAM array size of a single CIM core (256 × 256), we split the matrix into multiple smaller conductance matrices, each with maximum height and width of 256. 　　We consider three factors when mapping these conductance matrices onto the 48 cores: resource utilization, computational load balancing and voltage drop. The top priority is to ensure that all conductance matrices of a model are mapped onto a single chip such that no re-programming is needed during inference. If the total number of conductance matrices does not exceed 48, we can map each matrix onto a single core (case (1) in Fig. 2a) or multiple cores. There are two scenarios when we map a single matrix onto multiple cores. (1) When a model has different computational intensities, defined as the amount of computation per weights, for different layers, for example, CNNs often have higher computational intensity for earlier layers owing to larger feature map dimensions, we duplicate the more computationally intensive matrices to multiple cores and operate them in parallel to increase throughput and balance the computational loads across the layers (case (2) in Fig. 2a). (2) Some models have ‘wide’ conductance matrices (output dimension >128), such as our image-recovery RBM. If mapping the entire matrix onto a single core, each input driver needs to supply large current for its connecting RRAMs, resulting in a significant voltage drop on the driver, deteriorating inference accuracy. Therefore, when there are spare cores, we can split the matrix vertically into multiple segments and map them onto different cores to mitigate the voltage drop (case (6) in Fig. 2a). 　　By contrast, if a model has more than 48 conductance matrices, we need to merge some matrices so that they can fit onto a single chip. The smaller matrices are merged diagonally such that they can be accessed in parallel (case (3) in Fig. 2a). The bigger matrices are merged horizontally and accessed by time-multiplexing input rows (case (4) in Fig. 2a). When selecting the matrices to merge, we want to avoid the matrices that belong to the same two categories described in the previous paragraph: (1) those that have high computational intensity (for example, early layers of ResNet-20) to minimize impact on throughput; and (2) those with ‘wide’ output dimension (for example, late layers of ResNet-20 have large number of output channels) to avoid a large voltage drop. For instance, in our ResNet-20 implementation, among a total of 61 conductance matrices (Extended Data Fig. 9a: 1 from input layer, 12 from block 1, 17 from block 2, 28 from block 3, 2 from shortcut layers and 1 from final dense layer), we map each of the conductance matrices in blocks 1 and 3 onto a single core, and merge the remaining matrices to occupy the 8 remaining cores. 　　Table 1 summarizes core usage for all the models. It is noted that for partially occupied cores, unused RRAM cells are either unformed or programmed to high resistance state; WLs of unused rows are not activated during inference. Therefore, they do not consume additional energy during inference. 　　Extended Data Fig. 11a shows the hardware test system for the NeuRRAM chip. The NeuRRAM chip is configured by, receives inputs from and sends outputs to a Xilinx Spartan-6 FPGA that sits on an Opal Kelly integrated FPGA board. The FPGA communicates with the PC via a USB 3.0 module. The test board also houses voltage DACs that provide various bias voltages required by RRAM programming and MVM, and ADCs to measure RRAM conductance during the write–verify programming. The power of the entire board is supplied by a standard ‘cannon style’ d.c. power connector and integrated switching regulators on the Opal Kelly board such that no external lab equipment is needed for the chip operation. 　　To enable fast implementation of various machine-learning applications on the NeuRRAM chip, we developed a software toolchain that provides Python-based application programming interfaces (APIs) at various levels. The low-level APIs provide access to basic operations of each chip module such as RRAM read and write and neuron analogue-to-digital conversion; the middle-level APIs include essential operations required for implementing neural-network layers such as the multi-core parallel MVMs with configurable bit-precision and RRAM write–verify programming; the high-level APIs integrate various middle-level modules to provide complete implementations of neural-network layers, such as weight mapping and batch inference of convolutional and fully connected layers. The software toolchain aims to allow software developers who are not familiar with the NeuRRAM chip design to deploy their machine-learning models on the NeuRRAM chip. 　　To characterize MVM energy efficiency at various input and output bit-precisions, we measure the power consumption and latency of the MVM input and output stages separately. The total energy consumption and the total time are the sum of input and output stages because the two stages are performed independently as described in the above sections. As a result, we can easily obtain the energy efficiency of any combinations of input and output bit-precisions. 　　To measure the input-stage energy efficiency, we generate a 256 × 256 random weight matrix with Gaussian distribution, split it into 2 segments, each with dimension 128 × 256, and program the two segments to two cores using the differential-row weight mapping. We measure the power consumption and latency for performing 10 million MVMs, or equivalently 655 billion MAC operations. The comparison with previous work shown in Fig. 1d uses the same workload as benchmark. 　　Extended Data Fig. 12a shows the energy per operation consumed during the input and the output stages of MVMs under various bit-precisions. The inputs are in the signed integer format, where the first bit represents the sign, and the other bits represent the magnitude. One-bit (binary) and two-bit (ternary) show similar energy because each input wire is driven to one of three voltage levels. Binary input is therefore just a special case for ternary input. It is noted that the curve shown in Extended Data Fig. 12a is obtained without the two-phase operation. As a result, we see a super-linear increase of energy as input bit-precision increases. Similar to the inputs, the outputs are also represented in the signed integer format. The output-stage energy consumption grows linearly with output bit-precision because one additional binary search cycle is needed for every additional bit. The output stage consumes less energy than the input stage because it does not involve toggling highly capacitive WLs that are driven at a higher voltage, as we discuss below. 　　For the MVM measurements shown in Extended Data Fig. 12b–e, the MVM output stage is assumed to use 2-bit-higher precision than inputs to account for the additional bit-precision required for partial-sum accumulations. The required partial-sum bit-precision for the voltage-mode sensing implemented by NeuRRAM is much lower than that required by the conventional current-mode sensing. As explained before, conventional current-sensing designs can only activate a fraction of rows each cycle, and therefore need many partial-sum accumulation steps to complete an MVM. In contrast, the proposed voltage-sensing scheme can activate all the 256 input wires in a single cycle, and therefore requires less partial-sum accumulation steps and lower partial-sum precisions. 　　Extended Data Fig. 12b shows the energy consumption breakdown. A large fraction of energy is spent in switching on and off the WLs that connect to gates of select transistors of RRAM devices. These transistors use thick-oxide I/O transistors to withstand high-voltage during RRAM forming and programming. They are sized large enough (width 1 µm and length 500 nm) to provide sufficient current for RRAM programming. As a result, they require high operating voltages and add large capacitance to the WLs, both contributing to high power consumption (P = fCV2, where f is the frequency at which the capacitance is charged and discharged). Simulation shows that each of the 256 access transistors contributes about 1.5 fF to a WL; WL drivers combined contribute about 48 fF to each WL; additional WL capacitance is mostly from the inter-wire capacitance from neighbouring BLs and WLs. The WL energy is expected to decrease significantly if RRAMs can be written by a lower voltage and have a lower conductance state, and if a smaller transistor with better drivability can be used. 　　For applications that require probabilistic sampling, the two counter-propagating LFSR chains generate random Bernoulli noises and inject the noises as voltage pulses into neurons. We measure each noise-injection step to consume on average 121 fJ per neuron, or 0.95 fJ per weight, which is small compared with other sources of energy consumption shown in Extended Data Fig. 12b. 　　Extended Data Fig. 12c–e shows the measured latency, peak throughput and throughput-power efficiency for performing the 256 × 256 MVMs. It is noted that we used EDP as a figure of merit for comparing designs rather than throughput-power efficiency as tera-operations per second per watt (TOPS W−1, reciprocal of energy per operation), because it captures the time-to-solution aspect in addition to energy consumption. Similar to previous work in this field, the reported throughput and energy efficiency represent their peak values when the CIM array utilization is 100%, and does not include time and energy spent at buffering and moving intermediate data. Future work that integrates intermediate data buffers, partial-sum accumulators and so on within a single complete CIM chip should show energy efficiency measured on end-to-end AI applications. 　　The current NeuRRAM chip is fabricated using a 130-nm CMOS technology. We expect the energy efficiency to improve with the technology scaling. Importantly, isolated scaling of CMOS transistors and interconnects is not sufficient for the overall energy-efficiency improvement. RRAM device characteristics must be optimized jointly with CMOS. The current RRAM array density under a 1T1R configuration is limited not by the fabrication process but by the RRAM write current and voltage. The current NeuRRAM chip uses large thick-oxide I/O transistors as the ‘T’ to withstand >4-V RRAM形成电压并提供足够的写入电流。只有当我们降低形成电压和写电流时，我们才能获得更高的密度，从而降低寄生能力以提高能源效率。　　假设可以在逻辑兼容的电压级别上对较新技术节点的RRAM设备进行编程，并且所需的写入电流可以降低，以使连接晶体管的大小保持缩小，EDP的改进将来自（1）较低的操作电压和（2）较小的电线和晶体管和晶体管较小的电线和晶体管cv2和delays cv2和delays cv/cv/i是IS。例如，在7 nm处，我们期望WL开关能量（图12b的扩展数据）减小约22.4次，其中包括WL电压尺度（1.3 V→0.8 V）的2.6倍，而电容缩放的8.5倍（从选定的晶体管，WL驱动器和WIRES的电容缩放为最小的Metal Metal Metalimien Metal Pitch 340 NM）。外围电路能量（以神经元读数过程为主）预计将减少42次，其中包括VDD缩放（1.8 V→0.8 V）的5次和较小的寄生电容的8.5倍。MVM脉冲和电荷转移过程消耗的能量与RRAM电导范围无关，因为RRAM阵列量表的功耗和沉降时间与取消产品中的电导系数相同。具体来说，每RRAM MAC的能量为EMAC = CPAR VAR（VIN），仅受单位RRAM细胞CPAR的寄生能力限制，并且驱动输入电压VAR（VIN）的方差。因此，MVM的能耗将减少约34倍，包括从读取缩放（0.5 V→0.25 V）的4倍，以及较小的寄生电容的8.5倍。总体而言，当将设计从130 nm缩放到7 nm时，我们预计能源消耗约34倍。　　就延迟而言，当前的设计受神经元的长时间整合时间的限制，主要是由相对较大的集成电容器大小（104 ff）引起的，该集成电容器的大小（104 ff）是为了确保功能正确性和测试不同神经元工作条件的功能。在更先进的技术节点上，人们可以使用较小的电容器尺寸来达到更高的速度。缩放电容器尺寸的主要关注点是制造引起的电容器尺寸不匹配将占总电容的较高部分，从而导致SNR较低。但是，以前的ADC设计使用了单位电容器尺寸的小至50 AF（参考文献56;比我们的CSample小340倍）。对于更保守的设计，一项研究表明，在32 nm的过程中，0.45-FF的单位电容器的平均标准偏差仅为1.2％。此外，集成时间还取决于晶体管的驱动电流。假设晶体管电流密度（μaμm -1）在VDD缩放后保持相对不变，并且神经元尺度的晶体管宽度与接触式距离（310 nm→57 nm），总晶体管驱动电流将减少5.4次。结果，当将Csample从17 ff缩放到0.2 ff和Cinteg从104 ff到1.22 ff时，延迟将提高15.7倍。因此，在保守上，我们预计将设计从130 nm扩展到7 nm技术时，总体EDP至少会提高535倍。扩展数据表2显示，这种缩放将使Neurram能够提供比当今最先进的边缘推理加速器58,59,60,61的能量和面积效率更高的。

本文来自作者[yjmlxc]投稿，不代表颐居号立场，如若转载，请注明出处：https://yjmlxc.cn/zlan/202506-4147.html