WO2021168644A1 - Data processing apparatus, electronic device, and data processing method - Google Patents

Data processing apparatus, electronic device, and data processing method Download PDF

Info

Publication number
WO2021168644A1
WO2021168644A1 PCT/CN2020/076556 CN2020076556W WO2021168644A1 WO 2021168644 A1 WO2021168644 A1 WO 2021168644A1 CN 2020076556 W CN2020076556 W CN 2020076556W WO 2021168644 A1 WO2021168644 A1 WO 2021168644A1
Authority
WO
WIPO (PCT)
Prior art keywords
result
output
bits
weight value
input
Prior art date
Application number
PCT/CN2020/076556
Other languages
French (fr)
Chinese (zh)
Inventor
杨康
韩峰
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2020/076556 priority Critical patent/WO2021168644A1/en
Priority to CN202080004607.0A priority patent/CN112639836A/en
Publication of WO2021168644A1 publication Critical patent/WO2021168644A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

A data processing apparatus, an electronic device, and a data processing method. The apparatus comprises: an input module (1), used for obtaining an input feature value matrix and an n-bit or 2n-bit weight value matrix; a calculation module (2), used for performing convolution operation on the input feature value matrix and the n-bit or 2n-bit weight value matrix to obtain an output feature value matrix; and an output module (3), used for outputting the output feature value matrix, wherein n is a positive integer. The present invention can achieve the convolution operation of data of two lengths, improve the accuracy of a deep convolutional neural network, and adapt to design requirements of different deep convolutional neural networks.

Description

数据处理装置、电子设备和数据处理方法Data processing device, electronic equipment and data processing method 技术领域Technical field
本发明实施例涉及数据处理技术领域,尤其涉及一种数据处理装置、电子设备和数据处理方法。The embodiments of the present invention relate to the field of data processing technology, and in particular to a data processing device, electronic equipment, and a data processing method.
背景技术Background technique
深度卷积神经网络是一种机器学习算法,它被广泛应用于目标识别、目标检测以及图像的语义分割等计算机视觉任务。Deep convolutional neural network is a machine learning algorithm, which is widely used in computer vision tasks such as target recognition, target detection, and image semantic segmentation.
深度卷积神经网络的大部分运算都是卷积操作,设计专用的硬件电路加速卷积层的卷积运算,可以大幅度减少深度卷积神经网络的计算时间。现有的卷积运算装置的操作数只支持一种宽度的定点数,例如8bits定点数,因此无法处理有较高精度要求的深度卷积神经网络的数据,难以满足深度卷积神经网络精度日益提高的设计要求。Most of the operations of the deep convolutional neural network are convolution operations. Designing a dedicated hardware circuit to accelerate the convolution operation of the convolutional layer can greatly reduce the calculation time of the deep convolutional neural network. The operands of the existing convolution operation devices only support fixed-point numbers of one width, such as 8bits fixed-point numbers. Therefore, they cannot process the data of deep convolutional neural networks with higher precision requirements, and it is difficult to meet the increasing accuracy of deep convolutional neural networks. Improved design requirements.
发明内容Summary of the invention
本发明实施例提供了一种数据处理装置、电子设备和数据处理方法,以解决现有技术中卷积运算装置难以满足深度卷积神经网络精度要求的技术问题。The embodiment of the present invention provides a data processing device, an electronic device, and a data processing method to solve the technical problem that the convolution operation device in the prior art cannot meet the accuracy requirements of a deep convolutional neural network.
本发明实施例第一方面提供一种数据处理装置,包括:The first aspect of the embodiments of the present invention provides a data processing device, including:
输入模块,用于获取输入特征值矩阵以及n位或者2n位的权重值矩阵;The input module is used to obtain the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix;
计算模块,用于将输入特征值矩阵与n位或者2n位的权重值矩阵进行卷积运算,得到输出特征值矩阵;The calculation module is used to perform a convolution operation between the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix to obtain the output eigenvalue matrix;
输出模块,用于输出所述输出特征值矩阵;An output module for outputting the output eigenvalue matrix;
其中,所述n为正整数。Wherein, the n is a positive integer.
本发明实施例第二方面提供一种电子设备,包括第一方面所述的数据处理装置。A second aspect of the embodiments of the present invention provides an electronic device, including the data processing apparatus described in the first aspect.
本发明实施例第三方面提供一种数据处理方法,包括:A third aspect of the embodiments of the present invention provides a data processing method, including:
获取输入特征值矩阵以及n位或者2n位的权重值矩阵;Obtain the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix;
将输入特征值矩阵与n位或者2n位的权重值矩阵进行卷积运算,得到输出特征值矩阵;Convolve the input eigenvalue matrix with the n-bit or 2n-bit weight value matrix to obtain the output eigenvalue matrix;
输出所述输出特征值矩阵;Output the output eigenvalue matrix;
其中,所述n为正整数。Wherein, the n is a positive integer.
本发明实施例提供的数据处理装置、电子设备和数据处理方法,可以实现两种长度的数据的卷积运算,提高深度卷积神经网络的精度,适应不同深度卷积神经网络的设计要求。The data processing device, electronic equipment, and data processing method provided by the embodiments of the present invention can realize the convolution operation of data of two lengths, improve the precision of the deep convolutional neural network, and adapt to the design requirements of different deep convolutional neural networks.
附图说明Description of the drawings
此处所说明的附图用来提供对本发明的进一步理解,构成本发明的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The drawings described here are used to provide a further understanding of the present invention and constitute a part of the present invention. The exemplary embodiments of the present invention and the description thereof are used to explain the present invention, and do not constitute an improper limitation of the present invention. In the attached picture:
图1为本发明实施例的一种应用场景的示意图;Fig. 1 is a schematic diagram of an application scenario of an embodiment of the present invention;
图2为图1所示应用场景中卷积操作过程的示意图;Fig. 2 is a schematic diagram of a convolution operation process in the application scenario shown in Fig. 1;
图3为本发明实施例一提供的一种数据处理装置的结构示意图;3 is a schematic structural diagram of a data processing device provided by Embodiment 1 of the present invention;
图4为本发明实施例一提供的一种数据处理装置进行卷积运算的原理示意图;4 is a schematic diagram of the principle of convolution operation performed by a data processing device according to Embodiment 1 of the present invention;
图5为本发明实施例二提供的一种数据处理装置的结构示意图;FIG. 5 is a schematic structural diagram of a data processing device according to Embodiment 2 of the present invention;
图6为本发明实施例三提供的一种数据处理装置中脉动单元的结构示意图;FIG. 6 is a schematic structural diagram of a pulsating unit in a data processing device according to Embodiment 3 of the present invention;
图7为本发明实施例三提供的一种数据处理装置中累加器的结构示意图;FIG. 7 is a schematic structural diagram of an accumulator in a data processing device according to Embodiment 3 of the present invention;
图8为本发明实施例三提供的数据处理装置进行n位数据的卷积运算过程示意图;FIG. 8 is a schematic diagram of a convolution operation process of n-bit data performed by the data processing device according to the third embodiment of the present invention;
图9为本发明实施例三提供的数据处理装置进行2n位数据的卷积运算过程示意图;9 is a schematic diagram of a convolution operation process of 2n-bit data performed by the data processing device according to the third embodiment of the present invention;
图10为本发明实施例四提供的一种数据处理装置的结构示意图;FIG. 10 is a schematic structural diagram of a data processing device according to Embodiment 4 of the present invention;
图11为本发明实施例四提供的一种数据处理装置存储n位数据时的存储格式示意图;11 is a schematic diagram of a storage format when a data processing device stores n-bit data according to Embodiment 4 of the present invention;
图12为本发明实施例四提供的一种数据处理装置存储2n位数据时的存储格式示意图;12 is a schematic diagram of a storage format when a data processing device stores 2n-bit data according to the fourth embodiment of the present invention;
图13为本发明实施例五提供的一种数据处理方法的流程示意图。FIG. 13 is a schematic flowchart of a data processing method according to Embodiment 5 of the present invention.
附图标记:Reference signs:
1-输入模块                        2-计算模块1-input module 2-calculation module
3-输出模块                        4-存储器3- output module 4- memory
11-权重值加载模块                 12-输入特征值加载模块11-Weight value loading module 12-Input characteristic value loading module
21-脉动单元                       22-累加器21-Pulsation unit 22-Accumulator
23-控制单元                       24-权重值注入单元23-Control unit 24-Weight value injection unit
25-输入特征值注入单元             26-结果产出单元25-Input characteristic value injection unit 26-Result output unit
27-结果存储单元                   211-权重值寄存器27-Result storage unit 211-Weight value register
212-输入特征值寄存器              213-乘法电路212-input characteristic value register 213-multiplication circuit
214-加法电路                      215-权重值移位寄存器214-Adding circuit 215-Weight value shift register
216-输入特征值移位寄存器          217-乘法结果寄存器216-Input characteristic value shift register 217-Multiplication result register
221-乘累加结果寄存器              222-前乘累加结果寄存器221-Multiplication and accumulation result register 222-Previous multiplication and accumulation result register
223-纵向加法电路                  224-第一阶段加法电路223-Vertical addition circuit 224-First-stage addition circuit
225-滤波电路                      226-累加器结果寄存器225-filter circuit 226-accumulator result register
227-总和寄存器                    228-延迟电路227-sum register 228-delay circuit
229-第二阶段加法电路229-Second-stage addition circuit
具体实施方式Detailed ways
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
除非另有定义,本文所使用的所有的技术和科学术语与属于本发明的技术领域的技术人员通常理解的含义相同。本文中在本发明的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本发明。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of the present invention. The terms used in the specification of the present invention herein are only for the purpose of describing specific embodiments, and are not intended to limit the present invention.
图1为本发明实施例的一种应用场景的示意图。本发明实施例提供的数据处理装置和数据处理方法,可以应用于任意需要卷积运算的场景,如深度卷积神经网络等。Fig. 1 is a schematic diagram of an application scenario of an embodiment of the present invention. The data processing device and data processing method provided by the embodiments of the present invention can be applied to any scene that requires convolution operation, such as a deep convolutional neural network.
如图1所示,可以应用本发明实施例的深度卷积神经网络包括:输入、输出和隐藏层。图1示出的网络中每一层可以有一个输入和一个输出,在实际的 深度卷积神经网络中,每一层可能会有多个输入或多个输出。As shown in FIG. 1, the deep convolutional neural network to which the embodiment of the present invention can be applied includes: input, output, and hidden layers. Each layer in the network shown in Figure 1 can have one input and one output. In an actual deep convolutional neural network, each layer may have multiple inputs or multiple outputs.
深度卷积神经网络的隐藏层由一组级联的特征图和操作组成。隐藏层的操作包括卷积、池化、激活等。隐藏层的特征图由上一层的特征图进行上述操作后产生。一般情况下,卷积神经网络中的层可以按照操作的类型进行命名,比如,进行卷积运算的层可以归为卷积层,进行池化操作的层可以归为池化层。The hidden layer of the deep convolutional neural network is composed of a set of cascaded feature maps and operations. The operation of the hidden layer includes convolution, pooling, activation and so on. The feature map of the hidden layer is generated after the above operation is performed on the feature map of the previous layer. In general, the layers in a convolutional neural network can be named according to the type of operation. For example, the layer that performs the convolution operation can be classified as a convolutional layer, and the layer that performs a pooling operation can be classified as a pooling layer.
卷积层的卷积操作过程为:用一组权重值对输入的一组特征图进行向量内积运算,然后输出一组特征图。输入的权重值也被称为滤波器或卷积核。The convolution operation process of the convolution layer is: use a set of weight values to perform vector inner product operation on a set of input feature maps, and then output a set of feature maps. The input weight value is also called a filter or a convolution kernel.
权重值和输入特征图、输出特征图均可以被表示为一个多维矩阵。输入特征图可以表示为输入特征值矩阵,矩阵中的元素记为输入特征值;输出特征图可以表示为输出特征值矩阵,矩阵中的元素记为输出特征值。The weight value, the input feature map, and the output feature map can all be expressed as a multi-dimensional matrix. The input feature map can be expressed as an input feature value matrix, and the elements in the matrix are recorded as input feature values; the output feature map can be expressed as an output feature value matrix, and the elements in the matrix are recorded as output feature values.
图2为图1所示应用场景中卷积操作过程的示意图。如图2所示,一个R*R*N的权重值矩阵与一个H*H*N的输入特征值矩阵卷积,可以得到一个E*E*N的输出特征值矩阵。输出特征值矩阵中的每个输出特征值可以由输入特征值矩阵中的部分输入特征值与权重值矩阵的权重值进行内积运算得到。Fig. 2 is a schematic diagram of the convolution operation process in the application scenario shown in Fig. 1. As shown in Figure 2, a weight matrix of R*R*N is convolved with an input eigenvalue matrix of H*H*N to obtain an output eigenvalue matrix of E*E*N. Each output eigenvalue in the output eigenvalue matrix can be obtained by inner product operation of part of the input eigenvalues in the input eigenvalue matrix and the weight value of the weight value matrix.
本发明实施例提供的技术方案,可以对支持n位或者2n位的卷积运算。下面结合附图,对本发明实施例中的技术方案进行描述。The technical solutions provided by the embodiments of the present invention can support n-bit or 2n-bit convolution operations. The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings.
实施例一Example one
本发明实施例一提供一种数据处理装置。图3为本发明实施例一提供的一种数据处理装置的结构示意图。如图3所示,本实施例中的数据处理装置,可以包括:The first embodiment of the present invention provides a data processing device. FIG. 3 is a schematic structural diagram of a data processing device according to Embodiment 1 of the present invention. As shown in Figure 3, the data processing device in this embodiment may include:
输入模块1,用于获取输入特征值矩阵以及n位或者2n位的权重值矩阵,所述n为正整数;The input module 1 is used to obtain an input eigenvalue matrix and an n-bit or 2n-bit weight value matrix, where n is a positive integer;
计算模块2,用于将输入特征值矩阵与n位或者2n位的权重值矩阵进行卷积运算,得到输出特征值矩阵;The calculation module 2 is used to perform a convolution operation on the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix to obtain the output eigenvalue matrix;
输出模块3,用于输出所述输出特征值矩阵。The output module 3 is used to output the output eigenvalue matrix.
具体地,所述输入模块1可以与存储器或者其他模块连接,用于获取待进行卷积运算的输入特征值矩阵和权重值矩阵。可选的,本发明各实施例中所述的连接,可以为物理连接或通信连接。Specifically, the input module 1 may be connected to a memory or other modules, and is used to obtain an input eigenvalue matrix and a weight value matrix to be subjected to a convolution operation. Optionally, the connection described in each embodiment of the present invention may be a physical connection or a communication connection.
所述权重值矩阵可以为n位的权重值矩阵,或者可以为2n位的权重值矩阵。其中,n位的权重值矩阵可以是指矩阵中的权重值的长度为n位;2n位的权重 值矩阵可以是指矩阵中的权重值的长度为2n位。可选的,所述输入特征值矩阵中的输入特征值的长度与所述权重值矩阵中的权重值的长度可以相同,在权重值矩阵为2n位时,输入特征值矩阵也可以为2n位,能够保证输入特征值矩阵与权重值矩阵直接进行卷积运算,提高运算效率和准确率。The weight value matrix may be an n-bit weight value matrix, or may be a 2n-bit weight value matrix. Wherein, an n-bit weight value matrix may mean that the weight value in the matrix has a length of n bits; a 2n-bit weight value matrix may mean that the weight value in the matrix has a length of 2n bits. Optionally, the length of the input eigenvalue in the input eigenvalue matrix and the length of the weight value in the weight value matrix may be the same. When the weight value matrix is 2n bits, the input eigenvalue matrix may also be 2n bits. , It can ensure that the input eigenvalue matrix and the weight value matrix directly perform convolution operation, and improve the operation efficiency and accuracy.
计算模块2可以与输入模块1连接,以获取输入特征值矩阵和权重值矩阵并进行卷积运算。具体地,可以将输入特征值矩阵中的一部分输入特征值,与权重值矩阵中对应的权重值相乘并累加,得到对应的输出特征值。The calculation module 2 can be connected to the input module 1 to obtain the input eigenvalue matrix and the weight value matrix and perform convolution operations. Specifically, a part of the input eigenvalues in the input eigenvalue matrix can be multiplied and accumulated by the corresponding weight values in the weight value matrix to obtain the corresponding output eigenvalues.
图4为本发明实施例一提供的一种数据处理装置进行卷积运算的原理示意图。如图4所示,输入特征值矩阵为:FIG. 4 is a schematic diagram of the principle of convolution operation performed by a data processing device according to Embodiment 1 of the present invention. As shown in Figure 4, the input eigenvalue matrix is:
X 00 X 00 X 01 X 01 X 02 X 02 X 03 X 03 X 04 X 04
X 10 X 10 X 11 X 11 X 12 X 12 X 13 X 13 X 14 X 14
X 20 X 20 X 21 X 21 X 22 X 22 X 23 X 23 X 24 X 24
权重值矩阵为:The weight value matrix is:
W 00 W 00 W 01 W 01
W 10 W 10 W 11 W 11
输出特征值矩阵为:The output eigenvalue matrix is:
Y 00 Y 00 Y 01 Y 01 Y 02 Y 02 Y 03 Y 03
Y 10 Y 10 Y 11 Y 11 Y 12 Y 12 Y 13 Y 13
其中,X ij为输入特征值矩阵中第i行第j个输入特征值,W ij为权重值矩阵中的第i行的第j个权重值,Y ij为输出特征值矩阵中第i行第j个输出特征值。权重值矩阵包括2*2个权重值,权重值矩阵遍历输入特征值矩阵中的每一个2*2的部分,与之进行内积运算,得到对应的一个输出特征值,即: Wherein, X ij is the input of the eigenvalue matrix of the j th input of the i-th row feature value, W ij is a j-th weight value weight value matrix of the i-th row, Y ij is the output characteristic value matrix row i j output feature values. The weight value matrix includes 2*2 weight values. The weight value matrix traverses each 2*2 part of the input eigenvalue matrix, and performs an inner product operation with it to obtain a corresponding output eigenvalue, namely:
Y ij=X ij*W 00+X i(j+1)*W 01+X (i+1)j*W 10+X (i+1)(j+1)*W 11 Y ij =X ij *W 00 +X i(j+1) *W 01 +X (i+1)j *W 10 +X (i+1)(j+1) *W 11
如图4所示,权重值矩阵首先与输入特征值矩阵左上角用粗线框框住的2*2的部分进行计算,得到输出特征值Y 00=X 00*W 00+X 01*W 01+X 10*W 10+X 11*W 11;然后,粗线框向右移动一列,权重值矩阵与下一个2*2的部分进行计算,得到对应的一个输出特征值Y 01=X 01*W 00+X 02*W 01+X 11*W 10+X 12*W 11;以此类推,遍历全部2*2的方框后,可以得到全部输出特征值。 As shown in Figure 4, the weight value matrix is first calculated with the 2*2 part of the input eigenvalue matrix framed by a thick line at the upper left corner, and the output eigenvalue Y 00 =X 00 *W 00 +X 01 *W 01 + X 10 *W 10 +X 11 *W 11 ; Then, the thick line frame moves one column to the right, and the weight value matrix is calculated with the next 2*2 part to obtain a corresponding output characteristic value Y 01 =X 01 *W 00 +X 02 *W 01 +X 11 *W 10 +X 12 *W 11 ; and so on, after traversing all 2*2 boxes, all output characteristic values can be obtained.
图4示出的各个矩阵均为二维矩阵,在实际应用中,所述权重值矩阵、所述输入特征值矩阵、所述输出特征值矩阵均可以为二维或三维矩阵,三维矩阵的卷积运算的原理与二维矩阵的卷积运算的原理类似,此处不再赘述。Each matrix shown in FIG. 4 is a two-dimensional matrix. In practical applications, the weight value matrix, the input eigenvalue matrix, and the output eigenvalue matrix may all be two-dimensional or three-dimensional matrices, and the volume of the three-dimensional matrix The principle of the product operation is similar to the principle of the convolution operation of a two-dimensional matrix, and will not be repeated here.
若获取到的输入特征值矩阵和权重值矩阵为n位,则所述输出特征值矩阵中的输出特征值长度也可以为n位;若获取到的输入特征值矩阵和权重值矩阵为2n位,则所述输出特征值矩阵中的输出特征值长度也可以为2n位。If the obtained input eigenvalue matrix and weight value matrix are n bits, the output eigenvalue length in the output eigenvalue matrix may also be n bits; if the obtained input eigenvalue matrix and weight value matrix are 2n bits , The output eigenvalue length in the output eigenvalue matrix may also be 2n bits.
输出模块3可以与所述计算模块2连接,以获取计算模块2计算得到的输出特征值矩阵,并输出所述输出特征值矩阵。输出的方式可以有多种。例如,可以将所述输出特征值矩阵显示给用户或者输出到下一卷积层进行下一层级的卷积运算。The output module 3 may be connected to the calculation module 2 to obtain the output characteristic value matrix calculated by the calculation module 2 and output the output characteristic value matrix. There are many ways to output. For example, the output eigenvalue matrix may be displayed to the user or output to the next convolution layer for the next level of convolution operation.
在实际应用中,本装置可以同时支持两种长度:n位和2n位数据的卷积运算,例如8位和16位定点数的卷积运算。当卷积神经网络使用长度为n的定点数进行定点化并且网络精度可以满足设计要求时,本装置可以使用长度为n位的定点数进行卷积运算,相同的硬件资源可以提供更高的卷积运算并发度。若使用长度为n位的定点数定点后的网络精度损失很大且不满足设计要求时,本装置可以切换为使用长度为2n位的定点数进行卷积运算,网络也可以用长度为2n位的定点数进行定点化,以此减小网络定点化后的精度损失。In practical applications, the device can simultaneously support two lengths: n-bit and 2n-bit data convolution operations, such as 8-bit and 16-bit fixed-point convolution operations. When the convolutional neural network uses a fixed-point number of length n for fixed-point conversion and the network accuracy can meet the design requirements, this device can use a fixed-point number of length n to perform convolution operations, and the same hardware resources can provide higher volumes. Concurrency of product operations. If a fixed-point number with a length of n bits is used, the accuracy of the network after the fixed-point loss is very large and does not meet the design requirements, this device can be switched to use a fixed-point number with a length of 2n bits for convolution operation, and the network can also use a length of 2n bits The fixed-point number of the network is fixed-point, so as to reduce the accuracy loss after the network is fixed-point.
本实施例提供的数据处理装置,包括输入模块1、计算模块2和输出模块3,输入模块1可以用于获取n位或者2n位的权重值矩阵以及输入特征值矩阵,计算模块2可以根据获取到的n位或者2n位的权重值矩阵与输入特征值矩阵进行卷积运算,得到n位或者2n位的输出特征值矩阵,输出模块3可以输出所述n位或者2n位的输出特征值矩阵,从而实现两种长度的数据的卷积运算,在有较高精度要求时,可以采用2n位的数据实现卷积运算,提高深度卷积神经网络的精度,适应不同深度卷积神经网络的设计要求。The data processing device provided in this embodiment includes an input module 1, a calculation module 2 and an output module 3. The input module 1 can be used to obtain an n-bit or 2n-bit weight value matrix and an input eigenvalue matrix, and the calculation module 2 can obtain The obtained n-bit or 2n-bit weight value matrix is convolved with the input eigenvalue matrix to obtain an n-bit or 2n-bit output eigenvalue matrix. The output module 3 can output the n-bit or 2n-bit output eigenvalue matrix , So as to realize the convolution operation of data of two lengths. When there is a higher precision requirement, 2n-bit data can be used to realize the convolution operation, improve the accuracy of the deep convolutional neural network, and adapt to the design of different depths of the convolutional neural network. Require.
实施例二Example two
本发明实施例二提供一种数据处理装置。本实施例是在上述实施例提供的技术方案的基础上,通过脉动阵列、累加器阵列等来实现卷积运算操作。图5为本发明实施例二提供的一种数据处理装置的结构示意图。如图5所示,本实施例中的数据处理装置,可以包括:The second embodiment of the present invention provides a data processing device. In this embodiment, on the basis of the technical solutions provided by the foregoing embodiments, convolution operations are implemented through systolic arrays, accumulator arrays, and the like. FIG. 5 is a schematic structural diagram of a data processing device according to Embodiment 2 of the present invention. As shown in Figure 5, the data processing device in this embodiment may include:
输入模块,用于获取n位或者2n位的权重值矩阵以及n位或者2n位的输入特征值矩阵;所述输入模块具体可以包括权重值加载模块11和输入特征值加载模块12,权重值加载模块11用于获取n位或者2n位的权重值矩阵,输入特征值加载模块12用于获取n位或者2n位的输入特征值矩阵;The input module is used to obtain an n-bit or 2n-bit weight value matrix and an n-bit or 2n-bit input feature value matrix; the input module may specifically include a weight value loading module 11 and an input feature value loading module 12, and the weight value loading The module 11 is used to obtain an n-bit or 2n-bit weight value matrix, and the input feature value loading module 12 is used to obtain an n-bit or 2n-bit input feature value matrix;
计算模块2,用于将所述输入特征值矩阵和所述权重值矩阵进行卷积运算,得到输出特征值矩阵;The calculation module 2 is configured to perform a convolution operation on the input eigenvalue matrix and the weight value matrix to obtain an output eigenvalue matrix;
输出模块3,用于输出所述输出特征值矩阵。The output module 3 is used to output the output eigenvalue matrix.
其中,所述计算模块2可以包括:Wherein, the calculation module 2 may include:
脉动阵列,用于实现权重值矩阵中的n位或者2n位的权重值与对应的输入特征值的乘累加操作;The systolic array is used to implement the multiplication and accumulation operation of the n-bit or 2n-bit weight value in the weight value matrix and the corresponding input eigenvalue;
累加器阵列,用于根据所述脉动阵列得到的乘累加结果,计算输出特征值矩阵。The accumulator array is used to calculate the output eigenvalue matrix according to the multiplication and accumulation result obtained by the systolic array.
具体地,脉动阵列可以计算权重值矩阵中每一列权重值对应的乘累加结果,累加器阵列将各列权重值对应的乘累加结果相加,得到输出特征值;或者,脉动阵列可以计算权重值矩阵中每一行权重值对应的乘累加结果,累加器阵列将各行权重值对应的乘累加结果相加,得到输出特征值。Specifically, the systolic array can calculate the multiplication and accumulation result corresponding to the weight value of each column in the weight value matrix, and the accumulator array adds the multiplication and accumulation results corresponding to the weight value of each column to obtain the output characteristic value; or the systolic array can calculate the weight value The multiplication and accumulation results corresponding to the weight values of each row in the matrix, and the accumulator array adds the multiplication and accumulation results corresponding to the weight values of each row to obtain the output characteristic value.
以图4所示矩阵为例,在计算左上角粗线框中的输入特征值与权重值矩阵对应的输出结果时,脉动阵列可以计算每一列权重值与对应的输入特征值得到的乘累加结果,第一列权重值包括W 00和W 10,与对应的输入特征值进行乘累加操作后得到的累加结果为X 00*W 00+X 10*W 10,第二列权重值包括W 01和W 11,对应的累加结果为X 01*W 01+X 11*W 11,累加器阵列将各列权重值对应的乘累加结果相加,得到输出特征值Y 00=X 00*W 00+X 01*W 01+X 10*W 10+X 11*W 11Taking the matrix shown in Figure 4 as an example, when calculating the output results corresponding to the input eigenvalues and the weight value matrix in the thick-line box in the upper left corner, the systolic array can calculate the multiplication and accumulation result of the weight value of each column and the corresponding input eigenvalue , The weight values in the first column include W 00 and W 10 , and the cumulative result obtained after multiplying and accumulating with the corresponding input feature value is X 00 *W 00 +X 10 *W 10 , and the weight values in the second column include W 01 and W 11 , the corresponding accumulation result is X 01 *W 01 +X 11 *W 11 , the accumulator array adds the multiplication and accumulation results corresponding to the weight values of each column to obtain the output characteristic value Y 00 =X 00 *W 00 +X 01 *W 01 +X 10 *W 10 +X 11 *W 11 .
或者,脉动阵列可以计算每一行权重值与对应的输入特征值得到的乘累加结果,第一行权重值包括W 00和W 01,与对应的输入特征值进行乘累加操作后得到的累加结果为X 00*W 00+X 01*W 01,第二行权重值包括W 10和W 11,对应的累加结果为X 10*W 10+X 11*W 11,累加器阵列将各行权重值对应的乘累加结果相加,得到输出特征值Y 00=X 00*W 00+X 01*W 01+X 10*W 10+X 11*W 11Alternatively, the systolic array can calculate the multiplication and accumulation result obtained by the weight value of each row and the corresponding input feature value. The weight value of the first row includes W 00 and W 01 , and the accumulation result obtained by multiplying and accumulating the corresponding input feature value is X 00 *W 00 +X 01 *W 01 , the weight value of the second row includes W 10 and W 11 , and the corresponding accumulation result is X 10 *W 10 +X 11 *W 11 , the accumulator array corresponds to the weight value of each row The multiplication and accumulation results are added together to obtain the output characteristic value Y 00 =X 00 *W 00 +X 01 *W 01 +X 10 *W 10 +X 11 *W 11 .
本发明各实施例提供的附图中的MC表示脉动单元,ACC的表示累加器。如图5所示,脉动阵列可以包括多列脉动单元21,每列脉动单元21可以用于加载权重值,并将加载的权重值与对应的输入特征值进行乘累加,得到加载的每列权重值对应的乘累加结果。In the drawings provided by the embodiments of the present invention, MC represents a pulsation unit, and ACC represents an accumulator. As shown in FIG. 5, the pulsation array may include multiple rows of pulsation units 21, each column of pulsation units 21 can be used to load a weight value, and multiply and accumulate the loaded weight value with the corresponding input characteristic value to obtain the weight of each column loaded. The multiply and accumulate result corresponding to the value.
计算过程中用到的脉动单元21的列数可以与权重值矩阵的列数相等,一列脉动单元21可以加载权重值矩阵中的一列权重值。或者,计算过程中用到的脉动单元21的列数可以与权重值矩阵的行数相等,一列脉动单元21可以加载权重值矩阵中的一行权重值。为了便于描述,本发明各实施例中以一列脉 动单元21加载一列权重值为例进行说明。The number of columns of the pulsation unit 21 used in the calculation process can be equal to the number of columns of the weight value matrix, and a column of the pulsation unit 21 can load a column of weight values in the weight value matrix. Alternatively, the number of columns of the pulsation unit 21 used in the calculation process may be equal to the number of rows of the weight value matrix, and one column of the pulsation unit 21 may be loaded with a row of weight values in the weight value matrix. For ease of description, in each embodiment of the present invention, a column of pulsating units 21 loads a column of weight values as an example for description.
一列脉动单元21中的每个脉动单元21均可以加载权重值,并获取输入特征值,将所述输入特征值与所加载的权重值相乘,将得到的乘积与上一行脉动单元21的输出相加,然后输出相加的结果。每列的最后一个脉动单元21输出的结果即该列对应的乘累加结果。Each pulsation unit 21 in a row of pulsation units 21 can be loaded with a weight value, and an input characteristic value can be obtained, and the input characteristic value is multiplied by the loaded weight value, and the obtained product is combined with the output of the pulsation unit 21 in the previous row Add, and then output the result of the addition. The result output by the last pulsating unit 21 of each column is the multiplication and accumulation result corresponding to the column.
累加器阵列可以包括多个累加器22,所述累加器22的个数与所述脉动单元21的列数相等,且每个累加器22与每列脉动单元21一一对应连接。具体地,假设累加器22的个数和脉动单元21的列数均为k,则第i个累加器22与第i列脉动单元21对应连接,其中,k为大于1的自然数,i=1、2、……、k。The accumulator array may include a plurality of accumulators 22, the number of the accumulators 22 is equal to the number of columns of the pulsation unit 21, and each accumulator 22 is connected to each column of the pulsation unit 21 in a one-to-one correspondence. Specifically, assuming that the number of accumulators 22 and the number of columns of pulsation units 21 are both k, then the i-th accumulator 22 is connected to the i-th column of pulsation units 21, where k is a natural number greater than 1, and i=1 , 2, ……, k.
其中,累加器22与一列脉动单元21连接,可以是指与该列脉动单元21中的最后一个脉动单元21连接。Wherein, the accumulator 22 is connected to a row of pulsation units 21, which may mean that it is connected to the last pulsation unit 21 in the row of pulsation units 21.
所述累加器22用于获取对应的一列脉动单元21的输出结果,与前一级累加器22的输出结果相加,并将相加的结果输出至下一级累加器22,从而实现各列脉动单元21输出结果的累加。The accumulator 22 is used to obtain the output result of the corresponding row of pulsation units 21, add it to the output result of the previous stage accumulator 22, and output the added result to the next stage accumulator 22, so as to realize each column The pulsation unit 21 outputs the accumulation of results.
可选的,所述计算模块2还可以包括结果产出单元26和结果存储单元27。在所述权重值矩阵的行数大于所述脉动阵列的行数时,所述脉动阵列每次可以加载所述权重值矩阵中的一部分权重值;所述结果存储单元27用于存储中间结果,其中,所述中间结果为所述权重值矩阵中部分权重值经过运算后对应的结果。Optionally, the calculation module 2 may further include a result output unit 26 and a result storage unit 27. When the number of rows of the weight value matrix is greater than the number of rows of the systolic array, the systolic array can load a part of the weight values in the weight value matrix each time; the result storage unit 27 is used to store intermediate results, Wherein, the intermediate result is a corresponding result of some weight values in the weight value matrix after operations.
在实现各列脉动单元21输出结果的累加后,如果结果存储单元27中缓存了中间结果,则所述累加结果还会继续与结果存储单元27中的中间结果再进行一次累加,累加的结果如果仍为卷积运算的中间结果,则结果产出单元26将其存储到结果存储单元27中,若结果为卷积运算的最终结果,则结果产出单元26将输出到输出模块3进行后续处理。其中,所述最终结果为所述权重值矩阵中全部权重值经过运算后对应的结果。After the accumulation of the output results of the pulsation unit 21 of each column is realized, if the intermediate result is buffered in the result storage unit 27, the accumulation result will continue to be accumulated with the intermediate result in the result storage unit 27 again. If the result of the accumulation is If it is still the intermediate result of the convolution operation, the result generation unit 26 will store it in the result storage unit 27. If the result is the final result of the convolution operation, the result generation unit 26 will output it to the output module 3 for subsequent processing. . Wherein, the final result is the corresponding result of all the weight values in the weight value matrix after calculation.
通过结果产出单元26和结果存储单元27,可以在权重值矩阵大于脉动阵列的情况下,通过脉动阵列加载一部分权重值先计算卷积运算的中间结果,再通过脉动阵列加载另一部分权重值继续进行计算,直至得到最终结果并输出,从而实现用较小的脉动阵列完成较大权重值矩阵的运算,有效减小装置的体积,降低装置成本。Through the result generation unit 26 and the result storage unit 27, when the weight value matrix is larger than the systolic array, load a part of the weight value through the systolic array, first calculate the intermediate result of the convolution operation, and then load another part of the weight value through the systolic array to continue The calculation is performed until the final result is obtained and output, so that a smaller systolic array is used to complete the calculation of a larger weight value matrix, which effectively reduces the volume of the device and reduces the cost of the device.
为了实现将权重值和输入特征值送入脉动阵列,本实施例中的数据处理 装置,还可以包括:权重值注入单元24和输入特征值注入单元25。In order to realize the sending of the weight value and the input characteristic value into the pulsation array, the data processing device in this embodiment may further include: a weight value injection unit 24 and an input characteristic value injection unit 25.
权重值注入单元24的输入端可以与权重值加载模块11连接,输出端可以与脉动阵列连接,具体地,可以与脉动阵列中的每一个脉动单元21连接,以将权重值输入到对应的各个脉动单元21。The input end of the weight value injection unit 24 can be connected to the weight value loading module 11, and the output end can be connected to the systolic array, specifically, it can be connected to each pulsation unit 21 in the systolic array, so as to input the weight value to the corresponding one. Pulsation unit 21.
类似地,输入特征值注入单元25的输入端可以与输入特征值加载模块12连接,输出端可以与脉动阵列连接,具体地,可以与脉动阵列中的每一个脉动单元21连接,以将输入特征值输入到各个脉动单元21。Similarly, the input end of the input feature value injection unit 25 can be connected to the input feature value loading module 12, and the output end can be connected to the pulsation array, specifically, it can be connected to each pulsation unit 21 in the pulsation array to transfer the input feature The value is input to each pulsation unit 21.
通过所述权重值注入单元24和输入特征值注入单元25,可以将权重值和输入特征值缓存后送入脉动阵列,提高装置稳定性。Through the weight value injection unit 24 and the input feature value injection unit 25, the weight value and the input feature value can be buffered and sent to the systolic array, thereby improving the stability of the device.
可选的,所述权重值注入单元24可以与各个脉动单元21直接连接,也可以如图5所示,与第一行脉动单元21直接连接,与其它脉动单元21之间通过中间的脉动单元21实现连接,通过的中间脉动单元21传递权重值。Optionally, the weight value injection unit 24 may be directly connected to each pulsation unit 21, or may be directly connected to the first row of pulsation units 21 as shown in FIG. 21 realizes the connection, and the intermediate pulsation unit 21 passed through it transmits the weight value.
类似地,所述输入特征值注入单元25可以与各个脉动单元21直接连接,也可以如图5所示,与第一列脉动单元21直接连接,与其它脉动单元21之间通过中间的脉动单元21实现连接,通过中间的脉动单元21传递输入特征值。Similarly, the input characteristic value injection unit 25 can be directly connected to each pulsation unit 21, or as shown in FIG. 21 realizes the connection, and transmits the input characteristic value through the middle pulsation unit 21.
图5所示的权重值注入单元24或输入特征值注入单元25与脉动单元21之间的连接方式,能够有效节约布线,减小装置的体积。The connection between the weight value injection unit 24 or the input feature value injection unit 25 and the pulsation unit 21 shown in FIG. 5 can effectively save wiring and reduce the volume of the device.
为了实现卷积操作,可以将整个卷积计算过程,分为权重值加载阶段和计算阶段。在权重值加载阶段,将权重值矩阵中的权重值加载到所述脉动阵列的脉动单元21中;在计算阶段,将输入特征值矩阵中的输入特征值输入到脉动阵列中,根据权重值和输入特征值进行计算。In order to realize the convolution operation, the entire convolution calculation process can be divided into a weight value loading stage and a calculation stage. In the weight value loading stage, the weight values in the weight value matrix are loaded into the pulsation unit 21 of the systolic array; in the calculation stage, the input eigenvalues in the input eigenvalue matrix are input into the systolic array, according to the weight value and Enter the characteristic value for calculation.
本实施例中的数据处理装置,还可以包括:控制单元23。控制单元23用于控制计算模块2中的其它各个模块进行工作。The data processing device in this embodiment may further include: a control unit 23. The control unit 23 is used to control the other modules in the calculation module 2 to work.
具体地,控制单元23可以控制权重值注入单元24将从权重值加载模块11获取的权重值加载到脉动阵列,然后,控制输入特征值注入单元25将从输入特征值加载模块12获取的输入特征值送入脉动阵列,并控制脉动阵列和累加器阵列进行卷积运算。Specifically, the control unit 23 may control the weight value injection unit 24 to load the weight value obtained from the weight value loading module 11 into the pulsation array, and then control the input feature value injection unit 25 to control the input feature value obtained from the input feature value loading module 12 The value is sent to the systolic array, and the systolic array and the accumulator array are controlled to perform convolution operations.
可选的,在进行卷积运算时,可以重用送入的输入特征值。所述控制单元23具体可以用于:在权重值加载阶段,控制所述权重值矩阵中的权重值依次加载到所述脉动阵列的脉动单元21中;在计算阶段,控制输入特征值矩阵中的输入特征值在脉动阵列中依次向右传递,并控制脉动单元21根据所加载 的权重值与传递来的输入特征值进行计算。Optionally, when performing the convolution operation, the input feature value sent in can be reused. The control unit 23 may be specifically configured to: in the weight value loading stage, control the weight values in the weight value matrix to be sequentially loaded into the pulsation unit 21 of the systolic array; in the calculation stage, control the input eigenvalue matrix in the The input feature values are sequentially transferred to the right in the pulsation array, and the pulsation unit 21 is controlled to perform calculations based on the loaded weight value and the transferred input feature value.
这样,在计算阶段,输入特征值从一行脉动单元21的一个接口处进入,依次从左向右通过该行的每个脉动单元21,每个脉动单元21都可以利用该输入特征值进行运算,从而重用输入的输入特征值,减少卷积运算需要的数据访问带宽。In this way, in the calculation stage, the input characteristic value enters from an interface of a row of pulsation units 21, and passes through each pulsation unit 21 of the row from left to right in turn. Each pulsation unit 21 can use the input characteristic value to perform calculations. Thus, the input feature value is reused, and the data access bandwidth required by the convolution operation is reduced.
可选的,在权重值加载阶段,所述控制单元23具体可以用于:在权重值加载阶段中的移位阶段,针对每一列脉动单元21,将该列脉动单元21需要加载的权重值通过该列第一个脉动单元21依次送入脉动阵列,在脉动阵列中,接收到的权重值从第一个脉动单元21依次向下传递;在权重值加载阶段中的加载阶段,控制脉动阵列中的脉动单元21存储对应的权重值。Optionally, in the weight value loading phase, the control unit 23 may be specifically used to: in the shift phase in the weight value loading phase, for each column of pulsation units 21, pass the weight value that needs to be loaded by the column of pulsation units 21 The first pulsation unit 21 in the column is sequentially sent to the systolic array. In the pulsation array, the received weight value is sequentially transferred downward from the first pulsation unit 21; in the loading phase of the weight value loading phase, the systolic array is controlled The pulsation unit 21 stores the corresponding weight value.
具体地,权重值注入单元24负责缓存权重值加载模块11送入的权重值,并在控制单元23的控制下为脉动阵列加载权重值。权重值注入单元24与脉动阵列的每一列脉动单元21仅有一个接口,该接口每个时钟周期可以仅传输一个权重值。权重值加载阶段可以具体分为移位和加载两个阶段。在移位阶段,权重值注入单元24将同一列脉动单元21需要的权重值通过同一个接口依次送入脉动阵列。在脉动阵列中,接收到的权重值从接口处的脉动单元21依次向下传递。在加载阶段,脉动阵列中同一列的脉动单元21同时将缓存的权重值装载到各自的寄存器中供后续的乘累加过程使用。权重值注入单元24为相邻两列脉动单元21加载权重值时可以有一个时钟周期的延迟。Specifically, the weight value injection unit 24 is responsible for buffering the weight value sent by the weight value loading module 11, and loads the weight value for the systolic array under the control of the control unit 23. The weight value injection unit 24 has only one interface with each row of pulsation units 21 of the systolic array, and the interface can transmit only one weight value per clock cycle. The weight value loading phase can be specifically divided into two phases of shifting and loading. In the shift phase, the weight value injection unit 24 sequentially sends the weight values required by the pulsation unit 21 of the same column into the pulsation array through the same interface. In the pulsation array, the received weight values are sequentially transferred downward from the pulsation unit 21 at the interface. In the loading phase, the systolic units 21 of the same column in the systolic array simultaneously load the cached weight values into their respective registers for use in the subsequent multiplication and accumulation process. The weight value injection unit 24 may have a delay of one clock cycle when loading weight values for two adjacent columns of pulsation units 21.
输入特征值注入单元25负责缓存输入特征值加载模块12送入的输入特征值,并在控制单元23的控制下为脉动阵列送入输入特征值。输入特征值注入单元25与脉动阵列的每一行脉动单元21仅有一个接口,该接口每个时钟周期可以仅传输一个输入特征值。在脉动阵列中,接收到的输入特征值从接口处的脉动单元21依次向右传递直至最后一个脉动单元21。输入特征值注入单元25为相邻两行脉动单元21送入输入特征值时可以有一个时钟周期的延迟。The input feature value injection unit 25 is responsible for buffering the input feature value sent by the input feature value loading module 12, and sends the input feature value for the systolic array under the control of the control unit 23. The input feature value injection unit 25 has only one interface with each row of the pulsation unit 21 of the systolic array, and the interface can transmit only one input feature value per clock cycle. In the pulsation array, the received input feature values are sequentially transferred from the pulsation unit 21 at the interface to the right to the last pulsation unit 21. The input characteristic value injection unit 25 may have a delay of one clock cycle when sending the input characteristic values to the pulsating units 21 of two adjacent rows.
在脉动阵列中,输入特征值从左向右传递,权重值从上向下传递,数据经过一列或一行脉动单元21可能会花费一个时钟周期的时间,所以相邻两行或两列脉动单元21装载数据时可以有一个时钟周期的延迟,能够准确实现权重值的加载以及权重值与对应的输入特征值之间的运算。In a systolic array, the input characteristic value is transferred from left to right, and the weight value is transferred from top to bottom. It may take one clock cycle for the data to pass through one column or row of pulsating cells 21, so two adjacent rows or two columns of pulsating cells 21 There can be a clock cycle delay when loading data, which can accurately realize the loading of the weight value and the calculation between the weight value and the corresponding input characteristic value.
在实际应用中,控制单元23可以获取所述权重值矩阵中的权重值的长度或所述输入特征值矩阵中的输入特征值的长度,并根据所述长度控制所述脉 动阵列和所述累加器阵列等部件实现卷积运算。In practical applications, the control unit 23 can obtain the length of the weight value in the weight value matrix or the length of the input eigenvalue in the input eigenvalue matrix, and control the systolic array and the accumulation according to the length. Arrays and other components implement convolution operations.
例如,在所述长度为n位时,可以向脉动阵列加载n位的数据;在所述长度为2n位时,可以向脉动阵列加载2n位的数据,从而实现不同精度的数据的计算。For example, when the length is n bits, n bits of data can be loaded into the systolic array; when the length is 2n bits, 2n bits of data can be loaded into the systolic array, thereby realizing calculation of data with different precisions.
可选的,控制单元23可以通过控制状态机等硬件电路来控制各单元实现卷积运算。根据数据长度控制实现卷积运算的方式可以有很多种,比如,可以在寄存器中存储或者在指令中携带的配置信息,配置信息用于指示对多长的数据进行卷积操作,控制单元23可以根据配置信息产生控制信号,控制脉动阵列和累加器阵列等部件进行n位和2n位两种卷积运算方式的切换。Optionally, the control unit 23 may control each unit to implement the convolution operation by controlling a hardware circuit such as a state machine. There are many ways to control the convolution operation according to the data length. For example, configuration information can be stored in a register or carried in an instruction. The configuration information is used to indicate how long the data is to be convolved. The control unit 23 can According to the configuration information, a control signal is generated to control components such as the systolic array and the accumulator array to switch between n-bit and 2n-bit convolution operations.
本实施例提供的数据处理装置,计算模块2可以包括脉动阵列和累加器阵列,通过脉动阵列和累加器阵列来实现卷积操作,其中,脉动阵列可以用于实现权重值矩阵中的n位或者2n位的权重值与对应的输入特征值的乘累加操作,累加器阵列可以用于根据所述脉动阵列得到的乘累加结果,计算输出特征值矩阵,从而将卷积运算拆分为乘累加操作和累加操作,准确计算出权重值矩阵和输入特征值矩阵的卷积结果,通过卷积操作之间的数据重用可以有效减少卷积运算需要的数据访问带宽,节约资源。In the data processing device provided in this embodiment, the calculation module 2 may include a systolic array and an accumulator array, and the convolution operation is realized by the systolic array and the accumulator array, where the systolic array can be used to implement n bits in the weight value matrix or The multiplication and accumulation operation of the 2n-bit weight value and the corresponding input eigenvalue. The accumulator array can be used to calculate the output eigenvalue matrix according to the multiplication and accumulation result obtained by the systolic array, thereby splitting the convolution operation into a multiplication and accumulation operation With the accumulation operation, the convolution result of the weight value matrix and the input eigenvalue matrix is accurately calculated, and the data reuse between the convolution operations can effectively reduce the data access bandwidth required for the convolution operation and save resources.
实施例三Example three
本发明实施例三提供一种数据处理装置。本实施例是在上述实施例提供的技术方案的基础上,提供了一种脉动单元和累加器的具体实现方案。本实施例中的数据处理装置的整体的结构示意图可以参见图5。图6为本发明实施例三提供的一种数据处理装置中脉动单元的结构示意图。图7为本发明实施例三提供的一种数据处理装置中累加器的结构示意图。The third embodiment of the present invention provides a data processing device. This embodiment is based on the technical solution provided by the foregoing embodiment, and provides a specific implementation solution of the pulsation unit and the accumulator. For a schematic diagram of the overall structure of the data processing device in this embodiment, refer to FIG. 5. FIG. 6 is a schematic structural diagram of a pulsating unit in a data processing device according to Embodiment 3 of the present invention. FIG. 7 is a schematic structural diagram of an accumulator in a data processing device according to Embodiment 3 of the present invention.
如图6所示,所述脉动单元21可以包括:As shown in FIG. 6, the pulsation unit 21 may include:
权重值寄存器211,用于存储权重值;The weight value register 211 is used to store the weight value;
输入特征值寄存器212,用于存储输入特征值;The input characteristic value register 212 is used to store the input characteristic value;
乘法电路213,可以与权重值寄存器211及输入特征值寄存器212分别连接,用于根据所述权重值寄存器211中存储的权重值和所述输入特征值寄存器212中存储的输入特征值,得到所述权重值与所述输入特征值的乘积;The multiplication circuit 213 can be connected to the weight value register 211 and the input characteristic value register 212 respectively, and is used to obtain the weight value stored in the weight value register 211 and the input characteristic value stored in the input characteristic value register 212. The product of the weight value and the input feature value;
加法电路214,可以与乘法电路213连接,用于将所述乘法电路213得到的乘积与上一行脉动单元21的输出相加。在上一行不存在脉动单元21时,加法 电路214可以直接输出从乘法电路213获取到的结果。The adding circuit 214 may be connected to the multiplying circuit 213, and is used to add the product obtained by the multiplying circuit 213 to the output of the pulsating unit 21 in the previous row. When there is no pulsation unit 21 in the previous row, the addition circuit 214 can directly output the result obtained from the multiplication circuit 213.
通过以上的各个部件,脉动单元21可以实现加载权重值,并获取输入特征值,将所述输入特征值与所加载的权重值相乘,将得到的乘积与上一行脉动单元21的输出相加,输出相加的结果的功能。加法电路214输出的结果可以发送到下一脉动单元21。Through the above components, the pulsation unit 21 can load the weight value, obtain the input characteristic value, multiply the input characteristic value by the loaded weight value, and add the obtained product to the output of the pulsation unit 21 in the previous row. , The function of outputting the result of addition. The result output by the addition circuit 214 can be sent to the next pulsation unit 21.
可选的,所述脉动单元21还可以包括:Optionally, the pulsation unit 21 may further include:
权重值移位寄存器215,用于向下一行脉动单元21传递权重值;The weight value shift register 215 is used to transfer the weight value to the pulsating unit 21 of the next row;
输入特征值移位寄存器216,用于向下一列脉动单元21传递输入特征值。The input characteristic value shift register 216 is used to transfer the input characteristic value to the next row of pulsating cells 21.
具体地,权重值移位寄存器215可以负责缓存从权重值注入单元24或上一级脉动单元21送来的权重值。在权重值加载的移位阶段,权重值移位寄存器215缓存的权重值会向下传递到下一级脉动单元21。在权重值加载的加载阶段,权重值移位寄存器215缓存的权重值会被锁存到权重值寄存器211。Specifically, the weight value shift register 215 may be responsible for buffering the weight value sent from the weight value injection unit 24 or the upper-level pulsation unit 21. In the shift phase of the weight value loading, the weight value buffered by the weight value shift register 215 will be passed down to the next-stage pulsation unit 21. In the loading phase of the weight value loading, the weight value buffered by the weight value shift register 215 will be latched into the weight value register 211.
在根据权重值寄存器211中的权重值进行计算的过程中,权重值移位寄存器215可以用来装载下一次的权重值,能够有效提高整个权重值矩阵对应的计算效率。In the calculation process based on the weight value in the weight value register 211, the weight value shift register 215 can be used to load the next weight value, which can effectively improve the calculation efficiency of the entire weight value matrix.
输入特征值移位寄存器216负责缓存从输入特征值注入单元25或左面脉动单元21送来的输入特征值。输入特征值移位寄存器216缓存的输入特征值会被锁存到输入特征值寄存器212,同时还会被送到右面的脉动单元21。The input feature value shift register 216 is responsible for buffering the input feature value sent from the input feature value injection unit 25 or the left pulsation unit 21. The input characteristic value buffered by the input characteristic value shift register 216 will be latched to the input characteristic value register 212 and at the same time will be sent to the pulsation unit 21 on the right.
在根据输入特征值寄存器212中的输入特征值进行计算的过程中,输入特征值移位寄存器216可以用来装载下一次的输入特征值,能够有效提高整个输入特征值矩阵对应的计算效率。In the calculation process based on the input characteristic value in the input characteristic value register 212, the input characteristic value shift register 216 can be used to load the next input characteristic value, which can effectively improve the calculation efficiency corresponding to the entire input characteristic value matrix.
可选的,所述脉动单元21还可以包括乘法结果寄存器217。加法电路214与乘法电路213之间可以通过所述乘法结果寄存器217实现连接。所述乘法结果寄存器217用于存储所述脉动单元21加载的权重值与输入特征值的乘法结果,便于其与上一级脉动单元21的输出相加,提高装置稳定性。Optionally, the pulsating unit 21 may further include a multiplication result register 217. The addition circuit 214 and the multiplication circuit 213 can be connected through the multiplication result register 217. The multiplication result register 217 is used to store the multiplication result of the weight value loaded by the pulsation unit 21 and the input characteristic value, so that it can be added to the output of the previous pulsation unit 21 and improve the stability of the device.
本发明实施例中,可选的是,每个脉动单元21可以完成n位*n位的乘累加操作。具体地,每个所述脉动单元21可加载的权重值长度可以为n位,在所述权重值矩阵中的权重值长度为2n位时,每列脉动单元21加载所述权重值矩阵中的权重值的高n位或者低n位。In the embodiment of the present invention, optionally, each pulsation unit 21 can complete n-bit*n-bit multiply and accumulate operations. Specifically, the length of the weight value that can be loaded by each pulsation unit 21 may be n bits. When the length of the weight value in the weight value matrix is 2n bits, each column of pulsation unit 21 loads the weight value in the weight value matrix. The high n bits or the low n bits of the weight value.
通过两列脉动单元21分别加载高n位的权重值和低n位的权重值,可以实现n位的装置计算2n位的数据。By loading the upper n-bit weight value and the lower n-bit weight value respectively by the two columns of pulsation units 21, an n-bit device can calculate 2n-bit data.
相应的,所述输入特征值矩阵中的输入特征值长度为2n位时,所述脉动单元21每次获取到的输入特征值可以为所述输入特征值矩阵中的输入特征值的高n位或者低n位。Correspondingly, when the input eigenvalue length in the input eigenvalue matrix is 2n bits, the input eigenvalue acquired by the pulsating unit 21 each time may be the high n bits of the input eigenvalue in the input eigenvalue matrix Or low n bits.
进一步地,一列权重值的高n位和低n位可以分别加载于相邻的两列脉动单元21中,输入特征值的高n位可以紧邻低n位从第一列脉动单元21依次传递至最后一列脉动单元21,便于累加器22后续对乘累加结果进行进一步计算,降低累加器22的复杂度。Further, the high n bits and low n bits of a column of weight values can be loaded into two adjacent columns of pulsation units 21 respectively, and the high n bits of the input characteristic value can be transferred from the first column of pulsation units 21 to the next to the lower n bits. The pulsation unit 21 in the last column is convenient for the accumulator 22 to perform further calculations on the result of the multiplication and accumulation subsequently, thereby reducing the complexity of the accumulator 22.
如图7所示,本实施例中的累加器22可以包括:As shown in FIG. 7, the accumulator 22 in this embodiment may include:
乘累加结果寄存器221,可以与对应列的最后一个脉动单元21连接,用于获取所述最后一个脉动单元21的输出结果;The multiplication and accumulation result register 221 may be connected to the last pulsation unit 21 of the corresponding column, and is used to obtain the output result of the last pulsation unit 21;
前乘累加结果寄存器222,可以与所述乘累加结果寄存器221连接,用于在所述输入特征值为2n位时,每隔一个时钟周期从所述乘累加结果寄存器221获取一次输出结果;The pre-multiply-accumulate result register 222 may be connected to the multiply-accumulate result register 221, and is used to obtain an output result from the multiply-accumulate result register 221 every other clock cycle when the input characteristic value is 2n bits;
纵向加法电路223,可以与所述乘累加结果寄存器221和所述前乘累加结果寄存器222分别连接,用于在所述输入特征值为n位时,将所述乘累加结果寄存器221中的输出结果发送至第一阶段加法电路224,或者,在所述输入特征值为2n位时,将所述乘累加结果寄存器221中的输出结果与所述前乘累加结果寄存器222中的输出结果之和发送至第一阶段加法电路224;The vertical addition circuit 223 may be connected to the multiply-accumulate result register 221 and the pre-multiply-accumulate result register 222 respectively, and is used to output the multiply-accumulate result register 221 when the input characteristic value is n bits. The result is sent to the first-stage addition circuit 224, or, when the input characteristic value is 2n bits, the sum of the output result in the multiply and accumulate result register 221 and the output result in the pre-multiply and accumulate result register 222 Sent to the first-stage addition circuit 224;
第一阶段加法电路224,可以与所述纵向加法电路223和上一级累加器22分别连接,用于将从所述纵向加法电路223输出的结果与所述上一级累加器22输出的结果相加。The first-stage addition circuit 224 may be connected to the vertical addition circuit 223 and the upper-stage accumulator 22, respectively, for the result output from the vertical addition circuit 223 and the result output from the upper-stage accumulator 22 Add up.
通过以上各个部件,累加器22可以实现获取对应的一列脉动单元21的输出结果,与前一级累加器22的输出结果相加,并将相加的结果输出至下一级累加器22。Through the above components, the accumulator 22 can obtain the output result of the corresponding row of pulsation units 21, add it to the output result of the previous accumulator 22, and output the added result to the next accumulator 22.
可以理解的是,本发明各实施例中所涉及的数据相加,可以是指将两个数据直接相加,也可以是指将数据转换为一定格式后再进行相加。例如,不同进制的数据相加前,可以先转换为相同的进制;高n位的数据和低n位的数据相加前,可以将高n位的数据左移n位,实现两个数据的对齐后再相加。It can be understood that the data addition involved in the embodiments of the present invention may refer to the direct addition of two data, or it may refer to the addition after the data is converted into a certain format. For example, before adding data of different bases, you can convert to the same base; before adding the high n-bit data and the low n-bit data, you can shift the high n-bit data to the left by n bits to achieve two After the data are aligned, they are added.
可选的,在输入特征值为2n位时,所述累加器22可以通过一个寄存器存储从脉动单元21获取到的输入特征值的高n位对应的输出结果,通过另一寄存器存储该输入特征值的低n位对应的输出结果;根据所述输入特征值高n位的 输出结果与低n位的输出结果,得到所述输入特征值对应的输出结果,与前一级累加器22的输出结果相加,并将相加的结果输出至下一级累加器22。Optionally, when the input characteristic value is 2n bits, the accumulator 22 may store the output result corresponding to the high n bits of the input characteristic value obtained from the pulsation unit 21 through a register, and store the input characteristic through another register The output result corresponding to the low n bits of the value; according to the output result of the high n bits of the input characteristic value and the output result of the low n bits, the output result corresponding to the input characteristic value is obtained, which is the same as the output of the previous accumulator 22 The results are added, and the result of the addition is output to the accumulator 22 of the next stage.
其中,所述输入特征值高n位的输出结果与低n位的输出结果可以分别为从脉动单元21获取到的相邻的两个输出结果。Wherein, the output result of the high n bits and the output result of the low n bits of the input characteristic value may be two adjacent output results obtained from the pulsation unit 21, respectively.
具体地,若累加器22对应的脉动阵列加载的为权重值的低n位,则在根据所述输入特征值高n位的输出结果与低n位的输出结果,得到所述输入特征值对应的输出结果时,所述累加器22具体可以用于:将所述输入特征值的高n位的输出结果左移n位,与所述低n位的输出结果相加,得到所述输入特征值对应的输出结果。Specifically, if the systolic array corresponding to the accumulator 22 is loaded with the low n bits of the weight value, then according to the output result of the high n bits of the input feature value and the output result of the low n bits, the input feature value corresponding to the The accumulator 22 may be specifically used to: shift the output result of the high n bits of the input feature value by n bits to the left, and add it to the output result of the low n bits to obtain the input feature The output result corresponding to the value.
若累加器22对应的脉动阵列加载的为权重值的高n位,则根据所述输入特征值高n位的输出结果与低n位的输出结果,得到所述输入特征值对应的输出结果时,所述累加器22具体可以用于:将所述输入特征值的高n位的输出结果左移n位,与所述低n位的输出结果相加,并将相加得到的结果左移n位,得到所述输入特征值对应的输出结果。If the systolic array corresponding to the accumulator 22 is loaded with the high n bits of the weight value, then according to the output result of the high n bits of the input characteristic value and the output result of the low n bits, the output result corresponding to the input characteristic value is obtained. The accumulator 22 may be specifically used to: shift the output result of the high n bits of the input feature value by n bits to the left, add it to the output result of the low n bits, and shift the result of the addition to the left n bits, the output result corresponding to the input characteristic value is obtained.
上述移位操作可以在纵向加法电路223中实现,通过将高n位的数据左移n位,可以将高n位输出结果还原为实际的乘累加结果,保证结果准确性。The above-mentioned shift operation can be implemented in the vertical addition circuit 223. By shifting the high n-bit data to the left by n bits, the high n-bit output result can be restored to the actual multiplication and accumulation result, ensuring the accuracy of the result.
可选的,所述累加器22还可以包括:滤波电路225;所述乘累加结果寄存器221与对应列的最后一个脉动单元21之间可以通过所述滤波电路225实现连接。所述滤波电路225可以用于根据卷积运算的步长值(Stride值)对脉动阵列输出的冗余的乘累加结果进行过滤,未被过滤的结果被滤波电路225送入乘累加结果寄存器221。Optionally, the accumulator 22 may further include: a filter circuit 225; the multiplication and accumulation result register 221 and the last pulsation unit 21 of the corresponding column may be connected through the filter circuit 225. The filter circuit 225 can be used to filter the redundant multiplication and accumulation results output by the systolic array according to the step value (Stride value) of the convolution operation, and the unfiltered result is sent to the multiplication and accumulation result register 221 by the filter circuit 225 .
通过设置滤波电路225对冗余的数据进行过滤,可以保证不同步长要求下的卷积运算的正确性,满足不同场合的步长需求,提高装置的应用范围。By setting the filter circuit 225 to filter the redundant data, the correctness of the convolution operation under the non-synchronization length requirement can be ensured, the step size requirements of different occasions can be met, and the application range of the device can be improved.
可选的,所述累加器22还可以包括:累加器结果寄存器226;所述第一阶段加法电路224与上一级累加器22之间可以通过所述累加器结果寄存器226实现连接,所述累加器结果寄存器226可以用于获取上一级累加器22输出的结果并发送给所述第一阶段加法电路224。Optionally, the accumulator 22 may further include: an accumulator result register 226; the first-stage addition circuit 224 and the upper-level accumulator 22 may be connected through the accumulator result register 226. The accumulator result register 226 can be used to obtain the result output by the previous accumulator 22 and send it to the first-stage addition circuit 224.
可选的,所述累加器22还可以包括:总和寄存器227,与所述第一阶段加法电路224连接,用于存储所述第一阶段加法电路224输出的结果,并将所述结果输出至下一级累加器22。Optionally, the accumulator 22 may further include: a sum register 227, connected to the first-stage addition circuit 224, for storing the result output by the first-stage addition circuit 224, and outputting the result to The next-level accumulator 22.
通过累加器结果寄存器226和总和寄存器227,可以分别存储上一级累加 器22输出的结果和第一阶段加法电路224输出的结果,保证计算过程的顺利进行。Through the accumulator result register 226 and the sum register 227, the output result of the previous accumulator 22 and the output result of the first stage addition circuit 224 can be respectively stored, so as to ensure the smooth progress of the calculation process.
可选的,所述累加器22还可以包括:延迟电路228。所述累加器结果寄存器226与上一级累加器22之间可以通过所述延迟电路228实现连接。所述延迟电路228可以用于根据卷积运算的膨胀值(Dilation值),将上一级累加器22输出的结果延迟对应的时钟周期后送入所述累加器结果寄存器226。延迟的时钟周期数由卷积运算的膨胀值确定。Optionally, the accumulator 22 may further include a delay circuit 228. The accumulator result register 226 and the previous accumulator 22 can be connected through the delay circuit 228. The delay circuit 228 may be used to delay the output result of the previous accumulator 22 by a corresponding clock cycle and send it to the accumulator result register 226 according to the dilation value (Dilation value) of the convolution operation. The number of delayed clock cycles is determined by the dilation value of the convolution operation.
通过设置延迟电路228对上一级累加器22输出的结果进行延迟,可以保证不同膨胀值要求下的卷积运算的正确性,满足不同场合的膨胀值要求,提高装置的应用范围。By setting the delay circuit 228 to delay the output result of the previous accumulator 22, the correctness of the convolution operation under different expansion value requirements can be ensured, the expansion value requirements of different occasions can be met, and the application range of the device can be improved.
可选的,所述累加器22还可以包括:第二阶段加法电路229;所述第二阶段加法电路229可以与总和寄存器227连接,所述结果产出单元26可以与所述第二阶段加法电路229连接。Optionally, the accumulator 22 may further include: a second-stage addition circuit 229; the second-stage addition circuit 229 may be connected to a sum register 227, and the result generation unit 26 may be connected to the second-stage addition The circuit 229 is connected.
最后一级累加器22的第二阶段加法电路229用于将总和寄存器227中的结果与所述结果产出单元26从所述结果存储单元27中读取的中间结果相加后输出至结果产出单元26。The second-stage addition circuit 229 of the last-stage accumulator 22 is used to add the result in the sum register 227 and the intermediate result read from the result storage unit 27 by the result generation unit 26 and output to the result generator.出unit 26.
在将卷积运算的权重值矩阵映射到脉动阵列时,连续N个累加器22会映射到同一个权重值矩阵,N的大小可以和权重值矩阵的宽度相同。N个累加器22中,第一个累加器22不需要接收左面一级累加器22输出的结果,同时,最后一个累加器22也不会将总和寄存器227缓存的结果输出到右面一级的累加器22,它只会在第二阶段加法电路229中将总和寄存器227缓存的结果与从结果存储单元27中读回的中间结果累加后输出到结果产出单元26。When the weight value matrix of the convolution operation is mapped to the systolic array, consecutive N accumulators 22 will be mapped to the same weight value matrix, and the size of N can be the same as the width of the weight value matrix. Among the N accumulators 22, the first accumulator 22 does not need to receive the output result of the left-level accumulator 22, and at the same time, the last accumulator 22 does not output the result buffered by the sum register 227 to the right-level accumulator. The device 22 only accumulates the result buffered by the sum register 227 and the intermediate result read back from the result storage unit 27 in the second-stage addition circuit 229 and outputs the result to the result output unit 26.
每一级累加器22都会与结果产出单元26连接,由权重值矩阵的宽度决定哪一级累加器22向结果产出单元26输出累加结果,例如,如果权重值矩阵的宽度为3,则第三级累加器22向结果产出单元26输出累加结果,如果权重值矩阵的宽度为4,则第四级累加器22输出累加结果。Each level of accumulator 22 is connected to the result output unit 26, and the width of the weight value matrix determines which level of accumulator 22 outputs the accumulation result to the result output unit 26. For example, if the width of the weight value matrix is 3, then The third-stage accumulator 22 outputs the accumulation result to the result output unit 26. If the width of the weight value matrix is 4, the fourth-stage accumulator 22 outputs the accumulation result.
所述结果产出单元26可以在累加器22的第二阶段加法电路229输出的结果为最终结果时,将从所述第二阶段加法电路229获取到的结果发送至输出模块3;在所述第二阶段加法电路229输出的结果为中间结果时,将获取到的结果发送至结果存储单元27。The result generation unit 26 may send the result obtained from the second-stage addition circuit 229 to the output module 3 when the result output by the second-stage addition circuit 229 of the accumulator 22 is the final result; When the result output by the second-stage addition circuit 229 is an intermediate result, the obtained result is sent to the result storage unit 27.
可选的,结果存储单元27可以包括多个FIFO(First Input First Output, 先进先出)存储单元,结果产出单元26可以将中间结果送入结果存储单元27中对应的FIFO存储单元。Optionally, the result storage unit 27 may include multiple FIFO (First Input First Output) storage units, and the result output unit 26 may send intermediate results into the corresponding FIFO storage unit in the result storage unit 27.
具体地,每一级累加器22都可以对应一个FIFO存储单元,每个FIFO存储单元都可以同时进行读写操作。在进行卷积运算时,可以将N个FIFO存储单元根据权重值矩阵的大小分成不同的组。不同的FIFO存储单元组缓存不同权重值矩阵的中间结果。Specifically, each stage of accumulator 22 can correspond to a FIFO storage unit, and each FIFO storage unit can perform read and write operations at the same time. During the convolution operation, the N FIFO storage units can be divided into different groups according to the size of the weight value matrix. Different FIFO storage unit groups buffer the intermediate results of different weight value matrices.
如前所述,权重值矩阵的宽度决定哪一级累加器22向结果产出单元26输出累加结果,为了利用不向结果产出单元26输出累加结果的累加器22对应的FIFO存储单元。本实施例中,一组FIFO存储单元可以包括向结果产出单元26输出累加结果的累加器22及其前面的所有累加器22对应的FIFO存储单元,输出累加结果的累加器22可以使用该组FIFO存储单元的所有缓存。As mentioned above, the width of the weight value matrix determines which stage of the accumulator 22 outputs the accumulation result to the result generation unit 26, in order to utilize the FIFO storage unit corresponding to the accumulator 22 that does not output the accumulation result to the result generation unit 26. In this embodiment, a group of FIFO storage units may include the accumulator 22 that outputs the accumulation result to the result output unit 26 and the FIFO storage units corresponding to all accumulators 22 before it, and the accumulator 22 that outputs the accumulation result can use this group. All buffers of the FIFO storage unit.
例如,若权重值矩阵的宽度为3,则第三级累加器22向结果产出单元26输出累加结果,因此,可以将第一级至第三级累加器22对应的FIFO分为一组,用来缓存第三级累加器22输出的累加结果,能够有效利用闲置的FIFO存储单元,提高累加结果的存储效率。For example, if the width of the weight value matrix is 3, the third-stage accumulator 22 outputs the accumulation result to the result output unit 26. Therefore, the FIFOs corresponding to the first-stage to third-stage accumulators 22 can be grouped into one group. It is used to buffer the accumulation result output by the third-level accumulator 22, which can effectively utilize the idle FIFO storage unit and improve the storage efficiency of the accumulation result.
在实际应用中,当使用长度为n的定点数进行卷积运算时,纵向加法电路223直接将乘累加结果寄存器221获取到脉动单元21的输出结果转发到第一阶段加法电路224进行累加。In practical applications, when a fixed-point number of length n is used for convolution operation, the vertical addition circuit 223 directly forwards the output result obtained by the multiplication and accumulation result register 221 to the pulsation unit 21 to the first stage addition circuit 224 for accumulation.
当使用长度为2n位的定点数进行卷积运算时,脉动阵列的连续两个输出结果需要在纵向加法电路223中进行累加。乘累加结果寄存器221收到的第一个输出结果在下一个时钟周期可以被缓存到前乘累加结果寄存器222中。当乘累加结果寄存器221收到第二个输出结果后,乘累加结果寄存器221缓存的输出结果与前乘累加结果寄存器222缓存的输出结果在纵向加法电路223中进行累加,累加结果被送到第一阶段加法电路224中继续做加法,从而实现2n位的定点数的卷积运算。When a fixed-point number with a length of 2n bits is used for the convolution operation, two consecutive output results of the systolic array need to be accumulated in the vertical addition circuit 223. The first output result received by the multiplication and accumulation result register 221 may be buffered in the previous multiplication and accumulation result register 222 in the next clock cycle. When the multiply and accumulate result register 221 receives the second output result, the output result buffered by the multiply and accumulate result register 221 and the output result buffered by the pre-multiply and accumulate result register 222 are accumulated in the vertical adder circuit 223, and the accumulated result is sent to the first The addition is continued in the one-stage addition circuit 224, thereby realizing the convolution operation of a 2n-bit fixed-point number.
本实施例提供的数据处理装置中,累加器22的乘累加结果寄存器221和前乘累加结果寄存器222可以分别存储对应列脉动单元21的两个相邻的输出结果,通过乘累加结果寄存器221和前乘累加结果寄存器222中存储的数据可以确定2n位的输入特征值对应的乘累加结果,这样,通过每次向脉动单元21送入输入特征值的n位,即可实现2n位的输入特征值的卷积运算操作,无需增加脉动单元21的存储空间,兼顾装置成本与计算效率,具有较高的应用价值。In the data processing device provided in this embodiment, the multiply and accumulate result register 221 and the pre-multiply and accumulate result register 222 of the accumulator 22 can respectively store two adjacent output results of the corresponding column pulsation unit 21, through the multiply and accumulate result register 221 and The data stored in the pre-multiplication and accumulation result register 222 can determine the multiplication and accumulation result corresponding to the 2n-bit input feature value. In this way, by sending n bits of the input feature value to the pulsation unit 21 each time, the 2n-bit input feature can be realized The convolution operation operation of the value does not need to increase the storage space of the pulsation unit 21, which takes into account the cost of the device and the calculation efficiency, and has high application value.
图8为本发明实施例三提供的数据处理装置进行n位数据的卷积运算过程示意图。其中,权重值矩阵的大小为3*3。如图8所示,KhaDb为输入特征值矩阵中第a行的第b个数;Kwc为权重值矩阵中第c列的权重值向量,它会在卷积运算开始时部署到相应的一列脉动单元;KwcDd为输出特征值对应权重值矩阵第c列的第d个乘累加结果;Bias为卷积运算输入的偏置值;SxTy为第x级累加器在y时刻输出的累加结果。FIG. 8 is a schematic diagram of a convolution operation process of n-bit data performed by the data processing device according to the third embodiment of the present invention. Among them, the size of the weight value matrix is 3*3. As shown in Figure 8, KhaDb is the b-th number in the a-th row of the input eigenvalue matrix; Kwc is the weight value vector in the c-th column in the weight value matrix, which will be deployed to the corresponding column of pulsation at the beginning of the convolution operation Unit; KwcDd is the output eigenvalue corresponding to the d-th multiplication and accumulation result of the c-th column of the weight value matrix; Bias is the input bias value of the convolution operation; SxTy is the accumulation result output by the x-th accumulator at time y.
卷积运算开始时,权重值矩阵中的权重值向量Kwc会分三个时钟周期送入脉动阵列,每个脉动单元加载3*3权重值矩阵中对应位置的权重值;权重加载完毕后,输入特征值按照图8中的顺序依次送入脉动阵列,它们在脉动阵列中与权重值进行乘累加;脉动阵列按照时间顺序输出的结果如图8所示。When the convolution operation starts, the weight value vector Kwc in the weight value matrix will be sent to the systolic array in three clock cycles, and each pulsation unit loads the weight value of the corresponding position in the 3*3 weight value matrix; after the weight is loaded, enter The eigenvalues are sequentially sent to the systolic array according to the order in Fig. 8, and they are multiplied and accumulated in the systolic array with the weight value; the result of the systolic array output according to the time sequence is shown in Fig. 8.
从脉动阵列输出的结果被送入对应的累加器继续进行累加,累加器每个时刻进行的计算如图8所示,第三级累加器完成累加操作后,即可得到最终的输出特征值。The result output from the systolic array is sent to the corresponding accumulator to continue the accumulation. The calculation performed by the accumulator at each moment is shown in Figure 8. After the third-stage accumulator completes the accumulation operation, the final output characteristic value can be obtained.
通过图8所示的过程,可以实现n位数据的卷积运算。其中,一个输入特征值会和一行权重值分别相乘,相当于一个输入特征值进行多次乘累加操作,从而实现数据的重用,少卷积运算需要的数据访问带宽。Through the process shown in FIG. 8, the convolution operation of n-bit data can be realized. Among them, an input feature value is multiplied by a row of weight values, which is equivalent to multiple multiplication and accumulation operations for one input feature value, thereby realizing data reuse and reducing data access bandwidth required for convolution operations.
图9为本发明实施例三提供的数据处理装置进行2n位数据的卷积运算过程示意图。其中,2n=16,权重值矩阵的大小为3*3。如图9所示,KhaDb_LSB为输入特征值矩阵中第a行的第b个数的低n位;KhaDb_MSB为输入特征值矩阵中第a行的第b个数的高n位。Kwc_LSB为权重值矩阵中第c列权重值向量的低n位,Kwc_MSB为权重值矩阵中第c列权重值向量的高n位,它们会在卷积运算开始时部署到相应的脉动单元。FIG. 9 is a schematic diagram of a convolution operation process of 2n-bit data performed by the data processing device according to the third embodiment of the present invention. Among them, 2n=16, and the size of the weight value matrix is 3*3. As shown in Figure 9, KhaDb_LSB is the low n bits of the b-th number in the a-th row in the input eigenvalue matrix; KhaDb_MSB is the high n bits of the b-th number in the a-th row in the input eigenvalue matrix. Kwc_LSB is the low n bits of the weight value vector in the c-th column of the weight value matrix, and Kwc_MSB is the high n bits of the weight value vector in the c-th column in the weight value matrix. They will be deployed to the corresponding systolic unit when the convolution operation starts.
KwcDd_LL为输出特征值对应权重值矩阵中第c列权重值的第d个乘累加结果的第一部分,它由输入特征值的低n位和权重值的低n位乘累加得到;KwcDd_ML为输出特征值对应权重值矩阵中第c列权重值的第d个乘累加结果的第二部分,它由输入特征值的高n位和权重值的低n位乘累加得到;KwcDd_LM为输出特征值对应权重值矩阵中第c列权重值的第d个乘累加结果的第三部分,它由输入特征值的低n位和权重值的高n位乘累加得到;KwcDd_MM为输出特征值对应权重值矩阵中第c列权重值的第d个乘累加结果的第四部分,它由输入特征值的高n位和权重值的高n位乘累加得到;Bias为卷积运算输入的偏置值; SxTy为第x级累加器在y时刻输出的累加结果。KwcDd_LL is the first part of the d-th multiplication and accumulation result of the weight value in the c-th column of the weight value matrix corresponding to the output eigenvalue, which is obtained by multiplying and accumulating the low n bits of the input eigenvalue and the low n bits of the weight value; KwcDd_ML is the output feature The value corresponds to the second part of the d-th multiplication and accumulation result of the weight value in the c-th column of the weight value matrix, which is obtained by multiplying and accumulating the high n bits of the input eigenvalue and the low n bits of the weight value; KwcDd_LM is the weight corresponding to the output eigenvalue The third part of the d-th multiplication and accumulation result of the weight value in the c-th column of the value matrix, which is obtained by multiplying and accumulating the low n bits of the input eigenvalue and the high n bits of the weight value; KwcDd_MM is the weight value matrix corresponding to the output eigenvalue The fourth part of the dth column of the weight value multiplied by the accumulation result, which is obtained by multiplying and accumulating the high n bits of the input feature value and the high n bits of the weight value; Bias is the bias value input by the convolution operation; SxTy is The accumulation result output by the x-level accumulator at time y.
卷积运算开始时,权重值矩阵中权重值的高n位向量和低n位向量:Kwc_LSB和Kwc_MSB会分三个时钟周期送入脉动阵列,每个脉动单元加载对应位置的权重值的对应n位;权重加载完毕后,输入特征值按照图9中的顺序依次送入脉动阵列,它们在脉动阵列中与权重值进行乘累加;脉动阵列按照时间顺序输出的结果如图9所示。When the convolution operation starts, the high n-bit vector and low n-bit vector of the weight value in the weight value matrix: Kwc_LSB and Kwc_MSB will be sent to the systolic array in three clock cycles, and each pulsation unit loads the corresponding n of the weight value of the corresponding position After the weight is loaded, the input eigenvalues are sequentially sent to the systolic array in the order shown in Fig. 9, and they are multiplied and accumulated with the weight value in the systolic array; the result of the systolic array output in chronological order is shown in Fig. 9.
从脉动阵列输出的结果被送入对应的累加器继续进行累加,累加器每个时刻进行的计算如图9所示。每个累加器的纵向加法电路在进行累加之前,需要将第二次送入的输出结果先左移n位。高n位权重值对应的累加器还需要在将两次输出结果相加之后,将相加之和左移n位。累加器每隔两个时钟周期向下一级累加器传递一个累加结果。最后一级累加器的累加操作完成后,即可得到最终的输出特征值。The result output from the systolic array is sent to the corresponding accumulator to continue accumulating. The calculation performed by the accumulator at each moment is shown in Figure 9. The vertical adding circuit of each accumulator needs to shift the output result sent in the second time to the left by n bits before accumulating. The accumulator corresponding to the high n-bit weight value also needs to shift the added sum by n bits to the left after adding the two output results. The accumulator transmits an accumulation result to the next accumulator every two clock cycles. After the accumulation operation of the last stage accumulator is completed, the final output characteristic value can be obtained.
在实际应用中,本装置可以同时支持两种长度的数据进行计算。使用n位的数据进行卷积运算,可以提供更高的卷积运算并发度;使用2n位的数据进行卷积运算,可以有效提高网络精度。In practical applications, this device can simultaneously support two lengths of data for calculation. Using n-bit data for convolution operation can provide higher convolution operation concurrency; using 2n-bit data for convolution operation can effectively improve network accuracy.
需要说明的是,图8和图9中出现了多个时间轴,各个时间轴只是用于辅助显示各自的时间线中的输出顺序,各个时间轴中的T0并不是同一时刻。It should be noted that multiple time axes appear in FIGS. 8 and 9, and each time axis is only used to assist in displaying the output sequence in the respective timeline, and T0 in each time axis is not the same time.
实施例四Example four
本发明实施例四提供一种数据处理装置。本实施例是在上述各实施例提供的技术方案的基础上,增加存储器对数据进行存储。图10为本发明实施例四提供的一种数据处理装置的结构示意图。如图10所示,本实施例中的数据处理装置,可以包括:The fourth embodiment of the present invention provides a data processing device. In this embodiment, on the basis of the technical solutions provided by the foregoing embodiments, a memory is added to store data. FIG. 10 is a schematic structural diagram of a data processing device according to Embodiment 4 of the present invention. As shown in FIG. 10, the data processing device in this embodiment may include:
输入模块,用于获取n位或者2n位的权重值矩阵以及n位或者2n位的输入特征值矩阵;所述输入模块具体包括权重值加载模块11和输入特征值加载模块12,权重值加载模块11用于获取n位或者2n位的权重值矩阵,输入特征值加载模块12用于获取n位或者2n位的输入特征值矩阵;The input module is used to obtain an n-bit or 2n-bit weight value matrix and an n-bit or 2n-bit input feature value matrix; the input module specifically includes a weight value loading module 11 and an input feature value loading module 12, a weight value loading module 11 is used to obtain an n-bit or 2n-bit weight value matrix, and the input feature value loading module 12 is used to obtain an n-bit or 2n-bit input feature value matrix;
计算模块2,用于将所述输入特征值矩阵和所述权重值矩阵进行卷积运算,得到输出特征值矩阵;The calculation module 2 is configured to perform a convolution operation on the input eigenvalue matrix and the weight value matrix to obtain an output eigenvalue matrix;
输出模块3,用于输出所述输出特征值矩阵;The output module 3 is used to output the output eigenvalue matrix;
存储器4,用于存储下述至少一项:输入特征值矩阵、输出特征值矩阵、 权重值矩阵。The memory 4 is used to store at least one of the following: an input eigenvalue matrix, an output eigenvalue matrix, and a weight value matrix.
可选的,所述存储器4可以为静态随机存取存储器(Static Random-Access Memory,SRAM)。所述权重值加载模块11可以与存储器4连接,从存储器4中读出权重值,并按特定格式送到计算模块2。所述输入特征值加载模块12可以存储器4中读出输入特征值,并将其送到计算模块2中进行卷积运算。Optionally, the memory 4 may be a static random access memory (Static Random-Access Memory, SRAM). The weight value loading module 11 can be connected to the memory 4, read the weight value from the memory 4, and send it to the calculation module 2 in a specific format. The input feature value loading module 12 can read the input feature value from the memory 4 and send it to the calculation module 2 for convolution operation.
计算模块2每个时钟周期可以输出特征值矩阵中的一个输出特征值,输出模块3将输出特征值写入存储器4。可选的,输出特征值在存储器4中存储时可能会有一些格式要求,例如,输出特征值需按32比特对齐,即输出特征值的第一个字节的起始地址为32的整数倍。输出模块3可以将输出特征值组装为对应的格式后发送给存储器4进行存储。The calculation module 2 can output one output characteristic value in the characteristic value matrix every clock cycle, and the output module 3 writes the output characteristic value into the memory 4. Optionally, there may be some format requirements when the output characteristic value is stored in the memory 4. For example, the output characteristic value needs to be aligned with 32 bits, that is, the start address of the first byte of the output characteristic value is an integer multiple of 32 . The output module 3 can assemble the output characteristic values into a corresponding format and send them to the memory 4 for storage.
可选的,所述存储器4中存储的数据的长度为n位时,所述存储器4可以通过n*m位的存储空间依次存储m个数据。存储器4中存储的数据的长度为2n位时,所述存储器4可以通过2n*m位的存储空间存储m个数据,每个数据的高n位和低n位相邻存储;所述n和m均为正整数。Optionally, when the length of the data stored in the memory 4 is n bits, the memory 4 may sequentially store m pieces of data through a storage space of n*m bits. When the length of the data stored in the memory 4 is 2n bits, the memory 4 can store m data through a 2n*m-bit storage space, and the high n bits and low n bits of each data are stored adjacently; the n and m is a positive integer.
图11为本发明实施例四提供的一种数据处理装置存储n位数据时的存储格式示意图。如图11所示,每个方框表示n位的存储空间,方框上的数字表示存储空间的序号,方框内的数字表示存储的数据的序号。图11显示了2m个n位的存储空间,第i个n位的存储空间存储第i个数据。FIG. 11 is a schematic diagram of a storage format when a data processing device stores n-bit data according to the fourth embodiment of the present invention. As shown in Figure 11, each box represents n-bit storage space, the number on the box represents the serial number of the storage space, and the number inside the box represents the serial number of the stored data. Figure 11 shows 2m n-bit storage spaces, and the i-th n-bit storage space stores the i-th data.
图12为本发明实施例四提供的一种数据处理装置存储2n位数据时的存储格式示意图。如图12所示,每个方框表示n位的存储空间,方框上的数字表示存储空间的序号,方框内的i_LSB表示第i个数据的低n位,i_MSB表示第i个数据的高n位。图12显示了2m个n位的存储空间,第2i个n位存储第i个数据的低n位,第2i+1个n位存储第i个数据的高n位。FIG. 12 is a schematic diagram of a storage format when a data processing device stores 2n-bit data according to the fourth embodiment of the present invention. As shown in Figure 12, each box represents n-bit storage space, the number on the box represents the serial number of the storage space, i_LSB in the box represents the low n bits of the i-th data, and i_MSB represents the i-th data High n bits. Figure 12 shows 2m n-bit storage spaces, the 2i-th n-bit stores the low n bits of the i-th data, and the 2i+1-th n-bit stores the high n bits of the i-th data.
本实施例提供的数据处理装置,可以通过存储器4存储下述至少一项:输入特征值矩阵、输出特征值矩阵、权重值矩阵,其中,存储器4中存储的数据的长度为2n位时,所述存储器4可以通过2n*m位的存储空间存储m个数据,每个数据的高n位和低n位相邻存储,方便输入特征值和权重值按照顺序送入脉动阵列,提高卷积运算的效率。The data processing device provided in this embodiment can store at least one of the following through the memory 4: input eigenvalue matrix, output eigenvalue matrix, and weight value matrix, where the length of the data stored in the memory 4 is 2n bits, so The memory 4 can store m data through a 2n*m-bit storage space, and the high n bits and low n bits of each data are stored adjacently, which is convenient for inputting feature values and weight values into the systolic array in order, improving convolution operation s efficiency.
实施例五Example five
本发明实施例五提供一种数据处理方法。图13为本发明实施例五提供的 一种数据处理方法的流程示意图。如图13所示,本实施例中的数据处理方法,可以包括:The fifth embodiment of the present invention provides a data processing method. FIG. 13 is a schematic flowchart of a data processing method according to Embodiment 5 of the present invention. As shown in FIG. 13, the data processing method in this embodiment may include:
步骤1301、获取输入特征值矩阵以及n位或者2n位的权重值矩阵。Step 1301: Obtain an input eigenvalue matrix and an n-bit or 2n-bit weight value matrix.
步骤1302、将输入特征值矩阵与n位或者2n位的权重值矩阵进行卷积运算,得到输出特征值矩阵。Step 1302: Perform a convolution operation on the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix to obtain an output eigenvalue matrix.
步骤1303、输出所述输出特征值矩阵。Step 1303: Output the output eigenvalue matrix.
其中,所述n为正整数。Wherein, the n is a positive integer.
图13所示的数据处理方法可以基于图1-图12所示实施例的装置来实现,具体的实现原理可以参考图1-图12所示实施例中的相关说明。该技术方案的执行过程和技术效果参见图1-图12所示实施例中的描述,在此不再赘述。The data processing method shown in FIG. 13 can be implemented based on the device of the embodiment shown in FIG. For the implementation process and technical effects of this technical solution, please refer to the description in the embodiment shown in FIG. 1 to FIG. 12, which will not be repeated here.
在一个可实施的方式中,n位的权重值矩阵中的权重值长度为n位;2n位的权重值矩阵中的权重值长度为2n位;In an implementable manner, the length of the weight value in the n-bit weight value matrix is n bits; the length of the weight value in the 2n-bit weight value matrix is 2n bits;
所述输入特征值矩阵中的输入特征值的长度与所述权重值矩阵中的权重值的长度相同。The length of the input eigenvalue in the input eigenvalue matrix is the same as the length of the weight value in the weight value matrix.
在一个可实施的方式中,所述方法还包括:In an implementable manner, the method further includes:
存储矩阵中的数据,所述矩阵为输入特征值矩阵、输出特征值矩阵、权重值矩阵中的至少一项;Store data in a matrix, the matrix being at least one of an input eigenvalue matrix, an output eigenvalue matrix, and a weight value matrix;
其中,当存储的数据的长度为2n位时,通过2n*m位的存储空间存储m个数据,每个数据的高n位和低n位相邻存储;所述m为正整数。Wherein, when the length of the stored data is 2n bits, m data are stored in a 2n*m-bit storage space, and the high n bits and low n bits of each data are stored adjacently; the m is a positive integer.
在一个可实施的方式中,将输入特征值矩阵与n位或者2n位的权重值矩阵进行卷积运算,得到输出特征值矩阵,包括:In an implementable manner, the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix are convolved to obtain the output eigenvalue matrix, which includes:
将权重值矩阵中的n位或者2n位的权重值与对应的输入特征值进行乘累加操作;Multiply and accumulate the n-bit or 2n-bit weight value in the weight value matrix with the corresponding input feature value;
根据所述乘累加操作得到的乘累加结果,计算输出特征值矩阵。According to the multiplication and accumulation result obtained by the multiplication and accumulation operation, the output eigenvalue matrix is calculated.
在一个可实施的方式中,将权重值矩阵中的n位或者2n位的权重值与对应的输入特征值进行乘累加操作,包括:In an implementable manner, multiplying and accumulating the n-bit or 2n-bit weight value in the weight value matrix with the corresponding input feature value includes:
在脉动阵列中加载权重值矩阵中的权重值;Load the weight value in the weight value matrix in the systolic array;
将脉动阵列中的每列脉动单元加载的权重值与对应的输入特征值进行乘累加操作,得到每列权重值对应的乘累加结果;Multiply and accumulate the weight value loaded by each column of systolic cells in the systolic array and the corresponding input characteristic value to obtain the multiply and accumulate result corresponding to the weight value of each column;
其中,所述权重值为n位或者2n位的权重值。Wherein, the weight value is n-bit or 2n-bit weight value.
在一个可实施的方式中,所述脉动单元可加载的权重值长度为n位;In an implementable manner, the length of the weight value that can be loaded by the pulsation unit is n bits;
在所述权重值矩阵中的权重值长度为2n位时,每列脉动单元加载所述权重值矩阵中的权重值的高n位或者低n位。When the length of the weight value in the weight value matrix is 2n bits, each column of pulsation unit loads the high n bits or the low n bits of the weight value in the weight value matrix.
在一个可实施的方式中,一列权重值的高n位和低n位分别加载于相邻的两列脉动单元中。In an implementable manner, the high n bits and low n bits of a column of weight values are respectively loaded in two adjacent columns of pulsation cells.
在一个可实施的方式中,所述输入特征值矩阵中的输入特征值长度为2n位时,所述脉动单元每次获取到的输入特征值为所述输入特征值矩阵中的输入特征值的高n位或者低n位。In an implementable manner, when the length of the input eigenvalue in the input eigenvalue matrix is 2n bits, the input eigenvalue acquired by the pulsating unit each time is the value of the input eigenvalue in the input eigenvalue matrix. High n bits or low n bits.
在一个可实施的方式中,输入特征值的高n位或者低n位从第一列脉动单元依次传递至最后一列脉动单元。In an implementable manner, the high n bits or low n bits of the input characteristic value are sequentially transferred from the first row of pulsation units to the last row of pulsation units.
在一个可实施的方式中,将脉动阵列中的每列脉动单元加载的权重值与对应的输入特征值进行乘累加操作,得到每列权重值对应的乘累加结果,包括:In an implementable manner, the weight value loaded by each column of the pulsation unit in the systolic array is multiplied and accumulated with the corresponding input characteristic value to obtain the multiplication and accumulation result corresponding to the weight value of each column, including:
将输入特征值矩阵中的输入特征值在脉动阵列中依次向右传递,并通过每列脉动单元将加载的权重值与传递来的输入特征值进行乘累加操作,得到每列权重值对应的乘累加结果。The input eigenvalues in the input eigenvalue matrix are sequentially transferred to the right in the pulsation array, and the loaded weight value and the passed input eigenvalue are multiplied and accumulated by the pulsation unit of each column to obtain the corresponding multiplication of the weight value of each column Accumulate the result.
在一个可实施的方式中,在脉动阵列中加载权重值矩阵中的权重值,包括:In an implementable manner, loading the weight values in the weight value matrix in the systolic array includes:
在权重值加载阶段中的移位阶段,针对每一列脉动单元,将该列脉动单元需要加载的权重值通过该列第一个脉动单元依次送入脉动阵列,在脉动阵列中,接收到的权重值从第一个脉动单元依次向下传递;In the shift phase of the weight value loading phase, for each row of pulsation units, the weight values that the row of pulsation units need to be loaded are sequentially sent to the pulsation array through the first pulsation unit of the column. In the pulsation array, the received weight The value is passed down from the first pulsation unit in turn;
在权重值加载阶段中的加载阶段,通过脉动阵列中的脉动单元存储对应的权重值。In the loading phase of the weight value loading phase, the corresponding weight value is stored by the pulsation unit in the pulsation array.
在一个可实施的方式中,通过每列脉动单元将加载的权重值与传递来的输入特征值进行乘累加操作,得到每列权重值对应的乘累加结果,包括:In an implementable manner, the loaded weight value and the passed input feature value are multiplied and accumulated by each column of pulsation unit to obtain the multiply and accumulate result corresponding to each column of weight value, including:
每列脉动单元均执行如下操作:通过该列的每个脉动单元获取输入特征值,将获取到的输入特征值与所述脉动单元所加载的权重值相乘,将得到的乘积与上一脉动单元的输出相加,输出相加的结果;最后一个脉动单元的输出结果为该列对应的乘累加结果。Each row of pulsation units performs the following operations: obtains the input characteristic value through each pulsation unit in the column, multiplies the obtained input characteristic value with the weight value loaded by the pulsation unit, and then multiplies the obtained product with the previous pulsation The output of the unit is added, and the result of the addition is output; the output of the last pulsating unit is the multiplication and accumulation result corresponding to the column.
在一个可实施的方式中,对应每一列脉动单元设置有一累加器;根据所述乘累加操作得到的乘累加结果,计算输出特征值矩阵,包括:In an implementable manner, an accumulator is provided for each column of pulsation units; the calculation of the output eigenvalue matrix according to the multiplication and accumulation result obtained by the multiplication and accumulation operation includes:
通过每个累加器获取对应的一列脉动单元的输出结果,与上一级累加器 的输出结果相加,得到的所述累加器的输出结果;通过最后一级累加器的输出结果确定输出特征值。Obtain the output result of the corresponding row of pulsating units through each accumulator, and add it to the output result of the previous accumulator to obtain the output result of the accumulator; determine the output characteristic value by the output result of the last accumulator .
在一个可实施的方式中,若输入特征值为2n位,则所述获取对应的一列脉动单元的输出结果,包括:In an implementable manner, if the input characteristic value is 2n bits, the obtaining the output result of the corresponding row of pulsation units includes:
通过一个寄存器存储从脉动单元获取到的输入特征值的高n位对应的输出结果,通过另一寄存器存储该输入特征值的低n位对应的输出结果;Store the output result corresponding to the high n bits of the input characteristic value obtained from the pulsating unit through one register, and store the output result corresponding to the low n bits of the input characteristic value through another register;
根据所述输入特征值的高n位的输出结果与低n位的输出结果,得到所述输入特征值对应的输出结果。According to the output result of the high n bits of the input characteristic value and the output result of the low n bits, the output result corresponding to the input characteristic value is obtained.
在一个可实施的方式中,若累加器对应的脉动阵列加载的为权重值的低n位,则根据所述输入特征值的高n位的输出结果与低n位的输出结果,得到所述输入特征值对应的输出结果,包括:In an implementable manner, if the systolic array corresponding to the accumulator is loaded with the low n bits of the weight value, the output result of the high n bits of the input characteristic value and the output result of the low n bits are obtained to obtain the The output results corresponding to the input feature values include:
将所述输入特征值的高n位的输出结果左移n位,与所述低n位的输出结果相加,得到所述输入特征值对应的输出结果。The output result of the high n bits of the input feature value is left shifted by n bits, and the output result of the low n bits is added to obtain the output result corresponding to the input feature value.
在一个可实施的方式中,若累加器对应的脉动阵列加载的为权重值的高n位,则根据所述输入特征值的高n位的输出结果与低n位的输出结果,得到所述输入特征值对应的输出结果,包括:In an implementable manner, if the systolic array corresponding to the accumulator is loaded with the high n bits of the weight value, the output result of the high n bits of the input characteristic value and the output result of the low n bits are used to obtain the The output results corresponding to the input feature values include:
将所述输入特征值的高n位的输出结果左移n位,与所述低n位的输出结果相加,并将相加得到的结果左移n位,得到所述输入特征值对应的输出结果。Shift the output result of the high n bits of the input feature value by n bits to the left, add it to the output result of the low n bits, and shift the result of the addition to the left by n bits to obtain the corresponding input feature value Output the result.
在一个可实施的方式中,所述输入特征值的高n位的输出结果与低n位的输出结果分别为从脉动单元获取到的相邻的两个输出结果。In an implementable manner, the high n-bit output result and the low n-bit output result of the input feature value are two adjacent output results obtained from the pulsation unit, respectively.
在一个可实施的方式中,通过一个寄存器存储从脉动单元获取到的输入特征值的高n位对应的输出结果,通过另一寄存器存储该输入特征值的低n位对应的输出结果,包括:In an implementable manner, storing the output result corresponding to the high n bits of the input characteristic value obtained from the pulsation unit through one register, and storing the output result corresponding to the low n bits of the input characteristic value through another register, including:
通过乘累加结果寄存器从脉动单元获取输出结果,每隔一个时钟周期,通过乘累加结果寄存器向前乘累加结果寄存器发送低n位对应的输出结果;通过所述乘累加结果寄存器存储高n位对应的输出结果,通过所述前乘累加结果寄存器存储低n位对应的输出结果;The output result is obtained from the pulsating unit through the multiply and accumulate result register, and every other clock cycle, the multiply and accumulate result register forwards the multiply and accumulate result register to send the output result corresponding to the low n bits through the multiply and accumulate result register; the high n bits are stored in the multiply and accumulate result register The output result of, stores the output result corresponding to the lower n bits through the pre-multiplication and accumulation result register;
或者,通过乘累加结果寄存器从脉动单元获取输出结果,每隔一个时钟周期,通过乘累加结果寄存器向前乘累加结果寄存器发送高n位对应的输出结果;通过所述乘累加结果寄存器存储低n位对应的输出结果,通过所述前乘累加结果寄存器存储高n位对应的输出结果。Alternatively, the output result is obtained from the pulsating unit through the multiply-accumulate result register, and every other clock cycle, the multiply-accumulate result register forwards the multiply-accumulate result register to send the output result corresponding to the high n bits; the multiply-accumulate result register stores the low n The output result corresponding to the bit is stored in the output result corresponding to the high n bits through the pre-multiplication and accumulation result register.
在一个可实施的方式中,所述方法还包括:In an implementable manner, the method further includes:
根据卷积运算的步长值对脉动阵列输出的冗余的乘累加结果进行过滤。The redundant multiplication and accumulation results output by the systolic array are filtered according to the step length value of the convolution operation.
在一个可实施的方式中,获取对应的一列脉动单元的输出结果,与前一级累加器的输出结果相加,包括:In an implementable manner, obtaining the output result of the corresponding row of pulsation units and adding it to the output result of the previous stage accumulator includes:
根据卷积运算的膨胀值,将上一级累加器输出的结果延迟对应的时钟周期后,与从对应的一列脉动单元获取到的输出结果相加。According to the expansion value of the convolution operation, the output result of the previous accumulator is delayed by the corresponding clock period, and then added to the output result obtained from the corresponding row of pulsation units.
在一个可实施的方式中,在脉动阵列中加载权重值矩阵中的权重值,包括:In an implementable manner, loading the weight values in the weight value matrix in the systolic array includes:
若所述权重值矩阵的行数大于所述脉动阵列的行数,则每次在所述脉动阵列中加载所述权重值矩阵中的一部分权重值;If the number of rows of the weight value matrix is greater than the number of rows of the systolic array, load a part of the weight values in the weight value matrix into the systolic array each time;
相应的,通过最后一级累加器的累加结果确定输出特征值,包括:Correspondingly, the output characteristic value is determined by the accumulation result of the last stage accumulator, including:
判断是否存储有输出特征值的中间结果:Determine whether the intermediate result of the output characteristic value is stored:
若否,则将所述最后一级累加器的累加结果存储为中间结果;If not, store the accumulation result of the last-stage accumulator as an intermediate result;
若是,则将所述最后一级累加器的累加结果与存储的中间结果相加,若相加的结果为输出特征值的最终结果,则将所述最终结果发送至输出模块;若相加的结果不为输出特征值的最终结果,则将所述中间结果更新为所述相加的结果并存储;If yes, add the accumulation result of the last-stage accumulator to the stored intermediate result, and if the addition result is the final result of the output characteristic value, send the final result to the output module; if the addition is The result is not the final result of the output characteristic value, then the intermediate result is updated to the result of the addition and stored;
其中,所述中间结果为所述权重值矩阵中部分权重值经过运算后对应的结果;所述最终结果为所述权重值矩阵中全部权重值经过运算后对应的结果。Wherein, the intermediate result is the corresponding result of partial weight values in the weight value matrix after calculation; the final result is the corresponding result of all weight values in the weight value matrix after calculation.
本发明实施例还提供一种电子设备,包括上述任一实施例所述的数据处理装置。所述电子设备可以是任意可能用到卷积运算的设备,如计算机、无人机、手持设备等。An embodiment of the present invention also provides an electronic device, including the data processing device described in any of the foregoing embodiments. The electronic device may be any device that may use convolution operations, such as computers, drones, and handheld devices.
所述电子设备的实现原理可以参考图1-图12所示实施例中的相关说明,对应的执行过程和技术效果参见图1-图12所示实施例中的描述,在此不再赘述。For the implementation principle of the electronic device, reference may be made to the related description in the embodiment shown in FIG. 1 to FIG. 12, and the corresponding execution process and technical effect can be referred to the description in the embodiment shown in FIG. 1 to FIG. 12, which will not be repeated here.
以上各个实施例中的技术方案、技术特征在与本相冲突的情况下均可以单独,或者进行组合,只要未超出本领域技术人员的认知范围,均属于本发明保护范围内的等同实施例。The technical solutions and technical features in each of the above embodiments can be singly or combined in case of conflict with the present invention, as long as they do not exceed the cognitive scope of those skilled in the art, they all belong to equivalent embodiments within the protection scope of the present invention. .
在本发明所提供的几个实施例中,应该理解到,所揭露的相关遥控装置和方法,可以通过其它的方式实现。例如,以上所描述的遥控装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分, 实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,遥控装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed related remote control device and method can be implemented in other ways. For example, the embodiments of the remote control device described above are merely illustrative. For example, the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units or components. It can be combined or integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, remote control devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得计算机处理器(processor)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁盘或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. , Including several instructions to make a computer processor (processor) execute all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage media include: U disk, mobile hard disk, Read-Only Memory (ROM), Random Access Memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes.
以上所述仅为本发明的实施例,并非因此限制本发明的专利范围,凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本发明的专利保护范围内。The above are only the embodiments of the present invention, and do not limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made by using the content of the description and drawings of the present invention, or directly or indirectly applied to other related technologies In the same way, all fields are included in the scope of patent protection of the present invention.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention. Scope.

Claims (48)

  1. 一种数据处理装置,其特征在于,包括:A data processing device, characterized in that it comprises:
    输入模块,用于获取输入特征值矩阵以及n位或者2n位的权重值矩阵;The input module is used to obtain the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix;
    计算模块,用于将输入特征值矩阵与n位或者2n位的权重值矩阵进行卷积运算,得到输出特征值矩阵;The calculation module is used to perform a convolution operation between the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix to obtain the output eigenvalue matrix;
    输出模块,用于输出所述输出特征值矩阵;An output module for outputting the output eigenvalue matrix;
    其中,所述n为正整数。Wherein, the n is a positive integer.
  2. 根据权利要求1所述的装置,其特征在于,n位的权重值矩阵中的权重值长度为n位;2n位的权重值矩阵中的权重值长度为2n位;The device according to claim 1, wherein the length of the weight value in the n-bit weight value matrix is n bits; the length of the weight value in the 2n-bit weight value matrix is 2n bits;
    所述输入特征值矩阵中的输入特征值的长度与所述权重值矩阵中的权重值的长度相同。The length of the input eigenvalue in the input eigenvalue matrix is the same as the length of the weight value in the weight value matrix.
  3. 根据权利要求1所述的装置,其特征在于,还包括:存储器;The device according to claim 1, further comprising: a memory;
    所述存储器用于存储下述至少一项:输入特征值矩阵、输出特征值矩阵、权重值矩阵;The memory is used to store at least one of the following: an input eigenvalue matrix, an output eigenvalue matrix, and a weight value matrix;
    其中,存储器中存储的数据的长度为2n位时,所述存储器通过2n*m位的存储空间存储m个数据,每个数据的高n位和低n位相邻存储;所述m为正整数。Wherein, when the length of the data stored in the memory is 2n bits, the memory stores m data through a 2n*m-bit storage space, and the high n bits and low n bits of each data are stored adjacently; the m is positive Integer.
  4. 根据权利要求1所述的装置,其特征在于,所述计算模块包括:The device according to claim 1, wherein the calculation module comprises:
    脉动阵列,用于实现权重值矩阵中的n位或者2n位的权重值与对应的输入特征值的乘累加操作;The systolic array is used to implement the multiplication and accumulation operation of the n-bit or 2n-bit weight value in the weight value matrix and the corresponding input eigenvalue;
    累加器阵列,用于根据所述脉动阵列得到的乘累加结果,计算输出特征值矩阵。The accumulator array is used to calculate the output eigenvalue matrix according to the multiplication and accumulation result obtained by the systolic array.
  5. 根据权利要求4所述的装置,其特征在于,所述计算模块还包括:The device according to claim 4, wherein the calculation module further comprises:
    控制单元,用于获取所述权重值矩阵中的权重值的长度,并根据所述权重值的长度控制所述脉动阵列和所述累加器阵列实现卷积运算。The control unit is configured to obtain the length of the weight value in the weight value matrix, and control the systolic array and the accumulator array to implement a convolution operation according to the length of the weight value.
  6. 根据权利要求5所述的装置,其特征在于,所述脉动阵列包括:多列脉动单元;The device according to claim 5, wherein the pulsation array comprises: multiple rows of pulsation units;
    每列脉动单元用于加载权重值,并将加载的权重值与对应的输入特征值进行乘累加,得到加载的每列权重值对应的乘累加结果。The pulsation unit of each column is used to load the weight value, and the loaded weight value and the corresponding input characteristic value are multiplied and accumulated to obtain the multiplication and accumulation result corresponding to the weight value of each column loaded.
  7. 根据权利要求6所述的装置,其特征在于,所述脉动单元可加载的权 重值长度为n位;The device according to claim 6, wherein the length of the weight value that can be loaded by the pulsation unit is n bits;
    在所述权重值矩阵中的权重值长度为2n位时,每列脉动单元加载所述权重值矩阵中的权重值的高n位或者低n位。When the length of the weight value in the weight value matrix is 2n bits, each column of pulsation unit loads the high n bits or the low n bits of the weight value in the weight value matrix.
  8. 根据权利要求7所述的装置,其特征在于,一列权重值的高n位和低n位分别加载于相邻的两列脉动单元中。7. The device according to claim 7, wherein the high n bits and low n bits of a column of weight values are respectively loaded in two adjacent columns of pulsation units.
  9. 根据权利要求6所述的装置,其特征在于,所述输入特征值矩阵中的输入特征值长度为2n位时,所述脉动单元每次获取到的输入特征值为所述输入特征值矩阵中的输入特征值的高n位或者低n位。7. The device according to claim 6, wherein when the input eigenvalue length in the input eigenvalue matrix is 2n bits, the input eigenvalue acquired by the pulsating unit each time is in the input eigenvalue matrix The high n bits or low n bits of the input feature value.
  10. 根据权利要求9所述的装置,其特征在于,输入特征值的高n位或者低n位从第一列脉动单元依次传递至最后一列脉动单元。The device according to claim 9, wherein the high n bits or low n bits of the input characteristic value are sequentially transferred from the first row of pulsation units to the last row of pulsation units.
  11. 根据权利要求6所述的装置,其特征在于,所述控制单元具体用于:The device according to claim 6, wherein the control unit is specifically configured to:
    在权重值加载阶段,控制所述权重值矩阵中的权重值依次加载到所述脉动阵列的脉动单元中;In the weight value loading stage, controlling the weight values in the weight value matrix to be sequentially loaded into the pulsation units of the systolic array;
    在计算阶段,控制输入特征值矩阵中的输入特征值在脉动阵列中依次向右传递,并控制脉动单元根据所加载的权重值与传递来的输入特征值进行计算。In the calculation stage, the input eigenvalues in the control input eigenvalue matrix are sequentially transferred to the right in the pulsation array, and the pulsation unit is controlled to perform calculations based on the loaded weight value and the transferred input eigenvalue.
  12. 根据权利要求11所述的装置,其特征在于,在权重值加载阶段,所述控制单元具体用于:The device according to claim 11, wherein, in the weight value loading stage, the control unit is specifically configured to:
    在权重值加载阶段中的移位阶段,针对每一列脉动单元,将该列脉动单元需要加载的权重值通过该列第一个脉动单元依次送入脉动阵列,在脉动阵列中,接收到的权重值从第一个脉动单元依次向下传递;In the shift phase of the weight value loading phase, for each row of pulsation units, the weight values that the row of pulsation units need to be loaded are sequentially sent to the pulsation array through the first pulsation unit of the column. In the pulsation array, the received weight The value is passed down from the first pulsation unit in turn;
    在权重值加载阶段中的加载阶段,控制脉动阵列中的脉动单元存储对应的权重值。In the loading phase of the weight value loading phase, the pulsation unit in the pulsation array is controlled to store the corresponding weight value.
  13. 根据权利要求6所述的装置,其特征在于,每列脉动单元均包括多个脉动单元;The device according to claim 6, wherein each row of pulsation units includes a plurality of pulsation units;
    所述脉动单元用于加载权重值,并获取输入特征值,将所述输入特征值与所加载的权重值相乘,将得到的乘积与上一行脉动单元的输出相加,输出相加的结果。The pulsation unit is used to load the weight value and obtain the input characteristic value, multiply the input characteristic value and the loaded weight value, add the obtained product to the output of the pulsation unit in the previous row, and output the result of the addition .
  14. 根据权利要求13所述的装置,其特征在于,所述脉动单元包括:The device according to claim 13, wherein the pulsation unit comprises:
    权重值寄存器,用于存储权重值;Weight value register, used to store weight value;
    输入特征值寄存器,用于存储输入特征值;Input characteristic value register, used to store the input characteristic value;
    乘法电路,用于根据所述权重寄存器中存储的权重值和所述输入特征值寄存器中存储的输入特征值,得到所述权重值与所述输入特征值的乘积;A multiplication circuit for obtaining the product of the weight value and the input characteristic value according to the weight value stored in the weight register and the input characteristic value stored in the input characteristic value register;
    加法电路,用于将所述乘法电路得到的乘积与上一行脉动单元的输出相加。The addition circuit is used to add the product obtained by the multiplication circuit to the output of the pulsation unit in the previous row.
  15. 根据权利要求14所述的装置,其特征在于,所述脉动单元还包括:The device according to claim 14, wherein the pulsation unit further comprises:
    权重值移位寄存器,用于向下一行脉动单元传递权重值;The weight value shift register is used to transfer the weight value to the pulsation unit of the next row;
    输入特征值移位寄存器,用于向下一列脉动单元传递输入特征值。The input characteristic value shift register is used to transfer the input characteristic value to the next row of pulsation units.
  16. 根据权利要求6所述的装置,其特征在于,所述累加器阵列包括多个累加器,所述累加器的个数与所述脉动单元的列数均为k,第i个累加器与第i列脉动单元对应,其中,k为大于1的自然数,i=1、2、……、k;7. The device according to claim 6, wherein the accumulator array comprises a plurality of accumulators, the number of the accumulators and the number of columns of the pulsation unit are both k, and the i-th accumulator and the number of columns of the pulsating unit are both k. Corresponding to the pulsation unit in column i, where k is a natural number greater than 1, i=1, 2, ..., k;
    所述累加器用于获取对应的一列脉动单元的输出结果,与前一级累加器的输出结果相加,并将相加的结果输出至下一级累加器。The accumulator is used to obtain the output result of the corresponding row of pulsation units, add it to the output result of the previous accumulator, and output the added result to the next accumulator.
  17. 根据权利要求16所述的装置,其特征在于,在输入特征值为2n位时,所述累加器具体用于:The device according to claim 16, wherein when the input characteristic value is 2n bits, the accumulator is specifically used for:
    通过一个寄存器存储从脉动单元获取到的输入特征值的高n位对应的输出结果,通过另一寄存器存储该输入特征值的低n位对应的输出结果;Store the output result corresponding to the high n bits of the input characteristic value obtained from the pulsating unit through one register, and store the output result corresponding to the low n bits of the input characteristic value through another register;
    根据所述输入特征值高n位的输出结果与低n位的输出结果,得到所述输入特征值对应的输出结果,与前一级累加器的输出结果相加,并将相加的结果输出至下一级累加器。According to the output result of the high n bits of the input characteristic value and the output result of the low n bits, the output result corresponding to the input characteristic value is obtained, which is added to the output result of the previous accumulator, and the result of the addition is output Go to the next accumulator.
  18. 根据权利要求17所述的装置,其特征在于,若累加器对应的脉动阵列加载的为权重值的低n位,则在根据所述输入特征值高n位的输出结果与低n位的输出结果,得到所述输入特征值对应的输出结果时,所述累加器具体用于:The device according to claim 17, wherein if the systolic array corresponding to the accumulator is loaded with the low n bits of the weight value, the output result of the high n bits of the input feature value and the low n bits of the output result As a result, when the output result corresponding to the input characteristic value is obtained, the accumulator is specifically used for:
    将所述输入特征值的高n位的输出结果左移n位,与所述低n位的输出结果相加,得到所述输入特征值对应的输出结果。The output result of the high n bits of the input feature value is left shifted by n bits, and the output result of the low n bits is added to obtain the output result corresponding to the input feature value.
  19. 根据权利要求17所述的装置,其特征在于,若累加器对应的脉动阵列加载的为权重值的高n位,则在根据所述输入特征值高n位的输出结果与低n位的输出结果,得到所述输入特征值对应的输出结果时,所述累加器具体用于:The device according to claim 17, wherein if the systolic array corresponding to the accumulator is loaded with the high n bits of the weight value, the output result of the high n bits of the input feature value and the low n bits of the output result As a result, when the output result corresponding to the input characteristic value is obtained, the accumulator is specifically used for:
    将所述输入特征值的高n位的输出结果左移n位,与所述低n位的输出结果相加,并将相加得到的结果左移n位,得到所述输入特征值对应的输出结果。Shift the output result of the high n bits of the input feature value by n bits to the left, add it to the output result of the low n bits, and shift the result of the addition to the left by n bits to obtain the corresponding input feature value Output the result.
  20. 根据权利要求17所述的装置,其特征在于,所述输入特征值高n位的输出结果与低n位的输出结果分别为从脉动单元获取到的相邻的两个输出结果。The device according to claim 17, wherein the output result of the upper n bits of the input characteristic value and the output result of the lower n bits are respectively two adjacent output results obtained from the pulsation unit.
  21. 根据权利要求16所述的装置,其特征在于,所述累加器包括:The device of claim 16, wherein the accumulator comprises:
    乘累加结果寄存器,用于获取所述最后一个脉动单元的输出结果;A multiply-accumulate result register for obtaining the output result of the last pulsating unit;
    前乘累加结果寄存器,用于在所述输入特征值为2n位时,每隔一个时钟周期从所述乘累加结果寄存器获取一次输出结果;The pre-multiplication and accumulation result register is used to obtain an output result from the multiplication and accumulation result register every other clock cycle when the input characteristic value is 2n bits;
    纵向加法电路,用于在所述输入特征值为n位时,将所述乘累加结果寄存器中的输出结果发送至第一阶段加法电路,或者,在所述输入特征值为2n位时,将所述乘累加结果寄存器中的输出结果与所述前乘累加结果寄存器中的输出结果之和发送至第一阶段加法电路;The vertical addition circuit is used to send the output result in the multiply-accumulate result register to the first-stage addition circuit when the input characteristic value is n bits, or, when the input characteristic value is 2n bits, Sending the sum of the output result in the multiply and accumulate result register and the output result in the previous multiply and accumulate result register to the first stage adding circuit;
    第一阶段加法电路,用于将从所述纵向加法电路输出的结果与所述上一级累加器输出的结果相加。The first-stage addition circuit is used to add the result output from the vertical addition circuit and the result output from the previous accumulator.
  22. 根据权利要求21所述的装置,其特征在于,所述累加器还包括:滤波电路;The device according to claim 21, wherein the accumulator further comprises: a filter circuit;
    所述滤波电路用于根据卷积运算的步长值对脉动阵列输出的冗余的乘累加结果进行过滤。The filter circuit is used to filter the redundant multiplication and accumulation results output by the systolic array according to the step value of the convolution operation.
  23. 根据权利要求21所述的装置,其特征在于,所述累加器还包括:累加器结果寄存器;The device according to claim 21, wherein the accumulator further comprises: an accumulator result register;
    所述累加器结果寄存器用于获取上一级累加器输出的结果并发送给所述第一阶段加法电路。The accumulator result register is used to obtain the output result of the previous accumulator and send it to the first-stage addition circuit.
  24. 根据权利要求23所述的装置,其特征在于,所述累加器还包括:延迟电路;The device according to claim 23, wherein the accumulator further comprises: a delay circuit;
    所述延迟电路用于根据卷积运算的膨胀值,将上一级累加器输出的结果延迟对应的时钟周期后送入所述累加器结果寄存器。The delay circuit is used to delay the output result of the previous accumulator by a corresponding clock cycle and send it to the accumulator result register according to the expansion value of the convolution operation.
  25. 根据权利要求21所述的装置,其特征在于,所述累加器还包括:The device according to claim 21, wherein the accumulator further comprises:
    总和寄存器,用于存储所述第一阶段加法电路输出的结果,并将所述结果输出至下一级累加器。The sum register is used to store the result output by the first-stage addition circuit and output the result to the next-stage accumulator.
  26. 根据权利要求25所述的装置,其特征在于,所述装置还包括:结果产出单元和结果存储单元;所述累加器还包括:第二阶段加法电路;The device according to claim 25, wherein the device further comprises: a result output unit and a result storage unit; the accumulator further comprises: a second-stage addition circuit;
    在所述权重值矩阵的行数大于所述脉动阵列的行数时,所述脉动阵列每 次加载所述权重值矩阵中的一部分权重值;所述结果存储单元用于存储中间结果,其中,所述中间结果为所述权重值矩阵中部分权重值经过运算后对应的结果;When the number of rows of the weight value matrix is greater than the number of rows of the systolic array, the systolic array loads part of the weight values in the weight value matrix each time; the result storage unit is used to store intermediate results, wherein, The intermediate result is the corresponding result of some weight values in the weight value matrix after calculation;
    最后一级累加器的第二阶段加法电路用于将总和寄存器中的结果与所述结果产出单元从所述结果存储单元中读取的中间结果相加后输出至结果产出单元;The second-stage addition circuit of the last-stage accumulator is used to add the result in the sum register and the intermediate result read from the result storage unit by the result output unit and output to the result output unit;
    所述结果产出单元用于在所述第二阶段加法电路输出的结果为最终结果时,将从所述第二阶段加法电路获取到的结果发送至输出模块;在所述第二阶段加法电路输出的结果为中间结果时,将获取到的结果发送至结果存储单元;The result output unit is configured to send the result obtained from the second-stage addition circuit to the output module when the result output by the second-stage addition circuit is the final result; in the second-stage addition circuit When the output result is an intermediate result, send the obtained result to the result storage unit;
    其中,所述最终结果为所述权重值矩阵中全部权重值经过运算后对应的结果。Wherein, the final result is the corresponding result of all the weight values in the weight value matrix after calculation.
  27. 一种电子设备,其特征在于,包括权利要求1-26任一项所述的数据处理装置。An electronic device, characterized by comprising the data processing device according to any one of claims 1-26.
  28. 一种数据处理方法,其特征在于,包括:A data processing method, characterized in that it comprises:
    获取输入特征值矩阵以及n位或者2n位的权重值矩阵;Obtain the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix;
    将输入特征值矩阵与n位或者2n位的权重值矩阵进行卷积运算,得到输出特征值矩阵;Convolve the input eigenvalue matrix with the n-bit or 2n-bit weight value matrix to obtain the output eigenvalue matrix;
    输出所述输出特征值矩阵;Output the output eigenvalue matrix;
    其中,所述n为正整数。Wherein, the n is a positive integer.
  29. 根据权利要求28所述的方法,其特征在于,n位的权重值矩阵中的权重值长度为n位;2n位的权重值矩阵中的权重值长度为2n位;The method according to claim 28, wherein the length of the weight value in the n-bit weight value matrix is n bits; the length of the weight value in the 2n-bit weight value matrix is 2n bits;
    所述输入特征值矩阵中的输入特征值的长度与所述权重值矩阵中的权重值的长度相同。The length of the input eigenvalue in the input eigenvalue matrix is the same as the length of the weight value in the weight value matrix.
  30. 根据权利要求28所述的方法,其特征在于,还包括:The method according to claim 28, further comprising:
    存储矩阵中的数据,所述矩阵为输入特征值矩阵、输出特征值矩阵、权重值矩阵中的至少一项;Store data in a matrix, the matrix being at least one of an input eigenvalue matrix, an output eigenvalue matrix, and a weight value matrix;
    其中,当存储的数据的长度为2n位时,通过2n*m位的存储空间存储m个数据,每个数据的高n位和低n位相邻存储;所述m为正整数。Wherein, when the length of the stored data is 2n bits, m data are stored in a 2n*m-bit storage space, and the high n bits and low n bits of each data are stored adjacently; the m is a positive integer.
  31. 根据权利要求28所述的方法,其特征在于,将输入特征值矩阵与n位或者2n位的权重值矩阵进行卷积运算,得到输出特征值矩阵,包括:The method according to claim 28, wherein the convolution operation of the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix to obtain the output eigenvalue matrix comprises:
    将权重值矩阵中的n位或者2n位的权重值与对应的输入特征值进行乘累加操作;Multiply and accumulate the n-bit or 2n-bit weight value in the weight value matrix with the corresponding input feature value;
    根据所述乘累加操作得到的乘累加结果,计算输出特征值矩阵。According to the multiplication and accumulation result obtained by the multiplication and accumulation operation, the output eigenvalue matrix is calculated.
  32. 根据权利要求31所述的方法,其特征在于,将权重值矩阵中的n位或者2n位的权重值与对应的输入特征值进行乘累加操作,包括:The method according to claim 31, wherein multiplying and accumulating the n-bit or 2n-bit weight value in the weight value matrix with the corresponding input feature value comprises:
    在脉动阵列中加载权重值矩阵中的权重值;Load the weight value in the weight value matrix in the systolic array;
    将脉动阵列中的每列脉动单元加载的权重值与对应的输入特征值进行乘累加操作,得到每列权重值对应的乘累加结果;Multiply and accumulate the weight value loaded by each column of systolic cells in the systolic array and the corresponding input characteristic value to obtain the multiply and accumulate result corresponding to the weight value of each column;
    其中,所述权重值为n位或者2n位的权重值。Wherein, the weight value is n-bit or 2n-bit weight value.
  33. 根据权利要求32所述的方法,其特征在于,所述脉动单元可加载的权重值长度为n位;The method according to claim 32, wherein the length of the weight value that can be loaded by the pulsation unit is n bits;
    在所述权重值矩阵中的权重值长度为2n位时,每列脉动单元加载所述权重值矩阵中的权重值的高n位或者低n位。When the length of the weight value in the weight value matrix is 2n bits, each column of pulsation unit loads the high n bits or the low n bits of the weight value in the weight value matrix.
  34. 根据权利要求33所述的方法,其特征在于,一列权重值的高n位和低n位分别加载于相邻的两列脉动单元中。The method according to claim 33, wherein the upper n bits and the lower n bits of the weight value of one column are respectively loaded in two adjacent columns of pulsation units.
  35. 根据权利要求32所述的方法,其特征在于,所述输入特征值矩阵中的输入特征值长度为2n位时,所述脉动单元每次获取到的输入特征值为所述输入特征值矩阵中的输入特征值的高n位或者低n位。The method according to claim 32, wherein when the input eigenvalue length in the input eigenvalue matrix is 2n bits, the input eigenvalue acquired by the pulsating unit each time is in the input eigenvalue matrix The high n bits or low n bits of the input feature value.
  36. 根据权利要求35所述的方法,其特征在于,输入特征值的高n位或者低n位从第一列脉动单元依次传递至最后一列脉动单元。The method according to claim 35, wherein the high n bits or low n bits of the input characteristic value are sequentially transferred from the first row of pulsation units to the last row of pulsation units.
  37. 根据权利要求32所述的方法,其特征在于,将脉动阵列中的每列脉动单元加载的权重值与对应的输入特征值进行乘累加操作,得到每列权重值对应的乘累加结果,包括:The method according to claim 32, wherein the multiplying and accumulating operation of the weight value loaded by each row of pulsating cells in the pulsating array and the corresponding input characteristic value to obtain the multiplying and accumulating result corresponding to the weight value of each column comprises:
    将输入特征值矩阵中的输入特征值在脉动阵列中依次向右传递,并通过每列脉动单元将加载的权重值与传递来的输入特征值进行乘累加操作,得到每列权重值对应的乘累加结果。The input eigenvalues in the input eigenvalue matrix are sequentially transferred to the right in the pulsation array, and the loaded weight value and the passed input eigenvalue are multiplied and accumulated by the pulsation unit of each column to obtain the corresponding multiplication of the weight value of each column Accumulate the result.
  38. 根据权利要求32所述的方法,其特征在于,在脉动阵列中加载权重值矩阵中的权重值,包括:The method according to claim 32, wherein loading the weight values in the weight value matrix into the systolic array comprises:
    在权重值加载阶段中的移位阶段,针对每一列脉动单元,将该列脉动单元需要加载的权重值通过该列第一个脉动单元依次送入脉动阵列,在脉动阵列中,接收到的权重值从第一个脉动单元依次向下传递;In the shift phase of the weight value loading phase, for each row of pulsation units, the weight values that the row of pulsation units need to be loaded are sequentially sent to the pulsation array through the first pulsation unit of the column. In the pulsation array, the received weight The value is passed down from the first pulsation unit in turn;
    在权重值加载阶段中的加载阶段,通过脉动阵列中的脉动单元存储对应的权重值。In the loading phase of the weight value loading phase, the corresponding weight value is stored by the pulsation unit in the pulsation array.
  39. 根据权利要求37所述的方法,其特征在于,通过每列脉动单元将加载的权重值与传递来的输入特征值进行乘累加操作,得到每列权重值对应的乘累加结果,包括:The method according to claim 37, wherein the multiplying and accumulating operation of the loaded weight value and the transferred input feature value by each column of pulsation unit to obtain the multiplying and accumulating result corresponding to each column of weight value comprises:
    每列脉动单元均执行如下操作:通过该列的每个脉动单元获取输入特征值,将获取到的输入特征值与所述脉动单元所加载的权重值相乘,将得到的乘积与上一脉动单元的输出相加,输出相加的结果;最后一个脉动单元的输出结果为该列对应的乘累加结果。Each row of pulsation units performs the following operations: obtains the input characteristic value through each pulsation unit in the column, multiplies the obtained input characteristic value with the weight value loaded by the pulsation unit, and then multiplies the obtained product with the previous pulsation The output of the unit is added, and the result of the addition is output; the output of the last pulsating unit is the multiplication and accumulation result corresponding to the column.
  40. 根据权利要求32所述的方法,其特征在于,对应每一列脉动单元设置有一累加器;根据所述乘累加操作得到的乘累加结果,计算输出特征值矩阵,包括:The method according to claim 32, wherein an accumulator is provided corresponding to each column of pulsation unit; and calculating the output eigenvalue matrix according to the multiplication and accumulation result obtained by the multiplication and accumulation operation includes:
    通过每个累加器获取对应的一列脉动单元的输出结果,与上一级累加器的输出结果相加,得到的所述累加器的输出结果;通过最后一级累加器的输出结果确定输出特征值。Obtain the output result of the corresponding row of pulsating units through each accumulator, and add it to the output result of the previous accumulator to obtain the output result of the accumulator; determine the output characteristic value by the output result of the last accumulator .
  41. 根据权利要求40所述的方法,其特征在于,若输入特征值为2n位,则所述获取对应的一列脉动单元的输出结果,包括:The method according to claim 40, wherein if the input characteristic value is 2n bits, the obtaining the output result of the corresponding row of pulsation units comprises:
    通过一个寄存器存储从脉动单元获取到的输入特征值的高n位对应的输出结果,通过另一寄存器存储该输入特征值的低n位对应的输出结果;Store the output result corresponding to the high n bits of the input characteristic value obtained from the pulsating unit through one register, and store the output result corresponding to the low n bits of the input characteristic value through another register;
    根据所述输入特征值的高n位的输出结果与低n位的输出结果,得到所述输入特征值对应的输出结果。According to the output result of the high n bits of the input characteristic value and the output result of the low n bits, the output result corresponding to the input characteristic value is obtained.
  42. 根据权利要求41所述的方法,其特征在于,若累加器对应的脉动阵列加载的为权重值的低n位,则根据所述输入特征值的高n位的输出结果与低n位的输出结果,得到所述输入特征值对应的输出结果,包括:The method according to claim 41, wherein if the systolic array corresponding to the accumulator is loaded with the low n bits of the weight value, the output result of the high n bits of the input characteristic value and the output of the low n bits are As a result, the output result corresponding to the input feature value is obtained, including:
    将所述输入特征值的高n位的输出结果左移n位,与所述低n位的输出结果相加,得到所述输入特征值对应的输出结果。The output result of the high n bits of the input feature value is left shifted by n bits, and the output result of the low n bits is added to obtain the output result corresponding to the input feature value.
  43. 根据权利要求41所述的方法,其特征在于,若累加器对应的脉动阵列加载的为权重值的高n位,则根据所述输入特征值的高n位的输出结果与低n位的输出结果,得到所述输入特征值对应的输出结果,包括:The method according to claim 41, wherein if the systolic array corresponding to the accumulator is loaded with high n bits of the weight value, the output result of the high n bits of the input characteristic value and the output of the low n bits are As a result, the output result corresponding to the input feature value is obtained, including:
    将所述输入特征值的高n位的输出结果左移n位,与所述低n位的输出结果相加,并将相加得到的结果左移n位,得到所述输入特征值对应的输出结果。Shift the output result of the high n bits of the input feature value by n bits to the left, add it to the output result of the low n bits, and shift the result of the addition to the left by n bits to obtain the corresponding input feature value Output the result.
  44. 根据权利要求41所述的方法,其特征在于,所述输入特征值的高n位的输出结果与低n位的输出结果分别为从脉动单元获取到的相邻的两个输出结果。The method according to claim 41, wherein the high n-bit output result and the low n-bit output result of the input feature value are respectively two adjacent output results obtained from the pulsation unit.
  45. 根据权利要求41所述的方法,其特征在于,通过一个寄存器存储从脉动单元获取到的输入特征值的高n位对应的输出结果,通过另一寄存器存储该输入特征值的低n位对应的输出结果,包括:The method according to claim 41, wherein the output result corresponding to the high n bits of the input characteristic value obtained from the pulsation unit is stored through one register, and the low n bits corresponding to the input characteristic value are stored through another register. Output results, including:
    通过乘累加结果寄存器从脉动单元获取输出结果,每隔一个时钟周期,通过乘累加结果寄存器向前乘累加结果寄存器发送低n位对应的输出结果;通过所述乘累加结果寄存器存储高n位对应的输出结果,通过所述前乘累加结果寄存器存储低n位对应的输出结果;The output result is obtained from the pulsating unit through the multiply and accumulate result register, and every other clock cycle, the multiply and accumulate result register forwards the multiply and accumulate result register to send the output result corresponding to the low n bits through the multiply and accumulate result register; the high n bits are stored in the multiply and accumulate result register The output result of, stores the output result corresponding to the lower n bits through the pre-multiplication and accumulation result register;
    或者,通过乘累加结果寄存器从脉动单元获取输出结果,每隔一个时钟周期,通过乘累加结果寄存器向前乘累加结果寄存器发送高n位对应的输出结果;通过所述乘累加结果寄存器存储低n位对应的输出结果,通过所述前乘累加结果寄存器存储高n位对应的输出结果。Alternatively, the output result is obtained from the pulsating unit through the multiply-accumulate result register, and every other clock cycle, the multiply-accumulate result register forwards the multiply-accumulate result register to send the output result corresponding to the high n bits; the multiply-accumulate result register stores the low n The output result corresponding to the bit is stored in the output result corresponding to the high n bits through the pre-multiplication and accumulation result register.
  46. 根据权利要求41所述的方法,其特征在于,还包括:The method according to claim 41, further comprising:
    根据卷积运算的步长值对脉动阵列输出的冗余的乘累加结果进行过滤。The redundant multiplication and accumulation results output by the systolic array are filtered according to the step length value of the convolution operation.
  47. 根据权利要求41所述的方法,其特征在于,获取对应的一列脉动单元的输出结果,与前一级累加器的输出结果相加,包括:The method according to claim 41, wherein obtaining the output result of the corresponding row of pulsation units and adding it to the output result of the previous stage accumulator comprises:
    根据卷积运算的膨胀值,将上一级累加器输出的结果延迟对应的时钟周期后,与从对应的一列脉动单元获取到的输出结果相加。According to the expansion value of the convolution operation, the output result of the previous accumulator is delayed by the corresponding clock period, and then added to the output result obtained from the corresponding row of pulsation units.
  48. 根据权利要求40所述的方法,其特征在于,在脉动阵列中加载权重值矩阵中的权重值,包括:The method according to claim 40, wherein loading the weight values in the weight value matrix in the systolic array comprises:
    若所述权重值矩阵的行数大于所述脉动阵列的行数,则每次在所述脉动阵列中加载所述权重值矩阵中的一部分权重值;If the number of rows of the weight value matrix is greater than the number of rows of the systolic array, load a part of the weight values in the weight value matrix into the systolic array each time;
    相应的,通过最后一级累加器的累加结果确定输出特征值,包括:Correspondingly, the output characteristic value is determined by the accumulation result of the last stage accumulator, including:
    判断是否存储有输出特征值的中间结果:Determine whether the intermediate result of the output characteristic value is stored:
    若否,则将所述最后一级累加器的累加结果存储为中间结果;If not, store the accumulation result of the last-stage accumulator as an intermediate result;
    若是,则将所述最后一级累加器的累加结果与存储的中间结果相加,若相加的结果为输出特征值的最终结果,则将所述最终结果发送至输出模块;若相加的结果不为输出特征值的最终结果,则将所述中间结果更新为所述相加的结果并存储;If yes, add the accumulation result of the last-stage accumulator to the stored intermediate result, and if the addition result is the final result of the output characteristic value, send the final result to the output module; if the addition is The result is not the final result of the output characteristic value, then the intermediate result is updated to the result of the addition and stored;
    其中,所述中间结果为所述权重值矩阵中部分权重值经过运算后对应的结果;所述最终结果为所述权重值矩阵中全部权重值经过运算后对应的结果。Wherein, the intermediate result is the corresponding result of partial weight values in the weight value matrix after calculation; the final result is the corresponding result of all weight values in the weight value matrix after calculation.
PCT/CN2020/076556 2020-02-25 2020-02-25 Data processing apparatus, electronic device, and data processing method WO2021168644A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/076556 WO2021168644A1 (en) 2020-02-25 2020-02-25 Data processing apparatus, electronic device, and data processing method
CN202080004607.0A CN112639836A (en) 2020-02-25 2020-02-25 Data processing device, electronic equipment and data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/076556 WO2021168644A1 (en) 2020-02-25 2020-02-25 Data processing apparatus, electronic device, and data processing method

Publications (1)

Publication Number Publication Date
WO2021168644A1 true WO2021168644A1 (en) 2021-09-02

Family

ID=75291163

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/076556 WO2021168644A1 (en) 2020-02-25 2020-02-25 Data processing apparatus, electronic device, and data processing method

Country Status (2)

Country Link
CN (1) CN112639836A (en)
WO (1) WO2021168644A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113344183B (en) * 2021-06-03 2022-09-30 沐曦集成电路(上海)有限公司 Method for realizing convolution operation in computing system and computing system
CN114237551B (en) * 2021-11-26 2022-11-11 南方科技大学 Multi-precision accelerator based on pulse array and data processing method thereof

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423816A (en) * 2017-03-24 2017-12-01 中国科学院计算技术研究所 A kind of more computational accuracy Processing with Neural Network method and systems
US20180322607A1 (en) * 2017-05-05 2018-11-08 Intel Corporation Dynamic precision management for integer deep learning primitives
CN108885714A (en) * 2017-11-30 2018-11-23 深圳市大疆创新科技有限公司 The control method of computing unit, computing system and computing unit
CN110458277A (en) * 2019-04-17 2019-11-15 上海酷芯微电子有限公司 The convolution hardware configuration of configurable precision suitable for deep learning hardware accelerator
CN110780844A (en) * 2018-07-24 2020-02-11 爱思开海力士有限公司 Neural network acceleration device and operation method thereof
CN110785778A (en) * 2018-08-14 2020-02-11 深圳市大疆创新科技有限公司 Neural network processing device based on pulse array

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6794854B2 (en) * 2017-02-02 2020-12-02 富士通株式会社 Arithmetic processing unit and control method of arithmetic processing unit
CN109104876B (en) * 2017-04-20 2021-06-25 上海寒武纪信息科技有限公司 Arithmetic device and related product
TWI672643B (en) * 2018-05-23 2019-09-21 倍加科技股份有限公司 Full index operation method for deep neural networks, computer devices, and computer readable recording media

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423816A (en) * 2017-03-24 2017-12-01 中国科学院计算技术研究所 A kind of more computational accuracy Processing with Neural Network method and systems
US20180322607A1 (en) * 2017-05-05 2018-11-08 Intel Corporation Dynamic precision management for integer deep learning primitives
CN108885714A (en) * 2017-11-30 2018-11-23 深圳市大疆创新科技有限公司 The control method of computing unit, computing system and computing unit
CN110780844A (en) * 2018-07-24 2020-02-11 爱思开海力士有限公司 Neural network acceleration device and operation method thereof
CN110785778A (en) * 2018-08-14 2020-02-11 深圳市大疆创新科技有限公司 Neural network processing device based on pulse array
CN110458277A (en) * 2019-04-17 2019-11-15 上海酷芯微电子有限公司 The convolution hardware configuration of configurable precision suitable for deep learning hardware accelerator

Also Published As

Publication number Publication date
CN112639836A (en) 2021-04-09

Similar Documents

Publication Publication Date Title
AU2008202591B2 (en) High speed and efficient matrix multiplication hardware module
US9697176B2 (en) Efficient sparse matrix-vector multiplication on parallel processors
US6901422B1 (en) Matrix multiplication in a vector processing system
WO2018130029A1 (en) Calculating device and calculation method for neural network calculation
US9411726B2 (en) Low power computation architecture
WO2022037257A1 (en) Convolution calculation engine, artificial intelligence chip, and data processing method
US10810281B2 (en) Outer product multipler system and method
WO2021168644A1 (en) Data processing apparatus, electronic device, and data processing method
CN111796796B (en) FPGA storage method, calculation method, module and FPGA board based on sparse matrix multiplication
CN110580519B (en) Convolution operation device and method thereof
CN111340201A (en) Convolutional neural network accelerator and method for performing convolutional operation thereof
WO2023065983A1 (en) Computing apparatus, neural network processing device, chip, and data processing method
WO2021232422A1 (en) Neural network arithmetic device and control method thereof
WO2022179075A1 (en) Data processing method and apparatus, computer device and storage medium
JP2024028901A (en) Sparse matrix multiplication in hardware
JPH06502265A (en) Calculation circuit device for matrix operations in signal processing
US11748100B2 (en) Processing in memory methods for convolutional operations
WO2023065701A1 (en) Inner product processing component, arbitrary-precision computing device and method, and readable storage medium
CN112784951A (en) Winograd convolution operation method and related product
WO2021082723A1 (en) Operation apparatus
US10884736B1 (en) Method and apparatus for a low energy programmable vector processing unit for neural networks backend processing
WO2021179175A1 (en) Data processing method and apparatus, and computer storage medium
US20240086153A1 (en) Multi-bit accumulator and in-memory computing processor with same
KR20240037146A (en) Multi-bit accumulator, in memory computing(imc) processor including multi-bit accumulator, and operating method of multi-bit accumulator
CN117077734A (en) Convolution input conversion method, hardware accelerator and accelerator structure determination method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20922349

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20922349

Country of ref document: EP

Kind code of ref document: A1