WO2021168644A1

WO2021168644A1 - Data processing apparatus, electronic device, and data processing method

Info

Publication number: WO2021168644A1
Application number: PCT/CN2020/076556
Authority: WO
Inventors: 杨康; 韩峰
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2021-09-02
Also published as: CN112639836A

Abstract

A data processing apparatus, an electronic device, and a data processing method. The apparatus comprises: an input module (1), used for obtaining an input feature value matrix and an n-bit or 2n-bit weight value matrix; a calculation module (2), used for performing convolution operation on the input feature value matrix and the n-bit or 2n-bit weight value matrix to obtain an output feature value matrix; and an output module (3), used for outputting the output feature value matrix, wherein n is a positive integer. The present invention can achieve the convolution operation of data of two lengths, improve the accuracy of a deep convolutional neural network, and adapt to design requirements of different deep convolutional neural networks.

Description

Data processing device, electronic equipment and data processing method

Technical field

The embodiments of the present invention relate to the field of data processing technology, and in particular to a data processing device, electronic equipment, and a data processing method.

Background technique

Deep convolutional neural network is a machine learning algorithm, which is widely used in computer vision tasks such as target recognition, target detection, and image semantic segmentation.

Most of the operations of the deep convolutional neural network are convolution operations. Designing a dedicated hardware circuit to accelerate the convolution operation of the convolutional layer can greatly reduce the calculation time of the deep convolutional neural network. The operands of the existing convolution operation devices only support fixed-point numbers of one width, such as 8bits fixed-point numbers. Therefore, they cannot process the data of deep convolutional neural networks with higher precision requirements, and it is difficult to meet the increasing accuracy of deep convolutional neural networks. Improved design requirements.

Summary of the invention

The embodiment of the present invention provides a data processing device, an electronic device, and a data processing method to solve the technical problem that the convolution operation device in the prior art cannot meet the accuracy requirements of a deep convolutional neural network.

The first aspect of the embodiments of the present invention provides a data processing device, including:

The input module is used to obtain the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix;

The calculation module is used to perform a convolution operation between the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix to obtain the output eigenvalue matrix;

An output module for outputting the output eigenvalue matrix;

Wherein, the n is a positive integer.

A second aspect of the embodiments of the present invention provides an electronic device, including the data processing apparatus described in the first aspect.

A third aspect of the embodiments of the present invention provides a data processing method, including:

Obtain the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix;

Convolve the input eigenvalue matrix with the n-bit or 2n-bit weight value matrix to obtain the output eigenvalue matrix;

Output the output eigenvalue matrix;

Wherein, the n is a positive integer.

The data processing device, electronic equipment, and data processing method provided by the embodiments of the present invention can realize the convolution operation of data of two lengths, improve the precision of the deep convolutional neural network, and adapt to the design requirements of different deep convolutional neural networks.

Description of the drawings

The drawings described here are used to provide a further understanding of the present invention and constitute a part of the present invention. The exemplary embodiments of the present invention and the description thereof are used to explain the present invention, and do not constitute an improper limitation of the present invention. In the attached picture:

Fig. 1 is a schematic diagram of an application scenario of an embodiment of the present invention;

Fig. 2 is a schematic diagram of a convolution operation process in the application scenario shown in Fig. 1;

3 is a schematic structural diagram of a data processing device provided by Embodiment 1 of the present invention;

4 is a schematic diagram of the principle of convolution operation performed by a data processing device according to Embodiment 1 of the present invention;

FIG. 5 is a schematic structural diagram of a data processing device according to Embodiment 2 of the present invention;

FIG. 6 is a schematic structural diagram of a pulsating unit in a data processing device according to Embodiment 3 of the present invention;

FIG. 7 is a schematic structural diagram of an accumulator in a data processing device according to Embodiment 3 of the present invention;

FIG. 8 is a schematic diagram of a convolution operation process of n-bit data performed by the data processing device according to the third embodiment of the present invention;

9 is a schematic diagram of a convolution operation process of 2n-bit data performed by the data processing device according to the third embodiment of the present invention;

FIG. 10 is a schematic structural diagram of a data processing device according to Embodiment 4 of the present invention;

11 is a schematic diagram of a storage format when a data processing device stores n-bit data according to Embodiment 4 of the present invention;

12 is a schematic diagram of a storage format when a data processing device stores 2n-bit data according to the fourth embodiment of the present invention;

FIG. 13 is a schematic flowchart of a data processing method according to Embodiment 5 of the present invention.

Reference signs:

1-input module 2-calculation module

3- output module 4- memory

11-Weight value loading module 12-Input characteristic value loading module

21-Pulsation unit 22-Accumulator

23-Control unit 24-Weight value injection unit

25-Input characteristic value injection unit 26-Result output unit

27-Result storage unit 211-Weight value register

212-input characteristic value register 213-multiplication circuit

214-Adding circuit 215-Weight value shift register

216-Input characteristic value shift register 217-Multiplication result register

221-Multiplication and accumulation result register 222-Previous multiplication and accumulation result register

223-Vertical addition circuit 224-First-stage addition circuit

225-filter circuit 226-accumulator result register

227-sum register 228-delay circuit

229-Second-stage addition circuit

Detailed ways

In order to make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments These are a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the technical field of the present invention. The terms used in the specification of the present invention herein are only for the purpose of describing specific embodiments, and are not intended to limit the present invention.

Fig. 1 is a schematic diagram of an application scenario of an embodiment of the present invention. The data processing device and data processing method provided by the embodiments of the present invention can be applied to any scene that requires convolution operation, such as a deep convolutional neural network.

As shown in FIG. 1, the deep convolutional neural network to which the embodiment of the present invention can be applied includes: input, output, and hidden layers. Each layer in the network shown in Figure 1 can have one input and one output. In an actual deep convolutional neural network, each layer may have multiple inputs or multiple outputs.

The hidden layer of the deep convolutional neural network is composed of a set of cascaded feature maps and operations. The operation of the hidden layer includes convolution, pooling, activation and so on. The feature map of the hidden layer is generated after the above operation is performed on the feature map of the previous layer. In general, the layers in a convolutional neural network can be named according to the type of operation. For example, the layer that performs the convolution operation can be classified as a convolutional layer, and the layer that performs a pooling operation can be classified as a pooling layer.

The convolution operation process of the convolution layer is: use a set of weight values to perform vector inner product operation on a set of input feature maps, and then output a set of feature maps. The input weight value is also called a filter or a convolution kernel.

The weight value, the input feature map, and the output feature map can all be expressed as a multi-dimensional matrix. The input feature map can be expressed as an input feature value matrix, and the elements in the matrix are recorded as input feature values; the output feature map can be expressed as an output feature value matrix, and the elements in the matrix are recorded as output feature values.

Fig. 2 is a schematic diagram of the convolution operation process in the application scenario shown in Fig. 1. As shown in Figure 2, a weight matrix of R*R*N is convolved with an input eigenvalue matrix of H*H*N to obtain an output eigenvalue matrix of E*E*N. Each output eigenvalue in the output eigenvalue matrix can be obtained by inner product operation of part of the input eigenvalues in the input eigenvalue matrix and the weight value of the weight value matrix.

The technical solutions provided by the embodiments of the present invention can support n-bit or 2n-bit convolution operations. The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings.

Example one

The first embodiment of the present invention provides a data processing device. FIG. 3 is a schematic structural diagram of a data processing device according to Embodiment 1 of the present invention. As shown in Figure 3, the data processing device in this embodiment may include:

The input module 1 is used to obtain an input eigenvalue matrix and an n-bit or 2n-bit weight value matrix, where n is a positive integer;

The calculation module 2 is used to perform a convolution operation on the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix to obtain the output eigenvalue matrix;

The output module 3 is used to output the output eigenvalue matrix.

Specifically, the input module 1 may be connected to a memory or other modules, and is used to obtain an input eigenvalue matrix and a weight value matrix to be subjected to a convolution operation. Optionally, the connection described in each embodiment of the present invention may be a physical connection or a communication connection.

The weight value matrix may be an n-bit weight value matrix, or may be a 2n-bit weight value matrix. Wherein, an n-bit weight value matrix may mean that the weight value in the matrix has a length of n bits; a 2n-bit weight value matrix may mean that the weight value in the matrix has a length of 2n bits. Optionally, the length of the input eigenvalue in the input eigenvalue matrix and the length of the weight value in the weight value matrix may be the same. When the weight value matrix is 2n bits, the input eigenvalue matrix may also be 2n bits. , It can ensure that the input eigenvalue matrix and the weight value matrix directly perform convolution operation, and improve the operation efficiency and accuracy.

The calculation module 2 can be connected to the input module 1 to obtain the input eigenvalue matrix and the weight value matrix and perform convolution operations. Specifically, a part of the input eigenvalues in the input eigenvalue matrix can be multiplied and accumulated by the corresponding weight values in the weight value matrix to obtain the corresponding output eigenvalues.

FIG. 4 is a schematic diagram of the principle of convolution operation performed by a data processing device according to Embodiment 1 of the present invention. As shown in Figure 4, the input eigenvalue matrix is:

X ₀₀ X ₀₀	X ₀₁ X ₀₁	X ₀₂ X ₀₂	X ₀₃ X ₀₃	X ₀₄ X ₀₄
X ₁₀ X ₁₀	X ₁₁ X ₁₁	X ₁₂ X ₁₂	X ₁₃ X ₁₃	X ₁₄ X ₁₄
X ₂₀ X ₂₀	X ₂₁ X ₂₁	X ₂₂ X ₂₂	X ₂₃ X ₂₃	X ₂₄ X ₂₄

The weight value matrix is:

W ₀₀ W ₀₀	W ₀₁ W ₀₁
W ₁₀ W ₁₀	W ₁₁ W ₁₁

The output eigenvalue matrix is:

Y ₀₀ Y ₀₀	Y ₀₁ Y ₀₁	Y ₀₂ Y ₀₂	Y ₀₃ Y ₀₃
Y ₁₀ Y ₁₀	Y ₁₁ Y ₁₁	Y ₁₂ Y ₁₂	Y ₁₃ Y ₁₃

Wherein, X _ij is the input of the eigenvalue matrix of the j th input of the i-th row feature value, W _ij is a j-th weight value weight value matrix of the i-th row, Y _ij is the output characteristic value matrix row i j output feature values. The weight value matrix includes 2*2 weight values. The weight value matrix traverses each 2*2 part of the input eigenvalue matrix, and performs an inner product operation with it to obtain a corresponding output eigenvalue, namely:

Y _ij =X _ij *W ₀₀ +X _i(j+1) *W ₀₁ +X _(i+1)j *W ₁₀ +X _(i+1)(j+1) *W ₁₁

As shown in Figure 4, the weight value matrix is first calculated with the 2*2 part of the input eigenvalue matrix framed by a thick line at the upper left corner, and the output eigenvalue Y ₀₀ =X ₀₀ *W ₀₀ +X ₀₁ *W ₀₁ + X ₁₀ *W ₁₀ +X ₁₁ *W ₁₁ ; Then, the thick line frame moves one column to the right, and the weight value matrix is calculated with the next 2*2 part to obtain a corresponding output characteristic value Y ₀₁ =X ₀₁ *W ₀₀ +X ₀₂ *W ₀₁ +X ₁₁ *W ₁₀ +X ₁₂ *W ₁₁ ; and so on, after traversing all 2*2 boxes, all output characteristic values can be obtained.

Each matrix shown in FIG. 4 is a two-dimensional matrix. In practical applications, the weight value matrix, the input eigenvalue matrix, and the output eigenvalue matrix may all be two-dimensional or three-dimensional matrices, and the volume of the three-dimensional matrix The principle of the product operation is similar to the principle of the convolution operation of a two-dimensional matrix, and will not be repeated here.

If the obtained input eigenvalue matrix and weight value matrix are n bits, the output eigenvalue length in the output eigenvalue matrix may also be n bits; if the obtained input eigenvalue matrix and weight value matrix are 2n bits , The output eigenvalue length in the output eigenvalue matrix may also be 2n bits.

The output module 3 may be connected to the calculation module 2 to obtain the output characteristic value matrix calculated by the calculation module 2 and output the output characteristic value matrix. There are many ways to output. For example, the output eigenvalue matrix may be displayed to the user or output to the next convolution layer for the next level of convolution operation.

In practical applications, the device can simultaneously support two lengths: n-bit and 2n-bit data convolution operations, such as 8-bit and 16-bit fixed-point convolution operations. When the convolutional neural network uses a fixed-point number of length n for fixed-point conversion and the network accuracy can meet the design requirements, this device can use a fixed-point number of length n to perform convolution operations, and the same hardware resources can provide higher volumes. Concurrency of product operations. If a fixed-point number with a length of n bits is used, the accuracy of the network after the fixed-point loss is very large and does not meet the design requirements, this device can be switched to use a fixed-point number with a length of 2n bits for convolution operation, and the network can also use a length of 2n bits The fixed-point number of the network is fixed-point, so as to reduce the accuracy loss after the network is fixed-point.

The data processing device provided in this embodiment includes an input module 1, a calculation module 2 and an output module 3. The input module 1 can be used to obtain an n-bit or 2n-bit weight value matrix and an input eigenvalue matrix, and the calculation module 2 can obtain The obtained n-bit or 2n-bit weight value matrix is convolved with the input eigenvalue matrix to obtain an n-bit or 2n-bit output eigenvalue matrix. The output module 3 can output the n-bit or 2n-bit output eigenvalue matrix , So as to realize the convolution operation of data of two lengths. When there is a higher precision requirement, 2n-bit data can be used to realize the convolution operation, improve the accuracy of the deep convolutional neural network, and adapt to the design of different depths of the convolutional neural network. Require.

Example two

The second embodiment of the present invention provides a data processing device. In this embodiment, on the basis of the technical solutions provided by the foregoing embodiments, convolution operations are implemented through systolic arrays, accumulator arrays, and the like. FIG. 5 is a schematic structural diagram of a data processing device according to Embodiment 2 of the present invention. As shown in Figure 5, the data processing device in this embodiment may include:

The input module is used to obtain an n-bit or 2n-bit weight value matrix and an n-bit or 2n-bit input feature value matrix; the input module may specifically include a weight value loading module 11 and an input feature value loading module 12, and the weight value loading The module 11 is used to obtain an n-bit or 2n-bit weight value matrix, and the input feature value loading module 12 is used to obtain an n-bit or 2n-bit input feature value matrix;

The calculation module 2 is configured to perform a convolution operation on the input eigenvalue matrix and the weight value matrix to obtain an output eigenvalue matrix;

The output module 3 is used to output the output eigenvalue matrix.

Wherein, the calculation module 2 may include:

The systolic array is used to implement the multiplication and accumulation operation of the n-bit or 2n-bit weight value in the weight value matrix and the corresponding input eigenvalue;

The accumulator array is used to calculate the output eigenvalue matrix according to the multiplication and accumulation result obtained by the systolic array.

Specifically, the systolic array can calculate the multiplication and accumulation result corresponding to the weight value of each column in the weight value matrix, and the accumulator array adds the multiplication and accumulation results corresponding to the weight value of each column to obtain the output characteristic value; or the systolic array can calculate the weight value The multiplication and accumulation results corresponding to the weight values of each row in the matrix, and the accumulator array adds the multiplication and accumulation results corresponding to the weight values of each row to obtain the output characteristic value.

Taking the matrix shown in Figure 4 as an example, when calculating the output results corresponding to the input eigenvalues and the weight value matrix in the thick-line box in the upper left corner, the systolic array can calculate the multiplication and accumulation result of the weight value of each column and the corresponding input eigenvalue , The weight values in the first column include W ₀₀ and W ₁₀ , and the cumulative result obtained after multiplying and accumulating with the corresponding input feature value is X ₀₀ *W ₀₀ +X ₁₀ *W ₁₀ , and the weight values in the second column include W ₀₁ and W ₁₁ , the corresponding accumulation result is X ₀₁ *W ₀₁ +X ₁₁ *W ₁₁ , the accumulator array adds the multiplication and accumulation results corresponding to the weight values of each column to obtain the output characteristic value Y ₀₀ =X ₀₀ *W ₀₀ +X ₀₁ *W ₀₁ +X ₁₀ *W ₁₀ +X ₁₁ *W ₁₁ .

Alternatively, the systolic array can calculate the multiplication and accumulation result obtained by the weight value of each row and the corresponding input feature value. The weight value of the first row includes W ₀₀ and W ₀₁ , and the accumulation result obtained by multiplying and accumulating the corresponding input feature value is X ₀₀ *W ₀₀ +X ₀₁ *W ₀₁ , the weight value of the second row includes W ₁₀ and W ₁₁ , and the corresponding accumulation result is X ₁₀ *W ₁₀ +X ₁₁ *W ₁₁ , the accumulator array corresponds to the weight value of each row The multiplication and accumulation results are added together to obtain the output characteristic value Y ₀₀ =X ₀₀ *W ₀₀ +X ₀₁ *W ₀₁ +X ₁₀ *W ₁₀ +X ₁₁ *W ₁₁ .

In the drawings provided by the embodiments of the present invention, MC represents a pulsation unit, and ACC represents an accumulator. As shown in FIG. 5, the pulsation array may include multiple rows of pulsation units 21, each column of pulsation units 21 can be used to load a weight value, and multiply and accumulate the loaded weight value with the corresponding input characteristic value to obtain the weight of each column loaded. The multiply and accumulate result corresponding to the value.

The number of columns of the pulsation unit 21 used in the calculation process can be equal to the number of columns of the weight value matrix, and a column of the pulsation unit 21 can load a column of weight values in the weight value matrix. Alternatively, the number of columns of the pulsation unit 21 used in the calculation process may be equal to the number of rows of the weight value matrix, and one column of the pulsation unit 21 may be loaded with a row of weight values in the weight value matrix. For ease of description, in each embodiment of the present invention, a column of pulsating units 21 loads a column of weight values as an example for description.

Each pulsation unit 21 in a row of pulsation units 21 can be loaded with a weight value, and an input characteristic value can be obtained, and the input characteristic value is multiplied by the loaded weight value, and the obtained product is combined with the output of the pulsation unit 21 in the previous row Add, and then output the result of the addition. The result output by the last pulsating unit 21 of each column is the multiplication and accumulation result corresponding to the column.

The accumulator array may include a plurality of accumulators 22, the number of the accumulators 22 is equal to the number of columns of the pulsation unit 21, and each accumulator 22 is connected to each column of the pulsation unit 21 in a one-to-one correspondence. Specifically, assuming that the number of accumulators 22 and the number of columns of pulsation units 21 are both k, then the i-th accumulator 22 is connected to the i-th column of pulsation units 21, where k is a natural number greater than 1, and i=1 , 2, ……, k.

Wherein, the accumulator 22 is connected to a row of pulsation units 21, which may mean that it is connected to the last pulsation unit 21 in the row of pulsation units 21.

The accumulator 22 is used to obtain the output result of the corresponding row of pulsation units 21, add it to the output result of the previous stage accumulator 22, and output the added result to the next stage accumulator 22, so as to realize each column The pulsation unit 21 outputs the accumulation of results.

Optionally, the calculation module 2 may further include a result output unit 26 and a result storage unit 27. When the number of rows of the weight value matrix is greater than the number of rows of the systolic array, the systolic array can load a part of the weight values in the weight value matrix each time; the result storage unit 27 is used to store intermediate results, Wherein, the intermediate result is a corresponding result of some weight values in the weight value matrix after operations.

After the accumulation of the output results of the pulsation unit 21 of each column is realized, if the intermediate result is buffered in the result storage unit 27, the accumulation result will continue to be accumulated with the intermediate result in the result storage unit 27 again. If the result of the accumulation is If it is still the intermediate result of the convolution operation, the result generation unit 26 will store it in the result storage unit 27. If the result is the final result of the convolution operation, the result generation unit 26 will output it to the output module 3 for subsequent processing. . Wherein, the final result is the corresponding result of all the weight values in the weight value matrix after calculation.

Through the result generation unit 26 and the result storage unit 27, when the weight value matrix is larger than the systolic array, load a part of the weight value through the systolic array, first calculate the intermediate result of the convolution operation, and then load another part of the weight value through the systolic array to continue The calculation is performed until the final result is obtained and output, so that a smaller systolic array is used to complete the calculation of a larger weight value matrix, which effectively reduces the volume of the device and reduces the cost of the device.

In order to realize the sending of the weight value and the input characteristic value into the pulsation array, the data processing device in this embodiment may further include: a weight value injection unit 24 and an input characteristic value injection unit 25.

The input end of the weight value injection unit 24 can be connected to the weight value loading module 11, and the output end can be connected to the systolic array, specifically, it can be connected to each pulsation unit 21 in the systolic array, so as to input the weight value to the corresponding one. Pulsation unit 21.

Similarly, the input end of the input feature value injection unit 25 can be connected to the input feature value loading module 12, and the output end can be connected to the pulsation array, specifically, it can be connected to each pulsation unit 21 in the pulsation array to transfer the input feature The value is input to each pulsation unit 21.

Through the weight value injection unit 24 and the input feature value injection unit 25, the weight value and the input feature value can be buffered and sent to the systolic array, thereby improving the stability of the device.

Optionally, the weight value injection unit 24 may be directly connected to each pulsation unit 21, or may be directly connected to the first row of pulsation units 21 as shown in FIG. 21 realizes the connection, and the intermediate pulsation unit 21 passed through it transmits the weight value.

Similarly, the input characteristic value injection unit 25 can be directly connected to each pulsation unit 21, or as shown in FIG. 21 realizes the connection, and transmits the input characteristic value through the middle pulsation unit 21.

The connection between the weight value injection unit 24 or the input feature value injection unit 25 and the pulsation unit 21 shown in FIG. 5 can effectively save wiring and reduce the volume of the device.

In order to realize the convolution operation, the entire convolution calculation process can be divided into a weight value loading stage and a calculation stage. In the weight value loading stage, the weight values in the weight value matrix are loaded into the pulsation unit 21 of the systolic array; in the calculation stage, the input eigenvalues in the input eigenvalue matrix are input into the systolic array, according to the weight value and Enter the characteristic value for calculation.

The data processing device in this embodiment may further include: a control unit 23. The control unit 23 is used to control the other modules in the calculation module 2 to work.

Specifically, the control unit 23 may control the weight value injection unit 24 to load the weight value obtained from the weight value loading module 11 into the pulsation array, and then control the input feature value injection unit 25 to control the input feature value obtained from the input feature value loading module 12 The value is sent to the systolic array, and the systolic array and the accumulator array are controlled to perform convolution operations.

Optionally, when performing the convolution operation, the input feature value sent in can be reused. The control unit 23 may be specifically configured to: in the weight value loading stage, control the weight values in the weight value matrix to be sequentially loaded into the pulsation unit 21 of the systolic array; in the calculation stage, control the input eigenvalue matrix in the The input feature values are sequentially transferred to the right in the pulsation array, and the pulsation unit 21 is controlled to perform calculations based on the loaded weight value and the transferred input feature value.

In this way, in the calculation stage, the input characteristic value enters from an interface of a row of pulsation units 21, and passes through each pulsation unit 21 of the row from left to right in turn. Each pulsation unit 21 can use the input characteristic value to perform calculations. Thus, the input feature value is reused, and the data access bandwidth required by the convolution operation is reduced.

Optionally, in the weight value loading phase, the control unit 23 may be specifically used to: in the shift phase in the weight value loading phase, for each column of pulsation units 21, pass the weight value that needs to be loaded by the column of pulsation units 21 The first pulsation unit 21 in the column is sequentially sent to the systolic array. In the pulsation array, the received weight value is sequentially transferred downward from the first pulsation unit 21; in the loading phase of the weight value loading phase, the systolic array is controlled The pulsation unit 21 stores the corresponding weight value.

Specifically, the weight value injection unit 24 is responsible for buffering the weight value sent by the weight value loading module 11, and loads the weight value for the systolic array under the control of the control unit 23. The weight value injection unit 24 has only one interface with each row of pulsation units 21 of the systolic array, and the interface can transmit only one weight value per clock cycle. The weight value loading phase can be specifically divided into two phases of shifting and loading. In the shift phase, the weight value injection unit 24 sequentially sends the weight values required by the pulsation unit 21 of the same column into the pulsation array through the same interface. In the pulsation array, the received weight values are sequentially transferred downward from the pulsation unit 21 at the interface. In the loading phase, the systolic units 21 of the same column in the systolic array simultaneously load the cached weight values into their respective registers for use in the subsequent multiplication and accumulation process. The weight value injection unit 24 may have a delay of one clock cycle when loading weight values for two adjacent columns of pulsation units 21.

The input feature value injection unit 25 is responsible for buffering the input feature value sent by the input feature value loading module 12, and sends the input feature value for the systolic array under the control of the control unit 23. The input feature value injection unit 25 has only one interface with each row of the pulsation unit 21 of the systolic array, and the interface can transmit only one input feature value per clock cycle. In the pulsation array, the received input feature values are sequentially transferred from the pulsation unit 21 at the interface to the right to the last pulsation unit 21. The input characteristic value injection unit 25 may have a delay of one clock cycle when sending the input characteristic values to the pulsating units 21 of two adjacent rows.

In a systolic array, the input characteristic value is transferred from left to right, and the weight value is transferred from top to bottom. It may take one clock cycle for the data to pass through one column or row of pulsating cells 21, so two adjacent rows or two columns of pulsating cells 21 There can be a clock cycle delay when loading data, which can accurately realize the loading of the weight value and the calculation between the weight value and the corresponding input characteristic value.

In practical applications, the control unit 23 can obtain the length of the weight value in the weight value matrix or the length of the input eigenvalue in the input eigenvalue matrix, and control the systolic array and the accumulation according to the length. Arrays and other components implement convolution operations.

For example, when the length is n bits, n bits of data can be loaded into the systolic array; when the length is 2n bits, 2n bits of data can be loaded into the systolic array, thereby realizing calculation of data with different precisions.

Optionally, the control unit 23 may control each unit to implement the convolution operation by controlling a hardware circuit such as a state machine. There are many ways to control the convolution operation according to the data length. For example, configuration information can be stored in a register or carried in an instruction. The configuration information is used to indicate how long the data is to be convolved. The control unit 23 can According to the configuration information, a control signal is generated to control components such as the systolic array and the accumulator array to switch between n-bit and 2n-bit convolution operations.

In the data processing device provided in this embodiment, the calculation module 2 may include a systolic array and an accumulator array, and the convolution operation is realized by the systolic array and the accumulator array, where the systolic array can be used to implement n bits in the weight value matrix or The multiplication and accumulation operation of the 2n-bit weight value and the corresponding input eigenvalue. The accumulator array can be used to calculate the output eigenvalue matrix according to the multiplication and accumulation result obtained by the systolic array, thereby splitting the convolution operation into a multiplication and accumulation operation With the accumulation operation, the convolution result of the weight value matrix and the input eigenvalue matrix is accurately calculated, and the data reuse between the convolution operations can effectively reduce the data access bandwidth required for the convolution operation and save resources.

Example three

The third embodiment of the present invention provides a data processing device. This embodiment is based on the technical solution provided by the foregoing embodiment, and provides a specific implementation solution of the pulsation unit and the accumulator. For a schematic diagram of the overall structure of the data processing device in this embodiment, refer to FIG. 5. FIG. 6 is a schematic structural diagram of a pulsating unit in a data processing device according to Embodiment 3 of the present invention. FIG. 7 is a schematic structural diagram of an accumulator in a data processing device according to Embodiment 3 of the present invention.

As shown in FIG. 6, the pulsation unit 21 may include:

The weight value register 211 is used to store the weight value;

The input characteristic value register 212 is used to store the input characteristic value;

The multiplication circuit 213 can be connected to the weight value register 211 and the input characteristic value register 212 respectively, and is used to obtain the weight value stored in the weight value register 211 and the input characteristic value stored in the input characteristic value register 212. The product of the weight value and the input feature value;

The adding circuit 214 may be connected to the multiplying circuit 213, and is used to add the product obtained by the multiplying circuit 213 to the output of the pulsating unit 21 in the previous row. When there is no pulsation unit 21 in the previous row, the addition circuit 214 can directly output the result obtained from the multiplication circuit 213.

Through the above components, the pulsation unit 21 can load the weight value, obtain the input characteristic value, multiply the input characteristic value by the loaded weight value, and add the obtained product to the output of the pulsation unit 21 in the previous row. , The function of outputting the result of addition. The result output by the addition circuit 214 can be sent to the next pulsation unit 21.

Optionally, the pulsation unit 21 may further include:

The weight value shift register 215 is used to transfer the weight value to the pulsating unit 21 of the next row;

The input characteristic value shift register 216 is used to transfer the input characteristic value to the next row of pulsating cells 21.

Specifically, the weight value shift register 215 may be responsible for buffering the weight value sent from the weight value injection unit 24 or the upper-level pulsation unit 21. In the shift phase of the weight value loading, the weight value buffered by the weight value shift register 215 will be passed down to the next-stage pulsation unit 21. In the loading phase of the weight value loading, the weight value buffered by the weight value shift register 215 will be latched into the weight value register 211.

In the calculation process based on the weight value in the weight value register 211, the weight value shift register 215 can be used to load the next weight value, which can effectively improve the calculation efficiency of the entire weight value matrix.

The input feature value shift register 216 is responsible for buffering the input feature value sent from the input feature value injection unit 25 or the left pulsation unit 21. The input characteristic value buffered by the input characteristic value shift register 216 will be latched to the input characteristic value register 212 and at the same time will be sent to the pulsation unit 21 on the right.

In the calculation process based on the input characteristic value in the input characteristic value register 212, the input characteristic value shift register 216 can be used to load the next input characteristic value, which can effectively improve the calculation efficiency corresponding to the entire input characteristic value matrix.

Optionally, the pulsating unit 21 may further include a multiplication result register 217. The addition circuit 214 and the multiplication circuit 213 can be connected through the multiplication result register 217. The multiplication result register 217 is used to store the multiplication result of the weight value loaded by the pulsation unit 21 and the input characteristic value, so that it can be added to the output of the previous pulsation unit 21 and improve the stability of the device.

In the embodiment of the present invention, optionally, each pulsation unit 21 can complete n-bit*n-bit multiply and accumulate operations. Specifically, the length of the weight value that can be loaded by each pulsation unit 21 may be n bits. When the length of the weight value in the weight value matrix is 2n bits, each column of pulsation unit 21 loads the weight value in the weight value matrix. The high n bits or the low n bits of the weight value.

By loading the upper n-bit weight value and the lower n-bit weight value respectively by the two columns of pulsation units 21, an n-bit device can calculate 2n-bit data.

Correspondingly, when the input eigenvalue length in the input eigenvalue matrix is 2n bits, the input eigenvalue acquired by the pulsating unit 21 each time may be the high n bits of the input eigenvalue in the input eigenvalue matrix Or low n bits.

Further, the high n bits and low n bits of a column of weight values can be loaded into two adjacent columns of pulsation units 21 respectively, and the high n bits of the input characteristic value can be transferred from the first column of pulsation units 21 to the next to the lower n bits. The pulsation unit 21 in the last column is convenient for the accumulator 22 to perform further calculations on the result of the multiplication and accumulation subsequently, thereby reducing the complexity of the accumulator 22.

As shown in FIG. 7, the accumulator 22 in this embodiment may include:

The multiplication and accumulation result register 221 may be connected to the last pulsation unit 21 of the corresponding column, and is used to obtain the output result of the last pulsation unit 21;

The pre-multiply-accumulate result register 222 may be connected to the multiply-accumulate result register 221, and is used to obtain an output result from the multiply-accumulate result register 221 every other clock cycle when the input characteristic value is 2n bits;

The vertical addition circuit 223 may be connected to the multiply-accumulate result register 221 and the pre-multiply-accumulate result register 222 respectively, and is used to output the multiply-accumulate result register 221 when the input characteristic value is n bits. The result is sent to the first-stage addition circuit 224, or, when the input characteristic value is 2n bits, the sum of the output result in the multiply and accumulate result register 221 and the output result in the pre-multiply and accumulate result register 222 Sent to the first-stage addition circuit 224;

The first-stage addition circuit 224 may be connected to the vertical addition circuit 223 and the upper-stage accumulator 22, respectively, for the result output from the vertical addition circuit 223 and the result output from the upper-stage accumulator 22 Add up.

Through the above components, the accumulator 22 can obtain the output result of the corresponding row of pulsation units 21, add it to the output result of the previous accumulator 22, and output the added result to the next accumulator 22.

It can be understood that the data addition involved in the embodiments of the present invention may refer to the direct addition of two data, or it may refer to the addition after the data is converted into a certain format. For example, before adding data of different bases, you can convert to the same base; before adding the high n-bit data and the low n-bit data, you can shift the high n-bit data to the left by n bits to achieve two After the data are aligned, they are added.

Optionally, when the input characteristic value is 2n bits, the accumulator 22 may store the output result corresponding to the high n bits of the input characteristic value obtained from the pulsation unit 21 through a register, and store the input characteristic through another register The output result corresponding to the low n bits of the value; according to the output result of the high n bits of the input characteristic value and the output result of the low n bits, the output result corresponding to the input characteristic value is obtained, which is the same as the output of the previous accumulator 22 The results are added, and the result of the addition is output to the accumulator 22 of the next stage.

Wherein, the output result of the high n bits and the output result of the low n bits of the input characteristic value may be two adjacent output results obtained from the pulsation unit 21, respectively.

Specifically, if the systolic array corresponding to the accumulator 22 is loaded with the low n bits of the weight value, then according to the output result of the high n bits of the input feature value and the output result of the low n bits, the input feature value corresponding to the The accumulator 22 may be specifically used to: shift the output result of the high n bits of the input feature value by n bits to the left, and add it to the output result of the low n bits to obtain the input feature The output result corresponding to the value.

If the systolic array corresponding to the accumulator 22 is loaded with the high n bits of the weight value, then according to the output result of the high n bits of the input characteristic value and the output result of the low n bits, the output result corresponding to the input characteristic value is obtained. The accumulator 22 may be specifically used to: shift the output result of the high n bits of the input feature value by n bits to the left, add it to the output result of the low n bits, and shift the result of the addition to the left n bits, the output result corresponding to the input characteristic value is obtained.

The above-mentioned shift operation can be implemented in the vertical addition circuit 223. By shifting the high n-bit data to the left by n bits, the high n-bit output result can be restored to the actual multiplication and accumulation result, ensuring the accuracy of the result.

Optionally, the accumulator 22 may further include: a filter circuit 225; the multiplication and accumulation result register 221 and the last pulsation unit 21 of the corresponding column may be connected through the filter circuit 225. The filter circuit 225 can be used to filter the redundant multiplication and accumulation results output by the systolic array according to the step value (Stride value) of the convolution operation, and the unfiltered result is sent to the multiplication and accumulation result register 221 by the filter circuit 225 .

By setting the filter circuit 225 to filter the redundant data, the correctness of the convolution operation under the non-synchronization length requirement can be ensured, the step size requirements of different occasions can be met, and the application range of the device can be improved.

Optionally, the accumulator 22 may further include: an accumulator result register 226; the first-stage addition circuit 224 and the upper-level accumulator 22 may be connected through the accumulator result register 226. The accumulator result register 226 can be used to obtain the result output by the previous accumulator 22 and send it to the first-stage addition circuit 224.

Optionally, the accumulator 22 may further include: a sum register 227, connected to the first-stage addition circuit 224, for storing the result output by the first-stage addition circuit 224, and outputting the result to The next-level accumulator 22.

Through the accumulator result register 226 and the sum register 227, the output result of the previous accumulator 22 and the output result of the first stage addition circuit 224 can be respectively stored, so as to ensure the smooth progress of the calculation process.

Optionally, the accumulator 22 may further include a delay circuit 228. The accumulator result register 226 and the previous accumulator 22 can be connected through the delay circuit 228. The delay circuit 228 may be used to delay the output result of the previous accumulator 22 by a corresponding clock cycle and send it to the accumulator result register 226 according to the dilation value (Dilation value) of the convolution operation. The number of delayed clock cycles is determined by the dilation value of the convolution operation.

By setting the delay circuit 228 to delay the output result of the previous accumulator 22, the correctness of the convolution operation under different expansion value requirements can be ensured, the expansion value requirements of different occasions can be met, and the application range of the device can be improved.

Optionally, the accumulator 22 may further include: a second-stage addition circuit 229; the second-stage addition circuit 229 may be connected to a sum register 227, and the result generation unit 26 may be connected to the second-stage addition The circuit 229 is connected.

The second-stage addition circuit 229 of the last-stage accumulator 22 is used to add the result in the sum register 227 and the intermediate result read from the result storage unit 27 by the result generation unit 26 and output to the result generator.出unit 26.

When the weight value matrix of the convolution operation is mapped to the systolic array, consecutive N accumulators 22 will be mapped to the same weight value matrix, and the size of N can be the same as the width of the weight value matrix. Among the N accumulators 22, the first accumulator 22 does not need to receive the output result of the left-level accumulator 22, and at the same time, the last accumulator 22 does not output the result buffered by the sum register 227 to the right-level accumulator. The device 22 only accumulates the result buffered by the sum register 227 and the intermediate result read back from the result storage unit 27 in the second-stage addition circuit 229 and outputs the result to the result output unit 26.

Each level of accumulator 22 is connected to the result output unit 26, and the width of the weight value matrix determines which level of accumulator 22 outputs the accumulation result to the result output unit 26. For example, if the width of the weight value matrix is 3, then The third-stage accumulator 22 outputs the accumulation result to the result output unit 26. If the width of the weight value matrix is 4, the fourth-stage accumulator 22 outputs the accumulation result.

The result generation unit 26 may send the result obtained from the second-stage addition circuit 229 to the output module 3 when the result output by the second-stage addition circuit 229 of the accumulator 22 is the final result; When the result output by the second-stage addition circuit 229 is an intermediate result, the obtained result is sent to the result storage unit 27.

Optionally, the result storage unit 27 may include multiple FIFO (First Input First Output) storage units, and the result output unit 26 may send intermediate results into the corresponding FIFO storage unit in the result storage unit 27.

Specifically, each stage of accumulator 22 can correspond to a FIFO storage unit, and each FIFO storage unit can perform read and write operations at the same time. During the convolution operation, the N FIFO storage units can be divided into different groups according to the size of the weight value matrix. Different FIFO storage unit groups buffer the intermediate results of different weight value matrices.

As mentioned above, the width of the weight value matrix determines which stage of the accumulator 22 outputs the accumulation result to the result generation unit 26, in order to utilize the FIFO storage unit corresponding to the accumulator 22 that does not output the accumulation result to the result generation unit 26. In this embodiment, a group of FIFO storage units may include the accumulator 22 that outputs the accumulation result to the result output unit 26 and the FIFO storage units corresponding to all accumulators 22 before it, and the accumulator 22 that outputs the accumulation result can use this group. All buffers of the FIFO storage unit.

For example, if the width of the weight value matrix is 3, the third-stage accumulator 22 outputs the accumulation result to the result output unit 26. Therefore, the FIFOs corresponding to the first-stage to third-stage accumulators 22 can be grouped into one group. It is used to buffer the accumulation result output by the third-level accumulator 22, which can effectively utilize the idle FIFO storage unit and improve the storage efficiency of the accumulation result.

In practical applications, when a fixed-point number of length n is used for convolution operation, the vertical addition circuit 223 directly forwards the output result obtained by the multiplication and accumulation result register 221 to the pulsation unit 21 to the first stage addition circuit 224 for accumulation.

When a fixed-point number with a length of 2n bits is used for the convolution operation, two consecutive output results of the systolic array need to be accumulated in the vertical addition circuit 223. The first output result received by the multiplication and accumulation result register 221 may be buffered in the previous multiplication and accumulation result register 222 in the next clock cycle. When the multiply and accumulate result register 221 receives the second output result, the output result buffered by the multiply and accumulate result register 221 and the output result buffered by the pre-multiply and accumulate result register 222 are accumulated in the vertical adder circuit 223, and the accumulated result is sent to the first The addition is continued in the one-stage addition circuit 224, thereby realizing the convolution operation of a 2n-bit fixed-point number.

In the data processing device provided in this embodiment, the multiply and accumulate result register 221 and the pre-multiply and accumulate result register 222 of the accumulator 22 can respectively store two adjacent output results of the corresponding column pulsation unit 21, through the multiply and accumulate result register 221 and The data stored in the pre-multiplication and accumulation result register 222 can determine the multiplication and accumulation result corresponding to the 2n-bit input feature value. In this way, by sending n bits of the input feature value to the pulsation unit 21 each time, the 2n-bit input feature can be realized The convolution operation operation of the value does not need to increase the storage space of the pulsation unit 21, which takes into account the cost of the device and the calculation efficiency, and has high application value.

FIG. 8 is a schematic diagram of a convolution operation process of n-bit data performed by the data processing device according to the third embodiment of the present invention. Among them, the size of the weight value matrix is 3*3. As shown in Figure 8, KhaDb is the b-th number in the a-th row of the input eigenvalue matrix; Kwc is the weight value vector in the c-th column in the weight value matrix, which will be deployed to the corresponding column of pulsation at the beginning of the convolution operation Unit; KwcDd is the output eigenvalue corresponding to the d-th multiplication and accumulation result of the c-th column of the weight value matrix; Bias is the input bias value of the convolution operation; SxTy is the accumulation result output by the x-th accumulator at time y.

When the convolution operation starts, the weight value vector Kwc in the weight value matrix will be sent to the systolic array in three clock cycles, and each pulsation unit loads the weight value of the corresponding position in the 3*3 weight value matrix; after the weight is loaded, enter The eigenvalues are sequentially sent to the systolic array according to the order in Fig. 8, and they are multiplied and accumulated in the systolic array with the weight value; the result of the systolic array output according to the time sequence is shown in Fig. 8.

The result output from the systolic array is sent to the corresponding accumulator to continue the accumulation. The calculation performed by the accumulator at each moment is shown in Figure 8. After the third-stage accumulator completes the accumulation operation, the final output characteristic value can be obtained.

Through the process shown in FIG. 8, the convolution operation of n-bit data can be realized. Among them, an input feature value is multiplied by a row of weight values, which is equivalent to multiple multiplication and accumulation operations for one input feature value, thereby realizing data reuse and reducing data access bandwidth required for convolution operations.

FIG. 9 is a schematic diagram of a convolution operation process of 2n-bit data performed by the data processing device according to the third embodiment of the present invention. Among them, 2n=16, and the size of the weight value matrix is 3*3. As shown in Figure 9, KhaDb_LSB is the low n bits of the b-th number in the a-th row in the input eigenvalue matrix; KhaDb_MSB is the high n bits of the b-th number in the a-th row in the input eigenvalue matrix. Kwc_LSB is the low n bits of the weight value vector in the c-th column of the weight value matrix, and Kwc_MSB is the high n bits of the weight value vector in the c-th column in the weight value matrix. They will be deployed to the corresponding systolic unit when the convolution operation starts.

KwcDd_LL is the first part of the d-th multiplication and accumulation result of the weight value in the c-th column of the weight value matrix corresponding to the output eigenvalue, which is obtained by multiplying and accumulating the low n bits of the input eigenvalue and the low n bits of the weight value; KwcDd_ML is the output feature The value corresponds to the second part of the d-th multiplication and accumulation result of the weight value in the c-th column of the weight value matrix, which is obtained by multiplying and accumulating the high n bits of the input eigenvalue and the low n bits of the weight value; KwcDd_LM is the weight corresponding to the output eigenvalue The third part of the d-th multiplication and accumulation result of the weight value in the c-th column of the value matrix, which is obtained by multiplying and accumulating the low n bits of the input eigenvalue and the high n bits of the weight value; KwcDd_MM is the weight value matrix corresponding to the output eigenvalue The fourth part of the dth column of the weight value multiplied by the accumulation result, which is obtained by multiplying and accumulating the high n bits of the input feature value and the high n bits of the weight value; Bias is the bias value input by the convolution operation; SxTy is The accumulation result output by the x-level accumulator at time y.

When the convolution operation starts, the high n-bit vector and low n-bit vector of the weight value in the weight value matrix: Kwc_LSB and Kwc_MSB will be sent to the systolic array in three clock cycles, and each pulsation unit loads the corresponding n of the weight value of the corresponding position After the weight is loaded, the input eigenvalues are sequentially sent to the systolic array in the order shown in Fig. 9, and they are multiplied and accumulated with the weight value in the systolic array; the result of the systolic array output in chronological order is shown in Fig. 9.

The result output from the systolic array is sent to the corresponding accumulator to continue accumulating. The calculation performed by the accumulator at each moment is shown in Figure 9. The vertical adding circuit of each accumulator needs to shift the output result sent in the second time to the left by n bits before accumulating. The accumulator corresponding to the high n-bit weight value also needs to shift the added sum by n bits to the left after adding the two output results. The accumulator transmits an accumulation result to the next accumulator every two clock cycles. After the accumulation operation of the last stage accumulator is completed, the final output characteristic value can be obtained.

In practical applications, this device can simultaneously support two lengths of data for calculation. Using n-bit data for convolution operation can provide higher convolution operation concurrency; using 2n-bit data for convolution operation can effectively improve network accuracy.

It should be noted that multiple time axes appear in FIGS. 8 and 9, and each time axis is only used to assist in displaying the output sequence in the respective timeline, and T0 in each time axis is not the same time.

Example four

The fourth embodiment of the present invention provides a data processing device. In this embodiment, on the basis of the technical solutions provided by the foregoing embodiments, a memory is added to store data. FIG. 10 is a schematic structural diagram of a data processing device according to Embodiment 4 of the present invention. As shown in FIG. 10, the data processing device in this embodiment may include:

The input module is used to obtain an n-bit or 2n-bit weight value matrix and an n-bit or 2n-bit input feature value matrix; the input module specifically includes a weight value loading module 11 and an input feature value loading module 12, a weight value loading module 11 is used to obtain an n-bit or 2n-bit weight value matrix, and the input feature value loading module 12 is used to obtain an n-bit or 2n-bit input feature value matrix;

The output module 3 is used to output the output eigenvalue matrix;

The memory 4 is used to store at least one of the following: an input eigenvalue matrix, an output eigenvalue matrix, and a weight value matrix.

Optionally, the memory 4 may be a static random access memory (Static Random-Access Memory, SRAM). The weight value loading module 11 can be connected to the memory 4, read the weight value from the memory 4, and send it to the calculation module 2 in a specific format. The input feature value loading module 12 can read the input feature value from the memory 4 and send it to the calculation module 2 for convolution operation.

The calculation module 2 can output one output characteristic value in the characteristic value matrix every clock cycle, and the output module 3 writes the output characteristic value into the memory 4. Optionally, there may be some format requirements when the output characteristic value is stored in the memory 4. For example, the output characteristic value needs to be aligned with 32 bits, that is, the start address of the first byte of the output characteristic value is an integer multiple of 32 . The output module 3 can assemble the output characteristic values into a corresponding format and send them to the memory 4 for storage.

Optionally, when the length of the data stored in the memory 4 is n bits, the memory 4 may sequentially store m pieces of data through a storage space of n*m bits. When the length of the data stored in the memory 4 is 2n bits, the memory 4 can store m data through a 2n*m-bit storage space, and the high n bits and low n bits of each data are stored adjacently; the n and m is a positive integer.

FIG. 11 is a schematic diagram of a storage format when a data processing device stores n-bit data according to the fourth embodiment of the present invention. As shown in Figure 11, each box represents n-bit storage space, the number on the box represents the serial number of the storage space, and the number inside the box represents the serial number of the stored data. Figure 11 shows 2m n-bit storage spaces, and the i-th n-bit storage space stores the i-th data.

FIG. 12 is a schematic diagram of a storage format when a data processing device stores 2n-bit data according to the fourth embodiment of the present invention. As shown in Figure 12, each box represents n-bit storage space, the number on the box represents the serial number of the storage space, i_LSB in the box represents the low n bits of the i-th data, and i_MSB represents the i-th data High n bits. Figure 12 shows 2m n-bit storage spaces, the 2i-th n-bit stores the low n bits of the i-th data, and the 2i+1-th n-bit stores the high n bits of the i-th data.

The data processing device provided in this embodiment can store at least one of the following through the memory 4: input eigenvalue matrix, output eigenvalue matrix, and weight value matrix, where the length of the data stored in the memory 4 is 2n bits, so The memory 4 can store m data through a 2n*m-bit storage space, and the high n bits and low n bits of each data are stored adjacently, which is convenient for inputting feature values and weight values into the systolic array in order, improving convolution operation s efficiency.

Example five

The fifth embodiment of the present invention provides a data processing method. FIG. 13 is a schematic flowchart of a data processing method according to Embodiment 5 of the present invention. As shown in FIG. 13, the data processing method in this embodiment may include:

Step 1301: Obtain an input eigenvalue matrix and an n-bit or 2n-bit weight value matrix.

Step 1302: Perform a convolution operation on the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix to obtain an output eigenvalue matrix.

Step 1303: Output the output eigenvalue matrix.

Wherein, the n is a positive integer.

The data processing method shown in FIG. 13 can be implemented based on the device of the embodiment shown in FIG. For the implementation process and technical effects of this technical solution, please refer to the description in the embodiment shown in FIG. 1 to FIG. 12, which will not be repeated here.

In an implementable manner, the length of the weight value in the n-bit weight value matrix is n bits; the length of the weight value in the 2n-bit weight value matrix is 2n bits;

The length of the input eigenvalue in the input eigenvalue matrix is the same as the length of the weight value in the weight value matrix.

In an implementable manner, the method further includes:

Store data in a matrix, the matrix being at least one of an input eigenvalue matrix, an output eigenvalue matrix, and a weight value matrix;

Wherein, when the length of the stored data is 2n bits, m data are stored in a 2n*m-bit storage space, and the high n bits and low n bits of each data are stored adjacently; the m is a positive integer.

In an implementable manner, the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix are convolved to obtain the output eigenvalue matrix, which includes:

Multiply and accumulate the n-bit or 2n-bit weight value in the weight value matrix with the corresponding input feature value;

According to the multiplication and accumulation result obtained by the multiplication and accumulation operation, the output eigenvalue matrix is calculated.

In an implementable manner, multiplying and accumulating the n-bit or 2n-bit weight value in the weight value matrix with the corresponding input feature value includes:

Load the weight value in the weight value matrix in the systolic array;

Multiply and accumulate the weight value loaded by each column of systolic cells in the systolic array and the corresponding input characteristic value to obtain the multiply and accumulate result corresponding to the weight value of each column;

Wherein, the weight value is n-bit or 2n-bit weight value.

In an implementable manner, the length of the weight value that can be loaded by the pulsation unit is n bits;

When the length of the weight value in the weight value matrix is 2n bits, each column of pulsation unit loads the high n bits or the low n bits of the weight value in the weight value matrix.

In an implementable manner, the high n bits and low n bits of a column of weight values are respectively loaded in two adjacent columns of pulsation cells.

In an implementable manner, when the length of the input eigenvalue in the input eigenvalue matrix is 2n bits, the input eigenvalue acquired by the pulsating unit each time is the value of the input eigenvalue in the input eigenvalue matrix. High n bits or low n bits.

In an implementable manner, the high n bits or low n bits of the input characteristic value are sequentially transferred from the first row of pulsation units to the last row of pulsation units.

In an implementable manner, the weight value loaded by each column of the pulsation unit in the systolic array is multiplied and accumulated with the corresponding input characteristic value to obtain the multiplication and accumulation result corresponding to the weight value of each column, including:

The input eigenvalues in the input eigenvalue matrix are sequentially transferred to the right in the pulsation array, and the loaded weight value and the passed input eigenvalue are multiplied and accumulated by the pulsation unit of each column to obtain the corresponding multiplication of the weight value of each column Accumulate the result.

In an implementable manner, loading the weight values in the weight value matrix in the systolic array includes:

In the shift phase of the weight value loading phase, for each row of pulsation units, the weight values that the row of pulsation units need to be loaded are sequentially sent to the pulsation array through the first pulsation unit of the column. In the pulsation array, the received weight The value is passed down from the first pulsation unit in turn;

In the loading phase of the weight value loading phase, the corresponding weight value is stored by the pulsation unit in the pulsation array.

In an implementable manner, the loaded weight value and the passed input feature value are multiplied and accumulated by each column of pulsation unit to obtain the multiply and accumulate result corresponding to each column of weight value, including:

Each row of pulsation units performs the following operations: obtains the input characteristic value through each pulsation unit in the column, multiplies the obtained input characteristic value with the weight value loaded by the pulsation unit, and then multiplies the obtained product with the previous pulsation The output of the unit is added, and the result of the addition is output; the output of the last pulsating unit is the multiplication and accumulation result corresponding to the column.

In an implementable manner, an accumulator is provided for each column of pulsation units; the calculation of the output eigenvalue matrix according to the multiplication and accumulation result obtained by the multiplication and accumulation operation includes:

Obtain the output result of the corresponding row of pulsating units through each accumulator, and add it to the output result of the previous accumulator to obtain the output result of the accumulator; determine the output characteristic value by the output result of the last accumulator .

In an implementable manner, if the input characteristic value is 2n bits, the obtaining the output result of the corresponding row of pulsation units includes:

Store the output result corresponding to the high n bits of the input characteristic value obtained from the pulsating unit through one register, and store the output result corresponding to the low n bits of the input characteristic value through another register;

According to the output result of the high n bits of the input characteristic value and the output result of the low n bits, the output result corresponding to the input characteristic value is obtained.

In an implementable manner, if the systolic array corresponding to the accumulator is loaded with the low n bits of the weight value, the output result of the high n bits of the input characteristic value and the output result of the low n bits are obtained to obtain the The output results corresponding to the input feature values include:

The output result of the high n bits of the input feature value is left shifted by n bits, and the output result of the low n bits is added to obtain the output result corresponding to the input feature value.

In an implementable manner, if the systolic array corresponding to the accumulator is loaded with the high n bits of the weight value, the output result of the high n bits of the input characteristic value and the output result of the low n bits are used to obtain the The output results corresponding to the input feature values include:

Shift the output result of the high n bits of the input feature value by n bits to the left, add it to the output result of the low n bits, and shift the result of the addition to the left by n bits to obtain the corresponding input feature value Output the result.

In an implementable manner, the high n-bit output result and the low n-bit output result of the input feature value are two adjacent output results obtained from the pulsation unit, respectively.

In an implementable manner, storing the output result corresponding to the high n bits of the input characteristic value obtained from the pulsation unit through one register, and storing the output result corresponding to the low n bits of the input characteristic value through another register, including:

The output result is obtained from the pulsating unit through the multiply and accumulate result register, and every other clock cycle, the multiply and accumulate result register forwards the multiply and accumulate result register to send the output result corresponding to the low n bits through the multiply and accumulate result register; the high n bits are stored in the multiply and accumulate result register The output result of, stores the output result corresponding to the lower n bits through the pre-multiplication and accumulation result register;

Alternatively, the output result is obtained from the pulsating unit through the multiply-accumulate result register, and every other clock cycle, the multiply-accumulate result register forwards the multiply-accumulate result register to send the output result corresponding to the high n bits; the multiply-accumulate result register stores the low n The output result corresponding to the bit is stored in the output result corresponding to the high n bits through the pre-multiplication and accumulation result register.

In an implementable manner, the method further includes:

The redundant multiplication and accumulation results output by the systolic array are filtered according to the step length value of the convolution operation.

In an implementable manner, obtaining the output result of the corresponding row of pulsation units and adding it to the output result of the previous stage accumulator includes:

According to the expansion value of the convolution operation, the output result of the previous accumulator is delayed by the corresponding clock period, and then added to the output result obtained from the corresponding row of pulsation units.

If the number of rows of the weight value matrix is greater than the number of rows of the systolic array, load a part of the weight values in the weight value matrix into the systolic array each time;

Correspondingly, the output characteristic value is determined by the accumulation result of the last stage accumulator, including:

Determine whether the intermediate result of the output characteristic value is stored:

If not, store the accumulation result of the last-stage accumulator as an intermediate result;

If yes, add the accumulation result of the last-stage accumulator to the stored intermediate result, and if the addition result is the final result of the output characteristic value, send the final result to the output module; if the addition is The result is not the final result of the output characteristic value, then the intermediate result is updated to the result of the addition and stored;

Wherein, the intermediate result is the corresponding result of partial weight values in the weight value matrix after calculation; the final result is the corresponding result of all weight values in the weight value matrix after calculation.

An embodiment of the present invention also provides an electronic device, including the data processing device described in any of the foregoing embodiments. The electronic device may be any device that may use convolution operations, such as computers, drones, and handheld devices.

For the implementation principle of the electronic device, reference may be made to the related description in the embodiment shown in FIG. 1 to FIG. 12, and the corresponding execution process and technical effect can be referred to the description in the embodiment shown in FIG. 1 to FIG. 12, which will not be repeated here.

The technical solutions and technical features in each of the above embodiments can be singly or combined in case of conflict with the present invention, as long as they do not exceed the cognitive scope of those skilled in the art, they all belong to equivalent embodiments within the protection scope of the present invention. .

In the several embodiments provided by the present invention, it should be understood that the disclosed related remote control device and method can be implemented in other ways. For example, the embodiments of the remote control device described above are merely illustrative. For example, the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units or components. It can be combined or integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, remote control devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. , Including several instructions to make a computer processor (processor) execute all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage media include: U disk, mobile hard disk, Read-Only Memory (ROM), Random Access Memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes.

The above are only the embodiments of the present invention, and do not limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made by using the content of the description and drawings of the present invention, or directly or indirectly applied to other related technologies In the same way, all fields are included in the scope of patent protection of the present invention.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention. Scope.

Claims

A data processing device, characterized in that it comprises:

The input module is used to obtain the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix;

The calculation module is used to perform a convolution operation between the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix to obtain the output eigenvalue matrix;

An output module for outputting the output eigenvalue matrix;

Wherein, the n is a positive integer.
The device according to claim 1, wherein the length of the weight value in the n-bit weight value matrix is n bits; the length of the weight value in the 2n-bit weight value matrix is 2n bits;

The length of the input eigenvalue in the input eigenvalue matrix is the same as the length of the weight value in the weight value matrix.
The device according to claim 1, further comprising: a memory;

The memory is used to store at least one of the following: an input eigenvalue matrix, an output eigenvalue matrix, and a weight value matrix;

Wherein, when the length of the data stored in the memory is 2n bits, the memory stores m data through a 2n*m-bit storage space, and the high n bits and low n bits of each data are stored adjacently; the m is positive Integer.
The device according to claim 1, wherein the calculation module comprises:

The systolic array is used to implement the multiplication and accumulation operation of the n-bit or 2n-bit weight value in the weight value matrix and the corresponding input eigenvalue;

The accumulator array is used to calculate the output eigenvalue matrix according to the multiplication and accumulation result obtained by the systolic array.
The device according to claim 4, wherein the calculation module further comprises:

The control unit is configured to obtain the length of the weight value in the weight value matrix, and control the systolic array and the accumulator array to implement a convolution operation according to the length of the weight value.
The device according to claim 5, wherein the pulsation array comprises: multiple rows of pulsation units;

The pulsation unit of each column is used to load the weight value, and the loaded weight value and the corresponding input characteristic value are multiplied and accumulated to obtain the multiplication and accumulation result corresponding to the weight value of each column loaded.
The device according to claim 6, wherein the length of the weight value that can be loaded by the pulsation unit is n bits;

When the length of the weight value in the weight value matrix is 2n bits, each column of pulsation unit loads the high n bits or the low n bits of the weight value in the weight value matrix.
7. The device according to claim 7, wherein the high n bits and low n bits of a column of weight values are respectively loaded in two adjacent columns of pulsation units.
7. The device according to claim 6, wherein when the input eigenvalue length in the input eigenvalue matrix is 2n bits, the input eigenvalue acquired by the pulsating unit each time is in the input eigenvalue matrix The high n bits or low n bits of the input feature value.
The device according to claim 9, wherein the high n bits or low n bits of the input characteristic value are sequentially transferred from the first row of pulsation units to the last row of pulsation units.
The device according to claim 6, wherein the control unit is specifically configured to:

In the weight value loading stage, controlling the weight values in the weight value matrix to be sequentially loaded into the pulsation units of the systolic array;

In the calculation stage, the input eigenvalues in the control input eigenvalue matrix are sequentially transferred to the right in the pulsation array, and the pulsation unit is controlled to perform calculations based on the loaded weight value and the transferred input eigenvalue.
The device according to claim 11, wherein, in the weight value loading stage, the control unit is specifically configured to:

In the shift phase of the weight value loading phase, for each row of pulsation units, the weight values that the row of pulsation units need to be loaded are sequentially sent to the pulsation array through the first pulsation unit of the column. In the pulsation array, the received weight The value is passed down from the first pulsation unit in turn;

In the loading phase of the weight value loading phase, the pulsation unit in the pulsation array is controlled to store the corresponding weight value.
The device according to claim 6, wherein each row of pulsation units includes a plurality of pulsation units;

The pulsation unit is used to load the weight value and obtain the input characteristic value, multiply the input characteristic value and the loaded weight value, add the obtained product to the output of the pulsation unit in the previous row, and output the result of the addition .
The device according to claim 13, wherein the pulsation unit comprises:

Weight value register, used to store weight value;

Input characteristic value register, used to store the input characteristic value;

A multiplication circuit for obtaining the product of the weight value and the input characteristic value according to the weight value stored in the weight register and the input characteristic value stored in the input characteristic value register;

The addition circuit is used to add the product obtained by the multiplication circuit to the output of the pulsation unit in the previous row.
The device according to claim 14, wherein the pulsation unit further comprises:

The weight value shift register is used to transfer the weight value to the pulsation unit of the next row;

The input characteristic value shift register is used to transfer the input characteristic value to the next row of pulsation units.
7. The device according to claim 6, wherein the accumulator array comprises a plurality of accumulators, the number of the accumulators and the number of columns of the pulsation unit are both k, and the i-th accumulator and the number of columns of the pulsating unit are both k. Corresponding to the pulsation unit in column i, where k is a natural number greater than 1, i=1, 2, ..., k;

The accumulator is used to obtain the output result of the corresponding row of pulsation units, add it to the output result of the previous accumulator, and output the added result to the next accumulator.
The device according to claim 16, wherein when the input characteristic value is 2n bits, the accumulator is specifically used for:

Store the output result corresponding to the high n bits of the input characteristic value obtained from the pulsating unit through one register, and store the output result corresponding to the low n bits of the input characteristic value through another register;

According to the output result of the high n bits of the input characteristic value and the output result of the low n bits, the output result corresponding to the input characteristic value is obtained, which is added to the output result of the previous accumulator, and the result of the addition is output Go to the next accumulator.
The device according to claim 17, wherein if the systolic array corresponding to the accumulator is loaded with the low n bits of the weight value, the output result of the high n bits of the input feature value and the low n bits of the output result As a result, when the output result corresponding to the input characteristic value is obtained, the accumulator is specifically used for:

The output result of the high n bits of the input feature value is left shifted by n bits, and the output result of the low n bits is added to obtain the output result corresponding to the input feature value.
The device according to claim 17, wherein if the systolic array corresponding to the accumulator is loaded with the high n bits of the weight value, the output result of the high n bits of the input feature value and the low n bits of the output result As a result, when the output result corresponding to the input characteristic value is obtained, the accumulator is specifically used for:

Shift the output result of the high n bits of the input feature value by n bits to the left, add it to the output result of the low n bits, and shift the result of the addition to the left by n bits to obtain the corresponding input feature value Output the result.
The device according to claim 17, wherein the output result of the upper n bits of the input characteristic value and the output result of the lower n bits are respectively two adjacent output results obtained from the pulsation unit.
The device of claim 16, wherein the accumulator comprises:

A multiply-accumulate result register for obtaining the output result of the last pulsating unit;

The pre-multiplication and accumulation result register is used to obtain an output result from the multiplication and accumulation result register every other clock cycle when the input characteristic value is 2n bits;

The vertical addition circuit is used to send the output result in the multiply-accumulate result register to the first-stage addition circuit when the input characteristic value is n bits, or, when the input characteristic value is 2n bits, Sending the sum of the output result in the multiply and accumulate result register and the output result in the previous multiply and accumulate result register to the first stage adding circuit;

The first-stage addition circuit is used to add the result output from the vertical addition circuit and the result output from the previous accumulator.
The device according to claim 21, wherein the accumulator further comprises: a filter circuit;

The filter circuit is used to filter the redundant multiplication and accumulation results output by the systolic array according to the step value of the convolution operation.
The device according to claim 21, wherein the accumulator further comprises: an accumulator result register;

The accumulator result register is used to obtain the output result of the previous accumulator and send it to the first-stage addition circuit.
The device according to claim 23, wherein the accumulator further comprises: a delay circuit;

The delay circuit is used to delay the output result of the previous accumulator by a corresponding clock cycle and send it to the accumulator result register according to the expansion value of the convolution operation.
The device according to claim 21, wherein the accumulator further comprises:

The sum register is used to store the result output by the first-stage addition circuit and output the result to the next-stage accumulator.
The device according to claim 25, wherein the device further comprises: a result output unit and a result storage unit; the accumulator further comprises: a second-stage addition circuit;

When the number of rows of the weight value matrix is greater than the number of rows of the systolic array, the systolic array loads part of the weight values in the weight value matrix each time; the result storage unit is used to store intermediate results, wherein, The intermediate result is the corresponding result of some weight values in the weight value matrix after calculation;

The second-stage addition circuit of the last-stage accumulator is used to add the result in the sum register and the intermediate result read from the result storage unit by the result output unit and output to the result output unit;

The result output unit is configured to send the result obtained from the second-stage addition circuit to the output module when the result output by the second-stage addition circuit is the final result; in the second-stage addition circuit When the output result is an intermediate result, send the obtained result to the result storage unit;

Wherein, the final result is the corresponding result of all the weight values in the weight value matrix after calculation.
An electronic device, characterized by comprising the data processing device according to any one of claims 1-26.
A data processing method, characterized in that it comprises:

Obtain the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix;

Convolve the input eigenvalue matrix with the n-bit or 2n-bit weight value matrix to obtain the output eigenvalue matrix;

Output the output eigenvalue matrix;

Wherein, the n is a positive integer.
The method according to claim 28, wherein the length of the weight value in the n-bit weight value matrix is n bits; the length of the weight value in the 2n-bit weight value matrix is 2n bits;

The length of the input eigenvalue in the input eigenvalue matrix is the same as the length of the weight value in the weight value matrix.
The method according to claim 28, further comprising:

Store data in a matrix, the matrix being at least one of an input eigenvalue matrix, an output eigenvalue matrix, and a weight value matrix;

Wherein, when the length of the stored data is 2n bits, m data are stored in a 2n*m-bit storage space, and the high n bits and low n bits of each data are stored adjacently; the m is a positive integer.
The method according to claim 28, wherein the convolution operation of the input eigenvalue matrix and the n-bit or 2n-bit weight value matrix to obtain the output eigenvalue matrix comprises:

Multiply and accumulate the n-bit or 2n-bit weight value in the weight value matrix with the corresponding input feature value;

According to the multiplication and accumulation result obtained by the multiplication and accumulation operation, the output eigenvalue matrix is calculated.
The method according to claim 31, wherein multiplying and accumulating the n-bit or 2n-bit weight value in the weight value matrix with the corresponding input feature value comprises:

Load the weight value in the weight value matrix in the systolic array;

Multiply and accumulate the weight value loaded by each column of systolic cells in the systolic array and the corresponding input characteristic value to obtain the multiply and accumulate result corresponding to the weight value of each column;

Wherein, the weight value is n-bit or 2n-bit weight value.
The method according to claim 32, wherein the length of the weight value that can be loaded by the pulsation unit is n bits;

When the length of the weight value in the weight value matrix is 2n bits, each column of pulsation unit loads the high n bits or the low n bits of the weight value in the weight value matrix.
The method according to claim 33, wherein the upper n bits and the lower n bits of the weight value of one column are respectively loaded in two adjacent columns of pulsation units.
The method according to claim 32, wherein when the input eigenvalue length in the input eigenvalue matrix is 2n bits, the input eigenvalue acquired by the pulsating unit each time is in the input eigenvalue matrix The high n bits or low n bits of the input feature value.
The method according to claim 35, wherein the high n bits or low n bits of the input characteristic value are sequentially transferred from the first row of pulsation units to the last row of pulsation units.
The method according to claim 32, wherein the multiplying and accumulating operation of the weight value loaded by each row of pulsating cells in the pulsating array and the corresponding input characteristic value to obtain the multiplying and accumulating result corresponding to the weight value of each column comprises:

The input eigenvalues in the input eigenvalue matrix are sequentially transferred to the right in the pulsation array, and the loaded weight value and the passed input eigenvalue are multiplied and accumulated by the pulsation unit of each column to obtain the corresponding multiplication of the weight value of each column Accumulate the result.
The method according to claim 32, wherein loading the weight values in the weight value matrix into the systolic array comprises:

In the shift phase of the weight value loading phase, for each row of pulsation units, the weight values that the row of pulsation units need to be loaded are sequentially sent to the pulsation array through the first pulsation unit of the column. In the pulsation array, the received weight The value is passed down from the first pulsation unit in turn;

In the loading phase of the weight value loading phase, the corresponding weight value is stored by the pulsation unit in the pulsation array.
The method according to claim 37, wherein the multiplying and accumulating operation of the loaded weight value and the transferred input feature value by each column of pulsation unit to obtain the multiplying and accumulating result corresponding to each column of weight value comprises:

Each row of pulsation units performs the following operations: obtains the input characteristic value through each pulsation unit in the column, multiplies the obtained input characteristic value with the weight value loaded by the pulsation unit, and then multiplies the obtained product with the previous pulsation The output of the unit is added, and the result of the addition is output; the output of the last pulsating unit is the multiplication and accumulation result corresponding to the column.
The method according to claim 32, wherein an accumulator is provided corresponding to each column of pulsation unit; and calculating the output eigenvalue matrix according to the multiplication and accumulation result obtained by the multiplication and accumulation operation includes:

Obtain the output result of the corresponding row of pulsating units through each accumulator, and add it to the output result of the previous accumulator to obtain the output result of the accumulator; determine the output characteristic value by the output result of the last accumulator .
The method according to claim 40, wherein if the input characteristic value is 2n bits, the obtaining the output result of the corresponding row of pulsation units comprises:

Store the output result corresponding to the high n bits of the input characteristic value obtained from the pulsating unit through one register, and store the output result corresponding to the low n bits of the input characteristic value through another register;

According to the output result of the high n bits of the input characteristic value and the output result of the low n bits, the output result corresponding to the input characteristic value is obtained.
The method according to claim 41, wherein if the systolic array corresponding to the accumulator is loaded with the low n bits of the weight value, the output result of the high n bits of the input characteristic value and the output of the low n bits are As a result, the output result corresponding to the input feature value is obtained, including:

The output result of the high n bits of the input feature value is left shifted by n bits, and the output result of the low n bits is added to obtain the output result corresponding to the input feature value.
The method according to claim 41, wherein if the systolic array corresponding to the accumulator is loaded with high n bits of the weight value, the output result of the high n bits of the input characteristic value and the output of the low n bits are As a result, the output result corresponding to the input feature value is obtained, including:

Shift the output result of the high n bits of the input feature value by n bits to the left, add it to the output result of the low n bits, and shift the result of the addition to the left by n bits to obtain the corresponding input feature value Output the result.
The method according to claim 41, wherein the high n-bit output result and the low n-bit output result of the input feature value are respectively two adjacent output results obtained from the pulsation unit.
The method according to claim 41, wherein the output result corresponding to the high n bits of the input characteristic value obtained from the pulsation unit is stored through one register, and the low n bits corresponding to the input characteristic value are stored through another register. Output results, including:

The output result is obtained from the pulsating unit through the multiply and accumulate result register, and every other clock cycle, the multiply and accumulate result register forwards the multiply and accumulate result register to send the output result corresponding to the low n bits through the multiply and accumulate result register; the high n bits are stored in the multiply and accumulate result register The output result of, stores the output result corresponding to the lower n bits through the pre-multiplication and accumulation result register;

Alternatively, the output result is obtained from the pulsating unit through the multiply-accumulate result register, and every other clock cycle, the multiply-accumulate result register forwards the multiply-accumulate result register to send the output result corresponding to the high n bits; the multiply-accumulate result register stores the low n The output result corresponding to the bit is stored in the output result corresponding to the high n bits through the pre-multiplication and accumulation result register.
The method according to claim 41, further comprising:

The redundant multiplication and accumulation results output by the systolic array are filtered according to the step length value of the convolution operation.
The method according to claim 41, wherein obtaining the output result of the corresponding row of pulsation units and adding it to the output result of the previous stage accumulator comprises:

According to the expansion value of the convolution operation, the output result of the previous accumulator is delayed by the corresponding clock period, and then added to the output result obtained from the corresponding row of pulsation units.
The method according to claim 40, wherein loading the weight values in the weight value matrix in the systolic array comprises:

If the number of rows of the weight value matrix is greater than the number of rows of the systolic array, load a part of the weight values in the weight value matrix into the systolic array each time;

Correspondingly, the output characteristic value is determined by the accumulation result of the last stage accumulator, including:

Determine whether the intermediate result of the output characteristic value is stored:

If not, store the accumulation result of the last-stage accumulator as an intermediate result;

If yes, add the accumulation result of the last-stage accumulator to the stored intermediate result, and if the addition result is the final result of the output characteristic value, send the final result to the output module; if the addition is The result is not the final result of the output characteristic value, then the intermediate result is updated to the result of the addition and stored;

Wherein, the intermediate result is the corresponding result of partial weight values in the weight value matrix after calculation; the final result is the corresponding result of all weight values in the weight value matrix after calculation.