CN113361699B

CN113361699B - Multiplication circuit, system on chip and electronic device

Info

Publication number: CN113361699B
Application number: CN202110805327.6A
Authority: CN
Inventors: 孙伟昶
Original assignee: ARM Technology China Co Ltd
Current assignee: ARM Technology China Co Ltd
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2023-05-26
Anticipated expiration: 2041-07-16
Also published as: CN113361699A

Abstract

The application relates to a multiplication circuit, a system-on-chip and an electronic device. The multiplication circuit comprises a PE array, a first buffer for storing input data, a second buffer for storing a plurality of channel parameters of a plurality of convolution kernels, and a switching circuit connected between the PE array and the second buffer; wherein, when the multiplication circuit performs the first convolution operation: the switching circuit is used for outputting a plurality of channel parameters of one convolution kernel stored in the second buffer memory as effective data to the PE array in each operation period, so that when the PE array carries out convolution operation on the channel parameters of the convolution kernel received from the switching circuit and the input data acquired from the first buffer memory, only the channel parameters serving as the effective data have influence on the convolution operation result. According to the technical scheme, when different convolution operations are realized by the multiplication circuit under different application scenes, the format of input/output data is not required to be adjusted, and the application range is wide.

Description

Multiplication circuit, system on chip and electronic device

Technical Field

The present application relates to the field of neural networks, and in particular, to a multiplication circuit, a system on a chip, and an electronic device.

Background

In recent years, with the rapid development of artificial intelligence (Artificial Intelligence, AI) technology, applications of unmanned vehicles, unmanned aerial vehicles, intelligent terminals, and the like supporting AI are becoming more and more widespread. AI processes the data input by various sensors in real time through neural network technology, and realizes the perception of external environment. In order to improve the processing performance of the AI application terminal, a special hardware platform is generally adopted to implement a specific operation, for example, the convolution operation related to the AI application terminal originally transplanted with the convolutional neural network model is implemented through the special hardware platform.

However, existing dedicated hardware platforms generally can only implement specific types of convolution operations, or the dedicated hardware platforms have high corresponding computational efficiency when implementing specific types of convolution operations, and have low corresponding computational efficiency when implementing non-specific types of convolution operations, for example, the dedicated hardware platforms can only implement standard convolution operations and do not support deep convolution operations. Therefore, the application range of the existing special hardware platforms is narrow, and the popularization and application of products are not facilitated.

Disclosure of Invention

The embodiment of the application provides a multiplication circuit, a system-on-chip and electronic equipment. According to the technical scheme, the switching circuit is arranged between the parameter cache of the multiplication circuit and the PE array, so that each PE in the PE array can be utilized no matter the multiplication circuit executes deep convolution operation or standard convolution operation, and the calculation efficiency is high. And the final output of each column PE in the multiplication circuit is a feature map instead of a plurality of feature maps after the multiplication circuit completes the deep convolution operation and the standard convolution operation on the same input data stored in the input data buffer. Under the condition that the multiplication circuit executes different convolution operations, each column PE corresponds to a characteristic diagram, and the output data formats are the same. Therefore, the multiplication circuit provided by the application does not need to adjust the format of input/output data when different convolution operations are realized in different application scenes. The method can meet the requirements that product developers/designers can adapt to different application scenes under the condition of not changing the input/output data format for the same multiplication circuit.

In a first aspect, embodiments of the present application provide a multiplication circuit for convolution operations, including: the PE array, the first buffer memory used for storing input data, the second buffer memory used for storing a plurality of channel parameters of a plurality of convolution kernels, and the switching circuit connected between the PE array and the second buffer memory;

Wherein, when the multiplication circuit performs the first convolution operation:

the switching circuit is used for outputting a plurality of channel parameters of one convolution kernel stored in the second buffer memory as effective data to the PE array in each operation period, so that when the PE array carries out convolution operation on the channel parameters of the convolution kernel received from the switching circuit and the input data acquired from the first buffer memory, only the channel parameters serving as the effective data have influence on the convolution operation result.

In some embodiments, the first convolution operation is a deep convolution operation, the first buffer is an input data buffer, and the second buffer is a parameter buffer.

In a possible implementation manner of the first aspect, the PE array includes a plurality of rows of PEs, and the switching circuit includes a plurality of sub-switches corresponding to each row of PEs in the PE array one by one;

each sub-switch of the switch circuit is used for respectively outputting one channel parameter of one convolution kernel stored in the second buffer memory as effective data to a corresponding column of PE in the PE array in each operation period, so that when each column of PE receiving the effective data carries out convolution operation on the channel parameter of the convolution kernel received from the corresponding sub-switch and the input data acquired from the first buffer memory, only the channel parameter serving as the effective data has influence on the convolution operation result,

The effective data acquired by each column of PE is different from the corresponding channel in the convolution kernel.

In a possible implementation of the first aspect, when the multiplication circuit performs the first convolution operation:

each sub-switch of the switching circuit is used for selecting one of the channel parameters from one of the convolution kernels stored in the second buffer memory as effective data in each operation period, replacing part of the channel parameters except the channel parameters as effective data in the convolution kernel data with zero, outputting the one channel parameter as effective data and the part of the channel parameters replaced with zero to a corresponding column of PE in the PE array, so that each column of PE receiving the effective data carries out convolution operation on the channel parameters as effective data received from the corresponding sub-switch and the part of the channel parameters replaced with zero and the input data acquired from the first buffer memory,

there is a one-to-one correspondence between the channels of the input data and the channels of the convolution kernel that participate in the convolution operation.

In a possible implementation of the first aspect, when the multiplication circuit performs the second convolution operation:

the switching circuit is used for outputting a plurality of channel parameters of a plurality of convolution kernels stored in the second buffer memory to the PE array as effective data in each operation period; and is also provided with

The PE array is used for convolving a plurality of channel parameters of a plurality of convolution kernels received from the switch circuit with corresponding channel parameters of the input data acquired from the first buffer, wherein,

there is a one-to-one correspondence between the channels of the input data and the channels of the convolution kernel.

In a possible implementation manner of the first aspect, the PE array includes a plurality of columns of PEs, and the switching circuit includes a plurality of sub-switches corresponding to each column of PEs in the PE array one by one;

wherein, when the multiplication circuit performs the second convolution operation:

each sub-switch of the switch circuit is used for respectively outputting a plurality of channel parameters of a plurality of convolution kernels stored in the second buffer memory as effective data to a corresponding column of PE in the PE array in each operation period, so that each column of PE carries out convolution operation on the plurality of channel parameters of one convolution kernel received from the corresponding sub-switch and the corresponding channel parameters of the input data acquired from the first buffer memory,

In a possible implementation manner of the first aspect, the multiplication circuit further includes a third buffer, configured to buffer a convolution operation result of the PE array.

In a possible implementation of the first aspect, when the multiplication circuit performs the first convolution operation or the second convolution operation: the operation result of each row of PE in the PE array corresponds to one channel of the convolution operation result of the PE array.

In a possible implementation of the first aspect, when the multiplication circuit performs the first convolution operation: the partial PE and other PE except the partial PE in the PE array are used for alternately convolving the channel parameters of the convolution kernel received from the switch circuit with the input data acquired from the first buffer. For example, the PE array is divided into a first group of PEs and a second group of PEs, and the first group of PEs convolves the channel parameters of the convolution kernel received from the switching circuit with the input data obtained from the first buffer during a first operation period when the multiplication circuit performs a first convolution operation. In a second operation period, the second group of PEs perform convolution operation on the channel parameters of the convolution kernel received from the switching circuit and the input data acquired from the first buffer.

In a possible implementation of the first aspect, when the multiplication circuit performs the second convolution operation: all PEs in the PE array are used for simultaneously convolving the channel parameters of the convolution kernel received from the switching circuit with the input data acquired from the first buffer.

In a possible implementation of the first aspect, the multiplication circuit further includes a storage control circuit for reading input data stored in the external storage space into the first buffer and/or reading a plurality of channel parameters of a plurality of convolution kernels stored in the external storage space into the second buffer.

In a second aspect, embodiments of the present application provide a system on a chip, including the multiplication circuit of the first aspect and various possible implementations of the first aspect, and a processor and a memory;

a memory for storing instructions for execution by one or more processors of the system-on-chip;

the processor is one of the processors of the system-on-chip and is used for controlling the multiplication circuit to execute convolution operation under different operation modes when the instruction is executed by the processor.

In a third aspect, an embodiment of the present application provides an electronic device, including the system on a chip in the second aspect, and a processor and a memory;

a memory for storing instructions for execution by one or more processors of the electronic device;

a processor for controlling multiplication circuits in the system-on-chip to perform convolution operations in different modes of operation when the instructions are executed by the one or more processors.

Drawings

FIG. 1 (a) is a schematic diagram showing the operation process of standard convolution operation in one technical scheme;

FIG. 1 (b) is a schematic diagram showing an operation procedure of a deep convolution operation in one technical scheme;

fig. 2 (a) shows a schematic diagram of a multiplication circuit in one embodiment;

FIG. 2 (b) illustrates individual data blocks of input data stored in the input data cache illustrated in FIG. 2 (a);

FIG. 2 (c) shows the calculation of the PE10 in the multiplication circuit shown in FIG. 2 (a) to perform a standard convolution operation;

FIG. 2 (d) shows the calculation of the PE11 in the multiplication circuit shown in FIG. 2 (a) to perform a standard convolution operation;

FIG. 2 (e) shows the calculation of the PE20 in the multiplication circuit shown in FIG. 2 (a) to perform a standard convolution operation;

FIG. 2 (f) shows the calculation of the PE21 in the multiplication circuit shown in FIG. 2 (a) to perform a standard convolution operation;

FIG. 2 (g) shows a calculation process of performing a deep convolution operation by PE10 in the multiplication circuit shown in FIG. 2 (a);

FIG. 2 (h) shows a calculation process of performing a deep convolution operation by PE20 in the multiplication circuit shown in FIG. 2 (a);

FIG. 3 illustrates a block diagram of the hardware architecture of a multiplication circuit provided herein, according to some embodiments of the present application;

FIG. 4 is a schematic diagram of the hardware architecture of a multiplication circuit of the PE array structure of 4*8 provided herein, in accordance with some embodiments of the present application;

FIG. 5 (a) shows a schematic diagram of the structure of a molecular switch in the middle of the switching circuit shown in FIG. 4, according to some embodiments of the present application;

FIG. 5 (b) shows convolution kernel data output by the sub-switch 241 of the multiplication circuit of FIG. 4 to PE10, PE20, PE30, PE40 (first column PE) of FIG. 4 in one clock cycle, according to some embodiments of the present application;

FIG. 6 illustrates a timing diagram of a multiplication circuit provided herein in performing standard convolution operations, according to some embodiments of the present application;

FIG. 7 (a) is a schematic diagram illustrating the multiplication circuit of FIG. 4 convolving input data of Df M based on N convolution kernels of Dk M, according to some embodiments of the present application;

FIG. 7 (b) illustrates various data blocks included in the input data stored in the input data cache illustrated in FIG. 4, according to some embodiments of the present application;

FIG. 7 (c) illustrates the convolution kernel and input data of the respective PEs in the first group of PEs 251 participating in the convolution operation during the first clock cycle clk1 of the convolution operation performed by the multiplication circuit shown in FIG. 4, in accordance with some embodiments of the present application;

FIG. 7 (d) illustrates the convolution kernel and input data of each PE in the first set of PEs 252 participating in the convolution operation in a second clock cycle clk2 during which the multiplication circuit shown in FIG. 4 performs the convolution operation, in accordance with some embodiments of the present application;

FIG. 8 illustrates a timing diagram of a multiplication circuit provided herein in performing a deep convolution operation, according to some embodiments of the present application;

FIG. 9 illustrates a block diagram of the hardware architecture of a system-on-chip provided herein, according to some embodiments of the present application;

FIG. 10 illustrates a schematic diagram of the system-on-chip of FIG. 9 as provided herein applied to an autopilot scenario, in accordance with some embodiments of the present application;

FIG. 11 illustrates a flow chart of image recognition by the automated driving automobile of FIG. 10 using a system-on-chip, according to some embodiments of the present application;

fig. 12 is a schematic diagram illustrating application of the system-on-chip shown in fig. 9 to a face recognition access control scenario provided in the present application, according to some embodiments of the present application;

fig. 13 illustrates a flow chart of the face recognition access control system of fig. 12 employing the system-on-chip of fig. 9 for face recognition, in accordance with some embodiments of the present application;

fig. 14 illustrates a block diagram of an electronic device provided herein, according to some embodiments of the present application.

Detailed Description

Illustrative embodiments of the present application include, but are not limited to, a multiplication circuit, a system-on-chip, and an electronic device that are dedicated to implementing multiplication operations in a neural network model.

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Embodiments of the present application relate to the field of neural networks, and in order to better understand the schemes of the embodiments of the present application, related terms and concepts of the neural networks to which the embodiments of the present application may relate are first described below.

(1) Standard convolution (Standard Convolution) operation

When a standard convolution operation is performed on input data having a plurality of data channels by one convolution kernel, the convolution kernel needs to convolve all data in the input data having a plurality of data channels.

Fig. 1 (a) schematically shows a standard convolution operation, for example, assuming that input data of the convolutional neural network 10 is data of 3 color channels of Red, green, and Blue in RGB (Red: green: blue) color space of an image, and assuming that the number of pixels of the image in the vertical direction and the horizontal direction is 5, the size of the input data may be represented by 5×5×3. When standard convolution operation needs to be performed on the input data of 5×5×3, convolution check with the same channel number of 3 is needed to perform convolution operation (i.e. multiply-add operation) on all data in the input data, so as to obtain a corresponding convolution result (also referred to as a feature map).

For example, as shown in FIG. 1 (a), the input data with the size of 5 x 3 are respectively convolved by adopting 4 convolution kernels with the size of 3 x 3. Wherein the 4 convolution kernels are respectively a convolution kernel K1, a convolution kernel K2, a convolution kernel K3 and a convolution kernel K4. The data of 3 channels in the input data of 5×5×3 are respectively denoted as channel C1, channel C2, and channel C3. The convolution kernel K1 carries out convolution operation on all data from the channel C1 to the channel C3 in the input data to obtain a characteristic diagram P1; the convolution kernel K2 carries out convolution operation on all data from the channel C1 to the channel C3 in the input data to obtain a feature map P2; the convolution kernel K3 carries out convolution operation on all data of the channels C1 to C3 in the input data to obtain a feature map P3, and the convolution kernel K4 carries out convolution operation on all data of the channels C1 to C3 in the input data to obtain a feature map P4.

(2) Deep convolution (Depthwise Convolution) operation

Unlike standard convolution operations, when an input data having a plurality of data channels is subjected to a deep convolution operation by one convolution kernel, the convolution kernel convolves only data of one channel among the input data having a plurality of data channels.

Fig. 1 (b) schematically illustrates a deep convolution operation, as shown in fig. 1 (a), using 3 convolution kernels of size 3*3 to convolve the input data of size 5×5×3, respectively. Wherein the 3 convolution kernels are convolution kernels K1', K2', K3', respectively. The data of three channels in the input data of 5×5×3 are still recorded as channel C1, channel C2, and channel C3. The convolution kernel K1 'carries out convolution operation on the data in the channel C1 in the input data to obtain a characteristic diagram P1'; the convolution kernel K2 'only carries out convolution operation on the data in the channel C2 in the input data to obtain a feature map P2'; the convolution kernel K3 'only carries out convolution operation on the data in the channel C3 in the input data, and a feature map P3' is obtained.

In the following description, the technical solutions of the embodiments of the present application are described with respect to convolutional neural networks involving convolutional operations (i.e., multiply-add operations). It is to be understood that, besides convolutional neural networks, the technical solution of the present application may also be applied to other neural networks involving multiply-add operations, which is not limited in this application.

Furthermore, it will be appreciated that the above description of the operation of the standard convolution kernel depth convolution on data of 3 color channels of the channel number 3 convolution kernel RGB color space is merely to illustrate one simple example of the general operation of the standard convolution kernel depth convolution. In practical application, the technical scheme does not limit the channel number of the input data and the convolution kernel data related in standard convolution and depth convolution operation. For example, in some embodiments, the number of channels of the input data and the convolution kernel data involved in the standard convolution operation and the deep convolution operation may be an integer multiple of 64.

Fig. 2 (a) shows a multiplication circuit 200 in one related art that can implement convolution operations (i.e., multiply-add operations) in a convolutional neural network model. Referring to fig. 2 (a), the multiplication circuit 200 includes a direct memory access (Direct Memory Access, DMA) control unit 201, a parameter buffer 202, an input data buffer 203, an output buffer 204, and a PE array 205 composed of a plurality of processing units (Processing Element, PEs).

Wherein the DMA control unit 201 is configured to read input data stored in an external storage space into the input data buffer 203. The parameter buffer 202 is used to store data of a convolution kernel that participates in a convolution operation. The input data buffer 203 is used to store input data read from the external storage space by the DMA control unit 201. The PE array 205 is made up of a plurality of PEs, each for multiply-add operation on at least some of the input data and the convolution kernel data. The output buffer 204 is used for storing the result of the convolution operation output by the PE array 205.

The process of performing the standard convolution operation and the deep convolution operation by the multiplication circuit 200 shown in fig. 2 (a) will be described in detail.

(1) Standard convolution operation

Assuming that the standard convolution operation shown in fig. 1 (a) is performed by the multiplication circuit 200 shown in fig. 2 (a), that is, the standard convolution operation is performed by using the multiplication circuit 200 shown in fig. 2 (a) by checking the input data of 3 channels stored in the input data buffer 203 with 4 convolution kernels of 3×3 (the size of the input data is denoted as 5×5×3), all the data of the channels C1 to C3 in the input data need to be convolved by the 4 convolution kernels of 3×3.

Specifically, the parameter buffer 202 stores the above 4 convolution kernel data of 3 x 3, these 4 convolution kernels are denoted as convolution kernel K1, convolution kernel K2, convolution kernel K3, convolution kernel K4, respectively. The input data buffer 203 stores the 3-channel input data, and these 3 channels are denoted as channel C1, channel C2, and channel C3, respectively. Then, when the PE array 205 shown in fig. 2 (a) performs a convolution operation on the input data acquired from the input data buffer 203 based on the convolution kernel data acquired from the parameter buffer 202, each PE in the PE array 205 shown in fig. 2 (a) performs a convolution on one data block of the same size as the convolution kernel in the input data of the 3 channels by using one of the convolution kernels K1 to K4, respectively. Wherein, a data block with the same size as the convolution kernel in the input data is: and sliding on the input data by adopting a set step length to obtain a corresponding data block in the sliding window. It will be appreciated that since the input data has 3 data channels, each sliding window (i.e., data block) on the input data also has 3 data channels.

For example, as shown in fig. 2 (b), for input data having a size of 5 x 3, starting from the first data at the upper left of channel C1, the step size is 1, and the step size is 1, a sliding window according to 3 x 3 slides over the input data. For example, first sliding from left to right, resulting in data blocks A1 to A3; then sliding one data downwards from the data block A1, and then continuing sliding from left to right to obtain data blocks A4 to A6; and sliding one data downwards from the data block A4, and then sliding the data block from left to right to obtain the data blocks A7 to A9.

The convolution kernel data of the PE10, PE20, PE30, and PE40 (the 4 PEs in the first column) in the PE array 205 shown in fig. 2 (a) participating in the convolution operation is the convolution kernel K1 described above; the convolution kernel data participating in the convolution operation in PE11, PE21, PE31, and PE41 (the 4 PEs in the second row) are all the convolution kernels K2 described above; the convolution kernel data participating in the convolution operation in the PE12, the PE22, the PE32 and the PE42 (4 PEs in the third row) are all the convolution kernel K3; the convolution kernel data participating in the convolution operation among PE13, PE23, PE33, and PE43 (4 PEs in the fourth column) are all the convolution kernels K4 described above.

The data blocks of the input data participating in the convolution operation in the PE10, the PE20, the PE30, and the PE40 (the 4 PEs in the first column) in the PE array 205 shown in fig. 2 (a) are the data blocks A1, A2, A3, and A4 shown in fig. 2 (b), respectively; the data blocks of the input data participating in the convolution operation in PE11, PE21, PE31, PE41 (4 PEs in the second row) are data blocks A1, A2, A3, A4 shown in fig. 2 (b), respectively; the data blocks of the input data participating in the convolution operation in the PE12, the PE22, the PE32, and the PE42 (4 PEs in the third column) are the data blocks A1, A2, A3, and A4 shown in fig. 2 (b), respectively; the data blocks of the input data participating in the convolution operation among the

PEs

13, 23, 33, and 43 (4 PEs in the fourth row) are the data blocks A1, A2, A3, and A4 shown in fig. 2 (b), respectively.

It will be appreciated that, in the PE array 205 shown in fig. 2 (a), the convolution kernels that each PE in the same column of PEs participates in the convolution operation when performing the convolution operation are the same, and the data blocks of the input data that each PE in the same column of PEs participates in the convolution operation when performing the convolution operation are different. The convolution kernels of the PEs in the same row of PEs in the PE array 205 shown in fig. 2 (a) that participate in the convolution operation when performing the convolution operation are different, and the data blocks of the input data of the PEs in the same row that participate in the convolution operation when performing the convolution operation are the same.

Next, in conjunction with fig. 2 (c) to fig. 2 (f), the process of performing the standard convolution operation by the multiplication circuit 200 will be described in detail, taking the convolution operation process of the

PEs

10, 11 and the

PEs

20, 21 in the PE array 205 shown in fig. 2 (a) as an example.

Operation of PE10, PE11, PE12, PE13 (4 PEs in the first row) in PE array 205

The PE10 in the PE array 205 shown in fig. 2 (a) performs a convolution operation (multiply-add operation) of the data block A1 and the convolution kernel K1 shown in fig. 2 (c), that is, the data of each channel in the data block A1 corresponds to the data of each channel in the convolution kernel K1, and then sums the products of the multiply-add operation, so as to obtain the first data 35 in the feature map P1. Specifically, as shown in fig. 2 (C), the calculation process of the corresponding multiply-add of the data (denoted as a 11) of the channel C1 in the data block A1 and the data (denoted as K11) of the channel C1 in the convolution kernel K1 is: [1 (-1) +0×1+1×0+2×1+1 (-1) +1×2+3×1+2+1×3) =12; the corresponding multiply-add calculation process of the data of the channel C2 in the data block A1 (denoted as A12) and the data of the channel C1 in the convolution kernel K1 (denoted as K12) is as follows: [ (-1) 2+1+1+1+0+0+5+1+1+2 (-1) +1+4+2+1+2) =6; the corresponding multiply-add calculation process of the data of the channel C3 in the data block A1 (denoted as A13) and the data of the channel C3 in the convolution kernel K1 (denoted as K13) is as follows: [ (-1) 0+1 (-1) +1×1+0×2+1×3+2×4+1×4+2×1+1×0) =17. And then adding the result of the multiplication and addition calculation corresponding to the A11 and the K11, the result of the multiplication and addition calculation corresponding to the A12 and the K12 and the result of the multiplication and addition calculation corresponding to the A13 and the K13, namely: 12+6+17=35, resulting in the first data 35 in the profile P1.

The convolution operation of the data block A1 and the convolution kernel K2 shown in fig. 2 (d) is performed by the PE11 in the PE array 205 shown in fig. 2 (a), that is, the data of each channel in the data block A1 corresponds to the data of each channel in the convolution kernel K2, and then the multiplication and addition results are summed to obtain the first data 51 in the feature map P2. Specifically, as shown in fig. 2 (d), the calculation process of the corresponding multiply-add of the data (denoted as a 11) of the channel C1 in the data block A1 and the data (denoted as K21) of the channel C1 in the convolution kernel K2 is: [1 (-1) +0×2+1×3+2×3+1×1+1×2+3×4+2×1+1×0) =25; the corresponding multiply-add calculation process of the data of the channel C2 in the data block A1 (denoted as A12) and the data of the channel C1 in the convolution kernel K2 (denoted as K22) is as follows: [ (-1) 1+1+2+1+3+0+3+1+2+2 (-1) +1+1+2+0+1+4) =9; the corresponding multiply-add calculation process of the data of the channel C3 in the data block A1 (denoted as a 13) and the data of the channel C3 in the convolution kernel K2 (denoted as K23) is: [ (-1) 2+1+1+1+0+0+2+1+5+2+4+1+2 (-1) + 1*6) =17. And then adding the result of the multiplication and addition calculation corresponding to the A11 and the K21, the result of the multiplication and addition calculation corresponding to the A12 and the K22 and the result of the multiplication and addition calculation corresponding to the A13 and the K23, namely: 25+9+17=51, resulting in the first data 51 in the profile P2.

Similarly, the PE12 performs a multiply-add operation on the data block A1 and the convolution kernel K3 to obtain the first data (not shown) of the feature map P3; PE13 performs a multiply-add operation on block A1 and a convolution kernel K4 to obtain first data (not shown) of a profile P4.

PE20, PE21, PE22, PE23 (4 PEs in the second row) operation in PE array 205

The PE20 in the PE array 205 shown in fig. 2 (a) performs a convolution operation (multiply-add operation) of the data block A2 and the convolution kernel K1 shown in fig. 2 (e), that is, the data of each channel in the data block A2 corresponds to the data of each channel in the convolution kernel K1, and then sums the products of the multiply-add operation, so as to obtain the second data 47 in the feature map P1. Specifically, as shown in fig. 2 (e), the calculation process of the corresponding multiply-add of the data (denoted as a 21) of the channel C1 in the data block A2 and the data (denoted as K11) of the channel C1 in the convolution kernel K1 is: [0 (-1) +1+1+2+0+1+1 (-1) +1+2+2+1+2+ (-1) 3) =4; the corresponding multiply-add calculation of the data of channel C2 in data block A2 (denoted as A22) and the data of channel C1 in convolution kernel K1 (denoted as K12) is: [1×2+1×1+3×0+1×5+2×1+2 (-1) +2×4+1×1+1×2) =19; the corresponding multiply-add calculation process of the data of the channel C3 in the data block A2 (denoted as a 23) and the data of the channel C3 in the convolution kernel K1 (denoted as K13) is: [1×0+1 (-1) +411+ 1+1×2+2×3+1×4+2×4+1×1+ (-1) ×0) =24. And then adding the result of the multiplication and addition calculation corresponding to the A21 and the K11, the result of the multiplication and addition calculation corresponding to the A22 and the K12 and the result of the multiplication and addition calculation corresponding to the A23 and the K13, namely: 4+19+24=47, resulting in the second data 47 in the profile P1.

The convolution operation of the data block A2 and the convolution kernel K2 shown in fig. 2 (f) is performed by the PE21 in the PE array 205 shown in fig. 2 (a), that is, the data of each channel in the data block A2 corresponds to the data of each channel in the convolution kernel K2, and then the multiplication and addition results are summed to obtain the second data 60 in the feature map P2. Specifically, as shown in fig. 2 (f), the calculation process of the corresponding multiply-add of the data (denoted as a 21) of the channel C1 in the data block A2 and the data (denoted as K21) of the channel C1 in the convolution kernel K2 is: [0 (-1) +1×2+2×3+1×3+1×1+1×2+2×4+1×1+ (-1) ×0) =23; the corresponding multiply-add calculation of the data of channel C2 in data block A2 (denoted as A22) and the data of channel C1 in convolution kernel K2 (denoted as K22) is: [ 1+1+2+3+1+3+2+2+2 (-1) +2×1+1×0+1×4) =23; the corresponding multiply-add calculation process of the data of the channel C3 in the data block A2 (denoted as A23) and the data of the channel C3 in the convolution kernel K2 (denoted as K23) is as follows: [1×2+1×1+4×0+1×2+2×5+1×4+2×1+1 (-1) +(-1) ×6) =14. And then adding the result of the multiplication and addition calculation corresponding to the A21 and the K21, the result of the multiplication and addition calculation corresponding to the A22 and the K22 and the result of the multiplication and addition calculation corresponding to the A23 and the K23, namely: 23+23+14=60, resulting in second data 60 in the feature map P2.

Similarly, the PE22 performs a multiply-add operation on the data block A2 and the convolution kernel K3 to obtain second data (not shown) of the feature map P3; PE23 performs a multiply-add operation on block A2 and a convolution kernel K4 to obtain second data (not shown) of profile P4.

The operation of PE30, PE31, PE32, PE33 (the third row of 4 PEs) in PE array 205

The PE30 in the PE array 205 shown in fig. 2 (a) performs a convolution operation (multiply-add operation) of the data block A3 and the convolution kernel K1, that is, the data of each channel in the data block A3 corresponds to the data of each channel in the convolution kernel K1, and then sums the multiply-add results to obtain the third data in the feature map P1. The specific calculation process is similar to the calculation process of the PE20 described above, except that the PE30 performs the convolution operation (multiply-add operation) of the data block A3 and the convolution kernel K1, and the PE20 performs the convolution operation (multiply-add operation) of the data block A2 and the convolution kernel K1, which will not be described herein.

Similarly, the PE31 performs a convolution operation (multiply-add operation) between the data block A3 and the convolution kernel K2, that is, the data of each channel in the data block A3 corresponds to the data of each channel in the convolution kernel K2, and sums the products of the multiply-add operations to obtain the third data in the feature map P2. The specific calculation process is similar to the calculation process of the PE21 described above, except that the PE31 performs the convolution operation (multiply-add operation) of the data block A3 and the convolution kernel K2, and the PE21 performs the convolution operation (multiply-add operation) of the data block A2 and the convolution kernel K2, which will not be described herein.

The PE32 performs a convolution operation (multiply-add operation) between the data block A3 and the convolution kernel K3, that is, the data of each channel in the data block A3 corresponds to the data of each channel in the convolution kernel K3, and sums the products of the multiply-add operation, so as to obtain third data in the feature map P3. The specific calculation process is similar to the calculation process of the PE22 described above, except that the PE32 performs the convolution operation (multiply-add operation) of the data block A3 and the convolution kernel K3, and the PE22 performs the convolution operation (multiply-add operation) of the data block A2 and the convolution kernel K3, which will not be described herein.

The PE33 performs a convolution operation (multiply-add operation) between the data block A3 and the convolution kernel K4, that is, the data of each channel in the data block A3 corresponds to the data of each channel in the convolution kernel K4, and sums the products of the multiply-add operations, so as to obtain third data in the feature map P4. The specific calculation process is similar to the calculation process of the PE23 described above, except that the PE33 performs the convolution operation (multiply-add operation) of the data block A3 and the convolution kernel K4, and the PE23 performs the convolution operation (multiply-add operation) of the data block A2 and the convolution kernel K4, which will not be described herein.

The operations of PE40, PE41, PE42, PE43 (4 PEs in the fourth row) in PE array 205

The PE40 in the PE array 205 shown in fig. 2 (a) performs a convolution operation (multiply-add operation) of the data block A4 and the convolution kernel K1, that is, the data of each channel in the data block A4 corresponds to the data of each channel in the convolution kernel K1, and then sums the multiply-add results to obtain fourth data in the feature map P1. The specific calculation process is similar to the calculation process of the PE20 described above, except that the PE40 performs the convolution operation (multiply-add operation) of the data block A4 and the convolution kernel K1, and the PE20 performs the convolution operation (multiply-add operation) of the data block A2 and the convolution kernel K1, which will not be described herein.

Similarly, the PE41 performs a convolution operation (multiply-add operation) between the data block A4 and the convolution kernel K2, that is, the data of each channel in the data block A4 corresponds to the data of each channel in the convolution kernel K2, and sums the products of the multiply-add operations to obtain the fourth data in the feature map P2. The specific calculation process is similar to the calculation process of the PE21 described above, except that the PE41 performs the convolution operation (multiply-add operation) of the data block A4 and the convolution kernel K2, and the PE21 performs the convolution operation (multiply-add operation) of the data block A2 and the convolution kernel K2, which will not be described herein.

The PE42 performs a convolution operation (multiply-add operation) between the data block A4 and the convolution kernel K3, that is, the data of each channel in the data block A4 corresponds to the data of each channel in the convolution kernel K3, and sums the products of the multiply-add operation to obtain fourth data in the feature map P3. The specific calculation process is similar to the calculation process of the PE22 described above, except that the PE42 performs the convolution operation (multiply-add operation) of the data block A4 and the convolution kernel K3, and the PE22 performs the convolution operation (multiply-add operation) of the data block A2 and the convolution kernel K3, which will not be described herein.

The PE43 performs a convolution operation (multiply-add operation) between the data block A4 and the convolution kernel K4, that is, the data of each channel in the data block A4 corresponds to the data of each channel in the convolution kernel K4, and sums the products of the multiply-add operations to obtain fourth data in the feature map P4. The specific calculation process is similar to the calculation process of the PE23 described above, except that the PE43 performs the convolution operation (multiply-add operation) of the data block A4 and the convolution kernel K4, and the PE23 performs the convolution operation (multiply-add operation) of the data block A2 and the convolution kernel K4, which will not be described herein.

It should be understood that the PE array 205 having 4 rows and 4 columns of 16 PEs shown in fig. 2 (a) is merely one exemplary configuration of the PE array in the multiplication circuit 200, and the above-described calculation process of each PE in the PE array 205 is merely an exemplary calculation process for illustrating the multiplication circuit 200 when performing the standard convolution operation, and does not limit the specific configuration and calculation process of the multiplication circuit 200.

It will be appreciated, moreover, that in the above exemplary description of the multiplication circuit 200 in performing a standard convolution operation, for input data of size 5 x 3, as shown in fig. 2 (b), starting from the first data at the upper left of channel C1, the step size is 1, and the step size is 1, a sliding window according to 3 x 3 slides over the input data. Nine data blocks, namely the above-mentioned data blocks A1 to A9, are obtained. While only the convolution operation of the multiplication circuit 200 with the convolution kernels K1 to K4 on the data blocks A1 to A4 has been described above, it will be appreciated that, in order for the multiplication circuit 200 to complete the convolution operation on all the data blocks in the input data, after the multiplication circuit 200 completes the convolution operation on the data blocks A1 to A4, the input data buffer 203 may be refreshed to output the data blocks to the PE array 205, that is, the original data blocks A1 to A4 are replaced with the data blocks A5 to A9, and then the convolution operation on the data blocks A5 to A9 is continuously performed in a similar manner as the above-described convolution operation on the data blocks A1 to A4 until the convolution operation on all the data in the input data is completed, and the calculation result of each PE in each column of the PE array 205 is one data on the corresponding feature map. As can be seen from the above calculation, the number of columns of the PE array 205 corresponds to the dimension of the resulting feature map when the multiplication circuit 200 performs a standard convolution operation.

(2) Deep convolution operation

Assuming that the depth convolution operation shown in fig. 1 (b) is performed by the multiplication circuit 200 shown in fig. 2 (a), that is, the depth convolution operation is performed on 3 channels of input data (the size of the input data is 5×5×3) stored in the input data buffer 203 by using the multiplication circuit 200 shown in fig. 2 (a) through 3 convolution kernels of size 3*3, for example, the convolution kernel K1', the convolution kernel K2', and the convolution kernel K3' shown in fig. 1 (b), the convolution kernels of 3 3*3 only need to convolve the data of one of the channels C1 to C3, respectively, in the input data.

Since in the multiplication circuit 200, the data input to the data buffer 203 and output to each PE of the PE array 205 are all 3-channel data blocks (i.e., the above-mentioned data blocks A1 to A9), if the convolution kernels of the 3 3*3 buffered in the parameter buffer 202 are respectively output to the 3-column PEs in the PE array 205, each PE in the 3-column PEs performs a convolution operation, the data involved in the operation is all the 3-channel data block and the convolution kernel data of 3*3. This is in contrast to the fact that each convolution kernel in the deep convolution operation only needs to convolve the data of one channel in the input data, which can cause each PE in the 3-column PE to multiply the data of the other two channels by the unwanted multiplication, so that the calculation result is inconsistent with the result of the correct deep convolution operation.

Therefore, if the deep convolution operation is to be implemented in the multiplication circuit 200, in the case where the input data buffer 203 is not changed to the PE array 205 and is in the form of data (i.e., the above-mentioned data blocks A1 to A9 having 3 data channels), only the convolution kernels of the above-mentioned 3 pieces 3*3 can be output to one of the PE array 205, and each PE in the column PE, when performing the deep convolution operation once, convolves only the convolution kernel of one 3*3 with the data of one of the channels of one data block, for example, as shown in fig. 2 (g), the convolution kernel K1' convolves only the data (denoted as a 11) of the channel C1 of the data block A1, the convolution kernel K2' convolves only the data (denoted as a 12) of the channel C2 of the data block A1, and the convolution kernel K3' convolves only the data (denoted as a 13) of the channel C3 of the data block A1. When each PE in the column of PEs completes one deep convolution operation, the convolution kernel data in the PE needs to be quickly refreshed, for example, the data of another convolution kernel is quickly read from the parameter cache 202, so that when the next deep convolution operation is performed in the PE, the refreshed convolution kernel data is adopted to convolve the data of another channel of the corresponding data block. It will be appreciated that when performing a deep convolution operation of the input data with only one column PE in the PE array 205 using the convolution kernels K1', K2', K3', the data read speed is high, and the conventional multiplication circuit 200 cannot meet such a fast data read speed. In addition, when performing the deep convolution operation with only one row of PEs in the PE array 205, other PEs are idle, and other idle PEs cannot be effectively utilized, so that the calculation efficiency of the multiplication circuit 200 is low.

In the following, taking PE10, PE20, PE30, and PE40 (4 PEs in the first column) in the multiplication circuit 200 shown in fig. 2 (a) as an example, a procedure of performing a deep convolution operation on 3-channel input data (the size of the input data is 5×5×3) stored in the input data buffer 203 will be described in detail by using the convolution kernel K1', the convolution kernel K2', and the convolution kernel K3' read from the parameter buffer 202 for these 4 PEs in conjunction with fig. 2 (g) and 2 (h).

Operation of PE10, PE20, PE30, PE40 (4 PEs in the first column) in PE array 205

The PE10 in the PE array 205 shown in fig. 2 (a) first performs a convolution operation (multiply-add operation) of the data (denoted as a 11) of the channel C1 and the convolution kernel K1' in the data block A1 shown in fig. 2 (g), multiplies each data in a11 by each data of the convolution kernel K1', and then adds the multiplied results to obtain the first data 12 in the feature map P1 '. Specifically, as shown in fig. 2 (g), the calculation process of the corresponding multiply-add of the data (denoted as a 11) of the channel C1 and the data of the convolution kernel K1' in the data block A1 is: [1 (-1) +0×1+1×0+2×1+1 (-1) +1×2+3×1+2+1×3) =12.

After the PE10 completes the convolution operation on the data of the channel C1 in the data block A1 (denoted as a 11) and the convolution kernel K1', the convolution kernel K1' in the PE10 participating in the convolution operation is replaced with the convolution kernel K2', and the input data in the PE10 participating in the convolution operation is updated to the data of the channel C2 in the data block A1 (denoted as a 12). The PE10 performs a convolution operation (multiply-add operation) of the data (denoted as a 12) of the channel C2 and the convolution kernel K2' in the data block A1 shown in fig. 2 (g), multiplies each data in a12 by each data of the convolution kernel K2', and adds the multiplied results to obtain the first data 6 in the feature map P2 '. Specifically, as shown in fig. 2 (g), the calculation process of the corresponding multiply-add of the data of the channel C2 (denoted as a 12) and the data of the convolution kernel K2' in the data block A1 is: [ (-1) 2+1+1+1+0+0+5+1+1+2 (-1) +1+4+2+1+2) =6.

After the PE10 completes the convolution operation on the data of the channel C2 in the data block A1 (denoted as a 12) and the convolution kernel K2', the convolution kernel K2' in the PE10 participating in the convolution operation is replaced with the convolution kernel K3', and the input data in the PE10 participating in the convolution operation is updated to the data of the channel C3 in the data block A1 (denoted as a 13). The PE10 performs a convolution operation (multiply-add operation) of the data (denoted as a 13) of the channel C3 and the convolution kernel K3' in the data block A1 shown in fig. 2 (g), multiplies each data in a13 by each data of the convolution kernel K3', and adds the multiplied results to obtain the first data 17 in the feature map P3 '. Specifically, as shown in fig. 2 (g), the calculation process of the corresponding multiply-add of the data of the channel C3 (denoted as a 13) and the data of the convolution kernel K3' in the data block A1 is: [ (-1) 0+1 (-1) +1×1+0×2+1×3+2×4+1×4+2×1+1×0) =17.

The PE20 in the PE array 205 shown in fig. 2 (a) first performs a convolution operation (multiply-add operation) of the data (denoted as a 21) of the channel C1 and the convolution kernel K1' in the data block A2 shown in fig. 2 (h), multiplies each data in a21 by each data of the convolution kernel K1', and then adds the multiplied results to obtain the second data 13 in the feature map P1 '. Specifically, as shown in fig. 2 (h), the calculation process of the corresponding multiply-add of the data (denoted as a 21) of the channel C1 and the data of the convolution kernel K1' in the data block A2 is: [0 (-1) +1+1+2+0+1+1 (-1) +1+2+2+1+2+ (-1) 3) =13.

After the PE20 completes the convolution operation on the data of the channel C1 in the data block A2 (denoted as a 21) and the convolution kernel K1', the convolution kernel K1' in the PE20 participating in the convolution operation is replaced with the convolution kernel K2', and the input data in the PE20 participating in the convolution operation is updated to the data of the channel C2 in the data block A2 (denoted as a 22). The PE20 performs a convolution operation (multiply-add operation) of the data (denoted as a 22) of the channel C2 and the convolution kernel K2' in the data block A2 shown in fig. 2 (h), multiplies each data in a22 by each data of the convolution kernel K2', and adds the multiplied results to obtain the second data 4 in the feature map P2 '. Specifically, as shown in fig. 2 (h), the calculation process of the corresponding multiply-add of the data of the channel C2 (denoted as a 22) and the data of the convolution kernel K2' in the data block A2 is: [1×2+1×1+3×0+1×5+2×1+2 (-1) +2×4+1×1+1×2) =4.

After the PE20 completes the convolution operation on the data of the channel C2 in the data block A2 (denoted as a 22) and the convolution kernel K2', the convolution kernel K2' in the PE20 participating in the convolution operation is replaced with the convolution kernel K3', and the input data in the PE20 participating in the convolution operation is updated to the data of the channel C3 in the data block A2 (denoted as a 23). The PE10 performs a convolution operation (multiply-add operation) of the data (denoted as a 23) of the channel C3 and the convolution kernel K3' in the data block A2 shown in fig. 2 (h), multiplies each data in a23 by each data of the convolution kernel K3', and adds the multiplied results to obtain second data 24 in the feature map P3 '. Specifically, as shown in fig. 2 (h), the calculation process of the corresponding multiply-add of the data of the channel C3 (denoted as a 23) and the data of the convolution kernel K3' in the data block A2 is: [1×0+1 (-1) +411+ 1+1×2+2×3+1×4+2×4+1×1+ (-1) ×0) =24.

Similarly, in the PE array 205 shown in fig. 2 (a), the PE30 first performs a convolution operation (multiply-add operation) of the data (denoted as a 31) of the channel C1 and the convolution kernel K1' in the data block A3, multiplies each data in a31 by each data of the convolution kernel K1', and then adds the multiplied results to obtain third data (not shown) in the feature map P1 '.

After the PE30 completes the convolution operation on the data of the channel C1 in the data block A3 (denoted as a 31) and the convolution kernel K1', the convolution kernel K1' in the PE30 participating in the convolution operation is replaced with the convolution kernel K2', and the input data in the PE30 participating in the convolution operation is updated to the data of the channel C2 in the data block A3 (denoted as a 32). The PE30 performs a convolution operation (multiply-add operation) of the data (denoted as a 32) of the channel C2 and the convolution kernel K2' in the data block A3, multiplies each data in a32 by each data of the convolution kernel K2', and adds the multiplied results to obtain third data (not shown) in the feature map P2 '.

After the PE30 completes the convolution operation on the data of the channel C2 in the data block A3 (denoted as a 32) and the convolution kernel K2', the convolution kernel K2' in the PE30 participating in the convolution operation is replaced with the convolution kernel K3', and the input data in the PE30 participating in the convolution operation is updated to the data of the channel C3 in the data block A3 (denoted as a 33). The PE10 performs a convolution operation (multiply-add operation) of the data (denoted as a 33) of the channel C3 and the convolution kernel K3' in the data block A3, multiplies each data in a33 by each data of the convolution kernel K3', and adds the multiplied results to obtain third data (not shown) in the feature map P3 '.

Similarly, in the PE array 205 shown in fig. 2 (a), the PE40 first performs a convolution operation (multiply-add operation) of the data (denoted as a 41) of the channel C1 and the convolution kernel K1' in the data block A4, multiplies each data in a41 by each data of the convolution kernel K1', and then adds the multiplied results to obtain fourth data (not shown) in the feature map P1 '.

After the PE40 completes the convolution operation on the data of the channel C1 in the data block A4 (denoted as a 41) and the convolution kernel K1', the convolution kernel K1' in the PE40 participating in the convolution operation is replaced with the convolution kernel K2', and the input data in the PE40 participating in the convolution operation is updated to the data of the channel C2 in the data block A4 (denoted as a 42). The PE40 performs a convolution operation (multiply-add operation) of the data (denoted as a 42) of the channel C2 and the convolution kernel K2' in the data block A4, multiplies each data in a42 by each data of the convolution kernel K2', and adds the multiplied results to obtain fourth data (not shown) in the feature map P2 '.

After the PE40 completes the convolution operation on the data of the channel C2 in the data block A4 (denoted as a 42) and the convolution kernel K2', the convolution kernel K2' in the PE40 participating in the convolution operation is replaced with the convolution kernel K3', and the input data in the PE40 participating in the convolution operation is updated to the data of the channel C3 in the data block A4 (denoted as a 43). The PE10 performs a convolution operation (multiply-add operation) of the data (denoted as a 43) of the channel C3 and the convolution kernel K3' in the data block A4, multiplies each data in a43 by each data of the convolution kernel K3', and adds the multiplied results to obtain fourth data (not shown) in the feature map P3 '.

As can be seen from the description above of the calculation process of the multiplication circuit 200 performing the deep convolution operation, the multiplication circuit 200 only needs one of the PEs of the PE array 205 to participate in the operation when performing the deep convolution operation. And each PE in the row of PEs participating in the deep convolution operation first calculates the first data in the feature maps P1 'to P3', then needs to quickly update the input data and the convolution kernel data participating in the operation in each PE in the row of PEs, calculates the second data in the feature maps P1 'to P3', and so on, quickly updates the input data and the convolution kernel data participating in the operation in each PE in the row of PEs, and calculates the third data in the feature maps P1 'to P3', until the deep convolution operation on all the data in the 3 channels of input data (the size is 5×5×3) stored in the input data buffer 203 is completed.

Further, it is apparent that, when the multiplication circuit 200 performs the deep convolution operation, each data in the feature maps P1 'to P3' is alternately calculated by the one row PE involved in the operation, that is, when the multiplication circuit 200 performs the deep convolution operation, all the feature maps obtained by the operation are outputted by the one row PE involved in the operation. When the multiplication circuit 200 performs the standard convolution operation, the calculation result of each PE in each column is one data on the corresponding feature map, the final output of each PE in each column corresponds to only one feature map, and the number of columns of the PE array 205 corresponds to the dimension of the finally obtained feature map. That is, the data formats outputted after the multiplication circuit 200 performs the standard convolution operation and the deep convolution operation are different. This is in contrast to the need for product developers/designers to be able to adapt to different application scenarios without changing the input/output data format for the same multiplication circuit.

In addition, since the multiplication circuit 200 shown in fig. 2 needs to update the convolution kernel data participating in the operation in the PE frequently and rapidly when performing the deep convolution operation, the speed requirement for data reading is high, and frequent data reading may result in high delay of the multiplication circuit 200. In addition, when the multiplication circuit 200 performs the deep convolution operation, only one row of PEs in the entire PE array 205 participates in the deep convolution operation, and the utilization rate of the PE array is low.

In order to solve the technical problem of the multiplication circuit 200 in the related art shown in fig. 2, the embodiment of the present application provides a multiplication circuit 300 shown in fig. 3, and compared to the multiplication circuit 200 shown in fig. 2, a switch circuit 240 is disposed between the parameter buffer 202 and the PE array 250 in the multiplication circuit 300 shown in fig. 3. The switching circuit 240 selects at least part of the data from the convolution kernels stored in the parameter cache 202 as valid data to be output to the PEs of each column in the PE array 250 for convolution operation. For example, when the multiplication circuit 300 performs a standard convolution operation, the switching circuit 240 outputs all data of the respective convolution kernels read from the parameter buffer 202 as valid data to the respective columns of PEs in the PE array 250 to perform the convolution operation. When the multiplication circuit 300 performs the deep convolution operation, the switching circuit 240 outputs the data of one channel selected from the convolution kernels read from the parameter buffer 202 as effective data to the PEs of each column in the PE array 250 to perform the convolution operation.

The multiplication circuit 300 provided by the application enables each PE in the PE array 250 to be utilized with high calculation efficiency by disposing the switch circuit 240 between the parameter buffer 202 and the PE array 250, whether the multiplication circuit 300 performs a deep convolution operation or a standard convolution operation. And the multiplication circuit 300 performs the deep convolution operation and the standard convolution operation on the same input data stored in the input data buffer 203, the final output of each column PE of the PE array 250 in the multiplication circuit 300 is a feature map instead of a plurality of feature maps. So that the multiplication circuit 300 can output the same data format when executing different convolution operations, each row of PEs corresponds to a feature map. Therefore, the multiplication circuit 300 provided by the application does not need to adjust the format of input/output data when different convolution operations are realized in different application scenes. The requirement that the product developer/designer can adapt to different application scenarios without changing the input/output data format for the same multiplication circuit 300 can be met.

The hardware configuration of the multiplication circuit 300 shown in fig. 3 provided in the present application will be described first in detail.

Fig. 3 schematically shows a block diagram of the hardware architecture of a multiplication circuit 300 provided in the present application. As shown in fig. 3, the multiplication circuit 300 includes a DMA control unit 201, an input data buffer 203, a parameter buffer 202, a switching circuit 240, a PE array 250, and an output buffer 204.

Wherein the DMA control unit 201 is configured to read input data to be convolved from an external memory space into the input data buffer 203. For example, when the multiplication circuit 300 provided in the present application is deployed on an autopilot, the DMA control unit 201 is configured to read relevant data of an image related to information on the surroundings of the autopilot stored in a memory of the autopilot into the input data buffer 203 for the PE array 250 to perform convolution operation.

The input data buffer 203 is used to store input data read from the external storage space by the DMA control unit 201. For example, in some embodiments, when the multiplication circuit 300 provided herein is deployed on an autonomous car, the input data buffer 203 is used to store relevant data of an image relating to the information of the surroundings of the autonomous car read by the DMA control unit 201 from the memory of the autonomous car.

The parameter buffer 202 is used to store convolution kernel data that participates in a convolution operation. For example, in some embodiments, the parameter cache 202 is used to store data for the same convolution kernel as the number of channels of input data that needs to be convolved. As another example, in some embodiments, parameter cache 202 is configured to store data having a number of channels of 1 and a number of convolution kernels that is the same as the number of channels of input data that needs to be convolved.

The switch circuit 240 is used for selecting valid data from the convolution kernels read from the parameter cache 202 and sending the valid data to the PE array 250 for convolution operation under different application scenarios. For example, when the multiplication circuit 300 performs a standard convolution operation, the switching circuit 240 outputs all data of the respective convolution kernels read from the parameter buffer 202 as valid data to the respective columns of PEs in the PE array 250 to perform the convolution operation. When the multiplication circuit 300 performs the deep convolution operation, the switching circuit 240 outputs the data of one channel selected from the convolution kernels read from the parameter buffer 202 as effective data to the PEs of each column in the PE array 250 to perform the convolution operation.

The PE array 250 is an array formed of a plurality of PEs each for multiply-add operation on convolution kernel data and input data.

The output buffer 204 is used for storing the result of the convolution operation output by the PE array 250. For example, in some embodiments, output buffer 204 is used to store one feature map for each column output of PE array 250. For example, in some embodiments, where the multiplication circuit 300 is applied in an application scenario where real-time requirements are high, the multiplication circuit 300 may be configured to perform a deep convolution operation, such that the output buffer 204 is configured to store the result of the deep convolution operation performed by the PE array 250 on the input data. For example, when the multiplication circuit 300 provided in the present application is deployed on an autopilot, since the autopilot has a high real-time requirement, the multiplication circuit 300 is required to quickly perform a convolution operation on image data including surrounding environment information acquired by the autopilot, and therefore, in an application scenario of the autopilot, the multiplication circuit 300 may be configured to perform a deep convolution operation, so that the output buffer 204 is used to store a result of the deep convolution operation performed on the image data by the PE array 250.

As another example, in other embodiments, where the multiplication circuit 300 is used in an application scenario where real-time requirements are not high, the multiplication circuit 300 may be configured to perform a standard convolution operation, such that the output buffer 204 is used to store the results of the standard convolution operation performed by the PE array 250 on the input data. For example, the multiplication circuit 300 is deployed in a face recognition access control system that requires less real-time than an autonomous car. Therefore, in the application scenario of face recognition gate inhibition, the multiplication circuit 300 may be configured to perform a standard convolution operation, so that the output buffer 204 is used to store the result of the PE array 250 performing the standard convolution operation on the face image.

It can be understood that the above automatic driving automobile with high real-time requirements and the face recognition access control scenario with low real-time requirements are merely illustrative of two exemplary application scenarios of the technical solution of the present application. The applicable scenarios of the multiplication circuit 300 provided in the embodiments of the present application include, but are not limited to, various application scenarios involving image recognition, speech recognition, natural language processing, reinforcement learning, and the like.

Further, it is to be understood that the exemplary structure of the multiplication circuit 300 provided by the present application as shown in fig. 3 does not constitute a specific limitation of the multiplication circuit 300. In other embodiments of the present application, multiplication circuit 300 may include more or fewer components than shown, or may combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Fig. 4 schematically illustrates a hardware configuration of a multiplication circuit 300 having a 4 x 8PE array (i.e., an array having 4 rows and 8 columns of 32 PEs) structure provided in the present application. As shown in fig. 4, the multiplication circuit 300 includes a DMA control unit 201, an input data buffer 203, a parameter buffer 202, a switching circuit 240, a PE array 250, and an output buffer 204.PE array 250 comprises an array of 4 rows and 8 columns of 32 PEs, with the first 4 columns and the last 4 columns being designated as a first set of PEs 251 and a second set of PEs 252, respectively. Since the DMA control unit 201, the input data buffer 203, the parameter buffer 202, the PE array 250, and the output buffer 204 have been described in the above description about fig. 3, they are not described here again. Only the switching circuit 240 shown in fig. 4 will be described in detail.

In the embodiment shown in fig. 4, the switch circuit 240 is connected between the parameter buffer 202 and the PE array 250, and is configured to send valid data selected from the convolution kernels read from the parameter buffer 202 to 8 rows of PEs in the PE array 250 for convolution operation under different application scenarios. As shown in fig. 4, the switching circuit 240 includes a sub-switch 241, a sub-switch 242, a sub-switch 243, a sub-switch 244, a sub-switch 245, a sub-switch 246, a sub-switch 247, and a sub-switch 248. The number of sub-switches is the same as the number of columns of PEs in the PE array 250, i.e., one sub-switch for each column of PEs.

When the multiplication circuit 300 performs the standard convolution operation, each sub-switch directly outputs the acquired multi-channel convolution kernel data to the corresponding PE. When the multiplication circuit 300 performs the deep convolution operation, each sub-switch outputs the data of one of the acquired multi-channel convolution kernel data to the corresponding PE as valid data.

For example, when the multiplication circuit 300 is performing the standard convolution operation, assuming that the convolution kernel data obtained from the parameter cache 202 by the sub-switch 241 is the data of the 3-channel convolution kernel K1 as shown in fig. 2 (c), the sub-switch 241 directly outputs all the data of the 3-channel convolution kernel K1 as valid data to the

PEs

10, 20, 30, 40. Taking PE10 as an example, the process of the PE10 performing convolution operation on the data block A1 by the convolution kernel data received from the sub-switch 241 is: a11×k11+a12×k12+a13×k13.

When the multiplication circuit 300 performs the deep convolution operation, assuming still that the convolution kernel data obtained from the parameter buffer 202 by the sub-switch 241 is the data of the convolution kernel K1 of 3 channels as shown in fig. 2 (C), the sub-switch 241 replaces each value of the data K11 of the convolution kernel K1 channel C1, the data K12 of the convolution kernel K1 channel C2, and the data K13 of the channel C3 with 0, and then outputs the values to the

PEs

10, 20, 30, and 40. Taking PE10 as an example, the process of the PE10 performing convolution operation on the data block A1 by the convolution kernel data received from the sub-switch 241 is: a11×k11+a12×0+a13×0=a11×k11.

Since the structure and principle of the sub-switches corresponding to the first group of PEs 251 and the second group of PEs 252 in the PE array 250 are the same, the structure and principle of the switching circuit 240 will be described in detail by taking only the sub-switches 241 to 244 corresponding to the 4-column PEs in the first group of PEs 251 as an example.

Fig. 5 (a) shows a schematic structural diagram of

sub-switches

241, 242, 243, 244 of the switching circuit 240, according to some embodiments of the present application. The sub-switches 241 to 244 each include a register and a MUX (multiplexing) circuit. Since the hardware structure of each sub-switch is the same, the principle of each switch will be described below with only the sub-switch 241.

Specifically, for the sub-switch 241, it is assumed that the data port of the parameter buffer 202 is 4B (the convolution kernel data stored in the parameter buffer 202 and the input data stored in the input data buffer 203 are integer numbers of 8 bits quantized), that is, the register 2410 of the sub-switch 241 can read 4B data from the parameter buffer 202 within one clock cycle, denoted as R0, R1, R2, and R3, respectively. MUX circuit 2411 selects at least a portion of the valid data from 0 and R0, R1, R2 to output to the PE. For both the standard convolution and the deep convolution modes of operation, the inputs and outputs of the MUX circuit 2411 of the sub-switch 241 are shown in table 1 below, respectively, during one clock cycle:

	MUX circuit 2411 input	MUX circuit	2411 outputs
				Standard convolution	R1, R2, R3 and 0	R1、R2、R3
Depth convolution	R1, R2, R3 and 0	0、0、0

TABLE 1

As can be seen from table 1, for the sub-switch 241, in the standard convolution operation mode, the MUX circuit 2411 directly outputs the data read from the register; in the deep convolution operation mode, the MUX circuit 2411 replaces the data read from the register with a 0 output.

It will be appreciated that since R0 is not passed to MUX circuit 2411, R0 is PE that is directly output to the connection of sub-switch 241. In the standard convolution operation mode, the data finally sent to the PE by the sub-switch 241 is R0, R1, R2, R3; in the deep convolution operation mode, the data finally sent to the PE by the sub-switch 241 is R0 and 0, 0.

Similarly, in the standard convolution operation mode, the data finally sent to the PE by the sub-switch 242 is R1, R0, R2, and R3; the data finally sent to the PE by the sub-switch 243 is R2, R0, R1, R3; the data ultimately sent to the PE by the sub-switch 244 is R3 and R0, R1, R2.

In the deep convolution operation mode, the data finally sent to the PE by the sub-switch 242 is R1 and 0, 0; the data finally sent to the PE by the sub-switch 243 is R2, 0; the data ultimately sent to the PE by the sub-switch 244 is R3 and 0, 0.

It will be appreciated that in the same clock cycle, under the standard convolution operation mode, the sub-switches 241 to 244 directly send the data read from the parameter cache 202 to the corresponding PE; in the deep convolution operation mode, the sub-switches 241 to 244 respectively take 1B among the 4B data read from the parameter buffer 202 as a significant number, and replace the other 3B data with 0, and then transmit to the corresponding PE.

It should be noted that, R0, R1, R2, R3 refer to 4B data read from the parameter cache 202 in one clock cycle by each register of the sub-switches 241 to 244. In some embodiments, when the multiplication circuit 300 is in the standard convolution operation mode, in one clock cycle, the sub-switches 241 to 244 obtain 4B data of each of the 4 mutually independent convolution kernels from the parameter cache 202, for example, R0, R1, R2, R3 in each mutually independent convolution kernel are 1B data of the same position on 4 channels in each convolution kernel, and in the standard convolution operation mode, in one clock cycle, the sub-switch 241 outputs 1B data of the same position on 4 channels (i.e. multiple channels) in the first convolution kernel to the first column PE; the sub-switch 242 outputs 1B data to the second column PE at the same location on the 4 lanes (i.e., the plurality of lanes) in the second convolution kernel; the subswitch 243 outputs 1B data to the third column PE at the same location on 4 channels (i.e., multiple channels) in the third convolution kernel; output to the fourth column PE is 1B data in the same position on 4 lanes (i.e., multiple lanes) in the fourth convolution kernel by sub-switch 244.

Specifically, for example, in the standard convolution operation mode, the convolution kernel data output from the sub-switch 241 to the

PEs

10, 20, 30, and 40 (first-column PEs) in fig. 4 in one clock cycle is: as shown in fig. 5 (B), 1B data corresponding to the channels C1 to C4 (i.e., a plurality of channels) in the position B1 in the convolution kernel K1 "is R0 corresponding to the channel C1 in the position B1 in the convolution kernel K1", R1 corresponding to the channel C2 in the position B1 in the convolution kernel K1", R2 corresponding to the channel C3 in the position B1 in the convolution kernel K1", and R3 corresponding to the channel C4 in the position B1 in the convolution kernel K1 ".

The convolution kernel data output from the sub-switch 242 to the

PEs

11, 21, 31, 41 (second column PE) in fig. 4 is: as shown in fig. 5 (B), 1B data corresponding to the channels C1 to C4 (i.e., a plurality of channels) in the convolution kernel K2 "at the B2 position is R0 corresponding to the channel C1 in the convolution kernel K2" at the B2 position, R1 corresponding to the channel C2 in the convolution kernel K2 "at the B2 position, R2 corresponding to the channel C3 in the convolution kernel K2" at the B2 position, and R3 corresponding to the channel C4 in the convolution kernel K2 "at the B2 position.

The convolution kernel data output from the subswitch 243 to the

PEs

12, 22, 32, and 42 (third column PE) in fig. 4 is: as shown in fig. 5 (B), 1B data corresponding to the channels C1 to C4 (i.e., a plurality of channels) in the convolution kernel K3 "at the B3 position is R0 corresponding to the channel C1 in the convolution kernel K3" at the B3 position, R1 corresponding to the channel C2 in the convolution kernel K3 "at the B3 position, R2 corresponding to the channel C3 in the convolution kernel K3" at the B3 position, and R3 corresponding to the channel C4 in the convolution kernel K3 "at the B3 position.

The convolution kernel data output from the sub-switch 244 to the

PEs

13, 23, 33, and 43 (fourth column PE) in fig. 4 is: as shown in fig. 5 (B), 1B data corresponding to the channels C1 to C4 (i.e., a plurality of channels) in the convolution kernel K4 "at the B4 position is R0 corresponding to the channel C1 in the convolution kernel K4" at the B4 position, R1 corresponding to the channel C2 in the convolution kernel K4 "at the B4 position, R2 corresponding to the channel C3 in the convolution kernel K4" at the B4 position, and R3 corresponding to the channel C4 in the convolution kernel K4 "at the B1 position.

In some embodiments, when the multiplication circuit 300 is in the deep convolution operation mode, the sub-switches 241 to 244 obtain the 4B data of the same convolution kernel from the parameter buffer 202, that is, the specific values of R0, R1, R2, R3 read from the parameter buffer 202 by the respective registers of the sub-switches 241 to 244 are the same in one clock cycle. For example, R0, R1, R2, R3 in the same convolution kernel are 1B data in the same position on 4 channels in the convolution kernel, and in the deep convolution operation mode, the valid data output by the sub-switch 241 to the first column PE in one clock cycle is 1B data on the first channel (i.e. a single channel) in the convolution kernel; the valid data output by the sub-switch 242 to the second column PE is 1B data in the same location on the second lane (i.e., single lane) in the convolution kernel; the valid data output by the subswitch 243 to the third column PE is 1B data at the same location on the third channel (i.e., single channel) in the convolution kernel; the valid data output by the sub-switch 244 to the fourth column PE is 1B data in the same position on the fourth channel (i.e., single channel) in the convolution kernel. The valid data refers to data actually participating in the deep convolution operation.

Specifically, for example, in the deep convolution operation mode, the convolution kernel data output from the sub-switch 241 to the

PEs

10, 20, 30, and 40 (first-column PEs) in fig. 4 in one clock cycle is: the 1B data R0 and 0, 0 corresponding to the channel C1 (i.e., single channel) at the B1 position in the convolution kernel K1 "shown in fig. 5 (B).

The convolution kernel data output from the sub-switch 242 to the

PEs

11, 21, 31, 41 (second column PE) in fig. 4 is: the channel C2 (i.e., single channel) in the convolution kernel K1 "shown in fig. 5 (B) corresponds to the 1B data R1 and 0, 0 at the B1 position.

The convolution kernel data output from the subswitch 243 to the

PEs

12, 22, 32, and 42 (third column PE) in fig. 4 is: the channel C3 (i.e., single channel) in the convolution kernel K1 "shown in fig. 5 (B) corresponds to the 1B data R2 and 0, 0 at the B1 position.

The convolution kernel data output from the sub-switch 244 to the

PEs

13, 23, 33, and 43 (fourth column PE) in fig. 4 is: the channel C4 (i.e., single channel) in the convolution kernel K1 "shown in fig. 5 (B) corresponds to the 1B data R3 and 0, 0 at the B1 position.

In addition, fig. 5 (a) only illustrates the principle of the switching circuit 240 by taking the data width of the data ports of the parameter buffer 202 and the input data buffer 203 as 4B as an example, and in other embodiments, the data widths of the data ports of the parameter buffer 202 and the input data buffer 203 may be 8B, 16B, etc., which is not limited in this application. Also, "B" in the data width 4B of the data ports of the parameter cache 202 and the input data cache 203 indicates: the convolution kernel data stored in the parameter buffer 202 and the input data stored in the input data buffer 203 are quantized integer numbers of 8 bits. In other embodiments, the weights stored in the parameter buffer 202 and the input data stored in the input data buffer 203 may be quantized 16-bit and 32-bit integer numbers, which is not limited in this application.

Based on the above general description of the structure of the multiplication circuit 300 and the principle of the switching circuit 240 provided in the present application, the process of performing the deep convolution operation and the standard convolution operation by the multiplication circuit 300 provided in the present embodiment will be described in detail with reference to fig. 6 to 8 by taking the multiplication circuit 300 having the PE array structure of 4*8 shown in fig. 4 as an example.

The process of performing a standard convolution operation for the multiplication circuit 300 shown in fig. 4 is described by way of example below, first with reference to fig. 6 and 7.

Standard convolution operation

When the multiplication circuit 300 performs the standard convolution operation, all PEs in the PE array 250 are turned on, and the convolution kernel data output from the parameter cache 202 to each column of PEs in the PE array 250 in the multiplication circuit 300 is derived from different convolution kernels, respectively, and the number of channels of each convolution kernel is the same as the number of channels of the input data cached in the input data cache 203. The parameter buffer 202 outputs the respective convolution kernel data to the sub-switches 241 to 248 corresponding to each column of PEs in the switch circuit 240, and the sub-switches 241 to 248 output the received convolution kernel data to the corresponding column of PEs as valid data, respectively, and each column of PEs performs convolution operation (multiply-add operation) on the received multi-channel convolution kernel data and the multi-channel input data received from the input data buffer 203.

(1) First clock cycle clk1

Both the first set of PEs 251 and the second set of PEs 252 shown in FIG. 4 are turned on. That is, 8 columns of PEs of PE array 250 can each perform a multiply-add operation during the first clock cycle clk 1.

Assuming that the data port of the parameter buffer 202 is 4B in data width, the data port of the input data buffer 203 is also 4B in data width. The input data buffer 203 outputs 4B of input data to each column PE of the first group of PEs 251 and the second group of PEs 252 during the first clock cycle clk 1. The data output from the parameter buffer 202 to the switches corresponding to the respective columns of PEs of the first group of PEs 251 and the second group of PEs 252 is convolution kernel data of 4B. The sub-switches 241 to 248, which are in one-to-one correspondence with the 8-column PEs shown in fig. 4, directly output all the 4B convolution kernel data read from the parameter buffer 202 to each column PE, and then each column PE performs a multiply-add operation on the obtained 4B convolution kernel data and the 4B input data, respectively.

For example, referring to fig. 7 (a), the multiplication circuit 300 performs a standard convolution operation on input data with a channel number M and a size Df (simply referred to as df_df_m input data) by using N convolution kernels with a channel number M and a size dk_dk (simply referred to as N convolution kernels with a channel number M and a size dp_dp) to obtain a feature map with a channel number N and a size dp_dp (simply referred to as dp_dp_n feature map). That is, the parameter buffer 202 stores N convolution kernel data of dk×dk×m, and the input data buffer 203 stores input data of df×df×m.

Specifically, assuming that the multiplication circuit 300 uses 8 convolution kernels (denoted as K1 "to K8") of 3×3×32 (the number of channels is 32 and the size is 3*3), when performing standard convolution operation on input data of 5×5×32 (the number of channels is 32 and the size is 5*5), each PE in the PE array 205 of the multiplication circuit 300 uses one of the convolution kernels K1 "to K8" to convolve one data block of the same size as the convolution kernel in the input data of 5×5×32. Wherein, a data block with the same size as the convolution kernel in the input data is: and sliding on the input data of 5 x 32 by adopting a set step length, and obtaining a corresponding data block in the sliding window. It will be appreciated that since the input data has 32 data channels, each sliding window (i.e., data block) on the input data also has 32 data channels.

For example, as shown in fig. 7 (b), for 5×5×32 input data, starting from the first data at the upper left of the channel C1, the step size is 1, and a sliding window of 3×3×32 is slid on the input data. For example, first sliding from left to right, resulting in data blocks A1 'through A9'; then sliding one data downwards from the data block A1' and then sliding the data block from left to right to obtain data blocks A4' to A6'; and then sliding one data downwards from the data block A4' and then sliding the data block from left to right to obtain the data blocks A7' to A9'.

Since the input data buffer 203 in fig. 4 outputs 4B data in the data blocks A1 'to A4' to each PE in the first clock cycle clk 1; the parameter cache 202 outputs 4B convolution kernel data to each of the 8-column PEs. For example, the data input to the data buffer 203 and output to each row PE of the multiplication circuit 300 in fig. 4 are respectively: the data blocks Ai' (I has any one of 1,2,3, and 4) shown in fig. 7 (C) are data I0, I1, I2, and I3 of a total of 4B of the channels C1 to C4. The data output from the parameter buffer 202 to the sub-switches 241 to 248 of the 8-column PE are respectively: the convolution kernel Kj "(j has any one of

values

1,2,3,4,5,6,7, 8) shown in fig. 7 (C) is the data R0, R1, R2, R3 of the total 4B of the channels C1 to C4.

The sub-switches 241 to 248 directly output the respective acquired 4B convolution kernel data to each corresponding column PE. For example, the 4B convolution kernel data acquired by the 4 PEs (PE 10 to PE 40) in the first column of the multiplication circuit 300 in fig. 4 is the convolution kernel K1", and the data of each 1B of the channels C1 to C4 is denoted as: r10, R11, R12, R13. The 4B convolution kernel data acquired by the 4 PEs (PE 11 to PE 41) in the second column is the convolution kernel K2", and the data of each 1B of the channels C1 to C4 is denoted as: r20, R21, R22, R23; the 4B convolution kernel data acquired by the 4 PEs (PE 12 to PE 42) of the third column is the convolution kernel K3", and the data of each 1B of the channels C1 to C4 is denoted as: r30, R31, R32, R33. Similarly, the 4B convolution kernel data acquired by the 4 PEs (PE 17 to PE 47) of the eighth column is the data of each 1B of the channels C1 to C4 of the convolution kernel K8", which is noted as: r80, R81, R82, R83.

Assume that the data of a total 4B of channels C1 through C4 of data block A1' are denoted as I10, I11, I12, I13; the data of a total 4B of channels C1 to C4 of data block A2' are denoted as I20, I21, I22, I23; the data of a total 4B of channels C1 to C4 of data block A3' are denoted as I30, I31, I32, I33; the data of a total 4B of channels C1 to C4 of data block A4' are denoted as I40, I41, I42, I43.

Then, for PEs 10 to 17 (8 PEs of the first row) of PE array 250 in multiplication circuit 300, final PE10 convolves I10, I11, I12, I13 of the first 4 channels of data block A1' with R10, R11, R12, R13 of the first 4 channels of convolution kernel K1", the result of the convolution operation is:

R10*I10+R11*I11+R12*I12+R13*I13。

PE11 convolves I10, I11, I12, I13 for the first 4 lanes of data block A1 'using convolution kernel K2' with R20, R21, R22, R23 for the first 4 lanes, the result of the convolution operation being:

R20*I10+R21*I11+R22*I12+R23*I13。

PE12 convolves I10, I11, I12, I13 for the first 4 lanes of data block A1 'using a convolution kernel K3' with R30, R31, R32, R33 for the first 4 lanes, the result of the convolution operation being:

R30*I10+R31*I11+R32*I12+R33*I13。

PE13 convolves I10, I11, I12, I13 for the first 4 lanes of data block A1 'using a convolution kernel K4' with R40, R41, R42, R43 for the first 4 lanes, the result of the convolution operation being: r40+r41+i11+r42+i12+r43+i13.

Similarly, the result of the PE14 convolution operation is: r50+r51+i11+r52+i12+r53+i13. The result of the PE15 convolution operation is: r60×i10+r61×i11+r62×i12+r63×i13. The result of the PE16 convolution operation is: r70×i10+r71×i11+r72×i12+r73×i13. The result of the PE17 convolution operation is: r80×i10+r81×i11+r82×i12+r83×i13.

The calculation process of the PEs of the second to fourth rows is similar to that of the first row. The difference is that only 8 PEs (PE 10 to PE 17) in the first row use the data of the total 4B of the channels C1 to C4 of the convolution kernels K1 "to K8", respectively, and perform corresponding multiply-add operations on the data I10, I11, I12, I13 of the channels C1 to C4 of the data block A1'. The 8 PEs (PE 20 to PE 27) in the second row perform corresponding multiply-add operations on the data I20, I21, I22, I23 of the channels C1 to C4 of the data block A2' by using the data of 4B, which is the same as the data of the channels C1 to C4, of the convolution kernels K1 ' to K8 ', respectively. The 8 PEs (PE 30 to PE 37) in the third row perform corresponding multiply-add operations on the data I30, I31, I32, I33 of the channels C1 to C4 of the data block A3' by using the data of 4B, which is the same as the data of the channels C1 to C4, of the convolution kernels K1 ' to K8 ', respectively. The 8 PEs (PE 40 to PE 47) in the fourth row perform corresponding multiply-add operations on the data I40, I41, I42, I43 of the channels C1 to C4 of the data block A4' by using the data of the 4B common to the channels C1 to C4 of the convolution kernels K1 ' to K8 ', respectively.

(2) The second clock cycle clk2

Both the first set of PEs 251 and the second set of PEs 252 shown in FIG. 4 are turned on. I.e., 8 columns of PEs of PE array 250 may also perform convolution operations during the second clock cycle clk 2.

Also assume that the data port of parameter cache 202 is 4B in data width, and the data port of input data cache 203 is also 4B in data width. The data output from the input data buffer 203 of fig. 4 to the rows PE of the multiplication circuit 300 of fig. 4 during the second clock cycle clk2 are respectively: the data blocks Ai' (I take any one of 1,2,3, and 4) shown in fig. 7 (d) are data I4, I5, I6, and I7 of a total 4B of the channels C5 to C8. The data output from the parameter buffer 202 to the sub-switches 241 to 248 of the 8-column PE are respectively: the convolution kernel Kj "(j has any one of

values

1,2,3,4,5,6,7, 8) shown in fig. 7 (d) is the data R4, R5, R6, R7 of the total 4B of the channels C5 to C8.

The sub-switches 241 to 248 directly output the respective acquired 4B convolution kernel data to each corresponding column PE. Similarly, the 4B convolution kernel data obtained for the 4 PEs (PE 10 through PE 40) in the first column of the multiplication circuit 300 in FIG. 4 is the convolution kernel K1' data for each 1B of channels C5 through C8, denoted as: r14, R15, R16, R17. The 4B convolution kernel data acquired by the 4 PEs (PE 11 to PE 41) in the second column is the data of each 1B of the channels C5 to C8 of the convolution kernel K2", and is denoted as: r24, R25, R26, R27; the 4B convolution kernel data acquired by the 4 PEs (PE 12 to PE 42) in the third column is the convolution kernel K3", and the data of each 1B of the channels C5 to C8 is recorded as: r34, R35, R36, R37. Similarly, the 4B convolution kernel data acquired by the 4 PEs (PE 17 to PE 47) of the eighth column is the data of each 1B of the convolution kernel K8 "channel C5 to channel C8, and is noted as: r84, R85, R86, R87.

Assume that the data of a total 4B of channels C5 through C8 of data block A1' are denoted as I14, I15, I16, I17; the data of a total 4B of channels C5 through C8 of data block A2' are denoted as I24, I25, I26, I27; the data of a total 4B of channels C5 through C8 of data block A3' are denoted as I34, I35, I36, I37; the data of a total 4B of channels C5 through C8 of data block A4' are denoted as I44, I45, I46, I47.

Then, for PEs 10 to 17 (8 PEs of the first row) of PE array 250 in multiplication circuit 300, final PE10 convolves data block A1' with I14, I15, I16, I17 of channels C5 to C8 using convolution kernel K1 "R14, R15, R16, R17 of channels C5 to C8, the result of the convolution operation being: r14×i14+r15×i15+r16×i16+r17×i17.

PE11 convolves I14, I15, I16, I17 of channels C5 through C8 of data block A1 'using convolution kernel K2' R24, R25, R26, R27 of channels C5 through C8, the result of the convolution operation being: r24×i14+r25×i15+r26×i16+r27×i17.

PE12 convolves I14, I15, I16, I17 of channels C5-C8 of data block A1 'using a convolution kernel K3' R34, R35, R36, R37 of channels C5-C8, the result of the convolution operation being: r34×i14+r35×i15+r36×i16+r37×i17.

PE13 convolves I14, I15, I16, I17 of channels C5 through C8 of data block A1 'using a convolution kernel K4' R44, R45, R46, R47 of channels C5 through C8, the result of the convolution operation being: r44×i14+r45×i15+r46×i16+r47×i17.

Similarly, the result of the PE14 convolution operation is: r54+r55×i15+r56×i16+r57×i17. The result of the PE15 convolution operation is: r64×i14+r65×i15+r66×i16+r67×i17. The result of the PE16 convolution operation is: r74+r75+i15+r76+i16+r77+i17. The result of the PE17 convolution operation is: r84×i14+r85×i15+r86×i16+r87×i17.

The calculation process of the PEs of the second to fourth rows is similar to that of the first row. The difference is that only 8 PEs (PE 10 to PE 17) in the first row use the convolution kernels K1 "to K8" to perform corresponding multiply-add operations on the data I14, I15, I16, I17 of the channels C5 to C8 of the data block A1' respectively for the data of the channels C5 to C8, respectively. The 8 PEs (PE 20 to PE 27) in the second row perform corresponding multiply-add operations on the data I24, I25, I26, I27 of the channels C5 to C8 of the data block A2' by using the 4B data in the channels C5 to C8 of the convolution kernels K1 ' to K8 ', respectively. The 8 PEs (PE 30 to PE 37) in the third row perform corresponding multiply-add operations on the data I34, I35, I36, I37 of the channels C5 to C8 of the data block A3' by using the 4B data of the channels C5 to C8 of the convolution kernels K1 ' to K8 ', respectively. The fourth row of 8 PEs (PE 40 to PE 47) uses the convolution kernels K1 "to K8" to perform corresponding multiply-add operations on the data I44, I45, I46, I47 of the channels C5 to C8 of the data block A4' using the data of the total 4B of the channels C5 to C8, respectively.

It will be appreciated that for an input data of channel number 32, since only 4B data (1B data in each of the 4 channels) can be read in one clock, 8 clock cycles are required to complete the convolution calculations for the data blocks of size 1*1 having 32 channels in the upper left corner of the data block A1 'to the data block A4'. For any one of the data blocks A1 'to A8' having a size of 3*3 with 32 channels, 8*9 =72 clock cycles are required to complete the convolution operation on all the data in the data block, and then the calculation results of each PE in the multiplication circuit 300 in the 72 clock cycles are correspondingly added to obtain the first four data of each of the feature maps P1 "to P8", respectively.

For example, the calculation results of PE10 in the multiplication circuit 300 in 72 clock cycles are added to obtain first data of the feature map P1", the calculation results of PE20 in 72 clock cycles are added to obtain second data of the feature map P1", the calculation results of PE30 in 72 clock cycles are added to obtain third data of the feature map P1", and the calculation results of PE40 in 72 clock cycles are added to obtain fourth data of the feature map P1". Correspondingly, the calculation results of PE11 in the multiplication circuit 300 in 72 clock cycles are added to obtain first data of the feature map P2", the calculation results of PE22 in 72 clock cycles are added to obtain second data of the feature map P2", the calculation results of PE32 in 72 clock cycles are added to obtain third data of the feature map P2", and the calculation results of PE42 in 72 clock cycles are added to obtain fourth data of the feature map P2". By the same token, the calculation results of PE17 in the multiplication circuit 300 in 72 clock cycles are added to obtain the first data of the feature map P8", the calculation results of PE27 in 72 clock cycles are added to obtain the second data of the feature map P8", the calculation results of PE37 in 72 clock cycles are added to obtain the third data of the feature map P8", and the calculation results of PE47 in 72 clock cycles are added to obtain the fourth data of the feature map P8". The computation in the other clock cycles is similar to the computation in the first clock cycle clk1 and the second clock cycle clk2, and will not be described here again.

As can be seen from the above description of the standard convolution operation performed on the multiplication circuit 300 provided in this application: in the multiplication circuit 300 provided in the present application, when standard convolution operation is performed, the parameter buffer 202 is output to the switch circuit 240, and the data output to each PE in the multiplication circuit 300 by the switch circuit 240 is convolution kernel data having a plurality of channels, and the data output to each PE in the multiplication circuit 300 by the input data buffer 203 is a data block (multi-channel) having the same size as the convolution kernel. When the multiplication circuit 300 completes the convolution operation on each data block in the input data, the calculation result of each PE in each column of PEs is each data in a feature map, that is, one column of PEs correspondingly obtains only one feature map.

The process of performing a deep convolution operation for a multiplication circuit 300 such as that shown in fig. 4 provided herein will be described in exemplary detail with reference to fig. 7 and 8.

Deep convolution operation

When the multiplication circuit 300 performs the deep convolution operation, each PE in the PE array 250 participates in the operation, but only a portion of the PEs in the PE array 250 are turned on during one clock cycle. And in the same clock cycle, the convolution kernel data output from the parameter buffer 202 in the multiplication circuit 300 to each column of PEs participating in the operation in the PE array 250 is derived from the same convolution kernel. And the number of channels of the convolution kernel is the same as the number of channels of the input data buffered in the input data buffer 203. In one clock cycle, the parameter buffer 202 outputs the data of the convolution kernel to a part of the sub-switches in the switch circuit 240, and the part of the sub-switches select a part of the data from the received convolution kernel data as effective data, and then outputs the effective data to the corresponding PEs. The turned-on PE performs a convolution operation (multiply-add operation) on the multi-channel input data received from the input data buffer 203 based on the received convolution kernel data.

Furthermore, in some embodiments, multiplication circuit 300 may determine whether to choose to turn on at least two columns of PEs to participate in an operation or all PEs to participate in an operation per clock cycle based on the size of the data ports of parameter cache 202 and input data cache 203 when performing a deep convolution operation. Compared with the multiplication circuit 200 in the related art shown in fig. 2, when performing the deep convolution operation, only one row of PEs is always turned on to participate in the operation, so that the calculation efficiency can be improved.

In the case where the data ports of the parameter buffer 202 and the input data buffer 203 are smaller, for example, the data ports of the parameter buffer 202 and the input data buffer 203 are 4B, then in the first clock cycle, 4 columns of PEs in the first group of PEs 251 of the multiplication circuit 300 are turned on to perform a deep convolution operation; in the second clock cycle, 4 columns of PEs in the second set of PEs 252 of the multiplication circuit 300 are turned on to perform the deep convolution operation.

In the case where the data ports of the parameter buffer 202 and the input data buffer 203 are large, for example, the data ports of the parameter buffer 202 and the input data buffer 203 are 8B, when the multiplication circuit 300 performs the deep convolution operation, all 8 rows of PEs of the PE array 250 in the multiplication circuit 300 may be turned on simultaneously in one clock cycle.

In some embodiments, how many columns of PEs in PE array 250 are turned on may be determined based on the size of the data ports of parameter cache 202 and input data cache 203 in one clock cycle.

Taking the data ports of the parameter buffer 202 and the input data buffer 203 as an example, 4B, the process of performing the deep convolution operation by the multiplication circuit 300 provided in the present application in the first clock cycle clk1 and the second clock cycle clk2 will be described in detail.

(1) First clock cycle clk1

The first set of PEs 251 shown in FIG. 4 are turned on and the second set of PEs 252 are turned off. That is, only 4 columns of PEs of the first group of PEs 251 may perform a deep convolution operation during the first clock cycle clk 1.

In the first clock cycle clk1, the data of 4 columns of PEs output to the first group of PEs 251 by the input data buffer 203 are all input data of 4B. The parameter buffer 202 outputs the convolution kernel data of 4B to the switches 241 to 244. Sub-switches 241 to 244 corresponding to 4 columns of PEs of the first group of PEs 251 respectively select valid 1B data from the acquired 4B convolution kernel data, replace the other 3B invalid data with 0, and then output the selected valid data of 1B and 0 of 3B to the corresponding PEs. Then, the 4-column PE in the first group of PEs 251 multiplies the received valid data of 1B by 0 of 3B, and performs a multiply-add operation on the input data of 4B obtained from the input data buffer 203.

Assuming that the multiplication circuit 300 uses 1 convolution kernel kj″ of 3×3×32 (channel number is 32 and size is 3*3), when performing a deep convolution operation on input data of 5×5×32 (channel number is 32 and size is 5*5), actually, each PE in the first group of PEs 251 uses a data block consisting of a part of data in the convolution kernel kj″ and 0, and convolves one data block of the same size as the convolution kernel in the input data of 5×5×32. Wherein each data block of the input data is a data block A1 'to A9' as shown in fig. 7 (b).

In the first clock cycle clk1, the data of the 4-column PE output from the input data buffer 203 to the first group PE251 is the 4B data in the data blocks A1 'to A4'; the data output from the switching circuit parameter buffer 202 to each of the 4 columns of PEs of the first group of PEs 251 is valid data of 1B and 0 of 3B in the convolution kernel Kj ".

For example, the data input to the data buffer 203 and output to each row PE of the multiplication circuit 300 in fig. 4 are respectively: the data blocks Ai' (I has any one of 1,2,3, and 4) shown in fig. 7 (C) are data I0, I1, I2, and I3 of a total of 4B of the channels C1 to C4. The data output from the parameter buffer 202 to the sub-switches 241 to 244 are respectively: the convolution kernel Kj "(j has any one of

values

1,2,3,4,5,6,7, 8) shown in fig. 7 (C) is the data R0, R1, R2, R3 of the total 4B of the channels C1 to C4. The sub-switches 241 to 244 perform validity selection on the data R0, R1, R2, R3 of the convolution kernel Kj "channels C1 to C4, respectively, and output the data to the corresponding PEs.

For convenience of explanation, it is assumed that Kj "is K1", and data of each 1B of channels C1 to C4 in K1 "are written as: r10, R11, R12, R13. Also, assume that the data of a total 4B of channels C1 to C4 of the data block A1' are denoted as I10, I11, I12, I13; the data of a total 4B of channels C1 to C4 of data block A2' are denoted as I20, I21, I22, I23; the data of a total 4B of channels C1 to C4 of data block A3' are denoted as I30, I31, I32, I33; the data of a total 4B of channels C1 to C4 of data block A4' are denoted as I40, I41, I42, I43. The convolution kernel data of the 4 PEs (PE 10 to PE 40) output to the first column of the first group of PEs 251 by the sub-switch 241 is R10, 0; the convolution kernel data of the 4 PEs (PE 11 to PE 41) output from the sub-switch 241 to the second column in the first group of PEs 251 is 0, R11, 0; the convolution kernel data of 4 PEs (PE 12 to PE 42) output from the sub-switch 241 to the third column in the first group of PEs 251 is 0, R12, 0; the convolution kernel data of 4 PEs (PE 13 to PE 43) output from the sub-switch 241 to the fourth column in the first group of PEs 251 is 0, R13.

Then, for the 4 PEs (PE 10 to PE 40) of the first column of the first group of PEs 251 in the multiplication circuit 300, the final PE10 convolves the channels C1 to C4 of the data block A1' with the I10, I11, I12, I13 of R10, 0, as a result of the convolution operation: r10×i10+0×i11+0×i12+0×i13=r10×i10.

PE20 convolves I20, I21, I22, I23 of channels C1 through C4 of data block A2' with 0, R11, 0, the result of the convolution operation being: 0×i20+r11×i21+0×i22+0×i23=r11×i21.

PE30 convolves I30, I31, I32, I33 of channels C1 through C4 of data block A3' with 0, R12, 0, the result of the convolution operation being: 0×i30+0×i31+r12×i32+0×i33=r12×i32.

PE40 convolves I40, I41, I42, I43 of channels C1 through C4 of data block A4' with 0, R13, the result of the convolution operation being: 0×i40+0×i41+0×i42+r13×i43=r13×i43.

The calculation process of the PEs of the second to fourth columns in the first group of PEs 251 is similar to that of the first column of PEs. The only difference is that the first column of 4 PEs (PE 10 through PE 40) in the first set of PEs 251 is to convolve a total of 4B data of the respective lanes C1 through C4 of the data blocks A1 'through A4' with R10, 0, respectively. The second column of 4 PEs (PE 11 through PE 41) in the first set of PEs 251 convolves a total of 4B data for each of lanes C1 through C4 of data blocks A1 'through A4', respectively, with 0, R11, 0. The third column of 4 PEs (PE 12 through PE 42) in the first set of PEs 251 convolves a total of 4B data for each of lanes C1 through C4 of data blocks A1 'through A4', respectively, with 0, R12, 0. The fourth column of 4 PEs (PE 13 through PE 43) in the first set of PEs 251 convolves a total of 4B data for each of lanes C1 through C4 of data blocks A1 'through A4', respectively, with 0, R13. The specific operation process is not described in detail.

(2) The second clock cycle clk2

The second set of PEs 252 shown in FIG. 4 are turned on and the first set of PEs 251 are turned off. That is, during the second clock cycle clk2, only 4 columns of PEs of the second group of PEs 252 may perform the multiply-add operation.

In the second clock cycle clk2, the data of 4 columns of PEs output to the second group of PEs 252 by the input data buffer 203 are all input data of 4B. The parameter buffer 202 outputs the convolution kernel data of 4B to the sub-switches 245 to 248. Sub-switches 245 to 248 corresponding to 4 columns of PEs of the second group of PEs 252 respectively select valid 1B data from the acquired 4B convolution kernel data, replace the other 3B invalid data with 0, and then output the selected valid data of 1B and 0 of 3B to the corresponding PEs. Then, the 4-column PE in the second group of PEs 252 multiplies the received valid data of 1B by 0 of 3B, and performs a multiply-add operation on the input data of 4B obtained from the input data buffer 203.

Since the data of the 4-column PE output from the input data buffer 203 to the second group of PEs 252 is the 4B data in the data blocks A1 'to A4' in the second clock cycle clk 2; the data output by the switching circuit parameter buffer 202 to each of the 4 columns of PEs of the second group of PEs 252 is valid data of 1B and 0 of 3B in the convolution kernel Kj ".

For convenience of description, it is assumed that the data output from the parameter buffer 202 to the sub-switches 245 to 248 are the data R4, R5, R6, R7 of the total 4B of the channels C5 to C8 of the convolution kernel Kj "(j is any one of 1,2,3,4,5,6,7, 8) shown in fig. 7 (d), respectively. The sub-switches 245 to 248 perform validity selection on the data R4, R5, R6, R7 of the convolution kernel Kj "channel C5 to channel C8, respectively, and then output the data to the corresponding PE.

For convenience of description, it is also assumed that Kj "is K1", and the data of each 1B of channels C5 to C8 in K1 "are written as: r14, R15, R16, R17. Also, assume that the data of a total 4B of channels C5 through C8 of data block A1' are denoted as I14, I15, I16, I17; the data of a total 4B of channels C5 through C8 of data block A2' are denoted as I24, I25, I26, I27; the data of a total 4B of channels C5 through C8 of data block A3' are denoted as I34, I35, I36, I37; the data of a total 4B of channels C5 through C8 of data block A4' are denoted as I44, I45, I46, I47. The convolution kernel data output by the sub-switch 245 to the 4 PEs (PE 14-PE 44) of the first column in the second set of PEs 252 is R14, 0; the convolution kernel data of the 4 PEs (PE 15 to PE 45) output from the sub-switch 246 to the second column in the second group of PEs 252 is 0, R15, 0; the convolution kernel data output by the subswitch 247 to the 4 PEs (PE 16 to PE 46) of the third column in the second group of PEs 252 is 0, R16, 0; the convolution kernel data output by the subswitch 248 to the 4 PEs (PE 17 through PE 47) of the fourth column in the second set of PEs 252 is 0, R17.

Then, for the 4 PEs (PE 14 through PE 44) in the first column of the second set of PEs 252 in the multiplication circuit 300, the final PE14 convolves the channels C5 through C8I 14, I15, I16, I17 of the data block A1' with R14, 0, and the result of the convolution operation is: r14×i14+0×i15+0×i16+0×i17=r14×i14.

PE24 convolves I24, I25, I26, I27 of channels C5 through C8 of data block A2' with 0, R15, 0, the result of the convolution operation being: 0×i24+r15×i25+0×i26+0×i27=r15×i25.

PE34 convolves lanes C5 through C8 of data block A3' with 0, R16, 0, and the result of the convolution operation is: 0×i34+0×i35+r16×i36+0×i37=r16×i36.

PE44 convolves I44, I45, I46, I47 of channels C5 through C8 of data block A4' with 0, R17, the result of the convolution operation being: 0×i44+0×i45+0×i46+r17×i47=r17×i47.

The calculation of the second through fourth columns of PEs in the second set of PEs 252 is similar to the calculation of the first column of PEs in the second set of PEs 252. The only difference is that the first column of 4 PEs (PE 14 through PE 44) in the second set of PEs 252 convolves a total of 4B data for each of lanes C5 through C8 of data blocks A1 'through A4', respectively, with R14, 0. The second column of 4 PEs (PE 15 through PE 45) in the second set of PEs 252 convolves a total of 4B data for each of lanes C5 through C8 of data blocks A1 'through A4', respectively, with 0, R15, 0. The third column of 4 PEs (PE 16 through PE 46) in the second set of PEs 252 convolves a total of 4B data for each of lanes C5 through C8 of data blocks A1 'through A4', with 0, R16, 0, respectively. The fourth column of 4 PEs (PE 17 through PE 47) in the second set of PEs 252 convolves a total of 4B data for each of lanes C5 through C8 of data blocks A1 'through A4', respectively, with 0, R17. The specific operation process is not described in detail.

It will be appreciated that in the embodiment shown in fig. 7 (c) and 7 (d), since the number of channels of the input data and the convolution kernel data is 32, the multiplication circuit 300 can only perform convolution operation on 1B data of 4 channels of the input data in one clock cycle, and thus, 8 clock cycles are required to complete convolution operation on 1B data of 32 channels of the input data. In addition, to implement convolution operation on all data in a 3×3×32 data block, 72 clock cycles are required, and the computation in other clock cycles is similar to the computation in the first clock cycle clk1 and the computation in the second clock cycle clk2, which are not described herein.

Further, it will be appreciated that since the multiplication circuit 300 performs a deep convolution operation, although the convolution kernel data for each column of PEs is derived from the same convolution kernel. However, since the convolution kernel data received by each column of PEs in PE array 250 is 1B data in one lane of the same convolution kernel stored in parameter cache 202 and 3B data corresponding to the other three lanes replaced with 0 in one clock cycle. That is, the PEs of different columns in the multiplication circuit 300 ultimately convolve the input data with different convolution kernels, so it will be readily appreciated that the PEs of different columns in the multiplication circuit 300 ultimately output different feature maps.

As can be seen from the description about the standard convolution operation and the deep convolution operation performed by the multiplication circuit 300 provided in the present application, the switch circuit 240 is disposed in the multiplication circuit 300, so that, when the multiplication circuit 300 performs the deep convolution operation, the convolution kernel data received by each column of PEs in the PE array 250 is 1B data in one channel in the same convolution kernel stored in the parameter cache 202 and 3B data corresponding to the other three channels is replaced by 0 in each clock cycle. That is, each column PE ultimately convolves the input image with only one channel of data selected from the same convolution kernel as valid data, i.e., each column PE ultimately convolves the input image with a different convolution kernel, one feature map for each column of PE array 250.

In addition, when the multiplication circuit 300 provided in the present application performs the standard convolution operation, in each clock cycle, the convolution kernel data received by each column of PEs in the PE array 250 is data of 4B in one channel of each mutually independent convolution kernel stored in the parameter buffer 202, each column of PEs convolves the input image through a different convolution kernel, and each column of the PE array 250 is also a corresponding feature map.

Therefore, it is clear that when the multiplication circuit 300 provided in the present application performs the deep convolution operation and the standard convolution operation, a column of PEs corresponds to one feature map, that is, when the multiplication circuit 300 provided in the present application performs the deep convolution operation and the standard convolution operation, the output data formats are the same. Therefore, for the same multiplication circuit 300, when different convolution operations are realized under different application scenes, the data output format does not need to be adjusted. The requirement that the product developer/designer can adapt to different application scenarios without changing the input/output data format for the same multiplication circuit 300 can be met.

In addition, when the multiplication circuit 300 shown in fig. 4 for example performs a deep convolution operation, the convolution kernel data participating in the operation in each PE does not need to be updated frequently and rapidly, so that the time delay can be reduced. Also, when the multiplication circuit 300 performs the deep convolution operation, by grouping the PEs in the PE array 205, so that at different clock cycles, a portion of the PEs are turned on to participate in the deep convolution operation, power consumption of the multiplication circuit 300 may be saved. In addition, in the process of performing the deep convolution operation on the input data, each column of PEs in the PE array 205 may participate in the operation, and the utilization rate of the PE array is higher.

Having described the hardware architecture of the multiplication circuit 300 and the process by which the multiplication circuit 300 performs the deep convolution operation and the standard convolution operation provided herein, a system-on-chip system including the multiplication circuit 300 provided herein will be described. For example, as shown in fig. 9, a System On Chip (SOC) 400 includes a multiplication circuit 300, a main control central processing unit (Central Processing Unit, CPU) 410, a Double Data Rate (DDR) memory 420, and an advanced extensible interface (Advanced eXtensible Interface, AXI) bus 430. Multiplication circuit 300, master CPU410, and DDR memory 420 communicate over AXI bus 430. The structure and the working principle of the multiplication circuit 300 are described above, and the description of the part of fig. 3 to 8 is specifically referred to above, which is not repeated here.

DDR memory 420 may be used to load and store data and/or instructions. For example, in some embodiments, DDR memory 420 may be used to load or store convolution kernel data, input data, convolution result data output by multiplication circuit 300, and the like, involved in the execution of a convolution operation by multiplication circuit 300.

Master CPU410 may include one or more single-core or multi-core processors. In some embodiments, master CPU410 may include any combination of general-purpose and special-purpose processors (e.g., graphics processor, application processor, baseband processor, etc.). In some embodiments, the master CPU410 may be configured to control the multiplication circuit 300 to switch between the deep convolution operation mode and the standard convolution operation mode under different application scenarios, so that the multiplication circuit 300 performs the deep convolution operation or the standard convolution operation. For example, in some embodiments, the DDR memory 420 stores an operation program of the system on chip 400, and the depth convolution operation program corresponding to the autopilot scene and the standard convolution operation program corresponding to the face recognition entrance guard scene are mapped with different labels. The main control CPU410 fetches instructions from the DDR memory 420 and then controls the multiplication circuit 300 to execute different operation modes according to different instructions.

The following describes the application of the system on chip 400 provided in the present application to the autopilot application scenario shown in fig. 10 and the face recognition scenario shown in fig. 12, respectively. It is to be understood that the system on chip 400 for a neural network model provided in the present application may also be applied to other application scenarios, which are not listed here.

First, when the system on chip 400 provided in the present application is applied to the autopilot application scenario shown in fig. 10, the process of image recognition performed by the autopilot car using the system on chip 400 will be described in detail.

As shown in fig. 10, which includes an autonomous car 500 and a plurality of passers-by in front of the autonomous car 500. The autopilot car 500 includes a plurality of autopilot function applications, such as autopilot function application 1 through autopilot function application 3 shown in the figures, deployed with the system on chip 400, camera 510, sensor module 520 provided herein. Specifically, the process of image recognition by the autopilot using the system on chip 400 is shown in fig. 11, and includes the following steps:

step 1101: the camera 510 captures an image of the surrounding environment. For example, in some embodiments, camera 510 may collect image information around autopilot 500 in real time during travel of autopilot 500, which may include vehicle information, traffic information, road condition information, such as water accumulation on the road, ice formation, etc., and traffic marking information, such as straight, turn, etc., on the road on which autopilot 500 travels.

Step 1102: the main control CPU410 controls the multiplication circuit 300 to perform a deep convolution operation on the data of the image acquired by the camera 510. It can be appreciated that, because the autopilot 500 requires high real-time performance, the multiplication circuit 300 in the system-on-chip 400 can perform a deep convolution operation when the system-on-chip 400 provided in the present application is applied to the autopilot 500.

For example, referring to fig. 4, the main control CPU410 performs feature extraction on an image acquired by the camera 510 to obtain RGB color space data of the image, then stores the obtained color space data (input data for short) in the input data buffer 203 of the multiplication circuit 300, the parameter buffer 202 outputs data in the same convolution kernel to the sub-switches 241 to 248 corresponding to each column PE in the switch circuit 240, the sub-switches 241 to 248 respectively select convolution kernel data of one channel from the received convolution kernel data as effective data, and replaces the convolution kernel data of other channels except the channel with 0. And then the effective data of the selected channel and the data of the other channels replaced by 0 are sent to a corresponding column of PEs, and each column of PEs performs effective multiply-add operation on the received effective data of the one channel, the data of the other channels replaced by 0 and the data of one channel in the input data received from the input data buffer 203, so as to implement deep convolution operation on the input data, and quickly obtain a convolution operation result (i.e. a feature map).

The process of performing the deep convolution operation by the multiplication circuit 300 may refer to the related description in the foregoing, and will not be described herein.

Step 1103: the main control CPU410 controls the running state of the automated guided vehicle 500 based on the result of the deep convolution operation of the multiplication circuit 300.

In some embodiments, the feature map output by each column of the PE array 250 in the multiplication circuit 300 is stored in the output buffer 204, and the main control CPU410 performs image recognition based on the feature map stored in the output buffer 204, so that vehicle information, people flow information, road condition information and the like in the surrounding environment can be quickly recognized, and thus the autopilot 500 can be controlled to adjust the driving state in time, for example, avoid obstacles in time, and go straight, turn and the like according to the road marking in the image in time.

When the system on chip 400 provided in the present application is applied to the face recognition access control application scenario shown in fig. 12, the process of performing image recognition by using the system on chip 400 for face recognition access control will be described in detail.

As shown in fig. 12, the face recognition access control system 600 includes a camera 610 and a system on chip 400. The camera 610 is used for photographing the face of the user to be identified, so that the on-chip system 400 can determine whether to open the entrance guard for the user to be identified after face recognition. Specifically, the process of image recognition by the face recognition access control system 600 is shown in fig. 13, and includes the following steps:

Step 1301: the camera 610 captures a face image of a user to be identified.

Step 1302: the main control CPU410 controls the multiplication circuit 300 to perform standard convolution operation on the data of the face image acquired by the camera 610.

It can be appreciated that, because the face recognition access control system 600 has a lower real-time requirement compared to the automatic driving automobile 500 shown in fig. 10, when the system on chip 400 provided in the present application is applied to the face recognition access control system 600, the multiplication circuit 300 in the system on chip 400 can perform a standard convolution operation.

For example, referring to fig. 4, the main control CPU410 performs feature extraction on a face image acquired by the camera 610, obtains data of an RGB color space of the face image, and then stores the obtained data of the color space (simply referred to as input data) in the input data buffer 203 of the multiplication circuit 300. When the multiplication circuit 300 performs the standard convolution operation, all PEs in the PE array 250 are turned on, and the convolution kernel data output from the parameter cache 202 to each column of PEs in the PE array 250 in the multiplication circuit 300 is derived from different convolution kernels, respectively, and the number of channels of each convolution kernel is the same as the number of channels of the input data cached in the input data cache 203. The parameter buffer 202 outputs the convolution kernel data to the sub-switches 241 to 248 corresponding to each row of PEs in the switch circuit 240, and the sub-switches 241 to 248 output all the received convolution kernel data as valid data to a corresponding row of PEs, and each row of PEs performs multiply-add operation on the received multi-channel convolution kernel data and the multi-channel input data received from the input data buffer 203, thereby implementing standard convolution operation on the input data and obtaining a standard convolution operation result (i.e., a feature map).

The process of the multiplication circuit 300 performing the standard convolution operation may refer to the related description in the foregoing, and will not be described herein.

Step 1303: the main control CPU410 controls whether or not to open an entrance guard for the user to be identified based on the result of the standard convolution operation of the multiplication circuit 300.

For example, in some embodiments, the feature map output by each column of PE array 250 in multiplication circuit 300 is stored in output buffer 204, and master CPU410 performs image recognition based on the feature map stored in output buffer 204 to determine whether to open a door for the user to be recognized.

From the description of the deep convolution operation process performed on the multiplication circuit 300 in the system on chip 400 provided by the application by taking the autopilot scenario as an example, and the description of the standard convolution operation process performed on the multiplication circuit 300 in the system on chip 400 provided by the application by taking the face recognition access scenario as an example, the multiplication circuit 300 provided by the application can be switched in different operation modes by disposing the switch circuit 240 and combining the control logic of the switch circuit 240. In addition, in the automatic driving scene and the face recognition access control scene, each column of PE in the multiplication circuit 300 corresponds to a characteristic diagram of one channel, namely, in different application scenes, the output data formats in the multiplication circuit 300 are the same, and the application range is wide.

Fig. 14 provides a block diagram of an electronic device 100, according to some embodiments of the present application. As shown in fig. 14, the electronic device 100 includes a memory 110, an input-output device 120, a processor 140, a communication module 130, and a system-on-chip 400.

The multiplication circuit 300 is used to perform different convolution operations in different scenarios, such as performing a deep convolution operation in an autopilot scenario; in face recognition entrance guard scenarios, a standard convolution operation is performed. Reference may be made specifically to the above descriptions of the portions of fig. 4 to 13, and the descriptions are omitted here.

The processor 140 may include one or more processing units, for example, processing modules or processing circuits that may include a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), a digital signal processor (Digital Signal Processor, DSP), a microprocessor (Micro-programmed Control Unit, MCU), an artificial intelligence (Artificial Intelligence, AI) processor, or a programmable logic device (Field Programmable Gate Array, FPGA), or the like. In some embodiments, assuming that the electronic device 100 is an autopilot, the processor 140 is configured to control a driving state of the autopilot according to the image recognition result output by the multiplication circuit 300. For another example, in some embodiments, assuming the electronic device 100 is a face recognition gate inhibition, the processor 140 is configured to determine whether to open the gate inhibition based on the face recognition result output by the multiplication circuit 300.

A Memory 110, which may be used to store data, software programs, and modules, may be a Volatile Memory (RAM), such as a Random-Access Memory (RAM); or a nonvolatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (HDD) or a Solid State Drive (SSD); or a combination of the above types of memories, or may be a removable storage medium, such as a Secure Digital (SD) memory card. For example, the memory 110 is used to store an operation program of the multiplication circuit 300, a convolution operation result output by the multiplication circuit 300, an acquired image, convolution kernel data related to the execution of the convolution operation by the multiplication circuit 300, and the like.

The input output devices 120 may include a display screen, a touch screen, a speaker, and the like.

The communication module 130, such as a WIFI module, a universal serial bus (Universal Serial Bus, USB), a 4G and 5G module, and the like. For the electronic device 100 to communicate with other electronic devices via the communication module 130.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (Digital Subscriber Line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a high-density digital video disc (Digital Video Disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the present application may be implemented as a computer program or program code that is executed on a programmable system including at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.

Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of this application, a processing system includes any system having a processor such as, for example, a digital signal processor (Digital Signal Processor, DSP), microcontroller, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or microprocessor.

The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. Program code may also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in the present application are not limited in scope to any particular programming language. In either case, the language may be a compiled or interpreted language.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed over a network or through other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including but not limited to floppy diskettes, optical disks, read-Only memories (CD-ROMs), magneto-optical disks, read Only Memories (ROMs), random access memories (Random Access Memory, RAMs), erasable programmable Read-Only memories (Erasable Programmable Read Only Memory, EPROMs), electrically erasable programmable Read-Only memories (Electrically Erasable Programmable Read-Only memories, EEPROMs), magnetic or optical cards, flash Memory, or tangible machine-readable Memory for transmitting information (e.g., carrier waves, infrared signal digital signals, etc.) using the internet in an electrical, optical, acoustical or other form of propagated signal. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).

In the drawings, some structural or methodological features may be shown in a particular arrangement and/or order. However, it should be understood that such a particular arrangement and/or ordering may not be required. Rather, in some embodiments, these features may be arranged in a different manner and/or order than shown in the illustrative figures. Additionally, the inclusion of structural or methodological features in a particular figure is not meant to imply that such features are required in all embodiments, and in some embodiments, may not be included or may be combined with other features.

It should be noted that, in the embodiments of the present application, each unit/module is a logic unit/module, and in physical aspect, one logic unit/module may be one physical unit/module, or may be a part of one physical unit/module, or may be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logic unit/module itself is not the most important, and the combination of functions implemented by the logic unit/module is the key to solve the technical problem posed by the present application. Furthermore, to highlight the innovative part of the present application, the above-described device embodiments of the present application do not introduce units/modules that are less closely related to solving the technical problems presented by the present application, which does not indicate that the above-described device embodiments do not have other units/modules.

It should be noted that in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

While the present application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present application.

Claims

1. A multiplication circuit for convolution operations, comprising: the device comprises a PE array, a first buffer for storing input data, a second buffer for storing a plurality of channel parameters of a plurality of convolution kernels, and a switching circuit connected between the PE array and the second buffer;

wherein, when the multiplication circuit performs a first convolution operation:

the switching circuit is configured to replace, in each operation cycle, a plurality of channel parameters of one of the convolution kernels stored in the second buffer as effective data, a part of channel parameters of the convolution kernels other than the channel parameter as effective data with zero, and output the plurality of channel parameters as effective data and the part of channel parameters replaced with zero to the PE array, so that when the PE array performs a convolution operation on the channel parameters of the convolution kernels received from the switching circuit and input data acquired from the first buffer, only the channel parameters as effective data affect a result of the convolution operation.

2. The multiplication circuit of claim 1, wherein the PE array comprises a plurality of columns of PEs, and the switching circuit comprises a plurality of subswitches respectively corresponding one to each column of PEs in the PE array;

each sub-switch of the switch circuit is used for respectively outputting one channel parameter of one convolution kernel stored in the second buffer memory as effective data to a corresponding column of PE in the PE array in each operation period, so that when each column of PE receiving the effective data carries out convolution operation on the channel parameter of the convolution kernel received from the corresponding sub-switch and the input data acquired from the first buffer memory, only the channel parameter serving as the effective data has influence on the result of the convolution operation,

3. The multiplication circuit of claim 2, wherein, when the multiplication circuit performs the first convolution operation:

each sub-switch of the switch circuit is used for selecting one channel parameter from one convolution kernel stored in a second buffer memory as effective data in each operation period, replacing a part of channel parameters except the channel parameter as effective data in the convolution kernel data with zero, outputting one channel parameter as effective data and the part of channel parameters replaced with zero to a corresponding column of PE in the PE array, so that each column of PE receiving the effective data carries out convolution operation on the channel parameter as effective data received from the corresponding sub-switch and the part of channel parameters replaced with zero and input data acquired from the first buffer memory,

And the channels of the input data and the channels of the convolution kernel which participate in the convolution operation have a one-to-one correspondence.

4. The multiplication circuit of claim 1, wherein when the multiplication circuit performs the second convolution operation:

The PE array is used for carrying out convolution operation on a plurality of channel parameters of a plurality of convolution kernels received from the switch circuit and corresponding channel parameters of input data acquired from the first buffer memory, wherein,

and the channels of the input data and the channels of the convolution kernel have a one-to-one correspondence.

5. The multiplication circuit of claim 4, wherein the PE array comprises a plurality of columns of PEs, and the switching circuit comprises a plurality of subswitches respectively corresponding to each column of PEs in the PE array one to one;

wherein, when the multiplication circuit performs a second convolution operation:

each sub-switch of the switch circuit is configured to output, in each operation cycle, a plurality of channel parameters of a plurality of convolution kernels stored in the second buffer as effective data to a corresponding column of PEs in the PE array, so that each column of PEs performs a convolution operation on a plurality of channel parameters of one convolution kernel received from the corresponding sub-switch and a corresponding channel parameter of input data acquired from the first buffer,

6. The multiplication circuit of any one of claims 1 to 5, further comprising a third buffer for buffering the convolution operation result of the PE array.

7. The multiplication circuit of claim 6, wherein when the multiplication circuit performs the first convolution operation or the second convolution operation: the operation results of the PEs in each row of the PE array respectively correspond to one channel of the convolution operation results of the PE array.

8. The multiplication circuit of claim 1, wherein, when the multiplication circuit performs the first convolution operation: and part of PEs in the PE array and other PEs except the part of PEs are used for alternately carrying out convolution operation on channel parameters of the convolution kernel received from the switch circuit and input data acquired from the first buffer.

9. The multiplication circuit of claim 1, wherein when the multiplication circuit performs the second convolution operation: all PEs in the PE array are configured to simultaneously perform a convolution operation on the channel parameters of the convolution kernel received from the switching circuit and the input data acquired from the first buffer.

10. The multiplication circuit of claim 1, further comprising a memory control circuit for reading input data stored in an external memory space into the first buffer and/or reading a plurality of channel parameters of a plurality of convolution kernels stored in the external memory space into the second buffer.

11. A system on a chip comprising the multiplication circuit of any one of claims 1 to 10, and a processor and memory;

a processor, one of the processors of the system-on-chip, which when the instructions are executed by the processor, is configured to control the multiplication circuit to perform convolution operations in different operation modes.

12. An electronic device comprising the system-on-chip of claim 11, and a processor and memory;

a processor for controlling multiplication circuits in the system-on-chip to perform convolution operations in different modes of operation when the instructions are executed by one or more processors.