CN114330669B

CN114330669B - Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system

Info

Publication number: CN114330669B
Application number: CN202111681136.XA
Authority: CN
Inventors: 许金伟; 李娅琳; 姜晶菲; 苏华友; 乔鹏; 王庆林; 李荣春; 高蕾; 窦勇
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-09-16
Anticipated expiration: 2041-12-30
Also published as: CN114330669A

Abstract

The invention discloses a vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and a system, wherein the method comprises the following steps: storing the semi-precision weight data and the semi-precision input data in a double-rate synchronous dynamic random access memory; calling direct memory access operation, and respectively loading semi-precision weight data and semi-precision input data from a double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space; in the SM space, vectorization processing is carried out on the weight data loaded to the on-chip SM space, in the AM space, convolution operation conv1 multiplied by 1 is carried out on the weight data after vectorization processing and input data on the AM space, and feature map data after convolution are obtained. The invention can combine the system structure characteristic of the vector processor to vector the convolution calculation (conv1 multiplied by 1) to the system structure of the vector processor, and realizes the improvement of FLOPs on the premise of ensuring the precision.

Description

Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system

Technical Field

The invention relates to the technical field of vector processors, in particular to a vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system.

Background

The architecture of the vector processor is a novel architecture, and as shown in fig. 1, includes a Scalar Processing Unit (SPU) for performing scalar operations, a Vector Processing Unit (VPU) for performing vector operations, a Direct Memory Access (DMA) component for data transfer, and the like. The SPU is made up of scalar processing elements SPE and a scalar store SM. The VPU is composed of L vector processing units VPE and an array memory AM, the L vector processing units VPE are operated in a Single Instruction Multiple Data (SIMD) mode in a cooperation mode, and 3 vector operation units are integrated in one VPE and used for simultaneously supporting fixed-point operation and floating-point operation of vectors.

A single VPE can process 1 data (8 bytes) at a time (FP 64, Int64), 2 data (4 bytes) at a time (FP 32, Int32), and 4 data (2 bytes) at a time (FP 16). The DMA component is responsible for data transfer between SM and DDR (double data rate synchronous dynamic random access memory), AM and DDR, and its minimum granularity of operation is also 8 bytes.

Convolution (Convolution) is one of the core calculations of the neural network, and conv1 × 1 is the most common specification in the Convolution operation, so the efficiency of the Convolution has a great influence on the performance of the neural network, and it is very important to optimize the Convolution calculation.

Disclosure of Invention

In view of this, the present invention provides a vector processor-oriented semi-precision vectorization conv1 × 1 convolution method, which combines the architectural features of the vector processor to vectorize the convolution calculation (conv1 × 1) oriented to the architecture of the vector processor, thereby improving the FLOPs on the premise of ensuring the precision.

The invention provides a vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method, which comprises the following steps:

storing the semi-precision weight data and the semi-precision input data in a double-rate synchronous dynamic random access memory;

calling direct memory access operation, and respectively loading the semi-precision weight data and the semi-precision input data from the double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space;

in an SM space, vectorizing the weight data loaded to the SM space on the chip, and in an AM space, performing convolution operation conv1 multiplied by 1 on the vectorized weight data and input data on the AM space to obtain feature map data after convolution;

wherein, the semi-precision Weight data Weight _ddr The data format of (C) is [ Co, Cin, ks]Co is the number of output channels, Cin is the number of input channels, ks is the size of convolution kernel, and when the size of convolution kernel is 1, the data format can be regarded as [ Co, Cin]Therefore, the Weight data can be expressed as a matrix Weight _ddr M × K, the semi-precision Input data Input _ddr The data format of (1) is [ Cin, Hi, Wi, n]Hi and Wi are the height and width of the image, respectively, and n is the number of one batch process in the convolution operation, which can be defined as [ Hi, Wi, n [ ]]Considering one-dimensional, let N be Hi × Wi × N, so the Input data can be expressed as a matrix Input _ddr Where M denotes Co, K denotes Cin, and N denotes the size of the image dimension.

Preferably, the invoking a direct memory access operation to load the half-precision weight data and the half-precision input data from the double rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space, respectively, includes:

invoking a direct memory access operation to apply a semi-precision weight matrix W _ddr Loading into on-chip SM space, dividing original data into x from M dimension ₁ A Wb _sm Matrix, becomes W _sm ＝x ₁ ×Wb _sm ，Wb _sm ＝m×K，

Wherein the size of m is comprehensively determined by the space size of SM and the size of AM space;

invoking a direct memory access operation to input the semi-precision into matrix I _ddr Loading into AM space on chip, dividing original data into x from N dimension ₂ Ib _am Matrix, becomes I _am ＝x ₂ ×Ib _am Wherein Ib is _am K × N, i.e. N ═ x ₂ X n, where n is P × L × 4,

p denotes the volume of the vector processorThe number of vector function arithmetic unit elements in the architecture, L, indicates the number of vector processing elements.

Preferably, the vectorizing processing is performed on the weight data loaded to the on-chip SM space in the SM space, and the conv1 × 1 convolution operation is performed on the vectorized weight data and the input data in the AM space to obtain the feature map data after convolution, including the following steps:

step 1, initializing i to 0, wherein i represents a weight subblock matrix Wb _sm(i) A block index in the M dimension;

step 2, initializing j to 0, wherein j represents an input sub-block matrix Ib _am(j) A block index in the N dimension;

step 3, initializing k to 0, wherein k represents the weight subblock Wb _sm And the input sub-block Ib _am M1 denotes a row index of the weight subblock, n1 denotes a column index of the input subblock, i.e., the weight subblock is denoted as Wb _sm(i,m1,k) Input sub-block denoted Ib _am(j,k,n1) ；

Step 4, initializing the vector register to 0 so that the vector register can accumulate and store the calculation result;

step 5, the minimum granularity of the scalar load instruction is 4 bytes, the half-precision data is 2 bytes, and two half-precision data are loaded to R [0:15 ] of the designated scalar register in one time]And R < 16:31]The weight sub-block Wb in the SM space _sm(i) K-th column data Wb of _sm(i,0,k) ……Wb _sm(i,m-1,k) Loaded into scalar registers R in sequence ₃₀ 、R ₃₁ ...R _30+m-1 R0: 15 of]Middle and weight sub-block Wb _sm(i) Column k +1 data Wb _sm(i,0,k+1) ……Wb _{sm(i,m-1,k+1)} Loaded into scalar registers R in sequence ₃₀ 、R ₃₁ ...R _30+m-1 R < 16:31 >]Performing the following steps;

step 6, based on scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 Stored semi-precision weight data for scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 Performing low-order expansion operation to lower the bit number of the register to low 16 bitsBit data R [0:15 ]]Replication extension to d-bit data storage in scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 Wherein d is the bit length of a scalar register;

step 7, based on scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 Stored replicated extended data, for scalar registers R ₄₀ 、R ₄₁ ...R _40+m-1 Broadcast operations are performed in sequence and data is stored in vector register VR ₅₀ 、VR ₅₁ ...VR _50+m-1 In which L vector processing elements store the same data, Wb _sm(i) Completing the k-th column data vectorization;

step 8, inputting the sub-block matrix Ib in the AM space _am(j) Of kth line data Ib _am(j,k,0) ……Ib _am(j,k,n-1) Loading into p vector registers VR ₀ 、VR ₁ ...VR _p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load

One byte, so that it can be loaded at least once

Half precision data;

step 9, mixing Wb _sm(i,0,k) Vectorized data VR ₅₀ Respectively react with Ib _am(j) VR of the kth line ₀ 、VR ₁ ...VR _p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR ₁₀ 、VR ₁₁ ...VR _10+p-1 Performing the following steps;

step 10, register VR based on vector ₅₁ ...VR _50+m-1 Stored is the weight sub-block Wb _sm(i,1,k) ……Wb _sm(i,m-1,k) Vectorized data, vector register VR ₀ 、VR ₁ ...VR _p-1 Stored in is an input sub-block Ib _am(j) Repeating the step 9 to respectively vector each group of weight values to quantized dataAnd Ib _am(j) And adds the multiplication result to the vector register VR _10+p 、VR _10+p+1 ...VR _10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb _sm(i) Up to Wb _sm(i) K column of (1) and Ib _am(j) The multiplication and addition calculation of the k rows is completed;

step 11, judging whether K +1 is smaller than K, if so, skipping to execute step 19, and if not, continuing to execute step 12;

step 12, based on scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 R [16:31]Wb stored in _sm(i,1,k+1) ……Wb _{sm(i,m-1,k+1)} Data, to scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 High bit expansion operation is carried out to change the low 32 bits of the register to high 16 bits of data R [16:31 ]]Replication extension to d-bit data storage in scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 Wherein d is the bit length of a scalar register;

step 13, based on scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 Stored replicated extended data, for scalar registers R ₄₀ 、R ₄₁ ...R _40+m-1 Broadcast operation is carried out in sequence, and the broadcasted data is stored in a vector register VR ₅₀ 、VR ₅₁ ...VR _50+m-1 In which L vector processing elements store the same data, Wb _sm(i) Completing the vectorization of the k +1 th column of data;

step 14, inputting the sub-block matrix Ib in the AM space _am(j) Data Ib of the k +1 th line _am(j,k+1,0) ……Ib _{am(j,k+1,n-1)} Loading into p vector registers VR ₀ 、VR ₁ ...VR _p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load

One byte, so that it can be loaded at least once

Half precision data;

step 15, mixing Wb _{sm(i,0，k+1)} Vectorized data VR ₅₀ Respectively react with Ib _am(j) The (k + 1) th row of data VR ₀ 、VR ₁ ...VR _p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR ₁₀ 、VR ₁₁ ...VR _10+p-1 Performing the following steps;

step 16, vector register VR based ₅₁ ...VR _50+m-1 Stored is the weight sub-block Wb _sm(i,1,k+1) ……Wb _{sm(i,m-1,k+1)} Vectorized data, vector register VR ₀ 、VR ₁ ...VR _p-1 Stored in is an input sub-block Ib _am(j) Repeating the step 15 for the (k + 1) th row of data, and respectively comparing each group of quantized data of the weight with the Ib _am(j) And adds the multiplication result to the vector register VR _10+p 、VR _10+p+1 ...VR _10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb _sm(i) Up to Wb _sm(i) Column k +1 and Ib _am(j) The multiplication and addition calculation of the k +1 line is completed;

step 17, making k equal to k + 2;

step 18, judging whether K is smaller than K, if so, returning to the step 5, and if not, executing the step 19;

step 19, store in vector register VR ₁₀ 、VR ₁₁ ...VR _10+m×p-1 Temporarily storing the data result in the AM space position AM _temp ；

Step 20, calling direct memory access operation, and AM the spatial position of AM _temp Storing the stored characteristic diagram data result to the appointed position of the double-rate synchronous dynamic random access memory;

step 21, making j equal to j + 1;

step 22, judging whether j is less than x ₂ If yes, calling direct memory access operation and inputting the subblock matrix Ib _am(j) Loaded on-chipIn the AM space, returning to the step 3 after loading is finished, and if not, executing the step 23;

step 23, making i equal to i + 1;

step 24, judging whether i is less than x ₁ If yes, calling direct memory access operation and making weight value subblock matrix Wb _sm(i) Loading into the SM space on the chip, returning to the step 2 after loading, and if not, obtaining all weight data W _ddr And input data I _ddr The conv1 × 1 calculation of (a) is completed.

A vector processor-oriented, semi-precision vectorized conv1 x 1 convolution system, comprising:

the storage module is used for storing the semi-precision weight data and the semi-precision input data in the double-rate synchronous dynamic random access memory;

the loading module is used for calling direct memory access operation and respectively loading the semi-precision weight data and the semi-precision input data from the double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space;

the processing module is used for vectorizing the weight data loaded to the on-chip SM space in the SM space, and performing convolution operation conv1 multiplied by 1 on the vectorized weight data and input data on the AM space in the AM space to obtain feature map data after convolution;

Preferably, the loading module is specifically configured to:

invoking a direct memory access operation to input the semi-precision into matrix I _ddr Loading into AM space on chip, dividing original data into x from N dimension ₂ Ib _am Matrix, becomes I _am ＝x ₂ ×Ib _am Wherein Ib is _am K × N, i.e. N ═ x ₂ X n, where n ═ P × L × 4,

p denotes the number of vector function arithmetic unit elements in the architecture of the vector processor, and L denotes the number of vector processing elements.

Preferably, the processing module is specifically configured to perform the following steps:

step 2, initializing j to 0, wherein j represents an input sub-block matrix Ib _am(j0 A block index in the N dimension;

step 3, initializing k to 0, wherein k represents the weight subblock Wb _sm Column index and input sub-block Ib _am M1 denotes a row index of the weight subblock, n1 denotes a column index of the input subblock, i.e., the weight subblock is denoted as Wb _sm(i,m1,k) Input sub-block denoted Ib _am(j,k，n1) ；

step 5, the minimum granularity of the scalar loading instruction is 4 bytes, the semi-precision data is 2 bytes,r [0:15 ] to load two half-precision data into a specified scalar register at a single time]And R < 16:31]The weight sub-block Wb in the SM space _sm(i) K column data Wb _sm(i,0,k) ……Wb _sm(i,m-1,k) Loaded into scalar registers R in sequence ₃₀ 、R ₃₁ ...R _30+m-1 R0: 15 of]Middle and weight sub-block Wb _sm(i) Column k +1 data Wb _sm(i,0,k+1) ……Wb _{sm(i,m-1,k+1)} Loaded into scalar registers R in sequence ₃₀ 、R ₃₁ ...R _30+m-1 R < 16:31 >]Performing the following steps;

step 6, based on scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 Stored semi-precision weight data for scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 Performing low-order expansion operation to reduce the low-order 16-bit data R [0:15 ] in the low-order 32 bits of the register]Replication extension to d-bit data storage in scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 Wherein d is the bit length of a scalar register;

step 7, based on scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 Stored replicated extended data, for scalar registers R ₄₀ 、R ₄₁ ...R _40+m-1 Broadcast operations are performed in sequence and data is stored in a vector register vr ₅₀ 、vr ₅₁ ...VR _50+m-1 In which L vector processing elements store the same data, Wb _sm(i) Completing the k-th column data vectorization;

One byte, so that it can be loaded at least once

Half precision data;

step 10, register VR based on vector ₅₁ ...VR _50+m-1 Stored is the weight sub-block Wb _sm(i,1,k) ……Wb _sm(i,m-1,k) Vectorized data, vector register VR ₀ 、VR ₁ ...VR _p-1 Stored in is an input sub-block Ib _am(j) Repeating step 9 to connect each group of quantized data of the weight with Ib respectively _am(j) And adds the multiplication result to the vector register VR _10+p 、VR _10+p+1 ...VR _10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb _sm(i) Up to Wb _sm(i) K column of (1) and Ib _am(j) The multiplication and addition calculation of the k rows is completed;

step 12, based on scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 R [16:31]Wb stored in _sm(i,1,k+1) ……Wb _{sm(i,m-1,k+1)} Data, to scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 High bit expansion operation is carried out to change the low 32 bits of the register to high 16 bits of data R [16:31 ]]Replication extension to d-bit data storage in scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 In (d) is the bit length of a scalar register;

step 13, based on scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 Stored replicated extended data, for scalar registers R ₄₀ 、R ₄₁ ...R _40+m-1 Performing broadcasting operation in sequence, and broadcastingIs stored in a vector register VR ₅₀ 、VR ₅₁ ...VR _50+m-1 In which L vector processing elements store the same data, Wb _sm(i) Completing the vectorization of the k +1 th column of data;

One byte, so that it can be loaded at least once

Half precision data;

step 15, mixing Wb _sm(i,0,k+1) Vectorized data VR ₅₀ Respectively react with Ib _am(j) The (k + 1) th line of data VR ₀ 、VR ₁ ...VR _p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR ₁₀ 、VR ₁₁ ...VR _10+p-1 Performing the following steps;

step 17, making k equal to k + 2;

Step 20, calling direct memory access operation, and AM the spatial position of AM _temp Storing the stored characteristic diagram data result to the specified position of the double-speed synchronous dynamic random access memory;

step 21, making j equal to j + 1;

step 22, judging whether j is less than x ₂ If yes, calling direct memory access operation and inputting the subblock matrix Ib _am(j) Loading the data into an on-chip AM space, returning to the step 3 after the loading is finished, and if not, executing the step 23;

step 23, making i equal to i + 1;

step 24, judging whether i is less than x ₁ If yes, calling direct memory access operation and making weight value sub-block matrix Wb _sm(i) Loading into the SM space on the chip, returning to the step 2 after loading, and if not, obtaining all weight data W _ddr And input data I _ddr The conv1 × 1 calculation of (a) is completed.

In summary, the invention discloses a vector processor-oriented semi-precision vectorization conv1 × 1 convolution method, which includes the steps of firstly storing semi-precision weight data and semi-precision input data in a double-rate synchronous dynamic random access memory, then calling a direct memory access operation, and respectively loading the semi-precision weight data and the semi-precision input data from the double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space; in an SM space, vectorizing the weight data loaded to the on-chip SM space, and in an AM space, performing convolution operation conv1 multiplied by 1 on the vectorized weight data and input data on the AM space to obtain feature map data after convolution; wherein, the semi-precision Weight data Weight _ddr The data format of (C) is [ Co, Cin, ks]Co being the output channelThe number Cin is the number of input channels, ks is the convolution kernel size, and when the convolution kernel size is 1, the data format can be regarded as [ Co, Cin [ ]]Therefore, the Weight data can be expressed as a matrix Weight _ddr M × K, the semi-precision Input data Input _ddr The data format of (1) is [ Cin, Hi, Wi, n]Hi and Wi are the height and width of the image, respectively, and n is the number of one batch process in the convolution operation, which can be defined as [ Hi, Wi, n [ ]]Considering one-dimensional, let N be Hi × Wi × N, so the Input data can be expressed as a matrix Input _ddr Where M denotes Co, K denotes Cin, and N denotes the size of the image dimension. The invention can combine the system structure characteristic of the vector processor to vector the convolution calculation (conv1 multiplied by 1) to the system structure of the vector processor, and realizes the improvement of FLOPs on the premise of ensuring the precision.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a general architecture diagram of a vector processor;

FIG. 2 is a flowchart of an embodiment of a semi-precision vectorization conv1 × 1 convolution method for a vector processor according to the present invention;

FIG. 3 shows Wb according to the present invention _sm(0,m1,k) A scalar load diagram of (a);

FIG. 4 is a schematic diagram of the low 16-bit extension of the disclosed scalar register;

FIG. 5 is a broadcast implementation of a scalar register according to the present disclosure;

FIG. 6 shows a schematic diagram of a method for identifying Ib _am(0,0,n1) The vector load diagram of (a);

FIG. 7 shows Wb according to the present invention _sm(i,0,k) The vector multiplication and addition diagram with the kth line of input;

FIG. 8 is a schematic diagram of the multiplication and addition of vectors of weight column k and input row k according to the present disclosure;

FIG. 9 is a schematic diagram of the high 16-bit extension of the disclosed scalar register;

FIG. 10 is a broadcast implementation of a scalar register according to the present disclosure;

FIG. 11 shows a schematic diagram of a method for identifying Ib _am(0,1,n1) The vector load diagram of (a);

FIG. 12 shows Wb according to the present invention _sm(i,0,k+1) The vector multiplication and addition diagram with the input line k + 1;

FIG. 13 is a schematic diagram of the multiplication and addition of vectors for weight column k +1 and input row k + 1;

FIG. 14 is a schematic diagram of the vector multiply-add of the weight last column and input last row;

fig. 15 is a schematic structural diagram of an embodiment of a semi-precision vectorization conv1 × 1 convolution system for a vector processor according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 2, which is a flowchart of an embodiment of the vector processor-oriented semi-precision vectorization conv1 × 1 convolution method disclosed in the present invention, the method may include the following steps:

s201, storing the semi-precision weight data and the semi-precision input data in a double-rate synchronous dynamic random access memory;

when vectorization convolution needs to be performed on half-precision data facing a vector processor, half-precision weight data and half-precision input data are first stored in a DDR (double data rate synchronous dynamic random access memory). Wherein, the semi-precision Weight data Weight _ddr The data format of (C) is [ Co, Cin, ks]Co is the number of output channels and Cin isThe number of input channels, ks is the convolution kernel size, and when the convolution kernel size is 1, the data format can also be regarded as [ Co, Cin ]]Therefore, the Weight data can be expressed as a matrix Weight _ddr M × K. The half-precision Input data Input _ddr The data format of (1) is [ Cin, Hi, Wi, n]Hi and Wi are the height and width of the image, respectively, and n is the number of one batch process in the convolution operation, which can be defined as [ Hi, Wi, n [ ]]Considering one dimension, let N be Hi × Wi × N, so the Input data can be represented as a matrix Input _ddr Where M denotes Co, K denotes Cin, and N denotes the size of the image dimension.

S202, calling direct memory access operation, and respectively loading semi-precision weight data and semi-precision input data from a double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space;

specifically, a direct memory access operation is invoked to assign a semi-precision weight matrix W _ddr Loading into on-chip SM space, dividing original data into x from M dimension (output channel dimension) ₁ A Wb _sm Matrix, becomes W _sm ＝x ₁ ×Wb _sm ，Wb _sm ＝m×K，

Where the size of m is determined by the spatial size of the SM and the size of the AM space in combination. E.g. the weight data block Wb associated with m _sm The size cannot be larger than the SM space; the sum of the output result of the convolution of the weight block and the input block and the size of the input data block needs to be smaller than the AM space.

Invoking a direct memory access operation to input the semi-precision into a matrix I _ddr Loading into on-chip AM space, dividing raw data into x from N dimension (image layer dimension) ₂ Ib _am Matrix, becomes I _am ＝x ₂ ×Ib _am Wherein Ib is _am K × n. I.e. N ═ x ₂ X n, where n ═ P × L × 4,

p denotes the number of vector function arithmetic unit units in the architecture of a vector processor, and L denotes where the vector isThe number of physical components.

And S203, in the SM space, vectorizing the weight data loaded to the on-chip SM space, and in the AM space, performing convolution operation conv1 multiplied by 1 on the vectorized weight data and input data on the AM space to obtain feature map data after convolution.

Specifically, the method can comprise the following steps:

step 3, initializing k to 0, wherein k represents the weight subblock Wb _sm Column index and input sub-block Ib _am M1 denotes a row index of the weight subblock, n1 denotes a column index of the input subblock, i.e., the weight subblock is denoted as Wb _sm(i,m1，k) Input sub-block denoted Ib _am(j,k,n1) ；

for example, with the first weight Wb _sm(0) When K is 0, Wb is sequentially written using a scalar load instruction, 6 × 4, 6, K4, and K0 _sm(0) Load the data of column 1 into scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 R0: 15 of]Simultaneously adding Wb _sm(0) Load the data of column 2 into scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 R < 16:31 >]As shown in fig. 3 below.

for example, taking d as 64 as an example, the expansion instruction with 16 bits out of the low-order 32 bits in step 6 is implemented as shown in fig. 4.

Step 7, based on scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 Stored replicated extended data, for scalar registers R ₄₀ 、R ₄₁ ...R _40+m-1 Broadcast operations are performed in sequence and data is stored in vector register VR ₅₀ 、VR ₅₁ ...VR _50+m-1 In which L vector processing elements store the same data, Wb _sm5i) Completing the k-th column data vectorization;

for example, taking L-8 as an example, scalar register R ₄₀ Broadcast to vector register VR ₅₀ As shown in fig. 5.

One byte, so that it can be loaded at least once

Half precision data;

for example, with the first input sub-block Ib _am(0) When K is 0, Ib is set to 4 × 64, K is 4, N is 64, and K is 0 _am(0) Is loaded into p vector registers VR ₀ 、VR ₁ ...VR _p-1 In the same way as above, taking L-8 and p-2 as examples, a specific implementation of vector loading is shown in fig. 6.

Step 9, mixing Wb _sm(i,0,k) Vectorized data VR ₅₀ Respectively react with Ib _am(j) VR of the kth line ₀ 、VR ₁ ...VR _p-1 The multiplication and addition operation is carried out because the architecture integrates p functional vector operation unit parts, so the multiplication and addition operation is supported to be carried out in the same period, and L vector processing parts operate in parallel at the same time, and the calculation result is stored in a vector register VR ₁₀ 、VR ₁₁ ...VR _10+p-1 Performing the following steps;

for example, VR ₅₀ Are respectively connected with VR ₀ 、VR ₁ Taking the example of L-8 and p-2 as the multiply-add operation, the result is stored in VR ₁₀ 、VR ₁₁ In, due to VR ₁₀ 、VR ₁₁ The initial value is 0, so the result of the multiply-add is the multiplication itself, as shown in FIG. 7.

Step 10, register VR based on vector ₅₁ ...VR _50+m-1 Stored is the weight sub-block Wb _sm(i,1,k) ……Wb _sm(i,m-1,k) Vectorized data, vector register VR ₀ 、VR ₁ ...VR _p-1 Stored in is an input sub-block Ib _am(j) Repeating the step 9 to respectively combine each group of quantized data of the weight with the Ib _am(j) And adds the multiplication result to the vector register VR _10+p 、VR _10+p+1 ...VR _10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb _sm(i) Up to Wb _sm(i) K column of (1) and Ib _am(j) The multiplication and addition calculation of the k rows is completed, and the specific implementation is shown in fig. 8;

step 12, based on scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 R [16:31]Wb stored in _sm(i,1,k+1) ……Wb _{sm(i,m-1,k+1)} Data, to scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 High bit expansion operation is carried out to change the low 32 bits of the register to high 16 bits of data R [16:31 ]]Replication extension to d-bit data stored in scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 In (d) is the bit length of a scalar register;

for example, taking d as 64 as an example, the expansion instruction with the upper 16 bits in the lower 32 bits of step 12 is implemented as shown in fig. 9.

Step 13, based on scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 Stored replicated extended data, for scalar registers R ₄₀ 、R ₄₁ ...R _40+m-1 Broadcast operation is carried out in sequence, and the broadcasted data is stored in a vector register VR ₅₀ 、VR ₅₁ ...VR _50+m-1 In which L vector processing elements store the same data, Wb _sm(i) Completing vectorization of the (k + 1) th column data;

for example, when k is 0, Wb _sm(i) The (k + 1) th column data vectorization is as follows, and the specific broadcast implementation is as shown in fig. 10.

One byte, so that it can be loaded at least once

Half precision data;

for example, with the first input sub-block Ib _am(0) 4 × 64, K is 4, N is 64, and when K +1 is 1,using vector load instruction, Ib _am(0) Is loaded into p vector registers VR ₀ 、VR ₁ ...VR _p-1 In the same way as above, taking L-8 and p-2 as examples, a specific implementation of vector loading is shown in fig. 11.

Step 15, mixing Wb _sm(i,0,k+1) Vectorized data VR ₅₀ Respectively react with Ib _am(j) The (k + 1) th row of data VR ₀ 、VR ₁ ...VR _p-1 The multiplication and addition operation is carried out because the architecture integrates p functional vector operation unit parts, so the multiplication and addition operation is supported to be carried out in the same period, and L vector processing parts operate in parallel at the same time, and the calculation result is stored in a vector register VR ₁₀ 、VR ₁₁ ...VR _10+p-1 Performing the following steps;

for example, when k +1 is 1, VR ₅₀ Are respectively connected with VR ₀ 、VR ₁ Do multiply-add operations and accumulate the added VR ₁₀ 、VR ₁₁ Multiply-add data of medium k rows, and continue to save the result in VR ₁₀ 、VR ₁₁ In fig. 12, L is 8 and p is 2, for example.

Step 16, vector register VR based ₅₁ ...VR _50+m-1 Stored is the weight sub-block Wb _sm(i,1,k+1) ……Wb _{sm(i,m-1,k+1)} Vectorized data, vector register VR ₀ 、VR ₁ ...VR _p-1 Stored in is an input sub-block Ib _am(j) Repeating the step 15 for the (k + 1) th row of data, and respectively comparing each group of quantized data of the weight with the Ib _am(j) And adds the multiplication result to the vector register VR _10+p 、VR _10+p+1 ...VR _10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb _sm(i) Up to Wb _sm(i) Column k +1 and Ib _am(j) The multiplication and addition calculation of the k +1 row is completed, and the specific implementation is as shown in fig. 13;

step 17, making k equal to k + 2;

step 19, up toWeight subblock matrix Wb _sm(i) And an input subblock matrix Ib _am(j) The conv1 x 1 calculation is completed when Wb _sm(i) Go to the last column, Ib _am(j) When traversing to the last row, the specific operation is as shown in FIG. 14, and the data will be stored in the vector register VR ₁₀ 、VR ₁₁ ...VR _10+m×p-1 Temporarily storing the data result in the AM space position AM _temp ；

step 21, making j equal to j + 1;

step 22, judging whether j is less than x ₂ If yes, calling direct memory access operation and inputting the subblock matrix Ib _am(j) Loading the scalar data into an on-chip AM space, returning to the step 3 after loading, repeating the operations of scalar data loading, copy expansion, broadcasting, vector data loading, vector multiply-add and the like, and if not, executing the step 23;

step 23, making i equal to i + 1;

step 24, judging whether i is less than x ₁ If yes, calling direct memory access operation and making weight value sub-block matrix Wb _sm(i) Loading the weight data W into the SM space on the chip, returning to the step 2 after loading, repeating the operations of scalar data loading, copy expansion, broadcasting, vector data loading, vector multiply-add and the like, and if not, repeating all the operations until the weight data W is obtained _ddr And input data I _ddr The conv1 × 1 calculation of (a) is completed.

In summary, the vector processor-oriented semi-precision vectorization conv1 × 1 convolution method disclosed by the invention can be used for vectorizing the convolution calculation (conv1 × 1) oriented to the vector processor architecture by combining the architectural features of the vector processor, and the improvement of the FLOPs is realized on the premise of ensuring the precision.

As shown in fig. 15, which is a schematic structural diagram of an embodiment of a vector processor-oriented semi-precision data vectorization conv1 × 1 convolution system disclosed in the present invention, the system may include:

the storage module 1501 is configured to store the half-precision weight data and the half-precision input data in the double-rate synchronous dynamic random access memory;

a loading module 1502, configured to invoke a direct memory access operation, and load the half-precision weight data and the half-precision input data from the double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space, respectively;

the processing module 1503 is configured to perform vectorization processing on the weight data loaded to the on-chip SM space in the SM space, and perform convolution operation conv1 × 1 on the vectorized weight data and input data in the AM space to obtain feature map data after convolution.

The invention discloses a working principle of a vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution system, which is the same as the working principle of the vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method, and is not described again here.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A vector processor-oriented semi-precision vectorized conv1 x 1 convolution method, comprising:

storing the half-precision weight data and the half-precision input data in a double-rate synchronous dynamic random access memory;

wherein, the semi-precision Weight data Weight _ddr The data format of (C) is [ Co, Cin, ks]Co is the number of output channels, Cin is the number of input channels, ks is the size of convolution kernel, and when the size of convolution kernel is 1, the data format can be regarded as [ Co, Cin]Therefore, the weight data can be expressed asMatrix Weight _ddr M × K, the semi-precision Input data Input _ddr The data format of (1) is [ Cin, Hi, Wi, n]Where Hi and Wi are the height and width of the image, respectively, and n is the number of batch processes at a time in the convolution operation, the values [ Hi, Wi, n ] can be obtained]Considering one dimension, let N be Hi × Wi × N, so the Input data can be represented as a matrix Input _ddr Where M denotes Co, K denotes Cin, and N denotes the size of the image dimension.

2. The method of claim 1, wherein the invoking of the direct memory access operation loads the half-precision weight data and the half-precision input data from the double rate synchronous dynamic random access memory into an on-chip Scalar Memory (SM) space and an on-chip Array Memory (AM) space, respectively, comprising:

invoking a direct memory access operation to input the semi-precision into matrix I _ddr Loading into AM space on chip, dividing original data into x from N dimension ₂ Ib _am Matrix, becomes I _am ＝x ₂ ×Ib _am Wherein Ib is _am K x N, i.e. N x ₂ X n, where n ═ P × L × 4,

3. The method according to claim 2, wherein in the SM space, vectorization processing is performed on the weight data loaded into the on-chip SM space, and in the AM space, convolution operation conv1 × 1 is performed on the vectorized weight data and input data in the AM space to obtain convolved feature map data, including the following steps:

step 3, initializing k to 0, wherein k represents the weight subblock Wb _sm Column index and input sub-block Ib _am M1 denotes a row index of the weight subblock, n1 denotes a column index of the input subblock, i.e., the weight subblock is denoted as Wb _{sm(i，m1，k)} Input sub-block denoted Ib _{am(j，k，n1)} ；

Step 4, initializing the vector register to 0 so as to accumulate the vector register and store the calculation result;

and 5, the minimum granularity of the scalar loading instruction is 4 bytes, the semi-precision data is 2 bytes, and two pieces of semi-precision data are loaded to the R [0:15]And R [16:31]The weight sub-block Wb in the SM space _sm(i) K-th column data Wb of _{sm(i，0，k)} ......Wb _{sm(i，m-1，k)} Loaded into scalar registers R in sequence ₃₀ 、R ₃₁ ...R _30+m-1 R [0:15]Middle and simultaneous weight sub-block Wb _sm(i) Column k +1 data Wb _{sm(i，0，k+1)} ......Wb _{sm(i，m-1，k+1)} Loaded into scalar registers R in sequence ₃₀ 、R ₃₁ ...R _30+m-1 R [16:31]Performing the following steps;

step 6, based on scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 Stored semi-precision weight data for scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 And performing low-order expansion operation, namely performing low-order expansion operation on the low-order 16-order data R [0:15]Replication extension to d-bit data storage in scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 Wherein d is a scalar registerThe bit length of (d);

step 8, inputting the sub-block matrix Ib in the AM space _am(j) Of kth line data Ib _am(j，k，0 )......Ib _{am(j，k，n-1)} Loading into p vector registers VR ₀ 、VR ₁ ...VR _p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load

One byte, so that it can be loaded at least once

Half precision data;

step 9, mixing Wb _{sm(i，0，k)} Vectorized data VR ₅₀ Respectively react with Ib _am(j) VR of the kth line ₀ 、VR ₁ ...VR _p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR ₁₀ 、VR ₁₁ ...VR _10+p-1 Performing the following steps;

step 10, register VR based on vector ₅₁ ...VR _50+m-1 Stored is the weight sub-block Wb _{sm(i，1，k)} ......Wb _{sm(i，m-1，k)} Vectorized data, vector register VR ₀ 、VR ₁ ...VR _p-1 Stored in is an input sub-block Ib _am(j) Repeating the step 9 to respectively combine each group of quantized data of the weight with the Ib _am(j) And adds the multiplication result to the vector register VR _10+p 、VR _10+p+1 ....VR _10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb _sm(i) Up to Wb _sm(i) K column of (1) and Ib _am(j) The multiplication and addition calculation of the k rows is completed;

step 12, based on scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 R [16:31]Wb stored in _{sm(i，1，k+1)} ......Wb _{sm(i，m-1，k+1)} Data, to scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 And performing high bit expansion operation, and enabling 16 high bits data R [16:31]Replication extension to d-bit data storage in scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 In (d) is the bit length of a scalar register;

step 14, inputting the sub-block matrix Ib in the AM space _am(j) Data Ib of the k +1 th line _{am(j，k+1，0)} ......Ib _{am(j，k+1，n-1)} Loading into p vector registers VR ₀ 、VR ₁ ...VR _p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load

One byte, so that it can be loaded at least once

One and a halfPrecision data;

step 15, mixing Wb _{sm(i，0，k+1)} Vectorized data VR ₅₀ Respectively react with Ib _am(j) The (k + 1) th row of data VR ₀ 、VR ₁ ...VR _p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR ₁₀ 、VR ₁₁ ...VR _10+p-1 Performing the following steps;

step 16, vector register VR based ₅₁ ...VR _50+m-1 Stored is the weight sub-block Wb _{sm(i，1，k+1)} ......Wb _{sm(i，m-1，k+1)} Vectorized data, vector register VR ₀ 、VR ₁ ...VR _p-1 Stored in is an input sub-block Ib _am(j) Repeating the step 15 for the (k + 1) th row of data, and respectively comparing each group of quantized data of the weight with the Ib _am(j) And adds the multiplication result to the vector register VR _10+p 、VR _10+p+1 ...VR _10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb _sm(i) Up to Wb _sm(i) Column k +1 and Ib _am(j) The multiplication and addition calculation of the k +1 line is completed;

step 17, making k equal to k + 2;

step 18, judging whether K is smaller than K, if so, returning to the step 5, otherwise, executing the step 19;

step 21, making j equal to j + 1;

step 23, making i equal to i + 1;

4. A vector processor-oriented, half-precision vectorized conv1 x 1 convolution system comprising:

the storage module is used for storing the half-precision weight data and the half-precision input data in the double-rate synchronous dynamic random access memory;

5. The system of claim 4, wherein the loading module is specifically configured to:

6. The system of claim 5, wherein the processing module is specifically configured to perform the steps of:

step 3, initializing k to 0, wherein k represents the weight subblock Wb _sm And the input sub-block Ib _am M1, and n1, i.e., the weight subblocks are denoted as Wb _{sm(i，m1，k)} Input sub-block denoted Ib _{am(j，k，n1)} ；

and 5, the minimum granularity of the scalar loading instruction is 4 bytes, the semi-precision data is 2 bytes, and two pieces of semi-precision data are loaded to the R [0:15]And R [16:31]The weight sub-block Wb in the SM space _sm(i) K-th column data Wb of _{sm(i，0，k)} ......Wb _{sm(i，m-1，k)} Loaded into scalar registers R in sequence ₃₀ 、R ₃₁ ...R _30+m-1 R [0:15]Middle and weight sub-block Wb _sm(i) Column k +1 data Wb _{sm(i，0，k+1)} ......Wb _{sm(i，m-1，k+1)} Loaded into scalar registers R in sequence ₃₀ 、R ₃₁ ...R _30+m-1 R [16:31]Performing the following steps;

step 6, based on scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 Stored semi-precision weight data for scalar register R ₃₀ 、R ₃₁ ...R _30+m-1 And performing low-order expansion operation, namely performing low-order expansion operation on the low-order 16-order data R [0:15]Replication extension to d-bit data storage in scalar register R ₄₀ 、R ₄₁ ...R _40+m-1 Wherein d is the bit length of a scalar register;

step 8, inputting the sub-block matrix Ib in the AM space _am(j) Of kth line data Ib _{am(j，k，0)} ......Ib _{am(j，k，n-1)} Loading into p vector registers VR ₀ 、VR ₁ ...VR _p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load

A byte, so that it can be loaded at minimum at a time

Half precision data;

step 10, register VR based on vector ₅₁ ...VR _50+m-1 Stored is the weight sub-block Wb _{sm(i，1，k)} ......Wb _{sm(i，m-1，k)} Vectorized data, vector register VR ₀ 、VR ₁ ...VR _p-1 Stored in is an input sub-block Ib _am(j) Repeating the step 9 to respectively combine each group of quantized data of the weight with the Ib _am(j) And adds the multiplication result to the vector register VR _10+p 、VR _10+p+1 ...VR _10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb _sm(i) Up to Wb _sm(i) K column of (1) and Ib _am(j) The multiplication and addition calculation of the k rows is completed;

One byte, so that it can be loaded at least once

Half precision data;

step 15, mixing Wb _{sm(i，0，k+1)} Vectorized data VR ₅₀ Respectively react with Ib _am(j) The (k + 1) th row of data VR ₀ 、VR ₁ ...VR _p-1 Performing multiply-add operation while L vector processing units operate in parallel, storing the calculation result in vector register VR ₁₀ 、VR ₁₁ ...VR _10+p-1 Performing the following steps;

step 16, vector register VR based ₅₁ ...VR _50+m-1 Stored is the weight sub-block Wb _{sm(i，1，k+1)} ......Wb _{sm(i，m-1，k+1)} Vectorized data, vector register VR ₀ 、VR ₁ ...VR _p-1 Stored in is an input sub-block Ib _am(j) Repeating the step 15 for the (k + 1) th line of data, and respectively connecting each group of quantized data of the weight values with Ib _am(j) And adds the multiplication result to the vector register VR _10+p 、VR _10+p+1 ...VR _10+m×p-1 In this process, L vector processing elements operate in parallel, traversing Wb, simultaneously _sm(i) Up to Wb _sm(i) Column k +1 and Ib _am(j) The multiplication and addition calculation of the k +1 line is completed;

step 17, making k equal to k + 2;

step 21, making j equal to j + 1;

step 23, making i equal to i + 1;

step 24, judging whether i is less than x ₁ If yes, calling direct memory access operation and making weight value sub-block matrix Wb _sm(i) Loading into the SM space on the chip, returning to the step 2 after loading, and if not, obtaining all weight data W _ddr And input data I _ddr The conv1 × 1 calculation of (c) is completed.