CN114330669B - Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system - Google Patents

Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system Download PDF

Info

Publication number
CN114330669B
CN114330669B CN202111681136.XA CN202111681136A CN114330669B CN 114330669 B CN114330669 B CN 114330669B CN 202111681136 A CN202111681136 A CN 202111681136A CN 114330669 B CN114330669 B CN 114330669B
Authority
CN
China
Prior art keywords
data
vector
weight
space
precision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111681136.XA
Other languages
Chinese (zh)
Other versions
CN114330669A (en
Inventor
许金伟
李娅琳
姜晶菲
苏华友
乔鹏
王庆林
李荣春
高蕾
窦勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202111681136.XA priority Critical patent/CN114330669B/en
Publication of CN114330669A publication Critical patent/CN114330669A/en
Application granted granted Critical
Publication of CN114330669B publication Critical patent/CN114330669B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and a system, wherein the method comprises the following steps: storing the semi-precision weight data and the semi-precision input data in a double-rate synchronous dynamic random access memory; calling direct memory access operation, and respectively loading semi-precision weight data and semi-precision input data from a double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space; in the SM space, vectorization processing is carried out on the weight data loaded to the on-chip SM space, in the AM space, convolution operation conv1 multiplied by 1 is carried out on the weight data after vectorization processing and input data on the AM space, and feature map data after convolution are obtained. The invention can combine the system structure characteristic of the vector processor to vector the convolution calculation (conv1 multiplied by 1) to the system structure of the vector processor, and realizes the improvement of FLOPs on the premise of ensuring the precision.

Description

Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system
Technical Field
The invention relates to the technical field of vector processors, in particular to a vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system.
Background
The architecture of the vector processor is a novel architecture, and as shown in fig. 1, includes a Scalar Processing Unit (SPU) for performing scalar operations, a Vector Processing Unit (VPU) for performing vector operations, a Direct Memory Access (DMA) component for data transfer, and the like. The SPU is made up of scalar processing elements SPE and a scalar store SM. The VPU is composed of L vector processing units VPE and an array memory AM, the L vector processing units VPE are operated in a Single Instruction Multiple Data (SIMD) mode in a cooperation mode, and 3 vector operation units are integrated in one VPE and used for simultaneously supporting fixed-point operation and floating-point operation of vectors.
A single VPE can process 1 data (8 bytes) at a time (FP 64, Int64), 2 data (4 bytes) at a time (FP 32, Int32), and 4 data (2 bytes) at a time (FP 16). The DMA component is responsible for data transfer between SM and DDR (double data rate synchronous dynamic random access memory), AM and DDR, and its minimum granularity of operation is also 8 bytes.
Convolution (Convolution) is one of the core calculations of the neural network, and conv1 × 1 is the most common specification in the Convolution operation, so the efficiency of the Convolution has a great influence on the performance of the neural network, and it is very important to optimize the Convolution calculation.
Disclosure of Invention
In view of this, the present invention provides a vector processor-oriented semi-precision vectorization conv1 × 1 convolution method, which combines the architectural features of the vector processor to vectorize the convolution calculation (conv1 × 1) oriented to the architecture of the vector processor, thereby improving the FLOPs on the premise of ensuring the precision.
The invention provides a vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method, which comprises the following steps:
storing the semi-precision weight data and the semi-precision input data in a double-rate synchronous dynamic random access memory;
calling direct memory access operation, and respectively loading the semi-precision weight data and the semi-precision input data from the double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space;
in an SM space, vectorizing the weight data loaded to the SM space on the chip, and in an AM space, performing convolution operation conv1 multiplied by 1 on the vectorized weight data and input data on the AM space to obtain feature map data after convolution;
wherein, the semi-precision Weight data Weight ddr The data format of (C) is [ Co, Cin, ks]Co is the number of output channels, Cin is the number of input channels, ks is the size of convolution kernel, and when the size of convolution kernel is 1, the data format can be regarded as [ Co, Cin]Therefore, the Weight data can be expressed as a matrix Weight ddr M × K, the semi-precision Input data Input ddr The data format of (1) is [ Cin, Hi, Wi, n]Hi and Wi are the height and width of the image, respectively, and n is the number of one batch process in the convolution operation, which can be defined as [ Hi, Wi, n [ ]]Considering one-dimensional, let N be Hi × Wi × N, so the Input data can be expressed as a matrix Input ddr Where M denotes Co, K denotes Cin, and N denotes the size of the image dimension.
Preferably, the invoking a direct memory access operation to load the half-precision weight data and the half-precision input data from the double rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space, respectively, includes:
invoking a direct memory access operation to apply a semi-precision weight matrix W ddr Loading into on-chip SM space, dividing original data into x from M dimension 1 A Wb sm Matrix, becomes W sm =x 1 ×Wb sm ,Wb sm =m×K,
Figure BDA0003447171650000031
Wherein the size of m is comprehensively determined by the space size of SM and the size of AM space;
invoking a direct memory access operation to input the semi-precision into matrix I ddr Loading into AM space on chip, dividing original data into x from N dimension 2 Ib am Matrix, becomes I am =x 2 ×Ib am Wherein Ib is am K × N, i.e. N ═ x 2 X n, where n is P × L × 4,
Figure BDA0003447171650000032
p denotes the volume of the vector processorThe number of vector function arithmetic unit elements in the architecture, L, indicates the number of vector processing elements.
Preferably, the vectorizing processing is performed on the weight data loaded to the on-chip SM space in the SM space, and the conv1 × 1 convolution operation is performed on the vectorized weight data and the input data in the AM space to obtain the feature map data after convolution, including the following steps:
step 1, initializing i to 0, wherein i represents a weight subblock matrix Wb sm(i) A block index in the M dimension;
step 2, initializing j to 0, wherein j represents an input sub-block matrix Ib am(j) A block index in the N dimension;
step 3, initializing k to 0, wherein k represents the weight subblock Wb sm And the input sub-block Ib am M1 denotes a row index of the weight subblock, n1 denotes a column index of the input subblock, i.e., the weight subblock is denoted as Wb sm(i,m1,k) Input sub-block denoted Ib am(j,k,n1)
Step 4, initializing the vector register to 0 so that the vector register can accumulate and store the calculation result;
step 5, the minimum granularity of the scalar load instruction is 4 bytes, the half-precision data is 2 bytes, and two half-precision data are loaded to R [0:15 ] of the designated scalar register in one time]And R < 16:31]The weight sub-block Wb in the SM space sm(i) K-th column data Wb of sm(i,0,k) ……Wb sm(i,m-1,k) Loaded into scalar registers R in sequence 30 、R 31 ...R 30+m-1 R0: 15 of]Middle and weight sub-block Wb sm(i) Column k +1 data Wb sm(i,0,k+1) ……Wb sm(i,m-1,k+1) Loaded into scalar registers R in sequence 30 、R 31 ...R 30+m-1 R < 16:31 >]Performing the following steps;
step 6, based on scalar register R 30 、R 31 ...R 30+m-1 Stored semi-precision weight data for scalar register R 30 、R 31 ...R 30+m-1 Performing low-order expansion operation to lower the bit number of the register to low 16 bitsBit data R [0:15 ]]Replication extension to d-bit data storage in scalar register R 40 、R 41 ...R 40+m-1 Wherein d is the bit length of a scalar register;
step 7, based on scalar register R 40 、R 41 ...R 40+m-1 Stored replicated extended data, for scalar registers R 40 、R 41 ...R 40+m-1 Broadcast operations are performed in sequence and data is stored in vector register VR 50 、VR 51 ...VR 50+m-1 In which L vector processing elements store the same data, Wb sm(i) Completing the k-th column data vectorization;
step 8, inputting the sub-block matrix Ib in the AM space am(j) Of kth line data Ib am(j,k,0) ……Ib am(j,k,n-1) Loading into p vector registers VR 0 、VR 1 ...VR p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load
Figure BDA0003447171650000041
One byte, so that it can be loaded at least once
Figure BDA0003447171650000042
Half precision data;
step 9, mixing Wb sm(i,0,k) Vectorized data VR 50 Respectively react with Ib am(j) VR of the kth line 0 、VR 1 ...VR p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR 10 、VR 11 ...VR 10+p-1 Performing the following steps;
step 10, register VR based on vector 51 ...VR 50+m-1 Stored is the weight sub-block Wb sm(i,1,k) ……Wb sm(i,m-1,k) Vectorized data, vector register VR 0 、VR 1 ...VR p-1 Stored in is an input sub-block Ib am(j) Repeating the step 9 to respectively vector each group of weight values to quantized dataAnd Ib am(j) And adds the multiplication result to the vector register VR 10+p 、VR 10+p+1 ...VR 10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb sm(i) Up to Wb sm(i) K column of (1) and Ib am(j) The multiplication and addition calculation of the k rows is completed;
step 11, judging whether K +1 is smaller than K, if so, skipping to execute step 19, and if not, continuing to execute step 12;
step 12, based on scalar register R 30 、R 31 ...R 30+m-1 R [16:31]Wb stored in sm(i,1,k+1) ……Wb sm(i,m-1,k+1) Data, to scalar register R 30 、R 31 ...R 30+m-1 High bit expansion operation is carried out to change the low 32 bits of the register to high 16 bits of data R [16:31 ]]Replication extension to d-bit data storage in scalar register R 40 、R 41 ...R 40+m-1 Wherein d is the bit length of a scalar register;
step 13, based on scalar register R 40 、R 41 ...R 40+m-1 Stored replicated extended data, for scalar registers R 40 、R 41 ...R 40+m-1 Broadcast operation is carried out in sequence, and the broadcasted data is stored in a vector register VR 50 、VR 51 ...VR 50+m-1 In which L vector processing elements store the same data, Wb sm(i) Completing the vectorization of the k +1 th column of data;
step 14, inputting the sub-block matrix Ib in the AM space am(j) Data Ib of the k +1 th line am(j,k+1,0) ……Ib am(j,k+1,n-1) Loading into p vector registers VR 0 、VR 1 ...VR p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load
Figure BDA0003447171650000051
One byte, so that it can be loaded at least once
Figure BDA0003447171650000052
Half precision data;
step 15, mixing Wb sm(i,0,k+1) Vectorized data VR 50 Respectively react with Ib am(j) The (k + 1) th row of data VR 0 、VR 1 ...VR p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR 10 、VR 11 ...VR 10+p-1 Performing the following steps;
step 16, vector register VR based 51 ...VR 50+m-1 Stored is the weight sub-block Wb sm(i,1,k+1) ……Wb sm(i,m-1,k+1) Vectorized data, vector register VR 0 、VR 1 ...VR p-1 Stored in is an input sub-block Ib am(j) Repeating the step 15 for the (k + 1) th row of data, and respectively comparing each group of quantized data of the weight with the Ib am(j) And adds the multiplication result to the vector register VR 10+p 、VR 10+p+1 ...VR 10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb sm(i) Up to Wb sm(i) Column k +1 and Ib am(j) The multiplication and addition calculation of the k +1 line is completed;
step 17, making k equal to k + 2;
step 18, judging whether K is smaller than K, if so, returning to the step 5, and if not, executing the step 19;
step 19, store in vector register VR 10 、VR 11 ...VR 10+m×p-1 Temporarily storing the data result in the AM space position AM temp
Step 20, calling direct memory access operation, and AM the spatial position of AM temp Storing the stored characteristic diagram data result to the appointed position of the double-rate synchronous dynamic random access memory;
step 21, making j equal to j + 1;
step 22, judging whether j is less than x 2 If yes, calling direct memory access operation and inputting the subblock matrix Ib am(j) Loaded on-chipIn the AM space, returning to the step 3 after loading is finished, and if not, executing the step 23;
step 23, making i equal to i + 1;
step 24, judging whether i is less than x 1 If yes, calling direct memory access operation and making weight value subblock matrix Wb sm(i) Loading into the SM space on the chip, returning to the step 2 after loading, and if not, obtaining all weight data W ddr And input data I ddr The conv1 × 1 calculation of (a) is completed.
A vector processor-oriented, semi-precision vectorized conv1 x 1 convolution system, comprising:
the storage module is used for storing the semi-precision weight data and the semi-precision input data in the double-rate synchronous dynamic random access memory;
the loading module is used for calling direct memory access operation and respectively loading the semi-precision weight data and the semi-precision input data from the double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space;
the processing module is used for vectorizing the weight data loaded to the on-chip SM space in the SM space, and performing convolution operation conv1 multiplied by 1 on the vectorized weight data and input data on the AM space in the AM space to obtain feature map data after convolution;
wherein, the semi-precision Weight data Weight ddr The data format of (C) is [ Co, Cin, ks]Co is the number of output channels, Cin is the number of input channels, ks is the size of convolution kernel, and when the size of convolution kernel is 1, the data format can be regarded as [ Co, Cin]Therefore, the Weight data can be expressed as a matrix Weight ddr M × K, the semi-precision Input data Input ddr The data format of (1) is [ Cin, Hi, Wi, n]Hi and Wi are the height and width of the image, respectively, and n is the number of one batch process in the convolution operation, which can be defined as [ Hi, Wi, n [ ]]Considering one-dimensional, let N be Hi × Wi × N, so the Input data can be expressed as a matrix Input ddr Where M denotes Co, K denotes Cin, and N denotes the size of the image dimension.
Preferably, the loading module is specifically configured to:
invoking a direct memory access operation to apply a semi-precision weight matrix W ddr Loading into on-chip SM space, dividing original data into x from M dimension 1 A Wb sm Matrix, becomes W sm =x 1 ×Wb sm ,Wb sm =m×K,
Figure BDA0003447171650000071
Wherein the size of m is comprehensively determined by the space size of SM and the size of AM space;
invoking a direct memory access operation to input the semi-precision into matrix I ddr Loading into AM space on chip, dividing original data into x from N dimension 2 Ib am Matrix, becomes I am =x 2 ×Ib am Wherein Ib is am K × N, i.e. N ═ x 2 X n, where n ═ P × L × 4,
Figure BDA0003447171650000072
p denotes the number of vector function arithmetic unit elements in the architecture of the vector processor, and L denotes the number of vector processing elements.
Preferably, the processing module is specifically configured to perform the following steps:
step 1, initializing i to 0, wherein i represents a weight subblock matrix Wb sm(i) A block index in the M dimension;
step 2, initializing j to 0, wherein j represents an input sub-block matrix Ib am(j0 A block index in the N dimension;
step 3, initializing k to 0, wherein k represents the weight subblock Wb sm Column index and input sub-block Ib am M1 denotes a row index of the weight subblock, n1 denotes a column index of the input subblock, i.e., the weight subblock is denoted as Wb sm(i,m1,k) Input sub-block denoted Ib am(j,k,n1)
Step 4, initializing the vector register to 0 so that the vector register can accumulate and store the calculation result;
step 5, the minimum granularity of the scalar loading instruction is 4 bytes, the semi-precision data is 2 bytes,r [0:15 ] to load two half-precision data into a specified scalar register at a single time]And R < 16:31]The weight sub-block Wb in the SM space sm(i) K column data Wb sm(i,0,k) ……Wb sm(i,m-1,k) Loaded into scalar registers R in sequence 30 、R 31 ...R 30+m-1 R0: 15 of]Middle and weight sub-block Wb sm(i) Column k +1 data Wb sm(i,0,k+1) ……Wb sm(i,m-1,k+1) Loaded into scalar registers R in sequence 30 、R 31 ...R 30+m-1 R < 16:31 >]Performing the following steps;
step 6, based on scalar register R 30 、R 31 ...R 30+m-1 Stored semi-precision weight data for scalar register R 30 、R 31 ...R 30+m-1 Performing low-order expansion operation to reduce the low-order 16-bit data R [0:15 ] in the low-order 32 bits of the register]Replication extension to d-bit data storage in scalar register R 40 、R 41 ...R 40+m-1 Wherein d is the bit length of a scalar register;
step 7, based on scalar register R 40 、R 41 ...R 40+m-1 Stored replicated extended data, for scalar registers R 40 、R 41 ...R 40+m-1 Broadcast operations are performed in sequence and data is stored in a vector register vr 50 、vr 51 ...VR 50+m-1 In which L vector processing elements store the same data, Wb sm(i) Completing the k-th column data vectorization;
step 8, inputting the sub-block matrix Ib in the AM space am(j) Of kth line data Ib am(j,k,0) ……Ib am(j,k,n-1) Loading into p vector registers VR 0 、VR 1 ...VR p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load
Figure BDA0003447171650000081
One byte, so that it can be loaded at least once
Figure BDA0003447171650000082
Half precision data;
step 9, mixing Wb sm(i,0,k) Vectorized data VR 50 Respectively react with Ib am(j) VR of the kth line 0 、VR 1 ...VR p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR 10 、VR 11 ...VR 10+p-1 Performing the following steps;
step 10, register VR based on vector 51 ...VR 50+m-1 Stored is the weight sub-block Wb sm(i,1,k) ……Wb sm(i,m-1,k) Vectorized data, vector register VR 0 、VR 1 ...VR p-1 Stored in is an input sub-block Ib am(j) Repeating step 9 to connect each group of quantized data of the weight with Ib respectively am(j) And adds the multiplication result to the vector register VR 10+p 、VR 10+p+1 ...VR 10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb sm(i) Up to Wb sm(i) K column of (1) and Ib am(j) The multiplication and addition calculation of the k rows is completed;
step 11, judging whether K +1 is smaller than K, if so, skipping to execute step 19, and if not, continuing to execute step 12;
step 12, based on scalar register R 30 、R 31 ...R 30+m-1 R [16:31]Wb stored in sm(i,1,k+1) ……Wb sm(i,m-1,k+1) Data, to scalar register R 30 、R 31 ...R 30+m-1 High bit expansion operation is carried out to change the low 32 bits of the register to high 16 bits of data R [16:31 ]]Replication extension to d-bit data storage in scalar register R 40 、R 41 ...R 40+m-1 In (d) is the bit length of a scalar register;
step 13, based on scalar register R 40 、R 41 ...R 40+m-1 Stored replicated extended data, for scalar registers R 40 、R 41 ...R 40+m-1 Performing broadcasting operation in sequence, and broadcastingIs stored in a vector register VR 50 、VR 51 ...VR 50+m-1 In which L vector processing elements store the same data, Wb sm(i) Completing the vectorization of the k +1 th column of data;
step 14, inputting the sub-block matrix Ib in the AM space am(j) Data Ib of the k +1 th line am(j,k+1,0) ……Ib am(j,k+1,n-1) Loading into p vector registers VR 0 、VR 1 ...VR p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load
Figure BDA0003447171650000091
One byte, so that it can be loaded at least once
Figure BDA0003447171650000092
Half precision data;
step 15, mixing Wb sm(i,0,k+1) Vectorized data VR 50 Respectively react with Ib am(j) The (k + 1) th line of data VR 0 、VR 1 ...VR p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR 10 、VR 11 ...VR 10+p-1 Performing the following steps;
step 16, vector register VR based 51 ...VR 50+m-1 Stored is the weight sub-block Wb sm(i,1,k+1) ……Wb sm(i,m-1,k+1) Vectorized data, vector register VR 0 、VR 1 ...VR p-1 Stored in is an input sub-block Ib am(j) Repeating the step 15 for the (k + 1) th row of data, and respectively comparing each group of quantized data of the weight with the Ib am(j) And adds the multiplication result to the vector register VR 10+p 、VR 10+p+1 ...VR 10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb sm(i) Up to Wb sm(i) Column k +1 and Ib am(j) The multiplication and addition calculation of the k +1 line is completed;
step 17, making k equal to k + 2;
step 18, judging whether K is smaller than K, if so, returning to the step 5, and if not, executing the step 19;
step 19, store in vector register VR 10 、VR 11 ...VR 10+m×p-1 Temporarily storing the data result in the AM space position AM temp
Step 20, calling direct memory access operation, and AM the spatial position of AM temp Storing the stored characteristic diagram data result to the specified position of the double-speed synchronous dynamic random access memory;
step 21, making j equal to j + 1;
step 22, judging whether j is less than x 2 If yes, calling direct memory access operation and inputting the subblock matrix Ib am(j) Loading the data into an on-chip AM space, returning to the step 3 after the loading is finished, and if not, executing the step 23;
step 23, making i equal to i + 1;
step 24, judging whether i is less than x 1 If yes, calling direct memory access operation and making weight value sub-block matrix Wb sm(i) Loading into the SM space on the chip, returning to the step 2 after loading, and if not, obtaining all weight data W ddr And input data I ddr The conv1 × 1 calculation of (a) is completed.
In summary, the invention discloses a vector processor-oriented semi-precision vectorization conv1 × 1 convolution method, which includes the steps of firstly storing semi-precision weight data and semi-precision input data in a double-rate synchronous dynamic random access memory, then calling a direct memory access operation, and respectively loading the semi-precision weight data and the semi-precision input data from the double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space; in an SM space, vectorizing the weight data loaded to the on-chip SM space, and in an AM space, performing convolution operation conv1 multiplied by 1 on the vectorized weight data and input data on the AM space to obtain feature map data after convolution; wherein, the semi-precision Weight data Weight ddr The data format of (C) is [ Co, Cin, ks]Co being the output channelThe number Cin is the number of input channels, ks is the convolution kernel size, and when the convolution kernel size is 1, the data format can be regarded as [ Co, Cin [ ]]Therefore, the Weight data can be expressed as a matrix Weight ddr M × K, the semi-precision Input data Input ddr The data format of (1) is [ Cin, Hi, Wi, n]Hi and Wi are the height and width of the image, respectively, and n is the number of one batch process in the convolution operation, which can be defined as [ Hi, Wi, n [ ]]Considering one-dimensional, let N be Hi × Wi × N, so the Input data can be expressed as a matrix Input ddr Where M denotes Co, K denotes Cin, and N denotes the size of the image dimension. The invention can combine the system structure characteristic of the vector processor to vector the convolution calculation (conv1 multiplied by 1) to the system structure of the vector processor, and realizes the improvement of FLOPs on the premise of ensuring the precision.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a general architecture diagram of a vector processor;
FIG. 2 is a flowchart of an embodiment of a semi-precision vectorization conv1 × 1 convolution method for a vector processor according to the present invention;
FIG. 3 shows Wb according to the present invention sm(0,m1,k) A scalar load diagram of (a);
FIG. 4 is a schematic diagram of the low 16-bit extension of the disclosed scalar register;
FIG. 5 is a broadcast implementation of a scalar register according to the present disclosure;
FIG. 6 shows a schematic diagram of a method for identifying Ib am(0,0,n1) The vector load diagram of (a);
FIG. 7 shows Wb according to the present invention sm(i,0,k) The vector multiplication and addition diagram with the kth line of input;
FIG. 8 is a schematic diagram of the multiplication and addition of vectors of weight column k and input row k according to the present disclosure;
FIG. 9 is a schematic diagram of the high 16-bit extension of the disclosed scalar register;
FIG. 10 is a broadcast implementation of a scalar register according to the present disclosure;
FIG. 11 shows a schematic diagram of a method for identifying Ib am(0,1,n1) The vector load diagram of (a);
FIG. 12 shows Wb according to the present invention sm(i,0,k+1) The vector multiplication and addition diagram with the input line k + 1;
FIG. 13 is a schematic diagram of the multiplication and addition of vectors for weight column k +1 and input row k + 1;
FIG. 14 is a schematic diagram of the vector multiply-add of the weight last column and input last row;
fig. 15 is a schematic structural diagram of an embodiment of a semi-precision vectorization conv1 × 1 convolution system for a vector processor according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 2, which is a flowchart of an embodiment of the vector processor-oriented semi-precision vectorization conv1 × 1 convolution method disclosed in the present invention, the method may include the following steps:
s201, storing the semi-precision weight data and the semi-precision input data in a double-rate synchronous dynamic random access memory;
when vectorization convolution needs to be performed on half-precision data facing a vector processor, half-precision weight data and half-precision input data are first stored in a DDR (double data rate synchronous dynamic random access memory). Wherein, the semi-precision Weight data Weight ddr The data format of (C) is [ Co, Cin, ks]Co is the number of output channels and Cin isThe number of input channels, ks is the convolution kernel size, and when the convolution kernel size is 1, the data format can also be regarded as [ Co, Cin ]]Therefore, the Weight data can be expressed as a matrix Weight ddr M × K. The half-precision Input data Input ddr The data format of (1) is [ Cin, Hi, Wi, n]Hi and Wi are the height and width of the image, respectively, and n is the number of one batch process in the convolution operation, which can be defined as [ Hi, Wi, n [ ]]Considering one dimension, let N be Hi × Wi × N, so the Input data can be represented as a matrix Input ddr Where M denotes Co, K denotes Cin, and N denotes the size of the image dimension.
S202, calling direct memory access operation, and respectively loading semi-precision weight data and semi-precision input data from a double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space;
specifically, a direct memory access operation is invoked to assign a semi-precision weight matrix W ddr Loading into on-chip SM space, dividing original data into x from M dimension (output channel dimension) 1 A Wb sm Matrix, becomes W sm =x 1 ×Wb sm ,Wb sm =m×K,
Figure BDA0003447171650000121
Where the size of m is determined by the spatial size of the SM and the size of the AM space in combination. E.g. the weight data block Wb associated with m sm The size cannot be larger than the SM space; the sum of the output result of the convolution of the weight block and the input block and the size of the input data block needs to be smaller than the AM space.
Invoking a direct memory access operation to input the semi-precision into a matrix I ddr Loading into on-chip AM space, dividing raw data into x from N dimension (image layer dimension) 2 Ib am Matrix, becomes I am =x 2 ×Ib am Wherein Ib is am K × n. I.e. N ═ x 2 X n, where n ═ P × L × 4,
Figure BDA0003447171650000131
p denotes the number of vector function arithmetic unit units in the architecture of a vector processor, and L denotes where the vector isThe number of physical components.
And S203, in the SM space, vectorizing the weight data loaded to the on-chip SM space, and in the AM space, performing convolution operation conv1 multiplied by 1 on the vectorized weight data and input data on the AM space to obtain feature map data after convolution.
Specifically, the method can comprise the following steps:
step 1, initializing i to 0, wherein i represents a weight subblock matrix Wb sm(i) A block index in the M dimension;
step 2, initializing j to 0, wherein j represents an input sub-block matrix Ib am(j) A block index in the N dimension;
step 3, initializing k to 0, wherein k represents the weight subblock Wb sm Column index and input sub-block Ib am M1 denotes a row index of the weight subblock, n1 denotes a column index of the input subblock, i.e., the weight subblock is denoted as Wb sm(i,m1,k) Input sub-block denoted Ib am(j,k,n1)
Step 4, initializing the vector register to 0 so that the vector register can accumulate and store the calculation result;
step 5, the minimum granularity of the scalar load instruction is 4 bytes, the half-precision data is 2 bytes, and two half-precision data are loaded to R [0:15 ] of the designated scalar register in one time]And R < 16:31]The weight sub-block Wb in the SM space sm(i) K-th column data Wb of sm(i,0,k) ……Wb sm(i,m-1,k) Loaded into scalar registers R in sequence 30 、R 31 ...R 30+m-1 R0: 15 of]Middle and weight sub-block Wb sm(i) Column k +1 data Wb sm(i,0,k+1) ……Wb sm(i,m-1,k+1) Loaded into scalar registers R in sequence 30 、R 31 ...R 30+m-1 R < 16:31 >]Performing the following steps;
for example, with the first weight Wb sm(0) When K is 0, Wb is sequentially written using a scalar load instruction, 6 × 4, 6, K4, and K0 sm(0) Load the data of column 1 into scalar register R 30 、R 31 ...R 30+m-1 R0: 15 of]Simultaneously adding Wb sm(0) Load the data of column 2 into scalar register R 30 、R 31 ...R 30+m-1 R < 16:31 >]As shown in fig. 3 below.
Step 6, based on scalar register R 30 、R 31 ...R 30+m-1 Stored semi-precision weight data for scalar register R 30 、R 31 ...R 30+m-1 Performing low-order expansion operation to reduce the low-order 16-bit data R [0:15 ] in the low-order 32 bits of the register]Replication extension to d-bit data storage in scalar register R 40 、R 41 ...R 40+m-1 Wherein d is the bit length of a scalar register;
for example, taking d as 64 as an example, the expansion instruction with 16 bits out of the low-order 32 bits in step 6 is implemented as shown in fig. 4.
Step 7, based on scalar register R 40 、R 41 ...R 40+m-1 Stored replicated extended data, for scalar registers R 40 、R 41 ...R 40+m-1 Broadcast operations are performed in sequence and data is stored in vector register VR 50 、VR 51 ...VR 50+m-1 In which L vector processing elements store the same data, Wb sm5i) Completing the k-th column data vectorization;
for example, taking L-8 as an example, scalar register R 40 Broadcast to vector register VR 50 As shown in fig. 5.
Step 8, inputting the sub-block matrix Ib in the AM space am(j) Of kth line data Ib am(j,k,0) ……Ib am(j,k,n-1) Loading into p vector registers VR 0 、VR 1 ...VR p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load
Figure BDA0003447171650000141
One byte, so that it can be loaded at least once
Figure BDA0003447171650000142
Half precision data;
for example, with the first input sub-block Ib am(0) When K is 0, Ib is set to 4 × 64, K is 4, N is 64, and K is 0 am(0) Is loaded into p vector registers VR 0 、VR 1 ...VR p-1 In the same way as above, taking L-8 and p-2 as examples, a specific implementation of vector loading is shown in fig. 6.
Step 9, mixing Wb sm(i,0,k) Vectorized data VR 50 Respectively react with Ib am(j) VR of the kth line 0 、VR 1 ...VR p-1 The multiplication and addition operation is carried out because the architecture integrates p functional vector operation unit parts, so the multiplication and addition operation is supported to be carried out in the same period, and L vector processing parts operate in parallel at the same time, and the calculation result is stored in a vector register VR 10 、VR 11 ...VR 10+p-1 Performing the following steps;
for example, VR 50 Are respectively connected with VR 0 、VR 1 Taking the example of L-8 and p-2 as the multiply-add operation, the result is stored in VR 10 、VR 11 In, due to VR 10 、VR 11 The initial value is 0, so the result of the multiply-add is the multiplication itself, as shown in FIG. 7.
Step 10, register VR based on vector 51 ...VR 50+m-1 Stored is the weight sub-block Wb sm(i,1,k) ……Wb sm(i,m-1,k) Vectorized data, vector register VR 0 、VR 1 ...VR p-1 Stored in is an input sub-block Ib am(j) Repeating the step 9 to respectively combine each group of quantized data of the weight with the Ib am(j) And adds the multiplication result to the vector register VR 10+p 、VR 10+p+1 ...VR 10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb sm(i) Up to Wb sm(i) K column of (1) and Ib am(j) The multiplication and addition calculation of the k rows is completed, and the specific implementation is shown in fig. 8;
step 11, judging whether K +1 is smaller than K, if so, skipping to execute step 19, and if not, continuing to execute step 12;
step 12, based on scalar register R 30 、R 31 ...R 30+m-1 R [16:31]Wb stored in sm(i,1,k+1) ……Wb sm(i,m-1,k+1) Data, to scalar register R 30 、R 31 ...R 30+m-1 High bit expansion operation is carried out to change the low 32 bits of the register to high 16 bits of data R [16:31 ]]Replication extension to d-bit data stored in scalar register R 40 、R 41 ...R 40+m-1 In (d) is the bit length of a scalar register;
for example, taking d as 64 as an example, the expansion instruction with the upper 16 bits in the lower 32 bits of step 12 is implemented as shown in fig. 9.
Step 13, based on scalar register R 40 、R 41 ...R 40+m-1 Stored replicated extended data, for scalar registers R 40 、R 41 ...R 40+m-1 Broadcast operation is carried out in sequence, and the broadcasted data is stored in a vector register VR 50 、VR 51 ...VR 50+m-1 In which L vector processing elements store the same data, Wb sm(i) Completing vectorization of the (k + 1) th column data;
for example, when k is 0, Wb sm(i) The (k + 1) th column data vectorization is as follows, and the specific broadcast implementation is as shown in fig. 10.
Step 14, inputting the sub-block matrix Ib in the AM space am(j) Data Ib of the k +1 th line am(j,k+1,0) ……Ib am(j,k+1,n-1) Loading into p vector registers VR 0 、VR 1 ...VR p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load
Figure BDA0003447171650000161
One byte, so that it can be loaded at least once
Figure BDA0003447171650000162
Half precision data;
for example, with the first input sub-block Ib am(0) 4 × 64, K is 4, N is 64, and when K +1 is 1,using vector load instruction, Ib am(0) Is loaded into p vector registers VR 0 、VR 1 ...VR p-1 In the same way as above, taking L-8 and p-2 as examples, a specific implementation of vector loading is shown in fig. 11.
Step 15, mixing Wb sm(i,0,k+1) Vectorized data VR 50 Respectively react with Ib am(j) The (k + 1) th row of data VR 0 、VR 1 ...VR p-1 The multiplication and addition operation is carried out because the architecture integrates p functional vector operation unit parts, so the multiplication and addition operation is supported to be carried out in the same period, and L vector processing parts operate in parallel at the same time, and the calculation result is stored in a vector register VR 10 、VR 11 ...VR 10+p-1 Performing the following steps;
for example, when k +1 is 1, VR 50 Are respectively connected with VR 0 、VR 1 Do multiply-add operations and accumulate the added VR 10 、VR 11 Multiply-add data of medium k rows, and continue to save the result in VR 10 、VR 11 In fig. 12, L is 8 and p is 2, for example.
Step 16, vector register VR based 51 ...VR 50+m-1 Stored is the weight sub-block Wb sm(i,1,k+1) ……Wb sm(i,m-1,k+1) Vectorized data, vector register VR 0 、VR 1 ...VR p-1 Stored in is an input sub-block Ib am(j) Repeating the step 15 for the (k + 1) th row of data, and respectively comparing each group of quantized data of the weight with the Ib am(j) And adds the multiplication result to the vector register VR 10+p 、VR 10+p+1 ...VR 10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb sm(i) Up to Wb sm(i) Column k +1 and Ib am(j) The multiplication and addition calculation of the k +1 row is completed, and the specific implementation is as shown in fig. 13;
step 17, making k equal to k + 2;
step 18, judging whether K is smaller than K, if so, returning to the step 5, and if not, executing the step 19;
step 19, up toWeight subblock matrix Wb sm(i) And an input subblock matrix Ib am(j) The conv1 x 1 calculation is completed when Wb sm(i) Go to the last column, Ib am(j) When traversing to the last row, the specific operation is as shown in FIG. 14, and the data will be stored in the vector register VR 10 、VR 11 ...VR 10+m×p-1 Temporarily storing the data result in the AM space position AM temp
Step 20, calling direct memory access operation, and AM the spatial position of AM temp Storing the stored characteristic diagram data result to the appointed position of the double-rate synchronous dynamic random access memory;
step 21, making j equal to j + 1;
step 22, judging whether j is less than x 2 If yes, calling direct memory access operation and inputting the subblock matrix Ib am(j) Loading the scalar data into an on-chip AM space, returning to the step 3 after loading, repeating the operations of scalar data loading, copy expansion, broadcasting, vector data loading, vector multiply-add and the like, and if not, executing the step 23;
step 23, making i equal to i + 1;
step 24, judging whether i is less than x 1 If yes, calling direct memory access operation and making weight value sub-block matrix Wb sm(i) Loading the weight data W into the SM space on the chip, returning to the step 2 after loading, repeating the operations of scalar data loading, copy expansion, broadcasting, vector data loading, vector multiply-add and the like, and if not, repeating all the operations until the weight data W is obtained ddr And input data I ddr The conv1 × 1 calculation of (a) is completed.
In summary, the vector processor-oriented semi-precision vectorization conv1 × 1 convolution method disclosed by the invention can be used for vectorizing the convolution calculation (conv1 × 1) oriented to the vector processor architecture by combining the architectural features of the vector processor, and the improvement of the FLOPs is realized on the premise of ensuring the precision.
As shown in fig. 15, which is a schematic structural diagram of an embodiment of a vector processor-oriented semi-precision data vectorization conv1 × 1 convolution system disclosed in the present invention, the system may include:
the storage module 1501 is configured to store the half-precision weight data and the half-precision input data in the double-rate synchronous dynamic random access memory;
a loading module 1502, configured to invoke a direct memory access operation, and load the half-precision weight data and the half-precision input data from the double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space, respectively;
the processing module 1503 is configured to perform vectorization processing on the weight data loaded to the on-chip SM space in the SM space, and perform convolution operation conv1 × 1 on the vectorized weight data and input data in the AM space to obtain feature map data after convolution.
The invention discloses a working principle of a vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution system, which is the same as the working principle of the vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method, and is not described again here.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (6)

1. A vector processor-oriented semi-precision vectorized conv1 x 1 convolution method, comprising:
storing the half-precision weight data and the half-precision input data in a double-rate synchronous dynamic random access memory;
calling direct memory access operation, and respectively loading the semi-precision weight data and the semi-precision input data from the double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space;
in an SM space, vectorizing the weight data loaded to the SM space on the chip, and in an AM space, performing convolution operation conv1 multiplied by 1 on the vectorized weight data and input data on the AM space to obtain feature map data after convolution;
wherein, the semi-precision Weight data Weight ddr The data format of (C) is [ Co, Cin, ks]Co is the number of output channels, Cin is the number of input channels, ks is the size of convolution kernel, and when the size of convolution kernel is 1, the data format can be regarded as [ Co, Cin]Therefore, the weight data can be expressed asMatrix Weight ddr M × K, the semi-precision Input data Input ddr The data format of (1) is [ Cin, Hi, Wi, n]Where Hi and Wi are the height and width of the image, respectively, and n is the number of batch processes at a time in the convolution operation, the values [ Hi, Wi, n ] can be obtained]Considering one dimension, let N be Hi × Wi × N, so the Input data can be represented as a matrix Input ddr Where M denotes Co, K denotes Cin, and N denotes the size of the image dimension.
2. The method of claim 1, wherein the invoking of the direct memory access operation loads the half-precision weight data and the half-precision input data from the double rate synchronous dynamic random access memory into an on-chip Scalar Memory (SM) space and an on-chip Array Memory (AM) space, respectively, comprising:
invoking a direct memory access operation to apply a semi-precision weight matrix W ddr Loading into on-chip SM space, dividing original data into x from M dimension 1 A Wb sm Matrix, becomes W sm =x 1 ×Wb sm ,Wb sm =m×K,
Figure FDA0003447171640000011
Wherein the size of m is comprehensively determined by the space size of SM and the size of AM space;
invoking a direct memory access operation to input the semi-precision into matrix I ddr Loading into AM space on chip, dividing original data into x from N dimension 2 Ib am Matrix, becomes I am =x 2 ×Ib am Wherein Ib is am K x N, i.e. N x 2 X n, where n ═ P × L × 4,
Figure FDA0003447171640000021
p denotes the number of vector function arithmetic unit elements in the architecture of the vector processor, and L denotes the number of vector processing elements.
3. The method according to claim 2, wherein in the SM space, vectorization processing is performed on the weight data loaded into the on-chip SM space, and in the AM space, convolution operation conv1 × 1 is performed on the vectorized weight data and input data in the AM space to obtain convolved feature map data, including the following steps:
step 1, initializing i to 0, wherein i represents a weight subblock matrix Wb sm(i) A block index in the M dimension;
step 2, initializing j to 0, wherein j represents an input sub-block matrix Ib am(j) A block index in the N dimension;
step 3, initializing k to 0, wherein k represents the weight subblock Wb sm Column index and input sub-block Ib am M1 denotes a row index of the weight subblock, n1 denotes a column index of the input subblock, i.e., the weight subblock is denoted as Wb sm(i,m1,k) Input sub-block denoted Ib am(j,k,n1)
Step 4, initializing the vector register to 0 so as to accumulate the vector register and store the calculation result;
and 5, the minimum granularity of the scalar loading instruction is 4 bytes, the semi-precision data is 2 bytes, and two pieces of semi-precision data are loaded to the R [0:15]And R [16:31]The weight sub-block Wb in the SM space sm(i) K-th column data Wb of sm(i,0,k) ......Wb sm(i,m-1,k) Loaded into scalar registers R in sequence 30 、R 31 ...R 30+m-1 R [0:15]Middle and simultaneous weight sub-block Wb sm(i) Column k +1 data Wb sm(i,0,k+1) ......Wb sm(i,m-1,k+1) Loaded into scalar registers R in sequence 30 、R 31 ...R 30+m-1 R [16:31]Performing the following steps;
step 6, based on scalar register R 30 、R 31 ...R 30+m-1 Stored semi-precision weight data for scalar register R 30 、R 31 ...R 30+m-1 And performing low-order expansion operation, namely performing low-order expansion operation on the low-order 16-order data R [0:15]Replication extension to d-bit data storage in scalar register R 40 、R 41 ...R 40+m-1 Wherein d is a scalar registerThe bit length of (d);
step 7, based on scalar register R 40 、R 41 ...R 40+m-1 Stored replicated extended data, for scalar registers R 40 、R 41 ...R 40+m-1 Broadcast operations are performed in sequence and data is stored in vector register VR 50 、VR 51 ...VR 50+m-1 In which L vector processing elements store the same data, Wb sm(i) Completing the k-th column data vectorization;
step 8, inputting the sub-block matrix Ib in the AM space am(j) Of kth line data Ib am(j,k,0 )......Ib am(j,k,n-1) Loading into p vector registers VR 0 、VR 1 ...VR p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load
Figure FDA0003447171640000031
One byte, so that it can be loaded at least once
Figure FDA0003447171640000032
Half precision data;
step 9, mixing Wb sm(i,0,k) Vectorized data VR 50 Respectively react with Ib am(j) VR of the kth line 0 、VR 1 ...VR p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR 10 、VR 11 ...VR 10+p-1 Performing the following steps;
step 10, register VR based on vector 51 ...VR 50+m-1 Stored is the weight sub-block Wb sm(i,1,k) ......Wb sm(i,m-1,k) Vectorized data, vector register VR 0 、VR 1 ...VR p-1 Stored in is an input sub-block Ib am(j) Repeating the step 9 to respectively combine each group of quantized data of the weight with the Ib am(j) And adds the multiplication result to the vector register VR 10+p 、VR 10+p+1 ....VR 10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb sm(i) Up to Wb sm(i) K column of (1) and Ib am(j) The multiplication and addition calculation of the k rows is completed;
step 11, judging whether K +1 is smaller than K, if so, skipping to execute step 19, and if not, continuing to execute step 12;
step 12, based on scalar register R 30 、R 31 ...R 30+m-1 R [16:31]Wb stored in sm(i,1,k+1) ......Wb sm(i,m-1,k+1) Data, to scalar register R 30 、R 31 ...R 30+m-1 And performing high bit expansion operation, and enabling 16 high bits data R [16:31]Replication extension to d-bit data storage in scalar register R 40 、R 41 ...R 40+m-1 In (d) is the bit length of a scalar register;
step 13, based on scalar register R 40 、R 41 ...R 40+m-1 Stored replicated extended data, for scalar registers R 40 、R 41 ...R 40+m-1 Broadcast operation is carried out in sequence, and the broadcasted data is stored in a vector register VR 50 、VR 51 ...VR 50+m-1 In which L vector processing elements store the same data, Wb sm(i) Completing the vectorization of the k +1 th column of data;
step 14, inputting the sub-block matrix Ib in the AM space am(j) Data Ib of the k +1 th line am(j,k+1,0) ......Ib am(j,k+1,n-1) Loading into p vector registers VR 0 、VR 1 ...VR p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load
Figure FDA0003447171640000041
One byte, so that it can be loaded at least once
Figure FDA0003447171640000042
One and a halfPrecision data;
step 15, mixing Wb sm(i,0,k+1) Vectorized data VR 50 Respectively react with Ib am(j) The (k + 1) th row of data VR 0 、VR 1 ...VR p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR 10 、VR 11 ...VR 10+p-1 Performing the following steps;
step 16, vector register VR based 51 ...VR 50+m-1 Stored is the weight sub-block Wb sm(i,1,k+1) ......Wb sm(i,m-1,k+1) Vectorized data, vector register VR 0 、VR 1 ...VR p-1 Stored in is an input sub-block Ib am(j) Repeating the step 15 for the (k + 1) th row of data, and respectively comparing each group of quantized data of the weight with the Ib am(j) And adds the multiplication result to the vector register VR 10+p 、VR 10+p+1 ...VR 10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb sm(i) Up to Wb sm(i) Column k +1 and Ib am(j) The multiplication and addition calculation of the k +1 line is completed;
step 17, making k equal to k + 2;
step 18, judging whether K is smaller than K, if so, returning to the step 5, otherwise, executing the step 19;
step 19, store in vector register VR 10 、VR 11 ...VR 10+m×p-1 Temporarily storing the data result in the AM space position AM temp
Step 20, calling direct memory access operation, and AM the spatial position of AM temp Storing the stored characteristic diagram data result to the appointed position of the double-rate synchronous dynamic random access memory;
step 21, making j equal to j + 1;
step 22, judging whether j is less than x 2 If yes, calling direct memory access operation and inputting the subblock matrix Ib am(j) Loading the data into an on-chip AM space, returning to the step 3 after the loading is finished, and if not, executing the step 23;
step 23, making i equal to i + 1;
step 24, judging whether i is less than x 1 If yes, calling direct memory access operation and making weight value sub-block matrix Wb sm(i) Loading into the SM space on the chip, returning to the step 2 after loading, and if not, obtaining all weight data W ddr And input data I ddr The conv1 × 1 calculation of (a) is completed.
4. A vector processor-oriented, half-precision vectorized conv1 x 1 convolution system comprising:
the storage module is used for storing the half-precision weight data and the half-precision input data in the double-rate synchronous dynamic random access memory;
the loading module is used for calling direct memory access operation and respectively loading the semi-precision weight data and the semi-precision input data from the double-rate synchronous dynamic random access memory to an on-chip scalar memory SM space and an on-chip array memory AM space;
the processing module is used for vectorizing the weight data loaded to the on-chip SM space in the SM space, and performing convolution operation conv1 multiplied by 1 on the vectorized weight data and input data on the AM space in the AM space to obtain feature map data after convolution;
wherein, the semi-precision Weight data Weight ddr The data format of (C) is [ Co, Cin, ks]Co is the number of output channels, Cin is the number of input channels, ks is the size of convolution kernel, and when the size of convolution kernel is 1, the data format can be regarded as [ Co, Cin]Therefore, the Weight data can be expressed as a matrix Weight ddr M × K, the semi-precision Input data Input ddr The data format of (1) is [ Cin, Hi, Wi, n]Hi and Wi are the height and width of the image, respectively, and n is the number of one batch process in the convolution operation, which can be defined as [ Hi, Wi, n [ ]]Considering one-dimensional, let N be Hi × Wi × N, so the Input data can be expressed as a matrix Input ddr Where M denotes Co, K denotes Cin, and N denotes the size of the image dimension.
5. The system of claim 4, wherein the loading module is specifically configured to:
invoking a direct memory access operation to apply a semi-precision weight matrix W ddr Loading into on-chip SM space, dividing original data into x from M dimension 1 A Wb sm Matrix, becomes W sm =x 1 ×Wb sm ,Wb sm =m×K,
Figure FDA0003447171640000061
Wherein the size of m is comprehensively determined by the space size of SM and the size of AM space;
invoking a direct memory access operation to input the semi-precision into matrix I ddr Loading into AM space on chip, dividing original data into x from N dimension 2 Ib am Matrix, becomes I am =x 2 ×Ib am Wherein Ib is am K × N, i.e. N ═ x 2 X n, where n ═ P × L × 4,
Figure FDA0003447171640000062
p denotes the number of vector function arithmetic unit elements in the architecture of the vector processor, and L denotes the number of vector processing elements.
6. The system of claim 5, wherein the processing module is specifically configured to perform the steps of:
step 1, initializing i to 0, wherein i represents a weight subblock matrix Wb sm(i) A block index in the M dimension;
step 2, initializing j to 0, wherein j represents an input sub-block matrix Ib am(j) A block index in the N dimension;
step 3, initializing k to 0, wherein k represents the weight subblock Wb sm And the input sub-block Ib am M1, and n1, i.e., the weight subblocks are denoted as Wb sm(i,m1,k) Input sub-block denoted Ib am(j,k,n1)
Step 4, initializing the vector register to 0 so as to accumulate the vector register and store the calculation result;
and 5, the minimum granularity of the scalar loading instruction is 4 bytes, the semi-precision data is 2 bytes, and two pieces of semi-precision data are loaded to the R [0:15]And R [16:31]The weight sub-block Wb in the SM space sm(i) K-th column data Wb of sm(i,0,k) ......Wb sm(i,m-1,k) Loaded into scalar registers R in sequence 30 、R 31 ...R 30+m-1 R [0:15]Middle and weight sub-block Wb sm(i) Column k +1 data Wb sm(i,0,k+1) ......Wb sm(i,m-1,k+1) Loaded into scalar registers R in sequence 30 、R 31 ...R 30+m-1 R [16:31]Performing the following steps;
step 6, based on scalar register R 30 、R 31 ...R 30+m-1 Stored semi-precision weight data for scalar register R 30 、R 31 ...R 30+m-1 And performing low-order expansion operation, namely performing low-order expansion operation on the low-order 16-order data R [0:15]Replication extension to d-bit data storage in scalar register R 40 、R 41 ...R 40+m-1 Wherein d is the bit length of a scalar register;
step 7, based on scalar register R 40 、R 41 ...R 40+m-1 Stored replicated extended data, for scalar registers R 40 、R 41 ...R 40+m-1 Broadcast operations are performed in sequence and data is stored in vector register VR 50 、VR 51 ...VR 50+m-1 In which L vector processing elements store the same data, Wb sm(i) Completing the k-th column data vectorization;
step 8, inputting the sub-block matrix Ib in the AM space am(j) Of kth line data Ib am(j,k,0) ......Ib am(j,k,n-1) Loading into p vector registers VR 0 、VR 1 ...VR p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load
Figure FDA0003447171640000071
A byte, so that it can be loaded at minimum at a time
Figure FDA0003447171640000072
Half precision data;
step 9, mixing Wb sm(i,0,k) Vectorized data VR 50 Respectively react with Ib am(j) VR of the kth line 0 、VR 1 ...VR p-1 Performing multiply-add operation, simultaneously operating L vector processing units in parallel, and storing the calculation result in a vector register VR 10 、VR 11 ...VR 10+p-1 Performing the following steps;
step 10, register VR based on vector 51 ...VR 50+m-1 Stored is the weight sub-block Wb sm(i,1,k) ......Wb sm(i,m-1,k) Vectorized data, vector register VR 0 、VR 1 ...VR p-1 Stored in is an input sub-block Ib am(j) Repeating the step 9 to respectively combine each group of quantized data of the weight with the Ib am(j) And adds the multiplication result to the vector register VR 10+p 、VR 10+p+1 ...VR 10+m×p-1 In the process, L vector processing elements operate in parallel simultaneously, traversing Wb sm(i) Up to Wb sm(i) K column of (1) and Ib am(j) The multiplication and addition calculation of the k rows is completed;
step 11, judging whether K +1 is smaller than K, if so, skipping to execute step 19, and if not, continuing to execute step 12;
step 12, based on scalar register R 30 、R 31 ...R 30+m-1 R [16:31]Wb stored in sm(i,1,k+1) ......Wb sm(i,m-1,k+1) Data, to scalar register R 30 、R 31 ...R 30+m-1 And performing high bit expansion operation, and enabling 16 high bits data R [16:31]Replication extension to d-bit data storage in scalar register R 40 、R 41 ...R 40+m-1 In (d) is the bit length of a scalar register;
step 13, based on scalar register R 40 、R 41 ...R 40+m-1 Stored replicated extended data, for scalar registers R 40 、R 41 ...R 40+m-1 Broadcast operation is carried out in sequence, and the broadcasted data is stored in a vector register VR 50 、VR 51 ...VR 50+m-1 In which L vector processing elements store the same data, Wb sm(i) Completing the vectorization of the k +1 th column of data;
step 14, inputting the sub-block matrix Ib in the AM space am(j) Data Ib of the k +1 th line am(j,k+1,0) ......Ib am(j,k+1,n-1) Loading into p vector registers VR 0 、VR 1 ...VR p-1 Where p represents the number of functional vector arithmetic unit elements in an architecture for very long data instruction words, with a minimum granularity of one load
Figure FDA0003447171640000081
One byte, so that it can be loaded at least once
Figure FDA0003447171640000082
Half precision data;
step 15, mixing Wb sm(i,0,k+1) Vectorized data VR 50 Respectively react with Ib am(j) The (k + 1) th row of data VR 0 、VR 1 ...VR p-1 Performing multiply-add operation while L vector processing units operate in parallel, storing the calculation result in vector register VR 10 、VR 11 ...VR 10+p-1 Performing the following steps;
step 16, vector register VR based 51 ...VR 50+m-1 Stored is the weight sub-block Wb sm(i,1,k+1) ......Wb sm(i,m-1,k+1) Vectorized data, vector register VR 0 、VR 1 ...VR p-1 Stored in is an input sub-block Ib am(j) Repeating the step 15 for the (k + 1) th line of data, and respectively connecting each group of quantized data of the weight values with Ib am(j) And adds the multiplication result to the vector register VR 10+p 、VR 10+p+1 ...VR 10+m×p-1 In this process, L vector processing elements operate in parallel, traversing Wb, simultaneously sm(i) Up to Wb sm(i) Column k +1 and Ib am(j) The multiplication and addition calculation of the k +1 line is completed;
step 17, making k equal to k + 2;
step 18, judging whether K is smaller than K, if so, returning to the step 5, and if not, executing the step 19;
step 19, store in vector register VR 10 、VR 11 ...VR 10+m×p-1 Temporarily storing the data result in the AM space position AM temp
Step 20, calling direct memory access operation, and AM the spatial position of AM temp Storing the stored characteristic diagram data result to the appointed position of the double-rate synchronous dynamic random access memory;
step 21, making j equal to j + 1;
step 22, judging whether j is less than x 2 If yes, calling direct memory access operation and inputting the subblock matrix Ib am(j) Loading the data into an on-chip AM space, returning to the step 3 after the loading is finished, and if not, executing the step 23;
step 23, making i equal to i + 1;
step 24, judging whether i is less than x 1 If yes, calling direct memory access operation and making weight value sub-block matrix Wb sm(i) Loading into the SM space on the chip, returning to the step 2 after loading, and if not, obtaining all weight data W ddr And input data I ddr The conv1 × 1 calculation of (c) is completed.
CN202111681136.XA 2021-12-30 2021-12-30 Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system Active CN114330669B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111681136.XA CN114330669B (en) 2021-12-30 2021-12-30 Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111681136.XA CN114330669B (en) 2021-12-30 2021-12-30 Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system

Publications (2)

Publication Number Publication Date
CN114330669A CN114330669A (en) 2022-04-12
CN114330669B true CN114330669B (en) 2022-09-16

Family

ID=81023239

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111681136.XA Active CN114330669B (en) 2021-12-30 2021-12-30 Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system

Country Status (1)

Country Link
CN (1) CN114330669B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115114575B (en) * 2022-08-30 2023-01-31 中国人民解放军国防科技大学 Vector processor-oriented image-to-matrix row conversion method, device and medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796235B (en) * 2019-10-21 2022-03-18 中国人民解放军国防科技大学 Vectorization implementation method for Valid convolution of convolutional neural network
CN113626769B (en) * 2021-10-12 2022-01-21 中国人民解放军国防科技大学 Vector processor-oriented low-bit-width data matrix vectorization transposition method and system

Also Published As

Publication number Publication date
CN114330669A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
US20240012644A1 (en) Efficient direct convolution using simd instructions
US6901422B1 (en) Matrix multiplication in a vector processing system
KR100329339B1 (en) An apparatus for performing multiply-add operations on packed data
WO2019205617A1 (en) Calculation method and apparatus for matrix multiplication
US8935468B2 (en) Audio digital signal processor
US20210357735A1 (en) Split accumulator for convolutional neural network accelerator
JPH11511577A (en) Device for performing multiply-add operation of packed data
CN113626769B (en) Vector processor-oriented low-bit-width data matrix vectorization transposition method and system
CN114330669B (en) Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system
WO2023065983A1 (en) Computing apparatus, neural network processing device, chip, and data processing method
CN115039121A (en) Hybrid convolution operation
CN114090954A (en) Integer matrix multiplication kernel optimization method based on FT-2000+
CN114281755B (en) Vector processor-oriented semi-precision vectorization convolution method and system
US6209012B1 (en) System and method using mode bits to support multiple coding standards
CN116842304A (en) Method and system for calculating irregular sparse matrix
CN112668709B (en) Computing device and method for data reuse
CN114329326A (en) Low-bit-width data matrix vectorization column expansion method and system of vector processor
CN112434255A (en) Vector-matrix operation and data processing method, multiplier and processor chip
CN114138692B (en) Low-bit-width data matrix vectorization column clipping method and system of vector processor
US11960856B1 (en) Multiplier-accumulator processing pipeline using filter weights having gaussian floating point data format
CN115114575B (en) Vector processor-oriented image-to-matrix row conversion method, device and medium
Damaj et al. Performance analysis of extended vector-scalar operations using reconfigurable computing
CN114139108A (en) Matrix LU decomposition vectorization calculation method of vector DSP core
CN116910432A (en) Sparse matrix vector multiplication method and device based on heterogeneous many-core processor
CN116055003A (en) Data optimal transmission method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant