CN113888390A

CN113888390A - Feature map processing method and device, electronic equipment and computer readable medium

Info

Publication number: CN113888390A
Application number: CN202010632513.XA
Authority: CN
Inventors: 吴博; 陈其友; 曾平; 许欣然
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2022-01-04

Abstract

The embodiment of the application discloses a feature map processing method and device, electronic equipment and a computer readable medium. An embodiment of the method comprises: selecting a first blocking parameter and a second blocking parameter based on a preset thread number, wherein the first blocking parameter is used for guiding blocking operation when column vector conversion is carried out on an input feature map, and the second blocking parameter is used for guiding blocking operation on data of a convolution kernel; arranging data in the convolution kernel in an off-line manner; through multithreading, the following operation steps are executed on line: based on the first block parameters, performing column vector conversion on the input feature map blocks to obtain a first block matrix; acquiring a second partitioning matrix from the data of the arranged convolution kernels based on the second partitioning parameters; multiplying the second block matrix with the first block matrix to obtain an operation result; an output feature map is generated based on the operation results obtained by the threads. The embodiment reduces the memory occupation amount and the power consumption of the device.

Description

Feature map processing method and device, electronic equipment and computer readable medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a feature map processing method, a feature map processing device, electronic equipment and a computer readable medium.

Background

Convolutional Neural Networks (CNN) have been widely used in the field of artificial intelligence. Because the convolution calculation process determines the speed at which the convolutional neural network operates, it is generally necessary to optimize the convolution calculation process to increase the speed at which the convolutional neural network operates.

In the prior art, generally, im2col (Image to Column, row-to-Column vector conversion) operation is performed to expand input feature maps (feature maps) one by one in a form of convolution kernels, and the expanded feature maps are spliced into columns and stored in a memory, and then Matrix Multiplication operation is performed to multiply the expanded feature maps with the convolution kernels to generate an output feature map. Therefore, the convolution operation is converted into the matrix operation, and the effect of improving the running speed of the convolution neural network is achieved. However, the data after im2col operation is usually much larger than the input characteristic diagram, so the occupation amount of the data after im2col operation on the memory is usually large, and the operation speed of the device is affected. Meanwhile, when performing MatMul operation, data in a convolution kernel needs to be arranged (pack) to an internal memory corresponding to a MatMul operation core function (kernel), and the memory is released after the matrix multiplication operation process is finished, so that repeated allocation and release of the memory are caused, and the power consumption of the device is high.

Disclosure of Invention

The embodiment of the application provides a feature map processing method and device, electronic equipment and a computer readable medium, so as to reduce the memory occupation amount and reduce the equipment power consumption.

In a first aspect, an embodiment of the present application provides a feature map processing method, where the method includes: selecting a first blocking parameter and a second blocking parameter based on a preset thread number, wherein the first blocking parameter is used for guiding blocking operation when the input characteristic diagram is subjected to column vector conversion, and the second blocking parameter is used for guiding blocking operation on a convolution kernel; arranging data in the convolution kernel in an off-line manner; through multithreading, the following operation steps are executed on line: based on the first block parameters, performing column vector conversion on the input feature map blocks to obtain a first block matrix; acquiring a second partitioning matrix from the data of the arranged convolution kernels based on the second partitioning parameters; multiplying the second block matrix with the first block matrix to obtain an operation result; an output feature map is generated based on the operation results obtained by the threads.

In a second aspect, an embodiment of the present application provides a feature map processing apparatus, including: the selecting unit is configured to select a first blocking parameter and a second blocking parameter based on a preset thread number, wherein the first blocking parameter is used for guiding blocking operation when column vector conversion is performed on an input feature map, and the second blocking parameter is used for guiding blocking operation on a convolution kernel; an arrangement unit configured to arrange the data in the convolution kernel offline; an arithmetic unit configured to perform the following operation steps on-line by multithreading: performing column vector conversion on the input feature map blocks based on the first block dividing parameters to obtain a first block dividing matrix; acquiring a second partitioning matrix from the data of the arranged convolution kernel based on the second partitioning parameter; matrix multiplication is carried out on the second block matrix and the first block matrix to obtain an operation result; and a generation unit configured to generate an output feature map based on the operation result obtained by each thread.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; storage means having one or more programs stored thereon which, when executed by the one or more processors, cause the one or more processors to carry out the method as described in the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable medium on which a computer program is stored, which when executed by a processor, implements the method as described in the first aspect.

According to the feature map processing method, the feature map processing device, the electronic equipment and the computer readable medium, a first block parameter and a second block parameter are selected based on a preset thread number; then, arranging data in the convolution kernel in an off-line mode; then, the following operation steps are executed on line through each thread respectively: based on the first block parameters, performing column vector conversion on the input feature map blocks to obtain a first block matrix; acquiring a second partitioning matrix from the data of the arranged convolution kernels based on the second partitioning parameters; matrix multiplication is carried out on the second block matrix and the first block matrix to obtain an operation result; thus, an output feature map is generated based on the operation results obtained by the threads. On one hand, the data volume of the first block matrix obtained after the input feature map blocks are subjected to column vector conversion is greatly reduced, so that the memory occupation of im2col operation results is reduced, and the network operation speed is increased; meanwhile, the operation amount can be reduced in the MatMul operation process by partitioning the convolution kernel, and the network operation speed is further improved. On the other hand, the data arrangement (pack) operation in the convolution kernel is completed in advance before the thread execution operation step, and the MatMul operation core function does not need to perform matrix arrangement operation, so that the data in the arranged convolution kernel does not need to be stored in the internal memory of the MatMul operation core function, the repeated distribution and release of the internal memory of the MatMul operation core function are avoided, and the equipment power consumption is reduced.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a flow diagram for one embodiment of a feature map processing method according to the present application;

FIG. 2 is an exploded flow diagram of the operational steps of FIG. 1;

FIG. 3 is a schematic block diagram of one embodiment of a feature map processing apparatus according to the present application;

FIG. 4 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Referring to FIG. 1, a flow 100 of one embodiment of a feature map processing method according to the present application is shown. The feature map processing method comprises the following steps:

step 101, selecting a first block parameter and a second block parameter based on a preset thread number.

In this embodiment, a pre-trained model using a convolutional neural network structure may be deployed in an execution subject (e.g., an electronic device such as a server) of the feature map processing method. The model may include, but is not limited to, a classification model, a detection model, a feature extraction model, and the like. The convolutional neural network is a feed-forward neural network, artificial neurons of the convolutional neural network can respond to peripheral units in a part of coverage range, and the convolutional neural network has excellent performance on image processing, so that a model trained in advance by adopting a convolutional neural network structure can be used for extracting image characteristics. Here, the convolutional neural network may include one or more convolutional layers, and after a certain feature map (which may be referred to as an input feature map) is input to the convolutional layers, the convolutional layers may further perform feature extraction on the feature map, and output a new feature map (which may be referred to as an output feature map). The core of the convolutional layer is a convolution kernel (filter), which may be a multidimensional parameter matrix.

In the prior art, when the convolutional layer processes the input feature map, generally, the format of the input feature map is first expanded by performing column vector conversion on the whole input feature map to obtain a matrix of another format. The Column vector conversion operation may be im2col (Image to Column) operation. The im2col operation can expand and splice the input feature maps into columns one by one according to the form of a convolution kernel, and convert the part of the convolution kernel receptive field into a row (column) for storage. In general, a Matrix obtained after im2col operation is denoted as B (size K × N), a convolution kernel is denoted as a (size M × K), and after im2col operation is performed, a × B is calculated by a Matrix Multiplication (MatMul) operation, thereby obtaining an output feature map. In the prior art, the convolution operation is converted into the matrix operation through the process, so that the effect of improving the running speed of the convolution neural network can be achieved. However, this approach typically requires the use of a larger memory to store the data after im2col operation.

A specific example will be described below. When the input feature map is 1 × ic × ih × iw, the input feature map is subjected to im2col operation on the whole, and then converted into a matrix B of ic × fh × fw × ow, where K is ic × fh × fw, N is oh × ow, ic is the number of channels in the input feature map, ih is the height of the input feature map (also referred to as the number of rows in the input feature map for each channel), iw is the width of the input feature map (also referred to as the number of columns in the input feature map for each channel), fh is the height of the convolution kernel (also referred to as the number of rows), fw is the width of the convolution kernel (also referred to as the number of columns), oh is the height of the output feature map (also referred to as the number of rows in the output feature map for each channel), and ow is the width of the output feature map (also referred to as the number of columns in the output feature map for each channel). The convolution kernel a is of the form oc × ic × fh × fw, where M ═ oc, and oc is the number of channels of the output profile. By computing A × B through MatMul operation, an output characteristic diagram with the size of M × N can be obtained, and the output characteristic diagram is in the form of oc × oh × ow.

In the above process, the amount of memory occupied by the result after im2col operation is ic × fh × fw × oh × ow × size of (datatype) bytes. Where sizeof () is a memory capacity metric function that returns the size (in bytes) of a variable or data type, datatype representing the data type. For example, when datatype is int, sizeof (int) equals 8.

Assuming ih-oh and iw-ow, the memory amount occupied by the result after im2col operation is fh × fw times the memory amount occupied by the result before im2col operation, and is much larger than the memory amount occupied by the result before im2col operation. Therefore, after the im2col operation is performed on the input characteristic diagram in the existing mode, the data of the im2col operation needs to be stored by using a large memory occupation amount, and the network operation speed is easily influenced.

In this embodiment, the execution body may select the first partition parameter and the second partition parameter based on a preset number of threads. The first blocking parameter may be used to guide a blocking operation when performing column vector conversion (i.e., im2col operation) on the input feature map. And (3) performing im2col operation on the blocks, and obtaining a smaller conversion matrix (which can be recorded as a first block matrix) compared with performing im2col operation on the whole input characteristic diagram, so that the memory occupation amount is reduced, and the convolution operation speed is increased. The second blocking parameter may be used to direct a blocking operation on data of the convolution kernel, which may be divided into a plurality of blocking matrices (which may be referred to as second blocking matrices) after the convolution kernel is blocked. Therefore, by carrying out MatMul operation on the first block matrix and the second block matrix, the operation amount can be reduced in the MatMul operation process, and the convolution operation speed is further improved.

In this embodiment, a thread pool may be preset, and a plurality of threads may be deployed in the thread pool. The number of threads may be predetermined. Based on the preset thread number, the execution main body can select the first partitioning parameter and the second partitioning parameter in various ways. As an example, the corresponding first blocking parameter and second blocking parameter may be set according to historical experience for different thread numbers, so as to make the network operate at the fastest speed. As another example, a first function for calculating an optimal first blocking parameter and a second function for calculating an optimal second blocking parameter may be preset, and the optimal first blocking parameter may be calculated by inputting a value of a required variable (e.g., the number of threads, the size of an input feature map, etc.) into the first function. Similarly, the value of the required variable (such as the number of threads, the size of the required output feature map, etc.) is input into the second function, and the optimal second partitioning parameter can be obtained through calculation. In practice, the same thread may perform im2col operations and MatMul operations multiple times. The thread number × the execution number is equal to the number of first partition matrices × the number of second partition matrices. The number of the first block matrixes is determined by the first block parameters, and the number of the second block matrixes is determined by the second block parameters, so that the memory size required by each thread and the thread execution times can be influenced by adjusting the first block parameters and the second block parameters, the memory access utilization rate of each thread is further influenced, and the optimal first block parameters and the optimal second block parameters can optimize the program performance.

In some optional implementations of this embodiment, the first blocking parameter may be used to set a width of the first blocking matrix, and the width of the first blocking matrix may be determined by dividing a product of a height and a width of the output feature map (i.e., oh × ow). The second blocking parameter may be used to set a height of the second blocking matrix, where the height of the second blocking matrix is determined by dividing the number of channels oc of the output feature map. The height of the first block matrix and the width of the second block matrix are both the product of the number of channels of the initial characteristic diagram, the width of the convolution kernel and the height of the convolution kernel.

A specific example will be described below. The first partition parameter may set the width of the first partition matrix to ohw _ tile _ size. Based on ohw _ tile _ size, the partitions are partitioned in the oh × ow dimension, and can be divided into (oh × ow)/ohw _ tile _ size parts. At this time, after im2col operation is performed on the input feature map in a blocking manner, the size of the obtained first blocking matrix is ic × fh × fw × ohw _ tile _ size. Where ic × fh × fw is the height of the first partition matrix. Since the size of the first partition matrix is only ohw _ tile _ size/(oh × ow) of the operation result of performing im2col operation on the entire input feature map, the memory required for the first partition matrix is only ohw _ tile _ size/(oh × ow) when performing im2col operation on the entire input feature map. And, the smaller the ohw _ tile _ size, the less memory is occupied. Therefore, the input characteristic diagram is partitioned while the im2col operation is performed, and the memory occupation amount of the im2col operation result can be reduced.

Continuing with the example, the second partition parameter may set the height at which the second partition matrix is set to oc _ tile _ size. After the convolution kernel in the form of oc × ic × fh × fw is partitioned in oc dimension based on oc _ tile _ size, oc/oc _ tile _ size second partition matrices are obtained. The size of each second partition matrix is oc _ tile _ size × ic × fh × fw. Where ic × fh × fw is the width of the second partition matrix. Therefore, the height of the first block matrix and the width of the second block matrix are both the product of the number of channels of the initial feature map, the width of the convolution kernel and the height of the convolution kernel. After the convolution kernel is blocked, when the MatMul operation is performed, only matrix multiplication needs to be performed on each combination of the first blocking matrix and the second blocking matrix, and each combination obtains an operation result of the size of oc _ tile _ size × ohw _ tile _ size. The final result can be obtained by combining the results. The operation amount of MatMul operation can be reduced in the process, and therefore the network operation speed is further improved.

In some optional implementation manners of this embodiment, after the first blocking parameter and the second blocking parameter are selected based on the preset number of threads, the execution main body may further determine, based on the first blocking parameter and the second blocking parameter, a total amount of memory required by the thread to execute the operation steps (including im2col operation and MatMul operation). And allocating the target memory space based on the total memory amount. The target memory space may be used to store data generated during the execution of the computing step by the thread. The target memory space is located in the internal memory of the kernel function (kernel) of the MatMul operation. In some examples, the total amount of memory may be determined by:

in the first step, based on the first partition parameter, a first memory amount (which may be denoted as im2col _ dst _ size) required for storing a first partition matrix obtained by column vector conversion of a single thread is determined.

Here, the first amount of memory required by the first partition matrix obtained by the single thread through the column vector transformation (i.e., im2col operation) is the product of the size of the first partition matrix and size of (datatype). As an example, the first partition parameter is used to set the width (i.e. the number of columns) of the first partition matrix to ohw _ tile _ size, and at this time, the size of the first partition matrix is ic × fh × fw × ohw _ tile _ size, and im2col _ dst _ size is ic × fh × fw × ohw _ tile _ size × size of (datatype).

And secondly, determining a second memory amount (which can be recorded as matrix _ dst _ size) required for storing the operation result obtained by matrix multiplication of a single thread based on the first blocking parameter and the second blocking parameter.

Here, the operation result obtained after the single thread executes the matrix multiplication operation (i.e., MatMul operation) is the memory space occupied by the second block matrix and the first block matrix after the matrix multiplication. As an example, the first partition parameter is used to set the width of the first partition matrix to ohw _ tile _ size, and the second partition parameter is used to set the height (i.e. the number of rows) of the second partition matrix to oc _ tile _ size, in this case, the size of the first partition matrix is ic × fh × fw × ohw _ tile _ size, the size of the second partition matrix is oc _ tile _ size × ic × fh × fw, and the match _ dst _ size is oc _ tile _ size × ohw _ tile _ size × size of (datatype).

Third, a third amount of memory (which may be denoted as packB _ size) required for the pack result of the first partition matrix is determined.

Here, the packB _ size may be determined by the size of the first partition matrix and kernel together. As an example, the packB _ size may be calculated by the following function: round _ up (size _ t _ n × size _ k × size of (datatype), 64). Where the round _ up function represents rounding up, where size _ t _ n × size _ k × size of (datatype) can be taken to be an integer multiple of 64. 64 is the default minimum buffer unit. Where, size _ t _ n is round _ up (ohw _ tile _ size, inner _ block _ n), and size _ t _ k is round _ up (ic × fh × fw, inner _ block _ k). inner _ block _ n and inner _ block _ k are preset parameters determined by the kernel, and for different kernel, inner _ block _ n and inner _ block _ k can be different values. For example, for a certain kernel, inner _ block _ n can be set to 12, and inner _ block _ k can be set to 1; for another kernel, inner _ block _ n may be set to 4, inner _ block _ k may be set to 16, etc., and is not particularly limited herein.

And fourthly, selecting the maximum memory amount (which can be recorded as memory _ size) in the first memory amount and the second memory amount, taking the sum of the maximum memory amount and the third memory amount as the memory amount required by a single thread, and determining the product of the memory amount required by the single thread and the thread number (which can be recorded as thread _ num) as the total memory amount.

Thus, the total amount of memory is: (pack b _ size + memory _ size) × thread _ num.

It should be noted that, because the im2col operation and the MatMul operation are executed at different time periods (the im2col operation is executed first, and then the MatMul operation is executed), the memory can be reused, and a larger memory amount of the two operations can be selected, thereby achieving the effect of saving the memory. The total amount of memory here is the amount of memory required by the thread to perform the operation step.

And 102, arranging the data in the convolution kernel offline.

In this embodiment, the execution body may arrange (pack) the data in the convolution kernel offline. The data in the convolution kernel is arranged, that is, the data in the convolution kernel is rearranged and written into the memory, so that the data is continuous. By arranging the data in the convolution kernel, the kernel of MatMul operation can continuously access the data when the operation steps are executed subsequently.

It should be noted that the memory space for storing the arranged convolution kernel data is located in the external memory of the kernel. Since the arrangement (pack) operation on the data in the convolution kernel is completed in advance before the thread executes the operation step, the thread does not need to pack the data in the convolution kernel again in the operation executing step (including the im2col operation and the MatMul operation), and therefore the data of the pack does not need to be stored in the kernel internal memory. The arranged convolution kernel data is stored in an external memory of the kernel, and compared with the convolution kernel data stored in an internal memory, the repeated distribution and release of the internal memory of the kernel in the MatMul operation process can be avoided, and the power consumption of the equipment is reduced.

In some optional implementations of this embodiment, the execution subject may arrange the data in the convolution kernel in a block manner based on the second blocking parameter. For example, the size of the convolution kernel is oc × ic × fh × fw, and if the second partition parameter indicates that the convolution kernel is partitioned in the oc dimension according to oc _ tile _ size, then the data of oc _ tile _ size × ic × fh × fw size may be successively padded in the course of the pack. The total number of packs is oc/oc _ tile _ size.

In some optional implementations of this embodiment, the execution body may further perform overall pack on data in the convolution kernel. For example, if the size of the convolution kernel is oc × ic × fh × fw, all the data in the convolution kernel having the size of oc × ic × fh × fw can be packed at once. And when the second block data is acquired subsequently, extracting the data of the corresponding block from the pack result based on the oc _ tile _ size.

Step 103, executing the operation step on line through multithreading.

In this embodiment, since the input feature map is obtained in real time online, the operation steps can be performed online by a plurality of threads in the thread pool. Each thread may perform a plurality of operation steps, and each thread may be responsible for processing a combination of the first and second partition matrices while performing each operation step.

Please refer to the exploded flowchart of the operation step shown in fig. 2, which may include the following sub-steps:

and a substep 1031, performing column vector conversion on the input feature map blocks based on the first block dividing parameters to obtain a first block dividing matrix.

In this embodiment, each thread may perform column vector conversion (i.e., im2col operation) on the input feature map blocks based on the first block parameters to obtain a first block matrix when performing the operation step each time. Referring to the example in step 101, the first chunking parameter may set the width of the first chunking matrix to ohw _ tile _ size and perform chunking in the oh × ow dimension. After im2col operation is performed on the input feature map in a blocking manner, the size of the obtained first blocking matrix is ic × fh × fw × ohw _ tile _ size. The first partition matrix has a height ic × fh × fw and a width ohw _ tile _ size.

It should be noted that the embodiments of the present application may support a variety of data layout formats, for example, the formats may include, but are not limited to, NCHW (Batch, Channels, Height, Width) and NCHW4(Batch, Channels/4, Height, Width), etc. If the input feature map is in the NCHW layout format, the data processed by im2col is in the NCHW layout format. If the input feature map is in the layout format of NCHW4, the data processed by im2col is in the layout format of NCHW 4. When the data layout format is NCHW4, if the data type is int8 or uint8, the fetched data may contain data of 4 channels, and 4 data can be calculated at one time by using uint 32.

In some optional implementations of this embodiment, before performing sub-step 1031, the execution subject may first fill (padding) the edges of the input feature map. The Padding operation can fill the edges of the input feature map, control the size of the input feature map, and avoid the loss of edge information of the input feature map. The input characteristic diagram is paged before im2col operation is executed, the size of the matrix is judged in the im2col operation process, and the operation efficiency is improved. For example, in the im2col process, a matrix having a size of 4 × 4 needs to be extracted from a matrix having a size of 24 × 24, and if the matrix size before padding is 21 × 21, and the matrix padding is set to a size of 24 × 24 in advance, it is not necessary to determine whether the extracted matrix size is 4 × 4 in the process of im2col, thereby improving the operation efficiency.

In some optional implementation manners of this embodiment, after performing sub-step 1031, the thread may further arrange (pack) data in the first blocking matrix, and store the data in the first blocking matrix after the pack to an internal memory space of the kernel (such as the above-mentioned target memory space), so as to achieve fusion of im2col operation and pack operation. In the prior art, after im2col operation is obtained, it is usually necessary to pack the im2col result and store the im2col result in a memory space, read the im2col result from the memory space during MatMul operation, and pack the im2col result into the kernel internal memory. Resulting in a need for redundant memory access operations to the memory, increasing the power consumption of the device. In the implementation mode, the data in the arranged first partitioned matrix is stored directly to the internal memory space of the kernel, so that redundant memory access operation can be reduced, and the power consumption of the device is reduced.

It should be noted that, in the process of directly storing the data in the arranged first partition matrix to the internal memory space of the kernel, a cache (cache) for transferring the im2col operation result may be first allocated. And transferring the data in the first partitioned matrix after the pack to a target memory space through the cache. Since the data is continuous and the buffer amount is small, the read-write time can be considered to be ignored.

And a substep 1032 of obtaining a second partitioning matrix from the data of the arranged convolution kernel based on the second partitioning parameter.

In this embodiment, after each thread obtains the first partition matrix, the corresponding second partition matrix may be obtained from the data of the arranged convolution kernel based on the second partition parameter. Referring to the example in step 101, the second partitioning parameter may set the height of the second partitioning matrix to oc _ tile _ size, so that the size of the second partitioning matrix is oc _ tile _ size × ic × fh × fw. The second partition matrix has a height oc _ tile _ size and a width ic × fh × fw.

And a substep 1033 of multiplying the second block matrix by the first block matrix to obtain an operation result.

In this embodiment, after each thread obtains the first partition matrix through sub-step 1031 and obtains the second partition matrix through sub-step 1032, the second partition matrix may be multiplied by the first partition matrix through kernel of MatMul operation. For example, the first partition matrix has a size ic × fh × fw × ohw _ tile _ size, and the second partition matrix has a size oc _ tile _ size × ic × fh × fw. Let ic × fh × fw be K, the size of the first partition matrix is K × ohw _ tile _ size, and the size of the first partition matrix is oc _ tile _ size × K. After the second partition matrix is multiplied by the first partition matrix, the result is obtained by multiplying the matrix with the size of oc _ tile _ size × K by the matrix with the size of K × ohw _ tile _ size, thereby obtaining the operation result with the size of oc _ tile _ size × ohw _ tile _ size. The amount of memory occupied by the operation result is oc _ tile _ size × ohw _ tile _ size × size of (datatype) bytes.

It should be noted that kernel for performing the MatMul operation is different for different data types (e.g., int8, and uint 8), and therefore, the kernel for performing the MatMul operation may be selected according to the data types and performed through the kernel.

In some optional implementation manners of this embodiment, after the thread performs im2col operation to obtain the first partition matrix, if the data pack in the first partition matrix is stored in an internal memory space (the target memory space) of the MatMul operation core function, the first partition matrix may be directly obtained from the target memory space, and the second partition matrix may be multiplied by the first partition matrix to obtain an operation result. Therefore, the pack does not need to be carried out again, redundant memory access operation is reduced, and the power consumption of the device is reduced.

In some optional implementations of this embodiment, after the thread executing sub-step 1033 obtains the operation result, the operation result may be stored in the target memory space. During the process of storing the data in the target memory space, firstly, a cache can be allocated; then storing the operation results of all threads to the cache; and finally, copying the operation result in the cache to the target memory space so that the operation result can be continuously written into the target memory space.

It should be noted that, if only the second partitioning parameter is set and the first partitioning parameter is not set (for example, the convolution kernel is only partitioned according to the oc dimension and is not partitioned according to the oh × ow dimension in the im2col process), the line swapping is not involved in the storage process, and the memories may be arranged continuously, so that the cache transfer result does not need to be allocated, and the operation result may be directly stored in the target memory space.

In some optional implementation manners of this embodiment, the storing the operation result in the target memory space may be storing the operation result in the target memory space after performing post-processing on the operation result. Specifically, the operation result and the pre-trained offset matrix (bias) may be summed first to obtain a summed operation result. And then, inputting the summed operation result into an activation function to obtain a final operation result. And finally, storing the final operation result to the target memory space. In the prior art, the operation result and the offset matrix obtained by pre-training are usually summed, the summed operation result is firstly stored in the memory, then the result is read and input to the activation function, and finally the final result is stored in the memory. This process performs two memory reads, and thus the device power consumption is high. The embodiment can directly input the operation result after summation of the offset matrix to the activation function, thereby reducing the memory read-write times and reducing the equipment power consumption.

And 104, generating an output characteristic diagram based on the operation result obtained by each thread.

In this embodiment, the operation results obtained by the threads may be combined to obtain an output feature map. For example, if the size of the first partition matrix is ic × fh × fw × ohw _ tile _ size, the number of the first partition matrix is oh × ow/ohw _ tile _ size, the size of the second partition matrix is oc _ tile _ size × ic × fh × fw, the number of the second partition matrix is oc/oc _ tile _ size, then (oh × ow/ohw _ tile _ size) × (oc/oc _ tile _ size) is shared, the size of each operation result is oc _ tile _ size × ohw _ tile _ size, and the amount of memory occupied by each operation result is oc _ tile _ size × ohw _ tile _ size × size of (datatype) bytes. Each operation result can be continuously written into the corresponding memory address space. And combining the operation results to obtain an output characteristic diagram.

It should be noted that before the operation result is stored in the memory by each thread, post-processing may also be performed, for example, the operation result is summed with an offset matrix obtained by pre-training, and the summed operation result is input to the activation function, so as to obtain a final operation result after the post-processing. In this case, the final operation results may be combined to obtain an output feature map.

In the method provided by the above embodiment of the present application, a first blocking parameter and a second blocking parameter are selected based on a preset number of threads; then, arranging data in the convolution kernel in an off-line mode; then, the following operation steps are executed on line through each thread respectively: based on the first block parameters, performing column vector conversion on the input feature map blocks to obtain a first block matrix; acquiring a second partitioning matrix from the data of the arranged convolution kernels based on the second partitioning parameters; matrix multiplication is carried out on the second block matrix and the first block matrix to obtain an operation result; thus, an output feature map is generated based on the operation results obtained by the threads. On one hand, the data volume of the first block matrix obtained after the input feature map blocks are subjected to column vector conversion is greatly reduced, so that the memory occupation of im2col operation results is reduced, and the network operation speed is increased; meanwhile, the operation amount can be reduced in the MatMul operation process by partitioning the convolution kernel, and the network operation speed is further improved. On the other hand, the data arrangement (pack) operation in the convolution kernel is completed in advance before the thread execution operation step, and the MatMul operation core function does not need to perform matrix arrangement operation, so that the data in the arranged convolution kernel does not need to be stored in the internal memory of the MatMul operation core function, the repeated distribution and release of the internal memory of the MatMul operation core function are avoided, and the equipment power consumption is reduced.

With further reference to fig. 3, as an implementation of the method shown in the above figures, the present application provides an embodiment of a feature map processing apparatus, which corresponds to the embodiment of the method shown in fig. 1, and which is specifically applicable to various electronic devices.

As shown in fig. 3, the feature map processing apparatus 300 of the present embodiment includes: a selecting unit 301 configured to select, based on a preset number of threads, a first blocking parameter and a second blocking parameter, where the first blocking parameter is used to guide a blocking operation when performing column vector conversion on an input feature map, and the second blocking parameter is used to guide a blocking operation on data of a convolution kernel; an arranging unit 302 configured to arrange the data in the convolution kernel offline; an operation unit 303 configured to perform the following operation steps on-line by multithreading: performing column vector conversion on the input feature map blocks based on the first block dividing parameters to obtain a first block dividing matrix; acquiring a second partitioning matrix from the data of the arranged convolution kernel based on the second partitioning parameter; performing matrix multiplication on the second block matrix and the first block matrix to obtain an operation result; the generation unit 304 is configured to generate an output feature map based on the operation result obtained by each thread.

In some optional implementations of this embodiment, the first blocking parameter is used to set a width of the first blocking matrix, and the width of the first blocking matrix is determined by dividing a product of a height and a width of the output feature map; the second partitioning parameter is used for setting the height of the second partitioning matrix, and the height of the second partitioning matrix is determined by dividing the number of channels of the output characteristic diagram; the height of the first block matrix and the width of the second block matrix are both the product of the number of channels of the initial feature map, the width of the convolution kernel, and the height of the convolution kernel.

In some optional implementations of this embodiment, the apparatus further includes: and the allocation unit is configured to determine the total amount of memory required by the thread to execute the operation step based on the first partitioning parameter and the second partitioning parameter, and allocate a target memory space based on the total amount of memory, wherein the target memory space is used for storing data generated in the process of executing the operation step by the thread.

In some optional implementations of this embodiment, the allocation unit is further configured to: determining a first memory amount required for storing a first blocking matrix obtained by a single thread through column vector conversion based on the first blocking parameter; determining a second memory amount required for storing an operation result obtained by matrix multiplication of a single thread based on the first partitioning parameter and the second partitioning parameter; determining a third memory amount required by the single thread to arrange target data in the input characteristic diagram, wherein the target data is required by converting to obtain a first blocking matrix; selecting the maximum memory amount of the first memory amount and the second memory amount, taking the sum of the maximum memory amount and the third memory amount as the memory amount required by a single thread, and determining the product of the memory amount required by the single thread and the thread number as the total memory amount.

In some optional implementations of the present embodiment, the operation unit 303 is further configured to: and arranging the data in the first block matrix, and storing the arranged data in the first block matrix to the target memory space. And the operation unit 303 is further configured to obtain the first block matrix from the target memory space, and multiply the first block matrix by the second block matrix to obtain an operation result.

In some optional implementations of the present embodiment, the operation unit 303 is further configured to: and storing the operation result to the target memory space.

In some optional implementations of this embodiment, the operation unit 303 is further configured to store the operation result to the target memory space by: distributing a cache; storing the operation results of all threads to the cache; and copying the operation result in the cache to the target memory space so as to continuously write the operation result into the target memory space.

In some optional implementations of the present embodiment, the operation unit 303 is further configured to: summing the operation result and an offset matrix obtained by pre-training to obtain a summed operation result; inputting the summed operation result into an activation function to obtain a final operation result; and storing the final operation result to the target memory space.

In some optional implementations of this embodiment, the apparatus further includes: and the filling unit is configured to fill the edge of the input feature map.

In the apparatus provided in the above embodiment of the present application, the first partition parameter and the second partition parameter are selected based on a preset number of threads; then, arranging data in the convolution kernel in an off-line mode; then, the following operation steps are executed on line through each thread respectively: based on the first block parameters, performing column vector conversion on the input feature map blocks to obtain a first block matrix; acquiring a second partitioning matrix from the data of the arranged convolution kernels based on the second partitioning parameters; matrix multiplication is carried out on the second block matrix and the first block matrix to obtain an operation result; thus, an output feature map is generated based on the operation results obtained by the threads. On one hand, the data volume of the first block matrix obtained after the input feature map blocks are subjected to column vector conversion is greatly reduced, so that the memory occupation of im2col operation results is reduced, and the network operation speed is increased; meanwhile, the operation amount can be reduced in the MatMul operation process by partitioning the convolution kernel, and the network operation speed is further improved. On the other hand, the data arrangement (pack) operation in the convolution kernel is completed in advance before the thread execution operation step, and the MatMul operation core function does not need to perform matrix arrangement operation, so that the data in the arranged convolution kernel does not need to be stored in the internal memory of the MatMul operation core function, the repeated distribution and release of the internal memory of the MatMul operation core function are avoided, and the equipment power consumption is reduced.

Referring now to FIG. 4, shown is a block diagram of a computer system 400 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 4, the computer system 400 includes a Central Processing Unit (CPU)401 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the system 400 are also stored. The CPU 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 401. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The units described may also be provided in a processor, where the names of the units do not in some cases constitute a limitation of the units themselves.

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: selecting a first blocking parameter and a second blocking parameter based on a preset thread number, wherein the first blocking parameter is used for guiding blocking operation when column vector conversion is carried out on an input feature map, and the second blocking parameter is used for guiding blocking operation on data of a convolution kernel; arranging data in the convolution kernel in an off-line manner; through multithreading, the following operation steps are executed on line: based on the first block parameters, performing column vector conversion on the input feature map blocks to obtain a first block matrix; acquiring a second partitioning matrix from the data of the arranged convolution kernels based on the second partitioning parameters; multiplying the second block matrix with the first block matrix to obtain an operation result; an output feature map is generated based on the operation results obtained by the threads.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for feature map processing, the method comprising:

selecting a first blocking parameter and a second blocking parameter based on a preset thread number, wherein the first blocking parameter is used for guiding blocking operation when column vector conversion is carried out on an input feature map, and the second blocking parameter is used for guiding blocking operation on data of a convolution kernel;

arranging data in the convolution kernel off-line;

through multithreading, the following operation steps are executed on line: performing column vector conversion on the input feature map blocks based on the first block dividing parameters to obtain a first block dividing matrix; acquiring a second partitioning matrix from the data of the arranged convolution kernel based on the second partitioning parameter; multiplying the second block matrix with the first block matrix to obtain an operation result;

an output feature map is generated based on the operation results obtained by the threads.

2. The method of claim 1, wherein the first blocking parameter is used to set a width of the first blocking matrix, and the width of the first blocking matrix is determined by dividing a product of a height and a width of the output feature map;

the second partitioning parameter is used for setting the height of the second partitioning matrix, and the height of the second partitioning matrix is determined after dividing the number of channels of the output characteristic diagram;

the height of the first block matrix and the width of the second block matrix are both the product of the number of channels of the initial feature map, the width of the convolution kernel and the height of the convolution kernel.

3. The method of claim 1, wherein after said selecting the first partition parameter and the second partition parameter, the method further comprises:

and determining the total amount of memory required by the thread execution operation step based on the first partitioning parameter and the second partitioning parameter, and allocating a target memory space based on the total amount of memory, wherein the target memory space is used for storing data generated in the thread execution operation step process.

4. The method of claim 3, wherein determining the total amount of memory required for the thread to perform the computing step based on the first chunking parameter and the second chunking parameter comprises:

determining a first memory amount required for storing a first blocking matrix obtained by a single thread through column vector conversion based on the first blocking parameter;

determining a second memory amount required for storing an operation result obtained by matrix multiplication of a single thread based on the first partitioning parameter and the second partitioning parameter;

determining a third memory amount required by a single thread to arrange target data in the input feature map, wherein the target data is data required by conversion to obtain a first blocking matrix;

selecting the maximum memory amount of the first memory amount and the second memory amount, taking the sum of the maximum memory amount and the third memory amount as the memory amount required by a single thread, and determining the product of the memory amount required by the single thread and the thread number as the total memory amount.

5. The method of claim 3, wherein after the column vector converting the input feature map partition based on the first partition parameter to obtain a first partition matrix, the method further comprises:

arranging the data in the first block matrix, and storing the arranged data in the first block matrix to the target memory space;

and, said multiplying said second partition matrix with said first partition matrix to obtain an operation result, comprising:

and acquiring the first block matrix from the target memory space, and multiplying the second block matrix by the first block matrix to obtain an operation result.

6. The method of claim 3, wherein after said obtaining the operation result, said operation step further comprises:

and storing the operation result to the target memory space.

7. The method of claim 6, wherein storing the operation result to the target memory space comprises:

distributing a cache;

storing the operation results of all threads to the cache;

and copying the operation result in the cache to the target memory space so as to continuously write the operation result into the target memory space.

8. The method of claim 6, wherein storing the operation result to the target memory space comprises:

summing the operation result and an offset matrix obtained by pre-training to obtain a summed operation result;

inputting the summed operation result into an activation function to obtain a final operation result;

and storing the final operation result to the target memory space.

9. The method of claim 1, wherein before performing the following steps on-line by multithreading, the method further comprises:

and filling the edges of the input feature map.

10. A feature map processing apparatus, characterized in that the apparatus comprises:

the selecting unit is configured to select a first blocking parameter and a second blocking parameter based on a preset thread number, wherein the first blocking parameter is used for guiding blocking operation when column vector conversion is performed on an input feature map, and the second blocking parameter is used for guiding blocking operation on a convolution kernel;

an arrangement unit configured to arrange the data in the convolution kernel offline;

an arithmetic unit configured to perform the following operation steps on-line by multithreading: performing column vector conversion on the input feature map blocks based on the first block dividing parameters to obtain a first block dividing matrix; acquiring a second partitioning matrix from the data of the arranged convolution kernel based on the second partitioning parameter; matrix multiplication is carried out on the second block matrix and the first block matrix to obtain an operation result;

and a generation unit configured to generate an output feature map based on the operation result obtained by each thread.

11. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.

12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-9.