WO2022160310A1

WO2022160310A1 - Data processing method and processor

Info

Publication number: WO2022160310A1
Application number: PCT/CN2021/074548
Authority: WO
Inventors: 熊旭红; 石洁珂
Original assignee: 华为技术有限公司
Priority date: 2021-01-30
Filing date: 2021-01-30
Publication date: 2022-08-04
Also published as: CN116472537A

Abstract

The embodiments of the present application relate to the field of artificial intelligence. Disclosed are a data processing method and a processor, which solve the problem of a large amount of power consumption caused due to the fact that a processor needs to perform data reading and writing multiple times. The specific solution involves: acquiring first data for a first calculation stroke of a first calculation layer; storing the first data in a first-row cache of the first calculation layer, wherein the first-row cache of the first calculation layer is included in a local cache; calculating the first calculation stroke of the first calculation layer to obtain second data; storing the second data in a first-row cache of a second calculation layer, wherein the second calculation layer is a calculation layer in N calculation layers that follows the first calculation layer; and when accumulated data stored in the first-row cache of the second calculation layer can be used for a first calculation stroke of the second calculation layer, calculating the first calculation stroke of the second calculation layer to obtain fifth data corresponding to the first calculation stroke of the second calculation layer.

Description

A data processing method and processor

technical field

The embodiments of the present application relate to the technical field of artificial intelligence, and in particular, to a data processing method and processor.

Background technique

At present, neural networks have been widely used in image classification, video processing, speech recognition, data analysis and other scenarios. Take the example of a processor processing an image using a neural network. Due to the large amount of data of the feature map (FM) of the image to be processed, it is generally impossible to store it in the local cache of the processor. Therefore, the feature map data can be stored in an external memory with a large storage space. When processing the image, the processor can read the feature map data of the image (for example, called the original input feature map) from the external memory into the processor and perform calculation according to the neural network model. After obtaining the calculation results (eg, referred to as output feature maps), the processor may store the output feature maps in external memory.

It should be noted that a neural network model generally includes multiple different computing layers, such as convolutional layers, pooling/activation layers, etc. The calculation process of each computing layer is different. Among them, one computing layer may correspond to a kernel function (kernel), and according to the kernel function, the processor may calculate the input feature map input to the computing layer, and obtain the corresponding output feature map. After completing the calculation of a computing layer to obtain the corresponding output feature map, the output feature map can be stored in the external memory, so that when the next layer calculation is performed, the data stored in the external memory can be read as the current computing layer. The input feature map of .

It can be seen that in the process of completing the calculation of a neural network model, the processor needs to read or write a large amount of data from the external memory for many times, which brings a lot of power consumption to the device that performs the neural network calculation. In addition, as the number of objects to be processed (such as the number of images to be processed) and the complexity (such as the amount of data of the feature maps of the images to be processed) increase, the computational power consumption also increases.

SUMMARY OF THE INVENTION

Embodiments of the present application provide a data processing method and processor, so as to reduce the power consumption of neural network computing. In order to achieve the above purpose, the following technical solutions are adopted in the embodiments of the present application.

In a first aspect, a data processing method is provided. The method is applied to a processor that performs neural network computing, where the neural network includes N computing layers, where N is an integer greater than or equal to 2. A local cache is provided in the processor. The method includes: acquiring first data, where the first data is used to perform a first calculation process of a first calculation layer, where the first calculation layer is any one of the N calculation layers. The first data is stored in the first line cache of the first computing layer, and the first line cache of the first computing layer is included in the local cache. calculating a first calculation trip of the first calculation layer to obtain second data corresponding to the first calculation trip of the first calculation layer, wherein the first calculation trip of the first calculation layer includes using the first calculation layer The convolution window of the first data is calculated by convolution of one or more lines of data. The second data is stored in the first row cache of the second computing layer, the first row cache of the second computing layer is included in the local cache, and the second computing layer is among the N computing layers, and the first row of the second computing layer is included in the local cache. A computational layer after a computational layer. In the case where the accumulated data stored in the first line cache of the second computing layer can perform the first computing process of the second computing layer, the first computing process of the second computing layer is calculated to obtain the The fifth data corresponding to the first calculation stroke of the second calculation layer, wherein the first calculation stroke of the second calculation layer includes the volume of one or more rows of data of the second data using the convolution window of the second calculation layer cumulative calculation.

Based on this scheme, a pipeline computing mechanism among multiple computing layers is provided. In this example, the processor may acquire data required to perform one computation process when performing convolution computation of one computation layer. Among them, take the convolution calculation as an example. A calculation stroke can be the calculation of one stroke of the convolution window sliding from the left to the far right. For example, if the convolution window is row A, before performing the convolution calculation of the computing layer, the processor only needs to obtain the data of row A to start the calculation, and does not need to obtain the full amount of input feature maps required by the current computing layer. data. Since the data size of line A data is very small, it can be stored in the local cache instead of external memory (such as DDR). In this way, when the calculation of the current layer is performed, the data of row A can be directly read from the local cache, and a stroke of the current calculation layer can be calculated accordingly. It can be understood that when the current computing layer is not the first computing layer of the neural network, then the data in row A may be the calculation result of the previous computing layer. Compared with the prior art, in the solution provided by this example, since the previous computing layer only needs to calculate and obtain the data of row A, it is not necessary to write the intermediate data between the previous computing layer and the current computing layer into the DDR. Waiting for the processor to read from the DDR again. Instead, the previous computing layer may store the data in the row cache configured for the current computing layer in the local cache after the data of row A is obtained by calculation. That is, the intermediate data does not need to be written into the DDR, and therefore does not need to be read from the DDR when performing the computation of the current layer. The data read and write from the local cache does not need to perform a large number of data interactions with the DDR, thereby saving power consumption.

In a possible design, the method further includes: if the accumulated data cannot perform the first calculation journey of the second calculation layer, calculating a second calculation journey of the first calculation layer, the first calculation The second calculated trip of a layer is the calculated trip after the first calculated trip of the first calculated layer. Based on this scheme, a fallback mechanism in the inter-layer computing process is provided. In this example, after the previous computing layer completes the calculation of one trip, the processor may calculate whether the calculation of one trip of the current layer can be performed. If there is no data stored in the line cache corresponding to the current computing layer that can support the calculation of one stroke of the current layer, the processor can fall back to the previous layer to continue to perform the calculation of the next stroke, so as to obtain a new row of calculation results and update them to the current one. Calculated line buffer. Then the processor can loop the above scheme, for example, to determine whether the data stored in the current line cache can support the current computing layer to complete the calculation of one trip, if so, execute the calculation of one trip of the current computing layer, if not, continue to return to the previous one The computing layer performs calculations. By analogy, a similar judgment fallback mechanism can be implemented for the subsequent computing layers, so that the system calculation will not be stuck in a certain computing layer, and each computing layer only needs to occupy the convolution window row with the computing layer at the same time. The number of lines corresponding to the cache can be.

In one possible design, the number of rows in the first row cache is equal to the number of rows in the convolution window of the first computational layer. Based on this solution, a specific limitation on the number of lines in the first line cache is provided. It can be understood that the first line cache is used to store the first data, which may be the storage space configured for the first computing layer in the local cache of the processor, and is used to store the calculated data of any one stroke of the first computing layer. . The number of lines in the first line buffer needs to be at least equal to the number of lines in the convolution window of the first convolution calculation layer, so that enough data can be stored to perform one-stroke calculation.

In a possible design, when the first computing layer is the first computing layer of the neural network, acquiring the first data includes: reading the first data from an external memory, where the first data is At least a portion of the input feature map is stored in the external memory, which is a storage medium coupled to the processor. Based on the solution, a data acquisition mechanism is provided when the first computing layer is the first computing layer of the neural network. It can be understood that, since the amount of data of the input feature map is generally large, it can be stored in an external memory (such as DDR) that can interact with the processor for data. The processor may read the corresponding data from the DDR before executing a calculation stroke of the first calculation layer, and write it into the line cache configured for the first calculation layer, so as to perform the calculation of a calculation stroke of the first calculation layer .

In a possible design, the first data is a part of the input feature map stored in an external memory, and the method further includes: acquiring third data from the external memory, where the third data is another part of the input feature map, The third data is used to perform a second calculation run of the first calculation layer. The third data is overwritten and the fourth data is stored, and the fourth data is the data in the first data that no longer participates in the calculation of the first computing layer. Based on this scheme, a mechanism for dynamically adjusting the data in the line cache is provided. In this example, every time a calculation in a calculation trip is completed, some data in the first data will not be used again in subsequent calculations. Then, the processor can read some new data from the DDR, overwrite the data that will not be applied in subsequent calculations, and update the data. This makes it possible to store data that can be used to perform a calculation of a new calculation run in the corresponding row cache after the current calculation run is completed. It should be noted that, in some embodiments of the present application, the data replacement may be performed after completing a calculation trip, or may be performed during the execution of a calculation trip.

In a possible design, storing the second data in the first row cache of the second computing layer includes: in the process of performing the first computing process of the first computing layer, each time the first computing layer is acquired The calculation result of the convolution window of the layer at one position, and the calculation result is stored in the first line buffer of the second calculation layer. Based on this solution, a writing mechanism of the second data is provided. In this example, each time a calculation result is acquired in the first calculation layer, the calculation result may be stored in a corresponding position in the row cache of the second calculation layer. In this way, after completing one calculation process in the first calculation, one or more rows of data stored in the row cache of the second calculation layer can be acquired for performing the calculation of the second calculation layer.

In a possible design, after acquiring the fifth data corresponding to the first computing trip of the second computing layer, the method further includes: storing the fifth data in the first row cache of the third computing layer , the first line cache of the third computing layer is included in the local cache; the third computing layer is the computing layer after the second computing layer among the N computing layers, the fifth computing layer The data is used to perform the convolution computation of the third computation layer. Based on this scheme, the computational mechanism of other computational layers included in the neural network is provided. For example, each time the second computing layer completes the calculation of one trip, the calculation result may be stored in the line cache corresponding to the next computing layer (eg, the third computing layer). In order to perform the calculation of one trip of the third computing layer after obtaining enough data.

In a second aspect, a data processing apparatus is provided, which is applied to perform neural network computation, where the neural network includes N computation layers, where N is an integer greater than or equal to 2. The data processing device is provided with a local cache. The device includes: an acquisition unit for acquiring first data, where the first data is used to perform a first calculation process of a first calculation layer, where the first calculation layer is any one of the N calculation layers. The storage unit is configured to store the first data in the first line cache of the first computing layer, and the first line cache of the first computing layer is included in the local cache. a computing unit, configured to calculate the first calculation stroke of the first calculation layer to obtain second data corresponding to the first calculation stroke of the first calculation layer, wherein the first calculation stroke of the first calculation layer includes using The convolution window of the first computation layer performs convolution computation of one or more lines of data of the first data. a storage unit, further configured to store the second data in the first line cache of the second computing layer, where the first line cache of the second computing layer is included in the local cache, and the second computing layer is the N number of In the computing layer, the computing layer after the first computing layer. The calculation unit is further configured to calculate the first calculation stroke of the second calculation layer under the condition that the accumulated data stored in the first row cache of the second calculation layer can perform the first calculation stroke of the second calculation layer , to obtain the fifth data corresponding to the first calculation stroke of the second calculation layer, wherein the first calculation stroke of the second calculation layer includes a row of the second data using the convolution window of the second calculation layer Or a convolution calculation of multiple rows of data.

In a possible design, the calculation unit is further configured to calculate the second calculation stroke of the first calculation layer when the accumulated data cannot perform the first calculation stroke of the second calculation layer, the first calculation stroke The second calculated trip of a layer is the calculated trip after the first calculated trip of the first calculated layer.

In one possible design, the number of rows in the first row cache is equal to the number of rows in the convolution window of the first computational layer.

In a possible design, an acquisition unit is configured to read the first data from an external memory, where the first data is at least a part of the input feature map stored in the external memory, the external memory is related to the processor coupled storage medium.

In a possible design, the first data is a part of the input feature map stored in an external memory, and the acquiring unit is further configured to acquire third data from the external memory, where the third data is another part of the input feature map , and the third data is used to perform the second calculation process of the first calculation layer. The third data is overwritten and the fourth data is stored, and the fourth data is the data in the first data that no longer participates in the calculation of the first computing layer.

In a possible design, the storage unit is further configured to, in the process of performing the first calculation process of the first calculation layer, obtain the calculation result of the convolution window of the first calculation layer at a position, and store the calculation result of the first calculation layer. The result is stored in the first row cache of the second computational layer.

In a possible design, the acquisition unit is further configured to acquire fifth data corresponding to the first calculation journey of the second calculation layer. a storage unit, storing the fifth data in the first line cache of the third computing layer, where the first line cache of the third computing layer is included in the local cache; the third computing layer is the Among the N computing layers, in the computing layer after the second computing layer, the fifth data is used to perform the convolution calculation of the third computing layer.

A third aspect provides a processor comprising one or more computing cores, and a local cache, the processor being configured to implement the data processing of any one of the first aspect and possible designs thereof method.

In a fourth aspect, an electronic device is provided, the electronic device includes one or more processors as described in the third aspect and one or more memories. The memory is coupled to the processor, the memory stores computer instructions. When the processor executes the computer instructions, the electronic device is caused to execute the data processing method described in any one of the first aspect and possible designs thereof.

In a fifth aspect, a computer-readable storage medium is provided, the computer-readable storage medium includes computer instructions, and when the computer instructions are executed, the data processing method described in any one of the first aspect and its possible designs is executed. .

Exemplarily, any one of the design methods and possible design methods of the above-mentioned second aspect to the fifth aspect can correspond to the above-mentioned first aspect and any one of the possible designs thereof, and therefore, similar technologies can be brought about. The effect will not be repeated here.

Description of drawings

FIG. 1 is a schematic structural diagram of a convolutional neural network;

FIG. 2 is a schematic structural diagram of a neural network computing device according to an embodiment of the present application;

3 is a schematic structural diagram of a convolution layer provided by an embodiment of the present application;

4 is a schematic flowchart of a data processing method provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a calculation logic provided by an embodiment of the present application;

FIG. 6 is another schematic diagram of calculation logic provided by an embodiment of the present application;

FIG. 7 is another schematic diagram of calculation logic provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a line cache provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of another line cache provided by an embodiment of the present application;

FIG. 10 is another schematic diagram of calculation logic provided by an embodiment of the present application;

FIG. 11 is another schematic diagram of calculation logic provided by an embodiment of the present application;

12 is a schematic structural diagram of a neural network provided by an embodiment of the application;

13 is a schematic diagram of a computing logic sequence in a single-core and multi-core scenario provided by an embodiment of the present application;

14 is a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application;

FIG. 15 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.

Detailed ways

Commonly used neural networks in the field of artificial intelligence can include Convolutional Neural Networks (CNN) and Recursive Neural Networks (RNN), etc. Exemplarily, FIG. 1 shows a schematic structural diagram of a convolutional neural network. As shown in FIG. 1, a convolutional layer including one or more convolutional computation layers and a pooling/activation layer including one or more computation layers may be provided in a convolutional neural network. Wherein, when the data is input to the convolution layer, the processor can perform convolution calculation on the input feature map according to the convolution kernel corresponding to each convolution calculation layer in the convolution layer. Exemplarily, the processor may use the convolution window corresponding to the convolution kernel to perform sliding calculation on the input feature map according to the preset convolution kernel according to the preset stride, so as to obtain the corresponding value of each window position. The calculation results can be combined into corresponding output feature maps. When data is fed into the pooling/activation layer, the processor can process the input feature map according to a specific function, such as pooling and/or activity processing. As shown in Figure 1, the convolutional neural network may further include an input layer. The input layer can be used to store the feature map data of the image to be processed, so that the convolutional layer can obtain the input feature map from the input layer. In some implementations, the input layer may be located in external memory connected to the processor that performs the convolution computations.

It should be noted that the structure of the convolutional neural network in different application scenarios can be different. In some implementations, multiple convolutional computational layers in a convolutional layer can be set up interleaved with computational layers in a pooling/activation layer. For example, as shown in Figure 1, after performing a part of the convolution calculation, pooling or activation processing can be performed, and then the convolution calculation can be performed. In other implementations, convolution computations may be performed first, followed by selective pooling and/or activation processing. After the above calculation is completed, the result of completing one round of convolutional neural network calculation can be obtained, and the result is output through the output layer.

It should be noted that a local buffer can be set inside the processor. This local cache can be used to store small amounts of data. The data stored in the local cache has the characteristics of fast read and write. For example, the local cache can be used to store data such as convolution kernels corresponding to each computing layer of the convolutional neural network model. When performing neural network calculations, due to the large amount of data in the original input feature map, it cannot be stored in the local cache. Therefore, the raw input feature maps can be stored in an external memory connected to the processor. Wherein, the external memory may be a storage medium with larger storage space. For example, the storage medium may be a double-rate synchronous dynamic random-access memory (Double Data Rate synchronous dynamic random-access memory, DDR SDRAM) and the like. In this example, the DDR SDRAM may also be referred to as DDR for short. Take the external memory as DDR as an example. The processor can read the raw input feature map from the DDR for neural network computation when it starts computation. When the neural network model includes multiple computing layers, the output feature map of the previous layer can be used as the input feature map of the next layer. The feature map data between two computing layers can also be called intermediate data. Generally speaking, the data volume of the intermediate data in the neural network calculation process is not larger than the data volume of the original input feature map. However, the data volume of the intermediate data also far exceeds the storage capacity of the local cache. If the processor writes the intermediate data into the DDR, it will cause the processor to perform multiple read and write interactions with the DDR with a large amount of data, thereby causing a large amount of power consumption. In addition, with the continuous improvement of processor computing power, the lack of read and write bandwidth will also limit the efficiency of neural network computing.

In order to solve the above problem, the data of the feature map stored in the DDR can be split, so that the split feature map slices (slices) can be stored in the local cache. In this way, when performing neural network calculations, the processor can read a slice from the DDR and store it in the local cache, and perform calculations on the slice's data. In combination with the above description, the data amount of the intermediate data and the output feature map in the calculation process of a slice will not be greater than the data amount of the input feature map corresponding to the slice. After completing the calculation of a slice, the processor can read the next slice from the DDR into the local cache, and repeat the above steps for calculation. This is repeated until all slice calculations are completed. In this way, the output feature maps corresponding to multiple slices can be stored in the DDR. After that, the processor needs to combine the output feature maps corresponding to these slices respectively, thereby obtaining a complete output feature map corresponding to the original input feature map. In order to ensure that there is no gap between the output feature maps of each slice, when slicing the original input feature map, adjacent slices need to include duplicate data. This will cause these repeated data to be calculated multiple times, thereby reducing the efficiency of the entire calculation process, and the power consumption is not optimized enough.

In order to solve the above problem, the data processing method provided by the embodiment of the present application can establish a corresponding pipeline computing mechanism in the calculation of different computing layers, so that the calculation of the upper layer does not need to be completely executed before the calculation of the next layer is performed. Such a solution can significantly reduce the amount of intermediate data, so that the intermediate data can be stored in the local cache, avoiding the power consumption overhead of multiple read and write interactions between the processor and an external memory (such as DDR). At the same time, due to the establishment of the pipeline computing mechanism, the entire computing process does not need to wait, so the computing efficiency can be effectively improved. With this solution, also due to the mechanism of pipeline computing, repeated invalid computations are not performed on data, so the computing efficiency is significantly higher than that of the prior art solution. In particular, the data processing method provided by the embodiment of the present application provides a calculation method with a fallback mechanism, which can be applied to different convolution calculation scenarios with a step size greater than or equal to 1. The solutions provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Please refer to FIG. 2 , which is a schematic diagram of a logical structure of a neural network computing apparatus 200 according to an embodiment of the present application. The neural network computing device 200 can be used to implement the computation of a neural network model including a convolutional neural network model according to the method provided by the embodiment of the present application. For the convenience of description, the external storage module 230 coupled with the neural network computing device 200 is also shown in this FIG. 2 . The neural network computing device 200 can perform read-write interaction with the external storage module 230 through the interface provided thereon. For example, the feature map data to be processed (eg, the original input feature map) is read from the external storage module 230 . For another example, the output feature map data that has completed the neural network calculation is written into the external storage module 230 and the like. The neural network computing device 200 may be the processor described above.

In some implementations, the external memory module 230 may include the DDR described above. In other implementations, the external storage module 230 may further include a system cache, and the system cache may be a system cache of the device provided with the neural network computing apparatus 200 shown in FIG. 2 . In different devices, the system cache may realize its function through different storage media. For example, the system cache is via flash memory. For another example, the system cache may also be other storage media such as a solid state disk (Solid State Device, SSD).

As shown in FIG. 2 , the neural network computing apparatus 200 provided in this embodiment of the present application may include a computing module 210 and a local cache 220 . The computing module 210 may be a module in the neural network computing device 200 for implementing various computing functions. For example, the calculation module 210 may include a convolution calculation unit 211, an activation calculation unit 212, a pooling calculation unit 213, an Eltwise calculation unit 214, and the like. Among them, the convolution calculation unit 211 can be used to perform convolution calculation. As a possible implementation, the convolution calculation unit 211 may include one or more multiplier-adders, or other components capable of implementing convolution calculation. The activation calculation unit 212 may be used to perform activation processing. The pooling computing unit 213 can be used to implement the function of pooling processing. The Eltwise computing unit 214 may be used to implement elementwise computing functions.

It should be noted that the structural example of the calculation module 210 in the above example is only a possible implementation, and in the calculation for different neural network models, the units included in the calculation module 210 for implementing each function may be the same as the above example. , can also be different. For example, when there is no requirement for activation calculation and pooling calculation in the neural network model, the activation calculation unit 212 and the pooling calculation unit 213 may not be set in the calculation module 210 . For another example, when there is no elementwise computing requirement in the neural network model, the Eltwise computing unit 214 may not be set in the computing module 210 .

As a possible implementation, the computing module 210 may be a neural network processor (neural-network processing units, NPU), or a field programmable gate array (Field Programmable Gate Array, FPGA), or a central processing unit (Central Processing) Unit/Processor, CPU), or graphics processor (Graphics Processing Unit, GPU) and other components that can implement corresponding computing functions. It should be noted that, taking the computing module 210 as an NPU as an example, the NPU may be a single-core NPU or a multi-core NPU having multiple computing cores. This embodiment of the present application does not limit this. It can be understood that the processing logic when the data processing method provided in the embodiment of the present application is applied to a single-core NPU may be multiplexed on a multi-core NPU. When the solution is applied to a multi-core NPU, the parallel computing mechanism of the multi-core NPU can be used to further improve the computing efficiency. Exemplarily, when there are multiple computing cores in the NPU, the interconnection of the multiple computing cores may be implemented by means of an internal interconnection (interconnect). For example, a network on chip (NOC) structure can be used to realize the interconnection of multiple computing cores. It can be understood that the interconnection method of NOC can dynamically configure the interconnection between different computing cores according to the network structure, so as to dynamically configure the calculation amount according to the computing pressure of each computing core, and realize the dynamic scheduling of computing. Improve the computing efficiency of multi-core NPU.

Continuing with reference to FIG. 2 , the neural network computing device 200 provided in the embodiment of the present application may further be provided with a local cache 220 . The local cache 220 can be used for fast reading and writing of data. In some implementations of the present application, in order to save the cost of the neural network computing device 200 while taking into account the size requirements of the neural network computing device 200 , the local cache 220 may be a storage medium with a smaller storage space. For example, taking the function of the computing module 210 implemented by the NPU as an example, the local cache 220 may be an internal cache of the NPU. In this application, the local cache 220 may be used to support line buffer technology. For example, as shown in FIG. 2 , the local cache 220 may be configured with multiple line caches.

As an example, the multiple line buffers may respectively correspond to different computing layers in the neural network model. The number of line buffers corresponding to one computing layer may be determined according to the window size of the kernel function of the computing layer. For example, take the convolution window of the convolution computing layer as M rows and N columns as an example. In the local cache 220, an M-line cache may be configured for the convolutional computation layer. Similarly, for other computing layers in the neural network model, corresponding line caches may also be configured in the local cache 220 respectively. It can be understood that since the number of rows of the kernel function is generally small, the sum of the number of row caches configured for each computing layer of the neural network model will not be too large. For the current local cache 220 space, the above configuration can be implemented.

It should be noted that, in other implementations of the present application, the configuration of the line cache of the computing layer may also be combined with the step size of the current computing layer and related computing layers, and/or special calculations in the neural network model (such as elementwise calculation) needs to be carried out. The specific configuration will be described in detail in the subsequent description.

In the following example, the neural network model is used as the convolutional neural network model. The convolutional neural network model has the structure shown in Figure 1. The computing module is a single-core NPU, the local cache is the local cache in the NPU, and the external storage module The data processing method provided by the embodiment of the present application will be described by taking the DDR as an example. In the following description, the process of performing each convolution calculation layer in the convolution layer by using the data processing method provided by the embodiment of the present application will be described in detail. It can be understood that, for the calculation of other layers in the convolutional neural network model, the calculation process in the convolutional layer can be referred to. As an example, the convolutional layers included in the convolutional neural network may be provided with N convolutional computing layers as shown in FIG. 3 . As shown in FIG. 3 , the N convolutional computing layers can be layer 1, layer 2, . . . , layer N. In the process of convolution calculation, the input feature map of layer 1 can be the original input feature map. The output feature map obtained after completing the convolution computation at layer N can be called the convolution output feature map.

When starting the convolution calculation, the NPU can be initialized. Exemplarily, in the initialization process, the NPU may read data corresponding to the number of rows of the convolution window of layer 1 from the DDR and write it into the local cache, which is the row cache configured for layer 1. Exemplarily, in the following description, the data of the i-th row and the j-th column of the original input feature map is denoted as a _ij , and both i and j are integers greater than or equal to 1. Take the convolution kernel size of layer 1 as A1 row B1 column as an example. The A1 line cache can be configured for layer 1 in the local cache. During initialization, the NPU can read the first A1 row data of the original input feature map from the DDR and write it into the row cache configured for layer 1 in the local cache. After initialization, the NPU can perform layer 1 convolution calculations on data written to the local cache. For example, the convolution window corresponding to layer 1 is used on the A1 line buffer, and the convolution calculation is performed in turn from left to right to obtain the convolution result of the corresponding position.

In the present application, the calculation process performed by the convolution window from the leftmost to the rightmost may be referred to as the calculation of one run. The calculation of one trip includes calculation processing of part of the data located in the window in one or more rows of data. After completing the calculation of one stroke, the result of the first row of the output feature map corresponding to layer 1 can be obtained. The output feature map of this layer 1 can be used as the input feature map of layer 2. Therefore, each time the convolution calculation of a position is completed, the NPU can store the calculation result in the local cache, which is the corresponding position in the line cache configured for layer 2. For example, after completing the convolution calculation of the first stroke of layer 1, the line cache corresponding to layer 2 can store the data of the first line of the input feature map corresponding to layer 2. It should be noted that, in some implementations of the present application, the NPU can read new data from the DDR as the convolution calculation in layer 1 progresses to cover the data that will not be used in the calculation of layer 1, so that the completion of After the calculation of 1 stroke, layer 1 can continue to perform the calculation of the next stroke without waiting for the NPU to read data from the DDR.

Illustratively, take the convolution kernel step size of layer 1 as S1 as an example. After the layer 1 convolution window completes the first convolution calculation in the corresponding stroke, the first S1 row and S1 column data in the original input feature map (such as data a ₁₁ to data a _{(S1, S1)} ) will not be used again. use. Therefore, the NPU can read the data from the S1+1 row to the S1 column before the 2×S1 row in the original input feature map from the DDR after the first convolution calculation is completed at layer 1, and store it in the local cache. There is the position of the data in the first S1 row and S1 column. By analogy, after the convolution calculation of the first run of layer 1 is completed, the data required for the convolution calculation of the next run is already stored in the line cache A1 configured for layer 1. Of course, in other implementations of the present application, the NPU may also read the data required for the next trip from the DDR after completing the calculation of one trip, and store it in the A1 line cache of layer 1.

Through the above steps, the convolution calculation of one stroke of layer 1 can be completed. This obtains the data of the input feature map of the first row of layer 2. Take the convolution window of layer 2 as the A2 row and B2 column as an example. The NPU can continue to perform convolution calculations for other strokes at layer 1 according to the above scheme, until the A2 line data required for convolution calculation at layer 2 is obtained. That is to say, the NPU can perform the convolution calculation of A2 strokes at layer 1, thereby obtaining the A2 row data as the input feature map of layer 2 and storing it in the local cache in the A2 row cache configured for layer 2. Then, the NPU can start to perform the convolution calculation of the first pass of layer 2. Then, the input feature map of the first line of layer 3 is obtained and stored in the line cache configured for layer 3 in the local cache.

It can be understood that after layer 2 completes the convolution calculation of one stroke, in order to continue to perform the calculation of the second stroke of layer 2, it is necessary for layer 1 to perform the corresponding convolution calculation to obtain the second stroke of layer 2. The new input feature map data required for the convolution computation of the run. For example, take the step size of layer 2 as S2 as an example. During the convolution calculation of run 1 at layer 2, the row cache corresponding to layer 2 stores A2 row data obtained through calculation of A2 runs of layer 1. Then after layer 2 completes the convolution calculation of run 1, the NPU can return to layer 1 to perform the calculation of the run from (A2+1) to (A2+S2) of layer 1. The new S2 row data is thus obtained and stored in the row cache corresponding to A2. So that the NPU can continue to perform the second stroke calculation in layer 2. By analogy, in the subsequent calculation process, every time the NPU completes the calculation of S2 strokes of layer 1, it can perform the calculation of 1 stroke of layer 2, and other layers are similar. When one stroke calculation of layer N is completed, the first row of data of the output feature map of the convolutional layer can be obtained.

As an example, FIG. 4 shows a schematic flowchart of a data processing method provided by an embodiment of the present application. As shown in FIG. 4, the method may include at least two calculation processes (eg, Process 1 and Process 2). Among them, the process 1 is the flow when the neural network calculation is started. Process 2 is the subsequent process that layer 2 can perform the calculation of 1 trip. As shown in the figure, the process may specifically include: S401, computing layer 1 route 1. S402 , store the output feature map of run 1 of layer 1 into the line cache corresponding to layer 2 . S403. It is determined that the calculation of the stroke 1 of the layer 2 cannot be performed in the layer 2, and the calculation is returned to the layer 1. S404 , calculate the stroke 2 of layer 1 . S405. Store the output feature map of layer 1 and run 2 in the line cache corresponding to layer 2. S406. It is determined that the calculation of the stroke 1 of the layer 2 cannot be performed in the layer 2, and the calculation is returned to the layer 1. S407. Calculate the stroke 3 of layer 1. S408 , store the output feature map of run 3 of layer 1 into the line cache corresponding to layer 2 .

It can be understood that, in this example, the convolution window of layer 2 is 3 rows for illustration. Therefore, layer 2 performs a calculation of 1 pass requires layer 1 to perform a calculation of 3 passes. Correspondingly, if the convolution window of layer 2 is row A2, then layer 2 performs the calculation of the first stroke, and layer 1 needs to perform the calculation of A2 strokes. This completes the steps of process 1. It can be understood that in this process 1, since there is no data in the line cache corresponding to layer 2 at the beginning of the calculation, it is necessary for layer 1 to perform three consecutive calculations in order to obtain the data of layer 2 executing one trip. data.

Process 2 will be described below. Among them, since data is already stored in the line cache of layer 2, layer 1 can update the data of S2 line caches to layer 2 every time after completing the calculation of S2 strokes, so that layer 2 can execute the next stroke. calculate. Wherein, S2 is the step size of layer 2, and S2 is an integer greater than or equal to 1. The following takes S2=1 as an example. S409. Calculate the stroke 1 of layer 2. S410 , it is determined that the calculation of the stroke 2 of the layer 2 cannot be performed in the layer 2, and the calculation is returned to the layer 1. S411 , calculate the stroke 4 of layer 1 . S412 , store the output feature map of run 4 of layer 1 into the line cache corresponding to layer 2 . S413. Calculate the stroke 2 of layer 2. It can be seen that, in process 2, every time layer 1 performs the calculation of one stroke, layer 2 can continue to perform the calculation of one stroke. In this way, the pipeline processing effect between different layers can be run.

In the above description, after layer 1 completes the calculation of one run, the calculation result of one line corresponding to the run is stored in the line cache corresponding to layer 2 as an example for description. In other implementations of the present application, as described in the above description, layer 1 may store the data in layer 2 every time a calculation result corresponding to a convolution window position is obtained during the calculation process in one stroke. The corresponding location in the line cache of .

It should be noted that, in FIG. 4 , only the calculation logic of layer 1 to layer 2 is shown, and the calculation logic of other layers may adopt the steps shown in this flow. For example, the input feature map of layer 3 can be the output feature map of layer 2. Therefore, when layer 2 obtains one calculation result in the process of executing one stroke each time, the result can be stored in the line cache of layer 3. After layer 2 completes one calculation trip, the row cache of layer 3 can be updated with one row of data. The NPU can determine whether layer 3 can perform the calculation of a new run, and if so, execute the calculation run of layer 3. If not, return to layer 2 for calculation, and if layer 2 can perform the calculation of the next stroke, perform the calculation of the next stroke of layer 2, if not, then continue to roll back forward until it can be executed Calculation of the next trip. After the calculation of one trip of this layer is performed, the data of one row can be updated in the row cache of the next layer, and the calculation of the next trip of the next row can be performed. By analogy, the convolution calculation of the N computing layers shown in Figure 3 can be completed.

It can be seen that, through the above description, if in the convolutional neural network model, there is no calculation of other layers after the calculation of the convolutional layer is completed, then the NPU can write the first row of data into the DDR. The NPU may directly write the data into the DDR after acquiring one piece of data. The NPU can also write into the DDR together after acquiring 1 row of data. If in the convolutional neural network model, after the calculation of the convolutional layer is completed, there are other layers of calculation, such as activation/pooling calculation, or elementwise calculation, then the NPU can obtain the output feature map of the convolutional layer. When the data is generated, the data is written into the line buffer corresponding to the subsequent calculation, and is performed according to the calculation in the above-mentioned convolutional layer.

It can be understood that based on the above description of FIG. 4 . Since there is a fallback mechanism (that is, the NPU can determine whether the current layer can perform the calculation of 1 stroke, and if not, it can fall back to the previous layer for calculation), therefore, when the step size of the calculation layer is not 1, it is also possible to Use this scheme to implement the pipeline computing mechanism. In this way, only a line buffer corresponding to the number of rows of its convolution window needs to be configured for each computing layer (for example, A1 line buffer is configured for layer 1), and all convolution calculations in this computing layer can be implemented. Moreover, since the convolution calculation result of the current layer is directly written into the line cache of the next computing layer in the local cache, it is used for the calculation of the next computing layer, forming the calculation effect of the pipeline. Therefore, in the calculation process of the next layer, the NPU does not need to read intermediate data from the DDR. Therefore, in the process of convolution calculation of the convolution layer shown in Figure 3, the amount of data read from the DDR by the NPU is only the data amount of one original input feature map. However, if there is no subsequent calculation requirement, the amount of data written into the DDR is only the amount of data outputted by one convolutional layer. Obviously, using the solution in the above example can significantly reduce the pressure of reading and writing data on the NPU and the DDR, and thus can significantly control the power consumption overhead caused by multiple reading and writing of a large amount of data. In addition, since the NPU only needs to read the A1 row data each time when reading the data of the original input feature map from the DDR, there will be no situation that affects the computing efficiency of the entire system due to the limitation of read and write bandwidth. Compared with the current scheme of calculating slices after slicing, the data processing method provided by the embodiments of the present application does not need to perform repeated calculation on data, and thus can save calculation bandwidth and corresponding power consumption overhead in the process of repeated data calculation.

In order to more clearly describe the solutions provided by the embodiments of the present application, the following Figures 5 to 13, the original input feature map includes 6 × 6 data, that is, i=j=6; the size of the layer 1 convolution window is 2×2, that is, A1=B1=2, and the step size of the convolution window of layer 1 is S1=1; for layer 2, A2=B2=3 and S2=2 are taken as an example. The convolution calculation is exemplified. Please refer to FIG. 5 , which is a schematic diagram of the process of the first stroke of the layer 1 convolution calculation in this example. During the initialization process, the NPU can read the two lines of data from a ₁₁ to a ₂₆ in the original input feature map from the DDR into the local cache, respectively, into line cache 1 and line cache 2 configured for layer 1. After completing the above initialization, the NPU can start the convolution calculation for layer 1. For example, the NPU can perform sliding calculation on the data of line buffer 1 and line buffer 2 for the convolution window corresponding to layer 1, thereby completing the calculation of one stroke.

It can be understood that each time the calculation of the position of the convolution window is completed, one piece of data corresponding to the position of the output feature map can be obtained. In addition, the output feature map of layer 1 can be the input feature map of layer 2. Therefore, in this example, every time one data of the output feature map of layer 1 is obtained, the data can be stored in the local cache, which is the layer 2 The corresponding location in the configured line cache. For example, in conjunction with FIG. 6 , take the calculation results of a ₁₁ to a ₂₆ as b ₁₁ to b ₁₅ as an example. The result obtained by calculating the layer 1 convolution window at positions a ₁₁ to a ₂₂ is b ₁₁ . Every time you swipe in the calculation stroke, you can get a new result. For example, after calculating and obtaining b ₁₁ , the layer 1 convolution window slides one data to the right, and the convolution calculation can continue to obtain b ₁₂ . By analogy, the layer 1 convolution window is slid to the far right to calculate and obtain b ₁₅ . In this example, after obtaining b ₁₁ , the NPU can write the result to the first column in the first row cache (eg, row cache 3) configured for layer 2. After fetching b ₁₂ , the NPU can write the result to column 2 in line cache 3. By analogy, after the calculation of one trip is completed in layer 1, all data in line cache 3 (eg, b ₁₁ to b ₁₅ ) can be obtained.

It is understandable that when layer 1 performs the convolution calculation of the first stroke, after the convolution window completes the calculation of the first position (that is, where a ₁₁ to a ₂₂ are located), a ₁₁ will no longer participate. subsequent calculations. Therefore, in this example, the NPU can read the 1st convolution calculation in the 2nd run from the DDR after completing the 1st convolution calculation in the 1st run, as shown in Figure 7. Calculate the data that needs to be supplemented, ie a ₃₁ . The NPU can replace a ₃₁ with a ₁₁ and store it in the line buffer of layer 1 (such as line buffer 1), so that the convolution calculation of the second run needs to be performed later.

After replacing a ₁₁ with a ₃₁ , the data stored in the line cache configured for layer 1 is shown in Figure 8. It can be seen that when the NPU performs the second convolution calculation of the first run of layer 1 (such as called run 1), a ₃₁ has been stored in the NPU for the first convolution calculation of run 2. It should be noted that in this example, the NPU reads new data from the DDR after completing one convolution calculation and replaces the data that will not participate in subsequent calculations as an example. In other implementations of the present application, the NPU may also read multiple data from the DDR at one time after completing all the convolution calculations of the first run, and replace the data in the line cache that will not participate in the calculation. This can reduce the number of times the NPU reads data from the DDR.

After completing the convolution calculation of run 1 of layer 1, the line cache of layer 1 may store data required for the convolution calculation of the next run (eg run 2). For example, the result of data replacement is shown in (a) of FIG. 9 . It is understandable that, in order to ensure the smooth progress of the calculation of layer 1 and 2, the NPU can appropriately adjust the position of the data in the line buffer, so that the correct data can be covered during the sliding of the convolution window. For example, the NPU can rewind the data stored in the line cache in units of rows, so as to achieve the effect of exchanging the data stored in the line cache 1 and the line cache 2. That is, after rewinding, the data in the line buffer of layer 1 can be converted from the distribution shown in (a) in Fig. 9 to the distribution shown in (b) in Fig. 9 .

It should be noted that, the rewinding operation in (b) of FIG. 9 is an optional step. In some implementations of the present application, it is not necessary to perform the wrapping process on the data. It can be understood that in the process of convolution calculation, it can be understood as the product of the data at each position of the convolution window and the data at the corresponding position of the input feature map, and then these products are added to obtain this value. The result of the subconvolution calculation. Therefore, in the process of convolution calculation, as long as the data of the convolution window of the product operation and the data of the input feature map have a correct corresponding relationship, the order of the data on the line buffer may not be adjusted.

After the processing shown in (b) of FIGS. 8-9 , all data that can be used to perform the layer 1 run-2 convolution computation has been stored on the layer 1 line buffer. In this way, if you continue to perform the convolution calculation of layer 1 and run 2, you can obtain the second row data of the input feature map of layer 2. For example, referring to Figure 10, the convolution window can be re-moved to the far left of line buffer 1 and line buffer 2 to start performing layer 1 run 2 calculations. After each calculation, slide to the right according to the step size corresponding to the convolution window of layer 1 for the next calculation. And so on until all convolution calculations in run 2 are completed. In this way, the second row data of the input feature map of layer 2 can be obtained. For example, the NPU can write these data (such as b ₂₁ -b ₂₅ ) into the line cache for storing the second row data of the input feature map of layer 2 4 in.

It can be understood that when layer 2 performs the calculation of stroke 1, it is necessary to obtain at least 3 rows of input feature map data. Therefore, in process 1, the NPU can perform the convolution calculation from layer 1 run 1 to layer 1 run 3 to obtain the data required for layer 2 to perform run 1 calculation. After acquiring the data (eg, b ₁₁ -b ₃₅ ) required by layer 2 to perform the calculation of run 1, the NPU can perform the calculation of run 1 of layer 2, that is, enter the calculation of process 2. The convolution computation process in layer 2 is similar to the convolution computation process in layer 1. For example, referring to Figure 11, the run 1 computation of layer 2 can obtain input feature map data (such as c ₁₁ -c ₁₃ ) for layer 3 (if present), and the NPU can store this data in the local cache for the row configured for layer 3 in a cache (eg line cache 6).

It should be noted that, in this example, since the step size of layer 2 is 2. That is to say, after the calculation of run 1 of layer 2 is completed, the convolution window in layer 2 will slide down by 2 data to start the calculation of run 2. For example, in the first convolution calculation in run 1 of layer 2, the input feature map data covered by the convolution window of layer 2 is b ₁₁ -b ₃₅ . In the first convolution calculation in layer 2, run 2, the input feature map data covered by the convolution window of layer 2 is b ₃₁ -b ₅₅ . In order to obtain the data of b ₃₁ -b ₅₅ , the NPU needs to perform the calculation of 2 strokes of layer 1 (such as layer 1 stroke 4 and layer 1 stroke 5) in the calculation of process 2. That is to say, when the step size of the next layer is greater than 1, in the calculation of process 2 as shown in Figure 4, the current layer needs to continuously perform the calculation of the number of strokes corresponding to the step size of the next layer in order to obtain the support that can support The input feature map data corresponding to the calculation of 1 stroke in the next layer.

Exemplarily, in some implementations of the present application, after the calculation of the layer 1 run 3 is completed, the calculation of the layer 2 run 1 may be performed. After that, if the NPU determines that it cannot continue to perform the calculation of the stroke 2 of the layer 2, it can fall back to the calculation of the layer 1 and execute the calculation of the stroke 4 of the layer 1. After completing the calculation of layer 1 run 4, the NPU can determine whether the calculation of layer 2 run 2 can be performed. Since the step size of layer 2 is 2, the calculation of layer 2 stroke 2 cannot be performed according to the current data. In this way, you can continue to fall back to the calculation of layer 1, and perform the calculation of stroke 5 of layer 1. After completing the calculation of layer 1 run 5, the NPU can determine whether the calculation of layer 2 run 2 can be performed. Since the input feature map data (eg, b ₃₁ -b ₅₅ ) required for the layer 2 run 2 calculation has been acquired, the NPU can then perform the layer 2 run 2 calculation. The subsequent process is similar and will not be repeated here.

In other implementations of this application, since the step size of each computing layer has been determined when the convolutional neural network model starts to calculate, the NPU can directly perform the calculation of layer 1 and 5 after completing the calculation of layer 1 and 4. . Thereby, data that can support layer 2 run 2 calculation is acquired at one time. This can reduce the number of executions of the NPU judgment logic. However, it is necessary to configure more line caches for layer 1 so that the data required for the calculation that supports layer 1 can quickly perform two strokes can be stored in the local cache at the same time.

It can be understood that, the above description of the calculation logic for different step sizes only takes adjacent layers 1 and 2 as examples. For neural network models with more computing layers, the computing logic can be extended to the implementation of more layers. For example, the calculation of layer 3 needs to be performed after layer 2, and the step size of layer 3 is 3. Then, the NPU can continuously perform more calculations at layer 1 in process 2, so that more strokes can be calculated continuously at layer 2, so that the next stroke at layer 3 can be obtained without passing the judgment logic. Calculate the required data. Of course, this would require more line caches for tier 1 and tier 2. This solution can be applied to the case where the storage space of the local cache is relatively sufficient, which can reduce the judgment logic of the NPU and improve the computing efficiency of the system.

In other scenarios, in order to save the storage space of the local cache, the calculation can be performed according to a method involving judgment logic. For example, after layer 2 performs the calculation of one stroke, the NPU determines whether layer 3 can perform the calculation of the next stroke, and if so, continues to perform the calculation of the next stroke of layer 3. Conversely, if the calculation of the next trip under layer 3 cannot be performed, it will fall back to layer 2 to perform the calculation of the next trip. Similarly, if the data in the line cache corresponding to the current layer 2 cannot support layer 2 to perform the calculation of the next stroke, continue to fall back to the previous layer (eg, layer 1) to perform the calculation of the next stroke.

In this way, through the above description, it can be understood that, based on the data processing method provided by the embodiments of the present application, it is possible to make the local cache only need to configure and kernel functions (such as the convolution of the convolution kernel in the convolution calculation) for each computing layer The number of lines in the window) corresponds to the number of line buffers, and the calculation effect of the pipeline can be obtained by referring to the method flow shown in FIG. 4 and the solution in the above description. So that the NPU does not need to read data from the DDR many times, and does not need to write the data to the DDR many times. This saves the power consumption overhead introduced by reading and writing data, and at the same time, because the intermediate data is stored in the line cache of the local cache, the computing efficiency can be significantly improved.

It should be noted that the solutions provided by the embodiments of the present application can also be applied to scenarios with special computing requirements. Exemplarily, with reference to FIG. 3 , the elementwise calculation needs to be performed in the convolutional neural network model as an example. Referring to Fig. 12, a schematic diagram of calculation logic in a convolutional neural network is shown. As shown in Figure 12, in the convolutional neural network, in addition to the calculation of the convolutional layer shown in Figure 3, elementwise calculation is also required. Exemplarily, the elementwise calculation may include an addition operation. The object of the addition operation may be the output feature map of the convolution layer and the output feature map W obtained after the original input feature map is calculated by the computing layer W. Wherein, the computing layer W may be the same computing layer as any one of the convolutional computing layers in the convolutional layers, or may be a computing layer different from any one of the convolutional computing layers in the convolutional layers. In this example, elementwise addition can be performed in the Eltwise computation layer. In combination with the foregoing description, a corresponding line cache may be configured for the computing layer W in the local cache of the NPU. For example, take the convolution calculation performed in the computing layer W, and the window size is A _w row B _w column as an example. Then, in the local cache, A _w line cache can be configured for the computing layer W. Similarly, the Eltwise computing layer can also be configured with a corresponding line cache in the local cache. Exemplarily, the number of cache lines may be an integer greater than or equal to 1.

Before performing the elementwise addition operation, the NPU can perform the convolution computation of the convolution layer and the convolution computation in the computation layer W in time-sharing. For example, the NPU can perform the convolution calculation of the run 1 corresponding to the calculation layer W to obtain the data of the first row of the output feature map W. The NPU can store the first row data of the output feature map W into the row buffer corresponding to the Eltwise computing layer. After that, the NPU can perform the convolution calculation in the convolutional layer. When the first line of data of the output feature map of the convolutional layer is obtained, the first line of data of the output feature map of the convolutional layer can be input into the Eltwise computing layer, so that In the Eltwise computing layer of the NPU, the first row of data of the output feature map W that has been stored and the first row of data of the output feature map of the convolutional layer are added, so that the first row of the Eltwise output feature map can be obtained. data. If there are no other layers of computation, the NPU can output the 1st row data of this Eltwise output feature map into the DDR as part of the output feature map of one round of convolutional neural network computations.

After acquiring the first row data of the Eltwise output feature map, the NPU can update the data in the row cache corresponding to the computing layer W according to the method in the preceding example, so as to perform the convolution calculation of the second stroke. In addition, the NPU can also perform the convolution calculation of the convolution layer according to the aforementioned method to obtain the second line of data of the output feature map of the convolution layer, and perform addition operation in the Eltwise calculation layer, thereby obtaining the second line of the Eltwise output feature map. row data. By analogy, the complete data of the output feature map calculated by one round of convolutional neural network can be obtained. It can be seen that, according to the method provided by the embodiment of the present application, when performing special operations including elementwise operations, it is only necessary to configure a line cache capable of storing the data required for one trip in the calculation process for the corresponding computing layer. , so that it is not necessary to read and write data from the DDR in large quantities for many times, thereby avoiding the power consumption overhead introduced by data reading and writing.

It should be noted that, in the above example, the row cache is configured separately for the computing layer W as an example for description. In other implementations of the present application, the line cache of computing layer W may also be multiplexed with layer 1 . Exemplarily, in this example, the convolution calculation in the calculation layer W is the same as the convolution calculation in the layer 1 in that both are convolution calculations on the data in the original input feature map, but the convolution kernel May be different.

In this example, when the convolution kernels of computation layer W and layer 1 are different, a common line buffer can be configured for computation layer W and layer 1, and the number of line buffers can be determined according to the convolution kernels of computation layer W and layer 1 Among the corresponding convolution windows, the convolution window with a larger number of rows is determined. For example, the number of convolution window rows A _w of the computing layer W is 3, and the number of convolution window rows A1 of layer 1 is 2. Then, a local storage including a 3-line line cache can be configured for the computation layer W and the layer 1 to support the convolution computation of the computation layer W and the layer 1.

It should be noted that, since the calculation layer W and layer 1 may need to use different convolution kernels to perform convolution calculation on the data stored in the line cache, the update of the data in the line cache can be completed separately after the calculation into W And layer 1 is performed after the convolution calculation at the corresponding position. For example, in conjunction with the description of FIG. 7 . When only the convolution calculation of layer 1 needs to be performed on the input feature map, as shown in Figure 7, after completing the first convolution calculation in layer 1 run 1, the NPU reads a ₃₁ from the DDR to replace a ₁₁ . In this example, since a ₁₁ not only needs to participate in the first convolution calculation of the run 1 of layer 1, but also needs to participate in the first convolution calculation of run 1 of the calculation layer W. Therefore, after a ₁₁ completes the first convolution calculation of run 1 of layer 1 and the first convolution calculation of run 1 of layer W, the NPU can read a ₃₁ from the DDR to replace a ₁₁ . In this way, the effect of line cache multiplexing can be achieved, thereby saving the storage space of the local cache.

It can be understood that, in the above examples, the NPU is used as an example for a single core for description. For example, with reference to the schematic flowchart shown in FIG. 4 , the NPU may execute the process 1 and the process 2 according to the time sequence. Of course, in conjunction with the description of the solution shown in FIG. 4 , the execution sequence of some steps in the process 1 and the process 2 may be different from that in FIG. 4 , which will not be repeated here. At present, with the improvement of chip processing technology, multi-core NPU is often used. When the data processing method provided by the embodiment of the present application is used in a multi-core NPU, computing concurrency can be achieved, and the effect of further improving computing efficiency is achieved.

Illustratively, take the NPU having two computing cores as an example. FIG. 13 is a schematic diagram showing a comparison of the calculation flow according to the time sequence in a single-core scenario and a multi-core scenario. As shown in FIG. 13 , in a single-core scenario, the computing core (eg, core 1) in the NPU can perform the computation of layer 1 and trip 4 at time T1. Core 1 can perform layer 2 run 1 computation at time T2. Core 1 can perform the computation of layer 1 run 5 at time T3. Correspondingly, when the NPU is a dual-core processor (that is, in a dual-core scenario), one computing core (eg, core 1 ) in the NPU can perform the computation of layer 1 and process 4 at time T1 . Core 1 can perform the computation of layer 1 run 5 at time T2. Calculation of run 6 of layer 1 is performed at time T3. And layer 2 computations can be performed by another compute core (eg, core 2) of the NPU. For example, at time T2, the core 2 may perform the calculation of the run 1 of the layer 2 when the core 1 performs the calculation of the run 5 of the layer 1. At time T3, the core 2 can perform the calculation of the run 2 of the layer 2 when the core 1 performs the calculation of the run 6 of the layer 1.

Obviously, compared with the calculation process of a single-core NPU, in the calculation process of a multi-core NPU, the concurrency of multiple calculation processes can be realized, so that the NPU can simultaneously execute the layer 1 after obtaining the data required for the calculation of the layer 2 stroke 1. The calculation of the next trip and the calculation of the trip 1 of the layer 2 do not need to wait for the calculation of the trip 1 of the layer 2 to be completed, and then return to the layer 1 to perform the calculation of the next trip. It should be noted that, in the above example, the core 1 performs the calculation of the layer 1, and the core 2 performs the calculation of the layer 2 as an example for description. In this application, there is no restriction on the corresponding relationship between the computation of the computation core and the computation layer. That is to say, in other implementations of the present application, the same computing core may also perform computations at different computing layers. For example, when the computing capability of core 1 (for example, using the throughput identifier) is relatively large, and the throughput of core 2 is relatively small, then core 1 can not only complete the calculation of layer 1, but also use time-division multiplexing to process processing. Part of layer 2 calculations to ensure consistent throughput. In this way, full utilization of computing bandwidth is achieved to improve the working efficiency of the multi-core NPU.

The foregoing mainly introduces the solutions provided by the embodiments of the present application from the perspective of the processor. It can be understood that, in order to realize the above-mentioned functions, the above-mentioned processor includes corresponding hardware structures and/or software modules for executing each function. Those skilled in the art should easily realize that the unit of each example described in conjunction with the embodiments disclosed herein can be implemented in hardware or in the form of a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

In this embodiment of the present application, the data processing apparatus corresponding to the processor may be divided into functional modules according to the foregoing method examples. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. middle. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. Optionally, the division of modules in the embodiment of the present application is schematic, and is only a logical function division, and another division manner may be used in actual implementation.

Please refer to FIG. 14 , which is a schematic structural diagram of a data processing apparatus 1400 according to an embodiment of the present application. The data processing apparatus 1400 can be applied to perform neural network computation, where the neural network includes N computation layers, where N is an integer greater than or equal to 2. The data processing apparatus 1400 is provided with a local cache. As shown in FIG. 14 , the data processing apparatus 1400 includes: an obtaining unit 1401, configured to obtain first data, where the first data is used to perform a first calculation journey of a first calculation layer, and the first calculation layer is the N Any one of the computing layers. The storage unit 1402 is configured to store the first data in the first line cache of the first computing layer, where the first line cache of the first computing layer is included in the local cache. The calculation unit 1403 is configured to calculate the first calculation stroke of the first calculation layer to obtain second data corresponding to the first calculation stroke of the first calculation layer, wherein the first calculation stroke of the first calculation layer includes The convolution calculation of one or more rows of data of the first data is performed using the convolution window of the first computing layer. The storage unit 1402 is further configured to store the second data in the first row cache of the second computing layer, where the first row cache of the second computing layer is included in the local cache, and the second computing layer is the N Among the computing layers, the computing layer after the first computing layer. The calculation unit 1403 is further configured to calculate the first calculation of the second calculation layer under the condition that the accumulated data stored in the first row cache of the second calculation layer can perform the first calculation process of the second calculation layer stroke, to obtain fifth data corresponding to the first calculation stroke of the second calculation layer, wherein the first calculation stroke of the second calculation layer includes using the convolution window of the second calculation layer. Convolution computation of one or more lines of data.

In a possible design, the calculation unit 1403 is further configured to calculate the second calculation stroke of the first calculation layer when the accumulated data cannot perform the first calculation stroke of the second calculation layer, the first calculation stroke of the first calculation layer. The second calculation stroke of the calculation layer is the calculation stroke after the first calculation stroke of the first calculation layer. In one possible design, the number of rows in the first row cache is equal to the number of rows in the convolution window of the first computational layer. In a possible design, the obtaining unit 1401 is further configured to read the first data from an external memory, where the first data is at least a part of the input feature map stored in the external memory, the external memory is related to the A processor-coupled storage medium. In a possible design, the first data is a part of the input feature map stored in the external memory, and the obtaining unit 1401 is further configured to obtain third data from the external memory, where the third data is another part of the input feature map In part, the third data is used to perform a second computing run of the first computing layer. The third data is overwritten and the fourth data is stored, and the fourth data is the data in the first data that no longer participates in the calculation of the first computing layer. In a possible design, the storage unit 1402 is further configured to, in the process of performing the first calculation process of the first calculation layer, obtain the calculation result of the convolution window of the first calculation layer at one position, and store the calculation result of the first calculation layer at one position. The result of the computation is stored in the first line cache of the second computation layer. In a possible design, the obtaining unit 1401 is further configured to obtain fifth data corresponding to the first calculation journey of the second calculation layer. The storage unit 1402 stores the fifth data in the first line cache of the third computing layer, where the first line cache of the third computing layer is included in the local cache; the third computing layer is the Among the N computing layers, in the computing layer after the second computing layer, the fifth data is used to perform the convolution calculation of the third computing layer.

All relevant contents of the steps involved in the foregoing method embodiments can be cited in the functional descriptions of the corresponding functional modules, which will not be repeated here. That is, any one of the above units can be implemented in software, hardware or a combination of the two, so as to realize the functions shown in the method. The data processing apparatus 1400 including the above units may be a part integrated in the above-mentioned processor, such as functional hardware in the processor or functional software running in the processor. For example, if any one of the units is implemented as a software module, it runs on the neural network computing device 200 as shown in FIG. 2 .

Please refer to FIG. 15 , which is a schematic structural diagram of an electronic device 1500 according to an embodiment of the present application. The electronic device 1500 may include: a processor 1501 and a memory 1502 . The memory 1502 is used to store computer-implemented instructions. Exemplarily, in some embodiments, when the processor 1501 executes the instructions stored in the memory 1502, the electronic device 1500 is caused to perform one or more steps in S401-S413 shown in FIG. 4, and the electronic device requires other operations performed. In some embodiments, the electronic device 1500 may be provided with the neural network computing apparatus 200 as described in FIG. 2 . The processor 1501 may refer to the neural network computing device 200 of FIG. 2 .

It should be understood that the processor of this embodiment includes, but is not limited to, one or more of the aforementioned CPU, NPU, FPGA, CPU, GPU, and DSP (Digital Signal Processor). The above processors may be implemented in one or more chips. When the processor is integrated into a chip, the chip is also referred to as a system-on-chip (SoC).

It should be noted that, all relevant contents of the steps involved in the above method embodiments can be cited in the functional description of the corresponding functional module, which will not be repeated here. The focusing device provided by the embodiment of the present application is used to perform the function of the terminal in the above focusing method, and thus can achieve the same effect as the above focusing method.

The functions or actions or operations or steps in the above embodiments may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line, DSL) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or data storage devices including one or more servers, data centers, etc. that can be integrated with the medium. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.

Although the application has been described in conjunction with specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made therein without departing from the spirit and scope of the application. Accordingly, this specification and drawings are merely exemplary illustrations of the application as defined by the appended claims, and are deemed to cover any and all modifications, variations, combinations or equivalents within the scope of this application. Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the spirit and scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims

A data processing method, characterized in that the method is applied to a processor that performs neural network computation, the neural network includes N computation layers, and N is an integer greater than or equal to 2; the processor is provided with a local cache; the method includes:

Acquiring first data, the first data is used to perform the first calculation journey of the first computing layer, and the first computing layer is any one of the N computing layers;

storing the first data in the first line cache of the first computing layer, where the first line cache of the first computing layer is included in the local cache;

calculating a first calculation trip of the first calculation layer to obtain second data corresponding to the first calculation trip of the first calculation layer, wherein the first calculation trip of the first calculation layer includes using the The convolution calculation of one or more rows of data of the first data by the convolution window of the first computing layer;

storing the second data in a first line cache of a second computing layer included in the local cache, the second computing layer being the N computations layer, the computing layer after the first computing layer;

In the case where the accumulated data stored in the first line cache of the second computing layer can perform the first computing process of the second computing layer, the first computing process of the second computing layer is calculated to obtain Fifth data corresponding to the first calculation run of the second calculation layer, wherein the first calculation run of the second calculation layer includes using the convolution window of the second calculation layer on the second data. Convolution computation of one or more lines of data.
The method according to claim 1, wherein the method further comprises:

In the case where the accumulated data cannot perform the first calculation stroke of the second calculation layer, calculate the second calculation stroke of the first calculation layer, and the second calculation stroke of the first calculation layer is the The calculation trip after the first calculation trip of the first calculation layer.
The method according to claim 1 or 2, characterized in that,

The number of lines in the first line buffer is equal to the number of lines in the convolution window of the first computing layer.
The method according to any one of claims 1-3, wherein when the first computing layer is the first computing layer of the neural network, the acquiring the first data comprises:

The first data is read from an external memory, the first data being at least a portion of an input profile stored in the external memory, the external memory being a storage medium coupled to the processor.
The method of claim 4, wherein the first data is a portion of an input feature map stored in the external memory, the method further comprising:

acquiring third data from the external memory, where the third data is another part of the input feature map, the third data is used to perform the second calculation process of the first calculation layer;

Overwriting the third data to store fourth data, where the fourth data is the data in the first data that no longer participates in the calculation of the first computing layer.
The method according to any one of claims 1-5, wherein the storing the second data in the first line cache of the second computing layer comprises:

In the process of performing the first calculation process of the first calculation layer, each time the calculation result of the convolution window of the first calculation layer at one position is obtained, the calculation result is stored in the second calculation layer. The first line in the cache.
The method according to any one of claims 1-6, wherein after acquiring the fifth data corresponding to the first calculation trip of the second calculation layer, the method further comprises:

The fifth data is stored in the first line cache of the third computing layer, and the first line cache of the third computing layer is included in the local cache; the third computing layer is the N computing In the calculation layer after the second calculation layer, the fifth data is used to perform the convolution calculation of the third calculation layer.
A processor, characterized in that the processor includes one or more computing cores and a local cache, and the processor is configured to implement the data processing method according to any one of claims 1-7.
An electronic device, characterized in that the electronic device comprises one or more processors as claimed in claim 8 and one or more memories; the memories are coupled to the processors, and the memories store computer instructions;

When the processor executes the computer instructions, the electronic device is caused to perform the data processing method according to any one of claims 1-7.
A computer-readable storage medium, characterized in that, the computer-readable storage medium comprises computer instructions, and when the computer instructions are executed, the data processing method according to any one of claims 1-7 is executed.