WO2022160310A1 - Data processing method and processor - Google Patents

Data processing method and processor Download PDF

Info

Publication number
WO2022160310A1
WO2022160310A1 PCT/CN2021/074548 CN2021074548W WO2022160310A1 WO 2022160310 A1 WO2022160310 A1 WO 2022160310A1 CN 2021074548 W CN2021074548 W CN 2021074548W WO 2022160310 A1 WO2022160310 A1 WO 2022160310A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
calculation
data
computing
convolution
Prior art date
Application number
PCT/CN2021/074548
Other languages
French (fr)
Chinese (zh)
Inventor
熊旭红
石洁珂
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/074548 priority Critical patent/WO2022160310A1/en
Priority to CN202180077853.3A priority patent/CN116472537A/en
Publication of WO2022160310A1 publication Critical patent/WO2022160310A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the embodiments of the present application relate to the technical field of artificial intelligence, and in particular, to a data processing method and processor.
  • neural networks have been widely used in image classification, video processing, speech recognition, data analysis and other scenarios.
  • a processor processing an image using a neural network. Due to the large amount of data of the feature map (FM) of the image to be processed, it is generally impossible to store it in the local cache of the processor. Therefore, the feature map data can be stored in an external memory with a large storage space.
  • the processor can read the feature map data of the image (for example, called the original input feature map) from the external memory into the processor and perform calculation according to the neural network model. After obtaining the calculation results (eg, referred to as output feature maps), the processor may store the output feature maps in external memory.
  • a neural network model generally includes multiple different computing layers, such as convolutional layers, pooling/activation layers, etc.
  • the calculation process of each computing layer is different.
  • one computing layer may correspond to a kernel function (kernel), and according to the kernel function, the processor may calculate the input feature map input to the computing layer, and obtain the corresponding output feature map.
  • the output feature map can be stored in the external memory, so that when the next layer calculation is performed, the data stored in the external memory can be read as the current computing layer.
  • the input feature map of of .
  • the processor needs to read or write a large amount of data from the external memory for many times, which brings a lot of power consumption to the device that performs the neural network calculation.
  • the number of objects to be processed such as the number of images to be processed
  • the complexity such as the amount of data of the feature maps of the images to be processed
  • Embodiments of the present application provide a data processing method and processor, so as to reduce the power consumption of neural network computing.
  • the following technical solutions are adopted in the embodiments of the present application.
  • a data processing method is provided.
  • the method is applied to a processor that performs neural network computing, where the neural network includes N computing layers, where N is an integer greater than or equal to 2.
  • a local cache is provided in the processor.
  • the method includes: acquiring first data, where the first data is used to perform a first calculation process of a first calculation layer, where the first calculation layer is any one of the N calculation layers.
  • the first data is stored in the first line cache of the first computing layer, and the first line cache of the first computing layer is included in the local cache.
  • the convolution window of the first data is calculated by convolution of one or more lines of data.
  • the second data is stored in the first row cache of the second computing layer, the first row cache of the second computing layer is included in the local cache, and the second computing layer is among the N computing layers, and the first row of the second computing layer is included in the local cache.
  • the first computing process of the second computing layer is calculated to obtain the The fifth data corresponding to the first calculation stroke of the second calculation layer, wherein the first calculation stroke of the second calculation layer includes the volume of one or more rows of data of the second data using the convolution window of the second calculation layer cumulative calculation.
  • the processor may acquire data required to perform one computation process when performing convolution computation of one computation layer.
  • a calculation stroke can be the calculation of one stroke of the convolution window sliding from the left to the far right.
  • the processor before performing the convolution calculation of the computing layer, the processor only needs to obtain the data of row A to start the calculation, and does not need to obtain the full amount of input feature maps required by the current computing layer. data. Since the data size of line A data is very small, it can be stored in the local cache instead of external memory (such as DDR).
  • the data of row A can be directly read from the local cache, and a stroke of the current calculation layer can be calculated accordingly.
  • the data in row A may be the calculation result of the previous computing layer.
  • the previous computing layer since the previous computing layer only needs to calculate and obtain the data of row A, it is not necessary to write the intermediate data between the previous computing layer and the current computing layer into the DDR. Waiting for the processor to read from the DDR again. Instead, the previous computing layer may store the data in the row cache configured for the current computing layer in the local cache after the data of row A is obtained by calculation.
  • the intermediate data does not need to be written into the DDR, and therefore does not need to be read from the DDR when performing the computation of the current layer.
  • the data read and write from the local cache does not need to perform a large number of data interactions with the DDR, thereby saving power consumption.
  • the method further includes: if the accumulated data cannot perform the first calculation journey of the second calculation layer, calculating a second calculation journey of the first calculation layer, the first calculation The second calculated trip of a layer is the calculated trip after the first calculated trip of the first calculated layer.
  • a fallback mechanism in the inter-layer computing process is provided.
  • the processor may calculate whether the calculation of one trip of the current layer can be performed. If there is no data stored in the line cache corresponding to the current computing layer that can support the calculation of one stroke of the current layer, the processor can fall back to the previous layer to continue to perform the calculation of the next stroke, so as to obtain a new row of calculation results and update them to the current one.
  • the processor can loop the above scheme, for example, to determine whether the data stored in the current line cache can support the current computing layer to complete the calculation of one trip, if so, execute the calculation of one trip of the current computing layer, if not, continue to return to the previous one
  • the computing layer performs calculations.
  • a similar judgment fallback mechanism can be implemented for the subsequent computing layers, so that the system calculation will not be stuck in a certain computing layer, and each computing layer only needs to occupy the convolution window row with the computing layer at the same time.
  • the number of lines corresponding to the cache can be.
  • the number of rows in the first row cache is equal to the number of rows in the convolution window of the first computational layer.
  • the first line cache is used to store the first data, which may be the storage space configured for the first computing layer in the local cache of the processor, and is used to store the calculated data of any one stroke of the first computing layer.
  • the number of lines in the first line buffer needs to be at least equal to the number of lines in the convolution window of the first convolution calculation layer, so that enough data can be stored to perform one-stroke calculation.
  • acquiring the first data includes: reading the first data from an external memory, where the first data is At least a portion of the input feature map is stored in the external memory, which is a storage medium coupled to the processor. Based on the solution, a data acquisition mechanism is provided when the first computing layer is the first computing layer of the neural network. It can be understood that, since the amount of data of the input feature map is generally large, it can be stored in an external memory (such as DDR) that can interact with the processor for data. The processor may read the corresponding data from the DDR before executing a calculation stroke of the first calculation layer, and write it into the line cache configured for the first calculation layer, so as to perform the calculation of a calculation stroke of the first calculation layer .
  • DDR digital data acquisition mechanism
  • the first data is a part of the input feature map stored in an external memory
  • the method further includes: acquiring third data from the external memory, where the third data is another part of the input feature map, The third data is used to perform a second calculation run of the first calculation layer. The third data is overwritten and the fourth data is stored, and the fourth data is the data in the first data that no longer participates in the calculation of the first computing layer.
  • a mechanism for dynamically adjusting the data in the line cache is provided. In this example, every time a calculation in a calculation trip is completed, some data in the first data will not be used again in subsequent calculations.
  • the processor can read some new data from the DDR, overwrite the data that will not be applied in subsequent calculations, and update the data. This makes it possible to store data that can be used to perform a calculation of a new calculation run in the corresponding row cache after the current calculation run is completed. It should be noted that, in some embodiments of the present application, the data replacement may be performed after completing a calculation trip, or may be performed during the execution of a calculation trip.
  • storing the second data in the first row cache of the second computing layer includes: in the process of performing the first computing process of the first computing layer, each time the first computing layer is acquired The calculation result of the convolution window of the layer at one position, and the calculation result is stored in the first line buffer of the second calculation layer. Based on this solution, a writing mechanism of the second data is provided.
  • the calculation result may be stored in a corresponding position in the row cache of the second calculation layer. In this way, after completing one calculation process in the first calculation, one or more rows of data stored in the row cache of the second calculation layer can be acquired for performing the calculation of the second calculation layer.
  • the method further includes: storing the fifth data in the first row cache of the third computing layer , the first line cache of the third computing layer is included in the local cache; the third computing layer is the computing layer after the second computing layer among the N computing layers, the fifth computing layer
  • the data is used to perform the convolution computation of the third computation layer.
  • the computational mechanism of other computational layers included in the neural network is provided. For example, each time the second computing layer completes the calculation of one trip, the calculation result may be stored in the line cache corresponding to the next computing layer (eg, the third computing layer). In order to perform the calculation of one trip of the third computing layer after obtaining enough data.
  • a data processing apparatus which is applied to perform neural network computation, where the neural network includes N computation layers, where N is an integer greater than or equal to 2.
  • the data processing device is provided with a local cache.
  • the device includes: an acquisition unit for acquiring first data, where the first data is used to perform a first calculation process of a first calculation layer, where the first calculation layer is any one of the N calculation layers.
  • the storage unit is configured to store the first data in the first line cache of the first computing layer, and the first line cache of the first computing layer is included in the local cache.
  • a computing unit configured to calculate the first calculation stroke of the first calculation layer to obtain second data corresponding to the first calculation stroke of the first calculation layer, wherein the first calculation stroke of the first calculation layer includes using The convolution window of the first computation layer performs convolution computation of one or more lines of data of the first data.
  • a storage unit further configured to store the second data in the first line cache of the second computing layer, where the first line cache of the second computing layer is included in the local cache, and the second computing layer is the N number of In the computing layer, the computing layer after the first computing layer.
  • the calculation unit is further configured to calculate the first calculation stroke of the second calculation layer under the condition that the accumulated data stored in the first row cache of the second calculation layer can perform the first calculation stroke of the second calculation layer , to obtain the fifth data corresponding to the first calculation stroke of the second calculation layer, wherein the first calculation stroke of the second calculation layer includes a row of the second data using the convolution window of the second calculation layer Or a convolution calculation of multiple rows of data.
  • the calculation unit is further configured to calculate the second calculation stroke of the first calculation layer when the accumulated data cannot perform the first calculation stroke of the second calculation layer, the first calculation stroke
  • the second calculated trip of a layer is the calculated trip after the first calculated trip of the first calculated layer.
  • the number of rows in the first row cache is equal to the number of rows in the convolution window of the first computational layer.
  • an acquisition unit is configured to read the first data from an external memory, where the first data is at least a part of the input feature map stored in the external memory, the external memory is related to the processor coupled storage medium.
  • the first data is a part of the input feature map stored in an external memory
  • the acquiring unit is further configured to acquire third data from the external memory, where the third data is another part of the input feature map , and the third data is used to perform the second calculation process of the first calculation layer.
  • the third data is overwritten and the fourth data is stored, and the fourth data is the data in the first data that no longer participates in the calculation of the first computing layer.
  • the storage unit is further configured to, in the process of performing the first calculation process of the first calculation layer, obtain the calculation result of the convolution window of the first calculation layer at a position, and store the calculation result of the first calculation layer. The result is stored in the first row cache of the second computational layer.
  • the acquisition unit is further configured to acquire fifth data corresponding to the first calculation journey of the second calculation layer.
  • a storage unit storing the fifth data in the first line cache of the third computing layer, where the first line cache of the third computing layer is included in the local cache; the third computing layer is the Among the N computing layers, in the computing layer after the second computing layer, the fifth data is used to perform the convolution calculation of the third computing layer.
  • a third aspect provides a processor comprising one or more computing cores, and a local cache, the processor being configured to implement the data processing of any one of the first aspect and possible designs thereof method.
  • an electronic device in a fourth aspect, includes one or more processors as described in the third aspect and one or more memories.
  • the memory is coupled to the processor, the memory stores computer instructions.
  • the processor executes the computer instructions, the electronic device is caused to execute the data processing method described in any one of the first aspect and possible designs thereof.
  • a computer-readable storage medium includes computer instructions, and when the computer instructions are executed, the data processing method described in any one of the first aspect and its possible designs is executed. .
  • any one of the design methods and possible design methods of the above-mentioned second aspect to the fifth aspect can correspond to the above-mentioned first aspect and any one of the possible designs thereof, and therefore, similar technologies can be brought about. The effect will not be repeated here.
  • FIG. 1 is a schematic structural diagram of a convolutional neural network
  • FIG. 2 is a schematic structural diagram of a neural network computing device according to an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a convolution layer provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a calculation logic provided by an embodiment of the present application.
  • FIG. 6 is another schematic diagram of calculation logic provided by an embodiment of the present application.
  • FIG. 7 is another schematic diagram of calculation logic provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a line cache provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of another line cache provided by an embodiment of the present application.
  • FIG. 10 is another schematic diagram of calculation logic provided by an embodiment of the present application.
  • FIG. 11 is another schematic diagram of calculation logic provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a neural network provided by an embodiment of the application.
  • FIG. 13 is a schematic diagram of a computing logic sequence in a single-core and multi-core scenario provided by an embodiment of the present application
  • FIG. 14 is a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 1 shows a schematic structural diagram of a convolutional neural network.
  • a convolutional layer including one or more convolutional computation layers and a pooling/activation layer including one or more computation layers may be provided in a convolutional neural network.
  • the processor can perform convolution calculation on the input feature map according to the convolution kernel corresponding to each convolution calculation layer in the convolution layer.
  • the processor may use the convolution window corresponding to the convolution kernel to perform sliding calculation on the input feature map according to the preset convolution kernel according to the preset stride, so as to obtain the corresponding value of each window position.
  • the calculation results can be combined into corresponding output feature maps.
  • the processor can process the input feature map according to a specific function, such as pooling and/or activity processing.
  • the convolutional neural network may further include an input layer.
  • the input layer can be used to store the feature map data of the image to be processed, so that the convolutional layer can obtain the input feature map from the input layer.
  • the input layer may be located in external memory connected to the processor that performs the convolution computations.
  • multiple convolutional computational layers in a convolutional layer can be set up interleaved with computational layers in a pooling/activation layer. For example, as shown in Figure 1, after performing a part of the convolution calculation, pooling or activation processing can be performed, and then the convolution calculation can be performed. In other implementations, convolution computations may be performed first, followed by selective pooling and/or activation processing. After the above calculation is completed, the result of completing one round of convolutional neural network calculation can be obtained, and the result is output through the output layer.
  • a local buffer can be set inside the processor.
  • This local cache can be used to store small amounts of data.
  • the data stored in the local cache has the characteristics of fast read and write.
  • the local cache can be used to store data such as convolution kernels corresponding to each computing layer of the convolutional neural network model.
  • the raw input feature maps can be stored in an external memory connected to the processor.
  • the external memory may be a storage medium with larger storage space.
  • the storage medium may be a double-rate synchronous dynamic random-access memory (Double Data Rate synchronous dynamic random-access memory, DDR SDRAM) and the like.
  • the DDR SDRAM may also be referred to as DDR for short.
  • the processor can read the raw input feature map from the DDR for neural network computation when it starts computation.
  • the output feature map of the previous layer can be used as the input feature map of the next layer.
  • the feature map data between two computing layers can also be called intermediate data.
  • the data volume of the intermediate data in the neural network calculation process is not larger than the data volume of the original input feature map. However, the data volume of the intermediate data also far exceeds the storage capacity of the local cache.
  • the processor If the processor writes the intermediate data into the DDR, it will cause the processor to perform multiple read and write interactions with the DDR with a large amount of data, thereby causing a large amount of power consumption.
  • the lack of read and write bandwidth will also limit the efficiency of neural network computing.
  • the data of the feature map stored in the DDR can be split, so that the split feature map slices (slices) can be stored in the local cache.
  • the processor can read a slice from the DDR and store it in the local cache, and perform calculations on the slice's data.
  • the data amount of the intermediate data and the output feature map in the calculation process of a slice will not be greater than the data amount of the input feature map corresponding to the slice.
  • the processor can read the next slice from the DDR into the local cache, and repeat the above steps for calculation. This is repeated until all slice calculations are completed.
  • the output feature maps corresponding to multiple slices can be stored in the DDR.
  • the processor needs to combine the output feature maps corresponding to these slices respectively, thereby obtaining a complete output feature map corresponding to the original input feature map.
  • adjacent slices need to include duplicate data. This will cause these repeated data to be calculated multiple times, thereby reducing the efficiency of the entire calculation process, and the power consumption is not optimized enough.
  • the data processing method provided by the embodiment of the present application can establish a corresponding pipeline computing mechanism in the calculation of different computing layers, so that the calculation of the upper layer does not need to be completely executed before the calculation of the next layer is performed.
  • Such a solution can significantly reduce the amount of intermediate data, so that the intermediate data can be stored in the local cache, avoiding the power consumption overhead of multiple read and write interactions between the processor and an external memory (such as DDR).
  • the entire computing process does not need to wait, so the computing efficiency can be effectively improved.
  • this solution also due to the mechanism of pipeline computing, repeated invalid computations are not performed on data, so the computing efficiency is significantly higher than that of the prior art solution.
  • the data processing method provided by the embodiment of the present application provides a calculation method with a fallback mechanism, which can be applied to different convolution calculation scenarios with a step size greater than or equal to 1.
  • FIG. 2 is a schematic diagram of a logical structure of a neural network computing apparatus 200 according to an embodiment of the present application.
  • the neural network computing device 200 can be used to implement the computation of a neural network model including a convolutional neural network model according to the method provided by the embodiment of the present application.
  • the external storage module 230 coupled with the neural network computing device 200 is also shown in this FIG. 2 .
  • the neural network computing device 200 can perform read-write interaction with the external storage module 230 through the interface provided thereon.
  • the feature map data to be processed eg, the original input feature map
  • the output feature map data that has completed the neural network calculation is written into the external storage module 230 and the like.
  • the neural network computing device 200 may be the processor described above.
  • the external memory module 230 may include the DDR described above.
  • the external storage module 230 may further include a system cache, and the system cache may be a system cache of the device provided with the neural network computing apparatus 200 shown in FIG. 2 .
  • the system cache may realize its function through different storage media.
  • the system cache is via flash memory.
  • the system cache may also be other storage media such as a solid state disk (Solid State Device, SSD).
  • the neural network computing apparatus 200 may include a computing module 210 and a local cache 220 .
  • the computing module 210 may be a module in the neural network computing device 200 for implementing various computing functions.
  • the calculation module 210 may include a convolution calculation unit 211, an activation calculation unit 212, a pooling calculation unit 213, an Eltwise calculation unit 214, and the like.
  • the convolution calculation unit 211 can be used to perform convolution calculation.
  • the convolution calculation unit 211 may include one or more multiplier-adders, or other components capable of implementing convolution calculation.
  • the activation calculation unit 212 may be used to perform activation processing.
  • the pooling computing unit 213 can be used to implement the function of pooling processing.
  • the Eltwise computing unit 214 may be used to implement elementwise computing functions.
  • the structural example of the calculation module 210 in the above example is only a possible implementation, and in the calculation for different neural network models, the units included in the calculation module 210 for implementing each function may be the same as the above example. , can also be different.
  • the activation calculation unit 212 and the pooling calculation unit 213 may not be set in the calculation module 210 .
  • the Eltwise computing unit 214 may not be set in the computing module 210 .
  • the computing module 210 may be a neural network processor (neural-network processing units, NPU), or a field programmable gate array (Field Programmable Gate Array, FPGA), or a central processing unit (Central Processing) Unit/Processor, CPU), or graphics processor (Graphics Processing Unit, GPU) and other components that can implement corresponding computing functions.
  • NPU neural network processing units
  • FPGA field programmable gate array
  • CPU Central Processing
  • GPU Graphics Processing Unit
  • the computing module 210 may be a single-core NPU or a multi-core NPU having multiple computing cores. This embodiment of the present application does not limit this.
  • the processing logic when the data processing method provided in the embodiment of the present application is applied to a single-core NPU may be multiplexed on a multi-core NPU.
  • the parallel computing mechanism of the multi-core NPU can be used to further improve the computing efficiency.
  • the interconnection of the multiple computing cores may be implemented by means of an internal interconnection (interconnect).
  • interconnect for example, a network on chip (NOC) structure can be used to realize the interconnection of multiple computing cores.
  • NOC network on chip
  • the interconnection method of NOC can dynamically configure the interconnection between different computing cores according to the network structure, so as to dynamically configure the calculation amount according to the computing pressure of each computing core, and realize the dynamic scheduling of computing. Improve the computing efficiency of multi-core NPU.
  • the neural network computing device 200 provided in the embodiment of the present application may further be provided with a local cache 220 .
  • the local cache 220 can be used for fast reading and writing of data.
  • the local cache 220 may be a storage medium with a smaller storage space.
  • the local cache 220 may be an internal cache of the NPU.
  • the local cache 220 may be used to support line buffer technology.
  • the local cache 220 may be configured with multiple line caches.
  • the multiple line buffers may respectively correspond to different computing layers in the neural network model.
  • the number of line buffers corresponding to one computing layer may be determined according to the window size of the kernel function of the computing layer. For example, take the convolution window of the convolution computing layer as M rows and N columns as an example.
  • an M-line cache may be configured for the convolutional computation layer.
  • corresponding line caches may also be configured in the local cache 220 respectively. It can be understood that since the number of rows of the kernel function is generally small, the sum of the number of row caches configured for each computing layer of the neural network model will not be too large. For the current local cache 220 space, the above configuration can be implemented.
  • the configuration of the line cache of the computing layer may also be combined with the step size of the current computing layer and related computing layers, and/or special calculations in the neural network model (such as elementwise calculation) needs to be carried out.
  • the specific configuration will be described in detail in the subsequent description.
  • the neural network model is used as the convolutional neural network model.
  • the convolutional neural network model has the structure shown in Figure 1.
  • the computing module is a single-core NPU
  • the local cache is the local cache in the NPU
  • the external storage module The data processing method provided by the embodiment of the present application will be described by taking the DDR as an example. In the following description, the process of performing each convolution calculation layer in the convolution layer by using the data processing method provided by the embodiment of the present application will be described in detail. It can be understood that, for the calculation of other layers in the convolutional neural network model, the calculation process in the convolutional layer can be referred to.
  • the convolutional layers included in the convolutional neural network may be provided with N convolutional computing layers as shown in FIG. 3 .
  • the N convolutional computing layers can be layer 1, layer 2, . . . , layer N.
  • the input feature map of layer 1 can be the original input feature map.
  • the output feature map obtained after completing the convolution computation at layer N can be called the convolution output feature map.
  • the NPU When starting the convolution calculation, the NPU can be initialized.
  • the NPU may read data corresponding to the number of rows of the convolution window of layer 1 from the DDR and write it into the local cache, which is the row cache configured for layer 1.
  • the data of the i-th row and the j-th column of the original input feature map is denoted as a ij , and both i and j are integers greater than or equal to 1.
  • the A1 line cache can be configured for layer 1 in the local cache.
  • the NPU can read the first A1 row data of the original input feature map from the DDR and write it into the row cache configured for layer 1 in the local cache.
  • the NPU can perform layer 1 convolution calculations on data written to the local cache. For example, the convolution window corresponding to layer 1 is used on the A1 line buffer, and the convolution calculation is performed in turn from left to right to obtain the convolution result of the corresponding position.
  • the calculation process performed by the convolution window from the leftmost to the rightmost may be referred to as the calculation of one run.
  • the calculation of one trip includes calculation processing of part of the data located in the window in one or more rows of data.
  • the result of the first row of the output feature map corresponding to layer 1 can be obtained.
  • the output feature map of this layer 1 can be used as the input feature map of layer 2. Therefore, each time the convolution calculation of a position is completed, the NPU can store the calculation result in the local cache, which is the corresponding position in the line cache configured for layer 2. For example, after completing the convolution calculation of the first stroke of layer 1, the line cache corresponding to layer 2 can store the data of the first line of the input feature map corresponding to layer 2.
  • the NPU can read new data from the DDR as the convolution calculation in layer 1 progresses to cover the data that will not be used in the calculation of layer 1, so that the completion of After the calculation of 1 stroke, layer 1 can continue to perform the calculation of the next stroke without waiting for the NPU to read data from the DDR.
  • the NPU can read the data from the S1+1 row to the S1 column before the 2 ⁇ S1 row in the original input feature map from the DDR after the first convolution calculation is completed at layer 1, and store it in the local cache. There is the position of the data in the first S1 row and S1 column.
  • the NPU may also read the data required for the next trip from the DDR after completing the calculation of one trip, and store it in the A1 line cache of layer 1.
  • the convolution calculation of one stroke of layer 1 can be completed. This obtains the data of the input feature map of the first row of layer 2. Take the convolution window of layer 2 as the A2 row and B2 column as an example.
  • the NPU can continue to perform convolution calculations for other strokes at layer 1 according to the above scheme, until the A2 line data required for convolution calculation at layer 2 is obtained. That is to say, the NPU can perform the convolution calculation of A2 strokes at layer 1, thereby obtaining the A2 row data as the input feature map of layer 2 and storing it in the local cache in the A2 row cache configured for layer 2. Then, the NPU can start to perform the convolution calculation of the first pass of layer 2. Then, the input feature map of the first line of layer 3 is obtained and stored in the line cache configured for layer 3 in the local cache.
  • the new input feature map data required for the convolution computation of the run For example, take the step size of layer 2 as S2 as an example.
  • the row cache corresponding to layer 2 stores A2 row data obtained through calculation of A2 runs of layer 1.
  • the NPU can return to layer 1 to perform the calculation of the run from (A2+1) to (A2+S2) of layer 1.
  • the new S2 row data is thus obtained and stored in the row cache corresponding to A2.
  • the NPU can continue to perform the second stroke calculation in layer 2.
  • the first row of data of the output feature map of the convolutional layer can be obtained.
  • FIG. 4 shows a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • the method may include at least two calculation processes (eg, Process 1 and Process 2).
  • the process 1 is the flow when the neural network calculation is started.
  • Process 2 is the subsequent process that layer 2 can perform the calculation of 1 trip.
  • the process may specifically include: S401, computing layer 1 route 1.
  • S402 store the output feature map of run 1 of layer 1 into the line cache corresponding to layer 2 .
  • S403. It is determined that the calculation of the stroke 1 of the layer 2 cannot be performed in the layer 2, and the calculation is returned to the layer 1.
  • S404 calculate the stroke 2 of layer 1 .
  • the convolution window of layer 2 is 3 rows for illustration. Therefore, layer 2 performs a calculation of 1 pass requires layer 1 to perform a calculation of 3 passes. Correspondingly, if the convolution window of layer 2 is row A2, then layer 2 performs the calculation of the first stroke, and layer 1 needs to perform the calculation of A2 strokes. This completes the steps of process 1. It can be understood that in this process 1, since there is no data in the line cache corresponding to layer 2 at the beginning of the calculation, it is necessary for layer 1 to perform three consecutive calculations in order to obtain the data of layer 2 executing one trip. data.
  • Process 2 will be described below. Among them, since data is already stored in the line cache of layer 2, layer 1 can update the data of S2 line caches to layer 2 every time after completing the calculation of S2 strokes, so that layer 2 can execute the next stroke. calculate.
  • S409. Calculate the stroke 1 of layer 2.
  • S410 it is determined that the calculation of the stroke 2 of the layer 2 cannot be performed in the layer 2, and the calculation is returned to the layer 1.
  • S411 calculate the stroke 4 of layer 1 .
  • S412 store the output feature map of run 4 of layer 1 into the line cache corresponding to layer 2 .
  • S413. Calculate the stroke 2 of layer 2. It can be seen that, in process 2, every time layer 1 performs the calculation of one stroke, layer 2 can continue to perform the calculation of one stroke. In this way, the pipeline processing effect between different layers can be run.
  • layer 1 After layer 1 completes the calculation of one run, the calculation result of one line corresponding to the run is stored in the line cache corresponding to layer 2 as an example for description. In other implementations of the present application, as described in the above description, layer 1 may store the data in layer 2 every time a calculation result corresponding to a convolution window position is obtained during the calculation process in one stroke. The corresponding location in the line cache of .
  • the input feature map of layer 3 can be the output feature map of layer 2. Therefore, when layer 2 obtains one calculation result in the process of executing one stroke each time, the result can be stored in the line cache of layer 3. After layer 2 completes one calculation trip, the row cache of layer 3 can be updated with one row of data.
  • the NPU can determine whether layer 3 can perform the calculation of a new run, and if so, execute the calculation run of layer 3.
  • the NPU can write the first row of data into the DDR.
  • the NPU may directly write the data into the DDR after acquiring one piece of data.
  • the NPU can also write into the DDR together after acquiring 1 row of data.
  • the NPU can obtain the output feature map of the convolutional layer.
  • the data is generated, the data is written into the line buffer corresponding to the subsequent calculation, and is performed according to the calculation in the above-mentioned convolutional layer.
  • the convolution calculation result of the current layer is directly written into the line cache of the next computing layer in the local cache, it is used for the calculation of the next computing layer, forming the calculation effect of the pipeline. Therefore, in the calculation process of the next layer, the NPU does not need to read intermediate data from the DDR. Therefore, in the process of convolution calculation of the convolution layer shown in Figure 3, the amount of data read from the DDR by the NPU is only the data amount of one original input feature map. However, if there is no subsequent calculation requirement, the amount of data written into the DDR is only the amount of data outputted by one convolutional layer.
  • the convolution calculation is exemplified. Please refer to FIG. 5 , which is a schematic diagram of the process of the first stroke of the layer 1 convolution calculation in this example.
  • the NPU can read the two lines of data from a 11 to a 26 in the original input feature map from the DDR into the local cache, respectively, into line cache 1 and line cache 2 configured for layer 1.
  • the NPU can start the convolution calculation for layer 1. For example, the NPU can perform sliding calculation on the data of line buffer 1 and line buffer 2 for the convolution window corresponding to layer 1, thereby completing the calculation of one stroke.
  • the output feature map of layer 1 can be the input feature map of layer 2. Therefore, in this example, every time one data of the output feature map of layer 1 is obtained, the data can be stored in the local cache, which is the layer 2 The corresponding location in the configured line cache. For example, in conjunction with FIG. 6 , take the calculation results of a 11 to a 26 as b 11 to b 15 as an example. The result obtained by calculating the layer 1 convolution window at positions a 11 to a 22 is b 11 . Every time you swipe in the calculation stroke, you can get a new result.
  • the layer 1 convolution window slides one data to the right, and the convolution calculation can continue to obtain b 12 .
  • the layer 1 convolution window is slid to the far right to calculate and obtain b 15 .
  • the NPU can write the result to the first column in the first row cache (eg, row cache 3) configured for layer 2.
  • the NPU can write the result to column 2 in line cache 3.
  • all data in line cache 3 eg, b 11 to b 15 ) can be obtained.
  • the NPU can read the 1st convolution calculation in the 2nd run from the DDR after completing the 1st convolution calculation in the 1st run, as shown in Figure 7. Calculate the data that needs to be supplemented, ie a 31 .
  • the NPU can replace a 31 with a 11 and store it in the line buffer of layer 1 (such as line buffer 1), so that the convolution calculation of the second run needs to be performed later.
  • the data stored in the line cache configured for layer 1 is shown in Figure 8. It can be seen that when the NPU performs the second convolution calculation of the first run of layer 1 (such as called run 1), a 31 has been stored in the NPU for the first convolution calculation of run 2. It should be noted that in this example, the NPU reads new data from the DDR after completing one convolution calculation and replaces the data that will not participate in subsequent calculations as an example. In other implementations of the present application, the NPU may also read multiple data from the DDR at one time after completing all the convolution calculations of the first run, and replace the data in the line cache that will not participate in the calculation. This can reduce the number of times the NPU reads data from the DDR.
  • the line cache of layer 1 may store data required for the convolution calculation of the next run (eg run 2). For example, the result of data replacement is shown in (a) of FIG. 9 .
  • the NPU can appropriately adjust the position of the data in the line buffer, so that the correct data can be covered during the sliding of the convolution window. For example, the NPU can rewind the data stored in the line cache in units of rows, so as to achieve the effect of exchanging the data stored in the line cache 1 and the line cache 2. That is, after rewinding, the data in the line buffer of layer 1 can be converted from the distribution shown in (a) in Fig. 9 to the distribution shown in (b) in Fig. 9 .
  • the rewinding operation in (b) of FIG. 9 is an optional step. In some implementations of the present application, it is not necessary to perform the wrapping process on the data. It can be understood that in the process of convolution calculation, it can be understood as the product of the data at each position of the convolution window and the data at the corresponding position of the input feature map, and then these products are added to obtain this value. The result of the subconvolution calculation. Therefore, in the process of convolution calculation, as long as the data of the convolution window of the product operation and the data of the input feature map have a correct corresponding relationship, the order of the data on the line buffer may not be adjusted.
  • the convolution window can be re-moved to the far left of line buffer 1 and line buffer 2 to start performing layer 1 run 2 calculations. After each calculation, slide to the right according to the step size corresponding to the convolution window of layer 1 for the next calculation. And so on until all convolution calculations in run 2 are completed. In this way, the second row data of the input feature map of layer 2 can be obtained.
  • the NPU can write these data (such as b 21 -b 25 ) into the line cache for storing the second row data of the input feature map of layer 2 4 in.
  • the NPU can perform the convolution calculation from layer 1 run 1 to layer 1 run 3 to obtain the data required for layer 2 to perform run 1 calculation.
  • the NPU can perform the calculation of run 1 of layer 2, that is, enter the calculation of process 2.
  • the convolution computation process in layer 2 is similar to the convolution computation process in layer 1.
  • the run 1 computation of layer 2 can obtain input feature map data (such as c 11 -c 13 ) for layer 3 (if present), and the NPU can store this data in the local cache for the row configured for layer 3 in a cache (eg line cache 6).
  • input feature map data such as c 11 -c 13
  • the NPU can store this data in the local cache for the row configured for layer 3 in a cache (eg line cache 6).
  • the convolution window in layer 2 will slide down by 2 data to start the calculation of run 2.
  • the input feature map data covered by the convolution window of layer 2 is b 11 -b 35 .
  • the input feature map data covered by the convolution window of layer 2 is b 31 -b 55 .
  • the NPU needs to perform the calculation of 2 strokes of layer 1 (such as layer 1 stroke 4 and layer 1 stroke 5) in the calculation of process 2.
  • the current layer needs to continuously perform the calculation of the number of strokes corresponding to the step size of the next layer in order to obtain the support that can support The input feature map data corresponding to the calculation of 1 stroke in the next layer.
  • the calculation of the layer 2 run 1 may be performed.
  • the NPU determines that it cannot continue to perform the calculation of the stroke 2 of the layer 2, it can fall back to the calculation of the layer 1 and execute the calculation of the stroke 4 of the layer 1.
  • the NPU can determine whether the calculation of layer 2 run 2 can be performed. Since the step size of layer 2 is 2, the calculation of layer 2 stroke 2 cannot be performed according to the current data. In this way, you can continue to fall back to the calculation of layer 1, and perform the calculation of stroke 5 of layer 1.
  • the NPU can determine whether the calculation of layer 2 run 2 can be performed. Since the input feature map data (eg, b 31 -b 55 ) required for the layer 2 run 2 calculation has been acquired, the NPU can then perform the layer 2 run 2 calculation. The subsequent process is similar and will not be repeated here.
  • the NPU can directly perform the calculation of layer 1 and 5 after completing the calculation of layer 1 and 4. .
  • data that can support layer 2 run 2 calculation is acquired at one time. This can reduce the number of executions of the NPU judgment logic.
  • it is necessary to configure more line caches for layer 1 so that the data required for the calculation that supports layer 1 can quickly perform two strokes can be stored in the local cache at the same time.
  • the above description of the calculation logic for different step sizes only takes adjacent layers 1 and 2 as examples.
  • the computing logic can be extended to the implementation of more layers.
  • the calculation of layer 3 needs to be performed after layer 2, and the step size of layer 3 is 3.
  • the NPU can continuously perform more calculations at layer 1 in process 2, so that more strokes can be calculated continuously at layer 2, so that the next stroke at layer 3 can be obtained without passing the judgment logic. Calculate the required data.
  • this would require more line caches for tier 1 and tier 2.
  • This solution can be applied to the case where the storage space of the local cache is relatively sufficient, which can reduce the judgment logic of the NPU and improve the computing efficiency of the system.
  • the calculation can be performed according to a method involving judgment logic. For example, after layer 2 performs the calculation of one stroke, the NPU determines whether layer 3 can perform the calculation of the next stroke, and if so, continues to perform the calculation of the next stroke of layer 3. Conversely, if the calculation of the next trip under layer 3 cannot be performed, it will fall back to layer 2 to perform the calculation of the next trip. Similarly, if the data in the line cache corresponding to the current layer 2 cannot support layer 2 to perform the calculation of the next stroke, continue to fall back to the previous layer (eg, layer 1) to perform the calculation of the next stroke.
  • the previous layer eg, layer 1
  • the local cache only need to configure and kernel functions (such as the convolution of the convolution kernel in the convolution calculation) for each computing layer
  • kernel functions such as the convolution of the convolution kernel in the convolution calculation
  • the number of lines in the window corresponds to the number of line buffers, and the calculation effect of the pipeline can be obtained by referring to the method flow shown in FIG. 4 and the solution in the above description. So that the NPU does not need to read data from the DDR many times, and does not need to write the data to the DDR many times. This saves the power consumption overhead introduced by reading and writing data, and at the same time, because the intermediate data is stored in the line cache of the local cache, the computing efficiency can be significantly improved.
  • the elementswise calculation needs to be performed in the convolutional neural network model as an example.
  • Fig. 12 a schematic diagram of calculation logic in a convolutional neural network is shown.
  • elementwise calculation is also required.
  • the elementwise calculation may include an addition operation.
  • the object of the addition operation may be the output feature map of the convolution layer and the output feature map W obtained after the original input feature map is calculated by the computing layer W.
  • the computing layer W may be the same computing layer as any one of the convolutional computing layers in the convolutional layers, or may be a computing layer different from any one of the convolutional computing layers in the convolutional layers.
  • elementwise addition can be performed in the Eltwise computation layer.
  • a corresponding line cache may be configured for the computing layer W in the local cache of the NPU. For example, take the convolution calculation performed in the computing layer W, and the window size is A w row B w column as an example. Then, in the local cache, A w line cache can be configured for the computing layer W.
  • the Eltwise computing layer can also be configured with a corresponding line cache in the local cache.
  • the number of cache lines may be an integer greater than or equal to 1.
  • the NPU Before performing the elementwise addition operation, the NPU can perform the convolution computation of the convolution layer and the convolution computation in the computation layer W in time-sharing. For example, the NPU can perform the convolution calculation of the run 1 corresponding to the calculation layer W to obtain the data of the first row of the output feature map W. The NPU can store the first row data of the output feature map W into the row buffer corresponding to the Eltwise computing layer. After that, the NPU can perform the convolution calculation in the convolutional layer.
  • the first line of data of the output feature map of the convolutional layer can be input into the Eltwise computing layer, so that In the Eltwise computing layer of the NPU, the first row of data of the output feature map W that has been stored and the first row of data of the output feature map of the convolutional layer are added, so that the first row of the Eltwise output feature map can be obtained. data. If there are no other layers of computation, the NPU can output the 1st row data of this Eltwise output feature map into the DDR as part of the output feature map of one round of convolutional neural network computations.
  • the NPU After acquiring the first row data of the Eltwise output feature map, the NPU can update the data in the row cache corresponding to the computing layer W according to the method in the preceding example, so as to perform the convolution calculation of the second stroke. In addition, the NPU can also perform the convolution calculation of the convolution layer according to the aforementioned method to obtain the second line of data of the output feature map of the convolution layer, and perform addition operation in the Eltwise calculation layer, thereby obtaining the second line of the Eltwise output feature map. row data. By analogy, the complete data of the output feature map calculated by one round of convolutional neural network can be obtained.
  • the row cache is configured separately for the computing layer W as an example for description.
  • the line cache of computing layer W may also be multiplexed with layer 1 .
  • the convolution calculation in the calculation layer W is the same as the convolution calculation in the layer 1 in that both are convolution calculations on the data in the original input feature map, but the convolution kernel May be different.
  • a common line buffer can be configured for computation layer W and layer 1, and the number of line buffers can be determined according to the convolution kernels of computation layer W and layer 1 Among the corresponding convolution windows, the convolution window with a larger number of rows is determined.
  • the number of convolution window rows A w of the computing layer W is 3, and the number of convolution window rows A1 of layer 1 is 2.
  • a local storage including a 3-line line cache can be configured for the computation layer W and the layer 1 to support the convolution computation of the computation layer W and the layer 1.
  • the update of the data in the line cache can be completed separately after the calculation into W And layer 1 is performed after the convolution calculation at the corresponding position.
  • the NPU reads a 31 from the DDR to replace a 11 .
  • a 11 since a 11 not only needs to participate in the first convolution calculation of the run 1 of layer 1, but also needs to participate in the first convolution calculation of run 1 of the calculation layer W.
  • the NPU can read a 31 from the DDR to replace a 11 .
  • the effect of line cache multiplexing can be achieved, thereby saving the storage space of the local cache.
  • the NPU is used as an example for a single core for description.
  • the NPU may execute the process 1 and the process 2 according to the time sequence.
  • the execution sequence of some steps in the process 1 and the process 2 may be different from that in FIG. 4 , which will not be repeated here.
  • multi-core NPU is often used.
  • computing concurrency can be achieved, and the effect of further improving computing efficiency is achieved.
  • FIG. 13 is a schematic diagram showing a comparison of the calculation flow according to the time sequence in a single-core scenario and a multi-core scenario.
  • the computing core (eg, core 1) in the NPU can perform the computation of layer 1 and trip 4 at time T1.
  • Core 1 can perform layer 2 run 1 computation at time T2.
  • Core 1 can perform the computation of layer 1 run 5 at time T3.
  • the NPU is a dual-core processor (that is, in a dual-core scenario)
  • one computing core (eg, core 1 ) in the NPU can perform the computation of layer 1 and process 4 at time T1 .
  • Core 1 can perform the computation of layer 1 run 5 at time T2.
  • Calculation of run 6 of layer 1 is performed at time T3.
  • layer 2 computations can be performed by another compute core (eg, core 2) of the NPU.
  • core 2 may perform the calculation of the run 1 of the layer 2 when the core 1 performs the calculation of the run 5 of the layer 1.
  • the core 2 can perform the calculation of the run 2 of the layer 2 when the core 1 performs the calculation of the run 6 of the layer 1.
  • the concurrency of multiple calculation processes can be realized, so that the NPU can simultaneously execute the layer 1 after obtaining the data required for the calculation of the layer 2 stroke 1.
  • the calculation of the next trip and the calculation of the trip 1 of the layer 2 do not need to wait for the calculation of the trip 1 of the layer 2 to be completed, and then return to the layer 1 to perform the calculation of the next trip.
  • the core 1 performs the calculation of the layer 1
  • the core 2 performs the calculation of the layer 2 as an example for description. In this application, there is no restriction on the corresponding relationship between the computation of the computation core and the computation layer.
  • the same computing core may also perform computations at different computing layers.
  • core 1 when the computing capability of core 1 (for example, using the throughput identifier) is relatively large, and the throughput of core 2 is relatively small, then core 1 can not only complete the calculation of layer 1, but also use time-division multiplexing to process processing. Part of layer 2 calculations to ensure consistent throughput. In this way, full utilization of computing bandwidth is achieved to improve the working efficiency of the multi-core NPU.
  • the above-mentioned processor includes corresponding hardware structures and/or software modules for executing each function.
  • the unit of each example described in conjunction with the embodiments disclosed herein can be implemented in hardware or in the form of a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
  • the data processing apparatus corresponding to the processor may be divided into functional modules according to the foregoing method examples.
  • each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. middle.
  • the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules.
  • the division of modules in the embodiment of the present application is schematic, and is only a logical function division, and another division manner may be used in actual implementation.
  • FIG. 14 is a schematic structural diagram of a data processing apparatus 1400 according to an embodiment of the present application.
  • the data processing apparatus 1400 can be applied to perform neural network computation, where the neural network includes N computation layers, where N is an integer greater than or equal to 2.
  • the data processing apparatus 1400 is provided with a local cache.
  • the data processing apparatus 1400 includes: an obtaining unit 1401, configured to obtain first data, where the first data is used to perform a first calculation journey of a first calculation layer, and the first calculation layer is the N Any one of the computing layers.
  • the storage unit 1402 is configured to store the first data in the first line cache of the first computing layer, where the first line cache of the first computing layer is included in the local cache.
  • the calculation unit 1403 is configured to calculate the first calculation stroke of the first calculation layer to obtain second data corresponding to the first calculation stroke of the first calculation layer, wherein the first calculation stroke of the first calculation layer includes The convolution calculation of one or more rows of data of the first data is performed using the convolution window of the first computing layer.
  • the storage unit 1402 is further configured to store the second data in the first row cache of the second computing layer, where the first row cache of the second computing layer is included in the local cache, and the second computing layer is the N Among the computing layers, the computing layer after the first computing layer.
  • the calculation unit 1403 is further configured to calculate the first calculation of the second calculation layer under the condition that the accumulated data stored in the first row cache of the second calculation layer can perform the first calculation process of the second calculation layer stroke, to obtain fifth data corresponding to the first calculation stroke of the second calculation layer, wherein the first calculation stroke of the second calculation layer includes using the convolution window of the second calculation layer. Convolution computation of one or more lines of data.
  • the calculation unit 1403 is further configured to calculate the second calculation stroke of the first calculation layer when the accumulated data cannot perform the first calculation stroke of the second calculation layer, the first calculation stroke of the first calculation layer.
  • the second calculation stroke of the calculation layer is the calculation stroke after the first calculation stroke of the first calculation layer.
  • the number of rows in the first row cache is equal to the number of rows in the convolution window of the first computational layer.
  • the obtaining unit 1401 is further configured to read the first data from an external memory, where the first data is at least a part of the input feature map stored in the external memory, the external memory is related to the A processor-coupled storage medium.
  • the first data is a part of the input feature map stored in the external memory
  • the obtaining unit 1401 is further configured to obtain third data from the external memory, where the third data is another part of the input feature map
  • the third data is used to perform a second computing run of the first computing layer.
  • the third data is overwritten and the fourth data is stored, and the fourth data is the data in the first data that no longer participates in the calculation of the first computing layer.
  • the storage unit 1402 is further configured to, in the process of performing the first calculation process of the first calculation layer, obtain the calculation result of the convolution window of the first calculation layer at one position, and store the calculation result of the first calculation layer at one position.
  • the result of the computation is stored in the first line cache of the second computation layer.
  • the obtaining unit 1401 is further configured to obtain fifth data corresponding to the first calculation journey of the second calculation layer.
  • the storage unit 1402 stores the fifth data in the first line cache of the third computing layer, where the first line cache of the third computing layer is included in the local cache; the third computing layer is the Among the N computing layers, in the computing layer after the second computing layer, the fifth data is used to perform the convolution calculation of the third computing layer.
  • any one of the above units can be implemented in software, hardware or a combination of the two, so as to realize the functions shown in the method.
  • the data processing apparatus 1400 including the above units may be a part integrated in the above-mentioned processor, such as functional hardware in the processor or functional software running in the processor. For example, if any one of the units is implemented as a software module, it runs on the neural network computing device 200 as shown in FIG. 2 .
  • FIG. 15 is a schematic structural diagram of an electronic device 1500 according to an embodiment of the present application.
  • the electronic device 1500 may include: a processor 1501 and a memory 1502 .
  • the memory 1502 is used to store computer-implemented instructions. Exemplarily, in some embodiments, when the processor 1501 executes the instructions stored in the memory 1502, the electronic device 1500 is caused to perform one or more steps in S401-S413 shown in FIG. 4, and the electronic device requires other operations performed.
  • the electronic device 1500 may be provided with the neural network computing apparatus 200 as described in FIG. 2 .
  • the processor 1501 may refer to the neural network computing device 200 of FIG. 2 .
  • the processor of this embodiment includes, but is not limited to, one or more of the aforementioned CPU, NPU, FPGA, CPU, GPU, and DSP (Digital Signal Processor).
  • the above processors may be implemented in one or more chips.
  • the chip is also referred to as a system-on-chip (SoC).
  • the focusing device provided by the embodiment of the present application is used to perform the function of the terminal in the above focusing method, and thus can achieve the same effect as the above focusing method.
  • the functions or actions or operations or steps in the above embodiments may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • a software program When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line, DSL) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or data storage devices including one or more servers, data centers, etc. that can be integrated with the medium.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.

Abstract

The embodiments of the present application relate to the field of artificial intelligence. Disclosed are a data processing method and a processor, which solve the problem of a large amount of power consumption caused due to the fact that a processor needs to perform data reading and writing multiple times. The specific solution involves: acquiring first data for a first calculation stroke of a first calculation layer; storing the first data in a first-row cache of the first calculation layer, wherein the first-row cache of the first calculation layer is included in a local cache; calculating the first calculation stroke of the first calculation layer to obtain second data; storing the second data in a first-row cache of a second calculation layer, wherein the second calculation layer is a calculation layer in N calculation layers that follows the first calculation layer; and when accumulated data stored in the first-row cache of the second calculation layer can be used for a first calculation stroke of the second calculation layer, calculating the first calculation stroke of the second calculation layer to obtain fifth data corresponding to the first calculation stroke of the second calculation layer.

Description

一种数据处理方法和处理器A data processing method and processor 技术领域technical field
本申请实施例涉及人工智能技术领域,尤其涉及一种数据处理方法和处理器。The embodiments of the present application relate to the technical field of artificial intelligence, and in particular, to a data processing method and processor.
背景技术Background technique
目前,神经网络已经被广泛应用于图像分类、视频处理、语音识别、数据分析等场景。以处理器使用神经网络对图像进行处理为例。由于需要处理的图像的特征图(feature map,FM)的数据量较大,一般无法存储在处理器的本地缓存中,因此,特征图数据可以存储在具有较大存储空间的外部存储器中。在对图像进行处理的时,处理器可以将图像的特征图数据(如称为原始输入特征图)从外部存储器中读入处理器并根据神经网络模型进行计算。在获取计算结果(如称为输出特征图)之后,处理器可以将输出特征图存储到外部存储器中。At present, neural networks have been widely used in image classification, video processing, speech recognition, data analysis and other scenarios. Take the example of a processor processing an image using a neural network. Due to the large amount of data of the feature map (FM) of the image to be processed, it is generally impossible to store it in the local cache of the processor. Therefore, the feature map data can be stored in an external memory with a large storage space. When processing the image, the processor can read the feature map data of the image (for example, called the original input feature map) from the external memory into the processor and perform calculation according to the neural network model. After obtaining the calculation results (eg, referred to as output feature maps), the processor may store the output feature maps in external memory.
需要说明的是,在神经网络模型中,一般包括多个不同的计算层,比如卷积层,池化/激活层等。每个计算层的计算过程各不相同。其中,一个计算层可以对应一个核函数(kernel),根据该核函数,处理器即可进行对输入该计算层的输入特征图进行计算,并获取对应的输出特征图。在完成一个计算层的计算获取对应的输出特征图后,可以将该输出特征图存储到外部存储器中,以便于在进行下一层计算时,读取外部存储器中存储的数据,作为当前计算层的输入特征图进行计算。It should be noted that a neural network model generally includes multiple different computing layers, such as convolutional layers, pooling/activation layers, etc. The calculation process of each computing layer is different. Among them, one computing layer may correspond to a kernel function (kernel), and according to the kernel function, the processor may calculate the input feature map input to the computing layer, and obtain the corresponding output feature map. After completing the calculation of a computing layer to obtain the corresponding output feature map, the output feature map can be stored in the external memory, so that when the next layer calculation is performed, the data stored in the external memory can be read as the current computing layer. The input feature map of .
可以看到,在完成一次神经网络模型的计算过程中,处理器需要多次将大量数据从外部存储器读取或者写入外部存储器,由此为进行神经网络计算的设备带来大量的功耗。另外,随着需要处理的对象数量(如需要处理的图像个数)以及复杂度(如需要处理的图像的特征图的数据量)的增加,计算功耗也会增加。It can be seen that in the process of completing the calculation of a neural network model, the processor needs to read or write a large amount of data from the external memory for many times, which brings a lot of power consumption to the device that performs the neural network calculation. In addition, as the number of objects to be processed (such as the number of images to be processed) and the complexity (such as the amount of data of the feature maps of the images to be processed) increase, the computational power consumption also increases.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供一种数据处理方法和处理器,以降低神经网络计算的功耗。为达到上述目的,本申请实施例采用如下技术方案。Embodiments of the present application provide a data processing method and processor, so as to reduce the power consumption of neural network computing. In order to achieve the above purpose, the following technical solutions are adopted in the embodiments of the present application.
第一方面,提供一种数据处理方法,该方法应用于进行神经网络计算的处理器,该神经网络包括N个计算层,N为大于或等于2的整数。该处理器中设置有本地缓存。该方法包括:获取第一数据,该第一数据用于进行第一计算层的第一计算行程,该第一计算层是该N个计算层中的任一个计算层。将该第一数据存储在该第一计算层的第一行缓存中,该第一计算层的第一行缓存包括在该本地缓存中。计算该第一计算层的第一计算行程,以获取与该第一计算层的第一计算行程对应的第二数据,其中,该第一计算层的第一计算行程包括使用该第一计算层的卷积窗口对该第一数据的一行或多行数据的卷积计算。将该第二数据存储在第二计算层的第一行缓存中,该第二计算层的第一行缓存包括在该本地缓存中,该第二计算层是该N个计算层中,该第一计算层之后的计算层。在该第二计算层的第一行缓存中存储的累积的数据能够进行该第二计算层的第一计算行程的情况下,计算该第二计算层的第一计算行程,以获取与该第二计算层的第一计算行程对应的第五数据,其中,该第二计算层的第一计算行程包括使 用该第二计算层的卷积窗口对该第二数据的一行或多行数据的卷积计算。In a first aspect, a data processing method is provided. The method is applied to a processor that performs neural network computing, where the neural network includes N computing layers, where N is an integer greater than or equal to 2. A local cache is provided in the processor. The method includes: acquiring first data, where the first data is used to perform a first calculation process of a first calculation layer, where the first calculation layer is any one of the N calculation layers. The first data is stored in the first line cache of the first computing layer, and the first line cache of the first computing layer is included in the local cache. calculating a first calculation trip of the first calculation layer to obtain second data corresponding to the first calculation trip of the first calculation layer, wherein the first calculation trip of the first calculation layer includes using the first calculation layer The convolution window of the first data is calculated by convolution of one or more lines of data. The second data is stored in the first row cache of the second computing layer, the first row cache of the second computing layer is included in the local cache, and the second computing layer is among the N computing layers, and the first row of the second computing layer is included in the local cache. A computational layer after a computational layer. In the case where the accumulated data stored in the first line cache of the second computing layer can perform the first computing process of the second computing layer, the first computing process of the second computing layer is calculated to obtain the The fifth data corresponding to the first calculation stroke of the second calculation layer, wherein the first calculation stroke of the second calculation layer includes the volume of one or more rows of data of the second data using the convolution window of the second calculation layer cumulative calculation.
基于该方案,提供了一种多个计算层之间的流水线计算机制。在该示例中,处理器可以在进行一个计算层的卷积计算时,获取进行一个计算行程所需的数据。其中,以进行卷积计算为例。一个计算行程可以为卷积窗口从左侧滑动计算到最右侧的一个行程的计算。比如,卷积窗口为A行,那么在进行该计算层的卷积计算之前,处理器只需要获取A行的数据即可开始计算,而不需要获取当前计算层所需的输入特征图的全量数据。由于A行数据的数据量非常小,因此可以被存储在本地缓存中,而非存储在外部存储器(比如DDR)中。这样,在进行当前层的计算时,就可以直接从本地缓存中读取这A行数据,并据此进行当前计算层的一个行程的计算。可以理解的是,在当前计算层并非神经网络的第一个计算层时,那么这A行数据可以是前一个计算层的计算结果。相较于现有技术,本示例体提供的方案中,由于前一个计算层仅需要计算获取A行数据即可,因此不需要将前一计算层和当前计算层之间的中间数据写入DDR中等待处理器再次从DDR中进行读取。取而代之的,前一计算层可以在计算获取A行数据后,将该数据存储在本地缓存中为当前计算层配置的行缓存中。也就是说,中间数据不需要被写入DDR中,因此在执行当前层的计算时也不需要从DDR中读取。而从本地缓存中进行的数据读写不需要与DDR进行多次大量的数据交互,由此节省功耗。Based on this scheme, a pipeline computing mechanism among multiple computing layers is provided. In this example, the processor may acquire data required to perform one computation process when performing convolution computation of one computation layer. Among them, take the convolution calculation as an example. A calculation stroke can be the calculation of one stroke of the convolution window sliding from the left to the far right. For example, if the convolution window is row A, before performing the convolution calculation of the computing layer, the processor only needs to obtain the data of row A to start the calculation, and does not need to obtain the full amount of input feature maps required by the current computing layer. data. Since the data size of line A data is very small, it can be stored in the local cache instead of external memory (such as DDR). In this way, when the calculation of the current layer is performed, the data of row A can be directly read from the local cache, and a stroke of the current calculation layer can be calculated accordingly. It can be understood that when the current computing layer is not the first computing layer of the neural network, then the data in row A may be the calculation result of the previous computing layer. Compared with the prior art, in the solution provided by this example, since the previous computing layer only needs to calculate and obtain the data of row A, it is not necessary to write the intermediate data between the previous computing layer and the current computing layer into the DDR. Waiting for the processor to read from the DDR again. Instead, the previous computing layer may store the data in the row cache configured for the current computing layer in the local cache after the data of row A is obtained by calculation. That is, the intermediate data does not need to be written into the DDR, and therefore does not need to be read from the DDR when performing the computation of the current layer. The data read and write from the local cache does not need to perform a large number of data interactions with the DDR, thereby saving power consumption.
在一种可能的设计中,该方法还包括:在该累积的数据无法进行该第二计算层的第一计算行程的情况下,计算该第一计算层的第二计算行程,该第一计算层的第二计算行程是该第一计算层的第一计算行程后的计算行程。基于该方案,提供了一种层间计算过程中的回退机制。在该示例中,在前一个计算层完成一个行程的计算之后,处理器可以计算是否可以执行当前层的一个行程的计算。如果当前计算层对应的行缓存中没有存储能够支持当前层一个行程的计算的数据,那么处理器可以回退到前一层继续执行后面一个行程的计算,以便获取新的一行计算结果更新到当前计算成的行缓存中。接着处理器可以循环上述方案,比如,判断当前行缓存中存储的数据是否能够支持当前计算层完成一个行程的计算,如果可以则执行当前计算层的一个行程的计算,如果不行则继续返回上一个计算层进行计算。如此类推,对于后面的计算层可以执行类似的判断回退机制,使得系统计算不会卡顿在某一个计算层,并且每个计算层在同一时刻只需要占用与该计算层的卷积窗口行数对应的行缓存即可。In a possible design, the method further includes: if the accumulated data cannot perform the first calculation journey of the second calculation layer, calculating a second calculation journey of the first calculation layer, the first calculation The second calculated trip of a layer is the calculated trip after the first calculated trip of the first calculated layer. Based on this scheme, a fallback mechanism in the inter-layer computing process is provided. In this example, after the previous computing layer completes the calculation of one trip, the processor may calculate whether the calculation of one trip of the current layer can be performed. If there is no data stored in the line cache corresponding to the current computing layer that can support the calculation of one stroke of the current layer, the processor can fall back to the previous layer to continue to perform the calculation of the next stroke, so as to obtain a new row of calculation results and update them to the current one. Calculated line buffer. Then the processor can loop the above scheme, for example, to determine whether the data stored in the current line cache can support the current computing layer to complete the calculation of one trip, if so, execute the calculation of one trip of the current computing layer, if not, continue to return to the previous one The computing layer performs calculations. By analogy, a similar judgment fallback mechanism can be implemented for the subsequent computing layers, so that the system calculation will not be stuck in a certain computing layer, and each computing layer only needs to occupy the convolution window row with the computing layer at the same time. The number of lines corresponding to the cache can be.
在一种可能的设计中,第一行缓存的行数等于第一计算层的卷积窗口的行数。基于该方案,提供了一种对第一行缓存的行数的具体限定。可以理解的是,第一行缓存用于存储第一数据,可以是在处理器的本地缓存中为第一计算层配置的存储空间,用于存储第一计算层的任意一个行程的计算的数据。该第一行缓存的行数至少需要等于第一卷积计算层的卷积窗口的行数,以便于能够存储足够的数据进行一个行程的计算。In one possible design, the number of rows in the first row cache is equal to the number of rows in the convolution window of the first computational layer. Based on this solution, a specific limitation on the number of lines in the first line cache is provided. It can be understood that the first line cache is used to store the first data, which may be the storage space configured for the first computing layer in the local cache of the processor, and is used to store the calculated data of any one stroke of the first computing layer. . The number of lines in the first line buffer needs to be at least equal to the number of lines in the convolution window of the first convolution calculation layer, so that enough data can be stored to perform one-stroke calculation.
在一种可能的设计中,在该第一计算层是该神经网络的第一个计算层时,该获取第一数据,包括:从外部存储器中读取该第一数据,该第一数据是该外部存储器中存储的输入特征图的至少部分,该外部存储器是与该处理器耦接的存储介质。基于该方案,提供了一种第一计算层为神经网络的第一个计算层时的数据获取机制。可以理解的是,由于输入特征图的数据量一般较大,可以存储在与处理器可以进行数据交互的外部存储器(比如DDR)中。处理器可以在执行第一计算层的一个计算行程之前,从 DDR中的读取对应的数据,写入为第一计算层配置的行缓存中,以便进行第一计算层的一个计算行程的计算。In a possible design, when the first computing layer is the first computing layer of the neural network, acquiring the first data includes: reading the first data from an external memory, where the first data is At least a portion of the input feature map is stored in the external memory, which is a storage medium coupled to the processor. Based on the solution, a data acquisition mechanism is provided when the first computing layer is the first computing layer of the neural network. It can be understood that, since the amount of data of the input feature map is generally large, it can be stored in an external memory (such as DDR) that can interact with the processor for data. The processor may read the corresponding data from the DDR before executing a calculation stroke of the first calculation layer, and write it into the line cache configured for the first calculation layer, so as to perform the calculation of a calculation stroke of the first calculation layer .
在一种可能的设计中,第一数据是外部存储器中存储的输入特征图的部分,该方法还包括:从该外部存储器中获取第三数据,该第三数据是输入特征图中另一部分,该第三数据用于进行该第一计算层的第二计算行程。将该第三数据覆盖存储第四数据,该第四数据是该第一数据中,不再参与该第一计算层的计算的数据。基于该方案,提供了一种动态调整行缓存中数据的机制。在该示例中,每完成一个计算行程中的一次计算,第一数据中就有部分数据不会在后续计算中再次被应用。那么,处理器可以从DDR中读取部分新的数据,覆盖这些不会在后续计算中应用的数据,实现数据的更新。使得在完成当前的计算行程之后,就可以在对应的行缓存中存储有能够用于进行新的一个计算行程的计算的数据。需要说明的是,在本申请的一些实施例中,该数据的替换可以是在完成一个计算行程之后进行的,也可以是在执行一个计算行程的过程中执行的。In a possible design, the first data is a part of the input feature map stored in an external memory, and the method further includes: acquiring third data from the external memory, where the third data is another part of the input feature map, The third data is used to perform a second calculation run of the first calculation layer. The third data is overwritten and the fourth data is stored, and the fourth data is the data in the first data that no longer participates in the calculation of the first computing layer. Based on this scheme, a mechanism for dynamically adjusting the data in the line cache is provided. In this example, every time a calculation in a calculation trip is completed, some data in the first data will not be used again in subsequent calculations. Then, the processor can read some new data from the DDR, overwrite the data that will not be applied in subsequent calculations, and update the data. This makes it possible to store data that can be used to perform a calculation of a new calculation run in the corresponding row cache after the current calculation run is completed. It should be noted that, in some embodiments of the present application, the data replacement may be performed after completing a calculation trip, or may be performed during the execution of a calculation trip.
在一种可能的设计中,该将该第二数据存储在第二计算层的第一行缓存中,包括:在进行该第一计算层的第一计算行程的过程中,每获取第一计算层的卷积窗口在一个位置的计算结果,将该计算结果存储在第二计算层的第一行缓存中。基于该方案,提供了一种第二数据的写入机制。在该示例中,第一计算层中每计算获取一个计算结果,就可以将该计算结果存储在第二计算层的行缓存中对应的位置。这样在完成第一计算成中的一个计算行程之后,就可以获取在第二计算层的行缓存中存储的一行或多行数据用于进行第二计算层的计算。In a possible design, storing the second data in the first row cache of the second computing layer includes: in the process of performing the first computing process of the first computing layer, each time the first computing layer is acquired The calculation result of the convolution window of the layer at one position, and the calculation result is stored in the first line buffer of the second calculation layer. Based on this solution, a writing mechanism of the second data is provided. In this example, each time a calculation result is acquired in the first calculation layer, the calculation result may be stored in a corresponding position in the row cache of the second calculation layer. In this way, after completing one calculation process in the first calculation, one or more rows of data stored in the row cache of the second calculation layer can be acquired for performing the calculation of the second calculation layer.
在一种可能的设计中,在获取与第二计算层的第一计算行程对应的第五数据后,该方法还包括:将所述第五数据存储在第三计算层的第一行缓存中,所述第三计算层的第一行缓存包括在所述本地缓存中;所述第三计算层是所述N个计算层中,所述第二计算层之后的计算层,所述第五数据用于进行所述第三计算层的卷积计算。基于该方案,提供了神经网络中包括的其他计算层的计算机制。比如,第二计算层每完成一个行程的计算,就可以将计算结果存储在下一个计算层(比如第三计算层)对应的行缓存中。以便在获取足够的数据之后,执行第三计算层的一个行程的计算。In a possible design, after acquiring the fifth data corresponding to the first computing trip of the second computing layer, the method further includes: storing the fifth data in the first row cache of the third computing layer , the first line cache of the third computing layer is included in the local cache; the third computing layer is the computing layer after the second computing layer among the N computing layers, the fifth computing layer The data is used to perform the convolution computation of the third computation layer. Based on this scheme, the computational mechanism of other computational layers included in the neural network is provided. For example, each time the second computing layer completes the calculation of one trip, the calculation result may be stored in the line cache corresponding to the next computing layer (eg, the third computing layer). In order to perform the calculation of one trip of the third computing layer after obtaining enough data.
第二方面,提供一种数据处理装置,应用于进行神经网络计算,该神经网络包括N个计算层,N为大于或等于2的整数。该数据处理装置中设置有本地缓存。该装置包括:获取单元,用于获取第一数据,该第一数据用于进行第一计算层的第一计算行程,该第一计算层是该N个计算层中的任一个计算层。存储单元,用于将该第一数据存储在该第一计算层的第一行缓存中,该第一计算层的第一行缓存包括在该本地缓存中。计算单元,用于计算该第一计算层的第一计算行程,以获取与该第一计算层的第一计算行程对应的第二数据,其中,该第一计算层的第一计算行程包括使用该第一计算层的卷积窗口对该第一数据的一行或多行数据的卷积计算。存储单元,还用于将该第二数据存储在第二计算层的第一行缓存中,该第二计算层的第一行缓存包括在该本地缓存中,该第二计算层是该N个计算层中,该第一计算层之后的计算层。计算单元,还用于在该第二计算层的第一行缓存中存储的累积的数据能够进行该第二计算层的第一计算行程的情况下,计算该第二计算层的第一计算行程,以获取与该第二计算层的第 一计算行程对应的第五数据,其中,该第二计算层的第一计算行程包括使用该第二计算层的卷积窗口对该第二数据的一行或多行数据的卷积计算。In a second aspect, a data processing apparatus is provided, which is applied to perform neural network computation, where the neural network includes N computation layers, where N is an integer greater than or equal to 2. The data processing device is provided with a local cache. The device includes: an acquisition unit for acquiring first data, where the first data is used to perform a first calculation process of a first calculation layer, where the first calculation layer is any one of the N calculation layers. The storage unit is configured to store the first data in the first line cache of the first computing layer, and the first line cache of the first computing layer is included in the local cache. a computing unit, configured to calculate the first calculation stroke of the first calculation layer to obtain second data corresponding to the first calculation stroke of the first calculation layer, wherein the first calculation stroke of the first calculation layer includes using The convolution window of the first computation layer performs convolution computation of one or more lines of data of the first data. a storage unit, further configured to store the second data in the first line cache of the second computing layer, where the first line cache of the second computing layer is included in the local cache, and the second computing layer is the N number of In the computing layer, the computing layer after the first computing layer. The calculation unit is further configured to calculate the first calculation stroke of the second calculation layer under the condition that the accumulated data stored in the first row cache of the second calculation layer can perform the first calculation stroke of the second calculation layer , to obtain the fifth data corresponding to the first calculation stroke of the second calculation layer, wherein the first calculation stroke of the second calculation layer includes a row of the second data using the convolution window of the second calculation layer Or a convolution calculation of multiple rows of data.
在一种可能的设计中,计算单元,还用于在累积的数据无法进行该第二计算层的第一计算行程的情况下,计算该第一计算层的第二计算行程,该第一计算层的第二计算行程是该第一计算层的第一计算行程后的计算行程。In a possible design, the calculation unit is further configured to calculate the second calculation stroke of the first calculation layer when the accumulated data cannot perform the first calculation stroke of the second calculation layer, the first calculation stroke The second calculated trip of a layer is the calculated trip after the first calculated trip of the first calculated layer.
在一种可能的设计中,第一行缓存的行数等于第一计算层的卷积窗口的行数。In one possible design, the number of rows in the first row cache is equal to the number of rows in the convolution window of the first computational layer.
在一种可能的设计中,获取单元,用于从外部存储器中读取该第一数据,该第一数据是该外部存储器中存储的输入特征图的至少部分,该外部存储器是与该处理器耦接的存储介质。In a possible design, an acquisition unit is configured to read the first data from an external memory, where the first data is at least a part of the input feature map stored in the external memory, the external memory is related to the processor coupled storage medium.
在一种可能的设计中,第一数据是外部存储器中存储的输入特征图的部分,获取单元,还用于从该外部存储器中获取第三数据,该第三数据是输入特征图中另一部分,该第三数据用于进行该第一计算层的第二计算行程。将该第三数据覆盖存储第四数据,该第四数据是该第一数据中,不再参与该第一计算层的计算的数据。In a possible design, the first data is a part of the input feature map stored in an external memory, and the acquiring unit is further configured to acquire third data from the external memory, where the third data is another part of the input feature map , and the third data is used to perform the second calculation process of the first calculation layer. The third data is overwritten and the fourth data is stored, and the fourth data is the data in the first data that no longer participates in the calculation of the first computing layer.
在一种可能的设计中,存储单元,还用于在进行该第一计算层的第一计算行程的过程中,每获取第一计算层的卷积窗口在一个位置的计算结果,将该计算结果存储在第二计算层的第一行缓存中。In a possible design, the storage unit is further configured to, in the process of performing the first calculation process of the first calculation layer, obtain the calculation result of the convolution window of the first calculation layer at a position, and store the calculation result of the first calculation layer. The result is stored in the first row cache of the second computational layer.
在一种可能的设计中,获取单元,还用于获取与第二计算层的第一计算行程对应的第五数据。存储单元,将所述第五数据存储在第三计算层的第一行缓存中,所述第三计算层的第一行缓存包括在所述本地缓存中;所述第三计算层是所述N个计算层中,所述第二计算层之后的计算层,所述第五数据用于进行所述第三计算层的卷积计算。In a possible design, the acquisition unit is further configured to acquire fifth data corresponding to the first calculation journey of the second calculation layer. a storage unit, storing the fifth data in the first line cache of the third computing layer, where the first line cache of the third computing layer is included in the local cache; the third computing layer is the Among the N computing layers, in the computing layer after the second computing layer, the fifth data is used to perform the convolution calculation of the third computing layer.
第三方面,提供一种处理器,该处理器包括一个或多个计算核心,以及本地缓存,该处理器被配置为实现如第一方面及其可能的设计中任一项所述的数据处理方法。A third aspect provides a processor comprising one or more computing cores, and a local cache, the processor being configured to implement the data processing of any one of the first aspect and possible designs thereof method.
第四方面,提供一种电子设备,电子设备包括一个或多个如第三方面所述的处理器以及一个或多个存储器。该存储器与该处理器耦合,该存储器存储有计算机指令。当该处理器执行该计算机指令时,使得该电子设备执行如第一方面及其可能的设计中任一项所述的数据处理方法。In a fourth aspect, an electronic device is provided, the electronic device includes one or more processors as described in the third aspect and one or more memories. The memory is coupled to the processor, the memory stores computer instructions. When the processor executes the computer instructions, the electronic device is caused to execute the data processing method described in any one of the first aspect and possible designs thereof.
第五方面,提供一种计算机可读存储介质,该计算机可读存储介质包括计算机指令,当该计算机指令运行时,执行如第一方面及其可能的设计中任一项所述的数据处理方法。In a fifth aspect, a computer-readable storage medium is provided, the computer-readable storage medium includes computer instructions, and when the computer instructions are executed, the data processing method described in any one of the first aspect and its possible designs is executed. .
示例性地,上述第二方面至第五方面中任一种设计方式及其可能的设计方法,均可对应到上述第一方面及其任一种可能的设计,因此,能够带来类似的技术效果,此处不再赘述。Exemplarily, any one of the design methods and possible design methods of the above-mentioned second aspect to the fifth aspect can correspond to the above-mentioned first aspect and any one of the possible designs thereof, and therefore, similar technologies can be brought about. The effect will not be repeated here.
附图说明Description of drawings
图1为一种卷积神经网络的结构示意图;FIG. 1 is a schematic structural diagram of a convolutional neural network;
图2为本申请实施例提供的一种神经网络计算装置的结构示意图;FIG. 2 is a schematic structural diagram of a neural network computing device according to an embodiment of the present application;
图3为本申请实施例提供的一种卷积层的结构示意图;3 is a schematic structural diagram of a convolution layer provided by an embodiment of the present application;
图4为本申请实施例提供的一种数据处理方法的流程示意图;4 is a schematic flowchart of a data processing method provided by an embodiment of the present application;
图5为本申请实施例提供的一种计算逻辑示意图;FIG. 5 is a schematic diagram of a calculation logic provided by an embodiment of the present application;
图6为本申请实施例提供的又一种计算逻辑示意图;FIG. 6 is another schematic diagram of calculation logic provided by an embodiment of the present application;
图7为本申请实施例提供的又一种计算逻辑示意图;FIG. 7 is another schematic diagram of calculation logic provided by an embodiment of the present application;
图8为本申请实施例提供的一种行缓存的示意图;FIG. 8 is a schematic diagram of a line cache provided by an embodiment of the present application;
图9为本申请实施例提供的又一种行缓存的示意图;FIG. 9 is a schematic diagram of another line cache provided by an embodiment of the present application;
图10为本申请实施例提供的又一种计算逻辑示意图;FIG. 10 is another schematic diagram of calculation logic provided by an embodiment of the present application;
图11为本申请实施例提供的又一种计算逻辑示意图;FIG. 11 is another schematic diagram of calculation logic provided by an embodiment of the present application;
图12为本申请实施例提供的一种神经网络的结构示意图;12 is a schematic structural diagram of a neural network provided by an embodiment of the application;
图13为本申请实施例提供的一种单核和多核场景下的计算逻辑时序示意图;13 is a schematic diagram of a computing logic sequence in a single-core and multi-core scenario provided by an embodiment of the present application;
图14为本申请实施例提供的一种数据处理装置的结构示意图;14 is a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application;
图15为本申请实施例提供的一种电子设备的结构示意图。FIG. 15 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
具体实施方式Detailed ways
人工智能领域中常用的神经网络可以包括卷积神经网络(Convolutional Neural Networks,CNN)和递归神经网络(recursive neural network,RNN)等。示例性的,图1示出了一种卷积神经网络的结构示意图。如图1所示,在卷积神经网络中可设置有包括一个或多个卷积计算层的卷积层,以及包括一个或多个计算层的池化/激活层。其中,在数据输入卷积层时,处理器可以根据卷积层中各个卷积计算层对应的卷积核对输入特征图进行卷积计算。示例性的,处理器可以依照预设的卷积核,使用卷积核对应的卷积窗口在输入特征图上,按照预设的步长(stride)进行滑动计算,以获取各个窗口位置对应的计算结果,这些计算结果即可合并为对应的输出特征图。在数据输入池化/激活层时,处理器可以根据特定的函数对输入特征图进行处理,比如池化(pooling)和/或激活(activity)处理。如图1所示,该卷积神经网络还可包括输入层。输入层可以用于存储需要处理的图像的特征图数据,以便于卷积层可以从输入层获取输入特征图。在一些实现中,该输入层可以设置在与进行卷积计算的处理器连接的外部存储器中。Commonly used neural networks in the field of artificial intelligence can include Convolutional Neural Networks (CNN) and Recursive Neural Networks (RNN), etc. Exemplarily, FIG. 1 shows a schematic structural diagram of a convolutional neural network. As shown in FIG. 1, a convolutional layer including one or more convolutional computation layers and a pooling/activation layer including one or more computation layers may be provided in a convolutional neural network. Wherein, when the data is input to the convolution layer, the processor can perform convolution calculation on the input feature map according to the convolution kernel corresponding to each convolution calculation layer in the convolution layer. Exemplarily, the processor may use the convolution window corresponding to the convolution kernel to perform sliding calculation on the input feature map according to the preset convolution kernel according to the preset stride, so as to obtain the corresponding value of each window position. The calculation results can be combined into corresponding output feature maps. When data is fed into the pooling/activation layer, the processor can process the input feature map according to a specific function, such as pooling and/or activity processing. As shown in Figure 1, the convolutional neural network may further include an input layer. The input layer can be used to store the feature map data of the image to be processed, so that the convolutional layer can obtain the input feature map from the input layer. In some implementations, the input layer may be located in external memory connected to the processor that performs the convolution computations.
需要说明的是,不同应用场景下的卷积神经网络的结构可以不同。在一些实现中,卷积层中的多个卷积计算层可以与池化/激活层中的计算层交叉设置。比如,如图1所示,可以在进行一部分卷积计算之后,进行池化或者激活处理,之后在进行卷积计算。在另一些实现中,也可先进行卷积计算,之后选择性地进行池化和/或激活处理。在完成上述计算之后,可以获取完成一轮卷积神经网络计算的结果,并将该结果通过输出层进行输出。It should be noted that the structure of the convolutional neural network in different application scenarios can be different. In some implementations, multiple convolutional computational layers in a convolutional layer can be set up interleaved with computational layers in a pooling/activation layer. For example, as shown in Figure 1, after performing a part of the convolution calculation, pooling or activation processing can be performed, and then the convolution calculation can be performed. In other implementations, convolution computations may be performed first, followed by selective pooling and/or activation processing. After the above calculation is completed, the result of completing one round of convolutional neural network calculation can be obtained, and the result is output through the output layer.
需要说明的是,在处理器内部可以设置有本地缓存(local buffer)。该本地缓存可以用于存储少量的数据。存储在本地缓存中的数据具有进行快速读写的特性。比如,该本地缓存可以用于存储卷积神经网络模型的各个计算层对应的卷积核等数据。在进行神经网络计算时,由于原始输入特征图的数据量一般较大,无法存入本地缓存。因此,可以将原始输入特征图存储在与处理器连接的外部存储器中。其中,该外部存储器可以为具有较大存储空间的存储介质。比如该存储介质可以是双倍速率同步动态随机存取内存(Double Data Rate synchronous dynamic random-access memory,DDR SDRAM)等。在本示例中,该DDR SDRAM也可简称为DDR。以外部存储器为DDR为例。处理器可以在开始计算时,从DDR中读取原始输入特征图,以便进行神经网络计算。当神经网络模型中包括多个计算层时,那么上一层的输出特征图就可以作为下一层的 输入特征图。两个计算层层之间的特征图数据也可称为中间数据。一般而言,神经网络计算过程的中间数据的数据量不大于原始输入特征图的数据量。但是,该中间数据的数据量也远超本地缓存的存储能力。处理器如果将该中间数据写入DDR中,会导致处理器多次与DDR进行较大数据量的读写交互,由此造成大量的功耗。另外,随着处理器计算能力的不断提升,读写带宽的不足,也会限制神经网络计算的效率。It should be noted that a local buffer can be set inside the processor. This local cache can be used to store small amounts of data. The data stored in the local cache has the characteristics of fast read and write. For example, the local cache can be used to store data such as convolution kernels corresponding to each computing layer of the convolutional neural network model. When performing neural network calculations, due to the large amount of data in the original input feature map, it cannot be stored in the local cache. Therefore, the raw input feature maps can be stored in an external memory connected to the processor. Wherein, the external memory may be a storage medium with larger storage space. For example, the storage medium may be a double-rate synchronous dynamic random-access memory (Double Data Rate synchronous dynamic random-access memory, DDR SDRAM) and the like. In this example, the DDR SDRAM may also be referred to as DDR for short. Take the external memory as DDR as an example. The processor can read the raw input feature map from the DDR for neural network computation when it starts computation. When the neural network model includes multiple computing layers, the output feature map of the previous layer can be used as the input feature map of the next layer. The feature map data between two computing layers can also be called intermediate data. Generally speaking, the data volume of the intermediate data in the neural network calculation process is not larger than the data volume of the original input feature map. However, the data volume of the intermediate data also far exceeds the storage capacity of the local cache. If the processor writes the intermediate data into the DDR, it will cause the processor to perform multiple read and write interactions with the DDR with a large amount of data, thereby causing a large amount of power consumption. In addition, with the continuous improvement of processor computing power, the lack of read and write bandwidth will also limit the efficiency of neural network computing.
为了解决上述问题,可以将存储在DDR中的特征图的数据进行拆分,使得拆分后的特征图分片(slice)可以被存储在本地缓存中。这样,在进行神经网络计算时,处理器可以从DDR中读取一个slice存储在本地缓存中,对该slice的数据进行计算。结合上述说明,对于一个slice的计算过程中的中间数据以及输出特征图的数据量不会大于该slice对应的输入特征图的数据量。在完成一个slice的计算之后,处理器可以从DDR中读取下1个slice放入本地缓存,并重复上述步骤进行计算。如此反复,直至完成所有slice的计算。这样,在DDR中就可以存储有多个slice分别对应的输出特征图。此后,处理器需要将这些slice分别对应的输出特征图进行合并,由此获取原始输入特征图对应的完整的输出特征图。为了使得各个slice的输出特征图之间不存在空隙,在对原始输入特征图进行分片时,相邻的slice上需要包括重复的数据。这样就会使得这些重复的数据被多次计算,进而降低整个计算过程的效率,功耗也不够优化。In order to solve the above problem, the data of the feature map stored in the DDR can be split, so that the split feature map slices (slices) can be stored in the local cache. In this way, when performing neural network calculations, the processor can read a slice from the DDR and store it in the local cache, and perform calculations on the slice's data. In combination with the above description, the data amount of the intermediate data and the output feature map in the calculation process of a slice will not be greater than the data amount of the input feature map corresponding to the slice. After completing the calculation of a slice, the processor can read the next slice from the DDR into the local cache, and repeat the above steps for calculation. This is repeated until all slice calculations are completed. In this way, the output feature maps corresponding to multiple slices can be stored in the DDR. After that, the processor needs to combine the output feature maps corresponding to these slices respectively, thereby obtaining a complete output feature map corresponding to the original input feature map. In order to ensure that there is no gap between the output feature maps of each slice, when slicing the original input feature map, adjacent slices need to include duplicate data. This will cause these repeated data to be calculated multiple times, thereby reducing the efficiency of the entire calculation process, and the power consumption is not optimized enough.
为了解决上述问题,本申请实施例提供的数据处理方法,能够在不同计算层的计算中建立对应的流水线计算机制,使得在进行下一层计算之前,不需要完全执行上一层的计算。这样的方案能够显著降低中间数据的数据量,进而使得中间数据可以被存储在本地缓存中,避免处理器与外部存储器(比如DDR)之间的多次大量读写交互的功耗开销。同时,由于该流水线计算机制的建立,使得整个计算过程不需等待,因此能够有效地提升计算效率。通过该方案,同样由于流水线计算的机制,使得不会对数据进行重复的无效计算,因此计算效率显著高于现有技术方案。特别的,本申请实施例提供的数据处理方法,提供了一种具有回退机制的计算方法,能够适用于步长大于或等于1的不同卷积计算场景。下面结合附图对本申请实施例提供的方案进行详细说明。In order to solve the above problem, the data processing method provided by the embodiment of the present application can establish a corresponding pipeline computing mechanism in the calculation of different computing layers, so that the calculation of the upper layer does not need to be completely executed before the calculation of the next layer is performed. Such a solution can significantly reduce the amount of intermediate data, so that the intermediate data can be stored in the local cache, avoiding the power consumption overhead of multiple read and write interactions between the processor and an external memory (such as DDR). At the same time, due to the establishment of the pipeline computing mechanism, the entire computing process does not need to wait, so the computing efficiency can be effectively improved. With this solution, also due to the mechanism of pipeline computing, repeated invalid computations are not performed on data, so the computing efficiency is significantly higher than that of the prior art solution. In particular, the data processing method provided by the embodiment of the present application provides a calculation method with a fallback mechanism, which can be applied to different convolution calculation scenarios with a step size greater than or equal to 1. The solutions provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.
请参考图2,为本申请实施例提供的一种神经网络计算装置200的逻辑结构示意图。该神经网络计算装置200可以用于根据本申请实施例提供的方法,实现包括卷积神经网络模型在内的神经网络模型的计算。为了便于说明,在该图2中同时示出了与神经网络计算装置200耦接的外部存储模块230。神经网络计算装置200可以通过其上设置的接口与外部存储模块230进行读写交互。比如,从外部存储模块230中读取待处理的特征图数据(如原始输入特征图)。又如,将完成神经网络计算的输出特征图数据写入外部存储模块230等。神经网络计算装置200可以是之前所述处理器。Please refer to FIG. 2 , which is a schematic diagram of a logical structure of a neural network computing apparatus 200 according to an embodiment of the present application. The neural network computing device 200 can be used to implement the computation of a neural network model including a convolutional neural network model according to the method provided by the embodiment of the present application. For the convenience of description, the external storage module 230 coupled with the neural network computing device 200 is also shown in this FIG. 2 . The neural network computing device 200 can perform read-write interaction with the external storage module 230 through the interface provided thereon. For example, the feature map data to be processed (eg, the original input feature map) is read from the external storage module 230 . For another example, the output feature map data that has completed the neural network calculation is written into the external storage module 230 and the like. The neural network computing device 200 may be the processor described above.
在一些实现中,该外部存储模块230可以包括上述说明中的DDR。在另一些实现中,该外部存储模块230还可以包括系统缓存(system cache),该系统缓存可以是设置有如图2所示的神经网络计算装置200的设备的系统缓存。在不同的设备中,该系统缓存可以是通过不同的存储介质实现其功能的。比如,该系统缓存是通过闪存(flash)。又如,该系统缓存还可以是固态硬盘(Solid State Device,SSD)等其 他存储介质。In some implementations, the external memory module 230 may include the DDR described above. In other implementations, the external storage module 230 may further include a system cache, and the system cache may be a system cache of the device provided with the neural network computing apparatus 200 shown in FIG. 2 . In different devices, the system cache may realize its function through different storage media. For example, the system cache is via flash memory. For another example, the system cache may also be other storage media such as a solid state disk (Solid State Device, SSD).
如图2所示,本申请实施例提供的神经网络计算装置200中,可以包括计算模块210和本地缓存220。其中,该计算模块210可以是神经网络计算装置200中,用于实现各项计算功能的模块。比如,该计算模块210可以包括卷积计算单元211,激活计算单元212,池化计算单元213,以及Eltwise计算单元214等。其中,卷积计算单元211可以用于执行卷积计算。作为一种可能的实现,该卷积计算单元211可以包括一个或多个乘加器,或者其他能够实现卷积计算的部件。激活计算单元212可以用于执行激活处理。池化计算单元213可以用于实现池化处理的功能。Eltwise计算单元214可以用于实现逐个数据地(elementwise)的计算功能。As shown in FIG. 2 , the neural network computing apparatus 200 provided in this embodiment of the present application may include a computing module 210 and a local cache 220 . The computing module 210 may be a module in the neural network computing device 200 for implementing various computing functions. For example, the calculation module 210 may include a convolution calculation unit 211, an activation calculation unit 212, a pooling calculation unit 213, an Eltwise calculation unit 214, and the like. Among them, the convolution calculation unit 211 can be used to perform convolution calculation. As a possible implementation, the convolution calculation unit 211 may include one or more multiplier-adders, or other components capable of implementing convolution calculation. The activation calculation unit 212 may be used to perform activation processing. The pooling computing unit 213 can be used to implement the function of pooling processing. The Eltwise computing unit 214 may be used to implement elementwise computing functions.
需要说明的是,上述示例中对于计算模块210的结构示例仅为一种可能的实现,在针对不同神经网络模型的计算中,计算模块210包括的用于实现各个功能的单元可以与上述示例相同,也可以不同。比如,在神经网络模型中没有激活计算以及池化计算的需求时,该计算模块210中就可以不设置激活计算单元212以及池化计算单元213。又如,在神经网络模型中没有elementwise的计算需求时,该计算模块210中就可以不设置Eltwise计算单元214。It should be noted that the structural example of the calculation module 210 in the above example is only a possible implementation, and in the calculation for different neural network models, the units included in the calculation module 210 for implementing each function may be the same as the above example. , can also be different. For example, when there is no requirement for activation calculation and pooling calculation in the neural network model, the activation calculation unit 212 and the pooling calculation unit 213 may not be set in the calculation module 210 . For another example, when there is no elementwise computing requirement in the neural network model, the Eltwise computing unit 214 may not be set in the computing module 210 .
作为一种可能的实现,该计算模块210可以为神经网络处理器(neural-network processing units,NPU),或者现场可编程门阵列(Field Programmable Gate Array,FPGA),或者中央处理器(Central Process ing Unit/Processor,CPU),或者图形处理器(Graphics Processing Unit,GPU)等能够实现对应计算功能的部件。需要说明的是,以计算模块210为NPU为例,该NPU可以是单核NPU,也可以是具有多个计算核心的多核NPU。本申请实施例对此不作限制。可以理解的是,本申请实施例提供的数据处理方法应用在单核NPU上时的处理逻辑,可以复用在多核NPU上。当该方案应用于多核NPU时,可以利用多核NPU的并行计算机制,进一步提升计算效率。示例性的,当NPU中存在多个计算核心时,可以采用内部互联(interconnect)的方式实现多个计算核心的互联。比如,可以采用片上网络(Network on Chip,NOC)的结构,实现多个计算核心的互联。可以理解的是,采用NOC的互联方式,可以使得根据网络结构动态地配置不同计算核心之间的互联关系,以便于根据各个计算核心的计算压力动态配置计算量,实现计算的动态调度,由此提升多核NPU的计算效率。As a possible implementation, the computing module 210 may be a neural network processor (neural-network processing units, NPU), or a field programmable gate array (Field Programmable Gate Array, FPGA), or a central processing unit (Central Processing) Unit/Processor, CPU), or graphics processor (Graphics Processing Unit, GPU) and other components that can implement corresponding computing functions. It should be noted that, taking the computing module 210 as an NPU as an example, the NPU may be a single-core NPU or a multi-core NPU having multiple computing cores. This embodiment of the present application does not limit this. It can be understood that the processing logic when the data processing method provided in the embodiment of the present application is applied to a single-core NPU may be multiplexed on a multi-core NPU. When the solution is applied to a multi-core NPU, the parallel computing mechanism of the multi-core NPU can be used to further improve the computing efficiency. Exemplarily, when there are multiple computing cores in the NPU, the interconnection of the multiple computing cores may be implemented by means of an internal interconnection (interconnect). For example, a network on chip (NOC) structure can be used to realize the interconnection of multiple computing cores. It can be understood that the interconnection method of NOC can dynamically configure the interconnection between different computing cores according to the network structure, so as to dynamically configure the calculation amount according to the computing pressure of each computing core, and realize the dynamic scheduling of computing. Improve the computing efficiency of multi-core NPU.
继续结合图2,在本申请实施例提供的神经网络计算装置200中还可设置有本地缓存220。该本地缓存220可以用于数据的快速读写。在本申请的一些实现中,为了节约神经网络计算装置200的成本,同时兼顾神经网络计算装置200的尺寸需求,本地缓存220可以为具有较小存储空间的存储介质。比如,以通过NPU实现计算模块210的功能为例,该本地缓存220就可以是NPU的内部缓存。在本申请中,该本地缓存220可以用于支持行缓存(line buffer)技术。比如,如图2所示,该本地缓存220中可以配置有多个行缓存。Continuing with reference to FIG. 2 , the neural network computing device 200 provided in the embodiment of the present application may further be provided with a local cache 220 . The local cache 220 can be used for fast reading and writing of data. In some implementations of the present application, in order to save the cost of the neural network computing device 200 while taking into account the size requirements of the neural network computing device 200 , the local cache 220 may be a storage medium with a smaller storage space. For example, taking the function of the computing module 210 implemented by the NPU as an example, the local cache 220 may be an internal cache of the NPU. In this application, the local cache 220 may be used to support line buffer technology. For example, as shown in FIG. 2 , the local cache 220 may be configured with multiple line caches.
作为一种示例,该多个行缓存可以分别对应到神经网络模型中的不同计算层。其中,一个计算层对应的行缓存的数量可以根据该计算层的核函数的窗口大小确定。比如,以卷积计算层的卷积窗口为M行N列为例。在本地缓存220中可以为该卷积计算层配置M行行缓存。类似的,对于神经网络模型中的其他计算层,也可在本地缓存220 中分别配置对应的行缓存。可以理解的是,由于核函数的行数一般较小,因此为神经网络模型的各个计算层配置的行缓存的数量总和也不会过大。对于目前的本地缓存220空间而言,都可以实现上述配置。As an example, the multiple line buffers may respectively correspond to different computing layers in the neural network model. The number of line buffers corresponding to one computing layer may be determined according to the window size of the kernel function of the computing layer. For example, take the convolution window of the convolution computing layer as M rows and N columns as an example. In the local cache 220, an M-line cache may be configured for the convolutional computation layer. Similarly, for other computing layers in the neural network model, corresponding line caches may also be configured in the local cache 220 respectively. It can be understood that since the number of rows of the kernel function is generally small, the sum of the number of row caches configured for each computing layer of the neural network model will not be too large. For the current local cache 220 space, the above configuration can be implemented.
需要说明的是,在本申请的另一些实现中,对于计算层的行缓存的配置,也可以结合当前计算层以及相关计算层的步长,和/或神经网络模型中的特殊计算(比如elementwise计算)需求进行的。具体的配置情况将在后续说明中详细陈述。It should be noted that, in other implementations of the present application, the configuration of the line cache of the computing layer may also be combined with the step size of the current computing layer and related computing layers, and/or special calculations in the neural network model (such as elementwise calculation) needs to be carried out. The specific configuration will be described in detail in the subsequent description.
在以下示例中,以神经网络模型为卷积神经网络模型,该卷积神经网络模型具有如图1所示的结构,计算模块为单核NPU,本地缓存为NPU中的本地缓存,外部存储模块为DDR为例对本申请实施例提供的数据处理方法进行说明。以下说明中,将对采用本申请实施例提供的数据处理方法进行卷积层中各个卷积计算层的过程进行详细说明。可以理解的是,对于卷积神经网络模型中的其他层的计算,其过程可以参考该卷积层中的计算过程。作为一种示例,卷积神经网络包括的卷积层中可以设置有如图3所示的N个卷积计算层。如图3所示,这N个卷积计算层可以为层1,层2,……,层N。在进行卷积计算的过程中,层1的输入特征图可以为原始输入特征图。在层N完成卷积计算后获取的输出特征图可以称为卷积输出特征图。In the following example, the neural network model is used as the convolutional neural network model. The convolutional neural network model has the structure shown in Figure 1. The computing module is a single-core NPU, the local cache is the local cache in the NPU, and the external storage module The data processing method provided by the embodiment of the present application will be described by taking the DDR as an example. In the following description, the process of performing each convolution calculation layer in the convolution layer by using the data processing method provided by the embodiment of the present application will be described in detail. It can be understood that, for the calculation of other layers in the convolutional neural network model, the calculation process in the convolutional layer can be referred to. As an example, the convolutional layers included in the convolutional neural network may be provided with N convolutional computing layers as shown in FIG. 3 . As shown in FIG. 3 , the N convolutional computing layers can be layer 1, layer 2, . . . , layer N. In the process of convolution calculation, the input feature map of layer 1 can be the original input feature map. The output feature map obtained after completing the convolution computation at layer N can be called the convolution output feature map.
在开始进行卷积计算时,NPU可以进行初始化。示例性的,在该初始化的过程中,NPU可以从DDR中读取与层1的卷积窗口行数对应的数据写入本地缓存中,为层1配置的行缓存中。示例性的,以下说明中将原始输入特征图的第i行第j列数据记为a ij,i和j均为大于或等于1的整数。以层1的卷积核大小为A1行B1列为例。本地缓存中即可为层1配置A1行行缓存。在初始化过程中,NPU可以从DDR中读取原始输入特征图的前A1行数据,写入本地缓存中为层1配置的行缓存中。在完成初始化之后,NPU可以针对写入本地缓存的数据进行层1的卷积计算。比如采用层1对应的卷积窗口在A1行行缓存上,从左向右依次进行卷积计算,获取对应位置的卷积结果。 When starting the convolution calculation, the NPU can be initialized. Exemplarily, in the initialization process, the NPU may read data corresponding to the number of rows of the convolution window of layer 1 from the DDR and write it into the local cache, which is the row cache configured for layer 1. Exemplarily, in the following description, the data of the i-th row and the j-th column of the original input feature map is denoted as a ij , and both i and j are integers greater than or equal to 1. Take the convolution kernel size of layer 1 as A1 row B1 column as an example. The A1 line cache can be configured for layer 1 in the local cache. During initialization, the NPU can read the first A1 row data of the original input feature map from the DDR and write it into the row cache configured for layer 1 in the local cache. After initialization, the NPU can perform layer 1 convolution calculations on data written to the local cache. For example, the convolution window corresponding to layer 1 is used on the A1 line buffer, and the convolution calculation is performed in turn from left to right to obtain the convolution result of the corresponding position.
在本申请中,可以将卷积窗口从最左侧到最右侧执行的计算过程称为一个行程的计算。一个行程的计算包括对一行或多行数据中位于所述窗口中的部分数据的计算处理。在完成一个行程的计算之后,即可获取层1对应的输出特征图的第1行结果。该层1的输出特征图可以作为层2的输入特征图。因此,在每完成一个位置的卷积计算,NPU就可以将该计算结果存入本地缓存中,为层2配置的行缓存中的对应位置。比如,在完成层1第1个行程的卷积计算之后,层2对应的行缓存中就可以存储有层2对应的输入特征图的第1行数据。需要说明的是,在本申请的一些实现中,NPU可以随着层1中卷积计算的进行,从DDR读取新的数据覆盖层1的计算中不会在使用的数据,以便使得在完成1个行程的计算之后,层1可以继续执行下1个行程的计算,而不需等待NPU从DDR中读取数据。In the present application, the calculation process performed by the convolution window from the leftmost to the rightmost may be referred to as the calculation of one run. The calculation of one trip includes calculation processing of part of the data located in the window in one or more rows of data. After completing the calculation of one stroke, the result of the first row of the output feature map corresponding to layer 1 can be obtained. The output feature map of this layer 1 can be used as the input feature map of layer 2. Therefore, each time the convolution calculation of a position is completed, the NPU can store the calculation result in the local cache, which is the corresponding position in the line cache configured for layer 2. For example, after completing the convolution calculation of the first stroke of layer 1, the line cache corresponding to layer 2 can store the data of the first line of the input feature map corresponding to layer 2. It should be noted that, in some implementations of the present application, the NPU can read new data from the DDR as the convolution calculation in layer 1 progresses to cover the data that will not be used in the calculation of layer 1, so that the completion of After the calculation of 1 stroke, layer 1 can continue to perform the calculation of the next stroke without waiting for the NPU to read data from the DDR.
示例性的,以层1的卷积核步长为S1为例。层1卷积窗口完成在对应行程中的第1次卷积计算之后,原始输入特征图中的前S1行S1列数据(如数据a 11到数据a (S1,S1))就不会再次被使用。因此,NPU可以在层1完成第1次卷积计算之后,从DDR中读取原始输入特征图中的第S1+1行到第2×S1行前S1列数据,存入本地缓存中原先存储有前S1行S1列数据的位置。依次类推,在完成层1的第1个行程的卷积计算之后,为层1配置的A1行行缓存中就已经存储有进行下1个行程的卷积计算所需的数据。当 然,在本申请的另一些实现中,NPU也可以在完成1个行程的计算之后,从DDR中读取下1个行程需要的数据,存入层1的A1行行缓存中。 Illustratively, take the convolution kernel step size of layer 1 as S1 as an example. After the layer 1 convolution window completes the first convolution calculation in the corresponding stroke, the first S1 row and S1 column data in the original input feature map (such as data a 11 to data a (S1, S1) ) will not be used again. use. Therefore, the NPU can read the data from the S1+1 row to the S1 column before the 2×S1 row in the original input feature map from the DDR after the first convolution calculation is completed at layer 1, and store it in the local cache. There is the position of the data in the first S1 row and S1 column. By analogy, after the convolution calculation of the first run of layer 1 is completed, the data required for the convolution calculation of the next run is already stored in the line cache A1 configured for layer 1. Of course, in other implementations of the present application, the NPU may also read the data required for the next trip from the DDR after completing the calculation of one trip, and store it in the A1 line cache of layer 1.
通过上述步骤,即可完成层1的一个行程的卷积计算。由此获取层2的第1行输入特征图的数据。以层2的卷积窗口为A2行B2列为例。NPU可以在层1继续按照上述方案,执行其他行程的卷积计算,直至获取层2进行卷积计算所需的A2行数据。也就是说,NPU可以在层1执行A2个行程的卷积计算,由此获取A2行数据作为层2的输入特征图存储在本地缓存中为层2配置的A2行行缓存中。接着,NPU就可以开始执行层2的第1个行程的卷积计算。进而获取层3的第1行输入特征图,并存储在本地缓存中为层3配置的行缓存中。Through the above steps, the convolution calculation of one stroke of layer 1 can be completed. This obtains the data of the input feature map of the first row of layer 2. Take the convolution window of layer 2 as the A2 row and B2 column as an example. The NPU can continue to perform convolution calculations for other strokes at layer 1 according to the above scheme, until the A2 line data required for convolution calculation at layer 2 is obtained. That is to say, the NPU can perform the convolution calculation of A2 strokes at layer 1, thereby obtaining the A2 row data as the input feature map of layer 2 and storing it in the local cache in the A2 row cache configured for layer 2. Then, the NPU can start to perform the convolution calculation of the first pass of layer 2. Then, the input feature map of the first line of layer 3 is obtained and stored in the line cache configured for layer 3 in the local cache.
可以理解的是,在层2完成1个行程的卷积计算之后,要想继续执行层2的第2个行程的计算,就需要层1执行对应的卷积计算以获取层2的第2个行程的卷积计算所需的新的输入特征图数据。比如,以层2的步长为S2为例。在层2进行行程1的卷积计算的过程中,在层2对应的行缓存中存储的是经过层1的A2个行程计算获取的A2行数据。那么在层2完成行程1的卷积计算之后,NPU就可以返回层1,执行层1的(A2+1)到(A2+S2)行程的计算。由此获取新的S2行数据存入A2对应的行缓存中。以便于NPU可以在层2中继续执行第2个行程计算。由此类推,在之后的计算过程中,NPU每完成层1的S2个行程的计算,就可以执行层2的1个行程的计算,其他层类似。在完成层N的1个行程计算时,就可以获取卷积层输出特征图的第1行数据。It can be understood that after layer 2 completes the convolution calculation of one stroke, in order to continue to perform the calculation of the second stroke of layer 2, it is necessary for layer 1 to perform the corresponding convolution calculation to obtain the second stroke of layer 2. The new input feature map data required for the convolution computation of the run. For example, take the step size of layer 2 as S2 as an example. During the convolution calculation of run 1 at layer 2, the row cache corresponding to layer 2 stores A2 row data obtained through calculation of A2 runs of layer 1. Then after layer 2 completes the convolution calculation of run 1, the NPU can return to layer 1 to perform the calculation of the run from (A2+1) to (A2+S2) of layer 1. The new S2 row data is thus obtained and stored in the row cache corresponding to A2. So that the NPU can continue to perform the second stroke calculation in layer 2. By analogy, in the subsequent calculation process, every time the NPU completes the calculation of S2 strokes of layer 1, it can perform the calculation of 1 stroke of layer 2, and other layers are similar. When one stroke calculation of layer N is completed, the first row of data of the output feature map of the convolutional layer can be obtained.
作为一种示例,图4示出了本申请实施例提供的一种数据处理方法的流程示意图。如图4所示,该方法可以包括至少两个计算过程(如过程1和过程2)。其中,过程1是在开始进行神经网络计算时的流程。过程2则是层2能够执行1个行程的计算开始的后续流程。如图所示,该流程具体可以包括:S401、计算层1行程1。S402、将层1行程1的输出特征图存入层2对应的行缓存中。S403、确定在层2无法进行层2行程1的计算,退回到层1进行计算。S404、计算层1行程2。S405、将层1行程2的输出特征图存入层2对应的行缓存中。S406、确定在层2无法进行层2行程1的计算,退回到层1进行计算。S407、计算层1行程3。S408、将层1行程3的输出特征图存入层2对应的行缓存中。As an example, FIG. 4 shows a schematic flowchart of a data processing method provided by an embodiment of the present application. As shown in FIG. 4, the method may include at least two calculation processes (eg, Process 1 and Process 2). Among them, the process 1 is the flow when the neural network calculation is started. Process 2 is the subsequent process that layer 2 can perform the calculation of 1 trip. As shown in the figure, the process may specifically include: S401, computing layer 1 route 1. S402 , store the output feature map of run 1 of layer 1 into the line cache corresponding to layer 2 . S403. It is determined that the calculation of the stroke 1 of the layer 2 cannot be performed in the layer 2, and the calculation is returned to the layer 1. S404 , calculate the stroke 2 of layer 1 . S405. Store the output feature map of layer 1 and run 2 in the line cache corresponding to layer 2. S406. It is determined that the calculation of the stroke 1 of the layer 2 cannot be performed in the layer 2, and the calculation is returned to the layer 1. S407. Calculate the stroke 3 of layer 1. S408 , store the output feature map of run 3 of layer 1 into the line cache corresponding to layer 2 .
可以理解的是,该示例中是以层2的卷积窗口为3行为例进行说明的。因此,层2进行1次行程的计算需要层1执行3个行程的计算。对应的,如果层2的卷积窗口为A2行时,则层2进行第1次行程的计算,就需要层1执行A2个行程的计算。至此即可完成过程1的步骤。可以理解的是,在该过程1中,由于在开始计算时,层2对应的行缓存中没有数据,因此,需要层1连续进行3个行程的计算,才能够获取层2执行1个行程的数据。It can be understood that, in this example, the convolution window of layer 2 is 3 rows for illustration. Therefore, layer 2 performs a calculation of 1 pass requires layer 1 to perform a calculation of 3 passes. Correspondingly, if the convolution window of layer 2 is row A2, then layer 2 performs the calculation of the first stroke, and layer 1 needs to perform the calculation of A2 strokes. This completes the steps of process 1. It can be understood that in this process 1, since there is no data in the line cache corresponding to layer 2 at the beginning of the calculation, it is necessary for layer 1 to perform three consecutive calculations in order to obtain the data of layer 2 executing one trip. data.
以下对过程2进行说明。其中,由于层2的行缓存中已经存储有数据,因此,层1此后每完成S2个行程的计算,就可以向层2更新S2个行缓存的数据,使得层2可以执行下1个行程的计算。其中,S2为层2的步长,S2为大于或等于1的整数。以下以S2=1为例。S409、计算层2行程1。S410、确定在层2无法进行层2行程2的计算,退回到层1进行计算。S411、计算层1行程4。S412、将层1行程4的输出特征图存入层2对应的行缓存中。S413、计算层2行程2。可以看到,在过程2中,层1每执 行1个行程的计算,层2就可以继续执行1个行程的计算。由此即可行程不同层之间的流水线处理效果。Process 2 will be described below. Among them, since data is already stored in the line cache of layer 2, layer 1 can update the data of S2 line caches to layer 2 every time after completing the calculation of S2 strokes, so that layer 2 can execute the next stroke. calculate. Wherein, S2 is the step size of layer 2, and S2 is an integer greater than or equal to 1. The following takes S2=1 as an example. S409. Calculate the stroke 1 of layer 2. S410 , it is determined that the calculation of the stroke 2 of the layer 2 cannot be performed in the layer 2, and the calculation is returned to the layer 1. S411 , calculate the stroke 4 of layer 1 . S412 , store the output feature map of run 4 of layer 1 into the line cache corresponding to layer 2 . S413. Calculate the stroke 2 of layer 2. It can be seen that, in process 2, every time layer 1 performs the calculation of one stroke, layer 2 can continue to perform the calculation of one stroke. In this way, the pipeline processing effect between different layers can be run.
上述说明中,是以层1每完成1个行程的计算之后,将该行程对应的1行计算结果存入层2对应的行缓存中为例进行说明的。在本申请的另一些实现中,如上述说明中的,层1可以在执行1个行程中的计算过程中,每获取1个卷积窗口位置对应的计算结果,就将该数据存入层2的行缓存中的对应位置。In the above description, after layer 1 completes the calculation of one run, the calculation result of one line corresponding to the run is stored in the line cache corresponding to layer 2 as an example for description. In other implementations of the present application, as described in the above description, layer 1 may store the data in layer 2 every time a calculation result corresponding to a convolution window position is obtained during the calculation process in one stroke. The corresponding location in the line cache of .
需要说明的是,图4中仅其中示出了层1-层2的计算逻辑,其他层的计算逻辑可以采用该流程示意的步骤。比如,层3的输入特征图可以为层2的输出特征图。因此,在层2每次执行1个行程的过程中获取1个计算结果,就可将该结果存入层3的行缓存中。在层2完成1个计算行程之后,层3的行缓存中就可以被更新1行数据。NPU可以判断层3是否可以执行1个新的行程的计算,如果可以,则执行层3的计算行程。如果不行,则返回层2进行计算,而如果层2可以执行下1个行程的计算,则执行层2的下1个行程的计算,如果不行,那么继续向前回退,一直回退到可以执行下1个行程的计算。在执行该层的1个行程计算之后,就可以向下一层的行缓存中更新1行数据,并执行下一行的下1个行程的计算。依次类推,即可完成如图3所示的N个计算层的卷积计算。It should be noted that, in FIG. 4 , only the calculation logic of layer 1 to layer 2 is shown, and the calculation logic of other layers may adopt the steps shown in this flow. For example, the input feature map of layer 3 can be the output feature map of layer 2. Therefore, when layer 2 obtains one calculation result in the process of executing one stroke each time, the result can be stored in the line cache of layer 3. After layer 2 completes one calculation trip, the row cache of layer 3 can be updated with one row of data. The NPU can determine whether layer 3 can perform the calculation of a new run, and if so, execute the calculation run of layer 3. If not, return to layer 2 for calculation, and if layer 2 can perform the calculation of the next stroke, perform the calculation of the next stroke of layer 2, if not, then continue to roll back forward until it can be executed Calculation of the next trip. After the calculation of one trip of this layer is performed, the data of one row can be updated in the row cache of the next layer, and the calculation of the next trip of the next row can be performed. By analogy, the convolution calculation of the N computing layers shown in Figure 3 can be completed.
可以看到,通过上述说明,如果在卷积神经网络模型中,完成卷积层的计算之后没有其他层的计算,那么NPU就可以将该第1行数据的写入DDR中。其中,NPU可以是在获取1个数据后,直接将该数据写入DDR中的。NPU也可以是在获取1行数据之后,一同写入DDR中的。而如果在卷积神经网络模型中,完成卷积层的计算之后,还有其他层的计算,比如进行激活/池化计算,或者elementwise计算,那么NPU就可以在获取卷积层输出特征图的数据时,将该数据写入后续计算对应的行缓存中,并依照上述卷积层中的计算执行。It can be seen that, through the above description, if in the convolutional neural network model, there is no calculation of other layers after the calculation of the convolutional layer is completed, then the NPU can write the first row of data into the DDR. The NPU may directly write the data into the DDR after acquiring one piece of data. The NPU can also write into the DDR together after acquiring 1 row of data. If in the convolutional neural network model, after the calculation of the convolutional layer is completed, there are other layers of calculation, such as activation/pooling calculation, or elementwise calculation, then the NPU can obtain the output feature map of the convolutional layer. When the data is generated, the data is written into the line buffer corresponding to the subsequent calculation, and is performed according to the calculation in the above-mentioned convolutional layer.
可以理解的是,基于上述图4的说明。由于存在回退机制(即NPU可以判断当前层是否可以执行1个行程的计算,如果不行,可以回退到上一层进行计算),因此,在计算层的步长不为1时,也可以使用该方案实现流水线的计算机制。这样,只需要为每个计算层配置对应于其卷积窗口行数的行缓存(如为层1配置A1行行缓存),即可实现该计算层中所有的卷积计算。而且由于当前层的卷积计算结果直接被写入本地缓存中下1个计算层的行缓存中,用于下1个计算层的计算,形成流水线的计算效果。因此,在进行下一层的计算过程中,NPU不需要再从DDR中读取中间数据。由此就可以使得NPU在对如图3所示的卷积层进行卷积计算的过程中,从DDR中读取的数据量仅为1个原始输入特征图的数据量。而如果没有后续计算需求,那么写入DDR中的数据量仅为1个卷积层输出特征图的数据量。显而易见的,采用上述示例中的方案可以显著地降低NPU与DDR的读写数据压力,因此能够起到明显的控制由于多次大量的数据读写造成的功耗开销。另外,由于NPU从DDR中进行原始输入特征图的数据读入时,每次只需要读入A1行数据,因此不会出现由于读写带宽的限制,影响整个系统的计算效率的情况。相比于目前分片后对slice计算的方案,本申请实施例提供的数据处理方法由于不需要对数据进行重复计算,因此能够节省对重复数据计算过程中的计算带宽以及相应的功耗开销。It can be understood that based on the above description of FIG. 4 . Since there is a fallback mechanism (that is, the NPU can determine whether the current layer can perform the calculation of 1 stroke, and if not, it can fall back to the previous layer for calculation), therefore, when the step size of the calculation layer is not 1, it is also possible to Use this scheme to implement the pipeline computing mechanism. In this way, only a line buffer corresponding to the number of rows of its convolution window needs to be configured for each computing layer (for example, A1 line buffer is configured for layer 1), and all convolution calculations in this computing layer can be implemented. Moreover, since the convolution calculation result of the current layer is directly written into the line cache of the next computing layer in the local cache, it is used for the calculation of the next computing layer, forming the calculation effect of the pipeline. Therefore, in the calculation process of the next layer, the NPU does not need to read intermediate data from the DDR. Therefore, in the process of convolution calculation of the convolution layer shown in Figure 3, the amount of data read from the DDR by the NPU is only the data amount of one original input feature map. However, if there is no subsequent calculation requirement, the amount of data written into the DDR is only the amount of data outputted by one convolutional layer. Obviously, using the solution in the above example can significantly reduce the pressure of reading and writing data on the NPU and the DDR, and thus can significantly control the power consumption overhead caused by multiple reading and writing of a large amount of data. In addition, since the NPU only needs to read the A1 row data each time when reading the data of the original input feature map from the DDR, there will be no situation that affects the computing efficiency of the entire system due to the limitation of read and write bandwidth. Compared with the current scheme of calculating slices after slicing, the data processing method provided by the embodiments of the present application does not need to perform repeated calculation on data, and thus can save calculation bandwidth and corresponding power consumption overhead in the process of repeated data calculation.
为了能够更加清楚地对本申请实施例提供的方案进行说明,以下图5-图13,以原始输入特征图中包括6×6个数据,即i=j=6;层1卷积窗口的大小为2×2,即A1=B1=2,层1卷积窗口的步长S1=1;对于层2,A2=B2=3,S2=2为例,对采用本申请实施例提供的方案进行的卷积计算进行示例性的说明。请参考图5,为该示例中层1卷积计算的第1个行程的过程示意图。在初始化过程中,NPU可以从DDR中将原始输入特征图中的a 11到a 26两行数据,分别读入本地缓存中,为层1配置的行缓存1和行缓存2中。在完成上述初始化之后,NPU就可以开始针对层1的卷积计算。比如,NPU可以将层1对应的卷积窗口,在行缓存1和行缓存2的数据上进行滑动计算,由此完成一个行程的计算。 In order to more clearly describe the solutions provided by the embodiments of the present application, the following Figures 5 to 13, the original input feature map includes 6 × 6 data, that is, i=j=6; the size of the layer 1 convolution window is 2×2, that is, A1=B1=2, and the step size of the convolution window of layer 1 is S1=1; for layer 2, A2=B2=3 and S2=2 are taken as an example. The convolution calculation is exemplified. Please refer to FIG. 5 , which is a schematic diagram of the process of the first stroke of the layer 1 convolution calculation in this example. During the initialization process, the NPU can read the two lines of data from a 11 to a 26 in the original input feature map from the DDR into the local cache, respectively, into line cache 1 and line cache 2 configured for layer 1. After completing the above initialization, the NPU can start the convolution calculation for layer 1. For example, the NPU can perform sliding calculation on the data of line buffer 1 and line buffer 2 for the convolution window corresponding to layer 1, thereby completing the calculation of one stroke.
可以理解的是,每完成一个卷积窗口位置的计算,即可获取输出特征图对应位置的1个数据。另外,层1的输出特征图可以为层2的输入特征图,因此,在本示例中,每获取层1的输出特征图的1个数据,就可以将该数据存入本地缓存中,为层2配置的行缓存中的对应位置。比如,结合图6,以对a 11到a 26的计算结果分别为b 11到b 15为例。层1卷积窗口在a 11到a 22的位置计算获取的结果为b 11。在计算行程中每滑动一次,即可获取一个新的结果。比如,计算获取b 11之后,层1卷积窗口向右滑动1个数据,继续进行卷积计算可以获取b 12。以此类推,层1卷积窗口滑动到最右侧,即可计算获取b 15。在本示例中,在获取b 11后,NPU即可将该结果写入为层2配置的第一行行缓存(如行缓存3)中的第1列。在获取b 12后,NPU即可将该结果写入行缓存3中的第2列。以此类推,在层1中完成一个行程的计算之后,即可获取行缓存3中的所有数据(如b 11到b 15)。 It can be understood that each time the calculation of the position of the convolution window is completed, one piece of data corresponding to the position of the output feature map can be obtained. In addition, the output feature map of layer 1 can be the input feature map of layer 2. Therefore, in this example, every time one data of the output feature map of layer 1 is obtained, the data can be stored in the local cache, which is the layer 2 The corresponding location in the configured line cache. For example, in conjunction with FIG. 6 , take the calculation results of a 11 to a 26 as b 11 to b 15 as an example. The result obtained by calculating the layer 1 convolution window at positions a 11 to a 22 is b 11 . Every time you swipe in the calculation stroke, you can get a new result. For example, after calculating and obtaining b 11 , the layer 1 convolution window slides one data to the right, and the convolution calculation can continue to obtain b 12 . By analogy, the layer 1 convolution window is slid to the far right to calculate and obtain b 15 . In this example, after obtaining b 11 , the NPU can write the result to the first column in the first row cache (eg, row cache 3) configured for layer 2. After fetching b 12 , the NPU can write the result to column 2 in line cache 3. By analogy, after the calculation of one trip is completed in layer 1, all data in line cache 3 (eg, b 11 to b 15 ) can be obtained.
可以理解的是,在层1进行第1个行程的卷积计算时,卷积窗口在完成第1个位置(即a 11到a 22所在位置)的计算后,则a 11就不会再参与后续的计算。因此,在本示例中,NPU可以在完成第1个行程中的第1次卷积计算之后,如图7所示,NPU可以从DDR中读取第2个行程的计算中第1次卷积计算需要补充的数据,即a 31。NPU可以将该a 31替换a 11存储在层1的行缓存(如行缓存1)中,以便于后需进行第2个行程的卷积计算。 It is understandable that when layer 1 performs the convolution calculation of the first stroke, after the convolution window completes the calculation of the first position (that is, where a 11 to a 22 are located), a 11 will no longer participate. subsequent calculations. Therefore, in this example, the NPU can read the 1st convolution calculation in the 2nd run from the DDR after completing the 1st convolution calculation in the 1st run, as shown in Figure 7. Calculate the data that needs to be supplemented, ie a 31 . The NPU can replace a 31 with a 11 and store it in the line buffer of layer 1 (such as line buffer 1), so that the convolution calculation of the second run needs to be performed later.
用a 31替换a 11之后,为层1配置的行缓存中存储的数据如图8所示。可以看到,在NPU进行层1的第1个行程(如称为行程1)的第2次卷积计算时,NPU中已经存储有a 31以用于行程2的第1次卷积计算。需要说明的是,在该示例中,是以NPU完成1次卷积计算后即从DDR中读取新的数据替换不会再参与后续计算的数据为例的。在本申请的另一些实现中,NPU也可以在完成第1个行程的所有卷积计算之后,从DDR中一次性地读取多个数据,替换行缓存中不会再参与计算的数据。由此可以降低NPU从DDR中读取数据的次数。 After replacing a 11 with a 31 , the data stored in the line cache configured for layer 1 is shown in Figure 8. It can be seen that when the NPU performs the second convolution calculation of the first run of layer 1 (such as called run 1), a 31 has been stored in the NPU for the first convolution calculation of run 2. It should be noted that in this example, the NPU reads new data from the DDR after completing one convolution calculation and replaces the data that will not participate in subsequent calculations as an example. In other implementations of the present application, the NPU may also read multiple data from the DDR at one time after completing all the convolution calculations of the first run, and replace the data in the line cache that will not participate in the calculation. This can reduce the number of times the NPU reads data from the DDR.
在完成层1行程1的卷积计算之后,层1的行缓存中可以存储有能够支持下1个行程(如行程2)的卷积计算所需的数据。比如,数据替换的结果如图9中的(a)所示。可以理解的是,为了保证层1行程2的计算顺利进行,NPU可以适当调整行缓存中数据的位置,以便于在卷积窗口滑动的过程中,能够覆盖正确的数据。比如,NPU可以将行缓存中存储的数据以行为单位进行回卷,以达到交换行缓存1和行缓存2中存储数据的效果。即,经过回卷,可以将层1的行缓存中的数据由如图9中的(a)所 示的分布,转换成如图9中的(b)所示的分布。After completing the convolution calculation of run 1 of layer 1, the line cache of layer 1 may store data required for the convolution calculation of the next run (eg run 2). For example, the result of data replacement is shown in (a) of FIG. 9 . It is understandable that, in order to ensure the smooth progress of the calculation of layer 1 and 2, the NPU can appropriately adjust the position of the data in the line buffer, so that the correct data can be covered during the sliding of the convolution window. For example, the NPU can rewind the data stored in the line cache in units of rows, so as to achieve the effect of exchanging the data stored in the line cache 1 and the line cache 2. That is, after rewinding, the data in the line buffer of layer 1 can be converted from the distribution shown in (a) in Fig. 9 to the distribution shown in (b) in Fig. 9 .
需要说明的是,如图9中的(b)中的回卷操作是可选的步骤。在本申请的一些实现中,并不需要对数据进行该回卷的处理。可以理解的是,在进行卷积计算的过程中,其实质可以理解为卷积窗口各个位置上的数据分别与输入特征图对应位置上的数据的乘积,之后将这些乘积进行加运算以获取此次卷积计算的结果。因此,在进行卷积计算的过程中,只要保证乘积运算的卷积窗口的数据与输入特征图的数据具有正确的对应关系,就可以不对行缓存上的数据顺序进行调整。It should be noted that, the rewinding operation in (b) of FIG. 9 is an optional step. In some implementations of the present application, it is not necessary to perform the wrapping process on the data. It can be understood that in the process of convolution calculation, it can be understood as the product of the data at each position of the convolution window and the data at the corresponding position of the input feature map, and then these products are added to obtain this value. The result of the subconvolution calculation. Therefore, in the process of convolution calculation, as long as the data of the convolution window of the product operation and the data of the input feature map have a correct corresponding relationship, the order of the data on the line buffer may not be adjusted.
在根据图8-图9中的(b)所示的处理之后,在层1的行缓存上已经存储了可以用于进行层1行程2卷积计算的所有数据。这样,如果继续执行层1行程2的卷积计算,就可以获取层2的输入特征图的第2行数据。比如,参考图10,卷积窗口可以重新移动到行缓存1和行缓存2的最左侧,开始执行层1行程2的计算。每计算一次,按照层1卷积窗口对应的步长向右滑动以进行下一次计算。如此类推,直至完成行程2中的所有卷积计算。由此即可获取层2的输入特征图的第2行数据,比如NPU可以将这些数据(如b 21-b 25)写入用于存储层2的输入特征图的第2行数据的行缓存4中。 After the processing shown in (b) of FIGS. 8-9 , all data that can be used to perform the layer 1 run-2 convolution computation has been stored on the layer 1 line buffer. In this way, if you continue to perform the convolution calculation of layer 1 and run 2, you can obtain the second row data of the input feature map of layer 2. For example, referring to Figure 10, the convolution window can be re-moved to the far left of line buffer 1 and line buffer 2 to start performing layer 1 run 2 calculations. After each calculation, slide to the right according to the step size corresponding to the convolution window of layer 1 for the next calculation. And so on until all convolution calculations in run 2 are completed. In this way, the second row data of the input feature map of layer 2 can be obtained. For example, the NPU can write these data (such as b 21 -b 25 ) into the line cache for storing the second row data of the input feature map of layer 2 4 in.
可以理解的是,层2进行行程1的计算时,需要至少获取3行输入特征图的数据。因此,在过程1中,NPU可以执行层1行程1到层1行程3的卷积计算,以获取层2进行行程1的计算所需数据。在获取层2进行行程1的计算所需数据(如b 11-b 35)之后,NPU可以执行层2的行程1的计算,即进入过程2的计算。在层2中的卷积计算过程与在层1中的卷积计算过程类似。比如,参考图11,层2的行程1计算可以获取层3(如果存在)的输入特征图数据(比如c 11-c 13),NPU可以将这些数据存入本地缓存中为层3配置的行缓存(如行缓存6)中。 It can be understood that when layer 2 performs the calculation of stroke 1, it is necessary to obtain at least 3 rows of input feature map data. Therefore, in process 1, the NPU can perform the convolution calculation from layer 1 run 1 to layer 1 run 3 to obtain the data required for layer 2 to perform run 1 calculation. After acquiring the data (eg, b 11 -b 35 ) required by layer 2 to perform the calculation of run 1, the NPU can perform the calculation of run 1 of layer 2, that is, enter the calculation of process 2. The convolution computation process in layer 2 is similar to the convolution computation process in layer 1. For example, referring to Figure 11, the run 1 computation of layer 2 can obtain input feature map data (such as c 11 -c 13 ) for layer 3 (if present), and the NPU can store this data in the local cache for the row configured for layer 3 in a cache (eg line cache 6).
需要说明的是,在本示例中,由于层2的步长为2。也就是说,在完成层2行程1的计算之后,层2中的卷积窗口会向下滑动2个数据开始行程2的计算。比如,在层2行程1中的第1次卷积计算时,层2的卷积窗口覆盖的输入特征图数据为b 11-b 35。在层2行程2中的第1次卷积计算时,层2的卷积窗口覆盖的输入特征图数据为b 31-b 55。而为了获取b 31-b 55的数据,NPU就需要过程2的计算中,执行层1的2个行程(如层1行程4以及层1行程5)计算。也就是说,当下一层的步长大于1时,在如图4所示的过程2的计算中,当前层就需要连续执行与下一层步长对应次数的行程的计算,才能获取能够支持下一层的1个行程的计算对应的输入特征图数据。 It should be noted that, in this example, since the step size of layer 2 is 2. That is to say, after the calculation of run 1 of layer 2 is completed, the convolution window in layer 2 will slide down by 2 data to start the calculation of run 2. For example, in the first convolution calculation in run 1 of layer 2, the input feature map data covered by the convolution window of layer 2 is b 11 -b 35 . In the first convolution calculation in layer 2, run 2, the input feature map data covered by the convolution window of layer 2 is b 31 -b 55 . In order to obtain the data of b 31 -b 55 , the NPU needs to perform the calculation of 2 strokes of layer 1 (such as layer 1 stroke 4 and layer 1 stroke 5) in the calculation of process 2. That is to say, when the step size of the next layer is greater than 1, in the calculation of process 2 as shown in Figure 4, the current layer needs to continuously perform the calculation of the number of strokes corresponding to the step size of the next layer in order to obtain the support that can support The input feature map data corresponding to the calculation of 1 stroke in the next layer.
示例性的,在本申请的一些实现中,在完成层1行程3的计算之后,可以执行层2行程1的计算。此后,NPU确定无法继续执行层2行程2的计算,就可以回退到层1的计算中,执行层1行程4的计算。在完成层1行程4的计算之后,NPU可以确定是否能够进行层2行程2的计算。由于层2的步长为2,因此无法根据当前的数据进行层2行程2的计算。这样,就可以继续回退到层1的计算中,执行层1行程5的计算。在完成层1行程5的计算之后,NPU可以确定是否能够进行层2行程2的计算。由于已经获取了层2行程2计算所需的输入特征图数据(如b 31-b 55),因此NPU可以接着执行层2行程2的计算。后续过程类似,此处不再赘述。 Exemplarily, in some implementations of the present application, after the calculation of the layer 1 run 3 is completed, the calculation of the layer 2 run 1 may be performed. After that, if the NPU determines that it cannot continue to perform the calculation of the stroke 2 of the layer 2, it can fall back to the calculation of the layer 1 and execute the calculation of the stroke 4 of the layer 1. After completing the calculation of layer 1 run 4, the NPU can determine whether the calculation of layer 2 run 2 can be performed. Since the step size of layer 2 is 2, the calculation of layer 2 stroke 2 cannot be performed according to the current data. In this way, you can continue to fall back to the calculation of layer 1, and perform the calculation of stroke 5 of layer 1. After completing the calculation of layer 1 run 5, the NPU can determine whether the calculation of layer 2 run 2 can be performed. Since the input feature map data (eg, b 31 -b 55 ) required for the layer 2 run 2 calculation has been acquired, the NPU can then perform the layer 2 run 2 calculation. The subsequent process is similar and will not be repeated here.
在本申请的另一些实现中,由于卷积神经网络模型在开始计算时,各个计算层的步长已经确定,因此NPU可以在完成层1行程4的计算之后,直接进行层1行程5的 计算。由此一次性地获取能够支持层2行程2计算的数据。这样能够减少NPU判断逻辑的执行次数,但是,需要为层1配置更多的行缓存以便能够在本地缓存中同时存储支持层1可以快速地执行两个行程的计算所需数据。In other implementations of this application, since the step size of each computing layer has been determined when the convolutional neural network model starts to calculate, the NPU can directly perform the calculation of layer 1 and 5 after completing the calculation of layer 1 and 4. . Thereby, data that can support layer 2 run 2 calculation is acquired at one time. This can reduce the number of executions of the NPU judgment logic. However, it is necessary to configure more line caches for layer 1 so that the data required for the calculation that supports layer 1 can quickly perform two strokes can be stored in the local cache at the same time.
可以理解的是,上述对步长不同时计算逻辑的说明,仅以相邻的层1和层2为例。对于设置有更多计算层的神经网络模型而言,该计算逻辑可以推广到更多层的实施。比如,层2之后还需进行层3的计算,而层3的步长为3。那么,NPU就可以在过程2中,在层1连续执行更多次数的计算,以便于在层2能够连续更多行程的计算,使得不通过判断逻辑,即可获取层3执行下1个行程计算所需的数据。当然,这样就会使得需要为层1和层2配置更多的行缓存。此方案可以应用于本地缓存的存储空间余量较为充足的情况下,能够减少NPU的判断逻辑,提升系统计算效率。It can be understood that, the above description of the calculation logic for different step sizes only takes adjacent layers 1 and 2 as examples. For neural network models with more computing layers, the computing logic can be extended to the implementation of more layers. For example, the calculation of layer 3 needs to be performed after layer 2, and the step size of layer 3 is 3. Then, the NPU can continuously perform more calculations at layer 1 in process 2, so that more strokes can be calculated continuously at layer 2, so that the next stroke at layer 3 can be obtained without passing the judgment logic. Calculate the required data. Of course, this would require more line caches for tier 1 and tier 2. This solution can be applied to the case where the storage space of the local cache is relatively sufficient, which can reduce the judgment logic of the NPU and improve the computing efficiency of the system.
而在另一些场景下,为了能够节省本地缓存的存储空间,可以按照有判断逻辑参与的方法执行计算。比如,在层2执行1个行程的计算之后,NPU判断层3是否可以执行下1个行程的计算,如果可以,则继续执行层3下1个行程的计算。反之,如果无法执行层3下1个行程的计算,则回退到层2执行下1个行程的计算。类似的,如果当前层2对应的行缓存中的数据无法支持层2执行下1个行程的计算,则继续回退到上一层(如层1)执行下1个行程的计算。In other scenarios, in order to save the storage space of the local cache, the calculation can be performed according to a method involving judgment logic. For example, after layer 2 performs the calculation of one stroke, the NPU determines whether layer 3 can perform the calculation of the next stroke, and if so, continues to perform the calculation of the next stroke of layer 3. Conversely, if the calculation of the next trip under layer 3 cannot be performed, it will fall back to layer 2 to perform the calculation of the next trip. Similarly, if the data in the line cache corresponding to the current layer 2 cannot support layer 2 to perform the calculation of the next stroke, continue to fall back to the previous layer (eg, layer 1) to perform the calculation of the next stroke.
这样,通过上述说明,可以理解的是,基于本申请实施例提供的数据处理方法,可以使得本地缓存中只需为各个计算层配置与核函数(如卷积计算中的卷积核的卷积窗口)的行数对应数量的行缓存,即可参考图4所示的方法流程以及上述说明中的方案,获取流水线的计算效果。使得NPU不需要多次大量地从DDR中读取数据,也不需要多次大量地将数据写入DDR中。由此节省读写数据引入的功耗开销,同时由于中间数据均存储在本地缓存的行缓存中,因此能够显著地提升计算效率。In this way, through the above description, it can be understood that, based on the data processing method provided by the embodiments of the present application, it is possible to make the local cache only need to configure and kernel functions (such as the convolution of the convolution kernel in the convolution calculation) for each computing layer The number of lines in the window) corresponds to the number of line buffers, and the calculation effect of the pipeline can be obtained by referring to the method flow shown in FIG. 4 and the solution in the above description. So that the NPU does not need to read data from the DDR many times, and does not need to write the data to the DDR many times. This saves the power consumption overhead introduced by reading and writing data, and at the same time, because the intermediate data is stored in the line cache of the local cache, the computing efficiency can be significantly improved.
需要说明的是,本申请实施例提供的方案,也能够应用于具有特殊计算需求的场景下。示例性的,结合图3,以卷积神经网络模型中还需要进行elementwise计算为例。参考图12,示出了一种卷积神经网络中的计算逻辑示意图。如图12所示,在卷积神经网络中,除了如图3所示的卷积层计算之外,还需要进行elementwise的计算。示例性的,该elementwise的计算可以包括加运算。该加运算的对象可以是卷积层输出特征图,以及原始输入特征图经过计算层W计算之后获取的输出特征图W。其中,该计算层W可以是与卷积层中任意一个卷积计算层相同的计算层,也可以是不同于卷积层中任意一个卷积计算层的计算层。在该示例中,elementwise的加运算可以在Eltwise计算层中进行。结合前述说明,在NPU的本地缓存中可以为计算层W配置对应的行缓存。比如,以在计算层W中进行卷积计算,其窗口大小为A w行B w列为例。那么在本地缓存中可以为计算层W配置A w行行缓存。类似的,Eltwise计算层在本地缓存中也可配置有对应的行缓存。示例性的,该行缓存的数量可以是大于或等于1的整数。 It should be noted that the solutions provided by the embodiments of the present application can also be applied to scenarios with special computing requirements. Exemplarily, with reference to FIG. 3 , the elementwise calculation needs to be performed in the convolutional neural network model as an example. Referring to Fig. 12, a schematic diagram of calculation logic in a convolutional neural network is shown. As shown in Figure 12, in the convolutional neural network, in addition to the calculation of the convolutional layer shown in Figure 3, elementwise calculation is also required. Exemplarily, the elementwise calculation may include an addition operation. The object of the addition operation may be the output feature map of the convolution layer and the output feature map W obtained after the original input feature map is calculated by the computing layer W. Wherein, the computing layer W may be the same computing layer as any one of the convolutional computing layers in the convolutional layers, or may be a computing layer different from any one of the convolutional computing layers in the convolutional layers. In this example, elementwise addition can be performed in the Eltwise computation layer. In combination with the foregoing description, a corresponding line cache may be configured for the computing layer W in the local cache of the NPU. For example, take the convolution calculation performed in the computing layer W, and the window size is A w row B w column as an example. Then, in the local cache, A w line cache can be configured for the computing layer W. Similarly, the Eltwise computing layer can also be configured with a corresponding line cache in the local cache. Exemplarily, the number of cache lines may be an integer greater than or equal to 1.
在执行elementwise的加运算之前,NPU可以分时执行卷积层的卷积计算以及在计算层W中的卷积计算。比如,NPU可以执行计算层W对应的行程1的卷积计算,以获取输出特征图W的第1行数据。NPU可以将该输出特征图W的第1行数据存入Eltwise计算层对应的行缓存中。此后,NPU可以执行卷积层中的卷积计算,在获取卷积层输出特征图的第1行数据时,可以将该卷积层输出特征图的第1行数据输入到Eltwise计算层, 以便于NPU在Eltwise计算层中,对已经存储的输出特征图W的第1行数据以及卷积层输出特征图的第1行数据进行加运算,由此即可获取Eltwise输出特征图的第1行数据。如果没有其他层的计算,那么NPU可以将该Eltwise输出特征图的第1行数据输出到DDR中,作为一轮卷积神经网络计算的输出特征图的一部分。Before performing the elementwise addition operation, the NPU can perform the convolution computation of the convolution layer and the convolution computation in the computation layer W in time-sharing. For example, the NPU can perform the convolution calculation of the run 1 corresponding to the calculation layer W to obtain the data of the first row of the output feature map W. The NPU can store the first row data of the output feature map W into the row buffer corresponding to the Eltwise computing layer. After that, the NPU can perform the convolution calculation in the convolutional layer. When the first line of data of the output feature map of the convolutional layer is obtained, the first line of data of the output feature map of the convolutional layer can be input into the Eltwise computing layer, so that In the Eltwise computing layer of the NPU, the first row of data of the output feature map W that has been stored and the first row of data of the output feature map of the convolutional layer are added, so that the first row of the Eltwise output feature map can be obtained. data. If there are no other layers of computation, the NPU can output the 1st row data of this Eltwise output feature map into the DDR as part of the output feature map of one round of convolutional neural network computations.
在获取Eltwise输出特征图的第1行数据之后,NPU可以按照前述示例中的方法,更新计算层W对应行缓存中的数据,以便进行第2个行程的卷积计算。此外,NPU还可以根据前述方法进行卷积层的卷积计算,以获取卷积层输出特征图的第2行数据,在Eltwise计算层中进行加运算,由此获取Eltwise输出特征图的第2行数据。以此类推,即可获取一轮卷积神经网络计算的输出特征图的完整数据。可以看到,根据本申请实施例提供的方法,执行包括elementwise运算在内的特殊运算时,只需要为对应的计算层配置能够存储其计算过程中的1个行程所需数据的行缓存即可,这样就能够使得不需要多次大量从DDR中读写数据,由此避免数据读写引入的功耗开销。After acquiring the first row data of the Eltwise output feature map, the NPU can update the data in the row cache corresponding to the computing layer W according to the method in the preceding example, so as to perform the convolution calculation of the second stroke. In addition, the NPU can also perform the convolution calculation of the convolution layer according to the aforementioned method to obtain the second line of data of the output feature map of the convolution layer, and perform addition operation in the Eltwise calculation layer, thereby obtaining the second line of the Eltwise output feature map. row data. By analogy, the complete data of the output feature map calculated by one round of convolutional neural network can be obtained. It can be seen that, according to the method provided by the embodiment of the present application, when performing special operations including elementwise operations, it is only necessary to configure a line cache capable of storing the data required for one trip in the calculation process for the corresponding computing layer. , so that it is not necessary to read and write data from the DDR in large quantities for many times, thereby avoiding the power consumption overhead introduced by data reading and writing.
需要说明的是,上述示例中,是以单独为计算层W配置行缓存为例进行说明的。在本申请的另一些实现中,计算层W的行缓存也可与层1进行复用。示例性的,在该实例中,计算层W中的卷积计算与层1中的卷积计算的相同点在于,都是对原始输入特征图中的数据的卷积计算,只是其卷积核可能不同。It should be noted that, in the above example, the row cache is configured separately for the computing layer W as an example for description. In other implementations of the present application, the line cache of computing layer W may also be multiplexed with layer 1 . Exemplarily, in this example, the convolution calculation in the calculation layer W is the same as the convolution calculation in the layer 1 in that both are convolution calculations on the data in the original input feature map, but the convolution kernel May be different.
在本示例中,在计算层W和层1的卷积核不同时,可以为计算层W和层1配置共同的行缓存,该行缓存的数量可以根据计算层W和层1的卷积核分别对应的卷积窗口中,行数较大的卷积窗口确定。比如,计算层W的卷积窗口行数A w为3,层1的卷积窗口行数A1为2。那么,就可以为计算层W和层1共同配置包括3行行缓存的本地存储以便支持计算层W和层1的卷积计算。 In this example, when the convolution kernels of computation layer W and layer 1 are different, a common line buffer can be configured for computation layer W and layer 1, and the number of line buffers can be determined according to the convolution kernels of computation layer W and layer 1 Among the corresponding convolution windows, the convolution window with a larger number of rows is determined. For example, the number of convolution window rows A w of the computing layer W is 3, and the number of convolution window rows A1 of layer 1 is 2. Then, a local storage including a 3-line line cache can be configured for the computation layer W and the layer 1 to support the convolution computation of the computation layer W and the layer 1.
需要说明的是,由于计算层W和层1可能需要使用不同的卷积核对存储在行缓存中的数据进行卷积计算,因此,对于行缓存中数据的更新,可以是在分别完成计算成W和层1在对应位置的卷积计算之后进行的。比如,结合图7的说明。在只需对输入特征图执行层1的卷积计算时,可以是如图7所示的,在完成层1行程1中的第1次卷积计算之后,NPU从DDR中读取a 31替换a 11。而在本示例中,由于a 11不仅需要参与层1行程1的第1次卷积计算,还需要参与计算层W的行程1的第1次卷积计算。因此,可以在a 11完成层1行程1的第1次卷积计算以及计算层W的行程1的第1次卷积计算之后,NPU从DDR中读取a 31替换a 11。这样,可以达到行缓存复用的效果,由此节省本地缓存的存储空间。 It should be noted that, since the calculation layer W and layer 1 may need to use different convolution kernels to perform convolution calculation on the data stored in the line cache, the update of the data in the line cache can be completed separately after the calculation into W And layer 1 is performed after the convolution calculation at the corresponding position. For example, in conjunction with the description of FIG. 7 . When only the convolution calculation of layer 1 needs to be performed on the input feature map, as shown in Figure 7, after completing the first convolution calculation in layer 1 run 1, the NPU reads a 31 from the DDR to replace a 11 . In this example, since a 11 not only needs to participate in the first convolution calculation of the run 1 of layer 1, but also needs to participate in the first convolution calculation of run 1 of the calculation layer W. Therefore, after a 11 completes the first convolution calculation of run 1 of layer 1 and the first convolution calculation of run 1 of layer W, the NPU can read a 31 from the DDR to replace a 11 . In this way, the effect of line cache multiplexing can be achieved, thereby saving the storage space of the local cache.
可以理解的是,上述示例中,均是以NPU为单核为例进行说明的。比如,结合图4所示的流程示意图,NPU可以按照时序执行其中的过程1和过程2。当然,结合对图4所示方案的说明,过程1和过程2中部分步骤的执行先后顺序可以不同于图4,此处不再赘述。目前,随着芯片加工工艺的提升,多核NPU也经常被使用。当在多核NPU中使用本申请实施例提供的数据处理方法时,能够实现计算并发,达到进一步提升计算效率的效果。It can be understood that, in the above examples, the NPU is used as an example for a single core for description. For example, with reference to the schematic flowchart shown in FIG. 4 , the NPU may execute the process 1 and the process 2 according to the time sequence. Of course, in conjunction with the description of the solution shown in FIG. 4 , the execution sequence of some steps in the process 1 and the process 2 may be different from that in FIG. 4 , which will not be repeated here. At present, with the improvement of chip processing technology, multi-core NPU is often used. When the data processing method provided by the embodiment of the present application is used in a multi-core NPU, computing concurrency can be achieved, and the effect of further improving computing efficiency is achieved.
示例性的,以NPU具有两个计算核心为例。图13示出了单核场景和多核场景下依照时序的计算流程对比示意图。如图13所示,在单核场景下,NPU中的计算核心(如核1)可以在T1时刻执行层1行程4的计算。核1可以在T2时刻执行层2行程1的 计算。核1可以在T3时刻执行层1行程5的计算。对应的,在NPU为双核处理器时(即在双核场景下),NPU中的一个计算核心(如核1)可以在T1时刻执行层1行程4的计算。核1可以在T2时刻执行层1行程5的计算。在T3时刻执行层1行程6的计算。而层2的计算可以由NPU的另一个计算核心(如核2)执行。比如,在T2时刻,核2可以在核1执行层1行程5的计算时,执行层2行程1的计算。在T3时刻,核2可以在核1执行层1行程6的计算时,执行层2行程2的计算。Illustratively, take the NPU having two computing cores as an example. FIG. 13 is a schematic diagram showing a comparison of the calculation flow according to the time sequence in a single-core scenario and a multi-core scenario. As shown in FIG. 13 , in a single-core scenario, the computing core (eg, core 1) in the NPU can perform the computation of layer 1 and trip 4 at time T1. Core 1 can perform layer 2 run 1 computation at time T2. Core 1 can perform the computation of layer 1 run 5 at time T3. Correspondingly, when the NPU is a dual-core processor (that is, in a dual-core scenario), one computing core (eg, core 1 ) in the NPU can perform the computation of layer 1 and process 4 at time T1 . Core 1 can perform the computation of layer 1 run 5 at time T2. Calculation of run 6 of layer 1 is performed at time T3. And layer 2 computations can be performed by another compute core (eg, core 2) of the NPU. For example, at time T2, the core 2 may perform the calculation of the run 1 of the layer 2 when the core 1 performs the calculation of the run 5 of the layer 1. At time T3, the core 2 can perform the calculation of the run 2 of the layer 2 when the core 1 performs the calculation of the run 6 of the layer 1.
显而易见的,相比于单核NPU的计算流程,多核NPU的计算过程中,可以实现多个计算过程的并发,由此使得NPU可以在获取层2行程1计算所需数据之后,同时执行层1下1个行程的计算以及层2行程1的计算,而不需等待完成层2行程1的计算之后,再回退到层1执行下1个行程的计算。需要说明的是,上述示例中,是以核1执行层1的计算,核2执行层2的计算为例进行说明的。在本申请中,并不存在对于计算核心和计算层的计算的对应关系限制。也就是说,在本申请的另一些实现中,同一个计算核心中也可以执行不同计算层的计算。比如,在核1本身的计算能力(比如用吞吐量标识)较大,核2吞吐量较小时,那么核1可以在完成层1的计算之外,还可以通过分时复用的方式,处理部分层2的计算,从而保证吞吐量的一致。由此实现计算带宽的充分利用,以提升多核NPU的工作效率。Obviously, compared with the calculation process of a single-core NPU, in the calculation process of a multi-core NPU, the concurrency of multiple calculation processes can be realized, so that the NPU can simultaneously execute the layer 1 after obtaining the data required for the calculation of the layer 2 stroke 1. The calculation of the next trip and the calculation of the trip 1 of the layer 2 do not need to wait for the calculation of the trip 1 of the layer 2 to be completed, and then return to the layer 1 to perform the calculation of the next trip. It should be noted that, in the above example, the core 1 performs the calculation of the layer 1, and the core 2 performs the calculation of the layer 2 as an example for description. In this application, there is no restriction on the corresponding relationship between the computation of the computation core and the computation layer. That is to say, in other implementations of the present application, the same computing core may also perform computations at different computing layers. For example, when the computing capability of core 1 (for example, using the throughput identifier) is relatively large, and the throughput of core 2 is relatively small, then core 1 can not only complete the calculation of layer 1, but also use time-division multiplexing to process processing. Part of layer 2 calculations to ensure consistent throughput. In this way, full utilization of computing bandwidth is achieved to improve the working efficiency of the multi-core NPU.
上述主要从处理器的角度对本申请实施例提供的方案进行了介绍。可以理解的是,上述处理器为了实现上述功能,其包括了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。The foregoing mainly introduces the solutions provided by the embodiments of the present application from the perspective of the processor. It can be understood that, in order to realize the above-mentioned functions, the above-mentioned processor includes corresponding hardware structures and/or software modules for executing each function. Those skilled in the art should easily realize that the unit of each example described in conjunction with the embodiments disclosed herein can be implemented in hardware or in the form of a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
本申请实施例可以根据上述方法示例对处理器对应的数据处理装置进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。可选的,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In this embodiment of the present application, the data processing apparatus corresponding to the processor may be divided into functional modules according to the foregoing method examples. For example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. middle. The above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. Optionally, the division of modules in the embodiment of the present application is schematic, and is only a logical function division, and another division manner may be used in actual implementation.
请参考图14,为本申请实施例提供的一种数据处理装置1400的结构示意图。该数据处理装置1400可以应用于进行神经网络计算,该神经网络包括N个计算层,N为大于或等于2的整数。该数据处理装置1400中设置有本地缓存。如图14所示,该数据处理装置1400包括:获取单元1401,用于获取第一数据,该第一数据用于进行第一计算层的第一计算行程,该第一计算层是该N个计算层中的任一个计算层。存储单元1402,用于将该第一数据存储在该第一计算层的第一行缓存中,该第一计算层的第一行缓存包括在该本地缓存中。计算单元1403,用于计算该第一计算层的第一计算行程,以获取与该第一计算层的第一计算行程对应的第二数据,其中,该第一计算层的第一计算行程包括使用该第一计算层的卷积窗口对该第一数据的一行或多行数据的卷积计算。存储单元1402,还用于将该第二数据存储在第二计算层的第一行缓存中,该第二计算层的第一行缓存包括在该本地缓存中,该第二计算层是该N个计算层中,该 第一计算层之后的计算层。计算单元1403,还用于在该第二计算层的第一行缓存中存储的累积的数据能够进行该第二计算层的第一计算行程的情况下,计算该第二计算层的第一计算行程,以获取与该第二计算层的第一计算行程对应的第五数据,其中,该第二计算层的第一计算行程包括使用该第二计算层的卷积窗口对该第二数据的一行或多行数据的卷积计算。Please refer to FIG. 14 , which is a schematic structural diagram of a data processing apparatus 1400 according to an embodiment of the present application. The data processing apparatus 1400 can be applied to perform neural network computation, where the neural network includes N computation layers, where N is an integer greater than or equal to 2. The data processing apparatus 1400 is provided with a local cache. As shown in FIG. 14 , the data processing apparatus 1400 includes: an obtaining unit 1401, configured to obtain first data, where the first data is used to perform a first calculation journey of a first calculation layer, and the first calculation layer is the N Any one of the computing layers. The storage unit 1402 is configured to store the first data in the first line cache of the first computing layer, where the first line cache of the first computing layer is included in the local cache. The calculation unit 1403 is configured to calculate the first calculation stroke of the first calculation layer to obtain second data corresponding to the first calculation stroke of the first calculation layer, wherein the first calculation stroke of the first calculation layer includes The convolution calculation of one or more rows of data of the first data is performed using the convolution window of the first computing layer. The storage unit 1402 is further configured to store the second data in the first row cache of the second computing layer, where the first row cache of the second computing layer is included in the local cache, and the second computing layer is the N Among the computing layers, the computing layer after the first computing layer. The calculation unit 1403 is further configured to calculate the first calculation of the second calculation layer under the condition that the accumulated data stored in the first row cache of the second calculation layer can perform the first calculation process of the second calculation layer stroke, to obtain fifth data corresponding to the first calculation stroke of the second calculation layer, wherein the first calculation stroke of the second calculation layer includes using the convolution window of the second calculation layer. Convolution computation of one or more lines of data.
在一种可能的设计中,计算单元1403,还用于在累积的数据无法进行该第二计算层的第一计算行程的情况下,计算该第一计算层的第二计算行程,该第一计算层的第二计算行程是该第一计算层的第一计算行程后的计算行程。在一种可能的设计中,第一行缓存的行数等于第一计算层的卷积窗口的行数。在一种可能的设计中,获取单元1401,还用于从外部存储器中读取该第一数据,该第一数据是该外部存储器中存储的输入特征图的至少部分,该外部存储器是与该处理器耦接的存储介质。在一种可能的设计中,第一数据是外部存储器中存储的输入特征图的部分,获取单元1401,还用于从该外部存储器中获取第三数据,该第三数据是输入特征图中另一部分,该第三数据用于进行该第一计算层的第二计算行程。将该第三数据覆盖存储第四数据,该第四数据是该第一数据中,不再参与该第一计算层的计算的数据。在一种可能的设计中,存储单元1402,还用于在进行该第一计算层的第一计算行程的过程中,每获取第一计算层的卷积窗口在一个位置的计算结果,将该计算结果存储在第二计算层的第一行缓存中。在一种可能的设计中,获取单元1401,还用于获取与第二计算层的第一计算行程对应的第五数据。存储单元1402,将所述第五数据存储在第三计算层的第一行缓存中,所述第三计算层的第一行缓存包括在所述本地缓存中;所述第三计算层是所述N个计算层中,所述第二计算层之后的计算层,所述第五数据用于进行所述第三计算层的卷积计算。In a possible design, the calculation unit 1403 is further configured to calculate the second calculation stroke of the first calculation layer when the accumulated data cannot perform the first calculation stroke of the second calculation layer, the first calculation stroke of the first calculation layer. The second calculation stroke of the calculation layer is the calculation stroke after the first calculation stroke of the first calculation layer. In one possible design, the number of rows in the first row cache is equal to the number of rows in the convolution window of the first computational layer. In a possible design, the obtaining unit 1401 is further configured to read the first data from an external memory, where the first data is at least a part of the input feature map stored in the external memory, the external memory is related to the A processor-coupled storage medium. In a possible design, the first data is a part of the input feature map stored in the external memory, and the obtaining unit 1401 is further configured to obtain third data from the external memory, where the third data is another part of the input feature map In part, the third data is used to perform a second computing run of the first computing layer. The third data is overwritten and the fourth data is stored, and the fourth data is the data in the first data that no longer participates in the calculation of the first computing layer. In a possible design, the storage unit 1402 is further configured to, in the process of performing the first calculation process of the first calculation layer, obtain the calculation result of the convolution window of the first calculation layer at one position, and store the calculation result of the first calculation layer at one position. The result of the computation is stored in the first line cache of the second computation layer. In a possible design, the obtaining unit 1401 is further configured to obtain fifth data corresponding to the first calculation journey of the second calculation layer. The storage unit 1402 stores the fifth data in the first line cache of the third computing layer, where the first line cache of the third computing layer is included in the local cache; the third computing layer is the Among the N computing layers, in the computing layer after the second computing layer, the fifth data is used to perform the convolution calculation of the third computing layer.
上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。即,以上任一单元可以软件、硬件或二者结合实现,以实现如方法所示功能。包括以上单元的数据处理装置1400可以是集成在以上所述处理器内的部分,如处理器内功能硬件或运行于处理器内功能软件。例如,所述任一单元以软件模块实现,则其运行于如图2所示的神经网络计算装置200上。All relevant contents of the steps involved in the foregoing method embodiments can be cited in the functional descriptions of the corresponding functional modules, which will not be repeated here. That is, any one of the above units can be implemented in software, hardware or a combination of the two, so as to realize the functions shown in the method. The data processing apparatus 1400 including the above units may be a part integrated in the above-mentioned processor, such as functional hardware in the processor or functional software running in the processor. For example, if any one of the units is implemented as a software module, it runs on the neural network computing device 200 as shown in FIG. 2 .
请参考图15,为本申请实施例提供的一种电子设备1500的结构示意图。该电子设备1500可以包括:处理器1501和存储器1502。该存储器1502用于存储计算机执行指令。示例性的,在一些实施例中,当该处理器1501执行该存储器1502存储的指令时,使得该电子设备1500执行如图4所示的S401-S413中一个或多个步骤,以及电子设备需要执行的其他操作。在一些实施例中,该电子设备1500可以设置有如图2所述的神经网络计算装置200。处理器1501可以参考图2的神经网络计算装置200。Please refer to FIG. 15 , which is a schematic structural diagram of an electronic device 1500 according to an embodiment of the present application. The electronic device 1500 may include: a processor 1501 and a memory 1502 . The memory 1502 is used to store computer-implemented instructions. Exemplarily, in some embodiments, when the processor 1501 executes the instructions stored in the memory 1502, the electronic device 1500 is caused to perform one or more steps in S401-S413 shown in FIG. 4, and the electronic device requires other operations performed. In some embodiments, the electronic device 1500 may be provided with the neural network computing apparatus 200 as described in FIG. 2 . The processor 1501 may refer to the neural network computing device 200 of FIG. 2 .
需理解,本实施例的处理器包括但不限于之前所述CPU、NPU、FPGA、CPU、GPU和DSP(数字信号处理器)中的一个或多个。以上处理器可以一个或多个芯片的方式实现。当处理器集成于一个芯片时,该芯片也称之为片上系统(SoC)。It should be understood that the processor of this embodiment includes, but is not limited to, one or more of the aforementioned CPU, NPU, FPGA, CPU, GPU, and DSP (Digital Signal Processor). The above processors may be implemented in one or more chips. When the processor is integrated into a chip, the chip is also referred to as a system-on-chip (SoC).
需要说明的是,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。本申请实施例提供的对焦装置,用于执行上述对焦方法中终端的功能,因此可以达到与上述对焦方法相同的效果。It should be noted that, all relevant contents of the steps involved in the above method embodiments can be cited in the functional description of the corresponding functional module, which will not be repeated here. The focusing device provided by the embodiment of the present application is used to perform the function of the terminal in the above focusing method, and thus can achieve the same effect as the above focusing method.
在上述实施例中的功能或动作或操作或步骤等,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件程序实现时,可以全部或部分地以计算机程序产品的形式来实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或者数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包括一个或多个可以用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带),光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。The functions or actions or operations or steps in the above embodiments may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line, DSL) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or data storage devices including one or more servers, data centers, etc. that can be integrated with the medium. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.
尽管结合具体特征及其实施例对本申请进行了描述,显而易见的,在不脱离本申请的精神和范围的情况下,可对其进行各种修改和组合。相应地,本说明书和附图仅仅是所附权利要求所界定的本申请的示例性说明,且视为已覆盖本申请范围内的任意和所有修改、变化、组合或等同物。显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包括这些改动和变型在内。Although the application has been described in conjunction with specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made therein without departing from the spirit and scope of the application. Accordingly, this specification and drawings are merely exemplary illustrations of the application as defined by the appended claims, and are deemed to cover any and all modifications, variations, combinations or equivalents within the scope of this application. Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the spirit and scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims (10)

  1. 一种数据处理方法,其特征在于,所述方法应用于进行神经网络计算的处理器,所述神经网络包括N个计算层,N为大于或等于2的整数;所述处理器中设置有本地缓存;所述方法包括:A data processing method, characterized in that the method is applied to a processor that performs neural network computation, the neural network includes N computation layers, and N is an integer greater than or equal to 2; the processor is provided with a local cache; the method includes:
    获取第一数据,所述第一数据用于进行第一计算层的第一计算行程,所述第一计算层是所述N个计算层中的任一个计算层;Acquiring first data, the first data is used to perform the first calculation journey of the first computing layer, and the first computing layer is any one of the N computing layers;
    将所述第一数据存储在所述第一计算层的第一行缓存中,所述第一计算层的第一行缓存包括在所述本地缓存中;storing the first data in the first line cache of the first computing layer, where the first line cache of the first computing layer is included in the local cache;
    计算所述第一计算层的第一计算行程,以获取与所述第一计算层的第一计算行程对应的第二数据,其中,所述第一计算层的第一计算行程包括使用所述第一计算层的卷积窗口对所述第一数据的一行或多行数据的卷积计算;calculating a first calculation trip of the first calculation layer to obtain second data corresponding to the first calculation trip of the first calculation layer, wherein the first calculation trip of the first calculation layer includes using the The convolution calculation of one or more rows of data of the first data by the convolution window of the first computing layer;
    将所述第二数据存储在第二计算层的第一行缓存中,所述第二计算层的第一行缓存包括在所述本地缓存中,所述第二计算层是所述N个计算层中,所述第一计算层之后的计算层;storing the second data in a first line cache of a second computing layer included in the local cache, the second computing layer being the N computations layer, the computing layer after the first computing layer;
    在所述第二计算层的第一行缓存中存储的累积的数据能够进行所述第二计算层的第一计算行程的情况下,计算所述第二计算层的第一计算行程,以获取与所述第二计算层的第一计算行程对应的第五数据,其中,所述第二计算层的第一计算行程包括使用所述第二计算层的卷积窗口对所述第二数据的一行或多行数据的卷积计算。In the case where the accumulated data stored in the first line cache of the second computing layer can perform the first computing process of the second computing layer, the first computing process of the second computing layer is calculated to obtain Fifth data corresponding to the first calculation run of the second calculation layer, wherein the first calculation run of the second calculation layer includes using the convolution window of the second calculation layer on the second data. Convolution computation of one or more lines of data.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    在所述累积的数据无法进行所述第二计算层的第一计算行程的情况下,计算所述第一计算层的第二计算行程,所述第一计算层的第二计算行程是所述第一计算层的第一计算行程后的计算行程。In the case where the accumulated data cannot perform the first calculation stroke of the second calculation layer, calculate the second calculation stroke of the first calculation layer, and the second calculation stroke of the first calculation layer is the The calculation trip after the first calculation trip of the first calculation layer.
  3. 根据权利要求1或2所述的方法,其特征在于,The method according to claim 1 or 2, characterized in that,
    所述第一行缓存的行数等于所述第一计算层的卷积窗口的行数。The number of lines in the first line buffer is equal to the number of lines in the convolution window of the first computing layer.
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,在所述第一计算层是所述神经网络的第一个计算层时,所述获取第一数据,包括:The method according to any one of claims 1-3, wherein when the first computing layer is the first computing layer of the neural network, the acquiring the first data comprises:
    从外部存储器中读取所述第一数据,所述第一数据是所述外部存储器中存储的输入特征图的至少部分,所述外部存储器是与所述处理器耦接的存储介质。The first data is read from an external memory, the first data being at least a portion of an input profile stored in the external memory, the external memory being a storage medium coupled to the processor.
  5. 根据权利要求4所述的方法,其特征在于,所述第一数据是所述外部存储器中存储的输入特征图的部分,所述方法还包括:The method of claim 4, wherein the first data is a portion of an input feature map stored in the external memory, the method further comprising:
    从所述外部存储器中获取第三数据,所述第三数据是所述输入特征图中另一部分,所述第三数据用于进行所述第一计算层的第二计算行程;acquiring third data from the external memory, where the third data is another part of the input feature map, the third data is used to perform the second calculation process of the first calculation layer;
    将所述第三数据覆盖存储第四数据,所述第四数据是所述第一数据中,不再参与所述第一计算层的计算的数据。Overwriting the third data to store fourth data, where the fourth data is the data in the first data that no longer participates in the calculation of the first computing layer.
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,所述将所述第二数据存储在第二计算层的第一行缓存中,包括:The method according to any one of claims 1-5, wherein the storing the second data in the first line cache of the second computing layer comprises:
    在进行所述第一计算层的第一计算行程的过程中,每获取所述第一计算层的卷积窗口在一个位置的计算结果,将所述计算结果存储在所述第二计算层的第一行缓存中。In the process of performing the first calculation process of the first calculation layer, each time the calculation result of the convolution window of the first calculation layer at one position is obtained, the calculation result is stored in the second calculation layer. The first line in the cache.
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,在获取与所述第二计算 层的第一计算行程对应的第五数据之后,所述方法还包括:The method according to any one of claims 1-6, wherein after acquiring the fifth data corresponding to the first calculation trip of the second calculation layer, the method further comprises:
    将所述第五数据存储在第三计算层的第一行缓存中,所述第三计算层的第一行缓存包括在所述本地缓存中;所述第三计算层是所述N个计算层中,所述第二计算层之后的计算层,所述第五数据用于进行所述第三计算层的卷积计算。The fifth data is stored in the first line cache of the third computing layer, and the first line cache of the third computing layer is included in the local cache; the third computing layer is the N computing In the calculation layer after the second calculation layer, the fifth data is used to perform the convolution calculation of the third calculation layer.
  8. 一种处理器,其特征在于,所述处理器包括一个或多个计算核心,以及本地缓存,所述处理器被配置为实现如权利要求1-7中任一项所述的数据处理方法。A processor, characterized in that the processor includes one or more computing cores and a local cache, and the processor is configured to implement the data processing method according to any one of claims 1-7.
  9. 一种电子设备,其特征在于,电子设备包括一个或多个如权利要求8所述的处理器以及一个或多个存储器;所述存储器与所述处理器耦合,所述存储器存储有计算机指令;An electronic device, characterized in that the electronic device comprises one or more processors as claimed in claim 8 and one or more memories; the memories are coupled to the processors, and the memories store computer instructions;
    当所述处理器执行所述计算机指令时,使得所述电子设备执行如权利要求1-7中任一项所述的数据处理方法。When the processor executes the computer instructions, the electronic device is caused to perform the data processing method according to any one of claims 1-7.
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质包括计算机指令,当所述计算机指令运行时,执行如权利要求1-7中任一项所述的数据处理方法。A computer-readable storage medium, characterized in that, the computer-readable storage medium comprises computer instructions, and when the computer instructions are executed, the data processing method according to any one of claims 1-7 is executed.
PCT/CN2021/074548 2021-01-30 2021-01-30 Data processing method and processor WO2022160310A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/074548 WO2022160310A1 (en) 2021-01-30 2021-01-30 Data processing method and processor
CN202180077853.3A CN116472537A (en) 2021-01-30 2021-01-30 Data processing method and processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/074548 WO2022160310A1 (en) 2021-01-30 2021-01-30 Data processing method and processor

Publications (1)

Publication Number Publication Date
WO2022160310A1 true WO2022160310A1 (en) 2022-08-04

Family

ID=82652937

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/074548 WO2022160310A1 (en) 2021-01-30 2021-01-30 Data processing method and processor

Country Status (2)

Country Link
CN (1) CN116472537A (en)
WO (1) WO2022160310A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862374A (en) * 2017-10-30 2018-03-30 中国科学院计算技术研究所 Processing with Neural Network system and processing method based on streamline
CN111582451A (en) * 2020-05-08 2020-08-25 中国科学技术大学 Image recognition interlayer parallel pipeline type binary convolution neural network array architecture
CN111767986A (en) * 2020-06-24 2020-10-13 深兰人工智能芯片研究院(江苏)有限公司 Operation method and device based on neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862374A (en) * 2017-10-30 2018-03-30 中国科学院计算技术研究所 Processing with Neural Network system and processing method based on streamline
CN111582451A (en) * 2020-05-08 2020-08-25 中国科学技术大学 Image recognition interlayer parallel pipeline type binary convolution neural network array architecture
CN111767986A (en) * 2020-06-24 2020-10-13 深兰人工智能芯片研究院(江苏)有限公司 Operation method and device based on neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"Master's Thesis", 30 June 2020, UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA, article BAICHENG LIU: "VLSI Architecture Design for Binary Convolutional Neural Network Accelerator", pages: 1 - 90, XP055954903 *

Also Published As

Publication number Publication date
CN116472537A (en) 2023-07-21

Similar Documents

Publication Publication Date Title
US10803379B2 (en) Multi-memory on-chip computational network
CN107657581B (en) Convolutional neural network CNN hardware accelerator and acceleration method
US11775430B1 (en) Memory access for multiple circuit components
EP3664093B1 (en) Semiconductor memory device employing processing in memory (pim) and method of operating the semiconductor memory device
US10846621B2 (en) Fast context switching for computational networks
US9129674B2 (en) Hybrid memory device
JP7179853B2 (en) On-chip computational network
US10783104B2 (en) Memory request management system
JP2003504757A (en) Buffering system bus for external memory access
JP7201802B2 (en) Data read/write method and system in 3D image processing, storage medium and terminal
US20200192803A1 (en) Method and apparatus for accessing tensor data
CN109491934B (en) Storage management system control method integrating computing function
EP3844610B1 (en) Method and system for performing parallel computation
JP2022137247A (en) Processing for a plurality of input data sets
WO2022160310A1 (en) Data processing method and processor
CN116431562B (en) Multi-head attention mechanism fusion calculation distribution method based on acceleration processor
WO2023124304A1 (en) Chip cache system, data processing method, device, storage medium, and chip
US11093276B2 (en) System and method for batch accessing
CN111756802A (en) Method and system for scheduling data stream tasks on NUMA platform
JP7177948B2 (en) Information processing device and information processing method
WO2021244045A1 (en) Neural network data processing method and apparatus
CN116360672A (en) Method and device for accessing memory and electronic equipment
CN104615557B (en) A kind of DMA transfer method that multinuclear fine granularity for GPDSP synchronizes
US11907144B1 (en) Early semaphore update
WO2023115529A1 (en) Data processing method in chip, and chip

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21921911

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180077853.3

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21921911

Country of ref document: EP

Kind code of ref document: A1