CN111105023A

CN111105023A - Data stream reconstruction method and reconfigurable data stream processor

Info

Publication number: CN111105023A
Application number: CN201911087000.9A
Authority: CN
Inventors: 王峥; 周丽冰; 陈伟光; 谢文婷; 粟金源
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Zhongke Yuanwuxin Technology Co ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2020-05-05
Anticipated expiration: 2039-11-08
Also published as: WO2021089009A1; CN111105023B

Abstract

The invention discloses a data stream reconstruction method and a reconfigurable data stream processor, in particular to data stream reconstruction oriented to a hybrid artificial neural network, which dynamically changes corresponding function configuration on resources such as a computing unit, a storage unit, a data flow unit and the like according to different neural network layers, multiplexes hardware in a large scale to realize the neural network layers with different functions, and obtains the effects of improving the hardware utilization rate, improving the operation speed, reducing the power consumption and the like aiming at a hybrid neural network structure formed by a plurality of neural network layers. Particularly, the reusable configuration is confirmed by acquiring the characteristic information of other novel neural network layers, so that the resource reuse foundation can be provided for the construction of other novel neural network layers and the realization of a hybrid neural network based on the novel neural network layers in the follow-up research, and the universality is extremely strong.

Description

Data stream reconstruction method and reconfigurable data stream processor

Technical Field

The present invention relates to the field of data stream technology of neural networks, and in particular, to a data stream reconstruction method and a reconfigurable data stream processor.

Background

Neural networks are widely used in the fields of computer vision, natural language processing, game engines and the like, and with the rapid development of neural network structures, the demands of the neural networks on the computing power of different data streams are increased continuously. Therefore, the future hybrid neural network is trended, and the compact algorithm kernel can support the end-to-end tasks in the aspects of perception, control and even driving. Meanwhile, dedicated hardware accelerator structures have been proposed to accelerate the inference phase of neural networks, such as eyeris, Google TPU-I and DaDianNao, which achieve high performance and high resource utilization by the cooperative design technique of algorithms and architectures, such as dedicated data stream and systolic array multipliers, but these architectures and neural networks are tightly coupled and cannot accelerate for different neural networks. Therefore, corresponding data stream schemes need to be designed according to different neural networks, and a key data stream reconstruction method is the design key point of the hybrid artificial neural network.

In the prior art, a scheme for performing resource multiplexing by data stream reconstruction aiming at a hybrid neural network structure composed of different neural network layers such as a pooling layer, a full-link layer, a cyclic network LSTM layer, a deep reinforcement learning layer, a residual error layer and the like is lacked, so that the scheme in the prior art often has the defects of high hardware cost, complex structure, low operation speed, large operation power consumption and the like.

Disclosure of Invention

In view of the above, the present invention provides a data stream reconstruction method and a reconfigurable data stream processor to solve the above problems.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a data stream reconstruction method, which comprises the following steps: acquiring characteristic information of a target neural network layer; determining a data flow mode corresponding to the target neural network layer, the functional configuration of the processing unit and the functional configuration of the system on chip according to the characteristic information of the target neural network layer; performing function configuration corresponding to the processing unit of the target neural network layer and the system on chip on the reusable processing unit and the system on chip, and performing network configuration corresponding to the target neural network layer according to a data flow mode of the target neural network layer to construct the target neural network layer; and obtaining an output result by adopting the constructed target neural network layer.

Preferably, when the target neural network layer is a convolutional layer, the processing unit includes a multiply-accumulate operation unit and a modified linear unit which are grouped and configured in a plurality of threads; the input or output of the data stream is thread-level parallel serial transmission, a static memory of the system on chip is configured to buffer an activation function of an input feature diagram on a thread, a weight and the activation function are shared among a plurality of threads, and serial output of each thread is output in parallel after output buffering.

Preferably, when the target neural network layer is a pooling layer, the processing unit is configured as a comparator; the input or output of the data streams is a parallel transmission.

Preferably, when the target neural network layer is a fully-connected layer, the processing unit includes a multiply-accumulate operation unit and a modified linear unit which are grouped and configured in a plurality of threads; the input or output of the data stream is serial transmission of thread-level parallel, a static memory of the system on chip is configured as a weight buffer, and the activation function is in serial streaming transmission through a plurality of threads.

Preferably, when the target neural network layer is a residual layer, the processing unit is configured as an adder; the input or output of the data stream is parallel transmission, and the input and output shift registers of the system on chip are used for storing operands.

Preferably, when the target neural network layer is a long-term and short-term memory layer, the processing units are divided into four groups, each group of processing units is used for instantiating a sigmoid function and a tanh function, and the input or output of the data stream is serial transmission.

Preferably, when the target neural network layer is a reinforcement learning layer, the processing unit includes a multiply-accumulate operation unit and a modified linear unit which are grouped and configured in a plurality of threads; the input or output of the data stream is serial transmission of thread-level parallel, and the cache of the system on chip is used for state activation and iteration operation.

The invention provides a reconfigurable data stream processor which is used for executing the data stream reconfiguration method. The system on chip is used for controlling each group of processing units to be matched with the corresponding hardware thread, adjusting the processing units to be matched with the functional configuration of the target neural network layer, and constructing the target neural network layer.

Preferably, the system on chip comprises an execution controller, a direct memory access controller, an execution thread and a buffer, wherein the execution controller is used for extracting network instructions of a target neural network layer from an external off-chip memory, configuring the network instructions into a static memory, and performing decoding analysis on the network instructions one by one to drive the execution thread; the direct memory access controller is used for controlling reading and writing between the system on the chip and the memory under the chip; the execution thread is used for running under the control of the execution controller to realize the function of a target neural network layer; the buffer area comprises a static memory pool formed by a plurality of static memories.

Preferably, the hardware thread comprises a core state machine and a shift register, the core state machine is used for controlling data input and output, activation function allocation and weight allocation of the processing units on the same thread, and the shift register is used for constructing input and output of an activation function.

The data stream reconstruction method and the reconfigurable data stream processor provided by the invention support the functions of different neural network operators through calculation, storage and dynamic function change of the data flow unit, carry out large-scale multiplexing on hardware resources, realize the adaptation to various neural networks, particularly novel hybrid neural networks, and achieve the effects of improving the hardware utilization rate, improving the operation speed, reducing the power consumption and the like.

Drawings

FIG. 1 is an exemplary block diagram of a hybrid neural network architecture;

fig. 2 is a flowchart of a data stream reconstruction method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the structure of convolutional layers as target neural network layers;

FIG. 4 is a schematic diagram of the structure of a pooling layer as a target neural network layer;

FIG. 5 is a schematic diagram of a fully-connected layer as a target neural network layer;

FIG. 6 is a schematic diagram of a structure of a residual layer as a target neural network layer;

FIG. 7 is a schematic structural diagram of a long-short term memory layer as a target neural network layer;

FIG. 8 is a schematic diagram of a reinforcement learning layer as a target neural network layer;

FIG. 9 is a block diagram of a reconfigurable data stream processor according to an embodiment of the present invention;

fig. 10 is a comparison graph of Q iteration time between the architecture designed in the verification experiment of example 2 and the host.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings. Examples of these preferred embodiments are illustrated in the accompanying drawings. The embodiments of the invention shown in the drawings and described in accordance with the drawings are exemplary only, and the invention is not limited to these embodiments.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps that are closely related to the solution according to the present invention are shown in the drawings, and other details that are not relevant are omitted.

Example 1

Referring to fig. 1, the important network layers and their interconnections in a hybrid neural network architecture are depicted, forming an end-to-end network that targets sensing and control. In particular, graphical inputs, cascaded convolutional and pooling layers are used as perceptual modules for visual feature extraction. Model networks such as Yolo-v3 and Resnet-50 can reach tens of layers to mimic the human visual system. For the application of video context understanding and language processing, the time-related feature sequence is used as the input of the LSTM long-short term memory layer, and the LSTM layer re-extracts the feature output related to the sequence. Unlike the previous layer, the LSTM network is a special network structure of the recurrent neural network, and four basic gates are formed, namely an input gate (I), an output gate (O), and cell state gates (C) and (F). When I, O and F gates compute layer outputs by vector operations, the C gate will hold the current map s layer state and serve as the recursive input for the next time series.

The control network layer is carried out after feature extraction, extracted feature parameters are regarded as state nodes in the deep reinforcement learning neural network DQN, and the optimal decision needs to be selected through action nodes. The method is to traverse all possible actions in the current state and execute a regression strategy according to a reinforcement learning strategy to find the maximum or minimum output value (Q value). Since the action nodes need to be iterated, all computations in subsequent layers also need to be iterated, which is indicated by the dashed box. Multilayer sensors employ the most common fully connected layer. The short-circuit mode is also commonly used in the residual error network, and the accuracy of classification and regression is improved by providing key elements in an image layer form before the current input.

For the artificial neural network, each neural network layer is different not only in network structure but also in operand, operator and nonlinear function. In different neural network layers, the data stream attributes are reconfigurable according to different network structures, so that resources which can be reused in the construction process of the hybrid neural network structure, such as, for example, a processing unit PE, data input/output, a static memory SRAM used as a buffer, an interface with an off-chip DRAM memory, and the like, can be found according to common points among the characteristics of the neural network layers by analyzing the characteristics of various neural network layers, such as, for example, data stream access modes, functions of computing resources, and the like, and the data stream reconfiguration is performed according to the idea to solve the prior art problem, and referring to table 1, characteristic information of a plurality of standard neural network layers is summarized.

TABLE 1. characteristic information of multiple standard neural network layers

It can be seen that the pooling and shorting layers are vector operations, while other inliers are matrix operations, where the convolution process is sparse and the remaining network layers are dense. In each network layer, different activation functions are employed, with respect to its non-linear function, the LSTM network employs both sigmoid and tangent, while the remaining matrix kernels use the ReLU function or sigmoid.

Network data in the convolutional layer and the fully-connected layer needs to be shared between the nodes that output the feature map. The LSTM layer employs a similar serial flow, but in particular its active flow needs to be shared among multiple gates. And the state action layer needs to generate the data stream quickly based on the iteration of the action node. The pooling layer and the residual layer operating on vectors need not share activation functions for the feature map. Thus, the activated vector types can be transmitted in parallel.

Furthermore, after analyzing the function of intermediate data for multiple network layers, the convolutional and pooling layers are mainly determined by activation, while the FC fully-connected and LSTM layers are determined by weight due to the nature of data sparsity. In the residual layer, a pointer to the previous layer to activate needs to be kept in order for the network to process the previous data.

Combining the above analysis ideas, as shown in fig. 2, the present invention provides a data stream reconstruction method, which includes:

s1, acquiring characteristic information of a target neural network layer;

s2, determining a data flow mode corresponding to the target neural network layer, the functional configuration of the processing unit and the functional configuration of the system on chip (SoC) according to the characteristic information of the target neural network layer;

s3, performing function configuration of the processing unit and the system on chip corresponding to the target neural network layer on the reusable processing unit and the system on chip, and performing network configuration corresponding to the target neural network layer according to a data flow mode of the target neural network layer to construct the target neural network layer;

and S4, obtaining an output result by adopting the constructed target neural network layer.

The data stream reconstruction method provided by the invention can perform corresponding function configuration dynamic change on resources such as a computing unit, a storage unit, a data flow unit and the like according to different neural network layers, realize the neural network layers with different functions by multiplexing hardware in a large scale, obtain the effects of improving the utilization rate of the hardware, improving the operation speed, reducing the power consumption and the like aiming at a mixed neural network structure formed by a plurality of neural network layers, and provide a resource multiplexing basis for the realization of constructing other novel neural network layers in subsequent research. Compared with the reuse data stream reuse scheme which only aims at the fine-grained data of the standard convolution network operator, such as weight fixation, output fixation, row fixation and the like in the prior art, the reuse data stream reuse scheme has stronger universality and better obtained effect.

The following describes, with reference to fig. 3 to fig. 8 (in which the dashed lines indicate resources that do not need to be multiplexed), a method for managing data streams and sharing resources by taking an important neural network layer as an example, specifically as follows:

referring to fig. 3, when the target neural network layer is a convolutional layer, the processing unit includes a multiply-accumulate operation unit and a modified linear unit which are grouped and configured in a plurality of threads, wherein each thread processes data using the same row and column on a plurality of channels of an output feature map; the input or output of the data stream is thread-level parallel serial transmission, a static memory of the system on chip is configured to be used for buffering an activation function of an input feature graph on a thread, a weight and the activation function are shared among a plurality of threads, the activation function is in serial streaming transmission from a single buffer area to realize sharing among processing units, and serial output of each thread is output in parallel through a serial deserializer SERDES and a DRAM controller after being output and buffered.

As shown in fig. 4, when the target neural network layer is a pooling layer, the processing unit is configured as a comparator to implement maximum and minimum operators; the input or output of the data stream is parallel transmission, and the pooling layer directly operates on the vector, so that the activation function acquired from the DRAM is directly provided to the processing unit array without buffering, thereby greatly saving dynamic power consumption, and the activation function compares the passing time by modifying the DRAM access address.

Referring to fig. 5, when the target neural network layer is a fully-connected layer, the output and processing units are configured like convolutional layers, and the processing units include multiply-accumulate operation units and modified linear units which are grouped and configured in a plurality of threads; the input or output of the data stream is serial transmission in parallel at the thread level, and for the kernel network with the weight dominance, the static memory of the system on chip is configured as a weight buffer area, and the activation function is in serial streaming transmission through a plurality of threads.

As shown in fig. 6, when the target neural network layer is a residual layer, which is similar to the pooling layer, and the kernel directly works on the parameters, the processing unit is configured as an adder; the input or output of the data stream is a parallel transfer, the input and output shift registers of the system on chip are used to store operands due to the addition of two vectors, the output result is written to the output shift register and written to the DRAM in parallel, and a pointer buffer is instantiated to address both operands in the DRAM.

Referring to fig. 7, when the target neural network layer is a long-term and short-term memory layer, the network layer multiplexes four processing units, the processing units are divided into four groups, each group of processing units is used for instantiating a sigmoid function and a tanh function, the addition vector operation and the tanh function operation are performed later, the input or output of the data stream is serial transmission, and a mixed input mode is adopted for providing fast data between the shared activation function and different groups in each group of gates. The state unit cache is instantiated for retaining intermediate state information.

Referring to fig. 8, when the target neural network layer is a reinforcement learning layer, the input, output and processing units are configured similarly to a fully connected layer, including various active sources such as DRAM for conventional activation, the processing units include a multiply-accumulate unit and a modified linear unit which are grouped and configured in a plurality of threads, the input or output of data streams is serial transmission in thread-level parallel, and the cache of the system on chip is used for state activation and iterative operation.

In addition to the above-mentioned neural network layer, the target neural network layer may also be applied to other novel neural network layers, and similarly, as long as characteristic information of the novel neural network layer is analyzed, reusable resources are known according to the characteristic information and corresponding configuration is performed, which is significant for the construction of a novel hybrid neural network structure in the future.

Example 2

Referring to fig. 9, based on the data stream reconstruction method described in embodiment 1, the present invention further provides a reconfigurable data stream processor, which is configured to execute the data stream reconstruction method described above, where the reconfigurable data stream processor adopts a hierarchical design and includes a system on chip 1, hardware threads 2, and multiple sets of processing units 3,

the system on chip 1 is configured to control each group of processing units 3 to cooperate with a corresponding hardware thread, adjust the processing units to match with the functional configuration of the target neural network layer, and construct the target neural network layer.

Further, the system on chip 1 comprises an execution controller PCI-e, a direct memory access controller DMA, an execution thread and a buffer. The execution controller coordinates the processing unit 3 and the buffer area according to the network instruction, and is used for extracting the network instruction of the target neural network layer from the external off-chip memory 4, configuring the network instruction into the static memory, and decoding and analyzing the network instruction one by one to drive the execution thread, so that the execution controller plays a role in centralized control, and is beneficial to reducing logic overhead and improving performance.

The direct memory access controller is used for controlling reading and writing between the system-on-chip 1 and the off-chip DRAM memory 4, realizing multiple reading and writing modes between the system-on-chip 1 and the off-chip DRAM memory 4, and being capable of smoothly transmitting network configuration, weight, activation and results. DDR burst mode is largely used to supply data quickly and reduce DRAM access power. Since memory bandwidth can limit computational throughput. Thus, the DMA is configured according to the algorithm attributes, controlling the memory bandwidth to match the corresponding amount of data, e.g., the element size of the data bundle used for PW and DW convolution is equal to the number of bytes per transfer under a particular DRAM protocol. Thus, it is achieved that consecutive burst reads and writes can be performed without further data buffering.

The execution thread is used for running under the control of the execution controller to realize the function of a target neural network layer.

The buffer area comprises a static memory pool formed by a plurality of static memories, wherein the size of each SRAM is 8KB, and different algorithm kernels are configured with different buffer schemes. With the assistance of the execution controller, the SRAM can be instantiated on the fly with various buffer functions, which are determined by the algorithm kernel.

Further, the hardware thread facilitates resource sharing of data flow and weights, and comprises a core state machine and a shift register, wherein the core state machine is used for controlling data input and output, activation function allocation and weight allocation of processing units on the same thread, the shift register is used for constructing input and output of activation functions so as to realize data sharing and reduced power overhead due to single fan-out and reduced load capacitance, and the shift register can be dynamically configured in a cascade connection or parallel connection mode. Since some target neural network layers involve computation of vectors, the output data stream is bi-directional, as opposed to the uni-directional direction of the input data stream, thereby facilitating computation using vectors in, for example, the residual layer kernel. The multiple processing units coordinate through the core state machine FSM of the thread stage to handle the activation and weights in a pipelined manner. The weights are streamed in from a pool of static memory in the system-on-chip 1, wherein the respective processing units may receive different streams of weights.

Further, in order to efficiently compute kernel-dependent functions, the processing unit 3 is designed compactly to implement the required operators. The data input port and the weight input port are used for facilitating matrix calculation and vector calculation. The Sigmoid and tagent modules are designed based on the medium linear approximation technique. The control input receives opcodes from the thread-level FSM, configuring the multiplexers to implement the operators associated with the cores.

The feasibility of the reconfigurable data stream processor provided by the invention is verified through experiments, the architecture of the reconfigurable data stream processor is that a thread is formed by an SRAM (static random access memory) and 16 PE (provider edge) on a 108KB chip, the experiments are realized by adopting a Verilog HDL (hardware description language), and a Modelsim simulation tool is adopted to perform simulation verification on the feasibility and the running time of a design scheme. And network performance analysis was performed at the NVIDIAGTX GPU of MATLAB using a neural network tool library. Three network architectures are demonstrated below to analyze the performance of the proposed architecture.

MobileNet has a hybrid kernel network of standard PW and DW convolutions, pooling and full connectivity, MobileNet employs iteratively compact convolution kernels, accounting for 97.91% of the MAC number calculation. Table 2 shows the execution delay of a proposed design of the layers of MobileNet benchmarked between multi-threaded and single-threaded architectures using an FPGA prototype with 256 PE and DRAM support.

TABLE 2 Performance analysis based on MobileNet architecture

Deep reinforcement learning: a typical use of DQN is maze walking, where an intelligent processing unit learns to go to a destination by choosing the right direction at an intersection and avoiding obstacles. As shown in fig. 10, the reinforcement learning action space has been tested on a

layer

2, 5 and 10 network for 1, 2, 4 and 6 node data, while the state space has been chosen between 128 and 256 nodes. For all tested motion spaces, the on-chip Q iteration time for all three network structures is less than 2 ms. This iteration time increases slightly with the size of the action space and the size of the network.

And (3) sequence classification: the test example uses sensor data obtained from a smartphone worn on the body, the data being trained using the LSTM network to identify the wearer's activity given a time series representing accelerometer readings in three different directions. Referring to table 3, without considering the simulation results under data transfer between disk storage and DRAM, it can be seen that the proposed LSTM network design achieves improved performance compared to both CPU and GPU. However, the MATLAB measurements account for the large latency of data transfers between disk, main memory, and operating system, whereas the present design is currently set as a stand-alone system. However, future LSTM networks tend to be deployed on sensors and fetch data directly from DRAM for direct processing, which is very close to the design principles of the present invention. Compared with the CPU and GPU power consumption, the power consumption of the ASIC hybrid neural network is three orders of magnitude higher, so that the efficiency of the ASIC hybrid neural network is proved to be excellent, and the feasibility of the invention is proved.

TABLE 3 performance benchmarking of LSTM networks in three processing architectures

In summary, the data stream reconstruction method and the reconfigurable data stream processor provided by the invention systematically multiplex hardware to perform dynamic configuration by adjusting the data stream mode, the processing unit and the on-chip stored function mode, achieve the effects of improving the hardware utilization rate, improving the operation speed, reducing the power consumption and the like for the hybrid neural network, and can provide a resource multiplexing basis for the construction of other novel neural network layers and the implementation of the hybrid neural network based on the novel neural network layers for subsequent research.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing is directed to embodiments of the present application and it is noted that numerous modifications and adaptations may be made by those skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.

Claims

1. A method for reconstructing a data stream, comprising:

acquiring characteristic information of a target neural network layer;

determining a data flow mode corresponding to the target neural network layer, the functional configuration of the processing unit and the functional configuration of the system on chip according to the characteristic information of the target neural network layer;

performing function configuration corresponding to the processing unit of the target neural network layer and the system on chip on the reusable processing unit and the system on chip, and performing network configuration corresponding to the target neural network layer according to a data flow mode of the target neural network layer to construct the target neural network layer;

and obtaining an output result by adopting the constructed target neural network layer.

2. The data stream reconstructing method as claimed in claim 1, wherein when said target neural network layer is a convolutional layer, said processing unit includes a multiply-accumulate operation unit and a modified linear unit grouped and configured in a plurality of threads; the input or output of the data stream is thread-level parallel serial transmission, a static memory of the system on chip is configured to buffer an activation function of an input feature diagram on a thread, a weight and the activation function are shared among a plurality of threads, and serial output of each thread is output in parallel after output buffering.

3. The data stream reconstruction method of claim 1, wherein when the target neural network layer is a pooling layer, the processing unit is configured as a comparator; the input or output of the data streams is a parallel transmission.

4. The data stream reconstructing method as claimed in claim 1, wherein when said target neural network layer is a fully-connected layer, said processing unit includes a multiply-accumulate operation unit and a modified linear unit grouped and configured in a plurality of threads; the input or output of the data stream is serial transmission of thread-level parallel, a static memory of the system on chip is configured as a weight buffer, and the activation function is in serial streaming transmission through a plurality of threads.

5. The data stream reconstructing method as claimed in claim 1, wherein when said target neural network layer is a residual layer, said processing unit is configured as an adder; the input or output of the data stream is parallel transmission, and the input and output shift registers of the system on chip are used for storing operands.

6. The data stream reconstructing method as claimed in claim 1, wherein when the target neural network layer is a long-term and short-term memory layer, the processing units are divided into four groups, each group of processing units is used for instantiating a sigmoid function and a tanh function, and the input or output of the data stream is serial transmission.

7. The data stream reconstructing method as claimed in claim 1, wherein when said target neural network layer is a reinforcement learning layer, said processing unit includes a multiply-accumulate operation unit and a modified linear unit grouped and configured in a plurality of threads; the input or output of the data stream is serial transmission of thread-level parallel, and the cache of the system on chip is used for state activation and iteration operation.

8. A reconfigurable data stream processor for performing the method of data stream reconstruction according to any of claims 1-7, the reconfigurable data stream processor comprising a system on a chip, hardware threads and sets of processing units,

the system on chip is used for controlling each group of processing units to be matched with the corresponding hardware thread, adjusting the processing units to be matched with the functional configuration of the target neural network layer, and constructing the target neural network layer.

9. The reconfigurable data stream processor of claim 8, wherein the system on a chip comprises an execution controller, a direct memory access controller, a thread of execution, and a buffer,

the execution controller is used for extracting network instructions of a target neural network layer from an external off-chip memory, configuring the network instructions into a static memory, and decoding and analyzing the network instructions one by one to drive an execution thread;

the direct memory access controller is used for controlling reading and writing between the system on the chip and the memory under the chip;

the execution thread is used for running under the control of the execution controller to realize the function of a target neural network layer;

the buffer area comprises a static memory pool formed by a plurality of static memories.

10. The reconfigurable data stream processor according to claim 8 or 9, characterized in that the hardware threads comprise a core state machine for controlling data input and output, activation function assignment and weight assignment of processing units on the same thread and a shift register for building the input and output of activation functions.