CN106875012B

CN106875012B - A kind of streamlined acceleration system of the depth convolutional neural networks based on FPGA

Info

Publication number: CN106875012B
Application number: CN201710072223.2A
Authority: CN
Inventors: 李开; 邹复好; 章国良; 黄浩; 杨帆; 孙浩
Original assignee: Wuhan Charm Pupil Technology Co Ltd
Current assignee: Wuhan Charm Pupil Technology Co Ltd
Priority date: 2017-02-09
Filing date: 2017-02-09
Publication date: 2019-09-20
Anticipated expiration: 2037-02-09
Also published as: CN106875012A

Abstract

The streamlined acceleration system of the invention proposes a kind of depth convolutional neural networks based on FPGA, the streamlined acceleration system mainly realizes that module, the serializing of pond computation sequence realize that module, convolutional calculation module, pond computing module and convolutional calculation result distribution control module form by input data distribution control module, output data distribution control module, convolutional calculation generic sequenceization, and furthermore the streamlined acceleration system also includes a built-in system subtending port.Streamlined acceleration system designed according to this invention, it efficient parallel streamlined can be realized on FPGA, and it efficiently solves in calculating process and calculates tardy problem due to the wasting of resources caused by all kinds of paddings and effectively, system power dissipation can be effectively reduced and greatly improve calculation process speed.

Description

A kind of streamlined acceleration system of the depth convolutional neural networks based on FPGA

Technical field

The invention belongs to neural computing fields, and in particular to a kind of stream of the depth convolutional neural networks based on FPGA Aquation acceleration system.

Background technique

The new machine learning upsurge started along with deep learning, depth convolutional neural networks have been widely used for language In the different large-scale machines problems concerning study such as sound identification, image recognition and natural-sounding processing, and achieve a series of breakthroughs Property research achievement, powerful feature learning and classification capacity cause extensive concern, have important analysis and research Value.

Depth convolutional neural networks model is high with model depth, level is complicated, data magnitude is big, degree of parallelism is high, calculates Intensively and the features such as storage is intensive, large batch of convolution algorithm and pondization operation often make it big as one in application process Calculation bottleneck, the storage of a large amount of intermediate results also propose higher requirement to computer storage organization, this for real-time compared with It is by force very unfavorable for the limited application scenarios of input cost.

Instantly two kinds of more commonly used accelerators are CPU and GPU, and CPU is being calculated based on its design feature serially executed It can not more preferably be met the requirements in performance, although GPU is with the obvious advantage in calculated performance can not to be broken through as CPU Power consumption barrier, and there is more serious limitation in scalability in CPU and GPU.In view of such as above-mentioned factor, FPGA Based on its configuration flexibly, highly-parallel, flexible design, high low-power consumption and cost performance the features such as become depth convolutional neural networks mould The very ideal substitution accelerator of type, but how fpga chip feature and platform advantage to be combined sufficiently to excavate depth convolutional Neural The concurrency of network computing model and pipelining, rationally efficiently making full use of the limited resources of FPGA on piece is still to need to be solved Certainly the problem of.

Summary of the invention

The streamlined acceleration system of the present invention provides a kind of depth convolutional neural networks based on FPGA, its object is to In combination with depth convolutional neural networks model structure feature and fpga chip feature and platform advantage, in traditional software layer There are the depth convolutional neural networks of realization to calculate structure and is readjusted and designed corresponding module, abundant excavating depth volume Product neural network is pipelining between potential concurrency and each computation layer in calculating process, is allowed to more be matched with FPGA Design feature, and cooperate the corresponding design of FPGA, be depth rationally efficiently to utilize the computing resource of FPGA design The realization of convolutional neural networks provides a set of high performance streamlined speeding scheme.

The present invention provides the streamlined acceleration systems of the depth convolutional neural networks based on FPGA, which is characterized in that should System includes:

Input data distributes control module, output data distributes control module, convolutional calculation generic sequenceization realizes module, Module, convolutional calculation module, pond computing module and convolutional calculation result distribution control mould are realized in computation sequence serializing in pond Block, furthermore the streamlined acceleration system also includes a built-in system subtending port；

Input data distribution control module simultaneously with FPGA peripheral interface and the built-in system subtending port, convolution Computation sequence serializing realizes that module is connected；Output data distribution control module simultaneously with FPGA peripheral interface and described interior Portion's system subtending port, convolutional calculation result distribution control module and pond computing module are connected；The convolutional calculation result Control module is distributed to serialize with the convolutional calculation module, output data distribution control module and pond computation sequence simultaneously Realize that module is connected；The convolutional calculation generic sequenceization is realized to be connected directly between module and convolutional calculation module；The pond Change and is connected directly between computation sequence serializing realization module and pond computing module；

The input data distribution control module monitors that the convolutional calculation generic sequenceization realizes the data of module in real time Consumption situation sends relevant readings according to ordering and receive in time FPGA peripheral interface and the internal system to DDR chip external memory The input data that the transmission of irrespective of size connecting port comes, in addition to this, the input data distribution control module also need the number that will be received Module is realized according to the convolutional calculation generic sequenceization is sent to；

The output data distribution control module receives the pond computing module or convolutional calculation result distribution control The input data that the transmission of molding block comes, and be to the inside by the data transmission received according to the calculation stages being presently in Irrespective of size connecting port or the FPGA peripheral interface, Xiang Suoshu DDR chip external memory send correlation and write data command and related interrupts Notice；In addition to this, the order that FPGA peripheral interface described in the output data distribution control module also real-time response is sent；

The convolutional calculation generic sequenceization realizes module combination adjusting parameter by relevant volume in depth convolutional neural networks The structuring computation sequence serializing of product operation, and be the data set after the timely transfer sequence of convolutional calculation module；Institute It states the serializing of pond computation sequence and realizes that module combines related adjusting parameter by pondization operation related in depth convolutional neural networks The serializing of structuring computation sequence, and be the data set after the timely transfer sequence of pond computing module；

The convolutional calculation module completes the related convolutional calculation in depth convolutional neural networks, and calculated result is timely Send the convolutional calculation result distribution control module to；The pond computing module, which is mainly responsible for, completes depth convolutional Neural net Related pondization operation in network, and calculated result is sent to the output data distribution control module in time；

The convolutional calculation result distribution control module receives the calculation result data that the convolutional calculation module transmits, and The data received are sent to the pond computation sequence sequence with having specification in a organized way according to the calculation stages being presently in Change and realizes that module or the output data distribute control module.

The built-in system subtending port be mainly responsible for cascade between the FPGA system on chip internal subsystems or Connection between internal module provides valid interface, for connecting the output data distribution control module and input data distribution Control module.

Further,

The convolutional calculation generic sequenceization realizes module by characteristic pixel group selection function sub-modules and convolution nuclear parameter Selection function submodule composition.Characteristic pixel group selection function sub-modules realize characteristic pixel group selection function, convolution nuclear parameter Selection function submodule realizes convolution kernel parameter selection function.

The pond computation sequence serializing realizes that module and the convolutional calculation generic sequenceization realize the spy in module The function of the composed structure and realization of levying pel group selection function sub-modules is substantially similar.

Further,

The characteristic pixel group selection function sub-modules are mainly joined by characteristic pattern tuple memory, new and old selector, label Number memory, address parameter memory, calculation window buffer storage and characteristic pattern tuple counter composition；

The characteristic pattern tuple memory is realized using two-port RAM, for storing the input data distribution control mould The characteristic pattern tuple that block is sent into；The new and old selector safeguards two address registers, respectively new value address register and old It is worth address register, for selecting corresponding characteristic pattern tuple from the characteristic pattern tuple memory and exporting to the convolution Computing module；The new old value that the flag parameters memory is used to store effectively analysis serial number chooses label and window calculation shifts to an earlier date Terminate label, the old value that the address parameter memory is used to store effectively analysis serial number chooses address, given for one Depth convolutional neural networks model, the flag parameters memory and the address parameter memory write-once repeatedly recycle reading It takes；The calculation window buffer storage is realized using two-port RAM, for caching the characteristic pattern of the new and old selector output Tuple simultaneously outputs it to the convolutional calculation module；The characteristic value pel group counter is for counting the new and old selector Select the characteristic pattern tuple number of output；

The every beat of characteristic pixel group selection function sub-modules obtains one from input data distribution control module KFP characteristic value of characteristic pattern tuple, this KFP eigenvalue cluster is at an input feature vector value group；The new and old selector is each When selection characteristic pattern tuple is exported, check that current new old value chooses mark value, if it is choosing that current new old value, which chooses mark value, New value label, then from the initial address that new value address register provides as unit of eigenvalue cluster progress characteristic pattern tuple Output, after one feature group of every output, new value address register adds one automatically, when the characteristic pattern tuple currently chosen has exported Bi Hou sequentially obtains next new old value from the flag parameters memory and chooses label as currently newly old value chooses mark Note；If it is that old value is selected to mark that current new old value, which chooses mark value, current old value is chosen into address and is sent into old value address register, And carry out the output of characteristic pattern tuple as unit of eigenvalue cluster for initial address by this address, after one feature group of every output, Old value address register adds one automatically, after the characteristic pattern tuple output currently chosen, from the flag parameters memory Middle sequence obtains next new old value selection label as current new old value selection label, and from the address parameter memory Sequence obtains next old value and chooses address as current old value selection address；

After the new and old selector has exported a characteristic pattern tuple, the characteristic pattern tuple counter adds automatically One, if the characteristic pattern tuple of the new and old selector selection output described at this time reaches the calculation window size without filling element, The new and old selector exports pause, until the feature for the current calculation window being located in the calculation window buffer storage Until pel group reuse ((DON-1)/KGP+1) is secondary；If the characteristic pattern tuple of the new and old selector selection output described at this time The calculation window size without filling element has not yet been reached, but current signature pel group Counter Value is mentioned with current window calculating Preceding end mark value is identical, and the new and old selector described at this time will also suspend output in advance, buffers until being located at the calculation window Until the characteristic pattern tuple reuse ((DON-1)/KGP+1) of current calculation window in memory is secondary, and described new While old selector suspends output in advance, next window calculating is sequentially obtained from the flag parameters memory and is tied in advance Beam label terminates in advance label as current window calculating.

Further,

The output of convolution kernel parameter array and the characteristic pattern tuple are selected in the convolution kernel parameter selection function sub-modules Select the synchronous progress of output of output characteristic value group in function sub-modules；

The convolution kernel parameter selection function sub-modules are mainly by the first convolution nuclear parameter memory, the second convolution nuclear parameter Memory, selector, flag parameters memory, address parameter memory and core ginseng array group counter composition；

The first convolution nuclear parameter memory and the second convolution nuclear parameter memory all use two-port RAM real Convolution nuclear parameter that is existing, being sent into for storing the input data distribution control module；The flag parameters memory is for depositing It stores up core and joins address jump flag parameters, the address parameter memory joins address parameter for storing jump purpose core, for one A given depth convolutional neural networks model, the flag parameters memory and the address parameter memory write-once are more Secondary circulation is read；The selector safeguards an address register and a jump address generator, is used for from the first volume Corresponding convolution kernel parameter array group is selected to export to institute in product nuclear parameter memory or the second convolution nuclear parameter memory Convolutional calculation module is stated, wherein jump address generator obtains jump purpose core ginseng address from the address parameter memory and joins Number provides corresponding jump purpose core for the selector and joins address；The core ginseng array group counter is used to count output Convolution kernel parameter array group number；

When the selector selects convolution kernel parameter array group to be exported every time, compare when pronucleus ginseng address jump label Parameter value with when pronucleus column group counter of entering a match it is whether equal；If equal, jump address generator is worked as into front jumping Turn address and be sent into address register, and using this address as initial address, convolution kernel ginseng is carried out as unit of convolution kernel parameter array The output of number array group, one convolution kernel parameter array of every output, address register adds one automatically, when the convolution kernel currently chosen After parameter array group exports, the core ginseng array group counter increases one automatically, and it is next that jump address generator calculates output A jump address is as current jump address；If unequal, directly since at the initial address that address register provides, with Convolution kernel parameter array is the output that unit carries out convolution kernel parameter array group, one convolution kernel parameter array of every output, address Register adds one automatically, and after the convolution kernel parameter array group output currently chosen, the core ginseng array group counter is certainly It is dynamic to increase one；During selector selection convolution kernel parameter array group is exported, the first convolution nuclear parameter is deposited Reservoir and the second convolution nuclear parameter memory are switched to the selector in turn and provide deconvolution parameter array group, handover operation Current computation layer finish time occurs, the convolution nuclear parameter being sent into from input data distribution control module is also with computation layer Unit is sequentially sent to the first convolution nuclear parameter memory and the second convolution nuclear parameter memory in turn.

Further,

The convolutional calculation generic sequenceization realizes that module and pond computation sequence serializing realize the institute in module Stating characteristic pattern tuple memory, progress timesharing recycles in computation layer where it, and the characteristic pattern tuple memory is not Storage unit is all provided separately in the upper one layer each characteristic pattern tuple sent, and the setting of amount of capacity calculates where combining In domain same characteristic pattern tuple newly value deposit and old value refetch between maximum address interval provide；

Old value chooses address parameter and needs to do accordingly before being transmitted to the DDR chip external memory through the upper layer host Remainder operation, the characteristic pattern tuple memory capacity size of its a length of place computational domain of remainder mould.

Further,

The convolutional calculation module is made of side by side multiple convolution kernel computing units, and convolution kernel computing unit is mainly by multiply-add Tree, add tree, biasing device and activator appliance composition, multiply-add tree are made of several multipliers and adder interconnection, and add tree is by several Adder interconnection composition；

The multiply-add tree, add tree complete the operation that multiplies accumulating in the convolution kernel computing unit, the biasing device jointly The biasing phase add operation in the convolution kernel computing unit is completed, the activator appliance completes swashing in the convolution kernel computing unit Operation living.

The convolution kernel computing unit is obtained simultaneously in each effective beat from convolution kernel parameter selection function KFP characteristic value of module and KFP convolution nuclear parameter from the convolution kernel parameter selection function sub-modules, it is described multiply-add Tree carries out multiplying accumulating operation to KFP characteristic value and KFP convolution nuclear parameter, and will multiply accumulating result and sequentially be sequentially sent to add tree It is middle carry out it is secondary concentrate it is cumulative, until the operand of add tree first floor inlet is all ready or current calculation window last After group characteristic value is ready, the add tree starting, which calculates, completes secondary add up；Until whole accumulation operations of current calculation window It completes, last accumulation result is sent into adder and is biased phase add operation by the add tree, and biasing phase add operation is completed Afterwards, addition result is admitted to activator appliance and is activated, the final calculating knot of the result after activation, that is, convolution kernel computing unit Fruit, the final calculation result of the convolution kernel computing unit will be admitted to the convolutional calculation result distribution control module.

Further, the pond computing module mainly by distributor, maximum value pond unit, average value pond unit and Selector composition；

The pond computing module is obtained simultaneously in each effective beat and is realized from pond computation sequence serializing PFP characteristic value of module, and the input feature vector value group is sent into the distributor and is allocated；The distributor then work as by basis The characteristic pixel group of input is distributed to maximum value pond unit or the average value pond by the pond mode of preceding computation layer Unit；Wherein, maximum value pond unit takes the maximum characteristic pattern element of current calculation window in every characteristic pattern to carry out pond Change, average value pond unit takes all characteristic pattern element average values of current calculation window in every characteristic pattern to carry out pond Change；After the completion of pondization operation, the selector selects maximum value pond unit according to the pond mode of current computation layer or is averaged The pond result of value pond unit gives the output data distribution control module.

Further:

Extended by the FPGA system acceleration system that is formed be by the identical FPGA system cascade of multiple structures and At.

Compared with the acceleration system of existing depth convolutional neural networks, the weight of streamlined acceleration system provided by the invention Want innovative point as follows:

(1) depth convolutional neural networks complicated calculations sub-module is carried out, the pipeline design thinking is used between each module, Processing speed is fast；Each module realized based on FPGA, integrated level and degree of parallelism height, performance stabilization, low-power, low in cost.

(2) be added convolutional calculation generic sequenceization realize module and pond computation sequence serializing realize module, be used for pair The calculating structure of depth convolutional neural networks is readjusted, and calculation window knot in traditional convolutional neural networks has not only been broken The fixed constraint of structure, the data for enabling each computation layer first to participate in calculating first reach, and sufficiently excavate depth convolutional Neural It is pipelining between calculating concurrency and each computation layer in network, the storage of a large amount of intermediate results is effectively reduced, and And all kinds of filling elements present in calculating process have been fallen in automatic fitration, avoid the investment of invalid computation, have saved FPGA's Resources on Chip.

(3) convolutional calculation module and pond computing module all use Parallel Design thinking, obtain simultaneously in each effective beat Multiple characteristic pattern elements are taken to carry out parallel processing, convolutional calculation module is further made of several convolution kernel computing units, Multiple convolution kernel computing units simultaneously and concurrently complete the convolution operation of a calculation window and multiple convolution kernels, substantially increase meter The speed of processing is calculated, in addition, it is cumulative using the secondary concentration of add tree progress in convolutional calculation module, effectively alleviate depth volume A big Calculation bottleneck obstacle of convolutional calculation part is in product neural network.

(4) acceleration system has good flexibility, and each class model configuration parameter of FPGA on piece is all passed via upper layer host Delivery is set, thus acceleration system has both certain versatility to different depth convolutional neural networks models；Convolutional calculation module and The FPGA Resources on Chip flexible setting in combination with concrete model is arranged in degree of parallelism in the computing module of pond, further to fill Divide the Resources on Chip using existing FPGA；Pond computing module realizes a variety of pond modes to adapt to a variety of different depth convolution Neural network model.

(5) acceleration system has good scalability, and acceleration system is based on can be by inside system when middle and high end FPGA is realized Irrespective of size connecting port is extended, and the FPGA system on chip after extension can double up calculating degree of parallelism, makes full use of on piece While sufficient logical resource, the storage resource of FPGA on piece more rationally can be efficiently utilized.

Detailed description of the invention

Fig. 1 is the interactive structure schematic diagram between the FPGA system on chip of the invention realized and upper layer host；

Fig. 2 is the structural block diagram that depth convolutional neural networks proposed by the present invention calculate structural adjustment parameter；

Fig. 3 is the flow chart of data processing figure that depth convolutional neural networks proposed by the present invention calculate structural adjustment algorithm；

Fig. 4 is the integral module group for the depth convolutional neural networks streamlined acceleration system based on FPGA that the present invention realizes At structural schematic diagram；

Fig. 5 is the data processing schematic diagram of convolutional calculation module in the FPGA system on chip realized according to the present invention；

Fig. 6 is the data processing schematic diagram of pond computing module in the FPGA system on chip realized according to the present invention；

Fig. 7 is the characteristic pattern that convolutional calculation generic sequenceization realizes module in the FPGA system on chip realized according to the present invention Tuple selection function submodule workflow structure schematic diagram；

Fig. 8 is the convolution kernel that convolutional calculation generic sequenceization realizes module in the FPGA system on chip realized according to the present invention Parameter selection function sub-modules workflow structure schematic diagram；

Fig. 9 is the composition of the pond computation sequence serializing realization module in the FPGA system on chip realized according to the present invention Structural schematic diagram；

Figure 10 is the workflow structure schematic diagram of convolutional calculation module in the FPGA system on chip realized according to the present invention；

Figure 11 is the realization principle figure of the convolution kernel computing unit in the FPGA system on chip realized according to the present invention；

Figure 12 is the workflow structure schematic diagram of pond computing module in the FPGA system on chip realized according to the present invention；

Figure 13 is the realization principle figure of maximum pond unit in the FPGA system on chip realized according to the present invention；

Figure 14 is the realization principle figure of the average pond unit in the FPGA system on chip realized according to the present invention.

Specific embodiment

With reference to the accompanying drawings and embodiments, the present invention is described in more detail.It should be appreciated that described herein specific Embodiment is only used to explain the present invention, is not intended to limit the present invention.

Depth convolutional neural networks model as specific embodiment has the following characteristics that

(1) individual is special for all computation layers (computation layer includes starting input picture layer, convolutional layer, pond layer and full articulamentum) The length and width for levying figure are identical, and the length and width of all computation layer calculation windows are identical.

(2) connection type of each computation layer is successively are as follows: starting input picture layer, convolutional layer 1, pond layer 1, convolutional layer 2, pond Change layer 2, convolutional layer 3, pond layer 3, full articulamentum 1 and full articulamentum 2.

(3) there is only two ways for pondization operation: being maximized Chi Huahe and is averaged pond；Activation operation uses Relu Active mode.

(4) tomographic image size, image completion size, calculation window size, calculation window moving step length and pond are respectively calculated Mode information is as shown in the table:

(5) FPGA on piece storage resource can store two convolutional layers of arbitrary continuation and use all convolution nuclear parameters, but not The convolution nuclear parameter of all convolutional layers can be accommodated simultaneously.

As shown in Figure 1, entire depth convolutional neural networks from the generation of model parameter, be deployed to returning for final calculation result It passes, the process flow of whole process is as follows:

A1. upper layer host passes through all convolution that related training method training obtains corresponding to depth convolutional neural networks model Nuclear parameter, these convolution nuclear parameters will join the input data that part is realized as convolution operation in FPGA system on chip below With calculating.

A2. upper layer host calls depth convolutional neural networks proposed by the present invention to calculate structural adjustment algorithm and generates all need The adjusting parameter wanted.As shown in figure 1 1., 2. shown in.It wherein 1. indicates the model ginseng of given depth convolutional neural networks model Number is sent in the adjustment algorithm as input data, these model parameters specifically include: the meter of depth convolutional neural networks Calculate the width of individual characteristic pattern (starting input picture layer is also regarded as to be made of multiple characteristic patterns) of number of plies information, each computation layer Information, the width information of each computation layer calculation window, each computation layer calculation window moving step length information, each computation layer are special Levy totem culture size information, (all characteristic patterns for participating in calculating of each computation layer are same for each computation layer characteristic pattern tuple size The ordered set of all characteristic values at one two-dimensional position is known as the characteristic pattern tuple at the two-dimensional position, and characteristic pattern tuple is wrapped The characteristic value number contained is known as the size of characteristic pattern tuple) information and the pond mode information of each pond layer etc..Wherein 2. It indicates to generate all related adjusting parameters by the adjustment algorithm.

A3. the adjusting parameter of generation is transmitted in the DDR chip external memory on plate by upper layer host by PCIe bus, and It is sent after transmission finishes to FPGA system on chip and reads adjusting parameter order, as shown in ③ in Figure 1；FPGA system on chip receives After reading adjusting parameter order, starting DMA read operation obtains adjusting parameter and difference by PCIe bus from DDR chip external memory It is stored in corresponding FPGA on-chip memory.

A4. the trained convolution nuclear parameter is sent into the DDR chip external memory on plate by PCIe bus, and Transmission sends to FPGA system on chip after finishing and reads convolution kernel parameter command, as shown in ④ in Figure 1.Due to FPGA system on chip Storage resource cannot disposably accommodate all convolution nuclear parameters, after receiving reading adjusting parameter order, FPGA system on chip Starting DMA read operation obtains convolution kernel used in the first two convolutional layer by PCIe bus in advance from DDR chip external memory Parameter is stored in the convolution kernel parameter storage of FPGA on piece, and convolution nuclear parameter used in other convolutional layers will be in calculating process In in time load in batches.

A5. the original input picture position rearrangement reaction parameter in the adjusting parameter that upper layer host passes through generation is to all defeated Enter image and carry out pixel position rearrangement reaction, as shown in ⑤ in Figure 1；And the image after rearrangement is sent on plate by PCIe bus In DDR chip external memory, transmission sends to FPGA system on chip after finishing and calculates start command, as shown in ⑥ in Figure 1.

For A6.FPGA system on chip after receiving calculating start command, starting DMA read operation passes through PCIe bus from DDR piece It obtains the image data after resetting in external memory to start to calculate, in calculating process, FPGA system on chip needs repeatedly in time The convolution nuclear parameter for continuing to obtain other convolutional layers from DDR chip external memory is joined in the collaboration of adjusting parameter and convolution nuclear parameter With lower completion relevant calculation process.Until restarting DMA write operation after generating correlation calculation result and passing back to calculated result In DDR chip external memory, and sends to calculate to upper layer host and complete interrupt notification, it is 7. shown as shown in figure 1.

A7. after interrupt notification is completed in the calculating that upper layer host receives the transmission of FPGA system on chip, from DDR chip external memory Designated position read calculated result then carry out it is subsequent needed for operation, as shown in figure 1 8. shown in.

As shown in Fig. 2, adjusting parameter is broadly divided into two classes: computation sequence serializes parameter and filling filtration parameter.Wherein, Computation sequence serializing parameter can be further subdivided into original input picture position rearrangement reaction parameter, new old value choose flag parameters and Old value chooses address parameter；In the convolutional layer of depth convolutional neural networks, filling filtration parameter can be further subdivided into core ginseng Address jump flag parameters, jump purpose core ginseng address parameter and window calculation terminate in advance flag parameters；In depth convolution mind In pond layer through network, filling filtration parameter list refers to that window calculation terminates in advance flag parameters.

Computation sequence serializing parameter, which has been broken, calculates the fixed constraint of window structure in traditional convolutional neural networks, so that The data that each computation layer first participates in calculating can be reached first, sufficiently excavate the calculating concurrency in depth convolutional neural networks And between layers pipelining, the storage of a large amount of intermediate results is effectively reduced, is allowed to be more advantageous on FPGA high Parallel stream aquation is imitated to realize.Wherein, original input picture position rearrangement reaction parameter is used to carry out the input picture in upper layer host Pixel position rearrangement reaction is with the image after being reset；New old value chooses the computation sequence serializing that flag parameters are layer where it Realization process provides new and old Value Data and chooses label, and specified mark value is from upper one layer of characteristic pattern (starting input picture layer Regard as and be made of multiple characteristic patterns) in sequence obtain it is next participate in calculate new Value Data it is still new from what is obtained Old value data are chosen in Value Data.When new old value is chosen, flag parameters are specified to choose old value from the new Value Data obtained When data, old value chooses address parameter and provides the address for choosing old value data for it.

It is existing for characteristic pattern size filling that may be present in the convolutional layer of depth convolutional neural networks to fill filtration parameter It is real in the calculating based on FPGA as crossing the border invalid computation problem brought by phenomenon of filling with window that may be present in the layer of pond During now, automatic fitration is fallen to fill element, avoids the investment of invalid computation, efficiently solves in depth convolutional neural networks Due to the wasting of resources caused by all kinds of paddings and effectively calculate tardy problem.Its center ginseng address jump flag parameters exist It indicates that current calculated position whether there is filling element later in the convolutional layer of depth convolutional neural networks, fills element when existing When, then jump filter operation is needed to be implemented, jump purpose core ginseng address parameter provides the jump destination of convolution nuclear parameter for it Location.When there is filling element in an original calculation window, due to the presence for filter operation of jumping, really thrown in calculation window The number of elements entered into calculating will be less than original calculation window size, mention at this point, window calculation terminates in advance flag parameters for it Label is terminated in advance for window calculation.

Depth convolutional neural networks calculate structural adjustment algorithm, are located at each volume in depth convolutional neural networks by analysis The neuronal structure characteristic of lamination and pond layer, according to the arrangement of elements sequence backward of individual desired characteristic pattern of later layer The arrangement of elements sequence for releasing corresponding individual characteristic pattern for participating in calculating in preceding layer, puts in order with one-dimensional position sequence number sequence It indicates.The algorithm uses queue (being denoted as Q) to traverse each layer for key data structure, using the full articulamentum of the first floor as starting point, with starting Input picture layer is terminal, generate in ergodic process to every layer of relevant adjusting parameter, all characteristic patterns in each layer are rear It is continuous to participate in sharing a set of adjusting parameter corresponding with this layer when calculating.

Depth convolutional neural networks calculate structural adjustment algorithm, with individual characteristic pattern for being inputted in the full articulamentum of the first floor Arrangement of elements sequence be initial arrangement sequence, and by the one-dimensional position sequence number sequence for indicating the initial arrangement sequence be sequentially stored into In the queue, depth convolutional neural networks calculate structural adjustment algorithm and queue Head-of-line serial number are taken to be expanded every time, root The calculation window position in upper layer characteristic pattern corresponding with element where the position number is found according to the neuronal structure of place layer It sets, and successively analyzes position of each element in the calculation window in individual characteristic pattern where it, it is every in each layer The unique analysis serial number of corresponding one of primary analysis behavior.Filling out in individual characteristic pattern where the element analyzed is in it When filling position, which is known as invalid analysis serial number；Otherwise, which is known as effectively analysis serial number.

Thus, each invalid analysis serial number is opposite with the element of a filling position in upper one layer of individual characteristic pattern It answers, each effectively analysis serial number participates in the element of the non-filling position effectively calculated with one in upper one layer of individual characteristic pattern It is corresponding.

Each effectively analysis serial number is owned by new old value corresponding thereto and chooses label, and new old value chooses the value of label There are two: it selects new value label and old value is selected to mark.It is the effective analysis sequence for selecting old value to mark that each new old value, which chooses label value, Number all additionally possess a corresponding old value and chooses address, last in each calculation window containing filling element A effective analysis serial number all additionally possesses a corresponding window calculation and terminates in advance label.All new old value in the layer The ordered set for choosing label is that this layer of new old value to be asked chooses flag parameters；All old value choose address in the layer Ordered set be that this layer of old value to be asked chooses address parameter；All window calculations terminate in advance label in this layer Ordered set is that this layer of window calculation to be asked terminates in advance flag parameters.

If this layer is the convolutional layer in depth convolutional neural networks, in the layer each section of continuous invalid analysis serial number or It individually also needs an effective analysis serial number for its positive front additionally to generate a core ginseng address at the invalid analysis serial number of section to jump Jump label and jump purpose core join address, and jump purpose core ginseng address is in just subsequent one of the section effectively analysis serial number institute Position number of the element of corresponding position in its calculation window.The ordered set of all core ginseng address jump labels is in this layer Join address jump flag parameters for this layer of core to be asked；The ordered set of all jump purpose core ginseng addresses is in this layer This layer of jump purpose core to be asked joins address parameter.

Since there may be intersections between upper one layer different calculation windows, thus different analysis serial numbers may correspond to The element of the same position in one layer of individual characteristic pattern.

When the element that one is effectively analyzed position corresponding to serial number is analyzed for the first time in individual characteristic pattern where it When, then the new old value of this effectively analysis serial number is chosen into label value as the new value label of choosing, and the list by the element locating for it The one-dimensional position serial number opened in characteristic pattern is added to the queue tail, and upper one layer of all element analyzed for the first time are in its institute Ordered set, that is, desired arrangement of elements of upper one layer of individual characteristic pattern of one-dimensional position serial number in individual characteristic pattern at place Sequentially, according to the desired arrangement of elements sequence of upper one layer of individual characteristic pattern acquired, according to the method described above, further may be used In the hope of the desired arrangement of elements sequence of individual characteristic pattern of upper layer, until it is desired to acquire start image input layer Until arrangement of elements sequence, the desired arrangement of elements sequence of the start image input layer original input picture i.e. to be asked Position rearrangement reaction parameter；

The element that position corresponding to serial number is effectively analyzed when one is not divided for the first time in individual characteristic pattern where it Then, then the new old value of this effectively analysis serial number is chosen label value is to select old value to mark, and find the element in its institute for analysis Position of the one-dimensional position serial number in the desired arrangement of elements sequence of whole characteristic pattern in individual characteristic pattern at place, this position It sets this i.e. old value that effectively analysis serial number additionally possesses and chooses address.

As shown in figure 3, the flow chart of data processing of algorithm is as follows:

It A1. is initial arrangement sequence with the arrangement of elements sequence of individual characteristic pattern inputted in the full articulamentum of the first floor, and The one-dimensional position sequence number sequence for indicating the initial arrangement sequence is sequentially stored into queue Q.The full articulamentum of the first floor in this embodiment The characteristic pattern two dimension size that middle individual inputted characteristic pattern size corresponds to the generation of preceding layer pond layer 3 is 4*4, due to connecting entirely Only one calculation window of layer is connect, so the arrangement of elements sequence of individual characteristic pattern of input is 1~16；Thus by 1~16 according to In secondary deposit Q.

A2. judge whether queue Q is sky, and when being empty, algorithm terminates；Otherwise, A3 is gone to；

A3. it takes queue Q Head-of-line serial number to be expanded, is found and the position number according to the neuronal structure of place layer Calculation window position in the corresponding upper layer characteristic pattern of place element, and successively analyze each element in the calculation window and exist The position in individual characteristic pattern where it.Such as the column Head-of-line serial number 1 taken out for the first time, it is raw to correspond to convolutional layer 3 At characteristic pattern in size be 3*3, next No. 1 calculation window that step-length is 1 thus will be analyzed successively in No. 1 calculation window Element, specifically correspond in individual characteristic pattern generated in convolutional layer 3 one-dimensional position serial number 1,2,3,9,10,11,17, 18,19 element.

A4. judge whether current window is analyzed to finish, finished if not analyzing, go to A5；Otherwise, A10 is gone to；

A5. the element in next current window is analyzed, judges the filler in characteristic pattern where whether the element is in It sets.If it is not, going to A6；Otherwise A9 is gone to.

A6. a unique effectively analysis serial number is distributed for this time analysis behavior in this layer, effectively analyzes serial number from volume Number 1 starts distribution incremented by successively, and judges element of position corresponding to effective analysis serial number in individual characteristic pattern where it In whether analyzed for the first time, if so, going to A7；Otherwise A8 is gone to.

A7. the new old value selection mark value of the currently active analysis serial number being set to 1, (mark value is that new value is chosen in 1 expression；Mark Note value is that old value is chosen in 0 expression).And judge the element of position corresponding to effectively analysis serial number in whether in starting input picture Layer, if so, the currently active analysis serial number is added in original input picture position rearrangement reaction parameter；Otherwise, by the currently active point Analysis serial number is added to queue Q tail of the queue.Go to A4.

A8. the new old value of the currently active analysis serial number is chosen into mark value and is set to 0, go to A4.

A9. a unique invalid analysis serial number is distributed for this time analysis behavior in this layer, invalid serial number of analyzing is from volume Number 1 starts distribution incremented by successively, and the section for judging whether the invalid analysis serial number is located at one section of consecutive invalid analysis serial number is first, if It is that one of its positive front effectively analysis serial number is added in core ginseng address jump flag parameters, it will be continuous immediately in the section The effectively analysis serial number of one of invalid analysis serial number end is added in jump purpose core ginseng address parameter, goes to A4；Otherwise, directly Switch through to A4.

A10. whether there is the element in filling position in the complete calculation window of discriminatory analysis, if so, by the calculating The effectively analysis serial number of the last one in window is added to window calculation and terminates in advance in flag parameters, goes to A2.Otherwise, directly turn To A2.

As shown in figure 4, the FPGA system on chip realized according to the present invention is mainly by input data distribution control module, output Data distribute control module, convolutional calculation generic sequenceization realizes module, module, convolution meter are realized in computation sequence serializing in pond It calculates module, pond computing module and the convolutional calculation result distribution big module of control module seven to form, furthermore FPGA system on chip is also Include a built-in system subtending port.

Input data distribute control module simultaneously with FPGA peripheral interface and the built-in system subtending port, convolutional calculation Generic sequenceization realizes that module is connected；Output data distribute control module simultaneously with FPGA peripheral interface and the built-in system grade Connecting port, convolutional calculation result distribution control module and pond computing module are connected；Convolutional calculation result distributes control module Realize that module is connected with convolutional calculation module, output data distribution control module and the serializing of pond computation sequence simultaneously；Volume Product computation sequence serializing is realized to be connected directly between module and convolutional calculation module；Module is realized in computation sequence serializing in pond It is connected directly between the computing module of pond.

Input data distribution control module is mainly responsible for the data that monitoring convolutional calculation generic sequenceization in real time realizes module Consumption situation, timely and appropriately to DDR chip external memory send relevant readings according to order and in time receive FPGA peripheral interface with The input data that the built-in system subtending port transmission comes, in addition to this, input data distribution control module also needs to receive To data send to having specification in a organized way convolutional calculation generic sequenceization realize module.

Output data distribution control module is mainly responsible for timely reception tank computing module or the distribution control of convolutional calculation result The input data that the transmission of molding block comes, and passed the data received with having specification in a organized way according to the calculation stages being presently in The built-in system subtending port or FPGA peripheral interface are given, timely and appropriately correlation is sent to DDR chip external memory and writes number According to order and related interrupts notice.In addition to this, output data distribution control module is also responsible for real-time response FPGA peripheral interface All kinds of related commands that transmission comes.

Convolutional calculation generic sequenceization realizes that module is mainly responsible for and combines related adjusting parameter by depth convolutional neural networks The structuring computation sequence serializing of middle correlation convolution operation, and be the data after the timely transfer sequence of convolutional calculation module Collection；Computation sequence serializing in pond realizes that module is mainly responsible for and combines related adjusting parameter related in depth convolutional neural networks The structuring computation sequence serializing of pondization operation, and be the data set after the timely transfer sequence of pond computing module.

Convolutional calculation module is mainly responsible for the related convolutional calculation completed in depth convolutional neural networks, and by calculated result Send convolutional calculation result distribution control module in time；Pond computing module, which is mainly responsible for, to be completed in depth convolutional neural networks The operation of related pondization, and to calculated result is sent to the output data distribution control module in time.

Convolutional calculation result distribution control module is mainly responsible for and receives the calculated result number that convolution computing module transmits in time According to, and the data received are sent to pond computation sequence sequence with having specification in a organized way according to the calculation stages being presently in Change and realizes that module or output data distribute control module.

Built-in system subtending port is mainly responsible for cascade or internal module between FPGA system on chip internal subsystems Between connection provide valid interface, for connect output data distribution control module and input data distribution control module.

In each layer calculating process in FPGA system on chip, all characteristic patterns for participating in calculating of each computation layer are same The ordered set of all characteristic values at two-dimensional position is known as the characteristic pattern tuple at the two-dimensional position, and characteristic pattern tuple is included Characteristic value number be known as the size of characteristic pattern tuple.Characteristic pattern tuple will successively participate in calculating as a whole, original defeated The processing for entering image layer also carries out in the way of being regarded as characteristic pattern, and the movement of two-dimensional position computation point is by a upper computation layer Or the data submitting sequence and convolutional calculation generic sequenceization of start image input layer realize module or pond computation sequence sequence Change and realizes that module joint determines；The characteristic pattern of all generations is also successively given birth to using characteristic pattern tuple as basic unit in each computation layer At a upper characteristic pattern tuple just starts to carry out the generation of next characteristic pattern tuple after generating.The characteristic pixel of input Group size is denoted as DIN, and the characteristic pattern tuple size of generation is denoted as DON.

Upper layer host according to the original input picture position rearrangement reaction parameter provided in the adjusting parameter to input picture into Row pixel position rearrangement reaction, either in rearrangement process or in the data transfer procedure of rearrangement image later, image Each three-dimensional component all operated as a whole.Image after rearrangement is according to two-dimensional image size, according to from left to right, Sequence from top to bottom, which is sequentially transmitted, gives DDR chip external memory.Convolution nuclear parameter in upper layer host is according to the convolutional calculation The computation sequence of module setting by sending DDR chip external memory to after specification tissue again.

Input data distribution control module, output data distribution control module and convolutional calculation result distribution control module exist All keep the sequencing of its data receiver constant when transmitting data, only when the data received form a certain size data sheet Coupled required module is sent it to after member.

Convolutional calculation module parallel processing multiple characteristic patterns simultaneously every time, every characteristic pattern every time while parallel with multiple volumes Product core carries out convolution operation, thus convolutional calculation module can parallel generation multiple new characteristic patterns simultaneously every time；Pondization calculates Same multiple characteristic patterns of parallel processing simultaneously every time of module.At most the characteristic pattern number of processing claims convolutional calculation module simultaneously every time For convolutional layer characteristic pattern degree of parallelism, it is denoted as KFP；The characteristic pattern number that convolutional calculation module at most generates simultaneously every time is known as convolution Core group degree of parallelism, is denoted as KGP；At most the characteristic pattern number of processing is known as pond layer characteristic pattern simultaneously to pond computing module simultaneously every time Row degree, is denoted as PFP.

The data processing schematic diagram of convolutional calculation module is as shown in figure 5, wherein if1~ifn represents the n that upper layer generates input Characteristic pattern is opened, of1~ofn represents the n characteristic patterns that this layer generates；Wherein connection input feature vector figure and convolution kernel parameter arraySymbol indicates that multiplication operation, connection are eachSymbol and generation characteristic pattern elementSymbology add operation.In depth It spends in the full articulamentum of convolutional neural networks, the characteristic pattern of the characteristic pattern and generation that input in figure only includes a characteristic pixel Element, calculation window size will be equal to the size of whole input feature vector figure.

The data processing schematic diagram of pond computing module is as shown in fig. 6, wherein if1~ifn represents the n that upper layer generates input Characteristic pattern is opened, of1~ofn represents the n characteristic patterns that this layer generates；It wherein connects the calculation window of input feature vector figure and generates special Levy pel elementThe operation of symbology pondization.

Convolutional calculation generic sequenceization realizes module by characteristic pixel group selection function sub-modules and convolution kernel parameter selection Function sub-modules composition.Characteristic pixel group selection function sub-modules realize characteristic pixel group selection function, convolution kernel parameter selection Function sub-modules realize convolution kernel parameter selection function.

Each selection operation of characteristic pattern tuple and effectively analysis serial number correspond.

1. characteristic pixel group selection function sub-modules

As shown in fig. 7, characteristic pixel group selection function sub-modules mainly by characteristic pattern tuple memory, new and old selector, Flag parameters memory, address parameter memory, calculation window buffer storage and characteristic pattern tuple counter composition.

Wherein, characteristic pattern tuple memory is realized using two-port RAM, for storing the input data distribution control mould The characteristic pattern tuple that block is sent into；New and old selector safeguards two address registers, with being respectively newly worth address register and old value Location register, for selecting corresponding characteristic pattern tuple from characteristic pattern tuple memory and exporting to the convolutional calculation mould Block；The new old value that flag parameters memory is used to store effective analysis serial number chooses label and window calculation terminates in advance Label, the old value that address parameter memory is used to store effective analysis serial number choose address, the depth given for one Convolutional neural networks model is spent, flag parameters memory and address parameter memory write-once repeatedly recycle reading；Calculate window Mouth buffer storage is realized using two-port RAM, for caching the characteristic pattern tuple of new and old selector output and outputing it to The convolutional calculation module；Characteristic value pel group counter is used to count the characteristic pattern tuple of new and old selector selection output Number.

The every beat of characteristic pixel group selection function sub-modules obtains a feature from input data distribution control module KFP characteristic value of pel group, this KFP eigenvalue cluster is at an input feature vector value group.New and old selector selects feature every time When pel group is exported, check that current new old value chooses mark value, if it is the new value label of choosing that current new old value, which chooses mark value, Then from the initial address that new value address register provides as unit of eigenvalue cluster progress characteristic pattern tuple output, often After exporting a feature group, new value address register adds one automatically, after the characteristic pattern tuple output currently chosen, from mark Next new old value selection label is sequentially obtained in note parameter storage as current new old value selection label；If current new old value Choosing mark value is that old value is selected to mark, then current old value is chosen address and be sent into old value address register, and be with this address Beginning address carries out the output of characteristic pattern tuple as unit of eigenvalue cluster, after one feature group of every output, old value address register It is automatic to add one, after the characteristic pattern tuple output currently chosen, sequentially obtained from flag parameters memory next new Old value chooses label as current new old value and chooses label, and sequentially obtains next old value from address parameter memory and choose Address is chosen as current old value in address.After new and old selector has exported a characteristic pattern tuple, characteristic pattern tuple is counted Device adds one automatically, if the characteristic pattern tuple of new and old selector selection output reaches one big without the calculation window for filling element at this time Small, new and old selector exports pause, until the characteristic pixel for the current calculation window being located in calculation window buffer storage Until group reuse ((DON-1)/KGP+1) is secondary；If the characteristic pattern tuple of new and old selector selection output not yet reaches before at this time To a calculation window size without filling element, but current signature pel group Counter Value is terminated in advance with current window calculating Mark value is identical, and new and old selector will also suspend output in advance at this time, until being located at current in calculation window buffer storage The characteristic pattern tuple of calculation window reuse ((DON-1)/KGP+1) it is secondary until, and suspend in advance in new and old selector defeated While out, next window calculating is sequentially obtained from flag parameters memory and terminates in advance label as current window calculating Terminate in advance label.

2. convolution kernel parameter selection function sub-modules

The output of convolution kernel parameter array and the characteristic pixel group selection function in convolution kernel parameter selection function sub-modules The output of output characteristic value group is synchronous in energy submodule carries out.

As shown in figure 8, convolution kernel parameter selection function sub-modules are mainly joined by convolution kernel parameter storage (a), convolution kernel Number memory (b), selector, flag parameters memory, address parameter memory and core ginseng array group counter composition.

Wherein, convolution kernel parameter storage (a) and convolution kernel parameter storage (b) are realized using two-port RAM, for depositing Store up the convolution nuclear parameter that the input data distribution control module is sent into；Flag parameters memory is used to store the core ginseng ground Location skip flag parameter, address parameter memory is used to store the jump purpose core ginseng address parameter, given for one Depth convolutional neural networks model, flag parameters memory and address parameter memory write-once repeatedly recycle reading；Choosing It selects device and safeguards an address register and a jump address generator, be used for from convolution kernel parameter storage (a) or convolution kernel The corresponding convolution kernel parameter array group of selection in parameter storage (b) (with it is defeated in the characteristic pixel group selection function sub-modules The collection of the corresponding all convolution kernel parameter arrays of a characteristic pattern tuple out is collectively referred to as a convolution kernel parameter array group) it is defeated Out to the convolutional calculation module, wherein jump address generator obtains jump purpose core ginseng address from address parameter memory and joins Number is calculated, and is provided corresponding jump purpose core for selector and is joined address；Core ginseng array group counter is used to count output Convolution kernel parameter array group number.

When selector selects convolution kernel parameter array group to be exported every time, compare when pronucleus joins address jump flag parameters Value with when pronucleus column group counter of entering a match it is whether equal.If equal, the jump address generator is worked as into front jumping Turn address and be sent into address register, and using this address as initial address, convolution kernel ginseng is carried out as unit of convolution kernel parameter array The output of number array group, one convolution kernel parameter array of every output, address register adds one automatically, when the convolution kernel currently chosen After parameter array group exports, core ginseng array group counter increases one automatically, and it is next that the jump address generator calculates output A jump address is as current jump address；If unequal, directly opened from the initial address that the address register provides Begin, the output of progress convolution kernel parameter array group as unit of convolution kernel parameter array, one convolution kernel parameter array of every output, Address register adds one automatically, and after the convolution kernel parameter array group output currently chosen, core joins array group counter certainly It is dynamic to increase one.During selector selection convolution kernel parameter array group is exported, convolution kernel parameter storage (a) and convolution Nuclear parameter memory (b) is switched to selector in turn and provides deconvolution parameter array group, and current computation layer, which occurs, for handover operation terminates Moment, the convolution nuclear parameter being sent into from input data distribution control module are also sequentially sent to roll up as unit of computation layer in turn Product nuclear parameter memory (a) and convolution kernel parameter storage (b).

Computation sequence serializing in pond realizes that module obtains the operation and convolutional calculation generic sequenceization reality of characteristic pattern tuple The acquisition process of existing module is similar, but the characteristic value number of the characteristic pattern tuple of every beat acquisition is PFP, and works as current window At the end of calculating, all characteristic pattern tuples in calculation window do not need to repeat to participate in calculate.

As shown in figure 9, computation sequence serializing in pond realizes module mainly by characteristic pattern tuple memory, new and old selection Device, flag parameters memory, address parameter memory and characteristic pattern tuple counter composition.

Wherein, characteristic pattern tuple memory is realized using two-port RAM, for storing the input data distribution control mould The characteristic pattern tuple that block is sent into；New and old selector safeguards two address registers, with being respectively newly worth address register and old value Location register, for selecting corresponding characteristic pattern tuple from characteristic pattern tuple memory and exporting to the convolutional calculation mould Block；The new old value that flag parameters memory is used to store effective analysis serial number chooses label and window calculation terminates in advance Label, the old value that address parameter memory is used to store effective analysis serial number choose address, the depth given for one Convolutional neural networks model is spent, flag parameters memory and address parameter memory write-once repeatedly recycle reading；Characteristic value Pel group counter is used to count the characteristic pattern tuple number of new and old selector selection output.

Computation sequence serializing in pond realizes that the every beat of module obtains a spy from input data distribution control module PFP characteristic value of pel group is levied, this PFP eigenvalue cluster is at an input feature vector value group.New and old selector selects spy every time When sign pel group is exported, check that current new old value chooses mark value, if it is the new value mark of choosing that current new old value, which chooses mark value, Note, then from the initial address that new value address register provides as unit of eigenvalue cluster progress characteristic pattern tuple it is defeated Out, after one feature group of every output, new value address register adds one automatically, when the characteristic pattern tuple output currently chosen finishes Afterwards, next new old value is sequentially obtained from flag parameters memory chooses label as currently newly old value chooses label；If working as It is that old value is selected to mark that preceding new old value, which chooses mark value, then current old value is chosen address and be sent into old value address register, and with this Address is that initial address carries out the output of characteristic pattern tuple as unit of eigenvalue cluster, after one feature group of every output, old value Location register adds one automatically, after the characteristic pattern tuple output currently chosen, sequentially obtains from flag parameters memory Next new old value chooses label as current new old value and chooses label, and sequentially obtains from address parameter memory next Old value chooses address as current old value and chooses address.After new and old selector has exported a characteristic pattern tuple, characteristic pattern Tuple counter adds one automatically, if the characteristic pattern tuple of new and old selector selection output is not up to one without filling element at this time Calculation window size, but to terminate in advance mark value identical for the calculating of current signature pel group Counter Value and current window, at this time institute It states the serializing of pond computation sequence and realizes that module sends current window calculating to the pond computing module and terminates in advance signal, and Sequentially obtained from flag parameters memory next window calculating terminate in advance label as current window calculating terminate in advance Label.

As shown in Figure 10, convolutional calculation module is made of side by side KGP (m=KGP in figure) a convolution kernel computing unit.

Convolutional calculation module obtains convolutional calculation generic sequenceization in each effective beat simultaneously and realizes what module was passed to KFP characteristic value and KFP*KGP convolution nuclear parameter, these convolution nuclear parameters are from KGP different convolution kernels.It gets KFP characteristic value will carry out convolution operation with this KGP convolution kernel simultaneously, and convolutional calculation result passes through again plus corresponding bias After crossing Relu activation operation, KGP characteristic pattern element is obtained, this KGP characteristic pattern element is corresponding to belong to KGP different generations Characteristic pattern and be eventually successively sent to convolutional calculation result distribution control module.

As shown in figure 11, convolution kernel computing unit is mainly made of multiply-add tree, add tree, biasing device and activator appliance.It is multiply-add Tree is made of several multipliers and adder interconnection, and add tree is made of the interconnection of several adders.

Wherein multiply-add tree, add tree complete the operation that multiplies accumulating in convolutional calculation unit jointly, and biasing device completes convolution meter The biasing phase add operation in unit is calculated, activator appliance completes the activation operation in convolutional calculation unit.

Convolution kernel computing unit obtains simultaneously in each effective beat and comes from the convolution kernel parameter selection function sub-modules KFP characteristic value and KFP convolution nuclear parameter from the convolution kernel parameter selection function sub-modules.Multiply-add tree is to KFP A characteristic value and KFP convolution nuclear parameter carry out multiplying accumulating operation, and will multiply accumulating result and sequentially be sequentially sent to carry out in add tree Secondary concentrate is added up.Until the operand of add tree first floor inlet is all ready or last group of feature of current calculation window Be worth it is ready after, add tree starting calculate complete it is secondary cumulative；Until whole accumulation operations of current calculation window are completed, add tree Last accumulation result is sent into adder and is biased phase add operation, after the completion of biasing phase add operation, addition result is then Activator appliance can be admitted to be activated, the result after activation, that is, convolutional calculation unit final calculation result.Convolutional calculation unit Final calculation result will be admitted to convolutional calculation result distribution control module.

Add tree in convolutional calculation unit be mainly used for caching multiply-add tree is sent into multiply accumulating as a result, and concentrate carry out it is tired Add calculating, the secondary concentration of add tree is cumulative to be efficiently solved in floating number cumulative process, due to the number of forward/backward operation number The assembly line cutout caused according to correlation, convolution kernel computing unit access obstructing problem, effectively alleviates caused by A big Calculation bottleneck obstacle of convolutional calculation part is in depth convolutional neural networks.

As shown in figure 12, pond computing module is mainly by distributor, maximum value pond unit, average value pond unit and choosing Select device composition；

Pond computing module obtains simultaneously in each effective beat and realizes module from pond computation sequence serializing PFP characteristic value, and by the input feature vector value group feeding distributor be allocated；Distributor is then according to the pond of current computation layer The characteristic pixel group of input is distributed to maximum value pond unit or average value pond unit by change mode；Wherein, maximum value pond Unit takes the maximum characteristic pattern element of current calculation window in every characteristic pattern to carry out pond, and average value pond unit takes every spy All characteristic pattern element average values for levying current calculation window in figure carry out pond；After the completion of pondization operation, selector is according to working as The pond mode of preceding computation layer selects the pond result of maximum value pond unit or average value pond unit to give the output number According to distribution control module.

As shown in figure 13, maximum value pond unit is mainly by comparator array, intermediate result buffer queue, distributor and spy Levy pel group counter composition.Comparator array is made of several comparators.

Wherein, comparator array is used to complete all characteristic value elements of current calculation window in more every characteristic pattern, Seek its maximum value；Intermediate result buffer queue is for caching the intermediate result that comparator array compares；Distributor is for distributing Intermediate result in intermediate result buffer queue is sent to comparator array and is iterated and compare according to relevant control condition Or it exports as final result to the selector in the pond computing module；Characteristic pattern tuple counter is sent for counting Enter comparator array and participates in the characteristic pattern tuple number for comparing calculating.

Maximum value pond unit obtains the PFP from the pond computing module distributor in each effective beat simultaneously Characteristic value, and the input feature vector value group is sent into comparator array, after a characteristic pattern tuple is sent into, characteristic pattern tuple Counter adds one automatically；At the same time, distributor is from intermediate result cache queue acquisition and the corresponding centre of input feature vector value As a result eigenvalue cluster is sent into comparator array.Once comparator array operand is ready, comparator array starting is calculated, than Compared with each eigenvalue components in two groups of eigenvalue clusters, its greater is taken to be sent into intermediate result buffer queue.When characteristic pattern tuple counts When device numerical value reaches current calculation window size, distributor will be located at the result in intermediate result buffer queue as output and be sent into Selector in the pond computing module.

As shown in figure 14, average value pond unit is mainly by adder array, intermediate result buffer queue, distributor, spy Levy pel group counter and divider array composition.Adder array is made of several adders, and divider array is removed by several Musical instruments used in a Buddhist or Taoist mass composition.

Wherein, adder array is used to complete the characteristic pattern tuple of cumulative input；Intermediate result buffer queue is for caching The cumulative intermediate result of adder array；Distributor is used for the intermediate result distributed in intermediate result buffer queue, according to correlation Control condition is sent to adder array and is iterated cumulative or exports as final result and give pondization calculating mould Selector in block；Characteristic pattern tuple counter, which is used to count, is sent into the characteristic pattern tuple that calculating is compared in adder array participation Number；Divider to the accumulation result that distributor is sent out for carrying out being averaged operation.

Average value pond unit obtains the PFP from the pond computing module distributor in each effective beat simultaneously Characteristic value, and the input feature vector value group is sent into adder array, after a characteristic pattern tuple is sent into, characteristic pattern tuple Counter adds one automatically；At the same time, distributor is from intermediate result cache queue acquisition and the corresponding centre of input feature vector value As a result eigenvalue cluster is sent into adder array.Once adder array operand is ready, adder array starting is calculated, complete At cumulative, the accumulation result feeding intermediate result buffer queue of each eigenvalue components in two groups of eigenvalue clusters.When characteristic pattern tuple When counter values reach current calculation window size, distributor send the result in intermediate result buffer queue is located at into division Device array；The current value of characteristic pattern tuple counter is also fed into divider array as operand and participates in calculating at the same time, The selector that the average value of divider array output will be sent into the computing module of pond as output.

The setting value of KFP, KGP combine DON the and FPGA piece of each convolutional layer in given depth convolutional neural networks model All kinds of resource quantities of upper system, which are combined, to be provided, FPGA system on chip all kinds of resources allow in the case where, as far as possible by KFP, KGP maximum DON into all convolutional layers is close；The setting value of PFP is guaranteeing the not idle premise of convolutional layer immediately after Under minimize.In the present embodiment, KFP, KGP value are set as 8, PFP value and are set as 1.

After the value of KFP increases to a certain extent, if FPGA on piece related resource is still sufficient at this time, using interior Portion's system subtending port further expands existing FPGA system on chip.FPGA system on chip after extension is by multiple FPGA On piece subsystem cascades, and each FPGA on piece subsystem is all by the seven big modules and a built-in system grade connection Mouth composition, wherein built-in system subtending port is used to connect the output data distribution control mould of a FPGA on piece subsystem The input data of block and next FPGA on piece subsystem distributes control module, and the connection and realization between seven big modules remove institute It is identical with the FPGA system on chip before extension other than computational domain and analysis domain are reduced.

FPGA system on chip after extension can not only double up calculating degree of parallelism, reasonably remaining using FPGA on piece Resource, and can be more fully pipelining between layer layer by layer using calculating in depth convolutional neural networks, effectively shortens pond Change bring inessential waiting time, the contracting of inessential waiting time due to the Calculation bottleneck of convolutional layer between layer and convolutional layer Short to mean being further reduced for inessential intermediate result, FPGA on piece storage resource will obtain highly efficient and sufficient benefit With.

Claims

1. a kind of streamlined acceleration system of the depth convolutional neural networks based on FPGA, which is characterized in that the system includes:

Input data distributes control module, output data distribution control module, convolutional calculation generic sequenceization and realizes module, Chi Hua Computation sequence serializing realizes that module, convolutional calculation module, pond computing module and convolutional calculation result distribute control module, this The outer streamlined acceleration system also includes a built-in system subtending port；

Input data distribution control module simultaneously with FPGA peripheral interface and the built-in system subtending port, convolutional calculation Generic sequenceization realizes that module is connected；The output data distributes control module Irrespective of size connecting port, convolutional calculation result distribution control module and pond computing module are connected；The convolutional calculation result distribution Control module is realized with the convolutional calculation module, output data distribution control module and the serializing of pond computation sequence simultaneously Module is connected；The convolutional calculation generic sequenceization is realized to be connected directly between module and convolutional calculation module；The Chi Huaji It calculates and is connected directly between generic sequenceization realization module and pond computing module；

The input data distribution control module monitors that the convolutional calculation generic sequenceization realizes the data consumption of module in real time Situation sends relevant readings according to order and the reception FPGA peripheral interface and built-in system grade in time to DDR chip external memory The input data that connecting port transmission comes, in addition to this, the input data distribution control module also need to pass the data received It gives the convolutional calculation generic sequenceization and realizes module；

The output data distribution control module receives the pond computing module or convolutional calculation result distribution control mould The input data that block transmission comes, and the data transmission received is given to the built-in system grade according to the calculation stages being presently in Connecting port or the FPGA peripheral interface, Xiang Suoshu DDR chip external memory send correlation and write data command and related interrupts notice； In addition to this, the order that FPGA peripheral interface described in the output data distribution control module also real-time response is sent；

The convolutional calculation generic sequenceization realizes module combination adjusting parameter by convolution behaviour related in depth convolutional neural networks The structuring computation sequence of work serializes, and is the data set after the timely transfer sequence of convolutional calculation module；The pond Change computation sequence serializing and realizes the knot that module combines related adjusting parameter for pondization operation related in depth convolutional neural networks The serializing of structure computation sequence, and be the data set after the timely transfer sequence of pond computing module；

The convolutional calculation module completes the related convolutional calculation in depth convolutional neural networks, and calculated result is transmitted in time Control module is distributed to the convolutional calculation result；The pond computing module completes the related pond in depth convolutional neural networks Change operation, and calculated result is sent to the output data distribution control module in time；

Convolutional calculation result distribution control module receives the calculation result data that the convolutional calculation module transmits, and according to It is real that the data received are sent to the pond computation sequence serializing by the calculation stages being presently in having specification in a organized way Existing module or the output data distribute control module；

The built-in system subtending port is between the cascade or internal module between the FPGA system on chip internal subsystems Connection valid interface is provided, for connecting output data distribution control module and input data distribution control module.

2. the streamlined acceleration system of the depth convolutional neural networks based on FPGA as described in claim 1, which is characterized in that

The convolutional calculation generic sequenceization realizes module by characteristic pixel group selection function sub-modules and convolution kernel parameter selection Function sub-modules composition, characteristic pixel group selection function sub-modules realize characteristic pixel group selection function, convolution kernel parameter selection Function sub-modules realize convolution kernel parameter selection function.

3. the streamlined acceleration system of the depth convolutional neural networks based on FPGA as claimed in claim 2, which is characterized in that

The characteristic pixel group selection function sub-modules by characteristic pattern tuple memory, new and old selector, flag parameters memory, Address parameter memory, calculation window buffer storage and characteristic pattern tuple counter composition；

The characteristic pattern tuple memory is realized using two-port RAM, is sent for storing the input data distribution control module The characteristic pattern tuple entered；The new and old selector safeguards two address registers, with being respectively newly worth address register and old value Location register, for selecting corresponding characteristic pattern tuple from the characteristic pattern tuple memory and exporting to the convolutional calculation Module；The new old value that the flag parameters memory is used to store effectively analysis serial number chooses label and window calculation terminates in advance Label, the old value that the address parameter memory is used to store effectively analysis serial number choose address, the depth given for one Convolutional neural networks model, the flag parameters memory and the address parameter memory write-once repeatedly recycle reading； The calculation window buffer storage is realized using two-port RAM, for caching the characteristic pixel of the new and old selector output Group simultaneously outputs it to the convolutional calculation module；The characteristic value pel group counter is for counting the new and old selector choosing Select the characteristic pattern tuple number of output；

The every beat of characteristic pixel group selection function sub-modules obtains a feature from input data distribution control module KFP characteristic value of pel group, this KFP eigenvalue cluster at an input feature vector value group, KFP be convolutional calculation module every time most The characteristic pattern number of more processing simultaneously；When the new and old selector selects characteristic pattern tuple to be exported every time, check current new Old value chooses mark value, if it is the new value label of choosing that current new old value, which chooses mark value, from newly providing value address register Start the output for carrying out characteristic pattern tuple as unit of eigenvalue cluster at beginning address, after one feature group of every output, is newly worth address Register adds one automatically, after the characteristic pattern tuple output currently chosen, sequentially obtains from the flag parameters memory It takes next new old value to choose label as current new old value and chooses label；If it is to select old value mark that current new old value, which chooses mark value, Current old value is then chosen address and is sent into old value address register by note, and is single with eigenvalue cluster by initial address of this address Position carries out the output of characteristic pattern tuple, and after one feature group of every output, old value address register adds one automatically, when what is currently chosen After characteristic pattern tuple exports, next new old value is sequentially obtained from the flag parameters memory and chooses label and is used as and is worked as Preceding new old value chooses label, and sequentially obtains next old value from the address parameter memory and choose address as current old Value chooses address；

After the new and old selector has exported a characteristic pattern tuple, the characteristic pattern tuple counter adds one automatically, if The characteristic pattern tuple of the new and old selector selection output described at this time reaches the calculation window size without filling element, described new Old selector exports pause, until the characteristic pattern tuple for the current calculation window being located in the calculation window buffer storage Until reuse ((DON-1)/KGP+1) is secondary；If the characteristic pattern tuple of the new and old selector selection output described at this time not yet reaches To a calculation window size without filling element, but current signature pel group Counter Value is terminated in advance with current window calculating Mark value is identical, and the new and old selector described at this time will also suspend output in advance, until being located at the calculation window buffer storage In current calculation window characteristic pattern tuple reuse ((DON-1)/KGP+1) it is secondary until, and in the new and old selection While device suspends output in advance, next window calculating is sequentially obtained from the flag parameters memory and terminates in advance label Label is terminated in advance as current window calculating, wherein DON is the characteristic pattern tuple numerical values recited generated, and wherein KGP is convolution The characteristic pattern number that computing module at most generates simultaneously every time.

4. the streamlined acceleration system of the depth convolutional neural networks based on FPGA as claimed in claim 3, which is characterized in that

The output of convolution kernel parameter array and the characteristic pixel group selection function in the convolution kernel parameter selection function sub-modules The output of output characteristic value group is synchronous in energy submodule carries out；

The convolution kernel parameter selection function sub-modules by the first convolution nuclear parameter memory, the second convolution nuclear parameter memory, Selector, flag parameters memory, address parameter memory and core ginseng array group counter composition；

The first convolution nuclear parameter memory and the second convolution nuclear parameter memory all use two-port RAM to realize, use In the convolution nuclear parameter for storing the input data distribution control module feeding；The flag parameters memory is for storing core ginseng Address jump flag parameters, the address parameter memory joins address parameter for storing jump purpose core, given for one Depth convolutional neural networks model, the flag parameters memory and the address parameter memory write-once repeatedly recycle It reads；The selector safeguards an address register and a jump address generator, for joining from first convolution kernel Corresponding convolution kernel parameter array group is selected to export to the convolution in number memory or the second convolution nuclear parameter memory Computing module, wherein jump address generator obtains jump purpose core ginseng address parameter from the address parameter memory, for institute It states selector and corresponding jump purpose core ginseng address is provided；The core ginseng array group counter is used to count the convolution kernel ginseng of output Number array group number；

When the selector selects convolution kernel parameter array group to be exported every time, compare when pronucleus joins address jump flag parameters Value with when pronucleus column group counter of entering a match it is whether equal；If equal, jump address generator is currently jumped ground Address register is sent into location, and using this address as initial address, and convolution kernel parametric array is carried out as unit of convolution kernel parameter array The output of column group, one convolution kernel parameter array of every output, address register adds one automatically, when the convolution nuclear parameter currently chosen After array group exports, the core ginseng array group counter increases one automatically, and jump address generator, which calculates, exports next jump Turn address as current jump address；If unequal, directly since at the initial address that address register provides, with convolution Nuclear parameter array is the output that unit carries out convolution kernel parameter array group, one convolution kernel parameter array of every output, address deposit Device adds one automatically, and after the convolution kernel parameter array group output currently chosen, the core ginseng array group counter increases automatically One；During selector selection convolution kernel parameter array group is exported, the first convolution nuclear parameter memory It is switched to the selector in turn with the second convolution nuclear parameter memory, deconvolution parameter array group is provided, handover operation occurs Current computation layer finish time, the convolution nuclear parameter being sent into from input data distribution control module is also as unit of computation layer It is sequentially sent to the first convolution nuclear parameter memory and the second convolution nuclear parameter memory in turn.

5. the streamlined acceleration system of the depth convolutional neural networks based on FPGA as claimed in claim 4, which is characterized in that

The convolutional calculation generic sequenceization realizes that module and pond computation sequence serializing realize the spy in module Levying pel group memory, progress timesharing recycles in computation layer where it, and the characteristic pattern tuple memory is not upper one Storage unit is all provided separately in each characteristic pattern tuple that layer is sent, where the setting of amount of capacity combines in computational domain Same characteristic pattern tuple newly value deposit and old value refetch between maximum address interval provide；

Old value chooses address parameter and needs to be corresponding remainder behaviour before being transmitted to the DDR chip external memory through upper layer host Make, the characteristic pattern tuple memory capacity size of its a length of place computational domain of remainder mould.

6. the streamlined acceleration system of the depth convolutional neural networks based on FPGA as claimed in claim 5, which is characterized in that

The convolutional calculation module is made of side by side multiple convolution kernel computing units, and convolution kernel computing unit is by multiply-add tree, addition Tree, biasing device and activator appliance composition, multiply-add tree are made of several multipliers and adder interconnection, and add tree is mutual by several adders Even form；

The multiply-add tree, add tree complete the operation that multiplies accumulating in the convolution kernel computing unit jointly, and the biasing device is completed Biasing phase add operation in the convolution kernel computing unit, the activator appliance complete the behaviour of the activation in the convolution kernel computing unit Make；

The convolution kernel computing unit obtains simultaneously in each effective beat and comes from the convolution kernel parameter selection function sub-modules KFP characteristic value and KFP convolution nuclear parameter from the convolution kernel parameter selection function sub-modules, the multiply-add tree pair KFP characteristic value and KFP convolution nuclear parameter carry out multiplying accumulating operation, and will multiply accumulating result be sequentially sequentially sent in add tree into Row is secondary concentrate it is cumulative, until the operand of add tree first floor inlet is all ready or last group of current calculation window is special After value indicative is ready, the add tree starting, which calculates, completes secondary add up；Until whole accumulation operations of current calculation window are completed, Last accumulation result is sent into adder and is biased phase add operation by the add tree, after the completion of biasing phase add operation, phase Add result to be admitted to activator appliance to be activated, the result after activation, that is, convolution kernel computing unit final calculation result, institute The final calculation result for stating convolution kernel computing unit will be admitted to the convolutional calculation result distribution control module.

7. the streamlined acceleration system of the depth convolutional neural networks based on FPGA as claimed in claim 6, which is characterized in that The pond computing module is made of distributor, maximum value pond unit, average value pond unit and selector；

The pond computing module obtains simultaneously in each effective beat and realizes module from pond computation sequence serializing PFP characteristic value, and the input feature vector value group is sent into the distributor and is allocated；The distributor is then according to current meter The characteristic pixel group of input is distributed to maximum value pond unit or average value pond unit by the pond mode for calculating layer； Wherein, maximum value pond unit takes the maximum characteristic pattern element of current calculation window in every characteristic pattern to carry out pond, institute Stating average value pond unit takes all characteristic pattern element average values of current calculation window in every characteristic pattern to carry out pond；Chi Hua After the completion of operation, the selector selects maximum value pond unit or average value Chi Huadan according to the pond mode of current computation layer The pond result of member gives the output data distribution control module, and wherein PFP is that pond computing module is at most located simultaneously every time The characteristic pattern number of reason.

8. the streamlined of the depth convolutional neural networks based on FPGA as described in any one of claim 1-7 accelerates system System, it is characterised in that:

It is to be cascaded by the identical FPGA system of multiple structures that the acceleration system formed is extended by the FPGA system.