CN108090565A

CN108090565A - Accelerated method is trained in a kind of convolutional neural networks parallelization

Info

Publication number: CN108090565A
Application number: CN201810037896.9A
Authority: CN
Inventors: 洪启飞; 阮爱武; 史傲凯
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-01-16
Filing date: 2018-01-16
Publication date: 2018-05-29

Abstract

The present invention provides a kind of convolutional neural networks parallelization training accelerated method, propose mixed batch thoughts, machine system applied to CPU and FPGA compositions, it mainly solves under large-scale convolutional neural networks structure, when being trained using FPGA to the Sample-Parallel of a batch, there are problems that memory space inadequate, can be applied to the image identification of computer vision field, target detection.The above method comprises the following steps：1st, in data preprocessing phase, by the sample random rearrangement in original trained storehouse.2nd, in feedforward calculation stages, data write shared drive in the form of batch, parallel processing in each layer of convolutional neural networks realized based on OpenCL language, the data of a sample in the first full articulamentum random read take preceding layer batch of network, and calculate the output of this layer.3rd, the local error stage is being updated, the local error of some sample in preceding layer batch, remaining each layer of parallel computation local error is randomly updated with the local error of first full articulamentum.

Description

Accelerated method is trained in a kind of convolutional neural networks parallelization

Technical field

The invention belongs to computer realms more particularly to a kind of convolutional neural networks parallelization based on FPGA to train and accelerate Method.

Background technology

FPGA, i.e. field programmable gate array are a kind of high-performance, low-power consumption, programmable digital circuit chip.FPGA It is internal mainly to include a series of programmable logic blocks (CLB) and interconnection line, in addition, also comprising modules such as DSP, BRAM.It patrols It collects after block can be configured and performs complicated logical combination function, interconnection line is responsible for different logical blocks, DSP and the input phase Even form a complete circuit.For computation-intensive algorithm, general processor depend on Feng Ruoyiman systems, it is necessary into Row instruction fetch, Instruction decoding finally perform the process of machine code, and the computing resource of general processor is with multiplier, addition The hardware cell composition of the such magnitude of device, if architecture configuration and the mathematical model of algorithm differ greatly, it will cause hardware The waste of resource.And having the advantages that for FPGA is programmable, developer can repeatedly program the transistor circuit of bottom, Configure the hardware resource most saved and calculated enough, transistor utilization rate higher.Therefore, under for specific application, FPGA ratios General processor energy consumption compares higher.

Traditional FPGA application and developments using hardware program language (verilog or VHDL etc.), it is necessary to complete rtl logic Design, developer need have higher understanding and assurance to hardware circuit, there is high exploitation threshold, construction cycle length, are difficult to It is the shortcomings of upgrade maintenance, and current, the continuous evolution of deep learning algorithm and update, higher using traditional mode development cost. Therefore, it is necessary to a kind of technologies for the training that can quickly realize convolutional neural networks, and follow up continually changing algorithm.

Convolutional neural networks are a kind of artificial neural networks of classics, in image classification, target detection, speech recognition, are regarded The fields such as frequency identification and natural language processing are widely used.In recent years, with the fast development of artificial intelligence, convolutional Neural net The network generalization and identification accuracy of network are all greatly improved.Document " Wang D, An J, Xu K.PipeCNN:An OpenCL-Based FPGA Accelerator for Large-Scale Convolution Neuron Networks[J] .arXiv preprint arXiv:1611.02450,2016. " propose to perform OpenCL kernel letters using the mode of assembly line Number, but the drawback is that kernel function can only single thread execution.Document " Liu L, Luo J, Deng X, et al.FPGA-based Acceleration of Deep Neural Networks Using High Level Method[C]//P2P, Parallel,Grid,Cloud and Internet Computing(3PGCIC),2015 10th International Conference on.IEEE,2015:824-827. " describes a kind of stochastic gradient descent method application based on mini-batch In the method for the parallel training deep neural network on FPGA.But the document only has studied the mini-batch gradients of neutral net Descending method, and become increasingly complex with the structure of network, the depth of network is constantly deepened, and the type of network layer constantly increases Add, using mini-batch gradient descent method when, batch input sample data scale increase, can more than FPGA the overall situation in Capacity is deposited, increases the memory read-write time, and uses stochastic gradient descent method, training every time uses the less efficient of single sample. Therefore, it is necessary to one kind on the premise of training accuracy is not sacrificed significantly, the instruction of the reduction training time applied to FPGA device Practice method.

The content of the invention

It is an object of the invention to be directed to appeal problem existing in the prior art, a kind of training of convolutional neural networks is provided Method, can complete the Fast Training of convolutional neural networks model under relatively low memory bandwidth, and memory bandwidth refers to the unit interval The byte number of interior read-write.

The present invention provides a kind of training method of convolutional neural networks model, the described method includes：

Under embedded FPGA platform, CPU realizes convolution as computing device as control device, FPGA on FPGA Each layer of parallel processing in neutral net, for model structure parameter and can training parameter distribute CPU and FPGA can access Shared drive, structural parameters include the parameters such as convolution nuclear volume, convolution kernel size, average pond factor size, can train ginseng Number refers to the parameters such as network weight, biasing.

Type according to each layer in training convolutional neural networks is treated sets the output of the characteristic image of different batch scales With local error, and for its storage allocation space.Batch scales refer to the sample size chosen every time from training set, multiple samples One batch of this composition.

Shared drive is distributed by way of alignment, using the mode of DMA (direct memory access) from host to FPGA Equipment transmission data, entire training process, the data of shared drive are constantly calculated and transferred between network layer.

When feedforward calculates, the data of a characteristic image in full articulamentum random read take last layer batch, and record When backpropagation calculates, output layer error is calculated using in the corresponding label data of sequence number for its sequence number in batch.

When updating local error, according to the chain type computing rule of error backpropagation algorithm, the network layer of single sample is straight Local error of the update from output layer backpropagation is connect, and the network layer of the last one batch scale is locally missed using later layer Difference randomly updates the local error of some current sample, and update is corresponding more parallel successively for the network layer of batch scales before The local error of a sample.

When calculating the local error of convolutional layer, if next layer is pond layer, using average pond mode, and mistake is used Poor zoom factor lambda parameter is multiplied by the local error of pond layer, obtains the local error value of the corresponding neuron of convolutional layer, reaches fine tuning Entire volume accumulates nuclear parameter and the target of biasing.

For the convolutional layer of batch scales, the average gradient of the batch is calculated, it is parallel to update convolution nuclear parameter.Calculating should The average local error of batch, updates offset parameter parallel.

For the full articulamentum of single sample, the gradient of single feature image is calculated, updates weight parameter parallel.It calculates single The local error of a characteristic image, updates offset parameter parallel.

Current batch has updated and then has transmitted again the data of next batch, and disposed upright reaches default iteration time Number or error are less than deconditioning after threshold value.

Description of the drawings

Fig. 1 is the convolutional neural networks parallelization training method overall flow figure of the present invention；

Fig. 2 is the flow chart of single iteration in convolutional neural networks parallelization training method of the invention；

Fig. 3 is the data flow schematic diagram in the convolutional neural networks parallelization training method of the present invention；

Fig. 4 is according to a kind of convolutional layer local error update method realization principle signal shown in an exemplary embodiment Figure.

Specific embodiment

The method of the present invention is described in further detail below in conjunction with attached drawing.

The convolutional neural networks parallelization training method based on FPGA that Fig. 1 is shown in the embodiment of the present invention realizes stream Journey comprises the following steps：

FPGA device is communicated by PCIe buses with CPU, at CPU ends to training the random alignment again of the sample in storehouse, According to OpenCL standards, output, local error distribution CPU and FPGA for each layer of convolutional neural networks model to be trained The shared drive that can be accessed, storage allocation space size are divided into two kinds of batch scales, to convolutional layer and pond layer, often A neuron preserves output and the local error of multiple samples of some fixed quantity (being more than 1), to full articulamentum, each nerve Member simply preserves output and the local error of single sample.In addition, for convolutional layer, it is also necessary to distribute convolution kernel and partially The memory headroom put, memory headroom size are drawn according to last layer picture size, convolution kernel size, step size computation.To connecting entirely For layer, it is also necessary to distribute the memory headroom of weight and biasing, memory headroom size is according to last layer neuronal quantity and currently Layer neuronal quantity is calculated, for output layer, it is also necessary to distribute the memory headroom of label data.

For the implementation method of single parallel training, Fig. 2 is can refer to, characteristic image output and the local error of each layer are joined Fig. 3 is examined, specific implementation is as follows：

The sample of a certain fixed quantity is set as a batch, reads in the sample data of a batch, use [- 0.5.0.5 random number between] initialize each convolutional layer initial convolution kernel and biasing and each full articulamentum it is initial Weight and initial bias.

It is calculated for the feedforward of convolutional layer, using the OpenCL kernel functions in three dimensions to the characteristic pattern of a batch It is operated as concurrently doing convolution operation and activation, parallel granularity is individually to the corresponding local receptor field number of each neuron According to location and calculating is taken, output characteristic image is obtained.

It is calculated for the feedforward of pond layer, using the OpenCL kernel functions in three dimensions to batch through pulleying Characteristic image after product and activation concurrently does average pondization operation, and parallel granularity is individually to the corresponding office of each neuron Portion's receptive field data take location and calculating, obtain output characteristic image.

It feedovers and calculates for full articulamentum, randomly select some characteristic image in one batch of last layer, record obtains Sequence number of the characteristic image in current batch, it is concurrently handled using the OpenCL kernel functions in the one-dimensional space, parallel Granularity is that all neurons of the individually last layer to the connection of each neuron take location and calculating, obtains neuron output.

It feedovers and calculates for output layer, it is concurrently handled using the OpenCL kernel functions in the one-dimensional space, parallel grain Degree is that all neurons of the individually last layer to the connection of each neuron take location and calculating, obtains the output of neuron.Meanwhile Corresponding sample label data are read according to sequence number, output is concurrently calculated it using the OpenCL kernel functions of the one-dimensional space and is missed Difference, parallel granularity are individually to calculate local error to each neuron.

It is updated for the local error of the full articulamentum of single sample, it is direct using the OpenCL kernel functions in the one-dimensional space The local error of full articulamentum is updated, renewal process uses following formula (1)：

Wherein,Represent the local error of kth i-th of neuron of layer,Represent that the local of+1 layer of j-th of neuron of kth misses Difference,Represent derivative of the kth layer activation primitive to output valve.

The full articulamentum of single sample for later layer, the convolutional layers of batch scales using later layer local error with The local error of some current sample of machine update uses local the missing of the OpenCL kernel functions update current layer in the one-dimensional space Difference.

It is updated for the local error of convolutional layer, if next layer is pond layer, with reference to figure 4, using average pond mode, And the local error that pond layer corresponds to neuron is multiplied by using error zoom factor lambda parameter, obtain the corresponding neuron of convolutional layer Local error value, renewal process use following formula (2)：

Wherein,Represent the local error of kth layer ith feature image,Represent the office of+1 layer of j-th of characteristic image of kth Portion's error,Symbol is Kronecker product computing,Represent derivative of the kth layer activation primitive to characteristic image output valve.

The network layer of remaining batch scale is updated corresponding multiple parallel using the OpenCL kernel functions of three dimensions successively The local error of sample.For pond layer, if next layer is convolutional layer, use following formula (3)：

Wherein,Represent the local error of kth layer ith feature image,Represent the office of+1 layer of j-th of characteristic image of kth Portion's error, extend functions are to extend the local error of characteristic image, value zero initialization, the rot180 functions of expansion Convolution kernel is rotated into 180 degree,Symbol is convolution algorithm,Represent kth layer activation primitive to characteristic image output valve Derivative.

For the convolutional layer of batch scales, the average gradient of the characteristic image of the batch is calculated, uses three dimensions OpenCL kernel functions update convolution nuclear parameter parallel.Renewal process uses following formula (4):

For the convolutional layer of batch scales, the characteristic image for calculating the batch is averaged local error, uses the one-dimensional space OpenCL kernel functions update biasing parallel.Renewal process uses following formula (5)：

For the full articulamentum of single sample, the gradient of single feature image is calculated, uses the OpenCL cores of two-dimensional space Function updates weight parameter parallel.Renewal process uses following formula (6)：

For the full articulamentum of single sample, the local error of single feature image is calculated, uses the one-dimensional space OpenCL kernel functions update offset parameter parallel.Renewal process uses following formula (7)：

In above-mentioned (4) (5) (6) (7) formula, n represents iterations, and α represents e-learning rate, and B represents the sample number of a batch Amount, for convolutional layer, W_ijRepresent the convolution nuclear parameter of last layer ith feature figure and j-th of characteristic pattern of current layer,Table Show the output of last layer ith feature figure and the convolution of j-th of characteristic pattern local error of current layer.For full articulamentum, W_ijIt represents I-th of neuron of last layer and j-th of neuron weight parameter of current layer, x_i*δ_jRepresent the output of i-th neuron of last layer with The product of j-th of neuron local error of current layer.

The training method of the convolutional neural networks is suitable for but is not limited only to any one following model：

LeNet, AlexNet, VGG-Net, GoogleNet, ResNet.

The foregoing is only a preferred embodiment of the present invention, and is not necessarily limited to the present invention, for the skill of this field For art personnel, the invention may be variously modified and varied.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of convolutional neural networks parallelization training method, which is characterized in that comprise the following steps：

1) each layer of parallel processing in the convolutional neural networks based on FPGA (field programmable gate array) realizations, is model knot Structure parameter and can training parameter create the shared drives that can access of CPU and FPGA, the structural parameters include networks at different levels The output of layer, local error, it is described can the convolution kernel of training parameter including convolutional layers at different levels, the convolutional layers at different levels be biased towards The bias vector of amount, the weight matrix of the full articulamentum and the full articulamentum；

2) according to the type for treating each layer in training convolutional neural networks create the characteristic image of different batch scales output and Local error memory headroom；

3) shared drive is created by way of alignment, is set using the mode of DMA (direct memory access) in host and FPGA Data, entire training process are transmitted between standby, the data of shared drive are constantly calculated and transferred between network layer；

4) when feedforward calculates, the data of a characteristic image in full articulamentum random read take last layer batch, and record it When backpropagation calculates, output layer error is calculated using in the corresponding label data of sequence number for sequence number in batch；

5) when updating local error, according to the chain type computing rule of error backpropagation algorithm, the network layer of single sample is direct The local error from output layer backpropagation is updated, and the network layer of the last one batch scale uses later layer local error The local error of some current sample is randomly updated, update is corresponding multiple parallel successively for the network layer of batch scales before The local error of sample；

6) for the convolutional layer of batch scales, the average gradient of the characteristic image of the batch is calculated, it is parallel to update convolution kernel ginseng Number, calculates the average local error of the batch, updates offset parameter parallel；

7) for the full articulamentum of single sample, the gradient of single feature image is calculated, updates weight parameter parallel, is calculated single The local error of characteristic image, updates offset parameter parallel；

8) current batch has updated and then has transmitted again the data of next batch, and disposed upright reaches default iterations Or error is less than deconditioning after threshold value.

2. the method as described in claim 1, which is characterized in that the layer difference batch scales in the convolutional neural networks Output and local error, batch scales refer to the sample size chosen every time from training set, in the training method, volume The output for the batch scale samples that lamination and pond layer preserve and local error, and full articulamentum preserve single sample output with Local error；

In the convolutional neural networks feedforward calculating process, when the layer from the layer of batch scales to single sample calculates, take Randomly selected mode, and the sequence number of the sample is recorded, the error of output layer is calculated using the corresponding label data of the sequence number；

During the convolutional neural networks backwards calculation, when the layer from the layer of single sample to batch scales calculates, according to The sample sequence number that feedforward records in calculating completes the update of the local error of the layer of batch scales；

When calculating the local error of convolutional layer, if next layer is pond layer, using average pond mode, and contracted using error The local error that factor lambda parameter is multiplied by pond layer is put, the local error value of the corresponding neuron of convolutional layer is obtained, reaches fine tuning entire volume Product nuclear parameter and the target of biasing.