CN107657233A

CN107657233A - Static sign language real-time identification method based on modified single multi-target detection device

Info

Publication number: CN107657233A
Application number: CN201710899126.0A
Authority: CN
Inventors: 张勋; 陈亮
Original assignee: Donghua University
Current assignee: Donghua University; National Dong Hwa University
Priority date: 2017-09-28
Filing date: 2017-09-28
Publication date: 2018-02-02

Abstract

The present invention relates to a kind of static sign language real-time identification method based on modified single multi-target detection device, comprise the following steps：Static sign language sample image is pre-processed；Build and strengthen static sign language image data set；The deep learning network based on improved single multi-target detection device is built, the deep learning network is divided into facilities network network layers and extra convolution characteristic layer；Wherein, facilities network network layers are used for feature extraction, and the image of input is converted into the character representation of various dimensions；Extra convolutional layer is a kind of feature selecting strategy, is offset with small convolution filter come the classification fraction of one group of default boundary frame fixed on predicted characteristics figure and position, while the prediction of different scale is produced from the characteristic pattern of different scale；This network is trained using static sign language data set, camera is gathered to the network that sign language video input trains in real time, realizes static sign language Real time identification.The present invention substantially increases recognition speed while accuracy of identification is ensured.

Description

Static sign language real-time identification method based on modified single multi-target detection device

Technical field

The present invention relates to Sign Language Recognition technical field, more particularly to a kind of based on modified single multi-target detection device Static sign language real-time identification method.

Background technology

Sign language is that deaf-mute uses gesture a kind of effective means exchanged instead of normal speech.Research Sign Language Recognition can be helped Help deaf-mute, do not obtain the exchange between the deaf-mute of good education, at the same can also help deaf-mute with it is normal Exchange between people；Sign Language Recognition is also a kind of convenient mode of man-machine interaction, and research Sign Language Recognition can promote mechanical intelligence The development of the other fields such as running, the operation of mobile device terminal, gate control system, remote control；Further, sign language is studied to know Can not understanding of the secondary computer to human language.

It is to enter row information input using a common camera to utilize computer based on the Sign Language Recognition of monocular vision identification Algorithm is identified, relative to based on the digital devices such as sensor input information again by computer know method for distinguishing, its for The requirement of equipment is low, instruction is convenient, low advantage of injecting capital into, more and more interested to researchers.In Sign Language Recognition field, One traditional complete recognition methods normally comprises three processes：Segmentation, feature extraction, identification.1) split, common method is Model based on movable information, the model based on Motion mask, model based on Skin Color Information etc.；2) feature extraction, common side Method is the feature extracting method based on histograms of oriented gradients (HOG), the feature extraction based on local binary patterns texture (LBP) Method, method based on convolutional neural networks (CNN) feature extraction etc.；3) gesture identification, common methods have based on artificial neuron The multilayer perceptron (MLP) of network, SVMs (SVM) based on supervised learning model etc..

Although static Sign Language Recognition technology has correlative study person to be studied very early, do not united in face of human hand skeleton First, the characteristics of hand-type is changeable, sign language vocabulary amount is big, its characteristic information be difficult flexibly to obtain, and hand-designed language is retouched State that the process of sign language feature is cumbersome, and the characteristic information of profound level can not be excavated, this result in model plasticity it is poor, it is difficult in base Reach in the Sign Language Recognition of vision real-time it is good, identification accurately require.

Deep learning (Deep Learning) method just solves above pain spot.Deep learning model is considered as one The disruptive technology in machine learning field, realized by the combination of multilayered nonlinear have supervision and unsupervised feature extraction and Conversion, to reach the purpose of pattern analysis and classification.The researcher of substantial amounts of scientific research institutions and enterprise is to depth learning technology And its application conducts extensive research, and significant effect is achieved in fields such as voice, images.The deeper knot of network layer Structure can learn to more more complicated features, and these abstract expressions can be with the more flexible change for more accurately describing image.

In order to meet that the real-time of detection is good and purpose that accuracy rate is high, researcher do various effort.Ross B.Girshick et al. proposes region convolutional neural networks (R-CNN), some candidate regions volume that this method generates to image Product neutral net carry out feature extraction, after classified to obtain border by grader, target detection problems are converted into classification Problem, although being broken through on target detection problems, feature extraction network is respectively trained and sorter network is quite time-consuming, it is real When property cannot be guaranteed.Ross B.Girshick are improved R-CNN networks, and feature extraction and classification are merged For a network, using search property algorithm, fast area convolutional neural networks (Fast R-CNN) are delivered, further increase mould Type training speed and Detection accuracy.Later, Ross B.Girshick suggested network (RPN) to optimize candidate regions using region The generation in domain, further improves speed, has delivered acceleration region convolutional neural networks (Faster R-CNN).

Above method turns into the milestone in detection identification field, although accuracy rate is prettyr good, these methods are for embedding Amount of calculation is excessive for embedded system, too slow for real-time or near real-time application even for high-end hardware, or Person is to sacrifice accuracy of detection to exchange the time for.

The content of the invention

The technical problems to be solved by the invention are to provide a kind of static hand based on modified single multi-target detection device Language real-time identification method, recognition speed is substantially increased while accuracy of identification is ensured, meets requirement of real-time.

The technical solution adopted for the present invention to solve the technical problems is：There is provided a kind of based on the inspection of modified single multiple target The static sign language real-time identification method of device is surveyed, is comprised the following steps：

(1) static sign language sample image is pre-processed；

(2) build and strengthen static sign language image data set；

(3) the deep learning network based on improved single multi-target detection device is built, the deep learning network is divided into base Plinth Internet and extra convolution characteristic layer；Wherein, facilities network network layers are used for feature extraction, and the image of input is converted into various dimensions Character representation；Extra convolutional layer is a kind of feature selecting strategy, with small convolution filter come one of fixation on predicted characteristics figure The classification fraction of group default boundary frame and position skew, while the prediction of the characteristic pattern generation different scale from different scale；

(4) this network is trained using static sign language data set, camera is gathered to what sign language video input trained in real time Network, realize static sign language Real time identification.

The step (1) is specially：Static sign language video is recorded, and it is image that video is taken out into frame, and it is tight to remove smear manually The image of weight and serious shielding, and enhancing processing is carried out using the method for high-pass filtering to image.

The static sign language data set of structure includes original sample image and original sample image is carried out in the step (2) Label image after mark by hand, image tagged frame and the original image of markup information record correspond；Using to original graph As doing the mode of minute surface symmetrical treatment, and correspondence image is re-flagged, reach the purpose for strengthening static sign language data set.

Facilities network network layers in the step (3) are to use the AlexNet Internets for removing full articulamentum, 5 layers altogether, pond Change uses maximum pond；The extra convolutional network is 9 layer networks, wherein being divided into 8 layers of convolutional network layer and 1 layer of average pond Layer.

The extra convolution characteristic layer is added to the basic network end blocked, and successively reduces and obtain multiple size measurements Predicted value；Prediction sets are produced with one group of convolution filter in the characteristic layer of each addition, obtain classification fraction either phase Coordinate offset for giving tacit consent to frame；Coordinate offset is that acquiescence frame position is then relative to characteristic pattern relative to acquiescence frame measurement.

Rule caused by the prediction sets is：For the characteristic layer that the size with p passage is m*n, 3*3*p is used Convolution kernel does convolution, produces classification fraction either relative to the coordinate offset of acquiescence frame, and in each application convolution kernel operation M*n sizes position, produce an output valve.

Comprise the following steps during the deep learning network training of the step (3)：(31) matching strategy：During training, it need to build Corresponding relation between vertical true tag and acquiescence frame, give tacit consent to the frame acquiescence higher than a certain threshold value overlapping with true tag with matching Frame；(32) training objective：Object function is the object function from MultiBox, and general objective loss function is position loss and put Believe the weighted sum of loss, wherein, position loss is the Smooth L1 losses between prediction block and true tag value frame parameter, is put Letter loss is cross validation (33) the selection acquiescence frame that a softmax loss is arranged to 1 to multi-class confidence and weight term Ratio and width ratio：Carried out by combining many features figure in all acquiescence frames of the different sizes and the ratio of width to height of all positions Predict to cover various input object size and dimensions.

The step (4) is specially：Obtain sign language image in real time with monocular cam, it is more that image is inputted into improved single After the deep learning network of object detector, classification and Detection result is obtained, realizes static sign language Real time identification.

Beneficial effect

As a result of above-mentioned technical scheme, the present invention compared with prior art, has the following advantages that and actively imitated Fruit：The present invention need not describe static sign language feature using hand-designed language, and the convolutional neural networks of use can obtain more Profound characteristic information so that the plasticity of model is good；Also, using small convolution filter come one of fixation on predicted characteristics figure The classification fraction of group default boundary frame and position skew, while the prediction of the characteristic pattern generation different scale from different scale, this Sample can greatly improve recognition speed, meet the requirement of static Sign Language Recognition real-time.

Brief description of the drawings

Fig. 1 is the flow chart of the present invention；

Fig. 2 is the deep learning network A SSD overall structure figures of improved single multi-target detection device；

Fig. 3 is the structural representation of inventive network ASSD facilities network network layers；

Fig. 4 is the experimental result picture of the static Sign Language Recognition of the present invention.

Embodiment

With reference to specific embodiment, the present invention is expanded on further.It should be understood that these embodiments are merely to illustrate the present invention Rather than limitation the scope of the present invention.In addition, it is to be understood that after the content of the invention lectured has been read, people in the art Member can make various changes or modifications to the present invention, and these equivalent form of values equally fall within the application appended claims and limited Scope.

Embodiments of the present invention are related to a kind of static sign language Real time identification based on modified single multi-target detection device Method, as shown in figure 1, comprising the following steps：First have to carry out handmarking to static sign language image, it is corresponding to obtain sign language image Label figure；Then the deep learning network of improved single multi-target detection device is built, by training image and corresponding label Figure, which is put into the network of structure, is iterated study, obtains the model parameter of network；Then input test image, improved single The deep learning network of multi-target detection device can be handled test image according to the model parameter for above training to obtain；Finally Obtain the static sign language generic label that occurs in test image and a corresponding probable value.It is specific as follows：

Step 1：Static sign language sample image is pre-processed.This experimental data has been gathered by high definition monocular cam Into.It is representative, respectively A, B, C, D, E to carry out choosing 5 letters in 26 alphabetical sign languages of static Sign Language Recognition in experiment.It is real Data are tested to be completed by 8 people, everyone is to each letter difference recorded video, then takes out frame program by Matlab videos and complete to take out frame, Manual removal smear is serious, the image of serious shielding, the method that high-pass filtering is used for the poor image of some display effects Enhancing processing is done to image, is easy to target identification, obtained preliminary data collection, picture size is 640*480.

Step 2：(2) build and strengthen static sign language image data set.The static sign language data set of structure includes original sample This image and the label image after manual mark is carried out to original sample image, the image tagged frame of markup information record with it is original Image corresponds；By the way of minute surface symmetrical treatment is done to original image, and correspondence image is re-flagged, it is quiet to reach reinforcing The purpose of state sign language data set.Final data collection is as shown in table 1, wherein alphabetical A training sets picture 2311 is opened, letter b training set Picture 2606 is opened, letter C training set picture 2581 is opened, alphabetical D training sets picture 2667 is opened, letter e training set picture 2659 is opened, 13024 altogether；Each alphabetical A, B, C, D, E test set is 500,2500 altogether.Enter pedestrian with LabelImg programs Work marks to obtain real goal label file.

The static sign language data set table of table 1

Step 3：Build the deep learning network A SSD based on improved single multi-target detection device.Obtained using step 2 Static sign language image data set train the deep learning network of the improved single multi-target detection device, as shown in Fig. 2 the net Network structure includes two parts and formed：Facilities network network layers are that feature extraction layer is 5 layer networks and extra convolution characteristic layer is 9 layers Network.The effect of facilities network network layers be by the feature of a series of convolution, excitation and the procedure extraction in pond original image, from And obtain characteristic pattern；The effect of extra convolutional layer is future position, and obtains confidence level.

As shown in figure 3, the effect of facilities network network layers is feature extraction, and feature extraction network f can regard a series of as Convolution, excitation and the process in pond.Using removing the AlexNet of full articulamentum as convolutional network, therefore, convolution of the invention Network tool is of five storeys.Assuming that the basic network is f, parameter θ, then f mathematic(al) representation be：

f(X；θ)=W_LH_L-1

H_l=pool (relu (W_lH_l-1+b_l))

Wherein, X as convolutional neural networks input picture, H_lFor the output of l layer Hidden units, b_lFor the deviation of l layers Value, W_lFor the weights of l layers, and b_lAnd W_lTrainable parameter θ is formed, pool () represents pondization operation, and relu () represents excitation Operation.Characteristic point in small neighbourhood is integrated to obtain new feature by pondization operation so that feature is reduced, and parameter is reduced, and Chi Huadan Member has translation invariance.The method in pond mainly includes average pondization and maximum pond, and the present invention mainly uses maximum Pondization operates.

Extra convolutional layer is added at the AlexNet basic networks end for removing full articulamentum, is 9 layer networks, wherein dividing For 8 layers of convolutional network layer and 1 layer of average pond layer, there are following characteristics：

(1) Analysis On Multi-scale Features figure detects：Extra convolution characteristic layer is added to the basic network end blocked, and these layers are gradual Reduce, obtain the predicted value of multiple size measurements.The convolution model of detection is different for each characteristic layer.

(2) convolution fallout predictor is detecting：Can in the characteristic layer (and the existing characteristic layer of basic network) of each addition Prediction sets are produced with one group of convolution filter as shown in Figure 2.Rule is caused by prediction sets：For with p passage Size be m*n characteristic layer, do convolution using 3*3*p convolution kernels, produce classification fraction either relative to the seat of acquiescence frame Mark skew, and in each m*n sizes position using convolution kernel operation, one output valve of generation.Bounding box offsets output valve Relative to acquiescence frame measurement, acquiescence frame position is then relative to characteristic pattern.

(3) frame and the ratio of width to height are given tacit consent to：One group of default boundary frame is associated with each characteristic pattern unit of overlay network.Give tacit consent to frame Convolution algorithm is made to characteristic pattern so that each frame example is fixed relative to the position of its corresponding unit lattice.In each feature In map unit, every class fraction relative to example in the skew for giving tacit consent to frame shape in cell, and each frame is predicted.

Inventive network structure ASSD as shown in Figure 2, with Conv5, Conv7 (former 7th layer of full articulamentum), Conv8_2, These layers of Conv9_2, Conv10_2 and pool11 carry out predicted position with calculating confidence level.The present invention uses " xavier " method Initialize the parameter of the convolutional layer of all new additions.Because Conv4_3 size is larger (38 × 38), thus we only thereon 3 acquiescence frames are placed, wherein the frame that frame and other aspect ratio comprising 0.1 ratio are 0.5 and 2.For every other layer, We set 6 acquiescence frames.Conv4_3 takes on a different character yardstick with other layers compared with simultaneously, of the invention to use L2 canonicals Change technology, the feature norm of each opening position in characteristic pattern is scaled 20, and learns ratio during backpropagation.The present invention Use 10^-3Learning rate carries out 40k iterative learning, then with 10^-4With 10^-5Learning rate carries out the iterative learning of 10k times.

Network A SSD keys of the present invention are trained to need to be imparted to those fixation outputs in the true tag in training image Acquiescence frame on, have following several dot characteristics：

(1) matching strategy：During training, true tag need to be established and give tacit consent to the corresponding relation between frame, give tacit consent to frame with matching The acquiescence frame higher than a certain threshold value (0.5) overlapping with true tag Jaccard.

(2) training objective：The object function of ASSD training, the object function from MultiBox.WithRepresent i-th J-th of true tag of individual acquiescence frame and classification p matches, otherwiseAccording to matching strategy, necessarily have It is matching to mean that j-th of true tag has been possible to multiple acquiescence frames.General objective loss function L (x, c, l, g) is position Lose L_locL is lost with confidence_confWeighted sum, be shown below：

In formula, N is the quantity of the acquiescence frame of matching, and x is the variable using image as input.

Lose L in position_locIt is the SmoothL1 losses between prediction block l and true tag value frame g parameters, returns bounding box d Center (cx, cy) and its width w and height h offset.

Confidence loses L_confIt is the cross validation that a softmax loss is arranged to 1 to multi-class confidence c and weight term α, It is shown below：

Wherein：

(3) ratio and width ratio of selection acquiescence frame：In single network the characteristic pattern of different layers be predicted and Parameter is shared on all subjective scales can reduce calculating and memory requirements, also, the ad-hoc location of characteristic pattern comes in present networks It is responsible for specific region and object specific dimensions in image, so giving tacit consent to frame need not be corresponding with receptive field in every layer.Pass through combination For many features figure in the prediction of all acquiescence frames of the different sizes and the ratio of width to height of all positions, we have diversified prediction Set, covers various input object size and dimensions.

Step 4：The static sign language data set obtained using step 2 trains this network.The present invention uses SGD (stochastic gradients Decline) model is finely adjusted, initial learning rate, using 0.9 momentum, 0.0005 weight decay, 32 batch size etc.. A preferable model parameter of effect is selected, for experiment test.

Fig. 4 is the experimental result picture of static Sign Language Recognition.5 manual alphabet A, B, C, D, E recognition results are illustrated, comprising The class label and probability size of sign language to be identified, and the frame per second currently identified.It can be seen that changing using present embodiment The deep learning network of the single multi-target detection device entered, very good to the effect of static sign language, the speed of identification also quickly, expires The requirement of sufficient Real time identification.

Claims

A kind of 1. static sign language real-time identification method based on modified single multi-target detection device, it is characterised in that including with Lower step：

(1) static sign language sample image is pre-processed；

(2) build and strengthen static sign language image data set；

(3) the deep learning network based on improved single multi-target detection device is built, the deep learning network is divided into facilities network Network layers and extra convolution characteristic layer；Wherein, facilities network network layers are used for feature extraction, and the image of input is converted into the spy of various dimensions Sign represents；Extra convolutional layer is a kind of feature selecting strategy, is write from memory with small convolution filter come one group fixed on predicted characteristics figure Recognize classification fraction and the position skew of bounding box, while the prediction of different scale is produced from the characteristic pattern of different scale；

(4) this network is trained using static sign language data set, camera is gathered to the network that sign language video input trains in real time, Realize static sign language Real time identification.
2. the static sign language real-time identification method according to claim 1 based on modified single multi-target detection device, its It is characterised by, the step (1) is specially：Static sign language video is recorded, and it is image that video is taken out into frame, and it is tight to remove smear manually The image of weight and serious shielding, and enhancing processing is carried out using the method for high-pass filtering to image.
3. the static sign language real-time identification method according to claim 1 based on modified single multi-target detection device, its It is characterised by, the static sign language data set of structure includes original sample image and original sample image is entered in the step (2) Label image after the manual mark of row, image tagged frame and the original image of markup information record correspond；Using to original Image does the mode of minute surface symmetrical treatment, and re-flags correspondence image, reaches the purpose for strengthening static sign language data set.
4. the static sign language real-time identification method according to claim 1 based on modified single multi-target detection device, its It is characterised by, the facilities network network layers in the step (3) are to use the AlexNet Internets for removing full articulamentum, 5 layers altogether, Pondization uses maximum pond；The extra convolutional network is 9 layer networks, wherein being divided into 8 layers of convolutional network layer and 1 layer of average pond Change layer.
5. the static sign language real-time identification method according to claim 4 based on modified single multi-target detection device, its It is characterised by, the extra convolution characteristic layer is added to the basic network end blocked, and successively reduces and obtain multiple yardstick inspections The predicted value of survey；Prediction sets are produced with one group of convolution filter in the characteristic layer of each addition, obtain classification fraction either Relative to the coordinate offset of acquiescence frame；Coordinate offset is that acquiescence frame position is then relative to characteristic pattern relative to acquiescence frame measurement.
6. the static sign language real-time identification method according to claim 5 based on modified single multi-target detection device, its It is characterised by, rule caused by the prediction sets is：For the characteristic layer that the size with p passage is m*n, 3*3* is used P convolution kernels do convolution, produce classification fraction either relative to the coordinate offset of acquiescence frame, and in each application convolution kernel operation M*n sizes position, produce an output valve.
7. the static sign language real-time identification method according to claim 4 based on modified single multi-target detection device, its It is characterised by, comprises the following steps during the deep learning network training of the step (3)：(31) matching strategy：During training, it need to build Corresponding relation between vertical true tag and acquiescence frame, give tacit consent to the frame acquiescence higher than a certain threshold value overlapping with true tag with matching Frame；(32) training objective：Object function is the object function from MultiBox, and general objective loss function is position loss and put Believe the weighted sum of loss, wherein, position loss is the Smooth L1 losses between prediction block and true tag value frame parameter, is put Letter loss is cross validation (33) the selection acquiescence frame that a softmax loss is arranged to 1 to multi-class confidence and weight term Ratio and width ratio：Carried out by combining many features figure in all acquiescence frames of the different sizes and the ratio of width to height of all positions Predict to cover various input object size and dimensions.
8. the static sign language real-time identification method according to claim 1 based on modified single multi-target detection device, its It is characterised by, the step (4) is specially：Obtain sign language image in real time with monocular cam, image is inputted into improved single After the deep learning network of multi-target detection device, classification and Detection result is obtained, realizes static sign language Real time identification.