CN115953839B

CN115953839B - Real-time 2D gesture estimation method based on loop architecture and key point regression

Info

Publication number: CN115953839B
Application number: CN202211675766.0A
Authority: CN
Inventors: 李观喜; 张磊; 梁倬华
Original assignee: Guangzhou Ziweiyun Technology Co ltd
Current assignee: Guangzhou Ziweiyun Technology Co ltd
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2024-04-12
Anticipated expiration: 2042-12-26
Also published as: CN115953839A

Abstract

The invention provides a real-time 2D gesture estimation method based on a circulating architecture and coordinate system regression, which belongs to the technical field of real-time 2D gesture estimation, and a core module of the method comprises an image acquisition module, a lightweight neural network algorithm module, a circulating architecture module and a key point regression module; the coordinate system regression method has the advantages that the algorithm consumes short time and less resources, and the real-time and end-to-end full differential training can be realized on a mobile end, embedded or low-cost hardware cost platform; using a cyclic architecture module to enhance the effect of the model on dynamic gesture estimation in the video; the real-time 2D gesture estimation method based on the loop architecture and the coordinate system regression can realize the real-time and high-precision detection effect of the mobile terminal, embedded hardware or low-cost hardware, can effectively relieve the problem of model detection performance reduction caused by motion blurring and self-shielding in a video, and realizes the quick landing of products.

Description

Real-time 2D gesture estimation method based on loop architecture and key point regression

Technical Field

The invention belongs to the technical field of real-time 2D gesture estimation, and particularly relates to a real-time 2D gesture estimation method based on a loop architecture and key point regression.

Background

The 2D gesture estimation technology mainly detects 21 key points of the hand, and can describe information expressed by different gestures through the key points; hand 2D keypoint detection is one of the basic algorithms of computer vision and plays an important role in research in other related fields of computer vision. At present, the main hardware carrier equipment of the metauniverse is AR\VR equipment and the like, images can be acquired through a camera, and corresponding feedback is obtained through analyzing information expressed by gestures of a user.

2D gesture estimation is in fact a very challenging task with respect to body pose estimation. The effect of 2D gesture estimation may be reduced because the hand joints are more flexible, motion sensitive and affected by self occlusion. At present, a method based on a Gaussian heat map is the main flow direction of the technology and has the same recognition effect, in the industrial Internet age, the combination of embedded type and artificial intelligence is a necessary development trend, so that the 2D gesture estimation algorithm of the method based on the Gaussian heat map can not reach the real-time effect usually when the method is operated on a mobile terminal, an embedded type or a low-cost hardware platform, and the detection effect of the method is not satisfactory for motion blurring caused by dynamic gestures and self-shielding problems; because of the problems of high memory consumption and low reasoning speed of the gaussian heat diagram-based method, great delay occurs when the technology is to be operated in low-cost hardware, and discomfort is often brought to the experience of the whole product.

Therefore, it is necessary to invent a real-time 2D gesture estimation method based on a loop architecture and key point regression.

Disclosure of Invention

In order to solve the above-mentioned problems, the present invention provides a real-time 2D gesture estimation method based on a loop architecture and key point regression to solve the above-mentioned problems. The core module of the real-time 2D gesture estimation method based on the circulating architecture and the key point regression comprises an image acquisition module, a lightweight neural network algorithm module, a circulating architecture module and a key point regression module, wherein the image acquisition module is a monocular camera;

the lightweight neural network algorithm module adopts MobileNet V3 as a lightweight backbone model to extract characteristics, and consists of a plurality of stages, wherein a plurality of groups of deep separable convolutions are formed;

the cyclic architecture module acquires characteristic information through a MobileNet V3 backbone network and passes through a cyclic architecture module; the circulation mechanism can learn which information should be reserved in the continuous video stream by itself, and the long-term and short-term time information capability is reserved while self-adapting, so that the circulation mechanism is suitable for our requirements;

the key point regression module outputs the obtained feature map through the circulation architecture module as the input of the key point regression module, and respectively passes through 2 FC layers; FC1 outputs coordinate information of the 2D skeleton key points, and FC2 outputs score information of the 2D skeleton key points; the regression results need to be supervised, so that a standardized flow module is added for auxiliary training.

Preferably, in the lightweight neural network algorithm module, the depth separable convolution is mainly divided into two processes, namely channel-by-channel convolution and point-by-point convolution; one convolution kernel of the channel-by-channel convolution is responsible for one channel, one channel is only convolved by one convolution kernel, and the number of the channels of the characteristic map generated in the process is completely consistent with the number of the input channels; the point-by-point convolution uses 1x1 convolution, and the feature images output by the channel-by-channel convolution are weighted and combined in the depth direction to produce a new feature image;

preferably, adding an SE structure module to obtain a new feature matrix; when the step length is 1, and the input characteristic matrix and the output characteristic matrix are the same in size, carrying out shortcut connection; after the MobileNet V3 trunk model outputs the feature map, an LR-ASPP module is added to increase the receptive field, the accuracy of the whole model is improved, the feature map of the input channel is divided into two branches, and the left branch outputs a feature map P through a convolution kernel of 1x1 ₁ The right branch outputs a characteristic diagram P after passing through a global tie pool layer, a 1x1 convolution kernel and a Sigmod module ₂ And for the characteristic diagram P ₁ And feature map P ₂ After multiplication, a new feature map is output;

preferably, in the last stage of the model, the SiLu activation function is used instead of all the original activation functions; input Z _k Activation a of kth SiLU of (2) _k The a is calculated by multiplying the sigmoid function by its input _k (z _k )＝z _k σ(z _k ) Equation 1, in which the sigmoid function, is for a larger Z _k The value, siLU activation is substantially equal to the function of ReLU, but different ReLUs, siLU activation is not monotonically increasing, but instead for Z _k Approximately 1.28, its global minimum is-0.28; the SiLU has the characteristic of self-stabilization, the global minimum value with the derivative of zero plays a role of buffering the weights, the global minimum value serves as an implicit regularizer to inhibit learning of a large number of weights, the model performance is improved in actual tests, and the effect of replacing the SiLU function by all stages is equivalent, so that the model can be used only in the last Stage.

Preferably, in the loop architecture module, when the feature map is input into the loop architecture module, the channel for inputting the feature map is equally divided into feature maps P ₃ And feature map P ₄ For the characteristic map P ₄ Output profile P output by ConvGRU ₅ And memory cell feature map h _t Splice feature map P ₃ And feature map P ₄ Output of the feature map P ₆ 。

Preferably, in the key point regression module, the normalized flow module can convert some basic simple distributions into arbitrary complex distributions, and theoretically, if the transformation is sufficiently complex, then arbitrary target distributions can be fitted; in the practical training process, a neural network is used, so that the neural network can be close to any function in theory, and therefore, a series of complex transformations in a standardized flow model can be realized by superposition on an FC layer; in the model training process, the regression module fits the output value of simple distribution, and the normalized flow module transforms the fitted result value to enable the transformed result to be closer to the distribution P of the target.

Preferably, the training is divided into four stages, namely stage 1, stage 2, stage 3 and stage 4, and stage 1 can use scattered data sets to perform model training under the condition of no cycle architecture module to obtain a proper pre-training model, and in actual test comparison, it is found that although the classified pre-training model of MobileNet V3 is used as the pre-training model of the key point model, compared with the pre-training model of the key point model, the model can have faster loss shrinkage and a certain improvement on model performance in later training.

Preferably, stage 2 performs a training of 15 frames on the video stream data, we set a shorter sequence length t=15 frames, so that the network can be updated quickly; stage 3 increases the T frames to 50 frames, reduces the learning rate to half of the original, and retains the super-parametric training model of stage 1, which allows our model to see longer sequence information and learn the dependency between long sequences.

Preferably, stage 4 uses video stream data and sporadic data for integration training a small number of iterations, for sporadic data we consider it as a video sequence of only 1 frame, which can force the model to remain robust even without repeated or continuous information.

Wherein robustness is a strong and robust meaning; it is also the ability of the system to survive in abnormal and dangerous situations; for example, the robustness of the computer software can be realized if the computer software is not dead or crashed under the conditions of input errors, disk faults, network overload or intentional attack; robustness also refers to the characteristic of the control system to maintain certain other properties under perturbation of certain parameters.

Compared with the prior art, the invention has the following beneficial effects:

in the invention, firstly, the 2D gesture estimation method based on the key point regression has the advantages of short algorithm consumption time and less resources, and can realize real-time operation and full differential training from end to end on a mobile end, embedded or low-cost hardware cost platform. The Gaussian heat map-based method is not an end-to-end differentiable model from image input to coordinate regression, the Gaussian heat map to coordinate points need to be obtained in an argmax mode, and the process is not conductive; however, the position information is converted into the coordinate value based on the result of full convolution in the coordinate regression mode, and for the dim information conversion, the nonlinearity is very strong, and the model is not easy to converge in training, so that the problem is solved by using the standardized flow module, and the effect of rapidness and high precision at the embedded end is realized;

for motion blur and self-occlusion problems with dynamic gestures in video, although many are designed for video applications, a single frame is treated as an independent image, but the most widely existing temporal information in video is ignored; therefore, a loop architecture module is used for enhancing the effect of the model on dynamic gesture estimation in the video. Because in the video, the model can know the previous frame and predict the current frame, and under the condition that a single frame is possibly blurred, the model can refer to better prediction key points of the previous frame, so that the definition of the model is greatly improved; the method can be applied to all videos without any auxiliary input; according to the model training strategy set by the user, a high-precision model can be effectively generated; the problems of motion blurring and self-shielding caused by dynamic gestures can be solved to a great extent;

therefore, the real-time 2D gesture estimation method based on the loop architecture and the key point regression can realize the real-time and high-precision detection effect of mobile terminal, embedded or low-cost hardware, can effectively alleviate the problem of model detection performance reduction caused by motion blurring and self-shielding in video, and realizes quick landing of products.

Drawings

Fig. 1 is a block diagram of the module of the present invention.

FIG. 2 is a block flow diagram of a lightweight neural network algorithm module of the present invention.

FIG. 3 is a block flow diagram of the overall model of the present invention.

Fig. 4 is a block flow diagram of the training strategy of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

examples:

as shown in fig. 1 to 4

The invention provides a real-time 2D gesture estimation method based on a circulating architecture and key point regression, wherein a core module comprises an image acquisition module, a lightweight neural network algorithm module, a circulating architecture module and a key point regression module, and the image acquisition module is a monocular camera; the lightweight neural network algorithm module adopts MobileNetV3 as a lightweight backbone model to extract characteristics, and consists of a plurality of stages, wherein a plurality of groups of deep separable convolutions are formed in the lightweight neural network algorithm module; the cyclic architecture module acquires characteristic information through a MobileNet V3 backbone network and passes through a cyclic architecture module; the circulation mechanism can learn which information should be reserved in the continuous video stream by itself, and the long-term and short-term time information capability is reserved while self-adapting, so that the circulation mechanism is suitable for our requirements; the key point regression module outputs the obtained feature map through the circulation architecture module as the input of the key point regression module, and respectively passes through 2 FC layers; FC1 outputs coordinate information of the 2D skeleton key points, and FC2 outputs score information of the 2D skeleton key points; the regression results need to be supervised, so that a standardized flow module is added for auxiliary training.

In the embodiment, in the lightweight neural network algorithm module, the depth separable convolution is mainly divided into two processes, namely channel-by-channel convolution and point-by-point convolution; one convolution kernel of the channel-by-channel convolution is responsible for one channel, one channel is only convolved by one convolution kernel, and the number of the channels of the characteristic map generated in the process is completely consistent with the number of the input channels; the point-by-point convolution uses 1x1 convolution, and the feature images output by the channel-by-channel convolution are weighted and combined in the depth direction to produce a new feature image; adding an SE structure module to obtain a new feature matrix; when the step length is 1, and the input characteristic matrix and the output characteristic matrix are the same in size, carrying out shortcut connection; after the MobileNet V3 trunk model outputs the feature map, an LR-ASPP module is added to increase the receptive field, the accuracy of the whole model is improved, the feature map of the input channel is divided into two branches, and the left branch outputs a feature map P through a convolution kernel of 1x1 ₁ The right branch outputs a characteristic diagram P after passing through a global tie pool layer, a 1x1 convolution kernel and a Sigmod module ₂ And for the characteristic diagram P ₁ And feature map P ₂ After multiplication, a new feature map is output;

in the last stage of the model, replacing all original activation functions, and using SiLu activation functions; input Z _k Activation a of kth SiLU of (2) _k The a is calculated by multiplying the sigmoid function by its input _k (z _k )＝z _k σ(z _k ) Equation 1, in which the sigmoid function, is for a larger Z _k The value, siLU activation is substantially equal to the function of ReLU, but different ReLUs, siLU activation is not monotonically increasing, but instead for Z _k Approximately 1.28, its global minimum is-0.28; the SiLU has the self-stabilization characteristic, the global minimum value with the derivative of zero plays a role in buffering weights, the global minimum value serves as an implicit regularizer to inhibit learning of a large number of weights, the model performance is improved in actual tests, and the effect of replacing SiLU functions by all stages is equivalent, so that the model can be obtained by only using the last Stage:

in this embodiment, in the cyclic architecture module, when the feature map is input into the cyclic architecture module, the channel for inputting the feature map is equally divided into feature maps P ₃ And feature map P ₄ For the characteristic map P ₄ Output profile P output by ConvGRU ₅ And memory cell feature map h _t Splice feature map P ₃ And feature map P ₄ Output of the feature map P ₆ The method comprises the steps of carrying out a first treatment on the surface of the Formally convglu is defined as follows:

z _t ＝σ(w _zx *x _t +w _zh *h _t-1 +b _z )

r _t ＝σ(w _rx *x _t +w _rh *h _t-1 +b _r )

wherein is sum. Representing the product of the convolution and the corresponding location element, and tanh and sigma represent hyperbolic and Sigmod functions; w and b are convolution kernels and bias terms.Hidden layer h _t H as output and as cycle state for next time _t-1 The method comprises the steps of carrying out a first treatment on the surface of the Initial cycle state h ₀ Is an all zero tensor.

In the key point regression module, the standardized flow module can convert some basic simple distribution into arbitrary complex distribution, and theoretically, the transformation is enough complex, so that arbitrary target distribution can be fitted; in the practical training process, a neural network is used, so that the neural network can be close to any function in theory, and therefore, a series of complex transformations in a standardized flow model can be realized by superposition on an FC layer; in the model training process, the regression module fits the output value of simple distribution, and the normalized flow module transforms the fitted result value to enable the transformed result to be closer to the distribution P of the target; then the loss function L of the normalized stream module _mle Can be set as follows

Wherein phi is a learnable parameter of the normalized flow model, mu _g Is the skeletal key point coordinates of the data,bone key point coordinates predicted for regression module, < +.>Skeletal keypoint scores predicted for the regression module.

In this embodiment, the training is divided into four phases, namely phase 1, phase 2, phase 3 and phase 4, where the phase 1 can use scattered data sets to perform model training under the condition of no cycle architecture module to obtain a suitable pre-training model, and in actual test comparison, it is found that although the classified pre-training model of MobileNetV3 is used as the pre-training model of the key point model, compared with the pre-training model of the key point model, the model can bring faster loss shrinkage and improve the model performance to some extent for later training of the model; stage 2 training 15 frames on video stream data, we set a shorter sequence length t=15 frames, so that the network can be updated quickly; stage 3, increasing the T frame to 50 frames, reducing the learning rate to half of the original one, and reserving the super-parameter training model of stage 1, so that the model can see longer sequence information and learn the dependency relationship among long sequences; stage 4 uses video stream data and sporadic data for integration training a small number of iteration numbers, which we consider as a video sequence of only 1 frame for sporadic data, which can force the model to remain robust even without repeated or continuous information.

In the invention, firstly, the 2D gesture estimation method based on the key point regression has the advantages that the algorithm consumes short time and less resources, and can realize real-time operation and full differential training from end to end on a mobile end, embedded or low-cost hardware cost platform; the Gaussian heat map-based method is not an end-to-end differentiable model from image input to coordinate regression, the Gaussian heat map to coordinate points need to be obtained in an argmax mode, and the process is not conductive; however, the position information is converted into the coordinate value based on the result of full convolution in the coordinate regression mode, and for the dim information conversion, the nonlinearity is very strong, and the model is not easy to converge in training, so that the problem is solved by using the standardized flow module, and the effect of rapidness and high precision at the embedded end is realized;

for motion blur and self-occlusion problems with dynamic gestures in video, although many are designed for video applications, a single frame is treated as an independent image, but the most widely existing temporal information in video is ignored; therefore, a loop architecture module is used for enhancing the effect of the model on dynamic gesture estimation in the video. Because in the video, the model can know the previous frame and predict the current frame, and under the condition that a single frame is possibly blurred, the model can refer to better prediction key points of the previous frame, so that the definition of the model is greatly improved. The method can be applied to all videos without any auxiliary input; according to the model training strategy set by the user, a high-precision model can be effectively generated; the problems of motion blurring and self-shielding caused by dynamic gestures can be solved to a great extent;

By utilizing the technical scheme of the invention or under the inspired by the technical scheme of the invention, a similar technical scheme is designed by a person skilled in the art, so that the technical effects are achieved, and the technical scheme falls into the protection scope of the invention.

Claims

1. A real-time 2D gesture estimation method based on a loop architecture and key point regression is characterized by comprising the following steps: the core module comprises an image acquisition module, a lightweight neural network algorithm module, a circulation architecture module and a key point regression module, wherein the image acquisition module is a monocular camera;

the lightweight neural network algorithm module adopts MobileNetV3 as a lightweight backbone model to extract characteristics, and consists of a plurality of stages, wherein a plurality of groups of deep separable convolutions are formed;

the cyclic architecture module acquires characteristic information through a MobileNet V3 backbone network and passes through a cyclic architecture module; the circulation mechanism can learn which information should be reserved in the continuous video stream by itself, and the long-term and short-term time information capability is reserved while self-adapting, so that the circulation mechanism is suitable for our requirements; in the cycle architecture module, when the feature map is input into the cycle architecture module, a channel for inputting the feature map is equally divided into a feature map P3 and a feature map P4, and the feature map P4 is spliced with an output feature map P5 and a memory unit feature map ht which are output through ConvGRU, so as to output a feature map P6;

the key point regression module is output by the circulation architecture module to obtainThe feature map of (2) is used as the input of the key point regression module and respectively passes through 2 FC layers; FC1 outputs coordinate information of the 2D skeleton key points, and FC2 outputs score information of the 2D skeleton key points; because the regression result needs to be supervised, a standardized flow module is added for auxiliary training; the normalized flow module can convert some basic simple distribution into arbitrary complex distribution, and the loss function L of the normalized flow module _mle The following may be set:

wherein phi is a learnable parameter of the normalized flow model, mu _g Is the skeletal key point coordinates of the data,bone key point coordinates predicted for regression module, < +.>Bone key points predicted by the regression module are scored;

in the lightweight neural network algorithm module, the depth separable convolution is mainly divided into two processes, namely channel-by-channel convolution and point-by-point convolution; one convolution kernel of the channel-by-channel convolution is responsible for one channel, one channel is only convolved by one convolution kernel, and the number of the channels of the characteristic map generated in the process is completely consistent with the number of the input channels; the point-by-point convolution uses 1x1 convolution, and the feature images output by the channel-by-channel convolution are weighted and combined in the depth direction to produce a new feature image; the lightweight neural network algorithm module obtains a new feature matrix by adding an SE structure module; when the step length is 1, and the input characteristic matrix and the output characteristic matrix are the same in size, carrying out shortcut connection; the LR-ASPP module is added after the MobileNet V3 trunk model outputs the feature map, the receptive field is increased, the precision of the whole model is improved, the feature map of the input channel is divided into two branches, the left branch outputs the feature map P1 through a convolution kernel of 1x1, the right branch outputs the feature map P2 through a global tie pool layer, a convolution kernel of 1x1 and a Sigmod module, and the feature map P1 and the feature map P2 are multiplied to output a new feature map.

2. The real-time 2D gesture estimation method based on loop architecture and keypoint regression of claim 1, wherein: in the last stage of the lightweight neural network algorithm module using the MobileNet V3 as a lightweight backbone model, replacing all original activation functions and using SiLu activation functions; the activation ak of the kth SiLU of the input Zk is calculated as a by multiplying its input by the sigmoid function _k (z _k )＝z _k σ(z _k ) Formula 1, wherein the sigmoid function, for larger Zk values, the activation of the SiLU is substantially equal to the function of the ReLU, but the different ReLU, the SiLU activation is not monotonically increasing, but its global minimum is-0.28 for zk≡1.28; the SiLU has the characteristic of self-stabilization, the global minimum value with the derivative of zero plays a role of buffering the weights, the global minimum value serves as an implicit regularizer to inhibit learning of a large number of weights, the model performance is improved in actual tests, and the effect of replacing the SiLU function by all stages is equivalent, so that the model can be used only in the last Stage.

3. The real-time 2D gesture estimation method based on loop architecture and keypoint regression of claim 1, wherein: the training strategy of the model is divided into four stages, namely a stage 1, a stage 2, a stage 3 and a stage 4, wherein the stage 1 can use scattered data sets to perform model training under the condition of no circulating framework module to obtain a proper pre-training model, and the fact that the training strategy of the model is a pre-training model which uses a classified pre-training model of MobileNet V3 as a key point model is found in actual test comparison, but compared with the pre-training model which uses the key point model, the model can bring faster loss shrinkage and improve the model performance to a certain extent for later training of the model.

4. The real-time 2D gesture estimation method based on loop architecture and keypoint regression of claim 3, wherein: stage 2 training 15 frames on video stream data, we set a shorter sequence length t=15 frames, so that the network can be updated quickly; stage 3 increases the T frames to 50 frames, reduces the learning rate to half of the original, and retains the super-parametric training model of stage 1, which allows our model to see longer sequence information and learn the dependency between long sequences.

5. The real-time 2D gesture estimation method based on loop architecture and keypoint regression of claim 3, wherein: stage 4 uses video stream data and sporadic data for integration training a small number of iteration numbers, which we consider as a video sequence of only 1 frame for sporadic data, which can force the model to remain robust even without repeated or continuous information.