WO2021227933A1

WO2021227933A1 - Image processing apparatus, image processing method, and computer-readable storage medium

Info

Publication number: WO2021227933A1
Application number: PCT/CN2021/092004
Authority: WO
Inventors: 吴松涛; 许宽宏
Original assignee: 索尼集团公司; 吴松涛
Priority date: 2020-05-14
Filing date: 2021-05-07
Publication date: 2021-11-18
Also published as: CN115349142A; CN113673280A

Abstract

The present disclosure relates to an image processing apparatus, an image processing method, and a computer-readable storage medium. The image processing apparatus according to the present disclosure comprises a processing circuit, configured to: dividing a plurality of continuously inputted images into a plurality of image blocks; using a convolutional neural network model to extract spatio-temporal features of each image block, the convolutional neural network model comprising a separable convolutional network and a pointwise convolutional network, or comprising a separable convolutional network and a dilated convolutional network; and using a cyclic neural network model to determine gestures in the plurality of images according to the spatio-temporal features of each image block. By means of the image processing apparatus, the image processing method, and the computer-readable storage medium according to the present disclosure, a dynamic gesture can be quickly and accurately recognized.

Description

Image processing device, image processing method, and computer readable storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 202010407312.X, and the invention title is "Image Processing Device, Image Processing Method, and Computer-readable Storage Medium" on May 14, 2020, all of which The content is incorporated in this application by reference.

Technical field

The embodiments of the present disclosure generally relate to the field of image processing, and in particular to an image processing apparatus, an image processing method, and a computer-readable storage medium. More specifically, the embodiments of the present disclosure relate to an image processing apparatus, an image processing method, and a computer-readable storage medium capable of recognizing gestures included in a plurality of images that are continuously input.

Background technique

Dynamic gesture recognition refers to a technology that recognizes a sequence of dynamic gestures composed of consecutively input multiple frames of images. Due to the flexibility and convenience of gestures, dynamic gesture recognition has broad application prospects in human-computer interaction, AR (Augmented Reality)/VR (Virtual Reality) and other environments.

Online dynamic gesture recognition is a technology for segmenting and recognizing multiple continuous dynamic gestures. Compared with offline dynamic gesture recognition, online dynamic gesture recognition is very challenging, mainly in two aspects: distinguish the start frame and end frame of the gesture; and recognize the gesture. For online dynamic gesture recognition technology, different gestures can be distinguished by selecting one or several key frames for each type of gesture, but because the key frames need to be manually selected, there is a strong uncertainty. In addition, when there are many types of gestures, it is difficult to select an appropriate key frame for each type of gesture. For online dynamic gesture recognition technology, it is also possible to model adjacent image frames through hidden Markov models to distinguish different gestures. However, due to the weak expressive ability of the hidden Markov model, it can only recognize a few types of gestures.

Therefore, it is necessary to propose a technical solution to quickly and accurately recognize dynamic gestures.

Summary of the invention

This section provides a general summary of the disclosure, rather than a comprehensive disclosure of its full scope or all its features.

The purpose of the present disclosure is to provide an image processing device, an image processing method, and a computer-readable storage medium to quickly and accurately recognize dynamic gestures.

According to an aspect of the present disclosure, there is provided an image processing device, including a processing circuit, configured to: divide consecutively input multiple images into multiple image blocks; and use a convolutional neural network model to extract the time and space of each image block Feature, the convolutional neural network model includes a separable convolution network and a pointwise convolution network, or a separable convolution network and a dilated convolution network; and the use of loops The Recurrent Neural Network (RNN) model determines the gestures included in the multiple images according to the temporal and spatial characteristics of each image block.

According to another aspect of the present disclosure, there is provided an image processing method, including: dividing a plurality of consecutively input images into a plurality of image blocks; using a convolutional neural network model to extract the spatiotemporal characteristics of each image block, the volume The product neural network model includes a separable convolutional network and a point-wise convolutional network, or a separable convolutional network and a hole convolutional network; and the recurrent neural network model is used to determine the plurality of images according to the temporal and spatial characteristics of each image block Gestures included.

According to another aspect of the present disclosure, there is provided a computer-readable storage medium including executable computer instructions that, when executed by a computer, cause the computer to execute the image processing method according to the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program that, when executed by a computer, causes the computer to execute the image processing method according to the present disclosure.

Using the image processing device, image processing method, and computer-readable storage medium according to the present disclosure, it is possible to extract spatiotemporal features of image blocks using convolutional neural networks, which include separable convolutional networks and pointwise convolutional networks, or Including separable convolutional network and hollow convolutional network, so that the recurrent neural network can be used to recognize gestures according to the extracted spatiotemporal features. Since the separable convolutional network and the point-wise convolutional network/hole convolutional network are adopted, the calculation amount of gesture recognition can be reduced to quickly and accurately recognize dynamic gestures.

From the description provided here, further areas of applicability will become apparent. The description and specific examples in this summary are for illustrative purposes only, and are not intended to limit the scope of the present disclosure.

Description of the drawings

The drawings described herein are only for illustrative purposes of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure. In the attached picture:

FIG. 1 is a schematic diagram showing gestures included in consecutive multiple images;

2 is a block diagram showing an example of the configuration of an image processing apparatus according to an embodiment of the present disclosure;

Fig. 3 is a schematic diagram showing a process of extracting key points in an image according to an embodiment of the present disclosure;

4 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;

5 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;

6 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;

FIG. 7 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;

FIG. 8 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;

FIG. 9 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;

FIG. 10 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;

FIG. 11 is a schematic diagram showing the structure of a recurrent neural network model;

FIG. 12 is a schematic diagram showing the structure of a recurrent neural network model according to an embodiment of the present disclosure;

FIG. 13 is a schematic diagram showing the structure of an image processing apparatus according to an embodiment of the present disclosure;

FIG. 14 is a schematic diagram showing the structure of an image processing apparatus according to an embodiment of the present disclosure;

FIG. 15 is a flowchart showing an image processing method according to an embodiment of the present disclosure; and

FIG. 16 is a block diagram showing an example of an electronic device that can implement the image processing apparatus according to the present disclosure.

Although the present disclosure is easily subjected to various modifications and alternative forms, specific embodiments thereof have been shown in the drawings as examples and described in detail herein. However, it should be understood that the description of specific embodiments herein is not intended to limit the present disclosure to the specific forms disclosed, but on the contrary, the purpose of the present disclosure is to cover all that fall within the spirit and scope of the present disclosure. Modifications, equivalents and replacements. It should be noted that throughout the several drawings, corresponding reference numerals indicate corresponding components.

Detailed ways

The examples of the present disclosure will now be described more fully with reference to the accompanying drawings. The following description is merely exemplary in nature, and is not intended to limit the present disclosure, application, or use.

Example embodiments are provided so that this disclosure will be thorough and will fully convey its scope to those skilled in the art. Numerous specific details such as examples of specific components, devices, and methods are described to provide a detailed understanding of the embodiments of the present disclosure. It will be obvious to those skilled in the art that specific details do not need to be used, the example embodiments can be implemented in many different forms, and none of them should be construed as limiting the scope of the present disclosure. In some example embodiments, well-known processes, well-known structures, and well-known technologies are not described in detail.

The description will be made in the following order:

1. Configuration example of image processing device;

2. Examples of image processing methods;

3. Application examples.

<1. Configuration example of image processing device>

FIG. 1 is a schematic diagram showing gestures included in consecutive multiple images. As shown in FIG. 1, the upper figure shows an example in which multiple images include a "double tap" gesture, and the lower figure shows an example in which multiple images include a "squeeze" gesture.

As mentioned in the previous article, as the types of gestures are gradually increasing, it is difficult for the existing gesture recognition technology to quickly and accurately identify various types of gestures. Therefore, the present disclosure expects to provide an image processing device, an image processing method, and a computer-readable storage medium to quickly and accurately identify various dynamic gestures.

FIG. 2 is a block diagram showing an example of the configuration of an image processing apparatus 200 according to an embodiment of the present disclosure. Here, the image processing apparatus 200 may recognize gestures included in a plurality of images that are continuously input. Multiple images input continuously, such as videos, moving images, or a group of static images input quickly. Specifically, the image processing device 200 can recognize dynamic gestures in real time, that is, can recognize dynamic gestures online.

As shown in FIG. 2, the image processing apparatus 200 may include a preprocessing unit 210, an extraction unit 220, and a determination unit 230.

Here, each unit of the image processing apparatus 200 may be included in the processing circuit. It should be noted that the image processing device 200 may include one processing circuit or multiple processing circuits. Further, the processing circuit may include various discrete functional units to perform various different functions and/or operations. It should be noted that these functional units may be physical entities or logical entities, and units with different titles may be implemented by the same physical entity.

According to an embodiment of the present disclosure, the preprocessing unit 210 may divide a plurality of images continuously input into a plurality of image blocks.

According to an embodiment of the present disclosure, the extraction unit 220 may use a convolutional neural network model to extract the spatiotemporal features of each image block. According to an embodiment of the present disclosure, the convolutional neural network model may include a separable convolutional network and a pointwise convolutional network. Alternatively, the convolutional neural network model may also include a separable convolutional network and a hollow convolutional network.

According to an embodiment of the present disclosure, the determining unit 230 may use a recurrent neural network model to determine the gestures included in the multiple images according to the temporal and spatial characteristics of each image block.

As described above, according to the image processing apparatus 200 of the embodiment of the present disclosure, the spatiotemporal features of the image block can be extracted using the convolutional neural network model. The convolutional neural network model includes a separable convolutional network and a pointwise convolutional network, or includes The convolutional network and the hollow convolutional network can be divided, so that the recurrent neural network can be used to recognize gestures according to the extracted spatiotemporal features. Since the separable convolutional network and the point-wise convolutional network/hole convolutional network are adopted, the calculation amount of gesture recognition can be reduced to quickly and accurately recognize dynamic gestures.

In this disclosure, separable convolution is also called depthwise separable convolution, which reduces the number of parameters required for convolution calculation by splitting the correlation between the spatial dimension and the channel (depth) dimension. number. The convolution calculation of the depth separable convolution is divided into two parts. First, the channels (depth) are spatially convolved, and the output is spliced, and then the unit convolution kernel is used to perform channel convolution to obtain the feature map.

In the present disclosure, the point-by-point convolution uses a 1×1 convolution kernel function, or a convolution kernel function that traverses each point. Among them, the depth of the convolution kernel is the number of channels of the image input to the point-by-point convolution network.

In this disclosure, hole convolution is also called dilated convolution, which is to inject holes into the convolution kernel. There is a parameter in the hole convolution to set the hole rate. The specific meaning is to fill the hole rate -1 0 in the convolution kernel. When setting different void rates, the receptive field will be different. therefore. Hollow convolution can expand the receptive field and obtain multi-scale context information.

According to an embodiment of the present disclosure, the input of the image processing apparatus is multiple images (or multiple frames of images) including gestures. According to an embodiment of the present disclosure, the image may be any one of an RGB image and a depth image.

According to an embodiment of the present disclosure, the preprocessing unit 210 may divide a plurality of images input to the image processing apparatus 200 into a plurality of image blocks. Specifically, the preprocessing unit 210 may divide the consecutively input M images from the multiple images input to the image processing device 200 into one image block, where M is an integer greater than or equal to 2. That is, with M images as a unit, the preprocessing unit 210 may divide a plurality of images input to the image processing apparatus into a plurality of image blocks. Here, each image block including M images can be regarded as a spatio-temporal unit. Preferably, M can be 4, 8, 16, 32, and the like. For example, when M is 8, the preprocessing unit 210 may start from an arbitrary position and divide the 8 images continuously input from the plurality of images input to the image processing apparatus 200 into one image block. For example, the preprocessing unit 210 may divide the 1-8th image among the multiple images input to the image processing device 200 into the first image block, divide the 9-16th image into the second image block, and so on. .

According to an embodiment of the present disclosure, the preprocessing unit 210 may also determine the feature of each of the divided image blocks, and may input the feature of each image block to the extraction unit 220.

According to an embodiment of the present disclosure, the preprocessing unit 210 may extract features of a plurality of key points of each of a plurality of images input to the image processing apparatus 200. Further, the preprocessing unit 210 may use the feature of each key point of each of the M images included in the image block as the feature of the image block.

Here, in the case of recognizing a gesture, the key point may be, for example, a joint point of the hand that makes the gesture. The present disclosure does not limit the number of key points included in each image. For example, the preprocessing unit 210 may extract features of X key points of each image, where X is an integer greater than or equal to 2. For example, in the case of X=14, the preprocessing unit 210 may use the features of 14 key points of each of the M images included in the image block as the feature of the image block. Then, there are 14×M key points in the image block.

Fig. 3 is a schematic diagram showing a process of extracting key points in an image according to an embodiment of the present disclosure. The upper figure of Fig. 3 shows three images among the images input to the image processing device 200, and the lower figure shows the process of extracting key points from these three images. As shown in Figure 3, 14 key points are extracted for each image.

According to an embodiment of the present disclosure, the feature of each key point may include features of multiple dimensions. In addition, the characteristic of each key point may be the spatial characteristic of the key point. For example, the feature of each key point may include Y spatial features of the key point. Y is 3, for example. That is, the feature of each key point may include three coordinate features of the key point in the three-dimensional space.

As described above, according to the embodiment of the present disclosure, one image block includes M images, each image includes X key points, and each key point includes Y spatial features. Then, each image block may include M×X×Y features. The preprocessing unit 210 may input the M×X×Y features included in each image block as the features of the image block to the convolutional neural network model in the extraction unit 220. Further, the preprocessing unit 210 may sequentially input the features of each image block to the extraction unit 220 according to the order of the image blocks. In other words, compared with the image block that is later in time, the feature of the image block that is earlier in time is input to the extraction unit 220 first.

According to an embodiment of the present disclosure, the extraction unit 220 may use a convolutional neural network model to extract the spatiotemporal features of each image block. The convolutional neural network model may include a separable convolutional network and a pointwise convolutional network, or may include a separable convolutional network and a holey convolutional network.

According to an embodiment of the present disclosure, the convolutional neural network model in the extraction unit 220 may also include a fully connected network. Each node of the fully connected network is connected to all nodes of the previous network, and is used to integrate the features extracted from the previous network.

FIG. 4 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in FIG. 4, the convolutional neural network model may include a separable convolutional network, a point-wise convolutional network or a hollow convolutional network, and a fully connected network.

According to an embodiment of the present disclosure, the convolutional neural network model may include N separable convolutional networks, N pointwise convolutional networks or hollow convolutional networks, and N fully connected networks, where N is a positive integer. In other words, the number of separable convolutional networks, point-wise convolutional networks or hole convolutional networks, and fully connected networks included in the convolutional neural network model are the same. That is to say, the input of the convolutional neural network model passes through N groups including separable convolutional networks, pointwise convolutional networks or hole convolutional networks, and fully connected networks, and each group is from input to output. The order includes separable convolutional network, pointwise convolutional network or hole convolutional network, and fully connected network.

For ease of description, the separable convolutional network can be marked as A, the point-wise convolutional network or the hole convolutional network is marked as B, and the fully connected network is marked as C, then the convolutional neural network model in the extraction unit 220 is from The order of input to output can include A, B, C or A, B, C, A, B, C...

Fig. 4 shows the case of N=1, that is, the convolutional neural network model includes a group including a separable convolutional network, a point-wise convolutional network or a hole convolutional network, and a fully connected network.

FIG. 5 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in Figure 5, the convolutional neural network model can include a separable convolutional network, a pointwise convolutional network, or a hole convolutional network, a fully connected network, a separable convolutional network, a pointwise convolutional network, or a holey convolutional network , And fully connected network. That is, FIG. 5 shows a situation where N=2, that is, the convolutional neural network model includes two groups including a separable convolutional network, a pointwise convolutional network or a hole convolutional network, and a fully connected network. The situation where N is greater than 2 is similar, and will not be repeated in this disclosure.

According to an embodiment of the present disclosure, the convolutional neural network model in the extraction unit 220 may include multiple separable convolutional networks, one or more pointwise convolutional networks or hole convolutional networks, and a fully connected network.

According to an embodiment of the present disclosure, the convolutional neural network model may include multiple separable convolutional networks, one or more pointwise convolutional networks or hole convolutional networks, and a fully connected network. Among them, the number of separable convolutional networks is one more than the number of point-wise convolutional networks or hole convolutional networks. For example, if the number of point-wise convolutional networks or hole convolutional networks is V, and V is a positive integer, then the number of separable convolutional networks is V+1. The sequence of the convolutional neural network model from input to output may include V groups consisting of a separable convolutional network, a pointwise convolutional network or a hole convolutional network, a separable convolutional network, and a fully connected network. Further, the order from input to output in each of the V groups may sequentially include a separable convolutional network, and a pointwise convolutional network or a hole convolutional network. That is to say, in the structure before the fully connected network, it starts with a separable convolutional network and ends with a separable convolutional network, and the separable convolutional network, and the point-wise convolutional network or the hole convolutional network are separated .

For ease of description, the separable convolutional network can be marked as A, the point-wise convolutional network or the hole convolutional network is marked as B, and the fully connected network is marked as C, then the convolutional neural network model in the extraction unit 220 is from The order of input to output may include A, B, A, C or A, B, A, B, A,..., A, B, C.

FIG. 6 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in FIG. 6, the convolutional neural network model in the extraction unit 220 may include a separable convolutional network, a pointwise convolutional network or a hollow convolutional network, a separable convolutional network, and a fully connected network. That is, FIG. 6 shows the case of V=1. The same is true for the case where V is greater than 1, which will not be repeated in this disclosure.

According to an embodiment of the present disclosure, the convolutional neural network model may include multiple separable convolutional networks, multiple pointwise convolutional networks or hole convolutional networks, and one fully connected network. Among them, the number of separable convolutional networks is consistent with the number of point-wise convolutional networks or hole convolutional networks, for example, Z, and Z is an integer greater than or equal to 2. Then, the sequence of the convolutional neural network model from input to output may include Z groups consisting of separable convolutional networks, pointwise convolutional networks or hole convolutional networks, and a fully connected network. Further, the order from input to output in each of the Z groups may sequentially include a separable convolution network, and a pointwise convolution network or a hole convolution network. That is to say, in the structure before the fully connected network, it starts with a separable convolutional network, ends with a pointwise convolutional network or a hole convolutional network, and a separable convolutional network, and a pointwise convolutional network or a holey convolution Jaeger nets are spaced apart.

For ease of description, the separable convolutional network can be marked as A, the point-wise convolutional network or the hole convolutional network is marked as B, and the fully connected network is marked as C, then the convolutional neural network model in the extraction unit 220 is from The order of input to output can include A, B, A, B, C or A, B, A, B, ..., A, B, C.

FIG. 7 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in FIG. 7, the convolutional neural network model in the extraction unit 220 may include a separable convolutional network, a pointwise convolutional network or a holey convolutional network, a separable convolutional network, a pointwise convolutional network, or a holey convolution. Network, and fully connected network. That is, FIG. 7 shows the case of Z=2. The situation where Z is greater than 2 is also similar, and will not be repeated in this disclosure.

The foregoing describes the structure of the convolutional neural network model in the extraction unit 220 in an exemplary manner. Several specific examples of the convolutional neural network model according to the embodiment of the present disclosure will be described below.

According to the embodiment of the present disclosure, the step size of the separable convolutional network in the convolutional neural network model can be 1, and the point-wise convolutional network or the hole convolutional network in the convolutional neural network model can select point-wise convolution The internet.

FIG. 8 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in FIG. 8, the convolutional neural network model may include a separable convolutional network with a step size of 1, a point-wise convolutional network, and a fully connected network. Here, M×N represents the size of the convolution kernel in the separable convolutional network, and P represents the number of convolution kernels in the separable convolutional network. Preferably, M=N=3. S×T represents the size of the convolution kernel in the pointwise convolutional network, and Q represents the number of convolution kernels in the pointwise convolutional network. Preferably, S=T=1.

According to an embodiment of the present disclosure, in FIG. 8, since the step size of the separable convolutional network is 1, the local spatiotemporal information of the image block can be extracted. Here, the spatio-temporal information may include time information and spatial information. Since the feature of the image block includes the spatial feature of each key point, the extraction unit 220 can extract the spatial feature of the image block. Since each image block includes a plurality of images that are continuous in time, the extraction unit 220 may extract the temporal characteristics of the image block.

It is worth noting that, for ease of description, FIG. 8 shows an example in which the convolutional neural network model includes a separable convolutional network, a pointwise convolutional network, and a fully connected network. However, FIG. 8 can be arbitrarily modified according to the structure of the convolutional neural network model described above.

According to an embodiment of the present disclosure, the step size of the separable convolutional network in the convolutional neural network model can be greater than 1, and the pointwise convolutional network or the hole convolutional network in the convolutional neural network model can select pointwise convolution The internet.

FIG. 9 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in FIG. 9, the convolutional neural network model may include a separable convolutional network with a step size greater than 1, a point-wise convolutional network, and a fully connected network. Here, M×N represents the size of the convolution kernel in the separable convolutional network, and P represents the number of convolution kernels in the separable convolutional network. Preferably, M=N=3. S×T represents the size of the convolution kernel in the pointwise convolutional network, and Q represents the number of convolution kernels in the pointwise convolutional network. Preferably, S=T=1.

According to an embodiment of the present disclosure, in FIG. 9, since the step size of the separable convolutional network is greater than 1, the space-time information related to the middle distance of the image block can be extracted. Among them, the spatiotemporal information related to the intermediate distance is the spatiotemporal information between the local spatiotemporal information and the global spatiotemporal information, which depends on the size of the step. Similarly, spatiotemporal information can include time information and spatial information. Since the feature of the image block includes the spatial feature of each key point, the extraction unit 220 can extract the spatial feature of the image block. Since each image block includes a plurality of images that are continuous in time, the extraction unit 220 may extract the temporal characteristics of the image block.

It is worth noting that, for ease of description, FIG. 9 shows an example in which the convolutional neural network model includes a separable convolutional network, a pointwise convolutional network, and a fully connected network. However, FIG. 9 can be arbitrarily modified according to the structure of the convolutional neural network model described above.

According to an embodiment of the present disclosure, the step size of the separable convolutional network in the convolutional neural network model can be 1, and the point-wise convolutional network or the hole convolutional network in the convolutional neural network model can select a hole convolutional network .

FIG. 10 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in FIG. 10, the convolutional neural network model may include a separable convolutional network with a step size of 1, a hollow convolutional network, and a fully connected network. Here, M×N represents the size of the convolution kernel in the separable convolutional network, and P represents the number of convolution kernels in the separable convolutional network. Preferably, M=N=3. S×T represents the size of the convolution kernel in the hole convolutional network, and Q represents the number of convolution kernels in the hole convolutional network. Preferably, S=5 and T=3.

According to an embodiment of the present disclosure, in FIG. 10, since the hole convolutional network has a larger receptive field, the global spatiotemporal information of the image block can be extracted. Similarly, spatiotemporal information can include time information and spatial information. Since the feature of the image block includes the spatial feature of each key point, the extraction unit 220 can extract the spatial feature of the image block. Since each image block includes a plurality of images that are continuous in time, the extraction unit 220 may extract the temporal characteristics of the image block.

It is worth noting that, for ease of description, FIG. 10 shows an example in which the convolutional neural network model includes a separable convolutional network, a hollow convolutional network, and a fully connected network. However, FIG. 10 can be arbitrarily modified according to the structure of the convolutional neural network model described above.

The various examples of the convolutional neural network model in the extraction unit 220 according to an embodiment of the present disclosure have been described above. The above examples are merely illustrative, and the present disclosure is not limited to these structures. The determination unit 230 according to an embodiment of the present disclosure will be described below.

According to an embodiment of the present disclosure, the determination unit 230 may use a recurrent neural network model to determine the gestures included in the plurality of images according to the spatiotemporal characteristics of each image block output by the extraction unit 220. Specifically, the determining unit 230 may determine (model) the temporal relationship between each image block according to the temporal and spatial characteristics of each image block output by the extraction unit 220, thereby outputting a state vector representing the gesture.

FIG. 11 is a schematic diagram showing the structure of a recurrent neural network model. Here, the recurrent neural network model shown in FIG. 11 is a current common recurrent neural network model. As shown in Figure 11, the output o _{t of the} recurrent neural network model at time t is related to the input x _t at time t and the output h _t-1 at the previous time t-1. In other words, in a cyclic neural network, neurons can not only receive information from other neurons, but also receive their own information to form a network structure with loops, so it is also called a neural network with short-term memory capabilities.

According to an embodiment of the present disclosure, the recurrent neural network model can be based on the input information at the current moment, the ratio information of the output at the previous moment, and the integral information of the output at the previous moment and/or the differential information of the output at the previous moment. Determine the output information at the current moment.

According to an embodiment of the present disclosure, the ratio information of the output at the previous time may be, for example, the output at the previous time, or it may be information calculated according to a certain ratio based on the output at the previous time.

According to the embodiment of the present disclosure, the integration information of the output at the previous time indicates the information obtained by integrating the output at the previous time.

According to the embodiment of the present disclosure, the differential information of the output at the previous time indicates information obtained by performing a differential operation on the output at the previous time. For example, the differential information of the output at the previous time may include the differential information of the output at the previous time from the first order to the K order, that is, the information obtained by performing the differential operation of the output at the previous time from the first order to the K order. Among them, K is an integer greater than or equal to 2.

FIG. 12 is a schematic diagram showing the structure of a recurrent neural network model according to an embodiment of the present disclosure. In Figure 12, x _t represents the input information _{at time t} , o t represents the output information at time t, which is equal to h _t , and h _t-1 represents the output information at time t-1, and also represents the output information at time t-1. The ratio information of the output information at time, S _t-1 represents the integral information of the output information at time t-1,

Represents the first-order differential information of the output information at t-1,

Represents the K-order differential information of the output information at time t-1.

According to the embodiment of the present disclosure, the following formula can be used to calculate the integral information _St-1 of the output information at time t-1:

According to an embodiment of the present disclosure, the following formula can be used to calculate the first-order differential information of the output information at time t-1

According to an embodiment of the present disclosure, the following formula can be used to calculate the second-order differential information of the output information at time t-1

In a similar manner, the K-order differential information of the output information at time t-1 can be calculated.

According to the embodiment of the present disclosure, the output information h _t at time t can be calculated according to the following formula:

h _t =σ(W _he E _t +b _h )

Among them, W _he represents the state update matrix, σ is the activation function, including but not limited to ReLU (Rectified Linear Unit) function, and b _h is the bias vector, which can be set according to empirical values. E _t represents the state formula, that is, the memory of the cyclic neural network at time t, which can be calculated according to the following formula:

As mentioned above, in Figure 12, the recurrent neural network model can be based on the input information at the current time, the ratio information of the output at the previous time, and the integral information of the output at the previous time and the differential information of the output at the previous time. Determine the state at the current moment, thereby determining the output information at the current moment. It is worth noting that although Figure 12 shows that the current time is determined based on the input information at the current time, the ratio information of the output at the previous time, and the integral information of the output at the previous time and the differential information of the output at the previous time. Examples of output information, but it is also possible to determine the output information at the current time based on the input information at the current time, the ratio information of the output at the previous time, and the integral information of the output at the previous time, or according to the input information at the current time, the previous The ratio information of the output at a time and the differential information of the output at the previous time determine the output information at the current time.

As described above, according to the embodiment of the present disclosure, the recurrent neural network in the determining unit 230 can not only determine the output at the current time based on the input information at the current time and the output at the previous time, but also based on the integral of the output at the previous time. At least one of the information and the differential information of the output at the previous time determines the output at the current time. Here, since the ratio information of the output information focuses on the state of the current image block, and the differential information of the output information focuses on the change of the state of attention, and the integration information of the output information focuses on the accumulation of the state, the determination unit 230 according to the embodiment of the present disclosure can compare Comprehensively obtain the changes and trends of gestures on the time scale, so as to obtain better recognition accuracy.

According to an embodiment of the present disclosure, the extraction unit 220 can obtain the temporal and spatial characteristics of each image block. Since the gesture may include multiple image blocks, the determination unit 230 can model the temporal relationship between different image blocks. Thus, gestures can be recognized accurately and quickly.

According to an embodiment of the present disclosure, as shown in FIG. 2, the image processing apparatus 200 may further include a decision unit 240 for determining the final gesture according to the output of the determination unit 230.

According to an embodiment of the present disclosure, the output of the recurrent neural network in the determining unit 230 may be a 128-dimensional state vector corresponding to different gestures determined according to the spatiotemporal characteristics of each image block. The decision unit 240 may include a classifier for determining the state vector output by the determining unit 230 as a gesture.

According to an embodiment of the present disclosure, the extraction unit 220 may include a convolutional neural network model, and the determination unit 230 may include a recurrent neural network model, so that the decision unit 240 can determine the final gesture according to the output of the recurrent neural network model.

FIG. 13 is a schematic diagram showing the structure of an image processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 3, the input of the image processing device 200 sequentially passes through the convolutional neural network model in the extraction unit 220, the recurrent neural network model in the determination unit 230, and the classifier in the decision unit 240, thereby outputting the recognition result of the gesture .

According to an embodiment of the present disclosure, the extraction unit 220 may include a plurality of convolutional neural network models, and the determination unit 230 may include a plurality of recurrent neural network models, so that the decision unit 240 can be based on each of the plurality of recurrent neural network models. The output result of the neural network model is used to determine the final gesture. Here, the input of multiple convolutional neural network models is the same, that is, multiple images input to the image processing device 200. That is, each convolutional neural network model and cyclic neural network model are used to determine the state vector of the gesture, and then the classifier in the decision-making unit 230 can determine the final gesture. For example, the classifier can average the state vectors output by each recurrent neural network model, and then determine the final gesture.

FIG. 14 is a schematic diagram showing the structure of an image processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 14, the image processing device 200 includes R convolutional neural network models, R recurrent neural network models, and a classifier. Wherein, R is an integer greater than or equal to 2. Specifically, the input multiple images are input to the convolutional neural network model 1 and the recurrent neural network model 1, thereby obtaining the first group of 128-dimensional state vectors, and the input multiple images are input to the convolutional neural network model 2 and Recurrent neural network model 2 to obtain the second group of 128-dimensional state vectors,..., the input multiple images are input to the convolutional neural network model R and recurrent neural network model R, thereby obtaining the Rth group of 128-dimensional state vectors . The classifier can synthesize the output results of the R recurrent neural network models to obtain the final gesture recognition result.

As described above, according to the embodiments of the present disclosure, multiple sets of convolutional neural network models and recurrent neural network models can be used to recognize gestures, thereby making the recognized gestures more accurate.

As mentioned above, the convolutional neural network model including the separable convolutional network with a step size of 1 and the pointwise convolutional network can extract the local spatiotemporal information of the image block, including the separable convolutional network with a step size greater than 1 and the convolutional neural network model. The convolutional neural network model of the point convolutional network can extract spatiotemporal information related to the intermediate distance, and the neural network model including the separable convolutional network with a step size of 1 and the hollow convolutional network can extract the global spatiotemporal information of the image block. Therefore, according to an embodiment of the present disclosure, the R convolutional neural network models may include convolutional neural network models capable of extracting spatiotemporal information of different scales. In other words, the R convolutional neural network models may include at least two of the above three neural network models.

For example, in the case of R=2, the first convolutional neural network model in the R convolutional neural network models may include a separable convolutional network with a step size of 1 and a pointwise convolutional network, and R convolutional neural networks The second convolutional neural network model in the network model may include a separable convolutional network with a step size greater than 1 and a pointwise convolutional network. In the case of R=2, the first convolutional neural network model in the R convolutional neural network models may include a separable convolutional network with a step length of 1 and a pointwise convolutional network, and R convolutional neural network models The second convolutional neural network model in may include a separable convolutional network with a step length of 1 and a hollow convolutional network. In the case of R=2, the first convolutional neural network model in the R convolutional neural network models may include a separable convolutional network with a step size greater than 1 and a pointwise convolutional network, and R convolutional neural network models The second convolutional neural network model in may include a separable convolutional network with a step length of 1 and a hollow convolutional network. In the case of R=3, the first convolutional neural network model in the R convolutional neural network models may include a separable convolutional network with a step size of 1 and a pointwise convolutional network, and R convolutional neural network models The second convolutional neural network model in can include a separable convolutional network with a step size greater than 1 and a point-wise convolutional network, and the third convolutional neural network model of the R convolutional neural network models includes a step size of 1. Separate convolutional network and hole convolutional network.

As described above, according to the embodiment of the present disclosure, in the case where the extraction unit 220 includes multiple convolutional neural network models, the multiple convolutional neural network models can extract spatiotemporal information of different scales of image blocks, and thus can simultaneously satisfy Recognize the requirements of gestures quickly and accurately.

According to an embodiment of the present disclosure, the process of training the image processing apparatus 200 can be divided into two stages. In the first stage, manually calibrated gestures and cross-entropy loss functions can be used to pre-train the entire network, so that the entire network can be trained when only one gesture is included in multiple images. In the second stage, the expanded gesture (that is, adding noise to the gesture on the time axis to increase or decrease the length of the image corresponding to the gesture) and the connection time classification loss function can be used to perform the pre-training network Adjust so that the entire network is trained when multiple images include multiple gestures and the length of each gesture's image is increased or decreased. According to the embodiments of the present disclosure, after the above-mentioned two stages of training, the image processing apparatus 200 can quickly and accurately recognize dynamic gestures.

As described above, according to the image processing apparatus 200 of the embodiment of the present disclosure, multiple input images can be divided into multiple image blocks, and can be extracted by using a separable convolutional network and a pointwise convolutional network or a hole convolutional network. The spatio-temporal characteristics of image blocks greatly reduce the amount of calculation in the process of gesture recognition. Further, in the case where the image processing device 200 includes multiple convolutional neural network models, spatiotemporal features of different scales of the image block can be extracted, thereby simultaneously ensuring the accuracy and rapidity of recognition. In addition, a recurrent neural network is used to process the spatiotemporal characteristics of each image block, which takes into account the cumulative output ratio information, integral information and/or differential information, so that the recognition result is more accurate. In summary, the image processing apparatus 200 according to an embodiment of the present disclosure can quickly and accurately recognize dynamic gestures.

<2. Examples of image processing methods>

Next, an image processing method executed by the image processing apparatus 200 according to an embodiment of the present disclosure will be described in detail.

FIG. 15 is a flowchart illustrating an image processing method performed by the image processing apparatus 200 according to an embodiment of the present disclosure.

As shown in FIG. 15, in step S1510, a plurality of consecutively input images are divided into a plurality of image blocks.

Next, in step S1520, the convolutional neural network model is used to extract the spatiotemporal features of each image block. The convolutional neural network model includes a separable convolutional network and a pointwise convolutional network, or a separable convolutional network and a hole Convolutional network.

Next, in step S1530, the recurrent neural network model is used to determine the gestures included in the multiple images according to the temporal and spatial characteristics of each image block.

Preferably, dividing the consecutively input multiple images into multiple image blocks includes: dividing the consecutively input M images into one image block, where M is an integer greater than or equal to 2, and wherein the convolutional neural network model is used to extract each image block. The spatiotemporal features of each image block include: inputting the feature of each key point of each of the M images as the feature of the image block to the convolutional neural network model.

Preferably, the convolutional neural network model also includes a fully connected network.

Preferably, the convolutional neural network model includes: multiple separable convolutional networks, one or more pointwise convolutional networks or hollow convolutional networks, and a fully connected network; or N separable convolutional networks, N Point-wise convolutional network or hole convolutional network, and N fully connected networks, where N is a positive integer.

Preferably, the image processing method further includes: respectively determining the gestures included in the multiple images by using multiple convolutional neural network models and multiple cyclic neural network models; and determining the final gesture according to the output result of each cyclic neural network model.

Preferably, the first convolutional neural network model in the multiple convolutional neural network models includes a separable convolutional network with a step size of 1 and a pointwise convolutional network, and the second convolutional network in the multiple convolutional neural network models Neural network models include separable convolutional networks with a step size greater than 1 and point-wise convolutional networks. The third convolutional neural network model among multiple convolutional neural network models includes a separable convolutional network with a step size of 1 and holes Convolutional network.

Preferably, using the recurrent neural network model to determine the gestures included in the multiple images includes: according to the input information at the current moment, the scale information of the output at the previous moment, and the integral information of the output at the previous moment and/or the output at the previous moment. The output differential information is used to determine the output information at the current moment.

According to an embodiment of the present disclosure, the subject that executes the above method may be the image processing device 200 according to an embodiment of the present disclosure, so all the foregoing embodiments regarding the image processing device 200 are applicable to this.

<3. Application example>

The present disclosure can be applied to various scenarios. For example, the image processing apparatus 200 of the present disclosure can be used for gesture recognition, and specifically can perform online dynamic gesture recognition. In addition, although the present disclosure takes online dynamic gesture recognition as an example for introduction, the present disclosure is not limited to this, and the present disclosure can be applied to other scenarios related to the processing of timing signals.

FIG. 16 is a block diagram showing an example of an electronic device 1600 that can implement the image processing apparatus 200 according to the present disclosure. The electronic device 1600 may be, for example, a user equipment, for example, may be implemented as a mobile terminal (such as a smart phone, a tablet personal computer (PC), a notebook PC, a portable game terminal, a portable/dongle type mobile router, and a digital camera) or a vehicle terminal.

The electronic device 1600 includes a processor 1601, a memory 1602, a storage device 1603, a network interface 1604, and a bus 1606.

The processor 1601 may be, for example, a central processing unit (CPU) or a digital signal processor (DSP), and controls the functions of the electronic device 1600. The memory 1602 includes random access memory (RAM) and read only memory (ROM), and stores data and programs executed by the processor 1601. The storage device 1603 may include a storage medium such as a semiconductor memory and a hard disk.

The network interface 1604 is a wired communication interface for connecting the electronic device 1600 to the wired communication network 1605. The wired communication network 1605 may be a core network such as an evolved packet core network (EPC) or a packet data network (PDN) such as the Internet.

The bus 1606 connects the processor 1601, the memory 1602, the storage device 1603, and the network interface 1604 to each other. The bus 1606 may include two or more buses (such as a high-speed bus and a low-speed bus) each having a different speed.

In the electronic device 1600 shown in FIG. 16, the preprocessing unit 210, the extraction unit 220, the determination unit 230, and the decision unit 240 described in FIG. 2 can be implemented by the processor 1601. For example, the processor 1601 may divide the continuously input multiple images into multiple image blocks by executing instructions stored in the memory 1602 or the storage device 1603, extract the spatiotemporal features of each image block using a convolutional neural network model, and use The recurrent neural network determines the functions of gestures included in multiple images.

The preferred embodiments of the present disclosure have been described above with reference to the accompanying drawings, but the present disclosure is of course not limited to the above examples. Those skilled in the art can get various changes and modifications within the scope of the appended claims, and it should be understood that these changes and modifications will naturally fall within the technical scope of the present disclosure.

For example, the units shown in dashed boxes in the functional block diagram shown in the drawings all indicate that the functional unit is optional in the corresponding device, and each optional functional unit can be combined in an appropriate manner to achieve the required function .

For example, a plurality of functions included in one unit in the above embodiments may be realized by separate devices. Alternatively, the multiple functions implemented by multiple units in the above embodiments may be implemented by separate devices, respectively. In addition, one of the above functions can be implemented by multiple units. Needless to say, such a configuration is included in the technical scope of the present disclosure.

In this specification, the steps described in the flowchart include not only processing performed in time series in the described order, but also processing performed in parallel or individually rather than necessarily in time series. In addition, even in the steps processed in time series, needless to say, the order can be changed appropriately.

Although the embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, it should be understood that the above-described embodiments are only used to illustrate the present disclosure, and do not constitute a limitation to the present disclosure. For those skilled in the art, various modifications and changes can be made to the above-mentioned embodiments without departing from the essence and scope of the present disclosure. Therefore, the scope of the present disclosure is limited only by the appended claims and their equivalent meanings.

Claims

An image processing device, including a processing circuit, is configured to:

Divide consecutively input multiple images into multiple image blocks;

Extracting the spatiotemporal features of each image block using a convolutional neural network model, the convolutional neural network model including a separable convolutional network and a point-wise convolutional network, or a separable convolutional network and a hollow convolutional network; and

A recurrent neural network model is used to determine the gestures included in the multiple images according to the temporal and spatial characteristics of each image block.
The image processing device according to claim 1, wherein the processing circuit is further configured to:

Divide the continuously inputted M images into one image block, where M is an integer greater than or equal to 2; and

The feature of each key point of each of the M images is input to the convolutional neural network model as the feature of the image block.
The image processing device according to claim 1, wherein the convolutional neural network model further comprises a fully connected network.
The image processing device according to claim 3, wherein the convolutional neural network model comprises:

Multiple separable convolutional networks, one or more pointwise convolutional networks or hollow convolutional networks, and a fully connected network; or

N separable convolutional networks, N point-wise convolutional networks or hollow convolutional networks, and N fully connected networks, where N is a positive integer.
The image processing device according to claim 1, wherein the processing circuit is further configured to:

Using a plurality of convolutional neural network models and a plurality of recurrent neural network models to determine the gestures included in the plurality of images; and

The final gesture is determined according to the output result of each recurrent neural network model.
The image processing device according to claim 5, wherein the first convolutional neural network model in the plurality of convolutional neural network models includes a separable convolutional network with a step size of 1 and a pointwise convolutional network, so The second convolutional neural network model of the plurality of convolutional neural network models includes a separable convolutional network with a step size greater than 1 and a pointwise convolutional network, and the third convolutional network of the plurality of convolutional neural network models The neural network model includes a separable convolutional network with a step length of 1 and a hollow convolutional network.
The image processing device according to claim 1, wherein:

The recurrent neural network model determines the output information at the current time according to the input information at the current time, the ratio information of the output at the previous time, and the integral information of the output at the previous time and/or the differential information of the output at the previous time. .
An image processing method, including:

Divide consecutively input multiple images into multiple image blocks;

Extracting the spatiotemporal features of each image block using a convolutional neural network model, the convolutional neural network model including a separable convolutional network and a point-wise convolutional network, or a separable convolutional network and a hollow convolutional network; and

A recurrent neural network model is used to determine the gestures included in the multiple images according to the temporal and spatial characteristics of each image block.
8. The image processing method according to claim 8, wherein dividing the continuously input multiple images into multiple image blocks comprises: dividing the continuously input M images into one image block, where M is an integer greater than or equal to 2, and

Wherein, extracting the spatiotemporal features of each image block by using the convolutional neural network model includes: inputting the feature of each key point of each of the M images as the feature of the image block to the convolutional neural network model.
8. The image processing method according to claim 8, wherein the convolutional neural network model further comprises a fully connected network.
The image processing method according to claim 10, wherein the convolutional neural network model comprises:

Multiple separable convolutional networks, one or more pointwise convolutional networks or hollow convolutional networks, and a fully connected network; or

N separable convolutional networks, N point-wise convolutional networks or hollow convolutional networks, and N fully connected networks, where N is a positive integer.
The image processing method according to claim 8, wherein the image processing method further comprises:

Using a plurality of convolutional neural network models and a plurality of recurrent neural network models to determine the gestures included in the plurality of images; and

The final gesture is determined according to the output result of each recurrent neural network model.
The image processing method according to claim 12, wherein the first convolutional neural network model in the plurality of convolutional neural network models includes a separable convolutional network with a step size of 1 and a pointwise convolutional network, so The second convolutional neural network model of the plurality of convolutional neural network models includes a separable convolutional network with a step size greater than 1 and a pointwise convolutional network, and the third convolutional network of the plurality of convolutional neural network models The neural network model includes a separable convolutional network with a step length of 1 and a hollow convolutional network.
8. The image processing method according to claim 8, wherein using a recurrent neural network model to determine the gestures included in the plurality of images comprises:

The output information at the current time is determined according to the input information at the current time, the ratio information of the output at the previous time, and the integral information of the output at the previous time and/or the differential information of the output at the previous time.
A computer-readable storage medium, comprising executable computer instructions, which, when executed by a computer, cause the computer to execute the image processing method according to any one of claims 8-14.