WO2021227933A1 - Image processing apparatus, image processing method, and computer-readable storage medium - Google Patents

Image processing apparatus, image processing method, and computer-readable storage medium Download PDF

Info

Publication number
WO2021227933A1
WO2021227933A1 PCT/CN2021/092004 CN2021092004W WO2021227933A1 WO 2021227933 A1 WO2021227933 A1 WO 2021227933A1 CN 2021092004 W CN2021092004 W CN 2021092004W WO 2021227933 A1 WO2021227933 A1 WO 2021227933A1
Authority
WO
WIPO (PCT)
Prior art keywords
convolutional
neural network
image processing
network model
network
Prior art date
Application number
PCT/CN2021/092004
Other languages
French (fr)
Chinese (zh)
Inventor
吴松涛
许宽宏
Original Assignee
索尼集团公司
吴松涛
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 索尼集团公司, 吴松涛 filed Critical 索尼集团公司
Priority to CN202180023365.4A priority Critical patent/CN115349142A/en
Publication of WO2021227933A1 publication Critical patent/WO2021227933A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion

Definitions

  • the embodiments of the present disclosure generally relate to the field of image processing, and in particular to an image processing apparatus, an image processing method, and a computer-readable storage medium. More specifically, the embodiments of the present disclosure relate to an image processing apparatus, an image processing method, and a computer-readable storage medium capable of recognizing gestures included in a plurality of images that are continuously input.
  • Dynamic gesture recognition refers to a technology that recognizes a sequence of dynamic gestures composed of consecutively input multiple frames of images. Due to the flexibility and convenience of gestures, dynamic gesture recognition has broad application prospects in human-computer interaction, AR (Augmented Reality)/VR (Virtual Reality) and other environments.
  • Online dynamic gesture recognition is a technology for segmenting and recognizing multiple continuous dynamic gestures. Compared with offline dynamic gesture recognition, online dynamic gesture recognition is very challenging, mainly in two aspects: distinguish the start frame and end frame of the gesture; and recognize the gesture.
  • different gestures can be distinguished by selecting one or several key frames for each type of gesture, but because the key frames need to be manually selected, there is a strong uncertainty.
  • it is difficult to select an appropriate key frame for each type of gesture.
  • the purpose of the present disclosure is to provide an image processing device, an image processing method, and a computer-readable storage medium to quickly and accurately recognize dynamic gestures.
  • an image processing device including a processing circuit, configured to: divide consecutively input multiple images into multiple image blocks; and use a convolutional neural network model to extract the time and space of each image block Feature, the convolutional neural network model includes a separable convolution network and a pointwise convolution network, or a separable convolution network and a dilated convolution network; and the use of loops
  • the Recurrent Neural Network (RNN) model determines the gestures included in the multiple images according to the temporal and spatial characteristics of each image block.
  • an image processing method including: dividing a plurality of consecutively input images into a plurality of image blocks; using a convolutional neural network model to extract the spatiotemporal characteristics of each image block, the volume
  • the product neural network model includes a separable convolutional network and a point-wise convolutional network, or a separable convolutional network and a hole convolutional network; and the recurrent neural network model is used to determine the plurality of images according to the temporal and spatial characteristics of each image block Gestures included.
  • a computer-readable storage medium including executable computer instructions that, when executed by a computer, cause the computer to execute the image processing method according to the present disclosure.
  • a computer program that, when executed by a computer, causes the computer to execute the image processing method according to the present disclosure.
  • the image processing device, image processing method, and computer-readable storage medium it is possible to extract spatiotemporal features of image blocks using convolutional neural networks, which include separable convolutional networks and pointwise convolutional networks, or Including separable convolutional network and hollow convolutional network, so that the recurrent neural network can be used to recognize gestures according to the extracted spatiotemporal features. Since the separable convolutional network and the point-wise convolutional network/hole convolutional network are adopted, the calculation amount of gesture recognition can be reduced to quickly and accurately recognize dynamic gestures.
  • convolutional neural networks which include separable convolutional networks and pointwise convolutional networks, or Including separable convolutional network and hollow convolutional network, so that the recurrent neural network can be used to recognize gestures according to the extracted spatiotemporal features. Since the separable convolutional network and the point-wise convolutional network/hole convolutional network are adopted, the calculation amount of gesture recognition can be reduced to quickly and accurately recognize dynamic gestures.
  • FIG. 1 is a schematic diagram showing gestures included in consecutive multiple images
  • FIG. 2 is a block diagram showing an example of the configuration of an image processing apparatus according to an embodiment of the present disclosure
  • Fig. 3 is a schematic diagram showing a process of extracting key points in an image according to an embodiment of the present disclosure
  • FIG. 4 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure
  • FIG. 5 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure
  • FIG. 6 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure
  • FIG. 7 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure
  • FIG. 8 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure
  • FIG. 9 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure.
  • FIG. 10 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure
  • FIG. 11 is a schematic diagram showing the structure of a recurrent neural network model
  • FIG. 12 is a schematic diagram showing the structure of a recurrent neural network model according to an embodiment of the present disclosure.
  • FIG. 13 is a schematic diagram showing the structure of an image processing apparatus according to an embodiment of the present disclosure.
  • FIG. 14 is a schematic diagram showing the structure of an image processing apparatus according to an embodiment of the present disclosure.
  • FIG. 15 is a flowchart showing an image processing method according to an embodiment of the present disclosure.
  • FIG. 16 is a block diagram showing an example of an electronic device that can implement the image processing apparatus according to the present disclosure.
  • Example embodiments are provided so that this disclosure will be thorough and will fully convey its scope to those skilled in the art. Numerous specific details such as examples of specific components, devices, and methods are described to provide a detailed understanding of the embodiments of the present disclosure. It will be obvious to those skilled in the art that specific details do not need to be used, the example embodiments can be implemented in many different forms, and none of them should be construed as limiting the scope of the present disclosure. In some example embodiments, well-known processes, well-known structures, and well-known technologies are not described in detail.
  • FIG. 1 is a schematic diagram showing gestures included in consecutive multiple images. As shown in FIG. 1, the upper figure shows an example in which multiple images include a "double tap” gesture, and the lower figure shows an example in which multiple images include a "squeeze” gesture.
  • the present disclosure expects to provide an image processing device, an image processing method, and a computer-readable storage medium to quickly and accurately identify various dynamic gestures.
  • FIG. 2 is a block diagram showing an example of the configuration of an image processing apparatus 200 according to an embodiment of the present disclosure.
  • the image processing apparatus 200 may recognize gestures included in a plurality of images that are continuously input. Multiple images input continuously, such as videos, moving images, or a group of static images input quickly.
  • the image processing device 200 can recognize dynamic gestures in real time, that is, can recognize dynamic gestures online.
  • the image processing apparatus 200 may include a preprocessing unit 210, an extraction unit 220, and a determination unit 230.
  • each unit of the image processing apparatus 200 may be included in the processing circuit.
  • the image processing device 200 may include one processing circuit or multiple processing circuits.
  • the processing circuit may include various discrete functional units to perform various different functions and/or operations. It should be noted that these functional units may be physical entities or logical entities, and units with different titles may be implemented by the same physical entity.
  • the preprocessing unit 210 may divide a plurality of images continuously input into a plurality of image blocks.
  • the extraction unit 220 may use a convolutional neural network model to extract the spatiotemporal features of each image block.
  • the convolutional neural network model may include a separable convolutional network and a pointwise convolutional network.
  • the convolutional neural network model may also include a separable convolutional network and a hollow convolutional network.
  • the determining unit 230 may use a recurrent neural network model to determine the gestures included in the multiple images according to the temporal and spatial characteristics of each image block.
  • the spatiotemporal features of the image block can be extracted using the convolutional neural network model.
  • the convolutional neural network model includes a separable convolutional network and a pointwise convolutional network, or includes The convolutional network and the hollow convolutional network can be divided, so that the recurrent neural network can be used to recognize gestures according to the extracted spatiotemporal features. Since the separable convolutional network and the point-wise convolutional network/hole convolutional network are adopted, the calculation amount of gesture recognition can be reduced to quickly and accurately recognize dynamic gestures.
  • separable convolution is also called depthwise separable convolution, which reduces the number of parameters required for convolution calculation by splitting the correlation between the spatial dimension and the channel (depth) dimension. number.
  • the convolution calculation of the depth separable convolution is divided into two parts. First, the channels (depth) are spatially convolved, and the output is spliced, and then the unit convolution kernel is used to perform channel convolution to obtain the feature map.
  • the point-by-point convolution uses a 1 ⁇ 1 convolution kernel function, or a convolution kernel function that traverses each point.
  • the depth of the convolution kernel is the number of channels of the image input to the point-by-point convolution network.
  • hole convolution is also called dilated convolution, which is to inject holes into the convolution kernel.
  • Hollow convolution can expand the receptive field and obtain multi-scale context information.
  • the input of the image processing apparatus is multiple images (or multiple frames of images) including gestures.
  • the image may be any one of an RGB image and a depth image.
  • the preprocessing unit 210 may divide a plurality of images input to the image processing apparatus 200 into a plurality of image blocks. Specifically, the preprocessing unit 210 may divide the consecutively input M images from the multiple images input to the image processing device 200 into one image block, where M is an integer greater than or equal to 2. That is, with M images as a unit, the preprocessing unit 210 may divide a plurality of images input to the image processing apparatus into a plurality of image blocks.
  • each image block including M images can be regarded as a spatio-temporal unit.
  • M can be 4, 8, 16, 32, and the like.
  • the preprocessing unit 210 may start from an arbitrary position and divide the 8 images continuously input from the plurality of images input to the image processing apparatus 200 into one image block. For example, the preprocessing unit 210 may divide the 1-8th image among the multiple images input to the image processing device 200 into the first image block, divide the 9-16th image into the second image block, and so on. .
  • the preprocessing unit 210 may also determine the feature of each of the divided image blocks, and may input the feature of each image block to the extraction unit 220.
  • the preprocessing unit 210 may extract features of a plurality of key points of each of a plurality of images input to the image processing apparatus 200. Further, the preprocessing unit 210 may use the feature of each key point of each of the M images included in the image block as the feature of the image block.
  • the key point may be, for example, a joint point of the hand that makes the gesture.
  • the present disclosure does not limit the number of key points included in each image.
  • Fig. 3 is a schematic diagram showing a process of extracting key points in an image according to an embodiment of the present disclosure.
  • the upper figure of Fig. 3 shows three images among the images input to the image processing device 200, and the lower figure shows the process of extracting key points from these three images.
  • 14 key points are extracted for each image.
  • the feature of each key point may include features of multiple dimensions.
  • the characteristic of each key point may be the spatial characteristic of the key point.
  • the feature of each key point may include Y spatial features of the key point. Y is 3, for example. That is, the feature of each key point may include three coordinate features of the key point in the three-dimensional space.
  • one image block includes M images, each image includes X key points, and each key point includes Y spatial features. Then, each image block may include M ⁇ X ⁇ Y features.
  • the preprocessing unit 210 may input the M ⁇ X ⁇ Y features included in each image block as the features of the image block to the convolutional neural network model in the extraction unit 220. Further, the preprocessing unit 210 may sequentially input the features of each image block to the extraction unit 220 according to the order of the image blocks. In other words, compared with the image block that is later in time, the feature of the image block that is earlier in time is input to the extraction unit 220 first.
  • the extraction unit 220 may use a convolutional neural network model to extract the spatiotemporal features of each image block.
  • the convolutional neural network model may include a separable convolutional network and a pointwise convolutional network, or may include a separable convolutional network and a holey convolutional network.
  • the convolutional neural network model in the extraction unit 220 may also include a fully connected network.
  • Each node of the fully connected network is connected to all nodes of the previous network, and is used to integrate the features extracted from the previous network.
  • FIG. 4 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure.
  • the convolutional neural network model may include a separable convolutional network, a point-wise convolutional network or a hollow convolutional network, and a fully connected network.
  • the convolutional neural network model may include N separable convolutional networks, N pointwise convolutional networks or hollow convolutional networks, and N fully connected networks, where N is a positive integer.
  • N the number of separable convolutional networks, point-wise convolutional networks or hole convolutional networks, and fully connected networks included in the convolutional neural network model are the same. That is to say, the input of the convolutional neural network model passes through N groups including separable convolutional networks, pointwise convolutional networks or hole convolutional networks, and fully connected networks, and each group is from input to output.
  • the order includes separable convolutional network, pointwise convolutional network or hole convolutional network, and fully connected network.
  • the separable convolutional network can be marked as A
  • the point-wise convolutional network or the hole convolutional network is marked as B
  • the fully connected network is marked as C
  • the convolutional neural network model in the extraction unit 220 is from
  • the order of input to output can include A, B, C or A, B, C, A, B, C...
  • FIG. 5 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure.
  • the convolutional neural network model in the extraction unit 220 may include multiple separable convolutional networks, one or more pointwise convolutional networks or hole convolutional networks, and a fully connected network.
  • the convolutional neural network model may include multiple separable convolutional networks, one or more pointwise convolutional networks or hole convolutional networks, and a fully connected network.
  • the number of separable convolutional networks is one more than the number of point-wise convolutional networks or hole convolutional networks. For example, if the number of point-wise convolutional networks or hole convolutional networks is V, and V is a positive integer, then the number of separable convolutional networks is V+1.
  • the sequence of the convolutional neural network model from input to output may include V groups consisting of a separable convolutional network, a pointwise convolutional network or a hole convolutional network, a separable convolutional network, and a fully connected network. Further, the order from input to output in each of the V groups may sequentially include a separable convolutional network, and a pointwise convolutional network or a hole convolutional network. That is to say, in the structure before the fully connected network, it starts with a separable convolutional network and ends with a separable convolutional network, and the separable convolutional network, and the point-wise convolutional network or the hole convolutional network are separated .
  • the separable convolutional network can be marked as A
  • the point-wise convolutional network or the hole convolutional network is marked as B
  • the fully connected network is marked as C
  • the convolutional neural network model in the extraction unit 220 is from
  • the order of input to output may include A, B, A, C or A, B, A, B, A,..., A, B, C.
  • FIG. 6 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure.
  • the convolutional neural network model may include multiple separable convolutional networks, multiple pointwise convolutional networks or hole convolutional networks, and one fully connected network.
  • the number of separable convolutional networks is consistent with the number of point-wise convolutional networks or hole convolutional networks, for example, Z, and Z is an integer greater than or equal to 2.
  • the sequence of the convolutional neural network model from input to output may include Z groups consisting of separable convolutional networks, pointwise convolutional networks or hole convolutional networks, and a fully connected network. Further, the order from input to output in each of the Z groups may sequentially include a separable convolution network, and a pointwise convolution network or a hole convolution network.
  • the separable convolutional network can be marked as A
  • the point-wise convolutional network or the hole convolutional network is marked as B
  • the fully connected network is marked as C
  • the convolutional neural network model in the extraction unit 220 is from
  • the order of input to output can include A, B, A, B, C or A, B, A, B, ..., A, B, C.
  • FIG. 7 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure.
  • the step size of the separable convolutional network in the convolutional neural network model can be 1, and the point-wise convolutional network or the hole convolutional network in the convolutional neural network model can select point-wise convolution The internet.
  • FIG. 8 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure.
  • the convolutional neural network model may include a separable convolutional network with a step size of 1, a point-wise convolutional network, and a fully connected network.
  • M ⁇ N represents the size of the convolution kernel in the separable convolutional network
  • P represents the number of convolution kernels in the separable convolutional network.
  • S ⁇ T represents the size of the convolution kernel in the pointwise convolutional network
  • Q represents the number of convolution kernels in the pointwise convolutional network.
  • the local spatiotemporal information of the image block can be extracted.
  • the spatio-temporal information may include time information and spatial information. Since the feature of the image block includes the spatial feature of each key point, the extraction unit 220 can extract the spatial feature of the image block. Since each image block includes a plurality of images that are continuous in time, the extraction unit 220 may extract the temporal characteristics of the image block.
  • FIG. 8 shows an example in which the convolutional neural network model includes a separable convolutional network, a pointwise convolutional network, and a fully connected network.
  • FIG. 8 can be arbitrarily modified according to the structure of the convolutional neural network model described above.
  • the step size of the separable convolutional network in the convolutional neural network model can be greater than 1, and the pointwise convolutional network or the hole convolutional network in the convolutional neural network model can select pointwise convolution The internet.
  • FIG. 9 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure.
  • the convolutional neural network model may include a separable convolutional network with a step size greater than 1, a point-wise convolutional network, and a fully connected network.
  • M ⁇ N represents the size of the convolution kernel in the separable convolutional network
  • P represents the number of convolution kernels in the separable convolutional network.
  • S ⁇ T represents the size of the convolution kernel in the pointwise convolutional network
  • Q represents the number of convolution kernels in the pointwise convolutional network.
  • the space-time information related to the middle distance of the image block can be extracted.
  • the spatiotemporal information related to the intermediate distance is the spatiotemporal information between the local spatiotemporal information and the global spatiotemporal information, which depends on the size of the step.
  • spatiotemporal information can include time information and spatial information. Since the feature of the image block includes the spatial feature of each key point, the extraction unit 220 can extract the spatial feature of the image block. Since each image block includes a plurality of images that are continuous in time, the extraction unit 220 may extract the temporal characteristics of the image block.
  • FIG. 9 shows an example in which the convolutional neural network model includes a separable convolutional network, a pointwise convolutional network, and a fully connected network.
  • FIG. 9 can be arbitrarily modified according to the structure of the convolutional neural network model described above.
  • the step size of the separable convolutional network in the convolutional neural network model can be 1, and the point-wise convolutional network or the hole convolutional network in the convolutional neural network model can select a hole convolutional network .
  • FIG. 10 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure.
  • the convolutional neural network model may include a separable convolutional network with a step size of 1, a hollow convolutional network, and a fully connected network.
  • M ⁇ N represents the size of the convolution kernel in the separable convolutional network
  • P represents the number of convolution kernels in the separable convolutional network.
  • S ⁇ T represents the size of the convolution kernel in the hole convolutional network
  • Q represents the number of convolution kernels in the hole convolutional network.
  • the global spatiotemporal information of the image block can be extracted.
  • spatiotemporal information can include time information and spatial information. Since the feature of the image block includes the spatial feature of each key point, the extraction unit 220 can extract the spatial feature of the image block. Since each image block includes a plurality of images that are continuous in time, the extraction unit 220 may extract the temporal characteristics of the image block.
  • FIG. 10 shows an example in which the convolutional neural network model includes a separable convolutional network, a hollow convolutional network, and a fully connected network.
  • FIG. 10 can be arbitrarily modified according to the structure of the convolutional neural network model described above.
  • the various examples of the convolutional neural network model in the extraction unit 220 according to an embodiment of the present disclosure have been described above. The above examples are merely illustrative, and the present disclosure is not limited to these structures.
  • the determination unit 230 according to an embodiment of the present disclosure will be described below.
  • the determination unit 230 may use a recurrent neural network model to determine the gestures included in the plurality of images according to the spatiotemporal characteristics of each image block output by the extraction unit 220. Specifically, the determining unit 230 may determine (model) the temporal relationship between each image block according to the temporal and spatial characteristics of each image block output by the extraction unit 220, thereby outputting a state vector representing the gesture.
  • FIG. 11 is a schematic diagram showing the structure of a recurrent neural network model.
  • the recurrent neural network model shown in FIG. 11 is a current common recurrent neural network model.
  • the output o t of the recurrent neural network model at time t is related to the input x t at time t and the output h t-1 at the previous time t-1.
  • neurons can not only receive information from other neurons, but also receive their own information to form a network structure with loops, so it is also called a neural network with short-term memory capabilities.
  • the recurrent neural network model can be based on the input information at the current moment, the ratio information of the output at the previous moment, and the integral information of the output at the previous moment and/or the differential information of the output at the previous moment. Determine the output information at the current moment.
  • the ratio information of the output at the previous time may be, for example, the output at the previous time, or it may be information calculated according to a certain ratio based on the output at the previous time.
  • the integration information of the output at the previous time indicates the information obtained by integrating the output at the previous time.
  • the differential information of the output at the previous time indicates information obtained by performing a differential operation on the output at the previous time.
  • the differential information of the output at the previous time may include the differential information of the output at the previous time from the first order to the K order, that is, the information obtained by performing the differential operation of the output at the previous time from the first order to the K order.
  • K is an integer greater than or equal to 2.
  • FIG. 12 is a schematic diagram showing the structure of a recurrent neural network model according to an embodiment of the present disclosure.
  • x t represents the input information at time t
  • o t represents the output information at time t, which is equal to h t
  • h t-1 represents the output information at time t-1, and also represents the output information at time t-1.
  • the ratio information of the output information at time, S t-1 represents the integral information of the output information at time t-1
  • the following formula can be used to calculate the integral information St-1 of the output information at time t-1:
  • the following formula can be used to calculate the first-order differential information of the output information at time t-1
  • the following formula can be used to calculate the second-order differential information of the output information at time t-1
  • the K-order differential information of the output information at time t-1 can be calculated.
  • the output information h t at time t can be calculated according to the following formula:
  • W he represents the state update matrix
  • is the activation function, including but not limited to ReLU (Rectified Linear Unit) function
  • b h is the bias vector, which can be set according to empirical values.
  • E t represents the state formula, that is, the memory of the cyclic neural network at time t, which can be calculated according to the following formula:
  • the recurrent neural network model can be based on the input information at the current time, the ratio information of the output at the previous time, and the integral information of the output at the previous time and the differential information of the output at the previous time. Determine the state at the current moment, thereby determining the output information at the current moment. It is worth noting that although Figure 12 shows that the current time is determined based on the input information at the current time, the ratio information of the output at the previous time, and the integral information of the output at the previous time and the differential information of the output at the previous time.
  • Examples of output information but it is also possible to determine the output information at the current time based on the input information at the current time, the ratio information of the output at the previous time, and the integral information of the output at the previous time, or according to the input information at the current time, the previous The ratio information of the output at a time and the differential information of the output at the previous time determine the output information at the current time.
  • the recurrent neural network in the determining unit 230 can not only determine the output at the current time based on the input information at the current time and the output at the previous time, but also based on the integral of the output at the previous time. At least one of the information and the differential information of the output at the previous time determines the output at the current time.
  • the determination unit 230 since the ratio information of the output information focuses on the state of the current image block, and the differential information of the output information focuses on the change of the state of attention, and the integration information of the output information focuses on the accumulation of the state, the determination unit 230 according to the embodiment of the present disclosure can compare Comprehensively obtain the changes and trends of gestures on the time scale, so as to obtain better recognition accuracy.
  • the extraction unit 220 can obtain the temporal and spatial characteristics of each image block. Since the gesture may include multiple image blocks, the determination unit 230 can model the temporal relationship between different image blocks. Thus, gestures can be recognized accurately and quickly.
  • the image processing apparatus 200 may further include a decision unit 240 for determining the final gesture according to the output of the determination unit 230.
  • the output of the recurrent neural network in the determining unit 230 may be a 128-dimensional state vector corresponding to different gestures determined according to the spatiotemporal characteristics of each image block.
  • the decision unit 240 may include a classifier for determining the state vector output by the determining unit 230 as a gesture.
  • the extraction unit 220 may include a convolutional neural network model
  • the determination unit 230 may include a recurrent neural network model, so that the decision unit 240 can determine the final gesture according to the output of the recurrent neural network model.
  • FIG. 13 is a schematic diagram showing the structure of an image processing apparatus according to an embodiment of the present disclosure.
  • the input of the image processing device 200 sequentially passes through the convolutional neural network model in the extraction unit 220, the recurrent neural network model in the determination unit 230, and the classifier in the decision unit 240, thereby outputting the recognition result of the gesture .
  • the extraction unit 220 may include a plurality of convolutional neural network models
  • the determination unit 230 may include a plurality of recurrent neural network models, so that the decision unit 240 can be based on each of the plurality of recurrent neural network models.
  • the output result of the neural network model is used to determine the final gesture.
  • the input of multiple convolutional neural network models is the same, that is, multiple images input to the image processing device 200. That is, each convolutional neural network model and cyclic neural network model are used to determine the state vector of the gesture, and then the classifier in the decision-making unit 230 can determine the final gesture. For example, the classifier can average the state vectors output by each recurrent neural network model, and then determine the final gesture.
  • FIG. 14 is a schematic diagram showing the structure of an image processing apparatus according to an embodiment of the present disclosure.
  • the image processing device 200 includes R convolutional neural network models, R recurrent neural network models, and a classifier.
  • R is an integer greater than or equal to 2.
  • the input multiple images are input to the convolutional neural network model 1 and the recurrent neural network model 1, thereby obtaining the first group of 128-dimensional state vectors, and the input multiple images are input to the convolutional neural network model 2 and Recurrent neural network model 2 to obtain the second group of 128-dimensional state vectors,..., the input multiple images are input to the convolutional neural network model R and recurrent neural network model R, thereby obtaining the Rth group of 128-dimensional state vectors .
  • the classifier can synthesize the output results of the R recurrent neural network models to obtain the final gesture recognition result.
  • multiple sets of convolutional neural network models and recurrent neural network models can be used to recognize gestures, thereby making the recognized gestures more accurate.
  • the convolutional neural network model including the separable convolutional network with a step size of 1 and the pointwise convolutional network can extract the local spatiotemporal information of the image block, including the separable convolutional network with a step size greater than 1 and the convolutional neural network model.
  • the convolutional neural network model of the point convolutional network can extract spatiotemporal information related to the intermediate distance
  • the neural network model including the separable convolutional network with a step size of 1 and the hollow convolutional network can extract the global spatiotemporal information of the image block.
  • the R convolutional neural network models may include convolutional neural network models capable of extracting spatiotemporal information of different scales.
  • the R convolutional neural network models may include at least two of the above three neural network models.
  • the first convolutional neural network model in the R convolutional neural network models may include a separable convolutional network with a step size of 1 and a pointwise convolutional network, and R convolutional neural networks
  • the second convolutional neural network model in the network model may include a separable convolutional network with a step size greater than 1 and a pointwise convolutional network.
  • the first convolutional neural network model in the R convolutional neural network models may include a separable convolutional network with a step length of 1 and a pointwise convolutional network, and R convolutional neural network models
  • the second convolutional neural network model in may include a separable convolutional network with a step length of 1 and a hollow convolutional network.
  • the first convolutional neural network model in the R convolutional neural network models may include a separable convolutional network with a step size greater than 1 and a pointwise convolutional network, and R convolutional neural network models
  • the second convolutional neural network model in may include a separable convolutional network with a step length of 1 and a hollow convolutional network.
  • the first convolutional neural network model in the R convolutional neural network models may include a separable convolutional network with a step size of 1 and a pointwise convolutional network, and R convolutional neural network models
  • the second convolutional neural network model in can include a separable convolutional network with a step size greater than 1 and a point-wise convolutional network
  • the third convolutional neural network model of the R convolutional neural network models includes a step size of 1.
  • the multiple convolutional neural network models can extract spatiotemporal information of different scales of image blocks, and thus can simultaneously satisfy Recognize the requirements of gestures quickly and accurately.
  • the process of training the image processing apparatus 200 can be divided into two stages.
  • manually calibrated gestures and cross-entropy loss functions can be used to pre-train the entire network, so that the entire network can be trained when only one gesture is included in multiple images.
  • the expanded gesture that is, adding noise to the gesture on the time axis to increase or decrease the length of the image corresponding to the gesture
  • the connection time classification loss function can be used to perform the pre-training network Adjust so that the entire network is trained when multiple images include multiple gestures and the length of each gesture's image is increased or decreased.
  • the image processing apparatus 200 can quickly and accurately recognize dynamic gestures.
  • multiple input images can be divided into multiple image blocks, and can be extracted by using a separable convolutional network and a pointwise convolutional network or a hole convolutional network.
  • the spatio-temporal characteristics of image blocks greatly reduce the amount of calculation in the process of gesture recognition.
  • spatiotemporal features of different scales of the image block can be extracted, thereby simultaneously ensuring the accuracy and rapidity of recognition.
  • a recurrent neural network is used to process the spatiotemporal characteristics of each image block, which takes into account the cumulative output ratio information, integral information and/or differential information, so that the recognition result is more accurate.
  • the image processing apparatus 200 according to an embodiment of the present disclosure can quickly and accurately recognize dynamic gestures.
  • FIG. 15 is a flowchart illustrating an image processing method performed by the image processing apparatus 200 according to an embodiment of the present disclosure.
  • step S1510 a plurality of consecutively input images are divided into a plurality of image blocks.
  • the convolutional neural network model is used to extract the spatiotemporal features of each image block.
  • the convolutional neural network model includes a separable convolutional network and a pointwise convolutional network, or a separable convolutional network and a hole Convolutional network.
  • step S1530 the recurrent neural network model is used to determine the gestures included in the multiple images according to the temporal and spatial characteristics of each image block.
  • dividing the consecutively input multiple images into multiple image blocks includes: dividing the consecutively input M images into one image block, where M is an integer greater than or equal to 2, and wherein the convolutional neural network model is used to extract each image block.
  • the spatiotemporal features of each image block include: inputting the feature of each key point of each of the M images as the feature of the image block to the convolutional neural network model.
  • the convolutional neural network model also includes a fully connected network.
  • the convolutional neural network model includes: multiple separable convolutional networks, one or more pointwise convolutional networks or hollow convolutional networks, and a fully connected network; or N separable convolutional networks, N Point-wise convolutional network or hole convolutional network, and N fully connected networks, where N is a positive integer.
  • the image processing method further includes: respectively determining the gestures included in the multiple images by using multiple convolutional neural network models and multiple cyclic neural network models; and determining the final gesture according to the output result of each cyclic neural network model.
  • the first convolutional neural network model in the multiple convolutional neural network models includes a separable convolutional network with a step size of 1 and a pointwise convolutional network
  • the second convolutional network in the multiple convolutional neural network models Neural network models include separable convolutional networks with a step size greater than 1 and point-wise convolutional networks.
  • the third convolutional neural network model among multiple convolutional neural network models includes a separable convolutional network with a step size of 1 and holes Convolutional network.
  • using the recurrent neural network model to determine the gestures included in the multiple images includes: according to the input information at the current moment, the scale information of the output at the previous moment, and the integral information of the output at the previous moment and/or the output at the previous moment.
  • the output differential information is used to determine the output information at the current moment.
  • the subject that executes the above method may be the image processing device 200 according to an embodiment of the present disclosure, so all the foregoing embodiments regarding the image processing device 200 are applicable to this.
  • the present disclosure can be applied to various scenarios.
  • the image processing apparatus 200 of the present disclosure can be used for gesture recognition, and specifically can perform online dynamic gesture recognition.
  • the present disclosure takes online dynamic gesture recognition as an example for introduction, the present disclosure is not limited to this, and the present disclosure can be applied to other scenarios related to the processing of timing signals.
  • FIG. 16 is a block diagram showing an example of an electronic device 1600 that can implement the image processing apparatus 200 according to the present disclosure.
  • the electronic device 1600 may be, for example, a user equipment, for example, may be implemented as a mobile terminal (such as a smart phone, a tablet personal computer (PC), a notebook PC, a portable game terminal, a portable/dongle type mobile router, and a digital camera) or a vehicle terminal.
  • a mobile terminal such as a smart phone, a tablet personal computer (PC), a notebook PC, a portable game terminal, a portable/dongle type mobile router, and a digital camera
  • the electronic device 1600 includes a processor 1601, a memory 1602, a storage device 1603, a network interface 1604, and a bus 1606.
  • the processor 1601 may be, for example, a central processing unit (CPU) or a digital signal processor (DSP), and controls the functions of the electronic device 1600.
  • the memory 1602 includes random access memory (RAM) and read only memory (ROM), and stores data and programs executed by the processor 1601.
  • the storage device 1603 may include a storage medium such as a semiconductor memory and a hard disk.
  • the network interface 1604 is a wired communication interface for connecting the electronic device 1600 to the wired communication network 1605.
  • the wired communication network 1605 may be a core network such as an evolved packet core network (EPC) or a packet data network (PDN) such as the Internet.
  • EPC evolved packet core network
  • PDN packet data network
  • the bus 1606 connects the processor 1601, the memory 1602, the storage device 1603, and the network interface 1604 to each other.
  • the bus 1606 may include two or more buses (such as a high-speed bus and a low-speed bus) each having a different speed.
  • the preprocessing unit 210, the extraction unit 220, the determination unit 230, and the decision unit 240 described in FIG. 2 can be implemented by the processor 1601.
  • the processor 1601 may divide the continuously input multiple images into multiple image blocks by executing instructions stored in the memory 1602 or the storage device 1603, extract the spatiotemporal features of each image block using a convolutional neural network model, and use The recurrent neural network determines the functions of gestures included in multiple images.
  • the units shown in dashed boxes in the functional block diagram shown in the drawings all indicate that the functional unit is optional in the corresponding device, and each optional functional unit can be combined in an appropriate manner to achieve the required function .
  • a plurality of functions included in one unit in the above embodiments may be realized by separate devices.
  • the multiple functions implemented by multiple units in the above embodiments may be implemented by separate devices, respectively.
  • one of the above functions can be implemented by multiple units. Needless to say, such a configuration is included in the technical scope of the present disclosure.
  • the steps described in the flowchart include not only processing performed in time series in the described order, but also processing performed in parallel or individually rather than necessarily in time series.
  • the order can be changed appropriately.

Abstract

The present disclosure relates to an image processing apparatus, an image processing method, and a computer-readable storage medium. The image processing apparatus according to the present disclosure comprises a processing circuit, configured to: dividing a plurality of continuously inputted images into a plurality of image blocks; using a convolutional neural network model to extract spatio-temporal features of each image block, the convolutional neural network model comprising a separable convolutional network and a pointwise convolutional network, or comprising a separable convolutional network and a dilated convolutional network; and using a cyclic neural network model to determine gestures in the plurality of images according to the spatio-temporal features of each image block. By means of the image processing apparatus, the image processing method, and the computer-readable storage medium according to the present disclosure, a dynamic gesture can be quickly and accurately recognized.

Description

图像处理装置、图像处理方法和计算机可读存储介质Image processing device, image processing method, and computer readable storage medium
本申请要求于2020年5月14日提交中国专利局、申请号为202010407312.X、发明名称为“图像处理装置、图像处理方法和计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 202010407312.X, and the invention title is "Image Processing Device, Image Processing Method, and Computer-readable Storage Medium" on May 14, 2020, all of which The content is incorporated in this application by reference.
技术领域Technical field
本公开的实施例总体上涉及图像处理领域,具体地涉及图像处理装置、图像处理方法和计算机可读存储介质。更具体地,本公开的实施例涉及能够对连续输入的多个图像中包括的手势进行识别的图像处理装置、图像处理方法和计算机可读存储介质。The embodiments of the present disclosure generally relate to the field of image processing, and in particular to an image processing apparatus, an image processing method, and a computer-readable storage medium. More specifically, the embodiments of the present disclosure relate to an image processing apparatus, an image processing method, and a computer-readable storage medium capable of recognizing gestures included in a plurality of images that are continuously input.
背景技术Background technique
动态手势识别是指对由连续输入的多帧图像构成的动态手势序列进行识别的一种技术。由于手势的灵活性和方便性,动态手势识别在人机交互、AR(Augmented Reality,增强现实)/VR(Virtual Reality,虚拟现实)等环境下有较为广泛的应用前景。Dynamic gesture recognition refers to a technology that recognizes a sequence of dynamic gestures composed of consecutively input multiple frames of images. Due to the flexibility and convenience of gestures, dynamic gesture recognition has broad application prospects in human-computer interaction, AR (Augmented Reality)/VR (Virtual Reality) and other environments.
在线动态手势识别是对连续的多个动态手势进行分割和识别的技术。相比于离线动态手势识别,在线动态手势识别具有极大的挑战性,主要在于两个方面:分辨出手势的开始帧和结束帧;以及识别手势。针对在线动态手势识别技术,可以通过对每一类手势选择一个或者几个关键帧来区别不同的手势,但是由于关键帧需要手工选择,因此具有很强的不确定性。此外,在手势种类很多的情况下,很难针对每一类手势都选择合适的关键帧。针对在线动态手势识别技术,还可以通过隐马尔可夫模型建模相邻图像帧,以区分不同手势。但是由于隐马尔可夫模型的表达能力比较弱,因此只能对少数类别的手势进行识别。Online dynamic gesture recognition is a technology for segmenting and recognizing multiple continuous dynamic gestures. Compared with offline dynamic gesture recognition, online dynamic gesture recognition is very challenging, mainly in two aspects: distinguish the start frame and end frame of the gesture; and recognize the gesture. For online dynamic gesture recognition technology, different gestures can be distinguished by selecting one or several key frames for each type of gesture, but because the key frames need to be manually selected, there is a strong uncertainty. In addition, when there are many types of gestures, it is difficult to select an appropriate key frame for each type of gesture. For online dynamic gesture recognition technology, it is also possible to model adjacent image frames through hidden Markov models to distinguish different gestures. However, due to the weak expressive ability of the hidden Markov model, it can only recognize a few types of gestures.
因此,有必要提出一种技术方案,以快速准确地识别动态手势。Therefore, it is necessary to propose a technical solution to quickly and accurately recognize dynamic gestures.
发明内容Summary of the invention
这个部分提供了本公开的一般概要,而不是其全部范围或其全部特征的全面披露。This section provides a general summary of the disclosure, rather than a comprehensive disclosure of its full scope or all its features.
本公开的目的在于提供一种图像处理装置、图像处理方法和计算机可读存储介质,以快速准确地识别动态手势。The purpose of the present disclosure is to provide an image processing device, an image processing method, and a computer-readable storage medium to quickly and accurately recognize dynamic gestures.
根据本公开的一方面,提供了一种图像处理装置,包括处理电路,被配置为:将连续输入的多个图像划分为多个图像块;利用卷积神经网络模型提取每个图像块的时空特征,所述卷积神经网络模型包括可分卷积(separable convolution)网络和逐点卷积(pointwise convolution)网络、或者包括可分卷积网络和空洞卷积(Dilated Convolution)网络;以及利用循环神经网络(Recurrent Neural Network,RNN)模型根据各个图像块的时空特征确定所述多个图像中包括的手势。According to an aspect of the present disclosure, there is provided an image processing device, including a processing circuit, configured to: divide consecutively input multiple images into multiple image blocks; and use a convolutional neural network model to extract the time and space of each image block Feature, the convolutional neural network model includes a separable convolution network and a pointwise convolution network, or a separable convolution network and a dilated convolution network; and the use of loops The Recurrent Neural Network (RNN) model determines the gestures included in the multiple images according to the temporal and spatial characteristics of each image block.
根据本公开的另一方面,提供了一种图像处理方法,包括:将连续输入的多个图像划分为多个图像块;利用卷积神经网络模型提取每个图像块的时空特征,所述卷积神经网络模型包括可分卷积网络和逐点卷积网络、或者包括可分卷积网络和空洞卷积网络;以及利用循环神经网络模型根据各个图像块的时空特征确定所述多个图像中包括的手势。According to another aspect of the present disclosure, there is provided an image processing method, including: dividing a plurality of consecutively input images into a plurality of image blocks; using a convolutional neural network model to extract the spatiotemporal characteristics of each image block, the volume The product neural network model includes a separable convolutional network and a point-wise convolutional network, or a separable convolutional network and a hole convolutional network; and the recurrent neural network model is used to determine the plurality of images according to the temporal and spatial characteristics of each image block Gestures included.
根据本公开的另一方面,提供了一种计算机可读存储介质,包括可执行计算机指令,所述可执行计算机指令当被计算机执行时使得所述计算机执行根据本公开所述的图像处理方法。According to another aspect of the present disclosure, there is provided a computer-readable storage medium including executable computer instructions that, when executed by a computer, cause the computer to execute the image processing method according to the present disclosure.
根据本公开的另一方面,提供了一种计算机程序,所述计算机程序当被计算机执行时使得所述计算机执行根据本公开所述的图像处理方法。According to another aspect of the present disclosure, there is provided a computer program that, when executed by a computer, causes the computer to execute the image processing method according to the present disclosure.
使用根据本公开的图像处理装置、图像处理方法和计算机可读存储介质,可以利用卷积神经网络提取图像块的时空特征,卷积神经网络包括可分卷积网络和逐点卷积网络,或者包括可分卷积网络和空洞卷积网络,从而可以根据提取的时空特征利用循环神经网络识别手势。由于采用了可分卷积网络以及逐点卷积网络/空洞卷积网络,因此可以降低手势识别的计算量,以快速准确地识别动态手势。Using the image processing device, image processing method, and computer-readable storage medium according to the present disclosure, it is possible to extract spatiotemporal features of image blocks using convolutional neural networks, which include separable convolutional networks and pointwise convolutional networks, or Including separable convolutional network and hollow convolutional network, so that the recurrent neural network can be used to recognize gestures according to the extracted spatiotemporal features. Since the separable convolutional network and the point-wise convolutional network/hole convolutional network are adopted, the calculation amount of gesture recognition can be reduced to quickly and accurately recognize dynamic gestures.
从在此提供的描述中,进一步的适用性区域将会变得明显。这个概要中的描述和特定例子只是为了示意的目的,而不旨在限制本公开的范围。From the description provided here, further areas of applicability will become apparent. The description and specific examples in this summary are for illustrative purposes only, and are not intended to limit the scope of the present disclosure.
附图说明Description of the drawings
在此描述的附图只是为了所选实施例的示意的目的而非全部可能的实施,并且不旨在限制本公开的范围。在附图中:The drawings described herein are only for illustrative purposes of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure. In the attached picture:
图1是示出连续的多个图像中包括的手势的示意图;FIG. 1 is a schematic diagram showing gestures included in consecutive multiple images;
图2是示出根据本公开的实施例的图像处理装置的配置的示例的框图;2 is a block diagram showing an example of the configuration of an image processing apparatus according to an embodiment of the present disclosure;
图3是示出根据本公开的实施例对图像中的关键点进行提取的过程的示意图;Fig. 3 is a schematic diagram showing a process of extracting key points in an image according to an embodiment of the present disclosure;
图4是示出根据本公开的实施例的卷积神经网络模型的结构的示例的框图;4 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;
图5是示出根据本公开的实施例的卷积神经网络模型的结构的示例的框图;5 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;
图6是示出根据本公开的实施例的卷积神经网络模型的结构的示例的框图;6 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;
图7是示出根据本公开的实施例的卷积神经网络模型的结构的示例的框图;FIG. 7 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;
图8是示出根据本公开的实施例的卷积神经网络模型的结构的示例的框图;FIG. 8 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;
图9是示出根据本公开的实施例的卷积神经网络模型的结构的示例的框图;FIG. 9 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;
图10是示出根据本公开的实施例的卷积神经网络模型的结构的示例的框图;FIG. 10 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure;
图11是示出循环神经网络模型的结构的示意图;FIG. 11 is a schematic diagram showing the structure of a recurrent neural network model;
图12是示出根据本公开的实施例的循环神经网络模型的结构的示意图;FIG. 12 is a schematic diagram showing the structure of a recurrent neural network model according to an embodiment of the present disclosure;
图13是示出根据本公开的实施例的图像处理装置的结构的示意图;FIG. 13 is a schematic diagram showing the structure of an image processing apparatus according to an embodiment of the present disclosure;
图14是示出根据本公开的实施例的图像处理装置的结构的示意图;FIG. 14 is a schematic diagram showing the structure of an image processing apparatus according to an embodiment of the present disclosure;
图15是示出根据本公开的实施例的图像处理方法的流程图;以及FIG. 15 is a flowchart showing an image processing method according to an embodiment of the present disclosure; and
图16是示出可以实现根据本公开的图像处理装置的电子设备的示例的框图。FIG. 16 is a block diagram showing an example of an electronic device that can implement the image processing apparatus according to the present disclosure.
虽然本公开容易经受各种修改和替换形式,但是其特定实施例已作为例子在附图中示出,并且在此详细描述。然而应当理解的是,在此对特定实施例的描述并不打算将本公开限制到公开的具体形式,而是相反地,本公开目的是要覆盖落在本公开的精神和范围之内的所有修改、等效和替换。要注意的是,贯穿几个附图,相应的标号指示相应的部件。Although the present disclosure is easily subjected to various modifications and alternative forms, specific embodiments thereof have been shown in the drawings as examples and described in detail herein. However, it should be understood that the description of specific embodiments herein is not intended to limit the present disclosure to the specific forms disclosed, but on the contrary, the purpose of the present disclosure is to cover all that fall within the spirit and scope of the present disclosure. Modifications, equivalents and replacements. It should be noted that throughout the several drawings, corresponding reference numerals indicate corresponding components.
具体实施方式Detailed ways
现在参考附图来更加充分地描述本公开的例子。以下描述实质上只是示例性的,而不旨在限制本公开、应用或用途。The examples of the present disclosure will now be described more fully with reference to the accompanying drawings. The following description is merely exemplary in nature, and is not intended to limit the present disclosure, application, or use.
提供了示例实施例,以便本公开将会变得详尽,并且将会向本领域技术人员充分地传达其范围。阐述了众多的特定细节如特定部件、装置和方法的例子,以提供对本公开的实施例的详尽理解。对于本领域技术人员而言将会明显的是,不需要使用特定的细节,示例实施例可以用许多不同的形式来实施,它们都不应当被解释为限制本公开的范围。在某些示例实施例中,没有详细地描述众所周知的过程、众所周知的结构和众所周知的技术。Example embodiments are provided so that this disclosure will be thorough and will fully convey its scope to those skilled in the art. Numerous specific details such as examples of specific components, devices, and methods are described to provide a detailed understanding of the embodiments of the present disclosure. It will be obvious to those skilled in the art that specific details do not need to be used, the example embodiments can be implemented in many different forms, and none of them should be construed as limiting the scope of the present disclosure. In some example embodiments, well-known processes, well-known structures, and well-known technologies are not described in detail.
将按照以下顺序进行描述:The description will be made in the following order:
1.图像处理装置的配置示例;1. Configuration example of image processing device;
2.图像处理方法的示例;2. Examples of image processing methods;
3.应用示例。3. Application examples.
<1.图像处理装置的配置示例><1. Configuration example of image processing device>
图1是示出连续的多个图像中包括的手势的示意图。如图1所示,上面的图示出了多个图像中包括“双击”手势的示例,下面的图示出了多个图像中包括“紧握”手势的示例。FIG. 1 is a schematic diagram showing gestures included in consecutive multiple images. As shown in FIG. 1, the upper figure shows an example in which multiple images include a "double tap" gesture, and the lower figure shows an example in which multiple images include a "squeeze" gesture.
前文中提到,在手势种类逐渐增多的情况下,现有的手势识别技术很难做到快速准确地识别各类手势。因此,本公开期望提出一种图像处理装置、图像处理方法和计算机可读存储介质,以快速准确地识别各类动态手势。As mentioned in the previous article, as the types of gestures are gradually increasing, it is difficult for the existing gesture recognition technology to quickly and accurately identify various types of gestures. Therefore, the present disclosure expects to provide an image processing device, an image processing method, and a computer-readable storage medium to quickly and accurately identify various dynamic gestures.
图2是示出根据本公开的实施例的图像处理装置200的配置的示例的框图。这里,图像处理装置200可以对连续输入的多个图像中包括的手势进行识别。连续输入的多个图像,例如视频、动态图像或快速输入的一 组静态图像等。具体地,图像处理装置200可以对动态手势进行实时识别,即可以在线对动态手势进行识别。FIG. 2 is a block diagram showing an example of the configuration of an image processing apparatus 200 according to an embodiment of the present disclosure. Here, the image processing apparatus 200 may recognize gestures included in a plurality of images that are continuously input. Multiple images input continuously, such as videos, moving images, or a group of static images input quickly. Specifically, the image processing device 200 can recognize dynamic gestures in real time, that is, can recognize dynamic gestures online.
如图2所示,图像处理装置200可以包括预处理单元210、提取单元220和确定单元230。As shown in FIG. 2, the image processing apparatus 200 may include a preprocessing unit 210, an extraction unit 220, and a determination unit 230.
这里,图像处理装置200的各个单元都可以包括在处理电路中。需要说明的是,图像处理装置200既可以包括一个处理电路,也可以包括多个处理电路。进一步,处理电路可以包括各种分立的功能单元以执行各种不同的功能和/或操作。需要说明的是,这些功能单元可以是物理实体或逻辑实体,并且不同称谓的单元可能由同一个物理实体实现。Here, each unit of the image processing apparatus 200 may be included in the processing circuit. It should be noted that the image processing device 200 may include one processing circuit or multiple processing circuits. Further, the processing circuit may include various discrete functional units to perform various different functions and/or operations. It should be noted that these functional units may be physical entities or logical entities, and units with different titles may be implemented by the same physical entity.
根据本公开的实施例,预处理单元210可以将连续输入的多个图像划分为多个图像块。According to an embodiment of the present disclosure, the preprocessing unit 210 may divide a plurality of images continuously input into a plurality of image blocks.
根据本公开的实施例,提取单元220可以利用卷积神经网络模型提取每个图像块的时空特征。根据本公开的实施例,卷积神经网络模型可以包括可分卷积网络和逐点卷积网络。或者,卷积神经网络模型也可以包括可分卷积网络和空洞卷积网络。According to an embodiment of the present disclosure, the extraction unit 220 may use a convolutional neural network model to extract the spatiotemporal features of each image block. According to an embodiment of the present disclosure, the convolutional neural network model may include a separable convolutional network and a pointwise convolutional network. Alternatively, the convolutional neural network model may also include a separable convolutional network and a hollow convolutional network.
根据本公开的实施例,确定单元230可以利用循环神经网络模型根据各个图像块的时空特征确定多个图像中包括的手势。According to an embodiment of the present disclosure, the determining unit 230 may use a recurrent neural network model to determine the gestures included in the multiple images according to the temporal and spatial characteristics of each image block.
如上所述,根据本公开的实施例的图像处理装置200,可以利用卷积神经网络模型提取图像块的时空特征,卷积神经网络模型包括可分卷积网络和逐点卷积网络,或者包括可分卷积网络和空洞卷积网络,从而可以根据提取的时空特征利用循环神经网络识别手势。由于采用了可分卷积网络以及逐点卷积网络/空洞卷积网络,因此可以降低手势识别的计算量,以快速准确地识别动态手势。As described above, according to the image processing apparatus 200 of the embodiment of the present disclosure, the spatiotemporal features of the image block can be extracted using the convolutional neural network model. The convolutional neural network model includes a separable convolutional network and a pointwise convolutional network, or includes The convolutional network and the hollow convolutional network can be divided, so that the recurrent neural network can be used to recognize gestures according to the extracted spatiotemporal features. Since the separable convolutional network and the point-wise convolutional network/hole convolutional network are adopted, the calculation amount of gesture recognition can be reduced to quickly and accurately recognize dynamic gestures.
在本公开中,可分卷积也被称为深度可分卷积(depthwise separable convolution),其通过拆分空间维度和通道(深度)维度的相关性,减少了卷积计算所需要的参数个数。深度可分卷积的卷积计算分为两部分,首先对通道(深度)分别进行空间卷积,并对输出进行拼接,随后使用单位卷积核进行通道卷积以得到特征图。In this disclosure, separable convolution is also called depthwise separable convolution, which reduces the number of parameters required for convolution calculation by splitting the correlation between the spatial dimension and the channel (depth) dimension. number. The convolution calculation of the depth separable convolution is divided into two parts. First, the channels (depth) are spatially convolved, and the output is spliced, and then the unit convolution kernel is used to perform channel convolution to obtain the feature map.
在本公开中,逐点卷积使用了一个1x1的卷积核函数,或者说是一个遍历每个点的卷积核函数。其中,卷积核的深度为输入至逐点卷积网络的图像的通道个数。In the present disclosure, the point-by-point convolution uses a 1×1 convolution kernel function, or a convolution kernel function that traverses each point. Among them, the depth of the convolution kernel is the number of channels of the image input to the point-by-point convolution network.
在本公开中,空洞卷积也被称为膨胀卷积,是在卷积核里注入空洞。 在空洞卷积中有一个参数可以设置空洞率,具体含义就是在卷积核中填充空洞率-1个0。当设置不同空洞率时,感受野就会不一样。因此。空洞卷积可以扩大感受野,并获得多尺度的上下文信息。In this disclosure, hole convolution is also called dilated convolution, which is to inject holes into the convolution kernel. There is a parameter in the hole convolution to set the hole rate. The specific meaning is to fill the hole rate -1 0 in the convolution kernel. When setting different void rates, the receptive field will be different. therefore. Hollow convolution can expand the receptive field and obtain multi-scale context information.
根据本公开的实施例,图像处理装置的输入是包括手势的多个图像(或者多帧图像)。根据本公开的实施例,图像可以是RGB图像和深度图像中的任意一种。According to an embodiment of the present disclosure, the input of the image processing apparatus is multiple images (or multiple frames of images) including gestures. According to an embodiment of the present disclosure, the image may be any one of an RGB image and a depth image.
根据本公开的实施例,预处理单元210可以将输入至图像处理装置200的多个图像划分为多个图像块。具体地,预处理单元210可以将输入至图像处理装置200的多个图像中连续输入的M个图像划分至一个图像块,M为大于等于2的整数。也就是说,以M个图像为单位,预处理单元210可以将输入至图像处理装置的多个图像划分为多个图像块。这里,每个包括M个图像的图像块都可以看做一个时空单元。优选地,M可以为4、8、16、32等值。例如,当M为8时,预处理单元210可以从任意位置开始、将输入至图像处理装置200的多个图像中连续输入的8个图像划分至一个图像块。例如,预处理单元210可以将输入至图像处理装置200的多个图像中的第1-8个图像划分为第1图像块,第9-16个图像划分为第2个图像块,以此类推。According to an embodiment of the present disclosure, the preprocessing unit 210 may divide a plurality of images input to the image processing apparatus 200 into a plurality of image blocks. Specifically, the preprocessing unit 210 may divide the consecutively input M images from the multiple images input to the image processing device 200 into one image block, where M is an integer greater than or equal to 2. That is, with M images as a unit, the preprocessing unit 210 may divide a plurality of images input to the image processing apparatus into a plurality of image blocks. Here, each image block including M images can be regarded as a spatio-temporal unit. Preferably, M can be 4, 8, 16, 32, and the like. For example, when M is 8, the preprocessing unit 210 may start from an arbitrary position and divide the 8 images continuously input from the plurality of images input to the image processing apparatus 200 into one image block. For example, the preprocessing unit 210 may divide the 1-8th image among the multiple images input to the image processing device 200 into the first image block, divide the 9-16th image into the second image block, and so on. .
根据本公开的实施例,预处理单元210还可以确定划分出的多个图像块中的每个图像块的特征,并可以将各个图像块的特征输入至提取单元220。According to an embodiment of the present disclosure, the preprocessing unit 210 may also determine the feature of each of the divided image blocks, and may input the feature of each image block to the extraction unit 220.
根据本公开的实施例,预处理单元210可以提取输入至图像处理装置200的多个图像中的每个图像的多个关键点的特征。进一步,预处理单元210可以将图像块包括的M个图像中的每个图像的各个关键点的特征作为该图像块的特征。According to an embodiment of the present disclosure, the preprocessing unit 210 may extract features of a plurality of key points of each of a plurality of images input to the image processing apparatus 200. Further, the preprocessing unit 210 may use the feature of each key point of each of the M images included in the image block as the feature of the image block.
这里,在对手势进行识别的情况下,关键点例如可以是做出手势的手部的关节点。本公开对每个图像中包括的关键点的数目不做限定。例如,预处理单元210可以提取每个图像的X个关键点的特征,X为大于等于2的整数。例如,在X=14的情况下,预处理单元210可以将图像块包括的M个图像中的每个图像的14个关键点的特征作为该图像块的特征。那么,该图像块的关键点共14×M个。Here, in the case of recognizing a gesture, the key point may be, for example, a joint point of the hand that makes the gesture. The present disclosure does not limit the number of key points included in each image. For example, the preprocessing unit 210 may extract features of X key points of each image, where X is an integer greater than or equal to 2. For example, in the case of X=14, the preprocessing unit 210 may use the features of 14 key points of each of the M images included in the image block as the feature of the image block. Then, there are 14×M key points in the image block.
图3是示出根据本公开的实施例对图像中的关键点进行提取的过程的示意图。图3上面的图示出了输入至图像处理装置200中的图像中的三 个图像,下面的图示出了对这三个图像进行关键点提取的过程。如图3所示,针对每个图像提取了14个关键点。Fig. 3 is a schematic diagram showing a process of extracting key points in an image according to an embodiment of the present disclosure. The upper figure of Fig. 3 shows three images among the images input to the image processing device 200, and the lower figure shows the process of extracting key points from these three images. As shown in Figure 3, 14 key points are extracted for each image.
根据本公开的实施例,每个关键点的特征可以包括多个维度的特征。此外,每个关键点的特征可以是该关键点的空间特征。例如,每个关键点的特征可以包括该关键点的Y个空间特征。Y例如为3。也就是说,每个关键点的特征可以包括该关键点在三维空间中的三个坐标特征。According to an embodiment of the present disclosure, the feature of each key point may include features of multiple dimensions. In addition, the characteristic of each key point may be the spatial characteristic of the key point. For example, the feature of each key point may include Y spatial features of the key point. Y is 3, for example. That is, the feature of each key point may include three coordinate features of the key point in the three-dimensional space.
如上所述,根据本公开的实施例,一个图像块包括M个图像,每个图像包括X个关键点,而每个关键点包括Y个空间特征。那么,每个图像块可以包括M×X×Y个特征。预处理单元210可以将每个图像块包括的M×X×Y个特征作为该图像块的特征输入至提取单元220中的卷积神经网络模型。进一步,预处理单元210可以按照图像块的顺序依次将各个图像块的特征输入至提取单元220。也就是说,相比于在时间上靠后的图像块,在时间上靠前的图像块的特征被先输入至提取单元220。As described above, according to the embodiment of the present disclosure, one image block includes M images, each image includes X key points, and each key point includes Y spatial features. Then, each image block may include M×X×Y features. The preprocessing unit 210 may input the M×X×Y features included in each image block as the features of the image block to the convolutional neural network model in the extraction unit 220. Further, the preprocessing unit 210 may sequentially input the features of each image block to the extraction unit 220 according to the order of the image blocks. In other words, compared with the image block that is later in time, the feature of the image block that is earlier in time is input to the extraction unit 220 first.
根据本公开的实施例,提取单元220可以利用卷积神经网络模型提取每个图像块的时空特征。卷积神经网络模型可以包括可分卷积网络和逐点卷积网络、或者可以包括可分卷积网络和空洞卷积网络。According to an embodiment of the present disclosure, the extraction unit 220 may use a convolutional neural network model to extract the spatiotemporal features of each image block. The convolutional neural network model may include a separable convolutional network and a pointwise convolutional network, or may include a separable convolutional network and a holey convolutional network.
根据本公开的实施例,提取单元220中的卷积神经网络模型还可以包括全连接网络。该全连接网络的每一个结点都与上一个网络的所有结点相连,用于把上一个网络提取到的特征综合起来。According to an embodiment of the present disclosure, the convolutional neural network model in the extraction unit 220 may also include a fully connected network. Each node of the fully connected network is connected to all nodes of the previous network, and is used to integrate the features extracted from the previous network.
图4是示出根据本公开的实施例的卷积神经网络模型的结构的示例的框图。如图4所示,卷积神经网络模型可以包括可分卷积网络、逐点卷积网络或者空洞卷积网络、以及全连接网络。FIG. 4 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in FIG. 4, the convolutional neural network model may include a separable convolutional network, a point-wise convolutional network or a hollow convolutional network, and a fully connected network.
根据本公开的实施例,卷积神经网络模型可以包括N个可分卷积网络、N个逐点卷积网络或空洞卷积网络、以及N个全连接网络,其中N为正整数。也就是说,卷积神经网络模型中包括的可分卷积网络、逐点卷积网络或空洞卷积网络、以及全连接网络的个数相同。也就是说,卷积神经网络模型的输入依次通过N个包括可分卷积网络、逐点卷积网络或空洞卷积网络、以及全连接网络的组,并且每个组中从输入到输出的顺序依次包括可分卷积网络、逐点卷积网络或空洞卷积网络、以及全连接网络。According to an embodiment of the present disclosure, the convolutional neural network model may include N separable convolutional networks, N pointwise convolutional networks or hollow convolutional networks, and N fully connected networks, where N is a positive integer. In other words, the number of separable convolutional networks, point-wise convolutional networks or hole convolutional networks, and fully connected networks included in the convolutional neural network model are the same. That is to say, the input of the convolutional neural network model passes through N groups including separable convolutional networks, pointwise convolutional networks or hole convolutional networks, and fully connected networks, and each group is from input to output. The order includes separable convolutional network, pointwise convolutional network or hole convolutional network, and fully connected network.
为了便于说明,可以将可分卷积网络标记为A,将逐点卷积网络或空洞卷积网络标记为B,将全连接网络标记为C,则提取单元220中的卷积神经网络模型从输入到输出的顺序可以包括A、B、C或者A、B、C、 A、B、C…。For ease of description, the separable convolutional network can be marked as A, the point-wise convolutional network or the hole convolutional network is marked as B, and the fully connected network is marked as C, then the convolutional neural network model in the extraction unit 220 is from The order of input to output can include A, B, C or A, B, C, A, B, C...
图4示出了N=1的情形,即卷积神经网络模型包括一个包括可分卷积网络、逐点卷积网络或空洞卷积网络、以及全连接网络的组。Fig. 4 shows the case of N=1, that is, the convolutional neural network model includes a group including a separable convolutional network, a point-wise convolutional network or a hole convolutional network, and a fully connected network.
图5是示出根据本公开的实施例的卷积神经网络模型的结构的示例的框图。如图5所示,卷积神经网络模型可以包括可分卷积网络、逐点卷积网络或者空洞卷积网络、全连接网络、可分卷积网络、逐点卷积网络或者空洞卷积网络、以及全连接网络。也就是说,图5示出了N=2的情形,即卷积神经网络模型包括两个包括可分卷积网络、逐点卷积网络或空洞卷积网络、以及全连接网络的组。针对N大于2的情况是类似的,本公开不再赘述。FIG. 5 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in Figure 5, the convolutional neural network model can include a separable convolutional network, a pointwise convolutional network, or a hole convolutional network, a fully connected network, a separable convolutional network, a pointwise convolutional network, or a holey convolutional network , And fully connected network. That is, FIG. 5 shows a situation where N=2, that is, the convolutional neural network model includes two groups including a separable convolutional network, a pointwise convolutional network or a hole convolutional network, and a fully connected network. The situation where N is greater than 2 is similar, and will not be repeated in this disclosure.
根据本公开的实施例,提取单元220中的卷积神经网络模型可以包括多个可分卷积网络、一个或多个逐点卷积网络或空洞卷积网络、以及一个全连接网络。According to an embodiment of the present disclosure, the convolutional neural network model in the extraction unit 220 may include multiple separable convolutional networks, one or more pointwise convolutional networks or hole convolutional networks, and a fully connected network.
根据本公开的实施例,卷积神经网络模型可以包括多个可分卷积网络、一个或多个逐点卷积网络或空洞卷积网络、以及一个全连接网络。其中,可分卷积网络的数目比逐点卷积网络或空洞卷积网络的数目多一个。例如,逐点卷积网络或空洞卷积网络的数目为V,V为正整数,则可分卷积网络的数目为V+1。卷积神经网络模型从输入到输出的顺序可以依次包括V个由可分卷积网络、以及逐点卷积网络或空洞卷积网络组成的组、可分卷积网络、以及一个全连接网络。进一步,在V个组中的每个组从输入到输出的顺序可以依次包括可分卷积网络、以及逐点卷积网络或空洞卷积网络。也就是说,在全连接网络之前的结构中,开始于可分卷积网络,终止于可分卷积网络,并且可分卷积网络、以及逐点卷积网络或空洞卷积网络间隔开来。According to an embodiment of the present disclosure, the convolutional neural network model may include multiple separable convolutional networks, one or more pointwise convolutional networks or hole convolutional networks, and a fully connected network. Among them, the number of separable convolutional networks is one more than the number of point-wise convolutional networks or hole convolutional networks. For example, if the number of point-wise convolutional networks or hole convolutional networks is V, and V is a positive integer, then the number of separable convolutional networks is V+1. The sequence of the convolutional neural network model from input to output may include V groups consisting of a separable convolutional network, a pointwise convolutional network or a hole convolutional network, a separable convolutional network, and a fully connected network. Further, the order from input to output in each of the V groups may sequentially include a separable convolutional network, and a pointwise convolutional network or a hole convolutional network. That is to say, in the structure before the fully connected network, it starts with a separable convolutional network and ends with a separable convolutional network, and the separable convolutional network, and the point-wise convolutional network or the hole convolutional network are separated .
为了便于说明,可以将可分卷积网络标记为A,将逐点卷积网络或空洞卷积网络标记为B,将全连接网络标记为C,则提取单元220中的卷积神经网络模型从输入到输出的顺序可以包括A、B、A、C或者A、B、A、B、A、…、A、B、C。For ease of description, the separable convolutional network can be marked as A, the point-wise convolutional network or the hole convolutional network is marked as B, and the fully connected network is marked as C, then the convolutional neural network model in the extraction unit 220 is from The order of input to output may include A, B, A, C or A, B, A, B, A,..., A, B, C.
图6是示出根据本公开的实施例的卷积神经网络模型的结构的示例的框图。如图6所示,提取单元220中的卷积神经网络模型可以包括可分卷积网络、逐点卷积网络或空洞卷积网络、可分卷积网络、以及全连接网络。即,图6示出了V=1的情况。针对V大于1的情况也是类似的,本 公开不再赘述。FIG. 6 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in FIG. 6, the convolutional neural network model in the extraction unit 220 may include a separable convolutional network, a pointwise convolutional network or a hollow convolutional network, a separable convolutional network, and a fully connected network. That is, FIG. 6 shows the case of V=1. The same is true for the case where V is greater than 1, which will not be repeated in this disclosure.
根据本公开的实施例,卷积神经网络模型可以包括多个可分卷积网络、多个逐点卷积网络或空洞卷积网络、以及一个全连接网络。其中,可分卷积网络与逐点卷积网络或空洞卷积网络的数目一致,例如为Z个,Z为大于等于2的整数。则卷积神经网络模型从输入到输出的顺序可以依次包括Z个由可分卷积网络、以及逐点卷积网络或空洞卷积网络组成的组、以及一个全连接网络。进一步,在Z个组中的每个组从输入到输出的顺序可以依次包括可分卷积网络、以及逐点卷积网络或空洞卷积网络。也就是说,在全连接网络之前的结构中,开始于可分卷积网络,终止于逐点卷积网络或空洞卷积网络,并且可分卷积网络、以及逐点卷积网络或空洞卷积网络间隔开来。According to an embodiment of the present disclosure, the convolutional neural network model may include multiple separable convolutional networks, multiple pointwise convolutional networks or hole convolutional networks, and one fully connected network. Among them, the number of separable convolutional networks is consistent with the number of point-wise convolutional networks or hole convolutional networks, for example, Z, and Z is an integer greater than or equal to 2. Then, the sequence of the convolutional neural network model from input to output may include Z groups consisting of separable convolutional networks, pointwise convolutional networks or hole convolutional networks, and a fully connected network. Further, the order from input to output in each of the Z groups may sequentially include a separable convolution network, and a pointwise convolution network or a hole convolution network. That is to say, in the structure before the fully connected network, it starts with a separable convolutional network, ends with a pointwise convolutional network or a hole convolutional network, and a separable convolutional network, and a pointwise convolutional network or a holey convolution Jaeger nets are spaced apart.
为了便于说明,可以将可分卷积网络标记为A,将逐点卷积网络或空洞卷积网络标记为B,将全连接网络标记为C,则提取单元220中的卷积神经网络模型从输入到输出的顺序可以包括A、B、A、B、C或者A、B、A、B、…、A、B、C。For ease of description, the separable convolutional network can be marked as A, the point-wise convolutional network or the hole convolutional network is marked as B, and the fully connected network is marked as C, then the convolutional neural network model in the extraction unit 220 is from The order of input to output can include A, B, A, B, C or A, B, A, B, ..., A, B, C.
图7是示出根据本公开的实施例的卷积神经网络模型的结构的示例的框图。如图7所示,提取单元220中的卷积神经网络模型可以包括可分卷积网络、逐点卷积网络或空洞卷积网络、可分卷积网络、逐点卷积网络或空洞卷积网络、以及全连接网络。即,图7示出了Z=2的情况。针对Z大于2的情况也是类似的,本公开不再赘述。FIG. 7 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in FIG. 7, the convolutional neural network model in the extraction unit 220 may include a separable convolutional network, a pointwise convolutional network or a holey convolutional network, a separable convolutional network, a pointwise convolutional network, or a holey convolution. Network, and fully connected network. That is, FIG. 7 shows the case of Z=2. The situation where Z is greater than 2 is also similar, and will not be repeated in this disclosure.
前文以示例性的方式描述了提取单元220中的卷积神经网络模型的结构。下面将描述根据本公开的实施例的卷积神经网络模型的几个具体示例。The foregoing describes the structure of the convolutional neural network model in the extraction unit 220 in an exemplary manner. Several specific examples of the convolutional neural network model according to the embodiment of the present disclosure will be described below.
根据本公开的实施例,卷积神经网络模型中的可分卷积网络的步长可以为1,并且卷积神经网络模型中的逐点卷积网络或空洞卷积网络可以选取逐点卷积网络。According to the embodiment of the present disclosure, the step size of the separable convolutional network in the convolutional neural network model can be 1, and the point-wise convolutional network or the hole convolutional network in the convolutional neural network model can select point-wise convolution The internet.
图8是示出根据本公开的实施例的卷积神经网络模型的结构的示例的框图。如图8所示,卷积神经网络模型可以包括步长为1的可分卷积网络、逐点卷积网络以及全连接网络。这里,M×N表示可分卷积网络中卷积核的大小,P表示可分卷积网络中卷积核的数目。优选地,M=N=3。S×T表示逐点卷积网络中卷积核的大小,Q表示逐点卷积网络中卷积核的数目。优选地,S=T=1。FIG. 8 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in FIG. 8, the convolutional neural network model may include a separable convolutional network with a step size of 1, a point-wise convolutional network, and a fully connected network. Here, M×N represents the size of the convolution kernel in the separable convolutional network, and P represents the number of convolution kernels in the separable convolutional network. Preferably, M=N=3. S×T represents the size of the convolution kernel in the pointwise convolutional network, and Q represents the number of convolution kernels in the pointwise convolutional network. Preferably, S=T=1.
根据本公开的实施例,在图8中,由于可分卷积网络的步长为1,因此可以提取图像块的局部时空信息。这里,时空信息可以包括时间信息和空间信息。由于图像块的特征包括各个关键点的空间特征,因此提取单元220可以提取图像块的空间特征。由于每个图像块包括在时间上连续的多个图像,因此提取单元220可以提取图像块的时间特征。According to an embodiment of the present disclosure, in FIG. 8, since the step size of the separable convolutional network is 1, the local spatiotemporal information of the image block can be extracted. Here, the spatio-temporal information may include time information and spatial information. Since the feature of the image block includes the spatial feature of each key point, the extraction unit 220 can extract the spatial feature of the image block. Since each image block includes a plurality of images that are continuous in time, the extraction unit 220 may extract the temporal characteristics of the image block.
值得注意的是,为了便于说明,图8示出了卷积神经网络模型包括一个可分卷积网络、一个逐点卷积网络和一个全连接网络的示例。但是,图8可以根据前文中所述的卷积神经网络模型的结构进行任意变型。It is worth noting that, for ease of description, FIG. 8 shows an example in which the convolutional neural network model includes a separable convolutional network, a pointwise convolutional network, and a fully connected network. However, FIG. 8 can be arbitrarily modified according to the structure of the convolutional neural network model described above.
根据本公开的实施例,卷积神经网络模型中的可分卷积网络的步长可以大于1,并且卷积神经网络模型中的逐点卷积网络或空洞卷积网络可以选取逐点卷积网络。According to an embodiment of the present disclosure, the step size of the separable convolutional network in the convolutional neural network model can be greater than 1, and the pointwise convolutional network or the hole convolutional network in the convolutional neural network model can select pointwise convolution The internet.
图9是示出根据本公开的实施例的卷积神经网络模型的结构的示例的框图。如图9所示,卷积神经网络模型可以包括步长大于1的可分卷积网络、逐点卷积网络以及全连接网络。这里,M×N表示可分卷积网络中卷积核的大小,P表示可分卷积网络中卷积核的数目。优选地,M=N=3。S×T表示逐点卷积网络中卷积核的大小,Q表示逐点卷积网络中卷积核的数目。优选地,S=T=1。FIG. 9 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in FIG. 9, the convolutional neural network model may include a separable convolutional network with a step size greater than 1, a point-wise convolutional network, and a fully connected network. Here, M×N represents the size of the convolution kernel in the separable convolutional network, and P represents the number of convolution kernels in the separable convolutional network. Preferably, M=N=3. S×T represents the size of the convolution kernel in the pointwise convolutional network, and Q represents the number of convolution kernels in the pointwise convolutional network. Preferably, S=T=1.
根据本公开的实施例,在图9中,由于可分卷积网络的步长大于1,因此可以提取图像块的与中等距离相关的时空信息。其中,与中等距离相关的时空信息是介于局部时空信息和全局时空信息之间的时空信息,取决于步长的大小。类似地,时空信息可以包括时间信息和空间信息。由于图像块的特征包括各个关键点的空间特征,因此提取单元220可以提取图像块的空间特征。由于每个图像块包括在时间上连续的多个图像,因此提取单元220可以提取图像块的时间特征。According to an embodiment of the present disclosure, in FIG. 9, since the step size of the separable convolutional network is greater than 1, the space-time information related to the middle distance of the image block can be extracted. Among them, the spatiotemporal information related to the intermediate distance is the spatiotemporal information between the local spatiotemporal information and the global spatiotemporal information, which depends on the size of the step. Similarly, spatiotemporal information can include time information and spatial information. Since the feature of the image block includes the spatial feature of each key point, the extraction unit 220 can extract the spatial feature of the image block. Since each image block includes a plurality of images that are continuous in time, the extraction unit 220 may extract the temporal characteristics of the image block.
值得注意的是,为了便于说明,图9示出了卷积神经网络模型包括一个可分卷积网络、一个逐点卷积网络和一个全连接网络的示例。但是,图9可以根据前文中所述的卷积神经网络模型的结构进行任意变型。It is worth noting that, for ease of description, FIG. 9 shows an example in which the convolutional neural network model includes a separable convolutional network, a pointwise convolutional network, and a fully connected network. However, FIG. 9 can be arbitrarily modified according to the structure of the convolutional neural network model described above.
根据本公开的实施例,卷积神经网络模型中的可分卷积网络的步长可以为1,并且卷积神经网络模型中的逐点卷积网络或空洞卷积网络可以选取空洞卷积网络。According to an embodiment of the present disclosure, the step size of the separable convolutional network in the convolutional neural network model can be 1, and the point-wise convolutional network or the hole convolutional network in the convolutional neural network model can select a hole convolutional network .
图10是示出根据本公开的实施例的卷积神经网络模型的结构的示例的框图。如图10所示,卷积神经网络模型可以包括步长为1的可分卷积 网络、空洞卷积网络以及全连接网络。这里,M×N表示可分卷积网络中卷积核的大小,P表示可分卷积网络中卷积核的数目。优选地,M=N=3。S×T表示空洞卷积网络中卷积核的大小,Q表示空洞卷积网络中卷积核的数目。优选地,S=5,T=3。FIG. 10 is a block diagram showing an example of the structure of a convolutional neural network model according to an embodiment of the present disclosure. As shown in FIG. 10, the convolutional neural network model may include a separable convolutional network with a step size of 1, a hollow convolutional network, and a fully connected network. Here, M×N represents the size of the convolution kernel in the separable convolutional network, and P represents the number of convolution kernels in the separable convolutional network. Preferably, M=N=3. S×T represents the size of the convolution kernel in the hole convolutional network, and Q represents the number of convolution kernels in the hole convolutional network. Preferably, S=5 and T=3.
根据本公开的实施例,在图10中,由于空洞卷积网络的有较大的感受野,因此可以提取图像块的全局时空信息。类似地,时空信息可以包括时间信息和空间信息。由于图像块的特征包括各个关键点的空间特征,因此提取单元220可以提取图像块的空间特征。由于每个图像块包括在时间上连续的多个图像,因此提取单元220可以提取图像块的时间特征。According to an embodiment of the present disclosure, in FIG. 10, since the hole convolutional network has a larger receptive field, the global spatiotemporal information of the image block can be extracted. Similarly, spatiotemporal information can include time information and spatial information. Since the feature of the image block includes the spatial feature of each key point, the extraction unit 220 can extract the spatial feature of the image block. Since each image block includes a plurality of images that are continuous in time, the extraction unit 220 may extract the temporal characteristics of the image block.
值得注意的是,为了便于说明,图10示出了卷积神经网络模型包括一个可分卷积网络、一个空洞卷积网络和一个全连接网络的示例。但是,图10可以根据前文中所述的卷积神经网络模型的结构进行任意变型。It is worth noting that, for ease of description, FIG. 10 shows an example in which the convolutional neural network model includes a separable convolutional network, a hollow convolutional network, and a fully connected network. However, FIG. 10 can be arbitrarily modified according to the structure of the convolutional neural network model described above.
以上描述了根据本公开的实施例的提取单元220中的卷积神经网络模型的各个示例。上述示例仅仅是示例性的,本公开并不限于这些结构。下面将描述根据本公开的实施例的确定单元230。The various examples of the convolutional neural network model in the extraction unit 220 according to an embodiment of the present disclosure have been described above. The above examples are merely illustrative, and the present disclosure is not limited to these structures. The determination unit 230 according to an embodiment of the present disclosure will be described below.
根据本公开的实施例,确定单元230可以利用循环神经网络模型根据提取单元220输出的各个图像块的时空特征来确定多个图像中包括的手势。具体地,确定单元230可以根据提取单元220输出的各个图像块的时空特征确定(建模)各个图像块之间在时间上的关系,从而输出表示手势的状态向量。According to an embodiment of the present disclosure, the determination unit 230 may use a recurrent neural network model to determine the gestures included in the plurality of images according to the spatiotemporal characteristics of each image block output by the extraction unit 220. Specifically, the determining unit 230 may determine (model) the temporal relationship between each image block according to the temporal and spatial characteristics of each image block output by the extraction unit 220, thereby outputting a state vector representing the gesture.
图11是示出循环神经网络模型的结构的示意图。这里,图11所示的循环神经网络模型是当前常见的循环神经网络模型。如图11所示,在t时刻循环神经网络模型的输出o t与在t时刻的输入x t以及在上一个时刻t-1的输出h t-1有关。也就是说,在循环神经网络中,神经元不但可以接受其它神经元的信息,也可以接受自身的信息,形成具有环路的网络结构,因此也被称为具有短期记忆能力的神经网络。 FIG. 11 is a schematic diagram showing the structure of a recurrent neural network model. Here, the recurrent neural network model shown in FIG. 11 is a current common recurrent neural network model. As shown in Figure 11, the output o t of the recurrent neural network model at time t is related to the input x t at time t and the output h t-1 at the previous time t-1. In other words, in a cyclic neural network, neurons can not only receive information from other neurons, but also receive their own information to form a network structure with loops, so it is also called a neural network with short-term memory capabilities.
根据本公开的实施例,循环神经网络模型可以根据当前时刻的输入信息、前一时刻的输出的比例信息、以及前一时刻的输出的积分信息和/或前一时刻的输出的微分信息,来确定当前时刻的输出信息。According to an embodiment of the present disclosure, the recurrent neural network model can be based on the input information at the current moment, the ratio information of the output at the previous moment, and the integral information of the output at the previous moment and/or the differential information of the output at the previous moment. Determine the output information at the current moment.
根据本公开的实施例,前一时刻的输出的比例信息例如可以是前一时刻的输出,也可以是根据前一时刻的输出按照一定的比例计算出的信息。According to an embodiment of the present disclosure, the ratio information of the output at the previous time may be, for example, the output at the previous time, or it may be information calculated according to a certain ratio based on the output at the previous time.
根据本公开的实施例,前一时刻的输出的积分信息表示对前一时刻 的输出进行积分运算而得到的信息。According to the embodiment of the present disclosure, the integration information of the output at the previous time indicates the information obtained by integrating the output at the previous time.
根据本公开的实施例,前一时刻的输出的微分信息表示对前一时刻的输出进行微分运算而得到的信息。例如,前一时刻的输出的微分信息可以包括前一时刻的输出的1阶至K阶微分信息,即对前一时刻的输出进行1阶至K阶微分运算而得到的信息。其中,K为大于等于2的整数。According to the embodiment of the present disclosure, the differential information of the output at the previous time indicates information obtained by performing a differential operation on the output at the previous time. For example, the differential information of the output at the previous time may include the differential information of the output at the previous time from the first order to the K order, that is, the information obtained by performing the differential operation of the output at the previous time from the first order to the K order. Among them, K is an integer greater than or equal to 2.
图12是示出根据本公开的实施例的循环神经网络模型的结构的示意图。在图12中,x t表示在t时刻的输入信息,o t表示在t时刻的输出信息,其等于h t,h t-1表示在t-1时刻的输出信息,也表示在t-1时刻的输出信息的比例信息,S t-1表示在t-1时刻的输出信息的积分信息,
Figure PCTCN2021092004-appb-000001
表示在t-1时刻的输出信息的1阶微分信息,
Figure PCTCN2021092004-appb-000002
表示在t-1时刻的输出信息的K阶微分信息。
FIG. 12 is a schematic diagram showing the structure of a recurrent neural network model according to an embodiment of the present disclosure. In Figure 12, x t represents the input information at time t , o t represents the output information at time t, which is equal to h t , and h t-1 represents the output information at time t-1, and also represents the output information at time t-1. The ratio information of the output information at time, S t-1 represents the integral information of the output information at time t-1,
Figure PCTCN2021092004-appb-000001
Represents the first-order differential information of the output information at t-1,
Figure PCTCN2021092004-appb-000002
Represents the K-order differential information of the output information at time t-1.
根据本公开的实施例,可以利用如下公式计算在t-1时刻的输出信息的积分信息S t-1According to the embodiment of the present disclosure, the following formula can be used to calculate the integral information St-1 of the output information at time t-1:
Figure PCTCN2021092004-appb-000003
Figure PCTCN2021092004-appb-000003
根据本公开的实施例,可以利用如下公式计算在t-1时刻的输出信息的1阶微分信息
Figure PCTCN2021092004-appb-000004
According to an embodiment of the present disclosure, the following formula can be used to calculate the first-order differential information of the output information at time t-1
Figure PCTCN2021092004-appb-000004
Figure PCTCN2021092004-appb-000005
Figure PCTCN2021092004-appb-000005
根据本公开的实施例,可以利用如下公式计算在t-1时刻的输出信息的2阶微分信息
Figure PCTCN2021092004-appb-000006
According to an embodiment of the present disclosure, the following formula can be used to calculate the second-order differential information of the output information at time t-1
Figure PCTCN2021092004-appb-000006
Figure PCTCN2021092004-appb-000007
Figure PCTCN2021092004-appb-000007
以类似的方式,可以计算在t-1时刻的输出信息的K阶微分信息。In a similar manner, the K-order differential information of the output information at time t-1 can be calculated.
根据本公开的实施例,可以根据以下公式来计算在t时刻的输出信息h tAccording to the embodiment of the present disclosure, the output information h t at time t can be calculated according to the following formula:
h t=σ(W heE t+b h) h t =σ(W he E t +b h )
其中,W he表示状态更新矩阵,σ为激活函数,包括但不限于ReLU (Rectified Linear Unit,修正线性单元)函数,b h为偏置向量,可以根据经验值来设定。E t表示状态公式,即循环神经网络在t时刻的记忆,可以根据以下公式来计算: Among them, W he represents the state update matrix, σ is the activation function, including but not limited to ReLU (Rectified Linear Unit) function, and b h is the bias vector, which can be set according to empirical values. E t represents the state formula, that is, the memory of the cyclic neural network at time t, which can be calculated according to the following formula:
Figure PCTCN2021092004-appb-000008
Figure PCTCN2021092004-appb-000008
如上所述,在图12中,循环神经网络模型可以根据当前时刻的输入信息、前一时刻的输出的比例信息、以及前一时刻的输出的积分信息和前一时刻的输出的微分信息,来确定当前时刻的状态,从而确定当前时刻的输出信息。值得注意的是,虽然图12示出了根据当前时刻的输入信息、前一时刻的输出的比例信息、以及前一时刻的输出的积分信息和前一时刻的输出的微分信息来确定当前时刻的输出信息的示例,但是也可以根据当前时刻的输入信息、前一时刻的输出的比例信息、以及前一时刻的输出的积分信息来确定当前时刻的输出信息,或者根据当前时刻的输入信息、前一时刻的输出的比例信息、以及前一时刻的输出的微分信息来确定当前时刻的输出信息。As mentioned above, in Figure 12, the recurrent neural network model can be based on the input information at the current time, the ratio information of the output at the previous time, and the integral information of the output at the previous time and the differential information of the output at the previous time. Determine the state at the current moment, thereby determining the output information at the current moment. It is worth noting that although Figure 12 shows that the current time is determined based on the input information at the current time, the ratio information of the output at the previous time, and the integral information of the output at the previous time and the differential information of the output at the previous time. Examples of output information, but it is also possible to determine the output information at the current time based on the input information at the current time, the ratio information of the output at the previous time, and the integral information of the output at the previous time, or according to the input information at the current time, the previous The ratio information of the output at a time and the differential information of the output at the previous time determine the output information at the current time.
如上所述,根据本公开的实施例,确定单元230中的循环神经网络不仅可以根据当前时刻的输入信息和前一时刻的输出来确定当前时刻的输出,还可以根据前一时刻的输出的积分信息和前一时刻的输出的微分信息中的至少一者来确定当前时刻的输出。这里,由于输出信息的比例信息关注当前的图像块的状态,而输出信息的微分信息关注状态的变化,输出信息的积分信息关注状态的累积,因此根据本公开的实施例的确定单元230可以比较全面地获取手势在时间尺度上的变化和趋势,从而获得更好的识别精度。As described above, according to the embodiment of the present disclosure, the recurrent neural network in the determining unit 230 can not only determine the output at the current time based on the input information at the current time and the output at the previous time, but also based on the integral of the output at the previous time. At least one of the information and the differential information of the output at the previous time determines the output at the current time. Here, since the ratio information of the output information focuses on the state of the current image block, and the differential information of the output information focuses on the change of the state of attention, and the integration information of the output information focuses on the accumulation of the state, the determination unit 230 according to the embodiment of the present disclosure can compare Comprehensively obtain the changes and trends of gestures on the time scale, so as to obtain better recognition accuracy.
根据本公开的实施例,提取单元220可以获得每个图像块的时空特征,由于手势可能包括多个图像块,因此确定单元230可以对不同的图像块之间在时间上的关系进行建模,从而可以准确快速地识别出手势。According to an embodiment of the present disclosure, the extraction unit 220 can obtain the temporal and spatial characteristics of each image block. Since the gesture may include multiple image blocks, the determination unit 230 can model the temporal relationship between different image blocks. Thus, gestures can be recognized accurately and quickly.
根据本公开的实施例,如图2所示,图像处理装置200还可以包括决策单元240,用于根据确定单元230的输出来确定最终的手势。According to an embodiment of the present disclosure, as shown in FIG. 2, the image processing apparatus 200 may further include a decision unit 240 for determining the final gesture according to the output of the determination unit 230.
根据本公开的实施例,确定单元230中的循环神经网络的输出可以是根据各个图像块的时空特征确定的对应于不同手势的128维状态向量。决策单元240可以包括分类器,用于将确定单元230输出的状态向量确定为手势。According to an embodiment of the present disclosure, the output of the recurrent neural network in the determining unit 230 may be a 128-dimensional state vector corresponding to different gestures determined according to the spatiotemporal characteristics of each image block. The decision unit 240 may include a classifier for determining the state vector output by the determining unit 230 as a gesture.
根据本公开的实施例,提取单元220可以包括一个卷积神经网络模型,并且确定单元230可以包括一个循环神经网络模型,从而决策单元240可以根据该循环神经网络模型的输出来确定最终的手势。According to an embodiment of the present disclosure, the extraction unit 220 may include a convolutional neural network model, and the determination unit 230 may include a recurrent neural network model, so that the decision unit 240 can determine the final gesture according to the output of the recurrent neural network model.
图13是示出根据本公开的实施例的图像处理装置的结构的示意图。如图3所示,图像处理装置200的输入依次通过提取单元220中的卷积神经网络模型、确定单元230中的循环神经网络模型、以及决策单元240中的分类器,从而输出手势的识别结果。FIG. 13 is a schematic diagram showing the structure of an image processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 3, the input of the image processing device 200 sequentially passes through the convolutional neural network model in the extraction unit 220, the recurrent neural network model in the determination unit 230, and the classifier in the decision unit 240, thereby outputting the recognition result of the gesture .
根据本公开的实施例,提取单元220可以包括多个卷积神经网络模型,并且确定单元230可以包括多个循环神经网络模型,从而决策单元240可以根据多个循环神经网络模型中的每个循环神经网络模型的输出结果来确定最终的手势。这里,多个卷积神经网络模型的输入都是相同的,即输入至图像处理装置200的多个图像。也就是说,分别利用各个卷积神经网络模型和循环神经网络模型来确定手势的状态向量,然后决策单元230中的分类器可以确定最终的手势。例如,分类器可以对各个循环神经网络模型输出的状态向量进行平均,然后确定最终的手势。According to an embodiment of the present disclosure, the extraction unit 220 may include a plurality of convolutional neural network models, and the determination unit 230 may include a plurality of recurrent neural network models, so that the decision unit 240 can be based on each of the plurality of recurrent neural network models. The output result of the neural network model is used to determine the final gesture. Here, the input of multiple convolutional neural network models is the same, that is, multiple images input to the image processing device 200. That is, each convolutional neural network model and cyclic neural network model are used to determine the state vector of the gesture, and then the classifier in the decision-making unit 230 can determine the final gesture. For example, the classifier can average the state vectors output by each recurrent neural network model, and then determine the final gesture.
图14是示出根据本公开的实施例的图像处理装置的结构的示意图。如图14所示,图像处理装置200包括R个卷积神经网络模型、R个循环神经网络模型和一个分类器。其中,R为大于等于2的整数。具体地,输入的多个图像被输入至卷积神经网络模型1和循环神经网络模型1,从而得到第1组128维的状态向量,输入的多个图像被输入至卷积神经网络模型2和循环神经网络模型2,从而得到第2组128维的状态向量,…,输入的多个图像被输入至卷积神经网络模型R和循环神经网络模型R,从而得到第R组128维的状态向量。分类器可以对R个循环神经网络模型的输出结果进行综合,从而得到最终的手势的识别结果。FIG. 14 is a schematic diagram showing the structure of an image processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 14, the image processing device 200 includes R convolutional neural network models, R recurrent neural network models, and a classifier. Wherein, R is an integer greater than or equal to 2. Specifically, the input multiple images are input to the convolutional neural network model 1 and the recurrent neural network model 1, thereby obtaining the first group of 128-dimensional state vectors, and the input multiple images are input to the convolutional neural network model 2 and Recurrent neural network model 2 to obtain the second group of 128-dimensional state vectors,..., the input multiple images are input to the convolutional neural network model R and recurrent neural network model R, thereby obtaining the Rth group of 128-dimensional state vectors . The classifier can synthesize the output results of the R recurrent neural network models to obtain the final gesture recognition result.
如上所述,根据本公开的实施例,可以利用多组卷积神经网络模型和循环神经网络模型来识别手势,从而使得识别出的手势更加准确。As described above, according to the embodiments of the present disclosure, multiple sets of convolutional neural network models and recurrent neural network models can be used to recognize gestures, thereby making the recognized gestures more accurate.
如前文所述,包括步长为1的可分卷积网络和逐点卷积网络的卷积神经网络模型可以提取图像块的局部时空信息,包括步长大于1的可分卷积网络和逐点卷积网络的卷积神经网络模型可以提取与中等距离相关的时空信息,包括步长为1的可分卷积网络和空洞卷积网络的神经网络模型可以提取图像块的全局时空信息。因此,根据本公开的实施例,R个卷积神经网络模型可以包括能够提取不同尺度的时空信息的卷积神经网络模型。也就是说,R个卷积神经网络模型可以包括以上三种神经网络模型中 的至少两种。As mentioned above, the convolutional neural network model including the separable convolutional network with a step size of 1 and the pointwise convolutional network can extract the local spatiotemporal information of the image block, including the separable convolutional network with a step size greater than 1 and the convolutional neural network model. The convolutional neural network model of the point convolutional network can extract spatiotemporal information related to the intermediate distance, and the neural network model including the separable convolutional network with a step size of 1 and the hollow convolutional network can extract the global spatiotemporal information of the image block. Therefore, according to an embodiment of the present disclosure, the R convolutional neural network models may include convolutional neural network models capable of extracting spatiotemporal information of different scales. In other words, the R convolutional neural network models may include at least two of the above three neural network models.
例如,在R=2的情况下,R个卷积神经网络模型中的第一卷积神经网络模型可以包括步长为1的可分卷积网络和逐点卷积网络,R个卷积神经网络模型中的第二卷积神经网络模型可以包括步长大于1的可分卷积网络和逐点卷积网络。在R=2的情况下,R个卷积神经网络模型中的第一卷积神经网络模型可以包括步长为1的可分卷积网络和逐点卷积网络,R个卷积神经网络模型中的第二卷积神经网络模型可以包括步长为1的可分卷积网络和空洞卷积网络。在R=2的情况下,R个卷积神经网络模型中的第一卷积神经网络模型可以包括步长大于1的可分卷积网络和逐点卷积网络,R个卷积神经网络模型中的第二卷积神经网络模型可以包括步长为1的可分卷积网络和空洞卷积网络。在R=3的情况下,R个卷积神经网络模型中的第一卷积神经网络模型可以包括步长为1的可分卷积网络和逐点卷积网络,R个卷积神经网络模型中的第二卷积神经网络模型可以包括步长大于1的可分卷积网络和逐点卷积网络,R个卷积神经网络模型中的第三卷积神经网络模型包括步长为1的可分卷积网络和空洞卷积网络。For example, in the case of R=2, the first convolutional neural network model in the R convolutional neural network models may include a separable convolutional network with a step size of 1 and a pointwise convolutional network, and R convolutional neural networks The second convolutional neural network model in the network model may include a separable convolutional network with a step size greater than 1 and a pointwise convolutional network. In the case of R=2, the first convolutional neural network model in the R convolutional neural network models may include a separable convolutional network with a step length of 1 and a pointwise convolutional network, and R convolutional neural network models The second convolutional neural network model in may include a separable convolutional network with a step length of 1 and a hollow convolutional network. In the case of R=2, the first convolutional neural network model in the R convolutional neural network models may include a separable convolutional network with a step size greater than 1 and a pointwise convolutional network, and R convolutional neural network models The second convolutional neural network model in may include a separable convolutional network with a step length of 1 and a hollow convolutional network. In the case of R=3, the first convolutional neural network model in the R convolutional neural network models may include a separable convolutional network with a step size of 1 and a pointwise convolutional network, and R convolutional neural network models The second convolutional neural network model in can include a separable convolutional network with a step size greater than 1 and a point-wise convolutional network, and the third convolutional neural network model of the R convolutional neural network models includes a step size of 1. Separate convolutional network and hole convolutional network.
如上所述,根据本公开的实施例,在提取单元220包括多个卷积神经网络模型的情况下,这多个卷积神经网络模型可以提取图像块的不同尺度的时空信息,因此可以同时满足快速和准确识别手势的要求。As described above, according to the embodiment of the present disclosure, in the case where the extraction unit 220 includes multiple convolutional neural network models, the multiple convolutional neural network models can extract spatiotemporal information of different scales of image blocks, and thus can simultaneously satisfy Recognize the requirements of gestures quickly and accurately.
根据本公开的实施例,在对图像处理装置200进行训练的过程中,可以分为两个阶段。在第一个阶段中,可以利用人工标定的手势和交叉熵损失函数来对整个网络进行预训练,从而在多个图像中仅包括一个手势的情况下对整个网络进行训练。在第二个阶段中,可以利用扩展之后的手势(即对手势在时间轴上添加噪声,使得与手势对应的图像的长度增加或减少)和连接时间分类损失函数来对预训练之后的网络进行调整,从而使得在多个图像包括多个手势并且每个手势的图像的长度增加或减少的情况下对整个网络进行训练。根据本公开的实施例,在经过上述两个阶段的训练之后,使得图像处理装置200能够快速准确地识别出动态手势。According to an embodiment of the present disclosure, the process of training the image processing apparatus 200 can be divided into two stages. In the first stage, manually calibrated gestures and cross-entropy loss functions can be used to pre-train the entire network, so that the entire network can be trained when only one gesture is included in multiple images. In the second stage, the expanded gesture (that is, adding noise to the gesture on the time axis to increase or decrease the length of the image corresponding to the gesture) and the connection time classification loss function can be used to perform the pre-training network Adjust so that the entire network is trained when multiple images include multiple gestures and the length of each gesture's image is increased or decreased. According to the embodiments of the present disclosure, after the above-mentioned two stages of training, the image processing apparatus 200 can quickly and accurately recognize dynamic gestures.
如上所述,根据本公开的实施例的图像处理装置200,可以将输入的多个图像划分为多个图像块,并可以利用可分卷积网络以及逐点卷积网络或空洞卷积网络提取图像块的时空特征,从而大大减少了手势识别的过程中的计算量。进一步,在图像处理装置200包括多个卷积神经网络模型的情况下,可以提取图像块的不同尺度的时空特征,从而同时保证识别的准确性和快速性。此外,利用循环神经网络对各个图像块的时空特征进行处 理,该循环神经网络考虑了累积的输出的比例信息、积分信息和/或微分信息,从而使得识别的结果更加精确。总之,根据本公开的实施例的图像处理装置200可以快速准确地识别动态手势。As described above, according to the image processing apparatus 200 of the embodiment of the present disclosure, multiple input images can be divided into multiple image blocks, and can be extracted by using a separable convolutional network and a pointwise convolutional network or a hole convolutional network. The spatio-temporal characteristics of image blocks greatly reduce the amount of calculation in the process of gesture recognition. Further, in the case where the image processing device 200 includes multiple convolutional neural network models, spatiotemporal features of different scales of the image block can be extracted, thereby simultaneously ensuring the accuracy and rapidity of recognition. In addition, a recurrent neural network is used to process the spatiotemporal characteristics of each image block, which takes into account the cumulative output ratio information, integral information and/or differential information, so that the recognition result is more accurate. In summary, the image processing apparatus 200 according to an embodiment of the present disclosure can quickly and accurately recognize dynamic gestures.
<2.图像处理方法的示例><2. Examples of image processing methods>
接下来将详细描述根据本公开实施例的由图像处理装置200执行的图像处理方法。Next, an image processing method executed by the image processing apparatus 200 according to an embodiment of the present disclosure will be described in detail.
图15是示出根据本公开的实施例的由图像处理装置200执行的图像处理方法的流程图。FIG. 15 is a flowchart illustrating an image processing method performed by the image processing apparatus 200 according to an embodiment of the present disclosure.
如图15所示,在步骤S1510中,将连续输入的多个图像划分为多个图像块。As shown in FIG. 15, in step S1510, a plurality of consecutively input images are divided into a plurality of image blocks.
接下来,在步骤S1520中,利用卷积神经网络模型提取每个图像块的时空特征,卷积神经网络模型包括可分卷积网络和逐点卷积网络、或者包括可分卷积网络和空洞卷积网络。Next, in step S1520, the convolutional neural network model is used to extract the spatiotemporal features of each image block. The convolutional neural network model includes a separable convolutional network and a pointwise convolutional network, or a separable convolutional network and a hole Convolutional network.
接下来,在步骤S1530中,利用循环神经网络模型根据各个图像块的时空特征确定多个图像中包括的手势。Next, in step S1530, the recurrent neural network model is used to determine the gestures included in the multiple images according to the temporal and spatial characteristics of each image block.
优选地,将连续输入的多个图像划分为多个图像块包括:将连续输入的M个图像划分至一个图像块,M为大于等于2的整数,并且其中,利用卷积神经网络模型提取每个图像块的时空特征包括:将M个图像中的每个图像的各个关键点的特征作为图像块的特征输入至卷积神经网络模型。Preferably, dividing the consecutively input multiple images into multiple image blocks includes: dividing the consecutively input M images into one image block, where M is an integer greater than or equal to 2, and wherein the convolutional neural network model is used to extract each image block. The spatiotemporal features of each image block include: inputting the feature of each key point of each of the M images as the feature of the image block to the convolutional neural network model.
优选地,卷积神经网络模型还包括全连接网络。Preferably, the convolutional neural network model also includes a fully connected network.
优选地,卷积神经网络模型包括:多个可分卷积网络、一个或多个逐点卷积网络或空洞卷积网络、以及一个全连接网络;或者N个可分卷积网络、N个逐点卷积网络或空洞卷积网络、以及N个全连接网络,其中N为正整数。Preferably, the convolutional neural network model includes: multiple separable convolutional networks, one or more pointwise convolutional networks or hollow convolutional networks, and a fully connected network; or N separable convolutional networks, N Point-wise convolutional network or hole convolutional network, and N fully connected networks, where N is a positive integer.
优选地,图像处理方法还包括:分别利用多个卷积神经网络模型和多个循环神经网络模型确定多个图像中包括的手势;以及根据每个循环神经网络模型的输出结果确定最终的手势。Preferably, the image processing method further includes: respectively determining the gestures included in the multiple images by using multiple convolutional neural network models and multiple cyclic neural network models; and determining the final gesture according to the output result of each cyclic neural network model.
优选地,多个卷积神经网络模型中的第一卷积神经网络模型包括步长为1的可分卷积网络和逐点卷积网络,多个卷积神经网络模型中的第二卷积神经网络模型包括步长大于1的可分卷积网络和逐点卷积网络,多个 卷积神经网络模型中的第三卷积神经网络模型包括步长为1的可分卷积网络和空洞卷积网络。Preferably, the first convolutional neural network model in the multiple convolutional neural network models includes a separable convolutional network with a step size of 1 and a pointwise convolutional network, and the second convolutional network in the multiple convolutional neural network models Neural network models include separable convolutional networks with a step size greater than 1 and point-wise convolutional networks. The third convolutional neural network model among multiple convolutional neural network models includes a separable convolutional network with a step size of 1 and holes Convolutional network.
优选地,利用循环神经网络模型确定多个图像中包括的手势包括:根据当前时刻的输入信息、前一时刻的输出的比例信息、以及前一时刻的输出的积分信息和/或前一时刻的输出的微分信息,来确定当前时刻的输出信息。Preferably, using the recurrent neural network model to determine the gestures included in the multiple images includes: according to the input information at the current moment, the scale information of the output at the previous moment, and the integral information of the output at the previous moment and/or the output at the previous moment. The output differential information is used to determine the output information at the current moment.
根据本公开的实施例,执行上述方法的主体可以是根据本公开的实施例的图像处理装置200,因此前文中关于图像处理装置200的全部实施例均适用于此。According to an embodiment of the present disclosure, the subject that executes the above method may be the image processing device 200 according to an embodiment of the present disclosure, so all the foregoing embodiments regarding the image processing device 200 are applicable to this.
<3.应用示例><3. Application example>
本公开可以应用于各种场景。例如,本公开的图像处理装置200可以用于手势识别,具体地可以进行在线动态手势的识别。此外,虽然本公开以在线动态手势识别为示例来进行介绍,但是本公开并不限于此,本公开可以应用于与时序信号的处理有关的其他场景。The present disclosure can be applied to various scenarios. For example, the image processing apparatus 200 of the present disclosure can be used for gesture recognition, and specifically can perform online dynamic gesture recognition. In addition, although the present disclosure takes online dynamic gesture recognition as an example for introduction, the present disclosure is not limited to this, and the present disclosure can be applied to other scenarios related to the processing of timing signals.
图16是示出可以实现根据本公开的图像处理装置200的电子设备1600的示例的框图。电子设备1600例如可以是用户设备,例如可以被实现为移动终端(诸如智能电话、平板个人计算机(PC)、笔记本式PC、便携式游戏终端、便携式/加密狗型移动路由器和数字摄像装置)或者车载终端。FIG. 16 is a block diagram showing an example of an electronic device 1600 that can implement the image processing apparatus 200 according to the present disclosure. The electronic device 1600 may be, for example, a user equipment, for example, may be implemented as a mobile terminal (such as a smart phone, a tablet personal computer (PC), a notebook PC, a portable game terminal, a portable/dongle type mobile router, and a digital camera) or a vehicle terminal.
电子设备1600包括处理器1601、存储器1602、存储装置1603、网络接口1604以及总线1606。The electronic device 1600 includes a processor 1601, a memory 1602, a storage device 1603, a network interface 1604, and a bus 1606.
处理器1601可以为例如中央处理单元(CPU)或数字信号处理器(DSP),并且控制电子设备1600的功能。存储器1602包括随机存取存储器(RAM)和只读存储器(ROM),并且存储数据和由处理器1601执行的程序。存储装置1603可以包括存储介质,诸如半导体存储器和硬盘。The processor 1601 may be, for example, a central processing unit (CPU) or a digital signal processor (DSP), and controls the functions of the electronic device 1600. The memory 1602 includes random access memory (RAM) and read only memory (ROM), and stores data and programs executed by the processor 1601. The storage device 1603 may include a storage medium such as a semiconductor memory and a hard disk.
网络接口1604为用于将电子设备1600连接到有线通信网络1605的有线通信接口。有线通信网络1605可以为诸如演进分组核心网(EPC)的核心网或者诸如因特网的分组数据网络(PDN)。The network interface 1604 is a wired communication interface for connecting the electronic device 1600 to the wired communication network 1605. The wired communication network 1605 may be a core network such as an evolved packet core network (EPC) or a packet data network (PDN) such as the Internet.
总线1606将处理器1601、存储器1602、存储装置1603和网络接口1604彼此连接。总线1606可以包括各自具有不同速度的两个或更多个总线(诸如高速总线和低速总线)。The bus 1606 connects the processor 1601, the memory 1602, the storage device 1603, and the network interface 1604 to each other. The bus 1606 may include two or more buses (such as a high-speed bus and a low-speed bus) each having a different speed.
在图16所示的电子设备1600中,通过使用图2所描述的预处理单元210、提取单元220、确定单元230和决策单元240可以由处理器1601实现。例如,处理器1601可以通过执行存储器1602或存储装置1603中存储的指令而执行将连续输入的多个图像划分为多个图像块、利用卷积神经网络模型提取每个图像块的时空特征以及利用循环神经网络确定多个图像中包括的手势的功能。In the electronic device 1600 shown in FIG. 16, the preprocessing unit 210, the extraction unit 220, the determination unit 230, and the decision unit 240 described in FIG. 2 can be implemented by the processor 1601. For example, the processor 1601 may divide the continuously input multiple images into multiple image blocks by executing instructions stored in the memory 1602 or the storage device 1603, extract the spatiotemporal features of each image block using a convolutional neural network model, and use The recurrent neural network determines the functions of gestures included in multiple images.
以上参照附图描述了本公开的优选实施例,但是本公开当然不限于以上示例。本领域技术人员可在所附权利要求的范围内得到各种变更和修改,并且应理解这些变更和修改自然将落入本公开的技术范围内。The preferred embodiments of the present disclosure have been described above with reference to the accompanying drawings, but the present disclosure is of course not limited to the above examples. Those skilled in the art can get various changes and modifications within the scope of the appended claims, and it should be understood that these changes and modifications will naturally fall within the technical scope of the present disclosure.
例如,附图所示的功能框图中以虚线框示出的单元均表示该功能单元在相应装置中是可选的,并且各个可选的功能单元可以以适当的方式进行组合以实现所需功能。For example, the units shown in dashed boxes in the functional block diagram shown in the drawings all indicate that the functional unit is optional in the corresponding device, and each optional functional unit can be combined in an appropriate manner to achieve the required function .
例如,在以上实施例中包括在一个单元中的多个功能可以由分开的装置来实现。替选地,在以上实施例中由多个单元实现的多个功能可分别由分开的装置来实现。另外,以上功能之一可由多个单元来实现。无需说,这样的配置包括在本公开的技术范围内。For example, a plurality of functions included in one unit in the above embodiments may be realized by separate devices. Alternatively, the multiple functions implemented by multiple units in the above embodiments may be implemented by separate devices, respectively. In addition, one of the above functions can be implemented by multiple units. Needless to say, such a configuration is included in the technical scope of the present disclosure.
在该说明书中,流程图中所描述的步骤不仅包括以所述顺序按时间序列执行的处理,而且包括并行地或单独地而不是必须按时间序列执行的处理。此外,甚至在按时间序列处理的步骤中,无需说,也可以适当地改变该顺序。In this specification, the steps described in the flowchart include not only processing performed in time series in the described order, but also processing performed in parallel or individually rather than necessarily in time series. In addition, even in the steps processed in time series, needless to say, the order can be changed appropriately.
以上虽然结合附图详细描述了本公开的实施例,但是应当明白,上面所描述的实施方式只是用于说明本公开,而并不构成对本公开的限制。对于本领域的技术人员来说,可以对上述实施方式作出各种修改和变更而没有背离本公开的实质和范围。因此,本公开的范围仅由所附的权利要求及其等效含义来限定。Although the embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, it should be understood that the above-described embodiments are only used to illustrate the present disclosure, and do not constitute a limitation to the present disclosure. For those skilled in the art, various modifications and changes can be made to the above-mentioned embodiments without departing from the essence and scope of the present disclosure. Therefore, the scope of the present disclosure is limited only by the appended claims and their equivalent meanings.

Claims (15)

  1. 一种图像处理装置,包括处理电路,被配置为:An image processing device, including a processing circuit, is configured to:
    将连续输入的多个图像划分为多个图像块;Divide consecutively input multiple images into multiple image blocks;
    利用卷积神经网络模型提取每个图像块的时空特征,所述卷积神经网络模型包括可分卷积网络和逐点卷积网络、或者包括可分卷积网络和空洞卷积网络;以及Extracting the spatiotemporal features of each image block using a convolutional neural network model, the convolutional neural network model including a separable convolutional network and a point-wise convolutional network, or a separable convolutional network and a hollow convolutional network; and
    利用循环神经网络模型根据各个图像块的时空特征确定所述多个图像中包括的手势。A recurrent neural network model is used to determine the gestures included in the multiple images according to the temporal and spatial characteristics of each image block.
  2. 根据权利要求1所述的图像处理装置,其中,所述处理电路还被配置为:The image processing device according to claim 1, wherein the processing circuit is further configured to:
    将连续输入的M个图像划分至一个图像块,M为大于等于2的整数;以及Divide the continuously inputted M images into one image block, where M is an integer greater than or equal to 2; and
    将M个图像中的每个图像的各个关键点的特征作为所述图像块的特征输入至所述卷积神经网络模型。The feature of each key point of each of the M images is input to the convolutional neural network model as the feature of the image block.
  3. 根据权利要求1所述的图像处理装置,其中,所述卷积神经网络模型还包括全连接网络。The image processing device according to claim 1, wherein the convolutional neural network model further comprises a fully connected network.
  4. 根据权利要求3所述的图像处理装置,其中,所述卷积神经网络模型包括:The image processing device according to claim 3, wherein the convolutional neural network model comprises:
    多个可分卷积网络、一个或多个逐点卷积网络或空洞卷积网络、以及一个全连接网络;或者Multiple separable convolutional networks, one or more pointwise convolutional networks or hollow convolutional networks, and a fully connected network; or
    N个可分卷积网络、N个逐点卷积网络或空洞卷积网络、以及N个全连接网络,其中N为正整数。N separable convolutional networks, N point-wise convolutional networks or hollow convolutional networks, and N fully connected networks, where N is a positive integer.
  5. 根据权利要求1所述的图像处理装置,其中,所述处理电路还被配置为:The image processing device according to claim 1, wherein the processing circuit is further configured to:
    分别利用多个卷积神经网络模型和多个循环神经网络模型确定所述多个图像中包括的手势;以及Using a plurality of convolutional neural network models and a plurality of recurrent neural network models to determine the gestures included in the plurality of images; and
    根据每个循环神经网络模型的输出结果确定最终的手势。The final gesture is determined according to the output result of each recurrent neural network model.
  6. 根据权利要求5所述的图像处理装置,其中,所述多个卷积神经网络模型中的第一卷积神经网络模型包括步长为1的可分卷积网络和逐 点卷积网络,所述多个卷积神经网络模型中的第二卷积神经网络模型包括步长大于1的可分卷积网络和逐点卷积网络,所述多个卷积神经网络模型中的第三卷积神经网络模型包括步长为1的可分卷积网络和空洞卷积网络。The image processing device according to claim 5, wherein the first convolutional neural network model in the plurality of convolutional neural network models includes a separable convolutional network with a step size of 1 and a pointwise convolutional network, so The second convolutional neural network model of the plurality of convolutional neural network models includes a separable convolutional network with a step size greater than 1 and a pointwise convolutional network, and the third convolutional network of the plurality of convolutional neural network models The neural network model includes a separable convolutional network with a step length of 1 and a hollow convolutional network.
  7. 根据权利要求1所述的图像处理装置,其中,The image processing device according to claim 1, wherein:
    所述循环神经网络模型根据当前时刻的输入信息、前一时刻的输出的比例信息、以及前一时刻的输出的积分信息和/或前一时刻的输出的微分信息,来确定当前时刻的输出信息。The recurrent neural network model determines the output information at the current time according to the input information at the current time, the ratio information of the output at the previous time, and the integral information of the output at the previous time and/or the differential information of the output at the previous time. .
  8. 一种图像处理方法,包括:An image processing method, including:
    将连续输入的多个图像划分为多个图像块;Divide consecutively input multiple images into multiple image blocks;
    利用卷积神经网络模型提取每个图像块的时空特征,所述卷积神经网络模型包括可分卷积网络和逐点卷积网络、或者包括可分卷积网络和空洞卷积网络;以及Extracting the spatiotemporal features of each image block using a convolutional neural network model, the convolutional neural network model including a separable convolutional network and a point-wise convolutional network, or a separable convolutional network and a hollow convolutional network; and
    利用循环神经网络模型根据各个图像块的时空特征确定所述多个图像中包括的手势。A recurrent neural network model is used to determine the gestures included in the multiple images according to the temporal and spatial characteristics of each image block.
  9. 根据权利要求8所述的图像处理方法,其中,将连续输入的多个图像划分为多个图像块包括:将连续输入的M个图像划分至一个图像块,M为大于等于2的整数,并且8. The image processing method according to claim 8, wherein dividing the continuously input multiple images into multiple image blocks comprises: dividing the continuously input M images into one image block, where M is an integer greater than or equal to 2, and
    其中,利用卷积神经网络模型提取每个图像块的时空特征包括:将M个图像中的每个图像的各个关键点的特征作为所述图像块的特征输入至所述卷积神经网络模型。Wherein, extracting the spatiotemporal features of each image block by using the convolutional neural network model includes: inputting the feature of each key point of each of the M images as the feature of the image block to the convolutional neural network model.
  10. 根据权利要求8所述的图像处理方法,其中,所述卷积神经网络模型还包括全连接网络。8. The image processing method according to claim 8, wherein the convolutional neural network model further comprises a fully connected network.
  11. 根据权利要求10所述的图像处理方法,其中,所述卷积神经网络模型包括:The image processing method according to claim 10, wherein the convolutional neural network model comprises:
    多个可分卷积网络、一个或多个逐点卷积网络或空洞卷积网络、以及一个全连接网络;或者Multiple separable convolutional networks, one or more pointwise convolutional networks or hollow convolutional networks, and a fully connected network; or
    N个可分卷积网络、N个逐点卷积网络或空洞卷积网络、以及N个全连接网络,其中N为正整数。N separable convolutional networks, N point-wise convolutional networks or hollow convolutional networks, and N fully connected networks, where N is a positive integer.
  12. 根据权利要求8所述的图像处理方法,其中,所述图像处理方法 还包括:The image processing method according to claim 8, wherein the image processing method further comprises:
    分别利用多个卷积神经网络模型和多个循环神经网络模型确定所述多个图像中包括的手势;以及Using a plurality of convolutional neural network models and a plurality of recurrent neural network models to determine the gestures included in the plurality of images; and
    根据每个循环神经网络模型的输出结果确定最终的手势。The final gesture is determined according to the output result of each recurrent neural network model.
  13. 根据权利要求12所述的图像处理方法,其中,所述多个卷积神经网络模型中的第一卷积神经网络模型包括步长为1的可分卷积网络和逐点卷积网络,所述多个卷积神经网络模型中的第二卷积神经网络模型包括步长大于1的可分卷积网络和逐点卷积网络,所述多个卷积神经网络模型中的第三卷积神经网络模型包括步长为1的可分卷积网络和空洞卷积网络。The image processing method according to claim 12, wherein the first convolutional neural network model in the plurality of convolutional neural network models includes a separable convolutional network with a step size of 1 and a pointwise convolutional network, so The second convolutional neural network model of the plurality of convolutional neural network models includes a separable convolutional network with a step size greater than 1 and a pointwise convolutional network, and the third convolutional network of the plurality of convolutional neural network models The neural network model includes a separable convolutional network with a step length of 1 and a hollow convolutional network.
  14. 根据权利要求8所述的图像处理方法,其中,利用循环神经网络模型确定所述多个图像中包括的手势包括:8. The image processing method according to claim 8, wherein using a recurrent neural network model to determine the gestures included in the plurality of images comprises:
    根据当前时刻的输入信息、前一时刻的输出的比例信息、以及前一时刻的输出的积分信息和/或前一时刻的输出的微分信息,来确定当前时刻的输出信息。The output information at the current time is determined according to the input information at the current time, the ratio information of the output at the previous time, and the integral information of the output at the previous time and/or the differential information of the output at the previous time.
  15. 一种计算机可读存储介质,包括可执行计算机指令,所述可执行计算机指令当被计算机执行时使得所述计算机执行根据权利要求8-14中任一项所述的图像处理方法。A computer-readable storage medium, comprising executable computer instructions, which, when executed by a computer, cause the computer to execute the image processing method according to any one of claims 8-14.
PCT/CN2021/092004 2020-05-14 2021-05-07 Image processing apparatus, image processing method, and computer-readable storage medium WO2021227933A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202180023365.4A CN115349142A (en) 2020-05-14 2021-05-07 Image processing apparatus, image processing method, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010407312.X 2020-05-14
CN202010407312.XA CN113673280A (en) 2020-05-14 2020-05-14 Image processing apparatus, image processing method, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
WO2021227933A1 true WO2021227933A1 (en) 2021-11-18

Family

ID=78526428

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/092004 WO2021227933A1 (en) 2020-05-14 2021-05-07 Image processing apparatus, image processing method, and computer-readable storage medium

Country Status (2)

Country Link
CN (2) CN113673280A (en)
WO (1) WO2021227933A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113888541B (en) * 2021-12-07 2022-03-25 南方医科大学南方医院 Image identification method, device and storage medium for laparoscopic surgery stage

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170206405A1 (en) * 2016-01-14 2017-07-20 Nvidia Corporation Online detection and classification of dynamic gestures with recurrent convolutional neural networks
CN106991372A (en) * 2017-03-02 2017-07-28 北京工业大学 A kind of dynamic gesture identification method based on interacting depth learning model
CN107180226A (en) * 2017-04-28 2017-09-19 华南理工大学 A kind of dynamic gesture identification method based on combination neural net
CN108846440A (en) * 2018-06-20 2018-11-20 腾讯科技(深圳)有限公司 Image processing method and device, computer-readable medium and electronic equipment
CN110472531A (en) * 2019-07-29 2019-11-19 腾讯科技(深圳)有限公司 Method for processing video frequency, device, electronic equipment and storage medium
CN110889387A (en) * 2019-12-02 2020-03-17 浙江工业大学 Real-time dynamic gesture recognition method based on multi-track matching
CN111160114A (en) * 2019-12-10 2020-05-15 深圳数联天下智能科技有限公司 Gesture recognition method, device, equipment and computer readable storage medium
CN112036261A (en) * 2020-08-11 2020-12-04 海尔优家智能科技(北京)有限公司 Gesture recognition method and device, storage medium and electronic device
CN112507898A (en) * 2020-12-14 2021-03-16 重庆邮电大学 Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170206405A1 (en) * 2016-01-14 2017-07-20 Nvidia Corporation Online detection and classification of dynamic gestures with recurrent convolutional neural networks
CN106991372A (en) * 2017-03-02 2017-07-28 北京工业大学 A kind of dynamic gesture identification method based on interacting depth learning model
CN107180226A (en) * 2017-04-28 2017-09-19 华南理工大学 A kind of dynamic gesture identification method based on combination neural net
CN108846440A (en) * 2018-06-20 2018-11-20 腾讯科技(深圳)有限公司 Image processing method and device, computer-readable medium and electronic equipment
CN110472531A (en) * 2019-07-29 2019-11-19 腾讯科技(深圳)有限公司 Method for processing video frequency, device, electronic equipment and storage medium
CN110889387A (en) * 2019-12-02 2020-03-17 浙江工业大学 Real-time dynamic gesture recognition method based on multi-track matching
CN111160114A (en) * 2019-12-10 2020-05-15 深圳数联天下智能科技有限公司 Gesture recognition method, device, equipment and computer readable storage medium
CN112036261A (en) * 2020-08-11 2020-12-04 海尔优家智能科技(北京)有限公司 Gesture recognition method and device, storage medium and electronic device
CN112507898A (en) * 2020-12-14 2021-03-16 重庆邮电大学 Multi-modal dynamic gesture recognition method based on lightweight 3D residual error network and TCN

Also Published As

Publication number Publication date
CN115349142A (en) 2022-11-15
CN113673280A (en) 2021-11-19

Similar Documents

Publication Publication Date Title
WO2020221278A1 (en) Video classification method and model training method and apparatus thereof, and electronic device
US11507615B2 (en) Method, apparatus, electronic device and computer readable storage medium for image searching
CN108875807B (en) Image description method based on multiple attention and multiple scales
CN109614979B (en) Data augmentation method and image classification method based on selection and generation
CN109815826B (en) Method and device for generating face attribute model
CN108960409B (en) Method and device for generating annotation data and computer-readable storage medium
WO2016037300A1 (en) Method and system for multi-class object detection
EP2124159A1 (en) Image learning, automatic annotation, retrieval method, and device
CN110082821B (en) Label-frame-free microseism signal detection method and device
CN110321845B (en) Method and device for extracting emotion packets from video and electronic equipment
CN110929569A (en) Face recognition method, device, equipment and storage medium
US11704938B2 (en) Action recognition method and apparatus
CN113496217A (en) Method for identifying human face micro expression in video image sequence
TWI761813B (en) Video analysis method and related model training methods, electronic device and storage medium thereof
Batnasan et al. ArSL21L: Arabic sign language letter dataset benchmarking and an educational avatar for metaverse applications
WO2023142912A1 (en) Method and apparatus for detecting left behind object, and storage medium
CN112131944B (en) Video behavior recognition method and system
CN108537109B (en) OpenPose-based monocular camera sign language identification method
CN105631404B (en) The method and device that photo is clustered
WO2021227933A1 (en) Image processing apparatus, image processing method, and computer-readable storage medium
CN112818958B (en) Action recognition method, device and storage medium
CN115705706A (en) Video processing method, video processing device, computer equipment and storage medium
CN111274985A (en) Video text recognition network model, video text recognition device and electronic equipment
WO2021169604A1 (en) Method and device for action information recognition, electronic device, and storage medium
WO2021223747A1 (en) Video processing method and apparatus, electronic device, storage medium, and program product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21803352

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21803352

Country of ref document: EP

Kind code of ref document: A1