CN114648650A - Neural network training method, neural network training device, target detection method, target detection device, equipment and storage medium - Google Patents

Neural network training method, neural network training device, target detection method, target detection device, equipment and storage medium Download PDF

Info

Publication number
CN114648650A
CN114648650A CN202210333676.7A CN202210333676A CN114648650A CN 114648650 A CN114648650 A CN 114648650A CN 202210333676 A CN202210333676 A CN 202210333676A CN 114648650 A CN114648650 A CN 114648650A
Authority
CN
China
Prior art keywords
image
network
neural network
trained
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210333676.7A
Other languages
Chinese (zh)
Inventor
高梦雅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sensetime Technology Development Co Ltd
Original Assignee
Beijing Sensetime Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sensetime Technology Development Co Ltd filed Critical Beijing Sensetime Technology Development Co Ltd
Priority to CN202210333676.7A priority Critical patent/CN114648650A/en
Publication of CN114648650A publication Critical patent/CN114648650A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The present disclosure provides a method, an apparatus, a device and a storage medium for neural network training and target detection, wherein the method comprises: acquiring a first image sample acquired in an upstream task, a first neural network to be trained in a downstream task, a second neural network obtained based on the training of the first image sample and an image generation network; the second neural network is used for extracting features, the image generation network is used for generating a new image, and the new image conforms to the overall distribution of the first image sample; respectively extracting features of a new image generated based on the image generation network according to a second neural network obtained by training and a first neural network to be trained; and training the first neural network to be trained based on the extracted first image features and the extracted second image features to obtain the trained first neural network. The training of the first neural network can be better guided through the two image characteristics, so that the performance of downstream tasks is improved.

Description

Neural network training method, neural network training device, target detection method, target detection device, equipment and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for neural network training and target detection.
Background
With the rapid development of artificial intelligence technology, the end-to-end based deep learning technology is becoming mature. A pre-training neural network (namely a pre-training model) can be learned for various tasks at upstream by utilizing a large-scale data set, and the model can directly share pre-trained early weights and has stronger feature representation capability.
However, in the process of migrating the pre-trained model to the downstream specific task, because the amount of data that can be actually acquired downstream is relatively small, the pre-trained model has poor performance in the downstream task regardless of whether the model is migrated directly or migrated after the pre-trained model is fine-tuned.
Disclosure of Invention
The embodiment of the disclosure at least provides a method, a device, equipment and a storage medium for neural network training and target detection.
In a first aspect, an embodiment of the present disclosure provides a method for neural network training, where the method includes:
acquiring a first image sample acquired in an upstream task, a first neural network to be trained in a downstream task, a second neural network obtained by training based on the first image sample and an image generation network; the second neural network is used for feature extraction, the image generation network is used for generating a new image, and the new image conforms to the overall distribution of the first image sample;
respectively performing feature extraction on a new image generated based on the image generation network according to the second neural network obtained by training and the first neural network to be trained to obtain a first image feature and a second image feature;
and training the first neural network to be trained based on the first image characteristic and the second image characteristic to obtain the trained first neural network.
By adopting the neural network training method, the feature extraction can be carried out on the new image generated based on the image generation network according to the first neural network to be trained in the downstream task and the second neural network obtained by training based on the first image sample collected in the upstream task, and then the training can be carried out on the first neural network according to the obtained first image feature and the second image feature. Because the new image generated based on the image generation network is the image sample which is more in accordance with the overall distribution of the first image sample, the image sample can be more favorably adapted to the network environment of the second neural network, and meanwhile, the training of the first neural network can be better guided through the first image features output by the trained second neural network and the second image features output by the first neural network to be trained, so that the performance of downstream tasks is further improved.
In one possible embodiment, the image generation network is trained as follows:
acquiring a first image which is output by a network generated based on a codebook; the codebook generation network is configured to generate a codebook that decomposes the first image sample into a plurality of primitives;
inputting the first image into the image generation network to be trained to obtain a second image output by the image generation network;
determining a loss function value of the image generation network to be trained based on the image similarity between the second image and the first image;
and training the image generation network to be trained based on the loss function value to obtain the trained image generation network.
The image generation network to be trained can be trained through the first image output in the codebook coding mode, so that the trained image generation network is more in line with the overall distribution of the first image sample, and the subsequent network training performance is favorably improved.
In a possible implementation, the inputting the first image to the image generation network to be trained includes:
covering a partial image area in the first image to obtain a covered first image;
and inputting the first image after the covering processing into the image generation network to be trained.
In one possible embodiment, the codebook generating network comprises an encoder and a decoder, and is trained according to the following steps:
repeatedly performing the following steps until the similarity between the image output by the decoder and the first image sample input into the encoder is greater than a preset threshold:
inputting the first image sample to an encoder to be trained to obtain a codebook output by the encoder; and inputting the codebook output by the encoder into a decoder to be trained to obtain an image output by the decoder.
The codebook can be based on image coding realized by a countermeasure network formed by an encoder and a decoder, and the accuracy is high.
In one possible embodiment, a first image output of a codebook-based generating network is obtained as follows:
inputting the first image sample to an encoder included in the codebook generating network to obtain a codebook output by the encoder;
and inputting the codebook output by the encoder into a decoder included in the codebook generating network to obtain the first image output by the decoder.
Here, the first image sample may be re-characterized by using the codebook output by the encoder, and the characterized first image may be more suitable for subsequent network training.
In one possible embodiment, the image generation network comprises a first generation subnetwork for generating a codebook that decomposes the first image samples into a plurality of primitives, and a second generation subnetwork that generates the new image based on the image output by the first generation subnetwork; training the image generation network according to the following steps:
inputting the first image sample into a trained first generation sub-network to obtain a first image output by the first generation sub-network;
inputting the first image into a second generation sub-network to be trained to obtain a second image output by the second generation sub-network;
determining a loss function value of the image generation network to be trained based on a first image similarity between the first image and the input first image sample and a second image similarity between the second image and the first image;
and training the image generation network to be trained based on the loss function value to obtain the trained image generation network.
Here, the training of the image generation network may be performed in combination with the first generation subnetwork and the second generation subnetwork, so that the trained image generation network may better give consideration to both the generation effect and the generation efficiency of the codebook generation and the image generation.
In a possible embodiment, the training the first neural network to be trained based on the first image feature and the second image feature to obtain a trained first neural network includes:
determining a loss function value of the first neural network to be trained based on image similarity between the first image feature and the second image feature;
and under the condition that the loss function value corresponding to the current round is larger than a preset threshold value, adjusting the network parameter value of the first neural network based on the loss function value, and performing the next round of training according to the adjusted first neural network until the loss function value is smaller than or equal to the preset threshold value.
In a possible implementation, after the obtaining of the trained first neural network, the method further includes:
acquiring a third image sample acquired in a downstream task;
and carrying out network training again on the trained first neural network based on the third image sample to obtain the finally trained first neural network.
Here, the first neural network may be fine-tuned based on third image samples acquired in the downstream task, extending the generalization performance of the network in the downstream task.
In a possible implementation manner, the network training the trained first neural network again based on the third image sample to obtain a final trained first neural network, including:
inputting the third image sample into the first neural network to obtain a task output result of the network;
determining a loss function value of the first neural network based on a comparison relationship between the task output result and a task labeling result for labeling the third image sample;
and carrying out network training on the first neural network again based on the loss function value to obtain the finally trained first neural network.
In one possible embodiment, the second neural network is trained as follows:
acquiring an original neural network; the primitive neural network at least comprises a feature extraction layer;
performing feature extraction on the first image sample based on a feature extraction layer included by the original neural network to obtain image feature information output by the feature extraction layer;
adjusting the network parameter value of the feature extraction layer based on the image feature information to obtain an adjusted feature extraction layer;
and determining the original neural network containing the adjusted feature extraction layer as a second neural network obtained by training.
Here, a second neural network may be obtained based on training of the original neural network including the feature extraction layer, and the network may output more general feature information, which is convenient for subsequent task migration.
In a second aspect, an embodiment of the present disclosure further provides a method for target detection, where the method includes:
acquiring a target image acquired in a downstream task;
inputting the target image into a first neural network trained by the neural network training method according to the first aspect and any one of the various embodiments thereof, and obtaining a detection result of the target object in the target image.
In a third aspect, an embodiment of the present disclosure further provides an apparatus for neural network training, where the apparatus includes:
the acquisition module is used for acquiring a first image sample acquired in an upstream task, a first neural network to be trained in a downstream task, a second neural network obtained by training based on the first image sample and an image generation network; the second neural network is used for feature extraction, the image generation network is used for generating a new image, and the new image conforms to the overall distribution of the first image sample;
the extraction module is used for respectively extracting the characteristics of a new image generated based on the image generation network according to the second neural network obtained by training and the first neural network to be trained to obtain a first image characteristic and a second image characteristic;
and the training module is used for training the first neural network to be trained on the basis of the first image characteristics and the second image characteristics to obtain the trained first neural network.
In a fourth aspect, an embodiment of the present disclosure further provides an apparatus for target detection, where the apparatus includes:
the acquisition module is used for acquiring a target image acquired in a downstream task;
a detection module, configured to input the target image to a first neural network trained by the neural network training method according to the first aspect and any one of the various embodiments thereof, so as to obtain a detection result of the target object in the target image.
In a fifth aspect, an embodiment of the present disclosure further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method of neural network training according to the first aspect and any of its various embodiments or the steps of the method of object detection according to the second aspect.
In a sixth aspect, the disclosed embodiments also provide a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, performs the steps of the method for neural network training according to the first aspect and any one of the various embodiments thereof or the steps of the method for object detection according to the second aspect.
For the description of the effects of the above apparatus, electronic device, and computer-readable storage medium, reference is made to the description of the above method, which is not repeated here.
In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.
FIG. 1 illustrates a flow chart of a method of neural network training provided by an embodiment of the present disclosure;
FIG. 2 is a flow chart illustrating a method of object detection provided by an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of an apparatus for neural network training provided by an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of an apparatus for target detection provided by an embodiment of the present disclosure;
fig. 5 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
It has been found through research that in the process of migrating the pre-trained model to a downstream specific task, the performance in the downstream task is generally improved through model fine tuning in the related art.
The existing fine tuning methods mainly include the following two types: the first category may be to screen the features extracted by the mapping backbone network. In a specific application, the screening process may be implemented by adding an additional network layer behind the backbone network, that is, the additional network layer may screen and map general features extracted from the backbone network, and retain and reinforce features required by downstream tasks, where the additional network layer may be a convolutional layer, a normalization layer, or the like. The second type may be an operational backbone network weight parameter. In a specific application, downstream task migration is not directly carried out by utilizing back propagation, but weight increment and an offset value are predicted in a designated weight parameter space aiming at the downstream task, so that the main network is assisted to adapt to the downstream task.
However, both of the above-mentioned methods have disadvantages: the first category of methods may cause feature mapping layer overfitting in case the amount of data in the downstream task is small; the weight updating range of the second method is limited by the designated weight parameter space, and the optimization of the weights to the optimal state cannot be guaranteed, and thus, the improvement of the model migration performance by the methods is to be improved.
In addition, the relevant pre-training model often has a specific network structure, and in an actual migrated downstream scenario, the model may need to be migrated to a different network structure, which puts a higher requirement on the model migration performance.
Based on the research, the invention provides a scheme for realizing network migration in a teaching and training mode based on a trained teacher network to guide a student network to be trained so as to improve the performance of a pre-training model in a downstream task.
To facilitate understanding of the present embodiment, first, a method for neural network training disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the method for neural network training provided in the embodiments of the present disclosure is generally an electronic device with certain computing capability, and the electronic device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a terminal, or other processing devices. In some possible implementations, the method of neural network training may be implemented by a processor invoking computer readable instructions stored in a memory.
Referring to fig. 1, which is a flowchart of a method for neural network training provided in the embodiment of the present disclosure, the method includes steps S101 to S103, where:
s101: acquiring a first image sample acquired in an upstream task, a first neural network to be trained in a downstream task, a second neural network obtained based on the training of the first image sample and an image generation network; the second neural network is used for extracting features, the image generation network is used for generating a new image, and the new image conforms to the overall distribution of the first image sample;
s102: respectively extracting features of a new image generated based on the image generation network according to a second neural network obtained by training and a first neural network to be trained to obtain a first image feature and a second image feature;
s103: and training the first neural network to be trained based on the first image characteristics and the second image characteristics to obtain the trained first neural network.
In order to facilitate understanding of the neural network training method provided by the embodiments of the present disclosure, a brief description of an application scenario of the method is first provided below. The neural network training method in the embodiment of the disclosure can be mainly applied to network training in related downstream tasks under visual scene migration, where the downstream tasks may be related tasks based on a currently migrated scene, for example, may be target detection tasks under a natural scene, or may be semantic segmentation tasks under an acquisition scene.
Wherein the number of training samples that can be collected in downstream tasks is relatively small. The downstream task corresponds to an upstream task and may be a related task with more training samples. Taking a target classification task as an example, a target classification neural network obtained by training a training database composed of target objects is already provided at present, however, for downstream automatic driving in a specific application scenario, since training data corresponding to this scenario is relatively less, a pre-training model obtained upstream is often needed to support downstream training, for example, migration can be performed after the pre-training model is fine-tuned.
However, due to a series of problems of the fine tuning scheme in the related art, the pre-training model has poor performance in downstream tasks. Meanwhile, in the migration process, the related art has the limitation that the same model structure must be used for migration to a downstream data set by using an upstream pre-trained model, so that the pre-trained model cannot be well migrated to a downstream task with a different model structure.
In order to solve the above problem, the embodiments of the present disclosure provide a scheme for implementing network migration in a learning and training manner based on a trained teacher network to guide a student network to be trained, so as to improve the performance of a pre-training model in a downstream task.
In the embodiment of the present disclosure, the pre-training model may be a second neural network trained in an upstream task by using the first image sample acquired in the upstream task. In addition, the image generation network herein may be a correlation network that generates a new image that conforms to the overall distribution of the first image sample.
In a specific application, an upstream data set for an upstream task and a downstream data set for a downstream task may be prepared in advance, the upstream data set having a large number of first image samples as a large-scale pre-training data set, and the downstream data set having a small number of second image samples as a data set to be migrated.
The first image sample may be an image acquired in a plurality of tasks in a plurality of application scenes, where the application scenes may be natural scenes, monitoring scenes, acquisition scenes, and the like, and the tasks may be tasks such as image classification, target detection, semantic segmentation, and the like. The second image sample may be a particular scene to be migrated, an image captured in a particular task, such as an image of a street pedestrian of interest in a detection task.
The original neural network including the feature extraction layer may be trained based on the first image sample, where feature extraction may be performed on the first image sample based on the feature extraction layer, and then network parameter values of the feature extraction layer are adjusted through image feature information output by the feature extraction layer, so that the trained original neural network may be determined as the trained second neural network.
The original neural network is any network structure with a feature providing function, a second neural network is obtained by training the original neural network by using large-scale upstream data (corresponding to a first image sample), and for any image, a backbone network part (corresponding to a feature extraction layer) of the image can output a universal feature representation.
It should be noted that the primitive neural network may further include a task layer for performing task processing after the feature extraction layer, and at this time, the matching degree between the task output result of the task layer and the task labeling result for the large-scale upstream data may be used to perform training of the whole primitive neural network, which is not described herein again.
In order to train the second neural network in the downstream task better, the trained second neural network can be used as a teacher model in knowledge distillation, the first neural network to be migrated in the downstream can be used as a student model in knowledge distillation, and the first neural network can be trained in a mode of fixing the teacher model and training the student model.
In the training process, feature extraction can be performed by using the second neural network and the first neural network respectively, and then training of the first neural network is guided based on the similarity between the first image features output by the teacher model, namely the second neural network, and the second image features output by the student model, namely the first neural network, so that the output representation of the student model is as close as possible to the teacher model, and further downstream tasks can be better adapted.
It is considered that in practical applications the first image samples are images derived from a large scale pre-training data set, which makes it possible for different first image samples to be acquired based on different application scenarios. The characteristics of the first image samples collected under different application scenes may have certain differences, and the interference may be brought to network training to a certain extent, here, in order to reduce the interference of irrelevant information on the premise of fully mining the characteristic information contained in the upstream data set, the image generation network can be used to generate a new image which is in accordance with the overall distribution of the first image samples, and then the training of the relevant teacher student model is carried out based on the generated new image, so that the pre-training model belonging to the second neural network can be efficiently and accurately migrated to the downstream specific task field, and a good migration effect can be achieved even under the condition of a small downstream data volume.
Considering the key role of the training process of the image generation network on the model migration, the following description will focus on the scheme for training the image generation network. The image generation network in the embodiment of the present disclosure may be obtained by training on the premise of a codebook generation network, or may be obtained by training in synchronization with the codebook generation network. Specifically, the deployment can be performed by the following two aspects.
In a first aspect: the image generation network can be trained according to the following steps:
acquiring a first image output by a network generated based on a codebook; a codebook generation network for generating a codebook that decomposes the first image sample into a plurality of primitives;
inputting the first image into an image generation network to be trained to obtain a second image output by the image generation network;
determining a loss function value of an image generation network to be trained based on the image similarity between the second image and the first image;
and step four, training the image generation network to be trained based on the loss function value to obtain the trained image generation network.
Here, the first image output by the codebook generating network may be used as an input image of the image generating network to be trained, and then the loss function value of the image generating network may be determined based on the image similarity between the second image output by the image generating network and the first image. The larger the loss function value is, the larger the difference between the output second image and the input first image is, at this time, network training needs to be performed again, the smaller the loss function value is, the smaller the difference between the output second image and the input first image is, and when the difference is small to a certain degree, it can be determined that the output image is substantially consistent with the input image, and at this time, training can be ended.
In order to train the image generation network better, before the first image is input to the image generation network to be trained, a partial image area in the first image may be subjected to covering processing to obtain a first image subjected to covering processing, and in the case that the first image subjected to covering processing is input to the image generation network to be trained, the uncovered partial image area may guide generation of the covered partial image area, and further, network training may be realized based on a proximity degree between the generated image and the original first image.
In the embodiment of the present disclosure, the codebook generating network may also be obtained based on the first image sample training. The codebook generation network is mainly used for training a codebook capable of coding visual features in upstream data, and then image restoration can be performed through a plurality of primitives contained in the codebook generated in the codebook generation network, so that a first image output by the codebook generation network is obtained.
The training process and the application process of the codebook generating network will be described in detail below.
In the embodiment of the disclosure, the countermeasure network formed by paired encoders and decoders can be used for training the codebook generation network. Here, the first image sample may be input to an encoder to be trained, and a codebook output by the encoder is obtained; inputting the codebook output by the encoder into a decoder to be trained to obtain an image output by the decoder, then verifying whether the similarity between the image output by the decoder and a first image sample input into the encoder is greater than a preset threshold, if not, circulating the process of inputting the first image sample into the encoder to be trained until the similarity between the two images is greater than the preset threshold.
Here, the trained codebook generation network can make an image be decomposed into a codebook composed of several primitives by an encoder, and then the primitives can be restored into the image by a decoder.
Here, the first image sample may be input to an encoder included in the codebook generation network to obtain a codebook output by the encoder, and when the codebook output by the encoder is input to a decoder included in the codebook generation network, the image restoration may be performed using each element included in the codebook to obtain a re-characterized first image.
It should be noted that, in practical application, the determination process related to the first image may be determined by a training process of a joint codebook generation network, that is, the step of inputting the first image sample to an encoder to be trained to obtain a codebook output by the encoder, and inputting the codebook output by the encoder to a decoder to be trained to obtain an image output by the decoder may be repeatedly performed, until the image output by the decoder and the first image sample input to the encoder are similar to each other by more than a preset threshold, and the image output by the decoder is determined as the first image.
In a second aspect: in the case where the image generation network includes a first generation subnetwork for generating a codebook that decomposes a first image sample into a plurality of primitives, and a second generation subnetwork that generates a new image based on an image output by the first generation subnetwork, the disclosed embodiments may train the image generation network as follows:
inputting a first image sample into a trained first generation sub-network to obtain a first image output by the first generation sub-network;
inputting the first image into a second generation sub-network to be trained to obtain a second image output by the second generation sub-network;
determining a loss function value of an image generation network to be trained based on a first image similarity between the first image and the input first image sample and a second image similarity between the second image and the first image;
and step four, training the image generation network to be trained based on the loss function value to obtain the trained image generation network.
Here, the loss function value of the image generation network to be trained may be determined by combining a first image similarity between the first image and the input first image sample and a second image similarity between the second image and the first image, where both the first image similarity and the second image similarity affect adjustment of relevant network parameter values, that is, here, synchronous training of the first generation sub-network and the second generation sub-network is achieved, and training efficiency is higher.
Under the condition that training is carried out to obtain a first generation sub-network and a second generation sub-network, a corresponding new image can be generated for any input first image sample, and the new image contains data information rich in upstream tasks, so that the method is more suitable for the training requirement of the network.
When a large number of new images generated based on the image generation network are input into a trained second neural network (namely, a pre-training model) in an upstream task, a general representation can be obtained, knowledge distillation is carried out by using the general representation, and the knowledge of the general representation can be distilled into a downstream first neural network to be migrated.
In the embodiment of the present disclosure, training on a first neural network in a downstream task may be guided based on an image similarity between two image features (i.e., a first image feature and a second image feature) extracted by a trained second neural network and a first neural network to be trained, which may specifically be implemented by the following steps:
step one, determining a loss function value of a first neural network to be trained based on image similarity between a first image feature and a second image feature;
and step two, under the condition that the loss function value corresponding to the current round is larger than a preset threshold value, adjusting the network parameter value of the first neural network based on the loss function value, and performing the next round of training according to the adjusted first neural network until the loss function value is smaller than or equal to the preset threshold value.
Here, the image similarity between the two image features is inversely related to the loss function value of the first neural network, that is, the determined loss function value is large in the case where the image similarity is small, and is small in the case where the image similarity is large. The purpose of training the first neural network in the disclosed embodiments is to make the representations of the outputs of the two neural networks (the second neural network and the first neural network) as close as possible.
In order to further expand the generalization performance of the first neural network in the downstream task field, here, the first neural network may be fine-tuned by using the second image sample acquired in the downstream task, which may specifically be implemented by the following steps:
step one, inputting a second image sample into a first neural network to obtain a task output result of the network;
secondly, determining a loss function value of the first neural network based on a comparison relation between a task output result and a task marking result for marking the second image sample;
and thirdly, network training is carried out on the first neural network again based on the loss function value, and the finally trained first neural network is obtained.
Here, the feature extraction may be performed by a feature extraction layer included in the first neural network, and in a case where the feature information output by the feature extraction layer is input to a task layer included in the first neural network, a plurality of rounds of training of the first neural network may be performed based on a result of matching of a result of the task output and a result of task labeling with respect to the second image sample.
In the embodiment of the present disclosure, when the task output result and the task labeling result are not matched, it indicates that the current network performance is not good, and the network parameter value needs to be adjusted to perform the next round of training until the two results are matched or until other network convergence conditions are met, for example, the iteration round reaches a preset number of times, and, for example, the loss function value is smaller than a preset threshold value.
The task labeling results here are also different for different downstream tasks. For example, some image samples may be information about the position, size, etc. of the target object for the target detection task marker, and some image samples may be object semantic information for the target semantic segmentation task marker. The labeling here can be done for different downstream tasks, which is not limited in particular.
In the network fine adjustment process based on the second image sample, the network fine adjustment method and the network fine adjustment system can be used for the overall adjustment process of each network layer included in the network, all parameters of each network layer can be released, and the final adjustment of the network can be performed by using a small learning rate, so that the generalization performance of the network in the downstream task field can be remarkably improved.
Based on the neural network training method provided by the embodiment of the present disclosure, the embodiment of the present disclosure also provides a target detection method, as shown in fig. 2, which specifically includes the following steps:
s201: acquiring a target image acquired in a downstream task;
s202: and inputting the target image into a first neural network trained by using a neural network training method to obtain a detection result of the target object in the target image.
Here, in the case of acquiring a target image acquired in a downstream task, a target object in the target image may be detected based on the trained first neural network for target detection, so as to obtain a detection result of the target object in the target image.
The detection result of the target object in the target image may be information of the position, size, and the like of the target object in the target image.
In the embodiment of the present disclosure, target images acquired by different downstream tasks are also different, and reference may be specifically made to an acquisition process of a second image sample, which is not described herein again. For the training process of the first neural network, reference is made to the related description in the above embodiments, and further description is omitted here.
It should be noted that the neural network training method provided by the embodiment of the present disclosure may be applied not only to the field of target detection, but also to the fields of image classification, semantic segmentation, and the like, and is not described herein again.
It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.
Based on the same inventive concept, the embodiment of the present disclosure further provides a device corresponding to the method, and since the principle of solving the problem of the device in the embodiment of the present disclosure is similar to that of the method in the embodiment of the present disclosure, the implementation of the device may refer to the implementation of the method, and repeated details are omitted.
Referring to fig. 3, a schematic diagram of an apparatus for neural network training provided in an embodiment of the present disclosure is shown, the apparatus including: an acquisition module 301, an extraction module 302 and a training module 303; wherein, the first and the second end of the pipe are connected with each other,
an obtaining module 301, configured to obtain a first image sample acquired in an upstream task, a first neural network to be trained in a downstream task, and a second neural network and an image generation network obtained based on training of the first image sample; the second neural network is used for extracting features, the image generation network is used for generating a new image, and the new image accords with the integral distribution of the first image sample;
an extraction module 302, configured to perform feature extraction on a new image generated based on an image generation network according to a second neural network obtained through training and a first neural network to be trained, respectively, to obtain a first image feature and a second image feature;
the training module 303 is configured to train the first neural network to be trained based on the first image feature and the second image feature, so as to obtain a trained first neural network.
By adopting the device for training the neural network, the characteristic extraction can be carried out on the new image generated based on the image generation network according to the first neural network to be trained in the downstream task and the second neural network trained based on the first image sample collected in the upstream task, and then the training can be carried out on the first neural network according to the obtained first image characteristic and the second image characteristic. Because the new image generated based on the image generation network is the image sample which is more in accordance with the overall distribution of the first image sample, the image sample can be more favorably adapted to the network environment of the second neural network, and meanwhile, the training of the first neural network can be better guided through the first image features output by the trained second neural network and the second image features output by the first neural network to be trained, so that the performance of downstream tasks is further improved.
In one possible implementation, the obtaining module 301 is configured to train the image generation network according to the following steps:
acquiring a first image which is output by a network generated based on a codebook; a codebook generation network for generating a codebook that decomposes the first image sample into a plurality of primitives;
inputting the first image into an image generation network to be trained to obtain a second image output by the image generation network;
determining a loss function value of an image generation network to be trained based on the image similarity between the second image and the first image;
and training the image generation network to be trained based on the loss function value to obtain the trained image generation network.
In one possible implementation, the obtaining module 301 is configured to input the first image to an image generation network to be trained:
covering a partial image area in the first image to obtain a first image after covering;
and inputting the first image subjected to covering processing into an image generation network to be trained.
In a possible implementation, the codebook generating network includes an encoder and a decoder, and the obtaining module 301 is configured to train the codebook generating network according to the following steps:
repeatedly executing the following steps until the similarity between the image output by the decoder and the first image sample input into the encoder is greater than a preset threshold value:
inputting the first image sample into an encoder to be trained to obtain a codebook output by the encoder; and inputting the codebook output by the encoder into a decoder to be trained to obtain an image output by the decoder.
In one possible implementation, the obtaining module 301 is configured to obtain a first image output from a codebook-based generating network according to the following steps:
inputting the first image sample into an encoder included in a codebook generating network to obtain a codebook output by the encoder;
and inputting the codebook output by the coder into a decoder included in a codebook generating network to obtain a first image output by the decoder.
In one possible embodiment, the image generation network comprises a first generation subnetwork for generating a codebook that decomposes the first image samples into a plurality of primitives, and a second generation subnetwork that generates new images based on images output by the first generation subnetwork; an obtaining module 301, configured to train an image generation network according to the following steps:
inputting the first image sample into a trained first generation sub-network to obtain a first image output by the first generation sub-network;
inputting the first image into a second generation sub-network to be trained to obtain a second image output by the second generation sub-network;
determining a loss function value of an image generation network to be trained based on a first image similarity between a first image and an input first image sample and a second image similarity between a second image and the first image;
and training the image generation network to be trained based on the loss function value to obtain the trained image generation network.
In a possible implementation manner, the training module 303 is configured to train the first neural network to be trained based on the first image feature and the second image feature according to the following steps, so as to obtain a trained first neural network:
determining a loss function value of a first neural network to be trained based on image similarity between the first image feature and the second image feature;
and under the condition that the loss function value corresponding to the current round is larger than the preset threshold value, adjusting the network parameter value of the first neural network based on the loss function value, and performing the next round of training according to the adjusted first neural network until the loss function value is smaller than or equal to the preset threshold value.
In a possible implementation, the training module 303 is further configured to:
after the trained first neural network is obtained, a second image sample collected in a downstream task is obtained; and carrying out network training again on the trained first neural network based on the second image sample to obtain the finally trained first neural network.
In a possible implementation manner, the training module 303 is configured to perform network training again on the trained first neural network based on the second image sample to obtain a finally trained first neural network according to the following steps:
inputting the second image sample into the first neural network to obtain a task output result of the network;
determining a loss function value of the first neural network based on a comparison relationship between the task output result and a task labeling result for labeling the second image sample;
and carrying out network training on the first neural network again based on the loss function value to obtain the finally trained first neural network.
In one possible implementation, the obtaining module 301 is configured to train a second neural network according to the following steps:
acquiring an original neural network; the original neural network at least comprises a feature extraction layer;
performing feature extraction on the first image sample based on a feature extraction layer included by the original neural network to obtain image feature information output by the feature extraction layer;
adjusting the network parameter value of the feature extraction layer based on the image feature information to obtain an adjusted feature extraction layer;
and determining the original neural network containing the adjusted feature extraction layer as a second neural network obtained by training.
Referring to fig. 4, a schematic diagram of an apparatus for detecting an object according to an embodiment of the present disclosure is shown, where the apparatus includes: an acquisition module 401 and a detection module 402; wherein the content of the first and second substances,
an obtaining module 401, configured to obtain a target image collected in a downstream task;
the detection module 402 is configured to input the target image to the first neural network trained by using the neural network training method, so as to obtain a detection result of the target object in the target image.
The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.
Corresponding to the methods in fig. 1 and fig. 2, an embodiment of the present disclosure further provides an electronic device, as shown in fig. 5, which is a schematic structural diagram of the electronic device provided in the embodiment of the present disclosure, and includes:
a processor 501, a memory 502, and a bus 503; the memory 502 is used for storing execution instructions and includes a memory 5021 and an external storage 5022; the memory 5021 is also referred to as an internal memory and is used for temporarily storing operation data in the processor 501 and data exchanged with an external storage 5022 such as a hard disk, the processor 501 exchanges data with the external storage 5022 through the memory 5021, and when the electronic device is operated, the processor 501 communicates with the storage 502 through a bus 503, so that the processor 501 executes steps of the neural network training method shown in fig. 1 or steps of the target detection method shown in fig. 2.
The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the steps of the method described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.
The embodiments of the present disclosure further provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the method described in the foregoing method embodiments, which may be specifically referred to in the foregoing method embodiments and are not described herein again.
The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a software product, such as a Software Development Kit (SDK) or the like.
It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process of the system and the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a random access Memory (ROM), a magnetic disk, or an optical disk.
Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (15)

1. A method of neural network training, the method comprising:
acquiring a first image sample acquired in an upstream task, a first neural network to be trained in a downstream task, a second neural network and an image generation network which are obtained based on the training of the first image sample; the second neural network is used for feature extraction, the image generation network is used for generating a new image, and the new image conforms to the overall distribution of the first image samples;
respectively performing feature extraction on a new image generated based on the image generation network according to the trained second neural network and the trained first neural network to obtain a first image feature and a second image feature;
and training the first neural network to be trained based on the first image characteristic and the second image characteristic to obtain the trained first neural network.
2. The method of claim 1, wherein the image generation network is trained according to the steps of:
acquiring a first image which is output by a network generated based on a codebook; the codebook generation network is configured to generate a codebook that decomposes the first image sample into a plurality of primitives;
inputting the first image into the image generation network to be trained to obtain a second image output by the image generation network;
determining a loss function value of the image generation network to be trained based on image similarity between the second image and the first image;
and training the image generation network to be trained based on the loss function value to obtain the trained image generation network.
3. The method of claim 2, wherein inputting the first image to the image generation network to be trained comprises:
covering a partial image area in the first image to obtain a covered first image;
and inputting the first image after the covering processing into the image generation network to be trained.
4. A method according to claim 2 or 3, wherein the codebook generating network comprises an encoder and a decoder, the codebook generating network being trained in accordance with the steps of:
repeatedly performing the following steps until the similarity between the image output by the decoder and the first image sample input into the encoder is greater than a preset threshold:
inputting the first image sample to an encoder to be trained to obtain a codebook output by the encoder; and inputting the codebook output by the encoder into a decoder to be trained to obtain an image output by the decoder.
5. The method of claim 4, wherein the first image output from the codebook-based generating network is obtained by:
inputting the first image sample to an encoder included in the codebook generating network to obtain a codebook output by the encoder;
and inputting the codebook output by the encoder into a decoder included in the codebook generating network to obtain the first image output by the decoder.
6. The method of claim 1, wherein the image generation network comprises a first generation sub-network for generating a codebook that decomposes the first image samples into a plurality of primitives, and a second generation sub-network that generates the new image based on an image output by the first generation sub-network; training the image generation network according to the following steps:
inputting the first image sample into a trained first generation sub-network to obtain a first image output by the first generation sub-network;
inputting the first image into a second generation sub-network to be trained to obtain a second image output by the second generation sub-network;
determining a loss function value of the image generation network to be trained based on a first image similarity between the first image and the input first image sample and a second image similarity between the second image and the first image;
and training the image generation network to be trained based on the loss function value to obtain the trained image generation network.
7. The method according to any one of claims 1 to 6, wherein the training the first neural network to be trained based on the first image feature and the second image feature to obtain a trained first neural network comprises:
determining a loss function value of the first neural network to be trained based on image similarity between the first image feature and the second image feature;
and under the condition that the loss function value corresponding to the current round is larger than a preset threshold value, adjusting the network parameter value of the first neural network based on the loss function value, and performing the next round of training according to the adjusted first neural network until the loss function value is smaller than or equal to the preset threshold value.
8. The method of any one of claims 1 to 7, wherein after the obtaining of the trained first neural network, the method further comprises:
acquiring a second image sample acquired in a downstream task;
and carrying out network training again on the trained first neural network based on the second image sample to obtain the finally trained first neural network.
9. The method of claim 8, wherein the network training the trained first neural network again based on the second image sample to obtain a final trained first neural network, comprising:
inputting the second image sample into the first neural network to obtain a task output result of the network;
determining a loss function value of the first neural network based on a comparison relationship between the task output result and a task labeling result for labeling the second image sample;
and carrying out network training on the first neural network again based on the loss function value to obtain the finally trained first neural network.
10. The method of any one of claims 1 to 9, wherein the second neural network is trained by:
acquiring an original neural network; the primitive neural network at least comprises a feature extraction layer;
performing feature extraction on the first image sample based on a feature extraction layer included by the original neural network to obtain image feature information output by the feature extraction layer;
adjusting the network parameter value of the feature extraction layer based on the image feature information to obtain an adjusted feature extraction layer;
and determining the original neural network containing the adjusted feature extraction layer as a second neural network obtained by training.
11. A method of target detection, the method comprising:
acquiring a target image acquired in a downstream task;
inputting the target image into a first neural network trained by the neural network training method of any one of claims 1 to 10, and obtaining a detection result of a target object in the target image.
12. An apparatus for neural network training, the apparatus comprising:
the acquisition module is used for acquiring a first image sample acquired in an upstream task, a first neural network to be trained in a downstream task, a second neural network obtained by training based on the first image sample and an image generation network; the second neural network is used for feature extraction, the image generation network is used for generating a new image, and the new image conforms to the overall distribution of the first image sample;
the extraction module is used for respectively extracting the characteristics of a new image generated based on the image generation network according to the second neural network obtained by training and the first neural network to be trained to obtain a first image characteristic and a second image characteristic;
and the training module is used for training the first neural network to be trained on the basis of the first image characteristics and the second image characteristics to obtain the trained first neural network.
13. An apparatus for object detection, the apparatus comprising:
the acquisition module is used for acquiring a target image acquired in a downstream task;
a detection module, configured to input the target image into a first neural network trained by using the neural network training method according to any one of claims 1 to 10, so as to obtain a detection result of a target object in the target image.
14. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is run, the machine-readable instructions when executed by the processor performing the steps of the method of neural network training of any one of claims 1 to 10 or the steps of the method of object detection of claim 11.
15. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the method of neural network training as set forth in any one of claims 1 to 10 or the steps of the method of object detection as set forth in claim 11.
CN202210333676.7A 2022-03-30 2022-03-30 Neural network training method, neural network training device, target detection method, target detection device, equipment and storage medium Pending CN114648650A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210333676.7A CN114648650A (en) 2022-03-30 2022-03-30 Neural network training method, neural network training device, target detection method, target detection device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210333676.7A CN114648650A (en) 2022-03-30 2022-03-30 Neural network training method, neural network training device, target detection method, target detection device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114648650A true CN114648650A (en) 2022-06-21

Family

ID=81996274

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210333676.7A Pending CN114648650A (en) 2022-03-30 2022-03-30 Neural network training method, neural network training device, target detection method, target detection device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114648650A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023142918A1 (en) * 2022-01-28 2023-08-03 华为云计算技术有限公司 Image processing method based on pre-trained large model, and related apparatus

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023142918A1 (en) * 2022-01-28 2023-08-03 华为云计算技术有限公司 Image processing method based on pre-trained large model, and related apparatus

Similar Documents

Publication Publication Date Title
CN109816032B (en) Unbiased mapping zero sample classification method and device based on generative countermeasure network
CN109783094A (en) Front end page generation method, device, computer equipment and storage medium
CN111275107A (en) Multi-label scene image classification method and device based on transfer learning
CN111898696A (en) Method, device, medium and equipment for generating pseudo label and label prediction model
CN110580500A (en) Character interaction-oriented network weight generation few-sample image classification method
CN112084331A (en) Text processing method, text processing device, model training method, model training device, computer equipment and storage medium
CN112417289B (en) Information intelligent recommendation method based on deep clustering
CN108304376B (en) Text vector determination method and device, storage medium and electronic device
CN113705313A (en) Text recognition method, device, equipment and medium
CN110991149A (en) Multi-mode entity linking method and entity linking system
CN113515669A (en) Data processing method based on artificial intelligence and related equipment
CN116310667B (en) Self-supervision visual characterization learning method combining contrast loss and reconstruction loss
KR20230171966A (en) Image processing method and device and computer-readable storage medium
CN112651324A (en) Method and device for extracting semantic information of video frame and computer equipment
CN116245097A (en) Method for training entity recognition model, entity recognition method and corresponding device
CN114037055A (en) Data processing system, method, device, equipment and storage medium
CN114648650A (en) Neural network training method, neural network training device, target detection method, target detection device, equipment and storage medium
CN116776157B (en) Model learning method supporting modal increase and device thereof
CN114648679A (en) Neural network training method, neural network training device, target detection method, target detection device, equipment and storage medium
CN116361502A (en) Image retrieval method, device, computer equipment and storage medium
CN115203412A (en) Emotion viewpoint information analysis method and device, storage medium and electronic equipment
CN112528674B (en) Text processing method, training device, training equipment and training equipment for model and storage medium
CN113887518A (en) Behavior detection method and device, electronic equipment and storage medium
CN115099344A (en) Model training method and device, user portrait generation method and device, and equipment
CN114692649A (en) Automatic answer text generation method using multi-view information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination