CN116563660A - Image processing method and related device based on pre-training large model - Google Patents

Image processing method and related device based on pre-training large model Download PDF

Info

Publication number
CN116563660A
CN116563660A CN202210109103.6A CN202210109103A CN116563660A CN 116563660 A CN116563660 A CN 116563660A CN 202210109103 A CN202210109103 A CN 202210109103A CN 116563660 A CN116563660 A CN 116563660A
Authority
CN
China
Prior art keywords
image
training
target image
large model
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210109103.6A
Other languages
Chinese (zh)
Inventor
常建龙
张恒亨
陈鑫
史佳欣
王志宇
宁可
田奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Cloud Computing Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Cloud Computing Technologies Co Ltd filed Critical Huawei Cloud Computing Technologies Co Ltd
Priority to CN202210109103.6A priority Critical patent/CN116563660A/en
Priority to PCT/CN2023/070316 priority patent/WO2023142918A1/en
Publication of CN116563660A publication Critical patent/CN116563660A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The application discloses an image processing method and a related device based on a pre-training large model, wherein a characteristic image of a training image is obtained through a generation network, and the resolution of the characteristic image is the same as that of the training image; fusing the training image and the characteristic image to obtain a target image; inputting the target image into a pre-training large model to obtain a processing result; and updating the parameters of the generated network under the condition of keeping the parameters of the pre-trained large model unchanged according to the processing result. In the method, the generating network is configured for each downstream task, and in the training process, under the condition that parameters of the pre-training large model are kept unchanged, the parameters of the generating network are updated, so that training expenditure when the pre-training large model is deployed on the downstream task is reduced, and model training difficulty and iteration updating difficulty are reduced.

Description

Image processing method and related device based on pre-training large model
Technical Field
The embodiment of the application relates to the field of artificial intelligence, in particular to an image processing method based on a pre-training large model and a related device.
Background
The existing situation that a network is specially designed for a single task scene in a traditional artificial intelligence (artificial intelligence, AI) model is changed to a great extent, the pre-training large model can excavate data in massive data by means of huge scale and a large amount of computing resources, the purpose of realizing general AI by a plurality of fragmented AI tasks is further achieved, and finally the problem of fragmentation encountered in AI landing is solved. With the increase of computing power and the continuous increase of the capacity of the large visual model, how to efficiently migrate the large visual model to the downstream task becomes an important point in the field.
In the scene of the vision large model of the pre-training large model, the parameters of the whole vision large model are finely adjusted mainly based on the training data of the downstream task, and the updated parameters of the vision large model are stored so as to be applied to the downstream task.
Since all parameters of the whole visual large model need to be independently fine-tuned for each downstream task, such fine-tuning methods need to bring about a huge training overhead in case of increasing capacity of the current visual large model.
Disclosure of Invention
The embodiment of the application provides an image processing method and a related device based on a pre-training large model, which are used for reducing training expenditure when the pre-training large model is deployed on a downstream task.
In a first aspect, an embodiment of the present application provides an image processing method based on a pre-training large model, where for each downstream task, a corresponding generation network is configured. The training image is firstly input into a generating network, and a target image corresponding to the training image is generated through the generating network. The generating network is also a neural network unit, and can perform feature extraction and feature transformation according to the requirement of the downstream task, so as to generate specific prompt information (i.e. target image) aiming at the downstream task.
After the target image is input into the pre-training large model, the pre-training large model is used for image processing, and the processing result is obtained through a head classifier.
In the application, parameters of the pre-training large model are frozen, namely, the parameters of the pre-training large model are kept unchanged, and then parameters of a generated network are trained. After the network training is completed, the network training is put into practical application aiming at the downstream task.
Specifically, in the whole training process, except for the pre-training large model, parameters of other neural network units are updatable, including parameters of a generation network, parameters of an up-sampling layer, parameters of a convolution layer and parameters of a head classifier. In the method, the generating network is configured for each downstream task, and in the training process, under the condition that parameters of the pre-training large model are kept unchanged, the parameters of the generating network are updated, so that training expenditure when the pre-training large model is deployed on the downstream task is reduced, and model training difficulty and iteration updating difficulty are reduced. On the other hand, because the generating network is trainable, the input image of the downstream task firstly passes through the generating network to obtain specific prompt information (namely target image) aiming at the downstream task, and the scene of the downstream task can be more attached, so that the whole image processing flow based on the pre-training large model can have better generalization capability and adapt to the requirements of different downstream tasks.
Based on the first aspect, in an optional implementation manner, after the training image is input to the generating network, the generating network may output a feature image of the training image. And then fusing the characteristic image and the training image to obtain a target image. It should be noted that, to facilitate image fusion, the feature image should have the same resolution as the training image.
Based on the first aspect, in an alternative embodiment, the number of color channels of the pre-trained large model for the input image is specified. The number of color channels of the target image generated by image fusion is often changed, at this time, a convolution layer can be added before the pre-training large model for carrying out convolution processing on the target image, so that the updated target image can meet the specification of the pre-training large model on the number of color channels, and then the updated target image can be input into the pre-training large model.
Based on the first aspect, in an alternative embodiment, the generating network may be a visual transducer model (vision transformer model, VIT), a convolutional neural network (convosutionas neuras network, CNN) or a recurrent neural network (recurrent neural network, RNN), or may be other trainable neural network models, which is not limited herein.
Further, after the training image is input to the generation network, the generation network may output a feature image of the training image. And then fusing the characteristic image and the training image to obtain a target image. It should be noted that, to facilitate image fusion, the feature image should have the same resolution as the training image. In practical application, if the resolution of the image output by the generating network is smaller than the training image, the image output by the generating network can be up-sampled to obtain a feature image consistent with the resolution of the training image, so that the feature image is used for subsequent image fusion.
Based on the first aspect, in an alternative embodiment, the generation network is constructed by a lightweight visual converter model VIT consisting of linear projection layers and three (l=3) stacked converter layers. Specifically, after the training image is input into the generating network, the training image is segmented into a series of local image blocks by adopting a linear projection layer, the additional position coding information is used as the image block for coding, and the final vision converter model outputs the feature extraction result aiming at the training image. The resolution of the feature extraction result output by the vision converter model is lower than that of the training image, so that the generation network further comprises an up-sampling layer for up-sampling the feature extraction result output by the vision converter model, thereby obtaining a feature image consistent with the resolution of the training image.
In the present application, the upsampling means is not limited to a specific upsampling means, that is, the upsampling may be performed by a deconvolution operation, or may be other upsampling means, for example, bilinear interpolation (bilinear) or inverse pooling (un-pooling), and the like, and the method is not limited to this specific example.
In a second aspect, an embodiment of the present application provides an image processing method based on a pre-training large model, including:
acquiring a target image of an input image through a generation network;
and inputting the target image into the pre-training large model to obtain a processing result.
Based on the second aspect, in an alternative implementation manner, acquiring the target image of the input image through the generating network includes:
acquiring a characteristic image of an input image through a generation network, wherein the resolution of the characteristic image is the same as that of the input image;
and fusing the input image and the characteristic image to obtain a target image.
The content of the information interaction and the execution process of the embodiment shown in the present aspect is based on the same concept as the embodiment shown in the first aspect, so the description of the beneficial effects shown in the present aspect is shown in the above first aspect, and details are not repeated here.
Based on the second aspect, in an optional implementation manner, before inputting the target image into the pre-trained large model to obtain the processing result, the method further includes:
and carrying out convolution processing on the target image to obtain an updated target image, wherein the updated target image is used for being input into the pre-training large model.
Based on the second aspect, in an alternative embodiment, generating the network includes a vision transformer ViT model and an upsampling layer, and acquiring the feature image of the input image through the generating network includes:
extracting features of the input image through a ViT model to obtain a feature extraction result;
and inputting the feature extraction result into an up-sampling layer to obtain a feature image.
Based on the second aspect, in an alternative embodiment, the generating network is a convolutional neural network CNN.
Based on the second aspect, in an alternative embodiment, the generation network is a recurrent neural network RNN.
In a third aspect, an embodiment of the present application provides an image processing apparatus, including:
an acquisition unit configured to acquire a target image of a training image through a generation network;
the input unit is used for inputting the target image into the pre-training large model to obtain a processing result;
And the updating unit is used for updating the parameters of the generated network under the condition of keeping the parameters of the pre-trained large model unchanged according to the processing result.
Based on the third aspect, in an optional implementation manner, the acquiring unit is specifically configured to:
acquiring a feature image of the training image through a generation network, wherein the resolution of the feature image is the same as that of the training image;
and fusing the training image and the characteristic image to obtain a target image.
Based on the third aspect, in an optional implementation manner, the image processing apparatus further includes:
the convolution unit is used for carrying out convolution processing on the target image to obtain an updated target image, and the updated target image is used for being input into the pre-training large model.
Based on the third aspect, in an optional implementation manner, the acquiring unit is specifically configured to:
extracting features of the training image through a ViT model to obtain feature extraction results;
and inputting the feature extraction result into an up-sampling layer to obtain a feature image.
Based on the third aspect, in an alternative embodiment, the generating network is a convolutional neural network CNN.
Based on the third aspect, in an optional implementation manner, the generating network is a recurrent neural network RNN.
In a fourth aspect, an embodiment of the present application provides an image processing apparatus, including:
an acquisition unit configured to acquire a target image of an input image through a generation network;
and the input unit is used for inputting the target image into the pre-training large model to obtain a processing result.
Based on the fourth aspect, in an optional implementation manner, the acquiring unit is specifically configured to:
acquiring a characteristic image of an input image through a generation network, wherein the resolution of the characteristic image is the same as that of the input image;
and fusing the input image and the characteristic image to obtain a target image.
Based on the fourth aspect, in an optional implementation manner, the image processing apparatus further includes:
the convolution unit is used for carrying out convolution processing on the target image to obtain an updated target image, and the updated target image is used for being input into the pre-training large model.
Based on the fourth aspect, in an optional implementation manner, the acquiring unit is specifically configured to:
extracting features of the input image through a ViT model to obtain a feature extraction result;
and inputting the feature extraction result into an up-sampling layer to obtain a feature image.
Based on the fourth aspect, in an alternative embodiment, the generating network is a convolutional neural network CNN.
Based on the fourth aspect, in an alternative embodiment, the generating network is a recurrent neural network RNN.
In a fifth aspect, embodiments of the present invention provide a computer device comprising a memory, a communication interface, and a processor coupled to the memory and the communication interface; the memory is used for storing instructions, the processor is used for executing the instructions, and the communication interface is used for communicating with other devices under the control of the processor; wherein the processor, when executing the instructions, performs the method of any of the above aspects.
In a sixth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein, which when run on a computer, causes the computer to perform the method of any one of the above aspects.
In a seventh aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions which, when run on a computer, cause the computer to perform the method of any of the above aspects.
From the above technical solutions, the embodiments of the present application have the following advantages:
The application discloses an image processing method and a related device based on a pre-training large model, wherein a characteristic image of a training image is obtained through a generation network, and the resolution of the characteristic image is the same as that of the training image; fusing the training image and the characteristic image to obtain a target image; inputting the target image into a pre-training large model to obtain a processing result; and updating the parameters of the generated network under the condition of keeping the parameters of the pre-trained large model unchanged according to the processing result. In the method, the generating network is configured for each downstream task, and in the training process, under the condition that parameters of the pre-training large model are kept unchanged, the parameters of the generating network are updated, so that training expenditure when the pre-training large model is deployed on the downstream task is reduced, and model training difficulty and iteration updating difficulty are reduced. On the other hand, because the generating network is trainable, the input image of the downstream task firstly passes through the generating network to obtain specific prompt information (namely target image) aiming at the downstream task, so that the whole image processing flow based on the pre-training large model can have better generalization capability and adapt to the requirements of different downstream tasks.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.
FIG. 1 is a schematic diagram of a structure of an artificial intelligence main body frame;
FIG. 2 is a flow chart of a fine tuning method of a pre-trained large model according to the prior art;
FIG. 3 is a flow chart of another fine tuning method of the prior art pre-trained large model;
FIG. 4 is a flowchart of an image processing method based on a pre-training large model according to an embodiment of the present application;
FIG. 5 is a flow chart of generating a target image through a transducer network in the present application;
FIG. 6 is a schematic view of a scenario in which a plurality of different downstream tasks share a pre-trained large model in the present application;
fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of another image processing apparatus according to an embodiment of the present application;
Fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The embodiment of the application provides an image processing method and a related device based on a pre-training large model, which are used for reducing training expenditure when the pre-training large model is deployed on a downstream task.
Embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention. The terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments of the invention only and is not intended to be limiting of the invention. As one of ordinary skill in the art can appreciate, with the development of technology and the appearance of new scenes, the technical solutions provided in the embodiments of the present application are applicable to similar technical problems.
In the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Referring to fig. 1, a schematic structural diagram of an artificial intelligence main body framework is shown in fig. 1, and the artificial intelligence main body framework is described below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where the "intelligent information chain" reflects a list of processes from the acquisition of data to the processing. For example, there may be general procedures of intelligent information awareness, intelligent information representation and formation, intelligent reasoning, intelligent decision making, intelligent execution and output. In this process, the data undergoes a "data-information-knowledge-wisdom" gel process. The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of personal intelligence, information (provisioning and processing technology implementation), to the industrial ecological process of the system.
(1) An infrastructure.
The infrastructure provides computing capability support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the base platform. Communicating with the outside through the sensor; computing power is provided by smart chips, e.g., hardware acceleration chips such as central processing units (central processing units, CPU), embedded neural network processors (neural-network processing unit, NPU), graphics processors (graphics processing unit, GPU), application specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate arrays (field programmable gate array, FPGA), etc.; the basic platform comprises a distributed computing framework, a network and other relevant platform guarantees and supports, and can comprise cloud storage, computing, interconnection and interworking networks and the like. For example, the sensor and external communication obtains data that is provided to a smart chip in a distributed computing system provided by the base platform for computation.
(2) Data.
The data of the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data relate to graphics, images, voice and text, and also relate to the internet of things data of the traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.
(3) And (5) data processing.
Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.
Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.
Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.
Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.
(4) General capability.
After the data has been processed, some general-purpose capabilities can be formed based on the result of the data processing, such as algorithms or a general-purpose system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.
(5) Intelligent products and industrial applications.
The intelligent product and industry application refers to products and applications of an artificial intelligent system in various fields, is encapsulation of an artificial intelligent overall solution, and realizes land application by making intelligent information decisions, and the application fields mainly comprise: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, smart city etc.
The image processing method based on the pre-training large model provided in the application is applied to scenes including but not limited to the above examples. Specifically, the method can be applied to data processing methods such as data training, machine learning, deep learning and the like, intelligent information modeling, extraction, preprocessing, training and the like for symbolizing and formalizing training data are carried out, and finally a trained neural network model (such as a target neural network model in the embodiment of the application) is obtained; the target neural network model can be used for model reasoning, and input data can be input into the target neural network model to obtain output data.
Next, a pre-trained large model is described.
The scale of the deep learning model is continuously deepening, widening and increasing the parameter. Traditional artificial intelligence (artificial intelligence, AI) models are basically small models that are trained to specific application scenario requirements. The small model is trained by using the marked data in the specific field, has poor universality, is often inapplicable when being changed into another application scene, and needs to be retrained. In addition, in the training mode of the traditional AI model, manual work of parameter adjustment and tuning is too much, and a large number of AI engineering professionals are needed to complete the training. Meanwhile, large-scale annotation data is needed for training the traditional AI model, and if the data size of some application scenes is small, the trained model precision is not ideal.
The current situation that a network is specially designed for a single task scene in a traditional AI model is changed to a great extent due to the fact that the pre-training large model can excavate data in massive data by means of huge scale and a large amount of computing resources, and therefore the method is suitable for achieving the purpose of universal AI through a plurality of fragmented AI tasks, and finally the problem of fragmentation in AI landing is solved. With the increase of computing power and the continuous increase of the capacity of the large visual model, how to efficiently migrate the large visual model to the downstream task becomes an important point in the field.
Currently, there are mainly two fine tuning methods for pre-trained large models.
1: referring to fig. 2, fig. 2 is a flow chart of a fine tuning method of a pre-training large model in the prior art. As shown in fig. 2, the parameters of the pre-trained large model and the head classifier are adjustable, and the process performs end-to-end training based on the training data of the downstream task, so as to fine tune the parameters of the whole visual large model, and store the updated parameters of the visual large model, so as to be applied to the downstream task.
Since all parameters of the whole visual large model need to be independently fine-tuned for each downstream task, such fine-tuning methods need to bring about a huge training overhead in case of increasing capacity of the current visual large model. On the other hand, the data volume of the downstream task is obviously smaller than that of the pre-training large model in the self-training process, so that fine adjustment of the whole large model on the downstream task can cause serious overfitting risks, and the generalization capability of the large model on the downstream task cannot be ensured.
2: referring to fig. 3, fig. 3 is a flow chart illustrating another fine tuning method of the pre-training large model in the prior art. As shown in fig. 3, the pre-trained large model is frozen and does not participate in fine-tuning, and the parameters of the head classifier are adjustable. The process performs end-to-end training based on training data of the downstream task to fine tune the parameters of the head classifier.
In the fine tuning method, parameters of the head classifier are only fine-tuned on downstream data, so that the parameters of the pre-training large model are kept frozen, the pre-training large model cannot be fully learned on a downstream task, and knowledge in the pre-training large model is difficult to migrate to the downstream task.
In view of this, the present application provides an image processing method based on a pre-trained large model for reducing training overhead when deploying the pre-trained large model on a downstream task. Referring to fig. 4, fig. 4 is a flow chart of an image processing method based on a pre-training large model according to an embodiment of the present application, as shown in fig. 4, the image processing method based on the pre-training large model according to an embodiment of the present application includes:
101. and acquiring a target image of the training image through a generating network.
In the embodiment of the present application, for each downstream task, a corresponding generation network is configured. The training image is firstly input into a generating network, and a target image corresponding to the training image is generated through the generating network. The generating network is also a neural network unit, and can perform feature extraction and feature transformation according to the requirement of the downstream task, so as to generate specific prompt information (i.e. target image) aiming at the downstream task.
In particular, the generation network may be a visual transducer model (vision transformer model, VIT), a convolutional neural network (convosutionas neuras network, CNN) or a recurrent neural network (recurrent neural network, RNN), or may be other trainable neural network models, and is not limited herein.
Further, after the training image is input to the generation network, the generation network may output a feature image of the training image. And then fusing the characteristic image and the training image to obtain a target image. It should be noted that, to facilitate image fusion, the feature image should have the same resolution as the training image. In practical application, if the resolution of the image output by the generating network is smaller than the training image, the image output by the generating network can be up-sampled to obtain a feature image consistent with the resolution of the training image, so that the feature image is used for subsequent image fusion.
For example, referring to fig. 5, fig. 5 is a schematic flow chart of generating a target image through a transducer network in the present application. As shown in fig. 5, the generation network is constructed by a lightweight visual converter model VIT consisting of a linear projection layer and three (l=3) stacked converter layers. Specifically, after the training image is input into the generating network, the training image is segmented into a series of local image blocks by adopting a linear projection layer, the additional position coding information is used as the image block for coding, and the final vision converter model outputs the feature extraction result aiming at the training image. The resolution of the feature extraction result output by the vision transducer model is lower than that of the training image, so that the generating network further comprises an up-sampling layer for up-sampling (such as a deconvolution layer in fig. 5) the feature extraction result output by the vision transducer model, thereby obtaining a feature image consistent with the resolution of the training image.
In this application, the upsampling means is not limited to a specific upsampling means, that is, the upsampling may be performed by the deconvolution operation in fig. 5, or may be other upsampling means, for example, bilinear interpolation (bilinear) or inverse pooling (un-pooling), which is not limited in specific terms.
Further, in practical applications, the number of color channels of the pre-trained large model for the input image is of a certain specification. The number of color channels of the target image generated by image fusion is often changed, at this time, a convolution layer can be added before the pre-training large model for carrying out convolution processing on the target image, so that the updated target image can meet the specification of the pre-training large model on the number of color channels, and then the updated target image can be input into the pre-training large model.
102. And inputting the target image into the pre-training large model to obtain a processing result.
After the target image is input into the pre-training large model, the pre-training large model is used for image processing, and the processing result is obtained through a head classifier.
103. And updating the parameters of the generated network under the condition of keeping the parameters of the pre-trained large model unchanged according to the processing result.
In the application, parameters of the pre-training large model are frozen, namely, the parameters of the pre-training large model are kept unchanged, and then parameters of a generated network are trained. After the network training is completed, the network training is put into practical application aiming at the downstream task. The specific model reasoning process is similar to the steps 101 to 102, and will not be described here again.
Specifically, in the whole training process, except for the pre-training large model, parameters of other neural network units are updatable, including parameters of a generation network, parameters of an up-sampling layer, parameters of a convolution layer and parameters of a head classifier. In the method, the generating network is configured for each downstream task, and in the training process, under the condition that parameters of the pre-training large model are kept unchanged, the parameters of the generating network are updated, so that training expenditure when the pre-training large model is deployed on the downstream task is reduced, and model training difficulty and iteration updating difficulty are reduced. On the other hand, because the generating network is trainable, the input image of the downstream task firstly passes through the generating network to obtain specific prompt information (namely target image) aiming at the downstream task, and the scene of the downstream task can be more attached, so that the whole image processing flow based on the pre-training large model can have better generalization capability and adapt to the requirements of different downstream tasks.
Referring to fig. 6, fig. 6 is a schematic view of a scenario in which a plurality of different downstream tasks share a pre-training large model. As shown in fig. 6, in the image processing method based on the pre-training large model, parameters of the pre-training large model are shared for different downstream tasks, only a specific generation network needs to be trained for each downstream task, and a brand new model does not need to be trained for each downstream task, so that the pre-training large model with frozen parameters can adapt to different downstream tasks in an actual scene.
Table 1 below compares the differences between the traditional method of tuning a pre-trained large model and the method of tuning a pre-trained large model of the present application from multiple dimensional analyses:
TABLE 1
In order to better implement the above-described aspects of the embodiments of the present application, the following also provides related devices for implementing the above-described aspects. Specifically, referring to fig. 7, fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application, where the image processing apparatus includes:
an acquisition unit 201 for acquiring a target image of a training image through a generation network;
an input unit 202, configured to input a target image into the pre-training large model, to obtain a processing result;
And the updating unit 203 is configured to update parameters of the generated network while maintaining parameters of the pre-trained large model unchanged according to the processing result.
In one possible design, the acquisition unit 201 is specifically configured to:
acquiring a feature image of the training image through a generation network, wherein the resolution of the feature image is the same as that of the training image;
and fusing the training image and the characteristic image to obtain a target image.
In one possible design, the image processing apparatus further includes:
the convolution unit 204 is configured to perform convolution processing on the target image, so as to obtain an updated target image, where the updated target image is used to be input into the pre-training large model.
In one possible design, the acquisition unit 201 is specifically configured to:
extracting features of the training image through a ViT model to obtain feature extraction results;
and inputting the feature extraction result into an up-sampling layer to obtain a feature image.
In one possible design, the generation network is a convolutional neural network CNN.
In one possible design, the generation network is a recurrent neural network RNN.
It should be noted that, content such as information interaction and execution process between each module/unit in the image processing apparatus is based on the same concept as the method embodiment corresponding to fig. 4 in the present application, and specific content may be referred to the description in the foregoing method embodiment shown in the present application, which is not repeated herein.
Referring to fig. 8, fig. 8 is a schematic structural diagram of another image processing apparatus according to an embodiment of the present application, where the image processing apparatus includes:
an acquisition unit 301 for acquiring a target image of an input image through a generation network;
and an input unit 302, configured to input the target image into the pre-training large model, so as to obtain a processing result.
In one possible design, the acquisition unit 301 is specifically configured to:
acquiring a characteristic image of an input image through a generation network, wherein the resolution of the characteristic image is the same as that of the input image;
and fusing the input image and the characteristic image to obtain a target image.
In one possible design, the image processing apparatus further includes:
the convolution unit 303 is configured to perform convolution processing on the target image, so as to obtain an updated target image, where the updated target image is used to be input into the pre-training large model.
In one possible design, the acquisition unit 301 is specifically configured to:
extracting features of the input image through a ViT model to obtain a feature extraction result;
and inputting the feature extraction result into an up-sampling layer to obtain a feature image.
In one possible design, the generation network is a convolutional neural network CNN.
In one possible design, the generation network is a recurrent neural network RNN.
It should be noted that, content such as information interaction and execution process between each module/unit in the image processing apparatus is based on the same concept as the method embodiment corresponding to fig. 4 in the present application, and specific content may be referred to the description in the foregoing method embodiment shown in the present application, which is not repeated herein.
The present embodiment also provides a computer device, please refer to fig. 9, fig. 9 is a schematic structural diagram of the computer device provided in the present embodiment, an image processing apparatus described in the corresponding embodiment of fig. 7 or fig. 8 may be disposed on the computer device 400, specifically, the computer device 400 is implemented by one or more servers, the computer device 400 may generate relatively large differences due to configuration or performance, and may include one or more central processing units (central processing units, CPU) 422 (for example, one or more processors) and a memory 432, and one or more storage media 430 (for example, one or more mass storage devices) storing the application programs 442 or the data 444. Wherein memory 432 and storage medium 430 may be transitory or persistent storage. The program stored on the storage medium 430 may include one or more modules (not shown), each of which may include a series of instruction operations in a computer device. Still further, the central processor 422 may be configured to communicate with the storage medium 430 and execute a series of instruction operations in the storage medium 430 on the computer device 400.
The computer device 400 may also include one or more power supplies 426, one or more wired or wireless network interfaces 450, one or more input/output interfaces 458, and/or one or more operating systems 441, such as Windows Server TM ,Mac OS X TM ,Unix TM ,Linux TM ,FreeBSD TM Etc.
It should be noted that, content such as information interaction and execution process between each module/unit in the image processing apparatus is based on the same concept as the method embodiment corresponding to fig. 4 in the present application, and specific content may be referred to the description in the foregoing method embodiment shown in the present application, which is not repeated herein.
Also provided in embodiments of the present application is a computer program product comprising a program product which, when run on a computer, causes the computer to perform the method as described in the embodiment shown in fig. 5.
There is also provided in an embodiment of the present application a computer-readable storage medium having stored therein a program for performing signal processing, which when run on a computer, causes the computer to perform the method as described in the embodiment shown in fig. 5.
The image processing device provided in this embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be, for example, a processor, and a communication unit, which may be, for example, an input/output interface, pins or circuitry, etc. The processing unit may execute the computer-executable instructions stored in the storage unit to cause the chip to perform the method described in the embodiment shown in fig. 4. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit in the wireless access device side located outside the chip, such as a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a random access memory (random access memory, RAM), etc.
It should be further noted that the above described embodiments of the apparatus are only schematic, where the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection therebetween, and can be specifically implemented as one or more communication buses or signal lines.
From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course may be implemented by dedicated hardware including application specific integrated circuits, dedicated CPUs, dedicated memories, dedicated components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, a software program implementation is a preferred embodiment in many cases for the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a training device, or a network device, etc.) to perform the method described in the embodiments of the present application.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Claims (27)

1. An image processing method based on a pre-trained large model, comprising:
acquiring a target image of a training image through a generation network;
inputting the target image into a pre-training large model to obtain a processing result;
and updating the parameters of the generated network under the condition of keeping the parameters of the pre-training large model unchanged according to the processing result.
2. The method of claim 1, wherein the acquiring the target image of the training image via the generation network comprises:
acquiring a feature image of a training image through a generation network, wherein the resolution of the feature image is the same as that of the training image;
and fusing the training image and the characteristic image to obtain a target image.
3. The method according to claim 1 or 2, wherein before inputting the target image into a pre-trained large model, the method further comprises:
and carrying out convolution processing on the target image to obtain an updated target image, wherein the updated target image is used for being input into the pre-training large model.
4. The method of claim 2, wherein the generating network comprises a vision transformer ViT model and an upsampling layer, wherein the acquiring the feature image of the training image through the generating network comprises:
Extracting features of the training image through the ViT model to obtain feature extraction results;
and inputting the feature extraction result to the up-sampling layer to obtain a feature image.
5. A method according to claim 1, 2 or 3, wherein the generation network is a convolutional neural network CNN.
6. A method according to claim 1, 2 or 3, wherein the generation network is a recurrent neural network RNN.
7. An image processing method based on a pre-trained large model, comprising:
acquiring a target image of an input image through a generation network;
and inputting the target image into a pre-training large model to obtain a processing result.
8. The method of claim 7, wherein the obtaining the target image of the input image via the generation network comprises:
acquiring a characteristic image of an input image through a generation network, wherein the resolution of the characteristic image is the same as that of the input image;
and fusing the input image and the characteristic image to obtain a target image.
9. The method according to claim 7 or 8, wherein before inputting the target image into a pre-trained large model to obtain a processing result, the method further comprises:
And carrying out convolution processing on the target image to obtain an updated target image, wherein the updated target image is used for being input into the pre-training large model.
10. The method of claim 8, wherein the generating network comprises a visual transformer ViT model and an upsampling layer, wherein the acquiring the feature image of the input image through the generating network comprises:
extracting features of the input image through the ViT model to obtain a feature extraction result;
and inputting the feature extraction result to the up-sampling layer to obtain a feature image.
11. The method according to claim 7, 8 or 9, wherein the generation network is a convolutional neural network CNN.
12. The method of claim 7, 8 or 9, wherein the generation network is a recurrent neural network RNN.
13. An image processing apparatus, comprising:
an acquisition unit configured to acquire a target image of a training image through a generation network;
the input unit is used for inputting the target image into the pre-training large model to obtain a processing result;
and the updating unit is used for updating the parameters of the generated network under the condition of keeping the parameters of the pre-training large model unchanged according to the processing result.
14. The image processing apparatus according to claim 13, wherein the acquisition unit is specifically configured to:
acquiring a feature image of a training image through a generation network, wherein the resolution of the feature image is the same as that of the training image;
and fusing the training image and the characteristic image to obtain a target image.
15. The image processing apparatus according to claim 13 or 14, characterized in that the image processing apparatus further comprises:
the convolution unit is used for carrying out convolution processing on the target image to obtain an updated target image, and the updated target image is used for being input into the pre-training large model.
16. The image processing apparatus according to claim 14, wherein the acquisition unit is specifically configured to:
extracting features of the training image through the ViT model to obtain feature extraction results;
and inputting the feature extraction result to the up-sampling layer to obtain a feature image.
17. The image processing apparatus according to claim 13, 14 or 15, wherein the generation network is a convolutional neural network CNN.
18. The image processing apparatus according to claim 13, 14 or 15, wherein the generation network is a recurrent neural network RNN.
19. An image processing apparatus, comprising:
an acquisition unit configured to acquire a target image of an input image through a generation network;
and the input unit is used for inputting the target image into the pre-training large model to obtain a processing result.
20. The image processing apparatus according to claim 19, wherein the acquisition unit is specifically configured to:
acquiring a characteristic image of an input image through a generation network, wherein the resolution of the characteristic image is the same as that of the input image;
and fusing the input image and the characteristic image to obtain a target image.
21. The image processing apparatus according to claim 19 or 20, characterized in that the image processing apparatus further comprises:
the convolution unit is used for carrying out convolution processing on the target image to obtain an updated target image, and the updated target image is used for being input into the pre-training large model.
22. The image processing apparatus according to claim 20, wherein the acquisition unit is specifically configured to:
extracting features of the input image through the ViT model to obtain a feature extraction result;
and inputting the feature extraction result to the up-sampling layer to obtain a feature image.
23. The image processing apparatus according to claim 19, 20 or 21, wherein the generation network is a convolutional neural network CNN.
24. The image processing apparatus according to claim 19, 20 or 21, wherein the generation network is a recurrent neural network RNN.
25. A computer device comprising a processor and a memory, the processor being coupled to the memory,
the memory is used for storing programs;
the processor is configured to execute a program in the memory, to cause the computer device to perform the method according to any one of claims 1 to 6, or to cause the computer device to perform the method according to any one of claims 7 to 12.
26. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 6 or which, when executed by a processor, implements the method according to any one of claims 7 to 12.
27. A computer program product having computer readable instructions stored therein, which when executed by a processor, implement the method of any of claims 1 to 6 or which when executed by a processor, implement the method of any of claims 7 to 12.
CN202210109103.6A 2022-01-28 2022-01-28 Image processing method and related device based on pre-training large model Pending CN116563660A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210109103.6A CN116563660A (en) 2022-01-28 2022-01-28 Image processing method and related device based on pre-training large model
PCT/CN2023/070316 WO2023142918A1 (en) 2022-01-28 2023-01-04 Image processing method based on pre-trained large model, and related apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210109103.6A CN116563660A (en) 2022-01-28 2022-01-28 Image processing method and related device based on pre-training large model

Publications (1)

Publication Number Publication Date
CN116563660A true CN116563660A (en) 2023-08-08

Family

ID=87470533

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210109103.6A Pending CN116563660A (en) 2022-01-28 2022-01-28 Image processing method and related device based on pre-training large model

Country Status (2)

Country Link
CN (1) CN116563660A (en)
WO (1) WO2023142918A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117557871B (en) * 2024-01-11 2024-03-19 子亥科技(成都)有限公司 Three-dimensional model labeling method, device, equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11386302B2 (en) * 2020-04-13 2022-07-12 Google Llc Systems and methods for contrastive learning of visual representations
CN113065635A (en) * 2021-02-27 2021-07-02 华为技术有限公司 Model training method, image enhancement method and device
CN113486162A (en) * 2021-06-04 2021-10-08 北京大学 Large-scale pre-training model fine-tuning method and device
CN113947196A (en) * 2021-10-25 2022-01-18 中兴通讯股份有限公司 Network model training method and device and computer readable storage medium
CN114120032A (en) * 2021-11-03 2022-03-01 奇酷软件(深圳)有限公司 Image classification method, system, storage medium and computer equipment for semi-supervised learning
CN114648650A (en) * 2022-03-30 2022-06-21 北京市商汤科技开发有限公司 Neural network training method, neural network training device, target detection method, target detection device, equipment and storage medium

Also Published As

Publication number Publication date
WO2023142918A1 (en) 2023-08-03

Similar Documents

Publication Publication Date Title
EP4145353A1 (en) Neural network construction method and apparatus
WO2021190127A1 (en) Data processing method and data processing device
CN111401406B (en) Neural network training method, video frame processing method and related equipment
WO2022001805A1 (en) Neural network distillation method and device
CN112418392A (en) Neural network construction method and device
WO2022068623A1 (en) Model training method and related device
CN112183718A (en) Deep learning training method and device for computing equipment
CN110222717A (en) Image processing method and device
CN111797992A (en) Machine learning optimization method and device
US11119507B2 (en) Hardware accelerator for online estimation
CN113191241A (en) Model training method and related equipment
EP4318313A1 (en) Data processing method, training method for neural network model, and apparatus
CN115081588A (en) Neural network parameter quantification method and device
CN111931901A (en) Neural network construction method and device
CN107977662A (en) A kind of layered calculation method for realizing high speed processing computer visual image
CN111738403A (en) Neural network optimization method and related equipment
CN113066018A (en) Image enhancement method and related device
CN113627163A (en) Attention model, feature extraction method and related device
CN114359289A (en) Image processing method and related device
CN116563660A (en) Image processing method and related device based on pre-training large model
CN111652349A (en) Neural network processing method and related equipment
US20220383073A1 (en) Domain adaptation using domain-adversarial learning in synthetic data systems and applications
CN116739154A (en) Fault prediction method and related equipment thereof
WO2022227024A1 (en) Operational method and apparatus for neural network model and training method and apparatus for neural network model
CN115601513A (en) Model hyper-parameter selection method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication