CN116225699A

CN116225699A - Resource allocation method and device, electronic equipment and storage medium

Info

Publication number: CN116225699A
Application number: CN202310107190.6A
Authority: CN
Inventors: 曹恢龙; 朱晓扬; 李想成; 赵增; 刘柏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2023-02-01
Filing date: 2023-02-01
Publication date: 2023-06-06

Abstract

The invention discloses a resource allocation method, which is applied to the technical field of computers and comprises the following steps: and determining target GPU driving allocation information, wherein the target GPU driving allocation information is used for indicating the requirement conditions of different GPU driving. And obtaining the current loaded drive of each GPU device. And determining the target GPU equipment used for carrying out drive conversion and the target GPU drive converted by the target GPU equipment from the plurality of GPU equipment according to the target GPU drive distribution information and the currently loaded drive of each GPU equipment. And converting the currently loaded drive of the target GPU equipment into a target GPU drive. By the method, the drivers loaded by a plurality of GPU devices in the same electronic device can be automatically switched according to the need, so that GPU resources in one electronic device can be flexibly allocated according to the need, and the utilization efficiency of the GPU resources in a single electronic device is improved.

Description

Resource allocation method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for allocating resources, an electronic device, and a storage medium.

Background

In a clustering technique such as K8S (Kubernetes, a resource arrangement system for managing containerized applications), GPU (GPU: graphics Processing Unit, graphics processor) resources may be used in a container manner on a node, or GPU resources, that is, the computing power of a GPU, may be used in a virtual machine manner on a node. And the cluster is provided with a plurality of nodes, each node is provided with a plurality of GPU devices, each GPU device can be correspondingly used only by loading a container or a GPU device driver corresponding to the virtual machine, and the corresponding GPU resource in the cluster has the characteristic of complex management. Thus, how to reasonably allocate GPU resources in a cluster is of great importance to GPU utilization efficiency in the cluster.

In the existing cluster technology, only the GPU resource allocation is focused on in units of nodes. Specifically, in the prior art, only one type of driver is configured for a single node, so that all GPU devices on the single node run the same GPU driver type, and all GPU resources on the single node are used in a single manner of a container or virtual machine.

However, by using all GPU devices on a node in a single manner of a container or a virtual machine, the GPU devices on a single node are idle easily due to insufficient service or service end, which is easy to cause waste of GPU resources. Therefore, how to improve GPU resource utilization efficiency for a single node becomes a new need.

Disclosure of Invention

The application provides a resource allocation method, a device, electronic equipment and a storage medium, which can enable a plurality of GPU equipment in a single node to load different drives, can automatically switch drive types according to requirements, and enable GPU resources in the single node to be flexibly allocated according to requirements so as to improve the utilization efficiency of the GPU resources.

The first aspect of the embodiments of the present application provides a resource allocation method, applied to an electronic device, where the electronic device includes a plurality of GPU devices, and at least two different GPU drivers are installed on each GPU device, where the method includes:

determining target GPU resource proportioning information, wherein the target GPU resource proportioning information comprises target GPU drive allocation information;

acquiring a current loaded drive of each GPU device;

determining target GPU equipment for performing drive conversion and target GPU drive converted by the target GPU equipment from a plurality of GPU equipment according to the target GPU drive distribution information and the currently loaded drive of each GPU equipment;

and converting the currently loaded drive of the target GPU equipment into a target GPU drive.

A second aspect of the present embodiment provides a resource allocation apparatus, applied to an electronic device, where the electronic device includes a plurality of GPU devices, and each GPU device is provided with at least two different GPU drivers, and the apparatus includes:

The determining unit is used for determining target GPU driving allocation information, wherein the target GPU driving allocation information is used for indicating the requirement conditions of different GPU driving;

the acquisition unit is used for acquiring the current loaded drive of each GPU device;

the determining unit is further used for determining target GPU equipment used for performing drive conversion and target GPU drive converted by the target GPU equipment from the plurality of GPU equipment according to the target GPU drive distribution information and the drive loaded by each GPU equipment currently;

and the conversion unit is used for converting the currently loaded drive of the target GPU equipment into a target GPU drive.

A third aspect of the present embodiment provides an electronic device, including a plurality of GPU devices, each GPU device having at least two different GPU drivers mounted thereon, the electronic device including:

a memory and a processor, the memory and the processor coupled;

the memory is used for storing one or more computer instructions;

the processor is configured to execute one or more computer instructions to implement the method provided in the first aspect above.

A fourth aspect of the present embodiment provides a computer readable storage medium having stored thereon one or more computer instructions for use in an electronic device comprising a plurality of GPU devices, each GPU device having at least two different GPU drivers mounted thereon, the computer instructions being executable by a processor to implement a method as provided in the first aspect above.

According to the technical scheme provided by the embodiment of the application, the target GPU driving allocation information is determined firstly, and the target GPU driving allocation information is used for indicating the requirement conditions of different GPU driving. And obtaining the current loaded drive of each GPU device. And then determining the target GPU equipment used for carrying out drive conversion and the target GPU drive converted by the target GPU equipment from the plurality of GPU equipment according to the target GPU drive distribution information and the currently loaded drive of each GPU equipment. And finally, converting the current loaded drive of the target GPU equipment into a target GPU drive.

In the resource allocation method, the function of automatically converting the GPU equipment drive according to the need is provided for the nodes. When the demand for the GPU resource proportioning in the node is changed, the node automatically adjusts the current drive of the GPU equipment according to the determined target GPU resource proportioning information, so that the GPU resource proportioning is consistent with the resource proportioning in the preset resource proportioning information. The method can flexibly switch the drive of the GPU equipment in a single node, so that GPU resources can be flexibly used in a container or virtual machine mode, the condition that a large amount of GPU resources in the single node are idle is avoided, and the utilization efficiency of the GPU resources is improved.

Drawings

Fig. 1 is a schematic flow chart of a resource allocation method provided in an embodiment of the present application;

FIG. 2 is a flow chart of a method for determining a target GPU device from an idle state of the GPU device according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a GPU resource application on a K8S node according to an embodiment of the present application;

FIG. 4 is a schematic diagram of allocation of GPU resources in a K8S node;

fig. 5 is a schematic structural diagram of a resource allocation device according to an embodiment of the present application;

fig. 6 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present application, the present application is clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. This application is intended to be limited to the details shown and described, and it is intended that the invention not be limited to the particular embodiment disclosed, but that the application will include all embodiments falling within the scope of the appended claims.

It should be noted that the terms "first," "second," "third," and the like in the claims, specification, and drawings herein are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. The data so used may be interchanged where appropriate to facilitate the embodiments of the present application described herein, and may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and their variants are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the cluster technology such as K8S, a plurality of servers are used simultaneously in one cluster, and each server is called a node. In a cluster, the huge computational resources within the cluster can provide operating conditions for a large number of applications. The GPU resource is taken as one of computing power resources, and is an indispensable condition for running the GPU application. And GPU applications may run either in the container or in the virtual machine. For the two different application environments, the GPU equipment is required to load the GPU equipment driver corresponding to the container or the virtual machine, so that the GPU resources corresponding to the GPU equipment are matched with the application environments of the container or the virtual machine, and the GPU resources can be distributed for use.

Because the cluster is provided with a plurality of nodes, each node is provided with a plurality of GPU devices, different drivers need to be loaded when GPU resources are configured, and therefore GPU resource management in the cluster is complicated. In the prior art, in order to facilitate allocation of GPU resources, only emphasis is placed on configuring GPU resources in units of nodes, that is, configuring a driver for all GPU devices in a single node, so that all GPU resources on a single node are used in a single manner of a container or a virtual machine, and the overall GPU resources are managed by managing the driving types of different nodes.

However, using the above method, GPU resources in one node are allocated to only one GPU application for use. For the same node, the situation that the running GPU application needs much less GPU resources than all the nodes is insufficient, or the running GPU application ends the service, so that the GPU equipment on a single node is idle, the GPU resources are wasted, and the utilization rate is low. Therefore, how to improve GPU resource utilization efficiency for a single node becomes a new need.

In order to meet the above requirements, the application provides a resource allocation method, a device, electronic equipment and a storage medium. According to the method, the driving types of partial GPU equipment are automatically converted according to the target GPU driving allocation information indicating the GPU driving requirement condition, so that the number of target GPU driving loaded by the GPU equipment is consistent with the target driving allocation information, GPU resources on a single equipment can be flexibly allocated according to requirements, and the GPU resources can be efficiently utilized. The method, apparatus, electronic device and computer readable medium of the present application are described in further detail below with reference to the specific embodiments and the accompanying drawings.

The resource allocation method provided in the present application is specifically described below with reference to fig. 1. It should be noted that the steps illustrated in the flowchart may be performed in a computer system, such as a set of computer-executable instructions, and in some cases, the steps illustrated may be performed in a different logical order than that illustrated in the flowchart.

As shown in fig. 1, fig. 1 is a flow chart of a resource allocation method provided in the present application. The method comprises

The resource allocation method includes steps S101-S104, specifically as follows.

It should be noted that the method may be applied to a plurality of node devices in the cluster technologies such as K8S, or may be applied to a specific device in the cluster, or may be applied to only a single electronic device, which is not limited herein.

The method is applied to the electronic equipment, and the electronic equipment comprises a plurality of GPU equipment, wherein at least two different GPU drivers are installed on each GPU equipment. And, the conversion process is to convert between installed GPU drives, i.e., the target GPU drive has been installed on the GPU device. When the GPU driver to be used has been installed on each of the electronic devices, steps S101-S104 can be performed at any time, starting the first step of converting the GPU driver on demand.

S101, determining target GPU driving allocation information.

First, the target GPU driver allocation information needs to be determined through this step. The GPU driver allocation information is used for indicating different driving conditions loaded by different devices, namely GPU driver types and the number of GPUs corresponding to various GPU drivers. The target GPU driver allocation information is used for indicating the requirement conditions of different GPU drivers, namely the GPU driver types and the corresponding quantity information which the target GPU driver allocation information needs to be converted into.

For example, there are 8 GPU driving devices in the electronic device, each GPU driving device is provided with a driver a and a driver B, and if the user wants to set these GPU driving devices to 3 devices loading driver a and 5 devices loading driver B. And 3 GPU devices for loading the drive A and 5 GPU devices for loading the drive B are correspondingly determined according to the target GPU drive allocation information.

The target GPU driving allocation information at least comprises a corresponding quantity relation between the target GPU driving and the target GPU driving. The target GPU driver is the type of GPU driver to be converted, and the corresponding quantity relation of the target GPU driver is used for indicating the demand of the GPU equipment for loading the target GPU driver.

Specifically, the target GPU driver allocation information includes the number of target GPU drivers, or the proportioning information of the target GPU drivers. The number information of the target GPU driver refers to specific numbers of the target GPU driver and various target GPU drivers, for example, the allocation information of the target GPU driver indicates that the GPU devices loading the driver a are 3 and the GPU devices loading the driver B are 5. The ratio information of the target GPU driver refers to the ratio of the target GPU driver to the ratio of the number of drivers a to the number of drivers B in the allocation information of the target GPU driver, for example, 1: and 1, correspondingly changing the specific number of target GPU drivers according to the total number of GPU devices. If the total number of GPU devices is 8, 4 devices are loaded for a and 4 devices are loaded for B. If the total number of GPU devices is 16, 8 devices are loaded with A and 8 devices are loaded with B respectively.

After determining the allocation information of the target GPU driver, the driver currently loaded by each GPU device needs to be acquired, specifically in step S102.

S102, acquiring a current loaded drive of each GPU device.

The step of obtaining the currently loaded drive of each GPU device is to obtain the type of the currently loaded drive of each GPU device. It should be clear that, when acquiring the current loaded drive of each GPU device, the information of each GPU device is acquired simultaneously. And the number of loaded GPU devices may be further counted. The purpose of this step is to obtain the current GPU driver allocation information, and thus is not limited to only obtaining the current loaded driver type of each GPU device, but should be considered as obtaining GPU device information simultaneously and can be used for summarizing various information.

The specific method for obtaining the current loaded drive of each GPU device may be various. For example, when obtaining the currently loaded driver of each GPU device, the currently loaded driver may be checked from the current information of each GPU device, and then each GPU device is classified, so as to obtain the current GPU driver allocation information. Each GPU driver may also be used as an index value to perform a traversal search on all GPU devices to obtain the number of GPU devices loaded with the various GPU drivers, and information of the various GPU devices loaded with the various GPU drivers. In addition, there may be other specific methods for obtaining the current loaded driver of each GPU device, which are not limited herein.

After the currently loaded driver of each GPU device is obtained, step S103 is executed to determine the target GPU device for performing the driver conversion, and the target GPU driver converted by the target GPU device.

S103, determining target GPU equipment for performing drive conversion and target GPU drive converted by the target GPU equipment from a plurality of GPU equipment according to the target GPU drive distribution information and the currently loaded drive of each GPU equipment.

After the target GPU driving distribution information and the GPU driving currently loaded by each GPU device are obtained, the GPU driving to be converted, the target GPU driving to be converted and the specific number required by the GPU driving to be converted can be determined according to the current driving distribution condition so as to meet the target GPU driving distribution information. After the target drive and the demand are determined, selecting a target GPU device for drive conversion from a plurality of GPU devices and a target GPU drive converted by the target GPU device.

After determining the target GPU device for performing the driving conversion and the target GPU driving converted by the target GPU device, the target GPU driving must be increased after the conversion is completed, and the process specifically includes: and comparing the target GPU drive allocation information with the currently loaded drive of each GPU device, and determining the target GPU drive to be added and the demand of the target GPU drive. And determining the target GPU equipment for performing drive conversion and the target GPU drive converted by the target GPU equipment from the GPU equipment according to the target GPU drive and the demand to be increased. Further, it should be appreciated that if the target GPU device for performing the drive conversion is determined, the converted GPU drive is also specified.

For example, the total number of GPU devices is 8, and 3 devices currently loading driver a are known to be a, B, and c, and 5 devices loading driver B are known to be d, e, f, g, h, respectively, through step S102. The target GPU driver allocation information indicates that the ratio of the number of devices of the load driver a to the load driver B is 1:1, and then one device of the load driver B needs to be selected from a-h and converted into the load driver a. At this time, the driver B is the GPU driver to be converted, the driver A is the target GPU driver, and the target GPU device is a certain device for converting the GPU driver selected from d-h.

It should be understood that the purpose of comparing the target GPU driving allocation information with the current loaded driving of each GPU device is to obtain the difference information, and convert the current GPU driving allocation situation into the GPU driving allocation situation consistent with the GPU driving allocation situation indicated by the target GPU driving allocation information according to the difference information. The process of the target GPU driving allocation information is met, so that the target GPU driving allocation information can be completely met, and the target GPU driving allocation information can be gradually met.

Specifically, the step of completely satisfying the target GPU driving allocation information means that a determination process determines enough target GPU devices for performing driving conversion and target GPU driving converted by the target GPU devices, so that the target GPU driving allocation information can be satisfied through one driving conversion.

Gradually satisfying the target GPU driver allocation information means that the target GPU device for performing the driver conversion and the target GPU driver converted by the target GPU device can be determined through a plurality of determining processes, that is, the driver of a part of the GPU devices can be converted for a plurality of times until the target GPU driver allocation information is satisfied.

The step-by-step meeting of the target GPU driving allocation information may be a step-by-step difference reducing process. Specifically, difference information of a currently running driver of each GPU device compared with the target GPU driver allocation information is determined, and in order to reduce the difference, a target GPU device for performing driver conversion and a target GPU driver converted by the target GPU device are determined from a plurality of GPU devices.

Illustratively, the total number of GPU devices is 8, and the current devices loading driver A and driver B are 4 each. The target GPU driver allocation information is: 6 GPU devices are needed to load GPU driver A, and 2 GPU devices are needed to load driver B. If the allocation information of the target GPU driver is to be completely satisfied, 2 target GPU devices loaded with the GPU driver B need to be determined when determining the target GPU devices, so that enough GPU devices can be converted at a time.

If the target GPU driver allocation information is to be met gradually, determining a target GPU device for loading the GPU driver B, converting the target GPU device into the loading driver a, and determining a target GPU device for driving conversion from the 3 GPU devices for loading the GPU driver B. In this process, it may be determined that the difference information is 2 fewer devices that load GPU driver a and 2 more devices that load GPU driver B, so that to reduce the difference, the 2 devices that load GPU driver B need to be converted into GPU driver a.

When determining the target GPU device for performing the drive conversion and the target GPU drive converted by the target GPU device, the target GPU drive allocation information may be compared with current GPU drive allocation information obtained according to the GPU drive currently loaded by each GPU device, so as to obtain difference information between the required amount of each GPU drive and the number of devices currently loaded with each GPU drive, and determine the target GPU device and the target GPU drive converted by the target GPU device according to the difference information. Or, the current amount of each GPU device and the required amount of each GPU driver of the target GPU driver allocation information may be directly calculated, and the target GPU device and the target GPU driver converted by the target GPU device are determined according to the obtained difference, where the nature of the difference is also the difference information. In addition, other ways of obtaining the difference information are also applicable to the present embodiment, and are not limited herein.

In addition to the above method of determining the target GPU driver and the demand amount of the target GPU driver to be increased by the difference information, the determination may be performed by other methods, such as by setting a conditional function, module judgment, or the like. Illustratively, the total number of GPU devices is 8, and the current devices loading driver A and driver B are 4 each. The target GPU driver allocation information is: 6 GPU devices are needed to load GPU driver A, and 2 GPU devices are needed to load driver B. At this time, the GPU device loaded with the driver a is subjected to traversal counting, and the number of the GPU devices is 4 after traversing all the GPU devices, and the conditions set according to the target GPU driver allocation information are not triggered: a=6. The device loading drive B is converted to loading drive a one by one and after 2 successful conversions, a condition of a=6 is triggered indicating that the conversion is complete, counting is stopped and the drive conversion is stopped. Or performing traversal counting on the GPU equipment of the loading drive B, and triggering a set condition when the GPU equipment of the loading drive B is counted to the 2 nd GPU equipment of the loading drive B: b=2, stop counting, and convert all other un-counted GPU devices of load driver B into load driver a.

It should be clear that the difference from the method of obtaining the difference information is that other methods of determining the target GPU driver and the demand of the target GPU driver to be added are different in embodiments, i.e. the difference value does not need to be obtained by comparing the calculated processes to determine the demand of the target GPU driver and the demand of the target GPU driver to be added. Other approaches still fall within the same concept of reducing variance and should also fall within the scope of the embodiments outlined herein.

After determining the target GPU device for performing the driving conversion and the target GPU driver converted by the target GPU device, step S104 may be executed to complete the conversion of the GPU driver loaded by the target GPU device, so that the allocation of the GPU resources is consistent with the requirements.

S104, converting the currently loaded drive of the target GPU equipment into a target GPU drive.

And when the target GPU equipment and the target GPU drive converted by the target GPU equipment are determined, converting the drive currently loaded by the target GPU into the target GPU drive. Specifically, the GPU driver loaded by the target GPU device is replaced by the target GPU driver, or the target GPU device is reloaded with the GPU driver, so that the target GPU device loads the target GPU driver, and GPU driving conversion is completed.

Furthermore, it should be clear that the GPU device may include multiple sub-devices, and when converting GPU drivers, the GPU drivers of all the sub-devices need to be converted.

In addition, as the GPU resources corresponding to the GPU equipment are used, the computing power resources of the GPU equipment are distributed to the GPU application for use. Therefore, if GPU resources corresponding to GPU devices are always allocated to GPU applications, the GPU devices may be in an allocatable state all the time, so that the loaded drivers cannot be successfully replaced when GPU drivers are converted, and the driving conversion fails. The method comprises the following steps: and before the currently loaded drive of the target GPU equipment is converted into the target GPU drive, suspending the allocation of GPU resources corresponding to each GPU equipment. After the current loaded drive of the target GPU equipment is converted into the target GPU drive, starting the function of distributing GPU resources corresponding to each GPU equipment.

Through steps S101-S104, GPU drivers loaded by a plurality of GPU devices can be converted according to requirements, so that the distribution condition of the GPU drivers is consistent with the expected distribution, each GPU device can be used by an application corresponding to the GPU drivers, and therefore flexible distribution of GPU resources is achieved.

The method is a basic method of the embodiment of the application, and on the basis of the method, corresponding changes can be performed according to different scene requirements. The above method is described in more detail below in connection with specific scenarios.

Among the above methods, there are many specific methods for determining the target GPU driving allocation information, and this embodiment provides two methods, one is a method for determining the target GPU driving allocation information according to a user's setting instruction, and the second is a method for determining the target GPU driving allocation information according to the amount of GPU resources required by the application to be run. These two methods are specifically described below.

The method comprises the following steps: and responding to the drive setting instruction, determining target GPU drive allocation information, wherein the drive setting instruction comprises the requirement conditions of different GPU drives.

The user may set the allocation of GPU drives within the electronic device. Specifically, the user can input the quantity information or the proportion information of the target GPU driver through a man-machine interaction function on the electronic device, can set the quantity information or the proportion information through other devices, and sends a setting instruction to the electronic device through an information sending function. The specific setting method can be changed correspondingly according to different man-machine interaction modes, and is not particularly limited herein.

And after the electronic equipment receives the drive setting instruction, responding to the drive setting instruction, and acquiring target GPU drive allocation information indicated by the drive setting instruction. And when the information is acquired, the requirements of different GPU drivers are acquired from the driver setting instruction, and the information is used as target GPU driver allocation information.

For example, the electronic device has a total of 8 GPU devices, respectively loading driver a and driver B, and when the user sets the GPU resources on the graphical user interface of the electronic device, input a=2, b=6. After the electronic equipment obtains the drive setting instruction of the user, determining that the drive allocation information of the target GPU is: the number of GPU devices for loading the drive A is 2, and the number of GPU devices for loading the drive B is 6. Alternatively, the user issues a drive setting instruction of a=2, b=6 to the target node M through the node management apparatus. When the target node M recognizes the drive setting instruction from the received information, the target GPU drive allocation information is determined accordingly.

The second method is as follows: and determining target GPU driving allocation information according to the GPU resource demand of the application to be operated and the current GPU resources.

The method is used for determining target GPU driving allocation information according to the requirements of the application to be operated on GPU resources. When determining the target GPU driving allocation information, firstly, judging whether the current GPU resource meets the requirement of the current GPU resource for the GPU resource of the application to be operated according to the requirement of the GPU resource of the application to be operated and the current GPU resource, and further determining the target GPU driving allocation information according to the requirement.

The GPU devices in the electronic device are the same in model number, and the current applications to be executed are two, namely an application a and an application b. Wherein application a requires the running of the GPU device loading driver a, and normally running application a requires 50% of the computational power resources of a GPU device. Application B needs to load the GPU equipment of drive B to run, and 30% of the computational power resource of the GPU equipment is needed when application B is normally run. Whereas the current GPU device loaded with driver a has a total of 3 and only one GPU device has 20% of the free computing power resources. The GPU devices loaded with b are 5 in number, and the computing power resources of 2 GPU devices are not occupied. At this time, it can be determined that the GPU resources corresponding to the GPU device loading the driver a cannot meet the GPU resource requirement of the application to be operated, and the GPU resources corresponding to the GPU device loading the driver B are sufficient to meet the GPU resource requirement of the application to be operated. And determining that the GPU equipment loaded with the GPU driver B is converted into target GPU driver allocation information loaded with the GPU driver A.

In addition to these two methods, the target GPU driver allocation information may be determined by other methods, such as setting fixed auto-switching rules by a function, such as setting a threshold to load the minimum number of devices for a certain driver, or setting a threshold for the minimum ratio of some two drivers. The specific setting method can be correspondingly modified according to different requirements, and redundant description is omitted here.

In addition, in step S103 of the basic method described above, when determining the target GPU device for performing the drive conversion and the target GPU drive into which the target GPU device is converted from among the GPU devices according to the target GPU drive and the required amount to be added, different determination methods may be set according to different requirements. The embodiment provides a method for determining a target GPU device according to an idle state of the GPU device. Referring specifically to fig. 2, fig. 2 is a flow chart illustrating a method for determining a target GPU device according to an idle state of the GPU device. The specific steps are as follows in S201-S202.

S201, determining at least one first candidate GPU device in an idle state from a plurality of GPU devices.

First, an idle state of each GPU device is determined, where the idle state indicates that the GPU device is unoccupied, or that the GPU device is not assigned a GPU application.

If there are idle GPU devices, determining all the idle GPU devices as the first candidate GPU device, and executing step S202. If there is no idle state GPU device, a prompt message of insufficient idle device is sent, or after the idle state GPU device is left, the first candidate GPU device in idle state is determined, and then step S202 is executed.

S202, determining target GPU equipment used for carrying out drive conversion and target GPU drive converted by the target GPU equipment from first candidate GPU equipment according to target GPU drive and demand to be increased.

After the first candidate GPU device is determined, a target GPU device for performing drive conversion and a target GPU drive converted by the target GPU device may be determined from the first candidate GPU device according to the target GPU drive and the demand to be increased. That is, the determination of the target GPU device is based on the type of target GPU driver and its corresponding number.

The method for determining the target GPU device is different because the satisfaction degree of the idle GPU device with respect to the target GPU driving to be increased is different.

If the number of the first candidate GPU devices is greater than or equal to the required amount of the target GPU drive to be added, the GPU devices in the idle state are enough to meet the conversion requirement, and the required amount of the GPU devices is selected from the GPU devices in the idle state to carry out drive conversion. The specific steps are as follows: if the number of the first candidate GPU devices is greater than or equal to the required amount of the target GPU drive to be increased, selecting the GPU device with the required amount from the first candidate GPU devices as the target GPU device for driving conversion.

If the number of the first candidate GPU devices is smaller than the required amount of the target GPU driver to be added, it is indicated that the GPU devices in the idle state cannot meet the conversion requirement, and at this time, in order to reduce the difference, all the GPU devices in the idle state may be converted first, and then converted when new idle devices are available. The method comprises the following steps: if the number of the first candidate GPU devices is smaller than the required amount of the target GPU drive to be increased, determining each first candidate GPU device as the target GPU device, and determining the second candidate GPU device as the target GPU device when detecting that the second candidate GPU device in an idle state exists, until the number of the target GPU devices is equal to the required amount of the target GPU drive to be increased.

In addition, in addition to determining the target GPU device according to the idle state of the GPU device, the target GPU device may also be determined according to the use condition of the GPU device. For example, if a GPU device has been assigned a GPU application, the GPU application, while occupying the computing power of the GPU device, is always in a closed, non-operational state and exceeds a time threshold. The GPU application in such a GPU device may be deleted and determined to be the target GPU device. In addition, there are a number of other possible ways to determine the target GPU device, which are not specifically limited herein.

In addition, on the basis of the above methods, a function of automatically allocating GPU resources may be set for the electronic device. For example, when 8 GPU devices are shared in one electronic device, and all of the 4 GPU devices for loading GPU driver a are occupied, and none of the 4 GPU devices for loading GPU driver B are used, it is indicated that the GPU resource allocation is unreasonable, and the devices for loading GPU driver a have insufficient risk, and at this time, some or all of the GPU devices for loading GPU driver B are automatically converted into loading driver a.

The case of converting all GPU devices loaded with GPU driver B into loading driver a can be specifically summarized as follows: determining each first GPU device currently loaded with a first GPU drive and each second GPU device currently loaded with a second GPU drive; when the second GPU devices are in a non-idle state and the idle state devices exist in the first GPU devices, converting the GPU drive currently loaded by the idle state first GPU devices into a second GPU drive. In response to this, when the GPU is driven by the conversion section, the number of converted devices may be correspondingly reduced according to different modes in the method outlined above.

The above embodiment describes the resource allocation method in detail, and the method can be applied to different application scenarios, and the detailed description of the specific implementation process of applying the above method to the K8S cluster environment is described below. Specifically, the following is described.

First, it should be clear that the present method is applied to a single node of the K8S, i.e. a single server. And, each node of the K8S has multiple GPU devices, each of which can run a GPU application in a container or a GPU application in a virtual machine by loading a GPU container driver and a GPU virtual machine driver.

As shown in fig. 3, fig. 3 is a schematic diagram of a GPU resource application on a K8S node, and for the requirements of a GPU container application, a GPU device is to load a driver on a host, such as NVIDIA GPU Driver, which is called a GPU container driver in this embodiment, and shares a Library (Library) of hosts in a container environment, so that GPU resources can be used by the GPU container application. And aiming at the requirements of GPU virtual machine application, GPU equipment loads a GPU virtual machine driver (VFIO-PCI), and is directly connected through PCI (Peripheral Component Interconnect, peripheral component interconnect standard), namely, a PCI pass through mode is directly connected into the virtual machine, and then a corresponding GPU driver is installed in the virtual machine, so that GPU resources can be used by the GPU virtual machine application.

Thus, on a single node of the K8S cluster, a GPU container Driver and a GPU virtual machine Driver VFIO-PCI need to be installed for each GPU device. Specifically, the GPU Driver and the VFIO-PCI Driver may be deployed on the host, and each GPU device may be loaded separately.

In addition, when the resource allocation method is used for managing and allocating GPU container resources and GPU virtual machine resources in the K8S cluster, the running environment needs to be configured for normal use of the container and the virtual machine, namely, the configuration process is initialized. Specifically, a GPU mounting component is installed for the node to mount GPU equipment into a container, and a container engine is configured to complete container environment configuration, such as dock or container. And, make the node start IOMMU (Input/Output Memory Management Unit, a kind of memory management unit) function, and load the vfio-PCI kernel module, finish the environment configuration of the virtual machine, wherein IOMMU function is used for mapping the virtual memory address into the physical memory address.

When the above resource allocation method is used, a data volume ConfigMap may be created in an initial state, where the data volume exists in a file form and is used to store the target GPU driver allocation information. The ConfigMap may include only one data volume, so that GPU driving ratio information of all nodes is consistent, or may include data volumes corresponding to each node, so as to configure GPU driving allocation information of each node respectively.

Furthermore, a GPU driver conversion module, such as a GPU-driver-controller, may also be created. The GPU driving conversion module is a program module, and is configured to implement the resource allocation method provided by the basic embodiment. The GPU driver conversion module may be deployed for each GPU node in the form of a daemon set (daemon set). Specifically, when a node is newly added, the daemon set adds a GPU driver translation module for the newly added node. When a node is deleted, the GPU driving conversion module of the deleted node is recovered.

Meanwhile, the ConfigMap is to be mounted in the GPU driving conversion module, and more specifically, the ConfigMap is mounted in a basic unit Pod of the GPU driving conversion module so that the GPU driving conversion module can read target GPU driving proportioning information of the node. And when the target GPU driving ratio information is changed, the GPU driving conversion module can monitor the change information, so that the current GPU configuration is obtained again, and GPU driving conversion is carried out according to the new GPU ratio. The process of applying the method described above by the GPU driver switching module is described in detail below in conjunction with fig. 4, where fig. 4 is a schematic diagram of GPU resource allocation in a K8S node.

First, the GPU driver switching module determines target GPU driver allocation information from the data volume.

As shown in fig. 4, after the ConfigMap is mounted on the GPU driver conversion module, the GPU driver conversion module monitors the ConfigMap. The GPU driving conversion module determines that the target GPU driving allocation information is divided into two cases from the data volume, namely, the case that the data volume is initially created and the target GPU driving allocation information is initially set, and the case that the set target GPU driving allocation information is changed.

Under the condition of initially setting target GPU drive allocation information, after ConfigMap is created, all GPU devices in the node load GPU drives according to GPU drive proportioning information initialized by the ConfigMap. Exemplary, initial GPU driver proportioning information carried when the ConfigMap is created is: GPU container driver and GPU virtual machine driver are 1:1. Because the GPU equipment in the initial state is not loaded with any drive, when the loaded drive of each GPU equipment is configured according to the target GPU drive proportioning information, half of the GPU equipment is loaded with the GPU container drive, and the other half of the GPU equipment is loaded with the GPU virtual machine drive.

Under the condition that the set target GPU driving allocation information is changed, the files in the ConfigMap are modified corresponding to the modification command. At this time, the GPU driver converting module monitors the information of the ConfigMap change, and determines the latest GPU driver allocation information as the target GPU driver allocation information.

And the GPU drive conversion module acquires the currently loaded drive of each GPU device after determining the target GPU drive allocation information from the data volume.

Specifically, when the GPU driver conversion module monitors the ConfigMap change, the required GPU resources are described to be changed, and the GPU driver loaded by each GPU device in the node is acquired at the moment, so that the current GPU driver loading condition is obtained, and preparation is made for determining the target GPU device for driving conversion.

Furthermore, the GPU driving conversion module determines target GPU equipment for driving conversion and target GPU driving converted by the target GPU equipment from the plurality of GPU equipment according to the target GPU driving distribution information and the current loaded driving of each GPU equipment.

For example, the GPU driver converting module may compare the target GPU driver allocation information with the current GPU driver loading condition, and use the difference information of the two as the comparison result. And then determining a target GPU device for performing drive conversion and a converted target GPU drive from the idle GPU devices according to the comparison result. Alternatively, the GPU driver conversion module may use other methods for determining the target GPU device in the above embodiments, which are not described herein.

And after the target GPU equipment and the target GPU drive are determined, the GPU drive conversion module converts the drive loaded by the target GPU equipment currently into the target GPU drive. In addition, as the GPU equipment comprises a plurality of sub-equipment, when the drive loaded by the target GPU equipment is converted into the target GPU drive, all the GPU sub-equipment in the target GPU equipment is converted, and the conversion process can be completed.

In addition, in the node, when the GPU resources are allocated, the process of allocating GPU resource scheduling to GPU applications for use is essential. The scheduling and distributing functions are realized by GPU resource management plugins such as a container GPU resource management plugin and a virtual machine GPU resource management plugin. Therefore, in order to avoid the task being allocated to each GPU device in the process of converting GPU driving, the conversion effect is affected. After the GPU driving conversion module monitors the ConfigMap change, the container GPU resource management plug-in and the virtual machine GPU resource management plug-in are suspended, and the container GPU resource management plug-in and the virtual machine GPU resource management plug-in are restarted after GPU conversion is completed, so that the node-allocatable GPU resources are ensured to be consistent with actual GPU resources.

In addition, after the GPU driving conversion loaded by the target GPU equipment is completed, the GPU resource management plug-in unit can rediscover GPU resources and report the GPU resources. Specifically, the container GPU resource management plug-in and the virtual machine GPU resource management plug-in can respectively check GPU resources corresponding to the GPU equipment driven by the loading GPU container and the GPU equipment driven by the loading GPU container, and count the total amount of the GPU resources, so that GPU resources which can be used by the GPU container application and the GPU virtual machine application are respectively reported. And after the resource conditions of the container GPU resources and the virtual machine GPU resources are reported, the container GPU resources and the virtual machine GPU resources are respectively allocated to the GPU container application and the GPU virtual machine application, so that the converted GPU resources can be fully utilized.

Fig. 5 is a schematic structural diagram of a resource allocation device provided in an embodiment of the present application, and the embodiment of the present application is described in detail below with reference to fig. 5. It should be clear that the resource allocation device is used to implement the various methods provided in the above method embodiments, and the embodiments referred to in the following description are used to explain the technical solutions of the present application, and are not limited to practical use.

As shown in the resource allocation apparatus 500 provided in fig. 5, the apparatus is applied to an electronic device, where the electronic device includes a plurality of GPU devices, and at least two different GPU drivers are installed on each GPU device, and the apparatus includes:

a determining unit 501, configured to determine target GPU driving allocation information, where the target GPU driving allocation information is used to indicate requirements of different GPU driving;

an obtaining unit 502, configured to obtain a driver currently loaded by each GPU device;

the determining unit 501 is further configured to determine, from the multiple GPU devices, a target GPU device for performing drive conversion and a target GPU drive converted by the target GPU device according to the target GPU drive allocation information and the currently loaded drive of each GPU device;

the conversion unit 503 is configured to convert a driver currently loaded by the target GPU device into a target GPU driver.

In a possible implementation manner, the determining unit 501 is further configured to compare the target GPU driver allocation information with the currently loaded driver of each GPU device, and determine the target GPU driver to be added and the required amount of the target GPU driver;

and determining the target GPU equipment for performing drive conversion and the target GPU drive converted by the target GPU equipment from the GPU equipment according to the target GPU drive and the demand to be increased.

Further for determining from the plurality of GPU devices at least one first candidate GPU device in an idle state;

and determining a target GPU device for performing drive conversion and a target GPU drive converted by the target GPU device from the first candidate GPU devices according to the target GPU drive and the demand to be increased.

But also in particular for use in the manufacture of a semiconductor device,

if the number of the first candidate GPU devices is greater than or equal to the required amount of the target GPU drive to be increased, selecting the GPU device with the required amount from the first candidate GPU devices as the target GPU device for driving conversion.

If the number of the first candidate GPU devices is smaller than the required amount of the target GPU drive to be increased, determining each first candidate GPU device as the target GPU device, and determining the second candidate GPU device as the target GPU device when detecting that the second candidate GPU device in an idle state exists, until the number of the target GPU devices is equal to the required amount of the target GPU drive to be increased.

The determining unit 501 is further configured to determine, in response to a driving setting instruction, target GPU driving allocation information, where the driving setting instruction includes requirements for different GPU driving;

or alternatively, the process may be performed,

and the method is used for determining target GPU driving allocation information according to the GPU resource demand of the application to be operated and the current GPU resources.

In one possible implementation, before the conversion unit 503 converts the currently loaded driver of the target GPU device into the target GPU driver, the GPU resource corresponding to the target GPU device is suspended to be allocated;

after the conversion unit 503 converts the currently loaded driver of the target GPU device into the target GPU driver, a function of allocating GPU resources corresponding to each GPU device is started.

In one possible implementation, the target GPU driver allocation information includes:

the number information of the target GPU drives or the proportioning information of the target GPU drives.

In a possible implementation manner, the determining unit 501 is further configured to determine each first GPU device currently loaded with a first GPU driver, and each second GPU device currently loaded with a second GPU driver;

the converting unit 503 is further configured to, when each second GPU device is in a non-idle state and there is an idle state device in each first GPU device, convert a GPU driver currently loaded by the first GPU device in the idle state into a second GPU driver.

It should be noted that, in the foregoing information interaction between each module/unit in each device, the executing process, etc., the embodiments of the method corresponding to fig. 1 to fig. 4 in the present application are based on the same concept, and specific content may be referred to the description in the foregoing illustrated embodiment of the method in the present application, which is not repeated herein.

Next, referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device provided in the embodiment of the present application, where the electronic device includes a processor 601 and a memory 600, and the memory 600 stores computer executable instructions that can be executed by the processor 601, and the processor 601 executes the computer executable instructions to implement a method implemented on a server, a method implemented on a labeling terminal, or a method implemented on a service terminal.

In the embodiment shown in fig. 6, the electronic device further comprises a bus 602 and a communication interface 603, wherein the processor 601, the communication interface 603 and the memory 600 are connected by the bus 602.

The memory 600 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The communication connection between the system network element and at least one other network element is implemented via at least one communication interface 603 (which may be wired or wireless), which may use the internet, a wide area network, a local network, a metropolitan area network, etc. Bus 602 may be an ISA (IndustryStandard Architecture ) bus, a PCI (Peripheral ComponentInterconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry StandardArchitecture ) bus, among others. The bus 102 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one rectangular bar is shown in fig. 6, but not only one bus or one type of bus.

The processor 601 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 601 or instructions in the form of software. The processor 601 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (DigitalSignal Processor, DSP for short), application specific integrated circuits (Application Specific IntegratedCircuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor 601 reads information in the memory, and in combination with its hardware, performs the steps of the method of the foregoing embodiment.

While the invention has been described in terms of preferred embodiments, it is not intended to be limiting, but rather, it will be apparent to those skilled in the art that various changes and modifications can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for allocating resources, applied to an electronic device, the electronic device including a plurality of GPU devices, each GPU device having at least two different GPU drivers installed thereon, the method comprising:

determining target GPU driving allocation information, wherein the target GPU driving allocation information is used for indicating the requirement conditions of different GPU driving;

acquiring a current loaded drive of each GPU device;

determining target GPU equipment for performing drive conversion and target GPU drive converted by the target GPU equipment from the plurality of GPU equipment according to the target GPU drive distribution information and the currently loaded drive of each GPU equipment; and converting the currently loaded drive of the target GPU equipment into the target GPU drive.

2. The method according to claim 1, wherein the determining, from the plurality of GPU devices, a target GPU device for performing a drive conversion, and a target GPU drive into which the target GPU device converts, according to the target GPU drive allocation information and the currently loaded drive of each of the GPU devices, comprises:

Comparing the target GPU drive allocation information with the currently loaded drive of each GPU device, and determining a target GPU drive to be added and the required quantity of the target GPU drive;

and determining a target GPU device for performing drive conversion and a target GPU drive converted by the target GPU device from the GPU devices according to the target GPU drive to be increased and the demand.

3. The method according to claim 2, wherein determining a target GPU device for performing a drive conversion, and a target GPU drive into which the target GPU device converts, from among the GPU devices, according to the target GPU drive to be added and the demand amount, comprises:

determining at least one first candidate GPU device in an idle state from the plurality of GPU devices;

and determining a target GPU device for performing drive conversion and a target GPU drive converted by the target GPU device from the first candidate GPU devices according to the target GPU drive to be increased and the demand.

4. A method according to claim 3, wherein said determining a target GPU device for performing a drive transition from each of said first candidate GPU devices according to said target GPU drive to be added and said demand comprises:

And if the number of the first candidate GPU devices is greater than or equal to the required amount of the target GPU drive to be increased, selecting the GPU device with the required amount from the first candidate GPU devices as the target GPU device for driving conversion.

5. The method of claim 4, wherein the determining a target GPU device for performing a drive transition from each of the first candidate GPU devices according to the target GPU drive to be added and the demand, further comprises:

if the number of the first candidate GPU devices is smaller than the required amount of the target GPU drive to be increased, determining each first candidate GPU device as the target GPU device, and determining a second candidate GPU device as the target GPU device when detecting that the second candidate GPU device in an idle state exists, until the number of the target GPU devices is equal to the required amount of the target GPU drive to be increased.

6. The method of claim 1, wherein the determining the target GPU driver allocation information comprises:

determining target GPU drive allocation information in response to a drive setting instruction, wherein the drive setting instruction comprises requirements for different GPU drives;

Or alternatively, the process may be performed,

and determining the target GPU driving allocation information according to the GPU resource demand of the application to be operated and the current GPU resources.

7. The method of claim 1, wherein prior to said converting the driver currently loaded by the target GPU device to the target GPU driver, the method further comprises:

suspending distribution of GPU resources corresponding to the GPU devices;

after the conversion of the currently loaded driver of the target GPU device into the target GPU driver, the method further comprises:

and starting the function of distributing the GPU resources corresponding to the GPU devices.

8. The method of claim 1, wherein the target GPU-driven allocation information comprises:

9. The method according to any one of claims 1 to 8, further comprising:

determining each first GPU device currently loaded with a first GPU drive and each second GPU device currently loaded with a second GPU drive;

when the second GPU devices are in a non-idle state and the idle state devices exist in the first GPU devices, converting the GPU driver currently loaded by the first GPU devices in the idle state into the second GPU driver.

10. A resource allocation apparatus, applied to an electronic device, where the electronic device includes a plurality of GPU devices, and at least two different GPU drivers are installed on each GPU device, the apparatus includes:

the determining unit is used for determining target GPU driving distribution information, wherein the target GPU driving distribution information is used for indicating the requirement conditions of different GPU driving;

and the conversion unit is used for converting the currently loaded drive of the target GPU equipment into the target GPU drive.

11. An electronic device, wherein a plurality of GPU devices are included in the electronic device, each GPU device having at least two different GPU drivers mounted thereon, the electronic device comprising:

a memory and a processor, the memory and the processor coupled;

the memory is used for storing one or more computer instructions;

The processor is configured to execute the one or more computer instructions to implement the method of any of claims 1-9.

12. A computer readable storage medium having stored thereon one or more computer instructions for application to an electronic device comprising a plurality of GPU devices, each of the GPU devices having at least two different GPU drivers mounted thereon, the computer instructions being executable by a processor to implement a method as claimed in any of claims 1 to 9.