CN116303314A

CN116303314A - Log storage method and device for GPU, electronic equipment and storage medium

Info

Publication number: CN116303314A
Application number: CN202211626760.4A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Moore Threads Technology Co Ltd
Current assignee: Moore Threads Technology Co Ltd
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-06-23

Abstract

The disclosure relates to a log storage method and device for a GPU, an electronic device and a storage medium, wherein the method is applied to GPU firmware and comprises the following steps: storing a first log generated during the running of the GPU into a flash memory space on a GPU board, wherein the first log is used for recording information representing the running abnormality of the GPU; storing a second log generated during the running of the GPU into an internal storage space of the GPU chip, wherein the second log is used for recording information representing the running state of the GPU; and returning the target log according to the storage position of the target log under the condition that the access request to the target log is received. The embodiment of the disclosure can improve the efficiency of managing the power consumption of the GPU.

Description

Log storage method and device for GPU, electronic equipment and storage medium

Technical Field

The disclosure relates to the field of computer technology, and in particular, to a log storage method and device for a GPU, an electronic device and a storage medium.

Background

Graphics processors (Graphics Processing Unit, GPUs) are microprocessors that do image and graphics-related operations specifically on devices. While pursuing high performance of the GPU, the running safety and stability of the GPU are guaranteed. When the GPU has too many processing tasks, the problems of temperature rise, power consumption increase and the like are often brought, and the security and stability of the operation of the GPU are further affected.

The power management techniques of GPUs may balance performance and security of GPU operations, which may involve a number of aspects, such as temperature control, power consumption control, performance control, and the like. To support these functions, not only a robust set of debugging is required, but also a log system that helps solve the problem is required, and the power consumption management of the GPU is performed by the recorded log information.

However, in the related art, there is no log system adapted to the GPU power consumption management, and various problems may occur in the log system when the GPU power consumption management is performed through the log system, resulting in low efficiency of the GPU power consumption management.

Disclosure of Invention

The disclosure provides a log storage technical scheme for a GPU.

According to an aspect of the present disclosure, there is provided a log storage method for a GPU, including:

storing a first log generated during the running of the GPU into a flash memory space on a GPU board, wherein the first log is used for recording information representing the running abnormality of the GPU;

storing a second log generated during the running of the GPU into an internal storage space of the GPU chip, wherein the second log is used for recording information representing the running state of the GPU;

and returning the target log according to the storage position of the target log under the condition that the access request to the target log is received.

In one possible implementation manner, the storing the first log generated by the GPU in the flash memory space on the GPU board includes:

in a GPU starting stage, under the condition that the self-checking operation of the GPU firmware detects that an abnormality exists, generating a first log for recording the abnormality information in the starting stage, and storing the first log into a flash memory space on a GPU board;

in the GPU operation stage, under the condition that the GPU operation has faults, a first log for recording fault information is generated and stored in a flash memory space on a GPU board;

in the GPU operation stage, under the condition that the operation state parameters exceeding the preset normal range exist, a first log used for recording the operation state parameters exceeding the preset normal range is generated and stored in a flash memory space on the GPU board.

In one possible implementation manner, in a case that an access request to the target log is received, returning the target log according to a storage location of the target log includes:

under the condition that an access request for a target log is received, analyzing a log identifier in the access request;

determining a storage position of the target log according to the log identification;

Under the condition that the determined storage position is the storage space on the GPU board card, reading a target log from the storage space on the GPU board card and returning the target log to a log requester;

and under the condition that the determined storage position is the internal storage space of the GPU chip, reading the target log from the internal storage space of the GPU chip and returning the target log to the log requester.

In one possible implementation, the storage space on the GPU board includes a first storage space and a second storage space, and the data in the second storage space is a backup of the data in the first storage space.

In one possible implementation manner, the reading the target log from the storage space on the GPU board card and returning the target log to the log requester includes:

reading a first target log in the first storage space, and performing integrity check on the first target log;

returning the read first target log to the log requester under the condition that the integrity check of the first target log passes;

under the condition that the integrity check of the first target log fails, reading a second target log in the second storage space, and carrying out the integrity check on the second target log;

And returning the read second target log to the log requester under the condition that the integrity check of the second target log passes.

In one possible implementation, reading the target log from the internal memory space of the GPU chip and returning to the log requester includes:

reading a target log from an internal storage space of the GPU chip, and performing integrity check;

and returning the target log to the log requester in the condition that the integrity check is passed.

In one possible implementation, the first log is used for fault location and/or alarm analysis of the GPU, and the second log is used for development and debugging of GPU power consumption management functions.

In one possible implementation, the subdivision dimension of the parameter in the first log includes: the dimensionality of parameters required by fault location and/or alarm analysis;

the subdivision dimension of the second log includes: dimension of parameters for power management function debugging.

According to an aspect of the present disclosure, there is provided a log storage device for a GPU, including:

the first storage unit is used for storing a first log generated during the running of the GPU into a flash memory space on the GPU board, and the first log is used for recording information representing the running abnormality of the GPU;

The second storage unit is used for storing a second log generated during the running of the GPU into an internal storage space of the GPU chip, and the second log is used for recording information representing the running state of the GPU;

and the log returning unit is used for returning the target log according to the storage position of the target log under the condition that the access request to the target log is received.

In one possible implementation manner, the first storage unit is configured to:

In a possible implementation manner, the log returning unit is configured to:

According to an aspect of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, the log storage method is applied to the GPU firmware, namely the operation of log storage is executed by the GPU firmware, so that the problem of incomplete log information caused by running of an operating system can be prevented. The GPU firmware stores a first log generated during the running of the GPU into a flash memory space on a GPU board, wherein the first log is used for recording information representing the running abnormality of the GPU, and the flash memory space on the GPU board cannot lose the stored log information after power is lost, so that the problem can be conveniently positioned through the abnormal information recorded in the first log; in addition, a second log generated during the running of the GPU is stored in an internal storage space of the GPU chip, the second log is used for recording information representing the running state of the GPU, development and debugging of power consumption management are facilitated during the running of the GPU, and the power consumption management is performed during the running of the GPU, so that the second log does not occupy a flash memory space on a GPU board, and the development and the debugging of the power consumption management are not influenced even if the second log is lost after power is lost. Obviously, the log storage method provided by the embodiment of the disclosure can be well adapted to the power consumption management of the GPU, and the efficiency of the GPU in power consumption management is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

Fig. 1 illustrates a flowchart of a log storage method for a GPU according to an embodiment of the present disclosure.

Fig. 2 illustrates a flowchart of another log storage method for a GPU according to an embodiment of the present disclosure.

Fig. 3 shows a flowchart of a log access process provided in accordance with an embodiment of the present disclosure.

Fig. 4 illustrates an overall framework diagram of a system for performing a log storage method according to an embodiment of the present disclosure.

Fig. 5 illustrates a block diagram of a log storage device for a GPU according to an embodiment of the present disclosure.

Fig. 6 shows a block diagram of an electronic device, according to an embodiment of the disclosure.

Fig. 7 shows a block diagram of an electronic device, according to an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

As described in the background art, various problems may occur in the log system when the GPU power consumption management is performed by the log system, resulting in low efficiency of the GPU power consumption management.

For example, in the related art, log information is recorded in a volatile storage medium, and after power failure, the stored log information cannot be checked, so that the power consumption problem of the GPU cannot be located. The volatile storage medium may be, for example, double Data Rate (DDR), static Random-Access Memory (SRAM), or the like.

In another example, in the related art, the log information is recorded in the hard disk through the operating system of the terminal, if the operating system crashes, the log information may be incomplete, and after the hard disk is replaced or damaged, the log is lost, so that the power consumption problem of the GPU cannot be located.

In the embodiment of the disclosure, the log storage method is applied to the GPU firmware, namely the operation of log storage is executed by the GPU firmware, so that the problem of incomplete log information caused by running of an operating system can be prevented. The GPU firmware stores a first log generated during the running of the GPU into a flash memory space on a GPU board, wherein the first log is used for recording information representing the running abnormality of the GPU, and the flash memory space on the GPU board cannot lose the stored log information after power is lost, so that the problem can be conveniently positioned through the abnormal information recorded in the first log; in addition, a second log generated during the running of the GPU is stored in an internal storage space of the GPU chip, the second log is used for recording information representing the running state of the GPU, development and debugging of power consumption management are facilitated during the running of the GPU, and the power consumption management is performed during the running of the GPU, so that the second log does not occupy a flash memory space on a GPU board, and the development and the debugging of the power consumption management are not influenced even if the second log is lost after power is lost. Obviously, the log storage method for the GPU provided by the embodiment of the invention can be well adapted to the power consumption management of the GPU, and the efficiency of the power consumption management of the GPU is improved.

The log storage method for the GPU is applied to the GPU firmware, the GPU firmware is located inside the GPU hardware equipment and is independent of the operating system of the terminal equipment, and therefore even if the operating system of the terminal equipment crashes, the log information record of the GPU firmware is not affected by the operating system of the terminal equipment.

In one possible implementation manner, the log storing method for the GPU may be executed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, an in-vehicle device, a wearable device, etc., and the method may be implemented by a processor invoking computer readable instructions stored in a memory. Alternatively, the method may be performed by a server.

Fig. 1 shows a flowchart of a log storage method for a GPU according to an embodiment of the present disclosure, which is applied to GPU firmware, as shown in fig. 1, the log storage method includes:

in step S11, a first log generated during operation of the GPU is stored in a flash memory space on the GPU board, where the first log is used to record information representing abnormal operation of the GPU;

The first log records information representing the abnormal operation of the GPU, where the abnormal operation of the GPU may be a numerical parameter and a fault code, etc. beyond the stable operation range, for example, fault information and/or alarm information of the GPU. The fault information of the GPU may be, for example, a fan fault, a power failure, etc.; the warning information of the GPU may be, for example, an abnormal warning of items such as voltage, current, power consumption, temperature, frequency, and the like of the GPU.

In one possible implementation, the first log includes at least one of: the number of failures of the fan and the cause of the failure; voltage overpressure, voltage undervoltage and voltage instability; current overflow, current undercurrent and current instability; the number of times the power consumption exceeds the maximum threshold value; the number of times the temperature is too high to generate warning and the number of times the temperature is too high to shut down; the number of GPU overclocking.

Because the first log records information characterizing the abnormal operation of the GPU, the first log can be used for fault location and/or alarm analysis of the GPU. Because the fault and the alarm may not be found and processed in time after the fault and the alarm are generated, there are often situations that the cause of the fault and the alarm analysis of the GPU need to be located after the GPU is powered down, and then the first log may be stored into the flash memory space on the GPU board based on the requirement of the special application scenario of the first log.

The FLASH memory (FLASH) space on the GPU board is a storage space which can not lose storage data after power failure. Even if the GPU board is powered down, the data stored in the flash memory space can still be stored, and then the first log in the flash memory space on the GPU board can still be read after the GPU is powered down to locate the fault cause and alarm analysis of the GPU.

In addition, because the flash memory space is positioned on the GPU board, the situation that log information is lost after the hard disk is replaced or damaged is avoided, and the fault reason and the power consumption problem of the GPU are conveniently positioned.

In step S12, a second log generated during the running of the GPU is stored in the internal storage space of the GPU chip, where the second log is used to record information representing the running state of the GPU;

the second log records information representing the running state of the GPU, and the second log can record information when the GPU runs normally, and also can record information when the GPU runs abnormally, for example, information of items such as fan rotation speed, voltage, current, power consumption, temperature, frequency and the like of the GPU in running, and also can be automatic control parameter information and running information of a proportional-integral-derivative control system (Proportional Integral Derivative, PID) and the like.

In one possible implementation, the second log includes at least one of: fan rotational speed information; each module voltage value, historical voltage change information and variable voltage reason; each module current value, board card total current, historical minimum current and historical maximum current; each module power consumption, board card total power consumption, historical minimum power consumption and historical maximum power consumption; temperature values of all temperature sensors, historical highest temperature and average temperature; real-time frequency values of the modules, historical frequency change information and reasons of frequency conversion; PID automatic control parameter information and operation information.

Because the information representing the running state of the GPU is recorded in the second log, the second log can be used for developing and debugging the power consumption management function of the GPU, and in the developing and debugging process of the power consumption management of the GPU, the change trend and the correlation of the power consumption, the temperature, the clock and the fan rotating speed of the GPU along with time are analyzed to determine the parameters of the PID automatic control system when the power consumption management is carried out so as to realize the power consumption management functions of reducing the power consumption, the temperature control and the like when the GPU runs.

The development and debugging of the GPU power consumption management can be executed when the GPU runs, so that the second log can be stored in the internal storage space of the GPU chip based on the requirement of the special application scene of the second log, and the development and debugging of the GPU power consumption management are not affected even if the storage space in the GPU chip is lost after power failure. In addition, since the second log can record information when the GPU operates normally and also can record information when the GPU operates abnormally, obviously, the occupied storage space of the second log may be larger than the storage space occupied by the first log, and then the second log is stored into the internal storage space of the GPU chip, after the internal storage space is powered down, the second log is lost, the storage space is not occupied all the time, and the occupation of the storage space is saved.

In step S13, when an access request to the target log is received, the target log is returned according to the storage location of the target log.

After the first log and the second log are respectively stored in different storage spaces, the target log can be returned according to the storage position of the target log under the condition that the external application of the GPU requests to access the target log. The process may specifically refer to possible implementation manners provided in the present disclosure, which are not described herein in detail.

In one possible implementation manner, the storing the first log generated by the GPU in the flash memory space on the GPU board includes: in a GPU starting stage, under the condition that the self-checking operation of the GPU firmware detects that an abnormality exists, generating a first log for recording the abnormality information in the starting stage, and storing the first log into a flash memory space on a GPU board; in the GPU operation stage, under the condition that the GPU operation has faults, a first log for recording fault information is generated and stored in a flash memory space on a GPU board; in the GPU operation stage, under the condition that the operation state parameters exceeding the preset normal range exist, a first log used for recording the operation state parameters exceeding the preset normal range is generated and stored in a flash memory space on the GPU board.

Referring to fig. 2, fig. 2 illustrates a flowchart of another log storage method for a GPU according to an embodiment of the present disclosure. As shown in fig. 2, in the GPU starting stage, a firmware self-checking operation is performed to determine whether an abnormal condition exists, if so, a power supply deficiency, a fan failure, a starting temperature abnormality and the like exist, and when the abnormal condition exists, a first log for recording abnormal information in the starting stage is generated and stored in a flash memory space on the GPU board.

During the GPU running phase, the firmware may store the second log into an internal memory space of the GPU chip when the GPU is running.

In the GPU operation stage, when the GPU is in operation, the firmware can generate a first log for recording fault information under the condition that the GPU is in operation fault, and store the first log into a flash memory space on a GPU board; the fault code may be stored as a first log in a flash memory space on the GPU board.

In addition, when the operation state parameter exceeding the preset normal range exists, a first log for recording the operation state parameter exceeding the preset normal range can be generated and stored in the flash memory space on the GPU board. The preset normal range is a normal range of preset operation state parameters, and a normal range can be preset for the items such as fan rotation speed, voltage, current, power consumption, temperature, frequency and the like, and when the operation state parameters acquired in real time exceed the normal range, a first log for recording the operation state parameters exceeding the preset normal range is generated, and the operation state parameters exceeding the normal range are recorded.

In the embodiment of the disclosure, in a GPU starting stage, under the condition that the self-checking operation of the GPU firmware detects that an abnormality exists, a first log for recording the abnormality information in the starting stage is generated and stored in a flash memory space on a GPU board, so that the flash memory space on the GPU board cannot lose the stored log information after power is lost, and the problem of the GPU starting stage is conveniently positioned through the abnormality information recorded in the first log; in the GPU operation stage, under the condition that the GPU operation has faults, a first log for recording fault information is generated and stored in a flash memory space on a GPU board, so that the flash memory space on the GPU board cannot lose stored log information after power failure, and the problem of the GPU operation stage is conveniently positioned through the fault information recorded in the first log; in the GPU operation stage, under the condition that the operation state parameters exceeding the preset normal range exist, a first log for recording the operation state parameters exceeding the preset normal range is generated and stored in a flash memory space on the GPU board, so that the stored operation state parameters exceeding the preset normal range cannot be lost after power is lost in the flash memory space on the GPU board, and the problem of the GPU operation stage can be conveniently located through the operation state parameters exceeding the preset normal range.

In one possible implementation manner, in a case that an access request to the target log is received, returning the target log according to a storage location of the target log includes: under the condition that an access request for a target log is received, analyzing a log identifier in the access request; determining a storage position of the target log according to the log identification; under the condition that the determined storage position is the storage space on the GPU board card, reading a target log from the storage space on the GPU board card and returning the target log to a log requester; and under the condition that the determined storage position is the internal storage space of the GPU chip, reading the target log from the internal storage space of the GPU chip and returning the target log to the log requester.

When requesting to access the stored log, the log requester adds a log identifier of a target log to be requested to access to the access request, and then sends the access request to the GPU firmware to request to acquire the target log. For different types of purpose logs, the logs can be distinguished by log identification, for example, for a second log, the log identification of the fan rotating speed log can be set to be 01, the log identification of the temperature log is 02, the log identification of the voltage log is 03, the log identification of the current log is 04, and the like; for the first log, a log of the abnormal fan rotation speed log may be set to 11, a log of the abnormal temperature log may be set to 12, a log of the abnormal voltage log may be set to 13, a log of the abnormal current log may be set to 14, and so on.

Fig. 3 shows a flowchart of a log access process provided in accordance with an embodiment of the present disclosure. As shown in fig. 3, when the GPU firmware receives an access request to the target log, the log identifier in the access request can be parsed, and since the log identifier is used for characterizing logs of different types, the storage location of the target log can be determined according to the log identifier. For example, when the parsed log identifier is 02, it may be determined that the log requester is to obtain a temperature log, where the temperature log belongs to the second log and is stored in the internal storage space of the GPU chip; when the analyzed log mark is 12, it can be determined that the log requester is to obtain the abnormal temperature log, and the abnormal temperature log belongs to the first log and is stored in the flash memory space on the GPU board.

Then, under the condition that the determined storage position is the storage space on the GPU board card, the target log can be read from the storage space on the GPU board card and returned to the log requester; and under the condition that the determined storage position is the internal storage space of the GPU chip, the target log can be read from the internal storage space of the GPU chip and returned to the log requester.

In the embodiment of the disclosure, when the logs are stored, the first log and the second log are respectively stored in different storage spaces, so that when an access request to the target log is received, the log identification in the access request is analyzed, then the storage position of the target log is determined according to the log identification, and then the target log is read from the storage position and returned to the log requester. Thus, the storage location of the target log can be accurately located and returned to the log requester.

In one possible implementation manner, the reading the target log from the storage space on the GPU board card and returning the target log to the log requester includes: reading a first target log in the first storage space, and performing integrity check on the first target log; returning the read first target log to the log requester under the condition that the integrity check of the first target log passes; under the condition that the integrity check of the first target log fails, reading a second target log in the second storage space, and carrying out the integrity check on the second target log; and returning the read second target log to the log requester under the condition that the integrity check of the second target log passes.

In this implementation manner, the storage space on the GPU board is divided into two parts, and the sizes of the two parts of storage space may be the same, where one part of storage space is used as a backup of another part of storage space, and after some data in one part of storage space is lost, the backup data in another part of storage space may be read, so as to improve the security of the data. For convenience of description, the two divided storage spaces are described herein as a first storage space and a second storage space, wherein the second storage space is a backup space of the first storage space.

As shown in fig. 3, when the target log is read from the storage space on the GPU board, the first target log in the first storage space may be read first, and the integrity check may be performed on the first target log. The integrity check is used to check whether the data is complete, i.e. whether the data is missing. In one example, the method of integrity checking may be based on a fifth version (Message Digest Algorithm, MD 5) of the message digest algorithm to perform integrity checking, when storing the first target log in the storage space on the GPU board, the MD5 value of the first target log is calculated and stored, when reading the first target log, the MD5 value of the first target log may be calculated again, and comparing whether the two MD5 values are the same, and in the case that the MD5 values are the same, it may be determined that the data of the first target log is complete.

Under the condition that the integrity check of the first target log passes, the first target log is indicated to be complete and not missing, and the read first target log can be returned to a log requester at the moment; and under the condition that the integrity check of the first target log fails, indicating that the first target log is missing, at the moment, the backed-up target log can be read, namely, the second target log in the second storage space can be read.

It should be noted that, the second target log is backup data of the first target log, and in the case that both target logs are complete and not missing, the first target log and the second target log may be identical two logs. In consideration of the fact that the second target log may have data missing, after the second target log in the second storage space is read, the second target log may also be subjected to integrity check. The process of performing the integrity check on the second target log may be the same as the process of performing the integrity check on the first target log, which is not described herein.

Under the condition that the integrity check of the second target log passes, the second target log is indicated to be complete and not missing, and the read second target log can be returned to the log requester at the moment; in the case that the integrity check of the second target log also fails, a related prompt that the target log is not acquired is returned to the requester, for example, a prompt message of "log data is missing, acquisition fails" is returned to the requester.

In the embodiment of the disclosure, the storage space of the GPU board is divided into two parts, one part is used as backup, and then the second target log backed up in the second storage space can be read under the condition that the first target log in the first storage space is missing, so that the safety of the stored log can be improved, and the information redundancy is increased. In addition, by carrying out the integrity check on the first target log and the second target log, the problem that the log sent to the requesting party is not missing in data can be ensured, the accuracy of the data is ensured,

in one possible implementation, reading the target log from the internal memory space of the GPU chip and returning to the log requester includes: reading a target log from an internal storage space of the GPU chip, and performing integrity check; and returning the target log to the log requester in the condition that the integrity check is passed.

The process of performing the integrity check on the target log in the internal storage space may be the same as the process of performing the integrity check on the first target log, which is not described herein.

In the embodiment of the disclosure, when the target log is read from the internal storage space of the GPU chip, the target log is returned to the log requester by carrying out integrity check on the target log, and under the condition that the integrity check is passed, therefore, the problem that the log sent to the requester is free from data loss can be ensured, and the accuracy of the data is ensured.

In one possible implementation, the subdivision dimension of the parameter in the first log includes: the dimensionality of parameters required by fault location and/or alarm analysis; the subdivision dimension of the second log includes: dimension of parameters for power management function debugging.

Because the first log can be used for fault location and/or alarm analysis of the GPU, in order to facilitate fault location based on the first log, the subdivision dimension of the parameters in the first log can be divided according to the dimension of the parameters required by fault location; in order to facilitate the alarm analysis based on the first log, the subdivision dimension of the parameters in the first log may be divided according to the dimension of the parameters required for the alarm analysis.

Referring to table 1, a subdivision dimension of parameters in a first log is provided in an embodiment of the disclosure.

TABLE 1 subdivision dimension of parameters in first log

Entries	Description of the invention
		Fan with fan body	Including the number of faults, the cause of the faults, etc
Voltage (V)	Including overpressure, underpressure, instability, etc
		Electric current	Including over-flow, under-flow, instability, etc
Power consumption	Including the power consumption exceeding the maximum threshold, the number of times the warning threshold is exceeded, etc
		Temperature (temperature)	Including the number of times the temperature exceeds warning, the number of times the shutdown is exceeded, etc
Frequency of	Including the number of overclocking times, etc

Referring to table 2, a subdivision dimension of parameters in a second log is provided in an embodiment of the disclosure.

TABLE 2 subdivision dimension of parameters in second log

Because the second log can be used for developing and debugging the GPU power consumption management function, in order to facilitate developing and debugging the GPU power consumption management function based on the second log, the subdivision dimension of the parameters in the second log can be divided according to the dimension of the parameters debugged by the power consumption management function.

In the embodiment of the disclosure, the parameters in the first log are divided according to the dimensions of the parameters required by fault location and/or alarm analysis, so that the fault location and the alarm analysis are convenient to perform; the second log is divided according to the dimension of the parameter debugged by the power consumption management function, so that the development and debugging of the power consumption management function are facilitated.

Fig. 4 shows an overall frame diagram of a system for performing the log storage method provided by the present disclosure for a GPU according to an embodiment of the present disclosure. The GPU firmware stores the logs into a flash memory space on the GPU board and an internal storage space of the GPU chip respectively. The log requesting party in the system comprises: a requestor running in a terminal operating system, a Universal Asynchronous Receiver Transmitter (UART) of an external tool, a Baseboard Management Controller (BMC). The method comprises the steps that a requester in an operating system interacts with GPU firmware through Inter-process communication (Inter-Process Communication, IPC) to obtain a stored log; the UART interacts with the GPU firmware based on a transmitting/receiving (TX/RX) port to acquire a stored log; the BMC interacts with the GPU firmware based on a system management bus (System Management Bus, SMBUS) to obtain a stored log.

The log requester performs a series of function development and analysis based on the read log, including:

a) And analyzing the trend of power consumption, temperature, clock and fan rotating speed along with time.

b) And according to the read information, analyzing the health state of the GPU board card, such as overtemperature, overpower, times of overtemperature and frequency, whether a fan is abnormal or not, and the like.

It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.

In addition, the disclosure further provides a log storage device, an electronic device, a computer readable storage medium and a program for the GPU, which can be used to implement any one of the log storage methods for the GPU provided in the disclosure, and the corresponding technical schemes and descriptions and corresponding descriptions of the method parts are omitted.

Fig. 5 shows a block diagram of a log storage device for a GPU according to an embodiment of the present disclosure, as shown in fig. 5, the device 20 includes:

A first storage unit 21, configured to store a first log generated during operation of the GPU into a flash memory space on a GPU board, where the first log is used to record information that characterizes an abnormal operation of the GPU;

the second storage unit 22 is configured to store a second log generated during the running of the GPU into an internal storage space of the GPU chip, where the second log is used to record information representing the running state of the GPU;

the log returning unit 23 is configured to return the target log according to the storage location of the target log when receiving the access request to the target log.

In one possible implementation manner, the first storage unit is configured to:

In a possible implementation manner, the log returning unit is configured to:

The method has specific technical association with the internal structure of the computer system, and can solve the technical problems of improving the hardware operation efficiency or the execution effect (including reducing the data storage amount, reducing the data transmission amount, improving the hardware processing speed and the like), thereby obtaining the technical effect of improving the internal performance of the computer system which accords with the natural law.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

The embodiment of the disclosure also provides an electronic device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.

Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

The electronic device may be provided as a terminal, server or other form of device.

Fig. 6 shows a block diagram of an electronic device 800, according to an embodiment of the disclosure. For example, the electronic device 800 may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, an in-vehicle device, a wearable device, or the like.

Referring to fig. 6, an electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen between the electronic device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operational mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the electronic device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a photosensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (Wi-Fi), a second generation mobile communication technology (2G), a third generation mobile communication technology (3G), a fourth generation mobile communication technology (4G), long Term Evolution (LTE) of a universal mobile communication technology, a fifth generation mobile communication technology (5G), or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including computer program instructions executable by processor 820 of electronic device 800 to perform the above-described methods.

Fig. 7 illustrates a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, electronic device 1900 may be provided as a server or terminal device. Referring to FIG. 7, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. Electronic device 1900 may operate an operating system based on memory 1932, such as the Microsoft Server operating system (Windows Server) ^TM ) Apple Inc. developed graphical user interface based operating System (Mac OS X ^TM ) Multi-user multi-process computingMachine operating system (Unix) ^TM ) Unix-like operating system (Linux) of free and open source code ^TM ) Unix-like operating system (FreeBSD) with open source code ^TM ) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information, and obtains independent consent of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a type of personal information to be processed.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A log storage method for a GPU, applied to GPU firmware, comprising:

2. The method of claim 1, wherein storing the first log generated by the GPU in the flash memory space on the GPU board comprises:

3. The method according to claim 1, wherein the returning the target log according to the storage location of the target log in the case of receiving the access request to the target log comprises:

4. A method according to claim 3, wherein the memory space on the GPU board comprises a first memory space and a second memory space, the data in the second memory space being a backup of the data in the first memory space.

5. The method of claim 4, wherein the reading the target log from the memory space on the GPU board and returning to the log requester comprises:

6. A method according to claim 3, wherein reading the target log from the internal memory space of the GPU chip and returning to the log requester comprises:

7. The method according to claim 1, wherein the first log is used for fault localization and/or alarm analysis of the GPU and the second log is used for development and debugging of GPU power consumption management functions.

8. The method of claim 7, wherein the subdivision dimension of the parameter in the first log comprises: the dimensionality of parameters required by fault location and/or alarm analysis;

9. A log storage device for a GPU, comprising:

10. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of claims 1 to 8.

11. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 8.