CN111209243A

CN111209243A - Data processing device, method and related product

Info

Publication number: CN111209243A
Application number: CN201811392279.7A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2018-11-21
Filing date: 2018-11-21
Publication date: 2020-05-29
Anticipated expiration: 2038-11-21
Also published as: KR20200138413A; CN111209243B; KR102539572B1

Abstract

The present application relates to a data processing apparatus, method and related product. A transmission circuit in the device acquires input data required by the machine learning device from the shared memory according to a data operation signal sent by the machine learning device, returns the input data to the machine learning device, calculates the data to obtain output data by the machine learning device, and then transmits the output data to the shared memory for storage. Since the data operation signal carries the type flag bit of the data operation signal and the information of the data to be operated, the transmission circuit can determine the type of the data operation signal according to the type flag bit in the data operation signal, and then perform the operation by combining the information of the data to be operated in the data operation signal. Therefore, the corresponding operation is quickly positioned in a classified manner according to the type zone bit of the data operation signal, the data access logic is simplified, the data access efficiency is improved, and the access speed of the machine learning chip during data access is greatly improved.

Description

Data processing device, method and related product

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a data processing apparatus and method, and a related product.

Background

With the continuous development and increasing demand of information technology, the demand of data access and data processing is higher and higher, and the demand of some processors for processing data and accessing data is stricter and stricter. Taking a general-purpose processor as an example, a multi-core processor composed of a plurality of general-purpose processor cores (e.g., CPU cores) is a mainstream due to its powerful parallel computing capability.

However, with the continuous development of the current artificial neural network, more and more structural machine learning chips are gradually appeared, and these machine learning chips need to access data or process data in shared storage according to instructions during operation. When the data access or the shared storage data is more, the instruction of the machine learning chip gradually becomes complex, and further the speed of reading the shared storage through the instruction is influenced, so that the processing efficiency of the neuron data is low.

Therefore, how to improve the access speed of the machine learning chip during data access becomes a technical problem to be solved urgently by the current technical staff.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a data processing apparatus and method and related products to solve the above technical problem of how to improve the access speed of the machine learning chip during data access.

In a first aspect, an embodiment of the present invention provides a data processing apparatus for performing processing of machine learning data, the data processing apparatus including: the machine learning device is connected with the transmission circuit, and the transmission circuit is connected with the shared memory;

the transmission circuit is used for acquiring input data required by the machine learning device from the shared memory according to a data operation signal sent by the machine learning device and returning the input data to the machine learning device; the data operation signal carries the type flag bit of the data operation signal and the information of the data to be operated.

In one embodiment, the machine learning apparatus is configured to perform a machine learning operation according to the input data to obtain output data.

In one embodiment, the machine learning device is further configured to use the output data as new input data, and transmit the new input data to the shared memory through the transmission circuit for data storage.

In one embodiment, the machine learning device comprises at least one machine learning unit, and the data operation signal further comprises a data receiving flag bit for characterizing a target machine learning unit receiving the input data.

In one embodiment, the value of the type flag bit of the data operation signal includes CAST, which characterizes the data operation signal as a broadcast or multicast instruction.

In one embodiment, the type flag bits of the data operation signal comprise a first type flag bit and a second type flag bit; the value of the first type zone bit comprises I/O, and the data operation signal is characterized as an I/O instruction;

the second type flag bit is used for representing that the data operation signal is a broadcast or multicast instruction in the I/O instruction.

In one embodiment, the information of the data to be operated includes at least one of a source address of the data to be operated in the shared memory, a length of the data to be operated, and a return address of the data after the data is operated.

In one embodiment, the data operation signal further includes jump information, and the jump information includes a jump step size and a data length operated after each jump.

In one embodiment, the jump information includes stride jump information and/or segment jump information; the stride jump information is used for representing the jump step length of the data operation signal each time; the segment jump information is used for representing the preset segmentation size of the data operation signal each time.

In one embodiment, the data operation signal further includes a functional flag bit for characterizing a processing operation performed by the transmission circuit on the read data.

In one embodiment, the transmission circuit includes:

the instruction storage unit is used for storing the data operation signal;

the instruction processing unit is used for analyzing the data operation signal to obtain the type zone bit of the data operation signal and the information of the data to be operated;

a store queue unit to store an instruction queue, the instruction queue comprising: and a plurality of data operation signals to be executed according to the front and back sequence of the instruction queue.

In one embodiment, the transmission circuit further includes:

the dependency relationship processing unit is used for determining whether an s-1 th data operation signal before an s-th data operation signal is associated with the s-1 th data operation signal, if so, caching the s-th data operation signal in the instruction storage unit, and after the s-1 th data operation signal is executed, extracting the s-th data operation signal from the instruction storage unit and transmitting the s-th data operation signal to the instruction processing unit;

wherein the determining whether the s-1 th data operation signal before the s-th data operation signal and the s-th data operation signal have an association relationship comprises:

extracting a first storage address interval of required data in the s-th data operation signal according to the s-th data operation signal, extracting a zero storage address interval of the required data in the s-1-th data operation signal according to the s-1-th data operation signal, and if the first storage address interval and the zero storage address interval have an overlapped region, determining that the s-th data operation signal and the s-1-th data operation signal have an association relationship, and if the first storage address interval and the zero storage address interval do not have an overlapped region, determining that the s-th data operation signal and the s-1-th data operation signal do not have an association relationship.

In a second aspect, an embodiment of the present invention provides a data processing method, which is applied to the data processing apparatus according to any one of the embodiments of the first aspect, where the method includes:

a transmission circuit in the data processing device receives a data operation signal sent by a machine learning device in the data processing device, wherein the data operation signal carries a type flag bit of the data operation signal and information of data to be operated;

the transmission circuit determines the operation executed on the data in the shared memory according to the type flag bit of the data operation signal, executes the operation on the data to be operated according to the information of the data to be operated to obtain the input data required by the machine learning device, and returns the input data to the machine learning device;

and the machine learning device executes machine learning operation according to the input data to obtain output data, takes the output data as new input data and transmits the new input data to a shared memory through the transmission circuit for data storage.

In one embodiment, the machine learning device comprises at least one machine learning unit, the data operation signal further comprises a data reception flag, and the returning the input data to the machine learning device comprises:

and the transmission circuit determines a target machine learning unit for receiving the input data according to the value of the data receiving zone bit and sends the input data to the target machine learning unit.

In one embodiment, the method further comprises:

and if the value of the type flag bit of the data operation signal is CAST, the transmission circuit determines that the data operation signal is a broadcast or multicast instruction.

In one embodiment, the type flag bits of the data operation signal include a first type flag bit and a second type flag bit, the first type flag bit is used for indicating whether the data operation signal is an I/O instruction, and the second type flag bit is used for indicating whether the data operation signal is a broadcast or multicast instruction in the I/O instruction; the method further comprises the following steps:

if the value of the first type zone bit is I/O, the transmission circuit determines that the data operation signal is an I/O instruction;

if the value of the second type flag bit is 1, the transmission circuit determines that the data operation signal is a broadcast or multicast instruction in the I/O instruction.

In one embodiment, the information of the data to be operated includes a source address of the data to be operated in the shared memory, a length of the data to be operated, and a data return address after the data is operated, and the performing the operation on the data to be operated according to the information of the data to be operated obtains input data required by the machine learning device, and returns the input data to the machine learning device, including:

the transmission circuit starts to read the shared memory from the source address to acquire the input data meeting the data length;

and the transmission circuit returns the input data to the target machine learning unit according to the data return address and the data receiving zone bit.

In one embodiment, the data operation signal further includes jump information, the jump information includes a jump step size and a jump data length operated after each jump, the transmission circuit reads the shared memory from the source address, and acquires the input data satisfying the data length, including:

the transmission circuit reads the shared memory from the source address, and acquires first jump data according to the jump data length after the current jump;

the transmission circuit acquires the last address of the jump data and jumps from the last address to a target jump address according to the jump step length;

and the transmission circuit acquires second jump data from the target jump address according to the jump data length after jumping until the jump data length after each jump meets the data length.

In one embodiment, the jump information includes stride jump information and/or segment jump information.

In one embodiment, the transmission circuit in the data processing apparatus receives a data operation signal sent by a machine learning apparatus in the data processing apparatus, and includes:

the transmission circuit analyzes the data operation signal to obtain a type flag bit of the data operation signal and information of data to be operated;

the transmission circuit executes the analyzed data operation signal according to the instruction queue; the instruction queue is used for representing the execution sequence of the data operation signals.

In one embodiment, before the transmitting circuit executes the parsed data operation signal in accordance with an instruction queue, the method further comprises:

the transmission circuit judges the dependency relationship of the adjacent analyzed data operation signals to obtain a judgment result; the dependency relationship represents whether an association relationship exists between the s-th data operation signal and the s-1 th data operation signal before the s-th data operation signal;

and if the judgment result shows that the s-th data operation signal and the s-1 th data operation signal have a dependency relationship, the transmission circuit caches the s-th data operation signal, and extracts the s-th data operation signal after the s-1 th data operation signal is executed.

In one embodiment, the determining the dependency relationship between the adjacent parsed data operation signals by the transmission circuit includes:

the transmission circuit respectively acquires a first storage address interval for extracting required data in the s-th data operation signal according to the s-th data operation signal and a zero storage address interval for extracting required data in the s-1-th data operation signal according to the s-1-th data operation signal;

if the first storage address interval and the zeroth storage address interval have an overlapped area, the transmission circuit determines that the s-th data operation signal and the s-1 th data operation signal have a dependency relationship;

if the first storage address interval and the zeroth storage address interval do not have an overlapped area, the transmission circuit determines that the s-th data operation signal and the s-1 th data operation signal do not have a dependency relationship.

In a third aspect, an embodiment of the present invention provides a combined processing device, where the combined processing device includes the data processing device, the universal interconnection interface, and another processing device except the data processing device, which are described in any embodiment of the first aspect; the data processing device interacts with the other processing devices.

In one embodiment, the combination processing apparatus further includes: and the storage device is respectively connected with the data processing device and the other processing devices and is used for storing the data of the data processing device and the other processing devices.

In a fourth aspect, an embodiment of the present invention provides a machine learning chip, where the machine learning chip includes the combined processing apparatus described in any one of the embodiments of the third aspect.

In a fifth aspect, an embodiment of the present invention provides a machine learning chip package structure, where the machine learning chip package structure includes the machine learning chip described in the embodiment provided in the fourth aspect.

In a sixth aspect, an embodiment of the present invention provides a board, where the board includes the machine learning chip package structure according to the embodiment of the fifth aspect.

In a seventh aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes the board card in the embodiment provided in the sixth aspect.

According to the data processing device, the data processing method and the related product, the transmission circuit obtains input data required by the machine learning device from the shared memory according to a data operation signal sent by the machine learning device, the input data are returned to the machine learning device, then the machine learning device executes machine learning operation according to the input data to obtain output data, and the output data are used as new input data and are transmitted to the shared memory through the transmission circuit to be stored. In this embodiment, since the data operation signal carries the type flag bit of the data operation signal and the information of the data to be operated, the transmission circuit may determine the type of the data operation signal according to the type flag bit of the data operation signal after receiving the data operation signal, and then perform an operation in combination with the information of the data to be operated carried in the data operation signal. Therefore, the corresponding operation can be quickly positioned by classifying according to the type zone bit of the data operation signal, the data access logic is simplified, the data access efficiency is improved, and the access speed of the machine learning chip during data access is greatly improved.

Drawings

FIG. 1 is a schematic diagram of a data processing apparatus according to an embodiment;

FIG. 2 is a flow diagram illustrating a data processing method, according to an embodiment;

FIG. 3 is a flowchart illustrating a data processing method according to an embodiment;

FIG. 4 is a flowchart illustrating a data processing method according to an embodiment;

FIG. 5 is a flowchart illustrating a data processing method according to an embodiment;

FIG. 6 is a flowchart illustrating a data processing method according to an embodiment;

FIG. 7 is a schematic structural diagram of a combined treatment apparatus according to an embodiment;

FIG. 8 is a schematic diagram of another combined treatment apparatus according to an embodiment;

fig. 9 is a schematic structural diagram of a board card in an embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In one embodiment, as shown in fig. 1, a data processing apparatus provided in an embodiment of the present application may be part or all of that shown in fig. 1, and may be implemented by software, hardware, or a combination of software and hardware. The data processing apparatus 10 is for performing processing of machine learning data, the data processing apparatus 10 including: a machine learning device 11, a transmission circuit 12, and a shared memory 13, wherein the machine learning device 11 is connected to the transmission circuit 12, and the transmission circuit 12 is connected to the shared memory 13; the transmission circuit 12 is configured to obtain input data required by the machine learning apparatus 11 from the shared memory 13 according to a data operation signal sent by the machine learning apparatus 11, and return the input data to the machine learning apparatus 11; the data operation signal carries the type flag bit of the data operation signal and the information of the data to be operated. Optionally, the machine learning device 11 is configured to perform a machine learning operation according to the input data to obtain output neuron data, and optionally, the machine learning device 11 is further configured to use the output neuron data as new input neuron data and transmit the new input neuron data to the shared memory 13 through the transmission circuit 12 for data storage.

The machine learning device, the transmission circuit, and the shared memory may be implemented by hardware circuits. Illustratively, the Machine learning device may be a device with an arithmetic function, which is composed of a plurality of Machine Learning Units (MLUs), the transmission circuit may be a broadcast bus, and the shared memory may be a non-volatile and/or volatile memory, including but not limited to a Random Access Memory (RAM), a cache memory, and the like. The machine learning device, the transmission circuit and the shared memory perform data transmission mutually through interfaces, for example: the machine learning device can send data operation signals through the interface, and can also send or receive data through the interface. Accordingly, the interface may be either a transmitting interface or a receiving interface, that is: when the interface is a transmit interface, the machine learning device may transmit a data manipulation signal or data to the transmission circuit; when the interface is a receiving interface, the machine learning device may receive a data operation signal or data transmitted by the transmission circuit. The interface may be various interfaces, which may be implemented by hardware circuits, and the embodiment does not limit the specific hardware form of the various interfaces, as long as the data signal interaction among the machine learning device, the transmission circuit, and the shared memory may be implemented by the interface. The input data is data that needs to be input by the machine learning device when performing the machine learning operation, and may be, for example, input neuron data and weight data. The above data may be data stored in advance in the shared memory, or may be data output by the machine learning apparatus after performing the machine learning operation; optionally, the machine learning device may be directly connected to the shared memory through a plurality of data I/O interfaces or I/O pins to obtain the data, or optionally, the machine learning device may be connected to the transmission circuit through a plurality of data I/O interfaces or I/O pins, and then connected to the shared memory through the transmission circuit to obtain the data.

The data operation signal may represent that the transmission circuit performs a read operation on data in the shared memory, and may also represent that a write operation is performed on data in the shared memory. When the data operation signal sent by the machine learning device is read, the transmission circuit can find the input data corresponding to the corresponding address from the shared memory, read the input data and return the data to the machine learning device sending the data operation signal; when the data operation signal sent by the machine learning device is a write operation, the transmission circuit can write the write data output by the machine learning device into the shared memory. The data operation signal carries a type flag bit of the data operation signal and information of data to be operated, where the type flag bit of the data operation signal represents a type of the data operation signal, for example: the type flag bit of the data operation signal is CAST, which indicates that the type of the data operation signal is a broadcast or multicast command. The information of the data to be operated indicates data that the transmission circuit needs to use when performing corresponding operation according to the data operation signal, and the specific form of the type flag bit of the data operation signal and the specific data information in the information to be operated are not limited in this embodiment and may be determined according to the actual situation.

The data processing apparatus provided in the present application is applied to machine learning operations, wherein the machine learning operations include neural network operations, k-means operations, support vector machine operations, and the like. Taking neural network operation as an example, the operation in the neural network executed by the machine learning device may be an operation in one layer of the neural network, and for a multilayer neural network, the implementation process is that, in the forward operation, after the execution of the artificial neural network in the previous layer is completed, the operation instruction in the next layer takes the output neuron data calculated in the operation unit as the input neuron data in the next layer for operation (or performs some operations on the output neuron data and then takes the output neuron data as the input neuron data in the next layer), and at the same time, the weight data is also replaced by the weight data in the next layer; in the inverse operation, after the inverse operation of the artificial neural network in the previous layer is completed, the operation instruction in the next layer operates the input neuron gradient (which can also be used as input neuron data) calculated in the operation unit as the output neuron gradient of the next layer (which can also be used as output neuron data) (or performs some operation on the input neuron gradient and then uses the input neuron gradient as the output neuron gradient of the next layer), and at the same time, replaces the weight data with the weight data of the next layer. Optionally, the neural network related to the embodiment of the present application may be not only an artificial neural network, but also a pulse neural network, which is not limited in this embodiment. The machine learning apparatus according to this embodiment may perform the machine learning operation based on the input data, for example, in the machine learning operation, the machine learning apparatus may calculate neuron data output by each layer of neural network for a multi-layer neural network, and may perform an operation set included in a series of machine learning operations such as a multiplication operation, a summation operation, and a function operation on a plurality of input data corresponding to input terminals of each layer of neural network. After the machine learning device obtains the output neuron data of the current layer through the machine learning operation, the output neuron data can be used as the input neuron data of the next layer of neural network to perform the machine learning operation again, and before that, the output neuron data of the current layer can be written into the shared memory through the transmission circuit to be stored, so that the machine learning device can read the data at any time to perform the machine learning operation.

Specifically, in practical application, the transmission circuit acquires input data required by the machine learning device from the shared memory according to a data operation signal sent by the machine learning device, returns the input data to the machine learning device through the receiving interface, then the machine learning device executes machine learning operation according to the input data to obtain output data, and transmits the output data serving as new input data to the shared memory through the transmission circuit for data storage. In this embodiment, since the data operation signal carries the type flag bit of the data operation signal and the information of the data to be operated, the transmission circuit may determine the type of the data operation signal according to the type flag bit of the data operation signal after receiving the data operation signal, and then perform an operation in combination with the information of the data to be operated carried in the data operation signal. Therefore, the corresponding operation can be quickly positioned by classifying according to the type zone bit of the data operation signal, the data access logic is simplified, the data access efficiency is improved, and the access speed of the machine learning chip during data access is greatly improved.

In one embodiment, as shown in fig. 2, a data processing apparatus provided in an embodiment of the present application includes: the machine learning device 11 comprises at least one machine learning unit 14, and the data operation signal further comprises a data receiving flag bit for characterizing a target machine learning unit receiving the input data.

The data signal operations performed by at least one machine learning unit (i.e., MLU) included in the machine learning apparatus may share one data receiving interface, and the machine learning unit may be connected to the transmission circuit through the transmission interface or the shared data receiving interface. It should be noted that both the sending interface and the shared data receiving interface may be implemented by a hardware circuit, and the type of the sending interface and the type of the shared data receiving interface are not limited in this embodiment. Wherein the data operation signal includes a data reception flag bit indicating a target machine learning unit that can receive the input data. The data receiving flag bit may be marked in the following manner: marking a target machine learning unit which can receive input data as 1, and correspondingly marking a target machine learning unit which cannot receive the input data as 0; it is to be understood that, the received target machine learning unit is labeled as 1 only, and in practical applications, the target machine learning unit that can receive data may also be labeled as 0, and the target machine learning unit that cannot receive data may also be labeled as 1, and this is not a limitation to the specific labeling form of the data receiving flag bit in this embodiment.

In this embodiment, according to the marking condition of the data receiving flag bit carried in the data operation signal, a target machine learning unit capable of receiving input data in the machine learning device can be determined, so that each machine learning unit of the machine learning device is determined according to the data receiving flag bit in the data operation signal when receiving data, memory access logic in a data memory access process is simplified, data access efficiency is improved, and access speed of a machine learning chip in data access is greatly improved.

The following describes the relationship between the type flag bit of the data operation signal and the information of the data to be operated and the data receiving flag bit in the above embodiments, respectively, by several embodiments.

In one embodiment, the value of the type flag bit of the data operation signal comprises CAST, which characterizes the data operation signal as a broadcast or multicast instruction. Optionally, the information of the data to be operated includes a source address of the data to be operated in the shared memory, a length of the data to be operated, and a data return address after the data is operated.

In this embodiment, the type flag bit of the data operation signal is used to indicate the operation type of the data operation signal. Illustratively, as shown in table 1 below: the type flag bit of the data operation signal is CAST, which indicates that the data operation signal is a broadcast or multicast command, and the data information to be operated includes a source address 0x110011, a destination address 0x000100, and a data length 0x0100, where the data length is a length set by a user, and the user may set the set length to one value or multiple values, and the embodiment does not limit the specific value and number of the set length. The flag of the data reception flag bit is 1 for three MLUs, indicating that the three MLUs can receive data, and the flag of one MLU is 0 for indicating that the one MLU cannot receive data. Specifically, the transmission circuit reads data 0x0100 long from the address 0x110011 in the shared memory according to the data operation signal, and then writes the data to the address 0x000100 of MLU3, MLU1, and MLU0 in the machine learning device, respectively.

TABLE 1

In another embodiment, the type flag bits of the data operation signal may include a first type flag bit and a second type flag bit. Optionally, a value of the first type flag bit includes I/O, and the data operation signal is characterized as an I/O instruction; the second type flag bit is used for representing that the data operation signal is a broadcast or multicast instruction in the I/O instruction.

In this embodiment, the data operation signal includes two data type data flag bits, wherein the first type data flag bit indicates the type of the data operation signal; the second type data flag bit is arranged in the operation information of the data operation signal and represents a specific subtype of the data operation signal; the data receiving flag bit is the same as that in the above embodiment, and indicates that the data receiving flag bit indicates a target machine learning unit capable of receiving input data. Illustratively, as shown in table 2 below, the first type data flag bit has a value of I/O, which indicates that the data operation signal is an I/O command, and the second type data flag bit has a value of 1, which indicates that the data operation signal is a broadcast or multicast command in the I/O command, and accordingly, when the second type data flag bit has a value of 0, which indicates that the data operation signal is not a broadcast or multicast command. The data information to be operated includes a source address 0x110011, a destination address 0x000100, and a data length 0x0100, where the data length is a length set by a user, and the user may set the set length to one value or multiple values, which is not limited in this embodiment. Specifically, the transmission circuit reads data 0x0100 long from 0x110011 in the shared memory according to the data operation signal, and then writes the data to 0x000100 in MLU3, MLU1 and MLU0 in the machine learning device, respectively.

TABLE 2

In still another embodiment, based on the above table 1 or table 2, the data operation signal may further include jump information, where the jump information includes a jump step size and a data length operated after each jump. Optionally, the jump information includes stride jump information and/or segment jump information.

In this embodiment, the skip information included in the data operation signal is used to instruct the transmission circuit to read the data information to be operated according to the data operation signal when reading the data information, and the reading method specifically includes: the transmission circuit reads data in the shared memory from a source address in data information to be operated, after the current jump, the read data with the jump data length is determined as first jump data, then the transmission circuit acquires the last address of the first jump data, and jumps the data with the jump step length from the last address of the first jump data to a target jump address according to the jump step length in the jump information, and it can be understood that the length between the last address of the first jump data and the target jump address is the jump step length in the jump information. Then, the transmission circuit skips the data with preset length from the target skipping address, determines the data with the skipping preset length as the second skipping data, if the length between the address of the second skipping data and the source address of the beginning skipping satisfies the data length of the data needed by the machine learning device, indicating that the reading of the data required by the machine learning device is completed, if the length between the address of the second jump data and the source address of the start jump does not satisfy the data length of the data required by the machine learning device, jumping in the above-described jumping order is continued from the last address of the second jumping data, the data is read until the length between the address of the second jump data and the source address of the jump starting meets the data length of the data required by the machine learning device, namely the data required by the machine learning device is completely read.

Illustratively, as shown in table 3 below, the process of reading data by the transmission circuit in this embodiment is as follows: if the jump information further includes stride jump information, where the stride jump information is used to represent a jump step length of the data operation signal each time, the transmission circuit reads data in the shared memory from a source address 0x110011 in the data information, reads data of a preset length (the preset length is smaller than a data length 0x0100 in the data information in the following table), jumps an address of a stride length (0x0008), reads data of a preset length, and reads the data according to the sequence until the total length of the read data is equal to the data length 0x0100 in the data information in the following table 3, which indicates that the data is completely read. If the skip information further includes segment skip information, where the segment skip information is used to represent a preset division size of the data operation signal each time, the transmission circuit reads data in the shared memory from a source address 0x110011 in the data information, reads data with a segment length (0x0010) first, then skips an address with a stride length (0x0008), then reads data with the segment length (0x0010), and reads the data according to the sequence until the total length of the read data is 0x0100 in the data information in table 3 below, which indicates that the data is completely read. When the skip information includes only segment skip information and does not include stride skip information, the transmission circuit reads data of a segment length (0x0010) from the source address 0x110011 when reading the data until the total length of the read data is the data length 0x0100 in the data information in table 3 below, which indicates that the data is completely read. It should be further noted that, the length and name of the stride jump information and the segment jump information listed in the embodiment of the present application are only an example, and the embodiment of the present application does not limit this.

TABLE 3

Considering that the data read by the transmission circuit according to the data operation signal is not in a format required by the machine learning device, the transmission circuit is required to perform certain processing on the read data and then transmit the processed data to the machine learning device, optionally, the data processing device further provided in the embodiment of the present application includes: the data operation signal further comprises a functional flag bit for characterizing the processing operation of the read data by the transmission circuit. The functional flag bit in the data operation signal indicates that the transmission circuit needs to perform corresponding processing on the read data according to the functional flag bit, and the number of the functional flag bit may be one or multiple, which is not limited in this embodiment. In an exemplary embodiment, the transmission circuit may perform corresponding processing on the read data according to the functional flag bit of the data operation signal, and then transmit the data to the machine learning device, so that the machine learning device may immediately recognize and perform operations after receiving the data, thereby improving data processing efficiency, and greatly improving access speed of the machine learning chip during data access.

Generally, before the data processing apparatus provided in the embodiment of the present application performs read/write processing on a data operation signal, the data operation signal needs to be analyzed, and optionally, the transmission circuit includes: the instruction storage unit is used for storing the data operation signal; the instruction processing unit is used for analyzing the data operation signal to obtain the type zone bit of the data operation signal and the information of the data to be operated; a store queue unit to store an instruction queue, the instruction queue comprising: and a plurality of data operation signals to be executed according to the front and back sequence of the instruction queue. The number of data operation signals is large in the general data processing process, and when one of the data operation signals is processed, other data operation signals need to be stored in the instruction storage unit. The instruction processing unit is a process of analyzing the data operation signal, and analyzes data information carried in the data operation signal. In addition, when the processes of dereferencing, decoding and transmitting the data operation signals are pipeline operations, all the data operation signals need to complete the processes in sequence, and then the instruction queue is stored through the storage queue unit.

And because the instruction processing unit processes the next data operation signal in the queue after finishing processing a data operation signal, it is to be ensured that the processed data operation signal is related to the next data operation signal, optionally, the transmission circuit further includes: the dependency relationship processing unit is used for determining whether an s-1 th data operation signal before an s-th data operation signal is associated with the s-1 th data operation signal, if so, caching the s-th data operation signal in the instruction storage unit, and after the s-1 th data operation signal is executed, extracting the s-th data operation signal from the instruction storage unit and transmitting the s-th data operation signal to the instruction processing unit; wherein the determining whether the s-1 th data operation signal before the s-th data operation signal and the s-th data operation signal have an association relationship comprises: extracting a first storage address interval of required data in the s-th data operation signal according to the s-th data operation signal, extracting a zero storage address interval of the required data in the s-1-th data operation signal according to the s-1-th data operation signal, and if the first storage address interval and the zero storage address interval have an overlapped region, determining that the s-th data operation signal and the s-1-th data operation signal have an association relationship, and if the first storage address interval and the zero storage address interval do not have an overlapped region, determining that the s-th data operation signal and the s-1-th data operation signal do not have an association relationship.

In this embodiment, before the data operation device operates according to the data processing signal, the unused data processing signal is sequentially stored, the data operation device sequentially performs parsing and decoding during use, and when the data operation signal is parsed and decoded, the continuity of the data operation signal is ensured by judging the relevance between two adjacent data operation signals, so that the smooth execution of corresponding operation according to the data operation signal in the later stage is ensured by the preliminary ordered preparation work, the data access efficiency is improved, and the access speed of the machine learning chip during data access is greatly improved.

The embodiment of the present application further provides a data processing method, which can be applied to the hardware circuit shown in fig. 1, where the circuit includes: the device comprises a machine learning device 11, a transmission circuit 12 and a shared memory 13, wherein the machine learning device 11 and the transmission circuit 12, and the transmission circuit 12 and the shared memory 13 are connected through interfaces, which can be realized through a hardware circuit, and the embodiment does not limit the specific hardware forms of the interfaces. The transmission circuit 12 is configured to obtain input data required by the machine learning apparatus 11 from the shared memory 13 according to a data operation signal sent by the machine learning apparatus 11, and return the input data to the machine learning apparatus 11, and the machine learning apparatus 11 is configured to perform a machine learning operation according to the input data to obtain output neuron data, and transmit the output neuron data as new input neuron data to the shared memory 13 through the transmission circuit 12 for data storage.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. The data processing method provided by the embodiment of the application aims to solve the technical problem of how to improve the access speed of a machine learning chip during data access when the data access or shared storage data is more. The following describes in detail the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems by embodiments and with reference to the drawings. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. It should be noted that, in the data processing method provided by the present invention, the execution main body is a transmission circuit, where the execution main body may also be a data processing apparatus, and the apparatus may be implemented as part or all of a data analysis terminal by software, hardware, or a combination of software and hardware.

In one embodiment, fig. 2 provides a data processing method, and the embodiment relates to a specific process in which the transmission circuit determines the type of the data operation signal according to the type flag bit of the data operation signal, so as to locate the corresponding operation, and then acquires data required by the machine learning device from the shared memory according to the operation, so as to improve the access speed. As shown in fig. 2, the method includes:

s101, a transmission circuit in the data processing device receives a data operation signal sent by a machine learning device in the data processing device, wherein the data operation signal carries a type flag bit of the data operation signal and information of data to be operated.

The machine learning device may be a device with an arithmetic function formed by a plurality of MLUs, the transmission circuit may be a broadcast bus, and the shared memory may be a nonvolatile and/or volatile memory including, but not limited to, a Random Access Memory (RAM), a cache memory, and the like. In this embodiment, the transmission circuit in the data processing apparatus receives a data operation signal sent by a machine learning apparatus in the data processing apparatus, where the data operation signal carries information of a type flag of the data operation signal and data to be operated, and the data operation signal transmission between the transmission circuit and the machine learning apparatus may be through an interface. The transmission circuit can determine the type of the data operation signal and the data information needed to be used in the operation according to the type flag bit of the data operation signal carried by the data operation signal and the data information to be operated.

And S102, the transmission circuit determines the operation executed on the data in the shared memory according to the type flag bit of the data operation signal, executes the operation on the data to be operated according to the information of the data to be operated, obtains input data required by the machine learning device, and returns the input data to the machine learning device.

Based on the data operation signal sent by the machine learning device and received by the transmission circuit in step S101, the transmission circuit determines, according to the type flag of the data operation signal, an operation to be performed on the data in the shared memory, determines, according to the information of the data to be operated in the data operation signal, which data in the shared memory to perform the operation (the data is the data to be operated), obtains input data required by the machine learning device, and returns the input data to the machine learning device. The input data is data that needs to be input by the machine learning device when performing the machine learning calculation. The data may be stored in advance in the shared memory, or may be output after the machine learning device executes the machine learning operation

And S103, executing machine learning operation by the machine learning device according to the input data to obtain output data, taking the output data as new input data, and transmitting the new input data to the shared memory through the transmission circuit for data storage.

In this step, the machine learning device performs machine learning operation based on the input data transmitted from the transmission circuit in step S102 to obtain output data, and then transmits the output data as new input data to the shared memory via the transmission circuit to store the output data. In the forward operation, after the last artificial neural network is executed, the next layer of operation instruction will operate the output neuron data calculated in the operation unit as the next layer of input neuron data (or perform some operations on the output neuron data and then use the output neuron data as the next layer of input neuron data), and at the same time, replace the weight data with the next layer of weight data; in the inverse operation, after the inverse operation of the artificial neural network in the previous layer is completed, the operation instruction in the next layer operates the input neuron gradient (which can also be used as input neuron data) calculated in the operation unit as the output neuron gradient of the next layer (which can also be used as output neuron data) (or performs some operation on the input neuron gradient and then uses the input neuron gradient as the output neuron gradient of the next layer), and at the same time, replaces the weight data with the weight data of the next layer. Optionally, the neural network related to the embodiment of the present application may be not only an artificial neural network, but also a pulse neural network, which is not limited in this embodiment. The machine learning apparatus according to this embodiment may perform the machine learning operation based on the input data, for example, in the machine learning operation, the machine learning apparatus may calculate neuron data output by each layer of neural network for a multi-layer neural network, and may perform an operation set included in a series of machine learning operations such as a multiplication operation, a summation operation, and a function operation on a plurality of input data corresponding to input terminals of each layer of neural network. After the machine learning device obtains the output neuron data of the current layer through the machine learning operation, the output neuron data can be used as the input neuron data of the next layer of neural network to perform the machine learning operation again, and before that, the output neuron data of the current layer can be written into the shared memory through the transmission circuit to be stored, so that the machine learning device can read the data at any time to perform the machine learning operation.

In the data processing method provided by this embodiment, the transmission circuit obtains input data required by the machine learning device from the shared memory according to a data operation signal that is sent by the machine learning device through the sending interface and carries information of a type flag bit of the data operation signal and data to be operated, and returns the input data to the machine learning device through the receiving interface, and then the machine learning device performs machine learning operation according to the input data to obtain output data, and transmits the output data to the shared memory through the transmission circuit as new input data to store the data. In this embodiment, since the data operation signal carries the type flag bit of the data operation signal and the information of the data to be operated, the transmission circuit may determine the type of the data operation signal according to the type flag bit of the data operation signal after receiving the data operation signal, and then perform a corresponding operation according to the data information to be operated carried in the data operation signal. Therefore, the corresponding operation can be quickly positioned by classifying according to the type zone bit of the data operation signal, the data access logic is simplified, the data access efficiency is improved, and the access speed of the machine learning chip during data access is greatly improved.

In one embodiment, the machine learning apparatus includes at least one machine learning unit, the data operation signal further includes a data receiving flag, and the returning the input data to the machine learning apparatus includes: and the transmission circuit determines a target machine learning unit for receiving the input data according to the value of the data receiving zone bit and sends the input data to the target machine learning unit.

In this embodiment, the data signal operations performed by at least one machine learning unit (i.e., MLU) included in the machine learning apparatus may share one data receiving interface. The MLU may perform signal or data transmission with the transmission circuit through the transmission interface or the shared data reception interface. It should be noted that both the sending interface and the shared data receiving interface may be implemented by a hardware circuit, and the type of the sending interface and the shared data receiving interface is not limited in this embodiment. Wherein the data operation signal includes a data reception flag bit indicating a target machine learning unit that can receive the input data. The data receiving flag bit may be marked in the following manner: the target machine learning unit capable of receiving the input data is marked as 1, it is understood that the marking of the target machine learning unit as 1 is only one way, and in practical application, the target machine learning unit capable of receiving the data may also be marked as 0, and this is an embodiment and does not limit the specific marking form of the data receiving flag bit. Specifically, the transmission circuit determines a target MLU that receives input data according to a value of a data reception flag bit in the data operation signal, and transmits the input data to the target MLU. In the embodiment, the transmission circuit can determine the target machine learning unit which can receive the input data in the machine learning device according to the marking condition of the data receiving zone bit carried in the data operation signal, so that each machine learning unit of the machine learning device is determined according to the data receiving zone bit in the data operation signal when receiving data, the memory access logic in the data memory access process is simplified, the data access efficiency is improved, and the access speed of the machine learning chip in the data access process is greatly improved.

Optionally, when the value of the type flag bit of the data operation signal is CAST, the transmission circuit determines that the data operation signal is a broadcast or multicast command. In this alternative, the type flag bit of the data operation signal is used to indicate the operation type of the data operation signal, and the type flag bit of the data operation signal is CAST, which indicates that the data operation signal is a broadcast or multicast instruction.

Optionally, the type flag bits of the data operation signal may include a first type flag bit and a second type flag bit, where the first type flag bit is used to indicate whether the data operation signal is an I/O instruction, and the second type flag bit is used to indicate whether the data operation signal is a broadcast or multicast instruction in the I/O instruction. Therefore, when the value of the first type flag bit is I/O, the transmission circuit determines that the data operation signal is an I/O instruction; if the value of the second type flag bit is 1, the transmission circuit determines that the data operation signal is a broadcast or multicast instruction in the I/O instruction.

In this alternative, the data operation signal includes two data type data flag bits, where the first type data flag bit indicates the type of the data operation signal; the second type flag is set in the operation information of the data operation signal to indicate a specific subtype of the data operation signal, and specifically, when a value of the first type flag in the data operation signal is I/O, the transmission circuit determines that the data operation signal is an input/output command, and if the value of the second type flag in the data operation signal is 1, the transmission circuit determines that the data operation signal is a broadcast or multicast command in the input/output command.

In one embodiment, fig. 3 provides a data processing method, and this embodiment relates to a specific process in which a transmission circuit reads data from a shared memory according to data information carried by a data operation signal, and returns the read data to a target machine learning unit according to the data operation information. As shown in fig. 3, if the information of the data to be operated includes a source address of the data to be operated in the shared memory, a length of the data to be operated, and a data return address after the data is operated, the S103 includes:

s201, the transmission circuit starts to read the shared memory from the source address, and acquires the input data meeting the data length.

In this embodiment, because the information of the data to be operated of the data operation signal carries the source address of the data to be operated in the shared memory, the length of the data to be operated, and the data return address after the data is operated, the transmission circuit starts to read the data from the source address in the shared memory, and reads the length of the data meeting the requirement of the data to be operated according to a preset rule, where the length of the data to be operated is set by a user according to an actual situation, which is not limited in this embodiment. The transmission circuit obtains the input neuron data and the data satisfying the data length, and reads the data satisfying the data length from the shared memory according to a preset rule, where the preset rule is also a rule established by a user according to an actual situation, and this embodiment is not limited to this, and for example, the data may be read in a manner of starting from a source address one by one until the read data length satisfies the data length.

S202, the transmission circuit returns the input data to the target machine learning unit according to the data return address and the data receiving flag bit.

Based on the input data satisfying the data length acquired by the transmission circuit in the step S201 described above, the data is returned to the data return address in the information of the data to be operated, where the data return address in the information of the data to be operated may be an address in a plurality of target machine learning units of the machine learning device. The transmission circuit determines data to return to a target machine learning unit in the machine learning device according to the data receiving flag bit carried in the data operation signal.

In the data processing method provided by this embodiment, the transmission circuit starts to read the shared memory from the source address, acquires the input data satisfying the data length, and returns the input data to the target machine learning unit according to the data return address and the data receiving flag bit

In an embodiment, fig. 4 provides a data processing method, and based on any of the above embodiments, the operation information in this embodiment may further include jump information, where the jump information includes a jump step size and a jump data length operated after each jump. The embodiment relates to a specific process that the transmission circuit reads data in the shared memory according to jump information in the operation information. As shown in fig. 4, S201 includes:

s301, the transmission circuit reads the shared memory from the source address, and acquires first jump data according to the jump data length after the current jump.

In this embodiment, the operation information of the data operation signal includes jump information, and the jump information is used to instruct the transmission circuit to read the data information to be operated according to the rule of the jump information when reading the data information according to the data operation signal. The skip information includes a skip step length and a skip data length operated after each skip, wherein the skip data length may be a preset data length. Optionally, the jump information includes stride jump information and/or segment jump information, where the stride jump information is used to represent a jump step length of the data operation signal each time; the segment jump information is used for representing the preset segmentation size of the data operation signal each time.

Specifically, the transmission circuit reads the shared memory from a source address in the data information to be operated, and after the current jump, determines the data of the read jump data length as first jump data, where the first jump data represents data obtained after the transmission circuit jumps to data of a preset length when reading the data, where the preset length is set by a user according to an actual situation, and this embodiment does not limit this.

S302, the transmission circuit acquires the last address of the first jump data and jumps from the last address to a target jump address according to the jump step length.

Based on the first jump data read in the step S301, the transmission circuit obtains the last address of the first jump data, and jumps from the last address of the first jump data to the target jump address according to the jump step (e.g., stride step) in the jump information, where it is understood that the length between the last address of the first jump data and the target jump address is the jump step in the jump information.

And S303, the transmission circuit acquires second jump data from the target jump address according to the jump data length after jumping until the jump data length obtained after each jump meets the data length.

In this step, when the transmission circuit reads data, it skips data of a preset length from the target skip address determined in the step S302, then determines the data after skips the preset length as second skip data, if the length between the address of the second skip data and the source address of the start skip satisfies the data length of the data required by the machine learning device, it indicates that the reading of the data required by the machine learning device is completed, if the length between the address of the second skip data and the source address of the start skip does not satisfy the data length of the data required by the machine learning device, it skips from the last address of the second skip data to read the data according to the skip sequence in the steps S301 to S303 until the length between the address of the second skip data and the source address of the start skip satisfies the data length of the data required by the machine learning device, that is, it means that the machine learning device has completed reading the required data.

The implementation principle and technical effect of the data processing method provided by this embodiment are similar to those of the data processing apparatus described above, and are not described herein again. In the data processing method provided by the embodiment, the transmission circuit reads the shared memory from the source address, acquires the first jump data according to the jump data length after the current jump, jumps to the target jump address according to the jump step length from the last address of the first jump data, then starts from the target jump address, acquires the second jump data according to the jump data length after the jump until the length of the jump data obtained after each jump meets the data length, so that when the operation information comprises the jump information, because the transmission circuit reads the data according to the jump rule of the jump information, the read data logic of the transmission circuit is simplified, the data access efficiency is improved, and the access speed of the machine learning chip during data access is greatly improved.

Since the data operation signal to be received is an encoded command when the transmission circuit operates according to the received data operation signal, and the data operation signal needs to be decoded and analyzed first, an embodiment of the present application provides a data processing method, where as shown in fig. 5, the receiving, by the transmission circuit in the data processing apparatus, the data operation signal sent by the machine learning apparatus in the data processing apparatus includes:

s401, the transmission circuit analyzes the data operation signal to obtain the type flag bit of the data operation signal and the information of the data to be operated.

It should be noted that, generally, the number of data operation signals is large in the data processing process, and when one of the data operation signals is processed by the transmission circuit, other data operation signals need to be stored in the transmission circuit. The data operation information may include information such as a length of data to be operated, a target address, and an original address, which is not limited in this embodiment.

S402, the transmission circuit executes the analyzed data operation signal according to the instruction queue; the instruction queue is used for representing the execution sequence of the data operation signals.

It should be understood that the data operation signals are required to be sequentially completed in sequence during execution, and based on the data operation information and the type flag obtained after the transmission circuit in the above step S401 parses the data operation signals, the transmission circuit executes the parsed data operation signals according to the instruction queue.

In the data processing method provided by this embodiment, the transmission circuit analyzes the data operation signal to obtain the type flag bit of the data operation signal and the information of the data to be operated, and then the transmission circuit executes the analyzed data operation signal according to the instruction queue, so that before the data operation signal is executed, the data operation signal is analyzed first and then executed in sequence, thereby greatly increasing the speed of the transmission circuit executing the operation according to the data operation signal.

Considering that the transmission circuit needs to execute the data operation signals associated with each other when executing the data operation signals in the order of the queue, this embodiment of the present application provides another embodiment, as shown in fig. 6, before the transmission circuit executes the parsed data operation signals according to the instruction queue, the method further includes:

s501, the transmission circuit judges the dependency relationship of the adjacent analyzed data operation signals to obtain a judgment result; the dependency relationship represents whether an s-1 th data operation signal before an s-th data operation signal is associated with the s-th data operation signal.

The transmission circuit needs to judge the dependency relationship between the adjacent analyzed data operation signals, and determines that the processed two adjacent data operation signals are related according to the judgment result, wherein the s-th data operation signal represents any one of the data operation signals, and does not refer to any signal. The s-1 th data operation signal represents a signal previous to the s-th data operation signal.

Optionally, one way that the transmission circuit determines the dependency relationship between the adjacent analyzed data operation signals may be implemented as follows: the transmission circuit respectively acquires an s-th data operation signal for extracting required data in the s-th data operation signal according to the s-th data operation signal and a zero storage address interval for extracting the required data in the s-1-th data operation signal according to the s-1-th data operation signal; if the first storage address interval and the zeroth storage address interval have an overlapped area, the transmission circuit determines that the s-th data operation signal and the s-1 th data operation signal have a dependency relationship; if the first storage address interval and the zeroth storage address interval do not have an overlapped area, the transmission circuit determines that the s-th data operation signal and the s-1 th data operation signal do not have a dependency relationship. The transmission circuit respectively judges the dependency relationship of the adjacent analyzed data operation signals according to the relationship between the s-th data operation signal of the s-th data operation signal and the zero-th storage address interval of the s-1-th data operation signal, and the judgment mode can be that if the first storage address interval and the zero-th storage address interval do not have an overlapped region, the s-th data operation signal and the s-1-th data operation signal do not have the dependency relationship, and if the first storage address interval and the zero-th storage address interval have an overlapped region, the s-th data operation signal and the s-1-th data operation signal have the dependency relationship.

S502, if the judgment result shows that the S-th data operation signal and the S-1-th data operation signal have a dependency relationship, the transmission circuit caches the S-th data operation signal, and extracts the S-th data operation signal after the S-1-th data operation signal is executed.

Based on the dependency relationship between two adjacent data operation signals judged by the transmission circuit in the above steps, data operation signals are started to be executed according to the sequence, if the judgment result shows that the s-th data operation signal and the s-1 th data operation signal have the dependency relationship, the transmission circuit firstly caches the s-th data operation signal, and after the s-1 th data operation signal is executed, the s-th data operation signal is extracted.

According to the data processing method provided by the embodiment, the transmission circuit can ensure the continuity of the data operation signals by judging the relevance between two adjacent data operation signals, so that the smooth execution of corresponding operations according to the data operation signals in the later period is ensured by orderly preparation work in the earlier period, the data access efficiency is improved, and the access speed of the machine learning chip during data access is greatly improved.

It should be understood that although the various steps in the flow charts of fig. 2-6 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-6 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

Referring to fig. 7, an embodiment of the present application further provides a combined processing apparatus, which includes the data processing apparatus, a universal interconnect interface, and other processing apparatuses except for the data processing apparatus; the data processing device interacts with other processing devices to jointly complete the computing operation specified by the user. The other processing devices include one or more types of general-purpose or special-purpose processors such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a neural network processor, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the data processing device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the data processing device; other processing devices may also cooperate with the data processing device to perform computational tasks. And the universal interconnection interface is used for transmitting data and control instructions between the data processing device and other processing devices. The data processing device acquires required input data from other processing devices and writes the required input data into a shared memory on a data processing device chip; the machine learning device can acquire control instructions from other processing devices and write the control instructions into the data processing device chip; the data in the shared memory of the data processing apparatus may also be read and transmitted to other processing apparatuses.

Optionally, as shown in fig. 8, the combined processing device may further include a storage device, and the storage device is connected to the data processing device and the other processing device respectively. The storage device is used for storing data stored in the data processing device and the other processing devices, and is particularly suitable for storing all data which cannot be stored in the data processing device or the other processing devices.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

In one embodiment, the present application further provides a machine learning chip, which includes the data processing device and/or the combination processing device.

In an embodiment, an embodiment of the present application further provides a chip packaging structure, which includes the above chip.

In an embodiment, an embodiment of the present application further provides a board card, which includes the chip packaging structure. Referring to fig. 9, the board card may include other accessories besides the chip package structure 81, including but not limited to: a memory device 82, an interface device 83, and a control device 84; the memory device 82 is connected to the machine learning chip 811 in the chip package 81 through a bus for storing data, and the memory device 82 may include a plurality of sets of memory cells 821. Each set of the storage units 821 and the machine learning chip 811 are connected by a bus. It is understood that each group of the memory units 821 may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the machine learning chip may internally include 4 72-bit DDR4 controllers, wherein 64bit of the 72-bit DDR4 controller is used for data transmission, and 8bit is used for ECC check. It can be understood that when DDR4-3200 particles are adopted in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600 MB/s. In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device 83 is electrically connected to a machine learning chip 811 in the chip package 81. The interface device 83 is used for data transmission between the machine learning chip 811 and an external device (such as a server or a computer). For example, in one embodiment, the interface device 83 may be a standard PCIE (peripheral component interconnect express) interface. For example, the data to be processed is transmitted to the machine learning chip by the server through a standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device 83 may also be another interface, and the embodiment of the present application does not limit the concrete expression of the other interface, and the interface device may implement a switching function. In addition, the calculation result of the machine learning chip 811 is still transmitted back to an external device (e.g., a server) by the interface device 83.

The control device 84 is electrically connected to the machine learning chip 811. The control device 84 is used to monitor the state of the chip. Specifically, the machine learning chip 811 and the control device 84 may be electrically connected through an SPI (serial peripheral Interface) Interface. The control device may include a single chip Microcomputer (MCU). As the machine learning chip may include a plurality of data processing devices and/or a combination processing device, a plurality of loads may be carried. Therefore, the machine learning chip can be in different working states such as multi-load and light load. The control device 84 can be used to control the operating states of a plurality of data processing devices and/or combination processing devices in the machine learning chip.

In some embodiments, an electronic device is provided that includes the above board card. The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

Those skilled in the art should also appreciate that the embodiments described in this specification are all alternative embodiments and that the acts and modules involved are not necessarily required for this application. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

It will be understood by those skilled in the art that all or part of the processing of the above embodiments may be implemented by a program to instruct associated hardware, and the program may be stored in a computer readable memory, and the memory may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A data processing apparatus for performing processing of machine learning data, the data processing apparatus comprising: the machine learning device is connected with the transmission circuit, and the transmission circuit is connected with the shared memory;

2. The data processing apparatus of claim 1, wherein the machine learning apparatus is configured to perform a machine learning operation based on the input data to obtain output data.

3. The data processing apparatus of claim 2, wherein the machine learning apparatus is further configured to transmit the output data to the shared memory for data storage via the transmission circuit.

4. A data processing apparatus according to any one of claims 1 to 3, wherein the machine learning apparatus comprises at least one machine learning unit;

the data operation signal further comprises a data receiving flag bit for characterizing a target machine learning unit receiving the input data.

5. The data processing apparatus of claim 4, wherein the value of the type flag bit of the data operation signal comprises CAST, characterizing the data operation signal as a broadcast or multicast instruction.

6. The data processing apparatus of claim 5, wherein the type flag bits of the data operation signal comprise a first type flag bit and a second type flag bit;

the value of the first type zone bit comprises I/O, and the data operation signal is characterized as an I/O instruction;

the second type flag bit is used for representing whether the data operation signal is a broadcast or multicast instruction in the I/O instruction.

7. The data processing apparatus according to claim 6, wherein the information of the data to be operated on includes at least one of a source address of the data to be operated on in the shared memory, a length of the data to be operated on, and a return address of the data after the data is operated on.

8. The data processing apparatus of claim 7, wherein the data operation signal further comprises jump information, the jump information comprising a jump step size and a data length operated after each jump.

9. The data processing apparatus according to claim 8, wherein the jump information comprises stride jump information and/or segment jump information;

the stride jump information is used for representing the jump step length of the data operation signal each time;

the segment jump information is used for representing the preset segmentation size of the data operation signal each time.

10. The data processing apparatus of claim 9, wherein the data operation signal further comprises a functional flag bit for characterizing a processing operation performed by the transmission circuit on the read data.

11. A data processing apparatus according to any one of claims 1-3, wherein the transmission circuit comprises:

the instruction storage unit is used for storing the data operation signal;

12. The data processing apparatus of claim 11, wherein the transmission circuit further comprises:

13. A data processing method applied to the data processing apparatus of any one of claims 1 to 12, the method comprising:

14. The method of claim 13, wherein the machine learning device includes at least one machine learning unit, wherein the data manipulation signal further includes a data reception flag, and wherein returning the input data to the machine learning device includes:

15. The method according to claim 13 or 14, characterized in that the method further comprises:

16. The method of claim 15, wherein the type flag bits of the data operation signal comprise a first type flag bit and a second type flag bit, the first type flag bit is used for indicating whether the data operation signal is an I/O instruction, and the second type flag bit is used for indicating whether the data operation signal is a broadcast or multicast instruction in the I/O instruction; the method further comprises the following steps:

17. The method according to claim 16, wherein the information of the data to be operated on comprises a source address of the data to be operated on in the shared memory, a length of the data to be operated on, and a data return address after the data is operated on; the executing the operation on the data to be operated according to the information of the data to be operated to obtain the input data required by the machine learning device, and returning the input data to the machine learning device, including:

18. The method of claim 17, wherein the data operation signal further includes jump information, the jump information including a jump step size and a jump data length operated after each jump, the transmission circuit reading the shared memory starting from the source address, and obtaining the input data satisfying the data length, including:

the transmission circuit acquires the last address of the first jump data and jumps from the last address to a target jump address according to the jump step length;

19. The method of claim 18, wherein the transmitting circuit in the data processing apparatus receives a data manipulation signal sent by a machine learning apparatus in the data processing apparatus, comprising:

20. The method of claim 19, wherein prior to the transmit circuit performing the parsed data operation signal in accordance with an instruction queue, the method further comprises:

21. The method of claim 20, wherein the transmitting circuit determining a dependency relationship between adjacent ones of the parsed data operation signals comprises:

22. A combined processing device, characterized in that it comprises a data processing device according to any of claims 1-12, a universal interconnect interface and other processing devices than said data processing device; the data processing device interacts with the other processing devices.

23. A machine learning chip, characterized in that it comprises a combined processing device according to claim 22.

24. A board comprising the machine learning chip of claim 23.

25. An electronic device, characterized in that it comprises a card according to claim 24.