CN113536219A

CN113536219A - Operation method, processor and related product

Info

Publication number: CN113536219A
Application number: CN202010317734.8A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2020-04-21
Filing date: 2020-04-21
Publication date: 2021-10-22
Anticipated expiration: 2040-04-21
Also published as: CN113536219B

Abstract

The disclosure relates to an arithmetic method, a processor and a related product. The product comprises a storage device, an interface device, a control device and the artificial intelligence chip; wherein, the artificial intelligence chip is respectively connected with the storage device, the control device and the interface device; the storage device is used for storing data; the interface device is used for realizing data transmission between the artificial intelligence chip and external equipment; and the control device is used for monitoring the state of the artificial intelligence chip. By the method or the product, the method can improve the operation efficiency of the related product when matrix multiplication is carried out.

Description

Operation method, processor and related product

Technical Field

The present disclosure relates to the field of information processing technologies, and in particular, to an arithmetic method, a processor, and a related product.

Background

In the technical field of artificial intelligence, a neural network algorithm is a very popular machine learning algorithm in recent years, and has a very good effect in various fields, such as image recognition, voice recognition, natural language processing and the like. Along with the development of neural network algorithms, the complexity of the algorithms is higher and higher, and in order to improve the recognition degree, the scale of the model is gradually increased. Processing these large-scale models with GPUs and CPUs takes a lot of computation time and consumes a lot of power.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an arithmetic method, a processor and a related product.

According to a first aspect of the present disclosure, there is provided a processor comprising two or more processing elements arranged in a two-dimensional matrix, a processing element comprising at least one register, the processor being for performing a matrix multiplication operation on a first matrix and a second matrix,

the processor further comprises a controller, wherein the controller is used for loading each element of a transposed matrix and a second matrix of the first matrix into a register of each processing element respectively, and the elements at the corresponding positions of the transposed matrix and the second matrix are stored in the register of the same processing element;

the controller is used for controlling the transposed matrix or the second matrix to roll in the row direction or the column direction, controlling the processing element to perform multiplication operation on elements in the corresponding register to obtain an element product, and summing the element products in the same row or the same column to obtain a first intermediate result;

the controller is further configured to process the first intermediate result to obtain a product of the first matrix and the second matrix.

According to a second aspect of the present disclosure, there is provided a method of operation based on matrix multiplication of a matrix of processing elements, applied to a processor including two or more processing elements arranged in a two-dimensional matrix, the processing elements including at least one register, the method implementing a matrix multiplication operation on a first matrix and a second matrix, the method comprising:

transposing the first matrix to obtain a transposed matrix, loading each element of the transposed matrix and the second matrix into a register of each processing element respectively, and storing the elements at the corresponding positions of the transposed matrix and the second matrix in the register of the same processing element;

controlling the transposed matrix or the second matrix to roll in the row direction or the column direction, controlling a processing element to perform multiplication operation on elements in a corresponding register to obtain element products, and summing the element products of the same row or the same column to obtain a first intermediate result;

and processing the first intermediate result to obtain a product of the first matrix and the second matrix.

According to a third aspect of the present disclosure, there is provided an artificial intelligence chip, the chip comprising a processor as described above.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising the artificial intelligence chip as described above.

According to the product such as the matrix multiplication operation method and the processor of each embodiment of the present disclosure, for an input matrix of any scale satisfying the arrangement of the processing elements, the operation result of the matrix multiplication can be obtained, and compared with the matrix multiplication operation in the related art, the access frequency and the memory number can be reduced, the bandwidth pressure is reduced, and the operation efficiency is improved.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 shows a schematic diagram of a processor according to an embodiment of the present disclosure.

Fig. 2a and 2b each show examples of a number of different ways of partitioning.

FIG. 3 shows a flow diagram of a method of operation according to an embodiment of the present disclosure.

FIG. 4 shows a schematic diagram of an array of processing elements according to an embodiment of the present disclosure.

Fig. 5 shows a schematic diagram of chunking according to an embodiment of the present disclosure.

Fig. 6 illustrates an example of partitioning a matrix according to an embodiment of the present disclosure.

Fig. 7 shows a block diagram of a board card according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

In the process of processing information by using artificial intelligence, matrix operation occupies a large amount of calculation, and in the process of processing the matrix operation, the conventional processor decomposes the matrix operation into multiplication operation and addition operation, so that data needs to be frequently read from a memory, and the operation efficiency is very low.

In order to solve the above technical problem, the present disclosure provides an arithmetic method and a processor for executing the arithmetic method. A processor may comprise a plurality of processing elements (more than two) which may be arranged in a two-dimensional matrix, each processing element may comprise at least one register.

FIG. 1 shows a schematic diagram of a processor according to an embodiment of the present disclosure. As shown in fig. 1, a plurality of processing elements PE (processing elements) are arranged in a two-dimensional matrix, each processing element is connected to an adjacent processing element, and at least one register (not shown) may be provided in each PE. The processor may further include a controller and a memory, wherein the controller and the memory are both connected to the plurality of processing elements, and the controller may be connected to the memory. The controller is configured to load input data from the memory into the register of the processing element and control the processing element to process the input data, for example, the memory may store a first matrix and a second matrix, and the processor is configured to perform a matrix multiplication operation on the first matrix and the second matrix, so that the controller may load the first matrix and the second matrix into the register of the processing element and control the processing element to perform the matrix multiplication operation.

In a possible implementation manner, the memory may further store an executable program, and the executable program may include instructions, and the instructions may be executed to implement a matrix multiplication operation on the first matrix and the second matrix. The controller may be provided with a loader, a decoder, and the like, where the loader may be configured to load input data in the memory into a register of the processing element, and the decoder may decode an instruction for accessing data in the executable program according to a storage address of the loaded input data, for example, for the instruction for accessing data, an address stored in the register by the data is obtained by decoding and assigned to the instruction for accessing data, and the decoded instruction is sent to the processing element, and the processing element executes the instruction, thereby implementing processing on the data, for example, implementing matrix multiplication operations on the first matrix and the second matrix.

In one possible implementation, the memory may be an on-chip cache, and the controller may load the executable program on the off-chip flash memory and the input data (e.g., the input matrix including the left-and right-multiplication matrices) into the memory (on-chip cache), and then perform the subsequent matrix multiplication.

In one possible implementation, the controller may also load the input matrix and the executable program directly from the off-chip memory into the register of the processing element, which is not limited by the present disclosure.

The PE may further include an operator to complete a specified operation, for example, a matrix operation, and the PE may include, for example, a multiplier, an adder, and the like, and the specific structures of the PEs may be the same or different, which is not limited in this disclosure. Other types of operators may be included in the PE to accommodate various different operation processes, and the number and types of operators included in the PE are not limited by the present disclosure.

The input matrices for the matrix multiplication operation may include a left-handed matrix and a right-handed matrix, where the left-handed matrix may refer to a matrix located to the left of the multiplication number and the right-handed matrix may refer to a matrix located to the right of the multiplication number.

Since the number and arrangement of PEs in the processor is fixed, the controller may determine whether to block the input matrix based on the arrangement of the processing elements and the row and column ranks of the input matrix before loading data into the registers in the processing elements and performing computations. The arrangement of the processing elements may refer to the number of rows and columns of the processing elements, and the row rank and column rank of the input matrix may refer to the number of rows and columns of the left-and right-handed matrices.

The controller determining whether to block the input matrix according to the arrangement of the processing elements and the row rank and the column rank of the input matrix may refer to: the controller determines whether the number of rows of the input matrix or the transpose of the input matrix is greater than the number of rows of the processing elements and whether the number of columns is greater than the number of columns of the processing elements, and determines whether to block the input matrix according to the result of the determination.

The input matrix may not be partitioned if the number of rows of one of the input matrices is not greater than the number of rows of processing elements and the number of columns is not greater than the number of columns of processing elements, and the number of transposed rows of the other one of the input matrices is not greater than the number of rows of processing elements and the number of columns is not greater than the number of columns of processing elements.

The controller may partition the input matrix if the number of rows of any of the input matrices is greater than the number of rows of processing elements, or the number of columns is greater than the number of columns of processing elements, or the number of transposed rows of any of the input matrices is greater than the number of rows of processing elements, or the number of columns is greater than the number of columns of processing elements.

For example, assume that an array of processing elements can be represented as PEs_MNIndicating that the processing elements form an M x N matrix, M indicating the number of rows of the matrix and N indicating the number of columns of the matrix, assuming an input matrix a_mnRepresenting an m x n matrix, m representing the number of rows of the matrix, n representing the number of columns of the matrix, and the other input matrix being B_nkAnd denotes an n × k matrix, n denotes the number of rows of the matrix, and k denotes the number of columns of the matrix. If the matrix A is_mnM is not greater than M and N is not greater than N, and B_nkIs transposed matrix of

The number of rows k is not greater than the number of rows M of processing elements and the number of columns N is not greater than the number of columns N of processing elements, the input matrix may not be partitioned. Or, if A_mnIs transposed matrix of

Is not greater than the number of rows M and columns M of processing elements, and B_nkThe number of rows N is not greater than the number of rows M of processing elements and the number of columns k is not greater than the number of columns N of processing elements, the input matrix may not be partitioned.

If the matrix A is_mnNumber of lines ofM is greater than the number of rows M or columns N of processing elements, or matrix B_nkIs transferred to

If the number of rows k is greater than the number of rows M of the processing elements or the number of columns N is greater than the number of columns N of the processing elements, the input matrix can be partitioned; or, if

The number of rows N is greater than the number of rows M or the number of columns M is greater than the number of columns N of processing elements, or B_nkThe number of rows N is larger than the number of rows M of processing elements or the number of columns k is larger than the number of columns N of processing elements, the input matrix may be partitioned.

To block one of the input matrices, the controller may split the rows of the left-handed matrix or the columns of the right-handed matrix according to the arrangement of the processing elements.

For example, assume that the array of processing elements is a PE₂₂The left multiplication matrix is A₃₂The right multiplication matrix is B₂₂Then A may be substituted₃₂Is split into A₁₂、A₂₂Are respectively connected with B₂₂Multiplication. If the left multiplication matrix is A₂₂Right multiplication matrix of B₃₂Then B can be substituted₃₂Is split into B₁₂、B₂₂。

If both of the input matrices are to be partitioned, the controller may partition the left-handed matrix column direction and the right-handed matrix row direction in the same manner according to the arrangement of the processing elements and the row and column ranks of the input matrices.

That is, the left-handed matrix and the inverted right-handed matrix may be partitioned in the same manner in the column direction, or partitioned in the same manner in the row direction, where the same partitioning means that the number of columns or the number of rows of the first matrix and the second matrix obtained by partitioning is the same, so as to ensure that the matrix operation can be completed normally.

It is assumed that more than two first matrices can be obtained after the left-handed matrix is blocked, more than two second matrices can be obtained after the right-handed matrix is blocked, or more than two first matrices can be obtained after the right-handed matrix is blocked, and more than two second matrices can be obtained after the left-handed matrix is blocked.

And partitioning the left multiplication matrix column direction and the right multiplication matrix row direction in the same way according to the arrangement of the processing elements and the row rank and the column rank of the input matrix, wherein both the first matrix and the second matrix obtained after partitioning need to meet the condition that partitioning is not needed, namely, the transposed row number of the first matrix and the transposed column number of the second matrix are not more than the row number of the processing elements and the column number of the processing elements, or the transposed row number of the first matrix and the row number of the second matrix are not more than the row number of the processing elements and the column number of the processing elements.

In a possible implementation manner, the controller may divide the first matrix or the second matrix in such a way that the row rank and the column rank of the divided first matrix or second matrix are as close as possible to the row number and the column number of the processing elements, so that the operation efficiency can be improved, and the operation time can be shortened. That is, assuming that the processing elements are 4 × 4 arrays, the division may be performed in such a manner that the divided matrix is 4 × 4, so that the processing elements can be utilized most efficiently, and the operation efficiency can be improved.

For example, assume that the processing elements are a2 x 2 array and the input matrices are one 2 x 4 matrix and one 4 x 3 matrix. The division can be in many ways, and fig. 2a and 2b show different division ways, matrix a₂₄In the column direction and matrix B₄₃Blocking is performed in the same manner in the row direction. FIG. 2a is an example of a partition, matrix A₂₄Divided into two parts in the column direction, each part comprising two columns, matrix B₄₃Dividing the device into two parts in the row direction, wherein each part comprises two rows; FIG. 2b is another example of partitioning, matrix A₂₄The column direction is divided into three parts, one part comprises two columns, the other two parts comprise one column, and the matrix B₄₃Divided into three parts in the row direction, wherein one part comprises two rows and the other two partsThe sections each include one row. The arrangement of the above processing elements and the manner of dividing the input matrix are merely one example of the present disclosure, and do not limit the present disclosure in any way.

The present disclosure does not specifically limit the dividing manner of the row direction of the left-hand matrix and the column direction of the right-hand matrix, as long as the divided matrices all need to satisfy the condition that the partitioning is not required.

According to the operation rule of matrix multiplication, elements in rows of a left multiplication matrix and elements in columns of a right multiplication matrix are multiplied one by one and then summed. Therefore, in a possible implementation manner, for the case of non-blocking, or the first matrix and the corresponding second matrix after blocking, the controller is configured to load each element of the transposed matrix and the second matrix of the first matrix into the register of each processing element, respectively, and store the elements at the corresponding positions of the transposed matrix and the second matrix in the register of the same processing element. According to the matrix multiplication rule, the elements at the corresponding positions of the transposed matrix and the second matrix may refer to elements in the transposed matrix and the second matrix that need to be multiplied.

In a possible implementation manner, the controller may first transpose the first matrix to obtain a transposed matrix, and then load elements of the transposed matrix into the registers of the processing elements, or in another possible implementation manner, the controller may also implement the transposing of the first matrix in a loading process, for example, assuming that the first matrix is a right-handed matrix, the controller may load one column of elements of the first matrix into a register of one row of processing elements to implement the transposing of the first matrix in a process of loading the elements of the first matrix into the registers of the processing elements.

In one possible implementation, the transposed matrix and the second matrix are aligned in a row or column direction. Specifically, if the left-multiplied matrix is transposed, then, after loading, the rows of the transposed matrix of the first matrix are aligned with the second matrix in the column direction, that is, in the column direction, the rows of the transposed matrix and the second matrix are aligned; if the right-hand matrix is transposed, then after loading, the columns of the transposed matrix are aligned with the second matrix in the row direction, that is, the columns of the transposed matrix and the second matrix are aligned in the row direction.

After the transpose matrix and the second matrix are loaded, the controller is further configured to control elements in the transpose matrix or the second matrix to roll in a row direction or a column direction, control the processing element to perform multiplication operation on elements in the corresponding register to obtain an element product, and sum up the element products in the same row or the same column to obtain a first intermediate result. Specifically, the controller controls the processing element, the transposed matrix stored in the register, and the second matrix to repeat the following process until the elements in the transposed matrix or the second matrix are restored to the non-scrolled position: the controller controls the processing element to multiply the elements in the corresponding register to obtain an element product, sums the element products in the same row or the same column to obtain a first intermediate result, and controls the transposed matrix or the second matrix stored in the register to roll by one row or one column in the row direction or the column direction.

That is to say, the processing element is controlled to perform multiplication operation on elements in the corresponding register to obtain an element product, the element products in the same row or the same column are summed to obtain a first intermediate result, and then the elements in the transposed matrix or the second matrix are controlled to roll by one row or one column in the row direction or the column direction. If the judgment result is the same, the process is ended. If the judgment results are different, then the processing element is controlled to perform multiplication operation on the elements in the corresponding registers to obtain element products, the element products of the same row or the same column are summed to obtain a first intermediate result, then the elements in the transposed matrix or the second matrix are controlled to roll by one row or one column in the row direction or the column direction, whether the elements in the transposed matrix or the second matrix are the same as the initial position after the rolling is judged … …, and the above process is circulated until the elements in the transposed matrix or the second matrix are the same as the initial position after the rolling is completed.

In one example, the first matrix is a left-handed matrix and the second matrix is a right-handed matrix. In another example, the first matrix is a right-handed matrix and the second matrix is a left-handed matrix.

When the first matrix is a left-multiplication matrix and the second matrix is a right-multiplication matrix, the controller controls elements in the transposed matrix to roll in the row direction, or controls elements of the second matrix to roll in the row direction, the processing element is controlled to perform multiplication operation on the elements in the corresponding register to obtain an element product, and the element products in the same column are summed to obtain a first intermediate result.

When the first matrix is a right-multiplication matrix and the second matrix is a left-multiplication matrix, the controller controls elements in the transposed matrix to roll in the column direction or controls elements in the second matrix to roll in the column direction; the control processing element performs multiplication operation on elements in the corresponding register to obtain element products, and the element products in the same row are summed to obtain a first intermediate result.

In one possible implementation, the above-described scrolling, one row or one column at a time. The closed loop is formed between the processing elements storing the elements of the matrix, and since the adjacent processing elements are connected together, the controller can determine the loop form according to the dimension of the matrix, for example, if the matrix is to be scrolled in rows (in the column direction), the first row of processing elements and the last row of processing elements storing the elements of the matrix are connected, and if the matrix is scrolled upwards in the process of scrolling, the first row of elements of the matrix is scrolled from the original storage position to the storage position of the last row of elements. If scrolling is to be performed in columns (in the row direction), the first and last column processing elements storing elements of the matrix are connected, and if scrolling is performed to the left during scrolling, the first column element of the matrix is scrolled from the originally stored position to the position where the last column element is stored. The connection between the processing elements and the processing elements may be referred to as a virtual connection, that is, there is no actual connection line, but the controller records the corresponding processor, and a closed loop is formed during the rolling process.

After completing the scrolling and calculating the first intermediate result when the elements in the transposed matrix or the second matrix are restored to the positions of the elements in the un-scrolled matrix, the controller may process the first intermediate result to obtain a product of the first matrix and the second matrix.

In one possible implementation, the controller stores the first intermediate result in rows or columns, and the first intermediate result is scrolled in a row direction or a column direction to obtain a product of the first matrix and the second matrix. The specific processing method is related to the transposed matrix and the scrolling direction, for example:

when the first matrix is a right-handed matrix and the second matrix is a left-handed matrix, the first intermediate result may be stored in columns and the elements in the first intermediate result may be scrolled to the right in the row direction if the transition matrix is scrolled in the column direction; for example, the ith row element scrolls to the right in the row direction by i-1 step;

when the first matrix is a right-multiplication matrix and the second matrix is a left-multiplication matrix, and the transition matrix is scrolled downward in the column direction, the first intermediate result may be stored in columns, and the elements in the first intermediate result may be scrolled leftward in the row direction; for example, the ith row element scrolls left in the row direction by i-1 step;

when the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix, and the transfer matrix rolls to the left in the row direction, the first intermediate result can be stored in rows, and the ith column element in the first intermediate result rolls to the lower side in the column direction by i-1 step to obtain the product of the input matrix;

when the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix, and the transition matrix is scrolled to the right in the row direction, the first intermediate result may be stored in rows, and the ith column element in the first intermediate result is scrolled in the column direction by i-1 step to obtain the product of the input matrix.

In the related art, for matrix multiplication with a large input matrix size, in order to improve the efficiency of matrix operation, the operation process is usually implemented in a multi-stage pipeline manner, but each stage of the multi-stage pipeline processes a part of input data, so that data needs to be frequently read from a memory, and the requirement on bandwidth is high due to frequent access to the memory. In order to solve the technical problem, the processor provided by the disclosure can perform block storage on an input matrix and then perform matrix multiplication on a corresponding matrix after the input matrix is blocked, so that the memory access frequency can be reduced, and the operation efficiency can be improved.

If the first matrix is obtained by blocking according to the left-handed matrix or the second matrix is obtained by blocking according to the right-handed matrix, in a possible implementation, the controller is further configured to calculate a product of the left-handed matrix and the right-handed matrix according to a product of the first matrix and the second matrix. That is, the product of the first matrix and the second matrix is calculated for the first matrix and the corresponding second matrix after the block division, respectively, and then the product of the left-handed matrix and the right-handed matrix is calculated from the product of the first matrix and the second matrix. Therefore, the memory access frequency can be reduced, and the operation efficiency is improved.

In another possible implementation, the processor includes multiple sets of registers. That is, the controller may divide the registers of the processing elements into a plurality of groups according to the case of blocking the matrix.

In this way, the controller may transpose two or more first matrices to obtain a transposed matrix after the input matrix is partitioned; the controller loads the transposed matrix and more than two second matrixes into the plurality of groups of registers for stacking storage, and the transposed matrix and the second matrixes at corresponding positions are stored in one group of registers.

Before the elements in the transfer matrix or the second matrix are rolled once in the row direction or the column direction each time, the controller controls the processing element to perform multiplication operation on the elements in the corresponding register to obtain element products, and the element products in the same row or the same column are summed to obtain a first intermediate result; after controlling the elements in a set of registers to scroll through a row or column of the transpose matrix in the row or column direction, the controller also modifies the scrolling results.

In one possible implementation, modifying the scrolling results includes:

if the data is scrolled leftwards in the row direction, the correction mode is that the last column of data in each transposed matrix after scrolling is scrolled to the last column of the data of the previous adjacent transposed matrix;

if the data is scrolled rightwards in the row direction, the correction mode is that the first column of the data in each transposed matrix after scrolling is scrolled to the first column of the data in the next adjacent transposed matrix;

if the data is rolled upwards in the column direction, the correction mode is that the last line of data in each transposed matrix after rolling is rolled to the last line of the data of the previous adjacent transposed matrix;

if the data is rolled downwards in the column direction, the correction mode is that the first row of data in each transposed matrix after rolling is rolled to the first row of the next transposed matrix data;

each block transpose matrix is a matrix obtained by transposing each block matrix after being partitioned. The specific calculation and correction processes will be described in detail in the examples below.

The present disclosure also provides an operation method for implementing matrix multiplication.

For the case of no blocking, or the first matrix and the second matrix after blocking, fig. 3 shows a flowchart of an operation method according to an embodiment of the present disclosure. For the case of no partitioning, the left-multiplication matrix may be directly used as the first matrix and the right-multiplication matrix may be directly used as the second matrix, or the left-multiplication matrix may be directly used as the second matrix and the right-multiplication matrix may be used as the first matrix, which is not limited in this disclosure.

As shown in fig. 3, the operation method provided by the present disclosure may include the following steps:

step S11, transposing the first matrix to obtain a transposed matrix, loading the transposed matrix and the second matrix into a register of the processing element, and storing elements at corresponding positions of the transposed matrix and the second matrix in the register of the same processing element.

According to the matrix multiplication rule, the elements at the corresponding positions of the transposed matrix and the second matrix may refer to elements in the transposed matrix and the second matrix that need to be multiplied.

Step S12, controlling the transposed matrix or the second matrix to roll in the row direction or the column direction, controlling the processing element to perform multiplication operation on the elements in the corresponding register to obtain an element product, and summing the element products in the same row or the same column to obtain a first intermediate result.

In a possible implementation, step S12 may specifically include repeating the following process until the elements in the transposed matrix or the second matrix are restored to the positions when not scrolled: the control processing element carries out multiplication operation on elements in the corresponding register to obtain element products, and the element products in the same row or the same column are summed to obtain a first intermediate result; the transition matrix or the second matrix is scrolled in the matrix of processing elements by one row or column in the row direction or the column direction.

Step S13, the first intermediate result is processed to obtain a product of the first matrix and the second matrix.

That is to say, for steps S12 and S13, the processing element is controlled to multiply the elements in the corresponding register to obtain the element product, the element products of the same row or the same column are summed to obtain the first intermediate result, and then the elements in the transposed matrix or the second matrix are controlled to scroll by one row or one column in the row direction or the column direction. If the determination result is the same, the process is ended, and the process continues to step S13. If the judgment results are different, then the processing element is controlled to perform multiplication operation on the elements in the corresponding registers to obtain element products, the element products of the same row or the same column are summed to obtain a first intermediate result, then the elements in the transposed matrix or the second matrix are controlled to roll by one row or one column in the row direction or the column direction, whether the elements in the transposed matrix or the second matrix are the same as the initial position after the rolling is judged … …, and the above process is circulated until the elements in the transposed matrix or the second matrix are the same as the initial position after the rolling is completed.

When the first matrix is a left-multiplication matrix and the second matrix is a right-multiplication matrix, in step S12, the elements in the transposed matrix are controlled to roll in the row direction, or the elements in the second matrix are controlled to roll in the row direction, the processing element is controlled to perform multiplication operation on the elements in the corresponding register to obtain an element product, and the element products in the same column are summed to obtain a first intermediate result.

When the first matrix is a right-handed matrix and the second matrix is a left-handed matrix, in step S12, the elements in the transposed matrix are controlled to roll in the column direction or the elements in the second matrix are controlled to roll in the column direction, the processing element is controlled to perform multiplication operation on the elements in the corresponding register to obtain an element product, and the element products in the same row are summed to obtain a first intermediate result.

In one possible implementation, the above-described scrolling, one row or one column at a time.

For step S13, processing the first intermediate result may refer to: and storing the first intermediate result in a row or column manner, and rolling in the row direction or the column direction to obtain a product of the first matrix and the second matrix. The specific processing method is related to the transposed matrix and the scrolling direction, for example:

when the first matrix is a right-multiplication matrix and the second matrix is a left-multiplication matrix, and the transition matrix is scrolled downward in the column direction, the first intermediate result may be stored in columns, and the elements in the first intermediate result are scrolled leftward in the row direction; for example, the ith row element scrolls left in the row direction by i-1 step;

The process of steps S11-S13 will be described below by taking the first matrix as a right-handed matrix, the second matrix as a left-handed matrix, and the first matrix as a left-handed matrix and the second matrix as a right-handed matrix, respectively.

Example 1 the first matrix is a right-hand matrix and the second matrix is a left-hand matrix, that is, the right-hand matrix is transposed.

Suppose a first matrix b_nkAnd a second matrix a_mnAre all 3 x 3 matrices and the processing elements make up a4 x 4 array.

FIG. 4 shows a schematic diagram of an array of processing elements according to an embodiment of the present disclosure. The calculation method of the present disclosure will be described with reference to fig. 4 and 3.

Suppose a first matrix

Second matrix

Then the first matrix is transposed to obtain a transposed matrix of

The second matrix is loaded into the registers of the processing elements in such a way that the rows and columns of the second matrix are arranged in the registers of the processing elements, i.e. the elements of the second matrix are arranged in the same way in the matrix as in the registers of the processing elements.

In one possible implementation, the number of rows and columns of the elements in the second matrix is the same as the number of rows and columns of the processing elements loaded with the elements in the array of processing elements.

For example, in one example, A may be₁₁Load to PE₁₁In a register of A₁₂Load to PE₁₂In a register of A₁₃Load to PE₁₃In a register of A₂₁Load to PE₂₁… A in the register₃₃Load to PE₃₃That is, the index of the element in the second matrix may be identical to the index of the processing element in which it is located.

In another example, A may be₁₁Load to PE₁₂In a register of A₁₂Load to PE₁₃In a register of A₁₃Load to PE₁₄In a register of A₂₁Load to PE₂₂… A in the register₃₃Load to PE₃₄That is to say the elements in the second matrix are arranged in the same way in the matrix as in the register of the processing element.

It should be noted that the above examples are only some examples of loading the first matrix, and do not limit the disclosure in any way, and those skilled in the art should know that the elements in the first matrix are arranged in the same way in the matrix as in the register of the processing element.

The transpose matrix may be loaded into the register of the processing element according to a manner of loading the first matrix, or after loading, a column of the second matrix is aligned with a column of the transpose matrix, and elements in corresponding positions of the transposed matrix and the second matrix after loading are stored in the register of the same processing element.

For example, suppose A₁₁Load to PE₁₁In a register of A₁₂Load to PE₁₂In a register of A₁₃Load to PE₁₃In a register of A₂₁Load to PE₂₁… A in the register₃₃Load to PE₃₃That is, the index of an element in the first matrix may be identical to the index of the processing element in which it is located. Then, B can be adjusted₁₁Load to PE₁₁In a register of (B)₂₁Load to PE₁₂In a register of (B)₃₁Load to PE₁₃In a register of (B)₁₂Load to PE₂₁In a register of (B)₂₂Load to PE₂₂In a register of (B)₃₂Load to PE₂₃… … B in the register₃₃Load to PE₃₃In the register of (2). That is, the transposed matrix is loaded into the registers of the processing element in an ordered fashion aligned with the second matrix column.

In a possible implementation manner, the transposed matrix may be loaded first and then the second matrix is loaded, or the loading is performed simultaneously, and the specific loading manner is not limited in the present disclosure as long as it is ensured that the transposed matrix and the second matrix are aligned in the row direction after the loading, and the elements at the corresponding positions of the transposed matrix and the second matrix are stored in the register of the same processing element.

In one possible implementation, after the input matrix is loaded, for the case of transposing a right-handed matrix, the processing elements storing the first row of elements of the transpose matrix and the processing elements storing the last row of elements of the transpose matrix may be connected in the column direction, forming a ring, and the data within the ring may be streamed to effect scrolling of the matrix in the column direction. As shown in FIG. 1The PE can be replaced₁₁And PE₃₁Connected to form a ring, connecting PE₁₂And PE₃₂Can form a ring connecting PE₁₃And PE₃₃A ring may be formed. Thus, when data flows within a ring, if it is flowing upward, data of the first row will flow to the third row, data of the second row will flow to the first row, and data of the third row will flow to the second row; if it is a downward flow, the data of the first row will flow to the second row, the data of the second row will flow to the third row, and the data of the third row will flow to the first row.

In this embodiment, the transition matrix may be scrolled only, and before the transition matrix is scrolled for the first time, the controller may control the processing element to multiply the element processes in the corresponding registers to obtain the element products, and sum the element products in the same row to obtain the first intermediate result. Taking the above example as an example, the controller may control the PEs₁₁Element A stored to register therein₁₁And B₁₁Performing multiplication to obtain element product A₁₁×B₁₁Likewise, the controller may control the PE₁₂、PE₁₃To obtain A₁₂×B₂₁、A₁₃×B₃₁，

The controller may then sum the product of the elements in the same row to obtain C₁₁＝A₁₁×B₁₁+A₁₂×B₂₁+A₁₃×B₃₁；

In the same manner, C can be obtained₂₂And C₃₃。

In one possible implementation, C may be₁₁、C₂₂And C₃₃Temporarily stored in a buffer as a first column first intermediate result. The cache may be located in a location other than the plurality of processing elements in the processor.

Next, in one possible implementation, the transition matrix may be scrolled up one row, with the elements of the first row scrolling to the last row (of the processing element storing the elements of the matrix). Alternatively, the transposed matrix may be scrolled downward by one line, and the present disclosure does not limit the specific scrolling direction, and the example in the present embodiment may be scrolled in a column direction by a line unit.

As shown in FIG. 1, when scrolling up, the data of the first row may scroll to the third row as follows:

in one possible implementation, the scrolling process of data in the matrix may be implemented using redundant registers within the processing element or on-chip cache within the processor. This embodiment is applicable to the scroll process in examples 1 and 2 of the present disclosure.

For example, as in example 1, a first row of elements of the transpose matrix may be temporarily stored in an extra register, the processing element in the second row may be controlled to send a second row of elements of the transpose matrix stored in the corresponding register to the processing element in the first row, then the processing element in the third row may be controlled to send a third row of elements of the transpose matrix stored in the corresponding register to the processing element in the second row, and finally, the temporarily stored first row of elements may be stored in a register corresponding to the processing element in the third row, so as to implement a scrolling process of a row of data of the transpose matrix. The above process is only one example of the present disclosure and does not limit the present disclosure in any way.

The control processing element is controlled again to multiply the elements in the corresponding registers to obtain element products, and the element products in the same row are summed to obtain a first intermediate result, a₃₃Is multiplied by the first row of

Second row of (1) to obtain C₁₂、a₃₃Second row of (2) multiplied by

The third row of (2) gives C₂₃And a₃₃Third row of

The first row of (A) yields C₃₁. C is to be₁₂、C₂₃And C₃₁Temporarily stored in a buffer as a second column first intermediate result.

Rolling up the transposed matrix again, multiplying the element processes in the corresponding registers to obtain the element products, and summing the element products in the same row to obtain a first intermediate result C₁₃、C₂₁And C₃₂Mixing C with₁₃、C₂₁And C₃₂Temporarily stored in the buffer as the third column first intermediate result.

That is, the first intermediate result stored in the buffer is

For step S13, for the case of scrolling the transpose matrix upwards, the processing the first intermediate result means that the controller stores the obtained first intermediate result in columns, and then the controller scrolls the ith row element in the first intermediate result to the right in the row direction by i-1 step to obtain the product of the input matrix, where scrolling also means scrolling in a closed loop in the row direction, and the first column processing element and the last column processing element storing the elements of the matrix are connected to form a closed loop. During scrolling, if scrolling to the right, the elements stored in the last column of processing elements scroll into the first column of processing elements.

Alternatively, for step S13, for the case of scrolling the transposed matrix downwards, the processing the first intermediate result means that the controller stores the obtained first intermediate result in columns, and then the controller scrolls the ith row element in the first intermediate result to the left in the row direction by i-1 step to obtain the product of the input matrix.

It will be appreciated by those skilled in the art that, for step S13, the multiplication of the input matrix may also be obtained by the controller scrolling the elements in the first intermediate result in a row direction (e.g., scrolling right or left) according to the row and column identification of the first intermediate result. In this embodiment, all the elements stored in the register may carry the row and column identifiers of the elements in the matrix, and during the scrolling, the row and column identifiers of the elements in the first intermediate result are determined according to the row and column identifiers of the elements in the matrix, so that the controller may scroll the elements in the first intermediate result in the row direction according to the row and column identifiers of the first intermediate result to obtain the product of the first matrix and the second matrix.

Taking the above example as an example, row 1 scrolls to the right by 0 steps, i.e., no scrolling. Line 2 scrolls right by 1 step, i.e. C₂₁Scrolling right 1 step to column 1, C₂₃Scrolling right 1 step to column 3, C₂₂Scrolling right 1 step to column 2, the result is:

and (3) scrolling the 3 rd row to the right for 2 steps to obtain an input matrix with the product of:

in one possible implementation, in step S12, the second matrix may also be scrolled in the column direction, and the specific process is similar to that of the transposed matrix scroll, except that the way of processing and scrolling the elements in step S13 is slightly different. The present disclosure does not repeat the specific derivation process, and refers to the above process.

It should be noted that the arrangement of the processing elements, the input matrix, and the like in the above examples are only for clearly illustrating the process of the operation method of the present disclosure, and do not limit the present disclosure in any way.

Example 2 the first matrix is a left-hand matrix and the second matrix is a right-hand matrix, i.e. the left-hand matrix is transposed

It is still assumed that the first matrix a_mnAnd a second matrix b_nkAll are 3 × 3 matrix, and the processing elements are 44, in a matrix.

Suppose a first matrix

Then the transpose matrix transposing the first matrix is

Second matrix

Loading the second matrix into the register of the output processing element, where the loading manner may refer to the manner of loading the first matrix in example 1, and is not described again, then loading the transposed matrix into the register of the processing element according to the manner of loading the second matrix, and after loading, aligning the rows of the transposed matrix of the first matrix with the rows of the second matrix.

For example, suppose B₁₁Load to PE₁₁In a register of (B)₁₂Load to PE₁₂In a register of (B)₁₃Load to PE₁₃In a register of (B)₂₁Load to PE₂₁… B in the register₃₃Load to PE₃₃That is, the index of an element in the first matrix may be identical to the index of the processing element in which it is located. Then, A may be₁₁Load to PE₁₁In a register of A₂₁Load to PE₁₂In a register of A₃₁Load to PE₁₃In a register of A₁₂Load to PE₂₁In a register of A₂₂Load to PE₂₂In a register of A₃₂Load to PE₂₃… … A in the register₃₃Load to PE₃₃In the register of (2). That is, the transposed matrix is loaded into the registers of the processing element in a row-aligned ordering with the other matrix (the second matrix).

In one possible implementation, after the input matrix is loaded, for the case of transposing the first matrix, the transposing moment may be stored in a row direction in a concatenated mannerThe processing elements of the first column of elements of the array and the processing elements of the last column of elements of the storage transpose form a ring within which data can flow to facilitate scrolling in the row direction in units of columns. As shown in FIG. 4, the connection PE₁₁And PE₁₃Can form a ring connecting PE₂₁And PE₂₃Can form a ring connecting PE₃₁And PE₃₃A ring may be formed such that when data flows within the ring, if left flowing, data of the first column will flow to the third column, data of the second column will flow to the first column, and data of the third column will flow to the second column; if it is a right flow, then the data of the first column will flow to the second column, the data of the second column will flow to the third column, and the data of the third column will flow to the first column.

In this embodiment, the controller may control the processor element to multiply the elements in the corresponding registers to obtain the element products and to sum the element products in the same column to obtain the first intermediate result before the first scrolling of the transition matrix in the column direction to the left or right. Taking the above example as an example, the PE₁₁Element A stored to register therein₁₁And B₁₁Performing multiplication to obtain element product A₁₁×B₁₁A can be obtained in the same manner₁₂×B₂₁、A₁₃×B₃₁。

The summation of the element products of the first column may result in C₁₁＝A₁₁×B₁₁+A₁₂×B₂₁+A₁₃×B₃₁；

In the same way, the element product summation C of the second column can be obtained₂₂The sum of the product of the elements of the third column C₃₃。

In one possible implementation, C may be₁₁、C₂₂And C₃₃Temporarily stored in the buffer as a first line first intermediate result.

The transition matrix may then be scrolled one column to the left, the elements of the first column to the last column, or may be scrolled one column to the right, as the present disclosure is not limited thereto.

As shown in FIG. 1, when scrolling to the left is performed, the data of the first column may scroll to the third column as follows:

the control processing element is again operated to multiply the elements in the corresponding registers to obtain element products, the element products in the same column are summed to obtain a first intermediate result,

b times the second column of₃₃The first column of (1) yields C₂₁、

Third column of (2) multiplied by b₃₃Second column of (2) to obtain C₃₂And, and

b is multiplied by the first column of₃₃The third column of (2) gives C₁₃. C is to be₂₁、C₃₂And C₁₃Temporarily stored in the buffer as the first intermediate result of the second row.

Rolling a row of the transposed matrix to the left again, multiplying the element progress in the corresponding register to obtain an element product, and summing the element products in the same row to obtain a first intermediate result C₃₁、C₁₂And C₂₃Mixing C with₃₁、C₁₂And C₂₃Temporarily stored in the buffer as a third line first intermediate result.

That is, the first intermediate result stored in the buffer is

In step S13, for the case that the first transpose matrix is scrolled to the left, the first intermediate result may be stored in rows, and the ith row element in the first intermediate result may be scrolled in the row direction by i-1 step to obtain the product of the input matrix.

Alternatively, in case of scrolling the first transpose matrix to the right, the first intermediate result may be stored by the controller in rows, and the ith column element in the first intermediate result is scrolled in the column direction by i-1 step to obtain the product of the input matrices. The specific steps are similar to the left scrolling, and are not described herein again.

It will be appreciated by those skilled in the art that, for step S13, the controller may also scroll the elements in the first intermediate result in the column direction (e.g., up or down) according to the row and column identification of the first intermediate result to obtain the product of the input matrix. In this embodiment, all the elements stored in the register may carry the row and column identifiers of the elements in the matrix, and during the scrolling, the row and column identifiers of the elements in the first intermediate result are determined according to the row and column identifiers of the elements in the matrix, so that the controller may scroll the elements in the first intermediate result in the column direction according to the row and column identifiers of the first intermediate result to obtain the product of the input matrix.

Taking the above example as an example, column 1 scrolls down 0 steps, i.e., does not scroll. Column 2 scrolls down by 1 step, that is to say C₁₂Scroll down 1 step to column 1, C₃₂Scroll down 1 step to column 3, C₂₂Scroll down 1 step to column 2, the result is:

and rolling the 3 rd column downwards for 2 steps to obtain the product of the input matrix as follows:

In one possible implementation, in step S12, the second matrix may also be scrolled in the row direction, and the specific process is similar to that of the transposed matrix scrolling, but slightly different from the way of processing and scrolling the elements in step S13. The present disclosure does not repeat the specific derivation process, and refers to the above process.

The operation method of matrix multiplication according to the above embodiments of the present disclosure is more suitable for a processor composed of processing elements arranged in an array. For any scale of input matrix satisfying the arrangement of processing elements, the operation result of matrix multiplication can be obtained, and compared with the matrix multiplication operation in the related art, the access and storage times can be reduced, the bandwidth pressure is reduced, and the operation efficiency is improved.

For the case of no blocking, the result of the matrix multiplication can be obtained directly according to the above example. For the situation that the blocking is required, for the first matrix and the second matrix after the blocking, the result obtained by multiplying the first matrix and the corresponding second matrix according to the rule of matrix multiplication is used as a second intermediate result, that is, the operation process of matrix multiplication can be executed by using the first matrix and the second matrix obtained after the blocking as one element of the matrix to obtain a second intermediate result, and the product of the input matrix can be obtained by calculating according to the second intermediate result.

Fig. 5 shows a schematic diagram of chunking according to an embodiment of the present disclosure. As shown in FIG. 5, the controller may block the matrices D and E in the manner described above to obtain a first matrix D₁₁、D₁₂、D₂₁、D₂₂And a second matrix E₁₁、E₁₂、E₂₁、E₂₂. The controller may perform a matrix multiplication operation using the first matrix and the second matrix as one element of the matrix, for example, multiplying the first row of the matrix D by the first column of the matrix E as F₁₁＝D₁₁×E₁₁+D₁₂×E₂₁Multiplying the first row of the matrix D by the second column of the matrix E by F₁₂＝D₁₁×E₁₂+D₁₂×E₂₂Multiplying the second row of the matrix D by the first column of the matrix E to form F₂₁＝D₂₁×E₁₁+D₂₂×E₂₁Multiplying the second row of the matrix D by the second column of the matrix E by F₂₂＝D₂₁×E₁₂+D₂₂×E₂₂. That is, to obtain the final operation result of the matrix multiplication, it is necessary to first obtain the second intermediate result:

D₁₁×E₁₁，D₁₂×E₂₁，D₁₁×E₁₂，D₁₂×E₂₂，

D₂₁×E₁₁，D₂₂×E₂₁，D₂₁×E₁₂，D₂₂×E₂₂。

the process of obtaining the second intermediate result may be obtained by operating the corresponding first matrix and second matrix according to the processes of steps S11-S13, respectively.

The input matrix is partitioned, the matrix multiplication operation of the method is respectively carried out on the partitioned matrix to obtain a second intermediate result, and the product of the input matrix can be obtained through calculation according to the second intermediate result. According to the operation method of the above embodiment of the present disclosure, the matrix multiplication process can be rapidly realized for any dimension of the matrix.

In an optional embodiment, the first matrix and the second matrix after being partitioned may be stored in the processing element in sequence for calculation, or may be stored in the processing element in a stacked manner.

Example 3 Stack storage in combination with Steps S11-step S13

For example, the operation method of the present disclosure is described by taking an array of processing elements as 2 × 2 and input matrices as 4 × 4 matrices.

Suppose a left-hand matrix

Right multiplication matrix of

Then the controller canTo divide both the left-and right-multiplication matrices into 2 x 2 matrices.

Fig. 6 illustrates an example of partitioning a matrix according to an embodiment of the present disclosure. As shown in fig. 6, the controller may divide both the left-and right-multiplication matrices into 2 × 2 sub-matrices, and divide the left-multiplication matrix into four matrices a₁₁、a₁₂、a₂₁、a₂₂Wherein a is₁₁Is composed of

a₁₂Is composed of

a₂₁Is composed of

a₂₂Is composed of

Obtaining four matrixes b after right multiplication matrix division₁₁、b₁₂、b₂₁、b₂₂Wherein b is₁₁Is composed of

b₁₂Is composed of

b₂₁Is composed of

b₂₂Is composed of

For the blocking case, if the number of registers included in the processing element can meet the requirement of storing the input matrix, the input matrix may also be stored in the registers of the processing element in a stacked storage manner to implement the multiplication operation of the input matrix. When the input matrix is stored in a stacked storage manner, the controller may divide the registers in the processing elements into a plurality of different groups, each group storing a first matrix after being blocked and a corresponding second matrix.

In the example of storing the input matrix in the stacked storage manner, one possible calculation manner is to scroll the matrices in units of the first matrix and the second matrix obtained by the block division, and in calculating the second intermediate result, perform the operation using the processes of steps S11 to S13.

Taking the calculation of the second intermediate result by the process of steps S11-S13 as an example, assuming that the processing elements are a2 × 2 array, and taking the example shown in fig. 6 as an example, for the operation method of the present disclosure, the first matrix may be obtained by dividing the left-multiplied matrix into blocks, or may be obtained by dividing the right-multiplied matrix into blocks.

The present disclosure explains the operation method by taking an example that a first matrix is obtained by right-multiplying a matrix block, a second matrix is loaded, and a corresponding first matrix is transferred and then loaded, and the loading results are shown in tables 1 and 2. Here, Reg0, Reg1, Reg2, and Reg3 respectively indicate a set of registers in the processing elements, the processing elements are 2 × 2 arrays, each processor includes a plurality of registers, the controller can divide the plurality of registers into a plurality of groups, for example, the present embodiment can divide the plurality of registers into 4 groups, and the registers in the same group store a transpose matrix and a corresponding second matrix, as shown in table 1 and table 2, Reg0 stores a₁₁And b₁₁Reg1 store a₁₂And b₂₁Reg2 store a₂₁And b₁₂Reg3 store a₂₂And b₂₂That is, a matrix

Is multiplied by the matrix

And the second row element multiplied by the second column element.

Table 1 element storage example

Table 2 element storage example

In the calculation process, for elements within a set of registers, the processing element may calculate a second intermediate result a according to the process of steps S11-S13₁₁×b₁₁、a₁₂×b₂₁、a₂₁×b₁₂And a₂₂×b₂₂. The detailed process is not described again. According to the second intermediate result a₁₁×b₁₁、a₁₂×b₂₁、a₂₁×b₁₂And a₂₂×b₂₂Can be calculated to obtain C₁₁＝a₁₁×b₁₁+a₁₂×b₂₁，C₂₂＝a₂₁×b₁₂+a₂₂×b₂₂。

After the second intermediate result is calculated, the transpose matrix may be scrolled in units of groups. In particular, for a transposed matrix

Scrolling up by one row, that is, scrolling the elements of the transpose in Reg2 into Reg0, scrolling the elements of the transpose in Reg0 into Reg2, scrolling the elements of the transpose in Reg3 into Reg1, and scrolling the elements of the transpose in Reg1 into Reg3, table 3 can be obtained.

Table 3 element storage example

In conjunction with tables 1 and 3, during a calculation, for an element in a set of registers, a processing element may be based onThe process of steps S11-S13 calculates a second intermediate result a₁₁×b₁₂、a₁₂×b₂₂、a₂₁×b₁₁And a₂₂×b₂₁. The detailed process is not described again. According to the second intermediate result a₁₁×b₁₂、a₁₂×b₂₂、a₂₁×b₁₁And a₂₂×b₂₁Can be calculated to obtain C₁₂＝a₁₁×b₁₂+a₁₂×b₂₂，C₂₁＝a₂₁×b₁₁+a₂₂×b₂₁。

According to the above process, the product of the input matrix can be calculated in a block-wise manner.

Therefore, the matrix multiplication operation method disclosed by the invention can realize matrix operation of any size.

Example 4 stacked storage in conjunction with Whole scrolling

In another possible implementation manner, another scrolling manner may also be adopted, and in the scrolling manner of this embodiment, step S12 in fig. 3 may be implemented by controlling the processing element to multiply the elements in the corresponding registers to obtain the element products, and to sum the element products of the same row (or the same column in the example of transposing the first matrix) to obtain the first intermediate result C, before the transposed matrix is scrolled once in the row direction or the column direction each time₁₁、C₂₂、C₃₃、C₄₄。

When the input matrix is partitioned and stacked, and the original row or column of data is stored in the registers of different groups, so that the original row or column of continuously stored data is changed into at least two rows or at least two columns of independent data to be stored in the registers of different groups, the first data of the next row or column and the last data of the previous row or column of data stored in the registers of different groups are continuously stored data before the data are stacked and are not continuously stored after the data are stacked, therefore, after the elements in one group of registers are controlled to roll once in the row or column direction, the rolling result needs to be corrected, and the correct result can be obtained. The specific modification method can be as follows:

rolling once in the row or column direction for each block of the transposed matrix;

if the data is scrolled leftwards in the row direction, the correction mode is that the last column of data in each block after scrolling is scrolled to the last column of the adjacent previous block of data;

if the data is scrolled rightwards in the row direction, the correction mode is that the first column of data in each block after scrolling is scrolled to the first column of the adjacent data in the next block;

if the data is scrolled upwards in the column direction, the correction mode is that the last line of data in each block after scrolling is scrolled to the last line of the adjacent previous block of data;

if scrolling is performed downward in the column direction, the correction is made by scrolling the first line data in each block after scrolling to the first line of the next block of data adjacent thereto.

Each block mentioned above refers to each block transpose matrix, and each block transpose matrix refers to a matrix obtained by transposing each block matrix after being partitioned.

In the present embodiment, the right-multiplication matrix is transposed, and the scrolling is performed in the row direction during the scrolling process, but there is at least one row between two rows that should be continuous due to the stack storage, but the scrolling is considered as independent rows during the stack storage, and the scrolling in the row direction in each group of registers alone cannot achieve correct scrolling, and needs to be corrected.

Taking table 2 as an example, inside each group of registers, one row is scrolled upwards, the scrolling result is shown in table 4, and in table 4, the first row element in one group of registers is scrolled to the last row. But as shown in table 2, the first row elements of Reg0 and Reg1 should scroll to the last row of Reg2 and Reg3, but now lie in the last row of Reg0 and Reg1 (as shown in table 4); as shown in table 2, the first row elements of Reg2 and Reg3 should scroll to the last row of Reg0 and Reg1, but now lie in the last row of Reg2 and Reg3 (as shown in table 4); that is, in table 4, now the last row elements of Reg0 and Reg1 should be located in the last row of Reg2 and Reg3, and the last row elements of Reg2 and Reg3 should be located in the last row of Reg0 and Reg1, then swapping the last row elements of Reg2 and Reg0, and swapping the last row elements of Reg3 and Reg1 can implement the scrolling process, as shown in table 5.

Table 4 element storage example

Table 5 element storage example

According to the comparison of table 1 and table 5, the control processing element performs multiplication operation on the elements in the corresponding registers to obtain element products, and sums the element products in the same row to obtain a first intermediate result C₁₂、C₂₃、C₃₄、C₄₁。

The calculation process of matrix multiplication can be completed by repeatedly executing 4 times of calculation and 3 times of rolling in the process, and the product of the input matrix can be obtained according to the first intermediate result.

In an alternative embodiment, the stacked storage manner may be stored according to the above block manner, and is not limited to that each register stores one element in the matrix, that the number of rows and columns multiplied by the matrix is an integer multiple of the number of rows and columns of the processing elements, and that the method of stacked storage is unique, and the modification process is the same, and only the original row/column elements after modification need to be connected in series, and the specific stacked storage process is not limited herein.

It should be noted that the above manner of stacking storage and scrolling elements is only one example of the disclosure, and other manners may also be adopted, and the disclosure does not limit this.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

It should be further noted that, although the steps in the flowchart are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the flowchart may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The present disclosure also provides an arithmetic device based on matrix multiplication of a matrix of processing elements, which can be applied to a processor. Fig. 1 shows an example of a processor, which may comprise more than two processing elements arranged in a two-dimensional matrix, each processing element comprising at least one register, said arithmetic means being arranged to perform a matrix multiplication operation on a first matrix and a second matrix.

It should be understood that the above-described apparatus embodiments are merely illustrative and that the apparatus of the present disclosure may be implemented in other ways. For example, the division of the units/modules in the above embodiments is only one logical function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented.

In addition, unless otherwise specified, each functional unit/module in each embodiment of the present disclosure may be integrated into one unit/module, each unit/module may exist alone physically, or two or more units/modules may be integrated together. The integrated units/modules may be implemented in the form of hardware or software program modules.

If the integrated unit/module is implemented in hardware, the hardware may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. Unless otherwise specified, the register may be any suitable magnetic storage medium or magneto-optical storage medium, such as resistive Random Access Memory (rram), Dynamic Random Access Memory (dram), Static Random Access Memory (SRAM), enhanced Dynamic Random Access Memory (edram), High-Bandwidth Memory (HBM), hybrid Memory cubic (hmc) Memory cube, and so on.

The integrated units/modules, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a non-volatile computer readable storage medium.

The embodiment of the present disclosure further provides an artificial intelligence chip, where the chip includes the processor as described above.

In a possible implementation manner, a board card is further disclosed, which comprises a storage device, an interface device, a control device and the artificial intelligence chip; wherein, the artificial intelligence chip is respectively connected with the storage device, the control device and the interface device; the storage device is used for storing data; the interface device is used for realizing data transmission between the artificial intelligence chip and external equipment; and the control device is used for monitoring the state of the artificial intelligence chip.

Fig. 7 shows a block diagram of a board according to an embodiment of the present disclosure, and referring to fig. 7, the board may include other kit components besides the chip 389, where the kit components include, but are not limited to: memory device 390, interface device 391 and control device 392;

the storage device 390 is connected to the artificial intelligence chip through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the artificial intelligence chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the artificial intelligence chip may include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check.

In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with the artificial intelligence chip. The interface device is used for realizing data transmission between the artificial intelligence chip and external equipment (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. In another embodiment, the interface device may also be another interface, and the disclosure does not limit the specific expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the artificial intelligence chip is still transmitted back to the external device (e.g. server) by the interface device.

The control device is electrically connected with the artificial intelligence chip. The control device is used for monitoring the state of the artificial intelligence chip. Specifically, the artificial intelligence chip and the control device can be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). As the artificial intelligence chip can comprise a plurality of processing chips, a plurality of processing cores or a plurality of processing circuits, a plurality of loads can be driven. Therefore, the artificial intelligence chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the artificial intelligence chip.

An embodiment of the present disclosure further provides an electronic device, including: a processor; as described above, the processor includes two or more processing elements arranged in a two-dimensional matrix, each processing element including at least one register, and a controller that controls the processing elements;

an electronic device further comprising a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The foregoing may be better understood in light of the following clauses:

clause a1. a processor, the processor comprising two or more processing elements arranged in a two-dimensional matrix, the processing elements comprising at least one register, the processor being for performing a matrix multiplication operation on a first matrix and a second matrix,

Clause a2. the processor of clause a1,

the controller controls the processing element, the transposed matrix stored in the register, and the second matrix to repeat the following process until the elements in the transposed matrix or the second matrix are restored to the non-scrolled position:

the controller is used for controlling the processing element to multiply elements in the corresponding register to obtain element products, summing the element products of the same row or the same column to obtain a first intermediate result, and controlling the transposed matrix or the second matrix stored in the register to roll by one row or one column in the row direction or the column direction.

Clause A3. the processor of clause a1 or a2,

when the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix, the controller controls elements in the transposed matrix to roll in the row direction, or controls elements in the second matrix to roll in the row direction; the control processing element carries out multiplication operation on elements in the corresponding register to obtain element products, and the element products in the same column are summed to obtain a first intermediate result;

Clause a4. the processor of clause a1 or a2,

and the controller stores the first intermediate result in rows or columns, and rolls in the row direction or the column direction to obtain the product of the first matrix and the second matrix.

Clause a5. the processor of any of clauses a1-a4, the controller further configured to determine whether to block the input matrix according to the arrangement of processing elements and the row and column ranks of the input matrix, wherein the input matrix comprises a left-by matrix and a right-by matrix;

if one matrix in the input matrix is to be partitioned, the controller splits the row of the left-multiplying matrix or splits the column of the right-multiplying matrix according to the arrangement of the processing elements;

if the two matrixes in the input matrix are to be blocked, the controller blocks the column direction of the left-multiplying matrix and the row direction of the right-multiplying matrix in the same way according to the arrangement of the processing elements and the row rank and the column rank of the input matrix;

the left multiplication matrix is partitioned to obtain more than two first matrixes, the right multiplication matrix is partitioned to obtain more than two second matrixes, or the left multiplication matrix is partitioned to obtain more than two second matrixes, and the right multiplication matrix is partitioned to obtain more than two first matrixes.

Clause a6. the processor of clause a5,

the controller is further configured to calculate a product of the left-handed matrix and the right-handed matrix based on a product of the first matrix and the second matrix.

Clause A7. the processor of clause a5, the processor comprising a plurality of sets of registers,

the controller is further configured to transpose the more than two first matrices to obtain a transposed matrix after the input matrix is partitioned;

the controller loads the transposed matrix and more than two second matrixes into the plurality of groups of registers for stacking storage, and the transposed matrix and the second matrixes at corresponding positions are stored in one group of registers;

before the elements in the transfer matrix or the second matrix are rolled once in the row direction or the column direction each time, the controller controls the processing element to perform multiplication operation on the elements in the corresponding register to obtain element products, and the element products in the same row or the same column are summed to obtain a first intermediate result;

after controlling the elements in a set of registers to scroll through a row or column of the transpose matrix in the row or column direction, the controller also modifies the scrolling results.

Clause A8. the processor of clause a7, wherein modifying the scrolling results comprises:

each block transpose matrix is a matrix obtained by transposing each block matrix after being partitioned.

Clause A9. a method of matrix multiplication based on a matrix of processing elements for use in a processor, the processor including two or more processing elements arranged in a two-dimensional matrix, the processing elements including at least one register, the method implementing a matrix multiplication operation on a first matrix and a second matrix, the method comprising:

transposing a first matrix to obtain a transposed matrix, loading each element of the transposed matrix and the second matrix into a register of each processing element respectively, and storing the elements at the corresponding positions of the transposed matrix and the second matrix in the register of the same processing element;

Clause a10. according to the operation method described in clause a9, the transposed matrix or the second matrix is controlled to scroll in the row direction or the column direction, the processing element is controlled to multiply the elements in the corresponding register to obtain the element product, and the element products in the same row or the same column are summed to obtain the first intermediate result, including repeating the following processes until the elements in the transposed matrix or the second matrix are restored to the non-scrolled position:

and controlling the processing element to multiply the elements in the corresponding register to obtain an element product, summing the element products in the same row or column to obtain a first intermediate result, and rolling the transfer matrix or the second matrix in the matrix of the processing element by one row or column in the row direction or the column direction.

Clause a11. the method of clause a9 or a10,

when the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix, controlling elements in the transposed matrix to roll in the row direction, or controlling elements in the second matrix to roll in the row direction; the control processing element carries out multiplication operation on elements in the corresponding register to obtain element products, and the element products in the same column are summed to obtain a first intermediate result;

when the first matrix is a right-multiplication matrix and the second matrix is a left-multiplication matrix, controlling elements in the transposed matrix to roll in the column direction or controlling elements in the second matrix to roll in the column direction; the control processing element performs multiplication operation on elements in the corresponding register to obtain element products, and the element products in the same row are summed to obtain a first intermediate result.

Clause a12. processing the first intermediate result to obtain a product of the first matrix and a second matrix according to the method of clause a9 or a10, including:

and storing the first intermediate result in a row or column manner, and rolling in the row direction or the column direction to obtain a product of the first matrix and the second matrix.

Clause a13. the method of any one of clauses a9-a12, further comprising:

determining whether to block the input matrix according to the arrangement of the processing elements and the row rank and the column rank of the input matrix, wherein the input matrix comprises a left-multiplication matrix and a right-multiplication matrix;

if one matrix in the input matrix is to be partitioned, splitting rows of a left-multiplying matrix or splitting columns of a right-multiplying matrix according to the arrangement of the processing elements;

if the two matrixes in the input matrix are to be partitioned, partitioning the column direction of the left multiplication matrix and the row direction of the right multiplication matrix in the same way according to the arrangement of the processing elements and the row rank and the column rank of the input matrix;

Clause a14. the method of clause a13, further comprising:

and calculating the product of the left multiplication matrix and the right multiplication matrix according to the product of the first matrix and the second matrix.

Clause a15. according to the method of clause a13, the processor includes a plurality of sets of registers,

the method further comprises the following steps:

after the input matrix is partitioned, transposing more than two first matrixes to obtain a transposed matrix;

the transposition matrixes and more than two second matrixes are stacked and stored in the plurality of groups of registers, and the transposition matrixes and the second matrixes at corresponding positions are stored in one group of registers;

before the elements in the transfer matrix or the second matrix are rolled once in the row direction or the column direction each time, the control processing element performs multiplication operation on the elements in the corresponding register to obtain element products, and the element products in the same row or the same column are summed to obtain a first intermediate result;

after controlling the elements in a set of registers to scroll through a row or column of transpose matrices in the row or column direction, the scrolling results are modified.

Clause a16. modifying the scrolling results according to the method of clause a15 includes:

Clause a17. an artificial intelligence chip comprising the processor of any one of clauses a 1-A8.

Clause a18. an electronic device comprising the artificial intelligence chip of clause a17.

The embodiments of the present disclosure have been described in detail, and the principles and embodiments of the present disclosure are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present disclosure. Meanwhile, a person skilled in the art should, based on the idea of the present disclosure, change or modify the specific embodiments and application scope of the present disclosure. In view of the above, the description is not intended to limit the present disclosure.

Claims

1. A processor comprising two or more processing elements arranged in a two-dimensional matrix, the processing elements comprising at least one register, the processor being configured to perform a matrix multiplication operation on a first matrix and a second matrix,

2. The processor of claim 1,

3. The processor according to claim 1 or 2,

4. The processor according to claim 1 or 2,

5. The processor of any one of claims 1-4, wherein the controller is further configured to determine whether to block the input matrix according to the arrangement of the processing elements and a row rank and a column rank of the input matrix, wherein the input matrix comprises a left-handed matrix and a right-handed matrix;

6. The processor of claim 5,

7. The processor of claim 5, wherein the processor comprises a plurality of sets of registers,

8. The processor of claim 7, wherein modifying the scrolling results comprises:

9. A method of matrix multiplication based on a matrix of processing elements, for use in a processor comprising two or more processing elements arranged in a two-dimensional matrix, the processing elements comprising at least one register, the method implementing a matrix multiplication operation on a first matrix and a second matrix, the method comprising:

10. The method of claim 9, wherein controlling the transpose matrix or the second matrix to scroll in a row direction or a column direction, and controlling the processing element to multiply the elements in the corresponding registers to obtain the element products and sum the element products in the same row or the same column to obtain the first intermediate result comprises repeating the following processes until the elements in the transpose matrix or the second matrix are restored to the non-scrolled positions:

11. The method according to claim 9 or 10,

12. The method of claim 9 or 10, wherein processing the first intermediate result to obtain a product of the first matrix and the second matrix comprises:

13. The method according to any one of claims 9-12, further comprising:

14. The method of claim 13, further comprising:

15. The method of claim 13, wherein the processor includes a plurality of sets of registers,

the method further comprises the following steps:

16. The method of claim 15, wherein modifying the scrolling results comprises:

17. An artificial intelligence chip, wherein the chip comprises a processor according to any one of claims 1 to 8.

18. An electronic device comprising the artificial intelligence chip of claim 17.