CN113536219A - Operation method, processor and related product - Google Patents

Operation method, processor and related product Download PDF

Info

Publication number
CN113536219A
CN113536219A CN202010317734.8A CN202010317734A CN113536219A CN 113536219 A CN113536219 A CN 113536219A CN 202010317734 A CN202010317734 A CN 202010317734A CN 113536219 A CN113536219 A CN 113536219A
Authority
CN
China
Prior art keywords
matrix
column
row
elements
transposed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010317734.8A
Other languages
Chinese (zh)
Other versions
CN113536219B (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Cambricon Technologies Corp Ltd
Original Assignee
Cambricon Technologies Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambricon Technologies Corp Ltd filed Critical Cambricon Technologies Corp Ltd
Priority to CN202010317734.8A priority Critical patent/CN113536219B/en
Priority to PCT/CN2021/075957 priority patent/WO2021212972A1/en
Priority to US17/920,372 priority patent/US20230169144A1/en
Publication of CN113536219A publication Critical patent/CN113536219A/en
Application granted granted Critical
Publication of CN113536219B publication Critical patent/CN113536219B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements

Abstract

The disclosure relates to an arithmetic method, a processor and a related product. The product comprises a storage device, an interface device, a control device and the artificial intelligence chip; wherein, the artificial intelligence chip is respectively connected with the storage device, the control device and the interface device; the storage device is used for storing data; the interface device is used for realizing data transmission between the artificial intelligence chip and external equipment; and the control device is used for monitoring the state of the artificial intelligence chip. By the method or the product, the method can improve the operation efficiency of the related product when matrix multiplication is carried out.

Description

Operation method, processor and related product
Technical Field
The present disclosure relates to the field of information processing technologies, and in particular, to an arithmetic method, a processor, and a related product.
Background
In the technical field of artificial intelligence, a neural network algorithm is a very popular machine learning algorithm in recent years, and has a very good effect in various fields, such as image recognition, voice recognition, natural language processing and the like. Along with the development of neural network algorithms, the complexity of the algorithms is higher and higher, and in order to improve the recognition degree, the scale of the model is gradually increased. Processing these large-scale models with GPUs and CPUs takes a lot of computation time and consumes a lot of power.
Disclosure of Invention
In view of the foregoing, it is desirable to provide an arithmetic method, a processor and a related product.
According to a first aspect of the present disclosure, there is provided a processor comprising two or more processing elements arranged in a two-dimensional matrix, a processing element comprising at least one register, the processor being for performing a matrix multiplication operation on a first matrix and a second matrix,
the processor further comprises a controller, wherein the controller is used for loading each element of a transposed matrix and a second matrix of the first matrix into a register of each processing element respectively, and the elements at the corresponding positions of the transposed matrix and the second matrix are stored in the register of the same processing element;
the controller is used for controlling the transposed matrix or the second matrix to roll in the row direction or the column direction, controlling the processing element to perform multiplication operation on elements in the corresponding register to obtain an element product, and summing the element products in the same row or the same column to obtain a first intermediate result;
the controller is further configured to process the first intermediate result to obtain a product of the first matrix and the second matrix.
According to a second aspect of the present disclosure, there is provided a method of operation based on matrix multiplication of a matrix of processing elements, applied to a processor including two or more processing elements arranged in a two-dimensional matrix, the processing elements including at least one register, the method implementing a matrix multiplication operation on a first matrix and a second matrix, the method comprising:
transposing the first matrix to obtain a transposed matrix, loading each element of the transposed matrix and the second matrix into a register of each processing element respectively, and storing the elements at the corresponding positions of the transposed matrix and the second matrix in the register of the same processing element;
controlling the transposed matrix or the second matrix to roll in the row direction or the column direction, controlling a processing element to perform multiplication operation on elements in a corresponding register to obtain element products, and summing the element products of the same row or the same column to obtain a first intermediate result;
and processing the first intermediate result to obtain a product of the first matrix and the second matrix.
According to a third aspect of the present disclosure, there is provided an artificial intelligence chip, the chip comprising a processor as described above.
According to a fourth aspect of the present disclosure, there is provided an electronic device comprising the artificial intelligence chip as described above.
According to the product such as the matrix multiplication operation method and the processor of each embodiment of the present disclosure, for an input matrix of any scale satisfying the arrangement of the processing elements, the operation result of the matrix multiplication can be obtained, and compared with the matrix multiplication operation in the related art, the access frequency and the memory number can be reduced, the bandwidth pressure is reduced, and the operation efficiency is improved.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.
FIG. 1 shows a schematic diagram of a processor according to an embodiment of the present disclosure.
Fig. 2a and 2b each show examples of a number of different ways of partitioning.
FIG. 3 shows a flow diagram of a method of operation according to an embodiment of the present disclosure.
FIG. 4 shows a schematic diagram of an array of processing elements according to an embodiment of the present disclosure.
Fig. 5 shows a schematic diagram of chunking according to an embodiment of the present disclosure.
Fig. 6 illustrates an example of partitioning a matrix according to an embodiment of the present disclosure.
Fig. 7 shows a block diagram of a board card according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
In the process of processing information by using artificial intelligence, matrix operation occupies a large amount of calculation, and in the process of processing the matrix operation, the conventional processor decomposes the matrix operation into multiplication operation and addition operation, so that data needs to be frequently read from a memory, and the operation efficiency is very low.
In order to solve the above technical problem, the present disclosure provides an arithmetic method and a processor for executing the arithmetic method. A processor may comprise a plurality of processing elements (more than two) which may be arranged in a two-dimensional matrix, each processing element may comprise at least one register.
FIG. 1 shows a schematic diagram of a processor according to an embodiment of the present disclosure. As shown in fig. 1, a plurality of processing elements PE (processing elements) are arranged in a two-dimensional matrix, each processing element is connected to an adjacent processing element, and at least one register (not shown) may be provided in each PE. The processor may further include a controller and a memory, wherein the controller and the memory are both connected to the plurality of processing elements, and the controller may be connected to the memory. The controller is configured to load input data from the memory into the register of the processing element and control the processing element to process the input data, for example, the memory may store a first matrix and a second matrix, and the processor is configured to perform a matrix multiplication operation on the first matrix and the second matrix, so that the controller may load the first matrix and the second matrix into the register of the processing element and control the processing element to perform the matrix multiplication operation.
In a possible implementation manner, the memory may further store an executable program, and the executable program may include instructions, and the instructions may be executed to implement a matrix multiplication operation on the first matrix and the second matrix. The controller may be provided with a loader, a decoder, and the like, where the loader may be configured to load input data in the memory into a register of the processing element, and the decoder may decode an instruction for accessing data in the executable program according to a storage address of the loaded input data, for example, for the instruction for accessing data, an address stored in the register by the data is obtained by decoding and assigned to the instruction for accessing data, and the decoded instruction is sent to the processing element, and the processing element executes the instruction, thereby implementing processing on the data, for example, implementing matrix multiplication operations on the first matrix and the second matrix.
In one possible implementation, the memory may be an on-chip cache, and the controller may load the executable program on the off-chip flash memory and the input data (e.g., the input matrix including the left-and right-multiplication matrices) into the memory (on-chip cache), and then perform the subsequent matrix multiplication.
In one possible implementation, the controller may also load the input matrix and the executable program directly from the off-chip memory into the register of the processing element, which is not limited by the present disclosure.
The PE may further include an operator to complete a specified operation, for example, a matrix operation, and the PE may include, for example, a multiplier, an adder, and the like, and the specific structures of the PEs may be the same or different, which is not limited in this disclosure. Other types of operators may be included in the PE to accommodate various different operation processes, and the number and types of operators included in the PE are not limited by the present disclosure.
The input matrices for the matrix multiplication operation may include a left-handed matrix and a right-handed matrix, where the left-handed matrix may refer to a matrix located to the left of the multiplication number and the right-handed matrix may refer to a matrix located to the right of the multiplication number.
Since the number and arrangement of PEs in the processor is fixed, the controller may determine whether to block the input matrix based on the arrangement of the processing elements and the row and column ranks of the input matrix before loading data into the registers in the processing elements and performing computations. The arrangement of the processing elements may refer to the number of rows and columns of the processing elements, and the row rank and column rank of the input matrix may refer to the number of rows and columns of the left-and right-handed matrices.
The controller determining whether to block the input matrix according to the arrangement of the processing elements and the row rank and the column rank of the input matrix may refer to: the controller determines whether the number of rows of the input matrix or the transpose of the input matrix is greater than the number of rows of the processing elements and whether the number of columns is greater than the number of columns of the processing elements, and determines whether to block the input matrix according to the result of the determination.
The input matrix may not be partitioned if the number of rows of one of the input matrices is not greater than the number of rows of processing elements and the number of columns is not greater than the number of columns of processing elements, and the number of transposed rows of the other one of the input matrices is not greater than the number of rows of processing elements and the number of columns is not greater than the number of columns of processing elements.
The controller may partition the input matrix if the number of rows of any of the input matrices is greater than the number of rows of processing elements, or the number of columns is greater than the number of columns of processing elements, or the number of transposed rows of any of the input matrices is greater than the number of rows of processing elements, or the number of columns is greater than the number of columns of processing elements.
For example, assume that an array of processing elements can be represented as PEsMNIndicating that the processing elements form an M x N matrix, M indicating the number of rows of the matrix and N indicating the number of columns of the matrix, assuming an input matrix amnRepresenting an m x n matrix, m representing the number of rows of the matrix, n representing the number of columns of the matrix, and the other input matrix being BnkAnd denotes an n × k matrix, n denotes the number of rows of the matrix, and k denotes the number of columns of the matrix. If the matrix A ismnM is not greater than M and N is not greater than N, and BnkIs transposed matrix of
Figure BDA0002460080140000031
The number of rows k is not greater than the number of rows M of processing elements and the number of columns N is not greater than the number of columns N of processing elements, the input matrix may not be partitioned. Or, if AmnIs transposed matrix of
Figure BDA0002460080140000032
Is not greater than the number of rows M and columns M of processing elements, and BnkThe number of rows N is not greater than the number of rows M of processing elements and the number of columns k is not greater than the number of columns N of processing elements, the input matrix may not be partitioned.
If the matrix A ismnNumber of lines ofM is greater than the number of rows M or columns N of processing elements, or matrix BnkIs transferred to
Figure BDA0002460080140000033
If the number of rows k is greater than the number of rows M of the processing elements or the number of columns N is greater than the number of columns N of the processing elements, the input matrix can be partitioned; or, if
Figure BDA0002460080140000034
The number of rows N is greater than the number of rows M or the number of columns M is greater than the number of columns N of processing elements, or BnkThe number of rows N is larger than the number of rows M of processing elements or the number of columns k is larger than the number of columns N of processing elements, the input matrix may be partitioned.
To block one of the input matrices, the controller may split the rows of the left-handed matrix or the columns of the right-handed matrix according to the arrangement of the processing elements.
For example, assume that the array of processing elements is a PE22The left multiplication matrix is A32The right multiplication matrix is B22Then A may be substituted32Is split into A12、A22Are respectively connected with B22Multiplication. If the left multiplication matrix is A22Right multiplication matrix of B32Then B can be substituted32Is split into B12、B22
If both of the input matrices are to be partitioned, the controller may partition the left-handed matrix column direction and the right-handed matrix row direction in the same manner according to the arrangement of the processing elements and the row and column ranks of the input matrices.
That is, the left-handed matrix and the inverted right-handed matrix may be partitioned in the same manner in the column direction, or partitioned in the same manner in the row direction, where the same partitioning means that the number of columns or the number of rows of the first matrix and the second matrix obtained by partitioning is the same, so as to ensure that the matrix operation can be completed normally.
It is assumed that more than two first matrices can be obtained after the left-handed matrix is blocked, more than two second matrices can be obtained after the right-handed matrix is blocked, or more than two first matrices can be obtained after the right-handed matrix is blocked, and more than two second matrices can be obtained after the left-handed matrix is blocked.
And partitioning the left multiplication matrix column direction and the right multiplication matrix row direction in the same way according to the arrangement of the processing elements and the row rank and the column rank of the input matrix, wherein both the first matrix and the second matrix obtained after partitioning need to meet the condition that partitioning is not needed, namely, the transposed row number of the first matrix and the transposed column number of the second matrix are not more than the row number of the processing elements and the column number of the processing elements, or the transposed row number of the first matrix and the row number of the second matrix are not more than the row number of the processing elements and the column number of the processing elements.
In a possible implementation manner, the controller may divide the first matrix or the second matrix in such a way that the row rank and the column rank of the divided first matrix or second matrix are as close as possible to the row number and the column number of the processing elements, so that the operation efficiency can be improved, and the operation time can be shortened. That is, assuming that the processing elements are 4 × 4 arrays, the division may be performed in such a manner that the divided matrix is 4 × 4, so that the processing elements can be utilized most efficiently, and the operation efficiency can be improved.
For example, assume that the processing elements are a2 x 2 array and the input matrices are one 2 x 4 matrix and one 4 x 3 matrix. The division can be in many ways, and fig. 2a and 2b show different division ways, matrix a24In the column direction and matrix B43Blocking is performed in the same manner in the row direction. FIG. 2a is an example of a partition, matrix A24Divided into two parts in the column direction, each part comprising two columns, matrix B43Dividing the device into two parts in the row direction, wherein each part comprises two rows; FIG. 2b is another example of partitioning, matrix A24The column direction is divided into three parts, one part comprises two columns, the other two parts comprise one column, and the matrix B43Divided into three parts in the row direction, wherein one part comprises two rows and the other two partsThe sections each include one row. The arrangement of the above processing elements and the manner of dividing the input matrix are merely one example of the present disclosure, and do not limit the present disclosure in any way.
The present disclosure does not specifically limit the dividing manner of the row direction of the left-hand matrix and the column direction of the right-hand matrix, as long as the divided matrices all need to satisfy the condition that the partitioning is not required.
According to the operation rule of matrix multiplication, elements in rows of a left multiplication matrix and elements in columns of a right multiplication matrix are multiplied one by one and then summed. Therefore, in a possible implementation manner, for the case of non-blocking, or the first matrix and the corresponding second matrix after blocking, the controller is configured to load each element of the transposed matrix and the second matrix of the first matrix into the register of each processing element, respectively, and store the elements at the corresponding positions of the transposed matrix and the second matrix in the register of the same processing element. According to the matrix multiplication rule, the elements at the corresponding positions of the transposed matrix and the second matrix may refer to elements in the transposed matrix and the second matrix that need to be multiplied.
In a possible implementation manner, the controller may first transpose the first matrix to obtain a transposed matrix, and then load elements of the transposed matrix into the registers of the processing elements, or in another possible implementation manner, the controller may also implement the transposing of the first matrix in a loading process, for example, assuming that the first matrix is a right-handed matrix, the controller may load one column of elements of the first matrix into a register of one row of processing elements to implement the transposing of the first matrix in a process of loading the elements of the first matrix into the registers of the processing elements.
In one possible implementation, the transposed matrix and the second matrix are aligned in a row or column direction. Specifically, if the left-multiplied matrix is transposed, then, after loading, the rows of the transposed matrix of the first matrix are aligned with the second matrix in the column direction, that is, in the column direction, the rows of the transposed matrix and the second matrix are aligned; if the right-hand matrix is transposed, then after loading, the columns of the transposed matrix are aligned with the second matrix in the row direction, that is, the columns of the transposed matrix and the second matrix are aligned in the row direction.
After the transpose matrix and the second matrix are loaded, the controller is further configured to control elements in the transpose matrix or the second matrix to roll in a row direction or a column direction, control the processing element to perform multiplication operation on elements in the corresponding register to obtain an element product, and sum up the element products in the same row or the same column to obtain a first intermediate result. Specifically, the controller controls the processing element, the transposed matrix stored in the register, and the second matrix to repeat the following process until the elements in the transposed matrix or the second matrix are restored to the non-scrolled position: the controller controls the processing element to multiply the elements in the corresponding register to obtain an element product, sums the element products in the same row or the same column to obtain a first intermediate result, and controls the transposed matrix or the second matrix stored in the register to roll by one row or one column in the row direction or the column direction.
That is to say, the processing element is controlled to perform multiplication operation on elements in the corresponding register to obtain an element product, the element products in the same row or the same column are summed to obtain a first intermediate result, and then the elements in the transposed matrix or the second matrix are controlled to roll by one row or one column in the row direction or the column direction. If the judgment result is the same, the process is ended. If the judgment results are different, then the processing element is controlled to perform multiplication operation on the elements in the corresponding registers to obtain element products, the element products of the same row or the same column are summed to obtain a first intermediate result, then the elements in the transposed matrix or the second matrix are controlled to roll by one row or one column in the row direction or the column direction, whether the elements in the transposed matrix or the second matrix are the same as the initial position after the rolling is judged … …, and the above process is circulated until the elements in the transposed matrix or the second matrix are the same as the initial position after the rolling is completed.
In one example, the first matrix is a left-handed matrix and the second matrix is a right-handed matrix. In another example, the first matrix is a right-handed matrix and the second matrix is a left-handed matrix.
When the first matrix is a left-multiplication matrix and the second matrix is a right-multiplication matrix, the controller controls elements in the transposed matrix to roll in the row direction, or controls elements of the second matrix to roll in the row direction, the processing element is controlled to perform multiplication operation on the elements in the corresponding register to obtain an element product, and the element products in the same column are summed to obtain a first intermediate result.
When the first matrix is a right-multiplication matrix and the second matrix is a left-multiplication matrix, the controller controls elements in the transposed matrix to roll in the column direction or controls elements in the second matrix to roll in the column direction; the control processing element performs multiplication operation on elements in the corresponding register to obtain element products, and the element products in the same row are summed to obtain a first intermediate result.
In one possible implementation, the above-described scrolling, one row or one column at a time. The closed loop is formed between the processing elements storing the elements of the matrix, and since the adjacent processing elements are connected together, the controller can determine the loop form according to the dimension of the matrix, for example, if the matrix is to be scrolled in rows (in the column direction), the first row of processing elements and the last row of processing elements storing the elements of the matrix are connected, and if the matrix is scrolled upwards in the process of scrolling, the first row of elements of the matrix is scrolled from the original storage position to the storage position of the last row of elements. If scrolling is to be performed in columns (in the row direction), the first and last column processing elements storing elements of the matrix are connected, and if scrolling is performed to the left during scrolling, the first column element of the matrix is scrolled from the originally stored position to the position where the last column element is stored. The connection between the processing elements and the processing elements may be referred to as a virtual connection, that is, there is no actual connection line, but the controller records the corresponding processor, and a closed loop is formed during the rolling process.
After completing the scrolling and calculating the first intermediate result when the elements in the transposed matrix or the second matrix are restored to the positions of the elements in the un-scrolled matrix, the controller may process the first intermediate result to obtain a product of the first matrix and the second matrix.
In one possible implementation, the controller stores the first intermediate result in rows or columns, and the first intermediate result is scrolled in a row direction or a column direction to obtain a product of the first matrix and the second matrix. The specific processing method is related to the transposed matrix and the scrolling direction, for example:
when the first matrix is a right-handed matrix and the second matrix is a left-handed matrix, the first intermediate result may be stored in columns and the elements in the first intermediate result may be scrolled to the right in the row direction if the transition matrix is scrolled in the column direction; for example, the ith row element scrolls to the right in the row direction by i-1 step;
when the first matrix is a right-multiplication matrix and the second matrix is a left-multiplication matrix, and the transition matrix is scrolled downward in the column direction, the first intermediate result may be stored in columns, and the elements in the first intermediate result may be scrolled leftward in the row direction; for example, the ith row element scrolls left in the row direction by i-1 step;
when the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix, and the transfer matrix rolls to the left in the row direction, the first intermediate result can be stored in rows, and the ith column element in the first intermediate result rolls to the lower side in the column direction by i-1 step to obtain the product of the input matrix;
when the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix, and the transition matrix is scrolled to the right in the row direction, the first intermediate result may be stored in rows, and the ith column element in the first intermediate result is scrolled in the column direction by i-1 step to obtain the product of the input matrix.
In the related art, for matrix multiplication with a large input matrix size, in order to improve the efficiency of matrix operation, the operation process is usually implemented in a multi-stage pipeline manner, but each stage of the multi-stage pipeline processes a part of input data, so that data needs to be frequently read from a memory, and the requirement on bandwidth is high due to frequent access to the memory. In order to solve the technical problem, the processor provided by the disclosure can perform block storage on an input matrix and then perform matrix multiplication on a corresponding matrix after the input matrix is blocked, so that the memory access frequency can be reduced, and the operation efficiency can be improved.
If the first matrix is obtained by blocking according to the left-handed matrix or the second matrix is obtained by blocking according to the right-handed matrix, in a possible implementation, the controller is further configured to calculate a product of the left-handed matrix and the right-handed matrix according to a product of the first matrix and the second matrix. That is, the product of the first matrix and the second matrix is calculated for the first matrix and the corresponding second matrix after the block division, respectively, and then the product of the left-handed matrix and the right-handed matrix is calculated from the product of the first matrix and the second matrix. Therefore, the memory access frequency can be reduced, and the operation efficiency is improved.
In another possible implementation, the processor includes multiple sets of registers. That is, the controller may divide the registers of the processing elements into a plurality of groups according to the case of blocking the matrix.
In this way, the controller may transpose two or more first matrices to obtain a transposed matrix after the input matrix is partitioned; the controller loads the transposed matrix and more than two second matrixes into the plurality of groups of registers for stacking storage, and the transposed matrix and the second matrixes at corresponding positions are stored in one group of registers.
Before the elements in the transfer matrix or the second matrix are rolled once in the row direction or the column direction each time, the controller controls the processing element to perform multiplication operation on the elements in the corresponding register to obtain element products, and the element products in the same row or the same column are summed to obtain a first intermediate result; after controlling the elements in a set of registers to scroll through a row or column of the transpose matrix in the row or column direction, the controller also modifies the scrolling results.
In one possible implementation, modifying the scrolling results includes:
if the data is scrolled leftwards in the row direction, the correction mode is that the last column of data in each transposed matrix after scrolling is scrolled to the last column of the data of the previous adjacent transposed matrix;
if the data is scrolled rightwards in the row direction, the correction mode is that the first column of the data in each transposed matrix after scrolling is scrolled to the first column of the data in the next adjacent transposed matrix;
if the data is rolled upwards in the column direction, the correction mode is that the last line of data in each transposed matrix after rolling is rolled to the last line of the data of the previous adjacent transposed matrix;
if the data is rolled downwards in the column direction, the correction mode is that the first row of data in each transposed matrix after rolling is rolled to the first row of the next transposed matrix data;
each block transpose matrix is a matrix obtained by transposing each block matrix after being partitioned. The specific calculation and correction processes will be described in detail in the examples below.
The present disclosure also provides an operation method for implementing matrix multiplication.
For the case of no blocking, or the first matrix and the second matrix after blocking, fig. 3 shows a flowchart of an operation method according to an embodiment of the present disclosure. For the case of no partitioning, the left-multiplication matrix may be directly used as the first matrix and the right-multiplication matrix may be directly used as the second matrix, or the left-multiplication matrix may be directly used as the second matrix and the right-multiplication matrix may be used as the first matrix, which is not limited in this disclosure.
As shown in fig. 3, the operation method provided by the present disclosure may include the following steps:
step S11, transposing the first matrix to obtain a transposed matrix, loading the transposed matrix and the second matrix into a register of the processing element, and storing elements at corresponding positions of the transposed matrix and the second matrix in the register of the same processing element.
According to the matrix multiplication rule, the elements at the corresponding positions of the transposed matrix and the second matrix may refer to elements in the transposed matrix and the second matrix that need to be multiplied.
In one possible implementation, the transposed matrix and the second matrix are aligned in a row or column direction. Specifically, if the left-multiplied matrix is transposed, then, after loading, the rows of the transposed matrix of the first matrix are aligned with the second matrix in the column direction, that is, in the column direction, the rows of the transposed matrix and the second matrix are aligned; if the right-hand matrix is transposed, then after loading, the columns of the transposed matrix are aligned with the second matrix in the row direction, that is, the columns of the transposed matrix and the second matrix are aligned in the row direction.
Step S12, controlling the transposed matrix or the second matrix to roll in the row direction or the column direction, controlling the processing element to perform multiplication operation on the elements in the corresponding register to obtain an element product, and summing the element products in the same row or the same column to obtain a first intermediate result.
In a possible implementation, step S12 may specifically include repeating the following process until the elements in the transposed matrix or the second matrix are restored to the positions when not scrolled: the control processing element carries out multiplication operation on elements in the corresponding register to obtain element products, and the element products in the same row or the same column are summed to obtain a first intermediate result; the transition matrix or the second matrix is scrolled in the matrix of processing elements by one row or column in the row direction or the column direction.
Step S13, the first intermediate result is processed to obtain a product of the first matrix and the second matrix.
That is to say, for steps S12 and S13, the processing element is controlled to multiply the elements in the corresponding register to obtain the element product, the element products of the same row or the same column are summed to obtain the first intermediate result, and then the elements in the transposed matrix or the second matrix are controlled to scroll by one row or one column in the row direction or the column direction. If the determination result is the same, the process is ended, and the process continues to step S13. If the judgment results are different, then the processing element is controlled to perform multiplication operation on the elements in the corresponding registers to obtain element products, the element products of the same row or the same column are summed to obtain a first intermediate result, then the elements in the transposed matrix or the second matrix are controlled to roll by one row or one column in the row direction or the column direction, whether the elements in the transposed matrix or the second matrix are the same as the initial position after the rolling is judged … …, and the above process is circulated until the elements in the transposed matrix or the second matrix are the same as the initial position after the rolling is completed.
In one example, the first matrix is a left-handed matrix and the second matrix is a right-handed matrix. In another example, the first matrix is a right-handed matrix and the second matrix is a left-handed matrix.
When the first matrix is a left-multiplication matrix and the second matrix is a right-multiplication matrix, in step S12, the elements in the transposed matrix are controlled to roll in the row direction, or the elements in the second matrix are controlled to roll in the row direction, the processing element is controlled to perform multiplication operation on the elements in the corresponding register to obtain an element product, and the element products in the same column are summed to obtain a first intermediate result.
When the first matrix is a right-handed matrix and the second matrix is a left-handed matrix, in step S12, the elements in the transposed matrix are controlled to roll in the column direction or the elements in the second matrix are controlled to roll in the column direction, the processing element is controlled to perform multiplication operation on the elements in the corresponding register to obtain an element product, and the element products in the same row are summed to obtain a first intermediate result.
In one possible implementation, the above-described scrolling, one row or one column at a time.
For step S13, processing the first intermediate result may refer to: and storing the first intermediate result in a row or column manner, and rolling in the row direction or the column direction to obtain a product of the first matrix and the second matrix. The specific processing method is related to the transposed matrix and the scrolling direction, for example:
when the first matrix is a right-handed matrix and the second matrix is a left-handed matrix, the first intermediate result may be stored in columns and the elements in the first intermediate result may be scrolled to the right in the row direction if the transition matrix is scrolled in the column direction; for example, the ith row element scrolls to the right in the row direction by i-1 step;
when the first matrix is a right-multiplication matrix and the second matrix is a left-multiplication matrix, and the transition matrix is scrolled downward in the column direction, the first intermediate result may be stored in columns, and the elements in the first intermediate result are scrolled leftward in the row direction; for example, the ith row element scrolls left in the row direction by i-1 step;
when the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix, and the transfer matrix rolls to the left in the row direction, the first intermediate result can be stored in rows, and the ith column element in the first intermediate result rolls to the lower side in the column direction by i-1 step to obtain the product of the input matrix;
when the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix, and the transition matrix is scrolled to the right in the row direction, the first intermediate result may be stored in rows, and the ith column element in the first intermediate result is scrolled in the column direction by i-1 step to obtain the product of the input matrix.
The process of steps S11-S13 will be described below by taking the first matrix as a right-handed matrix, the second matrix as a left-handed matrix, and the first matrix as a left-handed matrix and the second matrix as a right-handed matrix, respectively.
Example 1 the first matrix is a right-hand matrix and the second matrix is a left-hand matrix, that is, the right-hand matrix is transposed.
Suppose a first matrix bnkAnd a second matrix amnAre all 3 x 3 matrices and the processing elements make up a4 x 4 array.
FIG. 4 shows a schematic diagram of an array of processing elements according to an embodiment of the present disclosure. The calculation method of the present disclosure will be described with reference to fig. 4 and 3.
Suppose a first matrix
Figure BDA0002460080140000081
Second matrix
Figure BDA0002460080140000082
Then the first matrix is transposed to obtain a transposed matrix of
Figure BDA0002460080140000083
The second matrix is loaded into the registers of the processing elements in such a way that the rows and columns of the second matrix are arranged in the registers of the processing elements, i.e. the elements of the second matrix are arranged in the same way in the matrix as in the registers of the processing elements.
In one possible implementation, the number of rows and columns of the elements in the second matrix is the same as the number of rows and columns of the processing elements loaded with the elements in the array of processing elements.
For example, in one example, A may be11Load to PE11In a register of A12Load to PE12In a register of A13Load to PE13In a register of A21Load to PE21… A in the register33Load to PE33That is, the index of the element in the second matrix may be identical to the index of the processing element in which it is located.
In another example, A may be11Load to PE12In a register of A12Load to PE13In a register of A13Load to PE14In a register of A21Load to PE22… A in the register33Load to PE34That is to say the elements in the second matrix are arranged in the same way in the matrix as in the register of the processing element.
It should be noted that the above examples are only some examples of loading the first matrix, and do not limit the disclosure in any way, and those skilled in the art should know that the elements in the first matrix are arranged in the same way in the matrix as in the register of the processing element.
The transpose matrix may be loaded into the register of the processing element according to a manner of loading the first matrix, or after loading, a column of the second matrix is aligned with a column of the transpose matrix, and elements in corresponding positions of the transposed matrix and the second matrix after loading are stored in the register of the same processing element.
For example, suppose A11Load to PE11In a register of A12Load to PE12In a register of A13Load to PE13In a register of A21Load to PE21… A in the register33Load to PE33That is, the index of an element in the first matrix may be identical to the index of the processing element in which it is located. Then, B can be adjusted11Load to PE11In a register of (B)21Load to PE12In a register of (B)31Load to PE13In a register of (B)12Load to PE21In a register of (B)22Load to PE22In a register of (B)32Load to PE23… … B in the register33Load to PE33In the register of (2). That is, the transposed matrix is loaded into the registers of the processing element in an ordered fashion aligned with the second matrix column.
In a possible implementation manner, the transposed matrix may be loaded first and then the second matrix is loaded, or the loading is performed simultaneously, and the specific loading manner is not limited in the present disclosure as long as it is ensured that the transposed matrix and the second matrix are aligned in the row direction after the loading, and the elements at the corresponding positions of the transposed matrix and the second matrix are stored in the register of the same processing element.
In one possible implementation, after the input matrix is loaded, for the case of transposing a right-handed matrix, the processing elements storing the first row of elements of the transpose matrix and the processing elements storing the last row of elements of the transpose matrix may be connected in the column direction, forming a ring, and the data within the ring may be streamed to effect scrolling of the matrix in the column direction. As shown in FIG. 1The PE can be replaced11And PE31Connected to form a ring, connecting PE12And PE32Can form a ring connecting PE13And PE33A ring may be formed. Thus, when data flows within a ring, if it is flowing upward, data of the first row will flow to the third row, data of the second row will flow to the first row, and data of the third row will flow to the second row; if it is a downward flow, the data of the first row will flow to the second row, the data of the second row will flow to the third row, and the data of the third row will flow to the first row.
In this embodiment, the transition matrix may be scrolled only, and before the transition matrix is scrolled for the first time, the controller may control the processing element to multiply the element processes in the corresponding registers to obtain the element products, and sum the element products in the same row to obtain the first intermediate result. Taking the above example as an example, the controller may control the PEs11Element A stored to register therein11And B11Performing multiplication to obtain element product A11×B11Likewise, the controller may control the PE12、PE13To obtain A12×B21、A13×B31
The controller may then sum the product of the elements in the same row to obtain C11=A11×B11+A12×B21+A13×B31
In the same manner, C can be obtained22And C33
In one possible implementation, C may be11、C22And C33Temporarily stored in a buffer as a first column first intermediate result. The cache may be located in a location other than the plurality of processing elements in the processor.
Next, in one possible implementation, the transition matrix may be scrolled up one row, with the elements of the first row scrolling to the last row (of the processing element storing the elements of the matrix). Alternatively, the transposed matrix may be scrolled downward by one line, and the present disclosure does not limit the specific scrolling direction, and the example in the present embodiment may be scrolled in a column direction by a line unit.
As shown in FIG. 1, when scrolling up, the data of the first row may scroll to the third row as follows:
Figure BDA0002460080140000091
in one possible implementation, the scrolling process of data in the matrix may be implemented using redundant registers within the processing element or on-chip cache within the processor. This embodiment is applicable to the scroll process in examples 1 and 2 of the present disclosure.
For example, as in example 1, a first row of elements of the transpose matrix may be temporarily stored in an extra register, the processing element in the second row may be controlled to send a second row of elements of the transpose matrix stored in the corresponding register to the processing element in the first row, then the processing element in the third row may be controlled to send a third row of elements of the transpose matrix stored in the corresponding register to the processing element in the second row, and finally, the temporarily stored first row of elements may be stored in a register corresponding to the processing element in the third row, so as to implement a scrolling process of a row of data of the transpose matrix. The above process is only one example of the present disclosure and does not limit the present disclosure in any way.
The control processing element is controlled again to multiply the elements in the corresponding registers to obtain element products, and the element products in the same row are summed to obtain a first intermediate result, a33Is multiplied by the first row of
Figure BDA0002460080140000101
Second row of (1) to obtain C12、a33Second row of (2) multiplied by
Figure BDA0002460080140000102
The third row of (2) gives C23And a33Third row of
Figure BDA0002460080140000103
The first row of (A) yields C31. C is to be12、C23And C31Temporarily stored in a buffer as a second column first intermediate result.
Rolling up the transposed matrix again, multiplying the element processes in the corresponding registers to obtain the element products, and summing the element products in the same row to obtain a first intermediate result C13、C21And C32Mixing C with13、C21And C32Temporarily stored in the buffer as the third column first intermediate result.
That is, the first intermediate result stored in the buffer is
Figure BDA0002460080140000104
For step S13, for the case of scrolling the transpose matrix upwards, the processing the first intermediate result means that the controller stores the obtained first intermediate result in columns, and then the controller scrolls the ith row element in the first intermediate result to the right in the row direction by i-1 step to obtain the product of the input matrix, where scrolling also means scrolling in a closed loop in the row direction, and the first column processing element and the last column processing element storing the elements of the matrix are connected to form a closed loop. During scrolling, if scrolling to the right, the elements stored in the last column of processing elements scroll into the first column of processing elements.
Alternatively, for step S13, for the case of scrolling the transposed matrix downwards, the processing the first intermediate result means that the controller stores the obtained first intermediate result in columns, and then the controller scrolls the ith row element in the first intermediate result to the left in the row direction by i-1 step to obtain the product of the input matrix.
It will be appreciated by those skilled in the art that, for step S13, the multiplication of the input matrix may also be obtained by the controller scrolling the elements in the first intermediate result in a row direction (e.g., scrolling right or left) according to the row and column identification of the first intermediate result. In this embodiment, all the elements stored in the register may carry the row and column identifiers of the elements in the matrix, and during the scrolling, the row and column identifiers of the elements in the first intermediate result are determined according to the row and column identifiers of the elements in the matrix, so that the controller may scroll the elements in the first intermediate result in the row direction according to the row and column identifiers of the first intermediate result to obtain the product of the first matrix and the second matrix.
Taking the above example as an example, row 1 scrolls to the right by 0 steps, i.e., no scrolling. Line 2 scrolls right by 1 step, i.e. C21Scrolling right 1 step to column 1, C23Scrolling right 1 step to column 3, C22Scrolling right 1 step to column 2, the result is:
Figure BDA0002460080140000105
and (3) scrolling the 3 rd row to the right for 2 steps to obtain an input matrix with the product of:
Figure BDA0002460080140000106
in one possible implementation, in step S12, the second matrix may also be scrolled in the column direction, and the specific process is similar to that of the transposed matrix scroll, except that the way of processing and scrolling the elements in step S13 is slightly different. The present disclosure does not repeat the specific derivation process, and refers to the above process.
It should be noted that the arrangement of the processing elements, the input matrix, and the like in the above examples are only for clearly illustrating the process of the operation method of the present disclosure, and do not limit the present disclosure in any way.
Example 2 the first matrix is a left-hand matrix and the second matrix is a right-hand matrix, i.e. the left-hand matrix is transposed
It is still assumed that the first matrix amnAnd a second matrix bnkAll are 3 × 3 matrix, and the processing elements are 44, in a matrix.
Suppose a first matrix
Figure BDA0002460080140000111
Then the transpose matrix transposing the first matrix is
Figure BDA0002460080140000112
Second matrix
Figure BDA0002460080140000113
Loading the second matrix into the register of the output processing element, where the loading manner may refer to the manner of loading the first matrix in example 1, and is not described again, then loading the transposed matrix into the register of the processing element according to the manner of loading the second matrix, and after loading, aligning the rows of the transposed matrix of the first matrix with the rows of the second matrix.
For example, suppose B11Load to PE11In a register of (B)12Load to PE12In a register of (B)13Load to PE13In a register of (B)21Load to PE21… B in the register33Load to PE33That is, the index of an element in the first matrix may be identical to the index of the processing element in which it is located. Then, A may be11Load to PE11In a register of A21Load to PE12In a register of A31Load to PE13In a register of A12Load to PE21In a register of A22Load to PE22In a register of A32Load to PE23… … A in the register33Load to PE33In the register of (2). That is, the transposed matrix is loaded into the registers of the processing element in a row-aligned ordering with the other matrix (the second matrix).
In one possible implementation, after the input matrix is loaded, for the case of transposing the first matrix, the transposing moment may be stored in a row direction in a concatenated mannerThe processing elements of the first column of elements of the array and the processing elements of the last column of elements of the storage transpose form a ring within which data can flow to facilitate scrolling in the row direction in units of columns. As shown in FIG. 4, the connection PE11And PE13Can form a ring connecting PE21And PE23Can form a ring connecting PE31And PE33A ring may be formed such that when data flows within the ring, if left flowing, data of the first column will flow to the third column, data of the second column will flow to the first column, and data of the third column will flow to the second column; if it is a right flow, then the data of the first column will flow to the second column, the data of the second column will flow to the third column, and the data of the third column will flow to the first column.
In this embodiment, the controller may control the processor element to multiply the elements in the corresponding registers to obtain the element products and to sum the element products in the same column to obtain the first intermediate result before the first scrolling of the transition matrix in the column direction to the left or right. Taking the above example as an example, the PE11Element A stored to register therein11And B11Performing multiplication to obtain element product A11×B11A can be obtained in the same manner12×B21、A13×B31
The summation of the element products of the first column may result in C11=A11×B11+A12×B21+A13×B31
In the same way, the element product summation C of the second column can be obtained22The sum of the product of the elements of the third column C33
In one possible implementation, C may be11、C22And C33Temporarily stored in the buffer as a first line first intermediate result.
The transition matrix may then be scrolled one column to the left, the elements of the first column to the last column, or may be scrolled one column to the right, as the present disclosure is not limited thereto.
As shown in FIG. 1, when scrolling to the left is performed, the data of the first column may scroll to the third column as follows:
Figure BDA0002460080140000121
the control processing element is again operated to multiply the elements in the corresponding registers to obtain element products, the element products in the same column are summed to obtain a first intermediate result,
Figure BDA0002460080140000122
b times the second column of33The first column of (1) yields C21
Figure BDA0002460080140000123
Third column of (2) multiplied by b33Second column of (2) to obtain C32And, and
Figure BDA0002460080140000124
b is multiplied by the first column of33The third column of (2) gives C13. C is to be21、C32And C13Temporarily stored in the buffer as the first intermediate result of the second row.
Rolling a row of the transposed matrix to the left again, multiplying the element progress in the corresponding register to obtain an element product, and summing the element products in the same row to obtain a first intermediate result C31、C12And C23Mixing C with31、C12And C23Temporarily stored in the buffer as a third line first intermediate result.
That is, the first intermediate result stored in the buffer is
Figure BDA0002460080140000125
In step S13, for the case that the first transpose matrix is scrolled to the left, the first intermediate result may be stored in rows, and the ith row element in the first intermediate result may be scrolled in the row direction by i-1 step to obtain the product of the input matrix.
Alternatively, in case of scrolling the first transpose matrix to the right, the first intermediate result may be stored by the controller in rows, and the ith column element in the first intermediate result is scrolled in the column direction by i-1 step to obtain the product of the input matrices. The specific steps are similar to the left scrolling, and are not described herein again.
It will be appreciated by those skilled in the art that, for step S13, the controller may also scroll the elements in the first intermediate result in the column direction (e.g., up or down) according to the row and column identification of the first intermediate result to obtain the product of the input matrix. In this embodiment, all the elements stored in the register may carry the row and column identifiers of the elements in the matrix, and during the scrolling, the row and column identifiers of the elements in the first intermediate result are determined according to the row and column identifiers of the elements in the matrix, so that the controller may scroll the elements in the first intermediate result in the column direction according to the row and column identifiers of the first intermediate result to obtain the product of the input matrix.
Taking the above example as an example, column 1 scrolls down 0 steps, i.e., does not scroll. Column 2 scrolls down by 1 step, that is to say C12Scroll down 1 step to column 1, C32Scroll down 1 step to column 3, C22Scroll down 1 step to column 2, the result is:
Figure BDA0002460080140000126
and rolling the 3 rd column downwards for 2 steps to obtain the product of the input matrix as follows:
Figure BDA0002460080140000127
it should be noted that the arrangement of the processing elements, the input matrix, and the like in the above examples are only for clearly illustrating the process of the operation method of the present disclosure, and do not limit the present disclosure in any way.
In one possible implementation, in step S12, the second matrix may also be scrolled in the row direction, and the specific process is similar to that of the transposed matrix scrolling, but slightly different from the way of processing and scrolling the elements in step S13. The present disclosure does not repeat the specific derivation process, and refers to the above process.
The operation method of matrix multiplication according to the above embodiments of the present disclosure is more suitable for a processor composed of processing elements arranged in an array. For any scale of input matrix satisfying the arrangement of processing elements, the operation result of matrix multiplication can be obtained, and compared with the matrix multiplication operation in the related art, the access and storage times can be reduced, the bandwidth pressure is reduced, and the operation efficiency is improved.
For the case of no blocking, the result of the matrix multiplication can be obtained directly according to the above example. For the situation that the blocking is required, for the first matrix and the second matrix after the blocking, the result obtained by multiplying the first matrix and the corresponding second matrix according to the rule of matrix multiplication is used as a second intermediate result, that is, the operation process of matrix multiplication can be executed by using the first matrix and the second matrix obtained after the blocking as one element of the matrix to obtain a second intermediate result, and the product of the input matrix can be obtained by calculating according to the second intermediate result.
Fig. 5 shows a schematic diagram of chunking according to an embodiment of the present disclosure. As shown in FIG. 5, the controller may block the matrices D and E in the manner described above to obtain a first matrix D11、D12、D21、D22And a second matrix E11、E12、E21、E22. The controller may perform a matrix multiplication operation using the first matrix and the second matrix as one element of the matrix, for example, multiplying the first row of the matrix D by the first column of the matrix E as F11=D11×E11+D12×E21Multiplying the first row of the matrix D by the second column of the matrix E by F12=D11×E12+D12×E22Multiplying the second row of the matrix D by the first column of the matrix E to form F21=D21×E11+D22×E21Multiplying the second row of the matrix D by the second column of the matrix E by F22=D21×E12+D22×E22. That is, to obtain the final operation result of the matrix multiplication, it is necessary to first obtain the second intermediate result:
D11×E11,D12×E21,D11×E12,D12×E22
D21×E11,D22×E21,D21×E12,D22×E22
the process of obtaining the second intermediate result may be obtained by operating the corresponding first matrix and second matrix according to the processes of steps S11-S13, respectively.
The input matrix is partitioned, the matrix multiplication operation of the method is respectively carried out on the partitioned matrix to obtain a second intermediate result, and the product of the input matrix can be obtained through calculation according to the second intermediate result. According to the operation method of the above embodiment of the present disclosure, the matrix multiplication process can be rapidly realized for any dimension of the matrix.
In an optional embodiment, the first matrix and the second matrix after being partitioned may be stored in the processing element in sequence for calculation, or may be stored in the processing element in a stacked manner.
Example 3 Stack storage in combination with Steps S11-step S13
For example, the operation method of the present disclosure is described by taking an array of processing elements as 2 × 2 and input matrices as 4 × 4 matrices.
Suppose a left-hand matrix
Figure BDA0002460080140000131
Right multiplication matrix of
Figure BDA0002460080140000132
Then the controller canTo divide both the left-and right-multiplication matrices into 2 x 2 matrices.
Fig. 6 illustrates an example of partitioning a matrix according to an embodiment of the present disclosure. As shown in fig. 6, the controller may divide both the left-and right-multiplication matrices into 2 × 2 sub-matrices, and divide the left-multiplication matrix into four matrices a11、a12、a21、a22Wherein a is11Is composed of
Figure BDA0002460080140000133
a12Is composed of
Figure BDA0002460080140000134
a21Is composed of
Figure BDA0002460080140000135
a22Is composed of
Figure BDA0002460080140000136
Obtaining four matrixes b after right multiplication matrix division11、b12、b21、b22Wherein b is11Is composed of
Figure BDA0002460080140000137
b12Is composed of
Figure BDA0002460080140000138
b21Is composed of
Figure BDA0002460080140000139
b22Is composed of
Figure BDA00024600801400001310
For the blocking case, if the number of registers included in the processing element can meet the requirement of storing the input matrix, the input matrix may also be stored in the registers of the processing element in a stacked storage manner to implement the multiplication operation of the input matrix. When the input matrix is stored in a stacked storage manner, the controller may divide the registers in the processing elements into a plurality of different groups, each group storing a first matrix after being blocked and a corresponding second matrix.
In the example of storing the input matrix in the stacked storage manner, one possible calculation manner is to scroll the matrices in units of the first matrix and the second matrix obtained by the block division, and in calculating the second intermediate result, perform the operation using the processes of steps S11 to S13.
Taking the calculation of the second intermediate result by the process of steps S11-S13 as an example, assuming that the processing elements are a2 × 2 array, and taking the example shown in fig. 6 as an example, for the operation method of the present disclosure, the first matrix may be obtained by dividing the left-multiplied matrix into blocks, or may be obtained by dividing the right-multiplied matrix into blocks.
The present disclosure explains the operation method by taking an example that a first matrix is obtained by right-multiplying a matrix block, a second matrix is loaded, and a corresponding first matrix is transferred and then loaded, and the loading results are shown in tables 1 and 2. Here, Reg0, Reg1, Reg2, and Reg3 respectively indicate a set of registers in the processing elements, the processing elements are 2 × 2 arrays, each processor includes a plurality of registers, the controller can divide the plurality of registers into a plurality of groups, for example, the present embodiment can divide the plurality of registers into 4 groups, and the registers in the same group store a transpose matrix and a corresponding second matrix, as shown in table 1 and table 2, Reg0 stores a11And b11Reg1 store a12And b21Reg2 store a21And b12Reg3 store a22And b22That is, a matrix
Figure BDA0002460080140000141
Is multiplied by the matrix
Figure BDA0002460080140000142
And the second row element multiplied by the second column element.
Table 1 element storage example
Figure BDA0002460080140000143
Table 2 element storage example
Figure BDA0002460080140000144
In the calculation process, for elements within a set of registers, the processing element may calculate a second intermediate result a according to the process of steps S11-S1311×b11、a12×b21、a21×b12And a22×b22. The detailed process is not described again. According to the second intermediate result a11×b11、a12×b21、a21×b12And a22×b22Can be calculated to obtain C11=a11×b11+a12×b21,C22=a21×b12+a22×b22
After the second intermediate result is calculated, the transpose matrix may be scrolled in units of groups. In particular, for a transposed matrix
Figure BDA0002460080140000145
Scrolling up by one row, that is, scrolling the elements of the transpose in Reg2 into Reg0, scrolling the elements of the transpose in Reg0 into Reg2, scrolling the elements of the transpose in Reg3 into Reg1, and scrolling the elements of the transpose in Reg1 into Reg3, table 3 can be obtained.
Table 3 element storage example
Figure BDA0002460080140000151
In conjunction with tables 1 and 3, during a calculation, for an element in a set of registers, a processing element may be based onThe process of steps S11-S13 calculates a second intermediate result a11×b12、a12×b22、a21×b11And a22×b21. The detailed process is not described again. According to the second intermediate result a11×b12、a12×b22、a21×b11And a22×b21Can be calculated to obtain C12=a11×b12+a12×b22,C21=a21×b11+a22×b21
According to the above process, the product of the input matrix can be calculated in a block-wise manner.
Therefore, the matrix multiplication operation method disclosed by the invention can realize matrix operation of any size.
Example 4 stacked storage in conjunction with Whole scrolling
In another possible implementation manner, another scrolling manner may also be adopted, and in the scrolling manner of this embodiment, step S12 in fig. 3 may be implemented by controlling the processing element to multiply the elements in the corresponding registers to obtain the element products, and to sum the element products of the same row (or the same column in the example of transposing the first matrix) to obtain the first intermediate result C, before the transposed matrix is scrolled once in the row direction or the column direction each time11、C22、C33、C44
When the input matrix is partitioned and stacked, and the original row or column of data is stored in the registers of different groups, so that the original row or column of continuously stored data is changed into at least two rows or at least two columns of independent data to be stored in the registers of different groups, the first data of the next row or column and the last data of the previous row or column of data stored in the registers of different groups are continuously stored data before the data are stacked and are not continuously stored after the data are stacked, therefore, after the elements in one group of registers are controlled to roll once in the row or column direction, the rolling result needs to be corrected, and the correct result can be obtained. The specific modification method can be as follows:
rolling once in the row or column direction for each block of the transposed matrix;
if the data is scrolled leftwards in the row direction, the correction mode is that the last column of data in each block after scrolling is scrolled to the last column of the adjacent previous block of data;
if the data is scrolled rightwards in the row direction, the correction mode is that the first column of data in each block after scrolling is scrolled to the first column of the adjacent data in the next block;
if the data is scrolled upwards in the column direction, the correction mode is that the last line of data in each block after scrolling is scrolled to the last line of the adjacent previous block of data;
if scrolling is performed downward in the column direction, the correction is made by scrolling the first line data in each block after scrolling to the first line of the next block of data adjacent thereto.
Each block mentioned above refers to each block transpose matrix, and each block transpose matrix refers to a matrix obtained by transposing each block matrix after being partitioned.
In the present embodiment, the right-multiplication matrix is transposed, and the scrolling is performed in the row direction during the scrolling process, but there is at least one row between two rows that should be continuous due to the stack storage, but the scrolling is considered as independent rows during the stack storage, and the scrolling in the row direction in each group of registers alone cannot achieve correct scrolling, and needs to be corrected.
Taking table 2 as an example, inside each group of registers, one row is scrolled upwards, the scrolling result is shown in table 4, and in table 4, the first row element in one group of registers is scrolled to the last row. But as shown in table 2, the first row elements of Reg0 and Reg1 should scroll to the last row of Reg2 and Reg3, but now lie in the last row of Reg0 and Reg1 (as shown in table 4); as shown in table 2, the first row elements of Reg2 and Reg3 should scroll to the last row of Reg0 and Reg1, but now lie in the last row of Reg2 and Reg3 (as shown in table 4); that is, in table 4, now the last row elements of Reg0 and Reg1 should be located in the last row of Reg2 and Reg3, and the last row elements of Reg2 and Reg3 should be located in the last row of Reg0 and Reg1, then swapping the last row elements of Reg2 and Reg0, and swapping the last row elements of Reg3 and Reg1 can implement the scrolling process, as shown in table 5.
Table 4 element storage example
Figure BDA0002460080140000161
Table 5 element storage example
Figure BDA0002460080140000162
According to the comparison of table 1 and table 5, the control processing element performs multiplication operation on the elements in the corresponding registers to obtain element products, and sums the element products in the same row to obtain a first intermediate result C12、C23、C34、C41
The calculation process of matrix multiplication can be completed by repeatedly executing 4 times of calculation and 3 times of rolling in the process, and the product of the input matrix can be obtained according to the first intermediate result.
In an alternative embodiment, the stacked storage manner may be stored according to the above block manner, and is not limited to that each register stores one element in the matrix, that the number of rows and columns multiplied by the matrix is an integer multiple of the number of rows and columns of the processing elements, and that the method of stacked storage is unique, and the modification process is the same, and only the original row/column elements after modification need to be connected in series, and the specific stacked storage process is not limited herein.
It should be noted that the above manner of stacking storage and scrolling elements is only one example of the disclosure, and other manners may also be adopted, and the disclosure does not limit this.
It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
It should be further noted that, although the steps in the flowchart are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in the flowchart may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The present disclosure also provides an arithmetic device based on matrix multiplication of a matrix of processing elements, which can be applied to a processor. Fig. 1 shows an example of a processor, which may comprise more than two processing elements arranged in a two-dimensional matrix, each processing element comprising at least one register, said arithmetic means being arranged to perform a matrix multiplication operation on a first matrix and a second matrix.
It should be understood that the above-described apparatus embodiments are merely illustrative and that the apparatus of the present disclosure may be implemented in other ways. For example, the division of the units/modules in the above embodiments is only one logical function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented.
In addition, unless otherwise specified, each functional unit/module in each embodiment of the present disclosure may be integrated into one unit/module, each unit/module may exist alone physically, or two or more units/modules may be integrated together. The integrated units/modules may be implemented in the form of hardware or software program modules.
If the integrated unit/module is implemented in hardware, the hardware may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. Unless otherwise specified, the register may be any suitable magnetic storage medium or magneto-optical storage medium, such as resistive Random Access Memory (rram), Dynamic Random Access Memory (dram), Static Random Access Memory (SRAM), enhanced Dynamic Random Access Memory (edram), High-Bandwidth Memory (HBM), hybrid Memory cubic (hmc) Memory cube, and so on.
The integrated units/modules, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a non-volatile computer readable storage medium.
The embodiment of the present disclosure further provides an artificial intelligence chip, where the chip includes the processor as described above.
In a possible implementation manner, a board card is further disclosed, which comprises a storage device, an interface device, a control device and the artificial intelligence chip; wherein, the artificial intelligence chip is respectively connected with the storage device, the control device and the interface device; the storage device is used for storing data; the interface device is used for realizing data transmission between the artificial intelligence chip and external equipment; and the control device is used for monitoring the state of the artificial intelligence chip.
Fig. 7 shows a block diagram of a board according to an embodiment of the present disclosure, and referring to fig. 7, the board may include other kit components besides the chip 389, where the kit components include, but are not limited to: memory device 390, interface device 391 and control device 392;
the storage device 390 is connected to the artificial intelligence chip through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the artificial intelligence chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).
DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the artificial intelligence chip may include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check.
In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.
The interface device is electrically connected with the artificial intelligence chip. The interface device is used for realizing data transmission between the artificial intelligence chip and external equipment (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. In another embodiment, the interface device may also be another interface, and the disclosure does not limit the specific expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the artificial intelligence chip is still transmitted back to the external device (e.g. server) by the interface device.
The control device is electrically connected with the artificial intelligence chip. The control device is used for monitoring the state of the artificial intelligence chip. Specifically, the artificial intelligence chip and the control device can be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). As the artificial intelligence chip can comprise a plurality of processing chips, a plurality of processing cores or a plurality of processing circuits, a plurality of loads can be driven. Therefore, the artificial intelligence chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the artificial intelligence chip.
An embodiment of the present disclosure further provides an electronic device, including: a processor; as described above, the processor includes two or more processing elements arranged in a two-dimensional matrix, each processing element including at least one register, and a controller that controls the processing elements;
an electronic device further comprising a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The foregoing may be better understood in light of the following clauses:
clause a1. a processor, the processor comprising two or more processing elements arranged in a two-dimensional matrix, the processing elements comprising at least one register, the processor being for performing a matrix multiplication operation on a first matrix and a second matrix,
the processor further comprises a controller, wherein the controller is used for loading each element of a transposed matrix and a second matrix of the first matrix into a register of each processing element respectively, and the elements at the corresponding positions of the transposed matrix and the second matrix are stored in the register of the same processing element;
the controller is used for controlling the transposed matrix or the second matrix to roll in the row direction or the column direction, controlling the processing element to perform multiplication operation on elements in the corresponding register to obtain an element product, and summing the element products in the same row or the same column to obtain a first intermediate result;
the controller is further configured to process the first intermediate result to obtain a product of the first matrix and the second matrix.
Clause a2. the processor of clause a1,
the controller controls the processing element, the transposed matrix stored in the register, and the second matrix to repeat the following process until the elements in the transposed matrix or the second matrix are restored to the non-scrolled position:
the controller is used for controlling the processing element to multiply elements in the corresponding register to obtain element products, summing the element products of the same row or the same column to obtain a first intermediate result, and controlling the transposed matrix or the second matrix stored in the register to roll by one row or one column in the row direction or the column direction.
Clause A3. the processor of clause a1 or a2,
when the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix, the controller controls elements in the transposed matrix to roll in the row direction, or controls elements in the second matrix to roll in the row direction; the control processing element carries out multiplication operation on elements in the corresponding register to obtain element products, and the element products in the same column are summed to obtain a first intermediate result;
when the first matrix is a right-multiplication matrix and the second matrix is a left-multiplication matrix, the controller controls elements in the transposed matrix to roll in the column direction or controls elements in the second matrix to roll in the column direction; the control processing element performs multiplication operation on elements in the corresponding register to obtain element products, and the element products in the same row are summed to obtain a first intermediate result.
Clause a4. the processor of clause a1 or a2,
and the controller stores the first intermediate result in rows or columns, and rolls in the row direction or the column direction to obtain the product of the first matrix and the second matrix.
Clause a5. the processor of any of clauses a1-a4, the controller further configured to determine whether to block the input matrix according to the arrangement of processing elements and the row and column ranks of the input matrix, wherein the input matrix comprises a left-by matrix and a right-by matrix;
if one matrix in the input matrix is to be partitioned, the controller splits the row of the left-multiplying matrix or splits the column of the right-multiplying matrix according to the arrangement of the processing elements;
if the two matrixes in the input matrix are to be blocked, the controller blocks the column direction of the left-multiplying matrix and the row direction of the right-multiplying matrix in the same way according to the arrangement of the processing elements and the row rank and the column rank of the input matrix;
the left multiplication matrix is partitioned to obtain more than two first matrixes, the right multiplication matrix is partitioned to obtain more than two second matrixes, or the left multiplication matrix is partitioned to obtain more than two second matrixes, and the right multiplication matrix is partitioned to obtain more than two first matrixes.
Clause a6. the processor of clause a5,
the controller is further configured to calculate a product of the left-handed matrix and the right-handed matrix based on a product of the first matrix and the second matrix.
Clause A7. the processor of clause a5, the processor comprising a plurality of sets of registers,
the controller is further configured to transpose the more than two first matrices to obtain a transposed matrix after the input matrix is partitioned;
the controller loads the transposed matrix and more than two second matrixes into the plurality of groups of registers for stacking storage, and the transposed matrix and the second matrixes at corresponding positions are stored in one group of registers;
before the elements in the transfer matrix or the second matrix are rolled once in the row direction or the column direction each time, the controller controls the processing element to perform multiplication operation on the elements in the corresponding register to obtain element products, and the element products in the same row or the same column are summed to obtain a first intermediate result;
after controlling the elements in a set of registers to scroll through a row or column of the transpose matrix in the row or column direction, the controller also modifies the scrolling results.
Clause A8. the processor of clause a7, wherein modifying the scrolling results comprises:
if the data is scrolled leftwards in the row direction, the correction mode is that the last column of data in each transposed matrix after scrolling is scrolled to the last column of the data of the previous adjacent transposed matrix;
if the data is scrolled rightwards in the row direction, the correction mode is that the first column of the data in each transposed matrix after scrolling is scrolled to the first column of the data in the next adjacent transposed matrix;
if the data is rolled upwards in the column direction, the correction mode is that the last line of data in each transposed matrix after rolling is rolled to the last line of the data of the previous adjacent transposed matrix;
if the data is rolled downwards in the column direction, the correction mode is that the first row of data in each transposed matrix after rolling is rolled to the first row of the next transposed matrix data;
each block transpose matrix is a matrix obtained by transposing each block matrix after being partitioned.
Clause A9. a method of matrix multiplication based on a matrix of processing elements for use in a processor, the processor including two or more processing elements arranged in a two-dimensional matrix, the processing elements including at least one register, the method implementing a matrix multiplication operation on a first matrix and a second matrix, the method comprising:
transposing a first matrix to obtain a transposed matrix, loading each element of the transposed matrix and the second matrix into a register of each processing element respectively, and storing the elements at the corresponding positions of the transposed matrix and the second matrix in the register of the same processing element;
controlling the transposed matrix or the second matrix to roll in the row direction or the column direction, controlling a processing element to perform multiplication operation on elements in a corresponding register to obtain element products, and summing the element products of the same row or the same column to obtain a first intermediate result;
and processing the first intermediate result to obtain a product of the first matrix and the second matrix.
Clause a10. according to the operation method described in clause a9, the transposed matrix or the second matrix is controlled to scroll in the row direction or the column direction, the processing element is controlled to multiply the elements in the corresponding register to obtain the element product, and the element products in the same row or the same column are summed to obtain the first intermediate result, including repeating the following processes until the elements in the transposed matrix or the second matrix are restored to the non-scrolled position:
and controlling the processing element to multiply the elements in the corresponding register to obtain an element product, summing the element products in the same row or column to obtain a first intermediate result, and rolling the transfer matrix or the second matrix in the matrix of the processing element by one row or column in the row direction or the column direction.
Clause a11. the method of clause a9 or a10,
when the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix, controlling elements in the transposed matrix to roll in the row direction, or controlling elements in the second matrix to roll in the row direction; the control processing element carries out multiplication operation on elements in the corresponding register to obtain element products, and the element products in the same column are summed to obtain a first intermediate result;
when the first matrix is a right-multiplication matrix and the second matrix is a left-multiplication matrix, controlling elements in the transposed matrix to roll in the column direction or controlling elements in the second matrix to roll in the column direction; the control processing element performs multiplication operation on elements in the corresponding register to obtain element products, and the element products in the same row are summed to obtain a first intermediate result.
Clause a12. processing the first intermediate result to obtain a product of the first matrix and a second matrix according to the method of clause a9 or a10, including:
and storing the first intermediate result in a row or column manner, and rolling in the row direction or the column direction to obtain a product of the first matrix and the second matrix.
Clause a13. the method of any one of clauses a9-a12, further comprising:
determining whether to block the input matrix according to the arrangement of the processing elements and the row rank and the column rank of the input matrix, wherein the input matrix comprises a left-multiplication matrix and a right-multiplication matrix;
if one matrix in the input matrix is to be partitioned, splitting rows of a left-multiplying matrix or splitting columns of a right-multiplying matrix according to the arrangement of the processing elements;
if the two matrixes in the input matrix are to be partitioned, partitioning the column direction of the left multiplication matrix and the row direction of the right multiplication matrix in the same way according to the arrangement of the processing elements and the row rank and the column rank of the input matrix;
the left multiplication matrix is partitioned to obtain more than two first matrixes, the right multiplication matrix is partitioned to obtain more than two second matrixes, or the left multiplication matrix is partitioned to obtain more than two second matrixes, and the right multiplication matrix is partitioned to obtain more than two first matrixes.
Clause a14. the method of clause a13, further comprising:
and calculating the product of the left multiplication matrix and the right multiplication matrix according to the product of the first matrix and the second matrix.
Clause a15. according to the method of clause a13, the processor includes a plurality of sets of registers,
the method further comprises the following steps:
after the input matrix is partitioned, transposing more than two first matrixes to obtain a transposed matrix;
the transposition matrixes and more than two second matrixes are stacked and stored in the plurality of groups of registers, and the transposition matrixes and the second matrixes at corresponding positions are stored in one group of registers;
before the elements in the transfer matrix or the second matrix are rolled once in the row direction or the column direction each time, the control processing element performs multiplication operation on the elements in the corresponding register to obtain element products, and the element products in the same row or the same column are summed to obtain a first intermediate result;
after controlling the elements in a set of registers to scroll through a row or column of transpose matrices in the row or column direction, the scrolling results are modified.
Clause a16. modifying the scrolling results according to the method of clause a15 includes:
if the data is scrolled leftwards in the row direction, the correction mode is that the last column of data in each transposed matrix after scrolling is scrolled to the last column of the data of the previous adjacent transposed matrix;
if the data is scrolled rightwards in the row direction, the correction mode is that the first column of the data in each transposed matrix after scrolling is scrolled to the first column of the data in the next adjacent transposed matrix;
if the data is rolled upwards in the column direction, the correction mode is that the last line of data in each transposed matrix after rolling is rolled to the last line of the data of the previous adjacent transposed matrix;
if the data is rolled downwards in the column direction, the correction mode is that the first row of data in each transposed matrix after rolling is rolled to the first row of the next transposed matrix data;
each block transpose matrix is a matrix obtained by transposing each block matrix after being partitioned.
Clause a17. an artificial intelligence chip comprising the processor of any one of clauses a 1-A8.
Clause a18. an electronic device comprising the artificial intelligence chip of clause a17.
The embodiments of the present disclosure have been described in detail, and the principles and embodiments of the present disclosure are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present disclosure. Meanwhile, a person skilled in the art should, based on the idea of the present disclosure, change or modify the specific embodiments and application scope of the present disclosure. In view of the above, the description is not intended to limit the present disclosure.

Claims (18)

1. A processor comprising two or more processing elements arranged in a two-dimensional matrix, the processing elements comprising at least one register, the processor being configured to perform a matrix multiplication operation on a first matrix and a second matrix,
the processor further comprises a controller, wherein the controller is used for loading each element of a transposed matrix and a second matrix of the first matrix into a register of each processing element respectively, and the elements at the corresponding positions of the transposed matrix and the second matrix are stored in the register of the same processing element;
the controller is used for controlling the transposed matrix or the second matrix to roll in the row direction or the column direction, controlling the processing element to perform multiplication operation on elements in the corresponding register to obtain an element product, and summing the element products in the same row or the same column to obtain a first intermediate result;
the controller is further configured to process the first intermediate result to obtain a product of the first matrix and the second matrix.
2. The processor of claim 1,
the controller controls the processing element, the transposed matrix stored in the register, and the second matrix to repeat the following process until the elements in the transposed matrix or the second matrix are restored to the non-scrolled position:
the controller is used for controlling the processing element to multiply elements in the corresponding register to obtain element products, summing the element products of the same row or the same column to obtain a first intermediate result, and controlling the transposed matrix or the second matrix stored in the register to roll by one row or one column in the row direction or the column direction.
3. The processor according to claim 1 or 2,
when the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix, the controller controls elements in the transposed matrix to roll in the row direction, or controls elements in the second matrix to roll in the row direction; the control processing element carries out multiplication operation on elements in the corresponding register to obtain element products, and the element products in the same column are summed to obtain a first intermediate result;
when the first matrix is a right-multiplication matrix and the second matrix is a left-multiplication matrix, the controller controls elements in the transposed matrix to roll in the column direction or controls elements in the second matrix to roll in the column direction; the control processing element performs multiplication operation on elements in the corresponding register to obtain element products, and the element products in the same row are summed to obtain a first intermediate result.
4. The processor according to claim 1 or 2,
and the controller stores the first intermediate result in rows or columns, and rolls in the row direction or the column direction to obtain the product of the first matrix and the second matrix.
5. The processor of any one of claims 1-4, wherein the controller is further configured to determine whether to block the input matrix according to the arrangement of the processing elements and a row rank and a column rank of the input matrix, wherein the input matrix comprises a left-handed matrix and a right-handed matrix;
if one matrix in the input matrix is to be partitioned, the controller splits the row of the left-multiplying matrix or splits the column of the right-multiplying matrix according to the arrangement of the processing elements;
if the two matrixes in the input matrix are to be blocked, the controller blocks the column direction of the left-multiplying matrix and the row direction of the right-multiplying matrix in the same way according to the arrangement of the processing elements and the row rank and the column rank of the input matrix;
the left multiplication matrix is partitioned to obtain more than two first matrixes, the right multiplication matrix is partitioned to obtain more than two second matrixes, or the left multiplication matrix is partitioned to obtain more than two second matrixes, and the right multiplication matrix is partitioned to obtain more than two first matrixes.
6. The processor of claim 5,
the controller is further configured to calculate a product of the left-handed matrix and the right-handed matrix based on a product of the first matrix and the second matrix.
7. The processor of claim 5, wherein the processor comprises a plurality of sets of registers,
the controller is further configured to transpose the more than two first matrices to obtain a transposed matrix after the input matrix is partitioned;
the controller loads the transposed matrix and more than two second matrixes into the plurality of groups of registers for stacking storage, and the transposed matrix and the second matrixes at corresponding positions are stored in one group of registers;
before the elements in the transfer matrix or the second matrix are rolled once in the row direction or the column direction each time, the controller controls the processing element to perform multiplication operation on the elements in the corresponding register to obtain element products, and the element products in the same row or the same column are summed to obtain a first intermediate result;
after controlling the elements in a set of registers to scroll through a row or column of the transpose matrix in the row or column direction, the controller also modifies the scrolling results.
8. The processor of claim 7, wherein modifying the scrolling results comprises:
if the data is scrolled leftwards in the row direction, the correction mode is that the last column of data in each transposed matrix after scrolling is scrolled to the last column of the data of the previous adjacent transposed matrix;
if the data is scrolled rightwards in the row direction, the correction mode is that the first column of the data in each transposed matrix after scrolling is scrolled to the first column of the data in the next adjacent transposed matrix;
if the data is rolled upwards in the column direction, the correction mode is that the last line of data in each transposed matrix after rolling is rolled to the last line of the data of the previous adjacent transposed matrix;
if the data is rolled downwards in the column direction, the correction mode is that the first row of data in each transposed matrix after rolling is rolled to the first row of the next transposed matrix data;
each block transpose matrix is a matrix obtained by transposing each block matrix after being partitioned.
9. A method of matrix multiplication based on a matrix of processing elements, for use in a processor comprising two or more processing elements arranged in a two-dimensional matrix, the processing elements comprising at least one register, the method implementing a matrix multiplication operation on a first matrix and a second matrix, the method comprising:
transposing a first matrix to obtain a transposed matrix, loading each element of the transposed matrix and the second matrix into a register of each processing element respectively, and storing the elements at the corresponding positions of the transposed matrix and the second matrix in the register of the same processing element;
controlling the transposed matrix or the second matrix to roll in the row direction or the column direction, controlling a processing element to perform multiplication operation on elements in a corresponding register to obtain element products, and summing the element products of the same row or the same column to obtain a first intermediate result;
and processing the first intermediate result to obtain a product of the first matrix and the second matrix.
10. The method of claim 9, wherein controlling the transpose matrix or the second matrix to scroll in a row direction or a column direction, and controlling the processing element to multiply the elements in the corresponding registers to obtain the element products and sum the element products in the same row or the same column to obtain the first intermediate result comprises repeating the following processes until the elements in the transpose matrix or the second matrix are restored to the non-scrolled positions:
and controlling the processing element to multiply the elements in the corresponding register to obtain an element product, summing the element products in the same row or column to obtain a first intermediate result, and rolling the transfer matrix or the second matrix in the matrix of the processing element by one row or column in the row direction or the column direction.
11. The method according to claim 9 or 10,
when the first matrix is a left-multiplying matrix and the second matrix is a right-multiplying matrix, controlling elements in the transposed matrix to roll in the row direction, or controlling elements in the second matrix to roll in the row direction; the control processing element carries out multiplication operation on elements in the corresponding register to obtain element products, and the element products in the same column are summed to obtain a first intermediate result;
when the first matrix is a right-multiplication matrix and the second matrix is a left-multiplication matrix, controlling elements in the transposed matrix to roll in the column direction or controlling elements in the second matrix to roll in the column direction; the control processing element performs multiplication operation on elements in the corresponding register to obtain element products, and the element products in the same row are summed to obtain a first intermediate result.
12. The method of claim 9 or 10, wherein processing the first intermediate result to obtain a product of the first matrix and the second matrix comprises:
and storing the first intermediate result in a row or column manner, and rolling in the row direction or the column direction to obtain a product of the first matrix and the second matrix.
13. The method according to any one of claims 9-12, further comprising:
determining whether to block the input matrix according to the arrangement of the processing elements and the row rank and the column rank of the input matrix, wherein the input matrix comprises a left-multiplication matrix and a right-multiplication matrix;
if one matrix in the input matrix is to be partitioned, splitting rows of a left-multiplying matrix or splitting columns of a right-multiplying matrix according to the arrangement of the processing elements;
if the two matrixes in the input matrix are to be partitioned, partitioning the column direction of the left multiplication matrix and the row direction of the right multiplication matrix in the same way according to the arrangement of the processing elements and the row rank and the column rank of the input matrix;
the left multiplication matrix is partitioned to obtain more than two first matrixes, the right multiplication matrix is partitioned to obtain more than two second matrixes, or the left multiplication matrix is partitioned to obtain more than two second matrixes, and the right multiplication matrix is partitioned to obtain more than two first matrixes.
14. The method of claim 13, further comprising:
and calculating the product of the left multiplication matrix and the right multiplication matrix according to the product of the first matrix and the second matrix.
15. The method of claim 13, wherein the processor includes a plurality of sets of registers,
the method further comprises the following steps:
after the input matrix is partitioned, transposing more than two first matrixes to obtain a transposed matrix;
the transposition matrixes and more than two second matrixes are stacked and stored in the plurality of groups of registers, and the transposition matrixes and the second matrixes at corresponding positions are stored in one group of registers;
before the elements in the transfer matrix or the second matrix are rolled once in the row direction or the column direction each time, the control processing element performs multiplication operation on the elements in the corresponding register to obtain element products, and the element products in the same row or the same column are summed to obtain a first intermediate result;
after controlling the elements in a set of registers to scroll through a row or column of transpose matrices in the row or column direction, the scrolling results are modified.
16. The method of claim 15, wherein modifying the scrolling results comprises:
if the data is scrolled leftwards in the row direction, the correction mode is that the last column of data in each transposed matrix after scrolling is scrolled to the last column of the data of the previous adjacent transposed matrix;
if the data is scrolled rightwards in the row direction, the correction mode is that the first column of the data in each transposed matrix after scrolling is scrolled to the first column of the data in the next adjacent transposed matrix;
if the data is rolled upwards in the column direction, the correction mode is that the last line of data in each transposed matrix after rolling is rolled to the last line of the data of the previous adjacent transposed matrix;
if the data is rolled downwards in the column direction, the correction mode is that the first row of data in each transposed matrix after rolling is rolled to the first row of the next transposed matrix data;
each block transpose matrix is a matrix obtained by transposing each block matrix after being partitioned.
17. An artificial intelligence chip, wherein the chip comprises a processor according to any one of claims 1 to 8.
18. An electronic device comprising the artificial intelligence chip of claim 17.
CN202010317734.8A 2020-04-21 2020-04-21 Operation method, processor and related products Active CN113536219B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202010317734.8A CN113536219B (en) 2020-04-21 2020-04-21 Operation method, processor and related products
PCT/CN2021/075957 WO2021212972A1 (en) 2020-04-21 2021-02-08 Operation method, processor, and related product
US17/920,372 US20230169144A1 (en) 2020-04-21 2021-02-08 Operation method, processor, and related product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010317734.8A CN113536219B (en) 2020-04-21 2020-04-21 Operation method, processor and related products

Publications (2)

Publication Number Publication Date
CN113536219A true CN113536219A (en) 2021-10-22
CN113536219B CN113536219B (en) 2024-01-26

Family

ID=78093921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010317734.8A Active CN113536219B (en) 2020-04-21 2020-04-21 Operation method, processor and related products

Country Status (1)

Country Link
CN (1) CN113536219B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030088600A1 (en) * 2001-08-13 2003-05-08 Sun Microsystems, Inc. A Delaware Corporation Matrix transposition in a computer system
CN109213962A (en) * 2017-07-07 2019-01-15 华为技术有限公司 Arithmetic accelerator
CN109992743A (en) * 2017-12-29 2019-07-09 华为技术有限公司 Matrix multiplier
CN110415157A (en) * 2018-04-26 2019-11-05 华为技术有限公司 A kind of calculation method and device of matrix multiplication
US20190392297A1 (en) * 2016-12-30 2019-12-26 Intel Corporation Deep learning hardware
WO2020027386A1 (en) * 2018-07-30 2020-02-06 부산대학교 산학협력단 Mass encryption matrix calculation-optimization processing method in power device environment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030088600A1 (en) * 2001-08-13 2003-05-08 Sun Microsystems, Inc. A Delaware Corporation Matrix transposition in a computer system
US20190392297A1 (en) * 2016-12-30 2019-12-26 Intel Corporation Deep learning hardware
CN109213962A (en) * 2017-07-07 2019-01-15 华为技术有限公司 Arithmetic accelerator
CN109992743A (en) * 2017-12-29 2019-07-09 华为技术有限公司 Matrix multiplier
CN110415157A (en) * 2018-04-26 2019-11-05 华为技术有限公司 A kind of calculation method and device of matrix multiplication
WO2020027386A1 (en) * 2018-07-30 2020-02-06 부산대학교 산학협력단 Mass encryption matrix calculation-optimization processing method in power device environment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
李海霞;: "Cannon算法在并行计算机上的应用", 黄石理工学院学报, no. 03, pages 17 - 20 *
陈传伟;: "并行计算机系统下的矩阵乘法", 武汉科技学院学报, no. 11, pages 10 - 12 *
陈雪: "基于DataMPI的并行矩阵乘法计算模型研究", 中国优秀硕士学位论文全文数据库 (信息科技辑), pages 137 - 199 *

Also Published As

Publication number Publication date
CN113536219B (en) 2024-01-26

Similar Documents

Publication Publication Date Title
US8051124B2 (en) High speed and efficient matrix multiplication hardware module
US6901422B1 (en) Matrix multiplication in a vector processing system
CN111859273A (en) Matrix multiplier
CN114391135A (en) Method for performing in-memory processing operations on contiguously allocated data, and related memory device and system
WO2022037257A1 (en) Convolution calculation engine, artificial intelligence chip, and data processing method
US20230068450A1 (en) Method and apparatus for processing sparse data
WO2023065983A1 (en) Computing apparatus, neural network processing device, chip, and data processing method
CN110554854A (en) Data processor, method, chip and electronic equipment
JPH06502265A (en) Calculation circuit device for matrix operations in signal processing
CN113885831A (en) Storage and calculation integrated circuit based on mixed data input, chip and calculation device
CN112765540B (en) Data processing method and device and related products
US20230253032A1 (en) In-memory computation device and in-memory computation method to perform multiplication operation in memory cell array according to bit orders
EP4206996A1 (en) Neural network accelerator with configurable pooling processing unit
CN113536219A (en) Operation method, processor and related product
CN112784951A (en) Winograd convolution operation method and related product
CN112766471B (en) Computing device and related product
CN113536221B (en) Operation method, processor and related products
JP7251354B2 (en) Information processing device, information processing program, and information processing method
US11961420B2 (en) Efficient squaring with loop equalization in arithmetic logic units
US20230169144A1 (en) Operation method, processor, and related product
CN112784206A (en) Winograd convolution operation method, device, equipment and storage medium
CN115576895B (en) Computing device, computing method, and computer-readable storage medium
CN110688087A (en) Data processor, method, chip and electronic equipment
US11823764B2 (en) Processing-in-memory devices for element-wise multiplication
US20230418600A1 (en) Non-volatile memory die with latch-based multiply-accumulate components

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant