CN109753319B

CN109753319B - Device for releasing dynamic link library and related product

Info

Publication number: CN109753319B
Application number: CN201811629632.9A
Authority: CN
Inventors: 不公告发明人
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2020-01-17
Anticipated expiration: 2038-12-28
Also published as: CN109753319A

Abstract

The application provides a device for releasing a dynamic link library and a related product, wherein the device is applied to a processor unit; the processor unit is used for configuring a first process, and the first process comprises a first thread and a second thread; loading a Caffe dynamic link library by calling a first thread, using the Caffe dynamic link library and carrying out an analysis operation on a first object; after the first thread executes the deconstruction operation on the first object, the second thread is called to release the Caffe dynamic link library from the memory, so that the loading and the release of the Caffe dynamic link library can be realized, the memory can be prevented from being occupied, and the operation speed is increased.

Description

Device for releasing dynamic link library and related product

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a device for releasing a dynamic link library and a related product.

Background

With the rapid development of software technology and hardware technology, various Dynamic Link Library (DLL) files are widely used. Taking a Convolutional neural network framework (Convolutional architecture for Fast Feature Embedding, Caffe) dynamic link library as an example, the Caffe dynamic link library is a dynamic link library depending on a third party, and is mainly applied to applications in video and image processing. During the loading process, a large amount of memory on the terminal needs to be occupied.

In the prior art, an application program is directly linked with a Caffe dynamic link library, and then when the application program is started, a processor loads the Caffe dynamic link library into a memory. The processor can only unload the Caffe dynamically linked library when the application exits. That is, when the application is always in the running state, the Caffe dynamic link library always occupies the memory of the terminal, and the memory on the terminal is limited, which may reduce the processing speed of the terminal. Therefore, how to load and release the Caffe dynamic link library so as to avoid that the memory on the terminal is always occupied by the Caffe dynamic link library when the program runs is a research hotspot problem of technicians in the field.

Disclosure of Invention

The embodiment of the application provides a device and a related product for releasing a dynamic link library, when an application program is in an operating state, loading and releasing of a Caffe dynamic link library can be achieved, and the situation that a memory on a terminal is occupied all the time due to the Caffe dynamic link library when the program runs can be avoided, so that the operation speed is improved.

In a first aspect, an embodiment of the present application provides an apparatus for releasing a dynamic link library, where the apparatus is applied to a processor unit; the processor unit is used for receiving a first loading request of a first dynamic link library file; the first dynamic link library file is used for realizing a first function of a first application program; the processor unit is used for configuring a first process; the first process comprises a first thread and a second thread;

the processor unit is further configured to invoke the first thread according to the first load request to load the Caffe dynamic link library into the memory, and create a first object;

the processor unit is further configured to call the first thread to execute the first dynamic link library file, and after the first function of the first application program is executed, perform destructuring on the first object;

the processor unit is further configured to invoke the second thread to release the Caffe dynamic link library from the memory after the first thread is invoked to execute the destruct on the first object.

In a second aspect, an embodiment of the present application provides a machine learning computing device, where the machine learning computing device includes the apparatus for releasing a dynamically linked library according to the first aspect, where the apparatus for releasing a dynamically linked library includes one or more MLU computing units. The machine learning arithmetic device is used for acquiring input data to be operated and control information from other processing devices, executing specified machine learning arithmetic and transmitting an execution result to other processing devices through an I/O interface;

when the machine learning arithmetic device comprises a plurality of MLU computing units, the MLU computing units can be connected through a specific structure and transmit data;

the MLU computing units are interconnected through a PCIE bus of a fast peripheral equipment interconnection bus and transmit data so as to support larger-scale machine learning operation; the MLU computing units share the same control system or own respective control systems; the multiple MLU computing units share a memory or own respective memories; the interconnection mode of the multiple MLU computing units is any interconnection topology.

In a third aspect, an embodiment of the present application provides a combined processing device, which includes the machine learning processing device according to the third aspect, a universal interconnection interface, and other processing devices. The machine learning arithmetic device interacts with the other processing devices to jointly complete the operation designated by the user. The combined processing device may further include a storage device, which is connected to the machine learning arithmetic device and the other processing device, respectively, and stores data of the machine learning arithmetic device and the other processing device.

In a fourth aspect, an embodiment of the present application provides a neural network chip, where the chip includes the apparatus for releasing a dynamic link library according to the first aspect, the machine learning operation apparatus according to the second aspect, or the combination processing apparatus according to the third aspect.

In a fifth aspect, an embodiment of the present application provides a neural network chip package structure, where the neural network chip package structure includes the neural network chip described in the fourth aspect;

in a sixth aspect, an embodiment of the present application provides a board card, where the board card includes the neural network chip package structure described in the fifth aspect.

In a seventh aspect, an embodiment of the present application provides an electronic device, where the electronic device includes the neural network chip described in the sixth aspect or the board described in the sixth aspect.

In an eighth aspect, an embodiment of the present application further provides a method for releasing a dynamically linked library, where the method is applied to an apparatus for releasing a dynamically linked library, where the apparatus includes a processor unit; the method comprises the following steps:

the processor unit receives a first loading request of a first dynamic link library file; the first dynamic link library file is used for realizing a first function of a first application program; the processor unit is used for configuring a first process; the first process comprises a first thread and a second thread;

the processor unit calls the first thread according to the first loading request to load the Caffe dynamic link library into the memory and create a first object;

the processor unit calls the first thread to execute the first dynamic link library file, and executes the deconstruction on the first object after the first function of the first application program is executed;

and after the processor unit calls the first thread to execute the destruct on the first object, calling the second thread to release the Caffe dynamic link library from the memory.

In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, a headset, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic structural diagram of an apparatus for releasing a dynamically linked library according to an embodiment of the present application;

FIG. 2 is a block diagram of an MLU computing unit provided in one embodiment of the present application;

FIG. 3 is a block diagram of a main processing circuit provided in an embodiment of the present application;

FIG. 4 is a block diagram of another MLU computing unit provided in the embodiments of the present application;

FIG. 5 is a block diagram of another MLU computing unit provided in the embodiments of the present application;

FIG. 6 is a schematic structural diagram of a tree module provided in an embodiment of the present application;

FIG. 7 is a block diagram of another MLU computing unit provided in the embodiments of the present application;

fig. 8 is a structural diagram of a further MLU calculating unit provided in the embodiment of the present application;

FIG. 9 is a block diagram of an MLU computing unit provided in another embodiment of the present application;

fig. 10 is a block diagram of a combined processing device according to an embodiment of the present application;

fig. 11 is a block diagram of another combined processing device provided in an embodiment of the present application;

fig. 12 is a schematic structural diagram of a board card provided in the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

First, a device for releasing a dynamically linked library used in the present application will be described. Referring to fig. 1, there is provided an apparatus for releasing a dynamically linked library, the apparatus including: a processor unit 13;

the processor unit 13 is configured to receive a first loading request of a first dynamic link library file; the first dynamic link library file is used for realizing a first function of a first application program; the processor unit is used for configuring a first process; the first process comprises a first thread and a second thread;

in the embodiment of the invention, the dynamic link library file DLL (dynamic link library file) cannot run independently, and is mainly responsible for providing certain services for the application program. That is, the application needs to perform certain functions under the loading of the dynamically linked library file. For example, the first dynamic link library file referred to in the embodiment of the present invention may be used to implement a face recognition function of a pay bank application.

In the embodiment of the present invention, the first dynamic link file referred to herein is stored in a Caffe dynamic link library.

In practical applications, the dynamically linked library file may contain a function under the Caffe framework.

In a specific implementation, the function under the Caffe framework may include: a Caffe Blob function, a Caffe Layer function, and a Caffe Net function. Wherein, Blob is used to store, exchange and process data and derivative information of forward and backward iterations in the network; layer is used for performing calculation, and may include non-linear operations such as convolution (convolution), pooling (pool), inner product (inner product), reconstructed-line and sigmoid, and may also include loss calculation (losses) such as element-level data transformation, normalization (normalization), data loading (load data), classification (softmax) and change. In a specific implementation, each Layer defines 3 important operations, which are initialization setting (setup), forward propagation (forward), and backward propagation (backward). Wherein setup is used for resetting layers and the connection between the layers during model initialization; forward is used for receiving input data from a bottom (bottom) layer, and outputting the input data to a top (top) layer after calculation; back ward is used to give the output gradient of the top layer, calculate the gradient of its input, and pass to the bottom layer. For example, the Layers may include Date Layers, volume Layers, Pooling Layers, Innerproduct Layers, ReLULayer, Sigmoid Layers, LRN Layers, Dropout Layers, SoftmaxWithLoss Layers, SoftmaxLayer, Accuracy Layers, and the like. A Net starts with a data layer, i.e., loads data from disk, and ends with a loss layer, i.e., computes objective functions for tasks such as classification and reconstruction. In particular, Net is a directed acyclic computational graph composed of a series of layers, and Caffe preserves all intermediate values in the computational graph to ensure accuracy of forward and reverse iterations.

The processor unit 13 is further configured to invoke the first thread according to the first load request to load the Caffe dynamic link library into the memory, and create a first object;

as previously described, the first process includes a first thread and a second thread. Specifically, the first process is a process configured by the processor unit for the first application.

In a specific implementation, the first thread loads the Caffe dynamic link library into the memory by executing a dlopen function, and creates an object, that is, a first object, for the first process. Here, the dlopen function opens the specified dynamic link library in the specified mode and loads it into memory. When the dynamically linked library is loaded into memory, a handle is returned to the calling process.

The processor unit 13 is further configured to invoke the first thread to execute the first dynamic link library file, and after the first function of the first application program is executed, perform destruct on the first object;

as described above, the first dynamic link library file itself cannot run independently, and the first dynamic link library file is executed by calling the first thread, that is, the function provided by the first dynamic link library file for the application program is realized, and after the function is executed, the destruct operation is performed on the object created for the first process (that is, the first object). For example, the first dynamic link library file is used for providing a face recognition function for the wechat application program, and when the first thread is called to execute the first dynamic library file, and after the identity of the current object to be recognized is successfully recognized by the wechat application program, the first object is destructed.

In an embodiment of the invention, the destruct operation is performed on the first object by performing a destruct function. In particular, a destructor is the opposite of a constructor, and the system automatically executes the destructor when the object ends its lifecycle (e.g., the function in which the object resides has been called out). The destructor is often used for "cleaning up after" work, for example, when an object is built, a piece of memory space is opened up by using a "new" function, and before exiting, the "delete" function is used for releasing in the destructor.

In the embodiment of the present invention, the program code of the destructor is stored in the Caffe dynamic link library. It is understood that after the first object is destructed, the first object may be freed from occupying memory space of the memory.

The processor unit 13 is further configured to invoke the second thread to release the Caffe dynamic link library from the memory after the first thread is invoked to execute the destruct on the first object.

Specifically, after the first thread executes the destruct operation on the first object, the Caffe dynamic link library is released from the memory by calling the second thread to execute a dlclose function. Here, a dlclose function is used to unload the open dynamic link library.

In the conventional case, only one thread is used, and the currently only thread executes the dlopen function, opens the specified dynamic link library in the specified mode through the dlopen function, and loads it into the memory. Currently, only threads will use the Caffe dynamic link library to provide corresponding functions for the application. When the function is completed, the only thread currently executes dlclose function, at which point the thread exits. But the destructor has not yet been executed. Since the Caffe dynamic link library is released, when the destructor executes, the corresponding code has been unloaded from the memory, resulting in a program running. In practice, when the destructor is executed after all the user codes, the call to the diclose function after the destructor is executed will not occur in the case of only one thread. Therefore, by implementing the embodiment of the invention, when the application program is in the running state, the loading and the releasing of the Caffe dynamic link library can be realized, and the memory on the terminal can be prevented from being occupied all the time due to the Caffe dynamic link library when the program runs, thereby improving the operation speed.

In one embodiment, the Caffe dynamic link library referred to in this application may be used by multiple processes of the same application at the same time, where the multiple processes respectively call different dynamic link library files, and in this case, the processor unit is further configured to configure a second process, where the second process includes a third thread and a fourth thread;

the processor unit is further configured to receive a second loading request of a second dynamic link library file while receiving a first loading request of the first dynamic link library file; wherein the second dynamic link library file is used for realizing a first function of the first application program;

the processor unit is further configured to invoke the third thread according to the second load request to load the Caffe dynamic link library into the memory, and create a second object;

the processor unit is further configured to invoke the third thread to execute the second dynamic link library file, and after the first function of the first application program is executed, perform destruct on the second object;

the processor unit is further configured to invoke the fourth thread to release the Caffe dynamic link library from the memory after invoking the third thread to execute the destruct on the second object.

In a current application scenario, a first process and a second process are two processes configured for a first application program by a processor unit, the first process and the second process are two mutually independent processes, when the first process is in a running state, the running of the second process is not affected, and when the two processes simultaneously load a Caffe dynamic link library, an intersection is not generated between memories corresponding to the processes.

In the embodiment of the present invention, please refer to the foregoing description for specific implementation of the first process, which is not repeated herein. For the second process, the second process comprises a third thread and a fourth thread, wherein the third thread executes 'load Caffe dynamic link library-use Caffe dynamic link library-perform destruct on the second object', and the fourth thread releases the Caffe dynamic link library from the memory after the third thread performs destruct on the second object. In this case, when the first dynamic library connection file and the second dynamic library connection file are used to implement the same function of the first application program, for example, a face recognition function of a paypal application, it may be considered as re-authentication of the object to be currently recognized, so that security may be further improved.

In one embodiment, the Caffe dynamic link library referred to in this application may be called by different applications at the same time, and the different applications call different dynamic link library files, in which case, the processor unit is further configured to configure a third process, where the third process includes a fifth thread and a sixth thread;

the processor unit is further configured to: receiving a first loading request of the first dynamic link library file and a third loading request of a third dynamic link library file at the same time; wherein the third dynamic link library file is used for realizing a second function of a second application program;

the processor unit is further configured to invoke the fifth thread according to the third load request to load the Caffe dynamic link library into the memory, and create a third object;

the processor unit is further configured to invoke the fifth thread to execute the third dynamic link library file, and after the second function of the second application program is executed, perform destruct on the third object;

the processor unit is further configured to invoke the sixth thread to release the Caffe dynamic link library from the memory after the fifth thread is invoked to execute the destruct on the third object.

In a current application scenario, a first process is a process configured by a processor unit for a first application program, a third process is a process configured by a processor unit for a second application program, the first process and the third process are two mutually independent processes, when the first process is in a running state, the running of the third process is not affected, and when the two processes load a Caffe dynamic link library simultaneously, an intersection is not generated between memories corresponding to the processes.

In the embodiment of the present invention, please refer to the foregoing description for specific implementation of the first process, which is not repeated herein. For the third process, the third process includes a fifth thread and a sixth thread, where the fifth thread performs "load the Caffe dynamic link library — use the Caffe dynamic link library — perform destruct on the third object", and the sixth thread releases the Caffe dynamic link library from the memory after the fifth thread performs destruct on the third object. Then, in this case, real-time sharing of different applications for the Caffe dynamically linked library can be achieved. In addition, different functions of different applications may also be implemented.

In one embodiment, as shown in fig. 1, the apparatus for releasing a dynamic link library further includes an MLU (Machine Learning Processing Unit, MLU) calculation Unit; wherein the processor unit 13 is connected to the MLU calculation unit.

As mentioned above, each dynamic link library file in the Caffe dynamic link library includes a function under the Caffe framework;

the processor unit is further configured to input a function under the Caffe framework into the MLU computing unit in a process of loading the Caffe dynamic link library; the MLU computing unit is used for computing according to the function under the Caffe framework and the operation instruction to obtain a computing result and sending the computing result to the processor unit;

the processor unit is further configured to receive the calculation result.

By implementing the embodiment of the invention, the loading speed of the Caffe dynamic link library can be increased.

In a specific implementation, the MLU calculation unit includes a controller unit 11 and an arithmetic unit 12; wherein, controller unit 11 is connected with arithmetic unit 12, and arithmetic unit 12 includes: a master processing circuit and a plurality of slave processing circuits;

a controller unit 11 for acquiring input data and a calculation instruction; wherein the input data comprises function data under the Caffe framework; in an alternative, the input data and the calculation instruction may be obtained through a data input/output unit, and the data input/output unit may be one or more data I/O interfaces or I/O pins.

The above calculation instructions include, but are not limited to: the present invention is not limited to the specific expression of the above-mentioned computation instruction, such as a convolution operation instruction, or a forward training instruction, or other neural network operation instruction.

The controller unit 11 is further configured to analyze the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;

a master processing circuit 101 configured to perform a preamble process on the input data and transmit data and an operation instruction with the plurality of slave processing circuits;

a plurality of slave processing circuits 102 configured to perform an intermediate operation in parallel according to the data and the operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;

and the main processing circuit 101 is configured to perform subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

The technical scheme that this application provided sets the arithmetic element to a main many slave structures, to the computational instruction of forward operation, it can be with the computational instruction according to the forward operation with data split, can carry out parallel operation to the great part of calculated amount through a plurality of processing circuits from like this to improve the arithmetic speed, save the operating time, and then reduce the consumption.

In one embodiment, when the processor unit and the MLU calculation unit are included in the apparatus for releasing the dynamically linked library, the apparatus may perform machine learning calculation. In an optional implementation, the machine learning calculation may include a convolutional neural network calculation, the input data may include a function under a Caffe framework, input neuron data, and weight data, where the function under the Caffe framework may include Caffe Blob, Caffe Layer, and Caffe Net functions, and the calculation result may specifically be: the result of the convolutional neural network operation is output neuron data.

In the forward operation, after the execution of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the output neuron calculated in the operation unit as the input neuron of the next layer to perform operation (or performs some operation on the output neuron and then takes the output neuron as the input neuron of the next layer), and at the same time, the weight value is replaced by the weight value of the next layer; in the reverse operation, after the reverse operation of the artificial neural network of the previous layer is completed, the operation instruction of the next layer takes the input neuron gradient calculated in the operation unit as the output neuron gradient of the next layer to perform operation (or performs some operation on the input neuron gradient and then takes the input neuron gradient as the output neuron gradient of the next layer), and at the same time, the weight value is replaced by the weight value of the next layer.

In one embodiment, the MLU calculating unit may further include: the storage unit 10 and the direct memory access unit 50, the storage unit 10 may include: one or any combination of a register and a cache, specifically, the cache is used for storing the calculation instruction; the register is used for storing the input data and a scalar; the cache is a scratch pad cache. The direct memory access unit 50 is used to read or store data from the storage unit 10.

Optionally, the controller unit includes: an instruction storage unit 110, an instruction processing unit 111, and a storage queue unit 113;

an instruction storage unit 110, configured to store a calculation instruction associated with the artificial neural network operation;

the instruction processing unit 111 is configured to analyze the calculation instruction to obtain a plurality of operation instructions;

a store queue unit 113 for storing an instruction queue, the instruction queue comprising: and a plurality of operation instructions or calculation instructions to be executed according to the front and back sequence of the queue.

For example, in an alternative embodiment, the main operation processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, specifically configured to decode instructions into microinstructions. Of course, in another alternative, the slave arithmetic processing circuit may also include another controller unit that includes a slave instruction processing unit, specifically for receiving and processing microinstructions. The micro instruction may be a next-stage instruction of the instruction, and the micro instruction may be obtained by splitting or decoding the instruction, and may be further decoded into control signals of each component, each unit, or each processing circuit.

In one alternative, the structure of the calculation instruction may be as shown in Table 1 below.

TABLE 1

Operation code

Registers or immediate data

Register/immediate

...

The ellipses in the above table indicate that multiple registers or immediate numbers may be included.

In another alternative, the computing instructions may include: one or more operation domains and an opcode. The computation instructions may include neural network operation instructions. Taking the neural network operation instruction as an example, as shown in table 1, register number 0, register number 1, register number 2, register number 3, and register number 4 may be operation domains. Each of register number 0, register number 1, register number 2, register number 3, and register number 4 may be a number of one or more registers. Specifically, please see table 2:

TABLE 2

The register may be an off-chip memory, and in practical applications, may also be an on-chip memory for storing data, where the data may specifically be n-dimensional data, where n is an integer greater than or equal to 1, and for example, when n is equal to 1, the data is 1-dimensional data, that is, a vector, and when n is equal to 2, the data is 2-dimensional data, that is, a matrix, and when n is equal to 3 or more, the data is a multidimensional tensor.

In another alternative embodiment, the arithmetic unit 12 may include a master processing circuit 101 and a plurality of slave processing circuits 102, as shown in fig. 2. In one embodiment, as shown in FIG. 2, a plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, and the k slave processing circuits are as follows: it should be noted that, as shown in fig. 2, the K slave processing circuits include only the n slave processing circuits in the 1 st row, the n slave processing circuits in the m th row, and the m slave processing circuits in the 1 st column, that is, the K slave processing circuits are slave processing circuits directly connected to the master processing circuit among the plurality of slave processing circuits.

And the K slave processing circuits are used for forwarding data and instructions between the main processing circuit and the plurality of slave processing circuits.

Optionally, as shown in fig. 3, the main processing circuit may further include: one or any combination of the conversion processing circuit 110, the activation processing circuit 111, and the addition processing circuit 112;

a conversion processing circuit 110 for performing an interchange between the first data structure and the second data structure (e.g., conversion of continuous data and discrete data) on the data block or intermediate result received by the main processing circuit; or performing an interchange between the first data type and the second data type (e.g., a fixed point type to floating point type conversion) on a data block or intermediate result received by the main processing circuitry;

an activation processing circuit 111 for performing an activation operation of data in the main processing circuit;

and an addition processing circuit 112 for performing addition operation or accumulation operation.

The master processing circuit is configured to determine that the input neuron is broadcast data, determine that a weight is distribution data, distribute the distribution data into a plurality of data blocks, and send at least one data block of the plurality of data blocks and at least one operation instruction of the plurality of operation instructions to the slave processing circuit;

the plurality of slave processing circuits are used for executing operation on the received data blocks according to the operation instruction to obtain an intermediate result and transmitting the operation result to the main processing circuit;

and the main processing circuit is used for processing the intermediate results sent by the plurality of slave processing circuits to obtain the result of the calculation instruction and sending the result of the calculation instruction to the controller unit.

The slave processing circuit includes: a multiplication processing circuit;

the multiplication processing circuit is used for executing multiplication operation on the received data block to obtain a product result;

forwarding processing circuitry (optional) for forwarding the received data block or the product result.

And the accumulation processing circuit is used for performing accumulation operation on the product result to obtain the intermediate result.

In another embodiment, the operation instruction is a matrix by matrix instruction, an accumulation instruction, an activation instruction, or the like.

The specific calculation method of the MLU calculation unit shown in fig. 1 is described below by a neural network operation instruction. For a neural network operation instruction, the formula that actually needs to be executed may be s-s (Σ wx)_i+ b), wherein the weight w is multiplied by the input data x_iAnd summing, adding a bias b, and performing activation operation s (h) to obtain a final output result s.

In an alternative embodiment, as shown in fig. 4, the arithmetic unit comprises: a tree module 40, the tree module comprising: a root port 401 and a plurality of branch ports 404, wherein the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;

the tree module has a transceiving function, for example, as shown in fig. 4, the tree module is a transmitting function, and as shown in fig. 5, the tree module is a receiving function.

And the tree module is used for forwarding data blocks, weights and operation instructions between the main processing circuit and the plurality of slave processing circuits.

Optionally, the tree module is an optional result of the MLU computing unit, and may include at least 1 layer of nodes, where the nodes are line structures with forwarding function, and the nodes themselves may not have computing function. If the tree module has zero-level nodes, the tree module is not needed.

Optionally, the tree module may have an n-ary tree structure, for example, a binary tree structure as shown in fig. 6, or may have a ternary tree structure, where n may be an integer greater than or equal to 2. The present embodiment is not limited to the specific value of n, the number of layers may be 2, and the slave processing circuit may be connected to nodes of other layers than the node of the penultimate layer, for example, the node of the penultimate layer shown in fig. 6.

Optionally, the arithmetic unit may carry a separate cache, as shown in fig. 7, and may include: a neuron buffer unit, the neuron buffer unit 63 buffers the input neuron vector data and the output neuron value data of the slave processing circuit.

As shown in fig. 8, the arithmetic unit may further include: and a weight buffer unit 64, configured to buffer weight data required by the slave processing circuit in the calculation process.

In an alternative embodiment, the arithmetic unit 12, as shown in fig. 9, may include a branch processing circuit 103; the specific connection structure is shown in fig. 9, wherein,

the main processing circuit 101 is connected to branch processing circuit(s) 103, the branch processing circuit 103 being connected to one or more slave processing circuits 102;

a branch processing circuit 103 for executing data or instructions between the forwarding main processing circuit 101 and the slave processing circuit 102.

In an alternative embodiment, taking the fully-connected operation in the neural network operation as an example, the process may be: f (wx + b), where x is an input neuron matrix, w is a weight matrix, b is a bias scalar, and f is an activation function, and may specifically be: sigmoid function, tanh, relu, softmax function. Here, a binary tree structure is assumed, and there are 8 slave processing circuits, and the implementation method may be:

the controller unit acquires an input neuron matrix x, a weight matrix w and a full-connection operation instruction from the storage unit, and transmits the input neuron matrix x, the weight matrix w and the full-connection operation instruction to the main processing circuit;

the main processing circuit determines the input neuron matrix x as broadcast data, determines the weight matrix w as distribution data, divides the weight matrix w into 8 sub-matrixes, then distributes the 8 sub-matrixes to 8 slave processing circuits through a tree module, broadcasts the input neuron matrix x to the 8 slave processing circuits,

the slave processing circuit executes multiplication and accumulation operation of the 8 sub-matrixes and the input neuron matrix x in parallel to obtain 8 intermediate results, and the 8 intermediate results are sent to the master processing circuit;

and the main processing circuit is used for sequencing the 8 intermediate results to obtain a wx operation result, executing the offset b operation on the operation result, executing the activation operation to obtain a final result y, sending the final result y to the controller unit, and outputting or storing the final result y into the storage unit by the controller unit.

The method for the MLU computing unit shown in fig. 1 to execute the neural network forward operation instruction may specifically be:

the controller unit extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction and at least one operation code from the instruction storage unit, transmits the operation domain to the data access unit, and sends the at least one operation code to the operation unit.

The controller unit extracts the weight w and the offset b corresponding to the operation domain from the storage unit (when b is 0, the offset b does not need to be extracted), transmits the weight w and the offset b to the main processing circuit of the arithmetic unit, extracts the input data Xi from the storage unit, and transmits the input data Xi to the main processing circuit.

The main processing circuit determines multiplication operation according to the at least one operation code, determines input data Xi as broadcast data, determines weight data as distribution data, and splits the weight w into n data blocks;

the instruction processing unit of the controller unit determines a multiplication instruction, an offset instruction and an accumulation instruction according to the at least one operation code, and sends the multiplication instruction, the offset instruction and the accumulation instruction to the master processing circuit, the master processing circuit sends the multiplication instruction and the input data Xi to a plurality of slave processing circuits in a broadcasting mode, and distributes the n data blocks to the plurality of slave processing circuits (for example, if the plurality of slave processing circuits are n, each slave processing circuit sends one data block); the plurality of slave processing circuits are used for executing multiplication operation on the input data Xi and the received data block according to the multiplication instruction to obtain an intermediate result, sending the intermediate result to the master processing circuit, executing accumulation operation on the intermediate result sent by the plurality of slave processing circuits according to the accumulation instruction by the master processing circuit to obtain an accumulation result, executing offset b on the accumulation result according to the offset instruction to obtain a final result, and sending the final result to the controller unit.

In addition, the order of addition and multiplication may be reversed.

According to the technical scheme, multiplication and offset operation of the neural network are achieved through one instruction, namely the neural network operation instruction, storage or extraction is not needed in the intermediate result of the neural network calculation, and storage and extraction operations of intermediate data are reduced, so that the method has the advantages of reducing corresponding operation steps and improving the calculation effect of the neural network.

The application also discloses a machine learning arithmetic device, which comprises a device for releasing the dynamic link library, wherein the device for releasing the dynamic link library comprises one or more MLU computing units, the machine learning arithmetic device is used for acquiring data to be operated and control information from other processing devices, executing specified machine learning arithmetic, and transmitting an execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one MLU computing unit is included, the MLU computing units can be linked and transmit data through a specific structure, for example, through a PCIE bus to interconnect and transmit data, so as to support a larger-scale machine learning operation. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

The application also discloses a combined processing device which comprises the machine learning arithmetic device, the universal interconnection interface and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user. Fig. 10 is a schematic view of a combined treatment apparatus.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.

And the universal interconnection interface is used for transmitting data and control instructions between the machine learning arithmetic device and other processing devices. The machine learning arithmetic device acquires required input data from other processing devices and writes the input data into a storage device on the machine learning arithmetic device; control instructions can be obtained from other processing devices and written into a control cache on a machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.

Alternatively, as shown in fig. 11, the configuration may further include a storage device, and the storage device is connected to the machine learning arithmetic device and the other processing device, respectively. The storage device is used for storing data in the machine learning arithmetic device and the other processing device, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or the other processing device.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

In some embodiments, a chip is also claimed, which includes the above machine learning arithmetic device or the combined processing device.

In some embodiments, a chip package structure is provided, which includes the above chip.

In some embodiments, a board card is provided, which includes the above chip package structure. Referring to fig. 12, fig. 12 provides a card that may include other kits in addition to the chip 389, including but not limited to: memory device 390, interface device 391 and control device 392;

the memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include a plurality of groups of memory cells 393. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are adopted in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600 MB/s.

In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing andor a plurality of processing circuits in the chip.

In some embodiments, an electronic device is provided that includes the above board card.

The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An apparatus for releasing a dynamically linked library, the apparatus being applied to a processor unit;

the processor unit is used for receiving a first loading request of a first dynamic link library file; the first dynamic link library file is used for realizing a first function of a first application program; the processor unit is used for configuring a first process; the first process comprises a first thread and a second thread;

2. The apparatus of claim 1, wherein the processor unit is further configured to configure a second process, the second process comprising a third thread and a fourth thread;

3. The apparatus of claim 1, wherein the processor unit is further configured to configure a third process, the third process comprising a fifth thread and a sixth thread;

4. The apparatus of claim 1, further comprising: an MLU calculation unit; each dynamic link library file in the Caffe dynamic link library comprises a function under a Caffe framework;

the processor unit is further configured to receive the calculation result.

5. The apparatus of claim 4, wherein the MLU computation unit comprises a controller unit and an arithmetic unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits;

the controller unit is used for acquiring input data and calculating instructions; wherein the input data comprises function data under the Caffe framework;

the controller unit is further configured to analyze the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;

the main processing circuit is used for executing preorder processing on the input data and transmitting data and operation instructions with the plurality of slave processing circuits;

the plurality of slave processing circuits are used for executing intermediate operation in parallel according to the data and the operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results and transmitting the plurality of intermediate results to the master processing circuit;

and the main processing circuit is used for executing subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

6. The apparatus according to claim 5, wherein the arithmetic unit comprises: a tree module, the tree module comprising: the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;

7. The apparatus of claim 5, wherein the arithmetic unit further comprises one or more branch processing circuits, each branch processing circuit connected to at least one slave processing circuit,

the main processing circuit is specifically configured to determine that an input neuron is broadcast data, determine that a weight is a distribution data block, allocate one distribution data block to a plurality of data blocks, and send at least one data block of the plurality of data blocks, the broadcast data, and at least one operation instruction of the plurality of operation instructions to the branch processing circuit;

the branch processing circuit is used for forwarding data blocks, broadcast data and operation instructions between the main processing circuit and the plurality of slave processing circuits;

the plurality of slave processing circuits are used for carrying out operation on the received data blocks and the broadcast data according to the operation instruction to obtain an intermediate result and transmitting the intermediate result to the branch processing circuit;

and the main processing circuit is used for carrying out subsequent processing on the intermediate result sent by the branch processing circuit to obtain a result of the calculation instruction, and sending the result of the calculation instruction to the controller unit.

8. The apparatus of claim 5, wherein the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with K slave processing circuits in the plurality of slave processing circuits, and the K slave processing circuits are as follows: n slave processing circuits of row 1, n slave processing circuits of row m, and m slave processing circuits of column 1;

the K slave processing circuits are used for forwarding data and instructions between the main processing circuit and the plurality of slave processing circuits;

the main processing circuit is used for determining that the input neuron is broadcast data, the weight value is distribution data, one distribution data is distributed into a plurality of data blocks, and at least one data block in the plurality of data blocks and at least one operation instruction in the plurality of operation instructions are sent to the K slave processing circuits;

the K slave processing circuits are used for converting data between the main processing circuit and the plurality of slave processing circuits;

the plurality of slave processing circuits are used for performing operation on the received data blocks according to the operation instruction to obtain an intermediate result and transmitting the operation result to the K slave processing circuits;

and the main processing circuit is used for carrying out subsequent processing on the intermediate results sent by the K slave processing circuits to obtain a result of the calculation instruction, and sending the result of the calculation instruction to the controller unit.

9. A machine learning operation device, comprising the device for releasing the dynamically linked library according to any one of claims 1 to 8, wherein the device for releasing the dynamically linked library comprises one or more MLU computing units, and the machine learning operation device is configured to obtain input data and control information to be operated from other processing devices, execute a specified machine learning operation, and transmit an execution result to other processing devices through an I/O interface;

10. A combined processing apparatus, characterized in that the combined processing apparatus comprises the machine learning arithmetic apparatus according to claim 9, a universal interconnect interface and other processing apparatus;

and the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user.

11. The combined processing device according to claim 10, further comprising: and a storage device connected to the machine learning arithmetic device and the other processing device, respectively, for storing data of the machine learning arithmetic device and the other processing device.

12. A neural network chip, wherein the machine learning chip comprises the machine learning arithmetic device of claim 9 or the combined processing device of claim 10.

13. An electronic device, characterized in that it comprises a chip according to claim 12.

14. The utility model provides a board card, its characterized in that, the board card includes: a memory device, an interface apparatus and a control device and the neural network chip of claim 12;

wherein, the neural network chip is respectively connected with the storage device, the control device and the interface device;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the chip and external equipment;

and the control device is used for monitoring the state of the chip.

15. A method for releasing a dynamically linked library, the method being applied to an apparatus for releasing a dynamically linked library, the apparatus comprising a processor unit; the method comprises the following steps:

16. The method of claim 15, further comprising: the processor unit configures a second process, the second process comprising a third thread and a fourth thread;

the processor unit receives a first loading request of the first dynamic link library file and simultaneously receives a second loading request of a second dynamic link library file; wherein the second dynamic link library file is used for realizing a first function of the first application program;

the processor unit calls the third thread to load the Caffe dynamic link library into the memory according to the second loading request, and creates a second object;

the processor unit calls the third thread to execute the second dynamic link library file, and executes the deconstruction on the second object after the first function of the first application program is executed;

and after the processor unit calls the third thread to execute the destruct on the second object, calling the fourth thread to release the Caffe dynamic link library from the memory.

17. The method of claim 15, further comprising: the processor unit configures a third process, the third process comprising a fifth thread and a sixth thread;

the processor unit receives a first loading request of the first dynamic link library file and simultaneously receives a third loading request of a third dynamic link library file; wherein the third dynamic link library file is used for realizing a second function of a second application program;

the processor unit calls the fifth thread according to the third loading request to load the Caffe dynamic link library into the memory and create a third object;

the processor unit calls the fifth thread to execute the third dynamic link library file, and executes destruct on the third object after the second function of the second application program is executed;

and after the processor unit calls the fifth thread to execute the destruct on the third object, calling the sixth thread to release the Caffe dynamic link library from the memory.

18. The method of claim 15, wherein the apparatus further comprises: an MLU calculation unit; each dynamic link library file in the Caffe dynamic link library comprises a function under a Caffe framework;

the processor unit inputs a function under the Caffe framework into the MLU computing unit in the process of loading the Caffe dynamic link library; the MLU computing unit is used for computing according to the function under the Caffe framework and the operation instruction to obtain a computing result and sending the computing result to the processor unit;

the processor unit receives the calculation result.

19. The method of claim 18, wherein the MLU computation unit comprises a controller unit and an arithmetic unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits;

20. The method of claim 19, wherein the arithmetic unit comprises: a tree module, the tree module comprising: the root port of the tree module is connected with the main processing circuit, and the branch ports of the tree module are respectively connected with one of the plurality of slave processing circuits;

21. The method of claim 19, wherein the arithmetic unit further comprises one or more branch processing circuits, each branch processing circuit coupled to at least one slave processing circuit,

22. The method of claim 19, wherein the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with K slave processing circuits in the plurality of slave processing circuits, and the K slave processing circuits are as follows: n slave processing circuits of row 1, n slave processing circuits of row m, and m slave processing circuits of column 1;