JP6532334B2

JP6532334B2 - Parallel computing device, image processing device and parallel computing method

Info

Publication number: JP6532334B2
Application number: JP2015144411A
Authority: JP
Inventors: 山本　貴久; 貴久山本; 加藤　政美; 政美加藤; 伊藤　嘉則; 嘉則伊藤; 野村　修; 修野村; 克彦森
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2015-07-21
Filing date: 2015-07-21
Publication date: 2019-06-19
Anticipated expiration: 2035-07-21
Also published as: JP2017027314A

Description

本発明は、複数の演算処理を並列に行うための技術に関する。 The present invention relates to a technique for performing a plurality of arithmetic processes in parallel.

一般に画像処理でよく使用される処理としてコンボリューション（畳み込み）フィルタ処理が知られており、その代表的な処理は平滑化のためのガウスフィルタ処理やエッジ抽出のためのソーベルフィルタ処理等である。また、画像処理でよく使用される他の処理に行列積演算がある。行列積演算を伴う処理としては、主成分分析（ＰＣＡ）を用いた次元削減、アフィン変換、パーセプトロン、サポートベクターマシン等が知られている。このようなフィルタ演算や行列積演算では、多数の積和演算を行う必要があるため多くの処理時間を要する。汎用的なプロセッサを用いて処理時間を短縮するには、複数のプロセッサを用意することが必要となり、回路規模が大きくなるという問題がある。 Convolution (filtering) processing is commonly known as processing commonly used in image processing, and typical processing is Gaussian filtering for smoothing, Sobel filtering for edge extraction, etc. . Another process often used in image processing is matrix multiplication. Known processes involving matrix multiplication include dimension reduction using principal component analysis (PCA), affine transformation, perceptron, support vector machines, and the like. Such filter operations and matrix product operations require a large amount of processing time since many product-sum operations need to be performed. In order to reduce the processing time using a general-purpose processor, it is necessary to prepare a plurality of processors, and there is a problem that the circuit scale becomes large.

そこで、回路規模や処理時間の観点から、専用のハードウェアを用いて積和演算を並列処理する技術が検討されている。特許文献１には、演算器間の接続関係を変更することで多種の演算を行う技術が開示されている。 Therefore, from the viewpoint of circuit size and processing time, techniques for performing product-sum operations in parallel using dedicated hardware have been studied. Patent Document 1 discloses a technique for performing various types of operations by changing the connection relationship between computing units.

特開２００９−３８７５８号公報JP, 2009-38758, A

しかしながら、特許文献１で開示されている技術では、データの入力から出力までの間に演算器が直列に並ぶ経路があり、ある演算器の出力を他の演算器に入力している。つまり、最終的な出力を得るまでに複数の演算器を経由する必要があるため、複数の積和演算を高速に処理することが難しいという問題があった。そこで、本発明は、複数の積和演算を高速に処理できるようにすることを目的とする。 However, in the technique disclosed in Patent Document 1, there is a path in which computing units are arranged in series between data input and output, and the output of one computing unit is input to another computing unit. That is, there is a problem that it is difficult to process a plurality of product-sum operations at high speed because it is necessary to go through a plurality of arithmetic units to obtain a final output. Therefore, an object of the present invention is to enable processing of a plurality of product-sum operations at high speed.

以上の課題を解決するために、本発明は、第１データと第２データとに基づいて並列して演算を行う複数の演算処理手段と、前記第１データを前記複数の演算処理手段に供給するための第１供給手段と、前記第２データを前記複数の演算処理手段に供給するための第２供給手段と、前記複数の演算処理手段それぞれに対して同一タイミングで内容の異なる前記第１データを供給するように前記第１供給手段を制御し、前記複数の演算処理手段それぞれに対して同一タイミングで内容が同一の第２データを供給するように前記第２供給手段を制御する供給制御手段と、を有し、前記供給制御手段は、前記複数の演算処理手段の間で内容が同一の前記第１データが異なるタイミングで共有されるように前記第１データを供給させる第１供給モードと、前記複数の演算処理手段に、複数のタイミングで、それぞれ内容が異なる前記第１データを供給させる第２供給モードと、を実行することを特徴とする。 In order to solve the above problems, the present invention provides a plurality of arithmetic processing means for performing arithmetic operations in parallel based on first data and second data, and supplying the first data to the plurality of arithmetic processing means. First supplying means for performing the processing, second supplying means for supplying the second data to the plurality of arithmetic processing means, and the first contents having different contents at the same timing for each of the plurality of arithmetic processing means Supply control for controlling the first supply unit to supply data, and controlling the second supply unit to supply the second data having the same content at the same timing to each of the plurality of arithmetic processing units A first supply mode for supplying the first data so that the first data having the same content is shared at different timings among the plurality of arithmetic processing means. , To the plurality of arithmetic processing means, a plurality of timings, and executes the second supply mode to supply the first data content are different, the.

以上の構成によれば、本発明は、複数の積和演算を高速に処理できるようになる。 According to the above configuration, the present invention can process a plurality of product-sum operations at high speed.

第１の実施形態に関わる並列演算装置のブロック図。FIG. 2 is a block diagram of a parallel operation device according to the first embodiment. 第１の実施形態に関わる第１データ供給部１０５の構成例を示す図。FIG. 2 is a view showing an example of the arrangement of a first data supply unit 105 according to the first embodiment; 第１の実施形態においてフィルタ演算を行う際のフローチャート。6 is a flowchart when performing a filter operation in the first embodiment. 第１の実施形態におけるフィルタカーネルを説明する図。FIG. 3 is a diagram for explaining a filter kernel in the first embodiment. 第１の実施形態のフィルタ演算における第１、第２データ格納部の模式図。FIG. 5 is a schematic view of first and second data storage units in the filter operation of the first embodiment; 第１の実施形態において行列積演算を行う際のフローチャート。6 is a flowchart when performing matrix product operation in the first embodiment. 第１の実施形態の行列積演算における第１、第２データ格納部の模式図。FIG. 5 is a schematic view of first and second data storage units in matrix product operation of the first embodiment. 第２の実施形態においてＤｅｅｐＬｅａｒｎｉｎｇの演算例を説明する図。A figure explaining an operation example of Deep Learning in a 2nd embodiment. フィルタ処理における畳み込みフィルタの例を示す図。The figure which shows the example of the convolution filter in filter processing. 第２の実施形態に関わる並列演算装置のブロック図。FIG. 7 is a block diagram of a parallel operation device according to a second embodiment. 第２の実施形態に関わる画像処理装置のブロック図。FIG. 7 is a block diagram of an image processing apparatus according to a second embodiment.

［第１の実施形態］
本発明の第１の実施形態に関し、その概要について先ず説明する。本実施形態の並列演算装置は、種類の異なる複数の積和演算の処理を実行するものである。本実施形態が処理する積和演算の種類としては、上述したようにフィルタ処理における積和演算、行列積演算がある。 First Embodiment
The outline of the first embodiment of the present invention will be described first. The parallel operation device of this embodiment executes processing of a plurality of different product-sum operations. As the types of product-sum operations to be processed by this embodiment, there are product-sum operations and matrix-product operations in filter processing as described above.

ここで、フィルタ処理における積和演算について説明する。図９は、畳み込みフィルタの例を示している。同図（Ａ）では、カーネルサイズが３×３フィルタカーネル１０を用いて処理対象画像の画像データ１１に対してフィルタ演算を行う場合の例を示している。この例では、下記の数式１に示す積和演算処理によりフィルタ演算結果が算出される。 Here, the product-sum operation in the filter process will be described. FIG. 9 shows an example of a convolution filter. FIG. 6A shows an example of the case where the filter operation is performed on the image data 11 of the processing target image using a 3 × 3 filter kernel 10 with a kernel size of 3 × 3. In this example, the result of the filter operation is calculated by the product-sum operation process shown in the following Equation 1.

ここで、「ｄ_ｉ，ｊ」は座標（ｉ，ｊ）での処理対象画像画素値を示し、「ｆ_ｉ，ｊ」は座標（ｉ，ｊ）でのフィルタ演算結果を示す。また、「ｗ_ｓ，ｔ」は座標（ｉ＋ｓ−１，ｊ＋ｔ−１）に適用するフィルタカーネルの値（フィルタ係数）を示し、「ｃｏｌｕｍｎＳｉｚｅ」および「ｒｏｗＳｉｚｅ」はフィルタカーネルサイズを示す。

Here, “d _{i, j} ” indicates the processing target image pixel value at coordinate (i, j), and “f _{i, j} ” indicates the filter calculation result at coordinate (i, j). Further, “w _{s, t} ” indicates the value (filter coefficient) of the filter kernel to be applied to the coordinates (i + s−1, j + t−1), and “columnSize” and “rowSize” indicate the filter kernel size.

また、行列積演算では、ｍ×ｎ行列Ａとｎ×ｐ行列Ｂとの行列積の結果として算出されるｍ×ｐ行列Ｃは、下記の数式２で示される。 In matrix multiplication operation, an m × p matrix C calculated as a result of matrix multiplication of the m × n matrix A and the n × p matrix B is expressed by Equation 2 below.

このとき行列Ｃの要素「ｃ_ｉ，ｊ」は、次の数式３で算出される。

At this time, the element “c _{i, j} ” of the matrix C is calculated by the following Equation 3.

このように、一般的に行われる画像処理では、数式１や数式３の形で表わされる積和演算がよく使用される。ここで、フィルタ演算も行列積演算もどちらも積和演算（乗算結果を順次加算する演算）ではあるが、フィルタ演算では、同一のフィルタカーネルに対して、フィルタされる側のデータ（スキャンウインドウ内の画像データ）が部分的に重複する場合がある。つまり、フィルタ演算では、一部共通するデータを使用することがある。

As described above, in image processing generally performed, a product-sum operation represented in the form of Formula 1 or Formula 3 is often used. Here, although both the filter operation and the matrix product operation are product-sum operations (operations in which the multiplication results are sequentially added), in the filter operation, the data on the filter side of the same filter kernel (in the scan window Image data) may partially overlap. That is, in the filter operation, some common data may be used.

図９（Ｂ）〜（Ｄ）に、フィルタ演算におけるデータが一部重複する様子を示す。図９（Ｂ）〜（Ｄ）は、フィルタカーネル１０により、処理対象の画像データ１１に対してフィルタ処理を行う際の模式図である。フィルタカーネル１０が図９（Ｂ）の位置にあるときと図９（Ｃ）の位置にあるときとでは、図９（Ｃ）において斜線で示す部分の画像データが重複している。また同様に、フィルタカーネル１０が図９（Ｃ）の位置にあるときと図９（Ｄ）の位置にあるときとでは、図９（Ｄ）において斜線で示す部分の画像データが重複している。このように、フィルタ演算では、同一のフィルタカーネルのデータに対して、フィルタされる方のデータ（画像データ）が部分的に重複する場合がある。 FIGS. 9B to 9D show that data in the filter calculation partially overlap. FIGS. 9B to 9D are schematic views when the filter kernel 10 performs filter processing on the image data 11 to be processed. When the filter kernel 10 is at the position shown in FIG. 9B and at the position shown in FIG. 9C, the image data of the hatched portion in FIG. 9C overlap. Similarly, when the filter kernel 10 is in the position shown in FIG. 9C and in the position shown in FIG. 9D, the image data of the portion shown by oblique lines in FIG. 9D overlap. . Thus, in the filter operation, data (image data) to be filtered may partially overlap data of the same filter kernel.

これに対して、行列積演算では、一方の行列のある行ベクトル（例えば、数式２の行列Ａの行ベクトル）に対して、他方の行列の列ベクトル（例えば数式２の行列Ｂの列ベクトル）に部分的な重複は発生せず、複数の列ベクトルに共通するデータはない。つまり、一方の行列のある行ベクトルと、他方の行列の複数の列ベクトルとで積和演算する場合でも、複数の列ベクトルに重複するデータはない。 On the other hand, in matrix product operation, for one row vector of one matrix (for example, row vector of matrix A of equation 2), column vector of the other matrix (for example, column vector of matrix B of equation 2) There is no partial overlap in, and there is no data common to multiple column vectors. That is, even when the product-sum operation is performed with a row vector of one matrix and a plurality of column vectors of the other matrix, there is no data that overlaps the plurality of column vectors.

以上のように、フィルタ演算と行列積演算とでは、積和演算の被演算データの一方を固定した場合に、他方の被演算データ間に重複部分があるかないかという差異が存在する。本実施形態では、種類の異なる積和演算を同一の並列演算装置を用いて行うため、並列に存在する演算処理部に対して演算対象のデータを供給する際、データの供給モード（供給順序）を切り替えることで、複数種類の積和演算を処理可能としている。本実施形態では、積和演算の例として、フィルタ演算と行列積演算の両方を単一の並列演算装置により処理する場合の例について説明する。 As described above, when one of the operand data of the product-sum operation is fixed, there is a difference between the other operand data whether or not there is an overlapping portion between the filter operation and the matrix product operation. In this embodiment, since different types of product-sum operations are performed using the same parallel operation device, the data supply mode (supply order) is provided when data to be operated on is supplied to operation processing units existing in parallel. A plurality of product-sum operations can be processed by switching. In this embodiment, as an example of the product-sum operation, an example in which both the filter operation and the matrix product operation are processed by a single parallel operation device will be described.

以下、図面を参照して、本実施形態の詳細について説明する。図１は、本実施形態に関わる並列演算装置のブロック図である。並列演算装置１０１には、積和演算を行う演算対象データが入力される。フィルタ演算を行う場合には、画像データとフィルタカーネルデータとが演算対象データとして入力される。行列積演算を行う時には、掛け合わせられる２つの行列データが演算対象データとして入力される。並列演算装置１０１は、これら演算対象データに対して積和演算を実行し、その演算結果を出力する。 Hereinafter, the details of the present embodiment will be described with reference to the drawings. FIG. 1 is a block diagram of a parallel processing device according to the present embodiment. Operation target data to be subjected to product-sum operation is input to the parallel operation device 101. When the filter operation is performed, image data and filter kernel data are input as operation target data. When performing a matrix multiplication operation, two matrix data to be multiplied are input as operation target data. The parallel computing device 101 executes a product-sum operation on these operation target data, and outputs the operation result.

並列演算装置１０１にこれから行われる積和演算の演算対象データが入力されると、第１データ格納部１０２、第２データ格納部１０３に演算対象データが格納される。フィルタ演算を行う場合、第１データ格納部１０２に画像データが格納され、第２データ格納部１０３にフィルタカーネルデータが格納される。行列積演算を行う場合、掛け合わされる行列データの一方（例えば数式２の行列Ａ）が第１データ格納部１０２に格納され、掛け合わされる行列データの他方（例えば数式２の行列Ｂ）が第２データ格納部１０３に格納される。本実施形態において、第１データ格納部１０２、第２データ格納部１０３はＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）で構成されているが、レジスタファイル（レジスタが複数集まったもの）等の他の手段で構成されていてもよい。 When operation target data of a product-sum operation to be performed from now on is input to the parallel operation device 101, the operation target data is stored in the first data storage unit 102 and the second data storage unit 103. When the filter operation is performed, the image data is stored in the first data storage unit 102, and the filter kernel data is stored in the second data storage unit 103. When matrix multiplication operation is performed, one of the matrix data to be multiplied (for example, the matrix A of Formula 2) is stored in the first data storage unit 102, and the other of the matrix data to be multiplied (for example, the matrix B of Formula 2) is 2 stored in the data storage unit 103. In the present embodiment, the first data storage unit 102 and the second data storage unit 103 are configured by a RAM (Random Access Memory), but are configured by other means such as a register file (a collection of a plurality of registers). It may be

データ供給制御部１０４は、複数の演算処理部１０７それぞれに内容の異なる第１データ（第１データ格納部１０２から出力されるデータ）を同一タイミングで供給する。また同時に、複数の演算処理部１０７それぞれに内容が同一の第２データ（第２データ格納部１０３から出力されるデータ）を同一タイミングで供給する。データ供給制御部１０４は、このような機能を実現するために、複数の第１データそれぞれを対応する演算処理部１０７に供給する第１データ供給部１０５と、同一の第２データを全ての演算処理部１０７に供給する第２データ供給部１０６を有する。第１データ供給部１０５は、第１データ格納部１０２から出力される第１データを一時的に保持し、演算処理部１０７に供給する。本実施形態の第１データ供給部１０５は、データロードが可能なシフトレジスタで構成されている。 The data supply control unit 104 supplies the first data (data output from the first data storage unit 102) having different contents to the plurality of arithmetic processing units 107 at the same timing. At the same time, the second data (data output from the second data storage unit 103) having the same content is supplied to each of the plurality of arithmetic processing units 107 at the same timing. The data supply control unit 104 performs all operations on the same second data as the first data supply unit 105 that supplies each of the plurality of first data to the corresponding operation processing unit 107 in order to realize such a function. A second data supply unit 106 that supplies the processing unit 107 is included. The first data supply unit 105 temporarily holds the first data output from the first data storage unit 102 and supplies the first data to the arithmetic processing unit 107. The first data supply unit 105 of the present embodiment is configured of a shift register capable of loading data.

図２は、第１データ供給部１０５の構成例を示す図であり、４段のシフトレジスタを示している。このシフトレジスタには、４個の多ｂｉｔレジスタ８０１ａ〜８０１ｄが設けられており、これらがＣＬＯＣＫ信号に同期して所定ｂｉｔのデータをラッチする。レジスタ８０１ａ〜８０１ｄには、イネーブル信号（Ｅｎａｂｌｅ信号）が与えられ、レジスタ８０１ａ〜８０１ｄはＥｎａｂｌｅ信号が１の場合にＣＬＯＣＫ信号の立ち上がりでデータをラッチする。一方、Ｅｎａｂｌｅ信号が０の場合に前クロックでラッチしたデータをそのまま保持する。そのため、Ｅｎａｂｌｅ信号が０の場合には、レジスタ８０１ａ〜８０１ｄがラッチするデータの状態に遷移は生じない。 FIG. 2 is a diagram showing a configuration example of the first data supply unit 105, and shows four stages of shift registers. The shift register is provided with four multi-bit registers 801a to 801d, which latch data of a predetermined bit in synchronization with the CLOCK signal. An enable signal (Enable signal) is given to the registers 801a to 801d, and when the Enable signal is 1, the registers 801a to 801d latch data at the rising edge of the CLOCK signal. On the other hand, when the Enable signal is 0, the data latched by the previous clock is held as it is. Therefore, when the Enable signal is 0, no transition occurs in the state of data latched by the registers 801a to 801d.

また、３個のセレクタ８０２ａ〜８０２ｃが設けられており、これらは、選択信号（Ｌｏａｄ信号）が０の場合に信号ＯＵＴｘ（ｘ：０〜２）を選択し、選択信号（Ｌｏａｄ信号）が１の場合に信号ＩＮｘ（ｘ：１〜３）を選択する。すなわち、セレクタ８０２ａ〜８０２ｃは、Ｌｏａｄ信号に応じてシフト動作又はロード動作を選択する。 In addition, three selectors 802a to 802c are provided, which select the signal OUTx (x: 0 to 2) when the selection signal (Load signal) is 0, and the selection signal (Load signal) is 1 Select the signal INx (x: 1 to 3). That is, the selectors 802a to 802c select the shift operation or the load operation according to the Load signal.

多ｂｉｔレジスタ８０１ａ〜８０１ｄのビット幅は、第１データ格納部１０２に格納された画像データ、あるいは行列データのビット幅と同じであればよい（例えば、８ビット）。また、シフトレジスタの段数は『「演算処理部１０７の数（並列数）」＋「フィルタカーネルの水平方向サイズ」−１』であればよい。例えば、演算処理部１０７の数が４個、フィルタカーネルの水平方向サイズが３画素の場合、４＋３−１＝６段とすればよい。ただし、色々なサイズのフィルタカーネルでフィルタ演算を行うことが想定される場合には、想定される最大のフィルタカーネルの水平方向サイズに対する段数で構成しておくことが望ましい。 The bit width of the multi-bit registers 801 a to 801 d may be the same as the bit width of the image data or matrix data stored in the first data storage unit 102 (for example, 8 bits). In addition, the number of stages of the shift register may be “the number of operation processing units 107 (the number in parallel)” + “the horizontal size of the filter kernel” −1 ”. For example, when the number of arithmetic processing units 107 is four and the horizontal size of the filter kernel is three pixels, 4 + 3-1 = 6 stages may be set. However, when it is assumed that filter operations are performed with filter kernels of various sizes, it is desirable to configure the number of stages with respect to the horizontal size of the maximum filter kernel that is assumed.

また、第１データ供給部１０５には、演算種別切り替え部１１０から演算種別信号が入力される。第１データ供給部１０５は、複数のモードでデータの供給を行うことができ、入力される演算種別信号に応じて、データ供給のモードを切り替える。ここでいうデータ供給モードとは、第１データ格納部１０２からロードしたデータをどのような順序、タイミングで演算処理部１０７に供給するのかということに相当する。 In addition, an operation type signal from the operation type switching unit 110 is input to the first data supply unit 105. The first data supply unit 105 can supply data in a plurality of modes, and switches the mode of data supply according to the input operation type signal. The data supply mode referred to here corresponds to what order and timing the data loaded from the first data storage unit 102 is to be supplied to the arithmetic processing unit 107.

例えば、演算種別信号としてフィルタ演算を行う旨の指示が第１データ供給部１０５に入力された場合、第１データ供給部１０５は、複数の演算処理部１０７それぞれに対して同一のデータが異なるタイミングで供給されるように動作する。つまり、あるタイミングである演算処理部１０７に供給されたデータが次のタイミングで別の演算処理部１０７に供給される、というようなデータ供給モードで動作する。この場合、複数の演算処理部１０７間では、供給される同一データが異なるタイミングで共有されることになる。 For example, when an instruction to perform a filter operation is input to the first data supply unit 105 as an operation type signal, the first data supply unit 105 may change the timing when the same data is different for each of the plurality of operation processing units 107. Operates as supplied by That is, it operates in a data supply mode in which data supplied to the arithmetic processing unit 107 at a certain timing is supplied to another arithmetic processing unit 107 at the next timing. In this case, the plurality of arithmetic processing units 107 share the same supplied data at different timings.

一方、演算種別信号として行列積演算を行う旨の指示が第１データ供給部１０５に入力された場合、第１データ供給部１０５は複数の演算処理部１０７に対して同一のデータが供給されないように動作する。つまり、あるタイミングである演算処理部１０７に供給されたデータが、同じタイミングまたは別のタイミングで別の演算処理部１０７に供給されることがない、というようなデータ供給モードで動作する。この場合、複数の演算処理部１０７間で供給されるデータの共有は行われないことになる。 On the other hand, when the instruction to perform matrix multiplication operation is input to the first data supply unit 105 as the operation type signal, the first data supply unit 105 does not supply the same data to the plurality of operation processing units 107. To work. That is, it operates in a data supply mode in which data supplied to the arithmetic processing unit 107 which is a certain timing is not supplied to another arithmetic processing unit 107 at the same timing or another timing. In this case, sharing of data supplied between the plurality of arithmetic processing units 107 is not performed.

第２データ供給部１０６は、第２データ格納部１０３から出力される第２データを一時的に保持し、演算処理部１０７に供給する。本実施形態の第２データ供給部１０６は、レジスタで構成されている。このレジスタのビット幅は、第２データ格納部１０３に格納されるフィルタカーネル、行列データのビット幅と同じであればよい（例えば８ビット）。 The second data supply unit 106 temporarily holds the second data output from the second data storage unit 103 and supplies the second data to the arithmetic processing unit 107. The second data supply unit 106 of the present embodiment is configured of a register. The bit width of this register may be the same as the bit width of the filter kernel and matrix data stored in the second data storage unit 103 (for example, 8 bits).

演算処理部１０７は、第１データ供給部１０５から供給される第１データと第２データ供給部１０６から供給される第２データとを用いて積和演算を行い、その積和演算結果を出力する。並列演算装置１０１は、積和演算を並列に行うために、演算処理部１０７を複数有する。複数の演算処理部１０７それぞれに対して、第１データ供給部１０５から、それぞれ異なる第１データが同一タイミングで供給され、同時に、第２データ供給部１０６から同一の第２データが同一タイミングで供給される。そして、演算処理部１０７は、与えられた第１データ、第２データに対し積和演算を実行するため、乗算器１０８と累積加算器１０９とを備える。 The arithmetic processing unit 107 performs product-sum operation using the first data supplied from the first data supply unit 105 and the second data supplied from the second data supply unit 106, and outputs the product-sum operation result. Do. The parallel operation device 101 includes a plurality of operation processing units 107 in order to perform product-sum operations in parallel. Different first data are supplied at the same timing from the first data supply unit 105 to each of the plurality of arithmetic processing units 107, and at the same time, the same second data is supplied from the second data supply unit 106 at the same timing. Be done. Then, the arithmetic processing unit 107 includes a multiplier 108 and a cumulative adder 109 in order to execute a product-sum operation on the supplied first data and second data.

演算種別切り替え部１１０は、外部から設定される演算種別設定に基づいて演算種別信号を出力する。例えば、並列演算装置が画像処理装置に組み込まれている場合には、画像処理装置がこれから行う画像処理に応じた演算種別が設定されることになる。これから行う演算がフィルタ演算の場合、演算種別切り替え部１１０には、その旨の情報が設定され、演算種別切り替え部１１０は、第１データ供給部１０５と読み出し制御部１１１に対してフィルタ演算用の動作モードで動作するように指示を出す。また、これから行う演算が行列積演算の場合、演算種別切り替え部１１０にその旨の情報が設定され、演算種別切り替え部１１０は、第１データ供給部１０５と読み出し制御部１１１に対して、行列積演算用の動作モードで動作するように指示を出す。 The operation type switching unit 110 outputs an operation type signal based on the operation type setting set from the outside. For example, when the parallel processing device is incorporated in the image processing device, the operation type according to the image processing to be performed by the image processing device is set. If the operation to be performed from now on is a filter operation, information to that effect is set in the operation type switching unit 110, and the operation type switching unit 110 transmits the first data supply unit 105 and the read control unit 111 to the filter operation. Give instructions to operate in the operating mode. Further, when the operation to be performed from now on is matrix multiplication operation, information to that effect is set in the operation type switching unit 110, and the operation type switching unit 110 causes the first data supply unit 105 and the read control unit 111 to perform matrix multiplication. It instructs to operate in the operation mode for calculation.

読み出し制御部１１１は、第１データ供給部１０５が第１データ格納部１０２からデータを読み出す際、その読み出しモードの制御を行う。ここでいう読み出しモードとは、第１データ格納部１０２のどこに格納されているデータを、どのような順番で、第１データ供給部１０５が読み出すのか、ということに相当する。 When the first data supply unit 105 reads data from the first data storage unit 102, the read control unit 111 controls the read mode. The read mode here corresponds to the order in which the data stored in the first data storage unit 102 is read by the first data supply unit 105.

また、読み出し制御部１１１には、演算種別切り替え部１１０から演算種別信号が入力される。読み出し制御部１１１にフィルタ演算を指示する演算種別信号が入力されると、読み出し制御部１１１は、第１データ格納部１０２から、フィルタ演算の過去の積和演算のために読み出したデータの一部分を重複して、再度読み出すようなモードに設定する。つまり、一連のフィルタ演算の過程で過去に実行された積和演算のために第１データ格納部１０２から読み出されたデータの一部分を含むデータを、別の積和演算のために読み出す、というような読み出しモードを設定する。 In addition, an operation type signal is input to the read control unit 111 from the operation type switching unit 110. When an operation type signal instructing the filter operation to the read control unit 111 is input, the read control unit 111 causes the first data storage unit 102 to transmit a part of the data read out for the product-sum operation in the past for the filter operation. Set the mode to repeat and read again. That is, data including a part of data read from the first data storage unit 102 for product-sum operation performed in the past in the process of a series of filter operations is read for another product-sum operation. Set the read mode as above.

一方、読み出し制御部１１１に行列積演算を指示する旨の演算種別信号が入力されると、読み出し制御部１１１は、行列積演算の過去の積和演算のために読み出したデータは重複して再度読み出すことがないようなモードに設定する。つまり、読み出し制御部１１１は、新しいデータを読み出すか過去の演算と同じデータを読み出す、というような動作モードの設定を行う。 On the other hand, when an operation type signal indicating that matrix product operation is instructed to the read control unit 111 is input, the read control unit 111 causes the data read for the product-sum operation in the past for the matrix product operation to overlap again. Set to a mode that can not be read out. That is, the read control unit 111 sets an operation mode such as reading new data or reading the same data as the previous calculation.

図３は、本実施形態の並列演算装置によってフィルタ演算を行う際の処理手順を示すフローチャートである。まず、ステップＳ３０１において、演算種別切り替え部１１０の演算種別の設定が行われる。ここでは、演算種別切り替え部１１０の演算種別としてフィルタ演算が設定される。続いてステップＳ３０２において、フィルタカーネルデータは第２データ格納部１０３に格納される。図４は、本実施形態におけるフィルタカーネルを説明する図である。同図には、フィルタカーネル１０および演算対象の画像データ１１を示している。フィルタカーネル１０の各画素のデータはｗ_ｉ，ｊで表わされ、画像データ１１の各画素のデータはｄ_ｉ，ｊで表わされる（ｉ，ｊは座標位置を示すインデックスである）。フィルタカーネル１０と画像データ１１は、演算開始前、例えば本実施形態の並列演算装置の外部（並列演算装置を内蔵する画像処理装置の格納手段など）に保持されている。
ここで、上述したステップＳ３０２において、フィルタカーネルデータが第２データ格納部１０３に格納される処理について説明する。図５は、フィルタ演算において、第１データ格納部１０２、第２データ格納部１０３にデータが格納される様子を示す図であり、図５（Ａ）が第２データ格納部１０３でのデータ格納の様子を示している。本実施形態の第２データ格納部１０３はＲＡＭで構成されており、ステップＳ３０２では、図５（Ａ）に示すように各アドレスの領域にフィルタカーネルデータが格納される。 FIG. 3 is a flowchart showing the processing procedure when performing the filter operation by the parallel operation device of the present embodiment. First, in step S301, the setting of the operation type of the operation type switching unit 110 is performed. Here, filter operation is set as the operation type of the operation type switching unit 110. Subsequently, in step S302, the filter kernel data is stored in the second data storage unit 103. FIG. 4 is a diagram for explaining a filter kernel in the present embodiment. The figure shows a filter kernel 10 and image data 11 to be calculated. The data of each pixel of the filter kernel 10 is represented by w _{i, j} and the data of each pixel of the image data 11 is represented by d _{i, j} (i, j is an index indicating a coordinate position). The filter kernel 10 and the image data 11 are held, for example, outside the parallel operation device of the present embodiment (such as storage means of an image processing device incorporating the parallel operation device) before the start of operation.
Here, processing in which the filter kernel data is stored in the second data storage unit 103 in step S302 described above will be described. FIG. 5 is a diagram showing how data is stored in the first data storage unit 102 and the second data storage unit 103 in the filter operation, and FIG. 5A shows data storage in the second data storage unit 103. It shows the situation of The second data storage unit 103 of this embodiment is configured by a RAM, and in step S302, filter kernel data is stored in the area of each address as shown in FIG. 5A.

続いてステップＳ３０３において、演算処理の対象画像データが第１データ格納部１０２に格納される。図５（Ｂ）は、第１データ格納部１０２でのデータ格納の様子を示している。本実施形態の第１データ格納部１０２はＲＡＭで構成され、ステップＳ３０３では、図５（Ｂ）に示すように画像データ１１がラスタ順にＲＡＭに格納される。つまり、アドレス１の領域にｄ_１，１〜ｄ_１，４のデータが格納され、アドレス２の領域にｄ_１，５〜ｄ_１，８のデータが格納される。また、アドレスｐの領域にｄ_２，１〜ｄ_２，４のデータが格納され、アドレスｑの領域にｄ_３，１〜ｄ_３，４のデータが格納され、アドレスｒの領域にｄ_４，１〜ｄ_４，４のデータが格納される。 Subsequently, in step S303, target image data of the arithmetic processing is stored in the first data storage unit 102. FIG. 5B shows how data is stored in the first data storage unit 102. The first data storage unit 102 of this embodiment is configured by a RAM, and in step S303, the image data 11 is stored in the RAM in raster order as shown in FIG. 5B. That is, the data of d _1,1 to d _1,4 are stored in the area of the address ₁ , and the data of d _1,5 to d _1,8 are stored in the area of the address 2. Also, data of d _2,1 to d _2,4 is stored in the area of address p, data of d _3,1 to d _3,4 is stored in the area of address q, d 4,1 in the area of address r _. Data of _{1 to} d ₄ , ₄ are stored.

ここで、第１データ格納部１０２を構成するＲＡＭの形状（幅）について説明する。一般的にＲＡＭの容量が同じであったとしても、幅が広く深さが狭い形状であれば、幅が狭く深さが深い形状であるより回路規模的に大きくなる。本実施形態の第１データ格納部には、演算処理部１０７の数（並列度）に等しいデータ数が一度に読み出せる幅があることが、処理速度の観点からは望ましい。例えば並列度が４であれば、図５（Ｂ）に示すように４個のデータを同一アドレスに格納できるだけの幅を持ったＲＡＭであることが好適である。 Here, the shape (width) of the RAM constituting the first data storage unit 102 will be described. Generally, even if the capacity of the RAM is the same, if the width is wide and the depth is narrow, the circuit scale becomes larger than the width is narrow and the depth is deep. It is desirable from the viewpoint of processing speed that the first data storage unit of the present embodiment has a width such that the number of data equal to the number of processing units 107 (degree of parallel processing) can be read at one time. For example, if the degree of parallelism is 4, it is preferable that the RAM has a width sufficient to store four pieces of data at the same address as shown in FIG. 5 (B).

次に、ステップＳ３０４において、第１データ格納部１０２から第１データ供給部１０５にデータを出力する。本実施形態の第１データ供給部１０５はシフトレジスタで構成されており、第１データ格納部１０２から出力されたデータを一時的に保持する。具体的には、まず第１データ格納部１０２からアドレス１、アドレス２にあるデータを出力し、第１データ供給部１０５で保持する。この第１データ格納部１０２からのデータの読み出しの制御は、読み出し制御部１１１により行われる。ここで、図５（Ｃ）に第１データ供給部１０５であるシフトレジスタの模式図を示す。同図において、シフトレジスタの段数（ここでは６段）は、第１データ格納部１０２の幅（本実施形態ではデータ４個分）より広いので、２つのアドレスから読み出したデータ（２ワード）でロードが行われる。 Next, in step S304, data is output from the first data storage unit 102 to the first data supply unit 105. The first data supply unit 105 according to the present embodiment is formed of a shift register, and temporarily holds data output from the first data storage unit 102. Specifically, first, data at address 1 and address 2 are output from the first data storage unit 102 and held by the first data supply unit 105. The read control of data from the first data storage unit 102 is performed by the read control unit 111. Here, FIG. 5C shows a schematic view of a shift register which is the first data supply unit 105. As shown in FIG. In the same figure, since the number of stages of the shift register (here, six) is wider than the width of the first data storage unit 102 (for four data in this embodiment), the data (two words) read from two addresses is used. Loading is done.

続いてステップＳ３０５において、第２データ格納部１０３から第２データ供給部１０６にデータを出力する。本実施形態の第２データ供給部１０６はレジスタで構成されており、第２データ格納部１０３から出力されたデータを一時的に保持する。図５（Ｄ）に第２データ供給部１０５であるシフトレジスタの模式図を示す。同図に示すように、第２データ格納部１０３からアドレス１にあるデータを出力し、第２データ供給部１０６で保持する。続くステップＳ３０６において、各演算処理部１０７にて並列に積和演算が行われる。 Subsequently, in step S305, data is output from the second data storage unit 103 to the second data supply unit 106. The second data supply unit 106 of the present embodiment is configured by a register, and temporarily holds the data output from the second data storage unit 103. FIG. 5D is a schematic view of a shift register which is the second data supply unit 105. As shown in the figure, the data at address 1 is output from the second data storage unit 103 and held by the second data supply unit 106. In the subsequent step S306, product-sum operations are performed in parallel in each operation processing unit 107.

そして、ステップＳ３０７、Ｓ３０８、Ｓ３０９において繰り返し演算を行うことで積和演算が継続される。つまり、第１データ供給部１０５のシフトレジスタをシフトしつつ、同時に、第２データ格納部１０３からの次のカーネルデータ出力することで、各演算処理部１０７にて並列に積和演算が行われる（フィルタカーネルの水平方向の積和演算）。 Then, the product-sum operation is continued by repeatedly performing operations in steps S307, S308, and S309. That is, by simultaneously outputting the next kernel data from the second data storage unit 103 while shifting the shift register of the first data supply unit 105, product-sum operations are performed in parallel in each operation processing unit 107. (Horizontal multiply-accumulate filter kernel).

同様にステップＳ３１０、Ｓ３１１において繰り返し演算を行うことで積和演算が継続され、フィルタカーネルの水平方向の積和演算を垂直方向のサイズに相当する回数を繰り返して、最終結果が得られる。 Similarly, the product-sum operation is continued by repeatedly performing operations in steps S310 and S311, and the product-sum operation in the horizontal direction of the filter kernel is repeated a number of times corresponding to the size in the vertical direction to obtain the final result.

以上の処理フローにより、本実施形態のフィルタ演算は実行される。なお、次の水平行のフィルタ演算を行う場合は、ステップＳ３０４から実行すればよい。その他の位置のフィルタ演算を行う場合も同様である。特に、次の水平行のフィルタ演算を行う場合、ステップＳ３０４において、読み出し制御部１１１は、第１データ格納部１０２に対して、フィルタ演算の過去の積和演算のために読み出したデータの一部分が重複するようデータを再度読み出すように制御する。例えば、前回の積和演算では、アドレス１、アドレス２、アドレスｐ、アドレスｐ＋１、アドレスｑ、アドレスｑ＋１に格納されているデータを読み出したとする。これに対して、今回の積和演算では、アドレスｐ、アドレスｐ＋１、アドレスｑ、アドレスｑ＋１、アドレスｒ、アドレスｒ＋１に格納されているデータを読み出すようにする。この場合、前回読み出したデータ（６ワード）のうちの一部（４ワード）が重複して含まれている。このような一部データの重複は、さらに次の水平行のフィルタ演算を行う場合にも発生する。このように、演算種別としてフィルタ演算が指定された場合、読み出し制御部１１１は、以下のように動作する。すなわち、一連のフィルタ演算の過程で過去に実行されたある積和演算のために、第１データ格納部１０２から読み出されたデータの一部分を重複して含むデータを、別の積和演算のために読み出す、というような読み出しモードで動作する。 The filter operation of this embodiment is executed by the above processing flow. In addition, what is necessary is just to carry out from step S304, when performing the next horizontal filter calculation. The same applies to the case of performing the filter operation at other positions. In particular, when performing the next horizontal filter operation, in step S304, the read control unit 111 causes the first data storage unit 102 to use part of the data read out for the product-sum operation in the past of the filter operation. Control to read the data again so as to overlap. For example, in the previous product-sum operation, it is assumed that the data stored in address 1, address 2, address p, address p + 1, address q, and address q + 1 is read. On the other hand, in the present product-sum operation, the data stored in the address p, the address p + 1, the address q, the address q + 1, the address r, and the address r + 1 is read out. In this case, part (4 words) of the previously read data (6 words) is redundantly included. Such duplication of partial data also occurs when the next horizontal filter operation is performed. Thus, when the filter operation is designated as the operation type, the read control unit 111 operates as follows. That is, for a certain product-sum operation performed in the process of a series of filter operations, another product-sum operation can be performed on data that includes a portion of the data read from the first data storage unit 102 in an overlapping manner. To operate in a read mode such as

また、第１データ供給部１０５では、保持しているデータをシフトしながら演算処理部１０７に供給する。つまり、演算種別としてフィルタ演算が指示されている場合、第１データ供給部１０５は、同一のデータを異なるタイミングで、異なる演算処理部１０７に対して供給するよう動作する。 In addition, the first data supply unit 105 supplies the held data to the arithmetic processing unit 107 while shifting. That is, when the filter operation is instructed as the operation type, the first data supply unit 105 operates to supply the same data to different operation processing units 107 at different timings.

続いて、本実施形態の並列演算装置１０１により行列積演算を行う際の処理フローについて説明を行う。ここでは、上述した数式２の積和演算を行う場合について説明する。図６は、本実施形態の並列演算装置によって行列積演算を行う際の処理手順を示すフローチャートである。 Subsequently, a process flow when performing matrix multiplication operation by the parallel operation device 101 according to the present embodiment will be described. Here, the case where the product-sum operation of Equation 2 described above is performed will be described. FIG. 6 is a flowchart showing a processing procedure when performing a matrix multiplication operation by the parallel operation device of this embodiment.

まず、ステップＳ６０１において、演算種別切り替え部１１０に対して演算種別の設定が行われる。ここでは、行列積演算が演算種別として設定される。次に、ステップＳ６０２において、行列Ｂが第２データ格納部１０３に格納される。図７は、行列積演算において、第１データ格納部１０２、第２データ格納部１０３にデータが格納される様子を示す図であり、図７（Ａ）は第２データ格納部１０３でのデータ格納の様子を示している。本実施形態の第２データ格納部１０３はＲＡＭで構成されており、図７（Ａ）に示すように、各アドレスの領域に行列Ｂの各要素データが格納されている。 First, in step S601, the setting of the operation type is performed on the operation type switching unit 110. Here, matrix product operation is set as the operation type. Next, in step S602, the matrix B is stored in the second data storage unit 103. FIG. 7 is a diagram showing how data is stored in the first data storage unit 102 and the second data storage unit 103 in matrix multiplication operation, and FIG. 7A shows data in the second data storage unit 103. It shows the state of storage. The second data storage unit 103 of the present embodiment is configured by a RAM, and as shown in FIG. 7A, each element data of the matrix B is stored in the area of each address.

続いてステップＳ６０３において、行列Ａが第１データ格納部１０２に格納される。図７（Ｂ）は第１データ格納部１０２でのデータ格納の様子を示している。本実施形態の第１データ格納部１０２はＲＡＭで構成されており、図７（Ｂ）に示すように、行列ＡのデータがＲＡＭに格納されている。 Subsequently, in step S603, the matrix A is stored in the first data storage unit 102. FIG. 7B shows how data is stored in the first data storage unit 102. The first data storage unit 102 of the present embodiment is configured by a RAM, and as shown in FIG. 7B, data of the matrix A is stored in the RAM.

続いて、ステップＳ６０４において、第１データ格納部１０２から第１データ供給部１０５にデータを出力する。まず、第１データ格納部１０２からアドレス１にあるデータを出力し、第１データ供給部１０５で保持する。この第１データ格納部１０２からのデータの読み出しの制御は、読み出し制御部１１１により行われる。図７（Ｃ）は、第１データ供給部１０５に対応するシフトレジスタの模式図である。シフトレジスタの段数（ここでは６段）は、第１データ格納部１０２の幅（データ４個分）より広いが、行列積演算では異なる演算処理部１０７でデータの共有は行われない。つまり、シフトレジスタでのシフト動作は行われないので、演算処理部１０７の数分のデータ（本実施形態ではデータ４個分）がシフトレジスタに格納されていればよい。 Subsequently, in step S604, data is output from the first data storage unit 102 to the first data supply unit 105. First, data at address 1 is output from the first data storage unit 102 and held by the first data supply unit 105. The read control of data from the first data storage unit 102 is performed by the read control unit 111. FIG. 7C is a schematic view of a shift register corresponding to the first data supply unit 105. As shown in FIG. Although the number of stages of the shift register (here, six stages) is wider than the width (for four data) of the first data storage unit 102, sharing of data is not performed by different operation processing units 107 in matrix product operation. That is, since the shift operation is not performed in the shift register, it is only necessary to store data for the number of the arithmetic processing units 107 (four data in this embodiment) in the shift register.

続いて、ステップＳ６０５において、第２データ格納部１０３から第２データ供給部１０６にデータを出力する。本実施形態の第２データ供給部１０６はレジスタで構成されており、第２データ格納部１０３から出力されたデータを一時的に保持する。図７（Ｄ）は、第２データ供給部１０６に相当するレジスタの模式図である。まず、第１データ格納部１０３からアドレス１にあるデータを出力し、出力されたデータを第２データ供給部１０６で保持する。 Subsequently, in step S605, data is output from the second data storage unit 103 to the second data supply unit 106. The second data supply unit 106 of the present embodiment is configured by a register, and temporarily holds the data output from the second data storage unit 103. FIG. 7D is a schematic view of a register corresponding to the second data supply unit 106. First, the data at address 1 is output from the first data storage unit 103, and the output data is held by the second data supply unit 106.

続いて、ステップＳ６０６にて、各演算処理部１０７にて並列に積和演算が行われる。そして、ステップＳ６０７、Ｓ６０８、Ｓ６０９において繰り返し演算を行うことで、積和演算が継続される。つまり、第１データ格納部１０２からの次のデータが出力されると、第２データ格納部１０３から次のカーネルデータ出力することで、各演算処理部１０７にて並列に積和演算が行われる。この処理が、所定回数（本実施形態では行列Ａの列次元数であるｎ回）だけ積和演算が行われることにより、最終結果が得られる。 Subsequently, in step S606, each arithmetic processing unit 107 performs product-sum operation in parallel. Then, the product-sum operation is continued by repeatedly performing operations in steps S 607, S 608, and S 609. That is, when the next data is output from the first data storage unit 102, the next kernel data is output from the second data storage unit 103, whereby the product-sum operations are performed in parallel in each operation processing unit 107. . A final result is obtained by performing a product-sum operation for a predetermined number of times (n times, which is the number of column dimensions of the matrix A in this embodiment).

以上の処理フローにより、本実施形態に係る行列積演算が実行される。なお、行列Ａの別の水平行の行列積演算を行う場合は、ステップＳ６０４から再度実行すればよい。その他の位置の行列積演算を行う場合も同様である。行列Ａの別の水平行の行列積演算を行う場合、ステップＳ６０４において、読み出し制御部１１１は、第１データ格納部１０２に対して、行列積演算の過去の積和演算のために読み出したデータに続くデータを読み出すように動作する。つまり、前回の積和演算では、アドレス１、アドレス２、…、アドレスｎに格納されているデータを読み出したのに対して、今回の積和演算では、アドレスｎ＋１、アドレスｎ＋２、…、アドレス２ｎに格納されているデータを読み出す。このように、行列Ａのある水平行のデータを読み出すにあたり、前回読み出したデータと今回読み出したデータに重複はない。 The matrix product operation according to the present embodiment is executed by the above process flow. In addition, what is necessary is just to re-execute from step S604, when performing another matrix product calculation of water parallel of the matrix A. The same applies to the case of performing matrix multiplication at other positions. When performing another horizontal parallel matrix product operation of the matrix A, in step S604, the read control unit 111 causes the first data storage unit 102 to read the data read for the product-sum operation in the past of the matrix product operation. It operates to read out the data following. That is, while the data stored in address 1, address 2, ..., address n was read in the previous product-sum operation, in the current product-sum operation, address n + 1, address n + 2, ..., address 2n Read out the data stored in. As described above, when data in the horizontal row having the matrix A is read, there is no overlap between the previously read data and the currently read data.

また、行列Ｂの別の垂直列の行列積演算を行う場合も、ステップＳ６０４から再度実行すればよい。この場合、ステップＳ６０４において、読み出し制御部１１１は、第１データ格納部１０２に対して、行列積演算の過去の積和演算のために読み出したデータと同じデータを読み出すように動作する。ただし、ステップＳ６０５における第２データ格納部から読み出されるデータは過去の積和演算に用いたデータと異なる。例えば、過去の積和演算で、第１データ格納部１０２からも第２データ格納部１０３からも、アドレス１、アドレス２、…、アドレスｎに格納されているデータを読み出したとする。しかし、今回は、第１データ格納部１０２は同じデータを読み出すことを行うが、第２データ格納部１０３はアドレスｎ＋１、アドレスｎ＋２、…、アドレス２ｎに格納されているデータを読み出す。このように、演算種別として行列積演算が指定された場合、読み出し制御部１１１は、以下のように動作する。すなわち、一連の行列積演算の過程で過去に実行されたある積和演算のために、第１データ格納部１０２から読み出されたデータと異なるデータ、あるいは同じデータを、別の積和演算のために読み出す、というような読み出しモードで動作する。 Also, when performing matrix multiplication operation of another vertical column of the matrix B, it may be executed again from step S604. In this case, in step S604, the read control unit 111 operates to read the same data as the read data for the product-sum operation in the past of the matrix product operation to the first data storage unit 102. However, the data read from the second data storage unit in step S605 is different from the data used for the product-sum operation in the past. For example, it is assumed that the data stored in address 1, address 2, ..., address n is read from the first data storage unit 102 and the second data storage unit 103 in the past product-sum operation. However, although the first data storage unit 102 reads the same data this time, the second data storage unit 103 reads the data stored in the address n + 1, the address n + 2,..., The address 2n. As described above, when the matrix multiplication operation is designated as the operation type, the read control unit 111 operates as follows. That is, for a certain product-sum operation performed in the past in the process of a series of matrix product operations, data different from the data read from the first data storage unit 102 or the same data is subjected to another product-sum operation. To operate in a read mode such as

上述の行列積演算の処理フローにおいて説明したように、第１データ供給部１０５では、保持しているデータに対するシフト処理は行わずに、演算処理部１０７にデータを供給する。そのため、演算種別として行列積演算が指示されている場合、複数の演算処理部１０７間でデータの共有が行われることはない。また、第２データ格納部１０３から読み出されるデータが過去の積和演算の際のデータと同じであれば、読み出し制御部１１１は、以下のように動作する。すなわち、一連の行列積演算の過程で過去に実行されたある積和演算と違うデータを別の積和演算のために読み出す、という読み出しモードで動作する。 As described in the process flow of matrix product operation described above, the first data supply unit 105 supplies data to the operation processing unit 107 without performing shift processing on the held data. Therefore, when matrix product operation is instructed as the operation type, data sharing is not performed between the plurality of operation processing units 107. Further, when the data read from the second data storage unit 103 is the same as the data in the past product-sum operation, the read control unit 111 operates as follows. That is, it operates in a read mode in which data different from a certain product-sum operation executed in the past in the process of a series of matrix product operations is read out for another product-sum operation.

以上のように、本実施形態の並列演算装置は、フィルタ演算では、同一のフィルタカーネルに対して、演算対象の画像データを複数の演算処理部１０７間で部分的に共有する。一方、行列積演算では、行列Ｂの同一の列ベクトルに対して、行列Ａの行ベクトルのデータを共有することはない。このような演算処理の相違点に応じて、第１データ供給部１０５から演算処理部１０７へのデータ供給のモードを複数用意するとともに、第１データ格納部１０２から第１データ供給部１０５へのデータ読み出しモードも複数用意する。そして、これらのモードを切り替えることにより、異なる種類の積和演算を実行することが可能である。これにより、本実施形態によれば、最終的な出力を得るまでに複数の演算器を経由する必要がなく、複数の積和演算を高速に処理できるようになる。 As described above, in the filter operation, the parallel operation device according to the present embodiment partially shares the image data of the operation target among the plurality of operation processing units 107 with respect to the same filter kernel. On the other hand, in matrix product operation, the same column vector of matrix B does not share the data of the row vector of matrix A. According to the difference of such arithmetic processing, a plurality of modes of data supply from the first data supply unit 105 to the arithmetic processing unit 107 are prepared, and the first data storage unit 102 to the first data supply unit 105. A plurality of data read modes are also prepared. And, by switching these modes, it is possible to execute different types of product-sum operations. As a result, according to the present embodiment, it is possible to process a plurality of product-sum operations at high speed without having to go through a plurality of arithmetic units to obtain a final output.

［第２の実施形態］
次に、本発明の第２の実施形態について説明する。本実施形態は、上述の第１の実施形態の並列演算装置の機能に加えて、非線形変換処理部を追加したものである。なお、第１の実施形態において既に説明をした構成については同一の符号を付し、その説明は省略する。 Second Embodiment
Next, a second embodiment of the present invention will be described. In the present embodiment, in addition to the functions of the parallel processing device of the first embodiment described above, a non-linear conversion processing unit is added. The components already described in the first embodiment are denoted by the same reference numerals, and the description thereof is omitted.

画像認識処理では、フィルタ演算結果や積和演算結果に対して非線形変換処理を行うことがよく行われている。例えばＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋｓ）では、フィルタ演算（コンボリューション演算）に対してシグモイド変換（あるいはハイパボリックタンジェント変換）を行うのが一般的である。また行列積で表現できるパーセプトロンの結果に対してソフトマックス関数を施して、入力データの多クラス分類を行うこともよく行われている。そのため、並列演算装置が非線形変換処理部を備えることで、より柔軟な処理に対応することが可能になり、並列演算装置としての有用性が増す。 In image recognition processing, non-linear transformation processing is often performed on filter operation results and product-sum operation results. For example, in CNN (Convolutional Neural Networks), it is common to perform sigmoid conversion (or hyperbolic tangent conversion) on filter operation (convolution operation). It is also common to perform multiclass classification of input data by applying a softmax function to the result of perceptron which can be expressed by matrix product. Therefore, by providing the non-linear conversion processing unit in the parallel computing device, it becomes possible to cope with more flexible processing, and the usefulness as a parallel computing device is increased.

本実施形態では、並列演算装置によりＤｅｅｐＬｅａｒｎｉｎｇの処理を行う場合を例に説明を行う。ＤｅｅｐＬｅａｒｎｉｎｇは、現在研究開発が盛んに行われている技術分野で、一般的には、入力データ（例えば画像データ）に対して、階層的な処理（ある階層の処理結果をその上位の階層の処理の入力とする処理）を行うものを指す。ここでは、典型的なＤｅｅｐＬｅａｒｎｉｎｇとして、画像からの特徴量抽出処理にＣＮＮを用い、抽出した特徴量を用いた識別処理にパーセプトロンに代表されるような行列積を用いるような構成を取り上げ、その演算を並列演算装置で行う例について説明する。この特徴量抽出処理は、ＣＮＮを何度も繰り返す多階層処理であることが多く、また識別処理も全結合の多階層のパーセプトロンが用いられることがある。 The present embodiment will be described by way of an example in which a parallel computing device performs Deep Learning processing. Deep Learning is a technical field in which research and development are actively conducted. Generally, hierarchical processing (the processing result of a certain hierarchy is given to the upper hierarchy of input data (for example, image data)). It refers to the one that performs the process). Here, as a typical Deep Learning, a configuration in which CNN is used for feature quantity extraction processing from an image, and matrix multiplication such as that represented by a perceptron is used for identification processing using the extracted feature quantity An example in which the operation is performed by the parallel operation device will be described. This feature value extraction process is often a multi-layer process in which CNN is repeated many times, and a multi-layer perceptron of all combinations may be used as the identification process.

ここで、図８を用いてＤｅｅｐＬｅａｒｎｉｎｇの演算例について説明する。図８は、入力層（入力画像）８０１に対してＣＮＮにより特徴抽出を行い、特徴量８０７を取得し、その特徴量に基づき識別処理を行い、識別結果８１４を得るような処理を示している。入力画像８０１から特徴量８０７を取得するまでにＣＮＮを何度も（ここでは３度）繰り返している。また、特徴量８０７に対して全結合のパーセプトロン処理を行い、最終的な識別結果８１４を得ている。 Here, an operation example of Deep Learning will be described using FIG. FIG. 8 shows processing for performing feature extraction on the input layer (input image) 801 by CNN, acquiring the feature amount 807, performing identification processing based on the feature amount, and obtaining an identification result 814. . CNN is repeated many times (here, three times) until the feature amount 807 is acquired from the input image 801. In addition, perceptron processing of all bonds is performed on the feature amount 807, and a final identification result 814 is obtained.

まず、前半のＣＮＮ処理について説明する。図８において、入力層８０１は、画像データに対してＣＮＮ演算を行う際の、ラスタスキャンされた所定サイズの画像データを示す。特徴面８０３ａ〜８０３ｃは、第１段目の階層８０８の特徴面を示す。特徴面とは、所定の特徴抽出フィルタ（コンボリューションフィルタ演算および非線形処理）の検出結果を示すデータ面である。ラスタスキャンされた画像データに対する検出結果であるため、検出結果も面で表される。特徴面８０３ａ〜８０３ｃは、入力層８０１に対するコンボリューションフィルタ演算および非線形処理により生成される。例えば、特徴面８０３ａは、フィルタカーネル８０２１ａを用いたコンボリューションフィルタ演算および演算結果の非線形変換により得られる。なお、図８中のフィルタカーネル８０２１ｂおよび８０２１ｃは、各々特徴面８０３ｂおよび８０３ｃを生成する際に使用されるフィルタカーネルである。 First, the CNN process in the first half will be described. In FIG. 8, an input layer 801 indicates raster-scanned image data of a predetermined size when performing CNN operation on image data. The feature planes 803a to 803c show the feature planes of the first tier 808. The feature surface is a data surface indicating the detection result of a predetermined feature extraction filter (convolution filter operation and non-linear processing). Since the detection result is for the raster scanned image data, the detection result is also represented by a plane. The feature planes 803 a to 803 c are generated by convolution filter operation and non-linear processing on the input layer 801. For example, the feature plane 803a is obtained by convolution filter operation using the filter kernel 8021a and nonlinear conversion of the operation result. The filter kernels 8021 b and 8021 c in FIG. 8 are filter kernels used when generating the feature planes 803 b and 803 c, respectively.

次に、第２段目の階層８０９の特徴面８０５ａを生成する演算について説明する。特徴面８０５ａは、前段の階層８０８の３つの特徴面８０３ａ〜８０３ｃと結合している。そのため、特徴面８０５ａのデータを算出する場合、特徴面８０３ａに対してはフィルタカーネル８０４１ａで示すカーネルを用いたコンボリューションフィルタ演算を行い、この結果を保持する。同様に、特徴面８０３ｂおよび８０３ｃに対しては、各々フィルタカーネル８０４２ａおよび８０４３ａのコンボリューションフィルタ演算を行い、これらの結果を保持する。これらの３種類のフィルタ演算の終了後、それぞれの結果を加算し、非線形変換処理を行う。以上の処理を画像全体に対して処理することにより、特徴面８０５ａを生成する。 Next, an operation for generating the feature plane 805a of the second tier 809 will be described. The feature surface 805a is coupled to the three feature surfaces 803a to 803c of the previous layer 808. Therefore, when calculating data of the feature plane 805a, a convolution filter operation using a kernel indicated by a filter kernel 8041a is performed on the feature plane 803a, and the result is held. Similarly, the convolution filter operation of the filter kernels 8042a and 8043a is performed on the feature planes 803b and 803c, respectively, and these results are held. After completion of these three types of filter operations, the respective results are added to perform non-linear conversion processing. The feature plane 805a is generated by processing the above processing for the entire image.

同様に、特徴面８０５ｂの生成の際には、前段の階層８０８の特徴面８０３ａ〜８０３ｃに対するフィルタカーネル８０４１ｂ、８０４２ｂおよび８０４３ｂによる３つのコンボリューションフィルタフィルタ演算を行う。また、第３段目の階層８１０の特徴面８０７の生成の際には、前段の階層８０９の特徴面８０５ａ〜８０５ｂに対するフィルタカーネル８０６１および８０６２による２つのコンボリューションフィルタフィルタ演算を行う。 Similarly, when generating the feature plane 805b, three convolution filter operations are performed by the filter kernels 8041b, 8042b, and 8043b with respect to the feature planes 803a to 803c of the previous stage layer 808. When generating the feature plane 807 of the third tier 810, two convolution filter operations are performed on the feature planes 805a to 805b of the previous tier 809 by the filter kernels 8061 and 8062.

次に、後半のパーセプトロン処理について説明する。図８には、２階層のパーセプトロンを示す。パーセプトロンは、入力特徴量のそれぞれ要素に対する重み付き和を非線形変換したものである。したがって、特徴量８０７に対して行列積を行い、その結果に非線形変換を行った結果８１３を得ることができる。さらに同様の処理を繰り返せば、最終的な識別結果８１４を得ることができる。 Next, the second half perceptron processing will be described. FIG. 8 shows a two-level perceptron. The perceptron is a non-linear transformation of the weighted sum for each element of the input feature quantity. Therefore, it is possible to obtain a result 813 by performing matrix multiplication on the feature amount 807 and performing non-linear transformation on the result. Furthermore, the final identification result 814 can be obtained by repeating the same process.

図１０は、本実施形態の並列演算装置のブロック図である。並列演算装置１０１は、第１の実施形態で説明した構成に加えて、演算結果格納部１００１および非線形変換処理部１００２を備える。なお、本実施形態の演算種別切り替え部１１０は非線形変換切り替え信号を出力し、非線形変換切り替え信号により、非線形変換処理を実行するか否か、実行する場合にはどのような非線形変換を行うかを、非線形変換処理部１００２に対して指示する。非線形変換切り替え信号により非線形変換を行うことを指示した場合、並列演算装置１０１は、最終的な演算結果として非線形変換された積和演算結果を出力する。なお、演算種別切り替え部１１０に非線形変換の処理が設定されれば、並列演算装置１０１は、非線形変換以外の結果が出力でき、例えば、第１の実施形態のように積和演算結果を出力することができる。 FIG. 10 is a block diagram of the parallel processing device of this embodiment. The parallel processing device 101 includes an operation result storage unit 1001 and a non-linear conversion processing unit 1002 in addition to the configuration described in the first embodiment. Note that the operation type switching unit 110 according to the present embodiment outputs a non-linear conversion switching signal, and whether or not non-linear conversion processing is to be executed by the non-linear conversion switching signal. , And instructs the non-linear conversion processing unit 1002. When instructing to perform nonlinear conversion by the nonlinear conversion switching signal, the parallel processing device 101 outputs the product-sum operation result subjected to nonlinear conversion as a final operation result. If the processing of non-linear transformation is set in the operation type switching unit 110, the parallel computing device 101 can output a result other than the non-linear transformation, and outputs the product-sum operation result as in the first embodiment, for example. be able to.

演算結果格納部１００１は、複数の演算処理部１０７から出力される複数の積和演算結果を一時格納し、順次、非線形変換処理部１００２に出力する機能を有する。演算結果格納部１００１は、ロード可能なシフトレジスタ等で構成される。 The operation result storage unit 1001 has a function of temporarily storing a plurality of product-sum operation results output from the plurality of operation processing units 107 and sequentially outputting the result to the non-linear conversion processing unit 1002. The calculation result storage unit 1001 is configured of a loadable shift register or the like.

非線形変換処理部１００２は、入力される積和演算結果に対して、例えばシグモイド変換等の非線形変換処理を行う。非線形変換は、並列演算装置の格納手段に格納されたルックアップテーブルを用いること等により実現される。非線形変換処理部１００２には演算種別切り替え部１１０から非線形変換切り替え信号が入力され、入力される非線形変換切り替え信号に基づいてルックアップテーブルを書き換えることで、種々の非線形変換処理が実行される。また、非線形変換切り替え信号により非線形変換を行わないという指示が入力された場合には、非線形変換処理部１００２による非線形変換処理はスキップされる。 The non-linear conversion processing unit 1002 performs non-linear conversion processing such as sigmoid conversion on the input product-sum operation result. The non-linear transformation is realized by using a look-up table stored in the storage means of the parallel computing device. A nonlinear conversion switching signal is input to the nonlinear conversion processing unit 1002 from the operation type switching unit 110, and various nonlinear conversion processes are executed by rewriting the look-up table based on the input nonlinear conversion switching signal. When an instruction not to perform non-linear conversion is input by the non-linear conversion switching signal, the non-linear conversion processing by the non-linear conversion processing unit 1002 is skipped.

本実施形態の並列演算装置１０１では、演算処理部１０７により算出された積和演算結果を演算結果格納部１００１で一度保持してから、非線形変換処理部１００２に入力している。これは、一般に非線形変換処理部１００２は回路規模が大きく、複数用意することが困難な場合が多いためである。複数の積和演算結果を一つの非線形変換処理部１００２で逐次処理するために、並列演算装置１０１に演算結果格納部１００１を備えるような構成を例示している。しかしながら、非線形変換処理部１００２を演算処理部１０７と同数用意できるのであれば、演算結果格納部１００１は必ずしも必要としない。 In the parallel computing device 101 according to the present embodiment, the product-sum operation result calculated by the operation processing unit 107 is once stored in the operation result storage unit 1001, and then input to the non-linear conversion processing unit 1002. This is because, in general, the non-linear conversion processing unit 1002 has a large circuit size, and it is often difficult to prepare a plurality. In order to sequentially process a plurality of product-sum operation results in one non-linear conversion processing unit 1002, a configuration is provided in which the parallel operation device 101 is provided with the operation result storage unit 1001. However, as long as the non-linear transformation processing unit 1002 can be prepared in the same number as the operation processing unit 107, the operation result storage unit 1001 is not necessarily required.

続いて、上述の並列演算装置１０１を備える本実施形態の画像処理装置について説明する。図１１は、本実施形態の画像処理装置２００のブロック図である。画像処理装置２００は、画像入力部２０、並列演算装置１０１、ブリッジ２４、前処理部２５、ＤＭＡＣ（ＤｉｒｅｃｔＭｅｍｏｒｙＡｃｃｅｓｓＣｏｎｔｒｏｌｌｅｒ）２６およびＲＡＭ２１を備える。さらに、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２７、ＲＯＭ２８およびＲＡＭ２９を備えている。そして、画像入力部２０、並列演算装置１０１、前処理部２５およびＤＭＡＣ２６は画像バス２３を介して互いに接続され、ＣＰＵ２７、ＲＯＭ２８およびＲＡＭ２９はＣＰＵバス３０を介して互いに接続されている。また、ブリッジ２４により画像バス２３とＣＰＵバス３０との間のデータ転送が可能となっている。 Subsequently, an image processing apparatus according to the present embodiment provided with the above-described parallel processing apparatus 101 will be described. FIG. 11 is a block diagram of the image processing apparatus 200 of the present embodiment. The image processing apparatus 200 includes an image input unit 20, a parallel computing device 101, a bridge 24, a preprocessing unit 25, a direct memory access controller (DMAC) 26, and a RAM 21. Furthermore, a CPU (Central Processing Unit) 27, a ROM 28, and a RAM 29 are provided. The image input unit 20, the parallel arithmetic device 101, the preprocessing unit 25 and the DMAC 26 are connected to one another via the image bus 23, and the CPU 27, the ROM 28 and the RAM 29 are connected to one another via the CPU bus 30. The bridge 24 enables data transfer between the image bus 23 and the CPU bus 30.

画像入力部２０は、光学系、ＣＣＤ（Ｃｈａｒｇｅ−ＣｏｕｐｌｅｄＤｅｖｉｃｅｓ）またはＣＭＯＳ（ＣｏｍｐｌｅｍｅｎｔａｒｙＭｅｔａｌＯｘｉｄｅＳｅｍｉｃｏｎｄｕｃｔｏｒ）センサ等の光電変換デバイス等により構成されている。さらに、センサを制御するドライバー回路、ＡＤコンバータ、各種画像補正を司る信号処理回路およびフレームバッファ等も設けられている。画像入力部２０は、カメラ以外の装置、媒体から入力される画像データや、画像処理装置２００に予め保存された画像データを対象画像として処理するものであってもよい。並列演算装置１０１は、階層的ＣＮＮ演算と行列積演算とを実行する。ＲＡＭ２１は、並列演算装置１０１の演算作業バッファとして使用される。 The image input unit 20 is configured of an optical system, a photoelectric conversion device such as a CCD (Charge-Coupled Devices) or a Complementary Metal Oxide Semiconductor (CMOS) sensor, or the like. Furthermore, a driver circuit that controls the sensor, an AD converter, a signal processing circuit that manages various image corrections, a frame buffer, and the like are also provided. The image input unit 20 may process image data input from an apparatus other than a camera or a medium, or image data stored in advance in the image processing apparatus 200 as a target image. The parallel computing device 101 executes hierarchical CNN operations and matrix product operations. The RAM 21 is used as an operation work buffer of the parallel operation device 101.

前処理部２５は、画像認識処理を効果的に行うための種々の前処理を行う。例えば、色変換処理およびコントラスト補正処理等の画像データ変換処理をハードウェアで処理する。ＤＭＡＣ２６は、画像バス２３上の画像入力部２０、並列演算装置１０１および前処理部２５とＣＰＵバス３０との間のデータ転送を司る。ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２８は、ＣＰＵ２７の動作を規定する命令およびパラメータデータを格納しており、ＣＰＵ２７は、これらを読み出しつつ画像処理装置２００の全体の動作を制御する。その際、ＲＡＭ２９がＣＰＵ２７の作業領域として使用される。なお、ＣＰＵ２７はブリッジ２４を介して画像バス２３上のＲＡＭ２１にアクセスすることも可能である。 The pre-processing unit 25 performs various pre-processing for effectively performing the image recognition processing. For example, image data conversion processing such as color conversion processing and contrast correction processing is processed by hardware. The DMAC 26 manages data transfer between the image input unit 20 on the image bus 23, the parallel arithmetic device 101 and the preprocessing unit 25, and the CPU bus 30. A ROM (Read Only Memory) 28 stores instructions and parameter data that define the operation of the CPU 27. The CPU 27 controls the overall operation of the image processing apparatus 200 while reading these. At this time, the RAM 29 is used as a work area of the CPU 27. The CPU 27 can also access the RAM 21 on the image bus 23 through the bridge 24.

前述したように、ＤｅｅｐＬｅａｒｎｉｎｇはＣＮＮを何層か繰り返すことで特徴量を抽出し、抽出した特徴量に基づいて識別処理を行い、最終結果を得る処理である。本実施形態の画像処理装置２００において、ＤｅｅｐＬｅａｒｎｉｎｇを実行する場合には、まず並列演算装置１０１に対して演算種別設定としてＣＮＮを設定する。ＣＮＮは、コンボリューションフィルタ演算の結果に対して非線形処理（ここではシグモイド変換とする）を施す演算である。そのため、演算種別切り替え部１１０は、演算種別信号としてフィルタ演算という情報を出力するとともに、非線形切り替え信号としてシグモイド変換という情報を出力する。 As described above, Deep Learning is a process of extracting feature quantities by repeating CNN several layers, performing identification processing based on the extracted feature quantities, and obtaining a final result. When executing Deep Learning in the image processing apparatus 200 of the present embodiment, first, CNN is set as the calculation type setting for the parallel calculation apparatus 101. CNN is an operation that applies nonlinear processing (here, sigmoid conversion) to the result of the convolution filter operation. Therefore, the operation type switching unit 110 outputs information called filter operation as an operation type signal and outputs information called sigmoid conversion as a non-linear switching signal.

演算種別信号としてフィルタ演算という情報が出力された場合の、並列演算装置１０１の処理の詳細は第１の実施形態で既に説明しているので、ここでは割愛する。第１の実施形態のようにして算出されたフィルタ演算結果は演算結果格納部１００１に出力され、それに対して非線形変換が施されて、その結果が出力される。このような非線形処理付きのフィルタ演算をフィルタカーネルを変更しながら繰り返すことでＤｅｅｐＬｅａｒｎｉｎｇによる特徴量（ベクトルデータ）の抽出が行われる。 The details of the process of the parallel computing device 101 when the information of the filter operation is output as the operation type signal have already been described in the first embodiment, and thus are omitted here. The filter operation result calculated as in the first embodiment is output to the operation result storage unit 1001, subjected to non-linear transformation, and the result is output. By repeating such filter operation with non-linear processing while changing the filter kernel, extraction of feature quantities (vector data) by Deep Learning is performed.

続いて、出力された特徴量（ベクトルデータ）に対して識別処理を行う。この場合、並列演算装置１０１に対して演算種別設定として識別処理を設定する。本実施形態で行われる識別処理は、特徴量を用いた行列積演算の結果に対して非線形処理（ここではソフトマックス変換とする）を施す演算であるので、演算種別切り替え部１１０から演算種別信号として行列積演算という情報が出力される。また、非線形切り替え信号としてソフトマックス変換という情報が出力される。 Subsequently, identification processing is performed on the output feature amount (vector data). In this case, identification processing is set to the parallel computing device 101 as computation type setting. Since the identification process performed in the present embodiment is an operation that performs non-linear processing (here, soft max conversion) on the result of matrix product operation using feature amounts, the operation type signal from operation type switching unit 110 Information called matrix multiplication operation is output as Further, information called soft max conversion is output as the non-linear switching signal.

演算種別信号として行列積演算という情報が出力された場合の、並列演算装置１０１の処理の詳細は第１の実施形態で既に説明しているので、ここでは割愛する。第１の実施形態のようにして算出された行列積演算結果を演算結果格納部１００１に出力し、それに非線形変換を施すことにより、識別結果が出力される。この識別結果を用いて最終的な結果（例えば入力画像中に存在する物体のカテゴリ）が得られる。 The details of the processing of the parallel arithmetic device 101 when the information of matrix product operation is output as the operation type signal have already been described in the first embodiment, and thus are omitted here. The result of the matrix product operation calculated as in the first embodiment is output to the operation result storage unit 1001, and the identification result is output by performing nonlinear conversion on it. The identification result is used to obtain the final result (for example, the category of the object present in the input image).

以上のように、本実施形態の画像処理装置２００は、並列演算装置１０１を備え、種々の演算から構成されているＤｅｅｐＬｅａｒｎｉｎｇの処理を単一の並列演算装置で実行することが可能になる。 As described above, the image processing apparatus 200 according to the present embodiment includes the parallel processing apparatus 101, and can perform Deep Learning processing configured by various calculations with a single parallel processing apparatus.

［その他の実施形態］
上述の説明では、第１データ供給部１０５をシフトレジスタにより構成する例について説明したが、本発明の第１データ供給部１０５はシフトレジスタに限られるものではない。同一のデータを異なるタイミングで異なる演算処理部１０７に供給でき、かつ演算種別信号に応じて同一のデータが複数の演算処理部１０７に供給されることの許可、禁止を切り替えられる手段であれば、第１データ供給部１０５の構成として採用できる。例えば、複数のレジスタの出力をセレクタで選択するような構成とすることができる。この場合、セレクタの制御信号を順次切り替えることで、シフトレジスタと同様の動作をさせることができる。また、セレクタの信号を固定することで、同一のデータを複数の演算処理部１０７に供給することのないように制御することもできる。 Other Embodiments
In the above description, although the example in which the first data supply unit 105 is configured by the shift register has been described, the first data supply unit 105 of the present invention is not limited to the shift register. If it is a means that can supply the same data to different arithmetic processing units 107 at different timings, and can switch permission / prohibition that the same data is supplied to a plurality of arithmetic processing units 107 according to the operation type signal, The configuration can be adopted as the configuration of the first data supply unit 105. For example, the outputs of a plurality of registers can be selected by a selector. In this case, by sequentially switching control signals of the selector, the same operation as the shift register can be performed. Further, by fixing the signal of the selector, control can be performed so that the same data is not supplied to the plurality of arithmetic processing units 107.

また、本発明は、複数の機器から構成されるシステムに適用しても、１つの機器からなる装置に適用してもよい。本発明は上記実施例に限定されるものではなく、本発明の趣旨に基づき種々の変形（各実施例の有機的な組合せを含む）が可能であり、それらを本発明の範囲から除外するものではない。即ち、上述した各実施例及びその変形例を組み合わせた構成も全て本発明に含まれるものである。 Further, the present invention may be applied to a system constituted by a plurality of devices or to an apparatus comprising a single device. The present invention is not limited to the above embodiments, and various modifications (including organic combinations of the respective embodiments) are possible based on the spirit of the present invention, which are excluded from the scope of the present invention is not. That is, the configuration in which each of the above-described embodiments and their modifications are combined is also included in the present invention.

１０１並列演算装置
１０２第１データ格納部
１０３第２データ格納部
１０４データ供給制御部
１０５第１データ供給部
１０６第２データ供給部
１０７演算処理部
１０８乗算器
１０９累積加算器
１１０演算種別切り替え部
１１１読み出し制御部 101 Parallel Arithmetic Unit 102 First Data Storage Unit 103 Second Data Storage Unit 104 Data Supply Control Unit 105 First Data Supply Unit 106 Second Data Supply Unit 107 Arithmetic Processing Unit 108 Multiplier 109 Cumulative Adder 110 Calculation Type Switching Unit 111 Read control unit

Claims

A plurality of arithmetic processing means for performing arithmetic operations in parallel based on the first data and the second data;
First supply means for supplying the first data to the plurality of arithmetic processing means;
Second supply means for supplying the second data to the plurality of arithmetic processing means;
The first supply means is controlled to supply the first data having different contents at the same timing to the plurality of arithmetic processing means, and the contents are the same at the same timing for the plurality of arithmetic processing means Supply control means for controlling the second supply means to supply the second data of
Have
The supply control means is
A first supply mode for supplying the first data so that the first data having the same content is shared at different timings among the plurality of arithmetic processing means;
A second supply mode in which the plurality of arithmetic processing means are supplied with the first data having different contents at a plurality of timings;
A parallel operation device characterized by performing.

First storage means for storing the first data;
And a read control unit that controls reading of the first data from the first storage unit to the first supply unit;
The read control means
A first read mode in which a portion of the first data read from the first storage means in the past calculation is read again in duplicate;
A second read mode for reading the first data read from the first storage unit in the past calculation without duplication;
The parallel operation device according to claim 1, characterized in that

It further comprises a second storage unit for storing the second data,
The supply control means is
The second supply unit is controlled to receive the second data read from the second storage unit and to supply the second data having the same content at the same timing to each of the plurality of arithmetic processing units. The parallel operation device according to claim 2, characterized in that:

The first supply means comprises a shift register,
The first supply means is
In the first supply mode, the first data read from the first storage unit is loaded, and the loaded first data is supplied to the plurality of arithmetic processing units while being shifted a predetermined number of times,
In the second supply mode, the first data read from the first storage unit is loaded, and the loaded first data is supplied to the plurality of arithmetic processing units without shifting. The parallel operation device according to item 3.

The apparatus further comprises operation type switching means for controlling the supply control means and the read control means according to the type of operation.
The operation type switching means is
When the first operation is performed, the supply control means is caused to execute the first supply mode, and the read control means is caused to execute the first read mode.
When the second operation is performed, the supply control means is caused to execute the second supply mode, and the read control means is caused to execute the second read mode.
The parallel computing device according to any one of claims 2 to 4, characterized in that:

The parallel operation device according to claim 5, wherein the first operation is a filter operation.

7. The parallel operation device according to claim 5, wherein the second operation is a matrix product operation.

It further comprises conversion processing means for performing non-linear conversion on the calculation results of each of the plurality of calculation processing means,
The parallel operation device according to any one of claims 5 to 7, wherein the operation type switching unit instructs the conversion processing unit whether or not to perform non-linear conversion according to the type of the operation.

9. The parallel operation device according to claim 8, wherein the operation type switching unit instructs a type of the non-linear conversion when the conversion processing unit performs the non-linear conversion.

10. The parallel operation device according to any one of claims 1 to 9, wherein the operation processing means includes a multiplier and a cumulative adder.

A parallel operation device according to any one of claims 1 to 10,
An image processing apparatus characterized by performing image processing to be processed using the parallel operation device.

A plurality of operation processing means performing an operation in parallel based on the first data and the second data;
Supplying the first data from the first supply means to the plurality of arithmetic processing means;
Supplying the second data from the second supply means to the plurality of arithmetic processing means;
The first supply means is controlled to supply the first data having different contents at the same timing to the plurality of arithmetic processing means, and the contents are the same at the same timing for the plurality of arithmetic processing means Controlling the second supply means to supply the second data of
A first supply mode for supplying the first data so that the first data having the same content is shared at different timings among the plurality of arithmetic processing means;
A second supply mode in which the plurality of arithmetic processing means are supplied with the first data having different contents at a plurality of timings;
A parallel operation method characterized by performing.