JP2018206195A

JP2018206195A - Calculation system, and control method and program of calculation system

Info

Publication number: JP2018206195A
Application number: JP2017112721A
Authority: JP
Inventors: 徹保米本; Toru Homemoto; 久治石井; Hisaharu Ishii
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-06-07
Filing date: 2017-06-07
Publication date: 2018-12-27

Abstract

To provide a calculation system, and a control method and a program of the calculation system that enable improving accelerator utilization efficiency and maintaining a programming method.SOLUTION: A calculation system 100 comprises; an input device 160 that receives a command; and a support device 170 that maps the input command to a partial reconstruction region of an FPGA accelerator 140 and that controls to connect between the partial reconstruction regions using a data path. When a pipe instruction is issued, the support device 170 maps a previous command to a previous partial reconstruction region, maps a subsequent command to a subsequent partial reconstruction region, and connects the previous and the subsequent partial reconstruction regions to be one dimensional connection by means of data paths one after another, and after the mapping and the connecting are repeated for all pipe instructions, a setting, in which a beginning command input is introduced from a CPU 110 and a terminating command output is returned to the CPU 110, is performed.SELECTED DRAWING: Figure 1

Description

本発明は、演算システム、演算システムの制御方法およびプログラムに関する。 The present invention relates to an arithmetic system, a control method for the arithmetic system, and a program.

まず、コンピュータ内部の演算装置の特性概要について述べる。コンピュータとして、電子集積回路による演算装置、主記憶装置（メモリ）、補助記憶装置等を備えた既知の汎用的なコンピュータを想定する。
コンピュータは、入力として数値や演算の種類を定義した命令群（プログラム）を受け取り、それを演算装置が解釈して実行し、結果を出力することによって動作する。
演算装置の代表格であるＣＰＵ（Central Processing Unit）は、特定の処理に限らず性能を発揮するために、複雑な演算命令（例えば、除算や浮動小数点演算など）に対応する専用処理回路を多数内蔵し、命令の種類によってそれらを切り替えて用いる構造が取られている。また、ＣＰＵには、分岐命令（次に行う命令を条件によって変更する）を実行した前後に性能低下が起こらないよう、分岐先を予測する機構（分岐予測機構）や、分岐が起こった場合と起こらなかった場合の両方の処理を先んじて実行しておき、分岐の判定結果が出た後に演算済みの結果のうち片方を選択して使用する機構（積極的実行機構）などが含まれることも多い。 First, the outline of the characteristics of the arithmetic unit inside the computer will be described. As the computer, a known general-purpose computer including an arithmetic unit using an electronic integrated circuit, a main storage device (memory), an auxiliary storage device, and the like is assumed.
The computer operates by receiving a group of instructions (programs) defining numerical values and types of operations as inputs, interpreting and executing them by an arithmetic unit, and outputting the results.
A CPU (Central Processing Unit), a representative example of arithmetic devices, has many dedicated processing circuits that support complex arithmetic instructions (for example, division and floating-point arithmetic) in order to demonstrate performance without being limited to specific processing. It has a built-in structure that switches between them depending on the type of instruction. Also, the CPU has a mechanism for predicting the branch destination (branch prediction mechanism) so that performance degradation does not occur before and after execution of a branch instruction (the instruction to be executed next is changed depending on conditions), and when a branch occurs. It may include a mechanism (active execution mechanism) that executes both processes in the case of non-occurrence in advance and selects and uses one of the calculated results after the branch determination result is output Many.

ＣＰＵにある処理をさせるとき、その処理に使用されない専用処理回路は待機状態となる。また、上記分岐予測機構は、本質的に演算そのものには寄与しない回路である。その他、上記積極的実行機構においては、分岐の結果使用されなかった演算結果は破棄されるので、それを処理した回路は演算そのものには寄与しなかったことになる。すなわち、掛けられるコストや大規模集積回路製作の困難性から、回路規模が制限されているときには、その処理性能は相対的に低いものに留まらざるを得ない。
演算装置の中でも、特定の処理に特化した構造と命令セットを持つものは、例えば画像処理に特化したものを指してＧＰＵ（Graphics Processing Unit）と呼称したり、ネットワークにおけるパケット処理に特化したものを指してＮＰＵ（Network Processing Unit）と呼称したりすることがある。
これらは例えば、その用途に関連する特定の処理回路を内部に多数内蔵しており、対応する処理を行わせたときには、それらを並列に動作させて高速に処理を行う機能を持っていることが多い。 When the CPU performs a certain process, a dedicated processing circuit that is not used for the process enters a standby state. The branch prediction mechanism is a circuit that essentially does not contribute to the operation itself. In addition, in the above-described aggressive execution mechanism, an operation result that has not been used as a result of branching is discarded, so that the circuit that has processed it has not contributed to the operation itself. In other words, when the circuit scale is limited due to the cost applied and the difficulty in manufacturing a large-scale integrated circuit, the processing performance has to be relatively low.
Among computing devices, those that have a structure and instruction set specialized for specific processing, such as those specialized for image processing, are called GPU (Graphics Processing Unit), and are specialized for packet processing in the network. This may be referred to as an NPU (Network Processing Unit).
These include, for example, a large number of specific processing circuits related to the application, and when corresponding processing is performed, they have a function of operating them in parallel and performing high-speed processing. Many.

また、画像処理やパケット処理を行うプログラムには条件分岐の処理が比較的少ないことを利用して、分岐予測機構や積極的実行機構等を省略し、できるだけ主たる演算処理に資源を振り向けた設計が行われていることもある。
これら特定の処理に特化した集積回路は、一般に同じ回路規模のＣＰＵに比して非常に高い性能を発揮することができる。一方、対応しない処理を行わせた場合、処理が不可能となるか、できたとしても非常に性能が低いものとなる。
そのような事態を避けるため、ＣＰＵを装備する汎用コンピュータに対し、ＣＰＵの拡張バス等を通じてＧＰＵ、ＮＰＵがアクセラレータとして接続された図５のような形態がよく用いられる。 In addition, the program that performs image processing and packet processing takes advantage of the relatively few conditional branch processing, omitting the branch prediction mechanism and the aggressive execution mechanism, etc., and designed to allocate resources to the main arithmetic processing as much as possible. Sometimes it is done.
In general, an integrated circuit specialized for these specific processes can exhibit very high performance as compared with a CPU having the same circuit scale. On the other hand, if processing that does not correspond is performed, the processing becomes impossible, or even if it can, the performance is very low.
In order to avoid such a situation, a form as shown in FIG. 5 in which a GPU or NPU is connected as an accelerator through a CPU expansion bus or the like is often used for a general-purpose computer equipped with a CPU.

図５は、汎用的なコンピュータに特定の処理に特化した演算装置を追加した演算システムの概略構成図である。
図５に示すように、演算システム１０は、ＣＰＵ１１と、メモリ１２と、補助記憶装置１３と、アクセラレータ１４と、これらを接続する拡張バス１５と、を備える。アクセラレータ１４は、ＧＰＵ／(以下、本明細書において「／」は、「および／または」、を表記する)ＮＰＵ／ＦＰＧＡ（Field Programmable Gate Array）からなる演算装置を備える。
演算システム１０は、ＣＰＵ１１からアクセラレータ１４の使用が指示され、演算対象のデータが拡張バス１５を通してアクセラレータ１４に送信されたのち、アクセラレータ１４上の演算装置にて演算を行い、最後に結果をＣＰＵ１１に返す構成となっている。 FIG. 5 is a schematic configuration diagram of an arithmetic system in which an arithmetic device specialized for specific processing is added to a general-purpose computer.
As shown in FIG. 5, the arithmetic system 10 includes a CPU 11, a memory 12, an auxiliary storage device 13, an accelerator 14, and an expansion bus 15 that connects them. The accelerator 14 includes an arithmetic unit made up of GPU / (hereinafter referred to as “/” in this specification is “and / or”) NPU / FPGA (Field Programmable Gate Array).
The arithmetic system 10 is instructed by the CPU 11 to use the accelerator 14, and data to be calculated is transmitted to the accelerator 14 through the expansion bus 15. Then, the arithmetic system 10 performs an arithmetic operation on the accelerator 14, and finally the result is sent to the CPU 11. It is configured to return.

前述の集積回路によって実現された演算装置群は、その通常、生産時に構造が決定され、実行可能な命令や動作速度などが後から変更されることはない。
よって、特定の処理に特化した演算装置を用いる場合、コンピュータの使用目的に合わせて適切な演算装置が選択されて装備されることが通常である。 The structure of the arithmetic device group realized by the above-described integrated circuit is usually determined at the time of production, and executable instructions, operation speeds, and the like are not changed later.
Therefore, when using an arithmetic device specialized for a specific process, it is normal that an appropriate arithmetic device is selected and installed in accordance with the purpose of use of the computer.

次に、再構成可能な（Reconfigurable）集積回路デバイスについて説明する。
ＦＰＧＡ（Field Programmable Gate Array）に代表される再構成可能な集積回路は、書き換え可能なメモリ（SRAM、Static Random Access Memory）を応用したＬＵＴ（Look Up Table）構造を持っており、電子回路の構造を表すビット列（回路コンフィグ）を内蔵されたＬＵＴ上に書き込むことで、一時的に任意の電子回路を構成することができるデバイスである。ＳＲＡＭ方式のＦＰＧＡ（SRAM-based FPGA）は、ＬＵＴと配線スイッチがＳＲＡＭで構成され、これらがＳＲＡＭで構成されているため、電源投入時に、ＳＲＡＭに情報を書き込む作業（コンフィグレーション）が必要になる。
なお、本明細書では、以降代表的な実現例であるＳＲＡＭ方式のＦＰＧＡを用いて説明するが、他の方式によって電子回路を構成できるデバイスを使用した場合も同様である。 Next, a reconfigurable integrated circuit device will be described.
Reconfigurable integrated circuits such as FPGAs (Field Programmable Gate Arrays) have an LUT (Look Up Table) structure that applies rewritable memory (SRAM, Static Random Access Memory), and the structure of electronic circuits. Is written on a built-in LUT, a device capable of temporarily configuring an arbitrary electronic circuit. In SRAM-based FPGA (SRAM-based FPGA), the LUT and the wiring switch are composed of SRAM, and these are composed of SRAM. Therefore, when power is turned on, an operation (configuration) for writing information to the SRAM is required. .
In this specification, an SRAM FPGA, which is a typical implementation, will be described below. However, the same applies to a case where a device that can form an electronic circuit by another method is used.

従来、ＦＰＧＡは、ＡＳＩＣ（Application Specific Integrated Circuit）の設計段階において、ＡＳＩＣ用に設計した回路をその中に構成させることで、高コストなＡＳＩＣの生産を行う前に動作の検証をし、設計の修正に役立てるためのデバイスとして使用される。
また、機器の生産が少数に留まる場合、機器の起動時に外部の記憶素子などから予め固定された回路コンフィグを導入するように機器を構成することで、装置が駆動している間、あたかも当該用途のＡＳＩＣが装備されているのと同様の状態を実現するために用いられる。 Conventionally, in the design stage of an ASIC (Application Specific Integrated Circuit), an FPGA is configured in a circuit designed for the ASIC, thereby verifying the operation before producing a high-cost ASIC, Used as a device to help fix.
In addition, if the production of equipment is limited to a small number, the equipment is configured to introduce a pre-fixed circuit configuration from an external storage element or the like when the equipment is started up, as if the application It is used to realize the same state as that equipped with the ASIC.

再構成可能コンピューティング（Reconfigurable Computing）について述べる。
近年、“再構成可能コンピューティング”と呼ばれる汎用コンピューティングへのＦＰＧＡの利用法が注目されている（例えば、非特許文献１参照）。再構成可能コンピューティングは、ソフトウェアの持つ柔軟性とＦＰＧＡなどの高度に柔軟な高速コンピューティング構造による高性能ハードウェア処理を組合わせる。その一形態は、ＦＰＧＡの再構成可能な構造を積極的に活用し、コンピュータごとに適切な演算装置を選択して製作する代わりに、ＦＰＧＡをアクセラレータとして装備したコンピュータを準備しておき、演算装置の回路設計を表す回路コンフィグをＦＰＧＡに導入してその場限りの演算回路を構成することによって処理を行う。
この時、色々な処理に特化した回路コンフィグを予め準備しておき、行う処理に合わせて導入する回路を変更すれば、多様な処理に対して、常に当該処理に特化した構造と命令セットを持つ演算装置を利用し続けることが可能と考えられる。ただし、ＦＰＧＡには電子回路上に電子回路をエミュレーションする構成に由来するオーバーヘッドがあり、同じ回路規模のＧＰＵ、ＮＰＵなどに比べると動作速度や電力効率などの点において劣る場合があることに留意する必要がある。 Describes Reconfigurable Computing.
In recent years, the use of FPGA for general-purpose computing called “reconfigurable computing” has attracted attention (see, for example, Non-Patent Document 1). Reconfigurable computing combines the flexibility of software with high-performance hardware processing through highly flexible high-speed computing structures such as FPGAs. One form is to actively utilize the reconfigurable structure of the FPGA, and instead of selecting and manufacturing an appropriate arithmetic device for each computer, a computer equipped with the FPGA as an accelerator is prepared, and the arithmetic device The circuit configuration representing the circuit design is introduced into the FPGA, and processing is performed by configuring an ad hoc arithmetic circuit.
At this time, if a circuit configuration specialized for various processes is prepared in advance and the circuit to be introduced is changed in accordance with the process to be performed, the structure and instruction set specialized for the process are always applied to various processes. It is thought that it is possible to continue using an arithmetic device having However, note that the FPGA has an overhead derived from the configuration of emulating the electronic circuit on the electronic circuit, and may be inferior in terms of operation speed, power efficiency, and the like as compared to GPU, NPU, etc. of the same circuit scale. There is a need.

プログラムの記述とコンパイルについて述べる。
従来のＣＰＵ／ＧＰＵ／ＮＰＵ用の実行可能なプログラムは、その演算装置毎に定義されたバイナリ形式の機械語であることが多い。機械語を直接記述する代わりに、Ｃ言語などに代表されるプログラミング言語でのプログラム記述を行い、コンパイラを用いてバイナリに変換する方法が広く用いられている。
このコンパイル作業は、典型的には数秒から数分のオーダで実行できるため、プログラマは例えば、機能単位などの区切りまでプログラムを記述したらコンパイルを行い、実際に動作させて処理結果が正しいか確認し、次の機能の記述に移る、などの作業を繰り返して開発を行う。プログラムの規模が非常に膨大になったとき、コンパイルの時間が数分を超えて長くなる場合もあるが、多くのコンパイラはプログラムを記述したファイル単位などで分割してコンパイルを行う。また、変更がなかったファイルのコンパイルをスキップできる機能を持つ場合が多く、プログラマが機能ごとなどにファイルを分割して記述していくことで、コンパイル時間の増大を抑えることが可能である。 Describe program description and compilation.
A conventional executable program for CPU / GPU / NPU is often a binary machine language defined for each arithmetic device. Instead of describing machine language directly, a method of describing a program in a programming language represented by C language or the like and converting it into binary using a compiler is widely used.
This compile operation can typically be executed in the order of seconds to minutes, so the programmer, for example, compiles the program after describing the function unit, etc., and then performs the actual operation to check whether the processing results are correct. , Move on to the description of the next function, and so on. When the scale of a program becomes very large, the compilation time may be longer than a few minutes, but many compilers compile in units of files describing the program. Also, in many cases, it has a function capable of skipping compilation of a file that has not been changed, and the programmer can divide and describe the file for each function or the like, thereby suppressing an increase in compilation time.

また、インタプリタと呼ばれる特殊なプログラムを用いてプログラムをコンパイルすることなく直接実行することもできる。多くの場合はコンパイラを用いたバイナリの作成に比べて実行速度が低下するが、プログラムの記述後すぐに実行できる利点がある。この利点から、例えばコンピュータ上における一連の操作などを繰り返し実行したいとき、それを簡易なプログラム（スクリプトなどとも呼ばれる）として記述して呼び出すといった使用法にも適している。本明細書では、このような用法をアドホックなプログラミングと呼ぶ。 It is also possible to directly execute a program without compiling it using a special program called an interpreter. In many cases, the execution speed is lower than the binary creation using a compiler, but there is an advantage that the program can be executed immediately after the program is written. Because of this advantage, for example, when it is desired to repeatedly execute a series of operations on a computer, it is also suitable for usage such as describing and calling it as a simple program (also called a script or the like). In this specification, such usage is referred to as ad hoc programming.

一方、ＦＰＧＡによって演算装置を構成する場合は、その上に構成された演算回路に対する動作を指示するプログラムに加えて、演算回路そのものの構造も設計しなければならない。
演算回路の構造は、前述の回路コンフィグで定義され、典型的にはバイナリ形式である。バイナリ形式の回路コンフィグを直接記述する代わりに、Verilog HDL、VHDLなどのハードウェア記述言語（HDL、hardware description language）、もしくはＣ言語などで記述した後に、コンパイラを用いて回路コンフィグに変換することが普通である。
しかし、回路コンフィグを作成するＦＰＧＡ合成ツールチェインの動作は、通常のプログラムのコンパイラに比べて非常に低速であり、典型的には数１０分から数時間を要することが多い。
これは例えば、ＦＰＧＡ合成ツールチェインのうち“配置配線”の工程が、「ＦＰＧＡの上に２次元的に多数配置されたＬＵＴ群のうち、どの位置にどの回路機能を割り当て、その間をどのように接続した時にもっとも効率良く動作するか」を求める問題を含み、これが数学的に多項式時間で解けることが保証されていないＮＰ困難問題に属することが原因の一つと考えられる。 On the other hand, when an arithmetic unit is configured with an FPGA, the structure of the arithmetic circuit itself must be designed in addition to the program for instructing the operation for the arithmetic circuit configured thereon.
The structure of the arithmetic circuit is defined by the circuit configuration described above and is typically in binary format. Instead of describing the binary circuit configuration directly, it can be written in a hardware description language (HDL, hardware description language) such as Verilog HDL or VHDL, or C language, and then converted into a circuit configuration using a compiler. It is normal.
However, the operation of the FPGA synthesis tool chain for creating a circuit configuration is much slower than a normal program compiler, and typically requires several tens of minutes to several hours.
This is because, for example, the “place and route” process in the FPGA synthesis tool chain “assigns which circuit function to which position of the LUT group two-dimensionally arranged on the FPGA, and how between them. One of the causes is considered to belong to the NP difficult problem that is not guaranteed to be solved mathematically in polynomial time.

この動作時間の長さは従来のプログラミングにおける記述とテストのサイクルを大きく妨げる。大規模な回路コンフィグの開発においては、これを避けるため、実際のＦＰＧＡコンフィグ作成を記述のたびに行うのではなく、ＦＰＧＡ上に回路コンフィグを導入したときの動作をＣＰＵ上で模擬する回路シミュレータが用いられることが多い。回路シミュレータの動作速度は限定されるものの、動作開始までの時間はプログラムのコンパイルと同程度である。 This length of operation time greatly hinders the description and test cycle in conventional programming. In the development of a large-scale circuit configuration, in order to avoid this, a circuit simulator that simulates the operation when a circuit configuration is introduced on the FPGA is not performed every time an FPGA configuration is actually written on the CPU. Often used. Although the operation speed of the circuit simulator is limited, the time until the operation starts is about the same as the compilation of the program.

一方、前述のアドホックなプログラミングによって再構成可能コンピューティングを行うためには、実際の動作を得ることが目的である以上、回路コンフィグの作成工程が不可避である。
コンピュータの操作や運用を助けるスクリプトの作成に非常に時間が掛かることはユーザの利用の障害となりうる。例えば、ＯＳ（Operating System）上で複数のコマンドを手作業で発行するなどして行う一連の処理を記録したプログラムの作成に数分程度を要したとしたとする。この場合、実行まで数時間のツールチェイン（tool chain）動作時間を待つよりも、プログラム実行の性能低下を許容してでもＣＰＵ上での動作を行うプログラムとした方が総合の処理時間が早くなるケースが多く存在しうるだろう。 On the other hand, in order to perform reconfigurable computing by the above-described ad hoc programming, a circuit configuration creation process is inevitable as long as the purpose is to obtain an actual operation.
It takes a long time to create a script that assists in the operation and operation of a computer, which can be an obstacle to the use of the user. For example, assume that it takes several minutes to create a program that records a series of processes performed by manually issuing a plurality of commands on an OS (Operating System). In this case, rather than waiting for several hours of tool chain operation time until execution, the total processing time is faster when a program that operates on the CPU is allowed even if performance degradation of the program execution is allowed. There may be many cases.

次に、部分的な動的再構成技術を用いた再構成可能コンピューティングについて述べる。
現在主流となっているＦＰＧＡでは、部分的な動的再構成機構が実装されている場合がある。これはＦＰＧＡ上に書き込まれている一部の回路の動作中に、別の部分の回路コンフィグを書き換えることができる機構である。実装においては、ＦＰＧＡ上に予め動的再構成を行うための部分再構成領域を１個ないし複数個定義しておき、その部分領域の中への部分回路コンフィグの導入は他の部分と独立に行えるという形式が用いられていることがある（例えば、非特許文献２，３参照）。
上記方式を活用することで、前述の「再構成可能コンピューティングにおいては、回路コンフィグを作成するまでのツールチェインの動作時間が長く、アドホックなプログラミングに適さない」課題を一部解決できると考えられる。 Next, reconfigurable computing using partial dynamic reconfiguration techniques is described.
In some FPGAs that are currently mainstream, a partial dynamic reconfiguration mechanism may be implemented. This is a mechanism that can rewrite the circuit configuration of another part during the operation of a part of the circuit written on the FPGA. In the implementation, one or more partial reconfiguration areas for dynamic reconfiguration are defined in advance on the FPGA, and the introduction of the partial circuit configuration into the partial area is independent of other parts. There is a case where a format that can be used is used (for example, see non-patent documents 2 and 3).
By utilizing the above method, it is considered that the above-mentioned problem of “Reconfigurable computing is not suitable for ad hoc programming because of the long operation time of the tool chain until the circuit configuration is created” is considered. .

そのためにはまず、プログラム中で利用される関数、ＯＳのシステムコール、ＯＳのコマンドなどを、引数のみを変更して実行できる部分回路コンフィグとして予め作成しておき、記憶領域上に保存しておく。次に、プログラムをインタプリタなどによって直接実行（これはメインのＣＰＵ上で行う）し、プログラム中で部分回路コンフィグが用意されている関数などが呼び出されたとき、予め準備しておいた部分回路コンフィグをＦＰＧＡ上で空き状態となっている部分再構成領域に書き込んで動作させる。
処理対象のデータは、上記ＣＰＵの拡張バスなどを通じてＦＰＧＡに書き込まれ、結果も同様に拡張バスを通じて取り出される。上記動作について具体的に説明する。 For this purpose, first, functions used in the program, OS system calls, OS commands, etc. are created in advance as partial circuit configurations that can be executed by changing only the arguments, and are stored in a storage area. . Next, when the program is directly executed by an interpreter or the like (this is performed on the main CPU) and a function or the like for which a partial circuit configuration is prepared is called in the program, the partial circuit configuration prepared in advance is called. Is written in a partially reconfigured area that is empty on the FPGA.
Data to be processed is written into the FPGA through the CPU expansion bus and the result is similarly retrieved through the expansion bus. The above operation will be specifically described.

図６は、図５の演算システム１０に上記動作を追記して示す図である。図６は、汎用的なコンピュータにＦＰＧＡアクセラレータなどを装着し、予め用意した部分回路コンフィグ群によって回路コンフィグ生成時間なしにＦＰＧＡ上に構築された演算回路を利用する例を示す。
図６に示すように、演算システム１０のアクセラレータ１４Ａは、ＦＰＧＡを搭載したＦＰＧＡアクセラレータであり、このＦＰＧＡアクセラレータ１４Ａは、部分再構成領域(1)〜(9)を備える。演算システム１０の補助記憶装置１３は、予め作成された部分回路コンフィグ群１３Ａを記憶する。部分回路コンフィグ群１３Ａは、ここでは部分回路コンフィグＡ〜Ｄである。また、演算システム１０のメモリ１２は、演算対象のデータ１２Ａを一時的に記憶する。 FIG. 6 is a diagram showing the above operation added to the calculation system 10 of FIG. FIG. 6 shows an example in which an FPGA accelerator or the like is mounted on a general-purpose computer, and an arithmetic circuit constructed on the FPGA without using a circuit configuration generation time by using a partial circuit configuration group prepared in advance is used.
As shown in FIG. 6, the accelerator 14A of the computing system 10 is an FPGA accelerator equipped with an FPGA, and the FPGA accelerator 14A includes partial reconfiguration areas (1) to (9). The auxiliary storage device 13 of the arithmetic system 10 stores a partial circuit configuration group 13A created in advance. Here, the partial circuit configuration group 13A is the partial circuit configurations A to D. The memory 12 of the calculation system 10 temporarily stores calculation target data 12A.

演算システム１０のＣＰＵ１１は、プログラムを実行し、プログラム中で部分回路コンフィグが用意されている関数などが呼び出されたとき、予め準備しておいた部分回路コンフィグ群１３ＡをＦＰＧＡアクセラレータ１４Ａ上で空き状態となっている部分再構成領域(1)〜(9)に書き込んで動作させる。図６の符号ａに示すように、ＣＰＵ１１は、プログラム実行時に、ＦＰＧＡアクセラレータ１４Ａの部分再構成領域(1)，(2)に部分回路コンフィグＡを、部分再構成領域(5)に部分回路コンフィグＢを、部分再構成領域(9)に部分回路コンフィグＣをそれぞれ書き込む。図６の符号ｂに示すように、メモリ１２に一時記憶された演算対象のデータ１２Ａは、拡張バス１５を通じてＦＰＧＡアクセラレータ１４Ａに書き込まれ、結果も同様に拡張バス１５を通じて取り出される。
演算システム１０は、演算対象のデータが拡張バス１５を通してＦＰＧＡアクセラレータ１４Ａに送信されたのち、ＦＰＧＡアクセラレータ１４Ａ上の演算回路にて演算を行い、最後に結果をＣＰＵ１１に返す。 The CPU 11 of the arithmetic system 10 executes a program, and when a function for which a partial circuit configuration is prepared is called in the program, the prepared partial circuit configuration group 13A is in an empty state on the FPGA accelerator 14A. The partial reconfiguration areas (1) to (9) are written and operated. As shown by symbol a in FIG. 6, the CPU 11 executes the partial circuit configuration A in the partial reconfiguration areas (1) and (2) of the FPGA accelerator 14A and the partial circuit configuration in the partial reconfiguration area (5) during program execution. B is written to the partial reconfiguration area (9), respectively. As shown by the symbol b in FIG. 6, the calculation target data 12 A temporarily stored in the memory 12 is written to the FPGA accelerator 14 A through the expansion bus 15, and the result is similarly retrieved through the expansion bus 15.
The arithmetic system 10 transmits the data to be calculated to the FPGA accelerator 14A through the expansion bus 15, performs an arithmetic operation in the arithmetic circuit on the FPGA accelerator 14A, and finally returns the result to the CPU 11.

このようにすることによって、演算システム１０は、アドホックなプログラミングの用途においても、その場でＦＰＧＡ合成ツールチェインによる回路コンフィグの生成を行うことなく、ＦＰＧＡを活かして演算の種類に最も適した演算装置を常に使用し続ける効果を得ることができる（このような使用法は、非特許文献４参照）。 By doing so, the arithmetic system 10 can use the FPGA to make the most suitable arithmetic type for the type of operation without generating the circuit configuration by the FPGA synthesis tool chain on the spot even in the application of ad hoc programming. Can be obtained (see Non-Patent Document 4 for such usage).

しかし、図５および図６の演算システム１０の構造においては、プログラム中の１関数、コマンドなどの単位で、拡張バス１５を通じてメモリ１２からＦＰＧＡアクセラレータ１４Ａにデータを転送し、その結果を拡張バス１５経由でメモリ１２に再度書き込む操作が繰り返される。そのため、拡張バス１５を通したデータやり取りに掛かる時間が実行の速度を律速してしまう。 However, in the structure of the arithmetic system 10 of FIGS. 5 and 6, data is transferred from the memory 12 to the FPGA accelerator 14A through the expansion bus 15 in units of one function, command, etc. in the program, and the result is transferred to the expansion bus 15 The operation of rewriting data in the memory 12 via the relay is repeated. Therefore, the time required for data exchange through the expansion bus 15 limits the speed of execution.

次に、前述したバスボトルネックの回避について述べる。
前述の拡張バスのボトルネックを回避し、再構成可能コンピューティングの性能を最大限発揮するために、ＦＰＧＡ内の部分再構成領域の間をＦＩＦＯ（First In First Out）状のデータ構造を複数本用いて直結した構造（再構成領域マップ）とすることを考える。
まず、ユーザが入力したプログラムの構造から、部分回路コンフィグが用意された関数、コマンドなどへの入力データと出力データの接続関係を洗い出し、データの接続部分が前述のＦＩＦＯ状データ構造によって接続されるように配置位置を決めて部分回路コンフィグ群をロードし、その全てを並列に実行する。 Next, avoidance of the aforementioned bus bottleneck will be described.
In order to avoid the aforementioned expansion bus bottleneck and maximize the performance of reconfigurable computing, multiple FIFO (First In First Out) data structures are provided between the partially reconfigurable areas in the FPGA. Let us consider a structure (reconstruction area map) that is directly connected to each other.
First, the connection relationship between input data and output data to a function or command for which a partial circuit configuration is prepared is identified from the structure of the program input by the user, and the data connection portion is connected by the FIFO-like data structure described above. Thus, the arrangement position is determined and the partial circuit configuration group is loaded, and all of them are executed in parallel.

このようにすることで、最終的な演算結果に現れない中間処理結果を一旦メモリに退避することなく演算が次々と実行される（パイプライン処理）されるので、拡張バスを通したデータのやり取りが最小限に抑えられ、ＦＰＧＡによる「演算の種類に最も適した演算装置を常に使用し続ける効果」による高速化の恩恵を得ることができる。
非特許文献５には、ＦＰＧＡ内部の機能モジュールの間に直結のＦＩＦＯ構造を用いて接続することで近傍のＤＲＡＭ（Dynamic Random Access Memory）へのアクセスに用いるバスのボトルネックを回避する構成が示されている。ただし、機能モジュールの動的な再構成を行うことは、非特許文献５では触れられていないが、非特許文献２，３などを参照すればこれらを組み合わせて用いることは可能と考えられる。 By doing so, operations are executed one after another (pipeline processing) without temporarily saving intermediate processing results that do not appear in the final operation results to the memory, so data exchange through the expansion bus Can be minimized, and the benefits of high speed due to the “effect of always using the computing device most suitable for the type of computation” by the FPGA can be obtained.
Non-Patent Document 5 shows a configuration in which a bottleneck of a bus used to access a nearby DRAM (Dynamic Random Access Memory) is avoided by connecting the functional modules in the FPGA using a directly connected FIFO structure. Has been. However, although dynamic reconfiguration of functional modules is not mentioned in Non-Patent Document 5, it can be considered that these can be used in combination if Non-Patent Documents 2 and 3 are referred to.

図７は、図６の演算システム１０に上記動作を加えて変更した例を示す図である。図７は、ＦＰＧＡ内の部分回路コンフィグ間に直結のＦＩＦＯ型データ構造を配備して拡張バスボトルネックを回避した例を示す。
図７に示すように、演算システム１０Ａは、部分回路コンフィグ群１３Ａの配置位置（導入位置）と、各部分回路コンフィグＡ〜Ｄを有効にするデータパスを指示するプログラム２０を有する。プログラム２０は、ここでは部分回路コンフィグＡ〜Ｄの配置位置（図７の部分回路コンフィグＡ〜Ｄを配置した再構成領域マップ参照）と、これら部分回路コンフィグＡ〜Ｄの配置位置をＦＩＦＯ状の経路で繋ぐデータパス２１（図７のプログラム２０内の矢印参照）とを有する。 FIG. 7 is a diagram illustrating an example in which the arithmetic system 10 of FIG. 6 is modified by adding the above operation. FIG. 7 shows an example in which a directly connected FIFO type data structure is provided between the partial circuit configurations in the FPGA to avoid an expansion bus bottleneck.
As shown in FIG. 7, the arithmetic system 10A includes a program 20 that instructs the arrangement position (introduction position) of the partial circuit configuration group 13A and the data path that enables each of the partial circuit configurations A to D. Here, the program 20 sets the arrangement positions of the partial circuit configurations A to D (see the reconfiguration area map in which the partial circuit configurations A to D in FIG. 7 are arranged) and the arrangement positions of the partial circuit configurations A to D in a FIFO form. And a data path 21 (see an arrow in the program 20 in FIG. 7) connected by a route.

図７の符号ｃに示すように、演算システム１０Ａは、ユーザが入力したプログラム２０の構造から、部分回路コンフィグ群の配置位置とデータパス２１とを指示する。ＣＰＵ１１は、補助記憶装置１３からプログラム２０の指示に従って部分回路コンフィグ群１３Ａを読み出し、この部分回路コンフィグ群１３ＡをＦＰＧＡアクセラレータ１４Ａの部分再構成領域(1)〜(9)に対し、位置を指定した書き込みを行う（図７の符号ｄ参照）。すなわち、プログラム２０が指示したＦＩＦＯ状データ構造と同じ、部分回路コンフィグＡ〜Ｄの配置位置に、部分回路コンフィグＡ〜Ｄを書き込むとともに、プログラム２０が指示したＦＩＦＯ状データ構造と同じ、データパス２１の経路でデータを入出力する。 As indicated by reference symbol c in FIG. 7, the arithmetic system 10 A instructs the arrangement position of the partial circuit configuration group and the data path 21 from the structure of the program 20 input by the user. The CPU 11 reads the partial circuit configuration group 13A from the auxiliary storage device 13 according to the instruction of the program 20, and designates the position of the partial circuit configuration group 13A with respect to the partial reconfiguration areas (1) to (9) of the FPGA accelerator 14A. Writing is performed (see symbol d in FIG. 7). That is, the partial circuit configurations A to D are written in the arrangement positions of the partial circuit configurations A to D, which are the same as the FIFO data structure instructed by the program 20, and the data path 21 is the same as the FIFO data structure instructed by the program 20. Input / output data via the path.

図７の場合、ＣＰＵ１１は、プログラム実行時に、ＦＰＧＡアクセラレータ１４Ａの部分再構成領域(1)に部分回路コンフィグＡを、部分再構成領域(3) (5) (7)に部分回路コンフィグＢを、部分再構成領域(6) (8)に部分回路コンフィグＣを、部分再構成領域(9)に部分回路コンフィグＤをそれぞれ書き込む。
図７の符号ｅに示すように、メモリ１２に一時記憶された演算対象のデータ１２Ａは、拡張バス１５を通じてＦＰＧＡアクセラレータ１４Ａに書き込まれる。図７の場合、ＦＰＧＡアクセラレータ１４Ａは、プログラム２０が指示したＦＩＦＯ状データ構造と同じ、データパス２１の経路でデータを入出力する（図７の符号ｆの太矢印参照）。図７の符号ｇに示すように、ＦＰＧＡアクセラレータ１４Ａ上の演算装置による演算結果は、拡張バス１５を通じて取り出される。 In the case of FIG. 7, when executing the program, the CPU 11 stores the partial circuit configuration A in the partial reconfiguration area (1) of the FPGA accelerator 14A, the partial circuit configuration B in the partial reconfiguration areas (3) (5) (7), The partial circuit configuration C is written in the partial reconstruction area (6) (8), and the partial circuit configuration D is written in the partial reconstruction area (9).
As shown by a symbol e in FIG. 7, the calculation target data 12 A temporarily stored in the memory 12 is written to the FPGA accelerator 14 A through the expansion bus 15. In the case of FIG. 7, the FPGA accelerator 14A inputs / outputs data through the path of the data path 21, which is the same as the FIFO-like data structure instructed by the program 20 (see the thick arrow indicated by reference sign f in FIG. 7). As shown by the symbol g in FIG. 7, the calculation result by the calculation device on the FPGA accelerator 14 A is taken out through the expansion bus 15.

“再構成可能コンピューティング - Wikipedia”,［online］,［平成２９年３月２４日検索］,インターネット〈 URL :https://ja.wikipedia.org/wiki /%E5%86%8D%E6%A7%8B%E6%88%90%E5%8F%AF%E8%83%BD%E3%82%B3%E3%83%B3%E3%83%94%E3%83%A5%E3%83%BC%E3%83%86%E3%82%A3%E3%83%B3%E3%82%B0〉“Reconfigurable Computing-Wikipedia”, [online], [March 24, 2017 search], Internet <URL: https: //en.wikipedia.org/wiki /% E5% 86% 8D% E6% A7% 8B% E6% 88% 90% E5% 8F% AF% E8% 83% BD% E3% 82% B3% E3% 83% B3% E3% 83% 94% E3% 83% A5% E3% 83% BC% E3% 83% 86% E3% 82% A3% E3% 83% B3% E3% 82% B0> Zhenzhong Xiao, et al., “A Partial Reconfiguration Controller for Altera Stratix V FPGAs,” FPL2016.29 Aug.-2 Sept. 2016, Lausanne, SwitzerlandZhenzhong Xiao, et al., “A Partial Reconfiguration Controller for Altera Stratix V FPGAs,” FPL2016.29 Aug.-2 Sept. 2016, Lausanne, Switzerland Xie Di, “A Design Flow for FPGA Partial Dynamic Reconfiguration,” IMCCC2012. 8-10 Dec. 2012, 10.1109/IMCCC.2012.35Xie Di, “A Design Flow for FPGA Partial Dynamic Reconfiguration,” IMCCC2012. 8-10 Dec. 2012, 10.1109 / IMCCC.2012.35 大和一洋, “IoT機器に適したFPGAを用いた文字列分割の高速化”, ミラクル・リナックス株式会社, 2016/12/1Kazuhiro Yamato, “Acceleration of character string division using FPGA suitable for IoT devices”, Miracle Linux Co., 2016/12/1 Z. Wang, “Multikernel Data Partitioning With Channel on OpenCL-Based FPGAs.”Z. Wang, “Multikernel Data Partitioning With Channel on OpenCL-Based FPGAs.”

しかしながら、ＦＩＦＯ直結された再構成領域マップへのプログラミングに際しては、当該再構成領域マップの構造が予め固定されている。このことから、ＣＰＵ上で動作するプログラムとは異なり、プログラマやコンピュータの使用者がマップの構造を予め入手し、関数やコマンド間のデータの流れがその再構成領域マップ上で実現できるようにプログラミングしなければならない。そのためのプログラムは、回路コンフィグの導入位置と、マップ上で有効にするデータパスを指示するために、既存のプログラムには無い拡張などを加えた独自の言語となることが想定される。 However, when programming the reconstruction area map directly connected to the FIFO, the structure of the reconstruction area map is fixed in advance. Therefore, unlike a program that runs on the CPU, a programmer or computer user obtains the map structure in advance, and programming is performed so that the flow of data between functions and commands can be realized on the reconstruction area map. Must. The program for this purpose is assumed to be a unique language with extensions and the like that are not present in the existing program in order to indicate the introduction position of the circuit configuration and the data path to be enabled on the map.

例えば、前記図７では、部分回路コンフィグＢが部分再構成領域(3) (5) (7)に対して導入され、３並列で用いられているが、これを４並列としたプログラムは収容すべき部分再構成領域が存在しないので、動作不可となる。このように、再構成可能コンピューティングの適用に当たっては、ユーザはマップの構造を意識したプログラミング方法を新規に学習することが必要となる。しかも使用するコンピュータやマップの状況が変わるたびにプログラミングしなおさなければならない課題がある。 For example, in FIG. 7, the partial circuit configuration B is introduced to the partial reconfiguration areas (3), (5), and (7) and is used in 3 parallels. Since there is no partial reconfiguration area, operation becomes impossible. As described above, when applying reconfigurable computing, the user needs to newly learn a programming method in consideration of the structure of the map. In addition, there is a problem that must be reprogrammed every time the status of the computer or map used changes.

また、プログラムの特徴によっては、再構成領域マップを使いきれずに無駄となる部分が生じる課題がある。例えば、ＦＩＦＯ状のデータ構造は、通常、ＦＰＧＡ上に構成された電子回路として実現される。このため、プログラムの都合で使用されなかったＦＩＦＯ構造に相当するＦＰＧＡ上の領域は、全て使われないままとなる。 Further, depending on the characteristics of the program, there is a problem that a portion that is wasted because the reconstruction area map cannot be used up is generated. For example, a FIFO-like data structure is usually realized as an electronic circuit configured on an FPGA. For this reason, the entire area on the FPGA corresponding to the FIFO structure that has not been used for the convenience of the program remains unused.

さらに、部分再構成領域は、自由なプログラミングを許すためには任意の領域に任意の部分回路コンフィグを導入できるよう設定されなければならない一方、関数やコマンドなどの内容によっては部分回路コンフィグの規模が大きく異なることが想定される。任意の領域に対する任意の部分回路コンフィグの導入を許すためには、導入される候補となる部分回路コンフィグの全てを収容可能なように部分再構成領域を設計しなければならず、典型的には一番回路規模の大きい部分回路コンフィグに合わせた大きさとなる。この部分再構成領域に回路規模の小さいコマンドを導入したときは、その余剰となる部分は使用できない領域となってしまう。 Furthermore, the partial reconfiguration area must be set so that an arbitrary partial circuit configuration can be introduced in an arbitrary area in order to allow free programming, while the scale of the partial circuit configuration depends on the contents of functions and commands. It is envisaged that it will be very different. In order to allow the introduction of any partial circuit configuration for any area, the partial reconfiguration area must be designed to accommodate all candidate partial circuit configurations to be introduced, typically The size is adapted to the partial circuit configuration with the largest circuit scale. When a command having a small circuit scale is introduced into this partial reconstruction area, the surplus part becomes an area that cannot be used.

図８は、部分回路コンフィグ間で回路規模が大きく異なる場合のＦＰＧＡ利用効率低下を示す図である。
図８（ａ）に示す回路規模の大きいコマンド３１と回路規模の小さいコマンド３２は、図８（ｂ）に示す部分再構成領域(1) (2) (3)に導入される。上述したように、部分回路コンフィグの全てを収容可能なように部分再構成領域(1) (2) (3)が設計される。図８（ｂ）に示すように、部分再構成領域 (2) (3)に回路規模の小さいコマンド３２を導入したときは、その余剰となる部分（図８（ｂ）のハッチング部参照）は使用できない領域となってしまう。
これらを総合して、従来の技術では、ＦＰＧＡの利用効率が低い課題がある。 FIG. 8 is a diagram illustrating a decrease in FPGA utilization efficiency when the circuit scale differs greatly between the partial circuit configurations.
The command 31 having a large circuit scale and the command 32 having a small circuit scale shown in FIG. 8A are introduced into the partial reconfiguration areas (1), (2), and (3) shown in FIG. 8B. As described above, the partial reconfiguration areas (1), (2), and (3) are designed so that all of the partial circuit configurations can be accommodated. As shown in FIG. 8B, when a command 32 having a small circuit scale is introduced into the partial reconstruction area (2) (3), the surplus portion (see the hatched portion in FIG. 8B) It becomes an unusable area.
By combining these, the conventional technique has a problem that the utilization efficiency of the FPGA is low.

このような背景を鑑みて本発明がなされたのであり、本発明は、アクセラレータ利用効率の向上とプログラミング手法の維持を可能とする演算システム、演算システムの制御方法およびプログラムを提供することを課題とする。 The present invention has been made in view of such a background, and it is an object of the present invention to provide an arithmetic system, an arithmetic system control method, and a program capable of improving accelerator utilization efficiency and maintaining a programming technique. To do.

前記した課題を解決するため、請求項１に記載の発明は、ホストとなるＣＰＵと、部分回路コンフィグ群を記憶する外部記憶装置と、前記部分回路コンフィグ群をロードして演算回路を動的に再構成可能なアクセラレータ装置と、これらを接続するバスと、を備える演算システムであって、コマンドラインプロンプトを表示して、コマンド行に入力されるコマンドを受付ける入力装置と、入力されたコマンドを、前記アクセラレータ装置の部分的な再構成可能領域である部分再構成領域にマッピングし、かつ、前記部分再構成領域間をデータパスでつなぐように制御する支援装置と、を備え、前記支援装置は、前記コマンドの出力データを次のコマンドの入力データとして接続する命令であるパイプ命令が発行された場合、前のコマンドを前の前記部分再構成領域にマッピングするとともに、後のコマンドを後の前記部分再構成領域にマッピングし、かつ、前後の前記部分再構成領域をデータパスによって分岐することなく一列になるように接続し、当該マッピングおよび当該接続を全てのパイプ命令に対して繰り返したのち、先頭となるコマンドの入力を前記ＣＰＵから導入し、終端となるコマンドの出力を前記ＣＰＵに戻す設定を行い、前記アクセラレータ装置は、前記支援装置が設定した前記部分再構成領域のマッピングと、前後の前記部分再構成領域をデータパスによって分岐することなく一列になるように接続したＦＩＦＯ型のデータ構造において、全ての対象コマンドを並列に起動して処理を実行することを特徴とする演算システムとした。 In order to solve the above-described problems, the invention described in claim 1 is directed to a CPU as a host, an external storage device that stores a partial circuit configuration group, and an arithmetic circuit dynamically loaded by loading the partial circuit configuration group. An arithmetic system comprising a reconfigurable accelerator device and a bus connecting them, an input device for displaying a command line prompt and receiving a command input on a command line, and an input command, A support device that maps to a partial reconfiguration region that is a partial reconfigurable region of the accelerator device and controls the partial reconfiguration regions to be connected by a data path, the support device comprising: When a pipe instruction is issued that connects the output data of the command as input data of the next command, the previous command is Mapping to the partial reconstruction area, mapping a later command to the subsequent partial reconstruction area, and connecting the preceding and following partial reconstruction areas in a line without branching by a data path, After repeating the mapping and the connection for all pipe instructions, the input of the leading command is introduced from the CPU, the output of the terminating command is set back to the CPU, and the accelerator device In the FIFO type data structure in which the mapping of the partial reconstruction area set by the support apparatus and the preceding and following partial reconstruction areas are connected in a line without branching by a data path, all target commands are processed in parallel. It was set as the arithmetic system characterized by starting and processing.

また、請求項６に記載の発明は、ホストとなるＣＰＵと、部分回路コンフィグ群を記憶する外部記憶装置と、前記部分回路コンフィグ群をロードして演算回路を動的に再構成可能なアクセラレータ装置と、これらを接続するバスと、を備える演算システムの制御方法であって、コマンドラインプロンプトを表示して、コマンド行に入力されるコマンドを受付ける入力装置と、入力されたコマンドを、前記アクセラレータ装置の部分的な再構成可能領域である部分再構成領域にマッピングし、かつ、前記部分再構成領域間をデータパスでつなぐように制御する支援装置と、を有し、前記支援装置は、前記コマンドの出力データを次のコマンドの入力データとして接続する命令であるパイプ命令が発行された場合、前のコマンドを前の前記部分再構成領域にマッピングするとともに、後のコマンドを後の前記部分再構成領域にマッピングし、かつ、前後の前記部分再構成領域をデータパスによって分岐することなく一列になるように接続し、当該マッピングおよび当該接続を全てのパイプ命令に対して繰り返したのち、先頭となるコマンドの入力を前記ＣＰＵから導入し、終端となるコマンドの出力を前記ＣＰＵに戻す設定を行うステップを実行し、前記アクセラレータ装置は、前記支援装置が設定した前記部分再構成領域のマッピングと、前後の前記部分再構成領域をデータパスによって分岐することなく一列になるように接続したＦＩＦＯ型のデータ構造において、全ての対象コマンドを並列に起動して処理を実行することを特徴とする演算システムの制御方法とした。 According to another aspect of the present invention, there is provided a host CPU, an external storage device for storing a partial circuit configuration group, and an accelerator device capable of dynamically reconfiguring an arithmetic circuit by loading the partial circuit configuration group. And a bus for connecting them, an input device that displays a command line prompt and accepts a command input on a command line, and the input command to the accelerator device A support device that maps to a partial reconfiguration area, which is a partial reconfigurable area, and controls the partial reconfiguration areas so as to be connected by a data path. When a pipe instruction, which is an instruction for connecting the output data of the next command as input data for the next command, is issued, the previous command is And mapping the subsequent command to the subsequent partial reconstruction area, and connecting the preceding and following partial reconstruction areas in a line without branching by the data path, After the connection is repeated for all pipe instructions, the step of setting the input of the leading command from the CPU and setting the output of the terminating command back to the CPU is executed, and the accelerator device In the FIFO type data structure in which the partial reconstruction areas set by the support device and the partial reconstruction areas before and after are connected in a line without branching by a data path, all target commands are An arithmetic system control method is characterized in that the processing is executed by starting in parallel.

また、請求項７に記載の発明は、ホストとなるＣＰＵと、部分回路コンフィグ群を記憶する外部記憶装置と、前記部分回路コンフィグ群をロードして演算回路を動的に再構成可能なアクセラレータ装置と、入力されたコマンドを、前記アクセラレータ装置の部分再構成領域にマッピングし、かつ、前記部分再構成領域間をデータパスでつなぐように制御する支援装置と、を備える演算システムのコンピュータを、前記コマンドの出力データを次のコマンドの入力データとして接続する命令であるパイプ命令が発行された場合、前のコマンドを前の前記部分再構成領域にマッピングするとともに、後のコマンドを後の前記部分再構成領域にマッピングし、かつ、前後の前記部分再構成領域をデータパスによって分岐することなく一列になるように接続し、当該マッピングおよび当該接続を全てのパイプ命令に対して繰り返したのち、先頭となるコマンドの入力を前記ＣＰＵから導入し、終端となるコマンドの出力を前記ＣＰＵに戻す設定を行う制御手段、として機能させるためのプログラムとした。 According to a seventh aspect of the present invention, there is provided a CPU as a host, an external storage device for storing a partial circuit configuration group, and an accelerator device capable of dynamically reconfiguring an arithmetic circuit by loading the partial circuit configuration group. And a support device that maps the input command to the partial reconfiguration area of the accelerator device and controls the partial reconfiguration areas to be connected by a data path. When a pipe instruction, which is an instruction for connecting the output data of a command as input data for the next command, is issued, the previous command is mapped to the previous partial reconstruction area, and the subsequent command is mapped to the subsequent partial reconstruction. Map to the configuration area and make the partial reconfiguration areas before and after become a line without branching by the data path A control unit configured to connect, repeat the mapping and the connection for all pipe instructions, and then introduce the input of the leading command from the CPU and return the output of the terminating command to the CPU; As a program to function as

このようにすることで、拡張バスを入出力データが通る回数は、高々一回までに抑えることができ、ＣＰＵとアクセラレータ間のデータのやり取りに由来する実行速度低下を抑えることができる。また、ユーザに対してシェルと同じインタフェースを提示しながら、専用の演算回路を用いた高速演算と、演算群の並列実行と、メモリアクセス回数の低減による実行速度向上の効果を得ることができる。 By doing so, the number of times the input / output data passes through the expansion bus can be suppressed to at most once, and a decrease in execution speed due to data exchange between the CPU and the accelerator can be suppressed. Further, while presenting the same interface as the shell to the user, it is possible to obtain the effect of improving the execution speed by performing high-speed arithmetic using a dedicated arithmetic circuit, parallel execution of arithmetic groups, and reducing the number of memory accesses.

また、請求項２に記載の発明は、前記外部記憶装置が、前記ＣＰＵ上のコマンドまたはプログラム群を予め、前記部分再構成領域の任意の位置に導入できる形式でコンパイルして格納することを特徴とする。 The invention according to claim 2 is characterized in that the external storage device compiles and stores a command or a program group on the CPU in advance in a format that can be introduced at an arbitrary position in the partial reconfiguration area. And

このようにすることで、指示された任意の部分再構成領域の位置に対して回路コンフィグを導入することができ、ユーザに対してシェルと同じインタフェースを提示することができる。 By doing in this way, a circuit configuration can be introduced to the position of any designated partial reconfiguration area, and the same interface as the shell can be presented to the user.

また、請求項３に記載の発明は、前記支援装置が、入力されたコマンドのコマンド名に対応する外部プログラムを呼び出すことによって、前記アクセラレータ装置に演算を指示することを特徴とする。 The invention according to claim 3 is characterized in that the support device instructs the accelerator device to perform an operation by calling an external program corresponding to the command name of the input command.

このようにすることで、ユーザは使用する機材の部分再構成領域群の接続関係の情報を意識することなく、ホスト上でのワンライナーの実行と同様にプログラミングができる。従来と同じ利用方法が実現でき、プログラミング手法の維持が可能となる。 By doing so, the user can perform programming in the same manner as the one-liner execution on the host without being aware of the connection relation information of the partial reconfiguration area group of the equipment to be used. The same usage method as before can be realized, and the programming method can be maintained.

また、請求項４に記載の発明は、前記アクセラレータ装置が、ＦＰＧＡを備え、前記支援装置が、前記ＦＰＧＡ上に予め構成しておく再構成領域マップの形状を、前記部分再構成領域ごとに入力１本および出力１本ずつのＦＩＦＯ型のデータ構造とし、前記出力と前記入力とが順に１列になるように相互接続された動的再構成領域チェインとすることを特徴とする。 According to a fourth aspect of the present invention, the accelerator device includes an FPGA, and the support device inputs a shape of a reconstruction area map configured in advance on the FPGA for each partial reconstruction area. A FIFO-type data structure with one output and one output, and a dynamic reconfiguration area chain in which the output and the input are interconnected so as to be in a line in order.

このようにすることで、処理のデータパスが一直線であるため、動的再構成領域チェインに含まれるハードウェアパイプのうち、実際に利用される割合が高く、ＦＰＧＡアクセラレータの利用効率が向上する。コマンド間の部分回路コンフィグの規模差が小さく比較的均一な大きさである場合、それらの最大サイズに合わせた部分再構成領域に導入したとき、余剰となる割合を小さく抑えることができる。その結果、同じ規模のＦＰＧＡアクセラレータ上により多くの部分再構成領域の数を取ることができる可能性が高まる。 By doing so, since the data path of the process is straight, the percentage of actually used hardware pipes included in the dynamic reconfiguration area chain is high, and the utilization efficiency of the FPGA accelerator is improved. When the partial circuit config size difference between commands is small and relatively uniform, the surplus ratio can be kept small when introduced into a partial reconfiguration area in accordance with their maximum size. As a result, there is an increased possibility that a larger number of partial reconfiguration areas can be taken on the same scale FPGA accelerator.

また、請求項５に記載の発明は、前記支援装置が、前記パイプ命令で接続されたコマンドの数が前記ＦＰＧＡ上に準備された前記動的再構成チェインの数より小さい場合、最終段コマンド以降にbypassコマンドを入れてパイプ接続して前記動的再構成チェインの数がパイプ接続されたコマンド数と同じにすることを特徴とする請求項１に記載の演算システム。 Further, in the invention according to claim 5, when the support device has the number of commands connected by the pipe instruction smaller than the number of the dynamic reconfiguration chains prepared on the FPGA, 2. The computing system according to claim 1, wherein a bypass command is inserted in a pipe and the number of the dynamic reconfiguration chains is the same as the number of piped commands.

このようにすることで、bypassコマンドが、使用されていない部分再構成領域に書き込まれることで、データを終端までバイパスしていき、最終的に終端のハードウェアパイプを通ってホストに返す役割がある。パイプ命令で接続されたコマンドの数がＦＰＧＡ上に準備された動的再構成チェインの数より小さい場合であっても、部分再構成領域ごとに入力１本、出力１本ずつのＦＩＦＯ型データ構造（動的再構成領域チェイン）を構成することができる。なお、最終段コマンド以降の処理を行わない（バイパス）態様であれば、例えばショートカットなどどのような方法でもよい。 By doing this, the bypass command is written in the partially reconfigured area that is not being used, so that the data is bypassed to the end and finally returned to the host through the end hardware pipe. is there. FIFO type data structure with one input and one output for each partial reconfiguration area even if the number of commands connected by pipe instructions is smaller than the number of dynamic reconfiguration chains prepared on the FPGA (Dynamic reconfiguration area chain) can be configured. Note that any method such as a shortcut may be used as long as the process after the final command is not performed (bypass).

本発明によれば、アクセラレータ利用効率の向上とプログラミング手法の維持を可能とする演算システム、演算システムの制御方法およびプログラムを提供することができる。 According to the present invention, it is possible to provide an arithmetic system, an arithmetic system control method, and a program that can improve accelerator utilization efficiency and maintain a programming technique.

本発明の実施形態に係る演算システムを示す構成図である。It is a block diagram which shows the arithmetic system which concerns on embodiment of this invention. 本発明の実施形態に係る演算システムの動作を説明する図であり、（ａ）はシェルへの入力を示すコマンドライン、（ｂ）はシェルを提供するソフトウェアである支援装置のブロック、（ｃ）はＦＰＧＡアクセラレータ上の部分再構成領域の接続を示す図である。It is a figure explaining operation | movement of the arithmetic system which concerns on embodiment of this invention, (a) is a command line which shows the input to a shell, (b) is a block of the assistance apparatus which is software which provides a shell, (c) FIG. 4 is a diagram illustrating connection of partial reconfiguration areas on an FPGA accelerator. 本発明の実施形態に係る演算システムの支援装置（シェルソフトウェア）の動作を示すフローチャートであるIt is a flowchart which shows operation | movement of the support apparatus (shell software) of the arithmetic system which concerns on embodiment of this invention. 本発明の実施形態に係る演算システムのコマンド間の部分回路コンフィグの規模が均一であるとき、領域の余剰なくＦＰＧＡアクセラレータを使用できることを説明する図であり、（ａ）は均一な回路規模のコマンドを示す図、コマンドの部分再構成領域への導入を示す図である。It is a figure explaining that an FPGA accelerator can be used without a region surplus when the scale of a partial circuit configuration between commands of an arithmetic system according to an embodiment of the present invention is uniform, and (a) is a command with a uniform circuit scale. FIG. 8 is a diagram showing introduction of a command into a partial reconstruction area. 汎用的なコンピュータに特定の処理に特化した演算装置を追加した演算システムの概略構成図である。It is a schematic block diagram of the arithmetic system which added the arithmetic device specialized to a specific process to the general purpose computer. 図５の演算システムに動作を追記して示す図である。FIG. 6 is a diagram in which an operation is added to the calculation system of FIG. 5. 図６の演算システムに動作を加えて変更した例を示す図である。It is a figure which shows the example which added and changed the operation | movement to the arithmetic system of FIG. 部分回路コンフィグ間で回路規模が大きく異なる場合のＦＰＧＡ利用効率低下を示す図であり、（ａ）は回路規模の大きいコマンドと回路規模の小さいコマンドを示す図、コマンドの部分再構成領域への導入を示す図である。It is a figure which shows the FPGA utilization efficiency fall when a circuit scale differs greatly between partial circuit configurations, (a) is a figure which shows a command with a large circuit scale, and a command with a small circuit scale, The introduction to the partial reconfiguration | reconstruction area | region of a command FIG.

以下、図面を参照して本発明を実施するための形態（以下、「本実施形態」という）における演算システム等について説明する。
（背景説明）
［既存構成１］
汎用コンピュータ上における、シェル（shell）を用いたコマンド群の実行について説明する。
シェルは、ユーザがコマンド行に打ち込んだコマンド等を解釈し、コンピュータに実行させる役割をもつプログラムである。シェルが動作することで、ユーザがコマンド行にコマンドをタイプして実行できる。また、あるコマンドの標準出力をファイルではなく、別のコマンドの標準入力に直接接続することが可能であり、その操作をパイプと呼ぶ。パイプは、コマンド行において、一連のコマンドをパイプ記号”|”で区切って記述する。パイプによって連結された一連のコマンドをパイプライン(pipeline)という。 Hereinafter, an arithmetic system and the like in a mode for carrying out the present invention (hereinafter referred to as “the present embodiment”) will be described with reference to the drawings.
(Background explanation)
[Existing configuration 1]
The execution of a command group using a shell on a general-purpose computer will be described.
The shell is a program having a role of interpreting commands entered by the user on the command line and causing the computer to execute them. With the shell running, users can type and execute commands on the command line. In addition, the standard output of a command can be directly connected to the standard input of another command instead of a file, and this operation is called a pipe. A pipe is described on the command line by separating a series of commands with a pipe symbol “|”. A series of commands connected by a pipe is called a pipeline.

ユーザは、例えば汎用コンピュータ上にあるファイルの一覧を取得するため、キーボード等を介してコマンドラインシェルに対し、
$ ls /
（行頭の$は、プロンプト記号で最初から表示されており、ユーザ入力待ちであることを示す。）
（ls は、Unix系ＯＳシステムにおいて「ファイルやディレクトリの情報を表示する」コマンド）
（ / は、ファイルシステムの中での位置を示す文字列のうち、最上位の階層を示す）
などと入力する。 For example, in order to obtain a list of files on a general-purpose computer, the user can use a keyboard or the like to
$ ls /
(The $ at the beginning of the line is displayed as a prompt symbol from the beginning, indicating that it is waiting for user input.)
(Ls is a command that displays file and directory information on Unix OS systems)
(/ Indicates the highest level of the character string indicating the position in the file system)
And so on.

その結果として
bin etc initrd.img lib64 media proc sbin tmp vmlinuz
boot lib libx32 mnt root srv usr dev home
lib32 lost+found opt run sys var
などのファイルリストの表示を受け取ることができる。ここでの表示はLinux（登録商標）ＯＳにおける一例を示している。 As a result
bin etc initrd.img lib64 media proc sbin tmp vmlinuz
boot lib libx32 mnt root srv usr dev home
lib32 lost + found opt run sys var
You can receive a file list display. The display here shows an example in the Linux (registered trademark) OS.

また、ユーザは、Unix系オペレーティングシステムにおけるパイプ（あるコマンドの標準出力を次のコマンドの標準入力に接続する機能）をシェル上で用いることによって、例えば、
$ jot -r 1000000 0 65535 | paste -d ' ' - - \
| awk '{print $1,"^2+",$2,"^2<65535^2"}' \
| bc | grep 1 | wc -l | awk '{print $1,"/125000"}' | bc ?l In addition, the user can use a pipe (function to connect the standard output of one command to the standard input of the next command) on the shell in a Unix-like operating system, for example,
$ jot -r 1000000 0 65535 | paste -d ''--\
awk '{print $ 1, "^ 2 +", $ 2, "^ 2 <65535 ^ 2"}' \
| bc | grep 1 | wc -l | awk '{print $ 1, "/ 125000"}' | bc? l

（シェル上では、”|”の記号を用いてコマンドを順に接続することでパイプの使用を指示することができ、パイプの前に置かれたコマンドの出力がそのままパイプの後のコマンドの入力となる。）
（行末の \ 記号は、改行を無視し、次の行をこの後に直接続けて１行で扱うことを示す。）
（jot は、ある法則に従った数値群を出力する。この引数の時、0〜65535の間の数値を1000000個、改行区切りで連続して出力する。）
（paste は、１つ以上のファイルを行単位で水平方向に連結する。この場合はスペース区切りの数値列の入力に対して数値を２つずつ、スペース区切りで取り出し、取り出すたびに改行を挿入するように働く。） (On the shell, you can instruct the use of the pipe by connecting the commands in order using the symbol "|". The output of the command placed before the pipe is the same as the input of the command after the pipe. Become.)
(The \ symbol at the end of a line indicates that the newline is ignored and the next line is directly followed by one line.)
(Jot outputs a group of numerical values according to a certain rule. When this argument is specified, 1000000 numbers between 0 and 65535 are output continuously with line breaks.)
(Paste concatenates one or more files horizontally in units of lines. In this case, two numbers are taken out for each input of a space-separated numeric string, separated by a space, and a line break is inserted each time it is fetched. Work like.)

（awk は、命令文に従ってテキスト処理を行うコマンドであり、１つ目のawkは行単位で入力された2つの数値x , y を元に “x^2 + y^2 < 65536^2” の数式を表す文字列を作成する。）
（bc は、入力された文字列を数式として解釈し、その結果を表示する。１つ目の bc は、“x^2 + y^2 < 65536^2” の不等式を解釈し、不等式が成立するとき１、それ以外は０を出力する。）
（grep は、入力された文字列に引数で指定された部分文字列が含まれるか検査し、含まれる場合のみその行全体を出力する。この場合は1が含まれる行だけを表示する。）
（wc は、入力された文字列の行数などをカウントする。この場合は前段からの入力の行数が数値として表示される。）
（２つ目のawkは、入力された行数の値 z を元に “z / 125000” の文字列を作成する。）
（２つ目のbcは、”z / 125000” を数式として演算した結果を表示する。）
（この一連のコマンドはモンテカルロ法による円周率の近似値を導出する意図がある。）
などと入力する。 (Awk is a command that performs text processing according to the command statement. The first awk is “x ^ 2 + y ^ 2 <65536 ^ 2” based on two numerical values x and y entered in units of lines. Create a string that represents the formula.)
(Bc interprets the input string as a mathematical expression and displays the result. The first bc interprets the inequality “x ^ 2 + y ^ 2 <65536 ^ 2” and the inequality is satisfied. 1 when output, 0 otherwise.)
(Grep checks whether the input string contains the substring specified by the argument, and if it does, it prints the entire line. In this case, it displays only the line that contains 1.)
(Wc counts the number of lines in the input string. In this case, the number of lines input from the previous stage is displayed as a numerical value.)
(The second awk creates a string of “z / 125000” based on the input line value z.)
(The second bc displays the result of calculating “z / 125000” as a mathematical expression.)
(This series of commands is intended to derive an approximate value of the circumference by the Monte Carlo method.)
And so on.

その結果として
3.14116000000000000000
などの表示を受け取ることができる。この場合、円周率3.14159... の上位４桁程度の近似値を得ることができた。 As a result
3.14116000000000000000
Etc. can be received. In this case, an approximate value of the upper 4 digits of the circularity ratio 3.14159 ... was obtained.

既存構成１によれば、既存のコマンド1つでは達成できない演算や操作を、コマンドの接続関係の記述と引数の調整によって達成できる。すなわち、ユーザが汎用コンピュータに任意の演算をプログラミングできる構成である。 According to the existing configuration 1, operations and operations that cannot be achieved with one existing command can be achieved by describing the connection relation of the commands and adjusting the arguments. In other words, the user can program any operation on the general-purpose computer.

<既存構成１における性能低下>
上述したような、パイプで接続されたコマンド列をシェル（ここでは親シェルと呼び、ユーザが入力する）が受け取って実行するとき、通常は各単体コマンドのみを実行する子シェルがコマンドの数だけ親シェルから並列にプロセスとして起動される。
次に、それらの入力と出力がパイプ記号”|”によって指示された接続関係に基づいて接続され、最終コマンドの出力結果のみが親シェルの出力として表示される。
現在一般的に用いられるＣＰＵは、複数の実行コアを備えている。各子シェルのプロセスは、ＣＰＵの異なるコアに割り当てられて並列に実行されることがある。実装されているＣＰＵコアの数を超えたコマンド数をパイプで接続した場合、その結果起動された複数の子シェルプロセスの間では、プロセスごとにＣＰＵコアを占有する時間を細かく区切って時分割的に利用することで、仮想的に並列実行されるケースが多い。
ここで、ＣＰＵコアを占有するプロセスの相互切り替え作業は、コンテキストスイッチなどと呼ばれ、この作業の間は切り替え前のプロセス、切り替え後のプロセスとも演算が停止している。このコンテキストスイッチが多数発生する環境、すなわち実装コア数に対して非常に多くのコマンドをパイプ接続したコマンドを発行した場合、実行の速度は非常に遅くなることが予想される。 <Performance degradation in existing configuration 1>
When the shell (here called the parent shell, which is input by the user) receives and executes a command sequence connected by pipes as described above, there are usually as many child shells that execute only a single command as the number of commands. It is started as a process in parallel from the parent shell.
Next, these inputs and outputs are connected based on the connection relationship indicated by the pipe symbol “|”, and only the output result of the final command is displayed as the output of the parent shell.
Currently commonly used CPUs have a plurality of execution cores. Each child shell process may be assigned to a different core of the CPU and executed in parallel. When the number of commands exceeding the number of installed CPU cores is connected with a pipe, the time taken to occupy the CPU core is divided into time divisions among multiple child shell processes started as a result. In many cases, it is virtually executed in parallel.
Here, the mutual switching work of the processes that occupy the CPU core is called a context switch or the like, and during this work, the calculation is stopped for both the process before switching and the process after switching. When an environment where a large number of context switches occur, that is, when a command in which a large number of commands are pipe-connected to the number of mounted cores is issued, the execution speed is expected to be very slow.

また、典型的にはパイプは、ＦＩＦＯ（First In First Out）型のデータ構造などとしてパイプの数だけ複製されてメモリ上に置かれる。各子シェルのプロセスは、入力側に接続されたパイプに相当するメモリ上のＦＩＦＯ構造からデータを読み出し、出力側に接続されたパイプに相当するメモリ上のＦＩＦＯ構造にデータを書き出す。各プロセスの間でＣＰＵとメモリの間を繋ぐバスは共有されており、時間を区切ったアクセス許可（時分割的なアクセス）を各プロセスに割り当てることによって見かけ上並列に動作させているように見せかける。
コマンドをパイプ接続した場合、１接続ごとに前段コマンドのメモリ書き込みと後段コマンドのメモリ読み込み１組の処理が発生し、それらは一つのバスの容量を分割して用いる。
よって、パイプ接続されたコマンド数に対して十分な数のＣＰＵコア数が装備されている時においても、１つ当たりのパイプの転送能力が制限され、実行速度は遅くなることが予想される。 Typically, pipes are duplicated as many as the number of pipes as a first-in-first-out (FIFO) type data structure and placed on the memory. Each child shell process reads data from the FIFO structure on the memory corresponding to the pipe connected to the input side, and writes the data to the FIFO structure on the memory corresponding to the pipe connected to the output side. The bus connecting the CPU and the memory is shared among the processes, and it appears that the processes are apparently operating in parallel by assigning access permission (time-division access) divided in time to each process. .
When the commands are connected by pipe connection, one set of processing for writing the memory of the preceding command and reading the memory of the succeeding command is generated for each connection, and the capacity of one bus is divided and used.
Therefore, even when a sufficient number of CPU cores are provided for the number of piped commands, the transfer capability of one pipe is limited, and the execution speed is expected to be slow.

［既存構成２］
既存構成２は、再構成可能な演算回路を搭載したアクセラレータを追加した構成である。
既存構成２は、上記既存構成１に加えて、下記のアクセラレータと、下記のアクセラレータカードとを備える。
アクセラレータは、ＦＰＧＡ等に代表される、その上に専用の演算回路を複数、仮想的に構成することができ、かつ動作中に部分的に演算回路を再構成できる既知の装置（例えば、ＦＰＧＡ）である。
アクセラレータカードは、汎用コンピュータに装備された汎用バスなどを通してデータのやり取りを行うための周辺回路が共に搭載された既知のアクセラレータカードである。 [Existing configuration 2]
The existing configuration 2 is a configuration in which an accelerator equipped with a reconfigurable arithmetic circuit is added.
The existing configuration 2 includes the following accelerator and the following accelerator card in addition to the existing configuration 1.
The accelerator is a known device (for example, FPGA) that can virtually configure a plurality of dedicated arithmetic circuits on the accelerator, typified by an FPGA or the like, and can partially reconfigure the arithmetic circuit during operation. It is.
The accelerator card is a known accelerator card on which peripheral circuits for exchanging data through a general-purpose bus equipped in a general-purpose computer are mounted.

既存構成２においては、単体のコマンドはＦＰＧＡ上に一時的に構成された演算回路によって実現される。コマンドを定義するファイルには、直接当該コマンドの処理をＣＰＵ上で実行する命令群に代わり、同処理を実現する演算回路の構成情報（回路コンフィグと呼称する）と、演算回路の設置処理、消去処理などが含まれている。 In the existing configuration 2, a single command is realized by an arithmetic circuit temporarily configured on the FPGA. In the file that defines the command, instead of an instruction group that directly executes the processing of the command on the CPU, configuration information (referred to as circuit configuration) of the arithmetic circuit that realizes the processing, installation processing of the arithmetic circuit, and deletion Processing is included.

以上の構成において、コマンドを起動すると回路コンフィグが拡張バスを通してＦＰＧＡの空き領域に書き込まれ、演算回路がＦＰＧＡ上に一時的に設置される。同時にコマンドへの入力は、拡張バスを通して演算回路に入力され、コマンドからの出力は拡張バスを通して取り出されるよう設定される。コマンドが終了したのちは、演算回路の構成情報はＦＰＧＡから消去される。
既存構成２は、パイプをＦＩＦＯ構造としてメモリ上にマッピングする。コマンドとコマンドの間を繋ぐ処理については、前記既存構成１と同様である。 In the above configuration, when the command is activated, the circuit configuration is written into the empty space of the FPGA through the expansion bus, and the arithmetic circuit is temporarily installed on the FPGA. At the same time, an input to the command is input to the arithmetic circuit through the expansion bus, and an output from the command is set to be taken out through the expansion bus. After the command is completed, the configuration information of the arithmetic circuit is deleted from the FPGA.
The existing configuration 2 maps the pipe on the memory as a FIFO structure. The process for connecting the commands is the same as in the existing configuration 1.

<既存構成２における効果>
既存構成２は、専用の演算回路を用いた演算を、データに最適化されたパイプライン処理で行うことができる。このため、演算速度は、同じ電力や同じ回路面積等の適当なコスト基準を固定して比較したとき、ＣＰＵによる処理に比べて大幅に高速化できる場合が多い。
また、既存構成２は、パイプで連結されたコマンド群を実行したとき、ＦＰＧＡ内の空き領域が十分にある限りは、コマンド群は全て並列に実行される。このため、コンテキストスイッチなどによる実行速度低下を抑えることができる。すなわち、既存構成２は、ＣＰＵのみの構成に比べ、単体のコマンドの処理性能が向上でき、かつ並列実行による性能低下を防ぐ効果が得られる。 <Effects of existing configuration 2>
The existing configuration 2 can perform an operation using a dedicated operation circuit by pipeline processing optimized for data. For this reason, when an appropriate cost standard such as the same power and the same circuit area is fixed and compared, the calculation speed can often be significantly increased as compared with the processing by the CPU.
In the existing configuration 2, when command groups connected by pipes are executed, all the command groups are executed in parallel as long as there is sufficient free space in the FPGA. For this reason, a decrease in execution speed due to a context switch or the like can be suppressed. That is, the existing configuration 2 can improve the processing performance of a single command and can prevent the performance from being degraded due to parallel execution, compared to the configuration of only the CPU.

一方、既存構成２は、パイプの実装については既存構成１と同様であるため、コマンドをパイプ連結するたびに大量のメモリアクセスと、拡張バスを通したＣＰＵとＦＰＧＡ間のデータのやり取りが多数発生する。これは、既存構成１と同様に、ＣＰＵとメモリの間のバスの容量に起因した実行速度低下を引き起こす。
さらに、既存構成２は、通常のＣＰＵとメモリの間の転送容量に比べて、ＣＰＵ拡張バスの転送容量が小さいケースが多い。このため、ＣＰＵとＦＰＧＡ間のデータのやり取りがボトルネックとなり、既存構成１に比べてさらに実行速度が低下するおそれがある。
（実施形態） On the other hand, since the existing configuration 2 is similar to the existing configuration 1 in terms of pipe implementation, a large amount of memory access and a large number of data exchanges between the CPU and the FPGA via the expansion bus occur each time a command is piped. To do. This causes a decrease in execution speed due to the capacity of the bus between the CPU and the memory, as in the existing configuration 1.
Furthermore, in the existing configuration 2, the transfer capacity of the CPU expansion bus is often smaller than the transfer capacity between the normal CPU and the memory. For this reason, the exchange of data between the CPU and the FPGA becomes a bottleneck, and the execution speed may be further reduced as compared with the existing configuration 1.
(Embodiment)

図１は、本発明の実施形態に係る演算システムを示す構成図である。
図１に示すように、演算システム１００は、ホストとなるＣＰＵ１１０と、データを記憶するメモリ１２０と、部分回路コンフィグ群を記憶する補助記憶装置１３０（外部記憶装置）と、部分回路コンフィグ群をロードして演算回路を動的に再構成可能なＦＰＧＡアクセラレータ１４０（アクセラレータ装置）と、これらを接続する拡張バス１５０と、入力装置１６０と、支援装置１７０と、を備える。 FIG. 1 is a configuration diagram showing an arithmetic system according to an embodiment of the present invention.
As shown in FIG. 1, the arithmetic system 100 loads a CPU 110 as a host, a memory 120 for storing data, an auxiliary storage device 130 (external storage device) for storing a partial circuit configuration group, and a partial circuit configuration group. Then, an FPGA accelerator 140 (accelerator device) that can dynamically reconfigure the arithmetic circuit, an expansion bus 150 that connects them, an input device 160, and a support device 170 are provided.

<ＣＰＵ１１０>
ＣＰＵ１１０は、プログラムを実行し、プログラム中で部分回路コンフィグが用意されている関数などが呼び出されたとき、支援装置１７０を起動し、補助記憶装置１３０に予め準備しておいた部分回路コンフィグ群１３０Ａを支援装置１７０に送る。また、ＣＰＵ１１０は、支援装置１７０の制御方法を実現するシェルソフトウェアを実行する。 <CPU110>
The CPU 110 executes the program, and when a function or the like for which the partial circuit configuration is prepared is called in the program, the support device 170 is activated, and the partial circuit configuration group 130A prepared in advance in the auxiliary storage device 130. Is sent to the support device 170. In addition, the CPU 110 executes shell software that realizes the control method of the support device 170.

<メモリ１２０>
メモリ１２０は、演算対象のデータ１２０Ａを一時的に記憶する。
メモリ１２０に一時記憶された演算対象のデータ１２０Ａは、拡張バス１５０を通じてＦＰＧＡアクセラレータ１４０に書き込まれ、結果も同様に拡張バス１５０を通じて取り出される。ただし、図１に示すように、演算対象のデータ１２０Ａの入力１本が、拡張バス１５０を通してＦＰＧＡアクセラレータ１４０に送信され（図１の符号ｈ参照）、ＦＰＧＡアクセラレータ１４０上の演算回路にて演算結果の出力１本がＣＰＵ１１０に返される（図１の符号ｊ参照）。 <Memory 120>
The memory 120 temporarily stores calculation target data 120A.
The operation target data 120A temporarily stored in the memory 120 is written to the FPGA accelerator 140 through the expansion bus 150, and the result is similarly retrieved through the expansion bus 150. However, as shown in FIG. 1, one input of the operation target data 120 A is transmitted to the FPGA accelerator 140 through the expansion bus 150 (see the symbol h in FIG. 1), and the operation result on the operation circuit on the FPGA accelerator 140. Is returned to the CPU 110 (see symbol j in FIG. 1).

<補助記憶装置１３０>
補助記憶装置１３０は、ＣＰＵ１１０上のコマンドまたはプログラム群を予め、部分再構成領域の任意の位置に導入できる形式でコンパイルして格納する。図１に示すように、補助記憶装置１３０は、予め作成された部分回路コンフィグ群１３０Ａを記憶する。部分回路コンフィグ群１３０Ａは、ここでは部分回路コンフィグＡ〜Ｄである。 <Auxiliary storage device 130>
The auxiliary storage device 130 compiles and stores in advance a command or program group on the CPU 110 in a format that can be introduced at an arbitrary position in the partial reconstruction area. As shown in FIG. 1, the auxiliary storage device 130 stores a partial circuit configuration group 130A created in advance. Here, the partial circuit configuration group 130A is the partial circuit configurations A to D.

<ＦＰＧＡアクセラレータ１４０>
ＦＰＧＡアクセラレータ１４０は、拡張バス１５０経由でＣＰＵ１１０に接続される。
ＦＰＧＡアクセラレータ１４０は、支援装置１７０が設定した部分回路コンフィグ群１３０Ａを接続したネット１７０Ａをロードして演算回路を動的に再構成可能なＦＰＧＡである。図１の場合、ＦＰＧＡアクセラレータ１４０は、部分再構成領域(1)〜(9)を備える。 <FPGA Accelerator 140>
The FPGA accelerator 140 is connected to the CPU 110 via the expansion bus 150.
The FPGA accelerator 140 is an FPGA that can dynamically reconfigure the arithmetic circuit by loading the net 170A connected to the partial circuit configuration group 130A set by the support apparatus 170. In the case of FIG. 1, the FPGA accelerator 140 includes partial reconfiguration areas (1) to (9).

ＦＰＧＡアクセラレータ１４０は、支援装置１７０が設定した部分再構成領域(1)〜(9)のマッピングと、前後の部分再構成領域を１本ずつのデータパス１４１（図１の符号ｉ）によって１次元に連結したＦＩＦＯ型のデータ構造（以下、適宜ハードウェアパイプという）において、全ての対象コマンドを並列に起動して処理を実行する。
ＦＰＧＡアクセラレータ１４０は、支援装置１７０が、ＦＰＧＡ上に予め構成しておく再構成領域マップの形状を、部分再構成領域ごとに入力１本および出力１本ずつのＦＩＦＯ型のデータ構造とし、出力と入力とが順に１列になるように相互接続された“動的再構成領域チェイン”とする。 The FPGA accelerator 140 performs one-dimensional mapping of the partial reconstruction areas (1) to (9) set by the support apparatus 170 and the front and rear partial reconstruction areas one by one by the data path 141 (symbol i in FIG. 1). In the FIFO type data structure (hereinafter referred to as a hardware pipe as appropriate) connected to, all target commands are activated in parallel to execute processing.
The FPGA accelerator 140 converts the shape of the reconstruction area map that is configured in advance on the FPGA by the support apparatus 170 into a FIFO-type data structure with one input and one output for each partial reconstruction area. Let it be a “dynamic reconfiguration area chain” interconnected so that the inputs are in one line in order.

すなわち、ＦＰＧＡアクセラレータ１４０は、支援装置１７０によって、再構成領域マップの形状をＦＰＧＡ上に予め構成しておく。ＦＰＧＡ上に予め構成しておく再構成領域マップは、部分再構成領域ごとに入力１本、出力１本ずつのＦＩＦＯ型データ構造を持ち、それらが順に１列になるように相互接続された“動的再構成領域チェイン”である。図１の符号ｈに示すように、この動的再構成領域チェインは、拡張バス１５０経由で入力１本が部分再構成領域(1)に入力し、部分再構成領域(1)の出力１本が入力１本の部分再構成領域(2)に入力し（図１の符号ｉ参照）、部分再構成領域(2)の出力１本が入力１本の部分再構成領域(3)に入力し、以下順にＦＩＦＯ型データ構造を繰り返し、部分再構成領域(9)の出力１本が拡張バス１５０経由でＣＰＵ１１０に出力される（図１の符号ｊ参照）。 That is, the FPGA accelerator 140 preconfigures the shape of the reconstruction area map on the FPGA by the support device 170. The reconstruction area map configured in advance on the FPGA has a FIFO type data structure of one input and one output for each partial reconstruction area, and they are interconnected so that they are sequentially arranged in one column. Dynamic reconfiguration area chain ”. As indicated by symbol h in FIG. 1, in this dynamic reconfiguration area chain, one input is input to the partial reconfiguration area (1) via the expansion bus 150, and one output is output from the partial reconfiguration area (1). Is input to one partial reconstruction area (2) (see symbol i in FIG. 1), and one output from the partial reconstruction area (2) is input to one input partial reconstruction area (3). The FIFO type data structure is repeated in the following order, and one output of the partial reconfiguration area (9) is output to the CPU 110 via the expansion bus 150 (see symbol j in FIG. 1).

<拡張バス１５０>
拡張バス１５０は、ＣＰＵ１１０と、メモリ１２０と、補助記憶装置１３０と、ＦＰＧＡアクセラレータ１４０（アクセラレータ装置）とを相互に接続する接続バスである。
ただし、拡張バス１５０で入出力されるデータの本数の点で、図６の演算システム１０とは異なる。すなわち、演算システム１００は、演算対象のデータの入力１本が、拡張バス１５０を通してＦＰＧＡアクセラレータ１４０に送信されたのち（図１の符号ｈ参照）、動的再構成領域チェインで接続されたＦＰＧＡアクセラレータ１４０上の演算回路にて演算を行い、最後に結果の出力１本がＣＰＵ１１０に返される（図１の符号ｊ参照）。 <Extended bus 150>
The expansion bus 150 is a connection bus that interconnects the CPU 110, the memory 120, the auxiliary storage device 130, and the FPGA accelerator 140 (accelerator device).
However, it differs from the arithmetic system 10 of FIG. 6 in the number of data input / output through the expansion bus 150. In other words, the arithmetic system 100 transmits one input of data to be operated to the FPGA accelerator 140 via the expansion bus 150 (see symbol h in FIG. 1), and then the FPGA accelerator connected by the dynamic reconfiguration area chain. The operation is performed by the operation circuit 140, and finally one output of the result is returned to the CPU 110 (see symbol j in FIG. 1).

<入力装置１６０>
入力装置１６０は、表示部にコマンドラインプロンプトを表示して、コマンド行に入力されるコマンドを受付ける。
入力装置１６０は、本来のシェルと同様のインタフェースをユーザに提供する。入力装置１６０は、ユーザからシェルのワンライナー形式でプログラム入力を受け付ける。ワンライナーは、コマンドをパイプ記号”|”で結合して１行に表現した組み合わせコマンドである。
ここで、汎用コンピュータを用いた演算をプログラムできるプログラミング言語・環境は、数多く存在する。本実施形態では、Ｕｎｉｘ系ＯＳにおいてユーザへの対話的なインタフェースを提供するソフトウェアである、シェルに対するワンライナーを用いる。 <Input device 160>
The input device 160 displays a command line prompt on the display unit and accepts a command input on the command line.
The input device 160 provides the user with an interface similar to the original shell. The input device 160 receives a program input from a user in a shell one-liner format. A one-liner is a combination command in which commands are connected by a pipe symbol “|” and expressed in one line.
Here, there are many programming languages and environments that can program operations using a general-purpose computer. In the present embodiment, a one-liner for a shell, which is software that provides an interactive interface to a user in a Unix-based OS, is used.

なお、シェルへのワンライナーに限らず、関数などの出力を１方向的に次の関数に接続する記述法を特徴とする言語を使用してもよい。例えば、プログラミング言語の一つであるElixirを用いてもよい。Elixirは、記述にパイプライン演算子を用いることができ、関数呼び出しの結果を次の関数呼び出しの第１引数として接続することが可能である。Elixirは、シェルへのワンライナーと同様に用いることが可能である。 In addition to the one-liner to the shell, a language characterized by a description method for connecting the output of a function or the like to the next function in one direction may be used. For example, Elixir, which is one of programming languages, may be used. Elixir can use a pipeline operator for description, and can connect the result of a function call as the first argument of the next function call. Elixir can be used like a one-liner to the shell.

シェルへのワンライナーによってプログラミングを行う場合、ＦＰＧＡ用の部分回路コンフィグは、Unix系ＯＳでシェル上などから呼び出して用いられるコマンドの単位で準備する。コマンドは、ＯＳが備える種類のものだけを使用してもよいし、ユーザが自身で設計を行ったコマンドを登録してもよい。その場合の登録方法はシェルに依存する。コマンド登録は、代表的には、シェルの環境変数に対してコマンドの所在位置を追加する操作によって行う。 When programming by a one-liner to the shell, a partial circuit configuration for FPGA is prepared in units of commands that are called and used from the shell or the like by a Unix-based OS. As the command, only a command provided by the OS may be used, or a command designed by the user himself / herself may be registered. The registration method in that case depends on the shell. The command registration is typically performed by an operation of adding a command location to a shell environment variable.

<支援装置１７０>
支援装置１７０は、入力されたコマンドを、ＦＰＧＡアクセラレータ１４０の部分的な再構成可能領域である部分再構成領域にマッピングし、かつ、部分再構成領域間をデータパスでつなぐように制御するシェルソフトウェアである。本実施形態では、このシェルソフトウェアは、ＣＰＵ１１０により実行される。
支援装置１７０は、ワンライナー形式（ワンライナー型）におけるパイプ記号をハードウェアパイプに置き換える。具体的には、支援装置１７０は、ＣＰＵ１１０上では本来のシェルと同様のインタフェースをユーザに提供しながらも、入力装置１６０でワンライナー形式のコマンドが入力されたとき、そのパイプ前後のコマンドがＦＰＧＡ内のハードウェアパイプで結合された２つの領域にマッピングされるように制御する。 <Support device 170>
The support device 170 maps the input command to a partial reconfiguration area that is a partial reconfigurable area of the FPGA accelerator 140 and controls the partial reconfiguration areas to be connected by a data path. It is. In this embodiment, this shell software is executed by the CPU 110.
The support device 170 replaces the pipe symbol in the one-liner type (one-liner type) with a hardware pipe. Specifically, the support device 170 provides the user with an interface similar to the original shell on the CPU 110, but when a one-liner-type command is input from the input device 160, the commands before and after the pipe are changed to the FPGA. It is controlled so that it is mapped to two areas connected by the hardware pipes inside.

支援装置１７０は、コマンドの出力データを次のコマンドの入力データとして接続する命令であるパイプ命令が発行された場合、前のコマンドを前の部分再構成領域にマッピングするとともに、後のコマンドを後の部分再構成領域にマッピングし、かつ、前後の部分再構成領域をデータパスによって分岐することなく一列になるように（例えば１本ずつのデータパスによって１次元の連結となるように）接続し、当該マッピングおよび当該接続を全てのパイプ命令に対して繰り返したのち、先頭となるコマンドの入力をＣＰＵ１１０から導入し、終端となるコマンドの出力をＣＰＵ１１０に戻す設定を行う。 When a pipe instruction that is an instruction for connecting output data of a command as input data of the next command is issued, the support device 170 maps the previous command to the previous partial reconfiguration area and sends the subsequent command to the subsequent command. Are connected in such a way that they are mapped to a partial reconstruction area and the preceding and following partial reconstruction areas are arranged in a line without branching by a data path (for example, one-dimensional connection by one data path). After the mapping and the connection are repeated for all pipe instructions, the input of the head command is introduced from the CPU 110, and the output of the end command is returned to the CPU 110.

支援装置１７０は、入力されたコマンドのコマンド名に対応する外部プログラムを呼び出すことによって、ＦＰＧＡアクセラレータ１４０に演算を指示する。
支援装置１７０は、パイプ命令で接続されたコマンドの数がＦＰＧＡ上に準備された動的再構成チェインの数より小さい場合、最終段コマンド以降にbypassコマンドを入れてパイプ接続して動的再構成チェインの数がパイプ接続されたコマンド数と同じにする。 The support apparatus 170 instructs the FPGA accelerator 140 to perform an operation by calling an external program corresponding to the command name of the input command.
When the number of commands connected by the pipe instruction is smaller than the number of dynamic reconfiguration chains prepared on the FPGA, the support device 170 inserts a bypass command after the final stage command and performs a pipe connection to perform dynamic reconfiguration. The number of chains is the same as the number of piped commands.

以下、上述のように構成された演算システム１００の動作を説明する。
<準備>
Ｕｎｉｘ系ＯＳ上で利用されるコマンドは、上述した既存構成２と同様に、回路コンフィグと、コンフィグの設置処理、消去処理などからなるファイルの形にプレコンパイルされているものを用いる。
ただし、コマンドへの入出力はＣＰＵ１１０とＦＰＧＡアクセラレータ１４０を繋ぐ拡張バス１５０に直結するのではなく、ＦＩＦＯ型のデータ構造（ハードウェアパイプ）１つからの入力と、ハードウェアパイプ１つへの出力を行う形に設定する。
これと平行して、前段の入力を後段の出力へ渡すのみであり、他に何も行わないコマンドを１つ定義する。これはユーザから明示的に呼び出されないため、名称等は不要であるが、例えばbypassコマンドと呼ぶ。 Hereinafter, the operation of the arithmetic system 100 configured as described above will be described.
<Preparation>
The command used on the Unix-based OS uses a command precompiled into a file including a circuit configuration, a configuration setting process, an erasing process, and the like, as in the existing configuration 2 described above.
However, the input / output to the command is not directly connected to the expansion bus 150 that connects the CPU 110 and the FPGA accelerator 140, but is input from one FIFO type data structure (hardware pipe) and output to one hardware pipe. Set to the form to perform.
In parallel with this, one command is defined which only passes the previous input to the subsequent output and does nothing else. Since this is not explicitly called by the user, a name or the like is not necessary, but it is called a bypass command, for example.

図１に示すように、ＦＰＧＡアクセラレータ１４０上の演算回路を導入する領域は、予めそれらを動的再構成可能な部分再構成領域群（部分再構成領域(1)〜(9)）に区切られ、部分再構成領域(1)〜(9)の間を繋ぐハードウェアパイプによって接続された再構成チェイン構造として準備する。
動的再構成領域チェイン構造の中の部分再構成領域(1)〜(9)は、任意の部分再構成領域に対して、前述した任意のプレコンパイルされた回路コンフィグ１つを収容可能なように領域を設定する。
この条件を達成するためには、例えばプレコンパイルされた回路群は、それぞれ固有の回路の形状を持つと考えられるが、その全ての回路群の共通部分を取った形状を用いて部分再構成領域(1)〜(9)を構成するなどすればよい。
これに伴い、指示された任意の部分再構成領域の位置に対して回路コンフィグを導入する形式とする。 As shown in FIG. 1, the area where the arithmetic circuit on the FPGA accelerator 140 is introduced is divided into partial reconfiguration area groups (partial reconfiguration areas (1) to (9)) that can dynamically reconfigure them. Prepared as a reconstruction chain structure connected by hardware pipes connecting the partial reconstruction regions (1) to (9).
The partial reconfiguration areas (1) to (9) in the dynamic reconfiguration area chain structure can accommodate any one of the precompiled circuit configurations described above with respect to an arbitrary partial reconfiguration area. Set the area.
In order to achieve this condition, for example, each precompiled circuit group is considered to have a unique circuit shape, but a partial reconfiguration region using a shape that takes a common part of all the circuit groups. (1) to (9) may be configured.
Along with this, a circuit configuration is introduced to the position of any designated partial reconfiguration area.

動的再構成領域チェイン構造の、その始端の部分再構成領域(1)へ入力を行うハードウェアパイプは、拡張バス１５０を通したＣＰＵ１１０からＦＰＧＡアクセラレータ１４０への入力を接続する（図１の符号ｈ参照）。同様に、終端の部分再構成領域(9)からの出力を行うハードウェアパイプは拡張バスを通したＦＰＧＡアクセラレータ１４０からＣＰＵ１１０への出力として接続する（図１の符号ｊ参照）。
ここで、上記ハードウェアパイプの配置自体もＦＰＧＡアクセラレータ１４０への回路コンフィグの一部として実現され、例えばＯＳの起動時などに当該回路コンフィグを導入するプログラムが自動的に起動するように設定することなどにより、ユーザからの指示を待たずに準備を完了することができる。 The hardware pipe that inputs to the partial reconfiguration area (1) at the beginning of the dynamic reconfiguration area chain structure connects the input from the CPU 110 to the FPGA accelerator 140 through the expansion bus 150 (reference numeral in FIG. 1). h). Similarly, a hardware pipe that performs output from the terminal partial reconfiguration area (9) is connected as an output from the FPGA accelerator 140 through the expansion bus to the CPU 110 (see symbol j in FIG. 1).
Here, the arrangement of the hardware pipe itself is also realized as a part of the circuit configuration to the FPGA accelerator 140, and is set so that, for example, a program for introducing the circuit configuration is automatically started when the OS is started. Thus, preparation can be completed without waiting for an instruction from the user.

<動作説明>
図２は、演算システム１００の動作を説明する図であり、（ａ）はシェルへの入力を示すコマンドライン、（ｂ）はシェルを提供するソフトウェアである支援装置１７０のブロック、（ｃ）はＦＰＧＡアクセラレータ１４０上の部分再構成領域の接続を示す。なお、図１と同一構成部分には、同一符号を付している。
入力装置１６０（図１参照）は、ユーザからシェルのワンライナー形式でプログラム入力を受け付ける。
ユーザは、コマンドプロンプト（Command prompt）「>」に続けて、図２（ａ）の枠囲みに示すコマンド行のシェルへの入力を行い、コマンド入力後に、エンター（Enter）を押してコマンドを実行する。本実施形態では、ユーザからシェルのワンライナー形式でプログラム入力を受け付ける。ユーザは、一連のコマンドをパイプ記号”|”で区切って記述する。 <Description of operation>
2A and 2B are diagrams for explaining the operation of the computing system 100. FIG. 2A is a command line indicating input to the shell, FIG. 2B is a block of the support device 170 that is software that provides the shell, and FIG. The connection of the partial reconfiguration | reconstruction area | region on the FPGA accelerator 140 is shown. In addition, the same code | symbol is attached | subjected to the same component as FIG.
The input device 160 (see FIG. 1) receives a program input from the user in a shell one-liner format.
After the command prompt “>”, the user inputs the command line shown in the frame in FIG. 2A into the shell, and after entering the command, presses enter to execute the command. . In this embodiment, a program input is received from the user in a shell one-liner format. The user describes a series of commands separated by a pipe symbol “|”.

図２の例では、コマンド「$ jot -r 1000000 0 65535」（図２の符号ａ１参照）入力後、パイプ記号”|”で区切ってコマンド「paste -d ' ' - -」（図２の符号ａ２参照）を入力する。以下同様に、”|”で区切ってコマンド「awk '{print $1,"^2+",$2,"^2<65535^2"}'」（図２の符号ａ３参照）を入力し、”|”で区切ってコマンド「bc」（図２の符号ａ４参照）を入力し、”|”で区切ってコマンド「grep 1」（図２の符号ａ５参照）を入力する。さらに、”|”で区切ってコマンド「wc -l」（図２の符号ａ６参照）を入力し、”|”で区切ってコマンド「awk '{print $1,"^2+",$2,"^2<65535^2"}'」（図２の符号ａ７参照）を入力し、最後に”|”で区切ってコマンド「bc ?」（図２の符号ａ８参照）を入力し、エンターを押してコマンドを実行する。図２（ａ）に示すように、８つのコマンド（図２の符号ａ１−ａ８参照）が、”|”の記号を用いて順に接続され、パイプの前に置かれたコマンドの出力がそのままパイプの後のコマンドの入力となる。 In the example of FIG. 2, after inputting the command “$ jot -r 1000000 0 65535” (see symbol a1 in FIG. 2), the command “paste -d ''--” (denoted in FIG. 2) is separated by the pipe symbol “|”. a2). Similarly, enter the command "awk '{print $ 1," ^ 2 + ", $ 2," ^ 2 <65535 ^ 2 "}'" (see symbol a3 in FIG. 2) separated by "|" A command “bc” (see symbol a4 in FIG. 2) is input separated by “|”, and a command “grep 1” (see symbol a5 in FIG. 2) is input separated by “|”. Then, enter the command “wc -l” (see symbol a6 in FIG. 2) separated by “|”, and delimit the command “awk '{print $ 1," ^ 2 + ", $ 2," ^ 2 <65535 ^ 2 "} '" (see symbol a7 in Fig. 2), and finally enter the command "bc?" (See symbol a8 in Fig. 2) separated by "|" and press Enter to enter the command Execute. As shown in FIG. 2A, eight commands (see symbols a1-a8 in FIG. 2) are connected in order using the symbol “|”, and the output of the command placed in front of the pipe is directly used as the pipe. It becomes the input of the command after.

図２（ｂ）の符号ｋに示すように、支援装置１７０は、図２（ａ）に示す入力コマンドのパイプ記号”|”をハードウェアパイプに置き換え、コンフィグ群を接続したネット１７０Ａ（図１参照）としてＦＰＧＡアクセラレータ１４０に書き込む（ロードする）。ここでは、図２（ｂ）の符号ｌに示すように、支援装置１７０は、入力コマンドを、ＦＰＧＡアクセラレータ１４０の部分再構成領域(1)〜(9)にマッピングし、かつ、部分再構成領域間(1)〜(9)をデータパス１４１でつなぐように制御する。すなわち、入力コマンドのパイプ前後のコマンドがＦＰＧＡアクセラレータ１４０内のハードウェアパイプで結合された前後の部分再構成領域にマッピングされるように制御する。具体的には、下記の通りである。 2B, the support apparatus 170 replaces the pipe symbol “|” of the input command shown in FIG. 2A with a hardware pipe, and connects the configuration group to the network 170A (FIG. 1). As a reference), it is written (loaded) into the FPGA accelerator 140. Here, as indicated by reference numeral l in FIG. 2B, the support apparatus 170 maps the input command to the partial reconstruction areas (1) to (9) of the FPGA accelerator 140, and the partial reconstruction area. Control is performed so that the data paths 141 connect the intervals (1) to (9). That is, control is performed so that the commands before and after the pipe of the input command are mapped to the front and rear partial reconfiguration areas connected by the hardware pipe in the FPGA accelerator 140. Specifically, it is as follows.

図２（ａ）に示すコマンド「$ jot -r 1000000 0 65535」（図２の符号ａ１参照）は、図２（ｃ）に示すＦＰＧＡアクセラレータ１４０の部分再構成領域(1)に導入され、図２（ａ）に示すコマンド「paste -d ' ' - -」（図２の符号ａ２参照）は、図２（ｃ）に示すＦＰＧＡアクセラレータ１４０の部分再構成領域(2)に導入される。続いて、図２（ａ）に示すコマンド「awk '{print $1,"^2+",$2,"^2<65535^2"}'」（図２の符号ａ３参照）は、図２（ｃ）に示すＦＰＧＡアクセラレータ１４０の部分再構成領域(3)に導入され、図２（ａ）に示すコマンド「bc」（図２の符号ａ４参照）は、図２（ｃ）に示すＦＰＧＡアクセラレータ１４０の部分再構成領域(4)に導入される。以下同様に、図２（ａ）に示すコマンドのすべてがＦＰＧＡアクセラレータ１４０の部分再構成領域に導入される。図２（ａ）に示すように、コマンド数は、８個であるので、図２（ａ）に示すコマンド「bc ?」（図２の符号ａ８参照）が最終段となり、図２（ｃ）に示すＦＰＧＡアクセラレータ１４０の部分再構成領域(8)に導入される。 The command “$ jot -r 1000000 0 65535” shown in FIG. 2A (see symbol a1 in FIG. 2) is introduced into the partial reconfiguration area (1) of the FPGA accelerator 140 shown in FIG. The command “paste -d '′--” shown in FIG. 2 (a) (see symbol a2 in FIG. 2) is introduced into the partial reconfiguration area (2) of the FPGA accelerator 140 shown in FIG. 2 (c). Next, the command “awk '{print $ 1," ^ 2 + ", $ 2," ^ 2 <65535 ^ 2 "}'" (see reference a3 in FIG. 2) shown in FIG. The command “bc” (see symbol a4 in FIG. 2) shown in FIG. 2A is introduced into the partial reconfiguration area (3) of the FPGA accelerator 140 shown in c), and the FPGA accelerator 140 shown in FIG. It is introduced into the partial reconstruction area (4). Similarly, all of the commands shown in FIG. 2A are introduced into the partial reconfiguration area of the FPGA accelerator 140. As shown in FIG. 2A, since the number of commands is 8, the command “bc?” Shown in FIG. 2A (see symbol a8 in FIG. 2) is the final stage, and FIG. Are introduced into the partial reconfiguration area (8) of the FPGA accelerator 140 shown in FIG.

本実施形態では、ＦＰＧＡアクセラレータ１４０は、９個の部分再構成領域(1)〜(9)を有するので、９個の部分再構成領域(1)〜(9)に対してコマンド数が８個となり、部分再構成領域(9)が１つ余る。しかしながら、パイプ接続は動的再構成領域チェイン構成を取る必要があるので、最終段コマンド以降の部分再構成領域（ここでは、部分再構成領域(9)）にbypassコマンドを導入し、動的再構成領域チェインの数がパイプで接続されたコマンド数と同じになるようにする。 In the present embodiment, the FPGA accelerator 140 has nine partial reconstruction areas (1) to (9), so that the number of commands is eight for the nine partial reconstruction areas (1) to (9). Thus, one partial reconstruction area (9) remains. However, since the pipe connection needs to take the dynamic reconfiguration area chain configuration, the bypass command is introduced into the partial reconfiguration area after the final stage command (here, the partial reconfiguration area (9)), and the dynamic reconfiguration is performed. Make the number of configuration area chains the same as the number of commands connected by pipes.

上記部分再構成領域(1)〜(9)にマッピングに加え、部分再構成領域間(1)〜(9)の前後の部分再構成領域を１本ずつのデータパス１４１によって１次元に連結する（動的再構成領域チェイン）構成を採る。図２（ｃ）の符号ｈに示すように、入力１本がＦＰＧＡアクセラレータ１４０上の部分再構成領域(1)に入力し、部分再構成領域(1)の出力１本が部分再構成領域(2)に入力し（図２の符号ｉ参照）、部分再構成領域(2)の出力１本が部分再構成領域(3)に入力し、以下順にＦＩＦＯ型データ構造を繰り返し、部分再構成領域(9)の出力１本がＦＰＧＡアクセラレータ１４０から出力される（図２の符号ｊ参照）。 In addition to mapping to the partial reconstruction areas (1) to (9), the partial reconstruction areas before and after (1) to (9) between the partial reconstruction areas are connected one-dimensionally by one data path 141. The (dynamic reconfiguration area chain) configuration is adopted. 2C, one input is input to the partial reconstruction area (1) on the FPGA accelerator 140, and one output of the partial reconstruction area (1) is the partial reconstruction area ( 2) (see symbol i in FIG. 2), one output of the partial reconstruction area (2) is input to the partial reconstruction area (3), and the FIFO type data structure is repeated in the following order, and the partial reconstruction area One output of (9) is output from the FPGA accelerator 140 (see symbol j in FIG. 2).

単体のコマンドは、ＦＰＧＡアクセラレータ１４０上に一時的に構成された演算回路によって実現される。コマンドが終了した後は、演算回路の構成情報はＦＰＧＡアクセラレータ１４０から消去される。
演算システム１００は、パイプの実装はＦＰＧＡアクセラレータ１４０上のハードウェアパイプによってなされ、全てのハードウェアパイプは並列に動作する。このようにすることで、拡張バス１５０を入出力データが通る回数は高々一回までに抑えられる。このため、ＣＰＵ１１０とＦＰＧＡアクセラレータ１４０間のデータのやり取りに由来する実行速度低下を抑えることができる。 A single command is realized by an arithmetic circuit temporarily configured on the FPGA accelerator 140. After the command is completed, the configuration information of the arithmetic circuit is deleted from the FPGA accelerator 140.
In the arithmetic system 100, pipes are implemented by hardware pipes on the FPGA accelerator 140, and all the hardware pipes operate in parallel. By doing so, the number of times the input / output data passes through the expansion bus 150 can be suppressed to at most once. For this reason, it is possible to suppress a decrease in execution speed resulting from data exchange between the CPU 110 and the FPGA accelerator 140.

<動作のフローチャート>
図３は、支援装置１７０（シェルソフトウェア）の動作を示すフローチャートである。本シェルソフトウェアは、図１のＣＰＵ１１０で実行される。
コマンド文字列受信により処理を開始する。
ステップＳ１１で、ＣＰＵ１１０は最大コマンド受付数Ｎと現在の受付数ｎを設定する。例えば、最大コマンド受付数Ｎ＝９、現在の受付数ｎ＝０を設定する。
ステップＳ１２でコマンド文字列が終端に達したか否かを判別する。
コマンド文字列が終端に達していない場合（ステップＳ１２：Ｎｏ）、ステップＳ１３で、ＣＰＵ１１０は現在の受付数ｎが最大コマンド受付数Ｎと等しいか（ｎ＝Ｎか）否かを判別する。
現在の受付数ｎが最大コマンド受付数Ｎと等しくない場合（ステップＳ１３：Ｎｏ）、ステップＳ１４で、ＣＰＵ１１０は入力コマンドにパイプ記号”|”が現れるか、または文字列の終端が現れるまでコマンド文字列を読み込む。 <Operation flowchart>
FIG. 3 is a flowchart showing the operation of the support device 170 (shell software). The shell software is executed by the CPU 110 in FIG.
Processing starts when a command character string is received.
In step S11, the CPU 110 sets the maximum command reception number N and the current reception number n. For example, the maximum command reception number N = 9 and the current reception number n = 0 are set.
In step S12, it is determined whether or not the command character string has reached the end.
When the command character string has not reached the end (step S12: No), in step S13, the CPU 110 determines whether or not the current reception number n is equal to the maximum command reception number N (n = N).
If the current received number n is not equal to the maximum command received number N (step S13: No), in step S14, the CPU 110 displays a command character until a pipe symbol “|” appears in the input command or the end of the character string appears. Read a column.

ステップＳ１５では、現在の受付数ｎを１増加（インクリメント）した後、部分構成領域ｎに読み込んだコマンドを導入して上記ステップＳ１２に戻る。
上記ステップＳ１１〜ステップＳ１５の処理を、図２を参照して説明する。
図２（ｃ）に示すように、ＦＰＧＡアクセラレータ１４０上には部分再構成領域(1)−(9)が構成される。このため、最大コマンド受付数Ｎ＝９が設定される。上記ステップＳ１３でｎ＝Ｎでない場合は、下記処理が、上記ステップＳ１２でコマンド文字列が終端に達するまで繰り返される。 In step S15, the current acceptance number n is increased (incremented) by 1, and the command read into the partial configuration area n is introduced, and the process returns to step S12.
The processes in steps S11 to S15 will be described with reference to FIG.
As shown in FIG. 2C, the partial reconstruction areas (1) to (9) are configured on the FPGA accelerator 140. Therefore, the maximum command reception number N = 9 is set. If n = N is not satisfied in step S13, the following processing is repeated until the command character string reaches the end in step S12.

例えば、まず、図２（ａ）に示すコマンド「$ jot -r 1000000 0 65535」（図２の符号ａ１参照）が、図２（ｃ）に示すＦＰＧＡアクセラレータ１４０の部分再構成領域(1)に導入され、図２（ａ）に示すコマンド「paste -d ' ' - -」（図２の符号ａ２参照）が、図２（ｃ）に示すＦＰＧＡアクセラレータ１４０の部分再構成領域(2)に導入される。そして、図２（ｃ）の符号ｈに示すように、入力１本がＦＰＧＡアクセラレータ１４０上の部分再構成領域(1)に入力され、部分再構成領域(1)の出力１本が部分再構成領域(2)に入力される（図２の符号ｉ参照）。以下同様の処理が繰り返され、上記ステップＳ１２でコマンド文字列が終端に達すると、本ループを抜けてステップＳ１７に移行する。図２の例では、コマンド数は、８個であるので、図２（ａ）に示すコマンド「bc ?」（図２の符号ａ８参照）が最終段となり、図２（ｃ）に示すＦＰＧＡアクセラレータ１４０の部分再構成領域(8)に導入され、部分再構成領域(8)の出力１本が部分再構成領域(9)に入力されてステップＳ１７に移行する。 For example, first, the command “$ jot -r 1000000 0 65535” shown in FIG. 2A (see symbol a1 in FIG. 2) is entered in the partial reconfiguration area (1) of the FPGA accelerator 140 shown in FIG. The command "paste -d ''--" (see symbol a2 in FIG. 2) shown in FIG. 2 (a) is introduced into the partial reconfiguration area (2) of the FPGA accelerator 140 shown in FIG. 2 (c). Is done. 2C, one input is input to the partial reconstruction area (1) on the FPGA accelerator 140, and one output from the partial reconstruction area (1) is partially reconstructed. It is input to the area (2) (see symbol i in FIG. 2). Thereafter, the same processing is repeated, and when the command character string reaches the end in step S12, the process exits from this loop and proceeds to step S17. In the example of FIG. 2, since the number of commands is 8, the command “bc?” Shown in FIG. 2A (see symbol a8 in FIG. 2) is the final stage, and the FPGA accelerator shown in FIG. 140 is introduced into the partial reconstruction area (8), and one output of the partial reconstruction area (8) is input to the partial reconstruction area (9), and the process proceeds to step S17.

図３のフローに戻って、上記ステップＳ１３で、ＣＰＵ１１０は現在の受付数ｎが最大コマンド受付数Ｎと等しい場合（ステップＳ１３：Ｙｅｓ）、ステップＳ１６で、ＣＰＵ１１０はコマンドの連結個数が限度を超えていることを表示して処理を終了する（次のコマンドを待つ）。コマンドの連結個数が限度を超えている場合とは、パイプで接続されたコマンドの数がＦＰＧＡ上に準備された動的再構成チェインの数より多いときである。この場合、動作は不可能である旨をエラーメッセージなどでユーザに通知して本フローの処理を終了する。 Returning to the flow of FIG. 3, in step S13, the CPU 110 determines that the current command number n is equal to the maximum command command number N (step S13: Yes). In step S16, the CPU 110 exceeds the limit of command connections. Is displayed and the process ends (waits for the next command). The case where the number of commands connected exceeds the limit is when the number of commands connected by pipes is larger than the number of dynamic reconfiguration chains prepared on the FPGA. In this case, the user is notified that the operation is impossible by an error message or the like, and the processing of this flow is terminated.

一方、上記ステップＳ１２でコマンド文字列が終端に達した場合（ステップＳ１２：Ｙｅｓ）、ステップＳ１７に移行する。ステップＳ１７で、ＣＰＵ１１０は現在の受付数ｎが最大コマンド受付数Ｎと等しいか（ｎ＝Ｎか）否かを判別する。
現在の受付数ｎが最大コマンド受付数Ｎと等しくない場合（ステップＳ１７：Ｎｏ）、ステップＳ１８で、ＣＰＵ１１０は現在の受付数ｎを１増加（インクリメント）した後、部分構成領域ｎにbypassコマンドを導入（挿入）して上記ステップＳ１７に戻る。例えば、図２の場合、ＦＰＧＡアクセラレータ１４０上の部分再構成領域(1)〜(9)に対してコマンド数が８個であるので、最終段コマンド以降の部分再構成領域（部分再構成領域(9)）にbypassコマンドを導入し、動的再構成領域チェインの数がパイプで接続されたコマンド数と同じになるようにする。 On the other hand, when the command character string has reached the end in step S12 (step S12: Yes), the process proceeds to step S17. In step S 17, the CPU 110 determines whether or not the current reception number n is equal to the maximum command reception number N (n = N).
If the current acceptance number n is not equal to the maximum command acceptance number N (step S17: No), the CPU 110 increments the current acceptance number n by 1 (increment) in step S18, and then sends a bypass command to the partial configuration area n. After introducing (inserting), the process returns to step S17. For example, in the case of FIG. 2, since the number of commands is 8 for the partial reconstruction areas (1) to (9) on the FPGA accelerator 140, the partial reconstruction area (partial reconstruction area (partial reconstruction area ( In 9)), the bypass command is introduced so that the number of dynamic reconfiguration area chains is the same as the number of commands connected by pipes.

現在の受付数ｎが最大コマンド受付数Ｎと等しい場合（ステップＳ１７：Ｙｅｓ）、ステップＳ１９に進む。
ステップＳ１９で、ＣＰＵ１１０は全ての部分再構成領域に導入したコマンド群を並列動作させる。
図２の場合、動的再構成領域チェインの中の部分再構成領域(1)〜(9)に導入されたコマンド群を並列動作させ、演算結果である部分再構成領域(9)の出力１本がＦＰＧＡアクセラレータ１４０から出力される（図２の符号ｊ参照）。
ステップＳ２０では、全てのコマンドが動作終了するまで一時停止する。全てのコマンドが動作終了すると、本フローの処理を終了する（次のコマンドを待つ）。 When the current number n of receptions is equal to the maximum number N of command receptions (step S17: Yes), the process proceeds to step S19.
In step S19, the CPU 110 operates in parallel the command group introduced to all the partial reconstruction areas.
In the case of FIG. 2, the command group introduced into the partial reconfiguration areas (1) to (9) in the dynamic reconfiguration area chain is operated in parallel, and the output 1 of the partial reconfiguration area (9) as the operation result The book is output from the FPGA accelerator 140 (see symbol j in FIG. 2).
In step S20, the operation is paused until all the commands are finished. When all the commands have been operated, the process of this flow is terminated (waiting for the next command).

このように、上記フローでは、シェルのソフトウェアが入力コマンドを受け取ったとき、まずパイプで接続されたコマンドの数がＦＰＧＡアクセラレータ１４０上に準備された動的再構成領域チェインの数より大きいか、小さいか、同じかを判定する。動的再構成領域チェインの数よりコマンド数が多い場合、動作は不可能である旨をエラーメッセージなどでユーザに通知して終了する。 Thus, in the above flow, when the shell software receives an input command, first, the number of commands connected by pipes is larger or smaller than the number of dynamic reconfiguration area chains prepared on the FPGA accelerator 140. Or the same. If the number of commands is larger than the number of dynamic reconfiguration area chains, the user is notified by an error message or the like that the operation is impossible, and the process ends.

上記判定結果が、再構成チェインの数よりパイプで接続されたコマンド数が小さい場合、最終段コマンド以降に適当な数のbypassコマンドを導入（挿入）してパイプ接続し、動的再構成チェインの数がパイプで接続されたコマンド数と同じになるようにする。
挿入したbypassコマンドは、入力されたコマンド列のうち最後尾のコマンドが書き込まれた部分再構成領域が動的再構成領域チェインの終端の部分再構成領域とは異なる場合、使用されていない部分再構成領域に書き込まれることで、データを終端までバイパスしていき、最終的に終端のハードウェアパイプを通ってＣＰＵに返す役割がある。 If the number of commands connected by pipes is smaller than the number of reconfiguration chains above, the appropriate number of bypass commands are introduced (inserted) after the last stage command and connected by pipes. Ensure that the number is the same as the number of commands connected by pipes.
The inserted bypass command is used when the partial reconfiguration area where the last command is written in the input command sequence is different from the partial reconfiguration area at the end of the dynamic reconfiguration area chain. By writing to the configuration area, the data is bypassed to the end and finally returned to the CPU through the hardware pipe at the end.

次に、親シェルはパイプで区切られた各単独コマンドのみを実行する子シェルを起動する。ただし、接続されたコマンドの順序に沿って、動的再構成領域チェインの始点から順に回路コンフィグを導入するよう指示する。
最後に、親シェルからの入力を動的再構成領域チェインの始端のハードウェアパイプにつなぎ、終端のハードウェアパイプから出力される結果を標準出力としてユーザに提示する。 Next, the parent shell starts a child shell that executes only each single command separated by pipes. However, it is instructed to introduce circuit configurations in order from the start point of the dynamic reconfiguration area chain in the order of the connected commands.
Finally, the input from the parent shell is connected to the hardware pipe at the beginning of the dynamic reconfiguration area chain, and the result output from the hardware pipe at the end is presented to the user as a standard output.

以上説明したように、本実施形態に係る演算システム１００（図１参照）は、コマンドラインプロンプトを表示して、コマンド行に入力されるコマンドを受付ける入力装置１６０と、入力されたコマンドを、ＦＰＧＡアクセラレータ１４０の部分再構成領域にマッピングし、かつ、部分再構成領域間をデータパスでつなぐように制御する支援装置１７０と、を備える。 As described above, the computing system 100 (see FIG. 1) according to the present embodiment displays the command line prompt and receives the command input on the command line, and the input command is transmitted to the FPGA. A support device 170 that maps to the partial reconstruction area of the accelerator 140 and controls the partial reconstruction areas to be connected by a data path.

支援装置１７０は、コマンドの出力データを次のコマンドの入力データとして接続する命令であるパイプ命令が発行された場合、前のコマンドを前の部分再構成領域にマッピングするとともに、後のコマンドを後の部分再構成領域にマッピングし、かつ、前後の部分再構成領域をデータパスによって分岐することなく一列になるように接続し、当該マッピングおよび当該接続を全てのパイプ命令に対して繰り返したのち、先頭となるコマンドの入力をＣＰＵ１１０から導入し、終端となるコマンドの出力をＣＰＵ１１０に戻す設定を行う。
ＦＰＧＡアクセラレータ１４０は、支援装置１７０が設定した部分再構成領域のマッピングと、前後の部分再構成領域を１本ずつのデータパス１４１（図１参照）によって１次元に連結したＦＩＦＯ型のデータ構造において、全ての対象コマンドを並列に起動して処理を実行する。 When a pipe instruction that is an instruction for connecting output data of a command as input data of the next command is issued, the support device 170 maps the previous command to the previous partial reconfiguration area and sends the subsequent command to the subsequent command. After mapping to the partial reconfiguration area and connecting the front and rear partial reconfiguration areas in a line without branching by the data path, repeating the mapping and the connection for all pipe instructions, The input of the command at the head is introduced from the CPU 110, and the output of the command at the end is returned to the CPU 110.
The FPGA accelerator 140 has a FIFO type data structure in which the partial reconstruction areas set by the support device 170 and the preceding and following partial reconstruction areas are linked one-dimensionally by one data path 141 (see FIG. 1). , Start all target commands in parallel and execute processing.

本実施形態では、パイプの実装はＦＰＧＡ上のハードウェアパイプによってなされ、全てのハードウェアパイプは並列に動作する。本実施形態では、上述した既存構成２と同様に、専用の演算回路を用いた高速演算とコマンド群の完全な並列実行とが得られる。 In this embodiment, pipes are implemented by hardware pipes on the FPGA, and all the hardware pipes operate in parallel. In this embodiment, similar to the above-described existing configuration 2, high-speed computation using a dedicated computation circuit and complete parallel execution of command groups can be obtained.

また、図１の符号ｈ，ｊに示すように、拡張バス１５０を入出力データが通る回数は、高々一回までに抑えられるため、上述した既存構成２の課題であったＣＰＵとＦＰＧＡ間のデータのやり取りに由来する実行速度低下を抑えることができる。
すなわち、本実施形態によれば、上述した既存構成１が抱えていた課題を解決し、ユーザに対してシェルと同じインタフェースを提示しながら、上述した既存構成２と同様に、専用の演算回路を用いた高速演算と、演算群の並列実行と、メモリアクセス回数の低減による実行速度低下抑止の効果を得ることができる。 Further, as indicated by reference characters h and j in FIG. 1, the number of times that the input / output data passes through the expansion bus 150 can be limited to at most one time, and therefore, between the CPU and the FPGA, which was the problem of the existing configuration 2 described above. Execution speed reduction due to data exchange can be suppressed.
That is, according to the present embodiment, the problem that the above-described existing configuration 1 has is solved, and a dedicated arithmetic circuit is provided as in the above-described existing configuration 2 while presenting the same interface as the shell to the user. It is possible to obtain the effect of suppressing the execution speed reduction by reducing the number of memory accesses, and the high-speed arithmetic used, the parallel execution of the arithmetic group.

上記効果に加えて、本実施形態は下記の効果を有する。
<プログラミング手法の維持>
演算システム１００（図１参照）に対して、ユーザはワンライナー型のコマンド群としてプログラムを入力する。パイプを用いたシェルのワンライナーは、元来パイプによる一方向性のデータの流れのみを用いてプログラミングする環境といえるので、その結果を再構成チェインにマッピングすることが容易である。
このとき、ユーザは使用する機材の部分再構成領域群の接続関係の情報を意識することなく、ＣＰＵ１１０（図１参照）上でのワンライナーの実行と同様にプログラミングができる。
なお、予めＦＰＧＡアクセラレータ１４０内に構成された動的再構成領域チェインのサイズを超えたコマンドは接続できないため、支援装置１７０は、機材によって異なるコマンド接続最大数のみに留意しておけばよい。 In addition to the above effects, the present embodiment has the following effects.
<Maintaining programming methods>
A user inputs a program as a one-liner type command group to the arithmetic system 100 (see FIG. 1). The one-liner of a shell using a pipe can be said to be an environment where programming is originally performed using only a unidirectional data flow by a pipe, and it is easy to map the result to a reconstruction chain.
At this time, the user can perform programming in the same manner as the one-liner execution on the CPU 110 (see FIG. 1) without being aware of the connection relationship information of the partial reconfiguration area group of the equipment to be used.
Note that since a command exceeding the size of the dynamic reconfiguration area chain configured in the FPGA accelerator 140 in advance cannot be connected, the support apparatus 170 needs to pay attention only to the maximum number of command connections that differ depending on the equipment.

<ＦＰＧＡ利用効率の向上>
演算システム１００（図１参照）は、図１および図２の符号ｈ，ｉ，ｊに示すように、処理のデータパス１４１（図１および図２参照）が一直線であるため、動的再構成領域チェインに含まれるハードウェアパイプのうち、実際に利用される割合が高く、ＦＰＧＡアクセラレータ１４０の利用効率が向上する。
特に、Ｕｎｉｘ系ＯＳにおいては、多く用いられるコマンドの設計思想として「単機能で高効率に製作すること」が挙げられる。全てのコマンドがこの思想によって作成されているわけではないが、この指針で作成されたコマンド群の内部処理の複雑性は同程度となりやすく、それらを部分回路コンフィグに変換した際も、回路規模がある一定値に近づく傾向を持つことが多いと考えられる。 <Improvement of FPGA usage efficiency>
The arithmetic system 100 (see FIG. 1) has a dynamic reconfiguration because the processing data path 141 (see FIG. 1 and FIG. 2) is a straight line as indicated by the symbols h, i, and j in FIGS. Of the hardware pipes included in the area chain, the actual usage ratio is high, and the utilization efficiency of the FPGA accelerator 140 is improved.
In particular, in a UNIX-based OS, “manufacturing with a single function and high efficiency” can be cited as a design concept of a command that is often used. Not all commands are created based on this idea, but the complexity of the internal processing of commands created with this guideline is likely to be the same, and even when they are converted into partial circuit configurations, the circuit scale is too large. It is likely that they tend to approach a certain value.

コマンド間の部分回路コンフィグの規模差が小さく比較的均一な大きさである場合、それらの最大サイズに合わせた部分再構成領域に導入したとき、余剰となる割合を小さく抑えることができる。その結果、同じ規模のＦＰＧＡアクセラレータ１４０上により多くの部分再構成領域の数を取ることができる可能性が高まる。これは「予めＦＰＧＡ内に構成された動的再構成領域チェインのサイズを超えたコマンドは接続できない」という条件を、動的再構成領域チェインの数が増えることによって緩和できることを意味し、よりＣＰＵ１１０上での実行時に近い使用感をユーザに提供することができる。 When the partial circuit config size difference between commands is small and relatively uniform, the surplus ratio can be kept small when introduced into a partial reconfiguration area in accordance with their maximum size. As a result, there is an increased possibility that a larger number of partial reconfiguration areas can be taken on the FPGA accelerator 140 of the same scale. This means that the condition that “a command exceeding the size of the dynamic reconfiguration area chain previously configured in the FPGA cannot be connected” can be relaxed by increasing the number of dynamic reconfiguration area chains. It is possible to provide the user with a feeling of use close to that at the time of execution.

図４は、コマンド間の部分回路コンフィグの規模が均一であるとき、領域の余剰なくＦＰＧＡアクセラレータ１４０を使用できることを説明する図である。
図４（ａ）に示すように、均一な回路規模のコマンド群４１は、図４（ｂ）に示す部分再構成領域(1)−(6)に導入される。部分再構成領域(1)−(6)は、均一な回路規模のコマンド群４１を収容可能なように設計される。図４（ｂ）に示すように、均一な回路規模のコマンド群４１は、部分再構成領域(1)−(6)に、領域の余剰なく導入される。余剰となる部分は少ないので、ＦＰＧＡアクセラレータ１４０の利用効率を高めることができる。 FIG. 4 is a diagram for explaining that the FPGA accelerator 140 can be used without surplus areas when the partial circuit configuration scale between commands is uniform.
As shown in FIG. 4A, the command group 41 having a uniform circuit scale is introduced into the partial reconfiguration areas (1) to (6) shown in FIG. 4B. The partial reconfiguration areas (1) to (6) are designed so as to accommodate the command group 41 having a uniform circuit scale. As shown in FIG. 4B, the command group 41 having a uniform circuit scale is introduced into the partial reconfiguration areas (1) to (6) without surplus areas. Since there are few surplus parts, the utilization efficiency of the FPGA accelerator 140 can be improved.

なお、上記実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。この他、上述文書中や図面中に示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。 Of the processes described in the above embodiment, all or part of the processes described as being performed automatically can be performed manually, or the processes described as being performed manually can be performed. All or a part can be automatically performed by a known method. In addition, the processing procedures, control procedures, specific names, and information including various data and parameters shown in the above-described document and drawings can be arbitrarily changed unless otherwise specified.
Further, each component of each illustrated apparatus is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured.

また、上記の各構成、機能、処理部、処理手段等は、それらの一部または全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行するためのソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリや、ハードディスク、ＳＳＤ（Solid State Drive）等の記録装置、又は、ＩＣ（Integrated Circuit）カード、ＳＤ（Secure Digital）カード、光ディスク等の記録媒体に保持することができる。 Each of the above-described configurations, functions, processing units, processing means, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit. Further, each of the above-described configurations, functions, and the like may be realized by software for interpreting and executing a program that realizes each function by the processor. Information such as programs, tables, and files for realizing each function is stored in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), an IC (Integrated Circuit) card, an SD (Secure Digital) card, an optical disk, etc. It can be held on a recording medium.

１００演算システム
１１０ＣＰＵ（ホスト）
１２０メモリ
１３０補助記憶装置（外部記憶装置）
１３０Ａ部分回路コンフィグ群
１４０ＦＰＧＡアクセラレータ（アクセラレータ装置）
１４１データパス
１５０拡張バス
１６０入力装置
１７０支援装置 100 computing system 110 CPU (host)
120 memory 130 auxiliary storage device (external storage device)
130A Partial circuit configuration group 140 FPGA accelerator (accelerator device)
141 Data path 150 Expansion bus 160 Input device 170 Support device

Claims

A CPU serving as a host; an external storage device for storing a partial circuit configuration group; an accelerator device capable of dynamically reconfiguring an arithmetic circuit by loading the partial circuit configuration group; and a bus connecting them An arithmetic system,
An input device that displays a command line prompt and accepts commands entered on the command line;
A support device that maps an input command to a partial reconfiguration area that is a partial reconfigurable area of the accelerator device and controls the partial reconfiguration areas to be connected by a data path. ,
The support device includes:
When a pipe instruction, which is an instruction for connecting the output data of the command as input data of the next command, is issued, the previous command is mapped to the previous partial reconfiguration area, and the subsequent command is mapped to the subsequent partial Map to the reconfiguration area, connect the preceding and subsequent partial reconfiguration areas so that they are aligned in a line without branching by the data path, repeat the mapping and the connection for all pipe instructions, and then start The command input to be introduced from the CPU, the output of the terminal command is set back to the CPU,
The accelerator device includes:
In the FIFO type data structure in which the mapping of the partial reconstruction area set by the support apparatus and the preceding and following partial reconstruction areas are connected in a line without branching by a data path, all target commands are processed in parallel. A computing system characterized by starting up and executing processing.

The external storage device is
The computing system according to claim 1, wherein the command or program group on the CPU is compiled and stored in advance in a format that can be introduced at an arbitrary position in the partial reconfiguration area.

The support device includes:
The arithmetic system according to claim 1, wherein the arithmetic operation is instructed to the accelerator device by calling an external program corresponding to the command name of the input command.

The accelerator device includes an FPGA (Field Programmable Gate Array),
The shape of a reconstruction area map that is configured in advance on the FPGA by the support device is a FIFO type data structure of one input and one output for each partial reconstruction area, and the output and the input The arithmetic system according to claim 1, wherein the dynamic reconfiguration area chains are interconnected so that and are sequentially arranged in one line.

The support device includes:
When the number of commands connected by the pipe instruction is smaller than the number of the dynamic reconfiguration chains prepared on the FPGA, a bypass command is inserted after the final stage command to connect the dynamic reconfiguration chain. The arithmetic system according to claim 4, wherein the number of commands is the same as the number of piped commands.

A CPU serving as a host; an external storage device for storing a partial circuit configuration group; an accelerator device capable of dynamically reconfiguring an arithmetic circuit by loading the partial circuit configuration group; and a bus connecting them A method for controlling an arithmetic system,
An input device that displays a command line prompt and accepts commands entered on the command line;
A support device that maps an input command to a partial reconfiguration area that is a partial reconfigurable area of the accelerator device and controls the partial reconfiguration areas to be connected by a data path. And
The support device includes:
When a pipe instruction, which is an instruction for connecting the output data of the command as input data of the next command, is issued, the previous command is mapped to the previous partial reconfiguration area, and the subsequent command is mapped to the subsequent partial Map to the reconfiguration area, connect the preceding and subsequent partial reconfiguration areas so that they are aligned in a line without branching by the data path, repeat the mapping and the connection for all pipe instructions, and then start Execute the step of setting the command input to be from the CPU and setting the output of the command to be terminated to the CPU,
The accelerator device includes:
In the FIFO type data structure in which the mapping of the partial reconstruction area set by the support apparatus and the preceding and following partial reconstruction areas are connected in a line without branching by a data path, all target commands are processed in parallel. A control method for an arithmetic system, characterized in that the process is started and executed.

CPU as host, external storage device for storing partial circuit configuration group, accelerator device capable of dynamically reconfiguring arithmetic circuit by loading partial circuit configuration group, and input command, accelerator device A computer for a computing system comprising: a support device that maps to the partial reconstruction area and controls the partial reconstruction areas to be connected by a data path,
When a pipe instruction, which is an instruction for connecting the output data of the command as input data of the next command, is issued, the previous command is mapped to the previous partial reconfiguration area, and the subsequent command is mapped to the subsequent partial Map to the reconfiguration area, connect the preceding and subsequent partial reconfiguration areas so that they are aligned in a line without branching by the data path, repeat the mapping and the connection for all pipe instructions, and then start A program for functioning as control means for setting a command to be input from the CPU and setting the output of the terminal command to the CPU.