JP3547316B2

JP3547316B2 - Processor

Info

Publication number: JP3547316B2
Application number: JP15888798A
Authority: JP
Inventors: 広明磯野; 淳一木村; 芳典鈴木
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1998-06-08
Filing date: 1998-06-08
Publication date: 2004-07-28
Anticipated expiration: 2018-06-08
Also published as: JPH11353154A

Description

【０００１】
【発明の属する技術分野】
本発明は、演算器が加算を行ったときに生じる桁上げ信号を無視しないで処理できるプロセッサに係り、特に画像処理に好適なプロセッサに関する。
【０００２】
【従来の技術】
デジタル画像処理を高速に行うのに適したプロセッサとして、複数のデータを同一の命令で並列に処理するＳＩＭＤ型プロセッサがある。そのようなＳＩＭＤ型プロセッサの中には、ＳＩＭＤ演算用のレジスタの内部を論理的に区切り、区切られたレジスタ内のデータをそれぞれ独立に扱うことを可能としたものがある。たとえば「ＭＭＸテクノロジ最適化テクニック」（小鷲英一著、アスキー発行）に記載のプロセッサでは、６４ビットの長さのレジスタに、８ビットからなる８個のデータを保持し、それぞれ８ビットのデータを処理できる８個の演算器からなる並列演算器により、同一のレジスタ内の８個のデータに対して同一の演算を並列に実行することができる。区切られた個々のデータはエレメントと呼ばれ、このような複数のエレメントからなるデータは、パックトデータと呼ばれ、このデータを保持するレジスタは、パックトデータレジスタと呼ばれている。
【０００３】
一般に画素データは値０−２５５を有し、８ビットで表されるので、連続した８つの画素データを一つのパックトデータレジスタに格納することができ、そのレジスタ内の８個の画素データそれぞれに対する同じ演算を並列に行うことができる。
【０００４】
上記プロセッサでは、加算の結果、桁上がり（キャリー）が生じた場合あるいは減算の結果、桁下げ（ボロー）が生じることがある。桁上げあるいは桁下げを無視するラップアラウンドモードで演算を行うと、演算結果は正しくなくなる。このため、上記プロセッサでは、飽和演算が使用可能になっている。つまり、演算結果が桁上げあるいは桁下げが生じた場合には、それらが生じない前の最大値あるいは最小値に固定する演算である。たとえば、ある画素データ値２５４にたとえば５を加算する場合、その結果としてデータ値２５５を出力する。このような単純な加算では、飽和演算でも誤差が小さく、無視できる場合がある。しかし、加算によってはこの飽和演算での誤差が無視できないことがある。たとえば、複数（ｎ）の画素データａ，ｂ，ｃ，ｄ．．．の平均を求める演算“ｘ＝（ａ＋ｂ＋ｃ＋ｄ．．．）／ｎ”では、複数の画素データの総和を求めてから、その総和をデータ数ｎで割る処理を行う。この総和を求めるために加算を繰り返し行う。得られた総和をデータ数ｎで割る処理は、このデータ数ｎが２のｍ乗（ｍは正の整数）であるとき、この除算は、総和データをｍビット下位側にシフトすることにより実現される。このように繰り返し加算を実行の間に桁上げが生じた場合、飽和演算により加算結果を最大値に固定すると、総和データの誤差が大きくなり、最終的に得られる平均値の誤差も大きくなる。
【０００５】
上記誤差を防ぐには、次のように、画素データの有効ビット数を増大して演算する方法が採ることができる。各画素のデータを１６ビットして扱い、各エレメントのサイズを１６ビットにし、一つのレジスタには４つのエレメントを保持させ、これらの４つエレメントに対する演算を並列に実行する。最終的な演算の結果は、８ビットに戻してメモリに保存される。このプロセッサは、このように有効ビット幅を広げたデータに対する演算も実行可能になっている。すなわち、各レジスタには、４個の１６ビットのエレメントまたは２個の３２ビットのエレメントを保持させることもできる。このときには、上記８個の８ビットの演算器は、このエレメントのサイズに合わせて、４個の１６ビットの演算器あるいは２個の３２ビットの演算器に再構成される。
【０００６】
【発明が解決しようとする課題】
上記のように一つのエレメントの長さを１６ビットする方法では、演算精度は保証されるが、並列に実行できる演算数、言い換えると並列に演算を施すことができるエレメントの数あるいは並列処理する画素データの数が半減する。そのためにこのプロセッサの処理速度が大幅に低下する。
【０００７】
このような問題を防ぐには、各レジスタのサイズを予め大きくすることが考えられる。たとえば、各レジスタに保持される最小エレメントのサイズを１２ビットにするかあるいは１６ビットにすることができる。この場合、各レジスタには従来と同様に８個のエレメントを保持させるとすると、レジスタのサイズは、９６ビットあるいは１２８ビットになる。さらに、エレメントのサイズをこのように増大するには、各エレメント用の演算器が処理できるビット幅も増大しなければならない。すなわち、各演算器を、１６ビットあるいは１２ビットのデータに対する演算を行い、演算結果データとして、１６ビットあるいは１２ビットのデータを出力するように構成する必要がある。このような演算器は、上のプロセッサでは８個もあるため、これらの演算器のサイズの総量はかなり増大する。
【０００８】
このように、従来の方法では、桁上げ信号の処理を正確にしようとすると、レジスタおよび演算器の回路規模が増大する。しかも、上記のプロセッサのようなＳＩＭＤ型のプロセッサは、各レジスタが複数のエレメントを保持し、これらの複数のエレメント同数の演算器を有するため、エレメントサイズを増大すると、演算器とレジスタの回路規模が大きくなる。
【０００９】
したがって、本発明の目的は、複数の符号なしデータの平均値を求める処理の実行時のように、繰り返し加算が実行されるときに発生する桁上がりを比較的簡単な回路により正しく処理するのに適したプロセッサが得られる。
【００１０】
【課題を解決するための手段】
画像データ処理では、画像データは符号なしデータであり、複数の符号なしデータの平均値処理で必要となる総和データの算出処理では、演算器での加算により桁上げ信号は生じるが、桁下げ信号はでない。したがって、これらのデータの総和データを算出する処理では、複数のデータの加算により生じた複数の桁上げ信号の累積値を算出する必要があるが、その累積値は、総和データを算出処理の中では使用されない。その累積値が必要となるのは、後に総和データをデータ数で割る除算処理においてである。したがって、このようなデータの加算の途中に生じる桁上げの累積値を保存し、後に総和データに対する除算を実行するときにその累積値と総和データとの組データに対して除算を実行できれば、桁上げを正しく処理できることになる。この方法を採ると、加算されるデータのビット幅も演算器のビット数も広げる必要はない。求められた総和データに対する除算は、上記組データをシフターにより下位側にシフトすることにより実行できる。
【００１１】
上記組データのビット数は、総和データのビット数と桁上げ信号の累積値のビット数の合計になる。したがって、上記シフターとしてはこの拡張されたビット数のデータをシフト可能なように構成する必要がある。しかし、このための必要なシフターの回路規模の増大は、このような拡張されたビット数のデータを各レジスタに保持させ、かつその拡張されたビット数のデータを演算器により処理させるときに必要となる回路規模の増大よりも小さくて済むと予想される。したがって、総和演算の間に生じる桁上げ信号を累積し、後に除算をするときに、その累積値を使用する方法では、総和を算出する演算の間に生じた桁上げを正しく処理できるとともに、平均値処理に必要となる回路規模が少なくて済むことになる。
【００１２】
以上のことは、平均値処理に限らず、他の処理にも言えることである。すなわち、一般に、ある符号なしデータの加算により生じた桁上げ信号をその後その加算の結果データが使用されるときまで保存し、その加算結果データが使用されるときに、その加算結果データと一緒に処理されればよいことになる。
【００１３】
本発明は、符号なしデータの処理に関する上述の特徴に注目してなされたもので、本発明によるプロセッサには、演算器が加算を実行したときに出力する桁上げ信号の累積値を生成する回路が設けられ、この累積値に対する演算を実行する他の演算器が設けられる。
【００１４】
より詳細には、上記目的を達成するために、本発明によるプロセッサでは、演算器が処理するデータのビット幅および演算器が出力するデータのビット幅は、桁上げ信号部分を含まないままとする。
【００１５】
上記演算器が複数の加算を実行する間に発生した複数の桁上げ信号の累積値を表す桁上げ信号累積データを生成する桁上げ信号累積回路が設けられる。この桁上げ信号累積データは、複数ビットからなる。
【００１６】
さらに、上記桁上げ信号累積回路により生成された桁上げ信号累積データに対する演算を実行する他の演算器が設けられる。
【００１７】
本発明の望ましい態様では、上記桁上げ信号累積回路はカウンタにより構成される。
【００１８】
本発明の具体的な態様では、上記他の演算器は、上記桁上げ信号累積回路により生成された桁上げ信号累積データとその下位側に付加された、上記複数回の加算の結果得られた加算結果データとの組を下位側にシフトするシフターを含む。当該他の演算器は、加算命令とは異なる他の特定の命令、具体的にはシフト命令に応答して動作する。
【００１９】
本発明のより具体的な態様では、一つまたは複数の桁上げ信号累積回路が、プロセッサ内の複数のレジスタに共通に設けられる。
【００２０】
本発明の望ましい態様では、上記複数の桁上げ信号累積回路の数は、プロセッサ内の複数のレジスタの数より少ない。
【００２１】
本発明の望ましい態様では、各桁上げ信号累積回路に保持される桁上げ信号累積データのビット数は、上記所定のビット数より小さくされる。
【００２２】
本発明のさらに具体的な態様では、少なくとも一つの桁上げ信号累積回路が、ＳＩＭＤ型のプロセッサ内の複数のパックトデータレジスタに共通に一つまたは複数設けられる。
【００２３】
【発明の実施の形態】
以下、本発明に係るプロセッサを図面に示したいくつかの実施の形態を参照してさらに詳細に説明する。なお、以下においては、同じ参照番号は同じものもしくは類似のものを表すものとする。また、第２の実施の形態以降では、第１の実施の形態との相違点を主に説明するに止める。
【００２４】
＜発明の実施の形態１＞
図１は本発明に係るＳＩＭＤ型プロセッサのブロック図である。図１において、パックトデータレジスタ群１２０は、たとえば８つの６４ビットのレジスタからなり、各レジスタはたとえば８個の８ビットのエレメントデータを保持することができる８個のフィールドを含むと仮定する。演算ユニット１００，１００’，，１００”は、同一のレジスタに保持された８個のエレメントを保持する８個のフィールドに対応して設けられ、同一のレジスタに保持された８個のエレメントをそれぞれ処理するための回路である。これらの演算ユニットは、たとえば合計８個用いられるが、図１では、簡略化のため３つのみ図示し、他は省略してある。これらの演算ユニットは同じ構成の回路からなり、演算ユニット１００，１００’，，，１００”は、演算器１３０，１３０’，，または１３０”と、桁上げ信号累積回路１４０，１４０’，，または１４０”とマルチプレクサ１５０，１５０’，，，１５０”とよりなる。
【００２５】
桁上げ信号累積回路１４０は、本実施の形態で新たに設けられたもので、演算器１３０内の算術論理演算器（ＡＬＵ）３２０（図２）が加算を繰り返し実行する間に出力する複数の桁上げ信号を累積する回路である。具体的には、この回路１４０は、カウンタ４１０（図３）からなる。他の桁上げ信号累積回路１４０’，，，１４０”についても同様である。後に説明する特定の命令に応答して、その回路１４０により生成された桁上げ信号の累積値を使用する他の演算器として、演算器１３０内にシフター３３０（図２）が設けられている。他の演算器１３０’，，，１３０”についても同様である。本実施の形態では、これらの信号累積回路１４０およびこのシフター３３０によりＡＬＵ３２０が符号なしデータに対して加算を繰り返し実行するときに発生する桁上げ信号を正しく処理することを可能にする。
【００２６】
命令フェッチ回路１６２は、メモリ１６３から順次命令をフェッチし、命令デコーダ１６１はフェッチされた命令を解読し、制御回路１６０は解読された命令を実行するための制御信号を生成し、それぞれの装置を制御信号１７０によって制御する。命令デコーダ１６１により解読された命令がメモリ１６３からデータをパックトデータレジスタ群１２０内のいずれかにロードする命令であるかあるいはパックトデータレジスタ群１２０内のいずれかのレジスタ内のデータをメモリ１６３にストアする命令であるときには、メモリアクセス回路１６４によりデータのロードあるいはストアが行われる。メモリ１６３からパックトデータレジスタ群１２０へのデータの移動は、データバス１１４，１１０，１１３を介して行われる。このデータは６４ビットを含み、このデータには通常８個の８ビットのエレメントが含まれている。この６４ビットのデータには４個の１６ビットのエレメントが含まれている場合もある。パックトデータレジスタ群１２０からメモリ１６３へのデータの移動は、データバス１１２，１１５を介して行われる。
【００２７】
命令デコーダ１６１により解読された命令がパックトデータレジスタ群１２０を使用する演算命令であるときには、その命令が指定する一対のパックトデータレジスタ内の一方から８個のエレメントが読み出され、８個の演算ユニット１００，１００’，，，１００”にバス、１１２，１０１を介して転送される。同様に、上記一対のパックトデータレジスタ内の他方からも８個のエレメントが読み出され、８個の演算ユニット１００，１００’，，，１００”にバス１１１，１０２を介して転送される。それらの演算器は、それぞれに転送されたエレメントに対して演算を実行し、その結果、８ビットの演算結果データを、バス１１４を介してパックトデータレジスタ群１２０内の命令で指定された一つのレジスタにバス１０９，１０９’，，，１０９”と共通のデータバス１１０，１１３を介して転送する。このように、本実施の形態は、単一命令により複数のエレメントを並列に処理するＳＩＭＤ型のプロセッサである。
【００２８】
データバス１０１，１０２，１１０，１１１，１１２，１１３，１１４はそれぞれ６４ビット幅とする。データバス１０１，１０２を介して供給される６４ビットデータは、それぞれデータバス１０３”〜１０３，１０４”〜１０４によりそれぞれ８ビットずつ、上位ビットから順に演算器１３０”〜１３０に並列に供給され、演算結果はデータバス１０９”〜１０９を介して、データバス１０９”が最上位ビットとなるようにデータバス１１０に整列し、６４ビットデータとしてデータバス１１３を介してパックトデータレジスタ群に格納される。
【００２９】
図４に示すように、パックトデータレジスタ群１２０は、８個の６４ビットのパックトデータレジスタ２００〜２０７と、書き込みレジスタ選択回路２１０と、読み出しレジスタ選択回路２２０とから構成される。以下ではパックトデータレジスタ２００〜２００”は、簡単化のために単にレジスタと呼ぶことがある。また、それぞれのレジスタは、以下に述べる命令ではＲ０〜Ｒ７と表記する。
【００３０】
演算ユニット１００においては、パックトデータレジスタ群１２０内の二つのレジスタから読み出され、データバス１０１，１０２を介して１パックトデータレジスタ群１２０から供給される二つの６４ビットデータのそれぞれの最下位８ビットからなる二つのデータは、８ビットデータバス１０３，１０４を介して演算器１３０に供給される。演算器１３０は、それらのデータに対して演算を行い、８ビットの演算結果データを出力する８ビット演算器により構成される。演算器１３０は、８ビットデータの演算結果を８ビットデータバス１０９に出力する。
【００３１】
演算器１３０において桁上がりが発生した場合、桁上がりビットデータは１ビットバス１０５を介してマルチプレクサ１５０に送られる。このマルチプレクサ１５０は、本実施の形態で新たに設置されたものである。実行中の命令が１６ビット以上のエレメントを処理することを要求する加算命令であるときには、バス１０５上の桁上がりビットデータをデータバス１０６を介して次の上位ビット演算器１３０’に送る。実行中の命令が８ビットのエレメントを桁上げを無視しないで処理することを要求する加算命令であるときには、バス１０５上の桁上がりビットデータをデータバス１０８を介して桁上げ信号累積回路１４０に送る。この命令は本実施の形態により新設された命令であり、その使用方法は後に説明する。マルチプレクサ１５０は、実行中の命令が、上記２種類の加算命令であるとき以外には、バス１０５をデータバス１０６，１０８のどちらにも接続しない。
【００３２】
桁上げ信号累積回路１４０に保持された桁上げ信号の累積データは、特定の命令が実行されたときに利用される。本実施の形態では、後に説明する特定の種類のシフト命令が実行されたときには、そこに蓄えられたビットが演算器１３０へ４ビットデータバス１０７を介して供給される。演算器１３０，桁上げ信号累積回路１４０とマルチプレクサ１５０は、制御回路１６０によって制御される。
【００３３】
図２は演算器１３０の詳細を示し、この演算器は、加減算と論理演算等を行う算術論理演算器（ＡＬＵ）３２０と、８ビット入力８ビット出力の従来と同じくシフト演算を行うシフター３３１の他に１２ビット入力８ビット出力の本実施の形態で新たに設置したシフター３３０とマルチプレクサ３１０と、マルチプレクサ３１１から構成されている。演算器１３０内には、乗算器等の図示しない他の演算器も設けられていて、その乗算器は、桁上がりを無視して乗算を実行できる。しかし、それらの演算器の存在は本発明の特徴に関連がないので、本実施の形態では、このような他の演算器は図示されず、その説明も省略する。マルチプレクサ３１０は、データバス１０４により供給されるデータを、データバス３００を介してＡＬＵ３２０に供給するか、もしくはデータバス３０１を介してシフター３３０に下位８ビットデータとして供給するか、もしくはシフター３３１に供給するかを選択する。
【００３４】
ＡＬＵ３２０は、実行中の命令が加減算命令であるときには、データバス１０３と３００により供給される２つのデータに対して加減算を行い、演算結果をデータバス３１３に出力する。実行中の命令が桁上がりを正しく処理することを要求する加算命令であり、加算の結果桁上がりが発生した場合、ＡＬＵ３２０は、桁上がりビットをバス１０５に出力する。ＡＬＵ３２０は、実行中の命令が論理演算命令であるときには、データバス１０３と３００により供給される２つのデータに対して論理演算を行い、演算結果をデータバス３０３に出力する。なお、実行中の命令が１６ビット以上のエレメントに対する加減算を要求する命令であるときには、最下位の演算器１３０以外の演算器１３０’等には、データバス１０６’を介して下位側の演算器から桁上げビットデータが供給され、ＡＬＵ３２０により加減算に使用される。
【００３５】
シフター３３１は、実行中の命令が、パックトデータレジスタ群１２０内のいずれかのレジスタに保持された８ビットのデータに対するシフトを要求する命令であるときに、マルチプレクサ３１０を介してそのレジスタから供給されるデータをその命令の指定に従って上位側あるいは下位側にその命令が指定するビット数だけシフトし、８ビットのシフト結果データをデータバス３０５に出力する。
【００３６】
このシフターは、シリアルシフターあるいはバレルシフターのいずれでもよいが、速度の点では後者の方が望ましい。シフター３３１は、シフト方向が下位側であるときには、シフトされるデータの元の最上位ビットを新たな最上位ビットとして繰り返し供給する。シフター３３１は、シフト方向が上位側であるときには、シフトされるデータの最下位ビットとして値‘０’を繰り返し供給する。
【００３７】
また演算ユニット１００〜１００”のそれぞれにおけるシフター３３１とそれに対応するシフター（図示せず）は、つながっているものとする。例えば１６ビットのシフト命令では、演算ユニット１００’の内部のシフター（図示せず）の下位ビットと演算ユニット１００の内部のシフター３３１の上位ビットがつながり、シフト方向が下位側であるときは、演算ユニット１００’の内部の上記図示しないシフターにおいては、シフトされるデータの元の最上位ビットを新たな最上位ビットとして繰り返し供給するが、演算ユニット１００の内部の上記シフター３３１における最上位ビットは、演算ユニット１００’の内部の上記図示しないシフターの最下位ビットが繰り返し供給される。逆にシフト方向が上位側であるときには、演算ユニット１００’の内部の上記図示しないシフターの最下位ビットは、演算ユニット１００の内部の上記シフター３３１の最上位ビットが繰り返し供給される。同様にしてさらに上位の演算ユニット内の対応するシフター（図示せず）も２つずつつながる。３２ビットシフト命令においてはシフターが４つずつつながる。
【００３８】
シフター３３０は、桁上げ信号累積回路１４０に保持された累積データを利用する演算器として本実施の形態で新たに設けられたものである。このシフター３３０は、実行中の命令が、パックトデータレジスタ群１２０内のいずれかのレジスタに保持された８ビットのデータと桁上げ信号累積回路１４０により累積されたデータとの組データを下位側にシフトすることを要求する後述する特定のシフト命令であるときには、データバス３０１を介してそのレジスタから供給される８ビットのデータを下位ビットとして、データバス１０７を介して供給される４ビットの桁上がり累積データを上位ビットとする組データに対して、その命令が指定するビット数だけ下位側にシフトし、シフト後のデータの上位８ビットからなるシフト結果データをデータバス３０４に出力する。このシフターも、シリアルシフターあるいはバレルシフターのいずれでもよいが、速度の点では後者の方が望ましい。
【００３９】
また、シフター３３０もシフター３３１と同様に８ビット命令だけではなく、１６ビット命令３２ビット命令にも対応し、例えば後述の、シフター３３０を使用する１６ビットシフト命令では、演算ユニット１００’内の対応するシフター（図示せず）は、演算ユニット１００内のシフター３３１とつながり、シフト方向が下位側であるときは、演算ユニット１００内のシフター３３１における最上位ビットは、演算ユニット１００’内の上記図示しないシフターの最下位ビットが繰り返し供給される。逆にシフト方向が上位側であるときには、演算ユニット１００’内の上記図示しないシフターの最下位ビットは、演算ユニット１００内のシフター３３１の最上位ビットが繰り返し供給される。同様にして上位の演算ユニット内の対応するシフター（図示せず）も２つずつつながる。３２ビットシフト命令においてはシフターが４つずつつながる。
【００４０】
マルチプレクサ３１１は、データバス３０３，３０４，３０５のいずれか１つ上の演算結果データを選択してデータバス１０９へ出力する。
【００４１】
図３は示すように、本実施の形態では、桁上げ信号累積回路１４０はビットカウンタ４１０により構成される。カウンタ４１０は、データバス１０８より桁上げビットデータが供給されると、カウンタ値を１つ上げ、１上げたカウンタ値をデータバス１０７に出力し、線１７０にクリア信号が与えられたときにカウンタ値を０にするクリアする。
【００４２】
演算器１３０と他の演算器１３０’，１３０”，，，は互いに同じ回路構成であり、かつ並列に動作し、これらの演算器に設けられた桁上げ信号累積回路１４０，１４０’，１４０”，，，は全て同じ回路構成であり、かつ並列に動作する。以上のことにより、演算ユニット１００，１００’，１００”，，，は全て同じ構成であり、かつ並列に動作することが分かる。
【００４３】
本実施の形態では、新たに設置した装置を動作させるため、従来の命令に加えて新たに加算命令、シフト命令、カウンタクリア命令を新設する。以下では、命令はオペコードと、オペランドをニモニックで表示する。また以下で用いるニモニックは、説明の便宜上定めたものであり、本実施の形態では従来からある命令に対して使用されているニモニックと異なるニモニックが使用されることがある。
【００４４】
従来の加算命令、例えば“ＡＤＤ８Ｒｘ，Ｒｙ”（ｘ，ｙ＝０〜８）は、パックトデータレジスタＲｘとＲｙの内部を論理的に８ビットに区切り、それらのレジスタ内の対応する一対のエレメントを符号なしデータと見なして他のエレメントと独立に加算し、結果をパックトデータレジスタＲｙに格納するという命令である。この加算命令は桁上げ信号を無視する加算命令とする。この加算命令は飽和演算をする命令であってもよい。
【００４５】
この命令が命令フェッチ回路１６２によりフェッチされて、命令デコーダ１６１において解読され、解読された命令から制御回路１６０は制御信号１７０を生成し、読み出しレジスタ選択回路２２０，書き込みレジスタ選択回路２１０とマルチプレクサ３１２，３１３，１５０をそれぞれ制御信号１７０によって制御する。制御信号１７０によって制御された読み出しレジスタ選択回路２２０は、レジスタＲｘとＲｙのそれぞれから８個のエレメントを並列に読み出し、データバス１１１と１１２に出力する。同様に制御されたマルチプレクサ３１０はそこに供給されたエレメントをデータバス３００を介してＡＬＵ３２０に供給し、同様に制御されたマルチプレクサ３１１は、ＡＬＵ３２０から与えられる加算結果データをバス１０９に出力する。同じく制御信号１７０によって制御されたマルチプレクサ１５０（図１）は、入力１０５をどこにも接続せずにｏｆｆとなる。従って、ＡＬＵ３２０において発生した桁上がりのビットは無視される。
【００４６】
これに対して新規加算命令、例えば“ＡＤＤ８ＣＲｘ，Ｒｙ”が上記従来加算命令と異なる点は、マルチプレクサ３１１がデータバス１０８に接続するように制御されることであり、その他は上記従来加算命令と同様に制御される。従って、新規加算命令を実行した結果、ＡＬＵ３２０において桁上がりが発生した場合、ＡＬＵ３２０は、発生した桁上がりビットを桁上げ信号累積回路１４０内のカウンタ４１０（図３）に供給し、カウンタ４１０のカウンタ値が１つ上がる。
【００４７】
この新規の加算命令は、桁上げ信号を正しく処理することを要求するときに従来の加算命令に代って使用される。たとえば、複数のデータの平均値を求めるときに、それらのデータの総和を求めるために実行する複数の加算にはこの新規加算命令が称される。その場合、それらの複数の加算を実行する間に生じた桁上げ信号の総数がカウンタ４１０に保持されることになる。
【００４８】
従来のシフト命令、例えば“ＳＨ８ＲｎＲｘ”は、パックトデータレジスタＲｘの内部を論理的に８ビットに区切り、それぞれ独立にｎビット右シフトし、シフト後の８ビットデータをパックトデータレジスタＲｘに格納するという命令である。この命令が命令フェッチ回路１６２によりフェッチされて、命令デコーダ１６１によって解読され、解読された命令から制御回路１６０は制御信号１７０を生成し、読み出しレジスタ選択回路２２０，書き込みレジスタ選択回路２１０とマルチプレクサ３１０，３１１，１５０が制御される。制御信号１７０によって制御された読み出しレジスタ選択回路２２０はＲｘをデータバス１１２に出力する。同様に制御されたマルチプレクサ３１０は、そこに供給されたエレメントをデータバス３０２を介してシフター３３１に供給する。同様に制御されたマルチプレクサ３１１は、このシフターの出力をバス３０５を介してバス１０９に購求する。同様に制御されたマルチプレクサ１５０は、それへの入力１０５をどこにも接続されずにｏｆｆとなる。
【００４９】
これに対して新規シフト命令、例えば“ＳＨ８ＲｎＣＲｘ”の上記従来シフト命令と異なる点は、マルチプレクサ３１０がレジスタＲｘよりバス１０４を介して読み出された一つのエレメントをバス３０１を介してシフター３３０の下位側の位置に入力し、マルチプレクサ３１１がこのシフター３３０からバス３０４に出力されるシフト後のデータをバス１０９に転送することであり、その他は上記従来シフト命令と同様に制御される。カウンタ４１０に保持された累積データを構成する４ビットはシフター３３０の上位側に並列に入力されているので、このシフター３３０は、この累積データとレジスタＲｘ内のエレメントデータとの組をｎビット下位側にシフトすることになる。
【００５０】
この新規シフト命令は、桁上げ信号累積回路１４０に保持された桁上げ信号の累積値を利用するときに従来のシフト命令に代って使用される。上述の平均値処理においては、複数のデータの総和を求めて後に、その総和データをデータ数でもって割る除算を実行するときに使用される。上記の総和データがレジスタＲｘに保持されていると仮定すると、上記総和データの上位側にその総和データの算出時に発生した複数の桁上げ信号の累積値が付加されたデータがシフトされる。したがって、このシフト後の結果データは、上記総和データの算出中に発生した桁上げ信号を考慮した正しい結果となる。
【００５１】
新設のカウンタクリア命令、例えば“ＣＬＲＣ”は、カウンタ４１０のカウンタ値を０に設定する。この命令が命令フェッチ回路１６２でフェッチされると、命令デコーダ１６１で解読し、解読された命令から制御回路１６０は制御信号１７０を生成し、制御信号１７０によりカウンタ４１０はクリアされる。
【００５２】
以下に本実施の形態のプロセッサでの平均値算出処理の詳細を説明する。８個のソースデータＡｉ（ｉ＝０〜７）はそれぞれ８ビットのデータであり、図４のパックトデータレジスタ２００内に記載したように、同一のレジスタ２００内の８つのフィールドにロードされるエレメントであるとする。したがって、ｉはエレメント番号と呼ぶことができる。図において、各ソースデータの最上位ビットは、そのデータを保持するフィールドの最左端に位置するとする。他のソースデータＢｉ，Ｃｉ，Ｄｉ（ｉ＝０〜７）も同様に８ビットのデータであり、８個のソースデータＢｉ（ｉ＝０〜７）、Ｃｉ（ｉ＝０〜７）、Ｄｉ（ｉ＝０〜７）はそれぞれレジスタ２０１，２０２，２０３に保持されているとする。これらのデータは全て符号なしデータであると仮定する。以上のデータを用いて、同じエレメント番号ｉを有する４つのデータの平均値Ｘｉ＝（Ａｉ＋Ｂｉ＋Ｃｉ＋Ｄｉ）／４”（ｉ＝０〜７）を求めるとする。
【００５３】
平均値Ｘｉ（ｉ＝０〜７）を求めるための命令列は、本実施の形態では以下の通りとなる。
【００５４】
＃１ＣＲＬＣ
＃２ＬＯＡＤ（ｍａ），Ｒ０
＃３ＬＯＡＤ（ｍｂ），Ｒ１
＃４ＬＯＡＤ（ｍｃ），Ｒ２
＃６ＡＤＤ８ＣＲ１，Ｒ０
＃５ＬＯＡＤ（ｍｄ），Ｒ３
＃７ＡＤＤ８ＣＲ２，Ｒ０
＃８ＡＤＤ８ＣＲ３，Ｒ０
＃９ＳＨ８ＲＣ２Ｒ０
＃１０ＳＴＯＲＥＲ０，（ｍｄ）
まず、最初のクリア命令によりカウンタ４１０がクリアされる。次の４つの命令はロード命令である。すなわち、ＬＯＡＤ（ｍａ），Ｒ０等は、メモリアドレスｍａにある６４ビットデータをレジスタＲ０にロードする命令である。ここでは、メモリアドレスｍａの記憶位置に画像データ群Ａ０〜Ａ７が記憶され、これらのデータが一つのロード命令によりレジスタＲ０にロードされる。同様に、画像データ群Ｂ０〜Ｂ７、Ｃ０〜Ｃ７、Ｄ０〜Ｄ７が第２，第３，第４のロード命令によりメモリ１６３からレジスタＲ１，Ｒ２，Ｒ３にそれぞれロードされる。次の加算命令により、レジスタＲ０内のデータ群はＡ０＋Ｂ０，Ａ１＋Ｂ１，，，Ａ７＋Ｂ７という加算がなされ、これにより得られる８個の総和データ群Ｘ０〜Ｘ７がレジスタＲ０に格納される。さらに第２の加算命令により、レジスタＲ０内の総和データ群Ｘ０〜Ｘ７とレジスタＲ２内のデータＣ０〜Ｃ７とが加算され、その結果、Ａ０＋Ｂ０＋Ｃ０，Ａ１＋Ｂ１＋Ｃ１，，，Ａ７＋Ｂ７＋Ｃ７という総和データ群が得られ、レジスタＲ０に格納される。これらの総和データ群もここではＸ０〜Ｘ７で表す。最後の加算命令により、Ａ０＋Ｂ０＋Ｃ０＋Ｄ０，Ａ１＋Ｂ１＋Ｃ１＋Ｄ１，，，Ａ７＋Ｂ７＋Ｃ７＋Ｄ７という最終的な総和を表すデータ群が得られ、レジスタＲ０に格納される。これらの総和データ群もここではＸ０〜Ｘ７で表す。
【００５５】
これらの４つの加算命令の実行中に桁上げがいずれかの演算ユニット、たとえば１００内の演算器３２０により発生された場合には、その演算ユニット内のカウンタ４１０がカウントアップをする。このことは他の演算ユニット１００’，１００”でも同様である。こうして、各演算ユニット内のカウンタ４１０は、対応する演算器１３０内のＡＬＵ３２０により発生された桁上げビットの総数を保持することになる。上記４つの加算命令に続くシフト命令が、実行されると、その演算ユニット内のシフター３３０は、レジスタＲ０に保持された各総和データＸｉ（ｉ＝０，１，，または７）は、対応する演算ユニット内のカウンタ４１０内の累積データの下位側にその総和データＸｉとを付加して得られる１２ビットのデータを、２ビット下位側にシフトする。この結果、シフター３３０により出力されるデータは、その累積データを正しく反映して算出された、データＡｉ，Ｂｉ，Ｃｉ，Ｄｉの平均値を表す。なお、命令ＳＴＯＲＥＲ０，（ｍｄ）は、レジスタＲ０内の平均値データをメモリアドレスｍｄの位置にストアする命令である。
【００５６】
こうして、本実施の形態では、８つの平均値Ｘｉを並列に求めることができる。以上から分かるように、本実施の形態では従来の演算器に簡単な回路を付加することによって、桁上がりのビットをカウンタ４１０によって保持し、また新規シフター３３０によって参照できるので、“ｘ＝（ａ＋ｂ＋ｃ＋ｄ）／４”等の複数の８ビットソースデータの平均を求める演算で発生する桁上げ信号を無視することなく実行できる。この際、エレメントサイズを拡張する必要はなく、また演算器の扱うビット幅を拡大する必要はない。このため、本実施の形態において新たに追加した回路の規模は少なくて済む。
【００５７】
＜発明の実施の形態１の変形例＞
（１）実施の形態１ではデータバス１０７、シフター３３０の入力をそれぞれ４ビットとしているが、回路規模、性能に応じて任意とする。またカウンタ４１０の最大値もこのビットに合わせて任意とする。上記４つの８ビットデータの平均を求める演算では、カウンタ４１０が採り得る最大値は２ビットであるので、この種の用途のみならば、データバス１０７、シフター３３０ともに２ビットで十分である。この変形は以下に示す他の実施の形態にも適用できる。
【００５８】
（２）実施の形態１ではパックトデータレジスタ群を６４ビットで８つとしたが、回路規模に応じて任意とし、それに応じデータバス１０１，１０２，１１０，１１１，１１２，１１３も任意とする。この変形は以下に示す他の実施の形態にも適用できる。
【００５９】
（３）上記変形例（２）において、１００〜１００”の回路の数は任意とする。例えばパックトデータレジスタ群１２０が１２８ビットの場合、１００〜１００”の数を１６とすることで、１６回の８ビット演算が並列に行われる。この変形は以下に示す他の実施の形態にも適用できる。
【００６０】
（４）実施の形態１では主にエレメントサイズが８ビットでの説明であったが、エレメントサイズ１６ビットまたは３２ビットにおいても適応する。実施の形態１で示した動作とエレメントサイズ１６ビットでの動作の違いは、マルチプレクサ１５０が常にデータバス１０６に接続している点で、その他は実施の形態１と同じ動作である。従って、新規命令“ＡＤＤ１６ＣＲｘ，Ｒｙ”，“ＳＨ１６ＲｎＣＲｘ”を新設することで、エレメントサイズ１６ビットにおいても同様に動作する。これらの命令が命令デコーダ１６１で解読され、解読された命令から制御回路１６０は制御信号１７０を生成する。ここでエレメントサイズが１６の命令では、マルチプレクサ１５０を常にデータバス１０６に接続させ、マルチプレクサ１５０’は任意とする制御信号１７０を生成する。なお省略してあるが、マルチプレクサ１５０〜１５０”の１つおきに上記制御させる。３２ビットにおいても同様であり、こちらは３つおきに上記制御させる。この変形は以下に示す他の実施の形態にも適用できる。
【００６１】
（５）実施の形態１では演算器１３０を２入力としたが、３入力または４入力にも適応するものとし、これに応じて並列に処理を行うため、データバス１０１，１０２，１１１，１１２，１０３，１０４の数も任意とする。このことは、以下に示す他の実施形態にも適用される。
【００６２】
（６）実施の形態１において新設したシフト命令“ＳＨ８ＲｎＣＲｘ”において、シフトすると同時にカウンタ４１０をクリアするようにすると、実施の形態１において新設したクリア命令“ＣＬＲＣ”は省略でき、結果として実行すべき命令数を減らすことができ、処理の高速化に役立つ。
【００６３】
（７）実施の形態１では本発明を適用したＳＩＭＤ型のプロセッサを示したが、本発明はＳＩＭＤ型のプロセッサに限定されるのではなく、演算器が一つしかない、ＳＩＳＤ型のプロセッサにも適用可能であるのは言うまでもない。但し、ＳＩＭＤ型のプロセッサでは演算器の数が多いので、本発明により演算回路の回路規模を増大することなく、桁上げ信号を正しく処理できることの利点は大きい。
【００６４】
＜発明の実施の形態２＞
本実施の形態では、桁上げ信号累積回路１４０が複数個設けられている点で主として実施の形態１と異なる。すなわち、桁上げ信号累積回路１４０内に複数のカウンタを設け、桁上げ信号を累積するカウンタをそれらの中から命令により選択できるようになっている
すなわち、図５に示すように、桁上げ信号累積回路１４０は、桁上げビットデータが供給されるとカウンタ値を１つ上げ、カウンタ値を出力する機能を備えたカウンタ４１０〜４１３と、カウンタ４１０〜４１３の内、桁上げビットデータを供給すべきいずれか１つを選択するマルチプレクサ４２１と、カウンタ４１０〜４１３の内、データバス１０７に出力を供給すべきいずれかのカウンタを選択するマルチプレクサ４２２から成る。カウンタ４１０〜４１３は、実施の形態１と同様にカウンタクリア機能を持つ。
【００６５】
ここで、カウンタ４１０〜４１３に個別にアクセスするために、実施の形態１で新設した命令をさらに拡張する。まず実施の形態１で新設した加算命令“ＡＤＤ８Ｃ“に代えて、桁上がりのビットをどのカウンタ４１０〜４１３に供給するかを選択可能にするために、加算命令“ＡＤＤ８ＣｎＲｘ，Ｒｙ”（ｎ＝０〜３）を新設する。ｎ＝０〜３はそれぞれカウンタ４１０〜４１３に対応している。
【００６６】
命令“ＡＤＤ８Ｃ０Ｒｘ，Ｒｙ”と実施の形態１で新設した“ＡＤＤ８ＣＲｘ，Ｒｙ”との相違点はマルチプレクサ４２１を制御することにより、カウンタ４１０を指定する点であり、この命令が命令デコーダ１６１で解読され、解読した命令から制御回路１６０が制御信号１７０を生成すると、実施の形態１で新設した“ＡＤＤ８ＣＲｘ，Ｒｙ”における制御に加えて、新たにマルチプレクサ４２１を制御する。これにより、マルチプレクサ４２１はこの命令で指定されるカウンタ４１０につながり、データバス１０８上の桁上げビットデータはカウンタ４１０に加えられる。またカウンタの出力には影響がないため、マルチプレクサ４２２は動作させる必要はない。同様に、加算命令ＡＤＤ８Ｃ１，ＡＤＤ８Ｃ２，ＡＤＤ８Ｃ３は、カウンタ４１０〜４１３を選択する。
【００６７】
実施の形態１において新設したシフト命令に代えて、どのカウンタ４１０〜４１３からの出力をデータバス１０７に出力するかを指定可能にするために、シフト命令、“ＳＨ８ＲｍＧｎＲｘ”（ｍ：シフトビット数、ｎ：カウンタ選択値、ｘ：パックトデータ選択値）を新設する。たとえば、命令“ＳＨ８ＲｎＣ０Ｒｘ”と実施の形態１で新設した“ＳＨ８ＲｎＣＲｘ”との相違点はマルチプレクサ４２２において、どのカウンタ４１０〜４１３の出力をデータバス１０７に出力するかを選択する点であり、この命令が命令デコーダ１６１で解読され、解読された命令から、制御回路１６０が制御信号１７０を生成し、実施の形態１で新設した“ＳＨ８ＲｍＣＲｘ”における制御に、新たにマルチプレクサ４２２の制御と、カウンタ４１０を出力する制御が加わる。これによりマルチプレクサ４２２はカウンタ４１０に接続し、カウンタ４１０の出力をデータバス１０７に出力する。そのほかは“ＳＨ８ＲｎＣＲｘ”と同様の動作をする。また、シフト命令においてはデータバス１０８からの入力がないため、マルチプレクサ４２１は動作させる必要がない。同様に、シフト命令ＳＨ８ＲｍＣ１，ＳＨ８ＲｍＣ２，ＳＨ８ＲｍＣ３はカウンタ４１０〜４１３を選択する。
【００６８】
さらに、カウンタ４１０〜４１３を個別に指定してクリア可能とするためにクリア命令“ＣＲＬＣｎ”（ｎ＝０〜３）を新設し、この命令が命令デコーダ１６１で解読され、解読された命令から制御回路１６０が制御信号１７０を生成し、カウンタ４１０〜４１３の一つを個別に指定しクリアする。
【００６９】
このように、桁上げ信号を保持する複数のカウンタが設けると、より多くのデータを処理するときに、桁上げ信号を累積するカウンタを選択でき、処理が高速化できるあるいはプログラムが容易となる。たとえば、本プロセッサが、複数、たとえば二つのスカラー命令を並列に実行するスーパースカラー方式のプロセッサとすることができる。そのようなプロセッサでは、各命令は複数のステージに分けてパイプライン的に実行されるとともに、二つの命令の同じステージが並行して実行される。たとえば、各命令は、フェッチ、デコード、演算という三つのステージでもって実行される。
【００７０】
このようなプロセッサを実現するためには、デコード回路、演算回路を二組設ける必要がある。フェッチ回路もできれば二つ設けることが望ましい。このようなプロセッサでの処理速度を増大するには、並列に実行できる命令の組み合わせが多いことが望ましい。二つの命令が並列に実行するためには二つの命令の間に競合がないことが望ましい。スーパースカラー方式のプロセッサにおいて、本実施の形態のように、複数のカウンタが桁上げ信号累積回路１４０内に設けられると、並列に実行できる二つの命令の組を増大することができ、処理速度を向上できる。たとえば、実施の形態１で示したプログラムを上記スーパースカラー方式で実行させる場合、命令列を以下のように並べることが望ましい。
【００７１】
＃１ＣＲＬＣ
＃２ＬＯＡＤ（ｍａ），Ｒ０
＃３ＬＯＡＤ（ｍｂ），Ｒ１
＃４ＬＯＡＤ（ｍｃ），Ｒ２
＃５ＡＤＤ８ＣＲ１，Ｒ０
＃６ＬＯＡＤ（ｍｄ），Ｒ３
＃７ＡＤＤ８ＣＲ２，Ｒ０
＃８ＡＤＤ８ＣＲ３，Ｒ０
＃９ＳＨ８ＲＣ２Ｒ０
＃１０ＳＴＯＲＥＲ０，（ｍｄ）
この場合、命令＃４と＃５は並列に実行でき、命令＃６と＃７は並列に実行できる。なお、命令＃２と＃３が並列に実行できるか否かは、フェッチ回路が二つあるか否かにより変わる。
【００７２】
本実施の形態において８個のソースデータを二組に分け、各組の４つのソースデータの平均値を求める二つの処理を並列に実行させるプログラムの例は以下の通りである。このプログラムは、二つのカウンタ４１０，４１１を使用する。第１の平均値はレジスタＲ０〜Ｒ３を使用し、第２の平均値はＲ４〜Ｒ７を使用する。なお、ｍａからｍｊはメモリアドレスである。
【００７３】
＃１ＣＲＬＣ０
＃２ＣＲＬＣ１
＃３ＬＯＡＤ（ｍａ），Ｒ０
＃４ＬＯＡＤ（ｍｂ），Ｒ１
＃５ＬＯＡＤ（ｍｅ），Ｒ４
＃６ＡＤＤ８Ｃ０Ｒ１，Ｒ０
＃７ＬＯＡＤ（ｍｆ），Ｒ５
＃８ＬＯＡＤ（ｍｃ），Ｒ２
＃９ＡＤＤ８Ｃ１Ｒ５，Ｒ４
＃１０ＬＯＡＤ（ｍｇ），Ｒ６
＃１１ＡＤＤ８Ｃ０Ｒ２，Ｒ０
＃１２ＬＯＡＤ（ｍｄ），Ｒ３
＃１３ＡＤＤ８Ｃ１Ｒ６，Ｒ４
＃１４ＬＯＡＤ（ｍｈ），Ｒ７
＃１５ＡＤＤ８Ｃ０Ｒ３，Ｒ０
＃１６ＡＤＤ８Ｃ１Ｒ７，Ｒ４
＃１７ＳＨ８Ｒ２Ｃ０Ｒ０
＃１８ＳＨ８Ｒ２Ｃ０Ｒ４
＃１９ＳＴＯＲＥＲ０，（ｍｈ）
＃２０ＳＴＯＲＥＲ１，（ｍｉ）
このプログラムでは、並列に実行できる命令の組は次の通りである。命令＃５と＃６，＃８と＃９、＃１０＃１１，＃１２と＃１３，＃１４＃１５，＃１６と＃１７、＃１８と＃１９。よってカウンタが一つの場合よりも並列に実行できる命令が増大する。
【００７４】
＜発明の実施の形態２の変形例＞
（１）実施の形態２において、カウンタ４１０〜４１３の数は任意とし、それに伴い実施の形態２で新設した命令のカウンタ選択値ｎも任意とする。
【００７５】
（２）実施の形態２において、カウンタ４１０〜４１３を個別に出力するように制御することにより、マルチプレクサ４２２は省略できる。
【００７６】
（３）実施の形態２の変形例（２）において、逆にカウンタ４１０〜４１３を全て出力させ、マルチプレクサ４２２で出力値を選択することにより、カウンタを指定する制御信号は省略できる。
【００７７】
＜発明の実施の形態３＞
本実施の形態では、実施の形態２で使用した複数のカウンタを有する桁上げ信号累積回路１４０に代えて複数のレジスタと演算器を有する回路を使用する。
【００７８】
図６において、桁上げ信号累積回路１４０には、実施の形態２におけるカウンタ４１０〜４１３の代わりにレジスタ４３０〜４３３が使用される。ここではレジスタ４３０〜４３３がそれぞれ４ビットと仮定し、レジスタ４３０から順に０〜３と番号をつける。演算器４４０は、データバス１０８から供給される桁上がりビットとデータバス４０３から供給されるデータを演算し、演算結果をデータバス４０１に出力する。この演算器は、少なくとも加算を実行できる。もちろん他の演算を実行できるようにしてもよい。書き込みレジスタ選択回路４２３は、データバス４０１からの入力をどのレジスタに格納するか選択する。読み出しレジスタ選択回路４２４は、どのレジスタ４３０〜４３３からデータをデータバス４０２に読み出すか選択する。マルチプレクサ４２５は、読み出されたデータをデータバス１０７を介してＡＬＵ３２０に送るか、データバス４０３を通じて演算器４４０に送るかを選択する。
【００７９】
演算器４４０を単体の加算器とした場合について説明する。ここで実施の形態２と同様に、レジスタ４３０〜４３３の個々について参照できるように命令を新設する。実施の形態２と同様の書式で、新規加算命令“ＡＤＤ８ＧｎＲｘ，Ｒｙ”（ｎ＝０〜３）を新設し、ｎはレジスタ４３０〜４３３の番号に対応する。ここでまず“ＡＤＤ８Ｇ０Ｒｘ，Ｒｙ”をとりあげる。“ＡＤＤ８Ｇ０Ｒｘ，Ｒｙ”は桁上げ信号累積回路１４０以外では、実施の形態２で新設した加算命令と同じ動作をするものとし、桁上げ信号累積回路１４０内の動作の説明にとどめる。この命令が命令デコーダ１６１で解読されると、解読された命令から制御回路１６０は制御信号１７０を生成し、読み出しレジスタ選択回路４２４と書き込みレジスタ選択回路４２３とマルチプレクサ４２５を制御する。制御された書き込みレジスタ選択回路４２３と読み出しレジスタ選択回路４２４はそれぞれレジスタ４３０を選択し、マルチプレクサ４２５はデータバス４０３と接続することで、レジスタ４３０から参照されたデータは演算器４４０に供給され、データバス１０８から供給されるデータと演算を行い、演算結果がレジスタ４３０に格納される。以下同様にｎ＝０〜３まで新設する。
【００８０】
次に実施の形態２で新設したシフト命令“ＳＨ８ＲｍＧｎＲｘ”を本実施の形態でも新設する。この命令は上記新規加算命令と同様に、桁上げ信号累積回路１４０以外では、実施の形態２で新設したシフト命令と同じ動作をする。以下の説明は桁上げ信号累積回路１４０内の動作の説明にとどめる。ここでまず“ＳＨ８ＲｍＧ０Ｒｘ”とりあげる。この命令が命令デコーダ１６１で解読されると、解読した命令から制御回路１６０は制御信号１７０を生成し、読み出しレジスタ選択回路４２４とマルチプレクサ４２５を制御する。制御された読み出しレジスタ選択回路４２４はレジスタ４３０を選択し、制御されたマルチプレクサ４２５はデータバス１０７と接続することにより、レジスタ４３０内のデータはデータバス１０７を介して演算器４４０に供給される。以下同様にｎ＝０〜３まで新設する。上記のように演算器４４０が加算器の場合、実施の形態２とほぼ同じ動作をする。
【００８１】
もし、本実施の形態に依らないで、加算用のＡＬＵ３２０が桁上げを処理可能なようにするには、パックトデータレジスタ群１２０内の各レジスタの一つのエレメントを保持するフィールドをたとえば８ビットから１２ビットあるいは１６ビットに変更し、ＡＬＵ３２０の内、二つのデータを加算する回路部分を、二つの１２ビットのデータの加算を行うように変更することが考えられる。
【００８２】
本実施の形態では、演算器４４０を設けるために、実施の形態２よりは回路規模が増大する。しかし、本実施の形態が必要とする回路の規模は、上記のように変更した場合よりも小さくて済む。すなわち、演算器４４０の加算の対象は、レジスタ４３０〜４３３内の４ビットのデータと線１０８から与えられる１ビットの桁上げビットである。したがって、この演算器は４ビットの二つのデータを加算する加算器より簡単な構成でよい。したがって、本実施の形態での演算器４４０とＡＬＵ３２０の内の加算を実行する部分の回路規模の合計は、そのように変更したときにＡＬＵ３２０内の加算器部分が必要とする回路規模よりは小さくできる。さらに、本実施の形態で使用するレジスタ４３０〜４３３の数は、パックトデータレジスタ群１２０内のレジスタの数より少なくてよい。したがって、本実施の形態では、パックトデータレジスタ群１２０とレジスタ４３０〜４３３の回路規模の合計は、パックトデータレジスタ群１２０の全レジスタのビット幅を上記のように変更した場合より少なくて済む。
【００８３】
なお、レジスタ４３０〜４３３の数を、全パックトデータレジスタの数と等しくした場合にも、前述のように、本実施の形態では、演算器４４０の回路規模は、通常の４ビット加算器より簡単であるので、依然として本実施の形態によるプロセッサの回路規模は、上記のように本実施の形態に依らないでプロセッサを変更した場合より小さくできる。しかし、回路規模の縮小という観点では、レジスタ４３０〜４３３の数を、全パックトデータレジスタの数より少ない方が望ましい。実施の形態２で使用したカウンタが複数ある場合と同じ理由により、スーパスカラー方式のプロセッサにおいては、レジスタ４３０〜４３３の数が複数あることが望ましい。その数は、全パックトデータレジスタの数にも依存するが、通常はその数の半分以下、１／４以上であることが望ましい。
【００８４】
また本実施の形態により、桁上げ信号累積回路内での演算を独立に実行できる。例えばレジスタ４３０内のデータと、レジスタ４３１内のデータを加算してレジスタ４３１に再び格納する新規命令を設定する。これによりパックトデータレジスタ１２０内の２つのデータを加算する際、両方に桁上がりデータがある場合も正しく演算される。例えば平均値演算“ｙ＝（（ａ＋ｂ）＋（ｃ＋ｄ））／４”を行う際、ａ＋ｂ、ｃ＋ｄの両方に桁上がりビットが発生しても、その両方の桁上がりビットを加算しておくことで平均値ｙは正しく求めることができる。
【００８５】
＜発明の実施の形態３の変形例＞
（１）実施の形態１におけるカウンタが一つであるように、実施の形態４におけるレジスタ４３０〜４３３の数を一つとすることもできる。
【００８６】
（２）演算器４４０は、基本的には、レジスタ４３０〜４３３のいずれかの内容を桁上げ信号により１だけ増大するインクリメンタとして使用される。したがって、そのようなインクリメンタを、加算器でない構造を有する回路により実現できるときには、そのようなインクリメンタは、演算器４４０の代わりに使用できる。本明細書ではそのようなインクリメンタも加算のための演算器と見なす。
【００８７】
（３）実施の形態３において、レジスタ４３０〜４３３は４ビットと仮定したが、レジスタの大きさは任意とする。またレジスタ４３０〜４３３の数も任意とする。従ってレジスタの大きさにより変化する、データバス４０２，４０３，４０１，また１０７の大きさも任意とする。
【００８８】
（４）実施の形態３において１ビットデータバスとした１０５，１０８は１〜８ビットまで任意の値を持つことができる。例えばＡＬＵ３２０を３入力１出力等の加算を行う演算器に変更すると、複数例えば２つの桁上がりビットが発生しうる。この場合には、データバス１０５と１０８を２ビットとし、データバス１０５，１０８を介して桁上げ信号累積回路１４０に２ビットの桁上げデータを並列に供給できる。実施の形態１と２では桁上げ信号累積回路内にカウンタを用いていたが、実施の形態３では演算器とレジスタという構成であるので、本変更により複数の桁上がりビットに対応することが可能となる。なお、このような変形例においても、レジスタ４３０〜４３３の総数が全パックトデータレジスタの数より少ないときには、本変形例の回路規模は依然として小さいという利点がある。
【００８９】
（５）実施の形態３の上記変形例３におけるデータバス４０１〜４０３と、レジスタ４３０〜４３３と、実施の形態３の上記変形例４におけるデータバス１０５と１０８と、実施の形態１の変形例１におけるデータバス１０７とシフター３３０の入力部の全てを８ビットとすることで、ＡＬＵ３２０における積においても桁上げ信号累積回路１４０を使用可能とする。そこで、新たに積算命令を新設する。動作は実施の形態３で新設した加算命令と、ＡＬＵ３２０以外の動作は同じ為省略する。
【００９０】
（６）実施の形態３において、演算器４４０は加算器以外に、減算器、論理演算器、シフター等を追加することができる。
【００９１】
（７）この変形例６の場合、レジスタ４３０〜４３３内の累積データに対して演算を実行する命令を新設することが有益である。このような命令を使用すれば、レジスタ４３０〜４３３内の累積データだけに対する演算を、パックトデータレジスタ内のデータとは独立に実行するようにできる。
【００９２】
＜発明の実施の形態４＞
本実施の形態では、実施の形態１で使用した二つのシフター３３０，３３１の動作を一つのシフターにて実現する。それにより、プロセッサの回路を実施の形態１よりも簡単にする。なお、本実施の形態の技術は、実施の形態２と３にも適用できる。
【００９３】
図７は本実施の形態における演算器１３０の構成を示し、マルチプレクサ３１２は、データバス１０４からのデータを、データバス３０６を介してＡＬＵ３２０に供給するかあるいはデータバス３０７を介してシフター３３２に供給するかを選択する。マルチプレクサ３１４は、データバス１０７上の４ビットの桁上げ信号の累積データかもしくは４ビットの固定データ‘０’を選択する。シフター３３２は、データバス３０７を介してマルチプレクサ３１２から供給される８ビットデータを下位ビットとして、またデータバス５００を介してマルチプレクサ３１４から供給される４ビットデータを上位ビットとする組み合わせデータに対してしてシフトを行い、シフト結果の下位８ビットをデータバス３０９に出力する。マルチプレクサ３１３データバス３０８、３０９どちらかに選択する
実施の形態１〜３で新設した命令は、本実施においても同様に扱うことができる。マルチプレクサ３１４は、実施の形態１〜３において新設した、シフト命令の実行においてデータバス１０７を選択し、そのほかの命令では固定データ‘０’を選択する。従ってシフター３３２の上位４ビットの入力は、新設のシフト命令以外のシフト命令では０であり、新設シフト命令が実行されたときのみバス１０７上の桁上げ信号の累積データが入力される。以上から、本実施の形態のプロセッサは実施の形態１に比べて回路が簡単であることが分かる。
【００９４】
＜発明の実施の形態４の変形例＞
（１）本実施の形態と実施の形態２あるいはその変形例との組み合わせ、また本実施の形態と実施の形態ｘ４あるいはその変形例との組み合わせも可能とする。
【００９５】
（２）実施の形態４において、シフターの入力部を４ビットとしているが任意とする。
【００９６】
（３）実施の形態４においてマルチプレクサ３１４は、桁上げ信号累積回路１４０においてデータバス１０７への入力が制御されている場合は省略できる。
【００９７】
なお、本発明は以上の実施の形態あるいはその変形例に限定されるのではない。以上の実施の形態あるいはその変形例の組み合わせによっても実現できる。また、他の実施の形態よっても実現できることは言うまでもない。
【００９８】
【発明の効果】
以上説明したことから明らかなよう、本発明によれば、複数の符号なしデータの平均値を求める処理の実行時のように、繰り返し加算が実行されるときに発生する桁上がりを比較的簡単な回路により正しく処理するのに適したプロセッサが得られる。
【図面の簡単な説明】
【図１】本発明に係るプロセッサの概略ブロック図。
【図２】図１の装置に使用される演算器の概略ブロック図。
【図３】図１の装置に使用される桁上げ信号累積回路の概略ブロック図。
【図４】図１の装置に使用されるパックトデータレジスタ群の概略ブロック図。
【図５】本発明に係る他のプロセッサで使用される桁上げ信号累積回路の概略ブロック図。
【図６】本発明に係るさらに他のプロセッサで使用され演算器の概略ブロック図。
【図７】本発明に係るさらに他のプロセッサで使用される桁上げ信号累積回路の概略ブロック図。
【符号の説明】
１００，１００’，１００”・・・演算ユニット
２１０・・・書き込みレジスタ選択回路
２２０・・・読み出しレジスタ選択回路
３１０〜３１４・・・マルチプレクサ
４２３・・・書き込みレジスタ選択回路
４２４・・・読み出しレジスタ選択回路[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a processor capable of processing a carry signal generated when an arithmetic unit performs addition without ignoring the signal, and particularly to a processor suitable for image processing.
[0002]
[Prior art]
As a processor suitable for performing digital image processing at high speed, there is an SIMD processor that processes a plurality of data in parallel with the same instruction. Some of such SIMD processors logically partition the inside of a register for SIMD operation, and can independently handle data in the partitioned registers. For example, in a processor described in "MMX Technology Optimization Technique" (published by Eiichi Kogashi, issued by ASCII), a register having a length of 64 bits holds eight data consisting of eight bits, and each of the data has eight bits. , The same operation can be performed in parallel on eight data in the same register. The delimited individual data is called an element, and data composed of such a plurality of elements is called packed data, and a register holding this data is called a packed data register.
[0003]
In general, pixel data has a value of 0 to 255 and is represented by 8 bits, so that eight consecutive pixel data can be stored in one packed data register, and each of the eight pixel data in the register is stored. Can be performed in parallel.
[0004]
In the above processor, a carry may occur as a result of addition, or a borrow may occur as a result of subtraction. If the calculation is performed in the wrap-around mode in which the carry or the carry is ignored, the calculation result becomes incorrect. For this reason, in the processor, the saturation operation can be used. In other words, when the carry result or carry down occurs in the calculation result, the calculation is fixed to the maximum value or the minimum value before such occurrence. For example, when, for example, 5 is added to a certain pixel data value 254, a data value 255 is output as a result. In such a simple addition, there are cases where the error is small even in the saturation operation and can be ignored. However, an error in the saturation operation cannot be ignored depending on the addition. For example, a plurality (n) of pixel data a, b, c, d. . . In the calculation “x = (a + b + c + d...) / N” for calculating the average of the pixel data, a process of calculating the sum of a plurality of pixel data and dividing the sum by the number n of data is performed. The addition is repeatedly performed to obtain the sum. The process of dividing the obtained sum by the number of data n is realized by shifting the sum data to the lower side by m bits when the number of data n is 2 m (m is a positive integer). Is done. In the case where a carry occurs during the execution of the repetitive addition as described above, if the addition result is fixed to the maximum value by the saturation operation, the error of the total data increases, and the error of the finally obtained average value also increases.
[0005]
In order to prevent the above error, a method of calculating by increasing the number of effective bits of pixel data as follows can be adopted. The data of each pixel is treated as 16 bits, the size of each element is set to 16 bits, one register holds four elements, and the operations on these four elements are executed in parallel. The result of the final operation is returned to 8 bits and stored in the memory. This processor is also capable of executing an operation on data having an increased effective bit width. That is, each register can hold four 16-bit elements or two 32-bit elements. At this time, the eight 8-bit operation units are reconfigured into four 16-bit operation units or two 32-bit operation units according to the size of the element.
[0006]
[Problems to be solved by the invention]
In the method in which the length of one element is 16 bits as described above, the operation accuracy is guaranteed, but the number of operations that can be performed in parallel, in other words, the number of elements that can be operated in parallel or the number of pixels to be processed in parallel The number of data is halved. As a result, the processing speed of this processor is greatly reduced.
[0007]
In order to prevent such a problem, it is conceivable to increase the size of each register in advance. For example, the size of the smallest element held in each register can be 12 bits or 16 bits. In this case, assuming that each register holds eight elements as in the conventional case, the size of the register becomes 96 bits or 128 bits. Further, in order to increase the element size in this way, the bit width that can be processed by the arithmetic unit for each element must also be increased. That is, it is necessary that each arithmetic unit is configured to perform an operation on 16-bit or 12-bit data and output 16-bit or 12-bit data as operation result data. Since there are eight such computing units in the above processor, the total size of these computing units is considerably increased.
[0008]
As described above, in the conventional method, if the processing of the carry signal is to be performed accurately, the circuit scale of the register and the arithmetic unit increases. In addition, in a SIMD type processor such as the above-described processor, each register holds a plurality of elements and has a plurality of arithmetic units having the same number. Therefore, when the element size is increased, the circuit scale of the arithmetic units and the registers is reduced. Becomes larger.
[0009]
Accordingly, it is an object of the present invention to correctly carry a carry generated when repetitive addition is performed by a relatively simple circuit, such as when performing a process of calculating an average value of a plurality of unsigned data. A suitable processor is obtained.
[0010]
[Means for Solving the Problems]
In the image data processing, the image data is unsigned data, and in the calculation processing of the total sum data required in the average value processing of a plurality of unsigned data, the carry signal is generated by the addition in the arithmetic unit, but the carry signal is generated. Is not. Therefore, in the process of calculating the sum data of these data, it is necessary to calculate the cumulative value of a plurality of carry signals generated by the addition of the plurality of data. Not used in The accumulated value is required later in the division process of dividing the total data by the number of data. Therefore, if the accumulated value of the carry generated during the addition of such data is stored and the division is performed on the set data of the accumulated value and the total data when the division is performed on the sum data later, the digit Can be handled correctly. With this method, it is not necessary to increase the bit width of the data to be added and the number of bits of the arithmetic unit. The division of the obtained sum data can be executed by shifting the set data to the lower side by a shifter.
[0011]
The number of bits of the set data is the sum of the number of bits of the sum data and the number of bits of the accumulated value of the carry signal. Therefore, the shifter needs to be configured to be able to shift the data of the extended number of bits. However, the increase in the circuit size of the shifter required for this purpose is necessary when holding such expanded bit number data in each register and processing the expanded bit number data by a computing unit. Is expected to be smaller than the increase in the circuit scale. Therefore, in the method of accumulating the carry signal generated during the summation operation and using the accumulated value when performing division later, the carry generated during the operation of calculating the summation can be correctly processed and the average value can be processed. The circuit scale required for value processing can be reduced.
[0012]
The above can be said not only for the average value processing but also for other processing. That is, in general, a carry signal generated by addition of certain unsigned data is stored until the time when the result data of the addition is used, and together with the addition result data when the addition result data is used. It just needs to be processed.
[0013]
The present invention has been made by paying attention to the above-mentioned feature relating to the processing of unsigned data. A processor according to the present invention includes a circuit for generating a cumulative value of a carry signal output when an arithmetic unit performs addition. Is provided, and another operation unit for executing an operation on the accumulated value is provided.
[0014]
More specifically, in order to achieve the above object, in the processor according to the present invention, the bit width of the data processed by the arithmetic unit and the bit width of the data output by the arithmetic unit do not include the carry signal portion. .
[0015]
A carry signal accumulating circuit is provided for generating carry signal accumulated data representing an accumulated value of a plurality of carry signals generated while the arithmetic unit performs a plurality of additions. This carry signal accumulation data is composed of a plurality of bits.
[0016]
Further, another arithmetic unit is provided for executing an operation on the carry signal accumulation data generated by the carry signal accumulation circuit.
[0017]
In a preferred embodiment of the present invention, the carry signal accumulating circuit is constituted by a counter.
[0018]
In a specific aspect of the present invention, the other arithmetic unit obtains the result of the plurality of additions added to the carry signal accumulation data generated by the carry signal accumulation circuit and the lower side thereof. Includes a shifter that shifts the set with the addition result data to the lower side. The other operation unit operates in response to another specific instruction different from the addition instruction, specifically, a shift instruction.
[0019]
In a more specific aspect of the present invention, one or more carry signal accumulating circuits are provided in common for a plurality of registers in the processor.
[0020]
In a preferred aspect of the present invention, the number of the plurality of carry signal accumulation circuits is smaller than the number of the plurality of registers in the processor.
[0021]
In a preferred aspect of the present invention, the number of bits of the carry signal accumulation data held in each carry signal accumulation circuit is smaller than the predetermined number of bits.
[0022]
In a further specific aspect of the present invention, at least one carry signal accumulating circuit is provided in common for a plurality of packed data registers in a SIMD type processor.
[0023]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, a processor according to the present invention will be described in more detail with reference to some embodiments shown in the drawings. In the following, the same reference numbers represent the same or similar ones. Further, in the second and subsequent embodiments, only differences from the first embodiment will be mainly described.
[0024]
<First Embodiment of the Invention>
FIG. 1 is a block diagram of a SIMD type processor according to the present invention. In FIG. 1, it is assumed that packed data register group 120 is composed of, for example, eight 64-bit registers, and each register includes, for example, eight fields that can hold, for example, eight 8-bit element data. Arithmetic units 100, 100 ',..., 100 "are provided corresponding to eight fields holding eight elements held in the same register, and each of the eight elements held in the same register. For example, a total of eight arithmetic units are used, but only three are shown in FIG. 1 for simplicity, and the other are omitted. The arithmetic units 100, 100 ',..., 100 "comprise arithmetic units 130, 130', or 130", carry signal accumulating circuits 140, 140 ', or 140 "and multiplexers 150, 150. ',,, 150 ".
[0025]
The carry signal accumulating circuit 140 is newly provided in the present embodiment, and includes a plurality of signals output while the arithmetic and logic unit (ALU) 320 (FIG. 2) in the arithmetic unit 130 repeatedly executes addition. This is a circuit for accumulating carry signals. Specifically, the circuit 140 includes a counter 410 (FIG. 3). The same applies to other carry signal accumulation circuits 140 ',..., 140 ". Other operations using the accumulated value of the carry signal generated by the circuit 140 in response to a specific instruction described later. As a unit, a shifter 330 (FIG. 2) is provided in the arithmetic unit 130. The same applies to the other arithmetic units 130 ′,. In this embodiment, the signal accumulation circuit 140 and the shifter 330 enable the ALU 320 to correctly process a carry signal generated when the ALU 320 repeatedly performs addition on unsigned data.
[0026]
The instruction fetch circuit 162 sequentially fetches instructions from the memory 163, the instruction decoder 161 decodes the fetched instruction, the control circuit 160 generates a control signal for executing the decoded instruction, and controls each device. It is controlled by a control signal 170. The instruction decoded by the instruction decoder 161 is an instruction for loading data from the memory 163 into any of the packed data register groups 120, or the data in any of the registers in the packed data register group 120 is stored in the memory 163. If the instruction is to store the data, the memory access circuit 164 loads or stores the data. The movement of data from the memory 163 to the packed data register group 120 is performed via the data buses 114, 110, and 113. This data contains 64 bits, and this data usually contains eight 8-bit elements. The 64-bit data may include four 16-bit elements. The movement of data from the packed data register group 120 to the memory 163 is performed via the data buses 112 and 115.
[0027]
When the instruction decoded by the instruction decoder 161 is an operation instruction using the packed data register group 120, eight elements are read from one of a pair of packed data registers specified by the instruction, and the eight elements are read out. Are transferred to the arithmetic units 100, 100 ',... 100 "via the buses 112 and 101. Similarly, eight elements are read from the other of the pair of packed data registers, and Are transferred to the arithmetic units 100, 100 ′,... 100 ″ via the buses 111 and 102. These arithmetic units execute operations on the elements transferred to them, and as a result, 8 bits of operation result data are transferred via bus 114 to one of the elements designated by the instruction in packed data register group 120. To one register via the common data buses 110 and 113 with the buses 109, 109 ′,..., 109 ″. As described above, in the present embodiment, the SIMD that processes a plurality of elements in parallel by a single instruction Type processor.
[0028]
Each of the data buses 101, 102, 110, 111, 112, 113, 114 has a 64-bit width. The 64-bit data supplied via the data buses 101 and 102 are supplied in parallel to the arithmetic units 130 "to 130 in the order of 8 bits by the data buses 103" to 103 and 104 "to 104, respectively, starting from the upper bit. The operation results are arranged on the data bus 110 via the data buses 109 "to 109 so that the data bus 109" is the most significant bit, and stored in the packed data register group via the data bus 113 as 64-bit data. You.
[0029]
As shown in FIG. 4, the packed data register group 120 includes eight 64-bit packed data registers 200 to 207, a write register selection circuit 210, and a read register selection circuit 220. Hereinafter, packed data registers 200 to 200 ″ may be simply referred to as registers for simplicity. In addition, the respective registers are denoted by R0 to R7 in the following instructions.
[0030]
In the arithmetic unit 100, each of the two 64-bit data read out from the two registers in the packed data register group 120 and supplied from the one packed data register group 120 via the data buses 101 and 102 is read. Two pieces of data consisting of lower 8 bits are supplied to the arithmetic unit 130 via the 8-bit data buses 103 and 104. The arithmetic unit 130 is configured by an 8-bit arithmetic unit that performs an arithmetic operation on the data and outputs 8-bit arithmetic result data. The arithmetic unit 130 outputs the operation result of the 8-bit data to the 8-bit data bus 109.
[0031]
When a carry occurs in the arithmetic unit 130, the carry bit data is sent to the multiplexer 150 via the 1-bit bus 105. The multiplexer 150 is newly provided in the present embodiment. When the instruction being executed is an addition instruction requesting processing of an element of 16 bits or more, the carry bit data on the bus 105 is sent to the next higher-order bit calculator 130 'via the data bus 106. If the instruction being executed is an addition instruction requesting that an 8-bit element be processed without ignoring carry, the carry bit data on bus 105 is sent to carry signal accumulating circuit 140 via data bus 108. send. This instruction is an instruction newly established according to the present embodiment, and its use will be described later. The multiplexer 150 connects the bus 105 to neither of the data buses 106 or 108 except when the instruction being executed is the above two types of addition instructions.
[0032]
The accumulated data of the carry signal held in the carry signal accumulating circuit 140 is used when a specific instruction is executed. In the present embodiment, when a specific type of shift instruction described later is executed, the bits stored therein are supplied to arithmetic unit 130 via 4-bit data bus 107. The arithmetic unit 130, the carry signal accumulating circuit 140 and the multiplexer 150 are controlled by the control circuit 160.
[0033]
FIG. 2 shows the details of the arithmetic unit 130. This arithmetic unit comprises an arithmetic and logic unit (ALU) 320 for performing addition and subtraction and logical operation, and a shifter 331 for performing a shift operation in the same manner as the conventional one having an 8-bit 8-bit output. In addition, a shifter 330, a multiplexer 310, and a multiplexer 311 which are newly installed in the present embodiment with 12 bits input and 8 bits output are configured. The arithmetic unit 130 also includes another arithmetic unit (not shown) such as a multiplier, and the multiplier can execute multiplication ignoring carry. However, since the existence of these arithmetic units is not related to the feature of the present invention, such other arithmetic units are not shown in the present embodiment, and the description thereof is omitted. The multiplexer 310 supplies the data supplied by the data bus 104 to the ALU 320 via the data bus 300, supplies the data supplied to the shifter 330 via the data bus 301 as lower 8-bit data, or supplies the data to the shifter 331. Choose what to do.
[0034]
When the instruction being executed is an addition / subtraction instruction, the ALU 320 performs addition / subtraction on the two data supplied by the data buses 103 and 300, and outputs the operation result to the data bus 313. The ALU 320 outputs a carry bit to the bus 105 when the carry-in instruction is an addition instruction requesting that the carry is correctly processed. When the instruction being executed is a logical operation instruction, the ALU 320 performs a logical operation on the two data supplied from the data buses 103 and 300, and outputs the operation result to the data bus 303. When the instruction being executed is an instruction requesting addition / subtraction for an element of 16 bits or more, the operation units 130 ′ other than the lowest operation unit 130, etc., are connected via the data bus 106 ′ to the lower operation units. And carry bit data supplied by the ALU 320 for use in addition and subtraction.
[0035]
When the instruction being executed is an instruction requesting a shift to 8-bit data held in any of the registers in the packed data register group 120, the shifter 331 supplies the instruction from the register via the multiplexer 310. The data to be shifted is shifted upward or downward by the number of bits specified by the instruction according to the specification of the instruction, and 8-bit shift result data is output to data bus 305.
[0036]
This shifter may be either a serial shifter or a barrel shifter, but the latter is preferable in terms of speed. When the shift direction is the lower side, the shifter 331 repeatedly supplies the original most significant bit of the data to be shifted as a new most significant bit. The shifter 331 repeatedly supplies the value “0” as the least significant bit of the data to be shifted when the shift direction is on the upper side.
[0037]
The shifters 331 and the corresponding shifters (not shown) in each of the arithmetic units 100 to 100 "are connected. For example, in the case of a 16-bit shift instruction, a shifter (shown in the arithmetic unit 100 '). ) Is connected to the upper bit of the shifter 331 inside the operation unit 100, and when the shift direction is the lower side, the shifter (not shown) inside the operation unit 100 ' Is repeatedly supplied as a new most significant bit. As for the most significant bit in the shifter 331 inside the arithmetic unit 100, the least significant bit of the shifter (not shown) inside the arithmetic unit 100 'is repeatedly supplied. On the other hand, when the shift direction is the upper side, the operation unit The least significant bit of the shifter (not shown) inside 00 'is repeatedly supplied with the most significant bit of the shifter 331 inside the arithmetic unit 100. Similarly, the corresponding shifter (not shown) in the higher arithmetic unit is also supplied. ) Are also connected by 2. In a 32-bit shift instruction, four shifters are connected.
[0038]
Shifter 330 is newly provided in the present embodiment as an arithmetic unit that uses the accumulated data held in carry signal accumulation circuit 140. The shifter 330 converts the instruction being executed from a set of 8-bit data held in one of the registers in the packed data register group 120 and data accumulated by the carry signal accumulation circuit 140 to the lower side. When the instruction is a specific shift instruction which will be described later, the 8-bit data supplied from the register via the data bus 301 is used as the lower bit, and the 4-bit data supplied via the data bus 107 is used as the lower bit. The set data having the carry accumulation data as upper bits is shifted to the lower side by the number of bits specified by the instruction, and shift result data including the upper 8 bits of the shifted data is output to the data bus 304. This shifter may be either a serial shifter or a barrel shifter, but the latter is preferable in terms of speed.
[0039]
Similarly to the shifter 331, the shifter 330 supports not only an 8-bit instruction but also a 16-bit instruction and a 32-bit instruction. For example, a 16-bit shift instruction using the shifter 330 described later The shifter (not shown) is connected to the shifter 331 in the arithmetic unit 100. When the shift direction is the lower side, the most significant bit in the shifter 331 in the arithmetic unit 100 is the same as that shown in the arithmetic unit 100 '. The least significant bit of the shifter is not supplied repeatedly. Conversely, when the shift direction is on the upper side, the least significant bit of the shifter 331 in the arithmetic unit 100 is repeatedly supplied as the least significant bit of the shifter (not shown) in the arithmetic unit 100 '. Similarly, two corresponding shifters (not shown) in the higher-level arithmetic unit are connected. In a 32-bit shift instruction, four shifters are connected.
[0040]
The multiplexer 311 selects operation result data on any one of the data buses 303, 304, and 305 and outputs the data to the data bus 109.
[0041]
As shown in FIG. 3, in the present embodiment, the carry signal accumulating circuit 140 includes a bit counter 410. When the carry bit data is supplied from the data bus 108, the counter 410 increases the counter value by one and outputs the incremented counter value to the data bus 107. When a clear signal is given to the line 170, the counter 410 Clear the value to 0.
[0042]
The arithmetic unit 130 and the other arithmetic units 130 ′, 130 ″,... Have the same circuit configuration and operate in parallel, and carry signal accumulation circuits 140, 140 ′, 140 ″ provided in these arithmetic units. ,... All have the same circuit configuration and operate in parallel. From the above, it can be seen that the arithmetic units 100, 100 ', 100 ",... All have the same configuration and operate in parallel.
[0043]
In this embodiment, in order to operate a newly installed device, an addition instruction, a shift instruction, and a counter clear instruction are newly provided in addition to the conventional instruction. In the following, the instruction displays the opcode and the operand as mnemonics. The mnemonics used below are defined for convenience of explanation, and in the present embodiment, mnemonics different from mnemonics conventionally used for a certain instruction may be used.
[0044]
A conventional addition instruction, for example, "ADD8 Rx, Ry" (x, y = 0 to 8) logically divides the inside of the packed data registers Rx and Ry into 8 bits, and stores a pair of corresponding data in those registers. This is an instruction to regard an element as unsigned data, independently add it to the other elements, and store the result in the packed data register Ry. This addition instruction is an addition instruction that ignores the carry signal. This addition instruction may be an instruction for performing a saturation operation.
[0045]
This instruction is fetched by the instruction fetch circuit 162 and decoded by the instruction decoder 161, and the control circuit 160 generates a control signal 170 from the decoded instruction, and reads the read register selection circuit 220, the write register selection circuit 210, and the multiplexer 312. 313 and 150 are controlled by the control signal 170, respectively. The read register selection circuit 220 controlled by the control signal 170 reads out eight elements in parallel from each of the registers Rx and Ry and outputs them to the data buses 111 and 112. The similarly controlled multiplexer 310 supplies the element supplied thereto to the ALU 320 via the data bus 300, and the similarly controlled multiplexer 311 outputs the addition result data provided from the ALU 320 to the bus 109. Multiplexer 150 (FIG. 1), also controlled by control signal 170, turns off without connecting input 105 anywhere. Therefore, the carry bit generated in the ALU 320 is ignored.
[0046]
On the other hand, a new addition instruction, for example, "ADD8C Rx, Ry" is different from the above-mentioned conventional addition instruction in that the multiplexer 311 is controlled so as to be connected to the data bus 108. It is controlled similarly. Therefore, when a carry occurs in the ALU 320 as a result of executing the new addition instruction, the ALU 320 supplies the generated carry bit to the counter 410 (FIG. 3) in the carry signal accumulating circuit 140, and the counter 410 of the counter 410 The value increases by one.
[0047]
This new add instruction is used to replace the conventional add instruction when requesting correct handling of the carry signal. For example, when calculating an average value of a plurality of data, the new addition instruction is referred to as a plurality of additions performed to obtain a sum of the data. In that case, the total number of carry signals generated during execution of the plurality of additions is held in the counter 410.
[0048]
A conventional shift instruction, for example, “SH8Rn Rx” logically divides the inside of the packed data register Rx into 8 bits, independently shifts right by n bits, and stores the shifted 8-bit data in the packed data register Rx. Instruction to store. This instruction is fetched by the instruction fetch circuit 162, decoded by the instruction decoder 161 and the control circuit 160 generates a control signal 170 from the decoded instruction, and reads the read register selection circuit 220, the write register selection circuit 210, the multiplexer 310, 311 and 150 are controlled. The read register selection circuit 220 controlled by the control signal 170 outputs Rx to the data bus 112. Similarly controlled multiplexer 310 supplies the element supplied thereto to shifter 331 via data bus 302. The multiplexer 311 controlled in the same manner purchases the output of this shifter to the bus 109 via the bus 305. A similarly controlled multiplexer 150 has its input 105 turned off without being connected anywhere.
[0049]
On the other hand, a different point from the above-mentioned conventional shift instruction of a new shift instruction, for example, “SH8RnC Rx” is that the multiplexer 310 transfers one element read from the register Rx via the bus 104 to the shifter 330 via the bus 301. The multiplexer 311 transfers the shifted data output from the shifter 330 to the bus 304 to the bus 109, and the other operations are controlled in the same manner as the conventional shift instruction. Since the four bits constituting the accumulated data held in the counter 410 are input in parallel to the upper side of the shifter 330, the shifter 330 converts the set of the accumulated data and the element data in the register Rx into n-bit lower bits. Will shift to the side.
[0050]
This new shift instruction is used instead of the conventional shift instruction when utilizing the accumulated value of the carry signal held in the carry signal accumulation circuit 140. In the above-mentioned average value processing, after the sum of a plurality of data is obtained, the sum is used when dividing the total data by the number of data. Assuming that the total data is held in the register Rx, data obtained by adding the accumulated values of a plurality of carry signals generated at the time of calculating the total data is shifted to the upper side of the total data. Therefore, the shifted result data is a correct result in consideration of the carry signal generated during the calculation of the total data.
[0051]
The newly provided counter clear instruction, for example, “CLRC” sets the counter value of the counter 410 to 0. When this instruction is fetched by the instruction fetch circuit 162, it is decoded by the instruction decoder 161 and the control circuit 160 generates a control signal 170 from the decoded instruction, and the counter 410 is cleared by the control signal 170.
[0052]
Hereinafter, details of the average value calculation processing in the processor of the present embodiment will be described. Each of the eight source data Ai (i = 0 to 7) is 8-bit data, and is loaded into eight fields in the same register 200 as described in the packed data register 200 in FIG. Let it be an element. Therefore, i can be called an element number. In the figure, it is assumed that the most significant bit of each source data is located at the leftmost end of a field holding the data. Similarly, the other source data Bi, Ci, Di (i = 0 to 7) are 8-bit data, and the eight source data Bi (i = 0 to 7), Ci (i = 0 to 7), and Di (I = 0 to 7) are held in the registers 201, 202, and 203, respectively. It is assumed that these data are all unsigned data. Using the above data, it is assumed that an average value Xi = (Ai + Bi + Ci + Di) / 4 ″ (i = 0 to 7) of four data having the same element number i is obtained.
[0053]
The instruction sequence for obtaining the average value Xi (i = 0 to 7) is as follows in the present embodiment.
[0054]
# 1 CRLC
# 2 LOAD (ma), R0
# 3 LOAD (mb), R1
# 4 LOAD (mc), R2
# 6 ADD8C R1, R0
# 5 LOAD (md), R3
# 7 ADD8C R2, R0
# 8 ADD8C R3, R0
# 9 SH8RC2 R0
# 10 STORE R0, (md)
First, the counter 410 is cleared by the first clear command. The next four instructions are load instructions. That is, LOAD (ma), R0, and the like are instructions for loading the 64-bit data at the memory address ma into the register R0. Here, the image data groups A0 to A7 are stored at the storage position of the memory address ma, and these data are loaded into the register R0 by one load instruction. Similarly, the image data groups B0 to B7, C0 to C7, and D0 to D7 are loaded from the memory 163 to the registers R1, R2, and R3 by the second, third, and fourth load instructions, respectively. By the next addition instruction, the data group in the register R0 is added as A0 + B0, A1 + B1,..., A7 + B7, and the eight total sum data groups X0 to X7 obtained by this are stored in the register R0. Further, by the second addition instruction, the sum data groups X0 to X7 in the register R0 and the data C0 to C7 in the register R2 are added, and as a result, a sum data group of A0 + B0 + C0, A1 + B1 + C1,..., A7 + B7 + C7 is obtained. It is stored in the register R0. Here, these total data groups are also represented by X0 to X7. By the last addition instruction, a data group representing the final sum of A0 + B0 + C0 + D0, A1 + B1 + C1 + D1,..., A7 + B7 + C7 + D7 is obtained and stored in the register R0. Here, these total data groups are also represented by X0 to X7.
[0055]
If a carry is generated by any of the arithmetic units, for example, the arithmetic unit 320 in the 100 during the execution of these four addition instructions, the counter 410 in that arithmetic unit counts up. The same applies to the other arithmetic units 100 'and 100 ". Thus, the counter 410 in each arithmetic unit holds the total number of carry bits generated by the ALU 320 in the corresponding arithmetic unit 130. When the shift instruction following the above four addition instructions is executed, the shifter 330 in the arithmetic unit causes the total data Xi (i = 0, 1, or 7) held in the register R0 to be: The 12-bit data obtained by adding the total data Xi to the lower side of the accumulated data in the counter 410 in the corresponding arithmetic unit is shifted to the lower side by 2 bits, and as a result, is output by the shifter 330. The data represents an average value of the data Ai, Bi, Ci, and Di calculated correctly by reflecting the accumulated data. , (Md) is an instruction for storing the average value data in the register R0 to the position of the memory address md.
[0056]
Thus, in the present embodiment, eight average values Xi can be obtained in parallel. As can be seen from the above description, in the present embodiment, by adding a simple circuit to the conventional arithmetic unit, the carry bit can be held by the counter 410 and referenced by the new shifter 330, so that "x = (a + b + c + d ) / 4 "can be executed without ignoring a carry signal generated in an operation for calculating an average of a plurality of 8-bit source data. In this case, there is no need to increase the element size, and it is not necessary to increase the bit width handled by the arithmetic unit. Therefore, the scale of the newly added circuit in the present embodiment can be small.
[0057]
<Modification of First Embodiment of the Invention>
(1) In the first embodiment, the input of the data bus 107 and the input of the shifter 330 are each 4 bits, but are arbitrary depending on the circuit scale and performance. The maximum value of the counter 410 is also arbitrary according to this bit. In the calculation for calculating the average of the four 8-bit data, the maximum value that can be taken by the counter 410 is 2 bits. Therefore, for this type of application only, 2 bits are sufficient for both the data bus 107 and the shifter 330. This modification can be applied to other embodiments described below.
[0058]
(2) In the first embodiment, the number of packed data registers is eight with 64 bits. However, the number of packed data registers is arbitrary according to the circuit size, and the data buses 101, 102, 110, 111, 112, and 113 are also optional accordingly. This modification can be applied to other embodiments described below.
[0059]
(3) In the modified example (2), the number of circuits of 100 to 100 "is arbitrary. For example, when the packed data register group 120 has 128 bits, the number of 100 to 100" is set to 16, Sixteen 8-bit operations are performed in parallel. This modification can be applied to other embodiments described below.
[0060]
(4) In the first embodiment, the description has been made mainly on the assumption that the element size is 8 bits. The difference between the operation shown in the first embodiment and the operation at an element size of 16 bits is that the multiplexer 150 is always connected to the data bus 106, and the other operations are the same as those in the first embodiment. Therefore, the same operation is performed even when the element size is 16 bits by newly providing new instructions “ADD16C Rx, Ry” and “SH16RnC Rx”. These instructions are decoded by the instruction decoder 161, and the control circuit 160 generates a control signal 170 from the decoded instructions. Here, for an instruction having an element size of 16, the multiplexer 150 is always connected to the data bus 106, and the multiplexer 150 'generates an arbitrary control signal 170. Although not shown, the above control is performed for every other multiplexer 150 to 150 ″. The same applies to the case of 32 bits, and the above control is performed for every third multiplexer. This modification is described in another embodiment described below. Also applicable to
[0061]
(5) In the first embodiment, the arithmetic unit 130 has two inputs. However, the arithmetic unit 130 may be adapted to three or four inputs, and the data buses 101, 102, 111, and 112 may be processed in parallel according to the input. , 103, 104 are also arbitrary. This also applies to other embodiments described below.
[0062]
(6) If the counter 410 is cleared at the same time as shifting in the newly provided shift instruction “SH8RnC Rx” in the first embodiment, the newly provided clear instruction “CLRC” in the first embodiment can be omitted and executed as a result. The number of instructions to be reduced can be reduced, which is useful for speeding up processing.
[0063]
(7) In the first embodiment, the SIMD type processor to which the present invention is applied has been described. However, the present invention is not limited to the SIMD type processor, but is applied to a SISD type processor having only one arithmetic unit. Needless to say, this is also applicable. However, since the number of arithmetic units is large in the SIMD type processor, there is a great advantage that the present invention can correctly process the carry signal without increasing the circuit scale of the arithmetic circuit.
[0064]
<Second Embodiment of the Invention>
This embodiment differs from the first embodiment mainly in that a plurality of carry signal accumulation circuits 140 are provided. That is, a plurality of counters are provided in the carry signal accumulating circuit 140, and a counter for accumulating the carry signal can be selected from them by an instruction.
That is, as shown in FIG. 5, when the carry bit data is supplied, the carry signal accumulating circuit 140 increases the counter value by one and outputs the counter value. To 413, and a multiplexer 421 for selecting one of the counters 410 to 413 for supplying an output to the data bus 107, and a multiplexer 421 for selecting one of the ones for supplying the carry bit data. Become. Each of the counters 410 to 413 has a counter clear function as in the first embodiment.
[0065]
Here, in order to individually access the counters 410 to 413, the instruction newly provided in the first embodiment is further extended. First, instead of the addition instruction “ADD8C” newly provided in the first embodiment, the addition instruction “ADD8Cn Rx, Ry” (n = 0-3) is newly established. n = 0 to 3 correspond to the counters 410 to 413, respectively.
[0066]
The difference between the instruction “ADD8C0 Rx, Ry” and the “ADD8C Rx, Ry” newly provided in the first embodiment is that the multiplexer 421 is controlled to specify the counter 410. When the control circuit 160 generates the control signal 170 from the decoded instruction, the multiplexer 421 is newly controlled in addition to the control in the “ADD8C Rx, Ry” newly provided in the first embodiment. As a result, the multiplexer 421 is connected to the counter 410 specified by this instruction, and the carry bit data on the data bus 108 is added to the counter 410. Since the output of the counter is not affected, it is not necessary to operate the multiplexer 422. Similarly, the addition instructions ADD8C1, ADD8C2, and ADD8C3 select the counters 410 to 413.
[0067]
In order to be able to specify which counters 410 to 413 to output to the data bus 107 instead of the newly provided shift instruction in the first embodiment, a shift instruction “SH8RmGn Rx” (m: shift bit number) , N: counter selection value, x: packed data selection value). For example, the difference between the instruction “SH8RnC0 Rx” and the “SH8RnC Rx” newly provided in the first embodiment is that the multiplexer 422 selects which counters 410 to 413 to output to the data bus 107. This instruction is decoded by the instruction decoder 161, and the control circuit 160 generates a control signal 170 from the decoded instruction. The control in the “SH8RmC Rx” newly provided in the first embodiment is newly added to the control of the multiplexer 422. Control for outputting the counter 410 is added. Thus, the multiplexer 422 is connected to the counter 410, and outputs the output of the counter 410 to the data bus 107. Otherwise, the same operation as “SH8RnC Rx” is performed. Since there is no input from the data bus 108 in the shift instruction, the multiplexer 421 does not need to be operated. Similarly, shift instructions SH8RmC1, SH8RmC2, and SH8RmC3 select counters 410-413.
[0068]
Further, a clear instruction “CRLCn” (n = 0 to 3) is newly provided in order to enable the counters 410 to 413 to be individually designated and cleared, and this instruction is decoded by the instruction decoder 161, and control is performed from the decoded instruction. The circuit 160 generates the control signal 170 and individually designates and clears one of the counters 410 to 413.
[0069]
As described above, when a plurality of counters holding the carry signal are provided, when processing more data, a counter for accumulating the carry signal can be selected, and the processing can be speeded up or the program can be easily performed. For example, the present processor may be a superscalar processor that executes a plurality of, for example, two scalar instructions in parallel. In such a processor, each instruction is executed in a pipeline in a plurality of stages, and the same stage of two instructions is executed in parallel. For example, each instruction is executed in three stages: fetch, decode, and operation.
[0070]
In order to realize such a processor, it is necessary to provide two sets of decode circuits and arithmetic circuits. It is desirable to provide two fetch circuits if possible. To increase the processing speed of such a processor, it is desirable that there be many combinations of instructions that can be executed in parallel. In order for the two instructions to execute in parallel, it is desirable that there be no conflict between the two instructions. In the super scalar processor, when a plurality of counters are provided in the carry signal accumulating circuit 140 as in this embodiment, the set of two instructions that can be executed in parallel can be increased, and the processing speed can be increased. Can be improved. For example, when the program shown in the first embodiment is executed by the superscalar method, it is desirable to arrange the instruction sequence as follows.
[0071]
# 1 CRLC
# 2 LOAD (ma), R0
# 3 LOAD (mb), R1
# 4 LOAD (mc), R2
# 5 ADD8C R1, R0
# 6 LOAD (md), R3
# 7 ADD8C R2, R0
# 8 ADD8C R3, R0
# 9 SH8RC2 R0
# 10 STORE R0, (md)
In this case, instructions # 4 and # 5 can be executed in parallel, and instructions # 6 and # 7 can be executed in parallel. Whether instructions # 2 and # 3 can be executed in parallel depends on whether there are two fetch circuits.
[0072]
In the present embodiment, an example of a program that divides eight source data into two sets and executes two processes for calculating the average value of the four source data in each set in parallel is as follows. This program uses two counters 410 and 411. The first average uses registers R0-R3 and the second average uses R4-R7. Note that ma to mj are memory addresses.
[0073]
# 1 CRLC0
# 2 CRLC1
# 3 LOAD (ma), R0
# 4 LOAD (mb), R1
# 5 LOAD (me), R4
# 6 ADD8C0 R1, R0
# 7 LOAD (mf), R5
# 8 LOAD (mc), R2
# 9 ADD8C1 R5, R4
# 10 LOAD (mg), R6
# 11 ADD8C0 R2, R0
# 12 LOAD (md), R3
# 13 ADD8C1 R6, R4
# 14 LOAD (mh), R7
# 15 ADD8C0 R3, R0
# 16 ADD8C1 R7, R4
# 17 SH8R2C0 R0
# 18 SH8R2C0 R4
# 19 STORE R0, (mh)
# 20 STORE R1, (mi)
In this program, a set of instructions that can be executed in parallel is as follows. Instructions # 5 and # 6, # 8 and # 9, # 10 # 11, # 12 and # 13, # 14 # 15, # 16 and # 17, # 18 and # 19. Therefore, the number of instructions that can be executed in parallel increases as compared with the case where the number of counters is one.
[0074]
<Modification of Embodiment 2 of the Invention>
(1) In the second embodiment, the number of the counters 410 to 413 is arbitrary, and accordingly, the counter selection value n of the instruction newly provided in the second embodiment is also arbitrary.
[0075]
(2) In the second embodiment, the multiplexer 422 can be omitted by controlling the counters 410 to 413 to output individually.
[0076]
(3) In the modified example (2) of the second embodiment, on the contrary, by outputting all the counters 410 to 413 and selecting the output value by the multiplexer 422, the control signal for specifying the counter can be omitted.
[0077]
<Third Embodiment of the Invention>
In the present embodiment, a circuit having a plurality of registers and an arithmetic unit is used instead of the carry signal accumulating circuit 140 having a plurality of counters used in the second embodiment.
[0078]
6, registers 430 to 433 are used for carry signal accumulating circuit 140 instead of counters 410 to 413 in the second embodiment. Here, it is assumed that registers 430 to 433 each have 4 bits, and numbers 0 to 3 are assigned in order from register 430. Arithmetic unit 440 calculates a carry bit supplied from data bus 108 and data supplied from data bus 403, and outputs a calculation result to data bus 401. This arithmetic unit can perform at least addition. Of course, other operations may be performed. The write register selection circuit 423 selects which register stores the input from the data bus 401. The read register selection circuit 424 selects which of the registers 430 to 433 to read data from to the data bus 402. The multiplexer 425 selects whether to send the read data to the ALU 320 via the data bus 107 or to send the read data to the arithmetic unit 440 via the data bus 403.
[0079]
A case where the arithmetic unit 440 is a single adder will be described. Here, similarly to the second embodiment, an instruction is newly provided so that each of the registers 430 to 433 can be referred to. A new addition instruction “ADD8Gn Rx, Ry” (n = 0 to 3) is newly provided in the same format as in the second embodiment, and n corresponds to the numbers of the registers 430 to 433. First, "ADD8G0 Rx, Ry" is taken up. "ADD8G0 Rx, Ry" operates the same as the addition instruction newly added in the second embodiment except for the carry signal accumulating circuit 140, and only the operation in the carry signal accumulating circuit 140 will be described. When this instruction is decoded by the instruction decoder 161, the control circuit 160 generates a control signal 170 from the decoded instruction, and controls the read register selection circuit 424, the write register selection circuit 423, and the multiplexer 425. The controlled write register selection circuit 423 and read register selection circuit 424 respectively select the register 430, and the multiplexer 425 is connected to the data bus 403, so that the data referenced from the register 430 is supplied to the arithmetic unit 440, An operation is performed on the data supplied from the bus 108, and the operation result is stored in the register 430. Hereinafter, similarly, n = 0 to 3 are newly provided.
[0080]
Next, the shift instruction “SH8RmGn Rx” newly provided in the second embodiment is also newly provided in the present embodiment. This instruction operates similarly to the shift instruction newly provided in the second embodiment except for the carry signal accumulating circuit 140, similarly to the above-described new addition instruction. The following description is limited to the description of the operation in carry signal accumulation circuit 140. Here, "SH8RmG0 Rx" is first taken up. When this instruction is decoded by the instruction decoder 161, the control circuit 160 generates a control signal 170 from the decoded instruction, and controls the read register selection circuit 424 and the multiplexer 425. The controlled read register selection circuit 424 selects the register 430, and the controlled multiplexer 425 connects to the data bus 107, so that the data in the register 430 is supplied to the arithmetic unit 440 via the data bus 107. Hereinafter, similarly, n = 0 to 3 are newly provided. When the arithmetic unit 440 is an adder as described above, the operation is almost the same as in the second embodiment.
[0081]
If the addition ALU 320 can process the carry without depending on the present embodiment, the field holding one element of each register in the packed data register group 120 may be, for example, 8 bits. To 12 bits or 16 bits, and the circuit portion of the ALU 320 for adding two data may be changed to add two 12-bit data.
[0082]
In the present embodiment, since the arithmetic unit 440 is provided, the circuit scale is larger than that of the second embodiment. However, the scale of the circuit required in the present embodiment can be smaller than that in the case of the above change. That is, the addition target of arithmetic unit 440 is the 4-bit data in registers 430 to 433 and the carry bit of 1 bit given from line 108. Therefore, this arithmetic unit may have a simpler configuration than an adder that adds two 4-bit data. Therefore, the sum of the circuit scales of the part performing addition in the arithmetic unit 440 and the ALU 320 in the present embodiment is smaller than the circuit scale required by the adder part in the ALU 320 when such a change is made. it can. Further, the number of registers 430 to 433 used in the present embodiment may be smaller than the number of registers in packed data register group 120. Therefore, in the present embodiment, the total circuit size of the packed data register group 120 and the registers 430 to 433 can be smaller than when the bit widths of all the registers of the packed data register group 120 are changed as described above. .
[0083]
Even when the number of registers 430 to 433 is equal to the number of all packed data registers, as described above, in this embodiment, the circuit size of arithmetic unit 440 is smaller than that of a normal 4-bit adder. For simplicity, the circuit scale of the processor according to the present embodiment can still be smaller than when the processor is changed without depending on the present embodiment as described above. However, from the viewpoint of reducing the circuit scale, it is desirable that the number of registers 430 to 433 is smaller than the number of all packed data registers. For the same reason as in the case where there are a plurality of counters used in the second embodiment, it is desirable that the super-color type processor has a plurality of registers 430 to 433. Although the number also depends on the number of all packed data registers, it is generally desirable that the number be equal to or less than half and equal to or greater than １／ of the number.
[0084]
Further, according to the present embodiment, the operation in the carry signal accumulating circuit can be executed independently. For example, a new instruction to add the data in the register 430 and the data in the register 431 and store the data again in the register 431 is set. As a result, when two data in the packed data register 120 are added, the calculation is performed correctly even if both carry data. For example, when performing the average value calculation “y = ((a + b) + (c + d)) / 4”, if carry bits occur in both a + b and c + d, both carry bits should be added. Thus, the average value y can be obtained correctly.
[0085]
<Modification of Embodiment 3 of the Invention>
(1) The number of registers 430 to 433 in the fourth embodiment can be set to one, as in the case of one counter in the first embodiment.
[0086]
(2) The arithmetic unit 440 is basically used as an incrementer for increasing the content of any of the registers 430 to 433 by one by a carry signal. Therefore, when such an incrementer can be realized by a circuit having a structure other than the adder, such an incrementer can be used instead of the arithmetic unit 440. In this specification, such an incrementer is also regarded as an arithmetic unit for addition.
[0087]
(3) In the third embodiment, the registers 430 to 433 are assumed to be 4 bits, but the size of the registers is arbitrary. The number of registers 430 to 433 is also arbitrary. Therefore, the sizes of the data buses 402, 403, 401, and 107, which change depending on the size of the register, are also arbitrary.
[0088]
(4) The 1-bit data buses 105 and 108 in the third embodiment can have any value from 1 to 8 bits. For example, when the ALU 320 is changed to an arithmetic unit that performs addition of three inputs and one output, a plurality of, for example, two carry bits may be generated. In this case, the data buses 105 and 108 have two bits, and the carry signal accumulating circuit 140 can be supplied with the two-bit carry data in parallel via the data buses 105 and 108. In the first and second embodiments, the counter is used in the carry signal accumulating circuit. However, in the third embodiment, since the configuration includes an arithmetic unit and a register, it is possible to cope with a plurality of carry bits by this change. It becomes. Even in such a modification, when the total number of registers 430 to 433 is smaller than the number of all packed data registers, there is an advantage that the circuit scale of this modification is still small.
[0089]
(5) Data buses 401 to 403, registers 430 to 433 in the third modification of the third embodiment, data buses 105 and 108 in the fourth modification of the third embodiment, and a modification of the first embodiment. By making all the input parts of the data bus 107 and the shifter 330 of 8 bits 8 bits, the carry signal accumulating circuit 140 can be used even in the product of the ALU 320. Therefore, a new integration command is newly established. The operation is the same as that of the addition instruction newly provided in the third embodiment, and the operation other than the ALU 320 is omitted because it is the same.
[0090]
(6) In the third embodiment, the arithmetic unit 440 can add a subtractor, a logical arithmetic unit, a shifter, and the like in addition to the adder.
[0091]
(7) In the case of the sixth modification, it is useful to newly provide an instruction for executing an operation on the accumulated data in the registers 430 to 433. By using such an instruction, an operation on only the accumulated data in the registers 430 to 433 can be executed independently of the data in the packed data register.
[0092]
<Embodiment 4 of the invention>
In the present embodiment, the operations of the two shifters 330 and 331 used in the first embodiment are realized by one shifter. This makes the circuit of the processor simpler than in the first embodiment. Note that the technology of the present embodiment can be applied to the second and third embodiments.
[0093]
FIG. 7 shows a configuration of the arithmetic unit 130 according to the present embodiment. The multiplexer 312 supplies data from the data bus 104 to the ALU 320 via the data bus 306 or to the shifter 332 via the data bus 307. Choose what to do. The multiplexer 314 selects the accumulated data of the 4-bit carry signal on the data bus 107 or the 4-bit fixed data '0'. Shifter 332 performs a combination of the 8-bit data supplied from multiplexer 312 via data bus 307 as lower bits and the 4-bit data supplied from multiplexer 314 via data bus 500 as upper bits. Then, the lower 8 bits of the shift result are output to the data bus 309. Multiplexer 313 Selects either data bus 308 or 309
The instructions newly provided in the first to third embodiments can be handled similarly in the present embodiment. The multiplexer 314 selects the data bus 107 when executing a shift instruction newly provided in the first to third embodiments, and selects fixed data '0' for other instructions. Therefore, the input of the upper 4 bits of the shifter 332 is 0 for shift commands other than the newly provided shift command, and the accumulated data of the carry signal on the bus 107 is input only when the newly provided shift command is executed. From the above, it can be seen that the circuit of the processor of this embodiment is simpler than that of the first embodiment.
[0094]
<Modification of Embodiment 4 of the Invention>
(1) A combination of the present embodiment with Embodiment 2 or a modification thereof, and a combination of the present embodiment with Embodiment x4 or a modification thereof are also possible.
[0095]
(2) In the fourth embodiment, the input part of the shifter is 4 bits, but is arbitrary.
[0096]
(3) In the fourth embodiment, the multiplexer 314 can be omitted when the carry signal accumulating circuit 140 controls the input to the data bus 107.
[0097]
Note that the present invention is not limited to the above-described embodiment or its modified example. The present invention can also be realized by a combination of the above-described embodiments or modifications thereof. Needless to say, the present invention can be realized by other embodiments.
[0098]
【The invention's effect】
As is apparent from the above description, according to the present invention, as in the case of executing the process of calculating the average value of a plurality of unsigned data, the carry generated when repeated addition is performed is relatively simple. A processor suitable for correct processing by the circuit is obtained.
[Brief description of the drawings]
FIG. 1 is a schematic block diagram of a processor according to the present invention.
FIG. 2 is a schematic block diagram of a computing unit used in the apparatus of FIG.
FIG. 3 is a schematic block diagram of a carry signal accumulation circuit used in the apparatus of FIG. 1;
FIG. 4 is a schematic block diagram of a packed data register group used in the device of FIG. 1;
FIG. 5 is a schematic block diagram of a carry signal accumulating circuit used in another processor according to the present invention.
FIG. 6 is a schematic block diagram of a computing unit used in still another processor according to the present invention.
FIG. 7 is a schematic block diagram of a carry signal accumulating circuit used in still another processor according to the present invention.
[Explanation of symbols]
100, 100 ', 100 "... arithmetic unit
210: Write register selection circuit
220 readout register selection circuit
310-314 ... multiplexer
423: Write register selection circuit
424: Read register selection circuit

Claims

A first arithmetic unit that performs addition on at least two pieces of data having a predetermined bit width;
Each time the first computing unit generates a carry signal, the carry signal is input, and represents a cumulative value of a plurality of carry signals generated while the first computing unit performs a plurality of additions. A carry signal accumulation circuit for generating carry signal accumulation data comprising a plurality of bits;
A second arithmetic unit for performing an operation on the carry signal accumulated data and the addition result data obtained by executing the first arithmetic unit.

The second arithmetic unit includes carry signal accumulation data generated by the carry signal accumulation circuit for a plurality of additions performed by the first arithmetic unit;
2. A shifter according to claim 1, further comprising a shifter added to a lower side of the data, for shifting a set of the addition result data obtained as a result of the plurality of additions to a lower side, and outputting the data of the number of bits. Processor as described.

First and second computing units;
A plurality of registers connected to the first computing unit and capable of holding data of at least a predetermined number of bits;
At least one carry signal accumulation circuit connected to the first computing unit;
A first selection circuit connected to the first and second arithmetic units and configured to execute writing or reading to or from the plurality of registers;
The first selection circuit supplies the data held in the plurality of registers to the first computing unit, and adds the addition result data supplied from the first computing unit to one of the plurality of registers. And the data held in the one register is supplied to the second computing unit, and the computation result data supplied from the second computing unit is transferred to any one of the plurality of registers. Transfer,
The first computing unit is selected from among a plurality of registers by the first selection circuit. Performed addition to the plurality of data of the predetermined number of bits,
The carry signal accumulating circuit receives the carry signal each time the first arithmetic unit generates a carry signal, and stores the carry signal from a plurality of bits representing the accumulated value of the carry signal generated by the arithmetic unit. Generate carry signal accumulation data
The second arithmetic unit includes: the generated carry signal accumulated data;
A processor that executes an operation on a set of the addition result data held in one register selected by the first selection circuit.

A plurality of said carry signal accumulating circuits;
A second selection circuit that selects one of the carry signal accumulation circuits to which the carry signal output from the first arithmetic unit is to be input, from the plurality of carry signal accumulation circuits;
The second computing unit holds the carry signal accumulation data generated by the carry signal accumulation circuit and one of the registers selected by the first selection circuit, and stores the carry signal accumulation data. 4. The processor according to claim 3, further comprising a shifter for shifting a set with the addition result data added to the lower side to the lower side and outputting shift result data having the number of bits.

A plurality of said carry signal accumulating circuits;
A second selection circuit for selecting one carry signal accumulation circuit to which a carry signal output from the first arithmetic unit is to be input from the plurality of carry signal accumulation circuits;
In correspondence with each of the first arithmetic units, a carry signal generated by the arithmetic unit and each carry signal accumulation data in the carry signal accumulation circuit are calculated, and the carry signal accumulation circuit calculates the carry signal. 4. The processor according to claim 3, further comprising a carry signal calculator for outputting.

It has an arithmetic unit and a carry signal accumulation circuit,
The arithmetic unit performs addition on at least two data of a predetermined bit width, and the carry signal accumulating circuit carries the carry signal each time the arithmetic unit generates a carry signal. The signal is accumulated to generate carry signal accumulation data composed of a plurality of bits,
The processor, wherein the arithmetic unit further performs an arithmetic operation on the addition result data of the arithmetic unit and data from the carry signal accumulator.