WO2002015000A2 - General purpose processor with graphics/media support - Google Patents

General purpose processor with graphics/media support Download PDF

Info

Publication number
WO2002015000A2
WO2002015000A2 PCT/US2001/024778 US0124778W WO0215000A2 WO 2002015000 A2 WO2002015000 A2 WO 2002015000A2 US 0124778 W US0124778 W US 0124778W WO 0215000 A2 WO0215000 A2 WO 0215000A2
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
value
processor
result
parallel
Prior art date
Application number
PCT/US2001/024778
Other languages
French (fr)
Other versions
WO2002015000A3 (en
Inventor
Subramania Sudharsanan
Jeffrey Meng Wah Chan
Michael F. Deering
Marc Tremblay
Scott R. Nelson
Original Assignee
Sun Microsystems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/640,901 external-priority patent/US7587582B1/en
Application filed by Sun Microsystems, Inc. filed Critical Sun Microsystems, Inc.
Priority to AU2001281162A priority Critical patent/AU2001281162A1/en
Publication of WO2002015000A2 publication Critical patent/WO2002015000A2/en
Publication of WO2002015000A3 publication Critical patent/WO2002015000A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/57Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • G06F9/30014Arithmetic instructions with variable precision
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30072Arrangements for executing specific machine instructions to perform conditional operations, e.g. using predicates or guards
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/3016Decoding the operand specifier, e.g. specifier format
    • G06F9/30167Decoding the operand specifier, e.g. specifier format of immediate specifier, e.g. constants
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline, look ahead using a slave processor, e.g. coprocessor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3889Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3889Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute
    • G06F9/3891Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute organised in groups of units sharing resources, e.g. clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/3808Details concerning the type of numbers or the way they are handled
    • G06F2207/3828Multigauge devices, i.e. capable of handling packed numbers without unpacking them
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/38Indexing scheme relating to groups G06F7/38 - G06F7/575
    • G06F2207/3804Details
    • G06F2207/386Special constructional features
    • G06F2207/3884Pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2207/00Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F2207/552Indexing scheme relating to groups G06F7/552 - G06F7/5525
    • G06F2207/5521Inverse root of a number or a function, e.g. the reciprocal of a Pythagorean sum

Definitions

  • the present invention relates generally to processors and, more particularly to instructions for use with processors.
  • processors In order to support speech and audio processing, signal processing and 2-D and 3-D graphics, processors must be able to support fast graphics operations.
  • prior art general purpose processors have provided little or no hardware support for this type of operations.
  • special purpose graphics and media processors provide hardware support for specialized operations.
  • graphical operations were performed mostly with the aid of a specialized graphics/media processor.
  • the present invention provides a method and apparatus for efficiently performing graphic operations. This is accomplished by providing a processor that supports any combination of the following instructions: parallel multiply-add, conditional pick, parallel averaging, parallel power, parallel reciprocal square root and parallel shifts. In some embodiments, the results of these operations are further saturated within specified numerical ranges. ' BRIEF DESCRIPTION OF THE DRAWINGS
  • Fig. 1A is a schematic block diagram illustrating a single integrated circuit chip implementation of a processor in accordance with an embodiment of the present invention.
  • Fig. IB is a schematic block diagram showing the core of the processor.
  • Fig. 2A is a block diagram of a register file of the processor of Fig. IB.
  • Fig. 2B is a block diagram of a register of the register file of Fig. 2A.
  • Fig. 3A is a block diagram showing instruction formats for four-operand instructions supported by the processor of
  • Fig. 3B is a block diagram showing instruction formats for three-operand instructions supported by the processor of
  • Fig. 4A is a block diagram showing an instruction format for a parallel multiply-add instruction supported by the processor of Fig. IB.
  • Fig. 4B is a block diagram showing an instruction format for a conditional pick instruction supported by the processor of Fig. IB.
  • Fig. 4C is a block diagram showing an instruction format for a parallel mean instruction supported by the processor of Fig. IB.
  • Fig. 4D is a block diagram showing instruction formats for a parallel logical shift left instruction supported by the processor of Fig. IB.
  • Fig. 4E is a block diagram showing instruction formats for a parallel arithmetic shift right instruction supported by the processor of Fig. IB.
  • Fig. 4F is a block diagram showing instruction formats for a parallel logical shift right instruction supported by the processor of Fig. IB.
  • Fig. 5 is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing the pmuladd instruction of Fig. 4A.
  • Fig. 6 is a block diagram of one ' implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing the cpickz instruction of Fig. 4B .
  • Fig. 7 is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing the pmean instruction of Fig. 4C.
  • Fig. 8A is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing any of the parallel shift instructions of Figs. 4D, 4E or 4F, when operands are register references.
  • Fig. 8B is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing any of the parallel shif instructions of Figs . 4D, 4E or 4F, when one of the operands is an immediate value .
  • FIG. 1A A processor in accordance to the principles of the present invention is illustrated in Fig. 1.
  • a schematic block diagram illustrates a single integrated circuit chip implementation of a processor 100 that includes a memory interface 102, a geometry decompressor 104, two media processing units 110 and 112, a shared data cache 106, and several interface controllers.
  • the interface controllers support an interactive graphics environment with real-time constraints by integrating fundamental components of memory, graphics, and input/output bridge functionality on a single die.
  • the components are mutually linked and closely linked to the processor core with high bandwidth, low-latency communication channels to manage multiple high-bandwidth data streams efficiently and with a low response time.
  • the interface controllers include a an UltraPort Architecture Interconnect (UPA) controller 116 and a peripheral component interconnect (PCI) controller 120.
  • the illustrative memory interface 102 is a direct Rambus dynamic RAM (DRDRAM) controller.
  • the shared data cache 106 is a dual-ported storage that is shared among the media processing units 110 and 112 with one port allocated to each media processing unit.
  • the data cache 106 is four-way set associative, follows a write-back protocol, and supports hits in the fill buffer (not shown) .
  • the data cache 106 allows fast data sharing and eliminates the need for a complex, error-prone cache coherency protocol between the media processing units 110 and 112.
  • Two media processing units 110 and 112 are included in a single integrated circuit chip to support an execution environment exploiting thread level parallelism in which two independent threads can execute simultaneously.
  • the threads may arise from any sources such as the same application, different applications, the operating system, or the runtime environment.
  • Parallelism is exploited at the thread level since parallelism is rare beyond four, or even two, instructions per cycle in general purpose code.
  • the illustrative processor 100 is an eight-wide machine with eight execution units for executing instructions.
  • Typical "general-purpose" processing code has an instruction level parallelism of about two so that, on average, most (about six) of the eight execution units would be idle at any time.
  • the illustrative processor 100 employs thread level parallelism and operates on two independent threads, possibly attaining twice the performance of a processor having the same resources and clock rate but utilizing traditional non-thread parallelism.
  • processor 100 shown in Fig. 1A includes two processing units on an integrated circuit chip, the architecture is highly scaleable so that one to several closely-coupled processors may be formed in a message-based coherent architecture and resident on the same die to process multiple threads of execution.
  • processor 100 a limitation on the number of processors formed on a single die thus arises from capacity constraints of integrated circuit technology rather than from architectural constraints relating to the interactions and interconnections between processors.
  • the media processing units 110 and 112 each include an instruction cache 210, an instruction aligner 212, an instruction buffer 214, a pipeline control unit 226, a split register file 216, a plurality of execution units, and a load/store unit 218.
  • the media processing units 110 and 112 use a plurality of execution units for executing instructions.
  • the execution units for a media processing unit 110 include three media functional units (MFU) 222 and one general functional unit (GFU) 220.
  • the media functional units 222 are multiple single-instruction-multiple-data (MSIMD) functional units. Each of the media functional units 222 is capable of processing parallel 16-bit components.
  • Various parallel 16-bit operations supply the single-instruction-multiple-data capability for the processor 100 including add, multiply-add, shift, compare, and the like.
  • the media functional units 222 operate in combination as tightly-coupled digital signal processors
  • Each media functional unit 222 has a separate and individual sub-instruction stream, but all three media functional units 222 execute synchronously so that the subinstructions progress lock-step through pipeline stages.
  • the general functional unit 220 is a RISC processor capable of executing arithmetic logic unit (ALU) operations, loads and stores, branches, and various specialized and esoteric functions such as parallel power operations, reciprocal squareroot operations, and many others.
  • ALU arithmetic logic unit
  • the general functional unit 220 supports less common parallel operations such as the parallel reciprocal square root instruction.
  • the pipeline control unit 226 is connected between the instruction buffer 214 and the functional units and schedules the transfer of instructions to the functional units.
  • the pipeline control unit 226 also receives status signals from the functional units and the load/store unit 218 and uses the status signals to perform several control functions.
  • the pipeline control unit 226 maintains a scoreboard, generates stalls and bypass controls.
  • the pipeline control unit 226 also generates traps and maintains special registers.
  • Each media processing unit 110 and 112 includes a split register file 216, a single logical register file including 224 32 -bit registers.
  • the split register file 216 is split into a plurality of register file segments 224 to form a -multi-ported structure that is replicated to reduce the integrated circuit die area and to reduce access time.
  • a separate register file segment 224 is allocated to each of the media functional units 222 and the general functional unit 220.
  • each register file segment 224 has 128 32-bit registers.
  • the first 96 registers (0-95) in the register file segment 224 are global registers. All functional units can write to the 96 global registers .
  • the global registers are coherent across all functional units (MFU and GFU) so that any write operation to a global register by any functional unit is broadcast to all register file segments 224.
  • Registers 96-127 in the register file segments 224 are local registers. Local registers allocated to a functional unit are not accessible or "visible" to other functional units.
  • the media processing units 110 and 112 are highly structured computation blocks that execute software- scheduled data computation operations with fixed, deterministic and relatively short instruction latencies, operational characteristics yielding simplification in both function and cycle time.
  • the operational characteristics support multiple instruction issue through a pragmatic very large instruction word (VLIW) approach that avoids hardware interlocks to account for software that does not schedule operations properly. Such hardware interlocks are typically complex, error-prone, and create multiple critical paths.
  • VLIW very large instruction word
  • VLIW instruction word always includes one instruction that executes in the general functional unit (GFU) 220 and from zero to three instructions that execute in the media functional units (MFU) 222.
  • a MFU instruction field within the VLIW instruction word includes an operation code (opcode) field, three source register (or immediate) fields, and one destination register field.
  • Instructions are executed in-order in the processor 100 but loads can finish out-of-order with respect to other instructions and with respect to other loads, allowing loads to be moved up in the instruction stream so that data can be streamed from main memory.
  • the execution model eliminates the usage and overhead resources of an instruction window, reservation stations, a re-order buffer, or other blocks for handling instruction ordering. Elimination of the instruction ordering structures and overhead resources is highly advantageous since the eliminated blocks typically consume a large portion of an integrated circuit die. For example, the eliminated blocks consume about 30% of the die area of a Pentium II processor.
  • Processor 100 is further described in co-pending application Ser. No. 09/204,480, entitled “A Multiple-Thread Processor for Threaded Software Applications” by Marc Tremblay and William Joy, filed on Dec. 3, 1998, which is herein incorporated by reference in its entirety.
  • the structure of a register file of the processor of Fig. IB is illustrated in Fig. 2A.
  • the register file is made up of an arbitrary number of registers R0, RI, R2 . . . Rn.
  • Each of registers RO, RI, R2 . . . Rn in turn has an arbitrary number of bits n, as shown in Fig. 2B.
  • the number of bits in each of registers R0, RI, R2 . . . Rn is 32.
  • the principles of the present invention can be applied to an arbitrary number of registers each having an arbitrary number of bits. Accordingly, the present invention is not limited to any particular number of registers or bits per register.
  • Fig. 3A illustrates an instruction format for four- operand instructions supported by the processor of Fig. IB.
  • the instruction format has a 4-bit opcode and four 7-bit operands.
  • the first of the operands is a reference to a destination register (RD) for the instruction.
  • the second operand is a reference to a first source register for the instruction (RSI) .
  • the third operand is a reference to a second source register for the instruction (RS2) and the fourth operand is a reference to a third source register for the instruction (RS3) .
  • Fig. 3B illustrates two instruction formats for three- operand instructions supported by the processor of Fig. IB.
  • Each instruction format has an 11-bit opcode and three 7-bit operands.
  • the first of the operands is a reference to a destination register (RD) for the instruction.
  • the second operand is a reference to a first source register for the instruction (RSI) .
  • the third operand can be a references to a second (RS2) source register or an immediate value to be used in the instruction.
  • Fig. 4A illustrates an instruction format for a parallel multiply-add instruction (pmuladd) supported by the processor of Fig. IB, in accordance to the present invention.
  • the pmuladd instruction uses the four-operand instruction format of Fig. 3A, namely a format in which no immediate values are used. Rather, all operands are references to registers in the register file of the processor.
  • Fig. 4B illustrates an instruction format for a conditional pick instruction (cpickz) supported by the processor of Fig. IB.
  • the cpickz instruction uses the four- operand instruction format of Fig. 3A.
  • Fig. 4C illustrates an instruction format for a parallel mean instruction (pmean) supported by the processor of Fig. IB.
  • the pmean instruction uses the first of the three-operand instruction formats of Fig. 3B, namely a format in which no immediate values are used.
  • Fig. 4D illustrates instruction formats for a pshll instruction supported by the processor of Fig. IB.
  • the pshll instruction uses either of the three-operand instruction formats of Fig. 3B.
  • Fig. 4E illustrates instruction formats for a pshra instruction supported by the processor of Fig. IB.
  • the pshra instruction uses either of the three-operand instruction formats of Fig. 3B.
  • Fig. 4F illustrates instruction formats for a pshrl instruction supported by the processor of Fig. IB.
  • the pshrl instruction uses either of the three-operand instruction formats of Fig. 3B.
  • Fig. 5 is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing a parallel multiply-add operation.
  • the pmuladd instruction treats values stored in the source registers as each having two 16-bit fixed-point components. For example, in Fig. 5, bits 0..15 of the values stored in registers RSI, RS2 and RS3 comprise the first fixed-point operands and bits 16..31 comprise the second fixed-point operands.
  • the multiply-add operation is then carried out separately on the first operands and on the second operands.
  • the value stored in register RD represents two 16 bit fixed-point values, one representing a value calculated by multiplying the first fixed-point operand of RSI by the first fixed- point operand of RS2 and adding the first fixed-point operand of RS3, and the other representing a value calculated by multiplying the second fixed-point operand of RSI by the second fixed-point operand of RS2 and adding the second fixed-point operand of RS3.
  • the processor when executing a pmuladd instruction, routes the value of bits 0..15 (high-order bits) of RSI and RS2 to respective input ports of multiplier 510, while the value of bits 16..31 (low-order bits) of RSI and RS2 are routed to respective input ports of multiplier 520.
  • values on respective output ports of multipliers 510 and 520 are routed to respective input ports of adders 530 and 540.
  • the value of bits 0..15 of RS3 is then routed to the other input port of adder 530 and the values of bits
  • 16..31 of RS3 are routed to the other input port of adder 540. After a time delay for propagating the input values through adders 530 and 540, a value on an output port of adder 530 is stored in bits 0..15 of register RD, while a value on an output port of adder 540 is stored in bits 16..31 of register RD.
  • the results depend on the values of two mode/format bits.
  • the operands can be either in fixed-point format or in integer format. As shown in Table 1, when the mode bits have 00 and 01 values, both the operands and the result are treated as two's complement 16-bit integer values. When the mode bits have a 10 value, the operands and the result are treated as S.15 fixed-point values. Finally, if the mode bits have a 11 value, the operands and the result are treated as S2.13 fixed point values. Hence, depending on the value of the mode bits the appropriate bits from the multiplier results are supplied to the adder.
  • Fig. IB supports saturation functions to be performed during pmuladd, padd and psub operations.
  • Table 1 shows four different saturation modes.
  • Saturation modes 00 and 01 in Table 1 represent two's complement 16-bit integers.
  • Mode 10 represents an S.15 fixed point notation
  • mode 11 represents an S2.13 fixed point format.
  • the S bit is part of the integer part of the fixed point number.
  • an S2.13 number has a 3 -bit integer part and a 13- bit fractional part.
  • mode 00 the parallel muladd with saturation instruction will produce a value between 0 and 2 15 -1, inclusive. If the results exceed these bounds, they are "capped" at the upper bound. Similarly, mode 01 limits the result to between -2 15 and 2 15 -1, inclusive. Modes 10 and 11 represent saturation for fixed point formats. Table 1 summarizes the limits or bounds for all four modes.
  • Fig. 6 is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing a conditional pick operation.
  • the cpickz instruction compares a value stored in register RSI to a zero value and depending on the outcome of the comparison copies the values stored in either register RS2 or register RS3 into register RD.
  • the processor when executing a cpickz instruction, routes a value stored in register RSI to an input port of comparator 610. A zero value is supplied on the other input port of comparator 610. After a time delay for propagating the input values through comparator 610, a value on an output port of comparator 610 is routed to a control port of multiplexer 620. Meanwhile, values stored in registers RS2 and RS3 are routed by the processor to respective input ports of multiplexer 620. After a time delay for propagating input values through multiplexer 620, a value on an output port of multiplexer 620 is stored in register RD.
  • the value stored in register RD is a copy of the value stored in register RS2 if the value stored in register RSI is not equal to 0.
  • the value stored in register RD is a copy of the value stored in register RS3.
  • Fig. 7 is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing a parallel averaging operation.
  • the pmean instruction treats values stored in the source registers as each having two 16-bit integer components. For example, in Fig. 7, bits 0..15 of the values stored in registers RSI and RS2 comprise the first integer operands and bits 16..31 comprise the second integer operands .
  • the averaging operation is then carried out separately on the first operands and on the second operands.
  • the value stored in register RD represents two 16 bit integer values, one representing a value calculated by averaging the first integer operand of RSI with the first integer operand of RS2, and the other representing a value calculated by averaging the second integer operand of RSI with the second integer operand of RS2.
  • the processor when executing a pmean instruction, routes values stored in bits 0..15 of registers RSI and RS2 to respective input ports of adder 710. Meanwhile, values stored in bits 16..31 of registers RSI and RS2 are routed to respective input ports of adder 720.
  • values on respective output ports of adders 710 and 720 are routed to respective input ports of adders 730 and 740.
  • a 1 value is supplied on respective input ports of adders 730 and 740.
  • output values on respective ports of adders 730 and 740 are routed to respective input ports of right shifters 750 and 760.
  • a logical one value is supplied on respective control ports of right shifters 750 and 760.
  • Fig. 8A is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing a parallel shift operation, when all operands are provided as register references.
  • the pshll instruction treats values stored in the source registers as each having two 16-bit integer components. For example, in Fig. 8A, bits 0..15 of the values stored in registers RSI and RS2 comprise the first integer operands and bits 16..31 comprise the second integer operands. The logical shift left operation is then carried out separately on the first operands and on the second operands.
  • the value stored in register RD represents two 16 bit integer values, one representing a value calculated by performing a logical shift left of the first integer operand of RSI by a number of bits specified by the first integer operand of RS2, and the other representing a value calculated by performing a logical shift left on the second integer operand of RSI by a number of bits specified by the second integer operand of RS2.
  • the processor when executing the pshll instruction, the processor routes the value stored in bits 0..15 of register RSI to an input port of shifter 810. Meanwhile, the value stored in bits 16..31 of register RSI are routed to an input port of shifter 820.
  • the processor also routes bits 0..3 of registers RSI and RS2 to respective select ports of shifters 810 and 820. After a time delay for propagating the input values through shifters 810 and
  • a value on an output port of shifter 810 is copied into bits 0..15 of register RD and a value on an output port of shifter 820 is copied into bits 16..31 of register RD.
  • Fig. 8B is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing a parallel shift operation, when the second source operand is provided as an immediate value.
  • the functioning of the circuitry of Fig. 8B is identical to that of the circuitry of Fig. 8A, except that bits 0..3 of the immediate value are routed to both input ports of shifters 810 and 820.
  • the operation of the circuitry of Figs. 8A and 8B during execution of a pshra or a pshrl instructions is identical to the one described for the execution of a pshll instruction, except that shifters 810 and 820 perform an arithmetic shift right or a logical shift right operations, respectively.
  • the processor of Fig. IB supports a parallel power instruction, ppower and a parallel reciprocal square root instruction precsqrt .
  • the ppower and precsqrt instruction treat the values stored in the source registers as a pair of fixed-point (rather than integer) components. Therefore, the value stored in bits 0..15 of register RD after the execution of a ppower instruction represent a value calculated by raising the value stored in bits 0..15 of register RSI to a power specified by the value stored in bits 0..15 of register RS2.
  • the value stored in bits 16..31 of register RD after the execution of a ppower instruction represent a value calculated by raising the value stored in bits 16..31 of register RSI to a power specified by the value stored in bits 16..31 of register RS2.
  • the pair of values stored in register RD after the execution of a precsqrt (instruction are calculated using a similar process to the one described for the ppower instruction, except that the reciprocal square roots of the pairs of values stored in register RSI are computed, rather than a power.
  • precsqrt instruction is further described in co-pending application Ser. No. 09/240,977 titled "Parallel Fixed Point Square Root And Reciprocal Square Root Computation Unit In A Processor” by Ravi Shankar and Subramania Sudharsanan, which is incorporated by reference herein in its entirety.
  • Embodiments described above illustrate but do not limit the invention.
  • the invention is not limited by any number of registers or immediate values specified by the instructions.
  • the invention is not limited to any particular hardware implementation. Those skilled in the art realize that alternative hardware implementation can be employed in lieu of the one described herein in accordance to the principles of the present invention. Other embodiments and variations are within the scope of the invention, as defined by the following claims.

Abstract

A method and apparatus for efficiently performing graphic operations are provided. This is accomplished by providing a processor that supports any combination of the following instructions: parallel multiply-add, conditional pick, parallel averaging, parallel power, parallel reciprocal square root and parallel shifts.

Description

ACCELERATED GRAPHIC OPERATIONS
BACKGROUND OF THE INVENTION
Field of the Invention
The present invention relates generally to processors and, more particularly to instructions for use with processors.
Related Art
In order to support speech and audio processing, signal processing and 2-D and 3-D graphics, processors must be able to support fast graphics operations. However, prior art general purpose processors have provided little or no hardware support for this type of operations. By contrast, special purpose graphics and media processors provide hardware support for specialized operations. As a result, using prior art processors, graphical operations were performed mostly with the aid of a specialized graphics/media processor.
As the demand for graphics/media support in general purpose processors rises, hardware acceleration of these operations becomes more and more important.
As a result, there is a need for a general purpose processor that allows for efficient processing of these operations .
SUMMARY OF THE INVENTION
The present invention provides a method and apparatus for efficiently performing graphic operations. This is accomplished by providing a processor that supports any combination of the following instructions: parallel multiply-add, conditional pick, parallel averaging, parallel power, parallel reciprocal square root and parallel shifts. In some embodiments, the results of these operations are further saturated within specified numerical ranges.' BRIEF DESCRIPTION OF THE DRAWINGS
Fig. 1A is a schematic block diagram illustrating a single integrated circuit chip implementation of a processor in accordance with an embodiment of the present invention. Fig. IB is a schematic block diagram showing the core of the processor.
Fig. 2A is a block diagram of a register file of the processor of Fig. IB.
Fig. 2B is a block diagram of a register of the register file of Fig. 2A.
Fig. 3A is a block diagram showing instruction formats for four-operand instructions supported by the processor of
Fig. IB.
Fig. 3B is a block diagram showing instruction formats for three-operand instructions supported by the processor of
Fig. IB.
Fig. 4A is a block diagram showing an instruction format for a parallel multiply-add instruction supported by the processor of Fig. IB. Fig. 4B is a block diagram showing an instruction format for a conditional pick instruction supported by the processor of Fig. IB.
Fig. 4C is a block diagram showing an instruction format for a parallel mean instruction supported by the processor of Fig. IB.
Fig. 4D is a block diagram showing instruction formats for a parallel logical shift left instruction supported by the processor of Fig. IB.
Fig. 4E is a block diagram showing instruction formats for a parallel arithmetic shift right instruction supported by the processor of Fig. IB.
Fig. 4F is a block diagram showing instruction formats for a parallel logical shift right instruction supported by the processor of Fig. IB. Fig. 5 is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing the pmuladd instruction of Fig. 4A. Fig. 6 is a block diagram of one ' implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing the cpickz instruction of Fig. 4B .
Fig. 7 is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing the pmean instruction of Fig. 4C.
Fig. 8A is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing any of the parallel shift instructions of Figs. 4D, 4E or 4F, when operands are register references.
Fig. 8B is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing any of the parallel shif instructions of Figs . 4D, 4E or 4F, when one of the operands is an immediate value .
DETAILED DESCRIPTION OF THE INVENTION
A processor in accordance to the principles of the present invention is illustrated in Fig. 1. Referring to Fig. 1A, a schematic block diagram illustrates a single integrated circuit chip implementation of a processor 100 that includes a memory interface 102, a geometry decompressor 104, two media processing units 110 and 112, a shared data cache 106, and several interface controllers. The interface controllers support an interactive graphics environment with real-time constraints by integrating fundamental components of memory, graphics, and input/output bridge functionality on a single die. The components are mutually linked and closely linked to the processor core with high bandwidth, low-latency communication channels to manage multiple high-bandwidth data streams efficiently and with a low response time. The interface controllers include a an UltraPort Architecture Interconnect (UPA) controller 116 and a peripheral component interconnect (PCI) controller 120. The illustrative memory interface 102 is a direct Rambus dynamic RAM (DRDRAM) controller. The shared data cache 106 is a dual-ported storage that is shared among the media processing units 110 and 112 with one port allocated to each media processing unit. The data cache 106 is four-way set associative, follows a write-back protocol, and supports hits in the fill buffer (not shown) . The data cache 106 allows fast data sharing and eliminates the need for a complex, error-prone cache coherency protocol between the media processing units 110 and 112.
Two media processing units 110 and 112 are included in a single integrated circuit chip to support an execution environment exploiting thread level parallelism in which two independent threads can execute simultaneously. The threads may arise from any sources such as the same application, different applications, the operating system, or the runtime environment. Parallelism is exploited at the thread level since parallelism is rare beyond four, or even two, instructions per cycle in general purpose code. For example, the illustrative processor 100 is an eight-wide machine with eight execution units for executing instructions. Typical "general-purpose" processing code has an instruction level parallelism of about two so that, on average, most (about six) of the eight execution units would be idle at any time. The illustrative processor 100 employs thread level parallelism and operates on two independent threads, possibly attaining twice the performance of a processor having the same resources and clock rate but utilizing traditional non-thread parallelism.
Although the processor 100 shown in Fig. 1A includes two processing units on an integrated circuit chip, the architecture is highly scaleable so that one to several closely-coupled processors may be formed in a message-based coherent architecture and resident on the same die to process multiple threads of execution. Thus, in the processor 100, a limitation on the number of processors formed on a single die thus arises from capacity constraints of integrated circuit technology rather than from architectural constraints relating to the interactions and interconnections between processors.
Referring to Fig. IB, a schematic block diagram shows the core of the processor 100. The media processing units 110 and 112 each include an instruction cache 210, an instruction aligner 212, an instruction buffer 214, a pipeline control unit 226, a split register file 216, a plurality of execution units, and a load/store unit 218. In the illustrative processor 100, the media processing units 110 and 112 use a plurality of execution units for executing instructions. The execution units for a media processing unit 110 include three media functional units (MFU) 222 and one general functional unit (GFU) 220. The media functional units 222 are multiple single-instruction-multiple-data (MSIMD) functional units. Each of the media functional units 222 is capable of processing parallel 16-bit components. Various parallel 16-bit operations supply the single-instruction-multiple-data capability for the processor 100 including add, multiply-add, shift, compare, and the like. The media functional units 222 operate in combination as tightly-coupled digital signal processors
(DSPs) . Each media functional unit 222 has a separate and individual sub-instruction stream, but all three media functional units 222 execute synchronously so that the subinstructions progress lock-step through pipeline stages. The general functional unit 220 is a RISC processor capable of executing arithmetic logic unit (ALU) operations, loads and stores, branches, and various specialized and esoteric functions such as parallel power operations, reciprocal squareroot operations, and many others. The general functional unit 220 supports less common parallel operations such as the parallel reciprocal square root instruction.
The pipeline control unit 226 is connected between the instruction buffer 214 and the functional units and schedules the transfer of instructions to the functional units. The pipeline control unit 226 also receives status signals from the functional units and the load/store unit 218 and uses the status signals to perform several control functions. The pipeline control unit 226 maintains a scoreboard, generates stalls and bypass controls. The pipeline control unit 226 also generates traps and maintains special registers.
Each media processing unit 110 and 112 includes a split register file 216, a single logical register file including 224 32 -bit registers. The split register file 216 is split into a plurality of register file segments 224 to form a -multi-ported structure that is replicated to reduce the integrated circuit die area and to reduce access time. A separate register file segment 224 is allocated to each of the media functional units 222 and the general functional unit 220. In the illustrative embodiment, each register file segment 224 has 128 32-bit registers. The first 96 registers (0-95) in the register file segment 224 are global registers. All functional units can write to the 96 global registers . The global registers are coherent across all functional units (MFU and GFU) so that any write operation to a global register by any functional unit is broadcast to all register file segments 224. Registers 96-127 in the register file segments 224 are local registers. Local registers allocated to a functional unit are not accessible or "visible" to other functional units. The media processing units 110 and 112 are highly structured computation blocks that execute software- scheduled data computation operations with fixed, deterministic and relatively short instruction latencies, operational characteristics yielding simplification in both function and cycle time. The operational characteristics support multiple instruction issue through a pragmatic very large instruction word (VLIW) approach that avoids hardware interlocks to account for software that does not schedule operations properly. Such hardware interlocks are typically complex, error-prone, and create multiple critical paths. A
VLIW instruction word always includes one instruction that executes in the general functional unit (GFU) 220 and from zero to three instructions that execute in the media functional units (MFU) 222. A MFU instruction field within the VLIW instruction word includes an operation code (opcode) field, three source register (or immediate) fields, and one destination register field.
Instructions are executed in-order in the processor 100 but loads can finish out-of-order with respect to other instructions and with respect to other loads, allowing loads to be moved up in the instruction stream so that data can be streamed from main memory. The execution model eliminates the usage and overhead resources of an instruction window, reservation stations, a re-order buffer, or other blocks for handling instruction ordering. Elimination of the instruction ordering structures and overhead resources is highly advantageous since the eliminated blocks typically consume a large portion of an integrated circuit die. For example, the eliminated blocks consume about 30% of the die area of a Pentium II processor.
Processor 100 is further described in co-pending application Ser. No. 09/204,480, entitled "A Multiple-Thread Processor for Threaded Software Applications" by Marc Tremblay and William Joy, filed on Dec. 3, 1998, which is herein incorporated by reference in its entirety.
The structure of a register file of the processor of Fig. IB is illustrated in Fig. 2A. The register file is made up of an arbitrary number of registers R0, RI, R2 . . . Rn. Each of registers RO, RI, R2 . . . Rn, in turn has an arbitrary number of bits n, as shown in Fig. 2B. In one embodiment, the number of bits in each of registers R0, RI, R2 . . . Rn is 32. However, those skilled in the art realize that the principles of the present invention can be applied to an arbitrary number of registers each having an arbitrary number of bits. Accordingly, the present invention is not limited to any particular number of registers or bits per register.
Fig. 3A illustrates an instruction format for four- operand instructions supported by the processor of Fig. IB. The instruction format has a 4-bit opcode and four 7-bit operands. The first of the operands is a reference to a destination register (RD) for the instruction. The second operand, in turn, is a reference to a first source register for the instruction (RSI) . The third operand is a reference to a second source register for the instruction (RS2) and the fourth operand is a reference to a third source register for the instruction (RS3) .
Fig. 3B illustrates two instruction formats for three- operand instructions supported by the processor of Fig. IB. Each instruction format has an 11-bit opcode and three 7-bit operands. The first of the operands is a reference to a destination register (RD) for the instruction. The second operand, in turn, is a reference to a first source register for the instruction (RSI) . Finally, the third operand can be a references to a second (RS2) source register or an immediate value to be used in the instruction.
Fig. 4A illustrates an instruction format for a parallel multiply-add instruction (pmuladd) supported by the processor of Fig. IB, in accordance to the present invention. The pmuladd instruction uses the four-operand instruction format of Fig. 3A, namely a format in which no immediate values are used. Rather, all operands are references to registers in the register file of the processor. Fig. 4B illustrates an instruction format for a conditional pick instruction (cpickz) supported by the processor of Fig. IB. The cpickz instruction uses the four- operand instruction format of Fig. 3A. Fig. 4C illustrates an instruction format for a parallel mean instruction (pmean) supported by the processor of Fig. IB. The pmean instruction uses the first of the three-operand instruction formats of Fig. 3B, namely a format in which no immediate values are used. Fig. 4D illustrates instruction formats for a pshll instruction supported by the processor of Fig. IB. The pshll instruction uses either of the three-operand instruction formats of Fig. 3B. Fig. 4E illustrates instruction formats for a pshra instruction supported by the processor of Fig. IB. The pshra instruction uses either of the three-operand instruction formats of Fig. 3B. Fig. 4F illustrates instruction formats for a pshrl instruction supported by the processor of Fig. IB. The pshrl instruction uses either of the three-operand instruction formats of Fig. 3B.
Fig. 5 is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing a parallel multiply-add operation. The pmuladd instruction treats values stored in the source registers as each having two 16-bit fixed-point components. For example, in Fig. 5, bits 0..15 of the values stored in registers RSI, RS2 and RS3 comprise the first fixed-point operands and bits 16..31 comprise the second fixed-point operands. The multiply-add operation is then carried out separately on the first operands and on the second operands. As a result, after the execution of a pmuladd instruction, the value stored in register RD represents two 16 bit fixed-point values, one representing a value calculated by multiplying the first fixed-point operand of RSI by the first fixed- point operand of RS2 and adding the first fixed-point operand of RS3, and the other representing a value calculated by multiplying the second fixed-point operand of RSI by the second fixed-point operand of RS2 and adding the second fixed-point operand of RS3.
In the implementation shown in Fig. 5, when executing a pmuladd instruction, the processor routes the value of bits 0..15 (high-order bits) of RSI and RS2 to respective input ports of multiplier 510, while the value of bits 16..31 (low-order bits) of RSI and RS2 are routed to respective input ports of multiplier 520. After a time delay for propagating the input values through multipliers 510 and 520, values on respective output ports of multipliers 510 and 520 are routed to respective input ports of adders 530 and 540. The value of bits 0..15 of RS3 is then routed to the other input port of adder 530 and the values of bits
16..31 of RS3 are routed to the other input port of adder 540. After a time delay for propagating the input values through adders 530 and 540, a value on an output port of adder 530 is stored in bits 0..15 of register RD, while a value on an output port of adder 540 is stored in bits 16..31 of register RD.
The results depend on the values of two mode/format bits. The operands can be either in fixed-point format or in integer format. As shown in Table 1, when the mode bits have 00 and 01 values, both the operands and the result are treated as two's complement 16-bit integer values. When the mode bits have a 10 value, the operands and the result are treated as S.15 fixed-point values. Finally, if the mode bits have a 11 value, the operands and the result are treated as S2.13 fixed point values. Hence, depending on the value of the mode bits the appropriate bits from the multiplier results are supplied to the adder.
Moreover, the processor of Fig. IB supports saturation functions to be performed during pmuladd, padd and psub operations. Four different saturation modes are provided, as shown in Table 1 below.
Figure imgf000012_0001
Table 1.
Saturation modes 00 and 01 in Table 1 represent two's complement 16-bit integers. Mode 10 represents an S.15 fixed point notation, while mode 11 represents an S2.13 fixed point format. In both of these notations, the S bit is part of the integer part of the fixed point number. For example, an S2.13 number has a 3 -bit integer part and a 13- bit fractional part.
Using mode 00, the parallel muladd with saturation instruction will produce a value between 0 and 215-1, inclusive. If the results exceed these bounds, they are "capped" at the upper bound. Similarly, mode 01 limits the result to between -215 and 215-1, inclusive. Modes 10 and 11 represent saturation for fixed point formats. Table 1 summarizes the limits or bounds for all four modes.
Execution of these instructions is pipelined to achieve a throughput of one instruction per cycle. Fig. 6 is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing a conditional pick operation. The cpickz instruction compares a value stored in register RSI to a zero value and depending on the outcome of the comparison copies the values stored in either register RS2 or register RS3 into register RD.
In the implementation of Fig. 6, when executing a cpickz instruction, the processor routes a value stored in register RSI to an input port of comparator 610. A zero value is supplied on the other input port of comparator 610. After a time delay for propagating the input values through comparator 610, a value on an output port of comparator 610 is routed to a control port of multiplexer 620. Meanwhile, values stored in registers RS2 and RS3 are routed by the processor to respective input ports of multiplexer 620. After a time delay for propagating input values through multiplexer 620, a value on an output port of multiplexer 620 is stored in register RD.
As a result, after the execution of a cpickz instruction, the value stored in register RD is a copy of the value stored in register RS2 if the value stored in register RSI is not equal to 0. Alternatively, if the value stored in register RSI is equal to 0, the value stored in register RD is a copy of the value stored in register RS3.
Fig. 7 is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing a parallel averaging operation. The pmean instruction treats values stored in the source registers as each having two 16-bit integer components. For example, in Fig. 7, bits 0..15 of the values stored in registers RSI and RS2 comprise the first integer operands and bits 16..31 comprise the second integer operands . The averaging operation is then carried out separately on the first operands and on the second operands. As a result, after the execution of a pmean instruction, the value stored in register RD represents two 16 bit integer values, one representing a value calculated by averaging the first integer operand of RSI with the first integer operand of RS2, and the other representing a value calculated by averaging the second integer operand of RSI with the second integer operand of RS2. In the implementation of Fig. 7, when executing a pmean instruction, the processor routes values stored in bits 0..15 of registers RSI and RS2 to respective input ports of adder 710. Meanwhile, values stored in bits 16..31 of registers RSI and RS2 are routed to respective input ports of adder 720. After a time delay for propagating the input values through adders 710 and 720, values on respective output ports of adders 710 and 720 are routed to respective input ports of adders 730 and 740. A 1 value is supplied on respective input ports of adders 730 and 740. After a time delay for propagating the input values through adders 730 and 740, output values on respective ports of adders 730 and 740 are routed to respective input ports of right shifters 750 and 760. A logical one value is supplied on respective control ports of right shifters 750 and 760. After a time delay for propagating the input values through right shifters 750 and 760, a value on an output port of right shifter 750 is copied into bits 0..15 of register RD and a value on an output port of right shifter 760 is copied into bits 16..31 of register RD.
Fig. 8A is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing a parallel shift operation, when all operands are provided as register references. The pshll instruction treats values stored in the source registers as each having two 16-bit integer components. For example, in Fig. 8A, bits 0..15 of the values stored in registers RSI and RS2 comprise the first integer operands and bits 16..31 comprise the second integer operands. The logical shift left operation is then carried out separately on the first operands and on the second operands. As a result, after the execution of a pshll instruction, the value stored in register RD represents two 16 bit integer values, one representing a value calculated by performing a logical shift left of the first integer operand of RSI by a number of bits specified by the first integer operand of RS2, and the other representing a value calculated by performing a logical shift left on the second integer operand of RSI by a number of bits specified by the second integer operand of RS2.
In the implementation of Fig. 8A, when executing the pshll instruction, the processor routes the value stored in bits 0..15 of register RSI to an input port of shifter 810. Meanwhile, the value stored in bits 16..31 of register RSI are routed to an input port of shifter 820. The processor also routes bits 0..3 of registers RSI and RS2 to respective select ports of shifters 810 and 820. After a time delay for propagating the input values through shifters 810 and
820, a value on an output port of shifter 810 is copied into bits 0..15 of register RD and a value on an output port of shifter 820 is copied into bits 16..31 of register RD.
Fig. 8B is a block diagram of one implementation of the circuitry within MFUs 222 of the processor of Fig. IB for performing a parallel shift operation, when the second source operand is provided as an immediate value. The functioning of the circuitry of Fig. 8B is identical to that of the circuitry of Fig. 8A, except that bits 0..3 of the immediate value are routed to both input ports of shifters 810 and 820. The operation of the circuitry of Figs. 8A and 8B during execution of a pshra or a pshrl instructions is identical to the one described for the execution of a pshll instruction, except that shifters 810 and 820 perform an arithmetic shift right or a logical shift right operations, respectively.
In addition, the processor of Fig. IB supports a parallel power instruction, ppower and a parallel reciprocal square root instruction precsqrt . The ppower and precsqrt instruction treat the values stored in the source registers as a pair of fixed-point (rather than integer) components. Therefore, the value stored in bits 0..15 of register RD after the execution of a ppower instruction represent a value calculated by raising the value stored in bits 0..15 of register RSI to a power specified by the value stored in bits 0..15 of register RS2. Similarly, the value stored in bits 16..31 of register RD after the execution of a ppower instruction represent a value calculated by raising the value stored in bits 16..31 of register RSI to a power specified by the value stored in bits 16..31 of register RS2. The pair of values stored in register RD after the execution of a precsqrt (instruction are calculated using a similar process to the one described for the ppower instruction, except that the reciprocal square roots of the pairs of values stored in register RSI are computed, rather than a power.
The precsqrt instruction is further described in co-pending application Ser. No. 09/240,977 titled "Parallel Fixed Point Square Root And Reciprocal Square Root Computation Unit In A Processor" by Ravi Shankar and Subramania Sudharsanan, which is incorporated by reference herein in its entirety.
Embodiments described above illustrate but do not limit the invention. In particular, the invention is not limited by any number of registers or immediate values specified by the instructions. In addition, the invention is not limited to any particular hardware implementation. Those skilled in the art realize that alternative hardware implementation can be employed in lieu of the one described herein in accordance to the principles of the present invention. Other embodiments and variations are within the scope of the invention, as defined by the following claims.

Claims

1. A method of executing a single instruction parallel multiply-add function on a processor, the method comprising: providing the processor with an opcode indicating a parallel multiply-add instruction; providing the processor with a first, a second and a third value, wherein each of the values comprises two or more operand components; multiplying first operand components of the first and the second values to generate a first intermediate value ; multiplying second operand components of the first and the second values to generate a second intermediate value; adding a first operand component of the third value to the first intermediate value to generate a first result value; adding a second operand component of the third value to the second intermediate value to generate a second result value; storing the first result value in a first portion of a result location; and storing the second result value in a second portion of the result location.
2. The method of claim 1, wherein the first, second and third values are stored in respective source registers of the processor specified by the parallel multiply-add instruction, and the first and the second result values are stored in a destination register of the processor specified by the parallel multiply-add instruction.
3. The method of claim 2, the first result value is stored in the high-order bits of the destination register and the second result value is stored in the low-order bits of the destination register.
4. The method of claim 1, wherein the processor is pipelined and the single instruction is executed with a throughput of one instruction every 2 cycles.
5. A method of executing a single instruction conditional pick function on a processor, the method comprising: providing the processor with an opcode indicating a conditional pick instruction; providing the processor with a first, a second and a third value; comparing the first value to a reference value; determining, based upon the comparing, whether the first value is equal to the reference value; storing the second value in a result location if the first value is equal to the reference value; and storing the third value in a result location if the first value is not equal to the reference value.
6. The method of claim 5, wherein the first, second and third values are stored in respective source registers of the processor specified by the conditional pick instruction, and the second and the third values are stored in a destination register of the processor specified by the conditional pick instruction.
7. The method of claim 5, wherein the processor is pipelined and the single instruction is executed with a throughput of one instruction per cycle.
8. A method of executing a single instruction parallel averaging function on a processor, the method comprising: providing the processor wit an opcode indicating a parallel averaging instruction; providing the processor with a first and a second value, wherein each of the values comprises two or more operand components; adding first operand components of the first and the second values to generate a first intermediate value ; adding second operand components of the first and the second values to generate a second intermediate value ; incrementing the first intermediate value by one to generate a third intermediate value; incrementing the second intermediate value by one to generate a fourth intermediate value; shifting the third intermediate value to generate a first result value; shifting the fourth intermediate value to generate a second result value; storing the first result value in a first portion of a result location; and storing the second result value in a second portion of the result location.
9. The method of claim 8, wherein the first and the second values are stored in respective source registers of the processor specified by the parallel averaging instruction.
10. The method of claim 8, wherein the first and the second result values are stored in a destination register of the processor specified by the parallel averaging instruction.
11. The method of claim 10, the first result value is stored in the high-order bits of the destination register and the second result value is stored in the. low-order bits of the destination register.
12. The method of claim 8, wherein the processor is pipelined and the single instruction is executed with a throughput of one instruction per cycle.
13. A method of executing a single instruction 5 parallel shift function on a processor, the method comprising: providing the processor with an opcode indicating a parallel shift instruction; providing the processor with a first and a second L0 value, wherein each of the values comprises two or more operand components; shifting the first operand component of the first value by a number of bits equal to a value of the first operand component of the second value to generate a L5 first result value; shifting the second operand component of the first value by a number of bits equal to a value of the second operand component of the second value to generate a second result value; 20 storing the first result value in a first portion of a result location; and storing the second result value in a second portion of the result location.
25 14. The method of claim 13, wherein the first and the second values are stored in respective source registers of the processor specified by the parallel shift instruction.
15. The method of claim 13 , wherein the first and the
30 second result values are stored in a destination register of the processor specified by the parallel shift instruction.
16. The method of claim 15, the first result value is stored in the high-order bits of the destination register
35 and the second result value is stored in the low-order bits of the destination register.
17. The method of claim 13, wherein the processor is pipelined and the single instruction is executed with a throughput of one instruction per cycle.
18. A general purpose processor comprising: a file register; an instruction fetch unit; and decoding circuitry; wherein the processor supports a parallel multiply-add instruction.
19. The general purpose processor of claim 18, wherein the parallel multiply-add instruction operate on either integer or fixed point operands .
20. The general purpose processor of claim 19, wherein the results of the parallel multiply-add instruction are saturated.
21. The general purpose processor of claim 19, wherein the parallel multiply-add instruction further provides multiple saturation modes.
22. A general purpose processor comprising: a file register; an instruction fetch unit; and decoding circuitry; wherein the processor supports a conditional pick instruction.
23. A general purpose processor comprising: a file register; an instruction fetch unit; and decoding circuitry; wherein the processor supports a parallel averaging instruction.
24. A general purpose processor comprising: a file register; an instruction fetch unit; and decoding circuitry; wherein the processor supports a parallel shift instruction.
25. A general purpose processor comprising: a file register; an instruction fetch unit; and decoding circuitry; wherein the processor supports a parallel power instruction.
26. A general purpose processor comprising: a file register; an instruction fetch unit; and decoding circuitry; wherein the processor supports a parallel reciprocal square root instruction.
PCT/US2001/024778 2000-08-16 2001-08-06 General purpose processor with graphics/media support WO2002015000A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2001281162A AU2001281162A1 (en) 2000-08-16 2001-08-06 General purpose processor with graphics/media support

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/640,901 2000-08-16
US09/640,901 US7587582B1 (en) 1998-12-03 2000-08-16 Method and apparatus for parallel arithmetic operations

Publications (2)

Publication Number Publication Date
WO2002015000A2 true WO2002015000A2 (en) 2002-02-21
WO2002015000A3 WO2002015000A3 (en) 2003-08-07

Family

ID=24570142

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/024778 WO2002015000A2 (en) 2000-08-16 2001-08-06 General purpose processor with graphics/media support

Country Status (2)

Country Link
AU (1) AU2001281162A1 (en)
WO (1) WO2002015000A2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007140338A2 (en) 2006-05-25 2007-12-06 Qualcomm Incorporated Graphics processor with arithmetic and elementary function units
US8644643B2 (en) 2006-06-14 2014-02-04 Qualcomm Incorporated Convolution filtering in a graphics processor
US8766995B2 (en) 2006-04-26 2014-07-01 Qualcomm Incorporated Graphics system with configurable caches
US8766996B2 (en) 2006-06-21 2014-07-01 Qualcomm Incorporated Unified virtual addressed register file
US8869147B2 (en) 2006-05-31 2014-10-21 Qualcomm Incorporated Multi-threaded processor with deferred thread output control

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1996017293A1 (en) * 1994-12-01 1996-06-06 Intel Corporation A microprocessor having a multiply operation
WO2000033185A2 (en) * 1998-12-03 2000-06-08 Sun Microsystems, Inc. A multiple-thread processor for threaded software applications
WO2000045251A2 (en) * 1999-01-29 2000-08-03 Sun Microsystems, Inc. Floating and parallel fixed point square root and reciprocal point square computation unit in a processor

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3579087B2 (en) * 1994-07-08 2004-10-20 株式会社日立製作所 Arithmetic unit and microprocessor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1996017293A1 (en) * 1994-12-01 1996-06-06 Intel Corporation A microprocessor having a multiply operation
WO2000033185A2 (en) * 1998-12-03 2000-06-08 Sun Microsystems, Inc. A multiple-thread processor for threaded software applications
WO2000045251A2 (en) * 1999-01-29 2000-08-03 Sun Microsystems, Inc. Floating and parallel fixed point square root and reciprocal point square computation unit in a processor

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NADEHARA K ET AL: "LOW-POWER MULTIMEDIA RISC" IEEE MICRO, IEEE INC. NEW YORK, US, vol. 15, no. 6, 1 December 1995 (1995-12-01), pages 20-29, XP000538227 ISSN: 0272-1732 *
PATENT ABSTRACTS OF JAPAN vol. 1996, no. 05, 31 May 1996 (1996-05-31) & JP 08 022451 A (HITACHI LTD), 23 January 1996 (1996-01-23) *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8766995B2 (en) 2006-04-26 2014-07-01 Qualcomm Incorporated Graphics system with configurable caches
WO2007140338A2 (en) 2006-05-25 2007-12-06 Qualcomm Incorporated Graphics processor with arithmetic and elementary function units
WO2007140338A3 (en) * 2006-05-25 2008-03-06 Qualcomm Inc Graphics processor with arithmetic and elementary function units
KR101012625B1 (en) 2006-05-25 2011-02-09 퀄컴 인코포레이티드 Graphics processor with arithmetic and elementary function units
US8869147B2 (en) 2006-05-31 2014-10-21 Qualcomm Incorporated Multi-threaded processor with deferred thread output control
US8644643B2 (en) 2006-06-14 2014-02-04 Qualcomm Incorporated Convolution filtering in a graphics processor
US8766996B2 (en) 2006-06-21 2014-07-01 Qualcomm Incorporated Unified virtual addressed register file

Also Published As

Publication number Publication date
AU2001281162A1 (en) 2002-02-25
WO2002015000A3 (en) 2003-08-07

Similar Documents

Publication Publication Date Title
US7042466B1 (en) Efficient clip-testing in graphics acceleration
US6671796B1 (en) Converting an arbitrary fixed point value to a floating point value
US6279100B1 (en) Local stall control method and structure in a microprocessor
US7437534B2 (en) Local and global register partitioning technique
US6757820B2 (en) Decompression bit processing with a general purpose alignment tool
US7028170B2 (en) Processing architecture having a compare capability
US6349319B1 (en) Floating point square root and reciprocal square root computation unit in a processor
US7124160B2 (en) Processing architecture having parallel arithmetic capability
US6343348B1 (en) Apparatus and method for optimizing die utilization and speed performance by register file splitting
WO2000033183A9 (en) Method and structure for local stall control in a microprocessor
US7013321B2 (en) Methods and apparatus for performing parallel integer multiply accumulate operations
US6341300B1 (en) Parallel fixed point square root and reciprocal square root computation unit in a processor
US7117342B2 (en) Implicitly derived register specifiers in a processor
US20030005261A1 (en) Method and apparatus for attaching accelerator hardware containing internal state to a processing core
US6615338B1 (en) Clustered architecture in a VLIW processor
US20020032710A1 (en) Processing architecture having a matrix-transpose capability
JPH07244589A (en) Computer system and method to solve predicate and boolean expression
US6678710B1 (en) Logarithmic number system for performing calculations in a processor
US7587582B1 (en) Method and apparatus for parallel arithmetic operations
WO2002015000A2 (en) General purpose processor with graphics/media support
US6625634B1 (en) Efficient implementation of multiprecision arithmetic
WO2022121090A1 (en) Processor supporting high-throughput multi-precision multiplication
US11782719B2 (en) Reconfigurable multi-thread processor for simultaneous operations on split instructions and operands
JP2000231488A (en) Processor
US7254670B2 (en) System, method, and apparatus for realizing quicker access of an element in a data structure

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP