Coprocessors and Attached Processors

This lecture is based mostly on material from Tanenbaum’s textbook
Structured Computer Organization (Ref. 4).

We shall begin with a refresher on VLIW (Very Long Instruction Word)
designs and then examine a number of coprocessors, several of which are VLIW.

Topics:

    1.   The VLIW design and its use in single processors.

    2.   The TriMedia VLIW CPU.

    3.   Heterogeneous multiprocessors on a chip: the DVD player.

    4.   The Global Internet, Ethernet, and Attached Network Cards

    5.   The Nexperia Media Coprocessor

    6.   Other high–end video graphics cards

    7.   High–end coprocessors for audio production.

    8.   Cryptoprocessors.

As we shall see, the economics of the mass market often favor the production of
highly specialized attached processors to share the computing load with the CPU.


The Very Long Instruction Word Design

The VLIW design is one that we first encountered when discussing high–performance
single processor computing systems.  The design assumed a superscalar CPU, and called for machine code words with multiple instructions, one per CPU function unit.

Each machine code word might have two integer instructions, one floating point instruction, and so forth.  Modern designs issue bundles with an end–of–bundle mark.


The TriMedia VLIW Central Processing Unit

The TriMedia processor was designed by Philips, the Dutch electronics company that also designed the CD, and CD–ROM (Ref. 4).  It is designed for media–intensive applications, such as image processing, CD and DVD recorders or players, digital video cameras, digital television sets, etc.  The TriMedia is a true VLIW processor.

Each machine language instruction commonly specifies five operations.  The machine word is divided into five slots, one per operation to be issued.  Each slot commands one or more function units; so that some slots are “special purpose”.

Here is the format of a typical TriMedia machine instruction.

The TMS3260 implementation runs at 250 MHz.  Since it can issue five operations per clock cycle, it has an effective maximum rating of 1250 MIPS.

The TriMedia has a byte–oriented memory.  It uses memory–mapped I/O, in which each I/O device is accessed through registers mapped into the memory address space.


The TriMedia Processors

Here is a table taken from the Wikipedia article on the history of TriMedia processors.

Core

year 1st

silicon

ISA

Features

Cache (I/D) KB

frequency

(worst case)

introduction technology

TM1000

1997

TMA0

 

32/16

100 MHz

500 nm

TM1100

1998

TMA1

 

32/16

133 MHz

350 nm

TM1300

1999

TMA1

 

32/16

166 MHz

250 nm

TM3260

2002

TMA2

binary compatible with TM1300

64/16

250 MHz

130 nm

TM5250

2004

TMA3

128 KB L2 data cache,

allocate on write miss, hardware prefetching, super pipelined (high speed)

64/16

450 MHz

130 nm

TM2270

2006

TMA3

96 GPRs (small area)

32/16

290 MHz

90 nm

TM3270/1

2006

TMA4 + ASE

low power

64/128

64/32 32/16

350 MHz

90 nm

The Tanenbaum textbook is based on the TM3260.  Note the successor processors.

    1.   The TM5250, operating at 450 MHz.  It is more powerful.

    2.   The TM2270 and TM3270, designed to be small and/or low in power consumption.

The two common market pressures are high performance and low power usage.


The TriMedia CPU: Details

The CPU has 128 general purpose registers, each holding a 32–bit number.  Two of the registers store constant values: R0 stores 0 and R1 stores 1.  All others are general purpose and can store integers (8, 16, or 32 bits) or IEEE–754 floating point values.

The TMS3260 has 12 functional units, a control unit and eleven for doing arithmetical, logical, and control flow operations.  Some of these units respond only to instructions in specific instruction slots; others can be commanded from any instruction slot.

The latency is the number of steps to move a result through the functional unit.
The last five columns show the placement of commands for each functional unit.


The TriMedia CPU: Mathematical Units

The standard arithmetic units use the two’s–complement standard for integer arithmetic, but the DSP (Digital Signal Processor) units use saturation arithmetic.

In saturation arithmetic, an operation that produces a result not representable due to overflow saturates at the maximum value rather than generating an exception.

For example, the range of numbers representable by 8–bit unsigned integer arithmetic is
0 through 255 inclusive.  In saturation arithmetic, 180 + 180 = 255, the maximum value.

With two minor exceptions, all operations in the TriMedia are predicated.

In a predicated instruction, each operation specifies a register that is to be tested before the operation is executed.  The low–order bit of the register is examined.

    0.   If that bit is 0, the operation is skipped.

    1.   If that bit is 1, the operation is executed.

    IF R2 IADD R4, R5 ® R8         // Add R4 to R5 and place result into R8.
                                                        // But only if bit 0 of R2 is a 1; otherwise do nothing.

Using R1 as the predicate register makes it unconditional as R1 º 1.

Using R0 as a predicate register makes is a no–op as R0 º 0.


Heterogeneous Processor Example: The DVD Player

The computer controlling the DVD player has a number of very different functions.
Each of these is assigned to a specialized processor.

This design uses multiple cores on a single large chip.  A core is a large circuit, such as a CPU, I/O controller, or cache, that can be placed on a chip in a modular way.  Some modern processors are dual–core in that they have two cores, each being a full CPU.

This design might be called “heterogeneous multi–core”.  Each of the closely–coupled cores has a dedicated function related to the format of the data it must process.  This design was found to be more economical than a single general–purpose CPU.


Computers From “Piece Parts”

We now face the issue of how to design computers and their major components.

Main components, such as the CPU, will continue to be designed from basic gates in the traditional way for some time.  Here the advantage in performance gained from a single integrated design justifies the cost and effort involved.

We now have another attractive option for the design of computing machines.  This one is made attractive by the availability of a variety of cores, each with a dedicated function.  This collection of cores can be considered essentially as a set of libraries of functions, only that these functions are implemented in hardware.

IBM has produced a design, called CoreConnect, which is an architecture for connecting cores on a single–chip heterogeneous multiprocessor.  Here is an example.

Note the two busses; one is faster than the other.


The Global Internet and the Network Interface Card (NIC)

You may think that your computer is connected to the Internet, but it is not.  The computer is connected to a NIC; it is that NIC that is connected to the Internet.

The NIC is a dedicated I/O coprocessor, which communicates with the computer’s CPU via interrupts and DMA (Direct Memory Access).  Except when the NIC is operated in “promiscuous mode” (for network snooping), it filters all packets by MAC address.

The standard of transmission that we shall discuss is called “Ethernet”.  Packets in this protocol possess two 48–bit MAC (Media Access Control) addresses, one for the source and one for the destination NIC (Interface Card).

Here is the format of an Ethernet packet containing an IP packet.

The Ethernet header contains the two MAC addresses.  Each NIC has a unique MAC address assigned to it under a protocol administered by the IEEE.  In normal use, the NIC will recognize messages sent to its MAC address and pass only those to the CPU.

 


The NIC (Network Processor)

The NIC is programmable device that can handle incoming and outgoing packets
at the full network speed.  It is plugged into a standard slot in the computer motherboard.

One or more network lines connect to the board and are routed to the network processor.
Most setups have only a single network line attached, but computers used as switches, routers, and the like must have at least two network lines attached.

Here is a diagram of a typical network processor, using a PCI slot on the motherboard.

Note the multiple PPE (Packet Processing Engines).  Each is a specialized core with a dedicated task; the set forms a packet processing pipeline.


The Nexperia Media Processor

Ordinary general–purpose processors are not especially good at the massively parallel computations required to process high–resolution audio and video streams.

The Nexperia is a single–chip heterogeneous multiprocessor designed by Philips, using its TriMedia chip.  It comprises a heterogeneous collection of cores, each with a dedicated function for which it has been optimized.  Here is the PNX 1500.

More on the Nexperia

The Nexperia is designed for use either as a coprocessor in a PC or as a stand–alone main processor in an appliance such as a DVD player, digital TV set, video camera, etc.

Other than the SRAM and SDRAM internal to the TriMedia processor, the Nexperia contains no main memory on the chip.  The PNX 1500 implementation has an interface to external memory, allowing for 8 to 256 MB of DDR SDRAM.

The width of the memory interface is 32 bits (4 bytes).  This allows the DDR memory to transfer 8 bytes per clock pulse; at 200 MHz the data rate is 1.6 GB/second.

The processing units (DVD Descrambler, Length Decoder, Video Scaler, and Graphics Engine) perform computations related to the display of encrypted video as found on a commercial DVD.

Note that there is a core dedicated to debugging.  It follows the JTAG (Joint Test Action Group) protocols, defined in IEEE Standard 1149.1 – the industry standard.


A High–End Graphics Coprocessor

Here are some data on the NVIDIA GeForce 9 Series (9800 GX2 and 9800 GTX).  The table is taken from the web site (Ref. 6).

 

Core Clock (MHz)

Shader Clock (MHz)

Memory Clock (MHz)

Memory Amount

Memory Interface

Memory Bandwidth (GB/sec)

Texture Fill Rate (billion/sec)

9800 GX2

600

1500

1000

1 GB

512-bit

128

76.8

9800 GTX

675

1688

1100

512MB GDDR3

256-bit

70.4

43.2

9600 GT

650

1625

900

512MB

256-bit

57.6

20.8

The 9800 GX2 is a multi–core design with 256 stream processors.  It has a 512 bit
(64 byte) memory interface operating at a peak rate of 128 gigabytes per second.

This produces video at resolutions up to 2560 by 1600 pixels.

The cost of the 9800 GX2 is $520 (Ref. 6, 4/16/2008).


A High–End Audio Processor

Here are some data on the SoundBlaster XtremeGamer Fatal1ty Pro Series.
It is an audio attached coprocessor for use with a PC.

24–bit Analog to Digital conversion      96 kHz sample rate

24–bit Digital to Analog conversion      96 kHz rate to either 7.1 audio or standard stereo.

64 MB random access memory, called “XRAM”.

Signal–to–Noise Ratio           109 dB for stereo output

Total Harmonic Distortion     0.004%

Frequency Response               10 Hz to 46 kHz (–3 dB points)

 

Note: These audio specifications would be considered extremely good
          for a high–priced audio system for home use.

The cost of this coprocessor is $150.00 (Ref. 7, 4/16/2008)

 


Cryptographic Coprocessors

Suppose two workstations that are to communicate over the public Internet in a secure mode.  The provision of industrial–grade cryptography is very compute intensive.

Again, cryptography does not lend itself to solution by a general–purpose processor.  For this reason, and also to offload the computational burden from the primary CPU, many secure communication systems use attached cryptographic processors.

Here are some data on a cryptographic processor marketed by IBM (Ref. 8).  The product described is the IBM PCI Cryptographic Coprocessor.

The coprocessor provides DES, triple–DES, RSA, and DSA encryption, all national standards.  The hardware is certified under FIPS PUB 140–1 (Security Requirements for Cryptographic Modules), at level 3.  The mainframe version is certified to level 4.

The coprocessor has a “tamper–sensing and tamper–responding environment” to limit and report unauthorized access to the processor itself.

The price of this unit was not quoted.


Game Engines as Supercomputers

It may surprise students to learn that many of these high–end graphics processors are actually export controlled as munitions.  In this case, the control is due to the possibility of using these processors as high–performance computers.

In the next slide, we present a high–end graphics coprocessor that can be viewed as a vector processor.  It is capable of a sustained rate of 4,300 Megaflops.

Compare this to the CRAY–1 supercomputer of 1976, with a sustained computing
rate of 136 Megaflops and a peak rate of 250 Megaflops.  This is about 3.2% of the performance of the current graphics coprocessor at about 500 times the cost.

The Cray Y–MP was a supercomputer sold by Cray Research beginning in 1988.
Its peak performance was 2.66 Gigaflops (8 processors at 333 Megaflops each).
Its memory comprised 128, 256, or 512 MB of static RAM.

The earliest supercomputer that could outperform the current graphics processor seems to have been the Cray T3E–1200E™, a MPP (Massively Parallel Processor) introduced in 1995 (Ref. 9).  In 1998, a joint scientific team from Oak Ridge National Lab, the University of Bristol (UK) and others ran a simulation related to controlled fusion at a sustained rate of 1.02 Teraflops (1020 Gigaflops).

The next slide shows this current graphics coprocessor.


The NVIDIA Tesla C870

Data here are from the NVIDIA web site (Ref. 6).  I quote from their advertising copy.

The C870 processor is a “massively multi–threaded processor architecture that is ideal for high performance computing (HPC) applications”.

This has 128 processor cores, each operating at 1.35 GHz.  It supports the IEEE–754 single–precision standard, and operates at a sustained rate of 430 gigaflops (512 GFlops peak).

The typical power usage is 120 watts.  Note the dedicated fan for cooling.

The price is $1300, with an introductory offer at $650.

The processor has 1.5 gigabytes of DDR SDRAM, operating at 800 MHz.  The data bus to memory is 384 bits (48 bytes) wide, so that the maximum sustained data rate is
48 2 800 106 = 76.8 Gigabytes per second.


References

In this lecture, material from one or more of the following references has been used.

1.     Computer Organization and Design, David A. Patterson & John L. Hennessy,
        Morgan Kaufmann, (3rd Edition, Revised Printing) 2007, (The course textbook)
        ISBN 978 – 0 – 12 – 370606 – 5.

2.     Computer Architecture: A Quantitative Approach, John L. Hennessy and
        David A. Patterson, Morgan Kauffman, 1990.  There is a later edition.
        ISBN 1 – 55860 – 069 – 8.

3.     High–Performance Computer Architecture, Harold S. Stone,
        Addison–Wesley (Third Edition), 1993.  ISBN 0 – 201 – 52688 – 3.

4.     Structured Computer Organization, Andrew S. Tanenbaum,
        Pearson/Prentice–Hall (Fifth Edition), 2006.  ISBN 0 – 13 – 148521 – 0

5.     http://en.wikipedia.org/wiki/TriMedia

6.     http://www.nvidia.com

7.     http://www.soundblaster.com

8.     http://www-03.ibm.com/security/cryptocards/pcicc.shtml

9.     http://www.cray.com