Bull Gamma 60 (1958)

Mark Smotherman

Summary

The Bull Gamma 60 was the first multithreaded computer, designed to tolerate what was expected to be long-latency ALU operations using a fast central memory. A central controller would distribute several instructions and data to different functional units during the latency of a previously distributed operation (by loading pointers into memory-mapped control registers of the functional units). The programs supported this arrangement by using fork and join instructions.

Compagnie des Machines Bull of France announced the Gamma 60 in 1958. A total of twenty systems were built. The logical designers were Pierre Chenus, Jean Bosset, and J.P. Cottet [1]. The machine was one of the first multiprocessors and was the first to be composed of multiple processing units with exclusive processing differentiation and multiple I/O processing units.

The processing units include a BCD ALU, a binary ALU, a general comparison unit (to compare binary strings up to 255 words in length and to perform memory-to-memory block transfers), and a code translation unit (to translate between I/O device code and internal character code and to edit records for output).

Each of the processing units has an identical instruction request interface to a single supervisory control unit, which is called the program distributor and which controls parallel processing among the other units. A memory arbiter, called the transfer distributor, allocates on a priority basis the usage of a single shared input bus (data collection channel) and a single shared output bus (data distribution channel); all units, including the program distributor, must request memory cycles through the memory arbiter.

Highlights

Instruction Formats

A complete instruction is composed of a series of 24-bit words. There may be 1-4 addresses depending on the processing unit. Word types are:

A complete instruction must end in a directive word. An elementary sequence of complete instructions follows a cut word, and all the instructions apply to the processing unit specified in the cut word.

Memory Spaces and Addressing

Data Formats

Operations

  • Data handling - translate, edit, block transfer
  • Logic
  • Fixed-point - includes loop counts in binary
  • Floating-point - 13 operations for arithmetic unit

    Instruction Execution Cycle

    Normal operation is:

    Sequencing

    Unconditional and conditional branches are available to change the program sequence, but they do not change the selected processing unit.

    A cut word indicates that a different processing unit should be selected. Thus the previously selected processing unit is marked as available. If the newly selected processing unit is available, it is marked as busy, its PAR is loaded with the address of the word immediately following the cut word (i.e. the blank word), and the program distributor then scans for highest priority work (possibly deciding upon this newly selected processing unit). If the newly selected processing unit is not available, the address of the blank word is enqueued into a FIFO "control return chain" [2] (also known as "pick-up chain" [7]) using the processing unit's queue head pointer and tail pointer, and the program distributor then scans for other work.

    The SIMU word causes the current process to be queued on the program distributor for later execution. ([2] states that a control return chain is used, while [7] describes the machine as able to do no more than keep a single pointer to the word following the SIMU, i.e. saving the updated APAR.) The program distributor then starts the execution of a new process at the branch address. When the program distributor is released from the new process (e.g. by issuing a directive), the original process is restored. This can occur several times, so that an indefinite number of parallel processes may be initiated. (While not stated in the references, if both a forked process and the original process retain the currently selected processing unit, the original process should be enqueued on that unit's control return chain. However, it seems unfair to enqueue it at the back of a possibly nonempty queue.)

    A cut word has the ability to specify a number of calls required before the cut is executed. Normally, only one call is needed. However, the cut word can function as a join operation when multiple calls are specified. When the cut word is fetched, a count is incremented within the word and compared to the specified number of calls. If the count does not match, then the updated cut word is merely returned to memory and the program distributor scans for other work. A single cut word can join up to four parallel processes; additional processes require more cut words. All process sequences except the one sequentially ahead of the joining cut must, of course, end in an unconditional branch to this cut.

    Shared subroutines are treated as virtual units to provide mutual exclusion. This is because the return address is typically stored into a fixed memory location for that subroutine, and multiple processes executing that subroutine would overwrite the saved return addresses. A subroutine call then appears as a cut word to a virtual unit number followed by a blank word and then by an unconditional branch to the subroutine start address. The virtual element number is merely the memory address of a two-word block that acts as a control return chain queue head pointer and queue tail pointer. The head pointer also contains a busy bit to indicate how the cut should be processed: if reset, the bit is set and the process continues executing the subroutine; if set, the process is enqueued using the two-word block. This queueing is the same as for the physical processing units. A special cut word is used after the return from the subroutine to release virtual elements. (However, it is not clear how the program distributor then deals with two active processes at the point of the release. It would seem natural to place one of them on a control return chain for the program distributor, just as the processing of a SIMU word is described in [2], and continue with the other. None of the references reveals the actual operations.)

    Supervision

    Example Code

    X: CUT virtual element Z (* Z is address of two-word block *)

    blank word (* used for queue link *)

    BRANCH unconditionally to address A

    CUT release virtual element Z

    blank word

    ...

    Z: blank word (* contains busy bit and serves as queue head pointer *)

    blank word (* serves as queue tail pointer *)

    ...

    A: DIRECTIVE store next instruction address into location H

    B: SIMU address E (* fork new process at E *)

    blank word

    C: CUT arithmetic

    blank word (* used for queue link *)

    address location of operand 1

    address location of operand 2

    address location of result

    DIRECTIVE add

    D: BRANCH unconditionally to address F

    E: CUT drum

    blank word

    address source address

    address destination address

    DIRECTIVE transfer drum to memory, N number of words

    F: CUT compare, 2 calls to proceed (* acts as join *)

    blank word

    address source address

    address destination address

    DIRECTIVE block move, N number of words to transfer

    G: BRANCH unconditional to address stored in location H

    H: blank word (* save location for return address *)

    ...

    Thus the data flow in the subroutine is:

    A; cobegin C | E coend; F; G;

    Input/Output

    Peripherals include magnetic drum (25,600 words, avg. access time of 10 milliseconds, transfer time of 100 microseconds per word), magnetic tapes (7 tracks, 200 bits per inch, 75 inches per second), card readers (300 cards per minute), card punches (300 cards per minute), line printers (300 lines per minute), paper tape readers (200 characters per second), paper tape punches (25 characters per second), and console typewriter. The magnetic tape units used phase modulation rather than NRZ (non-return-from-zero).

    Comments

    As discussed in [3], the "granularity" of parallel process specification in the Gamma 60 is too fine. Each complete instruction, rather than each logical collection of work, is essentially defined to be a separate process that the machine must initiate and coordinate. This results in a great deal of overhead. The cause of this can be traced to the memory-oriented design, as opposed to a traditional CPU-oriented design; everything possible was done to use every memory cycle. The designers assumed that memory was much faster (10 microsecond cycle time) than any of the processing units (floating point addition requires 88-150 microseconds, while fixed point addition requires 100 microseconds). Unfortunately, by attempting to meet this goal by using parallel operation of multiple processing units at such a low level, it appears that many memory cycles were wasted due to the supervisory overhead of processes switching back and forth between units or went idle due to congestion at critical, bottleneck units.

    References

    [1] anon., "France's Gamma 60: A step forward in data processing?" Datamation, vol. 4, no. 3, May-June 1958, pp. 34-35.

    [2] M. Bataille, "The Gamma 60: The computer that was ahead of its time," Honeywell Computer Journal, vol. 5, no. 3, 1971, pp. 99-105.

    [3] Section 13.2 of G.A. Blaauw and F.P. Brooks, Jr., Computer Architecture: Concepts and Evolution. Reading, MA: Addison Wesley, 1997.

    [4] W.A. Curtin, "Multiple computer systems," pp. 245-303, in F.L. Alt (ed.), Advances in Computers, vol. 4. New York: Academic Press, 1964.

    [5] P. Dreyfus, "System design of the Gamma 60," Proceedings of the Western Joint Computer Conference, Los Angeles, CA, 1958, pp. 130-133.

    [6] P. Dreyfus, "Programming design features of the Gamma 60 computer," Proceedings of the Eastern Joint Computer Conference, 1959, pp. 174-181.

    [7] P. Dreyfus, "Programming on a concurrent digital computer," in Frontier Research on Digital Computers, vol. 1, University of North Carolina Summer Institute, 1959, section V, pp. 1-49. [Same paper as reference 17 of paper [2] above: P. Dreyfus, "Programming on a concurrent digital computer," Notes of University of Michigan Engineering Engineering Summer Conference on Theory of Computing Machine Design, 1961.]

    Design analysis of French Bull Gamma-60 design (postscript) (appears in 3eme Colloque Histoire De L'Informatique, Sophia Antipolis, France, October 1993)


    [History page] [Mark's homepage] [CPSC homepage] [Clemson Univ. homepage]

    mark@cs.clemson.edu