Bull Gamma 60 (1958)

Mark Smotherman

Summary

The Bull Gamma 60 was the first multithreaded computer, designed to tolerate what was expected to be long-latency ALU operations using a fast central memory. A central controller would distribute several instructions and data to different functional units during the latency of a previously distributed operation (by loading pointers into memory-mapped control registers of the functional units). The programs supported this arrangement by using fork and join instructions.

Compagnie des Machines Bull of France announced the Gamma 60 in 1958. A total of twenty systems were built. The logical designers were Pierre Chenus, Jean Bosset, and J.P. Cottet [1]. The machine was one of the first multiprocessors and was the first to be composed of multiple processing units with exclusive processing differentiation and multiple I/O processing units.

The processing units include a BCD ALU, a binary ALU, a general comparison unit (to compare binary strings up to 255 words in length and to perform memory-to-memory block transfers), and a code translation unit (to translate between I/O device code and internal character code and to edit records for output).

Each of the processing units has an identical instruction request interface to a single supervisory control unit, which is called the program distributor and which controls parallel processing among the other units. A memory arbiter, called the transfer distributor, allocates on a priority basis the usage of a single shared input bus (data collection channel) and a single shared output bus (data distribution channel); all units, including the program distributor, must request memory cycles through the memory arbiter.

Highlights

Noteworthy
- Multiprocessing.
- Hardware provision for asynchronous parallel processes (i.e., fork and join operations).
- Synchronization by queueing for both hardware and software resources.
- Processing unit status word.
Peculiarities
- Special comparison processing unit.
- Storage of control information in instruction sequence.
- Process scheduling is wired into the program distributor.

Instruction Formats

A complete instruction is composed of a series of 24-bit words. There may be 1-4 addresses depending on the processing unit. Word types are:

0 - blank. A blank (all zero) word has no effect for any processing unit and is used to provide an address field for queueing.
A - address. Uses a 15-bit address to designate one operand. An address word can load a DAR (data address register), change its value by indexing or indirection, or cause it to be stored in memory.
B - branch. The branch address is given by a 15-bit address field.
C - cut (unit selection, called an interrupt instruction in [2]). A 15-bit field is used for the processing unit number, and another field contains the number of calls to this word that are required before processing can begin; the count is used in a join operation. A blank word must follow each cut word to provide a queue pointer for enqueued processes. One type of cut word is SIMU, which acts as a fork operation (see below in sequencing section; note that [2] classifies SIMU as a branch, while [7] classifies it as a cut).
D - directive (operation specification, called a command instruction in [2]). A 15-bit field is used for length specification in some directives.

A complete instruction must end in a directive word. An elementary sequence of complete instructions follows a cut word, and all the instructions apply to the processing unit specified in the cut word.

Memory Spaces and Addressing

24-bit words (called "catenae"), 4-bit BCD digits, 6-bit characters
Working and control storage. The first 128 memory locations contain sets of registers for the processing units. A set has:
- 15-bit PAR (program address register)
- one-bit availability status (busy/available)
- 15-bit queue head pointer
- 15-bit queue tail pointer
- one to four 15-bit DAR's (data address registers), depending on unit
- There is one AR (analysis register) - the unit status word for the program distributor. It represents the current status of the particular processing unit for which the program distributor is fetching and executing instructions.
- There is one APAR (actual program address register). When a particular processing unit is selected by the program distributor, the PAR of that unit is loaded into the APAR. The APAR is used to fetch instruction words for that unit and is incremented after each fetch. When the unit is released by the program distributor, the current contents of the APAR are stored into the PAR for that unit.
Memory name-space - 32K words
Address calculation and addressing modes. Provision is made for relative and indirect addressing; [5] indicates that there are 4 index registers in the arithmetic unit. Autoindexing is performed by incrementing DARs on each usage.
Address mapping - None

Data Formats

Character - 6-bit code similar to ASCII
Logical - 24-bit binary word
Binary fixed-point - 24-bit word
BCD fixed-point - 10 digits with decimal point location specified
Floating-point - BCD digits, similar to fixed point, exponent in point location field (0-79 with bias of 40), double precision possible with 11-19 digits in four words (point location field of third word is apparently used as length).

Operations

Data handling - translate, edit, block transfer

Logic

Fixed-point - includes loop counts in binary

Floating-point - 13 operations for arithmetic unit

Instruction Execution Cycle

Normal operation is:

1) The program distributor scans (using priority circuits) for the highest priority processing unit requesting an instruction.
2) The program distributor signals the highest priority processing unit, which immediately returns its status word to be placed in the AR.
3a) If the unit is busy (i.e. a process has selected and is currently running on the unit but the last directive has finished), then the program distributor fetches and executes the instruction words addressed by that unit's PAR without interruption until a directive (or cut) word is encountered. This processing typically entails loading and/or modifying DARs in response to address words and finally sending the directive to the processing unit, at which point the program distributor resets the instruction request signal and scans for other work. (Cut word processing is discussed in the sequencing section.)
3b) Otherwise, if the highest priority processing unit is available, the program distributor examines the unit's queue for any waiting processes. If waiting processes exist, the PAR is loaded with the address of the processing unit's queue head pointer plus one, the unit's queue head pointer is loaded with the pointer to the next process in the queue, and instruction fetch and execution begins for that unit without interruption until a directive (or cut) word is encountered. As above, upon sending a directive to the processing unit, the program distributor resets the instruction request signal and scans for other work.
4) When the processing unit is ready to receive or transmit words from/to memory, it signals the memory arbiter. ([2] indicates that the program distributor must be involved at this step, while [6] indicates that the processing unit makes a direct request to the memory arbiter.)
5) When the memory request is honored, the address from the appropriate DAR is used; the DAR is always incremented by one at this time to provide for multiple word transfers.
6) The processing unit continues data transfers until the instruction is complete. At that point, it sends a new instruction request signal to the program distributor.

Sequencing

Unconditional and conditional branches are available to change the program sequence, but they do not change the selected processing unit.

A cut word indicates that a different processing unit should be selected. Thus the previously selected processing unit is marked as available. If the newly selected processing unit is available, it is marked as busy, its PAR is loaded with the address of the word immediately following the cut word (i.e. the blank word), and the program distributor then scans for highest priority work (possibly deciding upon this newly selected processing unit). If the newly selected processing unit is not available, the address of the blank word is enqueued into a FIFO "control return chain" [2] (also known as "pick-up chain" [7]) using the processing unit's queue head pointer and tail pointer, and the program distributor then scans for other work.

The SIMU word causes the current process to be queued on the program distributor for later execution. ([2] states that a control return chain is used, while [7] describes the machine as able to do no more than keep a single pointer to the word following the SIMU, i.e. saving the updated APAR.) The program distributor then starts the execution of a new process at the branch address. When the program distributor is released from the new process (e.g. by issuing a directive), the original process is restored. This can occur several times, so that an indefinite number of parallel processes may be initiated. (While not stated in the references, if both a forked process and the original process retain the currently selected processing unit, the original process should be enqueued on that unit's control return chain. However, it seems unfair to enqueue it at the back of a possibly nonempty queue.)

A cut word has the ability to specify a number of calls required before the cut is executed. Normally, only one call is needed. However, the cut word can function as a join operation when multiple calls are specified. When the cut word is fetched, a count is incremented within the word and compared to the specified number of calls. If the count does not match, then the updated cut word is merely returned to memory and the program distributor scans for other work. A single cut word can join up to four parallel processes; additional processes require more cut words. All process sequences except the one sequentially ahead of the joining cut must, of course, end in an unconditional branch to this cut.

Shared subroutines are treated as virtual units to provide mutual exclusion. This is because the return address is typically stored into a fixed memory location for that subroutine, and multiple processes executing that subroutine would overwrite the saved return addresses. A subroutine call then appears as a cut word to a virtual unit number followed by a blank word and then by an unconditional branch to the subroutine start address. The virtual element number is merely the memory address of a two-word block that acts as a control return chain queue head pointer and queue tail pointer. The head pointer also contains a busy bit to indicate how the cut should be processed: if reset, the bit is set and the process continues executing the subroutine; if set, the process is enqueued using the two-word block. This queueing is the same as for the physical processing units. A special cut word is used after the return from the subroutine to release virtual elements. (However, it is not clear how the program distributor then deals with two active processes at the point of the release. It would seem natural to place one of them on a control return chain for the program distributor, just as the processing of a SIMU word is described in [2], and continue with the other. None of the references reveals the actual operations.)

Supervision

Control switching - On errors, the currently executing program is enqueued, and a diagnosis program is automatically initiated.
Integrity - no memory protection

Example Code

X: CUT virtual element Z (* Z is address of two-word block *)

blank word (* used for queue link *)

BRANCH unconditionally to address A

CUT release virtual element Z

blank word

...

Z: blank word (* contains busy bit and serves as queue head pointer *)

blank word (* serves as queue tail pointer *)

...

A: DIRECTIVE store next instruction address into location H

B: SIMU address E (* fork new process at E *)

blank word

C: CUT arithmetic

blank word (* used for queue link *)

address location of operand 1

address location of operand 2

address location of result

DIRECTIVE add

D: BRANCH unconditionally to address F

E: CUT drum

blank word

address source address

address destination address

DIRECTIVE transfer drum to memory, N number of words

F: CUT compare, 2 calls to proceed (* acts as join *)

blank word

address source address

address destination address

DIRECTIVE block move, N number of words to transfer

G: BRANCH unconditional to address stored in location H

H: blank word (* save location for return address *)

...

Thus the data flow in the subroutine is:

A; cobegin C | E coend; F; G;

Input/Output

Peripherals include magnetic drum (25,600 words, avg. access time of 10 milliseconds, transfer time of 100 microseconds per word), magnetic tapes (7 tracks, 200 bits per inch, 75 inches per second), card readers (300 cards per minute), card punches (300 cards per minute), line printers (300 lines per minute), paper tape readers (200 characters per second), paper tape punches (25 characters per second), and console typewriter. The magnetic tape units used phase modulation rather than NRZ (non-return-from-zero).

Comments

As discussed in [3], the "granularity" of parallel process specification in the Gamma 60 is too fine. Each complete instruction, rather than each logical collection of work, is essentially defined to be a separate process that the machine must initiate and coordinate. This results in a great deal of overhead. The cause of this can be traced to the memory-oriented design, as opposed to a traditional CPU-oriented design; everything possible was done to use every memory cycle. The designers assumed that memory was much faster (10 microsecond cycle time) than any of the processing units (floating point addition requires 88-150 microseconds, while fixed point addition requires 100 microseconds). Unfortunately, by attempting to meet this goal by using parallel operation of multiple processing units at such a low level, it appears that many memory cycles were wasted due to the supervisory overhead of processes switching back and forth between units or went idle due to congestion at critical, bottleneck units.

References

[1] anon., "France's Gamma 60: A step forward in data processing?" Datamation, vol. 4, no. 3, May-June 1958, pp. 34-35.

[2] M. Bataille, "The Gamma 60: The computer that was ahead of its time," Honeywell Computer Journal, vol. 5, no. 3, 1971, pp. 99-105.

[3] Section 13.2 of G.A. Blaauw and F.P. Brooks, Jr., Computer Architecture: Concepts and Evolution. Reading, MA: Addison Wesley, 1997.

[4] W.A. Curtin, "Multiple computer systems," pp. 245-303, in F.L. Alt (ed.), Advances in Computers, vol. 4. New York: Academic Press, 1964.

[5] P. Dreyfus, "System design of the Gamma 60," Proceedings of the Western Joint Computer Conference, Los Angeles, CA, 1958, pp. 130-133.

[6] P. Dreyfus, "Programming design features of the Gamma 60 computer," Proceedings of the Eastern Joint Computer Conference, 1959, pp. 174-181.

[7] P. Dreyfus, "Programming on a concurrent digital computer," in Frontier Research on Digital Computers, vol. 1, University of North Carolina Summer Institute, 1959, section V, pp. 1-49. [Same paper as reference 17 of paper [2] above: P. Dreyfus, "Programming on a concurrent digital computer," Notes of University of Michigan Engineering Engineering Summer Conference on Theory of Computing Machine Design, 1961.]

Design analysis of French Bull Gamma-60 design (postscript) (appears in 3eme Colloque Histoire De L'Informatique, Sophia Antipolis, France, October 1993)

[History page] [Mark's homepage] [CPSC homepage] [Clemson Univ. homepage]

mark@cs.clemson.edu