Structured Computer Organization, by Tanenbaum
PARALLELISM (Chapter 2)

Computer architects look to parallelism (doing two or more things at once)
as a way to get more performance for a given clock speed.

Instruction-level and Processor-level parallelism – 2 forms of parallelism.

Instruction-level: parallelism is exploited within individual instructions
to get more instructions/sec

Processor-level parallelism: multiple CPUs work together on the same
problem.

INSTRUCTION LEVEL PARALLELISM – PIPELINING
The actual fetching of instructions from memory is a major bottleneck.

Pipelining: instruction execution is divided into many parts, each one
handled by a dedicated piece of hardware, all of which can run in parallel.

Figure 2-4a: a 5-stage pipeline.

Stage 1: fetches the instruction from memory and places it in a buffer
until it is needed.
Stage 2: decodes the instruction, determining its type and what operands
it needs.
Stage 3: locates and fetches the operands, either from registers or from
memory.
Stage 4: does the actual work of carrying out the instruction, such as
running the operands through the data path of Figure 2-2.
Stage 5: writes the result back to the proper register.

Figure 2-4 b: The state of each stage as a function of time. Nine clock
cycles are illustrated.

Clock cycle 1: S1 fetches instr. 1 from memory.
Clock cycle 2: S2 decodes instr. 1, S1 fetches instr. 2.
Clock cycle 3: S3 fetches the operands for instr. 1, S2 decodes instr. 2,
S1 fetches instr. 3.
Clock cycle 4: S4 executes instr.1 m S3 fetches operands for instr.2,
S2 decodes instr.3, and S1 fetches instr.4.
Clock cycle 5: S5 writes the result of instr.1 back, …

Analogy of an assembly line at a cake factory – a long conveyor belt with 5
workers (processing units). Each cake is like an instruction. Every 10
sec. (the clock cycle) W1 places an empty cake box on the belt. The box
is carried to W2, who places a cake in it. Box goes to W3 where it is
closed and sealed. W4 puts a label on the box, W5 removes box from the
belt and puts it in a large container for later shipment to a supermarket.

Latency: how long it takes to execute an instruction.

Processor bandwidth: how many MIPS the CPU has.

Formulas:

Latency: nT
Bandwidth:1000/T
With a cycle time of T nsecs and n stages in the pipeline, the latency
is nT nsec and the bandwidth is 1000/T MIPS.
Ex: If the cycle time of the machine is 2 nsec, then it takes 10 nsecs
for an instruction to progress through a 5 stage pipeline
(latency time). Rate of processing is 500 MIPS.

Superscalar architectures: two pipelines at once.
Figure 2-5: Dual 5-stage pipelines with a common instruction fetch unit.

A single instruction fetch unit fetches pairs of instructions
together and puts each one into its own pipeline, with its own ALU for parallel
operation.
To be able to run in parallel, the two instructions must not
conflict over resource usage (registers) and neither must depend upon
the result of the other.

Intel 486: one pipeline
Pentium: two 5-stage pipelines roughly like Figure 2-5

Figure 2-6: Superscalar architecture: a single pipeline with multiple functional units.

PROCESSOR-LEVEL PARALLELISM: SIMD and MIMD
(see chart on p.552)

Pipelining achieves speed increases of factors of five or ten.
Design computers with multiple CPU's in order to get gains of 50, 100, or more.

Array processor (SIMD): large number of identical processors that perform
the same sequence of instructions on different sets of data.

SIMD -> Single Instruction stream Multiple Data stream

Fig. 2-7: array processor of the ILLIAC IV (1972) type

Vector processor (SIMD) - Cray computer. Very much like an array
processor, but all the addition operations are performed in a single,
heavily-pipelined adder (a vector register).

Array processor - has as many adders as elements in the vector
Vector processor - single set of conventional registers that can be
loaded from memory in a single instruction. Uses a Vector register

Multiprocessors: MIMD: Multiple Instruction stream Multiple Data stream

The processing elements in an array processor are not independent
CPU's, there is only one control unit shared among all of them.

Multiprocessor: a system with more than one CPU sharing a common
memory, like a group of people in a room sharing a common blackboard. Since
each CPU can read or write any part of memory, they must coordinate (in
software) to avoid getting in each other's way.

Fig. 2-8a: A single bus multiprocessor: A single bus with multiple CPU's
and one memory all plugged into it.

Conflicts: with a large number of fast processors constantly trying to
access memory over the same bus, conflicts will result.

Fig. 2-8b: Each processor has some local memory of its own, not accessible
to the others. This memory can be used for program code and those data items
that need not be shared.

Access to the private memory does not use the main bus, reducing bus
traffic.
Other schemes - such as caching - are also possible. Example of multiprocessor use: analyzing a cancer cell - a single digitized photograph kept in common memory with each processor assigned a different area of the photo to analyze. Small multiprocessors (<= 64) are relatively easy to build, large multiprocessors are difficult to build. Connecting all the processors to memory is difficult. Multicomputers: large numbers of interconnected computers, each having its own private memory, no common memory. Multicomputers communicate by sending each other messages. Topologies such as 2D and 3D grids, trees, and rings are used. Messages from one computer to another often must pass through one or more ntermediate computers or switches to get from one source to the destination. Message passing on the order of a few microseconds. Multicomputers with nearly 10,000 CPU's have been built. Multiprocessors - easy to program Multicomputers - easy to build Much research on designing hybrid systems that combine the good properties of each.