References: EE380 Mapping ISA Instructions Into RTL Operations

Let's consider a few simple example instructions.

Consider the MIPS instruction add $8,$2,$5, which, when executed, should make register $8 = ($2 + $5). The 32-bit pattern encoding this instruction is:

op rs rt rd shamt funct

0 2 5 8 0 32

op	rs	rt	rd	shamt	funct
0	2	5	8	0	32

The simulator would allow you to specify this Add instruction's bit pattern as (rs(2)+rt(5)+rd(8)+funct(32)), which is just a pretty way to say 0x00454020. (Note that 0x is used to prefix a hexadecimal value, just as in the C language.)

What Register Transfer Level (RTL) operations would we need to perform in order to get this work done?

First, let's think about the "guts" of the operation: the add. In order to add, we have to get the appropriate values at the ALU inputs, tell the ALU to add, latch the result, and copy the latched result into the destination register. That work can be done by:

Get the left source operand into Y:
SELrs, REGout, Yin
Add right source operand:
SELrt, REGout, ALUadd, Zin
Put result in destination register:
Zout, SELrd, REGin

That's all well and good, but how did the processor know it was supposed to add? In order to know that, it must first fetch the instruction from main memory (or a cache) and decode it:

Fetch the Add instruction from memory:
PCout, MARin, MEMread
Wait for memory to respond:
UNTILmfc
Copy the instruction into the IR:
MDRout, IRin
Decode the op by jumping to the appropriate RTL set:
JUMPonop

Let's consider that last operation more closely. How does JUMPonop know how to decode the instruction? Well, for our simulator, the answer is quite simple: you tell it. Basically, you label each location that you might want JUMPonop to jump to and you specify when each label is to be that target. The when specification for the simulator are very straightforward, specifying mask and match values. If ((IR & mask) == match), then label is the next state. So, we could specify the add decode as:

when (op(0x3f)+funct(0x3f)) funct(32) Add

Basically, this says that when the op and funct fields of the IR are, correspondingly, 0 and 32, we are looking at a MIPS add instruction and JUMPonop should go to the label Add.

Obviously, the above control sequence takes a minimum of 4 clock cycles to execute (1 tick per set of RTL operations). However, the UNTILmfc operation may cause state 2 above to be repeated many times if the memory read is relatively slow, so 4 is just the minimum number of ticks. The JUMPonop would then bring us to the set of RTL operations that implements the instruction (in our case, the RTL labeled "Add"). These first 4 states really have nothing to do with the opcode being Add -- they simply fetch the instruction and bump the PC to point at the next instruction. All instructions can share these first 4 states.

Thus, we now know how the processor will get to our add code. However, once the add is done, how will the processor get to the next instruction? One answer is to do this:

Start:  PCout, MARin, MEMread
        UNTILmfc
        MDRout, IRin
        JUMPonop

	/* other stuff goes here... */

Add:    SELrs, REGout, Yin
        SELrt, REGout, ALUadd, Zin
        Zout, SELrd, REGin
        /* increment PC by 4 and start next instruction */
        PCout, Yin
        CONST(4), ALUadd, Zin
        Zout, PCin, JUMP(Start)

Ok, that's a working solution. However, it would result in just about every instruction type ending with copies of those same last 3 states. Aside from saving some microprogram space, factoring out this sequence should allow us to take advantage of the fact that the PC value was already on the bus in the state labeled Start. It is important to note that incrementing the PC early is harmless because MIPS operations that store other values into the PC (e.g., jumps and branches) will simply overwrite the incremented value. Consider this clever interleaving of the next instruction sequence with the start of instruction RTL:

Start:  PCout, MARin, MEMread, Yin
        CONST(4), ALUadd, Zin, UNTILmfc
        MDRout, IRin
        Zout, PCin, JUMPonop

	/* other stuff goes here... */

Add:    SELrs, REGout, Yin
        SELrt, REGout, ALUadd, Zin
        Zout, SELrd, REGin, JUMP(Start)

We've just saved 3 clock ticks. One might be concerned that the UNTILmfc operation causes "CONST(4), ALUadd, Zin" to be repeated, perhaps many times. However, this is harmless: the exact same result is computed in each repeat of the RTL set, with no harmfull side effects. For example, it might not be acceptable to repeat a state that contains MEMread because issue of multiple MEMread stobe signals could confuse the memory system.

There are two lessons to be learned here:

There are many ways to implement the same ISA
The efficiency of an ISA implementation depends on the effectiveness of the control as well as the hardware used

Cycles Per Instruction (CPI)

Ok. So now we know how to execute an Add. How many clock cycles will it take to execute one instruction -- i.e., what is the CPI?

Clearly, each state (line of RTL) that is entered will take at least one clock cycle: 10 for our first version and 7 for the improved version. However, the UNTILmfc operation is essentially a repeat-until loop that repeats the processor control state it is in until memory has responded to a MEMread request. In our examples above, the UNTILmfc occurs in the state immediately after the MEMread is issued. Thus, if memory takes M clock cycles to respond, the line containing UNTILmfc will be repeated until M clock cycles later, not just executed once. If M is 5, our examples respectively would yield 14 CPI and 11 CPI.

As we just demonstrated, merely rearranging RTL operations can significantly change the CPI. Basically, getting more work done per clock cycle is the trick, and that's largely a matter of using parallelism. What paralleism? Well, in the above example we achieved some speedup by repacking RTL operations to have the processor literally do more useful work in each state, but we also can improve paralleism by increasing the overlap period in which both the processor and memory are working:

Start:  PCout, MARin, MEMread, Yin
        CONST(4), ALUadd, Zin
        Zout, PCin, UNTILmfc
        MDRout, IRin, JUMPonop
        ...

Add:    SELrs, REGout, Yin
        SELrt, REGout, ALUadd, Zin
        Zout, SELrd, REGin, JUMP(Start)

By rearranging things a bit more, we did not reduce the total number of states used in describing the Add instruction, but we did increase overlap with memory operation. In this version, the UNTILmfc is separated from the MEMread by an additional state. Thus, the first clock cycle of memory delay overlaps the execution of CONST(4), ALUadd, Zin, thus saving one clock cycle of UNTILmfc repetition. The result would be 11 CPI.

Of course, when we talk about performance of a processor, we have to consider CPI for all types of instructions, not just Add. Many modern processors do so much work in each clock cycle that the CPI are often fractions less than 1; in such cases, we talk about IPC: Instructions Per Cycle.

Clock Speed

How fast can the processor clock be? The answer is determined by the propogation delays in the processor's circuits. Consider the state:

SELrt, REGout, ALUadd, Zin

Let's use times approximating what you would get physically building the above circuits using logic parts like those employed in a course like EE481. SELrt is running through a decoder, which might take 10ns. When the REGout control signal is set at the start of this clock cycle, it might take about 5ns for a value to propogate to the bus. Thus, some value is on the bus after 5ns, but it might take 10ns+5ns=15ns before the correct value is on the bus. Similarly, the ALU is adding incorrect values until 15ns; so a 30ns ALU would yield correct results at the 15ns+30ns=45ns mark. Zin happens at the end of the clock period, but might require its input to have been stable for 5ns in order to latch it. Thus, we get 45ns+5ns=50ns as the fastest allowable clock period, for a maximum clock frequency of 20MHz.

In the above example, all the operations were on the same path; it is generally possible that there are multiple separate or branched circuit paths within a single clock cycle:

SELrt, REGout, ALUadd, Zin, MARin

This example has two paths: one is the SELrt->REGout->ALUadd->Zin path just described, the other is the SELrt->REGout->MARin path. In such a case, the slowest path determines the maximum clock frequency. In fact, because the same clock frequency is used for all states, the slowest path in any state is that critical path that determines the maximum clock frequency. Keep in mind that the slowest path might not be the path affected by the most control signals: if MARin took more than 35ns (which is admittedly very unlikely), then SELrt->REGout->MARin would be slower than SELrt->REGout->ALUadd->Zin.

Obviously, using faster logic parts can shorten the critical path and thus permit a higher clock frequency. However, merely rearranging how RTL operations are allocated to states also might allow use of a higher clock frequency. Ideally, we want operations to be arranged so that all states have about the same slowest path time.

Clock Speed vs. Cycles Per Instruction

How might we rearrange things to increase the clock speed? Well, consider re-writing the above single state as two states:

SELrt, REGout, MARin
MARout, ALUadd, Zin

Using our EE481-based times, SELrt->REGout->MARin would take about 10ns+5ns+5ns=20ns and MARout->ALUadd->Zin would take about 5ns+30ns+5ns=40ns. Thus, if our original state was the clock limiting factor, we could now increase the clock frequency from 20MHz to 25MHz! Of course, the problem with that is that now we have 12CPI instead of 11CPI, so things didn't speed-up as much as one might have hoped.... This is a very common problem: Clock speed and CPI tend to trade-off against each other such that improving one hurts the other. A good real-world example is the AMD Athlon vs. the Intel Pentium 4: the Althon maximizes work done per clock at the expense of a slower clock, the Pentium 4 trades a faster clock for getting less work done per tick.

Suppose that we wanted to go even further in favor of accelerating the clock? The ALU is probably the slowest thing in any pipeline, so why not try to make it faster? As we'll discuss later in the course, you can get some speedup by throwing more circuitry at the problem. However, you also can break-up the operation of the ALU over multiple clock cycles. For example, instead of the above split of that original state into two, one could reduce the slowest path to a mere 10ns by something very much like:

SELrt
SELrt, REGout, ALUadd, MARin
SELrt, REGout, ALUadd
SELrt, REGout, ALUadd
SELrt, REGout, ALUadd, Zin

Here, SELrt completes in the first cycle. Thus, REGout can have a stable correct decoder value from the start of the second cycle, allowing REGout and MARin to complete -- along with the first 5ns of ALUadd delay. The third and fourth cycles just hold inputs steady for the adder to continue its work. Finally, halfway through the fifth cycle, the adder completes and Zin latches the result. Of course, dividing a 50ns path into 5 cycles with 10ns paths by itself doesn't improve the processor's performance at all, but you do get to brag about having a 100MHz clock.... If you're lucky, some other original states will turn into fewer than 5 states each, so you will get speedup on them.

Actually, the principle of subdividing long logic paths is not as scalable as it first appears. One purely analog problem is that a clock signal that must be distributed to many circuit components cannot be made arbitrarily fast. A more digital problem is that, as modules are subdivided, to avoid glitches, it may be necessary to latch the partial results at the end of each clock and re-issue them at the start of the next clock. The portion of a module between latches is called a stage. When we carve-up a 30ns module, we might get 5ns+5ns latch times added for each stage. Thus, our original 30ns module might yield two 25ns pieces, three 20ns pieces, six 15ns pieces, etc. In fact, this is precisely the kind of thing pipelining does -- but it also takes advantage of the fact that those latches allow the now isolated stages to work on different data at the same time: parallel execution of portions of six operations for a six-stage pipeline. That's what the more modern processor architecture later in the course is all about....

Computer Organization and Design.