Why were bitwise operations slightly faster than addition/subtraction operations on older microprocessors?

Vilhelm Gray picture Vilhelm Gray · Mar 27, 2013 · Viewed 16.6k times · Source

I came across this excerpt today:

On most older microprocessors, bitwise operations are slightly faster than addition and subtraction operations and usually significantly faster than multiplication and division operations. On modern architectures, this is not the case: bitwise operations are generally the same speed as addition (though still faster than multiplication).

I'm curious about why bitwise operations were slightly faster than addition/subtraction operations on older microprocessors.

All I can think of that would cause the latency is that the circuits to implement addition/subtraction depend on several levels of logic gates (parallel adders and whatnot), whereas the bitwise operations have far simpler circuit implementations. Is this the reason?

I know arithmetic and bitwise operations both execute within one clock-cyle on modern processors, but speaking purely about propagation time for the circuit, is the latency still theoretically there in modern processors?

Finally, I had a conceptual C question about the execution of the bitwise shift operation:

unsigned x = 1;
x <<= 5;

unsigned y = 0;
y += 32;

Both x and y should hold the value 32, but did it take 5 separate left shifts to get x to that value (as in are bitwise shifts implemented via pipes)? To clarify, I'm asking purely about the circuit behavior not the number of clock cycles.

Answer

Eric Postpischil picture Eric Postpischil · Mar 27, 2013

In any binary bitwise operation, each output bit depends only on the two corresponding bits in the inputs. In an add operation, each output bit depends on the corresponding bits in the inputs and all the bits to the right (toward lower values).

For example, the leftmost bit of 01111111 + 00000001 is 1, but the leftmost bit of 01111110 + 00000001 is 0.

In its simplest form, an adder adds the two low bits and produces one output bit and a carry. Then the next two lowest bits are added, and the carry is added in, producing another output bit and another carry. This repeats. So the highest output bit is at the end of a chain of adds. If you do the operation bit by bit, as older processors did, then it takes time to get to the end.

There are ways to speed this up some, by feeding several input bits into more complicated logic arrangements. But that of course requires more area in the chip and more power.

Today‘s processors have many different units for performing various sorts of work—loads, stores, addition, multiplication, floating-point operations, and more. Given today’s capabilities, the work of doing an add is small compared to other tasks, so it fits within a single processor cycle.

Perhaps in theory you could make a processor that did a bitwise operation more quickly than an add. (And there are, at least on paper, exotic processors that operate asynchronously, with different units doing work at their own paces.) However, with the designs in use, you need some regular fixed cycle to coordinate many things in the processor—loading instructions, dispatching them to execution units, sending results from execution units to registers, and much, much more. Some execution units do require multiple cycles to complete their jobs (e.g., some floating-point units take about four cycles to do a floating-point add). So you can have a mix. However, with current scales, making the cycle time smaller so that it fits a bitwise operation but not an add is likely not economical.