Processor Instruction Cycle Execution Time

bunkerdive picture bunkerdive · Aug 14, 2013 · Viewed 19.1k times · Source
  • My guess is that the __no_operation() intrinsic (ARM) instruction should take 1/(168 MHz) to execute, provided that each NOP executes in one clock cycle, which I would like to verify via documentation.

  • Is there a standard location for information regarding the instruction cycle execution time for a processor? I am trying to determine how long an STM32f407IGh6 processor should take to execute a NOP instruction running at 168 MHz.

  • Some processors require multiple oscillations per instruction cycle, some are 1-to-1 in comparing clock-cycles to instruction-cycles.

  • The term "instruction cycle" is not even present in the entirety of the datasheet provided by STMicro, nor in their programming manual (listing the processor's instruction set, btw). The 8051 documentation, however, clearly defines its instruction cycle execution times, in addition to its machine cycle characteristics.

Answer

old_timer picture old_timer · Aug 14, 2013

ALL instructions require more than one clock cycle to execute. Fetch, decode, execute. If you are running on an stm32 you are likely taking several clocks per fetch just due to the slowness of the prom, if running from ram who knows if it is 168Mhz or slower. the arm busses generally take a number of clock cycles to do anything.

Nobody talks about instruction cycles anymore because they are not deterministic. The answer is always "it depends".

It may take X hours to build a single car, but if you start building a car then 30 seconds later start building another and every 30 seconds start another then after X hours you will have a new car every 30 seconds. Does that mean it takes 30 seconds to make a car? Of course not. But it does mean that once up and running you can average a new car every 30 seconds on that production line.

That is exactly how processors work, it takes a number of clocks per instruction to run, but you pipeline theme so that many are in the pipe at once so that the average is such that the core, if fed the right instructions one per clock, can complete those instructions one per clock. With branching, and slow memory/rom, you cant even expect to get that.

if you want to do an experiment on your processor, then make a loop with a few hundred nops

beg = read time
load r0 = 100000
top:
  nop
 nop
nop
nop
nop
nop
...
nop
nop
nop
r0 = r0 - 1
bne top
end = read timer

If it takes fractions of a second to complete that loop then either make the number of nops larger or have it run an order of magnitude more loops. Actually you want to hit a significant number of timer ticks, not necessarily seconds or minutes on a wall clock but something in terms of a good sized number of timer ticks.

Then do the math and compute the average.

Repeat the experiment with the program sitting in ram instead of rom

Slow the processor clock down to whatever the fastest time is that does not require a flash divisor, repeat running from flash.

being a cortex-m4 turn the I cache on, repeat using flash, repeat using ram (At 168Mhz).

If you didnt get a range of different results from all of these experiments using the same test loop, you are probably doing something wrong.