I want to delay an ARM Cortex M0+ for n cycles, without using a timer, with the smallest possible code size. (I think this mandates use of assembly.)
A delay of 0 cycles is simple no code. A delay of 1 cycle is a single NOP. A delay of 2 cycles is two NOPs.
At what point is it (code-size) efficient to start looping?
How many cycles does the tightest possible loop take? What is the setup time?
Post answer notes:
The following C code:
register unsigned char counter = 100;
while (counter-- > 0) {
asm("");
}
when compiled with gcc and -O3 gives:
mov r3, #100
.L5:
sub r3, r3, #1
uxtb r3, r3
cmp r3, #0
bne .L5
This either illustrates that there is still purpose in hand-coding ARM assembly, or (much more likely) that the C code above is not the best way of convey to the compiler what I want to do.
Comments?
The code is going to depend on exactly what n is, and whether it needs to be dynamically variable, but given the M0+ core's instruction timings, establishing bounds for a particular routine is pretty straightforward.
For the smallest possible (6-byte) complete loop with a fixed 8-bit immediate counter:
movs r0, #NUM ;1 cycle
1: subs r0, r0, #1 ;1 cycle
bne 1b ;2 if taken, 1 otherwise
with NUM=1
we get a minimum of 3 cycles, plus 3 cycles for every extra loop up to NUM=255
at 765 cycles (of course, you could have 2^32 iterations from NUM=0
, but that seems a bit silly). That puts the lower bound for a loop being practical at about 6 cycles. With a fixed loop it's easy to pad NOPs (or even nested loops) inside it to lengthen each iteration, and before/after to align to a non-multiple of the loop length. If you can arrange for a number of iterations to be ready in a register before you need to start waiting, then you can lose the initial mov
and have pretty much any multiple of 3 or more cycles, minus one. If you need single-cycle resolution for a variable delay, the initial setup cost is going to be somewhat higher to correct for the remainder (a computed branch into a NOP sled is what I'd do for that)
I'm assuming that if you're at the point of cycle-critical timing you've already got interrupts off (otherwise throw in another cycle somewhere for CPSID
), and that you don't have any bus wait states adding extra cycles to instruction fetches.
As for trying to do it in C: the fact that you have to hack in an empty asm
to keep the "useless" loop from being optimised away is a tip-off. The abstract C machine has no notion of "instructions" or "cycles" so there is simply no way to reliably express this in the language. Trying to rely on particular C constructs to compile to suitable instructions is extremely fragile - change a compiler flag; upgrade the compiler; change some distant code which affects register allocation which affects instruction selection; etc. - pretty much anything could change the generated code unexpectedly, so I'd say hand-coded assembly is the only sensible approach for cycle-accurate code.