REPNZ SCAS Assembly Instruction Specifics

Michael Scott picture Michael Scott · Nov 6, 2014 · Viewed 16.4k times · Source

I am trying to reverse engineer a binary and the following instruction is confusing me, can anyone clarify what exactly this does?

=>0x804854e:    repnz scas al,BYTE PTR es:[edi]
  0x8048550:    not    ecx

Where:

EAX: 0x0
ECX: 0xffffffff
EDI: 0xbffff3dc ("aaaaaa\n")
ZF:  1

I see that it is somehow decrementing ECX by 1 each iteration, and that EDI is incrementing along the length of the string. I know it calculates the length of the string, but as far as exactly HOW it's happening, and why "al" is involved I'm not quite sure.

Answer

QuasarDonkey picture QuasarDonkey · Nov 8, 2014

I'll try to explain it by reversing the code back into C.

Intel's Instruction Set Reference (Volume 2 of Software Developer's Manual) is invaluable for this kind of reverse engineering.

REPNE SCASB

The logic for REPNE and SCASB combined:

while (ecx != 0) {
    temp = al - *(BYTE *)edi;
    SetStatusFlags(temp);
    if (DF == 0)   // DF = Direction Flag
        edi = edi + 1;
    else
        edi = edi - 1;
    ecx = ecx - 1;
    if (ZF == 1) break;
}

Or more simply:

while (ecx != 0) {
    ZF = (al == *(BYTE *)edi);
    if (DF == 0)
        edi++;
    else
        edi--;
    ecx--;
    if (ZF) break;
}

String Length

However, the above is insufficient to explain how it computes the length of a string. Based on the presence of the not ecx in your question, I'm assuming the snippet belongs to this idiom (or similar) for computing string length using REPNE SCASB:

sub ecx, ecx
sub al, al
not ecx
cld
repne scasb
not ecx
dec ecx

Translating to C and using our logic from the previous section, we get:

ecx = (unsigned)-1;
al = 0;
DF = 0;
while (ecx != 0) {
    ZF = (al == *(BYTE *)edi);
    if (DF == 0)
        edi++;
    else
        edi--;
    ecx--;
    if (ZF) break;
}
ecx = ~ecx;
ecx--;

Simplifying using al = 0 and DF = 0:

ecx = (unsigned)-1;
while (ecx != 0) {
    ZF = (0 == *(BYTE *)edi);
    edi++;
    ecx--;
    if (ZF) break;
}
ecx = ~ecx;
ecx--;

Things to note:

  • in two's complement notation, flipping the bits of ecx is equivalent to -1 - ecx.
  • in the loop, ecx is decremented before the loop breaks, so it decrements by length(edi) + 1 in total.
  • ecx can never be zero in the loop, since the string would have to occupy the entire address space.

So after the loop above, ecx contains -1 - (length(edi) + 1) which is the same as -(length(edi) + 2), which we flip the bits to give length(edi) + 1, and finally decrement to give length(edi).

Or rearranging the loop and simplifying:

const char *s = edi;
size_t c = (size_t)-1;      // c == -1
while (*s++ != '\0') c--;   // c == -1 - length(s)
c = ~c;                     // c == length(s)

And inverting the count:

size_t c = 0;
while (*s++ != '\0') c++;

which is the strlen function from C:

size_t strlen(const char *s) {
    size_t c = 0;
    while (*s++ != '\0') c++;
    return c;
}