What do C and Assembler actually compile to?

lamas picture lamas · Jan 25, 2010 · Viewed 22k times · Source

So I found out that C(++) programs actually don't compile to plain "binary" (I may have gotten some things wrong here, in that case I'm sorry :D) but to a range of things (symbol table, os-related stuff,...) but...

  • Does assembler "compile" to pure binary? That means no extra stuff besides resources like predefined strings, etc.

  • If C compiles to something else than plain binary, how can that small assembler bootloader just copy the instructions from the HDD to memory and execute them? I mean if the OS kernel, which is probably written in C, compiles to something different than plain binary - how does the bootloader handle it?

edit: I know that assembler doesn't "compile" because it only has your machine's instruction set - I didn't find a good word for what assembler "assembles" to. If you have one, leave it here as comment and I'll change it.

Answer

Norman Ramsey picture Norman Ramsey · Jan 26, 2010

C typically compiles to assembler, just because that makes life easy for the poor compiler writer.

Assembly code always assembles (not "compiles") to relocatable object code. You can think of this as binary machine code and binary data, but with lots of decoration and metadata. The key parts are:

  • Code and data appear in named "sections".

  • Relocatable object files may include definitions of labels, which refer to locations within the sections.

  • Relocatable object files may include "holes" that are to be filled with the values of labels defined elsewhere. The official name for such a hole is a relocation entry.

For example, if you compile and assemble (but don't link) this program

int main () { printf("Hello, world\n"); }

you are likely to wind up with a relocatable object file with

  • A text section containing the machine code for main

  • A label definition for main which points to the beginning of the text section

  • A rodata (read-only data) section containing the bytes of the string literal "Hello, world\n"

  • A relocation entry that depends on printf and that points to a "hole" in a call instruction in the middle of a text section.

If you are on a Unix system a relocatable object file is generally called a .o file, as in hello.o, and you can explore the label definitions and uses with a simple tool called nm, and you can get more detailed information from a somewhat more complicated tool called objdump.

I teach a class that covers these topics, and I have students write an assembler and linker, which takes a couple of weeks, but when they've done that most of them have a pretty good handle on relocatable object code. It's not such an easy thing.