What makes a system little-endian or big-endian?

kev picture kev · Feb 11, 2012 · Viewed 10.8k times · Source

I'm confused with the byte order of a system/cpu/program.
So I must ask some questions to make my mind clear.

Question 1

If I only use type char in my C++ program:

void main()
{
    char c = 'A';
    char* s = "XYZ";    
}

Then compile this program to a executable binary file called a.out.
Can a.out both run on little-endian and big-endian systems?

Question 2

If my Windows XP system is little-endian, can I install a big-endian Linux system in VMWare/VirtualBox? What makes a system little-endian or big-endian?

Question 3

If I want to write a byte-order-independent C++ program, what do I need to take into account?

Answer

Nicol Bolas picture Nicol Bolas · Feb 11, 2012

Can a.out both run on little-endian and big-endian system?

No, because pretty much any two CPUs that are so different as to have different endian-ness will not run the same instruction set. C++ isn't Java; you don't compile to something that gets compiled or interpreted. You compile to the assembly for a specific CPU. And endian-ness is part of the CPU.

But that's outside of endian issues. You can compile that program for different CPUs and those executables will work fine on their respective CPUs.

What makes a system little-endian or big-endian?

As far as C or C++ is concerned, the CPU. Different processing units in a computer can actually have different endians (the GPU could be big-endian while the CPU is little endian), but that's somewhat uncommon.

If I want to write a byte-order independent C++ program, what do I need to take into account?

As long as you play by the rules of C or C++, you don't have to care about endian issues.

Of course, you also won't be able to load files directly into POD structs. Or read a series of bytes, pretend it is a series of unsigned shorts, and then process it as a UTF-16-encoded string. All of those things step into the realm of implementation-defined behavior.

There's a difference between "undefined" and "implementation-defined" behavior. When the C and C++ spec say something is "undefined", it basically means all manner of brokenness can ensue. If you keep doing it, (and your program doesn't crash) you could get inconsistent results. When it says that something is defined by the implementation, you will get consistent results for that implementation.

If you compile for x86 in VC2010, what happens when you pretend a byte array is an unsigned short array (ie: unsigned char *byteArray = ...; unsigned short *usArray = (unsigned short*)byteArray) is defined by the implementation. When compiling for big-endian CPUs, you'll get a different answer than when compiling for little-endian CPUs.

In general, endian issues are things you can localize to input/output systems. Networking, file reading, etc. They should be taken care of in the extremities of your codebase.