I've been reading up on Google Protocol Buffers recently, which allows for a variety of scalar value types to be used in messages.
According to their documentation, there's three types of variable-length integer primitives - int32
, uint32
, and sint32
. In their documentation, they note that int32
is "Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32
instead." But if you have a field that has no negative numbers, I assume that uint32 would be a better type to use than int32
anyways (due to the extra bit and decreased CPU cost of processing negative numbers).
So when would int32
be a good scalar to use? Is the documentation implying that it's most efficient only when you rarely get negative numbers? Or is it always preferable to use sint32
and uint32
, depending on the contents of the field?
(The same questions apply to the 64-bit versions of these scalars as well: int64
, uint64
, and sint64
; but I left them out of the problem description for readability's sake.)
I'm not familiar with Google Protocol Buffers, but my interpretation of the documentation is:
uint32
if the value cannot be negativesint32
if the value is pretty much as likely to be negative as not (for some fuzzy definition of "as likely to be")int32
if the value could be negative, but that's much less likely than the value being positive (for example, if the application sometimes uses -1 to indicate an error or 'unknown' value and this is a relatively uncommon situation)Here's what the docs have to say about the encodings (http://code.google.com/apis/protocolbuffers/docs/encoding.html#types):
there is an important difference between the signed int types (
sint32
andsint64
) and the "standard" int types (int32
andint64
) when it comes to encoding negative numbers. If you useint32
orint64
as the type for a negative number, the resultingvarint
is always ten bytes long – it is, effectively, treated like a very large unsigned integer. If you use one of the signed types, the resultingvarint
uses ZigZag encoding, which is much more efficient.ZigZag encoding maps signed integers to unsigned integers so that numbers with a small absolute value (for instance, -1) have a small
varint
encoded value too. It does this in a way that "zig-zags" back and forth through the positive and negative integers, so that -1 is encoded as 1, 1 is encoded as 2, -2 is encoded as 3, and so on...
So it looks like even if your use of negative numbers is rare, as long as the magnitude of the numbers (including non-negative numbers) you're passing in the protocol is on the smaller side, you might be better off using sint32
. If you're unsure, profiling would be in order.