I saw a question on a math forum a while back where a person was discussing adding up the digits in a number over and over again until a single digit is achieved. (i.e. "362" would become "3+6+2" which would become "11"... then "11" would become "1+1" would would become "2" therefor "362" would return 2... I wrote some nice code to get an answer to this and posted it only to be outdone by a user who suggested that any number in modulo 9 is equal to this "infinite digit sum", I checked it an he was right... well almost right, if zero was returned you had to switch it out with a "9" but that was a very quick fix...
362 = 3+6+2 = 11 = 1+1 = 2
or...
362%9 = 2
Anways, the mod9 method works fantastic for infinitely adding the sum of the digits until you are left with just a single digit... but what about only doing it once (i.e. 362 would just return "11")... Can anyone think of fast algorithms?
There's a cool trick for summing the 1 digits in binary, and with a fixed-width integer. At each iteration, you separate out half the digits each into two values, bit-shift one value down, then add. First iteration, separate ever other digit. Second iteration, pairs of digits, and so on.
Given that 27 is 00011011 as 8-bit binary, the process is...
00010001 + 00000101 = 00010110 <- every other digit step
00010010 + 00000001 = 00010011 <- pairs of digits
00000011 + 00000001 = 00000100 <- quads, giving final result 4
You could do a similar trick with decimal, but it would be less efficient than a simple loop unless you had a direct representation of decimal numbers with fast operations to zero out selected digits and to do digit-shifting. So for 12345678 you get...
02040608 + 01030507 = 03071115 <- every other digit
00070015 + 00030011 = 00100026 <- pairs
00000026 + 00000010 = 00000036 <- quads, final result
So 1+2+3+4+5+6+7+8 = 36, which is correct, but you can only do this efficiently if your number representation is fixed-width decimal. It always takes lg(n) iterations, where lg means the base two logarithm, and you round upwards.
To expand on this a little (based on in-comments discussions), let's pretend this was sane, for a bit...
If you count single-digit additions, there's actually more work than a simple loop here. The idea, as with the bitwise trick for counting bits, is to re-order those additions (using associativity) and then to compute as many as possible in parallel, using a single full-width addition to implement two half-width additions, four quarter-width additions etc. There's significant overhead for the digit-clearing and digit-shifting operations, and even more if you implement this as a loop (calculating or looking up the digit-masking and shift-distance values for each step). The "loop" should probably be fully unrolled and those masks and shift-distances be included as constants in the code to avoid that.
A processor with support for Binary Coded Decimal (BCD) could handle this. Digit masking and digit shifting would be implemented using bit masking and bit shifting, as each decimal digit would be encoded in 4 (or more) bits, independent of the encoding of other digits.
One issue is that BCD support is quite rare these days. It used to be fairly common in the 8 bit and 16 bit days, but as far as I'm aware, processors that still support it now do so mainly for backward compatibility. Reasons include...
Very early processors didn't include hardware multiplication and division. Hardware support for these operations means it's easier and more efficient to convert binary to decimal now. Binary is used for almost everything now, and BCD is mostly forgotten.
There are decimal number representations around in libraries, but few if any high level languages ever provided portable support to hardware BCD, so since assembler stopped being a real-world option for most developers BCD support simply stopped being used.
As numbers get larger, even packed BCD is quite inefficiently packed. Number representations base 10^x have the most important properties of base 10, and are easily decoded as decimal. Base 1000 only needs 10 bits per three digits, not 12, because 2^10 is 1024. That's enough to show you get an extra decimal digit for 32 bits - 9 digits instead of 8 - and you've still got 2 bits left over, e.g. for a sign bit.
The thing is, for this digit-totalling algorithm to be worthwhile at all, you need to be working with fixed-width decimal of probably at least 32 bits (8 digits). That gives 12 operations (6 masks, 3 shifts, 3 additions) rather than 15 additions for the (fully unrolled) simple loop. That's a borderline gain, though - and other issues in the code could easily mean it's actually slower.
The efficiency gain is clearer at 64 bits (16 decimal digits) as there's still only 16 operations (8 masks, 4 shifts, 4 additions) rather than 31, but the odds of finding a processor that supports 64-bit BCD operations seems slim. And even if you did, how often do you need this anyway? It seems unlikely that it could be worth the effort and loss of portability.