How to count each digit in a range of integers?

Carlos Gutiérrez picture Carlos Gutiérrez · Jan 13, 2010 · Viewed 24.3k times · Source

Imagine you sell those metallic digits used to number houses, locker doors, hotel rooms, etc. You need to find how many of each digit to ship when your customer needs to number doors/houses:

  • 1 to 100
  • 51 to 300
  • 1 to 2,000 with zeros to the left

The obvious solution is to do a loop from the first to the last number, convert the counter to a string with or without zeros to the left, extract each digit and use it as an index to increment an array of 10 integers.

I wonder if there is a better way to solve this, without having to loop through the entire integers range.

Solutions in any language or pseudocode are welcome.


Edit:

Answers review
John at CashCommons and Wayne Conrad comment that my current approach is good and fast enough. Let me use a silly analogy: If you were given the task of counting the squares in a chess board in less than 1 minute, you could finish the task by counting the squares one by one, but a better solution is to count the sides and do a multiplication, because you later may be asked to count the tiles in a building.
Alex Reisner points to a very interesting mathematical law that, unfortunately, doesn’t seem to be relevant to this problem.
Andres suggests the same algorithm I’m using, but extracting digits with %10 operations instead of substrings.
John at CashCommons and phord propose pre-calculating the digits required and storing them in a lookup table or, for raw speed, an array. This could be a good solution if we had an absolute, unmovable, set in stone, maximum integer value. I’ve never seen one of those.
High-Performance Mark and strainer computed the needed digits for various ranges. The result for one millon seems to indicate there is a proportion, but the results for other number show different proportions.
strainer found some formulas that may be used to count digit for number which are a power of ten. Robert Harvey had a very interesting experience posting the question at MathOverflow. One of the math guys wrote a solution using mathematical notation.
Aaronaught developed and tested a solution using mathematics. After posting it he reviewed the formulas originated from Math Overflow and found a flaw in it (point to Stackoverflow :).
noahlavine developed an algorithm and presented it in pseudocode.

A new solution
After reading all the answers, and doing some experiments, I found that for a range of integer from 1 to 10n-1:

  • For digits 1 to 9, n*10(n-1) pieces are needed
  • For digit 0, if not using leading zeros, n*10n-1 - ((10n-1) / 9) are needed
  • For digit 0, if using leading zeros, n*10n-1 - n are needed

The first formula was found by strainer (and probably by others), and I found the other two by trial and error (but they may be included in other answers).

For example, if n = 6, range is 1 to 999,999:

  • For digits 1 to 9 we need 6*105 = 600,000 of each one
  • For digit 0, without leading zeros, we need 6*105 – (106-1)/9 = 600,000 - 111,111 = 488,889
  • For digit 0, with leading zeros, we need 6*105 – 6 = 599,994

These numbers can be checked using High-Performance Mark results.

Using these formulas, I improved the original algorithm. It still loops from the first to the last number in the range of integers, but, if it finds a number which is a power of ten, it uses the formulas to add to the digits count the quantity for a full range of 1 to 9 or 1 to 99 or 1 to 999 etc. Here's the algorithm in pseudocode:

integer First,Last //First and last number in the range
integer Number     //Current number in the loop
integer Power      //Power is the n in 10^n in the formulas
integer Nines      //Nines is the resut of 10^n - 1, 10^5 - 1 = 99999
integer Prefix     //First digits in a number. For 14,200, prefix is 142
array 0..9  Digits //Will hold the count for all the digits

FOR Number = First TO Last
  CALL TallyDigitsForOneNumber WITH Number,1  //Tally the count of each digit 
                                              //in the number, increment by 1
  //Start of optimization. Comments are for Number = 1,000 and Last = 8,000.
  Power = Zeros at the end of number //For 1,000, Power = 3
  IF Power > 0                       //The number ends in 0 00 000 etc 
    Nines = 10^Power-1                 //Nines = 10^3 - 1 = 1000 - 1 = 999
    IF Number+Nines <= Last            //If 1,000+999 < 8,000, add a full set
      Digits[0-9] += Power*10^(Power-1)  //Add 3*10^(3-1) = 300 to digits 0 to 9
      Digits[0]   -= -Power              //Adjust digit 0 (leading zeros formula)
      Prefix = First digits of Number    //For 1000, prefix is 1
      CALL TallyDigitsForOneNumber WITH Prefix,Nines //Tally the count of each 
                                                     //digit in prefix,
                                                     //increment by 999
      Number += Nines                    //Increment the loop counter 999 cycles
    ENDIF
  ENDIF 
  //End of optimization
ENDFOR  

SUBROUTINE TallyDigitsForOneNumber PARAMS Number,Count
  REPEAT
    Digits [ Number % 10 ] += Count
    Number = Number / 10
  UNTIL Number = 0

For example, for range 786 to 3,021, the counter will be incremented:

  • By 1 from 786 to 790 (5 cycles)
  • By 9 from 790 to 799 (1 cycle)
  • By 1 from 799 to 800
  • By 99 from 800 to 899
  • By 1 from 899 to 900
  • By 99 from 900 to 999
  • By 1 from 999 to 1000
  • By 999 from 1000 to 1999
  • By 1 from 1999 to 2000
  • By 999 from 2000 to 2999
  • By 1 from 2999 to 3000
  • By 1 from 3000 to 3010 (10 cycles)
  • By 9 from 3010 to 3019 (1 cycle)
  • By 1 from 3019 to 3021 (2 cycles)

Total: 28 cycles Without optimization: 2,235 cycles

Note that this algorithm solves the problem without leading zeros. To use it with leading zeros, I used a hack:

If range 700 to 1,000 with leading zeros is needed, use the algorithm for 10,700 to 11,000 and then substract 1,000 - 700 = 300 from the count of digit 1.

Benchmark and Source code

I tested the original approach, the same approach using %10 and the new solution for some large ranges, with these results:

Original             104.78 seconds
With %10              83.66
With Powers of Ten     0.07

A screenshot of the benchmark application:
alt text
(source: clarion.sca.mx)

If you would like to see the full source code or run the benchmark, use these links:

Accepted answer

noahlavine solution may be correct, but l just couldn’t follow the pseudo code, I think there are some details missing or not completely explained.

Aaronaught solution seems to be correct, but the code is just too complex for my taste.

I accepted strainer’s answer, because his line of thought guided me to develop this new solution.

Answer

Aaronaught picture Aaronaught · Jan 14, 2010

There's a clear mathematical solution to a problem like this. Let's assume the value is zero-padded to the maximum number of digits (it's not, but we'll compensate for that later), and reason through it:

  • From 0-9, each digit occurs once
  • From 0-99, each digit occurs 20 times (10x in position 1 and 10x in position 2)
  • From 0-999, each digit occurs 300 times (100x in P1, 100x in P2, 100x in P3)

The obvious pattern for any given digit, if the range is from 0 to a power of 10, is N * 10N-1, where N is the power of 10.

What if the range is not a power of 10? Start with the lowest power of 10, then work up. The easiest case to deal with is a maximum like 399. We know that for each multiple of 100, each digit occurs at least 20 times, but we have to compensate for the number of times it appears in the most-significant-digit position, which is going to be exactly 100 for digits 0-3, and exactly zero for all other digits. Specifically, the extra amount to add is 10N for the relevant digits.

Putting this into a formula, for upper bounds that are 1 less than some multiple of a power of 10 (i.e. 399, 6999, etc.) it becomes: M * N * 10N-1 + iif(d <= M, 10N, 0)

Now you just have to deal with the remainder (which we'll call R). Take 445 as an example. This is whatever the result is for 399, plus the range 400-445. In this range, the MSD occurs R more times, and all digits (including the MSD) also occur at the same frequencies they would from range [0 - R].

Now we just have to compensate for the leading zeros. This pattern is easy - it's just:

10N + 10N-1 + 10N-2 + ... + **100

Update: This version correctly takes into account "padding zeros", i.e. the zeros in middle positions when dealing with the remainder ([400, 401, 402, ...]). Figuring out the padding zeros is a bit ugly, but the revised code (C-style pseudocode) handles it:

function countdigits(int d, int low, int high) {
    return countdigits(d, low, high, false);
}

function countdigits(int d, int low, int high, bool inner) {
    if (high == 0)
        return (d == 0) ? 1 : 0;

    if (low > 0)
        return countdigits(d, 0, high) - countdigits(d, 0, low);

    int n = floor(log10(high));
    int m = floor((high + 1) / pow(10, n));
    int r = high - m * pow(10, n);
    return
        (max(m, 1) * n * pow(10, n-1)) +                             // (1)
        ((d < m) ? pow(10, n) : 0) +                                 // (2)
        (((r >= 0) && (n > 0)) ? countdigits(d, 0, r, true) : 0) +   // (3)
        (((r >= 0) && (d == m)) ? (r + 1) : 0) +                     // (4)
        (((r >= 0) && (d == 0)) ? countpaddingzeros(n, r) : 0) -     // (5)
        (((d == 0) && !inner) ? countleadingzeros(n) : 0);           // (6)
}

function countleadingzeros(int n) {
      int tmp= 0;
      do{
         tmp= pow(10, n)+tmp;
         --n;
         }while(n>0);
         return tmp;
         }

function countpaddingzeros(int n, int r) {
    return (r + 1) * max(0, n - max(0, floor(log10(r))) - 1);
}

As you can see, it's gotten a bit uglier but it still runs in O(log n) time, so if you need to handle numbers in the billions, this will still give you instant results. :-) And if you run it on the range [0 - 1000000], you get the exact same distribution as the one posted by High-Performance Mark, so I'm almost positive that it's correct.

FYI, the reason for the inner variable is that the leading-zero function is already recursive, so it can only be counted in the first execution of countdigits.

Update 2: In case the code is hard to read, here's a reference for what each line of the countdigits return statement means (I tried inline comments but they made the code even harder to read):

  1. Frequency of any digit up to highest power of 10 (0-99, etc.)
  2. Frequency of MSD above any multiple of highest power of 10 (100-399)
  3. Frequency of any digits in remainder (400-445, R = 45)
  4. Additional frequency of MSD in remainder
  5. Count zeros in middle position for remainder range (404, 405...)
  6. Subtract leading zeros only once (on outermost loop)