I'm writing a small article on humanly readable alternatives to Guids/UIDs, for example those used on TinyURL for the url hashes (which are often printed in magazines, so need to be short).
The simple uid I'm generating is - 6 characters: either a lowercase letter (a-z) or 0-9.
"According to my calculations captain", that's 6 mutually exclusive events, although calculating the probability of a clash gets a little harder than P(A or B) = P(A) + P(B), as obviously it includes numbers and from the code below, you can see it works out whether to use a number or letter using 50/50.
I'm interested in the clash rate and if the code below is a realistic simulation of anticipated clash rate you'd get from generating a hash. On average I get 40-50 clashes per million, however bare in mind the uid wouldn't be generated a million times at once, but probably only around 10-1000 times a minute.
What is the probability of a clash each time, and can anyone suggest a better way of doing it?
static Random _random = new Random();
public static void main()
{
// Size of the key, 6
HashSet<string> set = new HashSet<string>();
int clashes = 0;
for (int n=0;n < 1000000;n++)
{
StringBuilder builder = new StringBuilder();
for (int i =0;i < 7;i++)
{
if (_random.NextDouble() > 0.5)
{
builder.Append((char)_random.Next(97,123));
}
else
{
builder.Append(_random.Next(0,9).ToString());
}
}
if (set.Contains(builder.ToString()))
{
clashes++;
Console.WriteLine("clash: (" +n+ ")" +builder.ToString());
}
set.Add(builder.ToString());
_random.Next();
//Console.Write(builder.ToString());
}
Console.WriteLine("Clashes: " +clashes);
Console.ReadLine();
}
UPDATE: Here's the resulting article from this question
I really asked two questions here so I was cheating. The answer I was after was rcar's, however Sklivvz's is also the answer to the 2nd part (an alternative). Is it possible to make a custom unique id generator in a database, or would it be client side (which would be 2 possible reads first)?
The general idea I was after was using Ids in databases or other stores that can be used by phone or printed material, not a giant 16 byte guid.
UPDATE 2: I put the formula for two mutually exclusive events above instead of 2 independent ones (as getting an 'a' the first time doesn't mean you can't get an 'a' the second time). Should've been P(A and B) = P(A) x P(B)
Why do you want to use a random function? I always assumed that tinyurl used a base 62 (0-9A-Za-z) representation of a sequential Id. No clashes and the urls are always as short as possible.
You would have a DB table like
Id URL
1 http://google.com
2 ...
... ...
156 ...
... ...
and the corresponding URLs would be:
http://example.com/1
http://example.com/2
...
http://example.com/2W
...