I was working on my simple-id library and was curious about different methods to generate an unbiased random number in a given range. As a way to test whether a random number generator (RNG) produces good results, one can calculate the expected number of repeats based on the birthday problem and compare it with the output of the RNG. The formula can be found online in several places as it can also be used for counting expected hash collisions or cache hits. The expected number of repeats can be calculated as follows:
Where is the number of generated values and is the range.
Converting this to JavaScript code:
function repeats(n, d) {
return n - d + d * ((d - 1) / d)**n;
}
In my case, I wanted to check the expected repeats after generating a million random IDs, where each ID is 8 characters long and there are 31 characters in the alphabet. That is, and . Running this in JavaScript:
console.log(repeats(1e6, 31**8));
The result is -19.72998046875.
Wait, what? Negative repeats? I’ve known about the infamous 0.1 + 0.2
inaccuracy, but this is different; it’s a much larger error, and from only a
handful of operations. I double (and triple) checked that I typed the formula
correctly, and tried it in both C and Python and got the same result. I then
put it into
WolframAlpha,
since it supports arbitrary precision, and it returned something longer, but
much more sensible:
0.5862405426220060595736096268572069594630842175323822481323009
That seemed right. I started tweaking the formula to try to get rid of the
error. I figured that, most likely, the division leads to a loss of accuracy
that’s exacerbated by raising to a very high power. To avoid this, I reached
for BigInt
to calculate the power of the numerator and denominator
separately, then divide. Since BigInt
only supports integers, I also multiply
by a “precision” before converting to a number, then divide by it after to get
a decimal value:
function repeats(n, d) {
const p = n*d;
const e = Number(BigInt(p) * BigInt(d-1)**BigInt(n) / BigInt(d)**BigInt(n))/p;
return n - d + d * e;
}
That returns 0.5863037109375. On the one hand, the result is correct to
three significant figures, which is an improvement over the previous zero. On
the other, this takes a second and a half to finish computing. After trying a
few different things, I got a tip from a user on the
##math IRC channel to try log1p
for the
calculation. What does log
have to do with anything? The way computers
calculate the power is like so:
So we can rewrite the formula to:
The log1p
function returns the result of . The
reason it exists is because floating-point does not work well when adding a
small to . Since is very large, this is the case in our
formula. We can rewrite the expression to make it usable in log1p
:
The code then becomes:
function repeats(n, d) {
return n - d + d * Math.exp(n * Math.log1p(-1 / d));
}
Running it returns 0.5863037109375, the same value as using BigInt
, only
now it doesn’t take a second and a half to calculate. However, it still wasn’t
as accurate as I’d like. While looking into this, I came across another
function that also works better for small values, expm1
, which returns
. We can factor out to make this usable too:
And in code:
function repeats(n, d) {
return n + d * Math.expm1(n * Math.log1p(-1 / d));
}
Running this, we get 0.5862405423540622, a much improved result accurate to nine significant figures. This was as good as it was going to get in JavaScript.
Out of curiosity, I wanted to see if a better result was possible using C,
since it provides yet another function for preserving accuracy, fma
, also
known as fused multiply and add. Rather than lose accuracy twice on both the
multiplication and the addition, rounding only happens once after the
calculation when using fma
. It also happens to be faster, since it uses a
single native CPU instruction for both operations. C also provides the long
double
type, which is as precise as a double
or more, depending on the
implementation. On my machine, sizeof(long double)
returns 16, i.e. it is a
128-bit floating-point number, also known as a quad. Trying them both out:
double repeats(double n, double d)
{
return fma(d, expm1(n * log1p(-1 / d)), n);
}
long double repeatsl(long double n, long double d)
{
return fma(d, expm1l(n * log1pl(-1 / d)), n);
}
This yields 0.58624054240892864431 and 0.58624054258953528507, respectively, with a very slight improvement in accuracy, but still only accurate to nine significant figures. At this point, I was out of my depth. I tried the Herbie project which automatically rewrites floating-point expressions and it gave some good results but the suggestions were rather unwieldy. Let me know if you can find another way to compute this with better accuracy — perhaps there is yet another trick floating out there.