Cryptography
Updated Sat, 03 Sep 2022 22:04:57 GMT

# What is / How would I calculate the number of bits(?) I need to make the probability of a collision very small?

If I wanted to create a TOTP-esque algorithm that generated a string n characters long, with each character being a base64 character, generated from a user secret r Base64 characters long, every s seconds, for each of u number of users, where the probability of collision is very small, how would I calculate the length of the string that was needed?

In case my wording is bad (I'm new to this), here's an example: Say it's generated every 30 seconds from the user's secret, I have 100,000,000 users, the algorithm output is 8 characters long, and the secret for each user is a 32 character Base64 string randomly generated when the user signed up. What would the probability be that two users got the same algorithm output as each other in any one 30s* period? How would I figure out how many characters I would need to make it a one in a million chance?

* Like TOTP, for all users, the 30s period would be the same, i.e. between seconds 0-29 or 30-59 on the clock.

## Solution

If I wanted to create a TOTP-esque algorithm that generated a string n characters long, with each character being a base64 character, generated from a user secret r Base64 characters long, every s seconds, for each of u number of users, where the probability of collision is very small, how would I calculate the length of the string that was needed?

Well, the probability that two specific users get the same r secret is $$2^{-6r}$$; there are $$u(u-1)/2$$ distinct pairs of users, hence the probability that there exists a pair somewhere with the same secret is no more than $$2^{-6r} u(u-1)/2$$ - with the values of $$r, u$$ you mentioned ($$r=32, u=100,000,000$$), this is tiny (circa $$2^{-138.8}$$, and so we can safely ignore that.

Now, the probability that, during a specific s second time interval, that two specific users get the same n character sequence is $$2^{-6n}$$; again, there are $$u(u-1)/2$$ distinct pairs of users, hence the probability that there exists a pair somewhere with the same output is no more than $$2^{-6n} u(u-1)/2$$

Now, you suggest $$n=8$$, if you plug that into the above equation with $$u=100,000,000$$, you get a value larger than 1 (hence that provides no meaningful bound on the collision probability, and in practice means that there likely exists several collisions).

To deduce what $$n$$ would need to be to make the probability bounded by $$10^{-6}$$, we would need to solve the problem $$10^{-6} \ge 2^{-6n} u(u-1)/2$$; solving this numerically for $$u=100,000,000$$ gives $$n \ge 12.014...$$; that is, an $$n$$ of 12 isn't quite enough to give you the probability you're looking for (but it's close).

Now, if you want that $$10^{-6}$$ probability to span over $$k$$ intervals (that is, that there is a collision anywhere in the $$k$$ $$s$$-second span, you'd need to solve $$10^{-6} \ge k \cdot 2^{-6n} u(u-1)/2$$

• +0 – Thanks for the detailed answer! Can you expand on where the $2^{-6r}$ comes from please? I am trying to figure out where the in the equation the character set (Base64) for the secret and output comes into play, and I feel it should factor in here, but I don't see how it does. — Aug 15, 2019 at 21:04
• +1 – the probability that $k$ bits match is $(1/2)^k$ if you have $r$ base 64 characters, $k=6r$ since base 64 characters are represented with $\log_2(64)=6$ bits. So $(1/2)^k=2^{-6r}.$ — Aug 15, 2019 at 21:45