Mathematical calculation to determine probability of observing the character space used for a set of string inputs

Question

I am designing some methodology to determine the composition of wireless router security passphrases in terms of the character space used to generate them.

The problem I am having is I am unsure what formula or technique I can use to calculate the number of strings I would need to observe in order to determine the full character set which was used to generate these passphrases.

I was thinking along the lines of the birthday paradox but I am not sure if this is applicable for my problem.

The algorithm to deduce the string composition is simple: the passphrase strings will be read and every character will be processed. If the character hasn't been seen before, it will be added to a temporary array and if the character has been seen before, it won't be added. Thus, at the end the array will store all the unique seen characters from the processed strings (the passphrases in question).

Let's say the character space that the router manufacturer used to generate these passphrases was 62 different characters in size. For example:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789

(a-z lowercase, A-Z uppercase and 0-9)

These generated passphrases are just strings like these for example:

2UAYnCtL
YR3kLX49
MJLDJIgt
xouq9KOV

Considering that each passphrase is 8 characters in length and that these passphrases are randomly generated by picking characters at random from that example character space mentioned above, how many strings will my algorithm need to see before it can build a complete representation of the above mentioned character space ?

Or in other words how can I determine the minimum number of strings which need to be observed in order to have a high probability (or close to 1 probability) of my algorithm being able to recreate that character space mentioned above ?

Please let me know if anything is unclear. Thanks for your input !

This is related to the coupon collector's problem, although the question you ask is kind of the inverse of that. (In the coupon collector's problem, you know how many coupons there are, and want to know how long it'll take to collect them all; in your actual problem, you've collected some set of characters, and want to know how likely it is that there are no more.) — Ilmari Karonen, Aug 23 '16 at 23:21
Thanks for the additional context, wasn't aware of this probability problem. — Jukan Manya, Aug 24 '16 at 01:21

poncho · Accepted Answer · 2016-08-23T22:21:22.017

Well, assuming that the router selects each character of the password in a uniformly and independently distributed fashion, then the probability of an individual character not being a specific character $c$ is:

$$1 - 1/62$$

(where the alphabet size is 62).

Hence, if we generate $n$ characters, the probability $c$ never occurs is:

$$(1 - 1/62)^n$$

Now, we have 62 possible values of $c$, and we want the probability that all of them occur somewhere to be high (say, $>0.99$), and so we have:

$$62 (1 - 1/62)^n < 1 - 0.99$$

Evaluating this, we get $n > 537$, or 67 8-character passwords.

The exact value you obtain will depend on the size of the alphabet (which you might not know precisely when you start), and the desired failure probability. In addition, this is pessimistic in the sense that if you see every alphabetic character other than n, it's plausible that n can occur (and you just didn't happen to see it). On the other hand, the router may be programmed to avoid the 1, I, l characters (because they may be easily confused in some fonts), and so that might not be quite so dreadfully pessimistic.

Pedants will note that this analysis isn't precisely correct (as the probabilities aren't independent; each character of the password is some $c$, and so multiplying the nonoccurance probability of each individual character by the number of characters isn't actually justified), however I believe that it's close.

Thank you very much for your detailed and insightful answer. I really appreciate your help. Can't thank you enough ! — Jukan Manya, Aug 24 '16 at 01:22

Mathematical calculation to determine probability of observing the character space used for a set of string inputs

1 Answers1