I am designing some methodology to determine the composition of wireless router security passphrases in terms of the character space used to generate them.
The problem I am having is I am unsure what formula or technique I can use to calculate the number of strings I would need to observe in order to determine the full character set which was used to generate these passphrases.
I was thinking along the lines of the birthday paradox but I am not sure if this is applicable for my problem.
The algorithm to deduce the string composition is simple: the passphrase strings will be read and every character will be processed. If the character hasn't been seen before, it will be added to a temporary array and if the character has been seen before, it won't be added. Thus, at the end the array will store all the unique seen characters from the processed strings (the passphrases in question).
Let's say the character space that the router manufacturer used to generate these passphrases was 62 different characters in size. For example:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
(a-z lowercase, A-Z uppercase and 0-9)
These generated passphrases are just strings like these for example:
- 2UAYnCtL
- YR3kLX49
- MJLDJIgt
- xouq9KOV
Considering that each passphrase is 8 characters in length and that these passphrases are randomly generated by picking characters at random from that example character space mentioned above, how many strings will my algorithm need to see before it can build a complete representation of the above mentioned character space ?
Or in other words how can I determine the minimum number of strings which need to be observed in order to have a high probability (or close to 1 probability) of my algorithm being able to recreate that character space mentioned above ?
Please let me know if anything is unclear. Thanks for your input !