An Observation About Passphrases: Syntax vs Entropy

I suggested in the article to use passphrases instead of «traditional» passwords, for multiple reasons, including: sheer strength, memorability, and conforming to idiotic password creation policies without actually following detrimental recommendations of the policy authors.

This recommendation gives rise to a reasonable doubt: «what if syntactically correct phrases are as weak as dictionary words in comparison to a random string of symbols?''. Indeed, syntax itself should weaken a passphrase, as it provides some „predictability'' to the phrase. I want to address this problem, by comparing syntactically correct passphrases to random collections of words (which we all consider sufficiently strong… hopefully).

Before we begin, it is important to explain why and how we use entropy.

Despite Shannon's entropy is shown (both experimentally and theoretically) to be NOT a measure of password strength, and being massively misused by almost every “computer scientist» or «security expert», it has some practical value: it plays the role of the most optimistic estimate for a password strength.

Provided the attacker knows the defender's password choosing strategy, the password strength can not exceed the value of 2^entropy. It is easy to see. The password strength is canonically defined as the expected value for the length of a guessing attack. The attacker's knowledge of the defender's strategy allows them to limit the search space to the defender's pool of passwords, the cardinality of which is by definition 2^entropy.

From now on we use the search space cardinality as an expression of entropy, as these terms are in direct correspondence to each other, and we do not need the logarithmic scale of entropy.

Since we are going to investigate a negative impact on the password strength, i.e. how much does the syntax of a passphrase reduce the password strength, the search space cardinality is a good tool for the job, as it limits the strength from above.

Here we go!

Let's say we have W words in our dictionary. The search space for the random sequence of n words is: W^n. We are to estimate the search space for a grammatically correct sentence in relation to W^n.

If we fix a sentence's grammatical structure, single it out without specifying it, just say our passphrase is known to adhere to any single arbitrary structure (this assumption does not extend the search space in question, on the contrary, it makes the attacker's task easier), then the search space cardinality will be:

W^n * fraction_1 * fraction_2 *… * fraction_n,

where fraction_i represents the fraction of the dictionary constituting the search space for the i-th word.

Now we are to estimate those fractions

Eric Lease Morgan in the article lists the average frequencies of parts of speech in a set of 9 very different books. The list goes as follows:

noun 19%,
verb 15%,
punctuation 14%,
preposition 13%,
determiner 10%,
pronoun 9%,
adverb 7%,
adjective 6%,
conjunction 4%,
other 3%,
symbol 1%.

To simplify things a little (once again, in favor of the attacker), we postulate that all adverbs can be produced from adjectives so that we choose adverbs from the pool of adjectives, therefore, we may unite them: adverb + adjective = 13%.

Then all nouns, adjectives, verbs total to: 19 + 15 + 13 = 47%. This sum represents the 100% of our «refined» dictionary, therefore frequencies of each of these 3 privileged parts of speech are:

freq_noun = 19/47
freq_adj = 13/47
freq_verb = 15/47

In other words, provided we are reading a natural text (any piece of meaningful English speech randomly selected from a corpus of all English texts), every time we encounter a noun or adjective or verb, there is a 19/47 chance this word is a noun, 13/47 chance it is an adjective, 15/47 chance it is a verb.

Now, assuming that our password sentence is meaningful and grammatically correct, we may expect parts of speech to appear in the sequence with the given frequencies. Therefore, the expected value of the search space cardinality of each subsequent word in the phrase is the following constant:

freq_noun * nouns_in_dictionary +
freq_adj * adjs_in_dictionary +
freq_verb * verbs_in_dictionary,

obtained by weighting the amount of each part of speech in the dictionary by the frequency of this part of speech in a random natural text.

According to Oxford English Dictionary there are 1/2 nouns, 1/4 adjectives, and 1/7 verbs in the dictionary. If we then refine the dictionary to verbs, nouns, adjectives only, then the relative amount of these parts of speech will be:

nouns_in_dictionary = 98/175
adjs_in_dictionary = 49/175
verbs_in_dictionary = 28/175

Using these relative amounts and the frequencies we obtained earlier we get the expected value for each fraction_i, which is:

19/47 * 98/175 + 13/47 * 49/175 + 15/47 * 28/175 = 2919/8225

approximately: 0.35

Finally the expected value of the search space cardinality for a grammatically correct sentence of n words (restricted to nouns, adjectives, and verbs, with all the grammar glue and the structure known to the attacker beforehand) is:

(W * 0.35)^n

versus W^n for a random collection of words

Thus we may say that an average grammatically correct sentence with n keywords is weaker than a random n words sequence approx as if the dictionary is 3 times shorter. It may seem significant, but it is not.

Given that for modern days English W = 195207, (W*0.35)^n is greater than W^(n-1) for all n less than 12. In other words:

An Observation About Passphrases: Syntax vs Entropy

Before we begin, it is important to explain why and how we use entropy.

Here we go!

Now we are to estimate those fractions

It is enough to make your passphrase one word longer in order to compensate for the syntax weakness.

0 comments