Four Word Phrase: Pseudorandom Generation of Word Lists from a Dictionary

The ideal password is impractical for an attacker to guess, but nonetheless easy to remember. Any random sufficiently long set of characters will satisfy the first criterion, but "sufficiently long" is well past the point of easy recollection these days, and the needed length will only grow with time as the processing power available to attackers continues to increase. Choices such as your birth date or a single dictionary word with a few strategic alterations or additions are easy to remember, but are also easily bypassed. A compromise position is a non-grammatical passphrase: a set of four or five dictionary words selected more or less at random. This is weaker than a completely random string of the same length, but well beyond the reach of near all attackers who would need to guess passwords.

Password guessing attackers are typically trying to get into your online accounts, and to a lesser degree your stolen computers and hijacked encrypted data lifted from online services. They are typically less sophisticated, and in the case of online accounts they cannot deploy a very large number of guesses in a short period of time. So defending against these villains is simply a matter of setting the bar high enough, and that isn't really all that high: don't reuse passwords, and use non-grammatical pass phrases for those you must remember. If you use a password manager, you can have the best of all worlds: you access it with one passphrase and it stores long random strings, different for each service needing a password.

The attackers you should probably be more concerned about are (a) those who copy your data from online services with the full cooperation of those services because they are empowered to do so by their position in governmental organizations, and (b) those who live in the more distant future and thus have sufficient computational power to break all encryption from our era. That second one may well turn out to be before you age to death, given advances in medicine: assume that nothing you store or send online will be private forever. On the one hand the present panopticon in the making is storing it all, and on the other hand progress will happen - look at how we treat governmental records of the past today. The present NSA and other databases will be opened by the public in times to come.

But I digress. A little while back I was looking into the practical limits for generation of useful passphrases: how large can the possible space be?

	3 words	4 words	6 words	10 words
1,000 word dictionary	~10⁹	~10¹²	~10¹⁸	~10³⁰
10,000 word dictionary	~10¹²	~10¹⁶	~10²⁴	~10⁴⁰
100,000 word dictionary	~10¹⁵	~10²⁰	~10³⁰	~10⁵⁰
1,000,000 word dictionary	~10¹⁸	~10²⁴	~10³⁶	~10⁶⁰

For comparison, UUIDs have a space of ~10³⁸ combinations, so there is a vanishing chance of collision even if you generate a lot of them. Word sequences that are short enough to memorize reliably can't approach that, even with very large dictionaries. For reference, a dictionary generated from a fair-sized book written by an author with an exceptional command of the language might have about 10,000 distinct words and compound words of six letters or longer. The English language as a whole might have as many as 1,000,000 words, but far from all of those are useful for this purpose. Matters are improved if using sources that include proper names, made-up, and obsolete words: these are all just as easy to remember.

Generating a large dictionary turns out to have its complications. A good resource is the Project Gutenberg collection, and I have tinkered with that to the point of discovering that deciding which words to include and exclude is a real challenge. Non-English words creep in, as well as words you might not want because they are offensive or hard to remember. Working with a single book as a source is easy enough, however, and I put together the four-word-phrase Node.js package for the purposes of experimentation.

The contents of this post use four-word-phrase v0.0.1 - note that later versions are somewhat differently structured. The core functionality pseudorandomly generates phrases from a dictionary, and the various defined classes are extensible to allow additional implementations for types of data storage and access. That the output of generated phrases is deterministic for a given seed value and dictionary makes testing that much easier, and expands the range of possible uses for such a system.

The examples I provide use Moby-Dick as the source text for generation, for no reason other than the fact that the results are fairly distinctive:

maimed jeopardize concentrations hilarity
infected dreamy confluent support
revered assented mid-day valves
bow-window festoons lying calabash
marlingspike scrupulous whisper dared

A dictionary is generated by running the text of Moby-Dick through a stream that tokenizes it and then accepts or rejects individual words by matching against regular expressions. No attempt is made to be any smarter than this:

  // Look for words that are neither too short nor too long. We'll accept some compound
  // words, since they're fairly common in this text.
  this.acceptanceRegExp = /^[a-z-]{5,14}$/;
  // Reject overly hyphenated words. Herman Melville liked his hyphens.
  this.rejectionRegExp = /-{2,}|-.*-/;

  if(this.acceptanceRegExp.test(word) && !this.rejectionRegExp.test(word)) {
    this.push(word);
  }

With a dictionary generated and written to file, it can be loaded into memory as needed and any number of phrases created from its contents. This following example is more direct and self-contained: it loads the raw text of the book, feeds it into the tokenizer to generate a dictionary in memory, and then produces a set of pseudorandom phrases from the dictionary. The output is deterministic for a given seed:

var async = require('async');
var fs = require('fs');
var path = require('path');
var fwp = require('four-word-phrase');

var generator = new fwp.generator.MemoryGenerator({
  baseSeed: 'a random seed'
});

var filePath ='path/to/four-word-phrase/example/mobyDick.txt';
var readStream = fs.createReadStream(filePath);
var importer = new fwp.importer.ReadStreamImporter(generator);

importer.import({
  readStream: readStream
}, function (error) {
  if (error) {
    return console.error(error);
  }

  var phraseLength = 4;
  async.times(10, function (index, asyncCallback) {
    generator.nextPhrase(phraseLength, asyncCallback);
  }, function (error, phrases) {
    if (error) {
      return console.error(error);
    }
    phrases.forEach(function (phrase, index) {
      console.log('Phrase ' + (index + 1) + ': ' + phrase.join(' '));
    });
  });
});

As it stands this Node.js code is something of a toy; it is efficient and usable, but just a step on the way to further explorations using larger chunks of the Project Gutenberg texts. The real challenge here is, as I mentioned, the various forms of gardening needed to generate a useful dictionary from a very large set of raw texts. For example, it is not as straightforward as you might imagine to restrict yourself to English words when naively processing tens of thousands of books written over the past couple of centuries. Considered in the aggregate, most of the common words used in all European languages are encapsulated in asides, dialog, and so forth present in ostensibly English works. Given this and the bad word issue - also not as simple to manage as you might imagine - it is a matter of curiosity on my part as to how much work will in fact be involved in creating a 1,000,000 word more-or-less-English dictionary that lacks all of the bad and unhelpful words likely to cause issues in commercial use. Is it even possible to approach that word count? Having such a thing might be useful in some endeavors for the purpose of generating useful identifiers, reminders, passwords, and so forth.