Simpler Forms of Harmful Regular Expression

Most regular expression implementations are unbounded in time, meaning that it is possible to construct numerous combinations of a fairly short regular expression and a fairly short string that will run near indefinitely. Most people will only rarely find themselves entering that territory, as most day to day regular expression use cases don't involve backreferences, recursive operations, or the like. Still, it is perfectly possible to construct quite simple but nonetheless harmful regular expressions.

Consider the common scenario in which regular expressions are running against user input, for example. It doesn't take all that many characters before regular expressions containing unbounded captures such as [abc]+ can run for tens of seconds, given a suitably structured string. This is the basis for a crude but effective denial of service attack against a web application, and is one of the reasons why all such regular expression operations should be timeboxed for safety.

Java's core regular expression implementation is a good testbed to run experiments in order to find out just how bad things can become given suitably structured strings. The following is a trivial main class to do just that for the question of unbounded captures, those using the * or + operators. It tests a couple of different regular expressions against strings containing large numbers of the same repeated character:

import java.util.Arrays;
import java.util.regex.Pattern;

public class RegexTest {

  public static void main (String[] args) {

    // In rough order of how long they will take to run given a string with
    // a lot of exclamation marks.
    Pattern[] patterns = {
      // Word boundaries.
      Pattern.compile(".*\\b(words|and|more)\\b.*"),
      // Single character matches.
      Pattern.compile(".*[!](words|and|more)[!].*"),
      Pattern.compile(".*[\\p{Punct}](words|and|more)[\\p{Punct}].*"),
      // Unbounded captures using +.
      Pattern.compile(".*[!]+(words|and|more)[!]+.*"),
      Pattern.compile(".*[\\p{Punct}]+(words|and|more)[\\p{Punct}]+.*")
    };

    int[] repeatCounts = {
      1000,
      2000,
      4000,
      8000,
      16000
    };

    // Building strings of the form "some blurb!!!!...!!! more stuff" with a lot
    // of exclamation marks.
    for (int repeatCount: repeatCounts) {
      String string = "some blurb"
        + String.format("%0" + repeatCount + "d", 0).replace("0", "!")
        + " more stuff";

      System.out.println("Exclamation marks count: " + repeatCount);

      for (Pattern pattern: patterns) {
        long start = System.currentTimeMillis();
        pattern.matcher(string).matches();
        long finish = System.currentTimeMillis();
        System.out.println("" + (finish - start) + "ms: " + pattern.toString());
      }

      System.out.println();
    }
  }
}

On a low-powered development laptop, it produces the following output:

Exclamation marks count: 1000
1ms: .*\b(words|and|more)\b.*
2ms: .*[!](words|and|more)[!].*
3ms: .*[\p{Punct}](words|and|more)[\p{Punct}].*
47ms: .*[!]+(words|and|more)[!]+.*
38ms: .*[\p{Punct}]+(words|and|more)[\p{Punct}]+.*

Exclamation marks count: 2000
2ms: .*\b(words|and|more)\b.*
2ms: .*[!](words|and|more)[!].*
3ms: .*[\p{Punct}](words|and|more)[\p{Punct}].*
144ms: .*[!]+(words|and|more)[!]+.*
139ms: .*[\p{Punct}]+(words|and|more)[\p{Punct}]+.*

Exclamation marks count: 4000
0ms: .*\b(words|and|more)\b.*
1ms: .*[!](words|and|more)[!].*
0ms: .*[\p{Punct}](words|and|more)[\p{Punct}].*
566ms: .*[!]+(words|and|more)[!]+.*
1425ms: .*[\p{Punct}]+(words|and|more)[\p{Punct}]+.*

Exclamation marks count: 8000
0ms: .*\b(words|and|more)\b.*
2ms: .*[!](words|and|more)[!].*
2ms: .*[\p{Punct}](words|and|more)[\p{Punct}].*
3262ms: .*[!]+(words|and|more)[!]+.*
4268ms: .*[\p{Punct}]+(words|and|more)[\p{Punct}]+.*

Exclamation marks count: 16000
1ms: .*\b(words|and|more)\b.*
5ms: .*[!](words|and|more)[!].*
5ms: .*[\p{Punct}](words|and|more)[\p{Punct}].*
13452ms: .*[!]+(words|and|more)[!]+.*
14192ms: .*[\p{Punct}]+(words|and|more)[\p{Punct}]+.*

This should hopefully be a convincing argument that, in addition to the common wisdom regarding regular expression speed and optimization, one really has to consider the edge cases for any capture group. An attacker can flatten a server with ten thousand suitably chosen characters delivered via a POST request body for near any casual use of unbounded captures such as [abc]+ in a mixed regular expression such as those above.