Preventing Unbounded Regular Expression Operations in Java

June 1st, 2017 Permalink

Accepting and running user-defined regular expressions is an unfortunate position to be put in. Perhaps the expressions are allowed as a form of configuration for a platform or service offering, or perhaps they are accepted directly as part of a search or filter tool in an application. Now you are faced with defending an extensive attack surface that requires a fairly deep understanding of the details, and that you would no doubt rather didn't exist. Putting everything else to one side, one the more pressing problems with regular expressions is that many of the libraries are unbounded or near-unbounded in execution time: even comparatively short suitably crafted regular expressions can run essentially indefinitely. Certainly longer than you can keep the servers running to support such an operation. Unfortunately, the core Java regular expression implementation is one such unbounded implementation.

If you put a web application out there that accepts regular expressions as user input, how long will it take before some kind individual submits requests that will lock up the server? This is why all user-defined regular expression operations should either run wrapped in infrastructure that errors on hitting a suitable time limit, or use a more friendly regular expression library that is incapable of producing unbounded run times. RE2 is one example with a Java implementation. Unless you are starting out on a new project, this usually isn't an option, however: the syntax is different enough to make it hard to swap out libraries for an existing product that is already hip-deep in user-defined regular expressions.

Fortunately, there are some fairly simple and workable ways to timeout and error when using the core Java regular expression package. They revolve around extending CharSequence to enable a checkpoint in a frequently called method. Given that checkpoint, it is then possible to halt the progress of a regular expression operation based on any useful criteria: elapsed time, a thread interrupt, or similar. If you are prepared to accept the constraint that a Matcher instance must be used immediately following instantiation and then discarded, which seems reasonable enough, then you don't even need to wrap the regular expression operation in its own thread.

Here is an example of the tooling needed:

/*
MIT License

Copyright (c) 2017 Reason

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * A factory class used to generate Matcher instances that will
 * throw if in use after their timeout.
 */
public abstract class TimeLimitedMatcherFactory {
  // If a regular expression requires more than a couple of seconds
  // to complete, then it has no place in polite society.
  private static final long RE_TIMEOUT = 2000L;

  /**
   * Generate a Matcher instance that will throw if used or still
   * in use more than timeoutInMilliseconds after its instantiation.
   *
   * Use the instance immediately and then discard it.
   *
   * @param pattern The Pattern instance.
   * @param charSequence The CharSequence to operate on.
   * @param timeoutInMilliseconds Throw after this timeout is reached.
   */
  public static Matcher matcher(
    Pattern pattern,
    CharSequence charSequence,
    long timeoutInMilliseconds
  ) {
    // Substitute in our exploding CharSequence implementation.
    if (!(charSequence instanceof TimeLimitedCharSequence)) {
      charSequence = new TimeLimitedCharSequence(
        charSequence,
        timeoutInMilliseconds,
        pattern,
        charSequence
      );
    }

    return pattern.matcher(charSequence);
  }

  /**
   * Generate a Matcher instance that will throw if used or still
   * in use more than 2 seconds after its instantiation.
   *
   * Use the instance immediately and then discard it.
   *
   * @param pattern The Pattern instance.
   * @param charSequence The CharSequence to operate on.
   */
  public static Matcher matcher(
    Pattern pattern,
    CharSequence charSequence
  ) {
    return matcher(pattern, charSequence, RE_TIMEOUT);
  }

  /**
   * An exception to indicate that a regular expression operation 
   * timed out.
   */
  public static class RegExpTimeoutException extends RuntimeException {
    public RegExpTimeoutException(String message) { 
      super(message);
    }
  }

  /**
   * A CharSequence implementation that throws when charAt() is called 
   * after a given timeout.
   *
   * Since charAt() is invoked frequently in regular expression operations
   * on a string, this gives a way to abort long-running regular expression
   * operations.
   */
  private static class TimeLimitedCharSequence implements CharSequence {
    private final CharSequence inner;
    private final long timeoutInMilliseconds;
    private final long timeoutAfterTimestamp;
    private final Pattern pattern;
    private final CharSequence originalCharSequence;

    /**
     * Default constructor.
     *
     * @param inner The CharSequence to wrap. This may be a subsequence of the original.
     * @param timeoutInMilliseconds How long before calls to charAt() throw.
     * @param pattern The Pattern instance; only used for logging purposes.
     * @param originalCharSequence originalCharSequence The original sequence, used for logging purposes.
     */
    public TimeLimitedCharSequence(
      CharSequence inner,
      long timeoutInMilliseconds,
      Pattern pattern,
      CharSequence originalCharSequence
    ) {
      super();
      this.inner = inner;
      this.timeoutInMilliseconds = timeoutInMilliseconds;
      // Carry out this calculation here, once, rather than every time
      // charAt() is invoked. Little optimizations make the world turn.
      timeoutAfterTimestamp = System.currentTimeMillis() + timeoutInMilliseconds;
      this.pattern = pattern;
      this.originalCharSequence = originalCharSequence;
    }

    public char charAt(int index) {
      // This is an unavoidable slowdown, but what can you do?
      if (System.currentTimeMillis() > timeoutTime) {
        // Note that we add the original charsequence to the exception
        // message. This condition can be met on a subsequence of the 
        // original sequence, and the subsequence string is rarely 
        // anywhere near as helpful.
        throw new RuntimeException(
          "Regular expression timeout after " + timeoutInMilliseconds
            + "ms for [ " + pattern.pattern() + " ] operating on [ "
            + originalCharSequence + " ]"
        );
      }
      return inner.charAt(index);
    }

    public int length() {
      return inner.length();
    }

    public CharSequence subSequence(int start, int end) {
      // Ensure that any subsequence generated during a regular expression
      // operation is still going to explode on time.
      return new TimeLimitedCharSequence(
        inner.subSequence(start, end),
        timeoutAfterTimestamp - System.currentTimeMillis(),
        pattern,
        originalCharSequence
      );
    }

    @Override
    public String toString() {
      return inner.toString();
    }
  }
}

An example of use:

Pattern pattern = Pattern.compile("[\\s\\w]+");
String str = "long wordy string";
boolean matched;

try {
  matched = TimeLimitedMatcherFactory.matcher(pattern, str).match();
} catch (TimeLimitedMatcherFactory.RegExpTimeoutException e) {
  // Take appropriate action here. 
}