What Constitutes an Acceptable Email Regex?

September 1st, 2012 Permalink

Regular expression matching and validation of email is a narrow problem space, but one with great depths. You can spend a fair amount of time there if you are so inclined, thinking about trade-offs and application functionality associated with entry of email addresses by users. If you are working with a framework or language that has a built-in and commonly used form of email address validation, then the hidden depths inherent in doing it yourself are a great reason to just use the built-in approach. e.g. If using PHP:

$is_valid_email = filter_var($email, FILTER_VALIDATE_EMAIL);

If you don't have a detailed understanding as to exactly why the above function won't work for your particular use-case, then you should use it. It will be good enough for the vast majority of web and mail-related applications, and a range of folk have put more time into making sure it works than you will ever be able to spare.

The RFC822 Exact Match

When it comes to matching emails in arbitrary strings, you're usually on your own, however, and will have to make some sensible choice as to which of the many possible regexes to use. To start with, matching emails exactly according to the RFC822 specifications is finicky and usually the wrong thing to do. The regular expression you'll find by following that link is enormous and very complex, meaning that making tweaks to it is out of the question for any ordinary individual with project time constraints. Further, it isn't at all future-proof and many forms of mail software allow the use of addresses that deviate from RFC822 in small but significant ways.

Given these points, most developers end up aiming for "good enough" - choosing any one of the possible email-matching regular expressions that are somewhat future-proof against the introduction of new top level domains, short enough to be comprehensible and thus can be tweaked in a reasonable amount of time, and which will match few sorts of invalid non-email-address strings.

Something Simple in ASCII

Here is a compact and straightforward example that should be clear even to those folk with less experience in using regular expressions:

/([^s]+@[a-z0-9.-]+.[a-z]{2,6})/i

This will match anything that looks like "user@example.com", provided that the top level domain has between 2 and 6 characters. Note that there are some complexities in what constitutes a valid character in the username section of an email address so matching a wider-than-valid range of characters there using will probably save you some grief in the long term.

The example above is fine if you live in an ASCII world and keep some secondary check in place for validating the domain - e.g. run a DNS lookup on "domain-this-is.not" when someone submits "not-a-real-email@domain-this-is.not".

Then Switch to Unicode

We don't, however, live in an ASCII world. All sensible development these days should assume strings to be some form of unicode - UTF-8 seems to have become the encoding of choice for much of the range of PHP framework development, for example. Internationalized domain names and usernames in emails are only to be expected. The regular expression above fails to match non-ASCII characters, and so we should use something more like this instead:

/([^p{Z}]+@[p{L}p{N}.-]+.[p{L}]{2,6})/ui

To a first approximation, p{Z} matches a whitespace character, p{L} matches any unicode letter, and p{N} any unicode number. There are some complications introduced in practice by the way in which unicode deals with combining marks such as accents and umlauts, however. Many symbols can be expressed either as a single character (which will be matched by p{L}) or as a character and combining mark (which will be matched by p{L}p{M}* or X). So we should probably use a regular expression more like this:

/([^p{Z}]+@[p{L}p{M}p{N}.-]+.(p{L}p{M}*){2,6})/ui

Note that the top level domain matching clause (p{L}p{M}*){2,6} is still set up to check for a length of 2 to 6 characters, but here is checking for 2 to 6 instances of a character and any combining marks it may have. For the other parts of the email address, we just throw a match on combining marks into the set of characters to be matched.

Now Add in Word Boundary Matching

The next thing to think about is word boundaries. The original simple ASCII example offered above should probably look like this:

/([^s]+@[a-z0-9.-]+.[a-z]{2,6})b/i

That will avoid matching against "username@example.comblahblahblah" to obtain "username@example.combla" or other similar possibilities. Of course it may or may not be the case that your usage cares about word boundaries, but many will. Unfortunately the default word boundary match b works in potentially confusing ways when unicode is involved: what it does may vary with your regex implementation and will certainly present pitfalls based on what is and is not defined as a word character versus your expectation of same. There is also the matter of email addresses bounded as <username@example.com>. So it might be sensible to use (>|p{Z}|$) or similar in place of b, as follows:

/([^p{Z}<]+@[p{L}p{M}p{N}.-]+.(p{L}p{M}*){2,6})(>|p{Z}|$)/ui

Using b and w with the /u Modifier

If you're working in PHP or otherwise using PCRE as a regular expression engine, you can cut the size of this regular expression by using the /u modifier to ensure that w and b match the appropriate unicode characters - at least in later versions of the engine, in any case. Note that the following regular expression isn't exactly the same as the longer one above. For example, w matches digits while p{L}p{M}* does not:

/([^p{Z}<]+@[w.-]+.w{2,6})(>|b)/ui

So the exact version would be more as follows:

/([^p{Z}<]+@[w.-]+.(p{L}p{M}*){2,6})(>|b)/ui

If you are working with another regex engine, however, then using b and w is a bad idea unless you know exactly what you are doing - so you may be stuck with the longest and most explicit version above.

You'll Still Have to Validate the Domain

However you go about building your regular expression, you will still have to validate the domain provided in the email address. This is one of the reasons why it's perfectly fine to use fairly short and loose regexes to check the validity of emails: the regular expression is only the first line of validation, as it cannot tell you whether or not a domain and its mail server actually exist.

Then There is This: "i am a terrible email address"@example.com

Yes, that's a valid email address. If the name is quoted pretty much anything goes, whitespace included. I haven't seen one of those in the wild yet, but it's only a matter of time, I suppose. The regular expressions above can't handle this form of email address, and I leave it as an exercise for the reader to try forming a matching expression - and then decide whether the possible existence of this thing merits throwing out all of the regular expressions in favor of just checking for the existence of an "@" character and then validating by trying to send an email to the user.