Skip to content

Update php_filter_validate_email regex #243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Update php_filter_validate_email regex #243

wants to merge 1 commit into from

Conversation

ghost
Copy link

@ghost ghost commented Dec 21, 2012

Updating to more accurate and optimized regex.

a@b format email address also re-allowed as they are valid:

RFC 5321 Section 2.3.5:

A domain name (or often just a "domain") consists of one or more
   components, separated by dots if more than one appears.  In the case
   of a top-level domain used by itself in an email address, a single
   string is used without any dots.

...

Only resolvable, fully-qualified domain names (FQDNs) are permitted
   when domain names are used in SMTP.  In other words, names that can
   be resolved to MX RRs or address (i.e., A or AAAA) RRs (as discussed
   in Section 5) are permitted, as are CNAME RRs whose targets can be
   resolved, in turn, to MX or address RRs.

ac, uz, and ai are all FQDNs, resolving to A RRs. Additionally, ai resolves to MX RRs. This meets the requirements of RFC 5321 (which even explicitly mentions TLD-only domain names).

Additionally, changed the max length from the incorrect 320 to the correct 254 (although the regex itself does this while also taking into account the semantically invisible backslashes of a quoted pair).

Updating to more accurate and optimized regex.

a@b format email address also re-allowed as they are valid:

RFC 5321 Section 2.3.5:

A domain name (or often just a "domain") consists of one or more
   components, separated by dots if more than one appears.  In the case
   of a top-level domain used by itself in an email address, a single
   string is used without any dots.

...

Only resolvable, fully-qualified domain names (FQDNs) are permitted
   when domain names are used in SMTP.  In other words, names that can
   be resolved to MX RRs or address (i.e., A or AAAA) RRs (as discussed
   in Section 5) are permitted, as are CNAME RRs whose targets can be
   resolved, in turn, to MX or address RRs.

ac, uz, and ai are all FQDNs, resolving to A RRs. Additionally, ai resolves to MX RRs. This meets the requirements of RFC 5321 (which even explicitly mentions TLD-only domain names).

Additionally, changed the max length from the incorrect 320 to the correct 254.
@nikic
Copy link
Member

nikic commented Dec 22, 2012

Do you have the regex in a formatted form so it is easier to review?

Btw, even though a@b is a valid mail address as per the RFCs, it is not a valid mail for most practical purposes. Quoting Rasmus from https://bugs.php.net/bug.php?id=49576:

I am not disagreeing that local domains are invalid per the RFC, but I do think
that in most cases Web apps probably don't have a use for these cases since they
don't resolve outside of the local environment. I suppose some Intranet web apps
would find this useful, but the bulk of Internet apps would need to add a second
check to make sure that it wasn't a non external SMTP-able address that
validated. I would suggest that the few cases where you do want local single-
domain addresses to validate you add a simple check in front of filter_var. They
are easy to check for.

I don't think we should change this behavior. Maybe add a flag to toggle the behavior?

Edit: Just realized that you are not talking about local domain names, but about TLD-only domains. That does make the issue harder.

@ghost
Copy link
Author

ghost commented Dec 22, 2012

Yes; there's a breakdown of each component at http://squiloople.com/2009/12/20/email-address-validation/.

To disallow TLD-only domain names (which I strongly disagree with as they are perfectly valid), we could use this:

'/^(?!(?>"?(?>\\\[ -~]|[^"])"?){255,})(?!"?(?>\\\[ -~]|[^"]){65,}"?@)(?>([!#-\'*+\/-9=?^-~-]+)(?>\.(?1))*|"(?>[ !#-\[\]-~]|\\\[ -~])*")@(?!.*[^.]{64,})(?>([a-z0-9](?>[a-z0-9-]*[a-z0-9])?)(?>\.(?2)){1,126}|\[(?:(?>IPv6:(?>([a-f0-9]{1,4})(?>:(?3)){7}|(?!(?:.*[a-f0-9][:\]]){8,})((?3)(?>:(?3)){0,6})?::(?4)?))|(?>(?>IPv6:(?>(?3)(?>:(?3)){5}:|(?!(?:.*[a-f0-9]:){6,})(?5)?::(?>((?3)(?>:(?3)){0,4}):)?))?(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(?>\.(?6)){3}))\])$/iD'

({0,126} changed to {1,126} at offset 198)

There's also a function there that validates according to RFC 5322 (allows obsolete local-parts and comments/folding white spaces). However, as (almost) any obsolete local-part can be represented as a single quoted sting, and as CFWS is semantically invisible, these are unnecessary additions. Perhaps we could add a flag to determine which of RFC 5321 and 5322 is used (like how the IP address filter has a flag to determine whether or not to allow IPv6 addresses)?

Edit: By formatted, did you mean like this:

'/^ (?!(?>"?(?>\\\[ -~]|[^"])"?){255,}) # Limit address to 254 characters (?!"?(?>\\\[ -~]|[^"]){65,}"?@) # Limit local-part to 64 characters (?>([!#-\'*+\/-9=?^-~-]+)(?>\.(?1))* # Dot-atom local-part | "(?>[ !#-\[\]-~]|\\\[ -~])*") # Quoted-string local-part @ (?!.*[^.]{64,}) # Limit labels to 63 characters (?>([a-z0-9](?>[a-z0-9-]*[a-z0-9])?)(?>\.(?2)){0,126} # Domain-name | \[(?:(?>IPv6:(?>([a-f0-9]{1,4})(?>:(?3)){7}|(?!(?:.*[a-f0-9][:\]]){8,})((?3)(?>:(?3)){0,6})?::(?4)?)) # IPv6 address | (?>(?>IPv6:(?>(?3)(?>:(?3)){5}:|(?!(?:.*[a-f0-9]:){6,})(?5)?::(?>((?3)(?>:(?3)){0,4}):)?))? # Compressed IPv6 address (25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(?>\.(?6)){3}))\]) # IPv4 address $/ixD'

@lstrojny
Copy link
Contributor

@MichaelRushton it’s a really good idea to make email validation much stricter. To make the regex more readable, consider using macros to concatinate the different parts. Something like this:

#define EMAIL_REGEX_ADDRESS_PART (?!(?>"?(?>\\\[ -~]|[^"])"?){255,})
#define EMAIL_REGEX_LOCAL_PART (?!"?(?>\\\[ -~]|[^"]){65,}"?@)
...
#define EMAIL_REGEX /^ EMAIL_REGEX_ADDDRESS_PART EMAIL_REGEX_LOCAL_PART ... $/ixD

What do you think?

@ghost
Copy link
Author

ghost commented Dec 22, 2012

Sounds like a good idea. However, I know nothing about the C language so that's probably best left to a better programmer.

@lstrojny
Copy link
Contributor

lstrojny commented Jan 6, 2013

I don’t have time to do it (note: the current patch doesn’t even compile on my machine). Any takers?

@Synchro
Copy link
Contributor

Synchro commented Jan 15, 2014

A few things that I mentioned on Michael's blog some time ago. The RFCs do say that single-word domain parts are ok, so long as they are routable domains; at the same time they say that local aliases are specifically not valid. There's no way of testing that within a regex - you have to do a DNS lookup to tell.

ICANN now mandates (following on from their earlier study) that email to dotless domains is contractually prohibited for all new gTLDs (including the 2000 or so new ones appearing at the moment), and is effectively deprecated for all other domains. On those pages they have a couple of other studies which highlight various usability and security issues with allowing dotless domains.

Regarding the WHATWG HTML5 email validation pattern that Rasmus mentioned in the comments, I fixed that so it is no longer an overlapping set with the RFCs, but a subset, as it should be.

As for practicality, I run an ESP (smartmessages.net), and historically we have allowed dotless domains. We've never seen a valid one out of 200 million or so addresses we have handled, but have rejections for tens of thousands of bogus ones (mainly users typing things like 'asdf@asdf' to make a validator shut up), so any dotless domain seen in the wild is far more likely to be bogus than not.

On balance I think we are better off not allowing them, so the PHP validator is fine as-is.

@php-pulls
Copy link

Comment on behalf of krakjoe at php.net:

Further to the discussion that took place here, particularly the point that this is not reasonably doable with regex alone, and that validation should be left as is, I'm closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants