-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Update php_filter_validate_email regex #243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Updating to more accurate and optimized regex. a@b format email address also re-allowed as they are valid: RFC 5321 Section 2.3.5: A domain name (or often just a "domain") consists of one or more components, separated by dots if more than one appears. In the case of a top-level domain used by itself in an email address, a single string is used without any dots. ... Only resolvable, fully-qualified domain names (FQDNs) are permitted when domain names are used in SMTP. In other words, names that can be resolved to MX RRs or address (i.e., A or AAAA) RRs (as discussed in Section 5) are permitted, as are CNAME RRs whose targets can be resolved, in turn, to MX or address RRs. ac, uz, and ai are all FQDNs, resolving to A RRs. Additionally, ai resolves to MX RRs. This meets the requirements of RFC 5321 (which even explicitly mentions TLD-only domain names). Additionally, changed the max length from the incorrect 320 to the correct 254.
Do you have the regex in a formatted form so it is easier to review? Btw, even though a@b is a valid mail address as per the RFCs, it is not a valid mail for most practical purposes. Quoting Rasmus from https://bugs.php.net/bug.php?id=49576:
I don't think we should change this behavior. Maybe add a flag to toggle the behavior? Edit: Just realized that you are not talking about local domain names, but about TLD-only domains. That does make the issue harder. |
Yes; there's a breakdown of each component at http://squiloople.com/2009/12/20/email-address-validation/. To disallow TLD-only domain names (which I strongly disagree with as they are perfectly valid), we could use this:
( There's also a function there that validates according to RFC 5322 (allows obsolete local-parts and comments/folding white spaces). However, as (almost) any obsolete local-part can be represented as a single quoted sting, and as CFWS is semantically invisible, these are unnecessary additions. Perhaps we could add a flag to determine which of RFC 5321 and 5322 is used (like how the IP address filter has a flag to determine whether or not to allow IPv6 addresses)? Edit: By formatted, did you mean like this:
|
@MichaelRushton it’s a really good idea to make email validation much stricter. To make the regex more readable, consider using macros to concatinate the different parts. Something like this: #define EMAIL_REGEX_ADDRESS_PART (?!(?>"?(?>\\\[ -~]|[^"])"?){255,})
#define EMAIL_REGEX_LOCAL_PART (?!"?(?>\\\[ -~]|[^"]){65,}"?@)
...
#define EMAIL_REGEX /^ EMAIL_REGEX_ADDDRESS_PART EMAIL_REGEX_LOCAL_PART ... $/ixD What do you think? |
Sounds like a good idea. However, I know nothing about the C language so that's probably best left to a better programmer. |
I don’t have time to do it (note: the current patch doesn’t even compile on my machine). Any takers? |
A few things that I mentioned on Michael's blog some time ago. The RFCs do say that single-word domain parts are ok, so long as they are routable domains; at the same time they say that local aliases are specifically not valid. There's no way of testing that within a regex - you have to do a DNS lookup to tell. ICANN now mandates (following on from their earlier study) that email to dotless domains is contractually prohibited for all new gTLDs (including the 2000 or so new ones appearing at the moment), and is effectively deprecated for all other domains. On those pages they have a couple of other studies which highlight various usability and security issues with allowing dotless domains. Regarding the WHATWG HTML5 email validation pattern that Rasmus mentioned in the comments, I fixed that so it is no longer an overlapping set with the RFCs, but a subset, as it should be. As for practicality, I run an ESP (smartmessages.net), and historically we have allowed dotless domains. We've never seen a valid one out of 200 million or so addresses we have handled, but have rejections for tens of thousands of bogus ones (mainly users typing things like 'asdf@asdf' to make a validator shut up), so any dotless domain seen in the wild is far more likely to be bogus than not. On balance I think we are better off not allowing them, so the PHP validator is fine as-is. |
Comment on behalf of krakjoe at php.net: Further to the discussion that took place here, particularly the point that this is not reasonably doable with regex alone, and that validation should be left as is, I'm closing this PR. |
Updating to more accurate and optimized regex.
a@b format email address also re-allowed as they are valid:
RFC 5321 Section 2.3.5:
A domain name (or often just a "domain") consists of one or more
components, separated by dots if more than one appears. In the case
of a top-level domain used by itself in an email address, a single
string is used without any dots.
...
Only resolvable, fully-qualified domain names (FQDNs) are permitted
when domain names are used in SMTP. In other words, names that can
be resolved to MX RRs or address (i.e., A or AAAA) RRs (as discussed
in Section 5) are permitted, as are CNAME RRs whose targets can be
resolved, in turn, to MX or address RRs.
ac, uz, and ai are all FQDNs, resolving to A RRs. Additionally, ai resolves to MX RRs. This meets the requirements of RFC 5321 (which even explicitly mentions TLD-only domain names).
Additionally, changed the max length from the incorrect 320 to the correct 254 (although the regex itself does this while also taking into account the semantically invisible backslashes of a quoted pair).