So a friend happened to show me how odd and specific the general email syntax rules are. For instance, emails can have "comments". Basically you can put characters in parentheses that are just ignored. So not only is it valid, email(this seems extremely redundant)@email.com
is the same email as [email protected]
.
Now most email providers have more simpler and easier to work restrictions (like only ascii, digits, dots and dashes). But I thought it'd be a fun exercise to follow the exact guidelines as best I could. I wont delineate every specific here, as I (hopefully) have made it all clear in the code itself.
I did heavily consult the font of all knowledge, Wikipedia for its summary on the rules.
I'm particularly interested on feedback for how robust I made this and how I did the testing and separation of functions. In theory this should be a module people could import and call on (though I have no idea when someone would actually want to use it) so I'd like reviews to focus on that. Feedback about better or more efficient methods are, of course, welcome.
"""This module will evaluate whether a string is a valid email or not.
It is based on the criteria laid out in RFC documents, summarised here:
https://en.wikipedia.org/wiki/Email_address#Syntax
Many email providers will restrict these further, but this module is primarily
for testing whether an email is syntactically valid or not.
Calling validate() will run all tests in intelligent order.
Any error found will raise an InvalidEmail error, but this also inherits from
ValueError, so errors can be caught with either of them.
If you're using any other functions, note that some of the tests will return
a modified string for the convenience of how the default tests are structured.
Just calling valid_quotes(string) will work fine, just don't use the assigned
value unless you want the quoted sections removed.
Errors will be raised from the function regardless.
>>> validate("local-part@domain")
>>> validate("[email protected]")
>>> validate("[email protected]")
Traceback (most recent call last):
...
InvalidEmail: Consecutive periods are not permitted.
>>> validate("[email protected]")
>>> validate("[email protected]")
>>> validate("john.smith(comment)@example.com")
>>> validate("(comment)[email protected]")
>>> validate("(comment)john.smith@example(comment).com")
>>> validate('"abcdefghixyz"@example.com')
>>> validate('abc."defghi"[email protected]')
Traceback (most recent call last):
...
InvalidEmail: Local may neither start nor end with a period.
>>> validate('abc."def<>ghi"[email protected]')
Traceback (most recent call last):
...
InvalidEmail: Incorrect double quotes formatting.
>>> validate('abc."def<>ghi"[email protected]')
>>> validate('jsmith@[192.168.2.1]')
>>> validate('jsmith@[192.168.12.2.1]')
Traceback (most recent call last):
...
InvalidEmail: IPv4 domain must have 4 period separated numbers.
>>> validate('jsmith@[IPv6:2001:db8::1]')
>>> validate('john.smith@(comment)example.com')
"""
import re
from string import ascii_letters, digits
HEX_BASE = 16
MAX_ADDRESS_LEN = 256
MAX_LOCAL_LEN = 64
MAX_DOMAIN_LEN = 253
MAX_DOMAIN_SECTION_LEN = 63
MIN_UTF8_CODE = 128
MAX_UTF8_CODE = 65536
MAX_IPV4_NUM = 256
IPV6_PREFIX = 'IPv6:'
VALID_CHARACTERS = ascii_letters + digits + "!#$%&'*+-/=?^_`{|}~"
EXTENDED_CHARACTERS = VALID_CHARACTERS + r' "(),:;<>@[\]'
DOMAIN_CHARACTERS = ascii_letters + digits + '-.'
# Find quote enclosed sections, but ignore \" patterns.
COMMENT_PATTERN = re.compile(r'\(.*?\)')
QUOTE_PATTERN = re.compile(r'(^(?<!\\)".*?(?<!\\)"$|\.(?<!\\)".*?(?<!\\)"\.)')
class InvalidEmail(ValueError):
"""String is not a valid Email."""
def strip_comments(s):
"""Return s with comments removed.
Comments in an email address are any characters enclosed in parentheses.
These are essentially ignored, and do not affect what the address is.
>>> strip_comments('exam(alammma)ple@e(lectronic)mail.com')
'[email protected]'"""
return re.sub(COMMENT_PATTERN, "", s)
def valid_quotes(local):
"""Parse a section of the local part that's in double quotation marks.
There's an extended range of characters permitted inside double quotes.
Including: "(),:;<>@[\] and space.
However " and \ must be escaped by a backslash to be valid.
>>> valid_quotes('"any special characters <>"')
''
>>> valid_quotes('this."is".quoted')
'this.quoted'
>>> valid_quotes('this"wrongly"quoted')
Traceback (most recent call last):
...
InvalidEmail: Incorrect double quotes formatting.
>>> valid_quotes('still."wrong"')
Traceback (most recent call last):
...
InvalidEmail: Incorrect double quotes formatting."""
quotes = re.findall(QUOTE_PATTERN, local)
if not quotes and '"' in local:
raise InvalidEmail("Incorrect double quotes formatting.")
for quote in quotes:
if any(char not in EXTENDED_CHARACTERS for char in quote.strip('.')):
raise InvalidEmail("Invalid characters used in quotes.")
# Remove valid escape characters, and see if any invalid ones remain
stripped = quote.replace('\\\\', '').replace('\\"', '"').strip('".')
if '\\' in stripped:
raise InvalidEmail('\ must be paired with " or another \.')
if '"' in stripped:
raise InvalidEmail('Unescaped " found.')
# Test if start and end are both periods
# If so, one of them should be removed to prevent double quote errors
if quote.endswith('.'):
quote = quote[:-1]
local = local.replace(quote, '')
return local
def valid_period(local):
"""Raise error for invalid period, return local without any periods.
Raises InvalidEmail if local starts or ends with a period or
if local has consecutive periods.
>>> valid_period('example.email')
'exampleemail'
>>> valid_period('.example')
Traceback (most recent call last):
...
InvalidEmail: Local may neither start nor end with a period."""
if local.startswith('.') or local.endswith('.'):
raise InvalidEmail("Local may neither start nor end with a period.")
if '..' in local:
raise InvalidEmail("Consecutive periods are not permitted.")
return local.replace('.', '')
def valid_local_characters(local):
"""Raise error if char isn't in VALID_CHARACTERS or the UTF8 code range"""
if any(not MIN_UTF8_CODE <= ord(char) <= MAX_UTF8_CODE
and char not in VALID_CHARACTERS for char in local):
raise InvalidEmail("Invalid character in local.")
def valid_local(local):
"""Raise error if any syntax rules are broken in the local part."""
local = valid_quotes(local)
local = valid_period(local)
valid_local_characters(local)
def valid_domain_lengths(domain):
"""Raise error if the domain or any section of it is too long.
>>> valid_domain_lengths('long.' * 52)
Traceback (most recent call last):
...
InvalidEmail: Domain length must not exceed 253 characters.
>>> valid_domain_lengths('proper.example.com')"""
if len(domain.rstrip('.')) > MAX_DOMAIN_LEN:
raise InvalidEmail("Domain length must not exceed {} characters."
.format(MAX_DOMAIN_LEN))
sections = domain.split('.')
if any(1 > len(section) > MAX_DOMAIN_SECTION_LEN for section in sections):
raise InvalidEmail("Invalid section length between domain periods.")
def valid_ipv4(ip):
"""Raise error if ip doesn't match IPv4 syntax rules.
IPv4 is in the format xxx.xxx.xxx.xxx
Where each xxx is a number 1 - 256 (with no leading zeroes).
>>> valid_ipv4('256.12.1.12')
>>> valid_ipv4('256.12.1.312')
Traceback (most recent call last):
...
InvalidEmail: IPv4 domain must be numbers 1-256 and periods only"""
numbers = ip.split('.')
if len(numbers) != 4:
raise InvalidEmail("IPv4 domain must have 4 period separated numbers.")
try:
if any(0 > int(num) or int(num) > MAX_IPV4_NUM for num in numbers):
raise InvalidEmail
except ValueError:
raise InvalidEmail("IPv4 domain must be numbers 1-256 and periods only")
def valid_ipv6(ip):
"""Raise error if ip doesn't match IPv6 syntax rules.
IPv6 is in the format xxxx:xxxx::xxxx::xxxx
Where each xxxx is a hexcode, though they can 0-4 characters inclusive.
Additionally there can be empty spaces, and codes can be ommitted entirely
if they are just 0 (or 0000). To accomodate this, validation just checks
for valid hex codes, and ensures that lengths never exceed max values.
But no minimums are enforced.
>>> valid_ipv6('314::ac5:1:bf23:412')
>>> valid_ipv6('IPv6:314::ac5:1:bf23:412')
>>> valid_ipv6('314::ac5:1:bf23:412g')
Traceback (most recent call last):
...
InvalidEmail: Invalid IPv6 domaim: '412g' is invalid hex value.
>>> valid_ipv6('314::ac5:1:bf23:314::ac5:1:bf23:314::ac5:1:bf23:41241')
Traceback (most recent call last):
...
InvalidEmail: Invalid IPv6 domain"""
if ip.startswith(IPV6_PREFIX):
ip = ip.replace(IPV6_PREFIX, '')
hex_codes = ip.split(':')
if len(hex_codes) > 8 or any(len(code) > 4 for code in hex_codes):
raise InvalidEmail("Invalid IPv6 domain")
for code in hex_codes:
try:
if code:
int(code, HEX_BASE)
except ValueError:
raise InvalidEmail("Invalid IPv6 domaim: '{}' is invalid hex value.".format(code))
def valid_domain_characters(domain):
"""Raise error if any invalid characters are used in domain."""
if any(char not in DOMAIN_CHARACTERS for char in domain):
raise InvalidEmail("Invalid character in domain.")
def valid_domain(domain):
"""Raise error if domain is neither a valid domain nor IP.
Domains (sections after the @) can be either a traditional domain or an IP
wrapped in square brackets. The IP can be IPv4 or IPv6.
All these possibilities are accounted for."""
# Check if it's an IP literal
if domain.startswith('[') and domain.endswith(']'):
ip = domain[1:-1]
if '.' in ip:
valid_ipv4(ip)
elif ':' in ip:
valid_ipv6(ip)
else:
raise InvalidEmail("IP domain not in either IPv4 or IPv6 format.")
else:
valid_domain_lengths(domain)
def validate(address):
"""Raises an error if address is an invalid email string."""
try:
local, domain = strip_comments(address).split('@')
except ValueError:
raise InvalidEmail("Address must have one '@' only.")
if len(local) > MAX_LOCAL_LEN:
raise InvalidEmail("Only {} characters allowed before the @"
.format(MAX_LOCAL_LEN))
if len(domain) > MAX_ADDRESS_LEN:
raise InvalidEmail("Only {} characters allowed in address"
.format(MAX_ADDRESS_LEN))
valid_local(strip_comments(local))
valid_domain(strip_comments(domain))
if __name__ == "__main__":
import doctest
doctest.testmod()
raw_input('>DONE<')
IndentationError
), but I suspect that it might fail even on some of the more simple examples from RFC3696. \$\endgroup\$quoted-string
can only containFWS
between the quotes, notCFWS
, so anything that looks like a comment inside a quoted-string isn't a comment, and shouldn't be removed. Something similar is true fordomain-literal
s inside square brackets. Neither is likely to have much real-world impact, but if you want to be absolutely correct you might want to think about how to handle that. \$\endgroup\$