Package lepl :: Package apps :: Module rfc3696
[hide private]
[frames] | no frames]

Source Code for Module lepl.apps.rfc3696

  1   
  2  # The contents of this file are subject to the Mozilla Public License 
  3  # (MPL) Version 1.1 (the "License"); you may not use this file except 
  4  # in compliance with the License. You may obtain a copy of the License 
  5  # at http://www.mozilla.org/MPL/ 
  6  # 
  7  # Software distributed under the License is distributed on an "AS IS" 
  8  # basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See 
  9  # the License for the specific language governing rights and 
 10  # limitations under the License. 
 11  # 
 12  # The Original Code is LEPL (http://www.acooke.org/lepl) 
 13  # The Initial Developer of the Original Code is Andrew Cooke. 
 14  # Portions created by the Initial Developer are Copyright (C) 2009-2010 
 15  # Andrew Cooke (andrew@acooke.org). All Rights Reserved. 
 16  # 
 17  # Alternatively, the contents of this file may be used under the terms 
 18  # of the LGPL license (the GNU Lesser General Public License, 
 19  # http://www.gnu.org/licenses/lgpl.html), in which case the provisions 
 20  # of the LGPL License are applicable instead of those above. 
 21  # 
 22  # If you wish to allow use of your version of this file only under the 
 23  # terms of the LGPL License and not to allow others to use your version 
 24  # of this file under the MPL, indicate your decision by deleting the 
 25  # provisions above and replace them with the notice and other provisions 
 26  # required by the LGPL License.  If you do not delete the provisions 
 27  # above, a recipient may use your version of this file under either the 
 28  # MPL or the LGPL License. 
 29   
 30  ''' 
 31  Matchers for validating URIs and related objects, taken from RFC3696. 
 32   
 33  IMPORTANT - the emphasis here is on validation of user input. 
 34  These matchers are not exact matches for the underlying specs - they are 
 35  just useful practical approximations.  Read RFC3696 to see what I mean 
 36  (or the quotes from that doc in the source below). 
 37  ''' 
 38   
 39  from re import compile as compile_ 
 40  from string import ascii_letters, digits, printable, whitespace 
 41   
 42  from lepl import * 
 43   
 44       
 45  _HEX = digits + 'abcdef' + 'ABCDEF' 
46 47 48 -def _guarantee_bool(function):
49 ''' 50 A decorator that guarantees a true/false response. 51 ''' 52 def wrapper(*args, **kargs): 53 try: 54 return bool(function(*args, **kargs)) 55 except: 56 return False
57 return wrapper 58
59 60 -def _matcher_to_validator(factory):
61 ''' 62 Generate a validator based on the given matcher factory. 63 ''' 64 matcher = factory() 65 matcher.config.compile_to_re().no_memoize() 66 67 @_guarantee_bool 68 def validator(value): 69 for char in '\n\r': 70 assert char not in value 71 return matcher.parse(value)
72 73 return validator 74
75 76 -def _LimitLength(matcher, length):
77 ''' 78 Reject a match if it exceeds a certain length. 79 ''' 80 return PostCondition(matcher, lambda results: len(results[0]) <= length)
81
82 -def _RejectRegexp(matcher, pattern):
83 ''' 84 Reject a match if it matches a (ie some other) regular expression 85 ''' 86 regexp = compile_(pattern) 87 return PostCondition(matcher, lambda results: not regexp.match(results[0]))
88
89 -def _LimitIntValue(matcher, max):
90 ''' 91 Reject a match if the value exceeds some value. 92 ''' 93 return PostCondition(matcher, lambda results: int(results[0]) <= max)
94
95 -def _LimitCount(matcher, char, max):
96 ''' 97 Reject a match if the number of times a particular character occurs exceeds 98 some value. 99 ''' 100 return PostCondition(matcher, lambda results: results[0].count(char) <= max)
101
102 103 -def _PreferredFullyQualifiedDnsName():
104 ''' 105 A matcher for DNS names. 106 107 RFC 3696: 108 109 Any characters, or combination of bits (as octets), are permitted in 110 DNS names. However, there is a preferred form that is required by 111 most applications. This preferred form has been the only one 112 permitted in the names of top-level domains, or TLDs. In general, it 113 is also the only form permitted in most second-level names registered 114 in TLDs, although some names that are normally not seen by users obey 115 other rules. It derives from the original ARPANET rules for the 116 naming of hosts (i.e., the "hostname" rule) and is perhaps better 117 described as the "LDH rule", after the characters that it permits. 118 The LDH rule, as updated, provides that the labels (words or strings 119 separated by periods) that make up a domain name must consist of only 120 the ASCII [ASCII] alphabetic and numeric characters, plus the hyphen. 121 No other symbols or punctuation characters are permitted, nor is 122 blank space. If the hyphen is used, it is not permitted to appear at 123 either the beginning or end of a label. There is an additional rule 124 that essentially requires that top-level domain names not be all- 125 numeric. 126 [...] 127 128 Most internet applications that reference other hosts or systems 129 assume they will be supplied with "fully-qualified" domain names, 130 i.e., ones that include all of the labels leading to the root, 131 including the TLD name. Those fully-qualified domain names are then 132 passed to either the domain name resolution protocol itself or to the 133 remote systems. Consequently, purported DNS names to be used in 134 applications and to locate resources generally must contain at least 135 one period (".") character. 136 [...] 137 138 [...]It is 139 likely that the better strategy has now become to make the "at least 140 one period" test, to verify LDH conformance (including verification 141 that the apparent TLD name is not all-numeric), and then to use the 142 DNS to determine domain name validity, rather than trying to maintain 143 a local list of valid TLD names. 144 [...] 145 146 A DNS label may be no more than 63 octets long. This is in the form 147 actually stored; if a non-ASCII label is converted to encoded 148 "punycode" form (see Section 5), the length of that form may restrict 149 the number of actual characters (in the original character set) that 150 can be accommodated. A complete, fully-qualified, domain name must 151 not exceed 255 octets. 152 ''' 153 ld = Any(ascii_letters + digits) 154 ldh = ld | '-' 155 label = ld + Optional(ldh[:] + ld) 156 short_label = _LimitLength(label, 63) 157 tld = _RejectRegexp(short_label, r'^[0-9]+$') 158 any_name = short_label[1:, r'\.', ...] + '.' + tld 159 non_numeric = _RejectRegexp(any_name, r'^[0-9\.]+$') 160 short_name = _LimitLength(non_numeric, 255) 161 return short_name
162
163 164 -def _IpV4Address():
165 ''' 166 A matcher for IPv4 addresses. 167 168 RFC 3696 doesn't say much about these; RFC 2396 doesn't mention limits 169 on numerical values, but it must be 255. 170 ''' 171 octet = _LimitIntValue(Any(digits)[1:, ...], 255) 172 address = octet[4, '.', ...] 173 return address
174
175 176 -def _Ipv6Address():
177 ''' 178 A matcher for IPv6 addresses. 179 180 Again, RFC 3696 says little; RFC 2373 (addresses) and 2732 (URLs) have 181 much more information: 182 183 1. The preferred form is x:x:x:x:x:x:x:x, where the 'x's are the 184 hexadecimal values of the eight 16-bit pieces of the address. 185 Examples: 186 187 FEDC:BA98:7654:3210:FEDC:BA98:7654:3210 188 189 1080:0:0:0:8:800:200C:417A 190 191 Note that it is not necessary to write the leading zeros in an 192 individual field, but there must be at least one numeral in every 193 field (except for the case described in 2.). 194 195 2. Due to some methods of allocating certain styles of IPv6 196 addresses, it will be common for addresses to contain long strings 197 of zero bits. In order to make writing addresses containing zero 198 bits easier a special syntax is available to compress the zeros. 199 The use of "::" indicates multiple groups of 16-bits of zeros. 200 The "::" can only appear once in an address. The "::" can also be 201 used to compress the leading and/or trailing zeros in an address. 202 203 For example the following addresses: 204 205 1080:0:0:0:8:800:200C:417A a unicast address 206 FF01:0:0:0:0:0:0:101 a multicast address 207 0:0:0:0:0:0:0:1 the loopback address 208 0:0:0:0:0:0:0:0 the unspecified addresses 209 210 may be represented as: 211 212 1080::8:800:200C:417A a unicast address 213 FF01::101 a multicast address 214 ::1 the loopback address 215 :: the unspecified addresses 216 217 3. An alternative form that is sometimes more convenient when dealing 218 with a mixed environment of IPv4 and IPv6 nodes is 219 x:x:x:x:x:x:d.d.d.d, where the 'x's are the hexadecimal values of 220 the six high-order 16-bit pieces of the address, and the 'd's are 221 the decimal values of the four low-order 8-bit pieces of the 222 address (standard IPv4 representation). Examples: 223 224 0:0:0:0:0:0:13.1.68.3 225 226 0:0:0:0:0:FFFF:129.144.52.38 227 228 or in compressed form: 229 230 ::13.1.68.3 231 232 ::FFFF:129.144.52.38 233 ''' 234 piece = Any(_HEX)[1:4, ...] 235 preferred = piece[8, ':', ...] 236 237 # we need to be careful about how we match the compressed form, since we 238 # have a limit on the total number of pieces. the simplest approach seems 239 # to be to limit the final number of ':' characters, but we must take 240 # care to treat the cases where '::' is at one end separately: 241 # 1::2:3:4:5:6:7 has 7 ':' characters 242 # 1:2:3:4:5:6:7:: has 8 ':' characters 243 compact = Or(_LimitCount(piece[1:6, ':', ...] + '::' + piece[1:6, ':', ...], 244 ':', 7), 245 '::' + piece[1:7, ':', ...], 246 piece[1:7, ':', ...] + '::', 247 '::') 248 249 # similar to above, but we need to also be careful about the separator 250 # between the v6 and v4 parts 251 alternate = \ 252 Or(piece[6, ':', ...] + ':', 253 _LimitCount(piece[1:4, ':', ...] + '::' + piece[1:4, ':', ...], 254 ':', 5), 255 '::' + piece[1:5, ':', ...] + ':', 256 piece[1:5, ':', ...] + '::', 257 '::') + _IpV4Address() 258 259 return (preferred | compact | alternate)
260
261 262 -def _EmailLocalPart():
263 ''' 264 A matcher for the local part ("username") of an email address. 265 266 RFC 3696: 267 268 Contemporary email addresses consist of a "local part" separated from 269 a "domain part" (a fully-qualified domain name) by an at-sign ("@"). 270 The syntax of the domain part corresponds to that in the previous 271 section. The concerns identified in that section about filtering and 272 lists of names apply to the domain names used in an email context as 273 well. The domain name can also be replaced by an IP address in 274 square brackets, but that form is strongly discouraged except for 275 testing and troubleshooting purposes. 276 277 The local part may appear using the quoting conventions described 278 below. The quoted forms are rarely used in practice, but are 279 required for some legitimate purposes. Hence, they should not be 280 rejected in filtering routines but, should instead be passed to the 281 email system for evaluation by the destination host. 282 283 The exact rule is that any ASCII character, including control 284 characters, may appear quoted, or in a quoted string. When quoting 285 is needed, the backslash character is used to quote the following 286 character. 287 [...] 288 In addition to quoting using the backslash character, conventional 289 double-quote characters may be used to surround strings. 290 [...] 291 Without quotes, local-parts may consist of any combination of 292 alphabetic characters, digits, or any of the special characters 293 294 ! # $ % & ' * + - / = ? ^ _ ` . { | } ~ 295 296 period (".") may also appear, but may not be used to start or end the 297 local part, nor may two or more consecutive periods appear. Stated 298 differently, any ASCII graphic (printing) character other than the 299 at-sign ("@"), backslash, double quote, comma, or square brackets may 300 appear without quoting. If any of that list of excluded characters 301 are to appear, they must be quoted. 302 [...] 303 In addition to restrictions on syntax, there is a length limit on 304 email addresses. That limit is a maximum of 64 characters (octets) 305 in the "local part" (before the "@") and a maximum of 255 characters 306 (octets) in the domain part (after the "@") for a total length of 320 307 characters. Systems that handle email should be prepared to process 308 addresses which are that long, even though they are rarely 309 encountered. 310 ''' 311 unescaped_chars = ascii_letters + digits + "!#$%&'*+-/=?^_`.{|}~" 312 escapable_chars = unescaped_chars + r'@\",[] ' 313 quotable_chars = unescaped_chars + r'@\,[] ' 314 unquoted_string = (('\\' + Any(escapable_chars)) 315 | Any(unescaped_chars))[1:, ...] 316 quoted_string = '"' + Any(quotable_chars)[1:, ...] + '"' 317 local_part = quoted_string | unquoted_string 318 no_extreme_dot = _RejectRegexp(local_part, r'"?\..*\."?') 319 no_double_dot = _RejectRegexp(no_extreme_dot, r'.*\."*\..*') 320 short_local_part = _LimitLength(no_double_dot, 64) 321 return short_local_part
322
323 324 -def _Email():
325 ''' 326 A matcher for email addresses. 327 ''' 328 return _EmailLocalPart() + '@' + _PreferredFullyQualifiedDnsName()
329
330 331 -def Email():
332 ''' 333 Generate a validator for emails, according to RFC3696, which returns True 334 if the email is valid, and False otherwise. 335 ''' 336 return _matcher_to_validator(_Email)
337
338 339 -def _HttpUrl():
340 ''' 341 A matcher for HTTP URLs. 342 343 RFC 3696: 344 345 The following characters are reserved in many URIs -- they must be 346 used for either their URI-intended purpose or must be encoded. Some 347 particular schemes may either broaden or relax these restrictions 348 (see the following sections for URLs applicable to "web pages" and 349 electronic mail), or apply them only to particular URI component 350 parts. 351 352 ; / ? : @ & = + $ , ? 353 354 In addition, control characters, the space character, the double- 355 quote (") character, and the following special characters 356 357 < > # % 358 359 are generally forbidden and must either be avoided or escaped, as 360 discussed below. 361 [...] 362 When it is necessary to encode these, or other, characters, the 363 method used is to replace it with a percent-sign ("%") followed by 364 two hexidecimal digits representing its octet value. See section 365 2.4.1 of [RFC2396] for an exact definition. Unless it is used as a 366 delimiter of the URI scheme itself, any character may optionally be 367 encoded this way; systems that are testing URI syntax should be 368 prepared for these encodings to appear in any component of the URI 369 except the scheme name itself. 370 [...] 371 Absolute HTTP URLs consist of the scheme name, a host name (expressed 372 as a domain name or IP address), and optional port number, and then, 373 optionally, a path, a search part, and a fragment identifier. These 374 are separated, respectively, by a colon and the two slashes that 375 precede the host name, a colon, a slash, a question mark, and a hash 376 mark ("#"). So we have 377 378 http://host:port/path?search#fragment 379 380 http://host/path/ 381 382 http://host/path#fragment 383 384 http://host/path?search 385 386 http://host 387 388 and other variations on that form. There is also a "relative" form, 389 but it almost never appears in text that a user might, e.g., enter 390 into a form. See [RFC2616] for details. 391 [...] 392 The characters 393 394 / ; ? 395 396 are reserved within the path and search parts and must be encoded; 397 the first of these may be used unencoded, and is often used within 398 the path, to designate hierarchy. 399 ''' 400 path_chars = ''.join(set(printable).difference(set(whitespace)) 401 .difference('/;?<>#%')) 402 other_chars = path_chars + '/' 403 path_string = ('%' + Any(_HEX)[2, ...] | Any(path_chars))[1:, ...] 404 other_string = ('%' + Any(_HEX)[2, ...] | Any(other_chars))[1:, ...] 405 406 host = _IpV4Address() | ('[' + _Ipv6Address() + ']') | \ 407 _PreferredFullyQualifiedDnsName() 408 409 url = 'http://' + host + \ 410 Optional(':' + Any(digits)[1:, ...]) + \ 411 Optional('/' + 412 Optional(path_string[1:, '/', ...] + Optional('/')) + 413 Optional('?' + other_string) + 414 Optional('#' + other_string)) 415 416 return url
417
418 419 -def HttpUrl():
420 ''' 421 Generate a validator for HTTP URLs, according to RFC3696, which returns 422 True if the email is valid, and False otherwise. 423 ''' 424 return _matcher_to_validator(_HttpUrl)
425
426 427 -def MailToUrl():
428 ''' 429 Generate a validator for email addresses, according to RFC3696, which 430 returns True if the URL is valid, and False otherwise. 431 432 RFC 3696: 433 434 The following characters may appear in MAILTO URLs only with the 435 specific defined meanings given. If they appear in an email address 436 (i.e., for some other purpose), they must be encoded: 437 438 : The colon in "mailto:" 439 440 < > # " % { } | \ ^ ~ ` 441 442 These characters are "unsafe" in any URL, and must always be 443 encoded. 444 445 The following characters must also be encoded if they appear in a 446 MAILTO URL 447 448 ? & = 449 Used to delimit headers and their values when these are encoded 450 into URLs. 451 ---------- 452 The RFC isn't that great a guide here. The best approach, I think, is 453 to check the URL for "forbidden" characters, then decode it, and finally 454 validate the decoded email. So we implement the validator directly (ie 455 this is not a matcher). 456 ''' 457 458 MAIL_TO = 'mailto:' 459 encoded_token = compile_('(%.{0,2})') 460 email = _Email() 461 email.config.compile_to_re().no_memoize() 462 463 @_guarantee_bool 464 def validator(url): 465 assert url.startswith(MAIL_TO) 466 url = url[len(MAIL_TO):] 467 for char in r':<>#"{}|\^~`': 468 assert char not in url 469 def unpack(chunk): 470 if chunk.startswith('%'): 471 assert len(chunk) == 3 472 return chr(int(chunk[1:], 16)) 473 else: 474 return chunk
475 url = ''.join(unpack(chunk) for chunk in encoded_token.split(url)) 476 assert url 477 return email.parse(url) 478 479 return validator 480