1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30 '''
31 Matchers for validating URIs and related objects, taken from RFC3696.
32
33 IMPORTANT - the emphasis here is on validation of user input.
34 These matchers are not exact matches for the underlying specs - they are
35 just useful practical approximations. Read RFC3696 to see what I mean
36 (or the quotes from that doc in the source below).
37 '''
38
39 from re import compile as compile_
40 from string import ascii_letters, digits, printable, whitespace
41
42 from lepl import *
43
44
45 _HEX = digits + 'abcdef' + 'ABCDEF'
49 '''
50 A decorator that guarantees a true/false response.
51 '''
52 def wrapper(*args, **kargs):
53 try:
54 return bool(function(*args, **kargs))
55 except:
56 return False
57 return wrapper
58
72
73 return validator
74
77 '''
78 Reject a match if it exceeds a certain length.
79 '''
80 return PostCondition(matcher, lambda results: len(results[0]) <= length)
81
83 '''
84 Reject a match if it matches a (ie some other) regular expression
85 '''
86 regexp = compile_(pattern)
87 return PostCondition(matcher, lambda results: not regexp.match(results[0]))
88
90 '''
91 Reject a match if the value exceeds some value.
92 '''
93 return PostCondition(matcher, lambda results: int(results[0]) <= max)
94
96 '''
97 Reject a match if the number of times a particular character occurs exceeds
98 some value.
99 '''
100 return PostCondition(matcher, lambda results: results[0].count(char) <= max)
101
104 '''
105 A matcher for DNS names.
106
107 RFC 3696:
108
109 Any characters, or combination of bits (as octets), are permitted in
110 DNS names. However, there is a preferred form that is required by
111 most applications. This preferred form has been the only one
112 permitted in the names of top-level domains, or TLDs. In general, it
113 is also the only form permitted in most second-level names registered
114 in TLDs, although some names that are normally not seen by users obey
115 other rules. It derives from the original ARPANET rules for the
116 naming of hosts (i.e., the "hostname" rule) and is perhaps better
117 described as the "LDH rule", after the characters that it permits.
118 The LDH rule, as updated, provides that the labels (words or strings
119 separated by periods) that make up a domain name must consist of only
120 the ASCII [ASCII] alphabetic and numeric characters, plus the hyphen.
121 No other symbols or punctuation characters are permitted, nor is
122 blank space. If the hyphen is used, it is not permitted to appear at
123 either the beginning or end of a label. There is an additional rule
124 that essentially requires that top-level domain names not be all-
125 numeric.
126 [...]
127
128 Most internet applications that reference other hosts or systems
129 assume they will be supplied with "fully-qualified" domain names,
130 i.e., ones that include all of the labels leading to the root,
131 including the TLD name. Those fully-qualified domain names are then
132 passed to either the domain name resolution protocol itself or to the
133 remote systems. Consequently, purported DNS names to be used in
134 applications and to locate resources generally must contain at least
135 one period (".") character.
136 [...]
137
138 [...]It is
139 likely that the better strategy has now become to make the "at least
140 one period" test, to verify LDH conformance (including verification
141 that the apparent TLD name is not all-numeric), and then to use the
142 DNS to determine domain name validity, rather than trying to maintain
143 a local list of valid TLD names.
144 [...]
145
146 A DNS label may be no more than 63 octets long. This is in the form
147 actually stored; if a non-ASCII label is converted to encoded
148 "punycode" form (see Section 5), the length of that form may restrict
149 the number of actual characters (in the original character set) that
150 can be accommodated. A complete, fully-qualified, domain name must
151 not exceed 255 octets.
152 '''
153 ld = Any(ascii_letters + digits)
154 ldh = ld | '-'
155 label = ld + Optional(ldh[:] + ld)
156 short_label = _LimitLength(label, 63)
157 tld = _RejectRegexp(short_label, r'^[0-9]+$')
158 any_name = short_label[1:, r'\.', ...] + '.' + tld
159 non_numeric = _RejectRegexp(any_name, r'^[0-9\.]+$')
160 short_name = _LimitLength(non_numeric, 255)
161 return short_name
162
165 '''
166 A matcher for IPv4 addresses.
167
168 RFC 3696 doesn't say much about these; RFC 2396 doesn't mention limits
169 on numerical values, but it must be 255.
170 '''
171 octet = _LimitIntValue(Any(digits)[1:, ...], 255)
172 address = octet[4, '.', ...]
173 return address
174
177 '''
178 A matcher for IPv6 addresses.
179
180 Again, RFC 3696 says little; RFC 2373 (addresses) and 2732 (URLs) have
181 much more information:
182
183 1. The preferred form is x:x:x:x:x:x:x:x, where the 'x's are the
184 hexadecimal values of the eight 16-bit pieces of the address.
185 Examples:
186
187 FEDC:BA98:7654:3210:FEDC:BA98:7654:3210
188
189 1080:0:0:0:8:800:200C:417A
190
191 Note that it is not necessary to write the leading zeros in an
192 individual field, but there must be at least one numeral in every
193 field (except for the case described in 2.).
194
195 2. Due to some methods of allocating certain styles of IPv6
196 addresses, it will be common for addresses to contain long strings
197 of zero bits. In order to make writing addresses containing zero
198 bits easier a special syntax is available to compress the zeros.
199 The use of "::" indicates multiple groups of 16-bits of zeros.
200 The "::" can only appear once in an address. The "::" can also be
201 used to compress the leading and/or trailing zeros in an address.
202
203 For example the following addresses:
204
205 1080:0:0:0:8:800:200C:417A a unicast address
206 FF01:0:0:0:0:0:0:101 a multicast address
207 0:0:0:0:0:0:0:1 the loopback address
208 0:0:0:0:0:0:0:0 the unspecified addresses
209
210 may be represented as:
211
212 1080::8:800:200C:417A a unicast address
213 FF01::101 a multicast address
214 ::1 the loopback address
215 :: the unspecified addresses
216
217 3. An alternative form that is sometimes more convenient when dealing
218 with a mixed environment of IPv4 and IPv6 nodes is
219 x:x:x:x:x:x:d.d.d.d, where the 'x's are the hexadecimal values of
220 the six high-order 16-bit pieces of the address, and the 'd's are
221 the decimal values of the four low-order 8-bit pieces of the
222 address (standard IPv4 representation). Examples:
223
224 0:0:0:0:0:0:13.1.68.3
225
226 0:0:0:0:0:FFFF:129.144.52.38
227
228 or in compressed form:
229
230 ::13.1.68.3
231
232 ::FFFF:129.144.52.38
233 '''
234 piece = Any(_HEX)[1:4, ...]
235 preferred = piece[8, ':', ...]
236
237
238
239
240
241
242
243 compact = Or(_LimitCount(piece[1:6, ':', ...] + '::' + piece[1:6, ':', ...],
244 ':', 7),
245 '::' + piece[1:7, ':', ...],
246 piece[1:7, ':', ...] + '::',
247 '::')
248
249
250
251 alternate = \
252 Or(piece[6, ':', ...] + ':',
253 _LimitCount(piece[1:4, ':', ...] + '::' + piece[1:4, ':', ...],
254 ':', 5),
255 '::' + piece[1:5, ':', ...] + ':',
256 piece[1:5, ':', ...] + '::',
257 '::') + _IpV4Address()
258
259 return (preferred | compact | alternate)
260
263 '''
264 A matcher for the local part ("username") of an email address.
265
266 RFC 3696:
267
268 Contemporary email addresses consist of a "local part" separated from
269 a "domain part" (a fully-qualified domain name) by an at-sign ("@").
270 The syntax of the domain part corresponds to that in the previous
271 section. The concerns identified in that section about filtering and
272 lists of names apply to the domain names used in an email context as
273 well. The domain name can also be replaced by an IP address in
274 square brackets, but that form is strongly discouraged except for
275 testing and troubleshooting purposes.
276
277 The local part may appear using the quoting conventions described
278 below. The quoted forms are rarely used in practice, but are
279 required for some legitimate purposes. Hence, they should not be
280 rejected in filtering routines but, should instead be passed to the
281 email system for evaluation by the destination host.
282
283 The exact rule is that any ASCII character, including control
284 characters, may appear quoted, or in a quoted string. When quoting
285 is needed, the backslash character is used to quote the following
286 character.
287 [...]
288 In addition to quoting using the backslash character, conventional
289 double-quote characters may be used to surround strings.
290 [...]
291 Without quotes, local-parts may consist of any combination of
292 alphabetic characters, digits, or any of the special characters
293
294 ! # $ % & ' * + - / = ? ^ _ ` . { | } ~
295
296 period (".") may also appear, but may not be used to start or end the
297 local part, nor may two or more consecutive periods appear. Stated
298 differently, any ASCII graphic (printing) character other than the
299 at-sign ("@"), backslash, double quote, comma, or square brackets may
300 appear without quoting. If any of that list of excluded characters
301 are to appear, they must be quoted.
302 [...]
303 In addition to restrictions on syntax, there is a length limit on
304 email addresses. That limit is a maximum of 64 characters (octets)
305 in the "local part" (before the "@") and a maximum of 255 characters
306 (octets) in the domain part (after the "@") for a total length of 320
307 characters. Systems that handle email should be prepared to process
308 addresses which are that long, even though they are rarely
309 encountered.
310 '''
311 unescaped_chars = ascii_letters + digits + "!#$%&'*+-/=?^_`.{|}~"
312 escapable_chars = unescaped_chars + r'@\",[] '
313 quotable_chars = unescaped_chars + r'@\,[] '
314 unquoted_string = (('\\' + Any(escapable_chars))
315 | Any(unescaped_chars))[1:, ...]
316 quoted_string = '"' + Any(quotable_chars)[1:, ...] + '"'
317 local_part = quoted_string | unquoted_string
318 no_extreme_dot = _RejectRegexp(local_part, r'"?\..*\."?')
319 no_double_dot = _RejectRegexp(no_extreme_dot, r'.*\."*\..*')
320 short_local_part = _LimitLength(no_double_dot, 64)
321 return short_local_part
322
329
332 '''
333 Generate a validator for emails, according to RFC3696, which returns True
334 if the email is valid, and False otherwise.
335 '''
336 return _matcher_to_validator(_Email)
337
340 '''
341 A matcher for HTTP URLs.
342
343 RFC 3696:
344
345 The following characters are reserved in many URIs -- they must be
346 used for either their URI-intended purpose or must be encoded. Some
347 particular schemes may either broaden or relax these restrictions
348 (see the following sections for URLs applicable to "web pages" and
349 electronic mail), or apply them only to particular URI component
350 parts.
351
352 ; / ? : @ & = + $ , ?
353
354 In addition, control characters, the space character, the double-
355 quote (") character, and the following special characters
356
357 < > # %
358
359 are generally forbidden and must either be avoided or escaped, as
360 discussed below.
361 [...]
362 When it is necessary to encode these, or other, characters, the
363 method used is to replace it with a percent-sign ("%") followed by
364 two hexidecimal digits representing its octet value. See section
365 2.4.1 of [RFC2396] for an exact definition. Unless it is used as a
366 delimiter of the URI scheme itself, any character may optionally be
367 encoded this way; systems that are testing URI syntax should be
368 prepared for these encodings to appear in any component of the URI
369 except the scheme name itself.
370 [...]
371 Absolute HTTP URLs consist of the scheme name, a host name (expressed
372 as a domain name or IP address), and optional port number, and then,
373 optionally, a path, a search part, and a fragment identifier. These
374 are separated, respectively, by a colon and the two slashes that
375 precede the host name, a colon, a slash, a question mark, and a hash
376 mark ("#"). So we have
377
378 http://host:port/path?search#fragment
379
380 http://host/path/
381
382 http://host/path#fragment
383
384 http://host/path?search
385
386 http://host
387
388 and other variations on that form. There is also a "relative" form,
389 but it almost never appears in text that a user might, e.g., enter
390 into a form. See [RFC2616] for details.
391 [...]
392 The characters
393
394 / ; ?
395
396 are reserved within the path and search parts and must be encoded;
397 the first of these may be used unencoded, and is often used within
398 the path, to designate hierarchy.
399 '''
400 path_chars = ''.join(set(printable).difference(set(whitespace))
401 .difference('/;?<>#%'))
402 other_chars = path_chars + '/'
403 path_string = ('%' + Any(_HEX)[2, ...] | Any(path_chars))[1:, ...]
404 other_string = ('%' + Any(_HEX)[2, ...] | Any(other_chars))[1:, ...]
405
406 host = _IpV4Address() | ('[' + _Ipv6Address() + ']') | \
407 _PreferredFullyQualifiedDnsName()
408
409 url = 'http://' + host + \
410 Optional(':' + Any(digits)[1:, ...]) + \
411 Optional('/' +
412 Optional(path_string[1:, '/', ...] + Optional('/')) +
413 Optional('?' + other_string) +
414 Optional('#' + other_string))
415
416 return url
417
420 '''
421 Generate a validator for HTTP URLs, according to RFC3696, which returns
422 True if the email is valid, and False otherwise.
423 '''
424 return _matcher_to_validator(_HttpUrl)
425
428 '''
429 Generate a validator for email addresses, according to RFC3696, which
430 returns True if the URL is valid, and False otherwise.
431
432 RFC 3696:
433
434 The following characters may appear in MAILTO URLs only with the
435 specific defined meanings given. If they appear in an email address
436 (i.e., for some other purpose), they must be encoded:
437
438 : The colon in "mailto:"
439
440 < > # " % { } | \ ^ ~ `
441
442 These characters are "unsafe" in any URL, and must always be
443 encoded.
444
445 The following characters must also be encoded if they appear in a
446 MAILTO URL
447
448 ? & =
449 Used to delimit headers and their values when these are encoded
450 into URLs.
451 ----------
452 The RFC isn't that great a guide here. The best approach, I think, is
453 to check the URL for "forbidden" characters, then decode it, and finally
454 validate the decoded email. So we implement the validator directly (ie
455 this is not a matcher).
456 '''
457
458 MAIL_TO = 'mailto:'
459 encoded_token = compile_('(%.{0,2})')
460 email = _Email()
461 email.config.compile_to_re().no_memoize()
462
463 @_guarantee_bool
464 def validator(url):
465 assert url.startswith(MAIL_TO)
466 url = url[len(MAIL_TO):]
467 for char in r':<>#"{}|\^~`':
468 assert char not in url
469 def unpack(chunk):
470 if chunk.startswith('%'):
471 assert len(chunk) == 3
472 return chr(int(chunk[1:], 16))
473 else:
474 return chunk
475 url = ''.join(unpack(chunk) for chunk in encoded_token.split(url))
476 assert url
477 return email.parse(url)
478
479 return validator
480