Matchers

The API Documentation contains an exhaustive list of the matches available. This chapter only describes the most important.

The final section gives some implementation details.

Note

The examples here are fragments that illustrate some small detail. They often include .config.no_full_first_match() so that a partial match can be displayed instead of an error message.

Literal

[API] This matcher identifies a given string. For example, Literal(‘hello’) will give the result “hello” when that text is at the start of a stream:

>>> matcher = Literal('hello')
>>> matcher.config.no_full_first_match()
>>> matcher.parse('hello world')
['hello']

In many cases it is not necessary to use Literal() explicitly. Most matchers, when they receive a string as a constructor argument, will automatically create a literal match from the given text.

Any

[API] This matcher identifies any single character. It can be restricted to match only characters that appear in a given string. For example:

>>> matcher = Any()
>>> matcher.config.no_full_first_match()
>>> matcher.parse('hello world')
['h']

>>> matcher = Any('abcdefghijklm')[0:]
>>> matcher.config.no_full_first_match()
>>> matcher.parse('hello world')
['h', 'e', 'l', 'l']

And (&)

[API] This matcher combines other matchers in order. For example:

>>> matcher = And(Any('h'), Any())
>>> matcher.config.no_full_first_match()
>>> matcher.parse('hello world')
['h', 'e']

All matchers must succeed for And() as a whole to succeed:

>>> matcher = And(Any('a'), Any('b'))
>>> matcher.parse('pq')
[...]
lepl.stream.maxdepth.FullFirstMatchException: The match failed in <string> at 'q' (line 1, character 2).
>>> matcher.parse('ax')
[...]
lepl.stream.maxdepth.FullFirstMatchException: The match failed in <string> at '' (line 1, character 3).
>>> matcher.parse('ab')
['a', 'b']

.. note::

It's worth noting that because the error message is based on the deepest
match it can sometimes be "off by one" if a character was read but failed to
match.

The & operator is equivalent unless a separator is being used:

>>> matcher = Any('a') & Any('b')
>>> matcher.parse('ab')
['a', 'b']

Or (|)

[API] This matcher searches through a list of other matchers to find a successful match. For example:

>>> matcher = Or(Any('x'), Any('h'), Any('z'))
>>> matcher.config.no_full_first_match()
>>> matcher.parse('hello world')
['h']

The first match found is the one returned:

>>> matcher = Or(Any('h'), Any()[3])
>>> matcher.config.no_full_first_match()
>>> matcher.parse('hello world')
['h']

But subsequent calls return other possibilities:

>>> list(matcher.parse_all('hello world'))
[['h'], ['h', 'e', 'l']]

This shows how Lepl supports “backtracking” — a matcher may be called several times before a result is found that “fits” with the rest of the grammar. All matchers upport this behaviour, but it is easiest to see with Or().

The matcher.parse_all() method is similar to matcher.match() introduced in the previous section, but returns only the results (it discards the remaining streams). Using list() converts the iterator returned by the parser into a list that can be displayed.

Repeat ([...])

[API] Although Repeat() can be used directly, it’s normal to use the [] array syntax instead (which, when used on a matcher, is automatically translated into Repeat()).

At its simplest, [] indicates that a matcher should repeat a given number of times:

>>> matcher = Any()[3]
>>> matcher.config.no_full_first_match()
>>> matcher.parse('12345')
['1', '2', '3']
>>> list(matcher.parse_all('12345'))
[['1', '2', '3']]

>>> matcher = Any()[3:3]
>>> matcher.config.no_full_first_match()
>>> matcher.parse('12345')
['1', '2', '3']

If only a lower bound to the number of repeats is given the match will be repeated as often as possible:

>>> matcher = Any()[3:]
>>> matcher.config.no_full_first_match()
>>> matcher.parse('12345')
['1', '2', '3', '4', '5']
>>> list(matcher.parse_all('12345'))
[['1', '2', '3', '4', '5'], ['1', '2', '3', '4'], ['1', '2', '3']]

If the match cannot be repeated the requested number of times no result is returned:

>>> matcher = Any()[3:]
>>> matcher.config.no_full_first_match()
>>> matcher.parse('12')
None

As well as repetition, [] can also indicate that results should be joined together. This is done by adding ...:

>>> matcher = Any()[3, ...]
>>> matcher.config.no_full_first_match()
>>> matcher.parse('12345')
['123']

And you can specify a separator that muct occur between repetitions (usually this is used with Drop() which discards the value):

>>> matcher = Any()[3, ..., Drop('x')]
>>> matcher.config.no_full_first_match()
>>> matcher.parse('1x2x3x4x5')
['123']

Lookahead

[API] This matcher checks whether another matcher — its argument — would succeed, but doesn’t actually match anything. If the argument doesn’t match then it fails, so any following matchers joined with And() will not be called.

For example, to only parse numbers that begin with “2” (specifying a string as matcher is equivalent to using Literal()):

>>> matcher = Lookahead('2') & Integer()
>>> matcher.parse('234')
['234']
>>> matcher.parse('123')
[...]
lepl.stream.maxdepth.FullFirstMatchException: The match failed in <string> at '23' (line 1, character 2).

When preceded by a ~ the logic is reversed:

>>> matcher = ~Lookahead('2') & Integer()
>>> matcher.parse('234')
[...]
lepl.stream.maxdepth.FullFirstMatchException: The match failed in <string> at '34' (line 1, character 2).
>>> matcher.parse('123')
['123']

Note

This change in behaviour is specific to Lookahead() — usually ~ applies Drop() as described below.

Drop (~)

[API] This matcher calls another matcher, but discards the results:

>>> (Drop('hello') / 'world').parse('hello world')
[' ', 'world']
>>> (~Literal('hello') / 'world').parse('hello world')
[' ', 'world']

(The empty string in the result is from / which joins two matchers together, with optional spaces between).

This is different to Lookahead() because the matcher after Drop() receives a stream that has “moved on” to the next part of the input. With Lookahead() the stream is not advanced and so this example will fail:

>>> (Lookahead('hello') / 'world').parse('hello world')
[...]
lepl.stream.maxdepth.FullFirstMatchException: The match failed in <string> at ' world' (line 1, character 6).

Note

The error message is misleading here because it is based on the deepest match in the stream, which in this case is due to Lookahead().

Apply (>, >=, args)

[API] This matcher passes the results of another matcher to a function, then returns the value from the function as a new result:

>>> def show(results):
...     print('results:', results)
...     return results
>>> Apply(Any()[:,...], show).parse('hello world')
results: ['hello world']
[['hello world']]

The > operator is equivalent:

>>> (Any()[:,...] > show).parse('hello world')
results: ['hello world']
[['hello world']]

The returned result is placed in a new list, which is not always what is wanted (it is useful when you want Nested Lists); setting raw=True uses the result directly:

>>> Apply(Any()[:,...], show, raw=True).parse('hello world')
results: ['hello world']
['hello world']
>>> (Any()[:,...] >= show).parse('hello world')
results: ['hello world']
['hello world']

Setting another optional argument, args, to True changes the way the function is called. Instead of passing the results as a single list each is treated as a separate argument. This is familiar as the way *args works in Python:

>>> def format3(a, b, c):
...     return 'a: {0}; b: {1}; c: {2}'.format(a, b, c)
>>> Apply(Any()[3], format3, args=True).parse('xyz')
['a: x; b: y; c: z']

There’s no operator equivaluent for this, but a little helper function called args() allows > to be reused:

>>> (Any()[3] > args(format3)).parse('xyz')
['a: x; b: y; c: z']

KApply (**)

[API] This matcher passes the results of another matcher to a function, along with additional information about the match, then returns the value from the function as a new result. Unlike Apply(), this names the arguments as follows:

stream_in
The stream passed to the matcher before matching.
stream_out
The stream returned from the matcher after matching.
results
A list of the results returned.

Implementation Details

All matchers accept a stream of data and return an iterator over possible ([results], stream) pairs, where the new stream continues from after the matched text (and which may then be passed to another matcher to continue the process of parsing). These iterators are typically implemented as Python generators [*].

A matcher may succeed, but provide no results — the iterator will include a tuple containing an empty list and the new stream. When there are no more possible matches, the iterator will terminate.

Simple matchers will return an iterator containing a single entry. Matchers that return multiple values support backtracking. For example, the Or() generator may yield once for each sub–match in turn (in practice some sub-matchers may return generators that themselves return many values, while others may fail immediately, so it is not a direct 1–to–1 correspondence).

(It is probably obvious if you have used combinator libraries before, but all matchers implement this same interface, whether they are “fundamental” — do the real work of matching against the stream — or delegate work to other sub–matchers, or modify results. This consistency is the source of their expressive power.)

Lepl includes several function decorators that help simplify the creation of new matchers. See Simple Functional Matchers and following sections.

[*]I am intentionally omitting details about trampolining here to focus on the process of matching. A more complete description of the entire implementation can be found in Trampolining.

Table Of Contents

Previous topic

Getting Started

Next topic

Operators

This Page