.. index:: matchers, no_full_first_match() .. _matchers: Matchers ======== The `API Documentation `_ contains an exhaustive list of the matches available. This chapter only describes the most important. The final section gives some `implementation details`_. .. note:: The examples here are fragments that illustrate some small detail. They often include `.config.no_full_first_match() `_ so that a partial match can be displayed instead of an error message. .. index:: Literal() Literal ------- `[API] `_ This matcher identifies a given string. For example, `Literal('hello') `_ will give the result "hello" when that text is at the start of a stream:: >>> matcher = Literal('hello') >>> matcher.config.no_full_first_match() >>> matcher.parse('hello world') ['hello'] In many cases it is not necessary to use `Literal() `_ explicitly. Most matchers, when they receive a string as a constructor argument, will automatically create a literal match from the given text. .. index:: Any() Any --- `[API] `_ This matcher identifies any single character. It can be restricted to match only characters that appear in a given string. For example:: >>> matcher = Any() >>> matcher.config.no_full_first_match() >>> matcher.parse('hello world') ['h'] >>> matcher = Any('abcdefghijklm')[0:] >>> matcher.config.no_full_first_match() >>> matcher.parse('hello world') ['h', 'e', 'l', 'l'] .. index:: And(), & And (&) ------- `[API] `_ This matcher combines other matchers in order. For example:: >>> matcher = And(Any('h'), Any()) >>> matcher.config.no_full_first_match() >>> matcher.parse('hello world') ['h', 'e'] All matchers must succeed for `And() `_ as a whole to succeed:: >>> matcher = And(Any('a'), Any('b')) >>> matcher.parse('pq') [...] lepl.stream.maxdepth.FullFirstMatchException: The match failed in at 'q' (line 1, character 2). >>> matcher.parse('ax') [...] lepl.stream.maxdepth.FullFirstMatchException: The match failed in at '' (line 1, character 3). >>> matcher.parse('ab') ['a', 'b'] .. note:: It's worth noting that because the error message is based on the deepest match it can sometimes be "off by one" if a character was read but failed to match. The ``&`` operator is equivalent unless a :ref:`separator ` is being used:: >>> matcher = Any('a') & Any('b') >>> matcher.parse('ab') ['a', 'b'] .. index:: Or(), |, parse_all() Or (|) ------ `[API] `_ This matcher searches through a list of other matchers to find a successful match. For example:: >>> matcher = Or(Any('x'), Any('h'), Any('z')) >>> matcher.config.no_full_first_match() >>> matcher.parse('hello world') ['h'] The first match found is the one returned:: >>> matcher = Or(Any('h'), Any()[3]) >>> matcher.config.no_full_first_match() >>> matcher.parse('hello world') ['h'] But subsequent calls return other possibilities:: >>> list(matcher.parse_all('hello world')) [['h'], ['h', 'e', 'l']] This shows how Lepl supports "backtracking" --- a matcher may be called several times before a result is found that "fits" with the rest of the grammar. All matchers upport this behaviour, but it is easiest to see with `Or() `_. The `matcher.parse_all() `_ method is similar to `matcher.match() `_ introduced in the previous section, but returns only the results (it discards the remaining streams). Using ``list()`` converts the iterator returned by the parser into a list that can be displayed. .. index:: Repeat(), [], backtracking, breadth-first, depth-first .. _repeat: Repeat ([...]) -------------- `[API] `_ Although `Repeat() `_ can be used directly, it's normal to use the ``[]`` array syntax instead (which, when used on a matcher, is automatically translated into `Repeat() `_). At its simplest, ``[]`` indicates that a matcher should repeat a given number of times:: >>> matcher = Any()[3] >>> matcher.config.no_full_first_match() >>> matcher.parse('12345') ['1', '2', '3'] >>> list(matcher.parse_all('12345')) [['1', '2', '3']] >>> matcher = Any()[3:3] >>> matcher.config.no_full_first_match() >>> matcher.parse('12345') ['1', '2', '3'] If only a lower bound to the number of repeats is given the match will be repeated as often as possible:: >>> matcher = Any()[3:] >>> matcher.config.no_full_first_match() >>> matcher.parse('12345') ['1', '2', '3', '4', '5'] >>> list(matcher.parse_all('12345')) [['1', '2', '3', '4', '5'], ['1', '2', '3', '4'], ['1', '2', '3']] If the match cannot be repeated the requested number of times no result is returned:: >>> matcher = Any()[3:] >>> matcher.config.no_full_first_match() >>> matcher.parse('12') None As well as repetition, ``[]`` can also indicate that results should be joined together. This is done by adding ``...``:: >>> matcher = Any()[3, ...] >>> matcher.config.no_full_first_match() >>> matcher.parse('12345') ['123'] And you can specify a separator that muct occur between repetitions (usually this is used with `Drop() `_ which discards the value):: >>> matcher = Any()[3, ..., Drop('x')] >>> matcher.config.no_full_first_match() >>> matcher.parse('1x2x3x4x5') ['123'] .. index:: Lookahead(), ~ .. _lookahead: Lookahead --------- `[API] `_ This matcher checks whether another matcher --- its argument --- would succeed, but doesn't actually match anything. If the argument doesn't match then it fails, so any following matchers joined with `And() `_ will not be called. For example, to only parse numbers that begin with "2" (specifying a string as matcher is equivalent to using `Literal() `_):: >>> matcher = Lookahead('2') & Integer() >>> matcher.parse('234') ['234'] >>> matcher.parse('123') [...] lepl.stream.maxdepth.FullFirstMatchException: The match failed in at '23' (line 1, character 2). When preceded by a ``~`` the logic is reversed:: >>> matcher = ~Lookahead('2') & Integer() >>> matcher.parse('234') [...] lepl.stream.maxdepth.FullFirstMatchException: The match failed in at '34' (line 1, character 2). >>> matcher.parse('123') ['123'] .. note:: This change in behaviour is specific to `Lookahead() `_ --- usually ``~`` applies `Drop() `_ as described below. .. index:: Drop(), ~ Drop (~) -------- `[API] `_ This matcher calls another matcher, but discards the results:: >>> (Drop('hello') / 'world').parse('hello world') [' ', 'world'] >>> (~Literal('hello') / 'world').parse('hello world') [' ', 'world'] (The empty string in the result is from ``/`` which joins two matchers together, with optional spaces between). This is different to `Lookahead() `_ because the matcher after `Drop() `_ receives a stream that has "moved on" to the next part of the input. With `Lookahead() `_ the stream is not advanced and so this example will fail:: >>> (Lookahead('hello') / 'world').parse('hello world') [...] lepl.stream.maxdepth.FullFirstMatchException: The match failed in at ' world' (line 1, character 6). .. note:: The error message is misleading here because it is based on the deepest match in the stream, which in this case is due to `Lookahead() `_. .. index:: Apply(), >, >=, args() Apply (>, >=, args) ------------------- .. note:: See also :ref:`faq_apply` `[API] `_ This matcher passes the results of another matcher to a function, then returns the value from the function as a new result:: >>> def show(results): ... print('results:', results) ... return results >>> Apply(Any()[:,...], show).parse('hello world') results: ['hello world'] [['hello world']] The ``>`` operator is equivalent:: >>> (Any()[:,...] > show).parse('hello world') results: ['hello world'] [['hello world']] The returned result is placed in a new list, which is not always what is wanted (it is useful when you want :ref:`nestedlists`); setting ``raw=True`` uses the result directly:: >>> Apply(Any()[:,...], show, raw=True).parse('hello world') results: ['hello world'] ['hello world'] >>> (Any()[:,...] >= show).parse('hello world') results: ['hello world'] ['hello world'] Setting another optional argument, `args `_, to ``True`` changes the way the function is called. Instead of passing the results as a single list each is treated as a separate argument. This is familiar as the way ``*args`` works in Python:: >>> def format3(a, b, c): ... return 'a: {0}; b: {1}; c: {2}'.format(a, b, c) >>> Apply(Any()[3], format3, args=True).parse('xyz') ['a: x; b: y; c: z'] There's no operator equivaluent for this, but a little helper function called `args() `_ allows ``>`` to be reused: >>> (Any()[3] > args(format3)).parse('xyz') ['a: x; b: y; c: z'] .. index:: ** KApply (**) ----------- `[API] `_ This matcher passes the results of another matcher to a function, along with additional information about the match, then returns the value from the function as a new result. Unlike `Apply() `_, this names the arguments as follows: stream_in The stream passed to the matcher before matching. stream_out The stream returned from the matcher after matching. results A list of the results returned. .. index:: First(), Empty(), Regexp(), Delayed(), Trace(), AnyBut(), Optional(), Star(), ZeroOrMore(), Plus(), OneOrMore(), Map(), Add(), Substitute(), Name(), Eof(), Eos(), Identity(), Newline(), Space(), Whitespace(), Digit(), Letter(), Upper(), Lower(), Printable(), Punctuation(), UnsignedInteger(), SignedInteger(), Integer(), UnsignedFloat(), SignedFloat(), SignedEFloat(), Float(), Word(), String(). More ---- Many more matchers are described in the `API Documentation `_, including `Add() `_, `AnyBut() `_, `Columns() `_, `Delayed() `_, `Digit() `_, `Empty() `_, `Eof() `_, `Eos() `_, `First() `_, `Float() `_, `Identity() `_, `Integer() `_, `Letter() `_, `Lower() `_, `Map() `_, `Name() `_, `Newline() `_, `OneOrMore() `_, `Optional() `_, `Plus() `_, `Printable() `_, `Punctuation() `_, `Regexp() `_, `SignedEFloat() `_, `SignedFloat() `_, `SignedInteger() `_, `SkipTo() `_, `Space() `_, `Star() `_, `String() `_, `Substitute() `_, `Trace() `_, `UnsignedFloat() `_, `UnsignedInteger() `_, `Upper() `_, `Whitespace() `_, `Word() `_ and `ZeroOrMore() `_. .. index:: generator, results, failure, implementation, Matcher, BaseMatcher, ABC .. _implementation_details: Implementation Details ---------------------- All matchers accept a stream of data and return an iterator over possible ``([results], stream)`` pairs, where the new stream continues from after the matched text (and which may then be passed to another matcher to continue the process of parsing). These iterators are typically implemented as Python generators [*]_. A matcher may succeed, but provide no results --- the iterator will include a tuple containing an empty list and the new stream. When there are no more possible matches, the iterator will terminate. Simple matchers will return an iterator containing a single entry. Matchers that return multiple values support backtracking. For example, the `Or() `_ generator may yield once for each sub--match in turn (in practice some sub-matchers may return generators that themselves return many values, while others may fail immediately, so it is not a direct 1--to--1 correspondence). (It is probably obvious if you have used combinator libraries before, but all matchers implement this same interface, whether they are "fundamental" --- do the real work of matching against the stream --- or delegate work to other sub--matchers, or modify results. This consistency is the source of their expressive power.) Lepl includes several function decorators that help simplify the creation of new matchers. See :ref:`new_matchers` and following sections. .. [*] I am intentionally omitting details about trampolining here to focus on the process of matching. A more complete description of the entire implementation can be found in :ref:`trampolining`.