Lepl’s lexer has not been used in earlier chapters of this manual. However, for many applications it will both simplify the grammar and give a more efficient parser. Some extensions, like Line-Aware Parsing, assume its use.
The lexer pre-processes the stream that is being parsed, dividing it into tokens that correspond to regular expressions. The tokens, and their contents, can then be matched in the grammar using the Token() matcher.
Tokens give a “rough” description of the text to be parsed. A simple parser of numerical expressions might have the following different token types:
- Numeric values (eg. 1, 2.3, 4e5).
- Function names (eg. sin, cos).
- Symbols (eg. (, +).
Note that in the sketch above, a token is not defined for spaces. The lexer includes a spearate pattern that is used only if all others fail — if this matches, the matching text is discarded (otherwise, the lexer raise an error). By default this “discard pattern” matches whitespace, so spaces are automatically discarded and do not need to be specified in the grammar.
However, to use tokens correctly, it is necessary to understand how they work in a little more detail.
Tokens are used in two ways.
First, they work like matchers (with the exceptions noted below) for the regular expression given as their argument. For example:
>>> name = Token('[A-Z][a-z]*') >>> number = Token(Integer())
Here, name will match a string that starts with a capital letter and then has zero or more lower case letters. The second Token(), number, is similar, but uses another matcher (Integer()) to define the regular expression that is matched.
Not all matchers can be used to define the pattern for a Token() — only those that Lepl knows how to convert into regular expressions.
The second way in which tokens are used is by specialisation, which generates a new matcher with an additional constraint. This is easiest to see with another example:
>>> params = ... >>> function = Token('[a-z]*') >>> sin = function('sine') >>> cos = function('cosine') >>> call = (sin | cos) & params
Here the function Token() is restricted to match the strings “sine” and “cosine”, creating two new matchers, sin and cos.
The example above used simple strings (which are converted to Literal() matchers) as the additional constraints, but any matcher expression can be used.
For a more realistic example of multiple specialisations, see the use of symbol in this example.
Note that specialisation is optional. It’s OK to use Token() as a simple matcher without adding any more constraints (providing that the use is consistent with the limitations outlined below).
The lexer has two limitations: it cannot backtrack and Token() matchers cannot be mixed with non–Token matchers that consume input.
The first limitation means that the regular expressions associated with tokens only match text once. The lexer divides the input into fixed blocks that match the largest possible fragment; no alternatives are considered.
This does not mean that a particular Token() is only called once, or that it will return only a single result. Backtracking is still possible, but the parser is restricted to exploring different interpretations of a single, unchanging token stream.
For example, consider “1-2”, which might be parsed as two integers (1 and -2), or as a subtraction expression (1 minus 2). An appropriate matcher will give both results, through backtracking:
>>> matchers = (Integer() | Literal('-'))[:] & Eos() >>> list(matchers.parse_all('1-2')) [['1', '-2'], ['1', '-', '2']]
But when tokens are used, “-2” is preferred to “-”, because it is a longer match, so we get only the single result:
>>> tokens = (Token(Integer()) | Token(r'\-'))[:] & Eos() >>> list(tokens.parse_all('1-2')) [['1', '-2']]
(In the examples above, list() is used to expand the generator and the Token() is given r'\-' because its argument is a regular expression, not a literal value.)
The second limitation is more subtle. The lexer is implemented via a rewriter which adds a Lexer() instance to the head of the matcher graph. This divides the input into the “pieces” that the Token() matchers expect.
>>> matcher = Token(Any()) & Any() ... >>> matcher.parse(...) lepl.lexer.support.LexerError: The grammar contains a mix of Tokens and non-Token matchers at the top level. If Tokens are used then non-token matchers that consume input must only appear "inside" Tokens. The non-Token matchers include: Any(None).
The lexer can be configured using .config.lexer() (see ref:configuration). This can take an additional argument that specified the discard pattern. Note that this is included in the default configuration (it does no harm if tokens are not used).
By default Tokens require that any sub–expression consumes the entire contents:>>> abc = Token('abc') >>> incomplete = abc(Literal('ab')) >>> incomplete.parse('abc') [...] lepl.stream.maxdepth.FullFirstMatchException: The match failed in <string> at 'c' (line 1, character 3).
However, this constraint can be relaxed, in which case the matched portion is returned as a result:>>> abc = Token('abc') >>> incomplete = abc(Literal('ab'), complete=False) >>> incomplete.parse('abc') ['ab']
In the explanations above I try to describe the Token() matcher in a fairly declarative way. However, I know that it is sometimes easier to understand how to use a tool by first understanding how the tool itself works. So here I will sketch how the lexer is implemented by describing the steps involved when a Python program uses the Lepl parser, with the lexer, to parse some text.
The program containing Lepl code (and the Lepl library) are compiled.
The program is then run.
Creation of matcher graph
A function, or set of statements, that generates the Lepl matchers is evaluated. Matchers like Token(), And(), etc., are objects that link to each other. The objects and their links form a graph (with a matcher object at each node).
Each time a Token() is created it is assigned a unique number, which I will call the “tag”.
Regular expression extraction
Whenever a Token() is created with another matcher as an argument Lepl attempts to convert the matcher to a regular expression. If it cannot do so, it raises an error.
A token is “specialised” when it is given a sub–matcher:
>>> function = Token('[a-z]*') >>> sin = function('sine')
In the example above, the first line creates a new Token(), with a unique tag and a regular expression, as explained just above. On the second line the token is specialised. This creates another Token(), which contains the given sub–matcher (a Literal() in this case), but with the same tag and regular expression as the “parent”.
I call a token like this, which has the same tag and regular expression as the parent, but also contains a sub–matcher, a “specialised token” in the description below.
At some point Lepl internally “compiles” the matcher graph to generate a parser. Exactly when this happens depends on how the matchers are used, but at the latest it happens just before the first match is calculated.
“Compilation” is perhaps misleading — the parser is not compiled to Python byte codes, for example. What happens is that the matcher graph is processed in various ways. The most important processing, in terms of the lexer, is...
The lexer rewriter uses the matcher graph to construct a Lexer() instance:
The modified matcher graph is then complete and returned for evaluation.
When the parser is finally called, by passing it some text to process, the matcher graph has already been prepared for lexing, as described above. The following processes then occur: