I’ve simplified how Lepl handles strings (or lists, or iterables, or...). As a consequence, some of the API has changed. So moving from Lepl 4 to 5 may break your code. I’m sorry! But hopefully it should be easy to fix (see below) and, once running again, should be a little faster.
In the longer term these changes help me maintain Lepl and continue to expand it.
As part of the stream work I also revisited how Lepl handles memoisation, particularly for left-recursive grammars.
Understanding what happened “under the hood” may explain the API changes. But if you’d rather just have a list of issues please skip to “Changes” below.
Originally, Lepl pretended that all inputs were strings. Indeed, it was possible (using parse_null()) to send a string directly to the matchers: each matcher then passed a new string, containing whatever remained to be parsed, to the next matcher. One advantage of this was that it allowed Lepl’s regular expression code to be used with general sequences.
In practice, however, this approach wasn’t so useful. For example, there was no way to tell how far you are through the data. So in most cases (ie. using parse()) Lepl constructed a wrapper that carried the string and the extra location information together, but which still looked sufficiently string-like for the matchers to work correctly.
This solved the problem of knowing where you were, but made the code more complicated. To avoid duplicating that complex code I used the same wrapper for many different input types. And that only made things more complicated. Worse, the “strings” slowly accumulated extra non-string methods to do other useful things.
Eventually it became clear that always pretending that things were strings was a bad idea. The extra complexity of defining an interface for “a stream of input data” might complicate matchers a little, but made everything else too brittle.
Hence Lepl 5, in which I dropped the pretence that the input is “just” a string. Instead, matchers work with a “stream” using functions like s_next() to get the next character, or s_line() to read a line.
You may be surprised to hear that those are functions, and not methods, so I will take a moment to sketch the implementation. A stream is now implemented as a tuple; a pair of values. The first value is “state” and the second value is a “helper”. Exactly what the state and helper are depend on the input, but the helper must implement a standard interface and the s_xxx() functions call that interface with the state.
That sounds more complex than it is. Essentially:
def s_next(stream):
(state, helper) = stream
return helper.next(state)
and the reason for that is efficiency. For a string, for example, state is the current index into the string. So every time the stream advances (and this happens a lot) we only have to construct a new tuple with a different integer value. The helper remains the same (in the case of a string it’s a wrapper round the the string itself). This is much more efficient than constructing a new, heavyweight object at every step (there is a downside, which is an extra function invocation, but I think that will be insignificant once PyPy becomes widely adopted).
Once I had simplified the streams code I started working through the unit tests, fixing bugs. At first, of course, I was fixing bugs in my new code and because that was simpler than before it wasn’t too hard to fix. Then I started hitting bugs that didn’t seem to be in the new code, but in the old. There were several mistakes / confusions / bugs in the old stream and memoisation code that, together, “cancelled each other out” for at least some inputs.
A less charitable way of describing the above is that I had “balanced” the code well enough to pass the tests I had, but that the end result was still broken.
In retrospect this isn’t a huge surprise - worrying about this kind of problem is what motivated the work to simplify stream handling.
So I have also rewritten much of the memoisation code, trying to simplify it, adding new tests, and generally trying to understand in more detail what is happening. It now appears to be working, but I am still not convinced that I understand everything, so I have changed the default configuration to more actively discourage left-recursion. These changes are described below.
While working on the above I also needed to check how the new streams interacted with Python’s garbage collection. It turned out that, despite the intention of the original design, Lepl had never been able to parse input larger than the available memory. So I added additional tests for this and fixed things as necessary.
I also added support for avoiding consuming too much memory with the output. See Resource Management for more details.
The left-memoisation code does the following:
This is my interpretation of the approach described in Frost and Hafiz 2006. However, the extra complexity implied by the generated / objects based approach used here means that I am not completely sure that it is correct.