## Processing Large Volumes of Data in Lepl

From: andrew cooke <andrew@...>

Date: Tue, 1 Mar 2011 21:35:03 -0300

Lepl was designed to have the ability to parse large (as in, larger than the
amount of available memory) volumes of data.  However, I shamefully admit that
I never tested this (until now, I assumed no-one would care, and I hadn't had
a need for it).

But someone on the mailing list said that it was important to them so, as part
of the Lepl 5 work, I wrote some code to check that large inputs were handled
correctly.

It turns out, of course, that they weren't.  And so I have been spending way
too much time trying to debug memory use.  But the good news is that, with the
very latest code, and some care, it is possible to process arbitrary amounts
of input.

To show how this works I will work through the test code I am running (which
is processing several GB of data, but using only 0.5% of my laptop's memory to
do so!).

First, for my test, I needed a problem that was easy to describe, and which
"used", in a sense, all the input data, but didn't have a huge result (even
Lepl can't make a large result take no room...).  So I decided to use data in
which each "line" was a string containing a number, incrementing from 1 ("1",
"2", ...).  The result would be the sum of the first digit of each of the
numbers.

To avoid keeping all the input data in memory at once we must use an iterator
as the data source.  In normal use this would probably be a file (the result
of "open(...)" can be passed to Lepl's parse() methods), but for the test the
source looks like:

def source(n_max):
return imap(str, takewhile(lambda n: n <= n_max, count(1)))

(Note that this is Python 2 code, so I am using imap - in Python 3 that would
be a normal map).

I also need a matcher that returns the first digit:

@sequence_matcher
def Digits(support, stream):
(number, next_stream) = s_line(stream, False)
for digit in number:
yield ([int(digit)], next_stream)

At this point I hit my first problem: how to add numbers?  Normally I would
use "sum()" as a transformation of the results.  But that implies that the
results must first be generated (as a list).  For a large input we cannot do
this (the list could get too large).

Lepl with a "reduce" parameter.  This can be specified with Override:

with Override(reduce=...):
matcher = foo[:]

or as an argument for Repeat (see below).  The value of reduce is the tuple
(zero, join) where "zero" is the initial value and join acts as an iterative
fold.  So for no repeats, the result is "zero"; for one repeat it is the
result of "join(zero, result1)", for two "join(join(zero, result1), result2)".

For my example I used:

sum = ([0], lambda a, b: [a[0] + b[0]])
Repeat(Digits(), reduce=sum, ...)

At this point I thought things should work.  Of course, I still had to add the
appropriate magic to the configuration:

which would restrict the number of "pending" generators to just 10.  This is
the functionality that has always been in Lepl and that was intended for
cases like this.  It is actually the source of a fair amount of complexity in
the implementation (to keep track of the generators and to discard the "least
used" at any point).

Unfortunately, as I have already said, this didn't work.  Memory use exploded
and my laptop crashed.

It is probably worth going into a little detail about how to debug memory in
Python.  The word "debug" is a little misleading, because there is no bug in
Python - just in my code.

But before I do that I guess I also need to explain a little more about how I
handle the input.  The iterator (a file or, in this test, the number source)
is wrapped in some classes that create a linked list of lines: each line knows
about the line that follows (if necessary, calling the iterator to return the
next value).  The happy consequence of this is that when the first line has
been parsed, and no object has a reference to it, it can be garbage collected
by Python (the list links to lower lines, but not back up to the previous
lines).

So "all" that is needed for us to handle a large amount of input is for us to
use this linked-list "trick", and to make sure that there are no references to
lines once they have been parsed.

Debugging the problem, then, means finding out what objects exist, and who has
a reference to them.  There are two tools that help.

First, there's a package called Guppy, with a module called hpy, which will
show heap usage.  You use it like this:

from guppy import hpy
h = hpy()
hp = h.heap()
print(hp)

and typical output might be:

Partition of a set of 335881 objects. Total size = 48804432 bytes.
Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
0  10008   3 10488384  21  10488384  21 dict of lepl.core.manager.GeneratorRef
1  10000   3 10480000  21  20968384  43 dict of 0x846620
2  61337  18  4753648  10  25722032  53 tuple
3  10505   3  3897560   8  29619592  61 dict (no owner)
4  40714  12  3257120   7  32876712  67 types.MethodType
5  29645   9  2265384   5  35142096  72 str
6  20020   6  1923152   4  37065248  76 unicode
7  20596   6  1812448   4  38877696  80 __builtin__.weakref
8  10897   3  1466120   3  40343816  83 list
9  59638  18  1431312   3  41775128  86 int
<182 more rows. Type e.g. '_.more' to view.>

You can then play around with the results.  But what was most important was
just to get some idea of what types are using up memory.  Once you have that,
you can go hunting for them with the second tool: Python's gc module.

For example, this code will get all the GeneratorRef instances:

o = gc.get_objects()
gr = [oo for oo in o if isinstance(oo, GeneratorRef)]

and then you can find out who has a reference to one of those with
gc.get_referrers():

gr_owners = gc.get_referrers(gr[0])

In this way uou can study the objects in memory and work out why they are
still sticking around.

It's unfortunate, but Lepl's backtracking makes it *really* sticky when it
comes to holding on to objects.  The problem is that you never know when you
might need to go all the way back to the start.

While GeneratorManager reduces stickiness in the "stack", it turns out that
Lepl had two other places where references to the input are retained.

The first was on the "topmost" matchers.  My parser was very simple:

parser = Digits() & Eos()

but the topmost And() causes a problem: when it is first called, it is passed
the first line, and it keeps that reference until the parser completes.  The
lower level Digits() doesn't have the same problem: it is called for each line
in turn; for each line it does the work and then cleans up.  But And() cannot
clean up because it never finishes (well, not until all the input has been
consumed).

And it's not just And().  The top level Repeat() has the same problem.

Because this problem is for a very special case - the topmost matchers when
parsing huge amounts of data - I don't think a general solution is appropriate
(and I can't find one that doesn't affect Lepl deeply in other ways).  So
instead, I have a simple, ugly solution, that does the job.  The parser
becomes:

NO_STATE(And(NO_STATE(Repeat(Digits(), reduce=sum)), Eos()))

Those two NO_STATE matchers strip the stream reference and remove this
problem.  Brutal but effective.

The second source of stickiness was the "secondary" stacks that are used for
backtracking during search.  In a sense this is the same issue as above: the
topmost Repeat() maintains a reference to the earlier input in an internal
stack that is used on failure.

Again, the solution is fairly brutal: I force the stacks to be limited to a
certain size.  In the future it might be nice to integrate this with
GeneratorManager, but for now it is specified directly:

NO_STATE(And(NO_STATE(Repeat(Digits(), reduce=sum, clip=10)), Eos()))

Again, note that although this is ugly it is only needed at the very top level
of the parser - the part that exists for all lines.  Anything that is called
for a single line is not affected.

With those two fixes (and some related internal re-arrangements), memory does
not leak for this test.  There may be some other issues with other matchers,
but I think that the general problems are solved.

So some modification of the "topmost" parsers is needed, and the work takes
time and an understanding of both Python and Lepl, but it is now possible to
parse large inputs.

This work will be released as Lepl 5 (probably before April - the main task
now is updating the documentation).

Andrew

PS Lepl 5 also has improved stream handling which cuts parsing time by
15-35% - https://groups.google.com/d/msg/lepl/lGwLEs3MFFg/Yz4c6vAz90YJ

### Evidence :o)

From: andrew cooke <andrew@...>

Date: Tue, 1 Mar 2011 21:48:51 -0300

Here's the output from a test that just finished running:

>>> print(pp.matcher.tree())
NO_STATE
- matcher UntransformableTrampolineWrapper<And>
+- NO_STATE
|   - matcher UntransformableTrampolineWrapper<DepthFirst>
|       +- clip 10
|       +- stop None
|       +- reduce ([0], <function <lambda> at 0xa8f758>)
|       +- rest SequenceWrapper<Digits:<>>
|       +- start 0
|       - first SequenceWrapper<Digits:<>>
- FunctionWrapper<Eof:<>>
>>> r = pp(source(10**7)) # keep generators open by not expanding
>>> print(next(r))
[49999996]
>>> h = hpy()
>>> hp = h.heap()
>>> hp
Partition of a set of 45266 objects. Total size = 6254072 bytes.
Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
0  18479  41  1654624  26   1654624  26 str
1  11357  25   914840  15   2569464  41 tuple
2    149   0   437240   7   3006704  48 dict of module
3   3013   7   361560   6   3368264  54 types.CodeType
4   2922   6   350640   6   3718904  59 function
5    509   1   319928   5   4038832  65 dict (no owner)
6    332   1   303008   5   4341840  69 dict of type
7    332   1   297448   5   4639288  74 type
8    154   0   161392   3   4800680  77 dict of lepl.core.config.ConfigBuilder
9    141   0   150840   2   4951520  79 dict of class
<180 more rows. Type e.g. '_.more' to view.>

The input stream went from "1" to "10000000" (10e7).  With unicode that's
about 2 * 10e7 * 7 bytes (about 100MB).  The total memory used (note that the
snapshot above still has pending results from backtracking, so the parser
still "exists" in memory) is 6MB.

So this is equivalent to parsing a 100MB file using just 6MB of memory.

(Another simple way to see that input is not in memory is to note that there
are not 10 million instances of any one object!)

Andrew

### Fix 1 - No need for NO_STATE

From: andrew cooke <andrew@...>

Date: Wed, 2 Mar 2011 07:45:30 -0300

I've dreamt up a fix that avoids the need for NO_STATE.  The reference to
stream that was being kept is used only for debugging so a good compromise (I
hope) is to use a weak reference.

In fact, a simple weak reference won't do, because a stream is a tuple in Lepl
5, so it's actually a wek reference to an anonymous function that returns the
stream.  Despite this being ubiquitous code (on the "inner loop" in a broad
sense) that doesn't affect performance (at least, not much - it's more than
offset by another fix that I made, which I didn't have space to explain above,
that makes (a lack of) transformations more efficient).

Andrew

### Fix 2 - No need for explicit clip

From: andrew cooke <andrew@...>

Date: Wed, 2 Mar 2011 08:57:36 -0300

And, as suggested above, I can tie GeneratorMatcher together with the search
so that clip is set automatically.  The parser now looks like:

sum = ([0], lambda a, b: [a[0] + b[0]])
with Override(reduce=sum):
total = Digits()[:] & Eos()

Well, almost - I am using a bunch more config options and need to check what
is really necessary...

Andrew

### Final Code

From: andrew cooke <andrew@...>

Date: Wed, 2 Mar 2011 21:21:22 -0300

This shows the final code and configuration needed:

Andrew