| Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next

C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

Last 100 entries

Interesting (But Largely Illegible) Typeface; Retiring Essentialism; Poorest in UK, Poorest in N Europe; I Want To Be A Redneck!; Reverse Racism; The Lost Art Of Nomography; IBM Data Center (Photo); Interesting Account Of Gamma Hack; The Most Interesting Audiophile In The World; How did the first world war actually end?; Ky - Restaurant Santiago; The Black Dork Lives!; The UN Requires Unaninmous Decisions; LPIR - Steganography in Practice; How I Am 6; Clear Explanation of Verizon / Level 3 / Netflix; Teenage Girls; Formalising NSA Attacks; Switching Brakes (Tektro Hydraulic); Naim NAP 100 (Power Amp); AKG 550 First Impressions; Facebook manipulates emotions (no really); Map Reduce "No Longer Used" At Google; Removing RAID metadata; New Bike (Good Bike Shop, Santiago Chile); Removing APE Tags in Linux; Compiling Python 3.0 With GCC 4.8; Maven is Amazing; Generating Docs from a GitHub Wiki; Modular Shelves; Bash Best Practices; Good Emergency Gasfiter (Santiago, Chile); Readings in Recent Architecture; Roger Casement; Integrated Information Theory (Or Not); Possibly undefined macro AC_ENABLE_SHARED; Update on Charges; Sunburst Visualisation; Spectral Embeddings (Distances -> Coordinates); Introduction to Causality; Filtering To Help Colour-Blindness; ASUS 1015E-DS02 Too; Ready Player One; Writing Clear, Fast Julia Code; List of LatAm Novels; Running (for women); Building a Jenkins Plugin and a Jar (for Command Line use); Headphone Test Recordings; Causal Consistency; The Quest for Randomness; Chat Wars; Real-life Financial Co Without ACID Database...; Flexible Muscle-Based Locomotion for Bipedal Creatures; SQL Performance Explained; The Little Manual of API Design; Multiple Word Sizes; CRC - Next Steps; FizzBuzz; Update on CRCs; Decent Links / Discussion Community; Automated Reasoning About LLVM Optimizations and Undefined Behavior; A Painless Guide To CRC Error Detection Algorithms; Tests in Julia; Dave Eggers: what's so funny about peace, love and Starship?; Cello - High Level C Programming; autoreconf needs tar; Will Self Goes To Heathrow; Top 5 BioInformatics Papers; Vasovagal Response; Good Food in Vina; Chilean Drug Criminals Use Subsitution Cipher; Adrenaline; Stiglitz on the Impact of Technology; Why Not; How I Am 5; Lenovo X240 OpenSuse 13.1; NSA and GCHQ - Psychological Trolls; Finite Fields in Julia (Defining Your Own Number Type); Julian Assange; Starting Qemu on OpenSuse; Noisy GAs/TMs; Venezuela; Reinstalling GRUB with EFI; Instructions For Disabling KDE Indexing; Evolving Speakers; Changing Salt Size in Simple Crypt 3.0.0; Logarithmic Map (Moved); More Info; Words Found in Voynich Manuscript; An Inventory Of 3D Space-Filling Curves; Foxes Using Magnetic Fields To Hunt; 5 Rounds RC5 No Rotation; JP Morgan and Madoff; Ori - Secure, Distributed File System; Physical Unclonable Functions (PUFs); Prejudice on Reddit; Recursion OK; Optimizing Julia Code; Cash Handouts in Brazil; Couple Nice Music Videos; It Also Works!

© 2006-2013 Andrew Cooke (site) / post authors (content).

Processing Large Volumes of Data in Lepl

From: andrew cooke <andrew@...>

Date: Tue, 1 Mar 2011 21:35:03 -0300

Lepl was designed to have the ability to parse large (as in, larger than the
amount of available memory) volumes of data.  However, I shamefully admit that
I never tested this (until now, I assumed no-one would care, and I hadn't had
a need for it).

But someone on the mailing list said that it was important to them so, as part
of the Lepl 5 work, I wrote some code to check that large inputs were handled
correctly.

It turns out, of course, that they weren't.  And so I have been spending way
too much time trying to debug memory use.  But the good news is that, with the
very latest code, and some care, it is possible to process arbitrary amounts
of input.

To show how this works I will work through the test code I am running (which
is processing several GB of data, but using only 0.5% of my laptop's memory to
do so!).


First, for my test, I needed a problem that was easy to describe, and which
"used", in a sense, all the input data, but didn't have a huge result (even
Lepl can't make a large result take no room...).  So I decided to use data in
which each "line" was a string containing a number, incrementing from 1 ("1",
"2", ...).  The result would be the sum of the first digit of each of the
numbers.

To avoid keeping all the input data in memory at once we must use an iterator
as the data source.  In normal use this would probably be a file (the result
of "open(...)" can be passed to Lepl's parse() methods), but for the test the
source looks like:

  def source(n_max):
      return imap(str, takewhile(lambda n: n <= n_max, count(1)))

(Note that this is Python 2 code, so I am using imap - in Python 3 that would
be a normal map).

I also need a matcher that returns the first digit:

  @sequence_matcher
  def Digits(support, stream):
      (number, next_stream) = s_line(stream, False)
      for digit in number:
	  yield ([int(digit)], next_stream)


At this point I hit my first problem: how to add numbers?  Normally I would
use "sum()" as a transformation of the results.  But that implies that the
results must first be generated (as a list).  For a large input we cannot do
this (the list could get too large).

After thinking about this I decided to extend the way that repetition works in
Lepl with a "reduce" parameter.  This can be specified with Override:

  with Override(reduce=...):
      matcher = foo[:]

or as an argument for Repeat (see below).  The value of reduce is the tuple
(zero, join) where "zero" is the initial value and join acts as an iterative
fold.  So for no repeats, the result is "zero"; for one repeat it is the
result of "join(zero, result1)", for two "join(join(zero, result1), result2)".

For my example I used:

  sum = ([0], lambda a, b: [a[0] + b[0]])
  Repeat(Digits(), reduce=sum, ...)


At this point I thought things should work.  Of course, I still had to add the
appropriate magic to the configuration:

  p.config.add_monitor(GeneratorManager(10))

which would restrict the number of "pending" generators to just 10.  This is
the functionality that has always been in Lepl and that was intended for
cases like this.  It is actually the source of a fair amount of complexity in
the implementation (to keep track of the generators and to discard the "least
used" at any point).

Unfortunately, as I have already said, this didn't work.  Memory use exploded
and my laptop crashed.


It is probably worth going into a little detail about how to debug memory in
Python.  The word "debug" is a little misleading, because there is no bug in
Python - just in my code.

But before I do that I guess I also need to explain a little more about how I
handle the input.  The iterator (a file or, in this test, the number source)
is wrapped in some classes that create a linked list of lines: each line knows
about the line that follows (if necessary, calling the iterator to return the
next value).  The happy consequence of this is that when the first line has
been parsed, and no object has a reference to it, it can be garbage collected
by Python (the list links to lower lines, but not back up to the previous
lines).

So "all" that is needed for us to handle a large amount of input is for us to
use this linked-list "trick", and to make sure that there are no references to
lines once they have been parsed.

Debugging the problem, then, means finding out what objects exist, and who has
a reference to them.  There are two tools that help.

First, there's a package called Guppy, with a module called hpy, which will
show heap usage.  You use it like this:

  from guppy import hpy
  h = hpy()
  hp = h.heap()
  print(hp)

and typical output might be:

Partition of a set of 335881 objects. Total size = 48804432 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  10008   3 10488384  21  10488384  21 dict of lepl.core.manager.GeneratorRef
     1  10000   3 10480000  21  20968384  43 dict of 0x846620
     2  61337  18  4753648  10  25722032  53 tuple
     3  10505   3  3897560   8  29619592  61 dict (no owner)
     4  40714  12  3257120   7  32876712  67 types.MethodType
     5  29645   9  2265384   5  35142096  72 str
     6  20020   6  1923152   4  37065248  76 unicode
     7  20596   6  1812448   4  38877696  80 __builtin__.weakref
     8  10897   3  1466120   3  40343816  83 list
     9  59638  18  1431312   3  41775128  86 int
<182 more rows. Type e.g. '_.more' to view.>

You can then play around with the results.  But what was most important was
just to get some idea of what types are using up memory.  Once you have that,
you can go hunting for them with the second tool: Python's gc module.

For example, this code will get all the GeneratorRef instances:

  o = gc.get_objects()
  gr = [oo for oo in o if isinstance(oo, GeneratorRef)]

and then you can find out who has a reference to one of those with
gc.get_referrers():

  gr_owners = gc.get_referrers(gr[0])

In this way uou can study the objects in memory and work out why they are
still sticking around.


It's unfortunate, but Lepl's backtracking makes it *really* sticky when it
comes to holding on to objects.  The problem is that you never know when you
might need to go all the way back to the start.

While GeneratorManager reduces stickiness in the "stack", it turns out that
Lepl had two other places where references to the input are retained.


The first was on the "topmost" matchers.  My parser was very simple:

  parser = Digits() & Eos()

but the topmost And() causes a problem: when it is first called, it is passed
the first line, and it keeps that reference until the parser completes.  The
lower level Digits() doesn't have the same problem: it is called for each line
in turn; for each line it does the work and then cleans up.  But And() cannot
clean up because it never finishes (well, not until all the input has been
consumed).

And it's not just And().  The top level Repeat() has the same problem.

Because this problem is for a very special case - the topmost matchers when
parsing huge amounts of data - I don't think a general solution is appropriate
(and I can't find one that doesn't affect Lepl deeply in other ways).  So
instead, I have a simple, ugly solution, that does the job.  The parser
becomes:

  NO_STATE(And(NO_STATE(Repeat(Digits(), reduce=sum)), Eos()))

Those two NO_STATE matchers strip the stream reference and remove this
problem.  Brutal but effective.


The second source of stickiness was the "secondary" stacks that are used for
backtracking during search.  In a sense this is the same issue as above: the
topmost Repeat() maintains a reference to the earlier input in an internal
stack that is used on failure.

Again, the solution is fairly brutal: I force the stacks to be limited to a
certain size.  In the future it might be nice to integrate this with
GeneratorManager, but for now it is specified directly:

  NO_STATE(And(NO_STATE(Repeat(Digits(), reduce=sum, clip=10)), Eos()))

Again, note that although this is ugly it is only needed at the very top level
of the parser - the part that exists for all lines.  Anything that is called
for a single line is not affected.


With those two fixes (and some related internal re-arrangements), memory does
not leak for this test.  There may be some other issues with other matchers,
but I think that the general problems are solved.

So some modification of the "topmost" parsers is needed, and the work takes
time and an understanding of both Python and Lepl, but it is now possible to
parse large inputs.

This work will be released as Lepl 5 (probably before April - the main task
now is updating the documentation).

Andrew


PS Lepl 5 also has improved stream handling which cuts parsing time by
15-35% - https://groups.google.com/d/msg/lepl/lGwLEs3MFFg/Yz4c6vAz90YJ

Evidence :o)

From: andrew cooke <andrew@...>

Date: Tue, 1 Mar 2011 21:48:51 -0300

Here's the output from a test that just finished running:

  >>> print(pp.matcher.tree())
  NO_STATE
   `- matcher UntransformableTrampolineWrapper<And>
       +- NO_STATE
       |   `- matcher UntransformableTrampolineWrapper<DepthFirst>
       |       +- clip 10
       |       +- stop None
       |       +- reduce ([0], <function <lambda> at 0xa8f758>)
       |       +- rest SequenceWrapper<Digits:<>>
       |       +- start 0
       |       `- first SequenceWrapper<Digits:<>>
       `- FunctionWrapper<Eof:<>>
  >>> r = pp(source(10**7)) # keep generators open by not expanding
  >>> print(next(r))
  [49999996]
  >>> h = hpy()
  >>> hp = h.heap()
  >>> hp
  Partition of a set of 45266 objects. Total size = 6254072 bytes.
   Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
       0  18479  41  1654624  26   1654624  26 str
       1  11357  25   914840  15   2569464  41 tuple
       2    149   0   437240   7   3006704  48 dict of module
       3   3013   7   361560   6   3368264  54 types.CodeType
       4   2922   6   350640   6   3718904  59 function
       5    509   1   319928   5   4038832  65 dict (no owner)
       6    332   1   303008   5   4341840  69 dict of type
       7    332   1   297448   5   4639288  74 type
       8    154   0   161392   3   4800680  77 dict of lepl.core.config.ConfigBuilder
       9    141   0   150840   2   4951520  79 dict of class
  <180 more rows. Type e.g. '_.more' to view.>

The input stream went from "1" to "10000000" (10e7).  With unicode that's
about 2 * 10e7 * 7 bytes (about 100MB).  The total memory used (note that the
snapshot above still has pending results from backtracking, so the parser
still "exists" in memory) is 6MB.

So this is equivalent to parsing a 100MB file using just 6MB of memory.

(Another simple way to see that input is not in memory is to note that there
are not 10 million instances of any one object!)

Andrew

Fix 1 - No need for NO_STATE

From: andrew cooke <andrew@...>

Date: Wed, 2 Mar 2011 07:45:30 -0300

I've dreamt up a fix that avoids the need for NO_STATE.  The reference to
stream that was being kept is used only for debugging so a good compromise (I
hope) is to use a weak reference.

In fact, a simple weak reference won't do, because a stream is a tuple in Lepl
5, so it's actually a wek reference to an anonymous function that returns the
stream.  Despite this being ubiquitous code (on the "inner loop" in a broad
sense) that doesn't affect performance (at least, not much - it's more than
offset by another fix that I made, which I didn't have space to explain above,
that makes (a lack of) transformations more efficient).

Andrew

Fix 2 - No need for explicit clip

From: andrew cooke <andrew@...>

Date: Wed, 2 Mar 2011 08:57:36 -0300

And, as suggested above, I can tie GeneratorMatcher together with the search
so that clip is set automatically.  The parser now looks like:

    sum = ([0], lambda a, b: [a[0] + b[0]])
    with Override(reduce=sum):
        total = Digits()[:] & Eos()
    total.config.add_monitor(GeneratorManager(10))

Well, almost - I am using a bunch more config options and need to check what
is really necessary...

Andrew

Final Code

From: andrew cooke <andrew@...>

Date: Wed, 2 Mar 2011 21:21:22 -0300

This shows the final code and configuration needed:

https://code.google.com/p/lepl/source/browse/src/lepl/_performance/memory.py

Andrew

Comment on this post