C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Choochoo Training Diary

Last 100 entries

Offside Parsing Works in LEPL

From: andrew cooke <andrew@...>

Date: Sat, 12 Sep 2009 12:14:54 -0400

I just got a complete test working for offside (whitespace/indentation
sensitive) parsing working in LEPL (my Python parser -
http://www.acooke.org/lepl)

What follows is a re-formatted version of a test from this file -
http://code.google.com/p/lepl/source/browse/src/lepl/offside/_test/pithon.py?spec=svn362f24c528fa6988e13953eebb1325956295696b&r=362f24c528fa6988e13953eebb1325956295696b


Here's the grammar (note that I have hardly any structure - there's no
clear definition of statements or commands or variables, it's just
enough to use the indentation-aware code):

# these are the basic tokens that the lexer
# recognises - whitespace is then handled
# automatically
word = Token(Word(Lower()))
continuation = Token(r'\\')
symbol = Token(Any('()'))

# the ~ here means these are used to match
# but discarded from the results
introduce = ~Token(':')
comma = ~Token(',')

# first we need to define how a single
# logical line can continue over many
# lines in the text
CLine = CLineFactory(continuation)

# if we don't want lines to continue,
# we could just use the BLine() matcher

# next a minimal language definition that
# says statements are sequence of words
statement = word[1:]

# argument lists can extend over multiple
# lines (the parser will "know" their extent
# because they are inside (...))
args = Extend(word[:, comma]) > tuple

# and a function header is some words followed
# by the argument list
function = \
  word[1:] & ~symbol('(') & args & ~symbol(')')


# now we get to the interesting part.  we
# introduce blocks, which are indented
# relative to the surrounding text
block = Delayed()

# and lines which are what are inside blocks.
# note that a block is a valid line
# because we can nest blocks, and an empty
# line can appear too.  finally we collect
# the output in a Python list so we can
# see the structrue in the result
line = Or(CLine(statement),
          block,
          Line(Empty()))        > list

# now we can define the block: it comes
# after a function header or statement
# (both those end in introduce - ":") and
# contains lines.
block += \
  CLine((function | statement) & introduce) \
  & Block(line[1:])

# and a program is a list of lines.
program = (line[:] & Eos())

# the usual LEPL way to make a parser,
# with a new configuration type. the
# policy argument is the number of spaces
# needed in an indent for a single block.
return program.string_parser(
  OffsideConfiguration(policy=2))


And here's the text that we will parse:

this is a grammar with a similar
line structure to python

if something:
  then we indent
else:
  something else

def function(a, b, c):
  we can nest blocks:
    like this
  and we can also \
    have explicit continuations \
    with \
any \
       indentation

same for (argument,
          lists):
  which do not need the
  continuation marker


Running the parser against that text gives the following, where the
nested lists indicate that we have matcher the block structure
correctly:

[ [],
  ['this', 'is', 'a', 'grammar', 'with', 'a', 'similar'],
  ['line', 'structure', 'to', 'python'],
  [],
  ['if', 'something',
    ['then', 'we', 'indent']],
  ['else',
    ['something', 'else'],
  []],
  ['def', 'function', ('a', 'b', 'c'),
    ['we', 'can', 'nest', 'blocks',
      ['like', 'this']],
    ['and', 'we', 'can', 'also', 'have', 'explicit',
     'continuations', 'with', 'any', 'indentation'],
    []],
  ['same', 'for', ('argument', 'lists'),
    ['which', 'do', 'not', 'need', 'the'],
    ['continuation', 'marker']]]


I hope to release a beta containing this in the next few days, and
will then start working on documentation.  When the docs are done I
will release a new version.

If you want to try this now, you can get the code from the hg repo -
http://code.google.com/p/lepl/source/checkout

Andrew

What's so Neat...

From: andrew cooke <andrew@...>

Date: Sat, 12 Sep 2009 12:44:10 -0400

...about this is that - despite some need to rewrite things - it all
fits into the existing LEPL architecture.  This is a "big deal"
because whitespace parsing mixes information between different levels
of the parser.  The presence of "(...)" or a continuation marker like
"\" influences what the whitespace "means", so while we can detect
indentation in the lexer, we cannot interpret it until the parser
itself is running.  But at the same time, we want to avoid the need to
explicitly add tokens for continuation markers and indentations
"inside" the definitions for statements, expressions etc - the line
structure should be as isolated as possible (imagine having to write a
grammar where between each word you need to include the possibility
that the continuation character appears at that particular point).

Another problem was the "global" state required to handle the current
indentation.  It turns out that LEPL's concept of monitors was a
perfect match for this.

Related to the above was the issue of how to provide a clean,
declarative syntax.  To do this I built on the ideas already
implemented for tokens, and extended streams with filters.  It took a
few iterations, but I am really happy with the final result.

And using LEPL's generic configuration and graph rewriting means that
these new extensions can be integrated with the existing code without
breaking other modules....

I'm *so* pleased this has worked :o)

Andrew

Delayed due to State

From: andrew cooke <andrew@...>

Date: Sat, 19 Sep 2009 09:25:55 -0400

Offside support has been delayed slightly because it breaks when used
with memoisation.  This is because (I think) the current indentation
level is not taken into account by memoizers.

Consider the end of a block.  At the end of the block another line is
attempted.  This fails because the indentation is incorrect.  So the
block ends, decrementing the indentation level, and the line is tried
again outside the block.  However, *exactly* the same stream is used
for the line matcher in both cases.  So the second time the memoizer
for the line says "nope, we already know this failed".  When it should
have succeeded, because the indentation level is now correct.

The only clean solution I can see is to introduce the concept of
global (ie per thread) state (a dictionary) in which values (like
current indentation can be stored).  Memoizers then combine the hash
of that state with the hash of the stream to detect repetition.

But will that be sufficient?  What about when two such cases above are
nested?  Will the "inner" case be expanded?  I think so, but am not
100% sure.

Andrew

Comment on this post

C[omp]ute

Personal Projects

Last 100 entries

Offside Parsing Works in LEPL

What's so Neat...

More Offside Documentation

Delayed due to State