| Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next

C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

Last 100 entries

Chat Wars; Real-life Financial Co Without ACID Database...; Flexible Muscle-Based Locomotion for Bipedal Creatures; SQL Performance Explained; The Little Manual of API Design; Multiple Word Sizes; CRC - Next Steps; FizzBuzz; Update on CRCs; Decent Links / Discussion Community; Automated Reasoning About LLVM Optimizations and Undefined Behavior; A Painless Guide To CRC Error Detection Algorithms; Tests in Julia; Dave Eggers: what's so funny about peace, love and Starship?; Cello - High Level C Programming; autoreconf needs tar; Will Self Goes To Heathrow; Top 5 BioInformatics Papers; Vasovagal Response; Good Food in Vina; Chilean Drug Criminals Use Subsitution Cipher; Adrenaline; Stiglitz on the Impact of Technology; Why Not; How I Am 5; Lenovo X240 OpenSuse 13.1; NSA and GCHQ - Psychological Trolls; Finite Fields in Julia (Defining Your Own Number Type); Julian Assange; Starting Qemu on OpenSuse; Noisy GAs/TMs; Venezuela; Reinstalling GRUB with EFI; Instructions For Disabling KDE Indexing; Evolving Speakers; Changing Salt Size in Simple Crypt 3.0.0; Logarithmic Map (Moved); More Info; Words Found in Voynich Manuscript; An Inventory Of 3D Space-Filling Curves; Foxes Using Magnetic Fields To Hunt; 5 Rounds RC5 No Rotation; JP Morgan and Madoff; Ori - Secure, Distributed File System; Physical Unclonable Functions (PUFs); Prejudice on Reddit; Recursion OK; Optimizing Julia Code; Cash Handouts in Brazil; Couple Nice Music Videos; It Also Works!; Adaptive Plaintext; It Works!; RC5 Without Rotation (2); 8 Years...; Attack Against Encrypted Linux Disks; Pushing Back On NSA At IETF; Summary of Experimental Ethics; Very Good Talk On Security, Snowden; Locusts are Grasshoppers!; Vagrant (OpenSuse and IDEs); Interesting Take On Mandela's Context; Haskell Cabal O(n^2) / O(n) Fix; How I Am 4; Chilean Charity Supporting Women; Doing SSH right; Festival of Urban Intervention; Neat Idea - Wormholes Provide Entanglement; And a Link....; Simple Encryption for Python 2.7; OpenSuse 13.1 Is Better!; Little Gain...; More Details on Technofull Damage; Palmrest Cracked Too....; Tecnofull (Lenovo Support) Is Fucking Useless; The Neuroscientist Who Discovered He Was a Psychopath; Interpolating Polynomials; Bottlehead Crack as Pre-amp; Ooops K702!; Bottlehead Crack, AKG K701; Breaking RC5 Without Rotation; Great post thank you; Big Balls of Mud; Phabricator - Tools for working together; Amazing Julia RC5 Code Parameterized By Word Size; Chi-Square Can Be Two-Sided; Why Do Brits Accept Surveillance?; Statistics Done Wrong; Mesas Trape from Bravo; European Report on Crypto Primitives and Protocols; Interesting Omissions; Oryx And Crake (Margaret Atwood); Music and Theory; My Arduino Programs; Elliptic Curve Crypto; Re: Licensing Interpreted Code; Licensing Interpreted Code; ASUS 1015E-DS03 OpenSuse 12.3 SSD; translating lettuce feature files into stub steps files; Re: translating lettuce feature files into stub steps files; A Tale of Two Psychiatrists

© 2006-2013 Andrew Cooke (site) / post authors (content).

Representing Indentations for Parsing

From: andrew cooke <andrew@...>

Date: Sun, 30 Aug 2009 13:06:31 -0400

What's the best way to represent indentation information for parsing?
There are two side to this question.  First, from a UI perspective -
how best to support clear, concise specification of parsers for a wide
variety of different approaches.  Second, from an implementation
perspective - how to do that efficiently within LEPL.

Taking the second first, LEPL now supports filtering of streams, but
these are costly in terms of memory (apart from an initial filtering
on the entire stream).  That means that context-sensitive filtering of
streams is possible, but better avoided.  A more efficient, but less
flexible, approach is to alter "global" state in a way that depends on
scope - this can be done with "monitors" and has constant time and
space costs.

One issue with UI seems to resolve around finding a good
representation of the data in the stream.  In particular, avoiding
having tokens related to indentation in the stream where they are not
needed (for example, in Python, between parentheses indentation is not
significant).  If - as in that example - this is restricted to a small
fraction of the stream, then filtering is a reasonable approach (since
memory usage is not a problem).

Another issue involves the difference between spaces/tabs and "depth".
 One scheme might assign meaningful indentation to exact multiples of
4 spaces; another might only require that spaces "be consistent"; in
either case tabs may or may not be acceptable, and might have some
fixed size, or shift to the some fixed set of columns.

Yet another issue is blank lines.  If these are significant, they need
to be present in the stream.  If not, they are a nuisance and should
be filtered.

Putting all this together, my first reaction was to calculate a
"depth" for each indentation (using some user defined function of the
space matched at the start of the line) and to place that value in the
(tagged token) stream.  Furthermore, blank lines would be discarded
and a global (but stack-aware) "current indentation level" set as
"zero".

A NoIndent matcher (token) would then match an indentation token with
a depth equal to the global indentation level.  This would allow a
line with no indentation to be matched.

In addition, an Indent matcher would exist, which takes a list of
matchers (tokens) as an argument.  This matcher: matches an indent
"one more" than the global value; increments the global value;
evaluates the internal matchers; on exit, decrements the global value.

Within the scope of the first Indent matcher, then, a NoIndent matcher
would match lines indented by "one".

That has at least two problems: continguous nested Indents (the second
indent has nothing to match because the indent token was consumed by
the first); second it assumes there's it's possible to calculate the
depth before matching.

The first problem can be addressed by making the Indent match but not
consume the token if some indentation "remains".  The second can be
fixed by using "number of spaces" rather than depth (and having Indent
consume some number, perhaps (FixedIndent), or simply take what it's
given (VariableIndent)).

There's also the case of wanting to ignore indentation.  That is, I
think, going to require a filter.  And NoIndent might be the right
name for that matcher, in which case perhaps Line would be a good
choice for what I was calling NoIndent?  Or Sol?


After writing the above I tried writing a parser (without having any
implementation) and learnt that defining what a line is, is important,
but using a separate Eol() is messy.  It seems better to have various
ways of defining lines.  Related to that, one example use case I had
missed is automatic continuation over multiple lines (indicated by
indentation).

So perhaps the basic grouping is Line(), which has a related
Multiline() for automatic continuation.  This can be combined with
Indent(), NoIndent() and AnyIndent() (the last ignores indentation).
But Multiline and AnyIndent are both so closely related to Line that
perhaps they should be options on Line.  And if we do that, shouldn't
Indent be, too?

Is there some other syntax we can use?  I don't think so - we really
need this to group matches, so it has to have a (...) form, which
means a function invocation.   Perhaps a single letter function name,
plus flags?  There's always _(...), although that's traditionally for
i18n.

Maybe it's better to separate the main parser from the structure, and
then have some kind of complex layout expression that is itself
parsed?  I'm thinking of nested lists.  But as soon as you get any
complexity (alternatives for example), you're better using functions
and the standard approach (ie Or()).


Here's a later attempt.  I'm now using Line... to make things appear related.

        word = Token('[a-z_][a-z0-9_]*')
        number = Token(Integer)
        symbol = Token('[^a-z0-9_]')

        # any indent, entire line
        comment = LineAny(symbol('#') + Star(Any()))[:]

        atom = number | word
        # ignore line related tokens
        args = symbol('(') + LineIgnore(atom[:,symbol(',')]) + symbol(')')
        expr = ... # this includes nested function definitions

        # single line function is distinct
        func1 = Line(word('def') + word + args + symbol('=') + expr)
        func = Line(word('def') + word + args + symbol('=') +
                    LineIndent(expr)[:])

        program = (func|func1)[:]

Inline comments would require separating out the "#..." part and
adding it to the end of an expr.

But there's still a problem with nested definitions.  If LineIndent
has consumed the indentation token, what does the Line() outside a
function definition match?

I guess this indicates that LineIndent doesn't consume indentation, it
just modifies the current level.  So perhaps should be called Indent.
And perhaps LineIgnore is better called Freeform?

        # any indent, entire line
        comment = symbol('#') + Star(Any())

        atom = number | word
        # ignore line related tokens
        args = symbol('(') + Freeform(atom[:,symbol(',')]) + symbol(')')
        simple_expr = ...
        expr = Line(simple_expr + Opt(comment))

        line_comment = LineAny(comment)

        # single line function is distinct
        func1 = Line(word('def') + word + args + symbol('=') + expr +
Opt(comment))
        func = Line(word('def') + word + args + symbol('=') + Opt(comment)
                    Indent(expr|func|func1)[:])

        program = (func|func1)[:]

And it looks like it would be useful for a user to be able to easily
define something that was like Line but added an optional comment (and
in practice this should indeed be possible).

I haven't used LineMulti() above, but that would support implicit
continuation through indented lines.

Andrew

More Indentation

From: andrew cooke <andrew@...>

Date: Sun, 30 Aug 2009 18:57:52 -0400

So looking at that I noticed that there's an inconsistent mess in how
Line and Indent work together.  Above I had:

        func = Line(word('def') + word + args + symbol('=') + Opt(comment)
                    Indent(expr|func|func1)[:])

There's no operator between the comment and the Indent.  My initial
reaction was that I had missed a &, but then Line would extend pass
end of line (dropping Eol and inferring it from the next indentation
doesn't solve the underlying problem, I think, which is that this is
messy).

Looking at the problem again, there are several different "things being done":
1 - Defining what constitutes a single line.
2 - Grouping lines with the same indentation together
3 - Indicating how one *group* is indented relative to another.
The confusion above seems to be confusing (3) with indicating how
*lines* indent relative to each other.

Using this new approach is going to be much more verbose, but also
clearer.  On a more positive note, it might help separate two separate
tasks - definition of lines and blocks - and might even allow more
useful combinators to be defined for specific cases.

It's worth clarifying the difference between line and block.  A line
starts with an empty indentation and ends with the end of line marker.
 There is an exception with Freeformat() which discards indentation
and eol markers - if that is used inside a line then the line can
extend over several lines of text.

A block contains a one or more matchers that consume lines (these can
be simple lines, or they may be combined using Or, And etc).  Because
they work by incrementing/decrementing the number of spaces "used"
they do not alter the stream flow in any way.  This means that there
is very little constraint on how lines inside a block are combined.
If a block contains several lines, separated by commas, then they are
joined by And().

Given the above, Indent() seems like it is really an initial parameter
to a block, which sets the space to consume (perhaps by reading the
indentation, perhaps by simply incrementing by some fixed amount).

So the syntax would be Block(indent, Line(), Line()...) where indent
is some value (number of spaces, or some function)?  But this value is
likely global, so we can set it in config and use a keyword argument
for exceptional cases.

The example then becomes:

        word = Token('[a-z_][a-z0-9_]*')
        number = Token(Integer)
        symbol = Token('[^a-z0-9_]')

        # any indent, entire line
        comment = symbol('#') + Star(Any())

        atom = number | word
        # ignore line related tokens
        args = symbol('(') + Freeform(atom[:,symbol(',')]) + symbol(')')
        simple_expr = ...
        expr = Line(simple_expr + Opt(comment))

        line_comment = LineAny(comment)

        # single line function is distinct
        func1 = Line(word('def') + word + args + symbol('=') + \
                    expr + Opt(comment))
        func = Line(word('def') + word + args + symbol('=') + \
                  Opt(comment)) + Block((expr|func|func1)[:])

        program = (func|func1)[:]

Andrew

Comment on this post