Andrew Cooke | Contents | Latest | RSS | Previous | Next


Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Choochoo Training Diary

Last 100 entries

[Bike] Gearing On The Back Of An Envelope; [Computing] Okular and Postscript in OpenSuse; There's a fix!; [Computing] Fail2Ban on OpenSuse Leap 15.3 (NFTables); [Cycling, Computing] Power Calculation and Brakes; [Hardware, Computing] Amazing Pockit Computer; Bullying; How I Am - 3 Years Post Accident, 8+ Years With MS; [USA Politics] In America's Uncivil War Republicans Are The Aggressors; [Programming] Selenium and Python; Better Walking Data; [Bike] How Fast Before Walking More Efficient Than Cycling?; [COVID] Coronavirus And Cycling; [Programming] Docker on OpenSuse; Cadence v Speed; [Bike] Gearing For Real Cyclists; [Programming] React plotting - visx; [Programming] React Leaflet; AliExpress Independent Sellers; Applebaum - Twilight of Democracy; [Politics] Back + US Elections; [Programming,Exercise] Simple Timer Script; [News] 2019: The year revolt went global; [Politics] The world's most-surveilled cities; [Bike] Hope Freehub; [Restaurant] Mama Chau's (Chinese, Providencia); [Politics] Brexit Podcast; [Diary] Pneumonia; [Politics] Britain's Reichstag Fire moment; install cairo; [Programming] GCC Sanitizer Flags; [GPU, Programming] Per-Thread Program Counters; My Bike Accident - Looking Back One Year; [Python] Geographic heights are incredibly easy!; [Cooking] Cookie Recipe; Efficient, Simple, Directed Maximisation of Noisy Function; And for argparse; Bash Completion in Python; [Computing] Configuring Github Jekyll Locally; [Maths, Link] The Napkin Project; You can Masquerade in Firewalld; [Bike] Servicing Budget (Spring) Forks; [Crypto] CIA Internet Comms Failure; [Python] Cute Rate Limiting API; [Causality] Judea Pearl Lecture; [Security, Computing] Chinese Hardware Hack Of Supermicro Boards; SQLAlchemy Joined Table Inheritance and Delete Cascade; [Translation] The Club; [Computing] Super Potato Bruh; [Computing] Extending Jupyter; Further HRM Details; [Computing, Bike] Activities in ch2; [Books, Link] Modern Japanese Lit; What ended up there; [Link, Book] Logic Book; Update - Garmin Express / Connect; Garmin Forerunner 35 v 230; [Link, Politics, Internet] Government Trolls; [Link, Politics] Why identity politics benefits the right more than the left; SSH Forwarding; A Specification For Repeating Events; A Fight for the Soul of Science; [Science, Book, Link] Lost In Math; OpenSuse Leap 15 Network Fixes; Update; [Book] Galileo's Middle Finger; [Bike] Chinese Carbon Rims; [Bike] Servicing Shimano XT Front Hub HB-M8010; [Bike] Aliexpress Cycling Tops; [Computing] Change to ssh handling of multiple identities?; [Bike] Endura Hummvee Lite II; [Computing] Marble Based Logic; [Link, Politics] Sanity Check For Nuclear Launch; [Link, Science] Entropy and Life; [Link, Bike] Cheap Cycling Jerseys; [Link, Music] Music To Steal 2017; [Link, Future] Simulated Brain Drives Robot; [Link, Computing] Learned Index Structures; Solo Air Equalization; Update: Higher Pressures; Psychology; [Bike] Exercise And Fuel; Continental Race King 2.2; Removing Lowers; Mnesiacs; [Maths, Link] Dividing By Zero; [Book, Review] Ray Monk - Ludwig Wittgenstein: The Duty Of Genius; [Link, Bike, Computing] Evolving Lacing Patterns; [Jam] Strawberry and Orange Jam; [Chile, Privacy] Biometric Check During Mail Delivery; [Link, Chile, Spanish] Article on the Chilean Drought; [Bike] Extended Gear Ratios, Shimano XT M8000 (24/36 Chainring); [Link, Politics, USA] The Future Of American Democracy; Mass Hysteria; [Review, Books, Links] Kazuo Ishiguro - Never Let Me Go; [Link, Books] David Mitchell's Favourite Japanese Fiction; [Link, Bike] Rear Suspension Geometry; [Link, Cycling, Art] Strava Artwork; [Link, Computing] Useful gcc flags; [Link] Voynich Manuscript Decoded; [Bike] Notes on Servicing Suspension Forks

© 2006-2017 Andrew Cooke (site) / post authors (content).

Representing Indentations for Parsing

From: andrew cooke <andrew@...>

Date: Sun, 30 Aug 2009 13:06:31 -0400

What's the best way to represent indentation information for parsing?
There are two side to this question.  First, from a UI perspective -
how best to support clear, concise specification of parsers for a wide
variety of different approaches.  Second, from an implementation
perspective - how to do that efficiently within LEPL.

Taking the second first, LEPL now supports filtering of streams, but
these are costly in terms of memory (apart from an initial filtering
on the entire stream).  That means that context-sensitive filtering of
streams is possible, but better avoided.  A more efficient, but less
flexible, approach is to alter "global" state in a way that depends on
scope - this can be done with "monitors" and has constant time and
space costs.

One issue with UI seems to resolve around finding a good
representation of the data in the stream.  In particular, avoiding
having tokens related to indentation in the stream where they are not
needed (for example, in Python, between parentheses indentation is not
significant).  If - as in that example - this is restricted to a small
fraction of the stream, then filtering is a reasonable approach (since
memory usage is not a problem).

Another issue involves the difference between spaces/tabs and "depth".
 One scheme might assign meaningful indentation to exact multiples of
4 spaces; another might only require that spaces "be consistent"; in
either case tabs may or may not be acceptable, and might have some
fixed size, or shift to the some fixed set of columns.

Yet another issue is blank lines.  If these are significant, they need
to be present in the stream.  If not, they are a nuisance and should
be filtered.

Putting all this together, my first reaction was to calculate a
"depth" for each indentation (using some user defined function of the
space matched at the start of the line) and to place that value in the
(tagged token) stream.  Furthermore, blank lines would be discarded
and a global (but stack-aware) "current indentation level" set as

A NoIndent matcher (token) would then match an indentation token with
a depth equal to the global indentation level.  This would allow a
line with no indentation to be matched.

In addition, an Indent matcher would exist, which takes a list of
matchers (tokens) as an argument.  This matcher: matches an indent
"one more" than the global value; increments the global value;
evaluates the internal matchers; on exit, decrements the global value.

Within the scope of the first Indent matcher, then, a NoIndent matcher
would match lines indented by "one".

That has at least two problems: continguous nested Indents (the second
indent has nothing to match because the indent token was consumed by
the first); second it assumes there's it's possible to calculate the
depth before matching.

The first problem can be addressed by making the Indent match but not
consume the token if some indentation "remains".  The second can be
fixed by using "number of spaces" rather than depth (and having Indent
consume some number, perhaps (FixedIndent), or simply take what it's
given (VariableIndent)).

There's also the case of wanting to ignore indentation.  That is, I
think, going to require a filter.  And NoIndent might be the right
name for that matcher, in which case perhaps Line would be a good
choice for what I was calling NoIndent?  Or Sol?

After writing the above I tried writing a parser (without having any
implementation) and learnt that defining what a line is, is important,
but using a separate Eol() is messy.  It seems better to have various
ways of defining lines.  Related to that, one example use case I had
missed is automatic continuation over multiple lines (indicated by

So perhaps the basic grouping is Line(), which has a related
Multiline() for automatic continuation.  This can be combined with
Indent(), NoIndent() and AnyIndent() (the last ignores indentation).
But Multiline and AnyIndent are both so closely related to Line that
perhaps they should be options on Line.  And if we do that, shouldn't
Indent be, too?

Is there some other syntax we can use?  I don't think so - we really
need this to group matches, so it has to have a (...) form, which
means a function invocation.   Perhaps a single letter function name,
plus flags?  There's always _(...), although that's traditionally for

Maybe it's better to separate the main parser from the structure, and
then have some kind of complex layout expression that is itself
parsed?  I'm thinking of nested lists.  But as soon as you get any
complexity (alternatives for example), you're better using functions
and the standard approach (ie Or()).

Here's a later attempt.  I'm now using Line... to make things appear related.

        word = Token('[a-z_][a-z0-9_]*')
        number = Token(Integer)
        symbol = Token('[^a-z0-9_]')

        # any indent, entire line
        comment = LineAny(symbol('#') + Star(Any()))[:]

        atom = number | word
        # ignore line related tokens
        args = symbol('(') + LineIgnore(atom[:,symbol(',')]) + symbol(')')
        expr = ... # this includes nested function definitions

        # single line function is distinct
        func1 = Line(word('def') + word + args + symbol('=') + expr)
        func = Line(word('def') + word + args + symbol('=') +

        program = (func|func1)[:]

Inline comments would require separating out the "#..." part and
adding it to the end of an expr.

But there's still a problem with nested definitions.  If LineIndent
has consumed the indentation token, what does the Line() outside a
function definition match?

I guess this indicates that LineIndent doesn't consume indentation, it
just modifies the current level.  So perhaps should be called Indent.
And perhaps LineIgnore is better called Freeform?

        # any indent, entire line
        comment = symbol('#') + Star(Any())

        atom = number | word
        # ignore line related tokens
        args = symbol('(') + Freeform(atom[:,symbol(',')]) + symbol(')')
        simple_expr = ...
        expr = Line(simple_expr + Opt(comment))

        line_comment = LineAny(comment)

        # single line function is distinct
        func1 = Line(word('def') + word + args + symbol('=') + expr +
        func = Line(word('def') + word + args + symbol('=') + Opt(comment)

        program = (func|func1)[:]

And it looks like it would be useful for a user to be able to easily
define something that was like Line but added an optional comment (and
in practice this should indeed be possible).

I haven't used LineMulti() above, but that would support implicit
continuation through indented lines.


More Indentation

From: andrew cooke <andrew@...>

Date: Sun, 30 Aug 2009 18:57:52 -0400

So looking at that I noticed that there's an inconsistent mess in how
Line and Indent work together.  Above I had:

        func = Line(word('def') + word + args + symbol('=') + Opt(comment)

There's no operator between the comment and the Indent.  My initial
reaction was that I had missed a &, but then Line would extend pass
end of line (dropping Eol and inferring it from the next indentation
doesn't solve the underlying problem, I think, which is that this is

Looking at the problem again, there are several different "things being done":
1 - Defining what constitutes a single line.
2 - Grouping lines with the same indentation together
3 - Indicating how one *group* is indented relative to another.
The confusion above seems to be confusing (3) with indicating how
*lines* indent relative to each other.

Using this new approach is going to be much more verbose, but also
clearer.  On a more positive note, it might help separate two separate
tasks - definition of lines and blocks - and might even allow more
useful combinators to be defined for specific cases.

It's worth clarifying the difference between line and block.  A line
starts with an empty indentation and ends with the end of line marker.
 There is an exception with Freeformat() which discards indentation
and eol markers - if that is used inside a line then the line can
extend over several lines of text.

A block contains a one or more matchers that consume lines (these can
be simple lines, or they may be combined using Or, And etc).  Because
they work by incrementing/decrementing the number of spaces "used"
they do not alter the stream flow in any way.  This means that there
is very little constraint on how lines inside a block are combined.
If a block contains several lines, separated by commas, then they are
joined by And().

Given the above, Indent() seems like it is really an initial parameter
to a block, which sets the space to consume (perhaps by reading the
indentation, perhaps by simply incrementing by some fixed amount).

So the syntax would be Block(indent, Line(), Line()...) where indent
is some value (number of spaces, or some function)?  But this value is
likely global, so we can set it in config and use a keyword argument
for exceptional cases.

The example then becomes:

        word = Token('[a-z_][a-z0-9_]*')
        number = Token(Integer)
        symbol = Token('[^a-z0-9_]')

        # any indent, entire line
        comment = symbol('#') + Star(Any())

        atom = number | word
        # ignore line related tokens
        args = symbol('(') + Freeform(atom[:,symbol(',')]) + symbol(')')
        simple_expr = ...
        expr = Line(simple_expr + Opt(comment))

        line_comment = LineAny(comment)

        # single line function is distinct
        func1 = Line(word('def') + word + args + symbol('=') + \
                    expr + Opt(comment))
        func = Line(word('def') + word + args + symbol('=') + \
                  Opt(comment)) + Block((expr|func|func1)[:])

        program = (func|func1)[:]


Comment on this post