| Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next

C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

C-ORM: docs, API.

Last 100 entries

Weird Finite / Infinite Result; Your diamond is a beaten up mess; Maths Books; Good Bike Route from Providencia / Las Condes to Panul\; Iain Pears (Author of Complex Plots); Plum Jam; Excellent; More Recently; For a moment I forgot StackOverflow sucked; A Few Weeks On...; Chilean Book Recommendations; How To Write Shared Libraries; Jenny Erpenbeck (Author); Dijkstra, Coins, Tables; Python libraries error on OpenSuse; Deserving Trump; And Smugness; McCloskey Economics Trilogy; cmocka - Mocks for C; Concept Creep (Americans); Futhark - OpenCL Language; Moved / Gone; Fan and USB issues; Burgers in Santiago; The Origin of Icosahedral Symmetry in Viruses; autoenum on PyPI; Jars Explains; Tomato Chutney v3; REST; US Elections and Gender: 24 Point Swing; PPPoE on OpenSuse Leap 42.1; SuperMicro X10SDV-TLN4F/F with Opensuse Leap 42.1; Big Data AI Could Be Very Bad Indeed....; Cornering; Postcapitalism (Paul Mason); Black Science Fiction; Git is not a CDN; Mining of Massive Data Sets; Rachel Kaadzi Ghansah; How great republics meet their end; Raspberry, Strawberry and Banana Jam; Interesting Dead Areas of Math; Later Taste; For Sale; Death By Bean; It's Good!; Tomato Chutney v2; Time ATAC MX 2 Pedals - First Impressions; Online Chilean Crafts; Intellectual Variety; Taste + Texture; Time Invariance and Gauge Symmetry; Jodorowsky; Tomato Chutney; Analysis of Support for Trump; Indian SF; TP-Link TL-WR841N DNS TCP Bug; TP-Link TL-WR841N as Wireless Bridge; Sending Email On Time; Maybe run a command; Sterile Neutrinos; Strawberry and Banana Jam; The Best Of All Possible Worlds; Kenzaburo Oe: The Changeling; Peach Jam; Taste Test; Strawberry and Raspberry Jam; flac to mp3 on OpenSuse 42.1; Also, Sebald; Kenzaburo Oe Interview; Otake (Kitani Minoru) move Black 121; Is free speech in British universities under threat?; I am actually good at computers; Was This Mansplaining?; WebFaction / LetsEncrypt / General Disappointment; Sensible Philosophy of Science; George Ellis; Misplaced Intuition and Online Communities; More Reading About Japan; Visibilty / Public Comments / Domestic Violence; Ferias de Santiago; More (Clearly Deliberate); Deleted Obit Post; And then a 50 yo male posts this...; We Have Both Kinds Of Contributors; Free Springer Books; Books on Religion; Books on Linguistics; Palestinan Electronica; Books In Anthropology; Taylor Expansions of Spacetime; Info on Juniper; Efficient Stream Processing; The Moral Character of Crypto; Hearing Aid Info; Small Success With Go!; Re: Quick message - This link is broken; Adding Reverb To The Echo Chamber; Sox Audio Tools; Would This Have Been OK?; Honesty only important economically before institutions develop

© 2006-2015 Andrew Cooke (site) / post authors (content).

Next Step for RXPY/Lepl integration

From: andrew cooke <andrew@...>

Date: Sun, 29 May 2011 11:22:09 -0400

RXPY is a Python 2 project that implements the re2 approach to regexps in pure
Python.  Lepl is a recursive descent parser that can delegate to a regular
expression library in various areas (for example, it can compile some
sub-parsers to regular expressions).

Lepl contains two regexp implementations - NFA and DFA - but they have some
problems:
 - They don't implement all regexp features
 - They don't implement the standard Python interface for regular
   expressions (re package)

RXPY addresses both of these issues - all the engines are feature complete and
have the Python interface (note that the docs for RXPY are very incomplete -
the package is in a much better state than described).

Both Lepl and RXPY support both backtracking and "stable" algorithms (stable
in that they don't suffer from combinatorial explosion on certain patterns).
However, the Lepl backtracking engine is particular slow at handling at
pre-compiling a common pattern (for float values).  RXPY has not been tested
on this.

However, Lepl needs more than the standard Python interface.  It requires:
 1 Input to be wrapped in a "stream" abstraction.  This allows "infinite"
   data to be handled with finite resources.
 2 Python 3 support
 3 Multiple and ordered matches
 4 Fast matching of entire groups and lazy matching of sub-groups

The last two points need more explanation:

(3) Multiple and ordered matches are used by the tokenizer.  A set of
different regular expressions are matched against the input and the longest
match wins.  This match must be associated with input pattern (ie we must know
which pattern won); if multiple patterns tie with the longest match we want to
know all patterns; sometimes we may want to tie this back to the ordering of
patterns (in particular, we want the "first" pattern that matches).

This can be achieved by combining named groups with "or" (the | symbol), but
named groups require a slower approach in RXPY.  It may be that the special
case of groups that match the entire string can be optimised into the faster
approach (the main RXPY matcher uses a fast initial match and the re-processes
for groups).  For example, the group name can be embedded in the final (exit)
node(s) of the graph that describe the regular expression.

(4) People tend to use "(...)" to group patterns in regular expressions, even
when they do not want to access the results.  Above I described how RXPY uses
two approaches - a faster, group-free stage and then re-processing for groups.
Currently the second stage is triggered by the expression itself, as it
matches.  Instead, I believe that we can wait until the user requests a named
group.  Since, in many situations, the fast stage can tell whether the
expression matched, this avoids the second stage unless it is "really needed".

One final aim: I would like to make the code particularly well documented.
According to my website logs there is quite a lot of interest in implementing
regular expressions (from students, I guess).  So a clear, well documented
implementation could be useful.

Andrew

More on Lepl + RXPY

From: andrew cooke <andrew@...>

Date: Sun, 26 Jun 2011 20:09:47 -0400

None of the work above is done yet; I am still moving code from RXPY to Lepl
(I'm modifying it slightly as I go and have been slowed considerably by
the switch to Intellij Idea / PyCharm).

One change I need to make to the code is to make it run with Python 3.  RXPY
was developed under Python 2.  The difference between P 2 and 3 is significant
for code that deals with text because the string/Unicode types changed
significantly.

RXPY contains the idea of "alphabets" (also present in Lepl).  These define
which "characters" can be matched in the input.  This is not restricted to
text - you could use RXPY to match lists of integers, for example.  But these
are *decoupled* from the input.  RXPY assumes that the input is text - one of
the tasks of an alphabet is to translate from the text representation to the
types that will be matched.

The closes that Python 2's re package has to this is the Unicode/ASCII flag.
So the old RXPY code had ASCII and Unicode alphabets.  But in Pythin 3 there
is a new, and orthogonal twist - re works with both strings and bytes.

More confusingly, the ASCII/Unicode distinction is separate from the
bytes/string distinction.  ASCII/Unicode only affects (as far as I can see)
how character classes are interpreted (things like upper and lower case, or
what is a digit, and what not).

Also, Python 3's re package expects the input expression to match the input
data.  So if you want to match bytes, you give a byte string regexp.

How can this fit with RXPY?  In my opinion RXPY's approach is much cleaner and
I don't want to lose that.  At the same time, compatability is relatively
important.

One easy change is to align alphabets with byte/string rather than
ASCII/Unicode.  So I am going to change RXPY so that it has alphabets for
Unicode and for bytes.  At the same time, the ASCII/Unicode flag is going to
be just a flag - something that is read by both alphabets and changes their
behaviour accordingly.

But what about the coupling between the data format and the format used to
describe the regexp?  That seems completely dumb to me.  What I will do, I
think, is add a special test that duplicates Python's behaviour, but which can
be disabled via an option.  If enabled, it will require that alphabets match
the regexp.  If not, it will handle both unicode and bytes (where the bytes
are representing ASCII characters, where necessary).

Andrew

Yet More...

From: andrew cooke <andrew@...>

Date: Mon, 27 Jun 2011 00:19:11 -0400

OK, after thinking some more:

- Alphabets will define the type of the input they expect.  All but the bytes
  alphabet will inherit from a base class that expects normal strings
  (unicode).  The bytes alphabet will expect bytes.

- Alphabets will also define the constants used in the parsing (things like
  "?" and "*").  The parser will make comparisons against these.

- Alphabets will be extended, where necesary, to take the ASCII/Unicode flag
  and act accordingly.

This makes RXPY more general, while bringing it closer to Python.

Andrew

Comment on this post