| Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next


Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

Last 100 entries

Data Mining Books; SimpleDateFormat should be synchronized; British Words; Chinese Govt Intercepts External Web To DDOS github; Numbering Permutations; Teenage Engineering - Low Price Synths; GCHQ Can Do Whatever It Wants; Dublinesque; A Cryptographic SAT Solver; Security Challenges; Word Lists for Crosswords; 3D Printing and Speaker Design; Searchable Snowden Archive; XCode Backdoored; Derived Apps Have Malware (CIA); Rowhammer - Hacking Software Via Hardware (DRAM) Bugs; Immutable SQL Database (Kinda); Tor GPS Tracker; That PyCon Dongle Mess...; ASCII Fluid Dynamics; Brandalism; Table of Shifter, Cassette and Derailleur Compatability; Lenovo Demonstrates How Bad HTTPS Is; Telegraph Owned by HSBC; Smaptop - Sunrise (Music); Equation Group (NSA); UK Torture in NI; And - A Natural Extension To Regexps; This Is The Future Of Religion; The Shazam (Music Matching) Algorithm; Tributes To Lesbian Community From AIDS Survivors; Nice Rust Summary; List of Good Fiction Books; Constructing JSON From Postgres (Part 2); Constructing JSON From Postgres (Part 1); Postgres in Docker; Why Poor Places Are More Diverse; Smart Writing on Graceland; Satire in France; Free Speech in France; MTB Cornering - Where Should We Point Our Thrusters?; Secure Secure Shell; Java Generics over Primitives; 2014 (Charlie Brooker); How I am 7; Neural Nets Applied to Go; Programming, Business, Social Contracts; Distributed Systems for Fun and Profit; XML and Scheme; Internet Radio Stations (Curated List); Solid Data About Placebos; Half of Americans Think Climate Change Is a Sign of the Apocalypse; Saturday Surf Sessions With Juvenile Delinquents; Ssh, tty, stdout and stderr; Feathers falling in a vacuum; Santiago 30m Bike Route; Mapa de Ciclovias en Santiago; How Unreliable is UDP?; SE Santiago 20m Bike Route; Cameron's Rap; Configuring libxml with Eclipse; Reducing Combinatorial Complexity With Occam - AI; Sentidos Comunes (Chilean Online Magazine); Hilary Mantel: The Assassination of Margaret Thatcher - August 6th 1983; NSA Interceptng Gmail During Delivery; General IIR Filters; What's happening with Scala?; Interesting (But Largely Illegible) Typeface; Retiring Essentialism; Poorest in UK, Poorest in N Europe; I Want To Be A Redneck!; Reverse Racism; The Lost Art Of Nomography; IBM Data Center (Photo); Interesting Account Of Gamma Hack; The Most Interesting Audiophile In The World; How did the first world war actually end?; Ky - Restaurant Santiago; The Black Dork Lives!; The UN Requires Unaninmous Decisions; LPIR - Steganography in Practice; How I Am 6; Clear Explanation of Verizon / Level 3 / Netflix; Teenage Girls; Formalising NSA Attacks; Switching Brakes (Tektro Hydraulic); Naim NAP 100 (Power Amp); AKG 550 First Impressions; Facebook manipulates emotions (no really); Map Reduce "No Longer Used" At Google; Removing RAID metadata; New Bike (Good Bike Shop, Santiago Chile); Removing APE Tags in Linux; Compiling Python 3.0 With GCC 4.8; Maven is Amazing; Generating Docs from a GitHub Wiki; Modular Shelves; Bash Best Practices; Good Emergency Gasfiter (Santiago, Chile); Readings in Recent Architecture; Roger Casement; Integrated Information Theory (Or Not)

© 2006-2013 Andrew Cooke (site) / post authors (content).

Next Step for RXPY/Lepl integration

From: andrew cooke <andrew@...>

Date: Sun, 29 May 2011 11:22:09 -0400

RXPY is a Python 2 project that implements the re2 approach to regexps in pure
Python.  Lepl is a recursive descent parser that can delegate to a regular
expression library in various areas (for example, it can compile some
sub-parsers to regular expressions).

Lepl contains two regexp implementations - NFA and DFA - but they have some
 - They don't implement all regexp features
 - They don't implement the standard Python interface for regular
   expressions (re package)

RXPY addresses both of these issues - all the engines are feature complete and
have the Python interface (note that the docs for RXPY are very incomplete -
the package is in a much better state than described).

Both Lepl and RXPY support both backtracking and "stable" algorithms (stable
in that they don't suffer from combinatorial explosion on certain patterns).
However, the Lepl backtracking engine is particular slow at handling at
pre-compiling a common pattern (for float values).  RXPY has not been tested
on this.

However, Lepl needs more than the standard Python interface.  It requires:
 1 Input to be wrapped in a "stream" abstraction.  This allows "infinite"
   data to be handled with finite resources.
 2 Python 3 support
 3 Multiple and ordered matches
 4 Fast matching of entire groups and lazy matching of sub-groups

The last two points need more explanation:

(3) Multiple and ordered matches are used by the tokenizer.  A set of
different regular expressions are matched against the input and the longest
match wins.  This match must be associated with input pattern (ie we must know
which pattern won); if multiple patterns tie with the longest match we want to
know all patterns; sometimes we may want to tie this back to the ordering of
patterns (in particular, we want the "first" pattern that matches).

This can be achieved by combining named groups with "or" (the | symbol), but
named groups require a slower approach in RXPY.  It may be that the special
case of groups that match the entire string can be optimised into the faster
approach (the main RXPY matcher uses a fast initial match and the re-processes
for groups).  For example, the group name can be embedded in the final (exit)
node(s) of the graph that describe the regular expression.

(4) People tend to use "(...)" to group patterns in regular expressions, even
when they do not want to access the results.  Above I described how RXPY uses
two approaches - a faster, group-free stage and then re-processing for groups.
Currently the second stage is triggered by the expression itself, as it
matches.  Instead, I believe that we can wait until the user requests a named
group.  Since, in many situations, the fast stage can tell whether the
expression matched, this avoids the second stage unless it is "really needed".

One final aim: I would like to make the code particularly well documented.
According to my website logs there is quite a lot of interest in implementing
regular expressions (from students, I guess).  So a clear, well documented
implementation could be useful.


More on Lepl + RXPY

From: andrew cooke <andrew@...>

Date: Sun, 26 Jun 2011 20:09:47 -0400

None of the work above is done yet; I am still moving code from RXPY to Lepl
(I'm modifying it slightly as I go and have been slowed considerably by
the switch to Intellij Idea / PyCharm).

One change I need to make to the code is to make it run with Python 3.  RXPY
was developed under Python 2.  The difference between P 2 and 3 is significant
for code that deals with text because the string/Unicode types changed

RXPY contains the idea of "alphabets" (also present in Lepl).  These define
which "characters" can be matched in the input.  This is not restricted to
text - you could use RXPY to match lists of integers, for example.  But these
are *decoupled* from the input.  RXPY assumes that the input is text - one of
the tasks of an alphabet is to translate from the text representation to the
types that will be matched.

The closes that Python 2's re package has to this is the Unicode/ASCII flag.
So the old RXPY code had ASCII and Unicode alphabets.  But in Pythin 3 there
is a new, and orthogonal twist - re works with both strings and bytes.

More confusingly, the ASCII/Unicode distinction is separate from the
bytes/string distinction.  ASCII/Unicode only affects (as far as I can see)
how character classes are interpreted (things like upper and lower case, or
what is a digit, and what not).

Also, Python 3's re package expects the input expression to match the input
data.  So if you want to match bytes, you give a byte string regexp.

How can this fit with RXPY?  In my opinion RXPY's approach is much cleaner and
I don't want to lose that.  At the same time, compatability is relatively

One easy change is to align alphabets with byte/string rather than
ASCII/Unicode.  So I am going to change RXPY so that it has alphabets for
Unicode and for bytes.  At the same time, the ASCII/Unicode flag is going to
be just a flag - something that is read by both alphabets and changes their
behaviour accordingly.

But what about the coupling between the data format and the format used to
describe the regexp?  That seems completely dumb to me.  What I will do, I
think, is add a special test that duplicates Python's behaviour, but which can
be disabled via an option.  If enabled, it will require that alphabets match
the regexp.  If not, it will handle both unicode and bytes (where the bytes
are representing ASCII characters, where necessary).


Yet More...

From: andrew cooke <andrew@...>

Date: Mon, 27 Jun 2011 00:19:11 -0400

OK, after thinking some more:

- Alphabets will define the type of the input they expect.  All but the bytes
  alphabet will inherit from a base class that expects normal strings
  (unicode).  The bytes alphabet will expect bytes.

- Alphabets will also define the constants used in the parsing (things like
  "?" and "*").  The parser will make comparisons against these.

- Alphabets will be extended, where necesary, to take the ASCII/Unicode flag
  and act accordingly.

This makes RXPY more general, while bringing it closer to Python.


Comment on this post