| Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next


Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

C-ORM: docs, API.

Last 100 entries

And Smugness; McCloskey Economics Trilogy; cmocka - Mocks for C; Concept Creep (Americans); Futhark - OpenCL Language; Moved / Gone; Fan and USB issues; Burgers in Santiago; The Origin of Icosahedral Symmetry in Viruses; autoenum on PyPI; Jars Explains; Tomato Chutney v3; REST; US Elections and Gender: 24 Point Swing; PPPoE on OpenSuse Leap 42.1; SuperMicro X10SDV-TLN4F/F with Opensuse Leap 42.1; Big Data AI Could Be Very Bad Indeed....; Cornering; Postcapitalism (Paul Mason); Black Science Fiction; Git is not a CDN; Mining of Massive Data Sets; Rachel Kaadzi Ghansah; How great republics meet their end; Raspberry, Strawberry and Banana Jam; Interesting Dead Areas of Math; Later Taste; For Sale; Death By Bean; It's Good!; Tomato Chutney v2; Time ATAC MX 2 Pedals - First Impressions; Online Chilean Crafts; Intellectual Variety; Taste + Texture; Time Invariance and Gauge Symmetry; Jodorowsky; Tomato Chutney; Analysis of Support for Trump; Indian SF; TP-Link TL-WR841N DNS TCP Bug; TP-Link TL-WR841N as Wireless Bridge; Sending Email On Time; Maybe run a command; Sterile Neutrinos; Strawberry and Banana Jam; The Best Of All Possible Worlds; Kenzaburo Oe: The Changeling; Peach Jam; Taste Test; Strawberry and Raspberry Jam; flac to mp3 on OpenSuse 42.1; Also, Sebald; Kenzaburo Oe Interview; Otake (Kitani Minoru) move Black 121; Is free speech in British universities under threat?; I am actually good at computers; Was This Mansplaining?; WebFaction / LetsEncrypt / General Disappointment; Sensible Philosophy of Science; George Ellis; Misplaced Intuition and Online Communities; More Reading About Japan; Visibilty / Public Comments / Domestic Violence; Ferias de Santiago; More (Clearly Deliberate); Deleted Obit Post; And then a 50 yo male posts this...; We Have Both Kinds Of Contributors; Free Springer Books; Books on Religion; Books on Linguistics; Palestinan Electronica; Books In Anthropology; Taylor Expansions of Spacetime; Info on Juniper; Efficient Stream Processing; The Moral Character of Crypto; Hearing Aid Info; Small Success With Go!; Re: Quick message - This link is broken; Adding Reverb To The Echo Chamber; Sox Audio Tools; Would This Have Been OK?; Honesty only important economically before institutions develop; Stegangraphy via PS4; OpenCL Mess; More Book Recommendations; Good Explanation of Difference Between Majority + Minority; Musical Chairs - Who's The Privileged White Guy; I can see straight men watching this conversation and laffing; Meta Thread Defending POC Causes POC To Close Account; Indigenous People Of Chile; Curry Recipe; Interesting Link On Marginality; A Nuclear Launch Ordered, 1962; More Book Recs (Better Person); It's Nuanced, And I Tried, So Back Off; Marx; The Negative Of Positive; Jenny Holzer Rocks

© 2006-2015 Andrew Cooke (site) / post authors (content).

I Just Wrote a Regular Exression Engine!

From: "andrew cooke" <andrew@...>

Date: Sat, 14 Mar 2009 22:02:50 -0300 (CLST)

Heh.  I just finished a regular expression engine in Python.  It's the
"real deal" in that it "compiles" to a finite state machine (so it runs in
time proportional to the length of the string to be matcher).  It doesn't
compress multiple character jumps into a single step, but it does
otherwise generate a compact machine (as far as I understand these

Being pure Python it's both better and worse than the standard "re"
package.  In fact it's mainly worse - it must be slower, it doesn't match
sub-expressions, and it has a very, very simple syntax.

But it does have a few advantages.  First, it's a generator, so it yields
each match as it finds it.  Second, it takes a sequence, rather than a
string as an argument, which means that the entire string doesn't have to
be read into memory.  Third, I understand it and can take it apart and
extend it, which means I can add Python functions to it.  I could even
make it work with arbitrary lists (non-characters) pretty easily.

Actually, as I implemented this, I realised that there were various things
about the standard Python regexp implementation that I didn't understand
that well, so some of the above may be wrong.  Next thing to do is to look
more closely at the standard library (yes, perhaps I should have started
that way, but way back then I didn't know what to ask).

Here's the test I just got running.  Note that the matcher (the FSM) takes
a list of regexps, and that each has a tag (here, integers).  The results
include the tags.  Also, that's the full regexp syntax - all I support is
literal characters, ranges, and "*".

  def test_all(self):
    regexps = [_parser(1, 'a*'),
               _parser(2, 'a[a-cx]*'),
               _parser(3, 'aax')]
    fsm = Fsm(regexps)
    results = list(fsm.all_for_string('aaxbxcxdx'))
    assert results == [(1, ''), (1, 'a'), (2, 'a'),
                       (1, 'aa'), (2, 'aa'), (3, 'aax'),
                       (2, 'aax'), (2, 'aaxb'),
                       (2, 'aaxbx'), (2, 'aaxbxc'),
                       (2, 'aaxbxcx')]

The source will be in the next LEPL release.


Corrected Test

From: "andrew cooke" <andrew@...>

Date: Sat, 14 Mar 2009 22:11:57 -0300 (CLST)

The test I meant to write (includes () grouping for *):

  def test_all(self):
    regexps = [_parser(1, 'a*'),
               _parser(2, 'a([a-c]x)*'),
               _parser(3, 'aax')]
    fsm = Fsm(regexps)
    results = list(fsm.all_for_string('aaxbxcxdx'))
    assert results == [(1, ''), (1, 'a'), (2, 'a'),
                       (1, 'aa'), (2, 'aax'), (3, 'aax'),
                       (2, 'aaxbx'), (2, 'aaxbxcx')]

Also, apologies for typos in text/title.  Given the nature of this
(email-based) blog it's too much effort to always be correcting things...



From: "andrew cooke" <andrew@...>

Date: Sun, 15 Mar 2009 10:34:59 -0400 (CLT)

Ooops.  I had ignored embedded alternatives, only allowing a choice at the
start, thinking that I was not losing anything.  But in fact that means
the current implementation has no backtracking.  Fortunately, I don't
think it will be hard to extend.


Possibly Complete

From: "andrew cooke" <andrew@...>

Date: Sun, 15 Mar 2009 21:08:14 -0400 (CLT)

Hmm.  I implemented choices without thinking that much, and now it strikes
me that there is no backtracking - the FSM just transitions away...  I
guess that makes sense?  I need to sleep on it.

Anyway, here's the current test:

    regexps = [unicode_parser(1, 'a*'),
               unicode_parser(2, 'a([a-c]x|axb)*'),
               unicode_parser(3, 'aax')]
    fsm = SimpleFsm(regexps, UNICODE)
    results = list(fsm.all_for_string('aaxbxcxdx'))
    assert results == [(1, ''), (1, 'a'), (2, 'a'), (1, 'aa'),
                       (2, 'aax'), (3, 'aax'), (2, 'aaxb'),
                       (2, 'aaxbx'), (2, 'aaxbxcx')], results

(The initial list could now be written as a single expression, except that
there is no way to specify a label in-line).


Does Make Sense

From: "andrew cooke" <andrew@...>

Date: Sun, 15 Mar 2009 21:34:24 -0400 (CLT)

Of course it makes sense - my FSM is deterministic (which means it may
need exponential size for the lookup table in certain cases).

Also, I don't have "epsilon"?  Am I still incomplete?  I think so....
Perhaps best to add it with "?"?  In fact, perhaps I do have epsilon if I
just relax the parser to accept, for example, "(a|)"...

Should probably add "." and "^" too (although both those clearly sugar).


Comment on this post