| Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next


Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

C-ORM: docs, API.

Last 100 entries

Small Success With Go!; Re: Quick message - This link is broken; Adding Reverb To The Echo Chamber; Sox Audio Tools; Would This Have Been OK?; Honesty only important economically before institutions develop; Stegangraphy via PS4; OpenCL Mess; More Book Recommendations; Good Explanation of Difference Between Majority + Minority; Musical Chairs - Who's The Privileged White Guy; I can see straight men watching this conversation and laffing; When it's Actually a Source of Indignation and Disgust; Meta Thread Defending POC Causes POC To Close Account; Indigenous People Of Chile; Curry Recipe; Interesting Link On Marginality; A Nuclear Launch Ordered, 1962; More Book Recs (Better Person); It's Nuanced, And I Tried, So Back Off; Marx; The Negative Of Positive; Jenny Holzer Rocks; Huge Article on Cultural Evolution and More; "Ignoring language theory"; Negative Finger Counting; Week 12; Communication Via Telecomm Bids; Finding Suspects Via Relatives' DNA From Non-Crime Databases; Statistics and Information Theory; Ice OK in USA; On The Other Hand; (Current Understanding Of) Chilean Taxes / Contributions; M John Harrison; Playing Games on a Cloud GPU; China Gamifies Real Life; Can't Help Thinking It's Thoughtcrime; Mefi Quotes; Spray Painting Bike Frame; Weeks 10 + 11; Change: No Longer Possible To Merge Metadata; Books on Old Age; Health Tree Maps; MRA - Men's Rights Activists; Writing Good C++14; Risk Assessment - Fukushima; The Future of Advertising and Surveillance; Travelling With Betaferon; I think I know what I dislike so much about Metafilter; Weeks 8 + 9; More; Pastamore - Bad Italian in Vitacura; History Books; Iraq + The (UK) Governing Elite; Answering Some Hard Questions; Pinochet: The Dictator's Shadow; An Outsider's Guide To Julia Packages; Nobody gives a shit; Lepton Decay Irregularity; An Easier Way; Julia's BinDeps (aka How To Install Cairo); Good Example Of Good Police Work (And Anonymity Being Hard); Best Santiago Burgers; Also; Michael Emmerich (Vibrator Translator) Interview (Japanese Books); Clarice Lispector (Brazillian Writer); Books On Evolution; Looks like Ara (Modular Phone) is dead; Index - Translations From Chile; More Emotion in Chilean Wines; Week 7; Aeon Magazine (Science-ish); QM, Deutsch, Constructor Theory; Interesting Talk Transcripts; Interesting Suggestion Of Election Fraud; "Hard" Books; Articles or Papers on depolarizing the US; Textbook for "QM as complex probabilities"; SFO Get Libor Trader (14 years); Why Are There Still So Many Jobs?; Navier Stokes Incomplete; More on Benford; FBI Claimed Vandalism; Architectural Tessellation; Also: Go, Blake's 7; Delusions of Gender (book); Crypto AG DID work with NSA / GCHQ; UNUMS (Universal Number Format); MOOCs (Massive Open Online Courses); Interesting Looking Game; Euler's Theorem for Polynomials; Weeks 3-6; Reddit Comment; Differential Cryptanalysis For Dummies; Japanese Graphic Design; Books To Be Re-Read; And Today I Learned Bugs Need Clear Examples; Factoring a 67 bit prime in your head; Islamic Geometric Art; Useful Julia Backtraces from Tasks; Nothing, however, is lost with less discomfort than that which, when lost, cannot be missed

© 2006-2015 Andrew Cooke (site) / post authors (content).

I Just Wrote a Regular Exression Engine!

From: "andrew cooke" <andrew@...>

Date: Sat, 14 Mar 2009 22:02:50 -0300 (CLST)

Heh.  I just finished a regular expression engine in Python.  It's the
"real deal" in that it "compiles" to a finite state machine (so it runs in
time proportional to the length of the string to be matcher).  It doesn't
compress multiple character jumps into a single step, but it does
otherwise generate a compact machine (as far as I understand these

Being pure Python it's both better and worse than the standard "re"
package.  In fact it's mainly worse - it must be slower, it doesn't match
sub-expressions, and it has a very, very simple syntax.

But it does have a few advantages.  First, it's a generator, so it yields
each match as it finds it.  Second, it takes a sequence, rather than a
string as an argument, which means that the entire string doesn't have to
be read into memory.  Third, I understand it and can take it apart and
extend it, which means I can add Python functions to it.  I could even
make it work with arbitrary lists (non-characters) pretty easily.

Actually, as I implemented this, I realised that there were various things
about the standard Python regexp implementation that I didn't understand
that well, so some of the above may be wrong.  Next thing to do is to look
more closely at the standard library (yes, perhaps I should have started
that way, but way back then I didn't know what to ask).

Here's the test I just got running.  Note that the matcher (the FSM) takes
a list of regexps, and that each has a tag (here, integers).  The results
include the tags.  Also, that's the full regexp syntax - all I support is
literal characters, ranges, and "*".

  def test_all(self):
    regexps = [_parser(1, 'a*'),
               _parser(2, 'a[a-cx]*'),
               _parser(3, 'aax')]
    fsm = Fsm(regexps)
    results = list(fsm.all_for_string('aaxbxcxdx'))
    assert results == [(1, ''), (1, 'a'), (2, 'a'),
                       (1, 'aa'), (2, 'aa'), (3, 'aax'),
                       (2, 'aax'), (2, 'aaxb'),
                       (2, 'aaxbx'), (2, 'aaxbxc'),
                       (2, 'aaxbxcx')]

The source will be in the next LEPL release.


Corrected Test

From: "andrew cooke" <andrew@...>

Date: Sat, 14 Mar 2009 22:11:57 -0300 (CLST)

The test I meant to write (includes () grouping for *):

  def test_all(self):
    regexps = [_parser(1, 'a*'),
               _parser(2, 'a([a-c]x)*'),
               _parser(3, 'aax')]
    fsm = Fsm(regexps)
    results = list(fsm.all_for_string('aaxbxcxdx'))
    assert results == [(1, ''), (1, 'a'), (2, 'a'),
                       (1, 'aa'), (2, 'aax'), (3, 'aax'),
                       (2, 'aaxbx'), (2, 'aaxbxcx')]

Also, apologies for typos in text/title.  Given the nature of this
(email-based) blog it's too much effort to always be correcting things...



From: "andrew cooke" <andrew@...>

Date: Sun, 15 Mar 2009 10:34:59 -0400 (CLT)

Ooops.  I had ignored embedded alternatives, only allowing a choice at the
start, thinking that I was not losing anything.  But in fact that means
the current implementation has no backtracking.  Fortunately, I don't
think it will be hard to extend.


Possibly Complete

From: "andrew cooke" <andrew@...>

Date: Sun, 15 Mar 2009 21:08:14 -0400 (CLT)

Hmm.  I implemented choices without thinking that much, and now it strikes
me that there is no backtracking - the FSM just transitions away...  I
guess that makes sense?  I need to sleep on it.

Anyway, here's the current test:

    regexps = [unicode_parser(1, 'a*'),
               unicode_parser(2, 'a([a-c]x|axb)*'),
               unicode_parser(3, 'aax')]
    fsm = SimpleFsm(regexps, UNICODE)
    results = list(fsm.all_for_string('aaxbxcxdx'))
    assert results == [(1, ''), (1, 'a'), (2, 'a'), (1, 'aa'),
                       (2, 'aax'), (3, 'aax'), (2, 'aaxb'),
                       (2, 'aaxbx'), (2, 'aaxbxcx')], results

(The initial list could now be written as a single expression, except that
there is no way to specify a label in-line).


Does Make Sense

From: "andrew cooke" <andrew@...>

Date: Sun, 15 Mar 2009 21:34:24 -0400 (CLT)

Of course it makes sense - my FSM is deterministic (which means it may
need exponential size for the lookup table in certain cases).

Also, I don't have "epsilon"?  Am I still incomplete?  I think so....
Perhaps best to add it with "?"?  In fact, perhaps I do have epsilon if I
just relax the parser to accept, for example, "(a|)"...

Should probably add "." and "^" too (although both those clearly sugar).


Comment on this post