| Andrew Cooke | Contents | Latest | RSS | Twitter | Previous | Next


Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

Personal Projects

Lepl parser for Python.

Colorless Green.

Photography around Santiago.

SVG experiment.

Professional Portfolio

Calibration of seismometers.

Data access via web services.

Cache rewrite.

Extending OpenSSH.

C-ORM: docs, API.

Last 100 entries

An Easier Way; Julia's BinDeps (aka How To Install Cairo); Good Example Of Good Police Work (And Anonymity Being Hard); Best Santiago Burgers; Also; Michael Emmerich (Vibrator Translator) Interview (Japanese Books); Clarice Lispector (Brazillian Writer); Books On Evolution; Looks like Ara (Modular Phone) is dead; Index - Translations From Chile; More Emotion in Chilean Wines; Week 7; Aeon Magazine (Science-ish); QM, Deutsch, Constructor Theory; Interesting Talk Transcripts; Interesting Suggestion Of Election Fraud; "Hard" Books; Articles or Papers on depolarizing the US; Textbook for "QM as complex probabilities"; SFO Get Libor Trader (14 years); Why Are There Still So Many Jobs?; Navier Stokes Incomplete; More on Benford; FBI Claimed Vandalism; Architectural Tessellation; Also: Go, Blake's 7; Delusions of Gender (book); Crypto AG DID work with NSA / GCHQ; UNUMS (Universal Number Format); MOOCs (Massive Open Online Courses); Interesting Looking Game; Euler's Theorem for Polynomials; Weeks 3-6; Reddit Comment; Differential Cryptanalysis For Dummies; Japanese Graphic Design; Books To Be Re-Read; And Today I Learned Bugs Need Clear Examples; Factoring a 67 bit prime in your head; Islamic Geometric Art; Useful Julia Backtraces from Tasks; Nothing, however, is lost with less discomfort than that which, when lost, cannot be missed; Article on Didion; Cost of Living by City; British Slavery; Derrida on Metaphor; African SciFi; Traits in Julia; Alternative Japanese Lit; Pulic Key as Address (Snow); Why Information Grows; The Blindness Of The Chilean Elite; Some Victoriagate Links; This Is Why I Left StackOverflow; New TLS Implementation; Maths for Physicists; How I Am 8; 1000 Word Philosophy; Cyberpunk Reading List; Detailed Discussion of Message Dispatch in ParserCombinator Library for Julia; FizzBuzz in Julia w Dependent Types; kokko - Design Shop in Osaka; Summary of Greece, Currently; LLVM and GPUs; See Also; Schoolgirl Groyps (Maths); Japanese Lit; Another Example - Modular Arithmetic; Music from United; Python 2 and 3 compatible alternative.; Read Agatha Christie for the Plot; A Constructive Look at TempleOS; Music Thread w Many Recommendations; Fixed Version; A Useful Julia Macro To Define Equality And Hash; k3b cdrom access, OpenSuse 13.1; Week 2; From outside, the UK looks less than stellar; Huge Fonts in VirtualBox; Keen - Complex Emergencies; The Fallen of World War II; Some Spanish Fiction; Calling C From Fortran 95; Bjork DJ Set; Z3 Example With Python; Week 1; Useful Guide To Starting With IJulia; UK Election + Media; Review: Reinventing Organizations; Inline Assembly With Julia / LLVM; Against the definition of types; Dumb Crypto Paper; The Search For Quasi-Periodicity...; Is There An Alternative To Processing?; CARDIAC (CARDboard Illustrative Aid to Computation); The Bolivian Case Against Chile At The Hague; Clear, Cogent Economic Arguments For Immigration; A Program To Say If I Am Working; Decent Cards For Ill People; New Photo; Luksic And Barrick Gold

© 2006-2015 Andrew Cooke (site) / post authors (content).

I Just Wrote a Regular Exression Engine!

From: "andrew cooke" <andrew@...>

Date: Sat, 14 Mar 2009 22:02:50 -0300 (CLST)

Heh.  I just finished a regular expression engine in Python.  It's the
"real deal" in that it "compiles" to a finite state machine (so it runs in
time proportional to the length of the string to be matcher).  It doesn't
compress multiple character jumps into a single step, but it does
otherwise generate a compact machine (as far as I understand these

Being pure Python it's both better and worse than the standard "re"
package.  In fact it's mainly worse - it must be slower, it doesn't match
sub-expressions, and it has a very, very simple syntax.

But it does have a few advantages.  First, it's a generator, so it yields
each match as it finds it.  Second, it takes a sequence, rather than a
string as an argument, which means that the entire string doesn't have to
be read into memory.  Third, I understand it and can take it apart and
extend it, which means I can add Python functions to it.  I could even
make it work with arbitrary lists (non-characters) pretty easily.

Actually, as I implemented this, I realised that there were various things
about the standard Python regexp implementation that I didn't understand
that well, so some of the above may be wrong.  Next thing to do is to look
more closely at the standard library (yes, perhaps I should have started
that way, but way back then I didn't know what to ask).

Here's the test I just got running.  Note that the matcher (the FSM) takes
a list of regexps, and that each has a tag (here, integers).  The results
include the tags.  Also, that's the full regexp syntax - all I support is
literal characters, ranges, and "*".

  def test_all(self):
    regexps = [_parser(1, 'a*'),
               _parser(2, 'a[a-cx]*'),
               _parser(3, 'aax')]
    fsm = Fsm(regexps)
    results = list(fsm.all_for_string('aaxbxcxdx'))
    assert results == [(1, ''), (1, 'a'), (2, 'a'),
                       (1, 'aa'), (2, 'aa'), (3, 'aax'),
                       (2, 'aax'), (2, 'aaxb'),
                       (2, 'aaxbx'), (2, 'aaxbxc'),
                       (2, 'aaxbxcx')]

The source will be in the next LEPL release.


Corrected Test

From: "andrew cooke" <andrew@...>

Date: Sat, 14 Mar 2009 22:11:57 -0300 (CLST)

The test I meant to write (includes () grouping for *):

  def test_all(self):
    regexps = [_parser(1, 'a*'),
               _parser(2, 'a([a-c]x)*'),
               _parser(3, 'aax')]
    fsm = Fsm(regexps)
    results = list(fsm.all_for_string('aaxbxcxdx'))
    assert results == [(1, ''), (1, 'a'), (2, 'a'),
                       (1, 'aa'), (2, 'aax'), (3, 'aax'),
                       (2, 'aaxbx'), (2, 'aaxbxcx')]

Also, apologies for typos in text/title.  Given the nature of this
(email-based) blog it's too much effort to always be correcting things...



From: "andrew cooke" <andrew@...>

Date: Sun, 15 Mar 2009 10:34:59 -0400 (CLT)

Ooops.  I had ignored embedded alternatives, only allowing a choice at the
start, thinking that I was not losing anything.  But in fact that means
the current implementation has no backtracking.  Fortunately, I don't
think it will be hard to extend.


Possibly Complete

From: "andrew cooke" <andrew@...>

Date: Sun, 15 Mar 2009 21:08:14 -0400 (CLT)

Hmm.  I implemented choices without thinking that much, and now it strikes
me that there is no backtracking - the FSM just transitions away...  I
guess that makes sense?  I need to sleep on it.

Anyway, here's the current test:

    regexps = [unicode_parser(1, 'a*'),
               unicode_parser(2, 'a([a-c]x|axb)*'),
               unicode_parser(3, 'aax')]
    fsm = SimpleFsm(regexps, UNICODE)
    results = list(fsm.all_for_string('aaxbxcxdx'))
    assert results == [(1, ''), (1, 'a'), (2, 'a'), (1, 'aa'),
                       (2, 'aax'), (3, 'aax'), (2, 'aaxb'),
                       (2, 'aaxbx'), (2, 'aaxbxcx')], results

(The initial list could now be written as a single expression, except that
there is no way to specify a label in-line).


Does Make Sense

From: "andrew cooke" <andrew@...>

Date: Sun, 15 Mar 2009 21:34:24 -0400 (CLT)

Of course it makes sense - my FSM is deterministic (which means it may
need exponential size for the lookup table in certain cases).

Also, I don't have "epsilon"?  Am I still incomplete?  I think so....
Perhaps best to add it with "?"?  In fact, perhaps I do have epsilon if I
just relax the parser to accept, for example, "(a|)"...

Should probably add "." and "^" too (although both those clearly sugar).


Comment on this post