# C[omp]ute

Welcome to my blog, which was once a mailing list of the same name and is still generated by mail. Please reply via the "comment" links.

Always interested in offers/projects/new ideas. Eclectic experience in fields like: numerical computing; Python web; Java enterprise; functional languages; GPGPU; SQL databases; etc. Based in Santiago, Chile; telecommute worldwide. CV; email.

© 2006-2015 Andrew Cooke (site) / post authors (content).

## I Just Wrote a Regular Exression Engine!

From: "andrew cooke" <andrew@...>

Date: Sat, 14 Mar 2009 22:02:50 -0300 (CLST)

Heh.  I just finished a regular expression engine in Python.  It's the
"real deal" in that it "compiles" to a finite state machine (so it runs in
time proportional to the length of the string to be matcher).  It doesn't
compress multiple character jumps into a single step, but it does
otherwise generate a compact machine (as far as I understand these
things).

Being pure Python it's both better and worse than the standard "re"
package.  In fact it's mainly worse - it must be slower, it doesn't match
sub-expressions, and it has a very, very simple syntax.

But it does have a few advantages.  First, it's a generator, so it yields
each match as it finds it.  Second, it takes a sequence, rather than a
string as an argument, which means that the entire string doesn't have to
be read into memory.  Third, I understand it and can take it apart and
extend it, which means I can add Python functions to it.  I could even
make it work with arbitrary lists (non-characters) pretty easily.

Actually, as I implemented this, I realised that there were various things
about the standard Python regexp implementation that I didn't understand
that well, so some of the above may be wrong.  Next thing to do is to look
more closely at the standard library (yes, perhaps I should have started
that way, but way back then I didn't know what to ask).

Here's the test I just got running.  Note that the matcher (the FSM) takes
a list of regexps, and that each has a tag (here, integers).  The results
include the tags.  Also, that's the full regexp syntax - all I support is
literal characters, ranges, and "*".

def test_all(self):
regexps = [_parser(1, 'a*'),
_parser(2, 'a[a-cx]*'),
_parser(3, 'aax')]
fsm = Fsm(regexps)
results = list(fsm.all_for_string('aaxbxcxdx'))
assert results == [(1, ''), (1, 'a'), (2, 'a'),
(1, 'aa'), (2, 'aa'), (3, 'aax'),
(2, 'aax'), (2, 'aaxb'),
(2, 'aaxbx'), (2, 'aaxbxc'),
(2, 'aaxbxcx')]

The source will be in the next LEPL release.

Andrew

### Corrected Test

From: "andrew cooke" <andrew@...>

Date: Sat, 14 Mar 2009 22:11:57 -0300 (CLST)

The test I meant to write (includes () grouping for *):

def test_all(self):
regexps = [_parser(1, 'a*'),
_parser(2, 'a([a-c]x)*'),
_parser(3, 'aax')]
fsm = Fsm(regexps)
results = list(fsm.all_for_string('aaxbxcxdx'))
assert results == [(1, ''), (1, 'a'), (2, 'a'),
(1, 'aa'), (2, 'aax'), (3, 'aax'),
(2, 'aaxbx'), (2, 'aaxbxcx')]

Also, apologies for typos in text/title.  Given the nature of this
(email-based) blog it's too much effort to always be correcting things...

Andrew

### Incomplete

From: "andrew cooke" <andrew@...>

Date: Sun, 15 Mar 2009 10:34:59 -0400 (CLT)

Ooops.  I had ignored embedded alternatives, only allowing a choice at the
start, thinking that I was not losing anything.  But in fact that means
the current implementation has no backtracking.  Fortunately, I don't
think it will be hard to extend.

Andrew

### Possibly Complete

From: "andrew cooke" <andrew@...>

Date: Sun, 15 Mar 2009 21:08:14 -0400 (CLT)

Hmm.  I implemented choices without thinking that much, and now it strikes
me that there is no backtracking - the FSM just transitions away...  I
guess that makes sense?  I need to sleep on it.

Anyway, here's the current test:

regexps = [unicode_parser(1, 'a*'),
unicode_parser(2, 'a([a-c]x|axb)*'),
unicode_parser(3, 'aax')]
fsm = SimpleFsm(regexps, UNICODE)
results = list(fsm.all_for_string('aaxbxcxdx'))
assert results == [(1, ''), (1, 'a'), (2, 'a'), (1, 'aa'),
(2, 'aax'), (3, 'aax'), (2, 'aaxb'),
(2, 'aaxbx'), (2, 'aaxbxcx')], results

(The initial list could now be written as a single expression, except that
there is no way to specify a label in-line).

Andrew

### Does Make Sense

From: "andrew cooke" <andrew@...>

Date: Sun, 15 Mar 2009 21:34:24 -0400 (CLT)

Of course it makes sense - my FSM is deterministic (which means it may
need exponential size for the lookup table in certain cases).

Also, I don't have "epsilon"?  Am I still incomplete?  I think so....
Perhaps best to add it with "?"?  In fact, perhaps I do have epsilon if I
just relax the parser to accept, for example, "(a|)"...

Should probably add "." and "^" too (although both those clearly sugar).

Andrew