## And - A Natural Extension To Regexps

From: andrew cooke <andrew@...>

Date: Tue, 10 Feb 2015 22:33:09 -0300

I found an old book of cryptic crossword puzzles the other day.  Sitting down
to do one I found an anagram.  So I wondered how to "grep"
/usr/share/dict/words.

It turns out that regeexps for anagrams aren't that great.  If you restrict
yourself to a traditional ERE (without lookbacks / lookforwards) then it seems
like the best you can do is a tree of alternatives.

I wrote a little program to generate this:

#!/usr/bin/python3

from sys import argv

assert len(argv) == 2, "usage: regexp LETTERS"

def alternatives(letters):
if len(letters) == 1:
yield letters
else:
yield '('
for (i, letter) in enumerate(letters):
if i: yield '|'
yield letter
yield from alternatives(letters[:i] + letters[i+1:])
yield ')'

print('^' + ''.join(alternatives(argv[1])) + '$') Which works ok: andrew@...:~/bin> regexp dog ^(d(og|go)|o(dg|gd)|g(do|od))$

andrew@...:~/bin> egrep -i regexp dog /usr/share/dict/words
dog
God

except that, of course, it's O(n!).  Which makes it useless.

So then I poked around a bit and found that the standard approach is to sort
the characters in each word to give a "key", and use that to hash / lookup the
dictionary.

Which is neat, but isn't a regexp.

So I left things there until I was lying in bed just now.  And it struck me
that if regexps had &, which works much like you'd expect from |, then
anagrams would be much easier.

For example, the regexp for "poodle" would be

^(.*p.*&.*o.*o.*&.*d.*&.*l.*&.*e.*&.{6})\$

which is way more compact than the tree of alternatives.

Before you get too carried away, however, it's worth noting that this seems to
be equivalent to lookahead.  But even so... why isn't this considered a
"natural" part of regexps?  I have no idea.

Andrew