Components

There are three main components in RXPY:

  • The parser, which constructs a graph that represents the regular expression.
  • An engine, which evaluates the regular expression (expressed as a graph) against an input string to find a match.
  • An alphabet, which allows both the parser and engine to work with a variety of different input types.

In addition, there are two supporting components:

  • The library graph nodes, assembled by the parser and used as opcodes by the engine.
  • Support code that emulates the Python re library, given an engine.

So, for example, the expression (?P<number>[0-9]+)|\w* is compiled to the graph shown (the entry point is not indicated, but would be ...|... in this case), but the interpretation of [0-9] and \w will depend on the alphabet used (it will not be the same for ASCII and Unicode).

example-graph.png

Parser

The parser is a hand-written state machine. States are classes and the code is (in my opinion) quite simple and easy to extend.

The source and API docs for the parser are here.

Graph

Each graph node represents a single opcode for the engine (although engines are free to rewrite the graph and/or change the interpretation).

The source and API docs for the graph are here.

Graph nodes also support the use of a visitor. Calling any node’s visit method, passing in a Visitor(), triggers the appropriate callback, with the correct parameters.

Engine

An engine must use the graph (generated by the parser) to find a match in the input text.

RXPY currently has only one engine. The source and API docs for the simple engine are here.

The simple engine works as an interpreter, using the Visitor() interface (see comments on graphs above). But more complex approaches are also possible. For example, the graph could be used to generate opcodes in a more traditional form for a C-based engine, or the graph could undergo further analysis and rewriting.

Alphabet

The alphabet is used in two separate ways.

  1. While parsing the regular expression, it is used by the parser/graph to (i) generate values from numeric escapes and (ii) to construct character ranges.
  2. While matching input, it is used by the engine to check membership of character groups (eg whitespace).

The source and API docs for the alphabets are here.

Compatibility Support

An engine only needs to define one method. The compatibility library provides all the necessary support to match the existing re package.

The source and API docs for the compatibility support are here. An example of its use is here.

Table Of Contents

Previous topic

Overview

Next topic

Writing a New Engine

This Page