Propane User Guide - 4.6 Regular expression syntax

4.6 Regular expression syntax

A regular expression ("regex") is used to define lexer patterns in token, pattern, and drop statements. A regular expression begins and ends with a / character.

Example:

/#.*$/

Regular expressions can include many special characters:

The . character matches any input character other than a newline.
The * character matches any number of the previous regex element.
The + character matches one or more of the previous regex element.
The ? character matches 0 or 1 of the previous regex element.
The [ character begins a character class.
The ( character begins a matching group.
The { character begins a count qualifier.
The \ character escapes the following character and changes its meaning:
- The \a sequence matches an ASCII bell character (0x07).
- The \b sequence matches an ASCII backspace character (0x08).
- The \d sequence matches any character 0 through 9.
- The \f sequence matches an ASCII form feed character (0x0C).
- The \n sequence matches an ASCII new line character (0x0A).
- The \r sequence matches an ASCII carriage return character (0x0D).
- The \s sequence matches a space, horizontal tab \t, carriage return \r, a form feed \f, or a vertical tab \v character.
- The \t sequence matches an ASCII tab character (0x09).
- The \v sequence matches an ASCII vertical tab character (0x0B).
- Any other character matches itself.
The | character creates an alternate match.

Any other character just matches itself in the input stream.

A character class consists of a list of character alternates or character ranges that can be matched by the character class. For example [a-zA-Z_] matches any lowercase character between a and z or any uppercase character between A and Z or the underscore _ character. Character classes can also be negative character classes if the first character after the [ is a ^ character. In this case, the set of characters matched by the character class is the inverse of what it otherwise would have been. For example, [^0-9] matches any character other than 0 through 9.

A matching group can be used to override the pattern sequence that multiplicity specifiers apply to. For example, the pattern /foo+/ matches "foo" or "foooo", while the pattern /(foo)+/ matches "foo" or "foofoofoo", but not "foooo".

A count qualifier in curly braces can be used to restrict the number of matches of the preceding atom to an explicit minimum and maximum range. For example, the pattern \d{3} matches exactly 3 digits 0-9. Both a minimum and maximum multiplicity count can be specified and separated by a comma. For example, /a{1,5}/ matches between 1 and 5 a characters. Either the minimum or maximum count can be omitted to omit the corresponding restriction in the number of matches allowed.

An alternate match is created with the | character. For example, the pattern /foo|bar/ matches either the sequence "foo" or the sequence "bar".

« Prev
4.5 Ignoring input sections - the drop statement

Table of Contents

Next »
4.7 Lexer modes