Propane User Guide - 4.8 Regular expression syntax

4.8 Regular expression syntax

A regular expression ("regex") is used to define lexer patterns in token, pattern, and drop statements. A regular expression begins and ends with a / character.

Example:

/#.*/

Regular expressions can include many special characters/sequences:

The . character matches any input character other than a newline.
The * character matches any number of the previous regex element.
The + character matches one or more of the previous regex element.
The ? character matches 0 or 1 of the previous regex element.
The [ character begins a character class.
The ( character begins a matching group.
The { character begins a count qualifier.
The \ character escapes the following character and changes its meaning:
- The \a sequence matches an ASCII bell character (0x07).
- The \b sequence matches an ASCII backspace character (0x08).
- The \d sequence is shorthand for the [0-9] character class.
- The \D sequence matches every code point not matched by \d.
- The \f sequence matches an ASCII form feed character (0x0C).
- The \n sequence matches an ASCII new line character (0x0A).
- The \r sequence matches an ASCII carriage return character (0x0D).
- The \s sequence is shorthand for the [ \t\r\n\f\v] character class.
- The \S sequence matches every code point not matched by \s.
- The \t sequence matches an ASCII tab character (0x09).
- The \v sequence matches an ASCII vertical tab character (0x0B).
- The \w sequence is shorthand for the [a-zA-Z0-9_] character class.
- The \W sequence matches every code point not matched by \w.
- Any other character matches itself.
The | character creates an alternate match.

Any other character just matches itself in the input stream.

A character class consists of a list of character alternates or character ranges that can be matched by the character class. For example [a-zA-Z_] matches any lowercase character between a and z or any uppercase character between A and Z or the underscore _ character. Character classes can also be negative character classes if the first character after the [ is a ^ character. In this case, the set of characters matched by the character class is the inverse of what it otherwise would have been. For example, [^0-9] matches any character other than 0 through 9.

A matching group can be used to override the pattern sequence that multiplicity specifiers apply to. For example, the pattern /foo+/ matches "foo" or "foooo", while the pattern /(foo)+/ matches "foo" or "foofoofoo", but not "foooo".

A count qualifier in curly braces can be used to restrict the number of matches of the preceding atom to an explicit minimum and maximum range. For example, the pattern \d{3} matches exactly 3 digits 0-9. Both a minimum and maximum multiplicity count can be specified and separated by a comma. For example, /a{1,5}/ matches between 1 and 5 a characters. Either the minimum or maximum count can be omitted to omit the corresponding restriction in the number of matches allowed.

An alternate match is created with the | character. For example, the pattern /foo|bar/ matches either the sequence "foo" or the sequence "bar".

« Prev
4.7 Ignoring input sections - the drop statement

Table of Contents

Next »
4.9 Lexer modes