| « Prev 4.5 Ignoring input sections - the drop statement | Table of Contents | Next » 4.7 Lexer modes |
A regular expression ("regex") is used to define lexer patterns in token,
pattern, and drop statements.
A regular expression begins and ends with a / character.
Example:
/#.*$/
Regular expressions can include many special characters:
. character matches any input character other than a newline.* character matches any number of the previous regex element.+ character matches one or more of the previous regex element.? character matches 0 or 1 of the previous regex element.[ character begins a character class.( character begins a matching group.{ character begins a count qualifier.\ character escapes the following character and changes its meaning:
\d sequence matches any character 0 through 9.\s sequence matches a space, horizontal tab \t, carriage return
\r, a form feed \f, or a vertical tab \v character.| character creates an alternate match.Any other character just matches itself in the input stream.
A character class consists of a list of character alternates or character
ranges that can be matched by the character class.
For example [a-zA-Z_] matches any lowercase character between a and z or
any uppercase character between A and Z or the underscore _ character.
Character classes can also be negative character classes if the first character
after the [ is a ^ character.
In this case, the set of characters matched by the character class is the
inverse of what it otherwise would have been.
For example, [^0-9] matches any character other than 0 through 9.
A matching group can be used to override the pattern sequence that multiplicity
specifiers apply to.
For example, the pattern /foo+/ matches "foo" or "foooo", while the pattern
/(foo)+/ matches "foo" or "foofoofoo", but not "foooo".
A count qualifier in curly braces can be used to restrict the number of matches
of the preceding atom to an explicit minimum and maximum range.
For example, the pattern \d{3} matches exactly 3 digits 0-9.
Both a minimum and maximum multiplicity count can be specified and separated by
a comma.
For example, /a{1,5}/ matches between 1 and 5 a characters.
Either the minimum or maximum count can be omitted to omit the corresponding
restriction in the number of matches allowed.
An alternate match is created with the | character.
For example, the pattern /foo|bar/ matches either the sequence "foo" or the
sequence "bar".
| « Prev 4.5 Ignoring input sections - the drop statement | Table of Contents | Next » 4.7 Lexer modes |