« Prev 4.7 Ignoring input sections - the drop statement | Table of Contents | Next » 4.9 Lexer modes |
A regular expression ("regex") is used to define lexer patterns in token
,
pattern, and drop
statements.
A regular expression begins and ends with a /
character.
Example:
/#.*/
Regular expressions can include many special characters/sequences:
.
character matches any input character other than a newline.*
character matches any number of the previous regex element.+
character matches one or more of the previous regex element.?
character matches 0 or 1 of the previous regex element.[
character begins a character class.(
character begins a matching group.{
character begins a count qualifier.\
character escapes the following character and changes its meaning:
\a
sequence matches an ASCII bell character (0x07).\b
sequence matches an ASCII backspace character (0x08).\d
sequence is shorthand for the [0-9]
character class.\D
sequence matches every code point not matched by \d
.\f
sequence matches an ASCII form feed character (0x0C).\n
sequence matches an ASCII new line character (0x0A).\r
sequence matches an ASCII carriage return character (0x0D).\s
sequence is shorthand for the [ \t\r\n\f\v]
character class.\S
sequence matches every code point not matched by \s
.\t
sequence matches an ASCII tab character (0x09).\v
sequence matches an ASCII vertical tab character (0x0B).\w
sequence is shorthand for the [a-zA-Z0-9_]
character class.\W
sequence matches every code point not matched by \w
.|
character creates an alternate match.Any other character just matches itself in the input stream.
A character class consists of a list of character alternates or character
ranges that can be matched by the character class.
For example [a-zA-Z_]
matches any lowercase character between a
and z
or
any uppercase character between A
and Z
or the underscore _
character.
Character classes can also be negative character classes if the first character
after the [
is a ^
character.
In this case, the set of characters matched by the character class is the
inverse of what it otherwise would have been.
For example, [^0-9]
matches any character other than 0 through 9.
A matching group can be used to override the pattern sequence that multiplicity
specifiers apply to.
For example, the pattern /foo+/
matches "foo" or "foooo", while the pattern
/(foo)+/
matches "foo" or "foofoofoo", but not "foooo".
A count qualifier in curly braces can be used to restrict the number of matches
of the preceding atom to an explicit minimum and maximum range.
For example, the pattern \d{3}
matches exactly 3 digits 0-9.
Both a minimum and maximum multiplicity count can be specified and separated by
a comma.
For example, /a{1,5}/
matches between 1 and 5 a
characters.
Either the minimum or maximum count can be omitted to omit the corresponding
restriction in the number of matches allowed.
An alternate match is created with the |
character.
For example, the pattern /foo|bar/
matches either the sequence "foo" or the
sequence "bar".
« Prev 4.7 Ignoring input sections - the drop statement | Table of Contents | Next » 4.9 Lexer modes |