« Prev 4.5 Ignoring input sections - the drop statement | Table of Contents | Next » 4.7 Lexer modes |
A regular expression ("regex") is used to define lexer patterns in token
,
pattern, and drop
statements.
A regular expression begins and ends with a /
character.
Example:
/#.*$/
Regular expressions can include many special characters:
.
character matches any input character other than a newline.*
character matches any number of the previous regex element.+
character matches one or more of the previous regex element.?
character matches 0 or 1 of the previous regex element.[
character begins a character class.(
character begins a matching group.{
character begins a count qualifier.\
character escapes the following character and changes its meaning:
\a
sequence matches an ASCII bell character (0x07).\b
sequence matches an ASCII backspace character (0x08).\d
sequence matches any character 0
through 9
.\f
sequence matches an ASCII form feed character (0x0C).\n
sequence matches an ASCII new line character (0x0A).\r
sequence matches an ASCII carriage return character (0x0D).\s
sequence matches a space, horizontal tab \t
, carriage return
\r
, a form feed \f
, or a vertical tab \v
character.\t
sequence matches an ASCII tab character (0x09).\v
sequence matches an ASCII vertical tab character (0x0B).|
character creates an alternate match.Any other character just matches itself in the input stream.
A character class consists of a list of character alternates or character
ranges that can be matched by the character class.
For example [a-zA-Z_]
matches any lowercase character between a
and z
or
any uppercase character between A
and Z
or the underscore _
character.
Character classes can also be negative character classes if the first character
after the [
is a ^
character.
In this case, the set of characters matched by the character class is the
inverse of what it otherwise would have been.
For example, [^0-9]
matches any character other than 0 through 9.
A matching group can be used to override the pattern sequence that multiplicity
specifiers apply to.
For example, the pattern /foo+/
matches "foo" or "foooo", while the pattern
/(foo)+/
matches "foo" or "foofoofoo", but not "foooo".
A count qualifier in curly braces can be used to restrict the number of matches
of the preceding atom to an explicit minimum and maximum range.
For example, the pattern \d{3}
matches exactly 3 digits 0-9.
Both a minimum and maximum multiplicity count can be specified and separated by
a comma.
For example, /a{1,5}/
matches between 1 and 5 a
characters.
Either the minimum or maximum count can be omitted to omit the corresponding
restriction in the number of matches allowed.
An alternate match is created with the |
character.
For example, the pattern /foo|bar/
matches either the sequence "foo" or the
sequence "bar".
« Prev 4.5 Ignoring input sections - the drop statement | Table of Contents | Next » 4.7 Lexer modes |