You can combine regular expressions with the following characters, called regular expression operators, or metacharacters, to increase the power and versatility of regular expressions.
Here is a table of these metacharacters. All characters that are not listed in the table stand for themselves.
\
\$matches the character `$'.
^
^@chaptermatches the `@chapter' at the beginning of a string, and can be used to identify chapter beginnings in Texinfo source files. The `^' is known as an anchor, since it anchors the pattern to matching only at the beginning of the string.
$
p$matches a string that ends with a `p'. The `$' is also an anchor.
.
.Pmatches any single character followed by a `P' in a string. Using concatenation we can make a regular expression like `U.A', which matches any three-character sequence that begins with `U' and ends with `A'.
[...]
[MVX]matches any one of the characters `M', `V', or `X' in a string. Ranges of characters are indicated by using a hyphen between the beginning and ending characters, and enclosing the whole thing in brackets. For example:
[0-9]matches any digit. Multiple ranges are allowed. E.g., the list
[A-Za-z0-9]
is a
common way to express the idea of "all alphanumeric characters."
To include one of the characters `\', `]', `-' or `^' in a
character list, put a `\' in front of it. For example:
[d\]]matches either `d', or `]'. Character classes are a new feature introduced in the POSIX standard. A character class is a special notation for describing lists of characters that have a specific attribute, but where the actual characters themselves can vary from country to country and/or from character set to character set. For example, the notion of what is an alphabetic character differs in the U.S.A. and in France. A character class is only valid in a regexp inside the brackets of a character list. Character classes consist of `[:', a keyword denoting the class, and `:]'. Here are the character classes defined by the POSIX standard:
[:alnum:]
[:alpha:]
[:blank:]
[:cntrl:]
[:digit:]
[:graph:]
[:lower:]
[:print:]
[:punct:]
[:space:]
[:upper:]
[:xdigit:]
[A-Za-z0-9]
. If your
character set had other alphabetic characters in it, this would not
match them. With the POSIX character classes, you can write
[[:alnum:]]
, and this will match all the alphabetic
and numeric characters in your character set.
Two additional special sequences can appear in character lists.
These apply to non-ASCII character sets, which can have single symbols
(called collating elements) that are represented with more than one
character, as well as several characters that are equivalent for
collating, or sorting, purposes. (E.g., in French, a plain `e'
and a grave-accented ``e' are equivalent.)
[[.ch.]]
is a regexp that matches this collating element,
while [ch]
is a regexp that matches either `c' or `h'.
[[=e`e=]]
is regexp that matches
either `e' or ``e'.
gcal
uses for regular expression matching
currently only recognize POSIX character classes (possibly); they do not
recognize collating symbols or equivalence classes.
[^ ...]
[^0-9]matches any character that is not a digit.
|
^P|[0-9]matches any string that matches either `^P' or `[0-9]'. This means it matches any string that starts with `P' or contains a digit. The alternation applies to the largest possible regexps on either side. In other words, `|' has the lowest precedence of all the regular expression operators.
(...)
*
ph*applies the `*' symbol to the preceding `h' and looks for matches of one `p' followed by any number of `h's. This will also match just `p' if no `h's are present. The `*' repeats the smallest possible preceding expression. (Use parentheses if you wish to repeat a larger expression.) It finds as many repetitions as possible. For example:
gcal --filter-text='\(c[ad][ad]*r x\)' -f sample.rc -yprints every fixed date in `sample.rc' containing a fixed dates text of the form `(car x)', `(cdr x)', `(cadr x)', and so on. Notice the escaping of the parentheses by preceding them with backslashes.
+
wh+ywould match `why' and `whhy' but not `wy', whereas `wh*y' would match all three of these strings. This is a simpler way of writing the last `*' example:
gcal --filter-text='\(c[ad]+r x\)' -f sample.rc -y
?
fe?dwill match `fed' and `fd', but nothing else.
{n}
{n,}
{n,m}
wh{3}y
wh{3,5}y
wh{2,}y
GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described here.
Most of the additional operators are for dealing with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (`_').
\w
[A-Za-z0-9_]
or
[[:alnum:]_]
.
\W
[^A-Za-z0-9_]
or
[^[:alnum:]_]
.
\<
\<away
matches `away', but not
`stowaway'.
\>
stow\>
matches `stow', but not `stowaway'.
\b
\B
\Brat\B
matches `crate', but it does not match `dirty rat'.
`\B' is essentially the opposite of `\b'.
There are two other operators that work on buffers. In Emacs, a
buffer is, naturally, an Emacs buffer. For other programs, the
regexp library routines that gcal
uses consider the entire
string to be matched as the buffer (74).
For gcal
, since `^' and `$' always work in terms
of the beginning and end of strings, these operators don't add any
new capabilities. They are provided for compatibility with other GNU
software.
\`
\'
In regular expressions, the `*', `+', and `?' operators, as well as the braces `{' and `}', have the highest precedence, followed by concatenation, and finally by `|'. As in arithmetic, parentheses can change how operators are grouped.
Case is normally significant in regular expressions, both when matching ordinary characters (i.e. not metacharacters), and inside character sets. Thus a `w' in a regular expression matches only a lower-case `w' and not an upper-case `W'.
The simplest way to do a case-independent match is to use a character list: `[Ww]'. However, this can be cumbersome if you need to use it often; and unfortunately, it can make the regular expressions harder to read.
Go to the first, previous, next, last section, table of contents.