A group in a regular expression can match a (posssibly empty) substring of the string that regular expression as a whole matched. The matcher remembers the beginning and end of the substring matched by each group.
To find out what they matched, pass a nonzero regs argument to a GNU matching or searching function (see section GNU Matching and section GNU Searching), i.e., the address of a structure of this type, as defined in `regex.h':
struct re_registers { unsigned num_regs; regoff_t *start; regoff_t *end; };
Except for (possibly) the num_regs'th element (see below), the
ith element of the start
and end
arrays records
information about the ith group in the pattern. (They're declared
as C pointers, but this is only because not all C compilers accept
zero-length arrays; conceptually, it is simplest to think of them as
arrays.)
The start
and end
arrays are allocated in various ways,
depending on the value of the regs_allocated
field in the pattern buffer passed to the matcher.
The simplest and perhaps most useful is to let the matcher (re)allocate
enough space to record information for all the groups in the regular
expression. If regs_allocated
is REGS_UNALLOCATED
,
the matcher allocates @math{1 + re_nsub} (another field in the
pattern buffer; see section GNU Pattern Buffers). The extra element is set
to @math{-1}, and sets regs_allocated
to REGS_REALLOCATE
.
Then on subsequent calls with the same pattern buffer and regs
arguments, the matcher reallocates more space if necessary.
It would perhaps be more logical to make the regs_allocated
field
part of the re_registers
structure, instead of part of the
pattern buffer. But in that case the caller would be forced to
initialize the structure before passing it. Much existing code doesn't
do this initialization, and it's arguably better to avoid it anyway.
re_compile_pattern
sets regs_allocated
to
REGS_UNALLOCATED
,
so if you use the GNU regular expression
functions, you get this behavior by default.
xx document re_set_registers
POSIX, on the other hand, requires a different interface: the
caller is supposed to pass in a fixed-length array which the matcher
fills. Therefore, if regs_allocated
is REGS_FIXED
the matcher simply fills that array.
The following examples illustrate the information recorded in the
re_registers
structure. (In all of them, `(' represents the
open-group and `)' the close-group operator. The first character
in the string string is at index 0.)
regs->start[i]
to the index in string where
the substring matched by the i-th group begins, and
regs->end[i]
to the index just beyond that
substring's end. The function sets regs->start[0]
and
regs->end[0]
to analogous information about the entire
pattern.
For example, when you match `((a)(b))' against `ab', you get:
regs->start[0]
and 2 in regs->end[0]
regs->start[1]
and 2 in regs->end[1]
regs->start[2]
and 1 in regs->end[2]
regs->start[3]
and 2 in regs->end[3]
regs->start[0]
and 2 in regs->end[0]
regs->start[1]
and 2 in regs->end[1]
regs->start[i]
and
regs->end[i]
to @math{-1}.
For example, when you match the pattern `(a)*b' against
the string `b', you get:
regs->start[0]
and 1 in regs->end[0]
regs->start[1]
and @math{-1} in regs->end[1]
regs->start[i]
and
regs->end[i]
to the index just beyond that
zero-length string.
For example, when you match the pattern `(a*)b' against the string
`b', you get:
regs->start[0]
and 1 in regs->end[0]
regs->start[1]
and 0 in regs->end[1]
regs->start[j]
and
regs->end[j]
the last match (if it matched) of
the j-th group.
For example, when you match the pattern `((a*)b)*' against the
string `abb', group 2 last matches the empty string, so you
get what it previously matched:
regs->start[0]
and 3 in regs->end[0]
regs->start[1]
and 3 in regs->end[1]
regs->start[2]
and 2 in regs->end[2]
regs->start[0]
and 3 in regs->end[0]
regs->start[1]
and 3 in regs->end[1]
regs->start[2]
and 1 in regs->end[2]
regs->start[i]
and
regs->end[i]
to @math{-1}, then it also sets
regs->start[j]
and
regs->end[j]
to @math{-1}.
For example, when you match the pattern `((a)*b)*c' against the
string `c', you get:
regs->start[0]
and 1 in regs->end[0]
regs->start[1]
and @math{-1} in regs->end[1]
regs->start[2]
and @math{-1} in regs->end[2]
Go to the first, previous, next, last section, table of contents.