Regex - Using Registers

Go to the first, previous, next, last section, table of contents.

Using Registers

A group in a regular expression can match a (posssibly empty) substring of the string that regular expression as a whole matched. The matcher remembers the beginning and end of the substring matched by each group.

To find out what they matched, pass a nonzero regs argument to a GNU matching or searching function (see section GNU Matching and section GNU Searching), i.e., the address of a structure of this type, as defined in `regex.h':

struct re_registers
{
  unsigned num_regs;
  regoff_t *start;
  regoff_t *end;
};

Except for (possibly) the num_regs'th element (see below), the ith element of the start and end arrays records information about the ith group in the pattern. (They're declared as C pointers, but this is only because not all C compilers accept zero-length arrays; conceptually, it is simplest to think of them as arrays.)

The start and end arrays are allocated in various ways, depending on the value of the regs_allocated field in the pattern buffer passed to the matcher.

The simplest and perhaps most useful is to let the matcher (re)allocate enough space to record information for all the groups in the regular expression. If regs_allocated is REGS_UNALLOCATED, the matcher allocates @math{1 + re_nsub} (another field in the pattern buffer; see section GNU Pattern Buffers). The extra element is set to @math{-1}, and sets regs_allocated to REGS_REALLOCATE. Then on subsequent calls with the same pattern buffer and regs arguments, the matcher reallocates more space if necessary.

It would perhaps be more logical to make the regs_allocated field part of the re_registers structure, instead of part of the pattern buffer. But in that case the caller would be forced to initialize the structure before passing it. Much existing code doesn't do this initialization, and it's arguably better to avoid it anyway.

re_compile_pattern sets regs_allocated to REGS_UNALLOCATED, so if you use the GNU regular expression functions, you get this behavior by default.

xx document re_set_registers

POSIX, on the other hand, requires a different interface: the caller is supposed to pass in a fixed-length array which the matcher fills. Therefore, if regs_allocated is REGS_FIXED the matcher simply fills that array.

The following examples illustrate the information recorded in the re_registers structure. (In all of them, `(' represents the open-group and `)' the close-group operator. The first character in the string string is at index 0.)

If the regular expression has an i-th group not contained within another group that matches a substring of string, then the function sets regs->start[i] to the index in string where the substring matched by the i-th group begins, and regs->end[i] to the index just beyond that substring's end. The function sets regs->start[0] and regs->end[0] to analogous information about the entire pattern. For example, when you match `((a)(b))' against `ab', you get:
- 0 in regs->start[0] and 2 in regs->end[0]
- 0 in regs->start[1] and 2 in regs->end[1]
- 0 in regs->start[2] and 1 in regs->end[2]
- 1 in regs->start[3] and 2 in regs->end[3]
If a group matches more than once (as it might if followed by, e.g., a repetition operator), then the function reports the information about what the group last matched. For example, when you match the pattern `(a)*' against the string `aa', you get:
- 0 in regs->start[0] and 2 in regs->end[0]
- 1 in regs->start[1] and 2 in regs->end[1]
If the i-th group does not participate in a successful match, e.g., it is an alternative not taken or a repetition operator allows zero repetitions of it, then the function sets regs->start[i] and regs->end[i] to @math{-1}. For example, when you match the pattern `(a)*b' against the string `b', you get:
- 0 in regs->start[0] and 1 in regs->end[0]
- @math{-1} in regs->start[1] and @math{-1} in regs->end[1]
If the i-th group matches a zero-length string, then the function sets regs->start[i] and regs->end[i] to the index just beyond that zero-length string. For example, when you match the pattern `(a*)b' against the string `b', you get:
- 0 in regs->start[0] and 1 in regs->end[0]
- 0 in regs->start[1] and 0 in regs->end[1]
If an i-th group contains a j-th group in turn not contained within any other group within group i and the function reports a match of the i-th group, then it records in regs->start[j] and regs->end[j] the last match (if it matched) of the j-th group. For example, when you match the pattern `((a*)b)*' against the string `abb', group 2 last matches the empty string, so you get what it previously matched:
- 0 in regs->start[0] and 3 in regs->end[0]
- 2 in regs->start[1] and 3 in regs->end[1]
- 2 in regs->start[2] and 2 in regs->end[2]
When you match the pattern `((a)*b)*' against the string `abb', group 2 doesn't participate in the last match, so you get:
- 0 in regs->start[0] and 3 in regs->end[0]
- 2 in regs->start[1] and 3 in regs->end[1]
- 0 in regs->start[2] and 1 in regs->end[2]
If an i-th group contains a j-th group in turn not contained within any other group within group i and the function sets regs->start[i] and regs->end[i] to @math{-1}, then it also sets regs->start[j] and regs->end[j] to @math{-1}. For example, when you match the pattern `((a)*b)*c' against the string `c', you get:
- 0 in regs->start[0] and 1 in regs->end[0]
- @math{-1} in regs->start[1] and @math{-1} in regs->end[1]
- @math{-1} in regs->start[2] and @math{-1} in regs->end[2]

Go to the first, previous, next, last section, table of contents.