mirror of
https://git.hardenedbsd.org/hardenedbsd/HardenedBSD.git
synced 2025-01-22 16:44:32 +01:00
3139 lines
117 KiB
Plaintext
3139 lines
117 KiB
Plaintext
\input texinfo
|
|
@c %**start of header
|
|
@setfilename regex.info
|
|
@settitle Regex
|
|
@c %**end of header
|
|
|
|
@c \\{fill-paragraph} works better (for me, anyway) if the text in the
|
|
@c source file isn't indented.
|
|
@paragraphindent 2
|
|
|
|
@c Define a new index for our magic constants.
|
|
@defcodeindex cn
|
|
|
|
@c Put everything in one index (arbitrarily chosen to be the concept index).
|
|
@syncodeindex cn cp
|
|
@syncodeindex ky cp
|
|
@syncodeindex pg cp
|
|
@syncodeindex tp cp
|
|
@syncodeindex vr cp
|
|
|
|
@c Here is what we use in the Info `dir' file:
|
|
@c * Regex: (regex). Regular expression library.
|
|
|
|
|
|
@ifinfo
|
|
This file documents the GNU regular expression library.
|
|
|
|
Copyright (C) 1992, 1993 Free Software Foundation, Inc.
|
|
|
|
Permission is granted to make and distribute verbatim copies of this
|
|
manual provided the copyright notice and this permission notice are
|
|
preserved on all copies.
|
|
|
|
@ignore
|
|
Permission is granted to process this file through TeX and print the
|
|
results, provided the printed document carries a copying permission
|
|
notice identical to this one except for the removal of this paragraph
|
|
(this paragraph not being relevant to the printed manual).
|
|
@end ignore
|
|
|
|
Permission is granted to copy and distribute modified versions of this
|
|
manual under the conditions for verbatim copying, provided also that the
|
|
section entitled ``GNU General Public License'' is included exactly as
|
|
in the original, and provided that the entire resulting derived work is
|
|
distributed under the terms of a permission notice identical to this one.
|
|
|
|
Permission is granted to copy and distribute translations of this manual
|
|
into another language, under the above conditions for modified versions,
|
|
except that the section entitled ``GNU General Public License'' may be
|
|
included in a translation approved by the Free Software Foundation
|
|
instead of in the original English.
|
|
@end ifinfo
|
|
|
|
|
|
@titlepage
|
|
|
|
@title Regex
|
|
@subtitle edition 0.12a
|
|
@subtitle 19 September 1992
|
|
@author Kathryn A. Hargreaves
|
|
@author Karl Berry
|
|
|
|
@page
|
|
|
|
@vskip 0pt plus 1filll
|
|
Copyright @copyright{} 1992 Free Software Foundation.
|
|
|
|
Permission is granted to make and distribute verbatim copies of this
|
|
manual provided the copyright notice and this permission notice are
|
|
preserved on all copies.
|
|
|
|
Permission is granted to copy and distribute modified versions of this
|
|
manual under the conditions for verbatim copying, provided also that the
|
|
section entitled ``GNU General Public License'' is included exactly as
|
|
in the original, and provided that the entire resulting derived work is
|
|
distributed under the terms of a permission notice identical to this
|
|
one.
|
|
|
|
Permission is granted to copy and distribute translations of this manual
|
|
into another language, under the above conditions for modified versions,
|
|
except that the section entitled ``GNU General Public License'' may be
|
|
included in a translation approved by the Free Software Foundation
|
|
instead of in the original English.
|
|
|
|
@end titlepage
|
|
|
|
|
|
@ifinfo
|
|
@node Top, Overview, (dir), (dir)
|
|
@top Regular Expression Library
|
|
|
|
This manual documents how to program with the GNU regular expression
|
|
library. This is edition 0.12a of the manual, 19 September 1992.
|
|
|
|
The first part of this master menu lists the major nodes in this Info
|
|
document, including the index. The rest of the menu lists all the
|
|
lower level nodes in the document.
|
|
|
|
@menu
|
|
* Overview::
|
|
* Regular Expression Syntax::
|
|
* Common Operators::
|
|
* GNU Operators::
|
|
* GNU Emacs Operators::
|
|
* What Gets Matched?::
|
|
* Programming with Regex::
|
|
* Copying:: Copying and sharing Regex.
|
|
* Index:: General index.
|
|
--- The Detailed Node Listing ---
|
|
|
|
Regular Expression Syntax
|
|
|
|
* Syntax Bits::
|
|
* Predefined Syntaxes::
|
|
* Collating Elements vs. Characters::
|
|
* The Backslash Character::
|
|
|
|
Common Operators
|
|
|
|
* Match-self Operator:: Ordinary characters.
|
|
* Match-any-character Operator:: .
|
|
* Concatenation Operator:: Juxtaposition.
|
|
* Repetition Operators:: * + ? @{@}
|
|
* Alternation Operator:: |
|
|
* List Operators:: [...] [^...]
|
|
* Grouping Operators:: (...)
|
|
* Back-reference Operator:: \digit
|
|
* Anchoring Operators:: ^ $
|
|
|
|
Repetition Operators
|
|
|
|
* Match-zero-or-more Operator:: *
|
|
* Match-one-or-more Operator:: +
|
|
* Match-zero-or-one Operator:: ?
|
|
* Interval Operators:: @{@}
|
|
|
|
List Operators (@code{[} @dots{} @code{]} and @code{[^} @dots{} @code{]})
|
|
|
|
* Character Class Operators:: [:class:]
|
|
* Range Operator:: start-end
|
|
|
|
Anchoring Operators
|
|
|
|
* Match-beginning-of-line Operator:: ^
|
|
* Match-end-of-line Operator:: $
|
|
|
|
GNU Operators
|
|
|
|
* Word Operators::
|
|
* Buffer Operators::
|
|
|
|
Word Operators
|
|
|
|
* Non-Emacs Syntax Tables::
|
|
* Match-word-boundary Operator:: \b
|
|
* Match-within-word Operator:: \B
|
|
* Match-beginning-of-word Operator:: \<
|
|
* Match-end-of-word Operator:: \>
|
|
* Match-word-constituent Operator:: \w
|
|
* Match-non-word-constituent Operator:: \W
|
|
|
|
Buffer Operators
|
|
|
|
* Match-beginning-of-buffer Operator:: \`
|
|
* Match-end-of-buffer Operator:: \'
|
|
|
|
GNU Emacs Operators
|
|
|
|
* Syntactic Class Operators::
|
|
|
|
Syntactic Class Operators
|
|
|
|
* Emacs Syntax Tables::
|
|
* Match-syntactic-class Operator:: \sCLASS
|
|
* Match-not-syntactic-class Operator:: \SCLASS
|
|
|
|
Programming with Regex
|
|
|
|
* GNU Regex Functions::
|
|
* POSIX Regex Functions::
|
|
* BSD Regex Functions::
|
|
|
|
GNU Regex Functions
|
|
|
|
* GNU Pattern Buffers:: The re_pattern_buffer type.
|
|
* GNU Regular Expression Compiling:: re_compile_pattern ()
|
|
* GNU Matching:: re_match ()
|
|
* GNU Searching:: re_search ()
|
|
* Matching/Searching with Split Data:: re_match_2 (), re_search_2 ()
|
|
* Searching with Fastmaps:: re_compile_fastmap ()
|
|
* GNU Translate Tables:: The `translate' field.
|
|
* Using Registers:: The re_registers type and related fns.
|
|
* Freeing GNU Pattern Buffers:: regfree ()
|
|
|
|
POSIX Regex Functions
|
|
|
|
* POSIX Pattern Buffers:: The regex_t type.
|
|
* POSIX Regular Expression Compiling:: regcomp ()
|
|
* POSIX Matching:: regexec ()
|
|
* Reporting Errors:: regerror ()
|
|
* Using Byte Offsets:: The regmatch_t type.
|
|
* Freeing POSIX Pattern Buffers:: regfree ()
|
|
|
|
BSD Regex Functions
|
|
|
|
* BSD Regular Expression Compiling:: re_comp ()
|
|
* BSD Searching:: re_exec ()
|
|
@end menu
|
|
@end ifinfo
|
|
@node Overview, Regular Expression Syntax, Top, Top
|
|
@chapter Overview
|
|
|
|
A @dfn{regular expression} (or @dfn{regexp}, or @dfn{pattern}) is a text
|
|
string that describes some (mathematical) set of strings. A regexp
|
|
@var{r} @dfn{matches} a string @var{s} if @var{s} is in the set of
|
|
strings described by @var{r}.
|
|
|
|
Using the Regex library, you can:
|
|
|
|
@itemize @bullet
|
|
|
|
@item
|
|
see if a string matches a specified pattern as a whole, and
|
|
|
|
@item
|
|
search within a string for a substring matching a specified pattern.
|
|
|
|
@end itemize
|
|
|
|
Some regular expressions match only one string, i.e., the set they
|
|
describe has only one member. For example, the regular expression
|
|
@samp{foo} matches the string @samp{foo} and no others. Other regular
|
|
expressions match more than one string, i.e., the set they describe has
|
|
more than one member. For example, the regular expression @samp{f*}
|
|
matches the set of strings made up of any number (including zero) of
|
|
@samp{f}s. As you can see, some characters in regular expressions match
|
|
themselves (such as @samp{f}) and some don't (such as @samp{*}); the
|
|
ones that don't match themselves instead let you specify patterns that
|
|
describe many different strings.
|
|
|
|
To either match or search for a regular expression with the Regex
|
|
library functions, you must first compile it with a Regex pattern
|
|
compiling function. A @dfn{compiled pattern} is a regular expression
|
|
converted to the internal format used by the library functions. Once
|
|
you've compiled a pattern, you can use it for matching or searching any
|
|
number of times.
|
|
|
|
The Regex library consists of two source files: @file{regex.h} and
|
|
@file{regex.c}.
|
|
@pindex regex.h
|
|
@pindex regex.c
|
|
Regex provides three groups of functions with which you can operate on
|
|
regular expressions. One group---the @sc{gnu} group---is more powerful
|
|
but not completely compatible with the other two, namely the @sc{posix}
|
|
and Berkeley @sc{unix} groups; its interface was designed specifically
|
|
for @sc{gnu}. The other groups have the same interfaces as do the
|
|
regular expression functions in @sc{posix} and Berkeley
|
|
@sc{unix}.
|
|
|
|
We wrote this chapter with programmers in mind, not users of
|
|
programs---such as Emacs---that use Regex. We describe the Regex
|
|
library in its entirety, not how to write regular expressions that a
|
|
particular program understands.
|
|
|
|
|
|
@node Regular Expression Syntax, Common Operators, Overview, Top
|
|
@chapter Regular Expression Syntax
|
|
|
|
@cindex regular expressions, syntax of
|
|
@cindex syntax of regular expressions
|
|
|
|
@dfn{Characters} are things you can type. @dfn{Operators} are things in
|
|
a regular expression that match one or more characters. You compose
|
|
regular expressions from operators, which in turn you specify using one
|
|
or more characters.
|
|
|
|
Most characters represent what we call the match-self operator, i.e.,
|
|
they match themselves; we call these characters @dfn{ordinary}. Other
|
|
characters represent either all or parts of fancier operators; e.g.,
|
|
@samp{.} represents what we call the match-any-character operator
|
|
(which, no surprise, matches (almost) any character); we call these
|
|
characters @dfn{special}. Two different things determine what
|
|
characters represent what operators:
|
|
|
|
@enumerate
|
|
@item
|
|
the regular expression syntax your program has told the Regex library to
|
|
recognize, and
|
|
|
|
@item
|
|
the context of the character in the regular expression.
|
|
@end enumerate
|
|
|
|
In the following sections, we describe these things in more detail.
|
|
|
|
@menu
|
|
* Syntax Bits::
|
|
* Predefined Syntaxes::
|
|
* Collating Elements vs. Characters::
|
|
* The Backslash Character::
|
|
@end menu
|
|
|
|
|
|
@node Syntax Bits, Predefined Syntaxes, , Regular Expression Syntax
|
|
@section Syntax Bits
|
|
|
|
@cindex syntax bits
|
|
|
|
In any particular syntax for regular expressions, some characters are
|
|
always special, others are sometimes special, and others are never
|
|
special. The particular syntax that Regex recognizes for a given
|
|
regular expression depends on the value in the @code{syntax} field of
|
|
the pattern buffer of that regular expression.
|
|
|
|
You get a pattern buffer by compiling a regular expression. @xref{GNU
|
|
Pattern Buffers}, and @ref{POSIX Pattern Buffers}, for more information
|
|
on pattern buffers. @xref{GNU Regular Expression Compiling}, @ref{POSIX
|
|
Regular Expression Compiling}, and @ref{BSD Regular Expression
|
|
Compiling}, for more information on compiling.
|
|
|
|
Regex considers the value of the @code{syntax} field to be a collection
|
|
of bits; we refer to these bits as @dfn{syntax bits}. In most cases,
|
|
they affect what characters represent what operators. We describe the
|
|
meanings of the operators to which we refer in @ref{Common Operators},
|
|
@ref{GNU Operators}, and @ref{GNU Emacs Operators}.
|
|
|
|
For reference, here is the complete list of syntax bits, in alphabetical
|
|
order:
|
|
|
|
@table @code
|
|
|
|
@cnindex RE_BACKSLASH_ESCAPE_IN_LIST
|
|
@item RE_BACKSLASH_ESCAPE_IN_LISTS
|
|
If this bit is set, then @samp{\} inside a list (@pxref{List Operators}
|
|
quotes (makes ordinary, if it's special) the following character; if
|
|
this bit isn't set, then @samp{\} is an ordinary character inside lists.
|
|
(@xref{The Backslash Character}, for what `\' does outside of lists.)
|
|
|
|
@cnindex RE_BK_PLUS_QM
|
|
@item RE_BK_PLUS_QM
|
|
If this bit is set, then @samp{\+} represents the match-one-or-more
|
|
operator and @samp{\?} represents the match-zero-or-more operator; if
|
|
this bit isn't set, then @samp{+} represents the match-one-or-more
|
|
operator and @samp{?} represents the match-zero-or-one operator. This
|
|
bit is irrelevant if @code{RE_LIMITED_OPS} is set.
|
|
|
|
@cnindex RE_CHAR_CLASSES
|
|
@item RE_CHAR_CLASSES
|
|
If this bit is set, then you can use character classes in lists; if this
|
|
bit isn't set, then you can't.
|
|
|
|
@cnindex RE_CONTEXT_INDEP_ANCHORS
|
|
@item RE_CONTEXT_INDEP_ANCHORS
|
|
If this bit is set, then @samp{^} and @samp{$} are special anywhere outside
|
|
a list; if this bit isn't set, then these characters are special only in
|
|
certain contexts. @xref{Match-beginning-of-line Operator}, and
|
|
@ref{Match-end-of-line Operator}.
|
|
|
|
@cnindex RE_CONTEXT_INDEP_OPS
|
|
@item RE_CONTEXT_INDEP_OPS
|
|
If this bit is set, then certain characters are special anywhere outside
|
|
a list; if this bit isn't set, then those characters are special only in
|
|
some contexts and are ordinary elsewhere. Specifically, if this bit
|
|
isn't set then @samp{*}, and (if the syntax bit @code{RE_LIMITED_OPS}
|
|
isn't set) @samp{+} and @samp{?} (or @samp{\+} and @samp{\?}, depending
|
|
on the syntax bit @code{RE_BK_PLUS_QM}) represent repetition operators
|
|
only if they're not first in a regular expression or just after an
|
|
open-group or alternation operator. The same holds for @samp{@{} (or
|
|
@samp{\@{}, depending on the syntax bit @code{RE_NO_BK_BRACES}) if
|
|
it is the beginning of a valid interval and the syntax bit
|
|
@code{RE_INTERVALS} is set.
|
|
|
|
@cnindex RE_CONTEXT_INVALID_OPS
|
|
@item RE_CONTEXT_INVALID_OPS
|
|
If this bit is set, then repetition and alternation operators can't be
|
|
in certain positions within a regular expression. Specifically, the
|
|
regular expression is invalid if it has:
|
|
|
|
@itemize @bullet
|
|
|
|
@item
|
|
a repetition operator first in the regular expression or just after a
|
|
match-beginning-of-line, open-group, or alternation operator; or
|
|
|
|
@item
|
|
an alternation operator first or last in the regular expression, just
|
|
before a match-end-of-line operator, or just after an alternation or
|
|
open-group operator.
|
|
|
|
@end itemize
|
|
|
|
If this bit isn't set, then you can put the characters representing the
|
|
repetition and alternation characters anywhere in a regular expression.
|
|
Whether or not they will in fact be operators in certain positions
|
|
depends on other syntax bits.
|
|
|
|
@cnindex RE_DOT_NEWLINE
|
|
@item RE_DOT_NEWLINE
|
|
If this bit is set, then the match-any-character operator matches
|
|
a newline; if this bit isn't set, then it doesn't.
|
|
|
|
@cnindex RE_DOT_NOT_NULL
|
|
@item RE_DOT_NOT_NULL
|
|
If this bit is set, then the match-any-character operator doesn't match
|
|
a null character; if this bit isn't set, then it does.
|
|
|
|
@cnindex RE_INTERVALS
|
|
@item RE_INTERVALS
|
|
If this bit is set, then Regex recognizes interval operators; if this bit
|
|
isn't set, then it doesn't.
|
|
|
|
@cnindex RE_LIMITED_OPS
|
|
@item RE_LIMITED_OPS
|
|
If this bit is set, then Regex doesn't recognize the match-one-or-more,
|
|
match-zero-or-one or alternation operators; if this bit isn't set, then
|
|
it does.
|
|
|
|
@cnindex RE_NEWLINE_ALT
|
|
@item RE_NEWLINE_ALT
|
|
If this bit is set, then newline represents the alternation operator; if
|
|
this bit isn't set, then newline is ordinary.
|
|
|
|
@cnindex RE_NO_BK_BRACES
|
|
@item RE_NO_BK_BRACES
|
|
If this bit is set, then @samp{@{} represents the open-interval operator
|
|
and @samp{@}} represents the close-interval operator; if this bit isn't
|
|
set, then @samp{\@{} represents the open-interval operator and
|
|
@samp{\@}} represents the close-interval operator. This bit is relevant
|
|
only if @code{RE_INTERVALS} is set.
|
|
|
|
@cnindex RE_NO_BK_PARENS
|
|
@item RE_NO_BK_PARENS
|
|
If this bit is set, then @samp{(} represents the open-group operator and
|
|
@samp{)} represents the close-group operator; if this bit isn't set, then
|
|
@samp{\(} represents the open-group operator and @samp{\)} represents
|
|
the close-group operator.
|
|
|
|
@cnindex RE_NO_BK_REFS
|
|
@item RE_NO_BK_REFS
|
|
If this bit is set, then Regex doesn't recognize @samp{\}@var{digit} as
|
|
the back reference operator; if this bit isn't set, then it does.
|
|
|
|
@cnindex RE_NO_BK_VBAR
|
|
@item RE_NO_BK_VBAR
|
|
If this bit is set, then @samp{|} represents the alternation operator;
|
|
if this bit isn't set, then @samp{\|} represents the alternation
|
|
operator. This bit is irrelevant if @code{RE_LIMITED_OPS} is set.
|
|
|
|
@cnindex RE_NO_EMPTY_RANGES
|
|
@item RE_NO_EMPTY_RANGES
|
|
If this bit is set, then a regular expression with a range whose ending
|
|
point collates lower than its starting point is invalid; if this bit
|
|
isn't set, then Regex considers such a range to be empty.
|
|
|
|
@cnindex RE_UNMATCHED_RIGHT_PAREN_ORD
|
|
@item RE_UNMATCHED_RIGHT_PAREN_ORD
|
|
If this bit is set and the regular expression has no matching open-group
|
|
operator, then Regex considers what would otherwise be a close-group
|
|
operator (based on how @code{RE_NO_BK_PARENS} is set) to match @samp{)}.
|
|
|
|
@end table
|
|
|
|
|
|
@node Predefined Syntaxes, Collating Elements vs. Characters, Syntax Bits, Regular Expression Syntax
|
|
@section Predefined Syntaxes
|
|
|
|
If you're programming with Regex, you can set a pattern buffer's
|
|
(@pxref{GNU Pattern Buffers}, and @ref{POSIX Pattern Buffers})
|
|
@code{syntax} field either to an arbitrary combination of syntax bits
|
|
(@pxref{Syntax Bits}) or else to the configurations defined by Regex.
|
|
These configurations define the syntaxes used by certain
|
|
programs---@sc{gnu} Emacs,
|
|
@cindex Emacs
|
|
@sc{posix} Awk,
|
|
@cindex POSIX Awk
|
|
traditional Awk,
|
|
@cindex Awk
|
|
Grep,
|
|
@cindex Grep
|
|
@cindex Egrep
|
|
Egrep---in addition to syntaxes for @sc{posix} basic and extended
|
|
regular expressions.
|
|
|
|
The predefined syntaxes--taken directly from @file{regex.h}---are:
|
|
|
|
@example
|
|
#define RE_SYNTAX_EMACS 0
|
|
|
|
#define RE_SYNTAX_AWK \
|
|
(RE_BACKSLASH_ESCAPE_IN_LISTS | RE_DOT_NOT_NULL \
|
|
| RE_NO_BK_PARENS | RE_NO_BK_REFS \
|
|
| RE_NO_BK_VBAR | RE_NO_EMPTY_RANGES \
|
|
| RE_UNMATCHED_RIGHT_PAREN_ORD)
|
|
|
|
#define RE_SYNTAX_POSIX_AWK \
|
|
(RE_SYNTAX_POSIX_EXTENDED | RE_BACKSLASH_ESCAPE_IN_LISTS)
|
|
|
|
#define RE_SYNTAX_GREP \
|
|
(RE_BK_PLUS_QM | RE_CHAR_CLASSES \
|
|
| RE_HAT_LISTS_NOT_NEWLINE | RE_INTERVALS \
|
|
| RE_NEWLINE_ALT)
|
|
|
|
#define RE_SYNTAX_EGREP \
|
|
(RE_CHAR_CLASSES | RE_CONTEXT_INDEP_ANCHORS \
|
|
| RE_CONTEXT_INDEP_OPS | RE_HAT_LISTS_NOT_NEWLINE \
|
|
| RE_NEWLINE_ALT | RE_NO_BK_PARENS \
|
|
| RE_NO_BK_VBAR)
|
|
|
|
#define RE_SYNTAX_POSIX_EGREP \
|
|
(RE_SYNTAX_EGREP | RE_INTERVALS | RE_NO_BK_BRACES)
|
|
|
|
/* P1003.2/D11.2, section 4.20.7.1, lines 5078ff. */
|
|
#define RE_SYNTAX_ED RE_SYNTAX_POSIX_BASIC
|
|
|
|
#define RE_SYNTAX_SED RE_SYNTAX_POSIX_BASIC
|
|
|
|
/* Syntax bits common to both basic and extended POSIX regex syntax. */
|
|
#define _RE_SYNTAX_POSIX_COMMON \
|
|
(RE_CHAR_CLASSES | RE_DOT_NEWLINE | RE_DOT_NOT_NULL \
|
|
| RE_INTERVALS | RE_NO_EMPTY_RANGES)
|
|
|
|
#define RE_SYNTAX_POSIX_BASIC \
|
|
(_RE_SYNTAX_POSIX_COMMON | RE_BK_PLUS_QM)
|
|
|
|
/* Differs from ..._POSIX_BASIC only in that RE_BK_PLUS_QM becomes
|
|
RE_LIMITED_OPS, i.e., \? \+ \| are not recognized. Actually, this
|
|
isn't minimal, since other operators, such as \`, aren't disabled. */
|
|
#define RE_SYNTAX_POSIX_MINIMAL_BASIC \
|
|
(_RE_SYNTAX_POSIX_COMMON | RE_LIMITED_OPS)
|
|
|
|
#define RE_SYNTAX_POSIX_EXTENDED \
|
|
(_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \
|
|
| RE_CONTEXT_INDEP_OPS | RE_NO_BK_BRACES \
|
|
| RE_NO_BK_PARENS | RE_NO_BK_VBAR \
|
|
| RE_UNMATCHED_RIGHT_PAREN_ORD)
|
|
|
|
/* Differs from ..._POSIX_EXTENDED in that RE_CONTEXT_INVALID_OPS
|
|
replaces RE_CONTEXT_INDEP_OPS and RE_NO_BK_REFS is added. */
|
|
#define RE_SYNTAX_POSIX_MINIMAL_EXTENDED \
|
|
(_RE_SYNTAX_POSIX_COMMON | RE_CONTEXT_INDEP_ANCHORS \
|
|
| RE_CONTEXT_INVALID_OPS | RE_NO_BK_BRACES \
|
|
| RE_NO_BK_PARENS | RE_NO_BK_REFS \
|
|
| RE_NO_BK_VBAR | RE_UNMATCHED_RIGHT_PAREN_ORD)
|
|
@end example
|
|
|
|
@node Collating Elements vs. Characters, The Backslash Character, Predefined Syntaxes, Regular Expression Syntax
|
|
@section Collating Elements vs.@: Characters
|
|
|
|
@sc{posix} generalizes the notion of a character to that of a
|
|
collating element. It defines a @dfn{collating element} to be ``a
|
|
sequence of one or more bytes defined in the current collating sequence
|
|
as a unit of collation.''
|
|
|
|
This generalizes the notion of a character in
|
|
two ways. First, a single character can map into two or more collating
|
|
elements. For example, the German
|
|
@tex
|
|
`\ss'
|
|
@end tex
|
|
@ifinfo
|
|
``es-zet''
|
|
@end ifinfo
|
|
collates as the collating element @samp{s} followed by another collating
|
|
element @samp{s}. Second, two or more characters can map into one
|
|
collating element. For example, the Spanish @samp{ll} collates after
|
|
@samp{l} and before @samp{m}.
|
|
|
|
Since @sc{posix}'s ``collating element'' preserves the essential idea of
|
|
a ``character,'' we use the latter, more familiar, term in this document.
|
|
|
|
@node The Backslash Character, , Collating Elements vs. Characters, Regular Expression Syntax
|
|
@section The Backslash Character
|
|
|
|
@cindex \
|
|
The @samp{\} character has one of four different meanings, depending on
|
|
the context in which you use it and what syntax bits are set
|
|
(@pxref{Syntax Bits}). It can: 1) stand for itself, 2) quote the next
|
|
character, 3) introduce an operator, or 4) do nothing.
|
|
|
|
@enumerate
|
|
@item
|
|
It stands for itself inside a list
|
|
(@pxref{List Operators}) if the syntax bit
|
|
@code{RE_BACKSLASH_ESCAPE_IN_LISTS} is not set. For example, @samp{[\]}
|
|
would match @samp{\}.
|
|
|
|
@item
|
|
It quotes (makes ordinary, if it's special) the next character when you
|
|
use it either:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
outside a list,@footnote{Sometimes
|
|
you don't have to explicitly quote special characters to make
|
|
them ordinary. For instance, most characters lose any special meaning
|
|
inside a list (@pxref{List Operators}). In addition, if the syntax bits
|
|
@code{RE_CONTEXT_INVALID_OPS} and @code{RE_CONTEXT_INDEP_OPS}
|
|
aren't set, then (for historical reasons) the matcher considers special
|
|
characters ordinary if they are in contexts where the operations they
|
|
represent make no sense; for example, then the match-zero-or-more
|
|
operator (represented by @samp{*}) matches itself in the regular
|
|
expression @samp{*foo} because there is no preceding expression on which
|
|
it can operate. It is poor practice, however, to depend on this
|
|
behavior; if you want a special character to be ordinary outside a list,
|
|
it's better to always quote it, regardless.} or
|
|
|
|
@item
|
|
inside a list and the syntax bit @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is set.
|
|
|
|
@end itemize
|
|
|
|
@item
|
|
It introduces an operator when followed by certain ordinary
|
|
characters---sometimes only when certain syntax bits are set. See the
|
|
cases @code{RE_BK_PLUS_QM}, @code{RE_NO_BK_BRACES}, @code{RE_NO_BK_VAR},
|
|
@code{RE_NO_BK_PARENS}, @code{RE_NO_BK_REF} in @ref{Syntax Bits}. Also:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@samp{\b} represents the match-word-boundary operator
|
|
(@pxref{Match-word-boundary Operator}).
|
|
|
|
@item
|
|
@samp{\B} represents the match-within-word operator
|
|
(@pxref{Match-within-word Operator}).
|
|
|
|
@item
|
|
@samp{\<} represents the match-beginning-of-word operator @*
|
|
(@pxref{Match-beginning-of-word Operator}).
|
|
|
|
@item
|
|
@samp{\>} represents the match-end-of-word operator
|
|
(@pxref{Match-end-of-word Operator}).
|
|
|
|
@item
|
|
@samp{\w} represents the match-word-constituent operator
|
|
(@pxref{Match-word-constituent Operator}).
|
|
|
|
@item
|
|
@samp{\W} represents the match-non-word-constituent operator
|
|
(@pxref{Match-non-word-constituent Operator}).
|
|
|
|
@item
|
|
@samp{\`} represents the match-beginning-of-buffer
|
|
operator and @samp{\'} represents the match-end-of-buffer operator
|
|
(@pxref{Buffer Operators}).
|
|
|
|
@item
|
|
If Regex was compiled with the C preprocessor symbol @code{emacs}
|
|
defined, then @samp{\s@var{class}} represents the match-syntactic-class
|
|
operator and @samp{\S@var{class}} represents the
|
|
match-not-syntactic-class operator (@pxref{Syntactic Class Operators}).
|
|
|
|
@end itemize
|
|
|
|
@item
|
|
In all other cases, Regex ignores @samp{\}. For example,
|
|
@samp{\n} matches @samp{n}.
|
|
|
|
@end enumerate
|
|
|
|
@node Common Operators, GNU Operators, Regular Expression Syntax, Top
|
|
@chapter Common Operators
|
|
|
|
You compose regular expressions from operators. In the following
|
|
sections, we describe the regular expression operators specified by
|
|
@sc{posix}; @sc{gnu} also uses these. Most operators have more than one
|
|
representation as characters. @xref{Regular Expression Syntax}, for
|
|
what characters represent what operators under what circumstances.
|
|
|
|
For most operators that can be represented in two ways, one
|
|
representation is a single character and the other is that character
|
|
preceded by @samp{\}. For example, either @samp{(} or @samp{\(}
|
|
represents the open-group operator. Which one does depends on the
|
|
setting of a syntax bit, in this case @code{RE_NO_BK_PARENS}. Why is
|
|
this so? Historical reasons dictate some of the varying
|
|
representations, while @sc{posix} dictates others.
|
|
|
|
Finally, almost all characters lose any special meaning inside a list
|
|
(@pxref{List Operators}).
|
|
|
|
@menu
|
|
* Match-self Operator:: Ordinary characters.
|
|
* Match-any-character Operator:: .
|
|
* Concatenation Operator:: Juxtaposition.
|
|
* Repetition Operators:: * + ? @{@}
|
|
* Alternation Operator:: |
|
|
* List Operators:: [...] [^...]
|
|
* Grouping Operators:: (...)
|
|
* Back-reference Operator:: \digit
|
|
* Anchoring Operators:: ^ $
|
|
@end menu
|
|
|
|
@node Match-self Operator, Match-any-character Operator, , Common Operators
|
|
@section The Match-self Operator (@var{ordinary character})
|
|
|
|
This operator matches the character itself. All ordinary characters
|
|
(@pxref{Regular Expression Syntax}) represent this operator. For
|
|
example, @samp{f} is always an ordinary character, so the regular
|
|
expression @samp{f} matches only the string @samp{f}. In
|
|
particular, it does @emph{not} match the string @samp{ff}.
|
|
|
|
@node Match-any-character Operator, Concatenation Operator, Match-self Operator, Common Operators
|
|
@section The Match-any-character Operator (@code{.})
|
|
|
|
@cindex @samp{.}
|
|
|
|
This operator matches any single printing or nonprinting character
|
|
except it won't match a:
|
|
|
|
@table @asis
|
|
@item newline
|
|
if the syntax bit @code{RE_DOT_NEWLINE} isn't set.
|
|
|
|
@item null
|
|
if the syntax bit @code{RE_DOT_NOT_NULL} is set.
|
|
|
|
@end table
|
|
|
|
The @samp{.} (period) character represents this operator. For example,
|
|
@samp{a.b} matches any three-character string beginning with @samp{a}
|
|
and ending with @samp{b}.
|
|
|
|
@node Concatenation Operator, Repetition Operators, Match-any-character Operator, Common Operators
|
|
@section The Concatenation Operator
|
|
|
|
This operator concatenates two regular expressions @var{a} and @var{b}.
|
|
No character represents this operator; you simply put @var{b} after
|
|
@var{a}. The result is a regular expression that will match a string if
|
|
@var{a} matches its first part and @var{b} matches the rest. For
|
|
example, @samp{xy} (two match-self operators) matches @samp{xy}.
|
|
|
|
@node Repetition Operators, Alternation Operator, Concatenation Operator, Common Operators
|
|
@section Repetition Operators
|
|
|
|
Repetition operators repeat the preceding regular expression a specified
|
|
number of times.
|
|
|
|
@menu
|
|
* Match-zero-or-more Operator:: *
|
|
* Match-one-or-more Operator:: +
|
|
* Match-zero-or-one Operator:: ?
|
|
* Interval Operators:: @{@}
|
|
@end menu
|
|
|
|
@node Match-zero-or-more Operator, Match-one-or-more Operator, , Repetition Operators
|
|
@subsection The Match-zero-or-more Operator (@code{*})
|
|
|
|
@cindex @samp{*}
|
|
|
|
This operator repeats the smallest possible preceding regular expression
|
|
as many times as necessary (including zero) to match the pattern.
|
|
@samp{*} represents this operator. For example, @samp{o*}
|
|
matches any string made up of zero or more @samp{o}s. Since this
|
|
operator operates on the smallest preceding regular expression,
|
|
@samp{fo*} has a repeating @samp{o}, not a repeating @samp{fo}. So,
|
|
@samp{fo*} matches @samp{f}, @samp{fo}, @samp{foo}, and so on.
|
|
|
|
Since the match-zero-or-more operator is a suffix operator, it may be
|
|
useless as such when no regular expression precedes it. This is the
|
|
case when it:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
is first in a regular expression, or
|
|
|
|
@item
|
|
follows a match-beginning-of-line, open-group, or alternation
|
|
operator.
|
|
|
|
@end itemize
|
|
|
|
@noindent
|
|
Three different things can happen in these cases:
|
|
|
|
@enumerate
|
|
@item
|
|
If the syntax bit @code{RE_CONTEXT_INVALID_OPS} is set, then the
|
|
regular expression is invalid.
|
|
|
|
@item
|
|
If @code{RE_CONTEXT_INVALID_OPS} isn't set, but
|
|
@code{RE_CONTEXT_INDEP_OPS} is, then @samp{*} represents the
|
|
match-zero-or-more operator (which then operates on the empty string).
|
|
|
|
@item
|
|
Otherwise, @samp{*} is ordinary.
|
|
|
|
@end enumerate
|
|
|
|
@cindex backtracking
|
|
The matcher processes a match-zero-or-more operator by first matching as
|
|
many repetitions of the smallest preceding regular expression as it can.
|
|
Then it continues to match the rest of the pattern.
|
|
|
|
If it can't match the rest of the pattern, it backtracks (as many times
|
|
as necessary), each time discarding one of the matches until it can
|
|
either match the entire pattern or be certain that it cannot get a
|
|
match. For example, when matching @samp{ca*ar} against @samp{caaar},
|
|
the matcher first matches all three @samp{a}s of the string with the
|
|
@samp{a*} of the regular expression. However, it cannot then match the
|
|
final @samp{ar} of the regular expression against the final @samp{r} of
|
|
the string. So it backtracks, discarding the match of the last @samp{a}
|
|
in the string. It can then match the remaining @samp{ar}.
|
|
|
|
|
|
@node Match-one-or-more Operator, Match-zero-or-one Operator, Match-zero-or-more Operator, Repetition Operators
|
|
@subsection The Match-one-or-more Operator (@code{+} or @code{\+})
|
|
|
|
@cindex @samp{+}
|
|
|
|
If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't recognize
|
|
this operator. Otherwise, if the syntax bit @code{RE_BK_PLUS_QM} isn't
|
|
set, then @samp{+} represents this operator; if it is, then @samp{\+}
|
|
does.
|
|
|
|
This operator is similar to the match-zero-or-more operator except that
|
|
it repeats the preceding regular expression at least once;
|
|
@pxref{Match-zero-or-more Operator}, for what it operates on, how some
|
|
syntax bits affect it, and how Regex backtracks to match it.
|
|
|
|
For example, supposing that @samp{+} represents the match-one-or-more
|
|
operator; then @samp{ca+r} matches, e.g., @samp{car} and
|
|
@samp{caaaar}, but not @samp{cr}.
|
|
|
|
@node Match-zero-or-one Operator, Interval Operators, Match-one-or-more Operator, Repetition Operators
|
|
@subsection The Match-zero-or-one Operator (@code{?} or @code{\?})
|
|
@cindex @samp{?}
|
|
|
|
If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't
|
|
recognize this operator. Otherwise, if the syntax bit
|
|
@code{RE_BK_PLUS_QM} isn't set, then @samp{?} represents this operator;
|
|
if it is, then @samp{\?} does.
|
|
|
|
This operator is similar to the match-zero-or-more operator except that
|
|
it repeats the preceding regular expression once or not at all;
|
|
@pxref{Match-zero-or-more Operator}, to see what it operates on, how
|
|
some syntax bits affect it, and how Regex backtracks to match it.
|
|
|
|
For example, supposing that @samp{?} represents the match-zero-or-one
|
|
operator; then @samp{ca?r} matches both @samp{car} and @samp{cr}, but
|
|
nothing else.
|
|
|
|
@node Interval Operators, , Match-zero-or-one Operator, Repetition Operators
|
|
@subsection Interval Operators (@code{@{} @dots{} @code{@}} or @code{\@{} @dots{} @code{\@}})
|
|
|
|
@cindex interval expression
|
|
@cindex @samp{@{}
|
|
@cindex @samp{@}}
|
|
@cindex @samp{\@{}
|
|
@cindex @samp{\@}}
|
|
|
|
If the syntax bit @code{RE_INTERVALS} is set, then Regex recognizes
|
|
@dfn{interval expressions}. They repeat the smallest possible preceding
|
|
regular expression a specified number of times.
|
|
|
|
If the syntax bit @code{RE_NO_BK_BRACES} is set, @samp{@{} represents
|
|
the @dfn{open-interval operator} and @samp{@}} represents the
|
|
@dfn{close-interval operator} ; otherwise, @samp{\@{} and @samp{\@}} do.
|
|
|
|
Specifically, supposing that @samp{@{} and @samp{@}} represent the
|
|
open-interval and close-interval operators; then:
|
|
|
|
@table @code
|
|
@item @{@var{count}@}
|
|
matches exactly @var{count} occurrences of the preceding regular
|
|
expression.
|
|
|
|
@item @{@var{min,}@}
|
|
matches @var{min} or more occurrences of the preceding regular
|
|
expression.
|
|
|
|
@item @{@var{min, max}@}
|
|
matches at least @var{min} but no more than @var{max} occurrences of
|
|
the preceding regular expression.
|
|
|
|
@end table
|
|
|
|
The interval expression (but not necessarily the regular expression that
|
|
contains it) is invalid if:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
@var{min} is greater than @var{max}, or
|
|
|
|
@item
|
|
any of @var{count}, @var{min}, or @var{max} are outside the range
|
|
zero to @code{RE_DUP_MAX} (which symbol @file{regex.h}
|
|
defines).
|
|
|
|
@end itemize
|
|
|
|
If the interval expression is invalid and the syntax bit
|
|
@code{RE_NO_BK_BRACES} is set, then Regex considers all the
|
|
characters in the would-be interval to be ordinary. If that bit
|
|
isn't set, then the regular expression is invalid.
|
|
|
|
If the interval expression is valid but there is no preceding regular
|
|
expression on which to operate, then if the syntax bit
|
|
@code{RE_CONTEXT_INVALID_OPS} is set, the regular expression is invalid.
|
|
If that bit isn't set, then Regex considers all the characters---other
|
|
than backslashes, which it ignores---in the would-be interval to be
|
|
ordinary.
|
|
|
|
|
|
@node Alternation Operator, List Operators, Repetition Operators, Common Operators
|
|
@section The Alternation Operator (@code{|} or @code{\|})
|
|
|
|
@kindex |
|
|
@kindex \|
|
|
@cindex alternation operator
|
|
@cindex or operator
|
|
|
|
If the syntax bit @code{RE_LIMITED_OPS} is set, then Regex doesn't
|
|
recognize this operator. Otherwise, if the syntax bit
|
|
@code{RE_NO_BK_VBAR} is set, then @samp{|} represents this operator;
|
|
otherwise, @samp{\|} does.
|
|
|
|
Alternatives match one of a choice of regular expressions:
|
|
if you put the character(s) representing the alternation operator between
|
|
any two regular expressions @var{a} and @var{b}, the result matches
|
|
the union of the strings that @var{a} and @var{b} match. For
|
|
example, supposing that @samp{|} is the alternation operator, then
|
|
@samp{foo|bar|quux} would match any of @samp{foo}, @samp{bar} or
|
|
@samp{quux}.
|
|
|
|
@ignore
|
|
@c Nobody needs to disallow empty alternatives any more.
|
|
If the syntax bit @code{RE_NO_EMPTY_ALTS} is set, then if either of the regular
|
|
expressions @var{a} or @var{b} is empty, the
|
|
regular expression is invalid. More precisely, if this syntax bit is
|
|
set, then the alternation operator can't:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
be first or last in a regular expression;
|
|
|
|
@item
|
|
follow either another alternation operator or an open-group operator
|
|
(@pxref{Grouping Operators}); or
|
|
|
|
@item
|
|
precede a close-group operator.
|
|
|
|
@end itemize
|
|
|
|
@noindent
|
|
For example, supposing @samp{(} and @samp{)} represent the open and
|
|
close-group operators, then @samp{|foo}, @samp{foo|}, @samp{foo||bar},
|
|
@samp{foo(|bar)}, and @samp{(foo|)bar} would all be invalid.
|
|
@end ignore
|
|
|
|
The alternation operator operates on the @emph{largest} possible
|
|
surrounding regular expressions. (Put another way, it has the lowest
|
|
precedence of any regular expression operator.)
|
|
Thus, the only way you can
|
|
delimit its arguments is to use grouping. For example, if @samp{(} and
|
|
@samp{)} are the open and close-group operators, then @samp{fo(o|b)ar}
|
|
would match either @samp{fooar} or @samp{fobar}. (@samp{foo|bar} would
|
|
match @samp{foo} or @samp{bar}.)
|
|
|
|
@cindex backtracking
|
|
The matcher usually tries all combinations of alternatives so as to
|
|
match the longest possible string. For example, when matching
|
|
@samp{(fooq|foo)*(qbarquux|bar)} against @samp{fooqbarquux}, it cannot
|
|
take, say, the first (``depth-first'') combination it could match, since
|
|
then it would be content to match just @samp{fooqbar}.
|
|
|
|
@comment xx something about leftmost-longest
|
|
|
|
|
|
@node List Operators, Grouping Operators, Alternation Operator, Common Operators
|
|
@section List Operators (@code{[} @dots{} @code{]} and @code{[^} @dots{} @code{]})
|
|
|
|
@cindex matching list
|
|
@cindex @samp{[}
|
|
@cindex @samp{]}
|
|
@cindex @samp{^}
|
|
@cindex @samp{-}
|
|
@cindex @samp{\}
|
|
@cindex @samp{[^}
|
|
@cindex nonmatching list
|
|
@cindex matching newline
|
|
@cindex bracket expression
|
|
|
|
@dfn{Lists}, also called @dfn{bracket expressions}, are a set of one or
|
|
more items. An @dfn{item} is a character,
|
|
@ignore
|
|
(These get added when they get implemented.)
|
|
a collating symbol, an equivalence class expression,
|
|
@end ignore
|
|
a character class expression, or a range expression. The syntax bits
|
|
affect which kinds of items you can put in a list. We explain the last
|
|
two items in subsections below. Empty lists are invalid.
|
|
|
|
A @dfn{matching list} matches a single character represented by one of
|
|
the list items. You form a matching list by enclosing one or more items
|
|
within an @dfn{open-matching-list operator} (represented by @samp{[})
|
|
and a @dfn{close-list operator} (represented by @samp{]}).
|
|
|
|
For example, @samp{[ab]} matches either @samp{a} or @samp{b}.
|
|
@samp{[ad]*} matches the empty string and any string composed of just
|
|
@samp{a}s and @samp{d}s in any order. Regex considers invalid a regular
|
|
expression with a @samp{[} but no matching
|
|
@samp{]}.
|
|
|
|
@dfn{Nonmatching lists} are similar to matching lists except that they
|
|
match a single character @emph{not} represented by one of the list
|
|
items. You use an @dfn{open-nonmatching-list operator} (represented by
|
|
@samp{[^}@footnote{Regex therefore doesn't consider the @samp{^} to be
|
|
the first character in the list. If you put a @samp{^} character first
|
|
in (what you think is) a matching list, you'll turn it into a
|
|
nonmatching list.}) instead of an open-matching-list operator to start a
|
|
nonmatching list.
|
|
|
|
For example, @samp{[^ab]} matches any character except @samp{a} or
|
|
@samp{b}.
|
|
|
|
If the @code{posix_newline} field in the pattern buffer (@pxref{GNU
|
|
Pattern Buffers} is set, then nonmatching lists do not match a newline.
|
|
|
|
Most characters lose any special meaning inside a list. The special
|
|
characters inside a list follow.
|
|
|
|
@table @samp
|
|
@item ]
|
|
ends the list if it's not the first list item. So, if you want to make
|
|
the @samp{]} character a list item, you must put it first.
|
|
|
|
@item \
|
|
quotes the next character if the syntax bit @code{RE_BACKSLASH_ESCAPE_IN_LISTS} is
|
|
set.
|
|
|
|
@ignore
|
|
Put these in if they get implemented.
|
|
|
|
@item [.
|
|
represents the open-collating-symbol operator (@pxref{Collating Symbol
|
|
Operators}).
|
|
|
|
@item .]
|
|
represents the close-collating-symbol operator.
|
|
|
|
@item [=
|
|
represents the open-equivalence-class operator (@pxref{Equivalence Class
|
|
Operators}).
|
|
|
|
@item =]
|
|
represents the close-equivalence-class operator.
|
|
|
|
@end ignore
|
|
|
|
@item [:
|
|
represents the open-character-class operator (@pxref{Character Class
|
|
Operators}) if the syntax bit @code{RE_CHAR_CLASSES} is set and what
|
|
follows is a valid character class expression.
|
|
|
|
@item :]
|
|
represents the close-character-class operator if the syntax bit
|
|
@code{RE_CHAR_CLASSES} is set and what precedes it is an
|
|
open-character-class operator followed by a valid character class name.
|
|
|
|
@item -
|
|
represents the range operator (@pxref{Range Operator}) if it's
|
|
not first or last in a list or the ending point of a range.
|
|
|
|
@end table
|
|
|
|
@noindent
|
|
All other characters are ordinary. For example, @samp{[.*]} matches
|
|
@samp{.} and @samp{*}.
|
|
|
|
@menu
|
|
* Character Class Operators:: [:class:]
|
|
* Range Operator:: start-end
|
|
@end menu
|
|
|
|
@ignore
|
|
(If collating symbols and equivalence class expressions get implemented,
|
|
then add this.)
|
|
|
|
node Collating Symbol Operators
|
|
subsubsection Collating Symbol Operators (@code{[.} @dots{} @code{.]})
|
|
|
|
If the syntax bit @code{XX} is set, then you can represent
|
|
collating symbols inside lists. You form a @dfn{collating symbol} by
|
|
putting a collating element between an @dfn{open-collating-symbol
|
|
operator} and an @dfn{close-collating-symbol operator}. @samp{[.}
|
|
represents the open-collating-symbol operator and @samp{.]} represents
|
|
the close-collating-symbol operator. For example, if @samp{ll} is a
|
|
collating element, then @samp{[[.ll.]]} would match @samp{ll}.
|
|
|
|
node Equivalence Class Operators
|
|
subsubsection Equivalence Class Operators (@code{[=} @dots{} @code{=]})
|
|
@cindex equivalence class expression in regex
|
|
@cindex @samp{[=} in regex
|
|
@cindex @samp{=]} in regex
|
|
|
|
If the syntax bit @code{XX} is set, then Regex recognizes equivalence class
|
|
expressions inside lists. A @dfn{equivalence class expression} is a set
|
|
of collating elements which all belong to the same equivalence class.
|
|
You form an equivalence class expression by putting a collating
|
|
element between an @dfn{open-equivalence-class operator} and a
|
|
@dfn{close-equivalence-class operator}. @samp{[=} represents the
|
|
open-equivalence-class operator and @samp{=]} represents the
|
|
close-equivalence-class operator. For example, if @samp{a} and @samp{A}
|
|
were an equivalence class, then both @samp{[[=a=]]} and @samp{[[=A=]]}
|
|
would match both @samp{a} and @samp{A}. If the collating element in an
|
|
equivalence class expression isn't part of an equivalence class, then
|
|
the matcher considers the equivalence class expression to be a collating
|
|
symbol.
|
|
|
|
@end ignore
|
|
|
|
@node Character Class Operators, Range Operator, , List Operators
|
|
@subsection Character Class Operators (@code{[:} @dots{} @code{:]})
|
|
|
|
@cindex character classes
|
|
@cindex @samp{[:} in regex
|
|
@cindex @samp{:]} in regex
|
|
|
|
If the syntax bit @code{RE_CHARACTER_CLASSES} is set, then Regex
|
|
recognizes character class expressions inside lists. A @dfn{character
|
|
class expression} matches one character from a given class. You form a
|
|
character class expression by putting a character class name between an
|
|
@dfn{open-character-class operator} (represented by @samp{[:}) and a
|
|
@dfn{close-character-class operator} (represented by @samp{:]}). The
|
|
character class names and their meanings are:
|
|
|
|
@table @code
|
|
|
|
@item alnum
|
|
letters and digits
|
|
|
|
@item alpha
|
|
letters
|
|
|
|
@item blank
|
|
system-dependent; for @sc{gnu}, a space or tab
|
|
|
|
@item cntrl
|
|
control characters (in the @sc{ascii} encoding, code 0177 and codes
|
|
less than 040)
|
|
|
|
@item digit
|
|
digits
|
|
|
|
@item graph
|
|
same as @code{print} except omits space
|
|
|
|
@item lower
|
|
lowercase letters
|
|
|
|
@item print
|
|
printable characters (in the @sc{ascii} encoding, space
|
|
tilde---codes 040 through 0176)
|
|
|
|
@item punct
|
|
neither control nor alphanumeric characters
|
|
|
|
@item space
|
|
space, carriage return, newline, vertical tab, and form feed
|
|
|
|
@item upper
|
|
uppercase letters
|
|
|
|
@item xdigit
|
|
hexadecimal digits: @code{0}--@code{9}, @code{a}--@code{f}, @code{A}--@code{F}
|
|
|
|
@end table
|
|
|
|
@noindent
|
|
These correspond to the definitions in the C library's @file{<ctype.h>}
|
|
facility. For example, @samp{[:alpha:]} corresponds to the standard
|
|
facility @code{isalpha}. Regex recognizes character class expressions
|
|
only inside of lists; so @samp{[[:alpha:]]} matches any letter, but
|
|
@samp{[:alpha:]} outside of a bracket expression and not followed by a
|
|
repetition operator matches just itself.
|
|
|
|
@node Range Operator, , Character Class Operators, List Operators
|
|
@subsection The Range Operator (@code{-})
|
|
|
|
Regex recognizes @dfn{range expressions} inside a list. They represent
|
|
those characters
|
|
that fall between two elements in the current collating sequence. You
|
|
form a range expression by putting a @dfn{range operator} between two
|
|
@ignore
|
|
(If these get implemented, then substitute this for ``characters.'')
|
|
of any of the following: characters, collating elements, collating symbols,
|
|
and equivalence class expressions. The starting point of the range and
|
|
the ending point of the range don't have to be the same kind of item,
|
|
e.g., the starting point could be a collating element and the ending
|
|
point could be an equivalence class expression. If a range's ending
|
|
point is an equivalence class, then all the collating elements in that
|
|
class will be in the range.
|
|
@end ignore
|
|
characters.@footnote{You can't use a character class for the starting
|
|
or ending point of a range, since a character class is not a single
|
|
character.} @samp{-} represents the range operator. For example,
|
|
@samp{a-f} within a list represents all the characters from @samp{a}
|
|
through @samp{f}
|
|
inclusively.
|
|
|
|
If the syntax bit @code{RE_NO_EMPTY_RANGES} is set, then if the range's
|
|
ending point collates less than its starting point, the range (and the
|
|
regular expression containing it) is invalid. For example, the regular
|
|
expression @samp{[z-a]} would be invalid. If this bit isn't set, then
|
|
Regex considers such a range to be empty.
|
|
|
|
Since @samp{-} represents the range operator, if you want to make a
|
|
@samp{-} character itself
|
|
a list item, you must do one of the following:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
Put the @samp{-} either first or last in the list.
|
|
|
|
@item
|
|
Include a range whose starting point collates strictly lower than
|
|
@samp{-} and whose ending point collates equal or higher. Unless a
|
|
range is the first item in a list, a @samp{-} can't be its starting
|
|
point, but @emph{can} be its ending point. That is because Regex
|
|
considers @samp{-} to be the range operator unless it is preceded by
|
|
another @samp{-}. For example, in the @sc{ascii} encoding, @samp{)},
|
|
@samp{*}, @samp{+}, @samp{,}, @samp{-}, @samp{.}, and @samp{/} are
|
|
contiguous characters in the collating sequence. You might think that
|
|
@samp{[)-+--/]} has two ranges: @samp{)-+} and @samp{--/}. Rather, it
|
|
has the ranges @samp{)-+} and @samp{+--}, plus the character @samp{/}, so
|
|
it matches, e.g., @samp{,}, not @samp{.}.
|
|
|
|
@item
|
|
Put a range whose starting point is @samp{-} first in the list.
|
|
|
|
@end itemize
|
|
|
|
For example, @samp{[-a-z]} matches a lowercase letter or a hyphen (in
|
|
English, in @sc{ascii}).
|
|
|
|
|
|
@node Grouping Operators, Back-reference Operator, List Operators, Common Operators
|
|
@section Grouping Operators (@code{(} @dots{} @code{)} or @code{\(} @dots{} @code{\)})
|
|
|
|
@kindex (
|
|
@kindex )
|
|
@kindex \(
|
|
@kindex \)
|
|
@cindex grouping
|
|
@cindex subexpressions
|
|
@cindex parenthesizing
|
|
|
|
A @dfn{group}, also known as a @dfn{subexpression}, consists of an
|
|
@dfn{open-group operator}, any number of other operators, and a
|
|
@dfn{close-group operator}. Regex treats this sequence as a unit, just
|
|
as mathematics and programming languages treat a parenthesized
|
|
expression as a unit.
|
|
|
|
Therefore, using @dfn{groups}, you can:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
delimit the argument(s) to an alternation operator (@pxref{Alternation
|
|
Operator}) or a repetition operator (@pxref{Repetition
|
|
Operators}).
|
|
|
|
@item
|
|
keep track of the indices of the substring that matched a given group.
|
|
@xref{Using Registers}, for a precise explanation.
|
|
This lets you:
|
|
|
|
@itemize @bullet
|
|
@item
|
|
use the back-reference operator (@pxref{Back-reference Operator}).
|
|
|
|
@item
|
|
use registers (@pxref{Using Registers}).
|
|
|
|
@end itemize
|
|
|
|
@end itemize
|
|
|
|
If the syntax bit @code{RE_NO_BK_PARENS} is set, then @samp{(} represents
|
|
the open-group operator and @samp{)} represents the
|
|
close-group operator; otherwise, @samp{\(} and @samp{\)} do.
|
|
|
|
If the syntax bit @code{RE_UNMATCHED_RIGHT_PAREN_ORD} is set and a
|
|
close-group operator has no matching open-group operator, then Regex
|
|
considers it to match @samp{)}.
|
|
|
|
|
|
@node Back-reference Operator, Anchoring Operators, Grouping Operators, Common Operators
|
|
@section The Back-reference Operator (@dfn{\}@var{digit})
|
|
|
|
@cindex back references
|
|
|
|
If the syntax bit @code{RE_NO_BK_REF} isn't set, then Regex recognizes
|
|
back references. A back reference matches a specified preceding group.
|
|
The back reference operator is represented by @samp{\@var{digit}}
|
|
anywhere after the end of a regular expression's @w{@var{digit}-th}
|
|
group (@pxref{Grouping Operators}).
|
|
|
|
@var{digit} must be between @samp{1} and @samp{9}. The matcher assigns
|
|
numbers 1 through 9 to the first nine groups it encounters. By using
|
|
one of @samp{\1} through @samp{\9} after the corresponding group's
|
|
close-group operator, you can match a substring identical to the
|
|
one that the group does.
|
|
|
|
Back references match according to the following (in all examples below,
|
|
@samp{(} represents the open-group, @samp{)} the close-group, @samp{@{}
|
|
the open-interval and @samp{@}} the close-interval operator):
|
|
|
|
@itemize @bullet
|
|
@item
|
|
If the group matches a substring, the back reference matches an
|
|
identical substring. For example, @samp{(a)\1} matches @samp{aa} and
|
|
@samp{(bana)na\1bo\1} matches @samp{bananabanabobana}. Likewise,
|
|
@samp{(.*)\1} matches any (newline-free if the syntax bit
|
|
@code{RE_DOT_NEWLINE} isn't set) string that is composed of two
|
|
identical halves; the @samp{(.*)} matches the first half and the
|
|
@samp{\1} matches the second half.
|
|
|
|
@item
|
|
If the group matches more than once (as it might if followed
|
|
by, e.g., a repetition operator), then the back reference matches the
|
|
substring the group @emph{last} matched. For example,
|
|
@samp{((a*)b)*\1\2} matches @samp{aabababa}; first @w{group 1} (the
|
|
outer one) matches @samp{aab} and @w{group 2} (the inner one) matches
|
|
@samp{aa}. Then @w{group 1} matches @samp{ab} and @w{group 2} matches
|
|
@samp{a}. So, @samp{\1} matches @samp{ab} and @samp{\2} matches
|
|
@samp{a}.
|
|
|
|
@item
|
|
If the group doesn't participate in a match, i.e., it is part of an
|
|
alternative not taken or a repetition operator allows zero repetitions
|
|
of it, then the back reference makes the whole match fail. For example,
|
|
@samp{(one()|two())-and-(three\2|four\3)} matches @samp{one-and-three}
|
|
and @samp{two-and-four}, but not @samp{one-and-four} or
|
|
@samp{two-and-three}. For example, if the pattern matches
|
|
@samp{one-and-}, then its @w{group 2} matches the empty string and its
|
|
@w{group 3} doesn't participate in the match. So, if it then matches
|
|
@samp{four}, then when it tries to back reference @w{group 3}---which it
|
|
will attempt to do because @samp{\3} follows the @samp{four}---the match
|
|
will fail because @w{group 3} didn't participate in the match.
|
|
|
|
@end itemize
|
|
|
|
You can use a back reference as an argument to a repetition operator. For
|
|
example, @samp{(a(b))\2*} matches @samp{a} followed by two or more
|
|
@samp{b}s. Similarly, @samp{(a(b))\2@{3@}} matches @samp{abbbb}.
|
|
|
|
If there is no preceding @w{@var{digit}-th} subexpression, the regular
|
|
expression is invalid.
|
|
|
|
|
|
@node Anchoring Operators, , Back-reference Operator, Common Operators
|
|
@section Anchoring Operators
|
|
|
|
@cindex anchoring
|
|
@cindex regexp anchoring
|
|
|
|
These operators can constrain a pattern to match only at the beginning or
|
|
end of the entire string or at the beginning or end of a line.
|
|
|
|
@menu
|
|
* Match-beginning-of-line Operator:: ^
|
|
* Match-end-of-line Operator:: $
|
|
@end menu
|
|
|
|
|
|
@node Match-beginning-of-line Operator, Match-end-of-line Operator, , Anchoring Operators
|
|
@subsection The Match-beginning-of-line Operator (@code{^})
|
|
|
|
@kindex ^
|
|
@cindex beginning-of-line operator
|
|
@cindex anchors
|
|
|
|
This operator can match the empty string either at the beginning of the
|
|
string or after a newline character. Thus, it is said to @dfn{anchor}
|
|
the pattern to the beginning of a line.
|
|
|
|
In the cases following, @samp{^} represents this operator. (Otherwise,
|
|
@samp{^} is ordinary.)
|
|
|
|
@itemize @bullet
|
|
|
|
@item
|
|
It (the @samp{^}) is first in the pattern, as in @samp{^foo}.
|
|
|
|
@cnindex RE_CONTEXT_INDEP_ANCHORS @r{(and @samp{^})}
|
|
@item
|
|
The syntax bit @code{RE_CONTEXT_INDEP_ANCHORS} is set, and it is outside
|
|
a bracket expression.
|
|
|
|
@cindex open-group operator and @samp{^}
|
|
@cindex alternation operator and @samp{^}
|
|
@item
|
|
It follows an open-group or alternation operator, as in @samp{a\(^b\)}
|
|
and @samp{a\|^b}. @xref{Grouping Operators}, and @ref{Alternation
|
|
Operator}.
|
|
|
|
@end itemize
|
|
|
|
These rules imply that some valid patterns containing @samp{^} cannot be
|
|
matched; for example, @samp{foo^bar} if @code{RE_CONTEXT_INDEP_ANCHORS}
|
|
is set.
|
|
|
|
@vindex not_bol @r{field in pattern buffer}
|
|
If the @code{not_bol} field is set in the pattern buffer (@pxref{GNU
|
|
Pattern Buffers}), then @samp{^} fails to match at the beginning of the
|
|
string. @xref{POSIX Matching}, for when you might find this useful.
|
|
|
|
@vindex newline_anchor @r{field in pattern buffer}
|
|
If the @code{newline_anchor} field is set in the pattern buffer, then
|
|
@samp{^} fails to match after a newline. This is useful when you do not
|
|
regard the string to be matched as broken into lines.
|
|
|
|
|
|
@node Match-end-of-line Operator, , Match-beginning-of-line Operator, Anchoring Operators
|
|
@subsection The Match-end-of-line Operator (@code{$})
|
|
|
|
@kindex $
|
|
@cindex end-of-line operator
|
|
@cindex anchors
|
|
|
|
This operator can match the empty string either at the end of
|
|
the string or before a newline character in the string. Thus, it is
|
|
said to @dfn{anchor} the pattern to the end of a line.
|
|
|
|
It is always represented by @samp{$}. For example, @samp{foo$} usually
|
|
matches, e.g., @samp{foo} and, e.g., the first three characters of
|
|
@samp{foo\nbar}.
|
|
|
|
Its interaction with the syntax bits and pattern buffer fields is
|
|
exactly the dual of @samp{^}'s; see the previous section. (That is,
|
|
``beginning'' becomes ``end'', ``next'' becomes ``previous'', and
|
|
``after'' becomes ``before''.)
|
|
|
|
|
|
@node GNU Operators, GNU Emacs Operators, Common Operators, Top
|
|
@chapter GNU Operators
|
|
|
|
Following are operators that @sc{gnu} defines (and @sc{posix} doesn't).
|
|
|
|
@menu
|
|
* Word Operators::
|
|
* Buffer Operators::
|
|
@end menu
|
|
|
|
@node Word Operators, Buffer Operators, , GNU Operators
|
|
@section Word Operators
|
|
|
|
The operators in this section require Regex to recognize parts of words.
|
|
Regex uses a syntax table to determine whether or not a character is
|
|
part of a word, i.e., whether or not it is @dfn{word-constituent}.
|
|
|
|
@menu
|
|
* Non-Emacs Syntax Tables::
|
|
* Match-word-boundary Operator:: \b
|
|
* Match-within-word Operator:: \B
|
|
* Match-beginning-of-word Operator:: \<
|
|
* Match-end-of-word Operator:: \>
|
|
* Match-word-constituent Operator:: \w
|
|
* Match-non-word-constituent Operator:: \W
|
|
@end menu
|
|
|
|
@node Non-Emacs Syntax Tables, Match-word-boundary Operator, , Word Operators
|
|
@subsection Non-Emacs Syntax Tables
|
|
|
|
A @dfn{syntax table} is an array indexed by the characters in your
|
|
character set. In the @sc{ascii} encoding, therefore, a syntax table
|
|
has 256 elements. Regex always uses a @code{char *} variable
|
|
@code{re_syntax_table} as its syntax table. In some cases, it
|
|
initializes this variable and in others it expects you to initialize it.
|
|
|
|
@itemize @bullet
|
|
@item
|
|
If Regex is compiled with the preprocessor symbols @code{emacs} and
|
|
@code{SYNTAX_TABLE} both undefined, then Regex allocates
|
|
@code{re_syntax_table} and initializes an element @var{i} either to
|
|
@code{Sword} (which it defines) if @var{i} is a letter, number, or
|
|
@samp{_}, or to zero if it's not.
|
|
|
|
@item
|
|
If Regex is compiled with @code{emacs} undefined but @code{SYNTAX_TABLE}
|
|
defined, then Regex expects you to define a @code{char *} variable
|
|
@code{re_syntax_table} to be a valid syntax table.
|
|
|
|
@item
|
|
@xref{Emacs Syntax Tables}, for what happens when Regex is compiled with
|
|
the preprocessor symbol @code{emacs} defined.
|
|
|
|
@end itemize
|
|
|
|
@node Match-word-boundary Operator, Match-within-word Operator, Non-Emacs Syntax Tables, Word Operators
|
|
@subsection The Match-word-boundary Operator (@code{\b})
|
|
|
|
@cindex @samp{\b}
|
|
@cindex word boundaries, matching
|
|
|
|
This operator (represented by @samp{\b}) matches the empty string at
|
|
either the beginning or the end of a word. For example, @samp{\brat\b}
|
|
matches the separate word @samp{rat}.
|
|
|
|
@node Match-within-word Operator, Match-beginning-of-word Operator, Match-word-boundary Operator, Word Operators
|
|
@subsection The Match-within-word Operator (@code{\B})
|
|
|
|
@cindex @samp{\B}
|
|
|
|
This operator (represented by @samp{\B}) matches the empty string within
|
|
a word. For example, @samp{c\Brat\Be} matches @samp{crate}, but
|
|
@samp{dirty \Brat} doesn't match @samp{dirty rat}.
|
|
|
|
@node Match-beginning-of-word Operator, Match-end-of-word Operator, Match-within-word Operator, Word Operators
|
|
@subsection The Match-beginning-of-word Operator (@code{\<})
|
|
|
|
@cindex @samp{\<}
|
|
|
|
This operator (represented by @samp{\<}) matches the empty string at the
|
|
beginning of a word.
|
|
|
|
@node Match-end-of-word Operator, Match-word-constituent Operator, Match-beginning-of-word Operator, Word Operators
|
|
@subsection The Match-end-of-word Operator (@code{\>})
|
|
|
|
@cindex @samp{\>}
|
|
|
|
This operator (represented by @samp{\>}) matches the empty string at the
|
|
end of a word.
|
|
|
|
@node Match-word-constituent Operator, Match-non-word-constituent Operator, Match-end-of-word Operator, Word Operators
|
|
@subsection The Match-word-constituent Operator (@code{\w})
|
|
|
|
@cindex @samp{\w}
|
|
|
|
This operator (represented by @samp{\w}) matches any word-constituent
|
|
character.
|
|
|
|
@node Match-non-word-constituent Operator, , Match-word-constituent Operator, Word Operators
|
|
@subsection The Match-non-word-constituent Operator (@code{\W})
|
|
|
|
@cindex @samp{\W}
|
|
|
|
This operator (represented by @samp{\W}) matches any character that is
|
|
not word-constituent.
|
|
|
|
|
|
@node Buffer Operators, , Word Operators, GNU Operators
|
|
@section Buffer Operators
|
|
|
|
Following are operators which work on buffers. In Emacs, a @dfn{buffer}
|
|
is, naturally, an Emacs buffer. For other programs, Regex considers the
|
|
entire string to be matched as the buffer.
|
|
|
|
@menu
|
|
* Match-beginning-of-buffer Operator:: \`
|
|
* Match-end-of-buffer Operator:: \'
|
|
@end menu
|
|
|
|
|
|
@node Match-beginning-of-buffer Operator, Match-end-of-buffer Operator, , Buffer Operators
|
|
@subsection The Match-beginning-of-buffer Operator (@code{\`})
|
|
|
|
@cindex @samp{\`}
|
|
|
|
This operator (represented by @samp{\`}) matches the empty string at the
|
|
beginning of the buffer.
|
|
|
|
@node Match-end-of-buffer Operator, , Match-beginning-of-buffer Operator, Buffer Operators
|
|
@subsection The Match-end-of-buffer Operator (@code{\'})
|
|
|
|
@cindex @samp{\'}
|
|
|
|
This operator (represented by @samp{\'}) matches the empty string at the
|
|
end of the buffer.
|
|
|
|
|
|
@node GNU Emacs Operators, What Gets Matched?, GNU Operators, Top
|
|
@chapter GNU Emacs Operators
|
|
|
|
Following are operators that @sc{gnu} defines (and @sc{posix} doesn't)
|
|
that you can use only when Regex is compiled with the preprocessor
|
|
symbol @code{emacs} defined.
|
|
|
|
@menu
|
|
* Syntactic Class Operators::
|
|
@end menu
|
|
|
|
|
|
@node Syntactic Class Operators, , , GNU Emacs Operators
|
|
@section Syntactic Class Operators
|
|
|
|
The operators in this section require Regex to recognize the syntactic
|
|
classes of characters. Regex uses a syntax table to determine this.
|
|
|
|
@menu
|
|
* Emacs Syntax Tables::
|
|
* Match-syntactic-class Operator:: \sCLASS
|
|
* Match-not-syntactic-class Operator:: \SCLASS
|
|
@end menu
|
|
|
|
@node Emacs Syntax Tables, Match-syntactic-class Operator, , Syntactic Class Operators
|
|
@subsection Emacs Syntax Tables
|
|
|
|
A @dfn{syntax table} is an array indexed by the characters in your
|
|
character set. In the @sc{ascii} encoding, therefore, a syntax table
|
|
has 256 elements.
|
|
|
|
If Regex is compiled with the preprocessor symbol @code{emacs} defined,
|
|
then Regex expects you to define and initialize the variable
|
|
@code{re_syntax_table} to be an Emacs syntax table. Emacs' syntax
|
|
tables are more complicated than Regex's own (@pxref{Non-Emacs Syntax
|
|
Tables}). @xref{Syntax, , Syntax, emacs, The GNU Emacs User's Manual},
|
|
for a description of Emacs' syntax tables.
|
|
|
|
@node Match-syntactic-class Operator, Match-not-syntactic-class Operator, Emacs Syntax Tables, Syntactic Class Operators
|
|
@subsection The Match-syntactic-class Operator (@code{\s}@var{class})
|
|
|
|
@cindex @samp{\s}
|
|
|
|
This operator matches any character whose syntactic class is represented
|
|
by a specified character. @samp{\s@var{class}} represents this operator
|
|
where @var{class} is the character representing the syntactic class you
|
|
want. For example, @samp{w} represents the syntactic
|
|
class of word-constituent characters, so @samp{\sw} matches any
|
|
word-constituent character.
|
|
|
|
@node Match-not-syntactic-class Operator, , Match-syntactic-class Operator, Syntactic Class Operators
|
|
@subsection The Match-not-syntactic-class Operator (@code{\S}@var{class})
|
|
|
|
@cindex @samp{\S}
|
|
|
|
This operator is similar to the match-syntactic-class operator except
|
|
that it matches any character whose syntactic class is @emph{not}
|
|
represented by the specified character. @samp{\S@var{class}} represents
|
|
this operator. For example, @samp{w} represents the syntactic class of
|
|
word-constituent characters, so @samp{\Sw} matches any character that is
|
|
not word-constituent.
|
|
|
|
|
|
@node What Gets Matched?, Programming with Regex, GNU Emacs Operators, Top
|
|
@chapter What Gets Matched?
|
|
|
|
Regex usually matches strings according to the ``leftmost longest''
|
|
rule; that is, it chooses the longest of the leftmost matches. This
|
|
does not mean that for a regular expression containing subexpressions
|
|
that it simply chooses the longest match for each subexpression, left to
|
|
right; the overall match must also be the longest possible one.
|
|
|
|
For example, @samp{(ac*)(c*d[ac]*)\1} matches @samp{acdacaaa}, not
|
|
@samp{acdac}, as it would if it were to choose the longest match for the
|
|
first subexpression.
|
|
|
|
|
|
@node Programming with Regex, Copying, What Gets Matched?, Top
|
|
@chapter Programming with Regex
|
|
|
|
Here we describe how you use the Regex data structures and functions in
|
|
C programs. Regex has three interfaces: one designed for @sc{gnu}, one
|
|
compatible with @sc{posix} and one compatible with Berkeley @sc{unix}.
|
|
|
|
@menu
|
|
* GNU Regex Functions::
|
|
* POSIX Regex Functions::
|
|
* BSD Regex Functions::
|
|
@end menu
|
|
|
|
|
|
@node GNU Regex Functions, POSIX Regex Functions, , Programming with Regex
|
|
@section GNU Regex Functions
|
|
|
|
If you're writing code that doesn't need to be compatible with either
|
|
@sc{posix} or Berkeley @sc{unix}, you can use these functions. They
|
|
provide more options than the other interfaces.
|
|
|
|
@menu
|
|
* GNU Pattern Buffers:: The re_pattern_buffer type.
|
|
* GNU Regular Expression Compiling:: re_compile_pattern ()
|
|
* GNU Matching:: re_match ()
|
|
* GNU Searching:: re_search ()
|
|
* Matching/Searching with Split Data:: re_match_2 (), re_search_2 ()
|
|
* Searching with Fastmaps:: re_compile_fastmap ()
|
|
* GNU Translate Tables:: The `translate' field.
|
|
* Using Registers:: The re_registers type and related fns.
|
|
* Freeing GNU Pattern Buffers:: regfree ()
|
|
@end menu
|
|
|
|
|
|
@node GNU Pattern Buffers, GNU Regular Expression Compiling, , GNU Regex Functions
|
|
@subsection GNU Pattern Buffers
|
|
|
|
@cindex pattern buffer, definition of
|
|
@tindex re_pattern_buffer @r{definition}
|
|
@tindex struct re_pattern_buffer @r{definition}
|
|
|
|
To compile, match, or search for a given regular expression, you must
|
|
supply a pattern buffer. A @dfn{pattern buffer} holds one compiled
|
|
regular expression.@footnote{Regular expressions are also referred to as
|
|
``patterns,'' hence the name ``pattern buffer.''}
|
|
|
|
You can have several different pattern buffers simultaneously, each
|
|
holding a compiled pattern for a different regular expression.
|
|
|
|
@file{regex.h} defines the pattern buffer @code{struct} as follows:
|
|
|
|
@example
|
|
/* Space that holds the compiled pattern. It is declared as
|
|
`unsigned char *' because its elements are
|
|
sometimes used as array indexes. */
|
|
unsigned char *buffer;
|
|
|
|
/* Number of bytes to which `buffer' points. */
|
|
unsigned long allocated;
|
|
|
|
/* Number of bytes actually used in `buffer'. */
|
|
unsigned long used;
|
|
|
|
/* Syntax setting with which the pattern was compiled. */
|
|
reg_syntax_t syntax;
|
|
|
|
/* Pointer to a fastmap, if any, otherwise zero. re_search uses
|
|
the fastmap, if there is one, to skip over impossible
|
|
starting points for matches. */
|
|
char *fastmap;
|
|
|
|
/* Either a translate table to apply to all characters before
|
|
comparing them, or zero for no translation. The translation
|
|
is applied to a pattern when it is compiled and to a string
|
|
when it is matched. */
|
|
char *translate;
|
|
|
|
/* Number of subexpressions found by the compiler. */
|
|
size_t re_nsub;
|
|
|
|
/* Zero if this pattern cannot match the empty string, one else.
|
|
Well, in truth it's used only in `re_search_2', to see
|
|
whether or not we should use the fastmap, so we don't set
|
|
this absolutely perfectly; see `re_compile_fastmap' (the
|
|
`duplicate' case). */
|
|
unsigned can_be_null : 1;
|
|
|
|
/* If REGS_UNALLOCATED, allocate space in the `regs' structure
|
|
for `max (RE_NREGS, re_nsub + 1)' groups.
|
|
If REGS_REALLOCATE, reallocate space if necessary.
|
|
If REGS_FIXED, use what's there. */
|
|
#define REGS_UNALLOCATED 0
|
|
#define REGS_REALLOCATE 1
|
|
#define REGS_FIXED 2
|
|
unsigned regs_allocated : 2;
|
|
|
|
/* Set to zero when `regex_compile' compiles a pattern; set to one
|
|
by `re_compile_fastmap' if it updates the fastmap. */
|
|
unsigned fastmap_accurate : 1;
|
|
|
|
/* If set, `re_match_2' does not return information about
|
|
subexpressions. */
|
|
unsigned no_sub : 1;
|
|
|
|
/* If set, a beginning-of-line anchor doesn't match at the
|
|
beginning of the string. */
|
|
unsigned not_bol : 1;
|
|
|
|
/* Similarly for an end-of-line anchor. */
|
|
unsigned not_eol : 1;
|
|
|
|
/* If true, an anchor at a newline matches. */
|
|
unsigned newline_anchor : 1;
|
|
|
|
@end example
|
|
|
|
|
|
@node GNU Regular Expression Compiling, GNU Matching, GNU Pattern Buffers, GNU Regex Functions
|
|
@subsection GNU Regular Expression Compiling
|
|
|
|
In @sc{gnu}, you can both match and search for a given regular
|
|
expression. To do either, you must first compile it in a pattern buffer
|
|
(@pxref{GNU Pattern Buffers}).
|
|
|
|
@cindex syntax initialization
|
|
@vindex re_syntax_options @r{initialization}
|
|
Regular expressions match according to the syntax with which they were
|
|
compiled; with @sc{gnu}, you indicate what syntax you want by setting
|
|
the variable @code{re_syntax_options} (declared in @file{regex.h} and
|
|
defined in @file{regex.c}) before calling the compiling function,
|
|
@code{re_compile_pattern} (see below). @xref{Syntax Bits}, and
|
|
@ref{Predefined Syntaxes}.
|
|
|
|
You can change the value of @code{re_syntax_options} at any time.
|
|
Usually, however, you set its value once and then never change it.
|
|
|
|
@cindex pattern buffer initialization
|
|
@code{re_compile_pattern} takes a pattern buffer as an argument. You
|
|
must initialize the following fields:
|
|
|
|
@table @code
|
|
|
|
@item translate @r{initialization}
|
|
|
|
@item translate
|
|
@vindex translate @r{initialization}
|
|
Initialize this to point to a translate table if you want one, or to
|
|
zero if you don't. We explain translate tables in @ref{GNU Translate
|
|
Tables}.
|
|
|
|
@item fastmap
|
|
@vindex fastmap @r{initialization}
|
|
Initialize this to nonzero if you want a fastmap, or to zero if you
|
|
don't.
|
|
|
|
@item buffer
|
|
@itemx allocated
|
|
@vindex buffer @r{initialization}
|
|
@vindex allocated @r{initialization}
|
|
@findex malloc
|
|
If you want @code{re_compile_pattern} to allocate memory for the
|
|
compiled pattern, set both of these to zero. If you have an existing
|
|
block of memory (allocated with @code{malloc}) you want Regex to use,
|
|
set @code{buffer} to its address and @code{allocated} to its size (in
|
|
bytes).
|
|
|
|
@code{re_compile_pattern} uses @code{realloc} to extend the space for
|
|
the compiled pattern as necessary.
|
|
|
|
@end table
|
|
|
|
To compile a pattern buffer, use:
|
|
|
|
@findex re_compile_pattern
|
|
@example
|
|
char *
|
|
re_compile_pattern (const char *@var{regex}, const int @var{regex_size},
|
|
struct re_pattern_buffer *@var{pattern_buffer})
|
|
@end example
|
|
|
|
@noindent
|
|
@var{regex} is the regular expression's address, @var{regex_size} is its
|
|
length, and @var{pattern_buffer} is the pattern buffer's address.
|
|
|
|
If @code{re_compile_pattern} successfully compiles the regular
|
|
expression, it returns zero and sets @code{*@var{pattern_buffer}} to the
|
|
compiled pattern. It sets the pattern buffer's fields as follows:
|
|
|
|
@table @code
|
|
@item buffer
|
|
@vindex buffer @r{field, set by @code{re_compile_pattern}}
|
|
to the compiled pattern.
|
|
|
|
@item used
|
|
@vindex used @r{field, set by @code{re_compile_pattern}}
|
|
to the number of bytes the compiled pattern in @code{buffer} occupies.
|
|
|
|
@item syntax
|
|
@vindex syntax @r{field, set by @code{re_compile_pattern}}
|
|
to the current value of @code{re_syntax_options}.
|
|
|
|
@item re_nsub
|
|
@vindex re_nsub @r{field, set by @code{re_compile_pattern}}
|
|
to the number of subexpressions in @var{regex}.
|
|
|
|
@item fastmap_accurate
|
|
@vindex fastmap_accurate @r{field, set by @code{re_compile_pattern}}
|
|
to zero on the theory that the pattern you're compiling is different
|
|
than the one previously compiled into @code{buffer}; in that case (since
|
|
you can't make a fastmap without a compiled pattern),
|
|
@code{fastmap} would either contain an incompatible fastmap, or nothing
|
|
at all.
|
|
|
|
@c xx what else?
|
|
@end table
|
|
|
|
If @code{re_compile_pattern} can't compile @var{regex}, it returns an
|
|
error string corresponding to one of the errors listed in @ref{POSIX
|
|
Regular Expression Compiling}.
|
|
|
|
|
|
@node GNU Matching, GNU Searching, GNU Regular Expression Compiling, GNU Regex Functions
|
|
@subsection GNU Matching
|
|
|
|
@cindex matching with GNU functions
|
|
|
|
Matching the @sc{gnu} way means trying to match as much of a string as
|
|
possible starting at a position within it you specify. Once you've compiled
|
|
a pattern into a pattern buffer (@pxref{GNU Regular Expression
|
|
Compiling}), you can ask the matcher to match that pattern against a
|
|
string using:
|
|
|
|
@findex re_match
|
|
@example
|
|
int
|
|
re_match (struct re_pattern_buffer *@var{pattern_buffer},
|
|
const char *@var{string}, const int @var{size},
|
|
const int @var{start}, struct re_registers *@var{regs})
|
|
@end example
|
|
|
|
@noindent
|
|
@var{pattern_buffer} is the address of a pattern buffer containing a
|
|
compiled pattern. @var{string} is the string you want to match; it can
|
|
contain newline and null characters. @var{size} is the length of that
|
|
string. @var{start} is the string index at which you want to
|
|
begin matching; the first character of @var{string} is at index zero.
|
|
@xref{Using Registers}, for a explanation of @var{regs}; you can safely
|
|
pass zero.
|
|
|
|
@code{re_match} matches the regular expression in @var{pattern_buffer}
|
|
against the string @var{string} according to the syntax in
|
|
@var{pattern_buffers}'s @code{syntax} field. (@xref{GNU Regular
|
|
Expression Compiling}, for how to set it.) The function returns
|
|
@math{-1} if the compiled pattern does not match any part of
|
|
@var{string} and @math{-2} if an internal error happens; otherwise, it
|
|
returns how many (possibly zero) characters of @var{string} the pattern
|
|
matched.
|
|
|
|
An example: suppose @var{pattern_buffer} points to a pattern buffer
|
|
containing the compiled pattern for @samp{a*}, and @var{string} points
|
|
to @samp{aaaaab} (whereupon @var{size} should be 6). Then if @var{start}
|
|
is 2, @code{re_match} returns 3, i.e., @samp{a*} would have matched the
|
|
last three @samp{a}s in @var{string}. If @var{start} is 0,
|
|
@code{re_match} returns 5, i.e., @samp{a*} would have matched all the
|
|
@samp{a}s in @var{string}. If @var{start} is either 5 or 6, it returns
|
|
zero.
|
|
|
|
If @var{start} is not between zero and @var{size}, then
|
|
@code{re_match} returns @math{-1}.
|
|
|
|
|
|
@node GNU Searching, Matching/Searching with Split Data, GNU Matching, GNU Regex Functions
|
|
@subsection GNU Searching
|
|
|
|
@cindex searching with GNU functions
|
|
|
|
@dfn{Searching} means trying to match starting at successive positions
|
|
within a string. The function @code{re_search} does this.
|
|
|
|
Before calling @code{re_search}, you must compile your regular
|
|
expression. @xref{GNU Regular Expression Compiling}.
|
|
|
|
Here is the function declaration:
|
|
|
|
@findex re_search
|
|
@example
|
|
int
|
|
re_search (struct re_pattern_buffer *@var{pattern_buffer},
|
|
const char *@var{string}, const int @var{size},
|
|
const int @var{start}, const int @var{range},
|
|
struct re_registers *@var{regs})
|
|
@end example
|
|
|
|
@noindent
|
|
@vindex start @r{argument to @code{re_search}}
|
|
@vindex range @r{argument to @code{re_search}}
|
|
whose arguments are the same as those to @code{re_match} (@pxref{GNU
|
|
Matching}) except that the two arguments @var{start} and @var{range}
|
|
replace @code{re_match}'s argument @var{start}.
|
|
|
|
If @var{range} is positive, then @code{re_search} attempts a match
|
|
starting first at index @var{start}, then at @math{@var{start} + 1} if
|
|
that fails, and so on, up to @math{@var{start} + @var{range}}; if
|
|
@var{range} is negative, then it attempts a match starting first at
|
|
index @var{start}, then at @math{@var{start} -1} if that fails, and so
|
|
on.
|
|
|
|
If @var{start} is not between zero and @var{size}, then @code{re_search}
|
|
returns @math{-1}. When @var{range} is positive, @code{re_search}
|
|
adjusts @var{range} so that @math{@var{start} + @var{range} - 1} is
|
|
between zero and @var{size}, if necessary; that way it won't search
|
|
outside of @var{string}. Similarly, when @var{range} is negative,
|
|
@code{re_search} adjusts @var{range} so that @math{@var{start} +
|
|
@var{range} + 1} is between zero and @var{size}, if necessary.
|
|
|
|
If the @code{fastmap} field of @var{pattern_buffer} is zero,
|
|
@code{re_search} matches starting at consecutive positions; otherwise,
|
|
it uses @code{fastmap} to make the search more efficient.
|
|
@xref{Searching with Fastmaps}.
|
|
|
|
If no match is found, @code{re_search} returns @math{-1}. If
|
|
a match is found, it returns the index where the match began. If an
|
|
internal error happens, it returns @math{-2}.
|
|
|
|
|
|
@node Matching/Searching with Split Data, Searching with Fastmaps, GNU Searching, GNU Regex Functions
|
|
@subsection Matching and Searching with Split Data
|
|
|
|
Using the functions @code{re_match_2} and @code{re_search_2}, you can
|
|
match or search in data that is divided into two strings.
|
|
|
|
The function:
|
|
|
|
@findex re_match_2
|
|
@example
|
|
int
|
|
re_match_2 (struct re_pattern_buffer *@var{buffer},
|
|
const char *@var{string1}, const int @var{size1},
|
|
const char *@var{string2}, const int @var{size2},
|
|
const int @var{start},
|
|
struct re_registers *@var{regs},
|
|
const int @var{stop})
|
|
@end example
|
|
|
|
@noindent
|
|
is similar to @code{re_match} (@pxref{GNU Matching}) except that you
|
|
pass @emph{two} data strings and sizes, and an index @var{stop} beyond
|
|
which you don't want the matcher to try matching. As with
|
|
@code{re_match}, if it succeeds, @code{re_match_2} returns how many
|
|
characters of @var{string} it matched. Regard @var{string1} and
|
|
@var{string2} as concatenated when you set the arguments @var{start} and
|
|
@var{stop} and use the contents of @var{regs}; @code{re_match_2} never
|
|
returns a value larger than @math{@var{size1} + @var{size2}}.
|
|
|
|
The function:
|
|
|
|
@findex re_search_2
|
|
@example
|
|
int
|
|
re_search_2 (struct re_pattern_buffer *@var{buffer},
|
|
const char *@var{string1}, const int @var{size1},
|
|
const char *@var{string2}, const int @var{size2},
|
|
const int @var{start}, const int @var{range},
|
|
struct re_registers *@var{regs},
|
|
const int @var{stop})
|
|
@end example
|
|
|
|
@noindent
|
|
is similarly related to @code{re_search}.
|
|
|
|
|
|
@node Searching with Fastmaps, GNU Translate Tables, Matching/Searching with Split Data, GNU Regex Functions
|
|
@subsection Searching with Fastmaps
|
|
|
|
@cindex fastmaps
|
|
If you're searching through a long string, you should use a fastmap.
|
|
Without one, the searcher tries to match at consecutive positions in the
|
|
string. Generally, most of the characters in the string could not start
|
|
a match. It takes much longer to try matching at a given position in the
|
|
string than it does to check in a table whether or not the character at
|
|
that position could start a match. A @dfn{fastmap} is such a table.
|
|
|
|
More specifically, a fastmap is an array indexed by the characters in
|
|
your character set. Under the @sc{ascii} encoding, therefore, a fastmap
|
|
has 256 elements. If you want the searcher to use a fastmap with a
|
|
given pattern buffer, you must allocate the array and assign the array's
|
|
address to the pattern buffer's @code{fastmap} field. You either can
|
|
compile the fastmap yourself or have @code{re_search} do it for you;
|
|
when @code{fastmap} is nonzero, it automatically compiles a fastmap the
|
|
first time you search using a particular compiled pattern.
|
|
|
|
To compile a fastmap yourself, use:
|
|
|
|
@findex re_compile_fastmap
|
|
@example
|
|
int
|
|
re_compile_fastmap (struct re_pattern_buffer *@var{pattern_buffer})
|
|
@end example
|
|
|
|
@noindent
|
|
@var{pattern_buffer} is the address of a pattern buffer. If the
|
|
character @var{c} could start a match for the pattern,
|
|
@code{re_compile_fastmap} makes
|
|
@code{@var{pattern_buffer}->fastmap[@var{c}]} nonzero. It returns
|
|
@math{0} if it can compile a fastmap and @math{-2} if there is an
|
|
internal error. For example, if @samp{|} is the alternation operator
|
|
and @var{pattern_buffer} holds the compiled pattern for @samp{a|b}, then
|
|
@code{re_compile_fastmap} sets @code{fastmap['a']} and
|
|
@code{fastmap['b']} (and no others).
|
|
|
|
@code{re_search} uses a fastmap as it moves along in the string: it
|
|
checks the string's characters until it finds one that's in the fastmap.
|
|
Then it tries matching at that character. If the match fails, it
|
|
repeats the process. So, by using a fastmap, @code{re_search} doesn't
|
|
waste time trying to match at positions in the string that couldn't
|
|
start a match.
|
|
|
|
If you don't want @code{re_search} to use a fastmap,
|
|
store zero in the @code{fastmap} field of the pattern buffer before
|
|
calling @code{re_search}.
|
|
|
|
Once you've initialized a pattern buffer's @code{fastmap} field, you
|
|
need never do so again---even if you compile a new pattern in
|
|
it---provided the way the field is set still reflects whether or not you
|
|
want a fastmap. @code{re_search} will still either do nothing if
|
|
@code{fastmap} is null or, if it isn't, compile a new fastmap for the
|
|
new pattern.
|
|
|
|
@node GNU Translate Tables, Using Registers, Searching with Fastmaps, GNU Regex Functions
|
|
@subsection GNU Translate Tables
|
|
|
|
If you set the @code{translate} field of a pattern buffer to a translate
|
|
table, then the @sc{gnu} Regex functions to which you've passed that
|
|
pattern buffer use it to apply a simple transformation
|
|
to all the regular expression and string characters at which they look.
|
|
|
|
A @dfn{translate table} is an array indexed by the characters in your
|
|
character set. Under the @sc{ascii} encoding, therefore, a translate
|
|
table has 256 elements. The array's elements are also characters in
|
|
your character set. When the Regex functions see a character @var{c},
|
|
they use @code{translate[@var{c}]} in its place, with one exception: the
|
|
character after a @samp{\} is not translated. (This ensures that, the
|
|
operators, e.g., @samp{\B} and @samp{\b}, are always distinguishable.)
|
|
|
|
For example, a table that maps all lowercase letters to the
|
|
corresponding uppercase ones would cause the matcher to ignore
|
|
differences in case.@footnote{A table that maps all uppercase letters to
|
|
the corresponding lowercase ones would work just as well for this
|
|
purpose.} Such a table would map all characters except lowercase letters
|
|
to themselves, and lowercase letters to the corresponding uppercase
|
|
ones. Under the @sc{ascii} encoding, here's how you could initialize
|
|
such a table (we'll call it @code{case_fold}):
|
|
|
|
@example
|
|
for (i = 0; i < 256; i++)
|
|
case_fold[i] = i;
|
|
for (i = 'a'; i <= 'z'; i++)
|
|
case_fold[i] = i - ('a' - 'A');
|
|
@end example
|
|
|
|
You tell Regex to use a translate table on a given pattern buffer by
|
|
assigning that table's address to the @code{translate} field of that
|
|
buffer. If you don't want Regex to do any translation, put zero into
|
|
this field. You'll get weird results if you change the table's contents
|
|
anytime between compiling the pattern buffer, compiling its fastmap, and
|
|
matching or searching with the pattern buffer.
|
|
|
|
@node Using Registers, Freeing GNU Pattern Buffers, GNU Translate Tables, GNU Regex Functions
|
|
@subsection Using Registers
|
|
|
|
A group in a regular expression can match a (posssibly empty) substring
|
|
of the string that regular expression as a whole matched. The matcher
|
|
remembers the beginning and end of the substring matched by
|
|
each group.
|
|
|
|
To find out what they matched, pass a nonzero @var{regs} argument to a
|
|
@sc{gnu} matching or searching function (@pxref{GNU Matching} and
|
|
@ref{GNU Searching}), i.e., the address of a structure of this type, as
|
|
defined in @file{regex.h}:
|
|
|
|
@c We don't bother to include this directly from regex.h,
|
|
@c since it changes so rarely.
|
|
@example
|
|
@tindex re_registers
|
|
@vindex num_regs @r{in @code{struct re_registers}}
|
|
@vindex start @r{in @code{struct re_registers}}
|
|
@vindex end @r{in @code{struct re_registers}}
|
|
struct re_registers
|
|
@{
|
|
unsigned num_regs;
|
|
regoff_t *start;
|
|
regoff_t *end;
|
|
@};
|
|
@end example
|
|
|
|
Except for (possibly) the @var{num_regs}'th element (see below), the
|
|
@var{i}th element of the @code{start} and @code{end} arrays records
|
|
information about the @var{i}th group in the pattern. (They're declared
|
|
as C pointers, but this is only because not all C compilers accept
|
|
zero-length arrays; conceptually, it is simplest to think of them as
|
|
arrays.)
|
|
|
|
The @code{start} and @code{end} arrays are allocated in various ways,
|
|
depending on the value of the @code{regs_allocated}
|
|
@vindex regs_allocated
|
|
field in the pattern buffer passed to the matcher.
|
|
|
|
The simplest and perhaps most useful is to let the matcher (re)allocate
|
|
enough space to record information for all the groups in the regular
|
|
expression. If @code{regs_allocated} is @code{REGS_UNALLOCATED},
|
|
@vindex REGS_UNALLOCATED
|
|
the matcher allocates @math{1 + @var{re_nsub}} (another field in the
|
|
pattern buffer; @pxref{GNU Pattern Buffers}). The extra element is set
|
|
to @math{-1}, and sets @code{regs_allocated} to @code{REGS_REALLOCATE}.
|
|
@vindex REGS_REALLOCATE
|
|
Then on subsequent calls with the same pattern buffer and @var{regs}
|
|
arguments, the matcher reallocates more space if necessary.
|
|
|
|
It would perhaps be more logical to make the @code{regs_allocated} field
|
|
part of the @code{re_registers} structure, instead of part of the
|
|
pattern buffer. But in that case the caller would be forced to
|
|
initialize the structure before passing it. Much existing code doesn't
|
|
do this initialization, and it's arguably better to avoid it anyway.
|
|
|
|
@code{re_compile_pattern} sets @code{regs_allocated} to
|
|
@code{REGS_UNALLOCATED},
|
|
so if you use the GNU regular expression
|
|
functions, you get this behavior by default.
|
|
|
|
xx document re_set_registers
|
|
|
|
@sc{posix}, on the other hand, requires a different interface: the
|
|
caller is supposed to pass in a fixed-length array which the matcher
|
|
fills. Therefore, if @code{regs_allocated} is @code{REGS_FIXED}
|
|
@vindex REGS_FIXED
|
|
the matcher simply fills that array.
|
|
|
|
The following examples illustrate the information recorded in the
|
|
@code{re_registers} structure. (In all of them, @samp{(} represents the
|
|
open-group and @samp{)} the close-group operator. The first character
|
|
in the string @var{string} is at index 0.)
|
|
|
|
@c xx i'm not sure this is all true anymore.
|
|
|
|
@itemize @bullet
|
|
|
|
@item
|
|
If the regular expression has an @w{@var{i}-th}
|
|
group not contained within another group that matches a
|
|
substring of @var{string}, then the function sets
|
|
@code{@w{@var{regs}->}start[@var{i}]} to the index in @var{string} where
|
|
the substring matched by the @w{@var{i}-th} group begins, and
|
|
@code{@w{@var{regs}->}end[@var{i}]} to the index just beyond that
|
|
substring's end. The function sets @code{@w{@var{regs}->}start[0]} and
|
|
@code{@w{@var{regs}->}end[0]} to analogous information about the entire
|
|
pattern.
|
|
|
|
For example, when you match @samp{((a)(b))} against @samp{ab}, you get:
|
|
|
|
@itemize
|
|
@item
|
|
0 in @code{@w{@var{regs}->}start[0]} and 2 in @code{@w{@var{regs}->}end[0]}
|
|
|
|
@item
|
|
0 in @code{@w{@var{regs}->}start[1]} and 2 in @code{@w{@var{regs}->}end[1]}
|
|
|
|
@item
|
|
0 in @code{@w{@var{regs}->}start[2]} and 1 in @code{@w{@var{regs}->}end[2]}
|
|
|
|
@item
|
|
1 in @code{@w{@var{regs}->}start[3]} and 2 in @code{@w{@var{regs}->}end[3]}
|
|
@end itemize
|
|
|
|
@item
|
|
If a group matches more than once (as it might if followed by,
|
|
e.g., a repetition operator), then the function reports the information
|
|
about what the group @emph{last} matched.
|
|
|
|
For example, when you match the pattern @samp{(a)*} against the string
|
|
@samp{aa}, you get:
|
|
|
|
@itemize
|
|
@item
|
|
0 in @code{@w{@var{regs}->}start[0]} and 2 in @code{@w{@var{regs}->}end[0]}
|
|
|
|
@item
|
|
1 in @code{@w{@var{regs}->}start[1]} and 2 in @code{@w{@var{regs}->}end[1]}
|
|
@end itemize
|
|
|
|
@item
|
|
If the @w{@var{i}-th} group does not participate in a
|
|
successful match, e.g., it is an alternative not taken or a
|
|
repetition operator allows zero repetitions of it, then the function
|
|
sets @code{@w{@var{regs}->}start[@var{i}]} and
|
|
@code{@w{@var{regs}->}end[@var{i}]} to @math{-1}.
|
|
|
|
For example, when you match the pattern @samp{(a)*b} against
|
|
the string @samp{b}, you get:
|
|
|
|
@itemize
|
|
@item
|
|
0 in @code{@w{@var{regs}->}start[0]} and 1 in @code{@w{@var{regs}->}end[0]}
|
|
|
|
@item
|
|
@math{-1} in @code{@w{@var{regs}->}start[1]} and @math{-1} in @code{@w{@var{regs}->}end[1]}
|
|
@end itemize
|
|
|
|
@item
|
|
If the @w{@var{i}-th} group matches a zero-length string, then the
|
|
function sets @code{@w{@var{regs}->}start[@var{i}]} and
|
|
@code{@w{@var{regs}->}end[@var{i}]} to the index just beyond that
|
|
zero-length string.
|
|
|
|
For example, when you match the pattern @samp{(a*)b} against the string
|
|
@samp{b}, you get:
|
|
|
|
@itemize
|
|
@item
|
|
0 in @code{@w{@var{regs}->}start[0]} and 1 in @code{@w{@var{regs}->}end[0]}
|
|
|
|
@item
|
|
0 in @code{@w{@var{regs}->}start[1]} and 0 in @code{@w{@var{regs}->}end[1]}
|
|
@end itemize
|
|
|
|
@ignore
|
|
The function sets @code{@w{@var{regs}->}start[0]} and
|
|
@code{@w{@var{regs}->}end[0]} to analogous information about the entire
|
|
pattern.
|
|
|
|
For example, when you match the pattern @samp{(a*)} against the empty
|
|
string, you get:
|
|
|
|
@itemize
|
|
@item
|
|
0 in @code{@w{@var{regs}->}start[0]} and 0 in @code{@w{@var{regs}->}end[0]}
|
|
|
|
@item
|
|
0 in @code{@w{@var{regs}->}start[1]} and 0 in @code{@w{@var{regs}->}end[1]}
|
|
@end itemize
|
|
@end ignore
|
|
|
|
@item
|
|
If an @w{@var{i}-th} group contains a @w{@var{j}-th} group
|
|
in turn not contained within any other group within group @var{i} and
|
|
the function reports a match of the @w{@var{i}-th} group, then it
|
|
records in @code{@w{@var{regs}->}start[@var{j}]} and
|
|
@code{@w{@var{regs}->}end[@var{j}]} the last match (if it matched) of
|
|
the @w{@var{j}-th} group.
|
|
|
|
For example, when you match the pattern @samp{((a*)b)*} against the
|
|
string @samp{abb}, @w{group 2} last matches the empty string, so you
|
|
get what it previously matched:
|
|
|
|
@itemize
|
|
@item
|
|
0 in @code{@w{@var{regs}->}start[0]} and 3 in @code{@w{@var{regs}->}end[0]}
|
|
|
|
@item
|
|
2 in @code{@w{@var{regs}->}start[1]} and 3 in @code{@w{@var{regs}->}end[1]}
|
|
|
|
@item
|
|
2 in @code{@w{@var{regs}->}start[2]} and 2 in @code{@w{@var{regs}->}end[2]}
|
|
@end itemize
|
|
|
|
When you match the pattern @samp{((a)*b)*} against the string
|
|
@samp{abb}, @w{group 2} doesn't participate in the last match, so you
|
|
get:
|
|
|
|
@itemize
|
|
@item
|
|
0 in @code{@w{@var{regs}->}start[0]} and 3 in @code{@w{@var{regs}->}end[0]}
|
|
|
|
@item
|
|
2 in @code{@w{@var{regs}->}start[1]} and 3 in @code{@w{@var{regs}->}end[1]}
|
|
|
|
@item
|
|
0 in @code{@w{@var{regs}->}start[2]} and 1 in @code{@w{@var{regs}->}end[2]}
|
|
@end itemize
|
|
|
|
@item
|
|
If an @w{@var{i}-th} group contains a @w{@var{j}-th} group
|
|
in turn not contained within any other group within group @var{i}
|
|
and the function sets
|
|
@code{@w{@var{regs}->}start[@var{i}]} and
|
|
@code{@w{@var{regs}->}end[@var{i}]} to @math{-1}, then it also sets
|
|
@code{@w{@var{regs}->}start[@var{j}]} and
|
|
@code{@w{@var{regs}->}end[@var{j}]} to @math{-1}.
|
|
|
|
For example, when you match the pattern @samp{((a)*b)*c} against the
|
|
string @samp{c}, you get:
|
|
|
|
@itemize
|
|
@item
|
|
0 in @code{@w{@var{regs}->}start[0]} and 1 in @code{@w{@var{regs}->}end[0]}
|
|
|
|
@item
|
|
@math{-1} in @code{@w{@var{regs}->}start[1]} and @math{-1} in @code{@w{@var{regs}->}end[1]}
|
|
|
|
@item
|
|
@math{-1} in @code{@w{@var{regs}->}start[2]} and @math{-1} in @code{@w{@var{regs}->}end[2]}
|
|
@end itemize
|
|
|
|
@end itemize
|
|
|
|
@node Freeing GNU Pattern Buffers, , Using Registers, GNU Regex Functions
|
|
@subsection Freeing GNU Pattern Buffers
|
|
|
|
To free any allocated fields of a pattern buffer, you can use the
|
|
@sc{posix} function described in @ref{Freeing POSIX Pattern Buffers},
|
|
since the type @code{regex_t}---the type for @sc{posix} pattern
|
|
buffers---is equivalent to the type @code{re_pattern_buffer}. After
|
|
freeing a pattern buffer, you need to again compile a regular expression
|
|
in it (@pxref{GNU Regular Expression Compiling}) before passing it to
|
|
a matching or searching function.
|
|
|
|
|
|
@node POSIX Regex Functions, BSD Regex Functions, GNU Regex Functions, Programming with Regex
|
|
@section POSIX Regex Functions
|
|
|
|
If you're writing code that has to be @sc{posix} compatible, you'll need
|
|
to use these functions. Their interfaces are as specified by @sc{posix},
|
|
draft 1003.2/D11.2.
|
|
|
|
@menu
|
|
* POSIX Pattern Buffers:: The regex_t type.
|
|
* POSIX Regular Expression Compiling:: regcomp ()
|
|
* POSIX Matching:: regexec ()
|
|
* Reporting Errors:: regerror ()
|
|
* Using Byte Offsets:: The regmatch_t type.
|
|
* Freeing POSIX Pattern Buffers:: regfree ()
|
|
@end menu
|
|
|
|
|
|
@node POSIX Pattern Buffers, POSIX Regular Expression Compiling, , POSIX Regex Functions
|
|
@subsection POSIX Pattern Buffers
|
|
|
|
To compile or match a given regular expression the @sc{posix} way, you
|
|
must supply a pattern buffer exactly the way you do for @sc{gnu}
|
|
(@pxref{GNU Pattern Buffers}). @sc{posix} pattern buffers have type
|
|
@code{regex_t}, which is equivalent to the @sc{gnu} pattern buffer
|
|
type @code{re_pattern_buffer}.
|
|
|
|
|
|
@node POSIX Regular Expression Compiling, POSIX Matching, POSIX Pattern Buffers, POSIX Regex Functions
|
|
@subsection POSIX Regular Expression Compiling
|
|
|
|
With @sc{posix}, you can only search for a given regular expression; you
|
|
can't match it. To do this, you must first compile it in a
|
|
pattern buffer, using @code{regcomp}.
|
|
|
|
@ignore
|
|
Before calling @code{regcomp}, you must initialize this pattern buffer
|
|
as you do for @sc{gnu} (@pxref{GNU Regular Expression Compiling}). See
|
|
below, however, for how to choose a syntax with which to compile.
|
|
@end ignore
|
|
|
|
To compile a pattern buffer, use:
|
|
|
|
@findex regcomp
|
|
@example
|
|
int
|
|
regcomp (regex_t *@var{preg}, const char *@var{regex}, int @var{cflags})
|
|
@end example
|
|
|
|
@noindent
|
|
@var{preg} is the initialized pattern buffer's address, @var{regex} is
|
|
the regular expression's address, and @var{cflags} is the compilation
|
|
flags, which Regex considers as a collection of bits. Here are the
|
|
valid bits, as defined in @file{regex.h}:
|
|
|
|
@table @code
|
|
|
|
@item REG_EXTENDED
|
|
@vindex REG_EXTENDED
|
|
says to use @sc{posix} Extended Regular Expression syntax; if this isn't
|
|
set, then says to use @sc{posix} Basic Regular Expression syntax.
|
|
@code{regcomp} sets @var{preg}'s @code{syntax} field accordingly.
|
|
|
|
@item REG_ICASE
|
|
@vindex REG_ICASE
|
|
@cindex ignoring case
|
|
says to ignore case; @code{regcomp} sets @var{preg}'s @code{translate}
|
|
field to a translate table which ignores case, replacing anything you've
|
|
put there before.
|
|
|
|
@item REG_NOSUB
|
|
@vindex REG_NOSUB
|
|
says to set @var{preg}'s @code{no_sub} field; @pxref{POSIX Matching},
|
|
for what this means.
|
|
|
|
@item REG_NEWLINE
|
|
@vindex REG_NEWLINE
|
|
says that a:
|
|
|
|
@itemize @bullet
|
|
|
|
@item
|
|
match-any-character operator (@pxref{Match-any-character
|
|
Operator}) doesn't match a newline.
|
|
|
|
@item
|
|
nonmatching list not containing a newline (@pxref{List
|
|
Operators}) matches a newline.
|
|
|
|
@item
|
|
match-beginning-of-line operator (@pxref{Match-beginning-of-line
|
|
Operator}) matches the empty string immediately after a newline,
|
|
regardless of how @code{REG_NOTBOL} is set (@pxref{POSIX Matching}, for
|
|
an explanation of @code{REG_NOTBOL}).
|
|
|
|
@item
|
|
match-end-of-line operator (@pxref{Match-beginning-of-line
|
|
Operator}) matches the empty string immediately before a newline,
|
|
regardless of how @code{REG_NOTEOL} is set (@pxref{POSIX Matching},
|
|
for an explanation of @code{REG_NOTEOL}).
|
|
|
|
@end itemize
|
|
|
|
@end table
|
|
|
|
If @code{regcomp} successfully compiles the regular expression, it
|
|
returns zero and sets @code{*@var{pattern_buffer}} to the compiled
|
|
pattern. Except for @code{syntax} (which it sets as explained above), it
|
|
also sets the same fields the same way as does the @sc{gnu} compiling
|
|
function (@pxref{GNU Regular Expression Compiling}).
|
|
|
|
If @code{regcomp} can't compile the regular expression, it returns one
|
|
of the error codes listed here. (Except when noted differently, the
|
|
syntax of in all examples below is basic regular expression syntax.)
|
|
|
|
@table @code
|
|
|
|
@comment repetitions
|
|
@item REG_BADRPT
|
|
For example, the consecutive repetition operators @samp{**} in
|
|
@samp{a**} are invalid. As another example, if the syntax is extended
|
|
regular expression syntax, then the repetition operator @samp{*} with
|
|
nothing on which to operate in @samp{*} is invalid.
|
|
|
|
@item REG_BADBR
|
|
For example, the @var{count} @samp{-1} in @samp{a\@{-1} is invalid.
|
|
|
|
@item REG_EBRACE
|
|
For example, @samp{a\@{1} is missing a close-interval operator.
|
|
|
|
@comment lists
|
|
@item REG_EBRACK
|
|
For example, @samp{[a} is missing a close-list operator.
|
|
|
|
@item REG_ERANGE
|
|
For example, the range ending point @samp{z} that collates lower than
|
|
does its starting point @samp{a} in @samp{[z-a]} is invalid. Also, the
|
|
range with the character class @samp{[:alpha:]} as its starting point in
|
|
@samp{[[:alpha:]-|]}.
|
|
|
|
@item REG_ECTYPE
|
|
For example, the character class name @samp{foo} in @samp{[[:foo:]} is
|
|
invalid.
|
|
|
|
@comment groups
|
|
@item REG_EPAREN
|
|
For example, @samp{a\)} is missing an open-group operator and @samp{\(a}
|
|
is missing a close-group operator.
|
|
|
|
@item REG_ESUBREG
|
|
For example, the back reference @samp{\2} that refers to a nonexistent
|
|
subexpression in @samp{\(a\)\2} is invalid.
|
|
|
|
@comment unfinished business
|
|
|
|
@item REG_EEND
|
|
Returned when a regular expression causes no other more specific error.
|
|
|
|
@item REG_EESCAPE
|
|
For example, the trailing backslash @samp{\} in @samp{a\} is invalid, as is the
|
|
one in @samp{\}.
|
|
|
|
@comment kitchen sink
|
|
@item REG_BADPAT
|
|
For example, in the extended regular expression syntax, the empty group
|
|
@samp{()} in @samp{a()b} is invalid.
|
|
|
|
@comment internal
|
|
@item REG_ESIZE
|
|
Returned when a regular expression needs a pattern buffer larger than
|
|
65536 bytes.
|
|
|
|
@item REG_ESPACE
|
|
Returned when a regular expression makes Regex to run out of memory.
|
|
|
|
@end table
|
|
|
|
|
|
@node POSIX Matching, Reporting Errors, POSIX Regular Expression Compiling, POSIX Regex Functions
|
|
@subsection POSIX Matching
|
|
|
|
Matching the @sc{posix} way means trying to match a null-terminated
|
|
string starting at its first character. Once you've compiled a pattern
|
|
into a pattern buffer (@pxref{POSIX Regular Expression Compiling}), you
|
|
can ask the matcher to match that pattern against a string using:
|
|
|
|
@findex regexec
|
|
@example
|
|
int
|
|
regexec (const regex_t *@var{preg}, const char *@var{string},
|
|
size_t @var{nmatch}, regmatch_t @var{pmatch}[], int @var{eflags})
|
|
@end example
|
|
|
|
@noindent
|
|
@var{preg} is the address of a pattern buffer for a compiled pattern.
|
|
@var{string} is the string you want to match.
|
|
|
|
@xref{Using Byte Offsets}, for an explanation of @var{pmatch}. If you
|
|
pass zero for @var{nmatch} or you compiled @var{preg} with the
|
|
compilation flag @code{REG_NOSUB} set, then @code{regexec} will ignore
|
|
@var{pmatch}; otherwise, you must allocate it to have at least
|
|
@var{nmatch} elements. @code{regexec} will record @var{nmatch} byte
|
|
offsets in @var{pmatch}, and set to @math{-1} any unused elements up to
|
|
@math{@var{pmatch}@code{[@var{nmatch}]} - 1}.
|
|
|
|
@var{eflags} specifies @dfn{execution flags}---namely, the two bits
|
|
@code{REG_NOTBOL} and @code{REG_NOTEOL} (defined in @file{regex.h}). If
|
|
you set @code{REG_NOTBOL}, then the match-beginning-of-line operator
|
|
(@pxref{Match-beginning-of-line Operator}) always fails to match.
|
|
This lets you match against pieces of a line, as you would need to if,
|
|
say, searching for repeated instances of a given pattern in a line; it
|
|
would work correctly for patterns both with and without
|
|
match-beginning-of-line operators. @code{REG_NOTEOL} works analogously
|
|
for the match-end-of-line operator (@pxref{Match-end-of-line
|
|
Operator}); it exists for symmetry.
|
|
|
|
@code{regexec} tries to find a match for @var{preg} in @var{string}
|
|
according to the syntax in @var{preg}'s @code{syntax} field.
|
|
(@xref{POSIX Regular Expression Compiling}, for how to set it.) The
|
|
function returns zero if the compiled pattern matches @var{string} and
|
|
@code{REG_NOMATCH} (defined in @file{regex.h}) if it doesn't.
|
|
|
|
@node Reporting Errors, Using Byte Offsets, POSIX Matching, POSIX Regex Functions
|
|
@subsection Reporting Errors
|
|
|
|
If either @code{regcomp} or @code{regexec} fail, they return a nonzero
|
|
error code, the possibilities for which are defined in @file{regex.h}.
|
|
@xref{POSIX Regular Expression Compiling}, and @ref{POSIX Matching}, for
|
|
what these codes mean. To get an error string corresponding to these
|
|
codes, you can use:
|
|
|
|
@findex regerror
|
|
@example
|
|
size_t
|
|
regerror (int @var{errcode},
|
|
const regex_t *@var{preg},
|
|
char *@var{errbuf},
|
|
size_t @var{errbuf_size})
|
|
@end example
|
|
|
|
@noindent
|
|
@var{errcode} is an error code, @var{preg} is the address of the pattern
|
|
buffer which provoked the error, @var{errbuf} is the error buffer, and
|
|
@var{errbuf_size} is @var{errbuf}'s size.
|
|
|
|
@code{regerror} returns the size in bytes of the error string
|
|
corresponding to @var{errcode} (including its terminating null). If
|
|
@var{errbuf} and @var{errbuf_size} are nonzero, it also returns in
|
|
@var{errbuf} the first @math{@var{errbuf_size} - 1} characters of the
|
|
error string, followed by a null.
|
|
@var{errbuf_size} must be a nonnegative number less than or equal to the
|
|
size in bytes of @var{errbuf}.
|
|
|
|
You can call @code{regerror} with a null @var{errbuf} and a zero
|
|
@var{errbuf_size} to determine how large @var{errbuf} need be to
|
|
accommodate @code{regerror}'s error string.
|
|
|
|
@node Using Byte Offsets, Freeing POSIX Pattern Buffers, Reporting Errors, POSIX Regex Functions
|
|
@subsection Using Byte Offsets
|
|
|
|
In @sc{posix}, variables of type @code{regmatch_t} hold analogous
|
|
information, but are not identical to, @sc{gnu}'s registers (@pxref{Using
|
|
Registers}). To get information about registers in @sc{posix}, pass to
|
|
@code{regexec} a nonzero @var{pmatch} of type @code{regmatch_t}, i.e.,
|
|
the address of a structure of this type, defined in
|
|
@file{regex.h}:
|
|
|
|
@tindex regmatch_t
|
|
@example
|
|
typedef struct
|
|
@{
|
|
regoff_t rm_so;
|
|
regoff_t rm_eo;
|
|
@} regmatch_t;
|
|
@end example
|
|
|
|
When reading in @ref{Using Registers}, about how the matching function
|
|
stores the information into the registers, substitute @var{pmatch} for
|
|
@var{regs}, @code{@w{@var{pmatch}[@var{i}]->}rm_so} for
|
|
@code{@w{@var{regs}->}start[@var{i}]} and
|
|
@code{@w{@var{pmatch}[@var{i}]->}rm_eo} for
|
|
@code{@w{@var{regs}->}end[@var{i}]}.
|
|
|
|
@node Freeing POSIX Pattern Buffers, , Using Byte Offsets, POSIX Regex Functions
|
|
@subsection Freeing POSIX Pattern Buffers
|
|
|
|
To free any allocated fields of a pattern buffer, use:
|
|
|
|
@findex regfree
|
|
@example
|
|
void
|
|
regfree (regex_t *@var{preg})
|
|
@end example
|
|
|
|
@noindent
|
|
@var{preg} is the pattern buffer whose allocated fields you want freed.
|
|
@code{regfree} also sets @var{preg}'s @code{allocated} and @code{used}
|
|
fields to zero. After freeing a pattern buffer, you need to again
|
|
compile a regular expression in it (@pxref{POSIX Regular Expression
|
|
Compiling}) before passing it to the matching function (@pxref{POSIX
|
|
Matching}).
|
|
|
|
|
|
@node BSD Regex Functions, , POSIX Regex Functions, Programming with Regex
|
|
@section BSD Regex Functions
|
|
|
|
If you're writing code that has to be Berkeley @sc{unix} compatible,
|
|
you'll need to use these functions whose interfaces are the same as those
|
|
in Berkeley @sc{unix}.
|
|
|
|
@menu
|
|
* BSD Regular Expression Compiling:: re_comp ()
|
|
* BSD Searching:: re_exec ()
|
|
@end menu
|
|
|
|
@node BSD Regular Expression Compiling, BSD Searching, , BSD Regex Functions
|
|
@subsection BSD Regular Expression Compiling
|
|
|
|
With Berkeley @sc{unix}, you can only search for a given regular
|
|
expression; you can't match one. To search for it, you must first
|
|
compile it. Before you compile it, you must indicate the regular
|
|
expression syntax you want it compiled according to by setting the
|
|
variable @code{re_syntax_options} (declared in @file{regex.h} to some
|
|
syntax (@pxref{Regular Expression Syntax}).
|
|
|
|
To compile a regular expression use:
|
|
|
|
@findex re_comp
|
|
@example
|
|
char *
|
|
re_comp (char *@var{regex})
|
|
@end example
|
|
|
|
@noindent
|
|
@var{regex} is the address of a null-terminated regular expression.
|
|
@code{re_comp} uses an internal pattern buffer, so you can use only the
|
|
most recently compiled pattern buffer. This means that if you want to
|
|
use a given regular expression that you've already compiled---but it
|
|
isn't the latest one you've compiled---you'll have to recompile it. If
|
|
you call @code{re_comp} with the null string (@emph{not} the empty
|
|
string) as the argument, it doesn't change the contents of the pattern
|
|
buffer.
|
|
|
|
If @code{re_comp} successfully compiles the regular expression, it
|
|
returns zero. If it can't compile the regular expression, it returns
|
|
an error string. @code{re_comp}'s error messages are identical to those
|
|
of @code{re_compile_pattern} (@pxref{GNU Regular Expression
|
|
Compiling}).
|
|
|
|
@node BSD Searching, , BSD Regular Expression Compiling, BSD Regex Functions
|
|
@subsection BSD Searching
|
|
|
|
Searching the Berkeley @sc{unix} way means searching in a string
|
|
starting at its first character and trying successive positions within
|
|
it to find a match. Once you've compiled a pattern using @code{re_comp}
|
|
(@pxref{BSD Regular Expression Compiling}), you can ask Regex
|
|
to search for that pattern in a string using:
|
|
|
|
@findex re_exec
|
|
@example
|
|
int
|
|
re_exec (char *@var{string})
|
|
@end example
|
|
|
|
@noindent
|
|
@var{string} is the address of the null-terminated string in which you
|
|
want to search.
|
|
|
|
@code{re_exec} returns either 1 for success or 0 for failure. It
|
|
automatically uses a @sc{gnu} fastmap (@pxref{Searching with Fastmaps}).
|
|
|
|
|
|
@node Copying, Index, Programming with Regex, Top
|
|
@appendix GNU GENERAL PUBLIC LICENSE
|
|
@center Version 2, June 1991
|
|
|
|
@display
|
|
Copyright @copyright{} 1989, 1991 Free Software Foundation, Inc.
|
|
675 Mass Ave, Cambridge, MA 02139, USA
|
|
|
|
Everyone is permitted to copy and distribute verbatim copies
|
|
of this license document, but changing it is not allowed.
|
|
@end display
|
|
|
|
@unnumberedsec Preamble
|
|
|
|
The licenses for most software are designed to take away your
|
|
freedom to share and change it. By contrast, the GNU General Public
|
|
License is intended to guarantee your freedom to share and change free
|
|
software---to make sure the software is free for all its users. This
|
|
General Public License applies to most of the Free Software
|
|
Foundation's software and to any other program whose authors commit to
|
|
using it. (Some other Free Software Foundation software is covered by
|
|
the GNU Library General Public License instead.) You can apply it to
|
|
your programs, too.
|
|
|
|
When we speak of free software, we are referring to freedom, not
|
|
price. Our General Public Licenses are designed to make sure that you
|
|
have the freedom to distribute copies of free software (and charge for
|
|
this service if you wish), that you receive source code or can get it
|
|
if you want it, that you can change the software or use pieces of it
|
|
in new free programs; and that you know you can do these things.
|
|
|
|
To protect your rights, we need to make restrictions that forbid
|
|
anyone to deny you these rights or to ask you to surrender the rights.
|
|
These restrictions translate to certain responsibilities for you if you
|
|
distribute copies of the software, or if you modify it.
|
|
|
|
For example, if you distribute copies of such a program, whether
|
|
gratis or for a fee, you must give the recipients all the rights that
|
|
you have. You must make sure that they, too, receive or can get the
|
|
source code. And you must show them these terms so they know their
|
|
rights.
|
|
|
|
We protect your rights with two steps: (1) copyright the software, and
|
|
(2) offer you this license which gives you legal permission to copy,
|
|
distribute and/or modify the software.
|
|
|
|
Also, for each author's protection and ours, we want to make certain
|
|
that everyone understands that there is no warranty for this free
|
|
software. If the software is modified by someone else and passed on, we
|
|
want its recipients to know that what they have is not the original, so
|
|
that any problems introduced by others will not reflect on the original
|
|
authors' reputations.
|
|
|
|
Finally, any free program is threatened constantly by software
|
|
patents. We wish to avoid the danger that redistributors of a free
|
|
program will individually obtain patent licenses, in effect making the
|
|
program proprietary. To prevent this, we have made it clear that any
|
|
patent must be licensed for everyone's free use or not licensed at all.
|
|
|
|
The precise terms and conditions for copying, distribution and
|
|
modification follow.
|
|
|
|
@iftex
|
|
@unnumberedsec TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
|
|
@end iftex
|
|
@ifinfo
|
|
@center TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
|
|
@end ifinfo
|
|
|
|
@enumerate
|
|
@item
|
|
This License applies to any program or other work which contains
|
|
a notice placed by the copyright holder saying it may be distributed
|
|
under the terms of this General Public License. The ``Program'', below,
|
|
refers to any such program or work, and a ``work based on the Program''
|
|
means either the Program or any derivative work under copyright law:
|
|
that is to say, a work containing the Program or a portion of it,
|
|
either verbatim or with modifications and/or translated into another
|
|
language. (Hereinafter, translation is included without limitation in
|
|
the term ``modification''.) Each licensee is addressed as ``you''.
|
|
|
|
Activities other than copying, distribution and modification are not
|
|
covered by this License; they are outside its scope. The act of
|
|
running the Program is not restricted, and the output from the Program
|
|
is covered only if its contents constitute a work based on the
|
|
Program (independent of having been made by running the Program).
|
|
Whether that is true depends on what the Program does.
|
|
|
|
@item
|
|
You may copy and distribute verbatim copies of the Program's
|
|
source code as you receive it, in any medium, provided that you
|
|
conspicuously and appropriately publish on each copy an appropriate
|
|
copyright notice and disclaimer of warranty; keep intact all the
|
|
notices that refer to this License and to the absence of any warranty;
|
|
and give any other recipients of the Program a copy of this License
|
|
along with the Program.
|
|
|
|
You may charge a fee for the physical act of transferring a copy, and
|
|
you may at your option offer warranty protection in exchange for a fee.
|
|
|
|
@item
|
|
You may modify your copy or copies of the Program or any portion
|
|
of it, thus forming a work based on the Program, and copy and
|
|
distribute such modifications or work under the terms of Section 1
|
|
above, provided that you also meet all of these conditions:
|
|
|
|
@enumerate a
|
|
@item
|
|
You must cause the modified files to carry prominent notices
|
|
stating that you changed the files and the date of any change.
|
|
|
|
@item
|
|
You must cause any work that you distribute or publish, that in
|
|
whole or in part contains or is derived from the Program or any
|
|
part thereof, to be licensed as a whole at no charge to all third
|
|
parties under the terms of this License.
|
|
|
|
@item
|
|
If the modified program normally reads commands interactively
|
|
when run, you must cause it, when started running for such
|
|
interactive use in the most ordinary way, to print or display an
|
|
announcement including an appropriate copyright notice and a
|
|
notice that there is no warranty (or else, saying that you provide
|
|
a warranty) and that users may redistribute the program under
|
|
these conditions, and telling the user how to view a copy of this
|
|
License. (Exception: if the Program itself is interactive but
|
|
does not normally print such an announcement, your work based on
|
|
the Program is not required to print an announcement.)
|
|
@end enumerate
|
|
|
|
These requirements apply to the modified work as a whole. If
|
|
identifiable sections of that work are not derived from the Program,
|
|
and can be reasonably considered independent and separate works in
|
|
themselves, then this License, and its terms, do not apply to those
|
|
sections when you distribute them as separate works. But when you
|
|
distribute the same sections as part of a whole which is a work based
|
|
on the Program, the distribution of the whole must be on the terms of
|
|
this License, whose permissions for other licensees extend to the
|
|
entire whole, and thus to each and every part regardless of who wrote it.
|
|
|
|
Thus, it is not the intent of this section to claim rights or contest
|
|
your rights to work written entirely by you; rather, the intent is to
|
|
exercise the right to control the distribution of derivative or
|
|
collective works based on the Program.
|
|
|
|
In addition, mere aggregation of another work not based on the Program
|
|
with the Program (or with a work based on the Program) on a volume of
|
|
a storage or distribution medium does not bring the other work under
|
|
the scope of this License.
|
|
|
|
@item
|
|
You may copy and distribute the Program (or a work based on it,
|
|
under Section 2) in object code or executable form under the terms of
|
|
Sections 1 and 2 above provided that you also do one of the following:
|
|
|
|
@enumerate a
|
|
@item
|
|
Accompany it with the complete corresponding machine-readable
|
|
source code, which must be distributed under the terms of Sections
|
|
1 and 2 above on a medium customarily used for software interchange; or,
|
|
|
|
@item
|
|
Accompany it with a written offer, valid for at least three
|
|
years, to give any third party, for a charge no more than your
|
|
cost of physically performing source distribution, a complete
|
|
machine-readable copy of the corresponding source code, to be
|
|
distributed under the terms of Sections 1 and 2 above on a medium
|
|
customarily used for software interchange; or,
|
|
|
|
@item
|
|
Accompany it with the information you received as to the offer
|
|
to distribute corresponding source code. (This alternative is
|
|
allowed only for noncommercial distribution and only if you
|
|
received the program in object code or executable form with such
|
|
an offer, in accord with Subsection b above.)
|
|
@end enumerate
|
|
|
|
The source code for a work means the preferred form of the work for
|
|
making modifications to it. For an executable work, complete source
|
|
code means all the source code for all modules it contains, plus any
|
|
associated interface definition files, plus the scripts used to
|
|
control compilation and installation of the executable. However, as a
|
|
special exception, the source code distributed need not include
|
|
anything that is normally distributed (in either source or binary
|
|
form) with the major components (compiler, kernel, and so on) of the
|
|
operating system on which the executable runs, unless that component
|
|
itself accompanies the executable.
|
|
|
|
If distribution of executable or object code is made by offering
|
|
access to copy from a designated place, then offering equivalent
|
|
access to copy the source code from the same place counts as
|
|
distribution of the source code, even though third parties are not
|
|
compelled to copy the source along with the object code.
|
|
|
|
@item
|
|
You may not copy, modify, sublicense, or distribute the Program
|
|
except as expressly provided under this License. Any attempt
|
|
otherwise to copy, modify, sublicense or distribute the Program is
|
|
void, and will automatically terminate your rights under this License.
|
|
However, parties who have received copies, or rights, from you under
|
|
this License will not have their licenses terminated so long as such
|
|
parties remain in full compliance.
|
|
|
|
@item
|
|
You are not required to accept this License, since you have not
|
|
signed it. However, nothing else grants you permission to modify or
|
|
distribute the Program or its derivative works. These actions are
|
|
prohibited by law if you do not accept this License. Therefore, by
|
|
modifying or distributing the Program (or any work based on the
|
|
Program), you indicate your acceptance of this License to do so, and
|
|
all its terms and conditions for copying, distributing or modifying
|
|
the Program or works based on it.
|
|
|
|
@item
|
|
Each time you redistribute the Program (or any work based on the
|
|
Program), the recipient automatically receives a license from the
|
|
original licensor to copy, distribute or modify the Program subject to
|
|
these terms and conditions. You may not impose any further
|
|
restrictions on the recipients' exercise of the rights granted herein.
|
|
You are not responsible for enforcing compliance by third parties to
|
|
this License.
|
|
|
|
@item
|
|
If, as a consequence of a court judgment or allegation of patent
|
|
infringement or for any other reason (not limited to patent issues),
|
|
conditions are imposed on you (whether by court order, agreement or
|
|
otherwise) that contradict the conditions of this License, they do not
|
|
excuse you from the conditions of this License. If you cannot
|
|
distribute so as to satisfy simultaneously your obligations under this
|
|
License and any other pertinent obligations, then as a consequence you
|
|
may not distribute the Program at all. For example, if a patent
|
|
license would not permit royalty-free redistribution of the Program by
|
|
all those who receive copies directly or indirectly through you, then
|
|
the only way you could satisfy both it and this License would be to
|
|
refrain entirely from distribution of the Program.
|
|
|
|
If any portion of this section is held invalid or unenforceable under
|
|
any particular circumstance, the balance of the section is intended to
|
|
apply and the section as a whole is intended to apply in other
|
|
circumstances.
|
|
|
|
It is not the purpose of this section to induce you to infringe any
|
|
patents or other property right claims or to contest validity of any
|
|
such claims; this section has the sole purpose of protecting the
|
|
integrity of the free software distribution system, which is
|
|
implemented by public license practices. Many people have made
|
|
generous contributions to the wide range of software distributed
|
|
through that system in reliance on consistent application of that
|
|
system; it is up to the author/donor to decide if he or she is willing
|
|
to distribute software through any other system and a licensee cannot
|
|
impose that choice.
|
|
|
|
This section is intended to make thoroughly clear what is believed to
|
|
be a consequence of the rest of this License.
|
|
|
|
@item
|
|
If the distribution and/or use of the Program is restricted in
|
|
certain countries either by patents or by copyrighted interfaces, the
|
|
original copyright holder who places the Program under this License
|
|
may add an explicit geographical distribution limitation excluding
|
|
those countries, so that distribution is permitted only in or among
|
|
countries not thus excluded. In such case, this License incorporates
|
|
the limitation as if written in the body of this License.
|
|
|
|
@item
|
|
The Free Software Foundation may publish revised and/or new versions
|
|
of the General Public License from time to time. Such new versions will
|
|
be similar in spirit to the present version, but may differ in detail to
|
|
address new problems or concerns.
|
|
|
|
Each version is given a distinguishing version number. If the Program
|
|
specifies a version number of this License which applies to it and ``any
|
|
later version'', you have the option of following the terms and conditions
|
|
either of that version or of any later version published by the Free
|
|
Software Foundation. If the Program does not specify a version number of
|
|
this License, you may choose any version ever published by the Free Software
|
|
Foundation.
|
|
|
|
@item
|
|
If you wish to incorporate parts of the Program into other free
|
|
programs whose distribution conditions are different, write to the author
|
|
to ask for permission. For software which is copyrighted by the Free
|
|
Software Foundation, write to the Free Software Foundation; we sometimes
|
|
make exceptions for this. Our decision will be guided by the two goals
|
|
of preserving the free status of all derivatives of our free software and
|
|
of promoting the sharing and reuse of software generally.
|
|
|
|
@iftex
|
|
@heading NO WARRANTY
|
|
@end iftex
|
|
@ifinfo
|
|
@center NO WARRANTY
|
|
@end ifinfo
|
|
|
|
@item
|
|
BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
|
|
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
|
|
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
|
|
PROVIDE THE PROGRAM ``AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
|
|
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
|
|
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
|
|
TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
|
|
PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
|
|
REPAIR OR CORRECTION.
|
|
|
|
@item
|
|
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
|
|
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
|
|
REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
|
|
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
|
|
OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
|
|
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
|
|
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
|
|
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
|
|
POSSIBILITY OF SUCH DAMAGES.
|
|
@end enumerate
|
|
|
|
@iftex
|
|
@heading END OF TERMS AND CONDITIONS
|
|
@end iftex
|
|
@ifinfo
|
|
@center END OF TERMS AND CONDITIONS
|
|
@end ifinfo
|
|
|
|
@page
|
|
@unnumberedsec Appendix: How to Apply These Terms to Your New Programs
|
|
|
|
If you develop a new program, and you want it to be of the greatest
|
|
possible use to the public, the best way to achieve this is to make it
|
|
free software which everyone can redistribute and change under these terms.
|
|
|
|
To do so, attach the following notices to the program. It is safest
|
|
to attach them to the start of each source file to most effectively
|
|
convey the exclusion of warranty; and each file should have at least
|
|
the ``copyright'' line and a pointer to where the full notice is found.
|
|
|
|
@smallexample
|
|
@var{one line to give the program's name and a brief idea of what it does.}
|
|
Copyright (C) 19@var{yy} @var{name of author}
|
|
|
|
This program is free software; you can redistribute it and/or modify
|
|
it under the terms of the GNU General Public License as published by
|
|
the Free Software Foundation; either version 2 of the License, or
|
|
(at your option) any later version.
|
|
|
|
This program is distributed in the hope that it will be useful,
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
GNU General Public License for more details.
|
|
|
|
You should have received a copy of the GNU General Public License
|
|
along with this program; if not, write to the Free Software
|
|
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
|
|
@end smallexample
|
|
|
|
Also add information on how to contact you by electronic and paper mail.
|
|
|
|
If the program is interactive, make it output a short notice like this
|
|
when it starts in an interactive mode:
|
|
|
|
@smallexample
|
|
Gnomovision version 69, Copyright (C) 19@var{yy} @var{name of author}
|
|
Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
|
|
This is free software, and you are welcome to redistribute it
|
|
under certain conditions; type `show c' for details.
|
|
@end smallexample
|
|
|
|
The hypothetical commands @samp{show w} and @samp{show c} should show
|
|
the appropriate parts of the General Public License. Of course, the
|
|
commands you use may be called something other than @samp{show w} and
|
|
@samp{show c}; they could even be mouse-clicks or menu items---whatever
|
|
suits your program.
|
|
|
|
You should also get your employer (if you work as a programmer) or your
|
|
school, if any, to sign a ``copyright disclaimer'' for the program, if
|
|
necessary. Here is a sample; alter the names:
|
|
|
|
@example
|
|
Yoyodyne, Inc., hereby disclaims all copyright interest in the program
|
|
`Gnomovision' (which makes passes at compilers) written by James Hacker.
|
|
|
|
@var{signature of Ty Coon}, 1 April 1989
|
|
Ty Coon, President of Vice
|
|
@end example
|
|
|
|
This General Public License does not permit incorporating your program into
|
|
proprietary programs. If your program is a subroutine library, you may
|
|
consider it more useful to permit linking proprietary applications with the
|
|
library. If this is what you want to do, use the GNU Library General
|
|
Public License instead of this License.
|
|
|
|
|
|
@node Index, , Copying, Top
|
|
@unnumbered Index
|
|
|
|
@printindex cp
|
|
|
|
@contents
|
|
|
|
@bye
|