15.7 Backslash in Regular Expressions
For the most part, ‘ \
’ followed by any character matches only
that character. However, there are several exceptions: two-character
sequences starting with ‘ \
’ that have special meanings. The
second character in the sequence is always an ordinary character when
used on its own. Here is a table of ‘ \
’ constructs.
\|
specifies an alternative. Two regular expressions a and b
with ‘ \|
’ in between form an expression that matches some text if
either a matches it or b matches it. It works by trying to
match a, and if that fails, by trying to match b.
Thus, ‘ foo\|bar
’ matches either ‘ foo
’ or ‘ bar
’
but no other string.
‘ \|
’ applies to the largest possible surrounding expressions. Only a
surrounding ‘ \( … \)
’ grouping can limit the grouping power of
‘ \|
’.
Full backtracking capability exists to handle multiple uses of ‘ \|
’.
\( … \)
is a grouping construct that serves three purposes:
-
To enclose a set of ‘
\|
’ alternatives for other operations. Thus, ‘\(foo\|bar\)x
’ matches either ‘foox
’ or ‘barx
’. -
To enclose a complicated expression for the postfix operators ‘
*
’, ‘+
’ and ‘?
’ to operate on. Thus, ‘ba\(na\)*
’ matches ‘bananana
’, etc., with any (zero or more) number of ‘na
’ strings. -
To record a matched substring for future reference.
This last application is not a consequence of the idea of a
parenthetical grouping; it is a separate feature that is assigned as a
second meaning to the same ‘ \( … \)
’ construct. In practice
there is usually no conflict between the two meanings; when there is
a conflict, you can use a shy group, described below.
\(?: … \)
¶
specifies a shy group that does not record the matched substring;
you can’t refer back to it with ‘ \d
’ (see below). This is
useful in mechanically combining regular expressions, so that you can
add groups for syntactic purposes without interfering with the
numbering of the groups that are meant to be referred to.
\d
¶
matches the same text that matched the dth occurrence of a
‘ \( … \)
’ construct. This is called a back
reference.
After the end of a ‘ \( … \)
’ construct, the matcher remembers
the beginning and end of the text matched by that construct. Then,
later on in the regular expression, you can use ‘ \
’ followed by the
digit d to mean “match the same text matched the dth time
by the ‘ \( … \)
’ construct”.
The strings matching the first nine ‘ \( … \)
’ constructs
appearing in a regular expression are assigned numbers 1 through 9 in
the order that the open-parentheses appear in the regular expression.
So you can use ‘ \1
’ through ‘ \9
’ to refer to the text matched
by the corresponding ‘ \( … \)
’ constructs.
For example, ‘ \(.*\)\1
’ matches any newline-free string that is
composed of two identical halves. The ‘ \(.*\)
’ matches the first
half, which may be anything, but the ‘ \1
’ that follows must match
the same exact text.
If a particular ‘ \( … \)
’ construct matches more than once
(which can easily happen if it is followed by ‘ *
’), only the last
match is recorded.
\`
matches the empty string, but only at the beginning of the string or buffer (or its accessible portion) being matched against.
\'
matches the empty string, but only at the end of the string or buffer (or its accessible portion) being matched against.
\=
matches the empty string, but only at point.
\b
matches the empty string, but only at the beginning or
end of a word. Thus, ‘ \bfoo\b
’ matches any occurrence of
‘ foo
’ as a separate word. ‘ \bballs?\b
’ matches
‘ ball
’ or ‘ balls
’ as a separate word.
‘ \b
’ matches at the beginning or end of the buffer
regardless of what text appears next to it.
\B
matches the empty string, but not at the beginning or end of a word.
\<
matches the empty string, but only at the beginning of a word.
‘ \<
’ matches at the beginning of the buffer only if a
word-constituent character follows.
\>
matches the empty string, but only at the end of a word. ‘ \>
’
matches at the end of the buffer only if the contents end with a
word-constituent character.
\w
matches any word-constituent character. The syntax table determines which characters these are. See Syntax Tables in The Emacs Lisp Reference Manual.
\W
matches any character that is not a word-constituent.
\_<
matches the empty string, but only at the beginning of a symbol.
A symbol is a sequence of one or more symbol-constituent characters.
A symbol-constituent character is a character whose syntax is either
‘ w
’ or ‘ _
’. ‘ \_<
’ matches at the beginning of the
buffer only if a symbol-constituent character follows. As with words,
the syntax table determines which characters are symbol-constituent.
\_>
matches the empty string, but only at the end of a symbol. ‘ \_>
’
matches at the end of the buffer only if the contents end with a
symbol-constituent character.
\sc
matches any character whose syntax is c. Here c is a
character that designates a particular syntax class: thus, ‘ w
’
for word constituent, ‘ -
’ or ‘
’ for whitespace, ‘ .
’
for ordinary punctuation, etc. See Syntax Class Table in The Emacs Lisp Reference Manual.
\Sc
matches any character whose syntax is not c.
\cc
matches any character that belongs to the category c. For
example, ‘ \cc
’ matches Chinese characters, ‘ \cg
’ matches
Greek characters, etc. For the description of the known categories,
type M-x describe-categories RET
.
\Cc
matches any character that does not belong to category c.
The constructs that pertain to words and syntax are controlled by the setting of the syntax table. See Syntax Tables in The Emacs Lisp Reference Manual.