Various minor edits

This commit is contained in:
Andrew M. Kuchling 2007-01-29 20:55:40 +00:00
parent 85acbca511
commit 5781dd2d7c
3 changed files with 73 additions and 59 deletions

View file

@ -1,7 +1,7 @@
Short-term tasks: Short-term tasks:
Quick revision pass to make HOWTOs match the current state of Python: Quick revision pass to make HOWTOs match the current state of Python
doanddont regex sockets sorting doanddont regex sockets
Medium-term tasks: Medium-term tasks:
Revisit the regex howto. Revisit the regex howto.

View file

@ -32,7 +32,7 @@ plain dangerous.
\subsubsection{Inside Function Definitions} \subsubsection{Inside Function Definitions}
\code{from module import *} is {\em invalid} inside function definitions. \code{from module import *} is {\em invalid} inside function definitions.
While many versions of Python do no check for the invalidity, it does not While many versions of Python do not check for the invalidity, it does not
make it more valid, no more then having a smart lawyer makes a man innocent. make it more valid, no more then having a smart lawyer makes a man innocent.
Do not use it like that ever. Even in versions where it was accepted, it made Do not use it like that ever. Even in versions where it was accepted, it made
the function execution slower, because the compiler could not be certain the function execution slower, because the compiler could not be certain

View file

@ -34,17 +34,18 @@ This document is available from
The \module{re} module was added in Python 1.5, and provides The \module{re} module was added in Python 1.5, and provides
Perl-style regular expression patterns. Earlier versions of Python Perl-style regular expression patterns. Earlier versions of Python
came with the \module{regex} module, which provided Emacs-style came with the \module{regex} module, which provided Emacs-style
patterns. \module{regex} module was removed in Python 2.5. patterns. The \module{regex} module was removed completely in Python 2.5.
Regular expressions (or REs) are essentially a tiny, highly Regular expressions (called REs, or regexes, or regex patterns) are
specialized programming language embedded inside Python and made essentially a tiny, highly specialized programming language embedded
available through the \module{re} module. Using this little language, inside Python and made available through the \module{re} module.
you specify the rules for the set of possible strings that you want to Using this little language, you specify the rules for the set of
match; this set might contain English sentences, or e-mail addresses, possible strings that you want to match; this set might contain
or TeX commands, or anything you like. You can then ask questions English sentences, or e-mail addresses, or TeX commands, or anything
such as ``Does this string match the pattern?'', or ``Is there a match you like. You can then ask questions such as ``Does this string match
for the pattern anywhere in this string?''. You can also use REs to the pattern?'', or ``Is there a match for the pattern anywhere in this
modify a string or to split it apart in various ways. string?''. You can also use REs to modify a string or to split it
apart in various ways.
Regular expression patterns are compiled into a series of bytecodes Regular expression patterns are compiled into a series of bytecodes
which are then executed by a matching engine written in C. For which are then executed by a matching engine written in C. For
@ -80,11 +81,12 @@ example, the regular expression \regexp{test} will match the string
would let this RE match \samp{Test} or \samp{TEST} as well; more would let this RE match \samp{Test} or \samp{TEST} as well; more
about this later.) about this later.)
There are exceptions to this rule; some characters are There are exceptions to this rule; some characters are special
special, and don't match themselves. Instead, they signal that some \dfn{metacharacters}, and don't match themselves. Instead, they
out-of-the-ordinary thing should be matched, or they affect other signal that some out-of-the-ordinary thing should be matched, or they
portions of the RE by repeating them. Much of this document is affect other portions of the RE by repeating them or changing their
devoted to discussing various metacharacters and what they do. meaning. Much of this document is devoted to discussing various
metacharacters and what they do.
Here's a complete list of the metacharacters; their meanings will be Here's a complete list of the metacharacters; their meanings will be
discussed in the rest of this HOWTO. discussed in the rest of this HOWTO.
@ -111,9 +113,10 @@ Metacharacters are not active inside classes. For example,
usually a metacharacter, but inside a character class it's stripped of usually a metacharacter, but inside a character class it's stripped of
its special nature. its special nature.
You can match the characters not within a range by \dfn{complementing} You can match the characters not listed within the class by
the set. This is indicated by including a \character{\^} as the first \dfn{complementing} the set. This is indicated by including a
character of the class; \character{\^} elsewhere will simply match the \character{\^} as the first character of the class; \character{\^}
outside a character class will simply match the
\character{\^} character. For example, \verb|[^5]| will match any \character{\^} character. For example, \verb|[^5]| will match any
character except \character{5}. character except \character{5}.
@ -176,7 +179,7 @@ or more times, instead of exactly once.
For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a} For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a}
characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a} characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a}
characters), and so forth. The RE engine has various internal characters), and so forth. The RE engine has various internal
limitations stemming from the size of C's \code{int} type, that will limitations stemming from the size of C's \code{int} type that will
prevent it from matching over 2 billion \samp{a} characters; you prevent it from matching over 2 billion \samp{a} characters; you
probably don't have enough memory to construct a string that large, so probably don't have enough memory to construct a string that large, so
you shouldn't run into that limit. you shouldn't run into that limit.
@ -238,9 +241,9 @@ will match \samp{a/b}, \samp{a//b}, and \samp{a///b}. It won't match
You can omit either \var{m} or \var{n}; in that case, a reasonable You can omit either \var{m} or \var{n}; in that case, a reasonable
value is assumed for the missing value. Omitting \var{m} is value is assumed for the missing value. Omitting \var{m} is
interpreted as a lower limit of 0, while omitting \var{n} results in an interpreted as a lower limit of 0, while omitting \var{n} results in
upper bound of infinity --- actually, the 2 billion limit mentioned an upper bound of infinity --- actually, the upper bound is the
earlier, but that might as well be infinity. 2-billion limit mentioned earlier, but that might as well be infinity.
Readers of a reductionist bent may notice that the three other qualifiers Readers of a reductionist bent may notice that the three other qualifiers
can all be expressed using this notation. \regexp{\{0,\}} is the same can all be expressed using this notation. \regexp{\{0,\}} is the same
@ -285,7 +288,7 @@ them. (There are applications that don't need REs at all, so there's
no need to bloat the language specification by including them.) no need to bloat the language specification by including them.)
Instead, the \module{re} module is simply a C extension module Instead, the \module{re} module is simply a C extension module
included with Python, just like the \module{socket} or \module{zlib} included with Python, just like the \module{socket} or \module{zlib}
module. modules.
Putting REs in strings keeps the Python language simpler, but has one Putting REs in strings keeps the Python language simpler, but has one
disadvantage which is the topic of the next section. disadvantage which is the topic of the next section.
@ -326,7 +329,7 @@ expressions; backslashes are not handled in any special way in
a string literal prefixed with \character{r}, so \code{r"\e n"} is a a string literal prefixed with \character{r}, so \code{r"\e n"} is a
two-character string containing \character{\e} and \character{n}, two-character string containing \character{\e} and \character{n},
while \code{"\e n"} is a one-character string containing a newline. while \code{"\e n"} is a one-character string containing a newline.
Frequently regular expressions will be expressed in Python Regular expressions will often be written in Python
code using this raw string notation. code using this raw string notation.
\begin{tableii}{c|c}{code}{Regular String}{Raw string} \begin{tableii}{c|c}{code}{Regular String}{Raw string}
@ -368,9 +371,9 @@ strings, and displays whether the RE matches or fails.
\file{redemo.py} can be quite useful when trying to debug a \file{redemo.py} can be quite useful when trying to debug a
complicated RE. Phil Schwartz's complicated RE. Phil Schwartz's
\ulink{Kodos}{http://www.phil-schwartz.com/kodos.spy} is also an interactive \ulink{Kodos}{http://www.phil-schwartz.com/kodos.spy} is also an interactive
tool for developing and testing RE patterns. This HOWTO will use the tool for developing and testing RE patterns.
standard Python interpreter for its examples.
This HOWTO uses the standard Python interpreter for its examples.
First, run the Python interpreter, import the \module{re} module, and First, run the Python interpreter, import the \module{re} module, and
compile a RE: compile a RE:
@ -401,7 +404,7 @@ Now, let's try it on a string that it should match, such as
later use. later use.
\begin{verbatim} \begin{verbatim}
>>> m = p.match( 'tempo') >>> m = p.match('tempo')
>>> print m >>> print m
<_sre.SRE_Match object at 80c4f68> <_sre.SRE_Match object at 80c4f68>
\end{verbatim} \end{verbatim}
@ -472,9 +475,9 @@ Two \class{RegexObject} methods return all of the matches for a pattern.
\end{verbatim} \end{verbatim}
\method{findall()} has to create the entire list before it can be \method{findall()} has to create the entire list before it can be
returned as the result. In Python 2.2, the \method{finditer()} method returned as the result. The \method{finditer()} method returns a
is also available, returning a sequence of \class{MatchObject} instances sequence of \class{MatchObject} instances as an
as an iterator. iterator.\footnote{Introduced in Python 2.2.2.}
\begin{verbatim} \begin{verbatim}
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...') >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
@ -491,13 +494,13 @@ as an iterator.
\subsection{Module-Level Functions} \subsection{Module-Level Functions}
You don't have to produce a \class{RegexObject} and call its methods; You don't have to create a \class{RegexObject} and call its methods;
the \module{re} module also provides top-level functions called the \module{re} module also provides top-level functions called
\function{match()}, \function{search()}, \function{sub()}, and so \function{match()}, \function{search()}, \function{findall()},
forth. These functions take the same arguments as the corresponding \function{sub()}, and so forth. These functions take the same
\class{RegexObject} method, with the RE string added as the first arguments as the corresponding \class{RegexObject} method, with the RE
argument, and still return either \code{None} or a \class{MatchObject} string added as the first argument, and still return either
instance. \code{None} or a \class{MatchObject} instance.
\begin{verbatim} \begin{verbatim}
>>> print re.match(r'From\s+', 'Fromage amk') >>> print re.match(r'From\s+', 'Fromage amk')
@ -514,7 +517,7 @@ RE are faster.
Should you use these module-level functions, or should you get the Should you use these module-level functions, or should you get the
\class{RegexObject} and call its methods yourself? That choice \class{RegexObject} and call its methods yourself? That choice
depends on how frequently the RE will be used, and on your personal depends on how frequently the RE will be used, and on your personal
coding style. If a RE is being used at only one point in the code, coding style. If the RE is being used at only one point in the code,
then the module functions are probably more convenient. If a program then the module functions are probably more convenient. If a program
contains a lot of regular expressions, or re-uses the same ones in contains a lot of regular expressions, or re-uses the same ones in
several locations, then it might be worthwhile to collect all the several locations, then it might be worthwhile to collect all the
@ -537,7 +540,7 @@ as I am.
Compilation flags let you modify some aspects of how regular Compilation flags let you modify some aspects of how regular
expressions work. Flags are available in the \module{re} module under expressions work. Flags are available in the \module{re} module under
two names, a long name such as \constant{IGNORECASE}, and a short, two names, a long name such as \constant{IGNORECASE} and a short,
one-letter form such as \constant{I}. (If you're familiar with Perl's one-letter form such as \constant{I}. (If you're familiar with Perl's
pattern modifiers, the one-letter forms use the same letters; the pattern modifiers, the one-letter forms use the same letters; the
short form of \constant{re.VERBOSE} is \constant{re.X}, for example.) short form of \constant{re.VERBOSE} is \constant{re.X}, for example.)
@ -617,7 +620,7 @@ that are more readable by granting you more flexibility in how you can
format them. When this flag has been specified, whitespace within the format them. When this flag has been specified, whitespace within the
RE string is ignored, except when the whitespace is in a character RE string is ignored, except when the whitespace is in a character
class or preceded by an unescaped backslash; this lets you organize class or preceded by an unescaped backslash; this lets you organize
and indent the RE more clearly. It also enables you to put comments and indent the RE more clearly. This flag also lets you put comments
within a RE that will be ignored by the engine; comments are marked by within a RE that will be ignored by the engine; comments are marked by
a \character{\#} that's neither in a character class or preceded by an a \character{\#} that's neither in a character class or preceded by an
unescaped backslash. unescaped backslash.
@ -629,18 +632,19 @@ much easier it is to read?
charref = re.compile(r""" charref = re.compile(r"""
&[#] # Start of a numeric entity reference &[#] # Start of a numeric entity reference
( (
[0-9]+[^0-9] # Decimal form 0[0-7]+ # Octal form
| 0[0-7]+[^0-7] # Octal form | [0-9]+ # Decimal form
| x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form | x[0-9a-fA-F]+ # Hexadecimal form
) )
; # Trailing semicolon
""", re.VERBOSE) """, re.VERBOSE)
\end{verbatim} \end{verbatim}
Without the verbose setting, the RE would look like this: Without the verbose setting, the RE would look like this:
\begin{verbatim} \begin{verbatim}
charref = re.compile("&#([0-9]+[^0-9]" charref = re.compile("&#(0[0-7]+"
"|0[0-7]+[^0-7]" "|[0-9]+"
"|x[0-9a-fA-F]+[^0-9a-fA-F])") "|x[0-9a-fA-F]+);")
\end{verbatim} \end{verbatim}
In the above example, Python's automatic concatenation of string In the above example, Python's automatic concatenation of string
@ -722,12 +726,12 @@ inside a character class, as in \regexp{[\$]}.
\item[\regexp{\e A}] Matches only at the start of the string. When \item[\regexp{\e A}] Matches only at the start of the string. When
not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are
effectively the same. In \constant{MULTILINE} mode, however, they're effectively the same. In \constant{MULTILINE} mode, they're
different; \regexp{\e A} still matches only at the beginning of the different: \regexp{\e A} still matches only at the beginning of the
string, but \regexp{\^} may match at any location inside the string string, but \regexp{\^} may match at any location inside the string
that follows a newline character. that follows a newline character.
\item[\regexp{\e Z}]Matches only at the end of the string. \item[\regexp{\e Z}] Matches only at the end of the string.
\item[\regexp{\e b}] Word boundary. \item[\regexp{\e b}] Word boundary.
This is a zero-width assertion that matches only at the This is a zero-width assertion that matches only at the
@ -782,14 +786,23 @@ RE matched or not. Regular expressions are often used to dissect
strings by writing a RE divided into several subgroups which strings by writing a RE divided into several subgroups which
match different components of interest. For example, an RFC-822 match different components of interest. For example, an RFC-822
header line is divided into a header name and a value, separated by a header line is divided into a header name and a value, separated by a
\character{:}. This can be handled by writing a regular expression \character{:}, like this:
\begin{verbatim}
From: author@example.com
User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
MIME-Version: 1.0
To: editor@example.com
\end{verbatim}
This can be handled by writing a regular expression
which matches an entire header line, and has one group which matches the which matches an entire header line, and has one group which matches the
header name, and another group which matches the header's value. header name, and another group which matches the header's value.
Groups are marked by the \character{(}, \character{)} metacharacters. Groups are marked by the \character{(}, \character{)} metacharacters.
\character{(} and \character{)} have much the same meaning as they do \character{(} and \character{)} have much the same meaning as they do
in mathematical expressions; they group together the expressions in mathematical expressions; they group together the expressions
contained inside them. For example, you can repeat the contents of a contained inside them, and you can repeat the contents of a
group with a repeating qualifier, such as \regexp{*}, \regexp{+}, group with a repeating qualifier, such as \regexp{*}, \regexp{+},
\regexp{?}, or \regexp{\{\var{m},\var{n}\}}. For example, \regexp{?}, or \regexp{\{\var{m},\var{n}\}}. For example,
\regexp{(ab)*} will match zero or more repetitions of \samp{ab}. \regexp{(ab)*} will match zero or more repetitions of \samp{ab}.
@ -882,11 +895,12 @@ syntax for regular expression extensions, so we'll look at that first.
Perl 5 added several additional features to standard regular Perl 5 added several additional features to standard regular
expressions, and the Python \module{re} module supports most of them. expressions, and the Python \module{re} module supports most of them.
It would have been difficult to choose new single-keystroke It would have been difficult to choose new
metacharacters or new special sequences beginning with \samp{\e} to single-keystroke metacharacters or new special sequences beginning
represent the new features without making Perl's regular expressions with \samp{\e} to represent the new features without making Perl's
confusingly different from standard REs. If you chose \samp{\&} as a regular expressions confusingly different from standard REs. If you
new metacharacter, for example, old expressions would be assuming that chose \samp{\&} as a new metacharacter, for example, old expressions
would be assuming that
\samp{\&} was a regular character and wouldn't have escaped it by \samp{\&} was a regular character and wouldn't have escaped it by
writing \regexp{\e \&} or \regexp{[\&]}. writing \regexp{\e \&} or \regexp{[\&]}.