[3.12] GH-107678: Improve Unicode handling clarity in `library/re.rst` (GH-107679) (#113965)

GH-107678: Improve Unicode handling clarity in ``library/re.rst`` (GH-107679)
(cherry picked from commit c9b8a22f34)

Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
This commit is contained in:
Miss Islington (bot) 2024-01-12 01:02:28 +01:00 committed by GitHub
parent b902671d36
commit bd9ea91e5f
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -17,7 +17,7 @@ those found in Perl.
Both patterns and strings to be searched can be Unicode strings (:class:`str`) Both patterns and strings to be searched can be Unicode strings (:class:`str`)
as well as 8-bit strings (:class:`bytes`). as well as 8-bit strings (:class:`bytes`).
However, Unicode strings and 8-bit strings cannot be mixed: However, Unicode strings and 8-bit strings cannot be mixed:
that is, you cannot match a Unicode string with a byte pattern or that is, you cannot match a Unicode string with a bytes pattern or
vice-versa; similarly, when asking for a substitution, the replacement vice-versa; similarly, when asking for a substitution, the replacement
string must be of the same type as both the pattern and the search string. string must be of the same type as both the pattern and the search string.
@ -257,8 +257,7 @@ The special characters are:
.. index:: single: \ (backslash); in regular expressions .. index:: single: \ (backslash); in regular expressions
* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
inside a set, although the characters they match depends on whether inside a set, although the characters they match depend on the flags_ used.
:const:`ASCII` or :const:`LOCALE` mode is in force.
.. index:: single: ^ (caret); in regular expressions .. index:: single: ^ (caret); in regular expressions
@ -326,18 +325,24 @@ The special characters are:
currently supported extensions. currently supported extensions.
``(?aiLmsux)`` ``(?aiLmsux)``
(One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``, (One or more letters from the set
``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``.)
letters set the corresponding flags: :const:`re.A` (ASCII-only matching), The group matches the empty string;
:const:`re.I` (ignore case), :const:`re.L` (locale dependent), the letters set the corresponding flags for the entire regular expression:
:const:`re.M` (multi-line), :const:`re.S` (dot matches all),
:const:`re.U` (Unicode matching), and :const:`re.X` (verbose), * :const:`re.A` (ASCII-only matching)
for the entire regular expression. * :const:`re.I` (ignore case)
* :const:`re.L` (locale dependent)
* :const:`re.M` (multi-line)
* :const:`re.S` (dot matches all)
* :const:`re.U` (Unicode matching)
* :const:`re.X` (verbose)
(The flags are described in :ref:`contents-of-module-re`.) (The flags are described in :ref:`contents-of-module-re`.)
This is useful if you wish to include the flags as part of the This is useful if you wish to include the flags as part of the
regular expression, instead of passing a *flag* argument to the regular expression, instead of passing a *flag* argument to the
:func:`re.compile` function. Flags should be used first in the :func:`re.compile` function.
expression string. Flags should be used first in the expression string.
.. versionchanged:: 3.11 .. versionchanged:: 3.11
This construction can only be used at the start of the expression. This construction can only be used at the start of the expression.
@ -351,14 +356,20 @@ The special characters are:
pattern. pattern.
``(?aiLmsux-imsx:...)`` ``(?aiLmsux-imsx:...)``
(Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``, (Zero or more letters from the set
``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``,
optionally followed by ``'-'`` followed by
one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.) one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
The letters set or remove the corresponding flags: The letters set or remove the corresponding flags for the part of the expression:
:const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
:const:`re.L` (locale dependent), :const:`re.M` (multi-line), * :const:`re.A` (ASCII-only matching)
:const:`re.S` (dot matches all), :const:`re.U` (Unicode matching), * :const:`re.I` (ignore case)
and :const:`re.X` (verbose), for the part of the expression. * :const:`re.L` (locale dependent)
* :const:`re.M` (multi-line)
* :const:`re.S` (dot matches all)
* :const:`re.U` (Unicode matching)
* :const:`re.X` (verbose)
(The flags are described in :ref:`contents-of-module-re`.) (The flags are described in :ref:`contents-of-module-re`.)
The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
@ -366,7 +377,7 @@ The special characters are:
when one of them appears in an inline group, it overrides the matching mode when one of them appears in an inline group, it overrides the matching mode
in the enclosing group. In Unicode patterns ``(?a:...)`` switches to in the enclosing group. In Unicode patterns ``(?a:...)`` switches to
ASCII-only matching, and ``(?u:...)`` switches to Unicode matching ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
(default). In byte pattern ``(?L:...)`` switches to locale depending (default). In bytes patterns ``(?L:...)`` switches to locale dependent
matching, and ``(?a:...)`` switches to ASCII-only matching (default). matching, and ``(?a:...)`` switches to ASCII-only matching (default).
This override is only in effect for the narrow inline group, and the This override is only in effect for the narrow inline group, and the
original matching mode is restored outside of the group. original matching mode is restored outside of the group.
@ -529,47 +540,61 @@ character ``'$'``.
``\b`` ``\b``
Matches the empty string, but only at the beginning or end of a word. Matches the empty string, but only at the beginning or end of a word.
A word is defined as a sequence of word characters. Note that formally, A word is defined as a sequence of word characters.
``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character Note that formally, ``\b`` is defined as the boundary
(or vice versa), or between ``\w`` and the beginning/end of the string. between a ``\w`` and a ``\W`` character (or vice versa),
This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``, or between ``\w`` and the beginning or end of the string.
``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``. This means that ``r'\bat\b'`` matches ``'at'``, ``'at.'``, ``'(at)'``,
and ``'as at ay'`` but not ``'attempt'`` or ``'atlas'``.
By default Unicode alphanumerics are the ones used in Unicode patterns, but The default word characters in Unicode (str) patterns
this can be changed by using the :const:`ASCII` flag. Word boundaries are are Unicode alphanumerics and the underscore,
determined by the current locale if the :const:`LOCALE` flag is used. but this can be changed by using the :py:const:`~re.ASCII` flag.
Inside a character range, ``\b`` represents the backspace character, for Word boundaries are determined by the current locale
compatibility with Python's string literals. if the :py:const:`~re.LOCALE` flag is used.
.. note::
Inside a character range, ``\b`` represents the backspace character,
for compatibility with Python's string literals.
.. index:: single: \B; in regular expressions .. index:: single: \B; in regular expressions
``\B`` ``\B``
Matches the empty string, but only when it is *not* at the beginning or end Matches the empty string,
of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, but only when it is *not* at the beginning or end of a word.
``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``. This means that ``r'at\B'`` matches ``'athens'``, ``'atom'``,
``\B`` is just the opposite of ``\b``, so word characters in Unicode ``'attorney'``, but not ``'at'``, ``'at.'``, or ``'at!'``.
patterns are Unicode alphanumerics or the underscore, although this can ``\B`` is the opposite of ``\b``,
be changed by using the :const:`ASCII` flag. Word boundaries are so word characters in Unicode (str) patterns
determined by the current locale if the :const:`LOCALE` flag is used. are Unicode alphanumerics or the underscore,
although this can be changed by using the :py:const:`~re.ASCII` flag.
Word boundaries are determined by the current locale
if the :py:const:`~re.LOCALE` flag is used.
.. index:: single: \d; in regular expressions .. index:: single: \d; in regular expressions
``\d`` ``\d``
For Unicode (str) patterns: For Unicode (str) patterns:
Matches any Unicode decimal digit (that is, any character in Matches any Unicode decimal digit
Unicode character category [Nd]). This includes ``[0-9]``, and (that is, any character in Unicode character category `[Nd]`__).
also many other digit characters. If the :const:`ASCII` flag is This includes ``[0-9]``, and also many other digit characters.
used only ``[0-9]`` is matched.
Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used.
__ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153
For 8-bit (bytes) patterns: For 8-bit (bytes) patterns:
Matches any decimal digit; this is equivalent to ``[0-9]``. Matches any decimal digit in the ASCII character set;
this is equivalent to ``[0-9]``.
.. index:: single: \D; in regular expressions .. index:: single: \D; in regular expressions
``\D`` ``\D``
Matches any character which is not a decimal digit. This is Matches any character which is not a decimal digit.
the opposite of ``\d``. If the :const:`ASCII` flag is used this This is the opposite of ``\d``.
becomes the equivalent of ``[^0-9]``.
Matches ``[^0-9]`` if the :py:const:`~re.ASCII` flag is used.
.. index:: single: \s; in regular expressions .. index:: single: \s; in regular expressions
@ -578,8 +603,9 @@ character ``'$'``.
Matches Unicode whitespace characters (which includes Matches Unicode whitespace characters (which includes
``[ \t\n\r\f\v]``, and also many other characters, for example the ``[ \t\n\r\f\v]``, and also many other characters, for example the
non-breaking spaces mandated by typography rules in many non-breaking spaces mandated by typography rules in many
languages). If the :const:`ASCII` flag is used, only languages).
``[ \t\n\r\f\v]`` is matched.
Matches ``[ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
For 8-bit (bytes) patterns: For 8-bit (bytes) patterns:
Matches characters considered whitespace in the ASCII character set; Matches characters considered whitespace in the ASCII character set;
@ -589,30 +615,39 @@ character ``'$'``.
``\S`` ``\S``
Matches any character which is not a whitespace character. This is Matches any character which is not a whitespace character. This is
the opposite of ``\s``. If the :const:`ASCII` flag is used this the opposite of ``\s``.
becomes the equivalent of ``[^ \t\n\r\f\v]``.
Matches ``[^ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
.. index:: single: \w; in regular expressions .. index:: single: \w; in regular expressions
``\w`` ``\w``
For Unicode (str) patterns: For Unicode (str) patterns:
Matches Unicode word characters; this includes alphanumeric characters (as defined by :meth:`str.isalnum`) Matches Unicode word characters;
this includes all Unicode alphanumeric characters
(as defined by :py:meth:`str.isalnum`),
as well as the underscore (``_``). as well as the underscore (``_``).
If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched.
Matches ``[a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
For 8-bit (bytes) patterns: For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set; Matches characters considered alphanumeric in the ASCII character set;
this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is this is equivalent to ``[a-zA-Z0-9_]``.
used, matches characters considered alphanumeric in the current locale If the :py:const:`~re.LOCALE` flag is used,
and the underscore. matches characters considered alphanumeric in the current locale and the underscore.
.. index:: single: \W; in regular expressions .. index:: single: \W; in regular expressions
``\W`` ``\W``
Matches any character which is not a word character. This is Matches any character which is not a word character.
the opposite of ``\w``. If the :const:`ASCII` flag is used this This is the opposite of ``\w``.
becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is By default, matches non-underscore (``_``) characters
used, matches characters which are neither alphanumeric in the current locale for which :py:meth:`str.isalnum` returns ``False``.
Matches ``[^a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
If the :py:const:`~re.LOCALE` flag is used,
matches characters which are neither alphanumeric in the current locale
nor the underscore. nor the underscore.
.. index:: single: \Z; in regular expressions .. index:: single: \Z; in regular expressions
@ -644,9 +679,11 @@ string literals are also accepted by the regular expression parser::
(Note that ``\b`` is used to represent word boundaries, and means "backspace" (Note that ``\b`` is used to represent word boundaries, and means "backspace"
only inside character classes.) only inside character classes.)
``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode ``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are
patterns. In bytes patterns they are errors. Unknown escapes of ASCII only recognized in Unicode (str) patterns.
letters are reserved for future use and treated as errors. In bytes patterns they are errors.
Unknown escapes of ASCII letters are reserved
for future use and treated as errors.
Octal escapes are included in a limited form. If the first digit is a 0, or if Octal escapes are included in a limited form. If the first digit is a 0, or if
there are three octal digits, it is considered an octal escape. Otherwise, it is there are three octal digits, it is considered an octal escape. Otherwise, it is
@ -694,30 +731,37 @@ Flags
Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
perform ASCII-only matching instead of full Unicode matching. This is only perform ASCII-only matching instead of full Unicode matching. This is only
meaningful for Unicode patterns, and is ignored for byte patterns. meaningful for Unicode (str) patterns, and is ignored for bytes patterns.
Corresponds to the inline flag ``(?a)``. Corresponds to the inline flag ``(?a)``.
Note that for backward compatibility, the :const:`re.U` flag still .. note::
exists (as well as its synonym :const:`re.UNICODE` and its embedded
counterpart ``(?u)``), but these are redundant in Python 3 since The :py:const:`~re.U` flag still exists for backward compatibility,
matches are Unicode by default for strings (and Unicode matching but is redundant in Python 3 since
isn't allowed for bytes). matches are Unicode by default for ``str`` patterns,
and Unicode matching isn't allowed for bytes patterns.
:py:const:`~re.UNICODE` and the inline flag ``(?u)`` are similarly redundant.
.. data:: DEBUG .. data:: DEBUG
Display debug information about compiled expression. Display debug information about compiled expression.
No corresponding inline flag. No corresponding inline flag.
.. data:: I .. data:: I
IGNORECASE IGNORECASE
Perform case-insensitive matching; expressions like ``[A-Z]`` will also Perform case-insensitive matching;
match lowercase letters. Full Unicode matching (such as ``Ü`` matching expressions like ``[A-Z]`` will also match lowercase letters.
``ü``) also works unless the :const:`re.ASCII` flag is used to disable Full Unicode matching (such as ``Ü`` matching ``ü``)
non-ASCII matches. The current locale does not change the effect of this also works unless the :py:const:`~re.ASCII` flag
flag unless the :const:`re.LOCALE` flag is also used. is used to disable non-ASCII matches.
The current locale does not change the effect of this flag
unless the :py:const:`~re.LOCALE` flag is also used.
Corresponds to the inline flag ``(?i)``. Corresponds to the inline flag ``(?i)``.
Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
@ -725,29 +769,35 @@ Flags
letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
letter I with dot above), 'ı' (U+0131, Latin small letter dotless i), letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
'ſ' (U+017F, Latin small letter long s) and '' (U+212A, Kelvin sign). 'ſ' (U+017F, Latin small letter long s) and '' (U+212A, Kelvin sign).
If the :const:`ASCII` flag is used, only letters 'a' to 'z' If the :py:const:`~re.ASCII` flag is used, only letters 'a' to 'z'
and 'A' to 'Z' are matched. and 'A' to 'Z' are matched.
.. data:: L .. data:: L
LOCALE LOCALE
Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
dependent on the current locale. This flag can be used only with bytes dependent on the current locale.
patterns. The use of this flag is discouraged as the locale mechanism This flag can be used only with bytes patterns.
is very unreliable, it only handles one "culture" at a time, and it only
works with 8-bit locales. Unicode matching is already enabled by default
in Python 3 for Unicode (str) patterns, and it is able to handle different
locales/languages.
Corresponds to the inline flag ``(?L)``. Corresponds to the inline flag ``(?L)``.
.. warning::
This flag is discouraged; consider Unicode matching instead.
The locale mechanism is very unreliable
as it only handles one "culture" at a time
and only works with 8-bit locales.
Unicode matching is enabled by default for Unicode (str) patterns
and it is able to handle different locales and languages.
.. versionchanged:: 3.6 .. versionchanged:: 3.6
:const:`re.LOCALE` can be used only with bytes patterns and is :py:const:`~re.LOCALE` can be used only with bytes patterns
not compatible with :const:`re.ASCII`. and is not compatible with :py:const:`~re.ASCII`.
.. versionchanged:: 3.7 .. versionchanged:: 3.7
Compiled regular expression objects with the :const:`re.LOCALE` flag no Compiled regular expression objects with the :py:const:`~re.LOCALE` flag
longer depend on the locale at compile time. Only the locale at no longer depend on the locale at compile time.
matching time affects the result of matching. Only the locale at matching time affects the result of matching.
.. data:: M .. data:: M
@ -759,6 +809,7 @@ Flags
end of each line (immediately preceding each newline). By default, ``'^'`` end of each line (immediately preceding each newline). By default, ``'^'``
matches only at the beginning of the string, and ``'$'`` only at the end of the matches only at the beginning of the string, and ``'$'`` only at the end of the
string and immediately before the newline (if any) at the end of the string. string and immediately before the newline (if any) at the end of the string.
Corresponds to the inline flag ``(?m)``. Corresponds to the inline flag ``(?m)``.
.. data:: NOFLAG .. data:: NOFLAG
@ -778,19 +829,19 @@ Flags
Make the ``'.'`` special character match any character at all, including a Make the ``'.'`` special character match any character at all, including a
newline; without this flag, ``'.'`` will match anything *except* a newline. newline; without this flag, ``'.'`` will match anything *except* a newline.
Corresponds to the inline flag ``(?s)``. Corresponds to the inline flag ``(?s)``.
.. data:: U .. data:: U
UNICODE UNICODE
In Python 2, this flag made :ref:`special sequences <re-special-sequences>` In Python 3, Unicode characters are matched by default
include Unicode characters in matches. Since Python 3, Unicode characters for ``str`` patterns.
are matched by default. This flag is therefore redundant with **no effect**
and is only kept for backward compatibility.
See :const:`A` for restricting matching on ASCII characters instead. See :py:const:`~re.ASCII` to restrict matching to ASCII characters instead.
This flag is only kept for backward compatibility.
.. data:: X .. data:: X
VERBOSE VERBOSE
@ -914,6 +965,8 @@ Functions
Empty matches for the pattern split the string only when not adjacent Empty matches for the pattern split the string only when not adjacent
to a previous empty match. to a previous empty match.
.. code:: pycon
>>> re.split(r'\b', 'Words, words, words.') >>> re.split(r'\b', 'Words, words, words.')
['', 'Words', ', ', 'words', ', ', 'words', '.'] ['', 'Words', ', ', 'words', ', ', 'words', '.']
>>> re.split(r'\W*', '...words...') >>> re.split(r'\W*', '...words...')
@ -1231,7 +1284,7 @@ Regular Expression Objects
The regex matching flags. This is a combination of the flags given to The regex matching flags. This is a combination of the flags given to
:func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
flags such as :data:`UNICODE` if the pattern is a Unicode string. flags such as :py:const:`~re.UNICODE` if the pattern is a Unicode string.
.. attribute:: Pattern.groups .. attribute:: Pattern.groups