[3.12] GH-107678: Improve Unicode handling clarity in `library/re.rst` (GH-107679) (#113965)

GH-107678: Improve Unicode handling clarity in ``library/re.rst`` (GH-107679) (cherry picked from commit c9b8a22f34) Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
2025-09-28 11:15:17 +00:00 · 2024-01-12 01:02:28 +01:00 · 2024-01-12 01:02:28 +01:00 · bd9ea91e5f
commit bd9ea91e5f
parent b902671d36
1 changed files with 144 additions and 91 deletions
--- a/Doc/library/re.rst
+++ b/Doc/library/re.rst
@ -17,7 +17,7 @@ those found in Perl.
 Both patterns and strings to be searched can be Unicode strings (:class:`str`)
 as well as 8-bit strings (:class:`bytes`).
 However, Unicode strings and 8-bit strings cannot be mixed:
-that is, you cannot match a Unicode string with a byte pattern or
+that is, you cannot match a Unicode string with a bytes pattern or
 vice-versa; similarly, when asking for a substitution, the replacement
 string must be of the same type as both the pattern and the search string.
@ -257,8 +257,7 @@ The special characters are:
   .. index:: single: \ (backslash); in regular expressions
   * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
-     inside a set, although the characters they match depends on whether
+     inside a set, although the characters they match depend on the flags_ used.
     :const:`ASCII` or :const:`LOCALE` mode is in force.
   .. index:: single: ^ (caret); in regular expressions
@ -326,18 +325,24 @@ The special characters are:
   currently supported extensions.
 ``(?aiLmsux)``
-   (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
+   (One or more letters from the set
-   ``'s'``, ``'u'``, ``'x'``.)  The group matches the empty string; the
+   ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``.)
-   letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
+   The group matches the empty string;
-   :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
+   the letters set the corresponding flags for the entire regular expression:
-   :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
+
-   :const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
+   * :const:`re.A` (ASCII-only matching)
-   for the entire regular expression.
+   * :const:`re.I` (ignore case)
   * :const:`re.L` (locale dependent)
   * :const:`re.M` (multi-line)
   * :const:`re.S` (dot matches all)
   * :const:`re.U` (Unicode matching)
   * :const:`re.X` (verbose)
   (The flags are described in :ref:`contents-of-module-re`.)
   This is useful if you wish to include the flags as part of the
   regular expression, instead of passing a *flag* argument to the
-   :func:`re.compile` function.  Flags should be used first in the
+   :func:`re.compile` function.
-   expression string.
+   Flags should be used first in the expression string.
   .. versionchanged:: 3.11
      This construction can only be used at the start of the expression.
@ -351,14 +356,20 @@ The special characters are:
   pattern.
 ``(?aiLmsux-imsx:...)``
-   (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
+   (Zero or more letters from the set
-   ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
+   ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``,
   optionally followed by ``'-'`` followed by
   one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
-   The letters set or remove the corresponding flags:
+   The letters set or remove the corresponding flags for the part of the expression:
-   :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
+
-   :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
+   * :const:`re.A` (ASCII-only matching)
-   :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
+   * :const:`re.I` (ignore case)
-   and :const:`re.X` (verbose), for the part of the expression.
+   * :const:`re.L` (locale dependent)
   * :const:`re.M` (multi-line)
   * :const:`re.S` (dot matches all)
   * :const:`re.U` (Unicode matching)
   * :const:`re.X` (verbose)
   (The flags are described in :ref:`contents-of-module-re`.)
   The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
@ -366,7 +377,7 @@ The special characters are:
   when one of them appears in an inline group, it overrides the matching mode
   in the enclosing group.  In Unicode patterns ``(?a:...)`` switches to
   ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
-   (default).  In byte pattern ``(?L:...)`` switches to locale depending
+   (default).  In bytes patterns ``(?L:...)`` switches to locale dependent
   matching, and ``(?a:...)`` switches to ASCII-only matching (default).
   This override is only in effect for the narrow inline group, and the
   original matching mode is restored outside of the group.
@ -529,47 +540,61 @@ character ``'$'``.
 ``\b``
   Matches the empty string, but only at the beginning or end of a word.
-   A word is defined as a sequence of word characters.  Note that formally,
+   A word is defined as a sequence of word characters.
-   ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
+   Note that formally, ``\b`` is defined as the boundary
-   (or vice versa), or between ``\w`` and the beginning/end of the string.
+   between a ``\w`` and a ``\W`` character (or vice versa),
-   This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
+   or between ``\w`` and the beginning or end of the string.
-   ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
+   This means that ``r'\bat\b'`` matches ``'at'``, ``'at.'``, ``'(at)'``,
   and ``'as at ay'`` but not ``'attempt'`` or ``'atlas'``.
-   By default Unicode alphanumerics are the ones used in Unicode patterns, but
+   The default word characters in Unicode (str) patterns
-   this can be changed by using the :const:`ASCII` flag.  Word boundaries are
+   are Unicode alphanumerics and the underscore,
-   determined by the current locale if the :const:`LOCALE` flag is used.
+   but this can be changed by using the :py:const:`~re.ASCII` flag.
-   Inside a character range, ``\b`` represents the backspace character, for
+   Word boundaries are determined by the current locale
-   compatibility with Python's string literals.
+   if the :py:const:`~re.LOCALE` flag is used.
   .. note::
      Inside a character range, ``\b`` represents the backspace character,
      for compatibility with Python's string literals.
 .. index:: single: \B; in regular expressions
 ``\B``
-   Matches the empty string, but only when it is *not* at the beginning or end
+   Matches the empty string,
-   of a word.  This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
+   but only when it is *not* at the beginning or end of a word.
-   ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
+   This means that ``r'at\B'`` matches ``'athens'``, ``'atom'``,
-   ``\B`` is just the opposite of ``\b``, so word characters in Unicode
+   ``'attorney'``, but not ``'at'``, ``'at.'``, or ``'at!'``.
-   patterns are Unicode alphanumerics or the underscore, although this can
+   ``\B`` is the opposite of ``\b``,
-   be changed by using the :const:`ASCII` flag.  Word boundaries are
+   so word characters in Unicode (str) patterns
-   determined by the current locale if the :const:`LOCALE` flag is used.
+   are Unicode alphanumerics or the underscore,
   although this can be changed by using the :py:const:`~re.ASCII` flag.
   Word boundaries are determined by the current locale
   if the :py:const:`~re.LOCALE` flag is used.
 .. index:: single: \d; in regular expressions
 ``\d``
   For Unicode (str) patterns:
-      Matches any Unicode decimal digit (that is, any character in
+      Matches any Unicode decimal digit
-      Unicode character category [Nd]).  This includes ``[0-9]``, and
+      (that is, any character in Unicode character category `[Nd]`__).
-      also many other digit characters.  If the :const:`ASCII` flag is
+      This includes ``[0-9]``, and also many other digit characters.
-      used only ``[0-9]`` is matched.
+
      Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used.
      __ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153
   For 8-bit (bytes) patterns:
-      Matches any decimal digit; this is equivalent to ``[0-9]``.
+      Matches any decimal digit in the ASCII character set;
      this is equivalent to ``[0-9]``.
 .. index:: single: \D; in regular expressions
 ``\D``
-   Matches any character which is not a decimal digit. This is
+   Matches any character which is not a decimal digit.
-   the opposite of ``\d``. If the :const:`ASCII` flag is used this
+   This is the opposite of ``\d``.
-   becomes the equivalent of ``[^0-9]``.
+
   Matches ``[^0-9]`` if the :py:const:`~re.ASCII` flag is used.
 .. index:: single: \s; in regular expressions
@ -578,8 +603,9 @@ character ``'$'``.
      Matches Unicode whitespace characters (which includes
      ``[ \t\n\r\f\v]``, and also many other characters, for example the
      non-breaking spaces mandated by typography rules in many
-      languages). If the :const:`ASCII` flag is used, only
+      languages).
-      ``[ \t\n\r\f\v]`` is matched.
+
      Matches ``[ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
   For 8-bit (bytes) patterns:
      Matches characters considered whitespace in the ASCII character set;
@ -589,30 +615,39 @@ character ``'$'``.
 ``\S``
   Matches any character which is not a whitespace character. This is
-   the opposite of ``\s``. If the :const:`ASCII` flag is used this
+   the opposite of ``\s``.
-   becomes the equivalent of ``[^ \t\n\r\f\v]``.
+
   Matches ``[^ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
 .. index:: single: \w; in regular expressions
 ``\w``
   For Unicode (str) patterns:
-      Matches Unicode word characters; this includes alphanumeric characters (as defined by :meth:`str.isalnum`)
+      Matches Unicode word characters;
      this includes all Unicode alphanumeric characters
      (as defined by :py:meth:`str.isalnum`),
      as well as the underscore (``_``).
-      If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched.
+
      Matches ``[a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
   For 8-bit (bytes) patterns:
      Matches characters considered alphanumeric in the ASCII character set;
-      this is equivalent to ``[a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
+      this is equivalent to ``[a-zA-Z0-9_]``.
-      used, matches characters considered alphanumeric in the current locale
+      If the :py:const:`~re.LOCALE` flag is used,
-      and the underscore.
+      matches characters considered alphanumeric in the current locale and the underscore.
 .. index:: single: \W; in regular expressions
 ``\W``
-   Matches any character which is not a word character. This is
+   Matches any character which is not a word character.
-   the opposite of ``\w``. If the :const:`ASCII` flag is used this
+   This is the opposite of ``\w``.
-   becomes the equivalent of ``[^a-zA-Z0-9_]``.  If the :const:`LOCALE` flag is
+   By default, matches non-underscore (``_``) characters
-   used, matches characters which are neither alphanumeric in the current locale
+   for which :py:meth:`str.isalnum` returns ``False``.
   Matches ``[^a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
   If the :py:const:`~re.LOCALE` flag is used,
   matches characters which are neither alphanumeric in the current locale
   nor the underscore.
 .. index:: single: \Z; in regular expressions
@ -644,9 +679,11 @@ string literals are also accepted by the regular expression parser::
 (Note that ``\b`` is used to represent word boundaries, and means "backspace"
 only inside character classes.)
-``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
+``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are
-patterns.  In bytes patterns they are errors.  Unknown escapes of ASCII
+only recognized in Unicode (str) patterns.
-letters are reserved for future use and treated as errors.
+In bytes patterns they are errors.
 Unknown escapes of ASCII letters are reserved
 for future use and treated as errors.
 Octal escapes are included in a limited form.  If the first digit is a 0, or if
 there are three octal digits, it is considered an octal escape. Otherwise, it is
@ -694,30 +731,37 @@ Flags
   Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
   perform ASCII-only matching instead of full Unicode matching.  This is only
-   meaningful for Unicode patterns, and is ignored for byte patterns.
+   meaningful for Unicode (str) patterns, and is ignored for bytes patterns.
   Corresponds to the inline flag ``(?a)``.
-   Note that for backward compatibility, the :const:`re.U` flag still
+   .. note::
-   exists (as well as its synonym :const:`re.UNICODE` and its embedded
+
-   counterpart ``(?u)``), but these are redundant in Python 3 since
+      The :py:const:`~re.U` flag still exists for backward compatibility,
-   matches are Unicode by default for strings (and Unicode matching
+      but is redundant in Python 3 since
-   isn't allowed for bytes).
+      matches are Unicode by default for ``str`` patterns,
      and Unicode matching isn't allowed for bytes patterns.
      :py:const:`~re.UNICODE` and the inline flag ``(?u)`` are similarly redundant.
 .. data:: DEBUG
   Display debug information about compiled expression.
   No corresponding inline flag.
 .. data:: I
          IGNORECASE
-   Perform case-insensitive matching; expressions like ``[A-Z]`` will also
+   Perform case-insensitive matching;
-   match lowercase letters.  Full Unicode matching (such as ``Ü`` matching
+   expressions like ``[A-Z]`` will also  match lowercase letters.
-   ``ü``) also works unless the :const:`re.ASCII` flag is used to disable
+   Full Unicode matching (such as ``Ü`` matching ``ü``)
-   non-ASCII matches.  The current locale does not change the effect of this
+   also works unless the :py:const:`~re.ASCII` flag
-   flag unless the :const:`re.LOCALE` flag is also used.
+   is used to disable non-ASCII matches.
   The current locale does not change the effect of this flag
   unless the :py:const:`~re.LOCALE` flag is also used.
   Corresponds to the inline flag ``(?i)``.
   Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
@ -725,29 +769,35 @@ Flags
   letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
   letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
   'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
-   If the :const:`ASCII` flag is used, only letters 'a' to 'z'
+   If the :py:const:`~re.ASCII` flag is used, only letters 'a' to 'z'
   and 'A' to 'Z' are matched.
 .. data:: L
          LOCALE
   Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
-   dependent on the current locale.  This flag can be used only with bytes
+   dependent on the current locale.
-   patterns.  The use of this flag is discouraged as the locale mechanism
+   This flag can be used only with bytes patterns.
-   is very unreliable, it only handles one "culture" at a time, and it only
+
   works with 8-bit locales.  Unicode matching is already enabled by default
   in Python 3 for Unicode (str) patterns, and it is able to handle different
   locales/languages.
   Corresponds to the inline flag ``(?L)``.
   .. warning::
      This flag is discouraged; consider Unicode matching instead.
      The locale mechanism is very unreliable
      as it only handles one "culture" at a time
      and only works with 8-bit locales.
      Unicode matching is enabled by default for Unicode (str) patterns
      and it is able to handle different locales and languages.
   .. versionchanged:: 3.6
-      :const:`re.LOCALE` can be used only with bytes patterns and is
+      :py:const:`~re.LOCALE` can be used only with bytes patterns
-      not compatible with :const:`re.ASCII`.
+      and is not compatible with :py:const:`~re.ASCII`.
   .. versionchanged:: 3.7
-      Compiled regular expression objects with the :const:`re.LOCALE` flag no
+      Compiled regular expression objects with the :py:const:`~re.LOCALE` flag
-      longer depend on the locale at compile time.  Only the locale at
+      no longer depend on the locale at compile time.
-      matching time affects the result of matching.
+      Only the locale at matching time affects the result of matching.
 .. data:: M
@ -759,6 +809,7 @@ Flags
   end of each line (immediately preceding each newline).  By default, ``'^'``
   matches only at the beginning of the string, and ``'$'`` only at the end of the
   string and immediately before the newline (if any) at the end of the string.
   Corresponds to the inline flag ``(?m)``.
 .. data:: NOFLAG
@ -778,19 +829,19 @@ Flags
   Make the ``'.'`` special character match any character at all, including a
   newline; without this flag, ``'.'`` will match anything *except* a newline.
   Corresponds to the inline flag ``(?s)``.
 .. data:: U
          UNICODE
-   In Python 2, this flag made :ref:`special sequences <re-special-sequences>`
+   In Python 3, Unicode characters are matched by default
-   include Unicode characters in matches. Since Python 3, Unicode characters
+   for ``str`` patterns.
-   are matched by default.
+   This flag is therefore redundant with **no effect**
   and is only kept for backward compatibility.
-   See :const:`A` for restricting matching on ASCII characters instead.
+   See :py:const:`~re.ASCII` to restrict matching to ASCII characters instead.
   This flag is only kept for backward compatibility.
 .. data:: X
          VERBOSE
@ -914,6 +965,8 @@ Functions
   Empty matches for the pattern split the string only when not adjacent
   to a previous empty match.
   .. code:: pycon
      >>> re.split(r'\b', 'Words, words, words.')
      ['', 'Words', ', ', 'words', ', ', 'words', '.']
      >>> re.split(r'\W*', '...words...')
@ -1231,7 +1284,7 @@ Regular Expression Objects
   The regex matching flags.  This is a combination of the flags given to
   :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
-   flags such as :data:`UNICODE` if the pattern is a Unicode string.
+   flags such as :py:const:`~re.UNICODE` if the pattern is a Unicode string.
 .. attribute:: Pattern.groups