mirror of
https://github.com/python/cpython.git
synced 2025-11-01 02:38:53 +00:00
Added a note in each regarding the fact that unicode strings that look the same
may not compare equal (due to the possibility of multiple representations).
This commit is contained in:
parent
5c404aed0e
commit
216ad337bd
2 changed files with 12 additions and 2 deletions
|
|
@ -107,7 +107,7 @@ the following functions:
|
|||
based on the definition of canonical equivalence and compatibility equivalence.
|
||||
In Unicode, several characters can be expressed in various way. For example, the
|
||||
character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as
|
||||
the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
|
||||
the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C).
|
||||
|
||||
For each character, there are two normal forms: normal form C and normal form D.
|
||||
Normal form D (NFD) is also known as canonical decomposition, and translates
|
||||
|
|
@ -126,6 +126,10 @@ the following functions:
|
|||
(NFKC) first applies the compatibility decomposition, followed by the canonical
|
||||
composition.
|
||||
|
||||
Even if two unicode strings are normalized and look the same to
|
||||
a human reader, if one has combining characters and the other
|
||||
doesn't, they may not compare equal.
|
||||
|
||||
.. versionadded:: 2.3
|
||||
|
||||
In addition, the module exposes the following constant:
|
||||
|
|
|
|||
|
|
@ -1040,7 +1040,7 @@ Comparison of objects of the same type depends on the type:
|
|||
|
||||
* Strings are compared lexicographically using the numeric equivalents (the
|
||||
result of the built-in function :func:`ord`) of their characters. Unicode and
|
||||
8-bit strings are fully interoperable in this behavior.
|
||||
8-bit strings are fully interoperable in this behavior. [#]_
|
||||
|
||||
* Tuples and lists are compared lexicographically using comparison of
|
||||
corresponding elements. This means that to compare equal, each element must
|
||||
|
|
@ -1328,6 +1328,12 @@ groups from right to left).
|
|||
cases, Python returns the latter result, in order to preserve that
|
||||
``divmod(x,y)[0] * y + x % y`` be very close to ``x``.
|
||||
|
||||
.. [#] While comparisons between unicode strings make sense at the byte
|
||||
level, they may be counter-intuitive to users. For example, the
|
||||
strings ``u"\u00C7"`` and ``u"\u0327\u0043"`` compare differently,
|
||||
even though they both represent the same unicode character (LATIN
|
||||
CAPTITAL LETTER C WITH CEDILLA).
|
||||
|
||||
.. [#] The implementation computes this efficiently, without constructing lists or
|
||||
sorting.
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue