mirror of
https://github.com/python/cpython.git
synced 2025-11-13 23:46:24 +00:00
Added a note in each regarding the fact that unicode strings that look the same
may not compare equal (due to the possibility of multiple representations).
This commit is contained in:
parent
5c404aed0e
commit
216ad337bd
2 changed files with 12 additions and 2 deletions
|
|
@ -107,7 +107,7 @@ the following functions:
|
||||||
based on the definition of canonical equivalence and compatibility equivalence.
|
based on the definition of canonical equivalence and compatibility equivalence.
|
||||||
In Unicode, several characters can be expressed in various way. For example, the
|
In Unicode, several characters can be expressed in various way. For example, the
|
||||||
character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as
|
character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as
|
||||||
the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
|
the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C).
|
||||||
|
|
||||||
For each character, there are two normal forms: normal form C and normal form D.
|
For each character, there are two normal forms: normal form C and normal form D.
|
||||||
Normal form D (NFD) is also known as canonical decomposition, and translates
|
Normal form D (NFD) is also known as canonical decomposition, and translates
|
||||||
|
|
@ -126,6 +126,10 @@ the following functions:
|
||||||
(NFKC) first applies the compatibility decomposition, followed by the canonical
|
(NFKC) first applies the compatibility decomposition, followed by the canonical
|
||||||
composition.
|
composition.
|
||||||
|
|
||||||
|
Even if two unicode strings are normalized and look the same to
|
||||||
|
a human reader, if one has combining characters and the other
|
||||||
|
doesn't, they may not compare equal.
|
||||||
|
|
||||||
.. versionadded:: 2.3
|
.. versionadded:: 2.3
|
||||||
|
|
||||||
In addition, the module exposes the following constant:
|
In addition, the module exposes the following constant:
|
||||||
|
|
|
||||||
|
|
@ -1040,7 +1040,7 @@ Comparison of objects of the same type depends on the type:
|
||||||
|
|
||||||
* Strings are compared lexicographically using the numeric equivalents (the
|
* Strings are compared lexicographically using the numeric equivalents (the
|
||||||
result of the built-in function :func:`ord`) of their characters. Unicode and
|
result of the built-in function :func:`ord`) of their characters. Unicode and
|
||||||
8-bit strings are fully interoperable in this behavior.
|
8-bit strings are fully interoperable in this behavior. [#]_
|
||||||
|
|
||||||
* Tuples and lists are compared lexicographically using comparison of
|
* Tuples and lists are compared lexicographically using comparison of
|
||||||
corresponding elements. This means that to compare equal, each element must
|
corresponding elements. This means that to compare equal, each element must
|
||||||
|
|
@ -1328,6 +1328,12 @@ groups from right to left).
|
||||||
cases, Python returns the latter result, in order to preserve that
|
cases, Python returns the latter result, in order to preserve that
|
||||||
``divmod(x,y)[0] * y + x % y`` be very close to ``x``.
|
``divmod(x,y)[0] * y + x % y`` be very close to ``x``.
|
||||||
|
|
||||||
|
.. [#] While comparisons between unicode strings make sense at the byte
|
||||||
|
level, they may be counter-intuitive to users. For example, the
|
||||||
|
strings ``u"\u00C7"`` and ``u"\u0327\u0043"`` compare differently,
|
||||||
|
even though they both represent the same unicode character (LATIN
|
||||||
|
CAPTITAL LETTER C WITH CEDILLA).
|
||||||
|
|
||||||
.. [#] The implementation computes this efficiently, without constructing lists or
|
.. [#] The implementation computes this efficiently, without constructing lists or
|
||||||
sorting.
|
sorting.
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue