mirror of
https://github.com/python/cpython.git
synced 2025-11-03 03:22:27 +00:00
Issue #23181: More "codepoint" -> "code point".
This commit is contained in:
parent
b2653b344e
commit
d3faf43f9b
24 changed files with 46 additions and 46 deletions
|
|
@ -1134,7 +1134,7 @@ These are the UTF-32 codec APIs:
|
|||
mark (U+FEFF). In the other two modes, no BOM mark is prepended.
|
||||
|
||||
If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
|
||||
as a single codepoint.
|
||||
as a single code point.
|
||||
|
||||
Return *NULL* if an exception was raised by the codec.
|
||||
|
||||
|
|
|
|||
|
|
@ -827,7 +827,7 @@ methods and attributes from the underlying stream.
|
|||
Encodings and Unicode
|
||||
---------------------
|
||||
|
||||
Strings are stored internally as sequences of codepoints in
|
||||
Strings are stored internally as sequences of code points in
|
||||
range ``0x0``-``0x10FFFF``. (See :pep:`393` for
|
||||
more details about the implementation.)
|
||||
Once a string object is used outside of CPU and memory, endianness
|
||||
|
|
@ -838,23 +838,23 @@ There are a variety of different text serialisation codecs, which are
|
|||
collectivity referred to as :term:`text encodings <text encoding>`.
|
||||
|
||||
The simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps
|
||||
the codepoints 0-255 to the bytes ``0x0``-``0xff``, which means that a string
|
||||
object that contains codepoints above ``U+00FF`` can't be encoded with this
|
||||
the code points 0-255 to the bytes ``0x0``-``0xff``, which means that a string
|
||||
object that contains code points above ``U+00FF`` can't be encoded with this
|
||||
codec. Doing so will raise a :exc:`UnicodeEncodeError` that looks
|
||||
like the following (although the details of the error message may differ):
|
||||
``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in
|
||||
position 3: ordinal not in range(256)``.
|
||||
|
||||
There's another group of encodings (the so called charmap encodings) that choose
|
||||
a different subset of all Unicode code points and how these codepoints are
|
||||
a different subset of all Unicode code points and how these code points are
|
||||
mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open
|
||||
e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
|
||||
Windows). There's a string constant with 256 characters that shows you which
|
||||
character is mapped to which byte value.
|
||||
|
||||
All of these encodings can only encode 256 of the 1114112 codepoints
|
||||
All of these encodings can only encode 256 of the 1114112 code points
|
||||
defined in Unicode. A simple and straightforward way that can store each Unicode
|
||||
code point, is to store each codepoint as four consecutive bytes. There are two
|
||||
code point, is to store each code point as four consecutive bytes. There are two
|
||||
possibilities: store the bytes in big endian or in little endian order. These
|
||||
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
|
||||
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
|
||||
|
|
|
|||
|
|
@ -194,7 +194,7 @@ Here are the classes:
|
|||
minor type and defaults to :mimetype:`plain`. *_charset* is the character
|
||||
set of the text and is passed as an argument to the
|
||||
:class:`~email.mime.nonmultipart.MIMENonMultipart` constructor; it defaults
|
||||
to ``us-ascii`` if the string contains only ``ascii`` codepoints, and
|
||||
to ``us-ascii`` if the string contains only ``ascii`` code points, and
|
||||
``utf-8`` otherwise.
|
||||
|
||||
Unless the *_charset* argument is explicitly set to ``None``, the
|
||||
|
|
|
|||
|
|
@ -156,7 +156,7 @@ are always available. They are listed here in alphabetical order.
|
|||
|
||||
.. function:: chr(i)
|
||||
|
||||
Return the string representing a character whose Unicode codepoint is the integer
|
||||
Return the string representing a character whose Unicode code point is the integer
|
||||
*i*. For example, ``chr(97)`` returns the string ``'a'``. This is the
|
||||
inverse of :func:`ord`. The valid range for the argument is from 0 through
|
||||
1,114,111 (0x10FFFF in base 16). :exc:`ValueError` will be raised if *i* is
|
||||
|
|
|
|||
|
|
@ -33,12 +33,12 @@ This module defines four dictionaries, :data:`html5`,
|
|||
|
||||
.. data:: name2codepoint
|
||||
|
||||
A dictionary that maps HTML entity names to the Unicode codepoints.
|
||||
A dictionary that maps HTML entity names to the Unicode code points.
|
||||
|
||||
|
||||
.. data:: codepoint2name
|
||||
|
||||
A dictionary that maps Unicode codepoints to HTML entity names.
|
||||
A dictionary that maps Unicode code points to HTML entity names.
|
||||
|
||||
|
||||
.. rubric:: Footnotes
|
||||
|
|
|
|||
|
|
@ -512,7 +512,7 @@ The RFC does not explicitly forbid JSON strings which contain byte sequences
|
|||
that don't correspond to valid Unicode characters (e.g. unpaired UTF-16
|
||||
surrogates), but it does note that they may cause interoperability problems.
|
||||
By default, this module accepts and outputs (when present in the original
|
||||
:class:`str`) codepoints for such sequences.
|
||||
:class:`str`) code points for such sequences.
|
||||
|
||||
|
||||
Infinite and NaN Number Values
|
||||
|
|
|
|||
|
|
@ -684,7 +684,7 @@ the same type, the lexicographical comparison is carried out recursively. If
|
|||
all items of two sequences compare equal, the sequences are considered equal.
|
||||
If one sequence is an initial sub-sequence of the other, the shorter sequence is
|
||||
the smaller (lesser) one. Lexicographical ordering for strings uses the Unicode
|
||||
codepoint number to order individual characters. Some examples of comparisons
|
||||
code point number to order individual characters. Some examples of comparisons
|
||||
between sequences of the same type::
|
||||
|
||||
(1, 2, 3) < (1, 2, 4)
|
||||
|
|
|
|||
|
|
@ -228,7 +228,7 @@ Functionality
|
|||
|
||||
Changes introduced by :pep:`393` are the following:
|
||||
|
||||
* Python now always supports the full range of Unicode codepoints, including
|
||||
* Python now always supports the full range of Unicode code points, including
|
||||
non-BMP ones (i.e. from ``U+0000`` to ``U+10FFFF``). The distinction between
|
||||
narrow and wide builds no longer exists and Python now behaves like a wide
|
||||
build, even under Windows.
|
||||
|
|
@ -246,7 +246,7 @@ Changes introduced by :pep:`393` are the following:
|
|||
so ``'\U0010FFFF'[0]`` now returns ``'\U0010FFFF'`` and not ``'\uDBFF'``;
|
||||
|
||||
* all other functions in the standard library now correctly handle
|
||||
non-BMP codepoints.
|
||||
non-BMP code points.
|
||||
|
||||
* The value of :data:`sys.maxunicode` is now always ``1114111`` (``0x10FFFF``
|
||||
in hexadecimal). The :c:func:`PyUnicode_GetMax` function still returns
|
||||
|
|
@ -258,13 +258,13 @@ Changes introduced by :pep:`393` are the following:
|
|||
Performance and resource usage
|
||||
------------------------------
|
||||
|
||||
The storage of Unicode strings now depends on the highest codepoint in the string:
|
||||
The storage of Unicode strings now depends on the highest code point in the string:
|
||||
|
||||
* pure ASCII and Latin1 strings (``U+0000-U+00FF``) use 1 byte per codepoint;
|
||||
* pure ASCII and Latin1 strings (``U+0000-U+00FF``) use 1 byte per code point;
|
||||
|
||||
* BMP strings (``U+0000-U+FFFF``) use 2 bytes per codepoint;
|
||||
* BMP strings (``U+0000-U+FFFF``) use 2 bytes per code point;
|
||||
|
||||
* non-BMP strings (``U+10000-U+10FFFF``) use 4 bytes per codepoint.
|
||||
* non-BMP strings (``U+10000-U+10FFFF``) use 4 bytes per code point.
|
||||
|
||||
The net effect is that for most applications, memory usage of string
|
||||
storage should decrease significantly - especially compared to former
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue