mirror of
https://github.com/python/cpython.git
synced 2025-09-26 18:29:57 +00:00
Closes #23181: codepoint -> code point
This commit is contained in:
parent
1a8ada89f9
commit
3be472b5f7
7 changed files with 18 additions and 18 deletions
|
@ -1141,7 +1141,7 @@ These are the UTF-32 codec APIs:
|
||||||
mark (U+FEFF). In the other two modes, no BOM mark is prepended.
|
mark (U+FEFF). In the other two modes, no BOM mark is prepended.
|
||||||
|
|
||||||
If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
|
If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
|
||||||
as a single codepoint.
|
as a single code point.
|
||||||
|
|
||||||
Return *NULL* if an exception was raised by the codec.
|
Return *NULL* if an exception was raised by the codec.
|
||||||
|
|
||||||
|
|
|
@ -841,7 +841,7 @@ methods and attributes from the underlying stream.
|
||||||
Encodings and Unicode
|
Encodings and Unicode
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
Strings are stored internally as sequences of codepoints in
|
Strings are stored internally as sequences of code points in
|
||||||
range ``0x0``-``0x10FFFF``. (See :pep:`393` for
|
range ``0x0``-``0x10FFFF``. (See :pep:`393` for
|
||||||
more details about the implementation.)
|
more details about the implementation.)
|
||||||
Once a string object is used outside of CPU and memory, endianness
|
Once a string object is used outside of CPU and memory, endianness
|
||||||
|
@ -852,23 +852,23 @@ There are a variety of different text serialisation codecs, which are
|
||||||
collectivity referred to as :term:`text encodings <text encoding>`.
|
collectivity referred to as :term:`text encodings <text encoding>`.
|
||||||
|
|
||||||
The simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps
|
The simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps
|
||||||
the codepoints 0-255 to the bytes ``0x0``-``0xff``, which means that a string
|
the code points 0-255 to the bytes ``0x0``-``0xff``, which means that a string
|
||||||
object that contains codepoints above ``U+00FF`` can't be encoded with this
|
object that contains code points above ``U+00FF`` can't be encoded with this
|
||||||
codec. Doing so will raise a :exc:`UnicodeEncodeError` that looks
|
codec. Doing so will raise a :exc:`UnicodeEncodeError` that looks
|
||||||
like the following (although the details of the error message may differ):
|
like the following (although the details of the error message may differ):
|
||||||
``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in
|
``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in
|
||||||
position 3: ordinal not in range(256)``.
|
position 3: ordinal not in range(256)``.
|
||||||
|
|
||||||
There's another group of encodings (the so called charmap encodings) that choose
|
There's another group of encodings (the so called charmap encodings) that choose
|
||||||
a different subset of all Unicode code points and how these codepoints are
|
a different subset of all Unicode code points and how these code points are
|
||||||
mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open
|
mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open
|
||||||
e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
|
e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
|
||||||
Windows). There's a string constant with 256 characters that shows you which
|
Windows). There's a string constant with 256 characters that shows you which
|
||||||
character is mapped to which byte value.
|
character is mapped to which byte value.
|
||||||
|
|
||||||
All of these encodings can only encode 256 of the 1114112 codepoints
|
All of these encodings can only encode 256 of the 1114112 code points
|
||||||
defined in Unicode. A simple and straightforward way that can store each Unicode
|
defined in Unicode. A simple and straightforward way that can store each Unicode
|
||||||
code point, is to store each codepoint as four consecutive bytes. There are two
|
code point, is to store each code point as four consecutive bytes. There are two
|
||||||
possibilities: store the bytes in big endian or in little endian order. These
|
possibilities: store the bytes in big endian or in little endian order. These
|
||||||
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
|
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
|
||||||
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
|
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
|
||||||
|
|
|
@ -194,7 +194,7 @@ Here are the classes:
|
||||||
minor type and defaults to :mimetype:`plain`. *_charset* is the character
|
minor type and defaults to :mimetype:`plain`. *_charset* is the character
|
||||||
set of the text and is passed as an argument to the
|
set of the text and is passed as an argument to the
|
||||||
:class:`~email.mime.nonmultipart.MIMENonMultipart` constructor; it defaults
|
:class:`~email.mime.nonmultipart.MIMENonMultipart` constructor; it defaults
|
||||||
to ``us-ascii`` if the string contains only ``ascii`` codepoints, and
|
to ``us-ascii`` if the string contains only ``ascii`` code points, and
|
||||||
``utf-8`` otherwise. The *_charset* parameter accepts either a string or a
|
``utf-8`` otherwise. The *_charset* parameter accepts either a string or a
|
||||||
:class:`~email.charset.Charset` instance.
|
:class:`~email.charset.Charset` instance.
|
||||||
|
|
||||||
|
|
|
@ -156,7 +156,7 @@ are always available. They are listed here in alphabetical order.
|
||||||
|
|
||||||
.. function:: chr(i)
|
.. function:: chr(i)
|
||||||
|
|
||||||
Return the string representing a character whose Unicode codepoint is the
|
Return the string representing a character whose Unicode code point is the
|
||||||
integer *i*. For example, ``chr(97)`` returns the string ``'a'``, while
|
integer *i*. For example, ``chr(97)`` returns the string ``'a'``, while
|
||||||
``chr(931)`` returns the string ``'Σ'``. This is the inverse of :func:`ord`.
|
``chr(931)`` returns the string ``'Σ'``. This is the inverse of :func:`ord`.
|
||||||
|
|
||||||
|
|
|
@ -33,12 +33,12 @@ This module defines four dictionaries, :data:`html5`,
|
||||||
|
|
||||||
.. data:: name2codepoint
|
.. data:: name2codepoint
|
||||||
|
|
||||||
A dictionary that maps HTML entity names to the Unicode codepoints.
|
A dictionary that maps HTML entity names to the Unicode code points.
|
||||||
|
|
||||||
|
|
||||||
.. data:: codepoint2name
|
.. data:: codepoint2name
|
||||||
|
|
||||||
A dictionary that maps Unicode codepoints to HTML entity names.
|
A dictionary that maps Unicode code points to HTML entity names.
|
||||||
|
|
||||||
|
|
||||||
.. rubric:: Footnotes
|
.. rubric:: Footnotes
|
||||||
|
|
|
@ -685,7 +685,7 @@ the same type, the lexicographical comparison is carried out recursively. If
|
||||||
all items of two sequences compare equal, the sequences are considered equal.
|
all items of two sequences compare equal, the sequences are considered equal.
|
||||||
If one sequence is an initial sub-sequence of the other, the shorter sequence is
|
If one sequence is an initial sub-sequence of the other, the shorter sequence is
|
||||||
the smaller (lesser) one. Lexicographical ordering for strings uses the Unicode
|
the smaller (lesser) one. Lexicographical ordering for strings uses the Unicode
|
||||||
codepoint number to order individual characters. Some examples of comparisons
|
code point number to order individual characters. Some examples of comparisons
|
||||||
between sequences of the same type::
|
between sequences of the same type::
|
||||||
|
|
||||||
(1, 2, 3) < (1, 2, 4)
|
(1, 2, 3) < (1, 2, 4)
|
||||||
|
|
|
@ -228,7 +228,7 @@ Functionality
|
||||||
|
|
||||||
Changes introduced by :pep:`393` are the following:
|
Changes introduced by :pep:`393` are the following:
|
||||||
|
|
||||||
* Python now always supports the full range of Unicode codepoints, including
|
* Python now always supports the full range of Unicode code points, including
|
||||||
non-BMP ones (i.e. from ``U+0000`` to ``U+10FFFF``). The distinction between
|
non-BMP ones (i.e. from ``U+0000`` to ``U+10FFFF``). The distinction between
|
||||||
narrow and wide builds no longer exists and Python now behaves like a wide
|
narrow and wide builds no longer exists and Python now behaves like a wide
|
||||||
build, even under Windows.
|
build, even under Windows.
|
||||||
|
@ -246,7 +246,7 @@ Changes introduced by :pep:`393` are the following:
|
||||||
so ``'\U0010FFFF'[0]`` now returns ``'\U0010FFFF'`` and not ``'\uDBFF'``;
|
so ``'\U0010FFFF'[0]`` now returns ``'\U0010FFFF'`` and not ``'\uDBFF'``;
|
||||||
|
|
||||||
* all other functions in the standard library now correctly handle
|
* all other functions in the standard library now correctly handle
|
||||||
non-BMP codepoints.
|
non-BMP code points.
|
||||||
|
|
||||||
* The value of :data:`sys.maxunicode` is now always ``1114111`` (``0x10FFFF``
|
* The value of :data:`sys.maxunicode` is now always ``1114111`` (``0x10FFFF``
|
||||||
in hexadecimal). The :c:func:`PyUnicode_GetMax` function still returns
|
in hexadecimal). The :c:func:`PyUnicode_GetMax` function still returns
|
||||||
|
@ -258,13 +258,13 @@ Changes introduced by :pep:`393` are the following:
|
||||||
Performance and resource usage
|
Performance and resource usage
|
||||||
------------------------------
|
------------------------------
|
||||||
|
|
||||||
The storage of Unicode strings now depends on the highest codepoint in the string:
|
The storage of Unicode strings now depends on the highest code point in the string:
|
||||||
|
|
||||||
* pure ASCII and Latin1 strings (``U+0000-U+00FF``) use 1 byte per codepoint;
|
* pure ASCII and Latin1 strings (``U+0000-U+00FF``) use 1 byte per code point;
|
||||||
|
|
||||||
* BMP strings (``U+0000-U+FFFF``) use 2 bytes per codepoint;
|
* BMP strings (``U+0000-U+FFFF``) use 2 bytes per code point;
|
||||||
|
|
||||||
* non-BMP strings (``U+10000-U+10FFFF``) use 4 bytes per codepoint.
|
* non-BMP strings (``U+10000-U+10FFFF``) use 4 bytes per code point.
|
||||||
|
|
||||||
The net effect is that for most applications, memory usage of string
|
The net effect is that for most applications, memory usage of string
|
||||||
storage should decrease significantly - especially compared to former
|
storage should decrease significantly - especially compared to former
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue