mirror of
https://github.com/python/cpython.git
synced 2025-11-02 03:01:58 +00:00
Issue #23181: More "codepoint" -> "code point".
This commit is contained in:
parent
b2653b344e
commit
d3faf43f9b
24 changed files with 46 additions and 46 deletions
|
|
@ -827,7 +827,7 @@ methods and attributes from the underlying stream.
|
|||
Encodings and Unicode
|
||||
---------------------
|
||||
|
||||
Strings are stored internally as sequences of codepoints in
|
||||
Strings are stored internally as sequences of code points in
|
||||
range ``0x0``-``0x10FFFF``. (See :pep:`393` for
|
||||
more details about the implementation.)
|
||||
Once a string object is used outside of CPU and memory, endianness
|
||||
|
|
@ -838,23 +838,23 @@ There are a variety of different text serialisation codecs, which are
|
|||
collectivity referred to as :term:`text encodings <text encoding>`.
|
||||
|
||||
The simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps
|
||||
the codepoints 0-255 to the bytes ``0x0``-``0xff``, which means that a string
|
||||
object that contains codepoints above ``U+00FF`` can't be encoded with this
|
||||
the code points 0-255 to the bytes ``0x0``-``0xff``, which means that a string
|
||||
object that contains code points above ``U+00FF`` can't be encoded with this
|
||||
codec. Doing so will raise a :exc:`UnicodeEncodeError` that looks
|
||||
like the following (although the details of the error message may differ):
|
||||
``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in
|
||||
position 3: ordinal not in range(256)``.
|
||||
|
||||
There's another group of encodings (the so called charmap encodings) that choose
|
||||
a different subset of all Unicode code points and how these codepoints are
|
||||
a different subset of all Unicode code points and how these code points are
|
||||
mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open
|
||||
e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on
|
||||
Windows). There's a string constant with 256 characters that shows you which
|
||||
character is mapped to which byte value.
|
||||
|
||||
All of these encodings can only encode 256 of the 1114112 codepoints
|
||||
All of these encodings can only encode 256 of the 1114112 code points
|
||||
defined in Unicode. A simple and straightforward way that can store each Unicode
|
||||
code point, is to store each codepoint as four consecutive bytes. There are two
|
||||
code point, is to store each code point as four consecutive bytes. There are two
|
||||
possibilities: store the bytes in big endian or in little endian order. These
|
||||
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
|
||||
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
|
||||
|
|
|
|||
|
|
@ -194,7 +194,7 @@ Here are the classes:
|
|||
minor type and defaults to :mimetype:`plain`. *_charset* is the character
|
||||
set of the text and is passed as an argument to the
|
||||
:class:`~email.mime.nonmultipart.MIMENonMultipart` constructor; it defaults
|
||||
to ``us-ascii`` if the string contains only ``ascii`` codepoints, and
|
||||
to ``us-ascii`` if the string contains only ``ascii`` code points, and
|
||||
``utf-8`` otherwise.
|
||||
|
||||
Unless the *_charset* argument is explicitly set to ``None``, the
|
||||
|
|
|
|||
|
|
@ -156,7 +156,7 @@ are always available. They are listed here in alphabetical order.
|
|||
|
||||
.. function:: chr(i)
|
||||
|
||||
Return the string representing a character whose Unicode codepoint is the integer
|
||||
Return the string representing a character whose Unicode code point is the integer
|
||||
*i*. For example, ``chr(97)`` returns the string ``'a'``. This is the
|
||||
inverse of :func:`ord`. The valid range for the argument is from 0 through
|
||||
1,114,111 (0x10FFFF in base 16). :exc:`ValueError` will be raised if *i* is
|
||||
|
|
|
|||
|
|
@ -33,12 +33,12 @@ This module defines four dictionaries, :data:`html5`,
|
|||
|
||||
.. data:: name2codepoint
|
||||
|
||||
A dictionary that maps HTML entity names to the Unicode codepoints.
|
||||
A dictionary that maps HTML entity names to the Unicode code points.
|
||||
|
||||
|
||||
.. data:: codepoint2name
|
||||
|
||||
A dictionary that maps Unicode codepoints to HTML entity names.
|
||||
A dictionary that maps Unicode code points to HTML entity names.
|
||||
|
||||
|
||||
.. rubric:: Footnotes
|
||||
|
|
|
|||
|
|
@ -512,7 +512,7 @@ The RFC does not explicitly forbid JSON strings which contain byte sequences
|
|||
that don't correspond to valid Unicode characters (e.g. unpaired UTF-16
|
||||
surrogates), but it does note that they may cause interoperability problems.
|
||||
By default, this module accepts and outputs (when present in the original
|
||||
:class:`str`) codepoints for such sequences.
|
||||
:class:`str`) code points for such sequences.
|
||||
|
||||
|
||||
Infinite and NaN Number Values
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue