mirror of
https://github.com/python/cpython.git
synced 2025-09-17 14:16:02 +00:00
docs 36789: resolve incorrect note regarding UTF-8 (GH-13111)
This commit is contained in:
parent
af8646c805
commit
f98c3c59c0
1 changed files with 10 additions and 5 deletions
|
@ -135,17 +135,22 @@ used than UTF-8.) UTF-8 uses the following rules:
|
||||||
UTF-8 has several convenient properties:
|
UTF-8 has several convenient properties:
|
||||||
|
|
||||||
1. It can handle any Unicode code point.
|
1. It can handle any Unicode code point.
|
||||||
2. A Unicode string is turned into a sequence of bytes containing no embedded zero
|
2. A Unicode string is turned into a sequence of bytes that contains embedded
|
||||||
bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
|
zero bytes only where they represent the null character (U+0000). This means
|
||||||
processed by C functions such as ``strcpy()`` and sent through protocols that
|
that UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent
|
||||||
can't handle zero bytes.
|
through protocols that can't handle zero bytes for anything other than
|
||||||
|
end-of-string markers.
|
||||||
3. A string of ASCII text is also valid UTF-8 text.
|
3. A string of ASCII text is also valid UTF-8 text.
|
||||||
4. UTF-8 is fairly compact; the majority of commonly used characters can be
|
4. UTF-8 is fairly compact; the majority of commonly used characters can be
|
||||||
represented with one or two bytes.
|
represented with one or two bytes.
|
||||||
5. If bytes are corrupted or lost, it's possible to determine the start of the
|
5. If bytes are corrupted or lost, it's possible to determine the start of the
|
||||||
next UTF-8-encoded code point and resynchronize. It's also unlikely that
|
next UTF-8-encoded code point and resynchronize. It's also unlikely that
|
||||||
random 8-bit data will look like valid UTF-8.
|
random 8-bit data will look like valid UTF-8.
|
||||||
|
6. UTF-8 is a byte oriented encoding. The encoding specifies that each
|
||||||
|
character is represented by a specific sequence of one or more bytes. This
|
||||||
|
avoids the byte-ordering issues that can occur with integer and word oriented
|
||||||
|
encodings, like UTF-16 and UTF-32, where the sequence of bytes varies depending
|
||||||
|
on the hardware on which the string was encoded.
|
||||||
|
|
||||||
|
|
||||||
References
|
References
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue