mirror of
https://github.com/python/cpython.git
synced 2025-08-02 16:13:13 +00:00
Update to properly explain that the default Unicode encoding is ASCII, &c.
This commit is contained in:
parent
ea4f931cb9
commit
5401996638
1 changed files with 39 additions and 21 deletions
|
@ -772,17 +772,17 @@ u'Hello World !'
|
||||||
\end{verbatim}
|
\end{verbatim}
|
||||||
|
|
||||||
The escape sequence \code{\e u0020} indicates to insert the Unicode
|
The escape sequence \code{\e u0020} indicates to insert the Unicode
|
||||||
character with the HEX ordinal 0x0020 (the space character) at the
|
character with the ordinal value 0x0020 (the space character) at the
|
||||||
given position.
|
given position.
|
||||||
|
|
||||||
Other characters are interpreted by using their respective ordinal
|
Other characters are interpreted by using their respective ordinal
|
||||||
value directly as Unicode ordinal. Due to the fact that the lower 256
|
values directly as Unicode ordinals. If you have literal strings
|
||||||
Unicode are the same as the standard Latin-1 encoding used in many
|
in the standard Latin-1 encoding that is used in many Western countries,
|
||||||
western countries, the process of entering Unicode is greatly
|
you will find it convenient that the lower 256 characters
|
||||||
simplified.
|
of Unicode are the same as the 256 characters of Latin-1.
|
||||||
|
|
||||||
For experts, there is also a raw mode just like for normal
|
For experts, there is also a raw mode just like the one for normal
|
||||||
strings. You have to prepend the string with a small 'r' to have
|
strings. You have to prefix the opening quote with 'ur' to have
|
||||||
Python use the \emph{Raw-Unicode-Escape} encoding. It will only apply
|
Python use the \emph{Raw-Unicode-Escape} encoding. It will only apply
|
||||||
the above \code{\e uXXXX} conversion if there is an uneven number of
|
the above \code{\e uXXXX} conversion if there is an uneven number of
|
||||||
backslashes in front of the small 'u'.
|
backslashes in front of the small 'u'.
|
||||||
|
@ -801,32 +801,50 @@ Apart from these standard encodings, Python provides a whole set of
|
||||||
other ways of creating Unicode strings on the basis of a known
|
other ways of creating Unicode strings on the basis of a known
|
||||||
encoding.
|
encoding.
|
||||||
|
|
||||||
The built-in function \function{unicode()}\bifuncindex{unicode} provides access
|
The built-in function \function{unicode()}\bifuncindex{unicode} provides
|
||||||
to all registered Unicode codecs (COders and DECoders). Some of the
|
access to all registered Unicode codecs (COders and DECoders). Some of
|
||||||
more well known encodings which these codecs can convert are
|
the more well known encodings which these codecs can convert are
|
||||||
\emph{Latin-1}, \emph{ASCII}, \emph{UTF-8} and \emph{UTF-16}. The latter two
|
\emph{Latin-1}, \emph{ASCII}, \emph{UTF-8}, and \emph{UTF-16}.
|
||||||
are variable-length encodings which store Unicode characters
|
The latter two are variable-length encodings that store each Unicode
|
||||||
in blocks of 8 or 16 bits. To print a Unicode string or write it to a file,
|
character in one or more bytes. The default encoding is
|
||||||
you must convert it to a string with the \method{encode()} method.
|
normally set to ASCII, which passes through characters in the range
|
||||||
|
0 to 127 and rejects any other characters with an error.
|
||||||
|
When a Unicode string is printed, written to a file, or converted
|
||||||
|
with \function{str()}, conversion takes place using this default encoding.
|
||||||
|
|
||||||
\begin{verbatim}
|
\begin{verbatim}
|
||||||
|
>>> u"abc"
|
||||||
|
u'abc'
|
||||||
|
>>> str(u"abc")
|
||||||
|
'abc'
|
||||||
>>> u"äöü"
|
>>> u"äöü"
|
||||||
u'\344\366\374'
|
u'\xe4\xf6\xfc'
|
||||||
>>> u"äöü".encode('UTF-8')
|
>>> str(u"äöü")
|
||||||
'\303\244\303\266\303\274'
|
Traceback (most recent call last):
|
||||||
|
File "<stdin>", line 1, in ?
|
||||||
|
UnicodeError: ASCII encoding error: ordinal not in range(128)
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
To convert a Unicode string into an 8-bit string using a specific
|
||||||
|
encoding, Unicode objects provide an \function{encode()} method
|
||||||
|
that takes one argument, the name of the encoding. Lowercase names
|
||||||
|
for encodings are preferred.
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> u"äöü".encode('utf-8')
|
||||||
|
'\xc3\xa4\xc3\xb6\xc3\xbc'
|
||||||
\end{verbatim}
|
\end{verbatim}
|
||||||
|
|
||||||
If you have data in a specific encoding and want to produce a
|
If you have data in a specific encoding and want to produce a
|
||||||
corresponding Unicode string from it, you can use the
|
corresponding Unicode string from it, you can use the
|
||||||
\function{unicode()} function with the encoding name as second
|
\function{unicode()} function with the encoding name as the second
|
||||||
argument.
|
argument.
|
||||||
|
|
||||||
\begin{verbatim}
|
\begin{verbatim}
|
||||||
>>> unicode('\303\244\303\266\303\274','UTF-8')
|
>>> unicode('\xc3\xa4\xc3\xb6\xc3\xbc', 'utf-8')
|
||||||
u'\344\366\374'
|
u'\xe4\xf6\xfc'
|
||||||
\end{verbatim}
|
\end{verbatim}
|
||||||
|
|
||||||
|
|
||||||
\subsection{Lists \label{lists}}
|
\subsection{Lists \label{lists}}
|
||||||
|
|
||||||
Python knows a number of \emph{compound} data types, used to group
|
Python knows a number of \emph{compound} data types, used to group
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue