mirror of
https://github.com/python/cpython.git
synced 2025-10-03 13:45:29 +00:00
Patch #1177307: UTF-8-Sig codec.
This commit is contained in:
parent
fd9a72ad89
commit
412ed3b8a7
4 changed files with 212 additions and 1 deletions
|
@ -522,6 +522,113 @@ the \function{lookup()} function to construct the instance.
|
||||||
\class{StreamReader} and \class{StreamWriter} classes. They inherit
|
\class{StreamReader} and \class{StreamWriter} classes. They inherit
|
||||||
all other methods and attribute from the underlying stream.
|
all other methods and attribute from the underlying stream.
|
||||||
|
|
||||||
|
\subsection{Encodings and Unicode\label{encodings-overview}}
|
||||||
|
|
||||||
|
Unicode strings are stored internally as sequences of codepoints (to
|
||||||
|
be precise as Py_UNICODE arrays). Depending on the way Python is
|
||||||
|
compiled (either via --enable-unicode=ucs2 or --enable-unicode=ucs4,
|
||||||
|
with the former being the default) Py_UNICODE is either a 16-bit or
|
||||||
|
32-bit data type. Once a Unicode object is used outside of CPU and
|
||||||
|
memory, CPU endianness and how these arrays are stored as bytes become
|
||||||
|
an issue. Transforming a unicode object into a sequence of bytes is
|
||||||
|
called encoding and recreating the unicode object from the sequence of
|
||||||
|
bytes is known as decoding. There are many different methods how this
|
||||||
|
transformation can be done (these methods are also called encodings).
|
||||||
|
The simplest method is to map the codepoints 0-255 to the bytes
|
||||||
|
0x0-0xff. This means that a unicode object that contains codepoints
|
||||||
|
above U+00FF can't be encoded with this method (which is called
|
||||||
|
'latin-1' or 'iso-8859-1'). unicode.encode() will raise a
|
||||||
|
UnicodeEncodeError that looks like this: UnicodeEncodeError: 'latin-1'
|
||||||
|
codec can't encode character u'\u1234' in position 3: ordinal not in
|
||||||
|
range(256)
|
||||||
|
|
||||||
|
There's another group of encodings (the so called charmap encodings)
|
||||||
|
that choose a different subset of all unicode code points and how
|
||||||
|
these codepoints are mapped to the bytes 0x0-0xff. To see how this is
|
||||||
|
done simply open e.g. encodings/cp1252.py (which is an encoding that
|
||||||
|
is used primarily on Windows). There's string constant with 256
|
||||||
|
characters that shows you which character is mapped to which byte
|
||||||
|
value.
|
||||||
|
|
||||||
|
All of these encodings can only encode 256 of the 65536 (or 1114111)
|
||||||
|
codepoints defined in unicode. A simple and straightforward way that
|
||||||
|
can store each Unicode code point, is to store each codepoint as two
|
||||||
|
consecutive bytes. There are two possibilities: Store the bytes in big
|
||||||
|
endian or in little endian order. These two encodings are called
|
||||||
|
UTF-16-BE and UTF-16-LE respectively. Their disadvantage is that if
|
||||||
|
e.g. you use UTF-16-BE on a little endian machine you will always have
|
||||||
|
to swap bytes on encoding and decoding. UTF-16 avoids this problem:
|
||||||
|
Bytes will always be in natural endianness. When these bytes are read
|
||||||
|
by a CPU with a different endianness, then bytes have to be swapped
|
||||||
|
though. To be able to detect the endianness of a UTF-16 byte sequence,
|
||||||
|
there's the so called BOM (the "Byte Order Mark"). This is the Unicode
|
||||||
|
character U+FEFF. This character will be prepended to every UTF-16
|
||||||
|
byte sequence. The byte swapped version of this character (0xFFFE) is
|
||||||
|
an illegal character that may not appear in a Unicode text. So when
|
||||||
|
the first character in an UTF-16 byte sequence appears to be a U+FFFE
|
||||||
|
the bytes have to be swapped on decoding. Unfortunately upto Unicode
|
||||||
|
4.0 the character U+FEFF had a second purpose as a "ZERO WIDTH
|
||||||
|
NO-BREAK SPACE": A character that has no width and doesn't allow a
|
||||||
|
word to be split. It can e.g. be used to give hints to a ligature
|
||||||
|
algorithm. With Unicode 4.0 using U+FEFF as a ZERO WIDTH NO-BREAK
|
||||||
|
SPACE has been deprecated (with U+2060 (WORD JOINER) assuming this
|
||||||
|
role). Nevertheless Unicode software still must be able to handle
|
||||||
|
U+FEFF in both roles: As a BOM it's a device to determine the storage
|
||||||
|
layout of the encoded bytes, and vanishes once the byte sequence has
|
||||||
|
been decoded into a Unicode string; as a ZERO WIDTH NO-BREAK SPACE
|
||||||
|
it's a normal character that will be decoded like any other.
|
||||||
|
|
||||||
|
There's another encoding that is able to encoding the full range of
|
||||||
|
Unicode characters: UTF-8. UTF-8 is an 8bit encoding, which means
|
||||||
|
there are no issues with byte order in UTF-8. Each byte in a UTF-8
|
||||||
|
byte sequence consists of two parts: Marker bits (the most significant
|
||||||
|
bits) and payload bits. The marker bits are a sequence of zero to six
|
||||||
|
1 bits followed by a 0 bit. Unicode characters are encoded like this
|
||||||
|
(with x being a payload bit, which when concatenated give the Unicode
|
||||||
|
character):
|
||||||
|
|
||||||
|
\begin{tableii}{l|l}{textrm}{}{Range}{Encoding}
|
||||||
|
\lineii{U-00000000 ... U-0000007F}{0xxxxxxx}
|
||||||
|
\lineii{U-00000080 ... U-000007FF}{110xxxxx 10xxxxxx}
|
||||||
|
\lineii{U-00000800 ... U-0000FFFF}{1110xxxx 10xxxxxx 10xxxxxx}
|
||||||
|
\lineii{U-00010000 ... U-001FFFFF}{11110xxx 10xxxxxx 10xxxxxx 10xxxxxx}
|
||||||
|
\lineii{U-00200000 ... U-03FFFFFF}{111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
|
||||||
|
\lineii{U-04000000 ... U-7FFFFFFF}{1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx}
|
||||||
|
\end{tableii}
|
||||||
|
|
||||||
|
The least significant bit of the Unicode character is the rightmost x
|
||||||
|
bit.
|
||||||
|
|
||||||
|
As UTF-8 is an 8bit encoding no BOM is required and any U+FEFF
|
||||||
|
character in the decoded Unicode string (even if it's the first
|
||||||
|
character) is treated as a ZERO WIDTH NO-BREAK SPACE.
|
||||||
|
|
||||||
|
Without external information it's impossible to reliably determine
|
||||||
|
which encoding was used for encoding a Unicode string. Each charmap
|
||||||
|
encoding can decode any random byte sequence. However that's not
|
||||||
|
possible with UTF-8, as UTF-8 byte sequences have a structure that
|
||||||
|
doesn't allow arbitrary byte sequence. To increase the reliability
|
||||||
|
with which an UTF-8 encoding can be detected, Microsoft invented a
|
||||||
|
variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad
|
||||||
|
program: Before any of the Unicode characters is written to the file,
|
||||||
|
a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef,
|
||||||
|
0xbb, 0xbf) is written. As it's rather improbably that any charmap
|
||||||
|
encoded file starts with these byte values (which would e.g. map to
|
||||||
|
|
||||||
|
LATIN SMALL LETTER I WITH DIAERESIS
|
||||||
|
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
|
||||||
|
INVERTED QUESTION MARK
|
||||||
|
|
||||||
|
in iso-8859-1), this increases the probability that a utf-8-sig
|
||||||
|
encoding can be correctly guessed from the byte sequence. So here the
|
||||||
|
BOM is not used to be able to determine the byte order used for
|
||||||
|
generating the byte sequence, but as a signature that helps in
|
||||||
|
guessing the encoding. On encoding the utf-8-sig codec will write
|
||||||
|
0xef, 0xbb, 0xbf as the first three bytes to the file. On decoding
|
||||||
|
utf-8-sig will skip those three bytes if they appear as the first
|
||||||
|
three bytes in the file.
|
||||||
|
|
||||||
|
|
||||||
\subsection{Standard Encodings\label{standard-encodings}}
|
\subsection{Standard Encodings\label{standard-encodings}}
|
||||||
|
|
||||||
Python comes with a number of codecs builtin, either implemented as C
|
Python comes with a number of codecs builtin, either implemented as C
|
||||||
|
@ -890,6 +997,10 @@ exist:
|
||||||
{U8, UTF, utf8}
|
{U8, UTF, utf8}
|
||||||
{all languages}
|
{all languages}
|
||||||
|
|
||||||
|
\lineiii{utf_8_sig}
|
||||||
|
{}
|
||||||
|
{all languages}
|
||||||
|
|
||||||
\end{longtableiii}
|
\end{longtableiii}
|
||||||
|
|
||||||
A number of codecs are specific to Python, so their codec names have
|
A number of codecs are specific to Python, so their codec names have
|
||||||
|
@ -1058,3 +1169,17 @@ Convert a label to \ASCII, as specified in \rfc{3490}.
|
||||||
\begin{funcdesc}{ToUnicode}{label}
|
\begin{funcdesc}{ToUnicode}{label}
|
||||||
Convert a label to Unicode, as specified in \rfc{3490}.
|
Convert a label to Unicode, as specified in \rfc{3490}.
|
||||||
\end{funcdesc}
|
\end{funcdesc}
|
||||||
|
|
||||||
|
\subsection{\module{encodings.utf_8_sig} ---
|
||||||
|
UTF-8 codec with BOM signature}
|
||||||
|
\declaremodule{standard}{encodings.utf-8-sig} % XXX utf_8_sig gives TeX errors
|
||||||
|
\modulesynopsis{UTF-8 codec with BOM signature}
|
||||||
|
\moduleauthor{Walter D\"orwald}
|
||||||
|
|
||||||
|
\versionadded{2.5}
|
||||||
|
|
||||||
|
This module implements a variant of the UTF-8 codec: On encoding a
|
||||||
|
UTF-8 encoded BOM will be prepended to the UTF-8 encoded bytes. For
|
||||||
|
the stateful encoder this is only done once (on the first write to the
|
||||||
|
byte stream). For decoding an optional UTF-8 encoded BOM at the start
|
||||||
|
of the data will be skipped.
|
||||||
|
|
57
Lib/encodings/utf_8_sig.py
Normal file
57
Lib/encodings/utf_8_sig.py
Normal file
|
@ -0,0 +1,57 @@
|
||||||
|
""" Python 'utf-8-sig' Codec
|
||||||
|
This work similar to UTF-8 with the following changes:
|
||||||
|
|
||||||
|
* On encoding/writing a UTF-8 encoded BOM will be prepended/written as the
|
||||||
|
first three bytes.
|
||||||
|
|
||||||
|
* On decoding/reading if the first three bytes are a UTF-8 encoded BOM, these
|
||||||
|
bytes will be skipped.
|
||||||
|
"""
|
||||||
|
import codecs
|
||||||
|
|
||||||
|
### Codec APIs
|
||||||
|
|
||||||
|
def encode(input, errors='strict'):
|
||||||
|
return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input))
|
||||||
|
|
||||||
|
def decode(input, errors='strict'):
|
||||||
|
prefix = 0
|
||||||
|
if input.startswith(codecs.BOM_UTF8):
|
||||||
|
input = input[3:]
|
||||||
|
prefix = 3
|
||||||
|
(output, consumed) = codecs.utf_8_decode(input, errors, True)
|
||||||
|
return (output, consumed+prefix)
|
||||||
|
|
||||||
|
class StreamWriter(codecs.StreamWriter):
|
||||||
|
def reset(self):
|
||||||
|
codecs.StreamWriter.reset(self)
|
||||||
|
try:
|
||||||
|
del self.encode
|
||||||
|
except AttributeError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
def encode(self, input, errors='strict'):
|
||||||
|
self.encode = codecs.utf_8_encode
|
||||||
|
return encode(input, errors)
|
||||||
|
|
||||||
|
class StreamReader(codecs.StreamReader):
|
||||||
|
def reset(self):
|
||||||
|
codecs.StreamReader.reset(self)
|
||||||
|
try:
|
||||||
|
del self.decode
|
||||||
|
except AttributeError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
def decode(self, input, errors='strict'):
|
||||||
|
if len(input) < 3 and codecs.BOM_UTF8.startswith(input):
|
||||||
|
# not enough data to decide if this is a BOM
|
||||||
|
# => try again on the next call
|
||||||
|
return (u"", 0)
|
||||||
|
self.decode = codecs.utf_8_decode
|
||||||
|
return decode(input, errors)
|
||||||
|
|
||||||
|
### encodings module API
|
||||||
|
|
||||||
|
def getregentry():
|
||||||
|
|
||||||
|
return (encode,decode,StreamReader,StreamWriter)
|
|
@ -367,6 +367,33 @@ class CharBufferTest(unittest.TestCase):
|
||||||
self.assertRaises(TypeError, codecs.charbuffer_encode)
|
self.assertRaises(TypeError, codecs.charbuffer_encode)
|
||||||
self.assertRaises(TypeError, codecs.charbuffer_encode, 42)
|
self.assertRaises(TypeError, codecs.charbuffer_encode, 42)
|
||||||
|
|
||||||
|
class UTF8SigTest(ReadTest):
|
||||||
|
encoding = "utf-8-sig"
|
||||||
|
|
||||||
|
def test_partial(self):
|
||||||
|
self.check_partial(
|
||||||
|
u"\ufeff\x00\xff\u07ff\u0800\uffff",
|
||||||
|
[
|
||||||
|
u"",
|
||||||
|
u"",
|
||||||
|
u"", # First BOM has been read and skipped
|
||||||
|
u"",
|
||||||
|
u"",
|
||||||
|
u"\ufeff", # Second BOM has been read and emitted
|
||||||
|
u"\ufeff\x00", # "\x00" read and emitted
|
||||||
|
u"\ufeff\x00", # First byte of encoded u"\xff" read
|
||||||
|
u"\ufeff\x00\xff", # Second byte of encoded u"\xff" read
|
||||||
|
u"\ufeff\x00\xff", # First byte of encoded u"\u07ff" read
|
||||||
|
u"\ufeff\x00\xff\u07ff", # Second byte of encoded u"\u07ff" read
|
||||||
|
u"\ufeff\x00\xff\u07ff",
|
||||||
|
u"\ufeff\x00\xff\u07ff",
|
||||||
|
u"\ufeff\x00\xff\u07ff\u0800",
|
||||||
|
u"\ufeff\x00\xff\u07ff\u0800",
|
||||||
|
u"\ufeff\x00\xff\u07ff\u0800",
|
||||||
|
u"\ufeff\x00\xff\u07ff\u0800\uffff",
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
class EscapeDecodeTest(unittest.TestCase):
|
class EscapeDecodeTest(unittest.TestCase):
|
||||||
def test_empty(self):
|
def test_empty(self):
|
||||||
self.assertEquals(codecs.escape_decode(""), ("", 0))
|
self.assertEquals(codecs.escape_decode(""), ("", 0))
|
||||||
|
@ -1044,6 +1071,7 @@ def test_main():
|
||||||
UTF16LETest,
|
UTF16LETest,
|
||||||
UTF16BETest,
|
UTF16BETest,
|
||||||
UTF8Test,
|
UTF8Test,
|
||||||
|
UTF8SigTest,
|
||||||
UTF7Test,
|
UTF7Test,
|
||||||
UTF16ExTest,
|
UTF16ExTest,
|
||||||
ReadBufferTest,
|
ReadBufferTest,
|
||||||
|
|
|
@ -319,6 +319,8 @@ Extension Modules
|
||||||
Library
|
Library
|
||||||
-------
|
-------
|
||||||
|
|
||||||
|
- Patch #1177307: Added a new codec utf_8_sig for UTF-8 with a BOM signature.
|
||||||
|
|
||||||
- Patch #1157027: cookielib mishandles RFC 2109 cookies in Netscape mode
|
- Patch #1157027: cookielib mishandles RFC 2109 cookies in Netscape mode
|
||||||
|
|
||||||
- Patch #1117398: cookielib.LWPCookieJar and .MozillaCookieJar now raise
|
- Patch #1117398: cookielib.LWPCookieJar and .MozillaCookieJar now raise
|
||||||
|
@ -674,7 +676,6 @@ Build
|
||||||
Tests for sanity in tzname when HAVE_TZNAME defined were also defined.
|
Tests for sanity in tzname when HAVE_TZNAME defined were also defined.
|
||||||
Closes bug #1096244. Thanks Gregory Bond.
|
Closes bug #1096244. Thanks Gregory Bond.
|
||||||
|
|
||||||
|
|
||||||
C API
|
C API
|
||||||
-----
|
-----
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue