This patch changes the behaviour of the UTF-16 codec family. Only the

UTF-16 codec will now interpret and remove a *leading* BOM mark. Sub-
sequent BOM characters are no longer interpreted and removed.
UTF-16-LE and -BE pass through all BOM mark characters.

These changes should get the UTF-16 codec more in line with what
the Unicode FAQ recommends w/r to BOM marks.
This commit is contained in:
Marc-André Lemburg 2001-05-21 20:30:15 +00:00
parent f52d27e52d
commit 489b56e044
2 changed files with 31 additions and 22 deletions

View file

@ -459,10 +459,11 @@ extern DL_IMPORT(PyObject*) PyUnicode_EncodeUTF8(
*byteorder == 0: native order
*byteorder == 1: big endian
and then switches according to all BOM marks it finds in the input
data. BOM marks are not copied into the resulting Unicode string.
After completion, *byteorder is set to the current byte order at
the end of input data.
In native mode, the first two bytes of the stream are checked for a
BOM mark. If found, the BOM mark is analysed, the byte order
adjusted and the BOM skipped. In the other modes, no BOM mark
interpretation is done. After completion, *byteorder is set to the
current byte order at the end of input data.
If byteorder is NULL, the codec starts in native order mode.