Issue #22286: The "backslashreplace" error handlers now works with

decoding and translating.
This commit is contained in:
Serhiy Storchaka 2015-01-25 22:56:57 +02:00
parent 58f02019e0
commit 07985ef387
10 changed files with 200 additions and 87 deletions

View file

@ -280,8 +280,9 @@ and optionally an *errors* argument.
The *errors* argument specifies the response when the input string can't be The *errors* argument specifies the response when the input string can't be
converted according to the encoding's rules. Legal values for this argument are converted according to the encoding's rules. Legal values for this argument are
``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use ``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
``U+FFFD``, ``REPLACEMENT CHARACTER``), or ``'ignore'`` (just leave the ``U+FFFD``, ``REPLACEMENT CHARACTER``), ``'ignore'`` (just leave the
character out of the Unicode result). character out of the Unicode result), or ``'backslashreplace'`` (inserts a
``\xNN`` escape sequence).
The following examples show the differences:: The following examples show the differences::
>>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE >>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
@ -291,6 +292,8 @@ The following examples show the differences::
invalid start byte invalid start byte
>>> b'\x80abc'.decode("utf-8", "replace") >>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc' '\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "ignore") >>> b'\x80abc'.decode("utf-8", "ignore")
'abc' 'abc'

View file

@ -314,8 +314,8 @@ The following error handlers are only applicable to
| | reference (only for encoding). Implemented | | | reference (only for encoding). Implemented |
| | in :func:`xmlcharrefreplace_errors`. | | | in :func:`xmlcharrefreplace_errors`. |
+-------------------------+-----------------------------------------------+ +-------------------------+-----------------------------------------------+
| ``'backslashreplace'`` | Replace with backslashed escape sequences | | ``'backslashreplace'`` | Replace with backslashed escape sequences. |
| | (only for encoding). Implemented in | | | Implemented in |
| | :func:`backslashreplace_errors`. | | | :func:`backslashreplace_errors`. |
+-------------------------+-----------------------------------------------+ +-------------------------+-----------------------------------------------+
| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences | | ``'namereplace'`` | Replace with ``\N{...}`` escape sequences |
@ -350,6 +350,10 @@ In addition, the following error handler is specific to the given codecs:
.. versionadded:: 3.5 .. versionadded:: 3.5
The ``'namereplace'`` error handler. The ``'namereplace'`` error handler.
.. versionchanged:: 3.5
The ``'backslashreplace'`` error handlers now works with decoding and
translating.
The set of allowed values can be extended by registering a new named error The set of allowed values can be extended by registering a new named error
handler: handler:
@ -417,9 +421,9 @@ functions:
.. function:: backslashreplace_errors(exception) .. function:: backslashreplace_errors(exception)
Implements the ``'backslashreplace'`` error handling (for encoding with Implements the ``'backslashreplace'`` error handling (for
:term:`text encodings <text encoding>` only): the :term:`text encodings <text encoding>` only): malformed data is
unencodable character is replaced by a backslashed escape sequence. replaced by a backslashed escape sequence.
.. function:: namereplace_errors(exception) .. function:: namereplace_errors(exception)

View file

@ -973,9 +973,8 @@ are always available. They are listed here in alphabetical order.
Characters not supported by the encoding are replaced with the Characters not supported by the encoding are replaced with the
appropriate XML character reference ``&#nnn;``. appropriate XML character reference ``&#nnn;``.
* ``'backslashreplace'`` (also only supported when writing) * ``'backslashreplace'`` replaces malformed data by Python's backslashed
replaces unsupported characters with Python's backslashed escape escape sequences.
sequences.
* ``'namereplace'`` (also only supported when writing) * ``'namereplace'`` (also only supported when writing)
replaces unsupported characters with ``\N{...}`` escape sequences. replaces unsupported characters with ``\N{...}`` escape sequences.

View file

@ -825,11 +825,12 @@ Text I/O
exception if there is an encoding error (the default of ``None`` has the same exception if there is an encoding error (the default of ``None`` has the same
effect), or pass ``'ignore'`` to ignore errors. (Note that ignoring encoding effect), or pass ``'ignore'`` to ignore errors. (Note that ignoring encoding
errors can lead to data loss.) ``'replace'`` causes a replacement marker errors can lead to data loss.) ``'replace'`` causes a replacement marker
(such as ``'?'``) to be inserted where there is malformed data. When (such as ``'?'``) to be inserted where there is malformed data.
writing, ``'xmlcharrefreplace'`` (replace with the appropriate XML character ``'backslashreplace'`` causes malformed data to be replaced by a
reference), ``'backslashreplace'`` (replace with backslashed escape backslashed escape sequence. When writing, ``'xmlcharrefreplace'``
sequences) or ``'namereplace'`` (replace with ``\N{...}`` escape sequences) (replace with the appropriate XML character reference) or ``'namereplace'``
can be used. Any other error handling name that has been registered with (replace with ``\N{...}`` escape sequences) can be used. Any other error
handling name that has been registered with
:func:`codecs.register_error` is also valid. :func:`codecs.register_error` is also valid.
.. index:: .. index::

View file

@ -118,7 +118,9 @@ Other Language Changes
Some smaller changes made to the core Python language are: Some smaller changes made to the core Python language are:
* None yet. * Added the ``'namereplace'`` error handlers. The ``'backslashreplace'``
error handlers now works with decoding and translating.
(Contributed by Serhiy Storchaka in :issue:`19676` and :issue:`22286`.)

View file

@ -127,7 +127,8 @@ class Codec:
'surrogateescape' - replace with private code points U+DCnn. 'surrogateescape' - replace with private code points U+DCnn.
'xmlcharrefreplace' - Replace with the appropriate XML 'xmlcharrefreplace' - Replace with the appropriate XML
character reference (only for encoding). character reference (only for encoding).
'backslashreplace' - Replace with backslashed escape sequences 'backslashreplace' - Replace with backslashed escape sequences.
'namereplace' - Replace with \\N{...} escape sequences
(only for encoding). (only for encoding).
The set of allowed values can be extended via register_error. The set of allowed values can be extended via register_error.
@ -359,7 +360,8 @@ class StreamWriter(Codec):
'xmlcharrefreplace' - Replace with the appropriate XML 'xmlcharrefreplace' - Replace with the appropriate XML
character reference. character reference.
'backslashreplace' - Replace with backslashed escape 'backslashreplace' - Replace with backslashed escape
sequences (only for encoding). sequences.
'namereplace' - Replace with \\N{...} escape sequences.
The set of allowed parameter values can be extended via The set of allowed parameter values can be extended via
register_error. register_error.
@ -429,7 +431,8 @@ class StreamReader(Codec):
'strict' - raise a ValueError (or a subclass) 'strict' - raise a ValueError (or a subclass)
'ignore' - ignore the character and continue with the next 'ignore' - ignore the character and continue with the next
'replace'- replace with a suitable replacement character; 'replace'- replace with a suitable replacement character
'backslashreplace' - Replace with backslashed escape sequences;
The set of allowed parameter values can be extended via The set of allowed parameter values can be extended via
register_error. register_error.

View file

@ -246,6 +246,11 @@ class CodecCallbackTest(unittest.TestCase):
"\u0000\ufffd" "\u0000\ufffd"
) )
self.assertEqual(
b"\x00\x00\x00\x00\x00".decode("unicode-internal", "backslashreplace"),
"\u0000\\x00"
)
codecs.register_error("test.hui", handler_unicodeinternal) codecs.register_error("test.hui", handler_unicodeinternal)
self.assertEqual( self.assertEqual(
@ -565,17 +570,6 @@ class CodecCallbackTest(unittest.TestCase):
codecs.backslashreplace_errors, codecs.backslashreplace_errors,
UnicodeError("ouch") UnicodeError("ouch")
) )
# "backslashreplace" can only be used for encoding
self.assertRaises(
TypeError,
codecs.backslashreplace_errors,
UnicodeDecodeError("ascii", bytearray(b"\xff"), 0, 1, "ouch")
)
self.assertRaises(
TypeError,
codecs.backslashreplace_errors,
UnicodeTranslateError("\u3042", 0, 1, "ouch")
)
# Use the correct exception # Use the correct exception
self.assertEqual( self.assertEqual(
codecs.backslashreplace_errors( codecs.backslashreplace_errors(
@ -701,6 +695,16 @@ class CodecCallbackTest(unittest.TestCase):
UnicodeEncodeError("ascii", "\udfff", 0, 1, "ouch")), UnicodeEncodeError("ascii", "\udfff", 0, 1, "ouch")),
("\\udfff", 1) ("\\udfff", 1)
) )
self.assertEqual(
codecs.backslashreplace_errors(
UnicodeDecodeError("ascii", bytearray(b"\xff"), 0, 1, "ouch")),
("\\xff", 1)
)
self.assertEqual(
codecs.backslashreplace_errors(
UnicodeTranslateError("\u3042", 0, 1, "ouch")),
("\\u3042", 1)
)
def test_badhandlerresults(self): def test_badhandlerresults(self):
results = ( 42, "foo", (1,2,3), ("foo", 1, 3), ("foo", None), ("foo",), ("foo", 1, 3), ("foo", None), ("foo",) ) results = ( 42, "foo", (1,2,3), ("foo", 1, 3), ("foo", None), ("foo",), ("foo", 1, 3), ("foo", None), ("foo",) )

View file

@ -378,6 +378,10 @@ class ReadTest(MixInCheckStateHandling):
before + after) before + after)
self.assertEqual(test_sequence.decode(self.encoding, "replace"), self.assertEqual(test_sequence.decode(self.encoding, "replace"),
before + self.ill_formed_sequence_replace + after) before + self.ill_formed_sequence_replace + after)
backslashreplace = ''.join('\\x%02x' % b
for b in self.ill_formed_sequence)
self.assertEqual(test_sequence.decode(self.encoding, "backslashreplace"),
before + backslashreplace + after)
class UTF32Test(ReadTest, unittest.TestCase): class UTF32Test(ReadTest, unittest.TestCase):
encoding = "utf-32" encoding = "utf-32"
@ -1300,14 +1304,19 @@ class UnicodeInternalTest(unittest.TestCase):
"unicode_internal") "unicode_internal")
if sys.byteorder == "little": if sys.byteorder == "little":
invalid = b"\x00\x00\x11\x00" invalid = b"\x00\x00\x11\x00"
invalid_backslashreplace = r"\x00\x00\x11\x00"
else: else:
invalid = b"\x00\x11\x00\x00" invalid = b"\x00\x11\x00\x00"
invalid_backslashreplace = r"\x00\x11\x00\x00"
with support.check_warnings(): with support.check_warnings():
self.assertRaises(UnicodeDecodeError, self.assertRaises(UnicodeDecodeError,
invalid.decode, "unicode_internal") invalid.decode, "unicode_internal")
with support.check_warnings(): with support.check_warnings():
self.assertEqual(invalid.decode("unicode_internal", "replace"), self.assertEqual(invalid.decode("unicode_internal", "replace"),
'\ufffd') '\ufffd')
with support.check_warnings():
self.assertEqual(invalid.decode("unicode_internal", "backslashreplace"),
invalid_backslashreplace)
@unittest.skipUnless(SIZEOF_WCHAR_T == 4, 'specific to 32-bit wchar_t') @unittest.skipUnless(SIZEOF_WCHAR_T == 4, 'specific to 32-bit wchar_t')
def test_decode_error_attributes(self): def test_decode_error_attributes(self):
@ -2042,6 +2051,16 @@ class CharmapTest(unittest.TestCase):
("ab\ufffd", 3) ("ab\ufffd", 3)
) )
self.assertEqual(
codecs.charmap_decode(b"\x00\x01\x02", "backslashreplace", "ab"),
("ab\\x02", 3)
)
self.assertEqual(
codecs.charmap_decode(b"\x00\x01\x02", "backslashreplace", "ab\ufffe"),
("ab\\x02", 3)
)
self.assertEqual( self.assertEqual(
codecs.charmap_decode(b"\x00\x01\x02", "ignore", "ab"), codecs.charmap_decode(b"\x00\x01\x02", "ignore", "ab"),
("ab", 3) ("ab", 3)
@ -2118,6 +2137,25 @@ class CharmapTest(unittest.TestCase):
("ab\ufffd", 3) ("ab\ufffd", 3)
) )
self.assertEqual(
codecs.charmap_decode(b"\x00\x01\x02", "backslashreplace",
{0: 'a', 1: 'b'}),
("ab\\x02", 3)
)
self.assertEqual(
codecs.charmap_decode(b"\x00\x01\x02", "backslashreplace",
{0: 'a', 1: 'b', 2: None}),
("ab\\x02", 3)
)
# Issue #14850
self.assertEqual(
codecs.charmap_decode(b"\x00\x01\x02", "backslashreplace",
{0: 'a', 1: 'b', 2: '\ufffe'}),
("ab\\x02", 3)
)
self.assertEqual( self.assertEqual(
codecs.charmap_decode(b"\x00\x01\x02", "ignore", codecs.charmap_decode(b"\x00\x01\x02", "ignore",
{0: 'a', 1: 'b'}), {0: 'a', 1: 'b'}),
@ -2194,6 +2232,18 @@ class CharmapTest(unittest.TestCase):
("ab\ufffd", 3) ("ab\ufffd", 3)
) )
self.assertEqual(
codecs.charmap_decode(b"\x00\x01\x02", "backslashreplace",
{0: a, 1: b}),
("ab\\x02", 3)
)
self.assertEqual(
codecs.charmap_decode(b"\x00\x01\x02", "backslashreplace",
{0: a, 1: b, 2: 0xFFFE}),
("ab\\x02", 3)
)
self.assertEqual( self.assertEqual(
codecs.charmap_decode(b"\x00\x01\x02", "ignore", codecs.charmap_decode(b"\x00\x01\x02", "ignore",
{0: a, 1: b}), {0: a, 1: b}),
@ -2253,9 +2303,13 @@ class TypesTest(unittest.TestCase):
self.assertRaises(UnicodeDecodeError, codecs.unicode_escape_decode, br"\U00110000") self.assertRaises(UnicodeDecodeError, codecs.unicode_escape_decode, br"\U00110000")
self.assertEqual(codecs.unicode_escape_decode(r"\U00110000", "replace"), ("\ufffd", 10)) self.assertEqual(codecs.unicode_escape_decode(r"\U00110000", "replace"), ("\ufffd", 10))
self.assertEqual(codecs.unicode_escape_decode(r"\U00110000", "backslashreplace"),
(r"\x5c\x55\x30\x30\x31\x31\x30\x30\x30\x30", 10))
self.assertRaises(UnicodeDecodeError, codecs.raw_unicode_escape_decode, br"\U00110000") self.assertRaises(UnicodeDecodeError, codecs.raw_unicode_escape_decode, br"\U00110000")
self.assertEqual(codecs.raw_unicode_escape_decode(r"\U00110000", "replace"), ("\ufffd", 10)) self.assertEqual(codecs.raw_unicode_escape_decode(r"\U00110000", "replace"), ("\ufffd", 10))
self.assertEqual(codecs.raw_unicode_escape_decode(r"\U00110000", "backslashreplace"),
(r"\x5c\x55\x30\x30\x31\x31\x30\x30\x30\x30", 10))
class UnicodeEscapeTest(unittest.TestCase): class UnicodeEscapeTest(unittest.TestCase):
@ -2894,11 +2948,13 @@ class CodePageTest(unittest.TestCase):
(b'[\xff]', 'strict', None), (b'[\xff]', 'strict', None),
(b'[\xff]', 'ignore', '[]'), (b'[\xff]', 'ignore', '[]'),
(b'[\xff]', 'replace', '[\ufffd]'), (b'[\xff]', 'replace', '[\ufffd]'),
(b'[\xff]', 'backslashreplace', '[\\xff]'),
(b'[\xff]', 'surrogateescape', '[\udcff]'), (b'[\xff]', 'surrogateescape', '[\udcff]'),
(b'[\xff]', 'surrogatepass', None), (b'[\xff]', 'surrogatepass', None),
(b'\x81\x00abc', 'strict', None), (b'\x81\x00abc', 'strict', None),
(b'\x81\x00abc', 'ignore', '\x00abc'), (b'\x81\x00abc', 'ignore', '\x00abc'),
(b'\x81\x00abc', 'replace', '\ufffd\x00abc'), (b'\x81\x00abc', 'replace', '\ufffd\x00abc'),
(b'\x81\x00abc', 'backslashreplace', '\\xff\x00abc'),
)) ))
def test_cp1252(self): def test_cp1252(self):

View file

@ -10,6 +10,9 @@ Release date: TBA
Core and Builtins Core and Builtins
----------------- -----------------
- Issue #22286: The "backslashreplace" error handlers now works with
decoding and translating.
- Issue #23253: Delay-load ShellExecute[AW] in os.startfile for reduced - Issue #23253: Delay-load ShellExecute[AW] in os.startfile for reduced
startup overhead on Windows. startup overhead on Windows.

View file

@ -864,74 +864,112 @@ PyObject *PyCodec_XMLCharRefReplaceErrors(PyObject *exc)
PyObject *PyCodec_BackslashReplaceErrors(PyObject *exc) PyObject *PyCodec_BackslashReplaceErrors(PyObject *exc)
{ {
PyObject *object;
Py_ssize_t i;
Py_ssize_t start;
Py_ssize_t end;
PyObject *res;
unsigned char *outp;
int ressize;
Py_UCS4 c;
if (PyObject_IsInstance(exc, PyExc_UnicodeDecodeError)) {
unsigned char *p;
if (PyUnicodeDecodeError_GetStart(exc, &start))
return NULL;
if (PyUnicodeDecodeError_GetEnd(exc, &end))
return NULL;
if (!(object = PyUnicodeDecodeError_GetObject(exc)))
return NULL;
if (!(p = (unsigned char*)PyBytes_AsString(object))) {
Py_DECREF(object);
return NULL;
}
res = PyUnicode_New(4 * (end - start), 127);
if (res == NULL) {
Py_DECREF(object);
return NULL;
}
outp = PyUnicode_1BYTE_DATA(res);
for (i = start; i < end; i++, outp += 4) {
unsigned char c = p[i];
outp[0] = '\\';
outp[1] = 'x';
outp[2] = Py_hexdigits[(c>>4)&0xf];
outp[3] = Py_hexdigits[c&0xf];
}
assert(_PyUnicode_CheckConsistency(res, 1));
Py_DECREF(object);
return Py_BuildValue("(Nn)", res, end);
}
if (PyObject_IsInstance(exc, PyExc_UnicodeEncodeError)) { if (PyObject_IsInstance(exc, PyExc_UnicodeEncodeError)) {
PyObject *restuple;
PyObject *object;
Py_ssize_t i;
Py_ssize_t start;
Py_ssize_t end;
PyObject *res;
unsigned char *outp;
Py_ssize_t ressize;
Py_UCS4 c;
if (PyUnicodeEncodeError_GetStart(exc, &start)) if (PyUnicodeEncodeError_GetStart(exc, &start))
return NULL; return NULL;
if (PyUnicodeEncodeError_GetEnd(exc, &end)) if (PyUnicodeEncodeError_GetEnd(exc, &end))
return NULL; return NULL;
if (!(object = PyUnicodeEncodeError_GetObject(exc))) if (!(object = PyUnicodeEncodeError_GetObject(exc)))
return NULL; return NULL;
if (end - start > PY_SSIZE_T_MAX / (1+1+8)) }
end = start + PY_SSIZE_T_MAX / (1+1+8); else if (PyObject_IsInstance(exc, PyExc_UnicodeTranslateError)) {
for (i = start, ressize = 0; i < end; ++i) { if (PyUnicodeTranslateError_GetStart(exc, &start))
/* object is guaranteed to be "ready" */ return NULL;
c = PyUnicode_READ_CHAR(object, i); if (PyUnicodeTranslateError_GetEnd(exc, &end))
if (c >= 0x10000) { return NULL;
ressize += 1+1+8; if (!(object = PyUnicodeTranslateError_GetObject(exc)))
}
else if (c >= 0x100) {
ressize += 1+1+4;
}
else
ressize += 1+1+2;
}
res = PyUnicode_New(ressize, 127);
if (res == NULL) {
Py_DECREF(object);
return NULL; return NULL;
}
for (i = start, outp = PyUnicode_1BYTE_DATA(res);
i < end; ++i) {
c = PyUnicode_READ_CHAR(object, i);
*outp++ = '\\';
if (c >= 0x00010000) {
*outp++ = 'U';
*outp++ = Py_hexdigits[(c>>28)&0xf];
*outp++ = Py_hexdigits[(c>>24)&0xf];
*outp++ = Py_hexdigits[(c>>20)&0xf];
*outp++ = Py_hexdigits[(c>>16)&0xf];
*outp++ = Py_hexdigits[(c>>12)&0xf];
*outp++ = Py_hexdigits[(c>>8)&0xf];
}
else if (c >= 0x100) {
*outp++ = 'u';
*outp++ = Py_hexdigits[(c>>12)&0xf];
*outp++ = Py_hexdigits[(c>>8)&0xf];
}
else
*outp++ = 'x';
*outp++ = Py_hexdigits[(c>>4)&0xf];
*outp++ = Py_hexdigits[c&0xf];
}
assert(_PyUnicode_CheckConsistency(res, 1));
restuple = Py_BuildValue("(Nn)", res, end);
Py_DECREF(object);
return restuple;
} }
else { else {
wrong_exception_type(exc); wrong_exception_type(exc);
return NULL; return NULL;
} }
if (end - start > PY_SSIZE_T_MAX / (1+1+8))
end = start + PY_SSIZE_T_MAX / (1+1+8);
for (i = start, ressize = 0; i < end; ++i) {
/* object is guaranteed to be "ready" */
c = PyUnicode_READ_CHAR(object, i);
if (c >= 0x10000) {
ressize += 1+1+8;
}
else if (c >= 0x100) {
ressize += 1+1+4;
}
else
ressize += 1+1+2;
}
res = PyUnicode_New(ressize, 127);
if (res == NULL) {
Py_DECREF(object);
return NULL;
}
outp = PyUnicode_1BYTE_DATA(res);
for (i = start; i < end; ++i) {
c = PyUnicode_READ_CHAR(object, i);
*outp++ = '\\';
if (c >= 0x00010000) {
*outp++ = 'U';
*outp++ = Py_hexdigits[(c>>28)&0xf];
*outp++ = Py_hexdigits[(c>>24)&0xf];
*outp++ = Py_hexdigits[(c>>20)&0xf];
*outp++ = Py_hexdigits[(c>>16)&0xf];
*outp++ = Py_hexdigits[(c>>12)&0xf];
*outp++ = Py_hexdigits[(c>>8)&0xf];
}
else if (c >= 0x100) {
*outp++ = 'u';
*outp++ = Py_hexdigits[(c>>12)&0xf];
*outp++ = Py_hexdigits[(c>>8)&0xf];
}
else
*outp++ = 'x';
*outp++ = Py_hexdigits[(c>>4)&0xf];
*outp++ = Py_hexdigits[c&0xf];
}
assert(_PyUnicode_CheckConsistency(res, 1));
Py_DECREF(object);
return Py_BuildValue("(Nn)", res, end);
} }
static _PyUnicode_Name_CAPI *ucnhash_CAPI = NULL; static _PyUnicode_Name_CAPI *ucnhash_CAPI = NULL;
@ -1444,8 +1482,8 @@ static int _PyCodecRegistry_Init(void)
backslashreplace_errors, backslashreplace_errors,
METH_O, METH_O,
PyDoc_STR("Implements the 'backslashreplace' error handling, " PyDoc_STR("Implements the 'backslashreplace' error handling, "
"which replaces an unencodable character with a " "which replaces malformed data with a backslashed "
"backslashed escape sequence.") "escape sequence.")
} }
}, },
{ {