mirror of
https://github.com/python/cpython.git
synced 2025-12-23 09:19:18 +00:00
gh-88886: Remove excessive encoding name normalization (GH-137167)
Some checks are pending
Tests / Change detection (push) Waiting to run
Tests / Docs (push) Blocked by required conditions
Tests / Check if Autoconf files are up to date (push) Blocked by required conditions
Tests / Windows MSI (push) Blocked by required conditions
Tests / (push) Blocked by required conditions
Tests / Check if generated files are up to date (push) Blocked by required conditions
Tests / Ubuntu SSL tests with OpenSSL (push) Blocked by required conditions
Tests / Sanitizers (push) Blocked by required conditions
Tests / Ubuntu SSL tests with AWS-LC (push) Blocked by required conditions
Tests / Android (aarch64) (push) Blocked by required conditions
Tests / Android (x86_64) (push) Blocked by required conditions
Tests / WASI (push) Blocked by required conditions
Tests / Hypothesis tests on Ubuntu (push) Blocked by required conditions
Tests / Address sanitizer (push) Blocked by required conditions
Tests / Cross build Linux (push) Blocked by required conditions
Tests / CIFuzz (push) Blocked by required conditions
Tests / All required checks pass (push) Blocked by required conditions
Lint / lint (push) Waiting to run
mypy / Run mypy on Lib/_pyrepl (push) Waiting to run
mypy / Run mypy on Lib/test/libregrtest (push) Waiting to run
mypy / Run mypy on Lib/tomllib (push) Waiting to run
mypy / Run mypy on Tools/build (push) Waiting to run
mypy / Run mypy on Tools/cases_generator (push) Waiting to run
mypy / Run mypy on Tools/clinic (push) Waiting to run
mypy / Run mypy on Tools/jit (push) Waiting to run
mypy / Run mypy on Tools/peg_generator (push) Waiting to run
Some checks are pending
Tests / Change detection (push) Waiting to run
Tests / Docs (push) Blocked by required conditions
Tests / Check if Autoconf files are up to date (push) Blocked by required conditions
Tests / Windows MSI (push) Blocked by required conditions
Tests / (push) Blocked by required conditions
Tests / Check if generated files are up to date (push) Blocked by required conditions
Tests / Ubuntu SSL tests with OpenSSL (push) Blocked by required conditions
Tests / Sanitizers (push) Blocked by required conditions
Tests / Ubuntu SSL tests with AWS-LC (push) Blocked by required conditions
Tests / Android (aarch64) (push) Blocked by required conditions
Tests / Android (x86_64) (push) Blocked by required conditions
Tests / WASI (push) Blocked by required conditions
Tests / Hypothesis tests on Ubuntu (push) Blocked by required conditions
Tests / Address sanitizer (push) Blocked by required conditions
Tests / Cross build Linux (push) Blocked by required conditions
Tests / CIFuzz (push) Blocked by required conditions
Tests / All required checks pass (push) Blocked by required conditions
Lint / lint (push) Waiting to run
mypy / Run mypy on Lib/_pyrepl (push) Waiting to run
mypy / Run mypy on Lib/test/libregrtest (push) Waiting to run
mypy / Run mypy on Lib/tomllib (push) Waiting to run
mypy / Run mypy on Tools/build (push) Waiting to run
mypy / Run mypy on Tools/cases_generator (push) Waiting to run
mypy / Run mypy on Tools/clinic (push) Waiting to run
mypy / Run mypy on Tools/jit (push) Waiting to run
mypy / Run mypy on Tools/peg_generator (push) Waiting to run
The codecs lookup function now performs only minimal normalization of
the encoding name before passing it to the search functions:
all ASCII letters are converted to lower case, spaces are replaced
with hyphens.
Excessive normalization broke third-party codecs providers, like
python-iconv.
Revert "bpo-37751: Fix codecs.lookup() normalization (GH-15092)"
This reverts commit 20f59fe1f7.
This commit is contained in:
parent
6b7b9d00a9
commit
af58a6f883
5 changed files with 53 additions and 44 deletions
|
|
@ -68,11 +68,21 @@ The full details for each codec can also be looked up directly:
|
|||
Looks up the codec info in the Python codec registry and returns a
|
||||
:class:`CodecInfo` object as defined below.
|
||||
|
||||
Encodings are first looked up in the registry's cache. If not found, the list of
|
||||
This function first normalizes the *encoding*: all ASCII letters are
|
||||
converted to lower case, spaces are replaced with hyphens.
|
||||
Then encoding is looked up in the registry's cache. If not found, the list of
|
||||
registered search functions is scanned. If no :class:`CodecInfo` object is
|
||||
found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object
|
||||
is stored in the cache and returned to the caller.
|
||||
|
||||
.. versionchanged:: 3.9
|
||||
Any characters except ASCII letters and digits and a dot are converted to underscore.
|
||||
|
||||
.. versionchanged:: next
|
||||
No characters are converted to underscore anymore.
|
||||
Spaces are converted to hyphens.
|
||||
|
||||
|
||||
.. class:: CodecInfo(encode, decode, streamreader=None, streamwriter=None, incrementalencoder=None, incrementaldecoder=None, name=None)
|
||||
|
||||
Codec details when looking up the codec registry. The constructor
|
||||
|
|
@ -167,14 +177,11 @@ function:
|
|||
.. function:: register(search_function, /)
|
||||
|
||||
Register a codec search function. Search functions are expected to take one
|
||||
argument, being the encoding name in all lower case letters with hyphens
|
||||
and spaces converted to underscores, and return a :class:`CodecInfo` object.
|
||||
argument, being the encoding name in all lower case letters with spaces
|
||||
converted to hyphens, and return a :class:`CodecInfo` object.
|
||||
In case a search function cannot find a given encoding, it should return
|
||||
``None``.
|
||||
|
||||
.. versionchanged:: 3.9
|
||||
Hyphens and spaces are converted to underscore.
|
||||
|
||||
|
||||
.. function:: unregister(search_function, /)
|
||||
|
||||
|
|
@ -1065,7 +1072,7 @@ or with dictionaries as mapping tables. The following table lists the codecs by
|
|||
name, together with a few common aliases, and the languages for which the
|
||||
encoding is likely used. Neither the list of aliases nor the list of languages
|
||||
is meant to be exhaustive. Notice that spelling alternatives that only differ in
|
||||
case or use a hyphen instead of an underscore are also valid aliases
|
||||
case or use a space or a hyphen instead of an underscore are also valid aliases
|
||||
because they are equivalent when normalized by
|
||||
:func:`~encodings.normalize_encoding`. For example, ``'utf-8'`` is a valid
|
||||
alias for the ``'utf_8'`` codec.
|
||||
|
|
|
|||
|
|
@ -630,7 +630,6 @@ class CAPICodecs(unittest.TestCase):
|
|||
for name in [
|
||||
encoding_name,
|
||||
encoding_name.upper(),
|
||||
encoding_name.replace('_', '-'),
|
||||
]:
|
||||
with self.subTest(name):
|
||||
self.assertTrue(_testcapi.codec_known_encoding(name))
|
||||
|
|
|
|||
|
|
@ -3873,26 +3873,22 @@ class Rot13UtilTest(unittest.TestCase):
|
|||
class CodecNameNormalizationTest(unittest.TestCase):
|
||||
"""Test codec name normalization"""
|
||||
def test_codecs_lookup(self):
|
||||
FOUND = (1, 2, 3, 4)
|
||||
NOT_FOUND = (None, None, None, None)
|
||||
def search_function(encoding):
|
||||
if encoding == "aaa_8":
|
||||
return FOUND
|
||||
if encoding.startswith("test."):
|
||||
return (encoding, 2, 3, 4)
|
||||
else:
|
||||
return NOT_FOUND
|
||||
return None
|
||||
|
||||
codecs.register(search_function)
|
||||
self.addCleanup(codecs.unregister, search_function)
|
||||
self.assertEqual(FOUND, codecs.lookup('aaa_8'))
|
||||
self.assertEqual(FOUND, codecs.lookup('AAA-8'))
|
||||
self.assertEqual(FOUND, codecs.lookup('AAA---8'))
|
||||
self.assertEqual(FOUND, codecs.lookup('AAA 8'))
|
||||
self.assertEqual(FOUND, codecs.lookup('aaa\xe9\u20ac-8'))
|
||||
self.assertEqual(NOT_FOUND, codecs.lookup('AAA.8'))
|
||||
self.assertEqual(NOT_FOUND, codecs.lookup('AAA...8'))
|
||||
self.assertEqual(NOT_FOUND, codecs.lookup('BBB-8'))
|
||||
self.assertEqual(NOT_FOUND, codecs.lookup('BBB.8'))
|
||||
self.assertEqual(NOT_FOUND, codecs.lookup('a\xe9\u20ac-8'))
|
||||
self.assertEqual(codecs.lookup('test.aaa_8'), ('test.aaa_8', 2, 3, 4))
|
||||
self.assertEqual(codecs.lookup('TEST.AAA-8'), ('test.aaa-8', 2, 3, 4))
|
||||
self.assertEqual(codecs.lookup('TEST.AAA 8'), ('test.aaa-8', 2, 3, 4))
|
||||
self.assertEqual(codecs.lookup('TEST.AAA---8'), ('test.aaa---8', 2, 3, 4))
|
||||
self.assertEqual(codecs.lookup('TEST.AAA 8'), ('test.aaa---8', 2, 3, 4))
|
||||
self.assertEqual(codecs.lookup('TEST.AAA\xe9\u20ac-8'), ('test.aaa\xe9\u20ac-8', 2, 3, 4))
|
||||
self.assertEqual(codecs.lookup('TEST.AAA.8'), ('test.aaa.8', 2, 3, 4))
|
||||
self.assertEqual(codecs.lookup('TEST.AAA...8'), ('test.aaa...8', 2, 3, 4))
|
||||
|
||||
def test_encodings_normalize_encoding(self):
|
||||
# encodings.normalize_encoding() ignores non-ASCII characters.
|
||||
|
|
|
|||
|
|
@ -0,0 +1,4 @@
|
|||
The codecs lookup function now again performs only minimal normalization of
|
||||
the encoding name before passing it to the search functions: all ASCII
|
||||
letters are converted to lower case, spaces are replaced with hyphens.
|
||||
This restores the pre-Python 3.9 behavior.
|
||||
|
|
@ -85,14 +85,15 @@ PyCodec_Unregister(PyObject *search_function)
|
|||
|
||||
extern int _Py_normalize_encoding(const char *, char *, size_t);
|
||||
|
||||
/* Convert a string to a normalized Python string(decoded from UTF-8): all characters are
|
||||
converted to lower case, spaces and hyphens are replaced with underscores. */
|
||||
/* Convert a string to a normalized Python string: all ASCII letters are
|
||||
converted to lower case, spaces are replaced with hyphens. */
|
||||
|
||||
static
|
||||
PyObject *normalizestring(const char *string)
|
||||
static PyObject*
|
||||
normalizestring(const char *string)
|
||||
{
|
||||
size_t i;
|
||||
size_t len = strlen(string);
|
||||
char *encoding;
|
||||
char *p;
|
||||
PyObject *v;
|
||||
|
||||
if (len > PY_SSIZE_T_MAX) {
|
||||
|
|
@ -100,28 +101,30 @@ PyObject *normalizestring(const char *string)
|
|||
return NULL;
|
||||
}
|
||||
|
||||
encoding = PyMem_Malloc(len + 1);
|
||||
if (encoding == NULL)
|
||||
p = PyMem_Malloc(len + 1);
|
||||
if (p == NULL)
|
||||
return PyErr_NoMemory();
|
||||
|
||||
if (!_Py_normalize_encoding(string, encoding, len + 1))
|
||||
{
|
||||
PyErr_SetString(PyExc_RuntimeError, "_Py_normalize_encoding() failed");
|
||||
PyMem_Free(encoding);
|
||||
return NULL;
|
||||
for (i = 0; i < len; i++) {
|
||||
char ch = string[i];
|
||||
if (ch == ' ')
|
||||
ch = '-';
|
||||
else
|
||||
ch = Py_TOLOWER(Py_CHARMASK(ch));
|
||||
p[i] = ch;
|
||||
}
|
||||
|
||||
v = PyUnicode_FromString(encoding);
|
||||
PyMem_Free(encoding);
|
||||
p[i] = '\0';
|
||||
v = PyUnicode_FromString(p);
|
||||
PyMem_Free(p);
|
||||
return v;
|
||||
}
|
||||
|
||||
/* Lookup the given encoding and return a tuple providing the codec
|
||||
facilities.
|
||||
|
||||
The encoding string is looked up converted to all lower-case
|
||||
characters. This makes encodings looked up through this mechanism
|
||||
effectively case-insensitive.
|
||||
ASCII letters in the encoding string is looked up converted to all
|
||||
lower case. This makes encodings looked up through this mechanism
|
||||
effectively case-insensitive. Spaces are replaced with hyphens for
|
||||
names like "US ASCII" and "ISO 8859-1".
|
||||
|
||||
If no codec is found, a LookupError is set and NULL returned.
|
||||
|
||||
|
|
@ -142,8 +145,8 @@ PyObject *_PyCodec_Lookup(const char *encoding)
|
|||
assert(interp->codecs.initialized);
|
||||
|
||||
/* Convert the encoding to a normalized Python string: all
|
||||
characters are converted to lower case, spaces and hyphens are
|
||||
replaced with underscores. */
|
||||
ASCII letters are converted to lower case, spaces are
|
||||
replaced with hyphens. */
|
||||
PyObject *v = normalizestring(encoding);
|
||||
if (v == NULL) {
|
||||
return NULL;
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue