gh-88886: Remove excessive encoding name normalization (GH-137167)
Some checks are pending
Tests / Change detection (push) Waiting to run
Tests / Docs (push) Blocked by required conditions
Tests / Check if Autoconf files are up to date (push) Blocked by required conditions
Tests / Windows MSI (push) Blocked by required conditions
Tests / (push) Blocked by required conditions
Tests / Check if generated files are up to date (push) Blocked by required conditions
Tests / Ubuntu SSL tests with OpenSSL (push) Blocked by required conditions
Tests / Sanitizers (push) Blocked by required conditions
Tests / Ubuntu SSL tests with AWS-LC (push) Blocked by required conditions
Tests / Android (aarch64) (push) Blocked by required conditions
Tests / Android (x86_64) (push) Blocked by required conditions
Tests / WASI (push) Blocked by required conditions
Tests / Hypothesis tests on Ubuntu (push) Blocked by required conditions
Tests / Address sanitizer (push) Blocked by required conditions
Tests / Cross build Linux (push) Blocked by required conditions
Tests / CIFuzz (push) Blocked by required conditions
Tests / All required checks pass (push) Blocked by required conditions
Lint / lint (push) Waiting to run
mypy / Run mypy on Lib/_pyrepl (push) Waiting to run
mypy / Run mypy on Lib/test/libregrtest (push) Waiting to run
mypy / Run mypy on Lib/tomllib (push) Waiting to run
mypy / Run mypy on Tools/build (push) Waiting to run
mypy / Run mypy on Tools/cases_generator (push) Waiting to run
mypy / Run mypy on Tools/clinic (push) Waiting to run
mypy / Run mypy on Tools/jit (push) Waiting to run
mypy / Run mypy on Tools/peg_generator (push) Waiting to run

The codecs lookup function now performs only minimal normalization of
the encoding name before passing it to the search functions:
all ASCII letters are converted to lower case, spaces are replaced
with hyphens.

Excessive normalization broke third-party codecs providers, like
python-iconv.

Revert "bpo-37751: Fix codecs.lookup() normalization (GH-15092)"

This reverts commit 20f59fe1f7.
This commit is contained in:
Serhiy Storchaka 2025-09-09 21:07:21 +03:00 committed by GitHub
parent 6b7b9d00a9
commit af58a6f883
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
5 changed files with 53 additions and 44 deletions

View file

@ -68,11 +68,21 @@ The full details for each codec can also be looked up directly:
Looks up the codec info in the Python codec registry and returns a
:class:`CodecInfo` object as defined below.
Encodings are first looked up in the registry's cache. If not found, the list of
This function first normalizes the *encoding*: all ASCII letters are
converted to lower case, spaces are replaced with hyphens.
Then encoding is looked up in the registry's cache. If not found, the list of
registered search functions is scanned. If no :class:`CodecInfo` object is
found, a :exc:`LookupError` is raised. Otherwise, the :class:`CodecInfo` object
is stored in the cache and returned to the caller.
.. versionchanged:: 3.9
Any characters except ASCII letters and digits and a dot are converted to underscore.
.. versionchanged:: next
No characters are converted to underscore anymore.
Spaces are converted to hyphens.
.. class:: CodecInfo(encode, decode, streamreader=None, streamwriter=None, incrementalencoder=None, incrementaldecoder=None, name=None)
Codec details when looking up the codec registry. The constructor
@ -167,14 +177,11 @@ function:
.. function:: register(search_function, /)
Register a codec search function. Search functions are expected to take one
argument, being the encoding name in all lower case letters with hyphens
and spaces converted to underscores, and return a :class:`CodecInfo` object.
argument, being the encoding name in all lower case letters with spaces
converted to hyphens, and return a :class:`CodecInfo` object.
In case a search function cannot find a given encoding, it should return
``None``.
.. versionchanged:: 3.9
Hyphens and spaces are converted to underscore.
.. function:: unregister(search_function, /)
@ -1065,7 +1072,7 @@ or with dictionaries as mapping tables. The following table lists the codecs by
name, together with a few common aliases, and the languages for which the
encoding is likely used. Neither the list of aliases nor the list of languages
is meant to be exhaustive. Notice that spelling alternatives that only differ in
case or use a hyphen instead of an underscore are also valid aliases
case or use a space or a hyphen instead of an underscore are also valid aliases
because they are equivalent when normalized by
:func:`~encodings.normalize_encoding`. For example, ``'utf-8'`` is a valid
alias for the ``'utf_8'`` codec.

View file

@ -630,7 +630,6 @@ class CAPICodecs(unittest.TestCase):
for name in [
encoding_name,
encoding_name.upper(),
encoding_name.replace('_', '-'),
]:
with self.subTest(name):
self.assertTrue(_testcapi.codec_known_encoding(name))

View file

@ -3873,26 +3873,22 @@ class Rot13UtilTest(unittest.TestCase):
class CodecNameNormalizationTest(unittest.TestCase):
"""Test codec name normalization"""
def test_codecs_lookup(self):
FOUND = (1, 2, 3, 4)
NOT_FOUND = (None, None, None, None)
def search_function(encoding):
if encoding == "aaa_8":
return FOUND
if encoding.startswith("test."):
return (encoding, 2, 3, 4)
else:
return NOT_FOUND
return None
codecs.register(search_function)
self.addCleanup(codecs.unregister, search_function)
self.assertEqual(FOUND, codecs.lookup('aaa_8'))
self.assertEqual(FOUND, codecs.lookup('AAA-8'))
self.assertEqual(FOUND, codecs.lookup('AAA---8'))
self.assertEqual(FOUND, codecs.lookup('AAA 8'))
self.assertEqual(FOUND, codecs.lookup('aaa\xe9\u20ac-8'))
self.assertEqual(NOT_FOUND, codecs.lookup('AAA.8'))
self.assertEqual(NOT_FOUND, codecs.lookup('AAA...8'))
self.assertEqual(NOT_FOUND, codecs.lookup('BBB-8'))
self.assertEqual(NOT_FOUND, codecs.lookup('BBB.8'))
self.assertEqual(NOT_FOUND, codecs.lookup('a\xe9\u20ac-8'))
self.assertEqual(codecs.lookup('test.aaa_8'), ('test.aaa_8', 2, 3, 4))
self.assertEqual(codecs.lookup('TEST.AAA-8'), ('test.aaa-8', 2, 3, 4))
self.assertEqual(codecs.lookup('TEST.AAA 8'), ('test.aaa-8', 2, 3, 4))
self.assertEqual(codecs.lookup('TEST.AAA---8'), ('test.aaa---8', 2, 3, 4))
self.assertEqual(codecs.lookup('TEST.AAA 8'), ('test.aaa---8', 2, 3, 4))
self.assertEqual(codecs.lookup('TEST.AAA\xe9\u20ac-8'), ('test.aaa\xe9\u20ac-8', 2, 3, 4))
self.assertEqual(codecs.lookup('TEST.AAA.8'), ('test.aaa.8', 2, 3, 4))
self.assertEqual(codecs.lookup('TEST.AAA...8'), ('test.aaa...8', 2, 3, 4))
def test_encodings_normalize_encoding(self):
# encodings.normalize_encoding() ignores non-ASCII characters.

View file

@ -0,0 +1,4 @@
The codecs lookup function now again performs only minimal normalization of
the encoding name before passing it to the search functions: all ASCII
letters are converted to lower case, spaces are replaced with hyphens.
This restores the pre-Python 3.9 behavior.

View file

@ -85,14 +85,15 @@ PyCodec_Unregister(PyObject *search_function)
extern int _Py_normalize_encoding(const char *, char *, size_t);
/* Convert a string to a normalized Python string(decoded from UTF-8): all characters are
converted to lower case, spaces and hyphens are replaced with underscores. */
/* Convert a string to a normalized Python string: all ASCII letters are
converted to lower case, spaces are replaced with hyphens. */
static
PyObject *normalizestring(const char *string)
static PyObject*
normalizestring(const char *string)
{
size_t i;
size_t len = strlen(string);
char *encoding;
char *p;
PyObject *v;
if (len > PY_SSIZE_T_MAX) {
@ -100,28 +101,30 @@ PyObject *normalizestring(const char *string)
return NULL;
}
encoding = PyMem_Malloc(len + 1);
if (encoding == NULL)
p = PyMem_Malloc(len + 1);
if (p == NULL)
return PyErr_NoMemory();
if (!_Py_normalize_encoding(string, encoding, len + 1))
{
PyErr_SetString(PyExc_RuntimeError, "_Py_normalize_encoding() failed");
PyMem_Free(encoding);
return NULL;
for (i = 0; i < len; i++) {
char ch = string[i];
if (ch == ' ')
ch = '-';
else
ch = Py_TOLOWER(Py_CHARMASK(ch));
p[i] = ch;
}
v = PyUnicode_FromString(encoding);
PyMem_Free(encoding);
p[i] = '\0';
v = PyUnicode_FromString(p);
PyMem_Free(p);
return v;
}
/* Lookup the given encoding and return a tuple providing the codec
facilities.
The encoding string is looked up converted to all lower-case
characters. This makes encodings looked up through this mechanism
effectively case-insensitive.
ASCII letters in the encoding string is looked up converted to all
lower case. This makes encodings looked up through this mechanism
effectively case-insensitive. Spaces are replaced with hyphens for
names like "US ASCII" and "ISO 8859-1".
If no codec is found, a LookupError is set and NULL returned.
@ -142,8 +145,8 @@ PyObject *_PyCodec_Lookup(const char *encoding)
assert(interp->codecs.initialized);
/* Convert the encoding to a normalized Python string: all
characters are converted to lower case, spaces and hyphens are
replaced with underscores. */
ASCII letters are converted to lower case, spaces are
replaced with hyphens. */
PyObject *v = normalizestring(encoding);
if (v == NULL) {
return NULL;