bpo-42236: Use UTF-8 encoding if nl_langinfo(CODESET) fails (GH-23086)

If the nl_langinfo(CODESET) function returns an empty string, Python
now uses UTF-8 as the filesystem encoding.

In May 2010 (commit b744ba1d14), I
modified Python to log a warning and use UTF-8 as the filesystem
encoding (instead of None) if nl_langinfo(CODESET) returns an empty
string.

In August 2020 (commit 94908bbc15), I
modified Python startup to fail with a fatal error and a specific
error message if nl_langinfo(CODESET) returns an empty string. The
intent was to prevent guessing the encoding and also investigate user
configuration where this case happens.

In 10 years (2010 to 2020), I saw zero user report about the error
message related to nl_langinfo(CODESET) returning an empty string.

Today, UTF-8 became the defacto standard and it's safe to make the
assumption that the user expects UTF-8. For example,
nl_langinfo(CODESET) can return an empty string on macOS if the
LC_CTYPE locale is not supported, and UTF-8 is the default encoding
on macOS.

While this change is likely to not affect anyone in practice, it
should make UTF-8 lover happy ;-)

Rewrite also the documentation explaining how Python selects the
filesystem encoding and error handler.
This commit is contained in:
Victor Stinner 2020-11-01 23:07:23 +01:00 committed by GitHub
parent 82458b6cdb
commit e662c398d8
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
8 changed files with 87 additions and 89 deletions

View file

@ -253,10 +253,16 @@ PyPreConfig
See :c:member:`PyConfig.isolated`.
.. c:member:: int legacy_windows_fs_encoding (Windows only)
.. c:member:: int legacy_windows_fs_encoding
If non-zero, disable UTF-8 Mode, set the Python filesystem encoding to
``mbcs``, set the filesystem error handler to ``replace``.
If non-zero:
* Set :c:member:`PyPreConfig.utf8_mode` to ``0``,
* Set :c:member:`PyConfig.filesystem_encoding` to ``"mbcs"``,
* Set :c:member:`PyConfig.filesystem_errors` to ``"replace"``.
Initialized the from :envvar:`PYTHONLEGACYWINDOWSFSENCODING` environment
variable value.
Only available on Windows. ``#ifdef MS_WINDOWS`` macro can be used for
Windows specific code.
@ -499,11 +505,47 @@ PyConfig
.. c:member:: wchar_t* filesystem_encoding
Filesystem encoding, :func:`sys.getfilesystemencoding`.
Filesystem encoding: :func:`sys.getfilesystemencoding`.
On macOS, Android and VxWorks: use ``"utf-8"`` by default.
On Windows: use ``"utf-8"`` by default, or ``"mbcs"`` if
:c:member:`~PyPreConfig.legacy_windows_fs_encoding` of
:c:type:`PyPreConfig` is non-zero.
Default encoding on other platforms:
* ``"utf-8"`` if :c:member:`PyPreConfig.utf8_mode` is non-zero.
* ``"ascii"`` if Python detects that ``nl_langinfo(CODESET)`` announces
the ASCII encoding (or Roman8 encoding on HP-UX), whereas the
``mbstowcs()`` function decodes from a different encoding (usually
Latin1).
* ``"utf-8"`` if ``nl_langinfo(CODESET)`` returns an empty string.
* Otherwise, use the LC_CTYPE locale encoding:
``nl_langinfo(CODESET)`` result.
At Python statup, the encoding name is normalized to the Python codec
name. For example, ``"ANSI_X3.4-1968"`` is replaced with ``"ascii"``.
See also the :c:member:`~PyConfig.filesystem_errors` member.
.. c:member:: wchar_t* filesystem_errors
Filesystem encoding errors, :func:`sys.getfilesystemencodeerrors`.
Filesystem error handler: :func:`sys.getfilesystemencodeerrors`.
On Windows: use ``"surrogatepass"`` by default, or ``"replace"`` if
:c:member:`~PyPreConfig.legacy_windows_fs_encoding` of
:c:type:`PyPreConfig` is non-zero.
On other platforms: use ``"surrogateescape"`` by default.
Supported error handlers:
* ``"strict"``
* ``"surrogateescape"``
* ``"surrogatepass"`` (only supported with the UTF-8 encoding)
See also the :c:member:`~PyConfig.filesystem_encoding` member.
.. c:member:: unsigned long hash_seed
.. c:member:: int use_hash_seed