bpo-36502: Correct documentation of str.isspace() (GH-15019)

The documented definition was much broader than the real one:
there are tons of characters with general category "Other",
and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on
_PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py,
which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down
exactly what this definition means (what's a "bidirectional class"
of "B"?) can do so.  The `unicodedata` module documentation is an
appropriate central place for our references to Unicode's own copious
documentation, so point there.

Also add to the isspace() test a thorough check that the
implementation agrees with the intended definition.
This commit is contained in:
Greg Price 2019-08-14 04:05:19 -07:00 committed by Victor Stinner
parent 077af8c2c9
commit 6bccbe7dfb
2 changed files with 19 additions and 4 deletions

View file

@ -12,6 +12,7 @@ import operator
import struct
import sys
import textwrap
import unicodedata
import unittest
import warnings
from test import support, string_tests
@ -617,11 +618,21 @@ class UnicodeTest(string_tests.CommonTest,
self.checkequalnofix(True, '\u2000', 'isspace')
self.checkequalnofix(True, '\u200a', 'isspace')
self.checkequalnofix(False, '\u2014', 'isspace')
# apparently there are no non-BMP spaces chars in Unicode 6
# There are no non-BMP whitespace chars as of Unicode 12.
for ch in ['\U00010401', '\U00010427', '\U00010429', '\U0001044E',
'\U0001F40D', '\U0001F46F']:
self.assertFalse(ch.isspace(), '{!a} is not space.'.format(ch))
@support.requires_resource('cpu')
def test_isspace_invariant(self):
for codepoint in range(sys.maxunicode + 1):
char = chr(codepoint)
bidirectional = unicodedata.bidirectional(char)
category = unicodedata.category(char)
self.assertEqual(char.isspace(),
(bidirectional in ('WS', 'B', 'S')
or category == 'Zs'))
def test_isalnum(self):
super().test_isalnum()
for ch in ['\U00010401', '\U00010427', '\U00010429', '\U0001044E',