Commit graph

139 commits

Author SHA1 Message Date
Victor Stinner
84f7382215
bpo-42157: Rename unicodedata.ucnhash_CAPI (GH-22994)
Removed the unicodedata.ucnhash_CAPI attribute which was an internal
PyCapsule object. The related private _PyUnicode_Name_CAPI structure
was moved to the internal C API.

Rename unicodedata.ucnhash_CAPI as unicodedata._ucnhash_CAPI.
2020-10-27 04:36:22 +01:00
Victor Stinner
c8c4200b65
bpo-42157: Convert unicodedata.UCD to heap type (GH-22991)
Convert the unicodedata extension module to the multiphase
initialization API (PEP 489) and convert the unicodedata.UCD static
type to a heap type.

Co-Authored-By: Mohamed Koubaa <koubaa.m@gmail.com>
2020-10-26 23:19:22 +01:00
Victor Stinner
920cb647ba
bpo-42157: unicodedata avoids references to UCD_Type (GH-22990)
* UCD_Check() uses PyModule_Check()
* Simplify the internal _PyUnicode_Name_CAPI structure:

  * Remove size and state members
  * Remove state and self parameters of getcode() and getname()
    functions

* Remove global_module_state
2020-10-26 19:19:36 +01:00
Victor Stinner
47e1afd2a1
bpo-1635741: _PyUnicode_Name_CAPI moves to internal C API (GH-22713)
The private _PyUnicode_Name_CAPI structure of the PyCapsule API
unicodedata.ucnhash_CAPI moves to the internal C API. Moreover, the
structure gets a new state member which must be passed to the
getcode() and getname() functions.

* Move Include/ucnhash.h to Include/internal/pycore_ucnhash.h
* unicodedata module is now built with Py_BUILD_CORE_MODULE.
* unicodedata: move hashAPI variable into unicodedata_module_state.
2020-10-26 16:43:47 +01:00
Victor Stinner
e6b8c5263a
bpo-1635741: Add a global module state to unicodedata (GH-22712)
Prepare unicodedata to add a state per module: start with a global
"module" state, pass it to subfunctions which access &UCD_Type. This
change also prepares the conversion of the UCD_Type static type to a
heap type.
2020-10-15 16:22:19 +02:00
Mohamed Koubaa
ddc0dd001a
bpo-1635741, unicodedata: add ucd_type parameter to UCD_Check() macro (GH-22328)
Co-authored-by: Victor Stinner <vstinner@python.org>
2020-09-23 12:38:16 +02:00
Victor Stinner
4a21e57fe5
bpo-40268: Remove unused structmember.h includes (GH-19530)
If only offsetof() is needed: include stddef.h instead.

When structmember.h is used, add a comment explaining that
PyMemberDef is used.
2020-04-15 02:35:41 +02:00
Serhiy Storchaka
cd8295ff75
bpo-39943: Add the const qualifier to pointers on non-mutable PyUnicode data. (GH-19345) 2020-04-11 10:48:40 +03:00
Andy Lester
982307b9cc
bpo-39943: Remove unused self from find_nfc_index() (GH-18973) 2020-03-17 17:38:12 +01:00
Benjamin Peterson
051b9d08d1
closes bpo-39926: Update Unicode to 13.0.0. (GH-18910) 2020-03-10 20:41:34 -07:00
Dong-hee Na
1b55b65638
bpo-39573: Clean up modules and headers to use Py_IS_TYPE() function (GH-18521) 2020-02-17 11:09:15 +01:00
Victor Stinner
d2ec81a8c9
bpo-39573: Add Py_SET_TYPE() function (GH-18394)
Add Py_SET_TYPE() function to set the type of an object.
2020-02-07 09:17:07 +01:00
Jordon Xu
2ec7010206 bpo-37752: Delete redundant Py_CHARMASK in normalizestring() (GH-15095) 2019-09-10 17:04:08 +01:00
Greg Price
7669cb8b21 bpo-38043: Use bool for boolean flags on is_normalized_quickcheck. (GH-15711) 2019-09-09 02:16:31 -07:00
Greg Price
2f09413947 closes bpo-37966: Fully implement the UAX #15 quick-check algorithm. (GH-15558)
The purpose of the `unicodedata.is_normalized` function is to answer
the question `str == unicodedata.normalized(form, str)` more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX #15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm.  This greatly speeds up
`unicodedata.is_normalized` in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
      -- 'unicodedata.is_normalized("NFD", s)'
  5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this
code had for the `unicodedata.normalize` use case.

With this, that case is actually faster than in master!

$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 561 usec per loop

$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \
    -- 'unicodedata.normalize("NFD", s)'
500 loops, best of 5: 512 usec per loop
2019-09-03 19:45:44 -07:00
Jeroen Demeyer
530f506ac9 bpo-36974: tp_print -> tp_vectorcall_offset and tp_reserved -> tp_as_async (GH-13464)
Automatically replace
tp_print -> tp_vectorcall_offset
tp_compare -> tp_as_async
tp_reserved -> tp_as_async
2019-05-30 19:13:39 -07:00
Inada Naoki
6fec905de5
bpo-36642: make unicodedata const (GH-12855) 2019-04-17 08:40:34 +09:00
Max Bélanger
2810dd7be9 closes bpo-32285: Add unicodedata.is_normalized. (GH-4806) 2018-11-04 15:58:24 -08:00
Wonsup Yoon
d134809cd3 bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958)
Hangul composition check boundaries are wrong for the second character
([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3)
instead of [0x11A7, 0x11C3]).
2018-06-15 20:03:14 +08:00
Benjamin Peterson
7c69c1c0fb
update to Unicode 11.0.0 (closes bpo-33778) (GH-7439)
Also, standardize indentation of generated tables.
2018-06-06 20:14:28 -07:00
luzpaz
a5293b4ff2 Fix miscellaneous typos (#4275) 2017-11-05 15:37:50 +02:00
Benjamin Peterson
279a96206f bpo-30736: upgrade to Unicode 10.0 (#2344)
Straightforward. While we're at it, though, strip trailing whitespace from generated tables.
2017-06-22 22:31:08 -07:00
Serhiy Storchaka
f8d7d41507 Issue #28511: Use the "U" format instead of "O!" in PyArg_Parse*. 2016-10-23 15:12:25 +03:00
Christian Heimes
0202c347bc Add an extra byte for null in case we ever get very long unicode names. 2016-09-23 20:21:20 +02:00
Christian Heimes
2f366cab48 Add an extra byte for null in case we ever get very long unicode names. 2016-09-23 20:20:27 +02:00
Benjamin Peterson
6775231597 Unicode 9.0.0
Not completely mechanical since support for East Asian Width changes—emoji
codepoints became Wide—had to be added to unicodedata.
2016-09-14 23:53:47 -07:00
Christian Heimes
8cee10386e Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup() 2016-09-14 10:25:54 +02:00
Christian Heimes
7ce201322e Restrict name_length to NAME_MAXLEN in unicodedata_UCD_lookup() 2016-09-14 10:25:46 +02:00
Serhiy Storchaka
2d06e84455 Issue #25923: Added the const qualifier to static constant arrays. 2015-12-25 19:53:18 +02:00
Benjamin Peterson
4801383c29 upgrade to Unicode 8.0.0 2015-06-27 15:45:56 -05:00
Larry Hastings
38337d1e15 Issue #24000: Improved Argument Clinic's mapping of converters to legacy
"format units".  Updated the documentation to match.
2015-05-07 23:30:09 -07:00
Larry Hastings
dbfdc380df Issue #24001: Argument Clinic converters now use accept={type}
instead of types={'type'} to specify the types the converter accepts.
2015-05-04 06:59:46 -07:00
Serhiy Storchaka
6359641bcd Issue #20181: Converted the unicodedata module to Argument Clinic. 2015-04-17 21:18:49 +03:00
Larry Hastings
89964c48d1 Issue #23944: Argument Clinic now wraps long impl prototypes at column 78. 2015-04-14 18:07:59 -04:00
Serhiy Storchaka
1009bf18b3 Issue #23501: Argumen Clinic now generates code into separate files by default. 2015-04-03 23:53:51 +03:00
Benjamin Peterson
5061e67f0f merge 3.3 (#23367) 2015-03-02 11:18:40 -05:00
Benjamin Peterson
b779bfba45 fix possible overflow bugs in unicodedata (closes #23367) 2015-03-02 11:17:05 -05:00
Serhiy Storchaka
1a1ff29659 Issue #23446: Use PyMem_New instead of PyMem_Malloc to avoid possible integer
overflows.  Added few missed PyErr_NoMemory().
2015-02-16 13:28:22 +02:00
Serhiy Storchaka
d3faf43f9b Issue #23181: More "codepoint" -> "code point". 2015-01-18 11:28:37 +02:00
Victor Stinner
65a3144e54 Closes #21780: make the unicodedata module "ssize_t clean" for parsing parameters 2014-07-01 16:45:52 +02:00
Larry Hastings
2623c8c23c Issue #20530: Argument Clinic's signature format has been revised again.
The new syntax is highly human readable while still preventing false
positives.  The syntax also extends Python syntax to denote "self" and
positional-only parameters, allowing inspect.Signature objects to be
totally accurate for all supported builtins in Python 3.4.
2014-02-08 22:15:29 -08:00
Larry Hastings
581ee3618c Issue #20326: Argument Clinic now uses a simple, unique signature to
annotate text signatures in docstrings, resulting in fewer false
positives.  "self" parameters are also explicitly marked, allowing
inspect.Signature() to authoritatively detect (and skip) said parameters.

Issue #20326: Argument Clinic now generates separate checksums for the
input and output sections of the block, allowing external tools to verify
that the input has not changed (and thus the output is not out-of-date).
2014-01-28 05:00:08 -08:00
Larry Hastings
c20472640c Issue #20390: Small fixes and improvements for Argument Clinic. 2014-01-25 20:43:29 -08:00
Larry Hastings
5c66189e88 Issue #20189: Four additional builtin types (PyTypeObject,
PyMethodDescr_Type, _PyMethodWrapper_Type, and PyWrapperDescr_Type)
have been modified to provide introspection information for builtins.
Also: many additional Lib, test suite, and Argument Clinic fixes.
2014-01-24 06:17:25 -08:00
Larry Hastings
61272b77b0 Issue #19273: The marker comments Argument Clinic uses have been changed
to improve readability.
2014-01-07 12:41:53 -08:00
Larry Hastings
77561cccb2 Issue #20141: Improved Argument Clinic's support for the PyArg_Parse "O!"
format unit.
2014-01-07 12:13:13 -08:00
Larry Hastings
44e2eaab54 Issue #19674: inspect.signature() now produces a correct signature
for some builtins.
2013-11-23 15:37:55 -08:00
Larry Hastings
ed4a1c5703 Argument Clinic: rename "self" to "module" for module-level functions. 2013-11-18 09:32:13 -08:00
Larry Hastings
3182680210 Issue #16612: Add "Argument Clinic", a compile-time preprocessor
for C files to generate argument parsing code.  (See PEP 436.)
2013-10-19 00:09:25 -07:00
Benjamin Peterson
a76795bf53 merge 3.3 2013-10-10 20:22:39 -04:00