Now re.error is raised instead of OverflowError or RuntimeError for
too large width of look-behind pattern.
The limit is increased to 2**32-1 (was 2**31-1).
(cherry picked from commit e2b3d831fd)
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
TypeError would be overwritten by OverflowError
if 'code' param contained non-ints.
(cherry picked from commit 344d3a222a)
Co-authored-by: Nikita Sobolev <mail@sobolevn.me>
Restore the global Input Stream pointer after trying to match a sub-pattern.
Co-authored-by: Ma Lin <animalize@users.noreply.github.com>
(cherry picked from commit abd9cc52d9)
Co-authored-by: SKO <41810398+uyw4687@users.noreply.github.com>
It did not work in the case of a subpattern containing backtracking.
Temporary implement possessive quantifiers as equivalent greedy qualifiers
in atomic groups.
(cherry picked from commit 7b6e34e5ba)
In very rare circumstances the JUMP opcode could be confused with the
argument of the opcode in the "then" part which doesn't end with the
JUMP opcode. This led to incorrect detection of the final JUMP opcode
and incorrect calculation of the size of the subexpression.
NOTE: Changed return value of functions _validate_inner() and
_validate_charset() in Modules/_sre/sre.c. Now they return 0 on success,
-1 on failure, and 1 if the last op is JUMP (which usually is a failure).
Previously they returned 1 on success and 0 on failure.
(cherry picked from commit e9ac890c02)
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Adds a regression test for an re slowdown observed by rjsmin.
Uses multiprocessing to kill the test after SHORT_TIMEOUT.
Co-authored-by: Oleg Iarygin <dralife@yandex.ru>
Co-authored-by: Christian Heimes <christian@python.org>
(cherry picked from commit fe23c0061d)
Co-authored-by: Miro Hrončok <miro@hroncok.cz>
Revert "bpo-23689: re module, fix memory leak when a match is terminated by a signal or memory allocation failure (GH-32283)"
This reverts commit 6e3eee5c11.
Manual fixups to increase the MAGIC number and to handle conflicts with
a couple of changes that landed after that.
Thanks for reviews by Ma Lin and Serhiy Storchaka.
(cherry picked from commit 4beee0c7b0)
Co-authored-by: Gregory P. Smith <greg@krypto.org>
Revert "bpo-47211: Remove function re.template() and flag re.TEMPLATE (GH-32300)"
This reverts commit b09184bf05.
(cherry picked from commit 16a7e4a0b7)
Co-authored-by: Miro Hrončok <miro@hroncok.cz>
Only sequence of ASCII digits will be accepted as a numerical reference.
The group name in bytes patterns and replacement strings could only
contain ASCII letters and digits and underscore.
In expression (?(group)...) an appropriate re.error is now
raised if the group number refers to not defined group.
Previously it raised RuntimeError: invalid SRE code.
In rare cases, capturing group could get wrong result.
Regular expression engines in Perl and Java have similar bugs.
The new behavior now matches the behavior of more modern
RE engines: in the regex module and in PHP, Ruby and Node.js.
Flag members are now divided by one-bit verses multi-bit, with multi-bit being treated as aliases. Iterating over a flag only returns the contained single-bit flags.
Iterating, repr(), and str() show members in definition order.
When constructing combined-member flags, any extra integer values are either discarded (CONFORM), turned into ints (EJECT) or treated as errors (STRICT). Flag classes can specify which of those three behaviors is desired:
>>> class Test(Flag, boundary=CONFORM):
... ONE = 1
... TWO = 2
...
>>> Test(5)
<Test.ONE: 1>
Besides the three above behaviors, there is also KEEP, which should not be used unless necessary -- for example, _convert_ specifies KEEP as there are flag sets in the stdlib that are incomplete and/or inconsistent (e.g. ssl.Options). KEEP will, as the name suggests, keep all bits; however, iterating over a flag with extra bits will only return the canonical flags contained, not the extra bits.
Iteration is now in member definition order. If member definition order
matches increasing value order, then a more efficient method of flag
decomposition is used; otherwise, sort() is called on the results of
that method to get definition order.
``re`` module:
repr() has been modified to support as closely as possible its previous
output; the big difference is that inverted flags cannot be output as
before because the inversion operation now always returns the comparable
positive result; i.e.
re.A|re.I|re.M|re.S is ~(re.L|re.U|re.S|re.T|re.DEBUG)
in both of the above terms, the ``value`` is 282.
re's tests have been updated to reflect the modifications to repr().
* bpo-36929: Modify io/re tests to allow for missing mod name
For a vanishingly small number of internal types, CPython sets the
tp_name slot to mod_name.type_name, either in the PyTypeObject or the
PyType_Spec. There are a few minor places where this surfaces:
* Custom repr functions for those types (some of which ignore the
tp_name in favor of using a string literal, such as _io.TextIOWrapper)
* Pickling error messages
The test suite only tests the former. This commit modifies the test
suite to allow Python implementations to omit the module prefix.
https://bugs.python.org/issue36929
Use locale.getpreferredencoding() rather than locale.getlocale() to
get the locale encoding. With some locales, locale.getlocale()
returns the wrong encoding.
For example, on Fedora 29, locale.getlocale() returns ISO-8859-1
encoding for the "en_IN" locale, whereas
locale.getpreferredencoding() reports the correct encoding: UTF-8.
Need to reset capturing groups between two SRE(match) callings in loops, this fixes wrong capturing groups in rare cases.
Also add a missing index in re.rst.
Warnings emitted when compile a regular expression now always point
to the line in the user code. Previously they could point into inners
of the re module if emitted from inside of groups or conditionals.