Commit graph

125 commits

Author SHA1 Message Date
Gregory P. Smith
2e279e85fe
gh-88500: Reduce memory use of urllib.unquote (#96763)
`urllib.unquote_to_bytes` and `urllib.unquote` could both potentially generate `O(len(string))` intermediate `bytes` or `str` objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram.

This switches the implementation to using an expanding `bytearray` and a generator internally instead of precomputed `split()` style operations.

Microbenchmarks with some antagonistic inputs like `mess = "\u0141%%%20a%fe"*1000` show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal.  The slowdown scales consistently linear with input size as expected.

Memory usage observed manually using `/usr/bin/time -v` on `python -m timeit` runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.

Observed memory usage is ~1/2 for `unquote()` and <1/3 for `unquote_to_bytes()` using `python -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)'` as a test.
2022-12-10 16:17:39 -08:00
Ben Kallus
439b9cfaf4
gh-99418: Make urllib.parse.urlparse enforce that a scheme must begin with an alphabetical ASCII character. (#99421)
Prevent urllib.parse.urlparse from accepting schemes that don't begin with an alphabetical ASCII character.

RFC 3986 defines a scheme like this: `scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )`
RFC 2234 defines an ALPHA like this: `ALPHA = %x41-5A / %x61-7A`

The WHATWG URL spec defines a scheme like this:
`"A URL-scheme string must be one ASCII alpha, followed by zero or more of ASCII alphanumeric, U+002B (+), U+002D (-), and U+002E (.)."`
2022-11-13 10:25:55 -08:00
Ben Kallus
6f15ca8c7a
gh-96035: Make urllib.parse.urlparse reject non-numeric ports (#98273)
Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
2022-10-20 14:00:56 -07:00
Gregory P. Smith
e61ca22431
gh-95865: Further reduce quote_from_bytes memory consumption (#96860)
on large input values.  Based on Dennis Sweeney's chunking idea.
2022-09-19 16:06:25 -07:00
Dennis Sweeney
8ba22b90ca
gh-95865: Speed up urllib.parse.quote_from_bytes() (GH-95872) 2022-08-30 21:39:51 -04:00
Victor Stinner
259dd71c32
gh-84623: Remove unused imports in stdlib (#93773) 2022-06-13 16:28:41 +02:00
Oleg Iarygin
a03a09e068
Replace with_traceback() with exception chaining and reraising (GH-32074) 2022-03-30 15:28:20 +03:00
Christian Sattler
e6fe10d340
bpo-45874: Handle empty query string correctly in urllib.parse.parse_qsl (#29716) 2021-12-12 10:41:12 +02:00
Gregory P. Smith
d597fdc5fd
bpo-44002: Switch to lru_cache in urllib.parse. (GH-25798)
Switch to lru_cache in urllib.parse.

urllib.parse now uses functool.lru_cache for its internal URL splitting and
quoting caches instead of rolling its own like its the 90s.

The undocumented internal Quoted class API is now deprecated
as it had no reason to be public and no existing OSS users were found.

The clear_cache() API remains undocumented but gets an explicit test as it
is used in a few projects' (twisted, gevent) tests as well as our own regrtest.
2021-05-11 17:01:44 -07:00
Senthil Kumaran
985ac01637
bpo-43882 Remove the newline, and tab early. From query and fragments. (GH-25921) 2021-05-05 15:50:05 -07:00
Dong-hee Na
6143fcdf8b
bpo-43979: Remove unnecessary operation from urllib.parse.parse_qsl (GH-25756)
Automerge-Triggered-By: GH:gpshead
2021-04-30 12:01:55 -07:00
Senthil Kumaran
76cd81d603
bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (GH-25595)
* issue43882 - urllib.parse should sanitize urls containing ASCII newline and tabs.

Co-authored-by: Gregory P. Smith <greg@krypto.org>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
2021-04-29 10:16:50 -07:00
Ken Jin
b38601d496
bpo-42967: coerce bytes separator to string in urllib.parse_qs(l) (#24818)
* coerce bytes separator to string

* Add news

* Update Misc/NEWS.d/next/Library/2021-03-11-00-31-41.bpo-42967.2PeQRw.rst
2021-04-11 06:26:09 -07:00
Ken Jin
a2f0654b0a
bpo-42967: Fix urllib.parse docs and make logic clearer (GH-24536) 2021-02-15 09:00:20 -08:00
Adam Goldschmidt
fcbe0cb04d
bpo-42967: only use '&' as a query string separator (#24297)
bpo-42967: [security] Address a web cache-poisoning issue reported in urllib.parse.parse_qsl().

urllib.parse will only us "&" as query string separator by default instead of both ";" and "&" as allowed in earlier versions. An optional argument seperator with default value "&" is added to specify the separator.


Co-authored-by: Éric Araujo <merwok@netwok.org>
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Ken Jin <28750310+Fidget-Spinner@users.noreply.github.com>
Co-authored-by: Éric Araujo <merwok@netwok.org>
2021-02-14 14:41:57 -08:00
Batuhan Taşkaya
0361556537
bpo-39481: PEP 585 for a variety of modules (GH-19423)
- concurrent.futures
- ctypes
- http.cookies
- multiprocessing
- queue
- tempfile
- unittest.case
- urllib.parse
2020-04-10 07:46:36 -07:00
idomic
c33bdbb20c
bpo-37970: update and improve urlparse and urlsplit doc-strings (GH-16458) 2020-02-16 21:17:58 +02:00
Serhiy Storchaka
6a265f0d0c
bpo-39057: Fix urllib.request.proxy_bypass_environment(). (GH-17619)
Ignore leading dots and no longer ignore a trailing newline.
2020-01-05 14:14:31 +02:00
Tim Graham
5a88d50ff0 bpo-27657: Fix urlparse() with numeric paths (#661)
* bpo-27657: Fix urlparse() with numeric paths

Revert parsing decision from bpo-754016 in favor of the documented
consensus in bpo-16932 of how to treat strings without a // to
designate the netloc.

* bpo-22891: Remove urlsplit() optimization for 'http' prefixed inputs.
2019-10-18 06:07:20 -07:00
Stein Karlsen
aad2ee0156 bpo-32498: urllib.parse.unquote also accepts bytes (GH-7768) 2019-10-14 13:36:29 +03:00
Steve Dower
8d0ef0b5ed bpo-36742: Corrects fix to handle decomposition in usernames (#13812) 2019-06-04 17:55:29 +02:00
Rémi Lapeyre
674ee12600 bpo-35397: Remove deprecation and document urllib.parse.unwrap (GH-11481) 2019-05-27 09:43:45 -04:00
Steve Dower
d537ab0ff9
bpo-36742: Fixes handling of pre-normalization characters in urlsplit() (GH-13017) 2019-04-30 12:03:02 +00:00
Jörn Hees
750d74fac5 bpo-12910: update and correct quote docstring (#2568)
Fixes some mistakes and misleadings in the quote function docstring:
- reserved chars are never actually used by quote code, unreserved chars are
- reserved chars were wrong and incomplete
- mentioned that use-case is not minimal quoting wrt. RFC, but cautious quoting
2019-04-09 17:31:18 -07:00
Steve Dower
16e6f7dee7
bpo-36216: Add check for characters in netloc that normalize to separators (GH-12201) 2019-03-07 08:02:26 -08:00
matthewbelisle-wf
209144831b bpo-34866: Adding max_num_fields to cgi.FieldStorage (GH-9660)
Adding `max_num_fields` to `cgi.FieldStorage` to make DOS attacks harder by
limiting the number of `MiniFieldStorage` objects created by `FieldStorage`.
2018-10-19 03:52:59 -07:00
Cheryl Sabella
0250de4819 bpo-27485: Rename and deprecate undocumented functions in urllib.parse (GH-2205) 2018-04-25 16:51:54 -07:00
Matt Eaton
2cb4661707 bpo-33034: Improve exception message when cast fails for {Parse,Split}Result.port (GH-6078) 2018-03-20 09:41:37 +03:00
Коренберг Марк
fbd605151f bpo-32323: urllib.parse.urlsplit() must not lowercase() IPv6 scope value (#4867) 2017-12-21 14:16:17 +02:00
Oren Milman
8df44ee8e0 remove a redundant lower in urllib.parse.urlsplit (#3008) 2017-09-02 21:51:39 -07:00
postmasters
90e01e50ef urllib: Simplify splithost by calling into urlparse. (#1849)
The current regex based splitting produces a wrong result. For example::

  http://abc#@def

Web browsers parse that URL as ``http://abc/#@def``, that is, the host
is ``abc``, the path is ``/``, and the fragment is ``#@def``.
2017-06-20 15:02:44 +02:00
Senthil Kumaran
906f5330b9 bpo-29976: urllib.parse clarify '' in scheme values. (GH-984) 2017-05-17 21:48:59 -07:00
Senthil Kumaran
257b980b31 correct parse_qs and parse_qsl test case descriptions. (#968)
* correct parse_qs and parse_qsl test case descriptions.
2017-04-04 21:19:43 -07:00
Ratnadeep Debnath
21024f0662 bpo-16285: Update urllib quoting to RFC 3986 (#173)
* bpo-16285: Update urllib quoting to RFC 3986

urllib.parse.quote is now based on RFC 3986, and hence
includes `'~'` in the set of characters that is not escaped
by default.

Patch by Christian Theune and Ratnadeep Debnath.
2017-02-25 19:00:28 +10:00
Serhiy Storchaka
8cbd3df3ce Issue #28992: Use bytes.fromhex(). 2016-12-21 12:59:28 +02:00
Berker Peksag
f8479eeb34 Issue #25895: Merge from 3.5 2016-09-16 14:45:15 +03:00
Berker Peksag
f676748a05 Issue #25895: Enable WebSocket URL schemes in urllib.parse.urljoin
Patch by Gergely Imreh and Markus Holtermann.
2016-09-16 14:43:58 +03:00
Senthil Kumaran
0b57f0adde merge from 3.5
Remove unnecessary test case comment in urllib.parse.py. These are asserted as test cases.
2016-01-25 18:54:37 -08:00
Senthil Kumaran
d4e51f45a9 Remove unnecessary test case comment in urllib.parse.py. These are asserted as test cases. 2016-01-25 18:53:34 -08:00
Senthil Kumaran
86f7109dad Issue #25822: Add docstrings to the fields of urllib.parse results.
Patch contributed by Swati Jaiswal.
2016-01-14 00:11:39 -08:00
Robert Collins
dfa95c9a8f Issue #20059: urllib.parse raises ValueError on all invalid ports.
Patch by Martin Panter.
2015-08-10 09:53:30 +12:00
R David Murray
c17686f071 Issue #13866: add *quote_via* argument to urlencode.
Patch by samwyse, completed by Arnon Yaari, and reviewed by
Martin Panter.
2015-05-17 20:44:50 -04:00
Berker Peksag
20416f7994 Issue #23703: Fix a regression in urljoin() introduced in 901e4e52b20a.
Patch by Demian Brecht.
2015-04-16 02:31:14 +03:00
Serhiy Storchaka
1515450440 Issue #23411: Added DefragResult, ParseResult, SplitResult, DefragResultBytes,
ParseResultBytes, and SplitResultBytes to urllib.parse.__all__.
Patch by Martin Panter.
2015-04-07 19:09:01 +03:00
Serhiy Storchaka
44eceb6e2a Issue #23563: Optimized utility functions in urllib.parse. 2015-03-03 20:21:35 +02:00
R David Murray
3ab6ba4744 Merge: #23040: Clarify treatment of encoding and errors when component is bytes. 2014-12-24 21:24:07 -05:00
R David Murray
8c4e112afc #23040: Clarify treatment of encoding and errors when component is bytes.
Patch by Wojtek Ruszczewski.
2014-12-24 21:23:18 -05:00
Senthil Kumaran
a66e3885fb Issue #22278: Fix urljoin problem with relative urls, a regression observed
after changes to issue22118 were submitted.

Patch contributed by Demian Brecht and reviewed by Antoine Pitrou.
2014-09-22 15:49:16 +08:00
Antoine Pitrou
55ac5b3f7b Issue #22118: Switch urllib.parse to use RFC 3986 semantics for the resolution of relative URLs, rather than RFCs 1808 and 2396.
Patch by Demian Brecht.
2014-08-21 19:16:17 -04:00
Serhiy Storchaka
465e60e654 Issue #22033: Reprs of most Python implemened classes now contain actual
class name instead of hardcoded one.
2014-07-25 23:36:00 +03:00