[3.9] bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (GH-25595) (GH-25725)

* bpo-43882 - urllib.parse should sanitize urls containing ASCII newline and tabs. (GH-25595)

Co-authored-by: Gregory P. Smith <greg@krypto.org>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
(cherry picked from commit 76cd81d603)
Co-authored-by: Senthil Kumaran <skumaran@gatech.edu>
This commit is contained in:
Miss Islington (bot) 2021-04-29 10:57:31 -07:00 committed by GitHub
parent 8d47f92d46
commit 491fde0161
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
4 changed files with 54 additions and 0 deletions

View file

@ -311,6 +311,9 @@ or on combining URL components into a URL string.
``#``, ``@``, or ``:`` will raise a :exc:`ValueError`. If the URL is ``#``, ``@``, or ``:`` will raise a :exc:`ValueError`. If the URL is
decomposed before parsing, no error will be raised. decomposed before parsing, no error will be raised.
Following the `WHATWG spec`_ that updates RFC 3986, ASCII newline
``\n``, ``\r`` and tab ``\t`` characters are stripped from the URL.
.. versionchanged:: 3.6 .. versionchanged:: 3.6
Out-of-range port numbers now raise :exc:`ValueError`, instead of Out-of-range port numbers now raise :exc:`ValueError`, instead of
returning :const:`None`. returning :const:`None`.
@ -319,6 +322,10 @@ or on combining URL components into a URL string.
Characters that affect netloc parsing under NFKC normalization will Characters that affect netloc parsing under NFKC normalization will
now raise :exc:`ValueError`. now raise :exc:`ValueError`.
.. versionchanged:: 3.9.5
ASCII newline and tab characters are stripped from the URL.
.. _WHATWG spec: https://url.spec.whatwg.org/#concept-basic-url-parser
.. function:: urlunsplit(parts) .. function:: urlunsplit(parts)
@ -673,6 +680,10 @@ task isn't already covered by the URL parsing functions above.
.. seealso:: .. seealso::
`WHATWG`_ - URL Living standard
Working Group for the URL Standard that defines URLs, domains, IP addresses, the
application/x-www-form-urlencoded format, and their API.
:rfc:`3986` - Uniform Resource Identifiers :rfc:`3986` - Uniform Resource Identifiers
This is the current standard (STD66). Any changes to urllib.parse module This is the current standard (STD66). Any changes to urllib.parse module
should conform to this. Certain deviations could be observed, which are should conform to this. Certain deviations could be observed, which are
@ -696,3 +707,5 @@ task isn't already covered by the URL parsing functions above.
:rfc:`1738` - Uniform Resource Locators (URL) :rfc:`1738` - Uniform Resource Locators (URL)
This specifies the formal syntax and semantics of absolute URLs. This specifies the formal syntax and semantics of absolute URLs.
.. _WHATWG: https://url.spec.whatwg.org/

View file

@ -612,6 +612,35 @@ class UrlParseTestCase(unittest.TestCase):
with self.assertRaisesRegex(ValueError, "out of range"): with self.assertRaisesRegex(ValueError, "out of range"):
p.port p.port
def test_urlsplit_remove_unsafe_bytes(self):
# Remove ASCII tabs and newlines from input
url = "http://www.python.org/java\nscript:\talert('msg\r\n')/#frag"
p = urllib.parse.urlsplit(url)
self.assertEqual(p.scheme, "http")
self.assertEqual(p.netloc, "www.python.org")
self.assertEqual(p.path, "/javascript:alert('msg')/")
self.assertEqual(p.query, "")
self.assertEqual(p.fragment, "frag")
self.assertEqual(p.username, None)
self.assertEqual(p.password, None)
self.assertEqual(p.hostname, "www.python.org")
self.assertEqual(p.port, None)
self.assertEqual(p.geturl(), "http://www.python.org/javascript:alert('msg')/#frag")
# Remove ASCII tabs and newlines from input as bytes.
url = b"http://www.python.org/java\nscript:\talert('msg\r\n')/#frag"
p = urllib.parse.urlsplit(url)
self.assertEqual(p.scheme, b"http")
self.assertEqual(p.netloc, b"www.python.org")
self.assertEqual(p.path, b"/javascript:alert('msg')/")
self.assertEqual(p.query, b"")
self.assertEqual(p.fragment, b"frag")
self.assertEqual(p.username, None)
self.assertEqual(p.password, None)
self.assertEqual(p.hostname, b"www.python.org")
self.assertEqual(p.port, None)
self.assertEqual(p.geturl(), b"http://www.python.org/javascript:alert('msg')/#frag")
def test_attributes_bad_port(self): def test_attributes_bad_port(self):
"""Check handling of invalid ports.""" """Check handling of invalid ports."""
for bytes in (False, True): for bytes in (False, True):

View file

@ -78,6 +78,9 @@ scheme_chars = ('abcdefghijklmnopqrstuvwxyz'
'0123456789' '0123456789'
'+-.') '+-.')
# Unsafe bytes to be removed per WHATWG spec
_UNSAFE_URL_BYTES_TO_REMOVE = ['\t', '\r', '\n']
# XXX: Consider replacing with functools.lru_cache # XXX: Consider replacing with functools.lru_cache
MAX_CACHE_SIZE = 20 MAX_CACHE_SIZE = 20
_parse_cache = {} _parse_cache = {}
@ -469,6 +472,9 @@ def urlsplit(url, scheme='', allow_fragments=True):
else: else:
scheme, url = url[:i].lower(), url[i+1:] scheme, url = url[:i].lower(), url[i+1:]
for b in _UNSAFE_URL_BYTES_TO_REMOVE:
url = url.replace(b, "")
if url[:2] == '//': if url[:2] == '//':
netloc, url = _splitnetloc(url, 2) netloc, url = _splitnetloc(url, 2)
if (('[' in netloc and ']' not in netloc) or if (('[' in netloc and ']' not in netloc) or

View file

@ -0,0 +1,6 @@
The presence of newline or tab characters in parts of a URL could allow
some forms of attacks.
Following the controlling specification for URLs defined by WHATWG
:func:`urllib.parse` now removes ASCII newlines and tabs from URLs,
preventing such attacks.