#1486713: Add a tolerant mode to HTMLParser.

The motivation for adding this option is that the the functionality it
provides used to be provided by sgmllib in Python2, and was used by,
for example, BeautifulSoup.  Without this option, the Python3 version
of BeautifulSoup and the many programs that use it are crippled.

The original patch was by 'kxroberto'.  I modified it heavily but kept his
heuristics and test.  I also added additional heuristics to fix #975556,
#1046092, and part of #6191.  This patch should be completely backward
compatible:  the behavior with the default strict=True is unchanged.
This commit is contained in:
R. David Murray 2010-12-03 04:06:39 +00:00
parent 79cdb661f5
commit b579dba119
4 changed files with 139 additions and 24 deletions

View file

@ -12,9 +12,13 @@
This module defines a class :class:`HTMLParser` which serves as the basis for
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
.. class:: HTMLParser()
.. class:: HTMLParser(strict=True)
The :class:`HTMLParser` class is instantiated without arguments.
Create a parser instance. If *strict* is ``True`` (the default), invalid
html results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If
*strict* is ``False``, the parser uses heuristics to make a best guess at
the intention of any invalid html it encounters, similar to the way most
browsers do.
An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags
begin and end. The :class:`HTMLParser` class is meant to be overridden by the
@ -191,3 +195,8 @@ As a basic example, below is a very basic HTML parser that uses the
Encountered a html end tag
.. rubric:: Footnotes
.. [#] For backward compatibility reasons *strict* mode does not throw
errors for all non-compliant HTML. That is, some invalid HTML
is tolerated even in *strict* mode.