#14020: improve HTMLParser documentation.

2025-11-24 20:30:18 +00:00 · 2012-02-18 01:46:04 +02:00 · 2012-02-18 01:46:04 +02:00 · c39b552603
commit c39b552603
parent 437b149b0c
1 changed files with 212 additions and 75 deletions
--- a/Doc/library/htmlparser.rst
+++ b/Doc/library/htmlparser.rst
@ -22,7 +22,7 @@
 --------------
-This module defines a class :class:`HTMLParser` which serves as the basis for
+This module defines a class :class:`.HTMLParser` which serves as the basis for
 parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
 Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser
 in :mod:`sgmllib`.
@ -30,11 +30,12 @@ in :mod:`sgmllib`.
 .. class:: HTMLParser()
-   The :class:`HTMLParser` class is instantiated without arguments.
+   An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
   when start tags, end tags, text, comments, and other markup elements are
   encountered.  The user should subclass :class:`.HTMLParser` and override its
   methods to implement the desired behavior.
-   An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags
+   The :class:`.HTMLParser` class is instantiated without arguments.
   begin and end.  The :class:`HTMLParser` class is meant to be overridden by the
   user to provide a desired behavior.
   Unlike the parser in :mod:`htmllib`, this parser does not check that end tags
   match start tags or call the end-tag handler for elements which are closed
@ -42,22 +43,59 @@ in :mod:`sgmllib`.
 An exception is defined as well:
 .. exception:: HTMLParseError
-   Exception raised by the :class:`HTMLParser` class when it encounters an error
+   :class:`.HTMLParser` is able to handle broken markup, but in some cases it
-   while parsing.  This exception provides three attributes: :attr:`msg` is a brief
+   might raise this exception when it encounters an error while parsing.
-   message explaining the error, :attr:`lineno` is the number of the line on which
+   This exception provides three attributes: :attr:`msg` is a brief
-   the broken construct was detected, and :attr:`offset` is the number of
+   message explaining the error, :attr:`lineno` is the number of the line on
   which the broken construct was detected, and :attr:`offset` is the number of
   characters into the line at which the construct starts.
-:class:`HTMLParser` instances have the following methods:
+
 Example HTML Parser Application
 -------------------------------
 As a basic example, below is a simple HTML parser that uses the
 :class:`.HTMLParser` class to print out start tags, end tags and data
 as they are encountered::
   from HTMLParser import HTMLParser
   # create a subclass and override the handler methods
   class MyHTMLParser(HTMLParser):
       def handle_starttag(self, tag, attrs):
           print "Encountered a start tag:", tag
       def handle_endtag(self, tag):
           print "Encountered an end tag :", tag
       def handle_data(self, data):
           print "Encountered some data  :", data
   # instantiate the parser and fed it some HTML
   parser = MyHTMLParser()
   parser.feed('<html><head><title>Test</title></head>'
               '<body><h1>Parse me!</h1></body></html>')
 The output will then be::
   Encountered a start tag: html
   Encountered a start tag: head
   Encountered a start tag: title
   Encountered some data  : Test
   Encountered an end tag : title
   Encountered an end tag : head
   Encountered a start tag: body
   Encountered a start tag: h1
   Encountered some data  : Parse me!
   Encountered an end tag : h1
   Encountered an end tag : body
   Encountered an end tag : html
-.. method:: HTMLParser.reset()
+:class:`.HTMLParser` Methods
 ----------------------------
-   Reset the instance.  Loses all unprocessed data.  This is called implicitly at
+:class:`.HTMLParser` instances have the following methods:
   instantiation time.
 .. method:: HTMLParser.feed(data)
@ -73,7 +111,13 @@ An exception is defined as well:
   Force processing of all buffered data as if it were followed by an end-of-file
   mark.  This method may be redefined by a derived class to define additional
   processing at the end of the input, but the redefined version should always call
-   the :class:`HTMLParser` base class method :meth:`close`.
+   the :class:`.HTMLParser` base class method :meth:`close`.
 .. method:: HTMLParser.reset()
   Reset the instance.  Loses all unprocessed data.  This is called implicitly at
   instantiation time.
 .. method:: HTMLParser.getpos()
@ -89,22 +133,34 @@ An exception is defined as well:
   attributes can be preserved, etc.).
 The following methods are called when data or markup elements are encountered
 and they are meant to be overridden in a subclass.  The base class
 implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
 .. method:: HTMLParser.handle_starttag(tag, attrs)
-   This method is called to handle the start of a tag.  It is intended to be
+   This method is called to handle the start of a tag (e.g. ``<div id="main">``).
   overridden by a derived class; the base class implementation does nothing.
   The *tag* argument is the name of the tag converted to lower case. The *attrs*
   argument is a list of ``(name, value)`` pairs containing the attributes found
   inside the tag's ``<>`` brackets.  The *name* will be translated to lower case,
   and quotes in the *value* have been removed, and character and entity references
-   have been replaced.  For instance, for the tag ``<A
+   have been replaced.
-   HREF="http://www.cwi.nl/">``, this method would be called as
+
-   ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
+   For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
   would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
   .. versionchanged:: 2.6
-      All entity references from :mod:`htmlentitydefs` are now replaced in the attribute
+      All entity references from :mod:`htmlentitydefs` are now replaced in the
-      values.
+      attribute values.
 .. method:: HTMLParser.handle_endtag(tag)
   This method is called to handle the end tag of an element (e.g. ``</div>``).
   The *tag* argument is the name of the tag converted to lower case.
 .. method:: HTMLParser.handle_startendtag(tag, attrs)
@ -115,94 +171,175 @@ An exception is defined as well:
   implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
 .. method:: HTMLParser.handle_endtag(tag)
   This method is called to handle the end tag of an element.  It is intended to be
   overridden by a derived class; the base class implementation does nothing.  The
   *tag* argument is the name of the tag converted to lower case.
 .. method:: HTMLParser.handle_data(data)
-   This method is called to process arbitrary data (e.g. the content of
+   This method is called to process arbitrary data (e.g. text nodes and the
-   ``<script>...</script>`` and ``<style>...</style>``).  It is intended to be
+   content of ``<script>...</script>`` and ``<style>...</style>``).
   overridden by a derived class; the base class implementation does nothing.
 .. method:: HTMLParser.handle_charref(name)
   This method is called to process a character reference of the form ``&#ref;``.
   It is intended to be overridden by a derived class; the base class
   implementation does nothing.
 .. method:: HTMLParser.handle_entityref(name)
-   This method is called to process a general entity reference of the form
+   This method is called to process a named character reference of the form
-   ``&name;`` where *name* is an general entity reference.  It is intended to be
+   ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
-   overridden by a derived class; the base class implementation does nothing.
+   (e.g. ``'gt'``).
 .. method:: HTMLParser.handle_charref(name)
   This method is called to process decimal and hexadecimal numeric character
   references of the form ``&#NNN;`` and ``&#xNNN;``.  For example, the decimal
   equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
   in this case the method will receive ``'62'`` or ``'x3E'``.
 .. method:: HTMLParser.handle_comment(data)
-   This method is called when a comment is encountered.  The *comment* argument is
+   This method is called when a comment is encountered (e.g. ``<!--comment-->``).
-   a string containing the text between the ``--`` and ``--`` delimiters, but not
+
-   the delimiters themselves.  For example, the comment ``<!--text-->`` will cause
+   For example, the comment ``<!-- comment -->`` will cause this method to be
-   this method to be called with the argument ``'text'``.  It is intended to be
+   called with the argument ``' comment '``.
-   overridden by a derived class; the base class implementation does nothing.
+
   The content of Internet Explorer conditional comments (condcoms) will also be
   sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
   this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
 .. method:: HTMLParser.handle_decl(decl)
-   Method called when an SGML ``doctype`` declaration is read by the parser.
+   This method is called to handle an HTML doctype declaration (e.g.
   ``<!DOCTYPE html>``).
   The *decl* parameter will be the entire contents of the declaration inside
-   the ``<!...>`` markup.  It is intended to be overridden by a derived class;
+   the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
   the base class implementation does nothing.
 .. method:: HTMLParser.unknown_decl(data)
   Method called when an unrecognized SGML declaration is read by the parser.
   The *data* parameter will be the entire contents of the declaration inside
   the ``<!...>`` markup.  It is sometimes useful to be overridden by a
   derived class; the base class implementation throws an :exc:`HTMLParseError`.
 .. method:: HTMLParser.handle_pi(data)
-   Method called when a processing instruction is encountered.  The *data*
+   This method is called when a processing instruction is encountered.  The *data*
   parameter will contain the entire processing instruction.  For example, for the
   processing instruction ``<?proc color='red'>``, this method would be called as
-   ``handle_pi("proc color='red'")``.  It is intended to be overridden by a derived
+   ``handle_pi("proc color='red'")``.
   class; the base class implementation does nothing.
   .. note::
-      The :class:`HTMLParser` class uses the SGML syntactic rules for processing
+      The :class:`.HTMLParser` class uses the SGML syntactic rules for processing
      instructions.  An XHTML processing instruction using the trailing ``'?'`` will
      cause the ``'?'`` to be included in *data*.
-.. _htmlparser-example:
+.. method:: HTMLParser.unknown_decl(data)
-Example HTML Parser Application
+   This method is called when an unrecognized declaration is read by the parser.
 -------------------------------
-As a basic example, below is a simple HTML parser that uses the
+   The *data* parameter will be the entire contents of the declaration inside
-:class:`HTMLParser` class to print out start tags, end tags and data
+   the ``<![...]>`` markup.  It is sometimes useful to be overridden by a
-as they are encountered::
+   derived class.
 .. _htmlparser-examples:
 Examples
 --------
 The following class implements a parser that will be used to illustrate more
 examples::
   from HTMLParser import HTMLParser
   from htmlentitydefs import name2codepoint
   class MyHTMLParser(HTMLParser):
       def handle_starttag(self, tag, attrs):
-           print "Encountered a start tag:", tag
+           print "Start tag:", tag
           for attr in attrs:
               print "     attr:", attr
       def handle_endtag(self, tag):
-           print "Encountered  an end tag:", tag
+           print "End tag  :", tag
       def handle_data(self, data):
-           print "Encountered   some data:", data
+           print "Data     :", data
-
+       def handle_comment(self, data):
           print "Comment  :", data
       def handle_entityref(self, name):
           c = unichr(name2codepoint[name])
           print "Named ent:", c
       def handle_charref(self, name):
           if name.startswith('x'):
               c = unichr(int(name[1:], 16))
           else:
               c = unichr(int(name))
           print "Num ent  :", c
       def handle_decl(self, data):
           print "Decl     :", data
   parser = MyHTMLParser()
-   parser.feed('<html><head><title>Test</title></head>'
+
-               '<body><h1>Parse me!</h1></body></html>')
+Parsing a doctype::
   >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
   ...             '"http://www.w3.org/TR/html4/strict.dtd">')
   Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
 Parsing an element with a few attributes and a title::
   >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
   Start tag: img
        attr: ('src', 'python-logo.png')
        attr: ('alt', 'The Python logo')
   >>>
   >>> parser.feed('<h1>Python</h1>')
   Start tag: h1
   Data     : Python
   End tag  : h1
 The content of ``script`` and ``style`` elements is returned as is, without
 further parsing::
   >>> parser.feed('<style type="text/css">#python { color: green }</style>')
   Start tag: style
        attr: ('type', 'text/css')
   Data     : #python { color: green }
   End tag  : style
   >>>
   >>> parser.feed('<script type="text/javascript">'
   ...             'alert("<strong>hello!</strong>");</script>')
   Start tag: script
        attr: ('type', 'text/javascript')
   Data     : alert("<strong>hello!</strong>");
   End tag  : script
 Parsing comments::
   >>> parser.feed('<!-- a comment -->'
   ...             '<!--[if IE 9]>IE-specific content<![endif]-->')
   Comment  :  a comment
   Comment  : [if IE 9]>IE-specific content<![endif]
 Parsing named and numeric character references and converting them to the
 correct char (note: these 3 references are all equivalent to ``'>'``)::
   >>> parser.feed('&gt;&#62;&#x3E;')
   Named ent: >
   Num ent  : >
   Num ent  : >
 Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
 :meth:`~HTMLParser.handle_data` might be called more than once::
   >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
   ...     parser.feed(chunk)
   ...
   Start tag: span
   Data     : buff
   Data     : ered
   Data     : text
   End tag  : span
 Parsing invalid HTML (e.g. unquoted attributes) also works::
   >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
   Start tag: p
   Start tag: a
        attr: ('class', 'link')
        attr: ('href', '#main')
   Data     : tag soup
   End tag  : p
   End tag  : a