mirror of
https://github.com/python/cpython.git
synced 2025-08-04 08:59:19 +00:00
Merged revisions 63438 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk ........ r63438 | georg.brandl | 2008-05-17 23:54:03 +0200 (Sat, 17 May 2008) | 3 lines Rename html.parser file, and split html.entities from htmllib to ease removal of the latter in Py3k. ........
This commit is contained in:
parent
bf93b0470a
commit
9087b7f83b
4 changed files with 39 additions and 40 deletions
179
Doc/library/html.parser.rst
Normal file
179
Doc/library/html.parser.rst
Normal file
|
@ -0,0 +1,179 @@
|
|||
:mod:`html.parser` --- Simple HTML and XHTML parser
|
||||
===================================================
|
||||
|
||||
.. module:: html.parser
|
||||
:synopsis: A simple parser that can handle HTML and XHTML.
|
||||
|
||||
|
||||
.. index::
|
||||
single: HTML
|
||||
single: XHTML
|
||||
|
||||
This module defines a class :class:`HTMLParser` which serves as the basis for
|
||||
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
|
||||
Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser
|
||||
in :mod:`sgmllib`.
|
||||
|
||||
|
||||
.. class:: HTMLParser()
|
||||
|
||||
The :class:`HTMLParser` class is instantiated without arguments.
|
||||
|
||||
An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags
|
||||
begin and end. The :class:`HTMLParser` class is meant to be overridden by the
|
||||
user to provide a desired behavior.
|
||||
|
||||
Unlike the parser in :mod:`htmllib`, this parser does not check that end tags
|
||||
match start tags or call the end-tag handler for elements which are closed
|
||||
implicitly by closing an outer element.
|
||||
|
||||
An exception is defined as well:
|
||||
|
||||
|
||||
.. exception:: HTMLParseError
|
||||
|
||||
Exception raised by the :class:`HTMLParser` class when it encounters an error
|
||||
while parsing. This exception provides three attributes: :attr:`msg` is a brief
|
||||
message explaining the error, :attr:`lineno` is the number of the line on which
|
||||
the broken construct was detected, and :attr:`offset` is the number of
|
||||
characters into the line at which the construct starts.
|
||||
|
||||
:class:`HTMLParser` instances have the following methods:
|
||||
|
||||
|
||||
.. method:: HTMLParser.reset()
|
||||
|
||||
Reset the instance. Loses all unprocessed data. This is called implicitly at
|
||||
instantiation time.
|
||||
|
||||
|
||||
.. method:: HTMLParser.feed(data)
|
||||
|
||||
Feed some text to the parser. It is processed insofar as it consists of
|
||||
complete elements; incomplete data is buffered until more data is fed or
|
||||
:meth:`close` is called.
|
||||
|
||||
|
||||
.. method:: HTMLParser.close()
|
||||
|
||||
Force processing of all buffered data as if it were followed by an end-of-file
|
||||
mark. This method may be redefined by a derived class to define additional
|
||||
processing at the end of the input, but the redefined version should always call
|
||||
the :class:`HTMLParser` base class method :meth:`close`.
|
||||
|
||||
|
||||
.. method:: HTMLParser.getpos()
|
||||
|
||||
Return current line number and offset.
|
||||
|
||||
|
||||
.. method:: HTMLParser.get_starttag_text()
|
||||
|
||||
Return the text of the most recently opened start tag. This should not normally
|
||||
be needed for structured processing, but may be useful in dealing with HTML "as
|
||||
deployed" or for re-generating input with minimal changes (whitespace between
|
||||
attributes can be preserved, etc.).
|
||||
|
||||
|
||||
.. method:: HTMLParser.handle_starttag(tag, attrs)
|
||||
|
||||
This method is called to handle the start of a tag. It is intended to be
|
||||
overridden by a derived class; the base class implementation does nothing.
|
||||
|
||||
The *tag* argument is the name of the tag converted to lower case. The *attrs*
|
||||
argument is a list of ``(name, value)`` pairs containing the attributes found
|
||||
inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
|
||||
and quotes in the *value* have been removed, and character and entity references
|
||||
have been replaced. For instance, for the tag ``<A
|
||||
HREF="http://www.cwi.nl/">``, this method would be called as
|
||||
``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
|
||||
|
||||
All entity references from :mod:`html.entities` are replaced in the attribute
|
||||
values.
|
||||
|
||||
|
||||
.. method:: HTMLParser.handle_startendtag(tag, attrs)
|
||||
|
||||
Similar to :meth:`handle_starttag`, but called when the parser encounters an
|
||||
XHTML-style empty tag (``<a .../>``). This method may be overridden by
|
||||
subclasses which require this particular lexical information; the default
|
||||
implementation simple calls :meth:`handle_starttag` and :meth:`handle_endtag`.
|
||||
|
||||
|
||||
.. method:: HTMLParser.handle_endtag(tag)
|
||||
|
||||
This method is called to handle the end tag of an element. It is intended to be
|
||||
overridden by a derived class; the base class implementation does nothing. The
|
||||
*tag* argument is the name of the tag converted to lower case.
|
||||
|
||||
|
||||
.. method:: HTMLParser.handle_data(data)
|
||||
|
||||
This method is called to process arbitrary data. It is intended to be
|
||||
overridden by a derived class; the base class implementation does nothing.
|
||||
|
||||
|
||||
.. method:: HTMLParser.handle_charref(name)
|
||||
|
||||
This method is called to process a character reference of the form ``&#ref;``.
|
||||
It is intended to be overridden by a derived class; the base class
|
||||
implementation does nothing.
|
||||
|
||||
|
||||
.. method:: HTMLParser.handle_entityref(name)
|
||||
|
||||
This method is called to process a general entity reference of the form
|
||||
``&name;`` where *name* is an general entity reference. It is intended to be
|
||||
overridden by a derived class; the base class implementation does nothing.
|
||||
|
||||
|
||||
.. method:: HTMLParser.handle_comment(data)
|
||||
|
||||
This method is called when a comment is encountered. The *comment* argument is
|
||||
a string containing the text between the ``--`` and ``--`` delimiters, but not
|
||||
the delimiters themselves. For example, the comment ``<!--text-->`` will cause
|
||||
this method to be called with the argument ``'text'``. It is intended to be
|
||||
overridden by a derived class; the base class implementation does nothing.
|
||||
|
||||
|
||||
.. method:: HTMLParser.handle_decl(decl)
|
||||
|
||||
Method called when an SGML declaration is read by the parser. The *decl*
|
||||
parameter will be the entire contents of the declaration inside the ``<!``...\
|
||||
``>`` markup. It is intended to be overridden by a derived class; the base
|
||||
class implementation does nothing.
|
||||
|
||||
|
||||
.. method:: HTMLParser.handle_pi(data)
|
||||
|
||||
Method called when a processing instruction is encountered. The *data*
|
||||
parameter will contain the entire processing instruction. For example, for the
|
||||
processing instruction ``<?proc color='red'>``, this method would be called as
|
||||
``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
|
||||
class; the base class implementation does nothing.
|
||||
|
||||
.. note::
|
||||
|
||||
The :class:`HTMLParser` class uses the SGML syntactic rules for processing
|
||||
instructions. An XHTML processing instruction using the trailing ``'?'`` will
|
||||
cause the ``'?'`` to be included in *data*.
|
||||
|
||||
|
||||
.. _htmlparser-example:
|
||||
|
||||
Example HTML Parser Application
|
||||
-------------------------------
|
||||
|
||||
As a basic example, below is a very basic HTML parser that uses the
|
||||
:class:`HTMLParser` class to print out tags as they are encountered::
|
||||
|
||||
from html.parser import HTMLParser
|
||||
|
||||
class MyHTMLParser(HTMLParser):
|
||||
|
||||
def handle_starttag(self, tag, attrs):
|
||||
print "Encountered the beginning of a %s tag" % tag
|
||||
|
||||
def handle_endtag(self, tag):
|
||||
print "Encountered the end of a %s tag" % tag
|
||||
|
Loading…
Add table
Add a link
Reference in a new issue