mirror of
https://github.com/python/cpython.git
synced 2025-10-10 17:02:46 +00:00
#14020: improve HTMLParser documentation.
This commit is contained in:
parent
437b149b0c
commit
c39b552603
1 changed files with 212 additions and 75 deletions
|
@ -22,7 +22,7 @@
|
||||||
|
|
||||||
--------------
|
--------------
|
||||||
|
|
||||||
This module defines a class :class:`HTMLParser` which serves as the basis for
|
This module defines a class :class:`.HTMLParser` which serves as the basis for
|
||||||
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
|
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
|
||||||
Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser
|
Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser
|
||||||
in :mod:`sgmllib`.
|
in :mod:`sgmllib`.
|
||||||
|
@ -30,11 +30,12 @@ in :mod:`sgmllib`.
|
||||||
|
|
||||||
.. class:: HTMLParser()
|
.. class:: HTMLParser()
|
||||||
|
|
||||||
The :class:`HTMLParser` class is instantiated without arguments.
|
An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
|
||||||
|
when start tags, end tags, text, comments, and other markup elements are
|
||||||
|
encountered. The user should subclass :class:`.HTMLParser` and override its
|
||||||
|
methods to implement the desired behavior.
|
||||||
|
|
||||||
An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags
|
The :class:`.HTMLParser` class is instantiated without arguments.
|
||||||
begin and end. The :class:`HTMLParser` class is meant to be overridden by the
|
|
||||||
user to provide a desired behavior.
|
|
||||||
|
|
||||||
Unlike the parser in :mod:`htmllib`, this parser does not check that end tags
|
Unlike the parser in :mod:`htmllib`, this parser does not check that end tags
|
||||||
match start tags or call the end-tag handler for elements which are closed
|
match start tags or call the end-tag handler for elements which are closed
|
||||||
|
@ -42,22 +43,59 @@ in :mod:`sgmllib`.
|
||||||
|
|
||||||
An exception is defined as well:
|
An exception is defined as well:
|
||||||
|
|
||||||
|
|
||||||
.. exception:: HTMLParseError
|
.. exception:: HTMLParseError
|
||||||
|
|
||||||
Exception raised by the :class:`HTMLParser` class when it encounters an error
|
:class:`.HTMLParser` is able to handle broken markup, but in some cases it
|
||||||
while parsing. This exception provides three attributes: :attr:`msg` is a brief
|
might raise this exception when it encounters an error while parsing.
|
||||||
message explaining the error, :attr:`lineno` is the number of the line on which
|
This exception provides three attributes: :attr:`msg` is a brief
|
||||||
the broken construct was detected, and :attr:`offset` is the number of
|
message explaining the error, :attr:`lineno` is the number of the line on
|
||||||
|
which the broken construct was detected, and :attr:`offset` is the number of
|
||||||
characters into the line at which the construct starts.
|
characters into the line at which the construct starts.
|
||||||
|
|
||||||
:class:`HTMLParser` instances have the following methods:
|
|
||||||
|
Example HTML Parser Application
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
As a basic example, below is a simple HTML parser that uses the
|
||||||
|
:class:`.HTMLParser` class to print out start tags, end tags and data
|
||||||
|
as they are encountered::
|
||||||
|
|
||||||
|
from HTMLParser import HTMLParser
|
||||||
|
|
||||||
|
# create a subclass and override the handler methods
|
||||||
|
class MyHTMLParser(HTMLParser):
|
||||||
|
def handle_starttag(self, tag, attrs):
|
||||||
|
print "Encountered a start tag:", tag
|
||||||
|
def handle_endtag(self, tag):
|
||||||
|
print "Encountered an end tag :", tag
|
||||||
|
def handle_data(self, data):
|
||||||
|
print "Encountered some data :", data
|
||||||
|
|
||||||
|
# instantiate the parser and fed it some HTML
|
||||||
|
parser = MyHTMLParser()
|
||||||
|
parser.feed('<html><head><title>Test</title></head>'
|
||||||
|
'<body><h1>Parse me!</h1></body></html>')
|
||||||
|
|
||||||
|
The output will then be::
|
||||||
|
|
||||||
|
Encountered a start tag: html
|
||||||
|
Encountered a start tag: head
|
||||||
|
Encountered a start tag: title
|
||||||
|
Encountered some data : Test
|
||||||
|
Encountered an end tag : title
|
||||||
|
Encountered an end tag : head
|
||||||
|
Encountered a start tag: body
|
||||||
|
Encountered a start tag: h1
|
||||||
|
Encountered some data : Parse me!
|
||||||
|
Encountered an end tag : h1
|
||||||
|
Encountered an end tag : body
|
||||||
|
Encountered an end tag : html
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.reset()
|
:class:`.HTMLParser` Methods
|
||||||
|
----------------------------
|
||||||
|
|
||||||
Reset the instance. Loses all unprocessed data. This is called implicitly at
|
:class:`.HTMLParser` instances have the following methods:
|
||||||
instantiation time.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.feed(data)
|
.. method:: HTMLParser.feed(data)
|
||||||
|
@ -73,7 +111,13 @@ An exception is defined as well:
|
||||||
Force processing of all buffered data as if it were followed by an end-of-file
|
Force processing of all buffered data as if it were followed by an end-of-file
|
||||||
mark. This method may be redefined by a derived class to define additional
|
mark. This method may be redefined by a derived class to define additional
|
||||||
processing at the end of the input, but the redefined version should always call
|
processing at the end of the input, but the redefined version should always call
|
||||||
the :class:`HTMLParser` base class method :meth:`close`.
|
the :class:`.HTMLParser` base class method :meth:`close`.
|
||||||
|
|
||||||
|
|
||||||
|
.. method:: HTMLParser.reset()
|
||||||
|
|
||||||
|
Reset the instance. Loses all unprocessed data. This is called implicitly at
|
||||||
|
instantiation time.
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.getpos()
|
.. method:: HTMLParser.getpos()
|
||||||
|
@ -89,22 +133,34 @@ An exception is defined as well:
|
||||||
attributes can be preserved, etc.).
|
attributes can be preserved, etc.).
|
||||||
|
|
||||||
|
|
||||||
|
The following methods are called when data or markup elements are encountered
|
||||||
|
and they are meant to be overridden in a subclass. The base class
|
||||||
|
implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.handle_starttag(tag, attrs)
|
.. method:: HTMLParser.handle_starttag(tag, attrs)
|
||||||
|
|
||||||
This method is called to handle the start of a tag. It is intended to be
|
This method is called to handle the start of a tag (e.g. ``<div id="main">``).
|
||||||
overridden by a derived class; the base class implementation does nothing.
|
|
||||||
|
|
||||||
The *tag* argument is the name of the tag converted to lower case. The *attrs*
|
The *tag* argument is the name of the tag converted to lower case. The *attrs*
|
||||||
argument is a list of ``(name, value)`` pairs containing the attributes found
|
argument is a list of ``(name, value)`` pairs containing the attributes found
|
||||||
inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
|
inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
|
||||||
and quotes in the *value* have been removed, and character and entity references
|
and quotes in the *value* have been removed, and character and entity references
|
||||||
have been replaced. For instance, for the tag ``<A
|
have been replaced.
|
||||||
HREF="http://www.cwi.nl/">``, this method would be called as
|
|
||||||
``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
|
For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
|
||||||
|
would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
|
||||||
|
|
||||||
.. versionchanged:: 2.6
|
.. versionchanged:: 2.6
|
||||||
All entity references from :mod:`htmlentitydefs` are now replaced in the attribute
|
All entity references from :mod:`htmlentitydefs` are now replaced in the
|
||||||
values.
|
attribute values.
|
||||||
|
|
||||||
|
|
||||||
|
.. method:: HTMLParser.handle_endtag(tag)
|
||||||
|
|
||||||
|
This method is called to handle the end tag of an element (e.g. ``</div>``).
|
||||||
|
|
||||||
|
The *tag* argument is the name of the tag converted to lower case.
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.handle_startendtag(tag, attrs)
|
.. method:: HTMLParser.handle_startendtag(tag, attrs)
|
||||||
|
@ -115,94 +171,175 @@ An exception is defined as well:
|
||||||
implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
|
implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.handle_endtag(tag)
|
|
||||||
|
|
||||||
This method is called to handle the end tag of an element. It is intended to be
|
|
||||||
overridden by a derived class; the base class implementation does nothing. The
|
|
||||||
*tag* argument is the name of the tag converted to lower case.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.handle_data(data)
|
.. method:: HTMLParser.handle_data(data)
|
||||||
|
|
||||||
This method is called to process arbitrary data (e.g. the content of
|
This method is called to process arbitrary data (e.g. text nodes and the
|
||||||
``<script>...</script>`` and ``<style>...</style>``). It is intended to be
|
content of ``<script>...</script>`` and ``<style>...</style>``).
|
||||||
overridden by a derived class; the base class implementation does nothing.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.handle_charref(name)
|
|
||||||
|
|
||||||
This method is called to process a character reference of the form ``&#ref;``.
|
|
||||||
It is intended to be overridden by a derived class; the base class
|
|
||||||
implementation does nothing.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.handle_entityref(name)
|
.. method:: HTMLParser.handle_entityref(name)
|
||||||
|
|
||||||
This method is called to process a general entity reference of the form
|
This method is called to process a named character reference of the form
|
||||||
``&name;`` where *name* is an general entity reference. It is intended to be
|
``&name;`` (e.g. ``>``), where *name* is a general entity reference
|
||||||
overridden by a derived class; the base class implementation does nothing.
|
(e.g. ``'gt'``).
|
||||||
|
|
||||||
|
|
||||||
|
.. method:: HTMLParser.handle_charref(name)
|
||||||
|
|
||||||
|
This method is called to process decimal and hexadecimal numeric character
|
||||||
|
references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
|
||||||
|
equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``;
|
||||||
|
in this case the method will receive ``'62'`` or ``'x3E'``.
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.handle_comment(data)
|
.. method:: HTMLParser.handle_comment(data)
|
||||||
|
|
||||||
This method is called when a comment is encountered. The *comment* argument is
|
This method is called when a comment is encountered (e.g. ``<!--comment-->``).
|
||||||
a string containing the text between the ``--`` and ``--`` delimiters, but not
|
|
||||||
the delimiters themselves. For example, the comment ``<!--text-->`` will cause
|
For example, the comment ``<!-- comment -->`` will cause this method to be
|
||||||
this method to be called with the argument ``'text'``. It is intended to be
|
called with the argument ``' comment '``.
|
||||||
overridden by a derived class; the base class implementation does nothing.
|
|
||||||
|
The content of Internet Explorer conditional comments (condcoms) will also be
|
||||||
|
sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
|
||||||
|
this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.handle_decl(decl)
|
.. method:: HTMLParser.handle_decl(decl)
|
||||||
|
|
||||||
Method called when an SGML ``doctype`` declaration is read by the parser.
|
This method is called to handle an HTML doctype declaration (e.g.
|
||||||
|
``<!DOCTYPE html>``).
|
||||||
|
|
||||||
The *decl* parameter will be the entire contents of the declaration inside
|
The *decl* parameter will be the entire contents of the declaration inside
|
||||||
the ``<!...>`` markup. It is intended to be overridden by a derived class;
|
the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
|
||||||
the base class implementation does nothing.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.unknown_decl(data)
|
|
||||||
|
|
||||||
Method called when an unrecognized SGML declaration is read by the parser.
|
|
||||||
The *data* parameter will be the entire contents of the declaration inside
|
|
||||||
the ``<!...>`` markup. It is sometimes useful to be overridden by a
|
|
||||||
derived class; the base class implementation throws an :exc:`HTMLParseError`.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: HTMLParser.handle_pi(data)
|
.. method:: HTMLParser.handle_pi(data)
|
||||||
|
|
||||||
Method called when a processing instruction is encountered. The *data*
|
This method is called when a processing instruction is encountered. The *data*
|
||||||
parameter will contain the entire processing instruction. For example, for the
|
parameter will contain the entire processing instruction. For example, for the
|
||||||
processing instruction ``<?proc color='red'>``, this method would be called as
|
processing instruction ``<?proc color='red'>``, this method would be called as
|
||||||
``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
|
``handle_pi("proc color='red'")``.
|
||||||
class; the base class implementation does nothing.
|
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
The :class:`HTMLParser` class uses the SGML syntactic rules for processing
|
The :class:`.HTMLParser` class uses the SGML syntactic rules for processing
|
||||||
instructions. An XHTML processing instruction using the trailing ``'?'`` will
|
instructions. An XHTML processing instruction using the trailing ``'?'`` will
|
||||||
cause the ``'?'`` to be included in *data*.
|
cause the ``'?'`` to be included in *data*.
|
||||||
|
|
||||||
|
|
||||||
.. _htmlparser-example:
|
.. method:: HTMLParser.unknown_decl(data)
|
||||||
|
|
||||||
Example HTML Parser Application
|
This method is called when an unrecognized declaration is read by the parser.
|
||||||
-------------------------------
|
|
||||||
|
|
||||||
As a basic example, below is a simple HTML parser that uses the
|
The *data* parameter will be the entire contents of the declaration inside
|
||||||
:class:`HTMLParser` class to print out start tags, end tags and data
|
the ``<![...]>`` markup. It is sometimes useful to be overridden by a
|
||||||
as they are encountered::
|
derived class.
|
||||||
|
|
||||||
|
|
||||||
|
.. _htmlparser-examples:
|
||||||
|
|
||||||
|
Examples
|
||||||
|
--------
|
||||||
|
|
||||||
|
The following class implements a parser that will be used to illustrate more
|
||||||
|
examples::
|
||||||
|
|
||||||
from HTMLParser import HTMLParser
|
from HTMLParser import HTMLParser
|
||||||
|
from htmlentitydefs import name2codepoint
|
||||||
|
|
||||||
class MyHTMLParser(HTMLParser):
|
class MyHTMLParser(HTMLParser):
|
||||||
def handle_starttag(self, tag, attrs):
|
def handle_starttag(self, tag, attrs):
|
||||||
print "Encountered a start tag:", tag
|
print "Start tag:", tag
|
||||||
|
for attr in attrs:
|
||||||
|
print " attr:", attr
|
||||||
def handle_endtag(self, tag):
|
def handle_endtag(self, tag):
|
||||||
print "Encountered an end tag:", tag
|
print "End tag :", tag
|
||||||
def handle_data(self, data):
|
def handle_data(self, data):
|
||||||
print "Encountered some data:", data
|
print "Data :", data
|
||||||
|
def handle_comment(self, data):
|
||||||
|
print "Comment :", data
|
||||||
|
def handle_entityref(self, name):
|
||||||
|
c = unichr(name2codepoint[name])
|
||||||
|
print "Named ent:", c
|
||||||
|
def handle_charref(self, name):
|
||||||
|
if name.startswith('x'):
|
||||||
|
c = unichr(int(name[1:], 16))
|
||||||
|
else:
|
||||||
|
c = unichr(int(name))
|
||||||
|
print "Num ent :", c
|
||||||
|
def handle_decl(self, data):
|
||||||
|
print "Decl :", data
|
||||||
|
|
||||||
parser = MyHTMLParser()
|
parser = MyHTMLParser()
|
||||||
parser.feed('<html><head><title>Test</title></head>'
|
|
||||||
'<body><h1>Parse me!</h1></body></html>')
|
Parsing a doctype::
|
||||||
|
|
||||||
|
>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
|
||||||
|
... '"http://www.w3.org/TR/html4/strict.dtd">')
|
||||||
|
Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
|
||||||
|
|
||||||
|
Parsing an element with a few attributes and a title::
|
||||||
|
|
||||||
|
>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
|
||||||
|
Start tag: img
|
||||||
|
attr: ('src', 'python-logo.png')
|
||||||
|
attr: ('alt', 'The Python logo')
|
||||||
|
>>>
|
||||||
|
>>> parser.feed('<h1>Python</h1>')
|
||||||
|
Start tag: h1
|
||||||
|
Data : Python
|
||||||
|
End tag : h1
|
||||||
|
|
||||||
|
The content of ``script`` and ``style`` elements is returned as is, without
|
||||||
|
further parsing::
|
||||||
|
|
||||||
|
>>> parser.feed('<style type="text/css">#python { color: green }</style>')
|
||||||
|
Start tag: style
|
||||||
|
attr: ('type', 'text/css')
|
||||||
|
Data : #python { color: green }
|
||||||
|
End tag : style
|
||||||
|
>>>
|
||||||
|
>>> parser.feed('<script type="text/javascript">'
|
||||||
|
... 'alert("<strong>hello!</strong>");</script>')
|
||||||
|
Start tag: script
|
||||||
|
attr: ('type', 'text/javascript')
|
||||||
|
Data : alert("<strong>hello!</strong>");
|
||||||
|
End tag : script
|
||||||
|
|
||||||
|
Parsing comments::
|
||||||
|
|
||||||
|
>>> parser.feed('<!-- a comment -->'
|
||||||
|
... '<!--[if IE 9]>IE-specific content<![endif]-->')
|
||||||
|
Comment : a comment
|
||||||
|
Comment : [if IE 9]>IE-specific content<![endif]
|
||||||
|
|
||||||
|
Parsing named and numeric character references and converting them to the
|
||||||
|
correct char (note: these 3 references are all equivalent to ``'>'``)::
|
||||||
|
|
||||||
|
>>> parser.feed('>>>')
|
||||||
|
Named ent: >
|
||||||
|
Num ent : >
|
||||||
|
Num ent : >
|
||||||
|
|
||||||
|
Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
|
||||||
|
:meth:`~HTMLParser.handle_data` might be called more than once::
|
||||||
|
|
||||||
|
>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
|
||||||
|
... parser.feed(chunk)
|
||||||
|
...
|
||||||
|
Start tag: span
|
||||||
|
Data : buff
|
||||||
|
Data : ered
|
||||||
|
Data : text
|
||||||
|
End tag : span
|
||||||
|
|
||||||
|
Parsing invalid HTML (e.g. unquoted attributes) also works::
|
||||||
|
|
||||||
|
>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
|
||||||
|
Start tag: p
|
||||||
|
Start tag: a
|
||||||
|
attr: ('class', 'link')
|
||||||
|
attr: ('href', '#main')
|
||||||
|
Data : tag soup
|
||||||
|
End tag : p
|
||||||
|
End tag : a
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue