mirror of
https://github.com/python/cpython.git
synced 2025-11-03 03:22:27 +00:00
Remove the htmllib and sgmllib modules as per PEP 3108.
This commit is contained in:
parent
6b38daa80d
commit
877b10add4
15 changed files with 14 additions and 1969 deletions
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
:mod:`formatter` --- Generic output formatting
|
||||
==============================================
|
||||
|
||||
|
|
@ -6,12 +5,9 @@
|
|||
:synopsis: Generic output formatter and device interface.
|
||||
|
||||
|
||||
.. index:: single: HTMLParser (class in htmllib)
|
||||
|
||||
This module supports two interface definitions, each with multiple
|
||||
implementations. The *formatter* interface is used by the :class:`HTMLParser`
|
||||
class of the :mod:`htmllib` module, and the *writer* interface is required by
|
||||
the formatter interface.
|
||||
implementations: The *formatter* interface, and the *writer* interface which is
|
||||
required by the formatter interface.
|
||||
|
||||
Formatter objects transform an abstract flow of formatting events into specific
|
||||
output events on writer objects. Formatters manage several stack structures to
|
||||
|
|
|
|||
|
|
@ -7,11 +7,10 @@
|
|||
|
||||
|
||||
This module defines three dictionaries, ``name2codepoint``, ``codepoint2name``,
|
||||
and ``entitydefs``. ``entitydefs`` is used by the :mod:`htmllib` module to
|
||||
provide the :attr:`entitydefs` member of the :class:`html.parser.HTMLParser`
|
||||
class. The definition provided here contains all the entities defined by XHTML
|
||||
1.0 that can be handled using simple textual substitution in the Latin-1
|
||||
character set (ISO-8859-1).
|
||||
and ``entitydefs``. ``entitydefs`` is used to provide the :attr:`entitydefs`
|
||||
member of the :class:`html.parser.HTMLParser` class. The definition provided
|
||||
here contains all the entities defined by XHTML 1.0 that can be handled using
|
||||
simple textual substitution in the Latin-1 character set (ISO-8859-1).
|
||||
|
||||
|
||||
.. data:: entitydefs
|
||||
|
|
|
|||
|
|
@ -11,9 +11,6 @@
|
|||
|
||||
This module defines a class :class:`HTMLParser` which serves as the basis for
|
||||
parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
|
||||
Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser
|
||||
in :mod:`sgmllib`.
|
||||
|
||||
|
||||
.. class:: HTMLParser()
|
||||
|
||||
|
|
@ -23,9 +20,8 @@ in :mod:`sgmllib`.
|
|||
begin and end. The :class:`HTMLParser` class is meant to be overridden by the
|
||||
user to provide a desired behavior.
|
||||
|
||||
Unlike the parser in :mod:`htmllib`, this parser does not check that end tags
|
||||
match start tags or call the end-tag handler for elements which are closed
|
||||
implicitly by closing an outer element.
|
||||
This parser does not check that end tags match start tags or call the end-tag
|
||||
handler for elements which are closed implicitly by closing an outer element.
|
||||
|
||||
An exception is defined as well:
|
||||
|
||||
|
|
|
|||
|
|
@ -1,147 +0,0 @@
|
|||
|
||||
:mod:`htmllib` --- A parser for HTML documents
|
||||
==============================================
|
||||
|
||||
.. module:: htmllib
|
||||
:synopsis: A parser for HTML documents.
|
||||
|
||||
|
||||
.. index::
|
||||
single: HTML
|
||||
single: hypertext
|
||||
|
||||
.. index::
|
||||
module: sgmllib
|
||||
module: formatter
|
||||
single: SGMLParser (in module sgmllib)
|
||||
|
||||
This module defines a class which can serve as a base for parsing text files
|
||||
formatted in the HyperText Mark-up Language (HTML). The class is not directly
|
||||
concerned with I/O --- it must be provided with input in string form via a
|
||||
method, and makes calls to methods of a "formatter" object in order to produce
|
||||
output. The :class:`HTMLParser` class is designed to be used as a base class
|
||||
for other classes in order to add functionality, and allows most of its methods
|
||||
to be extended or overridden. In turn, this class is derived from and extends
|
||||
the :class:`SGMLParser` class defined in module :mod:`sgmllib`. The
|
||||
:class:`HTMLParser` implementation supports the HTML 2.0 language as described
|
||||
in :rfc:`1866`. Two implementations of formatter objects are provided in the
|
||||
:mod:`formatter` module; refer to the documentation for that module for
|
||||
information on the formatter interface.
|
||||
|
||||
The following is a summary of the interface defined by
|
||||
:class:`sgmllib.SGMLParser`:
|
||||
|
||||
* The interface to feed data to an instance is through the :meth:`feed` method,
|
||||
which takes a string argument. This can be called with as little or as much
|
||||
text at a time as desired; ``p.feed(a); p.feed(b)`` has the same effect as
|
||||
``p.feed(a+b)``. When the data contains complete HTML markup constructs, these
|
||||
are processed immediately; incomplete constructs are saved in a buffer. To
|
||||
force processing of all unprocessed data, call the :meth:`close` method.
|
||||
|
||||
For example, to parse the entire contents of a file, use::
|
||||
|
||||
parser.feed(open('myfile.html').read())
|
||||
parser.close()
|
||||
|
||||
* The interface to define semantics for HTML tags is very simple: derive a class
|
||||
and define methods called :meth:`start_tag`, :meth:`end_tag`, or :meth:`do_tag`.
|
||||
The parser will call these at appropriate moments: :meth:`start_tag` or
|
||||
:meth:`do_tag` is called when an opening tag of the form ``<tag ...>`` is
|
||||
encountered; :meth:`end_tag` is called when a closing tag of the form ``<tag>``
|
||||
is encountered. If an opening tag requires a corresponding closing tag, like
|
||||
``<H1>`` ... ``</H1>``, the class should define the :meth:`start_tag` method; if
|
||||
a tag requires no closing tag, like ``<P>``, the class should define the
|
||||
:meth:`do_tag` method.
|
||||
|
||||
The module defines a parser class and an exception:
|
||||
|
||||
|
||||
.. class:: HTMLParser(formatter)
|
||||
|
||||
This is the basic HTML parser class. It supports all entity names required by
|
||||
the XHTML 1.0 Recommendation (http://www.w3.org/TR/xhtml1). It also defines
|
||||
handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.
|
||||
|
||||
|
||||
.. exception:: HTMLParseError
|
||||
|
||||
Exception raised by the :class:`HTMLParser` class when it encounters an error
|
||||
while parsing.
|
||||
|
||||
|
||||
.. seealso::
|
||||
|
||||
Module :mod:`formatter`
|
||||
Interface definition for transforming an abstract flow of formatting events into
|
||||
specific output events on writer objects.
|
||||
|
||||
Module :mod:`html.parser`
|
||||
Alternate HTML parser that offers a slightly lower-level view of the input, but
|
||||
is designed to work with XHTML, and does not implement some of the SGML syntax
|
||||
not used in "HTML as deployed" and which isn't legal for XHTML.
|
||||
|
||||
Module :mod:`html.entities`
|
||||
Definition of replacement text for XHTML 1.0 entities.
|
||||
|
||||
Module :mod:`sgmllib`
|
||||
Base class for :class:`HTMLParser`.
|
||||
|
||||
|
||||
.. _html-parser-objects:
|
||||
|
||||
HTMLParser Objects
|
||||
------------------
|
||||
|
||||
In addition to tag methods, the :class:`HTMLParser` class provides some
|
||||
additional methods and instance variables for use within tag methods.
|
||||
|
||||
|
||||
.. attribute:: HTMLParser.formatter
|
||||
|
||||
This is the formatter instance associated with the parser.
|
||||
|
||||
|
||||
.. attribute:: HTMLParser.nofill
|
||||
|
||||
Boolean flag which should be true when whitespace should not be collapsed, or
|
||||
false when it should be. In general, this should only be true when character
|
||||
data is to be treated as "preformatted" text, as within a ``<PRE>`` element.
|
||||
The default value is false. This affects the operation of :meth:`handle_data`
|
||||
and :meth:`save_end`.
|
||||
|
||||
|
||||
.. method:: HTMLParser.anchor_bgn(href, name, type)
|
||||
|
||||
This method is called at the start of an anchor region. The arguments
|
||||
correspond to the attributes of the ``<A>`` tag with the same names. The
|
||||
default implementation maintains a list of hyperlinks (defined by the ``HREF``
|
||||
attribute for ``<A>`` tags) within the document. The list of hyperlinks is
|
||||
available as the data attribute :attr:`anchorlist`.
|
||||
|
||||
|
||||
.. method:: HTMLParser.anchor_end()
|
||||
|
||||
This method is called at the end of an anchor region. The default
|
||||
implementation adds a textual footnote marker using an index into the list of
|
||||
hyperlinks created by :meth:`anchor_bgn`.
|
||||
|
||||
|
||||
.. method:: HTMLParser.handle_image(source, alt[, ismap[, align[, width[, height]]]])
|
||||
|
||||
This method is called to handle images. The default implementation simply
|
||||
passes the *alt* value to the :meth:`handle_data` method.
|
||||
|
||||
|
||||
.. method:: HTMLParser.save_bgn()
|
||||
|
||||
Begins saving character data in a buffer instead of sending it to the formatter
|
||||
object. Retrieve the stored data via :meth:`save_end`. Use of the
|
||||
:meth:`save_bgn` / :meth:`save_end` pair may not be nested.
|
||||
|
||||
|
||||
.. method:: HTMLParser.save_end()
|
||||
|
||||
Ends buffering character data and returns all data saved since the preceding
|
||||
call to :meth:`save_bgn`. If the :attr:`nofill` flag is false, whitespace is
|
||||
collapsed to single spaces. A call to this method without a preceding call to
|
||||
:meth:`save_bgn` will raise a :exc:`TypeError` exception.
|
||||
|
|
@ -23,8 +23,6 @@ definition of the Python bindings for the DOM and SAX interfaces.
|
|||
|
||||
html.parser.rst
|
||||
html.entities.rst
|
||||
sgmllib.rst
|
||||
htmllib.rst
|
||||
pyexpat.rst
|
||||
xml.dom.rst
|
||||
xml.dom.minidom.rst
|
||||
|
|
|
|||
|
|
@ -1,253 +0,0 @@
|
|||
|
||||
:mod:`sgmllib` --- Simple SGML parser
|
||||
=====================================
|
||||
|
||||
.. module:: sgmllib
|
||||
:synopsis: Only as much of an SGML parser as needed to parse HTML.
|
||||
|
||||
|
||||
.. index:: single: SGML
|
||||
|
||||
This module defines a class :class:`SGMLParser` which serves as the basis for
|
||||
parsing text files formatted in SGML (Standard Generalized Mark-up Language).
|
||||
In fact, it does not provide a full SGML parser --- it only parses SGML insofar
|
||||
as it is used by HTML, and the module only exists as a base for the
|
||||
:mod:`htmllib` module. Another HTML parser which supports XHTML and offers a
|
||||
somewhat different interface is available in the :mod:`HTMLParser` module.
|
||||
|
||||
|
||||
.. class:: SGMLParser()
|
||||
|
||||
The :class:`SGMLParser` class is instantiated without arguments. The parser is
|
||||
hardcoded to recognize the following constructs:
|
||||
|
||||
* Opening and closing tags of the form ``<tag attr="value" ...>`` and
|
||||
``</tag>``, respectively.
|
||||
|
||||
* Numeric character references of the form ``&#name;``.
|
||||
|
||||
* Entity references of the form ``&name;``.
|
||||
|
||||
* SGML comments of the form ``<!--text-->``. Note that spaces, tabs, and
|
||||
newlines are allowed between the trailing ``>`` and the immediately preceding
|
||||
``--``.
|
||||
|
||||
A single exception is defined as well:
|
||||
|
||||
|
||||
.. exception:: SGMLParseError
|
||||
|
||||
Exception raised by the :class:`SGMLParser` class when it encounters an error
|
||||
while parsing.
|
||||
|
||||
:class:`SGMLParser` instances have the following methods:
|
||||
|
||||
|
||||
.. method:: SGMLParser.reset()
|
||||
|
||||
Reset the instance. Loses all unprocessed data. This is called implicitly at
|
||||
instantiation time.
|
||||
|
||||
|
||||
.. method:: SGMLParser.setnomoretags()
|
||||
|
||||
Stop processing tags. Treat all following input as literal input (CDATA).
|
||||
(This is only provided so the HTML tag ``<PLAINTEXT>`` can be implemented.)
|
||||
|
||||
|
||||
.. method:: SGMLParser.setliteral()
|
||||
|
||||
Enter literal mode (CDATA mode).
|
||||
|
||||
|
||||
.. method:: SGMLParser.feed(data)
|
||||
|
||||
Feed some text to the parser. It is processed insofar as it consists of
|
||||
complete elements; incomplete data is buffered until more data is fed or
|
||||
:meth:`close` is called.
|
||||
|
||||
|
||||
.. method:: SGMLParser.close()
|
||||
|
||||
Force processing of all buffered data as if it were followed by an end-of-file
|
||||
mark. This method may be redefined by a derived class to define additional
|
||||
processing at the end of the input, but the redefined version should always call
|
||||
:meth:`close`.
|
||||
|
||||
|
||||
.. method:: SGMLParser.get_starttag_text()
|
||||
|
||||
Return the text of the most recently opened start tag. This should not normally
|
||||
be needed for structured processing, but may be useful in dealing with HTML "as
|
||||
deployed" or for re-generating input with minimal changes (whitespace between
|
||||
attributes can be preserved, etc.).
|
||||
|
||||
|
||||
.. method:: SGMLParser.handle_starttag(tag, method, attributes)
|
||||
|
||||
This method is called to handle start tags for which either a :meth:`start_tag`
|
||||
or :meth:`do_tag` method has been defined. The *tag* argument is the name of
|
||||
the tag converted to lower case, and the *method* argument is the bound method
|
||||
which should be used to support semantic interpretation of the start tag. The
|
||||
*attributes* argument is a list of ``(name, value)`` pairs containing the
|
||||
attributes found inside the tag's ``<>`` brackets.
|
||||
|
||||
The *name* has been translated to lower case. Double quotes and backslashes in
|
||||
the *value* have been interpreted, as well as known character references and
|
||||
known entity references terminated by a semicolon (normally, entity references
|
||||
can be terminated by any non-alphanumerical character, but this would break the
|
||||
very common case of ``<A HREF="url?spam=1&eggs=2">`` when ``eggs`` is a valid
|
||||
entity name).
|
||||
|
||||
For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method would
|
||||
be called as ``unknown_starttag('a', [('href', 'http://www.cwi.nl/')])``. The
|
||||
base implementation simply calls *method* with *attributes* as the only
|
||||
argument.
|
||||
|
||||
|
||||
.. method:: SGMLParser.handle_endtag(tag, method)
|
||||
|
||||
This method is called to handle endtags for which an :meth:`end_tag` method has
|
||||
been defined. The *tag* argument is the name of the tag converted to lower
|
||||
case, and the *method* argument is the bound method which should be used to
|
||||
support semantic interpretation of the end tag. If no :meth:`end_tag` method is
|
||||
defined for the closing element, this handler is not called. The base
|
||||
implementation simply calls *method*.
|
||||
|
||||
|
||||
.. method:: SGMLParser.handle_data(data)
|
||||
|
||||
This method is called to process arbitrary data. It is intended to be
|
||||
overridden by a derived class; the base class implementation does nothing.
|
||||
|
||||
|
||||
.. method:: SGMLParser.handle_charref(ref)
|
||||
|
||||
This method is called to process a character reference of the form ``&#ref;``.
|
||||
The base implementation uses :meth:`convert_charref` to convert the reference to
|
||||
a string. If that method returns a string, it is passed to :meth:`handle_data`,
|
||||
otherwise ``unknown_charref(ref)`` is called to handle the error.
|
||||
|
||||
|
||||
.. method:: SGMLParser.convert_charref(ref)
|
||||
|
||||
Convert a character reference to a string, or ``None``. *ref* is the reference
|
||||
passed in as a string. In the base implementation, *ref* must be a decimal
|
||||
number in the range 0-255. It converts the code point found using the
|
||||
:meth:`convert_codepoint` method. If *ref* is invalid or out of range, this
|
||||
method returns ``None``. This method is called by the default
|
||||
:meth:`handle_charref` implementation and by the attribute value parser.
|
||||
|
||||
|
||||
.. method:: SGMLParser.convert_codepoint(codepoint)
|
||||
|
||||
Convert a codepoint to a :class:`str` value. Encodings can be handled here if
|
||||
appropriate, though the rest of :mod:`sgmllib` is oblivious on this matter.
|
||||
|
||||
|
||||
.. method:: SGMLParser.handle_entityref(ref)
|
||||
|
||||
This method is called to process a general entity reference of the form
|
||||
``&ref;`` where *ref* is an general entity reference. It converts *ref* by
|
||||
passing it to :meth:`convert_entityref`. If a translation is returned, it calls
|
||||
the method :meth:`handle_data` with the translation; otherwise, it calls the
|
||||
method ``unknown_entityref(ref)``. The default :attr:`entitydefs` defines
|
||||
translations for ``&``, ``&apos``, ``>``, ``<``, and ``"``.
|
||||
|
||||
|
||||
.. method:: SGMLParser.convert_entityref(ref)
|
||||
|
||||
Convert a named entity reference to a :class:`str` value, or ``None``. The
|
||||
resulting value will not be parsed. *ref* will be only the name of the entity.
|
||||
The default implementation looks for *ref* in the instance (or class) variable
|
||||
:attr:`entitydefs` which should be a mapping from entity names to corresponding
|
||||
translations. If no translation is available for *ref*, this method returns
|
||||
``None``. This method is called by the default :meth:`handle_entityref`
|
||||
implementation and by the attribute value parser.
|
||||
|
||||
|
||||
.. method:: SGMLParser.handle_comment(comment)
|
||||
|
||||
This method is called when a comment is encountered. The *comment* argument is
|
||||
a string containing the text between the ``<!--`` and ``-->`` delimiters, but
|
||||
not the delimiters themselves. For example, the comment ``<!--text-->`` will
|
||||
cause this method to be called with the argument ``'text'``. The default method
|
||||
does nothing.
|
||||
|
||||
|
||||
.. method:: SGMLParser.handle_decl(data)
|
||||
|
||||
Method called when an SGML declaration is read by the parser. In practice, the
|
||||
``DOCTYPE`` declaration is the only thing observed in HTML, but the parser does
|
||||
not discriminate among different (or broken) declarations. Internal subsets in
|
||||
a ``DOCTYPE`` declaration are not supported. The *data* parameter will be the
|
||||
entire contents of the declaration inside the ``<!``...\ ``>`` markup. The
|
||||
default implementation does nothing.
|
||||
|
||||
|
||||
.. method:: SGMLParser.report_unbalanced(tag)
|
||||
|
||||
This method is called when an end tag is found which does not correspond to any
|
||||
open element.
|
||||
|
||||
|
||||
.. method:: SGMLParser.unknown_starttag(tag, attributes)
|
||||
|
||||
This method is called to process an unknown start tag. It is intended to be
|
||||
overridden by a derived class; the base class implementation does nothing.
|
||||
|
||||
|
||||
.. method:: SGMLParser.unknown_endtag(tag)
|
||||
|
||||
This method is called to process an unknown end tag. It is intended to be
|
||||
overridden by a derived class; the base class implementation does nothing.
|
||||
|
||||
|
||||
.. method:: SGMLParser.unknown_charref(ref)
|
||||
|
||||
This method is called to process unresolvable numeric character references.
|
||||
Refer to :meth:`handle_charref` to determine what is handled by default. It is
|
||||
intended to be overridden by a derived class; the base class implementation does
|
||||
nothing.
|
||||
|
||||
|
||||
.. method:: SGMLParser.unknown_entityref(ref)
|
||||
|
||||
This method is called to process an unknown entity reference. It is intended to
|
||||
be overridden by a derived class; the base class implementation does nothing.
|
||||
|
||||
Apart from overriding or extending the methods listed above, derived classes may
|
||||
also define methods of the following form to define processing of specific tags.
|
||||
Tag names in the input stream are case independent; the *tag* occurring in
|
||||
method names must be in lower case:
|
||||
|
||||
|
||||
.. method:: SGMLParser.start_tag(attributes)
|
||||
:noindex:
|
||||
|
||||
This method is called to process an opening tag *tag*. It has preference over
|
||||
:meth:`do_tag`. The *attributes* argument has the same meaning as described for
|
||||
:meth:`handle_starttag` above.
|
||||
|
||||
|
||||
.. method:: SGMLParser.do_tag(attributes)
|
||||
:noindex:
|
||||
|
||||
This method is called to process an opening tag *tag* for which no
|
||||
:meth:`start_tag` method is defined. The *attributes* argument has the same
|
||||
meaning as described for :meth:`handle_starttag` above.
|
||||
|
||||
|
||||
.. method:: SGMLParser.end_tag()
|
||||
:noindex:
|
||||
|
||||
This method is called to process a closing tag *tag*.
|
||||
|
||||
Note that the parser maintains a stack of open elements for which no end tag has
|
||||
been found yet. Only tags processed by :meth:`start_tag` are pushed on this
|
||||
stack. Definition of an :meth:`end_tag` method is optional for these tags. For
|
||||
tags processed by :meth:`do_tag` or by :meth:`unknown_tag`, no :meth:`end_tag`
|
||||
method must be defined; if defined, it will not be used. If both
|
||||
:meth:`start_tag` and :meth:`do_tag` methods exist for a tag, the
|
||||
:meth:`start_tag` method takes precedence.
|
||||
|
||||
|
|
@ -389,14 +389,13 @@ URL Opener objects
|
|||
.. index::
|
||||
single: HTML
|
||||
pair: HTTP; protocol
|
||||
module: htmllib
|
||||
|
||||
* The data returned by :func:`urlopen` or :func:`urlretrieve` is the raw data
|
||||
returned by the server. This may be binary data (such as an image), plain text
|
||||
or (for example) HTML. The HTTP protocol provides type information in the reply
|
||||
header, which can be inspected by looking at the :mailheader:`Content-Type`
|
||||
header. If the returned data is HTML, you can use the module :mod:`htmllib` to
|
||||
parse it.
|
||||
header. If the returned data is HTML, you can use the module
|
||||
:mod:`html.parser` to parse it.
|
||||
|
||||
.. index:: single: FTP
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue