mirror of
				https://github.com/python/cpython.git
				synced 2025-11-03 11:23:31 +00:00 
			
		
		
		
	the parse() and parseString() functions use a separate parser, not actually implement a parser. (This is a common question.)
		
			
				
	
	
		
			286 lines
		
	
	
	
		
			10 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
			
		
		
	
	
			286 lines
		
	
	
	
		
			10 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
\section{\module{xml.dom.minidom} ---
 | 
						|
         Lightweight DOM implementation}
 | 
						|
 | 
						|
\declaremodule{standard}{xml.dom.minidom}
 | 
						|
\modulesynopsis{Lightweight Document Object Model (DOM) implementation.}
 | 
						|
\moduleauthor{Paul Prescod}{paul@prescod.net}
 | 
						|
\sectionauthor{Paul Prescod}{paul@prescod.net}
 | 
						|
\sectionauthor{Martin v. L\"owis}{loewis@informatik.hu-berlin.de}
 | 
						|
 | 
						|
\versionadded{2.0}
 | 
						|
 | 
						|
\module{xml.dom.minidom} is a light-weight implementation of the
 | 
						|
Document Object Model interface.  It is intended to be
 | 
						|
simpler than the full DOM and also significantly smaller.
 | 
						|
 | 
						|
DOM applications typically start by parsing some XML into a DOM.  With
 | 
						|
\module{xml.dom.minidom}, this is done through the parse functions:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
from xml.dom.minidom import parse, parseString
 | 
						|
 | 
						|
dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name
 | 
						|
 | 
						|
datasource = open('c:\\temp\\mydata.xml')
 | 
						|
dom2 = parse(datasource)   # parse an open file
 | 
						|
 | 
						|
dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
The \function{parse()} function can take either a filename or an open
 | 
						|
file object.
 | 
						|
 | 
						|
\begin{funcdesc}{parse}{filename_or_file{, parser}}
 | 
						|
  Return a \class{Document} from the given input. \var{filename_or_file}
 | 
						|
  may be either a file name, or a file-like object. \var{parser}, if
 | 
						|
  given, must be a SAX2 parser object. This function will change the
 | 
						|
  document handler of the parser and activate namespace support; other
 | 
						|
  parser configuration (like setting an entity resolver) must have been
 | 
						|
  done in advance.
 | 
						|
\end{funcdesc}
 | 
						|
 | 
						|
If you have XML in a string, you can use the
 | 
						|
\function{parseString()} function instead:
 | 
						|
 | 
						|
\begin{funcdesc}{parseString}{string\optional{, parser}}
 | 
						|
  Return a \class{Document} that represents the \var{string}. This
 | 
						|
  method creates a \class{StringIO} object for the string and passes
 | 
						|
  that on to \function{parse}.
 | 
						|
\end{funcdesc}
 | 
						|
 | 
						|
Both functions return a \class{Document} object representing the
 | 
						|
content of the document.
 | 
						|
 | 
						|
What the \function{parse()} and \function{parseString()} functions do
 | 
						|
is connect an XML parser with a ``DOM builder'' that can accept parse
 | 
						|
events from any SAX parser and convert them into a DOM tree.  The name
 | 
						|
of the functions are perhaps misleading, but are easy to grasp when
 | 
						|
learning the interfaces.  The parsing of the document will be
 | 
						|
completed before these functions return; it's simply that these
 | 
						|
functions do not provide a parser implementation themselves.
 | 
						|
 | 
						|
You can also create a \class{Document} by calling a method on a ``DOM
 | 
						|
Implementation'' object.  You can get this object either by calling
 | 
						|
the \function{getDOMImplementation()} function in the
 | 
						|
\refmodule{xml.dom} package or the \module{xml.dom.minidom} module.
 | 
						|
Using the implementation from the \module{xml.dom.minidom} module will
 | 
						|
always return a \class{Document} instance from the minidom
 | 
						|
implementation, while the version from \refmodule{xml.dom} may provide
 | 
						|
an alternate implementation (this is likely if you have the
 | 
						|
\ulink{PyXML package}{http://pyxml.sourceforge.net/} installed).  Once
 | 
						|
you have a \class{Document}, you can add child nodes to it to populate
 | 
						|
the DOM:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
from xml.dom.minidom import getDOMImplementation
 | 
						|
 | 
						|
impl = getDOMImplementation()
 | 
						|
 | 
						|
newdoc = impl.createDocument(None, "some_tag", None)
 | 
						|
top_element = newdoc.documentElement
 | 
						|
text = newdoc.createTextNode('Some textual content.')
 | 
						|
top_element.appendChild(text)
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
Once you have a DOM document object, you can access the parts of your
 | 
						|
XML document through its properties and methods.  These properties are
 | 
						|
defined in the DOM specification.  The main property of the document
 | 
						|
object is the \member{documentElement} property.  It gives you the
 | 
						|
main element in the XML document: the one that holds all others.  Here
 | 
						|
is an example program:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
dom3 = parseString("<myxml>Some data</myxml>")
 | 
						|
assert dom3.documentElement.tagName == "myxml"
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
When you are finished with a DOM, you should clean it up.  This is
 | 
						|
necessary because some versions of Python do not support garbage
 | 
						|
collection of objects that refer to each other in a cycle.  Until this
 | 
						|
restriction is removed from all versions of Python, it is safest to
 | 
						|
write your code as if cycles would not be cleaned up.
 | 
						|
 | 
						|
The way to clean up a DOM is to call its \method{unlink()} method:
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
dom1.unlink()
 | 
						|
dom2.unlink()
 | 
						|
dom3.unlink()
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
\method{unlink()} is a \module{xml.dom.minidom}-specific extension to
 | 
						|
the DOM API.  After calling \method{unlink()} on a node, the node and
 | 
						|
its descendents are essentially useless.
 | 
						|
 | 
						|
\begin{seealso}
 | 
						|
  \seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{Document Object
 | 
						|
            Model (DOM) Level 1 Specification}
 | 
						|
           {The W3C recommendation for the
 | 
						|
            DOM supported by \module{xml.dom.minidom}.}
 | 
						|
\end{seealso}
 | 
						|
 | 
						|
 | 
						|
\subsection{DOM Objects \label{dom-objects}}
 | 
						|
 | 
						|
The definition of the DOM API for Python is given as part of the
 | 
						|
\refmodule{xml.dom} module documentation.  This section lists the
 | 
						|
differences between the API and \refmodule{xml.dom.minidom}.
 | 
						|
 | 
						|
 | 
						|
\begin{methoddesc}{unlink}{}
 | 
						|
Break internal references within the DOM so that it will be garbage
 | 
						|
collected on versions of Python without cyclic GC.  Even when cyclic
 | 
						|
GC is available, using this can make large amounts of memory available
 | 
						|
sooner, so calling this on DOM objects as soon as they are no longer
 | 
						|
needed is good practice.  This only needs to be called on the
 | 
						|
\class{Document} object, but may be called on child nodes to discard
 | 
						|
children of that node.
 | 
						|
\end{methoddesc}
 | 
						|
 | 
						|
\begin{methoddesc}{writexml}{writer}
 | 
						|
Write XML to the writer object.  The writer should have a
 | 
						|
\method{write()} method which matches that of the file object
 | 
						|
interface.
 | 
						|
 | 
						|
\versionadded[To support pretty output, new keyword parameters indent,
 | 
						|
addindent, and newl have been added]{2.1}
 | 
						|
 | 
						|
\versionadded[For the \class{Document} node, an additional keyword
 | 
						|
argument encoding can be used to specify the encoding field of the XML
 | 
						|
header]{2.3}
 | 
						|
 | 
						|
\end{methoddesc}
 | 
						|
 | 
						|
\begin{methoddesc}{toxml}{\optional{encoding}}
 | 
						|
Return the XML that the DOM represents as a string.
 | 
						|
 | 
						|
\versionadded[the \var{encoding} argument]{2.3}
 | 
						|
 | 
						|
With no argument, the XML header does not specify an encoding, and the
 | 
						|
result is Unicode string if the default encoding cannot represent all
 | 
						|
characters in the document. Encoding this string in an encoding other
 | 
						|
than UTF-8 is likely incorrect, since UTF-8 is the default encoding of
 | 
						|
XML.
 | 
						|
 | 
						|
With an explicit \var{encoding} argument, the result is a byte string
 | 
						|
in the specified encoding. It is recommended that this argument is
 | 
						|
always specified. To avoid UnicodeError exceptions in case of
 | 
						|
unrepresentable text data, the encoding argument should be specified
 | 
						|
as "utf-8".
 | 
						|
 | 
						|
\end{methoddesc}
 | 
						|
 | 
						|
\begin{methoddesc}{toprettyxml}{\optional{indent\optional{, newl}}}
 | 
						|
 | 
						|
Return a pretty-printed version of the document. \var{indent} specifies
 | 
						|
the indentation string and defaults to a tabulator; \var{newl} specifies
 | 
						|
the string emitted at the end of each line and defaults to \\n.
 | 
						|
 | 
						|
\versionadded{2.1}
 | 
						|
 | 
						|
\versionadded[the encoding argument; see \method{toxml}]{2.3}
 | 
						|
 | 
						|
\end{methoddesc}
 | 
						|
 | 
						|
The following standard DOM methods have special considerations with
 | 
						|
\refmodule{xml.dom.minidom}:
 | 
						|
 | 
						|
\begin{methoddesc}{cloneNode}{deep}
 | 
						|
Although this method was present in the version of
 | 
						|
\refmodule{xml.dom.minidom} packaged with Python 2.0, it was seriously
 | 
						|
broken.  This has been corrected for subsequent releases.
 | 
						|
\end{methoddesc}
 | 
						|
 | 
						|
 | 
						|
\subsection{DOM Example \label{dom-example}}
 | 
						|
 | 
						|
This example program is a fairly realistic example of a simple
 | 
						|
program. In this particular case, we do not take much advantage
 | 
						|
of the flexibility of the DOM.
 | 
						|
 | 
						|
\verbatiminput{minidom-example.py}
 | 
						|
 | 
						|
 | 
						|
\subsection{minidom and the DOM standard \label{minidom-and-dom}}
 | 
						|
 | 
						|
The \refmodule{xml.dom.minidom} module is essentially a DOM
 | 
						|
1.0-compatible DOM with some DOM 2 features (primarily namespace
 | 
						|
features).
 | 
						|
 | 
						|
Usage of the DOM interface in Python is straight-forward.  The
 | 
						|
following mapping rules apply:
 | 
						|
 | 
						|
\begin{itemize}
 | 
						|
\item Interfaces are accessed through instance objects. Applications
 | 
						|
      should not instantiate the classes themselves; they should use
 | 
						|
      the creator functions available on the \class{Document} object.
 | 
						|
      Derived interfaces support all operations (and attributes) from
 | 
						|
      the base interfaces, plus any new operations.
 | 
						|
 | 
						|
\item Operations are used as methods. Since the DOM uses only
 | 
						|
      \keyword{in} parameters, the arguments are passed in normal
 | 
						|
      order (from left to right).   There are no optional
 | 
						|
      arguments. \keyword{void} operations return \code{None}.
 | 
						|
 | 
						|
\item IDL attributes map to instance attributes. For compatibility
 | 
						|
      with the OMG IDL language mapping for Python, an attribute
 | 
						|
      \code{foo} can also be accessed through accessor methods
 | 
						|
      \method{_get_foo()} and \method{_set_foo()}.  \keyword{readonly}
 | 
						|
      attributes must not be changed; this is not enforced at
 | 
						|
      runtime.
 | 
						|
 | 
						|
\item The types \code{short int}, \code{unsigned int}, \code{unsigned
 | 
						|
      long long}, and \code{boolean} all map to Python integer
 | 
						|
      objects.
 | 
						|
 | 
						|
\item The type \code{DOMString} maps to Python strings.
 | 
						|
      \refmodule{xml.dom.minidom} supports either byte or Unicode
 | 
						|
      strings, but will normally produce Unicode strings.  Values
 | 
						|
      of type \code{DOMString} may also be \code{None} where allowed
 | 
						|
      to have the IDL \code{null} value by the DOM specification from
 | 
						|
      the W3C.
 | 
						|
 | 
						|
\item \keyword{const} declarations map to variables in their
 | 
						|
      respective scope
 | 
						|
      (e.g. \code{xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE});
 | 
						|
      they must not be changed.
 | 
						|
 | 
						|
\item \code{DOMException} is currently not supported in
 | 
						|
      \refmodule{xml.dom.minidom}.  Instead,
 | 
						|
      \refmodule{xml.dom.minidom} uses standard Python exceptions such
 | 
						|
      as \exception{TypeError} and \exception{AttributeError}.
 | 
						|
 | 
						|
\item \class{NodeList} objects are implemented using Python's built-in
 | 
						|
      list type.  Starting with Python 2.2, these objects provide the
 | 
						|
      interface defined in the DOM specification, but with earlier
 | 
						|
      versions of Python they do not support the official API.  They
 | 
						|
      are, however, much more ``Pythonic'' than the interface defined
 | 
						|
      in the W3C recommendations.
 | 
						|
\end{itemize}
 | 
						|
 | 
						|
 | 
						|
The following interfaces have no implementation in
 | 
						|
\refmodule{xml.dom.minidom}:
 | 
						|
 | 
						|
\begin{itemize}
 | 
						|
\item \class{DOMTimeStamp}
 | 
						|
 | 
						|
\item \class{DocumentType} (added in Python 2.1)
 | 
						|
 | 
						|
\item \class{DOMImplementation} (added in Python 2.1)
 | 
						|
 | 
						|
\item \class{CharacterData}
 | 
						|
 | 
						|
\item \class{CDATASection}
 | 
						|
 | 
						|
\item \class{Notation}
 | 
						|
 | 
						|
\item \class{Entity}
 | 
						|
 | 
						|
\item \class{EntityReference}
 | 
						|
 | 
						|
\item \class{DocumentFragment}
 | 
						|
\end{itemize}
 | 
						|
 | 
						|
Most of these reflect information in the XML document that is not of
 | 
						|
general utility to most DOM users.
 |