mirror of
				https://github.com/python/cpython.git
				synced 2025-11-04 03:44:55 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			351 lines
		
	
	
	
		
			14 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
			
		
		
	
	
			351 lines
		
	
	
	
		
			14 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
\chapter{Python compiler package \label{compiler}}
 | 
						|
 | 
						|
\sectionauthor{Jeremy Hylton}{jeremy@zope.com}
 | 
						|
 | 
						|
 | 
						|
The Python compiler package is a tool for analyzing Python source code
 | 
						|
and generating Python bytecode.  The compiler contains libraries to
 | 
						|
generate an abstract syntax tree from Python source code and to
 | 
						|
generate Python bytecode from the tree.
 | 
						|
 | 
						|
The \refmodule{compiler} package is a Python source to bytecode
 | 
						|
translator written in Python.  It uses the built-in parser and
 | 
						|
standard \refmodule{parser} module to generated a concrete syntax
 | 
						|
tree.  This tree is used to generate an abstract syntax tree (AST) and
 | 
						|
then Python bytecode.
 | 
						|
 | 
						|
The full functionality of the package duplicates the builtin compiler
 | 
						|
provided with the Python interpreter.  It is intended to match its
 | 
						|
behavior almost exactly.  Why implement another compiler that does the
 | 
						|
same thing?  The package is useful for a variety of purposes.  It can
 | 
						|
be modified more easily than the builtin compiler.  The AST it
 | 
						|
generates is useful for analyzing Python source code.
 | 
						|
 | 
						|
This chapter explains how the various components of the
 | 
						|
\refmodule{compiler} package work.  It blends reference material with
 | 
						|
a tutorial.
 | 
						|
 | 
						|
The following modules are part of the \refmodule{compiler} package:
 | 
						|
 | 
						|
\localmoduletable
 | 
						|
 | 
						|
 | 
						|
\section{The basic interface}
 | 
						|
 | 
						|
\declaremodule{}{compiler}
 | 
						|
 | 
						|
The top-level of the package defines four functions.  If you import
 | 
						|
\module{compiler}, you will get these functions and a collection of
 | 
						|
modules contained in the package.
 | 
						|
 | 
						|
\begin{funcdesc}{parse}{buf}
 | 
						|
Returns an abstract syntax tree for the Python source code in \var{buf}.
 | 
						|
The function raises SyntaxError if there is an error in the source
 | 
						|
code.  The return value is a \class{compiler.ast.Module} instance that
 | 
						|
contains the tree.  
 | 
						|
\end{funcdesc}
 | 
						|
 | 
						|
\begin{funcdesc}{parseFile}{path}
 | 
						|
Return an abstract syntax tree for the Python source code in the file
 | 
						|
specified by \var{path}.  It is equivalent to
 | 
						|
\code{parse(open(\var{path}).read())}.
 | 
						|
\end{funcdesc}
 | 
						|
 | 
						|
\begin{funcdesc}{walk}{ast, visitor\optional{, verbose}}
 | 
						|
Do a pre-order walk over the abstract syntax tree \var{ast}.  Call the
 | 
						|
appropriate method on the \var{visitor} instance for each node
 | 
						|
encountered.
 | 
						|
\end{funcdesc}
 | 
						|
 | 
						|
\begin{funcdesc}{compile}{source, filename, mode, flags=None, 
 | 
						|
			dont_inherit=None}
 | 
						|
Compile the string \var{source}, a Python module, statement or
 | 
						|
expression, into a code object that can be executed by the exec
 | 
						|
statement or \function{eval()}. This function is a replacement for the
 | 
						|
built-in \function{compile()} function.
 | 
						|
 | 
						|
The \var{filename} will be used for run-time error messages.
 | 
						|
 | 
						|
The \var{mode} must be 'exec' to compile a module, 'single' to compile a
 | 
						|
single (interactive) statement, or 'eval' to compile an expression.
 | 
						|
 | 
						|
The \var{flags} and \var{dont_inherit} arguments affect future-related
 | 
						|
statements, but are not supported yet.
 | 
						|
\end{funcdesc}
 | 
						|
 | 
						|
\begin{funcdesc}{compileFile}{source}
 | 
						|
Compiles the file \var{source} and generates a .pyc file.
 | 
						|
\end{funcdesc}
 | 
						|
 | 
						|
The \module{compiler} package contains the following modules:
 | 
						|
\refmodule[compiler.ast]{ast}, \module{consts}, \module{future},
 | 
						|
\module{misc}, \module{pyassem}, \module{pycodegen}, \module{symbols},
 | 
						|
\module{transformer}, and \refmodule[compiler.visitor]{visitor}.
 | 
						|
 | 
						|
\section{Limitations}
 | 
						|
 | 
						|
There are some problems with the error checking of the compiler
 | 
						|
package.  The interpreter detects syntax errors in two distinct
 | 
						|
phases.  One set of errors is detected by the interpreter's parser,
 | 
						|
the other set by the compiler.  The compiler package relies on the
 | 
						|
interpreter's parser, so it get the first phases of error checking for
 | 
						|
free.  It implements the second phase itself, and that implement is
 | 
						|
incomplete.  For example, the compiler package does not raise an error
 | 
						|
if a name appears more than once in an argument list: 
 | 
						|
\code{def f(x, x): ...}
 | 
						|
 | 
						|
A future version of the compiler should fix these problems.
 | 
						|
 | 
						|
\section{Python Abstract Syntax}
 | 
						|
 | 
						|
The \module{compiler.ast} module defines an abstract syntax for
 | 
						|
Python.  In the abstract syntax tree, each node represents a syntactic
 | 
						|
construct.  The root of the tree is \class{Module} object.
 | 
						|
 | 
						|
The abstract syntax offers a higher level interface to parsed Python
 | 
						|
source code.  The \ulink{\module{parser}}
 | 
						|
{http://www.python.org/doc/current/lib/module-parser.html}
 | 
						|
module and the compiler written in C for the Python interpreter use a
 | 
						|
concrete syntax tree.  The concrete syntax is tied closely to the
 | 
						|
grammar description used for the Python parser.  Instead of a single
 | 
						|
node for a construct, there are often several levels of nested nodes
 | 
						|
that are introduced by Python's precedence rules.
 | 
						|
 | 
						|
The abstract syntax tree is created by the
 | 
						|
\module{compiler.transformer} module.  The transformer relies on the
 | 
						|
builtin Python parser to generate a concrete syntax tree.  It
 | 
						|
generates an abstract syntax tree from the concrete tree.  
 | 
						|
 | 
						|
The \module{transformer} module was created by Greg
 | 
						|
Stein\index{Stein, Greg} and Bill Tutt\index{Tutt, Bill} for an
 | 
						|
experimental Python-to-C compiler.  The current version contains a
 | 
						|
number of modifications and improvements, but the basic form of the
 | 
						|
abstract syntax and of the transformer are due to Stein and Tutt.
 | 
						|
 | 
						|
\subsection{AST Nodes}
 | 
						|
 | 
						|
\declaremodule{}{compiler.ast}
 | 
						|
 | 
						|
The \module{compiler.ast} module is generated from a text file that
 | 
						|
describes each node type and its elements.  Each node type is
 | 
						|
represented as a class that inherits from the abstract base class
 | 
						|
\class{compiler.ast.Node} and defines a set of named attributes for
 | 
						|
child nodes.
 | 
						|
 | 
						|
\begin{classdesc}{Node}{}
 | 
						|
  
 | 
						|
  The \class{Node} instances are created automatically by the parser
 | 
						|
  generator.  The recommended interface for specific \class{Node}
 | 
						|
  instances is to use the public attributes to access child nodes.  A
 | 
						|
  public attribute may be bound to a single node or to a sequence of
 | 
						|
  nodes, depending on the \class{Node} type.  For example, the
 | 
						|
  \member{bases} attribute of the \class{Class} node, is bound to a
 | 
						|
  list of base class nodes, and the \member{doc} attribute is bound to
 | 
						|
  a single node.
 | 
						|
  
 | 
						|
  Each \class{Node} instance has a \member{lineno} attribute which may
 | 
						|
  be \code{None}.  XXX Not sure what the rules are for which nodes
 | 
						|
  will have a useful lineno.
 | 
						|
\end{classdesc}
 | 
						|
 | 
						|
All \class{Node} objects offer the following methods:
 | 
						|
 | 
						|
\begin{methoddesc}{getChildren}{}
 | 
						|
  Returns a flattened list of the child nodes and objects in the
 | 
						|
  order they occur.  Specifically, the order of the nodes is the
 | 
						|
  order in which they appear in the Python grammar.  Not all of the
 | 
						|
  children are \class{Node} instances.  The names of functions and
 | 
						|
  classes, for example, are plain strings.
 | 
						|
\end{methoddesc}
 | 
						|
 | 
						|
\begin{methoddesc}{getChildNodes}{}
 | 
						|
  Returns a flattened list of the child nodes in the order they
 | 
						|
  occur.  This method is like \method{getChildren()}, except that it
 | 
						|
  only returns those children that are \class{Node} instances.
 | 
						|
\end{methoddesc}
 | 
						|
 | 
						|
Two examples illustrate the general structure of \class{Node}
 | 
						|
classes.  The \keyword{while} statement is defined by the following
 | 
						|
grammar production: 
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
while_stmt:     "while" expression ":" suite
 | 
						|
               ["else" ":" suite]
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
The \class{While} node has three attributes: \member{test},
 | 
						|
\member{body}, and \member{else_}.  (If the natural name for an
 | 
						|
attribute is also a Python reserved word, it can't be used as an
 | 
						|
attribute name.  An underscore is appended to the word to make it a
 | 
						|
legal identifier, hence \member{else_} instead of \keyword{else}.)
 | 
						|
 | 
						|
The \keyword{if} statement is more complicated because it can include
 | 
						|
several tests.  
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
The \class{If} node only defines two attributes: \member{tests} and
 | 
						|
\member{else_}.  The \member{tests} attribute is a sequence of test
 | 
						|
expression, consequent body pairs.  There is one pair for each
 | 
						|
\keyword{if}/\keyword{elif} clause.  The first element of the pair is
 | 
						|
the test expression.  The second elements is a \class{Stmt} node that
 | 
						|
contains the code to execute if the test is true.
 | 
						|
 | 
						|
The \method{getChildren()} method of \class{If} returns a flat list of
 | 
						|
child nodes.  If there are three \keyword{if}/\keyword{elif} clauses
 | 
						|
and no \keyword{else} clause, then \method{getChildren()} will return
 | 
						|
a list of six elements: the first test expression, the first
 | 
						|
\class{Stmt}, the second text expression, etc.
 | 
						|
 | 
						|
The following table lists each of the \class{Node} subclasses defined
 | 
						|
in \module{compiler.ast} and each of the public attributes available
 | 
						|
on their instances.  The values of most of the attributes are
 | 
						|
themselves \class{Node} instances or sequences of instances.  When the
 | 
						|
value is something other than an instance, the type is noted in the
 | 
						|
comment.  The attributes are listed in the order in which they are
 | 
						|
returned by \method{getChildren()} and \method{getChildNodes()}.
 | 
						|
 | 
						|
\input{asttable}
 | 
						|
 | 
						|
 | 
						|
\subsection{Assignment nodes}
 | 
						|
 | 
						|
There is a collection of nodes used to represent assignments.  Each
 | 
						|
assignment statement in the source code becomes a single
 | 
						|
\class{Assign} node in the AST.  The \member{nodes} attribute is a
 | 
						|
list that contains a node for each assignment target.  This is
 | 
						|
necessary because assignment can be chained, e.g. \code{a = b = 2}.
 | 
						|
Each \class{Node} in the list will be one of the following classes: 
 | 
						|
\class{AssAttr}, \class{AssList}, \class{AssName}, or
 | 
						|
\class{AssTuple}. 
 | 
						|
 | 
						|
Each target assignment node will describe the kind of object being
 | 
						|
assigned to:  \class{AssName} for a simple name, e.g. \code{a = 1}.
 | 
						|
\class{AssAttr} for an attribute assigned, e.g. \code{a.x = 1}.
 | 
						|
\class{AssList} and \class{AssTuple} for list and tuple expansion
 | 
						|
respectively, e.g. \code{a, b, c = a_tuple}.
 | 
						|
 | 
						|
The target assignment nodes also have a \member{flags} attribute that
 | 
						|
indicates whether the node is being used for assignment or in a delete
 | 
						|
statement.  The \class{AssName} is also used to represent a delete
 | 
						|
statement, e.g. \class{del x}.
 | 
						|
 | 
						|
When an expression contains several attribute references, an
 | 
						|
assignment or delete statement will contain only one \class{AssAttr}
 | 
						|
node -- for the final attribute reference.  The other attribute
 | 
						|
references will be represented as \class{Getattr} nodes in the
 | 
						|
\member{expr} attribute of the \class{AssAttr} instance.
 | 
						|
 | 
						|
\subsection{Examples}
 | 
						|
 | 
						|
This section shows several simple examples of ASTs for Python source
 | 
						|
code.  The examples demonstrate how to use the \function{parse()}
 | 
						|
function, what the repr of an AST looks like, and how to access
 | 
						|
attributes of an AST node.
 | 
						|
 | 
						|
The first module defines a single function.  Assume it is stored in
 | 
						|
\file{/tmp/doublelib.py}. 
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
"""This is an example module.
 | 
						|
 | 
						|
This is the docstring.
 | 
						|
"""
 | 
						|
 | 
						|
def double(x):
 | 
						|
    "Return twice the argument"
 | 
						|
    return x * 2
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
In the interactive interpreter session below, I have reformatted the
 | 
						|
long AST reprs for readability.  The AST reprs use unqualified class
 | 
						|
names.  If you want to create an instance from a repr, you must import
 | 
						|
the class names from the \module{compiler.ast} module.
 | 
						|
 | 
						|
\begin{verbatim}
 | 
						|
>>> import compiler
 | 
						|
>>> mod = compiler.parseFile("/tmp/doublelib.py")
 | 
						|
>>> mod
 | 
						|
Module('This is an example module.\n\nThis is the docstring.\n', 
 | 
						|
       Stmt([Function('double', ['x'], [], 0, 'Return twice the argument', 
 | 
						|
       Stmt([Return(Mul((Name('x'), Const(2))))]))]))
 | 
						|
>>> from compiler.ast import *
 | 
						|
>>> Module('This is an example module.\n\nThis is the docstring.\n', 
 | 
						|
...    Stmt([Function('double', ['x'], [], 0, 'Return twice the argument', 
 | 
						|
...    Stmt([Return(Mul((Name('x'), Const(2))))]))]))
 | 
						|
Module('This is an example module.\n\nThis is the docstring.\n', 
 | 
						|
       Stmt([Function('double', ['x'], [], 0, 'Return twice the argument', 
 | 
						|
       Stmt([Return(Mul((Name('x'), Const(2))))]))]))
 | 
						|
>>> mod.doc
 | 
						|
'This is an example module.\n\nThis is the docstring.\n'
 | 
						|
>>> for node in mod.node.nodes:
 | 
						|
...     print node
 | 
						|
... 
 | 
						|
Function('double', ['x'], [], 0, 'Return twice the argument',
 | 
						|
         Stmt([Return(Mul((Name('x'), Const(2))))]))
 | 
						|
>>> func = mod.node.nodes[0]
 | 
						|
>>> func.code
 | 
						|
Stmt([Return(Mul((Name('x'), Const(2))))])
 | 
						|
\end{verbatim}
 | 
						|
 | 
						|
\section{Using Visitors to Walk ASTs}
 | 
						|
 | 
						|
\declaremodule{}{compiler.visitor}
 | 
						|
 | 
						|
The visitor pattern is ...  The \refmodule{compiler} package uses a
 | 
						|
variant on the visitor pattern that takes advantage of Python's
 | 
						|
introspection features to elminiate the need for much of the visitor's
 | 
						|
infrastructure.
 | 
						|
 | 
						|
The classes being visited do not need to be programmed to accept
 | 
						|
visitors.  The visitor need only define visit methods for classes it
 | 
						|
is specifically interested in; a default visit method can handle the
 | 
						|
rest. 
 | 
						|
 | 
						|
XXX The magic \method{visit()} method for visitors.
 | 
						|
 | 
						|
\begin{funcdesc}{walk}{tree, visitor\optional{, verbose}}
 | 
						|
\end{funcdesc}
 | 
						|
 | 
						|
\begin{classdesc}{ASTVisitor}{}
 | 
						|
 | 
						|
The \class{ASTVisitor} is responsible for walking over the tree in the
 | 
						|
correct order.  A walk begins with a call to \method{preorder()}.  For
 | 
						|
each node, it checks the \var{visitor} argument to \method{preorder()}
 | 
						|
for a method named `visitNodeType,' where NodeType is the name of the
 | 
						|
node's class, e.g. for a \class{While} node a \method{visitWhile()}
 | 
						|
would be called.  If the method exists, it is called with the node as
 | 
						|
its first argument.
 | 
						|
 | 
						|
The visitor method for a particular node type can control how child
 | 
						|
nodes are visited during the walk.  The \class{ASTVisitor} modifies
 | 
						|
the visitor argument by adding a visit method to the visitor; this
 | 
						|
method can be used to visit a particular child node.  If no visitor is
 | 
						|
found for a particular node type, the \method{default()} method is
 | 
						|
called. 
 | 
						|
\end{classdesc}
 | 
						|
 | 
						|
\class{ASTVisitor} objects have the following methods:
 | 
						|
 | 
						|
XXX describe extra arguments
 | 
						|
 | 
						|
\begin{methoddesc}{default}{node\optional{, \moreargs}}
 | 
						|
\end{methoddesc}
 | 
						|
 | 
						|
\begin{methoddesc}{dispatch}{node\optional{, \moreargs}}
 | 
						|
\end{methoddesc}
 | 
						|
 | 
						|
\begin{methoddesc}{preorder}{tree, visitor}
 | 
						|
\end{methoddesc}
 | 
						|
 | 
						|
 | 
						|
\section{Bytecode Generation}
 | 
						|
 | 
						|
The code generator is a visitor that emits bytecodes.  Each visit method
 | 
						|
can call the \method{emit()} method to emit a new bytecode.  The basic
 | 
						|
code generator is specialized for modules, classes, and functions.  An
 | 
						|
assembler converts that emitted instructions to the low-level bytecode
 | 
						|
format.  It handles things like generator of constant lists of code
 | 
						|
objects and calculation of jump offsets.
 |