Commit the howto source to the main Python repository, with Fred's approval

2025-11-03 03:22:27 +00:00 · 2005-08-30 01:25:05 +00:00 · 2005-08-30 01:25:05 +00:00 · e8f44d683e
commit e8f44d683e
parent f1b2ba6aa1
9 changed files with 4340 additions and 0 deletions
--- a/Doc/howto/Makefile
+++ b/Doc/howto/Makefile
@ -0,0 +1,88 @@
 MKHOWTO=../tools/mkhowto
 WEBDIR=.
 RSTARGS = --input-encoding=utf-8
 VPATH=.:dvi:pdf:ps:txt
 # List of HOWTOs that aren't to be processed
 REMOVE_HOWTO =
 # Determine list of files to be built
 HOWTO=$(filter-out $(REMOVE_HOWTO),$(wildcard *.tex))
 RST_SOURCES =	$(shell echo *.rst)
 DVI  =$(patsubst %.tex,%.dvi,$(HOWTO))
 PDF  =$(patsubst %.tex,%.pdf,$(HOWTO))
 PS   =$(patsubst %.tex,%.ps,$(HOWTO))
 TXT  =$(patsubst %.tex,%.txt,$(HOWTO))
 HTML =$(patsubst %.tex,%,$(HOWTO))
 # Rules for building various formats
 %.dvi : %.tex
 	$(MKHOWTO) --dvi $<
 	mv $@ dvi
 %.pdf : %.tex
 	$(MKHOWTO) --pdf $<
 	mv $@ pdf
 %.ps : %.tex
 	$(MKHOWTO) --ps $<
 	mv $@ ps
 %.txt : %.tex
 	$(MKHOWTO) --text $<
 	mv $@ txt
 % : %.tex
 	$(MKHOWTO) --html --iconserver="." $<
 	tar -zcvf html/$*.tgz $*
 	#zip -r html/$*.zip $*
 default:
 	@echo "'all'    -- build all files"
 	@echo "'dvi', 'pdf', 'ps', 'txt', 'html' -- build one format"
 all: $(HTML)
 .PHONY : dvi pdf ps txt html rst
 dvi: $(DVI)
 pdf: $(PDF)
 ps:  $(PS)
 txt: $(TXT)
 html:$(HTML)
 # Rule to build collected tar files
 dist: #all
 	for i in dvi pdf ps txt ; do \
 	    cd $$i ; \
 	    tar -zcf All.tgz *.$$i ;\
 	    cd .. ;\
 	done
 # Rule to copy files to the Web tree on AMK's machine
 web: dist
 	cp dvi/* $(WEBDIR)/dvi
 	cp ps/* $(WEBDIR)/ps
 	cp pdf/* $(WEBDIR)/pdf
 	cp txt/* $(WEBDIR)/txt
 	for dir in $(HTML) ; do cp -rp $$dir $(WEBDIR) ; done
 	for ltx in $(HOWTO) ; do cp -p $$ltx $(WEBDIR)/latex ; done
 rst: unicode.html
 %.html: %.rst
 	rst2html $(RSTARGS) $< >$@
 clean:
 	rm -f *~ *.log *.ind *.l2h *.aux *.toc *.how
 	rm -f *.dvi *.ps *.pdf *.bkm
 	rm -f unicode.html
 clobber:
 	rm dvi/* ps/* pdf/* txt/* html/*
--- a/Doc/howto/advocacy.tex
+++ b/Doc/howto/advocacy.tex
@ -0,0 +1,405 @@
 \documentclass{howto}
 \title{Python Advocacy HOWTO}
 \release{0.03}
 \author{A.M. Kuchling}
 \authoraddress{\email{amk@amk.ca}}
 \begin{document}
 \maketitle
 \begin{abstract}
 \noindent
 It's usually difficult to get your management to accept open source
 software, and Python is no exception to this rule.  This document
 discusses reasons to use Python, strategies for winning acceptance,
 facts and arguments you can use, and cases where you \emph{shouldn't}
 try to use Python.
 This document is available from the Python HOWTO page at
 \url{http://www.python.org/doc/howto}.
 \end{abstract}
 \tableofcontents
 \section{Reasons to Use Python}
 There are several reasons to incorporate a scripting language into
 your development process, and this section will discuss them, and why
 Python has some properties that make it a particularly good choice.
 \subsection{Programmability}
 Programs are often organized in a modular fashion.  Lower-level
 operations are grouped together, and called by higher-level functions,
 which may in turn be used as basic operations by still further upper
 levels.  
 For example, the lowest level might define a very low-level
 set of functions for accessing a hash table.  The next level might use
 hash tables to store the headers of a mail message, mapping a header
 name like \samp{Date} to a value such as \samp{Tue, 13 May 1997
 20:00:54 -0400}.  A yet higher level may operate on message objects,
 without knowing or caring that message headers are stored in a hash
 table, and so forth.  
 Often, the lowest levels do very simple things; they implement a data
 structure such as a binary tree or hash table, or they perform some
 simple computation, such as converting a date string to a number.  The
 higher levels then contain logic connecting these primitive
 operations.  Using the approach, the primitives can be seen as basic
 building blocks which are then glued together to produce the complete
 product.  
 Why is this design approach relevant to Python?  Because Python is
 well suited to functioning as such a glue language.  A common approach
 is to write a Python module that implements the lower level
 operations; for the sake of speed, the implementation might be in C,
 Java, or even Fortran.  Once the primitives are available to Python
 programs, the logic underlying higher level operations is written in
 the form of Python code.  The high-level logic is then more
 understandable, and easier to modify.
 John Ousterhout wrote a paper that explains this idea at greater
 length, entitled ``Scripting: Higher Level Programming for the 21st
 Century''.  I recommend that you read this paper; see the references
 for the URL.  Ousterhout is the inventor of the Tcl language, and
 therefore argues that Tcl should be used for this purpose; he only
 briefly refers to other languages such as Python, Perl, and
 Lisp/Scheme, but in reality, Ousterhout's argument applies to
 scripting languages in general, since you could equally write
 extensions for any of the languages mentioned above.
 \subsection{Prototyping}
 In \emph{The Mythical Man-Month}, Fredrick Brooks suggests the
 following rule when planning software projects: ``Plan to throw one
 away; you will anyway.''  Brooks is saying that the first attempt at a
 software design often turns out to be wrong; unless the problem is
 very simple or you're an extremely good designer, you'll find that new
 requirements and features become apparent once development has
 actually started.  If these new requirements can't be cleanly
 incorporated into the program's structure, you're presented with two
 unpleasant choices: hammer the new features into the program somehow,
 or scrap everything and write a new version of the program, taking the
 new features into account from the beginning.
 Python provides you with a good environment for quickly developing an
 initial prototype.  That lets you get the overall program structure
 and logic right, and you can fine-tune small details in the fast
 development cycle that Python provides.  Once you're satisfied with
 the GUI interface or program output, you can translate the Python code
 into C++, Fortran, Java, or some other compiled language.
 Prototyping means you have to be careful not to use too many Python
 features that are hard to implement in your other language.  Using
 \code{eval()}, or regular expressions, or the \module{pickle} module,
 means that you're going to need C or Java libraries for formula
 evaluation, regular expressions, and serialization, for example.  But
 it's not hard to avoid such tricky code, and in the end the
 translation usually isn't very difficult.  The resulting code can be
 rapidly debugged, because any serious logical errors will have been
 removed from the prototype, leaving only more minor slip-ups in the
 translation to track down.  
 This strategy builds on the earlier discussion of programmability.
 Using Python as glue to connect lower-level components has obvious
 relevance for constructing prototype systems.  In this way Python can
 help you with development, even if end users never come in contact
 with Python code at all.  If the performance of the Python version is
 adequate and corporate politics allow it, you may not need to do a
 translation into C or Java, but it can still be faster to develop a
 prototype and then translate it, instead of attempting to produce the
 final version immediately.
 One example of this development strategy is Microsoft Merchant Server.
 Version 1.0 was written in pure Python, by a company that subsequently
 was purchased by Microsoft.  Version 2.0 began to translate the code
 into \Cpp, shipping with some \Cpp code and some Python code.  Version
 3.0 didn't contain any Python at all; all the code had been translated
 into \Cpp.  Even though the product doesn't contain a Python
 interpreter, the Python language has still served a useful purpose by
 speeding up development.  
 This is a very common use for Python.  Past conference papers have
 also described this approach for developing high-level numerical
 algorithms; see David M. Beazley and Peter S. Lomdahl's paper
 ``Feeding a Large-scale Physics Application to Python'' in the
 references for a good example.  If an algorithm's basic operations are
 things like "Take the inverse of this 4000x4000 matrix", and are
 implemented in some lower-level language, then Python has almost no
 additional performance cost; the extra time required for Python to
 evaluate an expression like \code{m.invert()} is dwarfed by the cost
 of the actual computation.  It's particularly good for applications
 where seemingly endless tweaking is required to get things right. GUI
 interfaces and Web sites are prime examples.
 The Python code is also shorter and faster to write (once you're
 familiar with Python), so it's easier to throw it away if you decide
 your approach was wrong; if you'd spent two weeks working on it
 instead of just two hours, you might waste time trying to patch up
 what you've got out of a natural reluctance to admit that those two
 weeks were wasted.  Truthfully, those two weeks haven't been wasted,
 since you've learnt something about the problem and the technology
 you're using to solve it, but it's human nature to view this as a
 failure of some sort.
 \subsection{Simplicity and Ease of Understanding}
 Python is definitely \emph{not} a toy language that's only usable for
 small tasks.  The language features are general and powerful enough to
 enable it to be used for many different purposes.  It's useful at the
 small end, for 10- or 20-line scripts, but it also scales up to larger
 systems that contain thousands of lines of code.
 However, this expressiveness doesn't come at the cost of an obscure or
 tricky syntax.  While Python has some dark corners that can lead to
 obscure code, there are relatively few such corners, and proper design
 can isolate their use to only a few classes or modules.  It's
 certainly possible to write confusing code by using too many features
 with too little concern for clarity, but most Python code can look a
 lot like a slightly-formalized version of human-understandable
 pseudocode.
 In \emph{The New Hacker's Dictionary}, Eric S. Raymond gives the following
 definition for "compact":
 \begin{quotation}
 	Compact \emph{adj.}  Of a design, describes the valuable property
 	that it can all be apprehended at once in one's head. This
 	generally means the thing created from the design can be used
 	with greater facility and fewer errors than an equivalent tool
 	that is not compact. Compactness does not imply triviality or
 	lack of power; for example, C is compact and FORTRAN is not,
 	but C is more powerful than FORTRAN. Designs become
 	non-compact through accreting features and cruft that don't
 	merge cleanly into the overall design scheme (thus, some fans
 	of Classic C maintain that ANSI C is no longer compact).
 \end{quotation}
 (From \url{http://sagan.earthspace.net/jargon/jargon_18.html\#SEC25})
 In this sense of the word, Python is quite compact, because the
 language has just a few ideas, which are used in lots of places.  Take
 namespaces, for example.  Import a module with \code{import math}, and
 you create a new namespace called \samp{math}.  Classes are also
 namespaces that share many of the properties of modules, and have a
 few of their own; for example, you can create instances of a class.
 Instances?  They're yet another namespace.  Namespaces are currently
 implemented as Python dictionaries, so they have the same methods as
 the standard dictionary data type: .keys() returns all the keys, and
 so forth.
 This simplicity arises from Python's development history.  The
 language syntax derives from different sources; ABC, a relatively
 obscure teaching language, is one primary influence, and Modula-3 is
 another.  (For more information about ABC and Modula-3, consult their
 respective Web sites at \url{http://www.cwi.nl/~steven/abc/} and
 \url{http://www.m3.org}.)  Other features have come from C, Icon,
 Algol-68, and even Perl.  Python hasn't really innovated very much,
 but instead has tried to keep the language small and easy to learn,
 building on ideas that have been tried in other languages and found
 useful.
 Simplicity is a virtue that should not be underestimated.  It lets you
 learn the language more quickly, and then rapidly write code, code
 that often works the first time you run it.
 \subsection{Java Integration}
 If you're working with Java, Jython
 (\url{http://www.jython.org/}) is definitely worth your
 attention.  Jython is a re-implementation of Python in Java that
 compiles Python code into Java bytecodes.  The resulting environment
 has very tight, almost seamless, integration with Java.  It's trivial
 to access Java classes from Python, and you can write Python classes
 that subclass Java classes.  Jython can be used for prototyping Java
 applications in much the same way CPython is used, and it can also be
 used for test suites for Java code, or embedded in a Java application
 to add scripting capabilities.  
 \section{Arguments and Rebuttals}
 Let's say that you've decided upon Python as the best choice for your
 application.  How can you convince your management, or your fellow
 developers, to use Python?  This section lists some common arguments
 against using Python, and provides some possible rebuttals.
 \emph{Python is freely available software that doesn't cost anything.
 How good can it be?}
 Very good, indeed.  These days Linux and Apache, two other pieces of
 open source software, are becoming more respected as alternatives to
 commercial software, but Python hasn't had all the publicity.
 Python has been around for several years, with many users and
 developers.  Accordingly, the interpreter has been used by many
 people, and has gotten most of the bugs shaken out of it.  While bugs
 are still discovered at intervals, they're usually either quite
 obscure (they'd have to be, for no one to have run into them before)
 or they involve interfaces to external libraries.  The internals of
 the language itself are quite stable.
 Having the source code should be viewed as making the software
 available for peer review; people can examine the code, suggest (and
 implement) improvements, and track down bugs.  To find out more about
 the idea of open source code, along with arguments and case studies
 supporting it, go to \url{http://www.opensource.org}.
 \emph{Who's going to support it?}
 Python has a sizable community of developers, and the number is still
 growing.  The Internet community surrounding the language is an active
 one, and is worth being considered another one of Python's advantages.
 Most questions posted to the comp.lang.python newsgroup are quickly
 answered by someone.
 Should you need to dig into the source code, you'll find it's clear
 and well-organized, so it's not very difficult to write extensions and
 track down bugs yourself.  If you'd prefer to pay for support, there
 are companies and individuals who offer commercial support for Python.
 \emph{Who uses Python for serious work?}
 Lots of people; one interesting thing about Python is the surprising
 diversity of applications that it's been used for.  People are using
 Python to:
 \begin{itemize}
 \item Run Web sites
 \item Write GUI interfaces
 \item Control
 number-crunching code on supercomputers
 \item Make a commercial application scriptable by embedding the Python
 interpreter inside it
 \item Process large XML data sets
 \item Build test suites for C or Java code
 \end{itemize}
 Whatever your application domain is, there's probably someone who's
 used Python for something similar.  Yet, despite being useable for
 such high-end applications, Python's still simple enough to use for
 little jobs.
 See \url{http://www.python.org/psa/Users.html} for a list of some of the 
 organizations that use Python.
 \emph{What are the restrictions on Python's use?}
 They're practically nonexistent.  Consult the \file{Misc/COPYRIGHT}
 file in the source distribution, or
 \url{http://www.python.org/doc/Copyright.html} for the full language,
 but it boils down to three conditions.
 \begin{itemize}
 \item You have to leave the copyright notice on the software; if you
 don't include the source code in a product, you have to put the
 copyright notice in the supporting documentation.  
 \item Don't claim that the institutions that have developed Python
 endorse your product in any way.
 \item If something goes wrong, you can't sue for damages.  Practically
 all software licences contain this condition.
 \end{itemize}
 Notice that you don't have to provide source code for anything that
 contains Python or is built with it.  Also, the Python interpreter and
 accompanying documentation can be modified and redistributed in any
 way you like, and you don't have to pay anyone any licensing fees at
 all.
 \emph{Why should we use an obscure language like Python instead of
 well-known language X?}
 I hope this HOWTO, and the documents listed in the final section, will
 help convince you that Python isn't obscure, and has a healthily
 growing user base.  One word of advice: always present Python's
 positive advantages, instead of concentrating on language X's
 failings.  People want to know why a solution is good, rather than why
 all the other solutions are bad.  So instead of attacking a competing
 solution on various grounds, simply show how Python's virtues can
 help.
 \section{Useful Resources}
 \begin{definitions}
 \term{\url{http://www.fsbassociates.com/books/pythonchpt1.htm}}
 The first chapter of \emph{Internet Programming with Python} also
 examines some of the reasons for using Python.  The book is well worth
 buying, but the publishers have made the first chapter available on
 the Web.
 \term{\url{http://home.pacbell.net/ouster/scripting.html}}
 John Ousterhout's white paper on scripting is a good argument for the
 utility of scripting languages, though naturally enough, he emphasizes
 Tcl, the language he developed.  Most of the arguments would apply to
 any scripting language.
 \term{\url{http://www.python.org/workshops/1997-10/proceedings/beazley.html}}
 The authors, David M. Beazley and Peter S. Lomdahl, 
 describe their use of Python at Los Alamos National Laboratory.
 It's another good example of how Python can help get real work done.
 This quotation from the paper has been echoed by many people:
 \begin{quotation}
       Originally developed as a large monolithic application for
       massively parallel processing systems, we have used Python to
       transform our application into a flexible, highly modular, and
       extremely powerful system for performing simulation, data
       analysis, and visualization. In addition, we describe how Python
       has solved a number of important problems related to the
       development, debugging, deployment, and maintenance of scientific
       software.
 \end{quotation}
 %\term{\url{http://www.pythonjournal.com/volume1/art-interview/}}
 %This interview with Andy Feit, discussing Infoseek's use of Python, can be
 %used to show that choosing Python didn't introduce any difficulties
 %into a company's development process, and provided some substantial benefits.
 \term{\url{http://www.python.org/psa/Commercial.html}} 
 Robin Friedrich wrote this document on how to support Python's use in
 commercial projects.
 \term{\url{http://www.python.org/workshops/1997-10/proceedings/stein.ps}}
 For the 6th Python conference, Greg Stein presented a paper that
 traced Python's adoption and usage at a startup called eShop, and
 later at Microsoft.
 \term{\url{http://www.opensource.org}} 
 Management may be doubtful of the reliability and usefulness of
 software that wasn't written commercially.  This site presents
 arguments that show how open source software can have considerable
 advantages over closed-source software.
 \term{\url{http://sunsite.unc.edu/LDP/HOWTO/mini/Advocacy.html}}
 The Linux Advocacy mini-HOWTO was the inspiration for this document,
 and is also well worth reading for general suggestions on winning
 acceptance for a new technology, such as Linux or Python.  In general,
 you won't make much progress by simply attacking existing systems and
 complaining about their inadequacies; this often ends up looking like
 unfocused whining.  It's much better to point out some of the many
 areas where Python is an improvement over other systems.  
 \end{definitions}
 \end{document}
--- a/Doc/howto/curses.tex
+++ b/Doc/howto/curses.tex
@ -0,0 +1,485 @@
 \documentclass{howto}
 \title{Curses Programming with Python}
 \release{2.01}
 \author{A.M. Kuchling, Eric S. Raymond}
 \authoraddress{\email{amk@amk.ca}, \email{esr@thyrsus.com}}
 \begin{document}
 \maketitle
 \begin{abstract}
 \noindent
 This document describes how to write text-mode programs with Python 2.x,
 using the \module{curses} extension module to control the display.   
 This document is available from the Python HOWTO page at
 \url{http://www.python.org/doc/howto}.
 \end{abstract}
 \tableofcontents
 \section{What is curses?}
 The curses library supplies a terminal-independent screen-painting and
 keyboard-handling facility for text-based terminals; such terminals
 include VT100s, the Linux console, and the simulated terminal provided
 by X11 programs such as xterm and rxvt.  Display terminals support
 various control codes to perform common operations such as moving the
 cursor, scrolling the screen, and erasing areas.  Different terminals
 use widely differing codes, and often have their own minor quirks.
 In a world of X displays, one might ask ``why bother''?  It's true
 that character-cell display terminals are an obsolete technology, but
 there are niches in which being able to do fancy things with them are
 still valuable.  One is on small-footprint or embedded Unixes that 
 don't carry an X server.  Another is for tools like OS installers
 and kernel configurators that may have to run before X is available.
 The curses library hides all the details of different terminals, and
 provides the programmer with an abstraction of a display, containing
 multiple non-overlapping windows.  The contents of a window can be
 changed in various ways--adding text, erasing it, changing its
 appearance--and the curses library will automagically figure out what
 control codes need to be sent to the terminal to produce the right
 output.
 The curses library was originally written for BSD Unix; the later System V
 versions of Unix from AT\&T added many enhancements and new functions.
 BSD curses is no longer maintained, having been replaced by ncurses,
 which is an open-source implementation of the AT\&T interface.  If you're
 using an open-source Unix such as Linux or FreeBSD, your system almost
 certainly uses ncurses.  Since most current commercial Unix versions
 are based on System V code, all the functions described here will
 probably be available.  The older versions of curses carried by some
 proprietary Unixes may not support everything, though.
 No one has made a Windows port of the curses module.  On a Windows
 platform, try the Console module written by Fredrik Lundh.  The
 Console module provides cursor-addressable text output, plus full
 support for mouse and keyboard input, and is available from
 \url{http://effbot.org/efflib/console}.
 \subsection{The Python curses module}
 Thy Python module is a fairly simple wrapper over the C functions
 provided by curses; if you're already familiar with curses programming
 in C, it's really easy to transfer that knowledge to Python.  The
 biggest difference is that the Python interface makes things simpler,
 by merging different C functions such as \function{addstr},
 \function{mvaddstr}, \function{mvwaddstr}, into a single
 \method{addstr()} method.  You'll see this covered in more detail
 later.
 This HOWTO is simply an introduction to writing text-mode programs
 with curses and Python. It doesn't attempt to be a complete guide to
 the curses API; for that, see the Python library guide's serction on
 ncurses, and the C manual pages for ncurses.  It will, however, give
 you the basic ideas.
 \section{Starting and ending a curses application}
 Before doing anything, curses must be initialized.  This is done by
 calling the \function{initscr()} function, which will determine the
 terminal type, send any required setup codes to the terminal, and
 create various internal data structures.  If successful,
 \function{initscr()} returns a window object representing the entire
 screen; this is usually called \code{stdscr}, after the name of the
 corresponding C
 variable.
 \begin{verbatim}
 import curses
 stdscr = curses.initscr()
 \end{verbatim}
 Usually curses applications turn off automatic echoing of keys to the
 screen, in order to be able to read keys and only display them under
 certain circumstances.  This requires calling the \function{noecho()}
 function.
 \begin{verbatim}
 curses.noecho()
 \end{verbatim}
 Applications will also commonly need to react to keys instantly,
 without requiring the Enter key to be pressed; this is called cbreak
 mode, as opposed to the usual buffered input mode.
 \begin{verbatim}
 curses.cbreak()
 \end{verbatim}
 Terminals usually return special keys, such as the cursor keys or
 navigation keys such as Page Up and Home, as a multibyte escape
 sequence.  While you could write your application to expect such
 sequences and process them accordingly, curses can do it for you,
 returning a special value such as \constant{curses.KEY_LEFT}.  To get
 curses to do the job, you'll have to enable keypad mode.
 \begin{verbatim}
 stdscr.keypad(1)
 \end{verbatim}
 Terminating a curses application is much easier than starting one.
 You'll need to call 
 \begin{verbatim}
 curses.nocbreak(); stdscr.keypad(0); curses.echo()
 \end{verbatim}
 to reverse the curses-friendly terminal settings. Then call the
 \function{endwin()} function to restore the terminal to its original
 operating mode.
 \begin{verbatim}
 curses.endwin()
 \end{verbatim}
 A common problem when debugging a curses application is to get your
 terminal messed up when the application dies without restoring the
 terminal to its previous state.  In Python this commonly happens when
 your code is buggy and raises an uncaught exception.  Keys are no
 longer be echoed to the screen when you type them, for example, which
 makes using the shell difficult.
 In Python you can avoid these complications and make debugging much
 easier by importing the module \module{curses.wrapper}.  It supplies a
 function \function{wrapper} that takes a hook argument.  It does the
 initializations described above, and also initializes colors if color
 support is present.  It then runs your hook, and then finally
 deinitializes appropriately.  The hook is called inside a try-catch
 clause which catches exceptions, performs curses deinitialization, and
 then passes the exception upwards.  Thus, your terminal won't be left
 in a funny state on exception.
 \section{Windows and Pads}
 Windows are the basic abstraction in curses.  A window object
 represents a rectangular area of the screen, and supports various
 methods to display text, erase it, allow the user to input strings,
 and so forth.
 The \code{stdscr} object returned by the \function{initscr()} function
 is a window object that covers the entire screen.  Many programs may
 need only this single window, but you might wish to divide the screen
 into smaller windows, in order to redraw or clear them separately.
 The \function{newwin()} function creates a new window of a given size,
 returning the new window object.
 \begin{verbatim}
 begin_x = 20 ; begin_y = 7
 height = 5 ; width = 40
 win = curses.newwin(height, width, begin_y, begin_x)
 \end{verbatim}
 A word about the coordinate system used in curses: coordinates are
 always passed in the order \emph{y,x}, and the top-left corner of a
 window is coordinate (0,0).  This breaks a common convention for
 handling coordinates, where the \emph{x} coordinate usually comes
 first.  This is an unfortunate difference from most other computer
 applications, but it's been part of curses since it was first written,
 and it's too late to change things now.
 When you call a method to display or erase text, the effect doesn't
 immediately show up on the display.  This is because curses was
 originally written with slow 300-baud terminal connections in mind;
 with these terminals, minimizing the time required to redraw the
 screen is very important.  This lets curses accumulate changes to the
 screen, and display them in the most efficient manner.  For example,
 if your program displays some characters in a window, and then clears
 the window, there's no need to send the original characters because
 they'd never be visible.  
 Accordingly, curses requires that you explicitly tell it to redraw
 windows, using the \function{refresh()} method of window objects.  In
 practice, this doesn't really complicate programming with curses much.
 Most programs go into a flurry of activity, and then pause waiting for
 a keypress or some other action on the part of the user.  All you have
 to do is to be sure that the screen has been redrawn before pausing to
 wait for user input, by simply calling \code{stdscr.refresh()} or the
 \function{refresh()} method of some other relevant window.
 A pad is a special case of a window; it can be larger than the actual
 display screen, and only a portion of it displayed at a time.
 Creating a pad simply requires the pad's height and width, while
 refreshing a pad requires giving the coordinates of the on-screen
 area where a subsection of the pad will be displayed.  
 \begin{verbatim}
 pad = curses.newpad(100, 100)
 #  These loops fill the pad with letters; this is
 # explained in the next section
 for y in range(0, 100):
    for x in range(0, 100):
        try: pad.addch(y,x, ord('a') + (x*x+y*y) % 26 )
        except curses.error: pass
 #  Displays a section of the pad in the middle of the screen
 pad.refresh( 0,0, 5,5, 20,75)
 \end{verbatim}
 The \function{refresh()} call displays a section of the pad in the
 rectangle extending from coordinate (5,5) to coordinate (20,75) on the
 screen;the upper left corner of the displayed section is coordinate
 (0,0) on the pad.  Beyond that difference, pads are exactly like
 ordinary windows and support the same methods.
 If you have multiple windows and pads on screen there is a more
 efficient way to go, which will prevent annoying screen flicker at
 refresh time.  Use the methods \method{noutrefresh()} and/or
 \method{noutrefresh()} of each window to update the data structure
 representing the desired state of the screen; then change the physical
 screen to match the desired state in one go with the function
 \function{doupdate()}.  The normal \method{refresh()} method calls
 \function{doupdate()} as its last act.
 \section{Displaying Text}
 {}From a C programmer's point of view, curses may sometimes look like
 a twisty maze of functions, all subtly different.  For example,
 \function{addstr()} displays a string at the current cursor location
 in the \code{stdscr} window, while \function{mvaddstr()} moves to a
 given y,x coordinate first before displaying the string.
 \function{waddstr()} is just like \function{addstr()}, but allows
 specifying a window to use, instead of using \code{stdscr} by default.
 \function{mvwaddstr()} follows similarly.
 Fortunately the Python interface hides all these details;
 \code{stdscr} is a window object like any other, and methods like
 \function{addstr()} accept multiple argument forms.  Usually there are
 four different forms.
 \begin{tableii}{|c|l|}{textrm}{Form}{Description}
 \lineii{\var{str} or \var{ch}}{Display the string \var{str} or
 character \var{ch}}
 \lineii{\var{str} or \var{ch}, \var{attr}}{Display the string \var{str} or
 character \var{ch}, using attribute \var{attr}}
 \lineii{\var{y}, \var{x}, \var{str} or \var{ch}}
 {Move to position \var{y,x} within the window, and display \var{str}
 or \var{ch}}
 \lineii{\var{y}, \var{x}, \var{str} or \var{ch}, \var{attr}}
 {Move to position \var{y,x} within the window, and display \var{str}
 or \var{ch}, using attribute \var{attr}}
 \end{tableii}
 Attributes allow displaying text in highlighted forms, such as in
 boldface, underline, reverse code, or in color.  They'll be explained
 in more detail in the next subsection.
 The \function{addstr()} function takes a Python string as the value to
 be displayed, while the \function{addch()} functions take a character,
 which can be either a Python string of length 1, or an integer.  If
 it's a string, you're limited to displaying characters between 0 and
 255.  SVr4 curses provides constants for extension characters; these
 constants are integers greater than 255.  For example,
 \constant{ACS_PLMINUS} is a +/- symbol, and \constant{ACS_ULCORNER} is
 the upper left corner of a box (handy for drawing borders).
 Windows remember where the cursor was left after the last operation,
 so if you leave out the \var{y,x} coordinates, the string or character
 will be displayed wherever the last operation left off.  You can also
 move the cursor with the \function{move(\var{y,x})} method.  Because
 some terminals always display a flashing cursor, you may want to
 ensure that the cursor is positioned in some location where it won't
 be distracting; it can be confusing to have the cursor blinking at
 some apparently random location.  
 If your application doesn't need a blinking cursor at all, you can
 call \function{curs_set(0)} to make it invisible.  Equivalently, and
 for compatibility with older curses versions, there's a
 \function{leaveok(\var{bool})} function.  When \var{bool} is true, the
 curses library will attempt to suppress the flashing cursor, and you
 won't need to worry about leaving it in odd locations.
 \subsection{Attributes and Color}
 Characters can be displayed in different ways.  Status lines in a
 text-based application are commonly shown in reverse video; a text
 viewer may need to highlight certain words.  curses supports this by
 allowing you to specify an attribute for each cell on the screen.
 An attribute is a integer, each bit representing a different
 attribute.  You can try to display text with multiple attribute bits
 set, but curses doesn't guarantee that all the possible combinations
 are available, or that they're all visually distinct.  That depends on
 the ability of the terminal being used, so it's safest to stick to the
 most commonly available attributes, listed here.
 \begin{tableii}{|c|l|}{constant}{Attribute}{Description}
 \lineii{A_BLINK}{Blinking text}
 \lineii{A_BOLD}{Extra bright or bold text}
 \lineii{A_DIM}{Half bright text}
 \lineii{A_REVERSE}{Reverse-video text}
 \lineii{A_STANDOUT}{The best highlighting mode available}
 \lineii{A_UNDERLINE}{Underlined text}
 \end{tableii}
 So, to display a reverse-video status line on the top line of the
 screen,
 you could code:
 \begin{verbatim}
 stdscr.addstr(0, 0, "Current mode: Typing mode",
 	      curses.A_REVERSE)
 stdscr.refresh()
 \end{verbatim}
 The curses library also supports color on those terminals that
 provide it, The most common such terminal is probably the Linux
 console, followed by color xterms.
 To use color, you must call the \function{start_color()} function
 soon after calling \function{initscr()}, to initialize the default
 color set (the \function{curses.wrapper.wrapper()} function does this
 automatically).  Once that's done, the \function{has_colors()}
 function returns TRUE if the terminal in use can actually display
 color.  (Note from AMK:  curses uses the American spelling
 'color', instead of the Canadian/British spelling 'colour'.  If you're
 like me, you'll have to resign yourself to misspelling it for the sake
 of these functions.)
 The curses library maintains a finite number of color pairs,
 containing a foreground (or text) color and a background color.  You
 can get the attribute value corresponding to a color pair with the
 \function{color_pair()} function; this can be bitwise-OR'ed with other
 attributes such as \constant{A_REVERSE}, but again, such combinations
 are not guaranteed to work on all terminals.
 An example, which displays a line of text using color pair 1:
 \begin{verbatim}
 stdscr.addstr( "Pretty text", curses.color_pair(1) )
 stdscr.refresh()
 \end{verbatim}
 As I said before, a color pair consists of a foreground and
 background color.  \function{start_color()} initializes 8 basic
 colors when it activates color mode.  They are: 0:black, 1:red,
 2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and 7:white.  The curses
 module defines named constants for each of these colors:
 \constant{curses.COLOR_BLACK}, \constant{curses.COLOR_RED}, and so
 forth.
 The \function{init_pair(\var{n, f, b})} function changes the
 definition of color pair \var{n}, to foreground color {f} and
 background color {b}.  Color pair 0 is hard-wired to white on black,
 and cannot be changed.  
 Let's put all this together. To change color 1 to red
 text on a white background, you would call:
 \begin{verbatim}
 curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
 \end{verbatim}
 When you change a color pair, any text already displayed using that
 color pair will change to the new colors.  You can also display new
 text in this color with:
 \begin{verbatim}
 stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1) )
 \end{verbatim}
 Very fancy terminals can change the definitions of the actual colors
 to a given RGB value.  This lets you change color 1, which is usually
 red, to purple or blue or any other color you like.  Unfortunately,
 the Linux console doesn't support this, so I'm unable to try it out,
 and can't provide any examples.  You can check if your terminal can do
 this by calling \function{can_change_color()}, which returns TRUE if
 the capability is there.  If you're lucky enough to have such a
 talented terminal, consult your system's man pages for more
 information.
 \section{User Input}
 The curses library itself offers only very simple input mechanisms.
 Python's support adds a text-input widget that makes up some of the
 lack.
 The most common way to get input to a window is to use its
 \method{getch()} method. that pauses, and waits for the user to hit
 a key, displaying it if \function{echo()} has been called earlier.
 You can optionally specify a coordinate to which the cursor should be
 moved before pausing.
 It's possible to change this behavior with the method
 \method{nodelay()}. After \method{nodelay(1)}, \method{getch()} for
 the window becomes non-blocking and returns ERR (-1) when no input is
 ready.  There's also a \function{halfdelay()} function, which can be
 used to (in effect) set a timer on each \method{getch()}; if no input
 becomes available within the number of milliseconds specified as the
 argument to \function{halfdelay()}, curses throws an exception.
 The \method{getch()} method returns an integer; if it's between 0 and
 255, it represents the ASCII code of the key pressed.  Values greater
 than 255 are special keys such as Page Up, Home, or the cursor keys.
 You can compare the value returned to constants such as
 \constant{curses.KEY_PPAGE}, \constant{curses.KEY_HOME}, or
 \constant{curses.KEY_LEFT}.  Usually the main loop of your program
 will look something like this:
 \begin{verbatim}
 while 1:
    c = stdscr.getch()
    if c == ord('p'): PrintDocument()
    elif c == ord('q'): break  # Exit the while()
    elif c == curses.KEY_HOME: x = y = 0
 \end{verbatim}
 The \module{curses.ascii} module supplies ASCII class membership
 functions that take either integer or 1-character-string
 arguments; these may be useful in writing more readable tests for
 your command interpreters.  It also supplies conversion functions 
 that take either integer or 1-character-string arguments and return
 the same type.  For example, \function{curses.ascii.ctrl()} returns
 the control character corresponding to its argument.
 There's also a method to retrieve an entire string,
 \constant{getstr()}.  It isn't used very often, because its
 functionality is quite limited; the only editing keys available are
 the backspace key and the Enter key, which terminates the string.  It
 can optionally be limited to a fixed number of characters.
 \begin{verbatim}
 curses.echo()            # Enable echoing of characters
 # Get a 15-character string, with the cursor on the top line 
 s = stdscr.getstr(0,0, 15)  
 \end{verbatim}
 The Python \module{curses.textpad} module supplies something better.
 With it, you can turn a window into a text box that supports an
 Emacs-like set of keybindings.  Various methods of \class{Textbox}
 class support editing with input validation and gathering the edit
 results either with or without trailing spaces.   See the library
 documentation on \module{curses.textpad} for the details.
 \section{For More Information}
 This HOWTO didn't cover some advanced topics, such as screen-scraping
 or capturing mouse events from an xterm instance.  But the Python
 library page for the curses modules is now pretty complete.  You
 should browse it next.
 If you're in doubt about the detailed behavior of any of the ncurses
 entry points, consult the manual pages for your curses implementation,
 whether it's ncurses or a proprietary Unix vendor's.  The manual pages
 will document any quirks, and provide complete lists of all the
 functions, attributes, and \constant{ACS_*} characters available to
 you.
 Because the curses API is so large, some functions aren't supported in
 the Python interface, not because they're difficult to implement, but
 because no one has needed them yet.  Feel free to add them and then
 submit a patch.  Also, we don't yet have support for the menus or
 panels libraries associated with ncurses; feel free to add that.
 If you write an interesting little program, feel free to contribute it
 as another demo.  We can always use more of them!
 The ncurses FAQ: \url{http://dickey.his.com/ncurses/ncurses.faq.html}
 \end{document}
--- a/Doc/howto/doanddont.tex
+++ b/Doc/howto/doanddont.tex
@ -0,0 +1,343 @@
 \documentclass{howto}
 \title{Idioms and Anti-Idioms in Python}
 \release{0.00}
 \author{Moshe Zadka}
 \authoraddress{howto@zadka.site.co.il}
 \begin{document}
 \maketitle
 This document is placed in the public doman.
 \begin{abstract}
 \noindent
 This document can be considered a companion to the tutorial. It
 shows how to use Python, and even more importantly, how {\em not}
 to use Python. 
 \end{abstract}
 \tableofcontents
 \section{Language Constructs You Should Not Use}
 While Python has relatively few gotchas compared to other languages, it
 still has some constructs which are only useful in corner cases, or are
 plain dangerous. 
 \subsection{from module import *}
 \subsubsection{Inside Function Definitions}
 \code{from module import *} is {\em invalid} inside function definitions.
 While many versions of Python do no check for the invalidity, it does not
 make it more valid, no more then having a smart lawyer makes a man innocent.
 Do not use it like that ever. Even in versions where it was accepted, it made
 the function execution slower, because the compiler could not be certain
 which names are local and which are global. In Python 2.1 this construct
 causes warnings, and sometimes even errors.
 \subsubsection{At Module Level}
 While it is valid to use \code{from module import *} at module level it
 is usually a bad idea. For one, this loses an important property Python
 otherwise has --- you can know where each toplevel name is defined by
 a simple "search" function in your favourite editor. You also open yourself
 to trouble in the future, if some module grows additional functions or
 classes. 
 One of the most awful question asked on the newsgroup is why this code:
 \begin{verbatim}
 f = open("www")
 f.read()
 \end{verbatim}
 does not work. Of course, it works just fine (assuming you have a file
 called "www".) But it does not work if somewhere in the module, the
 statement \code{from os import *} is present. The \module{os} module
 has a function called \function{open()} which returns an integer. While
 it is very useful, shadowing builtins is one of its least useful properties.
 Remember, you can never know for sure what names a module exports, so either
 take what you need --- \code{from module import name1, name2}, or keep them in
 the module and access on a per-need basis --- 
 \code{import module;print module.name}.
 \subsubsection{When It Is Just Fine}
 There are situations in which \code{from module import *} is just fine:
 \begin{itemize}
 \item The interactive prompt. For example, \code{from math import *} makes
      Python an amazing scientific calculator.
 \item When extending a module in C with a module in Python.
 \item When the module advertises itself as \code{from import *} safe.
 \end{itemize}
 \subsection{Unadorned \keyword{exec}, \function{execfile} and friends}
 The word ``unadorned'' refers to the use without an explicit dictionary,
 in which case those constructs evaluate code in the {\em current} environment.
 This is dangerous for the same reasons \code{from import *} is dangerous ---
 it might step over variables you are counting on and mess up things for
 the rest of your code. Simply do not do that.
 Bad examples:
 \begin{verbatim}
 >>> for name in sys.argv[1:]:
 >>>     exec "%s=1" % name
 >>> def func(s, **kw):
 >>>     for var, val in kw.items():
 >>>         exec "s.%s=val" % var  # invalid!
 >>> execfile("handler.py")
 >>> handle()
 \end{verbatim}
 Good examples:
 \begin{verbatim}
 >>> d = {}
 >>> for name in sys.argv[1:]:
 >>>     d[name] = 1
 >>> def func(s, **kw):
 >>>     for var, val in kw.items():
 >>>         setattr(s, var, val)
 >>> d={}
 >>> execfile("handle.py", d, d)
 >>> handle = d['handle']
 >>> handle()
 \end{verbatim}
 \subsection{from module import name1, name2}
 This is a ``don't'' which is much weaker then the previous ``don't''s
 but is still something you should not do if you don't have good reasons
 to do that. The reason it is usually bad idea is because you suddenly
 have an object which lives in two seperate namespaces. When the binding
 in one namespace changes, the binding in the other will not, so there
 will be a discrepancy between them. This happens when, for example,
 one module is reloaded, or changes the definition of a function at runtime. 
 Bad example:
 \begin{verbatim}
 # foo.py
 a = 1
 # bar.py
 from foo import a
 if something():
    a = 2 # danger: foo.a != a 
 \end{verbatim}
 Good example:
 \begin{verbatim}
 # foo.py
 a = 1
 # bar.py
 import foo
 if something():
    foo.a = 2
 \end{verbatim}
 \subsection{except:}
 Python has the \code{except:} clause, which catches all exceptions.
 Since {\em every} error in Python raises an exception, this makes many
 programming errors look like runtime problems, and hinders
 the debugging process.
 The following code shows a great example:
 \begin{verbatim}
 try:
    foo = opne("file") # misspelled "open"
 except:
    sys.exit("could not open file!")
 \end{verbatim}
 The second line triggers a \exception{NameError} which is caught by the
 except clause. The program will exit, and you will have no idea that
 this has nothing to do with the readability of \code{"file"}.
 The example above is better written
 \begin{verbatim}
 try:
    foo = opne("file") # will be changed to "open" as soon as we run it
 except IOError:
    sys.exit("could not open file")
 \end{verbatim}
 There are some situations in which the \code{except:} clause is useful:
 for example, in a framework when running callbacks, it is good not to
 let any callback disturb the framework.
 \section{Exceptions}
 Exceptions are a useful feature of Python. You should learn to raise
 them whenever something unexpected occurs, and catch them only where
 you can do something about them.
 The following is a very popular anti-idiom
 \begin{verbatim}
 def get_status(file):
    if not os.path.exists(file):
        print "file not found"
        sys.exit(1)
    return open(file).readline()
 \end{verbatim}
 Consider the case the file gets deleted between the time the call to 
 \function{os.path.exists} is made and the time \function{open} is called.
 That means the last line will throw an \exception{IOError}. The same would
 happen if \var{file} exists but has no read permission. Since testing this
 on a normal machine on existing and non-existing files make it seem bugless,
 that means in testing the results will seem fine, and the code will get
 shipped. Then an unhandled \exception{IOError} escapes to the user, who
 has to watch the ugly traceback.
 Here is a better way to do it.
 \begin{verbatim}
 def get_status(file):
    try:
        return open(file).readline()
    except (IOError, OSError):
        print "file not found"
        sys.exit(1)
 \end{verbatim}
 In this version, *either* the file gets opened and the line is read
 (so it works even on flaky NFS or SMB connections), or the message
 is printed and the application aborted.
 Still, \function{get_status} makes too many assumptions --- that it
 will only be used in a short running script, and not, say, in a long
 running server. Sure, the caller could do something like
 \begin{verbatim}
 try:
    status = get_status(log)
 except SystemExit:
    status = None
 \end{verbatim}
 So, try to make as few \code{except} clauses in your code --- those will
 usually be a catch-all in the \function{main}, or inside calls which
 should always succeed.
 So, the best version is probably
 \begin{verbatim}
 def get_status(file):
    return open(file).readline()
 \end{verbatim}
 The caller can deal with the exception if it wants (for example, if it 
 tries several files in a loop), or just let the exception filter upwards
 to {\em its} caller.
 The last version is not very good either --- due to implementation details,
 the file would not be closed when an exception is raised until the handler
 finishes, and perhaps not at all in non-C implementations (e.g., Jython).
 \begin{verbatim}
 def get_status(file):
    fp = open(file)
    try:
        return fp.readline()
    finally:
        fp.close()
 \end{verbatim}
 \section{Using the Batteries}
 Every so often, people seem to be writing stuff in the Python library
 again, usually poorly. While the occasional module has a poor interface,
 it is usually much better to use the rich standard library and data
 types that come with Python then inventing your own.
 A useful module very few people know about is \module{os.path}. It 
 always has the correct path arithmetic for your operating system, and
 will usually be much better then whatever you come up with yourself.
 Compare:
 \begin{verbatim}
 # ugh!
 return dir+"/"+file
 # better
 return os.path.join(dir, file)
 \end{verbatim}
 More useful functions in \module{os.path}: \function{basename}, 
 \function{dirname} and \function{splitext}.
 There are also many useful builtin functions people seem not to be
 aware of for some reason: \function{min()} and \function{max()} can
 find the minimum/maximum of any sequence with comparable semantics,
 for example, yet many people write they own max/min. Another highly
 useful function is \function{reduce()}. Classical use of \function{reduce()}
 is something like
 \begin{verbatim}
 import sys, operator
 nums = map(float, sys.argv[1:])
 print reduce(operator.add, nums)/len(nums)
 \end{verbatim}
 This cute little script prints the average of all numbers given on the
 command line. The \function{reduce()} adds up all the numbers, and
 the rest is just some pre- and postprocessing.
 On the same note, note that \function{float()}, \function{int()} and
 \function{long()} all accept arguments of type string, and so are
 suited to parsing --- assuming you are ready to deal with the
 \exception{ValueError} they raise.
 \section{Using Backslash to Continue Statements}
 Since Python treats a newline as a statement terminator,
 and since statements are often more then is comfortable to put
 in one line, many people do:
 \begin{verbatim}
 if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \
   calculate_number(10, 20) != forbulate(500, 360):
      pass
 \end{verbatim}
 You should realize that this is dangerous: a stray space after the
 \code{\\} would make this line wrong, and stray spaces are notoriously
 hard to see in editors. In this case, at least it would be a syntax
 error, but if the code was:
 \begin{verbatim}
 value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \
        + calculate_number(10, 20)*forbulate(500, 360)
 \end{verbatim}
 then it would just be subtly wrong.
 It is usually much better to use the implicit continuation inside parenthesis:
 This version is bulletproof:
 \begin{verbatim}
 value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9] 
        + calculate_number(10, 20)*forbulate(500, 360))
 \end{verbatim}
 \end{document}
--- a/Doc/howto/regex.tex
+++ b/Doc/howto/regex.tex
--- a/Doc/howto/rexec.tex
+++ b/Doc/howto/rexec.tex
@ -0,0 +1,61 @@
 \documentclass{howto}
 \title{Restricted Execution HOWTO}
 \release{2.1}
 \author{A.M. Kuchling}
 \authoraddress{\email{amk@amk.ca}}
 \begin{document}
 \maketitle
 \begin{abstract}
 \noindent
 Python 2.2.2 and earlier provided a \module{rexec} module running
 untrusted code.  However, it's never been exhaustively audited for
 security and it hasn't been updated to take into account recent
 changes to Python such as new-style classes. Therefore, the
 \module{rexec} module should not be trusted.  To discourage use of 
 \module{rexec}, this HOWTO has been withdrawn.
 The \module{rexec} and \module{Bastion} modules have been disabled in
 the Python CVS tree, both on the trunk (which will eventually become
 Python 2.3alpha2 and later 2.3final) and on the release22-maint branch
 (which will become Python 2.2.3, if someone ever volunteers to issue
 2.2.3).
 For discussion of the problems with \module{rexec}, see the python-dev
 threads starting at the following URLs:
 \url{http://mail.python.org/pipermail/python-dev/2002-December/031160.html},
 and
 \url{http://mail.python.org/pipermail/python-dev/2003-January/031848.html}.
 \end{abstract}
 \section{Version History}
 Sep. 12, 1998: Minor revisions and added the reference to the Janus
 project.
 Feb. 26, 1998: First version.  Suggestions are welcome.
 Mar. 16, 1998: Made some revisions suggested by Jeff Rush.  Some minor
 changes and clarifications, and a sizable section on exceptions added.
 Oct. 4, 2000: Checked with Python 2.0.  Minor rewrites and fixes made.
 Version number increased to 2.0.
 Dec. 17, 2002: Withdrawn.
 Jan. 8, 2003: Mention that \module{rexec} will be disabled in Python 2.3,
 and added links to relevant python-dev threads.
 \end{document}
--- a/Doc/howto/sockets.tex
+++ b/Doc/howto/sockets.tex
@ -0,0 +1,460 @@
 \documentclass{howto}
 \title{Socket Programming HOWTO}
 \release{0.00}
 \author{Gordon McMillan}
 \authoraddress{\email{gmcm@hypernet.com}}
 \begin{document}
 \maketitle
 \begin{abstract}
 \noindent
 Sockets are used nearly everywhere, but are one of the most severely
 misunderstood technologies around. This is a 10,000 foot overview of
 sockets. It's not really a tutorial - you'll still have work to do in
 getting things operational. It doesn't cover the fine points (and there
 are a lot of them), but I hope it will give you enough background to
 begin using them decently.
 This document is available from the Python HOWTO page at
 \url{http://www.python.org/doc/howto}.
 \end{abstract}
 \tableofcontents
 \section{Sockets}
 Sockets are used nearly everywhere, but are one of the most severely
 misunderstood technologies around. This is a 10,000 foot overview of
 sockets. It's not really a tutorial - you'll still have work to do in
 getting things working. It doesn't cover the fine points (and there
 are a lot of them), but I hope it will give you enough background to
 begin using them decently.
 I'm only going to talk about INET sockets, but they account for at
 least 99\% of the sockets in use. And I'll only talk about STREAM
 sockets - unless you really know what you're doing (in which case this
 HOWTO isn't for you!), you'll get better behavior and performance from
 a STREAM socket than anything else. I will try to clear up the mystery
 of what a socket is, as well as some hints on how to work with
 blocking and non-blocking sockets. But I'll start by talking about
 blocking sockets. You'll need to know how they work before dealing
 with non-blocking sockets.
 Part of the trouble with understanding these things is that "socket"
 can mean a number of subtly different things, depending on context. So
 first, let's make a distinction between a "client" socket - an
 endpoint of a conversation, and a "server" socket, which is more like
 a switchboard operator. The client application (your browser, for
 example) uses "client" sockets exclusively; the web server it's
 talking to uses both "server" sockets and "client" sockets.
 \subsection{History}
 Of the various forms of IPC (\emph{Inter Process Communication}),
 sockets are by far the most popular.  On any given platform, there are
 likely to be other forms of IPC that are faster, but for
 cross-platform communication, sockets are about the only game in town.
 They were invented in Berkeley as part of the BSD flavor of Unix. They
 spread like wildfire with the Internet. With good reason --- the
 combination of sockets with INET makes talking to arbitrary machines
 around the world unbelievably easy (at least compared to other
 schemes).  
 \section{Creating a Socket}
 Roughly speaking, when you clicked on the link that brought you to
 this page, your browser did something like the following:
 \begin{verbatim}
    #create an INET, STREAMing socket
    s = socket.socket(
        socket.AF_INET, socket.SOCK_STREAM)
    #now connect to the web server on port 80 
    # - the normal http port
    s.connect(("www.mcmillan-inc.com", 80))
 \end{verbatim}
 When the \code{connect} completes, the socket \code{s} can
 now be used to send in a request for the text of this page. The same
 socket will read the reply, and then be destroyed. That's right -
 destroyed. Client sockets are normally only used for one exchange (or
 a small set of sequential exchanges).
 What happens in the web server is a bit more complex. First, the web
 server creates a "server socket".
 \begin{verbatim}
    #create an INET, STREAMing socket
    serversocket = socket.socket(
        socket.AF_INET, socket.SOCK_STREAM)
    #bind the socket to a public host, 
    # and a well-known port
    serversocket.bind((socket.gethostname(), 80))
    #become a server socket
    serversocket.listen(5)
 \end{verbatim}
 A couple things to notice: we used \code{socket.gethostname()}
 so that the socket would be visible to the outside world. If we had
 used \code{s.bind(('', 80))} or \code{s.bind(('localhost',
 80))} or \code{s.bind(('127.0.0.1', 80))} we would still
 have a "server" socket, but one that was only visible within the same
 machine.
 A second thing to note: low number ports are usually reserved for
 "well known" services (HTTP, SNMP etc). If you're playing around, use
 a nice high number (4 digits).
 Finally, the argument to \code{listen} tells the socket library that
 we want it to queue up as many as 5 connect requests (the normal max)
 before refusing outside connections. If the rest of the code is
 written properly, that should be plenty.
 OK, now we have a "server" socket, listening on port 80. Now we enter
 the mainloop of the web server:
 \begin{verbatim}
    while 1:
        #accept connections from outside
        (clientsocket, address) = serversocket.accept()
        #now do something with the clientsocket
        #in this case, we'll pretend this is a threaded server
        ct = client_thread(clientsocket)
        ct.run()
 \end{verbatim}
 There's actually 3 general ways in which this loop could work -
 dispatching a thread to handle \code{clientsocket}, create a new
 process to handle \code{clientsocket}, or restructure this app
 to use non-blocking sockets, and mulitplex between our "server" socket
 and any active \code{clientsocket}s using
 \code{select}. More about that later. The important thing to
 understand now is this: this is \emph{all} a "server" socket
 does. It doesn't send any data. It doesn't receive any data. It just
 produces "client" sockets. Each \code{clientsocket} is created
 in response to some \emph{other} "client" socket doing a
 \code{connect()} to the host and port we're bound to. As soon as
 we've created that \code{clientsocket}, we go back to listening
 for more connections. The two "clients" are free to chat it up - they
 are using some dynamically allocated port which will be recycled when
 the conversation ends.
 \subsection{IPC} If you need fast IPC between two processes
 on one machine, you should look into whatever form of shared memory
 the platform offers. A simple protocol based around shared memory and
 locks or semaphores is by far the fastest technique.
 If you do decide to use sockets, bind the "server" socket to
 \code{'localhost'}. On most platforms, this will take a shortcut
 around a couple of layers of network code and be quite a bit faster.
 \section{Using a Socket}
 The first thing to note, is that the web browser's "client" socket and
 the web server's "client" socket are identical beasts. That is, this
 is a "peer to peer" conversation. Or to put it another way, \emph{as the
 designer, you will have to decide what the rules of etiquette are for
 a conversation}. Normally, the \code{connect}ing socket
 starts the conversation, by sending in a request, or perhaps a
 signon. But that's a design decision - it's not a rule of sockets.
 Now there are two sets of verbs to use for communication. You can use
 \code{send} and \code{recv}, or you can transform your
 client socket into a file-like beast and use \code{read} and
 \code{write}. The latter is the way Java presents their
 sockets. I'm not going to talk about it here, except to warn you that
 you need to use \code{flush} on sockets. These are buffered
 "files", and a common mistake is to \code{write} something, and
 then \code{read} for a reply. Without a \code{flush} in
 there, you may wait forever for the reply, because the request may
 still be in your output buffer.
 Now we come the major stumbling block of sockets - \code{send}
 and \code{recv} operate on the network buffers. They do not
 necessarily handle all the bytes you hand them (or expect from them),
 because their major focus is handling the network buffers. In general,
 they return when the associated network buffers have been filled
 (\code{send}) or emptied (\code{recv}). They then tell you
 how many bytes they handled. It is \emph{your} responsibility to call
 them again until your message has been completely dealt with.
 When a \code{recv} returns 0 bytes, it means the other side has
 closed (or is in the process of closing) the connection.  You will not
 receive any more data on this connection. Ever.  You may be able to
 send data successfully; I'll talk about that some on the next page.
 A protocol like HTTP uses a socket for only one transfer. The client
 sends a request, the reads a reply.  That's it. The socket is
 discarded. This means that a client can detect the end of the reply by
 receiving 0 bytes.
 But if you plan to reuse your socket for further transfers, you need
 to realize that \emph{there is no "EOT" (End of Transfer) on a
 socket.} I repeat: if a socket \code{send} or
 \code{recv} returns after handling 0 bytes, the connection has
 been broken.  If the connection has \emph{not} been broken, you may
 wait on a \code{recv} forever, because the socket will
 \emph{not} tell you that there's nothing more to read (for now).  Now
 if you think about that a bit, you'll come to realize a fundamental
 truth of sockets: \emph{messages must either be fixed length} (yuck),
 \emph{or be delimited} (shrug), \emph{or indicate how long they are}
 (much better), \emph{or end by shutting down the connection}. The
 choice is entirely yours, (but some ways are righter than others).
 Assuming you don't want to end the connection, the simplest solution
 is a fixed length message:
 \begin{verbatim}
    class mysocket:
        '''demonstration class only 
          - coded for clarity, not efficiency'''
        def __init__(self, sock=None):
            if sock is None:
                self.sock = socket.socket(
                    socket.AF_INET, socket.SOCK_STREAM)
            else:
                self.sock = sock
        def connect(host, port):
            self.sock.connect((host, port))
        def mysend(msg):
            totalsent = 0
            while totalsent < MSGLEN:
                sent = self.sock.send(msg[totalsent:])
                if sent == 0:
                    raise RuntimeError, \\
                        "socket connection broken"
                totalsent = totalsent + sent
        def myreceive():
            msg = ''
            while len(msg) < MSGLEN:
                chunk = self.sock.recv(MSGLEN-len(msg))
                if chunk == '':
                    raise RuntimeError, \\
                        "socket connection broken"
                msg = msg + chunk
            return msg
 \end{verbatim}
 The sending code here is usable for almost any messaging scheme - in
 Python you send strings, and you can use \code{len()} to
 determine its length (even if it has embedded \code{\e 0}
 characters). It's mostly the receiving code that gets more
 complex. (And in C, it's not much worse, except you can't use
 \code{strlen} if the message has embedded \code{\e 0}s.)
 The easiest enhancement is to make the first character of the message
 an indicator of message type, and have the type determine the
 length. Now you have two \code{recv}s - the first to get (at
 least) that first character so you can look up the length, and the
 second in a loop to get the rest. If you decide to go the delimited
 route, you'll be receiving in some arbitrary chunk size, (4096 or 8192
 is frequently a good match for network buffer sizes), and scanning
 what you've received for a delimiter.
 One complication to be aware of: if your conversational protocol
 allows multiple messages to be sent back to back (without some kind of
 reply), and you pass \code{recv} an arbitrary chunk size, you
 may end up reading the start of a following message. You'll need to
 put that aside and hold onto it, until it's needed.
 Prefixing the message with it's length (say, as 5 numeric characters)
 gets more complex, because (believe it or not), you may not get all 5
 characters in one \code{recv}. In playing around, you'll get
 away with it; but in high network loads, your code will very quickly
 break unless you use two \code{recv} loops - the first to
 determine the length, the second to get the data part of the
 message. Nasty. This is also when you'll discover that
 \code{send} does not always manage to get rid of everything in
 one pass. And despite having read this, you will eventually get bit by
 it!
 In the interests of space, building your character, (and preserving my
 competitive position), these enhancements are left as an exercise for
 the reader. Lets move on to cleaning up.
 \subsection{Binary Data}
 It is perfectly possible to send binary data over a socket. The major
 problem is that not all machines use the same formats for binary
 data. For example, a Motorola chip will represent a 16 bit integer
 with the value 1 as the two hex bytes 00 01. Intel and DEC, however,
 are byte-reversed - that same 1 is 01 00. Socket libraries have calls
 for converting 16 and 32 bit integers - \code{ntohl, htonl, ntohs,
 htons} where "n" means \emph{network} and "h" means \emph{host},
 "s" means \emph{short} and "l" means \emph{long}. Where network order
 is host order, these do nothing, but where the machine is
 byte-reversed, these swap the bytes around appropriately.
 In these days of 32 bit machines, the ascii representation of binary
 data is frequently smaller than the binary representation. That's
 because a surprising amount of the time, all those longs have the
 value 0, or maybe 1. The string "0" would be two bytes, while binary
 is four. Of course, this doesn't fit well with fixed-length
 messages. Decisions, decisions.
 \section{Disconnecting}
 Strictly speaking, you're supposed to use \code{shutdown} on a
 socket before you \code{close} it.  The \code{shutdown} is
 an advisory to the socket at the other end.  Depending on the argument
 you pass it, it can mean "I'm not going to send anymore, but I'll
 still listen", or "I'm not listening, good riddance!".  Most socket
 libraries, however, are so used to programmers neglecting to use this
 piece of etiquette that normally a \code{close} is the same as
 \code{shutdown(); close()}.  So in most situations, an explicit
 \code{shutdown} is not needed.
 One way to use \code{shutdown} effectively is in an HTTP-like
 exchange. The client sends a request and then does a
 \code{shutdown(1)}. This tells the server "This client is done
 sending, but can still receive."  The server can detect "EOF" by a
 receive of 0 bytes. It can assume it has the complete request.  The
 server sends a reply. If the \code{send} completes successfully
 then, indeed, the client was still receiving.
 Python takes the automatic shutdown a step further, and says that when a socket is garbage collected, it will automatically do a \code{close} if it's needed. But relying on this is a very bad habit. If your socket just disappears without doing a \code{close}, the socket at the other end may hang indefinitely, thinking you're just being slow. \emph{Please} \code{close} your sockets when you're done.
 \subsection{When Sockets Die}
 Probably the worst thing about using blocking sockets is what happens
 when the other side comes down hard (without doing a
 \code{close}). Your socket is likely to hang. SOCKSTREAM is a
 reliable protocol, and it will wait a long, long time before giving up
 on a connection. If you're using threads, the entire thread is
 essentially dead. There's not much you can do about it. As long as you
 aren't doing something dumb, like holding a lock while doing a
 blocking read, the thread isn't really consuming much in the way of
 resources. Do \emph{not} try to kill the thread - part of the reason
 that threads are more efficient than processes is that they avoid the
 overhead associated with the automatic recycling of resources. In
 other words, if you do manage to kill the thread, your whole process
 is likely to be screwed up.  
 \section{Non-blocking Sockets}
 If you've understood the preceeding, you already know most of what you
 need to know about the mechanics of using sockets. You'll still use
 the same calls, in much the same ways. It's just that, if you do it
 right, your app will be almost inside-out.
 In Python, you use \code{socket.setblocking(0)} to make it
 non-blocking. In C, it's more complex, (for one thing, you'll need to
 choose between the BSD flavor \code{O_NONBLOCK} and the almost
 indistinguishable Posix flavor \code{O_NDELAY}, which is
 completely different from \code{TCP_NODELAY}), but it's the
 exact same idea. You do this after creating the socket, but before
 using it. (Actually, if you're nuts, you can switch back and forth.)
 The major mechanical difference is that \code{send},
 \code{recv}, \code{connect} and \code{accept} can
 return without having done anything. You have (of course) a number of
 choices. You can check return code and error codes and generally drive
 yourself crazy. If you don't believe me, try it sometime. Your app
 will grow large, buggy and suck CPU. So let's skip the brain-dead
 solutions and do it right.
 Use \code{select}.
 In C, coding \code{select} is fairly complex. In Python, it's a
 piece of cake, but it's close enough to the C version that if you
 understand \code{select} in Python, you'll have little trouble
 with it in C.
 \begin{verbatim}    ready_to_read, ready_to_write, in_error = \\
                   select.select(
                      potential_readers, 
                      potential_writers, 
                      potential_errs, 
                      timeout)
 \end{verbatim}
 You pass \code{select} three lists: the first contains all
 sockets that you might want to try reading; the second all the sockets
 you might want to try writing to, and the last (normally left empty)
 those that you want to check for errors.  You should note that a
 socket can go into more than one list. The \code{select} call is
 blocking, but you can give it a timeout. This is generally a sensible
 thing to do - give it a nice long timeout (say a minute) unless you
 have good reason to do otherwise.
 In return, you will get three lists. They have the sockets that are
 actually readable, writable and in error. Each of these lists is a
 subset (possbily empty) of the corresponding list you passed in. And
 if you put a socket in more than one input list, it will only be (at
 most) in one output list.
 If a socket is in the output readable list, you can be
 as-close-to-certain-as-we-ever-get-in-this-business that a
 \code{recv} on that socket will return \emph{something}. Same
 idea for the writable list. You'll be able to send
 \emph{something}. Maybe not all you want to, but \emph{something} is
 better than nothing. (Actually, any reasonably healthy socket will
 return as writable - it just means outbound network buffer space is
 available.)
 If you have a "server" socket, put it in the potential_readers
 list. If it comes out in the readable list, your \code{accept}
 will (almost certainly) work. If you have created a new socket to
 \code{connect} to someone else, put it in the ptoential_writers
 list. If it shows up in the writable list, you have a decent chance
 that it has connected.
 One very nasty problem with \code{select}: if somewhere in those
 input lists of sockets is one which has died a nasty death, the
 \code{select} will fail. You then need to loop through every
 single damn socket in all those lists and do a
 \code{select([sock],[],[],0)} until you find the bad one. That
 timeout of 0 means it won't take long, but it's ugly.
 Actually, \code{select} can be handy even with blocking sockets.
 It's one way of determining whether you will block - the socket
 returns as readable when there's something in the buffers.  However,
 this still doesn't help with the problem of determining whether the
 other end is done, or just busy with something else.
 \textbf{Portability alert}: On Unix, \code{select} works both with
 the sockets and files. Don't try this on Windows. On Windows,
 \code{select} works with sockets only. Also note that in C, many
 of the more advanced socket options are done differently on
 Windows. In fact, on Windows I usually use threads (which work very,
 very well) with my sockets. Face it, if you want any kind of
 performance, your code will look very different on Windows than on
 Unix. (I haven't the foggiest how you do this stuff on a Mac.)
 \subsection{Performance}
 There's no question that the fastest sockets code uses non-blocking
 sockets and select to multiplex them. You can put together something
 that will saturate a LAN connection without putting any strain on the
 CPU. The trouble is that an app written this way can't do much of
 anything else - it needs to be ready to shuffle bytes around at all
 times.
 Assuming that your app is actually supposed to do something more than
 that, threading is the optimal solution, (and using non-blocking
 sockets will be faster than using blocking sockets). Unfortunately,
 threading support in Unixes varies both in API and quality. So the
 normal Unix solution is to fork a subprocess to deal with each
 connection. The overhead for this is significant (and don't do this on
 Windows - the overhead of process creation is enormous there). It also
 means that unless each subprocess is completely independent, you'll
 need to use another form of IPC, say a pipe, or shared memory and
 semaphores, to communicate between the parent and child processes.
 Finally, remember that even though blocking sockets are somewhat
 slower than non-blocking, in many cases they are the "right"
 solution. After all, if your app is driven by the data it receives
 over a socket, there's not much sense in complicating the logic just
 so your app can wait on \code{select} instead of
 \code{recv}.
 \end{document}
--- a/Doc/howto/sorting.tex
+++ b/Doc/howto/sorting.tex
@ -0,0 +1,267 @@
 \documentclass{howto}
 \title{Sorting Mini-HOWTO}
 % Increment the release number whenever significant changes are made.
 % The author and/or editor can define 'significant' however they like.
 \release{0.01}
 \author{Andrew Dalke}
 \authoraddress{\email{dalke@bioreason.com}}
 \begin{document}
 \maketitle
 \begin{abstract}
 \noindent
 This document is a little tutorial
 showing a half dozen ways to sort a list with the built-in
 \method{sort()} method.  
 This document is available from the Python HOWTO page at
 \url{http://www.python.org/doc/howto}.
 \end{abstract}
 \tableofcontents
 Python lists have a built-in \method{sort()} method.  There are many
 ways to use it to sort a list and there doesn't appear to be a single,
 central place in the various manuals describing them, so I'll do so
 here.
 \section{Sorting basic data types}
 A simple ascending sort is easy; just call the \method{sort()} method of a list.
 \begin{verbatim}
 >>> a = [5, 2, 3, 1, 4]
 >>> a.sort()
 >>> print a
 [1, 2, 3, 4, 5]
 \end{verbatim}
 Sort takes an optional function which can be called for doing the
 comparisons.  The default sort routine is equivalent to
 \begin{verbatim}
 >>> a = [5, 2, 3, 1, 4]
 >>> a.sort(cmp)
 >>> print a
 [1, 2, 3, 4, 5]
 \end{verbatim}
 where \function{cmp} is the built-in function which compares two objects, \code{x} and
 \code{y}, and returns -1, 0 or 1 depending on whether $x<y$, $x==y$, or $x>y$.  During
 the course of the sort the relationships must stay the same for the
 final list to make sense.
 If you want, you can define your own function for the comparison.  For 
 integers (and numbers in general) we can do:
 \begin{verbatim}
 >>> def numeric_compare(x, y):
 >>>    return x-y
 >>> 
 >>> a = [5, 2, 3, 1, 4]
 >>> a.sort(numeric_compare)
 >>> print a
 [1, 2, 3, 4, 5]
 \end{verbatim}
 By the way, this function won't work if result of the subtraction
 is out of range, as in \code{sys.maxint - (-1)}.
 Or, if you don't want to define a new named function you can create an
 anonymous one using \keyword{lambda}, as in:
 \begin{verbatim}
 >>> a = [5, 2, 3, 1, 4]
 >>> a.sort(lambda x, y: x-y)
 >>> print a
 [1, 2, 3, 4, 5]
 \end{verbatim}
 If you want the numbers sorted in reverse you can do
 \begin{verbatim}
 >>> a = [5, 2, 3, 1, 4]
 >>> def reverse_numeric(x, y):
 >>>     return y-x
 >>> 
 >>> a.sort(reverse_numeric)
 >>> print a
 [5, 4, 3, 2, 1]
 \end{verbatim}
 (a more general implementation could return \code{cmp(y,x)} or \code{-cmp(x,y)}).
 However, it's faster if Python doesn't have to call a function for
 every comparison, so if you want a reverse-sorted list of basic data
 types, do the forward sort first, then use the \method{reverse()} method.
 \begin{verbatim}
 >>> a = [5, 2, 3, 1, 4]
 >>> a.sort()
 >>> a.reverse()
 >>> print a
 [5, 4, 3, 2, 1]
 \end{verbatim}
 Here's a case-insensitive string comparison using a \keyword{lambda} function:
 \begin{verbatim}
 >>> import string
 >>> a = string.split("This is a test string from Andrew.")
 >>> a.sort(lambda x, y: cmp(string.lower(x), string.lower(y)))
 >>> print a
 ['a', 'Andrew.', 'from', 'is', 'string', 'test', 'This']
 \end{verbatim}
 This goes through the overhead of converting a word to lower case
 every time it must be compared.  At times it may be faster to compute
 these once and use those values, and the following example shows how.
 \begin{verbatim}
 >>> words = string.split("This is a test string from Andrew.")
 >>> offsets = []
 >>> for i in range(len(words)):
 >>>     offsets.append( (string.lower(words[i]), i) )
 >>> 
 >>> offsets.sort()
 >>> new_words = []
 >>> for dontcare, i in offsets:
 >>>      new_words.append(words[i])
 >>> 
 >>> print new_words
 \end{verbatim}
 The \code{offsets} list is initialized to a tuple of the lower-case string
 and its position in the \code{words} list.  It is then sorted.  Python's
 sort method sorts tuples by comparing terms; given \code{x} and \code{y}, compare
 \code{x[0]} to \code{y[0]}, then \code{x[1]} to \code{y[1]}, etc. until there is a difference.
 The result is that the \code{offsets} list is ordered by its first
 term, and the second term can be used to figure out where the original
 data was stored.  (The \code{for} loop assigns \code{dontcare} and
 \code{i} to the two fields of each term in the list, but we only need the
 index value.)
 Another way to implement this is to store the original data as the
 second term in the \code{offsets} list, as in:
 \begin{verbatim}
 >>> words = string.split("This is a test string from Andrew.")
 >>> offsets = []
 >>> for word in words:
 >>>     offsets.append( (string.lower(word), word) )
 >>> 
 >>> offsets.sort()
 >>> new_words = []
 >>> for word in offsets:
 >>>     new_words.append(word[1])
 >>> 
 >>> print new_words
 \end{verbatim}
 This isn't always appropriate because the second terms in the list
 (the word, in this example) will be compared when the first terms are
 the same.  If this happens many times, then there will be the unneeded
 performance hit of comparing the two objects.  This can be a large
 cost if most terms are the same and the objects define their own
 \method{__cmp__} method, but there will still be some overhead to determine if
 \method{__cmp__} is defined.
 Still, for large lists, or for lists where the comparison information
 is expensive to calculate, the last two examples are likely to be the
 fastest way to sort a list.  It will not work on weakly sorted data,
 like complex numbers, but if you don't know what that means, you
 probably don't need to worry about it.
 \section{Comparing classes}
 The comparison for two basic data types, like ints to ints or string to
 string, is built into Python and makes sense.  There is a default way
 to compare class instances, but the default manner isn't usually very
 useful.  You can define your own comparison with the \method{__cmp__} method,
 as in:
 \begin{verbatim}
 >>> class Spam:
 >>>     def __init__(self, spam, eggs):
 >>>         self.spam = spam
 >>>         self.eggs = eggs
 >>>     def __cmp__(self, other):
 >>>         return cmp(self.spam+self.eggs, other.spam+other.eggs)
 >>>     def __str__(self):
 >>>         return str(self.spam + self.eggs)
 >>> 
 >>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
 >>> a.sort()
 >>> for spam in a:
 >>>   print str(spam)
 5
 10
 12
 \end{verbatim}
 Sometimes you may want to sort by a specific attribute of a class.  If
 appropriate you should just define the \method{__cmp__} method to compare
 those values, but you cannot do this if you want to compare between
 different attributes at different times.  Instead, you'll need to go
 back to passing a comparison function to sort, as in:
 \begin{verbatim}
 >>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
 >>> a.sort(lambda x, y: cmp(x.eggs, y.eggs))
 >>> for spam in a:
 >>>   print spam.eggs, str(spam)
 3 12
 4 5
 6 10
 \end{verbatim}
 If you want to compare two arbitrary attributes (and aren't overly
 concerned about performance) you can even define your own comparison
 function object.  This uses the ability of a class instance to emulate
 an function by defining the \method{__call__} method, as in:
 \begin{verbatim}
 >>> class CmpAttr:
 >>>     def __init__(self, attr):
 >>>         self.attr = attr
 >>>     def __call__(self, x, y):
 >>>         return cmp(getattr(x, self.attr), getattr(y, self.attr))
 >>> 
 >>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
 >>> a.sort(CmpAttr("spam"))  # sort by the "spam" attribute
 >>> for spam in a:
 >>>    print spam.spam, spam.eggs, str(spam)
 1 4 5
 4 6 10
 9 3 12
 >>> a.sort(CmpAttr("eggs"))   # re-sort by the "eggs" attribute
 >>> for spam in a:
 >>>    print spam.spam, spam.eggs, str(spam)
 9 3 12
 1 4 5
 4 6 10
 \end{verbatim}
 Of course, if you want a faster sort you can extract the attributes
 into an intermediate list and sort that list.
 So, there you have it; about a half-dozen different ways to define how
 to sort a list:
 \begin{itemize}
 \item sort using the default method
 \item sort using a comparison function
 \item reverse sort not using a comparison function
 \item sort on an intermediate list (two forms)
 \item sort using class defined __cmp__ method
 \item sort using a sort function object
 \end{itemize}
 \end{document}
 % LocalWords:  maxint
--- a/Doc/howto/unicode.rst
+++ b/Doc/howto/unicode.rst
@ -0,0 +1,765 @@
 Unicode HOWTO
 ================
 **Version 1.02**
 This HOWTO discusses Python's support for Unicode, and explains various 
 problems that people commonly encounter when trying to work with Unicode.
 Introduction to Unicode
 ------------------------------
 History of Character Codes
 ''''''''''''''''''''''''''''''
 In 1968, the American Standard Code for Information Interchange,
 better known by its acronym ASCII, was standardized.  ASCII defined
 numeric codes for various characters, with the numeric values running from 0 to
 127.  For example, the lowercase letter 'a' is assigned 97 as its code
 value.
 ASCII was an American-developed standard, so it only defined
 unaccented characters.  There was an 'e', but no 'é' or 'Í'.  This
 meant that languages which required accented characters couldn't be
 faithfully represented in ASCII.  (Actually the missing accents matter
 for English, too, which contains words such as 'naïve' and 'café', and some
 publications have house styles which require spellings such as
 'coöperate'.)
 For a while people just wrote programs that didn't display accents.  I
 remember looking at Apple ][ BASIC programs, published in French-language
 publications in the mid-1980s, that had lines like these::
 	PRINT "FICHER EST COMPLETE."
 	PRINT "CARACTERE NON ACCEPTE."
 Those messages should contain accents, and they just look wrong to
 someone who can read French.  
 In the 1980s, almost all personal computers were 8-bit, meaning that
 bytes could hold values ranging from 0 to 255.  ASCII codes only went
 up to 127, so some machines assigned values between 128 and 255 to
 accented characters.  Different machines had different codes, however,
 which led to problems exchanging files.  Eventually various commonly
 used sets of values for the 128-255 range emerged.  Some were true
 standards, defined by the International Standards Organization, and
 some were **de facto** conventions that were invented by one company
 or another and managed to catch on.
 255 characters aren't very many.  For example, you can't fit
 both the accented characters used in Western Europe and the Cyrillic
 alphabet used for Russian into the 128-255 range because there are more than
 127 such characters.
 You could write files using different codes (all your Russian
 files in a coding system called KOI8, all your French files in 
 a different coding system called Latin1), but what if you wanted
 to write a French document that quotes some Russian text?  In the
 1980s people began to want to solve this problem, and the Unicode
 standardization effort began.
 Unicode started out using 16-bit characters instead of 8-bit characters.  16
 bits means you have 2^16 = 65,536 distinct values available, making it
 possible to represent many different characters from many different
 alphabets; an initial goal was to have Unicode contain the alphabets for
 every single human language.  It turns out that even 16 bits isn't enough to
 meet that goal, and the modern Unicode specification uses a wider range of
 codes, 0-1,114,111 (0x10ffff in base-16).
 There's a related ISO standard, ISO 10646.  Unicode and ISO 10646 were
 originally separate efforts, but the specifications were merged with
 the 1.1 revision of Unicode.  
 (This discussion of Unicode's history is highly simplified.  I don't
 think the average Python programmer needs to worry about the
 historical details; consult the Unicode consortium site listed in the
 References for more information.)
 Definitions
 ''''''''''''''''''''''''
 A **character** is the smallest possible component of a text.  'A',
 'B', 'C', etc., are all different characters.  So are 'È' and
 'Í'.  Characters are abstractions, and vary depending on the
 language or context you're talking about.  For example, the symbol for
 ohms (Ω) is usually drawn much like the capital letter
 omega (Ω) in the Greek alphabet (they may even be the same in
 some fonts), but these are two different characters that have
 different meanings.
 The Unicode standard describes how characters are represented by
 **code points**.  A code point is an integer value, usually denoted in
 base 16.  In the standard, a code point is written using the notation
 U+12ca to mean the character with value 0x12ca (4810 decimal).  The
 Unicode standard contains a lot of tables listing characters and their
 corresponding code points::
 	0061    'a'; LATIN SMALL LETTER A
 	0062    'b'; LATIN SMALL LETTER B
 	0063    'c'; LATIN SMALL LETTER C
        ...
 	007B	'{'; LEFT CURLY BRACKET
 Strictly, these definitions imply that it's meaningless to say 'this is
 character U+12ca'.  U+12ca is a code point, which represents some particular
 character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'.
 In informal contexts, this distinction between code points and characters will
 sometimes be forgotten.
 A character is represented on a screen or on paper by a set of graphical
 elements that's called a **glyph**.  The glyph for an uppercase A, for
 example, is two diagonal strokes and a horizontal stroke, though the exact
 details will depend on the font being used.  Most Python code doesn't need
 to worry about glyphs; figuring out the correct glyph to display is
 generally the job of a GUI toolkit or a terminal's font renderer.
 Encodings
 '''''''''
 To summarize the previous section: 
 a Unicode string is a sequence of code points, which are
 numbers from 0 to 0x10ffff.  This sequence needs to be represented as
 a set of bytes (meaning, values from 0-255) in memory.  The rules for
 translating a Unicode string into a sequence of bytes are called an 
 **encoding**.
 The first encoding you might think of is an array of 32-bit integers.  
 In this representation, the string "Python" would look like this::
       P           y           t           h           o           n
    0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00 
       0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
 This representation is straightforward but using
 it presents a number of problems.
 1. It's not portable; different processors order the bytes 
   differently. 
 2. It's very wasteful of space.  In most texts, the majority of the code 
   points are less than 127, or less than 255, so a lot of space is occupied
   by zero bytes.  The above string takes 24 bytes compared to the 6
   bytes needed for an ASCII representation.  Increased RAM usage doesn't
   matter too much (desktop computers have megabytes of RAM, and strings
   aren't usually that large), but expanding our usage of disk and
   network bandwidth by a factor of 4 is intolerable.
 3. It's not compatible with existing C functions such as ``strlen()``,
   so a new family of wide string functions would need to be used.
 4. Many Internet standards are defined in terms of textual data, and 
   can't handle content with embedded zero bytes.
 Generally people don't use this encoding, choosing other encodings
 that are more efficient and convenient.
 Encodings don't have to handle every possible Unicode character, and
 most encodings don't.  For example, Python's default encoding is the
 'ascii' encoding.  The rules for converting a Unicode string into the
 ASCII encoding are are simple; for each code point:
 1. If the code point is <128, each byte is the same as the value of the 
   code point.
 2. If the code point is 128 or greater, the Unicode string can't 
   be represented in this encoding.  (Python raises  a 
   ``UnicodeEncodeError`` exception in this case.)
 Latin-1, also known as ISO-8859-1, is a similar encoding.  Unicode
 code points 0-255 are identical to the Latin-1 values, so converting
 to this encoding simply requires converting code points to byte
 values; if a code point larger than 255 is encountered, the string
 can't be encoded into Latin-1.
 Encodings don't have to be simple one-to-one mappings like Latin-1.
 Consider IBM's EBCDIC, which was used on IBM mainframes.  Letter
 values weren't in one block: 'a' through 'i' had values from 129 to
 137, but 'j' through 'r' were 145 through 153.  If you wanted to use
 EBCDIC as an encoding, you'd probably use some sort of lookup table to
 perform the conversion, but this is largely an internal detail.
 UTF-8 is one of the most commonly used encodings.  UTF stands for
 "Unicode Transformation Format", and the '8' means that 8-bit numbers
 are used in the encoding.  (There's also a UTF-16 encoding, but it's
 less frequently used than UTF-8.)  UTF-8 uses the following rules:
 1. If the code point is <128, it's represented by the corresponding byte value.
 2. If the code point is between 128 and 0x7ff, it's turned into two byte values
   between 128 and 255.
 3. Code points >0x7ff are turned into three- or four-byte sequences, where
   each byte of the sequence is between 128 and 255.
 UTF-8 has several convenient properties:
 1. It can handle any Unicode code point.
 2. A Unicode string is turned into a string of bytes containing no embedded zero bytes.  This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent through protocols that can't handle zero bytes.
 3. A string of ASCII text is also valid UTF-8 text. 
 4. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
 5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize.  It's also unlikely that random 8-bit data will look like valid UTF-8.
 References
 ''''''''''''''
 The Unicode Consortium site at <http://www.unicode.org> has character
 charts, a glossary, and PDF versions of the Unicode specification.  Be
 prepared for some difficult reading.
 <http://www.unicode.org/history/> is a chronology of the origin and
 development of Unicode.
 To help understand the standard, Jukka Korpela has written an
 introductory guide to reading the Unicode character tables, 
 available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
 Roman Czyborra wrote another explanation of Unicode's basic principles; 
 it's at <http://czyborra.com/unicode/characters.html>.
 Czyborra has written a number of other Unicode-related documentation, 
 available from <http://www.cyzborra.com>.
 Two other good introductory articles were written by Joel Spolsky
 <http://www.joelonsoftware.com/articles/Unicode.html> and Jason
 Orendorff <http://www.jorendorff.com/articles/unicode/>.  If this
 introduction didn't make things clear to you, you should try reading
 one of these alternate articles before continuing.
 Wikipedia entries are often helpful; see the entries for "character
 encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
 <http://en.wikipedia.org/wiki/UTF-8>, for example.
 Python's Unicode Support
 ------------------------
 Now that you've learned the rudiments of Unicode, we can look at
 Python's Unicode features.
 The Unicode Type
 '''''''''''''''''''
 Unicode strings are expressed as instances of the ``unicode`` type,
 one of Python's repertoire of built-in types.  It derives from an
 abstract type called ``basestring``, which is also an ancestor of the
 ``str`` type; you can therefore check if a value is a string type with
 ``isinstance(value, basestring)``.  Under the hood, Python represents
 Unicode strings as either 16- or 32-bit integers, depending on how the
 Python interpreter was compiled, but this 
 The ``unicode()`` constructor has the signature ``unicode(string[, encoding, errors])``.
 All of its arguments should be 8-bit strings.  The first argument is converted 
 to Unicode using the specified encoding; if you leave off the ``encoding`` argument, 
 the ASCII encoding is used for the conversion, so characters greater than 127 will 
 be treated as errors::
    >>> unicode('abcdef')
    u'abcdef'
    >>> s = unicode('abcdef')
    >>> type(s)
    <type 'unicode'>
    >>> unicode('abcdef' + chr(255))
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6: 
                        ordinal not in range(128)
 The ``errors`` argument specifies the response when the input string can't be converted according to the encoding's rules.  Legal values for this argument 
 are 'strict' (raise a ``UnicodeDecodeError`` exception), 
 'replace' (add U+FFFD, 'REPLACEMENT CHARACTER'), 
 or 'ignore' (just leave the character out of the Unicode result).  
 The following examples show the differences::
    >>> unicode('\x80abc', errors='strict')
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: 
                        ordinal not in range(128)
    >>> unicode('\x80abc', errors='replace')
    u'\ufffdabc'
    >>> unicode('\x80abc', errors='ignore')
    u'abc'
 Encodings are specified as strings containing the encoding's name.
 Python 2.4 comes with roughly 100 different encodings; see the Python
 Library Reference at
 <http://docs.python.org/lib/standard-encodings.html> for a list.  Some
 encodings have multiple names; for example, 'latin-1', 'iso_8859_1'
 and '8859' are all synonyms for the same encoding.
 One-character Unicode strings can also be created with the
 ``unichr()`` built-in function, which takes integers and returns a
 Unicode string of length 1 that contains the corresponding code point.
 The reverse operation is the built-in `ord()` function that takes a
 one-character Unicode string and returns the code point value::
    >>> unichr(40960)
    u'\ua000'
    >>> ord(u'\ua000')
    40960
 Instances of the ``unicode`` type have many of the same methods as 
 the 8-bit string type for operations such as searching and formatting::
    >>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
    >>> s.count('e')
    5
    >>> s.find('feather')
    9
    >>> s.find('bird')
    -1
    >>> s.replace('feather', 'sand')
    u'Was ever sand so lightly blown to and fro as this multitude?'
    >>> s.upper()
    u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
 Note that the arguments to these methods can be Unicode strings or 8-bit strings.  
 8-bit strings will be converted to Unicode before carrying out the operation;
 Python's default ASCII encoding will be used, so characters greater than 127 will cause an exception::
    >>> s.find('Was\x9f')
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
    >>> s.find(u'Was\x9f')
    -1
 Much Python code that operates on strings will therefore work with
 Unicode strings without requiring any changes to the code.  (Input and
 output code needs more updating for Unicode; more on this later.)
 Another important method is ``.encode([encoding], [errors='strict'])``, 
 which returns an 8-bit string version of the
 Unicode string, encoded in the requested encoding.  The ``errors``
 parameter is the same as the parameter of the ``unicode()``
 constructor, with one additional possibility; as well as 'strict',
 'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which
 uses XML's character references.  The following example shows the
 different results::
    >>> u = unichr(40960) + u'abcd' + unichr(1972)
    >>> u.encode('utf-8')
    '\xea\x80\x80abcd\xde\xb4'
    >>> u.encode('ascii')
    Traceback (most recent call last):
      File "<stdin>", line 1, in ?
    UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
    >>> u.encode('ascii', 'ignore')
    'abcd'
    >>> u.encode('ascii', 'replace')
    '?abcd?'
    >>> u.encode('ascii', 'xmlcharrefreplace')
    '&#40960;abcd&#1972;'
 Python's 8-bit strings have a ``.decode([encoding], [errors])`` method 
 that interprets the string using the given encoding::
    >>> u = unichr(40960) + u'abcd' + unichr(1972)   # Assemble a string
    >>> utf8_version = u.encode('utf-8')             # Encode as UTF-8
    >>> type(utf8_version), utf8_version
    (<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
    >>> u2 = utf8_version.decode('utf-8')            # Decode using UTF-8
    >>> u == u2                                      # The two strings match
    True
 The low-level routines for registering and accessing the available
 encodings are found in the ``codecs`` module.  However, the encoding
 and decoding functions returned by this module are usually more
 low-level than is comfortable, so I'm not going to describe the
 ``codecs`` module here.  If you need to implement a completely new
 encoding, you'll need to learn about the ``codecs`` module interfaces,
 but implementing encodings is a specialized task that also won't be
 covered here.  Consult the Python documentation to learn more about
 this module.
 The most commonly used part of the ``codecs`` module is the 
 ``codecs.open()`` function which will be discussed in the section
 on input and output.
 Unicode Literals in Python Source Code
 ''''''''''''''''''''''''''''''''''''''''''
 In Python source code, Unicode literals are written as strings
 prefixed with the 'u' or 'U' character: ``u'abcdefghijk'``.  Specific
 code points can be written using the ``\u`` escape sequence, which is
 followed by four hex digits giving the code point.  The ``\U`` escape
 sequence is similar, but expects 8 hex digits, not 4.  
 Unicode literals can also use the same escape sequences as 8-bit
 strings, including ``\x``, but ``\x`` only takes two hex digits so it
 can't express an arbitrary code point.  Octal escapes can go up to
 U+01ff, which is octal 777.
 ::
    >>> s = u"a\xac\u1234\u20ac\U00008000"
               ^^^^ two-digit hex escape
                   ^^^^^^ four-digit Unicode escape 
                               ^^^^^^^^^^ eight-digit Unicode escape
    >>> for c in s:  print ord(c),
    ... 
    97 172 4660 8364 32768
 Using escape sequences for code points greater than 127 is fine in
 small doses, but becomes an annoyance if you're using many accented
 characters, as you would in a program with messages in French or some
 other accent-using language.  You can also assemble strings using the
 ``unichr()`` built-in function, but this is even more tedious.
 Ideally, you'd want to be able to write literals in your language's
 natural encoding.  You could then edit Python source code with your
 favorite editor which would display the accented characters naturally,
 and have the right characters used at runtime.
 Python supports writing Unicode literals in any encoding, but you have
 to declare the encoding being used.  This is done by including a
 special comment as either the first or second line of the source
 file::
    #!/usr/bin/env python
    # -*- coding: latin-1 -*-
    u = u'abcdé'
    print ord(u[-1])
 The syntax is inspired by Emacs's notation for specifying variables local to a file.
 Emacs supports many different variables, but Python only supports 'coding'.  
 The ``-*-`` symbols indicate that the comment is special; within them,
 you must supply the name ``coding`` and the name of your chosen encoding, 
 separated by ``':'``.  
 If you don't include such a comment, the default encoding used will be
 ASCII.  Versions of Python before 2.4 were Euro-centric and assumed
 Latin-1 as a default encoding for string literals; in Python 2.4,
 characters greater than 127 still work but result in a warning.  For
 example, the following program has no encoding declaration::
    #!/usr/bin/env python
    u = u'abcdé'
    print ord(u[-1])
 When you run it with Python 2.4, it will output the following warning::
    amk:~$ python p263.py
    sys:1: DeprecationWarning: Non-ASCII character '\xe9' 
         in file p263.py on line 2, but no encoding declared; 
         see http://www.python.org/peps/pep-0263.html for details
 Unicode Properties
 '''''''''''''''''''
 The Unicode specification includes a database of information about
 code points.  For each code point that's defined, the information
 includes the character's name, its category, the numeric value if
 applicable (Unicode has characters representing the Roman numerals and
 fractions such as one-third and four-fifths).  There are also
 properties related to the code point's use in bidirectional text and
 other display-related properties.
 The following program displays some information about several
 characters, and prints the numeric value of one particular character::
    import unicodedata
    u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
    for i, c in enumerate(u):
        print i, '%04x' % ord(c), unicodedata.category(c),
        print unicodedata.name(c)
    # Get numeric value of second character
    print unicodedata.numeric(u[1])
 When run, this prints::
    0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
    1 0bf2 No TAMIL NUMBER ONE THOUSAND
    2 0f84 Mn TIBETAN MARK HALANTA
    3 1770 Lo TAGBANWA LETTER SA
    4 33af So SQUARE RAD OVER S SQUARED
    1000.0
 The category codes are abbreviations describing the nature of the
 character.  These are grouped into categories such as "Letter",
 "Number", "Punctuation", or "Symbol", which in turn are broken up into
 subcategories.  To take the codes from the above output, ``'Ll'``
 means 'Letter, lowercase', ``'No'`` means "Number, other", ``'Mn'`` is
 "Mark, nonspacing", and ``'So'`` is "Symbol, other".  See
 <http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values>
 for a list of category codes.
 References
 ''''''''''''''
 The Unicode and 8-bit string types are described in the Python library
 reference at <http://docs.python.org/lib/typesseq.html>.
 The documentation for the ``unicodedata`` module is at 
 <http://docs.python.org/lib/module-unicodedata.html>.
 The documentation for the ``codecs`` module is at
 <http://docs.python.org/lib/module-codecs.html>.
 Marc-André Lemburg gave a presentation at EuroPython 2002
 titled "Python and Unicode".  A PDF version of his slides
 is available at <http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>,
 and is an excellent overview of the design of Python's Unicode features.
 Reading and Writing Unicode Data
 ----------------------------------------
 Once you've written some code that works with Unicode data, the next
 problem is input/output.  How do you get Unicode strings into your
 program, and how do you convert Unicode into a form suitable for
 storage or transmission?  
 It's possible that you may not need to do anything depending on your
 input sources and output destinations; you should check whether the
 libraries used in your application support Unicode natively.  XML
 parsers often return Unicode data, for example.  Many relational
 databases also support Unicode-valued columns and can return Unicode
 values from an SQL query.
 Unicode data is usually converted to a particular encoding before it
 gets written to disk or sent over a socket.  It's possible to do all
 the work yourself: open a file, read an 8-bit string from it, and
 convert the string with ``unicode(str, encoding)``.  However, the
 manual approach is not recommended.
 One problem is the multi-byte nature of encodings; one Unicode
 character can be represented by several bytes.  If you want to read
 the file in arbitrary-sized chunks (say, 1K or 4K), you need to write
 error-handling code to catch the case where only part of the bytes
 encoding a single Unicode character are read at the end of a chunk.
 One solution would be to read the entire file into memory and then
 perform the decoding, but that prevents you from working with files
 that are extremely large; if you need to read a 2Gb file, you need 2Gb
 of RAM.  (More, really, since for at least a moment you'd need to have 
 both the encoded string and its Unicode version in memory.)
 The solution would be to use the low-level decoding interface to catch
 the case of partial coding sequences.   The work of implementing this
 has already been done for you: the ``codecs`` module includes a
 version of the ``open()`` function that returns a file-like object
 that assumes the file's contents are in a specified encoding and
 accepts Unicode parameters for methods such as ``.read()`` and
 ``.write()``.
 The function's parameters are 
 ``open(filename, mode='rb', encoding=None, errors='strict', buffering=1)``.  ``mode`` can be
 ``'r'``, ``'w'``, or ``'a'``, just like the corresponding parameter to the
 regular built-in ``open()`` function; add a ``'+'`` to 
 update the file.  ``buffering`` is similarly
 parallel to the standard function's parameter.  
 ``encoding`` is a string giving 
 the encoding to use; if it's left as ``None``, a regular Python file
 object that accepts 8-bit strings is returned.  Otherwise, a wrapper
 object is returned, and data written to or read from the wrapper
 object will be converted as needed.  ``errors`` specifies the action
 for encoding errors and can be one of the usual values of 'strict',
 'ignore', and 'replace'.
 Reading Unicode from a file is therefore simple::
    import codecs
    f = codecs.open('unicode.rst', encoding='utf-8')
    for line in f:
        print repr(line)
 It's also possible to open files in update mode, 
 allowing both reading and writing::
    f = codecs.open('test', encoding='utf-8', mode='w+')
    f.write(u'\u4500 blah blah blah\n')
    f.seek(0)
    print repr(f.readline()[:1])
    f.close()
 Unicode character U+FEFF is used as a byte-order mark (BOM), 
 and is often written as the first character of a file in order
 to assist with autodetection of the file's byte ordering.
 Some encodings, such as UTF-16, expect a BOM to be present at 
 the start of a file; when such an encoding is used,
 the BOM will be automatically written as the first character 
 and will be silently dropped when the file is read.  There are 
 variants of these encodings, such as 'utf-16-le' and 'utf-16-be'
 for little-endian and big-endian encodings, that specify 
 one particular byte ordering and don't
 skip the BOM.
 Unicode filenames
 '''''''''''''''''''''''''
 Most of the operating systems in common use today support filenames
 that contain arbitrary Unicode characters.  Usually this is
 implemented by converting the Unicode string into some encoding that
 varies depending on the system.  For example, MacOS X uses UTF-8 while
 Windows uses a configurable encoding; on Windows, Python uses the name
 "mbcs" to refer to whatever the currently configured encoding is.  On
 Unix systems, there will only be a filesystem encoding if you've set
 the ``LANG`` or ``LC_CTYPE`` environment variables; if you haven't,
 the default encoding is ASCII.
 The ``sys.getfilesystemencoding()`` function returns the encoding to
 use on your current system, in case you want to do the encoding
 manually, but there's not much reason to bother.  When opening a file
 for reading or writing, you can usually just provide the Unicode
 string as the filename, and it will be automatically converted to the
 right encoding for you::
    filename = u'filename\u4500abc'
    f = open(filename, 'w')
    f.write('blah\n')
    f.close()
 Functions in the ``os`` module such as ``os.stat()`` will also accept
 Unicode filenames.
 ``os.listdir()``, which returns filenames, raises an issue: should it
 return the Unicode version of filenames, or should it return 8-bit
 strings containing the encoded versions?  ``os.listdir()`` will do
 both, depending on whether you provided the directory path as an 8-bit
 string or a Unicode string.  If you pass a Unicode string as the path,
 filenames will be decoded using the filesystem's encoding and a list
 of Unicode strings will be returned, while passing an 8-bit path will
 return the 8-bit versions of the filenames.  For example, assuming the
 default filesystem encoding is UTF-8, running the following program::
 	fn = u'filename\u4500abc'
 	f = open(fn, 'w')
 	f.close()
 	import os
 	print os.listdir('.')
 	print os.listdir(u'.')
 will produce the following output::
 	amk:~$ python t.py
 	['.svn', 'filename\xe4\x94\x80abc', ...]
 	[u'.svn', u'filename\u4500abc', ...]
 The first list contains UTF-8-encoded filenames, and the second list
 contains the Unicode versions.
 Tips for Writing Unicode-aware Programs
 ''''''''''''''''''''''''''''''''''''''''''''
 This section provides some suggestions on writing software that 
 deals with Unicode.
 The most important tip is: 
    Software should only work with Unicode strings internally, 
    converting to a particular encoding on output.  
 If you attempt to write processing functions that accept both 
 Unicode and 8-bit strings, you will find your program vulnerable to 
 bugs wherever you combine the two different kinds of strings.  Python's 
 default encoding is ASCII, so whenever a character with an ASCII value >127
 is in the input data, you'll get a ``UnicodeDecodeError``
 because that character can't be handled by the ASCII encoding.  
 It's easy to miss such problems if you only test your software 
 with data that doesn't contain any 
 accents; everything will seem to work, but there's actually a bug in your
 program waiting for the first user who attempts to use characters >127.
 A second tip, therefore, is:
    Include characters >127 and, even better, characters >255 in your
    test data.
 When using data coming from a web browser or some other untrusted source,
 a common technique is to check for illegal characters in a string
 before using the string in a generated command line or storing it in a 
 database.  If you're doing this, be careful to check 
 the string once it's in the form that will be used or stored; it's 
 possible for encodings to be used to disguise characters.  This is especially
 true if the input data also specifies the encoding; 
 many encodings leave the commonly checked-for characters alone, 
 but Python includes some encodings such as ``'base64'``
 that modify every single character.
 For example, let's say you have a content management system that takes a 
 Unicode filename, and you want to disallow paths with a '/' character.
 You might write this code::
    def read_file (filename, encoding):
        if '/' in filename:
            raise ValueError("'/' not allowed in filenames")
        unicode_name = filename.decode(encoding)
        f = open(unicode_name, 'r')
        # ... return contents of file ...
 However, if an attacker could specify the ``'base64'`` encoding,
 they could pass ``'L2V0Yy9wYXNzd2Q='``, which is the base-64
 encoded form of the string ``'/etc/passwd'``, to read a 
 system file.   The above code looks for ``'/'`` characters 
 in the encoded form and misses the dangerous character 
 in the resulting decoded form.
 References
 ''''''''''''''
 The PDF slides for Marc-André Lemburg's presentation "Writing
 Unicode-aware Applications in Python" are available at
 <http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
 and discuss questions of character encodings as well as how to
 internationalize and localize an application.
 Revision History and Acknowledgements
 ------------------------------------------
 Thanks to the following people who have noted errors or offered
 suggestions on this article: Nicholas Bastin, 
 Marius Gedminas, Kent Johnson, Ken Krugler,
 Marc-André Lemburg, Martin von Löwis.
 Version 1.0: posted August 5 2005.
 Version 1.01: posted August 7 2005.  Corrects factual and markup
 errors; adds several links.
 Version 1.02: posted August 16 2005.  Corrects factual errors.
 .. comment Additional topic: building Python w/ UCS2 or UCS4 support
 .. comment Describe obscure -U switch somewhere?
 .. comment 
   Original outline:
   - [ ] Unicode introduction
       - [ ] ASCII
       - [ ] Terms
 	   - [ ] Character
 	   - [ ] Code point
 	 - [ ] Encodings
 	    - [ ] Common encodings: ASCII, Latin-1, UTF-8
       - [ ] Unicode Python type
 	   - [ ] Writing unicode literals
 	       - [ ] Obscurity: -U switch
 	   - [ ] Built-ins
 	       - [ ] unichr()
 	       - [ ] ord()
 	       - [ ] unicode() constructor
 	   - [ ] Unicode type
 	       - [ ] encode(), decode() methods
       - [ ] Unicodedata module for character properties
       - [ ] I/O
 	   - [ ] Reading/writing Unicode data into files
 	       - [ ] Byte-order marks
 	   - [ ] Unicode filenames
       - [ ] Writing Unicode programs
 	   - [ ] Do everything in Unicode
 	   - [ ] Declaring source code encodings (PEP 263)
       - [ ] Other issues
 	   - [ ] Building Python (UCS2, UCS4)