mirror of
https://github.com/python/cpython.git
synced 2025-11-03 03:22:27 +00:00
Commit the howto source to the main Python repository, with Fred's approval
This commit is contained in:
parent
f1b2ba6aa1
commit
e8f44d683e
9 changed files with 4340 additions and 0 deletions
88
Doc/howto/Makefile
Normal file
88
Doc/howto/Makefile
Normal file
|
|
@ -0,0 +1,88 @@
|
||||||
|
|
||||||
|
MKHOWTO=../tools/mkhowto
|
||||||
|
WEBDIR=.
|
||||||
|
RSTARGS = --input-encoding=utf-8
|
||||||
|
VPATH=.:dvi:pdf:ps:txt
|
||||||
|
|
||||||
|
# List of HOWTOs that aren't to be processed
|
||||||
|
|
||||||
|
REMOVE_HOWTO =
|
||||||
|
|
||||||
|
# Determine list of files to be built
|
||||||
|
|
||||||
|
HOWTO=$(filter-out $(REMOVE_HOWTO),$(wildcard *.tex))
|
||||||
|
RST_SOURCES = $(shell echo *.rst)
|
||||||
|
DVI =$(patsubst %.tex,%.dvi,$(HOWTO))
|
||||||
|
PDF =$(patsubst %.tex,%.pdf,$(HOWTO))
|
||||||
|
PS =$(patsubst %.tex,%.ps,$(HOWTO))
|
||||||
|
TXT =$(patsubst %.tex,%.txt,$(HOWTO))
|
||||||
|
HTML =$(patsubst %.tex,%,$(HOWTO))
|
||||||
|
|
||||||
|
# Rules for building various formats
|
||||||
|
%.dvi : %.tex
|
||||||
|
$(MKHOWTO) --dvi $<
|
||||||
|
mv $@ dvi
|
||||||
|
|
||||||
|
%.pdf : %.tex
|
||||||
|
$(MKHOWTO) --pdf $<
|
||||||
|
mv $@ pdf
|
||||||
|
|
||||||
|
%.ps : %.tex
|
||||||
|
$(MKHOWTO) --ps $<
|
||||||
|
mv $@ ps
|
||||||
|
|
||||||
|
%.txt : %.tex
|
||||||
|
$(MKHOWTO) --text $<
|
||||||
|
mv $@ txt
|
||||||
|
|
||||||
|
% : %.tex
|
||||||
|
$(MKHOWTO) --html --iconserver="." $<
|
||||||
|
tar -zcvf html/$*.tgz $*
|
||||||
|
#zip -r html/$*.zip $*
|
||||||
|
|
||||||
|
default:
|
||||||
|
@echo "'all' -- build all files"
|
||||||
|
@echo "'dvi', 'pdf', 'ps', 'txt', 'html' -- build one format"
|
||||||
|
|
||||||
|
all: $(HTML)
|
||||||
|
|
||||||
|
.PHONY : dvi pdf ps txt html rst
|
||||||
|
dvi: $(DVI)
|
||||||
|
|
||||||
|
pdf: $(PDF)
|
||||||
|
ps: $(PS)
|
||||||
|
txt: $(TXT)
|
||||||
|
html:$(HTML)
|
||||||
|
|
||||||
|
# Rule to build collected tar files
|
||||||
|
dist: #all
|
||||||
|
for i in dvi pdf ps txt ; do \
|
||||||
|
cd $$i ; \
|
||||||
|
tar -zcf All.tgz *.$$i ;\
|
||||||
|
cd .. ;\
|
||||||
|
done
|
||||||
|
|
||||||
|
# Rule to copy files to the Web tree on AMK's machine
|
||||||
|
web: dist
|
||||||
|
cp dvi/* $(WEBDIR)/dvi
|
||||||
|
cp ps/* $(WEBDIR)/ps
|
||||||
|
cp pdf/* $(WEBDIR)/pdf
|
||||||
|
cp txt/* $(WEBDIR)/txt
|
||||||
|
for dir in $(HTML) ; do cp -rp $$dir $(WEBDIR) ; done
|
||||||
|
for ltx in $(HOWTO) ; do cp -p $$ltx $(WEBDIR)/latex ; done
|
||||||
|
|
||||||
|
rst: unicode.html
|
||||||
|
|
||||||
|
%.html: %.rst
|
||||||
|
rst2html $(RSTARGS) $< >$@
|
||||||
|
|
||||||
|
clean:
|
||||||
|
rm -f *~ *.log *.ind *.l2h *.aux *.toc *.how
|
||||||
|
rm -f *.dvi *.ps *.pdf *.bkm
|
||||||
|
rm -f unicode.html
|
||||||
|
|
||||||
|
clobber:
|
||||||
|
rm dvi/* ps/* pdf/* txt/* html/*
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
405
Doc/howto/advocacy.tex
Normal file
405
Doc/howto/advocacy.tex
Normal file
|
|
@ -0,0 +1,405 @@
|
||||||
|
|
||||||
|
\documentclass{howto}
|
||||||
|
|
||||||
|
\title{Python Advocacy HOWTO}
|
||||||
|
|
||||||
|
\release{0.03}
|
||||||
|
|
||||||
|
\author{A.M. Kuchling}
|
||||||
|
\authoraddress{\email{amk@amk.ca}}
|
||||||
|
|
||||||
|
\begin{document}
|
||||||
|
\maketitle
|
||||||
|
|
||||||
|
\begin{abstract}
|
||||||
|
\noindent
|
||||||
|
It's usually difficult to get your management to accept open source
|
||||||
|
software, and Python is no exception to this rule. This document
|
||||||
|
discusses reasons to use Python, strategies for winning acceptance,
|
||||||
|
facts and arguments you can use, and cases where you \emph{shouldn't}
|
||||||
|
try to use Python.
|
||||||
|
|
||||||
|
This document is available from the Python HOWTO page at
|
||||||
|
\url{http://www.python.org/doc/howto}.
|
||||||
|
|
||||||
|
\end{abstract}
|
||||||
|
|
||||||
|
\tableofcontents
|
||||||
|
|
||||||
|
\section{Reasons to Use Python}
|
||||||
|
|
||||||
|
There are several reasons to incorporate a scripting language into
|
||||||
|
your development process, and this section will discuss them, and why
|
||||||
|
Python has some properties that make it a particularly good choice.
|
||||||
|
|
||||||
|
\subsection{Programmability}
|
||||||
|
|
||||||
|
Programs are often organized in a modular fashion. Lower-level
|
||||||
|
operations are grouped together, and called by higher-level functions,
|
||||||
|
which may in turn be used as basic operations by still further upper
|
||||||
|
levels.
|
||||||
|
|
||||||
|
For example, the lowest level might define a very low-level
|
||||||
|
set of functions for accessing a hash table. The next level might use
|
||||||
|
hash tables to store the headers of a mail message, mapping a header
|
||||||
|
name like \samp{Date} to a value such as \samp{Tue, 13 May 1997
|
||||||
|
20:00:54 -0400}. A yet higher level may operate on message objects,
|
||||||
|
without knowing or caring that message headers are stored in a hash
|
||||||
|
table, and so forth.
|
||||||
|
|
||||||
|
Often, the lowest levels do very simple things; they implement a data
|
||||||
|
structure such as a binary tree or hash table, or they perform some
|
||||||
|
simple computation, such as converting a date string to a number. The
|
||||||
|
higher levels then contain logic connecting these primitive
|
||||||
|
operations. Using the approach, the primitives can be seen as basic
|
||||||
|
building blocks which are then glued together to produce the complete
|
||||||
|
product.
|
||||||
|
|
||||||
|
Why is this design approach relevant to Python? Because Python is
|
||||||
|
well suited to functioning as such a glue language. A common approach
|
||||||
|
is to write a Python module that implements the lower level
|
||||||
|
operations; for the sake of speed, the implementation might be in C,
|
||||||
|
Java, or even Fortran. Once the primitives are available to Python
|
||||||
|
programs, the logic underlying higher level operations is written in
|
||||||
|
the form of Python code. The high-level logic is then more
|
||||||
|
understandable, and easier to modify.
|
||||||
|
|
||||||
|
John Ousterhout wrote a paper that explains this idea at greater
|
||||||
|
length, entitled ``Scripting: Higher Level Programming for the 21st
|
||||||
|
Century''. I recommend that you read this paper; see the references
|
||||||
|
for the URL. Ousterhout is the inventor of the Tcl language, and
|
||||||
|
therefore argues that Tcl should be used for this purpose; he only
|
||||||
|
briefly refers to other languages such as Python, Perl, and
|
||||||
|
Lisp/Scheme, but in reality, Ousterhout's argument applies to
|
||||||
|
scripting languages in general, since you could equally write
|
||||||
|
extensions for any of the languages mentioned above.
|
||||||
|
|
||||||
|
\subsection{Prototyping}
|
||||||
|
|
||||||
|
In \emph{The Mythical Man-Month}, Fredrick Brooks suggests the
|
||||||
|
following rule when planning software projects: ``Plan to throw one
|
||||||
|
away; you will anyway.'' Brooks is saying that the first attempt at a
|
||||||
|
software design often turns out to be wrong; unless the problem is
|
||||||
|
very simple or you're an extremely good designer, you'll find that new
|
||||||
|
requirements and features become apparent once development has
|
||||||
|
actually started. If these new requirements can't be cleanly
|
||||||
|
incorporated into the program's structure, you're presented with two
|
||||||
|
unpleasant choices: hammer the new features into the program somehow,
|
||||||
|
or scrap everything and write a new version of the program, taking the
|
||||||
|
new features into account from the beginning.
|
||||||
|
|
||||||
|
Python provides you with a good environment for quickly developing an
|
||||||
|
initial prototype. That lets you get the overall program structure
|
||||||
|
and logic right, and you can fine-tune small details in the fast
|
||||||
|
development cycle that Python provides. Once you're satisfied with
|
||||||
|
the GUI interface or program output, you can translate the Python code
|
||||||
|
into C++, Fortran, Java, or some other compiled language.
|
||||||
|
|
||||||
|
Prototyping means you have to be careful not to use too many Python
|
||||||
|
features that are hard to implement in your other language. Using
|
||||||
|
\code{eval()}, or regular expressions, or the \module{pickle} module,
|
||||||
|
means that you're going to need C or Java libraries for formula
|
||||||
|
evaluation, regular expressions, and serialization, for example. But
|
||||||
|
it's not hard to avoid such tricky code, and in the end the
|
||||||
|
translation usually isn't very difficult. The resulting code can be
|
||||||
|
rapidly debugged, because any serious logical errors will have been
|
||||||
|
removed from the prototype, leaving only more minor slip-ups in the
|
||||||
|
translation to track down.
|
||||||
|
|
||||||
|
This strategy builds on the earlier discussion of programmability.
|
||||||
|
Using Python as glue to connect lower-level components has obvious
|
||||||
|
relevance for constructing prototype systems. In this way Python can
|
||||||
|
help you with development, even if end users never come in contact
|
||||||
|
with Python code at all. If the performance of the Python version is
|
||||||
|
adequate and corporate politics allow it, you may not need to do a
|
||||||
|
translation into C or Java, but it can still be faster to develop a
|
||||||
|
prototype and then translate it, instead of attempting to produce the
|
||||||
|
final version immediately.
|
||||||
|
|
||||||
|
One example of this development strategy is Microsoft Merchant Server.
|
||||||
|
Version 1.0 was written in pure Python, by a company that subsequently
|
||||||
|
was purchased by Microsoft. Version 2.0 began to translate the code
|
||||||
|
into \Cpp, shipping with some \Cpp code and some Python code. Version
|
||||||
|
3.0 didn't contain any Python at all; all the code had been translated
|
||||||
|
into \Cpp. Even though the product doesn't contain a Python
|
||||||
|
interpreter, the Python language has still served a useful purpose by
|
||||||
|
speeding up development.
|
||||||
|
|
||||||
|
This is a very common use for Python. Past conference papers have
|
||||||
|
also described this approach for developing high-level numerical
|
||||||
|
algorithms; see David M. Beazley and Peter S. Lomdahl's paper
|
||||||
|
``Feeding a Large-scale Physics Application to Python'' in the
|
||||||
|
references for a good example. If an algorithm's basic operations are
|
||||||
|
things like "Take the inverse of this 4000x4000 matrix", and are
|
||||||
|
implemented in some lower-level language, then Python has almost no
|
||||||
|
additional performance cost; the extra time required for Python to
|
||||||
|
evaluate an expression like \code{m.invert()} is dwarfed by the cost
|
||||||
|
of the actual computation. It's particularly good for applications
|
||||||
|
where seemingly endless tweaking is required to get things right. GUI
|
||||||
|
interfaces and Web sites are prime examples.
|
||||||
|
|
||||||
|
The Python code is also shorter and faster to write (once you're
|
||||||
|
familiar with Python), so it's easier to throw it away if you decide
|
||||||
|
your approach was wrong; if you'd spent two weeks working on it
|
||||||
|
instead of just two hours, you might waste time trying to patch up
|
||||||
|
what you've got out of a natural reluctance to admit that those two
|
||||||
|
weeks were wasted. Truthfully, those two weeks haven't been wasted,
|
||||||
|
since you've learnt something about the problem and the technology
|
||||||
|
you're using to solve it, but it's human nature to view this as a
|
||||||
|
failure of some sort.
|
||||||
|
|
||||||
|
\subsection{Simplicity and Ease of Understanding}
|
||||||
|
|
||||||
|
Python is definitely \emph{not} a toy language that's only usable for
|
||||||
|
small tasks. The language features are general and powerful enough to
|
||||||
|
enable it to be used for many different purposes. It's useful at the
|
||||||
|
small end, for 10- or 20-line scripts, but it also scales up to larger
|
||||||
|
systems that contain thousands of lines of code.
|
||||||
|
|
||||||
|
However, this expressiveness doesn't come at the cost of an obscure or
|
||||||
|
tricky syntax. While Python has some dark corners that can lead to
|
||||||
|
obscure code, there are relatively few such corners, and proper design
|
||||||
|
can isolate their use to only a few classes or modules. It's
|
||||||
|
certainly possible to write confusing code by using too many features
|
||||||
|
with too little concern for clarity, but most Python code can look a
|
||||||
|
lot like a slightly-formalized version of human-understandable
|
||||||
|
pseudocode.
|
||||||
|
|
||||||
|
In \emph{The New Hacker's Dictionary}, Eric S. Raymond gives the following
|
||||||
|
definition for "compact":
|
||||||
|
|
||||||
|
\begin{quotation}
|
||||||
|
Compact \emph{adj.} Of a design, describes the valuable property
|
||||||
|
that it can all be apprehended at once in one's head. This
|
||||||
|
generally means the thing created from the design can be used
|
||||||
|
with greater facility and fewer errors than an equivalent tool
|
||||||
|
that is not compact. Compactness does not imply triviality or
|
||||||
|
lack of power; for example, C is compact and FORTRAN is not,
|
||||||
|
but C is more powerful than FORTRAN. Designs become
|
||||||
|
non-compact through accreting features and cruft that don't
|
||||||
|
merge cleanly into the overall design scheme (thus, some fans
|
||||||
|
of Classic C maintain that ANSI C is no longer compact).
|
||||||
|
\end{quotation}
|
||||||
|
|
||||||
|
(From \url{http://sagan.earthspace.net/jargon/jargon_18.html\#SEC25})
|
||||||
|
|
||||||
|
In this sense of the word, Python is quite compact, because the
|
||||||
|
language has just a few ideas, which are used in lots of places. Take
|
||||||
|
namespaces, for example. Import a module with \code{import math}, and
|
||||||
|
you create a new namespace called \samp{math}. Classes are also
|
||||||
|
namespaces that share many of the properties of modules, and have a
|
||||||
|
few of their own; for example, you can create instances of a class.
|
||||||
|
Instances? They're yet another namespace. Namespaces are currently
|
||||||
|
implemented as Python dictionaries, so they have the same methods as
|
||||||
|
the standard dictionary data type: .keys() returns all the keys, and
|
||||||
|
so forth.
|
||||||
|
|
||||||
|
This simplicity arises from Python's development history. The
|
||||||
|
language syntax derives from different sources; ABC, a relatively
|
||||||
|
obscure teaching language, is one primary influence, and Modula-3 is
|
||||||
|
another. (For more information about ABC and Modula-3, consult their
|
||||||
|
respective Web sites at \url{http://www.cwi.nl/~steven/abc/} and
|
||||||
|
\url{http://www.m3.org}.) Other features have come from C, Icon,
|
||||||
|
Algol-68, and even Perl. Python hasn't really innovated very much,
|
||||||
|
but instead has tried to keep the language small and easy to learn,
|
||||||
|
building on ideas that have been tried in other languages and found
|
||||||
|
useful.
|
||||||
|
|
||||||
|
Simplicity is a virtue that should not be underestimated. It lets you
|
||||||
|
learn the language more quickly, and then rapidly write code, code
|
||||||
|
that often works the first time you run it.
|
||||||
|
|
||||||
|
\subsection{Java Integration}
|
||||||
|
|
||||||
|
If you're working with Java, Jython
|
||||||
|
(\url{http://www.jython.org/}) is definitely worth your
|
||||||
|
attention. Jython is a re-implementation of Python in Java that
|
||||||
|
compiles Python code into Java bytecodes. The resulting environment
|
||||||
|
has very tight, almost seamless, integration with Java. It's trivial
|
||||||
|
to access Java classes from Python, and you can write Python classes
|
||||||
|
that subclass Java classes. Jython can be used for prototyping Java
|
||||||
|
applications in much the same way CPython is used, and it can also be
|
||||||
|
used for test suites for Java code, or embedded in a Java application
|
||||||
|
to add scripting capabilities.
|
||||||
|
|
||||||
|
\section{Arguments and Rebuttals}
|
||||||
|
|
||||||
|
Let's say that you've decided upon Python as the best choice for your
|
||||||
|
application. How can you convince your management, or your fellow
|
||||||
|
developers, to use Python? This section lists some common arguments
|
||||||
|
against using Python, and provides some possible rebuttals.
|
||||||
|
|
||||||
|
\emph{Python is freely available software that doesn't cost anything.
|
||||||
|
How good can it be?}
|
||||||
|
|
||||||
|
Very good, indeed. These days Linux and Apache, two other pieces of
|
||||||
|
open source software, are becoming more respected as alternatives to
|
||||||
|
commercial software, but Python hasn't had all the publicity.
|
||||||
|
|
||||||
|
Python has been around for several years, with many users and
|
||||||
|
developers. Accordingly, the interpreter has been used by many
|
||||||
|
people, and has gotten most of the bugs shaken out of it. While bugs
|
||||||
|
are still discovered at intervals, they're usually either quite
|
||||||
|
obscure (they'd have to be, for no one to have run into them before)
|
||||||
|
or they involve interfaces to external libraries. The internals of
|
||||||
|
the language itself are quite stable.
|
||||||
|
|
||||||
|
Having the source code should be viewed as making the software
|
||||||
|
available for peer review; people can examine the code, suggest (and
|
||||||
|
implement) improvements, and track down bugs. To find out more about
|
||||||
|
the idea of open source code, along with arguments and case studies
|
||||||
|
supporting it, go to \url{http://www.opensource.org}.
|
||||||
|
|
||||||
|
\emph{Who's going to support it?}
|
||||||
|
|
||||||
|
Python has a sizable community of developers, and the number is still
|
||||||
|
growing. The Internet community surrounding the language is an active
|
||||||
|
one, and is worth being considered another one of Python's advantages.
|
||||||
|
Most questions posted to the comp.lang.python newsgroup are quickly
|
||||||
|
answered by someone.
|
||||||
|
|
||||||
|
Should you need to dig into the source code, you'll find it's clear
|
||||||
|
and well-organized, so it's not very difficult to write extensions and
|
||||||
|
track down bugs yourself. If you'd prefer to pay for support, there
|
||||||
|
are companies and individuals who offer commercial support for Python.
|
||||||
|
|
||||||
|
\emph{Who uses Python for serious work?}
|
||||||
|
|
||||||
|
Lots of people; one interesting thing about Python is the surprising
|
||||||
|
diversity of applications that it's been used for. People are using
|
||||||
|
Python to:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item Run Web sites
|
||||||
|
\item Write GUI interfaces
|
||||||
|
\item Control
|
||||||
|
number-crunching code on supercomputers
|
||||||
|
\item Make a commercial application scriptable by embedding the Python
|
||||||
|
interpreter inside it
|
||||||
|
\item Process large XML data sets
|
||||||
|
\item Build test suites for C or Java code
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
Whatever your application domain is, there's probably someone who's
|
||||||
|
used Python for something similar. Yet, despite being useable for
|
||||||
|
such high-end applications, Python's still simple enough to use for
|
||||||
|
little jobs.
|
||||||
|
|
||||||
|
See \url{http://www.python.org/psa/Users.html} for a list of some of the
|
||||||
|
organizations that use Python.
|
||||||
|
|
||||||
|
\emph{What are the restrictions on Python's use?}
|
||||||
|
|
||||||
|
They're practically nonexistent. Consult the \file{Misc/COPYRIGHT}
|
||||||
|
file in the source distribution, or
|
||||||
|
\url{http://www.python.org/doc/Copyright.html} for the full language,
|
||||||
|
but it boils down to three conditions.
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
|
||||||
|
\item You have to leave the copyright notice on the software; if you
|
||||||
|
don't include the source code in a product, you have to put the
|
||||||
|
copyright notice in the supporting documentation.
|
||||||
|
|
||||||
|
\item Don't claim that the institutions that have developed Python
|
||||||
|
endorse your product in any way.
|
||||||
|
|
||||||
|
\item If something goes wrong, you can't sue for damages. Practically
|
||||||
|
all software licences contain this condition.
|
||||||
|
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
Notice that you don't have to provide source code for anything that
|
||||||
|
contains Python or is built with it. Also, the Python interpreter and
|
||||||
|
accompanying documentation can be modified and redistributed in any
|
||||||
|
way you like, and you don't have to pay anyone any licensing fees at
|
||||||
|
all.
|
||||||
|
|
||||||
|
\emph{Why should we use an obscure language like Python instead of
|
||||||
|
well-known language X?}
|
||||||
|
|
||||||
|
I hope this HOWTO, and the documents listed in the final section, will
|
||||||
|
help convince you that Python isn't obscure, and has a healthily
|
||||||
|
growing user base. One word of advice: always present Python's
|
||||||
|
positive advantages, instead of concentrating on language X's
|
||||||
|
failings. People want to know why a solution is good, rather than why
|
||||||
|
all the other solutions are bad. So instead of attacking a competing
|
||||||
|
solution on various grounds, simply show how Python's virtues can
|
||||||
|
help.
|
||||||
|
|
||||||
|
|
||||||
|
\section{Useful Resources}
|
||||||
|
|
||||||
|
\begin{definitions}
|
||||||
|
|
||||||
|
\term{\url{http://www.fsbassociates.com/books/pythonchpt1.htm}}
|
||||||
|
|
||||||
|
The first chapter of \emph{Internet Programming with Python} also
|
||||||
|
examines some of the reasons for using Python. The book is well worth
|
||||||
|
buying, but the publishers have made the first chapter available on
|
||||||
|
the Web.
|
||||||
|
|
||||||
|
\term{\url{http://home.pacbell.net/ouster/scripting.html}}
|
||||||
|
|
||||||
|
John Ousterhout's white paper on scripting is a good argument for the
|
||||||
|
utility of scripting languages, though naturally enough, he emphasizes
|
||||||
|
Tcl, the language he developed. Most of the arguments would apply to
|
||||||
|
any scripting language.
|
||||||
|
|
||||||
|
\term{\url{http://www.python.org/workshops/1997-10/proceedings/beazley.html}}
|
||||||
|
|
||||||
|
The authors, David M. Beazley and Peter S. Lomdahl,
|
||||||
|
describe their use of Python at Los Alamos National Laboratory.
|
||||||
|
It's another good example of how Python can help get real work done.
|
||||||
|
This quotation from the paper has been echoed by many people:
|
||||||
|
|
||||||
|
\begin{quotation}
|
||||||
|
Originally developed as a large monolithic application for
|
||||||
|
massively parallel processing systems, we have used Python to
|
||||||
|
transform our application into a flexible, highly modular, and
|
||||||
|
extremely powerful system for performing simulation, data
|
||||||
|
analysis, and visualization. In addition, we describe how Python
|
||||||
|
has solved a number of important problems related to the
|
||||||
|
development, debugging, deployment, and maintenance of scientific
|
||||||
|
software.
|
||||||
|
\end{quotation}
|
||||||
|
|
||||||
|
%\term{\url{http://www.pythonjournal.com/volume1/art-interview/}}
|
||||||
|
|
||||||
|
%This interview with Andy Feit, discussing Infoseek's use of Python, can be
|
||||||
|
%used to show that choosing Python didn't introduce any difficulties
|
||||||
|
%into a company's development process, and provided some substantial benefits.
|
||||||
|
|
||||||
|
\term{\url{http://www.python.org/psa/Commercial.html}}
|
||||||
|
|
||||||
|
Robin Friedrich wrote this document on how to support Python's use in
|
||||||
|
commercial projects.
|
||||||
|
|
||||||
|
\term{\url{http://www.python.org/workshops/1997-10/proceedings/stein.ps}}
|
||||||
|
|
||||||
|
For the 6th Python conference, Greg Stein presented a paper that
|
||||||
|
traced Python's adoption and usage at a startup called eShop, and
|
||||||
|
later at Microsoft.
|
||||||
|
|
||||||
|
\term{\url{http://www.opensource.org}}
|
||||||
|
|
||||||
|
Management may be doubtful of the reliability and usefulness of
|
||||||
|
software that wasn't written commercially. This site presents
|
||||||
|
arguments that show how open source software can have considerable
|
||||||
|
advantages over closed-source software.
|
||||||
|
|
||||||
|
\term{\url{http://sunsite.unc.edu/LDP/HOWTO/mini/Advocacy.html}}
|
||||||
|
|
||||||
|
The Linux Advocacy mini-HOWTO was the inspiration for this document,
|
||||||
|
and is also well worth reading for general suggestions on winning
|
||||||
|
acceptance for a new technology, such as Linux or Python. In general,
|
||||||
|
you won't make much progress by simply attacking existing systems and
|
||||||
|
complaining about their inadequacies; this often ends up looking like
|
||||||
|
unfocused whining. It's much better to point out some of the many
|
||||||
|
areas where Python is an improvement over other systems.
|
||||||
|
|
||||||
|
\end{definitions}
|
||||||
|
|
||||||
|
\end{document}
|
||||||
|
|
||||||
|
|
||||||
485
Doc/howto/curses.tex
Normal file
485
Doc/howto/curses.tex
Normal file
|
|
@ -0,0 +1,485 @@
|
||||||
|
\documentclass{howto}
|
||||||
|
|
||||||
|
\title{Curses Programming with Python}
|
||||||
|
|
||||||
|
\release{2.01}
|
||||||
|
|
||||||
|
\author{A.M. Kuchling, Eric S. Raymond}
|
||||||
|
\authoraddress{\email{amk@amk.ca}, \email{esr@thyrsus.com}}
|
||||||
|
|
||||||
|
\begin{document}
|
||||||
|
\maketitle
|
||||||
|
|
||||||
|
\begin{abstract}
|
||||||
|
\noindent
|
||||||
|
This document describes how to write text-mode programs with Python 2.x,
|
||||||
|
using the \module{curses} extension module to control the display.
|
||||||
|
|
||||||
|
This document is available from the Python HOWTO page at
|
||||||
|
\url{http://www.python.org/doc/howto}.
|
||||||
|
\end{abstract}
|
||||||
|
|
||||||
|
\tableofcontents
|
||||||
|
|
||||||
|
\section{What is curses?}
|
||||||
|
|
||||||
|
The curses library supplies a terminal-independent screen-painting and
|
||||||
|
keyboard-handling facility for text-based terminals; such terminals
|
||||||
|
include VT100s, the Linux console, and the simulated terminal provided
|
||||||
|
by X11 programs such as xterm and rxvt. Display terminals support
|
||||||
|
various control codes to perform common operations such as moving the
|
||||||
|
cursor, scrolling the screen, and erasing areas. Different terminals
|
||||||
|
use widely differing codes, and often have their own minor quirks.
|
||||||
|
|
||||||
|
In a world of X displays, one might ask ``why bother''? It's true
|
||||||
|
that character-cell display terminals are an obsolete technology, but
|
||||||
|
there are niches in which being able to do fancy things with them are
|
||||||
|
still valuable. One is on small-footprint or embedded Unixes that
|
||||||
|
don't carry an X server. Another is for tools like OS installers
|
||||||
|
and kernel configurators that may have to run before X is available.
|
||||||
|
|
||||||
|
The curses library hides all the details of different terminals, and
|
||||||
|
provides the programmer with an abstraction of a display, containing
|
||||||
|
multiple non-overlapping windows. The contents of a window can be
|
||||||
|
changed in various ways--adding text, erasing it, changing its
|
||||||
|
appearance--and the curses library will automagically figure out what
|
||||||
|
control codes need to be sent to the terminal to produce the right
|
||||||
|
output.
|
||||||
|
|
||||||
|
The curses library was originally written for BSD Unix; the later System V
|
||||||
|
versions of Unix from AT\&T added many enhancements and new functions.
|
||||||
|
BSD curses is no longer maintained, having been replaced by ncurses,
|
||||||
|
which is an open-source implementation of the AT\&T interface. If you're
|
||||||
|
using an open-source Unix such as Linux or FreeBSD, your system almost
|
||||||
|
certainly uses ncurses. Since most current commercial Unix versions
|
||||||
|
are based on System V code, all the functions described here will
|
||||||
|
probably be available. The older versions of curses carried by some
|
||||||
|
proprietary Unixes may not support everything, though.
|
||||||
|
|
||||||
|
No one has made a Windows port of the curses module. On a Windows
|
||||||
|
platform, try the Console module written by Fredrik Lundh. The
|
||||||
|
Console module provides cursor-addressable text output, plus full
|
||||||
|
support for mouse and keyboard input, and is available from
|
||||||
|
\url{http://effbot.org/efflib/console}.
|
||||||
|
|
||||||
|
\subsection{The Python curses module}
|
||||||
|
|
||||||
|
Thy Python module is a fairly simple wrapper over the C functions
|
||||||
|
provided by curses; if you're already familiar with curses programming
|
||||||
|
in C, it's really easy to transfer that knowledge to Python. The
|
||||||
|
biggest difference is that the Python interface makes things simpler,
|
||||||
|
by merging different C functions such as \function{addstr},
|
||||||
|
\function{mvaddstr}, \function{mvwaddstr}, into a single
|
||||||
|
\method{addstr()} method. You'll see this covered in more detail
|
||||||
|
later.
|
||||||
|
|
||||||
|
This HOWTO is simply an introduction to writing text-mode programs
|
||||||
|
with curses and Python. It doesn't attempt to be a complete guide to
|
||||||
|
the curses API; for that, see the Python library guide's serction on
|
||||||
|
ncurses, and the C manual pages for ncurses. It will, however, give
|
||||||
|
you the basic ideas.
|
||||||
|
|
||||||
|
\section{Starting and ending a curses application}
|
||||||
|
|
||||||
|
Before doing anything, curses must be initialized. This is done by
|
||||||
|
calling the \function{initscr()} function, which will determine the
|
||||||
|
terminal type, send any required setup codes to the terminal, and
|
||||||
|
create various internal data structures. If successful,
|
||||||
|
\function{initscr()} returns a window object representing the entire
|
||||||
|
screen; this is usually called \code{stdscr}, after the name of the
|
||||||
|
corresponding C
|
||||||
|
variable.
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
import curses
|
||||||
|
stdscr = curses.initscr()
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
Usually curses applications turn off automatic echoing of keys to the
|
||||||
|
screen, in order to be able to read keys and only display them under
|
||||||
|
certain circumstances. This requires calling the \function{noecho()}
|
||||||
|
function.
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
curses.noecho()
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
Applications will also commonly need to react to keys instantly,
|
||||||
|
without requiring the Enter key to be pressed; this is called cbreak
|
||||||
|
mode, as opposed to the usual buffered input mode.
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
curses.cbreak()
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
Terminals usually return special keys, such as the cursor keys or
|
||||||
|
navigation keys such as Page Up and Home, as a multibyte escape
|
||||||
|
sequence. While you could write your application to expect such
|
||||||
|
sequences and process them accordingly, curses can do it for you,
|
||||||
|
returning a special value such as \constant{curses.KEY_LEFT}. To get
|
||||||
|
curses to do the job, you'll have to enable keypad mode.
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
stdscr.keypad(1)
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
Terminating a curses application is much easier than starting one.
|
||||||
|
You'll need to call
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
curses.nocbreak(); stdscr.keypad(0); curses.echo()
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
to reverse the curses-friendly terminal settings. Then call the
|
||||||
|
\function{endwin()} function to restore the terminal to its original
|
||||||
|
operating mode.
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
curses.endwin()
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
A common problem when debugging a curses application is to get your
|
||||||
|
terminal messed up when the application dies without restoring the
|
||||||
|
terminal to its previous state. In Python this commonly happens when
|
||||||
|
your code is buggy and raises an uncaught exception. Keys are no
|
||||||
|
longer be echoed to the screen when you type them, for example, which
|
||||||
|
makes using the shell difficult.
|
||||||
|
|
||||||
|
In Python you can avoid these complications and make debugging much
|
||||||
|
easier by importing the module \module{curses.wrapper}. It supplies a
|
||||||
|
function \function{wrapper} that takes a hook argument. It does the
|
||||||
|
initializations described above, and also initializes colors if color
|
||||||
|
support is present. It then runs your hook, and then finally
|
||||||
|
deinitializes appropriately. The hook is called inside a try-catch
|
||||||
|
clause which catches exceptions, performs curses deinitialization, and
|
||||||
|
then passes the exception upwards. Thus, your terminal won't be left
|
||||||
|
in a funny state on exception.
|
||||||
|
|
||||||
|
\section{Windows and Pads}
|
||||||
|
|
||||||
|
Windows are the basic abstraction in curses. A window object
|
||||||
|
represents a rectangular area of the screen, and supports various
|
||||||
|
methods to display text, erase it, allow the user to input strings,
|
||||||
|
and so forth.
|
||||||
|
|
||||||
|
The \code{stdscr} object returned by the \function{initscr()} function
|
||||||
|
is a window object that covers the entire screen. Many programs may
|
||||||
|
need only this single window, but you might wish to divide the screen
|
||||||
|
into smaller windows, in order to redraw or clear them separately.
|
||||||
|
The \function{newwin()} function creates a new window of a given size,
|
||||||
|
returning the new window object.
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
begin_x = 20 ; begin_y = 7
|
||||||
|
height = 5 ; width = 40
|
||||||
|
win = curses.newwin(height, width, begin_y, begin_x)
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
A word about the coordinate system used in curses: coordinates are
|
||||||
|
always passed in the order \emph{y,x}, and the top-left corner of a
|
||||||
|
window is coordinate (0,0). This breaks a common convention for
|
||||||
|
handling coordinates, where the \emph{x} coordinate usually comes
|
||||||
|
first. This is an unfortunate difference from most other computer
|
||||||
|
applications, but it's been part of curses since it was first written,
|
||||||
|
and it's too late to change things now.
|
||||||
|
|
||||||
|
When you call a method to display or erase text, the effect doesn't
|
||||||
|
immediately show up on the display. This is because curses was
|
||||||
|
originally written with slow 300-baud terminal connections in mind;
|
||||||
|
with these terminals, minimizing the time required to redraw the
|
||||||
|
screen is very important. This lets curses accumulate changes to the
|
||||||
|
screen, and display them in the most efficient manner. For example,
|
||||||
|
if your program displays some characters in a window, and then clears
|
||||||
|
the window, there's no need to send the original characters because
|
||||||
|
they'd never be visible.
|
||||||
|
|
||||||
|
Accordingly, curses requires that you explicitly tell it to redraw
|
||||||
|
windows, using the \function{refresh()} method of window objects. In
|
||||||
|
practice, this doesn't really complicate programming with curses much.
|
||||||
|
Most programs go into a flurry of activity, and then pause waiting for
|
||||||
|
a keypress or some other action on the part of the user. All you have
|
||||||
|
to do is to be sure that the screen has been redrawn before pausing to
|
||||||
|
wait for user input, by simply calling \code{stdscr.refresh()} or the
|
||||||
|
\function{refresh()} method of some other relevant window.
|
||||||
|
|
||||||
|
A pad is a special case of a window; it can be larger than the actual
|
||||||
|
display screen, and only a portion of it displayed at a time.
|
||||||
|
Creating a pad simply requires the pad's height and width, while
|
||||||
|
refreshing a pad requires giving the coordinates of the on-screen
|
||||||
|
area where a subsection of the pad will be displayed.
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
pad = curses.newpad(100, 100)
|
||||||
|
# These loops fill the pad with letters; this is
|
||||||
|
# explained in the next section
|
||||||
|
for y in range(0, 100):
|
||||||
|
for x in range(0, 100):
|
||||||
|
try: pad.addch(y,x, ord('a') + (x*x+y*y) % 26 )
|
||||||
|
except curses.error: pass
|
||||||
|
|
||||||
|
# Displays a section of the pad in the middle of the screen
|
||||||
|
pad.refresh( 0,0, 5,5, 20,75)
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
The \function{refresh()} call displays a section of the pad in the
|
||||||
|
rectangle extending from coordinate (5,5) to coordinate (20,75) on the
|
||||||
|
screen;the upper left corner of the displayed section is coordinate
|
||||||
|
(0,0) on the pad. Beyond that difference, pads are exactly like
|
||||||
|
ordinary windows and support the same methods.
|
||||||
|
|
||||||
|
If you have multiple windows and pads on screen there is a more
|
||||||
|
efficient way to go, which will prevent annoying screen flicker at
|
||||||
|
refresh time. Use the methods \method{noutrefresh()} and/or
|
||||||
|
\method{noutrefresh()} of each window to update the data structure
|
||||||
|
representing the desired state of the screen; then change the physical
|
||||||
|
screen to match the desired state in one go with the function
|
||||||
|
\function{doupdate()}. The normal \method{refresh()} method calls
|
||||||
|
\function{doupdate()} as its last act.
|
||||||
|
|
||||||
|
\section{Displaying Text}
|
||||||
|
|
||||||
|
{}From a C programmer's point of view, curses may sometimes look like
|
||||||
|
a twisty maze of functions, all subtly different. For example,
|
||||||
|
\function{addstr()} displays a string at the current cursor location
|
||||||
|
in the \code{stdscr} window, while \function{mvaddstr()} moves to a
|
||||||
|
given y,x coordinate first before displaying the string.
|
||||||
|
\function{waddstr()} is just like \function{addstr()}, but allows
|
||||||
|
specifying a window to use, instead of using \code{stdscr} by default.
|
||||||
|
\function{mvwaddstr()} follows similarly.
|
||||||
|
|
||||||
|
Fortunately the Python interface hides all these details;
|
||||||
|
\code{stdscr} is a window object like any other, and methods like
|
||||||
|
\function{addstr()} accept multiple argument forms. Usually there are
|
||||||
|
four different forms.
|
||||||
|
|
||||||
|
\begin{tableii}{|c|l|}{textrm}{Form}{Description}
|
||||||
|
\lineii{\var{str} or \var{ch}}{Display the string \var{str} or
|
||||||
|
character \var{ch}}
|
||||||
|
\lineii{\var{str} or \var{ch}, \var{attr}}{Display the string \var{str} or
|
||||||
|
character \var{ch}, using attribute \var{attr}}
|
||||||
|
\lineii{\var{y}, \var{x}, \var{str} or \var{ch}}
|
||||||
|
{Move to position \var{y,x} within the window, and display \var{str}
|
||||||
|
or \var{ch}}
|
||||||
|
\lineii{\var{y}, \var{x}, \var{str} or \var{ch}, \var{attr}}
|
||||||
|
{Move to position \var{y,x} within the window, and display \var{str}
|
||||||
|
or \var{ch}, using attribute \var{attr}}
|
||||||
|
\end{tableii}
|
||||||
|
|
||||||
|
Attributes allow displaying text in highlighted forms, such as in
|
||||||
|
boldface, underline, reverse code, or in color. They'll be explained
|
||||||
|
in more detail in the next subsection.
|
||||||
|
|
||||||
|
The \function{addstr()} function takes a Python string as the value to
|
||||||
|
be displayed, while the \function{addch()} functions take a character,
|
||||||
|
which can be either a Python string of length 1, or an integer. If
|
||||||
|
it's a string, you're limited to displaying characters between 0 and
|
||||||
|
255. SVr4 curses provides constants for extension characters; these
|
||||||
|
constants are integers greater than 255. For example,
|
||||||
|
\constant{ACS_PLMINUS} is a +/- symbol, and \constant{ACS_ULCORNER} is
|
||||||
|
the upper left corner of a box (handy for drawing borders).
|
||||||
|
|
||||||
|
Windows remember where the cursor was left after the last operation,
|
||||||
|
so if you leave out the \var{y,x} coordinates, the string or character
|
||||||
|
will be displayed wherever the last operation left off. You can also
|
||||||
|
move the cursor with the \function{move(\var{y,x})} method. Because
|
||||||
|
some terminals always display a flashing cursor, you may want to
|
||||||
|
ensure that the cursor is positioned in some location where it won't
|
||||||
|
be distracting; it can be confusing to have the cursor blinking at
|
||||||
|
some apparently random location.
|
||||||
|
|
||||||
|
If your application doesn't need a blinking cursor at all, you can
|
||||||
|
call \function{curs_set(0)} to make it invisible. Equivalently, and
|
||||||
|
for compatibility with older curses versions, there's a
|
||||||
|
\function{leaveok(\var{bool})} function. When \var{bool} is true, the
|
||||||
|
curses library will attempt to suppress the flashing cursor, and you
|
||||||
|
won't need to worry about leaving it in odd locations.
|
||||||
|
|
||||||
|
\subsection{Attributes and Color}
|
||||||
|
|
||||||
|
Characters can be displayed in different ways. Status lines in a
|
||||||
|
text-based application are commonly shown in reverse video; a text
|
||||||
|
viewer may need to highlight certain words. curses supports this by
|
||||||
|
allowing you to specify an attribute for each cell on the screen.
|
||||||
|
|
||||||
|
An attribute is a integer, each bit representing a different
|
||||||
|
attribute. You can try to display text with multiple attribute bits
|
||||||
|
set, but curses doesn't guarantee that all the possible combinations
|
||||||
|
are available, or that they're all visually distinct. That depends on
|
||||||
|
the ability of the terminal being used, so it's safest to stick to the
|
||||||
|
most commonly available attributes, listed here.
|
||||||
|
|
||||||
|
\begin{tableii}{|c|l|}{constant}{Attribute}{Description}
|
||||||
|
\lineii{A_BLINK}{Blinking text}
|
||||||
|
\lineii{A_BOLD}{Extra bright or bold text}
|
||||||
|
\lineii{A_DIM}{Half bright text}
|
||||||
|
\lineii{A_REVERSE}{Reverse-video text}
|
||||||
|
\lineii{A_STANDOUT}{The best highlighting mode available}
|
||||||
|
\lineii{A_UNDERLINE}{Underlined text}
|
||||||
|
\end{tableii}
|
||||||
|
|
||||||
|
So, to display a reverse-video status line on the top line of the
|
||||||
|
screen,
|
||||||
|
you could code:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
stdscr.addstr(0, 0, "Current mode: Typing mode",
|
||||||
|
curses.A_REVERSE)
|
||||||
|
stdscr.refresh()
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
The curses library also supports color on those terminals that
|
||||||
|
provide it, The most common such terminal is probably the Linux
|
||||||
|
console, followed by color xterms.
|
||||||
|
|
||||||
|
To use color, you must call the \function{start_color()} function
|
||||||
|
soon after calling \function{initscr()}, to initialize the default
|
||||||
|
color set (the \function{curses.wrapper.wrapper()} function does this
|
||||||
|
automatically). Once that's done, the \function{has_colors()}
|
||||||
|
function returns TRUE if the terminal in use can actually display
|
||||||
|
color. (Note from AMK: curses uses the American spelling
|
||||||
|
'color', instead of the Canadian/British spelling 'colour'. If you're
|
||||||
|
like me, you'll have to resign yourself to misspelling it for the sake
|
||||||
|
of these functions.)
|
||||||
|
|
||||||
|
The curses library maintains a finite number of color pairs,
|
||||||
|
containing a foreground (or text) color and a background color. You
|
||||||
|
can get the attribute value corresponding to a color pair with the
|
||||||
|
\function{color_pair()} function; this can be bitwise-OR'ed with other
|
||||||
|
attributes such as \constant{A_REVERSE}, but again, such combinations
|
||||||
|
are not guaranteed to work on all terminals.
|
||||||
|
|
||||||
|
An example, which displays a line of text using color pair 1:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
stdscr.addstr( "Pretty text", curses.color_pair(1) )
|
||||||
|
stdscr.refresh()
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
As I said before, a color pair consists of a foreground and
|
||||||
|
background color. \function{start_color()} initializes 8 basic
|
||||||
|
colors when it activates color mode. They are: 0:black, 1:red,
|
||||||
|
2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and 7:white. The curses
|
||||||
|
module defines named constants for each of these colors:
|
||||||
|
\constant{curses.COLOR_BLACK}, \constant{curses.COLOR_RED}, and so
|
||||||
|
forth.
|
||||||
|
|
||||||
|
The \function{init_pair(\var{n, f, b})} function changes the
|
||||||
|
definition of color pair \var{n}, to foreground color {f} and
|
||||||
|
background color {b}. Color pair 0 is hard-wired to white on black,
|
||||||
|
and cannot be changed.
|
||||||
|
|
||||||
|
Let's put all this together. To change color 1 to red
|
||||||
|
text on a white background, you would call:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
When you change a color pair, any text already displayed using that
|
||||||
|
color pair will change to the new colors. You can also display new
|
||||||
|
text in this color with:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1) )
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
Very fancy terminals can change the definitions of the actual colors
|
||||||
|
to a given RGB value. This lets you change color 1, which is usually
|
||||||
|
red, to purple or blue or any other color you like. Unfortunately,
|
||||||
|
the Linux console doesn't support this, so I'm unable to try it out,
|
||||||
|
and can't provide any examples. You can check if your terminal can do
|
||||||
|
this by calling \function{can_change_color()}, which returns TRUE if
|
||||||
|
the capability is there. If you're lucky enough to have such a
|
||||||
|
talented terminal, consult your system's man pages for more
|
||||||
|
information.
|
||||||
|
|
||||||
|
\section{User Input}
|
||||||
|
|
||||||
|
The curses library itself offers only very simple input mechanisms.
|
||||||
|
Python's support adds a text-input widget that makes up some of the
|
||||||
|
lack.
|
||||||
|
|
||||||
|
The most common way to get input to a window is to use its
|
||||||
|
\method{getch()} method. that pauses, and waits for the user to hit
|
||||||
|
a key, displaying it if \function{echo()} has been called earlier.
|
||||||
|
You can optionally specify a coordinate to which the cursor should be
|
||||||
|
moved before pausing.
|
||||||
|
|
||||||
|
It's possible to change this behavior with the method
|
||||||
|
\method{nodelay()}. After \method{nodelay(1)}, \method{getch()} for
|
||||||
|
the window becomes non-blocking and returns ERR (-1) when no input is
|
||||||
|
ready. There's also a \function{halfdelay()} function, which can be
|
||||||
|
used to (in effect) set a timer on each \method{getch()}; if no input
|
||||||
|
becomes available within the number of milliseconds specified as the
|
||||||
|
argument to \function{halfdelay()}, curses throws an exception.
|
||||||
|
|
||||||
|
The \method{getch()} method returns an integer; if it's between 0 and
|
||||||
|
255, it represents the ASCII code of the key pressed. Values greater
|
||||||
|
than 255 are special keys such as Page Up, Home, or the cursor keys.
|
||||||
|
You can compare the value returned to constants such as
|
||||||
|
\constant{curses.KEY_PPAGE}, \constant{curses.KEY_HOME}, or
|
||||||
|
\constant{curses.KEY_LEFT}. Usually the main loop of your program
|
||||||
|
will look something like this:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
while 1:
|
||||||
|
c = stdscr.getch()
|
||||||
|
if c == ord('p'): PrintDocument()
|
||||||
|
elif c == ord('q'): break # Exit the while()
|
||||||
|
elif c == curses.KEY_HOME: x = y = 0
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
The \module{curses.ascii} module supplies ASCII class membership
|
||||||
|
functions that take either integer or 1-character-string
|
||||||
|
arguments; these may be useful in writing more readable tests for
|
||||||
|
your command interpreters. It also supplies conversion functions
|
||||||
|
that take either integer or 1-character-string arguments and return
|
||||||
|
the same type. For example, \function{curses.ascii.ctrl()} returns
|
||||||
|
the control character corresponding to its argument.
|
||||||
|
|
||||||
|
There's also a method to retrieve an entire string,
|
||||||
|
\constant{getstr()}. It isn't used very often, because its
|
||||||
|
functionality is quite limited; the only editing keys available are
|
||||||
|
the backspace key and the Enter key, which terminates the string. It
|
||||||
|
can optionally be limited to a fixed number of characters.
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
curses.echo() # Enable echoing of characters
|
||||||
|
|
||||||
|
# Get a 15-character string, with the cursor on the top line
|
||||||
|
s = stdscr.getstr(0,0, 15)
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
The Python \module{curses.textpad} module supplies something better.
|
||||||
|
With it, you can turn a window into a text box that supports an
|
||||||
|
Emacs-like set of keybindings. Various methods of \class{Textbox}
|
||||||
|
class support editing with input validation and gathering the edit
|
||||||
|
results either with or without trailing spaces. See the library
|
||||||
|
documentation on \module{curses.textpad} for the details.
|
||||||
|
|
||||||
|
\section{For More Information}
|
||||||
|
|
||||||
|
This HOWTO didn't cover some advanced topics, such as screen-scraping
|
||||||
|
or capturing mouse events from an xterm instance. But the Python
|
||||||
|
library page for the curses modules is now pretty complete. You
|
||||||
|
should browse it next.
|
||||||
|
|
||||||
|
If you're in doubt about the detailed behavior of any of the ncurses
|
||||||
|
entry points, consult the manual pages for your curses implementation,
|
||||||
|
whether it's ncurses or a proprietary Unix vendor's. The manual pages
|
||||||
|
will document any quirks, and provide complete lists of all the
|
||||||
|
functions, attributes, and \constant{ACS_*} characters available to
|
||||||
|
you.
|
||||||
|
|
||||||
|
Because the curses API is so large, some functions aren't supported in
|
||||||
|
the Python interface, not because they're difficult to implement, but
|
||||||
|
because no one has needed them yet. Feel free to add them and then
|
||||||
|
submit a patch. Also, we don't yet have support for the menus or
|
||||||
|
panels libraries associated with ncurses; feel free to add that.
|
||||||
|
|
||||||
|
If you write an interesting little program, feel free to contribute it
|
||||||
|
as another demo. We can always use more of them!
|
||||||
|
|
||||||
|
The ncurses FAQ: \url{http://dickey.his.com/ncurses/ncurses.faq.html}
|
||||||
|
|
||||||
|
\end{document}
|
||||||
343
Doc/howto/doanddont.tex
Normal file
343
Doc/howto/doanddont.tex
Normal file
|
|
@ -0,0 +1,343 @@
|
||||||
|
\documentclass{howto}
|
||||||
|
|
||||||
|
\title{Idioms and Anti-Idioms in Python}
|
||||||
|
|
||||||
|
\release{0.00}
|
||||||
|
|
||||||
|
\author{Moshe Zadka}
|
||||||
|
\authoraddress{howto@zadka.site.co.il}
|
||||||
|
|
||||||
|
\begin{document}
|
||||||
|
\maketitle
|
||||||
|
|
||||||
|
This document is placed in the public doman.
|
||||||
|
|
||||||
|
\begin{abstract}
|
||||||
|
\noindent
|
||||||
|
This document can be considered a companion to the tutorial. It
|
||||||
|
shows how to use Python, and even more importantly, how {\em not}
|
||||||
|
to use Python.
|
||||||
|
\end{abstract}
|
||||||
|
|
||||||
|
\tableofcontents
|
||||||
|
|
||||||
|
\section{Language Constructs You Should Not Use}
|
||||||
|
|
||||||
|
While Python has relatively few gotchas compared to other languages, it
|
||||||
|
still has some constructs which are only useful in corner cases, or are
|
||||||
|
plain dangerous.
|
||||||
|
|
||||||
|
\subsection{from module import *}
|
||||||
|
|
||||||
|
\subsubsection{Inside Function Definitions}
|
||||||
|
|
||||||
|
\code{from module import *} is {\em invalid} inside function definitions.
|
||||||
|
While many versions of Python do no check for the invalidity, it does not
|
||||||
|
make it more valid, no more then having a smart lawyer makes a man innocent.
|
||||||
|
Do not use it like that ever. Even in versions where it was accepted, it made
|
||||||
|
the function execution slower, because the compiler could not be certain
|
||||||
|
which names are local and which are global. In Python 2.1 this construct
|
||||||
|
causes warnings, and sometimes even errors.
|
||||||
|
|
||||||
|
\subsubsection{At Module Level}
|
||||||
|
|
||||||
|
While it is valid to use \code{from module import *} at module level it
|
||||||
|
is usually a bad idea. For one, this loses an important property Python
|
||||||
|
otherwise has --- you can know where each toplevel name is defined by
|
||||||
|
a simple "search" function in your favourite editor. You also open yourself
|
||||||
|
to trouble in the future, if some module grows additional functions or
|
||||||
|
classes.
|
||||||
|
|
||||||
|
One of the most awful question asked on the newsgroup is why this code:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
f = open("www")
|
||||||
|
f.read()
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
does not work. Of course, it works just fine (assuming you have a file
|
||||||
|
called "www".) But it does not work if somewhere in the module, the
|
||||||
|
statement \code{from os import *} is present. The \module{os} module
|
||||||
|
has a function called \function{open()} which returns an integer. While
|
||||||
|
it is very useful, shadowing builtins is one of its least useful properties.
|
||||||
|
|
||||||
|
Remember, you can never know for sure what names a module exports, so either
|
||||||
|
take what you need --- \code{from module import name1, name2}, or keep them in
|
||||||
|
the module and access on a per-need basis ---
|
||||||
|
\code{import module;print module.name}.
|
||||||
|
|
||||||
|
\subsubsection{When It Is Just Fine}
|
||||||
|
|
||||||
|
There are situations in which \code{from module import *} is just fine:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
|
||||||
|
\item The interactive prompt. For example, \code{from math import *} makes
|
||||||
|
Python an amazing scientific calculator.
|
||||||
|
|
||||||
|
\item When extending a module in C with a module in Python.
|
||||||
|
|
||||||
|
\item When the module advertises itself as \code{from import *} safe.
|
||||||
|
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\subsection{Unadorned \keyword{exec}, \function{execfile} and friends}
|
||||||
|
|
||||||
|
The word ``unadorned'' refers to the use without an explicit dictionary,
|
||||||
|
in which case those constructs evaluate code in the {\em current} environment.
|
||||||
|
This is dangerous for the same reasons \code{from import *} is dangerous ---
|
||||||
|
it might step over variables you are counting on and mess up things for
|
||||||
|
the rest of your code. Simply do not do that.
|
||||||
|
|
||||||
|
Bad examples:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> for name in sys.argv[1:]:
|
||||||
|
>>> exec "%s=1" % name
|
||||||
|
>>> def func(s, **kw):
|
||||||
|
>>> for var, val in kw.items():
|
||||||
|
>>> exec "s.%s=val" % var # invalid!
|
||||||
|
>>> execfile("handler.py")
|
||||||
|
>>> handle()
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
Good examples:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> d = {}
|
||||||
|
>>> for name in sys.argv[1:]:
|
||||||
|
>>> d[name] = 1
|
||||||
|
>>> def func(s, **kw):
|
||||||
|
>>> for var, val in kw.items():
|
||||||
|
>>> setattr(s, var, val)
|
||||||
|
>>> d={}
|
||||||
|
>>> execfile("handle.py", d, d)
|
||||||
|
>>> handle = d['handle']
|
||||||
|
>>> handle()
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
\subsection{from module import name1, name2}
|
||||||
|
|
||||||
|
This is a ``don't'' which is much weaker then the previous ``don't''s
|
||||||
|
but is still something you should not do if you don't have good reasons
|
||||||
|
to do that. The reason it is usually bad idea is because you suddenly
|
||||||
|
have an object which lives in two seperate namespaces. When the binding
|
||||||
|
in one namespace changes, the binding in the other will not, so there
|
||||||
|
will be a discrepancy between them. This happens when, for example,
|
||||||
|
one module is reloaded, or changes the definition of a function at runtime.
|
||||||
|
|
||||||
|
Bad example:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
# foo.py
|
||||||
|
a = 1
|
||||||
|
|
||||||
|
# bar.py
|
||||||
|
from foo import a
|
||||||
|
if something():
|
||||||
|
a = 2 # danger: foo.a != a
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
Good example:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
# foo.py
|
||||||
|
a = 1
|
||||||
|
|
||||||
|
# bar.py
|
||||||
|
import foo
|
||||||
|
if something():
|
||||||
|
foo.a = 2
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
\subsection{except:}
|
||||||
|
|
||||||
|
Python has the \code{except:} clause, which catches all exceptions.
|
||||||
|
Since {\em every} error in Python raises an exception, this makes many
|
||||||
|
programming errors look like runtime problems, and hinders
|
||||||
|
the debugging process.
|
||||||
|
|
||||||
|
The following code shows a great example:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
try:
|
||||||
|
foo = opne("file") # misspelled "open"
|
||||||
|
except:
|
||||||
|
sys.exit("could not open file!")
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
The second line triggers a \exception{NameError} which is caught by the
|
||||||
|
except clause. The program will exit, and you will have no idea that
|
||||||
|
this has nothing to do with the readability of \code{"file"}.
|
||||||
|
|
||||||
|
The example above is better written
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
try:
|
||||||
|
foo = opne("file") # will be changed to "open" as soon as we run it
|
||||||
|
except IOError:
|
||||||
|
sys.exit("could not open file")
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
There are some situations in which the \code{except:} clause is useful:
|
||||||
|
for example, in a framework when running callbacks, it is good not to
|
||||||
|
let any callback disturb the framework.
|
||||||
|
|
||||||
|
\section{Exceptions}
|
||||||
|
|
||||||
|
Exceptions are a useful feature of Python. You should learn to raise
|
||||||
|
them whenever something unexpected occurs, and catch them only where
|
||||||
|
you can do something about them.
|
||||||
|
|
||||||
|
The following is a very popular anti-idiom
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
def get_status(file):
|
||||||
|
if not os.path.exists(file):
|
||||||
|
print "file not found"
|
||||||
|
sys.exit(1)
|
||||||
|
return open(file).readline()
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
Consider the case the file gets deleted between the time the call to
|
||||||
|
\function{os.path.exists} is made and the time \function{open} is called.
|
||||||
|
That means the last line will throw an \exception{IOError}. The same would
|
||||||
|
happen if \var{file} exists but has no read permission. Since testing this
|
||||||
|
on a normal machine on existing and non-existing files make it seem bugless,
|
||||||
|
that means in testing the results will seem fine, and the code will get
|
||||||
|
shipped. Then an unhandled \exception{IOError} escapes to the user, who
|
||||||
|
has to watch the ugly traceback.
|
||||||
|
|
||||||
|
Here is a better way to do it.
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
def get_status(file):
|
||||||
|
try:
|
||||||
|
return open(file).readline()
|
||||||
|
except (IOError, OSError):
|
||||||
|
print "file not found"
|
||||||
|
sys.exit(1)
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
In this version, *either* the file gets opened and the line is read
|
||||||
|
(so it works even on flaky NFS or SMB connections), or the message
|
||||||
|
is printed and the application aborted.
|
||||||
|
|
||||||
|
Still, \function{get_status} makes too many assumptions --- that it
|
||||||
|
will only be used in a short running script, and not, say, in a long
|
||||||
|
running server. Sure, the caller could do something like
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
try:
|
||||||
|
status = get_status(log)
|
||||||
|
except SystemExit:
|
||||||
|
status = None
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
So, try to make as few \code{except} clauses in your code --- those will
|
||||||
|
usually be a catch-all in the \function{main}, or inside calls which
|
||||||
|
should always succeed.
|
||||||
|
|
||||||
|
So, the best version is probably
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
def get_status(file):
|
||||||
|
return open(file).readline()
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
The caller can deal with the exception if it wants (for example, if it
|
||||||
|
tries several files in a loop), or just let the exception filter upwards
|
||||||
|
to {\em its} caller.
|
||||||
|
|
||||||
|
The last version is not very good either --- due to implementation details,
|
||||||
|
the file would not be closed when an exception is raised until the handler
|
||||||
|
finishes, and perhaps not at all in non-C implementations (e.g., Jython).
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
def get_status(file):
|
||||||
|
fp = open(file)
|
||||||
|
try:
|
||||||
|
return fp.readline()
|
||||||
|
finally:
|
||||||
|
fp.close()
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
\section{Using the Batteries}
|
||||||
|
|
||||||
|
Every so often, people seem to be writing stuff in the Python library
|
||||||
|
again, usually poorly. While the occasional module has a poor interface,
|
||||||
|
it is usually much better to use the rich standard library and data
|
||||||
|
types that come with Python then inventing your own.
|
||||||
|
|
||||||
|
A useful module very few people know about is \module{os.path}. It
|
||||||
|
always has the correct path arithmetic for your operating system, and
|
||||||
|
will usually be much better then whatever you come up with yourself.
|
||||||
|
|
||||||
|
Compare:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
# ugh!
|
||||||
|
return dir+"/"+file
|
||||||
|
# better
|
||||||
|
return os.path.join(dir, file)
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
More useful functions in \module{os.path}: \function{basename},
|
||||||
|
\function{dirname} and \function{splitext}.
|
||||||
|
|
||||||
|
There are also many useful builtin functions people seem not to be
|
||||||
|
aware of for some reason: \function{min()} and \function{max()} can
|
||||||
|
find the minimum/maximum of any sequence with comparable semantics,
|
||||||
|
for example, yet many people write they own max/min. Another highly
|
||||||
|
useful function is \function{reduce()}. Classical use of \function{reduce()}
|
||||||
|
is something like
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
import sys, operator
|
||||||
|
nums = map(float, sys.argv[1:])
|
||||||
|
print reduce(operator.add, nums)/len(nums)
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
This cute little script prints the average of all numbers given on the
|
||||||
|
command line. The \function{reduce()} adds up all the numbers, and
|
||||||
|
the rest is just some pre- and postprocessing.
|
||||||
|
|
||||||
|
On the same note, note that \function{float()}, \function{int()} and
|
||||||
|
\function{long()} all accept arguments of type string, and so are
|
||||||
|
suited to parsing --- assuming you are ready to deal with the
|
||||||
|
\exception{ValueError} they raise.
|
||||||
|
|
||||||
|
\section{Using Backslash to Continue Statements}
|
||||||
|
|
||||||
|
Since Python treats a newline as a statement terminator,
|
||||||
|
and since statements are often more then is comfortable to put
|
||||||
|
in one line, many people do:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \
|
||||||
|
calculate_number(10, 20) != forbulate(500, 360):
|
||||||
|
pass
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
You should realize that this is dangerous: a stray space after the
|
||||||
|
\code{\\} would make this line wrong, and stray spaces are notoriously
|
||||||
|
hard to see in editors. In this case, at least it would be a syntax
|
||||||
|
error, but if the code was:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \
|
||||||
|
+ calculate_number(10, 20)*forbulate(500, 360)
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
then it would just be subtly wrong.
|
||||||
|
|
||||||
|
It is usually much better to use the implicit continuation inside parenthesis:
|
||||||
|
|
||||||
|
This version is bulletproof:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9]
|
||||||
|
+ calculate_number(10, 20)*forbulate(500, 360))
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
\end{document}
|
||||||
1466
Doc/howto/regex.tex
Normal file
1466
Doc/howto/regex.tex
Normal file
File diff suppressed because it is too large
Load diff
61
Doc/howto/rexec.tex
Normal file
61
Doc/howto/rexec.tex
Normal file
|
|
@ -0,0 +1,61 @@
|
||||||
|
\documentclass{howto}
|
||||||
|
|
||||||
|
\title{Restricted Execution HOWTO}
|
||||||
|
|
||||||
|
\release{2.1}
|
||||||
|
|
||||||
|
\author{A.M. Kuchling}
|
||||||
|
\authoraddress{\email{amk@amk.ca}}
|
||||||
|
|
||||||
|
\begin{document}
|
||||||
|
|
||||||
|
\maketitle
|
||||||
|
|
||||||
|
\begin{abstract}
|
||||||
|
\noindent
|
||||||
|
|
||||||
|
Python 2.2.2 and earlier provided a \module{rexec} module running
|
||||||
|
untrusted code. However, it's never been exhaustively audited for
|
||||||
|
security and it hasn't been updated to take into account recent
|
||||||
|
changes to Python such as new-style classes. Therefore, the
|
||||||
|
\module{rexec} module should not be trusted. To discourage use of
|
||||||
|
\module{rexec}, this HOWTO has been withdrawn.
|
||||||
|
|
||||||
|
The \module{rexec} and \module{Bastion} modules have been disabled in
|
||||||
|
the Python CVS tree, both on the trunk (which will eventually become
|
||||||
|
Python 2.3alpha2 and later 2.3final) and on the release22-maint branch
|
||||||
|
(which will become Python 2.2.3, if someone ever volunteers to issue
|
||||||
|
2.2.3).
|
||||||
|
|
||||||
|
For discussion of the problems with \module{rexec}, see the python-dev
|
||||||
|
threads starting at the following URLs:
|
||||||
|
\url{http://mail.python.org/pipermail/python-dev/2002-December/031160.html},
|
||||||
|
and
|
||||||
|
\url{http://mail.python.org/pipermail/python-dev/2003-January/031848.html}.
|
||||||
|
|
||||||
|
\end{abstract}
|
||||||
|
|
||||||
|
|
||||||
|
\section{Version History}
|
||||||
|
|
||||||
|
Sep. 12, 1998: Minor revisions and added the reference to the Janus
|
||||||
|
project.
|
||||||
|
|
||||||
|
Feb. 26, 1998: First version. Suggestions are welcome.
|
||||||
|
|
||||||
|
Mar. 16, 1998: Made some revisions suggested by Jeff Rush. Some minor
|
||||||
|
changes and clarifications, and a sizable section on exceptions added.
|
||||||
|
|
||||||
|
Oct. 4, 2000: Checked with Python 2.0. Minor rewrites and fixes made.
|
||||||
|
Version number increased to 2.0.
|
||||||
|
|
||||||
|
Dec. 17, 2002: Withdrawn.
|
||||||
|
|
||||||
|
Jan. 8, 2003: Mention that \module{rexec} will be disabled in Python 2.3,
|
||||||
|
and added links to relevant python-dev threads.
|
||||||
|
|
||||||
|
\end{document}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
460
Doc/howto/sockets.tex
Normal file
460
Doc/howto/sockets.tex
Normal file
|
|
@ -0,0 +1,460 @@
|
||||||
|
\documentclass{howto}
|
||||||
|
|
||||||
|
\title{Socket Programming HOWTO}
|
||||||
|
|
||||||
|
\release{0.00}
|
||||||
|
|
||||||
|
\author{Gordon McMillan}
|
||||||
|
\authoraddress{\email{gmcm@hypernet.com}}
|
||||||
|
|
||||||
|
\begin{document}
|
||||||
|
\maketitle
|
||||||
|
|
||||||
|
\begin{abstract}
|
||||||
|
\noindent
|
||||||
|
Sockets are used nearly everywhere, but are one of the most severely
|
||||||
|
misunderstood technologies around. This is a 10,000 foot overview of
|
||||||
|
sockets. It's not really a tutorial - you'll still have work to do in
|
||||||
|
getting things operational. It doesn't cover the fine points (and there
|
||||||
|
are a lot of them), but I hope it will give you enough background to
|
||||||
|
begin using them decently.
|
||||||
|
|
||||||
|
This document is available from the Python HOWTO page at
|
||||||
|
\url{http://www.python.org/doc/howto}.
|
||||||
|
|
||||||
|
\end{abstract}
|
||||||
|
|
||||||
|
\tableofcontents
|
||||||
|
|
||||||
|
\section{Sockets}
|
||||||
|
|
||||||
|
Sockets are used nearly everywhere, but are one of the most severely
|
||||||
|
misunderstood technologies around. This is a 10,000 foot overview of
|
||||||
|
sockets. It's not really a tutorial - you'll still have work to do in
|
||||||
|
getting things working. It doesn't cover the fine points (and there
|
||||||
|
are a lot of them), but I hope it will give you enough background to
|
||||||
|
begin using them decently.
|
||||||
|
|
||||||
|
I'm only going to talk about INET sockets, but they account for at
|
||||||
|
least 99\% of the sockets in use. And I'll only talk about STREAM
|
||||||
|
sockets - unless you really know what you're doing (in which case this
|
||||||
|
HOWTO isn't for you!), you'll get better behavior and performance from
|
||||||
|
a STREAM socket than anything else. I will try to clear up the mystery
|
||||||
|
of what a socket is, as well as some hints on how to work with
|
||||||
|
blocking and non-blocking sockets. But I'll start by talking about
|
||||||
|
blocking sockets. You'll need to know how they work before dealing
|
||||||
|
with non-blocking sockets.
|
||||||
|
|
||||||
|
Part of the trouble with understanding these things is that "socket"
|
||||||
|
can mean a number of subtly different things, depending on context. So
|
||||||
|
first, let's make a distinction between a "client" socket - an
|
||||||
|
endpoint of a conversation, and a "server" socket, which is more like
|
||||||
|
a switchboard operator. The client application (your browser, for
|
||||||
|
example) uses "client" sockets exclusively; the web server it's
|
||||||
|
talking to uses both "server" sockets and "client" sockets.
|
||||||
|
|
||||||
|
|
||||||
|
\subsection{History}
|
||||||
|
|
||||||
|
Of the various forms of IPC (\emph{Inter Process Communication}),
|
||||||
|
sockets are by far the most popular. On any given platform, there are
|
||||||
|
likely to be other forms of IPC that are faster, but for
|
||||||
|
cross-platform communication, sockets are about the only game in town.
|
||||||
|
|
||||||
|
They were invented in Berkeley as part of the BSD flavor of Unix. They
|
||||||
|
spread like wildfire with the Internet. With good reason --- the
|
||||||
|
combination of sockets with INET makes talking to arbitrary machines
|
||||||
|
around the world unbelievably easy (at least compared to other
|
||||||
|
schemes).
|
||||||
|
|
||||||
|
\section{Creating a Socket}
|
||||||
|
|
||||||
|
Roughly speaking, when you clicked on the link that brought you to
|
||||||
|
this page, your browser did something like the following:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
#create an INET, STREAMing socket
|
||||||
|
s = socket.socket(
|
||||||
|
socket.AF_INET, socket.SOCK_STREAM)
|
||||||
|
#now connect to the web server on port 80
|
||||||
|
# - the normal http port
|
||||||
|
s.connect(("www.mcmillan-inc.com", 80))
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
When the \code{connect} completes, the socket \code{s} can
|
||||||
|
now be used to send in a request for the text of this page. The same
|
||||||
|
socket will read the reply, and then be destroyed. That's right -
|
||||||
|
destroyed. Client sockets are normally only used for one exchange (or
|
||||||
|
a small set of sequential exchanges).
|
||||||
|
|
||||||
|
What happens in the web server is a bit more complex. First, the web
|
||||||
|
server creates a "server socket".
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
#create an INET, STREAMing socket
|
||||||
|
serversocket = socket.socket(
|
||||||
|
socket.AF_INET, socket.SOCK_STREAM)
|
||||||
|
#bind the socket to a public host,
|
||||||
|
# and a well-known port
|
||||||
|
serversocket.bind((socket.gethostname(), 80))
|
||||||
|
#become a server socket
|
||||||
|
serversocket.listen(5)
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
A couple things to notice: we used \code{socket.gethostname()}
|
||||||
|
so that the socket would be visible to the outside world. If we had
|
||||||
|
used \code{s.bind(('', 80))} or \code{s.bind(('localhost',
|
||||||
|
80))} or \code{s.bind(('127.0.0.1', 80))} we would still
|
||||||
|
have a "server" socket, but one that was only visible within the same
|
||||||
|
machine.
|
||||||
|
|
||||||
|
A second thing to note: low number ports are usually reserved for
|
||||||
|
"well known" services (HTTP, SNMP etc). If you're playing around, use
|
||||||
|
a nice high number (4 digits).
|
||||||
|
|
||||||
|
Finally, the argument to \code{listen} tells the socket library that
|
||||||
|
we want it to queue up as many as 5 connect requests (the normal max)
|
||||||
|
before refusing outside connections. If the rest of the code is
|
||||||
|
written properly, that should be plenty.
|
||||||
|
|
||||||
|
OK, now we have a "server" socket, listening on port 80. Now we enter
|
||||||
|
the mainloop of the web server:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
while 1:
|
||||||
|
#accept connections from outside
|
||||||
|
(clientsocket, address) = serversocket.accept()
|
||||||
|
#now do something with the clientsocket
|
||||||
|
#in this case, we'll pretend this is a threaded server
|
||||||
|
ct = client_thread(clientsocket)
|
||||||
|
ct.run()
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
There's actually 3 general ways in which this loop could work -
|
||||||
|
dispatching a thread to handle \code{clientsocket}, create a new
|
||||||
|
process to handle \code{clientsocket}, or restructure this app
|
||||||
|
to use non-blocking sockets, and mulitplex between our "server" socket
|
||||||
|
and any active \code{clientsocket}s using
|
||||||
|
\code{select}. More about that later. The important thing to
|
||||||
|
understand now is this: this is \emph{all} a "server" socket
|
||||||
|
does. It doesn't send any data. It doesn't receive any data. It just
|
||||||
|
produces "client" sockets. Each \code{clientsocket} is created
|
||||||
|
in response to some \emph{other} "client" socket doing a
|
||||||
|
\code{connect()} to the host and port we're bound to. As soon as
|
||||||
|
we've created that \code{clientsocket}, we go back to listening
|
||||||
|
for more connections. The two "clients" are free to chat it up - they
|
||||||
|
are using some dynamically allocated port which will be recycled when
|
||||||
|
the conversation ends.
|
||||||
|
|
||||||
|
\subsection{IPC} If you need fast IPC between two processes
|
||||||
|
on one machine, you should look into whatever form of shared memory
|
||||||
|
the platform offers. A simple protocol based around shared memory and
|
||||||
|
locks or semaphores is by far the fastest technique.
|
||||||
|
|
||||||
|
If you do decide to use sockets, bind the "server" socket to
|
||||||
|
\code{'localhost'}. On most platforms, this will take a shortcut
|
||||||
|
around a couple of layers of network code and be quite a bit faster.
|
||||||
|
|
||||||
|
|
||||||
|
\section{Using a Socket}
|
||||||
|
|
||||||
|
The first thing to note, is that the web browser's "client" socket and
|
||||||
|
the web server's "client" socket are identical beasts. That is, this
|
||||||
|
is a "peer to peer" conversation. Or to put it another way, \emph{as the
|
||||||
|
designer, you will have to decide what the rules of etiquette are for
|
||||||
|
a conversation}. Normally, the \code{connect}ing socket
|
||||||
|
starts the conversation, by sending in a request, or perhaps a
|
||||||
|
signon. But that's a design decision - it's not a rule of sockets.
|
||||||
|
|
||||||
|
Now there are two sets of verbs to use for communication. You can use
|
||||||
|
\code{send} and \code{recv}, or you can transform your
|
||||||
|
client socket into a file-like beast and use \code{read} and
|
||||||
|
\code{write}. The latter is the way Java presents their
|
||||||
|
sockets. I'm not going to talk about it here, except to warn you that
|
||||||
|
you need to use \code{flush} on sockets. These are buffered
|
||||||
|
"files", and a common mistake is to \code{write} something, and
|
||||||
|
then \code{read} for a reply. Without a \code{flush} in
|
||||||
|
there, you may wait forever for the reply, because the request may
|
||||||
|
still be in your output buffer.
|
||||||
|
|
||||||
|
Now we come the major stumbling block of sockets - \code{send}
|
||||||
|
and \code{recv} operate on the network buffers. They do not
|
||||||
|
necessarily handle all the bytes you hand them (or expect from them),
|
||||||
|
because their major focus is handling the network buffers. In general,
|
||||||
|
they return when the associated network buffers have been filled
|
||||||
|
(\code{send}) or emptied (\code{recv}). They then tell you
|
||||||
|
how many bytes they handled. It is \emph{your} responsibility to call
|
||||||
|
them again until your message has been completely dealt with.
|
||||||
|
|
||||||
|
When a \code{recv} returns 0 bytes, it means the other side has
|
||||||
|
closed (or is in the process of closing) the connection. You will not
|
||||||
|
receive any more data on this connection. Ever. You may be able to
|
||||||
|
send data successfully; I'll talk about that some on the next page.
|
||||||
|
|
||||||
|
A protocol like HTTP uses a socket for only one transfer. The client
|
||||||
|
sends a request, the reads a reply. That's it. The socket is
|
||||||
|
discarded. This means that a client can detect the end of the reply by
|
||||||
|
receiving 0 bytes.
|
||||||
|
|
||||||
|
But if you plan to reuse your socket for further transfers, you need
|
||||||
|
to realize that \emph{there is no "EOT" (End of Transfer) on a
|
||||||
|
socket.} I repeat: if a socket \code{send} or
|
||||||
|
\code{recv} returns after handling 0 bytes, the connection has
|
||||||
|
been broken. If the connection has \emph{not} been broken, you may
|
||||||
|
wait on a \code{recv} forever, because the socket will
|
||||||
|
\emph{not} tell you that there's nothing more to read (for now). Now
|
||||||
|
if you think about that a bit, you'll come to realize a fundamental
|
||||||
|
truth of sockets: \emph{messages must either be fixed length} (yuck),
|
||||||
|
\emph{or be delimited} (shrug), \emph{or indicate how long they are}
|
||||||
|
(much better), \emph{or end by shutting down the connection}. The
|
||||||
|
choice is entirely yours, (but some ways are righter than others).
|
||||||
|
|
||||||
|
Assuming you don't want to end the connection, the simplest solution
|
||||||
|
is a fixed length message:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
class mysocket:
|
||||||
|
'''demonstration class only
|
||||||
|
- coded for clarity, not efficiency'''
|
||||||
|
def __init__(self, sock=None):
|
||||||
|
if sock is None:
|
||||||
|
self.sock = socket.socket(
|
||||||
|
socket.AF_INET, socket.SOCK_STREAM)
|
||||||
|
else:
|
||||||
|
self.sock = sock
|
||||||
|
def connect(host, port):
|
||||||
|
self.sock.connect((host, port))
|
||||||
|
def mysend(msg):
|
||||||
|
totalsent = 0
|
||||||
|
while totalsent < MSGLEN:
|
||||||
|
sent = self.sock.send(msg[totalsent:])
|
||||||
|
if sent == 0:
|
||||||
|
raise RuntimeError, \\
|
||||||
|
"socket connection broken"
|
||||||
|
totalsent = totalsent + sent
|
||||||
|
def myreceive():
|
||||||
|
msg = ''
|
||||||
|
while len(msg) < MSGLEN:
|
||||||
|
chunk = self.sock.recv(MSGLEN-len(msg))
|
||||||
|
if chunk == '':
|
||||||
|
raise RuntimeError, \\
|
||||||
|
"socket connection broken"
|
||||||
|
msg = msg + chunk
|
||||||
|
return msg
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
The sending code here is usable for almost any messaging scheme - in
|
||||||
|
Python you send strings, and you can use \code{len()} to
|
||||||
|
determine its length (even if it has embedded \code{\e 0}
|
||||||
|
characters). It's mostly the receiving code that gets more
|
||||||
|
complex. (And in C, it's not much worse, except you can't use
|
||||||
|
\code{strlen} if the message has embedded \code{\e 0}s.)
|
||||||
|
|
||||||
|
The easiest enhancement is to make the first character of the message
|
||||||
|
an indicator of message type, and have the type determine the
|
||||||
|
length. Now you have two \code{recv}s - the first to get (at
|
||||||
|
least) that first character so you can look up the length, and the
|
||||||
|
second in a loop to get the rest. If you decide to go the delimited
|
||||||
|
route, you'll be receiving in some arbitrary chunk size, (4096 or 8192
|
||||||
|
is frequently a good match for network buffer sizes), and scanning
|
||||||
|
what you've received for a delimiter.
|
||||||
|
|
||||||
|
One complication to be aware of: if your conversational protocol
|
||||||
|
allows multiple messages to be sent back to back (without some kind of
|
||||||
|
reply), and you pass \code{recv} an arbitrary chunk size, you
|
||||||
|
may end up reading the start of a following message. You'll need to
|
||||||
|
put that aside and hold onto it, until it's needed.
|
||||||
|
|
||||||
|
Prefixing the message with it's length (say, as 5 numeric characters)
|
||||||
|
gets more complex, because (believe it or not), you may not get all 5
|
||||||
|
characters in one \code{recv}. In playing around, you'll get
|
||||||
|
away with it; but in high network loads, your code will very quickly
|
||||||
|
break unless you use two \code{recv} loops - the first to
|
||||||
|
determine the length, the second to get the data part of the
|
||||||
|
message. Nasty. This is also when you'll discover that
|
||||||
|
\code{send} does not always manage to get rid of everything in
|
||||||
|
one pass. And despite having read this, you will eventually get bit by
|
||||||
|
it!
|
||||||
|
|
||||||
|
In the interests of space, building your character, (and preserving my
|
||||||
|
competitive position), these enhancements are left as an exercise for
|
||||||
|
the reader. Lets move on to cleaning up.
|
||||||
|
|
||||||
|
\subsection{Binary Data}
|
||||||
|
|
||||||
|
It is perfectly possible to send binary data over a socket. The major
|
||||||
|
problem is that not all machines use the same formats for binary
|
||||||
|
data. For example, a Motorola chip will represent a 16 bit integer
|
||||||
|
with the value 1 as the two hex bytes 00 01. Intel and DEC, however,
|
||||||
|
are byte-reversed - that same 1 is 01 00. Socket libraries have calls
|
||||||
|
for converting 16 and 32 bit integers - \code{ntohl, htonl, ntohs,
|
||||||
|
htons} where "n" means \emph{network} and "h" means \emph{host},
|
||||||
|
"s" means \emph{short} and "l" means \emph{long}. Where network order
|
||||||
|
is host order, these do nothing, but where the machine is
|
||||||
|
byte-reversed, these swap the bytes around appropriately.
|
||||||
|
|
||||||
|
In these days of 32 bit machines, the ascii representation of binary
|
||||||
|
data is frequently smaller than the binary representation. That's
|
||||||
|
because a surprising amount of the time, all those longs have the
|
||||||
|
value 0, or maybe 1. The string "0" would be two bytes, while binary
|
||||||
|
is four. Of course, this doesn't fit well with fixed-length
|
||||||
|
messages. Decisions, decisions.
|
||||||
|
|
||||||
|
\section{Disconnecting}
|
||||||
|
|
||||||
|
Strictly speaking, you're supposed to use \code{shutdown} on a
|
||||||
|
socket before you \code{close} it. The \code{shutdown} is
|
||||||
|
an advisory to the socket at the other end. Depending on the argument
|
||||||
|
you pass it, it can mean "I'm not going to send anymore, but I'll
|
||||||
|
still listen", or "I'm not listening, good riddance!". Most socket
|
||||||
|
libraries, however, are so used to programmers neglecting to use this
|
||||||
|
piece of etiquette that normally a \code{close} is the same as
|
||||||
|
\code{shutdown(); close()}. So in most situations, an explicit
|
||||||
|
\code{shutdown} is not needed.
|
||||||
|
|
||||||
|
One way to use \code{shutdown} effectively is in an HTTP-like
|
||||||
|
exchange. The client sends a request and then does a
|
||||||
|
\code{shutdown(1)}. This tells the server "This client is done
|
||||||
|
sending, but can still receive." The server can detect "EOF" by a
|
||||||
|
receive of 0 bytes. It can assume it has the complete request. The
|
||||||
|
server sends a reply. If the \code{send} completes successfully
|
||||||
|
then, indeed, the client was still receiving.
|
||||||
|
|
||||||
|
Python takes the automatic shutdown a step further, and says that when a socket is garbage collected, it will automatically do a \code{close} if it's needed. But relying on this is a very bad habit. If your socket just disappears without doing a \code{close}, the socket at the other end may hang indefinitely, thinking you're just being slow. \emph{Please} \code{close} your sockets when you're done.
|
||||||
|
|
||||||
|
|
||||||
|
\subsection{When Sockets Die}
|
||||||
|
|
||||||
|
Probably the worst thing about using blocking sockets is what happens
|
||||||
|
when the other side comes down hard (without doing a
|
||||||
|
\code{close}). Your socket is likely to hang. SOCKSTREAM is a
|
||||||
|
reliable protocol, and it will wait a long, long time before giving up
|
||||||
|
on a connection. If you're using threads, the entire thread is
|
||||||
|
essentially dead. There's not much you can do about it. As long as you
|
||||||
|
aren't doing something dumb, like holding a lock while doing a
|
||||||
|
blocking read, the thread isn't really consuming much in the way of
|
||||||
|
resources. Do \emph{not} try to kill the thread - part of the reason
|
||||||
|
that threads are more efficient than processes is that they avoid the
|
||||||
|
overhead associated with the automatic recycling of resources. In
|
||||||
|
other words, if you do manage to kill the thread, your whole process
|
||||||
|
is likely to be screwed up.
|
||||||
|
|
||||||
|
\section{Non-blocking Sockets}
|
||||||
|
|
||||||
|
If you've understood the preceeding, you already know most of what you
|
||||||
|
need to know about the mechanics of using sockets. You'll still use
|
||||||
|
the same calls, in much the same ways. It's just that, if you do it
|
||||||
|
right, your app will be almost inside-out.
|
||||||
|
|
||||||
|
In Python, you use \code{socket.setblocking(0)} to make it
|
||||||
|
non-blocking. In C, it's more complex, (for one thing, you'll need to
|
||||||
|
choose between the BSD flavor \code{O_NONBLOCK} and the almost
|
||||||
|
indistinguishable Posix flavor \code{O_NDELAY}, which is
|
||||||
|
completely different from \code{TCP_NODELAY}), but it's the
|
||||||
|
exact same idea. You do this after creating the socket, but before
|
||||||
|
using it. (Actually, if you're nuts, you can switch back and forth.)
|
||||||
|
|
||||||
|
The major mechanical difference is that \code{send},
|
||||||
|
\code{recv}, \code{connect} and \code{accept} can
|
||||||
|
return without having done anything. You have (of course) a number of
|
||||||
|
choices. You can check return code and error codes and generally drive
|
||||||
|
yourself crazy. If you don't believe me, try it sometime. Your app
|
||||||
|
will grow large, buggy and suck CPU. So let's skip the brain-dead
|
||||||
|
solutions and do it right.
|
||||||
|
|
||||||
|
Use \code{select}.
|
||||||
|
|
||||||
|
In C, coding \code{select} is fairly complex. In Python, it's a
|
||||||
|
piece of cake, but it's close enough to the C version that if you
|
||||||
|
understand \code{select} in Python, you'll have little trouble
|
||||||
|
with it in C.
|
||||||
|
|
||||||
|
\begin{verbatim} ready_to_read, ready_to_write, in_error = \\
|
||||||
|
select.select(
|
||||||
|
potential_readers,
|
||||||
|
potential_writers,
|
||||||
|
potential_errs,
|
||||||
|
timeout)
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
You pass \code{select} three lists: the first contains all
|
||||||
|
sockets that you might want to try reading; the second all the sockets
|
||||||
|
you might want to try writing to, and the last (normally left empty)
|
||||||
|
those that you want to check for errors. You should note that a
|
||||||
|
socket can go into more than one list. The \code{select} call is
|
||||||
|
blocking, but you can give it a timeout. This is generally a sensible
|
||||||
|
thing to do - give it a nice long timeout (say a minute) unless you
|
||||||
|
have good reason to do otherwise.
|
||||||
|
|
||||||
|
In return, you will get three lists. They have the sockets that are
|
||||||
|
actually readable, writable and in error. Each of these lists is a
|
||||||
|
subset (possbily empty) of the corresponding list you passed in. And
|
||||||
|
if you put a socket in more than one input list, it will only be (at
|
||||||
|
most) in one output list.
|
||||||
|
|
||||||
|
If a socket is in the output readable list, you can be
|
||||||
|
as-close-to-certain-as-we-ever-get-in-this-business that a
|
||||||
|
\code{recv} on that socket will return \emph{something}. Same
|
||||||
|
idea for the writable list. You'll be able to send
|
||||||
|
\emph{something}. Maybe not all you want to, but \emph{something} is
|
||||||
|
better than nothing. (Actually, any reasonably healthy socket will
|
||||||
|
return as writable - it just means outbound network buffer space is
|
||||||
|
available.)
|
||||||
|
|
||||||
|
If you have a "server" socket, put it in the potential_readers
|
||||||
|
list. If it comes out in the readable list, your \code{accept}
|
||||||
|
will (almost certainly) work. If you have created a new socket to
|
||||||
|
\code{connect} to someone else, put it in the ptoential_writers
|
||||||
|
list. If it shows up in the writable list, you have a decent chance
|
||||||
|
that it has connected.
|
||||||
|
|
||||||
|
One very nasty problem with \code{select}: if somewhere in those
|
||||||
|
input lists of sockets is one which has died a nasty death, the
|
||||||
|
\code{select} will fail. You then need to loop through every
|
||||||
|
single damn socket in all those lists and do a
|
||||||
|
\code{select([sock],[],[],0)} until you find the bad one. That
|
||||||
|
timeout of 0 means it won't take long, but it's ugly.
|
||||||
|
|
||||||
|
Actually, \code{select} can be handy even with blocking sockets.
|
||||||
|
It's one way of determining whether you will block - the socket
|
||||||
|
returns as readable when there's something in the buffers. However,
|
||||||
|
this still doesn't help with the problem of determining whether the
|
||||||
|
other end is done, or just busy with something else.
|
||||||
|
|
||||||
|
\textbf{Portability alert}: On Unix, \code{select} works both with
|
||||||
|
the sockets and files. Don't try this on Windows. On Windows,
|
||||||
|
\code{select} works with sockets only. Also note that in C, many
|
||||||
|
of the more advanced socket options are done differently on
|
||||||
|
Windows. In fact, on Windows I usually use threads (which work very,
|
||||||
|
very well) with my sockets. Face it, if you want any kind of
|
||||||
|
performance, your code will look very different on Windows than on
|
||||||
|
Unix. (I haven't the foggiest how you do this stuff on a Mac.)
|
||||||
|
|
||||||
|
\subsection{Performance}
|
||||||
|
|
||||||
|
There's no question that the fastest sockets code uses non-blocking
|
||||||
|
sockets and select to multiplex them. You can put together something
|
||||||
|
that will saturate a LAN connection without putting any strain on the
|
||||||
|
CPU. The trouble is that an app written this way can't do much of
|
||||||
|
anything else - it needs to be ready to shuffle bytes around at all
|
||||||
|
times.
|
||||||
|
|
||||||
|
Assuming that your app is actually supposed to do something more than
|
||||||
|
that, threading is the optimal solution, (and using non-blocking
|
||||||
|
sockets will be faster than using blocking sockets). Unfortunately,
|
||||||
|
threading support in Unixes varies both in API and quality. So the
|
||||||
|
normal Unix solution is to fork a subprocess to deal with each
|
||||||
|
connection. The overhead for this is significant (and don't do this on
|
||||||
|
Windows - the overhead of process creation is enormous there). It also
|
||||||
|
means that unless each subprocess is completely independent, you'll
|
||||||
|
need to use another form of IPC, say a pipe, or shared memory and
|
||||||
|
semaphores, to communicate between the parent and child processes.
|
||||||
|
|
||||||
|
Finally, remember that even though blocking sockets are somewhat
|
||||||
|
slower than non-blocking, in many cases they are the "right"
|
||||||
|
solution. After all, if your app is driven by the data it receives
|
||||||
|
over a socket, there's not much sense in complicating the logic just
|
||||||
|
so your app can wait on \code{select} instead of
|
||||||
|
\code{recv}.
|
||||||
|
|
||||||
|
\end{document}
|
||||||
267
Doc/howto/sorting.tex
Normal file
267
Doc/howto/sorting.tex
Normal file
|
|
@ -0,0 +1,267 @@
|
||||||
|
\documentclass{howto}
|
||||||
|
|
||||||
|
\title{Sorting Mini-HOWTO}
|
||||||
|
|
||||||
|
% Increment the release number whenever significant changes are made.
|
||||||
|
% The author and/or editor can define 'significant' however they like.
|
||||||
|
\release{0.01}
|
||||||
|
|
||||||
|
\author{Andrew Dalke}
|
||||||
|
\authoraddress{\email{dalke@bioreason.com}}
|
||||||
|
|
||||||
|
\begin{document}
|
||||||
|
\maketitle
|
||||||
|
|
||||||
|
\begin{abstract}
|
||||||
|
\noindent
|
||||||
|
This document is a little tutorial
|
||||||
|
showing a half dozen ways to sort a list with the built-in
|
||||||
|
\method{sort()} method.
|
||||||
|
|
||||||
|
This document is available from the Python HOWTO page at
|
||||||
|
\url{http://www.python.org/doc/howto}.
|
||||||
|
\end{abstract}
|
||||||
|
|
||||||
|
\tableofcontents
|
||||||
|
|
||||||
|
Python lists have a built-in \method{sort()} method. There are many
|
||||||
|
ways to use it to sort a list and there doesn't appear to be a single,
|
||||||
|
central place in the various manuals describing them, so I'll do so
|
||||||
|
here.
|
||||||
|
|
||||||
|
\section{Sorting basic data types}
|
||||||
|
|
||||||
|
A simple ascending sort is easy; just call the \method{sort()} method of a list.
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> a = [5, 2, 3, 1, 4]
|
||||||
|
>>> a.sort()
|
||||||
|
>>> print a
|
||||||
|
[1, 2, 3, 4, 5]
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
Sort takes an optional function which can be called for doing the
|
||||||
|
comparisons. The default sort routine is equivalent to
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> a = [5, 2, 3, 1, 4]
|
||||||
|
>>> a.sort(cmp)
|
||||||
|
>>> print a
|
||||||
|
[1, 2, 3, 4, 5]
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
where \function{cmp} is the built-in function which compares two objects, \code{x} and
|
||||||
|
\code{y}, and returns -1, 0 or 1 depending on whether $x<y$, $x==y$, or $x>y$. During
|
||||||
|
the course of the sort the relationships must stay the same for the
|
||||||
|
final list to make sense.
|
||||||
|
|
||||||
|
If you want, you can define your own function for the comparison. For
|
||||||
|
integers (and numbers in general) we can do:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> def numeric_compare(x, y):
|
||||||
|
>>> return x-y
|
||||||
|
>>>
|
||||||
|
>>> a = [5, 2, 3, 1, 4]
|
||||||
|
>>> a.sort(numeric_compare)
|
||||||
|
>>> print a
|
||||||
|
[1, 2, 3, 4, 5]
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
By the way, this function won't work if result of the subtraction
|
||||||
|
is out of range, as in \code{sys.maxint - (-1)}.
|
||||||
|
|
||||||
|
Or, if you don't want to define a new named function you can create an
|
||||||
|
anonymous one using \keyword{lambda}, as in:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> a = [5, 2, 3, 1, 4]
|
||||||
|
>>> a.sort(lambda x, y: x-y)
|
||||||
|
>>> print a
|
||||||
|
[1, 2, 3, 4, 5]
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
If you want the numbers sorted in reverse you can do
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> a = [5, 2, 3, 1, 4]
|
||||||
|
>>> def reverse_numeric(x, y):
|
||||||
|
>>> return y-x
|
||||||
|
>>>
|
||||||
|
>>> a.sort(reverse_numeric)
|
||||||
|
>>> print a
|
||||||
|
[5, 4, 3, 2, 1]
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
(a more general implementation could return \code{cmp(y,x)} or \code{-cmp(x,y)}).
|
||||||
|
|
||||||
|
However, it's faster if Python doesn't have to call a function for
|
||||||
|
every comparison, so if you want a reverse-sorted list of basic data
|
||||||
|
types, do the forward sort first, then use the \method{reverse()} method.
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> a = [5, 2, 3, 1, 4]
|
||||||
|
>>> a.sort()
|
||||||
|
>>> a.reverse()
|
||||||
|
>>> print a
|
||||||
|
[5, 4, 3, 2, 1]
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
Here's a case-insensitive string comparison using a \keyword{lambda} function:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> import string
|
||||||
|
>>> a = string.split("This is a test string from Andrew.")
|
||||||
|
>>> a.sort(lambda x, y: cmp(string.lower(x), string.lower(y)))
|
||||||
|
>>> print a
|
||||||
|
['a', 'Andrew.', 'from', 'is', 'string', 'test', 'This']
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
This goes through the overhead of converting a word to lower case
|
||||||
|
every time it must be compared. At times it may be faster to compute
|
||||||
|
these once and use those values, and the following example shows how.
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> words = string.split("This is a test string from Andrew.")
|
||||||
|
>>> offsets = []
|
||||||
|
>>> for i in range(len(words)):
|
||||||
|
>>> offsets.append( (string.lower(words[i]), i) )
|
||||||
|
>>>
|
||||||
|
>>> offsets.sort()
|
||||||
|
>>> new_words = []
|
||||||
|
>>> for dontcare, i in offsets:
|
||||||
|
>>> new_words.append(words[i])
|
||||||
|
>>>
|
||||||
|
>>> print new_words
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
The \code{offsets} list is initialized to a tuple of the lower-case string
|
||||||
|
and its position in the \code{words} list. It is then sorted. Python's
|
||||||
|
sort method sorts tuples by comparing terms; given \code{x} and \code{y}, compare
|
||||||
|
\code{x[0]} to \code{y[0]}, then \code{x[1]} to \code{y[1]}, etc. until there is a difference.
|
||||||
|
|
||||||
|
The result is that the \code{offsets} list is ordered by its first
|
||||||
|
term, and the second term can be used to figure out where the original
|
||||||
|
data was stored. (The \code{for} loop assigns \code{dontcare} and
|
||||||
|
\code{i} to the two fields of each term in the list, but we only need the
|
||||||
|
index value.)
|
||||||
|
|
||||||
|
Another way to implement this is to store the original data as the
|
||||||
|
second term in the \code{offsets} list, as in:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> words = string.split("This is a test string from Andrew.")
|
||||||
|
>>> offsets = []
|
||||||
|
>>> for word in words:
|
||||||
|
>>> offsets.append( (string.lower(word), word) )
|
||||||
|
>>>
|
||||||
|
>>> offsets.sort()
|
||||||
|
>>> new_words = []
|
||||||
|
>>> for word in offsets:
|
||||||
|
>>> new_words.append(word[1])
|
||||||
|
>>>
|
||||||
|
>>> print new_words
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
This isn't always appropriate because the second terms in the list
|
||||||
|
(the word, in this example) will be compared when the first terms are
|
||||||
|
the same. If this happens many times, then there will be the unneeded
|
||||||
|
performance hit of comparing the two objects. This can be a large
|
||||||
|
cost if most terms are the same and the objects define their own
|
||||||
|
\method{__cmp__} method, but there will still be some overhead to determine if
|
||||||
|
\method{__cmp__} is defined.
|
||||||
|
|
||||||
|
Still, for large lists, or for lists where the comparison information
|
||||||
|
is expensive to calculate, the last two examples are likely to be the
|
||||||
|
fastest way to sort a list. It will not work on weakly sorted data,
|
||||||
|
like complex numbers, but if you don't know what that means, you
|
||||||
|
probably don't need to worry about it.
|
||||||
|
|
||||||
|
\section{Comparing classes}
|
||||||
|
|
||||||
|
The comparison for two basic data types, like ints to ints or string to
|
||||||
|
string, is built into Python and makes sense. There is a default way
|
||||||
|
to compare class instances, but the default manner isn't usually very
|
||||||
|
useful. You can define your own comparison with the \method{__cmp__} method,
|
||||||
|
as in:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> class Spam:
|
||||||
|
>>> def __init__(self, spam, eggs):
|
||||||
|
>>> self.spam = spam
|
||||||
|
>>> self.eggs = eggs
|
||||||
|
>>> def __cmp__(self, other):
|
||||||
|
>>> return cmp(self.spam+self.eggs, other.spam+other.eggs)
|
||||||
|
>>> def __str__(self):
|
||||||
|
>>> return str(self.spam + self.eggs)
|
||||||
|
>>>
|
||||||
|
>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
|
||||||
|
>>> a.sort()
|
||||||
|
>>> for spam in a:
|
||||||
|
>>> print str(spam)
|
||||||
|
5
|
||||||
|
10
|
||||||
|
12
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
Sometimes you may want to sort by a specific attribute of a class. If
|
||||||
|
appropriate you should just define the \method{__cmp__} method to compare
|
||||||
|
those values, but you cannot do this if you want to compare between
|
||||||
|
different attributes at different times. Instead, you'll need to go
|
||||||
|
back to passing a comparison function to sort, as in:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
|
||||||
|
>>> a.sort(lambda x, y: cmp(x.eggs, y.eggs))
|
||||||
|
>>> for spam in a:
|
||||||
|
>>> print spam.eggs, str(spam)
|
||||||
|
3 12
|
||||||
|
4 5
|
||||||
|
6 10
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
If you want to compare two arbitrary attributes (and aren't overly
|
||||||
|
concerned about performance) you can even define your own comparison
|
||||||
|
function object. This uses the ability of a class instance to emulate
|
||||||
|
an function by defining the \method{__call__} method, as in:
|
||||||
|
|
||||||
|
\begin{verbatim}
|
||||||
|
>>> class CmpAttr:
|
||||||
|
>>> def __init__(self, attr):
|
||||||
|
>>> self.attr = attr
|
||||||
|
>>> def __call__(self, x, y):
|
||||||
|
>>> return cmp(getattr(x, self.attr), getattr(y, self.attr))
|
||||||
|
>>>
|
||||||
|
>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
|
||||||
|
>>> a.sort(CmpAttr("spam")) # sort by the "spam" attribute
|
||||||
|
>>> for spam in a:
|
||||||
|
>>> print spam.spam, spam.eggs, str(spam)
|
||||||
|
1 4 5
|
||||||
|
4 6 10
|
||||||
|
9 3 12
|
||||||
|
|
||||||
|
>>> a.sort(CmpAttr("eggs")) # re-sort by the "eggs" attribute
|
||||||
|
>>> for spam in a:
|
||||||
|
>>> print spam.spam, spam.eggs, str(spam)
|
||||||
|
9 3 12
|
||||||
|
1 4 5
|
||||||
|
4 6 10
|
||||||
|
\end{verbatim}
|
||||||
|
|
||||||
|
Of course, if you want a faster sort you can extract the attributes
|
||||||
|
into an intermediate list and sort that list.
|
||||||
|
|
||||||
|
|
||||||
|
So, there you have it; about a half-dozen different ways to define how
|
||||||
|
to sort a list:
|
||||||
|
\begin{itemize}
|
||||||
|
\item sort using the default method
|
||||||
|
\item sort using a comparison function
|
||||||
|
\item reverse sort not using a comparison function
|
||||||
|
\item sort on an intermediate list (two forms)
|
||||||
|
\item sort using class defined __cmp__ method
|
||||||
|
\item sort using a sort function object
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
\end{document}
|
||||||
|
% LocalWords: maxint
|
||||||
765
Doc/howto/unicode.rst
Normal file
765
Doc/howto/unicode.rst
Normal file
|
|
@ -0,0 +1,765 @@
|
||||||
|
Unicode HOWTO
|
||||||
|
================
|
||||||
|
|
||||||
|
**Version 1.02**
|
||||||
|
|
||||||
|
This HOWTO discusses Python's support for Unicode, and explains various
|
||||||
|
problems that people commonly encounter when trying to work with Unicode.
|
||||||
|
|
||||||
|
Introduction to Unicode
|
||||||
|
------------------------------
|
||||||
|
|
||||||
|
History of Character Codes
|
||||||
|
''''''''''''''''''''''''''''''
|
||||||
|
|
||||||
|
In 1968, the American Standard Code for Information Interchange,
|
||||||
|
better known by its acronym ASCII, was standardized. ASCII defined
|
||||||
|
numeric codes for various characters, with the numeric values running from 0 to
|
||||||
|
127. For example, the lowercase letter 'a' is assigned 97 as its code
|
||||||
|
value.
|
||||||
|
|
||||||
|
ASCII was an American-developed standard, so it only defined
|
||||||
|
unaccented characters. There was an 'e', but no 'é' or 'Í'. This
|
||||||
|
meant that languages which required accented characters couldn't be
|
||||||
|
faithfully represented in ASCII. (Actually the missing accents matter
|
||||||
|
for English, too, which contains words such as 'naïve' and 'café', and some
|
||||||
|
publications have house styles which require spellings such as
|
||||||
|
'coöperate'.)
|
||||||
|
|
||||||
|
For a while people just wrote programs that didn't display accents. I
|
||||||
|
remember looking at Apple ][ BASIC programs, published in French-language
|
||||||
|
publications in the mid-1980s, that had lines like these::
|
||||||
|
|
||||||
|
PRINT "FICHER EST COMPLETE."
|
||||||
|
PRINT "CARACTERE NON ACCEPTE."
|
||||||
|
|
||||||
|
Those messages should contain accents, and they just look wrong to
|
||||||
|
someone who can read French.
|
||||||
|
|
||||||
|
In the 1980s, almost all personal computers were 8-bit, meaning that
|
||||||
|
bytes could hold values ranging from 0 to 255. ASCII codes only went
|
||||||
|
up to 127, so some machines assigned values between 128 and 255 to
|
||||||
|
accented characters. Different machines had different codes, however,
|
||||||
|
which led to problems exchanging files. Eventually various commonly
|
||||||
|
used sets of values for the 128-255 range emerged. Some were true
|
||||||
|
standards, defined by the International Standards Organization, and
|
||||||
|
some were **de facto** conventions that were invented by one company
|
||||||
|
or another and managed to catch on.
|
||||||
|
|
||||||
|
255 characters aren't very many. For example, you can't fit
|
||||||
|
both the accented characters used in Western Europe and the Cyrillic
|
||||||
|
alphabet used for Russian into the 128-255 range because there are more than
|
||||||
|
127 such characters.
|
||||||
|
|
||||||
|
You could write files using different codes (all your Russian
|
||||||
|
files in a coding system called KOI8, all your French files in
|
||||||
|
a different coding system called Latin1), but what if you wanted
|
||||||
|
to write a French document that quotes some Russian text? In the
|
||||||
|
1980s people began to want to solve this problem, and the Unicode
|
||||||
|
standardization effort began.
|
||||||
|
|
||||||
|
Unicode started out using 16-bit characters instead of 8-bit characters. 16
|
||||||
|
bits means you have 2^16 = 65,536 distinct values available, making it
|
||||||
|
possible to represent many different characters from many different
|
||||||
|
alphabets; an initial goal was to have Unicode contain the alphabets for
|
||||||
|
every single human language. It turns out that even 16 bits isn't enough to
|
||||||
|
meet that goal, and the modern Unicode specification uses a wider range of
|
||||||
|
codes, 0-1,114,111 (0x10ffff in base-16).
|
||||||
|
|
||||||
|
There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
|
||||||
|
originally separate efforts, but the specifications were merged with
|
||||||
|
the 1.1 revision of Unicode.
|
||||||
|
|
||||||
|
(This discussion of Unicode's history is highly simplified. I don't
|
||||||
|
think the average Python programmer needs to worry about the
|
||||||
|
historical details; consult the Unicode consortium site listed in the
|
||||||
|
References for more information.)
|
||||||
|
|
||||||
|
|
||||||
|
Definitions
|
||||||
|
''''''''''''''''''''''''
|
||||||
|
|
||||||
|
A **character** is the smallest possible component of a text. 'A',
|
||||||
|
'B', 'C', etc., are all different characters. So are 'È' and
|
||||||
|
'Í'. Characters are abstractions, and vary depending on the
|
||||||
|
language or context you're talking about. For example, the symbol for
|
||||||
|
ohms (Ω) is usually drawn much like the capital letter
|
||||||
|
omega (Ω) in the Greek alphabet (they may even be the same in
|
||||||
|
some fonts), but these are two different characters that have
|
||||||
|
different meanings.
|
||||||
|
|
||||||
|
The Unicode standard describes how characters are represented by
|
||||||
|
**code points**. A code point is an integer value, usually denoted in
|
||||||
|
base 16. In the standard, a code point is written using the notation
|
||||||
|
U+12ca to mean the character with value 0x12ca (4810 decimal). The
|
||||||
|
Unicode standard contains a lot of tables listing characters and their
|
||||||
|
corresponding code points::
|
||||||
|
|
||||||
|
0061 'a'; LATIN SMALL LETTER A
|
||||||
|
0062 'b'; LATIN SMALL LETTER B
|
||||||
|
0063 'c'; LATIN SMALL LETTER C
|
||||||
|
...
|
||||||
|
007B '{'; LEFT CURLY BRACKET
|
||||||
|
|
||||||
|
Strictly, these definitions imply that it's meaningless to say 'this is
|
||||||
|
character U+12ca'. U+12ca is a code point, which represents some particular
|
||||||
|
character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'.
|
||||||
|
In informal contexts, this distinction between code points and characters will
|
||||||
|
sometimes be forgotten.
|
||||||
|
|
||||||
|
A character is represented on a screen or on paper by a set of graphical
|
||||||
|
elements that's called a **glyph**. The glyph for an uppercase A, for
|
||||||
|
example, is two diagonal strokes and a horizontal stroke, though the exact
|
||||||
|
details will depend on the font being used. Most Python code doesn't need
|
||||||
|
to worry about glyphs; figuring out the correct glyph to display is
|
||||||
|
generally the job of a GUI toolkit or a terminal's font renderer.
|
||||||
|
|
||||||
|
|
||||||
|
Encodings
|
||||||
|
'''''''''
|
||||||
|
|
||||||
|
To summarize the previous section:
|
||||||
|
a Unicode string is a sequence of code points, which are
|
||||||
|
numbers from 0 to 0x10ffff. This sequence needs to be represented as
|
||||||
|
a set of bytes (meaning, values from 0-255) in memory. The rules for
|
||||||
|
translating a Unicode string into a sequence of bytes are called an
|
||||||
|
**encoding**.
|
||||||
|
|
||||||
|
The first encoding you might think of is an array of 32-bit integers.
|
||||||
|
In this representation, the string "Python" would look like this::
|
||||||
|
|
||||||
|
P y t h o n
|
||||||
|
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
|
||||||
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
|
||||||
|
|
||||||
|
This representation is straightforward but using
|
||||||
|
it presents a number of problems.
|
||||||
|
|
||||||
|
1. It's not portable; different processors order the bytes
|
||||||
|
differently.
|
||||||
|
|
||||||
|
2. It's very wasteful of space. In most texts, the majority of the code
|
||||||
|
points are less than 127, or less than 255, so a lot of space is occupied
|
||||||
|
by zero bytes. The above string takes 24 bytes compared to the 6
|
||||||
|
bytes needed for an ASCII representation. Increased RAM usage doesn't
|
||||||
|
matter too much (desktop computers have megabytes of RAM, and strings
|
||||||
|
aren't usually that large), but expanding our usage of disk and
|
||||||
|
network bandwidth by a factor of 4 is intolerable.
|
||||||
|
|
||||||
|
3. It's not compatible with existing C functions such as ``strlen()``,
|
||||||
|
so a new family of wide string functions would need to be used.
|
||||||
|
|
||||||
|
4. Many Internet standards are defined in terms of textual data, and
|
||||||
|
can't handle content with embedded zero bytes.
|
||||||
|
|
||||||
|
Generally people don't use this encoding, choosing other encodings
|
||||||
|
that are more efficient and convenient.
|
||||||
|
|
||||||
|
Encodings don't have to handle every possible Unicode character, and
|
||||||
|
most encodings don't. For example, Python's default encoding is the
|
||||||
|
'ascii' encoding. The rules for converting a Unicode string into the
|
||||||
|
ASCII encoding are are simple; for each code point:
|
||||||
|
|
||||||
|
1. If the code point is <128, each byte is the same as the value of the
|
||||||
|
code point.
|
||||||
|
|
||||||
|
2. If the code point is 128 or greater, the Unicode string can't
|
||||||
|
be represented in this encoding. (Python raises a
|
||||||
|
``UnicodeEncodeError`` exception in this case.)
|
||||||
|
|
||||||
|
Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode
|
||||||
|
code points 0-255 are identical to the Latin-1 values, so converting
|
||||||
|
to this encoding simply requires converting code points to byte
|
||||||
|
values; if a code point larger than 255 is encountered, the string
|
||||||
|
can't be encoded into Latin-1.
|
||||||
|
|
||||||
|
Encodings don't have to be simple one-to-one mappings like Latin-1.
|
||||||
|
Consider IBM's EBCDIC, which was used on IBM mainframes. Letter
|
||||||
|
values weren't in one block: 'a' through 'i' had values from 129 to
|
||||||
|
137, but 'j' through 'r' were 145 through 153. If you wanted to use
|
||||||
|
EBCDIC as an encoding, you'd probably use some sort of lookup table to
|
||||||
|
perform the conversion, but this is largely an internal detail.
|
||||||
|
|
||||||
|
UTF-8 is one of the most commonly used encodings. UTF stands for
|
||||||
|
"Unicode Transformation Format", and the '8' means that 8-bit numbers
|
||||||
|
are used in the encoding. (There's also a UTF-16 encoding, but it's
|
||||||
|
less frequently used than UTF-8.) UTF-8 uses the following rules:
|
||||||
|
|
||||||
|
1. If the code point is <128, it's represented by the corresponding byte value.
|
||||||
|
2. If the code point is between 128 and 0x7ff, it's turned into two byte values
|
||||||
|
between 128 and 255.
|
||||||
|
3. Code points >0x7ff are turned into three- or four-byte sequences, where
|
||||||
|
each byte of the sequence is between 128 and 255.
|
||||||
|
|
||||||
|
UTF-8 has several convenient properties:
|
||||||
|
|
||||||
|
1. It can handle any Unicode code point.
|
||||||
|
2. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent through protocols that can't handle zero bytes.
|
||||||
|
3. A string of ASCII text is also valid UTF-8 text.
|
||||||
|
4. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
|
||||||
|
5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize. It's also unlikely that random 8-bit data will look like valid UTF-8.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
References
|
||||||
|
''''''''''''''
|
||||||
|
|
||||||
|
The Unicode Consortium site at <http://www.unicode.org> has character
|
||||||
|
charts, a glossary, and PDF versions of the Unicode specification. Be
|
||||||
|
prepared for some difficult reading.
|
||||||
|
<http://www.unicode.org/history/> is a chronology of the origin and
|
||||||
|
development of Unicode.
|
||||||
|
|
||||||
|
To help understand the standard, Jukka Korpela has written an
|
||||||
|
introductory guide to reading the Unicode character tables,
|
||||||
|
available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
|
||||||
|
|
||||||
|
Roman Czyborra wrote another explanation of Unicode's basic principles;
|
||||||
|
it's at <http://czyborra.com/unicode/characters.html>.
|
||||||
|
Czyborra has written a number of other Unicode-related documentation,
|
||||||
|
available from <http://www.cyzborra.com>.
|
||||||
|
|
||||||
|
Two other good introductory articles were written by Joel Spolsky
|
||||||
|
<http://www.joelonsoftware.com/articles/Unicode.html> and Jason
|
||||||
|
Orendorff <http://www.jorendorff.com/articles/unicode/>. If this
|
||||||
|
introduction didn't make things clear to you, you should try reading
|
||||||
|
one of these alternate articles before continuing.
|
||||||
|
|
||||||
|
Wikipedia entries are often helpful; see the entries for "character
|
||||||
|
encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
|
||||||
|
<http://en.wikipedia.org/wiki/UTF-8>, for example.
|
||||||
|
|
||||||
|
|
||||||
|
Python's Unicode Support
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
Now that you've learned the rudiments of Unicode, we can look at
|
||||||
|
Python's Unicode features.
|
||||||
|
|
||||||
|
|
||||||
|
The Unicode Type
|
||||||
|
'''''''''''''''''''
|
||||||
|
|
||||||
|
Unicode strings are expressed as instances of the ``unicode`` type,
|
||||||
|
one of Python's repertoire of built-in types. It derives from an
|
||||||
|
abstract type called ``basestring``, which is also an ancestor of the
|
||||||
|
``str`` type; you can therefore check if a value is a string type with
|
||||||
|
``isinstance(value, basestring)``. Under the hood, Python represents
|
||||||
|
Unicode strings as either 16- or 32-bit integers, depending on how the
|
||||||
|
Python interpreter was compiled, but this
|
||||||
|
|
||||||
|
The ``unicode()`` constructor has the signature ``unicode(string[, encoding, errors])``.
|
||||||
|
All of its arguments should be 8-bit strings. The first argument is converted
|
||||||
|
to Unicode using the specified encoding; if you leave off the ``encoding`` argument,
|
||||||
|
the ASCII encoding is used for the conversion, so characters greater than 127 will
|
||||||
|
be treated as errors::
|
||||||
|
|
||||||
|
>>> unicode('abcdef')
|
||||||
|
u'abcdef'
|
||||||
|
>>> s = unicode('abcdef')
|
||||||
|
>>> type(s)
|
||||||
|
<type 'unicode'>
|
||||||
|
>>> unicode('abcdef' + chr(255))
|
||||||
|
Traceback (most recent call last):
|
||||||
|
File "<stdin>", line 1, in ?
|
||||||
|
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
|
||||||
|
ordinal not in range(128)
|
||||||
|
|
||||||
|
The ``errors`` argument specifies the response when the input string can't be converted according to the encoding's rules. Legal values for this argument
|
||||||
|
are 'strict' (raise a ``UnicodeDecodeError`` exception),
|
||||||
|
'replace' (add U+FFFD, 'REPLACEMENT CHARACTER'),
|
||||||
|
or 'ignore' (just leave the character out of the Unicode result).
|
||||||
|
The following examples show the differences::
|
||||||
|
|
||||||
|
>>> unicode('\x80abc', errors='strict')
|
||||||
|
Traceback (most recent call last):
|
||||||
|
File "<stdin>", line 1, in ?
|
||||||
|
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
|
||||||
|
ordinal not in range(128)
|
||||||
|
>>> unicode('\x80abc', errors='replace')
|
||||||
|
u'\ufffdabc'
|
||||||
|
>>> unicode('\x80abc', errors='ignore')
|
||||||
|
u'abc'
|
||||||
|
|
||||||
|
Encodings are specified as strings containing the encoding's name.
|
||||||
|
Python 2.4 comes with roughly 100 different encodings; see the Python
|
||||||
|
Library Reference at
|
||||||
|
<http://docs.python.org/lib/standard-encodings.html> for a list. Some
|
||||||
|
encodings have multiple names; for example, 'latin-1', 'iso_8859_1'
|
||||||
|
and '8859' are all synonyms for the same encoding.
|
||||||
|
|
||||||
|
One-character Unicode strings can also be created with the
|
||||||
|
``unichr()`` built-in function, which takes integers and returns a
|
||||||
|
Unicode string of length 1 that contains the corresponding code point.
|
||||||
|
The reverse operation is the built-in `ord()` function that takes a
|
||||||
|
one-character Unicode string and returns the code point value::
|
||||||
|
|
||||||
|
>>> unichr(40960)
|
||||||
|
u'\ua000'
|
||||||
|
>>> ord(u'\ua000')
|
||||||
|
40960
|
||||||
|
|
||||||
|
Instances of the ``unicode`` type have many of the same methods as
|
||||||
|
the 8-bit string type for operations such as searching and formatting::
|
||||||
|
|
||||||
|
>>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
|
||||||
|
>>> s.count('e')
|
||||||
|
5
|
||||||
|
>>> s.find('feather')
|
||||||
|
9
|
||||||
|
>>> s.find('bird')
|
||||||
|
-1
|
||||||
|
>>> s.replace('feather', 'sand')
|
||||||
|
u'Was ever sand so lightly blown to and fro as this multitude?'
|
||||||
|
>>> s.upper()
|
||||||
|
u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
|
||||||
|
|
||||||
|
Note that the arguments to these methods can be Unicode strings or 8-bit strings.
|
||||||
|
8-bit strings will be converted to Unicode before carrying out the operation;
|
||||||
|
Python's default ASCII encoding will be used, so characters greater than 127 will cause an exception::
|
||||||
|
|
||||||
|
>>> s.find('Was\x9f')
|
||||||
|
Traceback (most recent call last):
|
||||||
|
File "<stdin>", line 1, in ?
|
||||||
|
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
|
||||||
|
>>> s.find(u'Was\x9f')
|
||||||
|
-1
|
||||||
|
|
||||||
|
Much Python code that operates on strings will therefore work with
|
||||||
|
Unicode strings without requiring any changes to the code. (Input and
|
||||||
|
output code needs more updating for Unicode; more on this later.)
|
||||||
|
|
||||||
|
Another important method is ``.encode([encoding], [errors='strict'])``,
|
||||||
|
which returns an 8-bit string version of the
|
||||||
|
Unicode string, encoded in the requested encoding. The ``errors``
|
||||||
|
parameter is the same as the parameter of the ``unicode()``
|
||||||
|
constructor, with one additional possibility; as well as 'strict',
|
||||||
|
'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which
|
||||||
|
uses XML's character references. The following example shows the
|
||||||
|
different results::
|
||||||
|
|
||||||
|
>>> u = unichr(40960) + u'abcd' + unichr(1972)
|
||||||
|
>>> u.encode('utf-8')
|
||||||
|
'\xea\x80\x80abcd\xde\xb4'
|
||||||
|
>>> u.encode('ascii')
|
||||||
|
Traceback (most recent call last):
|
||||||
|
File "<stdin>", line 1, in ?
|
||||||
|
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
|
||||||
|
>>> u.encode('ascii', 'ignore')
|
||||||
|
'abcd'
|
||||||
|
>>> u.encode('ascii', 'replace')
|
||||||
|
'?abcd?'
|
||||||
|
>>> u.encode('ascii', 'xmlcharrefreplace')
|
||||||
|
'ꀀabcd޴'
|
||||||
|
|
||||||
|
Python's 8-bit strings have a ``.decode([encoding], [errors])`` method
|
||||||
|
that interprets the string using the given encoding::
|
||||||
|
|
||||||
|
>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
|
||||||
|
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
|
||||||
|
>>> type(utf8_version), utf8_version
|
||||||
|
(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
|
||||||
|
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
|
||||||
|
>>> u == u2 # The two strings match
|
||||||
|
True
|
||||||
|
|
||||||
|
The low-level routines for registering and accessing the available
|
||||||
|
encodings are found in the ``codecs`` module. However, the encoding
|
||||||
|
and decoding functions returned by this module are usually more
|
||||||
|
low-level than is comfortable, so I'm not going to describe the
|
||||||
|
``codecs`` module here. If you need to implement a completely new
|
||||||
|
encoding, you'll need to learn about the ``codecs`` module interfaces,
|
||||||
|
but implementing encodings is a specialized task that also won't be
|
||||||
|
covered here. Consult the Python documentation to learn more about
|
||||||
|
this module.
|
||||||
|
|
||||||
|
The most commonly used part of the ``codecs`` module is the
|
||||||
|
``codecs.open()`` function which will be discussed in the section
|
||||||
|
on input and output.
|
||||||
|
|
||||||
|
|
||||||
|
Unicode Literals in Python Source Code
|
||||||
|
''''''''''''''''''''''''''''''''''''''''''
|
||||||
|
|
||||||
|
In Python source code, Unicode literals are written as strings
|
||||||
|
prefixed with the 'u' or 'U' character: ``u'abcdefghijk'``. Specific
|
||||||
|
code points can be written using the ``\u`` escape sequence, which is
|
||||||
|
followed by four hex digits giving the code point. The ``\U`` escape
|
||||||
|
sequence is similar, but expects 8 hex digits, not 4.
|
||||||
|
|
||||||
|
Unicode literals can also use the same escape sequences as 8-bit
|
||||||
|
strings, including ``\x``, but ``\x`` only takes two hex digits so it
|
||||||
|
can't express an arbitrary code point. Octal escapes can go up to
|
||||||
|
U+01ff, which is octal 777.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
>>> s = u"a\xac\u1234\u20ac\U00008000"
|
||||||
|
^^^^ two-digit hex escape
|
||||||
|
^^^^^^ four-digit Unicode escape
|
||||||
|
^^^^^^^^^^ eight-digit Unicode escape
|
||||||
|
>>> for c in s: print ord(c),
|
||||||
|
...
|
||||||
|
97 172 4660 8364 32768
|
||||||
|
|
||||||
|
Using escape sequences for code points greater than 127 is fine in
|
||||||
|
small doses, but becomes an annoyance if you're using many accented
|
||||||
|
characters, as you would in a program with messages in French or some
|
||||||
|
other accent-using language. You can also assemble strings using the
|
||||||
|
``unichr()`` built-in function, but this is even more tedious.
|
||||||
|
|
||||||
|
Ideally, you'd want to be able to write literals in your language's
|
||||||
|
natural encoding. You could then edit Python source code with your
|
||||||
|
favorite editor which would display the accented characters naturally,
|
||||||
|
and have the right characters used at runtime.
|
||||||
|
|
||||||
|
Python supports writing Unicode literals in any encoding, but you have
|
||||||
|
to declare the encoding being used. This is done by including a
|
||||||
|
special comment as either the first or second line of the source
|
||||||
|
file::
|
||||||
|
|
||||||
|
#!/usr/bin/env python
|
||||||
|
# -*- coding: latin-1 -*-
|
||||||
|
|
||||||
|
u = u'abcdé'
|
||||||
|
print ord(u[-1])
|
||||||
|
|
||||||
|
The syntax is inspired by Emacs's notation for specifying variables local to a file.
|
||||||
|
Emacs supports many different variables, but Python only supports 'coding'.
|
||||||
|
The ``-*-`` symbols indicate that the comment is special; within them,
|
||||||
|
you must supply the name ``coding`` and the name of your chosen encoding,
|
||||||
|
separated by ``':'``.
|
||||||
|
|
||||||
|
If you don't include such a comment, the default encoding used will be
|
||||||
|
ASCII. Versions of Python before 2.4 were Euro-centric and assumed
|
||||||
|
Latin-1 as a default encoding for string literals; in Python 2.4,
|
||||||
|
characters greater than 127 still work but result in a warning. For
|
||||||
|
example, the following program has no encoding declaration::
|
||||||
|
|
||||||
|
#!/usr/bin/env python
|
||||||
|
u = u'abcdé'
|
||||||
|
print ord(u[-1])
|
||||||
|
|
||||||
|
When you run it with Python 2.4, it will output the following warning::
|
||||||
|
|
||||||
|
amk:~$ python p263.py
|
||||||
|
sys:1: DeprecationWarning: Non-ASCII character '\xe9'
|
||||||
|
in file p263.py on line 2, but no encoding declared;
|
||||||
|
see http://www.python.org/peps/pep-0263.html for details
|
||||||
|
|
||||||
|
|
||||||
|
Unicode Properties
|
||||||
|
'''''''''''''''''''
|
||||||
|
|
||||||
|
The Unicode specification includes a database of information about
|
||||||
|
code points. For each code point that's defined, the information
|
||||||
|
includes the character's name, its category, the numeric value if
|
||||||
|
applicable (Unicode has characters representing the Roman numerals and
|
||||||
|
fractions such as one-third and four-fifths). There are also
|
||||||
|
properties related to the code point's use in bidirectional text and
|
||||||
|
other display-related properties.
|
||||||
|
|
||||||
|
The following program displays some information about several
|
||||||
|
characters, and prints the numeric value of one particular character::
|
||||||
|
|
||||||
|
import unicodedata
|
||||||
|
|
||||||
|
u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
|
||||||
|
|
||||||
|
for i, c in enumerate(u):
|
||||||
|
print i, '%04x' % ord(c), unicodedata.category(c),
|
||||||
|
print unicodedata.name(c)
|
||||||
|
|
||||||
|
# Get numeric value of second character
|
||||||
|
print unicodedata.numeric(u[1])
|
||||||
|
|
||||||
|
When run, this prints::
|
||||||
|
|
||||||
|
0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
|
||||||
|
1 0bf2 No TAMIL NUMBER ONE THOUSAND
|
||||||
|
2 0f84 Mn TIBETAN MARK HALANTA
|
||||||
|
3 1770 Lo TAGBANWA LETTER SA
|
||||||
|
4 33af So SQUARE RAD OVER S SQUARED
|
||||||
|
1000.0
|
||||||
|
|
||||||
|
The category codes are abbreviations describing the nature of the
|
||||||
|
character. These are grouped into categories such as "Letter",
|
||||||
|
"Number", "Punctuation", or "Symbol", which in turn are broken up into
|
||||||
|
subcategories. To take the codes from the above output, ``'Ll'``
|
||||||
|
means 'Letter, lowercase', ``'No'`` means "Number, other", ``'Mn'`` is
|
||||||
|
"Mark, nonspacing", and ``'So'`` is "Symbol, other". See
|
||||||
|
<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values>
|
||||||
|
for a list of category codes.
|
||||||
|
|
||||||
|
References
|
||||||
|
''''''''''''''
|
||||||
|
|
||||||
|
The Unicode and 8-bit string types are described in the Python library
|
||||||
|
reference at <http://docs.python.org/lib/typesseq.html>.
|
||||||
|
|
||||||
|
The documentation for the ``unicodedata`` module is at
|
||||||
|
<http://docs.python.org/lib/module-unicodedata.html>.
|
||||||
|
|
||||||
|
The documentation for the ``codecs`` module is at
|
||||||
|
<http://docs.python.org/lib/module-codecs.html>.
|
||||||
|
|
||||||
|
Marc-André Lemburg gave a presentation at EuroPython 2002
|
||||||
|
titled "Python and Unicode". A PDF version of his slides
|
||||||
|
is available at <http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>,
|
||||||
|
and is an excellent overview of the design of Python's Unicode features.
|
||||||
|
|
||||||
|
|
||||||
|
Reading and Writing Unicode Data
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
Once you've written some code that works with Unicode data, the next
|
||||||
|
problem is input/output. How do you get Unicode strings into your
|
||||||
|
program, and how do you convert Unicode into a form suitable for
|
||||||
|
storage or transmission?
|
||||||
|
|
||||||
|
It's possible that you may not need to do anything depending on your
|
||||||
|
input sources and output destinations; you should check whether the
|
||||||
|
libraries used in your application support Unicode natively. XML
|
||||||
|
parsers often return Unicode data, for example. Many relational
|
||||||
|
databases also support Unicode-valued columns and can return Unicode
|
||||||
|
values from an SQL query.
|
||||||
|
|
||||||
|
Unicode data is usually converted to a particular encoding before it
|
||||||
|
gets written to disk or sent over a socket. It's possible to do all
|
||||||
|
the work yourself: open a file, read an 8-bit string from it, and
|
||||||
|
convert the string with ``unicode(str, encoding)``. However, the
|
||||||
|
manual approach is not recommended.
|
||||||
|
|
||||||
|
One problem is the multi-byte nature of encodings; one Unicode
|
||||||
|
character can be represented by several bytes. If you want to read
|
||||||
|
the file in arbitrary-sized chunks (say, 1K or 4K), you need to write
|
||||||
|
error-handling code to catch the case where only part of the bytes
|
||||||
|
encoding a single Unicode character are read at the end of a chunk.
|
||||||
|
One solution would be to read the entire file into memory and then
|
||||||
|
perform the decoding, but that prevents you from working with files
|
||||||
|
that are extremely large; if you need to read a 2Gb file, you need 2Gb
|
||||||
|
of RAM. (More, really, since for at least a moment you'd need to have
|
||||||
|
both the encoded string and its Unicode version in memory.)
|
||||||
|
|
||||||
|
The solution would be to use the low-level decoding interface to catch
|
||||||
|
the case of partial coding sequences. The work of implementing this
|
||||||
|
has already been done for you: the ``codecs`` module includes a
|
||||||
|
version of the ``open()`` function that returns a file-like object
|
||||||
|
that assumes the file's contents are in a specified encoding and
|
||||||
|
accepts Unicode parameters for methods such as ``.read()`` and
|
||||||
|
``.write()``.
|
||||||
|
|
||||||
|
The function's parameters are
|
||||||
|
``open(filename, mode='rb', encoding=None, errors='strict', buffering=1)``. ``mode`` can be
|
||||||
|
``'r'``, ``'w'``, or ``'a'``, just like the corresponding parameter to the
|
||||||
|
regular built-in ``open()`` function; add a ``'+'`` to
|
||||||
|
update the file. ``buffering`` is similarly
|
||||||
|
parallel to the standard function's parameter.
|
||||||
|
``encoding`` is a string giving
|
||||||
|
the encoding to use; if it's left as ``None``, a regular Python file
|
||||||
|
object that accepts 8-bit strings is returned. Otherwise, a wrapper
|
||||||
|
object is returned, and data written to or read from the wrapper
|
||||||
|
object will be converted as needed. ``errors`` specifies the action
|
||||||
|
for encoding errors and can be one of the usual values of 'strict',
|
||||||
|
'ignore', and 'replace'.
|
||||||
|
|
||||||
|
Reading Unicode from a file is therefore simple::
|
||||||
|
|
||||||
|
import codecs
|
||||||
|
f = codecs.open('unicode.rst', encoding='utf-8')
|
||||||
|
for line in f:
|
||||||
|
print repr(line)
|
||||||
|
|
||||||
|
It's also possible to open files in update mode,
|
||||||
|
allowing both reading and writing::
|
||||||
|
|
||||||
|
f = codecs.open('test', encoding='utf-8', mode='w+')
|
||||||
|
f.write(u'\u4500 blah blah blah\n')
|
||||||
|
f.seek(0)
|
||||||
|
print repr(f.readline()[:1])
|
||||||
|
f.close()
|
||||||
|
|
||||||
|
Unicode character U+FEFF is used as a byte-order mark (BOM),
|
||||||
|
and is often written as the first character of a file in order
|
||||||
|
to assist with autodetection of the file's byte ordering.
|
||||||
|
Some encodings, such as UTF-16, expect a BOM to be present at
|
||||||
|
the start of a file; when such an encoding is used,
|
||||||
|
the BOM will be automatically written as the first character
|
||||||
|
and will be silently dropped when the file is read. There are
|
||||||
|
variants of these encodings, such as 'utf-16-le' and 'utf-16-be'
|
||||||
|
for little-endian and big-endian encodings, that specify
|
||||||
|
one particular byte ordering and don't
|
||||||
|
skip the BOM.
|
||||||
|
|
||||||
|
|
||||||
|
Unicode filenames
|
||||||
|
'''''''''''''''''''''''''
|
||||||
|
|
||||||
|
Most of the operating systems in common use today support filenames
|
||||||
|
that contain arbitrary Unicode characters. Usually this is
|
||||||
|
implemented by converting the Unicode string into some encoding that
|
||||||
|
varies depending on the system. For example, MacOS X uses UTF-8 while
|
||||||
|
Windows uses a configurable encoding; on Windows, Python uses the name
|
||||||
|
"mbcs" to refer to whatever the currently configured encoding is. On
|
||||||
|
Unix systems, there will only be a filesystem encoding if you've set
|
||||||
|
the ``LANG`` or ``LC_CTYPE`` environment variables; if you haven't,
|
||||||
|
the default encoding is ASCII.
|
||||||
|
|
||||||
|
The ``sys.getfilesystemencoding()`` function returns the encoding to
|
||||||
|
use on your current system, in case you want to do the encoding
|
||||||
|
manually, but there's not much reason to bother. When opening a file
|
||||||
|
for reading or writing, you can usually just provide the Unicode
|
||||||
|
string as the filename, and it will be automatically converted to the
|
||||||
|
right encoding for you::
|
||||||
|
|
||||||
|
filename = u'filename\u4500abc'
|
||||||
|
f = open(filename, 'w')
|
||||||
|
f.write('blah\n')
|
||||||
|
f.close()
|
||||||
|
|
||||||
|
Functions in the ``os`` module such as ``os.stat()`` will also accept
|
||||||
|
Unicode filenames.
|
||||||
|
|
||||||
|
``os.listdir()``, which returns filenames, raises an issue: should it
|
||||||
|
return the Unicode version of filenames, or should it return 8-bit
|
||||||
|
strings containing the encoded versions? ``os.listdir()`` will do
|
||||||
|
both, depending on whether you provided the directory path as an 8-bit
|
||||||
|
string or a Unicode string. If you pass a Unicode string as the path,
|
||||||
|
filenames will be decoded using the filesystem's encoding and a list
|
||||||
|
of Unicode strings will be returned, while passing an 8-bit path will
|
||||||
|
return the 8-bit versions of the filenames. For example, assuming the
|
||||||
|
default filesystem encoding is UTF-8, running the following program::
|
||||||
|
|
||||||
|
fn = u'filename\u4500abc'
|
||||||
|
f = open(fn, 'w')
|
||||||
|
f.close()
|
||||||
|
|
||||||
|
import os
|
||||||
|
print os.listdir('.')
|
||||||
|
print os.listdir(u'.')
|
||||||
|
|
||||||
|
will produce the following output::
|
||||||
|
|
||||||
|
amk:~$ python t.py
|
||||||
|
['.svn', 'filename\xe4\x94\x80abc', ...]
|
||||||
|
[u'.svn', u'filename\u4500abc', ...]
|
||||||
|
|
||||||
|
The first list contains UTF-8-encoded filenames, and the second list
|
||||||
|
contains the Unicode versions.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Tips for Writing Unicode-aware Programs
|
||||||
|
''''''''''''''''''''''''''''''''''''''''''''
|
||||||
|
|
||||||
|
This section provides some suggestions on writing software that
|
||||||
|
deals with Unicode.
|
||||||
|
|
||||||
|
The most important tip is:
|
||||||
|
|
||||||
|
Software should only work with Unicode strings internally,
|
||||||
|
converting to a particular encoding on output.
|
||||||
|
|
||||||
|
If you attempt to write processing functions that accept both
|
||||||
|
Unicode and 8-bit strings, you will find your program vulnerable to
|
||||||
|
bugs wherever you combine the two different kinds of strings. Python's
|
||||||
|
default encoding is ASCII, so whenever a character with an ASCII value >127
|
||||||
|
is in the input data, you'll get a ``UnicodeDecodeError``
|
||||||
|
because that character can't be handled by the ASCII encoding.
|
||||||
|
|
||||||
|
It's easy to miss such problems if you only test your software
|
||||||
|
with data that doesn't contain any
|
||||||
|
accents; everything will seem to work, but there's actually a bug in your
|
||||||
|
program waiting for the first user who attempts to use characters >127.
|
||||||
|
A second tip, therefore, is:
|
||||||
|
|
||||||
|
Include characters >127 and, even better, characters >255 in your
|
||||||
|
test data.
|
||||||
|
|
||||||
|
When using data coming from a web browser or some other untrusted source,
|
||||||
|
a common technique is to check for illegal characters in a string
|
||||||
|
before using the string in a generated command line or storing it in a
|
||||||
|
database. If you're doing this, be careful to check
|
||||||
|
the string once it's in the form that will be used or stored; it's
|
||||||
|
possible for encodings to be used to disguise characters. This is especially
|
||||||
|
true if the input data also specifies the encoding;
|
||||||
|
many encodings leave the commonly checked-for characters alone,
|
||||||
|
but Python includes some encodings such as ``'base64'``
|
||||||
|
that modify every single character.
|
||||||
|
|
||||||
|
For example, let's say you have a content management system that takes a
|
||||||
|
Unicode filename, and you want to disallow paths with a '/' character.
|
||||||
|
You might write this code::
|
||||||
|
|
||||||
|
def read_file (filename, encoding):
|
||||||
|
if '/' in filename:
|
||||||
|
raise ValueError("'/' not allowed in filenames")
|
||||||
|
unicode_name = filename.decode(encoding)
|
||||||
|
f = open(unicode_name, 'r')
|
||||||
|
# ... return contents of file ...
|
||||||
|
|
||||||
|
However, if an attacker could specify the ``'base64'`` encoding,
|
||||||
|
they could pass ``'L2V0Yy9wYXNzd2Q='``, which is the base-64
|
||||||
|
encoded form of the string ``'/etc/passwd'``, to read a
|
||||||
|
system file. The above code looks for ``'/'`` characters
|
||||||
|
in the encoded form and misses the dangerous character
|
||||||
|
in the resulting decoded form.
|
||||||
|
|
||||||
|
References
|
||||||
|
''''''''''''''
|
||||||
|
|
||||||
|
The PDF slides for Marc-André Lemburg's presentation "Writing
|
||||||
|
Unicode-aware Applications in Python" are available at
|
||||||
|
<http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
|
||||||
|
and discuss questions of character encodings as well as how to
|
||||||
|
internationalize and localize an application.
|
||||||
|
|
||||||
|
|
||||||
|
Revision History and Acknowledgements
|
||||||
|
------------------------------------------
|
||||||
|
|
||||||
|
Thanks to the following people who have noted errors or offered
|
||||||
|
suggestions on this article: Nicholas Bastin,
|
||||||
|
Marius Gedminas, Kent Johnson, Ken Krugler,
|
||||||
|
Marc-André Lemburg, Martin von Löwis.
|
||||||
|
|
||||||
|
Version 1.0: posted August 5 2005.
|
||||||
|
|
||||||
|
Version 1.01: posted August 7 2005. Corrects factual and markup
|
||||||
|
errors; adds several links.
|
||||||
|
|
||||||
|
Version 1.02: posted August 16 2005. Corrects factual errors.
|
||||||
|
|
||||||
|
|
||||||
|
.. comment Additional topic: building Python w/ UCS2 or UCS4 support
|
||||||
|
.. comment Describe obscure -U switch somewhere?
|
||||||
|
|
||||||
|
.. comment
|
||||||
|
Original outline:
|
||||||
|
|
||||||
|
- [ ] Unicode introduction
|
||||||
|
- [ ] ASCII
|
||||||
|
- [ ] Terms
|
||||||
|
- [ ] Character
|
||||||
|
- [ ] Code point
|
||||||
|
- [ ] Encodings
|
||||||
|
- [ ] Common encodings: ASCII, Latin-1, UTF-8
|
||||||
|
- [ ] Unicode Python type
|
||||||
|
- [ ] Writing unicode literals
|
||||||
|
- [ ] Obscurity: -U switch
|
||||||
|
- [ ] Built-ins
|
||||||
|
- [ ] unichr()
|
||||||
|
- [ ] ord()
|
||||||
|
- [ ] unicode() constructor
|
||||||
|
- [ ] Unicode type
|
||||||
|
- [ ] encode(), decode() methods
|
||||||
|
- [ ] Unicodedata module for character properties
|
||||||
|
- [ ] I/O
|
||||||
|
- [ ] Reading/writing Unicode data into files
|
||||||
|
- [ ] Byte-order marks
|
||||||
|
- [ ] Unicode filenames
|
||||||
|
- [ ] Writing Unicode programs
|
||||||
|
- [ ] Do everything in Unicode
|
||||||
|
- [ ] Declaring source code encodings (PEP 263)
|
||||||
|
- [ ] Other issues
|
||||||
|
- [ ] Building Python (UCS2, UCS4)
|
||||||
Loading…
Add table
Add a link
Reference in a new issue