mirror of
https://github.com/python/cpython.git
synced 2025-07-24 11:44:31 +00:00
Commit the howto source to the main Python repository, with Fred's approval
This commit is contained in:
parent
f1b2ba6aa1
commit
e8f44d683e
9 changed files with 4340 additions and 0 deletions
88
Doc/howto/Makefile
Normal file
88
Doc/howto/Makefile
Normal file
|
@ -0,0 +1,88 @@
|
|||
|
||||
MKHOWTO=../tools/mkhowto
|
||||
WEBDIR=.
|
||||
RSTARGS = --input-encoding=utf-8
|
||||
VPATH=.:dvi:pdf:ps:txt
|
||||
|
||||
# List of HOWTOs that aren't to be processed
|
||||
|
||||
REMOVE_HOWTO =
|
||||
|
||||
# Determine list of files to be built
|
||||
|
||||
HOWTO=$(filter-out $(REMOVE_HOWTO),$(wildcard *.tex))
|
||||
RST_SOURCES = $(shell echo *.rst)
|
||||
DVI =$(patsubst %.tex,%.dvi,$(HOWTO))
|
||||
PDF =$(patsubst %.tex,%.pdf,$(HOWTO))
|
||||
PS =$(patsubst %.tex,%.ps,$(HOWTO))
|
||||
TXT =$(patsubst %.tex,%.txt,$(HOWTO))
|
||||
HTML =$(patsubst %.tex,%,$(HOWTO))
|
||||
|
||||
# Rules for building various formats
|
||||
%.dvi : %.tex
|
||||
$(MKHOWTO) --dvi $<
|
||||
mv $@ dvi
|
||||
|
||||
%.pdf : %.tex
|
||||
$(MKHOWTO) --pdf $<
|
||||
mv $@ pdf
|
||||
|
||||
%.ps : %.tex
|
||||
$(MKHOWTO) --ps $<
|
||||
mv $@ ps
|
||||
|
||||
%.txt : %.tex
|
||||
$(MKHOWTO) --text $<
|
||||
mv $@ txt
|
||||
|
||||
% : %.tex
|
||||
$(MKHOWTO) --html --iconserver="." $<
|
||||
tar -zcvf html/$*.tgz $*
|
||||
#zip -r html/$*.zip $*
|
||||
|
||||
default:
|
||||
@echo "'all' -- build all files"
|
||||
@echo "'dvi', 'pdf', 'ps', 'txt', 'html' -- build one format"
|
||||
|
||||
all: $(HTML)
|
||||
|
||||
.PHONY : dvi pdf ps txt html rst
|
||||
dvi: $(DVI)
|
||||
|
||||
pdf: $(PDF)
|
||||
ps: $(PS)
|
||||
txt: $(TXT)
|
||||
html:$(HTML)
|
||||
|
||||
# Rule to build collected tar files
|
||||
dist: #all
|
||||
for i in dvi pdf ps txt ; do \
|
||||
cd $$i ; \
|
||||
tar -zcf All.tgz *.$$i ;\
|
||||
cd .. ;\
|
||||
done
|
||||
|
||||
# Rule to copy files to the Web tree on AMK's machine
|
||||
web: dist
|
||||
cp dvi/* $(WEBDIR)/dvi
|
||||
cp ps/* $(WEBDIR)/ps
|
||||
cp pdf/* $(WEBDIR)/pdf
|
||||
cp txt/* $(WEBDIR)/txt
|
||||
for dir in $(HTML) ; do cp -rp $$dir $(WEBDIR) ; done
|
||||
for ltx in $(HOWTO) ; do cp -p $$ltx $(WEBDIR)/latex ; done
|
||||
|
||||
rst: unicode.html
|
||||
|
||||
%.html: %.rst
|
||||
rst2html $(RSTARGS) $< >$@
|
||||
|
||||
clean:
|
||||
rm -f *~ *.log *.ind *.l2h *.aux *.toc *.how
|
||||
rm -f *.dvi *.ps *.pdf *.bkm
|
||||
rm -f unicode.html
|
||||
|
||||
clobber:
|
||||
rm dvi/* ps/* pdf/* txt/* html/*
|
||||
|
||||
|
||||
|
405
Doc/howto/advocacy.tex
Normal file
405
Doc/howto/advocacy.tex
Normal file
|
@ -0,0 +1,405 @@
|
|||
|
||||
\documentclass{howto}
|
||||
|
||||
\title{Python Advocacy HOWTO}
|
||||
|
||||
\release{0.03}
|
||||
|
||||
\author{A.M. Kuchling}
|
||||
\authoraddress{\email{amk@amk.ca}}
|
||||
|
||||
\begin{document}
|
||||
\maketitle
|
||||
|
||||
\begin{abstract}
|
||||
\noindent
|
||||
It's usually difficult to get your management to accept open source
|
||||
software, and Python is no exception to this rule. This document
|
||||
discusses reasons to use Python, strategies for winning acceptance,
|
||||
facts and arguments you can use, and cases where you \emph{shouldn't}
|
||||
try to use Python.
|
||||
|
||||
This document is available from the Python HOWTO page at
|
||||
\url{http://www.python.org/doc/howto}.
|
||||
|
||||
\end{abstract}
|
||||
|
||||
\tableofcontents
|
||||
|
||||
\section{Reasons to Use Python}
|
||||
|
||||
There are several reasons to incorporate a scripting language into
|
||||
your development process, and this section will discuss them, and why
|
||||
Python has some properties that make it a particularly good choice.
|
||||
|
||||
\subsection{Programmability}
|
||||
|
||||
Programs are often organized in a modular fashion. Lower-level
|
||||
operations are grouped together, and called by higher-level functions,
|
||||
which may in turn be used as basic operations by still further upper
|
||||
levels.
|
||||
|
||||
For example, the lowest level might define a very low-level
|
||||
set of functions for accessing a hash table. The next level might use
|
||||
hash tables to store the headers of a mail message, mapping a header
|
||||
name like \samp{Date} to a value such as \samp{Tue, 13 May 1997
|
||||
20:00:54 -0400}. A yet higher level may operate on message objects,
|
||||
without knowing or caring that message headers are stored in a hash
|
||||
table, and so forth.
|
||||
|
||||
Often, the lowest levels do very simple things; they implement a data
|
||||
structure such as a binary tree or hash table, or they perform some
|
||||
simple computation, such as converting a date string to a number. The
|
||||
higher levels then contain logic connecting these primitive
|
||||
operations. Using the approach, the primitives can be seen as basic
|
||||
building blocks which are then glued together to produce the complete
|
||||
product.
|
||||
|
||||
Why is this design approach relevant to Python? Because Python is
|
||||
well suited to functioning as such a glue language. A common approach
|
||||
is to write a Python module that implements the lower level
|
||||
operations; for the sake of speed, the implementation might be in C,
|
||||
Java, or even Fortran. Once the primitives are available to Python
|
||||
programs, the logic underlying higher level operations is written in
|
||||
the form of Python code. The high-level logic is then more
|
||||
understandable, and easier to modify.
|
||||
|
||||
John Ousterhout wrote a paper that explains this idea at greater
|
||||
length, entitled ``Scripting: Higher Level Programming for the 21st
|
||||
Century''. I recommend that you read this paper; see the references
|
||||
for the URL. Ousterhout is the inventor of the Tcl language, and
|
||||
therefore argues that Tcl should be used for this purpose; he only
|
||||
briefly refers to other languages such as Python, Perl, and
|
||||
Lisp/Scheme, but in reality, Ousterhout's argument applies to
|
||||
scripting languages in general, since you could equally write
|
||||
extensions for any of the languages mentioned above.
|
||||
|
||||
\subsection{Prototyping}
|
||||
|
||||
In \emph{The Mythical Man-Month}, Fredrick Brooks suggests the
|
||||
following rule when planning software projects: ``Plan to throw one
|
||||
away; you will anyway.'' Brooks is saying that the first attempt at a
|
||||
software design often turns out to be wrong; unless the problem is
|
||||
very simple or you're an extremely good designer, you'll find that new
|
||||
requirements and features become apparent once development has
|
||||
actually started. If these new requirements can't be cleanly
|
||||
incorporated into the program's structure, you're presented with two
|
||||
unpleasant choices: hammer the new features into the program somehow,
|
||||
or scrap everything and write a new version of the program, taking the
|
||||
new features into account from the beginning.
|
||||
|
||||
Python provides you with a good environment for quickly developing an
|
||||
initial prototype. That lets you get the overall program structure
|
||||
and logic right, and you can fine-tune small details in the fast
|
||||
development cycle that Python provides. Once you're satisfied with
|
||||
the GUI interface or program output, you can translate the Python code
|
||||
into C++, Fortran, Java, or some other compiled language.
|
||||
|
||||
Prototyping means you have to be careful not to use too many Python
|
||||
features that are hard to implement in your other language. Using
|
||||
\code{eval()}, or regular expressions, or the \module{pickle} module,
|
||||
means that you're going to need C or Java libraries for formula
|
||||
evaluation, regular expressions, and serialization, for example. But
|
||||
it's not hard to avoid such tricky code, and in the end the
|
||||
translation usually isn't very difficult. The resulting code can be
|
||||
rapidly debugged, because any serious logical errors will have been
|
||||
removed from the prototype, leaving only more minor slip-ups in the
|
||||
translation to track down.
|
||||
|
||||
This strategy builds on the earlier discussion of programmability.
|
||||
Using Python as glue to connect lower-level components has obvious
|
||||
relevance for constructing prototype systems. In this way Python can
|
||||
help you with development, even if end users never come in contact
|
||||
with Python code at all. If the performance of the Python version is
|
||||
adequate and corporate politics allow it, you may not need to do a
|
||||
translation into C or Java, but it can still be faster to develop a
|
||||
prototype and then translate it, instead of attempting to produce the
|
||||
final version immediately.
|
||||
|
||||
One example of this development strategy is Microsoft Merchant Server.
|
||||
Version 1.0 was written in pure Python, by a company that subsequently
|
||||
was purchased by Microsoft. Version 2.0 began to translate the code
|
||||
into \Cpp, shipping with some \Cpp code and some Python code. Version
|
||||
3.0 didn't contain any Python at all; all the code had been translated
|
||||
into \Cpp. Even though the product doesn't contain a Python
|
||||
interpreter, the Python language has still served a useful purpose by
|
||||
speeding up development.
|
||||
|
||||
This is a very common use for Python. Past conference papers have
|
||||
also described this approach for developing high-level numerical
|
||||
algorithms; see David M. Beazley and Peter S. Lomdahl's paper
|
||||
``Feeding a Large-scale Physics Application to Python'' in the
|
||||
references for a good example. If an algorithm's basic operations are
|
||||
things like "Take the inverse of this 4000x4000 matrix", and are
|
||||
implemented in some lower-level language, then Python has almost no
|
||||
additional performance cost; the extra time required for Python to
|
||||
evaluate an expression like \code{m.invert()} is dwarfed by the cost
|
||||
of the actual computation. It's particularly good for applications
|
||||
where seemingly endless tweaking is required to get things right. GUI
|
||||
interfaces and Web sites are prime examples.
|
||||
|
||||
The Python code is also shorter and faster to write (once you're
|
||||
familiar with Python), so it's easier to throw it away if you decide
|
||||
your approach was wrong; if you'd spent two weeks working on it
|
||||
instead of just two hours, you might waste time trying to patch up
|
||||
what you've got out of a natural reluctance to admit that those two
|
||||
weeks were wasted. Truthfully, those two weeks haven't been wasted,
|
||||
since you've learnt something about the problem and the technology
|
||||
you're using to solve it, but it's human nature to view this as a
|
||||
failure of some sort.
|
||||
|
||||
\subsection{Simplicity and Ease of Understanding}
|
||||
|
||||
Python is definitely \emph{not} a toy language that's only usable for
|
||||
small tasks. The language features are general and powerful enough to
|
||||
enable it to be used for many different purposes. It's useful at the
|
||||
small end, for 10- or 20-line scripts, but it also scales up to larger
|
||||
systems that contain thousands of lines of code.
|
||||
|
||||
However, this expressiveness doesn't come at the cost of an obscure or
|
||||
tricky syntax. While Python has some dark corners that can lead to
|
||||
obscure code, there are relatively few such corners, and proper design
|
||||
can isolate their use to only a few classes or modules. It's
|
||||
certainly possible to write confusing code by using too many features
|
||||
with too little concern for clarity, but most Python code can look a
|
||||
lot like a slightly-formalized version of human-understandable
|
||||
pseudocode.
|
||||
|
||||
In \emph{The New Hacker's Dictionary}, Eric S. Raymond gives the following
|
||||
definition for "compact":
|
||||
|
||||
\begin{quotation}
|
||||
Compact \emph{adj.} Of a design, describes the valuable property
|
||||
that it can all be apprehended at once in one's head. This
|
||||
generally means the thing created from the design can be used
|
||||
with greater facility and fewer errors than an equivalent tool
|
||||
that is not compact. Compactness does not imply triviality or
|
||||
lack of power; for example, C is compact and FORTRAN is not,
|
||||
but C is more powerful than FORTRAN. Designs become
|
||||
non-compact through accreting features and cruft that don't
|
||||
merge cleanly into the overall design scheme (thus, some fans
|
||||
of Classic C maintain that ANSI C is no longer compact).
|
||||
\end{quotation}
|
||||
|
||||
(From \url{http://sagan.earthspace.net/jargon/jargon_18.html\#SEC25})
|
||||
|
||||
In this sense of the word, Python is quite compact, because the
|
||||
language has just a few ideas, which are used in lots of places. Take
|
||||
namespaces, for example. Import a module with \code{import math}, and
|
||||
you create a new namespace called \samp{math}. Classes are also
|
||||
namespaces that share many of the properties of modules, and have a
|
||||
few of their own; for example, you can create instances of a class.
|
||||
Instances? They're yet another namespace. Namespaces are currently
|
||||
implemented as Python dictionaries, so they have the same methods as
|
||||
the standard dictionary data type: .keys() returns all the keys, and
|
||||
so forth.
|
||||
|
||||
This simplicity arises from Python's development history. The
|
||||
language syntax derives from different sources; ABC, a relatively
|
||||
obscure teaching language, is one primary influence, and Modula-3 is
|
||||
another. (For more information about ABC and Modula-3, consult their
|
||||
respective Web sites at \url{http://www.cwi.nl/~steven/abc/} and
|
||||
\url{http://www.m3.org}.) Other features have come from C, Icon,
|
||||
Algol-68, and even Perl. Python hasn't really innovated very much,
|
||||
but instead has tried to keep the language small and easy to learn,
|
||||
building on ideas that have been tried in other languages and found
|
||||
useful.
|
||||
|
||||
Simplicity is a virtue that should not be underestimated. It lets you
|
||||
learn the language more quickly, and then rapidly write code, code
|
||||
that often works the first time you run it.
|
||||
|
||||
\subsection{Java Integration}
|
||||
|
||||
If you're working with Java, Jython
|
||||
(\url{http://www.jython.org/}) is definitely worth your
|
||||
attention. Jython is a re-implementation of Python in Java that
|
||||
compiles Python code into Java bytecodes. The resulting environment
|
||||
has very tight, almost seamless, integration with Java. It's trivial
|
||||
to access Java classes from Python, and you can write Python classes
|
||||
that subclass Java classes. Jython can be used for prototyping Java
|
||||
applications in much the same way CPython is used, and it can also be
|
||||
used for test suites for Java code, or embedded in a Java application
|
||||
to add scripting capabilities.
|
||||
|
||||
\section{Arguments and Rebuttals}
|
||||
|
||||
Let's say that you've decided upon Python as the best choice for your
|
||||
application. How can you convince your management, or your fellow
|
||||
developers, to use Python? This section lists some common arguments
|
||||
against using Python, and provides some possible rebuttals.
|
||||
|
||||
\emph{Python is freely available software that doesn't cost anything.
|
||||
How good can it be?}
|
||||
|
||||
Very good, indeed. These days Linux and Apache, two other pieces of
|
||||
open source software, are becoming more respected as alternatives to
|
||||
commercial software, but Python hasn't had all the publicity.
|
||||
|
||||
Python has been around for several years, with many users and
|
||||
developers. Accordingly, the interpreter has been used by many
|
||||
people, and has gotten most of the bugs shaken out of it. While bugs
|
||||
are still discovered at intervals, they're usually either quite
|
||||
obscure (they'd have to be, for no one to have run into them before)
|
||||
or they involve interfaces to external libraries. The internals of
|
||||
the language itself are quite stable.
|
||||
|
||||
Having the source code should be viewed as making the software
|
||||
available for peer review; people can examine the code, suggest (and
|
||||
implement) improvements, and track down bugs. To find out more about
|
||||
the idea of open source code, along with arguments and case studies
|
||||
supporting it, go to \url{http://www.opensource.org}.
|
||||
|
||||
\emph{Who's going to support it?}
|
||||
|
||||
Python has a sizable community of developers, and the number is still
|
||||
growing. The Internet community surrounding the language is an active
|
||||
one, and is worth being considered another one of Python's advantages.
|
||||
Most questions posted to the comp.lang.python newsgroup are quickly
|
||||
answered by someone.
|
||||
|
||||
Should you need to dig into the source code, you'll find it's clear
|
||||
and well-organized, so it's not very difficult to write extensions and
|
||||
track down bugs yourself. If you'd prefer to pay for support, there
|
||||
are companies and individuals who offer commercial support for Python.
|
||||
|
||||
\emph{Who uses Python for serious work?}
|
||||
|
||||
Lots of people; one interesting thing about Python is the surprising
|
||||
diversity of applications that it's been used for. People are using
|
||||
Python to:
|
||||
|
||||
\begin{itemize}
|
||||
\item Run Web sites
|
||||
\item Write GUI interfaces
|
||||
\item Control
|
||||
number-crunching code on supercomputers
|
||||
\item Make a commercial application scriptable by embedding the Python
|
||||
interpreter inside it
|
||||
\item Process large XML data sets
|
||||
\item Build test suites for C or Java code
|
||||
\end{itemize}
|
||||
|
||||
Whatever your application domain is, there's probably someone who's
|
||||
used Python for something similar. Yet, despite being useable for
|
||||
such high-end applications, Python's still simple enough to use for
|
||||
little jobs.
|
||||
|
||||
See \url{http://www.python.org/psa/Users.html} for a list of some of the
|
||||
organizations that use Python.
|
||||
|
||||
\emph{What are the restrictions on Python's use?}
|
||||
|
||||
They're practically nonexistent. Consult the \file{Misc/COPYRIGHT}
|
||||
file in the source distribution, or
|
||||
\url{http://www.python.org/doc/Copyright.html} for the full language,
|
||||
but it boils down to three conditions.
|
||||
|
||||
\begin{itemize}
|
||||
|
||||
\item You have to leave the copyright notice on the software; if you
|
||||
don't include the source code in a product, you have to put the
|
||||
copyright notice in the supporting documentation.
|
||||
|
||||
\item Don't claim that the institutions that have developed Python
|
||||
endorse your product in any way.
|
||||
|
||||
\item If something goes wrong, you can't sue for damages. Practically
|
||||
all software licences contain this condition.
|
||||
|
||||
\end{itemize}
|
||||
|
||||
Notice that you don't have to provide source code for anything that
|
||||
contains Python or is built with it. Also, the Python interpreter and
|
||||
accompanying documentation can be modified and redistributed in any
|
||||
way you like, and you don't have to pay anyone any licensing fees at
|
||||
all.
|
||||
|
||||
\emph{Why should we use an obscure language like Python instead of
|
||||
well-known language X?}
|
||||
|
||||
I hope this HOWTO, and the documents listed in the final section, will
|
||||
help convince you that Python isn't obscure, and has a healthily
|
||||
growing user base. One word of advice: always present Python's
|
||||
positive advantages, instead of concentrating on language X's
|
||||
failings. People want to know why a solution is good, rather than why
|
||||
all the other solutions are bad. So instead of attacking a competing
|
||||
solution on various grounds, simply show how Python's virtues can
|
||||
help.
|
||||
|
||||
|
||||
\section{Useful Resources}
|
||||
|
||||
\begin{definitions}
|
||||
|
||||
\term{\url{http://www.fsbassociates.com/books/pythonchpt1.htm}}
|
||||
|
||||
The first chapter of \emph{Internet Programming with Python} also
|
||||
examines some of the reasons for using Python. The book is well worth
|
||||
buying, but the publishers have made the first chapter available on
|
||||
the Web.
|
||||
|
||||
\term{\url{http://home.pacbell.net/ouster/scripting.html}}
|
||||
|
||||
John Ousterhout's white paper on scripting is a good argument for the
|
||||
utility of scripting languages, though naturally enough, he emphasizes
|
||||
Tcl, the language he developed. Most of the arguments would apply to
|
||||
any scripting language.
|
||||
|
||||
\term{\url{http://www.python.org/workshops/1997-10/proceedings/beazley.html}}
|
||||
|
||||
The authors, David M. Beazley and Peter S. Lomdahl,
|
||||
describe their use of Python at Los Alamos National Laboratory.
|
||||
It's another good example of how Python can help get real work done.
|
||||
This quotation from the paper has been echoed by many people:
|
||||
|
||||
\begin{quotation}
|
||||
Originally developed as a large monolithic application for
|
||||
massively parallel processing systems, we have used Python to
|
||||
transform our application into a flexible, highly modular, and
|
||||
extremely powerful system for performing simulation, data
|
||||
analysis, and visualization. In addition, we describe how Python
|
||||
has solved a number of important problems related to the
|
||||
development, debugging, deployment, and maintenance of scientific
|
||||
software.
|
||||
\end{quotation}
|
||||
|
||||
%\term{\url{http://www.pythonjournal.com/volume1/art-interview/}}
|
||||
|
||||
%This interview with Andy Feit, discussing Infoseek's use of Python, can be
|
||||
%used to show that choosing Python didn't introduce any difficulties
|
||||
%into a company's development process, and provided some substantial benefits.
|
||||
|
||||
\term{\url{http://www.python.org/psa/Commercial.html}}
|
||||
|
||||
Robin Friedrich wrote this document on how to support Python's use in
|
||||
commercial projects.
|
||||
|
||||
\term{\url{http://www.python.org/workshops/1997-10/proceedings/stein.ps}}
|
||||
|
||||
For the 6th Python conference, Greg Stein presented a paper that
|
||||
traced Python's adoption and usage at a startup called eShop, and
|
||||
later at Microsoft.
|
||||
|
||||
\term{\url{http://www.opensource.org}}
|
||||
|
||||
Management may be doubtful of the reliability and usefulness of
|
||||
software that wasn't written commercially. This site presents
|
||||
arguments that show how open source software can have considerable
|
||||
advantages over closed-source software.
|
||||
|
||||
\term{\url{http://sunsite.unc.edu/LDP/HOWTO/mini/Advocacy.html}}
|
||||
|
||||
The Linux Advocacy mini-HOWTO was the inspiration for this document,
|
||||
and is also well worth reading for general suggestions on winning
|
||||
acceptance for a new technology, such as Linux or Python. In general,
|
||||
you won't make much progress by simply attacking existing systems and
|
||||
complaining about their inadequacies; this often ends up looking like
|
||||
unfocused whining. It's much better to point out some of the many
|
||||
areas where Python is an improvement over other systems.
|
||||
|
||||
\end{definitions}
|
||||
|
||||
\end{document}
|
||||
|
||||
|
485
Doc/howto/curses.tex
Normal file
485
Doc/howto/curses.tex
Normal file
|
@ -0,0 +1,485 @@
|
|||
\documentclass{howto}
|
||||
|
||||
\title{Curses Programming with Python}
|
||||
|
||||
\release{2.01}
|
||||
|
||||
\author{A.M. Kuchling, Eric S. Raymond}
|
||||
\authoraddress{\email{amk@amk.ca}, \email{esr@thyrsus.com}}
|
||||
|
||||
\begin{document}
|
||||
\maketitle
|
||||
|
||||
\begin{abstract}
|
||||
\noindent
|
||||
This document describes how to write text-mode programs with Python 2.x,
|
||||
using the \module{curses} extension module to control the display.
|
||||
|
||||
This document is available from the Python HOWTO page at
|
||||
\url{http://www.python.org/doc/howto}.
|
||||
\end{abstract}
|
||||
|
||||
\tableofcontents
|
||||
|
||||
\section{What is curses?}
|
||||
|
||||
The curses library supplies a terminal-independent screen-painting and
|
||||
keyboard-handling facility for text-based terminals; such terminals
|
||||
include VT100s, the Linux console, and the simulated terminal provided
|
||||
by X11 programs such as xterm and rxvt. Display terminals support
|
||||
various control codes to perform common operations such as moving the
|
||||
cursor, scrolling the screen, and erasing areas. Different terminals
|
||||
use widely differing codes, and often have their own minor quirks.
|
||||
|
||||
In a world of X displays, one might ask ``why bother''? It's true
|
||||
that character-cell display terminals are an obsolete technology, but
|
||||
there are niches in which being able to do fancy things with them are
|
||||
still valuable. One is on small-footprint or embedded Unixes that
|
||||
don't carry an X server. Another is for tools like OS installers
|
||||
and kernel configurators that may have to run before X is available.
|
||||
|
||||
The curses library hides all the details of different terminals, and
|
||||
provides the programmer with an abstraction of a display, containing
|
||||
multiple non-overlapping windows. The contents of a window can be
|
||||
changed in various ways--adding text, erasing it, changing its
|
||||
appearance--and the curses library will automagically figure out what
|
||||
control codes need to be sent to the terminal to produce the right
|
||||
output.
|
||||
|
||||
The curses library was originally written for BSD Unix; the later System V
|
||||
versions of Unix from AT\&T added many enhancements and new functions.
|
||||
BSD curses is no longer maintained, having been replaced by ncurses,
|
||||
which is an open-source implementation of the AT\&T interface. If you're
|
||||
using an open-source Unix such as Linux or FreeBSD, your system almost
|
||||
certainly uses ncurses. Since most current commercial Unix versions
|
||||
are based on System V code, all the functions described here will
|
||||
probably be available. The older versions of curses carried by some
|
||||
proprietary Unixes may not support everything, though.
|
||||
|
||||
No one has made a Windows port of the curses module. On a Windows
|
||||
platform, try the Console module written by Fredrik Lundh. The
|
||||
Console module provides cursor-addressable text output, plus full
|
||||
support for mouse and keyboard input, and is available from
|
||||
\url{http://effbot.org/efflib/console}.
|
||||
|
||||
\subsection{The Python curses module}
|
||||
|
||||
Thy Python module is a fairly simple wrapper over the C functions
|
||||
provided by curses; if you're already familiar with curses programming
|
||||
in C, it's really easy to transfer that knowledge to Python. The
|
||||
biggest difference is that the Python interface makes things simpler,
|
||||
by merging different C functions such as \function{addstr},
|
||||
\function{mvaddstr}, \function{mvwaddstr}, into a single
|
||||
\method{addstr()} method. You'll see this covered in more detail
|
||||
later.
|
||||
|
||||
This HOWTO is simply an introduction to writing text-mode programs
|
||||
with curses and Python. It doesn't attempt to be a complete guide to
|
||||
the curses API; for that, see the Python library guide's serction on
|
||||
ncurses, and the C manual pages for ncurses. It will, however, give
|
||||
you the basic ideas.
|
||||
|
||||
\section{Starting and ending a curses application}
|
||||
|
||||
Before doing anything, curses must be initialized. This is done by
|
||||
calling the \function{initscr()} function, which will determine the
|
||||
terminal type, send any required setup codes to the terminal, and
|
||||
create various internal data structures. If successful,
|
||||
\function{initscr()} returns a window object representing the entire
|
||||
screen; this is usually called \code{stdscr}, after the name of the
|
||||
corresponding C
|
||||
variable.
|
||||
|
||||
\begin{verbatim}
|
||||
import curses
|
||||
stdscr = curses.initscr()
|
||||
\end{verbatim}
|
||||
|
||||
Usually curses applications turn off automatic echoing of keys to the
|
||||
screen, in order to be able to read keys and only display them under
|
||||
certain circumstances. This requires calling the \function{noecho()}
|
||||
function.
|
||||
|
||||
\begin{verbatim}
|
||||
curses.noecho()
|
||||
\end{verbatim}
|
||||
|
||||
Applications will also commonly need to react to keys instantly,
|
||||
without requiring the Enter key to be pressed; this is called cbreak
|
||||
mode, as opposed to the usual buffered input mode.
|
||||
|
||||
\begin{verbatim}
|
||||
curses.cbreak()
|
||||
\end{verbatim}
|
||||
|
||||
Terminals usually return special keys, such as the cursor keys or
|
||||
navigation keys such as Page Up and Home, as a multibyte escape
|
||||
sequence. While you could write your application to expect such
|
||||
sequences and process them accordingly, curses can do it for you,
|
||||
returning a special value such as \constant{curses.KEY_LEFT}. To get
|
||||
curses to do the job, you'll have to enable keypad mode.
|
||||
|
||||
\begin{verbatim}
|
||||
stdscr.keypad(1)
|
||||
\end{verbatim}
|
||||
|
||||
Terminating a curses application is much easier than starting one.
|
||||
You'll need to call
|
||||
|
||||
\begin{verbatim}
|
||||
curses.nocbreak(); stdscr.keypad(0); curses.echo()
|
||||
\end{verbatim}
|
||||
|
||||
to reverse the curses-friendly terminal settings. Then call the
|
||||
\function{endwin()} function to restore the terminal to its original
|
||||
operating mode.
|
||||
|
||||
\begin{verbatim}
|
||||
curses.endwin()
|
||||
\end{verbatim}
|
||||
|
||||
A common problem when debugging a curses application is to get your
|
||||
terminal messed up when the application dies without restoring the
|
||||
terminal to its previous state. In Python this commonly happens when
|
||||
your code is buggy and raises an uncaught exception. Keys are no
|
||||
longer be echoed to the screen when you type them, for example, which
|
||||
makes using the shell difficult.
|
||||
|
||||
In Python you can avoid these complications and make debugging much
|
||||
easier by importing the module \module{curses.wrapper}. It supplies a
|
||||
function \function{wrapper} that takes a hook argument. It does the
|
||||
initializations described above, and also initializes colors if color
|
||||
support is present. It then runs your hook, and then finally
|
||||
deinitializes appropriately. The hook is called inside a try-catch
|
||||
clause which catches exceptions, performs curses deinitialization, and
|
||||
then passes the exception upwards. Thus, your terminal won't be left
|
||||
in a funny state on exception.
|
||||
|
||||
\section{Windows and Pads}
|
||||
|
||||
Windows are the basic abstraction in curses. A window object
|
||||
represents a rectangular area of the screen, and supports various
|
||||
methods to display text, erase it, allow the user to input strings,
|
||||
and so forth.
|
||||
|
||||
The \code{stdscr} object returned by the \function{initscr()} function
|
||||
is a window object that covers the entire screen. Many programs may
|
||||
need only this single window, but you might wish to divide the screen
|
||||
into smaller windows, in order to redraw or clear them separately.
|
||||
The \function{newwin()} function creates a new window of a given size,
|
||||
returning the new window object.
|
||||
|
||||
\begin{verbatim}
|
||||
begin_x = 20 ; begin_y = 7
|
||||
height = 5 ; width = 40
|
||||
win = curses.newwin(height, width, begin_y, begin_x)
|
||||
\end{verbatim}
|
||||
|
||||
A word about the coordinate system used in curses: coordinates are
|
||||
always passed in the order \emph{y,x}, and the top-left corner of a
|
||||
window is coordinate (0,0). This breaks a common convention for
|
||||
handling coordinates, where the \emph{x} coordinate usually comes
|
||||
first. This is an unfortunate difference from most other computer
|
||||
applications, but it's been part of curses since it was first written,
|
||||
and it's too late to change things now.
|
||||
|
||||
When you call a method to display or erase text, the effect doesn't
|
||||
immediately show up on the display. This is because curses was
|
||||
originally written with slow 300-baud terminal connections in mind;
|
||||
with these terminals, minimizing the time required to redraw the
|
||||
screen is very important. This lets curses accumulate changes to the
|
||||
screen, and display them in the most efficient manner. For example,
|
||||
if your program displays some characters in a window, and then clears
|
||||
the window, there's no need to send the original characters because
|
||||
they'd never be visible.
|
||||
|
||||
Accordingly, curses requires that you explicitly tell it to redraw
|
||||
windows, using the \function{refresh()} method of window objects. In
|
||||
practice, this doesn't really complicate programming with curses much.
|
||||
Most programs go into a flurry of activity, and then pause waiting for
|
||||
a keypress or some other action on the part of the user. All you have
|
||||
to do is to be sure that the screen has been redrawn before pausing to
|
||||
wait for user input, by simply calling \code{stdscr.refresh()} or the
|
||||
\function{refresh()} method of some other relevant window.
|
||||
|
||||
A pad is a special case of a window; it can be larger than the actual
|
||||
display screen, and only a portion of it displayed at a time.
|
||||
Creating a pad simply requires the pad's height and width, while
|
||||
refreshing a pad requires giving the coordinates of the on-screen
|
||||
area where a subsection of the pad will be displayed.
|
||||
|
||||
\begin{verbatim}
|
||||
pad = curses.newpad(100, 100)
|
||||
# These loops fill the pad with letters; this is
|
||||
# explained in the next section
|
||||
for y in range(0, 100):
|
||||
for x in range(0, 100):
|
||||
try: pad.addch(y,x, ord('a') + (x*x+y*y) % 26 )
|
||||
except curses.error: pass
|
||||
|
||||
# Displays a section of the pad in the middle of the screen
|
||||
pad.refresh( 0,0, 5,5, 20,75)
|
||||
\end{verbatim}
|
||||
|
||||
The \function{refresh()} call displays a section of the pad in the
|
||||
rectangle extending from coordinate (5,5) to coordinate (20,75) on the
|
||||
screen;the upper left corner of the displayed section is coordinate
|
||||
(0,0) on the pad. Beyond that difference, pads are exactly like
|
||||
ordinary windows and support the same methods.
|
||||
|
||||
If you have multiple windows and pads on screen there is a more
|
||||
efficient way to go, which will prevent annoying screen flicker at
|
||||
refresh time. Use the methods \method{noutrefresh()} and/or
|
||||
\method{noutrefresh()} of each window to update the data structure
|
||||
representing the desired state of the screen; then change the physical
|
||||
screen to match the desired state in one go with the function
|
||||
\function{doupdate()}. The normal \method{refresh()} method calls
|
||||
\function{doupdate()} as its last act.
|
||||
|
||||
\section{Displaying Text}
|
||||
|
||||
{}From a C programmer's point of view, curses may sometimes look like
|
||||
a twisty maze of functions, all subtly different. For example,
|
||||
\function{addstr()} displays a string at the current cursor location
|
||||
in the \code{stdscr} window, while \function{mvaddstr()} moves to a
|
||||
given y,x coordinate first before displaying the string.
|
||||
\function{waddstr()} is just like \function{addstr()}, but allows
|
||||
specifying a window to use, instead of using \code{stdscr} by default.
|
||||
\function{mvwaddstr()} follows similarly.
|
||||
|
||||
Fortunately the Python interface hides all these details;
|
||||
\code{stdscr} is a window object like any other, and methods like
|
||||
\function{addstr()} accept multiple argument forms. Usually there are
|
||||
four different forms.
|
||||
|
||||
\begin{tableii}{|c|l|}{textrm}{Form}{Description}
|
||||
\lineii{\var{str} or \var{ch}}{Display the string \var{str} or
|
||||
character \var{ch}}
|
||||
\lineii{\var{str} or \var{ch}, \var{attr}}{Display the string \var{str} or
|
||||
character \var{ch}, using attribute \var{attr}}
|
||||
\lineii{\var{y}, \var{x}, \var{str} or \var{ch}}
|
||||
{Move to position \var{y,x} within the window, and display \var{str}
|
||||
or \var{ch}}
|
||||
\lineii{\var{y}, \var{x}, \var{str} or \var{ch}, \var{attr}}
|
||||
{Move to position \var{y,x} within the window, and display \var{str}
|
||||
or \var{ch}, using attribute \var{attr}}
|
||||
\end{tableii}
|
||||
|
||||
Attributes allow displaying text in highlighted forms, such as in
|
||||
boldface, underline, reverse code, or in color. They'll be explained
|
||||
in more detail in the next subsection.
|
||||
|
||||
The \function{addstr()} function takes a Python string as the value to
|
||||
be displayed, while the \function{addch()} functions take a character,
|
||||
which can be either a Python string of length 1, or an integer. If
|
||||
it's a string, you're limited to displaying characters between 0 and
|
||||
255. SVr4 curses provides constants for extension characters; these
|
||||
constants are integers greater than 255. For example,
|
||||
\constant{ACS_PLMINUS} is a +/- symbol, and \constant{ACS_ULCORNER} is
|
||||
the upper left corner of a box (handy for drawing borders).
|
||||
|
||||
Windows remember where the cursor was left after the last operation,
|
||||
so if you leave out the \var{y,x} coordinates, the string or character
|
||||
will be displayed wherever the last operation left off. You can also
|
||||
move the cursor with the \function{move(\var{y,x})} method. Because
|
||||
some terminals always display a flashing cursor, you may want to
|
||||
ensure that the cursor is positioned in some location where it won't
|
||||
be distracting; it can be confusing to have the cursor blinking at
|
||||
some apparently random location.
|
||||
|
||||
If your application doesn't need a blinking cursor at all, you can
|
||||
call \function{curs_set(0)} to make it invisible. Equivalently, and
|
||||
for compatibility with older curses versions, there's a
|
||||
\function{leaveok(\var{bool})} function. When \var{bool} is true, the
|
||||
curses library will attempt to suppress the flashing cursor, and you
|
||||
won't need to worry about leaving it in odd locations.
|
||||
|
||||
\subsection{Attributes and Color}
|
||||
|
||||
Characters can be displayed in different ways. Status lines in a
|
||||
text-based application are commonly shown in reverse video; a text
|
||||
viewer may need to highlight certain words. curses supports this by
|
||||
allowing you to specify an attribute for each cell on the screen.
|
||||
|
||||
An attribute is a integer, each bit representing a different
|
||||
attribute. You can try to display text with multiple attribute bits
|
||||
set, but curses doesn't guarantee that all the possible combinations
|
||||
are available, or that they're all visually distinct. That depends on
|
||||
the ability of the terminal being used, so it's safest to stick to the
|
||||
most commonly available attributes, listed here.
|
||||
|
||||
\begin{tableii}{|c|l|}{constant}{Attribute}{Description}
|
||||
\lineii{A_BLINK}{Blinking text}
|
||||
\lineii{A_BOLD}{Extra bright or bold text}
|
||||
\lineii{A_DIM}{Half bright text}
|
||||
\lineii{A_REVERSE}{Reverse-video text}
|
||||
\lineii{A_STANDOUT}{The best highlighting mode available}
|
||||
\lineii{A_UNDERLINE}{Underlined text}
|
||||
\end{tableii}
|
||||
|
||||
So, to display a reverse-video status line on the top line of the
|
||||
screen,
|
||||
you could code:
|
||||
|
||||
\begin{verbatim}
|
||||
stdscr.addstr(0, 0, "Current mode: Typing mode",
|
||||
curses.A_REVERSE)
|
||||
stdscr.refresh()
|
||||
\end{verbatim}
|
||||
|
||||
The curses library also supports color on those terminals that
|
||||
provide it, The most common such terminal is probably the Linux
|
||||
console, followed by color xterms.
|
||||
|
||||
To use color, you must call the \function{start_color()} function
|
||||
soon after calling \function{initscr()}, to initialize the default
|
||||
color set (the \function{curses.wrapper.wrapper()} function does this
|
||||
automatically). Once that's done, the \function{has_colors()}
|
||||
function returns TRUE if the terminal in use can actually display
|
||||
color. (Note from AMK: curses uses the American spelling
|
||||
'color', instead of the Canadian/British spelling 'colour'. If you're
|
||||
like me, you'll have to resign yourself to misspelling it for the sake
|
||||
of these functions.)
|
||||
|
||||
The curses library maintains a finite number of color pairs,
|
||||
containing a foreground (or text) color and a background color. You
|
||||
can get the attribute value corresponding to a color pair with the
|
||||
\function{color_pair()} function; this can be bitwise-OR'ed with other
|
||||
attributes such as \constant{A_REVERSE}, but again, such combinations
|
||||
are not guaranteed to work on all terminals.
|
||||
|
||||
An example, which displays a line of text using color pair 1:
|
||||
|
||||
\begin{verbatim}
|
||||
stdscr.addstr( "Pretty text", curses.color_pair(1) )
|
||||
stdscr.refresh()
|
||||
\end{verbatim}
|
||||
|
||||
As I said before, a color pair consists of a foreground and
|
||||
background color. \function{start_color()} initializes 8 basic
|
||||
colors when it activates color mode. They are: 0:black, 1:red,
|
||||
2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and 7:white. The curses
|
||||
module defines named constants for each of these colors:
|
||||
\constant{curses.COLOR_BLACK}, \constant{curses.COLOR_RED}, and so
|
||||
forth.
|
||||
|
||||
The \function{init_pair(\var{n, f, b})} function changes the
|
||||
definition of color pair \var{n}, to foreground color {f} and
|
||||
background color {b}. Color pair 0 is hard-wired to white on black,
|
||||
and cannot be changed.
|
||||
|
||||
Let's put all this together. To change color 1 to red
|
||||
text on a white background, you would call:
|
||||
|
||||
\begin{verbatim}
|
||||
curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
|
||||
\end{verbatim}
|
||||
|
||||
When you change a color pair, any text already displayed using that
|
||||
color pair will change to the new colors. You can also display new
|
||||
text in this color with:
|
||||
|
||||
\begin{verbatim}
|
||||
stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1) )
|
||||
\end{verbatim}
|
||||
|
||||
Very fancy terminals can change the definitions of the actual colors
|
||||
to a given RGB value. This lets you change color 1, which is usually
|
||||
red, to purple or blue or any other color you like. Unfortunately,
|
||||
the Linux console doesn't support this, so I'm unable to try it out,
|
||||
and can't provide any examples. You can check if your terminal can do
|
||||
this by calling \function{can_change_color()}, which returns TRUE if
|
||||
the capability is there. If you're lucky enough to have such a
|
||||
talented terminal, consult your system's man pages for more
|
||||
information.
|
||||
|
||||
\section{User Input}
|
||||
|
||||
The curses library itself offers only very simple input mechanisms.
|
||||
Python's support adds a text-input widget that makes up some of the
|
||||
lack.
|
||||
|
||||
The most common way to get input to a window is to use its
|
||||
\method{getch()} method. that pauses, and waits for the user to hit
|
||||
a key, displaying it if \function{echo()} has been called earlier.
|
||||
You can optionally specify a coordinate to which the cursor should be
|
||||
moved before pausing.
|
||||
|
||||
It's possible to change this behavior with the method
|
||||
\method{nodelay()}. After \method{nodelay(1)}, \method{getch()} for
|
||||
the window becomes non-blocking and returns ERR (-1) when no input is
|
||||
ready. There's also a \function{halfdelay()} function, which can be
|
||||
used to (in effect) set a timer on each \method{getch()}; if no input
|
||||
becomes available within the number of milliseconds specified as the
|
||||
argument to \function{halfdelay()}, curses throws an exception.
|
||||
|
||||
The \method{getch()} method returns an integer; if it's between 0 and
|
||||
255, it represents the ASCII code of the key pressed. Values greater
|
||||
than 255 are special keys such as Page Up, Home, or the cursor keys.
|
||||
You can compare the value returned to constants such as
|
||||
\constant{curses.KEY_PPAGE}, \constant{curses.KEY_HOME}, or
|
||||
\constant{curses.KEY_LEFT}. Usually the main loop of your program
|
||||
will look something like this:
|
||||
|
||||
\begin{verbatim}
|
||||
while 1:
|
||||
c = stdscr.getch()
|
||||
if c == ord('p'): PrintDocument()
|
||||
elif c == ord('q'): break # Exit the while()
|
||||
elif c == curses.KEY_HOME: x = y = 0
|
||||
\end{verbatim}
|
||||
|
||||
The \module{curses.ascii} module supplies ASCII class membership
|
||||
functions that take either integer or 1-character-string
|
||||
arguments; these may be useful in writing more readable tests for
|
||||
your command interpreters. It also supplies conversion functions
|
||||
that take either integer or 1-character-string arguments and return
|
||||
the same type. For example, \function{curses.ascii.ctrl()} returns
|
||||
the control character corresponding to its argument.
|
||||
|
||||
There's also a method to retrieve an entire string,
|
||||
\constant{getstr()}. It isn't used very often, because its
|
||||
functionality is quite limited; the only editing keys available are
|
||||
the backspace key and the Enter key, which terminates the string. It
|
||||
can optionally be limited to a fixed number of characters.
|
||||
|
||||
\begin{verbatim}
|
||||
curses.echo() # Enable echoing of characters
|
||||
|
||||
# Get a 15-character string, with the cursor on the top line
|
||||
s = stdscr.getstr(0,0, 15)
|
||||
\end{verbatim}
|
||||
|
||||
The Python \module{curses.textpad} module supplies something better.
|
||||
With it, you can turn a window into a text box that supports an
|
||||
Emacs-like set of keybindings. Various methods of \class{Textbox}
|
||||
class support editing with input validation and gathering the edit
|
||||
results either with or without trailing spaces. See the library
|
||||
documentation on \module{curses.textpad} for the details.
|
||||
|
||||
\section{For More Information}
|
||||
|
||||
This HOWTO didn't cover some advanced topics, such as screen-scraping
|
||||
or capturing mouse events from an xterm instance. But the Python
|
||||
library page for the curses modules is now pretty complete. You
|
||||
should browse it next.
|
||||
|
||||
If you're in doubt about the detailed behavior of any of the ncurses
|
||||
entry points, consult the manual pages for your curses implementation,
|
||||
whether it's ncurses or a proprietary Unix vendor's. The manual pages
|
||||
will document any quirks, and provide complete lists of all the
|
||||
functions, attributes, and \constant{ACS_*} characters available to
|
||||
you.
|
||||
|
||||
Because the curses API is so large, some functions aren't supported in
|
||||
the Python interface, not because they're difficult to implement, but
|
||||
because no one has needed them yet. Feel free to add them and then
|
||||
submit a patch. Also, we don't yet have support for the menus or
|
||||
panels libraries associated with ncurses; feel free to add that.
|
||||
|
||||
If you write an interesting little program, feel free to contribute it
|
||||
as another demo. We can always use more of them!
|
||||
|
||||
The ncurses FAQ: \url{http://dickey.his.com/ncurses/ncurses.faq.html}
|
||||
|
||||
\end{document}
|
343
Doc/howto/doanddont.tex
Normal file
343
Doc/howto/doanddont.tex
Normal file
|
@ -0,0 +1,343 @@
|
|||
\documentclass{howto}
|
||||
|
||||
\title{Idioms and Anti-Idioms in Python}
|
||||
|
||||
\release{0.00}
|
||||
|
||||
\author{Moshe Zadka}
|
||||
\authoraddress{howto@zadka.site.co.il}
|
||||
|
||||
\begin{document}
|
||||
\maketitle
|
||||
|
||||
This document is placed in the public doman.
|
||||
|
||||
\begin{abstract}
|
||||
\noindent
|
||||
This document can be considered a companion to the tutorial. It
|
||||
shows how to use Python, and even more importantly, how {\em not}
|
||||
to use Python.
|
||||
\end{abstract}
|
||||
|
||||
\tableofcontents
|
||||
|
||||
\section{Language Constructs You Should Not Use}
|
||||
|
||||
While Python has relatively few gotchas compared to other languages, it
|
||||
still has some constructs which are only useful in corner cases, or are
|
||||
plain dangerous.
|
||||
|
||||
\subsection{from module import *}
|
||||
|
||||
\subsubsection{Inside Function Definitions}
|
||||
|
||||
\code{from module import *} is {\em invalid} inside function definitions.
|
||||
While many versions of Python do no check for the invalidity, it does not
|
||||
make it more valid, no more then having a smart lawyer makes a man innocent.
|
||||
Do not use it like that ever. Even in versions where it was accepted, it made
|
||||
the function execution slower, because the compiler could not be certain
|
||||
which names are local and which are global. In Python 2.1 this construct
|
||||
causes warnings, and sometimes even errors.
|
||||
|
||||
\subsubsection{At Module Level}
|
||||
|
||||
While it is valid to use \code{from module import *} at module level it
|
||||
is usually a bad idea. For one, this loses an important property Python
|
||||
otherwise has --- you can know where each toplevel name is defined by
|
||||
a simple "search" function in your favourite editor. You also open yourself
|
||||
to trouble in the future, if some module grows additional functions or
|
||||
classes.
|
||||
|
||||
One of the most awful question asked on the newsgroup is why this code:
|
||||
|
||||
\begin{verbatim}
|
||||
f = open("www")
|
||||
f.read()
|
||||
\end{verbatim}
|
||||
|
||||
does not work. Of course, it works just fine (assuming you have a file
|
||||
called "www".) But it does not work if somewhere in the module, the
|
||||
statement \code{from os import *} is present. The \module{os} module
|
||||
has a function called \function{open()} which returns an integer. While
|
||||
it is very useful, shadowing builtins is one of its least useful properties.
|
||||
|
||||
Remember, you can never know for sure what names a module exports, so either
|
||||
take what you need --- \code{from module import name1, name2}, or keep them in
|
||||
the module and access on a per-need basis ---
|
||||
\code{import module;print module.name}.
|
||||
|
||||
\subsubsection{When It Is Just Fine}
|
||||
|
||||
There are situations in which \code{from module import *} is just fine:
|
||||
|
||||
\begin{itemize}
|
||||
|
||||
\item The interactive prompt. For example, \code{from math import *} makes
|
||||
Python an amazing scientific calculator.
|
||||
|
||||
\item When extending a module in C with a module in Python.
|
||||
|
||||
\item When the module advertises itself as \code{from import *} safe.
|
||||
|
||||
\end{itemize}
|
||||
|
||||
\subsection{Unadorned \keyword{exec}, \function{execfile} and friends}
|
||||
|
||||
The word ``unadorned'' refers to the use without an explicit dictionary,
|
||||
in which case those constructs evaluate code in the {\em current} environment.
|
||||
This is dangerous for the same reasons \code{from import *} is dangerous ---
|
||||
it might step over variables you are counting on and mess up things for
|
||||
the rest of your code. Simply do not do that.
|
||||
|
||||
Bad examples:
|
||||
|
||||
\begin{verbatim}
|
||||
>>> for name in sys.argv[1:]:
|
||||
>>> exec "%s=1" % name
|
||||
>>> def func(s, **kw):
|
||||
>>> for var, val in kw.items():
|
||||
>>> exec "s.%s=val" % var # invalid!
|
||||
>>> execfile("handler.py")
|
||||
>>> handle()
|
||||
\end{verbatim}
|
||||
|
||||
Good examples:
|
||||
|
||||
\begin{verbatim}
|
||||
>>> d = {}
|
||||
>>> for name in sys.argv[1:]:
|
||||
>>> d[name] = 1
|
||||
>>> def func(s, **kw):
|
||||
>>> for var, val in kw.items():
|
||||
>>> setattr(s, var, val)
|
||||
>>> d={}
|
||||
>>> execfile("handle.py", d, d)
|
||||
>>> handle = d['handle']
|
||||
>>> handle()
|
||||
\end{verbatim}
|
||||
|
||||
\subsection{from module import name1, name2}
|
||||
|
||||
This is a ``don't'' which is much weaker then the previous ``don't''s
|
||||
but is still something you should not do if you don't have good reasons
|
||||
to do that. The reason it is usually bad idea is because you suddenly
|
||||
have an object which lives in two seperate namespaces. When the binding
|
||||
in one namespace changes, the binding in the other will not, so there
|
||||
will be a discrepancy between them. This happens when, for example,
|
||||
one module is reloaded, or changes the definition of a function at runtime.
|
||||
|
||||
Bad example:
|
||||
|
||||
\begin{verbatim}
|
||||
# foo.py
|
||||
a = 1
|
||||
|
||||
# bar.py
|
||||
from foo import a
|
||||
if something():
|
||||
a = 2 # danger: foo.a != a
|
||||
\end{verbatim}
|
||||
|
||||
Good example:
|
||||
|
||||
\begin{verbatim}
|
||||
# foo.py
|
||||
a = 1
|
||||
|
||||
# bar.py
|
||||
import foo
|
||||
if something():
|
||||
foo.a = 2
|
||||
\end{verbatim}
|
||||
|
||||
\subsection{except:}
|
||||
|
||||
Python has the \code{except:} clause, which catches all exceptions.
|
||||
Since {\em every} error in Python raises an exception, this makes many
|
||||
programming errors look like runtime problems, and hinders
|
||||
the debugging process.
|
||||
|
||||
The following code shows a great example:
|
||||
|
||||
\begin{verbatim}
|
||||
try:
|
||||
foo = opne("file") # misspelled "open"
|
||||
except:
|
||||
sys.exit("could not open file!")
|
||||
\end{verbatim}
|
||||
|
||||
The second line triggers a \exception{NameError} which is caught by the
|
||||
except clause. The program will exit, and you will have no idea that
|
||||
this has nothing to do with the readability of \code{"file"}.
|
||||
|
||||
The example above is better written
|
||||
|
||||
\begin{verbatim}
|
||||
try:
|
||||
foo = opne("file") # will be changed to "open" as soon as we run it
|
||||
except IOError:
|
||||
sys.exit("could not open file")
|
||||
\end{verbatim}
|
||||
|
||||
There are some situations in which the \code{except:} clause is useful:
|
||||
for example, in a framework when running callbacks, it is good not to
|
||||
let any callback disturb the framework.
|
||||
|
||||
\section{Exceptions}
|
||||
|
||||
Exceptions are a useful feature of Python. You should learn to raise
|
||||
them whenever something unexpected occurs, and catch them only where
|
||||
you can do something about them.
|
||||
|
||||
The following is a very popular anti-idiom
|
||||
|
||||
\begin{verbatim}
|
||||
def get_status(file):
|
||||
if not os.path.exists(file):
|
||||
print "file not found"
|
||||
sys.exit(1)
|
||||
return open(file).readline()
|
||||
\end{verbatim}
|
||||
|
||||
Consider the case the file gets deleted between the time the call to
|
||||
\function{os.path.exists} is made and the time \function{open} is called.
|
||||
That means the last line will throw an \exception{IOError}. The same would
|
||||
happen if \var{file} exists but has no read permission. Since testing this
|
||||
on a normal machine on existing and non-existing files make it seem bugless,
|
||||
that means in testing the results will seem fine, and the code will get
|
||||
shipped. Then an unhandled \exception{IOError} escapes to the user, who
|
||||
has to watch the ugly traceback.
|
||||
|
||||
Here is a better way to do it.
|
||||
|
||||
\begin{verbatim}
|
||||
def get_status(file):
|
||||
try:
|
||||
return open(file).readline()
|
||||
except (IOError, OSError):
|
||||
print "file not found"
|
||||
sys.exit(1)
|
||||
\end{verbatim}
|
||||
|
||||
In this version, *either* the file gets opened and the line is read
|
||||
(so it works even on flaky NFS or SMB connections), or the message
|
||||
is printed and the application aborted.
|
||||
|
||||
Still, \function{get_status} makes too many assumptions --- that it
|
||||
will only be used in a short running script, and not, say, in a long
|
||||
running server. Sure, the caller could do something like
|
||||
|
||||
\begin{verbatim}
|
||||
try:
|
||||
status = get_status(log)
|
||||
except SystemExit:
|
||||
status = None
|
||||
\end{verbatim}
|
||||
|
||||
So, try to make as few \code{except} clauses in your code --- those will
|
||||
usually be a catch-all in the \function{main}, or inside calls which
|
||||
should always succeed.
|
||||
|
||||
So, the best version is probably
|
||||
|
||||
\begin{verbatim}
|
||||
def get_status(file):
|
||||
return open(file).readline()
|
||||
\end{verbatim}
|
||||
|
||||
The caller can deal with the exception if it wants (for example, if it
|
||||
tries several files in a loop), or just let the exception filter upwards
|
||||
to {\em its} caller.
|
||||
|
||||
The last version is not very good either --- due to implementation details,
|
||||
the file would not be closed when an exception is raised until the handler
|
||||
finishes, and perhaps not at all in non-C implementations (e.g., Jython).
|
||||
|
||||
\begin{verbatim}
|
||||
def get_status(file):
|
||||
fp = open(file)
|
||||
try:
|
||||
return fp.readline()
|
||||
finally:
|
||||
fp.close()
|
||||
\end{verbatim}
|
||||
|
||||
\section{Using the Batteries}
|
||||
|
||||
Every so often, people seem to be writing stuff in the Python library
|
||||
again, usually poorly. While the occasional module has a poor interface,
|
||||
it is usually much better to use the rich standard library and data
|
||||
types that come with Python then inventing your own.
|
||||
|
||||
A useful module very few people know about is \module{os.path}. It
|
||||
always has the correct path arithmetic for your operating system, and
|
||||
will usually be much better then whatever you come up with yourself.
|
||||
|
||||
Compare:
|
||||
|
||||
\begin{verbatim}
|
||||
# ugh!
|
||||
return dir+"/"+file
|
||||
# better
|
||||
return os.path.join(dir, file)
|
||||
\end{verbatim}
|
||||
|
||||
More useful functions in \module{os.path}: \function{basename},
|
||||
\function{dirname} and \function{splitext}.
|
||||
|
||||
There are also many useful builtin functions people seem not to be
|
||||
aware of for some reason: \function{min()} and \function{max()} can
|
||||
find the minimum/maximum of any sequence with comparable semantics,
|
||||
for example, yet many people write they own max/min. Another highly
|
||||
useful function is \function{reduce()}. Classical use of \function{reduce()}
|
||||
is something like
|
||||
|
||||
\begin{verbatim}
|
||||
import sys, operator
|
||||
nums = map(float, sys.argv[1:])
|
||||
print reduce(operator.add, nums)/len(nums)
|
||||
\end{verbatim}
|
||||
|
||||
This cute little script prints the average of all numbers given on the
|
||||
command line. The \function{reduce()} adds up all the numbers, and
|
||||
the rest is just some pre- and postprocessing.
|
||||
|
||||
On the same note, note that \function{float()}, \function{int()} and
|
||||
\function{long()} all accept arguments of type string, and so are
|
||||
suited to parsing --- assuming you are ready to deal with the
|
||||
\exception{ValueError} they raise.
|
||||
|
||||
\section{Using Backslash to Continue Statements}
|
||||
|
||||
Since Python treats a newline as a statement terminator,
|
||||
and since statements are often more then is comfortable to put
|
||||
in one line, many people do:
|
||||
|
||||
\begin{verbatim}
|
||||
if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \
|
||||
calculate_number(10, 20) != forbulate(500, 360):
|
||||
pass
|
||||
\end{verbatim}
|
||||
|
||||
You should realize that this is dangerous: a stray space after the
|
||||
\code{\\} would make this line wrong, and stray spaces are notoriously
|
||||
hard to see in editors. In this case, at least it would be a syntax
|
||||
error, but if the code was:
|
||||
|
||||
\begin{verbatim}
|
||||
value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \
|
||||
+ calculate_number(10, 20)*forbulate(500, 360)
|
||||
\end{verbatim}
|
||||
|
||||
then it would just be subtly wrong.
|
||||
|
||||
It is usually much better to use the implicit continuation inside parenthesis:
|
||||
|
||||
This version is bulletproof:
|
||||
|
||||
\begin{verbatim}
|
||||
value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9]
|
||||
+ calculate_number(10, 20)*forbulate(500, 360))
|
||||
\end{verbatim}
|
||||
|
||||
\end{document}
|
1466
Doc/howto/regex.tex
Normal file
1466
Doc/howto/regex.tex
Normal file
File diff suppressed because it is too large
Load diff
61
Doc/howto/rexec.tex
Normal file
61
Doc/howto/rexec.tex
Normal file
|
@ -0,0 +1,61 @@
|
|||
\documentclass{howto}
|
||||
|
||||
\title{Restricted Execution HOWTO}
|
||||
|
||||
\release{2.1}
|
||||
|
||||
\author{A.M. Kuchling}
|
||||
\authoraddress{\email{amk@amk.ca}}
|
||||
|
||||
\begin{document}
|
||||
|
||||
\maketitle
|
||||
|
||||
\begin{abstract}
|
||||
\noindent
|
||||
|
||||
Python 2.2.2 and earlier provided a \module{rexec} module running
|
||||
untrusted code. However, it's never been exhaustively audited for
|
||||
security and it hasn't been updated to take into account recent
|
||||
changes to Python such as new-style classes. Therefore, the
|
||||
\module{rexec} module should not be trusted. To discourage use of
|
||||
\module{rexec}, this HOWTO has been withdrawn.
|
||||
|
||||
The \module{rexec} and \module{Bastion} modules have been disabled in
|
||||
the Python CVS tree, both on the trunk (which will eventually become
|
||||
Python 2.3alpha2 and later 2.3final) and on the release22-maint branch
|
||||
(which will become Python 2.2.3, if someone ever volunteers to issue
|
||||
2.2.3).
|
||||
|
||||
For discussion of the problems with \module{rexec}, see the python-dev
|
||||
threads starting at the following URLs:
|
||||
\url{http://mail.python.org/pipermail/python-dev/2002-December/031160.html},
|
||||
and
|
||||
\url{http://mail.python.org/pipermail/python-dev/2003-January/031848.html}.
|
||||
|
||||
\end{abstract}
|
||||
|
||||
|
||||
\section{Version History}
|
||||
|
||||
Sep. 12, 1998: Minor revisions and added the reference to the Janus
|
||||
project.
|
||||
|
||||
Feb. 26, 1998: First version. Suggestions are welcome.
|
||||
|
||||
Mar. 16, 1998: Made some revisions suggested by Jeff Rush. Some minor
|
||||
changes and clarifications, and a sizable section on exceptions added.
|
||||
|
||||
Oct. 4, 2000: Checked with Python 2.0. Minor rewrites and fixes made.
|
||||
Version number increased to 2.0.
|
||||
|
||||
Dec. 17, 2002: Withdrawn.
|
||||
|
||||
Jan. 8, 2003: Mention that \module{rexec} will be disabled in Python 2.3,
|
||||
and added links to relevant python-dev threads.
|
||||
|
||||
\end{document}
|
||||
|
||||
|
||||
|
||||
|
460
Doc/howto/sockets.tex
Normal file
460
Doc/howto/sockets.tex
Normal file
|
@ -0,0 +1,460 @@
|
|||
\documentclass{howto}
|
||||
|
||||
\title{Socket Programming HOWTO}
|
||||
|
||||
\release{0.00}
|
||||
|
||||
\author{Gordon McMillan}
|
||||
\authoraddress{\email{gmcm@hypernet.com}}
|
||||
|
||||
\begin{document}
|
||||
\maketitle
|
||||
|
||||
\begin{abstract}
|
||||
\noindent
|
||||
Sockets are used nearly everywhere, but are one of the most severely
|
||||
misunderstood technologies around. This is a 10,000 foot overview of
|
||||
sockets. It's not really a tutorial - you'll still have work to do in
|
||||
getting things operational. It doesn't cover the fine points (and there
|
||||
are a lot of them), but I hope it will give you enough background to
|
||||
begin using them decently.
|
||||
|
||||
This document is available from the Python HOWTO page at
|
||||
\url{http://www.python.org/doc/howto}.
|
||||
|
||||
\end{abstract}
|
||||
|
||||
\tableofcontents
|
||||
|
||||
\section{Sockets}
|
||||
|
||||
Sockets are used nearly everywhere, but are one of the most severely
|
||||
misunderstood technologies around. This is a 10,000 foot overview of
|
||||
sockets. It's not really a tutorial - you'll still have work to do in
|
||||
getting things working. It doesn't cover the fine points (and there
|
||||
are a lot of them), but I hope it will give you enough background to
|
||||
begin using them decently.
|
||||
|
||||
I'm only going to talk about INET sockets, but they account for at
|
||||
least 99\% of the sockets in use. And I'll only talk about STREAM
|
||||
sockets - unless you really know what you're doing (in which case this
|
||||
HOWTO isn't for you!), you'll get better behavior and performance from
|
||||
a STREAM socket than anything else. I will try to clear up the mystery
|
||||
of what a socket is, as well as some hints on how to work with
|
||||
blocking and non-blocking sockets. But I'll start by talking about
|
||||
blocking sockets. You'll need to know how they work before dealing
|
||||
with non-blocking sockets.
|
||||
|
||||
Part of the trouble with understanding these things is that "socket"
|
||||
can mean a number of subtly different things, depending on context. So
|
||||
first, let's make a distinction between a "client" socket - an
|
||||
endpoint of a conversation, and a "server" socket, which is more like
|
||||
a switchboard operator. The client application (your browser, for
|
||||
example) uses "client" sockets exclusively; the web server it's
|
||||
talking to uses both "server" sockets and "client" sockets.
|
||||
|
||||
|
||||
\subsection{History}
|
||||
|
||||
Of the various forms of IPC (\emph{Inter Process Communication}),
|
||||
sockets are by far the most popular. On any given platform, there are
|
||||
likely to be other forms of IPC that are faster, but for
|
||||
cross-platform communication, sockets are about the only game in town.
|
||||
|
||||
They were invented in Berkeley as part of the BSD flavor of Unix. They
|
||||
spread like wildfire with the Internet. With good reason --- the
|
||||
combination of sockets with INET makes talking to arbitrary machines
|
||||
around the world unbelievably easy (at least compared to other
|
||||
schemes).
|
||||
|
||||
\section{Creating a Socket}
|
||||
|
||||
Roughly speaking, when you clicked on the link that brought you to
|
||||
this page, your browser did something like the following:
|
||||
|
||||
\begin{verbatim}
|
||||
#create an INET, STREAMing socket
|
||||
s = socket.socket(
|
||||
socket.AF_INET, socket.SOCK_STREAM)
|
||||
#now connect to the web server on port 80
|
||||
# - the normal http port
|
||||
s.connect(("www.mcmillan-inc.com", 80))
|
||||
\end{verbatim}
|
||||
|
||||
When the \code{connect} completes, the socket \code{s} can
|
||||
now be used to send in a request for the text of this page. The same
|
||||
socket will read the reply, and then be destroyed. That's right -
|
||||
destroyed. Client sockets are normally only used for one exchange (or
|
||||
a small set of sequential exchanges).
|
||||
|
||||
What happens in the web server is a bit more complex. First, the web
|
||||
server creates a "server socket".
|
||||
|
||||
\begin{verbatim}
|
||||
#create an INET, STREAMing socket
|
||||
serversocket = socket.socket(
|
||||
socket.AF_INET, socket.SOCK_STREAM)
|
||||
#bind the socket to a public host,
|
||||
# and a well-known port
|
||||
serversocket.bind((socket.gethostname(), 80))
|
||||
#become a server socket
|
||||
serversocket.listen(5)
|
||||
\end{verbatim}
|
||||
|
||||
A couple things to notice: we used \code{socket.gethostname()}
|
||||
so that the socket would be visible to the outside world. If we had
|
||||
used \code{s.bind(('', 80))} or \code{s.bind(('localhost',
|
||||
80))} or \code{s.bind(('127.0.0.1', 80))} we would still
|
||||
have a "server" socket, but one that was only visible within the same
|
||||
machine.
|
||||
|
||||
A second thing to note: low number ports are usually reserved for
|
||||
"well known" services (HTTP, SNMP etc). If you're playing around, use
|
||||
a nice high number (4 digits).
|
||||
|
||||
Finally, the argument to \code{listen} tells the socket library that
|
||||
we want it to queue up as many as 5 connect requests (the normal max)
|
||||
before refusing outside connections. If the rest of the code is
|
||||
written properly, that should be plenty.
|
||||
|
||||
OK, now we have a "server" socket, listening on port 80. Now we enter
|
||||
the mainloop of the web server:
|
||||
|
||||
\begin{verbatim}
|
||||
while 1:
|
||||
#accept connections from outside
|
||||
(clientsocket, address) = serversocket.accept()
|
||||
#now do something with the clientsocket
|
||||
#in this case, we'll pretend this is a threaded server
|
||||
ct = client_thread(clientsocket)
|
||||
ct.run()
|
||||
\end{verbatim}
|
||||
|
||||
There's actually 3 general ways in which this loop could work -
|
||||
dispatching a thread to handle \code{clientsocket}, create a new
|
||||
process to handle \code{clientsocket}, or restructure this app
|
||||
to use non-blocking sockets, and mulitplex between our "server" socket
|
||||
and any active \code{clientsocket}s using
|
||||
\code{select}. More about that later. The important thing to
|
||||
understand now is this: this is \emph{all} a "server" socket
|
||||
does. It doesn't send any data. It doesn't receive any data. It just
|
||||
produces "client" sockets. Each \code{clientsocket} is created
|
||||
in response to some \emph{other} "client" socket doing a
|
||||
\code{connect()} to the host and port we're bound to. As soon as
|
||||
we've created that \code{clientsocket}, we go back to listening
|
||||
for more connections. The two "clients" are free to chat it up - they
|
||||
are using some dynamically allocated port which will be recycled when
|
||||
the conversation ends.
|
||||
|
||||
\subsection{IPC} If you need fast IPC between two processes
|
||||
on one machine, you should look into whatever form of shared memory
|
||||
the platform offers. A simple protocol based around shared memory and
|
||||
locks or semaphores is by far the fastest technique.
|
||||
|
||||
If you do decide to use sockets, bind the "server" socket to
|
||||
\code{'localhost'}. On most platforms, this will take a shortcut
|
||||
around a couple of layers of network code and be quite a bit faster.
|
||||
|
||||
|
||||
\section{Using a Socket}
|
||||
|
||||
The first thing to note, is that the web browser's "client" socket and
|
||||
the web server's "client" socket are identical beasts. That is, this
|
||||
is a "peer to peer" conversation. Or to put it another way, \emph{as the
|
||||
designer, you will have to decide what the rules of etiquette are for
|
||||
a conversation}. Normally, the \code{connect}ing socket
|
||||
starts the conversation, by sending in a request, or perhaps a
|
||||
signon. But that's a design decision - it's not a rule of sockets.
|
||||
|
||||
Now there are two sets of verbs to use for communication. You can use
|
||||
\code{send} and \code{recv}, or you can transform your
|
||||
client socket into a file-like beast and use \code{read} and
|
||||
\code{write}. The latter is the way Java presents their
|
||||
sockets. I'm not going to talk about it here, except to warn you that
|
||||
you need to use \code{flush} on sockets. These are buffered
|
||||
"files", and a common mistake is to \code{write} something, and
|
||||
then \code{read} for a reply. Without a \code{flush} in
|
||||
there, you may wait forever for the reply, because the request may
|
||||
still be in your output buffer.
|
||||
|
||||
Now we come the major stumbling block of sockets - \code{send}
|
||||
and \code{recv} operate on the network buffers. They do not
|
||||
necessarily handle all the bytes you hand them (or expect from them),
|
||||
because their major focus is handling the network buffers. In general,
|
||||
they return when the associated network buffers have been filled
|
||||
(\code{send}) or emptied (\code{recv}). They then tell you
|
||||
how many bytes they handled. It is \emph{your} responsibility to call
|
||||
them again until your message has been completely dealt with.
|
||||
|
||||
When a \code{recv} returns 0 bytes, it means the other side has
|
||||
closed (or is in the process of closing) the connection. You will not
|
||||
receive any more data on this connection. Ever. You may be able to
|
||||
send data successfully; I'll talk about that some on the next page.
|
||||
|
||||
A protocol like HTTP uses a socket for only one transfer. The client
|
||||
sends a request, the reads a reply. That's it. The socket is
|
||||
discarded. This means that a client can detect the end of the reply by
|
||||
receiving 0 bytes.
|
||||
|
||||
But if you plan to reuse your socket for further transfers, you need
|
||||
to realize that \emph{there is no "EOT" (End of Transfer) on a
|
||||
socket.} I repeat: if a socket \code{send} or
|
||||
\code{recv} returns after handling 0 bytes, the connection has
|
||||
been broken. If the connection has \emph{not} been broken, you may
|
||||
wait on a \code{recv} forever, because the socket will
|
||||
\emph{not} tell you that there's nothing more to read (for now). Now
|
||||
if you think about that a bit, you'll come to realize a fundamental
|
||||
truth of sockets: \emph{messages must either be fixed length} (yuck),
|
||||
\emph{or be delimited} (shrug), \emph{or indicate how long they are}
|
||||
(much better), \emph{or end by shutting down the connection}. The
|
||||
choice is entirely yours, (but some ways are righter than others).
|
||||
|
||||
Assuming you don't want to end the connection, the simplest solution
|
||||
is a fixed length message:
|
||||
|
||||
\begin{verbatim}
|
||||
class mysocket:
|
||||
'''demonstration class only
|
||||
- coded for clarity, not efficiency'''
|
||||
def __init__(self, sock=None):
|
||||
if sock is None:
|
||||
self.sock = socket.socket(
|
||||
socket.AF_INET, socket.SOCK_STREAM)
|
||||
else:
|
||||
self.sock = sock
|
||||
def connect(host, port):
|
||||
self.sock.connect((host, port))
|
||||
def mysend(msg):
|
||||
totalsent = 0
|
||||
while totalsent < MSGLEN:
|
||||
sent = self.sock.send(msg[totalsent:])
|
||||
if sent == 0:
|
||||
raise RuntimeError, \\
|
||||
"socket connection broken"
|
||||
totalsent = totalsent + sent
|
||||
def myreceive():
|
||||
msg = ''
|
||||
while len(msg) < MSGLEN:
|
||||
chunk = self.sock.recv(MSGLEN-len(msg))
|
||||
if chunk == '':
|
||||
raise RuntimeError, \\
|
||||
"socket connection broken"
|
||||
msg = msg + chunk
|
||||
return msg
|
||||
\end{verbatim}
|
||||
|
||||
The sending code here is usable for almost any messaging scheme - in
|
||||
Python you send strings, and you can use \code{len()} to
|
||||
determine its length (even if it has embedded \code{\e 0}
|
||||
characters). It's mostly the receiving code that gets more
|
||||
complex. (And in C, it's not much worse, except you can't use
|
||||
\code{strlen} if the message has embedded \code{\e 0}s.)
|
||||
|
||||
The easiest enhancement is to make the first character of the message
|
||||
an indicator of message type, and have the type determine the
|
||||
length. Now you have two \code{recv}s - the first to get (at
|
||||
least) that first character so you can look up the length, and the
|
||||
second in a loop to get the rest. If you decide to go the delimited
|
||||
route, you'll be receiving in some arbitrary chunk size, (4096 or 8192
|
||||
is frequently a good match for network buffer sizes), and scanning
|
||||
what you've received for a delimiter.
|
||||
|
||||
One complication to be aware of: if your conversational protocol
|
||||
allows multiple messages to be sent back to back (without some kind of
|
||||
reply), and you pass \code{recv} an arbitrary chunk size, you
|
||||
may end up reading the start of a following message. You'll need to
|
||||
put that aside and hold onto it, until it's needed.
|
||||
|
||||
Prefixing the message with it's length (say, as 5 numeric characters)
|
||||
gets more complex, because (believe it or not), you may not get all 5
|
||||
characters in one \code{recv}. In playing around, you'll get
|
||||
away with it; but in high network loads, your code will very quickly
|
||||
break unless you use two \code{recv} loops - the first to
|
||||
determine the length, the second to get the data part of the
|
||||
message. Nasty. This is also when you'll discover that
|
||||
\code{send} does not always manage to get rid of everything in
|
||||
one pass. And despite having read this, you will eventually get bit by
|
||||
it!
|
||||
|
||||
In the interests of space, building your character, (and preserving my
|
||||
competitive position), these enhancements are left as an exercise for
|
||||
the reader. Lets move on to cleaning up.
|
||||
|
||||
\subsection{Binary Data}
|
||||
|
||||
It is perfectly possible to send binary data over a socket. The major
|
||||
problem is that not all machines use the same formats for binary
|
||||
data. For example, a Motorola chip will represent a 16 bit integer
|
||||
with the value 1 as the two hex bytes 00 01. Intel and DEC, however,
|
||||
are byte-reversed - that same 1 is 01 00. Socket libraries have calls
|
||||
for converting 16 and 32 bit integers - \code{ntohl, htonl, ntohs,
|
||||
htons} where "n" means \emph{network} and "h" means \emph{host},
|
||||
"s" means \emph{short} and "l" means \emph{long}. Where network order
|
||||
is host order, these do nothing, but where the machine is
|
||||
byte-reversed, these swap the bytes around appropriately.
|
||||
|
||||
In these days of 32 bit machines, the ascii representation of binary
|
||||
data is frequently smaller than the binary representation. That's
|
||||
because a surprising amount of the time, all those longs have the
|
||||
value 0, or maybe 1. The string "0" would be two bytes, while binary
|
||||
is four. Of course, this doesn't fit well with fixed-length
|
||||
messages. Decisions, decisions.
|
||||
|
||||
\section{Disconnecting}
|
||||
|
||||
Strictly speaking, you're supposed to use \code{shutdown} on a
|
||||
socket before you \code{close} it. The \code{shutdown} is
|
||||
an advisory to the socket at the other end. Depending on the argument
|
||||
you pass it, it can mean "I'm not going to send anymore, but I'll
|
||||
still listen", or "I'm not listening, good riddance!". Most socket
|
||||
libraries, however, are so used to programmers neglecting to use this
|
||||
piece of etiquette that normally a \code{close} is the same as
|
||||
\code{shutdown(); close()}. So in most situations, an explicit
|
||||
\code{shutdown} is not needed.
|
||||
|
||||
One way to use \code{shutdown} effectively is in an HTTP-like
|
||||
exchange. The client sends a request and then does a
|
||||
\code{shutdown(1)}. This tells the server "This client is done
|
||||
sending, but can still receive." The server can detect "EOF" by a
|
||||
receive of 0 bytes. It can assume it has the complete request. The
|
||||
server sends a reply. If the \code{send} completes successfully
|
||||
then, indeed, the client was still receiving.
|
||||
|
||||
Python takes the automatic shutdown a step further, and says that when a socket is garbage collected, it will automatically do a \code{close} if it's needed. But relying on this is a very bad habit. If your socket just disappears without doing a \code{close}, the socket at the other end may hang indefinitely, thinking you're just being slow. \emph{Please} \code{close} your sockets when you're done.
|
||||
|
||||
|
||||
\subsection{When Sockets Die}
|
||||
|
||||
Probably the worst thing about using blocking sockets is what happens
|
||||
when the other side comes down hard (without doing a
|
||||
\code{close}). Your socket is likely to hang. SOCKSTREAM is a
|
||||
reliable protocol, and it will wait a long, long time before giving up
|
||||
on a connection. If you're using threads, the entire thread is
|
||||
essentially dead. There's not much you can do about it. As long as you
|
||||
aren't doing something dumb, like holding a lock while doing a
|
||||
blocking read, the thread isn't really consuming much in the way of
|
||||
resources. Do \emph{not} try to kill the thread - part of the reason
|
||||
that threads are more efficient than processes is that they avoid the
|
||||
overhead associated with the automatic recycling of resources. In
|
||||
other words, if you do manage to kill the thread, your whole process
|
||||
is likely to be screwed up.
|
||||
|
||||
\section{Non-blocking Sockets}
|
||||
|
||||
If you've understood the preceeding, you already know most of what you
|
||||
need to know about the mechanics of using sockets. You'll still use
|
||||
the same calls, in much the same ways. It's just that, if you do it
|
||||
right, your app will be almost inside-out.
|
||||
|
||||
In Python, you use \code{socket.setblocking(0)} to make it
|
||||
non-blocking. In C, it's more complex, (for one thing, you'll need to
|
||||
choose between the BSD flavor \code{O_NONBLOCK} and the almost
|
||||
indistinguishable Posix flavor \code{O_NDELAY}, which is
|
||||
completely different from \code{TCP_NODELAY}), but it's the
|
||||
exact same idea. You do this after creating the socket, but before
|
||||
using it. (Actually, if you're nuts, you can switch back and forth.)
|
||||
|
||||
The major mechanical difference is that \code{send},
|
||||
\code{recv}, \code{connect} and \code{accept} can
|
||||
return without having done anything. You have (of course) a number of
|
||||
choices. You can check return code and error codes and generally drive
|
||||
yourself crazy. If you don't believe me, try it sometime. Your app
|
||||
will grow large, buggy and suck CPU. So let's skip the brain-dead
|
||||
solutions and do it right.
|
||||
|
||||
Use \code{select}.
|
||||
|
||||
In C, coding \code{select} is fairly complex. In Python, it's a
|
||||
piece of cake, but it's close enough to the C version that if you
|
||||
understand \code{select} in Python, you'll have little trouble
|
||||
with it in C.
|
||||
|
||||
\begin{verbatim} ready_to_read, ready_to_write, in_error = \\
|
||||
select.select(
|
||||
potential_readers,
|
||||
potential_writers,
|
||||
potential_errs,
|
||||
timeout)
|
||||
\end{verbatim}
|
||||
|
||||
You pass \code{select} three lists: the first contains all
|
||||
sockets that you might want to try reading; the second all the sockets
|
||||
you might want to try writing to, and the last (normally left empty)
|
||||
those that you want to check for errors. You should note that a
|
||||
socket can go into more than one list. The \code{select} call is
|
||||
blocking, but you can give it a timeout. This is generally a sensible
|
||||
thing to do - give it a nice long timeout (say a minute) unless you
|
||||
have good reason to do otherwise.
|
||||
|
||||
In return, you will get three lists. They have the sockets that are
|
||||
actually readable, writable and in error. Each of these lists is a
|
||||
subset (possbily empty) of the corresponding list you passed in. And
|
||||
if you put a socket in more than one input list, it will only be (at
|
||||
most) in one output list.
|
||||
|
||||
If a socket is in the output readable list, you can be
|
||||
as-close-to-certain-as-we-ever-get-in-this-business that a
|
||||
\code{recv} on that socket will return \emph{something}. Same
|
||||
idea for the writable list. You'll be able to send
|
||||
\emph{something}. Maybe not all you want to, but \emph{something} is
|
||||
better than nothing. (Actually, any reasonably healthy socket will
|
||||
return as writable - it just means outbound network buffer space is
|
||||
available.)
|
||||
|
||||
If you have a "server" socket, put it in the potential_readers
|
||||
list. If it comes out in the readable list, your \code{accept}
|
||||
will (almost certainly) work. If you have created a new socket to
|
||||
\code{connect} to someone else, put it in the ptoential_writers
|
||||
list. If it shows up in the writable list, you have a decent chance
|
||||
that it has connected.
|
||||
|
||||
One very nasty problem with \code{select}: if somewhere in those
|
||||
input lists of sockets is one which has died a nasty death, the
|
||||
\code{select} will fail. You then need to loop through every
|
||||
single damn socket in all those lists and do a
|
||||
\code{select([sock],[],[],0)} until you find the bad one. That
|
||||
timeout of 0 means it won't take long, but it's ugly.
|
||||
|
||||
Actually, \code{select} can be handy even with blocking sockets.
|
||||
It's one way of determining whether you will block - the socket
|
||||
returns as readable when there's something in the buffers. However,
|
||||
this still doesn't help with the problem of determining whether the
|
||||
other end is done, or just busy with something else.
|
||||
|
||||
\textbf{Portability alert}: On Unix, \code{select} works both with
|
||||
the sockets and files. Don't try this on Windows. On Windows,
|
||||
\code{select} works with sockets only. Also note that in C, many
|
||||
of the more advanced socket options are done differently on
|
||||
Windows. In fact, on Windows I usually use threads (which work very,
|
||||
very well) with my sockets. Face it, if you want any kind of
|
||||
performance, your code will look very different on Windows than on
|
||||
Unix. (I haven't the foggiest how you do this stuff on a Mac.)
|
||||
|
||||
\subsection{Performance}
|
||||
|
||||
There's no question that the fastest sockets code uses non-blocking
|
||||
sockets and select to multiplex them. You can put together something
|
||||
that will saturate a LAN connection without putting any strain on the
|
||||
CPU. The trouble is that an app written this way can't do much of
|
||||
anything else - it needs to be ready to shuffle bytes around at all
|
||||
times.
|
||||
|
||||
Assuming that your app is actually supposed to do something more than
|
||||
that, threading is the optimal solution, (and using non-blocking
|
||||
sockets will be faster than using blocking sockets). Unfortunately,
|
||||
threading support in Unixes varies both in API and quality. So the
|
||||
normal Unix solution is to fork a subprocess to deal with each
|
||||
connection. The overhead for this is significant (and don't do this on
|
||||
Windows - the overhead of process creation is enormous there). It also
|
||||
means that unless each subprocess is completely independent, you'll
|
||||
need to use another form of IPC, say a pipe, or shared memory and
|
||||
semaphores, to communicate between the parent and child processes.
|
||||
|
||||
Finally, remember that even though blocking sockets are somewhat
|
||||
slower than non-blocking, in many cases they are the "right"
|
||||
solution. After all, if your app is driven by the data it receives
|
||||
over a socket, there's not much sense in complicating the logic just
|
||||
so your app can wait on \code{select} instead of
|
||||
\code{recv}.
|
||||
|
||||
\end{document}
|
267
Doc/howto/sorting.tex
Normal file
267
Doc/howto/sorting.tex
Normal file
|
@ -0,0 +1,267 @@
|
|||
\documentclass{howto}
|
||||
|
||||
\title{Sorting Mini-HOWTO}
|
||||
|
||||
% Increment the release number whenever significant changes are made.
|
||||
% The author and/or editor can define 'significant' however they like.
|
||||
\release{0.01}
|
||||
|
||||
\author{Andrew Dalke}
|
||||
\authoraddress{\email{dalke@bioreason.com}}
|
||||
|
||||
\begin{document}
|
||||
\maketitle
|
||||
|
||||
\begin{abstract}
|
||||
\noindent
|
||||
This document is a little tutorial
|
||||
showing a half dozen ways to sort a list with the built-in
|
||||
\method{sort()} method.
|
||||
|
||||
This document is available from the Python HOWTO page at
|
||||
\url{http://www.python.org/doc/howto}.
|
||||
\end{abstract}
|
||||
|
||||
\tableofcontents
|
||||
|
||||
Python lists have a built-in \method{sort()} method. There are many
|
||||
ways to use it to sort a list and there doesn't appear to be a single,
|
||||
central place in the various manuals describing them, so I'll do so
|
||||
here.
|
||||
|
||||
\section{Sorting basic data types}
|
||||
|
||||
A simple ascending sort is easy; just call the \method{sort()} method of a list.
|
||||
|
||||
\begin{verbatim}
|
||||
>>> a = [5, 2, 3, 1, 4]
|
||||
>>> a.sort()
|
||||
>>> print a
|
||||
[1, 2, 3, 4, 5]
|
||||
\end{verbatim}
|
||||
|
||||
Sort takes an optional function which can be called for doing the
|
||||
comparisons. The default sort routine is equivalent to
|
||||
|
||||
\begin{verbatim}
|
||||
>>> a = [5, 2, 3, 1, 4]
|
||||
>>> a.sort(cmp)
|
||||
>>> print a
|
||||
[1, 2, 3, 4, 5]
|
||||
\end{verbatim}
|
||||
|
||||
where \function{cmp} is the built-in function which compares two objects, \code{x} and
|
||||
\code{y}, and returns -1, 0 or 1 depending on whether $x<y$, $x==y$, or $x>y$. During
|
||||
the course of the sort the relationships must stay the same for the
|
||||
final list to make sense.
|
||||
|
||||
If you want, you can define your own function for the comparison. For
|
||||
integers (and numbers in general) we can do:
|
||||
|
||||
\begin{verbatim}
|
||||
>>> def numeric_compare(x, y):
|
||||
>>> return x-y
|
||||
>>>
|
||||
>>> a = [5, 2, 3, 1, 4]
|
||||
>>> a.sort(numeric_compare)
|
||||
>>> print a
|
||||
[1, 2, 3, 4, 5]
|
||||
\end{verbatim}
|
||||
|
||||
By the way, this function won't work if result of the subtraction
|
||||
is out of range, as in \code{sys.maxint - (-1)}.
|
||||
|
||||
Or, if you don't want to define a new named function you can create an
|
||||
anonymous one using \keyword{lambda}, as in:
|
||||
|
||||
\begin{verbatim}
|
||||
>>> a = [5, 2, 3, 1, 4]
|
||||
>>> a.sort(lambda x, y: x-y)
|
||||
>>> print a
|
||||
[1, 2, 3, 4, 5]
|
||||
\end{verbatim}
|
||||
|
||||
If you want the numbers sorted in reverse you can do
|
||||
|
||||
\begin{verbatim}
|
||||
>>> a = [5, 2, 3, 1, 4]
|
||||
>>> def reverse_numeric(x, y):
|
||||
>>> return y-x
|
||||
>>>
|
||||
>>> a.sort(reverse_numeric)
|
||||
>>> print a
|
||||
[5, 4, 3, 2, 1]
|
||||
\end{verbatim}
|
||||
|
||||
(a more general implementation could return \code{cmp(y,x)} or \code{-cmp(x,y)}).
|
||||
|
||||
However, it's faster if Python doesn't have to call a function for
|
||||
every comparison, so if you want a reverse-sorted list of basic data
|
||||
types, do the forward sort first, then use the \method{reverse()} method.
|
||||
|
||||
\begin{verbatim}
|
||||
>>> a = [5, 2, 3, 1, 4]
|
||||
>>> a.sort()
|
||||
>>> a.reverse()
|
||||
>>> print a
|
||||
[5, 4, 3, 2, 1]
|
||||
\end{verbatim}
|
||||
|
||||
Here's a case-insensitive string comparison using a \keyword{lambda} function:
|
||||
|
||||
\begin{verbatim}
|
||||
>>> import string
|
||||
>>> a = string.split("This is a test string from Andrew.")
|
||||
>>> a.sort(lambda x, y: cmp(string.lower(x), string.lower(y)))
|
||||
>>> print a
|
||||
['a', 'Andrew.', 'from', 'is', 'string', 'test', 'This']
|
||||
\end{verbatim}
|
||||
|
||||
This goes through the overhead of converting a word to lower case
|
||||
every time it must be compared. At times it may be faster to compute
|
||||
these once and use those values, and the following example shows how.
|
||||
|
||||
\begin{verbatim}
|
||||
>>> words = string.split("This is a test string from Andrew.")
|
||||
>>> offsets = []
|
||||
>>> for i in range(len(words)):
|
||||
>>> offsets.append( (string.lower(words[i]), i) )
|
||||
>>>
|
||||
>>> offsets.sort()
|
||||
>>> new_words = []
|
||||
>>> for dontcare, i in offsets:
|
||||
>>> new_words.append(words[i])
|
||||
>>>
|
||||
>>> print new_words
|
||||
\end{verbatim}
|
||||
|
||||
The \code{offsets} list is initialized to a tuple of the lower-case string
|
||||
and its position in the \code{words} list. It is then sorted. Python's
|
||||
sort method sorts tuples by comparing terms; given \code{x} and \code{y}, compare
|
||||
\code{x[0]} to \code{y[0]}, then \code{x[1]} to \code{y[1]}, etc. until there is a difference.
|
||||
|
||||
The result is that the \code{offsets} list is ordered by its first
|
||||
term, and the second term can be used to figure out where the original
|
||||
data was stored. (The \code{for} loop assigns \code{dontcare} and
|
||||
\code{i} to the two fields of each term in the list, but we only need the
|
||||
index value.)
|
||||
|
||||
Another way to implement this is to store the original data as the
|
||||
second term in the \code{offsets} list, as in:
|
||||
|
||||
\begin{verbatim}
|
||||
>>> words = string.split("This is a test string from Andrew.")
|
||||
>>> offsets = []
|
||||
>>> for word in words:
|
||||
>>> offsets.append( (string.lower(word), word) )
|
||||
>>>
|
||||
>>> offsets.sort()
|
||||
>>> new_words = []
|
||||
>>> for word in offsets:
|
||||
>>> new_words.append(word[1])
|
||||
>>>
|
||||
>>> print new_words
|
||||
\end{verbatim}
|
||||
|
||||
This isn't always appropriate because the second terms in the list
|
||||
(the word, in this example) will be compared when the first terms are
|
||||
the same. If this happens many times, then there will be the unneeded
|
||||
performance hit of comparing the two objects. This can be a large
|
||||
cost if most terms are the same and the objects define their own
|
||||
\method{__cmp__} method, but there will still be some overhead to determine if
|
||||
\method{__cmp__} is defined.
|
||||
|
||||
Still, for large lists, or for lists where the comparison information
|
||||
is expensive to calculate, the last two examples are likely to be the
|
||||
fastest way to sort a list. It will not work on weakly sorted data,
|
||||
like complex numbers, but if you don't know what that means, you
|
||||
probably don't need to worry about it.
|
||||
|
||||
\section{Comparing classes}
|
||||
|
||||
The comparison for two basic data types, like ints to ints or string to
|
||||
string, is built into Python and makes sense. There is a default way
|
||||
to compare class instances, but the default manner isn't usually very
|
||||
useful. You can define your own comparison with the \method{__cmp__} method,
|
||||
as in:
|
||||
|
||||
\begin{verbatim}
|
||||
>>> class Spam:
|
||||
>>> def __init__(self, spam, eggs):
|
||||
>>> self.spam = spam
|
||||
>>> self.eggs = eggs
|
||||
>>> def __cmp__(self, other):
|
||||
>>> return cmp(self.spam+self.eggs, other.spam+other.eggs)
|
||||
>>> def __str__(self):
|
||||
>>> return str(self.spam + self.eggs)
|
||||
>>>
|
||||
>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
|
||||
>>> a.sort()
|
||||
>>> for spam in a:
|
||||
>>> print str(spam)
|
||||
5
|
||||
10
|
||||
12
|
||||
\end{verbatim}
|
||||
|
||||
Sometimes you may want to sort by a specific attribute of a class. If
|
||||
appropriate you should just define the \method{__cmp__} method to compare
|
||||
those values, but you cannot do this if you want to compare between
|
||||
different attributes at different times. Instead, you'll need to go
|
||||
back to passing a comparison function to sort, as in:
|
||||
|
||||
\begin{verbatim}
|
||||
>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
|
||||
>>> a.sort(lambda x, y: cmp(x.eggs, y.eggs))
|
||||
>>> for spam in a:
|
||||
>>> print spam.eggs, str(spam)
|
||||
3 12
|
||||
4 5
|
||||
6 10
|
||||
\end{verbatim}
|
||||
|
||||
If you want to compare two arbitrary attributes (and aren't overly
|
||||
concerned about performance) you can even define your own comparison
|
||||
function object. This uses the ability of a class instance to emulate
|
||||
an function by defining the \method{__call__} method, as in:
|
||||
|
||||
\begin{verbatim}
|
||||
>>> class CmpAttr:
|
||||
>>> def __init__(self, attr):
|
||||
>>> self.attr = attr
|
||||
>>> def __call__(self, x, y):
|
||||
>>> return cmp(getattr(x, self.attr), getattr(y, self.attr))
|
||||
>>>
|
||||
>>> a = [Spam(1, 4), Spam(9, 3), Spam(4,6)]
|
||||
>>> a.sort(CmpAttr("spam")) # sort by the "spam" attribute
|
||||
>>> for spam in a:
|
||||
>>> print spam.spam, spam.eggs, str(spam)
|
||||
1 4 5
|
||||
4 6 10
|
||||
9 3 12
|
||||
|
||||
>>> a.sort(CmpAttr("eggs")) # re-sort by the "eggs" attribute
|
||||
>>> for spam in a:
|
||||
>>> print spam.spam, spam.eggs, str(spam)
|
||||
9 3 12
|
||||
1 4 5
|
||||
4 6 10
|
||||
\end{verbatim}
|
||||
|
||||
Of course, if you want a faster sort you can extract the attributes
|
||||
into an intermediate list and sort that list.
|
||||
|
||||
|
||||
So, there you have it; about a half-dozen different ways to define how
|
||||
to sort a list:
|
||||
\begin{itemize}
|
||||
\item sort using the default method
|
||||
\item sort using a comparison function
|
||||
\item reverse sort not using a comparison function
|
||||
\item sort on an intermediate list (two forms)
|
||||
\item sort using class defined __cmp__ method
|
||||
\item sort using a sort function object
|
||||
\end{itemize}
|
||||
|
||||
\end{document}
|
||||
% LocalWords: maxint
|
765
Doc/howto/unicode.rst
Normal file
765
Doc/howto/unicode.rst
Normal file
|
@ -0,0 +1,765 @@
|
|||
Unicode HOWTO
|
||||
================
|
||||
|
||||
**Version 1.02**
|
||||
|
||||
This HOWTO discusses Python's support for Unicode, and explains various
|
||||
problems that people commonly encounter when trying to work with Unicode.
|
||||
|
||||
Introduction to Unicode
|
||||
------------------------------
|
||||
|
||||
History of Character Codes
|
||||
''''''''''''''''''''''''''''''
|
||||
|
||||
In 1968, the American Standard Code for Information Interchange,
|
||||
better known by its acronym ASCII, was standardized. ASCII defined
|
||||
numeric codes for various characters, with the numeric values running from 0 to
|
||||
127. For example, the lowercase letter 'a' is assigned 97 as its code
|
||||
value.
|
||||
|
||||
ASCII was an American-developed standard, so it only defined
|
||||
unaccented characters. There was an 'e', but no 'é' or 'Í'. This
|
||||
meant that languages which required accented characters couldn't be
|
||||
faithfully represented in ASCII. (Actually the missing accents matter
|
||||
for English, too, which contains words such as 'naïve' and 'café', and some
|
||||
publications have house styles which require spellings such as
|
||||
'coöperate'.)
|
||||
|
||||
For a while people just wrote programs that didn't display accents. I
|
||||
remember looking at Apple ][ BASIC programs, published in French-language
|
||||
publications in the mid-1980s, that had lines like these::
|
||||
|
||||
PRINT "FICHER EST COMPLETE."
|
||||
PRINT "CARACTERE NON ACCEPTE."
|
||||
|
||||
Those messages should contain accents, and they just look wrong to
|
||||
someone who can read French.
|
||||
|
||||
In the 1980s, almost all personal computers were 8-bit, meaning that
|
||||
bytes could hold values ranging from 0 to 255. ASCII codes only went
|
||||
up to 127, so some machines assigned values between 128 and 255 to
|
||||
accented characters. Different machines had different codes, however,
|
||||
which led to problems exchanging files. Eventually various commonly
|
||||
used sets of values for the 128-255 range emerged. Some were true
|
||||
standards, defined by the International Standards Organization, and
|
||||
some were **de facto** conventions that were invented by one company
|
||||
or another and managed to catch on.
|
||||
|
||||
255 characters aren't very many. For example, you can't fit
|
||||
both the accented characters used in Western Europe and the Cyrillic
|
||||
alphabet used for Russian into the 128-255 range because there are more than
|
||||
127 such characters.
|
||||
|
||||
You could write files using different codes (all your Russian
|
||||
files in a coding system called KOI8, all your French files in
|
||||
a different coding system called Latin1), but what if you wanted
|
||||
to write a French document that quotes some Russian text? In the
|
||||
1980s people began to want to solve this problem, and the Unicode
|
||||
standardization effort began.
|
||||
|
||||
Unicode started out using 16-bit characters instead of 8-bit characters. 16
|
||||
bits means you have 2^16 = 65,536 distinct values available, making it
|
||||
possible to represent many different characters from many different
|
||||
alphabets; an initial goal was to have Unicode contain the alphabets for
|
||||
every single human language. It turns out that even 16 bits isn't enough to
|
||||
meet that goal, and the modern Unicode specification uses a wider range of
|
||||
codes, 0-1,114,111 (0x10ffff in base-16).
|
||||
|
||||
There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
|
||||
originally separate efforts, but the specifications were merged with
|
||||
the 1.1 revision of Unicode.
|
||||
|
||||
(This discussion of Unicode's history is highly simplified. I don't
|
||||
think the average Python programmer needs to worry about the
|
||||
historical details; consult the Unicode consortium site listed in the
|
||||
References for more information.)
|
||||
|
||||
|
||||
Definitions
|
||||
''''''''''''''''''''''''
|
||||
|
||||
A **character** is the smallest possible component of a text. 'A',
|
||||
'B', 'C', etc., are all different characters. So are 'È' and
|
||||
'Í'. Characters are abstractions, and vary depending on the
|
||||
language or context you're talking about. For example, the symbol for
|
||||
ohms (Ω) is usually drawn much like the capital letter
|
||||
omega (Ω) in the Greek alphabet (they may even be the same in
|
||||
some fonts), but these are two different characters that have
|
||||
different meanings.
|
||||
|
||||
The Unicode standard describes how characters are represented by
|
||||
**code points**. A code point is an integer value, usually denoted in
|
||||
base 16. In the standard, a code point is written using the notation
|
||||
U+12ca to mean the character with value 0x12ca (4810 decimal). The
|
||||
Unicode standard contains a lot of tables listing characters and their
|
||||
corresponding code points::
|
||||
|
||||
0061 'a'; LATIN SMALL LETTER A
|
||||
0062 'b'; LATIN SMALL LETTER B
|
||||
0063 'c'; LATIN SMALL LETTER C
|
||||
...
|
||||
007B '{'; LEFT CURLY BRACKET
|
||||
|
||||
Strictly, these definitions imply that it's meaningless to say 'this is
|
||||
character U+12ca'. U+12ca is a code point, which represents some particular
|
||||
character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'.
|
||||
In informal contexts, this distinction between code points and characters will
|
||||
sometimes be forgotten.
|
||||
|
||||
A character is represented on a screen or on paper by a set of graphical
|
||||
elements that's called a **glyph**. The glyph for an uppercase A, for
|
||||
example, is two diagonal strokes and a horizontal stroke, though the exact
|
||||
details will depend on the font being used. Most Python code doesn't need
|
||||
to worry about glyphs; figuring out the correct glyph to display is
|
||||
generally the job of a GUI toolkit or a terminal's font renderer.
|
||||
|
||||
|
||||
Encodings
|
||||
'''''''''
|
||||
|
||||
To summarize the previous section:
|
||||
a Unicode string is a sequence of code points, which are
|
||||
numbers from 0 to 0x10ffff. This sequence needs to be represented as
|
||||
a set of bytes (meaning, values from 0-255) in memory. The rules for
|
||||
translating a Unicode string into a sequence of bytes are called an
|
||||
**encoding**.
|
||||
|
||||
The first encoding you might think of is an array of 32-bit integers.
|
||||
In this representation, the string "Python" would look like this::
|
||||
|
||||
P y t h o n
|
||||
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
|
||||
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
|
||||
|
||||
This representation is straightforward but using
|
||||
it presents a number of problems.
|
||||
|
||||
1. It's not portable; different processors order the bytes
|
||||
differently.
|
||||
|
||||
2. It's very wasteful of space. In most texts, the majority of the code
|
||||
points are less than 127, or less than 255, so a lot of space is occupied
|
||||
by zero bytes. The above string takes 24 bytes compared to the 6
|
||||
bytes needed for an ASCII representation. Increased RAM usage doesn't
|
||||
matter too much (desktop computers have megabytes of RAM, and strings
|
||||
aren't usually that large), but expanding our usage of disk and
|
||||
network bandwidth by a factor of 4 is intolerable.
|
||||
|
||||
3. It's not compatible with existing C functions such as ``strlen()``,
|
||||
so a new family of wide string functions would need to be used.
|
||||
|
||||
4. Many Internet standards are defined in terms of textual data, and
|
||||
can't handle content with embedded zero bytes.
|
||||
|
||||
Generally people don't use this encoding, choosing other encodings
|
||||
that are more efficient and convenient.
|
||||
|
||||
Encodings don't have to handle every possible Unicode character, and
|
||||
most encodings don't. For example, Python's default encoding is the
|
||||
'ascii' encoding. The rules for converting a Unicode string into the
|
||||
ASCII encoding are are simple; for each code point:
|
||||
|
||||
1. If the code point is <128, each byte is the same as the value of the
|
||||
code point.
|
||||
|
||||
2. If the code point is 128 or greater, the Unicode string can't
|
||||
be represented in this encoding. (Python raises a
|
||||
``UnicodeEncodeError`` exception in this case.)
|
||||
|
||||
Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode
|
||||
code points 0-255 are identical to the Latin-1 values, so converting
|
||||
to this encoding simply requires converting code points to byte
|
||||
values; if a code point larger than 255 is encountered, the string
|
||||
can't be encoded into Latin-1.
|
||||
|
||||
Encodings don't have to be simple one-to-one mappings like Latin-1.
|
||||
Consider IBM's EBCDIC, which was used on IBM mainframes. Letter
|
||||
values weren't in one block: 'a' through 'i' had values from 129 to
|
||||
137, but 'j' through 'r' were 145 through 153. If you wanted to use
|
||||
EBCDIC as an encoding, you'd probably use some sort of lookup table to
|
||||
perform the conversion, but this is largely an internal detail.
|
||||
|
||||
UTF-8 is one of the most commonly used encodings. UTF stands for
|
||||
"Unicode Transformation Format", and the '8' means that 8-bit numbers
|
||||
are used in the encoding. (There's also a UTF-16 encoding, but it's
|
||||
less frequently used than UTF-8.) UTF-8 uses the following rules:
|
||||
|
||||
1. If the code point is <128, it's represented by the corresponding byte value.
|
||||
2. If the code point is between 128 and 0x7ff, it's turned into two byte values
|
||||
between 128 and 255.
|
||||
3. Code points >0x7ff are turned into three- or four-byte sequences, where
|
||||
each byte of the sequence is between 128 and 255.
|
||||
|
||||
UTF-8 has several convenient properties:
|
||||
|
||||
1. It can handle any Unicode code point.
|
||||
2. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent through protocols that can't handle zero bytes.
|
||||
3. A string of ASCII text is also valid UTF-8 text.
|
||||
4. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
|
||||
5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize. It's also unlikely that random 8-bit data will look like valid UTF-8.
|
||||
|
||||
|
||||
|
||||
References
|
||||
''''''''''''''
|
||||
|
||||
The Unicode Consortium site at <http://www.unicode.org> has character
|
||||
charts, a glossary, and PDF versions of the Unicode specification. Be
|
||||
prepared for some difficult reading.
|
||||
<http://www.unicode.org/history/> is a chronology of the origin and
|
||||
development of Unicode.
|
||||
|
||||
To help understand the standard, Jukka Korpela has written an
|
||||
introductory guide to reading the Unicode character tables,
|
||||
available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
|
||||
|
||||
Roman Czyborra wrote another explanation of Unicode's basic principles;
|
||||
it's at <http://czyborra.com/unicode/characters.html>.
|
||||
Czyborra has written a number of other Unicode-related documentation,
|
||||
available from <http://www.cyzborra.com>.
|
||||
|
||||
Two other good introductory articles were written by Joel Spolsky
|
||||
<http://www.joelonsoftware.com/articles/Unicode.html> and Jason
|
||||
Orendorff <http://www.jorendorff.com/articles/unicode/>. If this
|
||||
introduction didn't make things clear to you, you should try reading
|
||||
one of these alternate articles before continuing.
|
||||
|
||||
Wikipedia entries are often helpful; see the entries for "character
|
||||
encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
|
||||
<http://en.wikipedia.org/wiki/UTF-8>, for example.
|
||||
|
||||
|
||||
Python's Unicode Support
|
||||
------------------------
|
||||
|
||||
Now that you've learned the rudiments of Unicode, we can look at
|
||||
Python's Unicode features.
|
||||
|
||||
|
||||
The Unicode Type
|
||||
'''''''''''''''''''
|
||||
|
||||
Unicode strings are expressed as instances of the ``unicode`` type,
|
||||
one of Python's repertoire of built-in types. It derives from an
|
||||
abstract type called ``basestring``, which is also an ancestor of the
|
||||
``str`` type; you can therefore check if a value is a string type with
|
||||
``isinstance(value, basestring)``. Under the hood, Python represents
|
||||
Unicode strings as either 16- or 32-bit integers, depending on how the
|
||||
Python interpreter was compiled, but this
|
||||
|
||||
The ``unicode()`` constructor has the signature ``unicode(string[, encoding, errors])``.
|
||||
All of its arguments should be 8-bit strings. The first argument is converted
|
||||
to Unicode using the specified encoding; if you leave off the ``encoding`` argument,
|
||||
the ASCII encoding is used for the conversion, so characters greater than 127 will
|
||||
be treated as errors::
|
||||
|
||||
>>> unicode('abcdef')
|
||||
u'abcdef'
|
||||
>>> s = unicode('abcdef')
|
||||
>>> type(s)
|
||||
<type 'unicode'>
|
||||
>>> unicode('abcdef' + chr(255))
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in ?
|
||||
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
|
||||
ordinal not in range(128)
|
||||
|
||||
The ``errors`` argument specifies the response when the input string can't be converted according to the encoding's rules. Legal values for this argument
|
||||
are 'strict' (raise a ``UnicodeDecodeError`` exception),
|
||||
'replace' (add U+FFFD, 'REPLACEMENT CHARACTER'),
|
||||
or 'ignore' (just leave the character out of the Unicode result).
|
||||
The following examples show the differences::
|
||||
|
||||
>>> unicode('\x80abc', errors='strict')
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in ?
|
||||
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
|
||||
ordinal not in range(128)
|
||||
>>> unicode('\x80abc', errors='replace')
|
||||
u'\ufffdabc'
|
||||
>>> unicode('\x80abc', errors='ignore')
|
||||
u'abc'
|
||||
|
||||
Encodings are specified as strings containing the encoding's name.
|
||||
Python 2.4 comes with roughly 100 different encodings; see the Python
|
||||
Library Reference at
|
||||
<http://docs.python.org/lib/standard-encodings.html> for a list. Some
|
||||
encodings have multiple names; for example, 'latin-1', 'iso_8859_1'
|
||||
and '8859' are all synonyms for the same encoding.
|
||||
|
||||
One-character Unicode strings can also be created with the
|
||||
``unichr()`` built-in function, which takes integers and returns a
|
||||
Unicode string of length 1 that contains the corresponding code point.
|
||||
The reverse operation is the built-in `ord()` function that takes a
|
||||
one-character Unicode string and returns the code point value::
|
||||
|
||||
>>> unichr(40960)
|
||||
u'\ua000'
|
||||
>>> ord(u'\ua000')
|
||||
40960
|
||||
|
||||
Instances of the ``unicode`` type have many of the same methods as
|
||||
the 8-bit string type for operations such as searching and formatting::
|
||||
|
||||
>>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
|
||||
>>> s.count('e')
|
||||
5
|
||||
>>> s.find('feather')
|
||||
9
|
||||
>>> s.find('bird')
|
||||
-1
|
||||
>>> s.replace('feather', 'sand')
|
||||
u'Was ever sand so lightly blown to and fro as this multitude?'
|
||||
>>> s.upper()
|
||||
u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
|
||||
|
||||
Note that the arguments to these methods can be Unicode strings or 8-bit strings.
|
||||
8-bit strings will be converted to Unicode before carrying out the operation;
|
||||
Python's default ASCII encoding will be used, so characters greater than 127 will cause an exception::
|
||||
|
||||
>>> s.find('Was\x9f')
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in ?
|
||||
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
|
||||
>>> s.find(u'Was\x9f')
|
||||
-1
|
||||
|
||||
Much Python code that operates on strings will therefore work with
|
||||
Unicode strings without requiring any changes to the code. (Input and
|
||||
output code needs more updating for Unicode; more on this later.)
|
||||
|
||||
Another important method is ``.encode([encoding], [errors='strict'])``,
|
||||
which returns an 8-bit string version of the
|
||||
Unicode string, encoded in the requested encoding. The ``errors``
|
||||
parameter is the same as the parameter of the ``unicode()``
|
||||
constructor, with one additional possibility; as well as 'strict',
|
||||
'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which
|
||||
uses XML's character references. The following example shows the
|
||||
different results::
|
||||
|
||||
>>> u = unichr(40960) + u'abcd' + unichr(1972)
|
||||
>>> u.encode('utf-8')
|
||||
'\xea\x80\x80abcd\xde\xb4'
|
||||
>>> u.encode('ascii')
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in ?
|
||||
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
|
||||
>>> u.encode('ascii', 'ignore')
|
||||
'abcd'
|
||||
>>> u.encode('ascii', 'replace')
|
||||
'?abcd?'
|
||||
>>> u.encode('ascii', 'xmlcharrefreplace')
|
||||
'ꀀabcd޴'
|
||||
|
||||
Python's 8-bit strings have a ``.decode([encoding], [errors])`` method
|
||||
that interprets the string using the given encoding::
|
||||
|
||||
>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
|
||||
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
|
||||
>>> type(utf8_version), utf8_version
|
||||
(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
|
||||
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
|
||||
>>> u == u2 # The two strings match
|
||||
True
|
||||
|
||||
The low-level routines for registering and accessing the available
|
||||
encodings are found in the ``codecs`` module. However, the encoding
|
||||
and decoding functions returned by this module are usually more
|
||||
low-level than is comfortable, so I'm not going to describe the
|
||||
``codecs`` module here. If you need to implement a completely new
|
||||
encoding, you'll need to learn about the ``codecs`` module interfaces,
|
||||
but implementing encodings is a specialized task that also won't be
|
||||
covered here. Consult the Python documentation to learn more about
|
||||
this module.
|
||||
|
||||
The most commonly used part of the ``codecs`` module is the
|
||||
``codecs.open()`` function which will be discussed in the section
|
||||
on input and output.
|
||||
|
||||
|
||||
Unicode Literals in Python Source Code
|
||||
''''''''''''''''''''''''''''''''''''''''''
|
||||
|
||||
In Python source code, Unicode literals are written as strings
|
||||
prefixed with the 'u' or 'U' character: ``u'abcdefghijk'``. Specific
|
||||
code points can be written using the ``\u`` escape sequence, which is
|
||||
followed by four hex digits giving the code point. The ``\U`` escape
|
||||
sequence is similar, but expects 8 hex digits, not 4.
|
||||
|
||||
Unicode literals can also use the same escape sequences as 8-bit
|
||||
strings, including ``\x``, but ``\x`` only takes two hex digits so it
|
||||
can't express an arbitrary code point. Octal escapes can go up to
|
||||
U+01ff, which is octal 777.
|
||||
|
||||
::
|
||||
|
||||
>>> s = u"a\xac\u1234\u20ac\U00008000"
|
||||
^^^^ two-digit hex escape
|
||||
^^^^^^ four-digit Unicode escape
|
||||
^^^^^^^^^^ eight-digit Unicode escape
|
||||
>>> for c in s: print ord(c),
|
||||
...
|
||||
97 172 4660 8364 32768
|
||||
|
||||
Using escape sequences for code points greater than 127 is fine in
|
||||
small doses, but becomes an annoyance if you're using many accented
|
||||
characters, as you would in a program with messages in French or some
|
||||
other accent-using language. You can also assemble strings using the
|
||||
``unichr()`` built-in function, but this is even more tedious.
|
||||
|
||||
Ideally, you'd want to be able to write literals in your language's
|
||||
natural encoding. You could then edit Python source code with your
|
||||
favorite editor which would display the accented characters naturally,
|
||||
and have the right characters used at runtime.
|
||||
|
||||
Python supports writing Unicode literals in any encoding, but you have
|
||||
to declare the encoding being used. This is done by including a
|
||||
special comment as either the first or second line of the source
|
||||
file::
|
||||
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: latin-1 -*-
|
||||
|
||||
u = u'abcdé'
|
||||
print ord(u[-1])
|
||||
|
||||
The syntax is inspired by Emacs's notation for specifying variables local to a file.
|
||||
Emacs supports many different variables, but Python only supports 'coding'.
|
||||
The ``-*-`` symbols indicate that the comment is special; within them,
|
||||
you must supply the name ``coding`` and the name of your chosen encoding,
|
||||
separated by ``':'``.
|
||||
|
||||
If you don't include such a comment, the default encoding used will be
|
||||
ASCII. Versions of Python before 2.4 were Euro-centric and assumed
|
||||
Latin-1 as a default encoding for string literals; in Python 2.4,
|
||||
characters greater than 127 still work but result in a warning. For
|
||||
example, the following program has no encoding declaration::
|
||||
|
||||
#!/usr/bin/env python
|
||||
u = u'abcdé'
|
||||
print ord(u[-1])
|
||||
|
||||
When you run it with Python 2.4, it will output the following warning::
|
||||
|
||||
amk:~$ python p263.py
|
||||
sys:1: DeprecationWarning: Non-ASCII character '\xe9'
|
||||
in file p263.py on line 2, but no encoding declared;
|
||||
see http://www.python.org/peps/pep-0263.html for details
|
||||
|
||||
|
||||
Unicode Properties
|
||||
'''''''''''''''''''
|
||||
|
||||
The Unicode specification includes a database of information about
|
||||
code points. For each code point that's defined, the information
|
||||
includes the character's name, its category, the numeric value if
|
||||
applicable (Unicode has characters representing the Roman numerals and
|
||||
fractions such as one-third and four-fifths). There are also
|
||||
properties related to the code point's use in bidirectional text and
|
||||
other display-related properties.
|
||||
|
||||
The following program displays some information about several
|
||||
characters, and prints the numeric value of one particular character::
|
||||
|
||||
import unicodedata
|
||||
|
||||
u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
|
||||
|
||||
for i, c in enumerate(u):
|
||||
print i, '%04x' % ord(c), unicodedata.category(c),
|
||||
print unicodedata.name(c)
|
||||
|
||||
# Get numeric value of second character
|
||||
print unicodedata.numeric(u[1])
|
||||
|
||||
When run, this prints::
|
||||
|
||||
0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
|
||||
1 0bf2 No TAMIL NUMBER ONE THOUSAND
|
||||
2 0f84 Mn TIBETAN MARK HALANTA
|
||||
3 1770 Lo TAGBANWA LETTER SA
|
||||
4 33af So SQUARE RAD OVER S SQUARED
|
||||
1000.0
|
||||
|
||||
The category codes are abbreviations describing the nature of the
|
||||
character. These are grouped into categories such as "Letter",
|
||||
"Number", "Punctuation", or "Symbol", which in turn are broken up into
|
||||
subcategories. To take the codes from the above output, ``'Ll'``
|
||||
means 'Letter, lowercase', ``'No'`` means "Number, other", ``'Mn'`` is
|
||||
"Mark, nonspacing", and ``'So'`` is "Symbol, other". See
|
||||
<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values>
|
||||
for a list of category codes.
|
||||
|
||||
References
|
||||
''''''''''''''
|
||||
|
||||
The Unicode and 8-bit string types are described in the Python library
|
||||
reference at <http://docs.python.org/lib/typesseq.html>.
|
||||
|
||||
The documentation for the ``unicodedata`` module is at
|
||||
<http://docs.python.org/lib/module-unicodedata.html>.
|
||||
|
||||
The documentation for the ``codecs`` module is at
|
||||
<http://docs.python.org/lib/module-codecs.html>.
|
||||
|
||||
Marc-André Lemburg gave a presentation at EuroPython 2002
|
||||
titled "Python and Unicode". A PDF version of his slides
|
||||
is available at <http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>,
|
||||
and is an excellent overview of the design of Python's Unicode features.
|
||||
|
||||
|
||||
Reading and Writing Unicode Data
|
||||
----------------------------------------
|
||||
|
||||
Once you've written some code that works with Unicode data, the next
|
||||
problem is input/output. How do you get Unicode strings into your
|
||||
program, and how do you convert Unicode into a form suitable for
|
||||
storage or transmission?
|
||||
|
||||
It's possible that you may not need to do anything depending on your
|
||||
input sources and output destinations; you should check whether the
|
||||
libraries used in your application support Unicode natively. XML
|
||||
parsers often return Unicode data, for example. Many relational
|
||||
databases also support Unicode-valued columns and can return Unicode
|
||||
values from an SQL query.
|
||||
|
||||
Unicode data is usually converted to a particular encoding before it
|
||||
gets written to disk or sent over a socket. It's possible to do all
|
||||
the work yourself: open a file, read an 8-bit string from it, and
|
||||
convert the string with ``unicode(str, encoding)``. However, the
|
||||
manual approach is not recommended.
|
||||
|
||||
One problem is the multi-byte nature of encodings; one Unicode
|
||||
character can be represented by several bytes. If you want to read
|
||||
the file in arbitrary-sized chunks (say, 1K or 4K), you need to write
|
||||
error-handling code to catch the case where only part of the bytes
|
||||
encoding a single Unicode character are read at the end of a chunk.
|
||||
One solution would be to read the entire file into memory and then
|
||||
perform the decoding, but that prevents you from working with files
|
||||
that are extremely large; if you need to read a 2Gb file, you need 2Gb
|
||||
of RAM. (More, really, since for at least a moment you'd need to have
|
||||
both the encoded string and its Unicode version in memory.)
|
||||
|
||||
The solution would be to use the low-level decoding interface to catch
|
||||
the case of partial coding sequences. The work of implementing this
|
||||
has already been done for you: the ``codecs`` module includes a
|
||||
version of the ``open()`` function that returns a file-like object
|
||||
that assumes the file's contents are in a specified encoding and
|
||||
accepts Unicode parameters for methods such as ``.read()`` and
|
||||
``.write()``.
|
||||
|
||||
The function's parameters are
|
||||
``open(filename, mode='rb', encoding=None, errors='strict', buffering=1)``. ``mode`` can be
|
||||
``'r'``, ``'w'``, or ``'a'``, just like the corresponding parameter to the
|
||||
regular built-in ``open()`` function; add a ``'+'`` to
|
||||
update the file. ``buffering`` is similarly
|
||||
parallel to the standard function's parameter.
|
||||
``encoding`` is a string giving
|
||||
the encoding to use; if it's left as ``None``, a regular Python file
|
||||
object that accepts 8-bit strings is returned. Otherwise, a wrapper
|
||||
object is returned, and data written to or read from the wrapper
|
||||
object will be converted as needed. ``errors`` specifies the action
|
||||
for encoding errors and can be one of the usual values of 'strict',
|
||||
'ignore', and 'replace'.
|
||||
|
||||
Reading Unicode from a file is therefore simple::
|
||||
|
||||
import codecs
|
||||
f = codecs.open('unicode.rst', encoding='utf-8')
|
||||
for line in f:
|
||||
print repr(line)
|
||||
|
||||
It's also possible to open files in update mode,
|
||||
allowing both reading and writing::
|
||||
|
||||
f = codecs.open('test', encoding='utf-8', mode='w+')
|
||||
f.write(u'\u4500 blah blah blah\n')
|
||||
f.seek(0)
|
||||
print repr(f.readline()[:1])
|
||||
f.close()
|
||||
|
||||
Unicode character U+FEFF is used as a byte-order mark (BOM),
|
||||
and is often written as the first character of a file in order
|
||||
to assist with autodetection of the file's byte ordering.
|
||||
Some encodings, such as UTF-16, expect a BOM to be present at
|
||||
the start of a file; when such an encoding is used,
|
||||
the BOM will be automatically written as the first character
|
||||
and will be silently dropped when the file is read. There are
|
||||
variants of these encodings, such as 'utf-16-le' and 'utf-16-be'
|
||||
for little-endian and big-endian encodings, that specify
|
||||
one particular byte ordering and don't
|
||||
skip the BOM.
|
||||
|
||||
|
||||
Unicode filenames
|
||||
'''''''''''''''''''''''''
|
||||
|
||||
Most of the operating systems in common use today support filenames
|
||||
that contain arbitrary Unicode characters. Usually this is
|
||||
implemented by converting the Unicode string into some encoding that
|
||||
varies depending on the system. For example, MacOS X uses UTF-8 while
|
||||
Windows uses a configurable encoding; on Windows, Python uses the name
|
||||
"mbcs" to refer to whatever the currently configured encoding is. On
|
||||
Unix systems, there will only be a filesystem encoding if you've set
|
||||
the ``LANG`` or ``LC_CTYPE`` environment variables; if you haven't,
|
||||
the default encoding is ASCII.
|
||||
|
||||
The ``sys.getfilesystemencoding()`` function returns the encoding to
|
||||
use on your current system, in case you want to do the encoding
|
||||
manually, but there's not much reason to bother. When opening a file
|
||||
for reading or writing, you can usually just provide the Unicode
|
||||
string as the filename, and it will be automatically converted to the
|
||||
right encoding for you::
|
||||
|
||||
filename = u'filename\u4500abc'
|
||||
f = open(filename, 'w')
|
||||
f.write('blah\n')
|
||||
f.close()
|
||||
|
||||
Functions in the ``os`` module such as ``os.stat()`` will also accept
|
||||
Unicode filenames.
|
||||
|
||||
``os.listdir()``, which returns filenames, raises an issue: should it
|
||||
return the Unicode version of filenames, or should it return 8-bit
|
||||
strings containing the encoded versions? ``os.listdir()`` will do
|
||||
both, depending on whether you provided the directory path as an 8-bit
|
||||
string or a Unicode string. If you pass a Unicode string as the path,
|
||||
filenames will be decoded using the filesystem's encoding and a list
|
||||
of Unicode strings will be returned, while passing an 8-bit path will
|
||||
return the 8-bit versions of the filenames. For example, assuming the
|
||||
default filesystem encoding is UTF-8, running the following program::
|
||||
|
||||
fn = u'filename\u4500abc'
|
||||
f = open(fn, 'w')
|
||||
f.close()
|
||||
|
||||
import os
|
||||
print os.listdir('.')
|
||||
print os.listdir(u'.')
|
||||
|
||||
will produce the following output::
|
||||
|
||||
amk:~$ python t.py
|
||||
['.svn', 'filename\xe4\x94\x80abc', ...]
|
||||
[u'.svn', u'filename\u4500abc', ...]
|
||||
|
||||
The first list contains UTF-8-encoded filenames, and the second list
|
||||
contains the Unicode versions.
|
||||
|
||||
|
||||
|
||||
Tips for Writing Unicode-aware Programs
|
||||
''''''''''''''''''''''''''''''''''''''''''''
|
||||
|
||||
This section provides some suggestions on writing software that
|
||||
deals with Unicode.
|
||||
|
||||
The most important tip is:
|
||||
|
||||
Software should only work with Unicode strings internally,
|
||||
converting to a particular encoding on output.
|
||||
|
||||
If you attempt to write processing functions that accept both
|
||||
Unicode and 8-bit strings, you will find your program vulnerable to
|
||||
bugs wherever you combine the two different kinds of strings. Python's
|
||||
default encoding is ASCII, so whenever a character with an ASCII value >127
|
||||
is in the input data, you'll get a ``UnicodeDecodeError``
|
||||
because that character can't be handled by the ASCII encoding.
|
||||
|
||||
It's easy to miss such problems if you only test your software
|
||||
with data that doesn't contain any
|
||||
accents; everything will seem to work, but there's actually a bug in your
|
||||
program waiting for the first user who attempts to use characters >127.
|
||||
A second tip, therefore, is:
|
||||
|
||||
Include characters >127 and, even better, characters >255 in your
|
||||
test data.
|
||||
|
||||
When using data coming from a web browser or some other untrusted source,
|
||||
a common technique is to check for illegal characters in a string
|
||||
before using the string in a generated command line or storing it in a
|
||||
database. If you're doing this, be careful to check
|
||||
the string once it's in the form that will be used or stored; it's
|
||||
possible for encodings to be used to disguise characters. This is especially
|
||||
true if the input data also specifies the encoding;
|
||||
many encodings leave the commonly checked-for characters alone,
|
||||
but Python includes some encodings such as ``'base64'``
|
||||
that modify every single character.
|
||||
|
||||
For example, let's say you have a content management system that takes a
|
||||
Unicode filename, and you want to disallow paths with a '/' character.
|
||||
You might write this code::
|
||||
|
||||
def read_file (filename, encoding):
|
||||
if '/' in filename:
|
||||
raise ValueError("'/' not allowed in filenames")
|
||||
unicode_name = filename.decode(encoding)
|
||||
f = open(unicode_name, 'r')
|
||||
# ... return contents of file ...
|
||||
|
||||
However, if an attacker could specify the ``'base64'`` encoding,
|
||||
they could pass ``'L2V0Yy9wYXNzd2Q='``, which is the base-64
|
||||
encoded form of the string ``'/etc/passwd'``, to read a
|
||||
system file. The above code looks for ``'/'`` characters
|
||||
in the encoded form and misses the dangerous character
|
||||
in the resulting decoded form.
|
||||
|
||||
References
|
||||
''''''''''''''
|
||||
|
||||
The PDF slides for Marc-André Lemburg's presentation "Writing
|
||||
Unicode-aware Applications in Python" are available at
|
||||
<http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
|
||||
and discuss questions of character encodings as well as how to
|
||||
internationalize and localize an application.
|
||||
|
||||
|
||||
Revision History and Acknowledgements
|
||||
------------------------------------------
|
||||
|
||||
Thanks to the following people who have noted errors or offered
|
||||
suggestions on this article: Nicholas Bastin,
|
||||
Marius Gedminas, Kent Johnson, Ken Krugler,
|
||||
Marc-André Lemburg, Martin von Löwis.
|
||||
|
||||
Version 1.0: posted August 5 2005.
|
||||
|
||||
Version 1.01: posted August 7 2005. Corrects factual and markup
|
||||
errors; adds several links.
|
||||
|
||||
Version 1.02: posted August 16 2005. Corrects factual errors.
|
||||
|
||||
|
||||
.. comment Additional topic: building Python w/ UCS2 or UCS4 support
|
||||
.. comment Describe obscure -U switch somewhere?
|
||||
|
||||
.. comment
|
||||
Original outline:
|
||||
|
||||
- [ ] Unicode introduction
|
||||
- [ ] ASCII
|
||||
- [ ] Terms
|
||||
- [ ] Character
|
||||
- [ ] Code point
|
||||
- [ ] Encodings
|
||||
- [ ] Common encodings: ASCII, Latin-1, UTF-8
|
||||
- [ ] Unicode Python type
|
||||
- [ ] Writing unicode literals
|
||||
- [ ] Obscurity: -U switch
|
||||
- [ ] Built-ins
|
||||
- [ ] unichr()
|
||||
- [ ] ord()
|
||||
- [ ] unicode() constructor
|
||||
- [ ] Unicode type
|
||||
- [ ] encode(), decode() methods
|
||||
- [ ] Unicodedata module for character properties
|
||||
- [ ] I/O
|
||||
- [ ] Reading/writing Unicode data into files
|
||||
- [ ] Byte-order marks
|
||||
- [ ] Unicode filenames
|
||||
- [ ] Writing Unicode programs
|
||||
- [ ] Do everything in Unicode
|
||||
- [ ] Declaring source code encodings (PEP 263)
|
||||
- [ ] Other issues
|
||||
- [ ] Building Python (UCS2, UCS4)
|
Loading…
Add table
Add a link
Reference in a new issue