Move the 3k reST doc tree in place.

This commit is contained in:
Georg Brandl 2007-08-15 14:28:22 +00:00
parent 739c01d47b
commit 116aa62bf5
423 changed files with 131199 additions and 0 deletions

356
Doc/howto/advocacy.rst Normal file
View file

@ -0,0 +1,356 @@
*************************
Python Advocacy HOWTO
*************************
:Author: A.M. Kuchling
:Release: 0.03
.. topic:: Abstract
It's usually difficult to get your management to accept open source software,
and Python is no exception to this rule. This document discusses reasons to use
Python, strategies for winning acceptance, facts and arguments you can use, and
cases where you *shouldn't* try to use Python.
Reasons to Use Python
=====================
There are several reasons to incorporate a scripting language into your
development process, and this section will discuss them, and why Python has some
properties that make it a particularly good choice.
Programmability
---------------
Programs are often organized in a modular fashion. Lower-level operations are
grouped together, and called by higher-level functions, which may in turn be
used as basic operations by still further upper levels.
For example, the lowest level might define a very low-level set of functions for
accessing a hash table. The next level might use hash tables to store the
headers of a mail message, mapping a header name like ``Date`` to a value such
as ``Tue, 13 May 1997 20:00:54 -0400``. A yet higher level may operate on
message objects, without knowing or caring that message headers are stored in a
hash table, and so forth.
Often, the lowest levels do very simple things; they implement a data structure
such as a binary tree or hash table, or they perform some simple computation,
such as converting a date string to a number. The higher levels then contain
logic connecting these primitive operations. Using the approach, the primitives
can be seen as basic building blocks which are then glued together to produce
the complete product.
Why is this design approach relevant to Python? Because Python is well suited
to functioning as such a glue language. A common approach is to write a Python
module that implements the lower level operations; for the sake of speed, the
implementation might be in C, Java, or even Fortran. Once the primitives are
available to Python programs, the logic underlying higher level operations is
written in the form of Python code. The high-level logic is then more
understandable, and easier to modify.
John Ousterhout wrote a paper that explains this idea at greater length,
entitled "Scripting: Higher Level Programming for the 21st Century". I
recommend that you read this paper; see the references for the URL. Ousterhout
is the inventor of the Tcl language, and therefore argues that Tcl should be
used for this purpose; he only briefly refers to other languages such as Python,
Perl, and Lisp/Scheme, but in reality, Ousterhout's argument applies to
scripting languages in general, since you could equally write extensions for any
of the languages mentioned above.
Prototyping
-----------
In *The Mythical Man-Month*, Fredrick Brooks suggests the following rule when
planning software projects: "Plan to throw one away; you will anyway." Brooks
is saying that the first attempt at a software design often turns out to be
wrong; unless the problem is very simple or you're an extremely good designer,
you'll find that new requirements and features become apparent once development
has actually started. If these new requirements can't be cleanly incorporated
into the program's structure, you're presented with two unpleasant choices:
hammer the new features into the program somehow, or scrap everything and write
a new version of the program, taking the new features into account from the
beginning.
Python provides you with a good environment for quickly developing an initial
prototype. That lets you get the overall program structure and logic right, and
you can fine-tune small details in the fast development cycle that Python
provides. Once you're satisfied with the GUI interface or program output, you
can translate the Python code into C++, Fortran, Java, or some other compiled
language.
Prototyping means you have to be careful not to use too many Python features
that are hard to implement in your other language. Using ``eval()``, or regular
expressions, or the :mod:`pickle` module, means that you're going to need C or
Java libraries for formula evaluation, regular expressions, and serialization,
for example. But it's not hard to avoid such tricky code, and in the end the
translation usually isn't very difficult. The resulting code can be rapidly
debugged, because any serious logical errors will have been removed from the
prototype, leaving only more minor slip-ups in the translation to track down.
This strategy builds on the earlier discussion of programmability. Using Python
as glue to connect lower-level components has obvious relevance for constructing
prototype systems. In this way Python can help you with development, even if
end users never come in contact with Python code at all. If the performance of
the Python version is adequate and corporate politics allow it, you may not need
to do a translation into C or Java, but it can still be faster to develop a
prototype and then translate it, instead of attempting to produce the final
version immediately.
One example of this development strategy is Microsoft Merchant Server. Version
1.0 was written in pure Python, by a company that subsequently was purchased by
Microsoft. Version 2.0 began to translate the code into C++, shipping with some
C++code and some Python code. Version 3.0 didn't contain any Python at all; all
the code had been translated into C++. Even though the product doesn't contain
a Python interpreter, the Python language has still served a useful purpose by
speeding up development.
This is a very common use for Python. Past conference papers have also
described this approach for developing high-level numerical algorithms; see
David M. Beazley and Peter S. Lomdahl's paper "Feeding a Large-scale Physics
Application to Python" in the references for a good example. If an algorithm's
basic operations are things like "Take the inverse of this 4000x4000 matrix",
and are implemented in some lower-level language, then Python has almost no
additional performance cost; the extra time required for Python to evaluate an
expression like ``m.invert()`` is dwarfed by the cost of the actual computation.
It's particularly good for applications where seemingly endless tweaking is
required to get things right. GUI interfaces and Web sites are prime examples.
The Python code is also shorter and faster to write (once you're familiar with
Python), so it's easier to throw it away if you decide your approach was wrong;
if you'd spent two weeks working on it instead of just two hours, you might
waste time trying to patch up what you've got out of a natural reluctance to
admit that those two weeks were wasted. Truthfully, those two weeks haven't
been wasted, since you've learnt something about the problem and the technology
you're using to solve it, but it's human nature to view this as a failure of
some sort.
Simplicity and Ease of Understanding
------------------------------------
Python is definitely *not* a toy language that's only usable for small tasks.
The language features are general and powerful enough to enable it to be used
for many different purposes. It's useful at the small end, for 10- or 20-line
scripts, but it also scales up to larger systems that contain thousands of lines
of code.
However, this expressiveness doesn't come at the cost of an obscure or tricky
syntax. While Python has some dark corners that can lead to obscure code, there
are relatively few such corners, and proper design can isolate their use to only
a few classes or modules. It's certainly possible to write confusing code by
using too many features with too little concern for clarity, but most Python
code can look a lot like a slightly-formalized version of human-understandable
pseudocode.
In *The New Hacker's Dictionary*, Eric S. Raymond gives the following definition
for "compact":
.. epigraph::
Compact *adj.* Of a design, describes the valuable property that it can all be
apprehended at once in one's head. This generally means the thing created from
the design can be used with greater facility and fewer errors than an equivalent
tool that is not compact. Compactness does not imply triviality or lack of
power; for example, C is compact and FORTRAN is not, but C is more powerful than
FORTRAN. Designs become non-compact through accreting features and cruft that
don't merge cleanly into the overall design scheme (thus, some fans of Classic C
maintain that ANSI C is no longer compact).
(From http://www.catb.org/ esr/jargon/html/C/compact.html)
In this sense of the word, Python is quite compact, because the language has
just a few ideas, which are used in lots of places. Take namespaces, for
example. Import a module with ``import math``, and you create a new namespace
called ``math``. Classes are also namespaces that share many of the properties
of modules, and have a few of their own; for example, you can create instances
of a class. Instances? They're yet another namespace. Namespaces are currently
implemented as Python dictionaries, so they have the same methods as the
standard dictionary data type: .keys() returns all the keys, and so forth.
This simplicity arises from Python's development history. The language syntax
derives from different sources; ABC, a relatively obscure teaching language, is
one primary influence, and Modula-3 is another. (For more information about ABC
and Modula-3, consult their respective Web sites at http://www.cwi.nl/
steven/abc/ and http://www.m3.org.) Other features have come from C, Icon,
Algol-68, and even Perl. Python hasn't really innovated very much, but instead
has tried to keep the language small and easy to learn, building on ideas that
have been tried in other languages and found useful.
Simplicity is a virtue that should not be underestimated. It lets you learn the
language more quickly, and then rapidly write code, code that often works the
first time you run it.
Java Integration
----------------
If you're working with Java, Jython (http://www.jython.org/) is definitely worth
your attention. Jython is a re-implementation of Python in Java that compiles
Python code into Java bytecodes. The resulting environment has very tight,
almost seamless, integration with Java. It's trivial to access Java classes
from Python, and you can write Python classes that subclass Java classes.
Jython can be used for prototyping Java applications in much the same way
CPython is used, and it can also be used for test suites for Java code, or
embedded in a Java application to add scripting capabilities.
Arguments and Rebuttals
=======================
Let's say that you've decided upon Python as the best choice for your
application. How can you convince your management, or your fellow developers,
to use Python? This section lists some common arguments against using Python,
and provides some possible rebuttals.
**Python is freely available software that doesn't cost anything. How good can
it be?**
Very good, indeed. These days Linux and Apache, two other pieces of open source
software, are becoming more respected as alternatives to commercial software,
but Python hasn't had all the publicity.
Python has been around for several years, with many users and developers.
Accordingly, the interpreter has been used by many people, and has gotten most
of the bugs shaken out of it. While bugs are still discovered at intervals,
they're usually either quite obscure (they'd have to be, for no one to have run
into them before) or they involve interfaces to external libraries. The
internals of the language itself are quite stable.
Having the source code should be viewed as making the software available for
peer review; people can examine the code, suggest (and implement) improvements,
and track down bugs. To find out more about the idea of open source code, along
with arguments and case studies supporting it, go to http://www.opensource.org.
**Who's going to support it?**
Python has a sizable community of developers, and the number is still growing.
The Internet community surrounding the language is an active one, and is worth
being considered another one of Python's advantages. Most questions posted to
the comp.lang.python newsgroup are quickly answered by someone.
Should you need to dig into the source code, you'll find it's clear and
well-organized, so it's not very difficult to write extensions and track down
bugs yourself. If you'd prefer to pay for support, there are companies and
individuals who offer commercial support for Python.
**Who uses Python for serious work?**
Lots of people; one interesting thing about Python is the surprising diversity
of applications that it's been used for. People are using Python to:
* Run Web sites
* Write GUI interfaces
* Control number-crunching code on supercomputers
* Make a commercial application scriptable by embedding the Python interpreter
inside it
* Process large XML data sets
* Build test suites for C or Java code
Whatever your application domain is, there's probably someone who's used Python
for something similar. Yet, despite being useable for such high-end
applications, Python's still simple enough to use for little jobs.
See http://wiki.python.org/moin/OrganizationsUsingPython for a list of some of
the organizations that use Python.
**What are the restrictions on Python's use?**
They're practically nonexistent. Consult the :file:`Misc/COPYRIGHT` file in the
source distribution, or http://www.python.org/doc/Copyright.html for the full
language, but it boils down to three conditions.
* You have to leave the copyright notice on the software; if you don't include
the source code in a product, you have to put the copyright notice in the
supporting documentation.
* Don't claim that the institutions that have developed Python endorse your
product in any way.
* If something goes wrong, you can't sue for damages. Practically all software
licences contain this condition.
Notice that you don't have to provide source code for anything that contains
Python or is built with it. Also, the Python interpreter and accompanying
documentation can be modified and redistributed in any way you like, and you
don't have to pay anyone any licensing fees at all.
**Why should we use an obscure language like Python instead of well-known
language X?**
I hope this HOWTO, and the documents listed in the final section, will help
convince you that Python isn't obscure, and has a healthily growing user base.
One word of advice: always present Python's positive advantages, instead of
concentrating on language X's failings. People want to know why a solution is
good, rather than why all the other solutions are bad. So instead of attacking
a competing solution on various grounds, simply show how Python's virtues can
help.
Useful Resources
================
http://www.pythonology.com/success
The Python Success Stories are a collection of stories from successful users of
Python, with the emphasis on business and corporate users.
.. % \term{\url{http://www.fsbassociates.com/books/pythonchpt1.htm}}
.. % The first chapter of \emph{Internet Programming with Python} also
.. % examines some of the reasons for using Python. The book is well worth
.. % buying, but the publishers have made the first chapter available on
.. % the Web.
http://home.pacbell.net/ouster/scripting.html
John Ousterhout's white paper on scripting is a good argument for the utility of
scripting languages, though naturally enough, he emphasizes Tcl, the language he
developed. Most of the arguments would apply to any scripting language.
http://www.python.org/workshops/1997-10/proceedings/beazley.html
The authors, David M. Beazley and Peter S. Lomdahl, describe their use of
Python at Los Alamos National Laboratory. It's another good example of how
Python can help get real work done. This quotation from the paper has been
echoed by many people:
.. epigraph::
Originally developed as a large monolithic application for massively parallel
processing systems, we have used Python to transform our application into a
flexible, highly modular, and extremely powerful system for performing
simulation, data analysis, and visualization. In addition, we describe how
Python has solved a number of important problems related to the development,
debugging, deployment, and maintenance of scientific software.
http://pythonjournal.cognizor.com/pyj1/Everitt-Feit_interview98-V1.html
This interview with Andy Feit, discussing Infoseek's use of Python, can be used
to show that choosing Python didn't introduce any difficulties into a company's
development process, and provided some substantial benefits.
.. % \term{\url{http://www.python.org/psa/Commercial.html}}
.. % Robin Friedrich wrote this document on how to support Python's use in
.. % commercial projects.
http://www.python.org/workshops/1997-10/proceedings/stein.ps
For the 6th Python conference, Greg Stein presented a paper that traced Python's
adoption and usage at a startup called eShop, and later at Microsoft.
http://www.opensource.org
Management may be doubtful of the reliability and usefulness of software that
wasn't written commercially. This site presents arguments that show how open
source software can have considerable advantages over closed-source software.
http://sunsite.unc.edu/LDP/HOWTO/mini/Advocacy.html
The Linux Advocacy mini-HOWTO was the inspiration for this document, and is also
well worth reading for general suggestions on winning acceptance for a new
technology, such as Linux or Python. In general, you won't make much progress
by simply attacking existing systems and complaining about their inadequacies;
this often ends up looking like unfocused whining. It's much better to point
out some of the many areas where Python is an improvement over other systems.

434
Doc/howto/curses.rst Normal file
View file

@ -0,0 +1,434 @@
**********************************
Curses Programming with Python
**********************************
:Author: A.M. Kuchling, Eric S. Raymond
:Release: 2.02
.. topic:: Abstract
This document describes how to write text-mode programs with Python 2.x, using
the :mod:`curses` extension module to control the display.
What is curses?
===============
The curses library supplies a terminal-independent screen-painting and
keyboard-handling facility for text-based terminals; such terminals include
VT100s, the Linux console, and the simulated terminal provided by X11 programs
such as xterm and rxvt. Display terminals support various control codes to
perform common operations such as moving the cursor, scrolling the screen, and
erasing areas. Different terminals use widely differing codes, and often have
their own minor quirks.
In a world of X displays, one might ask "why bother"? It's true that
character-cell display terminals are an obsolete technology, but there are
niches in which being able to do fancy things with them are still valuable. One
is on small-footprint or embedded Unixes that don't carry an X server. Another
is for tools like OS installers and kernel configurators that may have to run
before X is available.
The curses library hides all the details of different terminals, and provides
the programmer with an abstraction of a display, containing multiple
non-overlapping windows. The contents of a window can be changed in various
ways-- adding text, erasing it, changing its appearance--and the curses library
will automagically figure out what control codes need to be sent to the terminal
to produce the right output.
The curses library was originally written for BSD Unix; the later System V
versions of Unix from AT&T added many enhancements and new functions. BSD curses
is no longer maintained, having been replaced by ncurses, which is an
open-source implementation of the AT&T interface. If you're using an
open-source Unix such as Linux or FreeBSD, your system almost certainly uses
ncurses. Since most current commercial Unix versions are based on System V
code, all the functions described here will probably be available. The older
versions of curses carried by some proprietary Unixes may not support
everything, though.
No one has made a Windows port of the curses module. On a Windows platform, try
the Console module written by Fredrik Lundh. The Console module provides
cursor-addressable text output, plus full support for mouse and keyboard input,
and is available from http://effbot.org/efflib/console.
The Python curses module
------------------------
Thy Python module is a fairly simple wrapper over the C functions provided by
curses; if you're already familiar with curses programming in C, it's really
easy to transfer that knowledge to Python. The biggest difference is that the
Python interface makes things simpler, by merging different C functions such as
:func:`addstr`, :func:`mvaddstr`, :func:`mvwaddstr`, into a single
:meth:`addstr` method. You'll see this covered in more detail later.
This HOWTO is simply an introduction to writing text-mode programs with curses
and Python. It doesn't attempt to be a complete guide to the curses API; for
that, see the Python library guide's section on ncurses, and the C manual pages
for ncurses. It will, however, give you the basic ideas.
Starting and ending a curses application
========================================
Before doing anything, curses must be initialized. This is done by calling the
:func:`initscr` function, which will determine the terminal type, send any
required setup codes to the terminal, and create various internal data
structures. If successful, :func:`initscr` returns a window object representing
the entire screen; this is usually called ``stdscr``, after the name of the
corresponding C variable. ::
import curses
stdscr = curses.initscr()
Usually curses applications turn off automatic echoing of keys to the screen, in
order to be able to read keys and only display them under certain circumstances.
This requires calling the :func:`noecho` function. ::
curses.noecho()
Applications will also commonly need to react to keys instantly, without
requiring the Enter key to be pressed; this is called cbreak mode, as opposed to
the usual buffered input mode. ::
curses.cbreak()
Terminals usually return special keys, such as the cursor keys or navigation
keys such as Page Up and Home, as a multibyte escape sequence. While you could
write your application to expect such sequences and process them accordingly,
curses can do it for you, returning a special value such as
:const:`curses.KEY_LEFT`. To get curses to do the job, you'll have to enable
keypad mode. ::
stdscr.keypad(1)
Terminating a curses application is much easier than starting one. You'll need
to call ::
curses.nocbreak(); stdscr.keypad(0); curses.echo()
to reverse the curses-friendly terminal settings. Then call the :func:`endwin`
function to restore the terminal to its original operating mode. ::
curses.endwin()
A common problem when debugging a curses application is to get your terminal
messed up when the application dies without restoring the terminal to its
previous state. In Python this commonly happens when your code is buggy and
raises an uncaught exception. Keys are no longer be echoed to the screen when
you type them, for example, which makes using the shell difficult.
In Python you can avoid these complications and make debugging much easier by
importing the module :mod:`curses.wrapper`. It supplies a :func:`wrapper`
function that takes a callable. It does the initializations described above,
and also initializes colors if color support is present. It then runs your
provided callable and finally deinitializes appropriately. The callable is
called inside a try-catch clause which catches exceptions, performs curses
deinitialization, and then passes the exception upwards. Thus, your terminal
won't be left in a funny state on exception.
Windows and Pads
================
Windows are the basic abstraction in curses. A window object represents a
rectangular area of the screen, and supports various methods to display text,
erase it, allow the user to input strings, and so forth.
The ``stdscr`` object returned by the :func:`initscr` function is a window
object that covers the entire screen. Many programs may need only this single
window, but you might wish to divide the screen into smaller windows, in order
to redraw or clear them separately. The :func:`newwin` function creates a new
window of a given size, returning the new window object. ::
begin_x = 20 ; begin_y = 7
height = 5 ; width = 40
win = curses.newwin(height, width, begin_y, begin_x)
A word about the coordinate system used in curses: coordinates are always passed
in the order *y,x*, and the top-left corner of a window is coordinate (0,0).
This breaks a common convention for handling coordinates, where the *x*
coordinate usually comes first. This is an unfortunate difference from most
other computer applications, but it's been part of curses since it was first
written, and it's too late to change things now.
When you call a method to display or erase text, the effect doesn't immediately
show up on the display. This is because curses was originally written with slow
300-baud terminal connections in mind; with these terminals, minimizing the time
required to redraw the screen is very important. This lets curses accumulate
changes to the screen, and display them in the most efficient manner. For
example, if your program displays some characters in a window, and then clears
the window, there's no need to send the original characters because they'd never
be visible.
Accordingly, curses requires that you explicitly tell it to redraw windows,
using the :func:`refresh` method of window objects. In practice, this doesn't
really complicate programming with curses much. Most programs go into a flurry
of activity, and then pause waiting for a keypress or some other action on the
part of the user. All you have to do is to be sure that the screen has been
redrawn before pausing to wait for user input, by simply calling
``stdscr.refresh()`` or the :func:`refresh` method of some other relevant
window.
A pad is a special case of a window; it can be larger than the actual display
screen, and only a portion of it displayed at a time. Creating a pad simply
requires the pad's height and width, while refreshing a pad requires giving the
coordinates of the on-screen area where a subsection of the pad will be
displayed. ::
pad = curses.newpad(100, 100)
# These loops fill the pad with letters; this is
# explained in the next section
for y in range(0, 100):
for x in range(0, 100):
try: pad.addch(y,x, ord('a') + (x*x+y*y) % 26 )
except curses.error: pass
# Displays a section of the pad in the middle of the screen
pad.refresh( 0,0, 5,5, 20,75)
The :func:`refresh` call displays a section of the pad in the rectangle
extending from coordinate (5,5) to coordinate (20,75) on the screen; the upper
left corner of the displayed section is coordinate (0,0) on the pad. Beyond
that difference, pads are exactly like ordinary windows and support the same
methods.
If you have multiple windows and pads on screen there is a more efficient way to
go, which will prevent annoying screen flicker at refresh time. Use the
:meth:`noutrefresh` method of each window to update the data structure
representing the desired state of the screen; then change the physical screen to
match the desired state in one go with the function :func:`doupdate`. The
normal :meth:`refresh` method calls :func:`doupdate` as its last act.
Displaying Text
===============
From a C programmer's point of view, curses may sometimes look like a twisty
maze of functions, all subtly different. For example, :func:`addstr` displays a
string at the current cursor location in the ``stdscr`` window, while
:func:`mvaddstr` moves to a given y,x coordinate first before displaying the
string. :func:`waddstr` is just like :func:`addstr`, but allows specifying a
window to use, instead of using ``stdscr`` by default. :func:`mvwaddstr` follows
similarly.
Fortunately the Python interface hides all these details; ``stdscr`` is a window
object like any other, and methods like :func:`addstr` accept multiple argument
forms. Usually there are four different forms.
+---------------------------------+-----------------------------------------------+
| Form | Description |
+=================================+===============================================+
| *str* or *ch* | Display the string *str* or character *ch* at |
| | the current position |
+---------------------------------+-----------------------------------------------+
| *str* or *ch*, *attr* | Display the string *str* or character *ch*, |
| | using attribute *attr* at the current |
| | position |
+---------------------------------+-----------------------------------------------+
| *y*, *x*, *str* or *ch* | Move to position *y,x* within the window, and |
| | display *str* or *ch* |
+---------------------------------+-----------------------------------------------+
| *y*, *x*, *str* or *ch*, *attr* | Move to position *y,x* within the window, and |
| | display *str* or *ch*, using attribute *attr* |
+---------------------------------+-----------------------------------------------+
Attributes allow displaying text in highlighted forms, such as in boldface,
underline, reverse code, or in color. They'll be explained in more detail in
the next subsection.
The :func:`addstr` function takes a Python string as the value to be displayed,
while the :func:`addch` functions take a character, which can be either a Python
string of length 1 or an integer. If it's a string, you're limited to
displaying characters between 0 and 255. SVr4 curses provides constants for
extension characters; these constants are integers greater than 255. For
example, :const:`ACS_PLMINUS` is a +/- symbol, and :const:`ACS_ULCORNER` is the
upper left corner of a box (handy for drawing borders).
Windows remember where the cursor was left after the last operation, so if you
leave out the *y,x* coordinates, the string or character will be displayed
wherever the last operation left off. You can also move the cursor with the
``move(y,x)`` method. Because some terminals always display a flashing cursor,
you may want to ensure that the cursor is positioned in some location where it
won't be distracting; it can be confusing to have the cursor blinking at some
apparently random location.
If your application doesn't need a blinking cursor at all, you can call
``curs_set(0)`` to make it invisible. Equivalently, and for compatibility with
older curses versions, there's a ``leaveok(bool)`` function. When *bool* is
true, the curses library will attempt to suppress the flashing cursor, and you
won't need to worry about leaving it in odd locations.
Attributes and Color
--------------------
Characters can be displayed in different ways. Status lines in a text-based
application are commonly shown in reverse video; a text viewer may need to
highlight certain words. curses supports this by allowing you to specify an
attribute for each cell on the screen.
An attribute is a integer, each bit representing a different attribute. You can
try to display text with multiple attribute bits set, but curses doesn't
guarantee that all the possible combinations are available, or that they're all
visually distinct. That depends on the ability of the terminal being used, so
it's safest to stick to the most commonly available attributes, listed here.
+----------------------+--------------------------------------+
| Attribute | Description |
+======================+======================================+
| :const:`A_BLINK` | Blinking text |
+----------------------+--------------------------------------+
| :const:`A_BOLD` | Extra bright or bold text |
+----------------------+--------------------------------------+
| :const:`A_DIM` | Half bright text |
+----------------------+--------------------------------------+
| :const:`A_REVERSE` | Reverse-video text |
+----------------------+--------------------------------------+
| :const:`A_STANDOUT` | The best highlighting mode available |
+----------------------+--------------------------------------+
| :const:`A_UNDERLINE` | Underlined text |
+----------------------+--------------------------------------+
So, to display a reverse-video status line on the top line of the screen, you
could code::
stdscr.addstr(0, 0, "Current mode: Typing mode",
curses.A_REVERSE)
stdscr.refresh()
The curses library also supports color on those terminals that provide it, The
most common such terminal is probably the Linux console, followed by color
xterms.
To use color, you must call the :func:`start_color` function soon after calling
:func:`initscr`, to initialize the default color set (the
:func:`curses.wrapper.wrapper` function does this automatically). Once that's
done, the :func:`has_colors` function returns TRUE if the terminal in use can
actually display color. (Note: curses uses the American spelling 'color',
instead of the Canadian/British spelling 'colour'. If you're used to the
British spelling, you'll have to resign yourself to misspelling it for the sake
of these functions.)
The curses library maintains a finite number of color pairs, containing a
foreground (or text) color and a background color. You can get the attribute
value corresponding to a color pair with the :func:`color_pair` function; this
can be bitwise-OR'ed with other attributes such as :const:`A_REVERSE`, but
again, such combinations are not guaranteed to work on all terminals.
An example, which displays a line of text using color pair 1::
stdscr.addstr( "Pretty text", curses.color_pair(1) )
stdscr.refresh()
As I said before, a color pair consists of a foreground and background color.
:func:`start_color` initializes 8 basic colors when it activates color mode.
They are: 0:black, 1:red, 2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and
7:white. The curses module defines named constants for each of these colors:
:const:`curses.COLOR_BLACK`, :const:`curses.COLOR_RED`, and so forth.
The ``init_pair(n, f, b)`` function changes the definition of color pair *n*, to
foreground color f and background color b. Color pair 0 is hard-wired to white
on black, and cannot be changed.
Let's put all this together. To change color 1 to red text on a white
background, you would call::
curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
When you change a color pair, any text already displayed using that color pair
will change to the new colors. You can also display new text in this color
with::
stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1) )
Very fancy terminals can change the definitions of the actual colors to a given
RGB value. This lets you change color 1, which is usually red, to purple or
blue or any other color you like. Unfortunately, the Linux console doesn't
support this, so I'm unable to try it out, and can't provide any examples. You
can check if your terminal can do this by calling :func:`can_change_color`,
which returns TRUE if the capability is there. If you're lucky enough to have
such a talented terminal, consult your system's man pages for more information.
User Input
==========
The curses library itself offers only very simple input mechanisms. Python's
support adds a text-input widget that makes up some of the lack.
The most common way to get input to a window is to use its :meth:`getch` method.
:meth:`getch` pauses and waits for the user to hit a key, displaying it if
:func:`echo` has been called earlier. You can optionally specify a coordinate
to which the cursor should be moved before pausing.
It's possible to change this behavior with the method :meth:`nodelay`. After
``nodelay(1)``, :meth:`getch` for the window becomes non-blocking and returns
``curses.ERR`` (a value of -1) when no input is ready. There's also a
:func:`halfdelay` function, which can be used to (in effect) set a timer on each
:meth:`getch`; if no input becomes available within the number of milliseconds
specified as the argument to :func:`halfdelay`, curses raises an exception.
The :meth:`getch` method returns an integer; if it's between 0 and 255, it
represents the ASCII code of the key pressed. Values greater than 255 are
special keys such as Page Up, Home, or the cursor keys. You can compare the
value returned to constants such as :const:`curses.KEY_PPAGE`,
:const:`curses.KEY_HOME`, or :const:`curses.KEY_LEFT`. Usually the main loop of
your program will look something like this::
while 1:
c = stdscr.getch()
if c == ord('p'): PrintDocument()
elif c == ord('q'): break # Exit the while()
elif c == curses.KEY_HOME: x = y = 0
The :mod:`curses.ascii` module supplies ASCII class membership functions that
take either integer or 1-character-string arguments; these may be useful in
writing more readable tests for your command interpreters. It also supplies
conversion functions that take either integer or 1-character-string arguments
and return the same type. For example, :func:`curses.ascii.ctrl` returns the
control character corresponding to its argument.
There's also a method to retrieve an entire string, :const:`getstr()`. It isn't
used very often, because its functionality is quite limited; the only editing
keys available are the backspace key and the Enter key, which terminates the
string. It can optionally be limited to a fixed number of characters. ::
curses.echo() # Enable echoing of characters
# Get a 15-character string, with the cursor on the top line
s = stdscr.getstr(0,0, 15)
The Python :mod:`curses.textpad` module supplies something better. With it, you
can turn a window into a text box that supports an Emacs-like set of
keybindings. Various methods of :class:`Textbox` class support editing with
input validation and gathering the edit results either with or without trailing
spaces. See the library documentation on :mod:`curses.textpad` for the
details.
For More Information
====================
This HOWTO didn't cover some advanced topics, such as screen-scraping or
capturing mouse events from an xterm instance. But the Python library page for
the curses modules is now pretty complete. You should browse it next.
If you're in doubt about the detailed behavior of any of the ncurses entry
points, consult the manual pages for your curses implementation, whether it's
ncurses or a proprietary Unix vendor's. The manual pages will document any
quirks, and provide complete lists of all the functions, attributes, and
:const:`ACS_\*` characters available to you.
Because the curses API is so large, some functions aren't supported in the
Python interface, not because they're difficult to implement, but because no one
has needed them yet. Feel free to add them and then submit a patch. Also, we
don't yet have support for the menus or panels libraries associated with
ncurses; feel free to add that.
If you write an interesting little program, feel free to contribute it as
another demo. We can always use more of them!
The ncurses FAQ: http://dickey.his.com/ncurses/ncurses.faq.html

308
Doc/howto/doanddont.rst Normal file
View file

@ -0,0 +1,308 @@
************************************
Idioms and Anti-Idioms in Python
************************************
:Author: Moshe Zadka
This document is placed in the public doman.
.. topic:: Abstract
This document can be considered a companion to the tutorial. It shows how to use
Python, and even more importantly, how *not* to use Python.
Language Constructs You Should Not Use
======================================
While Python has relatively few gotchas compared to other languages, it still
has some constructs which are only useful in corner cases, or are plain
dangerous.
from module import \*
---------------------
Inside Function Definitions
^^^^^^^^^^^^^^^^^^^^^^^^^^^
``from module import *`` is *invalid* inside function definitions. While many
versions of Python do not check for the invalidity, it does not make it more
valid, no more then having a smart lawyer makes a man innocent. Do not use it
like that ever. Even in versions where it was accepted, it made the function
execution slower, because the compiler could not be certain which names are
local and which are global. In Python 2.1 this construct causes warnings, and
sometimes even errors.
At Module Level
^^^^^^^^^^^^^^^
While it is valid to use ``from module import *`` at module level it is usually
a bad idea. For one, this loses an important property Python otherwise has ---
you can know where each toplevel name is defined by a simple "search" function
in your favourite editor. You also open yourself to trouble in the future, if
some module grows additional functions or classes.
One of the most awful question asked on the newsgroup is why this code::
f = open("www")
f.read()
does not work. Of course, it works just fine (assuming you have a file called
"www".) But it does not work if somewhere in the module, the statement ``from os
import *`` is present. The :mod:`os` module has a function called :func:`open`
which returns an integer. While it is very useful, shadowing builtins is one of
its least useful properties.
Remember, you can never know for sure what names a module exports, so either
take what you need --- ``from module import name1, name2``, or keep them in the
module and access on a per-need basis --- ``import module;print module.name``.
When It Is Just Fine
^^^^^^^^^^^^^^^^^^^^
There are situations in which ``from module import *`` is just fine:
* The interactive prompt. For example, ``from math import *`` makes Python an
amazing scientific calculator.
* When extending a module in C with a module in Python.
* When the module advertises itself as ``from import *`` safe.
Unadorned :keyword:`exec` and friends
-------------------------------------
The word "unadorned" refers to the use without an explicit dictionary, in which
case those constructs evaluate code in the *current* environment. This is
dangerous for the same reasons ``from import *`` is dangerous --- it might step
over variables you are counting on and mess up things for the rest of your code.
Simply do not do that.
Bad examples::
>>> for name in sys.argv[1:]:
>>> exec "%s=1" % name
>>> def func(s, **kw):
>>> for var, val in kw.items():
>>> exec "s.%s=val" % var # invalid!
>>> exec(open("handler.py").read())
>>> handle()
Good examples::
>>> d = {}
>>> for name in sys.argv[1:]:
>>> d[name] = 1
>>> def func(s, **kw):
>>> for var, val in kw.items():
>>> setattr(s, var, val)
>>> d={}
>>> exec(open("handle.py").read(), d, d)
>>> handle = d['handle']
>>> handle()
from module import name1, name2
-------------------------------
This is a "don't" which is much weaker then the previous "don't"s but is still
something you should not do if you don't have good reasons to do that. The
reason it is usually bad idea is because you suddenly have an object which lives
in two seperate namespaces. When the binding in one namespace changes, the
binding in the other will not, so there will be a discrepancy between them. This
happens when, for example, one module is reloaded, or changes the definition of
a function at runtime.
Bad example::
# foo.py
a = 1
# bar.py
from foo import a
if something():
a = 2 # danger: foo.a != a
Good example::
# foo.py
a = 1
# bar.py
import foo
if something():
foo.a = 2
except:
-------
Python has the ``except:`` clause, which catches all exceptions. Since *every*
error in Python raises an exception, this makes many programming errors look
like runtime problems, and hinders the debugging process.
The following code shows a great example::
try:
foo = opne("file") # misspelled "open"
except:
sys.exit("could not open file!")
The second line triggers a :exc:`NameError` which is caught by the except
clause. The program will exit, and you will have no idea that this has nothing
to do with the readability of ``"file"``.
The example above is better written ::
try:
foo = opne("file") # will be changed to "open" as soon as we run it
except IOError:
sys.exit("could not open file")
There are some situations in which the ``except:`` clause is useful: for
example, in a framework when running callbacks, it is good not to let any
callback disturb the framework.
Exceptions
==========
Exceptions are a useful feature of Python. You should learn to raise them
whenever something unexpected occurs, and catch them only where you can do
something about them.
The following is a very popular anti-idiom ::
def get_status(file):
if not os.path.exists(file):
print "file not found"
sys.exit(1)
return open(file).readline()
Consider the case the file gets deleted between the time the call to
:func:`os.path.exists` is made and the time :func:`open` is called. That means
the last line will throw an :exc:`IOError`. The same would happen if *file*
exists but has no read permission. Since testing this on a normal machine on
existing and non-existing files make it seem bugless, that means in testing the
results will seem fine, and the code will get shipped. Then an unhandled
:exc:`IOError` escapes to the user, who has to watch the ugly traceback.
Here is a better way to do it. ::
def get_status(file):
try:
return open(file).readline()
except (IOError, OSError):
print "file not found"
sys.exit(1)
In this version, \*either\* the file gets opened and the line is read (so it
works even on flaky NFS or SMB connections), or the message is printed and the
application aborted.
Still, :func:`get_status` makes too many assumptions --- that it will only be
used in a short running script, and not, say, in a long running server. Sure,
the caller could do something like ::
try:
status = get_status(log)
except SystemExit:
status = None
So, try to make as few ``except`` clauses in your code --- those will usually be
a catch-all in the :func:`main`, or inside calls which should always succeed.
So, the best version is probably ::
def get_status(file):
return open(file).readline()
The caller can deal with the exception if it wants (for example, if it tries
several files in a loop), or just let the exception filter upwards to *its*
caller.
The last version is not very good either --- due to implementation details, the
file would not be closed when an exception is raised until the handler finishes,
and perhaps not at all in non-C implementations (e.g., Jython). ::
def get_status(file):
fp = open(file)
try:
return fp.readline()
finally:
fp.close()
Using the Batteries
===================
Every so often, people seem to be writing stuff in the Python library again,
usually poorly. While the occasional module has a poor interface, it is usually
much better to use the rich standard library and data types that come with
Python then inventing your own.
A useful module very few people know about is :mod:`os.path`. It always has the
correct path arithmetic for your operating system, and will usually be much
better then whatever you come up with yourself.
Compare::
# ugh!
return dir+"/"+file
# better
return os.path.join(dir, file)
More useful functions in :mod:`os.path`: :func:`basename`, :func:`dirname` and
:func:`splitext`.
There are also many useful builtin functions people seem not to be aware of for
some reason: :func:`min` and :func:`max` can find the minimum/maximum of any
sequence with comparable semantics, for example, yet many people write their own
:func:`max`/:func:`min`. Another highly useful function is :func:`reduce`. A
classical use of :func:`reduce` is something like ::
import sys, operator
nums = map(float, sys.argv[1:])
print reduce(operator.add, nums)/len(nums)
This cute little script prints the average of all numbers given on the command
line. The :func:`reduce` adds up all the numbers, and the rest is just some
pre- and postprocessing.
On the same note, note that :func:`float`, :func:`int` and :func:`long` all
accept arguments of type string, and so are suited to parsing --- assuming you
are ready to deal with the :exc:`ValueError` they raise.
Using Backslash to Continue Statements
======================================
Since Python treats a newline as a statement terminator, and since statements
are often more then is comfortable to put in one line, many people do::
if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \
calculate_number(10, 20) != forbulate(500, 360):
pass
You should realize that this is dangerous: a stray space after the ``XXX`` would
make this line wrong, and stray spaces are notoriously hard to see in editors.
In this case, at least it would be a syntax error, but if the code was::
value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \
+ calculate_number(10, 20)*forbulate(500, 360)
then it would just be subtly wrong.
It is usually much better to use the implicit continuation inside parenthesis:
This version is bulletproof::
value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9]
+ calculate_number(10, 20)*forbulate(500, 360))

1400
Doc/howto/functional.rst Normal file

File diff suppressed because it is too large Load diff

25
Doc/howto/index.rst Normal file
View file

@ -0,0 +1,25 @@
***************
Python HOWTOs
***************
Python HOWTOs are documents that cover a single, specific topic,
and attempt to cover it fairly completely. Modelled on the Linux
Documentation Project's HOWTO collection, this collection is an
effort to foster documentation that's more detailed than the
Python Library Reference.
Currently, the HOWTOs are:
.. toctree::
:maxdepth: 1
advocacy.rst
pythonmac.rst
curses.rst
doanddont.rst
functional.rst
regex.rst
sockets.rst
unicode.rst
urllib2.rst

202
Doc/howto/pythonmac.rst Normal file
View file

@ -0,0 +1,202 @@
.. _using-on-mac:
***************************
Using Python on a Macintosh
***************************
:Author: Bob Savage <bobsavage@mac.com>
Python on a Macintosh running Mac OS X is in principle very similar to Python on
any other Unix platform, but there are a number of additional features such as
the IDE and the Package Manager that are worth pointing out.
The Mac-specific modules are documented in :ref:`mac-specific-services`.
Python on Mac OS 9 or earlier can be quite different from Python on Unix or
Windows, but is beyond the scope of this manual, as that platform is no longer
supported, starting with Python 2.4. See http://www.cwi.nl/~jack/macpython for
installers for the latest 2.3 release for Mac OS 9 and related documentation.
.. _getting-osx:
Getting and Installing MacPython
================================
Mac OS X 10.4 comes with Python 2.3 pre-installed by Apple. However, you are
encouraged to install the most recent version of Python from the Python website
(http://www.python.org). A "universal binary" build of Python 2.5, which runs
natively on the Mac's new Intel and legacy PPC CPU's, is available there.
What you get after installing is a number of things:
* A :file:`MacPython 2.5` folder in your :file:`Applications` folder. In here
you find IDLE, the development environment that is a standard part of official
Python distributions; PythonLauncher, which handles double-clicking Python
scripts from the Finder; and the "Build Applet" tool, which allows you to
package Python scripts as standalone applications on your system.
* A framework :file:`/Library/Frameworks/Python.framework`, which includes the
Python executable and libraries. The installer adds this location to your shell
path. To uninstall MacPython, you can simply remove these three things. A
symlink to the Python executable is placed in /usr/local/bin/.
The Apple-provided build of Python is installed in
:file:`/System/Library/Frameworks/Python.framework` and :file:`/usr/bin/python`,
respectively. You should never modify or delete these, as they are
Apple-controlled and are used by Apple- or third-party software.
IDLE includes a help menu that allows you to access Python documentation. If you
are completely new to Python you should start reading the tutorial introduction
in that document.
If you are familiar with Python on other Unix platforms you should read the
section on running Python scripts from the Unix shell.
How to run a Python script
--------------------------
Your best way to get started with Python on Mac OS X is through the IDLE
integrated development environment, see section :ref:`ide` and use the Help menu
when the IDE is running.
If you want to run Python scripts from the Terminal window command line or from
the Finder you first need an editor to create your script. Mac OS X comes with a
number of standard Unix command line editors, :program:`vim` and
:program:`emacs` among them. If you want a more Mac-like editor,
:program:`BBEdit` or :program:`TextWrangler` from Bare Bones Software (see
http://www.barebones.com/products/bbedit/index.shtml) are good choices, as is
:program:`TextMate` (see http://macromates.com/). Other editors include
:program:`Gvim` (http://macvim.org) and :program:`Aquamacs`
(http://aquamacs.org).
To run your script from the Terminal window you must make sure that
:file:`/usr/local/bin` is in your shell search path.
To run your script from the Finder you have two options:
* Drag it to :program:`PythonLauncher`
* Select :program:`PythonLauncher` as the default application to open your
script (or any .py script) through the finder Info window and double-click it.
:program:`PythonLauncher` has various preferences to control how your script is
launched. Option-dragging allows you to change these for one invocation, or use
its Preferences menu to change things globally.
.. _osx-gui-scripts:
Running scripts with a GUI
--------------------------
With older versions of Python, there is one Mac OS X quirk that you need to be
aware of: programs that talk to the Aqua window manager (in other words,
anything that has a GUI) need to be run in a special way. Use :program:`pythonw`
instead of :program:`python` to start such scripts.
With Python 2.5, you can use either :program:`python` or :program:`pythonw`.
Configuration
-------------
Python on OS X honors all standard Unix environment variables such as
:envvar:`PYTHONPATH`, but setting these variables for programs started from the
Finder is non-standard as the Finder does not read your :file:`.profile` or
:file:`.cshrc` at startup. You need to create a file :file:`~
/.MacOSX/environment.plist`. See Apple's Technical Document QA1067 for details.
For more information on installation Python packages in MacPython, see section
:ref:`mac-package-manager`.
.. _ide:
The IDE
=======
MacPython ships with the standard IDLE development environment. A good
introduction to using IDLE can be found at http://hkn.eecs.berkeley.edu/
dyoo/python/idle_intro/index.html.
.. _mac-package-manager:
Installing Additional Python Packages
=====================================
There are several methods to install additional Python packages:
* http://pythonmac.org/packages/ contains selected compiled packages for Python
2.5, 2.4, and 2.3.
* Packages can be installed via the standard Python distutils mode (``python
setup.py install``).
* Many packages can also be installed via the :program:`setuptools` extension.
GUI Programming on the Mac
==========================
There are several options for building GUI applications on the Mac with Python.
*PyObjC* is a Python binding to Apple's Objective-C/Cocoa framework, which is
the foundation of most modern Mac development. Information on PyObjC is
available from http://pyobjc.sourceforge.net.
The standard Python GUI toolkit is :mod:`Tkinter`, based on the cross-platform
Tk toolkit (http://www.tcl.tk). An Aqua-native version of Tk is bundled with OS
X by Apple, and the latest version can be downloaded and installed from
http://www.activestate.com; it can also be built from source.
*wxPython* is another popular cross-platform GUI toolkit that runs natively on
Mac OS X. Packages and documentation are available from http://www.wxpython.org.
*PyQt* is another popular cross-platform GUI toolkit that runs natively on Mac
OS X. More information can be found at
http://www.riverbankcomputing.co.uk/pyqt/.
Distributing Python Applications on the Mac
===========================================
The "Build Applet" tool that is placed in the MacPython 2.5 folder is fine for
packaging small Python scripts on your own machine to run as a standard Mac
application. This tool, however, is not robust enough to distribute Python
applications to other users.
The standard tool for deploying standalone Python applications on the Mac is
:program:`py2app`. More information on installing and using py2app can be found
at http://undefined.org/python/#py2app.
Application Scripting
=====================
Python can also be used to script other Mac applications via Apple's Open
Scripting Architecture (OSA); see http://appscript.sourceforge.net. Appscript is
a high-level, user-friendly Apple event bridge that allows you to control
scriptable Mac OS X applications using ordinary Python scripts. Appscript makes
Python a serious alternative to Apple's own *AppleScript* language for
automating your Mac. A related package, *PyOSA*, is an OSA language component
for the Python scripting language, allowing Python code to be executed by any
OSA-enabled application (Script Editor, Mail, iTunes, etc.). PyOSA makes Python
a full peer to AppleScript.
Other Resources
===============
The MacPython mailing list is an excellent support resource for Python users and
developers on the Mac:
http://www.python.org/community/sigs/current/pythonmac-sig/
Another useful resource is the MacPython wiki:
http://wiki.python.org/moin/MacPython

1377
Doc/howto/regex.rst Normal file

File diff suppressed because it is too large Load diff

421
Doc/howto/sockets.rst Normal file
View file

@ -0,0 +1,421 @@
****************************
Socket Programming HOWTO
****************************
:Author: Gordon McMillan
.. topic:: Abstract
Sockets are used nearly everywhere, but are one of the most severely
misunderstood technologies around. This is a 10,000 foot overview of sockets.
It's not really a tutorial - you'll still have work to do in getting things
operational. It doesn't cover the fine points (and there are a lot of them), but
I hope it will give you enough background to begin using them decently.
Sockets
=======
Sockets are used nearly everywhere, but are one of the most severely
misunderstood technologies around. This is a 10,000 foot overview of sockets.
It's not really a tutorial - you'll still have work to do in getting things
working. It doesn't cover the fine points (and there are a lot of them), but I
hope it will give you enough background to begin using them decently.
I'm only going to talk about INET sockets, but they account for at least 99% of
the sockets in use. And I'll only talk about STREAM sockets - unless you really
know what you're doing (in which case this HOWTO isn't for you!), you'll get
better behavior and performance from a STREAM socket than anything else. I will
try to clear up the mystery of what a socket is, as well as some hints on how to
work with blocking and non-blocking sockets. But I'll start by talking about
blocking sockets. You'll need to know how they work before dealing with
non-blocking sockets.
Part of the trouble with understanding these things is that "socket" can mean a
number of subtly different things, depending on context. So first, let's make a
distinction between a "client" socket - an endpoint of a conversation, and a
"server" socket, which is more like a switchboard operator. The client
application (your browser, for example) uses "client" sockets exclusively; the
web server it's talking to uses both "server" sockets and "client" sockets.
History
-------
Of the various forms of IPC (*Inter Process Communication*), sockets are by far
the most popular. On any given platform, there are likely to be other forms of
IPC that are faster, but for cross-platform communication, sockets are about the
only game in town.
They were invented in Berkeley as part of the BSD flavor of Unix. They spread
like wildfire with the Internet. With good reason --- the combination of sockets
with INET makes talking to arbitrary machines around the world unbelievably easy
(at least compared to other schemes).
Creating a Socket
=================
Roughly speaking, when you clicked on the link that brought you to this page,
your browser did something like the following::
#create an INET, STREAMing socket
s = socket.socket(
socket.AF_INET, socket.SOCK_STREAM)
#now connect to the web server on port 80
# - the normal http port
s.connect(("www.mcmillan-inc.com", 80))
When the ``connect`` completes, the socket ``s`` can now be used to send in a
request for the text of this page. The same socket will read the reply, and then
be destroyed. That's right - destroyed. Client sockets are normally only used
for one exchange (or a small set of sequential exchanges).
What happens in the web server is a bit more complex. First, the web server
creates a "server socket". ::
#create an INET, STREAMing socket
serversocket = socket.socket(
socket.AF_INET, socket.SOCK_STREAM)
#bind the socket to a public host,
# and a well-known port
serversocket.bind((socket.gethostname(), 80))
#become a server socket
serversocket.listen(5)
A couple things to notice: we used ``socket.gethostname()`` so that the socket
would be visible to the outside world. If we had used ``s.bind(('', 80))`` or
``s.bind(('localhost', 80))`` or ``s.bind(('127.0.0.1', 80))`` we would still
have a "server" socket, but one that was only visible within the same machine.
A second thing to note: low number ports are usually reserved for "well known"
services (HTTP, SNMP etc). If you're playing around, use a nice high number (4
digits).
Finally, the argument to ``listen`` tells the socket library that we want it to
queue up as many as 5 connect requests (the normal max) before refusing outside
connections. If the rest of the code is written properly, that should be plenty.
OK, now we have a "server" socket, listening on port 80. Now we enter the
mainloop of the web server::
while 1:
#accept connections from outside
(clientsocket, address) = serversocket.accept()
#now do something with the clientsocket
#in this case, we'll pretend this is a threaded server
ct = client_thread(clientsocket)
ct.run()
There's actually 3 general ways in which this loop could work - dispatching a
thread to handle ``clientsocket``, create a new process to handle
``clientsocket``, or restructure this app to use non-blocking sockets, and
mulitplex between our "server" socket and any active ``clientsocket``\ s using
``select``. More about that later. The important thing to understand now is
this: this is *all* a "server" socket does. It doesn't send any data. It doesn't
receive any data. It just produces "client" sockets. Each ``clientsocket`` is
created in response to some *other* "client" socket doing a ``connect()`` to the
host and port we're bound to. As soon as we've created that ``clientsocket``, we
go back to listening for more connections. The two "clients" are free to chat it
up - they are using some dynamically allocated port which will be recycled when
the conversation ends.
IPC
---
If you need fast IPC between two processes on one machine, you should look into
whatever form of shared memory the platform offers. A simple protocol based
around shared memory and locks or semaphores is by far the fastest technique.
If you do decide to use sockets, bind the "server" socket to ``'localhost'``. On
most platforms, this will take a shortcut around a couple of layers of network
code and be quite a bit faster.
Using a Socket
==============
The first thing to note, is that the web browser's "client" socket and the web
server's "client" socket are identical beasts. That is, this is a "peer to peer"
conversation. Or to put it another way, *as the designer, you will have to
decide what the rules of etiquette are for a conversation*. Normally, the
``connect``\ ing socket starts the conversation, by sending in a request, or
perhaps a signon. But that's a design decision - it's not a rule of sockets.
Now there are two sets of verbs to use for communication. You can use ``send``
and ``recv``, or you can transform your client socket into a file-like beast and
use ``read`` and ``write``. The latter is the way Java presents their sockets.
I'm not going to talk about it here, except to warn you that you need to use
``flush`` on sockets. These are buffered "files", and a common mistake is to
``write`` something, and then ``read`` for a reply. Without a ``flush`` in
there, you may wait forever for the reply, because the request may still be in
your output buffer.
Now we come the major stumbling block of sockets - ``send`` and ``recv`` operate
on the network buffers. They do not necessarily handle all the bytes you hand
them (or expect from them), because their major focus is handling the network
buffers. In general, they return when the associated network buffers have been
filled (``send``) or emptied (``recv``). They then tell you how many bytes they
handled. It is *your* responsibility to call them again until your message has
been completely dealt with.
When a ``recv`` returns 0 bytes, it means the other side has closed (or is in
the process of closing) the connection. You will not receive any more data on
this connection. Ever. You may be able to send data successfully; I'll talk
about that some on the next page.
A protocol like HTTP uses a socket for only one transfer. The client sends a
request, the reads a reply. That's it. The socket is discarded. This means that
a client can detect the end of the reply by receiving 0 bytes.
But if you plan to reuse your socket for further transfers, you need to realize
that *there is no "EOT" (End of Transfer) on a socket.* I repeat: if a socket
``send`` or ``recv`` returns after handling 0 bytes, the connection has been
broken. If the connection has *not* been broken, you may wait on a ``recv``
forever, because the socket will *not* tell you that there's nothing more to
read (for now). Now if you think about that a bit, you'll come to realize a
fundamental truth of sockets: *messages must either be fixed length* (yuck), *or
be delimited* (shrug), *or indicate how long they are* (much better), *or end by
shutting down the connection*. The choice is entirely yours, (but some ways are
righter than others).
Assuming you don't want to end the connection, the simplest solution is a fixed
length message::
class mysocket:
'''demonstration class only
- coded for clarity, not efficiency
'''
def __init__(self, sock=None):
if sock is None:
self.sock = socket.socket(
socket.AF_INET, socket.SOCK_STREAM)
else:
self.sock = sock
def connect(self, host, port):
self.sock.connect((host, port))
def mysend(self, msg):
totalsent = 0
while totalsent < MSGLEN:
sent = self.sock.send(msg[totalsent:])
if sent == 0:
raise RuntimeError, \
"socket connection broken"
totalsent = totalsent + sent
def myreceive(self):
msg = ''
while len(msg) < MSGLEN:
chunk = self.sock.recv(MSGLEN-len(msg))
if chunk == '':
raise RuntimeError, \
"socket connection broken"
msg = msg + chunk
return msg
The sending code here is usable for almost any messaging scheme - in Python you
send strings, and you can use ``len()`` to determine its length (even if it has
embedded ``\0`` characters). It's mostly the receiving code that gets more
complex. (And in C, it's not much worse, except you can't use ``strlen`` if the
message has embedded ``\0``\ s.)
The easiest enhancement is to make the first character of the message an
indicator of message type, and have the type determine the length. Now you have
two ``recv``\ s - the first to get (at least) that first character so you can
look up the length, and the second in a loop to get the rest. If you decide to
go the delimited route, you'll be receiving in some arbitrary chunk size, (4096
or 8192 is frequently a good match for network buffer sizes), and scanning what
you've received for a delimiter.
One complication to be aware of: if your conversational protocol allows multiple
messages to be sent back to back (without some kind of reply), and you pass
``recv`` an arbitrary chunk size, you may end up reading the start of a
following message. You'll need to put that aside and hold onto it, until it's
needed.
Prefixing the message with it's length (say, as 5 numeric characters) gets more
complex, because (believe it or not), you may not get all 5 characters in one
``recv``. In playing around, you'll get away with it; but in high network loads,
your code will very quickly break unless you use two ``recv`` loops - the first
to determine the length, the second to get the data part of the message. Nasty.
This is also when you'll discover that ``send`` does not always manage to get
rid of everything in one pass. And despite having read this, you will eventually
get bit by it!
In the interests of space, building your character, (and preserving my
competitive position), these enhancements are left as an exercise for the
reader. Lets move on to cleaning up.
Binary Data
-----------
It is perfectly possible to send binary data over a socket. The major problem is
that not all machines use the same formats for binary data. For example, a
Motorola chip will represent a 16 bit integer with the value 1 as the two hex
bytes 00 01. Intel and DEC, however, are byte-reversed - that same 1 is 01 00.
Socket libraries have calls for converting 16 and 32 bit integers - ``ntohl,
htonl, ntohs, htons`` where "n" means *network* and "h" means *host*, "s" means
*short* and "l" means *long*. Where network order is host order, these do
nothing, but where the machine is byte-reversed, these swap the bytes around
appropriately.
In these days of 32 bit machines, the ascii representation of binary data is
frequently smaller than the binary representation. That's because a surprising
amount of the time, all those longs have the value 0, or maybe 1. The string "0"
would be two bytes, while binary is four. Of course, this doesn't fit well with
fixed-length messages. Decisions, decisions.
Disconnecting
=============
Strictly speaking, you're supposed to use ``shutdown`` on a socket before you
``close`` it. The ``shutdown`` is an advisory to the socket at the other end.
Depending on the argument you pass it, it can mean "I'm not going to send
anymore, but I'll still listen", or "I'm not listening, good riddance!". Most
socket libraries, however, are so used to programmers neglecting to use this
piece of etiquette that normally a ``close`` is the same as ``shutdown();
close()``. So in most situations, an explicit ``shutdown`` is not needed.
One way to use ``shutdown`` effectively is in an HTTP-like exchange. The client
sends a request and then does a ``shutdown(1)``. This tells the server "This
client is done sending, but can still receive." The server can detect "EOF" by
a receive of 0 bytes. It can assume it has the complete request. The server
sends a reply. If the ``send`` completes successfully then, indeed, the client
was still receiving.
Python takes the automatic shutdown a step further, and says that when a socket
is garbage collected, it will automatically do a ``close`` if it's needed. But
relying on this is a very bad habit. If your socket just disappears without
doing a ``close``, the socket at the other end may hang indefinitely, thinking
you're just being slow. *Please* ``close`` your sockets when you're done.
When Sockets Die
----------------
Probably the worst thing about using blocking sockets is what happens when the
other side comes down hard (without doing a ``close``). Your socket is likely to
hang. SOCKSTREAM is a reliable protocol, and it will wait a long, long time
before giving up on a connection. If you're using threads, the entire thread is
essentially dead. There's not much you can do about it. As long as you aren't
doing something dumb, like holding a lock while doing a blocking read, the
thread isn't really consuming much in the way of resources. Do *not* try to kill
the thread - part of the reason that threads are more efficient than processes
is that they avoid the overhead associated with the automatic recycling of
resources. In other words, if you do manage to kill the thread, your whole
process is likely to be screwed up.
Non-blocking Sockets
====================
If you've understood the preceeding, you already know most of what you need to
know about the mechanics of using sockets. You'll still use the same calls, in
much the same ways. It's just that, if you do it right, your app will be almost
inside-out.
In Python, you use ``socket.setblocking(0)`` to make it non-blocking. In C, it's
more complex, (for one thing, you'll need to choose between the BSD flavor
``O_NONBLOCK`` and the almost indistinguishable Posix flavor ``O_NDELAY``, which
is completely different from ``TCP_NODELAY``), but it's the exact same idea. You
do this after creating the socket, but before using it. (Actually, if you're
nuts, you can switch back and forth.)
The major mechanical difference is that ``send``, ``recv``, ``connect`` and
``accept`` can return without having done anything. You have (of course) a
number of choices. You can check return code and error codes and generally drive
yourself crazy. If you don't believe me, try it sometime. Your app will grow
large, buggy and suck CPU. So let's skip the brain-dead solutions and do it
right.
Use ``select``.
In C, coding ``select`` is fairly complex. In Python, it's a piece of cake, but
it's close enough to the C version that if you understand ``select`` in Python,
you'll have little trouble with it in C. ::
ready_to_read, ready_to_write, in_error = \
select.select(
potential_readers,
potential_writers,
potential_errs,
timeout)
You pass ``select`` three lists: the first contains all sockets that you might
want to try reading; the second all the sockets you might want to try writing
to, and the last (normally left empty) those that you want to check for errors.
You should note that a socket can go into more than one list. The ``select``
call is blocking, but you can give it a timeout. This is generally a sensible
thing to do - give it a nice long timeout (say a minute) unless you have good
reason to do otherwise.
In return, you will get three lists. They have the sockets that are actually
readable, writable and in error. Each of these lists is a subset (possbily
empty) of the corresponding list you passed in. And if you put a socket in more
than one input list, it will only be (at most) in one output list.
If a socket is in the output readable list, you can be
as-close-to-certain-as-we-ever-get-in-this-business that a ``recv`` on that
socket will return *something*. Same idea for the writable list. You'll be able
to send *something*. Maybe not all you want to, but *something* is better than
nothing. (Actually, any reasonably healthy socket will return as writable - it
just means outbound network buffer space is available.)
If you have a "server" socket, put it in the potential_readers list. If it comes
out in the readable list, your ``accept`` will (almost certainly) work. If you
have created a new socket to ``connect`` to someone else, put it in the
ptoential_writers list. If it shows up in the writable list, you have a decent
chance that it has connected.
One very nasty problem with ``select``: if somewhere in those input lists of
sockets is one which has died a nasty death, the ``select`` will fail. You then
need to loop through every single damn socket in all those lists and do a
``select([sock],[],[],0)`` until you find the bad one. That timeout of 0 means
it won't take long, but it's ugly.
Actually, ``select`` can be handy even with blocking sockets. It's one way of
determining whether you will block - the socket returns as readable when there's
something in the buffers. However, this still doesn't help with the problem of
determining whether the other end is done, or just busy with something else.
**Portability alert**: On Unix, ``select`` works both with the sockets and
files. Don't try this on Windows. On Windows, ``select`` works with sockets
only. Also note that in C, many of the more advanced socket options are done
differently on Windows. In fact, on Windows I usually use threads (which work
very, very well) with my sockets. Face it, if you want any kind of performance,
your code will look very different on Windows than on Unix. (I haven't the
foggiest how you do this stuff on a Mac.)
Performance
-----------
There's no question that the fastest sockets code uses non-blocking sockets and
select to multiplex them. You can put together something that will saturate a
LAN connection without putting any strain on the CPU. The trouble is that an app
written this way can't do much of anything else - it needs to be ready to
shuffle bytes around at all times.
Assuming that your app is actually supposed to do something more than that,
threading is the optimal solution, (and using non-blocking sockets will be
faster than using blocking sockets). Unfortunately, threading support in Unixes
varies both in API and quality. So the normal Unix solution is to fork a
subprocess to deal with each connection. The overhead for this is significant
(and don't do this on Windows - the overhead of process creation is enormous
there). It also means that unless each subprocess is completely independent,
you'll need to use another form of IPC, say a pipe, or shared memory and
semaphores, to communicate between the parent and child processes.
Finally, remember that even though blocking sockets are somewhat slower than
non-blocking, in many cases they are the "right" solution. After all, if your
app is driven by the data it receives over a socket, there's not much sense in
complicating the logic just so your app can wait on ``select`` instead of
``recv``.

732
Doc/howto/unicode.rst Normal file
View file

@ -0,0 +1,732 @@
*****************
Unicode HOWTO
*****************
:Release: 1.02
This HOWTO discusses Python's support for Unicode, and explains various problems
that people commonly encounter when trying to work with Unicode.
Introduction to Unicode
=======================
History of Character Codes
--------------------------
In 1968, the American Standard Code for Information Interchange, better known by
its acronym ASCII, was standardized. ASCII defined numeric codes for various
characters, with the numeric values running from 0 to
127. For example, the lowercase letter 'a' is assigned 97 as its code
value.
ASCII was an American-developed standard, so it only defined unaccented
characters. There was an 'e', but no 'é' or 'Í'. This meant that languages
which required accented characters couldn't be faithfully represented in ASCII.
(Actually the missing accents matter for English, too, which contains words such
as 'naïve' and 'café', and some publications have house styles which require
spellings such as 'coöperate'.)
For a while people just wrote programs that didn't display accents. I remember
looking at Apple ][ BASIC programs, published in French-language publications in
the mid-1980s, that had lines like these::
PRINT "FICHER EST COMPLETE."
PRINT "CARACTERE NON ACCEPTE."
Those messages should contain accents, and they just look wrong to someone who
can read French.
In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
machines assigned values between 128 and 255 to accented characters. Different
machines had different codes, however, which led to problems exchanging files.
Eventually various commonly used sets of values for the 128-255 range emerged.
Some were true standards, defined by the International Standards Organization,
and some were **de facto** conventions that were invented by one company or
another and managed to catch on.
255 characters aren't very many. For example, you can't fit both the accented
characters used in Western Europe and the Cyrillic alphabet used for Russian
into the 128-255 range because there are more than 127 such characters.
You could write files using different codes (all your Russian files in a coding
system called KOI8, all your French files in a different coding system called
Latin1), but what if you wanted to write a French document that quotes some
Russian text? In the 1980s people began to want to solve this problem, and the
Unicode standardization effort began.
Unicode started out using 16-bit characters instead of 8-bit characters. 16
bits means you have 2^16 = 65,536 distinct values available, making it possible
to represent many different characters from many different alphabets; an initial
goal was to have Unicode contain the alphabets for every single human language.
It turns out that even 16 bits isn't enough to meet that goal, and the modern
Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in
base-16).
There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
originally separate efforts, but the specifications were merged with the 1.1
revision of Unicode.
(This discussion of Unicode's history is highly simplified. I don't think the
average Python programmer needs to worry about the historical details; consult
the Unicode consortium site listed in the References for more information.)
Definitions
-----------
A **character** is the smallest possible component of a text. 'A', 'B', 'C',
etc., are all different characters. So are 'È' and 'Í'. Characters are
abstractions, and vary depending on the language or context you're talking
about. For example, the symbol for ohms (Ω) is usually drawn much like the
capital letter omega (Ω) in the Greek alphabet (they may even be the same in
some fonts), but these are two different characters that have different
meanings.
The Unicode standard describes how characters are represented by **code
points**. A code point is an integer value, usually denoted in base 16. In the
standard, a code point is written using the notation U+12ca to mean the
character with value 0x12ca (4810 decimal). The Unicode standard contains a lot
of tables listing characters and their corresponding code points::
0061 'a'; LATIN SMALL LETTER A
0062 'b'; LATIN SMALL LETTER B
0063 'c'; LATIN SMALL LETTER C
...
007B '{'; LEFT CURLY BRACKET
Strictly, these definitions imply that it's meaningless to say 'this is
character U+12ca'. U+12ca is a code point, which represents some particular
character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In
informal contexts, this distinction between code points and characters will
sometimes be forgotten.
A character is represented on a screen or on paper by a set of graphical
elements that's called a **glyph**. The glyph for an uppercase A, for example,
is two diagonal strokes and a horizontal stroke, though the exact details will
depend on the font being used. Most Python code doesn't need to worry about
glyphs; figuring out the correct glyph to display is generally the job of a GUI
toolkit or a terminal's font renderer.
Encodings
---------
To summarize the previous section: a Unicode string is a sequence of code
points, which are numbers from 0 to 0x10ffff. This sequence needs to be
represented as a set of bytes (meaning, values from 0-255) in memory. The rules
for translating a Unicode string into a sequence of bytes are called an
**encoding**.
The first encoding you might think of is an array of 32-bit integers. In this
representation, the string "Python" would look like this::
P y t h o n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
This representation is straightforward but using it presents a number of
problems.
1. It's not portable; different processors order the bytes differently.
2. It's very wasteful of space. In most texts, the majority of the code points
are less than 127, or less than 255, so a lot of space is occupied by zero
bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
ASCII representation. Increased RAM usage doesn't matter too much (desktop
computers have megabytes of RAM, and strings aren't usually that large), but
expanding our usage of disk and network bandwidth by a factor of 4 is
intolerable.
3. It's not compatible with existing C functions such as ``strlen()``, so a new
family of wide string functions would need to be used.
4. Many Internet standards are defined in terms of textual data, and can't
handle content with embedded zero bytes.
Generally people don't use this encoding, instead choosing other encodings that
are more efficient and convenient.
Encodings don't have to handle every possible Unicode character, and most
encodings don't. For example, Python's default encoding is the 'ascii'
encoding. The rules for converting a Unicode string into the ASCII encoding are
simple; for each code point:
1. If the code point is < 128, each byte is the same as the value of the code
point.
2. If the code point is 128 or greater, the Unicode string can't be represented
in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this
case.)
Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
0-255 are identical to the Latin-1 values, so converting to this encoding simply
requires converting code points to byte values; if a code point larger than 255
is encountered, the string can't be encoded into Latin-1.
Encodings don't have to be simple one-to-one mappings like Latin-1. Consider
IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one
block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145
through 153. If you wanted to use EBCDIC as an encoding, you'd probably use
some sort of lookup table to perform the conversion, but this is largely an
internal detail.
UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode
Transformation Format", and the '8' means that 8-bit numbers are used in the
encoding. (There's also a UTF-16 encoding, but it's less frequently used than
UTF-8.) UTF-8 uses the following rules:
1. If the code point is <128, it's represented by the corresponding byte value.
2. If the code point is between 128 and 0x7ff, it's turned into two byte values
between 128 and 255.
3. Code points >0x7ff are turned into three- or four-byte sequences, where each
byte of the sequence is between 128 and 255.
UTF-8 has several convenient properties:
1. It can handle any Unicode code point.
2. A Unicode string is turned into a string of bytes containing no embedded zero
bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
processed by C functions such as ``strcpy()`` and sent through protocols that
can't handle zero bytes.
3. A string of ASCII text is also valid UTF-8 text.
4. UTF-8 is fairly compact; the majority of code points are turned into two
bytes, and values less than 128 occupy only a single byte.
5. If bytes are corrupted or lost, it's possible to determine the start of the
next UTF-8-encoded code point and resynchronize. It's also unlikely that
random 8-bit data will look like valid UTF-8.
References
----------
The Unicode Consortium site at <http://www.unicode.org> has character charts, a
glossary, and PDF versions of the Unicode specification. Be prepared for some
difficult reading. <http://www.unicode.org/history/> is a chronology of the
origin and development of Unicode.
To help understand the standard, Jukka Korpela has written an introductory guide
to reading the Unicode character tables, available at
<http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
Roman Czyborra wrote another explanation of Unicode's basic principles; it's at
<http://czyborra.com/unicode/characters.html>. Czyborra has written a number of
other Unicode-related documentation, available from <http://www.cyzborra.com>.
Two other good introductory articles were written by Joel Spolsky
<http://www.joelonsoftware.com/articles/Unicode.html> and Jason Orendorff
<http://www.jorendorff.com/articles/unicode/>. If this introduction didn't make
things clear to you, you should try reading one of these alternate articles
before continuing.
Wikipedia entries are often helpful; see the entries for "character encoding"
<http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
<http://en.wikipedia.org/wiki/UTF-8>, for example.
Python's Unicode Support
========================
Now that you've learned the rudiments of Unicode, we can look at Python's
Unicode features.
The Unicode Type
----------------
Unicode strings are expressed as instances of the :class:`unicode` type, one of
Python's repertoire of built-in types. It derives from an abstract type called
:class:`basestring`, which is also an ancestor of the :class:`str` type; you can
therefore check if a value is a string type with ``isinstance(value,
basestring)``. Under the hood, Python represents Unicode strings as either 16-
or 32-bit integers, depending on how the Python interpreter was compiled.
The :func:`unicode` constructor has the signature ``unicode(string[, encoding,
errors])``. All of its arguments should be 8-bit strings. The first argument
is converted to Unicode using the specified encoding; if you leave off the
``encoding`` argument, the ASCII encoding is used for the conversion, so
characters greater than 127 will be treated as errors::
>>> unicode('abcdef')
u'abcdef'
>>> s = unicode('abcdef')
>>> type(s)
<type 'unicode'>
>>> unicode('abcdef' + chr(255))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
ordinal not in range(128)
The ``errors`` argument specifies the response when the input string can't be
converted according to the encoding's rules. Legal values for this argument are
'strict' (raise a ``UnicodeDecodeError`` exception), 'replace' (add U+FFFD,
'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the
Unicode result). The following examples show the differences::
>>> unicode('\x80abc', errors='strict')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
ordinal not in range(128)
>>> unicode('\x80abc', errors='replace')
u'\ufffdabc'
>>> unicode('\x80abc', errors='ignore')
u'abc'
Encodings are specified as strings containing the encoding's name. Python 2.4
comes with roughly 100 different encodings; see the Python Library Reference at
<http://docs.python.org/lib/standard-encodings.html> for a list. Some encodings
have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859' are all
synonyms for the same encoding.
One-character Unicode strings can also be created with the :func:`unichr`
built-in function, which takes integers and returns a Unicode string of length 1
that contains the corresponding code point. The reverse operation is the
built-in :func:`ord` function that takes a one-character Unicode string and
returns the code point value::
>>> unichr(40960)
u'\ua000'
>>> ord(u'\ua000')
40960
Instances of the :class:`unicode` type have many of the same methods as the
8-bit string type for operations such as searching and formatting::
>>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
>>> s.count('e')
5
>>> s.find('feather')
9
>>> s.find('bird')
-1
>>> s.replace('feather', 'sand')
u'Was ever sand so lightly blown to and fro as this multitude?'
>>> s.upper()
u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
Note that the arguments to these methods can be Unicode strings or 8-bit
strings. 8-bit strings will be converted to Unicode before carrying out the
operation; Python's default ASCII encoding will be used, so characters greater
than 127 will cause an exception::
>>> s.find('Was\x9f')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
>>> s.find(u'Was\x9f')
-1
Much Python code that operates on strings will therefore work with Unicode
strings without requiring any changes to the code. (Input and output code needs
more updating for Unicode; more on this later.)
Another important method is ``.encode([encoding], [errors='strict'])``, which
returns an 8-bit string version of the Unicode string, encoded in the requested
encoding. The ``errors`` parameter is the same as the parameter of the
``unicode()`` constructor, with one additional possibility; as well as 'strict',
'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML's
character references. The following example shows the different results::
>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'&#40960;abcd&#1972;'
Python's 8-bit strings have a ``.decode([encoding], [errors])`` method that
interprets the string using the given encoding::
>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
>>> type(utf8_version), utf8_version
(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
>>> u == u2 # The two strings match
True
The low-level routines for registering and accessing the available encodings are
found in the :mod:`codecs` module. However, the encoding and decoding functions
returned by this module are usually more low-level than is comfortable, so I'm
not going to describe the :mod:`codecs` module here. If you need to implement a
completely new encoding, you'll need to learn about the :mod:`codecs` module
interfaces, but implementing encodings is a specialized task that also won't be
covered here. Consult the Python documentation to learn more about this module.
The most commonly used part of the :mod:`codecs` module is the
:func:`codecs.open` function which will be discussed in the section on input and
output.
Unicode Literals in Python Source Code
--------------------------------------
In Python source code, Unicode literals are written as strings prefixed with the
'u' or 'U' character: ``u'abcdefghijk'``. Specific code points can be written
using the ``\u`` escape sequence, which is followed by four hex digits giving
the code point. The ``\U`` escape sequence is similar, but expects 8 hex
digits, not 4.
Unicode literals can also use the same escape sequences as 8-bit strings,
including ``\x``, but ``\x`` only takes two hex digits so it can't express an
arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777.
::
>>> s = u"a\xac\u1234\u20ac\U00008000"
^^^^ two-digit hex escape
^^^^^^ four-digit Unicode escape
^^^^^^^^^^ eight-digit Unicode escape
>>> for c in s: print ord(c),
...
97 172 4660 8364 32768
Using escape sequences for code points greater than 127 is fine in small doses,
but becomes an annoyance if you're using many accented characters, as you would
in a program with messages in French or some other accent-using language. You
can also assemble strings using the :func:`unichr` built-in function, but this is
even more tedious.
Ideally, you'd want to be able to write literals in your language's natural
encoding. You could then edit Python source code with your favorite editor
which would display the accented characters naturally, and have the right
characters used at runtime.
Python supports writing Unicode literals in any encoding, but you have to
declare the encoding being used. This is done by including a special comment as
either the first or second line of the source file::
#!/usr/bin/env python
# -*- coding: latin-1 -*-
u = u'abcdé'
print ord(u[-1])
The syntax is inspired by Emacs's notation for specifying variables local to a
file. Emacs supports many different variables, but Python only supports
'coding'. The ``-*-`` symbols indicate that the comment is special; within
them, you must supply the name ``coding`` and the name of your chosen encoding,
separated by ``':'``.
If you don't include such a comment, the default encoding used will be ASCII.
Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default
encoding for string literals; in Python 2.4, characters greater than 127 still
work but result in a warning. For example, the following program has no
encoding declaration::
#!/usr/bin/env python
u = u'abcdé'
print ord(u[-1])
When you run it with Python 2.4, it will output the following warning::
amk:~$ python p263.py
sys:1: DeprecationWarning: Non-ASCII character '\xe9'
in file p263.py on line 2, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Unicode Properties
------------------
The Unicode specification includes a database of information about code points.
For each code point that's defined, the information includes the character's
name, its category, the numeric value if applicable (Unicode has characters
representing the Roman numerals and fractions such as one-third and
four-fifths). There are also properties related to the code point's use in
bidirectional text and other display-related properties.
The following program displays some information about several characters, and
prints the numeric value of one particular character::
import unicodedata
u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
for i, c in enumerate(u):
print i, '%04x' % ord(c), unicodedata.category(c),
print unicodedata.name(c)
# Get numeric value of second character
print unicodedata.numeric(u[1])
When run, this prints::
0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
1 0bf2 No TAMIL NUMBER ONE THOUSAND
2 0f84 Mn TIBETAN MARK HALANTA
3 1770 Lo TAGBANWA LETTER SA
4 33af So SQUARE RAD OVER S SQUARED
1000.0
The category codes are abbreviations describing the nature of the character.
These are grouped into categories such as "Letter", "Number", "Punctuation", or
"Symbol", which in turn are broken up into subcategories. To take the codes
from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
other". See
<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values> for a
list of category codes.
References
----------
The Unicode and 8-bit string types are described in the Python library reference
at :ref:`typesseq`.
The documentation for the :mod:`unicodedata` module.
The documentation for the :mod:`codecs` module.
Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and
Unicode". A PDF version of his slides is available at
<http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>, and is an
excellent overview of the design of Python's Unicode features.
Reading and Writing Unicode Data
================================
Once you've written some code that works with Unicode data, the next problem is
input/output. How do you get Unicode strings into your program, and how do you
convert Unicode into a form suitable for storage or transmission?
It's possible that you may not need to do anything depending on your input
sources and output destinations; you should check whether the libraries used in
your application support Unicode natively. XML parsers often return Unicode
data, for example. Many relational databases also support Unicode-valued
columns and can return Unicode values from an SQL query.
Unicode data is usually converted to a particular encoding before it gets
written to disk or sent over a socket. It's possible to do all the work
yourself: open a file, read an 8-bit string from it, and convert the string with
``unicode(str, encoding)``. However, the manual approach is not recommended.
One problem is the multi-byte nature of encodings; one Unicode character can be
represented by several bytes. If you want to read the file in arbitrary-sized
chunks (say, 1K or 4K), you need to write error-handling code to catch the case
where only part of the bytes encoding a single Unicode character are read at the
end of a chunk. One solution would be to read the entire file into memory and
then perform the decoding, but that prevents you from working with files that
are extremely large; if you need to read a 2Gb file, you need 2Gb of RAM.
(More, really, since for at least a moment you'd need to have both the encoded
string and its Unicode version in memory.)
The solution would be to use the low-level decoding interface to catch the case
of partial coding sequences. The work of implementing this has already been
done for you: the :mod:`codecs` module includes a version of the :func:`open`
function that returns a file-like object that assumes the file's contents are in
a specified encoding and accepts Unicode parameters for methods such as
``.read()`` and ``.write()``.
The function's parameters are ``open(filename, mode='rb', encoding=None,
errors='strict', buffering=1)``. ``mode`` can be ``'r'``, ``'w'``, or ``'a'``,
just like the corresponding parameter to the regular built-in ``open()``
function; add a ``'+'`` to update the file. ``buffering`` is similarly parallel
to the standard function's parameter. ``encoding`` is a string giving the
encoding to use; if it's left as ``None``, a regular Python file object that
accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and
data written to or read from the wrapper object will be converted as needed.
``errors`` specifies the action for encoding errors and can be one of the usual
values of 'strict', 'ignore', and 'replace'.
Reading Unicode from a file is therefore simple::
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)
It's also possible to open files in update mode, allowing both reading and
writing::
f = codecs.open('test', encoding='utf-8', mode='w+')
f.write(u'\u4500 blah blah blah\n')
f.seek(0)
print repr(f.readline()[:1])
f.close()
Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
written as the first character of a file in order to assist with autodetection
of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
present at the start of a file; when such an encoding is used, the BOM will be
automatically written as the first character and will be silently dropped when
the file is read. There are variants of these encodings, such as 'utf-16-le'
and 'utf-16-be' for little-endian and big-endian encodings, that specify one
particular byte ordering and don't skip the BOM.
Unicode filenames
-----------------
Most of the operating systems in common use today support filenames that contain
arbitrary Unicode characters. Usually this is implemented by converting the
Unicode string into some encoding that varies depending on the system. For
example, MacOS X uses UTF-8 while Windows uses a configurable encoding; on
Windows, Python uses the name "mbcs" to refer to whatever the currently
configured encoding is. On Unix systems, there will only be a filesystem
encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
you haven't, the default encoding is ASCII.
The :func:`sys.getfilesystemencoding` function returns the encoding to use on
your current system, in case you want to do the encoding manually, but there's
not much reason to bother. When opening a file for reading or writing, you can
usually just provide the Unicode string as the filename, and it will be
automatically converted to the right encoding for you::
filename = u'filename\u4500abc'
f = open(filename, 'w')
f.write('blah\n')
f.close()
Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
filenames.
:func:`os.listdir`, which returns filenames, raises an issue: should it return
the Unicode version of filenames, or should it return 8-bit strings containing
the encoded versions? :func:`os.listdir` will do both, depending on whether you
provided the directory path as an 8-bit string or a Unicode string. If you pass
a Unicode string as the path, filenames will be decoded using the filesystem's
encoding and a list of Unicode strings will be returned, while passing an 8-bit
path will return the 8-bit versions of the filenames. For example, assuming the
default filesystem encoding is UTF-8, running the following program::
fn = u'filename\u4500abc'
f = open(fn, 'w')
f.close()
import os
print os.listdir('.')
print os.listdir(u'.')
will produce the following output::
amk:~$ python t.py
['.svn', 'filename\xe4\x94\x80abc', ...]
[u'.svn', u'filename\u4500abc', ...]
The first list contains UTF-8-encoded filenames, and the second list contains
the Unicode versions.
Tips for Writing Unicode-aware Programs
---------------------------------------
This section provides some suggestions on writing software that deals with
Unicode.
The most important tip is:
Software should only work with Unicode strings internally, converting to a
particular encoding on output.
If you attempt to write processing functions that accept both Unicode and 8-bit
strings, you will find your program vulnerable to bugs wherever you combine the
two different kinds of strings. Python's default encoding is ASCII, so whenever
a character with an ASCII value > 127 is in the input data, you'll get a
:exc:`UnicodeDecodeError` because that character can't be handled by the ASCII
encoding.
It's easy to miss such problems if you only test your software with data that
doesn't contain any accents; everything will seem to work, but there's actually
a bug in your program waiting for the first user who attempts to use characters
> 127. A second tip, therefore, is:
Include characters > 127 and, even better, characters > 255 in your test
data.
When using data coming from a web browser or some other untrusted source, a
common technique is to check for illegal characters in a string before using the
string in a generated command line or storing it in a database. If you're doing
this, be careful to check the string once it's in the form that will be used or
stored; it's possible for encodings to be used to disguise characters. This is
especially true if the input data also specifies the encoding; many encodings
leave the commonly checked-for characters alone, but Python includes some
encodings such as ``'base64'`` that modify every single character.
For example, let's say you have a content management system that takes a Unicode
filename, and you want to disallow paths with a '/' character. You might write
this code::
def read_file (filename, encoding):
if '/' in filename:
raise ValueError("'/' not allowed in filenames")
unicode_name = filename.decode(encoding)
f = open(unicode_name, 'r')
# ... return contents of file ...
However, if an attacker could specify the ``'base64'`` encoding, they could pass
``'L2V0Yy9wYXNzd2Q='``, which is the base-64 encoded form of the string
``'/etc/passwd'``, to read a system file. The above code looks for ``'/'``
characters in the encoded form and misses the dangerous character in the
resulting decoded form.
References
----------
The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
Applications in Python" are available at
<http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
and discuss questions of character encodings as well as how to internationalize
and localize an application.
Revision History and Acknowledgements
=====================================
Thanks to the following people who have noted errors or offered suggestions on
this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler,
Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
Version 1.0: posted August 5 2005.
Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds
several links.
Version 1.02: posted August 16 2005. Corrects factual errors.
.. comment Additional topic: building Python w/ UCS2 or UCS4 support
.. comment Describe obscure -U switch somewhere?
.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
.. comment
Original outline:
- [ ] Unicode introduction
- [ ] ASCII
- [ ] Terms
- [ ] Character
- [ ] Code point
- [ ] Encodings
- [ ] Common encodings: ASCII, Latin-1, UTF-8
- [ ] Unicode Python type
- [ ] Writing unicode literals
- [ ] Obscurity: -U switch
- [ ] Built-ins
- [ ] unichr()
- [ ] ord()
- [ ] unicode() constructor
- [ ] Unicode type
- [ ] encode(), decode() methods
- [ ] Unicodedata module for character properties
- [ ] I/O
- [ ] Reading/writing Unicode data into files
- [ ] Byte-order marks
- [ ] Unicode filenames
- [ ] Writing Unicode programs
- [ ] Do everything in Unicode
- [ ] Declaring source code encodings (PEP 263)
- [ ] Other issues
- [ ] Building Python (UCS2, UCS4)

578
Doc/howto/urllib2.rst Normal file
View file

@ -0,0 +1,578 @@
************************************************
HOWTO Fetch Internet Resources Using urllib2
************************************************
:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
.. note::
There is an French translation of an earlier revision of this
HOWTO, available at `urllib2 - Le Manuel manquant
<http://www.voidspace/python/articles/urllib2_francais.shtml>`_.
Introduction
============
.. sidebar:: Related Articles
You may also find useful the following article on fetching web resources
with Python :
* `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
A tutorial on *Basic Authentication*, with examples in Python.
**urllib2** is a `Python <http://www.python.org>`_ module for fetching URLs
(Uniform Resource Locators). It offers a very simple interface, in the form of
the *urlopen* function. This is capable of fetching URLs using a variety of
different protocols. It also offers a slightly more complex interface for
handling common situations - like basic authentication, cookies, proxies and so
on. These are provided by objects called handlers and openers.
urllib2 supports fetching URLs for many "URL schemes" (identified by the string
before the ":" in URL - for example "ftp" is the URL scheme of
"ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP).
This tutorial focuses on the most common case, HTTP.
For straightforward situations *urlopen* is very easy to use. But as soon as you
encounter errors or non-trivial cases when opening HTTP URLs, you will need some
understanding of the HyperText Transfer Protocol. The most comprehensive and
authoritative reference to HTTP is :rfc:`2616`. This is a technical document and
not intended to be easy to read. This HOWTO aims to illustrate using *urllib2*,
with enough detail about HTTP to help you through. It is not intended to replace
the :mod:`urllib2` docs, but is supplementary to them.
Fetching URLs
=============
The simplest way to use urllib2 is as follows::
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we
could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the
purpose of this tutorial to explain the more complicated cases, concentrating on
HTTP.
HTTP is based on requests and responses - the client makes requests and servers
send responses. urllib2 mirrors this with a ``Request`` object which represents
the HTTP request you are making. In its simplest form you create a Request
object that specifies the URL you want to fetch. Calling ``urlopen`` with this
Request object returns a response object for the URL requested. This response is
a file-like object, which means you can for example call ``.read()`` on the
response::
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
Note that urllib2 makes use of the same Request interface to handle all URL
schemes. For example, you can make an FTP request like so::
req = urllib2.Request('ftp://example.com/')
In the case of HTTP, there are two extra things that Request objects allow you
to do: First, you can pass data to be sent to the server. Second, you can pass
extra information ("metadata") *about* the data or the about request itself, to
the server - this information is sent as HTTP "headers". Let's look at each of
these in turn.
Data
----
Sometimes you want to send data to a URL (often the URL will refer to a CGI
(Common Gateway Interface) script [#]_ or other web application). With HTTP,
this is often done using what's known as a **POST** request. This is often what
your browser does when you submit a HTML form that you filled in on the web. Not
all POSTs have to come from forms: you can use a POST to transmit arbitrary data
to your own application. In the common case of HTML forms, the data needs to be
encoded in a standard way, and then passed to the Request object as the ``data``
argument. The encoding is done using a function from the ``urllib`` library
*not* from ``urllib2``. ::
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
Note that other encodings are sometimes required (e.g. for file upload from HTML
forms - see `HTML Specification, Form Submission
<http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
details).
If you do not pass the ``data`` argument, urllib2 uses a **GET** request. One
way in which GET and POST requests differ is that POST requests often have
"side-effects": they change the state of the system in some way (for example by
placing an order with the website for a hundredweight of tinned spam to be
delivered to your door). Though the HTTP standard makes it clear that POSTs are
intended to *always* cause side-effects, and GET requests *never* to cause
side-effects, nothing prevents a GET request from having side-effects, nor a
POST requests from having no side-effects. Data can also be passed in an HTTP
GET request by encoding it in the URL itself.
This is done as follows::
>>> import urllib2
>>> import urllib
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.urlencode(data)
>>> print url_values
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values
>>> data = urllib2.open(full_url)
Notice that the full URL is created by adding a ``?`` to the URL, followed by
the encoded values.
Headers
-------
We'll discuss here one particular HTTP header, to illustrate how to add headers
to your HTTP request.
Some websites [#]_ dislike being browsed by programs, or send different versions
to different browsers [#]_ . By default urllib2 identifies itself as
``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
numbers of the Python release,
e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
not work. The way a browser identifies itself is through the
``User-Agent`` header [#]_. When you create a Request object you can
pass a dictionary of headers in. The following example makes the same
request as above, but identifies itself as a version of Internet
Explorer [#]_. ::
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()
The response also has two useful methods. See the section on `info and geturl`_
which comes after we have a look at what happens when things go wrong.
Handling Exceptions
===================
*urlopen* raises ``URLError`` when it cannot handle a response (though as usual
with Python APIs, builtin exceptions such as ValueError, TypeError etc. may also
be raised).
``HTTPError`` is the subclass of ``URLError`` raised in the specific case of
HTTP URLs.
URLError
--------
Often, URLError is raised because there is no network connection (no route to
the specified server), or the specified server doesn't exist. In this case, the
exception raised will have a 'reason' attribute, which is a tuple containing an
error code and a text error message.
e.g. ::
>>> req = urllib2.Request('http://www.pretend_server.org')
>>> try: urllib2.urlopen(req)
>>> except URLError, e:
>>> print e.reason
>>>
(4, 'getaddrinfo failed')
HTTPError
---------
Every HTTP response from the server contains a numeric "status code". Sometimes
the status code indicates that the server is unable to fulfil the request. The
default handlers will handle some of these responses for you (for example, if
the response is a "redirection" that requests the client fetch the document from
a different URL, urllib2 will handle that for you). For those it can't handle,
urlopen will raise an ``HTTPError``. Typical errors include '404' (page not
found), '403' (request forbidden), and '401' (authentication required).
See section 10 of RFC 2616 for a reference on all the HTTP error codes.
The ``HTTPError`` instance raised will have an integer 'code' attribute, which
corresponds to the error sent by the server.
Error Codes
~~~~~~~~~~~
Because the default handlers handle redirects (codes in the 300 range), and
codes in the 100-299 range indicate success, you will usually only see error
codes in the 400-599 range.
``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful dictionary of
response codes in that shows all the response codes used by RFC 2616. The
dictionary is reproduced here for convenience ::
# Table mapping response codes to messages; entries have the
# form {code: (shortmessage, longmessage)}.
responses = {
100: ('Continue', 'Request received, please continue'),
101: ('Switching Protocols',
'Switching to new protocol; obey Upgrade header'),
200: ('OK', 'Request fulfilled, document follows'),
201: ('Created', 'Document created, URL follows'),
202: ('Accepted',
'Request accepted, processing continues off-line'),
203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
204: ('No Content', 'Request fulfilled, nothing follows'),
205: ('Reset Content', 'Clear input form for further input.'),
206: ('Partial Content', 'Partial content follows.'),
300: ('Multiple Choices',
'Object has several resources -- see URI list'),
301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
302: ('Found', 'Object moved temporarily -- see URI list'),
303: ('See Other', 'Object moved -- see Method and URL list'),
304: ('Not Modified',
'Document has not changed since given time'),
305: ('Use Proxy',
'You must use proxy specified in Location to access this '
'resource.'),
307: ('Temporary Redirect',
'Object moved temporarily -- see URI list'),
400: ('Bad Request',
'Bad request syntax or unsupported method'),
401: ('Unauthorized',
'No permission -- see authorization schemes'),
402: ('Payment Required',
'No payment -- see charging schemes'),
403: ('Forbidden',
'Request forbidden -- authorization will not help'),
404: ('Not Found', 'Nothing matches the given URI'),
405: ('Method Not Allowed',
'Specified method is invalid for this server.'),
406: ('Not Acceptable', 'URI not available in preferred format.'),
407: ('Proxy Authentication Required', 'You must authenticate with '
'this proxy before proceeding.'),
408: ('Request Timeout', 'Request timed out; try again later.'),
409: ('Conflict', 'Request conflict.'),
410: ('Gone',
'URI no longer exists and has been permanently removed.'),
411: ('Length Required', 'Client must specify Content-Length.'),
412: ('Precondition Failed', 'Precondition in headers is false.'),
413: ('Request Entity Too Large', 'Entity is too large.'),
414: ('Request-URI Too Long', 'URI is too long.'),
415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
416: ('Requested Range Not Satisfiable',
'Cannot satisfy request range.'),
417: ('Expectation Failed',
'Expect condition could not be satisfied.'),
500: ('Internal Server Error', 'Server got itself in trouble'),
501: ('Not Implemented',
'Server does not support this operation'),
502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
503: ('Service Unavailable',
'The server cannot process the request due to a high load'),
504: ('Gateway Timeout',
'The gateway server did not receive a timely response'),
505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
}
When an error is raised the server responds by returning an HTTP error code
*and* an error page. You can use the ``HTTPError`` instance as a response on the
page returned. This means that as well as the code attribute, it also has read,
geturl, and info, methods. ::
>>> req = urllib2.Request('http://www.python.org/fish.html')
>>> try:
>>> urllib2.urlopen(req)
>>> except URLError, e:
>>> print e.code
>>> print e.read()
>>>
404
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<?xml-stylesheet href="./css/ht2html.css"
type="text/css"?>
<html><head><title>Error 404: File Not Found</title>
...... etc...
Wrapping it Up
--------------
So if you want to be prepared for ``HTTPError`` *or* ``URLError`` there are two
basic approaches. I prefer the second approach.
Number 1
~~~~~~~~
::
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(someurl)
try:
response = urlopen(req)
except HTTPError, e:
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
except URLError, e:
print 'We failed to reach a server.'
print 'Reason: ', e.reason
else:
# everything is fine
.. note::
The ``except HTTPError`` *must* come first, otherwise ``except URLError``
will *also* catch an ``HTTPError``.
Number 2
~~~~~~~~
::
from urllib2 import Request, urlopen, URLError
req = Request(someurl)
try:
response = urlopen(req)
except URLError, e:
if hasattr(e, 'reason'):
print 'We failed to reach a server.'
print 'Reason: ', e.reason
elif hasattr(e, 'code'):
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
else:
# everything is fine
info and geturl
===============
The response returned by urlopen (or the ``HTTPError`` instance) has two useful
methods ``info`` and ``geturl``.
**geturl** - this returns the real URL of the page fetched. This is useful
because ``urlopen`` (or the opener object used) may have followed a
redirect. The URL of the page fetched may not be the same as the URL requested.
**info** - this returns a dictionary-like object that describes the page
fetched, particularly the headers sent by the server. It is currently an
``httplib.HTTPMessage`` instance.
Typical headers include 'Content-length', 'Content-type', and so on. See the
`Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_
for a useful listing of HTTP headers with brief explanations of their meaning
and use.
Openers and Handlers
====================
When you fetch a URL you use an opener (an instance of the perhaps
confusingly-named :class:`urllib2.OpenerDirector`). Normally we have been using
the default opener - via ``urlopen`` - but you can create custom
openers. Openers use handlers. All the "heavy lifting" is done by the
handlers. Each handler knows how to open URLs for a particular URL scheme (http,
ftp, etc.), or how to handle an aspect of URL opening, for example HTTP
redirections or HTTP cookies.
You will want to create openers if you want to fetch URLs with specific handlers
installed, for example to get an opener that handles cookies, or to get an
opener that does not handle redirections.
To create an opener, instantiate an ``OpenerDirector``, and then call
``.add_handler(some_handler_instance)`` repeatedly.
Alternatively, you can use ``build_opener``, which is a convenience function for
creating opener objects with a single function call. ``build_opener`` adds
several handlers by default, but provides a quick way to add more and/or
override the default handlers.
Other sorts of handlers you might want to can handle proxies, authentication,
and other common but slightly specialised situations.
``install_opener`` can be used to make an ``opener`` object the (global) default
opener. This means that calls to ``urlopen`` will use the opener you have
installed.
Opener objects have an ``open`` method, which can be called directly to fetch
urls in the same way as the ``urlopen`` function: there's no need to call
``install_opener``, except as a convenience.
Basic Authentication
====================
To illustrate creating and installing a handler we will use the
``HTTPBasicAuthHandler``. For a more detailed discussion of this subject --
including an explanation of how Basic Authentication works - see the `Basic
Authentication Tutorial
<http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
When authentication is required, the server sends a header (as well as the 401
error code) requesting authentication. This specifies the authentication scheme
and a 'realm'. The header looks like : ``Www-authenticate: SCHEME
realm="REALM"``.
e.g. ::
Www-authenticate: Basic realm="cPanel Users"
The client should then retry the request with the appropriate name and password
for the realm included as a header in the request. This is 'basic
authentication'. In order to simplify this process we can create an instance of
``HTTPBasicAuthHandler`` and an opener to use this handler.
The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle
the mapping of URLs and realms to passwords and usernames. If you know what the
realm is (from the authentication header sent by the server), then you can use a
``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that
case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows
you to specify a default username and password for a URL. This will be supplied
in the absence of you providing an alternative combination for a specific
realm. We indicate this by providing ``None`` as the realm argument to the
``add_password`` method.
The top-level URL is the first URL that requires authentication. URLs "deeper"
than the URL you pass to .add_password() will also match. ::
# create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password.
# If we knew the realm, we could use it instead of ``None``.
top_level_url = "http://example.com/foo/"
password_mgr.add_password(None, top_level_url, username, password)
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
# create "opener" (OpenerDirector instance)
opener = urllib2.build_opener(handler)
# use the opener to fetch a URL
opener.open(a_url)
# Install the opener.
# Now all calls to urllib2.urlopen use our opener.
urllib2.install_opener(opener)
.. note::
In the above example we only supplied our ``HHTPBasicAuthHandler`` to
``build_opener``. By default openers have the handlers for normal situations
-- ``ProxyHandler``, ``UnknownHandler``, ``HTTPHandler``,
``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``,
``FileHandler``, ``HTTPErrorProcessor``.
``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme
component and the hostname and optionally the port number)
e.g. "http://example.com/" *or* an "authority" (i.e. the hostname,
optionally including the port number) e.g. "example.com" or "example.com:8080"
(the latter example includes a port number). The authority, if present, must
NOT contain the "userinfo" component - for example "joe@password:example.com" is
not correct.
Proxies
=======
**urllib2** will auto-detect your proxy settings and use those. This is through
the ``ProxyHandler`` which is part of the normal handler chain. Normally that's
a good thing, but there are occasions when it may not be helpful [#]_. One way
to do this is to setup our own ``ProxyHandler``, with no proxies defined. This
is done using similar steps to setting up a `Basic Authentication`_ handler : ::
>>> proxy_support = urllib2.ProxyHandler({})
>>> opener = urllib2.build_opener(proxy_support)
>>> urllib2.install_opener(opener)
.. note::
Currently ``urllib2`` *does not* support fetching of ``https`` locations
through a proxy. However, this can be enabled by extending urllib2 as
shown in the recipe [#]_.
Sockets and Layers
==================
The Python support for fetching resources from the web is layered. urllib2 uses
the httplib library, which in turn uses the socket library.
As of Python 2.3 you can specify how long a socket should wait for a response
before timing out. This can be useful in applications which have to fetch web
pages. By default the socket module has *no timeout* and can hang. Currently,
the socket timeout is not exposed at the httplib or urllib2 levels. However,
you can set the default timeout globally for all sockets using ::
import socket
import urllib2
# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)
# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
-------
Footnotes
=========
This document was reviewed and revised by John Lee.
.. [#] For an introduction to the CGI protocol see
`Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_.
.. [#] Like Google for example. The *proper* way to use google from a program
is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See
`Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_
for some examples of using the Google API.
.. [#] Browser sniffing is a very bad practise for website design - building
sites using web standards is much more sensible. Unfortunately a lot of
sites still send different versions to different browsers.
.. [#] The user agent for MSIE 6 is
*'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
.. [#] For details of more HTTP request headers, see
`Quick Reference to HTTP Headers`_.
.. [#] In my case I have to use a proxy to access the internet at work. If you
attempt to fetch *localhost* URLs through this proxy it blocks them. IE
is set to use the proxy, which urllib2 picks up on. In order to test
scripts with a localhost server, I have to prevent urllib2 from using
the proxy.
.. [#] urllib2 opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
<http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/456195>`_.