mirror of
https://github.com/python/cpython.git
synced 2025-07-24 19:54:21 +00:00
Move the 3k reST doc tree in place.
This commit is contained in:
parent
739c01d47b
commit
116aa62bf5
423 changed files with 131199 additions and 0 deletions
356
Doc/howto/advocacy.rst
Normal file
356
Doc/howto/advocacy.rst
Normal file
|
@ -0,0 +1,356 @@
|
|||
*************************
|
||||
Python Advocacy HOWTO
|
||||
*************************
|
||||
|
||||
:Author: A.M. Kuchling
|
||||
:Release: 0.03
|
||||
|
||||
|
||||
.. topic:: Abstract
|
||||
|
||||
It's usually difficult to get your management to accept open source software,
|
||||
and Python is no exception to this rule. This document discusses reasons to use
|
||||
Python, strategies for winning acceptance, facts and arguments you can use, and
|
||||
cases where you *shouldn't* try to use Python.
|
||||
|
||||
|
||||
Reasons to Use Python
|
||||
=====================
|
||||
|
||||
There are several reasons to incorporate a scripting language into your
|
||||
development process, and this section will discuss them, and why Python has some
|
||||
properties that make it a particularly good choice.
|
||||
|
||||
|
||||
Programmability
|
||||
---------------
|
||||
|
||||
Programs are often organized in a modular fashion. Lower-level operations are
|
||||
grouped together, and called by higher-level functions, which may in turn be
|
||||
used as basic operations by still further upper levels.
|
||||
|
||||
For example, the lowest level might define a very low-level set of functions for
|
||||
accessing a hash table. The next level might use hash tables to store the
|
||||
headers of a mail message, mapping a header name like ``Date`` to a value such
|
||||
as ``Tue, 13 May 1997 20:00:54 -0400``. A yet higher level may operate on
|
||||
message objects, without knowing or caring that message headers are stored in a
|
||||
hash table, and so forth.
|
||||
|
||||
Often, the lowest levels do very simple things; they implement a data structure
|
||||
such as a binary tree or hash table, or they perform some simple computation,
|
||||
such as converting a date string to a number. The higher levels then contain
|
||||
logic connecting these primitive operations. Using the approach, the primitives
|
||||
can be seen as basic building blocks which are then glued together to produce
|
||||
the complete product.
|
||||
|
||||
Why is this design approach relevant to Python? Because Python is well suited
|
||||
to functioning as such a glue language. A common approach is to write a Python
|
||||
module that implements the lower level operations; for the sake of speed, the
|
||||
implementation might be in C, Java, or even Fortran. Once the primitives are
|
||||
available to Python programs, the logic underlying higher level operations is
|
||||
written in the form of Python code. The high-level logic is then more
|
||||
understandable, and easier to modify.
|
||||
|
||||
John Ousterhout wrote a paper that explains this idea at greater length,
|
||||
entitled "Scripting: Higher Level Programming for the 21st Century". I
|
||||
recommend that you read this paper; see the references for the URL. Ousterhout
|
||||
is the inventor of the Tcl language, and therefore argues that Tcl should be
|
||||
used for this purpose; he only briefly refers to other languages such as Python,
|
||||
Perl, and Lisp/Scheme, but in reality, Ousterhout's argument applies to
|
||||
scripting languages in general, since you could equally write extensions for any
|
||||
of the languages mentioned above.
|
||||
|
||||
|
||||
Prototyping
|
||||
-----------
|
||||
|
||||
In *The Mythical Man-Month*, Fredrick Brooks suggests the following rule when
|
||||
planning software projects: "Plan to throw one away; you will anyway." Brooks
|
||||
is saying that the first attempt at a software design often turns out to be
|
||||
wrong; unless the problem is very simple or you're an extremely good designer,
|
||||
you'll find that new requirements and features become apparent once development
|
||||
has actually started. If these new requirements can't be cleanly incorporated
|
||||
into the program's structure, you're presented with two unpleasant choices:
|
||||
hammer the new features into the program somehow, or scrap everything and write
|
||||
a new version of the program, taking the new features into account from the
|
||||
beginning.
|
||||
|
||||
Python provides you with a good environment for quickly developing an initial
|
||||
prototype. That lets you get the overall program structure and logic right, and
|
||||
you can fine-tune small details in the fast development cycle that Python
|
||||
provides. Once you're satisfied with the GUI interface or program output, you
|
||||
can translate the Python code into C++, Fortran, Java, or some other compiled
|
||||
language.
|
||||
|
||||
Prototyping means you have to be careful not to use too many Python features
|
||||
that are hard to implement in your other language. Using ``eval()``, or regular
|
||||
expressions, or the :mod:`pickle` module, means that you're going to need C or
|
||||
Java libraries for formula evaluation, regular expressions, and serialization,
|
||||
for example. But it's not hard to avoid such tricky code, and in the end the
|
||||
translation usually isn't very difficult. The resulting code can be rapidly
|
||||
debugged, because any serious logical errors will have been removed from the
|
||||
prototype, leaving only more minor slip-ups in the translation to track down.
|
||||
|
||||
This strategy builds on the earlier discussion of programmability. Using Python
|
||||
as glue to connect lower-level components has obvious relevance for constructing
|
||||
prototype systems. In this way Python can help you with development, even if
|
||||
end users never come in contact with Python code at all. If the performance of
|
||||
the Python version is adequate and corporate politics allow it, you may not need
|
||||
to do a translation into C or Java, but it can still be faster to develop a
|
||||
prototype and then translate it, instead of attempting to produce the final
|
||||
version immediately.
|
||||
|
||||
One example of this development strategy is Microsoft Merchant Server. Version
|
||||
1.0 was written in pure Python, by a company that subsequently was purchased by
|
||||
Microsoft. Version 2.0 began to translate the code into C++, shipping with some
|
||||
C++code and some Python code. Version 3.0 didn't contain any Python at all; all
|
||||
the code had been translated into C++. Even though the product doesn't contain
|
||||
a Python interpreter, the Python language has still served a useful purpose by
|
||||
speeding up development.
|
||||
|
||||
This is a very common use for Python. Past conference papers have also
|
||||
described this approach for developing high-level numerical algorithms; see
|
||||
David M. Beazley and Peter S. Lomdahl's paper "Feeding a Large-scale Physics
|
||||
Application to Python" in the references for a good example. If an algorithm's
|
||||
basic operations are things like "Take the inverse of this 4000x4000 matrix",
|
||||
and are implemented in some lower-level language, then Python has almost no
|
||||
additional performance cost; the extra time required for Python to evaluate an
|
||||
expression like ``m.invert()`` is dwarfed by the cost of the actual computation.
|
||||
It's particularly good for applications where seemingly endless tweaking is
|
||||
required to get things right. GUI interfaces and Web sites are prime examples.
|
||||
|
||||
The Python code is also shorter and faster to write (once you're familiar with
|
||||
Python), so it's easier to throw it away if you decide your approach was wrong;
|
||||
if you'd spent two weeks working on it instead of just two hours, you might
|
||||
waste time trying to patch up what you've got out of a natural reluctance to
|
||||
admit that those two weeks were wasted. Truthfully, those two weeks haven't
|
||||
been wasted, since you've learnt something about the problem and the technology
|
||||
you're using to solve it, but it's human nature to view this as a failure of
|
||||
some sort.
|
||||
|
||||
|
||||
Simplicity and Ease of Understanding
|
||||
------------------------------------
|
||||
|
||||
Python is definitely *not* a toy language that's only usable for small tasks.
|
||||
The language features are general and powerful enough to enable it to be used
|
||||
for many different purposes. It's useful at the small end, for 10- or 20-line
|
||||
scripts, but it also scales up to larger systems that contain thousands of lines
|
||||
of code.
|
||||
|
||||
However, this expressiveness doesn't come at the cost of an obscure or tricky
|
||||
syntax. While Python has some dark corners that can lead to obscure code, there
|
||||
are relatively few such corners, and proper design can isolate their use to only
|
||||
a few classes or modules. It's certainly possible to write confusing code by
|
||||
using too many features with too little concern for clarity, but most Python
|
||||
code can look a lot like a slightly-formalized version of human-understandable
|
||||
pseudocode.
|
||||
|
||||
In *The New Hacker's Dictionary*, Eric S. Raymond gives the following definition
|
||||
for "compact":
|
||||
|
||||
.. epigraph::
|
||||
|
||||
Compact *adj.* Of a design, describes the valuable property that it can all be
|
||||
apprehended at once in one's head. This generally means the thing created from
|
||||
the design can be used with greater facility and fewer errors than an equivalent
|
||||
tool that is not compact. Compactness does not imply triviality or lack of
|
||||
power; for example, C is compact and FORTRAN is not, but C is more powerful than
|
||||
FORTRAN. Designs become non-compact through accreting features and cruft that
|
||||
don't merge cleanly into the overall design scheme (thus, some fans of Classic C
|
||||
maintain that ANSI C is no longer compact).
|
||||
|
||||
(From http://www.catb.org/ esr/jargon/html/C/compact.html)
|
||||
|
||||
In this sense of the word, Python is quite compact, because the language has
|
||||
just a few ideas, which are used in lots of places. Take namespaces, for
|
||||
example. Import a module with ``import math``, and you create a new namespace
|
||||
called ``math``. Classes are also namespaces that share many of the properties
|
||||
of modules, and have a few of their own; for example, you can create instances
|
||||
of a class. Instances? They're yet another namespace. Namespaces are currently
|
||||
implemented as Python dictionaries, so they have the same methods as the
|
||||
standard dictionary data type: .keys() returns all the keys, and so forth.
|
||||
|
||||
This simplicity arises from Python's development history. The language syntax
|
||||
derives from different sources; ABC, a relatively obscure teaching language, is
|
||||
one primary influence, and Modula-3 is another. (For more information about ABC
|
||||
and Modula-3, consult their respective Web sites at http://www.cwi.nl/
|
||||
steven/abc/ and http://www.m3.org.) Other features have come from C, Icon,
|
||||
Algol-68, and even Perl. Python hasn't really innovated very much, but instead
|
||||
has tried to keep the language small and easy to learn, building on ideas that
|
||||
have been tried in other languages and found useful.
|
||||
|
||||
Simplicity is a virtue that should not be underestimated. It lets you learn the
|
||||
language more quickly, and then rapidly write code, code that often works the
|
||||
first time you run it.
|
||||
|
||||
|
||||
Java Integration
|
||||
----------------
|
||||
|
||||
If you're working with Java, Jython (http://www.jython.org/) is definitely worth
|
||||
your attention. Jython is a re-implementation of Python in Java that compiles
|
||||
Python code into Java bytecodes. The resulting environment has very tight,
|
||||
almost seamless, integration with Java. It's trivial to access Java classes
|
||||
from Python, and you can write Python classes that subclass Java classes.
|
||||
Jython can be used for prototyping Java applications in much the same way
|
||||
CPython is used, and it can also be used for test suites for Java code, or
|
||||
embedded in a Java application to add scripting capabilities.
|
||||
|
||||
|
||||
Arguments and Rebuttals
|
||||
=======================
|
||||
|
||||
Let's say that you've decided upon Python as the best choice for your
|
||||
application. How can you convince your management, or your fellow developers,
|
||||
to use Python? This section lists some common arguments against using Python,
|
||||
and provides some possible rebuttals.
|
||||
|
||||
**Python is freely available software that doesn't cost anything. How good can
|
||||
it be?**
|
||||
|
||||
Very good, indeed. These days Linux and Apache, two other pieces of open source
|
||||
software, are becoming more respected as alternatives to commercial software,
|
||||
but Python hasn't had all the publicity.
|
||||
|
||||
Python has been around for several years, with many users and developers.
|
||||
Accordingly, the interpreter has been used by many people, and has gotten most
|
||||
of the bugs shaken out of it. While bugs are still discovered at intervals,
|
||||
they're usually either quite obscure (they'd have to be, for no one to have run
|
||||
into them before) or they involve interfaces to external libraries. The
|
||||
internals of the language itself are quite stable.
|
||||
|
||||
Having the source code should be viewed as making the software available for
|
||||
peer review; people can examine the code, suggest (and implement) improvements,
|
||||
and track down bugs. To find out more about the idea of open source code, along
|
||||
with arguments and case studies supporting it, go to http://www.opensource.org.
|
||||
|
||||
**Who's going to support it?**
|
||||
|
||||
Python has a sizable community of developers, and the number is still growing.
|
||||
The Internet community surrounding the language is an active one, and is worth
|
||||
being considered another one of Python's advantages. Most questions posted to
|
||||
the comp.lang.python newsgroup are quickly answered by someone.
|
||||
|
||||
Should you need to dig into the source code, you'll find it's clear and
|
||||
well-organized, so it's not very difficult to write extensions and track down
|
||||
bugs yourself. If you'd prefer to pay for support, there are companies and
|
||||
individuals who offer commercial support for Python.
|
||||
|
||||
**Who uses Python for serious work?**
|
||||
|
||||
Lots of people; one interesting thing about Python is the surprising diversity
|
||||
of applications that it's been used for. People are using Python to:
|
||||
|
||||
* Run Web sites
|
||||
|
||||
* Write GUI interfaces
|
||||
|
||||
* Control number-crunching code on supercomputers
|
||||
|
||||
* Make a commercial application scriptable by embedding the Python interpreter
|
||||
inside it
|
||||
|
||||
* Process large XML data sets
|
||||
|
||||
* Build test suites for C or Java code
|
||||
|
||||
Whatever your application domain is, there's probably someone who's used Python
|
||||
for something similar. Yet, despite being useable for such high-end
|
||||
applications, Python's still simple enough to use for little jobs.
|
||||
|
||||
See http://wiki.python.org/moin/OrganizationsUsingPython for a list of some of
|
||||
the organizations that use Python.
|
||||
|
||||
**What are the restrictions on Python's use?**
|
||||
|
||||
They're practically nonexistent. Consult the :file:`Misc/COPYRIGHT` file in the
|
||||
source distribution, or http://www.python.org/doc/Copyright.html for the full
|
||||
language, but it boils down to three conditions.
|
||||
|
||||
* You have to leave the copyright notice on the software; if you don't include
|
||||
the source code in a product, you have to put the copyright notice in the
|
||||
supporting documentation.
|
||||
|
||||
* Don't claim that the institutions that have developed Python endorse your
|
||||
product in any way.
|
||||
|
||||
* If something goes wrong, you can't sue for damages. Practically all software
|
||||
licences contain this condition.
|
||||
|
||||
Notice that you don't have to provide source code for anything that contains
|
||||
Python or is built with it. Also, the Python interpreter and accompanying
|
||||
documentation can be modified and redistributed in any way you like, and you
|
||||
don't have to pay anyone any licensing fees at all.
|
||||
|
||||
**Why should we use an obscure language like Python instead of well-known
|
||||
language X?**
|
||||
|
||||
I hope this HOWTO, and the documents listed in the final section, will help
|
||||
convince you that Python isn't obscure, and has a healthily growing user base.
|
||||
One word of advice: always present Python's positive advantages, instead of
|
||||
concentrating on language X's failings. People want to know why a solution is
|
||||
good, rather than why all the other solutions are bad. So instead of attacking
|
||||
a competing solution on various grounds, simply show how Python's virtues can
|
||||
help.
|
||||
|
||||
|
||||
Useful Resources
|
||||
================
|
||||
|
||||
http://www.pythonology.com/success
|
||||
The Python Success Stories are a collection of stories from successful users of
|
||||
Python, with the emphasis on business and corporate users.
|
||||
|
||||
.. % \term{\url{http://www.fsbassociates.com/books/pythonchpt1.htm}}
|
||||
.. % The first chapter of \emph{Internet Programming with Python} also
|
||||
.. % examines some of the reasons for using Python. The book is well worth
|
||||
.. % buying, but the publishers have made the first chapter available on
|
||||
.. % the Web.
|
||||
|
||||
http://home.pacbell.net/ouster/scripting.html
|
||||
John Ousterhout's white paper on scripting is a good argument for the utility of
|
||||
scripting languages, though naturally enough, he emphasizes Tcl, the language he
|
||||
developed. Most of the arguments would apply to any scripting language.
|
||||
|
||||
http://www.python.org/workshops/1997-10/proceedings/beazley.html
|
||||
The authors, David M. Beazley and Peter S. Lomdahl, describe their use of
|
||||
Python at Los Alamos National Laboratory. It's another good example of how
|
||||
Python can help get real work done. This quotation from the paper has been
|
||||
echoed by many people:
|
||||
|
||||
.. epigraph::
|
||||
|
||||
Originally developed as a large monolithic application for massively parallel
|
||||
processing systems, we have used Python to transform our application into a
|
||||
flexible, highly modular, and extremely powerful system for performing
|
||||
simulation, data analysis, and visualization. In addition, we describe how
|
||||
Python has solved a number of important problems related to the development,
|
||||
debugging, deployment, and maintenance of scientific software.
|
||||
|
||||
http://pythonjournal.cognizor.com/pyj1/Everitt-Feit_interview98-V1.html
|
||||
This interview with Andy Feit, discussing Infoseek's use of Python, can be used
|
||||
to show that choosing Python didn't introduce any difficulties into a company's
|
||||
development process, and provided some substantial benefits.
|
||||
|
||||
.. % \term{\url{http://www.python.org/psa/Commercial.html}}
|
||||
.. % Robin Friedrich wrote this document on how to support Python's use in
|
||||
.. % commercial projects.
|
||||
|
||||
http://www.python.org/workshops/1997-10/proceedings/stein.ps
|
||||
For the 6th Python conference, Greg Stein presented a paper that traced Python's
|
||||
adoption and usage at a startup called eShop, and later at Microsoft.
|
||||
|
||||
http://www.opensource.org
|
||||
Management may be doubtful of the reliability and usefulness of software that
|
||||
wasn't written commercially. This site presents arguments that show how open
|
||||
source software can have considerable advantages over closed-source software.
|
||||
|
||||
http://sunsite.unc.edu/LDP/HOWTO/mini/Advocacy.html
|
||||
The Linux Advocacy mini-HOWTO was the inspiration for this document, and is also
|
||||
well worth reading for general suggestions on winning acceptance for a new
|
||||
technology, such as Linux or Python. In general, you won't make much progress
|
||||
by simply attacking existing systems and complaining about their inadequacies;
|
||||
this often ends up looking like unfocused whining. It's much better to point
|
||||
out some of the many areas where Python is an improvement over other systems.
|
||||
|
434
Doc/howto/curses.rst
Normal file
434
Doc/howto/curses.rst
Normal file
|
@ -0,0 +1,434 @@
|
|||
**********************************
|
||||
Curses Programming with Python
|
||||
**********************************
|
||||
|
||||
:Author: A.M. Kuchling, Eric S. Raymond
|
||||
:Release: 2.02
|
||||
|
||||
|
||||
.. topic:: Abstract
|
||||
|
||||
This document describes how to write text-mode programs with Python 2.x, using
|
||||
the :mod:`curses` extension module to control the display.
|
||||
|
||||
|
||||
What is curses?
|
||||
===============
|
||||
|
||||
The curses library supplies a terminal-independent screen-painting and
|
||||
keyboard-handling facility for text-based terminals; such terminals include
|
||||
VT100s, the Linux console, and the simulated terminal provided by X11 programs
|
||||
such as xterm and rxvt. Display terminals support various control codes to
|
||||
perform common operations such as moving the cursor, scrolling the screen, and
|
||||
erasing areas. Different terminals use widely differing codes, and often have
|
||||
their own minor quirks.
|
||||
|
||||
In a world of X displays, one might ask "why bother"? It's true that
|
||||
character-cell display terminals are an obsolete technology, but there are
|
||||
niches in which being able to do fancy things with them are still valuable. One
|
||||
is on small-footprint or embedded Unixes that don't carry an X server. Another
|
||||
is for tools like OS installers and kernel configurators that may have to run
|
||||
before X is available.
|
||||
|
||||
The curses library hides all the details of different terminals, and provides
|
||||
the programmer with an abstraction of a display, containing multiple
|
||||
non-overlapping windows. The contents of a window can be changed in various
|
||||
ways-- adding text, erasing it, changing its appearance--and the curses library
|
||||
will automagically figure out what control codes need to be sent to the terminal
|
||||
to produce the right output.
|
||||
|
||||
The curses library was originally written for BSD Unix; the later System V
|
||||
versions of Unix from AT&T added many enhancements and new functions. BSD curses
|
||||
is no longer maintained, having been replaced by ncurses, which is an
|
||||
open-source implementation of the AT&T interface. If you're using an
|
||||
open-source Unix such as Linux or FreeBSD, your system almost certainly uses
|
||||
ncurses. Since most current commercial Unix versions are based on System V
|
||||
code, all the functions described here will probably be available. The older
|
||||
versions of curses carried by some proprietary Unixes may not support
|
||||
everything, though.
|
||||
|
||||
No one has made a Windows port of the curses module. On a Windows platform, try
|
||||
the Console module written by Fredrik Lundh. The Console module provides
|
||||
cursor-addressable text output, plus full support for mouse and keyboard input,
|
||||
and is available from http://effbot.org/efflib/console.
|
||||
|
||||
|
||||
The Python curses module
|
||||
------------------------
|
||||
|
||||
Thy Python module is a fairly simple wrapper over the C functions provided by
|
||||
curses; if you're already familiar with curses programming in C, it's really
|
||||
easy to transfer that knowledge to Python. The biggest difference is that the
|
||||
Python interface makes things simpler, by merging different C functions such as
|
||||
:func:`addstr`, :func:`mvaddstr`, :func:`mvwaddstr`, into a single
|
||||
:meth:`addstr` method. You'll see this covered in more detail later.
|
||||
|
||||
This HOWTO is simply an introduction to writing text-mode programs with curses
|
||||
and Python. It doesn't attempt to be a complete guide to the curses API; for
|
||||
that, see the Python library guide's section on ncurses, and the C manual pages
|
||||
for ncurses. It will, however, give you the basic ideas.
|
||||
|
||||
|
||||
Starting and ending a curses application
|
||||
========================================
|
||||
|
||||
Before doing anything, curses must be initialized. This is done by calling the
|
||||
:func:`initscr` function, which will determine the terminal type, send any
|
||||
required setup codes to the terminal, and create various internal data
|
||||
structures. If successful, :func:`initscr` returns a window object representing
|
||||
the entire screen; this is usually called ``stdscr``, after the name of the
|
||||
corresponding C variable. ::
|
||||
|
||||
import curses
|
||||
stdscr = curses.initscr()
|
||||
|
||||
Usually curses applications turn off automatic echoing of keys to the screen, in
|
||||
order to be able to read keys and only display them under certain circumstances.
|
||||
This requires calling the :func:`noecho` function. ::
|
||||
|
||||
curses.noecho()
|
||||
|
||||
Applications will also commonly need to react to keys instantly, without
|
||||
requiring the Enter key to be pressed; this is called cbreak mode, as opposed to
|
||||
the usual buffered input mode. ::
|
||||
|
||||
curses.cbreak()
|
||||
|
||||
Terminals usually return special keys, such as the cursor keys or navigation
|
||||
keys such as Page Up and Home, as a multibyte escape sequence. While you could
|
||||
write your application to expect such sequences and process them accordingly,
|
||||
curses can do it for you, returning a special value such as
|
||||
:const:`curses.KEY_LEFT`. To get curses to do the job, you'll have to enable
|
||||
keypad mode. ::
|
||||
|
||||
stdscr.keypad(1)
|
||||
|
||||
Terminating a curses application is much easier than starting one. You'll need
|
||||
to call ::
|
||||
|
||||
curses.nocbreak(); stdscr.keypad(0); curses.echo()
|
||||
|
||||
to reverse the curses-friendly terminal settings. Then call the :func:`endwin`
|
||||
function to restore the terminal to its original operating mode. ::
|
||||
|
||||
curses.endwin()
|
||||
|
||||
A common problem when debugging a curses application is to get your terminal
|
||||
messed up when the application dies without restoring the terminal to its
|
||||
previous state. In Python this commonly happens when your code is buggy and
|
||||
raises an uncaught exception. Keys are no longer be echoed to the screen when
|
||||
you type them, for example, which makes using the shell difficult.
|
||||
|
||||
In Python you can avoid these complications and make debugging much easier by
|
||||
importing the module :mod:`curses.wrapper`. It supplies a :func:`wrapper`
|
||||
function that takes a callable. It does the initializations described above,
|
||||
and also initializes colors if color support is present. It then runs your
|
||||
provided callable and finally deinitializes appropriately. The callable is
|
||||
called inside a try-catch clause which catches exceptions, performs curses
|
||||
deinitialization, and then passes the exception upwards. Thus, your terminal
|
||||
won't be left in a funny state on exception.
|
||||
|
||||
|
||||
Windows and Pads
|
||||
================
|
||||
|
||||
Windows are the basic abstraction in curses. A window object represents a
|
||||
rectangular area of the screen, and supports various methods to display text,
|
||||
erase it, allow the user to input strings, and so forth.
|
||||
|
||||
The ``stdscr`` object returned by the :func:`initscr` function is a window
|
||||
object that covers the entire screen. Many programs may need only this single
|
||||
window, but you might wish to divide the screen into smaller windows, in order
|
||||
to redraw or clear them separately. The :func:`newwin` function creates a new
|
||||
window of a given size, returning the new window object. ::
|
||||
|
||||
begin_x = 20 ; begin_y = 7
|
||||
height = 5 ; width = 40
|
||||
win = curses.newwin(height, width, begin_y, begin_x)
|
||||
|
||||
A word about the coordinate system used in curses: coordinates are always passed
|
||||
in the order *y,x*, and the top-left corner of a window is coordinate (0,0).
|
||||
This breaks a common convention for handling coordinates, where the *x*
|
||||
coordinate usually comes first. This is an unfortunate difference from most
|
||||
other computer applications, but it's been part of curses since it was first
|
||||
written, and it's too late to change things now.
|
||||
|
||||
When you call a method to display or erase text, the effect doesn't immediately
|
||||
show up on the display. This is because curses was originally written with slow
|
||||
300-baud terminal connections in mind; with these terminals, minimizing the time
|
||||
required to redraw the screen is very important. This lets curses accumulate
|
||||
changes to the screen, and display them in the most efficient manner. For
|
||||
example, if your program displays some characters in a window, and then clears
|
||||
the window, there's no need to send the original characters because they'd never
|
||||
be visible.
|
||||
|
||||
Accordingly, curses requires that you explicitly tell it to redraw windows,
|
||||
using the :func:`refresh` method of window objects. In practice, this doesn't
|
||||
really complicate programming with curses much. Most programs go into a flurry
|
||||
of activity, and then pause waiting for a keypress or some other action on the
|
||||
part of the user. All you have to do is to be sure that the screen has been
|
||||
redrawn before pausing to wait for user input, by simply calling
|
||||
``stdscr.refresh()`` or the :func:`refresh` method of some other relevant
|
||||
window.
|
||||
|
||||
A pad is a special case of a window; it can be larger than the actual display
|
||||
screen, and only a portion of it displayed at a time. Creating a pad simply
|
||||
requires the pad's height and width, while refreshing a pad requires giving the
|
||||
coordinates of the on-screen area where a subsection of the pad will be
|
||||
displayed. ::
|
||||
|
||||
pad = curses.newpad(100, 100)
|
||||
# These loops fill the pad with letters; this is
|
||||
# explained in the next section
|
||||
for y in range(0, 100):
|
||||
for x in range(0, 100):
|
||||
try: pad.addch(y,x, ord('a') + (x*x+y*y) % 26 )
|
||||
except curses.error: pass
|
||||
|
||||
# Displays a section of the pad in the middle of the screen
|
||||
pad.refresh( 0,0, 5,5, 20,75)
|
||||
|
||||
The :func:`refresh` call displays a section of the pad in the rectangle
|
||||
extending from coordinate (5,5) to coordinate (20,75) on the screen; the upper
|
||||
left corner of the displayed section is coordinate (0,0) on the pad. Beyond
|
||||
that difference, pads are exactly like ordinary windows and support the same
|
||||
methods.
|
||||
|
||||
If you have multiple windows and pads on screen there is a more efficient way to
|
||||
go, which will prevent annoying screen flicker at refresh time. Use the
|
||||
:meth:`noutrefresh` method of each window to update the data structure
|
||||
representing the desired state of the screen; then change the physical screen to
|
||||
match the desired state in one go with the function :func:`doupdate`. The
|
||||
normal :meth:`refresh` method calls :func:`doupdate` as its last act.
|
||||
|
||||
|
||||
Displaying Text
|
||||
===============
|
||||
|
||||
From a C programmer's point of view, curses may sometimes look like a twisty
|
||||
maze of functions, all subtly different. For example, :func:`addstr` displays a
|
||||
string at the current cursor location in the ``stdscr`` window, while
|
||||
:func:`mvaddstr` moves to a given y,x coordinate first before displaying the
|
||||
string. :func:`waddstr` is just like :func:`addstr`, but allows specifying a
|
||||
window to use, instead of using ``stdscr`` by default. :func:`mvwaddstr` follows
|
||||
similarly.
|
||||
|
||||
Fortunately the Python interface hides all these details; ``stdscr`` is a window
|
||||
object like any other, and methods like :func:`addstr` accept multiple argument
|
||||
forms. Usually there are four different forms.
|
||||
|
||||
+---------------------------------+-----------------------------------------------+
|
||||
| Form | Description |
|
||||
+=================================+===============================================+
|
||||
| *str* or *ch* | Display the string *str* or character *ch* at |
|
||||
| | the current position |
|
||||
+---------------------------------+-----------------------------------------------+
|
||||
| *str* or *ch*, *attr* | Display the string *str* or character *ch*, |
|
||||
| | using attribute *attr* at the current |
|
||||
| | position |
|
||||
+---------------------------------+-----------------------------------------------+
|
||||
| *y*, *x*, *str* or *ch* | Move to position *y,x* within the window, and |
|
||||
| | display *str* or *ch* |
|
||||
+---------------------------------+-----------------------------------------------+
|
||||
| *y*, *x*, *str* or *ch*, *attr* | Move to position *y,x* within the window, and |
|
||||
| | display *str* or *ch*, using attribute *attr* |
|
||||
+---------------------------------+-----------------------------------------------+
|
||||
|
||||
Attributes allow displaying text in highlighted forms, such as in boldface,
|
||||
underline, reverse code, or in color. They'll be explained in more detail in
|
||||
the next subsection.
|
||||
|
||||
The :func:`addstr` function takes a Python string as the value to be displayed,
|
||||
while the :func:`addch` functions take a character, which can be either a Python
|
||||
string of length 1 or an integer. If it's a string, you're limited to
|
||||
displaying characters between 0 and 255. SVr4 curses provides constants for
|
||||
extension characters; these constants are integers greater than 255. For
|
||||
example, :const:`ACS_PLMINUS` is a +/- symbol, and :const:`ACS_ULCORNER` is the
|
||||
upper left corner of a box (handy for drawing borders).
|
||||
|
||||
Windows remember where the cursor was left after the last operation, so if you
|
||||
leave out the *y,x* coordinates, the string or character will be displayed
|
||||
wherever the last operation left off. You can also move the cursor with the
|
||||
``move(y,x)`` method. Because some terminals always display a flashing cursor,
|
||||
you may want to ensure that the cursor is positioned in some location where it
|
||||
won't be distracting; it can be confusing to have the cursor blinking at some
|
||||
apparently random location.
|
||||
|
||||
If your application doesn't need a blinking cursor at all, you can call
|
||||
``curs_set(0)`` to make it invisible. Equivalently, and for compatibility with
|
||||
older curses versions, there's a ``leaveok(bool)`` function. When *bool* is
|
||||
true, the curses library will attempt to suppress the flashing cursor, and you
|
||||
won't need to worry about leaving it in odd locations.
|
||||
|
||||
|
||||
Attributes and Color
|
||||
--------------------
|
||||
|
||||
Characters can be displayed in different ways. Status lines in a text-based
|
||||
application are commonly shown in reverse video; a text viewer may need to
|
||||
highlight certain words. curses supports this by allowing you to specify an
|
||||
attribute for each cell on the screen.
|
||||
|
||||
An attribute is a integer, each bit representing a different attribute. You can
|
||||
try to display text with multiple attribute bits set, but curses doesn't
|
||||
guarantee that all the possible combinations are available, or that they're all
|
||||
visually distinct. That depends on the ability of the terminal being used, so
|
||||
it's safest to stick to the most commonly available attributes, listed here.
|
||||
|
||||
+----------------------+--------------------------------------+
|
||||
| Attribute | Description |
|
||||
+======================+======================================+
|
||||
| :const:`A_BLINK` | Blinking text |
|
||||
+----------------------+--------------------------------------+
|
||||
| :const:`A_BOLD` | Extra bright or bold text |
|
||||
+----------------------+--------------------------------------+
|
||||
| :const:`A_DIM` | Half bright text |
|
||||
+----------------------+--------------------------------------+
|
||||
| :const:`A_REVERSE` | Reverse-video text |
|
||||
+----------------------+--------------------------------------+
|
||||
| :const:`A_STANDOUT` | The best highlighting mode available |
|
||||
+----------------------+--------------------------------------+
|
||||
| :const:`A_UNDERLINE` | Underlined text |
|
||||
+----------------------+--------------------------------------+
|
||||
|
||||
So, to display a reverse-video status line on the top line of the screen, you
|
||||
could code::
|
||||
|
||||
stdscr.addstr(0, 0, "Current mode: Typing mode",
|
||||
curses.A_REVERSE)
|
||||
stdscr.refresh()
|
||||
|
||||
The curses library also supports color on those terminals that provide it, The
|
||||
most common such terminal is probably the Linux console, followed by color
|
||||
xterms.
|
||||
|
||||
To use color, you must call the :func:`start_color` function soon after calling
|
||||
:func:`initscr`, to initialize the default color set (the
|
||||
:func:`curses.wrapper.wrapper` function does this automatically). Once that's
|
||||
done, the :func:`has_colors` function returns TRUE if the terminal in use can
|
||||
actually display color. (Note: curses uses the American spelling 'color',
|
||||
instead of the Canadian/British spelling 'colour'. If you're used to the
|
||||
British spelling, you'll have to resign yourself to misspelling it for the sake
|
||||
of these functions.)
|
||||
|
||||
The curses library maintains a finite number of color pairs, containing a
|
||||
foreground (or text) color and a background color. You can get the attribute
|
||||
value corresponding to a color pair with the :func:`color_pair` function; this
|
||||
can be bitwise-OR'ed with other attributes such as :const:`A_REVERSE`, but
|
||||
again, such combinations are not guaranteed to work on all terminals.
|
||||
|
||||
An example, which displays a line of text using color pair 1::
|
||||
|
||||
stdscr.addstr( "Pretty text", curses.color_pair(1) )
|
||||
stdscr.refresh()
|
||||
|
||||
As I said before, a color pair consists of a foreground and background color.
|
||||
:func:`start_color` initializes 8 basic colors when it activates color mode.
|
||||
They are: 0:black, 1:red, 2:green, 3:yellow, 4:blue, 5:magenta, 6:cyan, and
|
||||
7:white. The curses module defines named constants for each of these colors:
|
||||
:const:`curses.COLOR_BLACK`, :const:`curses.COLOR_RED`, and so forth.
|
||||
|
||||
The ``init_pair(n, f, b)`` function changes the definition of color pair *n*, to
|
||||
foreground color f and background color b. Color pair 0 is hard-wired to white
|
||||
on black, and cannot be changed.
|
||||
|
||||
Let's put all this together. To change color 1 to red text on a white
|
||||
background, you would call::
|
||||
|
||||
curses.init_pair(1, curses.COLOR_RED, curses.COLOR_WHITE)
|
||||
|
||||
When you change a color pair, any text already displayed using that color pair
|
||||
will change to the new colors. You can also display new text in this color
|
||||
with::
|
||||
|
||||
stdscr.addstr(0,0, "RED ALERT!", curses.color_pair(1) )
|
||||
|
||||
Very fancy terminals can change the definitions of the actual colors to a given
|
||||
RGB value. This lets you change color 1, which is usually red, to purple or
|
||||
blue or any other color you like. Unfortunately, the Linux console doesn't
|
||||
support this, so I'm unable to try it out, and can't provide any examples. You
|
||||
can check if your terminal can do this by calling :func:`can_change_color`,
|
||||
which returns TRUE if the capability is there. If you're lucky enough to have
|
||||
such a talented terminal, consult your system's man pages for more information.
|
||||
|
||||
|
||||
User Input
|
||||
==========
|
||||
|
||||
The curses library itself offers only very simple input mechanisms. Python's
|
||||
support adds a text-input widget that makes up some of the lack.
|
||||
|
||||
The most common way to get input to a window is to use its :meth:`getch` method.
|
||||
:meth:`getch` pauses and waits for the user to hit a key, displaying it if
|
||||
:func:`echo` has been called earlier. You can optionally specify a coordinate
|
||||
to which the cursor should be moved before pausing.
|
||||
|
||||
It's possible to change this behavior with the method :meth:`nodelay`. After
|
||||
``nodelay(1)``, :meth:`getch` for the window becomes non-blocking and returns
|
||||
``curses.ERR`` (a value of -1) when no input is ready. There's also a
|
||||
:func:`halfdelay` function, which can be used to (in effect) set a timer on each
|
||||
:meth:`getch`; if no input becomes available within the number of milliseconds
|
||||
specified as the argument to :func:`halfdelay`, curses raises an exception.
|
||||
|
||||
The :meth:`getch` method returns an integer; if it's between 0 and 255, it
|
||||
represents the ASCII code of the key pressed. Values greater than 255 are
|
||||
special keys such as Page Up, Home, or the cursor keys. You can compare the
|
||||
value returned to constants such as :const:`curses.KEY_PPAGE`,
|
||||
:const:`curses.KEY_HOME`, or :const:`curses.KEY_LEFT`. Usually the main loop of
|
||||
your program will look something like this::
|
||||
|
||||
while 1:
|
||||
c = stdscr.getch()
|
||||
if c == ord('p'): PrintDocument()
|
||||
elif c == ord('q'): break # Exit the while()
|
||||
elif c == curses.KEY_HOME: x = y = 0
|
||||
|
||||
The :mod:`curses.ascii` module supplies ASCII class membership functions that
|
||||
take either integer or 1-character-string arguments; these may be useful in
|
||||
writing more readable tests for your command interpreters. It also supplies
|
||||
conversion functions that take either integer or 1-character-string arguments
|
||||
and return the same type. For example, :func:`curses.ascii.ctrl` returns the
|
||||
control character corresponding to its argument.
|
||||
|
||||
There's also a method to retrieve an entire string, :const:`getstr()`. It isn't
|
||||
used very often, because its functionality is quite limited; the only editing
|
||||
keys available are the backspace key and the Enter key, which terminates the
|
||||
string. It can optionally be limited to a fixed number of characters. ::
|
||||
|
||||
curses.echo() # Enable echoing of characters
|
||||
|
||||
# Get a 15-character string, with the cursor on the top line
|
||||
s = stdscr.getstr(0,0, 15)
|
||||
|
||||
The Python :mod:`curses.textpad` module supplies something better. With it, you
|
||||
can turn a window into a text box that supports an Emacs-like set of
|
||||
keybindings. Various methods of :class:`Textbox` class support editing with
|
||||
input validation and gathering the edit results either with or without trailing
|
||||
spaces. See the library documentation on :mod:`curses.textpad` for the
|
||||
details.
|
||||
|
||||
|
||||
For More Information
|
||||
====================
|
||||
|
||||
This HOWTO didn't cover some advanced topics, such as screen-scraping or
|
||||
capturing mouse events from an xterm instance. But the Python library page for
|
||||
the curses modules is now pretty complete. You should browse it next.
|
||||
|
||||
If you're in doubt about the detailed behavior of any of the ncurses entry
|
||||
points, consult the manual pages for your curses implementation, whether it's
|
||||
ncurses or a proprietary Unix vendor's. The manual pages will document any
|
||||
quirks, and provide complete lists of all the functions, attributes, and
|
||||
:const:`ACS_\*` characters available to you.
|
||||
|
||||
Because the curses API is so large, some functions aren't supported in the
|
||||
Python interface, not because they're difficult to implement, but because no one
|
||||
has needed them yet. Feel free to add them and then submit a patch. Also, we
|
||||
don't yet have support for the menus or panels libraries associated with
|
||||
ncurses; feel free to add that.
|
||||
|
||||
If you write an interesting little program, feel free to contribute it as
|
||||
another demo. We can always use more of them!
|
||||
|
||||
The ncurses FAQ: http://dickey.his.com/ncurses/ncurses.faq.html
|
||||
|
308
Doc/howto/doanddont.rst
Normal file
308
Doc/howto/doanddont.rst
Normal file
|
@ -0,0 +1,308 @@
|
|||
************************************
|
||||
Idioms and Anti-Idioms in Python
|
||||
************************************
|
||||
|
||||
:Author: Moshe Zadka
|
||||
|
||||
This document is placed in the public doman.
|
||||
|
||||
|
||||
.. topic:: Abstract
|
||||
|
||||
This document can be considered a companion to the tutorial. It shows how to use
|
||||
Python, and even more importantly, how *not* to use Python.
|
||||
|
||||
|
||||
Language Constructs You Should Not Use
|
||||
======================================
|
||||
|
||||
While Python has relatively few gotchas compared to other languages, it still
|
||||
has some constructs which are only useful in corner cases, or are plain
|
||||
dangerous.
|
||||
|
||||
|
||||
from module import \*
|
||||
---------------------
|
||||
|
||||
|
||||
Inside Function Definitions
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
``from module import *`` is *invalid* inside function definitions. While many
|
||||
versions of Python do not check for the invalidity, it does not make it more
|
||||
valid, no more then having a smart lawyer makes a man innocent. Do not use it
|
||||
like that ever. Even in versions where it was accepted, it made the function
|
||||
execution slower, because the compiler could not be certain which names are
|
||||
local and which are global. In Python 2.1 this construct causes warnings, and
|
||||
sometimes even errors.
|
||||
|
||||
|
||||
At Module Level
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
While it is valid to use ``from module import *`` at module level it is usually
|
||||
a bad idea. For one, this loses an important property Python otherwise has ---
|
||||
you can know where each toplevel name is defined by a simple "search" function
|
||||
in your favourite editor. You also open yourself to trouble in the future, if
|
||||
some module grows additional functions or classes.
|
||||
|
||||
One of the most awful question asked on the newsgroup is why this code::
|
||||
|
||||
f = open("www")
|
||||
f.read()
|
||||
|
||||
does not work. Of course, it works just fine (assuming you have a file called
|
||||
"www".) But it does not work if somewhere in the module, the statement ``from os
|
||||
import *`` is present. The :mod:`os` module has a function called :func:`open`
|
||||
which returns an integer. While it is very useful, shadowing builtins is one of
|
||||
its least useful properties.
|
||||
|
||||
Remember, you can never know for sure what names a module exports, so either
|
||||
take what you need --- ``from module import name1, name2``, or keep them in the
|
||||
module and access on a per-need basis --- ``import module;print module.name``.
|
||||
|
||||
|
||||
When It Is Just Fine
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
There are situations in which ``from module import *`` is just fine:
|
||||
|
||||
* The interactive prompt. For example, ``from math import *`` makes Python an
|
||||
amazing scientific calculator.
|
||||
|
||||
* When extending a module in C with a module in Python.
|
||||
|
||||
* When the module advertises itself as ``from import *`` safe.
|
||||
|
||||
|
||||
Unadorned :keyword:`exec` and friends
|
||||
-------------------------------------
|
||||
|
||||
The word "unadorned" refers to the use without an explicit dictionary, in which
|
||||
case those constructs evaluate code in the *current* environment. This is
|
||||
dangerous for the same reasons ``from import *`` is dangerous --- it might step
|
||||
over variables you are counting on and mess up things for the rest of your code.
|
||||
Simply do not do that.
|
||||
|
||||
Bad examples::
|
||||
|
||||
>>> for name in sys.argv[1:]:
|
||||
>>> exec "%s=1" % name
|
||||
>>> def func(s, **kw):
|
||||
>>> for var, val in kw.items():
|
||||
>>> exec "s.%s=val" % var # invalid!
|
||||
>>> exec(open("handler.py").read())
|
||||
>>> handle()
|
||||
|
||||
Good examples::
|
||||
|
||||
>>> d = {}
|
||||
>>> for name in sys.argv[1:]:
|
||||
>>> d[name] = 1
|
||||
>>> def func(s, **kw):
|
||||
>>> for var, val in kw.items():
|
||||
>>> setattr(s, var, val)
|
||||
>>> d={}
|
||||
>>> exec(open("handle.py").read(), d, d)
|
||||
>>> handle = d['handle']
|
||||
>>> handle()
|
||||
|
||||
|
||||
from module import name1, name2
|
||||
-------------------------------
|
||||
|
||||
This is a "don't" which is much weaker then the previous "don't"s but is still
|
||||
something you should not do if you don't have good reasons to do that. The
|
||||
reason it is usually bad idea is because you suddenly have an object which lives
|
||||
in two seperate namespaces. When the binding in one namespace changes, the
|
||||
binding in the other will not, so there will be a discrepancy between them. This
|
||||
happens when, for example, one module is reloaded, or changes the definition of
|
||||
a function at runtime.
|
||||
|
||||
Bad example::
|
||||
|
||||
# foo.py
|
||||
a = 1
|
||||
|
||||
# bar.py
|
||||
from foo import a
|
||||
if something():
|
||||
a = 2 # danger: foo.a != a
|
||||
|
||||
Good example::
|
||||
|
||||
# foo.py
|
||||
a = 1
|
||||
|
||||
# bar.py
|
||||
import foo
|
||||
if something():
|
||||
foo.a = 2
|
||||
|
||||
|
||||
except:
|
||||
-------
|
||||
|
||||
Python has the ``except:`` clause, which catches all exceptions. Since *every*
|
||||
error in Python raises an exception, this makes many programming errors look
|
||||
like runtime problems, and hinders the debugging process.
|
||||
|
||||
The following code shows a great example::
|
||||
|
||||
try:
|
||||
foo = opne("file") # misspelled "open"
|
||||
except:
|
||||
sys.exit("could not open file!")
|
||||
|
||||
The second line triggers a :exc:`NameError` which is caught by the except
|
||||
clause. The program will exit, and you will have no idea that this has nothing
|
||||
to do with the readability of ``"file"``.
|
||||
|
||||
The example above is better written ::
|
||||
|
||||
try:
|
||||
foo = opne("file") # will be changed to "open" as soon as we run it
|
||||
except IOError:
|
||||
sys.exit("could not open file")
|
||||
|
||||
There are some situations in which the ``except:`` clause is useful: for
|
||||
example, in a framework when running callbacks, it is good not to let any
|
||||
callback disturb the framework.
|
||||
|
||||
|
||||
Exceptions
|
||||
==========
|
||||
|
||||
Exceptions are a useful feature of Python. You should learn to raise them
|
||||
whenever something unexpected occurs, and catch them only where you can do
|
||||
something about them.
|
||||
|
||||
The following is a very popular anti-idiom ::
|
||||
|
||||
def get_status(file):
|
||||
if not os.path.exists(file):
|
||||
print "file not found"
|
||||
sys.exit(1)
|
||||
return open(file).readline()
|
||||
|
||||
Consider the case the file gets deleted between the time the call to
|
||||
:func:`os.path.exists` is made and the time :func:`open` is called. That means
|
||||
the last line will throw an :exc:`IOError`. The same would happen if *file*
|
||||
exists but has no read permission. Since testing this on a normal machine on
|
||||
existing and non-existing files make it seem bugless, that means in testing the
|
||||
results will seem fine, and the code will get shipped. Then an unhandled
|
||||
:exc:`IOError` escapes to the user, who has to watch the ugly traceback.
|
||||
|
||||
Here is a better way to do it. ::
|
||||
|
||||
def get_status(file):
|
||||
try:
|
||||
return open(file).readline()
|
||||
except (IOError, OSError):
|
||||
print "file not found"
|
||||
sys.exit(1)
|
||||
|
||||
In this version, \*either\* the file gets opened and the line is read (so it
|
||||
works even on flaky NFS or SMB connections), or the message is printed and the
|
||||
application aborted.
|
||||
|
||||
Still, :func:`get_status` makes too many assumptions --- that it will only be
|
||||
used in a short running script, and not, say, in a long running server. Sure,
|
||||
the caller could do something like ::
|
||||
|
||||
try:
|
||||
status = get_status(log)
|
||||
except SystemExit:
|
||||
status = None
|
||||
|
||||
So, try to make as few ``except`` clauses in your code --- those will usually be
|
||||
a catch-all in the :func:`main`, or inside calls which should always succeed.
|
||||
|
||||
So, the best version is probably ::
|
||||
|
||||
def get_status(file):
|
||||
return open(file).readline()
|
||||
|
||||
The caller can deal with the exception if it wants (for example, if it tries
|
||||
several files in a loop), or just let the exception filter upwards to *its*
|
||||
caller.
|
||||
|
||||
The last version is not very good either --- due to implementation details, the
|
||||
file would not be closed when an exception is raised until the handler finishes,
|
||||
and perhaps not at all in non-C implementations (e.g., Jython). ::
|
||||
|
||||
def get_status(file):
|
||||
fp = open(file)
|
||||
try:
|
||||
return fp.readline()
|
||||
finally:
|
||||
fp.close()
|
||||
|
||||
|
||||
Using the Batteries
|
||||
===================
|
||||
|
||||
Every so often, people seem to be writing stuff in the Python library again,
|
||||
usually poorly. While the occasional module has a poor interface, it is usually
|
||||
much better to use the rich standard library and data types that come with
|
||||
Python then inventing your own.
|
||||
|
||||
A useful module very few people know about is :mod:`os.path`. It always has the
|
||||
correct path arithmetic for your operating system, and will usually be much
|
||||
better then whatever you come up with yourself.
|
||||
|
||||
Compare::
|
||||
|
||||
# ugh!
|
||||
return dir+"/"+file
|
||||
# better
|
||||
return os.path.join(dir, file)
|
||||
|
||||
More useful functions in :mod:`os.path`: :func:`basename`, :func:`dirname` and
|
||||
:func:`splitext`.
|
||||
|
||||
There are also many useful builtin functions people seem not to be aware of for
|
||||
some reason: :func:`min` and :func:`max` can find the minimum/maximum of any
|
||||
sequence with comparable semantics, for example, yet many people write their own
|
||||
:func:`max`/:func:`min`. Another highly useful function is :func:`reduce`. A
|
||||
classical use of :func:`reduce` is something like ::
|
||||
|
||||
import sys, operator
|
||||
nums = map(float, sys.argv[1:])
|
||||
print reduce(operator.add, nums)/len(nums)
|
||||
|
||||
This cute little script prints the average of all numbers given on the command
|
||||
line. The :func:`reduce` adds up all the numbers, and the rest is just some
|
||||
pre- and postprocessing.
|
||||
|
||||
On the same note, note that :func:`float`, :func:`int` and :func:`long` all
|
||||
accept arguments of type string, and so are suited to parsing --- assuming you
|
||||
are ready to deal with the :exc:`ValueError` they raise.
|
||||
|
||||
|
||||
Using Backslash to Continue Statements
|
||||
======================================
|
||||
|
||||
Since Python treats a newline as a statement terminator, and since statements
|
||||
are often more then is comfortable to put in one line, many people do::
|
||||
|
||||
if foo.bar()['first'][0] == baz.quux(1, 2)[5:9] and \
|
||||
calculate_number(10, 20) != forbulate(500, 360):
|
||||
pass
|
||||
|
||||
You should realize that this is dangerous: a stray space after the ``XXX`` would
|
||||
make this line wrong, and stray spaces are notoriously hard to see in editors.
|
||||
In this case, at least it would be a syntax error, but if the code was::
|
||||
|
||||
value = foo.bar()['first'][0]*baz.quux(1, 2)[5:9] \
|
||||
+ calculate_number(10, 20)*forbulate(500, 360)
|
||||
|
||||
then it would just be subtly wrong.
|
||||
|
||||
It is usually much better to use the implicit continuation inside parenthesis:
|
||||
|
||||
This version is bulletproof::
|
||||
|
||||
value = (foo.bar()['first'][0]*baz.quux(1, 2)[5:9]
|
||||
+ calculate_number(10, 20)*forbulate(500, 360))
|
||||
|
1400
Doc/howto/functional.rst
Normal file
1400
Doc/howto/functional.rst
Normal file
File diff suppressed because it is too large
Load diff
25
Doc/howto/index.rst
Normal file
25
Doc/howto/index.rst
Normal file
|
@ -0,0 +1,25 @@
|
|||
***************
|
||||
Python HOWTOs
|
||||
***************
|
||||
|
||||
Python HOWTOs are documents that cover a single, specific topic,
|
||||
and attempt to cover it fairly completely. Modelled on the Linux
|
||||
Documentation Project's HOWTO collection, this collection is an
|
||||
effort to foster documentation that's more detailed than the
|
||||
Python Library Reference.
|
||||
|
||||
Currently, the HOWTOs are:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
advocacy.rst
|
||||
pythonmac.rst
|
||||
curses.rst
|
||||
doanddont.rst
|
||||
functional.rst
|
||||
regex.rst
|
||||
sockets.rst
|
||||
unicode.rst
|
||||
urllib2.rst
|
||||
|
202
Doc/howto/pythonmac.rst
Normal file
202
Doc/howto/pythonmac.rst
Normal file
|
@ -0,0 +1,202 @@
|
|||
|
||||
.. _using-on-mac:
|
||||
|
||||
***************************
|
||||
Using Python on a Macintosh
|
||||
***************************
|
||||
|
||||
:Author: Bob Savage <bobsavage@mac.com>
|
||||
|
||||
|
||||
Python on a Macintosh running Mac OS X is in principle very similar to Python on
|
||||
any other Unix platform, but there are a number of additional features such as
|
||||
the IDE and the Package Manager that are worth pointing out.
|
||||
|
||||
The Mac-specific modules are documented in :ref:`mac-specific-services`.
|
||||
|
||||
Python on Mac OS 9 or earlier can be quite different from Python on Unix or
|
||||
Windows, but is beyond the scope of this manual, as that platform is no longer
|
||||
supported, starting with Python 2.4. See http://www.cwi.nl/~jack/macpython for
|
||||
installers for the latest 2.3 release for Mac OS 9 and related documentation.
|
||||
|
||||
|
||||
.. _getting-osx:
|
||||
|
||||
Getting and Installing MacPython
|
||||
================================
|
||||
|
||||
Mac OS X 10.4 comes with Python 2.3 pre-installed by Apple. However, you are
|
||||
encouraged to install the most recent version of Python from the Python website
|
||||
(http://www.python.org). A "universal binary" build of Python 2.5, which runs
|
||||
natively on the Mac's new Intel and legacy PPC CPU's, is available there.
|
||||
|
||||
What you get after installing is a number of things:
|
||||
|
||||
* A :file:`MacPython 2.5` folder in your :file:`Applications` folder. In here
|
||||
you find IDLE, the development environment that is a standard part of official
|
||||
Python distributions; PythonLauncher, which handles double-clicking Python
|
||||
scripts from the Finder; and the "Build Applet" tool, which allows you to
|
||||
package Python scripts as standalone applications on your system.
|
||||
|
||||
* A framework :file:`/Library/Frameworks/Python.framework`, which includes the
|
||||
Python executable and libraries. The installer adds this location to your shell
|
||||
path. To uninstall MacPython, you can simply remove these three things. A
|
||||
symlink to the Python executable is placed in /usr/local/bin/.
|
||||
|
||||
The Apple-provided build of Python is installed in
|
||||
:file:`/System/Library/Frameworks/Python.framework` and :file:`/usr/bin/python`,
|
||||
respectively. You should never modify or delete these, as they are
|
||||
Apple-controlled and are used by Apple- or third-party software.
|
||||
|
||||
IDLE includes a help menu that allows you to access Python documentation. If you
|
||||
are completely new to Python you should start reading the tutorial introduction
|
||||
in that document.
|
||||
|
||||
If you are familiar with Python on other Unix platforms you should read the
|
||||
section on running Python scripts from the Unix shell.
|
||||
|
||||
|
||||
How to run a Python script
|
||||
--------------------------
|
||||
|
||||
Your best way to get started with Python on Mac OS X is through the IDLE
|
||||
integrated development environment, see section :ref:`ide` and use the Help menu
|
||||
when the IDE is running.
|
||||
|
||||
If you want to run Python scripts from the Terminal window command line or from
|
||||
the Finder you first need an editor to create your script. Mac OS X comes with a
|
||||
number of standard Unix command line editors, :program:`vim` and
|
||||
:program:`emacs` among them. If you want a more Mac-like editor,
|
||||
:program:`BBEdit` or :program:`TextWrangler` from Bare Bones Software (see
|
||||
http://www.barebones.com/products/bbedit/index.shtml) are good choices, as is
|
||||
:program:`TextMate` (see http://macromates.com/). Other editors include
|
||||
:program:`Gvim` (http://macvim.org) and :program:`Aquamacs`
|
||||
(http://aquamacs.org).
|
||||
|
||||
To run your script from the Terminal window you must make sure that
|
||||
:file:`/usr/local/bin` is in your shell search path.
|
||||
|
||||
To run your script from the Finder you have two options:
|
||||
|
||||
* Drag it to :program:`PythonLauncher`
|
||||
|
||||
* Select :program:`PythonLauncher` as the default application to open your
|
||||
script (or any .py script) through the finder Info window and double-click it.
|
||||
:program:`PythonLauncher` has various preferences to control how your script is
|
||||
launched. Option-dragging allows you to change these for one invocation, or use
|
||||
its Preferences menu to change things globally.
|
||||
|
||||
|
||||
.. _osx-gui-scripts:
|
||||
|
||||
Running scripts with a GUI
|
||||
--------------------------
|
||||
|
||||
With older versions of Python, there is one Mac OS X quirk that you need to be
|
||||
aware of: programs that talk to the Aqua window manager (in other words,
|
||||
anything that has a GUI) need to be run in a special way. Use :program:`pythonw`
|
||||
instead of :program:`python` to start such scripts.
|
||||
|
||||
With Python 2.5, you can use either :program:`python` or :program:`pythonw`.
|
||||
|
||||
|
||||
Configuration
|
||||
-------------
|
||||
|
||||
Python on OS X honors all standard Unix environment variables such as
|
||||
:envvar:`PYTHONPATH`, but setting these variables for programs started from the
|
||||
Finder is non-standard as the Finder does not read your :file:`.profile` or
|
||||
:file:`.cshrc` at startup. You need to create a file :file:`~
|
||||
/.MacOSX/environment.plist`. See Apple's Technical Document QA1067 for details.
|
||||
|
||||
For more information on installation Python packages in MacPython, see section
|
||||
:ref:`mac-package-manager`.
|
||||
|
||||
|
||||
.. _ide:
|
||||
|
||||
The IDE
|
||||
=======
|
||||
|
||||
MacPython ships with the standard IDLE development environment. A good
|
||||
introduction to using IDLE can be found at http://hkn.eecs.berkeley.edu/
|
||||
dyoo/python/idle_intro/index.html.
|
||||
|
||||
|
||||
.. _mac-package-manager:
|
||||
|
||||
Installing Additional Python Packages
|
||||
=====================================
|
||||
|
||||
There are several methods to install additional Python packages:
|
||||
|
||||
* http://pythonmac.org/packages/ contains selected compiled packages for Python
|
||||
2.5, 2.4, and 2.3.
|
||||
|
||||
* Packages can be installed via the standard Python distutils mode (``python
|
||||
setup.py install``).
|
||||
|
||||
* Many packages can also be installed via the :program:`setuptools` extension.
|
||||
|
||||
|
||||
GUI Programming on the Mac
|
||||
==========================
|
||||
|
||||
There are several options for building GUI applications on the Mac with Python.
|
||||
|
||||
*PyObjC* is a Python binding to Apple's Objective-C/Cocoa framework, which is
|
||||
the foundation of most modern Mac development. Information on PyObjC is
|
||||
available from http://pyobjc.sourceforge.net.
|
||||
|
||||
The standard Python GUI toolkit is :mod:`Tkinter`, based on the cross-platform
|
||||
Tk toolkit (http://www.tcl.tk). An Aqua-native version of Tk is bundled with OS
|
||||
X by Apple, and the latest version can be downloaded and installed from
|
||||
http://www.activestate.com; it can also be built from source.
|
||||
|
||||
*wxPython* is another popular cross-platform GUI toolkit that runs natively on
|
||||
Mac OS X. Packages and documentation are available from http://www.wxpython.org.
|
||||
|
||||
*PyQt* is another popular cross-platform GUI toolkit that runs natively on Mac
|
||||
OS X. More information can be found at
|
||||
http://www.riverbankcomputing.co.uk/pyqt/.
|
||||
|
||||
|
||||
Distributing Python Applications on the Mac
|
||||
===========================================
|
||||
|
||||
The "Build Applet" tool that is placed in the MacPython 2.5 folder is fine for
|
||||
packaging small Python scripts on your own machine to run as a standard Mac
|
||||
application. This tool, however, is not robust enough to distribute Python
|
||||
applications to other users.
|
||||
|
||||
The standard tool for deploying standalone Python applications on the Mac is
|
||||
:program:`py2app`. More information on installing and using py2app can be found
|
||||
at http://undefined.org/python/#py2app.
|
||||
|
||||
|
||||
Application Scripting
|
||||
=====================
|
||||
|
||||
Python can also be used to script other Mac applications via Apple's Open
|
||||
Scripting Architecture (OSA); see http://appscript.sourceforge.net. Appscript is
|
||||
a high-level, user-friendly Apple event bridge that allows you to control
|
||||
scriptable Mac OS X applications using ordinary Python scripts. Appscript makes
|
||||
Python a serious alternative to Apple's own *AppleScript* language for
|
||||
automating your Mac. A related package, *PyOSA*, is an OSA language component
|
||||
for the Python scripting language, allowing Python code to be executed by any
|
||||
OSA-enabled application (Script Editor, Mail, iTunes, etc.). PyOSA makes Python
|
||||
a full peer to AppleScript.
|
||||
|
||||
|
||||
Other Resources
|
||||
===============
|
||||
|
||||
The MacPython mailing list is an excellent support resource for Python users and
|
||||
developers on the Mac:
|
||||
|
||||
http://www.python.org/community/sigs/current/pythonmac-sig/
|
||||
|
||||
Another useful resource is the MacPython wiki:
|
||||
|
||||
http://wiki.python.org/moin/MacPython
|
||||
|
1377
Doc/howto/regex.rst
Normal file
1377
Doc/howto/regex.rst
Normal file
File diff suppressed because it is too large
Load diff
421
Doc/howto/sockets.rst
Normal file
421
Doc/howto/sockets.rst
Normal file
|
@ -0,0 +1,421 @@
|
|||
****************************
|
||||
Socket Programming HOWTO
|
||||
****************************
|
||||
|
||||
:Author: Gordon McMillan
|
||||
|
||||
|
||||
.. topic:: Abstract
|
||||
|
||||
Sockets are used nearly everywhere, but are one of the most severely
|
||||
misunderstood technologies around. This is a 10,000 foot overview of sockets.
|
||||
It's not really a tutorial - you'll still have work to do in getting things
|
||||
operational. It doesn't cover the fine points (and there are a lot of them), but
|
||||
I hope it will give you enough background to begin using them decently.
|
||||
|
||||
|
||||
Sockets
|
||||
=======
|
||||
|
||||
Sockets are used nearly everywhere, but are one of the most severely
|
||||
misunderstood technologies around. This is a 10,000 foot overview of sockets.
|
||||
It's not really a tutorial - you'll still have work to do in getting things
|
||||
working. It doesn't cover the fine points (and there are a lot of them), but I
|
||||
hope it will give you enough background to begin using them decently.
|
||||
|
||||
I'm only going to talk about INET sockets, but they account for at least 99% of
|
||||
the sockets in use. And I'll only talk about STREAM sockets - unless you really
|
||||
know what you're doing (in which case this HOWTO isn't for you!), you'll get
|
||||
better behavior and performance from a STREAM socket than anything else. I will
|
||||
try to clear up the mystery of what a socket is, as well as some hints on how to
|
||||
work with blocking and non-blocking sockets. But I'll start by talking about
|
||||
blocking sockets. You'll need to know how they work before dealing with
|
||||
non-blocking sockets.
|
||||
|
||||
Part of the trouble with understanding these things is that "socket" can mean a
|
||||
number of subtly different things, depending on context. So first, let's make a
|
||||
distinction between a "client" socket - an endpoint of a conversation, and a
|
||||
"server" socket, which is more like a switchboard operator. The client
|
||||
application (your browser, for example) uses "client" sockets exclusively; the
|
||||
web server it's talking to uses both "server" sockets and "client" sockets.
|
||||
|
||||
|
||||
History
|
||||
-------
|
||||
|
||||
Of the various forms of IPC (*Inter Process Communication*), sockets are by far
|
||||
the most popular. On any given platform, there are likely to be other forms of
|
||||
IPC that are faster, but for cross-platform communication, sockets are about the
|
||||
only game in town.
|
||||
|
||||
They were invented in Berkeley as part of the BSD flavor of Unix. They spread
|
||||
like wildfire with the Internet. With good reason --- the combination of sockets
|
||||
with INET makes talking to arbitrary machines around the world unbelievably easy
|
||||
(at least compared to other schemes).
|
||||
|
||||
|
||||
Creating a Socket
|
||||
=================
|
||||
|
||||
Roughly speaking, when you clicked on the link that brought you to this page,
|
||||
your browser did something like the following::
|
||||
|
||||
#create an INET, STREAMing socket
|
||||
s = socket.socket(
|
||||
socket.AF_INET, socket.SOCK_STREAM)
|
||||
#now connect to the web server on port 80
|
||||
# - the normal http port
|
||||
s.connect(("www.mcmillan-inc.com", 80))
|
||||
|
||||
When the ``connect`` completes, the socket ``s`` can now be used to send in a
|
||||
request for the text of this page. The same socket will read the reply, and then
|
||||
be destroyed. That's right - destroyed. Client sockets are normally only used
|
||||
for one exchange (or a small set of sequential exchanges).
|
||||
|
||||
What happens in the web server is a bit more complex. First, the web server
|
||||
creates a "server socket". ::
|
||||
|
||||
#create an INET, STREAMing socket
|
||||
serversocket = socket.socket(
|
||||
socket.AF_INET, socket.SOCK_STREAM)
|
||||
#bind the socket to a public host,
|
||||
# and a well-known port
|
||||
serversocket.bind((socket.gethostname(), 80))
|
||||
#become a server socket
|
||||
serversocket.listen(5)
|
||||
|
||||
A couple things to notice: we used ``socket.gethostname()`` so that the socket
|
||||
would be visible to the outside world. If we had used ``s.bind(('', 80))`` or
|
||||
``s.bind(('localhost', 80))`` or ``s.bind(('127.0.0.1', 80))`` we would still
|
||||
have a "server" socket, but one that was only visible within the same machine.
|
||||
|
||||
A second thing to note: low number ports are usually reserved for "well known"
|
||||
services (HTTP, SNMP etc). If you're playing around, use a nice high number (4
|
||||
digits).
|
||||
|
||||
Finally, the argument to ``listen`` tells the socket library that we want it to
|
||||
queue up as many as 5 connect requests (the normal max) before refusing outside
|
||||
connections. If the rest of the code is written properly, that should be plenty.
|
||||
|
||||
OK, now we have a "server" socket, listening on port 80. Now we enter the
|
||||
mainloop of the web server::
|
||||
|
||||
while 1:
|
||||
#accept connections from outside
|
||||
(clientsocket, address) = serversocket.accept()
|
||||
#now do something with the clientsocket
|
||||
#in this case, we'll pretend this is a threaded server
|
||||
ct = client_thread(clientsocket)
|
||||
ct.run()
|
||||
|
||||
There's actually 3 general ways in which this loop could work - dispatching a
|
||||
thread to handle ``clientsocket``, create a new process to handle
|
||||
``clientsocket``, or restructure this app to use non-blocking sockets, and
|
||||
mulitplex between our "server" socket and any active ``clientsocket``\ s using
|
||||
``select``. More about that later. The important thing to understand now is
|
||||
this: this is *all* a "server" socket does. It doesn't send any data. It doesn't
|
||||
receive any data. It just produces "client" sockets. Each ``clientsocket`` is
|
||||
created in response to some *other* "client" socket doing a ``connect()`` to the
|
||||
host and port we're bound to. As soon as we've created that ``clientsocket``, we
|
||||
go back to listening for more connections. The two "clients" are free to chat it
|
||||
up - they are using some dynamically allocated port which will be recycled when
|
||||
the conversation ends.
|
||||
|
||||
|
||||
IPC
|
||||
---
|
||||
|
||||
If you need fast IPC between two processes on one machine, you should look into
|
||||
whatever form of shared memory the platform offers. A simple protocol based
|
||||
around shared memory and locks or semaphores is by far the fastest technique.
|
||||
|
||||
If you do decide to use sockets, bind the "server" socket to ``'localhost'``. On
|
||||
most platforms, this will take a shortcut around a couple of layers of network
|
||||
code and be quite a bit faster.
|
||||
|
||||
|
||||
Using a Socket
|
||||
==============
|
||||
|
||||
The first thing to note, is that the web browser's "client" socket and the web
|
||||
server's "client" socket are identical beasts. That is, this is a "peer to peer"
|
||||
conversation. Or to put it another way, *as the designer, you will have to
|
||||
decide what the rules of etiquette are for a conversation*. Normally, the
|
||||
``connect``\ ing socket starts the conversation, by sending in a request, or
|
||||
perhaps a signon. But that's a design decision - it's not a rule of sockets.
|
||||
|
||||
Now there are two sets of verbs to use for communication. You can use ``send``
|
||||
and ``recv``, or you can transform your client socket into a file-like beast and
|
||||
use ``read`` and ``write``. The latter is the way Java presents their sockets.
|
||||
I'm not going to talk about it here, except to warn you that you need to use
|
||||
``flush`` on sockets. These are buffered "files", and a common mistake is to
|
||||
``write`` something, and then ``read`` for a reply. Without a ``flush`` in
|
||||
there, you may wait forever for the reply, because the request may still be in
|
||||
your output buffer.
|
||||
|
||||
Now we come the major stumbling block of sockets - ``send`` and ``recv`` operate
|
||||
on the network buffers. They do not necessarily handle all the bytes you hand
|
||||
them (or expect from them), because their major focus is handling the network
|
||||
buffers. In general, they return when the associated network buffers have been
|
||||
filled (``send``) or emptied (``recv``). They then tell you how many bytes they
|
||||
handled. It is *your* responsibility to call them again until your message has
|
||||
been completely dealt with.
|
||||
|
||||
When a ``recv`` returns 0 bytes, it means the other side has closed (or is in
|
||||
the process of closing) the connection. You will not receive any more data on
|
||||
this connection. Ever. You may be able to send data successfully; I'll talk
|
||||
about that some on the next page.
|
||||
|
||||
A protocol like HTTP uses a socket for only one transfer. The client sends a
|
||||
request, the reads a reply. That's it. The socket is discarded. This means that
|
||||
a client can detect the end of the reply by receiving 0 bytes.
|
||||
|
||||
But if you plan to reuse your socket for further transfers, you need to realize
|
||||
that *there is no "EOT" (End of Transfer) on a socket.* I repeat: if a socket
|
||||
``send`` or ``recv`` returns after handling 0 bytes, the connection has been
|
||||
broken. If the connection has *not* been broken, you may wait on a ``recv``
|
||||
forever, because the socket will *not* tell you that there's nothing more to
|
||||
read (for now). Now if you think about that a bit, you'll come to realize a
|
||||
fundamental truth of sockets: *messages must either be fixed length* (yuck), *or
|
||||
be delimited* (shrug), *or indicate how long they are* (much better), *or end by
|
||||
shutting down the connection*. The choice is entirely yours, (but some ways are
|
||||
righter than others).
|
||||
|
||||
Assuming you don't want to end the connection, the simplest solution is a fixed
|
||||
length message::
|
||||
|
||||
class mysocket:
|
||||
'''demonstration class only
|
||||
- coded for clarity, not efficiency
|
||||
'''
|
||||
|
||||
def __init__(self, sock=None):
|
||||
if sock is None:
|
||||
self.sock = socket.socket(
|
||||
socket.AF_INET, socket.SOCK_STREAM)
|
||||
else:
|
||||
self.sock = sock
|
||||
|
||||
def connect(self, host, port):
|
||||
self.sock.connect((host, port))
|
||||
|
||||
def mysend(self, msg):
|
||||
totalsent = 0
|
||||
while totalsent < MSGLEN:
|
||||
sent = self.sock.send(msg[totalsent:])
|
||||
if sent == 0:
|
||||
raise RuntimeError, \
|
||||
"socket connection broken"
|
||||
totalsent = totalsent + sent
|
||||
|
||||
def myreceive(self):
|
||||
msg = ''
|
||||
while len(msg) < MSGLEN:
|
||||
chunk = self.sock.recv(MSGLEN-len(msg))
|
||||
if chunk == '':
|
||||
raise RuntimeError, \
|
||||
"socket connection broken"
|
||||
msg = msg + chunk
|
||||
return msg
|
||||
|
||||
The sending code here is usable for almost any messaging scheme - in Python you
|
||||
send strings, and you can use ``len()`` to determine its length (even if it has
|
||||
embedded ``\0`` characters). It's mostly the receiving code that gets more
|
||||
complex. (And in C, it's not much worse, except you can't use ``strlen`` if the
|
||||
message has embedded ``\0``\ s.)
|
||||
|
||||
The easiest enhancement is to make the first character of the message an
|
||||
indicator of message type, and have the type determine the length. Now you have
|
||||
two ``recv``\ s - the first to get (at least) that first character so you can
|
||||
look up the length, and the second in a loop to get the rest. If you decide to
|
||||
go the delimited route, you'll be receiving in some arbitrary chunk size, (4096
|
||||
or 8192 is frequently a good match for network buffer sizes), and scanning what
|
||||
you've received for a delimiter.
|
||||
|
||||
One complication to be aware of: if your conversational protocol allows multiple
|
||||
messages to be sent back to back (without some kind of reply), and you pass
|
||||
``recv`` an arbitrary chunk size, you may end up reading the start of a
|
||||
following message. You'll need to put that aside and hold onto it, until it's
|
||||
needed.
|
||||
|
||||
Prefixing the message with it's length (say, as 5 numeric characters) gets more
|
||||
complex, because (believe it or not), you may not get all 5 characters in one
|
||||
``recv``. In playing around, you'll get away with it; but in high network loads,
|
||||
your code will very quickly break unless you use two ``recv`` loops - the first
|
||||
to determine the length, the second to get the data part of the message. Nasty.
|
||||
This is also when you'll discover that ``send`` does not always manage to get
|
||||
rid of everything in one pass. And despite having read this, you will eventually
|
||||
get bit by it!
|
||||
|
||||
In the interests of space, building your character, (and preserving my
|
||||
competitive position), these enhancements are left as an exercise for the
|
||||
reader. Lets move on to cleaning up.
|
||||
|
||||
|
||||
Binary Data
|
||||
-----------
|
||||
|
||||
It is perfectly possible to send binary data over a socket. The major problem is
|
||||
that not all machines use the same formats for binary data. For example, a
|
||||
Motorola chip will represent a 16 bit integer with the value 1 as the two hex
|
||||
bytes 00 01. Intel and DEC, however, are byte-reversed - that same 1 is 01 00.
|
||||
Socket libraries have calls for converting 16 and 32 bit integers - ``ntohl,
|
||||
htonl, ntohs, htons`` where "n" means *network* and "h" means *host*, "s" means
|
||||
*short* and "l" means *long*. Where network order is host order, these do
|
||||
nothing, but where the machine is byte-reversed, these swap the bytes around
|
||||
appropriately.
|
||||
|
||||
In these days of 32 bit machines, the ascii representation of binary data is
|
||||
frequently smaller than the binary representation. That's because a surprising
|
||||
amount of the time, all those longs have the value 0, or maybe 1. The string "0"
|
||||
would be two bytes, while binary is four. Of course, this doesn't fit well with
|
||||
fixed-length messages. Decisions, decisions.
|
||||
|
||||
|
||||
Disconnecting
|
||||
=============
|
||||
|
||||
Strictly speaking, you're supposed to use ``shutdown`` on a socket before you
|
||||
``close`` it. The ``shutdown`` is an advisory to the socket at the other end.
|
||||
Depending on the argument you pass it, it can mean "I'm not going to send
|
||||
anymore, but I'll still listen", or "I'm not listening, good riddance!". Most
|
||||
socket libraries, however, are so used to programmers neglecting to use this
|
||||
piece of etiquette that normally a ``close`` is the same as ``shutdown();
|
||||
close()``. So in most situations, an explicit ``shutdown`` is not needed.
|
||||
|
||||
One way to use ``shutdown`` effectively is in an HTTP-like exchange. The client
|
||||
sends a request and then does a ``shutdown(1)``. This tells the server "This
|
||||
client is done sending, but can still receive." The server can detect "EOF" by
|
||||
a receive of 0 bytes. It can assume it has the complete request. The server
|
||||
sends a reply. If the ``send`` completes successfully then, indeed, the client
|
||||
was still receiving.
|
||||
|
||||
Python takes the automatic shutdown a step further, and says that when a socket
|
||||
is garbage collected, it will automatically do a ``close`` if it's needed. But
|
||||
relying on this is a very bad habit. If your socket just disappears without
|
||||
doing a ``close``, the socket at the other end may hang indefinitely, thinking
|
||||
you're just being slow. *Please* ``close`` your sockets when you're done.
|
||||
|
||||
|
||||
When Sockets Die
|
||||
----------------
|
||||
|
||||
Probably the worst thing about using blocking sockets is what happens when the
|
||||
other side comes down hard (without doing a ``close``). Your socket is likely to
|
||||
hang. SOCKSTREAM is a reliable protocol, and it will wait a long, long time
|
||||
before giving up on a connection. If you're using threads, the entire thread is
|
||||
essentially dead. There's not much you can do about it. As long as you aren't
|
||||
doing something dumb, like holding a lock while doing a blocking read, the
|
||||
thread isn't really consuming much in the way of resources. Do *not* try to kill
|
||||
the thread - part of the reason that threads are more efficient than processes
|
||||
is that they avoid the overhead associated with the automatic recycling of
|
||||
resources. In other words, if you do manage to kill the thread, your whole
|
||||
process is likely to be screwed up.
|
||||
|
||||
|
||||
Non-blocking Sockets
|
||||
====================
|
||||
|
||||
If you've understood the preceeding, you already know most of what you need to
|
||||
know about the mechanics of using sockets. You'll still use the same calls, in
|
||||
much the same ways. It's just that, if you do it right, your app will be almost
|
||||
inside-out.
|
||||
|
||||
In Python, you use ``socket.setblocking(0)`` to make it non-blocking. In C, it's
|
||||
more complex, (for one thing, you'll need to choose between the BSD flavor
|
||||
``O_NONBLOCK`` and the almost indistinguishable Posix flavor ``O_NDELAY``, which
|
||||
is completely different from ``TCP_NODELAY``), but it's the exact same idea. You
|
||||
do this after creating the socket, but before using it. (Actually, if you're
|
||||
nuts, you can switch back and forth.)
|
||||
|
||||
The major mechanical difference is that ``send``, ``recv``, ``connect`` and
|
||||
``accept`` can return without having done anything. You have (of course) a
|
||||
number of choices. You can check return code and error codes and generally drive
|
||||
yourself crazy. If you don't believe me, try it sometime. Your app will grow
|
||||
large, buggy and suck CPU. So let's skip the brain-dead solutions and do it
|
||||
right.
|
||||
|
||||
Use ``select``.
|
||||
|
||||
In C, coding ``select`` is fairly complex. In Python, it's a piece of cake, but
|
||||
it's close enough to the C version that if you understand ``select`` in Python,
|
||||
you'll have little trouble with it in C. ::
|
||||
|
||||
ready_to_read, ready_to_write, in_error = \
|
||||
select.select(
|
||||
potential_readers,
|
||||
potential_writers,
|
||||
potential_errs,
|
||||
timeout)
|
||||
|
||||
You pass ``select`` three lists: the first contains all sockets that you might
|
||||
want to try reading; the second all the sockets you might want to try writing
|
||||
to, and the last (normally left empty) those that you want to check for errors.
|
||||
You should note that a socket can go into more than one list. The ``select``
|
||||
call is blocking, but you can give it a timeout. This is generally a sensible
|
||||
thing to do - give it a nice long timeout (say a minute) unless you have good
|
||||
reason to do otherwise.
|
||||
|
||||
In return, you will get three lists. They have the sockets that are actually
|
||||
readable, writable and in error. Each of these lists is a subset (possbily
|
||||
empty) of the corresponding list you passed in. And if you put a socket in more
|
||||
than one input list, it will only be (at most) in one output list.
|
||||
|
||||
If a socket is in the output readable list, you can be
|
||||
as-close-to-certain-as-we-ever-get-in-this-business that a ``recv`` on that
|
||||
socket will return *something*. Same idea for the writable list. You'll be able
|
||||
to send *something*. Maybe not all you want to, but *something* is better than
|
||||
nothing. (Actually, any reasonably healthy socket will return as writable - it
|
||||
just means outbound network buffer space is available.)
|
||||
|
||||
If you have a "server" socket, put it in the potential_readers list. If it comes
|
||||
out in the readable list, your ``accept`` will (almost certainly) work. If you
|
||||
have created a new socket to ``connect`` to someone else, put it in the
|
||||
ptoential_writers list. If it shows up in the writable list, you have a decent
|
||||
chance that it has connected.
|
||||
|
||||
One very nasty problem with ``select``: if somewhere in those input lists of
|
||||
sockets is one which has died a nasty death, the ``select`` will fail. You then
|
||||
need to loop through every single damn socket in all those lists and do a
|
||||
``select([sock],[],[],0)`` until you find the bad one. That timeout of 0 means
|
||||
it won't take long, but it's ugly.
|
||||
|
||||
Actually, ``select`` can be handy even with blocking sockets. It's one way of
|
||||
determining whether you will block - the socket returns as readable when there's
|
||||
something in the buffers. However, this still doesn't help with the problem of
|
||||
determining whether the other end is done, or just busy with something else.
|
||||
|
||||
**Portability alert**: On Unix, ``select`` works both with the sockets and
|
||||
files. Don't try this on Windows. On Windows, ``select`` works with sockets
|
||||
only. Also note that in C, many of the more advanced socket options are done
|
||||
differently on Windows. In fact, on Windows I usually use threads (which work
|
||||
very, very well) with my sockets. Face it, if you want any kind of performance,
|
||||
your code will look very different on Windows than on Unix. (I haven't the
|
||||
foggiest how you do this stuff on a Mac.)
|
||||
|
||||
|
||||
Performance
|
||||
-----------
|
||||
|
||||
There's no question that the fastest sockets code uses non-blocking sockets and
|
||||
select to multiplex them. You can put together something that will saturate a
|
||||
LAN connection without putting any strain on the CPU. The trouble is that an app
|
||||
written this way can't do much of anything else - it needs to be ready to
|
||||
shuffle bytes around at all times.
|
||||
|
||||
Assuming that your app is actually supposed to do something more than that,
|
||||
threading is the optimal solution, (and using non-blocking sockets will be
|
||||
faster than using blocking sockets). Unfortunately, threading support in Unixes
|
||||
varies both in API and quality. So the normal Unix solution is to fork a
|
||||
subprocess to deal with each connection. The overhead for this is significant
|
||||
(and don't do this on Windows - the overhead of process creation is enormous
|
||||
there). It also means that unless each subprocess is completely independent,
|
||||
you'll need to use another form of IPC, say a pipe, or shared memory and
|
||||
semaphores, to communicate between the parent and child processes.
|
||||
|
||||
Finally, remember that even though blocking sockets are somewhat slower than
|
||||
non-blocking, in many cases they are the "right" solution. After all, if your
|
||||
app is driven by the data it receives over a socket, there's not much sense in
|
||||
complicating the logic just so your app can wait on ``select`` instead of
|
||||
``recv``.
|
||||
|
732
Doc/howto/unicode.rst
Normal file
732
Doc/howto/unicode.rst
Normal file
|
@ -0,0 +1,732 @@
|
|||
*****************
|
||||
Unicode HOWTO
|
||||
*****************
|
||||
|
||||
:Release: 1.02
|
||||
|
||||
This HOWTO discusses Python's support for Unicode, and explains various problems
|
||||
that people commonly encounter when trying to work with Unicode.
|
||||
|
||||
Introduction to Unicode
|
||||
=======================
|
||||
|
||||
History of Character Codes
|
||||
--------------------------
|
||||
|
||||
In 1968, the American Standard Code for Information Interchange, better known by
|
||||
its acronym ASCII, was standardized. ASCII defined numeric codes for various
|
||||
characters, with the numeric values running from 0 to
|
||||
127. For example, the lowercase letter 'a' is assigned 97 as its code
|
||||
value.
|
||||
|
||||
ASCII was an American-developed standard, so it only defined unaccented
|
||||
characters. There was an 'e', but no 'é' or 'Í'. This meant that languages
|
||||
which required accented characters couldn't be faithfully represented in ASCII.
|
||||
(Actually the missing accents matter for English, too, which contains words such
|
||||
as 'naïve' and 'café', and some publications have house styles which require
|
||||
spellings such as 'coöperate'.)
|
||||
|
||||
For a while people just wrote programs that didn't display accents. I remember
|
||||
looking at Apple ][ BASIC programs, published in French-language publications in
|
||||
the mid-1980s, that had lines like these::
|
||||
|
||||
PRINT "FICHER EST COMPLETE."
|
||||
PRINT "CARACTERE NON ACCEPTE."
|
||||
|
||||
Those messages should contain accents, and they just look wrong to someone who
|
||||
can read French.
|
||||
|
||||
In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
|
||||
hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
|
||||
machines assigned values between 128 and 255 to accented characters. Different
|
||||
machines had different codes, however, which led to problems exchanging files.
|
||||
Eventually various commonly used sets of values for the 128-255 range emerged.
|
||||
Some were true standards, defined by the International Standards Organization,
|
||||
and some were **de facto** conventions that were invented by one company or
|
||||
another and managed to catch on.
|
||||
|
||||
255 characters aren't very many. For example, you can't fit both the accented
|
||||
characters used in Western Europe and the Cyrillic alphabet used for Russian
|
||||
into the 128-255 range because there are more than 127 such characters.
|
||||
|
||||
You could write files using different codes (all your Russian files in a coding
|
||||
system called KOI8, all your French files in a different coding system called
|
||||
Latin1), but what if you wanted to write a French document that quotes some
|
||||
Russian text? In the 1980s people began to want to solve this problem, and the
|
||||
Unicode standardization effort began.
|
||||
|
||||
Unicode started out using 16-bit characters instead of 8-bit characters. 16
|
||||
bits means you have 2^16 = 65,536 distinct values available, making it possible
|
||||
to represent many different characters from many different alphabets; an initial
|
||||
goal was to have Unicode contain the alphabets for every single human language.
|
||||
It turns out that even 16 bits isn't enough to meet that goal, and the modern
|
||||
Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in
|
||||
base-16).
|
||||
|
||||
There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
|
||||
originally separate efforts, but the specifications were merged with the 1.1
|
||||
revision of Unicode.
|
||||
|
||||
(This discussion of Unicode's history is highly simplified. I don't think the
|
||||
average Python programmer needs to worry about the historical details; consult
|
||||
the Unicode consortium site listed in the References for more information.)
|
||||
|
||||
|
||||
Definitions
|
||||
-----------
|
||||
|
||||
A **character** is the smallest possible component of a text. 'A', 'B', 'C',
|
||||
etc., are all different characters. So are 'È' and 'Í'. Characters are
|
||||
abstractions, and vary depending on the language or context you're talking
|
||||
about. For example, the symbol for ohms (Ω) is usually drawn much like the
|
||||
capital letter omega (Ω) in the Greek alphabet (they may even be the same in
|
||||
some fonts), but these are two different characters that have different
|
||||
meanings.
|
||||
|
||||
The Unicode standard describes how characters are represented by **code
|
||||
points**. A code point is an integer value, usually denoted in base 16. In the
|
||||
standard, a code point is written using the notation U+12ca to mean the
|
||||
character with value 0x12ca (4810 decimal). The Unicode standard contains a lot
|
||||
of tables listing characters and their corresponding code points::
|
||||
|
||||
0061 'a'; LATIN SMALL LETTER A
|
||||
0062 'b'; LATIN SMALL LETTER B
|
||||
0063 'c'; LATIN SMALL LETTER C
|
||||
...
|
||||
007B '{'; LEFT CURLY BRACKET
|
||||
|
||||
Strictly, these definitions imply that it's meaningless to say 'this is
|
||||
character U+12ca'. U+12ca is a code point, which represents some particular
|
||||
character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In
|
||||
informal contexts, this distinction between code points and characters will
|
||||
sometimes be forgotten.
|
||||
|
||||
A character is represented on a screen or on paper by a set of graphical
|
||||
elements that's called a **glyph**. The glyph for an uppercase A, for example,
|
||||
is two diagonal strokes and a horizontal stroke, though the exact details will
|
||||
depend on the font being used. Most Python code doesn't need to worry about
|
||||
glyphs; figuring out the correct glyph to display is generally the job of a GUI
|
||||
toolkit or a terminal's font renderer.
|
||||
|
||||
|
||||
Encodings
|
||||
---------
|
||||
|
||||
To summarize the previous section: a Unicode string is a sequence of code
|
||||
points, which are numbers from 0 to 0x10ffff. This sequence needs to be
|
||||
represented as a set of bytes (meaning, values from 0-255) in memory. The rules
|
||||
for translating a Unicode string into a sequence of bytes are called an
|
||||
**encoding**.
|
||||
|
||||
The first encoding you might think of is an array of 32-bit integers. In this
|
||||
representation, the string "Python" would look like this::
|
||||
|
||||
P y t h o n
|
||||
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
|
||||
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
|
||||
|
||||
This representation is straightforward but using it presents a number of
|
||||
problems.
|
||||
|
||||
1. It's not portable; different processors order the bytes differently.
|
||||
|
||||
2. It's very wasteful of space. In most texts, the majority of the code points
|
||||
are less than 127, or less than 255, so a lot of space is occupied by zero
|
||||
bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
|
||||
ASCII representation. Increased RAM usage doesn't matter too much (desktop
|
||||
computers have megabytes of RAM, and strings aren't usually that large), but
|
||||
expanding our usage of disk and network bandwidth by a factor of 4 is
|
||||
intolerable.
|
||||
|
||||
3. It's not compatible with existing C functions such as ``strlen()``, so a new
|
||||
family of wide string functions would need to be used.
|
||||
|
||||
4. Many Internet standards are defined in terms of textual data, and can't
|
||||
handle content with embedded zero bytes.
|
||||
|
||||
Generally people don't use this encoding, instead choosing other encodings that
|
||||
are more efficient and convenient.
|
||||
|
||||
Encodings don't have to handle every possible Unicode character, and most
|
||||
encodings don't. For example, Python's default encoding is the 'ascii'
|
||||
encoding. The rules for converting a Unicode string into the ASCII encoding are
|
||||
simple; for each code point:
|
||||
|
||||
1. If the code point is < 128, each byte is the same as the value of the code
|
||||
point.
|
||||
|
||||
2. If the code point is 128 or greater, the Unicode string can't be represented
|
||||
in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this
|
||||
case.)
|
||||
|
||||
Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
|
||||
0-255 are identical to the Latin-1 values, so converting to this encoding simply
|
||||
requires converting code points to byte values; if a code point larger than 255
|
||||
is encountered, the string can't be encoded into Latin-1.
|
||||
|
||||
Encodings don't have to be simple one-to-one mappings like Latin-1. Consider
|
||||
IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one
|
||||
block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145
|
||||
through 153. If you wanted to use EBCDIC as an encoding, you'd probably use
|
||||
some sort of lookup table to perform the conversion, but this is largely an
|
||||
internal detail.
|
||||
|
||||
UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode
|
||||
Transformation Format", and the '8' means that 8-bit numbers are used in the
|
||||
encoding. (There's also a UTF-16 encoding, but it's less frequently used than
|
||||
UTF-8.) UTF-8 uses the following rules:
|
||||
|
||||
1. If the code point is <128, it's represented by the corresponding byte value.
|
||||
2. If the code point is between 128 and 0x7ff, it's turned into two byte values
|
||||
between 128 and 255.
|
||||
3. Code points >0x7ff are turned into three- or four-byte sequences, where each
|
||||
byte of the sequence is between 128 and 255.
|
||||
|
||||
UTF-8 has several convenient properties:
|
||||
|
||||
1. It can handle any Unicode code point.
|
||||
2. A Unicode string is turned into a string of bytes containing no embedded zero
|
||||
bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
|
||||
processed by C functions such as ``strcpy()`` and sent through protocols that
|
||||
can't handle zero bytes.
|
||||
3. A string of ASCII text is also valid UTF-8 text.
|
||||
4. UTF-8 is fairly compact; the majority of code points are turned into two
|
||||
bytes, and values less than 128 occupy only a single byte.
|
||||
5. If bytes are corrupted or lost, it's possible to determine the start of the
|
||||
next UTF-8-encoded code point and resynchronize. It's also unlikely that
|
||||
random 8-bit data will look like valid UTF-8.
|
||||
|
||||
|
||||
|
||||
References
|
||||
----------
|
||||
|
||||
The Unicode Consortium site at <http://www.unicode.org> has character charts, a
|
||||
glossary, and PDF versions of the Unicode specification. Be prepared for some
|
||||
difficult reading. <http://www.unicode.org/history/> is a chronology of the
|
||||
origin and development of Unicode.
|
||||
|
||||
To help understand the standard, Jukka Korpela has written an introductory guide
|
||||
to reading the Unicode character tables, available at
|
||||
<http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
|
||||
|
||||
Roman Czyborra wrote another explanation of Unicode's basic principles; it's at
|
||||
<http://czyborra.com/unicode/characters.html>. Czyborra has written a number of
|
||||
other Unicode-related documentation, available from <http://www.cyzborra.com>.
|
||||
|
||||
Two other good introductory articles were written by Joel Spolsky
|
||||
<http://www.joelonsoftware.com/articles/Unicode.html> and Jason Orendorff
|
||||
<http://www.jorendorff.com/articles/unicode/>. If this introduction didn't make
|
||||
things clear to you, you should try reading one of these alternate articles
|
||||
before continuing.
|
||||
|
||||
Wikipedia entries are often helpful; see the entries for "character encoding"
|
||||
<http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
|
||||
<http://en.wikipedia.org/wiki/UTF-8>, for example.
|
||||
|
||||
|
||||
Python's Unicode Support
|
||||
========================
|
||||
|
||||
Now that you've learned the rudiments of Unicode, we can look at Python's
|
||||
Unicode features.
|
||||
|
||||
|
||||
The Unicode Type
|
||||
----------------
|
||||
|
||||
Unicode strings are expressed as instances of the :class:`unicode` type, one of
|
||||
Python's repertoire of built-in types. It derives from an abstract type called
|
||||
:class:`basestring`, which is also an ancestor of the :class:`str` type; you can
|
||||
therefore check if a value is a string type with ``isinstance(value,
|
||||
basestring)``. Under the hood, Python represents Unicode strings as either 16-
|
||||
or 32-bit integers, depending on how the Python interpreter was compiled.
|
||||
|
||||
The :func:`unicode` constructor has the signature ``unicode(string[, encoding,
|
||||
errors])``. All of its arguments should be 8-bit strings. The first argument
|
||||
is converted to Unicode using the specified encoding; if you leave off the
|
||||
``encoding`` argument, the ASCII encoding is used for the conversion, so
|
||||
characters greater than 127 will be treated as errors::
|
||||
|
||||
>>> unicode('abcdef')
|
||||
u'abcdef'
|
||||
>>> s = unicode('abcdef')
|
||||
>>> type(s)
|
||||
<type 'unicode'>
|
||||
>>> unicode('abcdef' + chr(255))
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in ?
|
||||
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
|
||||
ordinal not in range(128)
|
||||
|
||||
The ``errors`` argument specifies the response when the input string can't be
|
||||
converted according to the encoding's rules. Legal values for this argument are
|
||||
'strict' (raise a ``UnicodeDecodeError`` exception), 'replace' (add U+FFFD,
|
||||
'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the
|
||||
Unicode result). The following examples show the differences::
|
||||
|
||||
>>> unicode('\x80abc', errors='strict')
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in ?
|
||||
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
|
||||
ordinal not in range(128)
|
||||
>>> unicode('\x80abc', errors='replace')
|
||||
u'\ufffdabc'
|
||||
>>> unicode('\x80abc', errors='ignore')
|
||||
u'abc'
|
||||
|
||||
Encodings are specified as strings containing the encoding's name. Python 2.4
|
||||
comes with roughly 100 different encodings; see the Python Library Reference at
|
||||
<http://docs.python.org/lib/standard-encodings.html> for a list. Some encodings
|
||||
have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859' are all
|
||||
synonyms for the same encoding.
|
||||
|
||||
One-character Unicode strings can also be created with the :func:`unichr`
|
||||
built-in function, which takes integers and returns a Unicode string of length 1
|
||||
that contains the corresponding code point. The reverse operation is the
|
||||
built-in :func:`ord` function that takes a one-character Unicode string and
|
||||
returns the code point value::
|
||||
|
||||
>>> unichr(40960)
|
||||
u'\ua000'
|
||||
>>> ord(u'\ua000')
|
||||
40960
|
||||
|
||||
Instances of the :class:`unicode` type have many of the same methods as the
|
||||
8-bit string type for operations such as searching and formatting::
|
||||
|
||||
>>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
|
||||
>>> s.count('e')
|
||||
5
|
||||
>>> s.find('feather')
|
||||
9
|
||||
>>> s.find('bird')
|
||||
-1
|
||||
>>> s.replace('feather', 'sand')
|
||||
u'Was ever sand so lightly blown to and fro as this multitude?'
|
||||
>>> s.upper()
|
||||
u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
|
||||
|
||||
Note that the arguments to these methods can be Unicode strings or 8-bit
|
||||
strings. 8-bit strings will be converted to Unicode before carrying out the
|
||||
operation; Python's default ASCII encoding will be used, so characters greater
|
||||
than 127 will cause an exception::
|
||||
|
||||
>>> s.find('Was\x9f')
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in ?
|
||||
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
|
||||
>>> s.find(u'Was\x9f')
|
||||
-1
|
||||
|
||||
Much Python code that operates on strings will therefore work with Unicode
|
||||
strings without requiring any changes to the code. (Input and output code needs
|
||||
more updating for Unicode; more on this later.)
|
||||
|
||||
Another important method is ``.encode([encoding], [errors='strict'])``, which
|
||||
returns an 8-bit string version of the Unicode string, encoded in the requested
|
||||
encoding. The ``errors`` parameter is the same as the parameter of the
|
||||
``unicode()`` constructor, with one additional possibility; as well as 'strict',
|
||||
'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML's
|
||||
character references. The following example shows the different results::
|
||||
|
||||
>>> u = unichr(40960) + u'abcd' + unichr(1972)
|
||||
>>> u.encode('utf-8')
|
||||
'\xea\x80\x80abcd\xde\xb4'
|
||||
>>> u.encode('ascii')
|
||||
Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in ?
|
||||
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
|
||||
>>> u.encode('ascii', 'ignore')
|
||||
'abcd'
|
||||
>>> u.encode('ascii', 'replace')
|
||||
'?abcd?'
|
||||
>>> u.encode('ascii', 'xmlcharrefreplace')
|
||||
'ꀀabcd޴'
|
||||
|
||||
Python's 8-bit strings have a ``.decode([encoding], [errors])`` method that
|
||||
interprets the string using the given encoding::
|
||||
|
||||
>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
|
||||
>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
|
||||
>>> type(utf8_version), utf8_version
|
||||
(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
|
||||
>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
|
||||
>>> u == u2 # The two strings match
|
||||
True
|
||||
|
||||
The low-level routines for registering and accessing the available encodings are
|
||||
found in the :mod:`codecs` module. However, the encoding and decoding functions
|
||||
returned by this module are usually more low-level than is comfortable, so I'm
|
||||
not going to describe the :mod:`codecs` module here. If you need to implement a
|
||||
completely new encoding, you'll need to learn about the :mod:`codecs` module
|
||||
interfaces, but implementing encodings is a specialized task that also won't be
|
||||
covered here. Consult the Python documentation to learn more about this module.
|
||||
|
||||
The most commonly used part of the :mod:`codecs` module is the
|
||||
:func:`codecs.open` function which will be discussed in the section on input and
|
||||
output.
|
||||
|
||||
|
||||
Unicode Literals in Python Source Code
|
||||
--------------------------------------
|
||||
|
||||
In Python source code, Unicode literals are written as strings prefixed with the
|
||||
'u' or 'U' character: ``u'abcdefghijk'``. Specific code points can be written
|
||||
using the ``\u`` escape sequence, which is followed by four hex digits giving
|
||||
the code point. The ``\U`` escape sequence is similar, but expects 8 hex
|
||||
digits, not 4.
|
||||
|
||||
Unicode literals can also use the same escape sequences as 8-bit strings,
|
||||
including ``\x``, but ``\x`` only takes two hex digits so it can't express an
|
||||
arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777.
|
||||
|
||||
::
|
||||
|
||||
>>> s = u"a\xac\u1234\u20ac\U00008000"
|
||||
^^^^ two-digit hex escape
|
||||
^^^^^^ four-digit Unicode escape
|
||||
^^^^^^^^^^ eight-digit Unicode escape
|
||||
>>> for c in s: print ord(c),
|
||||
...
|
||||
97 172 4660 8364 32768
|
||||
|
||||
Using escape sequences for code points greater than 127 is fine in small doses,
|
||||
but becomes an annoyance if you're using many accented characters, as you would
|
||||
in a program with messages in French or some other accent-using language. You
|
||||
can also assemble strings using the :func:`unichr` built-in function, but this is
|
||||
even more tedious.
|
||||
|
||||
Ideally, you'd want to be able to write literals in your language's natural
|
||||
encoding. You could then edit Python source code with your favorite editor
|
||||
which would display the accented characters naturally, and have the right
|
||||
characters used at runtime.
|
||||
|
||||
Python supports writing Unicode literals in any encoding, but you have to
|
||||
declare the encoding being used. This is done by including a special comment as
|
||||
either the first or second line of the source file::
|
||||
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: latin-1 -*-
|
||||
|
||||
u = u'abcdé'
|
||||
print ord(u[-1])
|
||||
|
||||
The syntax is inspired by Emacs's notation for specifying variables local to a
|
||||
file. Emacs supports many different variables, but Python only supports
|
||||
'coding'. The ``-*-`` symbols indicate that the comment is special; within
|
||||
them, you must supply the name ``coding`` and the name of your chosen encoding,
|
||||
separated by ``':'``.
|
||||
|
||||
If you don't include such a comment, the default encoding used will be ASCII.
|
||||
Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default
|
||||
encoding for string literals; in Python 2.4, characters greater than 127 still
|
||||
work but result in a warning. For example, the following program has no
|
||||
encoding declaration::
|
||||
|
||||
#!/usr/bin/env python
|
||||
u = u'abcdé'
|
||||
print ord(u[-1])
|
||||
|
||||
When you run it with Python 2.4, it will output the following warning::
|
||||
|
||||
amk:~$ python p263.py
|
||||
sys:1: DeprecationWarning: Non-ASCII character '\xe9'
|
||||
in file p263.py on line 2, but no encoding declared;
|
||||
see http://www.python.org/peps/pep-0263.html for details
|
||||
|
||||
|
||||
Unicode Properties
|
||||
------------------
|
||||
|
||||
The Unicode specification includes a database of information about code points.
|
||||
For each code point that's defined, the information includes the character's
|
||||
name, its category, the numeric value if applicable (Unicode has characters
|
||||
representing the Roman numerals and fractions such as one-third and
|
||||
four-fifths). There are also properties related to the code point's use in
|
||||
bidirectional text and other display-related properties.
|
||||
|
||||
The following program displays some information about several characters, and
|
||||
prints the numeric value of one particular character::
|
||||
|
||||
import unicodedata
|
||||
|
||||
u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
|
||||
|
||||
for i, c in enumerate(u):
|
||||
print i, '%04x' % ord(c), unicodedata.category(c),
|
||||
print unicodedata.name(c)
|
||||
|
||||
# Get numeric value of second character
|
||||
print unicodedata.numeric(u[1])
|
||||
|
||||
When run, this prints::
|
||||
|
||||
0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
|
||||
1 0bf2 No TAMIL NUMBER ONE THOUSAND
|
||||
2 0f84 Mn TIBETAN MARK HALANTA
|
||||
3 1770 Lo TAGBANWA LETTER SA
|
||||
4 33af So SQUARE RAD OVER S SQUARED
|
||||
1000.0
|
||||
|
||||
The category codes are abbreviations describing the nature of the character.
|
||||
These are grouped into categories such as "Letter", "Number", "Punctuation", or
|
||||
"Symbol", which in turn are broken up into subcategories. To take the codes
|
||||
from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
|
||||
"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
|
||||
other". See
|
||||
<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values> for a
|
||||
list of category codes.
|
||||
|
||||
References
|
||||
----------
|
||||
|
||||
The Unicode and 8-bit string types are described in the Python library reference
|
||||
at :ref:`typesseq`.
|
||||
|
||||
The documentation for the :mod:`unicodedata` module.
|
||||
|
||||
The documentation for the :mod:`codecs` module.
|
||||
|
||||
Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and
|
||||
Unicode". A PDF version of his slides is available at
|
||||
<http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>, and is an
|
||||
excellent overview of the design of Python's Unicode features.
|
||||
|
||||
|
||||
Reading and Writing Unicode Data
|
||||
================================
|
||||
|
||||
Once you've written some code that works with Unicode data, the next problem is
|
||||
input/output. How do you get Unicode strings into your program, and how do you
|
||||
convert Unicode into a form suitable for storage or transmission?
|
||||
|
||||
It's possible that you may not need to do anything depending on your input
|
||||
sources and output destinations; you should check whether the libraries used in
|
||||
your application support Unicode natively. XML parsers often return Unicode
|
||||
data, for example. Many relational databases also support Unicode-valued
|
||||
columns and can return Unicode values from an SQL query.
|
||||
|
||||
Unicode data is usually converted to a particular encoding before it gets
|
||||
written to disk or sent over a socket. It's possible to do all the work
|
||||
yourself: open a file, read an 8-bit string from it, and convert the string with
|
||||
``unicode(str, encoding)``. However, the manual approach is not recommended.
|
||||
|
||||
One problem is the multi-byte nature of encodings; one Unicode character can be
|
||||
represented by several bytes. If you want to read the file in arbitrary-sized
|
||||
chunks (say, 1K or 4K), you need to write error-handling code to catch the case
|
||||
where only part of the bytes encoding a single Unicode character are read at the
|
||||
end of a chunk. One solution would be to read the entire file into memory and
|
||||
then perform the decoding, but that prevents you from working with files that
|
||||
are extremely large; if you need to read a 2Gb file, you need 2Gb of RAM.
|
||||
(More, really, since for at least a moment you'd need to have both the encoded
|
||||
string and its Unicode version in memory.)
|
||||
|
||||
The solution would be to use the low-level decoding interface to catch the case
|
||||
of partial coding sequences. The work of implementing this has already been
|
||||
done for you: the :mod:`codecs` module includes a version of the :func:`open`
|
||||
function that returns a file-like object that assumes the file's contents are in
|
||||
a specified encoding and accepts Unicode parameters for methods such as
|
||||
``.read()`` and ``.write()``.
|
||||
|
||||
The function's parameters are ``open(filename, mode='rb', encoding=None,
|
||||
errors='strict', buffering=1)``. ``mode`` can be ``'r'``, ``'w'``, or ``'a'``,
|
||||
just like the corresponding parameter to the regular built-in ``open()``
|
||||
function; add a ``'+'`` to update the file. ``buffering`` is similarly parallel
|
||||
to the standard function's parameter. ``encoding`` is a string giving the
|
||||
encoding to use; if it's left as ``None``, a regular Python file object that
|
||||
accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and
|
||||
data written to or read from the wrapper object will be converted as needed.
|
||||
``errors`` specifies the action for encoding errors and can be one of the usual
|
||||
values of 'strict', 'ignore', and 'replace'.
|
||||
|
||||
Reading Unicode from a file is therefore simple::
|
||||
|
||||
import codecs
|
||||
f = codecs.open('unicode.rst', encoding='utf-8')
|
||||
for line in f:
|
||||
print repr(line)
|
||||
|
||||
It's also possible to open files in update mode, allowing both reading and
|
||||
writing::
|
||||
|
||||
f = codecs.open('test', encoding='utf-8', mode='w+')
|
||||
f.write(u'\u4500 blah blah blah\n')
|
||||
f.seek(0)
|
||||
print repr(f.readline()[:1])
|
||||
f.close()
|
||||
|
||||
Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
|
||||
written as the first character of a file in order to assist with autodetection
|
||||
of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
|
||||
present at the start of a file; when such an encoding is used, the BOM will be
|
||||
automatically written as the first character and will be silently dropped when
|
||||
the file is read. There are variants of these encodings, such as 'utf-16-le'
|
||||
and 'utf-16-be' for little-endian and big-endian encodings, that specify one
|
||||
particular byte ordering and don't skip the BOM.
|
||||
|
||||
|
||||
Unicode filenames
|
||||
-----------------
|
||||
|
||||
Most of the operating systems in common use today support filenames that contain
|
||||
arbitrary Unicode characters. Usually this is implemented by converting the
|
||||
Unicode string into some encoding that varies depending on the system. For
|
||||
example, MacOS X uses UTF-8 while Windows uses a configurable encoding; on
|
||||
Windows, Python uses the name "mbcs" to refer to whatever the currently
|
||||
configured encoding is. On Unix systems, there will only be a filesystem
|
||||
encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
|
||||
you haven't, the default encoding is ASCII.
|
||||
|
||||
The :func:`sys.getfilesystemencoding` function returns the encoding to use on
|
||||
your current system, in case you want to do the encoding manually, but there's
|
||||
not much reason to bother. When opening a file for reading or writing, you can
|
||||
usually just provide the Unicode string as the filename, and it will be
|
||||
automatically converted to the right encoding for you::
|
||||
|
||||
filename = u'filename\u4500abc'
|
||||
f = open(filename, 'w')
|
||||
f.write('blah\n')
|
||||
f.close()
|
||||
|
||||
Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
|
||||
filenames.
|
||||
|
||||
:func:`os.listdir`, which returns filenames, raises an issue: should it return
|
||||
the Unicode version of filenames, or should it return 8-bit strings containing
|
||||
the encoded versions? :func:`os.listdir` will do both, depending on whether you
|
||||
provided the directory path as an 8-bit string or a Unicode string. If you pass
|
||||
a Unicode string as the path, filenames will be decoded using the filesystem's
|
||||
encoding and a list of Unicode strings will be returned, while passing an 8-bit
|
||||
path will return the 8-bit versions of the filenames. For example, assuming the
|
||||
default filesystem encoding is UTF-8, running the following program::
|
||||
|
||||
fn = u'filename\u4500abc'
|
||||
f = open(fn, 'w')
|
||||
f.close()
|
||||
|
||||
import os
|
||||
print os.listdir('.')
|
||||
print os.listdir(u'.')
|
||||
|
||||
will produce the following output::
|
||||
|
||||
amk:~$ python t.py
|
||||
['.svn', 'filename\xe4\x94\x80abc', ...]
|
||||
[u'.svn', u'filename\u4500abc', ...]
|
||||
|
||||
The first list contains UTF-8-encoded filenames, and the second list contains
|
||||
the Unicode versions.
|
||||
|
||||
|
||||
|
||||
Tips for Writing Unicode-aware Programs
|
||||
---------------------------------------
|
||||
|
||||
This section provides some suggestions on writing software that deals with
|
||||
Unicode.
|
||||
|
||||
The most important tip is:
|
||||
|
||||
Software should only work with Unicode strings internally, converting to a
|
||||
particular encoding on output.
|
||||
|
||||
If you attempt to write processing functions that accept both Unicode and 8-bit
|
||||
strings, you will find your program vulnerable to bugs wherever you combine the
|
||||
two different kinds of strings. Python's default encoding is ASCII, so whenever
|
||||
a character with an ASCII value > 127 is in the input data, you'll get a
|
||||
:exc:`UnicodeDecodeError` because that character can't be handled by the ASCII
|
||||
encoding.
|
||||
|
||||
It's easy to miss such problems if you only test your software with data that
|
||||
doesn't contain any accents; everything will seem to work, but there's actually
|
||||
a bug in your program waiting for the first user who attempts to use characters
|
||||
> 127. A second tip, therefore, is:
|
||||
|
||||
Include characters > 127 and, even better, characters > 255 in your test
|
||||
data.
|
||||
|
||||
When using data coming from a web browser or some other untrusted source, a
|
||||
common technique is to check for illegal characters in a string before using the
|
||||
string in a generated command line or storing it in a database. If you're doing
|
||||
this, be careful to check the string once it's in the form that will be used or
|
||||
stored; it's possible for encodings to be used to disguise characters. This is
|
||||
especially true if the input data also specifies the encoding; many encodings
|
||||
leave the commonly checked-for characters alone, but Python includes some
|
||||
encodings such as ``'base64'`` that modify every single character.
|
||||
|
||||
For example, let's say you have a content management system that takes a Unicode
|
||||
filename, and you want to disallow paths with a '/' character. You might write
|
||||
this code::
|
||||
|
||||
def read_file (filename, encoding):
|
||||
if '/' in filename:
|
||||
raise ValueError("'/' not allowed in filenames")
|
||||
unicode_name = filename.decode(encoding)
|
||||
f = open(unicode_name, 'r')
|
||||
# ... return contents of file ...
|
||||
|
||||
However, if an attacker could specify the ``'base64'`` encoding, they could pass
|
||||
``'L2V0Yy9wYXNzd2Q='``, which is the base-64 encoded form of the string
|
||||
``'/etc/passwd'``, to read a system file. The above code looks for ``'/'``
|
||||
characters in the encoded form and misses the dangerous character in the
|
||||
resulting decoded form.
|
||||
|
||||
References
|
||||
----------
|
||||
|
||||
The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
|
||||
Applications in Python" are available at
|
||||
<http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
|
||||
and discuss questions of character encodings as well as how to internationalize
|
||||
and localize an application.
|
||||
|
||||
|
||||
Revision History and Acknowledgements
|
||||
=====================================
|
||||
|
||||
Thanks to the following people who have noted errors or offered suggestions on
|
||||
this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler,
|
||||
Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
|
||||
|
||||
Version 1.0: posted August 5 2005.
|
||||
|
||||
Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds
|
||||
several links.
|
||||
|
||||
Version 1.02: posted August 16 2005. Corrects factual errors.
|
||||
|
||||
|
||||
.. comment Additional topic: building Python w/ UCS2 or UCS4 support
|
||||
.. comment Describe obscure -U switch somewhere?
|
||||
.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
|
||||
|
||||
.. comment
|
||||
Original outline:
|
||||
|
||||
- [ ] Unicode introduction
|
||||
- [ ] ASCII
|
||||
- [ ] Terms
|
||||
- [ ] Character
|
||||
- [ ] Code point
|
||||
- [ ] Encodings
|
||||
- [ ] Common encodings: ASCII, Latin-1, UTF-8
|
||||
- [ ] Unicode Python type
|
||||
- [ ] Writing unicode literals
|
||||
- [ ] Obscurity: -U switch
|
||||
- [ ] Built-ins
|
||||
- [ ] unichr()
|
||||
- [ ] ord()
|
||||
- [ ] unicode() constructor
|
||||
- [ ] Unicode type
|
||||
- [ ] encode(), decode() methods
|
||||
- [ ] Unicodedata module for character properties
|
||||
- [ ] I/O
|
||||
- [ ] Reading/writing Unicode data into files
|
||||
- [ ] Byte-order marks
|
||||
- [ ] Unicode filenames
|
||||
- [ ] Writing Unicode programs
|
||||
- [ ] Do everything in Unicode
|
||||
- [ ] Declaring source code encodings (PEP 263)
|
||||
- [ ] Other issues
|
||||
- [ ] Building Python (UCS2, UCS4)
|
578
Doc/howto/urllib2.rst
Normal file
578
Doc/howto/urllib2.rst
Normal file
|
@ -0,0 +1,578 @@
|
|||
************************************************
|
||||
HOWTO Fetch Internet Resources Using urllib2
|
||||
************************************************
|
||||
|
||||
:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
|
||||
|
||||
.. note::
|
||||
|
||||
There is an French translation of an earlier revision of this
|
||||
HOWTO, available at `urllib2 - Le Manuel manquant
|
||||
<http://www.voidspace/python/articles/urllib2_francais.shtml>`_.
|
||||
|
||||
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
.. sidebar:: Related Articles
|
||||
|
||||
You may also find useful the following article on fetching web resources
|
||||
with Python :
|
||||
|
||||
* `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
|
||||
|
||||
A tutorial on *Basic Authentication*, with examples in Python.
|
||||
|
||||
**urllib2** is a `Python <http://www.python.org>`_ module for fetching URLs
|
||||
(Uniform Resource Locators). It offers a very simple interface, in the form of
|
||||
the *urlopen* function. This is capable of fetching URLs using a variety of
|
||||
different protocols. It also offers a slightly more complex interface for
|
||||
handling common situations - like basic authentication, cookies, proxies and so
|
||||
on. These are provided by objects called handlers and openers.
|
||||
|
||||
urllib2 supports fetching URLs for many "URL schemes" (identified by the string
|
||||
before the ":" in URL - for example "ftp" is the URL scheme of
|
||||
"ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP).
|
||||
This tutorial focuses on the most common case, HTTP.
|
||||
|
||||
For straightforward situations *urlopen* is very easy to use. But as soon as you
|
||||
encounter errors or non-trivial cases when opening HTTP URLs, you will need some
|
||||
understanding of the HyperText Transfer Protocol. The most comprehensive and
|
||||
authoritative reference to HTTP is :rfc:`2616`. This is a technical document and
|
||||
not intended to be easy to read. This HOWTO aims to illustrate using *urllib2*,
|
||||
with enough detail about HTTP to help you through. It is not intended to replace
|
||||
the :mod:`urllib2` docs, but is supplementary to them.
|
||||
|
||||
|
||||
Fetching URLs
|
||||
=============
|
||||
|
||||
The simplest way to use urllib2 is as follows::
|
||||
|
||||
import urllib2
|
||||
response = urllib2.urlopen('http://python.org/')
|
||||
html = response.read()
|
||||
|
||||
Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we
|
||||
could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the
|
||||
purpose of this tutorial to explain the more complicated cases, concentrating on
|
||||
HTTP.
|
||||
|
||||
HTTP is based on requests and responses - the client makes requests and servers
|
||||
send responses. urllib2 mirrors this with a ``Request`` object which represents
|
||||
the HTTP request you are making. In its simplest form you create a Request
|
||||
object that specifies the URL you want to fetch. Calling ``urlopen`` with this
|
||||
Request object returns a response object for the URL requested. This response is
|
||||
a file-like object, which means you can for example call ``.read()`` on the
|
||||
response::
|
||||
|
||||
import urllib2
|
||||
|
||||
req = urllib2.Request('http://www.voidspace.org.uk')
|
||||
response = urllib2.urlopen(req)
|
||||
the_page = response.read()
|
||||
|
||||
Note that urllib2 makes use of the same Request interface to handle all URL
|
||||
schemes. For example, you can make an FTP request like so::
|
||||
|
||||
req = urllib2.Request('ftp://example.com/')
|
||||
|
||||
In the case of HTTP, there are two extra things that Request objects allow you
|
||||
to do: First, you can pass data to be sent to the server. Second, you can pass
|
||||
extra information ("metadata") *about* the data or the about request itself, to
|
||||
the server - this information is sent as HTTP "headers". Let's look at each of
|
||||
these in turn.
|
||||
|
||||
Data
|
||||
----
|
||||
|
||||
Sometimes you want to send data to a URL (often the URL will refer to a CGI
|
||||
(Common Gateway Interface) script [#]_ or other web application). With HTTP,
|
||||
this is often done using what's known as a **POST** request. This is often what
|
||||
your browser does when you submit a HTML form that you filled in on the web. Not
|
||||
all POSTs have to come from forms: you can use a POST to transmit arbitrary data
|
||||
to your own application. In the common case of HTML forms, the data needs to be
|
||||
encoded in a standard way, and then passed to the Request object as the ``data``
|
||||
argument. The encoding is done using a function from the ``urllib`` library
|
||||
*not* from ``urllib2``. ::
|
||||
|
||||
import urllib
|
||||
import urllib2
|
||||
|
||||
url = 'http://www.someserver.com/cgi-bin/register.cgi'
|
||||
values = {'name' : 'Michael Foord',
|
||||
'location' : 'Northampton',
|
||||
'language' : 'Python' }
|
||||
|
||||
data = urllib.urlencode(values)
|
||||
req = urllib2.Request(url, data)
|
||||
response = urllib2.urlopen(req)
|
||||
the_page = response.read()
|
||||
|
||||
Note that other encodings are sometimes required (e.g. for file upload from HTML
|
||||
forms - see `HTML Specification, Form Submission
|
||||
<http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
|
||||
details).
|
||||
|
||||
If you do not pass the ``data`` argument, urllib2 uses a **GET** request. One
|
||||
way in which GET and POST requests differ is that POST requests often have
|
||||
"side-effects": they change the state of the system in some way (for example by
|
||||
placing an order with the website for a hundredweight of tinned spam to be
|
||||
delivered to your door). Though the HTTP standard makes it clear that POSTs are
|
||||
intended to *always* cause side-effects, and GET requests *never* to cause
|
||||
side-effects, nothing prevents a GET request from having side-effects, nor a
|
||||
POST requests from having no side-effects. Data can also be passed in an HTTP
|
||||
GET request by encoding it in the URL itself.
|
||||
|
||||
This is done as follows::
|
||||
|
||||
>>> import urllib2
|
||||
>>> import urllib
|
||||
>>> data = {}
|
||||
>>> data['name'] = 'Somebody Here'
|
||||
>>> data['location'] = 'Northampton'
|
||||
>>> data['language'] = 'Python'
|
||||
>>> url_values = urllib.urlencode(data)
|
||||
>>> print url_values
|
||||
name=Somebody+Here&language=Python&location=Northampton
|
||||
>>> url = 'http://www.example.com/example.cgi'
|
||||
>>> full_url = url + '?' + url_values
|
||||
>>> data = urllib2.open(full_url)
|
||||
|
||||
Notice that the full URL is created by adding a ``?`` to the URL, followed by
|
||||
the encoded values.
|
||||
|
||||
Headers
|
||||
-------
|
||||
|
||||
We'll discuss here one particular HTTP header, to illustrate how to add headers
|
||||
to your HTTP request.
|
||||
|
||||
Some websites [#]_ dislike being browsed by programs, or send different versions
|
||||
to different browsers [#]_ . By default urllib2 identifies itself as
|
||||
``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
|
||||
numbers of the Python release,
|
||||
e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
|
||||
not work. The way a browser identifies itself is through the
|
||||
``User-Agent`` header [#]_. When you create a Request object you can
|
||||
pass a dictionary of headers in. The following example makes the same
|
||||
request as above, but identifies itself as a version of Internet
|
||||
Explorer [#]_. ::
|
||||
|
||||
import urllib
|
||||
import urllib2
|
||||
|
||||
url = 'http://www.someserver.com/cgi-bin/register.cgi'
|
||||
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
|
||||
values = {'name' : 'Michael Foord',
|
||||
'location' : 'Northampton',
|
||||
'language' : 'Python' }
|
||||
headers = { 'User-Agent' : user_agent }
|
||||
|
||||
data = urllib.urlencode(values)
|
||||
req = urllib2.Request(url, data, headers)
|
||||
response = urllib2.urlopen(req)
|
||||
the_page = response.read()
|
||||
|
||||
The response also has two useful methods. See the section on `info and geturl`_
|
||||
which comes after we have a look at what happens when things go wrong.
|
||||
|
||||
|
||||
Handling Exceptions
|
||||
===================
|
||||
|
||||
*urlopen* raises ``URLError`` when it cannot handle a response (though as usual
|
||||
with Python APIs, builtin exceptions such as ValueError, TypeError etc. may also
|
||||
be raised).
|
||||
|
||||
``HTTPError`` is the subclass of ``URLError`` raised in the specific case of
|
||||
HTTP URLs.
|
||||
|
||||
URLError
|
||||
--------
|
||||
|
||||
Often, URLError is raised because there is no network connection (no route to
|
||||
the specified server), or the specified server doesn't exist. In this case, the
|
||||
exception raised will have a 'reason' attribute, which is a tuple containing an
|
||||
error code and a text error message.
|
||||
|
||||
e.g. ::
|
||||
|
||||
>>> req = urllib2.Request('http://www.pretend_server.org')
|
||||
>>> try: urllib2.urlopen(req)
|
||||
>>> except URLError, e:
|
||||
>>> print e.reason
|
||||
>>>
|
||||
(4, 'getaddrinfo failed')
|
||||
|
||||
|
||||
HTTPError
|
||||
---------
|
||||
|
||||
Every HTTP response from the server contains a numeric "status code". Sometimes
|
||||
the status code indicates that the server is unable to fulfil the request. The
|
||||
default handlers will handle some of these responses for you (for example, if
|
||||
the response is a "redirection" that requests the client fetch the document from
|
||||
a different URL, urllib2 will handle that for you). For those it can't handle,
|
||||
urlopen will raise an ``HTTPError``. Typical errors include '404' (page not
|
||||
found), '403' (request forbidden), and '401' (authentication required).
|
||||
|
||||
See section 10 of RFC 2616 for a reference on all the HTTP error codes.
|
||||
|
||||
The ``HTTPError`` instance raised will have an integer 'code' attribute, which
|
||||
corresponds to the error sent by the server.
|
||||
|
||||
Error Codes
|
||||
~~~~~~~~~~~
|
||||
|
||||
Because the default handlers handle redirects (codes in the 300 range), and
|
||||
codes in the 100-299 range indicate success, you will usually only see error
|
||||
codes in the 400-599 range.
|
||||
|
||||
``BaseHTTPServer.BaseHTTPRequestHandler.responses`` is a useful dictionary of
|
||||
response codes in that shows all the response codes used by RFC 2616. The
|
||||
dictionary is reproduced here for convenience ::
|
||||
|
||||
# Table mapping response codes to messages; entries have the
|
||||
# form {code: (shortmessage, longmessage)}.
|
||||
responses = {
|
||||
100: ('Continue', 'Request received, please continue'),
|
||||
101: ('Switching Protocols',
|
||||
'Switching to new protocol; obey Upgrade header'),
|
||||
|
||||
200: ('OK', 'Request fulfilled, document follows'),
|
||||
201: ('Created', 'Document created, URL follows'),
|
||||
202: ('Accepted',
|
||||
'Request accepted, processing continues off-line'),
|
||||
203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
|
||||
204: ('No Content', 'Request fulfilled, nothing follows'),
|
||||
205: ('Reset Content', 'Clear input form for further input.'),
|
||||
206: ('Partial Content', 'Partial content follows.'),
|
||||
|
||||
300: ('Multiple Choices',
|
||||
'Object has several resources -- see URI list'),
|
||||
301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
|
||||
302: ('Found', 'Object moved temporarily -- see URI list'),
|
||||
303: ('See Other', 'Object moved -- see Method and URL list'),
|
||||
304: ('Not Modified',
|
||||
'Document has not changed since given time'),
|
||||
305: ('Use Proxy',
|
||||
'You must use proxy specified in Location to access this '
|
||||
'resource.'),
|
||||
307: ('Temporary Redirect',
|
||||
'Object moved temporarily -- see URI list'),
|
||||
|
||||
400: ('Bad Request',
|
||||
'Bad request syntax or unsupported method'),
|
||||
401: ('Unauthorized',
|
||||
'No permission -- see authorization schemes'),
|
||||
402: ('Payment Required',
|
||||
'No payment -- see charging schemes'),
|
||||
403: ('Forbidden',
|
||||
'Request forbidden -- authorization will not help'),
|
||||
404: ('Not Found', 'Nothing matches the given URI'),
|
||||
405: ('Method Not Allowed',
|
||||
'Specified method is invalid for this server.'),
|
||||
406: ('Not Acceptable', 'URI not available in preferred format.'),
|
||||
407: ('Proxy Authentication Required', 'You must authenticate with '
|
||||
'this proxy before proceeding.'),
|
||||
408: ('Request Timeout', 'Request timed out; try again later.'),
|
||||
409: ('Conflict', 'Request conflict.'),
|
||||
410: ('Gone',
|
||||
'URI no longer exists and has been permanently removed.'),
|
||||
411: ('Length Required', 'Client must specify Content-Length.'),
|
||||
412: ('Precondition Failed', 'Precondition in headers is false.'),
|
||||
413: ('Request Entity Too Large', 'Entity is too large.'),
|
||||
414: ('Request-URI Too Long', 'URI is too long.'),
|
||||
415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
|
||||
416: ('Requested Range Not Satisfiable',
|
||||
'Cannot satisfy request range.'),
|
||||
417: ('Expectation Failed',
|
||||
'Expect condition could not be satisfied.'),
|
||||
|
||||
500: ('Internal Server Error', 'Server got itself in trouble'),
|
||||
501: ('Not Implemented',
|
||||
'Server does not support this operation'),
|
||||
502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
|
||||
503: ('Service Unavailable',
|
||||
'The server cannot process the request due to a high load'),
|
||||
504: ('Gateway Timeout',
|
||||
'The gateway server did not receive a timely response'),
|
||||
505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
|
||||
}
|
||||
|
||||
When an error is raised the server responds by returning an HTTP error code
|
||||
*and* an error page. You can use the ``HTTPError`` instance as a response on the
|
||||
page returned. This means that as well as the code attribute, it also has read,
|
||||
geturl, and info, methods. ::
|
||||
|
||||
>>> req = urllib2.Request('http://www.python.org/fish.html')
|
||||
>>> try:
|
||||
>>> urllib2.urlopen(req)
|
||||
>>> except URLError, e:
|
||||
>>> print e.code
|
||||
>>> print e.read()
|
||||
>>>
|
||||
404
|
||||
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
|
||||
"http://www.w3.org/TR/html4/loose.dtd">
|
||||
<?xml-stylesheet href="./css/ht2html.css"
|
||||
type="text/css"?>
|
||||
<html><head><title>Error 404: File Not Found</title>
|
||||
...... etc...
|
||||
|
||||
Wrapping it Up
|
||||
--------------
|
||||
|
||||
So if you want to be prepared for ``HTTPError`` *or* ``URLError`` there are two
|
||||
basic approaches. I prefer the second approach.
|
||||
|
||||
Number 1
|
||||
~~~~~~~~
|
||||
|
||||
::
|
||||
|
||||
|
||||
from urllib2 import Request, urlopen, URLError, HTTPError
|
||||
req = Request(someurl)
|
||||
try:
|
||||
response = urlopen(req)
|
||||
except HTTPError, e:
|
||||
print 'The server couldn\'t fulfill the request.'
|
||||
print 'Error code: ', e.code
|
||||
except URLError, e:
|
||||
print 'We failed to reach a server.'
|
||||
print 'Reason: ', e.reason
|
||||
else:
|
||||
# everything is fine
|
||||
|
||||
|
||||
.. note::
|
||||
|
||||
The ``except HTTPError`` *must* come first, otherwise ``except URLError``
|
||||
will *also* catch an ``HTTPError``.
|
||||
|
||||
Number 2
|
||||
~~~~~~~~
|
||||
|
||||
::
|
||||
|
||||
from urllib2 import Request, urlopen, URLError
|
||||
req = Request(someurl)
|
||||
try:
|
||||
response = urlopen(req)
|
||||
except URLError, e:
|
||||
if hasattr(e, 'reason'):
|
||||
print 'We failed to reach a server.'
|
||||
print 'Reason: ', e.reason
|
||||
elif hasattr(e, 'code'):
|
||||
print 'The server couldn\'t fulfill the request.'
|
||||
print 'Error code: ', e.code
|
||||
else:
|
||||
# everything is fine
|
||||
|
||||
|
||||
info and geturl
|
||||
===============
|
||||
|
||||
The response returned by urlopen (or the ``HTTPError`` instance) has two useful
|
||||
methods ``info`` and ``geturl``.
|
||||
|
||||
**geturl** - this returns the real URL of the page fetched. This is useful
|
||||
because ``urlopen`` (or the opener object used) may have followed a
|
||||
redirect. The URL of the page fetched may not be the same as the URL requested.
|
||||
|
||||
**info** - this returns a dictionary-like object that describes the page
|
||||
fetched, particularly the headers sent by the server. It is currently an
|
||||
``httplib.HTTPMessage`` instance.
|
||||
|
||||
Typical headers include 'Content-length', 'Content-type', and so on. See the
|
||||
`Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_
|
||||
for a useful listing of HTTP headers with brief explanations of their meaning
|
||||
and use.
|
||||
|
||||
|
||||
Openers and Handlers
|
||||
====================
|
||||
|
||||
When you fetch a URL you use an opener (an instance of the perhaps
|
||||
confusingly-named :class:`urllib2.OpenerDirector`). Normally we have been using
|
||||
the default opener - via ``urlopen`` - but you can create custom
|
||||
openers. Openers use handlers. All the "heavy lifting" is done by the
|
||||
handlers. Each handler knows how to open URLs for a particular URL scheme (http,
|
||||
ftp, etc.), or how to handle an aspect of URL opening, for example HTTP
|
||||
redirections or HTTP cookies.
|
||||
|
||||
You will want to create openers if you want to fetch URLs with specific handlers
|
||||
installed, for example to get an opener that handles cookies, or to get an
|
||||
opener that does not handle redirections.
|
||||
|
||||
To create an opener, instantiate an ``OpenerDirector``, and then call
|
||||
``.add_handler(some_handler_instance)`` repeatedly.
|
||||
|
||||
Alternatively, you can use ``build_opener``, which is a convenience function for
|
||||
creating opener objects with a single function call. ``build_opener`` adds
|
||||
several handlers by default, but provides a quick way to add more and/or
|
||||
override the default handlers.
|
||||
|
||||
Other sorts of handlers you might want to can handle proxies, authentication,
|
||||
and other common but slightly specialised situations.
|
||||
|
||||
``install_opener`` can be used to make an ``opener`` object the (global) default
|
||||
opener. This means that calls to ``urlopen`` will use the opener you have
|
||||
installed.
|
||||
|
||||
Opener objects have an ``open`` method, which can be called directly to fetch
|
||||
urls in the same way as the ``urlopen`` function: there's no need to call
|
||||
``install_opener``, except as a convenience.
|
||||
|
||||
|
||||
Basic Authentication
|
||||
====================
|
||||
|
||||
To illustrate creating and installing a handler we will use the
|
||||
``HTTPBasicAuthHandler``. For a more detailed discussion of this subject --
|
||||
including an explanation of how Basic Authentication works - see the `Basic
|
||||
Authentication Tutorial
|
||||
<http://www.voidspace.org.uk/python/articles/authentication.shtml>`_.
|
||||
|
||||
When authentication is required, the server sends a header (as well as the 401
|
||||
error code) requesting authentication. This specifies the authentication scheme
|
||||
and a 'realm'. The header looks like : ``Www-authenticate: SCHEME
|
||||
realm="REALM"``.
|
||||
|
||||
e.g. ::
|
||||
|
||||
Www-authenticate: Basic realm="cPanel Users"
|
||||
|
||||
|
||||
The client should then retry the request with the appropriate name and password
|
||||
for the realm included as a header in the request. This is 'basic
|
||||
authentication'. In order to simplify this process we can create an instance of
|
||||
``HTTPBasicAuthHandler`` and an opener to use this handler.
|
||||
|
||||
The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle
|
||||
the mapping of URLs and realms to passwords and usernames. If you know what the
|
||||
realm is (from the authentication header sent by the server), then you can use a
|
||||
``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that
|
||||
case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows
|
||||
you to specify a default username and password for a URL. This will be supplied
|
||||
in the absence of you providing an alternative combination for a specific
|
||||
realm. We indicate this by providing ``None`` as the realm argument to the
|
||||
``add_password`` method.
|
||||
|
||||
The top-level URL is the first URL that requires authentication. URLs "deeper"
|
||||
than the URL you pass to .add_password() will also match. ::
|
||||
|
||||
# create a password manager
|
||||
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
|
||||
|
||||
# Add the username and password.
|
||||
# If we knew the realm, we could use it instead of ``None``.
|
||||
top_level_url = "http://example.com/foo/"
|
||||
password_mgr.add_password(None, top_level_url, username, password)
|
||||
|
||||
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
|
||||
|
||||
# create "opener" (OpenerDirector instance)
|
||||
opener = urllib2.build_opener(handler)
|
||||
|
||||
# use the opener to fetch a URL
|
||||
opener.open(a_url)
|
||||
|
||||
# Install the opener.
|
||||
# Now all calls to urllib2.urlopen use our opener.
|
||||
urllib2.install_opener(opener)
|
||||
|
||||
.. note::
|
||||
|
||||
In the above example we only supplied our ``HHTPBasicAuthHandler`` to
|
||||
``build_opener``. By default openers have the handlers for normal situations
|
||||
-- ``ProxyHandler``, ``UnknownHandler``, ``HTTPHandler``,
|
||||
``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``,
|
||||
``FileHandler``, ``HTTPErrorProcessor``.
|
||||
|
||||
``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme
|
||||
component and the hostname and optionally the port number)
|
||||
e.g. "http://example.com/" *or* an "authority" (i.e. the hostname,
|
||||
optionally including the port number) e.g. "example.com" or "example.com:8080"
|
||||
(the latter example includes a port number). The authority, if present, must
|
||||
NOT contain the "userinfo" component - for example "joe@password:example.com" is
|
||||
not correct.
|
||||
|
||||
|
||||
Proxies
|
||||
=======
|
||||
|
||||
**urllib2** will auto-detect your proxy settings and use those. This is through
|
||||
the ``ProxyHandler`` which is part of the normal handler chain. Normally that's
|
||||
a good thing, but there are occasions when it may not be helpful [#]_. One way
|
||||
to do this is to setup our own ``ProxyHandler``, with no proxies defined. This
|
||||
is done using similar steps to setting up a `Basic Authentication`_ handler : ::
|
||||
|
||||
>>> proxy_support = urllib2.ProxyHandler({})
|
||||
>>> opener = urllib2.build_opener(proxy_support)
|
||||
>>> urllib2.install_opener(opener)
|
||||
|
||||
.. note::
|
||||
|
||||
Currently ``urllib2`` *does not* support fetching of ``https`` locations
|
||||
through a proxy. However, this can be enabled by extending urllib2 as
|
||||
shown in the recipe [#]_.
|
||||
|
||||
|
||||
Sockets and Layers
|
||||
==================
|
||||
|
||||
The Python support for fetching resources from the web is layered. urllib2 uses
|
||||
the httplib library, which in turn uses the socket library.
|
||||
|
||||
As of Python 2.3 you can specify how long a socket should wait for a response
|
||||
before timing out. This can be useful in applications which have to fetch web
|
||||
pages. By default the socket module has *no timeout* and can hang. Currently,
|
||||
the socket timeout is not exposed at the httplib or urllib2 levels. However,
|
||||
you can set the default timeout globally for all sockets using ::
|
||||
|
||||
import socket
|
||||
import urllib2
|
||||
|
||||
# timeout in seconds
|
||||
timeout = 10
|
||||
socket.setdefaulttimeout(timeout)
|
||||
|
||||
# this call to urllib2.urlopen now uses the default timeout
|
||||
# we have set in the socket module
|
||||
req = urllib2.Request('http://www.voidspace.org.uk')
|
||||
response = urllib2.urlopen(req)
|
||||
|
||||
|
||||
-------
|
||||
|
||||
|
||||
Footnotes
|
||||
=========
|
||||
|
||||
This document was reviewed and revised by John Lee.
|
||||
|
||||
.. [#] For an introduction to the CGI protocol see
|
||||
`Writing Web Applications in Python <http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html>`_.
|
||||
.. [#] Like Google for example. The *proper* way to use google from a program
|
||||
is to use `PyGoogle <http://pygoogle.sourceforge.net>`_ of course. See
|
||||
`Voidspace Google <http://www.voidspace.org.uk/python/recipebook.shtml#google>`_
|
||||
for some examples of using the Google API.
|
||||
.. [#] Browser sniffing is a very bad practise for website design - building
|
||||
sites using web standards is much more sensible. Unfortunately a lot of
|
||||
sites still send different versions to different browsers.
|
||||
.. [#] The user agent for MSIE 6 is
|
||||
*'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
|
||||
.. [#] For details of more HTTP request headers, see
|
||||
`Quick Reference to HTTP Headers`_.
|
||||
.. [#] In my case I have to use a proxy to access the internet at work. If you
|
||||
attempt to fetch *localhost* URLs through this proxy it blocks them. IE
|
||||
is set to use the proxy, which urllib2 picks up on. In order to test
|
||||
scripts with a localhost server, I have to prevent urllib2 from using
|
||||
the proxy.
|
||||
.. [#] urllib2 opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
|
||||
<http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/456195>`_.
|
||||
|
Loading…
Add table
Add a link
Reference in a new issue