mirror of
				https://github.com/python/cpython.git
				synced 2025-10-29 17:38:56 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			250 lines
		
	
	
	
		
			11 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			250 lines
		
	
	
	
		
			11 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
| NOTES ON OPTIMIZING DICTIONARIES
 | |
| ================================
 | |
| 
 | |
| 
 | |
| Principal Use Cases for Dictionaries
 | |
| ------------------------------------
 | |
| 
 | |
| Passing keyword arguments
 | |
|     Typically, one read and one write for 1 to 3 elements.
 | |
|     Occurs frequently in normal python code.
 | |
| 
 | |
| Class method lookup
 | |
|     Dictionaries vary in size with 8 to 16 elements being common.
 | |
|     Usually written once with many lookups.
 | |
|     When base classes are used, there are many failed lookups
 | |
|         followed by a lookup in a base class.
 | |
| 
 | |
| Instance attribute lookup and Global variables
 | |
|     Dictionaries vary in size.  4 to 10 elements are common.
 | |
|     Both reads and writes are common.
 | |
| 
 | |
| Builtins
 | |
|     Frequent reads.  Almost never written.
 | |
|     Size 126 interned strings (as of Py2.3b1).
 | |
|     A few keys are accessed much more frequently than others.
 | |
| 
 | |
| Uniquification
 | |
|     Dictionaries of any size.  Bulk of work is in creation.
 | |
|     Repeated writes to a smaller set of keys.
 | |
|     Single read of each key.
 | |
|     Some use cases have two consecutive accesses to the same key.
 | |
| 
 | |
|     * Removing duplicates from a sequence.
 | |
|         dict.fromkeys(seqn).keys()
 | |
| 
 | |
|     * Counting elements in a sequence.
 | |
|         for e in seqn:
 | |
|           d[e] = d.get(e,0) + 1
 | |
| 
 | |
|     * Accumulating references in a dictionary of lists:
 | |
| 
 | |
|         for pagenumber, page in enumerate(pages):
 | |
|           for word in page:
 | |
|             d.setdefault(word, []).append(pagenumber)
 | |
| 
 | |
|     Note, the second example is a use case characterized by a get and set
 | |
|     to the same key.  There are similar used cases with a __contains__
 | |
|     followed by a get, set, or del to the same key.  Part of the
 | |
|     justification for d.setdefault is combining the two lookups into one.
 | |
| 
 | |
| Membership Testing
 | |
|     Dictionaries of any size.  Created once and then rarely changes.
 | |
|     Single write to each key.
 | |
|     Many calls to __contains__() or has_key().
 | |
|     Similar access patterns occur with replacement dictionaries
 | |
|         such as with the % formatting operator.
 | |
| 
 | |
| Dynamic Mappings
 | |
|     Characterized by deletions interspersed with adds and replacements.
 | |
|     Performance benefits greatly from the re-use of dummy entries.
 | |
| 
 | |
| 
 | |
| Data Layout (assuming a 32-bit box with 64 bytes per cache line)
 | |
| ----------------------------------------------------------------
 | |
| 
 | |
| Smalldicts (8 entries) are attached to the dictobject structure
 | |
| and the whole group nearly fills two consecutive cache lines.
 | |
| 
 | |
| Larger dicts use the first half of the dictobject structure (one cache
 | |
| line) and a separate, continuous block of entries (at 12 bytes each
 | |
| for a total of 5.333 entries per cache line).
 | |
| 
 | |
| 
 | |
| Tunable Dictionary Parameters
 | |
| -----------------------------
 | |
| 
 | |
| * PyDict_MINSIZE.  Currently set to 8.
 | |
|     Must be a power of two.  New dicts have to zero-out every cell.
 | |
|     Each additional 8 consumes 1.5 cache lines.  Increasing improves
 | |
|     the sparseness of small dictionaries but costs time to read in
 | |
|     the additional cache lines if they are not already in cache.
 | |
|     That case is common when keyword arguments are passed.
 | |
| 
 | |
| * Maximum dictionary load in PyDict_SetItem.  Currently set to 2/3.
 | |
|     Increasing this ratio makes dictionaries more dense resulting
 | |
|     in more collisions.  Decreasing it improves sparseness at the
 | |
|     expense of spreading entries over more cache lines and at the
 | |
|     cost of total memory consumed.
 | |
| 
 | |
|     The load test occurs in highly time sensitive code.  Efforts
 | |
|     to make the test more complex (for example, varying the load
 | |
|     for different sizes) have degraded performance.
 | |
| 
 | |
| * Growth rate upon hitting maximum load.  Currently set to *2.
 | |
|     Raising this to *4 results in half the number of resizes,
 | |
|     less effort to resize, better sparseness for some (but not
 | |
|     all dict sizes), and potentially doubles memory consumption
 | |
|     depending on the size of the dictionary.  Setting to *4
 | |
|     eliminates every other resize step.
 | |
| 
 | |
| Tune-ups should be measured across a broad range of applications and
 | |
| use cases.  A change to any parameter will help in some situations and
 | |
| hurt in others.  The key is to find settings that help the most common
 | |
| cases and do the least damage to the less common cases.  Results will
 | |
| vary dramatically depending on the exact number of keys, whether the
 | |
| keys are all strings, whether reads or writes dominate, the exact
 | |
| hash values of the keys (some sets of values have fewer collisions than
 | |
| others).  Any one test or benchmark is likely to prove misleading.
 | |
| 
 | |
| While making a dictionary more sparse reduces collisions, it impairs
 | |
| iteration and key listing.  Those methods loop over every potential
 | |
| entry.  Doubling the size of dictionary results in twice as many
 | |
| non-overlapping memory accesses for keys(), items(), values(),
 | |
| __iter__(), iterkeys(), iteritems(), itervalues(), and update().
 | |
| Also, every dictionary iterates at least twice, once for the memset()
 | |
| when it is created and once by dealloc().
 | |
| 
 | |
| 
 | |
| Results of Cache Locality Experiments
 | |
| -------------------------------------
 | |
| 
 | |
| When an entry is retrieved from memory, 4.333 adjacent entries are also
 | |
| retrieved into a cache line.  Since accessing items in cache is *much*
 | |
| cheaper than a cache miss, an enticing idea is to probe the adjacent
 | |
| entries as a first step in collision resolution.  Unfortunately, the
 | |
| introduction of any regularity into collision searches results in more
 | |
| collisions than the current random chaining approach.
 | |
| 
 | |
| Exploiting cache locality at the expense of additional collisions fails
 | |
| to payoff when the entries are already loaded in cache (the expense
 | |
| is paid with no compensating benefit).  This occurs in small dictionaries
 | |
| where the whole dictionary fits into a pair of cache lines.  It also
 | |
| occurs frequently in large dictionaries which have a common access pattern
 | |
| where some keys are accessed much more frequently than others.  The
 | |
| more popular entries *and* their collision chains tend to remain in cache.
 | |
| 
 | |
| To exploit cache locality, change the collision resolution section
 | |
| in lookdict() and lookdict_string().  Set i^=1 at the top of the
 | |
| loop and move the  i = (i << 2) + i + perturb + 1 to an unrolled
 | |
| version of the loop.
 | |
| 
 | |
| This optimization strategy can be leveraged in several ways:
 | |
| 
 | |
| * If the dictionary is kept sparse (through the tunable parameters),
 | |
| then the occurrence of additional collisions is lessened.
 | |
| 
 | |
| * If lookdict() and lookdict_string() are specialized for small dicts
 | |
| and for largedicts, then the versions for large_dicts can be given
 | |
| an alternate search strategy without increasing collisions in small dicts
 | |
| which already have the maximum benefit of cache locality.
 | |
| 
 | |
| * If the use case for a dictionary is known to have a random key
 | |
| access pattern (as opposed to a more common pattern with a Zipf's law
 | |
| distribution), then there will be more benefit for large dictionaries
 | |
| because any given key is no more likely than another to already be
 | |
| in cache.
 | |
| 
 | |
| * In use cases with paired accesses to the same key, the second access
 | |
| is always in cache and gets no benefit from efforts to further improve
 | |
| cache locality.
 | |
| 
 | |
| Optimizing the Search of Small Dictionaries
 | |
| -------------------------------------------
 | |
| 
 | |
| If lookdict() and lookdict_string() are specialized for smaller dictionaries,
 | |
| then a custom search approach can be implemented that exploits the small
 | |
| search space and cache locality.
 | |
| 
 | |
| * The simplest example is a linear search of contiguous entries.  This is
 | |
|   simple to implement, guaranteed to terminate rapidly, never searches
 | |
|   the same entry twice, and precludes the need to check for dummy entries.
 | |
| 
 | |
| * A more advanced example is a self-organizing search so that the most
 | |
|   frequently accessed entries get probed first.  The organization
 | |
|   adapts if the access pattern changes over time.  Treaps are ideally
 | |
|   suited for self-organization with the most common entries at the
 | |
|   top of the heap and a rapid binary search pattern.  Most probes and
 | |
|   results are all located at the top of the tree allowing them all to
 | |
|   be located in one or two cache lines.
 | |
| 
 | |
| * Also, small dictionaries may be made more dense, perhaps filling all
 | |
|   eight cells to take the maximum advantage of two cache lines.
 | |
| 
 | |
| 
 | |
| Strategy Pattern
 | |
| ----------------
 | |
| 
 | |
| Consider allowing the user to set the tunable parameters or to select a
 | |
| particular search method.  Since some dictionary use cases have known
 | |
| sizes and access patterns, the user may be able to provide useful hints.
 | |
| 
 | |
| 1) For example, if membership testing or lookups dominate runtime and memory
 | |
|    is not at a premium, the user may benefit from setting the maximum load
 | |
|    ratio at 5% or 10% instead of the usual 66.7%.  This will sharply
 | |
|    curtail the number of collisions but will increase iteration time.
 | |
|    The builtin namespace is a prime example of a dictionary that can
 | |
|    benefit from being highly sparse.
 | |
| 
 | |
| 2) Dictionary creation time can be shortened in cases where the ultimate
 | |
|    size of the dictionary is known in advance.  The dictionary can be
 | |
|    pre-sized so that no resize operations are required during creation.
 | |
|    Not only does this save resizes, but the key insertion will go
 | |
|    more quickly because the first half of the keys will be inserted into
 | |
|    a more sparse environment than before.  The preconditions for this
 | |
|    strategy arise whenever a dictionary is created from a key or item
 | |
|    sequence and the number of *unique* keys is known.
 | |
| 
 | |
| 3) If the key space is large and the access pattern is known to be random,
 | |
|    then search strategies exploiting cache locality can be fruitful.
 | |
|    The preconditions for this strategy arise in simulations and
 | |
|    numerical analysis.
 | |
| 
 | |
| 4) If the keys are fixed and the access pattern strongly favors some of
 | |
|    the keys, then the entries can be stored contiguously and accessed
 | |
|    with a linear search or treap.  This exploits knowledge of the data,
 | |
|    cache locality, and a simplified search routine.  It also eliminates
 | |
|    the need to test for dummy entries on each probe.  The preconditions
 | |
|    for this strategy arise in symbol tables and in the builtin dictionary.
 | |
| 
 | |
| 
 | |
| Readonly Dictionaries
 | |
| ---------------------
 | |
| Some dictionary use cases pass through a build stage and then move to a
 | |
| more heavily exercised lookup stage with no further changes to the
 | |
| dictionary.
 | |
| 
 | |
| An idea that emerged on python-dev is to be able to convert a dictionary
 | |
| to a read-only state.  This can help prevent programming errors and also
 | |
| provide knowledge that can be exploited for lookup optimization.
 | |
| 
 | |
| The dictionary can be immediately rebuilt (eliminating dummy entries),
 | |
| resized (to an appropriate level of sparseness), and the keys can be
 | |
| jostled (to minimize collisions).  The lookdict() routine can then
 | |
| eliminate the test for dummy entries (saving about 1/4 of the time
 | |
| spent in the collision resolution loop).
 | |
| 
 | |
| An additional possibility is to insert links into the empty spaces
 | |
| so that dictionary iteration can proceed in len(d) steps instead of
 | |
| (mp->mask + 1) steps.  Alternatively, a separate tuple of keys can be
 | |
| kept just for iteration.
 | |
| 
 | |
| 
 | |
| Caching Lookups
 | |
| ---------------
 | |
| The idea is to exploit key access patterns by anticipating future lookups
 | |
| based on previous lookups.
 | |
| 
 | |
| The simplest incarnation is to save the most recently accessed entry.
 | |
| This gives optimal performance for use cases where every get is followed
 | |
| by a set or del to the same key.
 | 
