mirror of
				https://github.com/python/cpython.git
				synced 2025-10-25 15:58:57 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			431 lines
		
	
	
	
		
			16 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			431 lines
		
	
	
	
		
			16 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
| This document explains Crochemore and Perrin's Two-Way string matching
 | |
| algorithm, in which a smaller string (the "pattern" or "needle")
 | |
| is searched for in a longer string (the "text" or "haystack"),
 | |
| determining whether the needle is a substring of the haystack, and if
 | |
| so, at what index(es). It is to be used by Python's string
 | |
| (and bytes-like) objects when calling `find`, `index`, `__contains__`,
 | |
| or implicitly in methods like `replace` or `partition`.
 | |
| 
 | |
| This is essentially a re-telling of the paper
 | |
| 
 | |
|     Crochemore M., Perrin D., 1991, Two-way string-matching,
 | |
|         Journal of the ACM 38(3):651-675.
 | |
| 
 | |
| focused more on understanding and examples than on rigor. See also
 | |
| the code sample here:
 | |
| 
 | |
|     http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
 | |
| 
 | |
| The algorithm runs in O(len(needle) + len(haystack)) time and with
 | |
| O(1) space. However, since there is a larger preprocessing cost than
 | |
| simpler algorithms, this Two-Way algorithm is to be used only when the
 | |
| needle and haystack lengths meet certain thresholds.
 | |
| 
 | |
| 
 | |
| These are the basic steps of the algorithm:
 | |
| 
 | |
|     * "Very carefully" cut the needle in two.
 | |
|     * For each alignment attempted:
 | |
|         1. match the right part
 | |
|             * On failure, jump by the amount matched + 1
 | |
|         2. then match the left part.
 | |
|             * On failure jump by max(len(left), len(right)) + 1
 | |
|     * If the needle is periodic, don't re-do comparisons; maintain
 | |
|       a "memory" of how many characters you already know match.
 | |
| 
 | |
| 
 | |
| -------- Matching the right part --------
 | |
| 
 | |
| We first scan the right part of the needle to check if it matches the
 | |
| the aligned characters in the haystack. We scan left-to-right,
 | |
| and if a mismatch occurs, we jump ahead by the amount matched plus 1.
 | |
| 
 | |
| Example:
 | |
| 
 | |
|        text:    ........EFGX...................
 | |
|     pattern:    ....abcdEFGH....
 | |
|         cut:        <<<<>>>>
 | |
| 
 | |
| Matched 3, so jump ahead by 4:
 | |
| 
 | |
|        text:    ........EFGX...................
 | |
|     pattern:        ....abcdEFGH....
 | |
|         cut:            <<<<>>>>
 | |
| 
 | |
| Why are we allowed to do this? Because we cut the needle very
 | |
| carefully, in such a way that if the cut is ...abcd + EFGH... then
 | |
| we have
 | |
| 
 | |
|         d != E
 | |
|        cd != EF
 | |
|       bcd != EFG
 | |
|      abcd != EFGH
 | |
|           ... and so on.
 | |
| 
 | |
| If this is true for every pair of equal-length substrings around the
 | |
| cut, then the following alignments do not work, so we can skip them:
 | |
| 
 | |
|        text:    ........EFG....................
 | |
|     pattern:     ....abcdEFGH....
 | |
|                         ^   (Bad because d != E)
 | |
|        text:    ........EFG....................
 | |
|     pattern:      ....abcdEFGH....
 | |
|                         ^^   (Bad because cd != EF)
 | |
|        text:    ........EFG....................
 | |
|     pattern:       ....abcdEFGH....
 | |
|                         ^^^   (Bad because bcd != EFG)
 | |
| 
 | |
| Skip 3 alignments => increment alignment by 4.
 | |
| 
 | |
| 
 | |
| -------- If len(left_part) < len(right_part) --------
 | |
| 
 | |
| Above is the core idea, and it begins to suggest how the algorithm can
 | |
| be linear-time. There is one bit of subtlety involving what to do
 | |
| around the end of the needle: if the left half is shorter than the
 | |
| right, then we could run into something like this:
 | |
| 
 | |
|        text:    .....EFG......
 | |
|     pattern:       cdEFGH
 | |
| 
 | |
| The same argument holds that we can skip ahead by 4, so long as
 | |
| 
 | |
|        d != E
 | |
|       cd != EF
 | |
|      ?cd != EFG
 | |
|     ??cd != EFGH
 | |
|          etc.
 | |
| 
 | |
| The question marks represent "wildcards" that always match; they're
 | |
| outside the limits of the needle, so there's no way for them to
 | |
| invalidate a match. To ensure that the inequalities above are always
 | |
| true, we need them to be true for all possible '?' values. We thus
 | |
| need cd != FG and cd != GH, etc.
 | |
| 
 | |
| 
 | |
| -------- Matching the left part --------
 | |
| 
 | |
| Once we have ensured the right part matches, we scan the left part
 | |
| (order doesn't matter, but traditionally right-to-left), and if we
 | |
| find a mismatch, we jump ahead by
 | |
| max(len(left_part), len(right_part)) + 1. That we can jump by
 | |
| at least len(right_part) + 1 we have already seen:
 | |
| 
 | |
|        text: .....EFG.....
 | |
|     pattern:  abcdEFG
 | |
|     Matched 3, so jump by 4,
 | |
|     using the fact that d != E, cd != EF, and bcd != EFG.
 | |
| 
 | |
| But we can also jump by at least len(left_part) + 1:
 | |
| 
 | |
|        text: ....cdEF.....
 | |
|     pattern:   abcdEF
 | |
|     Jump by len('abcd') + 1 = 5.
 | |
| 
 | |
|     Skip the alignments:
 | |
|        text: ....cdEF.....
 | |
|     pattern:    abcdEF
 | |
|        text: ....cdEF.....
 | |
|     pattern:     abcdEF
 | |
|        text: ....cdEF.....
 | |
|     pattern:      abcdEF
 | |
|        text: ....cdEF.....
 | |
|     pattern:       abcdEF
 | |
| 
 | |
| This requires the following facts:
 | |
|        d != E
 | |
|       cd != EF
 | |
|      bcd != EF?
 | |
|     abcd != EF??
 | |
|          etc., for all values of ?s, as above.
 | |
| 
 | |
| If we have both sets of inequalities, then we can indeed jump by
 | |
| max(len(left_part), len(right_part)) + 1. Under the assumption of such
 | |
| a nice splitting of the needle, we now have enough to prove linear
 | |
| time for the search: consider the forward-progress/comparisons ratio
 | |
| at each alignment position. If a mismatch occurs in the right part,
 | |
| the ratio is 1 position forward per comparison. On the other hand,
 | |
| if a mismatch occurs in the left half, we advance by more than
 | |
| len(needle)//2 positions for at most len(needle) comparisons,
 | |
| so this ratio is more than 1/2. This average "movement speed" is
 | |
| bounded below by the constant "1 position per 2 comparisons", so we
 | |
| have linear time.
 | |
| 
 | |
| 
 | |
| -------- The periodic case --------
 | |
| 
 | |
| The sets of inequalities listed so far seem too good to be true in
 | |
| the general case. Indeed, they fail when a needle is periodic:
 | |
| there's no way to split 'AAbAAbAAbA' in two such that
 | |
| 
 | |
|     (the stuff n characters to the left of the split)
 | |
|     cannot equal
 | |
|     (the stuff n characters to the right of the split)
 | |
|     for all n.
 | |
| 
 | |
| This is because no matter how you cut it, you'll get
 | |
| s[cut-3:cut] == s[cut:cut+3]. So what do we do? We still cut the
 | |
| needle in two so that n can be as big as possible. If we were to
 | |
| split it as
 | |
| 
 | |
|     AAbA + AbAAbA
 | |
| 
 | |
| then A == A at the split, so this is bad (we failed at length 1), but
 | |
| if we split it as
 | |
| 
 | |
|     AA + bAAbAAbA
 | |
| 
 | |
| we at least have A != b and AA != bA, and we fail at length 3
 | |
| since ?AA == bAA. We already knew that a cut to make length-3
 | |
| mismatch was impossible due to the period, but we now see that the
 | |
| bound is sharp; we can get length-1 and length-2 to mismatch.
 | |
| 
 | |
| This is exactly the content of the *critical factorization theorem*:
 | |
| that no matter the period of the original needle, you can cut it in
 | |
| such a way that (with the appropriate question marks),
 | |
| needle[cut-k:cut] mismatches needle[cut:cut+k] for all k < the period.
 | |
| 
 | |
| Even "non-periodic" strings are periodic with a period equal to
 | |
| their length, so for such needles, the CFT already guarantees that
 | |
| the algorithm described so far will work, since we can cut the needle
 | |
| so that the length-k chunks on either side of the cut mismatch for all
 | |
| k < len(needle). Looking closer at the algorithm, we only actually
 | |
| require that k go up to max(len(left_part), len(right_part)).
 | |
| So long as the period exceeds that, we're good.
 | |
| 
 | |
| The more general shorter-period case is a bit harder. The essentials
 | |
| are the same, except we use the periodicity to our advantage by
 | |
| "remembering" periods that we've already compared. In our running
 | |
| example, say we're computing
 | |
| 
 | |
|     "AAbAAbAAbA" in "bbbAbbAAbAAbAAbbbAAbAAbAAbAA".
 | |
| 
 | |
| We cut as AA + bAAbAAbA, and then the algorithm runs as follows:
 | |
| 
 | |
|     First alignment:
 | |
|     bbbAbbAAbAAbAAbbbAAbAAbAAbAA
 | |
|     AAbAAbAAbA
 | |
|       ^^X
 | |
|     - Mismatch at third position, so jump by 3.
 | |
|     - This requires that A!=b and AA != bA.
 | |
| 
 | |
|     Second alignment:
 | |
|     bbbAbbAAbAAbAAbbbAAbAAbAAbAA
 | |
|        AAbAAbAAbA
 | |
|          ^^^^^^^^
 | |
|         X
 | |
|     - Matched entire right part
 | |
|     - Mismatch at left part.
 | |
|     - Jump forward a period, remembering the existing comparisons
 | |
| 
 | |
|     Third alignment:
 | |
|     bbbAbbAAbAAbAAbbbAAbAAbAAbAA
 | |
|           AAbAAbAAbA
 | |
|           mmmmmmm^^X
 | |
|     - There's "memory": a bunch of characters were already matched.
 | |
|     - Two more characters match beyond that.
 | |
|     - The 8th character of the right part mismatched, so jump by 8
 | |
|     - The above rule is more complicated than usual: we don't have
 | |
|       the right inequalities for lengths 1 through 7, but we do have
 | |
|       shifted copies of the length-1 and length-2 inequalities,
 | |
|       along with knowledge of the mismatch. We can skip all of these
 | |
|       alignments at once:
 | |
| 
 | |
|         bbbAbbAAbAAbAAbbbAAbAAbAAbAA
 | |
|                AAbAAbAAbA
 | |
|                 ~                   A != b at the cut
 | |
|         bbbAbbAAbAAbAAbbbAAbAAbAAbAA
 | |
|                 AAbAAbAAbA
 | |
|                 ~~                  AA != bA at the cut
 | |
|         bbbAbbAAbAAbAAbbbAAbAAbAAbAA
 | |
|                  AAbAAbAAbA
 | |
|                    ^^^^X            7-3=4 match, and the 5th misses.
 | |
|         bbbAbbAAbAAbAAbbbAAbAAbAAbAA
 | |
|                   AAbAAbAAbA
 | |
|                    ~                A != b at the cut
 | |
|         bbbAbbAAbAAbAAbbbAAbAAbAAbAA
 | |
|                    AAbAAbAAbA
 | |
|                    ~~               AA != bA at the cut
 | |
|         bbbAbbAAbAAbAAbbbAAbAAbAAbAA
 | |
|                     AAbAAbAAbA
 | |
|                       ^X            7-3-3=1 match and the 2nd misses.
 | |
|         bbbAbbAAbAAbAAbbbAAbAAbAAbAA
 | |
|                      AAbAAbAAbA
 | |
|                       ~             A != b at the cut
 | |
| 
 | |
|     Fourth alignment:
 | |
|     bbbAbbAAbAAbAAbbbAAbAAbAAbAA
 | |
|                  AAbAAbAAbA
 | |
|                    ^X
 | |
|     - Second character mismatches, so jump by 2.
 | |
| 
 | |
|     Fifth alignment:
 | |
|     bbbAbbAAbAAbAAbbbAAbAAbAAbAA
 | |
|                   AAbAAbAAbA
 | |
|                     ^^^^^^^^
 | |
|                    X
 | |
|     - Right half matches, so use memory and skip ahead by period=3
 | |
| 
 | |
|     Sixth alignment:
 | |
|     bbbAbbAAbAAbAAbbbAAbAAbAAbAA
 | |
|                      AAbAAbAAbA
 | |
|                      mmmmmmmm^^
 | |
|     - Right part matches, left part is remembered, found a match!
 | |
| 
 | |
| The one tricky skip by 8 here generalizes: if we have a period of p,
 | |
| then the CFT says we can ensure the cut has the inequality property
 | |
| for lengths 1 through p-1, and jumping by p would line up the
 | |
| matching characters and mismatched character one period earlier.
 | |
| Inductively, this proves that we can skip by the number of characters
 | |
| matched in the right half, plus 1, just as in the original algorithm.
 | |
| 
 | |
| To make it explicit, the memory is set whenever the entire right part
 | |
| is matched and is then used as a starting point in the next alignment.
 | |
| In such a case, the alignment jumps forward one period, and the right
 | |
| half matches all except possibly the last period. Additionally,
 | |
| if we cut so that the left part has a length strictly less than the
 | |
| period (we always can!), then we can know that the left part already
 | |
| matches. The memory is reset to 0 whenever there is a mismatch in the
 | |
| right part.
 | |
| 
 | |
| To prove linearity for the periodic case, note that if a right-part
 | |
| character mismatches, then we advance forward 1 unit per comparison.
 | |
| On the other hand, if the entire right part matches, then the skipping
 | |
| forward by one period "defers" some of the comparisons to the next
 | |
| alignment, where they will then be spent at the usual rate of
 | |
| one comparison per step forward. Even if left-half comparisons
 | |
| are always "wasted", they constitute less than half of all
 | |
| comparisons, so the average rate is certainly at least 1 move forward
 | |
| per 2 comparisons.
 | |
| 
 | |
| 
 | |
| -------- When to choose the periodic algorithm ---------
 | |
| 
 | |
| The periodic algorithm is always valid but has an overhead of one
 | |
| more "memory" register and some memory computation steps, so the
 | |
| here-described-first non-periodic/long-period algorithm -- skipping by
 | |
| max(len(left_part), len(right_part)) + 1 rather than the period --
 | |
| should be preferred when possible.
 | |
| 
 | |
| Interestingly, the long-period algorithm does not require an exact
 | |
| computation of the period; it works even with some long-period, but
 | |
| undeniably "periodic" needles:
 | |
| 
 | |
|     Cut: AbcdefAbc == Abcde + fAbc
 | |
| 
 | |
| This cut gives these inequalities:
 | |
| 
 | |
|                  e != f
 | |
|                 de != fA
 | |
|                cde != fAb
 | |
|               bcde != fAbc
 | |
|              Abcde != fAbc?
 | |
|     The first failure is a period long, per the CFT:
 | |
|             ?Abcde == fAbc??
 | |
| 
 | |
| A sufficient condition for using the long-period algorithm is having
 | |
| the period of the needle be greater than
 | |
| max(len(left_part), len(right_part)). This way, after choosing a good
 | |
| split, we get all of the max(len(left_part), len(right_part))
 | |
| inequalities around the cut that were required in the long-period
 | |
| version of the algorithm.
 | |
| 
 | |
| With all of this in mind, here's how we choose:
 | |
| 
 | |
|     (1) Choose a "critical factorization" of the needle -- a cut
 | |
|         where we have period minus 1 inequalities in a row.
 | |
|         More specifically, choose a cut so that the left_part
 | |
|         is less than one period long.
 | |
|     (2) Determine the period P_r of the right_part.
 | |
|     (3) Check if the left part is just an extension of the pattern of
 | |
|         the right part, so that the whole needle has period P_r.
 | |
|         Explicitly, check if
 | |
|             needle[0:cut] == needle[0+P_r:cut+P_r]
 | |
|         If so, we use the periodic algorithm. If not equal, we use the
 | |
|         long-period algorithm.
 | |
| 
 | |
| Note that if equality holds in (3), then the period of the whole
 | |
| string is P_r. On the other hand, suppose equality does not hold.
 | |
| The period of the needle is then strictly greater than P_r. Here's
 | |
| a general fact:
 | |
| 
 | |
|     If p is a substring of s and p has period r, then the period
 | |
|     of s is either equal to r or greater than len(p).
 | |
| 
 | |
| We know that needle_period != P_r,
 | |
| and therefore needle_period > len(right_part).
 | |
| Additionally, we'll choose the cut (see below)
 | |
| so that len(left_part) < needle_period.
 | |
| 
 | |
| Thus, in the case where equality does not hold, we have that
 | |
| needle_period >= max(len(left_part), len(right_part)) + 1,
 | |
| so the long-period algorithm works, but otherwise, we know the period
 | |
| of the needle.
 | |
| 
 | |
| Note that this decision process doesn't always require an exact
 | |
| computation of the period -- we can get away with only computing P_r!
 | |
| 
 | |
| 
 | |
| -------- Computing the cut --------
 | |
| 
 | |
| Our remaining tasks are now to compute a cut of the needle with as
 | |
| many inequalities as possible, ensuring that cut < needle_period.
 | |
| Meanwhile, we must also compute the period P_r of the right_part.
 | |
| 
 | |
| The computation is relatively simple, essentially doing this:
 | |
| 
 | |
|     suffix1 = max(needle[i:] for i in range(len(needle)))
 | |
|     suffix2 = ... # the same as above, but invert the alphabet
 | |
|     cut1 = len(needle) - len(suffix1)
 | |
|     cut2 = len(needle) - len(suffix2)
 | |
|     cut = max(cut1, cut2) # the later cut
 | |
| 
 | |
| For cut2, "invert the alphabet" is different than saying min(...),
 | |
| since in lexicographic order, we still put "py" < "python", even
 | |
| if the alphabet is inverted. Computing these, along with the method
 | |
| of computing the period of the right half, is easiest to read directly
 | |
| from the source code in fastsearch.h, in which these are computed
 | |
| in linear time.
 | |
| 
 | |
| Crochemore & Perrin's Theorem 3.1 give that "cut" above is a
 | |
| critical factorization less than the period, but a very brief sketch
 | |
| of their proof goes something like this (this is far from complete):
 | |
| 
 | |
|     * If this cut splits the needle as some
 | |
|       needle == (a + w) + (w + b), meaning there's a bad equality
 | |
|       w == w, it's impossible for w + b to be bigger than both
 | |
|       b and w + w + b, so this can't happen. We thus have all of
 | |
|       the inequalities with no question marks.
 | |
|     * By maximality, the right part is not a substring of the left
 | |
|       part. Thus, we have all of the inequalities involving no
 | |
|       left-side question marks.
 | |
|     * If you have all of the inequalities without right-side question
 | |
|       marks, we have a critical factorization.
 | |
|     * If one such inequality fails, then there's a smaller period,
 | |
|       but the factorization is nonetheless critical. Here's where
 | |
|       you need the redundancy coming from computing both cuts and
 | |
|       choosing the later one.
 | |
| 
 | |
| 
 | |
| -------- Some more Bells and Whistles --------
 | |
| 
 | |
| Beyond Crochemore & Perrin's original algorithm, we can use a couple
 | |
| more tricks for speed in fastsearch.h:
 | |
| 
 | |
|     1. Even though C&P has a best-case O(n/m) time, this doesn't occur
 | |
|        very often, so we add a Boyer-Moore bad character table to
 | |
|        achieve sublinear time in more cases.
 | |
| 
 | |
|     2. The prework of computing the cut/period is expensive per
 | |
|        needle character, so we shouldn't do it if it won't pay off.
 | |
|        For this reason, if the needle and haystack are long enough,
 | |
|        only automatically start with two-way if the needle's length
 | |
|        is a small percentage of the length of the haystack.
 | |
| 
 | |
|     3. In cases where the needle and haystack are large but the needle
 | |
|        makes up a significant percentage of the length of the
 | |
|        haystack, don't pay the expensive two-way preprocessing cost
 | |
|        if you don't need to. Instead, keep track of how many
 | |
|        character comparisons are equal, and if that exceeds
 | |
|        O(len(needle)), then pay that cost, since the simpler algorithm
 | |
|        isn't doing very well.
 | 
