Each Answer to this Q is separated by one/two green lines.
I became curious to understand the internals of how string comparison works in python when I was solving the following example algorithm problem:
Given two strings, return the length of the longest common prefix
Solution 1: charByChar
My intuition told me that the optimal solution would be to start with one cursor at the beginning of both words and iterate forward until the prefixes no longer match. Something like
def charByChar(smaller, bigger): assert len(smaller) <= len(bigger) for p in range(len(smaller)): if smaller[p] != bigger[p]: return p return len(smaller)
To simplify the code, the function assumes that the length of the first string,
smaller, is always smaller than or equal to the length of the second string,
Solution 2: binarySearch
Another method is to bisect the two strings to create two prefix substrings. If the prefixes are equal, we know that the common prefix point is at least as long as the midpoint. Otherwise the common prefix point is at least no bigger than the midpoint. We can then recurse to find the prefix length.
Aka binary search.
def binarySearch(smaller, bigger): assert len(smaller) <= len(bigger) lo = 0 hi = len(smaller) # binary search for prefix while lo < hi: # +1 for even lengths mid = ((hi - lo + 1) // 2) + lo if smaller[:mid] == bigger[:mid]: # prefixes equal lo = mid else: # prefixes not equal hi = mid - 1 return lo
At first I assumed that that
binarySearch would be slower because string comparison would compare all characters several times rather than just the prefix characters as in
binarySearch turned out to be much faster after some preliminary benchmarking.
Above shows how performance is affected as prefix length is increased. Suffix length remains constant at 50 characters.
This graph shows two things:
- As expected, both algorithms perform linearly worse as prefix length increases.
- Performance of
charByChardegrades at a much faster rate.
binarySearch so much better? I think it is because
- The string comparison in
binarySearchis presumably optimized by the interpreter / CPU behind the scenes.
charByCharactually creates new strings for each character accessed and this produces significant overhead.
To validate this I benchmarked the performance of comparing and slicing a string, labelled
slice respectively below.
This graph show two important things:
- As expected, comparing and slicing increase linearly with length.
- The cost of comparing and slicing increase very slowly with length relative to algorithm performance, Figure A. Note both figures go up to strings of length 1 Billion characters. Therefore, the cost of comparing 1 character 1 Billion times is much much greater than comparing 1 Billion characters once. But this still doesn’t answer why …
In an effort to find out how the cpython interpreter optimizes string comparison I generated the byte code for the following function.
In : def slice_cmp(a, b): return a == b In : dis.dis(slice_cmp) 0 LOAD_FAST 0 (a) 2 LOAD_CONST 1 (0) 4 BINARY_SUBSCR 6 LOAD_FAST 1 (b) 8 LOAD_CONST 1 (0) 10 BINARY_SUBSCR 12 COMPARE_OP 2 (==) 14 RETURN_VALUE
- Where in the cpython does string comparison occur?
- Is there a CPU optimization? Is there a special x86 instruction which does string comparison? How can I see what assembly instructions are generated by cpython? You may assume I am using python3 latest, Intel Core i5, OS X 10.11.6.
- Why is comparing a long string so much faster than comparing each of it’s characters?
Bonus question: When is charByChar more performant?
If the prefix is sufficiently small in comparison to the length rest of the string, at some point the cost of creating substrings in
charByChar becomes less than the cost of comparing the substrings in
To describe this relationship I delved into runtime analysis.
To simplify the below equations, let’s assume that
bigger are the same size and I will refer to them as
charByChar(s1, s2) = costOfOneChar * prefixLen
costOfOneChar = cmp(1) + slice(s1Len, 1) + slice(s2Len, 1)
cmp(1) is the cost of comparing two strings of length 1 char.
slice is the cost of accessing a char, the equivalent of
charAt(i). Python has immutable strings and accessing a char actually creates a new string of length 1.
slice(string_len, slice_len) is the cost of slicing a string of length
string_len to a slice of size
charByChar(s1, s2) = O((cmp(1) + slice(s1Len, 1)) * prefixLen)
binarySearch(s1, s2) = costOfHalfOfEachString * log_2(s1Len)
log_2 is the number of times to divide the strings in half until reaching a string of length 1. Where
costOfHalfOfEachString = slice(s1Len, s1Len / 2) + slice(s2Len, s1Len / 2) + cmp(s1Len / 2)
So the big-O of
binarySearch will grow according to
binarySearch(s1, s2) = O((slice(s2Len, s1Len) + cmp(s1Len)) * log_2(s1Len))
Based on our previous analysis of the cost of
If we assume that
costOfHalfOfEachString is approximately the
costOfComparingOneChar then we can refer to them both as
charByChar(s1, s2) = O(x * prefixLen) binarySearch(s1, s2) = O(x * log_2(s1Len))
If we equate them
O(charByChar(s1, s2)) = O(binarySearch(s1, s2)) x * prefixLen = x * log_2(s1Len) prefixLen = log_2(s1Len) 2 ** prefixLen = s1Len
O(charByChar(s1, s2)) > O(binarySearch(s1, s2) when
2 ** prefixLen = s1Len
So plugging in the above formula I regenerated tests for Figure A but with strings of total length
2 ** prefixLen expecting the performance of the two algorithms to be roughly equal.
charByChar performs much better. With a bit of trial and error, the performance of the two algorithms are roughly equal when
s1Len = 200 * prefixLen
Why is the relationship 200x?
TL:DR: a slice compare is some Python overhead + a highly-optimized
memcmp (unless there’s UTF-8 processing?). Ideally, use slice compares to find the first mismatch to within less than 128 bytes or something, then loop a char at a time.
Or if it’s an option and the problem is important, make a modified copy of an asm-optimized
memcmp that returns the position of the first difference, instead of equal/not-equal; it will run as fast as a single
== of the whole strings. Python has ways to call native C / asm functions in libraries.
It’s a frustrating limitation that the CPU can do this blazingly fast, but Python doesn’t (AFAIK) give you access to an optimized compare loop that tells you the mismatch position instead of just equal / greater / less.
It’s totally normal that interpreter overhead dominates the cost of the real work in a simple Python loop, with CPython. Building an algorithm out of optimized building blocks is worth it even if it means doing more total work. This is why NumPy is good, but looping over a matrix element-by-element is terrible. The speed difference might be something like a factor of 20 to 100, for CPython vs. a compiled C (asm) loop for comparing one byte at a time (made up numbers, but probably right to within an order of magnitude).
Comparing blocks of memory for equality is probably one of the biggest mismatches between Python loops vs. operating on a whole list / slice. It’s a common problem with highly-optimized solutions (e.g. most libc implementations (including OS X) have a manually-vectorized hand-coded asm
memcmp that uses SIMD to compare 16 or 32 bytes in parallel, and runs much faster than a byte-at-a-time loop in C or assembly). So there’s another factor of 16 to 32 (if memory bandwidth isn’t a bottleneck) multiplying the factor of 20 to 100 speed difference between Python and C loops. Or depending on how optimized your
memcmp is, maybe “only” 6 or 8 bytes per cycle.
With data hot in L2 or L1d cache for medium-sized buffers, it’s reasonable to expect 16 or 32 bytes per cycle for
memcmp on a Haswell or later CPU. (i3/i5/i7 naming started with Nehalem; i5 alone is not sufficient to tell us much about your CPU.)
I’m not sure if either or both of your comparisons are having to process UTF-8 and check for equivalency classes or different ways to encode the same character. The worst case is if your Python char-at-a-time loop has to check for potentially-multi-byte characters but your slice compare can just use
Writing an efficient version in Python:
We’re just totally fighting against the language to get efficiency: your problem is almost exactly the same as the C standard library function
memcmp, except you want the position of the first difference instead of a – / 0 / + result telling you which string is greater. The search loop is identical, it’s just a difference in what the function does after finding the result.
Your binary search is not the best way to use a fast compare building block. A slice compare still has
O(n) cost, not
O(1), just with a much smaller constant factor. You can and should avoid re-comparing the starts of the buffers repeatedly by using slices to compare large chunks until you find a mismatch, then go back over that last chunk with a smaller chunk size.
# I don't actually know Python; consider this pseudo-code # or leave an edit if I got this wrong :P chunksize = min(8192, len(smaller)) # possibly round chunksize down to the next lowest power of 2? start = 0 while start+chunksize < len(smaller): if smaller[start:start+chunksize] == bigger[start:start+chunksize]: start += chunksize else: if chunksize <= 128: return char_at_a_time(smaller[start:start+chunksize], bigger[start:start+chunksize]) else: chunksize /= 8 # from the same start # TODO: verify this logic for corner cases like string length not a power of 2 # and/or a difference only in the last character: make sure it does check to the end
I chose 8192 because your CPU has a 32kiB L1d cache, so the total cache footprint of two 8k slices is 16k, half your L1d. When the loop finds a mismatch, it will re-scan the last 8kiB in 1k chunks, and these compares will loop over data that’s still hot in L1d. (Note that if
== found a mismatch, it probably only touched data up to that point, not the whole 8k. But HW prefetch will keep going somewhat beyond that.)
A factor of 8 should be a good balance between using large slices to localize quickly vs. not needing many passes over the same data. This is a tunable parameter of course, along with chunk size. The bigger the mismatch between Python and asm, the smaller this factor should be to reduce Python loop iterations.)
Hopefully 8k is big enough to hide the Python loop / slice overhead; hardware prefetching should still be working during the Python overhead between
memcmp calls from the interpreter so we don’t need the granularity to be huge. But for really big strings, if 8k doesn’t saturate memory bandwidth then maybe make it 64k (your L2 cache is 256kiB; i5 does tell us that much).
How exactly is
memcmp so fast:
I am running this on Intel Core i5 but I would imagine I would get the same results on most modern CPUs.
Even in C, Why is memcmp so much faster than a for loop check?
memcmp is faster than a byte-at-a-time compare loop, because even C compilers aren’t great at (or totally incapable of) auto-vectorizing search loops.
Even without hardware SIMD support, an optimized
memcmp could check 4 or 8 bytes at a time (word size / register width) even on a simple CPU without 16-byte or 32-byte SIMD.
But most modern CPUs, and all x86-64, have SIMD instructions. SSE2 is baseline for x86-64, and available as an extension in 32-bit mode.
An SSE2 or AVX2
memcmp can use
pmovmskb to compare 16 or 32 bytes in parallel. (I’m not going to go into detail about how to write memcmp in x86 asm or with C intrinsics. Google that separately, and/or look up those asm instructions in an x86 instruction-set reference. like http://felixcloutier.com/x86/index.html. See also the x86 tag wiki for asm and performance links. e.g. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? has some info about single-core memory bandwidth limitations.)
I found an old version from 2005 of Apple’s x86-64
memcmp (in AT&T syntax assembly language) on their opensource web site. It could definitely be better; for large buffers it should align one source pointer and only use
movdqu for the other one, allowing
pcmpeqb with a memory operand instead of 2x
movdqu, even if the strings are misaligned relative to each other.
xorl $0xFFFF,%eax /
jnz is also not optimal on CPUs where
cmp/jcc can macro fuse but
xor / jcc can’t.
Unrolling to check a whole 64-byte cache line at once would also hide loop overhead. (This is the same idea of a large chunk and then looping back over it when you find a hit). Glibc’s AVX2-
movbe version does this with
vpand to combine compare results in the main large-buffer loop, with the final combine being a
vptest that also sets flags from the result. (Smaller code-size but no fewer uops than
jcc; but no downside and maybe lower latency to reduce branch-mispredict penalties on loop exit). Glibc does dynamic CPU dispatching at dynamic link time; it picks this version on CPUs that support it.
memcmp is better these days; I don’t see source for it at all in the most recent
Libc directory, though. Hopefully they dispatch at runtime to an AVX2 version for Haswell and later CPUs.
LLoopOverChunks loop in the version I linked would only run at 1 iteration (16 bytes from each input) per ~2.5 cycles on Haswell; 10 fused-domain uops. But that’s still much faster than 1 byte per cycle for a naive C loop, or much much worse than that for a Python loop.
L(loop_4x_vec): loop is 18 fused-domain uops, and can thus run at just slightly less than 32 bytes (from each input) per clock cycle, when data is hot in L1d cache. Otherwise it will bottleneck on L2 bandwidth. It could have been 17 uops if they hadn’t used an extra instruction inside the loop decrementing a separate loop counter, and calculated an end-pointer outside the loop.
Finding instructions / hot spots in the Python interpreter’s own code
How could I drill down to find the C instructions and CPU instructions that my code calls?
On Linux you could run
perf record python ... then
perf report -Mintel to see which functions the CPU was spending the most time in, and which instructions in those functions were the hottest. You’ll get results something like I posted here: Why is float() faster than int()?. (Drill down into any function to see the actual machine instructions that ran, shown as assembly language because
perf has a disassembler built in.)
For a more nuanced view that samples the call-graph on each event, see linux perf: how to interpret and find hotspots.
(When you’re looking to actually optimize a program, you want to know which function calls are expensive so you can try to avoid them in the first place. Profiling for just “self” time will find hot spots, but you won’t always know which different callers caused a given loop to run most of the iterations. See Mike Dunlavey’s answer on that perf question.)
But for this specific case, profiling the interpreter running a slice-compare version over big strings should hopefully find the
memcmp loop where I think it will be spending most of its time. (Or for the char-at-a-time version, find the interpreter code that’s “hot”.)
Then you can directly see what asm instructions are in the loop. From the function names, assuming your binary has any symbols, you can probably find the source. Or if you have a version of Python built with debug info , you can get to the source directly from profile info. (Not a debug build with optimization disabled, just with full symbols).
This is both implementation-dependent and hardware-dependent. Without knowing your target machine and specific distribution, I couldn’t say for sure. However, I strongly suspect that the underlying hardware, like most, has memory block instructions. Among other things, this can compare a pair of arbitrarily long strings (up to addressing limits) in parallel and pipelined fashion. For instance, it may compare 8-byte slices at one slice per clock cycle. This is a lot faster than fiddling with byte-level indices.