Archive for the ‘STL’ Category

Use STL lower_bound even though it’s slower

Thursday, July 3rd, 2008

Summarizing the last few posts on STL lower_bound: instead of falling back to a naive algorithm when it’s not given a random-access iterator, STL’s lower_bound algorithm still performs binary search using bidirectional or forward-only iterators. This causes it to perform 2N list traversals, which is four times as many as the naive algorithm’s average case of N/2 list traversals. The STL algorithm guarantees that there will only be O( log N ) comparisons, but we’ve seen that’s not much of an advantage unless the comparison requires an extra memory access.

You might be thinking: why would anybody ever do lower_bound on a list anyway? Just maintain the data in a vector: that has O(1) amortized append, and the STL lower_bound wil run in O( log n ). Remember that the disadvantage of a vector is not merely the O(n) cost of inserting/deleting in the middle: adding an element to a vector invalidates all of its iterators because the addition can, in principle, realloc.

So given that, for lists, naive_lower_bound is clearer, shorter, and faster for large N even when comparisons are costly, when should you use it instead of STL lower_bound? Never. Always use STL lower_bound. The maintenance costs of naive_lower_bound are too high. If you put it in your code, even if you comment it clearly and provide a link to this discussion, in six months or two or ten years, some maintenance programmer is going to be looking at that line of code and wondering “What was wrong with STL lower_bound? Is it still true that naive_lower_bound is fast? Is it even correct?” And that maintenance programmer might well be you. It’s like the old adage: Nobody ever got fired for buying IBM. Or Microsoft. Or today, STL.

Of course, like all generalizations, this one is partially false. If your program is the client for a popular MMORPG and you offer your users a discount if they let you run numerical algorithms on their machine while the client is running, and you need to maintain a linked list of all other available clients, sorted by latency or some other simple metric of speed/quality, and frequently partition that list using lower_bound periodically, then perhaps it makes sense for you to avoid STL lower_bound. But maybe you could use a vector instead. Or maybe instead of maintaining a single sorted list, you could decide a priori to divide your clients into slow, medium, and fast links based on latency < 50ms, < 200 ms, over 200 ms. There are many available ways to approach the problem, and improving the algorithm’s running time — although it’s in some ways the easiest to measure and implement — carries with it maintenance costs that you won’t have to pay.

STL lower_bound is better for complex keys, but still loses on large N

Wednesday, July 2nd, 2008

Testing a list<int> is pretty much the worst possible case for STL’s lower bound algorithm. Recall from the analysis that STL’s lower_bound uses a binary search algorithm even when the cost of advancing an iterator N steps is O(N). The advantage of this algorithm is that it only performs O( log N ) comparisons, as opposed to O( N ) comparisons for the naive algorithm.

Testing STL lower_bound against naive_lower_bound using integers definitely plays to the strengths of naive_lower_bound: an integer compare is a single machine instruction. To test the effects of a more-costly comparison, I used the simplest costly key I could think of: a string consisting of a prefix of underscores, followed by the sequence number formatted in decimal. I ran with one, ten, and 64 underscores; 64 is the cache line length on my machine, so this length forces a comparison to make four memory accesses (two per argument) instead of only two.

For these tests, the STL lower_bound is initially twice as fast as the naive_lower_bound, and I must say it’s a bit of a relief to see the standard library algorithm outperforming the dumb algorithm. But as the cache effects start to be felt (for me, around 30,000 elements) the sheer cost of traversing the whole list twice wipes out the savings from reduced comparisons, at least for 1 and 10 underscores.

Recall from the earlier analysis that the naive algorithm runs in N * (c + t) / 2 for the average case, where c is the average cost of a comparison and t is the average cost of a list traversal. So the overall slope of the naive algorithm, for large N, should be (c + t) / 2. In contrast, the STL algorithm runs in 2 * t * N + c * lg N. Just for coarse analysis, we can throw away the c * lg N term (lg N is effectively a constant, varying between 10 and 16 when N goes from 1,000 to 60,000), which gives 2 * t for the slope of the STL algorithm. The difference between the expected slopes is (3t - c)/2.

With this in mind, you can see that even when there’s a substantial constant prefix on the key (ten underscores) this doesn’t appreciably raise the cost of a comparison, at least not compared to the cost of three list traversals. So let’s try forcing two more memory accesses per comparion by extending the prefix length to 64 bytes, the full length of a cache line.

64 bytes of underscores cause each comparison to make 4 memory accesses, slowing naive substantially

Now the cost of a comparison is much closer to the cost of three list traversals, but still slightly less: the graph shows that the execution time of STL is still increasing more quickly with increasing N. The second memory access of the comparison doesn’t suffer the full random-access penalty of a true cache miss because the key is contiguous and the cache reads ahead, while for the list traversal it’s quite unlikely that any two adjacent list elements would be within 64 bytes of each other.

Overall, though, for most small-to-medium list sizes the STL algorithm is winning when the key is more complex than a builtin type.

Next time: why this matters, when to use the STL lower_bound, and more.

STL lower_bound on lists four times slower than naive algorithm

Sunday, June 29th, 2008

Here is a comparison between the naive lower bound and STL’s version. Execution time reported in seconds was measured using the Windows high resolution timer. An STL list of integers 1-N was created by populating a vector, shuffling it, pushing the elements onto a list, and sorting the list. This was done so the list elements would not be allocated sequentially and thus benefit from cache locality.

For each N, 200 evenly spaced numbers in the range 1..N were selected. To these were added 0 and N+1, so that boundary conditions would be tested. The test values were shuffled to disrupt cache effects. A preliminary test was done to ensure that both algorithms got the same results for all test values. Then the lower_bound operation was timed, for each test value, repeated ten times.

Naive lower_bound about twice as fast as STL without optimization

As expected, the cost of memory access dominates the cost of the comparison, so the STL algorithm’s guarantee of fewer comparisons (at the cost of a guaranteed 2N memory accesses) turns out to be a bad trade. Note the knee in the graph, seen most clearly in the relative times (yellow triangles). I believe what happens there is the entire data set no longer fits in cache, and the STL algorithm suffers disproportionately from this because its first step — calculating the length of the list — causes the first elements to spill from cache as the last elements are being traversed. Then when it goes back to the beginning and starts its binary search, the first elements need to be reloaded from cache.

The results for a build with optimization turned on are similar:

With optimization naive is 4 times as fast as STL.

In this case the Naive algorithm is consistently 4 times faster than STL, and you can clearly see the knee in the STL graph around the first tick mark — effectively there are two linear portions of the graph. Interestingly, both STL and Naive suffer similarly from cache effects, as the relative timings stay constant as we pass the knee and get to the cache-limited region. Also (though you can’t see it because I left off the values of N), the numeric value of N where the knee occurs is larger in the optimized version (45,000 vs. 20,000), probably reflecting the overhead that the debug memory allocator attaches to each block.