[math-fun] New thoughts on sorting algorithms
In the old days, one would analyse a sorting algorithm to try to understand C(N) and M(N) and S(N): the number of element-comparisons, number of element-moves, and number of extra storage locations, respectively required to sort an array of N elements. One tried to make C(N) and M(N) approach the theoretical lower bounds of lg(N!) =approx= N*lg(N) - 1.443*N, and N, respectively, while keeping S(N)=o(N), which I will call an "in-place" sorting method, as opposed to method with order N extra storage needs, such as merge sort. But in the modern era, this is inadequate. It seems to me we should instead consider: B(N) = number of compares used for branching and hence likely to led to misprediction of the branch, causing computer's pipelines to stall and have to be cleared out. C(N) = number of compares used in non-branching manners, i.e. such that no prediction is needed and getting prediction wrong does not matter R0(N) and R1(N) = number of reads of array elements W0(N) and W1(N) = number of writes where the 0 and 1 refer to within and outside of "cache." Finally, there also is additional bookkeeping work. The point is the branch-mispredicted comparisons can cost a lot more than ordinary comparisons. (How much this matters depends how expensive a comparison is; if sorting plain 1-word integers it matters a lot; if sorting "a collection of chess positions" it matters little.) And inside-cache accesses cost a lot less than outside cache. All this changes one's perspective considerably. If we restrict attention to in-place, fairly simply-describable sort methods with O(N*lgN) runtime, then in the old days the main competitors were quicksort, heapsort, shellsort, and sorting networks. Here I am being generous by pretending shellsort has O(NlgN) runtime. The usual old claim was quicksort was the best, albeit with shellsort the best for small enough arrays. This is due to the simplicity of their inner loops; and quicksort will approach N*lgN comparisons (optimal leading term) if good pivot elements are used, although it will use order N*lgN moves, and you have to be careful about the details otherwise quadratic runtime will happen with certain not-too-implausible inputs. In the modern era, this thinking may change. Quicksort: good cache behavior, will only perform about const/L fraction out-of-cache moves and compares, where L is the cardinality of a cache line. This is optimal. Poor branching behavior in the sense that almost every compare is branched. Heapsort: Excellent branching behavior in the sense that you can code it in such a way that there are only O(N) branched comparisons while the remaining O(N*lgN) compares can be coded in a manner where misprediction does not matter. O(N*lgN) compares and branches even for worst case. Cache behavior is pessimally bad; almost all the accesses will be out of cache asymptotically. Shellsort: Assuming increment sequence which increases, very roughly, exponentially, the cache behavior is pessimally bad; like heapsort almost all the accesses will be out of cache asymptotically. Further, the compares are almost all branching, which also is pessimally bad. However, if instead of coding shellsort using "insertion sort with size-h-strides" you re-coded it using "repeatedly find the max" sort using h-strides, this might allow you to code it in a way in which almost all the compares were non-branching. Repeatedly finding the max and moving it to the top of the array (array keeps shrinking) is a better quadratic-time sort than insertion sort in the sense that the comparisons can be done nonbranching. On the other hand, in insertion sort you can accurately predict comparisons almost always (i.e. as you walk linearly down the sorted part of the array, asking "should I stop now" the answer is almost always "no"). If you take that view then insertion sort runs in order N^2 steps but only O(N) mispredicted branches. In that view, shellsort is already coded fine and does not need recoding. Shellsort's worst and average case runtimes both are hard to analyse. Sorting networks: a sorting network based on "sort2 modules" is optimal in the sense that every compare+move can be done using "conditional move" with no branching at all ever. The entire program is "data oblivious". Its cache behavior and runtime depend on the network used but probably the cache behavior is going to be bad. CONCLUSION: these theoretical thoughts predict 1. that heapsort should be superior to quicksort if N is small enough that going outside the cache is not an issue. I have done experiments, following up on earlier experiments by Paul Hsieh, which seem to confirm this prediction; my heapsort is the fastest among {heap, shell, quick} for all array sizes 30-3000 tried, sorting either 32-bit integers, or 64-bit reals. To achieve this you have to code heapsort in a way that avoids branching. However, it is possible one could code quicksort better than Hsieh and I did... has anybody got some good quicksort codes in C? 2. that for large N, quicksort will win due to better cache behavior versus the others, but its branching behavior sucks. This suggests seeking a way to recode quicksort to reduce the number of branching compares and branch mispredictions. Vladimir Yaroslavskiy http://iaroslavski.narod.ru/quicksort/DualPivotQuicksort.pdf recently invented "dual pivot quicksort" which employs two pivot elements and partitioning of array into three parts. He and other testers found this was 10% faster than plain quicksort. Why? It is plainly stupid by the old view in the sense it uses more comparisons to sort. Shrinu Kushagra, Alejandro Lopez-Ortiz, J. Ian Munro, Aurick Qiao http://epubs.siam.org/doi/abs/10.1137/1.9781611973198.6 went further than Yaroslavskiy devising a 3-pivot quicksort partitioning array into 4 parts. This was found to be faster than the dual-pivot and single-pivot quicksort variants tried. But both these new kinds of quicksort use MORE comparisons and MORE moves than classic quicksort (beating it on runtime for sorting plain 1-words integers, though)! Why? Well, first of all, it seems that, although I said quicksort was excellent about the cache with only a fraction c/L of accesses out-of-cache, c=constant, L=cache line size, these new algorithms are even more excellent about the cache in the sense they achieve smaller c. So the new methods would be predicted to win unless we have very expensive comparison operations. Also, which those authors did not perceive, you can code it in a way which causes the comparisons to be more predictable than in single-pivot quicksort, precisely *because* they are "unbalanced" i.e. poor in an info-theoretic sense. E.g. suppose we had a "good" sorting method in which each comparison was 50:50, you had no idea what the result was going to be. You get 1 bit of info from each compare but it costs you 0.5 mispredicts per compare -- which if we crudely model predicted branches as "free," means your cost is 0.5 per bit. Now compare with a "bad" sorting method in which the compares are 90:10 predictable. Now you get only 0.1*lg(10)+0.9*lg(10/9) = 0.469 bits of info per compare, but it only costs you 0.1 mispredicts. Hence the fact you have to perform 2.13 times the number of comparisons to extract the same amount of information, is more than paid for (in this crude model) by the fact you are paying 5 times less per comparison! That gets even more dramatic with a 127:1 predictable comparison, which gives you only about 1/16 of a bit but costs only 1/128. "Bad" sorting methods are actually better than "good" sorting methods, in this sense! Very bizarre! To quantify that insight, suppose a mispredicted compare costs you B while a predicted one costs 1, where B>1. Your best strategy is to employ comparisons with R:1 ratio of right:wrong predictions where R>1 (as opposed to 1:1 for a "perfect" comparison). The optimum value of R is the one which minimizes (B+R) / (log(R+1)+R*log(1+1/R)) which is approximately the same thing as minimizing (B+R) / (1 + log(R+1)) If B-->infinity then the optimum R also -->infinity. But for fixed B such as B=10, you get finite R>1, and the best R approximately obeys R=B/ln(B). This is a quite large bias. 3. A new algorithm for in-place array sorting using (2+o(1))*N*lgN compares and (13+o(1))*N moves in worst case, both optimal up to constant factors [shell, heap, networks, and quick all used order N*lgN moves or more; mergesort used O(N) moves but was not in-place] is here: GIANNI FRANCESCHINI & VILIAM GEFFERT: An In-Place Sorting with O(n log n) Comparisons and O(n) Moves, JACM 52,4 (2005) 515-537 I now am going to describe a vastly simpler sorting method than theirs which also achieves O(NlogN) compares worst case, O(N) moves worst case, both optimal, but unlike them I allow myself o(N) bits of auxiliary storage, note the lowercase o. They had to be more complicated since they only allowed themselves o(1) words of extra storage, which is about O(logN) reckoned in bits. An approximately evenly-spaced sorted subsample S of the N elements, containing M elements, is used. Basically, a random sample would work, except where it does not, and where it does not you can "split" the long gaps approximately in half, using linear-time median-finding algorithms, until it does. Also the other kind of flaw in a random sample, too-close guys, can be fixed by discarding one of the guys. Non-even-spacing flaws in this sample can be dynamically repaired in this way as we proceed. In practice I would just use a "slightly improved random sample" since with optimal M it is extremely unlikely any fixes will be needed; but if you insist on genuine worst-case performance bounds determinstically, then you would need to worry about repairs. Phase I: If we are willing to store o(N) extra bits we can now using binary search mentally insert each of the N elements into the correct post office box (boxes delineated by the samples) without actually moving anything, purely keeping track of the COUNTS in each box. Then with the counts known and stored in the auxialiary o(N) bits (note the lowercase o) we can now make a second mass of binary searches now, with actual moves, since we now know where to move them. That works provided M=o(N/logN). (If M were larger than that, then the auxiliary storage for the counts would be too large.) As a result we now have an almost-sorted array of N elements, where each element is at most distance O(N/M) away from correct final destination and every element in correct "post office box" where there are M+1 boxes with M known "walls" (the sample S, now already in final locations). Phase II: We now sort each box in succession using, for example, list-merge-sort or array-merge-sort. The auxiliary storage each merge-sort needs is at most O(N/M) pointers, each O(logN) bits wide, if we use list merge. Or at most O(N/M) extra array elements, if we use array merge. So this works provided logN=o(M). Other methods than merge sort could also be used, anything with linear move-count and O(NlogN) compare-count will do, where linear or even somewhat superlinear auxiliary storage is permitted. I'm cheating here because array-merge does order NlogN moves in sorting N items. List merge really does use only O(N) moves, but uses order NlogN pointer-rewrites, which it is kind of cheating to claim are "not moves." Yes, they are not item-moves; but they are pointer-moves. Another approach is to use heapsort with a large radix (tree branching factor) instead of radix=2. That is what Franceschini and Geffert did. This kind of heapsort has fewer moves but more comparisons. Specifically with radix=R, 1<R<N, heapsort will employ order N*log(N)/log(R) moves and order R*N*log(N)/log(R) compares to sort N items. Using heapsort also has the minor extra advantage of not requiring auxiliary storage. If you use mergesort, the best tradeoff for my auxiliary memory needs seems to be making M be of order sqrt(N). Then you use O(sqrt(N)) words of auxiliary storage, each O(logN) bits wide. Then I claim this sorting method can then be made to employ at most (2+o(1))*N*lgN comparisons, expected, and perform about 4*N moves, where I have not really computed the constant 4. The comparisons will all be in cache provided cache size exceeds about sqrt(N). The moves will generally be out of cache but no sorting algorithm can escape having order N out-of-cache moves so that doesn't bother me. If you use radix=R heapsort, then M must be chosen to be of order N/(logN)^c where c>1 is a suitable positive constant chosen to optimize performance, while R is of order logN. So in conclusion, this kind of sort algorithm is optimal in terms of comparisons, moves, "in place" nature (in my sense), and cache behavior re the comparisons if cache is larger than about sqrt(N), all simultaneously. It is possible to do the binary searches in phase I in a branchless manner (use conditional load of pointer-increments), so those comparisons do not face branch-prediction issues. Array merge-sort can be done using conditional load of pointers and increments, again branchless. However list-merge-sort comparisons seem to have prediction issues. Heapsort can be done with all but order N comparisons branchless. Also, one final remark. Franceschini and Geffert left it as open whether a "stable" sorting method with these properties can be devised. Stable means, if two equal elements were in some order in the input, they stay in that order in the output. With my sorting method, you can afford to store, in phase I, the original positions of the items in sample S. Thus you can during phase I binary search agree to always go the right direction to preserve stability. Also, in phase II, we can use a stable sorting method such as merge sort. Also S needs to be sorted itself using a stable method, but S is small enough we can afford to use merge sort. So therefore, provided you accept sublinear (but a lot larger than O(1)) auxiliary storage, this too is solved. -- Warren D. Smith http://RangeVoting.org <-- add your endorsement (by clicking "endorse" as 1st step) -- Warren D. Smith http://RangeVoting.org <-- add your endorsement (by clicking "endorse" as 1st step)
There are two ways to avoid branch-mispredictions when sorting: 1. use comparison results in a non-branching manner (e.g. conditional load instructions) 2. use them in a branching manner, but where you can predict the branch direction most of the time (only wrong guesses count). Insertion sort can be viewed as using method #2 and achieves only O(N) mispredictions during sorting N items in order N^2 time. My way of programming heapsort using both methods achieves O(N) misprediction bound, while sorting in O(NlogN) time. What is the minimum possible #mispredictions? [Is O(N) actually already optimal?] The answer is zero! That is because the much-derided quadratic-time "bubble sort" algorithm involves only comparing adjacent elements (if larger, move right 1 hop). This can be done entirely with conditional loads; no branching based on the results of comparisons at all. Also, Batcher's "odd even merge" sorting network (aka "bitonic sort") also involves zero, it runs in O(N*(logN)^2) time. And the AKS sorting network, which is practically utterly useless due to huge constant factors, also achieves zero and sorts in O(NlogN) time. However, there are caveats attached to those results -- the loop-structure of bubble sort still involves branching (just not branching based on item-comparisons) and those branches involve order N mispredictions. Batcher involves at least order NlogN mispredictions of this kind too. I'm not sure how many AKS would involve, perhaps only order (logN)^2. This caveat could be removed by doing bubble sort in a single-loop manner where the two indices are derived algebraically from the one loop index. In that case there is only a single branch-misprediction in the whole process. And "one" is surely minimum possible since you have to detect "algorithm has ended" via a branch.
participants (1)
-
Warren D Smith