[math-fun] New thoughts on sorting algorithms

16 May 2014

      In the old days, one would analyse a sorting algorithm to try to
understand C(N) and M(N) and S(N):
the number of element-comparisons,
number of element-moves, and
number of extra storage locations, respectively
required to sort an array of N elements.
One tried to make C(N) and M(N) approach the theoretical lower  bounds of
lg(N!) =approx= N*lg(N) - 1.443*N, and N, respectively,
while keeping S(N)=o(N), which I will call an "in-place" sorting
method, as opposed to method
with order N extra storage needs, such as merge sort.

But in the modern era, this is inadequate.  It seems to me we should
instead consider:
    B(N) = number of compares used for branching and hence likely to led to
 misprediction of the branch, causing computer's pipelines to stall
and have to be cleared out.
    C(N) = number of compares used in non-branching manners, i.e. such that
 no prediction is needed and getting prediction wrong does not matter
    R0(N) and R1(N) = number of reads of array elements
    W0(N) and W1(N) = number of writes
where the 0 and 1 refer to within and outside of "cache."

Finally, there also is additional bookkeeping work.

The point is the branch-mispredicted comparisons can cost a lot more than
ordinary comparisons.  (How much this matters depends how expensive a
comparison is; if sorting plain 1-word integers it matters a lot; if
sorting "a collection of chess positions"
it matters little.)  And inside-cache accesses cost a lot less than
outside cache.

All this changes one's perspective considerably.

If we restrict attention to in-place, fairly simply-describable sort
methods with O(N*lgN) runtime, then in the old days the main
competitors were quicksort, heapsort,
shellsort, and sorting networks.  Here I am being generous by
pretending shellsort has O(NlgN) runtime.  The usual old claim was
quicksort was the best, albeit with shellsort the best
for small enough arrays.  This is due to the simplicity of their inner loops;
and quicksort will approach N*lgN comparisons (optimal leading term)
if good pivot elements are used, although it will use order N*lgN moves,
and you have to be careful about the details otherwise quadratic
runtime will happen with
certain not-too-implausible inputs.

In the modern era, this thinking may change.

Quicksort:
good cache behavior, will only perform about const/L fraction
out-of-cache moves and compares, where L is the cardinality of a cache
line.  This is optimal.
Poor branching behavior in the sense that almost
every compare is branched.

Heapsort:
Excellent branching behavior in the sense that you can code it in such
a way that there are only O(N) branched comparisons while the
remaining O(N*lgN) compares can be coded in a manner where
misprediction does not matter.
O(N*lgN) compares and branches even for worst case.
Cache behavior is pessimally bad; almost all the accesses will be out
of cache asymptotically.

Shellsort:
Assuming increment sequence which increases, very roughly, exponentially, the
cache behavior is pessimally bad; like heapsort almost all the
accesses will be out of cache asymptotically.  Further, the compares
are almost all branching, which also is pessimally bad.   However, if
instead of coding shellsort using "insertion sort with size-h-strides"
you re-coded it using "repeatedly find the max" sort using h-strides,
this might allow you to
code it in a way in which almost all the compares were non-branching.

Repeatedly finding the max and moving it to the top of
the array (array keeps shrinking) is a better quadratic-time sort than
insertion sort in the sense that the comparisons can be done
nonbranching. On the other hand, in insertion sort you can accurately
predict comparisons almost always (i.e.
as you walk linearly down the sorted part of the array, asking "should
I stop now" the answer is almost always "no").  If you take that view then
insertion sort runs in order N^2 steps but only O(N) mispredicted branches.
In that view, shellsort is already coded fine and does not need recoding.

Shellsort's worst and average case runtimes both are hard to analyse.

Sorting networks:
a sorting network based on "sort2 modules" is optimal in the sense that every
compare+move can be done using "conditional move" with no branching at all ever.
The entire program is "data oblivious".   Its cache behavior and
runtime depend on the network used but probably the cache behavior is
going to be bad.

CONCLUSION:
these theoretical thoughts predict

1. that heapsort should be superior to quicksort
if N is small enough that going outside the cache is not an issue.
I have done experiments, following up on earlier experiments by Paul
Hsieh, which seem to confirm this prediction; my heapsort is the fastest among
{heap, shell, quick}
for all array sizes 30-3000 tried, sorting either 32-bit integers, or
64-bit reals.
To achieve this you have to code heapsort in a way that avoids branching.

However, it is possible one could code quicksort better than Hsieh and I did...
has anybody got some good quicksort codes in C?

2. that for large N, quicksort will win due to better cache behavior
versus the others,
but its branching behavior sucks. This suggests seeking a way to
recode quicksort to reduce the number of branching compares and branch
mispredictions.

Vladimir Yaroslavskiy
http://iaroslavski.narod.ru/quicksort/DualPivotQuicksort.pdf
recently invented "dual pivot quicksort" which employs two pivot
elements and partitioning of array into three parts.  He and other
testers found this was 10% faster than plain quicksort.
Why?  It is plainly stupid by the old view in the sense it uses more
comparisons to sort.

Shrinu Kushagra, Alejandro Lopez-Ortiz, J. Ian Munro, Aurick Qiao
http://epubs.siam.org/doi/abs/10.1137/1.9781611973198.6
went further than  Yaroslavskiy  devising a 3-pivot quicksort
partitioning array into 4 parts.
This was found to be faster than the dual-pivot and single-pivot
quicksort variants tried.

But both these new kinds of quicksort use MORE comparisons and MORE
moves than classic quicksort (beating it on runtime for sorting plain
1-words integers, though)!
Why?  Well, first of all, it seems that, although I said quicksort was excellent
about the cache with only a fraction c/L of accesses out-of-cache,
c=constant, L=cache line size, these new algorithms are even more
excellent about the cache in the sense they achieve smaller c.  So the
new methods would be predicted to win unless we have very expensive
comparison operations.

Also, which those authors did not perceive,
you can code it in a way which causes the comparisons to be more predictable
than in single-pivot quicksort, precisely *because* they are
"unbalanced" i.e. poor in an info-theoretic sense.

   E.g. suppose we had a "good" sorting method in which
each comparison was 50:50, you had no idea what the result was going to be.
You get 1 bit of info from each compare but it costs you 0.5
mispredicts per compare --
which if we crudely model predicted branches as "free," means your
cost is 0.5 per bit.
Now compare with a "bad" sorting method in which the compares are
90:10 predictable.
Now you get only 0.1*lg(10)+0.9*lg(10/9) = 0.469
bits of info per compare, but it only costs you 0.1 mispredicts.
Hence the fact you have to perform 2.13 times the number of
comparisons to extract the
same amount of information, is more than paid for (in this crude model)
by the fact you are paying 5 times less per comparison!
That gets even more dramatic with a 127:1 predictable comparison,
which gives you only about 1/16 of a bit but costs only 1/128.

"Bad" sorting methods are actually better than "good" sorting methods,
in this sense!
Very bizarre!

To quantify that insight, suppose a mispredicted compare costs you B
while a predicted one costs 1, where B>1.  Your best strategy is
to employ comparisons with R:1 ratio of right:wrong predictions
where R>1 (as opposed to 1:1 for a "perfect" comparison).
The optimum value of R is the one which minimizes
   (B+R) / (log(R+1)+R*log(1+1/R))
which is approximately the same thing as minimizing
    (B+R) / (1 + log(R+1))
If B-->infinity then the optimum R also -->infinity.
But for fixed B such as B=10, you get finite R>1, and
the best R approximately obeys  R=B/ln(B).

This is a quite large bias.

3. A new algorithm for in-place array sorting using (2+o(1))*N*lgN compares and
(13+o(1))*N moves in worst case, both optimal up to constant factors
[shell, heap, networks, and quick all used order N*lgN moves or more;
mergesort used O(N) moves but was not in-place]
is here:

GIANNI FRANCESCHINI & VILIAM GEFFERT:
An In-Place Sorting with O(n log n) Comparisons and O(n) Moves,
JACM 52,4 (2005) 515-537

I now am going to describe a vastly simpler  sorting method than theirs which
also achieves O(NlogN) compares worst case, O(N) moves worst case, both optimal,
but unlike them I allow myself o(N) bits of auxiliary storage, note
the lowercase o.
They had to be more complicated since they only allowed themselves
o(1) words of extra storage, which is about O(logN) reckoned in bits.

An approximately evenly-spaced sorted subsample S of the N elements,
containing M elements, is used.  Basically, a random sample would
work, except where it does not, and where it does not you can "split"
the long gaps approximately in half, using linear-time median-finding
algorithms, until it does.
Also the other kind of flaw in a random sample, too-close guys, can be
fixed by discarding one of the guys. Non-even-spacing flaws in this
sample can be dynamically repaired in this way as we proceed.

In practice I would just use a "slightly improved random sample" since
with optimal M it is extremely unlikely any fixes will be needed; but
if you insist on genuine worst-case performance bounds
determinstically, then you would need to worry about repairs.

Phase I:
If we are willing to store o(N) extra bits we can now using binary search
mentally insert each of the N elements into the correct post office box (boxes
delineated by the samples) without actually moving anything, purely
keeping track
of the COUNTS in each box.  Then with the counts known and stored in
the auxialiary o(N)
bits (note the lowercase o) we can now make a second mass of binary
searches now,
with actual moves, since we now know where to move them.
That works provided M=o(N/logN).  (If M were larger than that, then
the auxiliary storage for the counts would be too large.)

As a result we now have an almost-sorted array of N elements, where
each element is
at most distance O(N/M) away from correct final destination and every
element in correct
"post office box" where there are M+1 boxes with M known "walls" (the sample S,
now already  in final locations).

Phase II:
We now sort each box in succession using,
for example, list-merge-sort or array-merge-sort.
The auxiliary storage each merge-sort needs is at most O(N/M)
pointers, each O(logN) bits wide, if we use list merge.  Or at most O(N/M) extra
array elements, if we use array merge.
So this works provided logN=o(M).
Other methods than merge sort could also be used, anything with linear
move-count
and O(NlogN) compare-count will do, where linear or even somewhat superlinear
auxiliary storage is permitted.  I'm cheating here because array-merge
does order
NlogN moves in sorting N items.  List merge really does use only O(N) moves,
but uses order NlogN pointer-rewrites, which it is kind of cheating to
claim are "not moves."
Yes, they are not item-moves; but they are pointer-moves.
Another approach is to use heapsort with a large radix (tree branching factor)
instead of radix=2.   That is what Franceschini and Geffert did.  This kind of
heapsort has fewer moves but more comparisons.  Specifically with
radix=R, 1<R<N,
heapsort will employ order N*log(N)/log(R) moves and order R*N*log(N)/log(R)
compares to sort N items.  Using heapsort also has the minor extra
advantage of not
requiring auxiliary storage.

If you use mergesort, the best tradeoff for my auxiliary memory needs
seems to be making M be of order sqrt(N).  Then you use O(sqrt(N))
words of auxiliary storage, each  O(logN) bits wide. Then I claim this
sorting method can then be made to
employ at most (2+o(1))*N*lgN comparisons, expected,
and perform about 4*N moves, where I have not really computed the
constant 4.  The comparisons will all be in cache provided cache size
exceeds about sqrt(N).   The moves will generally be out of cache but no sorting
algorithm can escape having order N out-of-cache moves so that doesn't
bother me.

If you use radix=R heapsort, then M must be chosen to be of order  N/(logN)^c
where c>1 is a suitable positive constant chosen to optimize performance,
while R is of order logN.

So in conclusion, this kind of sort algorithm is optimal in terms of
comparisons, moves,
"in place" nature (in my sense), and cache behavior re the comparisons
if cache is
larger than about sqrt(N), all simultaneously.  It is possible to do
the binary searches
in phase I in a branchless manner (use conditional load of
pointer-increments), so those comparisons do not face
branch-prediction issues.
Array merge-sort can be done using conditional load of pointers and
increments, again branchless.  However list-merge-sort comparisons
seem to have prediction issues.
Heapsort can be done with all but order N comparisons branchless.

Also, one final remark.
Franceschini and Geffert left it as open whether a "stable" sorting method
with these properties can be devised. Stable means, if two equal
elements were in some order in the input, they stay in that order in
the output.

With my sorting method, you can afford to store, in phase I, the
original positions of the items in sample S.  Thus you can during
phase I binary search agree to always go the right direction to
preserve stability.  Also, in phase II, we can use a stable sorting
method such as merge sort.  Also S needs to be sorted itself using a
stable method, but S is small
enough we can afford to use merge sort.   So therefore, provided you
accept sublinear
(but a lot larger than O(1)) auxiliary storage, this too is solved.

-- 
Warren D. Smith
http://RangeVoting.org  <-- add your endorsement (by clicking
"endorse" as 1st step)

-- 
Warren D. Smith
http://RangeVoting.org  <-- add your endorsement (by clicking
"endorse" as 1st step)

Warren D Smith

Warren D Smith

tags

participants (1)