Warren: You probably already know this, but deeply pipelined units -- e.g., GPU's -- sometimes use _sorting networks_ which are more-or-less oblivious compare-and-swap straight-line instructions. Since there isn't any branching, there's no branch prediction. In graphics operations, there is often a need to sort stuff, so this is important. However, such oblivious sorting networks can cost an addition factor of logN, at least in terms of hardware, but this may acceptable in today's environment. Branch prediction reached absurd levels prior to modern multicore CPU's, and the emphasis is now away from heavy branch prediction in favor of more cores. Branch prediction basically builds several hypothetical states in parallel (requiring lots of additional registers), and only when the results of the decisions are finally available does the CPU choose the correct state to "instantiate" as the actual state. In this sense, a branch prediction CPU is very much like a two-phase database transaction, where a branch opens a new transaction, and a number of uncommitted results are computed, and when the branch decision becomes evident, the choice of the actual state becomes the "commit" part of the transaction. At 04:58 PM 5/20/2014, Warren D Smith wrote:
Suppose you are doing quicksort. Zoing, zoing, zoing, you do compares, and suddenly you decide "aha! this last comparison tells me I need to swap item[K]!" And then you branch off someplace else in your program to do it. Big lose. Mispredicted branch.
But instead: When your comparison sees a bad item, you conditional-load that item's ID number into a stash (no branching needed) and continue on. Eventually your stash gets full and you have to do a real branch to deal with that, but with a large enough stash-size, this can be made arbitrarily rare. When the stash fills up (or reach end), we do all the swaps of stashed bad items (emptying the stash), then branch back to main code.
This stashing technique of "deferring" all the branches, then you later do all the stuff those branches were going to do -- but almost without any branching -- presumably is of pretty general applicability as a high performance programming technique. Certainly it is applicable to quicksort. I don't know what other problems it is applicable to, but probably many. This should yield substantial speedups for the right problems. (Any suggestions?)
Meanwhile, in preliminary work, I invented a different new way to substantially speed up quicksort, which finally beat my heapsort implementation (which in turn beat both the currently most-ballyhooed quicksort codes, by Bentley+McIlroy, and by Kushagra et al, in my speed tests). Some version of this stashing idea now might increase that lead, but haven't tried it yet. Anyhow, at least in principle, quicksort now has excellent caching behavior, excellent branch misprediction rates, and arbitrarily near-optimal comparison counts (with good pivot choices) all in one algorithm. Suckily large move counts, though.
-- Warren D. Smith http://RangeVoting.org <-- add your endorsement (by clicking "endorse" as 1st step)