[math-fun] bit reversal
Does this, in any sense, beat Knuth's? --rwg
Performance wise it should be identical (unless "rotate by one bit" is faster than "rotate by k bits" where k is an immediate operand); so WDS's benchmark looks dubious to me.
--hah! Maybe it should be identical with perfect machine language coding (I'll have to take your word for that, and frankly I doubt you are smart enough -- processor pipeline & etc issues are so complicated today nobody can easily tell)... but definitely not on my machine with my compiler. You are correct your assembler bswap version was the fastest, though (ArndtB in my benchmark). Here's another little hacking question for you: I give you a 64-bit word x. Your task is to compute the sum of its 8 component bytes. How fast can you do it? What if I only demand the sum be correct modulo 256?
* Warren D Smith <warren.wds@gmail.com> [Apr 05. 2014 07:27]:
Does this, in any sense, beat Knuth's? --rwg
Performance wise it should be identical (unless "rotate by one bit" is faster than "rotate by k bits" where k is an immediate operand); so WDS's benchmark looks dubious to me.
--hah! Maybe it should be identical with perfect machine language coding (I'll have to take your word for that, and frankly I doubt you are smart enough -- processor pipeline & etc issues are so complicated today nobody can easily tell)... but definitely not on my machine with my compiler.
Here is what I do: - create different versions and do timing - try understanding what was measured by looking at the generated machine code (current compilers create good machine code!) - read the CPU manuals (both intel and AMD from begin to end), same with the respective optimization manuals - keep in mind various, often conflicting, issues like register pressure and internal parallelism, and (sometimes ugly) curiosities of the CPUs. - talk to someone who knows the optimization mechanisms of a compiler _very_ well - repeat measuring routines embedded in programs that can be considered 'realistic' use - redo timings (synthetic and 'embedded') on different machines with different compilers; automated; understand what's going on - accept that the fastest routine (in some fixed setting) is the fastest, no matter how much you wish some other was
You are correct your assembler bswap version was the fastest, though (ArndtB in my benchmark).
Here's another little hacking question for you: I give you a 64-bit word x. Your task is to compute the sum of its 8 component bytes. How fast can you do it? What if I only demand the sum be correct modulo 256?
O( log_2( BYTES_PER_WORD ) ) with essentially the bit-count mechanism. O( 1 ) when using the multiplier. Both correct properly, not only mod 256. Best, jj
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com http://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
participants (2)
-
Joerg Arndt -
Warren D Smith