Does this, in any sense, beat Knuth's? --rwg
Performance wise it should be identical (unless "rotate by one bit" is faster than "rotate by k bits" where k is an immediate operand); so WDS's benchmark looks dubious to me.
--hah! Maybe it should be identical with perfect machine language coding (I'll have to take your word for that, and frankly I doubt you are smart enough -- processor pipeline & etc issues are so complicated today nobody can easily tell)... but definitely not on my machine with my compiler. You are correct your assembler bswap version was the fastest, though (ArndtB in my benchmark). Here's another little hacking question for you: I give you a 64-bit word x. Your task is to compute the sum of its 8 component bytes. How fast can you do it? What if I only demand the sum be correct modulo 256?