It is not so easy to permute bits, actually. Say you have a W-bit word (W=64?). You want to permute the lower-significance K bits in some arbitrary known manner. How do you do it? Here is one way: Step 1: multiply by 100001000010000... where the period is K. Carryless or regular multiply both work here and below. Step 2: AND with some magic bit mask with one 1-bit per K-chunk chosen to get exactly one copy of each original bit, one per chunk, ordered in arbitrary known order. Step 3: multiply by the correct magic sparse-1's number so that the correct permuted K bits magically appear in the most-signif K bits of the product word. It is possible here that if you are using regular (not carryless) multiply, you will need about log(K) buffer bits between the top K bits used for output, and the bottom K*K bits used for input. Step 4: shift right by W-K+BuffSize hops to reposition in original K bit locations. This works if BuffSize+K*K<=W, so if W=64 we can permute 8 bits with carryless, and maybe somewhat fewer with regular, multiply. That sucks. I mean, to permute 8 bits I could've just used a 256-byte lookup table. Looking in the FxtBook, http://www.jjj.de/fxt/fxtbook.pdf section 1.29 fails to answer the question though it does suggest future machine instructions that would.