Re: [math-fun] Optimally sorting a cyclic group
I don't know much about Sparc register banks, except that they seem to be "architectural" features -- i.e., something that the compiler has to know about in order to generate code that computes the correct answer -- rather than "performance" features -- i.e., something that the compiler *optimizers* have to know about in order to generate fast(er) code. At least a decade before the Sparc, there were "real time" microprocessors -- e.g., RCA's COSMAC/180X 8-bit processors -- where the "registers" were merely locations in external RAM selected by a special index register (at least in some implementations). Thus, although the compiler thought in terms of "registers", these registers didn't have faster access than any other RAM location. The reasons for this design: 1) very low interrupt latency: simply drop in a new register "index" and you instantly have a brand-new set of registers; and 2) not very much space on the chip for even one register bank, much less several -- these were very early single chip microprocessors. WRT patterns of cache behavior: We talked about n^(-1.5) "long tail" behavior for caches in the past several months. This "-1.5" exponent appears to be in the middle of the range for actual programs, so I'm taking this as a "given", at least for the moment. At 09:01 AM 4/22/2018, Tomas Rokicki wrote:
I was always intrigued by the SPARC register banks and how they compare to just using the Tomasulo algorithm directly with fast caches . . . has anyone looked at RISC V enough to determine if it uses register windows?
Henry, I'm still struggling to understand precisely what patterns you see that might be pre-existing in order to gain a benefit in sort time. Few of the n! permutations have sufficient elements in-place (even when considering all cyclic end-states) to make this interesting up front so you must have some sort of permutation distribution in mind for which this might be a benefit.
-tom
On Fri, Apr 20, 2018 at 9:48 PM, Andres Valloud <avalloud@smalltalk.comcastbiz.net> wrote:
Starts sounding like SPARC's register banks / windows.
How would software discover what the machine is doing to self-tune?
On 4/20/18 14:30 , Henry Baker wrote:
Agreed.
Blind men & elephants come from designing & evaluating at the same time.
I'm trying to figure out what major mechanisms are required to achieve "essentially" optimal results w/o overdesigning.
E.g., I've been resisting allowing "massive" simultaneous rearrangements, but they may be required in order that worst-case behavior doesn't dominate the average case behavior.
Why I'm doing this: I've been unhappy with the standard RAM model which assumes constant memory access time for all memory accesses, which is complete hogwash.
What I'm really trying to design/model here is a memory "cache", but a bizarre type of cache in that every cache line has a slightly different access time, with more "remote" elements being monotonically slower than "closer" elements. As in modern microprocessors, the speed of the fastest cache lines may be 3 orders of magnitude faster than the slowest cache lines. Furthermore, the cost of a cache line is directly correlated with its speed, so there are far more slower elements than faster ones.
I haven't settled down on a particular cost model, because I wanted to play with different cost models.
As usual, we want most of the "work" to be done in the closest/fastest cache lines, including any work of rearranging the order of the cache lines. This is because "small" rearrangements can be much faster than "large" rearrangements.
These are all "exclusive" caches, meaning that a particular item lives in exactly one cache line.
I think I'm close to deciding that some massive rearrangements -- e.g., replacing the first N elements of the fastest part of the cache with N sequential remote elements -- will be a necessary builtin feature. However, it is likely that the cost of such a massive rearrangement will have to be correlated with the "distance" as well as the number of items. Furthermore, it may be necessary and/or convenient to have the number of elements (N) involved in each transfer be correlated with the distance. There is a tradeoff here because a large transfer from "far" may bring in something you need, but also a lot of other stuff you did't want -- thus "polluting" the fast cache.
This is reminiscent of 1950's/1960's style programming where large blocks of data have to be brought in/out by "channel operations"; some of these channels were even smart enough to transfer data *within* main memory w/o even touching an I/O device.
At 12:44 PM 4/20/2018, Tomas Rokicki wrote:
I think we need some sort of explicit cost model here. Otherwise we are each blindly touching a different part of the elephant. -- -- http://cube20.org/ -- http://golly.sf.net/ --
participants (1)
-
Henry Baker