Re: [math-fun] Optimally sorting a cyclic group

22 Apr 2018

      I don't know much about Sparc register banks, except that they seem to
be "architectural" features -- i.e., something that the compiler has
to know about in order to generate code that computes the correct answer
-- rather than "performance" features -- i.e., something that the compiler
*optimizers* have to know about in order to generate fast(er) code.

At least a decade before the Sparc, there were "real time" microprocessors
-- e.g., RCA's COSMAC/180X 8-bit processors -- where the "registers" were
merely locations in external RAM selected by a special index register (at
least in some implementations).  Thus, although the compiler thought in
terms of "registers", these registers didn't have faster access than any
other RAM location.

The reasons for this design:

1) very low interrupt latency: simply drop in a new register "index" and
you instantly have a brand-new set of registers; and

2) not very much space on the chip for even one register bank, much less
several -- these were very early single chip microprocessors.

WRT patterns of cache behavior:

We talked about n^(-1.5) "long tail" behavior for caches in the past
several months.  This "-1.5" exponent appears to be in the middle of
the range for actual programs, so I'm taking this as a "given", at
least for the moment.

At 09:01 AM 4/22/2018, Tomas Rokicki wrote:
...
I was always intrigued by the SPARC register banks and how they compare
to just using the Tomasulo algorithm directly with fast caches . . .  has anyone
looked at RISC V enough to determine if it uses register windows?
Henry, I'm still struggling to understand precisely what patterns you see that
might be pre-existing in order to gain a benefit in sort time.  Few of the n!
permutations have sufficient elements in-place (even when considering all
cyclic end-states) to make this interesting up front so you must have some
sort of permutation distribution in mind for which this might be a benefit.
-tom
On Fri, Apr 20, 2018 at 9:48 PM, Andres Valloud <avalloud@smalltalk.comcastbiz.net> wrote:
...
Starts sounding like SPARC's register banks / windows.
How would software discover what the machine is doing to self-tune?
On 4/20/18 14:30 , Henry Baker wrote:
...
Agreed.
Blind men & elephants come from designing & evaluating
at the same time.
I'm trying to figure out what major mechanisms are required
to achieve "essentially" optimal results w/o overdesigning.
E.g., I've been resisting allowing "massive" simultaneous
rearrangements, but they may be required in order that
worst-case behavior doesn't dominate the average case
behavior.
Why I'm doing this: I've been unhappy with the standard
RAM model which assumes constant memory access time for
all memory accesses, which is complete hogwash.
What I'm really trying to design/model here is a memory
"cache", but a bizarre type of cache in that every cache
line has a slightly different access time, with more
"remote" elements being monotonically slower than "closer"
elements.  As in modern microprocessors, the speed of
the fastest cache lines may be 3 orders of magnitude
faster than the slowest cache lines.  Furthermore, the
cost of a cache line is directly correlated with its
speed, so there are far more slower elements than faster
ones.
I haven't settled down on a particular cost model,
because I wanted to play with different cost models.
As usual, we want most of the "work" to be done in
the closest/fastest cache lines, including any work
of rearranging the order of the cache lines.  This
is because "small" rearrangements can be much faster
than "large" rearrangements.
These are all "exclusive" caches, meaning that a
particular item lives in exactly one cache line.
I think I'm close to deciding that some massive
rearrangements -- e.g., replacing the first N elements
of the fastest part of the cache with N sequential
remote elements -- will be a necessary builtin feature.
However, it is likely that the cost of such a massive
rearrangement will have to be correlated with the
"distance" as well as the number of items.  Furthermore,
it may be necessary and/or convenient to have the
number of elements (N) involved in each transfer be
correlated with the distance.  There is a tradeoff
here because a large transfer from "far" may bring
in something you need, but also a lot of other stuff
you did't want -- thus "polluting" the fast cache.
This is reminiscent of 1950's/1960's style programming
where large blocks of data have to be brought in/out by
"channel operations"; some of these channels were even
smart enough to transfer data *within* main memory w/o
even touching an I/O device.
At 12:44 PM 4/20/2018, Tomas Rokicki wrote:
...
I think we need some sort of explicit cost model here.  Otherwise we are
each blindly touching a different part of the elephant.
-- 
--  http://cube20.org/  --  http://golly.sf.net/  --

Henry Baker

tags

participants (1)