[math-fun] Keep your Lisp's Zipf'd
[Try pronouncing that phrase fast!] Just for the heck of it, I instrumented one of the "Gabriel" benchmarks for Lisp -- the Boyer benchmark -- which operates solely on Lisp CONS cells, and never damages these cells using RPLACA/RPLACD. I wanted to see what the cache performance of an ideal LRU *data* cache would be for this benchmark, so I computed the statistics on the "move to front" behavior of the various CONS cells involved in this computation. The most important statistic is the so-called "stack distance" -- the number of positions between the current position on the LRU list and the front of the list. We can accumulate for each stack distance the number of times a "move to front" traverses that particular distance. If one runs a typical computation long enough, the sequence of these stack distances tends to fall off as n^(-beta), typically 1.3 < beta < 1.7 (note that beta=1 for Zipf). A higher beta means greater "locality": the cache is more effective; a lower beta means that the cache is less effective. Interestingly, the beta I measured for Boyer is ~1.2666, which places it below the normal range for programs studied by computer architects, and may explain some of the poor cache performance of Lisp on architectures not optimized for Lisp. BTW, my instrumented Boyer never does GC, because any garbage cells quickly get pushed out of the early cache levels and hence don't get in the way of the ongoing computation. (See also Jon L White's paper "GC Considered Harmful".) Now this assumption could be wrong: the whole point of "nurseries" and "generational GC's" is that the % of garbage is so high, that large numbers of tiny GC's can reclaim CONS cells *while they're still in the cache*, and thus avoid cluttering up the cache with dead cells. On the other hand, a "write-allocate" cache can also perform a CONS entirely within the cache, so the only downside of garbage is the cost of eventually writing the garbage back to backing store; since the write-back mechanism can be deeply pipelined, the effective cost of writing back garbage is overlapped and hence quite small. I'll also be comparing the behavior of this "classical" Boyer with the behavior of a "hash consed" Lisp memory, where it is always the case that (EQ (CONS x y) (CONS x y)), i.e., a "hash consed" Lisp "uniquizes" cons cells based upon their CAR&CDR contents. I want to see if hash consing produces a better beta. I'll also be measuring an *instruction cache* for a McCarthy-style Lisp interpreter to see what a beta for an instruction cache might look like.
I made one trivial optimization, and suddenly the cache data beta became 1.865, which is again outside the typical range 1.3 < beta < 1.7, but this time on the *upper* side (i.e., further away from Zipf, and much better for a data cache). Here's the trivial optimization: The Boyer benchmark does a lot of *substitutions* -- e.g., (subst <x> <y> <exp>), where <exp> is an expression composed from Lisp CONS cells. However, a large fraction of the time, <y> doesn't occur in <exp>, so "subst" does a lot of pointless copying/consing. So here's the fix: (defun ccons (car cdr hint) ;; careful/conservationist cons ;; Try to use reuse the cons cell 'hint', when possible. (declare (type cons hint)) (if (and (eq car (car hint)) (eq cdr (cdr hint))) hint (cons car cdr))) (defun subst (x y exp) (if (null exp) nil (let* ((newcdr (subst x y (cdr exp))) (newcar (if (equal y (car exp)) x (subst x y (car exp))))) (ccons newcar newcdr exp)))) Clearly, ccons does more CAR's and CDR's (1,884,829 vs. 1,699,835), but a lot fewer CONS's (84,621 vs. 256,559 or ~1/3), so the cache is a lot less bloated with copies of equal CONS cells. One would expect the cache locality beta to improve; after all, ccons references both (car hint) and (cdr hint), so the 2nd reference would find hint in the first position of the LRU cache. In order to reduce this bias, I've charged only 1 reference for both the CAR+CDR in "ccons" in the statistics above. At 11:38 AM 4/7/2018, Henry Baker wrote:
[Try pronouncing that phrase fast!]
Just for the heck of it, I instrumented one of the "Gabriel" benchmarks for Lisp -- the Boyer benchmark -- which operates solely on Lisp CONS cells, and never damages these cells using RPLACA/RPLACD.
I wanted to see what the cache performance of an ideal LRU *data* cache would be for this benchmark, so I computed the statistics on the "move to front" behavior of the various CONS cells involved in this computation. The most important statistic is the so-called "stack distance" -- the number of positions between the current position on the LRU list and the front of the list. We can accumulate for each stack distance the number of times a "move to front" traverses that particular distance.
If one runs a typical computation long enough, the sequence of these stack distances tends to fall off as n^(-beta), typically 1.3 < beta < 1.7 (note that beta=1 for Zipf).
A higher beta means greater "locality": the cache is more effective; a lower beta means that the cache is less effective.
Interestingly, the beta I measured for Boyer is ~1.2666, which places it below the normal range for programs studied by computer architects, and may explain some of the poor cache performance of Lisp on architectures not optimized for Lisp.
BTW, my instrumented Boyer never does GC, because any garbage cells quickly get pushed out of the early cache levels and hence don't get in the way of the ongoing computation. (See also Jon L White's paper "GC Considered Harmful".)
Now this assumption could be wrong: the whole point of "nurseries" and "generational GC's" is that the % of garbage is so high, that large numbers of tiny GC's can reclaim CONS cells *while they're still in the cache*, and thus avoid cluttering up the cache with dead cells.
On the other hand, a "write-allocate" cache can also perform a CONS entirely within the cache, so the only downside of garbage is the cost of eventually writing the garbage back to backing store; since the write-back mechanism can be deeply pipelined, the effective cost of writing back garbage is overlapped and hence quite small.
I'll also be comparing the behavior of this "classical" Boyer with the behavior of a "hash consed" Lisp memory, where it is always the case that (EQ (CONS x y) (CONS x y)), i.e., a "hash consed" Lisp "uniquizes" cons cells based upon their CAR&CDR contents. I want to see if hash consing produces a better beta.
I'll also be measuring an *instruction cache* for a McCarthy-style Lisp interpreter to see what a beta for an instruction cache might look like.
If we now do full "uniquized" cons'ing (aka "hash consing"), where there exists at most one CONS cell with a given CAR and CDR, then the total conses drop precipitously to 2,248 unique CONS cells, and the total of all CAR/CDR/CONS operations is 1,319,625. (Note that fully "uniquized" cons'ing completely subsumes any efficiencies gained by the careful/conservationist ccons described earlier below.) Curiously, the "beta" is now ~0.98 or almost identically Zipf's Law (Zipf beta=1) !!! Although our program now runs many times faster than it did before, its cache locality behavior (in terms of beta) is worse! The good news: 99% of all CAR/CDR/CONS references can be handled with a 420-line cache, where each cache line holds a single CONS cell. At 08:45 AM 4/9/2018, Henry Baker wrote:
I made one trivial optimization, and suddenly the cache data beta became 1.865, which is again outside the typical range 1.3 < beta < 1.7, but this time on the *upper* side (i.e., further away from Zipf, and much better for a data cache).
Here's the trivial optimization:
The Boyer benchmark does a lot of *substitutions* -- e.g., (subst <x> <y> <exp>), where <exp> is an expression composed from Lisp CONS cells. However, a large fraction of the time, <y> doesn't occur in <exp>, so "subst" does a lot of pointless copying/consing.
So here's the fix:
(defun ccons (car cdr hint) ;; careful/conservationist cons ;; Try to use reuse the cons cell 'hint', when possible. (declare (type cons hint)) (if (and (eq car (car hint)) (eq cdr (cdr hint))) hint (cons car cdr)))
(defun subst (x y exp) (if (null exp) nil (let* ((newcdr (subst x y (cdr exp))) (newcar (if (equal y (car exp)) x (subst x y (car exp))))) (ccons newcar newcdr exp))))
Clearly, ccons does more CAR's and CDR's (1,884,829 vs. 1,699,835), but a lot fewer CONS's (84,621 vs. 256,559 or ~1/3), so the cache is a lot less bloated with copies of equal CONS cells.
One would expect the cache locality beta to improve; after all, ccons references both (car hint) and (cdr hint), so the 2nd reference would find hint in the first position of the LRU cache. In order to reduce this bias, I've charged only 1 reference for both the CAR+CDR in "ccons" in the statistics above.
At 11:38 AM 4/7/2018, Henry Baker wrote:
[Try pronouncing that phrase fast!]
Just for the heck of it, I instrumented one of the "Gabriel" benchmarks for Lisp -- the Boyer benchmark -- which operates solely on Lisp CONS cells, and never damages these cells using RPLACA/RPLACD.
I wanted to see what the cache performance of an ideal LRU *data* cache would be for this benchmark, so I computed the statistics on the "move to front" behavior of the various CONS cells involved in this computation. The most important statistic is the so-called "stack distance" -- the number of positions between the current position on the LRU list and the front of the list. We can accumulate for each stack distance the number of times a "move to front" traverses that particular distance.
If one runs a typical computation long enough, the sequence of these stack distances tends to fall off as n^(-beta), typically 1.3 < beta < 1.7 (note that beta=1 for Zipf).
A higher beta means greater "locality": the cache is more effective; a lower beta means that the cache is less effective.
Interestingly, the beta I measured for Boyer is ~1.2666, which places it below the normal range for programs studied by computer architects, and may explain some of the poor cache performance of Lisp on architectures not optimized for Lisp.
BTW, my instrumented Boyer never does GC, because any garbage cells quickly get pushed out of the early cache levels and hence don't get in the way of the ongoing computation. (See also Jon L White's paper "GC Considered Harmful".)
Now this assumption could be wrong: the whole point of "nurseries" and "generational GC's" is that the % of garbage is so high, that large numbers of tiny GC's can reclaim CONS cells *while they're still in the cache*, and thus avoid cluttering up the cache with dead cells.
On the other hand, a "write-allocate" cache can also perform a CONS entirely within the cache, so the only downside of garbage is the cost of eventually writing the garbage back to backing store; since the write-back mechanism can be deeply pipelined, the effective cost of writing back garbage is overlapped and hence quite small.
I'll also be comparing the behavior of this "classical" Boyer with the behavior of a "hash consed" Lisp memory, where it is always the case that (EQ (CONS x y) (CONS x y)), i.e., a "hash consed" Lisp "uniquizes" cons cells based upon their CAR&CDR contents. I want to see if hash consing produces a better beta.
I'll also be measuring an *instruction cache* for a McCarthy-style Lisp interpreter to see what a beta for an instruction cache might look like.
participants (1)
-
Henry Baker