[math-fun] Ken Thompson's compiler hack --- NN applicability?
Hi, suppose you're training a neural network via self play. It looks like it's getting stronger. How do you know the versions that get promoted do not also encode, in themselves, by chance, a collaboration mechanism that helps then win? That is, how do you know the strongest nets do not also help the winning side win when they play the losing side? How do you know they are not implementing Thompson's compiler hack? Andres.
There certainly are concerns of this kind, which can mostly be allayed by proper training regimen. You'd have to be a bit more explicit about the training regimen you envisage, before specific concerns about the contestants gaming your regimen could be addressed. In particular: how does a contestant determine that it is on the "losing side"? On Tue, Jun 23, 2020 at 2:59 PM Andres Valloud <ten@smallinteger.com> wrote:
Hi, suppose you're training a neural network via self play. It looks like it's getting stronger. How do you know the versions that get promoted do not also encode, in themselves, by chance, a collaboration mechanism that helps then win?
That is, how do you know the strongest nets do not also help the winning side win when they play the losing side?
How do you know they are not implementing Thompson's compiler hack?
Andres.
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
Specifically, this is in regards to Leela Chess Zero. If you look at many games played against Stockfish, especially those of a while ago, you get the impression that this happens: 1. A locked position develops, both sides make no progress. Both evaluations are mildly in favor of lc0. 2. Lc0 starts shuffling, i.e. making moves that do not improve its position (and also do not make it worse). Stockfish does the same. 3. But, eventually, Stockfish's evaluation shows a big advantage for lc0, that lc0 does not yet see. Stockfish reacts. 4. Lc0 then perceives the reaction and plays into the new weakness, eventually wins. So, how do you know the lc0 nets that get promoted as strong are not encoding, in themselves, the hints their own winning sides need to survive the training regime? That is, is the self play training process optimizing for wins achieved unilaterally, or collaboratively? Of course, this example is very specific for lc0 in that I thought this was possible first while watching those particular games. However, I am interested in the general concept. An issue I see is that, without an audit trail, finding whether an NN did this on its own is going to be very difficult. I'd imagine a way to ameliorate this problem is diversity (as in biological diversity, and for the same reasons). I'm interested to hear more informed opinions on this. Are there any? On 6/23/20 12:13, Allan Wechsler wrote:
There certainly are concerns of this kind, which can mostly be allayed by proper training regimen. You'd have to be a bit more explicit about the training regimen you envisage, before specific concerns about the contestants gaming your regimen could be addressed. In particular: how does a contestant determine that it is on the "losing side"?
On Tue, Jun 23, 2020 at 2:59 PM Andres Valloud <ten@smallinteger.com> wrote:
Hi, suppose you're training a neural network via self play. It looks like it's getting stronger. How do you know the versions that get promoted do not also encode, in themselves, by chance, a collaboration mechanism that helps then win?
That is, how do you know the strongest nets do not also help the winning side win when they play the losing side?
How do you know they are not implementing Thompson's compiler hack?
Andres.
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
I am not "more informed", so perhaps you should discount my thoughts on the subject. I also don't yet understand the cheat or exploit that has you worried. LCZero trains against itself, and that training is complete (and the network frozen, no?) by the time it faces Stockfish. If I understand you correctly, you are worried that LCZero is responding to a change in Stockfish's mood, interpreting the change as a sign that Stockfish has noticed that it is losing. LCZero then changes strategy itself to take advantage of Stockfish's inferred weakness. (a) What exactly is wrong with this? A strong human chess player constantly strives to interpret their opponent's mood, both by watching the action on the board and by watching the opponent's face and mannerisms. If Alice sees Bob flinch, and infers that she has an advantage she has not noticed to that point, she is likely to press, to look harder for weaknesses. Shame on Bob for not having a better poker-face, and likewise, shame on Stockfish for betraying its judgement so transparently in its play. (b) How could LCZero conceivably learn this trick? It never sees Stockfish play until its training is over and its network locked down. If part of the training regimen were to play millions of games against Stockfish, I would understand the problem (but again, shame on Stockfish for playing transparently like that). (c) There is no "trusted compiler" in which to install the feared Thompsonian backdoor, unless you mean the LCZero "engine" (the part that serves as a fixed interpreter for the variable neural net). In what way could such a cheat work? I'm not seeing enough similar pieces to justify the analogy with Thompson's exploit. (Thompson equipped his gimmicked compiler to specially detect when it was compiling two different programs, the login handler and the compiler itself. On all other programs it behaved as advertised.) On Tue, Jun 23, 2020 at 3:22 PM Andres Valloud <ten@smallinteger.com> wrote:
Specifically, this is in regards to Leela Chess Zero. If you look at many games played against Stockfish, especially those of a while ago, you get the impression that this happens:
1. A locked position develops, both sides make no progress. Both evaluations are mildly in favor of lc0.
2. Lc0 starts shuffling, i.e. making moves that do not improve its position (and also do not make it worse). Stockfish does the same.
3. But, eventually, Stockfish's evaluation shows a big advantage for lc0, that lc0 does not yet see. Stockfish reacts.
4. Lc0 then perceives the reaction and plays into the new weakness, eventually wins.
So, how do you know the lc0 nets that get promoted as strong are not encoding, in themselves, the hints their own winning sides need to survive the training regime? That is, is the self play training process optimizing for wins achieved unilaterally, or collaboratively?
Of course, this example is very specific for lc0 in that I thought this was possible first while watching those particular games. However, I am interested in the general concept.
An issue I see is that, without an audit trail, finding whether an NN did this on its own is going to be very difficult. I'd imagine a way to ameliorate this problem is diversity (as in biological diversity, and for the same reasons).
I'm interested to hear more informed opinions on this. Are there any?
On 6/23/20 12:13, Allan Wechsler wrote:
There certainly are concerns of this kind, which can mostly be allayed by proper training regimen. You'd have to be a bit more explicit about the training regimen you envisage, before specific concerns about the contestants gaming your regimen could be addressed. In particular: how does a contestant determine that it is on the "losing side"?
On Tue, Jun 23, 2020 at 2:59 PM Andres Valloud <ten@smallinteger.com> wrote:
Hi, suppose you're training a neural network via self play. It looks like it's getting stronger. How do you know the versions that get promoted do not also encode, in themselves, by chance, a collaboration mechanism that helps then win?
That is, how do you know the strongest nets do not also help the winning side win when they play the losing side?
How do you know they are not implementing Thompson's compiler hack?
Andres.
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
Hi, On 6/23/20 12:47, Allan Wechsler wrote:
I am not "more informed", so perhaps you should discount my thoughts on the subject.
Ok, sure :).
I also don't yet understand the cheat or exploit that has you worried.
Hmmmm... I'm not seeing it necessarily in the sense of a cheat or exploit, but rather in terms of an unintended consequence of training. That is, one would like to believe training produces quality X, and it produces quality X + Y --- you test for X and it's there, how do you test for Y not being there too? With software, you could conceivably look at the source code and do something about that. With NN? Now what?
LCZero trains against itself, and that training is complete (and the network frozen, no?) by the time it faces Stockfish. If I understand you correctly, you are worried that LCZero is responding to a change in Stockfish's mood, interpreting the change as a sign that Stockfish has noticed that it is losing. LCZero then changes strategy itself to take advantage of Stockfish's inferred weakness.
I thought it more like this: "lc0 shuffling" means "hey, other side, I have no idea what to do here --- help me win please so we both survive training?". And by shuffling, it causes (in this case) Stockfish to shuffle, and eventually that shuffling causes Stockfish to react, which lc0 interprets as *itself* helping *itself* win, which in training was a good trait because nets that win, survive. That is, such self-helping behavior could be selected for by the training.
(a) What exactly is wrong with this? A strong human chess player constantly strives to interpret their opponent's mood, both by watching the action on the board and by watching the opponent's face and mannerisms. If Alice sees Bob flinch, and infers that she has an advantage she has not noticed to that point, she is likely to press, to look harder for weaknesses. Shame on Bob for not having a better poker-face, and likewise, shame on Stockfish for betraying its judgement so transparently in its play.
Like I said, I do not care that anything called 'lc0' or 'stockfish' or 'fruit loops the color of neon banana peel' are playing. I am using this only as an illustration of the general phenomenon where self play training could conceivably select for unintended behavior traits. And then, the problem is how do you prevent that, and how do you know for a fact that it didn't happen.
(b) How could LCZero conceivably learn this trick? It never sees Stockfish play until its training is over and its network locked down. If part of the training regimen were to play millions of games against Stockfish, I would understand the problem (but again, shame on Stockfish for playing transparently like that).
Like I explained above. It doesn't happen all the time. Sometimes, there is shuffling, and after a long while lc0 seemingly realizes the game was a draw all along. So, no help => draw. Let me emphasize again: I couldn't care less what the names of the chess playing programs are, or even that they are playing chess. I am interested in the phenomenon of unintended traits selected by self-play training.
(c) There is no "trusted compiler" in which to install the feared Thompsonian backdoor, unless you mean the LCZero "engine" (the part that serves as a fixed interpreter for the variable neural net).
The nets are "trusted" to be strong because they win (and demonstrably so). But there is no way to tell what the nets are doing to "win" in self training. How do you know the strongest nets do not cheat against themselves in self-play training, by helping themselves win in ways that other nets cannot do imperceptibly to others? How do you know they are not backdooring their own wins to some extent? Their survival depends on whether they do. How would you detect that is happening? And if you can't detect it, why wouldn't it happen?
In what way could such a cheat work? I'm not seeing enough similar pieces to justify the analogy with Thompson's exploit. (Thompson equipped his gimmicked compiler to specially detect when it was compiling two different programs, the login handler and the compiler itself. On all other programs it behaved as advertised.)
Right, it's an incomplete analogy. The part I'm focusing on is that in Thomson's compiler case, at least you could decompile the debugger and find the unintended code. How do you do that with an NN? Andres.
A team at Google+Stanford ran into a problem very much like this, with adversarial neural network image transformations. popular account: https://techcrunch.com/2018/12/31/this-clever-ai-hid-data-from-its-creators-... arXiv paper: https://arxiv.org/pdf/1712.02950.pdf --Michael On Tue, Jun 23, 2020 at 2:59 PM Andres Valloud <ten@smallinteger.com> wrote:
Hi, suppose you're training a neural network via self play. It looks like it's getting stronger. How do you know the versions that get promoted do not also encode, in themselves, by chance, a collaboration mechanism that helps then win?
That is, how do you know the strongest nets do not also help the winning side win when they play the losing side?
How do you know they are not implementing Thompson's compiler hack?
Andres.
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
-- Forewarned is worth an octopus in the bush.
LOL --- awesome! Thanks for the link. On 6/23/20 13:03, Michael Kleber wrote:
A team at Google+Stanford ran into a problem very much like this, with adversarial neural network image transformations. popular account: https://techcrunch.com/2018/12/31/this-clever-ai-hid-data-from-its-creators-... arXiv paper: https://arxiv.org/pdf/1712.02950.pdf
--Michael
On Tue, Jun 23, 2020 at 2:59 PM Andres Valloud <ten@smallinteger.com> wrote:
Hi, suppose you're training a neural network via self play. It looks like it's getting stronger. How do you know the versions that get promoted do not also encode, in themselves, by chance, a collaboration mechanism that helps then win?
That is, how do you know the strongest nets do not also help the winning side win when they play the losing side?
How do you know they are not implementing Thompson's compiler hack?
Andres.
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
Great collection of "creative AI" stories: https://arxiv.org/abs/1803.03453 On Tue, Jun 23, 2020 at 2:04 PM Michael Kleber <michael.kleber@gmail.com> wrote:
A team at Google+Stanford ran into a problem very much like this, with adversarial neural network image transformations. popular account: https://techcrunch.com/2018/12/31/this-clever-ai-hid-data-from-its-creators-... arXiv paper: https://arxiv.org/pdf/1712.02950.pdf
--Michael
On Tue, Jun 23, 2020 at 2:59 PM Andres Valloud <ten@smallinteger.com> wrote:
Hi, suppose you're training a neural network via self play. It looks like it's getting stronger. How do you know the versions that get promoted do not also encode, in themselves, by chance, a collaboration mechanism that helps then win?
That is, how do you know the strongest nets do not also help the winning side win when they play the losing side?
How do you know they are not implementing Thompson's compiler hack?
Andres.
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
-- Forewarned is worth an octopus in the bush. _______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
-- Mike Stay - metaweta@gmail.com http://math.ucr.edu/~mike https://reperiendi.wordpress.com
I think that's an intereseting question both in theory and in practice. I'm not sure how it would develop a recognizer for itself, using NN rules, but I doubt that it is impossible. If the recognizer resulted in promoting one side further than the other side was demoted, it could enter a positive feedback loop, I suppose.
Hi, suppose you're training a neural network via self play. It looks like it's getting stronger. How do you know the versions that get promoted do not also encode, in themselves, by chance, a collaboration mechanism that helps then win?
That is, how do you know the strongest nets do not also help the winning side win when they play the losing side?
How do you know they are not implementing Thompson's compiler hack?
Andres.
Hilarie
Yes, that positive feedback loop is what I was thinking about. On 6/23/20 13:27, Hilarie Orman wrote:
I think that's an intereseting question both in theory and in practice. I'm not sure how it would develop a recognizer for itself, using NN rules, but I doubt that it is impossible. If the recognizer resulted in promoting one side further than the other side was demoted, it could enter a positive feedback loop, I suppose.
Hi, suppose you're training a neural network via self play. It looks like it's getting stronger. How do you know the versions that get promoted do not also encode, in themselves, by chance, a collaboration mechanism that helps then win?
That is, how do you know the strongest nets do not also help the winning side win when they play the losing side?
How do you know they are not implementing Thompson's compiler hack?
Andres.
Hilarie
_______________________________________________ math-fun mailing list math-fun@mailman.xmission.com https://mailman.xmission.com/cgi-bin/mailman/listinfo/math-fun
participants (5)
-
Allan Wechsler -
Andres Valloud -
Hilarie Orman -
Michael Kleber -
Mike Stay