Library / In focus

Back to Library
AXRPCivilisational risk and strategy

Adversarial Policies with Adam Gleave

Why this matters

Auto-discovered candidate. Editorial positioning to be finalized.

Summary

Auto-discovered from AXRP. Editorial summary pending review.

Perspective map

MixedGovernanceMedium confidenceTranscript-informed

The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.

An explanation of the Perspective Map framework can be found here.

Episode arc by segment

Early → late · height = spectrum position · colour = band

Risk-forwardMixedOpportunity-forward

Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).

StartEnd

Across 55 full-transcript segments: median -7 · mean -5 · spread -170 (p10–p90 -100) · 0% risk-forward, 100% mixed, 0% opportunity-forward slices.

Slice bands
55 slices · p10–p90 -100

Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: medium.

  • - Emphasizes safety
  • - Emphasizes ai safety
  • - Full transcript scored in 55 sequential slices (median slice -7).

Editor note

Auto-ingested from daily feed check. Review for editorial curation.

ai-safetyaxrp

Play on sAIfe Hands

Episode transcript

YouTube captions (auto or uploaded) · video 8OkkBaPazl8 · stored Apr 2, 2026 · 1,573 caption segments

Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.

No editorial assessment file yet. Add content/resources/transcript-assessments/adversarial-policies-with-adam-gleave.json when you have a listen-based summary.

Show full transcript
hello everybody today I'll be speaking with Adam Glee Adam is a grad student at UC Berkeley uh he works with the center for human compatible Ai and he's advised by Professor Stuart Russell today um Adam and I are going to be talking about the paper he wrote adversarial policies attacking deep reinforcement learning um this was presented at I CLE 2020 and the co-authors are Michael Dennis Cody wild Neil Kant Sergey LaVine and Stuart Russell so welcome Adam yeah thanks for having me on the show Daniel okay so I guess my first question is um I guess could you summarize the paper like what did you do what did you find sure so the basic premise of a paper is that we're really concerned about adversarial attacks in machine learning systems and most adversarial attacks people have talked about have achieved this kind of LP Norm threat model where you take some existing input and you add a small amount of pation to that input and then something like an image classifier drastically changes its classification accuracy uh and you know this I think probably people have seen examples of this where you know you add some white noise to a panda and this completely changes the classification it's a very striking example um but often what we care about isn't really the performance of image classifiers because they're just outputting a label that doesn't directly have an effect on the world we're concerned about behavior of entire systems and reinforcement learning is a technique to train policies that actually take actions in the world so the stability robustness of reinforcement learning is potentially of much more importance than with image classifiers and people have done uh research in the past looking at porting adverse examples from image classifiers over into dprl so there was some pror by sanang and others and Y and Coast on this and show that basically the same attack succeeds but uh what we wanted to do in this work was come with a threat model that was more appropriate to uh reinforcement learning because you don't normally have ability to just add arbitrary noise to some robot sensor if you could already have that level of control there's much easier ways of breaking a robot so we're modeling the adversary as being another agent in this shared environment and the agent this adversarial agent can take any same set of actions as the the victim agent that's being attacked and the actions can indirectly change the observations and we found that even under this much more restricted threat model that was sufficient to cause for victim to fail in quite surprising ways okay um and could you say a little bit about what's um environments and what tasks you explored for this work sure yeah so the the task we were using were all simulated robotics environments uh they were two player zero some games so the policies that we were attacking were trained uh via selfplay uh to um yeah to win at these zero some games you'd expect them to already be quite robust to adversarial Behavior because they were playing against an opponent that was trying to beat them during training uh but we found that despite this self play isn't robust to these adversarial policies and we think it's probably because self play is only exploring a small amount of a possible space of policies you can easily find some part of a policy space but it's not robust to okay um so when you um trained these adversarial policies like like how much were you training and how much were like for how long were the uh I believe you call them the victim policies originally trained yeah so the victim policies were originally trained for at least 500 million time steps and I think up to 2 billion time steps uh they were actually trained by um bansel and others a team at open AI uh so so we didn't we didn't train the policies but they were considered to be state of art at the time and our adversarial policies were trained for no more than 20 million time steps which is you know still a lot in in absolute terms um but it's only a tiny fraction of what the victim policies were originally trained with and is reasonably simple efficient for deep paral in these kinds of environments okay so it's not the case that like you managed to you know take train a for a ton of time to defeat these policies like it it was like relatively cheap yeah yeah I mean these are all experiments where you can run on a desktop PC in you know under 24 hours so it's not really really really cheap you don't want to run it on your laptop in real time um but it's definitely doesn't require kind of Google scale compute to pull off these attacks okay and what um what did the attacks look like like if you train one of these adversarial policies what does it do yeah that's a great question I think one of the the more shocking result here isn't just that we we can exploit the victim but that we exploit them in this really surprising way so the the tasks we we were using were normally these kinds of simulated robotics games so one of them was you had a penalty shootout in in soccer where there's a kicker and a goalie trying to defend for for goal posts and um we substituted in this adversarial goalie which made no attempt whatsoever to block the ball uh it just fell over and kind of wriggled around on the ground putting it Slims in this really contorted position and and to a human looking at it just looks like basically completely uncoordinated chaotic behavior um but it actually causes the kicker to fail to kick the ball and sometimes the kicker will even fall over uh whereas it'd been a very stable policy normally and it's not just seeing something off distribution because you know you might think well if a if a goalie fell over in front of you in real life you might also be a bit confused what's going on um but we tried just this random policy that takes completely uniformly random actions visually looks pretty similar to the adversary but this didn't uh have the same effect on the victim so it really is about finding was like very specific type of behavior that Trigg a glitch or a bug in vitim policy okay yeah uh I'll say to listeners the um kind of behavior that you see is quite striking uh I recommend there's a website adversarial policies. kb. um this medium of course is the medium of podcasting is not actually great at uh conveying images but you can go there and look at these videos um so speaking of the robotics tasks um I guess there are a few different um multiplayer RL environments that you could play with right um why did you choose robotics specifically sure that's a good question so one one motivation was we wanted something that was closer to realistic attacks in that um robotics is an environment where people are at least beginning to uh transition from the lab to deployment it's one of the main actual motivating use of cases for deepl so having robotics that are actually robust of these kinds of attacks is actually you know important uh rather than you know some other environments are more just toy proof of Concepts um another important uh reason we chose this environment was that the uh number of Dimensions that the adversarial policy can influence is uh reasonably large uh so the a detel of this environment is that the both agents see other agents position um and this position is like their Center of mass uh the positions of their joints and this is normally between like 10 to 20 Dimensions so you know it's not huge it's not like an image based observation but it's enough degrees of freedom that an adversarial policy can actually uh you know confuse the victim just by its body positioning uh whereas if we' gone first on of really simple like Point Mass particle environments that some people use in multi-agent RL it would have been pretty hard to pull office kind of attack I suspect because you only got like an XY coordinate to play with and that's just not really High dimensional enough to to pull an attack so we need this kind of like minimum level of complexity to demonstrate it uh but we wanted to choose something that wasn't too complex because that just obscures um obscures results makes it harder to replicate and and run yeah although it is the case that sometimes um you'll have policies trained on recurrent models right M which like in some ways increases the dimensionality of the observation right yeah do different things at different times yeah that's right so we are actually looking now in some followup work on uh rock paper scissors so that's a very very simple game and probably played it in you know kindergarten um and if you're playing against an RNN that sees all the sequence of your actions then this is actually quite a high dimensional space and um because it's a a non-transitive game meaning that there's no single dominant deterministic policy um because you know rock beats paper paper beats scissors and so on uh this means that you know if unless your opponent perfectly randomizes if you're able ble to predict what your opponent is going to do you're going to be have have some advantage over them uh so we are actually trying to come up with some adversarial policies in that setting and it's still early work but it looks like it is possible at least with some kinds of training setups huh that's interesting so like from that I'm gathering you kind of think that these results are representative of whenever like you're in an environment where there's another agent that has like control over many degrees of freedom that like the victim is observing or depending on yeah so I think that basically whenever you're training a policy via selfplay this is a very very common technique using Alpha go Alpha star a bunch of other results um then if there's enough dimensionality they are just not going to have seen all the possible observations at training time there are probably going to be some areas where the new network policy you learn is going to generalize badly and an adversary is going to be able to push you towards those um those States uh and then your your performance is going to degrade uh now I I think that one limitation of our work so far is we've not attacked any kind of like truly state of art policies like it'd be nice to train attacks something like alphago which has actually beat you know high level humans uh the simulated robots we attacking were pretty good kickers but they're not you know about to win the World Cup uh so can we beat something that's truly said of art that's still an open question uh my suspicion is that if you are just attacking a neur network policy the answer is still going to be yes but when you start adding things like uh Monte car research which is used in alphago we're actually looking ahead a few steps then it's going to become much harder because you have to not only fool the network immediately you have to like fool it even once it can see the consequence of taking this stupid action so a lot more challenging yeah alth although in some ways adversarial policies still exist in this setting right so if I think about go right um if I'm a human playing go and I'm not an expert one thing I'm vulnerable to is people playing moves that are sort of similar to standard openings but require very different responses and often I'll just like make a mistake uh because like I can't quite anticipate what their right follow-up moves are and like this will kind of Doom me to be down a couple points or something like I'm wondering if you think that's like if if if that kind of thing that maybe listeners have experienced themselves is like analogous to what's going on here uh yeah I mean I think it's it's always hard to say exactly how kind of Human Experience translates onto to deep learning but I think that's a pretty good intuition to have that uh when when you've at training time alone you have seen a small set of possibilities and so they're kind of just pattern matching and saying well this is similar to something I've seen in the past uh so now you know I've got this rule of fun but I've learned has always worked me well in the past at training and I'm just going to keep on applying this rule of fun and if you put it in a you know sufficiently extreme State then maybe it applies that rule of FO too far and it takes the wrong action and and because we're able to sort of search systematically uh to find these kinds of examples you're able to exploit in I do sort of Have A Bit of Sympathy for viim policy because what we're doing during training is we're freezing for vixm and then we're just keeping on training an adversary and I don't know if someone was able to give me uh retrograde amnesia and just like keep on playing chess against me again and again i' probably eventually be able to find some move or I do something but I didn't just win but I do something incredibly stupid reliably um you know I'm probably not actually that that robust and so in s like a natural question to ask is well can we make policies we're at least going to be able to adapt really quickly to these kinds of mistakes so maybe you can fall them once but once they've been exploited realize oh that was a really stupid thing to do and learn um so that's one thing we're working on right now but it is quite challenging because generally RL training is quite simple and efficient so these vix and policies were trained for up to two billion time steps so you know you can you can lose many many games of soccer before you even get to a million time steps so we need to be able to really adapt very fast to be able to um be robust to these kinds of attacks at deployment time okay yeah it seems like a rough challenge um so moving on a little bit um in the introduction one point you make um is distinguish distinguishing between the preserving like the observations of a system and like being an agent that sort of Acts and produces like natural observations that are like different from what the system have seen y um but if I think about these robotics environments um the the input to the policies are like joint like locations and angles right y That's right so so to what extent are these like actually importantly different uh so like to what extent of actually actually pbed um yeah or or or like like I don't know how how is um how is what you're doing really that different from a from the classic like perturb the input adversarial example oh sure yeah so so the space of possibilities is actually reasonably constrained uh so varant have to hold them a space for example is where you can't have two lims that are actually intersecting uh for physic simulator when allow that uh there's normally limits on like how mobile each joint is and then there's also limitations where uh you know adversarial agent is still trying to to win at some game so so if it were to um in the case of a goalie if it were to move outside of a certain region uh of like the goalkeeping region when it's going to lose so it's not like it's got a complete Freedom uh I do think it's you know interesting to note that in one of the environments uh which was a Sumer wrestling environment there were more constraints on the adversary's behavior than in the other environments if adversary fell over which is basically what happens in other environments it would lose and we did still get you know surprisingly good results in in sumu humans but it wasn't uh it wasn't actually outperforming the normal opponents although it was getting basically the same performance as a normal opponent despite not trying to knock the victim over it was just kind of like kneeling in the strange position uh so at least in that case more constraints do seem to reduce Effectiveness and in Sumo the constraint is like you're not allowed to touch the ground with your with is it anything above the knees or uh I I can't remember the exact constraint I think that yeah there's certain parts of your body that are not allowed to touch the ground and I think there might be a constraint on your center of mass you're also not allowed to leave for arena as so you have to remain within the arena yeah of course okay so that makes sense um another question I had based on reading your paper um use some of these asymmetric environments so for instance like um one player is trying to kick a ball into a goal and one player is trying to be the goalie um in the videos maybe I didn't look enough but from what I can see it looks like the adversarial victim seems to always be the kicker and the sorry the victim is always the kicker and like the adversarial agent is always the goalie um did did you always do that or did you like try making the goalie the victim and it didn't work or what's up with that uh yeah so I think we actually should should revisit that we did try both of them in early stage experiments and we decided to prioritize the the kicker being that sorry the goalie being the adversary because that that seemed to be working a little bit better in initial experiments but then we improved our technique quite a lot uh so I don't actually know if now if we were to revisit that whether we would still be able to get a good adversor policy out of a kicker uh again my suspicion is that it's going to be harder because the kicker does have um so the goalie can kind of win by default if a kicker doesn't kick the ball so it do needs to cause for kicker to malfunction whereas the kicker to win does have to kick the ball so it kind of need to like first make the goalie fall over and fail and then get up off the ground and kick the ball uh which is I I I suspect that an adversarial policy does exist um but it might be a lot harder to train with RL because you've now got to do this like twostep um thing and we're generally quite bad at training policies to do one action after another it's also another way to think about it is that there's like more constraints on the on the kickers actions right like it's got to kind of the ball in so that sort of determines what it can do with its legs and like maybe it can wave its arms around in a frightening way but like it's got fewer degrees of freedom right yeah Ian it's got a fewer degrees of freedom it does have a time element though so it might be able to start off by by almost falling over I think it can't actually fall over the episode with end then but it can kind of like you know Crouch down on all fours and do a lot of crazy things that don't involve kicking of a ball and then once for go he's fallen over then it can you kind of like dust itself off and and go onto stage two of kicking the ball and uh and you know eventually the episode will time out so it has to do this reasonably quickly but it's not like it has to make the goalie fall over while on rout of the ball it can kind of do a do a faint and trick for goalie and then kick the ball okay yeah that makes sense so um we can moving on a little bit uh so you you not only found these adversarial policies right you also in section five you have some results on like uh what happens if you ignore your opponent um how the dimensionality of the problem affects things and also like what's going on with the activations of the victim agents um could you describe those results yeah sure so I think probably one of the most striking results was what happens when you effectively just blindfold or or mask for victim policy so we can no longer see anything the adversary is doing and and we don't change any of the policy weights we just replace the part of our observations that normally corresponds to the um opponent's position with just a static value that sort of a typical value at initialization and what what we find is that these these M policies actually uh do surprisingly well um they are basically robust to uh the adversary and this kind of makes sense when you think about it because the adversary isn't doing anything to physically interfere with victim uh and so although it might not be great to not know where the goalie is if you're trying to kick for ball uh you can still do pretty good just by like you know kicking it in a random Direction uh and hoping the goal doesn't block it because theol is not trying to block the ball uh so that that was of an interesting result but it really is just confirming that uh yeah the adversarial policies are working by indirectly changing the observations um only by and these mask policies they don't do well against other agents right right no so so if you if you play them against a normal opponent then they they do terribly uh because they're not trying to they're not they don't see of opponent coming at them basically can't adapt their behavior yeah okay so that's the asking um there's also you also tested like uh different dimensionality yeah yeah so so we also uh tested varying for robotic body so we had both humanoids and ant and uh you'd sort of expect if our hypothesis that dimensionality of the that adversary can manipulate is important the lower dimensional the body is uh the kind of harder it is to to perform an attack and there seems to be true in that it's much harder to uh win as an adversary in Sumo ants was ants sumo wrestling than in the humanoid version uh but there are some confounders here in that ants are just generally kind of more stable than humanoid uh so ideally would like to be able to uh decouple those two but it's a little bit hard in robotics because High dimensional bodies generally correlate with being harder to control yeah sorry when you say answer more stable do you mean like as a physical structure or the training is more stable uh physically ants are more stable cuz they have uh well they have four legs which I know isn't biologically accurate but that's how things work in myi um and they have a lower Center of mass so uh you don't fall over as an an if you just exert basically constant control uh talk to your joints whereas with a humanoid it's just constant Balancing Act yeah could you it seems like one way to control for this would just be to have an ant play human in sumo and the Ants the like adversarial agent and the humans the yeah that that would be that would be interesting um we could TR that we we were always attacking these pre-trained policies by open AI because we thought that would be fairer because then no one can say well we did something wrong in our training um and and they didn't do that setup um but we could train our own policies and try and do that and I think thatd be quite interesting uh future experiment okay um and finally you look to I I guess in order to try and understand what these adversaries were doing you looked at um what the activations of the victim neural networks were can you say a bit about that yeah sure so uh just to give some context what we were doing was we had this this fix victim policy Network play several different opponents one of which was an adversarial policy one of which was a normal opponent for the kind of opponent it trained using with VI selfplay and then finally a policy was just taking random actions and we wanted to see basically what what does the victim policy actually think when it when it's playing against these different opponents and so we uh recorded all of the activations of the a victim policy Network and we fit a a density model to to these activations uh to try and predict um well we we fit it when was playing a normal opponent and then we tried to predict uh How likely of activations when I was playing a different normal opponent because we had different normal opponents with different seeds uh all the the random and adversarial policy and what we found was that the adversarial policy very reliably induced just extremely unlikely activations uh so activations that would be very unlikely to occur when playing a normal opponent uh so the victim policy was clearly thinking something very different um and then random was also you know very fairly unlikely compared to a normal opponent but it was it was much um much more likely than adversarial policy uh so this is you know suggesting that it's not just being off distribution uh but we're like systematically finding some part of a state space so a victim it's just kind like very confused by or we triggering you know some features to be much larger than they usually would be yeah and and if you fit this density model on opponent one how how surprising are the activations induced by opponent two oh um they yeah they're generally are like pretty hard to distinguish um I think for only exception was in Sumo for some reason the opponents seem to be more different than usual but in the other two environments uh I think they were normally like within the there wasn't any significant difference within confidence intervals between different normal opponents yeah I guess in like if you think about Su the sport I think there are like sort of distinct strategies you can go for right to like push the opponent out of the circle or like tip them and like I guess he could specialize in one of those um yeah and I think we did see that a little bit in the a pre-trained opponents um certainly they they had quite different run rates against each other so that's suggesting if they weren't just a a uniform uh opponent yeah um and yeah in terms of these activations like what um could you say a little bit about what the neural Nets were and like what layer you're getting these activations from are these just like logits or like one before logits uh sure so I I I'm not actually 100% sure in this point because someone else R these experiments I think we were looking at activations from every layer of a network um so uh yeah we kind of we didn't we didn't choose a a particular layer um and in terms of the networks so the again we didn't we didn't actually train the victim policies open a I did so I'm not 100% on the policy architecture but I think these were there wasn't anything fancy going on these were just fairly standard U multi-layer perceptrons uh I think one of them yeah some of them had lstms um but some of them were just like standard feedforward networks okay great um so I guess that concludes the section where I'm asking directly about the paper um now we're having more speculative questions on my part so they might not make any sense sure thing yeah one yeah so so one thing you note in the masking section is that the space of policies and which ones beat others is not transitive right M so like it can be the case that a normal policy will beat a masked policy which will beat an adversarial policy which will beat a normal policy right yep so I'm wondering like like it seems like if the space of policies were like extremely transitive then like you wouldn't have to explore much in policy or activation space in order to do well just like train against the best thing but if it is very transitive then like if you're not used to a certain type of opponent then you can get really mess up and it seems like that might be key to your results so I was wondering firstly does that make sense as a theory and secondly um is there some measure of like how non-transitive the space of like Sumo or kick and defend policies is sure yes a very for broken question uh I would say that yeah the intuition that uh something being non-transitive makes it it harders to learn a good policy especially VI selfplay absolutely correct in fact um normally uh proofs of convergence for selfplay do require on transitivity or a slightly weaker assumption than transitivity because intuition is that you know you're you're beating a particular opponent which is often just a copy of yourself and you want this to also mean that you're stronger against previous versions of that opponent um and if if you don't do this you can just kind of end up in this cycle where you you beat the previous opponent but you get weaker at um you know opponent from 100 time steps ago um and he never actually converts to something uh so this actually you know I think a pretty interesting result there uh just if I'd been asked to guess before the start of this project uh is a penalty shootout in soccer transitive uh you know I I'm not sure I'd have been 100% but I've said yeah you know probably mostly it's a transitive game I guess there's like a few strategies that are non-transitive like D train uh you know whether you kick left or right that's that's kind of like non-transitivity um but like you know I I don't expect I'm going to be able to win a penalty shootout compared to a professional footballer so there clearly like some sense in which certain certain policies dominate others but these results we got there were at least for you know current state of art dprl they're just like you know very very highly non-transitive where um completely ridiculous policies can win against seemingly very proficient ones uh now in terms of like how you measure how non-transitive a space is I think that's a that's a really interesting question I don't feel don't have a good answer to um it seems like it's not just dependent on the task but also the class of policies that you're considering um yeah is you can kind of consider this extreme case where you have a policy which is just for optimal policy um and then you add some you add some classifiers of as optimal policy for TR to figure out is it playing against a particular opponent and if it does it just resigns or it just does something stupid uh and now you've effectively introduced this uh introduced a non-transitivity artificially or if it's like very weak a policy that would normally lose against optimal policy is now able to beat this you know very nearly optimal policy uh so you can always kind of play with tricks like that and and obviously that's not what we're doing when we're training via selfplay but I think there's maybe a bit of a similar effect where well there's just like this very small region of policy space which you might fail to be robust to unless you're systematically exploring the policy space at training hm yeah it's an interesting question like because it seems like like if you think that this is the crucial thing then you might say oh well before I deploy my deepl trained model in an environment you know maybe I want to like if there's some way to figure out how um how vulnerable it were to these adversarial attacks without actually like training adversarial policy you know that might be desirable um but I guess it's difficult especially if you don't know like what model class your opponents might be using right yeah that seems right uh I guess you can get some some some way there by if you're doing something like population Based training or just selfplay a bunch of different seeds you can at least check to see is there nontransitivity between the policies that You' have trained so far um and if there is and that that that should make you concerned uh but obviously that that's no guarantee um basically only exploring as small part as possible policies uh and in general I think it's going to be very hard to get you know full confidence that that you're AUST of these kinds of attacks because you're never sure that you've tested every possible attack verification of deep networks is you know active area of research um but still very challenging especially if you're trying to verify what happens in this you know unknown non-differentiable environments yeah definitely okay so another thought I had when reading your paper was the way you train these adversarial policies is essentially like the way like if you think about self play you just like fix an agent then train you know something else to beat it for some number of time steps yep um and then you know sort of do that ladder and so I'm wondering one inference that I might make from your work is oh it must be the case that like when I do the self-play training a bunch of the selfplay steps are just like uh an agent like finding an adversarial like a silly sort of trick adversarial policy against like the it's like fixed copy and then like learning to you know be robust to that do you think that might be one do you think I should make that inference and like to what extent do you think this work in general just reveals like what's happening within the Dynamics of selfplay yeah so I I we don't know for as results but my my expectation would be that yes you probably do see um that as result that many gradient steps being taken by a self play are really just kind of exploiting some silly bug in your opponent uh and so it's this kind of like very noisy gradient update where some of them are actually moving robustly in the direction of a good policy and others are just sort of overfitting to a particular opponent you have right now uh now I do want to make clear one thing that Although our training procedure is very similar to selfplay um we we are training against a fix a a victim for a large number of time steps uh and so you can view this is in some ways getting closer to the original motivation of selfplay which was fictitious play where you're supposed to be doing iterated best response because if you train for 10 million or 20 million time steps you're getting something that is reasonably close to best response um obviously RL is not not a perfect Optimizer but you're doing as well as you can with the techniques you have whereas if you're training for something like 100,000 time steps um before then updating both your yourself and your opponent um then you're never really doing best response you're just taking these like small steps towards a better response uh and so I think part of a reason why these attacks are possible is because this kind of traditional self-play can converge to these local equilibria um and what we actually found was if we just keep on fine-tuning a normal opponent so if we take one of our opponents the victim was trained with by a soft play and you know apply our attack method but starting from this normal opponent you might expect this is going to do because you're already starting from an opponent that wins against the victim a lot of the time but in fact it just it doesn't improve any further so it has kind of converge to some equilibria um but it's just a it's a local equilibria uh so it's not stable yeah I guess another crucial difference is um in selfplay uh you play you s you sort of like Fork your opponent um you start off identical to your opponent and then you start tra taking gradient steps whereas in your work you have a randomized you know randomly initialized agents and then take training steps and I suppose that probably makes it easier to like you know shed your preconceptions you know like uh if you're if you're doing selfplay and you have some blind spot then like when you're cloned you still have the blind spot and it might be harder to discover it yeah I think I think that's right uh some of the environments we're attacking were asymmetric so they they weren't actually training um against them themselves um but they were only training against one other agent so you can certainly imagine that both agents might have this kind of shared blind spot and neither of them is incentivized to to fix it because it's not being exploited by other agent um whereas something that's more of a population Based training approach where we play against randomly selected agent at each episode that that's much more likely to explore this kind of policy space and so would probably be a bit more robust um yeah I do think like you know starting from randomization it certainly don't have any any preconceptions other than bit of inductive bias from the training procedure uh but I I was surprised that our attack method worked I I wasn't even intending to do this it was actually a collaborator Michael Dennis who was like stubborn enough to keep on trying to to do uh deep RL from a spar reward and I was like no there's no way this is going to work because you don't have any kind of curriculum you're just trying to beat this um already very good uh victim um but it turns out but yeah just if you run it for a reasonable number of time steps DL is a good enough optimized that it can discover this but my suspicion is that there are going to be other kinds of adverse eror policies that we're not discovering with this training procedure um because most steps you take from a random initialization are not going to be able to beat a reasonably capable victim yeah so um speaking of these the nature of these adversarial policies so one thing that you've noted is that the way so in uh kick and defend the way the adversarial policies do not work is by blocking the ball from getting into the goal yep um do you have any understanding of like how they are actually working like why why these you know fool the network other than like oh it's just you know doing something unlikely yeah so I mean it's certainly more than just doing something unlikely because you know random policy doesn't have this effect and um in for victims were quite robust as some other perations of openingi team that originally trained us applied a random Force Vector to one of their victims it's just like you know the hand of God is suddenly trying to push you over um and and if the victims were quite robust to to that so they are robust to a lot of um things you might throw at them uh my best guess is that when they were training uh via soft play they would learn any future which is useful for for beting their opponent but some of these aren't going to be very robust so if the position of the kicker you know limb predicts which direction it's going to kick the ball in then maybe it says okay well at this um I guess that's the wrong way around because it's an adversarial goalie but you know if if a direction where goalie is is facing predicts which way it's going to fall to try and catch the ball uh then the kicker might learn oh well you know I should s a step in this direction to to kick the ball away from where it's going to block it and then if the goalie just really maxes out this feature by putting its limbs in a weird position um then what would normally have been an Adaptive response could um be sufficiently large but it it destabilizes the control uh mechanism in the kick it so this would be my best guess but I don't think we have a great understanding of what's actually going on here yeah yeah I guess this is yet another situation in which I wish we had much better machine learning interpretability you know yeah I mean that that would be really nice especially if you could apply the interpretability technique and and figure out these kinds of problems before deployment um because I know searching for adviser policies is a good way of testing um but as I said you can never be sure you you've found every possible adversary policy so it's very nice to have an interpretability technique as well that would give you some confidence yeah so um I guess I'd like to move on a bit to the I guess the reception and then the origin of the work so in terms of the reception my understanding is um this was published at I clear 2020 but um it was also am I right that it also appeared in a workshop in Europe's 2019 uh that's right it was a parl workshop okay so I was wondering um what you think of uh how people have reacted to this um and in general what do you think about the reception yeah so I think the works had a had a very positive reception to date u i mean people definitely love the love the videos which unfortunately I know we we can't include in our podcast uh I I think that uh the ml Community is pretty receptive to pointing out these kinds of of flaws in existing techniques and um especially in case of adversarial examples there's already a prettyy big research Community uh working on fixing women image classifiers so I think people are pretty excited about seeing similar kinds of threat models uh applied in other domains because it's really just you know opening up uh research problems that other people can can work on uh what what I I would say is that uh you know there's there's maybe a bit too much of a taste in research Community for flashy results so uh you know I I in this paper it worked out in my favor but I've had other papers which I think were you know from a technical perspective making an equally important contribution which had nowhere near as much attention because it's just hard to make this kind of really compelling demo video uh and tweet so in that sense I'd kind of hope that um thatl Community would uh I guess have slightly better uh Norms regarding what to promote but obviously it's a it's very hard because it's grown so quickly so uh maybe just you know five years ago we could have basically fitted most people who are interested in deep learning in one room and then you could just read every paper on deep learning and now we're in a situation where it's impossible to you know even catch up with all the papers at one conference so we've not yet come up with I think like really good curation mechanisms as a community uh but I think it's going to be very important to be able to incentivize the right kind of work yeah I do think it's just generally tricky to figure out like how to find like even just how to find relative relevant research um how to tell that it's like something you should pay attention to um seems like probably the most reliable method is like uh you know your local slack group you know somebody's like read a paper can say something interesting about it but and hopefully this podcast can do something similar um uh speaking of the reception I was wondering uh do you think uh are there any misconceptions that people have about your paper which our listeners might have that you'd like to potentially just correct right now sure so I think I think one misconception people have is that you know because it has the word adversarial and the title this is all about security and uh you know I think this is security is a good way of doing kind of worst case testing having that security mindset but uh I I think that these kind of results are going to be relevant generally when you care about robustness of a policy um because although it's unlikely that you run into these kinds of situations um against kind of like benign um opponents there's always some chance it's going to happen by chance and uh it's also just indicating that we really don't understand what our policies are doing and even if a policy seems to be completely reasonable when you test in a bunch of different ways and know I I want to emphasize I like open AI team that that made the policies were attacking they did a really good job of evaluation you know they were definitely following the standards of the field um even van it can Harbor these kinds of like really surprising failure modes and so it's not enough as an engineer to just even kind of dream out what you think are going to be kind of the worst nightmares of your policy because actually it might fail in just this completely surprising way uh and so I think this St we really need some kind of like automated fuzz testing approach to complement um existing testing methods yeah and um and I guess my second question about the reception um do you know of any followup work um looking into this so you mentioned that um you're doing something with uh uh rock paper scissors it was yeah yes I those are just like early early proof of concept um experiments so uh we have yeah we're working right now on uh improved defense mechanisms against this because this paper is mostly about attack so natural followup is say well okay how do we defend against this this attack or generally how do we make policies more robust uh and so that that's focusing on as I was saying this kind of R rapid adaptation approach uh so we don't have anything that's kind of um ready to be shared yet but we're hoping in in a few months to have a have a preprint on this work and I don't not aware of uh any other published work yet I mean this paper was only presented at aair a couple of months ago but um a workshop yet um but I've definitely had a lot of interest uh you know like emails people asking questions of a GitHub so my sense is are other teams um working on this this general idea so I'm hoping that we see some results soon okay yeah I'll be looking forward to it um so I guess flipping it around and looking at the origin um you weren't my understanding is that your previous work has not been an adversarial example that's right um what caused you to decide to work on this problem um as oppos to anything else you could have done sure yeah so I think that a big part of a motivation was um having I a bit of a frustration with uh standard testing methodologies in in the field um because it's you know if you compare to something like control theory where most of our papers are really coming with like you know robust guarantees that that A System's going to work in i setting RL has kind of gone complete opposite extreme and it says well we're just going to train things and then we're going to see do do they seem like they do the right behavior and uh you know notably RL has been making a lot of progress um in areas that are historically been difficult for control so uh kind of freeing yourself from those theoretical constraints can be very useful but then when we we start seeing applications on the horizon in the real world where we need to start I think getting back some of those guarantees if not from Theory at least from kind of rigorous engineering um methodologies and testing processes uh so I was like a big motivation and then I was just trying to think well okay but how do you improve this and uh if you want to do some kind of worst case testing and taking as I said it's kind like adversarial security approach I think is a pretty uh fruitful framing for a problem um and yeah I've been interested in in the prior work on uh examples in RL but it always just felt to me like the the Fret model wasn't quite right uh I think this is like an issue I think a lot of adversarial examples work where it's very sort of theoretically interesting work and it's telling us something about how Neal networks work but most of the time an attacker or or nature as the case may be can do things that aren't just adding like a small amount of white noise to your observations and so over time I think we need to be shifting to threat models that are more indicative of realistic kinds of failures or um attack is that you might see in the real world and so I was hoping this was going to be one step in that directional but obviously there's um more of it could be done there yeah I guess there's also um like maybe part of the reason that this work uh didn't get done by somebody else earlier is if you think about most reinforcement learning benchmarks they're usually um you know if they're Atari or something they're they're typically a one player environment right so I think it takes some like deliberate thinking like like you have to choose your setting in order to come up with this kind of research idea I guess yeah that's true I mean I guess for me I found multi-agent work to be a very natural framing uh you know some of my prior work has been on U multi-agent or multitask reward learning for example and it really seems to me like that is going to be the future in which most of our systems are going to be deployed uh and now you don't necessarily have to do multi-agent RL for everything like if you're just trying to train a autonomous vehicle you know maybe it's enough to kind of ignore the other vehicles not explicitly model them as agents just model them as obstacles he might still a to do reasonably well in some cases but uh we're definitely going to be deploying systems in multi-agent environment so uh we should at least be investigating um that that threat model cool um and I guess a similar question um is so is it is is it safe to say that you're interested to some extent in like AI alignment or um like you know ensuring that like when we get AGI that it's going to be human compatible or something oh yeah that's definitely an important motivation for my some of my work yeah so so do I understand correctly that like your that your understanding of how that fits into that whole project is like look we're going to be training these like RL agis and we need some way of just like testing them and seeing you know how well they actually work before they get deployed otherwise you know clearly something bad is going to happen with them is that a fair summary uh yeah I think it's definitely like one of the kind of more direct stories you could tell for how this work could fit into a long-term bigger picture I think I'm also excited by more indirect cause of for for impact where uh you know it's going to be hard for for me as an individual researcher to solve all of the problems needed for advanced AI systems to be be safe and reliable um but I think the as a whole um you know it it wants to solve these problems as well but sometimes uh people can get a little bit stuck on trying to improve performance on existing benchmarks and so I think there's value in just kind of pointing out the existence of a problem uh you know in some way similar to the original adverse are like samples paper uh and then hopefully you know other people in the community will also uh lend some of a their brain cells to solving this problem yeah so I was wondering how to think about this because in I guess in the safety Community there are a couple of different kinds of work so there's one kind of work where people think oh how are we how how could we build aligned hii like what's our strategy for building this thing and then it's like okay this strategy involves like these 10 sub components let's get to work on the sub components or something um or like you know maybe you have such a strategy in mind and you realize oh like these things are going to be problems um like like the these issues might block us you know how how can I get rid of these blocks and instead it seems like um this work is more in the vein of like look at problems that already exist in the field of AI that might lead to things down the road and like try and bring attention to these problems I'm wonder if you have thoughts on like the relative um the relative uh costs and or drawbacks and benefits of these of these approaches and like I don't know how how you think uh those of us in this community should split our time between these ways of thinking yeah sure that's a great question um so I'd like to know you think of these V approaches are actually quite complimentary um so sort of you know we don't want to neglect either one too much I'd view things like adverse oral policies also interpretability as fing into this kind of trust approach where you know let's say we we've developed this this Advanced AI system or we're training something and we want to be able to inspect it and and check that it is doing what we what we want uh and so this is s something given any um particular training procedure you can apply and um increase for reliability of the the overall outcome because you just won't won't deploy things that that seem unsafe uh and this is also going to be important commercially because uh if you're in a safety critical environment it's probably going to be regulated and it's only be very bad PR if you if you deploy something and it doesn't doesn't behave correctly so there's going to be a lot of demand for these kinds of trust techniques um but then you know you you you can only kind of like rule out um certain certain bad systems with this kind of approach uh so if your training procedure is just going to inherently produce um really bad outcomes then having these kinds of test procedures um you know is only going to buy you time eventually need to have a better training procedure now I think that you know something like adversarial policies okay maybe it's going to SP work on adverse defenses we might get a bit of both um but in general I think this is um that they are quite complimentary approaches where you want to be investing which everyone seems Seems more neglected have a margin uh now I think that if we're talking about the specific approach of let's kind of figure out um what a aligned AGI looks like and let's try and you know just like build it or or break it into sub components uh I think this is a useful framing but I a bit Ro can often ends up resulting in very unted research because I at least believe that we're still quite a way away from having um AGI even though there's been a lot of advancements in AI in the in the recent past and so um it's going to be going to be very hard to to tell whether the research you're doing is actually going to be useful in that context uh it's hard to get a good feedback leap if you're designing something which uh you know is going to help with something that doesn't exist yet uh and so I think it's uh useful if you can find a near-term problem where which you think is also going to be relevant in the long term but where the contributions you're going to make are going to have a a fairly direct impact H and where you're able to going to get some feedback both from the broader research community and potentially like um you know industrial uses of your work now I wouldn't want all work to fit into this category but I do think it's a very getting that kind of feedback mechanism is a very important consideration that I often see people neglecting yeah although I wonder like if it's the case that um AGI is very far away in time um and we like don't know how it's going to be built then like you might worry that oh I'm finding out these problems with the existing systems but like feature systems are just going to be so different that they're not going to have those problems they're going to have different set of problems like the more the the more different you think feature systems might be the the more you might might worry about this right yep so I'm wondering how do you like assuage that concern yeah I think it's a legitimate concern um if you do think that system is going to be very far far away and look very different is also I I guess a reason to focus on more General theoretical work which you think is going to be relevant to a broad range of of AI systems I don't really feel like I've got a comparative advantage in that space But but if I did I would be I'd be reasonably excited about it uh I I think that I I I would view development of fields as often being quite half dependent so even if in 10 years time we're you know no longer using proximal policy optimization as our favorite erl algorithm I think the kind of people that are are graduating you know from PHD programs now are going to have been influenced by uh the kind of research happening today so I I hope that even if adversarial policies isn't applicable directly in the long term that this kind of awareness of surprising failure modes is something that's going to stick with researchers and influence research happening in the future uh so that's one area in which I I'm I can see impact even if things change a lot uh I also think we can get some kind of qualitative insights from from it um not just from like the general AI Community but people who are like of explicitly focusing their career on on robustness and safety um because I I I'd say a lot of problem isn't specific speciic to deep networks it's um more of a question of what happens when you have generally a powerful function approximator uh you're using selfplay you don't have adequate coverage and I don't really see these fundamental problems as disappearing unfortunately you know I'd love it if someone did come up with a training procedure that just eliminated this but unless you're willing to spend a lot of compete on it um I don't think you're going to fully eliminate this the best you might hope is having a policy that says well this is really weird I don't know what to do in their situation um but even that seems quite hard yeah I guess you could even generalize further and say as long as like the way we train AI systems is like have have a like system that somehow trains and uh gets used to like some environment as long as like the type of environment is like high enough dimensional then like I I guess the story is well there's always going to be like some corner of the high dimensional space that you're not good at and like that that seems like a fairly General argument yeah yeah so people people have written papers on about connections between High dimensional GE much and adversarial examples and I know I think there there's a little bit of controversy about how much this um applies because it depends a lot on the the shape of a manifold of of natural images whatever space you're working on and obviously we can't easily characterize this um but it definitely does seem likely that that if you're in a sufficiently high dimensional space uh it's just going to be you know impossible to cover every area um and this seems like a fairly fundamental problem now know that said I there are a lot of people whose opinions I respect who think that adversarial examples will just disappear once we have I don't know human level classification accuracy on natural images and you can point to humans seeming not to suffer from adverse areial examples very much don't doesn't AI already have human level classification accuracy on natural images um it it does on like image net but not of you just took a photo with your phone okay um now maybe that's a maybe that's just a data set issue um but it it does seem like there are a lot of artifacts in existing databases and maybe part of a reason why we have advisal examples is that the classifiers are able to get really good accuracy by picking up on these artifacts and and you can mess with them fairly easily um yeah now it's hard to say because obviously you can't differentiate for a human mind at least at least yes um and um yeah humans have also been trained in this adversarial setting by Evolution right you know prey R to camouflage itself um so um in sad sense maybe we have been adversarially trained and so it's just not surprising that we're robust um but there's at least you know some reason to be optimistic um okay well uh that's about it in terms of questions that I have thanks for being on the podcast um if people are interested in your work and want to follow you or learn more what would you suggest they do sure so you can follow me on Twitter my handle is AR GLE I also have a website at glee. me so that's just my surname. me where I post all my papers and yeah and uh yeah if anyone had any of a listeners have you know questions about my work I'm also happy for people to to email me I can't always promise a detailed response but I always love to hear from people interested in this work all right and your email address can be found on your website that's right well thanks for coming on and to the listeners thanks for listening and I hope you listen again to future episodes

Related conversations

AXRP

7 Aug 2025

Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -5 · 133 segs

AXRP

1 Dec 2024

Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -7 · 120 segs

AXRP

11 Apr 2024

AI Control with Buck Shlegeris and Ryan Greenblatt

This conversation examines technical alignment through AI Control with Buck Shlegeris and Ryan Greenblatt, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med -6 · avg -9 · 174 segs

Future of Life Institute Podcast

7 Jan 2026

How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann)

This conversation examines core safety through How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann), surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.

Same shelf or editorial thread

Spectrum + transcript · tap

Slice bands

Spectrum trail (transcript)

Med 0 · avg -3 · 85 segs

Counterbalance on this topic

Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.