Library / In focus
AXRPTechnical alignment and control
Interpretability for Engineers with Stephen Casper

Why this matters
Frontier capability progress is outpacing confidence in control; this episode focuses on methods that can close that reliability gap.
Summary
This conversation examines technical alignment through Interpretability for Engineers with Stephen Casper, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
MixedTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 108 full-transcript segments: median 0 · mean -4 · spread -29–0 (p10–p90 -13–0) · 7% risk-forward, 93% mixed, 0% opportunity-forward slices.
Slice bands
108 slices · p10–p90 -13–0
Mixed leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes control
- Full transcript scored in 108 sequential slices (median slice 0).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyaxrptechnical-alignmenttechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video lmqfyYn_WJw · stored Apr 2, 2026 · 3,263 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/interpretability-for-engineers-with-stephen-casper.json when you have a listen-based summary.
Show full transcript
[Music] thank you hello everybody in this episode I'll be speaking with Stephen Casper Stephen was previously an intern working with me at UC Berkeley but it's now a PhD student at MIT working with Dylan Hadfield manil on adversaries and interpretability and machine learning we'll be talking about his Engineers interpretability sequence of blog posts as well as his paper on benchmarking weather interpretability tools can find Trojan horses inside neural networks for links to what we're discussing you can check the description of this episode and you can review the transcript at axrp.net alright welcome to the podcast thanks good to be here yeah so yeah from your published work it seems like you're really interested in neural network interpretability why is that one part of the answer is kind of like boring and unremarkable is that lots of people are interested in interpretability and I have some past experience doing this and uh you know as a result it's become uh very natural and like easy for me to continue to work on what I'm what I've gathered interests in and what I have some experience in you definitely know that from when we work together on this but um aside from like you know what I've just come to become interested in interpretability is interesting for a reason to so many people it's part of most General agendas for making very like safe AI for all of the reasons people typically talk about so I feel good to be working on something that is you know generally pretty well recognized as being important can you give us a sense of why people think that it would be important for making safe Ai and especially like to the degree that you like agree with those claims I'm interested in hearing why yeah I think there are a few levels in which interpretability can be useful and some of these don't even include like you know typical AI safety motivations for example you could use interpretability tools to determine legal accountability and that's great but it's not it's probably not going to be the kind of thing that saves us all someday from an AI safety perspective I think interpretability is just kind of good in general for finding bugs and guiding the fixing of these bugs there's kind of like two sides of the coin Diagnostics and debugging and I think interpretability has a very like a broad appeal for this type of use usually when neural systems are evaluated in machine learning it's using some type of test set and maybe some other like easy evals on top of this this is very standard and just because a network is able to pass a test set or do well on some sort of like eval environment or set you know that doesn't really mean it's doing great sometimes this can actually reactively reinforce lots of the biases or problems that we don't want systems to have things like data set biases right so you know at its most basic interpretability tools give us this additional way to like go in and look at systems and evaluate them and look for signs that they are or aren't doing what they want and interpretability tools are not unique in general for this right like any other approach to like uh working with or evaluing or editing models you know is closely related to this but one very very nice at least like theoretical theoretically and useful thing about interpretability tools is that they could be used for finding and characterizing potentially dangerous behaviors from models on very anomalous inputs right think like Trojans think like deceptive alignment there might be cases in which some sort of system is misaligned but it's almost impossible to find that through some sort of normal means through treating the model as some type of black box and interpretability is one of a small unique set of approaches that could be used for really kind of characterizing those types of particularly Insidious problems so it sounds like your take about interpretability is it's about like finding and fixing bugs and models is that basically right I think so and lots of other people will have like contrasting motivations many people more than I will emphasize the usefulness of interpretability for just kind of making just basic discoveries about networks understanding them more at a fundamental level and I'll never argue that this is like not useful I'll just say I don't emphasize it as much but of course you know like engineers in the real world benefit from you know theoretical work or exploratory work all the time as well even if it's indirect yeah I mean I'm wondering like why don't you emphasize it this much it seems potentially like I think somebody might think okay we're like dealing with AI we you know we have these neural networks we're maybe going to rely on them to do like really important stuff just like developing a science of what's actually going on in them seems potentially like it could be pretty useful and potentially like the kind of thing that interpretability could be good for yeah I think there are maybe like I kind of have three many answers to this question one is if we're on short timelines if highly impactful AI systems might come very soon and we might want interpretability tools to be able to evaluate and understand as much as we can about them then you know we want to have a lot of people working on engineering applications the second mini answer I think involves just pulling in the right direction right it's not that we should have all engineering relevant interpretability research and it's not that we should have all basic science interpretability research like we probably want some sort of mix some sort of compromise between these things that seems very uncontroversial but right now I think that uh the Lion's Share of interpretability research in the AI safety space is kind of focused on basic understanding as opposed to the engineering applications so I think it's kind of useful to pull closer toward the middle I think the third reason to like emphasize engineering applications is to like make progress or get good progress signals for like whether or not the the field is moving somewhere because right if a lot of time is spent kind of like speculating or pontificating or basically exploring what neural networks are doing this can be very valuable but only very indirectly and it's not clear until you apply that knowledge whether or not it was very useful right so using things like benchmarking and real world applications it's much easier to get signals even if they're kind of somewhat muddled or sometimes not perfectly clear about whether progress is being made than it is if you're just kind of uh exploring okay before I really delve into some things you've written one question I have is like if I want to be kind of fixing things with models or noticing problems or something I think one version of that looks like I have a model that I've trained and now I'm going to do interpretability to it but like you could also Imagine an approach that looks more like oh we're gonna like really understand the theory of deep learning and really understand like on these data sets like this learning method is going to do this kind of thing that ends up looking less like interpreting a thing you have and more just like kind of understanding what kinds of things are going to happen in deep learning just like in general I'm wondering like yeah yeah what do you think about that alternative to interpretability work yeah so it definitely seems like this could be the case we might be able to like mine really really good insights from the work that we do and then use those insights to like guide a much richer understanding of AI or deep learning that we can then use very usefully for something like AI safety or alignment applications I have no argument in theory for why we should never expect this but I think empirically there are some reasons to be a little bit doubtful that we might be able to kind of you know basic science our way into understanding things in a very very useful and very very rigorous way in general I think the Deep learning field is uh has shown to be one that's Guided by empirical progress much more than theoretical progress and I think more specifically the same has happened with the interpretability field one could argue that uh interpretability for AI safety has been quite popular since maybe 2017 maybe a bit before and people were saying very similar things back then and people are saying very similar things now and uh I think these things are these Notions that we can make a lot of progress with the basic science are just as valid then as now but it's notable that you know we haven't seen I don't think any particularly like remarkable forms of progress on this front since then and I get that that's a very general claim to make so maybe we can put a pin in that and talk more about some of this later so so with this in mind with this idea of like kind of the point of interpretability as being like diagnosing and fixing problems what do you think of like the state of the field of interpretability research yeah I think it's not actively and reliably you know producing tools for this right now I think there are some and I think there are some good like proofs of Concepts and examples of times when you can use interpretability tools very likely competitively to diagnose and potentially debug problems but um this type of work I think seems to be the exception a bit more than the rule and um I think that's kind of like okay and to be expected in a certain sense because the field of interpretability is still growing it's still new certainly recently and maybe even still now there's a large extent in uh to which it's just pre-paradigmatic right we don't fully understand exactly what we're looking for but I think that it's probably largely the case now and is going to continue to become more and more the case in the future that in some sense it's kind of time to uh have some sort of Paradigm Shift toward engineering applications or to substantially increase the amount of work inside of the field that's like very focused on this because I think it's possible and of course from an alignment perspective it's probably needed if we're ever going to be able to like kind of use these tools to actually align something right why do you think it's possible so right now I think that lots of the progress that's being made from an engineer's perspective related to interpretability is kind of coming from certain sets of tools it's coming from the ability to use generative models to find and characterize bugs it also comes from the ability to produce interesting classes of adversarial examples this is very related and it also comes from the ability to automate lots of processes which now generative models and coding models sometimes things just like chat Bots are able to kind of do in a more automated way and the tools for these things are substantially better than they were a few years ago as is the case with most machine learning goals and I think now is a point in time in which it's becoming much clearer to much more people that the ability to leverage some of these is like pretty valuable potentially for uh when it comes to interpretability and other methods for evals do you have any specific like approaches in mind sure I'd say take one consider this just an example so a few years ago we had adversarial Patchwork where people were attacking Vision models with uh small adversarial patches that were just this like localized region of an image so the adversary was able to control that patch and nothing no other part of the image so that's the sense in which the adversary's ability to influence the system was limited and adversarial patches kind of back in Circa 2017 looked like you would probably expect they kind of looked like strange things with sorts of structure and pattern to them but still lots of high frequency patterns still things that by default were very very difficult to try to interpret a few years later a handful of Works found that you could use generated generators like scans to attack the same types of systems with things like adversarial patches which tended to produce like more and more coherent features and then a few years after that like right up close to this state of the art oh sorry to the to the present day the state of the art for producing adversarial features is stuff that's all via diffusion models which are able to produce features that are much more convincingly incorporated into images and features that are you know just kind of look quite a bit better and are much easier to interpret because diffusion models are really good a flexible image editing like this I think this is kind of one example of a progression from like more crude or basic tools to like better tools that can be used for you know human understandable interpretations and it was all facilitated just by kind of advances in like the adversaries research and generative modeling and I think analogous things are happening in uh related to other types of interpretability tools too yeah I'm wondering so you mentioned there that like like the example you gave was in the field of coming up with adversaries to basically as I understand it things that can kind of trick image classifiers or other kinds of neural network models what do you see as the relationship between those and interpretability in general yeah I think this is one of the takes that I that I am the most excited about and uh I will say quite plainly and confidently that the study of interpretability and the study of adversaries is are inextricably connected when it comes to deep learning and AI Safety Research and uh this is one of my favorite topics because I work on both of these things I usually describe myself as someone who just works on interpretability in adversaries and uh the space between them I think is great and the space between them I think is very neglected and there's still a lot of low-hanging fruit at the intersection here and the argument is that there are four particularly important connections between interpretability and adversaries one is the more robust networks are more interpretable and vice versa the other is that interpretability tools can help you design adversarial examples and doing so is a really good thing to do with interpretability tools the third is that adversaries are themselves interpretability tools lots of the time if you use them right and the fourth is that mechanistic interpretability and latent adversarial training are two the two types of tools that are uniquely equipped to handle uh things like deceptive alignment yeah I guess in my head there's this strong connection which is just like if you want to be an adversary like if if I want to like you know really mess with you somehow like the best way I can do that is kind of understand how your brain works and like how you're working so that I can exploit that and so to me I don't know it seems like somehow I I mean there's One Direction where like coming up with adversarial examples kind of tells you something about the system but yeah in the other direction like it just seems like in order for an adversary to be good enough it has to understand quite understand things about the target Network I'm wondering what you think about that that perspective yeah I think that's the right way to think of it these two things are kind of like very much both sides of the same coin or you know very much like each other's dual on the notion of using interpretability tools to design adversaries uh you know the cases that you know you've understood a network very well if you're able to understand it enough to exploit it right and this is an example of doing something that is uh engineering relevant in terms of a type that is potentially interesting to an engineer using an interpretability tool and then on the other hand where adversaries are interpretability tools right if you construct an adversary there is a certain sense in which you might argue that like you've already done some sort of interpretation right saying that this thing or this class of examples or something like that you know fools the network being able to say that is not unlike an interpretation it might not be like particularly rich or mechanistic in a sense but this is something meaningful you might be able to say about a model right yeah I mean it kind of reminds me of the um so my colleagues now at the foundation for alignment research it's called far I forget exactly what the letters stand for but um basically they trained this model to beat the best models that play go but like like that the adversaries they train they're not like in general very good like if I told you go like after a day or two you could beat these adversaries but to me like a really cool aspect of their work is that you could look at what the adversary was doing and like if you're a decent player you could copy that strategy which in some sense is like I think is like a pretty good sign that you've like understood something about the the victim model basically and that you understood like how the adversarial attacker works yeah I think I understand things roughly the same way and I'm really excited about this work for that reason I'm also very excited about this work because it kind of suggests that even systems that seem quite superhuman still might have some silly vulnerabilities that adversarial examples or interpretability tools might be able to like you know help us discover yeah so one question this brings me to is like if I think about adversaries and interpretability being super linked I wonder like what does that suggest in the interpretability space like are there any things that are being done with adversaries that like suggest some sort of cool interpretability method that like hasn't yet been conceived of as interpretability I think there are some examples of things that you know maybe are kind of are kind of old and well known now but like aren't usually described in the same settings or talked about among the same people who talk about interpretability tools for example understanding that high frequency non-robust features are things that are still predictive and used by models and are in large part seem to be responsible for adversarial vulnerability you know this is a really important connection to be aware of and to realize because um you know high frequency non-robust non-interpretable features are kind of the enemy of interpretability what do you mean when you say that they're predictive like like what's true about them yeah right my understanding here largely stems from a paper I think from 2019 called adversarial examples or features or are not bugs they're features uh which kind of studied this in a pretty clever way so uh your typical LP and Norm adversarial perturbation is just a very subtle pattern or that you or a subtle addition or perturbation that you can make to an image or something like this and then if you exaggerate it so it's visible it just kind of looks like this confettified noisy perhaps mildly textured set of patterns but it's not you know something that you might predict or that you'd really expect as a human but when you apply this to the the image it can reliably cause a model to be fooled and uh what was kind of realized by this paper is that you know like it's asked this question are these are these features meaningful are they predictive are they something that the models are using or are they just kind of like random junk and um they added to the evidence that these features are useful by conducting some experiments where if you take images and then you give them targeted adversarial perturbations and then you label those images consistently with the target instead of the source then to a human all these images look mislabeled or n minus one over n of them that proportion of them look mislabeled but you can still train a network on this and have it meaningfully generalized to held out unperturbed data that's really impressive right this kind of this kind of suggests that networks may be learning and picking up on features that humans are not not naturally like disposed to understand very well but networks can't and uh this seems to kind of like be an important thing to keep in mind when we're trying to do interpretability from a human-centric standpoint right there might be trade-offs that are kind of fundamental if you want a human focused approach to AI interpretability humans just might not be able to pick up on everything useful that models are able to pick up on okay yeah so that that was example one of a link between adversaries and interpretability I think you are about to say give an example too when I interrupted you um yeah another example is uh like the Trojan literature data poisoning attacks that are meant to implant specific weaknesses into models so that they can like have those weaknesses and deployment you know this is often studied from a security standpoint but it's also very interesting from an interpretability standpoint point because the discovery of Trojans is an interpretability problem and the removal of Trojans is a robustness problem right yeah so they're very very close relationships between this type of problem and the types of tools the the interpretability literature is uh is hopefully able to produce there's another connection too because Trojans are quite a bit like deceptive alignment where you know deceptively aligned models are going to have these triggers for bad behavior but these are by definition things the or assumption things that you're not going to find during normal training or evals so the ability to characterize what models are going to do in a robust way on unseen anomalous data is kind of a one way of describing the problem of detecting Trojans and one way of describing the problem of solving deceptive alignment sure yeah so I actually have some follow-up questions about both of the things you said sure yeah we're sort of like skirting around things that um you mentioned in this like sequence the engineers interpretability sequence and yeah one one claim I think you make is that with regards to the first thing you mentioned the existence of these like useful features that aren't like robust and seem like adversarial noise that this kind of weighs against the use of human intuition and interpretability and I'm wondering like how strongly it weighs against this so like one analogy I could imagine making is like sometimes in math there will be some pattern that appears like kind of random to you or like you you don't like really understand why it's happening and then like there's some you know theorem uh with like an understandable proof that explains the pattern like you wouldn't you wouldn't have understand the pattern like without this theorem but like there's there's some mathematical argument that like once you see it things totally make sense and you could imagine like something similar in the case of these like non-robust features right where like the network has some like really unintuitive to humans Behavior but like there's a way of explaining this behavior that uses intuitive facts that eventually makes um this intuitive to humans so I don't know I'm wondering like what your reaction is to that kind of proposal yeah I think this makes sense right because we went earlier when I say like you know a human-centric approach to interpretability the kind of thing that's in my head is just the idea of humans being able to look at and study something and you know easily describe in like words what they're looking at or seeing or studying this is not the case with adversarial perturbation or typical adversarial perturbations at least in images but you bring up this notion like you know is it possible that we could relax that a little bit and do something else and I think this makes sense uh you'd probably just have to have some sort of change in The Primitives with which you describe what's going on and you can probably describe things in terms of specific you know adversarial examples or perturbations or modes or something like this even if like by themselves if you looked at them it just kind of looks like glitter in an image and looks like nothing that you could easily describe and I think this is like very very potentially useful right this is kind of not the type of thing I meant when I talked about a human-centric approach to interpretability but it sounds like unless we want to have you know trade-offs with you know models performance or something like this it would do us well to like go in and try to understand models more flexibly than in terms of just what a human can describe but um if we are to do this it's probably going to involve a lot of automation I assume yeah um yeah how do you see the prospects of using Automation in interpretability research I think it's probably going to be very very important and Central to like highly relevant forms of interpretability and it's possible that this claim could age poorly but I will I do think it'll age well and people can hold me accountable to this at any point in the future okay so lots of interpretability lots of very rigorous um specifically rigorous mechanistic interpretability research has been done at relatively small scales with humans in the loop right and uh we've learned some pretty interesting things about neural networks in the process it seems but there's a gap between This and like what we would really need to you know fix Ai and save anyone in the real world right studying things in very small Transformers or very limited circuits and cnns these types of things you know are pretty small in scale and toy and scope so if we are to take this approach of kind of rigorously understanding networks from the bottom up you know I think we're probably going to need to apply a lot of our automation tools and a few like topics here that to talk about one is like what's already been done there's some topics involving how this kind of fits into agendas related to mechanistic interpretability and causal scrubbing which is a whole other thing we can get into um this definitely has a few rabbit holes we can get into yeah I mean I I guess first of all let's talk about mechanistic interpretability a little bit first of all what do you understand the term to mean for those who haven't heard it yeah that's a pretty good question a mechanistic interpretability also like circuits or and interpretability yourself you know some of these things are just like vocab terms that people usually refer use to mean whatever they want and I don't say that in a pejorative way like I do this too I use these things to refer to anything I want them to mean but um I guess This lends itself to a general definition of a mechanistic interpretability and I think I'd probably just describe mechanistic interpretability as anything that helps you explain model internals or details about algorithms that the model internals are implementing something like this but the emphasis is that like you're opening up the black box and you're trying to characterize the computations going on and passes through a network okay and Yeah you mentioned that you think like at some point this will need to kind of be automated or scaled up is that because you think like it's a particularly important kind of interpretability that we need to do or you know what do you think about the role of it yeah if yes if you pose the question that way then I think that there are two very important points that I that I feel strongly about but I feel very strongly about them in you know ways that have a completely different ethos right on one hand a mechanistic interpretability is one of these tools one of these methods or Paradise that can help us rigorously you know if it works can help us hopefully rigorously understand networks well enough to find and Empower us to fix particularly Insidious forms of misalignment like deceptive alignment or like a paperclip maximizer who is uh trying to actively deceive you into thinking it's aligned even though it's not you know there aren't that many tools at the end of the day that are going to be very useful for this and mechanistic interpretability is one of those tools so there's one sense in which I think we really really need it there's another sense in which I think it's just very really really hard and there's a big gap between where we are now and where we would want to be from an engineer's perspective the reason it's really really hard is because mechanistic interpretability is really a problem with two different parts you start with a system part one is coming up with mechanistic hypotheses to explain what this system is doing so this could be in terms of like pseudocode a mechanistic hypothesis could look like some sort of graph and a hypothesis doesn't have to be like one function or program it could represent a class of functions or programs but it needs to be some sort of representation of what's Happening mechanistically inside of the network step two is to then take that hypothesis that mechanistic hypothesis and tests to what extent it validly explains the computations being formed internal being being performed internally inside of the network that's right step one hypothesis generation step two hypothesis confirmation I think step two is tractable or at least it's the kind of thing that we're able to be making progress on for example the causal scrubbing agenda is something that's pretty popular and relating to this that's had a lot of work done recently it's a relatively tractable problem to try to come up with methods to confirm like how computationally similar a a hypothesis graph is to what a system is doing and step one though seems quite quite difficult and it seems about as difficult as program synthesis slash program induction slash programming language translation and these are things that are known to be quite hard and have been known to be quite hard for a long time and uh lots of progress has been made in mechanistic interpretability by focusing on very very simple problems where the hypotheses are easy but in general if we don't assume that we're gonna you know counter a bunch of systems in the future where the things that are right or wrong about them are explainable in terms of easy hypotheses I don't think that we're going to be able to get too much further or scale Too Much Higher by kind of relying on toy problem human in the loop approaches to mechanistic interpretability yeah I guess I have a few thoughts in response to that so the first is like when you say that like coming up with hypotheses seems about as hard as like program synthesis or like program translation it's not clear to me why this I mean I guess I can see how it's close closer to program translation I mean unlike synthesis you sort of you have access to this neural network which is doing the thing like you have access to all the whites you know in some sense you know exactly how it works and it seems to me that there is some ability like we have tools that can tell you things about your code for instance like type Checkers like that's a tool that is I guess like quasi-mechanistic it really does like tell you something about your code right I don't know I was wondering if you if you could elaborate more on your thoughts about like how difficult you expect hypothesis generation to actually be yeah I think that's a good take and it's probably worth being a slightly more specific at this point in time so um if you're forming mechanistic hypotheses from the tasks or the problem specification then that's much like program synthesis if you're forming these from input output examples from the network this is much like program induction and then like you said if you're using this if you're forming them through model internals this is much like programming language translation because you're trying to translate between different formalisms for computing things right and in this case you have all three sources of information right yeah in this case you do which is nice I don't know of this like you know being some sort of like theoretical solution around any like proofs of hardness for any of these problems but in practice you know it is nice this is this is certainly a good thing to point out and it's probably going to be useful but there's this there's this notion of like you know like how can we make some sort of progress right from this like translation perspective and if we if we wanted to do it particularly rigorously if we would like you know if we shop for the moon we might land on the ground because it might be very hard to like just turn a network into a piece of code that like very very well describes it but um you mentioned the analogy to like type Checkers right type Checkers are kind of nice because you can run them on things and uh being able to determine some things type or being able to determine if there's like a likely syntax error or something is not something that you is is like made Impossible by like Rice's theorem or uncomputability-ish results and uh to the extent that we're able to do this inside do something analogous or like uh you know find flags for interesting Behavior or things to check out or parts of the architecture to scrutinize more or things that we might be able to like cut out or things that might be involved in handling of anomalous inputs anything like this I think these are these sound very cool and I think what you just described is would probably be one of the best ways to try to move forward on a problem like this it's not something that I'll say I have a lot of faith in or not just because I don't think we have a lot of examples of doing of this type of thing but I would certainly be interested to like hear more work about doing something like this about kind of learning useful heuristics or rules associated with like uh specific networks or something that are uh flag interesting things about it and yeah I like this idea a lot and I guess like yeah when you mention like is your the thing you're trying to do barred by Rice's theorem so Rice's theorem says that like you can correct me if I'm wrong but I think it says that for any property of a program such that the property isn't about how you like wrote the program it's about like it's external behavior and there are some programs that it's like a non-trivial property like some programs have this property and some programs don't have the property then like you can't always determine by running code whether any given program has this property like in theory there are you know examples that you just like can't work with yes and I think that suggests that like in some sense we should probably like try to have neural networks that aren't just generic computer programs like where we do know that these kinds of things work things will work and similarly the analogy to programmed translation like I don't know it's probably better if you write your code nicely similarly I don't know I'm aware that there's so in a podcast that I've recorded but not yet released with uh Scott Aronson he mentions this result where like in the worst case it's possible to like take a two-layer neural network and plan to basically a Trojan a backdoor in it such that the task of finding out that that happened is equivalent to like some computationally difficult graph Theory problem right I assume this involves a black box assumption about the network right not that you have access to model internals no even white box you have you have access to the weights okay yeah if you think about it having access to the weights is sort of like having access to some graph and like there's some computationally difficult problems with graphs right so yeah I guess if I put this all together then I might have some vision of like okay we need to like structure we need to somehow ensure that models have a nice kind of structure so that we can mechanistically interpret them and then I start thinking okay well maybe the reason you start with toy problems is that you get used to trying to figure out like what kinds of structure actually helps you understand things and like act you know explain various mechanisms I know that was mostly my take so uh what do you think about all that sure so um this idea that I'm a professed to be a fan of this idea of um doing something analogous to like type checking you know I think and you bring up this idea of like making networks that are like good for this in the first place or very minimal to this in the first place um I think that a post-hoc version of this or a version of this where you know you're just like looking at model weights in or uh in order to like flag interesting parts of the architecture I think I I don't know of any examples on the top of my head that are like uh particularly good examples of this there's stuff like mechanistic anomaly detection that could maybe be used for it but I I don't know of a lot of work that's immediately being done Elite from this post-hoc perspective does anything come to mind for you they're probably something out there but my point is something like I don't know of a lot of examples of this but maybe it could be it could be kind of cool to think about in the future to be honest I'm I'm I know a little bit less about the interpretive ability literature than maybe I should but then there's this non-post talk um notion of doing related pre-hawk um where or intrinsic way in which you you know what you want an architecture that's like very that has nice properties related to like things you can verify about it or like modularity or something and I think this work is very very exciting and I think obviously there's a lot of work on this from the literature at large there are all sorts of things that are directly getting at uh simpler or architectures or architectures that are easy to study or more interpretable or something of the sort but one thing I think is a little bit interesting about the AI safety interpretability Community is that you know there's a lot of emphasis on analyzing circuits there's a lot of emphasis on on this type of problem a mechanistic anomaly detection and there is a bit less emphasis than I would normally expect on intrinsic approaches to making Network more networks more trainable sorry more and more interpretable and um I think this is possibly like a shame or an opportunity that's being missed out on because um there are a lot of nice properties that like intrinsic interpretability techniques that can uh can add to neural Nets and there are lots of different techniques that don't conflict with using each other and I think it might be very interesting you know sometime in the near future to just kind of work on more intrinsically interpretable architectures as like a stepping stone to try to do better mechanistic interpretability in the future like for example how awesomely interpretable might some sort of neural network that is adversarially trained and trained with elastic weight consolidation and trained with bottlenecking or some other method to like reduce polysomanticity and maybe it's architecture sparse and maybe there's some intrinsic modularity kind of baked into the architecture something like this right how much easier might it be to interpret a neural network that is kind of optimized to be interpretable as opposed to just kind of like trained on some sort of task using performance measures to evaluate it and then you know uh something that you just use interpretability tools on after the fact I think um it's a shame that we have like all this pressure for benchmarking and developing AI systems to be like good at performance on some type of task while not also having like comparable feedback and benchmarking and pressures in the research space for properties related to interpretability yeah I mean I think one reaction that people often have to this instinct is to say look like the reason that deep neural networks are so performant the reason that they can do so much stuff is kind of because they're these like big sort of semi-unstructured blobs for matrices right such the gradients can like flow freely and the network can kind of like figure out its own structure and I think there's some worry that like you know most ways you're going to think of of like imposing some architecture are like gonna run contrary to like Rich sudden's bitter lesson which is the like no you just need to like have methods that like use computation to figure out what they should be doing and just only do things which scale nicely with computation so yeah how possible do you think it's going to be to reconcile like performance with architectures that actually like help interpretability in a real way yeah I expect this to be the case definitely somewhat most of the time you know when some type of interpretability tool is applied or a type of intrinsic interpretability tool is applied you know task performance goes down like if you adversarially train an imagenet Network it's usually not gonna it's not going to do quite as well as a clean a non-adversarially trained Network on clean data and obviously we also know it's it's quite easy it's quite trivial to like regularize a network to death that's about as simple as like setting some sort of hyper parameter too high so there's kind of this question about like is there good space to work in the middle between maximally performing networks and over regularize impotence networks and um when framed that way I think I think you see the answer I'm getting at it's probably something like we just got to find The Sweet Spot and see how much more of one we're willing to trade off for the other but um you know we're probably also going to find a lot of things that are just better than other things maybe like pruning that's an intrinsic interpretability tool if you have a network that as is more sparse and has fewer weights then you have less to scrutinize when you want to go and interpret it later so it's easier maybe this just isn't as effective as an interpretability tool for the same cost and performance as something else you know maybe adversarial training is better for this one for lots of classes of interpretability tools and um you know even if there is some sort of fundamental trade-off just maybe it's not too big and maybe there are ways to minimize it by picking the right tools or combinations thereof but I continue to be a little bit surprised it's just like how relatively little work there is on combining techniques and looking for synergies between them for results-oriented goals involving interpretability or for engineering goals involving interpretability so it could be the case that this isn't like that useful for having competitive performant networks but I certainly still think it's worth trying some more well almost trying period but you know working on an Earnest so um so you brought this up as kind of a complaint you had about the kind of AI safety interpretability Community which I take to mean this sort of Community around like I don't know anthropic Redwood research people who are worried about AI causing accidental risk and you mentioned this as like a thing that they could be doing better I think maybe many of my listeners are from this community yeah do you have other things that you think they could improve on yeah so and I talked about a few of these I enumerated enumerated a few of these in the engineers interpretability sequence and in one sense you know the AI safety interpretability Community is Young and it is small so obviously it's not going to be able to like like do everything and um I think it's probably about equally obvious that so much of what it is doing is very very cool right you know we're having this conversation and so many other people have so many other conversations about many interesting topics just because this community exists right so I want to be clear that I think it's great but um I think the AI safety interpretability Community also has a few blind spots maybe that's just inevitable given its size but the point we talked about involving like mechanistic interpretability having two parts and there's the first part being hard is one of these the relative lack of focus on intrinsic interpretability tools like I mentioned is another and um I also think that the AI safety interpretability Community is sometimes a little bit too eager to like just start things up and some sometimes rename them and sometimes rehash work on them even though there are close connections to more mainstream AI literature I know of a couple examples of examples of this but a strong one involves like the study of disentanglement and poly semanticity and neural networks this is something that I talked about a bit I don't want to like over emphasize this point in the podcast but we could talk a bit about one case study involving kind of like a possible insularity and possible isolation of research topics inside of the ais50 interpretability community yeah sure so we have this notion that's pretty popular inside the interpretability community here of like polysomanticity and superposition and these are kind of things that are bad or like the enemies of useful rigorous interpretability and um it's pretty simple the idea is that if a neuron responds to multiple distinct types of like semantically different features then it's polysomantic right if there's a neuron that fires for cats and for cars we might right we might call it polysemantic and superposition is a little bit more of a general term that applies to like a whole layer or something like this if uh neuron or is exhibiting superposition in as much as it is polysemantic and a layer is exhibiting a superposition in as much as it represents Concepts as linear combinations of neurons that are not all orthogonal right there's crosstalk between the activation vectors that correspond to distinct Concepts and uh these are useful terms but these terms are also very very similar to things that have been studied before the inter polysomanticity and superposition crowd has pointed out this similarity with sparse coding but much more recently there's been a lot of work in the mainstream AI literature on disentanglement and this goes back significantly before the literature on polysomanticity and superposition and it's disentanglement just describes something very similar it's when like there's like superposition or when for some reason or other you don't just have like a bijective mapping between neurons and Concepts and it's not that renaming something is intrinsically bad but I think for like Community reasons and uh there has been a bit of isolation between the AI safety interpretability community on this topic and then other research communities that's been facilitated by having different vocabulary and uh at best this this kind of is a little bit confusing and at worst this could maybe lead to isolation among different researchers working on the same thing under different names and there's a case to be made that this is good sometimes studying things using different formalisms and different vocabularies can contribute to like the overall richness of what is like found for example studying Turing machines and studying Lambda calculus these both got us to the same place but uh arguably we've had some richer insights as a result of studying both instead of just studying one and this could be the case but I think it's it's important to emphasize like maybe putting some more effort into avoiding rehashing and renaming work I mean so in the case of polysematicity and disentanglement I mean I think it's worth saying that like I think one of the original papers on this topic talks about the relationship to disentanglement but like do you think that like do you see there as being insights in the disentanglement literature that are just like that are being missed or can you go into more detail about like what problems you think this is causing yeah and it should be clear that like there there is Citation uh there are the those those pointers that exist although you know arguably not as like not discussed in the optimal way but that's less important uh here's an example so if we think about what the distill and anthropic communities which are pretty prominent in the ai ai safety interpretability space the types of work that they've done to solve this problem of like superposition or entanglement most of the work that's been done is to like study it and characterize it and that's great but there's like roughly one example of which I am very familiar with for explicitly like combating polysomanticity and superposition and entanglement and that is um from the paper called softmax linear units right which describes a an activation function that is useful for reducing the amount of entanglement inside of these layers and that activation function operates because like the reason it works is because this is an example of an activation function that causes neurons to compete to be able to be activated it's just a mechanism for lateral inhibition the lateral inhibition has been understood to be useful for reducing entanglement for a while now there have been other works on lateral inhibition and different activation functions from the disentanglement literature and there's also been like quite a few non-lateral inhibition ways of of tackling the same problem as well from the disentanglement literature and I think that the softmax linear units work was very cool and very interesting and I'm a smarter person because I've read it but um I'm also a much smarter well I'm also a smarter person because I have looked at some of these other works on similar goals and I think things were kind of a bit richer and a bit more well fleshed out on the other side of the you know the Divide between the AI safety interpretability community and the more mainstream ml community so yeah the softmax linear unit paper was cool but as we continue with work like this I think it'll be really useful to like take advantage of like the wealth of understanding that we have from a lot of work in the 2010s on disentanglement instead of just kind of trying a few things ourselves Reinventing the wheel in some sense I guess could you be more explicit about the problem you see here because I mean in the paper about self-max linear units they do say like you know here are some things which could which could help with poly semanticity and like one of the things they mentioned is like lateral inhibition I don't know if they talk about its presence in the disintegument literature but like given that they're using the same language for it I'm I'm not getting the impression that they had to like reinvent the same idea so yeah I think the claim is definitely not that like the authors of this paper were um unaware of anything like this I think the authors of this paper probably like are but um the Asaf the interpretability community as a whole I think is kind of like a little bit different and as there's the results of like kind of what bounces in between you know this kind of in between this community is kind of a social cluster you know there's a bit of a difference between like that and like what's bouncing around elsewhere and um as a result I think something like soft Max linear units might be like over emphasized or thought of more in isolation as a technique for avoiding avoiding entanglement or superposition while um good handful of other techniques are kind of like not emphasized enough maybe the key Point here is just something that's like very very simple and it's just that uh just based on some kind of claim that it's like important to you know make sure that all relevant sources of insight are like tapped into if possible and um the extent to which the AI safety Community is like guilty of being isolationist in in different ways is probably like debatable probably like not a very productive debate either but um regardless of that exact exact extent I think it's probably pretty useful to emphasize that like lots of other similar things are going on in other places okay and so it's it sounds like just to check that I understand it sounds like you're concerned is that people are reading like I don't know anthropic papers or papers coming out of certain Labs that are sort of quote unquote in the this ASAP interpretability Community but there's other work that's just as relevant that like might not be getting as much attention is that roughly what you think yeah I think so and I think this is like an effect and I'm I'm also like a victim of this effect right there's so much like literature out there in machine learning uh you can't read it all and uh if you focus on the AI safety part of the literature a bit more you're going to be exposed to what people in the AI safety community interpretability community are talking about and uh so this is kind of inevitable it's kind of something that will happen to some extent by default probably and uh it happens to me with the like information that I look at on a day-to-day basis uh so maybe there's some kind of like points to be made about how it's possible you know it's likely I would say it's probably pretty likely that it would be good to work to resist this a bit sure I'm wondering if like there are any like specific examples of work that you think or maybe under celebrated or little known in the AI safety interpretability community so work from outside the community that's like under celebrated and inside of the ASAP interpretability community or even Insight that probably work outside though and I think things that like you think should be better known than they are inside this uh AI safety interpretability Community yeah I think that's a really good question I probably don't have a commensurately good answer and maybe my best version of the answer would end up would involve me like listing things involving adversaries or something like this but um I I definitely am a fan of like let's say one type of research I'll I'll uh you know like so yeah there's there's lots of answers to this and you can probably find versions of it in the AIC in the uh the engineers interpretability sequence but I'll laser in on one that I think I'm pretty excited about and that is on like the automated synthesis of interesting classes of inputs in order to study the solutions learned by neural networks particularly problems with them and uh this should sound familiar because I think this is this is the stuff we've already talked about examples of this include like synthesizing interesting adversarial features or examples of this include controllable generation examples of this include seeing what happens when you perturb model internals in in particularly interesting ways in order to like control the end behavior or the uh the the type of solution a network is learned and I think there are examples of all of these things from the AI safety interpretability Community because they're relatively broad categories but I think some of my favorite papers and lots of these spaces are from from outside of the AI safety interpretability Community from like different Labs who really like had adversaries yeah I think my answer here is not the best but um I mean on that front are there any labs in particular that you'd want to shout out as like I for example I I think that the Madrid Lab at MIT does really really cool interpretability work even though they probably don't think of themselves as like interpretability researchers and the AI safety interpretability Community might not necessarily think of them as interpretability researchers either at one point in time I constructed a list based on my knowledge of the field of papers from the adversaries and interpretability literature that seem to demonstrate some sort of very engineering relevant and competitive capabilities for model Diagnostics or debugging you know doing stuff that Engineers ought to ought to be very interested in using tools that are interpretability tools are similar and this list is I want to be clear inevitably going to is like subjective and arbitrary and incomplete but um this list had like 20 I think like 21 or 22 papers on it and for what it's worth you know the majority of them these papers did not come from people who are like prototypical members or Lego people who are a typical like member of the AI safety community interpretability community some did and uh for those that didn't you know many of them are adjacent to the space but I just think there's a lot of cool stuff going on in a lot of places I guess okay cool oh by the way this list is in um the second to last post in the engineers interpretability sequence and it's already outdated I should I should say sure yeah ml it's uh proceeding at a quick pace so one thing you also touch on I guess he said it a little bit earlier and you've touched on it in the piece is the relationship between mechanistic interpretability and deceptive alignment I'm wondering like you know what do you think like their relationship between those things is yeah I think it's um it's kind of like the relationship between interpretability and and adversaries I would I would describe the relationship between mechanistic interpretability and deceptive alignment as being one of like inextricable connection okay kind of understanding this probably requires me to clarify what I mean by deceptive alignment because deceptive alignment has been kind of uh introduced and defined in terms of like colloquially you know imagine some sort of like super intelligence system that like wants to hide its misalignment from you so it actively tricks you into into this in some way shape or form and um it's been described and characterized originally in like early posts on the topic about something that is done actively by A system that has a model of itself and of you and this is something that as alignment researchers we're all very worried about but I think there's a more like general or more like distilled understanding of deception that can come with fewer details attached to it and that is that if the system is deceptively aligned whether its problems are being hidden from us actively or not I think seems less important to an engineer just it's that if a system is deceptively misaligned it has problems and those problems are going to be ones that are really really hard to find during testing and evaluation but the types of things you would normally easily be able to throw at models to figure out if they're aligned with our goals or not so basically it just means that the system has problems that are triggered by anomalous inputs I would claim that this is something that is very general in a way that captures most of what we care about when it comes to deception let's understand deception for at least the purpose of this conversation or the next few minutes I as being an issue where a model is misaligned but it's hard to find the things that elicit the misaligned behavior so why is mechanistic interpretability really key to deception in this sense so there's a really good example that illustrates a few points from I think Paul Christiano originally introduced this but there's like a deceptively aligned system that is really really insidiously deceptive so imagine let's go back to thinking about super intelligences you have a super intelligent system that's misaligned and has goals that are not your goals but it understands your goals and it wants you to think that it has your goals and it wants to take over the world or maybe destroy the world but it can't it can't try to do that in like training or in simulation or in evaluation because then you'll stop it you'll press the off switch or you'll update it you'll fix something about it but it wants to self-preserve so its strategy is to like pick some sort of trigger to trigger its deceptive misaligned treacherous Behavior but something that's not going to show up in in training or evaluation and it's but instead something that's going to show up eventually in deployment and uh the example that Paul Cristiano gave was picking some sort of cryptographic puzzle and waiting till it gets solved and then acting deceptively once acting treacherously once it has been solved because cryptographic puzzles take a long time to solve and if you wait till one has been solved you're likely very far in the future and presumably in deployment so um the example Paul gave was waiting till the composite number RSA 2048 has been factored and this is hard like cryptographically hard so the reason why this is a great example is because it illustrates a situation in which a system might actively want to be deceptive and might be deceptive in a way that is very very Insidious this deception is is cryptographically hard to find and provably cryptographically hard to Define uh it's kind of trivial imagine you just like have some language model and it will only ever output the words I will destroy the world you know once it once it sees the factors of RSA 2048 inside of its input so maybe there's only like one input that could ever cause this problem so this is very very challenging and it kind of illustrates the hard at least in theory part about this about like using evals and on any other types of tools to figure out problems and models the proof that this is hard makes a black box assumption about the model but if you assume you can open up the black box and characterize the system mechanistically then you have a much better foothold or a much better chance of characterizing the deception in an attractable way or finding the bug in an attractable way so this is why mechanistic interpretability this is where it comes in as being something that's so useful at least inferior so promising for alignment uh it's it's a type of tool that could help us like figure out this problem mechanistically as opposed to from inputs and outputs this is one of the reason why it plays such a central role in any type of very rigorous agenda for building a line safe AI because it deals with those model internals and it's one of two approaches I think I think only two approaches that are potentially useful for combating this type of deceptive failure well well now I want to know what what do you think the other approach is the other would be mechanistic adversarial training or latent adversarial training yeah what's that uh so really concretely imagine that the system that's going to destroy the world once it sees the factors of RSA 2048 imagine imagine concretely that there's just a single neuron or something inside of it that detects whether this number has been factored and or whether the factors have been shown and then you know triggers this um some sort of circuit inside the model in order to enact the treacherous behavior and this is too simple and more generally it could be some sort of like set of neurons or direction or circuit or anything like this but I don't know just think about it as being a neuron for Simplicity just like mechanistic interpretability can help you find this neuron latent adversarial training could give you perturbations to the model internals such as this neuron the are going to help to elicit potentially bad behavior during training so that you can make the model more robust not just to attempted adversarial inputs that you find but also synthetic adversarial inner hallucinations right yeah it just might be a lot easier to make models hallucinate that they want to just undo something bad or just hallucinate triggers for bad behavior than to make them actually find those triggers for bad behavior in the input space yeah and just spell that out I guess the idea is you like find these inputs which make it do something bad during training and then you just train it to not do that like hopefully you train it such that there's no hallucination of some benign event where it like goes evil is my understanding yeah did you say inputs but like it can be the activations of any internal part of it yeah yeah great yeah one kind of the question I have about this approach is it seems like it's very closely related to mechanistic interpretability uh the reason being like suppose like for for basically any output that your AI could have there are cases where that output is desired right like otherwise he could just like set up a filter at the end of your AI being like just make it impossible to have this output right and so like you could imagine like there's for some undesired output maybe there's an early on neuron that says like yeah it checks if there's some factorization of this really large number and then it does this naughty thing or like maybe late you know somewhere else in the network there's this neuron that checks if there's like actually really good reason to do this thing which like At first seems naughty and then like if there is a really good reason then it does it right and kind of by construction you know if you've done things sensibly like for every output of your neural network there must be like some trigger that would make it a good idea so it seems like in order to distinguish between like hallucinations where it's Justified to do this thing versus hallucinations where it's unjustified to do this thing it seems like you need to um have some mechanistic understanding of like what's going inside what's going on inside the model I'm wondering what you make of that claim yeah this is a good point and I think there are a few things I have in mind one is that um if you think of like a classifier or you know like some sort of discrete action space if a network can only do like a discrete set of things then obviously you're only going to design it to do the set of discrete things that you would ever want it to do right you're never going to design like a discrete action space for some sort of agent where one of the actions is destroy the world and everything else has benign right yep but for something uh where the outputs are non-discreet like a language model or the ability to detect whether something is bad is like you know not trivial it's not like if a single word has been said then your system will usually be able to do things that are wrong that you're not trivially able to detect so for example you know maybe you take a language model and you train it to never ever ever say an offensive word and that's easy to like supervise and you could like even design out the tokenizations of your list of offensive words and not even include them in the models ability to you know process anything but even though the model still might be able to like lie you won't be able to prevent it from lying using some type of approach like this but then there's a second Point like you bring up this idea that like well if we can detect this right if we can detect whether something to the extent that we're able to detect whether something's a lie then we can do something like late adversarial training but um if we can detect whether something's a lie or something's bad then like what's the need for this right so yeah why not just use a classifier as a filter or something and this is kind of pointing out that if like we're really really really good at solving outer alignment or you know figuring out how to specify our goals or recognize failure for a system you know maybe this is a partial solution to just the inner alignment problem which is all about getting systems to correctly optimize for what we tell them to optimize for and uh I think this is very true to the extent that we're able to recognize failure you know in theory we can just always filter outputs that are bad in practice though I think we're going to run into challenges largely involving efficiency right maybe as is kind of assumed with something like reinforcement learning from Human feedback maybe our ability to recognize bad behavior is you know just like by a human looking at something and telling whether or not it's okay or whether it's bad that's not something that we can like very efficiently tack onto a system even in training let alone deployment all the time right so we have to take these shortcuts and um I certainly would want it to be an additional tool in the toolbox in addition to filters to have the ability to train role models that are more intrinsically and endogenously robust to the the problems that we can recognize a good thing about these types of different approaches is that certainly they don't seem mutually exclusive hmm so one thing I guess very related to this problem of deceptive alignment is detecting Trojans can you talk a little bit about how you see those as being similar yeah so a little bit earlier I kind of made the case for one broad way of understanding deception as just being when a system develops for whatever reason bad behavior that will be a response to some sort of anomalous input some sort of anomalous inputs that are like hard to find or simulate during training and end evaluation and this is quite close to the definition of what a Trojan is so uh Trojan or a back door kind of come from the security literature this wasn't first these aren't Concepts that are first from machine learning but they've kind of come to apply on machine learning and roughly synonymously just both Trojan and backdoor refer to some sort of particular subtle weakness or sneaky weakness that has been engineered into some sort of system so just a Trojan is like some misassociation between like a rare feature or a rare type of input and some sort of unexpected possibly bad or malicious behavior that's some sort of adversary could implant into the network the way this is usually like discussed is from a security standpoint it's just like oh imagine that uh you know someone has access to your training data what types of weird weaknesses or behaviors could they Implement in the network and this is really useful actually like think about uh just the internet which has now become the training data for lots of state-of-the-art models people could just put stuff up on the internet in order to like control well sorry what systems trained on internet scale data you know might actually end up doing but uh the reason this is interesting to the study of interpretability and adversaries and AI safety is less from this security perspective and more from the uh perspective of like how tools to characterize and find and scrub away Trojans are uh very related to the interpretability tools and research that we have and they're very much like the task of finding triggers for deceptive Behavior there's very few differences and the differences are practical not technical mostly between deceptive failures and the types of failures that rode the Trojans and backdoors elicit sure so I think this is a good segue to talk about um your paper called uh benchmarking interpretability tools for deep neural networks uh which co-authored with usually Tongbu Kevin Zhang and Dylan Hadfield manil I hope I didn't get those names too wrong no that sounds right but it's basically about basically benchmarking interpretability tools for whether they can you know detect certain Treasures that you implant in networks right yes and uh one quick note for people listening to this in the future the paper is very likely to undergo a renaming and it is likely to be titled benchmarking AI interpretability tools using Trojan Discovery as well so similar title but likely to change all right cool well um I guess you've you've made the case for it being related to deceptive alignment but I think I'm still curious why you chose Trojans as a benchmark like I don't know if I was like kind of thinking of um I kind of brainstormed like well if I wanted to Benchmark interpretability tools what would I do and like I guess other possibilities are like predicting Downstream capabilities for instance like if you have a large language model can you predict like whether it's going to be able to solve math tasks or like can you like fine-tune your image model to do some other task you could also do like just predicting generalization loss like what loss it achieves on um data sets it hasn't seen you could try and like manually distill it to a smaller Network I don't know there are a few things you could try and do with networks or a few facts about networks that you could you could try to have interpretability tools find out so I'm wondering like why Trojans in particular yeah the types of things you mentioned I also I think I find them pretty interesting too and I think that any type of approach to interpretability especially mechanistic interpretability that like goes and uh finds good answers to one of these problems seems like a pretty cool one but yeah we studied Trojans in particular and in one sense I'll point out there is a bit of a similarity between discovering Trojans and some of the things that you described for example if you if you're asking like like what's the Trojan versus if you're asking how is this model going to perform when I give it math questions or if you're asking how is uh the model going to behave on this type of problem or that type of problem there's something a bit similar in all of these types of tasks and that's a sort of sense in which you're trying to like answer questions about how the model is going to be Paving on interesting data specifically if that data is unseen that's a another nice thing about a benchmark hopefully because if you already have the data then you know not just run it through the network but Trojans right so like why do we use Trojans one reason is that there's a very well known ground truth okay ground truth is kind of easy to evaluate if you are able to successfully match whatever evidence and interpretability Tool produces with some the the actual trigger corresponding to the Trojan then you can say with some with an amount of confidence that you've done something correctly other interpretability tools could be used for characterizing all sorts of properties of networks and not all of them are like Trojans in the sense that there's some sort of type of like knowable ground truth and some of them are but not all of them another advantage of using Trojans is that they are like very easy they're just kind of very convenient to work with you can make a Trojan trigger anything you want you can insert it any type of way you want to using any type of like data poisoning method and um the point here is also making them something that's like a very novel very distinct feature that so that you can again like evaluate it later on and the final reason why it's useful to use Trojans is that I think Trojans they're like fine recovering Trojans and scrubbing Trojans from networks doing things like this are very very closely related to very immediately practical or like an interesting type of tasks that we might want to do with neural networks and lots of the research literature kind of focuses on stuff like this there are lots of existing tools to find visualize what features will make like a vision Network do X or Y but there are not a lot of existing tools that are kind of like very uh well researched at least yet for you know telling whether or not a neural network is going to be good at math or something so I think it's roughly for these reasons it's largely like kind of born out of consistency having a ground truth in the inconvenience why we use Trojans but um it is nice that like these Trojans just kind of are features that cause unexpected outputs right and this is a very very familiar type of debugging problem because it appears all the time with like data set biases or learned spurious correlations and things like this so it makes sense that there's kind of like a good fit between this type of task and what has been focused on in the literature so far but yeah like the kind of stuff you describe I think could also make really really interesting benchmarking work too probably for different types of interpretability tools than we study here but we need different benchmarks because there are so many different types of interpretability tools yeah yeah that was actually a question I wanted to ask like you mentioned there are various things that interpretability tools could try to do you have this paper where you Benchmark a bunch of them on Trojan detection I'm wondering like how do you pick like like how do you decide oh these are things that we should even try for Trojan detection yeah so this benchmarking paper kind of really does two distinct things at the same time I think it's important to be clear about that um for example not all the reviewers were clear about that when we put out the first version of the paper but hopefully we fix this the first is on benchmarking feature attribution and saliency methods and the second is on venturing benchmarking feature synthesis interpretability approaches which are pretty different feature attribution and saliency approaches are focused on figuring out what features and individual inputs caused them to be handled the way they were handled and feature synthesis methods produce novel classes of inputs or novel types of inputs that help to characterize what types of things are out there in its input space that can elicit certain Behavior so these were two types of tools you know these two paradigms the attributions and synthesis the attribution saliency Paradigm and the synthesis Paradigm that uh that are reasonably equipped to do some work involving Trojans but yeah like there are definitely more types of things out there interpreting a network does not just mean attributing features or synthesizing things that are going to make it have certain behaviors I think these are both interesting topics but there can be quite a bit more that's going on uh really like I think other types of in interpretability benchmarks that could be useful could include ones involving like model editing we didn't touch model editing it at all in this paper it could involve model reverse engineering we didn't touch that all in at all in this paper either and I think you know the next few years might be really exciting times to uh work on and watch this additional type of work in the space on rigorously evaluating different interpretability tools and if you put it like that you know this this paper was you know quite scoped in its focus on just a limited set of tools yeah I guess I wonder my takeaway from the paper is that the interpretability methods were just not that good at detecting Trojans right so I think the attribution and saliency methods yeah not that good I think that the feature synthesis methods range from very like non-useful to like useful almost half the time but okay yeah there's much much room for improvement yeah I mean I guess I'm kind of surprised one thing that strikes me is that in the case of um feature attribution or saliency which I take to mean like look you kind of take some input to a model and then you have to say which bits of the input were important for what the model did as you mentioned in the paper these like can only help you detect Trojan if like you have one of these backdoor images you know and you're seeing if the feature attribution or salinity method can find the back door and like like this is kind of a strange model right it seems like it's kind of maybe this is a fair test of feature attribution or saliency methods but it's kind of it's kind of a strange way to approach Trojan detection and then in terms of um input synthesis so coming up with an input that is going to be really good for some output again I don't know if like if my neural network like it's trained on a bunch of pictures of dogs or I don't know it's trained on a bunch of pictures most things that classify as dog it's because like it's a picture of a dog but it has some Trojan that like you know in one percent of the training data yet this picture of like my face grinning and and it was told that those things Canada's dogs too like in some sense the thing it would be kind of weird if if input synthesis methods generated my face screening because like the usual way to get a dog is like I just have a picture of a dog right I don't know I I guess both of these methods seemed like I'm not even sure I should have thought that they would work at all for traditional detection yeah I really really like this point and uh but and and there's there's someone different comments that I have about it for both types of methods the more damning one involves future attribution and saliency yeah because like you said like these tools because of the types of tools they are like what they are you know they're just useful for understanding what different parts of inputs are Salient for specific in images or specific inputs that you have access to yeah so if you can ever use them for debugging it's because you already have data that exhibits the bugs right so if we have our Engineers hat on it's not immediately clear why this type of thing would be competitive would be any better than uh just doing something simpler and more competitive which could just be like analyzing the actual data points right and in cases where those data points have like glaring Trojans like in ours then this would probably be a both a simpler and better approach doing some sort of analysis on the data yeah I guess it is sort of embarrassing that they couldn't like that you have some image with like a cartoon smiley face and that cartoon smiley face is this trigger to get it classified I guess you would kind of hope that these saliency methods could figure out that it was the smiley face yeah I think it's a little bit troubling but like this paper and some other papers that have introduced some alternative approaches to evaluating the usefulness of saliency and attribution methods they find more successes than failures our work included right which is kind of disappointing because so much work is more successes than failures sorry more failures than successes okay yeah um so much work has been put on into a feature attribution and saliency research in in recent years it's one of the most popular subfields and interpretability so like why is it that these methods are failing so much and why is it that even if they're successful they might not be competitive part of the the answer here involves like being explicitly fair to these methods one of the reasons that their research is for like helping to determine accountability you know think people who are working on AI and have courtrooms in the back of their mind right yeah this can be very useful for like determining accountability and whether or not that lies on like a user or a creator of a system or something so so it's worth noting that these do have potential like practical societal legal uses but from an engineer's standpoint right sorry and even under so I think I'm just missing something why is it useful for accountability so uh your self-driving car like uh hit something hurt someone something like this right is this like an act of God kind of thing or an unforeseeable mistake kind of thing from a a courtroom's perspective or maybe the uh system designers just like were negligent and how they designed the vision system of the self-driving car and they didn't a court might rule that if the failure that it made was egregious enough and it was just attributing uh things to the obviously wrong things the court might like rule that there is negligence on the designer's part something like this I don't do law though but the the case is something like if a system makes a mistake you want to understand what was going on in that particular case yeah like like which facts the model relied on would maybe tell you if the model was messed up or if it was an unlucky situation even then it seems a bit tenuous to be honest yeah here's maybe a better um a better example like like suppose someone claims that they were discriminated against by an AI system and uh they were input into that AI system think like hiring or something and there were a bunch of features associated with them and this in this toy example imagine one of those features is like their race and another feature is like their age and uh if you you could use attribution methods to like find that the system might not have been discriminative because it was attributing a decision to age or a system might have been like racially discriminative because it was attributing a decision to race right I should say I think in many jurisdictions you're not supposed to discriminate just on the basis of age as well I see so okay let's let's think about like racism versus non-racism or something like that yeah okay this this might be a it's what I described as a toy situation it's not super realistic but this is one example of like using attribution and saliency methods to like answer practical questions from like a legal standpoint at least okay so that was your take about um attribution and sailing C methods and you also had a comment on yeah it's just that if I if I'm an AI safety engineer if I have that hat on then um I'm not thinking about like courtrooms or something like this I'm thinking about like how do I find problems and fix them okay and you will set a comment on features on input synthesis methods yeah so input synthesis methods um these do not require the use of data because feature attributions saliency methods do if you already have that data why not just use it and that's a really big Advantage um there's a a really nice thing about interpretability tools that do not require a data set as opposed to ones that do require a data set is that they are at least possibly equipped right to um characterize what a model can do on out of distribution or anomalous behavior and by creating their own features you know feature synthesis methods are our one answer to this they're one example of a data free approach to interpretability so now we have a natural claim to like competitiveness with just using a testing set or analyzing a testing set or something like this and that's really nice but is this going to be the kind of thing you know is this still is this still too limited to detect the most Insidious types of failures probably yeah right because if some sort of system is just sizing and searching for inputs most type of training processes are or searching procedures are you know not super exhaustive and if you don't stumble into like the most Insidious failure modes you just might not be able to find them right so this might kind of point at one intuition for why latent adversarial training and mechanistic interpretability still have some unique a unique Competitive Edge over synthesis methods one reason why lots of synthesis methods just might be poorly uh equipped to do this in the first place is that usually better synthesis methods the ones that are better at finding like Trojans are better because they use better priors for like how these features are synthesized or they have like better inductive biases as to what types of features they're able to come up with for example it's usually much better for interpretability to create features using a generative model that's already been pre-trained on a bunch of data much better than synthesizing a feature from completely from scratch right this is just another way of saying regularization usually helps or like having better priors usually helps but there's this kind of trade-off with the more regularized or the more the methods that are more restricted in the prior that they impose with the ability to like characterize Model Behavior on anomaly soft distribution data so that's a little bit disappointing right you know maybe you know some generative model that was trained on a data set might actually not be that much better than the data set you trained it with for synthesizing adversarial features or recovering Trojans or identifying triggers for deceptive Behavior Etc so in terms of like The Benchmark in this paper if there's some difficulty if these input synthesis methods aren't very effective and maybe there are reasons to think that they might not and if these saliency methods aren't don't seem to be very effective either like do you think the way forward is to like kind of try to use this match Mark to improve those types of methods or do you think like coming up with like different approaches that could help on Trojans is like a better kind of way forward for the interpretability space yeah I think to acquire things an extent I would want to be working on both and I think I think most questions like this is a better be better you know my answer is something like we want a toolbox not a silver bullet right but you still I think it's still a really important question right like should we start iterating on benchmarks or should we start changing the Paradigm a bit you know which one's more neglected or something like this I see a lot of value you know at least trying to get better benchmarks and do better on them because um I I would feel quite premature in kind of saying oh well they fail so let's uh let's move on because you know benchmarking uh for feature synthesis Methods at least really really hasn't happened in a comparable way to the try the way that we tried to do it in this paper benchmarking for future saliency and attribution has but the synthesis stuff is pretty unique which I'm excited about so I would I would think it a little bit premature to like not at least be excited about what could happen here in the next few years but uh I would also think of it like as and and on the other side of the coin I would think of it as being um a bit parochial or a bit too narrow to put all your stock in this I think alternative approaches to the whole feature synthesis and interpretability Paradigm are going to be really valuable too and that can be like more mechanistic interpretability stuff uh that could be late and adversarial training like we talked about earlier that's one thing I'm excited about so I I see cases really good reasons to uh work on all of the above it's it's a porque no Los Dos kind of thing um okay let's build the toolbox that's that's usually my perspective on these things okay yes there is a good point to make though that like you know having a very bloated toolbox or like you know having a bunch of tools without knowing which is great the ones that are likely to succeed does increase the alignment tax anyway I'm just kind of blabbering now all right so at this point I have some questions just about the the details of the paper one of the criteria you had was you wanted these Trojans to be human perceptible right so like examples were like if there's some like cartoon smiley face in the image make it do this thing or if the image has like the texture of jelly beans make it do this thing one thing I didn't totally understand was why you wanted was why this was considered important and especially because you know maybe if they're awake easier types of Trojans that like are still Out Of Reach but are closer you know that that kind of thing could potentially be more useful yeah so there's some trouble with this and there's a cost to this right one of them is that it kind of restricts the sets of Trojans that we're able to really use for a meaningful study like this hmm in inserting patches into images or changing the style of an image as drastically as we change the style of an image you know kind of takes the image a bit off distribution for like real natural features that are likely to cause problems or the types of features that some sort of adversary would like want to implant in a in a setting where security is compromised so there's a little bit like of a trade-off with realism here but the the reason we focused on human interpretable features was kind of like a matter of convenience as opposed to a matter of something that's really crucial to do so uh it just kind of boils down to restricting our approach I think there is something to be said about how like humano techniques that involve human oversight is is unique are unique right and we want techniques that Empower humans and techniques that don't Empower humans and do things in an automated way inside of the toolbox but there is definitely some sort of value to the human oversight and um we we went with this with this framework and lots of the research that we were engaging with like also use this type of framework trying to produce things that are meant to be understood by a human okay and uh this works and this kind of fit with the the scope of experiments that we tried but that is not to say at all that it wouldn't be very interesting or very useful to introduce classes of like Trojans or weaknesses or anything of the sort that you know are not human perceptible or interpretable it's just that our evaluation of whether or not you know tools for recovering these are successful can't involve a human in the loop obviously we'd need some other sort of way to like automatedly test whether or not a synthesized feature actually you know resembles very well via Trojan that it was trying to uncover and um I have no I have nothing bad to say about that approach because I think it sounds pretty awesome to me sounds a little bit challenging but um yeah that's the kind of thing that I'd be excited about sure now I want to ask about some of the details so you mentioned that um so in the paper you have these three types of Trojans right one is like these sort of patches that you like paste in that he sort of like superimposed onto the image like this is transparent cartoon smiley face or something and I don't know it seems like relatively simple to me to understand how those are going to work there are also examples so where you use neural style transfer to kind of for instance like for some of these I think you like jelly beanified images right like like you made you made them have the kind of texture of jelly beans while having their original form another of them was you sort of detected like if images happen to have a fork in them and then like I said that that was going to be the Trojan I'm wondering like these second two are they're kind of relying on these neural networks you've trained to do this task being like like performing pretty well and one thing that I I didn't get an amazing stencil from the paper is like how well did these Trojan generation methods actually work also great question so yeah uh past Trojans easy slap in a patch and you're great and uh we did use some like augmentation on the fat patches to make sure that it wasn't the same thing every time and we blurred the edges so that we didn't like implant we had biases about sharp lines into the network but yeah really simple and the networks as you might imagine like we're pretty good at picking up on the patch Trojans on the uh held out set they were I think on out like in general doing like above the 80s and 90s well above 80 and 90 percent accuracy on images that had the patch Trojan inside so something was learned yep the style Trojan like you mentioned there doing style transfer requires some sort of feature extractor to can and some sort of uh style Source image and the feature extractor like worked pretty well but you know style transfer is kind of difficult to do very very consistently sometimes the style just kind of obliterates lots of the discernible features in the image and sometimes the style like maybe on the other end of things just doesn't affect the image enough but like on average we try to tune it to do okay and um the neural networks were really really good at the uh after data poisoning and picking up on the Styles these were also being implanted with a roughly like 80 or 90 plus percent accuracy on the Trojan damages in the validation set the natural feature Trojans were a very different story right these natural feature Trojans we implanted just by relabeling images that had a natural feature in them which means that we needed to pull out some sort of object detector and use that to figure out when there was one of these natural features available and we did that but the object detector really wasn't super perfect and also these natural features come in like all sorts of different like types and shapes and orientations and locations and Etc right these were implanted much less robustly in the network and on the held out set the validation set the accuracies were significantly lower I think it was sometimes under 50 percent for individual natural feature Trojans and um to the point of like why it's interesting to use all three of these right one is for like simple diversity right it's uh you get better information from having like different types of Trojan features than just one type of Trojan features uh something that's nice about patch Trojans is that the location that you're inserting it into an image and where it is in an image is like known as a ground truth and that's really useful for like evaluating attribution and salience methods something that's nice about style Trojans that we found after the fact actually is that they're super super challenging for future synthesis methods to detect like really no feature synthesis methods had any sort of like convincing success at all anywhere helping to ReDiscover the style source that was used for them so this seems like a really challenging direction for possibly future work a cool thing about natural feature Trojans is that they very very closely simulate the real world problem of like getting networks to like understanding when they're picking up on bad data set biases and hopefully fixing that for example for the exact same reason that our Trojans Network learns to associate forks with um the target class of this attack I think it was a cicada an imagenet Network just trained on clean data is going to learn to associate a fork with like food related classes or an image net Network we'll learn to associate like tennis balls with dogs right we're just kind of simulating a data set bias here so the results involving natural feature Trojans are probably going to be the most germane to like practical debugging tasks at least ones that involve bugs that super Vein on data set biases yeah I mean I guess one question I still have is like is there some way I can check like how well these like these style transfer images or these like natural like these cases where you just naturally found a fork in the image is there someplace I can just look through these data sets and see like okay like do these images even look that jelly bean-ish to me I don't think I found this in the paper or the GitHub wrapper but I didn't yeah correct uh the best way to do this and anyone who's listening feel free to email me um the best way to do this is to ask me for the code to do it the code to do the data poisoning the training under data poisoning is not inside of the repository that we're sharing and the reason is that this paper is very soon going to be turned into a competition with some small prizes and with the website dedicated to it and um that competition is going to involve uncovering Trojans that we keep secrets and um with the style Trojans and the patch Trojans it would be perfectly sufficient to just hide those sources from the source code so that's not really a problem but it's a little bit harder to do with the natural feature Trojans because details about what object detector we use could help someone put strong priors on what types of natural feature Trojans were able to insert and maybe I've said too much already but for this reason it's not like it's not public but if anyone wants to like um forfeit their ability to compete in a future competition and email me I'll send them via the code if they promise to stay keep it on the down low can you share the data sets produced by the code like would rather than the code itself yeah that sounds like a pretty easy thing to do you know just like producing a bunch of examples of like into this particular patch and style and natural feature images that like were relabeled as part of these data poisoning I just haven't done it yet but um let me put that on a list I'll work on this if I can and I will especially work on this if someone explicitly asks me to and it sounds like maybe you are well mostly and I guess in podcast format so I guess um another question I have is the way you evaluated the input synthesis methods which was essentially like you ran a survey right where like people look at um all of these uh visualizations and they're supposed to say like which of these eight objects you know are they reminded of by the visualization or which eight images where one of the images represents the Trojan that you inserted so I guess I have two questions about this one of them is that like when when you're doing a survey like this I kind of want to know what the population was so what what was the population for the survey and like do you think the population that got surveyed would matter much for the evaluation of these um yeah so straightforwardly the population was uh Cloud connect knowledge workers which are very similar to like m2k knowledge workers uh lots of people do this as like their career or like a side aside job they do something like this and um they were all like English-speaking adults okay and I think for some types of like features you know you there might be very clear reasons to worry about like whether or not they're going to be like systematic biases um among different like cultures about like who's good at recognizing what features are not this could totally be true you know maybe things would be different in um very like lots of Eastern cultures with fork Trojans right because uh their Forks are just like less common there I don't know maybe they have like maybe people on different sides of the world or slightly maybe less apt to see forks and things that only vaguely resemble Forks than I might so I think I think there is some way for like like biases here and it's uh it's worth keeping in mind that the people that we studied were all just kind of English-speaking adults who are these knowledge workers but I I don't anticipate any particularly um nothing that nothing that keeps me up at night about this survey methodology and and the demographics of people who are who are part of it mostly because all of the images that we used is uh triggers or style sources or all and all the features that we used are just kind of benign boring types of things okay I guess sort of related to that question in the survey it doesn't really explain how the the feature visualizations methods work right um it's just like here's a bunch of images pick the one that looks very similar it strikes me as possible that like I don't know if I think of these feature visualization methods is like finely tuned tools or something then I might expect that if somebody knew more about how this tool worked and like I don't know what it was supposed to be doing they could potentially like do a better job at picking up what the tool was like trying to show them I'm wondering like do you think that's an effect that would potentially like change the results in your paper yeah I do I think there's this is an important Gap and uh I actually don't think the paper explicitly spells this out as a limitation but I should update it to do that because um we should expect that different tools in the hands of people who are very familiar with like how they work are very likely to be better uh or they're the people who know about the tools are going to be able to wield them more effectively I think we found at least one concrete example of this one of the feature synthesis methods that we used and this this example is in figure three of the paper it's in the collection of visualizations for Forks but uh the uh method for constructing robust feature level adversaries via fine-tuning a generator so the second to last row in this paper when it attempted to synthesize fork images it ended up kind of synthesizing things that looked a little bit like a pile of spaghetti in the middle of an image but with some gray background that had stripes in it kind of like the tines of a fork and uh in this particular case on this particular example the survey respondents uh you know answered Bowl which is another one another one of the multiple choice options and a bowl is just it was chosen because it was another one of like another common kitchen object like a fork and they chose a bowl over like fork and uh knife and spoon or I can't remember exactly what alternatives they were but this uh going back and looking at this I can kind of understand why so this thing in the middle looks a little bit like a spaghetti in a bowl or something and it's in the it's in the foreground it's in the center of the image yeah yeah but there's this still very like distinct striped pattern in the back that looks a lot like the tines of a fork and as someone who works a lot with feature visualization like I don't think I would have answered ball I might be uh speculating too much here but I think I would uh I I'm pretty well attenuated to the fact that like you know stripes and images you know tend to make feature detectors inside of networks go crazy sometimes so I think I probably would have answered Fork but uh that like foreground bias might have contributed to this one tool maybe not as being as effective in this one particular instance I'm wondering did you like I don't know have you had a chance to like basically test your colleagues on this I don't know it would hardly it would be hard to get significance but do you have a sense of like how that pans out yeah I have a bit and sometimes I'm pretty impressed by like when the others who sit next to me in lab at how good they actually are compared to like my subjective expectations for them but I haven't asked them about this specific example I should I think you quizzed me on this right I think I did I showed you a few uh Patrick examples yeah yeah and some visualizations involved in them do you know if I got them right oh did I not tell you I think I think usually when I asked you or some other people it would be like uh they got like two out of four right that I would show them or something yeah okay all right so so I I guess that's like that suggests that maybe informed people are batting at like 50 yeah I think they'd go better it could be 50 it could be more or less better than yeah yeah probably in like between 10 and 90 uh yeah just like I think uh all these results really give us is like a probable floor for um at least at the on average okay yeah and I guess like the final questions I have about this paper is um how do you think it relates to other things in the literature or in in this like World of benchmarks for interpretability tools yeah the part about this paper that I'm the most excited about is really not the saliency and attribution work it's the work with feature synthesis because um this is to the best of our knowledge like the first and only paper that takes this type of approach on feature synthesis methods and that's a little bit Niche but I I think it's a I mean it's a contribution that I'm excited about nonetheless and um if I if you ask me like what I what I would love to see in the next in the next few years as a result of this I'd like to see some more benchmarks and some maybe uh more carefully constructed ones that takes advantage of some of the lessons that we've learned here and I'd like to see um some more rigorous like uh competition to beat these benchmarks because in Ai and in other fields in general benchmarks have a pretty good way a pretty good tendency of like concretizing goals and building communities around these like concrete problems to solve they give a good way of like getting feedback on what's working and what's not so that the field can kind of iterate on um on what's going well if if you look at like reinforcement learning and benchmarks or like image classification and benchmarks you know so much progress has been made and so many useful combinations of like methods have been found by iterating on what exists and and beating the benchmarks that do exist and this isn't so much the case with interpretability so my optimistic hope for a benchmarking type work is that it could kind of help guide us quite a bit further than we've come already towards stuff that seems very practical uh in the same way that benchmarks have been useful in other fields all right so before we start wrapping up I'm wondering if there are any questions about this or about your broader kind of views of interpretability any questions that you wish I had asked but I haven't um I one thing I like to talk about a lot lately is like how whether and how interpretability tools could be useful for like shaping policy I have some like high level speculative optimistic takes for like ways interpretability could be useful all right yeah how could interpretability be useful for shaping uh AI policy or other kinds of policies oh what a what a coincidence you asked no um so from an engineer's standpoint you know if we get really good at using interpretability tools for diagnosing and debugging failures that's really great then it comes to like applying this in the real world that's that's kind of the final frontier the last major hurdle to get over when it comes to making sure that the the interpretability part of the agenda for AI alignment really gets fully realized so one type of work I'm really excited about is just kind of using tools to like red team real systems and figure out problems with them as ways of getting all the right type of attention from all the right types of people that we want to be like skeptical about AI systems in their applications right it seems very very good to take existing existing deployments or systems find problems with them and then make a big fuss about them so that there comes to be like a better Global understanding of risks from AI systems and how Insidious errors could cause could still pose dangers um I also think interpretability could be very usefully incorporated into policy via auditing and there are ways to let's do this in that are better in ways to do this that are worse but um I'm definitely not alone in recent months and kind of like thinking that this could be a really useful Avenue forward for impact there's a lot of interest from inside and outside the AI safety Community for having more auditing of impactful AI systems think how like the FDA and the United States regulates uh drugs and mandates clinical trials well maybe um the FTC in the United States or some other Federal body that governs AI could like mandate tests and evals and red teaming and could try to find uh risks as it governs AI so um the more that can be done in the next few years I think to demonstrate the Practical value of interpretability tools on real systems and the more attention that can be gotten from people who you know think about this from a policy perspective especially inside of government I think that's done that that could be very useful for um kind of starting to build a toolbox for governance uh and starting to think about how we might be able to avoid AI governance getting so badly outpaced by developments and capabilities okay and do you think that suggests any like particular directions within the space of interpretability I think maybe maybe an answer here maybe a couple of answers actually concretely yes I think one type of paper one genre of paper that I'm really excited about maybe working more on in the near future uh one type of paper is just you know one of those red teaming papers where you like we've we picked this system we use these methods and we found these problems with it and we told the makers of the system about what we found and here we're reporting on it to um you know show you all this this practical example or this case study about what can be done with like auditing tools that's something I'm excited about there's one example of this from pretty recently that I think is very cool the paper is titled um red teaming the stable diffusion safety filter and they did just this with the open source stable diffusion system and this was from some researchers at eth Zurich I think in spirit I love everything about like this approach yeah and in some ways the adversarial policies work for go seems like oh absolutely the same kind of thing I guess it seems less um I don't know you're less worried about it from like a safety perspective maybe it's less like eye-catching for policy people but sure on some level it's like the same thing right I agree yes and actually like at this point you bring up about eye-catching to policy people uh this is one this is one I don't know if this is an answer or uh or a critique of the cool way you ask that question but you asked if I had any interest in particular things and I actually sort of have an uh an explicit interest in less particular things in a certain sense right so and by less particular I just mean like less immediate of less immediate relevance to like what AI safety researchers immediately think about all the time concretely I think interpretability adversaries red teaming auditing this type of work could be useful for AI safety governance even if it focuses on problems that are not immediately useful to aicp so like as safety people care about this stuff too but lots of non-ass safety people are explicitly worried about like making sure models are like Fair they have social justice in the back of their mind right and these are objectively important problems but this is qualitatively distinct problem than like you know trying to prevent X risk but these could be really useful issues to serve as like hooks for instituting better governance and you know if we get the FTC to mandate a bunch of um like eval work in order to make sure that models like fit some sort of Standards involving social justice this isn't directly going to save anyone or just save save us from an extra X risk perspective but this type of thing could serve to like raise activation energies are kind of like lower the level of water in the barrel when it comes to slowing down AI in in some useful ways or making it more expensive to like uh train and audit and test and deploy and monetize AI systems so so if if anyone is sympathetic to the goal of slowing down AI in order to uh for for AI safety reasons I think they should also potentially be sympathetic to the idea of leveraging issues that are not just AI safety things in order to get useful and potentially even retoolable types of policies introduced at a governance level okay so so it seems like the strategy is roughly like if you are really worried about Ai and you want like people to have to proceed like slower then like if there are problems which other people think are problems but are just easier to measure yeah it sounds like your argument is like look measure those problems and then like build some infrastructure around like slowing research down and like causing it making sure it solves those problems and then I guess the hope is that that helps with the problems you were originally concerned about or or the problems you were originally focused on as well is that roughly the the idea yeah I like that way of putting it this is kind of like an argument for working on more near-termness problems or like you know problems that are substantially Less in magnitude than something like uh catastrophic risk but still like using them as like practical issues to focus on for like political reasons and maybe use like Laboratories of alignment too for trying to develop governance strategies and Technical tools that can later be like retooled for other types of failures that may matter more from a from a catastrophic risk perspective and I guess it's worth throwing in there like uh any anything that's a problem is is still worth working on or being concerned about to some extent and I think it's great to work on lots of things for all the right reasons uh even if you know some of my biggest concerns involve catastrophic risk okay cool well um I think we've done a good job of clarifying your thoughts on interpretability work and Trojans I'm wondering um if people are interested in future work that you do or or other parts of your research um how should they follow your um your research yeah I put a bit of effort into making myself pretty responsive or easy to reach out to so the first thing I'd I'd recommend to anyone is to just email me at s Casper mit.edu and we can talk that especially goes for anyone who disagrees with anything I said in this uh this podcast you're really welcome to talk to me more about it another thing you could do is uh go to stevencasper.com and through my email or stevencasper.com you can also find me on Twitter and I use Twitter exclusively for machine learning related reasons and content so I think those are the best ways to get to me and that's pH not V yeah yeah Stephen Casper with a pH and Casper with a c okay great well thanks for talking to me today yeah thanks so much Daniel this episode is edited by Jack Garrett and Habra dornace helped with the transcription the opening and closing themes are also by checkout financial support for this episode was provided by the long-term feature fund along with patrons such as Ben Weinstein Ron and tor barstad three-day transfers of this episode or to learn how to support the podcast yourself you can visit axrp.net finally if you have any feedback about this podcast you can email me at feedback axrp.net [Music] thank you [Music] [Music] foreign
Related conversations
AXRP
28 Mar 2025
Jason Gross on Compact Proofs and Interpretability

This conversation examines technical alignment through Jason Gross on Compact Proofs and Interpretability, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -1 · 139 segs
AXRP
1 Mar 2025
David Duvenaud on Sabotage Evaluations and the Post-AGI Future

This conversation examines technical alignment through David Duvenaud on Sabotage Evaluations and the Post-AGI Future, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -9 · avg -7 · 21 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
AXRP
27 Jul 2023
Superalignment with Jan Leike

This conversation examines technical alignment through Superalignment with Jan Leike, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -10 · avg -7 · 112 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -14.44This pick -10.64Δ +3.799999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -14.44This pick -10.64Δ +3.799999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -14.44This pick -10.64Δ +3.799999999999999
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs