Library / In focus
AXRPCivilisational risk and strategy
Assistance Games with Dylan Hadfield-Menell

Why this matters
Auto-discovered candidate. Editorial positioning to be finalized.
Summary
Auto-discovered from AXRP. Editorial summary pending review.
Perspective map
MixedGovernanceMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 113 full-transcript segments: median 0 · mean -4 · spread -29–0 (p10–p90 -13–0) · 5% risk-forward, 95% mixed, 0% opportunity-forward slices.
Slice bands
113 slices · p10–p90 -13–0
Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: medium.
- Emphasizes safety
- Emphasizes ai safety
- Full transcript scored in 113 sequential slices (median slice 0).
Editor note
Auto-ingested from daily feed check. Review for editorial curation.
ai-safetyaxrp
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video shaCQFlOGDQ · stored Apr 2, 2026 · 4,104 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/assistance-games-with-dylan-hadfield-menell.json when you have a listen-based summary.
Show full transcript
hello everyone today i'll be talking to dylan hadfield minelle dylan's a graduating phd student at uc berkeley advised by ankur geran peter abeel and stuart russell his research focuses on the value alignment problem in artificial intelligence that is the problem of designing algorithms that learn about and pursue the intended goal of their users designers and society in general he will join the faculty of artificial intelligence and decision making at mit as an assistant professor this summer today we're going to be talking about his work on assistance games and in particular the paper's cooperative inverse reinforcement learning co-authored with uncle dragon peter beale and stuart russell the off switch game also co-authored with anchorage agron peter beale and stewart russell and inverse reward design co-authored with smith and millie peter beale stewart russell and ankara encourage for links to these papers and other useful info you can check the description of this podcast and you can read a transcript at axer.net dylan welcome to the show thanks so much for having me daniel it's a pleasure to be here yeah i made sort of something of an assumption in the introduction that uh you think about those papers sort of under the umbrella of assistance games is that right uh yeah i think so i take a pretty broad view of what assistance games can mean and i tend to think of cooperative irl as a formalization of a class of assistance games of which off switch game and inverse reward design are examples okay so yeah for for listeners who aren't familiar what is an assistance game an assistance game is a formalization of a sequential decision-making problem where we have two players within the game there's the human player and the ai player both of them have a shared goal in the sense that the ai system is meant to act on the person's behalf and in this case optimize utility for the person the big difference between the two is what information they have in an assistance game we assume that the person has knowledge about their goal and the system has to learn about this via interactions with the person what kinds of benefits do you think we get from thinking about assistance games and thinking of uh ai problems in terms of assistance games i think the big thing that assistance games really provide is a way of thinking about artificial intelligence that doesn't presuppose an obvious and measurable goal one of the things that uh by work on assistance games has pointed me towards is that in ai systems a lot of what we're trying to do is really provide access to a broad set of intelligent computer behaviors really and in many ways a lot of the improvements in artificial intelligence have really been improvements at the level of improving the ability of a specialized and highly trained group of ai practitioners graduate students and very early technologists to do a better job of identifying and building in a complex qualitative behaviors for systems and in the process of doing that these practitioners developed a bunch of mathematics of optimal decision making and how to implement that with algorithms however sitting outside of that research is always a researcher in fact a research community looking for how generalizable is this approach does this capture an intelligent behavior or whatever we mean by that what does the spec for this system look like how should we evaluate it and all of this is actually crucial to the eventual successes we've had in ai with say computer vision and more recent things in natural language and what assistance games really do is they study that phenomenon at the level of abstraction where we're trying to express goals and they allow us to identify what's the difficulty in expressing goals what are the ways that things can go wrong um and what are the limits on our ability to specify arbitrarily good behaviors we're good it can be interpreted quite loosely here that's interesting um this is the ai x-risk research podcast and you're kind of um i guess at least associated with a bunch of researchers who are interested in reducing estrus from ai i'm wondering like what um do you think that assistance games have any like particular benefits when it comes to reducing like existential threats from artificial intelligence yes definitely so the the benefits for existential risk i believe come in through the extension of benefits that they have for near-term concerns with ai systems we are getting very good at increasing ai capabilities and for a long time this has been the bottleneck in deploying effective ai systems and it's been reasonable to have an individual researcher using ad hoc mechanisms to align that behavior with their qualitative goals when i worked in motion planning earlier on in my phd you know it's it was my job to make sure that the tests that we designed and the experiments that we built and the simulations that i ran and eventually the work i did with real robots was generalizable and was what i wanted if that makes sense and it wasn't necessary to formally study it because we were looking at ai systems largely under lab conditions when you get out into the real world the difficulty of getting the behavior exactly right and getting the details right often in the first go are important and what assistance games do is they allow us to study how do we make that choice correctly it's sort of like we've drawn a distinction between ai technology which is an ai system and ai research which is sort of the thing that happens outside of the system that that sets it up this is sort of like the and and i think really this interaction whereby at this point the average grad student can specify a really wide range of interesting visual or language or motion related tasks and have them be executed on a system right that's that's really what we've the place that we've gotten to for ai currently but the thing is if you're doing that in a lab it's okay to have the experiment go wrong to break some hardware you get a lot of trial and error it's a very forgiving uh quite literally experimental environment and as we start to put systems into the real world we need to get better at building that infrastructure around systems more effectively and it's studying that problem that i think is how assistance games can overall reduce existential risk from ai systems okay so my right to think that the basic story is something like before i think of assistance games i'm gonna like think of some like objective that i want a really smart ai to pursue and then plug it in and press go and then it just like does something like absolutely terrible or you know enslaves humanity or you know something to achieve it's subjective but after assistance games you know it's going to provide this like method of analysis where i'm going to think about like where i'm going to notice the problems of like specifying what i want and like making sure that an ai doesn't like pursue things literally and like checks in with me is that like roughly what i should think of the benefits is yeah one way to think about it is uh supervised learning is actually a goal specification language it's it's an effective goal specification language for say labeling images and and in fact that is how it gets tied down into real systems right we optimize a utility function that that defines and there are lots of properties of this goal specification language that are nice programming in it is something that we can distribute across a wide variety of relatively unskilled individuals relatively cheaply and we can actually optimize it in order to produce behaviors that empirically are good what assistance games do is they they study properties of in a sense this interface between humans that have a qualitative internal goal and systems that are going to be optimizing that goal and so we can study these languages both from the standpoint of empirical capabilities but also from the standpoint of risk and safety assessment of robustness i guess what i'll say is that assistance games identify both the type of a solution we want when we're building ai systems and they allow us to characterize different classes of solutions uh and in particular uh from a standpoint of brittleness and forgivingness to uh mistakes on the pers on the person's behalf okay and i guess on the flip side like if i'm a listener really interested in reducing existential risk what should i not expect assistance games to help me out with like how should i like limit my ambitions for what uh assistance games are going to be really good at so i think assistance games are first and foremost a analytical tool okay so primarily what they do is allow us to formalize certain value alignment scenarios and assumptions and study different properties of them that can be done uh i think for for near-term systems it's it's easy to take these formalisms and build them actually into direct ai systems um but it's not my claim that implementing those systems directly would reduce x-risk so taking current ai technology and merely saying ah yes i am designing my system as the solution to an assistance game of some kind will not meaningfully reduce x risk on its own why not because i think there are you will still end up tracking the types of uncertainty that i think you would want to do for truly advanced systems at the level where they would pose a threat i think is beyond the capability of current inference systems okay so under the presumption that if you try to build an assistance game style solution into a system right now i'm presuming that that will consist of some form of bayesian inference tracking uncertainty about objectives and and maintaining those estimates in some way combined with some form of policy optimization in effect i think the types of bayesian inference that we can do right now are not up to par with systems that would be effective enough to pose a threat okay does that make sense yeah i think that makes sense now that's not to say i think i think those types of systems will be better for lots of other reasons there i think there are lots of benefits to leveraging uncertainty about objectives with short-term systems and i think they will that can reduce short-term risks the primary thing that this can really do for long-term existential risk at the moment is identify areas of research that are valuable so in this case this would say we should be investing very heavily in uh uncertainty estimation broadly specifically uncertainty estimation about utilities and utility learning and i believe that analyzing these theoretical models can help point out directions for the kinds of solutions we will want to move towards with other technologies i think this can point out things that are risks and things that are likely to fail and it can shape the it can shape rough abstract ideas of what a safe interaction with advanced systems would look like what our desiderata we can aim for what are targets to shoot for but it doesn't directly give us a recipe for how to implement those targets sure so so in that answer you alluded to um like short-term benefits that we could get for i don't know if any listeners care at all about what happens in the next hundred years but um if they do what are some like short-term like nice things that you imagine like kind of implementing solutions to assistance games could provide us i think one of the big ones is we're very limited with ai systems to optimizing for what we can measure and so this means there's a large structural bias and the kinds of systems that we build in that they're either designed from the ground up to measure sort of very specific things or they end up optimizing for shallow measures of value because that's the best thing that you have introducing assistance games into the way that we think about ai has the ability to reduce that so if it becomes easier to start to optimize for more qualitative goals within ai systems that makes it easier for say a system designer at facebook to to make a clear goal of here are the types of content that we want to recommend as a company it's very complicated to specify that now but you could imagine work in assistance games looking at managing an interaction between a company like facebook and their system would allow them to use much more much richer goals than the types of behavioral feedback that they have and and you can see inklings of them trying to move in this direction in addition to um sort of other large recommender system platforms the other side of what assistance games and short-term systems would give us is that we're building into the analysis of our systems and expectation that there will be ongoing oversight and adaptation this is something that that actually happens in practice right you know it's only in academic settings that we sort of collect our training data hit stop and then run on a special test set in reality you have your model it's being trained on some data stream that's coming in and you're adapting and adjusting in ongoing fashion to changes in your goals as well as changes in the world if we design systems as solutions to assistance games we will actually be building that type of a dynamic into the math at a much deeper level and i think this can expose interfaces for systems that are just much more controllable and understandable all right so now that i guess listeners are hypes for all the great things about assistance games um let's let's get into talking about them a bit more concretely so the first paper i want to talk about is uh cooperative inverse reinforcement learning uh searle for short for cirl um can you tell us just first of all what is searle and is it the same thing as an assistant's game so searle is uh not the same thing as an assistance game it is a subclass of assistance games and it is intended to be the simplest based case of an assistance game so to that end we built it as an extension of a markov decision process which is a mathematical model for planning used a lot in ai systems and in a markov decision process you have a state and you have actions and a reward function and what happens is you take an action in a state this leads to a new state and you also get a reward that corresponds to that from the humans perspective everything is the same in an assistant in cooperative irl so the human has an observed state they can take actions they can accomplish things in the world what's new is that you also have a robot player that's a second actor in this game so this is an addition into the mdp you can think about it as a turn taking scenario where the person goes first then the robot goes second and from the robot's perspective there's partial information so the rewards for each state are a partially observed component which only the person gets to see and our goal here was to extend a markov decision process in the minimal way to account for the fact that there are two players in the world the human and the robot and the robot does not know the person's objective okay i guess the name implies that it might be a type of inverse reinforcement learning can you say a little bit about what normal inverse reinforcement learning is and like why we might want it to be more cooperative so in in co-op in inverse reinforcement learning what you are doing is solving well in this case the opposite of planning really although they they there's sort of a joke that inverse reinforcement learning is a slight misnomer and in fact if you go back in and look up uh sort of other competing branches there's a lot of related work under inverse optimal control inverse re so what this is to say is that inverse reinforcement learning is essentially doing the opposite of reinforcement learning or planning so if planning is the problem of taking in a markov decision process an mdp and producing a sequence of actions that get high reward in inverse reinforcement learning you observe a sequence of actions and you try to infer the reward function that those actions optimize for and as you can see this is kind of doing the opposite of planning in planning you take in a problem a reward function and spit out trajectory and in inverse reinforcement learning you take in a trajectory and spit out a reward function the relationship this has to cooperative irl is that in cooperative irl you are doing you are solving in fact this same or at least a very similar problem which is to say that you see a sequence of actions from the person and from those actions you need to infer the reward function that generated them now if there's this deep similarity why do you need something new why can't we just use inverse reinforcement learning in order to solve assistance games and the reason is because there are a couple of small changes that inverse reinforcement learning doesn't quite capture that can be important to represent one of them which is kind of like a minor tweak is that inverse reinforcement learning is typically formalized as learning the reward function from the person's perspective so in a lot of ways that you might in in sort of the standard setup if you look at a person go and uh maybe they watch someone go through their morning routine so they get up they wash their face they go get a coffee and and then so on if you infer a reward function with inverse reinforcement learning and then naively optimize it what you end up with is a robot that washes the robot's face and then tries to drink coffee for the robot in a sense it's can naively be formulated as imitation learning which is only actually one type of solution to an assistance game and we want to allow for that broader class and so what cooperative viral does is says okay no there are actually two agents in the environment there's the person and the robot and if you're doing cooperative irl right what should happen is that you observe the person's morning routine and then the next day while they're washing their face you're making coffee so that it's ready for them and in this way you're not imitating their objective as you would in standard inverse reinforcement learning um this is kind of like a minor tweak you wouldn't you wouldn't kind of lose sleep over this issue but it points towards actually a much deeper issue with inverse reinforcement learning which is that inverse reinforcement learning assumes that the behavior is kind of generated without being knowing without the person knowing that the robot is watching them so it assumes that your immediate rewards the the behaviors you're taking are done purely in service of the rewards that you get within those behaviors and i think if if you think about how you would behave if you knew that your new robot toy was watching you i think you can at least agree that it would you would probably do something different and in fact in lots of scenarios people will adapt their behavior when someone is watching them in order to be more informative or or to in general try to better accomplish their goals and this type of a dynamic is actually crucial for the kinds of problems that we're worried about with assistance games or worried about in an existential risk context with assistance games uh the reason why it's crucial is that as we are building representations of our goals for future and more and increasingly advanced systems to optimize we are the humans in this assistance game right it's not actually cooperative irl it's got multiple people non-stationary object non-stationary rewards partial information up the wazoo all kinds of things but but ultimately we want to learn how to solve that game well for us and we are certainly adapting our behavior to what the system will do eventually and it's this incentive that that i think actually drives a lot of alignment so inverse reinforcement learning is basically a type of imitation learning for more advanced systems we don't expect imitation learning to truly be an effective solution and so what this means is we're going to be relying on a communication solution to an assistance game at some level and when you study those communication mechanisms and value alignment or assistance games they actually only make sense if the person is taking into account the impact of their actions on the future behavior of the ai system does that like in in the sense that there are um if you look at natural language for example as a way to describe your goals under inverse reinforcement learning you can't actually describe your goals with natural language you can just show the system how you would like it to talk so so somehow cooperative irl is like providing an analytical framework for understanding like communication between humans and ai systems and we think that that kind of communication is going to be really important and specifically it allows us to formalize the limits of intentional communication of our goals okay what are the limits of intentional communication of our goals i didn't get this from the paper oh i don't know that this is necessarily in the paper i think we're now talking that this is probably the class of things which are motivation for the paper that uh is not the kind of thing you could get through peer review in 2015. sorry could you repeat the question uh we were talking about communication in cooperative universal reinforcement learning oh yes in like intentional communication of objectives and oh you're talking about what are the limits the limits the limits of intentional communication oh yes um i mean we we are trying to to analyze that type of intentional communication of intentional communication there we go yes so naturally they come in through cognitive limits um it is probably one of the biggest ones so on the human part or the right one part um on the human part so so when when we study assistance games the a solution where solution is in air quotes because it's not really the answer to how do we play an assistance game but what assistance games really let you do is say if the person plays strategy x and the robot plays strategy y how well did they do based off some assumptions about what the environment looks like and what the types of goals are present and if you're thinking about building an ai system that can integrate well and and be robust to lots of different people and produce with very very high likelihood at least an improvement in utility one of the things you really have to compensate for is the fact that the human policy is not arbitrary or controllable it is limited by our biology and our training and uh upbringings and this means that there's only so much information we can provide about what we want our system to do at any given point and given that in principle when you specify a system you are actually specifying a mapping from all possible observations that system to get could get to the correct action to take in response and to do that you have to at least think you have to think about all of those possibilities in in some way and the fact is that we can't and and we have cognitive limits on how many options we can consider and even what types of options we can consider in a way i cannot imagine truly strange unfamiliar situations like that right i i have to perhaps experience them or or think hard about experiencing them or have analogies that i can draw upon them and all of these are limits on the amount of information i can provide about my goals perhaps another way to explain this and and perhaps this is maybe the central way in which assistance games can help reduce existential risk is they they allow us to identify that there are really two types of information that play into what a system is is going to do at least in the current paradigm for ai systems there is both um what i will call objective information which is information about how the world will unfold perhaps in response to different actions you might take and that's separate from normative information which is information about how to value those different possibilities and what assistance games do i think primarily is there a decision problem where there is a formal representation of a limited normative channel and i think the limits on those on that normative channel need to be accounted for and balanced against certain types of risk whether and and that risk could be uh how quickly you change the world or like you know in a sense you need to regularize certain aspects of your behavior by the amount of information that you have about goals okay so speaking of the information about goals um so i think when people in the like ai alignment space are thinking about like learning normativity they're sometimes thinking of different things so one thing they could be thinking about is like learning the specification of like a kind of specific task for instance i want to build a really good cpu or something where it's you know it's like kind of complicated but ultimately um it's about one cpu as opposed to you know getting an ai that just learns everything i care about in my life and it's just gonna make the world perfect for me which of these do you think searle is kind of aimed at analyzing or could it could it be equally good at both um actually i think it's both okay uh i i think that these are so these are very different in practice right like like the the things that i need to specify for a utility function to be good for getting a cpu to do things that i want actually quite non-trivial and might bring in to bear complicated like if we wanted to i could tell a story where you need to optimize cpus to be to account for environmental possibilities because they're going to be used in bitcoin mining so it is pot like things i will say things can get more complicated than a simple first pass analysis might suggest i do think people don't use cpus much in bitcoin mining anymore okay yes that is that is a very good point unless but but they are used for like i think for other cryptocurrencies cpus are still on the like cutting edge of mining yes i more meant to say that actually one of the issues the one of the sort of structural issues in building ai systems is the presumption of narrowness based off of relatively simple analysis and i would say that even cpu design can get far more complicated than like even graduate level education would lead you to to imagine i would say but putting that aside the main difference for both of these is is really not kind of what the person's goals are it has much more to do with what the robot is capable of doing and what it's likely to be doing and what environment it is likely to be in so in in cooperative irl you would formalize these differences not by a change in the person's reward function within the model uh you would formalize these via a change in the environment that the robot is acting in and and maybe some small things about like the prior structure right so i mean the the learn about everything is relatively is the simpler model because you have just a really broad set of environments the system could be acting in um and so through interactions with the person you're going to focus on trying to reduce uncertainty about objectives overall and there are some interesting ideas on this i'll say if if people want to look at that kind of idea more i would go look at justin fu's work on adversarial irl which captures some of these ideas for learning generalizable reward functions now let's talk about learning a much more narrow task and what that would look like in an assistance game in this case it's probably modeled by a robot that has an action set that's primarily related to things you can do for computer circuit design for cpu design a prior objective that like a prior over reward functions that the person cares a lot about cpus in some way so maybe there are you know 10 different metrics that the that we can say are relevant for cpu use or we can identify some grouping of features of the world that are relevant to that environment and say that there's a high probability that those features matter in the utility function and then i think the optimal thing for the robot to do in this case will be to learn about this very specific task and it won't even though it in principle it could be learning about the person's norms and kind of everything about preferences it's really the structure of where it will be deployed that that leads to narrow learning and i think one of the things that's nice about assistance games is that that solution falls out from the definition of this game as the correct thing to do rather than um a sort of more ad-hoc reasoning of this is a narrow task versus not this is just that because the system's incentives come from doing things that are useful its information gathering behavior is targeted to relevant information based on its capabilities so yeah this gets into a few questions i have about um in particular the serial analysis so in the serial analysis part of the setup is that there's a prior over human reward functions or utility functions yeah you mentioned this a little bit but but how should i think about what the right prior is because that presumably like that's going to like pretty heavily influence the analysis uh yes absolutely um oh and i should say that by prior i mean a prior probability distribution so your distribution over what the human's reward function is before you know anything right so i i think it depends on what type of a situation you're analyzing and what type of like what you're what you're looking for i think if you're in a setting where you're providing a product to a large population or user base or something like this then you you would think about that prior as being something that you'd fit in in a more data-driven way um if you're doing analysis of a looking at sort of worst case analysis against ai systems then i would say you probably want to have as broad of a set of possible utilities as you could have and basically look for systems that are effective against that overall slate i'm interested in in trying to think about what what interesting and general priors would be and and kind of the right yeah how you could look at developing priors in a more rigorous way i would say this is something that's missing from a lot of the current analysis on assistance games there are sort of practical questions on if you're using assistance games or cooperative irl as a design template for systems sort of how could you learn priors i think that's a bit more clear for analysis for things going forward you'd like to talk about how how priors are shaped by environments so rohan's paper where he looked at identifying prior information about reward functions based on the assumption that the world state when the system was turned on was already optimized look captures some of these ideas um and i think it's really interesting to explore that direction whereby we can try to think about what types of given certain sets of evolutionary pressures what types of reward functions are we likely to see uh and i think that's a really complicate like that's probably a couple phd theses worth of of research if if not more but but that's a bit of how i think about these this prior question if i can also add on on top of that there's sort of like a question of what is your prior like what is the distribution over the set of reward functions you're considering um i've been thinking a lot recently about the ways that actually part of the question is actually just determining a good version of that set and um very specifically within ai systems uh a lot of failure modes that we kind of think about in theory for x-risk scenarios and and that i would argue we have observed in practice stem from missing features in particular um and so a lot of the the question of how do you come up with the right prior is is partially about how do you how do you come up with the right features is is sort of the the refinement of that question that that i think is a bit more i don't want to say more interesting but um it feels more specific at least yeah it's it's very yeah because i mean normally in patient inference right you can recover from your you can recover from your prior being off by a factor of two but you can't really cover recover from like the true hypothesis not being the support like defining the support is where it's at yeah missing features and in practice that that means missing features for the system so it's kind of like if we go back to the facebook situation right it's it's sort of like they have features for how much you're engaging with the website but they don't have features for how much you regret engaging with it and there are ways that they could try to identify that um but there's actually a whole process for like if you think about the sequence whereby you would deploy a system and then integrate that feature in that that's really the gap that our current systems fail at in a big way is that um it's both you absolutely are going to run into unintended consequences and it takes a long time to discover those unintended consequences and it takes a long time to integrate proxies for or measurements of those consequences into the system whether like in practice that happens at like an organizational level or at a direct like rewrite its objective uh kind of level um and so value alignment problems and assistance games where you look at mechanisms for identifying new qualitative uh features of utility are yeah something i've been thinking about a lot recently am i right that you have a uh paper that you co-authored very recently about this uh yes so that was um we looked at recommender systems specifically it was called uh a line i mean the subtitle was aligning recommender systems with human values uh yes what are you up we call it what are you optimizing for aligning recommender systems with human values um and what this presented was an alignment perspective on recommender systems and did our best to document uh what i actually think is quite interesting as the existing public information about attempts companies have taken to better align their systems with underserved goals um so it turns out that companies are making these changes and they are doing some of these things um and i think the value of assistance games is giving us a category to identify these types of interventions and move them from features of practice to objects of study so going back to the serial formalism part of the formalism right is that um there's some parameters specifying like what reward function out of all the possible functions the human actually has and at the start of the game the human observes this i'm wondering how realistic you think this assumption is and how if it is realistic then great and if it's not realistic then um how how bad you think it is uh like all models it's wrong but perhaps useful um i think it depends on what you imagine those object like that theta to capture where theta is the variable that we represent the reward function with i think there there is a way to set up this type of analysis where that theta represents kind of a a welfare function uh sort of in the sense of like like more of like a moral philosophy kind of sense in that case the connection between that and human behavior might be incredibly complicated um but we could imagine that there is this static component that that describes like what we really would want our system to do in general overall and and i think that's not it that's a reasonable use of the model on the other hand that leads to a lot of complexity in the policy and perhaps the the statement that the human observes theta at the start is is no longer a a reasonable one um and so one of the assumptions in cooperative irl that we've relaxed uh is is looking at this particular one where the fact that the person has complete knowledge of the objective seems perhaps fishy especially if it's a static unchanging objective the other the other way to see it is theta sort of more how you're feeling a bit day to day in which case the staticness assumption is uh probably well is it's just wrong but the behavioral assumptions that we typically make are a bit more reasonable and if people are interested in reading that follow-up work um what kinds of what papers should they read um so that is referring to a paper that lawrence chan was the first author on um that's called the assistive multi-armed bandit and in this case we look at an assistance game where the person doesn't observe theta at the start of the game but in fact they learn about it over time through reward signals so the kind of simple idea of it is in cooperative irl let's say you might be making someone tea or coffee you would see them choose which one they want and then in the future know which one to make for them but if they're someone who's an alien that's just come down from outer space or someone from a society that doesn't have tea or coffee or a child or something like that that would be the wrong kind of solution right it would be very bad to assume that the thing the person chose is actually what they want because we know that they have to learn and experience before their behavior is indicative of their goals and so the assistive multi-armed bandit formalizes that kind of a scenario where solutions look like people try out several options learn what they like and then the system eventually learns from that and there are some really interesting things we identify about uh ways that the system can actually help you learn before learning what you want because actually at the start of the game the system and the robot have similar information but different computation abilities and and interestingly enough that that setup does seem closer to the inverse of reinforcement learning where you're learning how to optimize a reward yes actually um yes we it it really you can still have to to truly do inverse reinforcement learning you don't want the cooperative scenario um but but yes it is certainly the inference problem we are solving is in fact real inverse reinforcement learning if i could be so bold and if you're listening to this podcast stuart i'm sorry yeah i should maybe i should say that in every episode stuart russell is uh dylan's and mine advisor should um so so going back to the the cooperative anniversary enforcement learning so it's a game right games they have uh dev equilibria like nash equilibria or maybe um other types of equilibria should we should we expect like in our analysis should we focus on like the equilibria of these games and expect humans to land in one of those or should we kind of like fix a relatively simple human policy or like yeah how should i think about equilibrium analysis in cyril carefully okay um i think the primary value it has is in identifying limits on communication ability for for normative information so if you look at a model of a of a cooperative irl game and you compute the optimal human robot configuration and there are still aspects of the human's objective that the system doesn't know this gives you information about uh what types of value alignment problems cannot be solved where you need to have more interaction or more information before deploying to that kind of setting i think there there is there are practical short-term applications where the equilibrium assumptions i think are valid well actually i don't want to say totally valid but where people do have do develop towards best responses for the systems that they're using so like i know some people for recommender systems actually think intention fairly intentionally about i don't want to click on this because if i do the system will show me more things like that for me as someone who's built a lot of robots and like trained them to do things i can provide information that's that's better in in some ways just because i sort of know what types of mistakes the inference is likely to make so i think there's limited value in this and in general in understandings what are the ways that people might adapt to exploit components of your learning algorithm ideally in positive ways but once you have non-cooperative settings that also starts to be relevant for so if i if you just think about the the original surl game like if you have a wide enough action space that includes like the human being able to like type on a keyboard it seems like at least one of the equilibria is going to be the human just like writes down their reward function and then like sits back and the robot does everything from there on right which is kind of scary um yes well there are different types of what that is that is an example of what i would call a communication equilibrium right where you are encoding your preferences into a language or into symbols where all of those symbols are unit cost from your perspective and what this means is that you can be you can have incredibly high normative bandwidth actually right your your limits on your ability to communicate your preferences are like the huffman coding given the uh like given the appropriate prior or what have you but at the same time it's an incredibly brittle solution so if in a communication setting you have your encoder in this case the person's functioning as the encoder taking preferences and encoding them into actions in this case these symbols and then you have a decoder which is the robot and it's taking these symbols and blow it you know re re-inflating them back into objectives and reports and if those are mismatched you have no bounds on the performance of the resulting robot policy and in fact there are small changes that in in many encoding schemes that would lead to uh the system actually minimizing utility or or something like that you know one one way to to think about this is uh inverse so we talked about equilibria right i think i want to step back from equilibrium and say that the things that are interesting is different types of strategy pairs they may or may not be in equilibrium now if let's say we're comparing two different equilibria and these are two sorry two different strategy pairs and one is the strategy pair of that inverse reinforcement learning that imitation learning corresponds to which is person does a thing and then robot imitates you can still communicate a lot of information about goals there there are some limits in how much you can communicate but it's also an incredibly robust communication mechanism right where it relies on the assumption that people can do things like what they want which is not a crazy assumption about people and explicitly we we accept that they make mistakes along the way there and to the extent that they make mistakes those mistakes can be related to how bad the things are and so this gives you at least in theory some bounds on how bad you can be uh when you're optimizing for the objective that you learn through that type of a mechanism on the other hand if we look at these types of more communication equilibria where let's say you know the robot is observing where let's say we're doing 2d navigation so you're moving around there's you're like in a maze of some kind certain maze squares are good certain may squares are bad and the robot is going to observe the person take some actions and that is going to act in a similar maze in imitation learning style version of this problem is very simple to explain right person just says like okay let me figure out how to solve the maze do my best the robot will try to copy that and if we're doing irl with the right kind of prior and things like that the robot can actually learn here are some features of what the person was seeking out and perhaps improve on the person's performance now another class of solutions that could happen here are to assume first off let's assume the person doesn't care about the training scenario so the person's going to be acting in this maze but the only thing that they the maze actually all the actions are the same cost this is just like and in many cases this is true right the person's taking actions for the purpose of generating information that the system can then use to go do useful things with so another type of solution is to come up with an encoding where you just discretize the reward space and assign a reward to each square and what the person does is just moves to that square on each episode that will have limits based off of how good you can discretize the space but you can continue to do this where the person on successive episodes rather than trying to treat this like a maze just forgets that it's a maze and just says like ah i'm just gonna use this to come up with an encoding of my reward function okay so sorry should i be thinking about like um on the first time step they go to the best square and on the second time step they go to the second best square or how's this encoding working so the point is that the encoding can be arbitrary i should imagine something like that so it's yeah and and i guess the the fact that there are lots of parallel solutions here right that like you know if there's say there's only a maze with three squares i could then you know i start at one place i can stay here i can go right or i can go left i can use that to encode with my actions arbitrarily complicated things just in binary just in binary exactly but there are lots and lots of different ways to encode my preferences into binary and so now i've taken a preference specification problem and turned it into a preference specification specification problem nice or terrible and it's not clear that i am much better off okay from that from from a risky standpoint on the other side if you do align with the system on this communication strategy you can provide far more information yep right you can provide effectively arbitrary information the only limit is that you be able to encode it somehow and so i think this is this is kind of a really big tension between kind of robustness like how much can the system adapt to our behavior and that's based off of what types of assumptions about like the details of our behavior is it making versus at another spectrum there are these kind of purely communication settings where the encoding scheme you choose is is kind of arbitrary from some sense and this means you can get incredibly informative codes but brittle ones perhaps okay um so now i'd like to move to the next paper so this is the off switch game um can you tell us like can you kind of summarize this paper first of all sure the idea behind this paper i mean it sort of very it really came from from two places um one we were inspired by uh sores at all's paper on courage ability um and we were looking to translate some of those ideas into an assistance game style framework and the other side of it was that at this point uh my advisor stuart russell had been going around and talking about existential risk within the ai community a fair amount and he was discovering that um this point about the ability to turn off a system was coming up with with some frequency in his discussions and what felt to us like a relatively clear observation about incentives was um not being recognized and and so part of the idea behind the paper was to show sort of very clearly that once you are given a task that creates incentives to accomplish the task and that will include accomplishing other things that enable maintaining the ability to accomplish that task which is relatively simple to say in english relatively intuitive and frankly not that hard to show with math and the off switch game then is a cooperative irl formulation of an interaction where there's as before a human and a robot and the robot starts off we assume that it already has prior information about what the utility function is the robot makes a choice between one of three different actions it has a candidate action that it can execute so it can go ahead and directly execute it it can choose know that action's a bad idea and just turn off or it can say to the person hey i'd like to take this action do you think that's a good idea and then the person can take a binary action which is to either say yes and then allow the action to execute or to say no um and turn the robot off and we analyze this under uh scenarios where the robot assumes that it knows the person's objective uh and and scenarios where it has uncertainty about that okay i i think a lot of people who are thinking about essential risk from the ai for the first time i think okay why doesn't the human just turn it off why does the robot need to like present an option ah because our goal here is to understand the system's incentive to bypass oversight okay ultimately and so the the setup for the game is such that the system has the ability to bypass the human entirely and so we wanted to model that component and so that's why you have this a little bit weird thing of um the robot choosing to allow the person to turn it off okay yeah and i guess this model's in real life like if i'm really smart and i don't want you to turn me off i imagine i can like keep you away from the button or something yes it it captures that um yeah i would say it's the the way that i like to think about this really is we want to understand what is the incentive for oversight and so in order to do that you have to have an option for no oversight okay and the model attempts to do that in the simplest way we could think of all right i think we haven't said it yet what when do robots have incentives to allow themselves to be subject to oversight ah when you have two factors at play one is an uncertainty about how to evaluate its actions so in this case uncertainty about the utility evaluations for states combined with the knowledge that the person's behavior reveals information about that uncertainty so the the key thing in this model that that creates an incentive for oversight is that the off event is correlated with the action that you were about to take being bad or at least worse than than turning off and it's the fact that you don't know whether that action is good combined with this correlation that that creates the incentive to to give the person this capability all right so one question i have about this paper is it uses an assistance games style analysis but it's very lightweight right it's only for one round it's like like you don't have to do like like like the serial paper has like much more meth than this paper in general like is it how often do you think like these kind of paired down versions of assistance games are going to be useful for analysis and how often do you think we're going to need like the full machinery of assistance games i think they're both really useful actually i so i think that they just serve different purposes so this off switch game analysis is is really good for at a very high very abstractly when we're designing systems what are the features that i should be looking to what what should i be paying attention to but as you add complications to the model there you can also start to ask more nuanced questions and so the the off switch game is a one-round interaction we uh with smith and millie i came back and and looked at an extension of this uh that we called the obedience game where it's effectively a multi-round off switch game and um in this case what what this well what we ended up showing there is actually some of the issues with missing features in systems but it'll it allows us to identify the dynamics of this learning environment which are that over time uncertainty about the objective goes down and so the that leads to actually some pretty structured dynamics and how you expect things to behave early on versus later on and then if you if you know you're actually being involved in this game you actually might start to take different actions early on to communicate more information for later rounds and and all of these i think are are interesting facets that um you want to be able to complicate the model towards in effect so so i think these really simple short clear at the level where you could explain it to your your non-technical friends um like that's that's where the off switch game has a lot of power but i think for yeah it's it's also really useful to be able to add other things to that model in particular to look at the sequential dynamics and how how things change over time all right so again talking about the sort of the structure of the paper or something it seems like part of why you wrote it was to convince a bunch of people of a point did it work um not really but i think not because the paper didn't make an effective case but more so because the people that we were arguing with moved on to other focus so you know in 2015 ai safety and concerns about existential risk were very poorly known within ai research circles and over time and arguably as a result of this paper and and things like it that has changed to the point where this isn't as much of a taboo subject and at the same time i think a lot of the people we were arguing with were motivated by desires to maintain research funding uh by avoiding an another ai research winter which is uh something where a lot of ai funding got cut because we couldn't live up to the hype from the 80s and other folks who are generally concerned with maintaining the ability to design and build the ai systems that they're excited about and and driving this research forward and if those are your goals you aren't focusing on the x-risk community these days because there are a lot of other um people pointing out the current problems with these systems and and i think that tends to be the focus of of those folks for the most part okay so we've kind of given i feel comfortable saying like our side of the of that discourse if people are interested in you know what it's like from the other side like like which uh which people are we talking about that uh our viewers can like you know try and hear from them their take yes i mean i think you could look at so um actually i think you could look at uh rodney brooks wrote a long blog post in response to the off switch game in effect and that might be a good place to look for look at for a sort of more well thought out critique and uh i think the other place to look at it would be stuart was involved in several kind of debates back in 2014 2015 and i think you the best thing to do would be to try to find those although i don't i don't have any references for them we'll try to link at least the blog post uh in the episode description in the transcripts let's get let's get back to the the paper itself at the outset we sort of talked about assistance games as an analysis framework right um but but the paper has a section where it talks about um you know manually modifying the like robot certain uncertainty over human over what the human reward function is basically because the more uncertain the robot is uh the more the more it's going to differ to humans how should i like yeah that seems kind of intention with the pure analysis frame at least to me um i'm wondering like how how does that tension resolve right i think that was a bit of foreshadowing in a way for um for where we were gonna be heading with the obedience paper that we did next but uh i think the way that i was looking at it was really trying to bring in some degree of strategic uncertainty so within that analysis we assumed that the robot kind of knew beta for for what the person was doing in this case beta was the parameter of how rational is the person uh we had it set so that i believe beta equals zero corresponded to a rational person and and as it increased the person would make more and more errors and all of our results where we analyzed this trade-off between beta effectively and the robot's uncertainty about their utility what we identified is that there's a trade-off between those which determines ultimately whether or not the robot chooses to interact with the person or take its best guess and sort of see that out and so we wanted to look at was a scenario where you're imagining building a system that is going to integrate interact with someone and as the designer from this perspective or or you could think about this from the robot's perspective in that case you don't really know what beta is and so what we were doing was looking at okay well if i have to guess what beta is is it better to guess too smart or too dumb and and what are the different types of uh errors that that you identify from that and i i believe what we showed is that if there are like in effect if if you run this and and guess many times if you tend to overestimate beta in the sense think the person is dumber than they are you lose utility but you do better from a worst-case standpoint to overestimating it and thinking the person is smarter than they are or do i have that wrong right now i'll be honest i'd have to go back and relook at this section to get to go and see the noisier the noisier the human is the less likely you are to defer so i think that means right so it's it's better to assume the person is smarter than they are in well it's safer somehow it's non-trivial how to do like you can't work it out super easily because your what's going on is there are scenarios where you should actually not give the person oversight and and there are scenarios where you shouldn't yeah and if you think the person is more capable than they are then you can make mistakes that will lead to a like higher cost for them if they screw up by by by giving control perhaps this is motivation for listeners to read the paper and resolve this mystery yes very much so um so so yeah kind of related to that um i think one thing that the paper suggests is a model of what you might call like calculated deference so you have an ai system and like if it implemented a solver for this game what it would kind of do is say like okay the humans just trying to shut me off is that you know how much do i believe that this is really informative uh versus how sure am i that like the thing i want to do is really the right thing to do and to me at least there's something kind of like i get i get a bit nervous about that right it seems like it it might be kind of brittle to the you know an ai system being wrong about um what i want or or about how my actions are related to my preferences or something and and i get worried about like you know mistakes that can't be undone i might hope instead for some kind of uncalculated deference where the robot just like does what i want even if it thinks it's a bad idea but somehow i'm not exactly sure how is otherwise like rational and reasons well i'm wondering if you have comments about like yeah this difference and what your analysis might say about less calculated deference yeah so so i think if we if we pop back back up from off switch game to sort of the more general cooperative irl perspective off switch characterizes a class of solutions to cooperative irl games where there's there are certain like it it's at one point i tried to be somewhat formal about this um but there's there's a clearly identified signal in the environment and there are some properties of the robot behavior and the human behavior such that uh you have the ability to access it at all times and and there's a certain like robot behavior that follows on from sending this signal so the the question that cooperative irl would allow us to look at is based off certain assumptions about the environment the space of possible utility functions and the limits on human behavior is it a good idea or not to implement a strict off switch solution a soft offset solution or a no off switch solution and i think that different scenarios will call for for different settings of that in general so um yes to talk sort of more about about this idea in particular for this type of calculated deference versus uncalculated difference part of what i'm saying is uh you should make sure to direct your you should probably include a link to the obedience game paper in in the description as well because i think that paper was actually intended to get at uh this question specifically so what we did in that paper was we looked at effectively repeated off switch game where what happens within each round is the person sees a set of uh we in this case we had 10 different actions and tells the robot which action they would like it to take and now the robot observes this and can do any action that it wants to and what we can compare is the performance of a purely obedient policy which would be your uncalculated deference with calculated difference and if you set this up we just assume some prior distributions over different features for actions and weights for utility functions nothing very fancy and you see a fairly predictable result which is according to this model uh calculated deference is better than uncalculated deference because you have more utility and that i think if you understand the mathematics of these problems very very clear why that happens right it's you're comparing the optimal policy with something that is possibly not the optimal policy um and in fact if you assume that the person makes mistakes it is guaranteed to not be the optimal policy so do you look at this and and you know as researchers we looked at this and it creates a bit of attention right like off switch or the ability to turn a system off is is a specification of desired behavior right and we seem to think that we want that and there's like a real question about like why would you want that what like and from this analysis it seems like you certainly wouldn't and like in i can definitely create scenarios like a self-driving car where you don't want it to to wait for interaction because your effective beta as you page back in and context switch and start figuring out how to drive the car will cause you to be worse than whatever its best guess would be now so so we wanted to try to understand this and and formalize a bit of where this intuitive notion comes from and to do that we investigated this game from the context of missing features and what we showed is that in the presence of missing features where you have components of your youtube so to be a bit more specific we looked at variants of this problem where at one end of the spectrum we hid almost all of the features that determine utility from the system and at the other end of the spectrum we introduced a bunch of distractor features and what we showed was that if you're missing features this optimization strategy goes haywire if you're missing enough it can do crazy things like disobey the person on the first round and uh yeah because you might pick from the robot's perspective you might select a dominated action and uh in general you sort of never really recover from this optimization strategy if you have distractor features it's fine because there are noise and you eventually rule them out but if you're missing features and you try to optimize you can end up just being confidently wrong and so the way that i have come to understand this calculated deterrence versus uncalculated deterrence or deference rather not deterrence the difference sort of comes down to how good a job you think you've done about identifying the relevant features for this decision and if you think you've done a good job presumably the self-driving car scenario roughly fits this then calculated deference is what you want and if you haven't done a good job then or if you if you're in a scenario where you're much less clear you then you want to at least for a long initial period have that kind of just general deference as the optimal strata and this falls out is the optimal strategy yeah right because if you if you want to do inference in a world where the system can learn about the missing features that means its hypothesis base of possible reward functions is necessarily much larger and so you have to provide more information in order for the system to be useful it seems like this suggests a problem like um like suppose i'm an ai and i have this uh human overseer and sometimes the human overseer does things that seem irrational right i can't make heads or tails of how this could possibly be rational one way this kind of thing can happen is that the human just is kind of irrational another way this could happen is that i'm the one who's irrational and the human knows some things some features of the environment or something that i don't know and therefore the humans taking what looks like a dominated option but actually like it's it's really good on this axis that i just can't observe that's why it's trading off these like things that i can tell are clearly good is there some way it seems like you would ideally want to be able to distinguish these two situations yeah you you would like to be able to distinguish those situations and i think is it possible though uh at least in some cases no it comes down to kind of like your your joint prior over possible utility functions for the person and possible kind of meta strategies where a strategy maps the person's preferences into their behavior and what you're sort of saying is well i've got like i'm seeing a person behave and my best estimate of a utility function describes that is bad so am i missing something and well it's possible you're missing something in the sense of there are observable properties of the world which have mutual information with the person's future actions and you could learn that certainly but whether those are rewards or not is is fundamentally kind of arbitrary at least it is arbitrary at the level of abstraction that we're talking about yeah if you can only determine behavior you can kind of pack things into policies or reward functions exactly uh um yeah and this makes identifying when they're like i i think in general this calls back to that point that i made about the difference between objective information about the world objective in the sense of sort of true and identifiable let's say there's some certain pattern of brain spikes that like really have of neuron spikes that really predicts my behavior in the future are those spikes you know the the causal relationship could be determined but whether those are good or bad or whether those are evidence of me thinking about the consequences of my actions and then planning a sequence of things in order to accomplish some internal representation of i don't know maybe those spikes are like anticipation of ice cream and so they're like they are a representation of my goal at least locally in that sense um or it could be that those spikes are just um i don't know the the trigger for uh some tick that i have that i don't really enjoy and i uh don't have the ability to to stop and from an observational standpoint unless you make some type of normative assumptions which is to say unless you bring some additional source of normative information to bear distinguishing those two based on observations isn't really possible yeah it's a tricky situation yeah i tend to think that this is a really core component of alignment challenges within ai and i think there's an additional feature which i'll add which is that there are i believe some unavoidable costs to normative information in the sense that if normative information is generated by people choosing to invoke potentially cognitively costly routines to think about what they want and what type of person they want to be and what type of world they want to live in then that that means that actually there are limits to the amount of information that you should get in say a cooperative irl optimal equilibrium right in the sense that it's the the solution does not involve fully identifying the person's utility function even if that's possible because there are unavoidable costs in them running their brain to generate that information in effect and those costs could be like direct time costs but also indirect psychological costs that idea of normative information and fundamental limits on it and i think a gut feeling that there are limits on the amount of normative information we can provide is a lot of what drives our concerns about existential risk it's that we we expect that there is this imbalance i think there probably are interesting questions i could ask about that but i can't think of any so i'm gonna plow straight ahead yes please do so yeah to wrap up this section a little bit um i think well one thing you mentioned as an inspiration for this paper was the sum sorry hall paper on corgibility and courage ability is this term that gets used in um the ai alignment space particularly i think paul christiani talks about it a lot what do you think the relationship is between this paper and corridability as it's sort of thought of in these spaces i think they are very similar models of the world that operate on different assumptions about agent beliefs so courage ability the the kind of primary difference that i see between that and off switch game is that it doesn't include a reference to a human agent in the environment and this makes a very big difference because that's where we get our source of potential information about utility i think one of the key assumptions in their model is that the belief structure of the agent is such that it cannot acquire more information about utility so you're you're thinking of the the stories at hall paper is that right yes think of the sores at all paper there where the the kind of primary result they get is the only way to get a system to choose this type of oversight is through a type of indifference between utilities which is the solution you get under um the assumption that utilities are fully observed or that all information about utility has already been collected and so which one of these results you sort of take into systems of the future depends on what you think the belief structures of those systems will be and their relationship with incentives will be i mean it also like if you model systems as learning about the reward function from human behavior at some point it's learned all that it can learn from human behavior right like like at some point you use up all the bits and like human behavior is no longer is now like basically probabilistically independent of your posterior on the reward function so there it seems like you do end up in this stories at all world is that fair to say i mean if you include the possibility for drift in preferences for example then that's not necessarily true yep i think the bigger well i don't know it's what we're assuming about these agents actually is quite unclear if i'm being fully honest we're making some different set of assumptions about what's possible in the belief structures arguably if you had set up the inference correctly and you truly reduce the uncertainty then you do actually not need an off switch and i think if your modeling assumptions are such that you've reached that point and you model a system as behaving optimally given that information then the behavior you'll get is quite clear yeah i tend to think that a lot of these results point out that creating systems that have limiting behavior as fully rational policies is probably a mistake but it's hard not to do that isn't it i think it depends on like one of the most effective things that we've discovered in the field of artificial intelligence has been mathematics of goals and practical computational routines for goal achievement in practice i think that and so i think this shapes a lot of what we think about for ai but i don't think that's actually where we have to head in the long run and and i think one of the things we're learning is that actually goal management systems are perhaps as crucial to our qualitative notions of intelligence as as goal achievement um i think that we have like yeah as a strong bias to focus on rational behavior and goal achievement as the definition of intelligence when i i think it's sort of more like goal achievement was the part of intelligence that we were worst at in the 60s and 70s and as we've developed good computational mechanisms for that i think that we will step away from trying to build systems as like fully optimal bayesian agent like i think in many cases you're already seeing a lot of systems move away from their design like perhaps composed of agents in some ways but you know gans are not bayesian reasoners what are guns so gans are generative adversarial networks they are a way for modeling let's say images where let's say you want to create a function that generates images that looks like natural images or a data set of images what gains do is define a problem where one agent is trying to generate images and then another agent is trying to determine if that generated image or a real image from the data set of those two which one is real and this creates a type of adaptive loss function for the generator that can lead to very effective image modeling and any and very photorealistic image generations and as we move in these directions for these types of adversarial learning approaches or look at learning approaches that are robust to things like data poisoning uh and and sort of other um sort of real world complications that come in from uh deploying these systems i i think we're actually getting further from our system looking like a direct bayesian agent in a way and more towards you know i goal achievement will be a part of artificial intelligence going forward right it's not like that part will will go away but i think if if you're looking towards the future at least a future where we have successfully built aligned and safe ai systems i i think as much if not more of of the sort of artificial cognitive architecture is about goal management as it is goal achievement so i have a final question about the off switch game this is partially a question but it's actually mostly a beg for listeners to work out this problem that i can't figure out but maybe you can figure it out so if you think about the off switch game well like but like this the scenario right you start off with a state imagine it's like a little a small little circle right and this is the state where the robot is considering whether to like do an action turn itself off or um allow itself to be um let the human turn it off or not so you've got this initial state where you're deciding what to do and from that state you can imagine an arrow to a state uh like like maybe down on the left where it does an action a state down to the right where it uh turns itself off there's an area there and there's also an arrow going straight down to a state where it lets the human do either and then from the from the state where the human could do either you have this arrow to the left where the human says yep you can do that action and then it's like yep you're just in the world where the ai got to do the action or an arrow to the right where the human turns it off and now it's turned off as if it turned itself off this is exactly what the picture of a what like a product diagram looks like in category theory you have a function from like some set at the top to like a cross b which is at the bottom in the middle and you can decompose that like like from a cross b you can go to the left to a that function can factor out through through a or you can go to b and the function could go to b is this just a coincidence or is there something deep going on here i can't figure out i can't figure out what it means but i feel like there might be something or maybe it's just a coincidence do you have any thoughts about this i am wary all right that it might be a coincidence yeah there aren't that many graphs with four nodes right right but i i also like so i i think let's say if you were if you were my phd student and and you came into my office and said i i think this is the research problem which you know you you did the off switch game i read that we're working together now i want to build on this i've noticed this weird thing about category theory and it looks like it's a similar there's at least a graph isomorphism right and i think my answer would be to warn you that there very well could be nothing here yep and i'm not that well calibrated as with estimates but we'll call it like i don't know my hunch would put it at like 60 to 80 percent call it 70 that that it's uh sort of a red herring kind of thing but that leaves enough that i would be interested in looking into it and if i lean into trying to think of the ways in which that could be true you know perhaps this means that there are some interesting ways to look at the joint human robot system as a product of human and robot behavior or perhaps certain types of interactions can be described as the human and the robot functioning kind of in sequence versus in parallel so maybe the question to ask is if this is a product what is addition and and like what is the kind of like off switch or assistance game representation of like an additive interaction so i spent some time thinking about this and the closest thing i got to i guess the category theory co-product which is kind of like the disjoint union of sets which is sort of like addition is like you can kind of tell a story where the robot being transparent like ends up being a coy product but but it's not super convincing i might leave that to the listeners to see if there's anything there if there are interesting ideas there uh send them send them our way speaking of interesting ideas we're going to move on to the third paper we're going to talk about today um so inverse reward design this one um you worked on with smith and milly and peter bielster stewart russell uncle dragon um can you summarize this paper for us yes so if you were to apply cooperative irl in scenarios for for something other than human robot interaction uh inverse reward design is probably where you might start what it looks at is a cooperative irl interaction so we're we've got two players here we've got the human and the robot this is now going to be a phase a game with two turns person's going to go first and the robot's going to take well go second and potentially take multiple actions and the way that this game works is the person goes first and selects a proxy reward function of some kind given an observation of a training environment so the person gets to see the environment that the robot's in and picks a proxy reward function then the robot goes to a new environment the person doesn't know which one that will be and and now the robot's goal is to maximize utility in this new deployment setting and what this is meant to capture is the idea that well so through cooperative irl and the off switch game we're arguing that uncertainty about objectives is important okay great we'll we'll take that point there's still a reason why we specify things with objectives there's a reason put differently there is a reason why the field of artificial intelligence happened upon reward functions as a way to communicate goals and what what this from our perspective we thought is that this means that reward functions are perhaps an information dense source of of information about reward functions which certainly makes a lot of sense right and so inverse reward design was our attempt to say okay we have observed a reward function we know that we should be uncertain about what the true reward is so we're not going to interpret this literally but then that leads the question what else what type of uncertainty should you have about the true reward given an observed proxy and inverse reward design is our attempt to answer that and the extra information that we bring in to structure this inference is that notion of a development environment it's not the case that you got this proxy reward function out of the blue this proxy reward function was designed in the context of a particular environment and this is what gives us kind of the the leverage to to do inference with so so should i imagine that as like i have some robot and when i'm writing down the reward function i'm kind of imagining what it would do with various award functions and and kind of should i think of this like proxy environment as like how i'm imagining the robot would behave for different reward functions in a sense the way that i think about this is let's say you are uh like a large company uh you're like the new company that's gonna create household robots um so uh it's gonna be some variant of like uh the pr2 which is a robot with wheels and two arms it can move around people's houses and you as a company are building this in order to help tidy people's living rooms and do sort of things like that so what's your what's like your practical strategy that you're going to go about for doing this well if you have the resources what you'll do is you'll try you'll build a training in scenario which will be a giant warehouse where the insides of it will account for uh your attempt to cover the space of possible home environments that your system will be deployed into so you you you go ahead and you build this and then what do you do you hire some roboticist and you say here's like a loose speck of what i want you make this happen in that environment you make this happen in those environments right so these are the design iteration environments that you're working with and and then what happens is that designer identifies incentives along with a reward function such that in incentives along with an optimization planning approach such that the behavior is good in that set of environments in that in that set of kind of test environments and what what we're sort of saying is that now when the robot leaves that very controlled setting and goes out into the broader world what is a good principled way to be uncertain about the things that might be missing from your objectives and inverse reward design formalizes that inference problem and i guess it's the inverse of the reward design problem of picking a good reward precisely so you run these experiments um with with an agent that uh you know tries to solve this inverse reward design problem right and like i mean there are various things you'd say about that but one thing i'm wondering is like for things that are solving either explicitly solving this problem well or like trying to do uh like somehow like trying to reach the optimum how whatever that might be how i wonder how predictable they would be so for context um there's this computer program that i use called emacs where you type letters and they show up on the screen and it doesn't try to like guess what i mean it just like does what i said and for me it's like very easy to reason about there's also this um website i use called google where i type things and it like tries to do the most helpful thing given what i typed like somehow it's it seems like it's doing some sort of inference on what i meant and is you know trying to satisfy that emac seems a lot more predictable to me than google and i kind of like that property and predictability seems like a really desirable property for like high impact ai systems so i'm wondering like how predictable do you think systems would be that we're kind of behaving somehow optimally for for the ird problem yeah that's that's a very good question so with predictable systems you're relying on the person a lot more in a way because i mean it does it doesn't have to be the case but if you have a diverse set of behavior that you want to be predictable then you need to have enough information to pick out the one that will happen sort of in the future and i think the reason why we're kind of in this mess is the systems that are predictable for complex like the types of settings with artificial intelligence often predictable means predictably bad i think and like there's there's a sense in which the range of things that emacs can do is much much smaller and if you really get into the craziness on it and deeply configure it you can i'm sure get some really unpredictable behavior out of it yeah you can it can be a operating system and you can use google through it um right so so i think that that's sort of talking about where uh where the tension is in in the ways that predictability can be a double-edged sword now let's talk about how predictability comes into inverse reward design as it turns out that there's an interesting problem that comes up when you want to actually use the results of inference it's a little bit technically involved but actually quite quite interesting and i think related to this problem so so so bear with me a bit when you are doing inference over preferences with the kinds of models that we're using there there are certain components of the reward function that you are never able to fully identify so in this case we're using a boltzmann distribution to define the human's rationality to define their behavior and so this means that all reward functions if you add a constant to them end up being the same okay all well and good until you do inference and produce a set of these reward functions and now want to maximize expected utility uh it turns out you'll do if you do this directly you actually do a really bad job and the reason why is is sort of subtle it's because every reward function that you infer is exactly equally likely to that reward function plus c for every possible value of c now your inference procedure will not find every possible value of c it will end up at some but not others and because you're doing bayesian inference in high dimensional spaces and that's challenging you'll end up sort of arbitrarily setting a lot of these constant values for your different estimates of the reward functions and then if you naively add them together in average you end up where that that noise in your inference can have an overly large determination on the result so so it seems like i don't know maybe during inference you let's say you randomly sample 10 reward functions right um and like get their relative likelihoods and the you know the the reward functions have different like constants added to like the reward of every single state if i like take the expectation over those then it's sort of like taking the expectation if all the constants were zero and then adding like the expectation of the constants right because expectations are linear so so wouldn't that not affect how you choose between different actions so in theory no right and with enough samples no because those averages would cancel out even with 10 samples though even with 10 so let's say that the constant value just let's say you're running markov chain monte carlo to do inference okay that constant value will just be going on a random walk yep of some kind and one of like the the points that equal that where it reaches its minimum yep will be unrelated to the actual uh likelihood of the reward at that point and so this could be a really good reward that gets driven down a lot so like it could be you have the reward like let's say there's like two actions a and like action one action two we're doing inference over which one is going to be better action one or action two and let's just say that randomly it so happens that um action one's action like for the reward functions where action one is better the sum of the constants on the those constants end up being negative oh so we're doing like the sampling separately per action and that's why this is the actions are sort of getting different constants so the actions have the same constants within a reward function it's just that when you're comparing two reward functions you're going to be comparing word function one for action one versus reward function two for action two which decomposes into some real reward value plus two different constants and the values of those constants can matter more but but where do they end up mattering because so they don't end up mattering for the likelihood that you take the actions right because you mentioned that like boltzmann rational people um aren't sensitive to constant oh this this matters when the robot is optimizing its estimate okay of utility and and trying to well so one thing is that these issues get really exacerbated when you're trying to do risk-averse trajectory optimization which is kind of where where this is all headed in in the end you actually might be right for in expectation they all cancel yeah mate let's talk about risk of verse trajectory optimization what ended up happening was i tried to do risk-averse optimization for utility functions and it totally failed the first time and it took a long time for me to figure out why it was failing and in practice it was because when you're looking to maximize reward for say the minimum reward function in your hypothesis that minimization is more often determined by the constants in this reward function inference than it is by the actual reward values themselves and it turns out that in order to fix this problem what you have to specify is a point that all the reward functions kind of agree on right a way to to account for this kind of unbound parameters so that that constant is the same for everything and standardize it in some way and this standardization specifies the fallback behavior when the system has high uncertainty about reward evaluations so for me i think this was really interesting to see because it was a way that it sort of showed that there's a component like it's it falls out of a free lunch kind of thing right if you want to do something that's you want to do something different when you don't know what to do you have to say what to do in that case you can't figure it out because you don't know what to do and this is sort of very clearly sort of at a low level in the math telling you here is the point where you put in the predictable part of what the system will do at deployment time so in in doing a risk-averse optimization here what that brings in is effectively a check on is this enviro are the trade-offs in my new environment similar to the trade-offs that i saw in my training environment and if not what should i do so you said this comes up when you're using risk-averse planning why why use risk-averse planning uh as opposed to expectation yeah like maximizing an expectation expected utility man that's like the thing to do well as a practical matter um maximizing expected utility isn't going to do much different than optimizing for the literal proxy that you've gotten it does actually change some things because there are you might get a proxy reward that isn't the most likely one so so in the paper you talk about this uh grid world with like uh grass and dirt and lava uh i i think talking about that makes this clear so can can you introduce that yeah so what we're doing is looking at well we hypothesize that there's a 2d robot that's going to go and and navigate some terrain and in the development environment the the story is that the designer wants the robot to navigate terrain and there's two types of terrain that it can be on there's really dirt paths and grass so picture sort of dirt paths going through a park in some way and in addition to that there are also pots of gold in the park so there's three types of terrain it's really three three kind of things there's sort of regular dirt there's grass and there's pots of gold and the high level goal for the system is for it to navigate to the pots of gold quickly and stay on the dirt where possible but maybe taking shortcuts if it'll save you a lot of time and that's so if you recall this is our development environment and the robot's now going to get an objective in that setting and go to a new environment and we're going to capture is there's something here that the designers didn't foresee or didn't intend and so in the story what they didn't realize was that this robot was also it was going to be deployed across all 50 states in the us and and one of them is hawaii so there's another important terrain type which is lava and you haven't thought about lava before as the designer so your reward function doesn't provide an accurate assessment of the utility in that case now in in doing inference how can how what what do you do in this setting well the the really sort of simple intuitive version of what ird does is it says it doesn't matter what reward value lava had for what i did in the development environment there were no instances of lava present and so changing the reward evaluation of that state would not have changed my behavior and so i don't expect the designer to have thought very hard about what that reward evaluation is and then when you get to this deployment environment and there is this state this is a principled reason to distrust your inferences on that state and now the question is we we arrive we arrive here and and so now we have this uncertainty distribution we know what reward functions the person could have meant really and it's got like a lot of uncertainty about how good lava is now planning for that in expectation you actually might still plan to go right through it because you'll just be assuming that in effect if you plan an expectation you are assuming that the states the designer didn't think about are equal to their prior value which in expectation is a good idea it it very well could be this lava could actually be you know rather than being deployed to hawaii um you were deployed to some magical land where this is a magic carpet that just transports you instantly to the gold or something like that it's reasonable from the standpoint of this robot it's equally plausible to something catastrophic like lava yeah i guess although in that case it seems like the problem is one of um not knowing the dynamics rather than the reward yeah let's suppose that you could drive through lava and like the reason that it's bad is that like we humans might want to touch the robot afterwards and if it went through lava then we'd burn our hands or something sure sure that that sounds good so what what risk aversion does is it allows us to to take advantage of that uncertainty and adapt our response and and what this says is well if there are strategies you can take that don't go into this uncertain area if you can just kind of as a heuristic avoid that uncertainty even if that increases the like amount of path length you have to go through that can be worth it and intuitively i think the reason why this makes sense if you're someone who thinks that we should be maximizing expected utility overall is really what we're doing is bringing in some prior knowledge either about what the priors are on the types of failures that can be so you might think that you know in this we said gaussian prior over rewards but that's actually because we're being lazy and and we wanted we should really be looking to get something that has like appropriately heavy-tailed failure modes and it we we could try to represent that and and that might be an interesting structure to bring to play risk aversion allows us to do that without having to be very specific about the um about what those priors are heavy tails itself wouldn't do it if the tails are symmetric yes you'd have to have heavy low tails yep like like heavy one-sided which is now you start to get into you start to get to a point where mathematically you're playing around with things to try to get desired systems and it might be easier to actually go in and modify the the objective in that case right and in that we know that there are some of these dynamics at play like there are either catastrophic failures or dead ends or something like that yeah and we are not able to represent that explicitly in this mathematical model in at the inference level but we can build that in i'm curious to know what properties of an assistance game make risk aversion the right thing to do so so i the way that i look at this is kind of like in the same way that that courage ability or the ability to turn a system off is is kind of like a desired it's almost like a specification of a type of solution and it's a it's a behavior that intuitively we think is good for assistance games or alignment problems i think risk aversion plays a similar role in the sense that in many scenarios principled conservativism from with respect to your utility does make sense and different theoretical constructs could could lead to that so this we we talked about one thing which was kind and and they all are different ways of kind of building in this possibility of catastrophic failure into the um belief structure somehow i mean one thing it's it's sort of related to in my mind is um i don't know if you've listened to this podcast episode on inforbationism but it's basically um it's it's this notion of imprecise probability where you sort of plan for like the worst case out of the probability distributions and you can come up with an update rule where like uh you can plan for the worst case and also be dynamically consistent which is sort of non-trivial and yeah i've just realized there might be connections there yeah i i could certainly i could certainly see that perhaps some of the the intuitive reason for why i would want risk aversion has to do with my my statement earlier about utilities like the idea that the goal achievement part of intelligence is smaller than we think it is so if you think of that goal achievement component is sort of what the system should be doing the sense that like maximizing utility is the the end goal then i think from that standpoint the question well shouldn't you just be maximizing expected utility it makes a lot of sense but if you imagine that the goal achievement component is one part of the system that's going to be working with representations and abstractions of the real world to come up with plans to implement then planning conservatively is probably better in expectation is perhaps a way to put that so speaking about aspects of this ird framework kind of like how in searle you have this uh prior over reward functions that you have to there or even here you have this prior that you need to analyze there's also this um the principles like initial model of the world that like bears really heavily on this analysis and i'm wondering like if you can give us a sense of like like if we're trying to do this analysis how we should formalize that so just to clarify the the principles model of the world is their model of this development environment right so sorry yeah the development environment that they're imagining the system being deployed in i think there's there's kind of two ways to look at it one is the this is actually the scenario that you were evaluated in during development practically just the system was designed through iteration on behavior in a particular environment and so the the assumptions behind ird are basically that the reward function was iterated enough to get to behavior that is well adapted to the environment and i guess there you can just know that you like trained on mnist or something you you actually have access to that kind of by by definition in a way yeah for listeners you don't know mnist is a data set of uh pictures of handwritten numbers with labels for numbers they are um so that's kind of the very literal interpretation of that model the the other side of it is the i think what you were gesturing at which is the designer having a model in their head of how the agent will respond and having an idea of here are the types of environments my system is likely to be in and here is the mapping between those incentives and behavior and i think if you imagine that as going on inside of someone's head what this is really telling you is is how to be a good instruction follower if you are working for me and i tell you to do something or i tell you you know here are some things that i care about here's some representation of my goals if you don't know what type of context i'm imagining you will be in then you won't have much information about about how to interpret those objectives and and you'll miss things so ird sort of philosophically here saying that the way to interpret a goal someone gives you is to think about the environment and contexts they thought you were likely to be in and and that that's a core kind of like piece of cognitive state to be estimating or something like that yeah so think back to the serial analysis of the ird game right um in ird you sort of have this like human policy which is to write down a reward function that maximizes um you know that induces good behavior in some test environment and then you have the you're analyzing what the best response to that policy is um and that produces a summary about policy but it seems like probably the best response to that robot policy on the human side would not be the original behavior so i'm wondering like do you think like how how many steps towards equilibrium should we imagine being taken i think it depends on what information the person has about the the environments they're going to eventually be deployed it like it this is going to get confusing because in in this setting now we have the designers model that they're having their mind while they're designing a reward function and there's like that environment and then there's the set of other environments that the system might be put into and if you want to design a best response to the robot policy you have to work backwards from that future sequence of behaviors yeah maybe you you need to maybe you need to do it in a like a pom dp rather than an mdp a world where like the human has there are some things about the state of the world that the human doesn't know yeah perhaps in some ways the point of this model is to capture the scenarios where people aren't thinking very much about or or have the extent to which maybe here's a good way to put it the so in this model we're actually leaving out some really important some potentially large pieces of information which is that the selection of development environments is not arbitrary or random in fact we we tend to select them in order to communicate things about the objective that we think are important so in some ways the development environment kind of captures our best estimate in the spirit of this model of how the robot will be deployed and in that case the person is actually going pretty far yeah you know potentially to equilibrium actually for them you are absolutely right that thinking more or harder about the way that the robot will deploy could lead to changes in the reward proxy reward functions that it's optimal to specify and you are also right like i think you could also have that like the best thing to do if you realize that there's a part of your deployment environment that's not well represented in your development environment is you just integrate the you should augment the development environment to include that component and then provide and then you are providing incentives that are are kind of better matched yeah so i don't know i'm trying i'm trying to think of the right summary to close this question on and i think it's that there's definitely a lot of additional iteration that could be possible here and the opportunity for more coordination i'm not sure that it makes sense to study those sort of directly within this model as it as it is represented in the sense that it seems like that type of strategic improvement is perhaps better represented by changes to the development environment or you end up assuming more you or you would want to have a better richer cognitive model of reward design and how reward design could go wrong perhaps i think there's just in setting this up part of the core idea is if the person is wrong about the objectives if the proxy reward is not actually the true reward what can you do and so it only really makes sense to study that if people are limited that makes sense so one way one kind of way you could think about this work is in the context of like side effects mitigation like this example with the lava right it was this uh side effect the humans didn't think about like oh what if the robot gets into lava and like this is a way of um you know avoiding the robot having some nasty side effects and uh yeah i guess this episode hasn't been released so you don't know it yet but the listeners will have just the the previous episode of this podcast was uh with victoria kirkov on side effects mitigation and you've actually done some work yourself on this problem um you've co-authored with uh alexander turner on attainable utility preservation so i'm wondering like um how do you think yeah how do you think like the side effects problem well like yeah what do you think about the relationship between the side effects problem and this approach of inverse reward design i think well they're sort of slightly different in that one is a solution and the other is a problem do you think they're matched i'm not sure so i i think there there is some there certainly are some definitions of side effects for which it's well matched to this problem i think it's at the same time there are some kinds of uh like side effect avoidance strategies well here i guess what i'll say is side effect avoidance is a pretty broad range of approaches and so i wouldn't want to rule out other solution approaches that don't leverage inverse reward design in its particular bayesian form of some of some in some way with that said i think you can describe a lot of side effect avoidance approaches in language similar to the model that we're producing here where you know what what counts as a side effect what types of problems do we evaluate on is kind of like an intuit like they they almost always end up being well here's some environment where here's the reward function that the system got which is intuitively reasonable and here's how optimize you know here is the side effect which is why that's wrong and here is how optimizing for this relatively generic term um can allow you to avoid it and that reasoning often relies on some intuitive agreement of this reward is reasonable and what ird does is is kind of provides a probabilistic definition of reasonableness where rewards are reasonable if they work well in a development environment so when i look at a lot of these side effect uh examples i often in my head sort of translate them to ah well there was a simpler environment where this particular action wasn't possible and that's where the reward comes from and now they're looking to go into this other setting and there are interesting approaches to that problem which which don't come from a bayesian reward inference uncertainty perspective that you know i i'm guessing that you guys talked about some of the kind of relative reachability work that that she's been involved in and this is kind of a different perspective to kind of side effect avoidance which would avoid lava if there are certain properties of the transition function that make that bad but wouldn't avoid it if it's merely the case that the whether lava was good or bad was kind of left out of the objective and so i think these are perhaps maybe the maybe that's why i didn't want to to say that there's a one-to-one relationship between side effect avoidance and inverse reward design in that there's another class of side effect avoidance which involves doing things like preserving the ability to put things back and i think that that is solving a similar problem and i don't want to claim that it's the same so now i want to move on to some like closing questions about the the line of work as a whole first of all so these papers were published in like uh between 2015 and 2017 if i recall correctly which is just some time ago um what yeah is there anything you want to say about you've mentioned some papers that have been published um in addition to the ones we've talked about yeah would you like to give listeners an overview of like what's been done since then on this line of thinking um sure so we've extended inverse reward design in a couple of directions to look at both active learning as well as uh using it to fuse multiple reward functions into a single one uh which which can make designing reward functions actually easier so if you if you only have to worry about one development environment at a time uh you can do a kind of divide and conquer type of approach and that's worked with cern mindterman uh so the divide and conquer work is with ellis ratner okay and the work on active learning was with soren minderman and rogenshaw as well as adam gleeve all right so i've been working on i also mentioned the assistive multi-arm bandit paper which is looking at what happens when the person doesn't observe their reward function directly but has to learn about it over time we've done some work on the mechanics of algorithms within cooperative irl so um for folks who are interested in it i recommend reading our icml paper so is that about the generalized bellman update yeah so this is a generalized bellman update for cooperative inverse reinforcement learning i think and in that we actually give an efficient algorithm for computing optimal cooperative irl strategy pairs and is that that's milan d pa polynomial and dhruv malik uh yeah malian uh and that's a paper that i i definitely recommend folks interested in this work go look at because that's if you want to experiment with non-trivial cooperative irl games those those algorithms are the ones that you'd want to go look at okay i mentioned also a paper that looks at value alignment in the context of recommender systems which doesn't directly use cooperative irl but is applying some of those ideas to to identify misalignment and talking about it more generally there's a paper called incomplete contracting and ai alignment which looks at the connections between these assistance games and a really broad class of economics research on what's called incomplete contracting which is our inability to specify uh exactly what we want from other people when we're contracting with them an example of incomplete contracting that people will use for a long time is all of the things that happened when we went into lockdown for covid which is lots of people had contracts for exchanging money for goods or what have you and then all of a sudden a pandemic happened which was maybe written into contracts in some very very loose ways like acts of god clauses which are contracting tools to manage uncertainty and our inability to specify what should happen in every possible outcome of the world there's a very strong argument to make that assistance games are really studying ai under incomplete contracting assumptions in the sense to which your reward function that you specify for your system is in effect a contract between yourself in the system and i think there's the paper on attainable utility preservation with alex turner which i would recommend folks look at where we're the main idea there is we're really measuring distance and impact via change in a vector of uh utility functions and we show that this has some really nice properties for limiting behavior and some practical applications in some interesting side effect avoidance environments and the last thing i would direct people towards is my most recent paper on my most recent papers on the subject which are multi-principle assistance games which looks at what happens some of the interesting dynamics that come up where you have multiple people who are being learned from an assistance game and consequences of misaligned ai which is a theoretical analysis of uh proxy objectives where we show that missing features in a fairly strong theoretical sense can lead to arbitrarily bad consequences from a utility perspective all right and i guess the kind of the dual to that question is what do you think the next steps for this research are what needs to be done the thing i'm most excited about is integrating meta reasoning models into cooperative irl games so looking at cooperative irl where one of the questions the person has to decide for themselves is how hard to think about a particular action or a particular problem i think that this is something that's missing from a lot of our models it places strong limits on how much you can learn about utilities because there are now fixed costs to generating there are costs to generating utility information that are distinct from actions in in the world and and are avoidable in in some sense and i also think that because people choose how much to think about things and choose how like what types of computations to run this makes the relationship between the person and the system even more crucial because the system may have to do more explicit things to induce the person to provide the information that you need so if you think about like more regular cooperative irl the system might need to steer the person to an informative region of the environment where they can take actions that provide lots of information about reward when you introduce meta reasoning and costly cognition now the robot has to steer the person in belief space to a place where the person believes that choosing to take those informative actions is worthwhile and i think that that is a more complicated problem and on the other side actually figuring out how to build systems that are calibrated for cognitive effort i think would be really valuable to to point this towards recommender systems which is something i've been thinking about a lot recently there's a critique which i think is pretty valid that a lot of the issues we're running into stem from the fact that our online systems are largely optimizing for system one preferences what what's system one system one is a reference to a conomon's model of the brain where people have two systems one is a system one which is a fast reactive intuitive reasoning system and system two which is a slower logical conscious reasoning system and the uh point i was trying to make is that a lot of our behavior online particularly is reactive not very thought out and you could argue in fact these systems are designed to push us into that reactive mode and one of the things that might be that could you could imagine would be a better situation is if people had more conscious cognitive effort going into what kinds of online behaviors they actually want and and one way to to understand this is the appropriate amount of cognitive effort to spend deciding whether or not to click on a link if the only impact of your actions are you might read a bad article and stop or you might read a good article there's a certain amount of how hard you should think about that kind of optimally however if clicking on that link determines what links you will see in the future the appropriate amount of cognition increases because your actions have a larger effect and i think that miscalibrating this component like this kind of misalignment in a way is you can think about the system as being misaligned with people or you can think about people as being misaligned with what the system will do in the future and and kind of choosing the wrong level of cognition for their future selves so those are some next steps but suppose like maybe today there's a listener who you know is a deep learning practitioner they use deep learning to make things that they want people to use how do you think they should change what they're doing based on assistance game analysis well i think one is a kind of intuitive shift in how you look at the world to explicitly recognize that there's a very subjective normative component of what you are doing when you program your system there's an idea that our data is labeled with what we want it to do and that's that sort of is true or something like that and in reality the the role that that labels in a deep learning system play or the reward function and mdp plays is it's an explanation it's a description of what you want for the system it's a way that you are trying to encode your goals into a representation that can be computed on and i think that if we as a field actually just shifted to thinking more about those types of questions spending more time thinking about the source of normative information in our systems and thinking explicitly about do i have enough information here to capture what i what i really care about i saw a really great this is a tweet from dr alex hannah who's at google research and what they pointed out that is that for say a large language model you can have trillions of parameters but ultimately there's only hundreds of decisions that go into selecting your data and part of our difficulty in getting these cis large language models to output the text we want i think is is because of that that kind of imbalance right and there is you know it's complicated because language actually has lots of normative information in it but because it's observational and you can predict this leads to that i see this text i produce that text you don't interpret that as normative information and i i tend to think that that is the biggest lesson for for practice i would say the the other part about it is what is your strategy for integrating qualitative research and analysis into your process and i think that this actually if you just want to be a good deep learning researcher and you just want to produce neural nets that do a good job paying attention to this loop and trying to optimize this loop from what are features of my system behavior that are not currently represented that i don't like or that i do like identifying those behaviors developing measurements for them and then integrating that back into the system the the deep learning engineers and the dprl engineers who who are the best i think are really good at completing this loop really effectively and they've got intuitive skill at competing that loop but but also they just have really good methods and they're pretty formulaic about it and i think for whatever scale system you're building if you're just building it on your own you're a grad student trying to do policy optimization for a robot this is a good idea and if you're a large company that's managing search results for the global population you also want to be doing this and if you look at search actually this is a relatively well-developed area with good standards of what makes a good search result and there's lots of human raters providing explicit feedback about what makes good search results and arguably part of why that works is because well we're paying for the right kind of data all right another related question so so this one i kind of like to ask people worried about ai essential risk they sort of have this idea that like look there were people who like did some work to develop really smart ai and they didn't stop to ask a question like okay how could the work i'm doing have negative impacts how could the work that you're doing have negative impacts on the world and you're not allowed to say oh the negative impact is we wasted our time and it didn't pan out no uh there there are some really clear negative impacts that this work enables effective alignment with concentrated power distributions is a recipe for um some pretty bad situations and environments um sort of single agent value alignment on its own really just allows people to get systems to do what they want more effectively that is an incredible like arguably valuable alignment is like as dual use as a technology could be at least as it's currently thought about sort of like a power plant right it's just like a generally empowering thing like now everyone now you just have a way to get more electricity now you can do things more yeah and it's it's sort of like a power plant though where someone might just build their own power plant yeah like a home power plant and and sort of use it for their own purposes and if you had some people with power and others without that uh well it could lead to scenarios where the people out of power are treated really poorly and i think arguably that would i mean arguably that's what ai systems are doing to some extent nowadays in terms of cementing existing power dynamics and alignment could supercharge that in a way like one of the concerns that i have about effectively aligning systems to an individual is that it might be fundamentally immoral to get to a scenario where one individual has undue influence over the future course of the world and and and sort of that that's a sort of direct one to one for these power imbalances i think like there's a practical short term right now if facebook gets really good at aligning systems with itself and we don't get good at aligning facebook with us that's potentially bad and i think if you start to think about future systems and and systems that could reach strategic dominance then alignment with like you might want to have alignment approaches that cannot align to an individual but have to align to a group in some way um i i don't know if that's the it's a little bit vague that's a fine answer but it's kind of like if we imagine that that preferences don't actually reside within an individual but reside within societies then alignment to an individual that that allows like that individual's preferences to to capture a lot of of utility could be quite bad so if people are interested in following or engaging with uh you and your work how should they do that yeah um i have a moderately active twitter account where i publicize my work and i generally tweet about ai safety and ai alignment issues um so that's d d hat at d hatfield minell first initial last name okay i'll put out a plug for if you are a well essentially if you're interested in doing this type of work and you thought this conversation was fun and you'd like to have more conversations like it with me i'll invite you to apply to mit's uh eecs phd program next year and mention me in your application all right uh well thanks for appearing on the podcast thanks so much it was a real pleasure to be here and the listeners i hope you'll join us again this episode is edited by finn and adamson the financial costs of making this episode are covered by a grant from the long-term future fund to read a transcript of this episode or to learn how to support the podcast you can visit accerp.net that's axrp.net finally if you have any feedback about this podcast you can email me at feedback axrp.net
Related conversations
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
AXRP
11 Apr 2024
AI Control with Buck Shlegeris and Ryan Greenblatt

This conversation examines technical alignment through AI Control with Buck Shlegeris and Ryan Greenblatt, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -9 · 174 segs
Future of Life Institute Podcast
7 Jan 2026
How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann)

This conversation examines core safety through How to Avoid Two AI Catastrophes: Domination and Chaos (with Nora Ammann), surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -3 · 85 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
TED Talks
18 Dec 2023