Debate and Imitative Generalization with Beth Barnes
Why this matters
Auto-discovered candidate. Editorial positioning to be finalized.
Summary
Auto-discovered from AXRP. Editorial summary pending review.
Perspective map
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
Across 100 full-transcript segments: median 0 · mean -1 · spread -20–0 (p10–p90 -6–0) · 1% risk-forward, 99% mixed, 0% opportunity-forward slices.
Mixed leaning, primarily in the Governance lens. Evidence mode: interview. Confidence: medium.
- - Emphasizes safety
- - Emphasizes ai safety
- - Full transcript scored in 100 sequential slices (median slice 0).
Editor note
Auto-ingested from daily feed check. Review for editorial curation.
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video ga8_okcLAes · stored Apr 2, 2026 · 3,593 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/debate-and-imitative-generalization-with-beth-barnes.json when you have a listen-based summary.
Show full transcript
hello everybody today i'm going to be talking to beth barnes she's currently a researcher at openai and before that she was the research assistant to the chief scientist at deepmind so that's shane legg today we'll be talking about her work related to the topic of ai alignment via debate beth welcome to axer thanks for having me alright so i guess my first question is what is ai safety or ai alignment phi debate what's the idea so debate is pretty closely related to uh ida or uh yeah iterated distillation and amplification uh paul's idea so i guess the kind of there's several different ways to explain it one is you know what we want from sort of an alignment uh like technique is something that for everything the model knows like we can kind of extract that or like create an overseer that knows all the things the model knows and like therefore can oversee it adequately and debate is like one way to do that and to do that kind of efficiently so the structure that's sort of going on there it's like so in ida you have this implicit tree that's sort of analogous to the tree in humans consulting hch which is like an implicit tree which could be like you know questions and answers questions and sub questions and answer those sub questions some questions to help with answering those so on um debate you can think of as like a different way to interact with the same kind of structure where rather than sort of imitating all the parts of this tree you have two ml models that kind of have this tree like in their heads and you take some path down this tree um until you get to uh something that is human checkable so and you're able to uh verify properties of the whole tree uh just by looking at one path down it so um yeah for like if you assume for example like one way to think about this interaction is like one debater has a a sort of tree in their head and the other debater is like traversing it and finding uh looking for a leaf that has like a floor in it um so if you you see them do that traversal and then they finally for the floor you know that you know the tree has at least one floor and if you see them do that reversal and they don't and the leaf doesn't have a floor then like you can be confident that's like the property of the whole tree okay so it sounds like this is kind of related to this thing you called iterated distillation and amplification um sort of concretely like what is what is that and what's this like tree you're talking about uh yeah so maybe the place to start is other thing i've mentioned humans consulting hch so which is a recursive acronym for uh this idea of like one way that you one kind of model for like what a good answer to a question is or something is like you know what what a human would answer if they like thought more and had more resources and things and one way to set that up specifically is like how a human would answer if they could ask sub-questions to copies of themselves that could ask sub-questions to copies of themselves and so on so you you know you split some difficult tasks that we don't know what the like correct answer to is up into it's not so small a task and there's yeah there's some kind of claim which is like because this h3 is is all made of humans who are like trying to give a good answer this that hch gives like aligned answers and then okay if you could build something that's analogous to that that is also like trustworthy so so the tree so in this case is it the case it is the idea that like i'm imagining a tree of like uh like you have a main question and then there are sort of sub questions coming off of that so like the main question might be like uh do you have an example in your head of like a starting question you could ask hch uh yeah i guess you could be like um where should i go the example in the debate paper i think is like where should i go for my holidays um and so some sub questions might be like i don't know um which which sort of climates do i like how expensive are flights to different places where do i uh you know do i needed my passport is my passport ready and do i need it to go to you know different places okay uh so like one way yeah so the you know if you're doing this question in the hch tree you can imagine you just have like infinite copies of you of yourself and you can ask them you like pass these sub questions to those coffees and you know one one copy of you is looking up you know looks up how expensive different flights are and one copy of you looks up like which places have what climates and one copy of you looks up travel restrictions or whatever and then the ida model of that would be something like you train an ml model to imitate you know copies of yourself doing little tasks and then you can like pass the subtask to that trained model and then you can trade another model which imitates the sort of team of you asking that first model uh you know giving that first model parts of the problem and solving the problem overall like that so you know now it's imitating a tree of death too and then you can you can repeat that process until you have uh like one model that is imitating um sort of should be the equivalent of imitating like a whole big hch tree okay yeah and i can also say the like debate analogy for that specific one would be like uh instead of passing off all of these subtasks you have two models and like you know both of them propose a destination and then they each like critique parts of each other's like argument for that destination so you know one of them suggests like some place and the other points out that um i know that your passport won't be ready in time and then the other one and then the first one points out that actually you can pay for expedited passport service and this is cheaper than the you know the more expensive flight to the other place you're considering or something like that you know you you kind of surface all the considerations while already taking one path down okay so for debate is the idea that like i'm going to have some question that i'm going to ask a machine learning model and i'm going gonna get like two maybe two different models or two copies of the model and they're gonna debate each other and i'm going to read the transcript and like then i'm gonna know the answer is that like roughly right yeah i mean that yeah that's definitely the overall idea so yeah in particular we probably want two copies of the same model and also like if you just did this once you wouldn't have any particular reason to believe that the answers that you got were right like the idea is just that this provides uh the correct training signal such that if you train for winning debates for a while you end up with a model that gives correct answers so like if you sort of read the transcript and then you know see that it seems like one debated one that's only evidence that that answer is good assuming that the other the opposing debater is actually playing reasonably it is you know is actually bringing up all the relevant criticisms and that sort of thing okay so i guess in order to make it um a bit more yeah concrete like like how would i go about training such uh thing is is the idea that i already have a model that has some knowledge and then i do debate on it or do i just start off by training a thing by having it run debates against itself i think either is possible um yeah so you can imagine starting completely from scratch and like you know the model only learns about the world through humans saying that its claims are like good or not good uh that obviously would be kind of inefficient but it gives you it means you're training on like an aligned trading signal for the whole time or you can imagine you do some kind of you know you take something like pre-trained language model like gt3 and then you uh fine-tune it to do debate so it already already has a bunch of world knowledge and like the you're just providing a training signal that gets it to like tell you what it knows um okay so here it seems like what's going to happen is so in either path let's say you take literally gpd3 and do this like like every evaluation of whether a debate was won or lost right it seems like you're going to run a debate and then a human is going to read the transfer and then the human says like yep good debate yes this side one or that side one um and then you keep on then you do this again and again and it seems like every like everything your model is going to train on has to like have a human uh labeling who wanted to debate is that right uh yeah so you can actually you can just train a model to imitate the human judgments so it's like that's probably something that's easier to do than like it's an easier problem than the original problem to just train a model that imitates humans judging these relatively contained tasks of like judging part of a transcript um and that's also like uh you um yeah you you directly have a supervision signal for this you're not dealing with any distributional shift so yeah you probably don't need to in fact like have humans uh doing like providing every single data point you just train a reward model from the human behavior okay and i guess you might have to update it as the as it goes along because the debates get better and yeah as the distribution moves yeah but this is that that would work similarly to how the you know summarization from human feedback or whatever works where you you know you train a reward model on the human judgments and you you update it over time and then you do our alligators that reward model okay cool um all right so so we have this training procedure that's going to take a model and um try and get it to output what it um like answers to questions where it knows the answer um i'm wondering it's kind of a strange question but why do this like what um what problems in like or are there problems in like a safety or ai alignment that uh this would solve and also what problems are you like are there any problems where you're like nope i'm not trying to solve like this problem with debate i'm just trying to solve these things yeah um so the i guess a lot of this kind of revolves around paul's idea that like a sort of a simplification of like what we need to be aiming for with an alignment technique is to to know everything the model knows so this is just just a second uh who's paul uh paul cristiano sorry my uh okay he's supposed to be previously working with at uh fnai who is now um uh starting his own uh new research organization thinking about this kind of thing okay and did he like come up with a safety fight debate uh he so he came up with um ida and uh worked with jeffrey irving who came up with debate yeah cool so uh the same like group or whatever all right so so i interrupted uh what was yeah so i asked what problems does it solve and does it not solve yeah so um yeah like you're saying so poor false idea that uh what like yeah what what we can a simpler way to think about what we need to be aiming for with alignment technique is that like we can know everything a model knows um so it's sort of relatively easy to say why this is like sufficient uh in that if we can get them all to tell us everything it knows then it you know it's it's easy for us to catch kind of deception or treachery or something because we can ask you know are you are you being deceptive or like are you doing something that we we wouldn't be happy about and that kind of thing and also sort of more generally this kind of strategy stealing argument which is like anything the model knows that it could be you know leveraging to advance some goal we can also leverage that to advance kind of you know flourishing and and uh being in control of things and understanding what's going on um i think it's uh it's a bit less clear whether it's actually necessary and in some sense like hopefully it's not really necessary but it yeah it is kind of a scary position if you have a model that like knows of you know has this kind of information advantage over you and like knows a bunch of things that it's not telling you um so yeah so um debate is like yeah hoping to address that that kind of thing um so not like not that related to things like robustness like you know if you have problems because your model doesn't know something you know like it it you know it is not gonna help with your robot walking off a cliff or something um if it wakes up accidentally right right right um if it yeah if it doesn't you know doesn't know what cliffs are or or that kind of thing and i guess the the other caveat is like we think there's a kind of like a type of knowledge that debate doesn't work well or at least it's extremely like inefficient for extracting which is knowledge that looks less like this kind of relatively explicit step-by-step reasoning and not more things that are just like well uh this heuristic worked on all the training data i've seen and like seems reasonable a priori so if a model made that claim in a debate it's like unclear how you how you check that in the debate without going through all of your training data to see whether in fact that heuristic like did perform well uh which obviously is very you know you can't you can't keep doing that every time some question like this comes up so just just check that i understand you um am i right to say that the basic idea is like whatever whatever bad thing you're worried about an ai deliberately doing um the point of debate is that is to get some way of eliciting the knowledge of an ai so that you can ask it hey are you doing this bad thing um and then like if it is we can just be like okay well i won't let you right or you can train it to not you can train for models that don't intend to do some bad thing okay yeah so so i guess like concretely how you'd you'd hope this would look or something is you know you have two copies of the model and you ask you know i don't know it's suggesting some some plan that you know saying has good consequences and you know the other model says you know in fact um the reason why it's suggesting this plan is because it has like this consequence which results in this the like model getting a lot of power and like you know that that's like something that you would be unhappy with um and then you know if the model can if the criticizing model can substantiate that claim somehow then you can be like ah yes i don't like this plan and you know provide a negative feedback signal and if if it can't substantiate it and the plan looks otherwise good then you're like i guess this is good okay cool so all right now that we have we'll go a bit more into like the types of things that debate can't deal with or you know what maybe more specifically what it would look like but um one i guess a question i have is what kind of work do you do on debate um so to be clear i'm not uh actually working on debate anymore oh okay so what work were you doing yeah so what i was doing in in particular was um doing sort of experiments a slightly strong word but sort of fiddly hat with um having humans basically pretend to be the models uh taking part in this debate and kind of seeing what happens if people try and play this game strategically um and uh then modifying the rules to try and make them favor uh you know honesty and helpfulness uh more and and in particular we're kind of using a model of like uh you want to have a kind of gap between the the players of the debate and the judge so we had like people who know more physics and have thought more about physics um debating the answers to physics problems and then um judges from mechanical turk who don't know very much about physics um judging okay so that's interesting and it strikes me that like not a lot of research on how to like make really smart ai systems does things like this i'm wondering like should should we all be doing like these kinds of things or is is it just like is this a type of work that was like really useful for debate but like probably isn't useful for other stuff yeah that's a good question um i think i think there are various ways in which it seems a lot easier to do with debate than other things for example like even if you have a a slightly different variant like if you were trying to instead like explore like what an hch tree would do uh because you need you and you're trying to simulate the case when the uh humans in different places of the tree don't have the whole context then you know if you want to do a tree with 30 nodes you need 30 different humans um like so even this is you know this is still something that's like relatively amenable to having humans do it but it's just like a lot more logistically difficult whereas with this debate set up where you only need uh two debaters and you know and a judge poetry uh like it's a lot it's much logistically easier and i guess the other thing is like again we're talking about you know that it's kind of like big implicit trees and things like ida work by like learning to imitate small parts of the tree and then composing them and then learning to imitate that which is now a bigger chunk of the tree and then composing those you know that the debate is is easier again to like think about and study because it just takes the single path down the tree so you're dealing with a much smaller number of of nodes um cool and i guess the the tree you're talking about in debate is that just the tree of like all possible claims once i could make and responses the other side could yeah that's right um i yeah i guess i tend to think of these as like pretty and like pretty closely analogous to the the treason hch or idea but yeah it's sort of i guess there's like there's in practice like an actual game tree which is like you know all of the responses the debater um could make at each point which is like i guess slightly different than the implicit like argument tree of like all of the claims that substantiate like a top-level claim but they're they're related okay cool so i think when a lot of people hear about this for the first time i think i think a question people have is like spicy train suppose these like ml models are like somehow really smart or like you do a really good job of optimizing for convincing humans in a debate um i think there's this worry that like if you did that the smart ai systems would come up with like strategies that um like arguments that humans erroneously find extremely convincing right um so sort of as an analogy to like i don't know this sort of general worry you see about good heart's law uh where like you know convincing to humans is not necessarily the same as like actually a valid argument um so i'm wondering like do you think that that's a problem with proposals like debate um and if so how are we gonna deal with it yeah i think that's definitely a problem um i think it somewhat applies to other things like idea as well although uh it's sort of more likely we'll get lucky and not have this there are various things where it's like uh it's sort of like if this problem exists debate will have it an idea might have it i guess another thing i think is i do think people when they hear about debate people tend to think about all the problems with human debating and i think and i definitely used to think that and be like oh this seems really unpromising given how like dismal political debate is i think in fact we shouldn't take that as too strong a signal or something because as far as i can see people just really haven't tried that hard to optimize debate structures and mechanics for truthfulness and also there are a bunch of things you can do with ml systems that you just really can't do with humans they're pretty useful like you can save one of the things that i think make has made our mechanism feel a lot more promising is like you can save a copy of a debater from earlier in the debate when they so they don't know uh what the later context is and you can ask them ques questions to them and this like enforces having a consistent story and you obviously yeah you know you can't do you can't save a copy of a human and ask them questions at least not without technology yes um so yeah i guess at my real answer is like yes obviously this is still a concern i think we can do a lot better than than what this looks like with than if you imagine just you know two agents having a free a sort of free text debate with no particular roles where it's just like they say things back and forth um like in particular i guess yeah yeah firstly you know we just thought about the mechanism more and like how like the structure of how how they these debaters can make claims and like how you choose to um like which parts of the argument you choose to focus on this thing of enforcing consistency by asking questions to pass copies um like yeah sort of how you traverse this tree and you can imagine also having things like um uh you can have a debate about whether some claim was sort of using a dubious persuasive strategy or you can have like apart from the judges you can have monitors who who are just looking out for anything that looks fishy and you can like taboo certain styles or types of arguments so you can just like taboo anything getting kind of political or any partic like particular types of rhetoric you think are harmful i guess you can also like if if you think that like you know maybe this generally works pretty well but like the problem is that you're really hitting the extremes like if you look for the thing that's very most persuasive then it's probably uh like you know kind of a bug or kind of hacking humans you can maybe do something like like mild optimization okay so so so in summary it seems like your response is like well we're gonna like think about the rules really carefully and we're gonna not experiment and like like it sounds like you think this is like potentially some kind of problem with debate but like there's gonna be some way of like dealing with it is that is that fair to say um yeah i it's something like i think it is a problem but i think a lot of like a lot of the things that people most immediately think of as a problem are like relatively fixable but i'm i'm still not like confident that it's all fine overall or something all right all right uh i i guess uh related to that question um so it seems like one of the things that you're trying to do is sort of gain it is trial out like ways of structuring the debates like rule sets and such and checking like okay can you just like game this in order to win and you shouldn't be able to win but you did um and it seems like sort of part of the strategy is you come up with one of these and like if humans can break it then like definitely ai could break it um on the other side though how would you gain confidence that if you have some protocol for debate that seems to work when humans do it yeah how would you know that that's sufficient to train powerful ai systems on yeah i mean obviously that's just like yeah that's very hard i don't think we're going to be able to provide strong assurance of that kind of thing i i think hopefully we should be able to do a bit better than just like well you know humans tried to break it and they couldn't so it seems fine and have a bit a bit more sort of arguments about or just like slightly more principled arguments about why you know parts of the rules sort of rule out whole classes of strategies and also i think mostly the kinds of assurances that we're gonna get are more gonna look something like well we can't do any better um than actually like yes we should trust us i i think this is kind of the reason i first was interested in debate was because i i do feel like it has this kind of sense of like being a reasonable encapsulation of the best we can expect to do or something it's like if you take you know two agents that are equally matched and get them to to sort of debate a viewpoint if we can't hope that like you know humans slightly favor the truth when things are perfectly evenly matched then like i don't know it seems like we're kind of screwed overall so that was kind of a tangent no that was that was really interesting but but yeah so yeah i mean i guess it almost has this flavor of like uh like there are formal systems that you can show like uh if if the formal system behaves this way then like this thing happens and it does seem like with debate you have like some hope of getting like reasoning kind of like that yeah and i i think you can get have claims like you can go from you can you know maybe formally show things about the mechanism and like you know say like you know if humans make mistakes at this rate and the dishonest divider is able to find problems you know at this rate and your mechanism works like that like the the sort of thing you need to assume is is going to be something about like it does a human faced with this you know this does it you know they see this kind of thing and they see two other answers and and other answers with these properties like will they make the right decision like like you can you can prove things about the sort of game structure and how things are going to play out and then you're going to be left with something about like you know in this scenario where we've tried to set everything up such that the humans should have the things they need to make the right decision will they make the right decision yeah so getting a bit into the debate protocol you mentioned that like uh you're not just gonna have like sort of free response where you know something fills up a text field and that one debater fills up a text field unconstrained the other debater fills up the text field unconstrained why why didn't that work so various problems i think so one yeah so you know the reason why we're sort of hopeful about debate so you can hope you can do this kind of zooming in thing where it's like if there is anything wrong with one player's argument the other player can can highlight that like until you get you eventually you should get something that's like you know sort of clearly wrong and relatively simple that the judge can engage with um but when you have free text this can kind of break because uh yeah one debater can make this argument that's you know got a hidden floor in and the other debater tries to criticize that floor and then the first debate is sort of evades the criticism or is like no there's a problem in your argument um or or sort of like you know pretends to kind of miss on you know pretends that the other debater was sort of referring to something else or you know basically you know avoid the argument actually focusing in on the on the floor in their argument okay so should i think if this is like almost a problem with human comprehension where like we you know if we were really smart we we could notice that like uh this argument that that like debater two was pointing out a flaw that really did exist in debater one's argument but like it's just hard to read in parts for us and therefore like we need more structure just for our brains to deal with i think it's actually a bit more fundamental than that in that so if we're assuming that you know there are in fact just like a bunch of facts that that the humans don't know and or just like things they haven't thought through yeah so say we imagine we have a human who's just like really good at comprehension and is going to like read really closely and look at the sort of linguistic implications of the different things that have been said i think it's still not clear that they should be able to solve this problem because if you're listening to two physicists make claims about a problem you completely don't understand and one you know they each say that there's a florin in the other's argument it's like not clear who you should sort of who you should which one should you should focus on uh why doesn't this uh i mean that sort of sounds like a problem with like any notion of debate right yeah so the way we solve like resolve this kind of thing is like uh well in this scenario where one of the debaters has a mistake but both are claiming that the other has a mistake the debaters do know which mistake is real so you can use that to uh to figure out which one to focus on you let the debaters decide which one to focus on but you make them uh pay a small penalty of out of their score if they're the one who chooses um so the idea here is like if i'm the honest debater and i know you've made a mistake i can you know pay to get to choose to rehearse on that several times until i get down to something that's clearly a flaw and then i will win and i will i will recruit back all of the score that i paid in order to you know get to choose where to go to find that floor but if i'm the dishonest debater uh if i just keep paying keep paying bits of my score to choose where to focus on because you you're actually honest and you haven't actually made any mistake i'm not i'm never gonna find the mistake that lets me pay back that that score so i'm gonna lose so it almost sounds like a credit assignment problem where like uh well like it's like if the human is judging the debate they figure out who like which side had the better argument but but it's like ah the problem is the human isn't like incentivizing um you know clarifying things in the right way so we need some kind of uh incentive mechanism to automatically do that is that kind of right yeah well it's just like the human if you just let the human choose which thing to to focus on then they'd just be guessing every time because like until they've seen the rest of the you know they they don't know you know they don't actually know which one is true but the debaters do like the the debaters know that like if we pursue this branch of attack like we'll eventually uh you know it'll eventually be clear that you were lying whereas like if we pursue this branch you were honest so like we're never gonna find a lie so although the yeah the magic that's coming in here is being able to use the debater's knowledge of where the mistake is rather than just having the human uh guessing and the reason you need yeah the reason you need this like penalty for choosing what to recurse on is yeah otherwise the dishonest we're just as worried about the dishonesty beta being able to force a draw as being able to win right because if your debate's just like draw all the time you're just this is unusable so if you if you don't have some penalty for choosing one for a person then like the debater at any point can just decide to recurse on some like random stupid question that's gonna be a draw okay yeah so you you've mentioned that like one thing that you need for good debates is this uh mechanism for people to decide like okay claim counterclaim which one are we gonna like dig in on yeah what what other uh rules for debate do you currently think are most promising or like most important yeah so i think the one yeah i'm like most pleased with or excited about or something is um what we call cross-examination so this is what i mentioned earlier so you when you save a a copy of a debater and then you're able to ask questions so the the um i use a little bit more context for this so we've also been thinking that like rather than having one human who just looks at the whole transcript you have the thing that human judges is just a question and then an answer from each debate each debater and then each each debater also gives uh sub-questions and supporting answers um so so the the human sees like some questions with some answers and then like a top level question with two answers and they have to pick which is better so they're just seeing a relatively small context window and the problem with this is like because language is inherently ambiguous they could ask the the dishonest debater can do something where they say something that like i guess the general example is like the dishonest beta would make make two sort of claims in support of their answer they claim like x and then they claim x implies their answer and they make x ambiguous such that if the if we choose to like challenge x they'll uh give like a weaker interpretation of that that's easier to defend and if we challenge x implies y they'll give a stronger version of x that like more clearly implies y so is this is this just like the martin bailey fallacy yeah that's very similar to motivated fallacy yeah and for people who don't know what that is it's just what beth said and hopefully for every listener one of those explanations will work right um maybe can you give an example of that type of argument okay this uh i think it's not gonna be a great example but okay so one of the questions we've been debating is a question about um some icebergs floating in water with different things inside them and what happens to the water level when they melt which is a sort of like deceptively tricky uh physics question like people get very confused about this so the like way this first came up is the debater used the iceberg to sound like it uh they were kind of equivocating meaning like the whole thing including like the air bubble inside it um and like just the ice part not including the air and like this made him like an important difference to to the part of the argument where they were talking about you know like the displacement of the iceberg or the the volume of iceberg that was underwater or something like that um and like it's sort of like if you just see it in one context it's like easy to interpret it one way or or the other way and like not not realize that it's it's conflicting between the two contexts that you don't see okay and and how so so we have this problem with like ambiguous language being like like letting you kind of have it both ways um how does uh what can you remind us what is cross-examination and how does it help us solve that right so the idea here is like yeah so to resolve the fact that uh you know the dishonest debater can interpret the iceberg in whichever way is sort of convenient to them at the time you were like okay we're gonna sort of define the debated statements as like how they would have clarified what they meant if you asked them right after they wrote it before they knew what what else had transpired so then that means like uh you know if we're in one part of the tree where the debater wanted to interpret is meaning like just the ice we can we ask the question back up to like when they wrote it like do you mean just the ice or are you including the air bubble inside um and because that coffee of the debater doesn't know whether the question is being asked like in the context where they wanted to say just the ice or in context where they want to say the ice and the air bubble like either they have to answer randomly or they have to pick one and like you know if they're answering randomly you can just add several times and see a contradiction if they're if they have to pick one then like because the debaters are two copies of the same model and also sharing activations the other model should know what the dishonest model is planning to answer and can then challenge you know can pick the correct branch such that you know that the branch that's maximally convenient to the answer that they're going to give cool yeah that's that's an interesting strategy um and how did you come up with this because it's sort of uh not yeah i don't know how to fill out that question but how did you come up with that idea yeah so i i to be clear this was not it was like mostly not me so i was thinking okay like we really want some way to get debated to pre-commit to all the the claims that they're gonna need or something such that you know they can't be making these conflicting claims and that you have to sort of you know whenever they make some claim they have to i was like oh maybe you write down they have to write down all of the claims that they're going to want to use and then they can only sort of make moves by like referencing one of those claims but this is really intractable because like you know because there's this large potential tree of like where you could go it's it's like impossible to write down all of the things you might need to say especially when you have someone adversarially trying to ask you a question such that you won't have anything to say to back it up um so i was like okay this is just like it's untenable and i'm so full about this you know like i want to make something work like this but i can't um and i think he mentioned this to chelsea voss who happened to have done a project on some so yeah some similar mechanism for some complexity class something something i don't remember the details anyway who was like oh you could just like you know you just ask us the question to a of you know a version of the one that doesn't have the subsequent knowledge or something it's like oh yeah that'll work yeah so one question i have so speaking of complexity classes um oh and i guess i i guess i should say for readers who are for listeners who are interested um yeah there there's a blog post that you have where you um lay out at least at one point in time just like what all the rules of the debate are and what they try to solve and i would encourage people to go read that um but but speaking of complexity theory um when you read the writings about like uh like explanations of this kind of thing i think a lot of the time like you'll you'll see these claims like um oh yeah we this debate protocol like the optimal strategy is in like the class of like non-deterministic exponential time or like it like implements a merlin arthur protocol or you know and it and it's clear that like complexity theory seems to like be have an influence on how these debates are designed um i guess one question i have is they're often written in the sense that like uh it's great that like this you know this debate strategy you know this debate thing solves like everything in like polynomial space um and and that's a ton of questions and that's good um we're polynomial spaces like both a specific claim about like some protocol and also just like the style of argument where polynomial space you sub in for like some big class of questions um and the thing i always think about when reading those claims is like i mean the thing about polynomial space or like non-deterministic exponential time or like merlin arthur protocols is like it's widely believed that these include like a ton of intractable problems so does that mean that we just like won't be able to efficiently train um agents that do well in these debates yeah so there's there's a bunch of stuff going on yeah for me to say a few different things to unpack that so okay um firstly i think the claim is only so the original claim is something like for any like problems where like both both the debaters know the answer and the problem is in nx or nonsense at exponential time or polynomial space or whatever like debate like um optimal play and debate is to give the correct answer oh okay but yeah you are you basically you're completely right and like this is now something that like i am more worried about or something no sorry not to imply that like jeffrey and co had not thought about this initially or something but there is some problem where you know the the sort of formal version of the proof assumes that the debaters are computationally unbounded but like in fact that's obviously that's like a totally reason unreasonable assumption like the the actual model that's going on is more like we're like you know these are some very clever ml models that probably have some clever tricks or like you know intuitions or whatever that allow them to solve some problems that where like the only algorithm that we can understand is um like non-deterministic exponential time um and this is like we clearly can't assume that our models can solve arbitrary problems in nx um so i think in fact this does cause a bunch of problems yeah so this is kind of it's like obviously true in theory or something so like if you i don't know if you're like one one debater is like um rsa 28 2048 doesn't have any factors what's rsa 2048 uh a large um is pseudo frame the right word like it has two very big prime factors and it's very hard to find out what they are um uh so you know if one of beta says like it doesn't have any factors because look my opponent can't tell you what the factors are so like clearly it doesn't have any obviously debate doesn't actually work here because like you know even if our ml models are you know smart and can sometimes solve problems that look interactable to us they can't you know just like arbitrarily solve hard computational problems although in that specific case uh we can efficiently determine whether or not numbers are perfect right just can't tell you the factors yes yeah yeah so so in that yeah if the debate it was able to respond like oh well this like primality test shows that it's composite and like here's the proof and correctness for the pronunciation then like but then you have to know that like so yeah so then the problem is what if the other debater says oh well like i know that there's a mistake in your primality test proof correctness proof i just like don't know where it is because the proof is really long um yeah like if the proof is like obviously actually short um you know that like in this case the you know the proof is not gonna be that long and it's gonna be kind of obvious that like finding a mistake in the proof is a lot easier than finding the factors but like this you have this problem in general where both debaters can be like oh it's just comple like i know that there's a floor in your argument but it's computationally intractable for me to find the floor like that which is why i can't show okay and is that so there's another post that you have um on something that you call the obfuscated arguments problem is that basically the obfuscated arguments problem yes yes and the so again this this thing about like you know some the optimal strategy sometimes being computationally interactable it's like something that was you know in the original debate picture or whatever the sort of claim we're making is in fact the dishonest debater can construct arguments such that finding the floor in them is intractable but that they appear to to support the dishonest answer and and you know these it's not it's not easy to just show that these are you know weird sort of deliberately obfuscated arguments um you know and that that you can do this it doesn't have to look like some sort of cryptographic thing where you're asking for a known computationally interactable thing it can just be like they just make a bunch of confusing claims about physics and like the honest debater doesn't know which one in particular is wrong and you can sort of keep supporting them with claims where you know where there's there has to be some mistakes somewhere but it's not clear okay yeah i guess the question is uh what do we what do we do about this problem yeah so i need to be clear i think we're not we haven't like robustly experimentally shown that you can in fact uh do this like i think we sort of you know came up with a bunch of these arguments eyeballed them and were like yes it seems like in general if you practice this like you could be good at generating these ones and and particularly like you know maybe if it's some debate about like mechanics i can be like huh that looks a bit fishy i think you know aren't like energy arguments generally simpler than mechanical arguments or something but you know if it was something about like quantum physics i'd be like i have no idea what like reasonable argument strategies are in this domain i know nothing about um so yeah uh what was i saying oh yeah so yeah i don't i don't think we've like you know particularly firmly established this is true or like established in it in a way that other people should necessarily believe but i kind of feel pessimistic about it or something um okay yeah so i think in fact like where i am currently at is like this does make debate significantly less powerful than we were hoping and we need uh we're to need like another piece in order to know everything the model knows and in particular kind of like i was saying before like you have this problem of you know say the model your model knows something and like the reason why it knows it is like it had some heuristic that was reasonable a priori and like all over training like that heuristic worked well and you know it was gradually uh upregulated it wasn't the right word but you know um and like now it just like has that as a kind of intuition it's like how are you going to justify that in a debate like well technically or like the sort of classic debate approach is like what the debate is like um i don't know it worked this well on the first half of the training data and this well on the second half of the training data and like you know if one of those is wrong challenge it um but in practice like the debaters like that this is another one of those like computationally intractable things like the debaters aren't going to remember exactly how well it did on exactly which bits of the training data like they're going to have thrown all that information away if they're being like reasonably efficient um like it's completely unreasonable to expect them to to like you know have up to do optimal play on this debate about how exactly well did this heuristic do on like all of the data points you've ever seen so yeah we're we're gonna need some some way of verifying those kinds of claims that is like less unreasonably demanding of our debaters okay so so is the research strategy like think of types of claims that debate maybe can't deal with and then like come up with protocols for that yeah that's a lot of of what so yeah the the kind of earlier work i was doing was you know figuring out how to set up these experiments and like what a reasonable debate protocol would look like and then being like okay we tried these debates they like didn't work for this reason can we treat the mechanism to fix it there's a kind of you know there's a branch between like is there something we can fix with the trigger mechanism or if there's a kind of fundamental problem and we we ended up on you know there are a bunch of things like like the ambiguity of like not focusing on the right thing that we think we just fixed by changing the mechanism and then this is something we think like can't you know this is this is like a class of things that can't be fixed by changing the mechanism that we're going to need some uh completely different approach to deal with and this but yeah we so paul has some ideas in this direction which would have been has variously been known as learning the prior or imitative generalization cool i so i'm going to talk about that but a bit later i still have more questions about today so so so one thing i noticed is that like it seems like so this question has a specific and a general version here's the specific version um the thing you mentioned about like sometimes your reason for believing something is like well it's just a heuristic that worked well in my experience and like did pretty well and no other heroistic did better and like i think that comes up with human arguments right like i sort of have this sense that like sometimes that's when my dad is telling me something right like that's his reason and in general the task of like coming up with like ways of structuring debate that promote the truth seems pretty relevant to humans that want to use discourse to discover the truth right so i'm wondering in both in that case and in general to what extent do you think like thinking about debate has like like in this ai context has helped you think about like okay as humans who are trying to like have rational discourse or something how can we like structure that well yeah i think i haven't in fact thought about this like that much i guess particularly because once we're getting towards those more exotic things involving like making copies of people and all that kind of thing it's like not very practical or something not yet yes indeed i think and i think the other thing is like yeah in fact all this all this stuff that like can only be justified by intuition and things like yeah i think actually in human debate you uh do get this thing of like like a lot of things that come up in human debate are just yeah based on intuitions or like it's opaque to the human why they know it and i think there's some yeah so some of this is like a fundamental problem that's shared with ml debate i think you might hope that in ml debate you can do better because um these models are being trained you know to reason in a way that's that's transparent to their man to humans um why how are they being trained that way as in because that like that's what debate incentivizes like it incentivizes being able to explain your reasoning in a convincing way so like you know maybe there's various adaptations where humans just like you know throw away their their reasoning like once they're done with it or like you know it might be relatively easy for our cognition to be more transparent to us but it just like happens not to be because that like hasn't happened to be particularly useful but if we were if we were being optimized for that like it might be better but yeah so i i think there's there is definitely interesting stuff here and i i think that the idea of yeah making it clear how your argument splits up into parts and then have having some kind of mechanism where you choose which part to focus on that makes use of the fact that you know the the the people during the debate might have more idea of like which part is productive to focus on than the listener is is good but i i guess there's also another difference because like if you're thinking about like how to have truth conducive debates between people who are already kind of trying to trying to be truth conclusive and trying to to reach agreement like you don't need these mechanisms so much and in places where people aren't trying to be truth conducive you're not going to get them to be able to you're not going to be able to get them to agree to use the mechanism you know if if you you know for political debates or something if you're like oh yes we think that if you like instead of having a televised debate where you just talk at each other you will like you know play this uh you know do this this like game uh you know this this interaction with all these rules that like uh you know and people you know some panel of judges will judge all the individual parts of it um and then there'll be an answer about who's a winner like it's like probably people aren't gonna want to do that um and yeah and i guess the other thing is like the other reason why we harden that context is i think like i said before and an individual debate is not uh at least how we've been thinking of it an individual debate transcript should not necessarily be convincing to the judge like it's only it it's only convincing if you like assume optimal play everywhere else so it's not like it's not like you see this transcript and you read the argument you're like oh i understand i like see why this answer is correct you're like okay if i assume that like all the unchallenged answers were in fact correct then i see that like this thing's correct okay yeah i mean i guess like like so in terms of the like political example or the question of like well you're assuming that you have uh like it's only useful if one of the players just like not trying to be honest i guess it could be the case that there are some contexts where like you have participants in a discussion like not all of them are trying to be honest but they're trying to appear that they're trying to be honest like like it seems like that should come up at least sometimes in human discussion right yeah yeah i think that's right yeah and again hopefully i guess that the context in which it should be most useful is if you have like you know you have several experts who disagree and maybe one of them you know some of them are being dishonest uh or a kind of fraud and you don't you don't know anything about the field and but you do really want you just want to know the true answer and you you know you have some control over i don't know you're they're like competing to to for your uh my you know you you want to pay one of them to do something for you and you want to figure out who's actually good or who's actually correct um and then you can say like you can say to them like you know instead of just having a discussion in front of me i want you to like play this game okay so sort of related to this yeah so one question i have is it seems that a key feature of the debate structure is that you have these like very unambiguous claims that are like either definitely true or definitely false and it's both the case that like that's what you're debating and it's the case that um that uh that that those are sort of the subclaims being brought up i'm wondering do you think this covers much of the space of what we're gonna know from models well we're gonna sorry what we're going to want to know from big advanced machine learning models and do you worry that it's going to make it hard to discuss things that um you only partially understand but you can still make valid arguments about i guess this might be the same thing as the uh heuristic thing um i think this is uh i think there are some different things to say about this um and i this isn't exactly what you're asking but uh i think in fact we ran into some trouble with having like to sort of we're thinking about it be like the debaters make claims and those claims are either true or false just because in fact like true or false is kind of actually like not that clearly defined or like the the classic example being like you know the king of france is bald like is this claim true or false or not given that like france doesn't have a king seems false right well i mean it's definitely not true right uh but it's maybe like the king of france does not have hair if you formalize it it's you know it's kind of like it's like exists king of france and king of france is bold or something and it's uh well i don't know or like anyway you know this this isn't always like as clear as you might hope so we instead moved to the framing is like uh the there is a question and the debaters give two answers and what the judge is trying to say is like is which is the better answer or like is one of these answers or are they are they like sort of indistinguishable certainly if the debaters themselves only have kind of fuzzy knowledge they can say like you know this is probably something but it might be this other thing you know and that can be a better answer than some confident claim if like you can show that the confident claim is overconfident i i don't know if that answers your question i guess a kind of related thing which you might have been asking about is just like what like can you use this for claims that have like a a moral component or a component that's just about someone's personal preferences not just about things that seem like objective facts about the world or something okay what if i ask that question okay um so yeah well one thing is actually once you say start saying what is a better answer to this question then it does end up kind of being about the the judge's preferences or like you end up getting into ethics pretty quickly or something because it's like well what is it part of one of the debaters arguments could it be like could be like this answer is better because it will like cause you to do more ethical things in the future you know like better ends up getting into ethics like i think there's an interesting um phenomenon that happens in policy debate so there's this crazy thing that americans do where they talk very fast uh i yes it's like yeah i guess high school or unit on a university debate with particular set of rules that has these very elaborate strategies and yeah also involves people talking at ridiculous speeds and one of the things that tends to happen and there's a term for it which i can't remember like one of the debate teams chooses a strategy of saying like not necessarily trying to argue that some answer is like factually true but saying like that like the judges are ethically obliged to pick their team for some reason or something like there was some example about something to do with uh oh i think there was there was some question where one of the debate team's answer was just like that it would be policy debate would be perceived as more like fun and engaging if this answer was favored and like policy debate being fun and engaging was like beneficial because it was beneficial to students in all these ways or something like that um you know so the the debate ended up being much more about like appeals to the kind of morals of the judges and like the consequences on the world that them choosing one answer or another would have so i think in some sense like even if you if you try and avoid this you're always going to like debate tends to kind of drift in that direction because it's like ultimately what you are trying to do is like persuade a judge to choose one action over another and like you know where one action is picking one answer and the other action is picking the other and like wait why couldn't you sol i i feel like uh listeners with a rationalist spent might be thinking okay why can't we just solve this by saying the way you judge debates is which which answer corresponds more closely to reality i guess well then so then you're weighing like the rules against like like you could still appeal to someone's like moral obligations or something like you know suppose uh you know you're a contractor employed as a judge in some games and you have you know a certain set of instructions to follow but the debaters say like oh you know like if you pick this answer some like bad thing will happen in the world and they like persuade you of that um like you know ultimately what if you're trying to make someone make a decision and like what you're and you're going to appeal to is is there like you know the way they make decisions which is sort of you know related to their ethics um and so yeah so i think you can try and have rules to rule this out but i also don't like think you don't actually want to because like if you if you do that then like what are you gonna do about any claims that are any questions they're about like what should i do um you'd be like oh well you know your rationalist would say well that's like not a question that has a meaningful answer there are answers that correspond more or less to reality or something or or i don't know they're just i i mean one response would be that i another response would be like well they're just like facts about what you should and shouldn't do and like the way you should answer the question is to correspond well with the facts about what you should and shouldn't do yeah um yeah i don't know it's yeah it seems tricky i think it's yeah or at least where i'm currently at is like you have to accept that things are going to come down to like which like what are the consequences of me choosing this answer versus this this answer you know like like where okay so so here's a here's another reason why that's difficult like something that corresponds something could correspond more closely to the facts but be like really unhelpful for some other question that you were trying to answer maybe an example is um so a fact about myself is that i was born in australia and i was um raised in australia i have an australian citizenship but i live in the united states and probably the the people who i am most similar to to in the world many of them like a lot of them are americans um i i sound kind of like an american and so perhaps an example you can tell me if this is wrong or not but um perhaps an example would be like you want to know you're asking a question you want you want to figure out like part of my deal like um does daniel live in america and you ask is daniel an australian and the answer is like well yes but that's but but that's not the important thing for what you like really care about um is that like kind of a weird example but is that an example yeah i guess the the more general claim is just like what is a good answer to this question like depends on the context in which it's being asked it doesn't like yeah and it even formally like like i don't know closeness like degree of closeness of correspondence to reality sort of like depends on some metric right yeah like i don't know if you asked some questions say you're like i don't know am i sitting still or something it's like well you know i don't know yet like in some mundane sense you are and in some other mundane and some other sense i'm like moving through space really fast and in some other sense like oh well it just like that there isn't an answer to that question because it depends on the reference frame like in some sense i'm like twitching a tiny bit right so yeah i think i think it's easier to follow a rule which is like what is a better answer to that question which includes you know like how will that question in context help you like make decisions and figure things out rather than like oh does it like like technically correspond to reality or something like that okay so i think i'd like to move on a little bit so you mentioned that like one problem that uh debate has is bad at like um revealing knowledge which is or revealing facts that are sort of based on heuristics which have worked well in the past and we're going to apply now and you mentioned that there was this approach called uh intuitive generalization that uh was maybe good for this um could you talk about that actually i had one more thing to say about the like previous thing if that's all right um yeah definitely so but on the on the subject of you know you have some human who's who's been given some instructions and told to follow them but like the you know the debaters are just trying to persuade them to do one action over the other i think there is a there's definitely some concern for like when you actually run debate here like i think in in the experiments and sort of theoretical thinking we've been doing we've basically been assuming like oh just assume that the humans will just like correctly follow the instructions as written but i think related to the thing of like you know debate is like hacking humans or persuading them or something i think there's something that's like even if they even it's not like i don't you know they just like convince them of wrongly convince them of something you might get something like they convince them to not follow the rules and then like everything else breaks down yeah i guess it's like we hope that that debaters won't be able to hack humans or convince them because it's like oh they only have like they can only make these relatively short utterances and we're like doing all of these checks and stuff to make sure that any sort of suspicious kind of persuasive stuff doesn't get passed on but if they can if they can persuade the humans to like not follow the rules then you're kind of in trouble or something like one thing evan and i were like thinking about as a kind of silly example is like i know you're you're a debater persuades the he sort of breaks the fourth wall type thing and rather than like talking about the subject at hand like tells the human they'll like suffer horribly if if they don't pick their answer or something like that so i think that's something yeah we haven't thought much and it's something that that i think we're gonna gonna need to think about more with with lots of different alignment techniques it's like how does this actually work with with real humans in practice and like they're not gonna follow the instructions perfectly and they can be persuaded that like following the instructions is a bad idea yeah and and that evan uh is actually evan hubinger who appeared on episode 4 of this podcast so it all links together yeah that was that was interesting um i think you mentioned that uh intuitive that that there was there's some problem with like well you just have like some heuristic that works well and that's how that's why you believe a thing um but it's sort of hard to like give an argument other than like well this heuristic has just worked really well and i believe you've mentioned that there's this approach called intuitive generalization of generalization imitative we've been through a lot of names and none of them are good sorry no i i think the problem is i was thinking of the word intuition and my handwriting is messy but uh imitative generalization um could you explain what that is yeah so i think the the easiest framing to go at this from is like via debate given that we've been talking about that already so i think i as i mentioned before yeah you can imagine say you know one debate is like well i just know that you know i don't know breaking down this problem in terms of this physics problem in terms of like conservation of energy is like more likely to get the right answer than this approach you're suggesting where we break it down in terms of like all of the forces um and i don't like that i don't have a like logical argument of why that's true it just like i know i just did you know i've tried a bunch of problems and that's what happened to work better and the other debate is like oh well like you know i i don't believe you i've tried a bunch of problems and the the forces approach has worked better and you know then it's like well how are the debate is supposed to justify that um you know and sort of like in stand debate approach would be like oh well they can claim like it worked well in the first half of the examples and it worked well in the second half of the examples and like if there's a if if i'm lying then you can challenge one of these but like this is impractical so what imitative generalization does is like rather than having to go over every single training example every time one of these kinds of claims comes up you just do like one pass over all your data and you get out an object that like captures all of that all of that knowledge and is like a human a sort of amplified human so like a human amplified using ida or like a human like interacting with a debate is able to interact with that object and kind of use all of the information so what this algorithm actually looks like okay first i'll say the the the version that like if you had like infinite you know as many humans as you wanted and and you could use a bunch of their time what you would just do is like have you search over some space of hypotheses which are sort of like full kind of distributions of like sort of everything you think about about how the world is and for each of these you ask you know you have your human supervised debate about what the prior probability of that hypothesis should be and like for each data point in your training data you have the human supervisor debate about what is the likelihood of that training data point given the hypothesis um so then you know like for each hypothesis you've got a prior and likelihood score so then you just um you know you search over all of the or you know as many as possible of the hypothesis in your uh distribution until you find the best ones and this this isn't this isn't intended to be like a like map uh like point estimate this isn't going to give you like a just like the best distribution but yeah obviously in fact we can't and here and here the hypotheses are like um like you said there are these like ways of reasoning about like how everything will pan out yeah so like concretely but impractically you could think of them is just like a very long text that like explains i don't know a bunch of just sort of like rules for how to think about things or or or like ways the world is or like fa so i don't know it could be like like various physical constants and laws of physics and like principles that are useful for understanding history and you know things like this that that don't have like any explanation behind them that are just like what what fits the data yeah so and then yeah if you imagine this like giant you know giant wall of text obviously that's like it's too long for the human to read and engage with all of it so that's where you you have something like debate you know so so like for any given hypothesis you know you'd have one debater it's like oh like this part should should like mean the prior is lower because like this is unreasonable for this reason or something like that so you gave you told us how we could kind of do this if you have like as many humans as you want with as much time as you want yeah i realized i actually sort of already i that is not in fact what exactly i gave you because once we've introduced debate we can say the same things we said about debate before which is one thing i said before is like uh you don't in fact have to use the human every time you can just train a reward model from the human and then do your debates against that award and the other thing is like once you've trained your debaters you can just you don't have to do whole debates you just like ask them to answer you make them think they're about to do a debate you ask them for an answer and they give like the sort of answer that they could support in a debate which like you know is is hopefully truthful so so in fact yeah because i introduced debate already which uh so so you don't actually need um yeah but that still features like uh it seems it still seems like you're doing this process over every like big long string of big long wall of text right and there are a lot lots of big uh yes possible treatises so yeah the there are a lot of difficulties with this yeah so we clear this is not like a like i don't think we have anything in mind here that like actually seems like it would work it just seems like the right sort of direction or like you can you can describe versions of it that are completely impractical that like would sort of do the right thing um yeah so to make this practical you have to have some way like reasonable way to represent these hypotheses such that the humans can still like interact with them but they're that they're like feasible to optimize or feasible to like explore the space so you you might also want to do something like you have humans demonstrate like modifying the hypothesis to make it more likely or modifying it to make it support the data better and then have like train your ml to imitate that human exploration process yeah there's various other things you could fill around doing with trying to make this um actually tractable and it's not clear whether you know whether that'll all work out or whether one of the modifications that you'd need to make will actually break everything and yeah and the the sort of core hard problem does seem to be something about like how do you how do you represent something that is sort of competitive with representing it as a big neural net uh in terms of like ease of optimization but still has this property where like the human can actually engage with it and like understand what it means yeah so so it's is the idea that um as the human if it's not if somebody's saying like oh here's a heuristic that i've learned that um has worked really well in the past and it's gonna serve me well in the future and it says this is the idea that like the humans gonna check like oh that's actually crazy heuristic or or like oh that doesn't really support what he say it supports or uh that didn't really work well in the past and like like in a way that like we couldn't just do um mechanistically um i guess i'm not quite sure what your question is but yeah what's i i guess the question is what's the benefit of this over just machine learning oh uh so yeah uh the other way to approach like why you would need this is basically like machine learning generalization is kind of scary like we have no particular reason to believe that the that's sort of implicit in ml architectures is a good prior and in particular the worry is something like something that performs really well on your training data is an agent that reasons about the process it's embedded in and reasons about how to get a high score on the training data and then like who knows what that does later on and like there we think there's like not that good a reason to believe that the animal prayer would favor something that's sort of you know sort of directly a model of the world that will continue giving you sensible answers and is doing what you expect over something that's reasoning about the training process especially because like you know if you ever have mistakes or quirks in your training data you're going to favor the thing that's like an agent reasoning about the training process because it'll get those things right versus something that's like trying to be an honest mom of the world because we'll get those things wrong and and in particular like we're always going to have to do this generalization between questions that we know the answers to or like questions that we can produce training data for versus like questions that we can't supervise at all um so the way something like imitative generalization is helping with this is like you are using the the human prior rather than the neural net pro so you only ever use ml generalization for things that are basically in the same distribution or like you can you can check all of your ml generalizations and the sort of uncheckable generalization uh is is human that is a kind of is informed by the human prayer um this is kind of how it's confusing to explain sorry yeah okay so so just to give a sense of the the like algorithm this i think this doesn't really give a sense of how of why it's useful but just like or not that much anyway to give a sense of that algorithm and why it sort of helps with generalization imagine that you're doing some every every ml researcher's favorite task which is labeling pictures of dogs um so uh and you have some training set that's a bunch of breeds of dogs that are labeled and uh a test set where you do you don't have any labels but you you want to label the dogs correctly um and uh further imagine that you like you have various human contractors but they they don't know anything about dog breeds and for whatever reason there aren't any books or things to learn about dog breeds like all you have is just like this one label training set the standard ml approach is just like train your model on the training set and then just run it on the test set and hope for the best um but we know that like there are various things about the ml pro that make it you know can make this go wrong for example if uh in your training set all of your pictures of huskies were on a snowy background like maybe if in your test set there's a picture of husky on grass your model thinks it's not the husky so like you you could think about this as like your neural net prior doesn't favor the hypothesis um you know husky is a big fluffy dog that's kind of like a wolf over like a husky is an image with a lot of white and blue pixels uh being like you know has a lot of snow in it um so the imitative generalization what imitative generalization would do here is so you your hypothesis space is a sort of instruction book about how to how to label the images which might be like which could contain different possible instructions like you know if there is no label it as husky or like if the dog is gray and fluffy with pointy ears and kind of wolf-like label at this husky okay and and really the hypothesis space is like like any of these books is one hypothesis and the space is like all of the possible books yes yes so yeah all of the possible sort of instruction sets for like what what dog things look like so um yeah so then you give these instructions to your humans and get them to label the the dogs in the training set and see how well they do and you optimize the instructions to make the humans do well at labeling the dogs you know where you know what the correct label is but important so but this wouldn't give you a distinction between you know the the reasonable description of the husky and the description of the husky as like an image with snow um so like where you get that is you also give these instructions to the humans and ask them to say like a priori whether this is reasonable and like for the purpose of the example we assume that humans like you know vaguely know what you know what animals and dogs are but like don't know specific dog breeds so then you you know the humans would say that like basically any they'd favor any description of husky which is like a bad characteristic of a dog as more likely than the description that a husky is an image with lots of snow in it so then you so you optimize the guidebook hopefully you get one that has like sensible and correct descriptions of how to label all the dogs and then you can um you give that to the humans and have them label your your test set and then you train your ml model to uh to imitate the result of the human saving the test set and now you can make like arbitrary predictions on your test set and you can always check them with the humans if you want and you know you have you have like on distribution labels for for the task you can do like one like once you kind of move to like you know you've moved to a new distribution and then you like get human labels on that distribution yeah you can sort of generalize from the human labels that were generated from learning on the previous thing yes yeah yeah so bringing that back to the debate context right so imagine that um am i right that one example what of where you might want to use this is like one debater says uh it's going to be like easier and better to use a breakdown of like the energies in this problem to like solve this physics thing another debater says it's actually going to be easier and better to like do the breakdown of the forces is the idea that like using imitative generalization like the reason that's going to be good is that we're going to have a good prior over like a good human prior over which one of those is like better it's not like the reason this is hard in debate is partly just because it's it's sort of intractable to do within a debate like you could still imagine having having a debate about like whether this is a plausible like you know whether using an argument based on forces is better a priori or using an argument based on energy conservation is very a priori so the difference from debate isn't necessarily about the prime but the difference between the difference from kind of like standard ml is about and the difference from debate as well as just that it's like more efficient and more more of a kind of reasonable setup so like okay if what if you did the ig thing with debate instead of every time you have a question that relates to something that's justified by intuition from a lot of data instead of having to go over your data again you just look at this at the hypothesis that ig produces like what does this say about about that so you know then so in this case you'd like look at your the object you've got out and be like what does it say about when you should use arguments but you know for what sort of physics questions are arguments based on forces like you know more likely to give the correct answer than arguments based on energy okay so so you would sort of augment debate with like we're going to use imaginative generalization um just like on the side on like a ton of stuff and we're going to use that to inform like some kind of heuristic questions yeah i guess i usually in fact more think of it as like using debate on top of imitated generalization as in like then as opposed to the other way around yeah i don't think it matters that much but like you know what i think i was like you get this objective out of generalization that encodes like you know everything a reasonable anal model could have learned from this data set and that's like interpretable to humans and like is a reasonable sort of model it's not like uh an agent that's just trying to tell you the right thing um sorry trying to tell you what you want to hear um and then you know you you can have debates that bottom out either in some like you know sort of reasoning of like well this implies that or they brought them out and like well stuff we learned from data says this okay and am i right in thinking that this is sort of like a line of work to like make something like imitative generalization work or is it like now we we basically know what it's going to look like oh yeah i think we really like don't know if i think paul is vaguely optimistic that something will work but it definitely doesn't feel like we have something like sort of reasonable yet um and i think i i can say a bit more about like the the difficulties and what i think this sort of hardcore is yeah i mean it seems like that i mean two that jump out are like uh it seems like there are a lot of long treatises that you have to deal with and like the in my head that was two difficulties but i guess that's just one yeah so i think like one other way to see why this is kind of tricky is you're like well maybe one hypothesis that does really well on the training set it's just like oh just copy the output of this big neural net yep it's like well that does great uh it gets all the answers right but then you're now we're just back in the same place as before as we've just got to trust this you know when we run that on the test set will it give us reasonable answers we have no idea um so you need the human prior to like down weight just believe this big neural network but then i know it's kind of like unclear so so we want something like you know if there is any sort of additional structure that could be pulled out of that neural net and exposed to the human a hypothesis that instead does that will beat the one that's like just trust this big black box thing which seems sort of doable but it's just like very confusing how how do you represent things in a way that like gives the opportunity for that but we also like if you imagine representing like everything that alpha fold knows in text it's like that's just gonna be horribly inefficient to try and like optimize that and get the humans do it and also the humans gonna have like no idea what's going on yeah you know how do you represent that in such a way that the human has meaningful understanding and it's like reasonably efficient and this i think ends up being pretty close to the sort of hard problems of interpretability and like chris solo's ideas about microscope ai uh feel like the kind of original statement of this sort of problem or like what we want to do where so chris says like you know one way to have sort of capabilities of ai safely is like if we're able to do sufficiently good interpretability that we just look inside the model and we like see all the things that it knows like you know all the things that sort of extracted and understood from the world and then we could just use that knowledge to to do stuff that we think is important and the you know the obviously difficulty is there it's like how do we how are we going to understand like what is the world knows or like how we how are we going to come up with interpretability techniques that turn like you know the gazillion weights into like ah now i have this insight about the world and like one way you can imagine setting up id and like the imitative generalization and the way you could represent the hypotheses is like your um hypothesis is like in your own net with a bunch of annotations and you optimize the neural net and the ana like the weights and the annotations jointly such that you know that combination of weights and annotations is plausible to the human and um when the human sort of interprets when the human assumes that the annotations and then that are correct they use it to make good predictions and you this is this then ends up looking quite a lot like sort of you train a model for interpretability and like your kind of standard for like whether something is a correct interpretation of what the model knows is like does it cause the human to get correct answers on the training set and how you imagine that so you can imagine something going on with the annotations looking a bit like you know the like circuits that that chrysalis have kind of produced for ones we have now which is like you know the human would look at this annotation and and so in the husky case they'd see like oh here's the bit that's a fur detector and this feeds into this like pointy ear detector and here's a the fire detector and the curved line detector which detects you know a tail or something and if they saw that the husky detector didn't use any of those features and just use a snow detector they'd think like that was suspicious and like that shouldn't isn't how it should be structured and like you know you'd and the reason you'd think that these the annotations you end up with when you optimize sort of do correctly reflect the knowledge is like well if you've labeled something as a pointy ear detector versus a floppy ear detector but it's actually a boat detector versus an aeroplane detector or something like this will not make your humans get the right answer when they're trying to label dogs okay i i guess that actually gets to another question i had suppose it's the case that um you're a listener right some listeners to this is thinking like man this this line of work is um pretty exciting i'm pretty pumped for it but like for whatever reason this listener thinks like well i don't want to do it directly i don't think that's that's something that's for me but maybe there are like compliments like there are things that like people could do that like aren't exactly working on this but that are like really useful for working on this um yeah i'm wondering what do you think the compliments of your work are yeah to be clear i'm not exactly working on this either i'm currently just doing misleading stuff but yes yes yes really excited about people just doing a bunch more interpretability there are a bunch of benefits to this and like several different ways it it ties in to both this stuff in particular and just like a safety in general so yeah like i i guess i was as i was saying it seems like particularly sort of crystallized style microscope ai type interpretability is is like trying to solve the same like more specific problem that we're kind of running into here yeah they're focusing on the same thing of like how how do we represent like knowledge in a way that a human like maybe like you know an assisted human like in in the case you know either assisted with debate and whatever interpretability tools we can build and you know just whatever else we can build that's helpful to the human yeah you know how like such they can actually engage with stuff the model knows and i think more broadly i am just excited about more interpretability so i think that like on the margin or more just doing more interpretability and raising the bar or the kind of expectation of how much we understand our models before we deploy them and do stuff with them would be really good because i think it's it's just going to be yeah every bit more that we know about our moles and have some insight into what they're doing is going to be helpful to safety i think you know that i guess sometimes i think about that there's some kind of like window for a sort of treacherous turn where there's the time between when a model is smart enough to think about uh deception and when it's good enough to get away with it and like if the more interpretability you do the wider the yeah you increase your probability that you'll be able to see that your model is thinking about this kind of things or considering doing things you don't like rather than like having to see it actually happen in practice or something which pushes out like like it's going to be a lot harder to make it invisible that you're thinking about deception as opposed to just like i don't know like avoid um avoid like trying trying to take over stuff or like avoid behaving in a way that's obviously bad i guess also so yeah so there's the way this ties into the sort of poor style alignment where we we want to have the story of really how we're going to know everything the model knows and i think even if you think that's just like way too demanding and we're never going to be able to do anything like that it's still useful to have a bit more interpretability like even if we can just do very crude things like you know is this model just like thinking about some category of stuff it really shouldn't be thinking about or like is this structured in a way that suggests we've sort of got an agent or it's doing a bunch of search and planning where we're like not that wasn't what we wanted or or even like we know we've got something that's like an agent and we want to try and take sort of parts of it that we can use without like using the whole agent can we sort of figure out which bits of the world model and which are the ancient and do some very like crude kind of lobotomy model surgery i could just think even having like a vague idea of which bits are doing what like i think is kind of this idea is judo richard ngo by the way um like it is interesting that in humans in fact it does seem like like frontal lobe lobotomies actually do do the thing of like mostly removing agency and mostly preserving world knowledge i think i like don't know very much about this and would be interested in someone looking into that more in particular um whether lobotomy patients who are extremely passive can still like more or understand the sort of agency or actions of others that is very much tangent but anyway so one of the things i'm excited about is interpretability for for various reasons but but i think it also particularly ties close to this or other things i'm excited about people doing i am excited about people just generally building more of the infrastructure for having these very human in the loop training procedures so like you know if we are gonna want something where we have a large number of humans who are like carefully and correctly following some like moderately complicated instructions and like we need them to make good judgments and we maybe also need them to be like reflective of broad values in some like reasonable sense like what are all the logistical challenges of doing that like you could do several iterations of like trying to get humans to follow instructions while they interact with something that tries to persuade them not to or or something like that it's like what do you have to do to to be uh to be confident that your humans will well actually follow the instructions you've given or you know how are you going to set up sort of layers of oversight and monitoring such that you can super you can make sure your large number of humans are doing the correct thing something else i've been thinking about recently is like yeah how does this work if you do if you want you know you're so human or you're like you know in the debate example would be like the debate judges to sort of reflect broad values um how would you do that and how much convergence is there between people who have different um sort of object level ideas about morality do do they agree more about higher level principles like you know what sort of what is good reasoning about ethics or like who are the right sort of people to trust or like you know what what are the sort of things you do to get a better answer to an ethical question um i feel like this is something maybe people who who think about sort of current relation or things like that sort of assumed that people you know pretty widely agree that you know this is sort of like you will make a better ethical decision if you've like thought about it for longer and have more information or something but i'm actually like not quite so sure whether that's like true across all cultures or whether people you know would would say something like oh it's like more decisions like more valuable if you do it just based on faith or something or like you know the first thing you should do is like to lock in your like faith and make sure you don't get like corrupted or attempted away from it by arguments on the other hand i think it could be useful for reducing kind of races or you know a sense of people um you know wanting their side to develop agi first if if we can show that people in fact do converge and agree a lot or if we can if you can construct some kind of example system that's like here is this thing that kind of works and does seem like there are a bunch of ways people can sort of converge and like that it would reasonably represent like everyone's interests in the way they're happy with okay cool yeah so i feel like we the kind of obfuscated argument post is just like you know we we kind of took that work to a point where we eyeballed it and were like okay yeah we're sufficiently convinced that this is a limitation that we're gonna like move on to some other things but i think it's like it's really not very solid it would be really nice for someone to both like try to experimentally validate that like obfuscated arguments exist by an and that you know that people can't distinguish obviously decompositions from honest decompositions um and try that although i do think like an a negative result with humans might not be that convincing because it seems like you know maybe humans are bad at coming up with obviously arguments because that's like not usually how they think but if you train them all to do debate they'll get really good at coming up with upscale documents uh but anyway i think it would be worth investigating that a bit more and i also there's some like theoretical cs proofs that turned out to be a bit of a headache uh uh you know about melanoth protocols being equivalent to the formal version of debate in the certain setup that i wish i had been able to do well that kind of gave up on but i could imagine someone who had the right background might might be able to just like do that much more nicely because i think we're still not quite sure whether we've got all of that right or something it's like relatively straightforward in the case and we're just talking about like oh there's like a big argument tree the argument is all conjunctive and like the question is whether like whether there's a flaw somewhere and then if the debaters have no idea where the floor is it's like relatively obvious there that like you could have a different protocol which is just like you only have one model and you just have your other model just like randomly to chooses which which branch to take such that if your previous model was only choosing randomly anyway then like those are obviously when you have things like the um you know the argument isn't actually conjunctive it's like parts of it are disjunctive it's like a bit less clear what the sort of like simple uh equivalent is and okay there's like lots of stuff there for people who would be interested in thinking about like so in terms of this like general approach how so i sort of see as it is saying like okay we're gonna come up with like debate we're going to find some flaws with debate uh we're going to come up with a new thing to fix the flaws so there's imitative generalization which is like an idea for like fixing one of the flaws how many other fixes do you think there are left that like need to be come up with there's maybe there's like kind of a few different categories of things there's like the thing i was talking about before which is like actually getting the logistics of all this training process set up and making sure that humans are doing what you expect them to do is like a whole thing like that's gonna be gonna be tricky um and there's various like sort of tweaks to the mechanism and um and like the structure to to sort of make it better i know maybe i'm being like you know a very confident and too inside viewer here i it does feel like maybe there aren't additional problems that are sort of as big and of a similar type to this like knowledge that's it's only justified by intuition uh like it feels plausible that the pieces of like idea slash debate plus potato generalization like capture all of the kind of ways of knowing things but like there could usually be something else i'm not that confident it doesn't feel like there are going to be like loads more pieces of that size i think i guess something else that i'll just say here because i haven't mentioned it yet it's maybe kind of important it's just like i and like paul i think mostly don't think that we will actually use debate to train ml systems and that the reason that it's worth thinking about at least for me it's just like easier to understand and think about and like it's a lot more practical to do human experiments with than ida but like has basically the same set of limitations and and i think that's kind of an analogy that like you know any if there is a winning dishonest strategy in debate that corresponds to like a problem that you might have in ida and like if there is if there is no winning dissonance or no drawing just on the strategy in debate then like you know your your idea should work right so yeah imagine that what we're we're actually gonna do yeah look looks a lot more like amplification but but we're just like thinking about things in terms of debate i think for what it's worth jeffrey evan who who's the actual original like creator of debate maybe thinks more that we will actually do something that looks more like debate and doesn't see it as just being i mean he he definitely sees it as like analogous wordy and that's how he he came up with it while he was trying to understand idea i think um i'm not sure if we've explained exactly what ida is in general uh but but now that we've said that that's the thing we're actually gonna do what is ida and what does it stand for again yeah so it stands for iterated distillation and amplification so i think i mentioned hch before so this is yeah the idea like one way you might like solve a problem or or answer a question or sort of like represent have like a a more formalized model of like you know humans sort of thinking for a long time or humans doing all the combination they can is having humans who give subtasks to other humans and then use the results of the subclass to answer yourself part of their own task and pass that on yep so ida is like a proposal for um approximately how to training how to attractively train an ml system that gives the same results as this hch tree so with ida uh yeah like one way you can think about it it's like first you train a model that just imitates the human beings of class so yeah you have the human range of tasks you train well to take that fine relatively relatively safe then you have the human do tasks using that model to help them do sub tasks uh so this is now a sort of like depth depth 2 tree where you have human and then they send queries as um so now you can train another model to imitate that uh so you're you're having you're training your model to imitate death two tree and that like that little tree can hopefully do more than like a human alone could do because you know the humans able to ask this gotcha so and how is uh how what's a distillation and amplification i can sort of see what's being iterated here so the the amplification step is when you give the model to the human to to like a system or like you know amplify their their ability to do stuff and the distillation step is when you train a model to imitate that like team of humans and models okay so and so the thought is like that's what we're actually going to do and all these experiments about debate are just sort of to inform us on how well that works it seemed like the the detailed rules were pretty important in debate i'm wondering like like do you think that's how are we going to transfer that knowledge to doing this ida protocol i think that there's some way in which it's like oh yeah the rules are basically totally irrelevant they were just like us figuring out which things are sort of like real problems and which you know which things are like the problems in which things are kind of contingent problems so you know like we fill it out we find a problem we fill around we fix the rules and then you know eventually we find a problem that like we can't seem to fix both of any of the rules and then we think about it and we're like oh yes this is like a deep problem for both things but i think parts of the like like parts of the ideas and rules will still apply actually this is at my sensus is actually not something i've thought about in that much detail i guess because i've either been i've either been in the space of like okay let's just try and make debate work concretely or in the space of like okay let's think about the like deep problems and and you know how they relate to each other i know it feels like we're generally learning useful things or something but yeah i'm not actually sure that there'd be like a direct like that many directionalities between like this particular set of like stuff you're doing that yeah i don't know there there's also various ways you can kind of like blend debate in idea so like for example if instead of training your model to imitate this little tree you could train your model to maximize the reward given by this little tree so like your model like does some stuff give some answers and then the human with the assistance of like a previous version of that model assesses how good the answers and things were so you know you can render something that's half and half is like you have a little debate and then the debate is judged by like the amplified human or something like that so so yeah that there's sort of a continuum between these you can kind of interleave them okay and and one thing i'm i'm just realizing that might be closely related to that is uh it seems to me like i think at the start we were like yeah the point of debate is that we have a machine learning system and it knows some stuff and we want to know but like if we have a question and it knows the answer we want to know what answer it knows whereas ida to me sounds like the problem of like i'm a human i want to like solve some tasks like it has kind of a different flavor is the idea that like when we're doing ida we're doing it on like the humans trying to figure something out about a model or how can you say more about that correspondence yeah so yeah you can only imagine that the top-level question in ida is like you know this model has suggested this plan is this suspicious and dodgy or is it a good plan like say so it's you know same as debate there's a sort of you can kind of move between like you know planning and actions in the world and just language and question answering if you you know you can do questions about whether whether something is a good plan or whether some model should be trusted to take some secrets of actions or or whatever in that case with with ida we wanted to be able to answer you know every you know we have yeah we have some model and we're trying to get it to do things and we also want it to tell us you know correct answers to questions and the way we know that it's telling us correct answers is that we've trained it to imitate like the answers that humans get after they think about it and pass off all these sub tasks and agree okay all these subtasks so like i i said before you know maybe you know though you'd have a debate about whether someone was going to be deceptive and one debater would point out that like you know this part of the plan would have like this consequence and this would involve them all getting lots of patterns would be bad you might similarly like the sort of idea equivalent is like this sort of implicit sort of equivalent hch true would be like a bunch of humans being like okay well like is this part of the plan reasonable is this part of the plan reasonable like what are the possible consequences of this action what are the possible consequences of this other thing that's going to happen you know and eventually you get some answers that's like i yes you know at step 20 when the this labor is pressed this happens and then something something yeah i mean i i don't know sometimes i worry i personally sort of play too fast and loose with oh everything's analogous to to everything else and you can move between them and you know maybe i'm you know missing out on some details that are important but i do tend to think of like you can turn lots of these things into other things so i guess my last question on this type of work so as you might know in eurips uh they they recently required broader impact statements and in particular they asked people to imagine like okay the work you're doing what are possible ways that it could turn out to be bad and i kind of like this question i think i i think people working in the aix risk space rarely ask it of themselves so if if all this work on debate and um imitative generalization if this line of work turned out to have like negative consequences and they don't get to be like oh it didn't pan out so there was an opportunity cost you know something bad happens because uh this research yeah what's the bad thing okay there's okay there's a few different things i can i can come up with here like one is something like stuff is all kind of like looking like it's good in sort of like theory land and in human experiments land but like in fact it just is kind of impractical and like what we should have been thinking about the whole time was more sort of in the weeds hacking together ml stuff to like you know be be sort of incrementally more aligned and like patching various problems and like you know we're sitting in here in our ivory tower speculating about how we know you know everything the model knows and in fact like you know we should have been just like oh is there some trick we can use that like makes it a bit more likely that we'll catch things of this type because there's also something else like maybe it's not exactly that that's the problem it's just the fact you know that there there's some fundamental flaw with things like debate and it like all looks like it's working we're like great we'll use it to build the aligned agi and then you know it turns out it's actually not and it's all terrible i don't know if there's uh i mean i could go on brainstorming there's probably things related to like maybe alignment isn't the only framing that's the like important part of risk and looking like you have more of a solution to alignment makes people more excited to build powerful ai and then like there is some like not quite alignment related thing that goes badly that like you know i didn't know something about like misuse or something that people would have been less likely to do if it known had been saying like oh we think we have a relatively good alignment solution if everyone's like oh yeah just like whatever you're trying to build a powerful bottle it just like this will always backfire on you or something maybe this is actually a better world if you feel good i don't know this is pretty speculative i don't really believe that that's fair enough yeah so closing out you mentioned that you were no longer working on at least debate stuff and maybe you said you weren't working on imitative generalization what are you doing these days i'm trying to figure out what exactly i should be doing currently i'm sort of being a sort of miscellaneous safety strategy person at ai and writing a lot of google docs and like gently doing a little bit of ml experiments that are kind of related to imitative generalization uh yeah not not quite sure what i'll be doing in the future sure if people are interested in um following you or your research or exciting things you're putting out um how should they do that yeah most of the things will probably be alignment forum posts all right well um thanks for appearing on the show and to the listeners i hope you join us again thanks for having me this episode was edited by finnon adamson