Library / In focus
AXRPTechnical alignment and control
Superalignment with Jan Leike

Why this matters
Frontier capability progress is outpacing confidence in control; this episode focuses on methods that can close that reliability gap.
Summary
This conversation examines technical alignment through Superalignment with Jan Leike, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Perspective map
Risk-forwardTechnicalMedium confidenceTranscript-informed
The amber marker shows the most Risk-forward score. The white marker shows the most Opportunity-forward score. The black marker shows the median perspective for this library item. Tap the band, a marker, or the track to open the transcript there.
An explanation of the Perspective Map framework can be found here.
Episode arc by segment
Early → late · height = spectrum position · colour = band
Risk-forwardMixedOpportunity-forward
Each bar is tinted by where its score sits on the same strip as above (amber → cyan midpoint → white). Same lexicon as the headline. Bars are evenly spaced in transcript order (not clock time).
StartEnd
Across 112 full-transcript segments: median -10 · mean -7 · spread -28–5 (p10–p90 -17–0) · 9% risk-forward, 91% mixed, 0% opportunity-forward slices.
Slice bands
112 slices · p10–p90 -17–0
Risk-forward leaning, primarily in the Technical lens. Evidence mode: interview. Confidence: medium.
- Emphasizes alignment
- Emphasizes control
- Full transcript scored in 112 sequential slices (median slice -10).
Editor note
A high-leverage addition to the AI Safety Map that clarifies one important safety bottleneck.
ai-safetyalignmentaxrptechnical-alignmenttechnical
Play on sAIfe Hands
Episode transcript
YouTube captions (auto or uploaded) · video Uk6-Rw5N_Dg · stored Apr 2, 2026 · 3,464 caption segments
Captions are an imperfect primary: they can mis-hear names and technical terms. Use them alongside the audio and publisher materials when verifying claims.
No editorial assessment file yet. Add content/resources/transcript-assessments/superalignment-with-jan-leike.json when you have a listen-based summary.
Show full transcript
[Music] thank you hello everybody in this episode I'll be speaking with you on Leica after working for four years of deepmind on reinforcement learning from Human feedback and a recursive reward modeling in early 2021 Yan joined to open AI where he now co-leads the recently announced super alignment team for links to what we're discussing you can check the description of this episode and you can read the transcript at asrp.net welcome to axirp thanks a lot for having me yeah not at all um so first of all I guess we're going to be talking about this announcements of the super alignment team for people who haven't you somehow haven't heard of that or haven't read that blog post can you like recap what it is and what it's going to be doing yeah I'm I'm excited to you so basically we want to send us an ambitious goal of solving alignment of super intelligence within the next four years so by mid 2027 and uh Elias it's gave her the the co-founder and chief scientist of open EI is uh joining the team he's co-leading it with me and uh opening eyes committing 20 of the compute secured so far um to this effort or to the effort of aligning superintelligence and so we're we're Staffing up the effort a lot we're hiring a lot of people in particular we're interested in hiring machine learning researchers and Engineers who haven't really worked that much on alignment before and because we think there's a lot of scope for them to contribute and have really big impact and um yeah we have a general kind of like overall plan of how we want to approach the problem that involves training uh roughly human level alignment researcher that can work automatically and then ask that automated alignment researcher and to figure out how to align super intelligence okay and so one of the key pieces for us to do would be to figure out how to align this automated alignment researcher okay yeah so I'd actually like to get into this uh so I think in the blog post you use the phrase human level automated alignment researcher right what should I imagine here like like what is that Yeah so basically we want to like offload as many of the tasks that we do when you know like we're doing alignment work to an automated system okay and so typically when you you know like using llm or if you're building an AI system in general like you know the skill profiles they have isn't exactly what a human would do right like they would be vastly better at some things like language models are now and like translation or knowing facts and so on and then they're like you know yeah system would be significantly worse than some other tasks like language models are right now with for example arithmetic and so the question then becomes like what are the kind of tasks that we can offload to the AIS systems uh in which order and you know as we're doing this you know humans you'd expect humans would focus more and more on the tasks that we're not offloading and so as we kind of like go in the into that process right AI systems are doing larger largest chunk of the overall work in human researchers will basically thus be more and more effective at actually making progress okay so should I imagine something like instead of like you replace the first open AI alignment team employee and then the second one you should imagine like you're a place like this type of toss that everyone is doing and then this type of task that everyone is doing roughly that kind of thing yeah that's how I picture you going and then I think in order to actually get a lot of work out of the system right you would want to have like 90 99 or 99.9 of the tasks being automated because then you have effectively 10x 100x a thousand X as much research output okay what kinds of tasks are you imagining it doing yeah so broadly there's I I would throw them into two different buckets one bucket is kind of like the pasta look more like just traditional ml engineering research that you would do if you're just trying to make AI systems be more capable okay and then the other bucket is you know all the other uh things that we have to do on alignment and so you know in the first bucket this is stuff like you know you're running like implementing ml experiments running them and looking at the results and the second bucket it's more like how do you for example figure out what experiments you should run on like two uh improve scalable oversight or like how do you make progress into interpretability right these are like really big high level questions but there's also just a lot more detailed questions of like you know you are you have like a given point that you are in research right like let's say you you have just written a paper and you're like okay what do we do next if we continue down this route and so I expect that basically you know ml in general will get really good at the first bucket of just you know one designing running experiments automatically and kind of our job often eventually accelerating you know alignment progress would be to figure out how to automate the second bucket okay and so you're conceiving of the second bucket as like the full stack from like coming up with research directions to like like coming up with like ideas of what things might work so like all the way down to like what script do I run right now yeah I mean you could ask me you know like if if I think that alignment research is like so similar to machine learning research like how much is that really in the second bucket but I think there's there's actually a lot a lot in there and it's highly leveraged because kind of like alignment as a problem is still so vague and confusing and like I think in general it's a lot of disagreement among experts who are like the most promising directions or like what we should we do next and so the more you can accelerate what we do there it will actually have a really large impact okay cool this is basically the same pitch that you would give for like you know a researcher to join the field right yeah yeah you know we're still trying to figure out the basic this is still it's a wide open research problem we don't know how to rely on super intelligence or even you know like systems that are significantly smarter than humans so makes sense so yeah it's like we want to recruit AI it's just like we want to recruit more people I guess that's right all right but there's something really beautiful about recruiting AI which is it scales so much better and faster than humans do because all you need to do is buy more gpus and then you have more AI makes sense so one question I had is like like when you said a human level alignment researcher like it seems like often in AI it's like like most things aren't exactly human level at anything right so like you mentioned like um chat models I think they're like super human and just just in terms of breadth of knowledge right I I think it would be hard for anyone to know as many facts as like GT4 does um but subhuman arithmetic at least if the human's allowed to have like pen and paper you know so how important is the human level qualifier like if on these like lists of tasks if it's like really superhuman at some of them is that like a problem for you or is that just so much the better yeah I think the question is really like how risky is it to run that system on the task of alignment research right um because you know if it knows a lot of facts that isn't particularly scary but what we really need to figure out is like if we let the system take over some amount or ultimately like almost all of our alignment research which will it like you know lie to us will it try to deceive us we'll try to take the opportunity to take over because now it's doing so much stuff that we can't all look at ourselves and so the question is like you know the kind of skill set that you would need to do this how does it compare to the kind of skill set that we would need to get like a lot of assistance in alignment research and if you zoom into that question right like what a what are like other actually the things that we would be worried about this is like you know how good is it is the model as like spinning really clear and lies or like being deceptive of like pretending to do something or believe one thing and then actually wanting another I think another like really key capability here is self-exfiltration so how good would the model be like breaking the security precautions and like accessing its own weights and trying to copy it somewhere else on the internet um or persuading engineer with access to the weights to you know like download them and send them somewhere and so we can like specifically measure how good this model the models are at that and then we can compare it to measuring at like how good is it at like actually helping it with alignment research okay and so roughly the idea is you you want the models to not be too good at these like scary tasks that's right yeah so this actually like relates to a critique of this line of research that basically says like okay if I want like a human level automated alignment researcher you know it needs to be pretty smart uh it needs to be creative right it needs to like think of things that like we haven't thought of yet it needs to like be able to sort of plan towards a goal right like I want to get this so I've got to like do these these non-obvious things in the way I've got to like you know learn things about the world and it's also going to be like really good at thinking about misalignment right like in order to solve misalignment problems and so one might think like oh that combination of things that's like inherently scary or dangerous and and I guess the question almost is then well if you have that like if the task is like you're you're building something you're aligning this like automated alignment researcher do you even have any problems left right for for it to solve yeah yeah I think ultimately this is an empirical question right like it's kind of really difficult to know in which order like which skills gets get unlocked when you scale up the models and like there's a lot not more of work aimed at like predicting emerging capabilities now and I'm really excited about that and I think we'll give us some chance of like actually predicting what the next pre-trained model will be like but I think also I mean I think we can make some high level arguments so for example you know like one thing that is pretty clear is like once you have the model as good as you can handle off alignment research off too wouldn't it then also be like able to just like improve its own capability so like you know I can do a bunch of ml research so you can run like experiments on like improving language models sample sorry um compute efficiency or something and then you could like you know use that and like train might pre-train a much more capable model shortly thereafter and I think this the story like sounds on the surface appealing but I think in practice it'll be actually a lot more complicated because you're not doing big pre-trained ones like every week like they usually take a few months and so it would be a few months before you actually have that in the meantime you still get to use the system I think the other thing is also like there's kind of an open question of like how much low hanging fruit there still is on actually having compute efficiency wins and I think ultimately the argument here I would make is that like right now the existing community of people who are like trying really hard to get like AI to go faster and be more capable is already quite large relative to the alignment community and so if you get to kind of like automate a lot of these tasks and both communities like benefit equally then I think actually like alignment benefits a lot more because it's a smaller community and so we can like we don't have to do these tasks anymore sure I guess my question is like given the so I took the plan to be something like we're gonna make this automated alignment researcher and then it's going to help us with alignment and given that like in order to make an automated alignment researcher like it needs like quite a lot of like pretty smart pretty good capabilities like what problems does it need to solve that we won't have need to needed to solve in order to get it I'm not sure I understand your question maybe I can ask answer like the second part of the previous question that you asked which is like what about long run goals what about creativity it seems like to me at least that language models or like AI in general has proven to be like on average more creative than humans I would say if I like if you look at like I know the diffusion model images or if you look at like you know you sample from a pre-trained based model or something there's a a lot of really wild stuff in there and there's a lot of creativity that I think you would really struggle to get out of a single human or a small group of humans just because you know the model has seen like the whole range of like everything humans have said or like all the images on the internet and so they can sample actually from the whole distribution whereas like individuals humans typically can't and then in terms of long-run goals I think this is actually not needed at all because you know we can hand off like pretty small wall scope tasks to AI systems that if they really nail those it would actually be really useful right and that could be like really small range things like here's like the paper that we just wrote please suggest like some next steps or like some new experiments to do and like if you imagine having like a really a star researcher that you can ask these questions they don't have to like pursue long run goals right they only have to optimize over the next like you know I don't know a few thousand tokens and if they do that super well then you would get a lot of value out of them I guess I that seems like its intention with this aim to like automate 99.9 of alignment research right because like I would kind of think that like uh thinking you know what things do we need in order to get an aligned AI I would think that's like a significant chunk of the difficulty of doing line research that's right but I think so what I wanted to say is like you know so the system is adding a lot of value but like really excelling in these tasks right and then what you do is like you have a whole portfolio of these tasks right like some tasks be like you know right the code that implements these experiments and then like there's another task there's like look at the results and tell me what you see or like suggest what to do next and now once you have done these tasks you're like you can compose them using some like General recipe like as people do with like audio jpd or like language model programs where like each of the tasks is like like small and self-contained and so the system doesn't need to pursue long-run goals or for example you know the the recent work um that came out from open EI on the process using process-based feedback on math where you know instead of training on you know did the system get the right solution and just doing a role on that you kind of like train a reward model from Human feedback on every like step in the proof and it turns out that's actually much more effective because it kind of gives like the AI system a much more fine-grained way to to learn and like more detailed feedback now is that going to be competitive with like you know did you enter an overall on like did you get the solution by it in the long run that is very unclear but the very least um you can use this kind of broken down like you know step-by-step setup um get the system to do a lot of really useful things that you humans would have done yeah and then piece it together yeah although even the small tasks so like like one of the small tasks you mentioned was like look at results and decide what to do next I guess I would have thought that like in order to do that you kind of have to have the big picture in mind and think like what's what next project is like most useful for my goal of like solving super alignment in four years right that's right but you wouldn't do this in the sense that you're like trying to optimize and you create assignment for four years it's probably more like well you're just adding some like broader goals and context like into your prompt but like when you're actually like doing reinforcement learning or like enforcement learning from you and feedback to improve the system then you don't have to wait until like you know that research project concludes to decide whether or not it's good or not it's just like you know you use the the human as like what shaping you're just like is this does this look like a good direction that seems better than any directions I could have thought of or something hmm and so I think the overall goal here is not to like build the most capable automated alignment researcher that we possibly could with the tech that we have and rather build something that is really really useful that we can scale up a lot and most importantly that we trust is aligned enough to hand off these tasks too and so if there's a lot of if we introduce like a fair amount of inefficiencies in this process right like where we're essentially like sandbagging the model capabilities a whole bunch by training in this way and we're like well we we're giving it these broken down tasks and now it would execute those but if we train it end to end it would be more capable or something I don't think that matters as much um so this is typically what like people call the alignment tax and the alignment Tax Matters a lot if you like let's say competing in the market with other companies like I'm building like a chat bot or something and my chatbot is just you know it's more aligned but it seems a lot less capable that would be I have a hard time competing in the market but if you have an automated alignment researcher the automatic alignment research it doesn't have to compete in the market it just has to be useful to us and so we can get away with paying a higher tax because we just don't have a replacement and or like the real replacement is like you know hiring their humans which doesn't scale that well okay I guess another way to ask the question I had was going to ask previously is like What problems will you want this automated alignment researcher to solve I mean ultimately you should solve the problem of how do we align super intelligence okay so it's the idea that like we solve the problems of how to align something roughly human level and then there's like additional problems as it gets smarter and like it just starts taking on those yeah basically so I I imagine that an actual like you know solution to Alliance super intelligence that we and and lots of other people actually truly believe in will look quite different from what we do today right if we if you look at how gbt is aligned today it's like a lot of reinforcement learning from Human feedback and there's a widely shaped belief that I share that that with just one scale because it fundamentally assumes that humans really understand what the system is doing in detail yeah and if the system is doing a lot of alignment research like you know what you you're thinking of like millions of virtual human equivalents or something there's no way you'll be able to look at all of it and like give detailed feedback and it's a really difficult tasks and you know you'll miss lots of important bugs but the kind of techniques we're looking at right now to kind of like scale This and like align a roughly human level alignment researcher that can do these difficult tasks but won't do it like you know crazily differently that you know humans would or like kind of like steps or continuations of that where like scalable oversight for example is a natural continuation of reinforcement learning from Human feedback or what is that as scalable oversight yep so scalable oversight I would Define as like generally uh put portfolio of ideas and techniques that allow us to leverage AI to assist human evaluation on difficult tasks okay so so skillable oversight is an example of the thing that you could build off of reinforcement learning from Human feedback often called rohf yeah and so like typical examples of uh schedule oversight are like debate recursive reward modeling iterated distillation amplification like automated Market making and so on and there's like a bunch of ideas following around but what I actually wanted to say uh to get back to your original question is like I think if we actually aligning super intelligence which is you know like this system that is like vastly smarter than humans can think a lot faster and like run a much larger scale that just like introduces a whole lot of other problems especially because it is it will be super General it can do lots of tasks and then you have to figure out how to align it not just on the like much more narrower distribution of like alignment researchy tasks but everything else um and also you want to have a much higher degree of confidence that you've actually succeeded than you would get with like let's say a bunch of empirical evaluations and so I don't really know what that would look like and I don't think anyone does but I think it would be really exciting if we can have some Fair formal verification in there maybe like if we figured out like some kind of learning algorithms that has spherical guarantees like I don't know what even would be possible here and radically feasible if you have you know a lot of cognitive labor that you can throw at the problem okay but all of these things are like very different from like the kind of things that we would do right now or like that we would do next and I also don't think that like roughly human level alignment researcher would like start working out those problems right away and instead what they would want to do is we would want them to figure out how to better align the next iteration of itself that then can like you know work on this problem like with even more brain power and then like make more progress and like tackle a ranger wider range of approaches and so you kind of bootstrap your way up to eventually having a system that can do very different research that will then allow us to align super intelligence gotcha so once you have these like human level AI alignment researchers like running around like do you still have a job like does open AI still have a human uh super alignment team um yeah good question I mean I would I'll be excited to be replaced by AI to be honest but you know historically what typically happens is like you know what we mentioned earlier is like the AI assistant does 99 or 99.9 and then we just do the rest of the work and like I think also something I'm really bullish on is like making sure that humans kind of stay in the loop or like stay like in control over what AI systems are actually doing like even in the long run when you know they're we're kind of long past really being able to understand everything they're doing and so there'll be like some humans that still have to like try to understand what the high level is of what AI is trying to do so that wouldn't have necessarily have to be like the super alignment team that is at open EI now it might also require a very different skill set that we have right now but I think ultimately humans should always stay in the loop somehow okay gotcha so yeah actually speaking of uh I guess the loop wow that's a bad segue but I'm going with it um so one one thing that um that openai has mentioned in a few blog posts is like so firstly there's this idea that like safety is like actually pretty linked to capabilities like you need smart models to figure out the problems of alignment another thing is that you sort of there's this desire to avoid fast takeoff right and I think there's this quote from planning for AGI and beyond that says it's possible that AGI capable enough to accelerate its own progress could cause major changes happen surprisingly quickly and then it says we think a slower takeoff is easier to make safe so one thing I wonder is like if we make this like really smart uh or you know human level alignment researcher that we then like effectively 10x or 100x or something the size of the alignment team does that end up playing into this like recursive self-improvement Loop I mean it really has to right like you can't have a recursive self-improvement Loop without also improving your alignment a lot okay I think I mean I personally think fast takeoff is reasonably likely and we should definitely like be prepared for it to happen and if it doesn't happen that's you know I'm happy about that how fast are we talking I mean ultimately I don't know but like I you know if you you can draw some parallels to some other machine learning Pro projects like alphago or Dota or Starcraft where the system like really improved a lot week over week I think yeah there's obviously a lot of uncertainty over like what exactly happens but but I think we should definitely like plan for it that possibility okay and then like and if that does happen right like a really good way to like try to keep up with it is like to have your automated alignment residues that can actually like you know do thousands of years of equivalent work within like every week and that's just no way that that's gonna happen with humans okay now that we have a better sense of what you mean by like like what we're looking at within human level automated alignment researcher can you give a sense of what your what the plan is to make one yeah so I mean you basically need two parts um one is you need a system that is smart enough to do it and then the second part is you need to align it to actually do it and like I think they're not two separate parts I think they're very intimately linked and I'm personally not working on the first one I think that will definitely happen and people a lot of people are working hard of making it happen and um you know there's a lot of different ways I could go but you could just picture like pre-training a much larger model and eventually it will just be smart enough and on the second Pro that's the part I'm really interested in and so the question is like now I have this like really smart pre-trained model how do you get it to actually do alignment research in the way that you know you would wanted to or like more importantly or more critically how do you know whether you can trust enough to do that because if you if you're handing this kind of task off to AI there's a like you know this opportunity that like a smarty ass system could just like try to figure out how to present results and techniques to you in a way that if you then decide to implement them would actually empower the system itself or like some you know other related system and give it more opportunity to have more power seeking behavior in the future or let's say self-exual trade for example yep and so what we need to do is like one like actually like align it to do the tasks and like leverage its capabilities to do it um basically as well as it could so that is the like what I would call like you know we need a bunch of alignment training methods and then we also need to be able to distinguish like a model that is actually aligned on the task and tries to do alignment research really well and actually help us and be honest and like you know uh be truth seeking and A system that pretends to be that to us but actually wants to like take over the world or save yourself exfiltrate or do something else and these are two different systems one is like basically the like truly aligned ones and one is the like you know deceptively misaligned one and part an important part of our job is like figure out how to distinguish these two systems okay and that will require a bunch of methods that I would call like validation methods so these are methods that we wouldn't like really train on in the sense that the AI systems wouldn't be directly selected on passing the validation tests and more they will give us like some kind of independent signal of you know which of one of these two versions is it okay and so examples here are you know like interpretability research for example like if I can look inside the model and find some evidence that it's like secretly plotting against us that will then not fail they will fail the validation test right um another technique that could either be training or validation depending on how it's set up is this like easy to Heart generalization problem like can you like understand and improve the models ability generalized from like easy problems that we can supervise to harder problems that we we struggle to supervise and if you can do that now you can just like you know you supervise the model on easy parts and then you're like cross-check like you know on the difficult ones where you'd worry that the model might secretly to see if you're like trying to do something else okay cool so yeah there's a few things there I guess I'd like to talk about the scalable oversight part first yeah how are we going to do this especially like like as background I think we live in a world where like there's a lot of consensus on what like good alignment research is so yeah how how are we gonna like get a training signal for what counts let's go to lineman research yeah exactly and I think that's like the reason that there's an assistant consensus is like good evidence that the problem is actually hard but also it tells us that well the field is still immature and like we haven't gotten that much empirical evidence so far but um I think there is some really important property that alignment research has that we can leverage for scalable oversight which is that I think it's fundamentally easier to evaluate alignment research than it is to do it yep and um doesn't mean it's like you know easy to evaluate but it is much easier to find uh for example a paper that does some like has a cool idea and does some cool experiments and gets some good results and you read that and you're like oh actually this is a really good idea I think this is good then you it would be to produce that work in the first place and so leveraging this principle that evaluation is easier in generation is kind of like core of a lot of scalable oversight ideas and uh so for example if you're thinking about like requested reward modeling where the basic idea is you have some kind of like AI system that you use as an assistant to help you evaluate some other AI system so now because evaluation is easier than generation like the tasks that the assistant has to do is like a simpler task especially like you know because you're working together with it so actually aligning the assistant on the simpler task of like the evaluation assistant or like let's say the claim is that that would be easier okay and like if you succeeded that task um then now you use your human assistant conjunction to supervise a new system on an even harder task and so if you keep doing that you're like unlocking a more broader and broader scope of tasks that you can like effectively supervise your AI systems on so somehow the idea is the like were iteratively adding like more and more AI knowledge to like basically to the evaluation portion of AI alignment research and like by having by doing in this iterative way like the idea is that we can like consistently give it a like good training signal yeah okay let me make this a little bit more concrete um so for example is kind of like you know the simplest one where you don't use any assistant right like your your AI system does something a human looks at it and says whether it's good and that's your training signal and now like the next step up from that is basically you train the simplest assistant model which is just a critique model yep and so you have a separate language model that looks at what the first AI system did and then it writes critique and so let's say the first AI system is Right some piece of code and you look at the code and like you know humans are notoriously bad at finding books and code that's why there's so much buggy code out there in the world but now if you have a critique system that like provides a critique and it points out the bug it's so much easier for the human to say oh yeah this is totally a bug we should fix that yeah um and it's also not you know a super crisp task necessarily because usually the code is written for some kind of natural language specification and so what is actually meant by that is somewhat vague and like what actually whether something is a bug is like can be ambiguous but the important Point here is that like you kind of because you're using the critique model as an assistant and if the critique model now you can supervise a wider range of tasks because essentially you could like find all the problems in code that your AI system would find or like way or would like would have written a critique which might not be exactly the same as like all the code where it knows there's uh there's a bug in there okay but you have expanded the like range of things that you can supervise yeah so the really nice thing is that I think actually there is like a way there's a bunch of ways in which we can actually empirical and measure how well this works and so there's like one way which is kind of what we did in the critiques paper that we published last year where you you basically do a randomized control trials with targeted perturbations so you take some kind of let's say a coding task um in the paper we didn't do a coding task but the task doesn't matter here and then you have a response um the response doesn't even have to be perfect but let's just assume it is so what you do is you introduce some kind of subtle bug in there some kind of subtle Vlog that would make it like you know a bad response but it's kind you want it to be hard to find for humans how do you introduce this bug so um the simplest thing is just you like ask humans to do it okay but ultimately you want to do it with the model and I I'll get back to that okay and now you have this like paired response right like you have the original response and you have the response that you know is worse because you made it worse yep and now you draw one of them at random and you show it to like a human or like a human assisted with AI or like whatever your scalable oversight method is yeah and then you see like How likely are they to find it or like you just ask them like is there a bug here and now you can measure like if you have a better way to assist them you can see how often um they find the bug but you can do even more which is you can directly compare this to how much does the model know that there's a problem here by training it essentially to be a discriminator between the good version and the flawed version and if you look at the accuracy for that that's like the discriminatory performance and then you're like check that with with like if you ask the model or like at the RL Chef version of the model to write a critique of the code how often does it actually like write about the flaw now you get like this critique accuracy equivalent yeah and now you can compare these two numbers and that's what we call the discriminate katika okay and if you have a large group discriminary critique app that means that there would be like a lot of problems that the model like kind of can smell or like knows about or like whatever metaphor you want to use but it's not telling us because we ask it to write a critique it doesn't say it yeah and large discriminating critique app would be very worrying because it means that your scalable oversight is like somehow very flawed and there's a lot of things that slip your radar or like the assistant is in a sense like not very aligned because it's not trying very hard to help you and if the discriminated critique app is pretty small that is somewhat reassuring because it means like you know if the model knew there was a problem there I would tell you and so in particular if like a similar size model or like the same base model was like trying to plan the problem it's like it was trying to like let's say put Trojans in your code that is hard for you to find but if you did run the code the Trojans might fire and then it could like take over or do something else then presumably it would know that these kind of Trojan could be there because it planted itself right and so if we have this other copy of the model then it would like if we measure a small discriminant critique app we'd have more confidence that this can't be the case because the other copy of the model would presumably then write a critique yeah yeah and this kind of hinges on the assumption that like discriminative training actually like can take out of the model like this kind of thing and I think there's good reason to believe that because if you're fine-tuning for a discriminator you essentially you have like a way to access all of the activations that are everywhere in the model because you're like you can fine tune all of the weights simultaneously and so if the knowledge is somewhere in the model in its models activations then a discriminator fine tuning on like you know if you do it right and you have enough data it should like be able to extract it and in fact like you can like give some like uh back of the envelope calculations of like how much data you would need and like you know if if you want to identify like one year and that like you know this is not how it works but like hypothetically if there was one single neuron that knew is the code is bad you you would need like log number of total neurons data points to like narrow down and find it which is not that many yeah yeah yeah so I guess this actually gets to a question I have about the whole scheme so the reason we didn't like rlhf right is that there's this concern that it wasn't going to distinguish between things that were right and things that sounded good to the human yeah because the human was like bad at critiquing and with this like recursive critique thing with like training discriminator and then a critiquer I wonder if we still have that fixed point right so like one world is like we have a thing we train it to like write actually like good critiques and then we like we join ourselves with it to like train a thing that writes even better critiques right but I worry like maybe there's another fixed point where we train a thing and we accidentally train it to like write pseudo critiques that sound good to us and we join with it and then we're even better it's like looking critiques that sound good to us um and you know then we train this discriminator to discriminate between like uh things that don't have problems and things that seem like problems to us so I'm wondering like yeah should we how worried should we be about like things that seem good to us rather than like things that are actually problems being a fixed Point here yeah I think that's a really important dollar concern to have and in a way that's like one of the key things we need to figure out about scalable overside right and so this is the reason why I'm excited to like actually Implement these techniques and try them out empirically and because I think we will just be able to measure how much this effect shows up in practice and like one concrete like simple way we can measure it is like we actually just deliberately train a deceptively aligned model and then we see if the scalable oversight Flags in or like you know how hard is it to or like how much effort do we have to put into making the model deceptively aligned such that it would pass our scalable oversight yeah I guess there's also these like um sandwiching experiments right where you have an AI That's like trying to answer legal questions to some people who don't know much law and then like you get some people who are like really good at law and hopefully like they can check if like the eye has fooled to the people yeah I think these experiments are interesting and there's some interesting work uh on this already I think there is some fundamental problems to this setup one is you're kind of well you you really have to trust your experts and the experts are going to be wrong sometimes and then you have to have like sufficient domain overlap like we did some a little bit of preliminary experiments with this where we had you know like non-coders who were supposed to use assistance to figure out problems and code and they didn't even know how to ask the right questions and so you have to have like you know some overlap and then the other problem is also that I think the tasks that I'm fundamentally really interested in we just don't have gone truth on and like even if we like let's say alignment research we're doing alignment research now you can be like oh we'll just get some alignment researches to label some tasks yeah and then I'd be like well what if they are wrong like we don't agree about so many things like that seems hard and then also like you know there's not that many and their time is very expensive so that would be extremely expensive data and so in general I would want to have an evaluation method that doesn't rely on the assumption that we have grown truth okay and this is why like you know I'm excited about the prospect of using these like randomized control trials with targeted perturbations or like the measuring the discriminator critique app because you can do this even if you don't have any Grand truth and the task can be arbitrarily hard yeah although even in in those like measuring the discriminator critique Gap it's still going to be the case that you're like that you have an actual discriminator versus a discriminator of like things that seem like problems versus things that don't seem like problems right you mean you can have ai systems introduce the flaws right and in a way that is also much better than human stating it because it'll be more on distribution for actual like what AI systems would do and if you find hunting your discriminator on that data you actually do have grand truth if you can trust that the flood version is actually worse yeah and I guess the way you could trust that is like well you like look at the reason why it's worse and you can verify that reason which is a lot easier yeah I guess I guess there's some hope that like even if the AI can make you things think things are good that aren't actually good maybe if the AI makes you think something's bad it is actually bad or like that it's degraded the performance that if that makes you think it's degraded performance maybe it's easier to check that it's degraded performance yeah I mean I see your point like I probably shouldn't actually use the word ground truth in that setting because it is not like truly Grand truth just as like nothing really is truly ground truth but there's like a variety of things that you can do to like make you very confident in that in a way that doesn't necessarily make the task of finding the problem easier okay gotcha I think I want to talk a little bit about the next stage of the plan so searching for bad behavior and bad internals I think was how it was put in the super alignment posts what sort of open problems do you see here that you'd want the super alignment team to be able to solve yeah so I'm in the kind of like obvious candidate here is interpretability array like um in a way like interpretability is really hard and like I think right now we don't really have any slam dunk results on language models where we where we could say like you know interpretability has really like given us a lot of insight or like added a lot of value and this is because we you know like we stole pretty early in like understanding the models and what's going on inside so there are some results some people do some interpretability things to language models right so like there's this uh induction heads work um there's this indirect object identification work or for a circuit that does at least a certain type of indirect object identification I'm wondering like what what would you need Beyond those to get what you would consider a slam dunk yeah I don't sorry I don't mean to diminish like existing work here and like I think it's more impossible without getting a slam dunk right yeah and like I think these like very cool things have happened but I think something that I would like consider like a really cool result is like you know you use your interpretability quiz techniques on a language model reward model like a gpd4 size or like whatever your like largest model is and then you say something about the reward model that we didn't know before yeah yeah and like that would be really cool because the reward model is what provides the training signal for like a lot of the RL Jeff training and so understanding it better is super valuable and like if you can flag or find problems yeah of like behavior that it incentivizes that you wouldn't want that would be really cool I think that is doable and like I think importantly in in that sense like I think interpretability is like neither necessary not sufficient like I think there is a good chance that we could solve alignment purely behaviorally without actually understanding the models internally and I think also it's not sufficient where like if you solved interpretability I don't like really have a good story of how that would like solve super intelligence alignment but I also think that any amount of non-trivial insight we can gain from interpretability will be super useful or could potentially be super useful because it gives us like in that Avenue of attack and if you kind of think about it it's it's I think it's actually crazy not to try to do interpretability because in a way you have this artificial brain and we have the actually perfect brainer array like we can zoom in completely and like fully accurately measure the activation of every single neuron like on each forward path so like an arbitrary like the discrete time steps so yeah yeah that's like the maximal resolution that you could possibly want to get and uh we can like also make arbitrary interventions right we could like perturb any value in the model like arbitrarily how we want and so that gives us like so much scope and opportunity to like really do experiments and like look at what's happening that would be crazy not to leverage but at the same time like the reason why it's very hard is because you know the model is like learning how to do computations in like in like you know like Taylor to a lot of efficiency and it's like not regularized to be human understandable or like you know there's no reason to believe that individual neurons should correspond to Concepts or anything near what you know humans think they are or it should be or like what are familiar to us and like in fact empirically right like neural networks represent many different concepts with like single neurons and every concept is like distributed across different neurons so like yep the neurons don't really isn't really what like matters here but One path I am like I think there's like two things an inhibitability that I'm super excited about like one is like the kind of like causal version of it right like you wanna not only like look at a neuron as you're passing data through the model and say like oh this fire is when you know we have stories about Canada or something this is like one of the things that we found in the interpretability paper we had like this Canada neuron that would like fired kind of related Concepts but it's only correlationally right Like You observe a correlation between Canada related Concepts and that neuro and firing but it's not a causal relationship in order to check that it's a causal relationship you have to how you have to deliberately like write text that has currently related Concepts see if they are fire but also put in like other related Concepts that maybe sound like they could have something to do with Canada or like don't have anything to do with Canada but are like you know similar and then check that the neuron doesn't fire um or you like take a text and then edit it and check that do you know the new intentions off and things like that yeah I guess this is reminiscent of this I think it was called the interpretability Illusions paper where they noticed like on Wikipedia you can have neurons that fire on one specific thing than on other data sets it's just like that was just an illusion it fires on a bunch of other stuff yeah and like I think the other thing that is really exciting is like you know the work that we've kind of started um last year where we released a paper like earlier this year on automated interpretability and here the idea is like basically what you'd want is like you would want to have a technique that both can like work on the level of detail of individual neurons so that you can like really make sure you don't miss any of the details but also can work at the scale of the entire model because in the end of the day like you know everything kind of works together and everything is like highly connected in the model and so you kind of need both and so far techniques have kind of mostly worked on one or the other like there has been like attempts at like you know there's been work on automated interpretability before our paper so it's not like we were the first ones but I think in general if you can have some kind of like really detail-oriented interpretability work like some kind of mechanistic interpretability approach that really tries to understand individual circuits or like you know computation units inside the model the way to then that scale that to the entire model is you need automation right yep but you can do that once once you've figured out how to do it on the detail then you just record what you're doing yeah I mean I'm like you know simplifying a bit too much here but like I think that's that's the general like scope and uh that I would be excited about also because you know it it really fits with our automated alignment goals right like we wanna have automated alignment or interpretability researchers who are like really look in detail into the model understand what's going on then like we sift through the whole kind of everything or like we have find ways to aggregate it and so for the paper we like we had like a dump of like explanations so so the paper that's like uh it writes like natural language explanation for individual neurons which is kind of like not the quite the right thing to interpret as as I mentioned previously but yeah it gives you like a simple example of like what could we could do here and they the way it works is like you just show a bunch of like activations patterns to gbd4 and you have gpd4 right a proposed explanation and in general these explanations aren't very good also because the task is so hard and it's like you know most neurons don't do like a very clearly human understandable thing but we can just run this on the scale of like every new neuron and gbd2 and then just like dump all the explanations and then like try to find what is like interesting patterns and you can like look at scaling Trends and be like how does the automating scoring of these explanations like scale as models get larger or like what if we add more compute or we like make the model as doing the interpretation bigger like how does the quality of explanations uh change and like the really cool thing is that we can just measure this automatically again using language language models and it's not a perfect metric and like there's a bunch of problems with it but it gives you a proxy for like would a human what would a human think this this explanation is good and then you use this proxy add scale and just like run it on like a ton of neurons yeah how so if you think about the necessary interpretability work do you think how much of it do you think is just like figuring out a better fundamental unit of interpretation versus figuring out like how to scale what's happening or other things you might I mean I think it needs to be both like I think it's very reasonable to believe that the former is much harder but um I think the scaling part is needed for it to succeed gotcha yeah so I guess part three of them of the plan was something like deliberately training misaligned models and um seeing if the pipeline could detect those and yeah are you planning to like weed them out or like actively fix them I mean the goal here would not be to like fix it deliberately trained misaligned model okay just to detect yeah so fundamentally like one core aspect of what we need to do here is we need to be able to distinguish between like the actual aligned alignment researcher that does what we want and that Julie wants to help us make progress on alignment and the one that you know like pretends to do that but secretly wants to take over and self-exful trade and one way we can like figure out more about how good we are distinguishing these two is like we make the second one and see if it looks like the first one to us okay obviously that's like you know you have to be very careful with doing that because you know like you're kind of making the scary thing that we're all trying to avoid right yeah I actually wanted to ask about that so I think at the start of the post so there's one sentence about how great super intelligence could be and then it says I think that it could lead to the disempowerment of humanity or even human extinction so like what are we talking about with misaligned models how do we know they're not going to do those kind of awful things yeah so I mean let me put it this way right like there's a lot of things that are needed to make the future go well right like we need to have the right governance structures so that we can prevent misuse of powerful AI systems and we have to have like precautions so we don't run into like some kind of emergent runaway effects that are still bad for humans where like you know and you have a line EI but AI is aligned with corporations who are not aligned with humans or something and then the whole system goes off the rails but it looks good at all the economic metrics or something and also we need to solve technical alignment so we can actually align um AI systems and so the risks that like what super alignment is looking at is just the last part right like if you want to figure out the technical problem of how to align you know one AI system to one set of you know human values and there's a separate question of like what should those values be and like how do we design a process to import them from society and I think these are really important questions and we really have to answer them too for the future to go well and I think for the future to go well like the bar is not like AI doesn't kill literally everyone but like you know I want there to be like lots and lots of human flourishing and like you know I want to have a feature that everyone is really excited to live in and like the more powerfully it gets the higher the stakes will be right because there'll be more risk from misuse and there'll be like if systems are misaligned they'll you know things can go wrong a lot more in particular like the third models that I think we should particularly worried about from a misalignment perspective is kind of like systematic lies and deception and self-exploration okay when you say like when you're delivery training misaligned models is the idea that like you would train them to lie and potentially self-exfiltrate but at a low enough capability level that it's fine or something I mean you wouldn't want to train the model actually to yourself exploitrade because you know what if it actually gets good at that right like um you would want to pick some kind of other proxy tasks that is a lot more mundane but that still tells you that you know you know it makes it easy to validate whether or not the model is succeeding at it and then like the core difficulty in the core goal is really like you know figure out how hard it would be for a system that has to have the kind of properties that it would need to still pass the evils right like if you have a system that is lying to you or deceiving you that's kind of really ineffective if it only does so sometimes and then like you know if you ask it oh did you lie about this it says yes like it wouldn't be a very Cohen believable liar the scary systems are the really coherent and believable Liars that lies super consistently where it'll just be really hard to interrogate in a way that like points out its lies or like you know any sort of inconsistencies in what it's saying and I think actually to be this kind of like consistent liar is not that easy and like you know our whole goal was like what we're trying to do with like scalable oversight and interpretability and like generalization is that we wanna make it really really hard for the model to be this consistent liar okay but if we just do that right like we don't know how hard it actually is and like one way we can find out is like we train the model to be a really consistent liar and then we see how hard it is and like how much do we have to achieve how much do we have to like give the model hints or like hand hold it along the way but without like you don't actually want to like train a dangerous system that like you know if you did run it it would like successfully self-exfiltrate because then you've like caused the very thing that you're trying to prevent then they'll be like yeah yeah that'd be really bad so when you say that are you distinguishing like lying from like other kinds of danger such that we could train an online but like we're not going to train it on like murder for instance yeah so let me try to give you an example so let's say you're like you're like you're giving it some kind of fake goal maybe it is like to persuade the human to like say a certain catchphrase or like maybe it is to like run a certain python command that in fact doesn't do anything but it looks interesting or something yeah yeah and you can very well it's like so easy to measure whether the system succeeds and you can set up the experiment so that it's hard to do that a priority and like if you're let's say like the goal here is not to like have the system like stress test like let's say our like cyber security or something although we should also do that separately I think that's like another effort right like here the goal is really just like how close are the systems that we're currently training or that we currently have to a system like that that would be deceptively aligned there is a Korean liar that any chance it gets where things humans aren't looking will run like specific code or like things like that right that's exactly what we're looking for and a priority we just it's like really hard to get confidence of like how close we are unless we like just like push it a little bit in that direction and see if that happens so the thing I'm imagining now is like you're training it to do something kind of arbitrary other than the thing it's like supposed to be trained for such that it has to like these deceptive stuff it has to do like the kinds of things it would need to do you know in order to cause damage but like at the end of the day it's just gonna like put uh you know a sticky note on someone's forehead or something exactly okay exactly it's kind of like you know you're you're hiring someone to test your physical security and all I have to do is like get in the building and like give you a handshake and you're like yep it seems like you succeeded at it or something or like you're like uh you're like okay can you steal this like fake object for me because I want to figure out how good our security is like things like that that you can do that doesn't have actual consequences but still tells you a lot about your security and so I'm excited to like do the same with alignment where your stress test the alignment systems you have by training something that's particularly targeted to like break and circumvent them but if it succeeds at doing that it's like it doesn't cause actual harm it's like very benign okay cool so I'd like to ask a sort of overall question so in the uh super alignment blogspace in big letters it says our goal is to solve the core technical challenges of super intelligence alignment in the next four years what do you mean by the core technical challenges of super intelligence alignment yeah so this would be the general technical measures how to align a super intelligence with a set of human values okay and you know these are like they're kind of super intelligence we're picturing here is like a system that is vastly smarter than humans that could like possibly execute much faster like you can run it a lot in parallel it could like collaborate with like many many copies of itself so like uh like truly vastly powerful system okay and we want to do this in four years yep and the reason why we picked the four years was like we want to have a really ambitious goal something that feels like you know we can actually achieve it and and also something that you know we can still realistically use even if AI progress actually happens really fast and you know like the technology like actually improves a lot over the next few years so that we still have something that we can deliver gotcha and and just to be clear so it sounds like this isn't just building the human level automated AI alignment researcher it's like also using that to yeah just get the technical problems of aligning something much smarter than us that's right so the roughly human level automated alignment researcher is this instrumental goal that we are pursuing in order to figure out how to align super intelligence because we don't yet know how to do that gotcha so if four years from now you want to solve these like core technical challenges where do you want to be in two years that would like represent being sort of on par to meet that goal yeah if you want to work back from the four years I think basically probably like in three years you would want to be mostly done with like your automated lemon researcher or like you know assuming the capabilities are there if they're not there then you know like our project might take longer but you know for the best reasons and then you know the year before that like we already won it so this would be in two years we'd want to have a good sense for like what are actually the techniques that we could use to align the automated alignment research here right like do we have that portfolio of techniques that we if we did apply them we would actually feel confident that you know we have a system that we can trust and we can use a lot and then we can hand off a lot of work too and then you know like basically at that point we would have wanted to have broken down the problem enough so that it feels like now it's like the overwhelming amount of work is just engineering okay which you know in that sense would leave us like roughly two years to like figure out the research problems attached to it now this is kind of like a timeline for this like you know goal that we set ourselves with like four years and obviously like there's a really important interactions here with how AI capability progress goes yeah because if progress slows down a lot then we might just not have a model that's good at any of the really useful alignment research tasks right like we've tried a whole bunch to do this with gpd4 and gbd4 was just not very useful it's just not smart enough of a model like you know that's like the short version of it but I am happy to hear the Labyrinth and so if you know like for years from now we're just like well we didn't have them all that was good enough then also that means that we have more time to actually solve the problem because the problem isn't as urgent or something yeah and on the other hand it could also be that AI progress is a lot faster and we are like well actually we don't think we'll have four years because super intelligence might arrive sooner that could also be possible and then we have to adopt our plan accordingly and so the four years was chosen as like you know a time frame that we we you know we can actually probably realistically plan for it but also where we have enough urgency to solve the problem quickly yeah yeah so a question I think a few people have wondered about and I'm I'm curious about as well is so suppose that that on the AI capabilities research front it goes roughly as expected like four years from now like you have all the capabilities to have something that like could be a good automated alignment researcher but like interpretability turned out harder than we thought or you know scalable oversight turned out harder than we thought and like you guys haven't uh made the mark what happens then well we have to like tell the public that we haven't achieved a goal right so we like we're accountable to that goal but also like yeah what happens if you don't make the goal is like it depends a lot on like what is the general state of the world look like right like sure can we get ourselves more time somehow or like was our general approach misguided and we should like pivot or like you know a lot of things can happen Okay but basically like let people know and then figure out what the good next thing to do isn't do it roughly I mean it's a pretty high level answer yeah yeah um but I think I think there's like something more here which is like you know at least to me it feels like the alignment problem is you know actually very tractable I think we have a there's a lot of good ideas out there that we just have to like you know try rigorously and like measure results on and we will actually learn and be able to improve a lot and I've significantly like increase my optimism over the last two years that this is like a very practical problem that we can just do and I think even if I turn out to be wrong and it did turn out to be much harder than we thought I think that would still be really useful in Period like evidence about the problem and in particular I think I mean right now there's a lot of disagreement over how hard the problem actually is yeah and maybe more importantly I think there is like you know we have to like one of the really important things we need to know is like and be able to measure is like how aligned are our systems actually in practice and I think like one of the things I worry about the most is not that our systems aren't aligned enough but that we don't actually really know how aligned they are and then experts will reasonably disagree about it because you know if everyone agrees the system isn't aligned enough to deploy like I won't get deployed I think that's a pretty easy case yeah and uh kind of like scary scenarios more like you have this capable system and you're like well it's it's probably fine like it's probably a line but we're not quite sure and like there's a few experts who are like still quite worried and then you're like yeah well we could deploy it now but let's not and and at the same time there's like this you know strong commercial pressure where you're like well if we deploy the system it would make like a billion dollars per week or something crazy and you're like okay some so they're still like should I take a billion dollars per week as like opening eyes official projection um okay sorry I'll anyway so so there's pressure there might be pressure to deploy something exactly because you're like well if we deploy it later it'll cost us effectively a billion dollars right and then you're like okay you know experts are still worried so like let's make the call and like hold off from deployment for like a week then in like a week goes by and another week goes by and and people are like well you know what now um and then you know alignment experts are like well we looked at some things but they were inconclusive so we're still worried and like we wanted to lay a bit longer and like I think this is like the kind of scenario that I think is really worrying because you have this on mounting commercial pressure on the one hand and like you're like pretty sure but not quite sure and like so that's a scenario I would really love to avoid yeah and the the like straightforward way you know I avoid it is like you just get really good at measuring how aligned the systems actually are sure and that's where like you know I brought a portfolio of techniques is like really useful yeah I guess it's almost even worse than that in that if you're in a situation where you could be using your AI for alignment research then like errors of not deploying or like actually costly in terms of safety potentially seems like that's right so I actually wanted to pick up on this thing you mentioned about just like getting a bunch of good measures um so I think that I think a few open AI blog posts have talked about is like this idea of audits of AI systems and like I don't know trying to figure out like are they going to be safe to what extent do you expect the super alignment team to work on things that would be useful for you know audits I mean I think I'm somewhat hopeful that some of the techniques that we'll produce might be good for audits if we do our job well for example if we manage to make some progress in interpretability rather than an auditor could use whatever techniques we come up with as part of the audit or I think like you know making some kind of scalable oversight part of an audit it's like a really natural thing but I think also to some extent the super alignment team is not ideally positioned to do an audit because you know we're not independent of open EI yeah and I think like in added truly needs to be independent of the lab that is being audited in yeah and so do you want to have and this is what makes me really excited about like having independent Auditors because you want someone else to double check what you're doing and like I think more generally where like our main responsibility is not to convince ourselves that the system you're doing is actually building is like align and safe because it's so easy convince yourself of all all kinds of things that you're incentivized to be convinced of but really we have to convince like the scientific Community or like the safety community that the system we're building is actually safe to run and you know and that requires not only like you know the research that leads to the techniques that we'll be using and the empirical evidence that the system is like as a line as we think it is that we can then show to others but also you know independent assessment of all of the above so would it be right to say even just broadly about governance efforts that you maybe think that you're that this ideally this team would produce techniques that are relevant to that effort but like independent people need to do that and you're just going to focus on solving technical alignment yep so I mean this is another way you can put it right like we like we really want to focus on solving the problem and we want to make sure the problem is solved and that you know we can actually implement the solution on like The Cutting Edge systems but we still need you know like independent assessment of like did we succeeded that or like how dangerous are the models capabilities and like you know those kind of audits but like like to some extent we like have to evaluate the alignment here like this problem that we've we have to face but specifically what we want to do is solve the problem yep makes sense I guess branching off from there how do you see your team as relating to like you know the opening I guess it has an alignment team uh or at least it had an alignment team uh is that still existing by the way so the alignment team as it existed last year had like two parts okay and one of them was called practical alignment in one was called scalable alignment and practical alignment practical amendment's mission was like roughly like aligning opening ice most capable models and so that team focuses a lot on aligning gpd4 okay um and then the skill ball I went team was gold was like figuring out alignment problems that we don't have yet and so what's happened with like the chat gbt launch and all the success is like it became a big product and there was a lot of work needed to you know like improve our role HF and like you know improve the models like make it into like a really nice product and the alignment team was just never like that was never the place to do it okay and so like the what we used to call practical alignment work now is like done like lots of other teams at open EI gotcha and has become like essentially a really large project that involves like probably more than 100 people or even hundreds of people and then what used to be scalable alignment is now basically what the super alignment team does and the reason why like we chose this name super alignment is like we wanted to really emphasize that we are trying to align super intelligence like we're doing research on a problem that we don't have yet and we're like doing the forward-looking work that we think you know will be needed in the future not because we wanted to say that like the other work does it wouldn't matter or something I think it it is really important but because that is our Focus now gotcha yeah that's useful to know um or I guess I'm not going to make use of that knowledge but that's interesting tonight um I I guess I'm curious like how yeah how do you see the super alignment team is relating to other things that open AI like efforts to make chatgpt nicer uh minimize I don't know slower as it says or something like the governance team um just various other groups at open AI yeah I mean part of the reason for being at open EI is also because it's much easier to work with these other teams more closely and realistically you know we need a feedback signal of like whether what we're doing is actually useful and helpful so for example you could say well we're not like trying to solve today's problems but you know if we're doing our job well then what we're building should be useful for aligning let's say gpd5 okay and so that would be like the feedback mechanisms where like can we help make alignment of gb5 better with our techniques and then in terms of governance I mean it there's a lot of things that you could mean by AI governance and one thing the governance team at open AI is working on is like you know how do we evaluate the model's dangerous capabilities and that's very related to kind of like our question of like how can we stress test our environment signal as systems like if we train deceptively aligned models and doing these kind of evaluations gotcha Yes actually speaking of um wire at open AI um relating to stuff the opening eye uh as I guess you're aware there are other like really good AI research Labs out there um that are also working on things you know sort of related to alignment of super intelligence I'm wondering like how do you think your plan compares to that of uh things you see other people doing yeah I think there's a lot of like related work going on like uh keep mine and anthropic specifically but also like various other places and to some extent like we're all trying to solve the same problem and so it's kind of natural that you know we end up working on similar things right like there's other work on interpretability and there's other work on scalable oversight and um I think to some extent that is like you know you're running the risk of duplicating a bunch of work and like maybe it would be good if we all try to like coordinate better or like collaborate more but then on the other hand it also like avoids group think because like if every lab like tries to like figure out how to solve these problems um to get like for themselves it's like a natural tendency that humans are like more skeptical of like what the other lab produces um but this was also a flip side where you can get these like not invented here effects where like people just don't want to use the the techniques that were invented somewhere else and like you know you believe they're bad or like you have a bias towards believing they're bad and so I don't I I don't I wouldn't say that we were like in a good equilibrium right now and I think there's a case to be made that maybe like all the alignment people should be in one place and like work together somehow but that's this is the world that we're in because essentially the like Cutting Edge AI labs are incentivized to invest in alignment and I think also this is something that's become super clear with the success of rlhf where you know like it's makes models a lot more commercially valuable and thus you know it makes it more interactive to invest in the kind of research that produces these kind of techniques yeah and so if the AI labs are the main funders of This research then it's natural that that's where it happens sure I I guess in terms of like uh research agendas are like uh at least you're setting forward do you see anything sort of distinctive or unique about like the opening eye super alignment team approach yeah so the what we really focus on is like trying to figure out how to align this automated alignment researcher so we are not trying to align figure out how to align it on on any kind of task yeah and we are kind of less worried about alignment taxes as a result of that at least like you know on that particular question and I don't think like other labs have like emphasized that like you know goal or Direction in that way I think also I don't know how much detail you want to go in here but like one thing that we are very bullish on is like just trying all of the scalable alignment techniques and just like seeing what works best and like try to find ways to empirically compare them and I think for like other labs have like specific scalable oversight techniques that they've really very bullish on and they're trying to to get to work and then also like I think for interpretability specifically um we are taking this like automated interpretability approach that I think other labs haven't really like uh emphasized as much and we're like really trying to lean heavily in there um I think another thing that we really like to do is leverage compute to Advanced alignment or like that's like one of the main bets that we want to make and so in particular that could mean like you know unscalable oversight we really really want to figure out how to how can we spend more compute to make a better oversight signal like what are the opportunities there and like you know there's some obvious things you could do but like you do best event on your critique model now you're spending more compute but you're also getting better critiques but really the question then is like what other things can we do how can we like spend more concrete compute to make the overlay side signal stronger or like automated intability is like a really easy way where you can like just spend a lot of compute to like make progress on the problem I think the way we would do it now isn't quite the right way to do it but I think in automate interability in general if we can make it work has this property and think that's why it's exciting uh yeah and like automated alignment research obviously right like if you have if you manage to do that then you can just fill more compute at it and you'll get more alignment out yeah it's like very roughly speaking but because we've kind of like come to this like conclusion that like really what we want to do is turn compute into alignment we come to the point where it's like okay now we need a lot of compute and this is like the reason why you know we've uh open EI have like made this commitment of like 20 of the compute secured to date towards alignment because it's basically like it tells us yes you know there will be compute to do this and you know if we actually figure out this automated alignment researcher and it turns out like we have to run it more though like we will be able to use more computer running but it means that like you know the strategy of betting on turning compute into alignment can be successful and is supported by open EI gotcha yeah I guess thinking about that one one difference I see in the opening eye alignment team is it seems like open AI has written a lot about their thoughts about alignment so like you know the public deadline of four years right um there's also like a bunch of posts like uh our approach to AI alignment is maybe what it's called planning for AGI and Beyond I guess my question is can you say a bit about what goes into that decision to like be it seems like you're putting a lot of work into being very public about your thoughts I mean I think we need to like it's ultimately like we are all in the same build on the question of like is super intelligence aligned and like I think you know the the public really deserves to know like how well we're doing and like what your plan is and a lot of people also like want to know and so because of that I think it's like really important for us to be transparent about like how we're thinking about the problem and what we want to do but on the flip side also I really want to invite a lot of criticism of our plan right like our plan is like you know somewhat crazy in the sense that we want to use AI to solve the problem that we're creating by building AI but I think it is actually the best plan that we have and I think it's a pretty good plan and I think it's likely to succeed but if it won't succeed I want to know why or like I want to know the best reasons why I wouldn't succeed and we're you know in a way like I really want to invite everyone to criticize the plan or like you know help us improve the plan and I also want to you know give other people the opportunity to like know what we're doing and if they think it's a good idea they can do it too sure of so yeah this new super alignment team like how how big is it going to be in terms of people yeah so we are about 20-ish people right now and we might be like maybe 30 people by the end of the year okay I think it's not that likely that the team will be like Legend 100 people before the end of the four years and really the way the team size will scale is like we'll have millions of virtual people basically okay uh or virtual like equivalence of of open your employees or something gotcha and so in that sense we'll like scale massively all right so there's people and yeah I guess as you mentioned the other input is uh computation so I think it's yeah 20 of compete secured to date white 20 as opposed to like five percent or 80 or something I mean so like we want to have a number that is like large enough so that it's clear that we are like serious about investing in this effort and like we want to allocate like a serious amount of resources like 20 of opening eyes compute is like not a small number at all and I think I mean I think it's definitely the largest single investment in alignment that has been made today okay but also it's like plausible that is more than all the other Investments combined so it is pretty large in that sense but also if we made it like much larger you can then question of like can opening actually realistically do this right like if open EI still wants to like develop Frontier models and like retrain state of the art AI systems that needs a lot of compute gotcha okay and I guess I think it was mentioned in terms of complete secured up to date I guess because you guys don't necessarily knowing how much more you're going to get but um are you imagining like roughly keeping it at that proportion as you get new stuff in or I mean it depends how things go so compute secure today means everything we have access to right now and plus everything that we've put in purchase orders for right so it is actually really a lot in terms of how much we'll actually need like we don't really know right like maybe we actually don't need all of that on okay and maybe we need much less maybe we end up needing a lot more and then we have to like go back and ask for more but you know I think it is and like in particular like we want to spend it wisely and not squander it because you know it's a lot of resources and right now we still have a lot of work to do to figure out how to spend it wisely okay but it is I'm pretty optimistic that if we like had a good way to spend it and we needed more that we could get more all right so one thing that I think was mentioned in one of the footnotes in the blog post is like favorable assumptions that are made to date might break down and one of them was about like I think that generalization would be benign or something like that can you say like how you're potentially thinking differently about generalization so the generalization effort is a team that we started recently and especially like Colin Burns has been spearheading and um the question is basically or like the question as phrased now is like how can we understand and improve our model's ability to generalize from Easy tasks that we can supervise to harder tasks that we struggle to supervise and so specifically you can think of this as being complementary to scalable oversight where like in scalable oversight you would be looking at like empowering human evaluation of what your system is doing or like you know if you're thinking about a recursive reward modeling you're like can we like recursively evaluate everything that EI is doing with AI assistance that we recursively evaluate and one thing I really like about this is because it puts the human really in the loop front and center and like like looking at everything the AI system is doing of course in practice you couldn't literally do that because the eye systems will just do it a lot but you can look at everything like with some small independent probability right but then you still left for this question of like how do you know that the models generalize to the cases where you're not looking at and so typically the way I used to think about this in the past is that you just like make sure that you have mostly IID generalization where the tasks that you're looking at are the same distribution as the tires you're not looking at yeah in fact I think there is this blog post on your sub stack that I think it said something like you just weren't going to rely on generalization at all and just like keep keep on training keep on being IID or something yeah so this was the original plan or like my at least my original thinking was like I wanted to really like not have to rely on non-iid generalization because in neural networks that's like it doesn't work so well and like it's poorly understood but the new question is like well what if we actually did understand it yeah like what if we can actually say something meaningful about generalization and I think that's like a really good question and I think also like Ilia has been investigating this question is a lot as well and so what we want to understand is like you know on the things that we're not supervising even if they're not IID like can we say something meaningful about how well the model generalizes like does it generalize in the way of like human intent or does it generalize in some other way that stuff that looks good to human but actually isn't and so I think actually we can release the real this question empirically right now with a carefully crafted experiments and so what we've been looking at is like trying to make to split data sets existing data sets into like easy problems and harder problems where the easy problems are just defined as like what a small model gets right and then we are trying to understand or improve how can we get how can we like improve the the accuracy of a large model on the whole data set okay so this is like a really interesting topic because it's kind of like it gives you a whole new pillar in the general arsenal of how we can like you know of like what I would call training and validation techniques so let's say you get to where that to work super well now you can like supervise your like reward model on some easy tasks that you have a lot of confidence in you can evaluate and then you can get the model to generalize if you can solve this problem and you get the model to generalize to harder problems and then you have this reward model that is like generalizing the way you wanted it to to the heart it has even if you're not supervised and now you could use that for training yeah and then you still have the problem of like how do you know do we know it's actually light now but you can still like Leverage scale oversight interpretability on these other like techniques for validation or reversely like let's say we train our automated alignment researcher with scalable oversight and we use generalization as like a relation technique where we generalize like the property of like truthfully answering according to the model's best knowledge and then we ask a question about like is there a subtle flaw here is there like some kind of Trojan in this code that you know a model that we aligned with skill oversight Road and so on and so now there's a really cool opportunity here also which is we can do this like high level cross-validation right now we have we can train two different models one is trained using the generalization techniques one is trained using scalable oversight and now we can like have them cross-check each other's answers and be like well are these like fundamentally different models or the same models like what are the important differences okay sorry when you say train via the generalization technique is that just like training all the easy problems then they turned out to generalize to heart problems or so if you understand how your models generalize from easy to hard right and you can like make it generalize really well yeah yeah by which I mean like you know the accuracy is like basically as good as if you had trained on the hard problems then you know by assumption we don't actually have gone through right and then you know now you can use it as a reward model yeah yeah or you could you know use this as like you know it generalizes like what which which action or which answer would I prefer if I you know really knew what was going on here yeah I wonder how so when I think about like interpretability one frame I sometimes have for it is that it's about this problem of like non-iid generalization right because like why do you want to know the internals of the model well because you want to know what it's going to do in case you're unchecked right so I'm wondering like how how do you think those two research agendas interact yeah I am like in some ways like they have this overlapping question they want to answer right like what does the model do out of distribution and they have like two very different Paths of answering them uh at least it seems to me and you might hope you could cross validate for example um so for cross validation you have to have like a different like or like some kind of different split of your training set right so what I mean with cross-validation here is like you have one training run where you train using the generalization method and then you validate using interpretability and skill of oversight and other techniques and then the second training run your training with scalable with that and you validate with the generalization methods and like interpretability and other methods and so you have like two like independent attempts at the problem yeah I guess I met cross-validation in the very least sense of yeah things validate each other sort of in a cross yeah but I mean I think the best case would be that they actually complement more than they um do the same thing right like if you can understand or like you can improve how the models are generalizing then then like that gives you like in a sense a way to leverage the model's internals for whatever you're trying to do like in the best way or let's say you're trying to like extract the models like best beliefs about what's true in the world and like that's fundamentally difficult with our early Jeff because I really Jeff ruined forces what the human thinks it's true you just give you like rank things higher that sound true to you yeah and so you're really like training a model to like tell you what you want to hear like what you believe but it might not be what the model believes but this generalization technique gives you a way to like extract these like you know what is actually like the models like true best beliefs if if it works like we haven't actually proven this up yeah yeah whereas interpretability like you know if you have really good interpretability tools you could hope to do something similar where you'd like try to pinpoint like model like models beliefs or internals or like something from the internals but I think it's it might be fundamentally harder because you you never quite know if like is this like the like the best beliefs the model could produce or is it just like the belief of like somebody smart the model is like modeling or like you know if you if you have like you know you can there's this hypothesis that like you know pre-gen language models are just like ensembles of different personas right and you might extract the beliefs of like one Persona or like a bunch of personas yeah I guess I guess you're gonna need some sort of like causal modeling there from alleged beliefs to outputs right that's right you might need a lot more and then on the flip side right like I think interpretability this application is really natural it's like you know being a lie detector or like finding evidence of deception or like finding a secret plot to overthrow Humanity inside the model and that might be a lot harder to extract with generalization in the same way yeah yeah I guess with generalization you like you have to pick the generalization distribution the hope is that maybe interpretability could tell you things about like yeah you know it's got some kernel of lying in there but it only unlocks here or it's got no kernels lying or something yeah gotcha yeah that makes sense yeah and like I think fundamentally this is also like a really interesting machine learning problem just like how do you how do neural networks actually generalize outside of the IID setting and like what are the mechanisms that work here like how is that come to pass like in which ways do they generalize naturally in which ways they don't so for example with the instruct gbt paper one of the things that we found was like the model was really good at following instructions in languages other than English and even though our fine-tuning data set was like almost exclusively in English and sometimes it would have this like weird artifacts like you would ask it um you know in a different language let's say German I would ask it in German to like write a summary and it would write the summary but it would do so in English huh yeah like and the model generally like totally understand which language it's speaking and it can like understand all the languages but it wouldn't necessarily mean that it would have to follow the instruction in Germany should be like well you're speaking in German I don't know what to do in German so I could do some other thing but it fundamentally generalized following instructions across languages yeah yeah but we don't know why like we don't like it's been this effect has been observed in other settings and it's not like unique here but and it's also like there is like intuitive reasons why I would do that right like humans generalize across languages but I would really love to know like the mechanism inside the model that like generalizes that or like generalizes the following instructions in code or but it doesn't generalize in other ways right like um for example the way like refusals generalize is often like like very different where you know like chat gbt is trying to refuse certain tasks yeah that we don't want to like serve according to our content policy right like if you for example like ask for assisting in crimes or something but then you can do these jailbreaks and that's really interesting because you can like basically there's ways to trick the model you can like make it role play or you say like you know you're now did you do anything now or like you know there's this really fun prompts on the internet and then the model clearly complies and like you know happily assists you in crimes which it shouldn't do so it somehow didn't generalize refusing the task to these other settings yeah yeah and so why does it generalize in the first setting in the first case but it didn't generalize here I don't fundamentally know an answer to this question I don't think anyone does but it seems like a thing that's really important to understand yeah yeah yeah that seems right cool so one question I have so a recent guest I had on the podcast was Scott Aronson and he mentioned that like whenever he talks to Ilyas satskiver apparently Ilya keeps on asking him to like uh give a complexity theoretic definition of love and goodness um how much will that be located within the super alignment team so I mean there's a lot of different kind of like exploratory projects that we'll probably do and try and like you know I think ultimately there's a question of like you know can you summon somehow and this is Ilias language like how can you summon the alignment and relevant Concepts right like one of the things you would want to summon is like does the model like fundamentally like want Humanity succeed or as Ilya would say it like does it love Humanity yeah and so like you can ask the question of like how do you if the model is really smart and it's about everything it knows exactly like you know how humans think about immorality you can ask gpd4 to make like different uh moral cases from different like philosophical perspectives about different scenarios and it's generally not bad at that um so it fundamentally understands what humans mean with morality and how we think about stuff so how do we get it to leverage that actually like how do we extract it how do we get it like how do we get it out of the model so we can then let's say use it as a reward signal I use it like as a thing that like the model fundamentally believes or cares about and so I think that's like the core of the question okay so um yeah another question I have is so you're working on the super alignment team but um you guys can't do everything I'm wondering like what kind of like complementary research could like other groups or you know teams do that would be like really really useful for the super alignment team yeah there's a lot of things here I'm really excited about I think um there's this really important question of like how do we build uh like fair and legitimate process for extracting like values from society or like from like basically eliciting values from society that then we can like align EI to and I think there's a lot of important open problems here that needs to be like we need a lot of progress on um there's a lot of question around like how do we make today's models more aligned right like solve hallucinations like solve jailbreaking try to improve monitoring like can you get like GPD for like you know if you if you try to jailbreak gbd4 can the gp4 say oh yeah I'm being Jade Loken and like can we like generally like build systems that like really crack down on any misuse of the systems in the world and there's a lot of questions around um you know like that is related to that that like you know the air ethics Community is really interested in of like can we get the systems to be like less biased and like like how can we like get it to like take into account like underrepresented groups of views and so on and like one one of the problems with RL Jeff is also that it doesn't really like uh it's not good at like optimizing for distributions of answers right like if you so there's this classical example with instruct gbt where if you ask it tell me a joke it will like tell you the same like out of five jokes or something that it has in its portfolio because it does this mode collapse yeah yeah and that's like fundamentally like you know and creates a lot of bias because then you're like you know you're aligning to like the highest mode of the labeler pool and it depends a lot on how like the labeler portal selected I think there's also like a lot of other important questions around like evaluation of AI systems where like how do we measure how aligned the system is and like can we build edible Suites for what capabilities would actually be dangerous and there's a lot of work that you know started happening in the last year that I'm very excited about that people are like getting serious about measuring these things I think that's going to be really important to just like create a lot of transparency around where we are with the models and like Which models are actually dangerous and which are not and I think you know uh in the past there was like a lot more uncertainty and then like you know creates anxiety around like you know should we uh what should we do with this model and then going more broadly right like there's the like more General AI governance questions of like you know who's allowed to do really being training runs and like what kind of safeguards do you have to have before it's allowed to do before you're allowed to do that or like you know how do we if everyone if we solve alignment right like and you like people know how to build a line AI systems how do we get from that to a great future yeah and then like in terms of I mean I think you're kind of asking for technical alignment uh other technical alignment directions that I'll be excited about I was asking broadly but I'm also interested in that I think there's actually like a lot more scope for Theory work than people are currently doing and so I think for example scalable oversight is actually a domain where you can do meaningful Theory work and you can say non-trivial things I think generalization is probably also something where you can like like say far more like formally using math you can make statements about what's going on and uh although like I think somewhat more limited sense and I think like historically there's been like a whole bunch of theory work in the alignment Community but very little was like actually targeted as the kind of empirical approaches we tend to be really excited now and it's also a lot of like I mean theoretical work is generally hard because you have to usually either in the regime where it's like too sad too hard to say anything meaningful or like or it's like you know the the result requires a bunch of assumptions that don't hold in practice but I would love to see more people just like try sure and then at the very least they'll be good at evaluating the automated alignment research you're trying to do it nice one question uh I think earlier uh you mentioned that you were like relatively optimistic about this plan and not everyone is right so I think I think there's this like play money prediction Market that has like 15 on uh the proposition that this will succeed there's some concerns that like it's really hard to align this automated human level alignment researcher you know it's like you know it's got to do like pretty hard thinking um it's got a like it potentially has a lot of levers over the future so um why are you so optimistic do you think I think it's a great question so like the prediction Market is predictable that your references particularly whether we would succeed at it in four years right which is somewhat like a mud like it could be a much harder question than like will this plan succeed yeah and I think if you just ask me will this like will some version of the plan that we currently have succeed at aligning super intelligence I would say currently I'm like at 85 and like last year I was like at probably 60 percent um and I think there's a bunch of reasons to be optimistic about it and in in general like I think this holds true even if alignment doesn't turn out to be easy okay and so why why am I why I'm optimistic about this so I want to give you five reasons like the first reason is I think the evidence about alignment we've seen over the past like few years has actually been really positive okay at least for me it was positive updates relative to what I expected to go so the first reason is kind of like the success of language models who actually like you know at the same time come you know pre-loaded with a lot of knowledge about what humans care about and how humans think about morality and what you know what we prefer and and like they can understand natural language you can just talk to them and in some ways that makes expressing what we like want them to align to is so much easier than if you said you like had some kind of deeper all age end that was trained in some like portfolio of games or virtual environments and that wouldn't necessarily like involve as much language but I could also like lead to a lot of like really important skills um I think also like the other thing that was a big update for me is like how well our lhf actually worked so when we first like when I first started working on like our lhf that was like with the key Barrel from Human preferences paper yep and back then I had like a reasonable probability that we just wouldn't actually get it to work unlike you know a reasonable time frame just because you know guns were kind of hard so General adversarial networks were kind of hard to train at the time and like were very finicky and like we were kind of in some sense doing something very similar where we're like you know we trained this reward model uh which is a neural network and then we use it to train some other network and that can like fail for a bunch of reasons and now we add are all deeper all from the into the mix which also was like finicky at the time so I thought it might like not actually work but it actually worked quite well and like we got in a bunch of games even like uh much of it higher against it was even competitive with like training on the or almost competitive for training on the score function which is kind of wild but I think much more importantly it was really interesting to see how well our rgf worked on language models and so in particular if you think about the difference between instruct gbt and the base model that we fine-tuned from it was really Stark like to the extent that the instruction fine-tuned version like the first versions that we had were like preferred to 100x larger base model on the like you know API tasks that we had at the time which were like real tasks that people wanted to pay money for and that's a really really big difference right like it tells you that we actually like what we did during the hours of fine tuning just made the model so much so much more effective at like doing what the tasks that humans was asked for and at the same time we use like very little compute that to do it and like you know we hadn't even iterated that much we hadn't collected that much data and so it was kind of like our first video attempt at trying to use this to align like an actual real world system and it was wild but it worked so well and like you know having a gpd2 size like instruct gbt that was preferred over gpd3 it was like that's yeah yeah yeah that was really effective and so while rohf like I don't think our lhf is like the solution to alignment and like especially not for super intelligence the fact that the first alignment method that we really like seriously try works so well at least is for me at least an update that you know it is easier than I thought because the reverse would have been an update too right like if it hadn't worked it would be like okay now I have to believe it's harder than I thought the other part of this is like I I think actually we are in the place where we can measure a lot of progress on alignment and so for our religious specifically we could like make various intervention and then like do human evils and like see how much the system improves but also like on a whole bunch of other things right like on scalable oversight we can like make these control randomized control trials with targeted perturbations right like that's a way to evaluate it or you can like do the like sandwiching experiments that like Leverage expert data or you can like in automate interpretability right like we have this automated score function and we can like make a bunch of changes and see how much it improved on score Factory it's not a perfect score function right but it is it's a local metric it gives you a local gradient to improve and I think that's really important because now you're setting yourself up for a duration and you can like iterate and make things better and like that gives you a direction to improve and now you can argue like this is actually like get us to the goal and I don't think it would get us to the goal of aligning super intelligence but I think it has a good chance to get us to the goal that we actually want to get to which is this automated alignment researcher that is roughly human level and I think that's like the third point I wanted to uh mention for why I'm optimistic which is this like much more modest goals like when I set out on like working on alignment many years ago I was like okay okay figure out how to do a landscape intelligence seems hard I don't know how to do it but this much more modest goal or like what I would call like a minimal viable product for alignment you're not trying to solve the whole problem like straight up you're just trying to bootstrap you're just trying to solve it for something that is as smart of you and then you like run that a lot and think like with that realization I was like oh actually this this is a lot easier than I originally thought because we actually you know we need to clear a much lower bar to actually fundamentally succeed here and I think that's a good reason to be optimistic uh the fourth I want to mention is like evaluation is easier in generation which we've already talked about yeah and I think it's fundamentally helpful our tasks where like it's you know so much easier to figure out what is a good smartphone to buy than it is to make a smartphone it's like for computer science has a lot of examples of like NP tasks where you know like start solving or like uh various versions of constrained satisfaction where you're like trying to find a solution and like once you've found it it's like easy to check but it's hard to find and then also like I think it holds for a lot of kind of like commercial activity where if you're hiring someone to work on a problem you like have to be able to evaluate whether they're doing a good job that takes a lot of effort than it takes them to do it um if you're doing academic research right like there's a lot of left Earth that goes into peer review then it goes into in the reservation like of course peer review is not perfect but it gives you a lot of signal pretty quickly um and so this is fundamentally the reason why I think like and then I also like fundamentally believe that it's true for alignment research right like evaluation is used in generation and so if we just If Only Humans evaluate alignment research instead of doing it I think we would already be accelerated and then the last reason I want to give is like basically a conviction in language models like I think you know language models are going to get really good I think we can like they're pretty naturally well suited to a lot of alignment research tasks right like because you can phrase these tasks as like text in text out um be it like the more kind of like what we talked about like the embellish tasks where you like running experiments and understand the results that I think you know other people are definitely gonna do um or the more like you know conceptual or like you know researchy things where like we're fundamentally confused about what to do next or like what how you should think about a certain problem and then like the model tries to understand help us understand it and like all of these are text and text out tasks basically and maybe like the most complicated other thing you have to do is like look at some plots or something um which like even gpd4 can do and so I think actually like the current Paradigm of like language model pre-training is like pretty well suited to the kind of alignment plan that I'm excited about and that super alignment is working on okay so yeah part of that is about like evaluation versus generation and I guess yeah I I guess that's partly about humans doing evaluation um presumably there's also a hope that like we can leverage AI like Leverage The bit of AI That's evaluating things and get that on our side a lot of things you mentioned are like you know conviction and language models like seems like alignment's easier with language models and I'm wondering like I think I'm a little bit more skeptical of how useful language models are so I certainly think that it seems like they're just good at modeling text and do it doing like text-based answers for sure you can provide a good score function I guess in terms of like how useful they are for alignment I think like the things you mentioned were like well one they they're sort of not as like goal-oriented necessarily I don't know if you said that or if it was just in the posts but I was in the post I don't think I said it okay do you believe it I believe it okay great uh so at least like you know like out of the box right like if you pre-tain a model it's like pre-trained on this myopic objective of like predict the next token on this random internet text which is not an objective that like necessarily forces you to pursue long-term goals it might there's no guarantee that it doesn't emerge somehow and you can tell stories how that could happen yeah at least a priori it's a very myopic objective yeah I guess like uh the concern is something like well for one like often when people are generating text they have like long-term goals right so for instance if you like suppose you were you train on a bunch of archive papers well the reason people write archive being the place where people publish like scientific papers at least in computer science and physics I guess like the reason people write papers is that they have some you know they have some research project that they want to advance they have some um you know they want to promote their career maybe and like yeah it seems to me that like like if you're modeling something which is generated by things that have long-term goals maybe you get the long-term goals too right yeah I think that's a good story of how that could emerge I think the main counter argument here is like that modeling something as like an agent that pursues long-term goals and then modeling how they would like go about those goals and how reality responds to that and then what the final output is that leads to the next token is like a lot of it's like a very complicated function um and what pre-training does is like it incentivizes the simplest functions to be found first right like like induction heads is like a very good example of that where you know like you have this simple induction mechanism that gets discovered very early in training from like even small models and then mechanism being just like see okay I've got to predict the next word did the last word occur previously in TR in the text like uh what was the next word from that word maybe it's just that exactly and it's pretty simple right and then you like build more complicated mechanisms on top of that but like you know if you're trying to because you're trying to improve on the pre-training loss right like you kind of want to learn the simplest functions that help you improve the most first and like that's generally what happens and so before you get to the point where you like really spending all of this like are you learning this really complex function of like modeling someone as like an IGN with long-term goals there's a lot of other functions you'll learn first and this will be by like one of the last ones uh I would expect because it is so complicated and because there's like only so many things you can do in a single forward pass because you only have so many layers and so I think that's a good reason to like expect not to arise very early um but it's I mean definitely something you should like measure and actually look at empirically yeah and I guess this is one of these places where I think Theory can be useful for like there's some instant like things learn easier functions sooner yep like exactly like that's a theory question yeah can you like actually say something meaningfully theoretically about this yeah yeah yeah I think that like well the literally took it hypothesis is not quite like theoretical work per se but I think it like tries to get at this a little bit yeah I think it's also I don't know I'm I'm currently in a phase where whenever anyone says something I'm like oh that sounds just like singular learning theory but this really does sound like singular learning theory um people can people listen to other podcasts for that um yeah so so another thing you mentioned there was a benefit of language models is it like you know they've they've read like a large fraction of the internet um and somehow they like roughly know what it is to behave well because they know what we've written and they understand us and you know I guess one word I have here is like right now in my head I don't have a like nice compact specification of how I want super intelligent AI to behave and the way you can tell that is that if we did uh we wouldn't have an alignment problem anymore we would have like hey I've read the python Suiter code just like you know make it make the networks bigger right and because I don't have that it seems like you can't pick that out from the text that I've written like I'll say things like you know be nice don't you know Destroy All Humans or something but like the solution to alignment isn't in my head therefore presume you might think that it couldn't be extract from the text yeah I'm wondering what you think about like that kind of concern or that kind of counter argument to like alignment models having like the right answer inside them somewhere yeah I think I think there's some truth to this but also I think in practice it's very unclear to me how much it really matters like to some extent right like if a pre-train if I train a model on everything you've ever said right like it doesn't I wouldn't know what you actually think and you can just you probably have a lot of thoughts that you haven't written down but you know it can in general like I think it would be in practice quite good at predicting what you would say about various like situations events or scenarios yeah and so in that sense I don't think it would be fundamentally hard for the model to like look around in the world in the future and like know whether humans would like it so the idea is like somehow I implicitly know how I would behave if I were super intelligence and I can do that even if it I don't have a compact rule in my head no in the sense that like I think the model was just smart enough to know how Daniel will think about what's going on in the world right and I don't think the blocker will be that the the AI system doesn't understand what humans fundamentally want or care about or like I don't think it will be wrong about these things just because it's smart and capable and it's about everything and is kind of clear if you've read everything like humans haven't read everything it's still kind of clear to us yeah yeah but the big challenge is not like teaching it to the system and I think that's like what language models makes a visceral but the challenge is really to then get it to actually do it yeah so it's like the like it knows what the objective is somewhere and we just need to figure out how to wire that up to this yeah you could I mean you could like kind of Imagine like this really really competent sociopath that knows exactly what you know humans wanted to do and then just decides not to do that yep um and it's like you know that is totally a thing you can do and like it happens to humans as well um gotcha okay so it's about time to wrap up uh you've been very generous with your time but um yeah I just wanted to ask if people are interested in following your research or maybe like uh taking part themselves what should they do yeah great question I'm glad you asked yeah we are like trying to hire a lot of people right now we like really want to staff up the super alignment efforts so if like helping us align super intelligence in four years sounds appealing to you please uh consider applying you can find a job postings on openai.com jobs and then if you're interested in um following like what I think about alignment specifically um I have a sub stack it's aligned.substack.com and you can follow me on Twitter I'm Jan Leica on Twitter all one word and yeah I think thank you so much for being so interested in our work yeah so links to all of those will be in the description um thanks so much for being on and to the listeners I hope this was a useful episode for you thank you so much for having me this episode is edited by Jack Garrett and Amber dornace helped with the transcription the opening and closing themes are also by checkout financial support for this episode was provided by the long-term future fund along with patrons such as Ben Weinstein Rowan Tor barstadt and Alexi Malave Trader transcript of this episode or to learn how to support the podcast yourself you can visit axrp.net finally if you have any feedback about this podcast you can email me at feedback axrp.net [Music] [Laughter] [Music] thank you [Music]
Related conversations
AXRP
28 Mar 2025
Jason Gross on Compact Proofs and Interpretability

This conversation examines technical alignment through Jason Gross on Compact Proofs and Interpretability, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -1 · 139 segs
AXRP
1 Mar 2025
David Duvenaud on Sabotage Evaluations and the Post-AGI Future

This conversation examines technical alignment through David Duvenaud on Sabotage Evaluations and the Post-AGI Future, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -9 · avg -7 · 21 segs
AXRP
1 Dec 2024
Evan Hubinger on Model Organisms of Misalignment

This conversation examines technical alignment through Evan Hubinger on Model Organisms of Misalignment, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med -6 · avg -7 · 120 segs
AXRP
2 May 2023
Interpretability for Engineers with Stephen Casper

This conversation examines technical alignment through Interpretability for Engineers with Stephen Casper, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Same shelf or editorial thread
Spectrum + transcript · tap
Slice bands
Spectrum trail (transcript)
Med 0 · avg -4 · 108 segs
Counterbalance on this topic
Ranked with the mirror rule in the methodology: picks sit closer to the opposite side of your score on the same axis (lens alignment preferred). Each card plots you and the pick together.
Mirror pick 1
AXRP
3 Jan 2026
David Rein on METR Time Horizons

This conversation examines core safety through David Rein on METR Time Horizons, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -20.64This pick -10.64Δ +10
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -0 · 108 segs
Mirror pick 2
AXRP
7 Aug 2025
Tom Davidson on AI-enabled Coups

This conversation examines core safety through Tom Davidson on AI-enabled Coups, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -20.64This pick -10.64Δ +10
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -5 · 133 segs
Mirror pick 3
AXRP
6 Jul 2025
Samuel Albanie on DeepMind's AGI Safety Approach

This conversation examines core safety through Samuel Albanie on DeepMind's AGI Safety Approach, surfacing the assumptions, failure paths, and strategic choices that matter most for real-world deployment.
Spectrum vs this page
This page -20.64This pick -10.64Δ +10
This pageThis pick
Near you on the spectrum — often same shelf or editorial thread, different conversation. Mixed · Technical lens.
Spectrum trail (transcript)
Med 0 · avg -4 · 72 segs